<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="methods-article" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.27893.1</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Method Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>A guide to creating design matrices for gene expression experiments</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 1; peer review: 2 approved]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Law</surname>
                        <given-names>Charity W.</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Funding Acquisition</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Visualization</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-6082-6814</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a2">2</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Zeglinski</surname>
                        <given-names>Kathleen</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Visualization</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Dong</surname>
                        <given-names>Xueyi</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-1136-3117</uri>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a2">2</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Alhamdoosh</surname>
                        <given-names>Monther</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-2411-1325</uri>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Smyth</surname>
                        <given-names>Gordon K.</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Funding Acquisition</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-9221-2892</uri>
                    <xref ref-type="corresp" rid="c2">b</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a4">4</xref>
                </contrib>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Ritchie</surname>
                        <given-names>Matthew E.</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-7383-0609</uri>
                    <xref ref-type="corresp" rid="c3">c</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a2">2</xref>
                    <xref ref-type="aff" rid="a4">4</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>The Walter and Eliza Hall Institute of Medical Research, Parkville, 3052, Australia</aff>
                <aff id="a2">
                    <label>2</label>Department of Medical Biology, The University of Melbourne, Parkville, 3010, Australia</aff>
                <aff id="a3">
                    <label>3</label>Research and Development, CSL Limited, Parkville, 3010, Australia</aff>
                <aff id="a4">
                    <label>4</label>School of Mathematics and Statistics, The University of Melbourne, Parkville, 3010, Australia</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:law@wehi.edu.au">law@wehi.edu.au</email>
                </corresp>
                <corresp id="c2">
                    <label>b</label>
                    <email xlink:href="mailto:smyth@wehi.edu.au">smyth@wehi.edu.au</email>
                </corresp>
                <corresp id="c3">
                    <label>c</label>
                    <email xlink:href="mailto:mritchie@wehi.edu.au">mritchie@wehi.edu.au</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>10</day>
                <month>12</month>
                <year>2020</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2020</year>
            </pub-date>
            <volume>9</volume>
            <elocation-id>1444</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>26</day>
                    <month>11</month>
                    <year>2020</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2020 Law CW et al.</copyright-statement>
                <copyright-year>2020</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/9-1444/pdf"/>
            <abstract>
                <p>Differential expression analysis of genomic data types, such as RNA-sequencing experiments, use linear models to determine the size and direction of the changes in gene expression. For RNA-sequencing, there are several established software packages for this purpose accompanied with analysis pipelines that are well described. However, there are two crucial steps in the analysis process that can be a stumbling block for many -- the set up an appropriate model via design matrices and the set up of comparisons of interest via contrast matrices. These steps are particularly troublesome because an extensive catalogue for design and contrast matrices does not currently exist. One would usually search for example case studies across different platforms and mix and match the advice from those sources to suit the dataset they have at hand. This article guides the reader through the basics of how to set up design and contrast matrices. We take a practical approach by providing code and graphical representation of each case study, starting with simpler examples (e.g. models with a single explanatory variable) and move onto more complex ones (e.g. interaction models, mixed effects models, higher order time series and cyclical models). Although our work has been written specifically with a 
                    <bold>limma</bold>-style pipeline in mind, most of it is also applicable to other software packages for differential expression analysis, and the ideas covered can be adapted to data analysis of other high-throughput technologies. Where appropriate, we explain the interpretation and differences between models to aid readers in their own model choices. Unnecessary jargon and theory is omitted where possible so that our work is accessible to a wide audience of readers, from beginners to those with experience in genomics data analysis.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>Design matrix</kwd>
                <kwd>model matrix</kwd>
                <kwd>contrast matrix</kwd>
                <kwd>statistical models</kwd>
                <kwd>gene expression analysis</kwd>
            </kwd-group>
            <funding-group>
                <award-group id="fund-1" xlink:href="http://dx.doi.org/10.13039/100014989">
                    <funding-source>Chan Zuckerberg Initiative</funding-source>
                </award-group>
                <award-group id="fund-2" xlink:href="http://dx.doi.org/10.13039/501100000925">
                    <funding-source>National Health and Medical Research Council</funding-source>
                    <award-id>1154970</award-id>
                </award-group>
                <funding-statement>CWL was supported by a Chan Zuckerberg Initiative Essential Open Source Software for Science Program awarded to GKS and CWL. GKS was supported by NHMRC Fellowship [1154970].</funding-statement>
                <funding-statement>
                    <italic>The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</italic>
                </funding-statement>
            </funding-group>
        </article-meta>
    </front>
    <body>
        <sec id="sec1" sec-type="intro">
            <title>Introduction</title>
            <p>Gene expression technologies are useful for the study of transcriptomics and their associated profiles amongst biological samples of interest. The technology is used worldwide to examine complex relationships between gene expression (which we will refer to as the 
                <italic toggle="yes">response variable</italic> when performing statistical modelling) and the variables that influence the expression (referred to as 
                <italic toggle="yes">explanatory variables</italic>). From the resulting datasets, careful statistical analysis can be used to find relationships that are of biological interest through the choice of appropriate statistical models applied to the data. The modelling process requires the use of a 
                <italic toggle="yes">design matrix</italic> (or model matrix) that has two roles: 1) it defines the form of the model, or structure of the relationship between genes and explanatory variables, and 2) it is used to store values of the explanatory variable(s)
                <sup>
                    <xref ref-type="bibr" rid="ref1">1</xref>,
                    <xref ref-type="bibr" rid="ref2">2</xref>,
                    <xref ref-type="bibr" rid="ref3">3</xref>
                </sup>. Although design matrices are fundamental concepts that are covered in many undergraduate mathematics and statistics courses, their specific and multi-disciplinary application to the analysis of genomic data types through the use of the R programming language adds several layers of complexity, both theoretically and in practice.</p>
            <p>This article describes the appropriate design matrix set up for differential expression analyses specific to using the 
                <bold>limma</bold>
                <sup>
                    <xref ref-type="bibr" rid="ref4">4</xref>
                </sup> software package, one of the most popular open-source software packages for such analysis worldwide. Our examples have been written for gene expression data, specifically with the assumption that the expression values are genewise log-count per million (log-CPM) measurements from an RNA-sequencing (RNA-seq) experiment. However, most of the concepts and R code covered in this article can also be applied to differential analyses of other genomic data types, including microarrays, ChIP-seq, ATAC-seq, BS-seq, Hi-C and proteomics. The main requirements are that the response data represents abundance on a log-scale and that each row corresponds to an appropriate genomic feature. Typically, the data table from an RNA-seq experiment contains the gene expression measurements for tens of thousands of genes and a small number of samples (usually no more than 10 or 20, although much larger sample sizes are possible). In the modelling process, a single design matrix is defined and then simultaneously applied to each and every gene in the dataset. Rather than demonstrating the application of design matrices across multiple genes, where the modelling concepts are consistent between genes, we simply describe the process for a single gene in our examples. This allows us to illustrate clearly differences between varying models and the implications of adding or removing model parameters.</p>
            <p>The article begins by introducing the basic concepts associated with design and contrast matrices. We cover common experimental designs used in genomics research, and move onto more complex study designs as we progress through the sections. We have approached our work from a practical stand-point, with a focus on the R programming inputs and outputs, accompanied by associated plots to illustrate the observed data that we begin with and the fitted models that are produced from a graphical perspective. By omitting most of the theory associated with design matrices, our article allows readers from various backgrounds to gain a better understanding of design matrices, without having statistics as a prerequisite. To enable readers to select the most appropriate design matrix set up for their study design, we also discuss the interpretation of the models and the differences between them.</p>
            <p>In each of our examples, we will explicitly display the observed data and include the R code for associated design and contrast matrices that are used in the modelling process. This allows readers to quickly grasp modelling concepts and to apply the R code in their own datasets. Each example is also accompanied by a figure displaying the design matrix and both a written and graphical representation of the statistical model. Whilst the complete data analysis process, from pre-processing data to variance modelling and parameter estimation is not discussed in this article, the design matrices we describe can be implemented in conjunction with the &#x201c;
                <italic toggle="yes">RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR</italic>&#x201d; differential expression workflow article
                <sup>
                    <xref ref-type="bibr" rid="ref5">5</xref>
                </sup> for an RNA-seq analysis beginning with a table of counts.</p>
            <p>Other complementary work focusing on design matrices includes that of the CRAN 
                <bold>codingMatrices</bold> package vignette
                <sup>
                    <xref ref-type="bibr" rid="ref6">6</xref>
                </sup>, which describes the theoretical aspects of design matrices, and the 
                <bold>ExploreModelMatrix</bold> software package
                <sup>
                    <xref ref-type="bibr" rid="ref7">7</xref>
                </sup>, which allows interactive exploration of design matrices for specified explanatory variables. Although not focusing purely on design matrices, the user&#x2019;s guides for the 
                <bold>limma</bold> and 
                <bold>edgeR</bold>
                <sup>
                    <xref ref-type="bibr" rid="ref8">8</xref>,
                    <xref ref-type="bibr" rid="ref9">9</xref>
                </sup> software packages also contain many example case studies for different experimental designs.</p>
        </sec>
        <sec id="sec2">
            <title>Basic models</title>
            <sec id="sec3">
                <title>Background</title>
                <p>In this section, we outline the general form of some basic models and introduce terminology that will be used in the remainder of the article. The concept of model equations and associated graphical illustrations for fitted models are also introduced here.</p>
            </sec>
            <sec id="sec4">
                <title>Regression model for covariates</title>
                <p>To begin with, let us consider two types of explanatory variables: 
                    <italic toggle="yes">covariates</italic> and 
                    <italic toggle="yes">factors</italic>. Covariates contain numerical values that are quantitative measurements associated with samples in the experiment. These can be the age or weight of an individual, or other molecular or cellular phenotypes on a sample, such as measurements obtained from a polymerase chain reaction (PCR) experiment or fluorescence activated cell sorting (FACS). For covariates, it is generally of interest to know the rate of change between the response and the covariate, such as &#x201c;how much does the expression of a particular gene increase/decrease per unit increase in age?&#x201d;. We can use a straight line to model, or describe, this relationship, which takes the form of 
                    <disp-formula id="e1">
                        <mml:math display="block">
                            <mml:mtext>expression</mml:mtext>
                            <mml:mo>=</mml:mo>
                            <mml:msub>
                                <mml:mrow>
                                    <mml:mi>&#x03b2;</mml:mi>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mn>0</mml:mn>
                                </mml:mrow>
                            </mml:msub>
                            <mml:mo>+</mml:mo>
                            <mml:msub>
                                <mml:mrow>
                                    <mml:mi>&#x03b2;</mml:mi>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mn>1</mml:mn>
                                </mml:mrow>
                            </mml:msub>
                            <mml:mtext>age</mml:mtext>
                        </mml:math>
                    </disp-formula>where the line is defined by 
                    <italic toggle="yes">&#x03b2;</italic>
                    <sub>0</sub> the y-intercept and 
                    <italic toggle="yes">&#x03b2;</italic>
                    <sub>1</sub> the slope (
                    <xref ref-type="fig" rid="f1">Figure 1</xref>). In this model, the 
                    <monospace>age</monospace> covariate takes continuous, numerical values such as 0.8, 1.3, 2.0, 5.6, and so on. We refer to this model generally as a 
                    <italic toggle="yes">regression model</italic>, where the slope indicates the rate of change, or how much gene expression is expected to increase/decrease by per unit increase of the covariate. The y-intercept and slope of the line, or the 
                    <italic toggle="yes">&#x03b2;</italic>s (
                    <italic toggle="yes">&#x03b2;</italic>
                    <sub>0</sub> and 
                    <italic toggle="yes">&#x03b2;</italic>
                    <sub>1</sub>), are referred to as the model 
                    <italic toggle="yes">parameters</italic>. The true values of the parameters are unknown, but are estimated in the modelling process. A positive estimate of a model parameter indicates that an explanatory variable has a positive influence on gene expression (an increasing slope), whilst a negative value indicates that the explanatory variable has a negative influence on gene expression (a decreasing slope). In some cases, one may convert the 
                    <monospace>age</monospace> covariate into a factor by categorising the smaller values as &#x201c;young&#x201d; and larger values as &#x201c;mature&#x201d;, and instead use the models described below.</p>
                <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                    <label>Figure 1. </label>
                    <caption>
                        <title>Basic models for covariate and factor explanatory variables.</title>
                        <p>LEFT: The basic model for covariates is referred to as a regression model, which is a line defined by the model parameters 
                            <italic toggle="yes">&#x03b2;</italic>
                            <sub>0</sub> the y-intercept, and 
                            <italic toggle="yes">&#x03b2;</italic>
                            <sub>1</sub> the slope. CENTER: One of two basic models for factors is referred to as a means model, where model parameters are calculated as the mean gene expression of each level of the factor e.g. 
                            <italic toggle="yes">&#x03b2;</italic>
                            <sub>1</sub> represents the mean expression for wildtype and 
                            <italic toggle="yes">&#x03b2;</italic>
                            <sub>2</sub> represents the mean of mutant. RIGHT: The other basic model we refer to for factors is a mean-reference model, where the first model parameter is calculated as the mean gene expression of the reference level, and subsequent parameters are calculated relative to the reference level e.g. 
                            <italic toggle="yes">&#x03b2;</italic>
                            <sub>1</sub> represents the mean expression for wildtype and 
                            <italic toggle="yes">&#x03b2;</italic>
                            <sub>2</sub> represents the difference between mutant and wildtype. In each plot, the points represent the original data; coloured lines are used to represent expected gene expression, where dashed lines are specifically used to represent expected gene expression for non-reference levels in the mean-reference model e.g. mutant.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure1.gif"/>
                </fig>
            </sec>
            <sec id="sec5">
                <title>Means model for factors</title>
                <p>Factors are categorical variables or classifiers associated with samples in the experiment. They are often separated into those that are of a biological nature (e.g. disease status, genotype, treatment, cell-type) and those that are of a technical nature (e.g. experiment time, sample batch, handling technician, sequencing lane). Unique values within a factor are referred to as 
                    <italic toggle="yes">levels</italic>. For example, genotype as a factor may contain two levels, &#x201c;wildtype&#x201d; and &#x201c;mutant&#x201d;. Here, it is generally of interest to determine the expected or mean gene expression for each state or level of the factor. The relationship between gene expression and the factor can be described, or modelled, in the form of 
                    <disp-formula id="e2">
                        <mml:math display="block">
                            <mml:mtext>expression</mml:mtext>
                            <mml:mo>=</mml:mo>
                            <mml:msub>
                                <mml:mrow>
                                    <mml:mi>&#x03b2;</mml:mi>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mn>1</mml:mn>
                                </mml:mrow>
                            </mml:msub>
                            <mml:mtext>wildtype</mml:mtext>
                            <mml:mo>+</mml:mo>
                            <mml:msub>
                                <mml:mrow>
                                    <mml:mi>&#x03b2;</mml:mi>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mn>2</mml:mn>
                                </mml:mrow>
                            </mml:msub>
                            <mml:mtext>mutant</mml:mtext>
                        </mml:math>
                    </disp-formula>where 
                    <italic toggle="yes">&#x03b2;</italic>
                    <sub>1</sub> represents the mean gene expression for wildtype, and 
                    <italic toggle="yes">&#x03b2;</italic>
                    <sub>2</sub> represents the mean gene expression for mutant (
                    <xref ref-type="fig" rid="f1">Figure 1</xref>). Unlike that of the 
                    <monospace>age</monospace> covariate which can take on any non-negative numerical value in the model, the levels of genotype can only take on the values of 0 or 1. For example, 
                    <monospace>wildtype</monospace> is equal to 1 (and 
                    <monospace>mutant</monospace> is equal to 0) when determining the expected expression for the wildtype group, such that 
                    <disp-formula id="e3">
                        <mml:math display="block">
                            <mml:mtext>expression</mml:mtext>
                            <mml:mo>=</mml:mo>
                            <mml:msub>
                                <mml:mrow>
                                    <mml:mi>&#x03b2;</mml:mi>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mn>1</mml:mn>
                                </mml:mrow>
                            </mml:msub>
                            <mml:mspace width="2em"/>
                            <mml:mtext>for wildtype.</mml:mtext>
                        </mml:math>
                    </disp-formula>Similarly, 
                    <monospace>mutant</monospace> is equal to 1 (and 
                    <monospace>wildtype</monospace> is equal to 0) when determining the expected expression for the mutant group, such that 
                    <disp-formula id="e4">
                        <mml:math display="block">
                            <mml:mtext>expression</mml:mtext>
                            <mml:mo>=</mml:mo>
                            <mml:msub>
                                <mml:mrow>
                                    <mml:mi>&#x03b2;</mml:mi>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mn>2</mml:mn>
                                </mml:mrow>
                            </mml:msub>
                            <mml:mspace width="2em"/>
                            <mml:mtext>for mutant.</mml:mtext>
                        </mml:math>
                    </disp-formula>Notice that 
                    <monospace>wildtype</monospace> and 
                    <monospace>mutant</monospace> &#x201c;take turns&#x201d; in taking on the 0 and 1 values. This is because the categorisation of samples as wildtype or mutant are mutually exclusive, where a sample cannot be both wildtype and mutant, or neither wildtype nor mutant. The model estimates expected gene expression as 
                    <italic toggle="yes">&#x03b2;</italic>
                    <sub>1</sub> or 
                    <italic toggle="yes">&#x03b2;</italic>
                    <sub>2</sub>, where 
                    <italic toggle="yes">&#x03b2;</italic>
                    <sub>1</sub> is calculated as the mean of observed expression values in wildtype, and 
                    <italic toggle="yes">&#x03b2;</italic>
                    <sub>2</sub> is calculated as the mean of observed expression values in mutant. In other words, the 
                    <italic toggle="yes">&#x03b2;</italic>s (or model parameters) are estimated as the mean of each level in the genotype factor, as depicted in 
                    <xref ref-type="fig" rid="f1">Figure 1</xref> as distinct solid lines.</p>
                <p>Each of the horizontal lines in 
                    <xref ref-type="fig" rid="f1">Figure 1</xref> are defined by their y-intercept (and a slope of 0), and are themselves regression models. We, however, will refer specifically to models of this type as a 
                    <italic toggle="yes">means model</italic> since the model parameters represent the group means. This also allows us to differentiate these models from the general regression models applied to covariates where the y-intercept and slope can both be non-zero. As noted for covariates, the true values for the model parameters are unknown but estimable. Whilst the expected expression of each factor level is informative, it is often the difference in expression between levels that are of key interest, e.g. &#x201c;what is the difference in expression between wildtype and mutant?&#x201d;. These differences are calculated using linear combinations of the parameters (a fancy way to say that we multiply each parameter by a constant) which we refer to as 
                    <italic toggle="yes">contrasts</italic>. For example, a contrast of (1,&#x2212;1) calculates 
                    <italic toggle="yes">&#x03b2;</italic>
                    <sub>1</sub> &#x2212; 
                    <italic toggle="yes">&#x03b2;</italic>
                    <sub>2</sub>, the difference in means between wildtype and mutant.</p>
            </sec>
            <sec id="sec6">
                <title>Mean-reference model for factors</title>
                <p>An alternative parameterisation of the means model directly calculates the gene expression difference between mutant and wildtype. It does so by using one of the levels as a reference. Such a model is parameterised for the mean expression of the reference level (e.g. wildtype), and the rest of the levels are parameterised relative to the reference (e.g. the difference between mutant and wildtype). The relationship between gene expression and genotype is modelled in the form of 
                    <disp-formula id="e5">
                        <mml:math display="block">
                            <mml:mtext>expression</mml:mtext>
                            <mml:mo>=</mml:mo>
                            <mml:msub>
                                <mml:mrow>
                                    <mml:mi>&#x03b2;</mml:mi>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mn>1</mml:mn>
                                </mml:mrow>
                            </mml:msub>
                            <mml:mo>+</mml:mo>
                            <mml:msub>
                                <mml:mrow>
                                    <mml:mi>&#x03b2;</mml:mi>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mn>2</mml:mn>
                                </mml:mrow>
                            </mml:msub>
                            <mml:mtext>mutant</mml:mtext>
                        </mml:math>
                    </disp-formula>where 
                    <italic toggle="yes">&#x03b2;</italic>
                    <sub>1</sub> represents the mean gene expression for wildtype, and 
                    <italic toggle="yes">&#x03b2;</italic>
                    <sub>2</sub> is the difference between means of mutant and wildtype (
                    <xref ref-type="fig" rid="f1">Figure 1</xref>). Here, 
                    <monospace>mutant</monospace> in the equation takes the value of 0 when determining the expected expression for the wildtype group, such that 
                    <disp-formula id="e6">
                        <mml:math display="block">
                            <mml:mtext>expression</mml:mtext>
                            <mml:mo>=</mml:mo>
                            <mml:msub>
                                <mml:mrow>
                                    <mml:mi>&#x03b2;</mml:mi>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mn>1</mml:mn>
                                </mml:mrow>
                            </mml:msub>
                            <mml:mspace width="2em"/>
                            <mml:mtext>for wildtype.</mml:mtext>
                        </mml:math>
                    </disp-formula>On the other hand, 
                    <monospace>mutant</monospace> is equal to 1 when determining the expected expression for the mutant group, such that 
                    <disp-formula id="e7">
                        <mml:math display="block">
                            <mml:mtext>expression</mml:mtext>
                            <mml:mo>=</mml:mo>
                            <mml:msub>
                                <mml:mrow>
                                    <mml:mi>&#x03b2;</mml:mi>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mn>1</mml:mn>
                                </mml:mrow>
                            </mml:msub>
                            <mml:mo>+</mml:mo>
                            <mml:msub>
                                <mml:mrow>
                                    <mml:mi>&#x03b2;</mml:mi>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mn>2</mml:mn>
                                </mml:mrow>
                            </mml:msub>
                            <mml:mspace width="2em"/>
                            <mml:mtext>for mutant.</mml:mtext>
                        </mml:math>
                    </disp-formula>Expected gene expression for wildtype is represented directly by the first model parameter, 
                    <italic toggle="yes">&#x03b2;</italic>
                    <sub>1</sub>, and depicted as a solid line in 
                    <xref ref-type="fig" rid="f1">Figure 1</xref>. Whilst the expected gene expression for mutant is calculated as the sum of both parameters, and represented by a dashed line in in 
                    <xref ref-type="fig" rid="f1">Figure 1</xref>. Like the means model, the model demonstrated here is a regression model in itself. We, however, refer to this model specifically as a 
                    <italic toggle="yes">mean-reference model</italic> to distinguish it from the general model form that we use for covariate explanatory variables. The means model and the mean-reference model are equivalent models that differ in parameterisation, such that the form of the model is different but one could obtain equivalent values for the expected gene expression of wildtype and mutant from both models.</p>
            </sec>
            <sec id="sec7">
                <title>Terminology</title>
                <p>The terminology and concepts covered in this section are summarised in the table below, in the context of modelling gene expression data. The table also extends to some definitions and descriptions covered later in the article, and is a useful resource to refer to from time-to-time.</p>
                <table-wrap id="T1" orientation="portrait" position="anchor">
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Term</th>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Description</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Response variable</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Gene expression, e.g. log-CPM values.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Explanatory variable</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Variable that influences gene expression.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Statistical model</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Used to describe the relationship between response and explanatory variables.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Model parameters</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Of statistical models, unknown but estimable values that describe the direction and magnitude with which explanatory variables influence gene expression.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Design matrix</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Used to define the form of a statistical model and to store observed values of the explanatory variable(s). Used in the computation process to estimate model parameters.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Contrast matrix</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Used in conjunction with a design matrix to calculate specific values of interest between estimated parameters.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Covariate</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Explanatory variable that is numerical in nature, e.g. age.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Y-intercept</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Point at which a model prediction crosses the y-axis.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Slope</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Rate of change for a model e.g. the change in gene expression per unit increase of a covariate.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Regression model</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Our reference to statistical models for covariates.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Factor</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Explanatory variable that is categorical in nature, e.g. genotype.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Levels</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Unique values within a factor, e.g. wildtype or mutant.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Means model</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Our reference to statistical models for factors where parameters are calculated as the mean of each factor level.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Contrasts</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Linear combinations of estimated parameters. A contrast matrix is made up of individual contrasts.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Mean-reference model</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Our reference to statistical models for factors where parameters are calculated as the mean reference level, and relative means for subsequent levels.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Fitted model</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">The statistical model written with estimates of the model parameters. In our figures, we draw the fitted model (expected gene expression) along with the data points (observed gene expression) to give an idea of how well the fitted model represents the relationship between response and explanatory variables.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Additive effect</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">When the combined effect of two factors equals the sum of the two individual effects, e.g. if the estimated effect of Group A is 
                                    <italic toggle="yes">&#x03ba;</italic> and the estimated effect of sequencing on lane I is 
                                    <italic toggle="yes">&#x03c4;</italic>, then a sample in Group A that is sequenced on lane I has an expected expression of 
                                    <italic toggle="yes">&#x03ba;</italic> + 
                                    <italic toggle="yes">&#x03c4;</italic> if the two factors have an additive effect.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Interaction effect</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">When the combined effect of two factors does not equal the sum of the two individual effects, e.g. for the example above, a sample in Group A that is sequenced on lane I has an expected expression of 
                                    <italic toggle="yes">&#x03ba;</italic> + 
                                    <italic toggle="yes">&#x03c4;</italic> + 
                                    <italic toggle="yes">&#x03b4;</italic> if the two factors have an interaction effect, where 
                                    <italic toggle="yes">&#x03b4;</italic> can be a positive or negative number.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Nested factors</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">A factor is considered to be nested within a second factor, e.g. 
                                    <sans-serif>group</sans-serif> is nested within 
                                    <sans-serif>batch</sans-serif>, if different sets of its levels can be found in each level of the second factor, e.g. group A and group B are processed in batch B1 and group C and group D are processed in batch B2.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Mixed effects models</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">A statistical model that contains both fixed and random effects, where random effects are usually not of interest to the study at hand.</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
            </sec>
        </sec>
        <sec id="sec8">
            <title>Overview of models fitted</title>
            <p>In the sections to follow, we explore various models for explanatory variables that are factors, starting from the most basic study designs to those that are more complex. We then cover some models for explanatory variables that are covariates. The tables below summarise the data examples, R input for the associated design matrices, and the sections from which they can be found. In the tables, factors are distinguished from covariates by the presence of subscripts listing their levels e.g. &#x201c;factor
                <sub>
                    <italic toggle="yes">LEVEL</italic>1,
                    <italic toggle="yes">LEVEL</italic>2</sub>&#x201d;. Associated sections are marked with an asterisk if the design matrix cannot be sufficiently summarised within the table.</p>
            <sec id="sec9">
                <title>Design and contrast matrices</title>
                <p>This section describes and compares models that are coded with and without an intercept term for covariates and factors. It also shows the fundamental elements for computing differences between model parameters using contrasts and contrast matrices.</p>
                <table-wrap id="T2" orientation="portrait" position="anchor">
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Explanatory variables</th>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Design matrix</th>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Section</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">age</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">model.matrix(&#x223c;age)
                                    <break/>model.matrix(&#x223c;0+age)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Covariates: With intercept
                                    <break/>Covariates:Without intercept</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">group
                                    <sub>
                                        <italic toggle="yes">HEALTHY</italic>,
                                        <italic toggle="yes">SICK</italic>
                                    </sub>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">model.matrix(&#x223c;group)
                                    <break/>model.matrix(&#x223c;0+group)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Factors: With intercept
                                    <break/>Factors: Without intercept</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
            </sec>
            <sec id="sec10">
                <title>Study of treatments and control</title>
                <p>This section examines a study on four treatment groups (CTL, I, II, and III). The example here represents a study design that is very common in practice, where there are several treatments (or conditions or groups) including a control. Comparisons between the levels are computed using two alternative design matrices. In this section, we also look at more complex set ups for contrasts to compute comparisons that may be of interest.</p>
                <table-wrap id="T3" orientation="portrait" position="anchor">
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Explanatory variables</th>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Design matrix</th>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Section</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">treatment
                                    <sub>
                                        <italic toggle="yes">CTL</italic>,
                                        <italic toggle="yes">I</italic>,
                                        <italic toggle="yes">II</italic>,
                                        <italic toggle="yes">III</italic>
                                    </sub>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">model.matrix(&#x223c;treatment)
                                    <break/>model.matrix(&#x223c;0+treatment)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Treatment versus control
                                    <break/>All pairwise comparisons</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
            </sec>
            <sec id="sec11">
                <title>Study of interactions and additivity of treatments</title>
                <p>In this section we consider the effect of combining two separate treatments. Our first example looks at the interactivity of the treatments using a model from the previous section, which has a single treatment factor. We then show an alternative method to calculate the same estimates using a two factor model.</p>
                <table-wrap id="T4" orientation="portrait" position="anchor">
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Explanatory variables</th>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Design matrix</th>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Section</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">treat1
                                    <sub>
                                        <italic toggle="yes">NO</italic>,
                                        <italic toggle="yes">YES</italic>
                                    </sub> and treat2
                                    <sub>
                                        <italic toggle="yes">NO</italic>,
                                        <italic toggle="yes">YES</italic>
                                    </sub>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">model.matrix(&#x223c;treat1*treat2)
                                    <break/>model.matrix(&#x223c;treat1+treat2)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Interactions using a two-factor model
                                    <break/>Additivity using a two-factor model</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
            </sec>
            <sec id="sec12">
                <title>Studies with multiple factors</title>
                <p>This section looks at studies with multiple factors. It includes study designs that are more complex in nature and describes the approaches one can take to examine the differences of interest. The section covers studies with nested factors and the fitting of mixed effects models.</p>
                <table-wrap id="T5" orientation="portrait" position="anchor">
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Explanatory variables</th>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Design matrix</th>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Section</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">tissue
                                    <sub>
                                        <italic toggle="yes">LUNG</italic>,
                                        <italic toggle="yes">BRAIN</italic>
                                    </sub> and cells
                                    <sub>
                                        <italic toggle="yes">B</italic>,
                                        <italic toggle="yes">T</italic>
                                    </sub>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">model.matrix(&#x223c;0+group) 
                                    <italic toggle="yes">with</italic> 
                                    <inline-formula>
                                        <mml:math display="inline">
                                            <mml:msub>
                                                <mml:mtext>group</mml:mtext>
                                                <mml:mrow>
                                                    <mml:mi>L</mml:mi>
                                                    <mml:mi>U</mml:mi>
                                                    <mml:mi>N</mml:mi>
                                                    <mml:mi>G</mml:mi>
                                                    <mml:mo>_</mml:mo>
                                                    <mml:mi>B</mml:mi>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mi>B</mml:mi>
                                                    <mml:mi>R</mml:mi>
                                                    <mml:mi>A</mml:mi>
                                                    <mml:mi>I</mml:mi>
                                                    <mml:mi>N</mml:mi>
                                                    <mml:mo>_</mml:mo>
                                                    <mml:mi>B</mml:mi>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mi>L</mml:mi>
                                                    <mml:mi>U</mml:mi>
                                                    <mml:mi>N</mml:mi>
                                                    <mml:mi>G</mml:mi>
                                                    <mml:mo>_</mml:mo>
                                                    <mml:mi>T</mml:mi>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mi>B</mml:mi>
                                                    <mml:mi>R</mml:mi>
                                                    <mml:mi>A</mml:mi>
                                                    <mml:mi>I</mml:mi>
                                                    <mml:mi>N</mml:mi>
                                                    <mml:mo>_</mml:mo>
                                                    <mml:mi>T</mml:mi>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:math>
                                    </inline-formula>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Conversion to a single factor</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">group
                                    <sub>
                                        <italic toggle="yes">A</italic>,
                                        <italic toggle="yes">B</italic>
                                    </sub>, lane
                                    <sub>
                                        <italic toggle="yes">L</italic>1,
                                        <italic toggle="yes">L</italic>2</sub> and technician
                                    <sub>
                                        <italic toggle="yes">I</italic>,
                                        <italic toggle="yes">II</italic>
                                    </sub>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">model.matrix(&#x223c;0+group+lane+technician)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Accounting for factors that are not of interest</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">group
                                    <sub>
                                        <italic toggle="yes">A</italic>,
                                        <italic toggle="yes">B</italic>
                                    </sub> and batch
                                    <sub>
                                        <italic toggle="yes">B</italic>1,
                                        <italic toggle="yes">B</italic>2</sub>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <italic toggle="yes">Check rank of design matrix</italic>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Nested factors and matrices without full rank*</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">treatment
                                    <sub>
                                        <italic toggle="yes">X</italic>,
                                        <italic toggle="yes">Y</italic>
                                    </sub> and timepoint
                                    <sub>
                                        <italic toggle="yes">T</italic>1,
                                        <italic toggle="yes">T</italic>2</sub> with repeated measurements</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <italic toggle="yes">Model mouse IDs, then add columns representing timepoint T2 for both treatments</italic>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Time series experiment with repeated mouse measurements nested within treatments*</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">treatment
                                    <sub>
                                        <italic toggle="yes">X</italic>,
                                        <italic toggle="yes">Y</italic>
                                    </sub> and timepoint
                                    <sub>
                                        <italic toggle="yes">T</italic>1,
                                        <italic toggle="yes">T</italic>2</sub> with repeated measurements</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">model.matrix(&#x223c;0+group) 
                                    <italic toggle="yes">with</italic> 
                                    <inline-formula>
                                        <mml:math display="inline">
                                            <mml:msub>
                                                <mml:mtext>group</mml:mtext>
                                                <mml:mrow>
                                                    <mml:mi>X</mml:mi>
                                                    <mml:mo>_</mml:mo>
                                                    <mml:mi>T</mml:mi>
                                                    <mml:mn>1</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mi>X</mml:mi>
                                                    <mml:mo>_</mml:mo>
                                                    <mml:mi>T</mml:mi>
                                                    <mml:mn>2</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mi>Y</mml:mi>
                                                    <mml:mo>_</mml:mo>
                                                    <mml:mi>T</mml:mi>
                                                    <mml:mn>1</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mi>Y</mml:mi>
                                                    <mml:mo>_</mml:mo>
                                                    <mml:mi>T</mml:mi>
                                                    <mml:mn>2</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:math>
                                    </inline-formula>, 
                                    <italic toggle="yes">and mouse ID as random effect</italic>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Treating factors that are not of direct interest as random effects*</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
            </sec>
            <sec id="sec13">
                <title>Studies with covariates</title>
                <p>In this section, we look at explanatory variables that are covariates rather than factors. We begin with fitting some simple models and work up towards more complex ones such as the fitting of cyclical models.</p>
                <table-wrap id="T6" orientation="portrait" position="anchor">
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Explanatory variables</th>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Design matrix</th>
                                <th align="left" colspan="1" rowspan="1" valign="bottom">Section</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">treatment
                                    <sub>
                                        <italic toggle="yes">X</italic>,
                                        <italic toggle="yes">Y</italic>
                                    </sub> and time</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">model.matrix(&#x223c;0+treatment*time)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Combination with factor variable</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">time</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">model.matrix(&#x223c;time)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Linear time series</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">time</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">model.matrix(&#x223c;poly(time, degree=2, raw=TRUE))</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Quadratic time series</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">time</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">model.matrix(&#x223c;poly(time, degree=3, raw=TRUE))</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Cubic time series</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">time</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">model.matrix(&#x223c;time+sinphase+cosphase)
                                    <break/>model.matrix(&#x223c;sinphase+cosphase)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Cyclical models*
                                    <break/>Cyclical models*</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
            </sec>
        </sec>
        <sec id="sec14">
            <title>Design and contrast matrices</title>
            <sec id="sec15">
                <title>Background</title>
                <p>In this section, we demonstrate how design and contrast matrices can be created for the most basic experimental designs with covariates and factors. Specifically, we discuss the similarities and differences between design matrices that include and exclude an intercept term.</p>
                <p>Design matrices are used in the estimation process of model parameters. The design matrix has columns associated with the parameters and rows associated with samples (
                    <xref ref-type="fig" rid="f2">Figure 2</xref>). If the estimated parameters are not of direct interest, a 
                    <italic toggle="yes">contrast matrix</italic> can be used to calculate contrasts of the parameters. Combining multiple contrasts, each column in the contrast matrix represents a single contrast, and has rows associated with columns in the corresponding design matrix (
                    <xref ref-type="fig" rid="f2">Figure 2</xref>). Using the R programming language, we code for design matrices using the 
                    <monospace>model.matrix</monospace> function from the 
                    <bold>stats</bold> package, and contrast matrices using the 
                    <monospace>makeContrasts</monospace> function from the 
                    <bold>limma</bold> package. To learn more about these functions, one can bring up associated help pages by typing 
                    <monospace>?stats::model.matrix</monospace> and 
                    <monospace>?limma::makeContrasts</monospace> in the R console.</p>
                <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                    <label>Figure 2. </label>
                    <caption>
                        <title>The structure of design and contrast matrices.</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure2.gif"/>
                </fig>
            </sec>
            <sec id="sec16">
                <title>Design matrices with and without intercept term</title>
                <p>For a single explanatory variable, which we simply call 
                    <monospace>variable</monospace>, a design matrix can be coded by 
                    <monospace>model.matrix(~variable)</monospace> to include an intercept term, or by 
                    <monospace>model.matrix(~0+variable)</monospace> to exclude the intercept term. One of the most fundamental concepts in the coding of design matrices is to understand when one should include an intercept term, when not to, and how it affects the underlying model. If 
                    <monospace>variable</monospace> is a factor, then the two models with and without the intercept term are equivalent, but if 
                    <monospace>variable</monospace> is a covariate the then two models are fundamentally different.</p>
            </sec>
            <sec id="sec17">
                <title>Covariates</title>
                <p>Using age as an example, let&#x2019;s look at the gene expression of mice where their age in weeks from birth were recorded. The expression of a single gene is recorded as a numerical vector called 
                    <monospace>expression</monospace>. The age of mice is also recorded as a numerical vector in 
                    <monospace>age</monospace> (as weeks), and we use an additional 
                    <monospace>mouse</monospace> character vector to show that these are independent measurements. The three vectors, 
                    <monospace>expression</monospace>, 
                    <monospace>mouse</monospace> and 
                    <monospace>age</monospace>, that represent our example data are displayed below as a data frame as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
##   expression  mouse age
## 1       2.38 MOUSE1   1
## 2       2.85 MOUSE2   2
## 3       3.60 MOUSE3   3
## 4       4.06 MOUSE4   4
## 5       4.61 MOUSE5   5
## 6       5.04 MOUSE6   6
</preformat>
                </p>
                <sec id="sec55">
                    <title>With intercept</title>
                    <p>A design matrix with an intercept term can be coded as 
                        <monospace>model.matrix(~age)</monospace>. The resultant design matrix, which is displayed in 
                        <xref ref-type="fig" rid="f3">Figure 3</xref>, contains a column of 1s (the intercept term) and a column with values taken from 
                        <monospace>age</monospace>. This design matrix is associated with a regression model (see 
                        <xref ref-type="fig" rid="f1">Figure 1</xref>), where the intercept term in the first column is parameterised for the y-intercept, and &#x201c;age&#x201d; in the second column is parameterised for the slope of the regression line. In other words, the second column is used to estimate the rate of change in gene expression per unit increase in age.</p>
                    <fig fig-type="figure" id="f3" orientation="portrait" position="float">
                        <label>Figure 3. </label>
                        <caption>
                            <title>Expected gene expression is modelled by an age covariate, which is denoted as 
                                <italic toggle="yes">x</italic> in the model and plot.</title>
                            <p>This particular model includes an intercept term so that the model (line) has flexibility in intersecting the y-axis at any point. MODEL: The fitted model in written form, where 
                                <italic toggle="yes">y</italic> represents gene expression and 
                                <italic toggle="yes">E</italic>(
                                <italic toggle="yes">y</italic>) expected gene expression. Estimated model parameters are highlighted in colour. MATRIX: R input and output for the associated design matrix, with the colour of column names (model parameters) matching that of the estimated parameters above. For simplicity, &#x201d;assign&#x201d; and &#x201d;contrasts&#x201d; attributes of the design matrix are not displayed in our figures (see &#x2018;?stats::model.matrix&#x2019; for more). PLOT: Observed data points are drawn together with the fitted model representing expected gene expression. Where appropriate, aspects of the fitted model are drawn in a colour that matches associated parameter estimates.</p>
                        </caption>
                        <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure3.gif"/>
                    </fig>
                    <p>The parameters can be estimated as 1.85 for the y-intercept and 0.54 for the slope. This means that for every 1 unit (or week) increase in age, gene expression increases by a value of 0.54 on average. We can write our statistical model using the estimated model parameters to give us our 
                        <italic toggle="yes">fitted model</italic>. The fitted model can be written as 
                        <italic toggle="yes">E</italic>(
                        <italic toggle="yes">y</italic>) = 1.85  + 0.54
                        <italic toggle="yes">x</italic>, with 
                        <italic toggle="yes">y</italic> representing expression of the gene, 
                        <italic toggle="yes">E</italic>(
                        <italic toggle="yes">y</italic>) representing the expected gene expression and 
                        <italic toggle="yes">x</italic> representing age.</p>
                </sec>
                <sec id="sec56">
                    <title>Without intercept</title>
                    <p>Alternatively, a design matrix without an intercept term can be coded as 
                        <monospace>model.matrix(~0+age)</monospace>. This design matrix contains a single column that represents age, as shown in 
                        <xref ref-type="fig" rid="f4">Figure 4</xref>, which is parameterised for the slope of a regression line. By adding a 
                        <monospace>0</monospace> to the formula in 
                        <monospace>model.matrix</monospace>, the intercept term has been removed. This means that the regression line is forced to intercept the y-axis at 0. The slope of the line can be estimated to be 0.97, such that gene expression is expected to increase by 0.97 for every 1 unit increase in age. The fitted model is written as 
                        <italic toggle="yes">E</italic>(
                        <italic toggle="yes">y</italic>) = 0.97
                        <italic toggle="yes">x</italic>.</p>
                    <fig fig-type="figure" id="f4" orientation="portrait" position="float">
                        <label>Figure 4. </label>
                        <caption>
                            <title>Expected gene expression modelled by an age covariate, where the model (line) must intersect the y-axis at the zero-point.</title>
                            <p>This restriction is due to the design matrix set up which excludes the intercept term.</p>
                        </caption>
                        <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure4.gif"/>
                    </fig>
                </sec>
                <sec id="sec57">
                    <title>Comparing models</title>
                    <p>When comparing between the models with and without an intercept term, we observe that the model without an intercept term (
                        <xref ref-type="fig" rid="f4">Figure 4</xref>) does not fit as closely to the observed data as the model with an intercept term (
                        <xref ref-type="fig" rid="f3">Figure 3</xref>). It is not surprising that the model with an intercept term provides a better fit to the data since it is less restrictive (allows the y-intercept to be at any point) than the model without an intercept term. The extra parameter in the model allows it to be more flexible. In general, we suggest the inclusion of an intercept term for modelling explanatory variables that are covariates since it provides a more flexible fit to the data points. A model without an intercept term would only be recommended in cases where there is a strong biological reason why a zero covariate should be associated with a zero expression value, and such contexts are rare in gene expression modelling.</p>
                </sec>
            </sec>
            <sec id="sec18">
                <title>Factors</title>
                <p>Now we consider an example of gene expression on healthy and sick mice, each in triplicate. Healthy and sick mice are classified using a 
                    <monospace>group</monospace> factor which contains two levels, 
                    <monospace>HEALTHY</monospace> and 
                    <monospace>SICK</monospace>. The level names are written in all capitals so that design matrices have column names that are easier to read by default e.g. &#x201c;groupSICK&#x201d; (
                    <xref ref-type="fig" rid="f5">Figure 5</xref> and 
                    <xref ref-type="fig" rid="f6">6</xref>) rather than &#x201c;groupsick&#x201d; or &#x201c;groupSick&#x201d;. The data is displayed by combining vectors for 
                    <monospace>expression</monospace>, 
                    <monospace>mouse</monospace> and 
                    <monospace>group</monospace> as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
##   expression  mouse   group
## 1       2.38 MOUSE1 HEALTHY
## 2       2.85 MOUSE2 HEALTHY
## 3       3.60 MOUSE3 HEALTHY
## 4       4.06 MOUSE4    SICK
## 5       4.61 MOUSE5    SICK
## 6       5.04 MOUSE6    SICK
</preformat>
                </p>
                <fig fig-type="figure" id="f5" orientation="portrait" position="float">
                    <label>Figure 5. </label>
                    <caption>
                        <title>Expected gene expression is modelled by a group factor, where 
                            <italic toggle="yes">x</italic> is an indicator variable for sick mice (
                            <italic toggle="yes">x</italic> = 1 for sick mice, and 
                            <italic toggle="yes">x</italic> = 0 otherwise).</title>
                        <p>The associated design matrix includes an intercept term, where healthy mice acts as the reference level. The expected gene expression of non-reference levels, e.g. that of sick mice, are represented by dashed lines in the plot.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure5.gif"/>
                </fig>
                <fig fig-type="figure" id="f6" orientation="portrait" position="float">
                    <label>Figure 6. </label>
                    <caption>
                        <title>Expected gene expression is modelled by a group factor, where 
                            <italic toggle="yes">x</italic>
                            <sub>1</sub> is an indicator variable for healthy mice (
                            <italic toggle="yes">x</italic>
                            <sub>1</sub> = 1 for healthy; 0 otherwise), and 
                            <italic toggle="yes">x</italic>
                            <sub>2</sub> is an indicator variable for sick mice (
                            <italic toggle="yes">x</italic>
                            <sub>2</sub> = 1 for sick; 0 otherwise).</title>
                        <p>The associated design matrix excludes an intercept term.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure6.gif"/>
                </fig>
                <sec id="sec67">
                    <title>With intercept</title>
                    <p>A design matrix with an intercept term can be coded as 
                        <monospace>model.matrix(~group)</monospace> to obtain a two column matrix (
                        <xref ref-type="fig" rid="f5">Figure 5</xref>). In general, the resulting design matrix will have the same number of columns as the factor 
                        <monospace>group</monospace> has levels. The design matrix contains a column of 1s (the intercept term), and a column with values as 0s or 1s (a value of 1 when the associated sample is in the sick group, and 0 otherwise). This design matrix is parameterised for a mean-reference model (
                        <xref ref-type="fig" rid="f1">Figure 1</xref>), where the intercept term in the first column is parameterised for the mean expression of the healthy group, and the second column is parameterised for the mean expression of the sick group relative to healthy (difference between sick and healthy). The healthy group is selected as the reference level since 
                        <monospace>HEALTHY</monospace> is the first level in 
                        <monospace>group</monospace>. Levels in a factor are ordered alphanumerically by default, but re-specification of the reference can be carried out using the 
                        <monospace>relevel</monospace> function in the 
                        <bold>stats</bold> package. Using this design matrix, the parameters can be estimated to be 2.95 and 1.62, such that the mean gene expression of the healthy group is 2.95, and the mean gene expression of the sick group relative to healthy is 1.62. We can calculate the mean expression of the sick group by summing both parameter estimates, which in this case is equal to 4.57. Using the estimated values, the fitted model for expected gene expression can be written as 
                        <italic toggle="yes">E</italic>(
                        <italic toggle="yes">y</italic>) = 2.95 + 1.62
                        <italic toggle="yes">x</italic>, where 
                        <italic toggle="yes">x</italic> is an indicator variable for mice that are sick. The indicator variable, or 
                        <italic toggle="yes">x</italic>, takes the value of 1 when calculating the expected expression of sick mice, and takes the value of 0 when calculating the expected expression of healthy mice. In other words, 
                        <italic toggle="yes">E</italic>(
                        <italic toggle="yes">y</italic>) = 2.95 for healthy mice and 
                        <italic toggle="yes">E</italic>(
                        <italic toggle="yes">y</italic>) = 4.57 for sick mice.</p>
                </sec>
                <sec id="sec60">
                    <title>Without intercept</title>
                    <p>A design matrix without an intercept term can be coded as 
                        <monospace>model.matrix(~0+group)</monospace>, which gives an equivalent model to that of the previous model. The design matrix here also contains two columns (
                        <xref ref-type="fig" rid="f6">Figure 6</xref>), but is instead parameterised for a means model (
                        <xref ref-type="fig" rid="f1">Figure 1</xref>). This means that the first column of the design matrix is parameterised for the mean expression of the healthy group, where a value of 1 is present when the associated sample belongs to the healthy group, and 0 otherwise. The second column is parameterised for the mean expression of the sick group, and has a value of 1 when the associated sample belongs to the sick group, and 0 otherwise. The parameters in this model can be estimated as 2.95 and 4.57 for the mean gene expression of healthy and sick mice, respectively. Thus, the fitted model for expected gene expression can be written as 
                        <italic toggle="yes">E</italic>(
                        <italic toggle="yes">y</italic>) = 2.95
                        <italic toggle="yes">x</italic>
                        <sub>1</sub> + 4.57
                        <italic toggle="yes">x</italic>
                        <sub>2</sub>, where 
                        <italic toggle="yes">x</italic>
                        <sub>1</sub> and 
                        <italic toggle="yes">x</italic>
                        <sub>2</sub> are indicator variables for healthy mice and sick mice respectively. In other words, 
                        <italic toggle="yes">E</italic>(
                        <italic toggle="yes">y</italic>) = 2.95 for healthy mice and 
                        <italic toggle="yes">E</italic>(
                        <italic toggle="yes">y</italic>) = 4.57 for sick mice.</p>
                </sec>
                <sec id="sec63">
                    <title>Comparing models</title>
                    <p>As mentioned in our earlier description of basic models (
                        <xref ref-type="fig" rid="f1">Figure 1</xref>), models with and without an intercept term are equivalent for factor explanatory variables, but differ in parameterisation. This means that the expected expression values for healthy and sick mice are the same regardless of whether a means model (without an intercept term in the design matrix) or a mean-reference model (with an intercept term in the design matrix) is fitted. The only difference is that the expected gene expression for sick mice is calculated by summing both parameter estimates in a mean-reference model, whereas it is estimated directly as the second parameter in a means model. For this reason, it ultimately does not matter which design matrix is used. We recommend the use of whichever design matrix that is better understood by the reader, which is often the design matrix without the intercept term since the interpretation of parameters is more straightforward.</p>
                </sec>
            </sec>
            <sec id="sec19">
                <title>Contrast matrix for computing differences</title>
                <p>When fitting a means model, the parameter estimates themselves are usually not of direct interest. It is the difference between the parameter estimates, or difference between mean expression of groups, that is of interest. The difference in parameter estimates can be calculated using a contrast matrix via the 
                    <monospace>makeContrast</monospace> function. To specify the comparison of interest, column names from the design matrix, &#x201c;groupHEALTHY&#x201d; and &#x201c;groupSICK&#x201d;, are inserted into the function. The design matrix and associated contrast matrix is coded as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
design &lt;- model.matrix(~0+group)
makeContrasts(groupSICK-groupHEALTHY, levels=colnames(design))

##               Contrasts
## Levels         groupSICK - groupHEALTHY
##   groupHEALTHY                       -1
##   groupSICK                           1
</preformat>
                </p>
                <p>The 
                    <monospace>makeContrast</monospace> function simply creates the contrast of (-1, 1) which subtracts the first parameter estimate (mean expression of healthy) from the second parameter estimate (mean expression of sick). Using the parameter estimates estimated earlier (
                    <xref ref-type="fig" rid="f6">Figure 6</xref>), the contrast calculates -2.95 plus +4.57 which equals 1.62. In other words, we expect gene expression of sick mice to be upregulated by 1.62 units relative to healthy mice. Notice how this is the same value as the second parameter estimate in the mean-reference model (
                    <xref ref-type="fig" rid="f5">Figure 5</xref>), since that model is directly parameterised for the difference between sick and healthy mice. It is also reasonable to compute the difference in the opposite direction, by having 
                    <monospace>groupHEALTHY-groupSICK</monospace> as the first argument of the 
                    <monospace>makeContrasts</monospace> function. This will result in the value of -1.62 instead. The two options only differ in their interpretation, &#x201c;gene expression of sick mice is greater than healthy mice by a value of 1.62&#x201d; versus &#x201c;gene expression of healthy mice is greater than sick mice by a value of -1.62&#x201d;.</p>
            </sec>
        </sec>
        <sec id="sec20">
            <title>Study of treatments and control</title>
            <sec id="sec21">
                <title>Background</title>
                <p>In this section, we focus on a single factor as an explanatory variable to modelling gene expression. The factor we use contains several levels, which allows us to discuss some common comparisons of interest, and show different methods of calculating those differences.</p>
                <p>A very common study design examines several conditions of interest, where one condition represents the control. Considering such an experimental design, we want to model the relationship between gene expression and four possible conditions: three treatments and a control. The explanatory variable is set up as a factor vector, which we have named 
                    <monospace>treatment</monospace>, and the factor is used to classify samples into the control group (CTL), treatment I, treatment II, and treatment III. The factor has a total of 4 levels. Gene expression is recorded as a numeric vector called 
                    <monospace>expression</monospace>, and 
                    <monospace>mouse</monospace> is a character vector showing that the observations are independent measurements. The data combines the vectors as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
##    expression   mouse treatment
## 1        1.01  MOUSE1       CTL
## 2        1.04  MOUSE2       CTL
## 3        1.04  MOUSE3       CTL
## 4        1.99  MOUSE4         I
## 5        2.36  MOUSE5         I
## 6        2.00  MOUSE6         I
## 7        2.89  MOUSE7        II
## 8        3.12  MOUSE8        II
## 9        2.98  MOUSE9        II
## 10       5.00 MOUSE10       III
## 11       4.92 MOUSE11       III
## 12       4.78 MOUSE12       III
</preformat>
                </p>
                <p>We know from the previous section that the 
                    <monospace>treatment</monospace> factor can be represented in a means model or a mean-reference model using the design matrices coded as 
                    <monospace>model.matrix(~0+treatment)</monospace> or 
                    <monospace>model.matrix(~treatment)</monospace> respectively. Either representation would give equivalent models, and so it would be unnecessary to describe both models for the same exercise. Based on the comparison of interest at hand, we demonstrate the use of one of the models using the most direct approach.</p>
            </sec>
            <sec id="sec22">
                <title>Treatments versus control</title>
                <p>For a comparison of each treatment group versus the control group, we model gene expression using a mean-reference model. This is ideal since the differences can be estimated directly from the model parameters, and without the use of an additional contrast matrix. To do this, the control group would act as the reference and must be the first level of the 
                    <monospace>treatment</monospace> factor vector. We can view the order of levels by 
                    <monospace>levels(treatment)</monospace>. If the level associated with the control group is not listed first, it can be changed to the first or reference level with the code 
                    <monospace>treatment &lt;- relevel(treatment, ref="CTL")</monospace>.</p>
                <p>We can now create a design matrix that represents a mean-reference model by 
                    <monospace>model.matrix(~treatment)</monospace> (
                    <xref ref-type="fig" rid="f7">Figure 7</xref>). The columns of the design matrix represent the mean expression of the control group, and the difference in mean expression between treatment I and control, treatment II and control, and treatment III and control. Using the design matrix, the model parameters are estimated as 1.03, 1.09, 1.97 and 3.87. This means that the difference in expected gene expression between treatments and control are 1.09 for treatment I and control, 1.97 for treatment II and control, and 3.87 for treatment III and control. Treatment III has the greatest expected gene expression difference from the control. The fitted model for expected gene expression can then be written as 
                    <italic toggle="yes">E</italic>(
                    <italic toggle="yes">y</italic>) = 1.03 + 1.09
                    <italic toggle="yes">x</italic>
                    <sub>1</sub> + 1.97
                    <italic toggle="yes">x</italic>
                    <sub>2</sub> + 3.87
                    <italic toggle="yes">x</italic>
                    <sub>3</sub>, where the 
                    <italic toggle="yes">x</italic>&#x2019;s are indicator variables for treatment I, treatment II and treatment III, respectively. In other words, 
                    <italic toggle="yes">x</italic>
                    <sub>1</sub> = 1 for treatment I, 
                    <italic toggle="yes">x</italic>
                    <sub>2</sub> = 1 for treatment II, and 
                    <italic toggle="yes">x</italic>
                    <sub>3</sub> = 1 for treatment III. The 
                    <italic toggle="yes">x</italic>&#x2019;s are equal to 0 elsewhere.</p>
                <fig fig-type="figure" id="f7" orientation="portrait" position="float">
                    <label>Figure 7. </label>
                    <caption>
                        <title>Expected gene expression is modelled by a treatment factor.</title>
                        <p>The design matrix that is used includes an intercept term which represents the mean gene expression of the control group, or the reference level in the treatment factor. Other levels in the factor have mean gene expression represented relative to the control group. This means that the second to fourth parameters in the mean-reference model represent gene expression differences between treatment groups and the control group. The 
                            <italic toggle="yes">x</italic>&#x2019;s in the model are indicator variables for treatment groups, with 
                            <italic toggle="yes">x</italic>
                            <sub>1</sub> = 1 for treatment I, 
                            <italic toggle="yes">x</italic>
                            <sub>2</sub> = 1 for treatment II, and 
                            <italic toggle="yes">x</italic>
                            <sub>3</sub> = 1 for treatment III.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure7.gif"/>
                </fig>
            </sec>
            <sec id="sec23">
                <title>All pairwise comparisons</title>
                <p>In order to make all possible pairwise comparisons between the treatments, we model gene expression using a means model. Due to its parameterisation, the means model is simple to work with when specifying the comparisons of interest in the contrast matrix.</p>
                <p>The associated design matrix is coded as 
                    <monospace>design &lt;- model.matrix(~0+treatment)</monospace>, with columns or parameters representing the mean gene expression of each control and treatment group (
                    <xref ref-type="fig" rid="f8">Figure 8</xref>). The mean expression values can then be estimated as 1.03, 2.12, 3 and 4.9; where the fitted model for expected gene expression is written as 
                    <italic toggle="yes">E</italic>(
                    <italic toggle="yes">y</italic>) = 1.03
                    <italic toggle="yes">x</italic>
                    <sub>0</sub> + 2.12
                    <italic toggle="yes">x</italic>
                    <sub>1</sub> + 3
                    <italic toggle="yes">x</italic>
                    <sub>2</sub> + 4.9
                    <italic toggle="yes">x</italic>
                    <sub>3</sub>. The 
                    <italic toggle="yes">x</italic>&#x2019;s are indicator variables for control, treatment I, treatment II and treatment III, respectively. Specifically, 
                    <italic toggle="yes">x</italic>
                    <sub>0</sub> = 1 for control, 
                    <italic toggle="yes">x</italic>
                    <sub>1</sub> = 1 for treatment I, 
                    <italic toggle="yes">x</italic>
                    <sub>2</sub> = 1 for treatment II, and 
                    <italic toggle="yes">x</italic>
                    <sub>3</sub> = 1 for treatment III, and 0 elsewhere.</p>
                <fig fig-type="figure" id="f8" orientation="portrait" position="float">
                    <label>Figure 8. </label>
                    <caption>
                        <title>Expected gene expression is modelled by a treatment factor.</title>
                        <p>The design matrix that is used excludes the intercept term so that the associated model is a means model. In other words, the mean gene expression of each level in &#x2018;treatment&#x2019; is represented by a parameter in the model. The 
                            <italic toggle="yes">x</italic>&#x2019;s in the model are indicator variables for control and treatment groups, with 
                            <italic toggle="yes">x</italic>
                            <sub>0</sub> = 1 for control, 
                            <italic toggle="yes">x</italic>
                            <sub>1</sub> = 1 for treatment I, 
                            <italic toggle="yes">x</italic>
                            <sub>2</sub> = 1 for treatment II, and 
                            <italic toggle="yes">x</italic>
                            <sub>3</sub> = 1 for treatment III.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure8.gif"/>
                </fig>
                <p>Taking these parameter estimates, we compute all pairwise differences between treatments using the 
                    <monospace>makeContrasts</monospace> function as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
contrasts &lt;- makeContrasts(
  treatmentI-treatmentCTL, treatmentII-treatmentCTL, 
  treatmentIII-treatmentCTL, treatmentII-treatmentI, 
  treatmentIII-treatmentI, treatmentIII-treatmentII, 
  levels=colnames(design))
colnames(contrasts) &lt;- abbreviate(colnames(contrasts))
contrasts

##               Contrasts
## Levels         tI-tC tII-tC tIII-tC tII-tI trIII-tI tIII-tII
##   treatmentCTL    -1     -1      -1      0        0        0
##   treatmentI       1      0       0     -1       -1        0
##   treatmentII      0      1       0      1        0       -1
##   treatmentIII     0      0       1      0        1        1
</preformat>
                </p>
                <p>Note that there are six possible pairwise comparisons between the four treatments. Note also that default column names in the contrast matrix have been abbreviated here using the 
                    <monospace>abbreviate</monospace> function from the 
                    <bold>base</bold> package so that the contrast matrix can display neatly above (although this step is not usually necessary). The contrast matrix contains six columns, each representing one comparison: &#x201c;tI-C&#x201d; for treatment I versus control, &#x201c;tII-C&#x201d; for treatment II versus control, &#x201c;tIII-C&#x201d; for treatment III versus control, &#x201c;tII-I&#x201d; treatment II versus I, &#x201c;trIII-I&#x201d; for treatment III versus I, and &#x201c;tIII-II&#x201d; treatment III versus II. The 1s and -1s in each column of the contrast matrix mark the parameters from the design matrix from which comparisons are made. For example, the first contrast subtracts the mean expression of the control group (parameter 1) from the mean expression of treatment II (parameter 2), which is calculated as 1.09. In such a way, differences between treatments or between treatments and control are estimated. The difference in mean gene expression is estimated as 1.97 for treatment II versus control, 3.87 for treatment III versus control, 0.88 for treatment II versus I, 2.78 for treatment III versus I, and 1.9 for treatment III versus II.</p>
            </sec>
            <sec id="sec24">
                <title>Control versus the rest</title>
                <p>Rather than considering each treatment-control comparison separately, suppose that it is of interest to compare the control group to all of the treatment groups simultaneously. The idea of this is to find the genes that may define the control relative to the treatments. The same can also be carried out for individual treatment groups. For example, we could also consider the genes that define treatment I relative to the rest of the groups.</p>
                <p>When comparing the control group to the rest of the groups, it is not advisable to merge treatments I, II and III into one big treatment group, and to simply fit a separate model for the combined treatment group and control. The combined treatment group does not account for group-specific variability, and the combined group would be biased towards larger treatment groups in an unbalanced study design. Instead, we demonstrate two methods to approach this. Both methods can use either of the fitted models from the previous sections (mean-reference or means model), where individual group means and variability are accounted for. The first method uses a contrast matrix to compare the control group to the treatment average, and the second looks at the overlap between treatment-control comparisons.</p>
                <sec id="sec71">
                    <title>Control versus treatment average</title>
                    <p>Using the means model defined earlier, we calculate the average of the mean gene expression of treatment groups. We then subtract the mean gene expression of the control group from the average treatment value. To do this, a contrast matrix is coded as follows:</p>
                    <p>

                        <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
makeContrasts((treatmentI+treatmentII+treatmentIII)/3-treatmentCTL,
  levels=colnames(design))

##               Contrasts
## Levels         (treatmentI + treatmentII + treatmentIII)/3 - treatmentCTL
##   treatmentCTL                                                      -1.00
##   treatmentI                                                         0.33
##   treatmentII                                                        0.33
##   treatmentIII                                                       0.33
</preformat>
                    </p>
                    <p>which calculates (2.12+3+4.9)/3 - 1.03 and is equal to 2.31. Notice how the parameter estimates for treatment groups are divided by 3, the number of treatment groups under consideration. This is important as it ensures the correct calculation of averages. What this method says is that the average gene expression of the treatment groups is greater than the control group by 2.31. In our case, the gene expression of each treatment group is also greater than the control. It is worth noting, however, that the average gene expression of the treatment groups can be greater than that of the control group when individual treatment groups are not necessarily all greater than the control.</p>
                </sec>
                <sec id="sec72">
                    <title>Overlap of treatment-control results</title>
                    <p>For a more stringent approach that ensures that gene expression in each of the treatment groups are greater (or lower) than the control, we use a method of overlaps. Taking results from three treatment-control comparisons, we overlap or take the intersection of the genes that are significantly up-regulated (or down-regulated). Significance is usually defined by an adjusted 
                        <italic toggle="yes">p</italic>-value cut-off of 5%, but it can also be defined at varying thresholds or by using other summary statistics such as log-fold-changes. Notice that we take the direction of change into consideration so that genes are consistently up- or down-regulated in the control group. The direction of change can be determined by log-fold-change values, 
                        <italic toggle="yes">t</italic>-statistics or similar statistics. In the case where there are only a small number of significant genes in each of the treatment-control comparisons, the method described here can be overly stringent and result in no overlapping genes in the set. If this is the case, it would be reasonable to relax the threshold for defining significance.</p>
                </sec>
            </sec>
            <sec id="sec25">
                <title>2 versus 2 group comparisons</title>
                <p>Let us suppose it is of interest to compare the gene expression of two groups against another two other groups. This may be of interest if there are prior expectations that two groups are more similar to each other than the other two. In this example, we compare control and treatment III against treatment I and II by applying the contrast coded as</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
makeContrasts((treatmentCTL+treatmentIII)/2-(treatmentI+treatmentII)/2, 
  levels=colnames(design))

##               Contrasts
## Levels         (treatmentCTL + treatmentIII)/2 - (treatmentI + treatmentII)/2
##   treatmentCTL                                                            0.5
##   treatmentI                                                             -0.5
##   treatmentII                                                            -0.5
##   treatmentIII                                                            0.5
</preformat>
                </p>
                <p>to the means model. In defining the contrast, parameter estimates are divided out by the number of groups used to calculate the average. Using the parameter estimates, the difference in the 2 versus 2 group comparison is calculated as (1.03  + 4.9)/2 - (2.12  + 3)/2, which equals 0.41.</p>
            </sec>
        </sec>
        <sec id="sec26">
            <title>Study of interactions and additivity of treatments</title>
            <sec id="sec27">
                <title>Background</title>
                <p>In this section, we reconsider the same experimental data as in the previous section, but we now suppose the treatment III is a combination of treatments I and II. Here we are interested in examining the effect of combining treatments I and II relative to their individual effects. We approach this using two methods. The first simply uses the parameter estimates that we have already calculated from the previous section, meaning that we use the single 
                    <monospace>treatment</monospace> factor to allocate sample information on the treatment and control types. The second approach uses two separate factors, which we will call 
                    <monospace>treat1</monospace> and 
                    <monospace>treat2</monospace>, to allocate sample information on whether treatment I and/or treatment II were administered.</p>
            </sec>
            <sec id="sec28">
                <title>Interaction using a single factor model</title>
                <p>Using the first approach, we model the relationship between gene expression and the 
                    <monospace>treatment</monospace> factor with a mean-reference model. Taking the corresponding parameter estimates from the mean-reference model, such that we use the design matrix coded as 
                    <monospace>model.matrix(~treatment)</monospace>, we find that the effect of treatment I relative to control is 1.09, such that the difference in means between treatment I and control is 1.09. The relative effect of treatment II is 1.97, and the relative effect of the combined treatment (previously referred to as treatment III) is 3.87. For simplicity, let us refer to these relative effects as 
                    <italic toggle="yes">A</italic>, 
                    <italic toggle="yes">B</italic> and 
                    <italic toggle="yes">C</italic>.</p>
                <p>We consider the combined treatment to have an 
                    <italic toggle="yes">additive</italic> effect if the combined treatment effect is equal to the sum of the two individual effects, such that 
                    <italic toggle="yes">C</italic> &#x2212; 
                    <italic toggle="yes">A</italic> &#x2212; 
                    <italic toggle="yes">B</italic> = 0, which we simplify to 
                    <italic toggle="yes">&#x03b4;</italic> = 0. On the other hand, we consider the combined treatment to have an 
                    <italic toggle="yes">interaction</italic> effect if the combined treatment effect is not equal to the sum of the two individual effects, such that 
                    <italic toggle="yes">&#x03b4;</italic>&#x2260;0. An interaction effect is considered to be synergistic if the combined effect is greater than the sum of the individual effects (
                    <italic toggle="yes">&#x03b4;</italic> &gt; 0), and is considered repressive if the combined effect is less than the sum of the individual effects (
                    <italic toggle="yes">&#x03b4;</italic> &lt; 0). As you can see, it is of interest to determine the value of 
                    <italic toggle="yes">&#x03b4;</italic>, which we call the interaction term. Using a design matrix with an intercept term, we define the interaction term, or 
                    <italic toggle="yes">&#x03b4;</italic> = 
                    <italic toggle="yes">C</italic> &#x2212; 
                    <italic toggle="yes">A</italic> &#x2212; 
                    <italic toggle="yes">B</italic>, as a contrast in the 
                    <monospace>makeContrast</monospace> function, as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
design &lt;- model.matrix(~treatment)
makeContrasts(treatmentIII-treatmentI-treatmentII,
  levels=colnames(design))

##               Contrasts
## Levels         treatmentIII - treatmentI - treatmentII
##   Intercept                                          0
##   treatmentI                                        -1
##   treatmentII                                       -1
##   treatmentIII                                       1
</preformat>
                </p>
                <p>Taking the parameter estimates from the mean-reference model, this simply calculates 
                    <italic toggle="yes">&#x03b4;</italic> as 3.87-1.09-1.97, which equals 0.82. Since the interaction term is a positive value, we conclude that combined treatment effect is interactive and synergistic.</p>
                <p>Note that in running the 
                    <monospace>makeContrasts</monospace> function above, the function automatically converted the &#x201c;(Intercept)&#x201d; column in the design matrix to &#x201c;Intercept&#x201d; since the brackets are syntactically invalid. To avoid distracting from the results, we suppressed the display of its warning message referring to this in our output above.</p>
            </sec>
            <sec id="sec29">
                <title>Interactions using a two-factor model</title>
                <p>Another way to approach the same problem is by reassign the explanatory variable into two factors representing the presence and absence of the treatments. The factors 
                    <monospace>treat1</monospace> and 
                    <monospace>treat2</monospace> are defined as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
##    expression   mouse treat1 treat2
## 1        1.01  MOUSE1     NO     NO
## 2        1.04  MOUSE2     NO     NO
## 3        1.04  MOUSE3     NO     NO
## 4        1.99  MOUSE4    YES     NO
## 5        2.36  MOUSE5    YES     NO
## 6        2.00  MOUSE6    YES     NO
## 7        2.89  MOUSE7     NO    YES
## 8        3.12  MOUSE8     NO    YES
## 9        2.98  MOUSE9     NO    YES
## 10       5.00 MOUSE10    YES    YES
## 11       4.92 MOUSE11    YES    YES
## 12       4.78 MOUSE12    YES    YES
</preformat>
                </p>
                <p>Here 
                    <monospace>treat1</monospace> indicates the presence or absence of treatment I and 
                    <monospace>treat2</monospace> indicates the presence or absence of treatment II. The two factors allow us to create a model that directly includes an interaction term. The associated design matrix is coded as 
                    <monospace>model.matrix(~treat1*treat2)</monospace>, where an asterisk is placed between the two factors (
                    <xref ref-type="fig" rid="f9">Figure 9</xref>). The design matrix is parameterised for the mean gene expression of the control group (first column), the difference in mean expression between treatment I and control (second column), treatment II and control (third column), and the interaction term (last column). Using this design matrix, the parameters can be estimated as 1.03, 1.09, 1.97 and 0.82. In other words, the interaction term is estimated as 0.82, and has the same value as calculated previously.</p>
                <fig fig-type="figure" id="f9" orientation="portrait" position="float">
                    <label>Figure 9. </label>
                    <caption>
                        <title>Expected gene expression is modelled by a two factors representing treatment I and treatment II.</title>
                        <p>The design matrix that is used includes an interaction term in the last column, and we refer to the associated model as an interaction model. The interaction term can be used to indicates whether the combined administration of treatments I and II have an additive effect (interaction term equal to zero), have a synergistic effect (interaction term has a positive value), or have a repressive effect (interaction term has a negative value). In this example, the interactive effect is estimated as 0.82. The 
                            <italic toggle="yes">x</italic>&#x2019;s in the model are indicator variables for treatment I and treatment II, where 
                            <italic toggle="yes">x</italic>
                            <sub>1</sub>
                            <italic toggle="yes">x</italic>
                            <sub>2</sub> is only equal to 1 if both treatments are present.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure9.gif"/>
                </fig>
                <p>Moreover, whether we use a single 
                    <monospace>treatment</monospace> factor or the two factors here, the two models are equivalent, differing only in parameterisation. The interaction model fitted here can be written as 
                    <italic toggle="yes">E</italic>(
                    <italic toggle="yes">y</italic>) = 1.03 + 1.09
                    <italic toggle="yes">x</italic>
                    <sub>1</sub> + 1.97
                    <italic toggle="yes">x</italic>
                    <sub>2</sub> + 0.82
                    <italic toggle="yes">x</italic>
                    <sub>1</sub>
                    <italic toggle="yes">x</italic>
                    <sub>2</sub>, where the 
                    <italic toggle="yes">x</italic>
                    <sub>1</sub> and 
                    <italic toggle="yes">x</italic>
                    <sub>2</sub> are indicator variables for treatment I and treatment II respectively. Specifically, the fourth term in the model, the interaction term, is only included in the presence of both treatments, such that 
                    <italic toggle="yes">x</italic>
                    <sub>1</sub>
                    <italic toggle="yes">x</italic>
                    <sub>2</sub> = 1 but is 0 elsewhere.</p>
            </sec>
            <sec id="sec30">
                <title>Additivity using a two-factor model</title>
                <p>Whilst the interaction model is useful in identifying the effect of the combined treatment via the interaction term, such a model may not always be of interest. One may simply want to quantify the individual effects of treatment I and treatment II, and prefer the assumption that a combined treatment results in the additivity of the two effects. This means that we use all of the samples associated with treatment I (treatment I only and in combination with treatment II) to estimate the effect of treatment I. The same goes for treatment II.</p>
                <p>Using the two factors 
                    <monospace>treat1</monospace> and 
                    <monospace>treat2</monospace>, we create an additive model that excludes the interaction term. The associated design matrix is coded as 
                    <monospace>model.matrix(~treat1+treat2)</monospace>, where a plus sign is placed between the two factors (
                    <xref ref-type="fig" rid="f10">Figure 10</xref>). The design matrix contains 3 columns that are identical to the first three columns of the design matrix from the interaction model (
                    <xref ref-type="fig" rid="f9">Figure 9</xref>). The interpretation of those parameters remain the same. The first parameter represents the mean expression of the control group, the second represents the difference in mean expression between treatment I and control, and the third parameter represents the difference in mean expression between treatment II and control. The parameter estimates for this model can be calculated as 0.83, 1.5 and 2.37. The mean expression of the combined treatment can be calculated by combining all parameter estimates (gene expression from the control group, and the relative change when treatments I and II are added), such that it is equal to 0.83+1.5+2.37=4.7. We can write the fitted additive model as 
                    <italic toggle="yes">E</italic>(
                    <italic toggle="yes">y</italic>) = 0.83 + 1.5
                    <italic toggle="yes">x</italic>
                    <sub>1</sub> + 2.37
                    <italic toggle="yes">x</italic>
                    <sub>2</sub>, where the 
                    <italic toggle="yes">x</italic>
                    <sub>1</sub> and 
                    <italic toggle="yes">x</italic>
                    <sub>2</sub> are indicator variables for treatment I and treatment II. Relative to the interaction model, the fit of the additive model results in expected gene expression values that are further from the observed values. This is not unexpected since fewer parameters are used to model the relationship between gene expression and groups, and we know from the interaction model that the interaction term is non-zero. Even so, the additive model may be preferred for its simple interpretation and thus may be more applicable to some studies.</p>
                <fig fig-type="figure" id="f10" orientation="portrait" position="float">
                    <label>Figure 10. </label>
                    <caption>
                        <title>Expected gene expression is modelled by a two factors representing treatment I and treatment II.</title>
                        <p>The design matrix that is used excludes the interaction term, and we refer to the associated model as an additive model. This means that a combined treatment is assumed to have the additive effects of individual treatment I and treatment II effects. The 
                            <italic toggle="yes">x</italic>&#x2019;s in the model are indicator variables for treatment I and treatment II.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure10.gif"/>
                </fig>
            </sec>
        </sec>
        <sec id="sec31">
            <title>Studies with multiple factors</title>
            <sec id="sec32">
                <title>Background</title>
                <p>In this section, we examine several study designs that contain two or more factors as explanatory variables. We begin with an example where we convert two factors of interest into one, and then consider cases where there are factors that are not of interest. In the second half of this section, more complex study designs are introduced, such as scenarios where there are nested factors and repeated measurements. We finish off the section by fitting a mixed effects model using functions from the 
                    <bold>limma</bold> package, where we treat a factor that is not of interest to the study as a random effect.</p>
            </sec>
            <sec id="sec33">
                <title>Conversion to a single factor</title>
                <p>Experimental studies often include multiple factors of interest. This could involve different treatments, cell types, tissue types, sex, and so on. Let us consider an experiment on lung and brain samples that are enriched for B-cells and T-cells. The data is as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
##    expression      id tissue cells
## 1        1.01  MOUSE1   LUNG     B
## 2        1.04  MOUSE2   LUNG     B
## 3        1.04  MOUSE3   LUNG     B
## 4        1.99  MOUSE4  BRAIN     B
## 5        2.36  MOUSE5  BRAIN     B
## 6        2.00  MOUSE6  BRAIN     B
## 7        2.89  MOUSE7   LUNG     T
## 8        3.12  MOUSE8   LUNG     T
## 9        2.98  MOUSE9   LUNG     T
## 10       5.00 MOUSE10  BRAIN     T
## 11       4.92 MOUSE11  BRAIN     T
## 12       4.78 MOUSE12  BRAIN     T
</preformat>
                </p>
                <p>For this experiment, there are several comparisons of interest: 1) overall differences between cell types, 2) overall differences between tissues, 3) differences between cell types within each tissue type, and 4) differences between tissues within each cell type. The simplest method is to merge 
                    <monospace>tissue</monospace> and 
                    <monospace>cells</monospace> factors into a single 
                    <monospace>group</monospace> factor, as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
##    expression      id tissue cells   group
## 1        1.01  MOUSE1   LUNG     B  LUNG_B
## 2        1.04  MOUSE2   LUNG     B  LUNG_B
## 3        1.04  MOUSE3   LUNG     B  LUNG_B
## 4        1.99  MOUSE4  BRAIN     B BRAIN_B
## 5        2.36  MOUSE5  BRAIN     B BRAIN_B
## 6        2.00  MOUSE6  BRAIN     B BRAIN_B
## 7        2.89  MOUSE7   LUNG     T  LUNG_T
## 8        3.12  MOUSE8   LUNG     T  LUNG_T
## 9        2.98  MOUSE9   LUNG     T  LUNG_T
## 10       5.00 MOUSE10  BRAIN     T BRAIN_T
## 11       4.92 MOUSE11  BRAIN     T BRAIN_T
## 12       4.78 MOUSE12  BRAIN     T BRAIN_T
</preformat>
                </p>
                <p>This allows us to fit a means model to the data, using a design matrix coded as 
                    <monospace>design &lt;- model.matrix(~0+group)</monospace> (
                    <xref ref-type="fig" rid="f11">Figure 11</xref>), and to define contrasts for comparisons of interest using the 
                    <monospace>makeContrasts</monospace> function. The contrasts are coded as</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
contrasts &lt;- makeContrasts(
  BVsT=(groupLUNG_B+groupBRAIN_B)/2-(groupLUNG_T+groupBRAIN_T)/2,
  LungVsBrain=(groupLUNG_B+groupLUNG_T)/2-(groupBRAIN_B+groupBRAIN_T)/2,
  BVsT_Lung=groupLUNG_B-groupLUNG_T,
  BVsT_Brain=groupBRAIN_B-groupBRAIN_T,
  LungVsBrain_B=groupLUNG_B-groupBRAIN_B, 
  LungVsBrain_T=groupLUNG_T-groupBRAIN_T, 
  levels=colnames(design))
rownames(contrasts) &lt;- gsub("group", "", rownames(contrasts))
contrasts

##          Contrasts
## Levels    BVsT LungVsBrain BVsT_Lung BVsT_Brain LungVsBrain_B LungVsBrain_T
##   LUNG_B   0.5         0.5         1          0             1             0
##   BRAIN_B  0.5        -0.5         0          1            -1             0
##   LUNG_T  -0.5         0.5        -1          0             0             1
##   BRAIN_T -0.5        -0.5         0         -1             0            -1
</preformat>
                </p>
                <p>with columns of the matrix representing 1) overall differences between cells, B-cells versus T-cells; 2) overall differences between tissues, lung versus brain; 3) differences between cells within lung, and 4) differences between cells within brain; 5) differences between tissues within B-cells, and 6) differences between tissues within T-cells. Notice that we specified our own contrast names in the code above. The row names were also shortened so that the contrast matrix could display neatly.</p>
                <fig fig-type="figure" id="f11" orientation="portrait" position="float">
                    <label>Figure 11. </label>
                    <caption>
                        <title>Expected gene expression is modelled by a single group factor using a design matrix that excludes the intercept term.</title>
                        <p>The group factor is converted from two factors representing tissue samples and cell types.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure11.gif"/>
                </fig>
                <p>Using the design matrix, the parameters are estimated as 1.03 for the mean gene expression of B-cells in the lung, 2.12 for B-cells in the brain, 3 for T-cells in the lung, and 4.9 for T-cells in the brain. By applying the contrast matrix to the estimated parameters, we calculate that overall gene expression difference between B-cells versus T-cells is -2.37, and -1.5 for lung versus brain. B-cells and T-cells differ by -1.97 in the lung, and -2.78 in the brain. Lung samples and brain samples differ by -1.09 in B-cells, and by -1.9 in T-cells.</p>
            </sec>
            <sec id="sec34">
                <title>Accounting for factors that are not of interest</title>
                <p>Some factors within an experiment may not be of biological interest. Often they are technical factors such as handling technician, experimental time if samples were processed in separate batches, or the sequencing lane on which the samples were processed on. There are also biological factors that may not be of direct interest; such as ethnicity of patients in a human drug trial or the sex of individuals from which samples were taken. Let us consider an experiment with mice belonging to groups A, B, C, or D, each in triplicate. It is of interest to compare gene expression between the groups. In the process of the experiment, two sequencing lanes (L1 and L2) were used for sequencing and samples were processed by different technicians (I and II). To ensure that differences detected between groups are not influenced by these factors, we can account for any differences between the sequencing lanes and handling technician in our modelling process. The data is as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
##    expression      id group lane technician
## 1        1.01  MOUSE1     A   L2          I
## 2        1.04  MOUSE2     A   L2          I
## 3        1.04  MOUSE3     A   L1         II
## 4        1.99  MOUSE4     B   L1          I
## 5        2.36  MOUSE5     B   L2         II
## 6        2.00  MOUSE6     B   L1          I
## 7        2.89  MOUSE7     C   L1         II
## 8        3.12  MOUSE8     C   L2         II
## 9        2.98  MOUSE9     C   L1          I
## 10       5.00 MOUSE10     D   L1         II
## 11       4.92 MOUSE11     D   L2         II
## 12       4.78 MOUSE12     D   L2          I
</preformat>
                </p>
                <p>A means model can be fitted to the data, with a design matrix coded as 
                    <monospace>design &lt;- model.matrix(~0+group+lane+technician)</monospace> to model gene expression in groups, while accounting for effects resulting from differences in lane and technician. The first 4 columns of the design matrix are associated with parameters for the mean expression of group A, B, C and D (
                    <xref ref-type="fig" rid="f12">Figure 12</xref>). Specifically, the group means are parameterised for when the samples are in lane L1 and processed by technician I.</p>
                <fig fig-type="figure" id="f12" orientation="portrait" position="float">
                    <label>Figure 12. </label>
                    <caption>
                        <title>Expected gene expression is modelled by a group factor and two additional factors that are not of interest (lane and technician).</title>
                        <p>The design matrix excludes the intercept term for the first factor added to the function. Only lines reflecting the first 4 parameters are drawn in the plot, representing the mean gene expression of groups A, B, C and D in lane L1 and with handling technician I. Samples are labelled by their sequencing lane (L1 or L2), and coloured black if they are processed by technician I, yellow if they are processed by technician II.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure12.gif"/>
                </fig>
                <p>The fifth column in the design matrix is parameterised for difference between lane L2 and lane L1 (for group A samples processed by technician I), and the sixth column is parameterised for the difference between technician II and I (for group A samples in lane L1). Although an intercept-free design matrix has been coded using the 
                    <monospace>0+</monospace> notation, the intercept is only excluded from the first factor that is listed within the 
                    <monospace>model.matrix</monospace> function. In other words, the second and third factors added to the 
                    <monospace>model.matrix</monospace> function are parameterised as though there is an intercept term. This is why we place the factor of interest first as it simplifies the subsequent code for the comparisons of interest, even though a different order of factors added give equivalent models with variations in parameterisation.</p>
                <p>By estimating model parameters, the fitted model can be written as 
                    <italic toggle="yes">E</italic>(
                    <italic toggle="yes">y</italic>) = 0.92
                    <italic toggle="yes">x</italic>
                    <sub>1</sub> + 2.04
                    <italic toggle="yes">x</italic>
                    <sub>2</sub> + 2.87
                    <italic toggle="yes">x</italic>
                    <sub>3</sub> + 4.74
                    <italic toggle="yes">x</italic>
                    <sub>4</sub> + 0.1
                    <italic toggle="yes">x</italic>
                    <sub>5</sub> + 0.15
                    <italic toggle="yes">x</italic>
                    <sub>6</sub>, where 
                    <italic toggle="yes">x</italic>
                    <sub>1</sub> to 
                    <italic toggle="yes">x</italic>
                    <sub>4</sub> are indicator variables for groups A to D, respectively. Additionally, 
                    <italic toggle="yes">x</italic>
                    <sub>5</sub> is an indicator variable for lane L2, and 
                    <italic toggle="yes">x</italic>
                    <sub>6</sub> is an indicator variable for technician II. In other words, a group A sample processed in lane L1 and by technician I has expected gene expression 
                    <italic toggle="yes">E</italic>(
                    <italic toggle="yes">y</italic>)=0.92. Whereas, the expected gene expression is 
                    <italic toggle="yes">E</italic>(
                    <italic toggle="yes">y</italic>)=0.92+0.1=1.02 if it were processed in lane L2, 
                    <italic toggle="yes">E</italic>(
                    <italic toggle="yes">y</italic>)=0.92+0.15=1.06 for technician II, and 
                    <italic toggle="yes">E</italic>(
                    <italic toggle="yes">y</italic>)=0.92+0.1+0.15=1.16 for a group A sample processed in lane L2 and by technician II.</p>
                <p>For comparisons between groups, we form contrasts using only the first 4 parameter estimates, and keep lane and technician consistent. For example, a contrast comparing group A to group B can be coded as 
                    <monospace>makeContrasts(groupA-groupB, levels=colnames(design))</monospace>. All other pairwise comparisons can also be included into the contrast matrix.</p>
                <p>When modelling multiple factors of interest, the factors may be converted into a single factor for modelling, as shown in the previous section. We also note that it may not be sensible to add all known factors associated with the experiment. This could well exceed the number of degrees of freedom available for modelling (too many parameters when compared to the number of data points). A reasonable way to check the factors that should be accounted for include the use of unsupervised clustering plots, such as principal components analysis (PCA) or multi-dimensional scaling (MDS). Factors associated with separation between sample clusters should be included in the model. An alternative method is to fit a model to the biological groups of interest with one addition factor to observe whether the factor has substantial influence on gene expression (such that many genes are detected as differentially expressed for that factor). Repeat this for subsequent factors to determine the factors that should be included into the final model.</p>
            </sec>
            <sec id="sec35">
                <title>Nested factors and matrices without full rank</title>
                <p>Now consider a study design that includes two of the factors, 
                    <monospace>group</monospace> and 
                    <monospace>batch</monospace>, representing biological groups of interest and experimental batches. The samples in group A and group B are processed in batch B1, whilst samples in group C and group D are processed in batch B2. We say that the groups are 
                    <italic toggle="yes">nested</italic> within batches. The data is as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
##    expression      id group batch
## 1        1.01  MOUSE1     A    B1
## 2        1.04  MOUSE2     A    B1
## 3        1.04  MOUSE3     A    B1
## 4        1.99  MOUSE4     B    B1
## 5        2.36  MOUSE5     B    B1
## 6        2.00  MOUSE6     B    B1
## 7        2.89  MOUSE7     C    B2
## 8        3.12  MOUSE8     C    B2
## 9        2.98  MOUSE9     C    B2
## 10       5.00 MOUSE10     D    B2
## 11       4.92 MOUSE11     D    B2
## 12       4.78 MOUSE12     D    B2
</preformat>
                </p>
                <p>It is of interest to compare the gene expression between groups. Naturally, one may include both factors into a design matrix coded as 
                    <monospace>design &lt;- model.matrix(~0+group+batch)</monospace> or 
                    <monospace>design &lt;- model.matrix(~0+batch+group)</monospace>. This, however, produces a design matrix that is not of full rank, meaning that there are more columns in the design matrix (5 columns in this case) than what is needed (4 columns). The resultant design matrix has some columns that are linearly dependent, which is due to batch information being redundant once all group means are defined. This is because batch B1 is uniquely defined by group A and B, and batch B2 is uniquely defined by group C and D. Similarly, two of the groups are redundant if batch means are defined first. One would usually notice that their design matrix is not of full rank when the parameter estimation process returns 
                    <monospace>NA</monospace> or 
                    <monospace>non-estimable</monospace> results for some parameters.</p>
                <p>To check for redundancy of model parameters, one can compare between the number of columns in the design matrix with 
                    <monospace>ncol(design)</monospace> to the rank of the matrix with 
                    <monospace>qr(design)$rank</monospace>. This would show that there are 5 columns in the design matrix but only a rank of 4, meaning that one of the parameters defined in the design matrix is linearly dependent. This should prompt us to consider how to set up the model properly, figuring out which factors are dependent on others, and ultimately redefining the design matrix. For example, the design matrix can be set to 
                    <monospace>model.matrix(~0+group)</monospace> instead, although we should keep in mind that some pairwise group comparisons would be confounded by batch effects, such as when comparing group A to group C.</p>
            </sec>
            <sec id="sec36">
                <title>Time series experiment with repeated mouse measurements nested within treatments</title>
                <p>In a study of treatment effects, gene expression measurements were taken from mice at multiple time points. Three of the mice were administered treatment X, and another three were administered treatment Y. Measurements were taken for the mice at two timepoints, T1 (baseline) and T2. The data is as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
##    expression     id treatment timepoint
## 1        1.01 MOUSE1         X        T1
## 2        1.04 MOUSE2         X        T1
## 3        1.04 MOUSE3         X        T1
## 4        1.99 MOUSE1         X        T2
## 5        2.36 MOUSE2         X        T2
## 6        2.00 MOUSE3         X        T2
## 7        2.89 MOUSE4         Y        T1
## 8        3.12 MOUSE5         Y        T1
## 9        2.98 MOUSE6         Y        T1
## 10       5.00 MOUSE4         Y        T2
## 11       4.92 MOUSE5         Y        T2
## 12       4.78 MOUSE6         Y        T2
</preformat>
                </p>
                <p>It is of interest to compare timepoint T1 with T2 within treatment X, while accounting for how the samples are paired. What this means is that it is important to account for the relative change from timepoint T1 to T2 of each mice, when estimating the overall change between the timepoints. Similarly, a comparison between timepoint T1 and T2 within treatment Y is of interest. Additionally, we want to examine the overall differences between treatment X and Y. Since the mice are nested within treatment types, we create a custom design matrix to avoid a matrix that has linearly dependent columns or that is not of full rank. The custom design matrix is created using 
                    <monospace>model.matrix</monospace> and 
                    <monospace>cbind</monospace> functions in the following way:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
design &lt;- model.matrix(~0+id)
design &lt;- cbind(design, X= treatment=="X" &amp; timepoint=="T2")
design &lt;- cbind(design, Y= treatment=="Y" &amp; timepoint=="T2")
</preformat>
                </p>
                <p>In the first step, 
                    <monospace>model.matrix(~0+id)</monospace>, we account for each individual mouse and the pairing of samples. Since mice are nested within treatments, the treatment effects are encompassed within the mouse effects. In the second step, 
                    <monospace>cbind(design, X= treatment=="X" &amp; timepoint=="T2")</monospace>, an extra &#x201c;X&#x201d; column is added to the design matrix to represent treatment X at timepoint T2. Similarly, the third step, 
                    <monospace>cbind(design, Y= treatment=="Y" &amp; timepoint=="T2")</monospace>, appends column &#x201c;Y&#x201d; to the design matrix to represent treatment Y at timepoint T2. The two additional columns differentiate between samples at timepoint T1 (first 6 columns) from those at timepoint T2 (last 2 columns), such that the first 6 columns in the design matrix now represent the effect of each mouse at timepoint T1, and the last 2 columns represent the overall difference between timepoint T2 and T1 for treatments X and Y respectively (
                    <xref ref-type="fig" rid="f13">Figure 13</xref>).</p>
                <fig fig-type="figure" id="f13" orientation="portrait" position="float">
                    <label>Figure 13. </label>
                    <caption>
                        <title>We model the expected gene expression of mice that have been given either treatment X or treatment Y, with samples taken at timepoints T1 and T2.</title>
                        <p>Repeated measurements are taken from mice, as indicated by the numbers in the plot, such that label &#x201d;1&#x201d; represents MOUSE1. A custom design matrix is created to model mouse IDs, treatment and timepoint (complete R code shown in the main article). Fitted lines are drawn in pink for treatment X, and in aqua for treatment Y. Solid lines represent expected gene expression at timepoint T1, and dashed lines for timepoint T2.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure13.gif"/>
                </fig>
                <p>The first 6 parameters, which represent the expected expression of each mouse at timepoint T1, can be estimated as 0.96, 1.15, 0.98, 2.99, 3.07 and 2.93. Using the first three estimates (0.96, and 1.15, 0.98), we can calculate the expected expression of treatment X at timepoint T1 by taking the mean value, which equals 1.03 (and is marked by the thick, dark red line in the plot of 
                    <xref ref-type="fig" rid="f13">Figure 13</xref>). Similarly, the next three estimates (2.99, 3.07 and 2.93) can be used to calculate the expected expression of treatment Y at timepoint T1, which equals 3 (and is marked by the thick, dark green line in the plot of 
                    <xref ref-type="fig" rid="f13">Figure 13</xref>). The seventh parameter is estimated at 1.09, and the eighth is 1.9, such that gene expression is greater by 1.09 and 1.9 at timepoint T2 relative to timepoint T1 for treatments X and Y respectively. The overall difference between treatments X and Y can be coded as 
                    <monospace>makeContrasts(X-Y, levels=colnames(design))</monospace>, which calculates the difference between the seventh and eighth parameters, and has a value of -0.82.</p>
            </sec>
            <sec id="sec37">
                <title>Treating factors that are not of direct interest as random effects</title>
                <p>In the previous section, repeated measurements taken from mice receiving treatment X or Y were accounted for within the design matrix. By including the mouse IDs into the design matrix, we say that mouse IDs were treated as fixed effects in the modelling process. An alternative method treats the mouse IDs instead as random effects, and does not include the IDs into the design matrix. We refer to this type of model as a 
                    <italic toggle="yes">mixed effects model</italic>, such that treatment and timepoint are included into the design matrix as fixed effects in the model, whilst mouse IDs are included as random effects. One important advantage to the 
                    <bold>limma</bold> package is that it has the ability to fit a mixed effects model, unlike 
                    <bold>edgeR</bold> or 
                    <bold>DESeq2</bold>, which can only fit fixed effects.</p>
                <p>Why do we fit mouse IDs as a random effect rather than a fixed effect? The specific differences between mice are not of direct interest to the study, so removing them from the design matrix reduces the number of model parameters, conserves the number of degrees of freedom in modelling, and likely increases statistical power for testing. The effects, however, cannot be omitted completely because they are integral to the study design; individual mouse effects should still be accounted for when calculating relative difference between timepoints T1 and T2.</p>
                <p>To fit a mixed effects model, let us first define our fixed effects in a design matrix. To simplify the two factors of interest, 
                    <monospace>treatment</monospace> and 
                    <monospace>timepoint</monospace>, we merge them into a single factor called 
                    <monospace>group</monospace>. The data which includes the new 
                    <monospace>group</monospace> factor are as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
##    expression     id treatment timepoint group
## 1        1.01 MOUSE1         X        T1  X_T1
## 2        1.04 MOUSE2         X        T1  X_T1
## 3        1.04 MOUSE3         X        T1  X_T1
## 4        1.99 MOUSE1         X        T2  X_T2
## 5        2.36 MOUSE2         X        T2  X_T2
## 6        2.00 MOUSE3         X        T2  X_T2
## 7        2.89 MOUSE4         Y        T1  Y_T1
## 8        3.12 MOUSE5         Y        T1  Y_T1
## 9        2.98 MOUSE6         Y        T1  Y_T1
## 10       5.00 MOUSE4         Y        T2  Y_T2
## 11       4.92 MOUSE5         Y        T2  Y_T2
## 12       4.78 MOUSE6         Y        T2  Y_T2
</preformat>
                </p>
                <p>A means model is fitted to the groups by coding the design matrix as 
                    <monospace>design &lt;- model.matrix(~0+group)</monospace>, which gives a 4 column matrix representing the mean gene expression in each group (
                    <xref ref-type="fig" rid="f14">Figure 14</xref>). Mouse IDs are set as random effects by assigning 
                    <monospace>id</monospace> as the blocking variable in the 
                    <monospace>lmFit</monospace> function in 
                    <bold>limma</bold>. Before doing this, we first estimate the correlation between repeated mice measurements using the 
                    <monospace>duplicateCorrelation</monospace> function in 
                    <bold>limma</bold> as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
cor &lt;- duplicateCorrelation(expression, design, block=id)
cor$consensus.correlation

## [1] -0.05
</preformat>
                </p>
                <fig fig-type="figure" id="f14" orientation="portrait" position="float">
                    <label>Figure 14. </label>
                    <caption>
                        <title>Expected gene expression is modelled by a mixed effects model, with treatment and timepoint (converted into a single group factor) as fixed effects and mouse IDs as the random effects.</title>
                        <p>A means model is fitted to the data using a design matrix that excludes the intercept term. In the plot, data points are labelled by mouse ID.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure14.gif"/>
                </fig>
                <p>The correlation between measurements taken from the same mouse is estimated as -0.05, which is considered to be quite a small correlation value. This is expected in our example since we did not program specific mouse effects into the dataset. In the case of a negative estimated correlation, the blocking variable should be removed and we can resume the usual modelling approach of accounting for the 
                    <monospace>group</monospace> fixed effect only in the design matrix. In fact, the blocking variable can be removed if the estimated correlation is very, very small (say less than 0.01) as its contribution to the overall model fit would also be very minor, however, keeping it in the model would not adversely affect the modelling process either. For real use cases, correlation estimates of 0.7 to 0.9 are considered high but not uncommon. Despite our recommendation above, let us continue with the fitting of our mixed effects model for the sake of demonstrating how it can be carried out.</p>
                <p>Using the 
                    <monospace>lmFit</monospace> function, we fit our random effects by setting the 
                    <monospace>correlation</monospace> argument to the estimated correlation and 
                    <monospace>block</monospace> argument to mouse 
                    <monospace>id</monospace>. The fixed effects modelled within the design matrix are given to the 
                    <monospace>design</monospace> argument, along with the expression data as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
fit &lt;- lmFit(object=expression, design=design, 
  block=id, correlation=cor$consensus.correlation)
</preformat>
                </p>
                <p>The mixed effects model estimates the mean gene expression of mouse receiving treatment X at timepoint T1 to be 1.03, treatment X at timepoint T2 to be 2.12, treatment Y at timepoint T1 to be 3, and treatment Y at timepoint T1 to be 4.9. To obtain estimates for the comparisons of interest, we use a contrast matrix coded as:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
contrasts &lt;- makeContrasts(
  X_T2vsT1=groupX_T2-groupX_T1, 
  Y_T2vsT1=groupY_T2-groupY_T1, 
  XvsY=(groupX_T2-groupX_T1)-(groupY_T2-groupY_T1), 
  levels=colnames(design)) 
contrasts

##            Contrasts
## Levels      X_T2vsT1 Y_T2vsT1 XvsY
##   groupX_T1       -1        0   -1
##   groupX_T2        1        0    1
##   groupY_T1        0       -1    1
##   groupY_T2        0        1   -1
</preformat>
                </p>
                <p>The first contrast calculates the difference between timepoint T2 and T1 in treatment X, the second contrast calculates the difference between timepoint T2 and T1 in treatment Y, and the last contrast the overall difference between treatment X and Y. The overall difference between treatment X and Y is calculated after adjusting the mean gene expression at timepoint T2 by that of the baseline (timepoint T1). The contrast matrix is incorporated into the 
                    <bold>limma</bold> pipeline using the 
                    <monospace>contrasts.fit</monospace> function as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
fit &lt;- contrasts.fit(fit, contrasts)
</preformat>
                </p>
                <p>Using the contrast matrix, the difference between timepoint T2 and T1 in treatment X is calculated as 1.09, and for treatment Y as 1.9. The overall difference between treatment X and Y is estimated at -0.82. Since the correlation of repeated mouse measurements is small, individual mouse effects have little influence on the calculation of expected gene expression values for groups. This means that the inclusion of mouse IDs as fixed or random effects do not have much practical influence in this particular example. For this reason, we observe similar results between this section and that of the previous section.</p>
            </sec>
        </sec>
        <sec id="sec38">
            <title>Studies with covariates</title>
            <sec id="sec39">
                <title>Background</title>
                <p>In the remaining sections, we switch from looking at explanatory variables that are factors, and instead consider studies where the explanatory variable of interest is a covariate. Let us recall the basic models outlined in earlier sections, where a simple regression model for a covariate can be represented as a straight line defined by its y-intercept and slope. In the following section, we cover models that are more complex in their design, starting with a mix of covariates and factors. We also discuss options for non-linear fitted models that extend beyond the simple framework of y-intercept and slope. Whilst in practice the vast majority of study designs involve only factor variables, which we have covered extensively over multiple sections, this section is useful for the occasional study where the relationship between gene expression and a given covariate is of interest.</p>
            </sec>
            <sec id="sec40">
                <title>Combination with factor variable</title>
                <p>In the previous section we looked at models for the factors 
                    <monospace>treatment</monospace> and 
                    <monospace>timepoint</monospace>. We now consider a similar example, where there are two explanatory variables, the 
                    <monospace>treatment</monospace> factor and 
                    <monospace>time</monospace> as a covariate. The timepoints from the previous example is now treated as a numerical variable. Let us suppose that the timepoints T1 and T2 represent 1 hour post treatment and 2 hours post treatment, by either treatment X or Y. For simplicity, we do not consider having repeated mouse measurements here and assume that the measurements are taken from different mice each time. The data is as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
##    expression      id treatment time
## 1        1.01  MOUSE1         X    1
## 2        1.04  MOUSE2         X    1
## 3        1.04  MOUSE3         X    1
## 4        1.99  MOUSE4         X    2
## 5        2.36  MOUSE5         X    2
## 6        2.00  MOUSE6         X    2
## 7        2.89  MOUSE7         Y    1
## 8        3.12  MOUSE8         Y    1
## 9        2.98  MOUSE9         Y    1
## 10       5.00 MOUSE10         Y    2
## 11       4.92 MOUSE11         Y    2
## 12       4.78 MOUSE12         Y    2
</preformat>
                </p>
                <p>An appropriate design matrix that considers both the treatment factor and time covariate can be coded as 
                    <monospace>model.matrix(~0+treatment+treatment:time)</monospace>, which gives a fitted line for each of the treatments (
                    <xref ref-type="fig" rid="f15">Figure 15</xref>). For two fitted lines with the same slope, we could use the design matrix coded as 
                    <monospace>model.matrix(~0+treatment+time)</monospace>. Let us recall that each regression line is defined by a y-intercept and slope. In our design matrix, the first and third columns are parameterised for the y-intercept and slope of the line modelling treatment X. The second and fourth columns are parameterised for the y-intercept and slope of treatment Y.</p>
                <fig fig-type="figure" id="f15" orientation="portrait" position="float">
                    <label>Figure 15. </label>
                    <caption>
                        <title>Expected gene expression is modelled by time as a covariate and treatment as a factor.</title>
                        <p>In the plot, data points are labelled by treatment, and a fitted line is drawn for each of the treatments.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure15.gif"/>
                </fig>
                <p>The y-intercepts for treatment X and treatment Y are estimated as -0.06 and 1.09 respectively. The y-intercepts, though, are generally not of interest. It is the slopes that are usually of interest since this quantifies the rate of increase or decrease in gene expression over time. The slope for treatment X is estimated as 1.09, and the slope for treatment Y is estimated as 1.9. In the previous section with the factors 
                    <monospace>treatment</monospace> and 
                    <monospace>timepoint</monospace>, time is consider as distinct changes in state from timepoint T1 to T2. Here, time as a covariate allows us to quantify the expected change in gene expression over an interval of time, such that over 0.5 units of time we can expect an increase of 0.54 in gene expression for treatment X (which is calculated by halving the third parameter estimate).</p>
                <p>The fitted model can be written as 
                    <italic toggle="yes">E</italic>(
                    <italic toggle="yes">y</italic>) = -0.06
                    <italic toggle="yes">x</italic>
                    <sub>1</sub> + 1.09
                    <italic toggle="yes">x</italic>
                    <sub>2</sub> + 1.09
                    <italic toggle="yes">x</italic>
                    <sub>1</sub>
                    <italic toggle="yes">t</italic> + 1.9
                    <italic toggle="yes">x</italic>
                    <sub>2</sub>
                    <italic toggle="yes">t</italic>, where the 
                    <italic toggle="yes">x</italic>
                    <sub>1</sub> and 
                    <italic toggle="yes">x</italic>
                    <sub>2</sub> are indicator variables for treatment X and treatment Y respectively, and 
                    <italic toggle="yes">t</italic> is a numerical variable representing time. Specifically, the model for treatment X can be written as 
                    <italic toggle="yes">E</italic>(
                    <italic toggle="yes">y</italic>) = -0.06 + 1.09
                    <italic toggle="yes">t</italic>. The model for treatment Y can be written as 
                    <italic toggle="yes">E</italic>(
                    <italic toggle="yes">y</italic>) = 1.09 + 1.9
                    <italic toggle="yes">t</italic>.</p>
            </sec>
            <sec id="sec41">
                <title>Linear time series</title>
                <p>We now consider a mouse study over several time points. This could represent a study examining gene expression changes in a developmental stage of interest, such as early embryonic development. There are gene expression measurements from mice at times 1, 2, 3, 4, 5 and 6, each in duplicate. The times could represent hours, days or weeks, with the data as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
##    expression   mouse time
## 1        2.08  MOUSE1    1
## 2        2.29  MOUSE2    1
## 3        3.58  MOUSE3    2
## 4        3.54  MOUSE4    2
## 5        3.66  MOUSE5    3
## 6        4.20  MOUSE6    3
## 7        2.56  MOUSE7    4
## 8        2.00  MOUSE8    4
## 9        0.81  MOUSE9    5
## 10       0.58 MOUSE10    5
## 11      -0.14 MOUSE11    6
## 12       0.14 MOUSE12    6
</preformat>
                </p>
                <p>Naturally, we use a design matrix coded as 
                    <monospace>design &lt;- model.matrix(~time)</monospace> (
                    <xref ref-type="fig" rid="f16">Figure 16</xref>). The design matrix contains two columns and is parameterised for the y-intercept (estimated here as 4.22) and the slope of the line (estimated as -0.6). Using this model, the gene expression is expected to decrease by 0.6 units for every unit increase in time.</p>
                <fig fig-type="figure" id="f16" orientation="portrait" position="float">
                    <label>Figure 16. </label>
                    <caption>
                        <title>Expected gene expression is modelled by a time covariate, using a design matrix that includes an intercept term (linear fit).</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure16.gif"/>
                </fig>
            </sec>
            <sec id="sec42">
                <title>Quadratic time series</title>
                <p>Looking more closely at the plot of expression versus time in 
                    <xref ref-type="fig" rid="f16">Figure 16</xref>, we notice that expression seems to increase between time points 1 and 2, peaks between 2 and 3, and decreases between time points 3 and 6. So we also consider another model that describes the data using a curved line, such that the curved line takes the form of 
                    <italic toggle="yes">y</italic> = 
                    <italic toggle="yes">a</italic> + 
                    <italic toggle="yes">bt</italic> + 
                    <italic toggle="yes">ct</italic>
                    <sup>2</sup> where 
                    <italic toggle="yes">y</italic> represents gene expression, 
                    <italic toggle="yes">t</italic> represents time and 
                    <italic toggle="yes">t</italic>
                    <sup>2</sup> is calculated as the square of time (or time to the power of 2), and 
                    <italic toggle="yes">a</italic>, 
                    <italic toggle="yes">b</italic>, and 
                    <italic toggle="yes">c</italic> are model parameters which we estimate. Since the model includes the 
                    <monospace>time</monospace> covariate to the power of 2, the curve is referred to as a second degree polynomial or quadratic model.</p>
                <p>To fit such a model, the 
                    <monospace>poly</monospace> function from the 
                    <bold>stats</bold> package is used to specify the polynomial, with the specification that 
                    <monospace>degree=2</monospace> for a second degree polynomial. The design matrix is coded as 
                    <monospace>design &lt;- model.matrix(~poly(time, degree=2, raw=TRUE))</monospace> (see 
                    <xref ref-type="fig" rid="f17">Figure 17</xref> where column names have been simplified). Using the design matrix, the parameters can be estimated as 
                    <italic toggle="yes">a</italic> = 1.2, 
                    <italic toggle="yes">b</italic> = 1.67, and 
                    <italic toggle="yes">c</italic> = -0.32. The fitted model can be written as 
                    <italic toggle="yes">y</italic> = 1.2 + 1.67
                    <italic toggle="yes">t</italic> + -0.32
                    <italic toggle="yes">t</italic>
                    <sup>2</sup>. The quadratic model has a the peak (or trough) occurring at 
                    <italic toggle="yes">t</italic> = &#x2212;
                    <italic toggle="yes">b</italic>/(2
                    <italic toggle="yes">c</italic>), which in our case is at time 2.57, and this translates to a maximum gene expression of 3.33. Using this model, we can say that gene expression is increasing up until time 2.57, and decreases after time 2.57. The quadratic model can help differentiate between the genes that are increasing in expression over time (where 
                    <italic toggle="yes">b</italic> &gt; 0 and 
                    <italic toggle="yes">c</italic> = 0), genes that decreasing in expression over time (
                    <italic toggle="yes">b</italic> &lt; 0 and 
                    <italic toggle="yes">c</italic> = 0), genes that increase in expression then decrease in expression (
                    <italic toggle="yes">c</italic> &lt; 0), and genes that decrease then increase in expression (
                    <italic toggle="yes">c</italic> &gt; 0).</p>
                <fig fig-type="figure" id="f17" orientation="portrait" position="float">
                    <label>Figure 17. </label>
                    <caption>
                        <title>Expected gene expression is modelled by a time covariate, using a design matrix that fits a second degree polynomial (quadratic fit).</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure17.gif"/>
                </fig>
                <p>In our example, we specify the use of raw polynomials via the 
                    <monospace>raw</monospace> argument in the 
                    <monospace>poly</monospace> function. This allows us to easily demonstrate what the linear component of the model (
                    <italic toggle="yes">t</italic> or 
                    <monospace>time</monospace> in the design matrix) and the quadratic component (
                    <italic toggle="yes">t</italic>
                    <sup>2</sup> or 
                    <monospace>time2</monospace> in the design matrix) looks like based on the data at hand. In practice, however, we would recommend using orthogonal polynomials rather than raw polynomials, by specifying 
                    <monospace>raw=FALSE</monospace> in the 
                    <monospace>poly</monospace> function. In fact, the default setting in 
                    <monospace>poly</monospace> computes orthogonal polynomials so the 
                    <monospace>raw</monospace> argument does not need to be specified. Raw polynomials can result in covariates that are correlated, which means that the importance of each individual effect is indistinguishable from each other. We can observe this correlation quite easily in our own example, by checking 
                    <monospace>cor(poly(time, degree=2, raw=TRUE))</monospace>. Orthogonal polynomials, on the other hand, ensure that the covariates are not correlated, which we can check with 
                    <monospace>cor(poly(time, degree=2))</monospace>, but give a messy demonstration of the parameters and model for the purpose of this article. Since the covariates are not correlated, orthogonal polynomials allows us to determine the genes where the linear term is important but the quadratic term is not, and vice versa. There can also be genes where both or neither linear and quadratic terms are important. This means that we recommend a design matrix coded as 
                    <monospace>model.matrix(~poly(time, degree=2))</monospace> when fitting a quadratic model in practice, rather than the design matrix that is displayed in 
                    <xref ref-type="fig" rid="f17">Figure 17</xref>. To compare the polynomial model against that of a constant term (i.e. gene expression does not change over time), an 
                    <italic toggle="yes">F</italic>-test can be used to test multiple parameters together (e.g. the linear and the quadratic term). In 
                    <bold>limma</bold>, we calculate 
                    <italic toggle="yes">F</italic>-statistics and their associated raw and adjusted 
                    <italic toggle="yes">p</italic>-values using the 
                    <monospace>topTable</monospace> function by specifying multiple columns in the 
                    <monospace>coef</monospace> argument, e.g. 
                    <monospace>topTable(fit, coef=c(2,3), number=Inf)</monospace> for such values across all genes as ranked by significance.</p>
            </sec>
            <sec id="sec43">
                <title>Cubic time series</title>
                <p>We extend our example with two additional time points, time 7 and 8, with the data as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
##    expression   mouse time
## 1        2.08  MOUSE1    1
## 2        2.29  MOUSE2    1
## 3        3.58  MOUSE3    2
## 4        3.54  MOUSE4    2
## 5        3.66  MOUSE5    3
## 6        4.20  MOUSE6    3
## 7        2.56  MOUSE7    4
## 8        2.00  MOUSE8    4
## 9        0.81  MOUSE9    5
## 10       0.58 MOUSE10    5
## 11      -0.14 MOUSE11    6
## 12       0.14 MOUSE12    6
## 13       1.13 MOUSE13    7
## 14       1.58 MOUSE14    7
## 15       3.45 MOUSE15    8
## 16       3.17 MOUSE16    8
</preformat>
                </p>
                <p>We know from the previous section that between times 1 and 6, a linear fit to the data shows a decreasing trend (
                    <xref ref-type="fig" rid="f16">Figure 16</xref>), and a quadratic fit to the same data shows an increasing trend before time 2.57 and a decreasing trend after time 2.57 (
                    <xref ref-type="fig" rid="f17">Figure 17</xref>). The new data points, however, do not continue with the decreasing trend at time 7 and 8. For this reason, we fit a cubic model, or a third degree polynomial, to the data so that it can capture the two changes in gene expression trends. Like before, we use the 
                    <monospace>poly</monospace> function from the 
                    <bold>stats</bold> package to specify the polynomial we want to fit. The design matrix is coded as 
                    <monospace>design &lt;- model.matrix(~poly(time, degree=3, raw=TRUE))</monospace> (see 
                    <xref ref-type="fig" rid="f18">Figure 18</xref> where column names have been simplified). Note that we are again using a raw polynomial in this example for simplicity of illustration of the model parameters, although we would use an orthogonal polynomial in practice.</p>
                <fig fig-type="figure" id="f18" orientation="portrait" position="float">
                    <label>Figure 18. </label>
                    <caption>
                        <title>Expected gene expression is modelled by a time covariate, using a design matrix that fits a third degree polynomial (cubic fit).</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure18.gif"/>
                </fig>
                <p>The four parameters in the model represent the intercept term, the coefficient for time (represented as 
                    <italic toggle="yes">t</italic>), the coefficient for time to the power of 2 (represented as 
                    <italic toggle="yes">t</italic>
                    <sup>2</sup>), and the coefficient for time to the power of 3 (represented as 
                    <italic toggle="yes">t</italic>
                    <sup>3</sup>). The third degree polynomial takes the form of 
                    <italic toggle="yes">y</italic> = 
                    <italic toggle="yes">a</italic> + 
                    <italic toggle="yes">bt</italic> + 
                    <italic toggle="yes">ct</italic>
                    <sup>2</sup> + 
                    <italic toggle="yes">dt</italic>
                    <sup>3</sup>, where 
                    <italic toggle="yes">y</italic> represents gene expression and 
                    <italic toggle="yes">t</italic> represents time, and 
                    <italic toggle="yes">a</italic>, 
                    <italic toggle="yes">b</italic>, 
                    <italic toggle="yes">c</italic>, and 
                    <italic toggle="yes">d</italic> are model parameters which we estimate. The parameters can be estimated as 
                    <italic toggle="yes">a</italic> = -1.85, 
                    <italic toggle="yes">b</italic> = 5.56, 
                    <italic toggle="yes">c</italic> = -1.64, and 
                    <italic toggle="yes">d</italic> = 0.13. The fitted model can be written as 
                    <italic toggle="yes">y</italic> = -1.85 + 5.56
                    <italic toggle="yes">t</italic> + -1.64
                    <italic toggle="yes">t</italic>
                    <sup>2</sup> + 0.13
                    <italic toggle="yes">t</italic>
                    <sup>3</sup>.</p>
            </sec>
            <sec id="sec44">
                <title>Smooth curves</title>
                <p>Other than polynomial models, another choice for fitting smooth curves to the data is via regression splines using the 
                    <monospace>ns</monospace> function for natural splines in the 
                    <bold>splines</bold> package. Examples of this can be found in the 
                    <bold>limma</bold> and 
                    <bold>edgeR</bold> user&#x2019;s guides. In general, one would fit the most complex curve that one wishes to interpret and for which the number of time points can support. Usually the complexity of curve would never exceed that of the fifth order. For example, if there is no replication at time points, one might choose a second order polynomial (
                    <monospace>degree=2</monospace>) or a spline with 2 parameters (
                    <monospace>df=2</monospace>) for a study on 5 time points, this would leave 2 residual degrees of freedom for the model. If there are 10 time points, then a fifth degree polynomial or a spline with 5 parameters may be appropriate, resulting in a model with 4 residual degrees of freedom. Keep in mind that in gene expression analyses, the same model is applied to every gene in the dataset, even though a simpler model may be sufficient for some genes, whilst a more complex one is needed for others.</p>
                <p>In the case where one wants to fit smooth curves to multiple groups where samples for the groups are taken at different time points, using regression splines allow the fitted trends to be compared between the groups whilst a polynomial fit does not. Furthermore, splines tend to give more sensible and stable curves with better behaviour at the boundaries of the fit than polynomials. On the other hand, polynomials such as the quadratic fit are handy for when one wants to determine the peak or trough of the fitted curve. Note that both the fitting of a spline or polynomial is particularly useful for time course experiments with lots of time points but no replication at each time point. The time series examples in the previous sections have replication at the time points. In that case, one could treat time as a factor rather than a covariate to avoid the interpretation of curves, and to obtain differences between distinct time points explicitly. This approach could be preferred by many.</p>
            </sec>
            <sec id="sec45">
                <title>Cyclical models</title>
                <p>We extend our example further to include an additional two time points, time 9 and 10. It turns out that the biology under study involves genes that are turned on and off cyclically over time. This example can represent studies on circadian rhythm where certain genes are turned on in the day and turned off at night, whilst others are on in the night and off in the day. The data is as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
##    expression   mouse time
## 1        2.08  MOUSE1    1
## 2        2.29  MOUSE2    1
## 3        3.58  MOUSE3    2
## 4        3.54  MOUSE4    2
## 5        3.66  MOUSE5    3
## 6        4.20  MOUSE6    3
## 7        2.56  MOUSE7    4
## 8        2.00  MOUSE8    4
## 9        0.81  MOUSE9    5
## 10       0.58 MOUSE10    5
## 11      -0.14 MOUSE11    6
## 12       0.14 MOUSE12    6
## 13       1.13 MOUSE13    7
## 14       1.58 MOUSE14    7
## 15       3.45 MOUSE15    8
## 16       3.17 MOUSE16    8
## 17       3.66 MOUSE17    9
## 18       4.08 MOUSE18    9
## 19       3.23 MOUSE19   10
## 20       2.93 MOUSE20   10
</preformat>
                </p>
                <p>A cyclical model that models the rhythmic pattern in the data is considered for this example. Although the model itself may be complex, the interpretability of the model is greater than that of higher order polynomials since the fitted cyclical pattern is repeated over time. To fit a cyclical model, we use 
                    <monospace>sin</monospace> and 
                    <monospace>cos</monospace> functions from the 
                    <bold>base</bold> package to obtain the sine and cosine trigonometric functions on the 
                    <monospace>time</monospace> covariate. Now, let us consider an approximate cycle length. Our cycle length appears to be roughly 6 units - the first peak occurs at time 2.57 (
                    <xref ref-type="fig" rid="f17">Figure 17</xref>), and the second peak occurs just after time 8 (
                    <xref ref-type="fig" rid="f18">Figure 18</xref>). Note that a cycle length of 24 units would be appropriate for studies on circadian rhythm if time were measured in hours. The sine and cosine functions, with cycle length of 6 units, are coded as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
cycle &lt;- 6
sinphase &lt;- sin(2*pi*time/cycle)
cosphase &lt;- cos(2*pi*time/cycle)
</preformat>
                </p>
                <p>The design matrix is then coded as:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
design &lt;- model.matrix(~sinphase+cosphase)
</preformat>
                </p>
                <p>The first column in the design matrix (
                    <xref ref-type="fig" rid="f19">Figure 19</xref>), or first parameter in the model, represents the horizontal shift of the cycling pattern from 0. The second column represents the amplitude (or height) of the sine function over time. Similarly, the third column represents the amplitude of the cosine function over time. The parameters can be estimated as 2.1, 0.53 and -1.87. The fitted model can be written as 
                    <inline-formula>
                        <mml:math display="inline">
                            <mml:mrow>
                                <mml:mi>E</mml:mi>
                                <mml:mo stretchy="false">(</mml:mo>
                                <mml:mi>y</mml:mi>
                                <mml:mo stretchy="false">)</mml:mo>
                                <mml:mo>=</mml:mo>
                                <mml:mn>2.1</mml:mn>
                                <mml:mo>+</mml:mo>
                                <mml:mn>0.53</mml:mn>
                                <mml:mtext>&#x2009;</mml:mtext>
                                <mml:mi>sin</mml:mi>
                                <mml:mo stretchy="false">(</mml:mo>
                                <mml:mfrac>
                                    <mml:mi>&#x03c0;</mml:mi>
                                    <mml:mn>3</mml:mn>
                                </mml:mfrac>
                                <mml:mi>t</mml:mi>
                                <mml:mo stretchy="false">)</mml:mo>
                                <mml:mo>+</mml:mo>
                                <mml:mo>&#x2212;</mml:mo>
                                <mml:mn>1.87</mml:mn>
                                <mml:mtext>&#x2009;</mml:mtext>
                                <mml:mi>cos</mml:mi>
                                <mml:mo stretchy="false">(</mml:mo>
                                <mml:mfrac>
                                    <mml:mi>&#x03c0;</mml:mi>
                                    <mml:mn>3</mml:mn>
                                </mml:mfrac>
                                <mml:mi>t</mml:mi>
                                <mml:mo stretchy="false">)</mml:mo>
                            </mml:mrow>
                        </mml:math>
                    </inline-formula>, where 
                    <italic toggle="yes">t</italic> represents the 
                    <monospace>time</monospace> covariate.</p>
                <fig fig-type="figure" id="f19" orientation="portrait" position="float">
                    <label>Figure 19. </label>
                    <caption>
                        <title>Expected gene expression is modelled using sin and cos functions (complete R code shown in the main article) to give a cyclical fit.</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure19.gif"/>
                </fig>
                <p>The cyclical pattern can be integrated with other models in the data. If there is an upwards trend to the overall cyclical pattern in the data (data not shown), then we can include a linear component associated with time into the design matrix. The design matrix is coded as 
                    <monospace>design &lt;- model.matrix(~time + sinphase + cosphase)</monospace>, and the linear component is represented in the second column of the design matrix (
                    <xref ref-type="fig" rid="f20">Figure 20</xref>). One can consider including a natural spline (using the 
                    <monospace>ns</monospace> function from the 
                    <bold>splines</bold> package) rather than a linear component, if the upward trend is more &#x201c;curvy&#x201d; and the linear trend is inadequate. For example, the natural spline can be included into the design matrix by coding as 
                    <monospace>model.matrix(~ns(time, df=3) + sinphase + cosphase)</monospace> (design matrix and model not shown).</p>
                <fig fig-type="figure" id="f20" orientation="portrait" position="float">
                    <label>Figure 20. </label>
                    <caption>
                        <title>Expected gene expression is modelled using sin and cos functions (complete R code shown in the main article) to give a cyclical fit.</title>
                        <p>An upwards trend in gene expression over time is also accounted for by including the time covariate in the design matrix.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/30844/e0cd78d8-1ebb-48d6-a35c-007d3d42961c_figure20.gif"/>
                </fig>
            </sec>
        </sec>
        <sec id="sec46">
            <title>Further code notes</title>
            <sec id="sec47">
                <title>Alternative code for design matrices</title>
                <p>In this article, we have shown the coding of design matrices with an intercept term in the form of 
                    <monospace>model.matrix(~variable)</monospace> and those without an intercept term in the form of 
                    <monospace>model.matrix(~0+variable)</monospace> for an explanatory variable 
                    <monospace>variable</monospace>. There are other ways to code for the same design matrix, such as 
                    <monospace>model.matrix(~1+variable)</monospace> for a design matrix with an intercept term, and 
                    <monospace>model.matrix(~-1+variable)</monospace> or 
                    <monospace>model.matrix(~variable-1)</monospace> for a design matrix without an intercept term.</p>
            </sec>
            <sec id="sec48">
                <title>Alternative code for contrast matrices</title>
                <p>The 
                    <monospace>makeContrasts</monospace> function for creating contrast matrices ensures that the contrast matrix is ordered correctly according to model parameters in an associated design matrix. It also returns an error message if there is a mismatch with the associated design matrix which is helpful. However, the 
                    <monospace>makeContrast</monospace> functions requires one to type full column names from the design matrix which can be tedious. The function also complains (in the form of a warning message) about column names from the design matrix that are syntactically invalid.</p>
                <p>Alternatively, one can create a contrast matrix manually by using the 
                    <monospace>cbind</monospace> function as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
contrasts &lt;- cbind(
  AvsC=c(1,0,-1,0), BvsC=c(0,1,-1,0), ABvsCD=c(0.5,0.5,-0.5,-0.5))
rownames(contrasts) &lt;- LETTERS[1:4]
contrasts

##   AvsC BvsC ABvsCD
## A    1    0    0.5
## B    0    1    0.5
## C   -1   -1   -0.5
## D    0    0   -0.5
</preformat>
                </p>
                <p>The above contrast matrix contains three contrasts, which are linear combinations of four model parameters (A, B, C and D). The first contrast compares A to C, the second compares B to C, and the last contrast compares the average of A and B to the average of C and D. When coding for contrast matrices manually, one should carefully check that the rows in the contrast matrix match the columns of the design matrix.</p>
            </sec>
            <sec id="sec49">
                <title>Example code for a 
                    <italic toggle="yes">limma</italic> workflow</title>
                <p>Starting with a counts table, a complete workflow for differential gene expression analysis of RNA-seq data using the 
                    <bold>limma</bold> package can be found in the &#x201c;
                    <italic toggle="yes">RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR</italic>&#x201d; workflow article
                    <sup>
                        <xref ref-type="bibr" rid="ref5">5</xref>
                    </sup>. A summary of the main steps for fitting a linear model to each gene and obtaining parameter estimates are as follows:</p>
                <p>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
group &lt;- as.factor(c(1,1,1,2,2,2))
design &lt;- model.matrix(~0+group)
contrasts &lt;- makeContrasts(group1-group2, levels=colnames(design)) 
v &lt;- voom(counts, design)
fit &lt;- lmFit(v, design)
fit &lt;- contrasts.fit(fit, contrasts)
fit &lt;- eBayes(fit)
topTable(fit)
</preformat>
                </p>
                <p>The above code is non-runnable as the 
                    <monospace>counts</monospace> object is missing. The 
                    <monospace>counts</monospace> object here is assumed to be a table of counts, with rows as genes and columns as samples. In this example, there are six samples, three of which belong to group 1 and the other three to group 2. The design matrix is parameterised for a means model, and the contrast matrix is used to calculate the difference in mean expression between group 1 and group 2.</p>
            </sec>
        </sec>
        <sec id="sec50">
            <title>Theory of linear regression models, design matrices and contrast matrices</title>
            <p>This section briefly summarises the mathematics of design and contrasts matrices. This section is optional and understanding it is not required to undertake any of the analyses described earlier in the article. It is provided as a reference for those comfortable with the mathematical notation.</p>
            <p>A regression model, in the general sense, can be used to describe the relationship between explanatory variables and gene expression, the response variable. There are many different types of regression models, where each assume different characteristics for the relationship between explanatory and response variables, as well as the properties associated with variability in the data. Consider one such model, a simple linear regression model 
                <disp-formula id="e8">
                    <mml:math display="block">
                        <mml:mi>E</mml:mi>
                        <mml:mo stretchy="false">(</mml:mo>
                        <mml:mi>y</mml:mi>
                        <mml:mo stretchy="false">)</mml:mo>
                        <mml:mo>=</mml:mo>
                        <mml:msub>
                            <mml:mrow>
                                <mml:mi>&#x03b2;</mml:mi>
                            </mml:mrow>
                            <mml:mrow>
                                <mml:mn>0</mml:mn>
                            </mml:mrow>
                        </mml:msub>
                        <mml:mo>+</mml:mo>
                        <mml:msub>
                            <mml:mrow>
                                <mml:mi>&#x03b2;</mml:mi>
                            </mml:mrow>
                            <mml:mrow>
                                <mml:mn>1</mml:mn>
                            </mml:mrow>
                        </mml:msub>
                        <mml:mi>x</mml:mi>
                    </mml:math>
                </disp-formula>which describes the expected value for gene expression 
                <italic toggle="yes">E</italic>(
                <italic toggle="yes">y</italic>) as equal to the sum of a constant 
                <italic toggle="yes">&#x03b2;</italic>
                <sub>0</sub> and an explanatory variable 
                <italic toggle="yes">x</italic> scaled by a coefficient 
                <italic toggle="yes">&#x03b2;</italic>
                <sub>1</sub>. The 
                <italic toggle="yes">&#x03b2;</italic> values (
                <italic toggle="yes">&#x03b2;</italic>
                <sub>0</sub> and 
                <italic toggle="yes">&#x03b2;</italic>
                <sub>1</sub>) are referred to as regression parameters, where their real values are unknown. In matrix notation, the right-hand-side of the equation, 
                <italic toggle="yes">&#x03b2;</italic>
                <sub>0</sub> + 
                <italic toggle="yes">&#x03b2;</italic>
                <sub>1</sub>
                <italic toggle="yes">x</italic>, can be written as 
                <disp-formula id="e9">
                    <mml:math display="block">
                        <mml:mstyle mathvariant="bold">
                            <mml:mi>X</mml:mi>
                        </mml:mstyle>
                        <mml:mstyle mathvariant="bold-italic">
                            <mml:mi>&#x03b2;</mml:mi>
                        </mml:mstyle>
                        <mml:mo>=</mml:mo>
                        <mml:mfenced close="]" open="[">
                            <mml:mrow>
                                <mml:mtable>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mn>1</mml:mn>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mi>x</mml:mi>
                                        </mml:mtd>
                                    </mml:mtr>
                                </mml:mtable>
                            </mml:mrow>
                        </mml:mfenced>
                        <mml:mfenced close="]" open="[">
                            <mml:mrow>
                                <mml:mtable>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>&#x03b2;</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>0</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>&#x03b2;</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                </mml:mtable>
                            </mml:mrow>
                        </mml:mfenced>
                    </mml:math>
                </disp-formula>where it is broken down into the matrix 
                <bold>X</bold> and the regression parameters 
                <bold>
                    <italic toggle="yes">&#x03b2;</italic>
                </bold>. The values along the row in 
                <bold>X</bold> are multiplied by the regression parameters, such that it calculates 
                <italic toggle="yes">&#x03b2;</italic>
                <sub>0</sub> &#x00d7; 1 + 
                <italic toggle="yes">&#x03b2;</italic>
                <sub>1</sub> &#x00d7; 
                <italic toggle="yes">x</italic>. Since regression parameters are unknown, they are estimated from a study of 
                <italic toggle="yes">n</italic> samples. For a given gene, gene expression values are denoted as 
                <bold>y</bold> = 
                <italic toggle="yes">y</italic>
                <sub>1</sub>,
                <italic toggle="yes">y</italic>
                <sub>2</sub>,
                <italic toggle="yes">y</italic>
                <sub>3</sub>,...,
                <italic toggle="yes">y</italic>
                <sub>
                    <italic toggle="yes">n</italic>
                </sub> for the 
                <italic toggle="yes">n</italic> samples. The matrix 
                <bold>X</bold>, which we refer to as the design matrix, is used to store values of the explanatory variable associated with each sample, such that 
                <disp-formula id="e10">
                    <mml:math display="block">
                        <mml:mstyle mathvariant="bold">
                            <mml:mi>X</mml:mi>
                        </mml:mstyle>
                        <mml:mo>=</mml:mo>
                        <mml:mfenced close="]" open="[">
                            <mml:mrow>
                                <mml:mtable>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mn>1</mml:mn>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mn>1</mml:mn>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>2</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mn>1</mml:mn>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>3</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mn>1</mml:mn>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mn>1</mml:mn>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mn>1</mml:mn>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>n</mml:mi>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                </mml:mtable>
                            </mml:mrow>
                        </mml:mfenced>
                    </mml:math>
                </disp-formula>with each row representing a sample. Putting the response and explanatory variables together, we can then solve the equation 
                <disp-formula id="e11">
                    <mml:math display="block">
                        <mml:mstyle mathvariant="bold">
                            <mml:mi>y</mml:mi>
                        </mml:mstyle>
                        <mml:mo>=</mml:mo>
                        <mml:mstyle mathvariant="bold">
                            <mml:mi>X</mml:mi>
                        </mml:mstyle>
                        <mml:mstyle mathvariant="bold-italic">
                            <mml:mi>&#x03b2;</mml:mi>
                        </mml:mstyle>
                        <mml:mo>+</mml:mo>
                        <mml:mstyle mathvariant="bold">
                            <mml:mi>e</mml:mi>
                        </mml:mstyle>
                        <mml:mo>.</mml:mo>
                    </mml:math>
                    <label>(1)</label>
                </disp-formula>to obtain estimates for parameters 
                <bold>
                    <italic toggle="yes">&#x03b2;</italic>
                </bold>. The 
                <bold>e</bold> denotes differences between observed gene expression values and the true population value (e.g. the population mean), where 
                <bold>e</bold> is referred to as the errors. The above equation (
                <xref ref-type="disp-formula" rid="e11">Equation 1</xref>) can be expanded out and written as 
                <disp-formula id="e12">
                    <mml:math display="block">
                        <mml:mfenced close="]" open="[">
                            <mml:mrow>
                                <mml:mtable>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>y</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>y</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>2</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>y</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>3</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>y</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>n</mml:mi>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                </mml:mtable>
                            </mml:mrow>
                        </mml:mfenced>
                        <mml:mo>=</mml:mo>
                        <mml:mfenced close="]" open="[">
                            <mml:mrow>
                                <mml:mtable>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mn>1</mml:mn>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mn>1</mml:mn>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>2</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mn>1</mml:mn>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>3</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mn>1</mml:mn>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mn>1</mml:mn>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mn>1</mml:mn>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>n</mml:mi>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                </mml:mtable>
                            </mml:mrow>
                        </mml:mfenced>
                        <mml:mfenced close="]" open="[">
                            <mml:mrow>
                                <mml:mtable>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>&#x03b2;</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>0</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>&#x03b2;</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                </mml:mtable>
                            </mml:mrow>
                        </mml:mfenced>
                        <mml:mo>+</mml:mo>
                        <mml:mfenced close="]" open="[">
                            <mml:mrow>
                                <mml:mtable>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>e</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>e</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>2</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>e</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>3</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>e</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>n</mml:mi>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                </mml:mtable>
                            </mml:mrow>
                        </mml:mfenced>
                        <mml:mo>.</mml:mo>
                    </mml:math>
                </disp-formula>
            </p>
            <p>Through an estimation process, we obtain parameter estimates which we denote as 
                <inline-formula>
                    <mml:math display="inline">
                        <mml:mover>
                            <mml:mi mathvariant="bold-italic">&#x03b2;</mml:mi>
                            <mml:mo>^</mml:mo>
                        </mml:mover>
                    </mml:math>
                </inline-formula>. Using the parameter estimates, we can calculate fitted values for gene expression by multiplying the design matrix 
                <bold>X</bold> by the parameter estimates, such that the fitted values are calculated as 
                <inline-formula>
                    <mml:math display="inline">
                        <mml:mover>
                            <mml:mi mathvariant="bold">y</mml:mi>
                            <mml:mo>^</mml:mo>
                        </mml:mover>
                        <mml:mo>=</mml:mo>
                        <mml:mi mathvariant="bold">X</mml:mi>
                        <mml:mover>
                            <mml:mi mathvariant="bold-italic">&#x03b2;</mml:mi>
                            <mml:mo>^</mml:mo>
                        </mml:mover>
                    </mml:math>
                </inline-formula>. The least squares estimation strategy obtains estimates of regression parameters by minimising the sum of squared residuals, where residuals are calculated as the difference between observed 
                <bold>y</bold> and fitted gene expression values 
                <inline-formula>
                    <mml:math display="inline">
                        <mml:mover>
                            <mml:mi mathvariant="bold">y</mml:mi>
                            <mml:mo>^</mml:mo>
                        </mml:mover>
                    </mml:math>
                </inline-formula>.</p>
            <p>The simple linear regression model, 
                <italic toggle="yes">E</italic>(
                <italic toggle="yes">y</italic>) = 
                <italic toggle="yes">&#x03b2;</italic>
                <sub>0</sub> + 
                <italic toggle="yes">&#x03b2;</italic>
                <sub>1</sub>
                <italic toggle="yes">x</italic>, can be generalised and extended to contain more explanatory variables and written as 
                <disp-formula id="e13">
                    <mml:math display="block">
                        <mml:mi>E</mml:mi>
                        <mml:mo stretchy="false">(</mml:mo>
                        <mml:mi>y</mml:mi>
                        <mml:mo stretchy="false">)</mml:mo>
                        <mml:mo>=</mml:mo>
                        <mml:msub>
                            <mml:mrow>
                                <mml:mi>&#x03b2;</mml:mi>
                            </mml:mrow>
                            <mml:mrow>
                                <mml:mn>1</mml:mn>
                            </mml:mrow>
                        </mml:msub>
                        <mml:msub>
                            <mml:mrow>
                                <mml:mi>x</mml:mi>
                            </mml:mrow>
                            <mml:mrow>
                                <mml:mn>1</mml:mn>
                            </mml:mrow>
                        </mml:msub>
                        <mml:mo>+</mml:mo>
                        <mml:msub>
                            <mml:mrow>
                                <mml:mi>&#x03b2;</mml:mi>
                            </mml:mrow>
                            <mml:mrow>
                                <mml:mn>2</mml:mn>
                            </mml:mrow>
                        </mml:msub>
                        <mml:msub>
                            <mml:mrow>
                                <mml:mi>x</mml:mi>
                            </mml:mrow>
                            <mml:mrow>
                                <mml:mn>2</mml:mn>
                            </mml:mrow>
                        </mml:msub>
                        <mml:mo>+</mml:mo>
                        <mml:msub>
                            <mml:mrow>
                                <mml:mi>&#x03b2;</mml:mi>
                            </mml:mrow>
                            <mml:mrow>
                                <mml:mn>3</mml:mn>
                            </mml:mrow>
                        </mml:msub>
                        <mml:msub>
                            <mml:mrow>
                                <mml:mi>x</mml:mi>
                            </mml:mrow>
                            <mml:mrow>
                                <mml:mn>3</mml:mn>
                            </mml:mrow>
                        </mml:msub>
                        <mml:mo>+</mml:mo>
                        <mml:mo>.</mml:mo>
                        <mml:mo>.</mml:mo>
                        <mml:mo>.</mml:mo>
                        <mml:mo>+</mml:mo>
                        <mml:msub>
                            <mml:mrow>
                                <mml:mi>&#x03b2;</mml:mi>
                            </mml:mrow>
                            <mml:mrow>
                                <mml:mi>p</mml:mi>
                            </mml:mrow>
                        </mml:msub>
                        <mml:msub>
                            <mml:mrow>
                                <mml:mi>x</mml:mi>
                            </mml:mrow>
                            <mml:mrow>
                                <mml:mi>p</mml:mi>
                            </mml:mrow>
                        </mml:msub>
                        <mml:mo>.</mml:mo>
                    </mml:math>
                </disp-formula>We refer to this type of model as a linear regression model. The model contains 
                <italic toggle="yes">p</italic> regression parameters, each of which are associated with an explanatory variable. In a study on 
                <italic toggle="yes">n</italic> samples, an associated design matrix 
                <bold>X</bold> can be written as 
                <disp-formula id="e14">
                    <mml:math display="block">
                        <mml:mstyle mathvariant="bold">
                            <mml:mi>X</mml:mi>
                        </mml:mstyle>
                        <mml:mo>=</mml:mo>
                        <mml:mfenced close="]" open="[">
                            <mml:mrow>
                                <mml:mtable>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>1</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>2</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>3</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>p</mml:mi>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>1</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>2</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>2</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>2</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>3</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>2</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>p</mml:mi>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>2</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>1</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>3</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>2</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>3</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>3</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>3</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>p</mml:mi>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>3</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>1</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mi>n</mml:mi>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>2</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mi>n</mml:mi>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>3</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mi>n</mml:mi>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>x</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>p</mml:mi>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mi>n</mml:mi>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr/>
                                </mml:mtable>
                            </mml:mrow>
                        </mml:mfenced>
                        <mml:mo>,</mml:mo>
                    </mml:math>
                </disp-formula>where the values 
                <italic toggle="yes">x</italic>
                <sub>
                    <italic toggle="yes">i</italic>,
                    <italic toggle="yes">j</italic>
                </sub> represent explanatory variable 
                <italic toggle="yes">i</italic> in sample 
                <italic toggle="yes">j</italic>, for 
                <italic toggle="yes">i</italic> = 1,2,3,....,
                <italic toggle="yes">p</italic> and 
                <italic toggle="yes">j</italic> = 1,2,3,...,
                <italic toggle="yes">n</italic>. A model with an intercept term, simply sets the values in the first column of the design matrix as 1. To distinguish between a model with and without an intercept term, the associated parameter for the intercept term is sometimes denoted as 
                <italic toggle="yes">&#x03b2;</italic>
                <sub>0</sub> rather than 
                <italic toggle="yes">&#x03b2;</italic>
                <sub>1</sub>, although either notation is acceptable.</p>
            <p>Contrasts are set up to examine relationships of interest, such that a contrast matrix 
                <bold>C</bold> contains 
                <italic toggle="yes">K</italic> column vectors of length 
                <italic toggle="yes">p</italic> (number of model parameters), and can be written as 
                <disp-formula id="e15">
                    <mml:math display="block">
                        <mml:mstyle mathvariant="bold">
                            <mml:mi>C</mml:mi>
                        </mml:mstyle>
                        <mml:mo>=</mml:mo>
                        <mml:mfenced close="]" open="[">
                            <mml:mrow>
                                <mml:mtable>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>1</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>2</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>3</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>K</mml:mi>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>1</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>2</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>2</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>2</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>3</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>2</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>K</mml:mi>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>2</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>1</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>3</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>2</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>3</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>3</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>3</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>K</mml:mi>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>3</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>1</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mi>p</mml:mi>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>2</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mi>p</mml:mi>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>3</mml:mn>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mi>p</mml:mi>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>K</mml:mi>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mi>p</mml:mi>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr/>
                                </mml:mtable>
                            </mml:mrow>
                        </mml:mfenced>
                        <mml:mo>.</mml:mo>
                    </mml:math>
                </disp-formula>Each column in 
                <bold>C</bold>, or individual contrast 
                <bold>c</bold>
                <sub>
                    <italic toggle="yes">k</italic>
                </sub> = 
                <italic toggle="yes">c</italic>
                <sub>
                    <italic toggle="yes">k</italic>,1</sub>,
                <italic toggle="yes">c</italic>
                <sub>
                    <italic toggle="yes">k</italic>,2</sub>,
                <italic toggle="yes">c</italic>
                <sub>
                    <italic toggle="yes">k</italic>,3</sub>,....,
                <italic toggle="yes">c</italic>
                <sub>
                    <italic toggle="yes">k</italic>,
                    <italic toggle="yes">p</italic>
                </sub>, represents a relationship or comparison of interest, where for each gene we test whether or not 
                <italic toggle="yes">&#x03b4;</italic>
                <sub>
                    <italic toggle="yes">k</italic>
                </sub> =
                <bold>c</bold>
                <sub>
                    <italic toggle="yes">k</italic>
                </sub>
                <sup>
                    <italic toggle="yes">T</italic>
                </sup>
                <bold>
                    <italic toggle="yes">&#x03b2;</italic>
                </bold> is non-zero. 
                <italic toggle="yes">&#x03b4;</italic>
                <sub>
                    <italic toggle="yes">k</italic>
                </sub> can be expanded out and written as 
                <disp-formula id="e16">
                    <mml:math display="block">
                        <mml:msub>
                            <mml:mrow>
                                <mml:mi>&#x03b4;</mml:mi>
                            </mml:mrow>
                            <mml:mrow>
                                <mml:mi>k</mml:mi>
                            </mml:mrow>
                        </mml:msub>
                        <mml:mo>=</mml:mo>
                        <mml:mfenced close="]" open="[">
                            <mml:mrow>
                                <mml:mtable>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>k</mml:mi>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>k</mml:mi>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>2</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>k</mml:mi>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mn>3</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>c</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>k</mml:mi>
                                                    <mml:mo>,</mml:mo>
                                                    <mml:mi>p</mml:mi>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                </mml:mtable>
                            </mml:mrow>
                        </mml:mfenced>
                        <mml:mfenced close="]" open="[">
                            <mml:mrow>
                                <mml:mtable>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>&#x03b2;</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>1</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>&#x03b2;</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>2</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>&#x03b2;</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mn>3</mml:mn>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                            <mml:mo>.</mml:mo>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr>
                                        <mml:mtd>
                                            <mml:msub>
                                                <mml:mrow>
                                                    <mml:mi>&#x03b2;</mml:mi>
                                                </mml:mrow>
                                                <mml:mrow>
                                                    <mml:mi>p</mml:mi>
                                                </mml:mrow>
                                            </mml:msub>
                                        </mml:mtd>
                                    </mml:mtr>
                                    <mml:mtr/>
                                </mml:mtable>
                            </mml:mrow>
                        </mml:mfenced>
                        <mml:mo>.</mml:mo>
                    </mml:math>
                </disp-formula>
            </p>
        </sec>
        <sec id="sec52" sec-type="discussion">
            <title>Discussion</title>
            <p>In this article, we described the set up and interpretation of design matrices and their associated models for a variety of case studies involving factors and/or covariates. The examples in this article are completely reproducible via our Rmarkdown document that can be downloaded from the 
                <bold>RNAseq123</bold> workflow package available from 
                <ext-link ext-link-type="uri" xlink:href="https://doi.org/doi:10.18129/B9.bioc.RNAseq123">https://doi.org/doi:10.18129/B9.bioc.RNAseq123</ext-link>. The document can be used to recreate design matrices and plots found in this article, as well as to obtain estimated values for model parameters.</p>
            <p>The estimation process was not described explicitly in our work since it is not of direct interest here. Parameter estimates were merely used to illustrate aspects of the model relating to the design matrix. For simplicity, parameter estimation in our single gene examples was carried out using the 
                <monospace>lm</monospace> function from the 
                <bold>stats</bold> package, with the exception of the mixed effects model that uses 
                <monospace>lmFit</monospace> from 
                <bold>limma</bold> due to its complexity. In practice, 
                <bold>limma</bold>&#x2019;s 
                <monospace>lmFit</monospace> function would be used to obtain parameter estimates for multiple genes simultaneously in RNA-seq datasets and other genomic data types. The estimation process performed by 
                <monospace>lm</monospace> and 
                <monospace>lmFit</monospace> can be different (
                <monospace>lm</monospace> carries out ordinary least squares estimation, whereas 
                <monospace>lmFit</monospace> usually carries out weighted least squares estimation), so their parameter estimates may differ also. The two functions would produce the same parameter estimates if 
                <monospace>lmFit</monospace> was run in its simplest form without intrablock correlations, precision weights or robustification.</p>
            <p>Our article describes case studies that are common to the field of genomics research, where the choice of language used throughout the article makes it easily adaptable to studies and datasets for various applications. We have taken special care to explain, where appropriate, the reasoning behind specific design matrix set ups and describe how one would go about interpreting the associated model and its parameters. This allows readers to build their knowledge and understanding of simpler concepts, and work their way up to more advanced concepts, such as mixed effects or cyclical models, that are also described. Although we have covered design matrices in many common experimental settings, there will certainly be more complex scenarios that have been missed. We do not describe, for example, a study with a covariate and multiple factors, however, a reader should be able to tackle such an example quite easily with their understanding from the sections on multiple factors, combined with their understanding from the studies on covariates. For more complicated experimental designs, we would advise readers to consult with a statistician or bioinformatician who is experienced in linear modelling.</p>
        </sec>
        <sec id="sec53">
            <title>Software availability</title>
            <p>This article was written using Bioconductor
                <sup>
                    <xref ref-type="bibr" rid="ref10">10</xref>
                </sup> version 3.12, running on R version 4.0.3 (2020-10-10). The examples in this article made use of the software packages 
                <bold>limma</bold> version 3.45.19 and 
                <bold>TeachingDemos</bold>
                <sup>
                    <xref ref-type="bibr" rid="ref11">11</xref>
                </sup> version 2.12. This article was written as an Rmarkdown document that was compiled using knitr, and converted from an Rmarkdown document to LaTex with the help of 
                <bold>BiocWorkflowTools</bold> version 1.15.0. All packages and their version numbers are shown below. Reproducible code for this article is available at 
                <ext-link ext-link-type="uri" xlink:href="https://doi.org/doi:10.18129/B9.bioc.RNAseq123">https://doi.org/doi:10.18129/B9.bioc.RNAseq123</ext-link> (Artistic License 2.0), which also stores the code for the &#x201d;
                <italic toggle="yes">RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR</italic>&#x201d; workflow article
                <sup>
                    <xref ref-type="bibr" rid="ref5">5</xref>
                </sup>.</p>
            <p>

                <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
sessionInfo()

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks
    /vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] TeachingDemos_2.12 limma_3.45.19
##
## loaded via a namespace (and not attached):
##  [1] rstudioapi_0.11          knitr_1.30               magrittr_1.5
##  [4] usethis_1.6.3            BiocWorkflowTools_1.15.0 statmod_1.4.35
##  [7] here_0.1                 R6_2.4.1                 rlang_0.4.8
## [10] stringr_1.4.0            httr_1.4.2               tools_4.0.3
## [13] xfun_0.18                git2r_0.27.1             htmltools_0.5.0
## [16] yaml_2.2.1               digest_0.6.27            rprojroot_1.3-2
## [19] bookdown_0.21            BiocManager_1.30.10      fs_1.5.0
## [22] glue_1.4.2               evaluate_0.14            rmarkdown_2.5
## [25] stringi_1.5.3            compiler_4.0.3           backports_1.1.10
</preformat>
            </p>
        </sec>
    </body>
    <back>
        <ack id="ack1">
            <title>Acknowledgments</title>
            <p>The authors would like to thank Tim Thomas, Quentin Gouil, Shani Amarasinghe and Mengbo Li for contributing suggestions that have improved the quality and clarity of the work presented in this article.</p>
        </ack>
        <ref-list>
            <title>References</title>
            <ref id="ref1">
                <label>1</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Smyth</surname>
                            <given-names>GK</given-names>
                        </name>
</person-group>:
                    <article-title>Linear models and empirical Bayes methods for assessing differential expression in microarray experiments.</article-title>
                    <source>

                        <italic toggle="yes">Stat. Appl. Genet. Mol. Biol.</italic>
</source>
                    <year>2004</year>;<volume>3</volume>(<issue>1</issue>): Article
                    <elocation-id>3</elocation-id>.
                    <pub-id pub-id-type="pmid">16646809</pub-id>
                    <pub-id pub-id-type="doi">10.2202/1544-6115.1027</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref2">
                <label>2</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Smyth</surname>
                            <given-names>GK</given-names>
                        </name>
</person-group>:
                    <chapter-title>Limma: linear models for microarray data.</chapter-title>In:
                    <person-group person-group-type="editor">

                        <name name-style="western">
                            <surname>Gentleman</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Carey</surname>
                            <given-names>V</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dudoit</surname>
                            <given-names>S</given-names>
                        </name>
</person-group>, editors.
                    <source>

                        <italic toggle="yes">Bioinformatics and computational biology solutions using R and Bioconductor.</italic>
</source>
                    <publisher-loc>New York</publisher-loc>:
                    <publisher-name>Springer</publisher-name>;<year>2005</year>.
p.<fpage>397</fpage>&#x2013;<lpage>420</lpage>.</mixed-citation>
            </ref>
            <ref id="ref3">
                <label>3</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Glonek</surname>
                            <given-names>GFV</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Solomon</surname>
                            <given-names>PJ</given-names>
                        </name>
</person-group>:
                    <article-title>Factorial and time course designs for cDNA microarray experiments.</article-title>
                    <source>

                        <italic toggle="yes">Biostatistics.</italic>
</source>
                    <year>2004</year>;<volume>5</volume>(<issue>1</issue>):<fpage>89</fpage>&#x2013;<lpage>111</lpage>.
                    <pub-id pub-id-type="doi">10.1093/biostatistics/5.1.89</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref4">
                <label>4</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ritchie</surname>
                            <given-names>ME</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Phipson</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wu</surname>
                            <given-names>D</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>limma powers differential expression analyses for RNA-sequencing and microarray studies.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>2015</year>;<volume>43</volume>(<issue>7</issue>):<fpage>e47</fpage>.
                    <pub-id pub-id-type="pmid">25605792</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gkv007</pub-id>
                    <pub-id pub-id-type="pmcid">PMC4402510</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref5">
                <label>5</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Law</surname>
                            <given-names>CW</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Alhamdoosh</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Su</surname>
                            <given-names>S</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>RNA-seq analysis is easy as 1-2-3 with limma, glimma and edgeR.</article-title>
                    <source>

                        <italic toggle="yes">F1000 Research.</italic>
</source>
                    <year>2016</year>;<volume>5</volume>(<issue>1408</issue>).
                    <pub-id pub-id-type="pmid">27441086</pub-id>
                    <pub-id pub-id-type="doi">10.12688/f1000research.9005.3</pub-id>
                    <pub-id pub-id-type="pmcid">PMC4937821</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref6">
                <label>6</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Venables</surname>
                            <given-names>B</given-names>
                        </name>
</person-group>:
                    <source>

                        <italic toggle="yes">Coding matrices, contrast matrices and linear models.</italic>
</source>
                    <year>2018</year>; CRAN codingMatrices package vignette.</mixed-citation>
            </ref>
            <ref id="ref7">
                <label>7</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Soneson</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Marini</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Geier</surname>
                            <given-names>F</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>ExploreModelMatrix: interactive exploration for improved understanding of design matrices and linear models in R.</article-title>
                    <source>

                        <italic toggle="yes">F1000 Research.</italic>
</source>
                    <year>2020</year>;<volume>9</volume>(<issue>512</issue>).
                    <pub-id pub-id-type="pmid">32704355</pub-id>
                    <pub-id pub-id-type="doi">PMC7359746</pub-id>
                    <pub-id pub-id-type="pmcid">10.12688/f1000research.24187.2</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref8">
                <label>8</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Robinson</surname>
                            <given-names>MD</given-names>
                        </name>

                        <name name-style="western">
                            <surname>McCarthy</surname>
                            <given-names>DJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Smyth</surname>
                            <given-names>GK</given-names>
                        </name>
</person-group>:
                    <article-title>edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2010</year>;<volume>26</volume>:<fpage>139</fpage>&#x2013;<lpage>140</lpage>.
                    <pub-id pub-id-type="pmid">19910308</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btp616</pub-id>
                    <pub-id pub-id-type="pmcid">PMC2796818</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref9">
                <label>9</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>McCarthy</surname>
                            <given-names>DJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Chen</surname>
                            <given-names>Y</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Smyth</surname>
                            <given-names>GK</given-names>
                        </name>
</person-group>:
                    <article-title>Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>2012</year>;<volume>40</volume>:<fpage>4288</fpage>&#x2013;<lpage>4297</lpage>.
                    <pub-id pub-id-type="pmid">22287627</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gks042</pub-id>
                    <pub-id pub-id-type="pmcid">PMC3378882</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref10">
                <label>10</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Huber</surname>
                            <given-names>W</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Carey</surname>
                            <given-names>VJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gentleman</surname>
                            <given-names>R</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Orchestrating high-throughput genomic analysis with Bioconductor.</article-title>
                    <source>

                        <italic toggle="yes">Nat. Methods.</italic>
</source>
                    <year>2015</year>;<volume>12</volume>(<issue>2</issue>):<fpage>115</fpage>&#x2013;<lpage>21</lpage>
                    <ext-link ext-link-type="uri" xlink:href="http://www.nature.com/nmeth/journal/v12/n2/full/nmeth.3252.html">Reference Source</ext-link>.
                    <pub-id pub-id-type="pmid">25633503</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nmeth.3252</pub-id>
                    <pub-id pub-id-type="pmcid">PMC4509590</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref11">
                <label>11</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Snow</surname>
                            <given-names>G</given-names>
                        </name>
</person-group>:
                    <source>

                        <italic toggle="yes">TeachingDemos: Demonstrations for Teaching and Learning.</italic>
</source>
                    <year>2020</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://CRAN.R-project.org/package=TeachingDemos">Reference Source</ext-link>R package version 2.12.</mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report76092">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.30844.r76092</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Drnevich</surname>
                        <given-names>Jenny</given-names>
                    </name>
                    <xref ref-type="aff" rid="r76092a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-1401-8311</uri>
                </contrib>
                <aff id="r76092a1">
                    <label>1</label>High-Performance Biological Computing Group, University of Illinois, Urbana-Champaign, IL, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>9</day>
                <month>2</month>
                <year>2021</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2021 Drnevich J</copyright-statement>
                <copyright-year>2021</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport76092" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.27893.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>Law 
                <italic>et al.</italic>&#x2019;s Method Article focuses on two crucial aspects of differential expression analysis: how to properly form the statistical model and how to pull appropriate contrasts from the model. Unlike other aspects involved with a full analysis that are generalizable to all experiments (e.g., sample quality control, normalization, filtering, visualization of results and down-stream data mining), the statistical modeling must be tailored for each experimental design and hence why it often is the most difficult part for beginners and experienced analysts alike. Technical mathematical notation is confined to one &#x201c;optional&#x201d; section, so the rest of the manuscript is more understandable to people from varied backgrounds that are looking for help with making correct design and contrast matrices. Although the manuscript is more understandable, it still contains such necessary depth of information spanning simple to very complex models that it will require many readings to fully digest all of it. I envision this to become a welcome guide to be referenced again and again.</p>
            <p> This manuscript distinguishes itself from other guides on selecting design and contrast matrices like the limmaUsersGuide() and edgeRUsersGuide(), which are excellent but also contain a tome of information on other aspects of gene expression analysis such that the design/contrast matrix is not as easy to find as this stand-alone article. I also really like the combination in each figure of the fitted model statements (e.g., E(y) = 1.85 + 0.54x), the actual design matrix and a plot visualizing the data points and estimated parameter values. The fitted model statements are an addition that I have not seen much in other references on making design matrices. I do have a few minor points below that should be addressed: 
                <list list-type="order">
                    <list-item>
                        <p>Why are all the makeContrasts() functions called with levels=colnames(design) instead of just levels=design? It seems the second way is not only simpler but also reinforces the idea that the contrast matrix must be defined from the design matrix. I am not sure I have seen the first way in any other documentation in making contrast matrices, so it seems like an explicit change but without a compelling rationale.</p>
                    </list-item>
                    <list-item>
                        <p>At the end of the &#x201c;Contrast matrix for computing differences &#x201c; section, the sentence &#x201c;gene expression of healthy mice is greater than sick mice by a value of &#x2212;1.62&#x201d; required several readings to understand what it was trying to say, probably because of the conflict between &#x201c;greater than&#x201d; and &#x201c;-1.62&#x201d;. I would suggest replacing &#x201c;greater than&#x201d; in this sentence and the one above with &#x201c;different than&#x201d;.</p>
                    </list-item>
                    <list-item>
                        <p>In the &#x201c;2 versus 2 group comparisons&#x201d; section, I question the choice to compare control and treatment III against treatment I and II given the rationale &#x201c;This may be of interest if there are prior expectations that two groups are more similar to each other than the other two&#x201d;. Given that we have already seen that treatment III has the biggest change from the control, I feel like readers will be confused as to why control and treatment III were paired together instead of control and treatment I.</p>
                    </list-item>
                    <list-item>
                        <p>I am somewhat disappointed not to see the &#x201c;sum to zero&#x201d; parametrization for a two-factor model that is in the limmaUsersGuide(). Upon reflection, I agree with the decision to not add yet another model type to this already complex guide which does contain example contrasts to get the traditional main effects contrasts.</p>
                    </list-item>
                    <list-item>
                        <p>Would unequal sample sizes in groups affect the results of the modeling in any way? Perhaps this only applies to the estimate of the grand mean in the &#x201c;sum to zero&#x201d; parametrization, but I have occasionally struggled with getting the same answers from different model parametrizations and on-line help has suggested it was due to uneven sample sizes in the groups. If uneven sample sizes would not affect the choice of parametrization or formation of contrasts, it would be nice to have a simple sentence of that somewhere. If there are situations where uneven sample sizes would be an issue but it is beyond the scope of this paper, that should be addressed briefly and outside references given.</p>
                    </list-item>
                </list>
            </p>
            <p>Is the rationale for developing the new method (or application) clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the method technically sound?</p>
            <p>Yes</p>
            <p>Are the conclusions about the method and its performance adequately supported by the findings presented in the article?</p>
            <p>Yes</p>
            <p>If any results are presented, are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Are sufficient details provided to allow replication of the method development and its use by others?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Gene expression analysis, bioinformatics, RNA-Seq data,</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report76094">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.30844.r76094</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Marini</surname>
                        <given-names>Federico</given-names>
                    </name>
                    <xref ref-type="aff" rid="r76094a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-3252-7758</uri>
                </contrib>
                <aff id="r76094a1">
                    <label>1</label>Institute of Medical Biostatistics, Epidemiology and Informatics, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>5</day>
                <month>2</month>
                <year>2021</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2021 Marini F</copyright-statement>
                <copyright-year>2021</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport76094" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.27893.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>This manuscript presents a detailed description of an essential aspect of the differential expression analysis workflows, applicable to a variety of high-throughput assays, namely the creation, specification, and comprehension of proper design matrices, followed by the correct setup for the relevant comparisons, to be done via contrast matrices.</p>
            <p> </p>
            <p> While technologies such as RNA-seq are widespread in the current literature, and an extensive collection of software packages for such datasets are nowadays well established, the appropriate choices in statistical modeling might remain a more elusive topic to many researchers with experience in genomic data analysis, but not formally trained in biostatistics/bioinformatics.</p>
            <p> </p>
            <p> Seeing over the years the amount of questions on specialized support forums (the Bioconductor Support Site, Stack Overflow, Biostars) about the topics addressed in this work, I can only say it is a very welcome and valuable contribution to the scientific literature, and additionally become an essential resource to be used for teaching (e.g. in lectures but also in more focused workshops, as it is written in a way that it integrates seamlessly with the widely used framework of limma).</p>
            <p> </p>
            <p> Overall, the manuscript is clearly written, avoiding overcomplicated jargon to make it understandable to a large readership, and well structured in a way that builds up from the fundamental elements of commonly encountered experimental designs, followed by more complex yet plausible scenarios, where more refined approaches might be required to properly address the underlying scientific questions.</p>
            <p> </p>
            <p> I have a few points that a revised version of the manuscript might address to improve the overall quality of this work: 
                <list list-type="bullet">
                    <list-item>
                        <p>The text could refer to Figure 1 subpanels explicitly, for more clarity.</p>
                    </list-item>
                    <list-item>
                        <p>Moreover, in the center and right panel the description and legend both refer to Beta1 and Beta2, while the figure shows Beta0 and Beta1.</p>
                    </list-item>
                    <list-item>
                        <p>Some formal definitions in the Terminology table can be improved with more complete wordings to provide the clearest, non-ambiguous, yet formally correct descriptions of the fundamental concepts.</p>
                    </list-item>
                    <list-item>
                        <p>Response variable: add something like "often specified/referred to as Y in this manuscript".</p>
                    </list-item>
                    <list-item>
                        <p>Explanatory variable: add "often noted as X_i" or similar.</p>
                    </list-item>
                    <list-item>
                        <p>Nested factors: does this definition need e.g. the addition of "mutually exclusively" for better clarity?</p>
                    </list-item>
                    <list-item>
                        <p>Mixed effect models: "where random effects are usually not of interest to the study at hand.", probably add something like "yet need to be specified for proper modelling of the data at hand"? Related to this, a similar formulation could be adopted at page 19 when defining the random effect as "not of interest to the study" - while this is truly not of interest, the omission of accounting for that could be detrimental in the modeling.</p>
                        <p> Overall, I think that if the definitions in here would be a little more complete, it would increase a lot the value of this table as a go-to reference for many occasions - of course, without overcomplicating too much by overloading it with theoretical details.</p>
                    </list-item>
                    <list-item>
                        <p>As a general comment - for the table and the main text: I am not so sure that the "covariate" refers only to the continuous variables; I often encountered the nomenclature of a "categorical covariate". I suggest it is worth revisiting this aspect and if needed clarify it further in the scope of this manuscript.</p>
                    </list-item>
                    <list-item>
                        <p>In the subsection "Studies with covariates", the cyclical models are not included in the table just below the text.</p>
                    </list-item>
                    <list-item>
                        <p>It could be helpful to have a visual representation for the "Overlap of treatment-control results" with something like a Venn Diagram or an Upset plot to provide an immediate summary of the textual description.</p>
                    </list-item>
                    <list-item>
                        <p>While I understand that it could be out of scope for this manuscript, focused on the limma framework, it might be worth mentioning that some conventions might differ in other frameworks. For example, DESeq2 specifies the main variable of interest as the last term in the design, while on page 22 it is stated "This is why we place the factor of interest first as it simplifies the subsequent code for the comparisons of interest, even though a different order of factors added give equivalent models with variations in parameterisation.". The authors could mention that different software might follow different default implementation and recommend to accurately read the software documentation.</p>
                    </list-item>
                    <list-item>
                        <p>On page 36, the term delta_k was not previously defined (but was previously used for the interaction term). Does this need to be clarified explicitly?</p>
                    </list-item>
                    <list-item>
                        <p>It might concern the pagination of the article, but sometimes it is not entirely clear where the output of the code is close to a figure which actually relates to a section above (e.g. on page 10 with Figure 4). Another example where the sequence of text-output-figures is not so clear is on page 33 (Model belongs to above section, Matrix + Plot to the one below).</p>
                    </list-item>
                    <list-item>
                        <p>The authors could provide a snapshot of the RNAseq123 repository (e.g. on Zenodo) to guarantee the complete reproducibility of the manuscript - often times vignettes are edited over the development cycles. Even if not entirely a "Software" article, it is inserted into a workflow software package in Bioconductor.</p>
                    </list-item>
                </list> </p>
            <p> Some sentences are missing some prepositions or are not fully correct from the grammatical point of view. I am reporting here some other sentences as well that would need to be fixed: 
                <list list-type="bullet">
                    <list-item>
                        <p>&#x00a0;Abstract: "set up *of* an appropriate model via design matrices", "Differential expression analysis [...] use*s* linear models".</p>
                    </list-item>
                    <list-item>
                        <p>&#x00a0;page 4, "where the line is defined by Beta0 the y-intercept and Beta1 the slope" should have some commas inserted.</p>
                    </list-item>
                    <list-item>
                        <p>&#x00a0;"classifiers associated with samples", the wording is not so clear.</p>
                    </list-item>
                    <list-item>
                        <p>&#x00a0;page 8, "if variable is a covariate the then", invert the and then.</p>
                    </list-item>
                    <list-item>
                        <p>&#x00a0;Legend of Figure 3, the backticks for rendering the code monospace font are still visible.</p>
                    </list-item>
                    <list-item>
                        <p>&#x00a0;page 12: refers to the `makeContrast` function, but should read `makeContrasts` (also page 17, please check all occurrences).</p>
                    </list-item>
                    <list-item>
                        <p>&#x00a0;page 23: "the use of unsupervised clustering plots" could be rephrased in a more general way as "the use of exploratory data analysis techniques". In the sentences after that: "with one addition*al* factor to observe", please correct this.</p>
                    </list-item>
                    <list-item>
                        <p>&#x00a0;page 28: "time is consider*ed* as distinct changes in state".</p>
                    </list-item>
                    <list-item>
                        <p>&#x00a0;page 35: Section "Theory of linear regression models, design matrices and contrast matrices" might be missing an oxford comma.</p>
                    </list-item>
                    <list-item>
                        <p>&#x00a0;page 36, "A model with an intercept term, simply sets", please remove the comma.</p>
                    </list-item>
                    <list-item>
                        <p>&#x00a0;page 37, "R markdown" should read "R Markdown" (at least it does so in most of RStudio's documentation)</p>
                    </list-item>
                </list> </p>
            <p> Most of these points are easily addressable, and I believe they would contribute in making even clearer the concepts already nicely described throughout this work, which I envision to be extremely useful for many practitioners that have to deal with (and properly perform) omics data analysis.</p>
            <p> </p>
            <p> </p>
            <p> </p>
            <p>Is the rationale for developing the new method (or application) clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the method technically sound?</p>
            <p>Yes</p>
            <p>Are the conclusions about the method and its performance adequately supported by the findings presented in the article?</p>
            <p>Yes</p>
            <p>If any results are presented, are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Are sufficient details provided to allow replication of the method development and its use by others?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Bioinformatics, RNA-seq data analysis</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
    </sub-article>
</article>
