<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="other" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.9501.2</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Software Tool Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                    <subj-group>
                        <subject>Bioinformatics</subject>
                    </subj-group>
                    <subj-group>
                        <subject>Genomics</subject>
                    </subj-group>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 2; peer review: 3 approved, 2 approved with reservations]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Lun</surname>
                        <given-names>Aaron T.L.</given-names>
                    </name>
                    <uri content-type="orcid">https://orcid.org/0000-0002-3564-4813</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>McCarthy</surname>
                        <given-names>Davis J.</given-names>
                    </name>
                    <xref ref-type="aff" rid="a2">2</xref>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Marioni</surname>
                        <given-names>John C.</given-names>
                    </name>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a2">2</xref>
                    <xref ref-type="aff" rid="a4">4</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>Cancer Research UK Cambridge Institute, Cambridge, UK</aff>
                <aff id="a2">
                    <label>2</label>EMBL European Bioinformatics Institute, Cambridge, UK</aff>
                <aff id="a3">
                    <label>3</label>St Vincent&#x2019;s Institute of Medical Research, Fitzroy, Australia</aff>
                <aff id="a4">
                    <label>4</label>Wellcome Trust Sanger Institute, Cambridge, UK</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:alun@wehi.edu.au">alun@wehi.edu.au</email>
                </corresp>
                <fn fn-type="con">
                    <p>A.T.L.L. developed and tested the workflow on all datasets. A.T.L.L. and D.J.M. implemented improvements to the software packages required by the workflow. J.C.M. provided direction to the software and workflow development. All authors wrote and approved the final manuscript.</p>
                </fn>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>31</day>
                <month>10</month>
                <year>2016</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2016</year>
            </pub-date>
            <volume>5</volume>
            <elocation-id>2122</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>24</day>
                    <month>10</month>
                    <year>2016</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 Lun ATL et al.</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/5-2122/pdf"/>
            <abstract>
                <p>Single-cell RNA sequencing (scRNA-seq) is widely used to profile the transcriptome of individual cells. This provides biological resolution that cannot be matched by bulk RNA sequencing, at the cost of increased technical noise and data complexity. The differences between scRNA-seq and bulk RNA-seq data mean that the analysis of the former cannot be performed by recycling bioinformatics pipelines for the latter. Rather, dedicated single-cell methods are required at various steps to exploit the cellular resolution while accounting for technical noise. This article describes a computational workflow for low-level analyses of scRNA-seq data, based primarily on software packages from the open-source Bioconductor project. It covers basic steps including quality control, data exploration and normalization, as well as more complex procedures such as cell cycle phase assignment, identification of highly variable and correlated genes, clustering into subpopulations and marker gene detection. Analyses were demonstrated on gene-level count data from several publicly available datasets involving haematopoietic stem cells, brain-derived cells, T-helper cells and mouse embryonic stem cells. This will provide a range of usage scenarios from which readers can construct their own analysis pipelines.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>Single cell</kwd>
                <kwd>RNA-seq</kwd>
                <kwd>bioinformatics</kwd>
                <kwd>Bioconductor</kwd>
                <kwd>workflow</kwd>
            </kwd-group>
            <funding-group>
                <funding-statement>A.T.L.L. and J.C.M. were supported by core funding from Cancer Research UK (award no. A17197). D.J.M. was supported by a CJ Martin Fellowship from the National Health and Medical Research Council of Australia. D.J.M and J.C.M. were also supported by core funding from EMBL.&#13;
</funding-statement>
                <funding-statement>
                    <italic>The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</italic>
                </funding-statement>
            </funding-group>
        </article-meta>
        <notes>
            <sec sec-type="version-changes">
                <label>Revised</label>
                <title>Amendments from Version 1</title>
                <p>This version of the workflow contains a number of improvements based on the referees' comments. We have re-compiled the workflow using the latest packages from Bioconductor release 3.4, and stated more explicitly the dependence on these package versions. We have added a reference to the Bioconductor workflow page, which provides user-friendly instructions for installation and execution of the workflow. We have also moved cell cycle classification before gene filtering as this provides more precise cell cycle phase classifications. Some minor rewording and elaborations have also been performed in various parts of the article.</p>
            </sec>
        </notes>
    </front>
    <body>
        <sec sec-type="intro">
            <title>Introduction</title>
            <p>Single-cell RNA sequencing (scRNA-seq) is widely used to measure the genome-wide expression profile of individual cells. From each cell, mRNA is isolated and reverse transcribed to cDNA for high-throughput sequencing (
                <xref ref-type="bibr" rid="ref-43">Stegle 
                    <italic toggle="yes">et al.</italic>, 2015</xref>). This can be done using microfluidics platforms like the Fluidigm C1 (
                <xref ref-type="bibr" rid="ref-38">Pollen 
                    <italic toggle="yes">et al.</italic>, 2014</xref>), protocols based on microtiter plates like Smart-seq2 (
                <xref ref-type="bibr" rid="ref-37">Picelli 
                    <italic toggle="yes">et al.</italic>, 2014</xref>), or droplet-based technologies like inDrop (
                <xref ref-type="bibr" rid="ref-20">Klein 
                    <italic toggle="yes">et al.</italic>, 2015</xref>; 
                <xref ref-type="bibr" rid="ref-30">Macosko 
                    <italic toggle="yes">et al.</italic>, 2015</xref>). The number of reads mapped to each gene is then used to quantify its expression in each cell. Alternatively, unique molecular identifiers (UMIs) can be used to directly measure the number of transcript molecules for each gene (
                <xref ref-type="bibr" rid="ref-16">Islam 
                    <italic toggle="yes">et al.</italic>, 2014</xref>). Count data are analyzed to detect highly variable genes (HVGs) that drive heterogeneity across cells in a population, to find correlations between genes and cellular phenotypes, or to identify new subpopulations via dimensionality reduction and clustering. This provides biological insights at a single-cell resolution that cannot be achieved with conventional bulk RNA sequencing of cell populations.</p>
            <p>Strategies for scRNA-seq data analysis differ markedly from those for bulk RNA-seq. One technical reason is that scRNA-seq data are much noisier than bulk data (
                <xref ref-type="bibr" rid="ref-5">Brennecke 
                    <italic toggle="yes">et al.</italic>, 2013</xref>; 
                <xref ref-type="bibr" rid="ref-32">Marinov 
                    <italic toggle="yes">et al.</italic>, 2014</xref>). Reliable capture (i.e., conversion) of transcripts into cDNA for sequencing is difficult with the low quantity of RNA in a single cell. This increases the frequency of drop-out events where none of the transcripts for a gene are captured. Dedicated steps are required to deal with this noise during analysis, especially during quality control. In addition, scRNA-seq data can be used to study cell-to-cell heterogeneity, e.g., to identify new cell subtypes, to characterize differentiation processes, to assign cells into their cell cycle phases, or to identify HVGs driving variability across the population (
                <xref ref-type="bibr" rid="ref-11">Fan 
                    <italic toggle="yes">et al.</italic>, 2016</xref>; 
                <xref ref-type="bibr" rid="ref-44">Trapnell 
                    <italic toggle="yes">et al.</italic>, 2014</xref>; 
                <xref ref-type="bibr" rid="ref-46">Vallejos 
                    <italic toggle="yes">et al.</italic>, 2015</xref>). This is simply not possible with bulk data, meaning that custom methods are required to perform these analyses.</p>
            <p>This article describes a computational workflow for basic analysis of scRNA-seq data, using software packages from the open-source Bioconductor project (release 3.4) (
                <xref ref-type="bibr" rid="ref-13">Huber 
                    <italic toggle="yes">et al.</italic>, 2015</xref>). Starting from a count matrix, this workflow contains the steps required for quality control to remove problematic cells; normalization of cell-specific biases, with and without spike-ins; cell cycle phase classification from gene expression data; data exploration to identify putative subpopulations; and finally, HVG and marker gene identification to prioritize interesting genes. The application of different steps in the workflow will be demonstrated on several public scRNA-seq datasets involving haematopoietic stem cells, brain-derived cells, T-helper cells and mouse embryonic stem cells, generated with a range of experimental protocols and platforms (
                <xref ref-type="bibr" rid="ref-7">Buettner 
                    <italic toggle="yes">et al.</italic>, 2015</xref>; 
                <xref ref-type="bibr" rid="ref-21">Kolodziejczyk 
                    <italic toggle="yes">et al.</italic>, 2015</xref>; 
                <xref ref-type="bibr" rid="ref-48">Wilson 
                    <italic toggle="yes">et al.</italic>, 2015</xref>; 
                <xref ref-type="bibr" rid="ref-49">Zeisel 
                    <italic toggle="yes">et al.</italic>, 2015</xref>). The aim is to provide a variety of modular usage examples that can be applied to construct custom analysis pipelines.</p>
        </sec>
        <sec>
            <title>Analysis of haematopoietic stem cells</title>
            <sec>
                <title>Overview</title>
                <p>To introduce most of the concepts of scRNA-seq data analysis, we use a relatively simple dataset from a study of haematopoietic stem cells (HSCs) (
                    <xref ref-type="bibr" rid="ref-48">Wilson 
                        <italic toggle="yes">et al.</italic>, 2015</xref>). Single mouse HSCs were isolated into microtiter plates and libraries were prepared for 96 cells using the Smart-seq2 protocol. A constant amount of spike-in RNA from the External RNA Controls Consortium (ERCC) was also added to each cell&#x2019;s lysate prior to library preparation. High-throughput sequencing was performed and the expression of each gene was quantified by counting the total number of reads mapped to its exonic regions. Similarly, the quantity of each spike-in transcript was measured by counting the number of reads mapped to the spike-in reference sequences. Counts for all genes/transcripts in each cell were obtained from the NCBI Gene Expression Omnibus (GEO) as a supplementary file under the accession number GSE61533 (
                    <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE61533">http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE61533</ext-link>).</p>
                <p>For simplicity, we forego a description of the read processing steps required to generate the count matrix, i.e., read alignment and counting into features. These steps have been described in some detail elsewhere (
                    <xref ref-type="bibr" rid="ref-8">Chen 
                        <italic toggle="yes">et al.</italic>, 2016</xref>; 
                    <xref ref-type="bibr" rid="ref-27">Love 
                        <italic toggle="yes">et al.</italic>, 2015</xref>), and are largely the same for bulk and single-cell data. The only additional consideration is that the spike-in information must be included in the pipeline. Typically, spike-in sequences can be included as additional FASTA files during genome index building prior to alignment, while genomic intervals for both spike-in transcripts and endogenous genes can be concatenated into a single GTF file prior to counting. For users favouring an R-based approach to read alignment and counting, we suggest using the methods in the 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/Rsubread">Rsubread</ext-link>
                    </italic> package (
                    <xref ref-type="bibr" rid="ref-25">Liao 
                        <italic toggle="yes">et al.</italic>, 2013</xref>; 
                    <xref ref-type="bibr" rid="ref-26">Liao 
                        <italic toggle="yes">et al.</italic>, 2014</xref>). Alternatively, rapid quantification of expression with alignment-free methods such as 
                    <italic toggle="yes">kallisto</italic> (
                    <xref ref-type="bibr" rid="ref-5">Bray 
                        <italic toggle="yes">et al.</italic>, 2016</xref>) or 
                    <italic toggle="yes">Salmon</italic> (
                    <xref ref-type="bibr" rid="ref-35">Patro 
                        <italic toggle="yes">et al.</italic>, 2015</xref>) can be performed using the functions 
                    <monospace>runKallisto</monospace> and 
                    <monospace>runSalmon</monospace> in the 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/scater">scater</ext-link>
                    </italic> package.</p>
            </sec>
            <sec>
                <title>Count loading</title>
                <p>The first task is to load the count matrix into memory. In this case, some work is required to retrieve the data from the Gzip-compressed Excel format. Each row of the matrix represents an endogenous gene or a spike-in transcript, and each column represents a single HSC. For convenience, the counts for spike-in transcripts and endogenous genes are stored in a 
                    <monospace>SCESet</monospace> object from the 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/scater">scater</ext-link>
                    </italic> package (
                    <xref ref-type="bibr" rid="ref-34">McCarthy 
                        <italic toggle="yes">et al.</italic>, 2016</xref>).</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">library</styled-content>
                        <styled-content style="font-size:15px;">(R.utils)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">gunzip</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"GSE61533_HTSEQ_count_results.xls.gz"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">remove=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">overwrite=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">library</styled-content>
                        <styled-content style="font-size:15px;">(gdata)</styled-content>

                        <styled-content style="font-size:15px;">all.counts &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">read.xls</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">&#x2019;GSE61533_HTSEQ_count_results.xls&#x2019;</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">sheet=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">header=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">row.names=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">library</styled-content>
                        <styled-content style="font-size:15px;">(scater)</styled-content>

                        <styled-content style="font-size:15px;">sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">newSCESet</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">countData=</styled-content>
                        <styled-content style="font-size:15px;">all.counts)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">dim</styled-content>
                        <styled-content style="font-size:15px;">(sce)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## Features Samples
##    38498      96</styled-content>
                    </preformat>
                </p>
                <p>We identify the rows corresponding to ERCC spike-ins and mitochondrial genes. For this dataset, this information can be easily extracted from the row names. In general, though, identifying mitochondrial genes from standard identifiers like Ensembl requires extra annotation (this will be discussed later in more detail).</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">is.spike &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">grepl</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"^ERCC"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rownames</styled-content>
                        <styled-content style="font-size:15px;">(sce))</styled-content>

                        <styled-content style="font-size:15px;">is.mito &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">grepl</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"^mt-"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rownames</styled-content>
                        <styled-content style="font-size:15px;">(sce))</styled-content>
                    </preformat>
                </p>
                <p>For each cell, we calculate quality control metrics such as the total number of counts or the proportion of counts in mitochondrial genes or spike-in transcripts. These are stored in the 
                    <monospace>pData</monospace> of the 
                    <monospace>SCESet</monospace> for future reference.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">calculateQCMetrics</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">feature_controls=list</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">ERCC=</styled-content>
                        <styled-content style="font-size:15px;">is.spike,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">Mt=</styled-content>
                        <styled-content style="font-size:15px;">is.mito))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">head</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">colnames</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">pData</styled-content>
                        <styled-content style="font-size:15px;">(sce)))</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## [1] "total_counts"          "log10_total_counts"       "filter_on_total_counts"
## [4] "total_features"        "log10_total_features"     "filter_on_total_features"</styled-content>
                    </preformat>
                </p>
                <p>We need to explicitly indicate that the ERCC set is, in fact, a spike-in set. This is necessary as spike-ins require special treatment in some downstream steps such as variance estimation and normalization. We do this by supplying the name of the spike-in set to 
                    <monospace>isSpike</monospace>.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">library</styled-content>
                        <styled-content style="font-size:15px;">(scran)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">isSpike</styled-content>
                        <styled-content style="font-size:15px;">(sce) &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#4F9905;">"ERCC"</styled-content>
                    </preformat>
                </p>
            </sec>
            <sec>
                <title>Quality control on the cells</title>
                <p>Low-quality cells need to be removed to ensure that technical effects do not distort downstream analysis results. Two common measures of cell quality are the library size and the number of expressed features in each library. The library size is defined as the total sum of counts across all features, i.e., genes and spike-in transcripts. Cells with relatively small library sizes are considered to be of low quality as the RNA has not been efficiently captured (i.e., converted into cDNA and amplified) during library preparation. The number of expressed features in each cell is defined as the number of features with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured. The distributions of both of these metrics are shown in 
                    <xref ref-type="fig" rid="f1">Figure 1</xref>.</p>
                <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                    <label>Figure 1. </label>
                    <caption>
                        <title>Histograms of library sizes (left) and number of expressed genes (right) for all cells in the HSC dataset.</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure1.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">par</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">mfrow=c</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">2</styled-content>
                        <styled-content style="font-size:15px;">))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">hist</styled-content>
                        <styled-content style="font-size:15px;">(sce$total_counts/</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1e6</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Library sizes (millions)"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">main=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">""</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87;">breaks=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">20</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"grey80"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Number of cells"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">hist</styled-content>
                        <styled-content style="font-size:15px;">(sce$total_features,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Number of expressed genes"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">main=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">""</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87;">breaks=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">20</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"grey80"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Number of cells"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>Picking a threshold for these metrics is not straightforward as their absolute values depend on the protocol and biological system. For example, sequencing to greater depth will lead to more reads, regardless of the quality of the cells. To obtain an adaptive threshold, we assume that most of the dataset consists of high-quality cells. We remove cells with log-library sizes that are more than 3 median absolute deviations (MADs) below the median log-library size. (A log-transformation improves resolution at small values, especially when the MAD of the raw values is comparable to or greater than the median.) We also remove cells where the log-transformed number of expressed genes is 3 MADs below the median.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">libsize.drop &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">isOutlier</styled-content>
                        <styled-content style="font-size:15px;">(sce$total_counts,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">nmads=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">3</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">type=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"lower"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">log=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">feature.drop &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">isOutlier</styled-content>
                        <styled-content style="font-size:15px;">(sce$total_features,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">nmads=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">3</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">type=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"lower"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">log=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>Another measure of quality is the proportion of reads mapped to genes in the mitochondrial genome. High proportions are indicative of poor-quality cells (
                    <xref ref-type="bibr" rid="ref-14">Ilicic 
                        <italic toggle="yes">et al.</italic>, 2016</xref>; 
                    <xref ref-type="bibr" rid="ref-16">Islam 
                        <italic toggle="yes">et al.</italic>, 2014</xref>), possibly because of increased apoptosis and/or loss of cytoplasmic RNA from lysed cells. Similar reasoning applies to the proportion of reads mapped to spike-in transcripts. The quantity of spike-in RNA added to each cell should be constant, which means that the proportion should increase upon loss of endogenous RNA in low-quality cells. The distributions of mitochondrial and spike-in proportions across all cells are shown in 
                    <xref ref-type="fig" rid="f2">Figure 2</xref>.</p>
                <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                    <label>Figure 2. </label>
                    <caption>
                        <title>Histogram of the proportion of reads mapped to mitochondrial genes (left) or spike-in transcripts (right) across all cells in the HSC dataset.</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure2.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">par</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">mfrow=c</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">2</styled-content>
                        <styled-content style="font-size:15px;">))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">hist</styled-content>
                        <styled-content style="font-size:15px;">(sce$pct_counts_feature_controls_Mt,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Mitochondrial proportion (%)"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87;">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Number of cells"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">breaks=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">20</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">main=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">""</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"grey80"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">hist</styled-content>
                        <styled-content style="font-size:15px;">(sce$pct_counts_feature_controls_ERCC,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"ERCC proportion (%)"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87;">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Number of cells"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">breaks=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">20</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">main=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">""</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"grey80"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>Again, the ideal threshold for these proportions depends on the cell type and the experimental protocol. Cells with more mitochondria or more mitochondrial activity may naturally have larger mitochondrial proportions. Similarly, cells with more endogenous RNA or that are assayed with protocols using less spike-in RNA will have lower spike-in proportions. If we assume that most cells in the dataset are of high quality, then the threshold can be set to remove any large outliers from the distribution of proportions. We use the MAD-based definition of outliers to remove putative low-quality cells from the dataset.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">mito.drop &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">isOutlier</styled-content>
                        <styled-content style="font-size:15px;">(sce$pct_counts_feature_controls_Mt,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">nmads=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">3</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">type=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"higher"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">spike.drop &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">isOutlier</styled-content>
                        <styled-content style="font-size:15px;">(sce$pct_counts_feature_controls_ERCC,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">nmads=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">3</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">type=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"higher"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>Subsetting by column will retain only the high-quality cells that pass each filter described above. We examine the number of cells removed by each filter as well as the total number of retained cells. Removal of a substantial proportion of cells (&gt; 10%) may be indicative of an overall issue with data quality. It may also reflect genuine biology in extreme cases (e.g., low numbers of expressed genes in erythrocytes) for which the filters described here are inappropriate.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;- sce[,!(libsize.drop | feature.drop | mito.drop | spike.drop)]</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">data.frame</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">ByLibSize=sum</styled-content>
                        <styled-content style="font-size:15px;">(libsize.drop),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">ByFeature=sum</styled-content>
                        <styled-content style="font-size:15px;">(feature.drop),</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87;">ByMito=sum</styled-content>
                        <styled-content style="font-size:15px;">(mito.drop),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">BySpike=sum</styled-content>
                        <styled-content style="font-size:15px;">(spike.drop),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">Remaining=ncol</styled-content>
                        <styled-content style="font-size:15px;">(sce))</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">##         ByLibSize ByFeature ByMito BySpike Remaining
## Samples         2         2      6       3        86</styled-content>
                    </preformat>
                </p>
                <p>An alternative approach to quality control is to perform a principal components analysis (PCA) based on the quality metrics for each cell, e.g., the total number of reads, the total number of features and the proportion of mitochondrial or spike-in reads. Outliers on a PCA plot may be indicative of low-quality cells that have aberrant technical properties compared to the (presumed) majority of high-quality cells. In 
                    <xref ref-type="fig" rid="f3">Figure 3</xref>, no obvious outliers are present, which is consistent with the removal of suspect cells in the preceding quality control steps.</p>
                <fig fig-type="figure" id="f3" orientation="portrait" position="float">
                    <label>Figure 3. </label>
                    <caption>
                        <title>PCA plot for cells in the HSC dataset, constructed using quality metrics.</title>
                        <p>The first and second components are shown on each axis, along with the percentage of total variance explained by each component. Bars represent the coordinates of the cells on each axis.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure3.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">fontsize &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">theme</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">axis.text=element_text</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">size=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">12</styled-content>
                        <styled-content style="font-size:15px;">),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">axis.title=element_text</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">size=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">16</styled-content>
                        <styled-content style="font-size:15px;">))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">plotPCA</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">pca_data_input=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"pdata"</styled-content>
                        <styled-content style="font-size:15px;">) + fontsize</styled-content>
                    </preformat>
                </p>
                <p>Methods like PCA-based outlier detection and support vector machines can provide more power to distinguish low-quality cells from high-quality counterparts (
                    <xref ref-type="bibr" rid="ref-14">Ilicic 
                        <italic toggle="yes">et al.</italic>, 2016</xref>). This is because they are able to detect subtle patterns across many quality metrics simultaneously. However, this comes at some cost to interpretability, as the reason for removing a given cell may not always be obvious. Thus, for this workflow, we will use the simple approach whereby each quality metric is considered separately. Users interested in the more sophisticated approaches are referred to the 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/scater">scater</ext-link>
                    </italic> and 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/cellity">cellity</ext-link>
                    </italic> packages.</p>
            </sec>
            <sec>
                <title>Classification of cell cycle phase</title>
                <p>We use the prediction method described by 
                    <xref ref-type="bibr" rid="ref-42">Scialdone 
                        <italic toggle="yes">et al.</italic> (2015)</xref> to classify cells into cell cycle phases based on the gene expression data. Using a training dataset, the sign of the difference in expression between two genes was computed for each pair of genes. Pairs with changes in the sign across cell cycle phases were chosen as markers. Cells in a test dataset can then be classified into the appropriate phase, based on whether the observed sign for each marker pair is consistent with one phase or another. This approach is implemented in the 
                    <monospace>cyclone</monospace> function using a pre-trained set of marker pairs for mouse data. The result of phase assignment for each cell in the HSC dataset is shown in 
                    <xref ref-type="fig" rid="f4">Figure 4</xref>. (Some additional work is necessary to match the gene symbols in the data to the Ensembl annotation in the pre-trained marker set.)</p>
                <fig fig-type="figure" id="f4" orientation="portrait" position="float">
                    <label>Figure 4. </label>
                    <caption>
                        <title>Cell cycle phase scores from applying the pair-based classifier on the HSC dataset, where each point represents a cell.</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure4.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">mm.pairs &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">readRDS</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">system.file</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"exdata"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#4F9905">"mouse_cycle_markers.rds"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">package=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"scran"</styled-content>
                        <styled-content style="font-size:15px;">))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">library</styled-content>
                        <styled-content style="font-size:15px;">(org.Mm.eg.db)</styled-content>

                        <styled-content style="font-size:15px;">anno &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">select</styled-content>
                        <styled-content style="font-size:15px;">(org.Mm.eg.db,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">keys=rownames</styled-content>
                        <styled-content style="font-size:15px;">(sce),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">keytype=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"SYMBOL"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">column=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"ENSEMBL"</styled-content>
                        <styled-content style="font-size:15px;">)
ensembl &lt;- anno$ENSEMBL[</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">match</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">rownames</styled-content>
                        <styled-content style="font-size:15px;">(sce), anno$SYMBOL)]
assignments &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">cyclone</styled-content>
                        <styled-content style="font-size:15px;">(sce, mm.pairs,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">gene.names=</styled-content>
                        <styled-content style="font-size:15px;">ensembl)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">plot</styled-content>
                        <styled-content style="font-size:15px;">(assignments$score$G1, assignments$score$G2M,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"G1 score"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"G2/M score"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">pch=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">16</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>Cells are classified as being in G1 phase if the G1 score is above 0.5 and greater than the G2/M score; in G2/M phase if the G2/M score is above 0.5 and greater than the G1 score; and in S phase if neither score is above 0.5. Here, the vast majority of cells are classified as being in G1 phase. We will focus on these cells in the downstream analysis. Cells in other phases are removed to avoid potential confounding effects from cell cycle-induced differences. Alternatively, if a non-negligible number of cells are in other phases, we can use the assigned phase as a blocking factor in downstream analyses. This protects against cell cycle effects without discarding information.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;- sce[,assignments$phases==</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"G1"</styled-content>
                        <styled-content style="font-size:15px;">]</styled-content>
                    </preformat>
                </p>
                <p>Pre-trained classifiers are available in 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/scran">scran</ext-link>
                    </italic> for human and mouse data. While the mouse classifier used here was trained on data from embryonic stem cells, it is still accurate for other cell types (
                    <xref ref-type="bibr" rid="ref-42">Scialdone 
                        <italic toggle="yes">et al.</italic>, 2015</xref>). This may be due to the conservation of the transcriptional program associated with the cell cycle (
                    <xref ref-type="bibr" rid="ref-3">Bertoli 
                        <italic toggle="yes">et al.</italic>, 2013</xref>; 
                    <xref ref-type="bibr" rid="ref-10">Conboy 
                        <italic toggle="yes">et al.</italic>, 2007</xref>). The pair-based method is also a non-parametric procedure that is robust to most technical differences between datasets. However, it will be less accurate for data that are substantially different from those used in the training set, e.g., due to the use of a different protocol. In such cases, users can construct a custom classifier from their own training data using the 
                    <monospace>sandbag</monospace> function. This will also be necessary for other model organisms where pre-trained classifiers are not available.</p>
            </sec>
            <sec>
                <title>Filtering out low-abundance genes</title>
                <p>Low-abundance genes are problematic as zero or near-zero counts do not contain enough information for reliable statistical inference (
                    <xref ref-type="bibr" rid="ref-4">Bourgon 
                        <italic toggle="yes">et al.</italic>, 2010</xref>). In addition, the discreteness of the counts may interfere with downstream statistical procedures, e.g., by compromising the accuracy of continuous approximations. Here, low-abundance genes are defined as those with an average count below a filter threshold of 1. These genes are likely to be dominated by drop-out events (
                    <xref ref-type="bibr" rid="ref-6">Brennecke 
                        <italic toggle="yes">et al.</italic>, 2013</xref>), which limits their usefulness in later analyses. Removal of these genes mitigates discreteness and reduces the amount of computational work without major loss of information.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">ave.counts &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rowMeans</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">counts</styled-content>
                        <styled-content style="font-size:15px;">(sce))</styled-content>

                        <styled-content style="font-size:15px;">keep &lt;- ave.counts &gt;=</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">sum</styled-content>
                        <styled-content style="font-size:15px;">(keep)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## [1] 13965</styled-content>
                    </preformat>
                </p>
                <p>To check whether the chosen threshold is suitable, we examine the distribution of log-means across all genes (
                    <xref ref-type="fig" rid="f5">Figure 5</xref>). The peak represents the bulk of moderately expressed genes while the rectangular component corresponds to lowly expressed genes. The filter threshold should cut the distribution at some point along the rectangular component to remove the majority of low-abundance genes.</p>
                <fig fig-type="figure" id="f5" orientation="portrait" position="float">
                    <label>Figure 5. </label>
                    <caption>
                        <title>Histogram of log-average counts for all genes in the HSC dataset.</title>
                        <p>The filter threshold is represented by the blue line.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure5.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">hist</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">log10</styled-content>
                        <styled-content style="font-size:15px;">(ave.counts),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">breaks=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">100</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">main=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">""</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"grey80"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87;">xlab=expression</styled-content>
                        <styled-content style="font-size:15px;">(Log[</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">10</styled-content>
                        <styled-content style="font-size:15px;">]
                            <sup>~</sup>
                        </styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"average count"</styled-content>
                        <styled-content style="font-size:15px;">))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">abline</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">v=log10</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"blue"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">lwd=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">2</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">lty=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">2</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>We also look at the identities of the most highly expressed genes (
                    <xref ref-type="fig" rid="f6">Figure 6</xref>). This should generally be dominated by constitutively expressed transcripts, such as those for ribosomal or mitochondrial proteins. The presence of other classes of features may be cause for concern if they are not consistent with expected biology. For example, a top set containing many spike-in transcripts suggests that too much spike-in RNA was added during library preparation, while the absence of ribosomal proteins and/or the presence of their pseudogenes are indicative of suboptimal alignment.</p>
                <fig fig-type="figure" id="f6" orientation="portrait" position="float">
                    <label>Figure 6. </label>
                    <caption>
                        <title>Percentage of total counts assigned to the top 50 most highly-abundant features in the HSC dataset.</title>
                        <p>For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell, while circles are coloured according to whether the feature is labelled as a control feature.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure6.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">plotQC</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">type =</styled-content> 
                        <styled-content style="font-size:15px;color:#4F9905;">"highest-expression"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">n=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">50</styled-content>
                        <styled-content style="font-size:15px;">) + fontsize</styled-content>
                    </preformat>
                </p>
                <p>An alternative approach to gene filtering is to select genes that have non-zero counts in at least 
                    <italic toggle="yes">n</italic> cells. This provides some more protection against genes with outlier expression patterns, i.e., strong expression in only one or two cells. Such outliers are typically uninteresting as they can arise from amplification artifacts that are not replicable across cells. (The exception is for studies involving rare cells where the outliers may be biologically relevant.) An example of this filtering approach is shown below for 
                    <italic toggle="yes">n</italic> set to 10, though smaller values may be necessary to retain genes expressed in rare cell types.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">numcells &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">nexprs</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">byrow=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">alt.keep &lt;- numcells &gt;=</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF;">10</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">sum</styled-content>
                        <styled-content style="font-size:15px;">(alt.keep)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## [1] 11988</styled-content>
                    </preformat>
                </p>
                <p>The relationship between the number of expressing cells and the mean is shown in 
                    <xref ref-type="fig" rid="f7">Figure 7</xref>. The two statistics tend to be well-correlated so filtering on either should give roughly similar results.</p>
                <fig fig-type="figure" id="f7" orientation="portrait" position="float">
                    <label>Figure 7. </label>
                    <caption>
                        <title>Number of expressing cells against the log-mean expression for each gene in the HSC dataset.</title>
                        <p>Spike-in transcripts are highlighted in red.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure7.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">smoothScatter</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">log10</styled-content>
                        <styled-content style="font-size:15px;">(ave.counts), numcells,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">xlab=expression</styled-content>
                        <styled-content style="font-size:15px;">(Log[</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">10</styled-content>
                        <styled-content style="font-size:15px;">]
                            <sup>~</sup>
                        </styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"average count"</styled-content>
                        <styled-content style="font-size:15px;">),</styled-content>
    
                        <styled-content style="font-size:15px;color:#214A87;"> ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Number of expressing cells"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">is.ercc &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">isSpike</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">type=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"ERCC"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">points</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">log10</styled-content>
                        <styled-content style="font-size:15px;">(ave.counts[is.ercc]), numcells[is.ercc],</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"red"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">pch=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">16</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">cex=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">0.5</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>In general, we prefer the mean-based filter as it tends to be less aggressive. A gene will be retained as long as it has sufficient expression in any subset of cells. Genes expressed in fewer cells require higher levels of expression in those cells to be retained, but this is not undesirable as it avoids selecting uninformative genes (with low expression in few cells) that contribute little to downstream analyses, e.g., HVG detection or clustering. In contrast, the &#x201c;at least 
                    <italic toggle="yes">n</italic>&#x201d; filter depends heavily on the choice of 
                    <italic toggle="yes">n</italic>. With 
                    <italic toggle="yes">n</italic> = 10, a gene expressed in a subset of 9 cells would be filtered out, regardless of the level of expression in those cells. This may result in the failure to detect rare subpopulations that are present at frequencies below 
                    <italic toggle="yes">n</italic>. While the mean-based filter will retain more outlier-driven genes, this can be handled by choosing methods that are robust to outliers in the downstream analyses.</p>
                <p>Thus, we apply the mean-based filter to the data by subsetting the 
                    <monospace>SCESet</monospace> object as shown below. This removes all rows corresponding to endogenous genes or spike-in transcripts with abundances below the specified threshold.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;- sce[keep,]</styled-content>
                    </preformat>
                </p>
            </sec>
            <sec>
                <title>Normalization of cell-specific biases</title>
                <p>
                    <italic toggle="yes">
                        <bold>Using the deconvolution method to deal with zero counts.</bold>
                    </italic> Read counts are subject to differences in capture efficiency and sequencing depth between cells (
                    <xref ref-type="bibr" rid="ref-43">Stegle 
                        <italic toggle="yes">et al.</italic>, 2015</xref>). Normalization is required to eliminate these cell-specific biases prior to downstream quantitative analyses. This is often done by assuming that most genes are not differentially expressed (DE) between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias and is removed by scaling. More specifically, &#x201c;size factors&#x201d; are calculated that represent the extent to which counts should be scaled in each library.</p>
                <p>Size factors can be computed with several different approaches, e.g., using the 
                    <monospace>estimateSizeFactorsFromMatrix</monospace> function in the 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/DESeq2">DESeq2</ext-link>
                    </italic> package (
                    <xref ref-type="bibr" rid="ref-1">Anders &amp; Huber, 2010</xref>; 
                    <xref ref-type="bibr" rid="ref-28">Love 
                        <italic toggle="yes">et al.</italic>, 2014</xref>), or with the 
                    <monospace>calcNormFactors</monospace> function (
                    <xref ref-type="bibr" rid="ref-41">Robinson &amp; Oshlack, 2010</xref>) in the 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/edgeR">edgeR</ext-link>
                    </italic> package. However, single-cell data can be problematic for these bulk data-based methods due to the dominance of low and zero counts. To overcome this, we pool counts from many cells to increase the count size for accurate size factor estimation (
                    <xref ref-type="bibr" rid="ref-29">Lun 
                        <italic toggle="yes">et al.</italic>, 2016</xref>). Pool-based size factors are then &#x201c;deconvolved&#x201d; into cell-based factors for cell-specific normalization.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">computeSumFactors</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">sizes=c</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">20</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF;">40</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF;">60</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF;">80</styled-content>
                        <styled-content style="font-size:15px;">))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">summary</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">sizeFactors</styled-content>
                        <styled-content style="font-size:15px;">(sce))</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.</styled-content>

                        <styled-content style="font-size:15px;">##   0.4161   0.8055   0.9434   1.0000   1.1890   1.8410</styled-content>
                    </preformat>
                </p>
                <p>In this case, the size factors are tightly correlated with the library sizes for all cells (
                    <xref ref-type="fig" rid="f8">Figure 8</xref>). This suggests that the systematic differences between cells are primarily driven by differences in capture efficiency or sequencing depth. Any DE between cells would yield a non-linear trend between the total count and size factor, and/or increased scatter around the trend. This does not occur here as strong DE is unlikely to exist within a homogeneous population of cells.</p>
                <fig fig-type="figure" id="f8" orientation="portrait" position="float">
                    <label>Figure 8. </label>
                    <caption>
                        <title>Size factors from deconvolution, plotted against library sizes for all cells in the HSC dataset.</title>
                        <p>Axes are shown on a log-scale.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure8.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">plot</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">sizeFactors</styled-content>
                        <styled-content style="font-size:15px;">(sce), sce$total_counts/</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1e6</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">log=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"xy"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87;">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Library size (millions)"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Size factor"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>
                    <italic toggle="yes">
                        <bold>Computing separate size factors for spike-in transcripts.</bold>
                    </italic> Size factors computed from the counts for endogenous genes are usually not appropriate for normalizing the counts for spike-in transcripts. Consider an experiment without library quantification, i.e., the amount of cDNA from each library is 
                    <italic toggle="yes">not</italic> equalized prior to pooling and multiplexed sequencing. Here, cells containing more RNA have greater counts for endogenous genes and thus larger size factors to scale down those counts. However, the same amount of spike-in RNA is added to each cell during library preparation. This means that the counts for spike-in transcripts are not subject to the effects of RNA content. Attempting to normalize the spike-in counts with the gene-based size factors will lead to over-normalization and incorrect quantification of expression. Similar reasoning applies in cases where library quantification is performed. For a constant total amount of cDNA, any increases in endogenous RNA content will suppress the coverage of spike-in transcripts. As a result, the bias in the spike-in counts will be opposite to that captured by the gene-based size factor.</p>
                <p>To ensure normalization is performed correctly, we compute a separate set of size factors for the spike-in set. For each cell, the spike-in-specific size factor is defined as the total count across all transcripts in the spike-in set. This assumes that none of the spike-in transcripts are differentially expressed, which is reasonable given that the same amount and composition of spike-in RNA should have been added to each cell. (See below for a more detailed discussion on spike-in normalization.) These size factors are stored in a separate field of the 
                    <monospace>SCESet</monospace> object by setting 
                    <monospace>general.use=FALSE</monospace> in 
                    <monospace>computeSpikeFactors</monospace>. This ensures that they will only be used with the spike-in transcripts but not the endogenous genes.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">computeSpikeFactors</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">type=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"ERCC"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">general.use=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>
                    <italic toggle="yes">
                        <bold>Applying the size factors to normalize gene expression</bold>
                    </italic>. The count data are used to compute normalized log-expression values for use in downstream analyses. Each value is defined as the log-ratio of each count to the size factor for the corresponding cell, after adding a prior count of 1 to avoid undefined values at zero counts. Division by the size factor ensures that any cell-specific biases are removed. If spike-in-specific size factors are present in 
                    <monospace>sce</monospace>, they will be automatically applied to normalize the spike-in transcripts separately from the endogenous genes.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">normalize</styled-content>
                        <styled-content style="font-size:15px;">(sce)</styled-content>
                    </preformat>
                </p>
                <p>The log-transformation provides some measure of variance stabilization (
                    <xref ref-type="bibr" rid="ref-23">Law 
                        <italic toggle="yes">et al.</italic>, 2014</xref>), so that high-abundance genes with large variances do not dominate downstream analyses. The computed values are stored as an 
                    <monospace>exprs</monospace> matrix in addition to the other assay elements.</p>
            </sec>
            <sec>
                <title>Checking for important technical factors</title>
                <p>We check whether there are technical factors that contribute substantially to the heterogeneity of gene expression. If so, the factor may need to be regressed out to ensure that it does not inflate the variances or introduce spurious correlations. For this dataset, the simple experimental design means that there are no plate or batch effects to examine. Instead, we use the (log-transformed) total count for the spike-in transcripts as a proxy for the relative bias in each sample. This bias is purely technical in origin, given that the same amount of spike-in RNA should have been added to each cell. Thus, any association of gene expression with this factor is not biologically interesting and should be removed.</p>
                <p>For each gene, we calculate the percentage of the variance of the expression values that is explained by the spike-in totals (
                    <xref ref-type="fig" rid="f9">Figure 9</xref>). The percentages are generally small (1&#x2013;3%), indicating that the expression profiles of most genes are not strongly associated with this factor. This result is consistent with successful removal of cell-specific biases by scaling normalization. Thus, the spike-in total does not need to be explicitly modelled in our downstream analyses.</p>
                <fig fig-type="figure" id="f9" orientation="portrait" position="float">
                    <label>Figure 9. </label>
                    <caption>
                        <title>Density plot of the percentage of variance explained by the (log-transformed) total spike-in counts across all genes in the HSC dataset.</title>
                        <p>For each gene, the percentage of the variance of the normalized log-expression values across cells that is explained by each factor is calculated. Each curve corresponds to one factor and represents the distribution of percentages across all genes.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure9.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87">plotExplanatoryVariables</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">variables=c</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"counts_feature_controls_ERCC"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#4F9905">"log10_counts_feature_controls_ERCC"</styled-content>
                        <styled-content style="font-size:15px;">)) + fontsize</styled-content>
                    </preformat>
                </p>
                <p>Note that the use of the spike-in total as an accurate proxy for the relative technical bias assumes that no library quantification is performed. Otherwise, the coverage of the spike-in transcripts would be dependent on the total amount of endogenous RNA in each cell. (Specifically, if the same amount of cDNA is used for sequencing per cell, any increase in the amount of endogenous RNA will suppress the coverage of the spike-in transcripts.) This means that the spike-in totals could be confounded with genuine biological effects associated with changes in RNA content.</p>
            </sec>
            <sec>
                <title>Identifying HVGs from the normalized log-expression</title>
                <p>We identify HVGs to focus on the genes that are driving heterogeneity across the population of cells. This requires estimation of the variance in expression for each gene, followed by decomposition of the variance into biological and technical components. HVGs are then identified as those genes with the largest biological components. This avoids prioritizing genes that are highly variable due to technical factors such as sampling noise during RNA capture and library preparation.</p>
                <p>Ideally, the technical component would be estimated by fitting a mean-variance trend to the spike-in transcripts using the 
                    <monospace>trendVar</monospace> function. Recall that the same set of spike-ins was added in the same quantity to each cell. This means that the spike-in transcripts should exhibit no biological variability, i.e., any variance in their counts should be technical in origin. Given the mean abundance of a gene, the fitted value of the trend can be used as an estimate of the technical component for that gene. The biological component of the variance can then be calculated by subtracting the technical component from the total variance of each gene with the 
                    <monospace>decomposeVar</monospace> function.</p>
                <p>In practice, this strategy is compromised by the small number of spike-in transcripts, the uneven distribution of their abundances and (for low numbers of cells) the imprecision of their variance estimates. This makes it difficult to accurately fit a complex mean-dependent trend to the spike-in variances. An alternative approach is to fit the trend to the variance estimates of the endogenous genes, using the 
                    <monospace>use.spikes=FALSE</monospace> setting as shown below. This assumes that the majority of genes are not variably expressed, such that the technical component dominates the total variance for those genes. The fitted value of the trend is then used as an estimate of the technical component. Obviously, this is the only approach that can be used if no spike-ins were added in the experiment.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">var.fit &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">trendVar</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">trend=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"loess"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">use.spikes=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903">FALSE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">span=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">0.2</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">var.out &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">decomposeVar</styled-content>
                        <styled-content style="font-size:15px;">(sce, var.fit)</styled-content>
                    </preformat>
                </p>
                <p>We assess the suitability of the trend fitted to the endogenous variances by examining whether it is consistent with the spike-in variances (
                    <xref ref-type="fig" rid="f10">Figure 10</xref>). The trend passes through or close to most of the spike-in variances, indicating that our assumption (that most genes have low levels of biological variability) is valid. This strategy exploits the large number of endogenous genes to obtain a stable trend, with the spike-in transcripts used as diagnostic features rather than in the trend fitting itself. However, if our assumption did 
                    <italic toggle="yes">not</italic> hold, we would instead fit the trend directly to the spike-in variances with the default 
                    <monospace>use.spikes=TRUE</monospace>. This sacrifices stability to reduce systematic errors in the estimate of the biological component for each gene. (In such cases, tinkering with the trend fitting parameters may yield a more stable curve &#x2013; see 
                    <monospace>?trendVar</monospace> for more details.)</p>
                <fig fig-type="figure" id="f10" orientation="portrait" position="float">
                    <label>Figure 10. </label>
                    <caption>
                        <title>Variance of normalized log-expression values for each gene in the HSC dataset, plotted against the mean log-expression.</title>
                        <p>The blue line represents the mean-dependent trend fitted to the variances of the endogenous genes. Variance estimates for spike-in transcripts are highlighted in red.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure10.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87">plot</styled-content>
                        <styled-content style="font-size:15px;">(var.out$mean, var.out$total,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">pch=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">16</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">cex=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">0.6</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"Mean log-expression"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"Variance of log-expression"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">o &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">order</styled-content>
                        <styled-content style="font-size:15px;">(var.out$mean)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">lines</styled-content>
                        <styled-content style="font-size:15px;">(var.out$mean[o], var.out$tech[o],</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"dodgerblue"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">lwd=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">2</styled-content>)

                        <styled-content style="font-size:15px;">cur.spike &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">isSpike</styled-content>
                        <styled-content style="font-size:15px;">(sce)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">points</styled-content>
                        <styled-content style="font-size:15px;">(var.out$mean[cur.spike], var.out$total[cur.spike],</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"red"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">pch=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">16</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>HVGs are defined as genes with biological components that are significantly greater than zero at a false discovery rate (FDR) of 5%. These genes are interesting as they drive differences in the expression profiles between cells, and should be prioritized for further investigation. In addition, we only consider a gene to be a HVG if it has a biological component greater than or equal to 0.5. For transformed expression values on the log
                    <sub>2</sub> scale, this means that the average difference in true expression between any two cells will be at least 2-fold. (This reasoning assumes that the true log-expression values are Normally distributed with variance of 0.5. The root-mean-square of the difference between two values is treated as the average log
                    <sub>2</sub>-fold change between cells and is equal to unity.) We rank the results by the biological component to focus on genes with larger biological variability.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">hvg.out &lt;- var.out[</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">which</styled-content>
                        <styled-content style="font-size:15px;">(var.out$FDR &lt;=</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF">0.05</styled-content> 
                        <styled-content style="font-size:15px;">&amp; var.out$bio &gt;=</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF">0.5</styled-content>
                        <styled-content style="font-size:15px;">),]
hvg.out &lt;- hvg.out[</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">order</styled-content>
                        <styled-content style="font-size:15px;">(hvg.out$bio,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">decreasing=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903">TRUE</styled-content>
                        <styled-content style="font-size:15px;">),]</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">nrow</styled-content>
                        <styled-content style="font-size:15px;">(hvg.out)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## [1] 193</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87">write.table</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">file=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"hsc_hvg.tsv"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;">hvg.out,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">sep=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"\t"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">quote=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903">FALSE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">col.names=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903">NA</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">head</styled-content>
                        <styled-content style="font-size:15px;">(hvg.out)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">##              mean     total       bio      tech       p.value           FDR
## Fos      6.412282 20.167804 12.287746  7.880058  3.609804e-13  2.283693e-10
## Rgs1     5.214003 20.271925  9.430165 10.841761  3.065697e-06  5.019808e-04
## Dusp1    6.693026 16.074489  9.044983  7.029506  3.066936e-10  1.156266e-07
## H2-Aa    4.294426 19.390442  7.496497 11.893945  2.736909e-04  2.333494e-02
## Ppp1r15a 6.545438 14.964370  7.460786  7.503584  2.308822e-07  4.943721e-05
## Ctla2a   8.654347  9.471605  7.368337  2.103268  4.574748e-38  9.095906e-35</styled-content>
                    </preformat>
                </p>
                <p>We recommend checking the distribution of expression values for the top HVGs to ensure that the variance estimate is not being dominated by one or two outlier cells (
                    <xref ref-type="fig" rid="f11">Figure 11</xref>).</p>
                <fig fig-type="figure" id="f11" orientation="portrait" position="float">
                    <label>Figure 11. </label>
                    <caption>
                        <title>Violin plots of normalized log-expression values for the top 10 HVGs in the HSC dataset.</title>
                        <p>Each point represents the log-expression value in a single cell.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure11.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87">plotExpression</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">rownames</styled-content>
                        <styled-content style="font-size:15px;">(hvg.out)[</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">1</styled-content>
                        <styled-content style="font-size:15px;">:</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">10</styled-content>
                        <styled-content style="font-size:15px;">]) + fontsize</styled-content>
                    </preformat>
                </p>
                <p>There are many other strategies for defining HVGs, e.g., by using the coefficient of variation (
                    <xref ref-type="bibr" rid="ref-6">Brennecke 
                        <italic toggle="yes">et al.</italic>, 2013</xref>; 
                    <xref ref-type="bibr" rid="ref-19">Kim 
                        <italic toggle="yes">et al.</italic>, 2015</xref>; 
                    <xref ref-type="bibr" rid="ref-21">Kolodziejczyk 
                        <italic toggle="yes">et al.</italic>, 2015</xref>), with the dispersion parameter in the negative binomial distribution (
                    <xref ref-type="bibr" rid="ref-33">McCarthy 
                        <italic toggle="yes">et al.</italic>, 2012</xref>), or as a proportion of total variability (
                    <xref ref-type="bibr" rid="ref-46">Vallejos 
                        <italic toggle="yes">et al.</italic>, 2015</xref>). Some of these methods are available in 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/scran">scran</ext-link>
                    </italic> &#x2013; for example, see 
                    <monospace>DM</monospace> or 
                    <monospace>technicalCV2</monospace> for calculations based on the coefficient of variation. Here, we use the variance of the log-expression values because the log-transformation protects against genes with strong expression in only one or two cells. This ensures that the set of top HVGs is not dominated by genes with (mostly uninteresting) outlier expression patterns.</p>
            </sec>
            <sec>
                <title>Identifying correlated gene pairs with Spearman&#x2019;s rho</title>
                <p>Another useful procedure is to identify the HVGs that are highly correlated with one another. This distinguishes between HVGs caused by random noise and those involved in driving systematic differences between subpopulations. Correlations between genes are quantified by computing Spearman's rho, which accommodates non-linear relationships in the expression values. Gene pairs with significantly large positive or negative values of rho are identified using the 
                    <monospace>correlatePairs</monospace> function. We only apply this function to the set of HVGs, because these genes have large biological components and are more likely to exhibit strong correlations driven by biology. In contrast, calculating correlations for all possible gene pairs would require too much computational time and increase the severity of the multiple testing correction. It may also prioritize uninteresting genes that have strong correlations but low variance, e.g., tightly co-regulated house-keeping genes.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87">set.seed</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">100</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">var.cor &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">correlatePairs</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">subset.row=rownames</styled-content>
                        <styled-content style="font-size:15px;">(hvg.out))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">write.table</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">file=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"hsc_cor.tsv"</styled-content>
                        <styled-content style="font-size:15px;">, var.cor</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">sep=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"\t"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">quote=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903">FALSE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">row.names=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903">FALSE</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">head</styled-content>
                        <styled-content style="font-size:15px;">(var.cor)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">##      gene1   gene2       rho      p.value         FDR
## 1   mt-Nd2 mt-Rnr1 0.6037110 1.999998e-06 0.005293709
## 2     Egr1     Jun 0.5218295 1.999998e-06 0.005293709
## 3    Pdia6   Hspa5 0.5119852 1.999998e-06 0.005293709
## 4      Fos    Egr1 0.5035263 1.999998e-06 0.005293709
## 5 Ppp1r15a   Zfp36 0.4975862 1.999998e-06 0.005293709
## 6   Hnrpdl  mt-Nd2 0.4963688 1.999998e-06 0.005293709</styled-content>
                    </preformat>
                </p>
                <p>The significance of each correlation is determined using a permutation test. For each pair of genes, the null hypothesis is that the expression profiles of two genes are independent. Shuffling the profiles and recalculating the correlation yields a null distribution that is used to obtain a 
                    <italic toggle="yes">p</italic>-value for each observed correlation value (
                    <xref ref-type="bibr" rid="ref-36">Phipson &amp; Smyth, 2010</xref>). Correction for multiple testing across many gene pairs is performed by controlling the FDR at 5%. Correlated gene pairs can be directly used for experimental validation with orthogonal techniques (e.g., fluorescence-activated cell sorting, immunohistochemistry or RNA fluorescence 
                    <italic toggle="yes">in situ</italic> hybridization) to verify that these expression patterns are genuinely present across the cell population.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sig.cor &lt;- var.cor$FDR &lt;=</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF">0.05</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">summary</styled-content>
                        <styled-content style="font-size:15px;">(sig.cor)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">##    Mode FALSE TRUE NA&#x2019;s
## logical 18485   43    0</styled-content>
                    </preformat>
                </p>
                <p>Larger sets of correlated genes are assembled by treating genes as nodes in a graph and each pair of genes with significantly large correlations as an edge. In particular, an undirected graph is constructed using methods in the 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/RBGL">RBGL</ext-link>
                    </italic> package. Highly connected subgraphs are then identified and defined as gene sets. This provides a convenient summary of the pairwise correlations between genes.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87">library</styled-content>
                        <styled-content style="font-size:15px;">(RBGL)
g &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">ftM2graphNEL</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">cbind</styled-content>
                        <styled-content style="font-size:15px;">(var.cor$gene1, var.cor$gene2)[sig.cor,],</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87">W=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903">NULL</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">V=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903">NULL</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">edgemode=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"undirected"</styled-content>
                        <styled-content style="font-size:15px;">)
cl &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">highlyConnSG</styled-content>
                        <styled-content style="font-size:15px;">(g)$clusters
cl &lt;- cl[</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">order</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">lengths</styled-content>
                        <styled-content style="font-size:15px;">(cl),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">decreasing=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903">TRUE</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                        <styled-content style="font-size:15px;">]</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">head</styled-content>
                        <styled-content style="font-size:15px;">(cl)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## [[1]]
## [1] "Egr1"  "Fos"   "Zfp36" "Ier2"
##
## [[2]]
## [1] "mt-Nd2"  "Sh3bgrl" "mt-Rnr1"
##
## [[3]]
## [1] "Hspd1"   "Pik3ip1" "Srm"
##
## [[4]]
## [1] "Sqstm1" "Phgdh"  "Cct3"
##
## [[5]]
## [1] "Morf4l2" "Impdh2"  "Ncl"
##
## [[6]]
## [1] "Hsd17b12" "Srsf7"</styled-content>
                    </preformat>
                </p>
                <p>Significant correlations provide evidence for substructure in the dataset, i.e., subpopulations of cells with systematic differences in their expression profiles. The number of significantly correlated HVG pairs represents the strength of the substructure. If many pairs were significant, this would indicate that the subpopulations were clearly defined and distinct from one another. For this particular dataset, a relatively low number of HVGs exhibit significant correlations. This suggests that any substructure in the data will be modest, which is expected given that rigorous selection was performed to obtain a homogeneous population of HSCs (
                    <xref ref-type="bibr" rid="ref-48">Wilson 
                        <italic toggle="yes">et al.</italic>, 2015</xref>).</p>
            </sec>
            <sec>
                <title>Using correlated HVGs for further data exploration</title>
                <p>We visualize the expression profiles of the correlated HVGs with a heatmap (
                    <xref ref-type="fig" rid="f12">Figure 12</xref>). All expression values are mean-centred for each gene to highlight the relative differences in expression between cells. If any subpopulations were present, they would manifest as rectangular &#x201c;blocks&#x201d; in the heatmap, corresponding to sets of genes that are systematically up- or down-regulated in specific groups of cells. This is not observed in 
                    <xref ref-type="fig" rid="f12">Figure 12</xref>, consistent with the lack of strong substructure. There may be a subpopulation of 
                    <italic toggle="yes">Fos</italic> and 
                    <italic toggle="yes">Jun</italic>-negative cells, but it is poorly defined given the small numbers of cells and genes involved.</p>
                <fig fig-type="figure" id="f12" orientation="portrait" position="float">
                    <label>Figure 12. </label>
                    <caption>
                        <title>Heatmap of mean-centred normalized log-expression values for correlated HVGs in the HSC dataset.</title>
                        <p>Dendrograms are formed by hierarchical clustering on the Euclidean distances between genes (row) or cells (column).</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure12.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">chosen &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">unique</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">c</styled-content>
                        <styled-content style="font-size:15px;">(var.cor$gene1[sig.cor], var.cor$gene2[sig.cor]))
norm.exprs &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">exprs</styled-content>
                        <styled-content style="font-size:15px;">(sce)[chosen,,drop=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903">FALSE</styled-content>
                        <styled-content style="font-size:15px;">]
heat.vals &lt;- norm.exprs -</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">rowMeans</styled-content>
                        <styled-content style="font-size:15px;">(norm.exprs)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">library</styled-content>
                        <styled-content style="font-size:15px;">(gplots)
heat.out &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">heatmap.2</styled-content>
                        <styled-content style="font-size:15px;">(heat.vals,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">col=</styled-content>
                        <styled-content style="font-size:15px;">bluered,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">symbreak=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903">TRUE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">trace=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">&#x2019;none&#x2019;</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">cexRow=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">0.6</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>We also apply dimensionality reduction techniques to visualize the relationships between cells. This is done by constructing a PCA plot from the normalized log-expression values of the correlated HVGs (
                    <xref ref-type="fig" rid="f13">Figure 13</xref>). Cells with similar expression profiles should be located close together in the plot, while dissimilar cells should be far apart. We only use the correlated HVGs in 
                    <monospace>plotPCA</monospace> because any substructure should be most pronounced in the expression profiles of these genes. Even so, no clear separation of cells into distinct subpopulations is observed.</p>
                <fig fig-type="figure" id="f13" orientation="portrait" position="float">
                    <label>Figure 13. </label>
                    <caption>
                        <title>PCA plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell in the HSC dataset.</title>
                        <p>First and second components are shown, along with the percentage of variance explained. Bars represent the coordinates of the cells on each axis. Each cell is coloured according to its total number of expressed features.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure13.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87">plotPCA</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">exprs_values=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"exprs"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">colour_by=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"total_features"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87">feature_set=</styled-content>
                        <styled-content style="font-size:15px;">chosen) + fontsize</styled-content>
                    </preformat>
                </p>
                <p>On a related note, we only show the first two components that contribute most to the variance in 
                    <xref ref-type="fig" rid="f13">Figure 13</xref>. Additional components can be visualized by increasing the 
                    <monospace>ncomponents</monospace> argument in 
                    <monospace>plotPCA</monospace> to construct pairwise plots. The percentage of variance explained by each component can also be obtained by running 
                    <monospace>plotPCA</monospace> with 
                    <monospace>return_SCESet=TRUE</monospace>, and then calling 
                    <monospace>reducedDimension</monospace> on the returned object. This information may be useful for selecting high-variance components (possibly corresponding to interesting underlying factors) for further examination.</p>
                <p>Another widely used approach is the 
                    <italic toggle="yes">t</italic>-stochastic neighbour embedding (
                    <italic toggle="yes">t</italic>-SNE) method (
                    <xref ref-type="bibr" rid="ref-47">Van der Maaten &amp; Hinton, 2008</xref>). 
                    <italic toggle="yes">t</italic>-SNE tends to work better than PCA for separating cells in more diverse populations. This is because the former can directly capture non-linear relationships in high-dimensional space, whereas the latter must represent them (suboptimally) as linear components. However, this improvement comes at the cost of more computational effort and complexity. In particular, 
                    <italic toggle="yes">t</italic>-SNE is a stochastic method, so users should run the algorithm several times to ensure that the results are representative, and then set a seed to ensure that the chosen results are reproducible. It is also advisable to test different settings of the &#x201c;perplexity&#x201d; parameter as this will affect the distribution of points in the low-dimensional space. This is demonstrated below in 
                    <xref ref-type="fig" rid="f14">Figure 14</xref>, though no consistent substructure is observed in all plots.</p>
                <fig fig-type="figure" id="f14" orientation="portrait" position="float">
                    <label>Figure 14. </label>
                    <caption>
                        <title>
							
                            <italic toggle="yes">t</italic>-SNE plots constructed from normalized log-expression values of correlated HVGs, using a range of perplexity values.</title>
                        <p>In each plot, each point represents a cell in the HSC dataset. Bars represent the coordinates of the cells on each axis. Each cell is coloured according to its total number of expressed features.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure14.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87">set.seed</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">100</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">out5 &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">plotTSNE</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">exprs_values=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"exprs"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">perplexity=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">5</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">colour_by=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"total_features"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87">feature_set=</styled-content>
                        <styled-content style="font-size:15px;">chosen) + fontsize +</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">ggtitle</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"Perplexity = 5"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">out10 &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">plotTSNE</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">exprs_values=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"exprs"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">perplexity=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">10</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">colour_by=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"total_features"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87">feature_set=</styled-content>
                        <styled-content style="font-size:15px;">chosen) + fontsize +</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">ggtitle</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"Perplexity = 10"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">out20 &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">plotTSNE</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">exprs_values=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"exprs"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">perplexity=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">20</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">colour_by=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"total_features"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87">feature_set=</styled-content>
                        <styled-content style="font-size:15px;">chosen) + fontsize +</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">ggtitle</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"Perplexity = 20"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">multiplot</styled-content>
                        <styled-content style="font-size:15px;">(out5, out10, out20,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">cols=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">3</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>There are many other dimensionality reduction techniques that we do not consider here but could also be used, e.g., multidimensional scaling, diffusion maps. These have their own advantages and disadvantages &#x2013; for example, diffusion maps (see 
                    <monospace>plotDiffusionMap</monospace>) place cells along a continuous trajectory and are suited for visualizing graduated processes like differentiation (
                    <xref ref-type="bibr" rid="ref-2">Angerer 
                        <italic toggle="yes">et al.</italic>, 2016</xref>). For each visualization method, additional cell-specific information can be incorporated into the colour, size or shape of each point. Here, cells are coloured by the total number of expressed features to demonstrate that this metric does not drive any systematic differences across the population. The 
                    <monospace>selectorPlot</monospace> function from 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/scran">scran</ext-link>
                    </italic> can also be used to interactively select groups of cells in two-dimensional space. This facilitates data exploration as visually identified subpopulations can be directly selected for further examination.</p>
                <p>Finally, putative subpopulations can be computationally defined by cutting the dendrogram in 
                    <monospace>heat.out$colDendrogram</monospace> with 
                    <monospace>cutree</monospace> to form clusters. We do not attempt this here as the substructure is too weak for reliable clustering. In fact, users should generally treat clustering results with some caution. If the differences between cells are subtle, the assignment of cells into clusters may not be robust. Moreover, different algorithms can yield substantially different clusters by focusing on different aspects of the data. Experimental validation of the clusters is critical to ensure that the putative subpopulations actually exist.</p>
            </sec>
            <sec>
                <title>Additional comments</title>
                <p>Once the basic analysis is completed, it is often useful to save the 
                    <monospace>SCESet</monospace> object to file with the 
                    <monospace>saveRDS</monospace> function. The object can then be easily restored into new R sessions using the 
                    <monospace>readRDS</monospace> function. This allows further work to be conducted without having to repeat all of the processing steps described above.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87">saveRDS</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">file=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"hsc_data.rds"</styled-content>
                        <styled-content style="font-size:15px;">, sce)</styled-content>
                    </preformat>
                </p>
                <p>A variety of methods are available to perform more complex analyses on the processed expression data. For example, cells can be ordered in pseudotime (e.g., for progress along a differentiation pathway) with 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/monocle">monocle</ext-link>
                    </italic> (
                    <xref ref-type="bibr" rid="ref-44">Trapnell 
                        <italic toggle="yes">et al.</italic>, 2014</xref>) or 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/TSCAN">TSCAN</ext-link>
                    </italic> (
                    <xref ref-type="bibr" rid="ref-17">Ji &amp; Ji, 2016</xref>); cell-state hierarchies can be characterized with the 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/sincell">sincell</ext-link>
                    </italic> package (
                    <xref ref-type="bibr" rid="ref-18">Julia 
                        <italic toggle="yes">et al.</italic>, 2015</xref>); and oscillatory behaviour can be identified using 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/Oscope">Oscope</ext-link>
                    </italic> (
                    <xref ref-type="bibr" rid="ref-24">Leng 
                        <italic toggle="yes">et al.</italic>, 2015</xref>). HVGs can be used in gene set enrichment analyses to identify biological pathways and processes with heterogeneous activity, using packages designed for bulk data like 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/topGO">topGO</ext-link>
                    </italic> or with dedicated single-cell methods like 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/scde">scde</ext-link>
                    </italic> (
                    <xref ref-type="bibr" rid="ref-11">Fan 
                        <italic toggle="yes">et al.</italic>, 2016</xref>). Full descriptions of these analyses are outside the scope of this workflow, so interested users are advised to consult the relevant documentation.</p>
            </sec>
        </sec>
        <sec>
            <title>Analysis of cell types in the brain</title>
            <sec>
                <title>Overview</title>
                <p>We proceed to a more heterogeneous dataset from a study of cell types in the mouse brain (
                    <xref ref-type="bibr" rid="ref-49">Zeisel 
                        <italic toggle="yes">et al.</italic>, 2015</xref>). This contains approximately 3000 cells of varying types such as oligodendrocytes, microglia and neurons. Individual cells were isolated using the Fluidigm C1 microfluidics system and library preparation was performed on each cell using a UMI-based protocol. After sequencing, expression was quantified by counting the number of UMIs mapped to each gene. Count data for all endogenous genes, mitochondrial genes and spike-in transcripts were obtained from 
                    <ext-link ext-link-type="uri" xlink:href="http://linnarssonlab.org/cortex">http://linnarssonlab.org/cortex</ext-link>.</p>
            </sec>
            <sec>
                <title>Count loading</title>
                <p>The count data are distributed across several files, so some work is necessary to consolidate them into a single matrix. We define a simple utility function for loading data in from each file. (We stress that this function is only relevant to the current dataset, and should not be used for other datasets. This kind of effort is generally not required if all of the counts are in a single file and separated from the metadata.)</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">readFormat &lt;- function(infile) {</styled-content>
     
                        <styled-content style="font-size:15px;color:#8F5903;"># First column is empty.</styled-content>
     
                        <styled-content style="font-size:15px;">metadata &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">read.delim</styled-content>
                        <styled-content style="font-size:15px;">(infile,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">stringsAsFactors=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">header=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">nrow=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">10</styled-content>
                        <styled-content style="font-size:15px;">)[,-</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">]</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87;">rownames</styled-content>
                        <styled-content style="font-size:15px;">(metadata) &lt;- metadata[,</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">]</styled-content>
     
                        <styled-content style="font-size:15px;">metadata &lt;- metadata[,-</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">]</styled-content>
     
                        <styled-content style="font-size:15px;">metadata &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">as.data.frame</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">t</styled-content>
                        <styled-content style="font-size:15px;">(metadata))</styled-content>
     
                        <styled-content style="font-size:15px;color:#8F5903;"># First column after row names is some useless filler.</styled-content>
     
                        <styled-content style="font-size:15px;">counts &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">read.delim</styled-content>
                        <styled-content style="font-size:15px;">(infile,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">stringsAsFactors=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">header=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">row.names=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">skip=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">11</styled-content>
                        <styled-content style="font-size:15px;">)[,-</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">]</styled-content>
     
                        <styled-content style="font-size:15px;">counts &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">as.matrix</styled-content>
                        <styled-content style="font-size:15px;">(counts)</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87;">return</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">list</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">metadata=</styled-content>
                        <styled-content style="font-size:15px;">metadata,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">counts=</styled-content>
                        <styled-content style="font-size:15px;">counts))
}</styled-content>
                    </preformat>
                </p>
                <p>Using this function, we read in the counts for the endogenous genes, ERCC spike-ins and mitochondrial genes.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">endo.data &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">readFormat</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"expression_mRNA_17-Aug-2014.txt"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">spike.data &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">readFormat</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"expression_spikes_17-Aug-2014.txt"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">mito.data &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">readFormat</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"expression_mito_17-Aug-2014.txt"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>We also need to rearrange the columns for the mitochondrial data, as the order is not consistent with the other files.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">m &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">match</styled-content>
                        <styled-content style="font-size:15px;">(endo.data$metadata$cell_id, mito.data$metadata$cell_id)
mito.data$metadata &lt;- mito.data$metadata[m,]
mito.data$counts &lt;- mito.data$counts[,m]</styled-content>
                    </preformat>
                </p>
                <p>The counts are then combined into a single matrix for constructing a 
                    <monospace>SCESet</monospace> object. For convenience, metadata for all cells are stored in the same object for later access.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">all.counts &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rbind</styled-content>
                        <styled-content style="font-size:15px;">(endo.data$counts, mito.data$counts, spike.data$counts)
metadata &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">AnnotatedDataFrame</styled-content>
                        <styled-content style="font-size:15px;">(endo.data$metadata)
sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">newSCESet</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">countData=</styled-content>
                        <styled-content style="font-size:15px;">all.counts,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">phenoData=</styled-content>
                        <styled-content style="font-size:15px;">metadata)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">dim</styled-content>
                        <styled-content style="font-size:15px;">(sce)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## Features Samples
##    20063    3005</styled-content>
                    </preformat>
                </p>
                <p>We also add annotation identifying rows that correspond to each class of features.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">nrows &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">c</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">nrow</styled-content>
                        <styled-content style="font-size:15px;">(endo.data$counts),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">nrow</styled-content>
                        <styled-content style="font-size:15px;">(mito.data$counts),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">nrow</styled-content>
                        <styled-content style="font-size:15px;">(spike.data$counts))
is.spike &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rep</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">c</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">), nrows)
is.mito &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rep</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">c</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">), nrows)</styled-content>
                    </preformat>
                </p>
            </sec>
            <sec>
                <title>Quality control on the cells</title>
                <p>The original authors of the study have already removed low-quality cells prior to data publication. Nonetheless, we compute some quality control metrics to check whether the remaining cells are satisfactory.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">calculateQCMetrics</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">feature_controls=list</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">Spike=</styled-content>
                        <styled-content style="font-size:15px;">is.spike,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">Mt=</styled-content>
                        <styled-content style="font-size:15px;">is.mito))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">isSpike</styled-content>
                        <styled-content style="font-size:15px;">(sce) &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#4F9905;">"Spike"</styled-content>
                    </preformat>
                </p>
                <p>We examine the distribution of library sizes and numbers of expressed genes across cells (
                    <xref ref-type="fig" rid="f15">Figure 15</xref>).</p>
                <fig fig-type="figure" id="f15" orientation="portrait" position="float">
                    <label>Figure 15. </label>
                    <caption>
                        <title>Histograms of library sizes (left) and number of expressed genes (right) for all cells in the brain dataset.</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure15.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">par</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">mfrow=c</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>,
                        <styled-content style="font-size:15px;color:#0000CF;">2</styled-content>
                        <styled-content style="font-size:15px;">))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">hist</styled-content>
                        <styled-content style="font-size:15px;">(sce$total_counts/</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1e3</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Library sizes (thousands)"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">main=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">""</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87;">breaks=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">20</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"grey80"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Number of cells"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">hist</styled-content>
                        <styled-content style="font-size:15px;">(sce$total_features,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Number of expressed genes"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">main=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">""</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87;">breaks=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">20</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"grey80"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Number of cells"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>We also examine the distribution of the proportions of UMIs assigned to mitochondrial genes or spike-in transcripts (
                    <xref ref-type="fig" rid="f16">Figure 16</xref>). The spike-in proportions here are more variable than in the HSC dataset. This may reflect a greater variability in the total amount of endogenous RNA per cell when many cell types are present.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">par</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">mfrow=c</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">2</styled-content>
                        <styled-content style="font-size:15px;">))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">hist</styled-content>
                        <styled-content style="font-size:15px;">(sce$pct_counts_feature_controls_Mt,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Mitochondrial proportion (%)"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87;">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Number of cells"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">breaks=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">20</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">main=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">""</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"grey80"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">hist</styled-content>
                        <styled-content style="font-size:15px;">(sce$pct_counts_feature_controls_Spike,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"ERCC proportion (%)"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87;">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Number of cells"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">breaks=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">20</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">main=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">""</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"grey80"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>We remove small outliers in 
                    <xref ref-type="fig" rid="f15">Figure 15</xref> and large outliers in 
                    <xref ref-type="fig" rid="f16">Figure 16</xref>, using a MAD-based threshold as previously described.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">libsize.drop &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">isOutlier</styled-content>
                        <styled-content style="font-size:15px;">(sce$total_counts,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">nmads=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">3</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">type=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"lower"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">log=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">)
feature.drop &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">isOutlier</styled-content>
                        <styled-content style="font-size:15px;">(sce$total_features,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">nmads=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">3</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">type=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"lower"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">log=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">)
mito.drop &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">isOutlier</styled-content>
                        <styled-content style="font-size:15px;">(sce$pct_counts_feature_controls_Mt,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">nmads=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">3</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">type=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"higher"</styled-content>
                        <styled-content style="font-size:15px;">)
spike.drop &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">isOutlier</styled-content>
                        <styled-content style="font-size:15px;">(sce$pct_counts_feature_controls_Spike,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">nmads=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">3</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">type=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"higher"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <fig fig-type="figure" id="f16" orientation="portrait" position="float">
                    <label>Figure 16. </label>
                    <caption>
                        <title>Histogram of the proportion of UMIs assigned to mitochondrial genes (left) or spike-in transcripts (right) across all cells in the brain dataset.</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure16.gif"/>
                </fig>
                <p>Removal of low-quality cells is then performed by combining the filters for all of the metrics. The vast majority of cells are retained, which suggests that the original quality control procedures were generally adequate.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;- sce[,!(libsize.drop | feature.drop | spike.drop | mito.drop)]</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">data.frame</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">ByLibSize=sum</styled-content>
                        <styled-content style="font-size:15px;">(libsize.drop),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">ByFeature=sum</styled-content>
                        <styled-content style="font-size:15px;">(feature.drop),</styled-content>
    
                        <styled-content style="font-size:15px;color:#214A87;"> ByMito=sum</styled-content>
                        <styled-content style="font-size:15px;">(mito.drop),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">BySpike=sum</styled-content>
                        <styled-content style="font-size:15px;">(spike.drop),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">Remaining=ncol</styled-content>
                        <styled-content style="font-size:15px;">(sce))</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## 	    ByLibSize ByFeature ByMito BySpike Remaining
##  Samples         8         3     87       8      2902</styled-content>
                    </preformat>
                </p>
            </sec>
            <sec>
                <title>Cell cycle classification</title>
                <p>Application of 
                    <monospace>cyclone</monospace> to the brain dataset suggests that most of the cells are in G1 phase (
                    <xref ref-type="fig" rid="f17">Figure 17</xref>). However, the intepretation of this result requires some caution due to the differences between the test and training datasets. The classifier was trained on C1 SMARTer data (
                    <xref ref-type="bibr" rid="ref-42">Scialdone 
                        <italic toggle="yes">et al.</italic>, 2015</xref>) and accounts for the biases in that protocol. The brain dataset uses UMI counts, which has an entirely different set of biases, e.g., 3&#x2019;-end coverage only, no length bias, no amplification noise. These new biases (and the absence of expected biases) may interfere with accurate classification of some cells.</p>
                <fig fig-type="figure" id="f17" orientation="portrait" position="float">
                    <label>Figure 17. </label>
                    <caption>
                        <title>Cell cycle phase scores from applying the pair-based classifier on the brain dataset, where each point represents a cell.</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_Figure17.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">anno &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">select</styled-content>
                        <styled-content style="font-size:15px;">(org.Mm.eg.db,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">keys=rownames</styled-content>
                        <styled-content style="font-size:15px;">(sce),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">keytype=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"SYMBOL"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">column=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"ENSEMBL"</styled-content>
                        <styled-content style="font-size:15px;">)
ensembl &lt;- anno$ENSEMBL[</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">match</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">rownames</styled-content>
                        <styled-content style="font-size:15px;">(sce), anno$SYMBOL)]
assignments &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">cyclone</styled-content>
                        <styled-content style="font-size:15px;">(sce, mm.pairs,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">gene.names=</styled-content>
                        <styled-content style="font-size:15px;">ensembl)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">plot</styled-content>
                        <styled-content style="font-size:15px;">(assignments$score$G1, assignments$score$G2M,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"G1 score"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"G2/M score"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">pch=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">16</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>An additional complication is that many neuronal cell types are expected to lie in the G0 resting phase, which is distinct from the other phases of the cell cycle (
                    <xref ref-type="bibr" rid="ref-9">Coller 
                        <italic toggle="yes">et al.</italic>, 2006</xref>). Application of 
                    <monospace>cyclone</monospace> to these cells may be suboptimal if each cell must be assigned into one of the G1, S or G2/Mphases. To avoid problems from misclassification, we will not perform any processing of this dataset by cell cycle phase. This is unlikely to be problematic for this analysis, as the cell cycle effect will be relatively subtle compared to the obvious differences between cell types in a diverse population. Thus, the former is unlikely to distort the conclusions regarding the latter.</p>
            </sec>
            <sec>
                <title>Removing uninteresting genes</title>
                <p>Low-abundance genes are removed by applying a simple mean-based filter. We use a lower threshold for UMI counts compared to that used for read counts. This is because the number of transcript molecules will always be lower than the number of reads generated from such molecules. While some information and power will be lost due to the decrease in the size of the counts, this is mitigated by a concomitant reduction in the variability of the counts. Specifically, the use of UMIs eliminates technical noise due to amplification biases (
                    <xref ref-type="bibr" rid="ref-16">Islam 
                        <italic toggle="yes">et al.</italic>, 2014</xref>).</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">ave.counts &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rowMeans</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">counts</styled-content>
                        <styled-content style="font-size:15px;">(sce))</styled-content>

                        <styled-content style="font-size:15px;">keep &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rowMeans</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">counts</styled-content>
                        <styled-content style="font-size:15px;">(sce)) &gt;=</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF;">0.2</styled-content>
                    </preformat>
                </p>
                <p>
                    <xref ref-type="fig" rid="f18">Figure 18</xref> suggests that our choice of threshold is appropriate. The filter removes the bulk of lowly expressed genes while preserving the peak of moderately expressed genes.</p>
                <fig fig-type="figure" id="f18" orientation="portrait" position="float">
                    <label>Figure 18. </label>
                    <caption>
                        <title>Histogram of log-average counts for all genes in the brain dataset.</title>
                        <p>The filter threshold is represented by the blue line.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure18.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">hist</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">log10</styled-content>
                        <styled-content style="font-size:15px;">(ave.counts),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">breaks=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">100</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">main=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">""</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"grey"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87;">xlab=expression</styled-content>
                        <styled-content style="font-size:15px;">(Log[</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">10</styled-content>
                        <styled-content style="font-size:15px;">]
                            <sup>~</sup>
                        </styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"average count"</styled-content>
                        <styled-content style="font-size:15px;">))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">abline</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">v=log10</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">0.2</styled-content>)
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"blue"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">lwd=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">2</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">lty=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">2</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>The mean-based filter is applied to the dataset by subsetting 
                    <monospace>sce</monospace> as previously described. Despite the reduced threshold, the number of retained genes is lower than that in the HSC dataset, simply because the library sizes are much smaller with UMI counts.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;- sce[keep,]</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">nrow</styled-content>
                        <styled-content style="font-size:15px;">(sce)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## Features
##     8939</styled-content>
                    </preformat>
                </p>
                <p>Some datasets also contain strong heterogeneity in mitochondrial RNA content, possibly due to differences in mitochondrial copy number or activity between cell types. This heterogeneity will cause mitochondrial genes to dominate the top set of results, e.g., for identification of correlated HVGs. However, these genes are largely uninteresting given that most studies focus on nuclear regulation. As such, we filter them out prior to further analysis. Other candidates for removal include pseudogenes or ribosome-associated genes, which might not be relevant for characterising cell types but can still interfere with the interpretation of the results.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;- sce[!</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">fData</styled-content>
                        <styled-content style="font-size:15px;">(sce)$is_feature_control_Mt,]</styled-content>
                    </preformat>
                </p>
            </sec>
            <sec>
                <title>Normalization of cell-specific biases</title>
                <p>Normalization of cell-specific biases is performed using the deconvolution method in the 
                    <monospace>computeSumFactors</monospace> function. Here, we cluster similar cells together and normalize the cells in each cluster using the deconvolution method. This improves normalization accuracy by reducing the number of DE genes between cells in the same cluster. Scaling is then performed to ensure that size factors of cells in different clusters are comparable.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">clusters &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">quickCluster</styled-content>
                        <styled-content style="font-size:15px;">(sce)
sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">computeSumFactors</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">cluster</styled-content>
                        <styled-content style="font-size:15px;">=clusters)</styled-content>
                    </preformat>
                </p>
                <p>Compared to the HSC analysis, more scatter is observed around the trend between the total count and size factor for each cell (
                    <xref ref-type="fig" rid="f19">Figure 19</xref>). This is consistent with an increased amount of DE between cells of different types, which compromises the accuracy of library size normalization (
                    <xref ref-type="bibr" rid="ref-41">Robinson &amp; Oshlack, 2010</xref>). In contrast, the size factors are estimated based on median ratios and are more robust to the presence of DE between cells.</p>
                <fig fig-type="figure" id="f19" orientation="portrait" position="float">
                    <label>Figure 19. </label>
                    <caption>
                        <title>Size factors from deconvolution, plotted against library sizes for all cells in the brain dataset.</title>
                        <p>Axes are shown on a log-scale.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure19.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">plot</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">sizeFactors</styled-content>
                        <styled-content style="font-size:15px;">(sce), sce$total_counts/</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1e3</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">log=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"xy"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
    
                        <styled-content style="font-size:15px;color:#214A87;">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Library size (thousands)"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Size factor"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>We also compute size factors specific to the spike-in set, as previously described.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">computeSpikeFactors</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">type=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Spike"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">general.use=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>Finally, normalized log-expression values are computed for each endogenous gene or spike-in transcript using the appropriate size factors.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">normalize</styled-content>
                        <styled-content style="font-size:15px;">(sce)</styled-content>
                    </preformat>
                </p>
            </sec>
            <sec>
                <title>Checking for important technical factors</title>
                <p>Larger experiments contain more technical factors that need to be investigated. In this dataset, factors include the sex of the animal from which the cells were extracted, the age of the animal, the tissue of origin for each cell, and the total spike-in count in each cell. 
                    <xref ref-type="fig" rid="f20">Figure 20</xref> shows that the tissue of origin explains a substantial proportion of the variance for a subset of genes. This is probably because each tissue contains a different composition of cell types, leading to systematic differences in gene expression between tissues. The other factors explain only a small proportion of the variance for most genes and do not need to be incorporated into our downstream analyses.</p>
                <fig fig-type="figure" id="f20" orientation="portrait" position="float">
                    <label>Figure 20. </label>
                    <caption>
                        <title>Density plot of the percentage of variance explained by each factor across all genes in the brain dataset.</title>
                        <p>For each gene, the percentage of the variance of the normalized log-expression values that is explained by the (log-transformed) total spike-in counts, the sex or age of the mouse, or the tissue of origin is calculated. Each curve corresponds to one factor and represents the distribution of percentages across all genes.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure20.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">plotExplanatoryVariables</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">variables=c</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"counts_feature_controls_Spike"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
    
                        <styled-content style="font-size:15px;color:#4F9905;">"log10_counts_feature_controls_Spike"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#4F9905;">"sex"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#4F9905;">"tissue"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#4F9905;">"age"</styled-content>
                        <styled-content style="font-size:15px;">)) + fontsize</styled-content>
                    </preformat>
                </p>
                <p>Nonetheless, we demonstrate how to account for uninteresting technical factors by using sex as an example. We set up a design matrix with the sex of the animal as the explanatory factor for each cell. This ensures that any sex-specific changes in expression will be modelled in our downstream analyses. We do not block on the tissue of origin, despite the fact that it explains more of the variance than sex in 
                    <xref ref-type="fig" rid="f20">Figure 20</xref>. This is because the tissue factor is likely to be associated with genuine differences between cell types, so including it in the model might regress out interesting biological effects.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">design &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">model.matrix</styled-content>
                        <styled-content style="font-size:15px;">(
                            <sup>~</sup>sce$sex)</styled-content>
                    </preformat>
                </p>
                <p>Other relevant factors include the chip or plate on which the cells were processed and the batch in which the libraries were sequenced. Blocking on these factors may be necessary to account for batch effects that are often observed in scRNA-seq data (
                    <xref ref-type="bibr" rid="ref-12">Hicks 
                        <italic toggle="yes">et al.</italic>, 2015</xref>; 
                    <xref ref-type="bibr" rid="ref-45">Tung 
                        <italic toggle="yes">et al.</italic>, 2016</xref>).</p>
            </sec>
            <sec>
                <title>Identifying correlated HVGs</title>
                <p>We identify HVGs that may be involved in driving population heterogeneity. This is done by fitting a trend to the technical variances for the spike-in transcripts. We then compute the biological component of the variance for each endogenous gene by subtracting the fitted value of the trend from the total variance.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">var.fit &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">trendVar</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">trend=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"loess"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">design=</styled-content>
                        <styled-content style="font-size:15px;">design,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">span=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">0.4</styled-content>
                        <styled-content style="font-size:15px;">)
var.out &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">decomposeVar</styled-content>
                        <styled-content style="font-size:15px;">(sce, var.fit)</styled-content>
                    </preformat>
                </p>
                <p>
                    <xref ref-type="fig" rid="f21">Figure 21</xref> suggests that the trend is fitted accurately to the technical variances. Errors in fitting are negligible due to the precision of the variance estimates in a large dataset containing thousands of cells. The technical and total variances are also much smaller than those in the HSC dataset. This is due to the use of UMIs which reduces the noise caused by variable PCR amplification. Furthermore, the spike-in trend is consistently lower than the variances of the endogenous genes. This reflects the heterogeneity in gene expression across cells of different types. It also means the previous strategy of fitting a trend to the endogenous variances would not be appropriate here (or necessary, given the quality of the spike-in trend).</p>
                <fig fig-type="figure" id="f21" orientation="portrait" position="float">
                    <label>Figure 21. </label>
                    <caption>
                        <title>Variance of normalized log-expression values for each gene in the brain dataset, plotted against the mean log-expression.</title>
                        <p>The red line represents the mean-dependent trend in the technical variance of the spike-in transcripts (also highlighted as red points).</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure21.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">plot</styled-content>
                        <styled-content style="font-size:15px;">(var.out$mean, var.out$total,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">pch=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">16</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">cex=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">0.6</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Mean log-expression"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
    
                        <styled-content style="font-size:15px;color:#214A87;">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Variance of log-expression"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">points</styled-content>
                        <styled-content style="font-size:15px;">(var.fit$mean, var.fit$var,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"red"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">pch=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">16</styled-content>
                        <styled-content style="font-size:15px;">)
o &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">order</styled-content>
                        <styled-content style="font-size:15px;">(var.out$mean)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">lines</styled-content>
                        <styled-content style="font-size:15px;">(var.out$mean[o], var.out$tech[o],</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"red"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">lwd=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">2</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>HVGs are identified as genes with large positive biological components. These are saved to file for future reference. Note that some of the p-values are reported as zero due to numerical imprecision.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">hvg.out &lt;- var.out[</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">which</styled-content>
                        <styled-content style="font-size:15px;">(var.out$FDR &lt;=</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF;">0.05</styled-content> 
                        <styled-content style="font-size:15px;">&amp; var.out$bio &gt;=</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF;">0.5</styled-content>
                        <styled-content style="font-size:15px;">),]
hvg.out &lt;- hvg.out[</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">order</styled-content>
                        <styled-content style="font-size:15px;">(hvg.out$bio,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">decreasing=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">),]</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">nrow</styled-content>
                        <styled-content style="font-size:15px;">(hvg.out)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## [1] 1755</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">write.table</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">file=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"brain_hvg.tsv"</styled-content>
                        <styled-content style="font-size:15px;">, hvg.out,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">sep=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"\t"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">quote=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col.names=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">NA</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">head</styled-content>
                        <styled-content style="font-size:15px;">(hvg.out)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## 	    mean     total       bio      tech p.value FDR
## Plp1 4.045420 16.949056 16.681804 0.2672513       0   0
## Trf  2.427692 11.317924 10.745370 0.5725539       0   0
## Mal  2.454213 10.427362  9.860428 0.5669333       0   0
## Apod 2.044163  8.973862  8.319578 0.6542837       0   0
## Mog  1.974681  8.472565  7.803619 0.6689461       0   0
## Mbp  2.324417  7.853273  7.259729 0.5935431       0   0</styled-content>
                    </preformat>
                </p>
                <p>Again, we check the distribution of expression values for the top 10 HVGs to ensure that they are not being driven by outliers (
                    <xref ref-type="fig" rid="f22">Figure 22</xref>). Some tweaking of the 
                    <monospace>plotExpression</monospace> parameters is necessary to visualize a large number of cells.</p>
                <fig fig-type="figure" id="f22" orientation="portrait" position="float">
                    <label>Figure 22. </label>
                    <caption>
                        <title>Violin plots of normalized log-expression values for the top 10 HVGs in the brain dataset.</title>
                        <p>For each gene, each point represents the log-expression value for an individual cell.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure22.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">plotExpression</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rownames</styled-content>
                        <styled-content style="font-size:15px;">(hvg.out)[</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">:</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">10</styled-content>
                        <styled-content style="font-size:15px;">],</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">alpha=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">0.05</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">jitter=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"jitter"</styled-content>
                        <styled-content style="font-size:15px;">) + fontsize</styled-content>
                    </preformat>
                </p>
                <p>To identify genes involved in defining subpopulations, the set of HVGs is tested for significant pairwise correlations. Given the size of the set, we only use the top 500 HVGs to reduce computational work. Here, the number of significantly correlated pairs is much higher than in the HSC dataset, indicating that strong substructure is present. These results are also saved to file for use in designing validation experiments.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">set.seed</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">100</styled-content>
                        <styled-content style="font-size:15px;">)
var.cor &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">correlatePairs</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">design=</styled-content>
                        <styled-content style="font-size:15px;">design,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">subset.row=rownames</styled-content>
                        <styled-content style="font-size:15px;">(hvg.out)[</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">:</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">500</styled-content>
                        <styled-content style="font-size:15px;">])</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">write.table</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">file=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"brain_cor.tsv"</styled-content>
                        <styled-content style="font-size:15px;">, var.cor,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">sep=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"\t"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">quote=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">row.names=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">head</styled-content>
                        <styled-content style="font-size:15px;">(var.cor)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">##    gene1  gene2 	 rho      p.value          FDR
## 1   Meg3 Snhg11 0.8542706 1.999998e-06 2.611414e-06
## 2 Snap25  Stmn2 0.8023813 1.999998e-06 2.611414e-06
## 3 Ppp3ca  Prkcb 0.7977351 1.999998e-06 2.611414e-06
## 4 Atp1b1   Rtn1 0.7959162 1.999998e-06 2.611414e-06
## 5  Stmn3  Stmn2 0.7958141 1.999998e-06 2.611414e-06
## 6 Snap25  Ndrg4 0.7938286 1.999998e-06 2.611414e-06</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <styled-content style="font-size:15px;">sig.cor &lt;- var.cor$FDR &lt;=</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF;">0.05</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">sum</styled-content>
                        <styled-content style="font-size:15px;">(sig.cor)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <styled-content style="font-size:15px;">## [1] 111798</styled-content>
                    </preformat>
                </p>
            </sec>
            <sec>
                <title>Further data exploration with the correlated HVGs</title>
                <p>We first remove the sex effect using the 
                    <monospace>removeBatchEffect</monospace> function from the 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/release/bioc/html/limma.html">limma</ext-link>
                    </italic> package (
                    <xref ref-type="bibr" rid="ref-39">Ritchie 
                        <italic toggle="yes">et al.</italic>, 2015</xref>). This ensures that any sex-specific differences will not dominate the visualization of the expression profiles. In this manner, we maintain consistency with the use of 
                    <monospace>design</monospace> in the previous steps. (However, if an analysis method can accept a design matrix, blocking on nuisance factors in the design matrix is preferable to manipulating the expression values with 
                    <monospace>removeBatchEffect</monospace>. This is because the latter does not account for the loss of residual degrees of freedom, nor the uncertainty of estimation of the blocking factor terms.) We store these sex-corrected expression values in the 
                    <monospace>norm_exprs</monospace> field of the 
                    <monospace>SCESet</monospace> object for later use.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <styled-content style="font-size:15px;color:#214A87;">library</styled-content>
                        <styled-content style="font-size:15px;">(limma)</styled-content>

                        <styled-content style="font-size:15px;">adj.exprs &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">exprs</styled-content>
                        <styled-content style="font-size:15px;">(sce)</styled-content>

                        <styled-content style="font-size:15px;">adj.exprs &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">removeBatchEffect</styled-content>
                        <styled-content style="font-size:15px;">(adj.exprs,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">batch=</styled-content>
                        <styled-content style="font-size:15px;">sce$sex)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">norm_exprs</styled-content>
                        <styled-content style="font-size:15px;">(sce) &lt;- adj.exprs</styled-content>
                    </preformat>
                </p>
                <p>We perform dimensionality reduction on the correlated HVGs to check if there is any substructure. Cells separate into clear clusters in the 
                    <italic toggle="yes">t</italic>-SNE plot (
                    <xref ref-type="fig" rid="f23">Figure 23</xref>), corresponding to distinct subpopulations. This is consistent with the presence of multiple cell types in the diverse brain population.</p>
                <fig fig-type="figure" id="f23" orientation="portrait" position="float">
                    <label>Figure 23. </label>
                    <caption>
                        <title>
							
                            <italic toggle="yes">t</italic>-SNE plots constructed from the normalized and corrected log-expression values of correlated HVGs for cells in the brain dataset.</title>
                        <p>Each point represents a cell and is coloured according to its expression of the top HVG (left) or 
                            <italic toggle="yes">Mog</italic> (right).</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure23.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">chosen &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">unique</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">c</styled-content>
                        <styled-content style="font-size:15px;">(var.cor$gene1[sig.cor], var.cor$gene2[sig.cor]))</styled-content>

                        <styled-content style="font-size:15px;">top.hvg &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rownames</styled-content>
                        <styled-content style="font-size:15px;">(hvg.out)[</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">]</styled-content>

                        <styled-content style="font-size:15px;">tsne1 &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">plotTSNE</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">exprs_values=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"norm_exprs"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">colour_by=</styled-content>
                        <styled-content style="font-size:15px;">top.hvg,</styled-content>
    
                        <styled-content style="font-size:15px;color:#214A87;">perplexity=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">10</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rand_seed=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">100</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">feature_set=</styled-content>
                        <styled-content style="font-size:15px;">chosen) + fontsize</styled-content>

                        <styled-content style="font-size:15px;">tsne2 &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">plotTSNE</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">exprs_values=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"norm_exprs"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">colour_by=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Mog"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
    
                        <styled-content style="font-size:15px;color:#214A87;">perplexity=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">10</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rand_seed=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">100</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">feature_set=</styled-content>
                        <styled-content style="font-size:15px;">chosen) + fontsize</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">multiplot</styled-content>
                        <styled-content style="font-size:15px;">(tsne1, tsne2,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">cols=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">2</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>The PCA plot is less effective at separating cells into many different clusters (
                    <xref ref-type="fig" rid="f24">Figure 24</xref>). This is because the first two principal components are driven by strong differences between specific subpopulations, which reduces the resolution of more subtle differences between some of the other subpopulations. Nonetheless, some substructure is still visible.</p>
                <fig fig-type="figure" id="f24" orientation="portrait" position="float">
                    <label>Figure 24. </label>
                    <caption>
                        <title>PCA plots constructed from the normalized and corrected log-expression values of correlated HVGs for cells in the brain dataset.</title>
                        <p>Each point represents a cell and is coloured according to its expression of the top HVG (left) or 
                            <italic toggle="yes">Mog</italic> (right).</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure24.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">pca1 &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">plotPCA</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">exprs_values=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"norm_exprs"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">colour_by=</styled-content>
                        <styled-content style="font-size:15px;">top.hvg) + fontsize</styled-content>

                        <styled-content style="font-size:15px;">pca2 &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">plotPCA</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">exprs_values=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"norm_exprs"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">colour_by=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Mog"</styled-content>
                        <styled-content style="font-size:15px;">) + fontsize</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">multiplot</styled-content>
                        <styled-content style="font-size:15px;">(pca1, pca2,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">cols=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">2</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>For both methods, we colour each cell based on the expression of a particular gene. This is a useful strategy for visualizing changes in expression across the lower-dimensional space. It can also be used to characterise each cluster if the selected genes are known markers for particular cell types. For example, 
                    <italic toggle="yes">Mog</italic> can be used to identify clusters corresponding to oligodendrocytes.</p>
            </sec>
            <sec>
                <title>Clustering cells into putative subpopulations</title>
                <p>The normalized and sex-adjusted log-expression values for correlated HVGs are used to cluster cells into putative subpopulations. Specifically, we perform hierarchical clustering on the Euclidean distances between cells, using Ward&#x2019;s criterion to minimize the total variance within each cluster. This yields a dendrogram that groups together cells with similar expression patterns across the chosen genes. An alternative approach is to cluster on a matrix of distances derived from correlations (e.g., as in 
                    <monospace>quickCluster</monospace>). This is more robust to noise and normalization errors, but is also less sensitive to subtle changes in the expression profiles.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">chosen.exprs &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">norm_exprs</styled-content>
                        <styled-content style="font-size:15px;">(sce)[chosen,]</styled-content>

                        <styled-content style="font-size:15px;">my.dist &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">dist</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">t</styled-content>
                        <styled-content style="font-size:15px;">(chosen.exprs))</styled-content>

                        <styled-content style="font-size:15px;">my.tree &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">hclust</styled-content>
                        <styled-content style="font-size:15px;">(my.dist,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">method=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"ward.D2"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>Clusters are explicitly defined by applying a dynamic tree cut (
                    <xref ref-type="bibr" rid="ref-22">Langfelder 
                        <italic toggle="yes">et al.</italic>, 2008</xref>) to the dendrogram. This exploits the shape of the branches in the dendrogram to refine the cluster definitions, and is more appropriate than 
                    <monospace>cutree</monospace> for complex dendrograms. Greater control of the empirical clusters can be obtained by manually specifying 
                    <monospace>cutHeight</monospace> in 
                    <monospace>cutreeDynamic</monospace>.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">library</styled-content>
                        <styled-content style="font-size:15px;">(dynamicTreeCut)</styled-content>

                        <styled-content style="font-size:15px;">my.clusters &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">unname</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">cutreeDynamic</styled-content>
                        <styled-content style="font-size:15px;">(my.tree,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">distM=as.matrix</styled-content>
                        <styled-content style="font-size:15px;">(my.dist),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">verbose=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">0</styled-content>
                        <styled-content style="font-size:15px;">))</styled-content>
                    </preformat>
                </p>
                <p>
                    <xref ref-type="fig" rid="f25">Figure 25</xref> contains a clear block-like pattern, representing systematic differences between clusters of cells with distinct expression profiles. This is consistent with the presence of well-defined subpopulations that were previously observed in the dimensionality reduction plots.</p>
                <fig fig-type="figure" id="f25" orientation="portrait" position="float">
                    <label>Figure 25. </label>
                    <caption>
                        <title>Heatmap of mean-centred normalized and corrected log-expression values for correlated HVGs in the brain dataset.</title>
                        <p>Dendrograms are formed by hierarchical clustering on the Euclidean distances between genes (row) or cells (column). Column colours represent the cluster to which each cell is assigned after a dynamic tree cut.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure25.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">heat.vals &lt;- chosen.exprs -</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rowMeans</styled-content>
                        <styled-content style="font-size:15px;">(chosen.exprs)</styled-content>

                        <styled-content style="font-size:15px;">clust.col &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rainbow</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">max</styled-content>
                        <styled-content style="font-size:15px;">(my.clusters))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">heatmap.2</styled-content>
                        <styled-content style="font-size:15px;">(heat.vals,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;">bluered,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">symbreak=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">trace=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">&#x2019;none&#x2019;</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">cexRow=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">0.3</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
    
                        <styled-content style="font-size:15px;color:#214A87;">ColSideColors=</styled-content>
                        <styled-content style="font-size:15px;">clust.col[my.clusters],</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">Colv=as.dendrogram</styled-content>
                        <styled-content style="font-size:15px;">(my.tree))</styled-content>
                    </preformat>
                </p>
                <p>This heatmap can be stored at a greater resolution for detailed inspection later.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">pdf</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"brain_heat.pdf"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">width=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">20</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">height=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">40</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">heatmap.2</styled-content>
                        <styled-content style="font-size:15px;">(heat.vals,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;">bluered,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">symbreak=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">trace=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">&#x2019;none&#x2019;</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">cexRow=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">0.3</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
    
                        <styled-content style="font-size:15px;color:#214A87;">ColSideColors=</styled-content>
                        <styled-content style="font-size:15px;">clust.col[my.clusters],</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">Colv=as.dendrogram</styled-content>
                        <styled-content style="font-size:15px;">(my.tree))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">dev.off</styled-content>
                        <styled-content style="font-size:15px;">()</styled-content>
                    </preformat>
                </p>
            </sec>
            <sec>
                <title>Detecting marker genes between subpopulations</title>
                <p>Once putative subpopulations are identified, we can identify marker genes for specific subpopulations of interest. This is done by identifying genes that are consistently DE in one subpopulation compared to the others. DE testing can be performed using a number of packages, but for this workflow, we will use the 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/edgeR">edgeR</ext-link>
                    </italic> package (
                    <xref ref-type="bibr" rid="ref-40">Robinson 
                        <italic toggle="yes">et al.</italic>, 2010</xref>). First, we set up a design matrix specifying which cells belong to each cluster. Each 
                    <monospace>cluster*</monospace> coefficient represents the average log-expression of all cells in the corresponding cluster. We also block on uninteresting factors such as sex.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">cluster &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">factor</styled-content>
                        <styled-content style="font-size:15px;">(my.clusters)</styled-content>

                        <styled-content style="font-size:15px;">de.design &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">model.matrix</styled-content>
                        <styled-content style="font-size:15px;">(
                            <sup>~</sup>
                        </styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">0</styled-content> 
                        <styled-content style="font-size:15px;">+ cluster + sce$sex)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">head</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">colnames</styled-content>
                        <styled-content style="font-size:15px;">(de.design))</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## [1] "cluster1" "cluster2" "cluster3" "cluster4" "cluster5" "cluster6"</styled-content>
                    </preformat>
                </p>
                <p>We set up a 
                    <monospace>DGEList</monospace> object for entry into the 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/edgeR">edgeR</ext-link>
                    </italic> analysis. This new object contains all relevant information from the original 
                    <monospace>SCESet</monospace> object, including the counts and (library size-adjusted) size factors.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">library</styled-content>
                        <styled-content style="font-size:15px;">(edgeR)</styled-content>

                        <styled-content style="font-size:15px;">y &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">convertTo</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">type=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"edgeR"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/edgeR">edgeR</ext-link>
                    </italic> uses negative binomial (NB) distributions to model the read/UMI counts for each sample. We estimate the NB dispersion parameter that quantifies the biological variability in expression across cells in the same cluster. Large dispersion estimates above 0.5 are often observed in scRNA-seq data due to technical noise, in contrast to bulk data where values of 0.05&#x2013;0.2 are more typical. We then use the design matrix to fit a NB GLM to the counts for each gene (
                    <xref ref-type="bibr" rid="ref-33">McCarthy 
                        <italic toggle="yes">et al.</italic>, 2012</xref>).</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">y &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">estimateDisp</styled-content>
                        <styled-content style="font-size:15px;">(y, de.design)</styled-content>

                        <styled-content style="font-size:15px;">fit &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">glmFit</styled-content>
                        <styled-content style="font-size:15px;">(y, de.design)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">summary</styled-content>
                        <styled-content style="font-size:15px;">(y$tagwise.dispersion)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">##    Min. 1st Qu.  Median    Mean 3rd Qu.      Max.
## 0.04733 0.35370 0.64530 1.28600 1.32400 102.40000</styled-content>
                    </preformat>
                </p>
                <p>We assume that one of the clusters corresponds to our subpopulation of interest. Each gene is tested for DE between the chosen cluster and every other cluster in the dataset. We demonstrate this below for cluster 1, though the same process can be applied to any other cluster by changing 
                    <monospace>chosen.clust</monospace>.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">result.logFC &lt;- result.PValue &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">list</styled-content>
                        <styled-content style="font-size:15px;">()</styled-content>

                        <styled-content style="font-size:15px;">chosen.clust &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">which</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">levels</styled-content>
                        <styled-content style="font-size:15px;">(cluster)==</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"1"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content> 
                        <styled-content style="font-size:15px;color:#8F5903;"># character, as &#x2019;cluster&#x2019; is a factor.</styled-content>

                        <styled-content style="font-size:15px;">for (clust in</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">seq_len</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">nlevels</styled-content>
                        <styled-content style="font-size:15px;">(cluster))) {</styled-content>
    
                        <styled-content style="font-size:15px;">if (clust==chosen.clust) { next }</styled-content>
    
                        <styled-content style="font-size:15px;">contrast &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">numeric</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">ncol</styled-content>
                        <styled-content style="font-size:15px;">(de.design))</styled-content>
    
                        <styled-content style="font-size:15px;">contrast[chosen.clust] &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
    
                        <styled-content style="font-size:15px;">contrast[clust] &lt;- -</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
    
                        <styled-content style="font-size:15px;">res &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">glmLRT</styled-content>
                        <styled-content style="font-size:15px;">(fit,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">contrast=</styled-content>
                        <styled-content style="font-size:15px;">contrast)</styled-content>
    
                        <styled-content style="font-size:15px;">con.name &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">paste0</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">&#x2019;vs.&#x2019;</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">levels</styled-content>
                        <styled-content style="font-size:15px;">(cluster)[clust])</styled-content>
    
                        <styled-content style="font-size:15px;">result.logFC[[con.name]] &lt;- res$table$logFC</styled-content>
    
                        <styled-content style="font-size:15px;">result.PValue[[con.name]] &lt;- res$table$PValue</styled-content>

                        <styled-content style="font-size:15px;">}</styled-content>
                    </preformat>
                </p>
                <p>Potential marker genes are identified by taking the top set of DE genes from each pairwise comparison between clusters. We arrange the results into a single output table that allows a marker set to be easily defined for a user-specified size of the top set. For example, to construct a marker set from the top 10 genes of each comparison, one would filter 
                    <monospace>marker.set</monospace> to retain rows with 
                    <monospace>Top</monospace> less than or equal to 10.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">collected.ranks &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">lapply</styled-content>
                        <styled-content style="font-size:15px;">(result.PValue, rank,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">ties=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"first"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">min.rank &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">do.call</styled-content>
                        <styled-content style="font-size:15px;">(pmin, collected.ranks)</styled-content>

                        <styled-content style="font-size:15px;">marker.set &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">data.frame</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">Top=</styled-content>
                        <styled-content style="font-size:15px;">min.rank,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">Gene=rownames</styled-content>
                        <styled-content style="font-size:15px;">(y),</styled-content>
    
                        <styled-content style="font-size:15px;color:#214A87;">logFC=do.call</styled-content>
                        <styled-content style="font-size:15px;">(cbind, result.logFC),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">stringsAsFactors=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">marker.set &lt;- marker.set[</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">order</styled-content>
                        <styled-content style="font-size:15px;">(marker.set$Top),]</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">head</styled-content>
                        <styled-content style="font-size:15px;">(marker.set,</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF;">10</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">##     Top    Gene  logFC.vs.2  logFC.vs.3 logFC.vs.4 logFC.vs.5 logFC.vs.6 logFC.vs.7
## 26    1  Gm9846 -2.69173561 -0.89238306 -4.2332332 -1.0222698 -0.5414615 -2.5287437
## 223   1 Slc32a1  0.09461874 -0.04485368  0.1585265 -4.5682143 -1.4174543 -0.2546009
## 297   1   Cspg5 -1.30778951 -2.54296437 -1.5771899 -1.9881673 -1.4086953 -5.0830952
## 298   1    Syt1  2.78822084 -0.25850578  1.4804092 -0.8895181  0.3458730  1.8327007
## 862   1   Mef2c -1.08816401 -4.45879597 -2.9639706 -2.7639706 -2.9780931 -0.8413323
## 2563  1    Scd2 -4.45332845 -0.26021806 -1.0034850  0.1048065 -2.7348760 -3.4221061
## 260   2   Rcan2 -3.22472364 -3.05410260 -2.1732655 -4.5132580 -2.6087020 -0.9949232
## 309   2   Ndrg4  3.83886951 -0.34125245  2.6623976 -0.9701018  0.3516469  2.8775459
## 763   2     Clu -1.59785766 -2.42881333 -2.7317868 -1.9346444 -0.8799791 -6.1547824
## 963   2   Ncald -2.87305577 -4.43604787 -2.1004299 -4.5752214 -3.5526851 -1.5981341</styled-content>
                    </preformat>
                </p>
                <p>We save the list of candidate marker genes for further examination. We also examine their expression profiles to verify that the DE signature is robust. 
                    <xref ref-type="fig" rid="f26">Figure 26</xref> indicates that most of the top markers have strong and consistent up- or downregulation in cells of cluster 1 compared to some or all of the other clusters. Thus, cells from the subpopulation of interest can be identified as those that express the upregulated markers and do not express the downregulated markers.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">write.table</styled-content>
                        <styled-content style="font-size:15px;">(marker.set,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">file=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"brain_marker_1.tsv"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">sep=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"\t"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">quote=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col.names=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">NA</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">top.markers &lt;- marker.set$Gene[marker.set$Top &lt;=</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF;">10</styled-content>
                        <styled-content style="font-size:15px;">]</styled-content>

                        <styled-content style="font-size:15px;">top.exprs &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">norm_exprs</styled-content>
                        <styled-content style="font-size:15px;">(sce)[top.markers,,drop=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">FALSE</styled-content>
                        <styled-content style="font-size:15px;">]</styled-content>

                        <styled-content style="font-size:15px;">heat.vals &lt;- top.exprs -</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rowMeans</styled-content>
                        <styled-content style="font-size:15px;">(top.exprs)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">heatmap.2</styled-content>
                        <styled-content style="font-size:15px;">(heat.vals,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;">bluered,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">symbreak=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">trace=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">&#x2019;none&#x2019;</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">cexRow=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">0.6</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
    
                        <styled-content style="font-size:15px;color:#214A87;">ColSideColors=</styled-content>
                        <styled-content style="font-size:15px;">clust.col[my.clusters],</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">Colv=as.dendrogram</styled-content>
                        <styled-content style="font-size:15px;">(my.tree),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">dendrogram=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">&#x2019;none&#x2019;</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">legend</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"bottomleft"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;">clust.col,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">legend=sort</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">unique</styled-content>
                        <styled-content style="font-size:15px;">(my.clusters)),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">pch=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">16</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>Many of the markers in 
                    <xref ref-type="fig" rid="f26">Figure 26</xref> are not uniquely up- or downregulated in the chosen cluster. Testing for unique DE tends to be too stringent as it overlooks important genes that are expressed in two or more clusters. For example, in a mixed population of CD4
                    <sup>+</sup>-only, CD8
                    <sup>+</sup>-only, double-positive and double-negative T cells, neither 
                    <italic toggle="yes">Cd4</italic> or 
                    <italic toggle="yes">Cd8</italic> would be detected as subpopulation-specific markers because each gene is expressed in two subpopulations. With our approach, both of these genes will be picked up as candidate markers as they will be DE between at least one pair of subpopulations. A combination of markers can then be chosen to characterize a subpopulation, which is more flexible than trying to find uniquely DE genes.</p>
                <fig fig-type="figure" id="f26" orientation="portrait" position="float">
                    <label>Figure 26. </label>
                    <caption>
                        <title>Heatmap of mean-centred normalized and corrected log-expression values for the top set of markers for cluster 1 in the brain dataset.</title>
                        <p>Column colours represent the cluster to which each cell is assigned, as indicated by the legend.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure26.gif"/>
                </fig>
                <p>It must be stressed that the 
                    <italic toggle="yes">p</italic>-values computed here cannot be interpreted as measures of significance. This is because the clusters have been empirically identified from the data. 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/edgeR">edgeR</ext-link>
                    </italic> does not account for the uncertainty and stochasticity in clustering, which means that the 
                    <italic toggle="yes">p</italic>-values are much lower than they should be. As such, these 
                    <italic toggle="yes">p</italic>-values should only be used for ranking candidate markers for follow-up studies. However, this is not a concern in other analyses where the groups are pre-defined. For such analyses, the FDR-adjusted 
                    <italic toggle="yes">p</italic>-value can be directly used to define significant genes for each DE comparison, though some care may be required to deal with plate effects (
                    <xref ref-type="bibr" rid="ref-12">Hicks 
                        <italic toggle="yes">et al.</italic>, 2015</xref>; 
                    <xref ref-type="bibr" rid="ref-45">Tung 
                        <italic toggle="yes">et al.</italic>, 2016</xref>).</p>
            </sec>
            <sec>
                <title>Additional comments</title>
                <p>Having completed the basic analysis, we save the 
                    <monospace>SCESet</monospace> object with its associated data to file. This is especially important here as the brain dataset is quite large. If further analyses are to be performed, it would be inconvenient to have to repeat all of the pre-processing steps described above.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">saveRDS</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">file=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"brain_data.rds"</styled-content>
                        <styled-content style="font-size:15px;">, sce)</styled-content>
                    </preformat>
                </p>
            </sec>
        </sec>
        <sec>
            <title>Alternative parameter settings and strategies</title>
            <sec>
                <title>Normalizing based on spike-in coverage</title>
                <p>Scaling normalization strategies for scRNA-seq data can be broadly divided into two classes. The first class assumes that there exists a subset of genes that are not DE between samples, as previously described. The second class uses the fact that the same amount of spike-in RNA was added to each cell. Differences in the coverage of the spike-in transcripts can only be due to cell-specific biases, e.g., in capture efficiency or sequencing depth. Scaling normalization is then applied to equalize spike-in coverage across cells.</p>
                <p>The choice between these two normalization strategies depends on the biology of the cells and the features of interest. If the majority of genes are expected to be DE and there is no reliable house-keeping set, spike-in normalization may be the only option for removing cell-specific biases. Spike-in normalization should also be used if differences in the total RNA content of individual cells are of interest. In any particular cell, an increase in the amount of endogenous RNA will not increase spike-in coverage (with or without library quantification). Thus, the former will not be represented as part of the bias in the latter, which means that the effects of total RNA content on expression will not be removed upon scaling. With non-DE normalization, an increase in RNA content will systematically increase the expression of all genes in the non-DE subset, such that it will be treated as bias and removed.</p>
                <p>We demonstrate the use of spike-in normalization on a dataset involving different cell types &#x2013; namely, mouse embryonic stem cells (mESCs) and mouse embryonic fibroblasts (MEFs) (
                    <xref ref-type="bibr" rid="ref-15">Islam 
                        <italic toggle="yes">et al.</italic>, 2011</xref>). The count table was obtained from NCBI GEO as a supplementary file under the accession GSE29087 (
                    <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE29087">http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE29087</ext-link>). We load the counts into R and specify the rows corresponding to spike-in transcripts. The negative control wells do not contain any cells and are useful for quality control but need to be removed prior to downstream analysis.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">counts &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">read.table</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"GSE29087_L139_expression_tab.txt.gz"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">colClasses=c</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">list</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"character"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
    
                        <styled-content style="font-size:15px;color:#8F5903;">NULL</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#8F5903;">NULL</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#8F5903;">NULL</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#8F5903;">NULL</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#8F5903;">NULL</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#8F5903;">NULL</styled-content>
                        <styled-content style="font-size:15px;">),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rep</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"integer"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF;">96</styled-content>
                        <styled-content style="font-size:15px;">)),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">skip=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">6</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">sep=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">&#x2019;\t&#x2019;</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">row.names=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">newSCESet</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">countData=</styled-content>
                        <styled-content style="font-size:15px;">counts)</styled-content>

                        <styled-content style="font-size:15px;">sce$grouping &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rep</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">c</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"mESC"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#4F9905;">"MEF"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#4F9905;">"Neg"</styled-content>
                        <styled-content style="font-size:15px;">),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">c</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">48</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF;">44</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF;">4</styled-content>
                        <styled-content style="font-size:15px;">))</styled-content>

                        <styled-content style="font-size:15px;">sce &lt;- sce[,sce$grouping!=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Neg"</styled-content>
                        <styled-content style="font-size:15px;">]</styled-content> 
                        <styled-content style="font-size:15px;color:#8F5903;"># Removing negative control wells.</styled-content>

                        <styled-content style="font-size:15px;">sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">calculateQCMetrics</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">feature_controls=list</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">spike=grep</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"SPIKE"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rownames</styled-content>
                        <styled-content style="font-size:15px;">(counts))))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">isSpike</styled-content>
                        <styled-content style="font-size:15px;">(sce) &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#4F9905;">"spike"</styled-content>
                    </preformat>
                </p>
                <p>We then apply the 
                    <monospace>computeSpikeFactors</monospace> method to estimate size factors for all cells. This method computes the total count over all spike-in transcripts in each cell, and calculates size factors to equalize the total spike-in count across cells. Here, we set 
                    <monospace>general.use=TRUE</monospace> as we intend to apply the spike-in factors to all counts.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">computeSpikeFactors</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">general.use=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>Applying 
                    <monospace>normalize</monospace> will use the spike-in-based size factors to compute normalized log-expression values. Unlike in the previous analyses, we do not have to set separate size factors for the spike-in transcripts. This is because the relevant factors are already being used for all genes and spike-in transcripts when 
                    <monospace>general.use=TRUE</monospace>. (The exception is if the experiment uses multiple spike-in sets that behave differently and need to be normalized separately.)</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">normalize</styled-content>
                        <styled-content style="font-size:15px;">(sce)</styled-content>
                    </preformat>
                </p>
                <p>For comparison, we also compute the deconvolution size factors and plot them against the spike-in factors. We observe a negative correlation between the two sets of values (
                    <xref ref-type="fig" rid="f27">Figure 27</xref>). This is because MEFs contain more endogenous RNA, which reduces the relative spike-in coverage in each library (thereby decreasing the spike-in size factors) but increases the coverage of endogenous genes (thus increasing the deconvolution size factors). If the spike-in size factors were applied to the counts, the expression values in MEFs would be scaled up while expression in mESCs would be scaled down. However, the opposite would occur if deconvolution size factors were used.</p>
                <fig fig-type="figure" id="f27" orientation="portrait" position="float">
                    <label>Figure 27. </label>
                    <caption>
                        <title>Size factors from spike-in normalization, plotted against the size factors from deconvolution for all cells in the mESC/MEF dataset.</title>
                        <p>Axes are shown on a log-scale, and cells are coloured according to their identity. Deconvolution size factors were computed with small pool sizes owing to the low number of cells of each type.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure27.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">colours &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">c</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">mESC=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"red"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">MEF=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"grey"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">deconv.sf &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">computeSumFactors</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">sf.out=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">cluster=</styled-content>
                        <styled-content style="font-size:15px;">sce$grouping,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">sizes=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">:</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">4</styled-content>
                        <styled-content style="font-size:15px;">*</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">10</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">plot</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">sizeFactors</styled-content>
                        <styled-content style="font-size:15px;">(sce), deconv.sf,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;">colours[sce$grouping],</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">pch=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">16</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">log=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"xy"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
    
                        <styled-content style="font-size:15px;color:#214A87;">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Size factor (spike-in)"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"Size factor (deconvolution)"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">legend</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"bottomleft"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">col=</styled-content>
                        <styled-content style="font-size:15px;">colours,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">legend=names</styled-content>
                        <styled-content style="font-size:15px;">(colours),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">pch=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">16</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>Whether or not total RNA content is relevant &#x2013; and thus, the choice of normalization strategy &#x2013; depends on the biological hypothesis. In the HSC and brain analyses, variability in total RNA across the population was treated as noise and removed by non-DE normalization. This may not always be appropriate if total RNA is associated with a biological difference of interest. For example, 
                    <xref ref-type="bibr" rid="ref-15">Islam 
                        <italic toggle="yes">et al.</italic> (2011)</xref> observe a 5-fold difference in total RNA between mESCs and MEFs. Similarly, the total RNA in a cell changes across phases of the cell cycle (
                    <xref ref-type="bibr" rid="ref-7">Buettner 
                        <italic toggle="yes">et al.</italic>, 2015</xref>). Spike-in normalization will preserve these differences in total RNA content such that the corresponding biological groups can be easily resolved in downstream analyses.</p>
            </sec>
            <sec>
                <title>Blocking on the cell cycle phase</title>
                <p>Cell cycle phase is usually uninteresting in studies focusing on other aspects of biology. However, the effects of cell cycle on the expression profile can mask other effects and interfere with the interpretation of the results. This cannot be avoided by simply removing cell cycle marker genes, as the cell cycle can affect a substantial number of other transcripts (
                    <xref ref-type="bibr" rid="ref-7">Buettner 
                        <italic toggle="yes">et al.</italic>, 2015</xref>). Rather, more sophisticated strategies are required, one of which is demonstrated below using data from a study of T Helper 2 (T
                    <sub>H</sub>2) cells (
                    <xref ref-type="bibr" rid="ref-31">Mahata 
                        <italic toggle="yes">et al.</italic>, 2014</xref>). 
                    <xref ref-type="bibr" rid="ref-7">Buettner 
                        <italic toggle="yes">et al.</italic> (2015)</xref> have already applied quality control and normalized the data, so we can use them directly as log-expression values (accessible as Supplementary Data 1 of 
                    <ext-link ext-link-type="uri" xlink:href="https://dx.doi.org/10.1038/nbt.3102">https://dx.doi.org/10.1038/nbt.3102</ext-link>).</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87;">library</styled-content>
                        <styled-content style="font-size:15px;">(openxlsx)</styled-content>

                        <styled-content style="font-size:15px;">incoming &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">read.xlsx</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"nbt.3102-S7.xlsx"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">sheet=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">1</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">rowNames=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">incoming &lt;- incoming[,!</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">duplicated</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">colnames</styled-content>
                        <styled-content style="font-size:15px;">(incoming))]</styled-content> 
                        <styled-content style="font-size:15px;color:#8F5903;"># Remove duplicated genes.</styled-content>

                        <styled-content style="font-size:15px;">sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">newSCESet</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">exprsData=t</styled-content>
                        <styled-content style="font-size:15px;">(incoming),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">logged=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">TRUE</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>We empirically identify the cell cycle phase using the pair-based classifier in 
                    <monospace>cyclone</monospace>. The majority of cells in 
                    <xref ref-type="fig" rid="f28">Figure 28</xref> seem to lie in G1 phase, with small numbers of cells in the other phases.</p>
                <fig fig-type="figure" id="f28" orientation="portrait" position="float">
                    <label>Figure 28. </label>
                    <caption>
                        <title>Cell cycle phase scores from applying the pair-based classifier on the T
                            <sub>H</sub>2 dataset, where each point represents a cell.</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure28.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">anno &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">select</styled-content>
                        <styled-content style="font-size:15px;">(org.Mm.eg.db,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">keys=rownames</styled-content>
                        <styled-content style="font-size:15px;">(sce),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">keytype=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"SYMBOL"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">column=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"ENSEMBL"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">ensembl &lt;- anno$ENSEMBL[</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">match</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87;">rownames</styled-content>
                        <styled-content style="font-size:15px;">(sce), anno$SYMBOL)]</styled-content>

                        <styled-content style="font-size:15px;">assignments &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">cyclone</styled-content>
                        <styled-content style="font-size:15px;">(sce, mm.pairs,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">gene.names=</styled-content>
                        <styled-content style="font-size:15px;">ensembl,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">assay=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"exprs"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87;">plot</styled-content>
                        <styled-content style="font-size:15px;">(assignments$score$G1, assignments$score$G2M,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">xlab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"G1 score"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">ylab=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"G2/M score"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">pch=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF;">16</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>We can block directly on the phase scores in downstream analyses. This is more graduated than using a strict assignment of each cell to a specific phase, as the magnitude of the score considers the uncertainty of the assignment. The phase covariates in the design matrix will absorb any phase-related effects on expression such that they will not affect estimation of the effects of other experimental factors. Users should also ensure that the phase score is not confounded with other factors of interest. For example, model fitting is not possible if all cells in one experimental condition are in one phase, and all cells in another condition are in a different phase.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">design &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">model.matrix</styled-content>
                        <styled-content style="font-size:15px;">(
                            <sup>~</sup> G1 + G2M, assignments$score)</styled-content>

                        <styled-content style="font-size:15px;">fit.block &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">trendVar</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">use.spikes=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903;">NA</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">trend=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905;">"loess"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">design=</styled-content>
                        <styled-content style="font-size:15px;">design)</styled-content>

                        <styled-content style="font-size:15px;">dec.block &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87;">decomposeVar</styled-content>
                        <styled-content style="font-size:15px;">(sce, fit.block)</styled-content>
                    </preformat>
                </p>
                <p>For analyses that do not use design matrices, we remove the cell cycle effect directly from the expression values using 
                    <monospace>removeBatchEffect</monospace>. The result of this procedure is visualized with some PCA plots in 
                    <xref ref-type="fig" rid="f29">Figure 29</xref>. Before removal, the distribution of cells along the first two principal components is strongly associated with their G1 and G2/M scores. This is no longer the case after removal, which suggests that the cell cycle effect has been mitigated.</p>
                <fig fig-type="figure" id="f29" orientation="portrait" position="float">
                    <label>Figure 29. </label>
                    <caption>
                        <title>PCA plots before (left) and after (right) removal of the cell cycle effect in the T
                            <sub>H</sub>2 dataset.</title>
                        <p>Each cell is represented by a point with colour and size determined by the G1 and G2/M scores, respectively. Only HVGs were used to construct each plot.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure29.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#8F5903"># Finding HVGs without blocking on phase score.</styled-content>

                        <styled-content style="font-size:15px;">fit &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">trendVar</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">use.spikes=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903">NA</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">trend=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"loess"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;">dec &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">decomposeVar</styled-content>
                        <styled-content style="font-size:15px;">(sce, fit)
top.hvgs &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">which</styled-content>
                        <styled-content style="font-size:15px;">(dec$FDR &lt;=</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF">0.05</styled-content> 
                        <styled-content style="font-size:15px;">&amp; dec$bio &gt;=</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF">0.5</styled-content>
                        <styled-content style="font-size:15px;">)
sce$G1score &lt;- assignments$score$G1
sce$G2Mscore &lt;- assignments$score$G2M
out &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">plotPCA</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">feature_set=</styled-content>
                        <styled-content style="font-size:15px;">top.hvgs,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">colour_by=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"G1score"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">size_by=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"G2Mscore"</styled-content>
                        <styled-content style="font-size:15px;">) +
    fontsize +</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">ggtitle</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"Before removal"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>


                        <styled-content style="font-size:15px;color:#8F5903"># Using HVGs after blocking on the phase score.</styled-content>

                        <styled-content style="font-size:15px;">top.hvgs2 &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">which</styled-content>
                        <styled-content style="font-size:15px;">(dec.block$FDR &lt;=</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF">0.05</styled-content> 
                        <styled-content style="font-size:15px;">&amp; dec.block$bio &gt;=</styled-content> 
                        <styled-content style="font-size:15px;color:#0000CF">0.5</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">norm_exprs</styled-content>
                        <styled-content style="font-size:15px;">(sce) &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">removeBatchEffect</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">exprs</styled-content>
                        <styled-content style="font-size:15px;">(sce),</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">covariates=</styled-content>
                        <styled-content style="font-size:15px;">assignments$score[,</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">c</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"G1"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#4F9905">"G2M"</styled-content>
                        <styled-content style="font-size:15px;">)])
out2 &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">plotPCA</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">exprs_values=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"norm_exprs"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">feature_set=</styled-content>
                        <styled-content style="font-size:15px;">top.hvgs2,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">colour_by=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"G1score"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content>
    
                        <styled-content style="font-size:15px;color:#214A87">size_by=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"G2Mscore"</styled-content>
                        <styled-content style="font-size:15px;">) + fontsize +</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">ggtitle</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"After removal"</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">multiplot</styled-content>
                        <styled-content style="font-size:15px;">(out, out2,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">cols=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">2</styled-content>
                        <styled-content style="font-size:15px;">)</styled-content>
                    </preformat>
                </p>
                <p>As an aside, this dataset contains cells at various stages of differentiation (
                    <xref ref-type="bibr" rid="ref-31">Mahata 
                        <italic toggle="yes">et al.</italic>, 2014</xref>). This is an ideal use case for diffusion maps which perform dimensionality reduction along a continuous process. In 
                    <xref ref-type="fig" rid="f30">Figure 30</xref>, cells are arranged along a trajectory in the low-dimensional space. The first diffusion component is likely to correspond to T
                    <sub>H</sub>2 differentiation, given that a key regulator 
                    <italic toggle="yes">Gata3</italic> (
                    <xref ref-type="bibr" rid="ref-50">Zhu 
                        <italic toggle="yes">et al.</italic>, 2006</xref>) changes in expression from left to right.</p>
                <fig fig-type="figure" id="f30" orientation="portrait" position="float">
                    <label>Figure 30. </label>
                    <caption>
                        <title>A diffusion map for the T
                            <sub>H</sub>2 dataset, where each cell is coloured by its expression of 
                            <italic toggle="yes">Gata3</italic>.</title>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10712/3370b003-e980-4395-b641-daad385c65d3_figure30.gif"/>
                </fig>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87">plotDiffusionMap</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">exprs_values=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"norm_exprs"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">colour_by=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"Gata3"</styled-content>
                        <styled-content style="font-size:15px;">) + fontsize</styled-content>
                    </preformat>
                </p>
            </sec>
            <sec>
                <title>Extracting annotation from Ensembl identifiers</title>
                <p>Feature-counting tools typically report genes in terms of standard identifiers from Ensembl or Entrez. These identifiers are used as they are unambiguous and highly stable. However, they are difficult to interpret compared to the gene symbols which are more commonly used in the literature. We can easily convert from one to the other using annotation packages like 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html">org.Mm.eg.db</ext-link>
                    </italic>. This is demonstrated below for Ensembl identifiers in a mESC dataset (
                    <xref ref-type="bibr" rid="ref-21">Kolodziejczyk 
                        <italic toggle="yes">et al.</italic>, 2015</xref>) obtained from 
                    <ext-link ext-link-type="uri" xlink:href="http://www.ebi.ac.uk/teichmann-srv/espresso">http://www.ebi.ac.uk/teichmann-srv/espresso</ext-link>. The 
                    <monospace>select</monospace> call extracts the specified data from the annotation object, and the 
                    <monospace>match</monospace> call ensures that the first gene symbol is used if multiple symbols correspond to a single Ensembl identifier.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">incoming &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">read.table</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"counttable_es.csv"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">header=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903">TRUE</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">row.names=</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">1</styled-content>
                        <styled-content style="font-size:15px;">)
my.ids &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">rownames</styled-content>
                        <styled-content style="font-size:15px;">(incoming)
anno &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">select</styled-content>
                        <styled-content style="font-size:15px;">(org.Mm.eg.db,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">keys=</styled-content>
                        <styled-content style="font-size:15px;">my.ids,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">keytype=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"ENSEMBL"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">column=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"SYMBOL"</styled-content>
                        <styled-content style="font-size:15px;">)
anno &lt;- anno[</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">match</styled-content>
                        <styled-content style="font-size:15px;">(my.ids, anno$ENSEMBL),]</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">head</styled-content>
                        <styled-content style="font-size:15px;">(anno)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## 		ENSEMBL SYMBOL
## 1 ENSMUSG00000000001  Gnai3
## 2 ENSMUSG00000000003   Pbsn
## 3 ENSMUSG00000000028  Cdc45
## 4 ENSMUSG00000000031    H19
## 5 ENSMUSG00000000037  Scml2
## 6 ENSMUSG00000000049   Apoh</styled-content>
                    </preformat>
                </p>
                <p>To identify which rows correspond to mitochondrial genes, we need to use extra annotation describing the genomic location of each gene. For Ensembl, this involves using the 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/release/data/annotation/html/TxDb.Mmusculus.UCSC.mm10.ensGene.html">TxDb.Mmusculus.UCSC.mm10.ensGene</ext-link>
                    </italic> package.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;color:#214A87">library</styled-content>
                        <styled-content style="font-size:15px;">(TxDb.Mmusculus.UCSC.mm10.ensGene)
location &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">select</styled-content>
                        <styled-content style="font-size:15px;">(TxDb.Mmusculus.UCSC.mm10.ensGene,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">keys=</styled-content>
                        <styled-content style="font-size:15px;">my.ids,</styled-content>
     
                        <styled-content style="font-size:15px;color:#214A87">column=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"CDSCHROM"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">keytype=</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"GENEID"</styled-content>
                        <styled-content style="font-size:15px;">)
location &lt;- location[</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">match</styled-content>
                        <styled-content style="font-size:15px;">(my.ids, location$GENEID),]
is.mito &lt;- location$CDSCHROM ==</styled-content> 
                        <styled-content style="font-size:15px;color:#4F9905">"chrM"</styled-content> 
                        <styled-content style="font-size:15px;">&amp; !</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">is.na</styled-content>
                        <styled-content style="font-size:15px;">(location$CDSCHROM)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">sum</styled-content>
                        <styled-content style="font-size:15px;">(is.mito)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## [1] 13</styled-content>
                    </preformat>
                </p>
                <p>Identification of rows that correspond to spike-in transcripts is much easier, given that the ERCC spike-ins were used.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">is.spike &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">grepl</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"^ERCC"</styled-content>
                        <styled-content style="font-size:15px;">, my.ids)</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">sum</styled-content>
                        <styled-content style="font-size:15px;">(is.spike)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## [1] 92</styled-content>
                    </preformat>
                </p>
                <p>All of this information can be consolidated into a 
                    <monospace>SCESet</monospace> object for further manipulation. Alternatively, annotation from BioMart resources can be directly added to the object using the 
                    <monospace>getBMFeatureAnnos</monospace> function from 
                    <italic toggle="yes">
                        <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/release/bioc/html/scater.html">scater</ext-link>
                    </italic>.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">anno &lt;- anno[,-</styled-content>
                        <styled-content style="font-size:15px;color:#0000CF">1</styled-content>
                        <styled-content style="font-size:15px;">,drop=</styled-content>
                        <styled-content style="font-size:15px;color:#8F5903">FALSE</styled-content>
                        <styled-content style="font-size:15px;">]</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">rownames</styled-content>
                        <styled-content style="font-size:15px;">(anno) &lt;- my.ids
sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">newSCESet</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">countData=</styled-content>
                        <styled-content style="font-size:15px;">incoming,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">featureData=AnnotatedDataFrame</styled-content>
                        <styled-content style="font-size:15px;">(anno))
sce &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">calculateQCMetrics</styled-content>
                        <styled-content style="font-size:15px;">(sce,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">feature_controls=list</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">ERCC=</styled-content>
                        <styled-content style="font-size:15px;">is.spike))</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">isSpike</styled-content>
                        <styled-content style="font-size:15px;">(sce) &lt;-</styled-content> 
                        <styled-content style="font-size:15px;color:#4F9905">"ERCC"</styled-content>
                    </preformat>
                </p>
                <p>We filter out rows that do not correspond to endogenous genes or spike-in transcripts. This will remove rows containing mapping statistics such as the number of unaligned or unassigned reads, which would be misleading if treated as gene expression values. The object is then ready for downstream analyses as previously described.</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">sce &lt;- sce[</styled-content>
                        <styled-content style="font-size:15px;color:#214A87">grepl</styled-content>
                        <styled-content style="font-size:15px;">(</styled-content>
                        <styled-content style="font-size:15px;color:#4F9905">"ENSMUS"</styled-content>
                        <styled-content style="font-size:15px;">,</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">rownames</styled-content>
                        <styled-content style="font-size:15px;">(sce)) |</styled-content> 
                        <styled-content style="font-size:15px;color:#214A87">isSpike</styled-content>
                        <styled-content style="font-size:15px;">(sce),]</styled-content>

                        <styled-content style="font-size:15px;color:#214A87">dim</styled-content>
                        <styled-content style="font-size:15px;">(sce)</styled-content>
                    </preformat>
                </p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="font-size:15px;">## Features Samples
##    38653     704</styled-content>
                    </preformat>
                </p>
            </sec>
        </sec>
        <sec sec-type="conclusions">
            <title>Conclusions</title>
            <p>This workflow provides a step-by-step guide for performing basic analyses of single-cell RNA-seq data in R. It provides instructions for a number of low-level steps such as quality control, normalization, cell cycle phase assignment, data exploration, HVG and marker gene detection, and clustering. This is done with a number of different datasets to provide a range of usage examples. The workflow is modular so individual steps can be substituted with alternative methods according to user preferences. In addition, the processed data can be easily used for higher-level analyses with other Bioconductor packages. We anticipate that this workflow will assist readers in assembling analyses of their own scRNA-seq data.</p>
        </sec>
        <sec>
            <title>Software availability</title>
            <p>All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (
                <ext-link ext-link-type="uri" xlink:href="https://cran.r-project.org">https://cran.r-project.org</ext-link>) or the Bioconductor project (
                <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org">http://bioconductor.org</ext-link>). The specific version numbers of the packages used are shown below, along with the version of the R installation. Version numbers of all Bioconductor packages correspond to release version 3.4 of the Bioconductor project. Users can install all required packages and execute the workflow by following the instructions at 
                <ext-link ext-link-type="uri" xlink:href="https://www.bioconductor.org/help/workflows/simpleSingleCell">https://www.bioconductor.org/help/workflows/simpleSingleCell</ext-link>. The workflow takes less than an hour to run on a desktop computer with 8 GB of memory.</p>
            <p>
                <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                    <styled-content style="font-size:15px;color:#214A87">sessionInfo</styled-content>
                    <styled-content style="font-size:15px;">()</styled-content>
                </preformat>
            </p>
            <p>
                <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                    <styled-content style="font-size:15px;">## R version 3.3.1 Patched (2016-10-17 r71532)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8 	  LC_NUMERIC=C 		     LC_TIME=en_GB.UTF-8
##  [4] LC_COLLATE=en_GB.UTF-8 	  LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
##  [7] LC_PAPER=en_GB.UTF-8 	  LC_NAME=C 		     LC_ADDRESS=C
## [10] LC_TELEPHONE=C 	   	  LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 	 parallel stats     graphics   grDevices utils    datasets methods base
##
## other attached packages:
##  [1] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0 GenomicFeatures_1.26.0
##  [3] GenomicRanges_1.26.0 		       GenomeInfoDb_1.10.0
##  [5] openxlsx_3.0.0 		       	       edgeR_3.16.0
##  [7] dynamicTreeCut_1.63-1 		       limma_3.30.0
##  [9] gplots_3.0.1 		       	       RBGL_1.50.0
## [11] graph_1.52.0 		       	       org.Mm.eg.db_3.4.0
## [13] AnnotationDbi_1.36.0 		       IRanges_2.8.0
## [15] S4Vectors_0.12.0 		       scran_1.2.0
## [17] scater_1.2.0 		       	       ggplot2_2.1.0
## [19] Biobase_2.34.0 		       	       BiocGenerics_0.20.0
## [21] gdata_2.17.0 		       	       R.utils_2.4.0
## [23] R.oo_1.20.0 		       	       R.methodsS3_1.7.1
## [25] destiny_2.0.0 		       	       mvoutlier_2.0.6
## [27] sgeostat_1.0-27 		       Rtsne_0.11
## [29] BiocParallel_1.8.0 		       knitr_1.14
## [31] BiocStyle_2.2.0 		       	       
##
## loaded via a namespace (and not attached):
##   [1] Hmisc_3.17-4 		     RcppEigen_0.3.2.9.0 	 plyr_1.8.4
##   [4] igraph_1.0.1 		     sp_1.2-3 	 		 shinydashboard_0.5.3
##   [7] splines_3.3.1 		     digest_0.6.10 	 	 htmltools_0.3.5
##  [10] viridis_0.3.4 		     magrittr_1.5 	 	 cluster_2.0.5
##  [13] Biostrings_2.42.0 	     matrixStats_0.51.0  	 xts_0.9-7
##  [16] colorspace_1.2-7 	     rrcov_1.4-3 	 	 dplyr_0.5.0
##  [19] RCurl_1.95-4.8 	     tximport_1.2.0 	 	 lme4_1.1-12
##  [22] survival_2.39-5 	     zoo_1.7-13 	 	 gtable_0.2.0
##  [25] XVector_0.14.0 	     zlibbioc_1.20.0 	         MatrixModels_0.4-1
##  [28] car_2.1-3 	             kernlab_0.9-25 	  	 prabclus_2.2-6
##  [31] DEoptimR_1.0-6 	     SparseM_1.72 	  	 VIM_4.6.0
##  [34] scales_0.4.0 		     mvtnorm_1.0-5 	  	 DBI_0.5-1
##  [37] GGally_1.2.0 		     Rcpp_0.12.7 	  	 sROC_0.1-2
##  [40] xtable_1.8-2 		     laeken_0.4.6 	  	 foreign_0.8-67
##  [43] proxy_0.4-16 		     mclust_5.2 	         Formula_1.2-1
##  [46] vcd_1.4-3 		     FNN_1.1 	  	         RColorBrewer_1.1-2
##  [49] fpc_2.1-10 	             acepack_1.3-3.3 	  	 modeltools_0.2-21
##  [52] reshape_0.8.5 		     XML_3.98-1.4 	  	 flexmix_2.3-13
##  [55] nnet_7.3-12 		     locfit_1.5-9.1 	  	 labeling_0.3
##  [58] reshape2_1.4.1 	     munsell_0.4.3 	  	 tools_3.3.1
##  [61] RSQLite_1.0.0 		     pls_2.5-0 	  	 	 evaluate_0.10
##  [64] stringr_1.1.0 		     cvTools_0.3.2 	  	 robustbase_0.92-6
##  [67] caTools_1.17.1 	     nlme_3.1-128 	   	 mime_0.5
##  [70] quantreg_5.29 	             formatR_1.4 	   	 biomaRt_2.30.0
##  [73] pbkrtest_0.4-6 	     beeswarm_0.2.3 	   	 e1071_1.6-7
##  [76] statmod_1.4.26 	     smoother_1.1    	         tibble_1.2
##  [79] robCompositions_2.0.2       pcaPP_1.9-61   	         stringi_1.1.2
##  [82] lattice_0.20-34 	     trimcluster_0.1-2   	 Matrix_1.2-7.1
##  [85] nloptr_1.0.4 	             lmtest_0.9-34   	 	 data.table_1.9.6
##  [88] bitops_1.0-6 		     rtracklayer_1.34.0   	 httpuv_1.3.3
##  [91] R6_2.2.0                    latticeExtra_0.6-28         KernSmooth_2.23-15
##  [94] gridExtra_2.2.1 	     vipor_0.4.4                 boot_1.3-18
##  [97] MASS_7.3-45 		     gtools_3.5.0     		 assertthat_0.1
## [100] SummarizedExperiment_1.4.0  chron_2.3-47    		 rhdf5_2.18.0
## [103] rjson_0.2.15 		     GenomicAlignments_1.10.0    Rsamtools_1.26.0
## [106] diptest_0.75-7              mgcv_1.8-15                 grid_3.3.1
## [109] rpart_4.1-10                class_7.3-14                minqa_1.2.4
## [112] TTR_0.23-1                  scatterplot3d_0.3-37        shiny_0.14.1
## [115] ggbeeswarm_0.5.0</styled-content>
                </preformat>
            </p>
        </sec>
    </body>
    <back>
        <ack>
            <title>Acknowledgements</title>
            <p>We would like to thank Antonio Scialdone for helpful discussions, as well as Michael Epstein, James R. Smith and John Wilson-Kanamori for testing the workflow on other datasets.</p>
        </ack>
        <ref-list>
            <ref id="ref-1">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Anders</surname>
                            <given-names>S</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Huber</surname>
                            <given-names>W</given-names>
                        </name>
					</person-group>:
                    <article-title>Differential expression analysis for sequence count data.</article-title>
                    <source>
						
                        <italic toggle="yes">Genome Biol.</italic>
					</source>
                    <year>2010</year>;<volume>11</volume>(<issue>10</issue>):<fpage>R106</fpage>.
                    <pub-id pub-id-type="pmid">20979621</pub-id>
                    <pub-id pub-id-type="doi">10.1186/gb-2010-11-10-r106</pub-id>
                    <pub-id pub-id-type="pmcid">3218662</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-2">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Angerer</surname>
                            <given-names>P</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Haghverdi</surname>
                            <given-names>L</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>B&#x00fc;ttner</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>
                        <italic toggle="yes">destiny</italic>: diffusion maps for large-scale single-cell data in R.</article-title>
                    <source>
						
                        <italic toggle="yes">Bioinformatics.</italic>
					</source>
                    <year>2016</year>;<volume>32</volume>(<issue>8</issue>):<fpage>1241</fpage>&#x2013;<lpage>1243</lpage>.
                    <pub-id pub-id-type="pmid">26668002</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btv715</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-3">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Bertoli</surname>
                            <given-names>C</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Skotheim</surname>
                            <given-names>JM</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>de Bruin</surname>
                            <given-names>RA</given-names>
                        </name>
					</person-group>:
                    <article-title>Control of cell cycle transcription during G1 and S phases.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Rev Mol Cell Biol.</italic>
					</source>
                    <year>2013</year>;<volume>14</volume>(<issue>8</issue>):<fpage>518</fpage>&#x2013;<lpage>528</lpage>.
                    <pub-id pub-id-type="pmid">23877564</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nrm3629</pub-id>
                    <pub-id pub-id-type="pmcid">4569015</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-4">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Bourgon</surname>
                            <given-names>R</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Gentleman</surname>
                            <given-names>R</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Huber</surname>
                            <given-names>W</given-names>
                        </name>
					</person-group>:
                    <article-title>Independent filtering increases detection power for high-throughput experiments.</article-title>
                    <source>
						
                        <italic toggle="yes">Proc Natl Acad Sci U S A.</italic>
					</source>
                    <year>2010</year>;<volume>107</volume>(<issue>21</issue>):<fpage>9546</fpage>&#x2013;<lpage>9551</lpage>.
                    <pub-id pub-id-type="pmid">20460310</pub-id>
                    <pub-id pub-id-type="doi">10.1073/pnas.0914005107</pub-id>
                    <pub-id pub-id-type="pmcid">2906865</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-5">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Bray</surname>
                            <given-names>NL</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Pimentel</surname>
                            <given-names>H</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Melsted</surname>
                            <given-names>P</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Near-optimal probabilistic RNA-seq quantification.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Biotechnol.</italic>
					</source>
                    <year>2016</year>;<volume>34</volume>(<issue>5</issue>):<fpage>525</fpage>&#x2013;<lpage>527</lpage>.
                    <pub-id pub-id-type="pmid">27043002</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nbt.3519</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-6">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Brennecke</surname>
                            <given-names>P</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Anders</surname>
                            <given-names>S</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Kim</surname>
                            <given-names>JK</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Accounting for technical noise in single-cell RNA-seq experiments.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Methods.</italic>
					</source>
                    <year>2013</year>;<volume>10</volume>(<issue>11</issue>):<fpage>1093</fpage>&#x2013;<lpage>1095</lpage>.
                    <pub-id pub-id-type="pmid">24056876</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nmeth.2645</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-7">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Buettner</surname>
                            <given-names>F</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Natarajan</surname>
                            <given-names>KN</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Casale</surname>
                            <given-names>FP</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Biotechnol.</italic>
					</source>
                    <year>2015</year>;<volume>33</volume>(<issue>2</issue>):<fpage>155</fpage>&#x2013;<lpage>160</lpage>.
                    <pub-id pub-id-type="pmid">25599176</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nbt.3102</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-8">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Chen</surname>
                            <given-names>Y</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Lun</surname>
                            <given-names>AT</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Smyth</surname>
                            <given-names>GK</given-names>
                        </name>
					</person-group>:
                    <article-title>From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline [version 2; referees: 5 approved].</article-title>
                    <source>
						
                        <italic toggle="yes">F1000Res.</italic>
					</source>
                    <year>2016</year>;<volume>5</volume>: 1438.
                    <pub-id pub-id-type="pmid">27508061</pub-id>
                    <pub-id pub-id-type="doi">10.12688/f1000research.8987.2</pub-id>
                    <pub-id pub-id-type="pmcid">4934518</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-9">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Coller</surname>
                            <given-names>HA</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Sang</surname>
                            <given-names>L</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Roberts</surname>
                            <given-names>JM</given-names>
                        </name>
					</person-group>:
                    <article-title>A new description of cellular quiescence.</article-title>
                    <source>
						
                        <italic toggle="yes">PLoS Biol.</italic>
					</source>
                    <year>2006</year>;<volume>4</volume>(<issue>3</issue>):<fpage>e83</fpage>.
                    <pub-id pub-id-type="pmid">16509772</pub-id>
                    <pub-id pub-id-type="doi">10.1371/journal.pbio.0040083</pub-id>
                    <pub-id pub-id-type="pmcid">1393757</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-10">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Conboy</surname>
                            <given-names>CM</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Spyrou</surname>
                            <given-names>C</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Thorne</surname>
                            <given-names>NP</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Cell cycle genes are the evolutionarily conserved targets of the E2F4 transcription factor.</article-title>
                    <source>
						
                        <italic toggle="yes">PLoS One.</italic>
					</source>
                    <year>2007</year>;<volume>2</volume>(<issue>10</issue>):<fpage>e1061</fpage>.
                    <pub-id pub-id-type="pmid">17957245</pub-id>
                    <pub-id pub-id-type="doi">10.1371/journal.pone.0001061</pub-id>
                    <pub-id pub-id-type="pmcid">2020443</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-11">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Fan</surname>
                            <given-names>J</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Salathia</surname>
                            <given-names>N</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Liu</surname>
                            <given-names>R</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Methods.</italic>
					</source>
                    <year>2016</year>;<volume>13</volume>(<issue>3</issue>):<fpage>241</fpage>&#x2013;<lpage>244</lpage>.
                    <pub-id pub-id-type="pmid">26780092</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nmeth.3734</pub-id>
                    <pub-id pub-id-type="pmcid">4772672</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-12">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Hicks</surname>
                            <given-names>SC</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Teng</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Irizarry</surname>
                            <given-names>RA</given-names>
                        </name>
					</person-group>:
                    <article-title>On the widespread and critical impact of systematic bias and batch effects in single-cell RNA-Seq data.</article-title>
                    <source>
						
                        <italic toggle="yes">bioRxiv.</italic>
					</source>
                    <year>2015</year>.
                    <pub-id pub-id-type="doi">10.1101/025528</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-13">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Huber</surname>
                            <given-names>W</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Carey</surname>
                            <given-names>VJ</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Gentleman</surname>
                            <given-names>R</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Orchestrating high-throughput genomic analysis with Bioconductor.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Methods.</italic>
					</source>
                    <year>2015</year>;<volume>12</volume>(<issue>2</issue>):<fpage>115</fpage>&#x2013;<lpage>121</lpage>.
                    <pub-id pub-id-type="pmid">25633503</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nmeth.3252</pub-id>
                    <pub-id pub-id-type="pmcid">4509590</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-14">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Ilicic</surname>
                            <given-names>T</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Kim</surname>
                            <given-names>JK</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Kolodziejczyk</surname>
                            <given-names>AA</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Classification of low quality cells from single-cell RNA-seq data.</article-title>
                    <source>
						
                        <italic toggle="yes">Genome Biol.</italic>
					</source>
                    <year>2016</year>;<volume>17</volume>(<issue>1</issue>):<fpage>29</fpage>.
                    <pub-id pub-id-type="pmid">26887813</pub-id>
                    <pub-id pub-id-type="doi">10.1186/s13059-016-0888-1</pub-id>
                    <pub-id pub-id-type="pmcid">4758103</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-15">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Islam</surname>
                            <given-names>S</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Kj&#x00e4;llquist</surname>
                            <given-names>U</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Moliner</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq.</article-title>
                    <source>
						
                        <italic toggle="yes">Genome Res.</italic>
					</source>
                    <year>2011</year>;<volume>21</volume>(<issue>7</issue>):<fpage>1160</fpage>&#x2013;<lpage>1167</lpage>.
                    <pub-id pub-id-type="pmid">21543516</pub-id>
                    <pub-id pub-id-type="doi">10.1101/gr.110882.110</pub-id>
                    <pub-id pub-id-type="pmcid">3129258</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-16">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Islam</surname>
                            <given-names>S</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Zeisel</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Joost</surname>
                            <given-names>S</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Quantitative single-cell RNA-seq with unique molecular identifiers.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Methods.</italic>
					</source>
                    <year>2014</year>;<volume>11</volume>(<issue>2</issue>):<fpage>163</fpage>&#x2013;<lpage>166</lpage>.
                    <pub-id pub-id-type="pmid">24363023</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nmeth.2772</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-17">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Ji</surname>
                            <given-names>Z</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Ji</surname>
                            <given-names>H</given-names>
                        </name>
					</person-group>:
                    <article-title>TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis.</article-title>
                    <source>
						
                        <italic toggle="yes">Nucleic Acids Res.</italic>
					</source>
                    <year>2016</year>;<volume>44</volume>(<issue>13</issue>):<fpage>e117</fpage>.
                    <pub-id pub-id-type="pmid">27179027</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gkw430</pub-id>
                    <pub-id pub-id-type="pmcid">4994863</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-18">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Juli&#x00e1;</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Telenti</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Rausell</surname>
                            <given-names>A</given-names>
                        </name>
					</person-group>:
                    <article-title>
                        <italic toggle="yes">Sincell</italic>: an R/Bioconductor package for statistical assessment of cell-state hierarchies from single-cell RNA-seq.</article-title>
                    <source>
						
                        <italic toggle="yes">Bioinformatics.</italic>
					</source>
                    <year>2015</year>;<volume>31</volume>(<issue>20</issue>):<fpage>3380</fpage>&#x2013;<lpage>3382</lpage>.
                    <pub-id pub-id-type="pmid">26099264</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btv368</pub-id>
                    <pub-id pub-id-type="pmcid">4595899</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-19">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Kim</surname>
                            <given-names>JK</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Kolodziejczyk</surname>
                            <given-names>AA</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Ilicic</surname>
                            <given-names>T</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Commun.</italic>
					</source>
                    <year>2015</year>;<volume>6</volume>:<fpage>8687</fpage>.
                    <pub-id pub-id-type="pmid">26489834</pub-id>
                    <pub-id pub-id-type="doi">10.1038/ncomms9687</pub-id>
                    <pub-id pub-id-type="pmcid">4627577</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-20">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Klein</surname>
                            <given-names>AM</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Mazutis</surname>
                            <given-names>L</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Akartuna</surname>
                            <given-names>I</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells.</article-title>
                    <source>
						
                        <italic toggle="yes">Cell.</italic>
					</source>
                    <year>2015</year>;<volume>161</volume>(<issue>5</issue>):<fpage>1187</fpage>&#x2013;<lpage>1201</lpage>.
                    <pub-id pub-id-type="pmid">26000487</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.cell.2015.04.044</pub-id>
                    <pub-id pub-id-type="pmcid">4441768</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-21">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Kolodziejczyk</surname>
                            <given-names>AA</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Kim</surname>
                            <given-names>JK</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Tsang</surname>
                            <given-names>JC</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Single Cell RNA-Sequencing of Pluripotent States Unlocks Modular Transcriptional Variation.</article-title>
                    <source>
						
                        <italic toggle="yes">Cell Stem Cell.</italic>
					</source>
                    <year>2015</year>;<volume>17</volume>(<issue>4</issue>):<fpage>471</fpage>&#x2013;<lpage>485</lpage>.
                    <pub-id pub-id-type="pmid">26431182</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.stem.2015.09.011</pub-id>
                    <pub-id pub-id-type="pmcid">4595712</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-22">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Langfelder</surname>
                            <given-names>P</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Zhang</surname>
                            <given-names>B</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Horvath</surname>
                            <given-names>S</given-names>
                        </name>
					</person-group>:
                    <article-title>Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R.</article-title>
                    <source>
						
                        <italic toggle="yes">Bioinformatics.</italic>
					</source>
                    <year>2008</year>;<volume>24</volume>(<issue>5</issue>):<fpage>719</fpage>&#x2013;<lpage>720</lpage>.
                    <pub-id pub-id-type="pmid">18024473</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btm563</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-23">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Law</surname>
                            <given-names>CW</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Chen</surname>
                            <given-names>Y</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Shi</surname>
                            <given-names>W</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.</article-title>
                    <source>
						
                        <italic toggle="yes">Genome Biol.</italic>
					</source>
                    <year>2014</year>;<volume>15</volume>(<issue>2</issue>):<fpage>R29</fpage>.
                    <pub-id pub-id-type="pmid">24485249</pub-id>
                    <pub-id pub-id-type="doi">10.1186/gb-2014-15-2-r29</pub-id>
                    <pub-id pub-id-type="pmcid">4053721</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-24">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Leng</surname>
                            <given-names>N</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Chu</surname>
                            <given-names>LF</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Barry</surname>
                            <given-names>C</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Methods.</italic>
					</source>
                    <year>2015</year>;<volume>12</volume>(<issue>10</issue>):<fpage>947</fpage>&#x2013;<lpage>950</lpage>.
                    <pub-id pub-id-type="pmid">26301841</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nmeth.3549</pub-id>
                    <pub-id pub-id-type="pmcid">4589503</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-25">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Liao</surname>
                            <given-names>Y</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Smyth</surname>
                            <given-names>GK</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Shi</surname>
                            <given-names>W</given-names>
                        </name>
					</person-group>:
                    <article-title>The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.</article-title>
                    <source>
						
                        <italic toggle="yes">Nucleic Acids Res.</italic>
					</source>
                    <year>2013</year>;<volume>41</volume>(<issue>10</issue>):<fpage>e108</fpage>.
                    <pub-id pub-id-type="pmid">23558742</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gkt214</pub-id>
                    <pub-id pub-id-type="pmcid">3664803</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-26">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Liao</surname>
                            <given-names>Y</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Smyth</surname>
                            <given-names>GK</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Shi</surname>
                            <given-names>W</given-names>
                        </name>
					</person-group>:
                    <article-title>featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.</article-title>
                    <source>
						
                        <italic toggle="yes">Bioinformatics.</italic>
					</source>
                    <year>2014</year>;<volume>30</volume>(<issue>7</issue>):<fpage>923</fpage>&#x2013;<lpage>930</lpage>.
                    <pub-id pub-id-type="pmid">24227677</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btt656</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-27">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Love</surname>
                            <given-names>MI</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Anders</surname>
                            <given-names>S</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Kim</surname>
                            <given-names>V</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>RNA-Seq workflow: gene-level exploratory analysis and differential expression [version 1; referees: 2 approved].</article-title>
                    <source>
						
                        <italic toggle="yes">F1000Res.</italic>
					</source>
                    <year>2015</year>;<volume>4</volume>:<fpage>1070</fpage>.
                    <pub-id pub-id-type="pmid">26674615</pub-id>
                    <pub-id pub-id-type="doi">10.12688/f1000research.7035.1</pub-id>
                    <pub-id pub-id-type="pmcid">4670015</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-28">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Love</surname>
                            <given-names>MI</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Huber</surname>
                            <given-names>W</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Anders</surname>
                            <given-names>S</given-names>
                        </name>
					</person-group>:
                    <article-title>Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.</article-title>
                    <source>
						
                        <italic toggle="yes">Genome Biol.</italic>
					</source>
                    <year>2014</year>;<volume>15</volume>(<issue>12</issue>):<fpage>550</fpage>.
                    <pub-id pub-id-type="pmid">25516281</pub-id>
                    <pub-id pub-id-type="doi">10.1186/s13059-014-0550-8</pub-id>
                    <pub-id pub-id-type="pmcid">4302049</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-29">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Lun</surname>
                            <given-names>AT</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Bach</surname>
                            <given-names>K</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Marioni</surname>
                            <given-names>JC</given-names>
                        </name>
					</person-group>:
                    <article-title>Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.</article-title>
                    <source>
						
                        <italic toggle="yes">Genome Biol.</italic>
					</source>
                    <year>2016</year>;<volume>17</volume>:<fpage>75</fpage>.
                    <pub-id pub-id-type="pmid">27122128</pub-id>
                    <pub-id pub-id-type="doi">10.1186/s13059-016-0947-7</pub-id>
                    <pub-id pub-id-type="pmcid">4848819</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-30">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Macosko</surname>
                            <given-names>EZ</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Basu</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Satija</surname>
                            <given-names>R</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets.</article-title>
                    <source>
						
                        <italic toggle="yes">Cell.</italic>
					</source>
                    <year>2015</year>;<volume>161</volume>(<issue>5</issue>):<fpage>1202</fpage>&#x2013;<lpage>1214</lpage>.
                    <pub-id pub-id-type="pmid">26000488</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.cell.2015.05.002</pub-id>
                    <pub-id pub-id-type="pmcid">4481139</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-31">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Mahata</surname>
                            <given-names>B</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Zhang</surname>
                            <given-names>X</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Kolodziejczyk</surname>
                            <given-names>AA</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title> Single-cell RNA sequencing reveals T helper cells synthesizing steroids 
                        <italic toggle="yes">de novo</italic> to contribute to immune homeostasis.</article-title>
                    <source>
						
                        <italic toggle="yes">Cell Rep.</italic>
					</source>
                    <year>2014</year>;<volume>7</volume>(<issue>4</issue>):<fpage>1130</fpage>&#x2013;<lpage>1142</lpage>.
                    <pub-id pub-id-type="pmid">24813893</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.celrep.2014.04.011</pub-id>
                    <pub-id pub-id-type="pmcid">4039991</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-32">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Marinov</surname>
                            <given-names>GK</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Williams</surname>
                            <given-names>BA</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>McCue</surname>
                            <given-names>K</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing.</article-title>
                    <source>
						
                        <italic toggle="yes">Genome Res.</italic>
					</source>
                    <year>2014</year>;<volume>24</volume>(<issue>3</issue>):<fpage>496</fpage>&#x2013;<lpage>510</lpage>.
                    <pub-id pub-id-type="pmid">24299736</pub-id>
                    <pub-id pub-id-type="doi">10.1101/gr.161034.113</pub-id>
                    <pub-id pub-id-type="pmcid">3941114</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-33">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>McCarthy</surname>
                            <given-names>DJ</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Chen</surname>
                            <given-names>Y</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Smyth</surname>
                            <given-names>GK</given-names>
                        </name>
					</person-group>:
                    <article-title>Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.</article-title>
                    <source>
						
                        <italic toggle="yes">Nucleic Acids Res.</italic>
					</source>
                    <year>2012</year>;<volume>40</volume>(<issue>10</issue>):<fpage>4288</fpage>&#x2013;<lpage>4297</lpage>.
                    <pub-id pub-id-type="pmid">22287627</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gks042</pub-id>
                    <pub-id pub-id-type="pmcid">3378882</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-34">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>McCarthy</surname>
                            <given-names>DJ</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Campbell</surname>
                            <given-names>KR</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Lun</surname>
                            <given-names>AT</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R.</article-title>
                    <source>
						
                        <italic toggle="yes">bioRxiv.</italic>
					</source>
                    <year> 2016</year>.
                    <pub-id pub-id-type="doi">10.1101/069633 </pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-35">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Patro</surname>
                            <given-names>R</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Duggal</surname>
                            <given-names>G</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Kingsford</surname>
                            <given-names>C</given-names>
                        </name>
					</person-group>:
                    <article-title>Accurate, fast, and model-aware transcript expression quantification with Salmon.</article-title>
                    <source>
						
                        <italic toggle="yes">bioRxiv.</italic>
					</source>
                    <year>2015</year>.
                    <pub-id pub-id-type="doi">10.1101/021592</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-36">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Phipson</surname>
                            <given-names>B</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Smyth</surname>
                            <given-names>GK</given-names>
                        </name>
					</person-group>:
                    <article-title>Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn.</article-title>
                    <source>
						
                        <italic toggle="yes">Stat Appl Genet Mol Biol.</italic>
					</source>
                    <year>2010</year>;<volume>9</volume>: Article39.
                    <pub-id pub-id-type="pmid">21044043</pub-id>
                    <pub-id pub-id-type="doi">10.2202/1544-6115.1585</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-37">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Picelli</surname>
                            <given-names>S</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Faridani</surname>
                            <given-names>OR</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Bj&#x00f6;rklund</surname>
                            <given-names>AK</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Full-length RNA-seq from single cells using Smart-seq2.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Protoc.</italic>
					</source>
                    <year>2014</year>;<volume>9</volume>(<issue>1</issue>):<fpage>171</fpage>&#x2013;<lpage>181</lpage>.
                    <pub-id pub-id-type="pmid">24385147</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nprot.2014.006</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-38">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Pollen</surname>
                            <given-names>AA</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Nowakowski</surname>
                            <given-names>TJ</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Shuga</surname>
                            <given-names>J</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Biotechnol.</italic>
					</source>
                    <year>2014</year>;<volume>32</volume>(<issue>10</issue>):<fpage>1053</fpage>&#x2013;<lpage>1058</lpage>.
                    <pub-id pub-id-type="pmid">25086649</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nbt.2967</pub-id>
                    <pub-id pub-id-type="pmcid">4191988</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-39">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Ritchie</surname>
                            <given-names>ME</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Phipson</surname>
                            <given-names>B</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Wu</surname>
                            <given-names>D</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>
                        <italic toggle="yes">limma</italic> powers differential expression analyses for RNA-sequencing and microarray studies.</article-title>
                    <source>
						
                        <italic toggle="yes">Nucleic Acids Res.</italic>
					</source>
                    <year>2015</year>;<volume>43</volume>(<issue>7</issue>):<fpage>e47</fpage>.
                    <pub-id pub-id-type="pmid">25605792</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gkv007</pub-id>
                    <pub-id pub-id-type="pmcid">4402510</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-40">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Robinson</surname>
                            <given-names>MD</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>McCarthy</surname>
                            <given-names>DJ</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Smyth</surname>
                            <given-names>GK</given-names>
                        </name>
					</person-group>:
                    <article-title>edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.</article-title>
                    <source>
						
                        <italic toggle="yes">Bioinformatics.</italic>
					</source>
                    <year>2010</year>;<volume>26</volume>(<issue>1</issue>):<fpage>139</fpage>&#x2013;<lpage>140</lpage>.
                    <pub-id pub-id-type="pmid">19910308</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btp616</pub-id>
                    <pub-id pub-id-type="pmcid">2796818</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-41">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Robinson</surname>
                            <given-names>MD</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Oshlack</surname>
                            <given-names>A</given-names>
                        </name>
					</person-group>:
                    <article-title>A scaling normalization method for differential expression analysis of RNA-seq data.</article-title>
                    <source>
						
                        <italic toggle="yes">Genome Biol.</italic>
					</source>
                    <year>2010</year>;<volume>11</volume>(<issue>3</issue>):<fpage>R25</fpage>.
                    <pub-id pub-id-type="pmid">20196867</pub-id>
                    <pub-id pub-id-type="doi">10.1186/gb-2010-11-3-r25</pub-id>
                    <pub-id pub-id-type="pmcid">2864565</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-42">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Scialdone</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Natarajan</surname>
                            <given-names>KN</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Saraiva</surname>
                            <given-names>LR</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Computational assignment of cell-cycle stage from single-cell transcriptome data.</article-title>
                    <source>
						
                        <italic toggle="yes">Methods.</italic>
					</source>
                    <year>2015</year>;<volume>85</volume>:<fpage>54</fpage>&#x2013;<lpage>61</lpage>.
                    <pub-id pub-id-type="pmid">26142758</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.ymeth.2015.06.021</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-43">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Stegle</surname>
                            <given-names>O</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Teichmann</surname>
                            <given-names>SA</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Marioni</surname>
                            <given-names>JC</given-names>
                        </name>
					</person-group>:
                    <article-title>Computational and analytical challenges in single-cell transcriptomics.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Rev Genet.</italic>
					</source>
                    <year>2015</year>;<volume>16</volume>(<issue>3</issue>):<fpage>133</fpage>&#x2013;<lpage>145</lpage>.
                    <pub-id pub-id-type="pmid">25628217</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nrg3833</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-44">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Trapnell</surname>
                            <given-names>C</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Cacchiarelli</surname>
                            <given-names>D</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Grimsby</surname>
                            <given-names>J</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Biotechnol.</italic>
					</source>
                    <year>2014</year>;<volume>32</volume>(<issue>4</issue>):<fpage>381</fpage>&#x2013;<lpage>386</lpage>.
                    <pub-id pub-id-type="pmid">24658644</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nbt.2859</pub-id>
                    <pub-id pub-id-type="pmcid">4122333</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-45">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Tung</surname>
                            <given-names>PY</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Blischak</surname>
                            <given-names>JD</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Hsiao</surname>
                            <given-names>C</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Batch effects and the effective design of single-cell gene expression studies.</article-title>
                    <source>
						
                        <italic toggle="yes">bioRxiv.</italic>
					</source>
                    <year>2016</year>.
                    <pub-id pub-id-type="doi">10.1101/062919</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-46">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Vallejos</surname>
                            <given-names>CA</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Marioni</surname>
                            <given-names>JC</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Richardson</surname>
                            <given-names>S</given-names>
                        </name>
					</person-group>:
                    <article-title>BASiCS: Bayesian analysis of single-cell sequencing data.</article-title>
                    <source>
						
                        <italic toggle="yes">PLoS Comput Biol.</italic>
					</source>
                    <year>2015</year>;<volume>11</volume>(<issue>6</issue>):<fpage>e1004333</fpage>.
                    <pub-id pub-id-type="pmid">26107944</pub-id>
                    <pub-id pub-id-type="doi">10.1371/journal.pcbi.1004333</pub-id>
                    <pub-id pub-id-type="pmcid">4480965</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-47">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Van der Maaten</surname>
                            <given-names>L</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Hinton</surname>
                            <given-names>G</given-names>
                        </name>
					</person-group>:
                    <article-title>Visualizing data using t-SNE.</article-title>
                    <source>
						
                        <italic toggle="yes">J Mach Learn Res.</italic>
					</source>
                    <year>2008</year>;<volume>9</volume>:<fpage>2579</fpage>&#x2013;<lpage>2605</lpage>.
                    <ext-link ext-link-type="uri" xlink:href="http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-48">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Wilson</surname>
                            <given-names>NK</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Kent</surname>
                            <given-names>DG</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Buettner</surname>
                            <given-names>F</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Combined single-cell functional and gene expression analysis resolves heterogeneity within stem cell populations.</article-title>
                    <source>
						
                        <italic toggle="yes">Cell Stem Cell.</italic>
					</source>
                    <year>2015</year>;<volume>16</volume>(<issue>6</issue>):<fpage>712</fpage>&#x2013;<lpage>724</lpage>.
                    <pub-id pub-id-type="pmid">26004780</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.stem.2015.04.004</pub-id>
                    <pub-id pub-id-type="pmcid">4460190</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-49">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Zeisel</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Mu&#x00f1;oz-Manchado</surname>
                            <given-names>AB</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Codeluppi</surname>
                            <given-names>S</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.</article-title>
                    <source>
						
                        <italic toggle="yes">Science.</italic>
					</source>
                    <year>2015</year>;<volume>347</volume>(<issue>6226</issue>):<fpage>1138</fpage>&#x2013;<lpage>1142</lpage>.
                    <pub-id pub-id-type="pmid">25700174</pub-id>
                    <pub-id pub-id-type="doi">10.1126/science.aaa1934</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-50">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Zhu</surname>
                            <given-names>J</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Yamane</surname>
                            <given-names>H</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Cote-Sierra</surname>
                            <given-names>J</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>GATA-3 promotes Th2 responses through three different mechanisms: induction of Th2 cytokine production, selective growth of Th2 cells and inhibition of Th1 cell-specific factors.</article-title>
                    <source>
						
                        <italic toggle="yes">Cell Res.</italic>
					</source>
                    <year>2006</year>;<volume>16</volume>(<issue>1</issue>):<fpage>3</fpage>&#x2013;<lpage>10</lpage>.
                    <pub-id pub-id-type="pmid">16467870</pub-id>
                    <pub-id pub-id-type="doi">10.1038/sj.cr.7310002</pub-id>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report17328">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.10712.r17328</article-id>
            <title-group>
                <article-title>Reviewer response for version 2</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Rausell</surname>
                        <given-names>Antonio</given-names>
                    </name>
                    <xref ref-type="aff" rid="r17328a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-4832-6101</uri>
                </contrib>
                <aff id="r17328a1">
                    <label>1</label>Clinical Bioinformatics laboratory, Imagine Institute, Paris Descartes University - Sorbonne Paris Cit&#x00e9;, Paris, France</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>28</day>
                <month>12</month>
                <year>2016</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 Rausell A</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport17328" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.9501.2"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>Authors have satisfactorily addressed most of my questions and comments. However, I still have two main concerns (see follow up on Questions 12/13 and 19) and other minor comments.</p>
            <p> 
                <bold>Follow up on Q12/Q13:</bold>
            </p>
            <p> Dimensionality reduction approaches like PCA are exploratory techniques suitable to be used at early stages of the analysis. Of course they may be affected by increasing levels of noise, but still have proved valuable to uncover the underlying structure of high-dimensional transcriptional data without feature selection steps. The potential benefit of a particular feature selection approach preceding them is something that can -and should- be checked by the user by comparing the results obtained before and after the filtering steps. There is no reason to overlook the "unfiltered" results, as ignoring them a priori could neglect important factors.</p>
            <p> While restricting the analysis to correlated HVGs can enhance the signal of the most prominent differences (e.g. main subgroups or trends), the approach risks to filter out important genes for the identification of e.g. small subpopulations of cells and/or further heterogeneity within a given group, which are main motivations of many single-cell studies. The following code illustrates that the assessment of correlated genes used here (correlatePairs function) presents drawbacks (e.g. correlation results are sensitive to the relative size of the different subpopulations of cells and to random noise). Restricting the analysis to correlated HVGs could eventually limit the ability to capture the structure of the data and, in opinion of this referee, the general recommendation to the reader should be to apply the filter after checking that no obvious things are missed</p>
            <p> The example shows 3 subpopulations of cells determined by 2 markers (gene 1 and 2) (a common case, for instance, in immunology): -/-, +/- and -/+ . In the example, another two genes are tightly associated with the previous markers (gene 3 and 4, respectively). By themselves, gene 3 and 4 would determine as well the 3 populations. Two scenarios are illustrated. First 3 subpopulations of equal size, and second 3 populations of uneven size, with a subpopulation of cells of small size</p>
            <p> &gt; set.seed(100)</p>
            <p> # Scenario A: 3 subpopulations of equal size</p>
            <p> &gt; Size1&lt;-333</p>
            <p> &gt; Size2&lt;-333</p>
            <p> &gt; Size3&lt;-333</p>
            <p> &gt; myvar=1</p>
            <p> &gt;</p>
            <p> &gt; NEG_NEG&lt;-cbind(rnorm(Size1,100,myvar),rnorm(Size1,100,myvar),rnorm(Size1,100,myvar),rnorm(Size1,100,myvar))</p>
            <p> &gt; POS_NEG&lt;-cbind(rnorm(Size2,200,myvar),rnorm(Size2,100,myvar),rnorm(Size2,200,myvar),rnorm(Size2,100,myvar))</p>
            <p> &gt; NEG_POS&lt;-cbind(rnorm(Size3,100,myvar),rnorm(Size3,200,myvar),rnorm(Size3,100,myvar),rnorm(Size3,200,myvar))</p>
            <p> &gt;</p>
            <p> &gt; M&lt;-as.matrix(rbind(NEG_NEG,POS_NEG,NEG_POS))</p>
            <p> &gt; var.cor &lt;- correlatePairs(t(M))</p>
            <p> &gt;</p>
            <p> &gt; subset(var.cor,FDR &lt;= 0.05)</p>
            <p> &#x00a0; gene1 gene2&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; rho&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; p.value&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; FDR</p>
            <p> 1&#x00a0;&#x00a0;&#x00a0;&#x00a0; 1&#x00a0;&#x00a0;&#x00a0;&#x00a0; 3&#x00a0; 0.6770746 1.999998e-06 1.999998e-06</p>
            <p> 2&#x00a0;&#x00a0;&#x00a0;&#x00a0; 2&#x00a0;&#x00a0;&#x00a0;&#x00a0; 4&#x00a0; 0.6680684 1.999998e-06 1.999998e-06</p>
            <p> 3&#x00a0;&#x00a0;&#x00a0;&#x00a0; 3&#x00a0;&#x00a0;&#x00a0;&#x00a0; 4 -0.3477739 1.999998e-06 1.999998e-06</p>
            <p> 4&#x00a0;&#x00a0;&#x00a0;&#x00a0; 2&#x00a0;&#x00a0;&#x00a0;&#x00a0; 3 -0.3477432 1.999998e-06 1.999998e-06</p>
            <p> 5&#x00a0;&#x00a0;&#x00a0;&#x00a0; 1&#x00a0;&#x00a0;&#x00a0;&#x00a0; 4 -0.3462825 1.999998e-06 1.999998e-06</p>
            <p> 6&#x00a0;&#x00a0;&#x00a0;&#x00a0; 1&#x00a0;&#x00a0;&#x00a0;&#x00a0; 2 -0.3168503 1.999998e-06 1.999998e-06</p>
            <p> &gt; dim(M)</p>
            <p> [1] 999&#x00a0;&#x00a0; 4</p>
            <p> &gt; M2 &lt;- cbind(M, matrix(rnorm((Size1+Size2+Size3)*1000), ncol=1000)) # Adding uncorrelated noise</p>
            <p> &gt; dim(M2)</p>
            <p> [1]&#x00a0; 999 1004</p>
            <p> &gt; PCM&lt;-prcomp(M2)</p>
            <p> &gt; plot(PCM$x[,1],PCM$x[,2], pch=16)</p>
            <p> &gt; cor(M,method="spearman")</p>
            <p> &#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,1]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,2]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,3]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,4]</p>
            <p> [1,]&#x00a0; 1.0000000 -0.3168503&#x00a0; 0.6770746 -0.3462825</p>
            <p> [2,] -0.3168503&#x00a0; 1.0000000 -0.3477432&#x00a0; 0.6680684</p>
            <p> [3,]&#x00a0; 0.6770746 -0.3477432&#x00a0; 1.0000000 -0.3477739</p>
            <p> [4,] -0.3462825&#x00a0; 0.6680684 -0.3477739&#x00a0; 1.0000000</p>
            <p> &gt; cor(PCM$x[,1],M)</p>
            <p> &#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,1]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,2]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,3]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,4]</p>
            <p> [1,] 0.8655852 -0.8660004 0.8659075 -0.8663694</p>
            <p> &gt; cor(PCM$x[,2],M)</p>
            <p> &#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,1]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,2]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,3]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,4]</p>
            <p> [1,] 0.5005309 0.4998326 0.4999728 0.4991926</p>
            <p> </p>
            <p> # Scenario B: 3 subpopulations of uneven size</p>
            <p> &gt; Size1&lt;-800</p>
            <p> &gt; Size2&lt;-185</p>
            <p> &gt; Size3&lt;-15</p>
            <p> &gt; myvar=.1</p>
            <p> &gt; NEG_NEG&lt;-cbind(rnorm(Size1,100,myvar),rnorm(Size1,100,myvar),rnorm(Size1,100,myvar),rnorm(Size1,100,myvar))</p>
            <p> &gt; POS_NEG&lt;-cbind(rnorm(Size2,200,myvar),rnorm(Size2,100,myvar),rnorm(Size2,200,myvar),rnorm(Size2,100,myvar))</p>
            <p> &gt; NEG_POS&lt;-cbind(rnorm(Size3,100,myvar),rnorm(Size3,200,myvar),rnorm(Size3,100,myvar),rnorm(Size3,200,myvar))</p>
            <p> &gt; M&lt;-as.matrix(rbind(NEG_NEG,POS_NEG,NEG_POS))</p>
            <p> &gt; var.cor &lt;- correlatePairs(t(M))</p>
            <p> &gt; subset(var.cor,FDR &lt;= 0.05)</p>
            <p> &#x00a0; gene1 gene2&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; rho&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; p.value&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; FDR</p>
            <p> 1&#x00a0;&#x00a0;&#x00a0;&#x00a0; 1&#x00a0;&#x00a0;&#x00a0;&#x00a0; 3 0.4423758 1.999998e-06 1.199999e-05</p>
            <p> &gt; dim(M)</p>
            <p> [1] 1000&#x00a0;&#x00a0;&#x00a0; 4</p>
            <p> &gt; M2 &lt;- cbind(M, matrix(rnorm((Size1+Size2+Size3)*1000), ncol=1000)) # Adding uncorrelated noise</p>
            <p> &gt; dim(M2)</p>
            <p> [1] 1000 1004</p>
            <p> &gt; PCM&lt;-prcomp(M2)</p>
            <p> &gt; plot(PCM$x[,1],PCM$x[,2], pch=16)</p>
            <p> &gt; cor(M,method="spearman")</p>
            <p> &#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,1]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,2]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,3]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,4]</p>
            <p> [1,]&#x00a0; 1.00000000 -0.04622429&#x00a0; 0.44237576 -0.02078442</p>
            <p> [2,] -0.04622429&#x00a0; 1.00000000 -0.02955135&#x00a0; 0.01551650</p>
            <p> [3,]&#x00a0; 0.44237576 -0.02955135&#x00a0; 1.00000000 -0.02922687</p>
            <p> [4,] -0.02078442&#x00a0; 0.01551650 -0.02922687&#x00a0; 1.00000000</p>
            <p> &gt; cor(PCM$x[,1],M)</p>
            <p> &#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,1]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,2]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,3]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,4]</p>
            <p> [1,] -0.9999771 0.06552188 -0.9999781 0.06533876</p>
            <p> &gt; cor(PCM$x[,2],M)</p>
            <p> &#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,1]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,2]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,3]&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0;&#x00a0; [,4]</p>
            <p> [1,] -0.006496133 -0.9978253 -0.006326823 -0.9978401</p>
            <p> In the first scenario, the correlatePairs function leads to significant results among all pair-wise comparisons. However, in the second scenario, filtering through the correlatePairs function will miss the association between 2-4, as well as the antagonism between (1+, 3+) versus (2+, 4+). In both scenarios, a PCA analysis would have detected both factors and the 3 underlying subpopulations, despite the presence of uncorrelated noise.</p>
            <p> 
                <bold>Q19</bold>
            </p>
            <p> The inspection of the similarity between the quickCluster results and the dendrograms is not suggested as an indication of performance but, eventually, suitable to avoid circularity. I wonder whether, if some groups from the quickCluster were largely equivalent to groups resulting from the dendogram, the differentially expressed genes detected among them could just be reflecting the different size factors applied to each group at the normalization step rather than true biological differences in their expression levels</p>
            <p> 
                <bold>Minor comments:</bold>
            </p>
            <p> 
                <bold>Q5</bold>
            </p>
            <p> The approach of Shalek 
                <italic>et al.</italic> was not suggested by this referee here as an alternative way to filter out genes. On the contrary, it was proposed as a way to retain genes that eventually could be filtered out by the "non-zero counts in at least n cells" criteria. For instance, following such a simple rule, a gene expressed with high levels in few cells (i.e. eventually departing from expectations) will be treated the same way as a gene lowly expressed in few cells (following noise expectations). As authors already warned in the text, the risk is neglecting important genes for the identification of rare subpopulations of cells.</p>
            <p> 
                <bold>Q15</bold>
            </p>
            <p> Heatmaps are typically used to visualize both clustering of cells and of genes. Being Spearman's correlation the method of choice to assess correlated HVGs, and being correlated HVGs the set of genes on which clustering of cells is represented, it seems to this referee more consistent to use Spearman's rho as the default distance of choice to determine the ordering of genes in the plot. As for the clustering of cells, the statement "in the context of this workflow, the differences in clustering on correlations versus Euclidean distances would only have a minor effect" could eventually not hold true and, in any case, can be easily checked by the user</p>
            <p> 
                <bold>Q18</bold>
            </p>
            <p> The inspection of the distribution of the percentage of variance explained by the different PCs is not suggested here to assist interpretation of the non-visualized components. On the contrary, it is necessary to avoid an over-interpretation (or oversimplification) of a visualization done on the first 2 dimensions, when additional dimensions could still gather a high % of variance relevant to interpret the structure of the data. Literature exists on how to determine the "relevant" number of dimensions, and hence, whether visualization on 2D is any useful.</p>
            <p> </p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report17325">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.10712.r17325</article-id>
            <title-group>
                <article-title>Reviewer response for version 2</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Low</surname>
                        <given-names>Diana H.P.</given-names>
                    </name>
                    <xref ref-type="aff" rid="r17325a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r17325a1">
                    <label>1</label>Institute of Molecular and Cell Biology, Agency for Science, Technology and Research, Singapore, Singapore</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>29</day>
                <month>11</month>
                <year>2016</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 Low DHP</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport17325" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.9501.2"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The revised manuscript has satisfactorily addressed my largest concern about the reproducibility&#x00a0;of the workflow&#x00a0;by updating the code, and stating the dependence on Bioconductor 3.4.</p>
            <p> </p>
            <p> While I understand that this workflow is meant to work under version 3.4, it is important to mention these information in the manuscript regardless of the time of publication, especially if one expects that there might be casual users of R (at this point I would echo the sentiments of Hongkai Ji regarding installation convenience). Working with development versions of R - albeit awaiting its release at that point in time - is a slight danger in itself, unless the workflow is only relying on packages that the authors have themselves developed. However, these are but minor issues.</p>
            <p> </p>
            <p> I thank the authors for creating this documentation that should provide a sufficient launchpad&#x00a0;for those getting into single-cell analyses. I hope the authors will continue to improve upon the components as the field matures.</p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report17327">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.10712.r17327</article-id>
            <title-group>
                <article-title>Reviewer response for version 2</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Ji</surname>
                        <given-names>Hongkai</given-names>
                    </name>
                    <xref ref-type="aff" rid="r17327a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r17327a1">
                    <label>1</label>Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>21</day>
                <month>11</month>
                <year>2016</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 Ji H</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport17327" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.9501.2"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>In this revised manuscript, the authors have satisfactorily answered my previous questions. Overall, the workflow provides a useful step-by-step instruction on how to analyze scRNA-seq data, and it is worthwhile to be documented in the literature. The authors explained why they hesitated to provide the whole pipeline through an R file and a graphical user interface. While their explanations are reasonable, convenience in installation and data exploration will greatly help many users. Therefore, I hope they could add/improve those components as they continue to develop this workflow.</p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report17329">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.10712.r17329</article-id>
            <title-group>
                <article-title>Reviewer response for version 2</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>McDavid</surname>
                        <given-names>Andrew</given-names>
                    </name>
                    <xref ref-type="aff" rid="r17329a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-6581-1213</uri>
                </contrib>
                <aff id="r17329a1">
                    <label>1</label>Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>11</day>
                <month>11</month>
                <year>2016</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 McDavid A</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport17329" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.9501.2"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>In the 2nd version, Lun and colleagues have updated the workflow for the 3.4 release of Bioconductor and provided a tool to install all the needed dependencies to replicate the workflow. This eases reproducibility and addresses my first and largest concern.</p>
            <p> </p>
            <p> The authors have not modified the article in response to my other comments, but rather offer a rebuttal. I acknowledge that some of the issues I raise are probably too large to be solved in a vignette, but nonetheless potential users of this workflow should be aware of them. I reply below:</p>
            <p> </p>
            <p> 
                <bold>1b.</bold>&#x00a0;Inheriting from a `SummarizedExperiment` not only allows storage in sparse formats, which as indicated in rebuttal may not be terribly applicable for scRNAseq, but also allows analysis of data that is too big to fit into memory, by storing the data on-disk in HDF5 format. Adopting `SummarizedExperiment` as a container would future-proof this software in the event that data sets do get too large to fit into memory of conventional machines. A back-of-the envelop calculation suggests this won't be an issue until data sets are on the order of 10
                <sup>5</sup> cells, and require on the order of gigabytes of memory to store. Conducting analysis on big-memory machines might delay this issue indefinitely.</p>
            <p> </p>
            <p> 
                <bold>2.</bold> &#x00a0;Although feature counting is an approach used by many studies, I do not know that it has been shown to provide equivalent estimates to EM-based transcript quantitation (and would welcome a reference otherwise!) Because feature counting throws away multi-mapped reads, at a minimum it is an inefficient use (scarce, costly) data. I would not characterize it generally as "conservative," either, unless the reads are discarded uniformly at random from all possible transcripts, and only single isoforms are of interest. If uniformity doesn't hold, or multiple isoforms are present, feature counting distorts the relative abundances of genes, and even sample-to-sample comparisons of the same gene (e.g. Figures S1 and S4 in [1]). Additionally, modern quantitation programs do much more than account for multi-mapping reads--they also can model biases in the sequencing chemistry and fragment start sites, and can be faster than alignment-based procedures.
                <sup>2,3</sup> &#x00a0;I find analyst inertia an unconvincing reason not to adopt them.</p>
            <p> </p>
            <p> 
                <bold>3.&#x00a0;</bold>I also agree that the plethora of alternate approaches is a quagmire for end-users at the moment. Lamentably there has been very little in the way of across-method comparison, and instead the methods have been developed around a specific data set, or two, and rarely compared to each other. More direct head-to-head comparisons would advance the field dramatically. In the meantime, 
                <italic>caveat calculator</italic>. &#x00a0;</p>
            <p> </p>
            <p> I thank the authors for their effort in creating this tutorial. Having well-documented software and workflows will be critical for allowing the sorts of head-to-head comparisons I mention in [3].</p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-17329-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Differential analysis of gene regulation at transcript resolution with RNA-seq.</article-title>
                        <source>
                            <italic>Nat Biotechnol</italic>
                        </source>.<year>2013</year>;<volume>31</volume>(<issue>1</issue>) :
                        <elocation-id>10.1038/nbt.2450</elocation-id>
                        <fpage>46</fpage>-<lpage>53</lpage>
                        <pub-id pub-id-type="pmid">23222703</pub-id>
                        <pub-id pub-id-type="doi">10.1038/nbt.2450</pub-id>
                    </mixed-citation>
                </ref>
                <ref id="rep-ref-17329-2">
                    <label>2</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Near-optimal probabilistic RNA-seq quantification.</article-title>
                        <source>
                            <italic>Nat Biotechnol</italic>
                        </source>.<year>2016</year>;<volume>34</volume>(<issue>5</issue>) :
                        <elocation-id>10.1038/nbt.3519</elocation-id>
                        <fpage>525</fpage>-<lpage>7</lpage>
                        <pub-id pub-id-type="pmid">27043002</pub-id>
                        <pub-id pub-id-type="doi">10.1038/nbt.3519</pub-id>
                    </mixed-citation>
                </ref>
                <ref id="rep-ref-17329-3">
                    <label>3</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.</article-title>
                        <source>
                            <italic>Nat Biotechnol</italic>
                        </source>.<year>2014</year>;<volume>32</volume>(<issue>5</issue>) :
                        <elocation-id>10.1038/nbt.2862</elocation-id>
                        <fpage>462</fpage>-<lpage>4</lpage>
                        <pub-id pub-id-type="pmid">24752080</pub-id>
                        <pub-id pub-id-type="doi">10.1038/nbt.2862</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report15986">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.10234.r15986</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Rausell</surname>
                        <given-names>Antonio</given-names>
                    </name>
                    <xref ref-type="aff" rid="r15986a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-4832-6101</uri>
                </contrib>
                <aff id="r15986a1">
                    <label>1</label>Clinical Bioinformatics laboratory, Imagine Institute, Paris Descartes University - Sorbonne Paris Cit&#x00e9;, Paris, France</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>20</day>
                <month>10</month>
                <year>2016</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 Rausell A</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport15986" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.9501.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>In the Software tool article &#x201c;A step-by-step workflow for low-level analysis of single-cell RNA-seq data&#x201d;, Lun, McCarthy and Marioni thoroughly describe a comprehensive pipeline for the low-level analysis of single-cell RNA-seq data. The article covers important topics such as the quality control of cells and genes, normalization of expression levels, control for technical factors and cell cycle, detection of highly variable genes, assessment of subpopulations of cells and associated differentially expressed genes. The workflow is illustrated in a number of datasets offering diverse scenarios that nicely guide the reader on the different criteria that may be adopted throughout the analysis. The manuscript is clearly presented, the quality of the code and figures is excellent and a great effort has been done to introduce complex questions in an easily accessible manner to a broad audience. Importantly, the authors discuss situations where it is difficult to provide a clear-cut recipe, and the need for experimental validation is stressed. Overall I think the article is an important contribution to the community and that it should quickly become a reference guide in the field.</p>
            <p> </p>
            <p> I report here a number of comments, questions and suggestions with the hope that they may contribute to improve an already excellent article:</p>
            <p> </p>
            <p> 1.&#x00a0;&#x00a0; &#x00a0;In addition to the approaches proposed by the authors to detect low quality cells, I would suggest the readers the possibility of identifying outlier cells by performing a PCA on the normalized gene expression matrix restricted to protein-coding genes (e.g. using biotype annotations from Ensembl biomart). On the one hand, outlier cells will dominate the first principal components, which will show high percentage of variances simply accounting from the separation of the outlier from the compact cloud of &#x201c;normal&#x201d; cells. On the other hand, a PCA analysis could suggest keeping a cell whose relative similarity to the rest of the cells in a low-dimensional space seems rather normal, even if it could still have an allegedly "bad" quality metric.</p>
            <p> </p>
            <p> 2.&#x00a0;&#x00a0; &#x00a0;As an additional quality control check for the cells, I would also suggest to assess whether the sequencing depth was generally deep enough for most of the cells, by inspecting for instance the saturation curve of the number of detected genes (or other features like the known exon-exon junctions) as a function of the fraction of down-sampled reads.</p>
            <p> </p>
            <p> 3.&#x00a0;&#x00a0; &#x00a0;In the text it is proposed to filter out low-abundance genes, defined as &#x201c;those with an average count below a filter threshold of 1." However, the average count is assessed before the normalization step. Would it be more meaningful to apply this filter on the normalized counts?</p>
            <p> </p>
            <p> 4.&#x00a0;&#x00a0; &#x00a0;A priori it is difficult to rule out the possibility that the filtering of low-abundance genes could eventually hamper the identification of relevant genes in rare populations of cells. I would generally suggest being rather inclusive at this stage, especially when no clustering of single-cells has been done yet, so that it would still be possible to check whether e.g. the few cells expressing some genes -even if at low levels- are actually forming a distinctive and biologically relevant cluster.</p>
            <p> </p>
            <p> 5.&#x00a0;&#x00a0; &#x00a0;Authors propose as an alternative approach to gene filtering, to select genes that have non-zero counts in at least n cells. As illustrated in Figure 6, the number of cells expressing a given gene may be modeled by its mean expression level. This was elegantly addressed in Shalek 
                <italic>et al</italic>. (2014) through a likelihood ratio test comparing a null model -where all cells express a gene in a lognormal fashion- with an alternate model -where a gene is not expressed in a subpopulation of cells &#x03b1;&#x00a0;(See section "Controlling for relationship between expression level and detection efficiency" in the supplementary material: 
                <ext-link ext-link-type="uri" xlink:href="http://www.nature.com/nature/journal/v510/n7505/extref/nature13437-s1.pdf">http://www.nature.com/nature/journal/v510/n7505/extref/nature13437-s1.pdf</ext-link>). Genes for which the null model is rejected may be indicative of a subpopulation of cells not expressing the gene at a higher fraction than the one expected from technical noise (e.g. dropout events). I would suggest exploring such approach in order to avoid filtering out relevant genes due to a sharp threshold on the number of cells expressing it.</p>
            <p> </p>
            <p> 6.&#x00a0;&#x00a0; &#x00a0;In the section &#x201c;Filtering out low-abundance genes&#x201d;, the sentence "This provides some more protection against genes with outlier expression patterns, i.e., strong expression in only one or two cells. Such outliers are typically uninteresting as they can arise from amplification artifacts that are not replicable across cells.[...]" would be better followed by setting alt.keep &lt;- numcells &gt;= 2 instead of &gt;= 10</p>
            <p> </p>
            <p> 7.&#x00a0;&#x00a0; &#x00a0;It would be interesting to complement Figure 7 and 18 with a second panel representing the correlation between size factors from deconvolution versus spike-in-specific size factors, as done in Figure 27. In the eventual case that a low correlation between them was found in a non-DE scenario, would it be advisable to neglect spike-ins from the analysis?</p>
            <p> </p>
            <p> 8.&#x00a0;&#x00a0; &#x00a0;As pointed by the authors, spike-in molecules have been extensively used to infer the amount of variability in the expression levels of one gene that can be explained from technical noise (e.g. Brennecke 
                <italic>et al.</italic>, 2013; Gr&#x00fc;n 
                <italic>et al.</italic>, 2014; Islam 
                <italic>et al.</italic>, 2014). Ding
                <italic> et al.</italic>&#x00a0;(2015) went further on the applications of spike-in levels, by using them to explicitly remove technical noise and compute de-noised gene expression levels (R software GRM, http://wanglab.ucsd.edu/star/GRM/). I would suggest the readers such possibility that could largely benefit downstream analysis such as the detection of subpopulation of cells and cell trajectories, as they would mainly rely on biological variation. This would still be compatible with an assessment of HVG only based on biological variation by fitting the trend to the variance estimates of the endogenous genes (after technical denoising).</p>
            <p> </p>
            <p> 9.&#x00a0;&#x00a0; &#x00a0;Authors state that the technical component estimation through the fitting of a mean-variance trend to the spike-in transcripts &#x201c;is compromised by the small number of spike-in transcripts, the uneven distribution of their abundances and (for low numbers of cells) the imprecision of their variance estimates&#x201d;. Do the same remarks generally apply to a spike-in-specific normalization? And if so, should spike-in normalization be considered accurate enough when applied to cases with strong DE even if it is conceptually more appropriate than a deconvolution approach?</p>
            <p> </p>
            <p> 10.&#x00a0;&#x00a0; &#x00a0;In the section &#x201c;Identifying HVGs from the normalized log-expression&#x201d; the authors justify their choice of "the variance of the log-expression values because the log-transformation protects against genes with strong expression in only one or two cells. This ensures that the set of top HVGs is not dominated by genes with (mostly uninteresting) outlier expression patterns&#x201d;. However, the filtering of genes with such patterns has already been proposed in a previous section, so those cases should no longer be a risk here.</p>
            <p> </p>
            <p> 11.&#x00a0;&#x00a0; &#x00a0;The interpretability of the approach of "Identifying correlated gene pairs with Spearman&#x2019;s rho" is to some extent limited without a previous analysis such as PCA, ICA or MDS, transforming the high-dimensional space into a space of independent (uncorrelated) dimensions. I would rather favor the identification of sets of genes with a high weight on each of the retained independent axis (i.e. driving the variance in such axes, and therefore disentangling sets of correlated genes for each of the orthogonal dimensions). Otherwise, the analysis could risk to be dominated by the first component, probably neglecting other relevant hidden factors.</p>
            <p> </p>
            <p> 12.&#x00a0;&#x00a0; &#x00a0;In any case, I advise not to restrict to correlated HVGs downstream dimensionality reduction analysis such as PCA or ICA aiming at the identification of subpopulations of cells and their gene signatures. Such methods exploit correlation patterns (linear or non-linear) in a well-grounded way and they do not require a feature selection step. The sentence "We only use the correlated HVGs in plotPCA because any substructure should be most pronounced in the expression profiles of these genes" could eventually not hold true in some instances: correlated HVGs were assessed without considering those independent components, the relative contribution of each dimension to the total variance, and the relative contribution of each gene to each dimension.</p>
            <p> </p>
            <p> 13.&#x00a0;&#x00a0; &#x00a0;In the brain dataset, correlated HVGs genes were assessed considering the design &lt;- model.matrix(~sce$sex); correlatePairs(sce, design=design). It would be useful to further explain&#x00a0;here how this function accounts for the design matrix on the assessment of Spearman&#x2019;s rho.</p>
            <p> </p>
            <p> 14.&#x00a0;&#x00a0; &#x00a0;In the brain dataset, removeBatchEffect from limma package is used to remove the sex effect. Then tSNE and PCA are applied on the sex-corrected expression values restricted to correlated HVGs. Consistently, correlated HVGs were assessed considering the very same factor: design &lt;- model.matrix(~sce$sex); correlatePairs(sce, design=design). I would further warn the reader and stress the necessity of that consistency between both steps.</p>
            <p> </p>
            <p> 15.&#x00a0;&#x00a0; &#x00a0;For consistency with the assessment of correlations based on Spearman's rho, in the heatmap I would recommend to assess first the dendrograms for the cells and the genes by using also a spearman correlation: For instance:</p>
            <p> </p>
            <p> cells.cor &lt;- cor(expressionmatrix, method="spearman")</p>
            <p> cells.cor.dist &lt;- as.dist(1-samples.cor)</p>
            <p> cells.tree &lt;- hclust(cells.cor.dist,method='complete')</p>
            <p> And then in heatmap.2 setting Colv=as.dendrogram(cells.tree)</p>
            <p> And analogously with the genes for Rowv</p>
            <p> This should be adapted in the case that a design is used as in correlatePairs(sce, design=design)</p>
            <p> Personally I would also suggest to check how the heatmaps look by setting scale='row' in heatmap.2 function.</p>
            <p> </p>
            <p> 16.&#x00a0;&#x00a0; &#x00a0;I would suggest explicitly mentioning in the pipeline which approaches are based on linear or non-linear assumptions. The workflow alternates methods from both categories, which should be taken into account to understand their downstream consequences. For instance: 
                <list list-type="bullet">
                    <list-item>
                        <p>The function plotExplanatoryVariables from scater package, with the default method= "density", produces a density plot of R-squared values for each variable when fitted as the only explanatory variable in a linear model.</p>
                    </list-item>
                    <list-item>
                        <p>The function removeBatchEffect from limma package fits a linear model to the data, including both batches and regular treatments, then removes the component due to the batch effects.</p>
                    </list-item>
                    <list-item>
                        <p>Then the analysis is restricted to correlated HVGs, which are assessed on spearman&#x2019;s rho, i.e. rank-based, non-linear</p>
                    </list-item>
                    <list-item>
                        <p>tSNE is non-linear based, PCA is linear</p>
                    </list-item>
                    <list-item>
                        <p>Clusters are defined through dynamic tree cut to the dendograms assessed by hierarchical clustering on the Euclidean distances between cells (linear, although in a non-orthogonal space)</p>
                    </list-item>
                </list> 17.&#x00a0;&#x00a0; &#x00a0;The use of hierarchical clustering for clustering cells into putative subpopulations is based on Euclidean distances (or correlations) assessed in a non-orthogonal space. I would rather favor an analytical clustering directly performed in a low-dimensional orthogonal space such us those led by PCA, ICA or MDS, in which the most-informative dimensions can be selected (e.g. through their eigen values in PCA).</p>
            <p> </p>
            <p> 18. &#x00a0;A PCA analysis should be accompanied by a plot representing the % of variance explained by each principal component, so that it can be judged the number of relevant dimensions to be retained while disregarding the rest as &#x201c;noise&#x201d;. It could be the case that more than 2 dimensions are relevant to separate subpopulations in a finer detail. The inspection of eigenvalues would help supporting that "PCA plot is less effective at separating cells into many different clusters (Figure 24). This is because the first two principal components are driven by strong differences between specific subpopulations, which reduces the resolution of more subtle differences between some of the other subpopulations."</p>
            <p> </p>
            <p> 19.&#x00a0;&#x00a0; &#x00a0;In the brain analysis, three main steps are: 1) the deconvolution method is used to normalize expression levels. Here similar cells are clustered together and cells are normalized in each cluster. Authors state: &#x201c;This improves normalization accuracy by reducing the number of DE genes between cells in the same cluster&#x201d;. Clustering is performed here with the quickCluster function from scran package, where a distance matrix is constructed using Spearman&#x2019;s correlation on the counts between cells. 2) A hierarchical clustering is then performed and a dynamic tree cut is used to define clusters of cells. Then, the batch(sex)-corrected expression values of the (Spearman&#x2019;s rho) correlated HVG are used to build a dendogram assessed through hierarchical clustering on the Euclidean distances between cells, where clusters are defined. And 3) those clusters are used to assess DE with edgeR on the counts, normalized using the library size-adjusted size factors (if I well understood) and including all genes (not only correlated HVG). I personally found such procedure a bit cumbersome as it is relying on different types of expression matrices and metrics in each of the 3 steps (see also next comment). I also wonder to what extent the initial quickCluster results could be biasing the clusters detected downstream, and, if so, whether the normalization step would be biasing in turn the differentially expression results. The correspondence between the quickCluster results with the clusters from the dendograms should at least be inspected and discussed.</p>
            <p> </p>
            <p> 20.&#x00a0;&#x00a0; &#x00a0;In line with the previous comment, in the brain analysis I wonder whether the pipeline could somehow be simplified by 1) performing spike-in normalization (which seems possible given the quality of the spike-in trend observed in Figure 21), 2) doing a PCA on the batch(sex)-corrected expression values of all genes (not only correlated HVGs), and performing clustering on the retained principal components, and 3) assessing DE with edgeR on the counts normalized using the spike-in factors.</p>
            <p> </p>
            <p> </p>
            <p> Minor comments</p>
            <p> </p>
            <p> 21.&#x00a0;&#x00a0; &#x00a0;Some code at the beginning of the analysis to check and install all the required packages would be welcome</p>
            <p> </p>
            <p> 22.&#x00a0;&#x00a0; &#x00a0;Everything run smoothly in our hands except for the gdata package when trying to read the xls file. The perl command interpreter was running abnormally long and it was using a large amount of RAM. We finally opened the xls file in excel and converted into tab separated file, then read it using the general read.table command.</p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-15986-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Single-cell RNA-seq reveals dynamic paracrine control of cellular variation.</article-title>
                        <source>
                            <italic>Nature</italic>
                        </source>.<year>2014</year>;<volume>510</volume>(<issue>7505</issue>) :
                        <elocation-id>10.1038/nature13437</elocation-id>
                        <fpage>363</fpage>-<lpage>9</lpage>
                        <pub-id pub-id-type="pmid">24919153</pub-id>
                        <pub-id pub-id-type="doi">10.1038/nature13437</pub-id>
                    </mixed-citation>
                </ref>
                <ref id="rep-ref-15986-2">
                    <label>2</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Accounting for technical noise in single-cell RNA-seq experiments.</article-title>
                        <source>
                            <italic>Nat Methods</italic>
                        </source>.<year>2013</year>;<volume>10</volume>(<issue>11</issue>) :
                        <elocation-id>10.1038/nmeth.2645</elocation-id>
                        <fpage>1093</fpage>-<lpage>5</lpage>
                        <pub-id pub-id-type="pmid">24056876</pub-id>
                        <pub-id pub-id-type="doi">10.1038/nmeth.2645</pub-id>
                    </mixed-citation>
                </ref>
                <ref id="rep-ref-15986-3">
                    <label>3</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Validation of noise models for single-cell transcriptomics.</article-title>
                        <source>
                            <italic>Nat Methods</italic>
                        </source>.<year>2014</year>;<volume>11</volume>(<issue>6</issue>) :
                        <elocation-id>10.1038/nmeth.2930</elocation-id>
                        <fpage>637</fpage>-<lpage>40</lpage>
                        <pub-id pub-id-type="pmid">24747814</pub-id>
                        <pub-id pub-id-type="doi">10.1038/nmeth.2930</pub-id>
                    </mixed-citation>
                </ref>
                <ref id="rep-ref-15986-4">
                    <label>4</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Quantitative single-cell RNA-seq with unique molecular identifiers.</article-title>
                        <source>
                            <italic>Nat Methods</italic>
                        </source>.<year>2014</year>;<volume>11</volume>(<issue>2</issue>) :
                        <elocation-id>10.1038/nmeth.2772</elocation-id>
                        <fpage>163</fpage>-<lpage>6</lpage>
                        <pub-id pub-id-type="pmid">24363023</pub-id>
                        <pub-id pub-id-type="doi">10.1038/nmeth.2772</pub-id>
                    </mixed-citation>
                </ref>
                <ref id="rep-ref-15986-5">
                    <label>5</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Normalization and noise reduction for single cell RNA-seq experiments.</article-title>
                        <source>
                            <italic>Bioinformatics</italic>
                        </source>.<year>2015</year>;<volume>31</volume>(<issue>13</issue>) :
                        <elocation-id>10.1093/bioinformatics/btv122</elocation-id>
                        <fpage>2225</fpage>-<lpage>7</lpage>
                        <pub-id pub-id-type="pmid">25717193</pub-id>
                        <pub-id pub-id-type="doi">10.1093/bioinformatics/btv122</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
        <sub-article article-type="response" id="comment2252-15986">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Lun</surname>
                            <given-names>Aaron</given-names>
                        </name>
                        <aff>Cancer Research UK Cambridge Research Institute, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>None declared.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>26</day>
                    <month>10</month>
                    <year>2016</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Thanks for your comments, Antonio. Our responses are as below:</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>1. In addition to the approaches proposed by the authors to detect low quality cells, I would suggest the readers the possibility of identifying outlier cells by performing a PCA on the normalized gene expression matrix restricted to protein-coding genes (e.g. using biotype annotations from Ensembl biomart). On the one hand, outlier cells will dominate the first principal components, which will show high percentage of variances simply accounting from the separation of the outlier from the compact cloud of &#x201c;normal&#x201d; cells. On the other hand, a PCA analysis could suggest keeping a cell whose relative similarity to the rest of the cells in a low-dimensional space seems rather normal, even if it could still have an allegedly "bad" quality metric.</bold>
                    </italic>
                </p>
                <p> </p>
                <p> This is certainly a valid approach, though we do not mention it here for several reasons. The first reason is that there is an increased risk of being confounded by biological effects when gene expression patterns are directly used, e.g. where uncommon cell types are classified as outliers and removed. The second is that we do not want to confuse readers with a variety of possible options - while our approach is not the only way to do it, it does work, and thus serves its purpose in this workflow. Finally, the use of PCA-based outlier detection has been explored in some detail by Ilicic 
                    <italic>et al.</italic> (2016), which we have already mentioned in the text.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>2. As an additional quality control check for the cells, I would also suggest to assess whether the sequencing depth was generally deep enough for most of the cells, by inspecting for instance the saturation curve of the number of detected genes (or other features like the known exon-exon junctions) as a function of the fraction of down-sampled reads.</bold>
                    </italic>
                </p>
                <p> </p>
                <p> This is an interesting idea, though it seems to be more useful as a diagnostic for future experiments rather than for an already existing dataset. Even if saturation is not reached, it would not affect the data analysis provided that the existing counts were large enough. Our diagnostics focus on the quality of the data that we currently have, rather than the potential for improving the experiment by collecting more data.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>3. In the text it is proposed to filter out low-abundance genes, defined as &#x201c;those with an average count below a filter threshold of 1." However, the average count is assessed before the normalization step. Would it be more meaningful to apply this filter on the normalized counts?</bold>
                    </italic>
                </p>
                <p> </p>
                <p> Unfortunately, most normalization methods (e.g. deconvolution, TMM, DESeq) perform poorly with unfiltered data due to the poor precision of low counts. This necessitates some degree of filtering prior to normalization. We do not think that this has a major effect on the mean count for most genes, given that the size factors average out to unity across all cells.</p>
                <p> 
                    <italic>
                        <bold>4. A priori it is difficult to rule out the possibility that the filtering of low-abundance genes could eventually hamper the identification of relevant genes in rare populations of cells. I would generally suggest being rather inclusive at this stage, especially when no clustering of single-cells has been done yet, so that it would still be possible to check whether e.g. the few cells expressing some genes -even if at low levels- are actually forming a distinctive and biologically relevant cluster.</bold>
                    </italic>
                </p>
                <p> </p>
                <p> In the context of this workflow, one of the roles of filtering is to reduce the number of genes that need to be tested as being highly variable. This improves power by reducing the severity of the multiple testing correction, increasing the chance that potentially informative genes are detected as HVGs and used in downstream analyses. Thus, while relaxing the filter may retain more genes, fewer of these genes may actually be used in the downstream analysis. (This is more likely than not - low-abundance genes are not generally detected as being highly variable, due to inherent limits on the scope of variability in count data.) Indeed, in the example of few cells expressing few genes at low levels, it is difficult to see how such genes would be detected as being significant in a HVG analysis.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>5. Authors propose as an alternative approach to gene filtering, to select genes that have non-zero counts in at least n cells. As illustrated in Figure 6, the number of cells expressing a given gene may be modeled by its mean expression level. This was elegantly addressed in Shalek et al. (2014) through a likelihood ratio test comparing a null model -where all cells express a gene in a lognormal fashion- with an alternate model -where a gene is not expressed in a subpopulation of cells &#x03b1; (See section "Controlling for relationship between expression level and detection efficiency" in the supplementary material: http://www.nature.com/nature/journal/v510/n7505/extref/nature13437-s1.pdf). Genes for which the null model is rejected may be indicative of a subpopulation of cells not expressing the gene at a higher fraction than the one expected from technical noise (e.g. dropout events). I would suggest exploring such approach in order to avoid filtering out relevant genes due to a sharp threshold on the number of cells expressing it.</bold>
                    </italic>
                </p>
                <p> </p>
                <p> There are several arguments against using such an approach, at least during the filtering stage. Firstly, this approach specifically selects for bimodal genes whereas it is entirely possible that interesting genes could vary across a continuum of expression values (or, in fact, are bimodal at two non-zero locations). The second is that the significance threshold effectively serves the same purpose as a threshold on the percentage of expressing genes - only less interpretable, as it depends on the vagaries and assumptions of the model. Indeed, default thresholds for significance (e.g. 1%, 5%) may not be appropriate for filtering and exploratory analyses. Thus, some tuning of the significance thresholds is likely to be required, further reducing interpretability. Consequently, we feel that the approach we have suggested is more likely to be generally useful to the wider biological community.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>6. In the section &#x201c;Filtering out low-abundance genes&#x201d;, the sentence "This provides some more protection against genes with outlier expression patterns, i.e., strong expression in only one or two cells. Such outliers are typically uninteresting as they can arise from amplification artifacts that are not replicable across cells.[...]" would be better followed by setting alt.keep &lt;- numcells &gt;= 2 instead of &gt;= 10</bold>
                    </italic>
                </p>
                <p> </p>
                <p> The "ideal" threshold depends largely on the biological context. The HSC dataset contains a highly purified and homogeneous population. We would expect that most expressed genes would be present in a substantial number of these cells, hence the choice of threshold. While relaxing the filter is possible, this runs into the problems discussed above in our response to point 4. Of course, in other situations where rare cell types are present (e.g. olfactory neurons expressing unique receptors), relaxing the filter might be necessary to retain biological information. We have added a comment about this in the revised manuscript.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>7. It would be interesting to complement Figure 7 and 18 with a second panel representing the correlation between size factors from deconvolution versus spike-in-specific size factors, as done in Figure 27. In the eventual case that a low correlation between them was found in a non-DE scenario, would it be advisable to neglect spike-ins from the analysis?</bold>
                    </italic>
                </p>
                <p> </p>
                <p> We considered adding this, but felt that it would make this part of the workflow somewhat difficult to follow given that we use only the deconvolution factors for normalisation of the endogenous genes. Nevertheless, we agree that this is an important point and are glad that the reviewer pointed out Figure 27 where we discuss this issue in some detail.</p>
                <p> </p>
                <p> Low correlations between the spike-in and deconvolution size factors are not a cause for concern. As we have mentioned, this is entirely possible due to differences in total mRNA content. In terms of normalization, the two sets of size factors simply deal with different biases, so differences between them do not provide any indication of spike-in quality.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>8. As pointed by the authors, spike-in molecules have been extensively used to infer the amount of variability in the expression levels of one gene that can be explained from technical noise (e.g. Brennecke et al., 2013; Gr&#x00fc;n et al., 2014; Islam et al., 2014). Ding et al. (2015) went further on the applications of spike-in levels, by using them to explicitly remove technical noise and compute de-noised gene expression levels (R software GRM, http://wanglab.ucsd.edu/star/GRM/). I would suggest the readers such possibility that could largely benefit downstream analysis such as the detection of subpopulation of cells and cell trajectories, as they would mainly rely on biological variation. This would still be compatible with an assessment of HVG only based on biological variation by fitting the trend to the variance estimates of the endogenous genes (after technical denoising).</bold>
                    </italic>
                </p>
                <p> </p>
                <p> The GRM strategy is an interesting one. However, we do not use it here because the denoising is performed based on a curve fitted to the spike-in log-FPKMs against the known concentrations. This is philosophically similar to spike-in-based normalization, in that it will preserve information about total RNA content. For example, cells with more endogenous RNA will have larger gene counts and unchanged (or smaller) spike-in counts; this results in larger de-noised expression values compared to other cells with less total RNA. Such behaviour may not be desirable in situations where cell size is not of interest.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>9. Authors state that the technical component estimation through the fitting of a mean-variance trend to the spike-in transcripts &#x201c;is compromised by the small number of spike-in transcripts, the uneven distribution of their abundances and (for low numbers of cells) the imprecision of their variance estimates&#x201d;. Do the same remarks generally apply to a spike-in-specific normalization? And if so, should spike-in normalization be considered accurate enough when applied to cases with strong DE even if it is conceptually more appropriate than a deconvolution approach?</bold>
                    </italic>
                </p>
                <p> </p>
                <p> In general, no, the remarks do not apply for spike-in normalization. This is because spike-in normalization computes a single size factor, using information across all spike-in transcripts. As a result, the size factor is generally quite precise. Fitting of the mean-variance trend is less stable because it uses information from each individual spike-in transcript. This is subject to the issues described in the text, thus reducing the stability of the outcome.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>10. In the section &#x201c;Identifying HVGs from the normalized log-expression&#x201d; the authors justify their choice of "the variance of the log-expression values because the log-transformation protects against genes with strong expression in only one or two cells. This ensures that the set of top HVGs is not dominated by genes with (mostly uninteresting) outlier expression patterns&#x201d;. However, the filtering of genes with such patterns has already been proposed in a previous section, so those cases should no longer be a risk here.</bold>
                    </italic>
                </p>
                <p> </p>
                <p> This depends on the type of abundance filtering that was chosen. In this workflow, we performed filtering based on the average count, which does not explicitly protect against strong outliers. Thus, some additional protection is needed during the downstream analysis. If filtering was performed based on an "at least n" strategy, then outliers will be less of an issue during HVG detection. Of course, the "at least n" filter has problems of its own regarding an appropriate choice for "n", as we have discussed in the text and in our response to point 6, which is why we have not used it as the default filtering strategy.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>11. The interpretability of the approach of "Identifying correlated gene pairs with Spearman&#x2019;s rho" is to some extent limited without a previous analysis such as PCA, ICA or MDS, transforming the high-dimensional space into a space of independent (uncorrelated) dimensions. I would rather favor the identification of sets of genes with a high weight on each of the retained independent axis (i.e. driving the variance in such axes, and therefore disentangling sets of correlated genes for each of the orthogonal dimensions). Otherwise, the analysis could risk to be dominated by the first component, probably neglecting other relevant hidden factors.</bold>
                    </italic>
                </p>
                <p> </p>
                <p> This point of calculating these correlations is to provide a simple screen for genes that are likely to be involved in defining the substructure of the dataset. Interpretation of the cause of these correlations can then be performed using PCA, ICA, etc. as suggested on the subset of interesting genes. Without some pre-selection of genes (in terms of high variance or correlation), biological and technical noise may interfere with dimensionality reduction - see our response to point 12.</p>
                <p> </p>
                <p> Our approach allows relevant genes to be selected in a statistically rigorous manner based on significant correlations. In contrast, it is unclear how selection would be performed based on the PCA weights. For example, what should be considered a "high weight", and from how many principal components should genes be selected? The simplicity of the calculation of significant pairwise correlations also provides a useful sanity check for conclusions drawn from more complex downstream analyses.</p>
                <p> </p>
                <p> Finally, if there are hidden factors, these are likely to increase the correlations and cause rejection of the null hypothesis for the relevant genes. So, genes that are affected by these factors will still be retained for downstream analysis and interpretation.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>12. In any case, I advise not to restrict to correlated HVGs downstream dimensionality reduction analysis such as PCA or ICA aiming at the identification of subpopulations of cells and their gene signatures. Such methods exploit correlation patterns (linear or non-linear) in a well-grounded way and they do not require a feature selection step. The sentence "We only use the correlated HVGs in plotPCA because any substructure should be most pronounced in the expression profiles of these genes" could eventually not hold true in some instances: correlated HVGs were assessed without considering those independent components, the relative contribution of each dimension to the total variance, and the relative contribution of each gene to each dimension.</bold>
                    </italic>
                </p>
                <p> </p>
                <p> The aim of selecting correlated HVGs is to reduce the amount of technical and (uncorrelated/uninteresting) biological noise in the data to be used for downstream analyses. This improves the performance of dimensionality reduction approaches, especially if the substructure is relatively weak. For example, with PCA, adding a large number of uncorrelated genes will interfere with correct placement of cells along a trajectory:</p>
                <p> </p>
                <p> 
                    <italic>par(mfrow=c(1,2))</italic>
                </p>
                <p>
                    <italic> loc &lt;- 1:100/100 # True placement of cells</italic>
                </p>
                <p>
                    <italic> a1 &lt;- matrix(jitter(rep(loc, 50)), nrow=50, byrow=TRUE) # Correlated genes</italic>
                </p>
                <p>
                    <italic> x1 &lt;- prcomp(t(a1))</italic>
                </p>
                <p>
                    <italic> plot(x1$x[,1]) # Should be on the diagonal</italic>
                </p>
                <p>
                    <italic> a2 &lt;- rbind(a1, matrix(rnorm(100000), ncol=100)) # Adding uncorrelated noise</italic>
                </p>
                <p>
                    <italic> x2 &lt;- prcomp(t(a2))</italic>
                </p>
                <p>
                    <italic> plot(x2$x[,1]) # Correct placing is disrupted</italic>
                </p>
                <p> </p>
                <p> Similar arguments can be made for distance-based approaches like t-SNE and diffusion maps, where the nearest neighbours become more difficult to identify correctly with increasing noise.</p>
                <p> </p>
                <p> Finally, the identification of correlated HVGs does not need to consider the nature of the substructure. We only need to identify the genes that are affected by this substructure, in one way or the other - it is the function of downstream analyses to determine what the substructure actually represents.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>13. In the brain dataset, correlated HVGs genes were assessed considering the design &lt;- model.matrix(~sce$sex); correlatePairs(sce, design=design). It would be useful to further explain here how this function accounts for the design matrix on the assessment of Spearman&#x2019;s rho.</bold>
                    </italic>
                </p>
                <p> </p>
                <p> For one-way layouts, a value of rho is first computed within each group of cells. The average across all groups (weighted by the number of cells) is then used as the final value of rho for any given pair of genes. For more complex designs, a linear model is fitted to the log-normalized counts, and rho is calculated using the residuals of the model fit. (While the linear model approach also works for one-way layouts, it requires some additional assumptions that can be avoided with a simpler group-based approach.) More details can be found in the documentation for the correlatePairs() function.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>14. In the brain dataset, removeBatchEffect from limma package is used to remove the sex effect. Then tSNE and PCA are applied on the sex-corrected expression values restricted to correlated HVGs. Consistently, correlated HVGs were assessed considering the very same factor: design &lt;- model.matrix(~sce$sex); correlatePairs(sce, design=design). I would further warn the reader and stress the necessity of that consistency between both steps.</bold>
                    </italic>
                </p>
                <p> </p>
                <p> We have added a comment on this to the manuscript.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>15. For consistency with the assessment of correlations based on Spearman's rho, in the heatmap I would recommend to assess first the dendrograms for the cells and the genes by using also a spearman correlation...</bold>
                    </italic>
                </p>
                <p> </p>
                <p> Our dendrograms are constructed based on the distances between cells, which is different from the correlations between genes. Using the correlations to cluster the genes makes more sense with respect to checking consistency, but the primary aim of our analysis is to identify clusters of cells (potential subpopulations) rather than clusters of genes. The latter is certainly a worthwhile analysis (e.g. to identify gene modules) but, in the context of this workflow, the differences in clustering on correlations versus Euclidean distances would only have a minor effect.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>16. I would suggest explicitly mentioning in the pipeline which approaches are based on linear or non-linear assumptions. The workflow alternates methods from both categories, which should be taken into account to understand their downstream consequences...</bold>
                    </italic>
                </p>
                <p> </p>
                <p> Obviously, each computational method makes a number of assumptions. For the sake of readability and simplicity (especially for inexperienced readers), we have not discussed most of these assumptions in this workflow, except for those that are critical to choosing between methods, e.g. spike-in normalization versus deconvolution. Nonetheless, we have modified the manuscript to elaborate on the reasons for using non-linear methods such as Spearman's rho and t-SNE.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>17. The use of hierarchical clustering for clustering cells into putative subpopulations is based on Euclidean distances (or correlations) assessed in a non-orthogonal space. I would rather favor an analytical clustering directly performed in a low-dimensional orthogonal space such us those led by PCA, ICA or MDS, in which the most-informative dimensions can be selected (e.g. through their eigen values in PCA).</bold>
                    </italic>
                </p>
                <p> </p>
                <p> There are many possible approaches to clustering, each with their own advantages and disadvantages. For example, pre-selection of a low-dimensional space via PCA may reduce noise during clustering, but it may also discard subtle features present in lower-ranked PCs. Our clustering approach is simple but effective enough, which is why we have used it in this workflow. Other methods may well do better, but a discussion of the pros and cons of different clustering strategies is beyond the scope of this article.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>18. A PCA analysis should be accompanied by a plot representing the % of variance explained by each principal component, so that it can be judged the number of relevant dimensions to be retained while disregarding the rest as &#x201c;noise&#x201d;. It could be the case that more than 2 dimensions are relevant to separate subpopulations in a finer detail. The inspection of eigenvalues would help supporting that "PCA plot is less effective at separating cells into many different clusters (Figure 24). This is because the first two principal components are driven by strong differences between specific subpopulations, which reduces the resolution of more subtle differences between some of the other subpopulations."</bold>
                    </italic>
                </p>
                <p> </p>
                <p> We only use PCA for visualization, rather than selection of principal components for further quantitative analysis in low-dimensional space. For this purpose, knowing the relative contributions to the total variance from non-visualized components is less helpful. For example, even if we determined that the top 10 dimensions were "relevant", it is unclear how this would assist visualization. Nonetheless, we now mention in the text how this information can be generated and used.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>19. In the brain analysis, three main steps are: 1) the deconvolution method is used to normalize expression levels. Here similar cells are clustered together and cells are normalized in each cluster. Authors state: &#x201c;This improves normalization accuracy by reducing the number of DE genes between cells in the same cluster&#x201d;. Clustering is performed here with the quickCluster function from scran package, where a distance matrix is constructed using Spearman&#x2019;s correlation on the counts between cells. 2) A hierarchical clustering is then performed and a dynamic tree cut is used to define clusters of cells. Then, the batch(sex)-corrected expression values of the (Spearman&#x2019;s rho) correlated HVG are used to build a dendogram assessed through hierarchical clustering on the Euclidean distances between cells, where clusters are defined. And 3) those clusters are used to assess DE with edgeR on the counts, normalized using the library size-adjusted size factors (if I well understood) and including all genes (not only correlated HVG). I personally found such procedure a bit cumbersome as it is relying on different types of expression matrices and metrics in each of the 3 steps (see also next comment). I also wonder to what extent the initial quickCluster results could be biasing the clusters detected downstream, and, if so, whether the normalization step would be biasing in turn the differentially expression results. The correspondence between the quickCluster results with the clusters from the dendograms should at least be inspected and discussed.</bold>
                    </italic>
                </p>
                <p> </p>
                <p> In terms of the choice of matrices and metrics, we have chosen approaches that we feel are suitable for each step of the workflow. Given that each step examines a different aspect of the data, some flexibility is inevitably required in supplying the correct input to each method.</p>
                <p> </p>
                <p> Regarding quickCluster, Lun 
                    <italic>et al.</italic> (2016) show that unbiased size factor estimates are still obtained after clustering. This is because size factors computed within each cluster are explicitly corrected to be comparable between clusters. As for the similarity between the quickCluster results and the dendrograms, we do not believe that this provides a useful indication of method performance. Some agreement is expected, as the two methods should recover similar structure in the data. However, some disagreement is also expected, as quickCluster provides a quick-and-dirty clustering to reduce the amount of DE genes present during deconvolution, while the dendrograms are much more refined due to feature selection. Such incongruences are not a problem for normalization - even if quickCluster identifies the "incorrect" clusters, it is still adequate if it separates cells with vastly different transcriptomic profiles.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>20. In line with the previous comment, in the brain analysis I wonder whether the pipeline could somehow be simplified by 1) performing spike-in normalization (which seems possible given the quality of the spike-in trend observed in Figure 21), 2) doing a PCA on the batch(sex)-corrected expression values of all genes (not only correlated HVGs), and performing clustering on the retained principal components, and 3) assessing DE with edgeR on the counts normalized using the spike-in factors.</bold>
                    </italic>
                </p>
                <p> </p>
                <p> One could certainly perform such an analysis. However, we chose to use the approach described in the workflow, because feature selection can improve the results of downstream analyses, as discussed in our response to point 12; and the choice of whether or not to do spike-in normalization depends primarily on whether total RNA content is interesting, not on the quality of the spike-ins.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>21. Some code at the beginning of the analysis to check and install all the required packages would be welcome</bold>
                    </italic>
                </p>
                <p> </p>
                <p> We have added a link to the Bioconductor workflow page, which provides instructions for installing all required packages and running the workflow.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>22. Everything run smoothly in our hands except for the gdata package when trying to read the xls file. The perl command interpreter was running abnormally long and it was using a large amount of RAM. We finally opened the xls file in excel and converted into tab separated file, then read it using the general read.table command.</bold>
                    </italic>
                </p>
                <p> </p>
                <p> We understand the suboptimality of dealing with Excel files in bioinformatics analysis. Unfortunately, the authors of this study provided the count data in Excel format on NCBI GEO. We decided to load the data directly rather than manually supplying the counts in a simpler format. The latter would make the workflow less generalisable as it would no longer use data from public, well-recognised sources. In our hands, loading of the Excel file usually requires a couple of minutes and 3-4 GB of RAM.</p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report15991">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.10234.r15991</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>duVerle</surname>
                        <given-names>David</given-names>
                    </name>
                    <xref ref-type="aff" rid="r15991a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r15991a1">
                    <label>1</label>Department of Computational Biology and Medical Sciences, University of Tokyo, Tokyo, Japan</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>18</day>
                <month>10</month>
                <year>2016</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 duVerle D</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport15991" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.9501.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The pipeline described in this article seems promising. I was able to partly reproduce the results, as well as run similar treatment on a single cell dataset of my own.</p>
            <p> </p>
            <p> However: 
                <list list-type="bullet">
                    <list-item>
                        <p>The fundamental flaws mentioned by other reviewers over a month ago, still haven't been addressed: the pipeline requires Dev versions of R and bioconductor packages, yet makes no mention of it anywhere in the article.</p>
                    </list-item>
                    <list-item>
                        <p>In fact, even after installing the Bioconductor Dev versions of all required modules, it would appear the pipeline no longer works with the latest versions (e.g. scran_1.1.10,&#x00a0;with R 3.3.1):</p>
                    </list-item>
                </list> </p>
            <p> 
                <italic>&gt; isSpike(sce) &lt;- "ERCC"</italic>
            </p>
            <p>
                <italic> Error in `isSpike&lt;-`(`*tmp*`, value = "Spike") :&#x00a0;</italic>
            </p>
            <p>
                <italic> &#x00a0; 'isSpike' must be logical or NULL</italic>
            </p>
            <p>
                <italic> </italic>
            </p>
            <p>
                <italic> etc.</italic>
            </p>
            <p> </p>
            <p> While likely easy to fix, this type of incompatibility issues undermine the entire point of the article and perfectly illustrate the dangers of relying on development versions for this type of pipeline. 
                <list list-type="bullet">
                    <list-item>
                        <p>Additionally, the example dataset used by the article, is loaded from an Excel spreadsheet, which is 
                            <ext-link ext-link-type="uri" xlink:href="http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-5-80">generally considered extremely bad practice</ext-link>. It would behoove the authors of a software walkthrough aimed at somewhat-novice bioinformaticians to encourage best practices.</p>
                    </list-item>
                    <list-item>
                        <p>In the current conditions, and until some 
                            <bold>major revision</bold> work is done, it is impossible to properly review the pipeline and approve this article unreservedly.</p>
                    </list-item>
                </list>
            </p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-15991-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics.</article-title>
                        <source>
                            <italic>BMC Bioinformatics</italic>
                        </source>.<year>2004</year>;<volume>5</volume>:
                        <elocation-id>10.1186/1471-2105-5-80</elocation-id>
                        <fpage>80</fpage>
                        <pub-id pub-id-type="pmid">15214961</pub-id>
                        <pub-id pub-id-type="doi">10.1186/1471-2105-5-80</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
        <sub-article article-type="response" id="comment2245-15991">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Lun</surname>
                            <given-names>Aaron</given-names>
                        </name>
                        <aff>Cancer Research UK Cambridge Research Institute, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>None declared.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>18</day>
                    <month>10</month>
                    <year>2016</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Thanks for your comments, David. Regarding the incompatibility in software versions, we have been waiting for the imminent release of the latest version of Bioconductor (3.4) before revising the article. It seemed more prudent to wait for the latest software to become available, rather than making stop-gap modifications to accommodate soon-to-be-obsolete versions. We believe that this update should clear up any problems with execution of the workflow.</p>
                <p> </p>
                <p> We agree that Excel spreadsheets are a poor formatting choice for bioinformatics work. Unfortunately, the processed dataset is provided in this format from NCBI GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE61533, see Supplementary files). While having to tackle Excel formatting is not ideal, it is preferable to having to re-process the entire dataset to obtain counts from the raw read sequences. Moreover, at no point do we save into Excel - analysis results are always stored in simple tab-delimited formats, and the R objects themselves are saved in serialized form.</p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report16243">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.10234.r16243</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>McDavid</surname>
                        <given-names>Andrew</given-names>
                    </name>
                    <xref ref-type="aff" rid="r16243a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-6581-1213</uri>
                </contrib>
                <aff id="r16243a1">
                    <label>1</label>Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>30</day>
                <month>9</month>
                <year>2016</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 McDavid A</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport16243" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.9501.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>Lun, McCarthy and Marioni share a workflow for analysis of single cell RNAseq (scRNA-seq) data using software they have developed. The workflow is illustrated on two data sets of varying size and characteristic. The computational and statistical findings of the workflow are interpreted in their experimental context. Having a well-documented protocol for the analysis of scRNA-seq is an important contribution to the community, since it is still a wilderness in terms of methods and processing, for better or worse.That scRNA-seq is a quickly evolving discipline--and the implications this has for the workflow--forms the bulk of my criticism of this paper.</p>
            <p> </p>
            <p> 
                <bold>1a.</bold> The paper describes a currently-unreleased version of software. Other reviewers have indicated the difficulties this poses. I trust the authors will verify the correctness of their code and reproducibility of the analysis when their packages are finalized in Bioconductor 3.4. I also trust that this workflow will be made available as a literate (e.g knitr) document so that readers won't have to cut and paste from their web-browser. This reviewer was able to reproduce the figures reported in the first data set after loading the development version of `
                <italic>scater</italic>` (now version 1.1.14).</p>
            <p> </p>
            <p> 
                <bold>1b.</bold> The main software package `
                <italic>scater</italic>` defines a `SCESet` inheriting from `ExpressionSet`, which has been superceded by `SummarizedExperiment`. SummarizedExperiment is more likely to scale to large data sets (it can store data out of core or in sparse matrix formats). In practice, this is not such a big deal since it's relatively easy to coerce between the two object types.</p>
            <p> </p>
            <p> </p>
            <p> 
                <bold>2.</bold> &#x00a0;The title of this article stipulates that it is for "low-level" analysis of RNA-seq data, but the all-important question of how to process the data as many analysts will get them (short reads as .fasta files) is elided. &#x00a0;</p>
            <p> </p>
            <p> (Pseudo)-Alignment and quantification is an important, and probably overlooked step in scRNA-seq analysis. Counting transcripts by counting overlaps with features, a la 
                <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html">countOverlaps</ext-link> or 
                <ext-link ext-link-type="uri" xlink:href="http://www-huber.embl.de/HTSeq/doc/overview.html">htSeq</ext-link> is inefficient
                <sup>1</sup>, since many reads (30%-80% of those that map anywhere, in this reviewer's experience) do not align uniquely. Hence the need and value to use quantification tools that respect the degeneracy of multimapping reads, ie, RSEM, Star, Kallisto, Sailfish,
                <italic> et al</italic>. &#x00a0;A low-level analysis thus may wish to consider remapping with an appropriate tool. Fortunately, it does appear that `
                <italic>scater</italic>` has provisions for doing (re)-alignment with Kallisto.</p>
            <p> </p>
            <p> 
                <bold>3.</bold> &#x00a0;The authors may consider referencing other extant methods that could address areas of their workflow, especially methods that are adapted to deal with the non-normality of scRNA-seq data. 
                <list list-type="bullet">
                    <list-item>
                        <p>For normalization, there is 
                            <ext-link ext-link-type="uri" xlink:href="https://github.com/YosefLab/scone">scone</ext-link>, which tests many different normalization procedures and optimizes over the "best" one.</p>
                    </list-item>
                    <list-item>
                        <p>For identification of highly variable genes, there is Basics
                            <sup>2</sup>, which applies a hierarchical Bayesian model to test for over-dispersion, as opposed to modeling departures from an overall mean-variance relationship.</p>
                    </list-item>
                    <list-item>
                        <p>For single cell differential expression and gene set enrichment for bimodal distributions found in scRNA-seq, there is 
                            <ext-link ext-link-type="uri" xlink:href="https://github.com/RGLab/MAST">MAST</ext-link>
                            <sup>3</sup>.</p>
                    </list-item>
                    <list-item>
                        <p>For clustering, there is 
                            <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/devel/bioc/html/clusterExperiment.html">clusterExperiment</ext-link>.</p>
                    </list-item>
                    <list-item>
                        <p>For multi-dimensional scaling on bimodal data, there is ZIFA
                            <sup>4</sup>.&#x00a0;All of the above, aside from ZIFA are R/Bioconductor packages.</p>
                    </list-item>
                </list>
            </p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-16243-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.</article-title>
                        <source>
                            <italic>BMC Bioinformatics</italic>
                        </source>.<year>2011</year>;<volume>12</volume>:
                        <elocation-id>10.1186/1471-2105-12-323</elocation-id>
                        <fpage>323</fpage>
                        <pub-id pub-id-type="pmid">21816040</pub-id>
                        <pub-id pub-id-type="doi">10.1186/1471-2105-12-323</pub-id>
                    </mixed-citation>
                </ref>
                <ref id="rep-ref-16243-2">
                    <label>2</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>BASiCS: Bayesian Analysis of Single-Cell Sequencing Data.</article-title>
                        <source>
                            <italic>PLoS Comput Biol</italic>
                        </source>.<year>2015</year>;<volume>11</volume>(<issue>6</issue>) :
                        <elocation-id>10.1371/journal.pcbi.1004333</elocation-id>
                        <fpage>e1004333</fpage>
                        <pub-id pub-id-type="pmid">26107944</pub-id>
                        <pub-id pub-id-type="doi">10.1371/journal.pcbi.1004333</pub-id>
                    </mixed-citation>
                </ref>
                <ref id="rep-ref-16243-3">
                    <label>3</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data.</article-title>
                        <source>
                            <italic>Genome Biol</italic>
                        </source>.<year>2015</year>;<volume>16</volume>:
                        <elocation-id>10.1186/s13059-015-0844-5</elocation-id>
                        <fpage>278</fpage>
                        <pub-id pub-id-type="pmid">26653891</pub-id>
                        <pub-id pub-id-type="doi">10.1186/s13059-015-0844-5</pub-id>
                    </mixed-citation>
                </ref>
                <ref id="rep-ref-16243-4">
                    <label>4</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis.</article-title>
                        <source>
                            <italic>Genome Biol</italic>
                        </source>.<year>2015</year>;<volume>16</volume>:
                        <elocation-id>10.1186/s13059-015-0805-z</elocation-id>
                        <fpage>241</fpage>
                        <pub-id pub-id-type="pmid">26527291</pub-id>
                        <pub-id pub-id-type="doi">10.1186/s13059-015-0805-z</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
        <sub-article article-type="response" id="comment2244-16243">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Lun</surname>
                            <given-names>Aaron</given-names>
                        </name>
                        <aff>Cancer Research UK Cambridge Research Institute, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>None declared.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>18</day>
                    <month>10</month>
                    <year>2016</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Thanks for your comments, Andrew. Our responses are as below.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>1a. The paper describes a currently-unreleased version of software. Other reviewers have indicated the difficulties this poses. I trust the authors will verify the correctness of their code and reproducibility of the analysis when their packages are finalized in Bioconductor 3.4. I also trust that this workflow will be made available as a literate (e.g knitr) document so that readers won't have to cut and paste from their web-browser. This reviewer was able to reproduce the figures reported in the first data set after loading the development version of `scater` (now version 1.1.14).</bold>
                    </italic>
                </p>
                <p> </p>
                <p> Yes, this was an oversight on our part. The revised verison will include a link to the Bioconductor workflow page, where users can simply run a command to automatically download the relevant data files and packages prior to running the workflow.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>1b. The main software package `scater` defines a `SCESet` inheriting from `ExpressionSet`, which has been superceded by `SummarizedExperiment`. SummarizedExperiment is more likely to scale to large data sets (it can store data out of core or in sparse matrix formats). In practice, this is not such a big deal since it's relatively easy to coerce between the two object types.</bold>
                    </italic>
                </p>
                <p> </p>
                <p> We considered the practicality of storing data in sparse matrix format. Unfortunately, most existing tools for downstream data analysis require a full-sized matrix as input, so any gains in memory efficiency during storage seem to be countered by the need to (repeatedly) expand the matrix at multiple analysis steps. Moreover, a sparse matrix only improves efficiency for raw count data where unambiguous zeroes are present; upon applying normalization and transformation steps, this may no longer be the case, such that a full-sized matrix will ultimately be required anyway.</p>
                <p> </p>
                <p> 
                    <italic>
                        <bold>2.&#x00a0; The title of this article stipulates that it is for "low-level" analysis of RNA-seq data, but the all-important question of how to process the data as many analysts will get them (short reads as .fasta files) is elided. (Pseudo)-Alignment and quantification is an important, and probably overlooked step in scRNA-seq analysis. Counting transcripts by counting overlaps with features, a la countOverlaps or htSeq is inefficient1, since many reads (30%-80% of those that map anywhere, in this reviewer's experience) do not align uniquely. Hence the need and value to use quantification tools that respect the degeneracy of multimapping reads, ie, RSEM, Star, Kallisto, Sailfish, et al.&#x00a0; A low-level analysis thus may wish to consider remapping with an appropriate tool. Fortunately, it does appear that `scater` has provisions for doing (re)-alignment with Kallisto.</bold>
                    </italic>
                </p>
                <p> </p>
                <p> We find that conventional feature counting works quite well for read-based scRNA-seq data, having used this approach in several recent studies (Achim 
                    <italic>et al.</italic>, 2015; Kolodziejczyk 
                    <italic>et al.</italic>, 2015; Scialdone 
                    <italic>et al</italic>., 2016). While ignoring multi-mapped reads during quantification is conservative, we feel that it does provide a greater degree of confidence in our downstream inferences. Certainly, there may be gains in power from using tools that extract more information from multi-mapping reads, but we do not consider this advantage to be so pronounced that it should be standard procedure for all scRNA-seq data analyses. For UMI-based data, there does not yet appear to be any clear "gold standard" approach for UMI processing into counts, so we have not provided any description of that step.</p>
                <p> </p>
                <p> In summary, we decided to start the workflow from the raw count data, rather than starting from read sequences, as conventional approaches for quantification described elsewhere seem to work well; to maintain some flexibility with respect to future developments in this field; and because our workflow focuses on the steps of the analysis that are carried out in R/Bioconductor, whereas most existing quantification tools require manual installation and execution from the command-line.</p>
                <p> </p>
                <p> </p>
                <p> 
                    <bold>References:</bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Achim 
                                <italic>et al.</italic> (2015), 
                                <italic>Nature Biotechnology</italic> 33:503&#x2013;509</p>
                        </list-item>
                        <list-item>
                            <p>Kolodziejczyk 
                                <italic>et al.</italic> (2015), 
                                <italic>Cell Stem Cell</italic> 17(4):471-485</p>
                        </list-item>
                        <list-item>
                            <p>Scialdone 
                                <italic>et al. </italic>(2016), 
                                <italic>Nature</italic> 535:289-293</p>
                        </list-item>
                    </list> </p>
                <p> 
                    <bold>The authors may consider referencing other extant methods that could address areas of their workflow, especially methods that are adapted to deal with the non-normality of scRNA-seq data.</bold>
                </p>
                <p> </p>
                <p> As you have stated, there are many alternative approaches that could be used in various parts of the workflow. However, we feel that it is beyond the scope of this article to enter into discussions about the relative advantages of different methods. In fact, this may undermine the pedagogical value of the workflow by providing too many options to inexperienced users. The methods we have described work well in a variety of situations, so we have chosen them for use in the various analysis steps. We have added a sentence to the discussion about the existence of alternative methods for low-level processing, and encouraged experienced users to explore them.</p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report15987">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.10234.r15987</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Ji</surname>
                        <given-names>Hongkai</given-names>
                    </name>
                    <xref ref-type="aff" rid="r15987a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r15987a1">
                    <label>1</label>Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>19</day>
                <month>9</month>
                <year>2016</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 Ji H</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport15987" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.9501.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>In this article, the authors introduce a computational workflow to perform low-level analysis of single-cell RNA-seq (scRNA-seq) data based on R and Bioconductor. The workflow takes a read count matrix as input, and it provides R commands for loading data, quality control, gene filtering, data normalization (with or without spike-in controls), classifying cells based on their cell cycle phase, identifying highly variable genes, analyzing genes&#x2019; pairwise correlation, and basic data exploration such as clustering and visualization. The workflow is demonstrated using a number of real data examples. Overall, I think that the workflow provides a timely and very useful guide for people who want to analyze scRNA-seq data.</p>
            <p> This study is largely reproducible. I am able to obtain all major results in this article by running the commands provided by the authors. I have several comments and suggestions which I hope the authors can address in order to make their workflow more user-friendly.</p>
            <p> </p>
            <p> 1. It seems that installing the right version of R and Bioconductor is crucial for this pipeline to work. Some commands in the workflow depend on R version 3.3.1 or higher and the developing (devel) version of Bioconductor. The first time I tried the workflow, I encountered numerous errors. For example,</p>
            <p> </p>
            <p> &gt; isSpike(sce) &lt;- "ERCC"</p>
            <p> Error in `isSpike&lt;-`(`*tmp*`, value = "ERCC") :</p>
            <p> &#x00a0; 'isSpike' must be logical or NULL</p>
            <p> &gt; numcells &lt;- nexprs(sce, byrow=TRUE)</p>
            <p> Error: could not find function "nexprs"</p>
            <p> &gt; sce &lt;- computeSpikeFactors(sce, type="ERCC", general.use=FALSE)</p>
            <p> Error in .local(x, ...) :</p>
            <p> &#x00a0; unused arguments (type = "ERCC", general.use = FALSE)</p>
            <p> It turns out that I used an older version of R and Bioconductor. I then updated my R and Bioconductor packages and still had many problems. Finally, I decided to completely remove R and Bioconductor from my computer. I then installed R 3.3.1 and Bioconductor (devel version), and the pipeline worked. Although I eventually fixed the problem, I feel that this trial and error process can be frustrating for users. I therefore suggest that the authors make the R/Bioconductor dependencies clear at the beginning of the article. It would be even better if the authors could minimize the pipeline&#x2019;s dependency on certain versions of R/Bioconductor.</p>
            <p> </p>
            <p> </p>
            <p> 2. This workflow uses a number of R and Bioconductor packages. A user may not have all packages installed on their computer. Installing these packages one by one manually can be a little tedious. It would be nice if the authors can provide an R script that automatically finds missing packages on a user&#x2019;s computer and install them. This could improve the pipeline&#x2019;s user experience.</p>
            <p> </p>
            <p> </p>
            <p> 3. It will also be useful if the authors can provide an R file that contains all commands in the workflow so that users only need to slightly edit their code for future datasets. It might be beyond the scope of this article, but the authors may consider delivering the pipeline using an R shiny graphical user interface in the future to make it accessible to users without R coding experience.</p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
        <sub-article article-type="response" id="comment2194-15987">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Lun</surname>
                            <given-names>Aaron</given-names>
                        </name>
                        <aff>Cancer Research UK Cambridge Research Institute, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests are declared.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>19</day>
                    <month>9</month>
                    <year>2016</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Thanks for your comments, Hongkai. Our responses to each of your points is below: 
                    <list list-type="order">
                        <list-item>
                            <p>Yes, this was an oversight on our part. The pipeline was developed using packages from BioC-devel, to take advantage of cutting-edge methods in each package. For that reason, the pipeline is strictly dependent on Bioconductor release version 3.4, a fact that we will make explicit in the next revision. We do not think that&#x00a0;this is&#x00a0;a major inconvenience given that the next release of Bioconductor is less than a month away.</p>
                        </list-item>
                        <list-item>
                            <p>This is a good point. In fact, this article would ideally coincide with a parallel release on the Bioconductor workflow 
                                <ext-link ext-link-type="uri" xlink:href="https://www.bioconductor.org/help/workflows/simpleSingleCell/">page</ext-link>, where the workflow installation machinery will automatically install all dependencies required for the package. Unfortunately, because this article was using packages from BioC-devel, we were unable to coordinate its release with that on the Bioconductor workflow page (which is limited to BioC-release packages). This will be fixed in the next revision where we will add a reference to Bioconductor-based installation of required packages.</p>
                        </list-item>
                        <list-item>
                            <p>While we understand the convenience that an R script can offer, we feel that supplying such a script would invite attempts to blindly use the code without considering the context or caveats of the various methods. We believe that some initial copy-pasting&#x00a0;is a small&#x00a0;price to pay if the user is consistently reminded of how to properly&#x00a0;interpret the output. Note that the Bioconductor workflow site and our Github 
                                <ext-link ext-link-type="uri" xlink:href="https://github.com/MarioniLab/BiocWorkflow2016">page</ext-link>&#x00a0;provide&#x00a0;an Rmarkdown file containing all the necessary code blocks for easy execution of the entire workflow; if necessary, users can change the input files to generate an analysis&#x00a0;report similar to&#x00a0;the article. Of course, a&#x00a0;graphical user interface is even more intuitive, but this is difficult to set up in a manner that is amenable to rigorous and reproducible data analysis.</p>
                        </list-item>
                    </list>
                </p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report15990">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.10234.r15990</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Low</surname>
                        <given-names>Diana H.P.</given-names>
                    </name>
                    <xref ref-type="aff" rid="r15990a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r15990a1">
                    <label>1</label>Institute of Molecular and Cell Biology, Agency for Science, Technology and Research, Singapore, Singapore</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>9</day>
                <month>9</month>
                <year>2016</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 Low DHP</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport15990" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.9501.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>Lun and colleagues describe a low-level analysis specific for single-cell RNA-seq experiments, using open-source packages available on Bioconductor. This paper could potentially be a valuable resource for those who want to carry out such analysis in R.</p>
            <p> </p>
            <p> The steps are very descriptive, and they even include 2 different datasets presenting different types and conditions for analysis. They have done a very thorough job in explaining the decisions taken at each step of QC, filtering, normalization and provide some basic but important visualization examples (clustering, heatmaps) that would help in assessing not only the quality of the dataset technically, but also provided information on the outcome of the experiment itself.</p>
            <p> </p>
            <p> Unfortunately I could not run some of the steps in the workflow which prevented me from assessing the code. Some I could figure out and "fix" in the attempt to run the code, but others not so much.</p>
            <p> </p>
            <p> I provide some (not exhaustive) examples below to help in the troubleshooting, and if these (and the subsequent code relying on these outputts) could be solved, I would be happy to continue the review further. 
                <list list-type="order">
                    <list-item>
                        <p>isSpike(sce) &lt;- "ERCC" //worked with isSpike(sce) &lt;- is.spike</p>
                    </list-item>
                    <list-item>
                        <p>can't find the function nexprs [I had to use numcells &lt;- rowSums(exprs(sce)!=0)]</p>
                    </list-item>
                    <list-item>
                        <p>is.ercc &lt;- isSpike(sce, type="ERCC") //worked with&#x00a0;[is.ercc &lt;- isSpike(sce)</p>
                    </list-item>
                    <list-item>
                        <p>Could not run code from the section:&#x00a0;Identifying HVGs from the normalized log-expression</p>
                    </list-item>
                </list> var.fit &lt;- trendVar(sce, trend="loess", use.spikes=FALSE, span=0.2)</p>
            <p> Error in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, &#x00a0;:&#x00a0;</p>
            <p> &#x00a0; invalid 'x'</p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
        <sub-article article-type="response" id="comment2178-15990">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Lun</surname>
                            <given-names>Aaron</given-names>
                        </name>
                        <aff>Cancer Research UK Cambridge Research Institute, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests are declared.
</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>9</day>
                    <month>9</month>
                    <year>2016</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Thanks for your comments, Diana. The code actually depends on Bioconductor version 3.4 (i.e., BioC "devel"), rather than the current Bioconductor 3.3 (i.e., BioC "release"). This allows us to include cutting-edge features from all packages to provide a high level of functionality in the workflow. However, some of these features are not present in the release version, thus leading to execution failure.</p>
                <p> </p>
                <p> The devel versions of all packages can be easily installed by setting&#x00a0;
                    <italic>useDevel()&#x00a0;</italic>followed by&#x00a0;
                    <italic>biocLite()</italic>, as described on the Bioconductor 
                    <ext-link ext-link-type="uri" xlink:href="https://www.bioconductor.org/developers/how-to/useDevel/">website</ext-link>. We will also modify the text to explicitly state that Bioconductor 3.4 is required - currently, this can only be implicitly determined from the package versions, which admittedly is not obvious to casual users.</p>
            </body>
        </sub-article>
    </sub-article>
</article>
