<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="other" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.139116.1</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Software Tool Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>A Bioconductor workflow for processing, evaluating, and interpreting expression proteomics data</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 1; peer review: 3 approved]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Hutchings</surname>
                        <given-names>Charlotte</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Data Curation</role>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Visualization</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Dawson</surname>
                        <given-names>Charlotte S.</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Krueger</surname>
                        <given-names>Thomas</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-8132-8870</uri>
                    <xref ref-type="aff" rid="a2">2</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Lilley</surname>
                        <given-names>Kathryn S.</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Funding Acquisition</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Breckels</surname>
                        <given-names>Lisa M.</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-8918-7171</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>Cambridge Centre for Proteomics, University of Cambridge, Cambridge, CB2 1QR, UK</aff>
                <aff id="a2">
                    <label>2</label>Department of Biochemistry, University of Cambridge, Cambridge, CB2 1QR, UK</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:lms79@cam.ac.uk">lms79@cam.ac.uk</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>24</day>
                <month>10</month>
                <year>2023</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2023</year>
            </pub-date>
            <volume>12</volume>
            <elocation-id>1402</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>15</day>
                    <month>9</month>
                    <year>2023</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2023 Hutchings C et al.</copyright-statement>
                <copyright-year>2023</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/12-1402/pdf"/>
            <abstract>
                <p>
                    <bold>Background:</bold> Expression proteomics involves the global evaluation of protein abundances within a system. In turn, differential expression analysis can be used to investigate changes in protein abundance upon perturbation to such a system.</p>
                <p>
                    <bold>Methods:</bold> Here, we provide a workflow for the processing, analysis and interpretation of quantitative mass spectrometry-based expression proteomics data. This workflow utilizes open-source R software packages from the Bioconductor project and guides users end-to-end and step-by-step through every stage of the analyses. As a use-case we generated expression proteomics data from HEK293 cells with and without a treatment. Of note, the experiment included cellular proteins labelled using tandem mass tag (TMT) technology and secreted proteins quantified using label-free quantitation (LFQ).</p>
                <p>
                    <bold>Results:</bold> The workflow explains the software infrastructure before focusing on data import, pre-processing and quality control. This is done individually for TMT and LFQ datasets. The application of statistical differential expression analysis is demonstrated, followed by interpretation via gene ontology enrichment analysis.</p>
                <p>
                    <bold>Conclusions:</bold> A comprehensive workflow for the processing, analysis and interpretation of expression proteomics is presented. The workflow is a valuable resource for the proteomics community and specifically beginners who are at least familiar with R who wish to understand and make data-driven decisions with regards to their analyses.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>Bioconductor</kwd>
                <kwd>QFeatures</kwd>
                <kwd>proteomics</kwd>
                <kwd>shotgun proteomics</kwd>
                <kwd>bottom-up proteomics</kwd>
                <kwd>differential expression</kwd>
                <kwd>mass spectrometry</kwd>
                <kwd>quality control</kwd>
                <kwd>data processing</kwd>
                <kwd>limma</kwd>
            </kwd-group>
            <funding-group>
                <award-group id="fund-1" xlink:href="http://dx.doi.org/10.13039/100000936">
                    <funding-source>Gordon and Betty Moore Foundation</funding-source>
                    <award-id>#7872</award-id>
                </award-group>
                <award-group id="fund-2" xlink:href="http://dx.doi.org/10.13039/100010269">
                    <funding-source>Wellcome Trust</funding-source>
                    <award-id>110071/Z/15/Z</award-id>
                </award-group>
                <award-group id="fund-3" xlink:href="http://dx.doi.org/10.13039/100004325">
                    <funding-source>AstraZeneca</funding-source>
                    <award-id>BB/W509929/1</award-id>
                </award-group>
                <award-group id="fund-4" xlink:href="http://dx.doi.org/10.13039/501100000268">
                    <funding-source>Biotechnology and Biological Sciences Research Council</funding-source>
                    <award-id>BB/W509929/1</award-id>
                </award-group>
                <award-group id="fund-5" xlink:href="http://dx.doi.org/10.13039/501100007601">
                    <funding-source>Horizon 2020</funding-source>
                    <award-id>EPIC-XS(projectno:823839)</award-id>
                </award-group>
                <award-group id="fund-6">
                    <funding-source>Herchel Smith Research Studentship</funding-source>
                </award-group>
                <funding-statement>C. H. is funded through a BBSRC CASE award with AstraZeneca (BB/W509929/1). C. S. D. is funded by a Herchel Smith Research Studentship at the University of Cambridge, United Kingdom. T. K. is funded by a Gordon and Betty Moore Foundation Grant (#7872). K. S. L. is funded by Wellcome Trust (110071/Z/15/Z) and European Union Horizon 2020 program INFRAIA project EPIC-XS (project no.: 823839). L. M. B. is funded by European Union Horizon 2020 program INFRAIA project EPIC-XS (project no.: 823839). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
                <funding-statement>
                    <italic>The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</italic>
                </funding-statement>
            </funding-group>
        </article-meta>
    </front>
    <body>
        <sec id="sec1" sec-type="intro">
            <title>Introduction</title>
            <p>Proteins are responsible for carrying out a multitude of biological tasks, implementing cellular functionality and determining phenotype. Mass spectrometry (MS)-based expression proteomics allows protein abundance to be quantified and compared between samples. In turn, differential protein abundance can be used to explore how biological systems respond to a perturbation. Many research groups have applied such methodologies to understand mechanisms of disease, elucidate cellular responses to external stimuli, and discover diagnostic biomarkers (see Refs. 
                <xref ref-type="bibr" rid="ref1">1</xref>&#x2013;
                <xref ref-type="bibr" rid="ref3">3</xref> for recent examples). As the potential of proteomics continues to be realised, there is a clear need for resources demonstrating how to deal with expression proteomics data in a robust and standardised manner.</p>
            <p>The data generated during an expression proteomics experiment are complex, and unfortunately there is no one-size-fits-all method for the processing and analysis of such data. The reason for this is two-fold. Firstly, there are a wide range of experimental methods that can be used to generate expression proteomics data. Researchers can analyse full-length proteins (top-down proteomics) or complete an enzymatic digestion and analyse the resulting peptides. This proteolytic digestion can be either partial (middle-down proteomics) or complete (bottom-up proteomics). The latter approach is most commonly used as peptides have a more favourable ionisation capacity, predictable fragmentation patterns, and can be separated via reversed phase liquid chromatography, ultimately making them more compatible with MS.
                <sup>
                    <xref ref-type="bibr" rid="ref4">4</xref>
                </sup> Within bottom-up proteomics, the relative quantitation of peptides can be determined using one of two approaches: (1) label-free or (2) label-based quantitation. Moreover, the latter can be implemented with a number of different peptide labelling chemistries, for example, using tandem mass tag (TMT), stable-isotope labelling by amino acids in cell culture (SILAC), isobaric tags for relative and absolute quantitation (iTRAQ), among others.
                <sup>
                    <xref ref-type="bibr" rid="ref5">5</xref>
                </sup> MS analysis can also be used in either data-dependent or data-independent acquisition (DDA or DIA) mode.
                <sup>
                    <xref ref-type="bibr" rid="ref6">6</xref>
                </sup>
                <sup>,</sup>
                <sup>
                    <xref ref-type="bibr" rid="ref7">7</xref>
                </sup> Although all of these experimental methods typically result in a similar output, a matrix of quantitative values, the data are different and must be treated as such. Secondly, data processing is dependent upon the experimental goal and biological question being asked.</p>
            <p>Here, we provide a step-by-step workflow for processing, analysing and interpreting expression proteomics data derived from a bottom-up experiment using DDA and either LFQ or TMT label-based peptide quantitation. We outline how to process the data starting from a peptide spectrum match (PSM)- or peptide- level 
                <monospace>.txt</monospace> file. Such files are the outputs of most major third party search software (e.g. Proteome Discoverer, MaxQuant, FragPipe). We begin with data import and then guide users through the stages of data processing including data cleaning, quality control filtering, management of missing values, imputation, and aggregation to protein-level. Finally, we finish with how to discover differentially abundant proteins and carry out biological interpretation of the resulting data. The latter will be achieved through the application of gene ontology (GO) enrichment analysis. Hence, users can expect to generate lists of proteins that are significantly up- or downregulated in their system of interest, as well as the GO terms that are significantly over-represented in these proteins.</p>
            <p>Using the R statistical programming environment
                <sup>
                    <xref ref-type="bibr" rid="ref8">8</xref>
                </sup> we make use of several state-of-the-art packages from the open-source, open-development Bioconductor project
                <sup>
                    <xref ref-type="bibr" rid="ref9">9</xref>
                </sup> to analyse use-case expression proteomics datasets
                <sup>
                    <xref ref-type="bibr" rid="ref10">10</xref>
                </sup> from both LFQ and label-based technologies.</p>
            <sec id="sec2">
                <title>Package installation</title>
                <p>In this workflow we make use of open-source software from the 
                    <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/">R Bioconductor</ext-link>
                    <sup>
                        <xref ref-type="bibr" rid="ref9">9</xref>
                    </sup> project. The 
                    <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/">Bioconductor initiative</ext-link> provides R software packages dedicated to the processing of high-throughput complex biological data. Packages are open-source, well-documented and benefit from an active community of developers. We recommend users to download the RStudio integrated development environment (IDE) which provides a graphical interface to R programming language.</p>
                <p>Detailed instructions for the installation of Bioconductor packages are documented on the 
                    <ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/install/">Bioconductor Installation</ext-link> page. The main packages required for this workflow are installed using the code below.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#17365D">if</styled-content> (!require(
                            <styled-content style="color:#37A82E">"BiocManager"</styled-content>, 
                            <styled-content style="color:#CC9900">quietly =</styled-content> TRUE)) {</monospace>

                        <monospace>install.packages(
                            <styled-content style="color:#37A82E">"BiocManager"</styled-content>)</monospace>

                        <monospace>}

</monospace>

                        <monospace>BiocManager::install(c(
                            <styled-content style="color:#37A82E">"QFeatures"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"ggplot2"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"stringr"</styled-content>
                        </monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"NormalyzerDE"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"corrplot"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"Biostrings"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"limma"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"impute"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"dplyr"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"tibble"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"org.Hs.eg.db"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"clusterProfiler"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"enrichplot"</styled-content>))</monospace>
                    </preformat>
                </p>
                <p>After installation, each package must be loaded before it can be used in the R session. This is achieved via the 
                    <monospace>library</monospace> function. For example, to load the 
                    <monospace>QFeatures</monospace> package one would type 
                    <monospace>library("QFeatures")</monospace> after installation. Here we load all packages included in this workflow.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>library(
                            <styled-content style="color:#37A82E">"QFeatures"</styled-content>)</monospace>

                        <monospace>library(
                            <styled-content style="color:#37A82E">"ggplot2"</styled-content>)</monospace>

                        <monospace>library(
                            <styled-content style="color:#37A82E">"stringr"</styled-content>)</monospace>

                        <monospace>library(
                            <styled-content style="color:#37A82E">"dplyr"</styled-content>)</monospace>

                        <monospace>library(
                            <styled-content style="color:#37A82E">"tibble"</styled-content>)</monospace>

                        <monospace>library(
                            <styled-content style="color:#37A82E">"NormalyzerDE"</styled-content>)</monospace>

                        <monospace>library(
                            <styled-content style="color:#37A82E">"corrplot"</styled-content>)</monospace>

                        <monospace>library(
                            <styled-content style="color:#37A82E">"Biostrings"</styled-content>)</monospace>

                        <monospace>library(
                            <styled-content style="color:#37A82E">"limma"</styled-content>)</monospace>

                        <monospace>library(
                            <styled-content style="color:#37A82E">"org.Hs.eg.db"</styled-content>)</monospace>

                        <monospace>library(
                            <styled-content style="color:#37A82E">"clusterProfiler"</styled-content>)</monospace>

                        <monospace>library(
                            <styled-content style="color:#37A82E">"enrichplot"</styled-content>)</monospace>
                    </preformat>
                </p>
            </sec>
        </sec>
        <sec id="sec4">
            <title>The use-case: exploring changes in protein abundance in HEK293 cells upon perturbation</title>
            <p>As a use-case, we analyse two quantitative proteomics datasets derived from a single experiment. The aim of the experiment was to reveal the differential abundance of proteins in HEK293 cells upon a particular treatment, the exact details of which are anonymised for the purpose of this workflow. An outline of the experimental method is provided in 
                <xref ref-type="fig" rid="f1">Figure 1</xref>.</p>
            <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                <label>Figure 1. </label>
                <caption>
                    <title>A schematic summary of the experimental protocol used to generate the use-case data.</title>
                </caption>
                <graphic id="gr1" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure1.gif"/>
            </fig>
            <p>Briefly, HEK293 cells were either (i) left untreated, or (ii) provided with the treatment of interest. These two conditions are referred to as &#x2018;control&#x2019; and &#x2018;treated&#x2019;, respectively. Each condition was evaluated in triplicate. At 96-hours post-treatment, samples were collected and separated into cell pellet and supernatant fractions containing cellular and secreted proteins, respectively. Both fractions were denatured, alkylated and digested to peptides using trypsin.</p>
            <p>The supernatant fractions were de-salted and analysed over a two-hour gradient in an Orbitrap Fusion&#x2122; Lumos&#x2122; Tribrid&#x2122; mass spectrometer coupled to an UltiMate&#x2122; 3000 HPLC system (Thermo Fisher Scientific). LFQ was achieved at the MS1 level based on signal intensities. Cell pellet fractions were labelled using TMT technology before being pooled and subjected to high pH reversed-phase peptide fractionation giving a total of 8 fractions. As before, each fraction was analysed over a two-hour gradient in an Orbitrap Fusion&#x2122; Lumos&#x2122; Tribrid&#x2122; mass spectrometer coupled to an UltiMate&#x2122; 3000 HPLC system (Thermo Fisher Scientific). To improve the accuracy of the quantitation of TMT-labelled peptides, synchronous precursor selection (SPS)-MS3 data acquisition was employed.
                <sup>
                    <xref ref-type="bibr" rid="ref11">11</xref>
                </sup>
                <sup>,</sup>
                <sup>
                    <xref ref-type="bibr" rid="ref12">12</xref>
                </sup> Of note, TMT labelling of cellular proteins was achieved using a single TMT6plex. Hence, this workflow will not include guidance on multi-batch TMT effects or the use of internal reference scaling. For more information about the use of multiple TMTplexes users are directed to Refs. 
                <xref ref-type="bibr" rid="ref13">13</xref>, 
                <xref ref-type="bibr" rid="ref14">14</xref>.</p>
            <p>The cell pellet and supernatant datasets were handled independently and we take advantage of this to discuss the processing of TMT-labelled and LFQ proteomics data. In both cases, the raw MS data were processed using Proteome Discoverer v2.5 (Thermo Fisher Scientific). While the focus in the workflow presented below is differential protein expression analysis, the data processing and quality control steps described here are applicable to any TMT or LFQ proteomics dataset. Importantly, however, the experimental aim will influence data-guided decisions and the considerations discussed here likely differ from those of spatial proteomics, for example.</p>
            <sec id="sec5">
                <title>Downloading the data</title>
                <p>The files required for this workflow can be found deposited to the ProteomeXchange Consortium via the PRIDE
                    <sup>
                        <xref ref-type="bibr" rid="ref15">15</xref>
                    </sup>
                    <sup>,</sup>
                    <sup>
                        <xref ref-type="bibr" rid="ref16">16</xref>
                    </sup> partner repository with the dataset identifier PXD041794, Zenodo at 
                    <ext-link ext-link-type="uri" xlink:href="http://doi.org/10.5281/zenodo.7837375">http://doi.org/10.5281/zenodo.7837375</ext-link> and at the Github repository 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/CambridgeCentreForProteomics/f1000_expression_proteomics/">https://github.com/CambridgeCentreForProteomics/f1000_expression_proteomics/</ext-link>. Users are advised to download these files into their current working directory. In R the 
                    <monospace>setwd</monospace> function can be used to specify a working directory, or if using RStudio one can use the Session -&gt; Set Working Directory menu.</p>
            </sec>
        </sec>
        <sec id="sec6">
            <title>The infrastructure: 
                <monospace>QFeatures</monospace> and 
                <monospace>SummarizedExperiments</monospace>
            </title>
            <p>To be able to conveniently track each step of this workflow, users should make use of the Quantitative features for mass spectrometry, or 
                <ext-link ext-link-type="uri" xlink:href="https://www.bioconductor.org/packages/release/bioc/html/QFeatures.html">
                    <monospace>QFeatures</monospace>, Bioconductor package</ext-link>.
                <sup>
                    <xref ref-type="bibr" rid="ref17">17</xref>
                </sup> Prior to utilising the 
                <monospace>QFeatures</monospace> infrastructure, it is first necessary to understand the structure of a 
                <monospace>
                    <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html">SummarizedExperiment</ext-link>
                </monospace>
                <sup>
                    <xref ref-type="bibr" rid="ref18">18</xref>
                </sup> object as 
                <monospace>QFeatures</monospace> objects are based on the 
                <monospace>SummarizedExperiment</monospace> class. A 
                <monospace>SummarizedExperiment</monospace>, often referred to as an SE, is a data container and S4 object comprised of three components: (1) the 
                <monospace>colData</monospace> (column data) containing sample metadata, (2) the 
                <monospace>rowData</monospace> containing data features, and (3) the 
                <monospace>assay</monospace> storing quantitation data, as illustrated in 
                <xref ref-type="fig" rid="f2">Figure 2</xref>. The sample metadata includes annotations such as condition and replicate, and can be accessed using the 
                <monospace>colData</monospace> function. Data features, accessed via the 
                <monospace>rowData</monospace> function, represent information derived from the identification search. Examples include peptide sequence, master protein accession, and confidence scores. Finally, quantitative data is stored in the 
                <monospace>assay</monospace> slot. These three independent data structures are neatly stored within a single 
                <monospace>SummarizedExperiment</monospace> object.</p>
            <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                <label>Figure 2. </label>
                <caption>
                    <title>A graphic representation of the 
                        <monospace>SummarizedExperiment</monospace> (SE) object structure.</title>
                    <p>Figure reproduced from the 
                        <monospace>SummarizedExperiment</monospace> package
                        <sup>
                            <xref ref-type="bibr" rid="ref18">18</xref>
                        </sup> vignette with permission.</p>
                </caption>
                <graphic id="gr2" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure2.gif"/>
            </fig>
            <p>A 
                <monospace>QFeatures</monospace> object holds each level of quantitative proteomics data, namely (but not limited to) the PSM, peptide and protein-level data. Each level of the data is stored as its own 
                <monospace>SummarizedExperiment</monospace> within a single 
                <monospace>QFeatures</monospace> object. The lowest level data e.g. PSM, is first imported into a 
                <monospace>QFeatures</monospace> object before aggregating upward towards protein-level (
                <xref ref-type="fig" rid="f3">Figure 3</xref>). During this process of aggregation, 
                <monospace>QFeatures</monospace> maintains the hierarchical links between quantitative levels whilst allowing easy access to all data levels for individual proteins of interest. This key aspect of 
                <monospace>QFeatures</monospace> will be exemplified throughout this workflow. Additional guidance on the use of 
                <monospace>QFeatures</monospace> can be found in Ref. 
                <xref ref-type="bibr" rid="ref17">17</xref>. For visualisation of the data, all plots are generated using standard 
                <monospace>ggplot</monospace> functionality, but could equally be produced using base R.</p>
            <fig fig-type="figure" id="f3" orientation="portrait" position="float">
                <label>Figure 3. </label>
                <caption>
                    <title>A graphic representation of the 
                        <monospace>QFeatures</monospace> object structure showing the relationship between 
                        <monospace>assay</monospace>s.</title>
                    <p>Figure modified from the 
                        <monospace>QFeatures</monospace>
                        <sup>
                            <xref ref-type="bibr" rid="ref17">17</xref>
                        </sup> vignette with permission.</p>
                </caption>
                <graphic id="gr3" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure3.gif"/>
            </fig>
        </sec>
        <sec id="sec7">
            <title>Processing and analysing quantitative TMT data</title>
            <p>First, we provide a workflow for the processing and quality control of quantitative TMT-labelled data. As outlined above, the cell pellet fractions of triplicate control and treated HEK293 cells were labelled using a TMT6plex. Labelling was as outlined in 
                <xref ref-type="table" rid="T1">Table 1</xref>.</p>
            <table-wrap id="T1" orientation="portrait" position="float">
                <label>Table 1. </label>
                <caption>
                    <title>TMT labelling strategy in the use-case experiment.</title>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1" valign="top">Sample name</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Condition</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Replicate</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Tag</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">S1</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">Treated</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">1</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">TMT128</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">S2</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">Treated</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">2</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">TMT127</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">S3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">Treated</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">TMT131</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">S4</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">Control</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">1</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">TMT129</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">S5</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">Control</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">2</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">TMT126</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">S6</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">Control</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">TMT130</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
            <sec id="sec8">
                <title>Identification search of raw data</title>
                <p>The first processing step in any MS-based proteomics experiment involves an identification search using the raw data. The aim of this search is to identify which peptide sequences, and therefore proteins, correspond to the raw spectra output from the mass spectrometer. Several third-party software exist to facilitate identification searches of raw MS data but ultimately the output of any search is a list of PSMs, peptides and protein identifications along with their corresponding quantification data.</p>
                <p>The use-case data presented here was processed using Proteome Discoverer 2.5 and additional information about this search is provided in an appendix in the GitHub repository 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/CambridgeCentreForProteomics/f1000_expression_proteomics">https://github.com/CambridgeCentreForProteomics/f1000_expression_proteomics</ext-link>. Further, we provide template workflows for both the processing and consensus steps of the Proteome Discoverer identification runs. It is also possible to determine several of the key parameter settings during the preliminary data exploration. This step will be particularly important for those using publicly available data without detailed knowledge of the identification search parameters. For now, we simply export the PSM-level 
                    <monospace>.txt</monospace> file from the Proteome Discoverer output.</p>
            </sec>
            <sec id="sec75">
                <title>Importing data into R and creating a 
                    <monospace>QFeatures</monospace> object</title>
                <p>Data cleaning, exploration and filtering at the PSM-level is performed in 
                    <monospace>R</monospace> using 
                    <monospace>QFeatures</monospace>. The function 
                    <monospace>readQFeatures</monospace> is used to import the PSM-level 
                    <monospace>.txt</monospace> file. As the cell pellet TMT data we will use is derived from one TMT6plex, only one PSM-level 
                    <monospace>.txt</monospace> file needs to be imported. This file should be stored within the users working directory.</p>
                <p>The columns containing quantitative data also need to be identified before import. To check the column names we use 
                    <monospace>names</monospace> and 
                    <monospace>read.delim</monospace> (the equivalent for a 
                    <monospace>.csv</monospace> file would be 
                    <monospace>read.csv</monospace>). In the current experiment the order of TMT labels was randomised in an attempt to minimise the effect of TMT channel leakage. For ease of grouping and simplification of downstream visualisation, samples are re-ordered during the import step. This is done by creating a vector containing the sample column names in their correct order. If samples are already in the desired order, the vector can be created by simply indexing the quantitative columns.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Locate the PSM .txt file</styled-content>
                        </monospace>

                        <monospace>cp_psm 
                            <styled-content style="color:#984806">&lt;-</styled-content> 
                            <styled-content style="color:#37A82E">"cell_pellet_tmt_results_psms.txt"</styled-content>

</monospace>

                        <monospace>
                            <styled-content style="color:#984806">## Identify columns containing quantitative data</styled-content>
                        </monospace>

                        <monospace>cp_psm %&gt;%</monospace>

                        <monospace>  read.delim() %&gt;%</monospace>

                        <monospace>  names()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##  [1] "PSMs.Workflow.ID"                  "PSMs.Peptide.ID"</monospace>

                        <monospace>##  [3] "Checked"                           "Tags"</monospace>

                        <monospace>##  [5] "Confidence"                        "Identifying.Node.Type"</monospace>

                        <monospace>##  [7] "Identifying.Node"                  "Search.ID"</monospace>

                        <monospace>##  [9] "Identifying.Node.No"               "PSM.Ambiguity"</monospace>

                        <monospace>## [11] "Sequence"                          "Annotated.Sequence"</monospace>

                        <monospace>## [13] "Modifications"                     "Number.of.Proteins"</monospace>

                        <monospace>## [15] "Master.Protein.Accessions"         "Master.Protein.Descriptions"</monospace>

                        <monospace>## [17] "Protein.Accessions"                "Protein.Descriptions"</monospace>

                        <monospace>## [19] "Number.of.Missed.Cleavages"        "Charge"</monospace>

                        <monospace>## [21] "Original.Precursor.Charge"         "Delta.Score"</monospace>

                        <monospace>## [23] "Delta.Cn"                          "Rank"</monospace>

                        <monospace>## [25] "Search.Engine.Rank"                "Concatenated.Rank"</monospace>

                        <monospace>## [27] "mz.in.Da"                          "MHplus.in.Da"</monospace>

                        <monospace>## [29] "Theo.MHplus.in.Da"                 "Delta.M.in.ppm"</monospace>

                        <monospace>## [31] "Delta.mz.in.Da"                    "Ions.Matched"</monospace>

                        <monospace>## [33] "Matched.Ions"                      "Total.Ions"</monospace>

                        <monospace>## [35] "Intensity"                         "Activation.Type"</monospace>

                        <monospace>## [37] "NCE.in.Percent"                    "MS.Order"</monospace>

                        <monospace>## [39] "Isolation.Interference.in.Percent" "SPS.Mass.Matches.in.Percent"</monospace>

                        <monospace>## [41] "Average.Reporter.SN"               "Ion.Inject.Time.in.ms"</monospace>

                        <monospace>## [43] "RT.in.min"                         "First.Scan"</monospace>

                        <monospace>## [45] "Last.Scan"                         "Master.Scans"</monospace>

                        <monospace>## [47] "Spectrum.File"                     "File.ID"</monospace>

                        <monospace>## [49] "Abundance.126"                     "Abundance.127"</monospace>

                        <monospace>## [51] "Abundance.128"                     "Abundance.129"</monospace>

                        <monospace>## [53] "Abundance.130"                     "Abundance.131"</monospace>

                        <monospace>## [55] "Quan.Info"                         "Peptides.Matched"</monospace>

                        <monospace>## [57] "XCorr"                             "Number.of.Protein.Groups"</monospace>

                        <monospace>## [59] "Percolator.q.Value"                "Percolator.PEP"</monospace>

                        <monospace>## [61] "Percolator.SVMScore"</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Store location of quantitative columns in a vector in the desired order</styled-content>
                        </monospace>

                        <monospace>abundance_ordered 
                            <styled-content style="color:#984806">&lt;-</styled-content> c(
                            <styled-content style="color:#37A82E">"Abundance.128"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"Abundance.127"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"Abundance.131"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"Abundance.129"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"Abundance.126"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#37A82E">"Abundance.130"</styled-content>)</monospace>
                    </preformat>
                </p>
                <p>Now that the necessary file and its quantitative data columns have been identified, we can pass this to the 
                    <monospace>readQFeatures</monospace> function and provide these two pieces of information. We also specify that the file is tab-delimited by including 
                    <monospace>sep = &#x201c;\t&#x201d;</monospace> (similarly you would use 
                    <monospace>sep = &#x201c;,&#x201d;</monospace> for a .csv file). Of note, the 
                    <monospace>readQFeatures</monospace> function can also take 
                    <monospace>fnames</monospace> as an argument to specify a column to be used as the row names of the imported object. Whilst previous 
                    <monospace>QFeatures</monospace> vignettes used the &#x201c;Sequence&#x201d; or &#x201c;Annotated.Sequence&#x201d; as row names, we advise against this because of the presence of PSMs matched to the same peptide sequence with different modifications. In such cases, multiple rows would have the same name forcing the 
                    <monospace>readQFeatures</monospace> function to output a &#x201c;making assay row names unique&#x201d; message and add an identifying number to the end of each duplicated row name. These sequences would then be considered as unique during the aggregation of PSM to peptide, thus resulting in two independent peptide-level quantitation values rather than one. Therefore, we do not pass a 
                    <monospace>fnames</monospace> argument and the row names automatically become indices. Finally, we pass the name argument to indicate the type of data added.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Create QFeatures</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> readQFeatures(
                            <styled-content style="color:#CC9900">table =</styled-content> cp_psm,</monospace>

                        <monospace>                       
                            <styled-content style="color:#CC9900">ecol =</styled-content> abundance_ordered,</monospace>

                        <monospace>                       
                            <styled-content style="color:#CC9900">sep =</styled-content> 
                            <styled-content style="color:#37A82E">"</styled-content>\t
                            <styled-content style="color:#37A82E">"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#CC9900">name =</styled-content> 
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>)</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec10">
                <title>Accessing the 
                    <monospace>QFeatures</monospace> infrastructure</title>
                <p>As outlined above, a 
                    <monospace>QFeatures</monospace> data object is a list of 
                    <monospace>SummarizedExperiment</monospace> objects. As such, an individual 
                    <monospace>SummarizedExperiment</monospace> can be accessed using the standard double bracket nomenclature, as demonstrated in the code chunk below.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Index using position</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#000099">1</styled-content>]]</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## class: SummarizedExperiment</monospace>

                        <monospace>## dim: 48832 6</monospace>

                        <monospace>## metadata(0):</monospace>

                        <monospace>## assays(1): &#x201d;</monospace>

                        <monospace>## rownames(48832): 1 2 &#x2026; 48831 48832</monospace>

                        <monospace>## rowData names(55): PSMs.Workflow.ID PSMs.Peptide.ID &#x2026; Percolator.PEP</monospace>

                        <monospace>##   Percolator.SVMScore</monospace>

                        <monospace>## colnames(6): Abundance.128 Abundance.127 &#x2026; Abundance.126</monospace>

                        <monospace>##   Abundance.130</monospace>

                        <monospace>## colData names(0):</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Index using name</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]]</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## class: SummarizedExperiment</monospace>

                        <monospace>## dim: 48832 6</monospace>

                        <monospace>## metadata(0):</monospace>

                        <monospace>## assays(1): &#x201d;</monospace>

                        <monospace>## rownames(48832): 1 2 &#x2026; 48831 48832</monospace>

                        <monospace>## rowData names(55): PSMs.Workflow.ID PSMs.Peptide.ID &#x2026; Percolator.PEP</monospace>

                        <monospace>##   Percolator.SVMScore</monospace>

                        <monospace>## colnames(6): Abundance.128 Abundance.127 &#x2026; Abundance.126</monospace>

                        <monospace>##   Abundance.130</monospace>

                        <monospace>## colData names(0):</monospace>
                    </preformat>
                </p>
                <p>A summary of the data contained in the slots is printed to the screen. To retrieve the 
                    <monospace>rowData</monospace>, 
                    <monospace>colData</monospace> or 
                    <monospace>assay</monospace> data from a particular 
                    <monospace>SummarizedExperiment</monospace> within a 
                    <monospace>QFeatures</monospace> object users can make use of the 
                    <monospace>rowData</monospace>, 
                    <monospace>colData</monospace> and 
                    <monospace>assay</monospace> functions. For plotting or data transformation it is necessary to convert to a 
                    <monospace>data.frame</monospace> or 
                    <monospace>tibble</monospace>.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Access feature information with rowData</styled-content>
                        </monospace>

                        <monospace>
                            <styled-content style="color:#984806">## The output should be converted to data.frame/tibble for further processing</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  summarise(
                            <styled-content style="color:#CC9900">mean_intensity =</styled-content> mean(Intensity))</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## # A tibble: 1 x 1</monospace>

                        <monospace>##    mean_intensity</monospace>

                        <monospace>##             &lt;dbl&gt;</monospace>

                        <monospace>## 1       13281497.</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec11">
                <title>Adding metadata</title>
                <p>Having imported the data, each sample is first annotated with its TMT label, sample reference and condition. As this information is experimental metadata, it is added to the 
                    <monospace>colData</monospace> slot. It is also useful to clean up sample names such that they are short, intuitive and informative. This is done by editing the 
                    <monospace>colnames</monospace>. These steps may not always be necessary depending upon the identification search output.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Clean sample names</styled-content>
                        </monospace>

                        <monospace>colnames(cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]]) 
                            <styled-content style="color:#984806">&lt;-</styled-content> paste0(
                            <styled-content style="color:#37A82E">"S"</styled-content>, 
                            <styled-content style="color:#000099">1</styled-content>:
                            <styled-content style="color:#000099">6</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Add sample info as colData to QFeatures object</styled-content>
                        </monospace>

                        <monospace>cp_qf$label 
                            <styled-content style="color:#984806">&lt;-</styled-content> c(
                            <styled-content style="color:#37A82E">"TMT128"</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#37A82E">"TMT127"</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#37A82E">"TMT131"</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#37A82E">"TMT129"</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#37A82E">"TMT126"</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#37A82E">"TMT130"</styled-content>)</monospace>


                        <monospace>cp_qf$sample 
                            <styled-content style="color:#984806">&lt;-</styled-content> paste0(
                            <styled-content style="color:#37A82E">"S"</styled-content>, 
                            <styled-content style="color:#000099">1</styled-content>:
                            <styled-content style="color:#000099">6</styled-content>)</monospace>


                        <monospace>cp_qf$condition 
                            <styled-content style="color:#984806">&lt;-</styled-content> rep(c(
                            <styled-content style="color:#37A82E">"Treated"</styled-content>, 
                            <styled-content style="color:#37A82E">"Control"</styled-content>), 
                            <styled-content style="color:#CC9900">each =</styled-content> 
                            <styled-content style="color:#000099">3</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#D47D3C">## Verify</styled-content>
                        </monospace>

                        <monospace>colData(cp_qf)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## DataFrame with 6 rows and 3 columns</monospace>

                        <monospace>##           label        sample     condition</monospace>

                        <monospace>##     &lt;character&gt;   &lt;character&gt;   &lt;character&gt;</monospace>

                        <monospace>## S1       TMT128            S1       Treated</monospace>

                        <monospace>## S2       TMT127            S2       Treated</monospace>

                        <monospace>## S3       TMT131            S3       Treated</monospace>

                        <monospace>## S4       TMT129            S4       Control</monospace>

                        <monospace>## S5       TMT126            S5       Control</monospace>

                        <monospace>## S6       TMT130            S6       Control</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Assign the colData to first assay as well</styled-content>
                        </monospace>

                        <monospace>colData(cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]]) 
                            <styled-content style="color:#984806">&lt;-</styled-content> colData(cp_qf)</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec12">
                <title>Preliminary data exploration</title>
                <p>As well as cleaning and annotating the data, it is always advisable to check that the import worked and that the data looks as expected. Further, preliminary exploration of the data can provide an early sign of whether the experiment and subsequent identification search were successful. Importantly, however, the names of key parameters will vary depending on the software used, and will likely change over time. Users will need to be aware of this and modify the code in this workflow accordingly.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Check what information has been imported</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  colnames()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##  [1] "PSMs.Workflow.ID"                  "PSMs.Peptide.ID"</monospace>

                        <monospace>##  [3] "Checked"                           "Tags"</monospace>

                        <monospace>##  [5] "Confidence"                        "Identifying.Node.Type"</monospace>

                        <monospace>##  [7] "Identifying.Node"                  "Search.ID"</monospace>

                        <monospace>##  [9] "Identifying.Node.No"               "PSM.Ambiguity"</monospace>

                        <monospace>## [11] "Sequence"                          "Annotated.Sequence"</monospace>

                        <monospace>## [13] "Modifications"                     "Number.of.Proteins"</monospace>

                        <monospace>## [15] "Master.Protein.Accessions"         "Master.Protein.Descriptions"</monospace>

                        <monospace>## [17] "Protein.Accessions"                "Protein.Descriptions"</monospace>

                        <monospace>## [19] "Number.of.Missed.Cleavages"        "Charge"</monospace>

                        <monospace>## [21] "Original.Precursor.Charge"         "Delta.Score"</monospace>

                        <monospace>## [23] "Delta.Cn"                          "Rank"</monospace>

                        <monospace>## [25] "Search.Engine.Rank"                "Concatenated.Rank"</monospace>

                        <monospace>## [27] "mz.in.Da"                          "MHplus.in.Da"</monospace>

                        <monospace>## [29] "Theo.MHplus.in.Da"                 "Delta.M.in.ppm"</monospace>

                        <monospace>## [31] "Delta.mz.in.Da"                    "Ions.Matched"</monospace>

                        <monospace>## [33] "Matched.Ions"                      "Total.Ions"</monospace>

                        <monospace>## [35] "Intensity"                         "Activation.Type"</monospace>

                        <monospace>## [37] "NCE.in.Percent"                    "MS.Order"</monospace>

                        <monospace>## [39] "Isolation.Interference.in.Percent" "SPS.Mass.Matches.in.Percent"</monospace>

                        <monospace>## [41] "Average.Reporter.SN"               "Ion.Inject.Time.in.ms"</monospace>

                        <monospace>## [43] "RT.in.min"                         "First.Scan"</monospace>

                        <monospace>## [45] "Last.Scan"                         "Master.Scans"</monospace>

                        <monospace>## [47] "Spectrum.File"                     "File.ID"</monospace>

                        <monospace>## [49] "Quan.Info"                         "Peptides.Matched"</monospace>

                        <monospace>## [51] "XCorr"                             "Number.of.Protein.Groups"</monospace>

                        <monospace>## [53] "Percolator.q.Value"                "Percolator.PEP"</monospace>

                        <monospace>## [55] "Percolator.SVMScore"</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Find out how many PSMs are in the data</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  dim()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##&#x2003;&#x2003;[1]&#x2003;&#x2003;48832&#x2003;&#x2003;6</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>original_psms 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  nrow() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>
                    </preformat>
                </p>
                <p>We can see that the original data includes 48832 PSMs across the 6 samples. It is also useful to make note of how many peptides and proteins the raw PSM data corresponds to, and to track how many we remove during the subsequent filtering steps. This can be done by checking how many unique entries are located within the &#x201c;Sequence&#x201d; and &#x201c;Master.Protein.Accessions&#x201d; for peptides and proteins, respectively. Of note, searching for unique peptide sequences means that the number of peptides does not include duplicated sequences with different modifications.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Find out how many peptides and master proteins are in the data</styled-content>
                        </monospace>

                        <monospace>original_peps 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#984806">"psms_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Sequence) %&gt;%</monospace>

                        <monospace>  unique() %&gt;%</monospace>

                        <monospace>  length() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>original_prots 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Master.Protein.Accessions) %&gt;%</monospace>

                        <monospace>  unique() %&gt;%</monospace>

                        <monospace>  length() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>print(c(original_peps, original_prots))</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##&#x2003;&#x2003;[1]&#x2003;&#x2003;25969&#x2003;&#x2003;5040</monospace>
                    </preformat>
                </p>
                <p>Hence, the output of the identification search contains 48832 PSMs corresponding to 25969 peptide sequences and 5040 master proteins. Finally, we confirm that the identification search was carried out as expected. For this, we print summaries of the key search parameters using the 
                    <monospace>table</monospace> function for discrete parameters and 
                    <monospace>summary</monospace> for those which are continuous. This is also helpful for users who are analysing publicly available data and have limited knowledge about the identification search parameters.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Check missed cleavages</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Number.of.Missed.Cleavages) %&gt;%</monospace>

                        <monospace>  table()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## .</monospace>

                        <monospace>##     0     1    2</monospace>

                        <monospace>## 46164  2592   76</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Check precursor mass tolerance</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Delta.M.in.ppm) %&gt;%</monospace>

                        <monospace>  summary()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##    Min.   1st Qu.  Median    Mean  3rd Qu.   Max.</monospace>

                        <monospace>## -8.9300  -0.6000  0.3700  0.6447  1.3100  9.6700</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Check fragment mass tolerance</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Delta.mz.in.Da) %&gt;%</monospace>

                        <monospace>  summary()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##       Min.      1st Qu.    Median       Mean     3rd Qu.      Max.</monospace>

                        <monospace>## -0.0110400  -0.0004100  0.0002500  0.0006812  0.0010200  0.0135100</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Check PSM confidence allocations</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Confidence) %&gt;%</monospace>

                        <monospace>  table()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## .</monospace>

                        <monospace>##   High</monospace>

                        <monospace>##  48832</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec13">
                <title>Experimental quality control checks</title>
                <p>Experimental quality control of TMT-labelled quantitive proteomics data takes place in two steps: (1) assessment of the raw mass spectrometry data, and (2) evaluation of TMT labelling efficiency.</p>
            </sec>
            <sec id="sec14">
                <title>Quality control of the raw mass spectrometry data</title>
                <p>Having taken an initial look at the output of the identification search, it is possible to create some simple plots to inspect the raw mass spectrometry data. Such plots are useful in revealing problems that may have occurred during the mass spectrometry run but are far from extensive. Users who wish to carry out a more in-depth evaluation of the raw mass spectrometry data may benefit from use of the 
                    <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/release/bioc/html/Spectra.html">
                        <monospace>Spectra</monospace> Bioconductor package</ext-link> which allows for visualisation and exploration of raw chromatograms and spectra, among other features.
                    <sup>
                        <xref ref-type="bibr" rid="ref19">19</xref>
                    </sup>
                </p>
                <p>The first plot we generate looks at the delta precursor mass, that is the difference between observed and estimated precursor mass, across retention time. Importantly, exploration of this raw data feature can only be done when using the raw data prior to recalibration. For users of Proteome Discoverer, this means using the spectral files node rather than the spectral files recalibration node.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Generate scatter plot of mass accuracy</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> RT.in.min, 
                            <styled-content style="color:#CC9900">y =</styled-content> Delta.M.in.ppm)) +</monospace>

                        <monospace>  geom_point(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">0.5</styled-content>, 
                            <styled-content style="color:#CC9900">shape =</styled-content> 
                            <styled-content style="color:#000099">4</styled-content>) +</monospace>

                        <monospace>  geom_hline(
                            <styled-content style="color:#CC9900">yintercept =</styled-content> 
                            <styled-content style="color:#000099">5</styled-content>, 
                            <styled-content style="color:#CC9900">linetype =</styled-content> 
                            <styled-content style="color:#37A82E">"dashed"</styled-content>, 
                            <styled-content style="color:#CC9900">color =</styled-content> 
                            <styled-content style="color:#37A82E">"red"</styled-content>) +</monospace>

                        <monospace>  geom_hline(
                            <styled-content style="color:#CC9900">yintercept =</styled-content> 
                            <styled-content style="color:#000099">-5</styled-content>, 
                            <styled-content style="color:#CC9900">linetype =</styled-content> 
                            <styled-content style="color:#37A82E">"dashed"</styled-content>, 
                            <styled-content style="color:#CC9900">color =</styled-content> 
                            <styled-content style="color:#37A82E">"red"</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"RT (min)"</styled-content>, y = 
                            <styled-content style="color:#37A82E">"Delta precursor mass (ppm)"</styled-content>) +</monospace>

                        <monospace>  scale_x_continuous(
                            <styled-content style="color:#CC9900">limits =</styled-content> c(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">120</styled-content>), 
                            <styled-content style="color:#CC9900">breaks =</styled-content> seq(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">120</styled-content>, 
                            <styled-content style="color:#000099">20</styled-content>)) +</monospace>

                        <monospace>  scale_y_continuous(
                            <styled-content style="color:#CC9900">limits =</styled-content> c(
                            <styled-content style="color:#000099">-10</styled-content>, 
                            <styled-content style="color:#000099">10</styled-content>), 
                            <styled-content style="color:#CC9900">breaks =</styled-content> c(
                            <styled-content style="color:#000099">-10</styled-content>, 
                            <styled-content style="color:#000099">-5</styled-content>, 
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">5</styled-content>, 
                            <styled-content style="color:#000099">10</styled-content>)) +</monospace>

                        <monospace>  ggtitle(
                            <styled-content style="color:#37A82E">"PSM retention time against delta precursor mass"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure4.gif"/>
                </p>
                <p>Since we applied a precursor mass tolerance of 10 ppm during the identification search, all of the PSMs are within 
                    <inline-formula>
                        <mml:math display="inline">
                            <mml:mo>&#x00b1;</mml:mo>
                            <mml:mn>10</mml:mn>
                        </mml:math>
                    </inline-formula> ppm. Ideally, however, we want the majority of the data to be within 
                    <inline-formula>
                        <mml:math display="inline">
                            <mml:mo>&#x00b1;</mml:mo>
                            <mml:mn>5</mml:mn>
                        </mml:math>
                    </inline-formula> ppm since smaller delta masses correspond to a greater rate of correct peptide identifications. From the graph we have plotted we can see that indeed the majority of PSMs are within 
                    <inline-formula>
                        <mml:math display="inline">
                            <mml:mo>&#x00b1;</mml:mo>
                            <mml:mn>5</mml:mn>
                        </mml:math>
                    </inline-formula> ppm. If users find that too many PSMs are outside of the desired 
                    <inline-formula>
                        <mml:math display="inline">
                            <mml:mo>&#x00b1;</mml:mo>
                            <mml:mn>5</mml:mn>
                        </mml:math>
                    </inline-formula> ppm, it is advisable to check the calibration of the mass spectrometer.</p>
                <p>The second quality control plot of raw data is that of MS2 ion inject time across the retention time gradient. Here, it is desirable to achieve an average MS2 injection time of 50 ms or less, although the exact target threshold will depend upon the sample load. If the average ion inject time is longer than desired, then the ion transfer tube and/or front end optics of the instrument may require cleaning.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Generate scatter plot of ion inject time across retention time</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> RT.in.min, 
                            <styled-content style="color:#CC9900">y =</styled-content> Ion.Inject.Time.in.ms)) +</monospace>

                        <monospace>  geom_point(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">0.5</styled-content>, 
                            <styled-content style="color:#CC9900">shape =</styled-content> 
                            <styled-content style="color:#000099">4</styled-content>) +</monospace>

                        <monospace>  geom_hline(
                            <styled-content style="color:#CC9900">yintercept =</styled-content> 
                            <styled-content style="color:#000099">50</styled-content>, 
                            <styled-content style="color:#CC9900">linetype =</styled-content> 
                            <styled-content style="color:#37A82E">"dashed"</styled-content>, 
                            <styled-content style="color:#CC9900">color =</styled-content> 
                            <styled-content style="color:#37A82E">"red"</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"RT (min)"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"Ion inject time (ms)"</styled-content>) +</monospace>

                        <monospace>  scale_x_continuous(
                            <styled-content style="color:#CC9900">limits =</styled-content> c(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">120</styled-content>), 
                            <styled-content style="color:#CC9900">breaks =</styled-content> seq(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">120</styled-content>, 
                            <styled-content style="color:#000099">20</styled-content>)) +</monospace>

                        <monospace>  scale_y_continuous(
                            <styled-content style="color:#CC9900">limits =</styled-content> c(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">60</styled-content>), 
                            <styled-content style="color:#CC9900">breaks =</styled-content> seq(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">60</styled-content>, 
                            <styled-content style="color:#000099">10</styled-content>)) +</monospace>

                        <monospace>  ggtitle(
                            <styled-content style="color:#37A82E">"PSM retention time against ion inject time"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure5.gif"/>
                </p>
                <p>From this plot we can see that whilst there is a high density of PSMs at low inject times, there are also many data points found at the 50 ms threshold. This indicates that by increasing the time allowed for ions to accumulate in the ion trap, the number of PSMs could also have been increased. Finally, we inspect the distribution of PSMs across both the ion injection time and retention time by plotting histograms.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Plot histogram of PSM ion inject time</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> Ion.Inject.Time.in.ms)) +</monospace>

                        <monospace>  geom_histogram(
                            <styled-content style="color:#CC9900">binwidth =</styled-content> 
                            <styled-content style="color:#000099">1</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"Ion inject time (ms)"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"Frequency"</styled-content>) +</monospace>

                        <monospace>  scale_x_continuous(
                            <styled-content style="color:#CC9900">limits =</styled-content> c(-
                            <styled-content style="color:#000099">0.5</styled-content>, 
                            <styled-content style="color:#000099">52.5</styled-content>), 
                            <styled-content style="color:#CC9900">breaks =</styled-content> seq(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">50</styled-content>, 
                            <styled-content style="color:#000099">5</styled-content>)) +</monospace>

                        <monospace>  ggtitle(
                            <styled-content style="color:#37A82E">"PSM frequency across ion injection time"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure6.gif"/>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Plot histogram of PSM retention time</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> RT.in.min)) +</monospace>

                        <monospace>  geom_histogram(
                            <styled-content style="color:#CC9900">binwidth =</styled-content> 
                            <styled-content style="color:#000099">1</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"RT (min)"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"Frequency"</styled-content>) +</monospace>

                        <monospace>  scale_x_continuous(
                            <styled-content style="color:#CC9900">breaks =</styled-content> seq(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">120</styled-content>, 
                            <styled-content style="color:#000099">20</styled-content>)) +</monospace>

                        <monospace>  ggtitle(
                            <styled-content style="color:#37A82E">"PSM frequency across retention time"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure7.gif"/>
                </p>
                <p>The four plots that we have generated look relatively standard with no obvious problems indicated. Therefore, we continue by evaluating the quality of the processed data.</p>
            </sec>
            <sec id="sec15">
                <title>Checking the efficiency of TMT labelling</title>
                <p>The most fundamental data quality control step in a TMT experiment is to check the TMT labelling efficiency. TMT labels react with amine groups present at the peptide N-terminus as well as the side chain of lysine (K) residues. Of note, lysine residues can be TMT modified regardless of whether they are present at the C-terminus of a trypic peptide or internally following miscleavage.</p>
                <p>To evaluate the TMT labelling efficiency, a separate identification search of the raw data was completed with lysine (K) and peptide N-termini TMT labels considered as dynamic modifications rather than static. No additional residues (S or T) were evaluated for labelling in the search. This allows the search engine to assess the presence of both the modified (TMT labelled) and unmodified (original) forms of each peptide. The relative proportions of modified and unmodified peptides can then be used to calculate the TMT labelling efficiency. To demonstrate how to check for TMT labelling efficiency, only two of the eight fractions were utilised for this search.</p>
                <p>As we will only look at TMT efficiency at the PSM-level, here we upload the 
                    <monospace>.txt</monospace> file directly as a 
                    <monospace>SummarizedExperiment</monospace> rather than a 
                    <monospace>QFeatures</monospace> object. This is done using the 
                    <monospace>readSummarizedExperiment</monospace> function and the same arguments as those in 
                    <monospace>readQFeatures</monospace>.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Locate the PSM .txt file</styled-content>
                        </monospace>

                        <monospace>tmt_psm 
                            <styled-content style="color:#984806">&lt;-</styled-content> 
                            <styled-content style="color:#37A82E">"cell_pellet_tmt_efficiency_psms.txt"</styled-content>
                        </monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Identify columns containing quantitative data</styled-content>
                        </monospace>

                        <monospace>tmt_psm %&gt;%</monospace>

                        <monospace>  read.delim() %&gt;%</monospace>

                        <monospace>  names()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##  [1] "PSMs.Workflow.ID"                  "PSMs.Peptide.ID"</monospace>

                        <monospace>##  [3] "Checked"                           "Tags"</monospace>

                        <monospace>##  [5] "Confidence"                        "Identifying.Node.Type"</monospace>

                        <monospace>##  [7] "Identifying.Node"                  "Search.ID"</monospace>

                        <monospace>##  [9] "Identifying.Node.No"               "PSM.Ambiguity"</monospace>

                        <monospace>## [11] "Sequence"                          "Annotated.Sequence"</monospace>

                        <monospace>## [13] "Modifications"                     "Number.of.Proteins"</monospace>

                        <monospace>## [15] "Master.Protein.Accessions"         "Master.Protein.Descriptions"</monospace>

                        <monospace>## [17] "Protein.Accessions"                "Protein.Descriptions"</monospace>

                        <monospace>## [19] "Number.of.Missed.Cleavages"        "Charge"</monospace>

                        <monospace>## [21] "Original.Precursor.Charge"         "Delta.Score"</monospace>

                        <monospace>## [23] "Delta.Cn"                          "Rank"</monospace>

                        <monospace>## [25] "Search.Engine.Rank"                "Concatenated.Rank"</monospace>

                        <monospace>## [27] "mz.in.Da"                          "MHplus.in.Da"</monospace>

                        <monospace>## [29] "Theo.MHplus.in.Da"                 "Delta.M.in.ppm"</monospace>

                        <monospace>## [31] "Delta.mz.in.Da"                    "Ions.Matched"</monospace>

                        <monospace>## [33] "Matched.Ions"                      "Total.Ions"</monospace>

                        <monospace>## [35] "Intensity"                         "Activation.Type"</monospace>

                        <monospace>## [37] "NCE.in.Percent"                    "MS.Order"</monospace>

                        <monospace>## [39] "Isolation.Interference.in.Percent" "SPS.Mass.Matches.in.Percent"</monospace>

                        <monospace>## [41] "Average.Reporter.SN"               "Ion.Inject.Time.in.ms"</monospace>

                        <monospace>## [43] "RT.in.min"                         "First.Scan"</monospace>

                        <monospace>## [45] "Last.Scan"                         "Master.Scans"</monospace>

                        <monospace>## [47] "Spectrum.File"                     "File.ID"</monospace>

                        <monospace>## [49] "Abundance.126"                     "Abundance.127"</monospace>

                        <monospace>## [51] "Abundance.128"                     "Abundance.129"</monospace>

                        <monospace>## [53] "Abundance.130"                     "Abundance.131"</monospace>

                        <monospace>## [55] "Quan.Info"                         "Peptides.Matched"</monospace>

                        <monospace>## [57] "XCorr"                             "Number.of.Protein.Groups"</monospace>

                        <monospace>## [59] "Contaminant"                       "Percolator.q.Value"</monospace>

                        <monospace>## [61] "Percolator.PEP"                    "Percolator.SVMScore"</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Read in as a SummarizedExperiment</styled-content>
                        </monospace>

                        <monospace>tmt_se 
                            <styled-content style="color:#984806">&lt;-</styled-content> readSummarizedExperiment(
                            <styled-content style="color:#CC9900">table =</styled-content> tmt_psm,</monospace>

                        <monospace>                                   
                            <styled-content style="color:#CC9900">ecol =</styled-content> abundance_ordered,</monospace>

                        <monospace>                                   
                            <styled-content style="color:#CC9900">sep =</styled-content> 
                            <styled-content style="color:#37A82E">"</styled-content>\t
                            <styled-content style="color:#37A82E">"</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Clean sample names</styled-content>
                        </monospace>

                        <monospace>colnames(tmt_se) 
                            <styled-content style="color:#984806">&lt;-</styled-content> paste0(
                            <styled-content style="color:#37A82E">"S"</styled-content>, 
                            <styled-content style="color:#000099">1</styled-content>:
                            <styled-content style="color:#000099">6</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Add sample info as colData to QFeatures object</styled-content>
                        </monospace>

                        <monospace>tmt_se$label 
                            <styled-content style="color:#984806">&lt;-</styled-content> c(
                            <styled-content style="color:#37A82E">"TMT128"</styled-content>,</monospace>

                        <monospace>                  
                            <styled-content style="color:#37A82E">"TMT127"</styled-content>,</monospace>

                        <monospace>                  
                            <styled-content style="color:#37A82E">"TMT131"</styled-content>,</monospace>

                        <monospace>                  
                            <styled-content style="color:#37A82E">"TMT129"</styled-content>,</monospace>

                        <monospace>                  
                            <styled-content style="color:#37A82E">"TMT126"</styled-content>,</monospace>

                        <monospace>                  
                            <styled-content style="color:#37A82E">"TMT130"</styled-content>)</monospace>


                        <monospace>tmt_se$sample 
                            <styled-content style="color:#984806">&lt;-</styled-content> paste0(
                            <styled-content style="color:#37A82E">"S"</styled-content>, 
                            <styled-content style="color:#000099">1</styled-content>:
                            <styled-content style="color:#000099">6</styled-content>)</monospace>


                        <monospace>tmt_se$condition 
                            <styled-content style="color:#984806">&lt;-</styled-content> rep(c(
                            <styled-content style="color:#37A82E">"Treated"</styled-content>, 
                            <styled-content style="color:#37A82E">"Control"</styled-content>), 
                            <styled-content style="color:#CC9900">each =</styled-content> 
                            <styled-content style="color:#000099">3</styled-content>)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Verify</styled-content>
                        </monospace>

                        <monospace>colData(tmt_se)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## DataFrame with 6 rows and 3 columns</monospace>

                        <monospace>##          label      sample   condition</monospace>

                        <monospace>##    &lt;character&gt; &lt;character&gt; &lt;character&gt;</monospace>

                        <monospace>## S1      TMT128          S1     Treated</monospace>

                        <monospace>## S2      TMT127          S2     Treated</monospace>

                        <monospace>## S3      TMT131          S3     Treated</monospace>

                        <monospace>## S4      TMT129          S4     Control</monospace>

                        <monospace>## S5      TMT126          S5     Control</monospace>

                        <monospace>## S6      TMT130          S6     Control</monospace>
                    </preformat>
                </p>
                <p>Information about the presence of labels is stored within the &#x2018;Modifications&#x2019; feature of the 
                    <monospace>rowData</monospace>. Using this information, the TMT labelling efficiency of the experiment is calculated using the code chunks below. Users should alter this code if TMTpro reagents are being used such that &#x201c;TMT6plex&#x201d; is replaced by &#x201c;TMTpro&#x201d;.</p>
                <p>First we consider the efficiency of peptide N-termini TMT labelling. We use the grep function to identify PSMs which are annotated as having an N-Term TMT6plex modification. We then calculate the number of PSMs with this annotation as a proportion of the total number of PSMs.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Count the total number of PSMs</styled-content>
                        </monospace>

                        <monospace>tmt_total 
                            <styled-content style="color:#984806">&lt;-</styled-content> length(tmt_se)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Count the number of PSMs with an N-terminal TMT modification</styled-content>
                        </monospace>

                        <monospace>nterm_labelled_rows 
                            <styled-content style="color:#984806">&lt;-</styled-content> grep(
                            <styled-content style="color:#37A82E">"N-Term\\(TMT6plex\\)"</styled-content>,</monospace>

                        <monospace>                            rowData(tmt_se)$Modifications)</monospace>

                        <monospace>nterm_psms_labelled 
                            <styled-content style="color:#984806">&lt;-</styled-content> length(nterm_labelled_rows)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Calculate N-terminal TMT labelling efficiency</styled-content>
                        </monospace>

                        <monospace>efficiency_nterm 
                            <styled-content style="color:#984806">&lt;-</styled-content> (nterm_psms_labelled / tmt_total) * 
                            <styled-content style="color:#000099">100</styled-content>
                        </monospace>


                        <monospace>efficiency_nterm %&gt;%</monospace>

                        <monospace>  round(
                            <styled-content style="color:#CC9900">digits =</styled-content> 
                            <styled-content style="color:#000099">1</styled-content>) %&gt;%</monospace>

                        <monospace>  print()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## [1] 96.8</monospace>
                    </preformat>
                </p>
                <p>Secondly, we consider the TMT labelling efficiency of lysine (K) residues. As mentioned above, lysine residues can be TMT labelled regardless of their position within a peptide. Hence, we here calculate lysine labelling efficiency on a per lysine residue basis.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Count the number of lysine TMT6plex modifications in the PSM data</styled-content>
                        </monospace>

                        <monospace>k_tmt 
                            <styled-content style="color:#984806">&lt;-</styled-content> str_count(
                            <styled-content style="color:#CC9900">string =</styled-content> rowData(tmt_se)$Modifications,</monospace>

                        <monospace>                   
                            <styled-content style="color:#CC9900">pattern =</styled-content> 
                            <styled-content style="color:#37A82E">"K[0-9]{1,2}</styled-content>\\
                            <styled-content style="color:#37A82E">(TMT6plex\\)"</styled-content>) %&gt;%</monospace>

                        <monospace>  sum() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Count the number of lysine residues in the PSM data</styled-content>
                        </monospace>

                        <monospace>k_total 
                            <styled-content style="color:#984806">&lt;-</styled-content> str_count(
                            <styled-content style="color:#CC9900">string =</styled-content> rowData(tmt_se)$Sequence,</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">pattern =</styled-content> 
                            <styled-content style="color:#37A82E">"K"</styled-content>) %&gt;%</monospace>

                        <monospace>  sum() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Determine the percentage of TMT labelled lysines</styled-content>
                        </monospace>

                        <monospace>efficiency_k 
                            <styled-content style="color:#984806">&lt;-</styled-content> (k_tmt / k_total) * 
                            <styled-content style="color:#000099">100</styled-content>
                        </monospace>


                        <monospace>efficiency_k %&gt;%</monospace>

                        <monospace>  round(
                            <styled-content style="color:#CC9900">digits =</styled-content> 
                            <styled-content style="color:#000099">1</styled-content>) %&gt;%</monospace>

                        <monospace>  print()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## [1] 98.5</monospace>
                    </preformat>
                </p>
                <p>Users should aim for an overall TMT labelling efficiency &gt;90% in order to achieve reliable quantitation. In cases where labelling efficiency is towards the lower end of the acceptable range, TMT labels should be set as dynamic modifications during the final identification search, although this will increase the search space and time as well as influencing false discovery rate (FDR) calculations. A summary of the current advice from Thermo Fisher is provided in 
                    <xref ref-type="table" rid="T2">Table 2</xref>. Where labelling efficiency is calculated as being between categories, how to progress is ultimately decided by the user.</p>
                <table-wrap id="T2" orientation="portrait" position="float">
                    <label>Table 2. </label>
                    <caption>
                        <title>ThermoFisher search strategy recommendations based on TMT labelling efficiency.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">N-term efficiency</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">K efficiency</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Suggested search method</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">&gt;98%</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">&gt;98%</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Both modifications as &#x2019;static&#x2019;</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">85-95%</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">&gt;98%</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">N-terminal modification &#x2019;dynamic&#x2019; and K modification &#x2019;static&#x2019;</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">&lt;84%</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">&lt;84%</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Data not suitable for quantitation</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>Since the use-case data has a sufficiently high TMT labelling efficiency, we can continue to use the output of the identification search. This search considered TMT labelling of lysines as a static modification whilst N-terminal labelling was kept as dynamic, to investigate the presence of protein N-terminal modifications.</p>
            </sec>
            <sec id="sec16">
                <title>Basic data cleaning</title>
                <p>Being confident that the experiment and identification search were successful, we can now begin with some basic data cleaning. However, we also want to keep a copy of the raw PSM data. Therefore, we first create a second copy of the PSM 
                    <monospace>SummarizedExperiment</monospace>, called &#x201c;psms_filtered&#x201d;, and add it to the 
                    <monospace>QFeatures</monospace> object. This is done using the 
                    <monospace>addAssay</monospace> function. All changes made at the PSM-level will then only be applied to this second copy, so that we can refer back to the original data if needed.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Extract the "psms_raw" SummarizedExperiment</styled-content>
                        </monospace>

                        <monospace>data_copy 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]]</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Add copy of SummarizedExperiment</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> addAssay(
                            <styled-content style="color:#CC9900">x =</styled-content> cp_qf,</monospace>

                        <monospace>                  
                            <styled-content style="color:#CC9900">y =</styled-content> data_copy,</monospace>

                        <monospace>                  
                            <styled-content style="color:#CC9900">name =</styled-content> 
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Verify</styled-content>
                        </monospace>

                        <monospace>cp_qf</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## An instance of class QFeatures containing 2 assays:</monospace>

                        <monospace>## [1] psms_raw: SummarizedExperiment with 48832 rows and 6 columns</monospace>

                        <monospace>## [2] psms_filtered: SummarizedExperiment with 48832 rows and 6 columns</monospace>
                    </preformat>
                </p>
                <p>Of note, manually adding an 
                    <monospace>assay</monospace> (or 
                    <monospace>SummarizedExperiment</monospace>) to the 
                    <monospace>QFeatures</monospace> object does not automatically generate links between these 
                    <monospace>assays</monospace>. We will manually add the explicit links later, after we complete data cleaning and filtering.</p>
                <p>The cleaning steps included in this section are non-specific and should be applied to all quantitative proteomics datasets. The names of key parameters will vary in data outputs from alternative third party software, however, and users should remain aware of both terminology changes over time as well as the introduction of new filters. All data cleaning steps are completed in the same way. We first determine how many rows, here PSMs, meet the conditions for removal. This is achieved by using the 
                    <monospace>dplyr::count</monospace> function. The unwanted rows are removed using the 
                    <monospace>filterFeatures</monospace> function. Since we only wish to apply the filters to the &#x201c;psms_filtered&#x201d; level, we specify this by using the 
                    <monospace>i</monospace> = argument. If this argument is not used, 
                    <monospace>filterFeatures</monospace> will remove features from all 
                    <monospace>assays</monospace> within a 
                    <monospace>QFeatures</monospace> object.</p>
            </sec>
            <sec id="sec17">
                <title>Removing PSMs not matched to a master protein</title>
                <p>The first common cleaning step we carry out is the removal of PSMs that have not been assigned to a master protein during the identification search. This can happen when the search software is unable to resolve conflicts caused by the presence of the isobaric amino acids leucine and isoleucine. Before implementing the filter, it is useful to find out how many PSMs we expect to remove. This is easily done by using the 
                    <monospace>dplyr::count</monospace> on the master protein column. Any master proteins that return 
                    <monospace>TRUE</monospace> will be removed by filtering. If this returns no 
                    <monospace>TRUE</monospace> values, users should move on to the next filtering step without removing rows as this will introduce an error.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Find out how many PSMs we expect to lose</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  dplyr::count(Master.Protein.Accessions == "")</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## # A tibble: 2 x 2</monospace>

                        <monospace>##   &#x2018;Master.Protein.Accessions == ""&#x2018; n</monospace>

                        <monospace>##   &lt;lgl&gt;                         &lt;int&gt;</monospace>

                        <monospace>## 1 FALSE                         48660</monospace>

                        <monospace>## 2 TRUE                            172</monospace>
                    </preformat>
                </p>
                <p>For users who wish to explicitly track the process of data cleaning, the code chunk below demonstrates how to print a message containing the number of features removed.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>paste(
                            <styled-content style="color:#37A82E">"Removing"</styled-content>,</monospace>

                        <monospace>      length(which(rowData(</monospace>

                        <monospace>        cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]])$Master.Protein.Accessions == 
                            <styled-content style="color:#37A82E">""</styled-content>)),</monospace>

                        <monospace>      
                            <styled-content style="color:#37A82E">"PSMs without a master protein accession"</styled-content>) %&gt;%</monospace>

                        <monospace>  message()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## Removing 172 PSMs without a master protein accession</monospace>
                    </preformat>
                </p>
                <p>This code could be adapted to each cleaning and filtering step. To maintain simplicity of this workflow, we will not print explicit messages at each step. Instead, the decision to do so is left to the user.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Remove PSMs without a master protein accession using filterFeatures</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf %&gt;%</monospace>

                        <monospace>  filterFeatures(~ !Master.Protein.Accessions == 
                            <styled-content style="color:#37A82E">""</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>)</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec18">
                <title>Removing PSMs matched to a contaminant protein</title>
                <p>Next we remove PSMs corresponding to contaminant proteins. Such proteins can be introduced intentionally as reagents during sample preparation, as is the case for digestive enzymes, or accidentally, as seen with human keratins derived from skin and hair. Since these proteins do not contribute to the biological question being asked and it is standard practice to remove them from the data. This is done by using a carefully curated, sample-specific contaminant database. Critically, the database used for filtering should be the same one that was used during the identification search. Whilst it is possible to remove contaminants using the 
                    <monospace>filterFeatures</monospace> function on a contaminants annotation column (as per the 
                    <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/release/bioc/vignettes/QFeatures/inst/doc/Processing.html">
                        <monospace>QFeatures</monospace> processing vignette</ext-link>), we demonstrate how to filter using only contaminant protein accessions for users who do not have contaminant annotations within their identification data.</p>
                <p>For this experiment, a contaminant database from Ref. 
                    <xref ref-type="bibr" rid="ref20">20</xref> was used. The 
                    <monospace>.fasta</monospace> file for this database is available at the Hao Group&#x2019;s Github Repository for Protein Contaminant Libraries for DDA and DIA Proteomics and specifically can be found at 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/HaoGroup-ProtContLib/Protein-Contaminant-Libraries-for-DDA-and-DIA-Proteomics/tree/main/Universal%20protein%20contaminant%20FASTA">https://github.com/HaoGroup-ProtContLib/Protein-Contaminant-Libraries-for-DDA-and-DIA-Proteomics/tree/main/Universal%20protein%20contaminant%20FASTA</ext-link>. Here, we import this file using the 
                    <monospace>fasta.index</monospace> function from the 
                    <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/release/bioc/html/Biostrings.html">
                        <monospace>Biostrings</monospace> package</ext-link>.
                    <sup>
                        <xref ref-type="bibr" rid="ref21">21</xref>
                    </sup> This function requires a file path to the .fasta file and then asks users to specify the sequence type. In this case we have amino acid sequences so pass 
                    <monospace>seqtype = "AA"</monospace>. The function returns a 
                    <monospace>data.frame</monospace> with one row per FASTA entry. We then can extract the protein accessions from the fasta file. Users will need to alter the below code according to the contaminant file used.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Load Hao group .fasta file used in search</styled-content>
                        </monospace>

                        <monospace>cont_fasta 
                            <styled-content style="color:#984806">&lt;-</styled-content> 
                            <styled-content style="color:#37A82E">"220813_universal_protein_contaminants_Haogroup.fasta"</styled-content>
                        </monospace>

                        <monospace>conts 
                            <styled-content style="color:#984806">&lt;-</styled-content> Biostrings::fasta.index(cont_fasta, 
                            <styled-content style="color:#CC9900">seqtype =</styled-content> 
                            <styled-content style="color:#37A82E">"AA"</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Extract only the protein accessions (not Cont_ at the start)</styled-content>
                        </monospace>

                        <monospace>cont_acc 
                            <styled-content style="color:#984806">&lt;-</styled-content> regexpr(
                            <styled-content style="color:#37A82E">"(?&lt;=</styled-content>\\
                            <styled-content style="color:#37A82E">_</styled-content>).
                            <styled-content style="color:#37A82E">*?(?=</styled-content>\\
                            <styled-content style="color:#37A82E">|)"</styled-content>, conts$desc, 
                            <styled-content style="color:#CC9900">perl =</styled-content> TRUE) %&gt;%</monospace>

                        <monospace>  regmatches(conts$desc, .)</monospace>
                    </preformat>
                </p>
                <p>Now we have our contaminant list by accession number, we can identify and remove PSMs with any contaminant protein within their &#x201c;Protein.Accessions&#x201d;. Importantly, filtering on &#x201c;Protein.Accessions&#x201d; ensures the removal of PSMs which matched to a protein group containing a contaminant protein, even if the contaminant protein is not the group&#x2019;s master protein.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Define function to find contaminants</styled-content>
                        </monospace>

                        <monospace>find_cont 
                            <styled-content style="color:#984806">&lt;-</styled-content> 
                            <styled-content style="color:#000099">function</styled-content>(se, cont_acc) {</monospace>

                        <monospace>  cont_indices 
                            <styled-content style="color:#984806">&lt;-</styled-content> c()</monospace>

                        <monospace>  
                            <styled-content style="color:#000099">for</styled-content> (i in 
                            <styled-content style="color:#000099">1</styled-content>:length(cont_acc)) {</monospace>

                        <monospace>    cont_protein 
                            <styled-content style="color:#984806">&lt;-</styled-content> cont_acc[i]</monospace>

                        <monospace>    cont_present 
                            <styled-content style="color:#984806">&lt;-</styled-content> grep(cont_protein, rowData(se)$Protein.Accessions)</monospace>

                        <monospace>    output 
                            <styled-content style="color:#984806">&lt;-</styled-content> c(cont_present)</monospace>

                        <monospace>    cont_indices 
                            <styled-content style="color:#984806">&lt;-</styled-content> append(cont_indices, output)</monospace>

                        <monospace>  }</monospace>

                        <monospace>  cont_psm_indices 
                            <styled-content style="color:#984806">&lt;-</styled-content> cont_indices</monospace>

                        <monospace>}</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Store row indices of PSMs matched to a contaminant-containing protein group</styled-content>
                        </monospace>

                        <monospace>cont_psms 
                            <styled-content style="color:#984806">&lt;-</styled-content> find_cont(cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]], cont_acc)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## If we find contaminants, remove these rows from the data</styled-content>
                        </monospace>

                        <monospace>
                            <styled-content style="color:#000099">if</styled-content> (length(cont_psms) &gt; 
                            <styled-content style="color:#000099">0</styled-content>)</monospace>

                        <monospace>  cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]][-cont_psms, ]</monospace>
                    </preformat>
                </p>
                <p>At this point, users can also remove any additional proteins which may not have been included in the contaminant database. For example, users may wish to remove human trypsin (accession P35050) should it appear in their data.</p>
                <p>Several third party softwares also have the option to directly annotate which fasta file (here, the human proteome or contaminant database) a PSM is derived from. In such cases, filtering can be simplified by removing PSMs annotated as contaminants in the output file.</p>
            </sec>
            <sec id="sec19">
                <title>Removing PSMs which lack quantitative data</title>
                <p>Now that we are left with only PSMs matched to proteins of interest, we filter out PSMs which cannot be used for quantitation. This includes some PSMs which lack quantitative information altogether. In outputs derived from Proteome Discoverer this information is included in the &#x201c;Quan.Info&#x201d; column where PSMs are annotated as having &#x201c;NoQuanLabels&#x201d;. For users who have considered both lysine and N-terminal TMT labels as static modifications, the data should not contain any PSMs without quantitative information. However, since the use-case data was derived from a search in which N-terminal TMT modifications were dynamic, the data does include this annotation. Users are reminded that column names are typically software-specific as the &#x201c;Quan.Info&#x201d; column is found only in outputs derived from Proteome Discoverer. However, the majority of alternative third party softwares will have an equivalent column containing the same information.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Find out how many PSMs we expect to lose</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  dplyr::count(Quan.Info == 
                            <styled-content style="color:#37A82E">"NoQuanLabels"</styled-content>)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## # A tibble: 2 x 2</monospace>

                        <monospace>##  &#x2018;Quan.Info == "NoQuanLabels"&#x2018;    n</monospace>

                        <monospace>##  &lt;lgl&gt;                        &lt;int&gt;</monospace>

                        <monospace>## 1 FALSE                       47241</monospace>

                        <monospace>## 2 TRUE                          228</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Drop these rows from the data</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf %&gt;%</monospace>

                        <monospace>  filterFeatures(~ !Quan.Info == 
                            <styled-content style="color:#37A82E">"NoQuanLabels"</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>)</monospace>
                    </preformat>
                </p>
                <p>This point in the workflow is a good time to check whether there are any other annotations within the &#x201c;Quan.Info&#x201d; column. For example, if there are any PSMs which have been &#x201c;ExcludedByMethod&#x201d;, this indicates that a PSM-level filter was applied in Proteome Discoverer during the identification search. If this is the case, users should determine which filter has been applied to the data and decide whether to remove the PSMs which were &#x201c;ExcludedByMethod&#x201d; (thereby applying the pre-set threshold) or leave them in (disregard the threshold).

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Are there any remaining annotations in the Quan.Info column?</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Quan.Info) %&gt;%</monospace>

                        <monospace>  table()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## .</monospace>

                        <monospace>##</monospace>

                        <monospace>## 47241</monospace>
                    </preformat>
                </p>
                <p>In the above code chunk we see there are no remaining annotations in the &#x201c;Quan.Info&#x201d; column so we can continue.</p>
            </sec>
            <sec id="sec20">
                <title>Removing PSMs which are not unique to a protein</title>
                <p>The next step is to consider which PSMs are to be used for quantitation. There are two ways in which a PSM can be considered as unique. The first and most pure form of uniqueness comes from a PSM corresponding to a single protein only. This results in the PSM being allocated to one protein and one protein group. However, it is common to expand the definition of unique to include PSMs that map to multiple proteins within a single protein group. That is PSMs which are allocated to more than one protein but only one protein group. This distinction is ultimately up to the user. By contrast, PSMs corresponding to razor and shared peptides are linked to multiple proteins across multiple protein groups. In this workflow, the final grouping of peptides to proteins will be done based on master protein accession. Therefore, differential expression analysis will be based on protein groups, and we here consider unique as any PSM linked to only one protein group. This means removing PSMs where &#x201c;Number.of.Protein.Groups&#x201d; is not equal to 1.</p>
                <p>In the below code chunk we count the number of PSMs linked to more than 1 protein group.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Find out how many PSMs we expect to lose</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  dplyr::count(Number.of.Protein.Groups != 
                            <styled-content style="color:#000099">1</styled-content>)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## # A tibble: 2 x 2</monospace>

                        <monospace>##   &#x2018;Number.of.Protein.Groups != 1&#x2018;     n</monospace>

                        <monospace>##   &lt;lgl&gt;                           &lt;int&gt;</monospace>

                        <monospace>## 1 FALSE                           44501</monospace>

                        <monospace>## 2 TRUE                             2740</monospace>
                    </preformat>
                </p>
                <p>We again use the 
                    <monospace>filterFeatures</monospace> function to retain PSMs linked to only 1 protein group and discard any PSMs linked to more 1 group.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Remove these rows from the data</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf %&gt;%</monospace>

                        <monospace>  filterFeatures(~ Number.of.Protein.Groups == 
                            <styled-content style="color:#000099">1</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>)</monospace>
                    </preformat>
                </p>
                <p>
                    <bold>Additional considerations regarding protein isoforms</bold>
                </p>
                <p>Users searching against a database that includes protein isoforms must take extra caution when defining &#x2018;unique&#x2019; PSMs. A PSM that corresponds to a single protein when data is searched against the proteome without isoforms may correspond to multiple proteins once additional isoforms are included. As a result, PSMs or peptides that were previously mapped to one protein and one protein group could instead be mapped to multiple proteins and one protein group. These PSMs would be filtered out by defining &#x2018;unique&#x2019; as corresponding to only one protein and one protein group, but would be retained if the definition was expanded to multiple proteins and one protein group. Users should be aware of these possibilities and select their filtering strategy based on the biological question of interest.</p>
            </sec>
            <sec id="sec21">
                <title>Removing PSMs that are not rank 1</title>
                <p>Another filter that is important for quantitation is that of PSM rank. Since individual spectra can have multiple candidate peptide matches, Proteome Discoverer uses a scoring algorithm to determine the probability of a PSM being incorrect. Once each candidate PSM has been given a score, the one with the lowest score (lowest probability of being incorrect) is allocated rank 1. The PSM with the second lowest probability of being incorrect is rank 2, and so on. For the analysis, we only want rank 1 PSMs to be retained.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Find out how many PSMs we expect to lose</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  dplyr::count(Rank != 
                            <styled-content style="color:#000099">1</styled-content>)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## # A tibble: 2 x 2</monospace>

                        <monospace>##  &#x2018;Rank != 1&#x2018;      n</monospace>

                        <monospace>##  &lt;lgl&gt;        &lt;int&gt;</monospace>

                        <monospace>## 1 FALSE       43426</monospace>

                        <monospace>## 2 TRUE         1075</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Drop these rows from the data</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf %&gt;%</monospace>

                        <monospace>  filterFeatures(~ Rank == 
                            <styled-content style="color:#000099">1</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>)</monospace>
                    </preformat>
                </p>
                <p>The majority of search engines, including SequestHT, also provide their own PSM rank. To be conservative and ensure accurate quantitation, we also only retain PSMs that have a search engine rank of 1.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Find out how many PSMs we expect to lose</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  dplyr::count(Search.Engine.Rank != 
                            <styled-content style="color:#000099">1</styled-content>)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## # A tibble: 2 x 2</monospace>

                        <monospace>##   &#x2018;Search.Engine.Rank != 1&#x2018;     n</monospace>

                        <monospace>##   &lt;lgl&gt;                     &lt;int&gt;</monospace>

                        <monospace>## 1 FALSE                     43153</monospace>

                        <monospace>## 2 TRUE                        273</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Drop these rows from the data</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf %&gt;%</monospace>

                        <monospace>  filterFeatures(~ Search.Engine.Rank == 
                            <styled-content style="color:#000099">1</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>)</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec22">
                <title>Removing ambiguous PSMs</title>
                <p>Finally, we retain only unambiguous PSMs. Since there are several candidate peptides for each spectra, Proteome Discoverer allocates each PSM a level of ambiguity to indicate whether it was possible to determine a definite PSM or whether one had to be selected from a number of candidates. The allocation of PSM ambiguity takes place during the process of protein grouping and the definitions of each ambiguity assignment are given below in 
                    <xref ref-type="table" rid="T3">Table 3</xref>.</p>
                <table-wrap id="T3" orientation="portrait" position="float">
                    <label>Table 3. </label>
                    <caption>
                        <title>Definitions of PSM ambiguity categories based on Proteome Discoverer outputs.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">PSM category</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Definition</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Unambiguous</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">The only candidate PSM</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Selected</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">PSM was selected from a group of candidates</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Rejected</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">PSM was rejected from a group of candidates</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Ambiguous</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Two or more candidate PSMs could not be distinguished</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Unconsidered</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">PSM was not considered suitable</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>Importantly, depending upon the software being used, output files may already have excluded some of these categories. It is still good to check before proceeding with the data.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Find out how many PSMs we expect to lose</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  dplyr::count(PSM.Ambiguity != 
                            <styled-content style="color:#37A82E">"Unambiguous"</styled-content>)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## # A tibble: 1 x 2</monospace>

                        <monospace>##   &#x2018;PSM.Ambiguity != "Unambiguous"&#x2018;     n</monospace>

                        <monospace>##   &lt;lgl&gt;                            &lt;int&gt;</monospace>

                        <monospace>## 1 FALSE                            43153</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## No PSMs to remove so proceed</styled-content>
                        </monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec23">
                <title>Assessing the impact of non-specific data cleaning</title>
                <p>Now that we have finished the non-specific data cleaning, we can pause and check to see what this has done to the data. We determine the number and proportion of PSMs, peptides, and proteins lost from the original dataset.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Determine number and proportion of PSMs removed</styled-content>
                        </monospace>

                        <monospace>psms_remaining 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  nrow() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>psms_removed 
                            <styled-content style="color:#984806">&lt;-</styled-content> original_psms - psms_remaining</monospace>

                        <monospace>psms_removed_prop 
                            <styled-content style="color:#984806">&lt;-</styled-content> ((psms_removed / original_psms) * 
                            <styled-content style="color:#000099">100</styled-content>) %&gt;%</monospace>

                        <monospace>  round(
                            <styled-content style="color:#CC9900">digits =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Determine number and proportion of peptides removed</styled-content>
                        </monospace>

                        <monospace>peps_remaining 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Sequence) %&gt;%</monospace>

                        <monospace>  unique() %&gt;%</monospace>

                        <monospace>  length() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>peps_removed 
                            <styled-content style="color:#984806">&lt;-</styled-content> original_peps - peps_remaining</monospace>

                        <monospace>peps_removed_prop 
                            <styled-content style="color:#984806">&lt;-</styled-content> ((peps_removed / original_peps) * 
                            <styled-content style="color:#000099">100</styled-content>) %&gt;%</monospace>

                        <monospace>  round(
                            <styled-content style="color:#CC9900">digits =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Determine number and proportion of proteins removed</styled-content>
                        </monospace>

                        <monospace>prots_remaining 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Master.Protein.Accessions) %&gt;%</monospace>

                        <monospace>  unique() %&gt;%</monospace>

                        <monospace>  length() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>prots_removed 
                            <styled-content style="color:#984806">&lt;-</styled-content> original_prots - prots_remaining</monospace>

                        <monospace>prots_removed_prop 
                            <styled-content style="color:#984806">&lt;-</styled-content> ((prots_removed / original_prots) * 
                            <styled-content style="color:#000099">100</styled-content>) %&gt;%</monospace>

                        <monospace>  round(
                            <styled-content style="color:#CC9900">digits =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Print as a table</styled-content>
                        </monospace>

                        <monospace>data.frame(
                            <styled-content style="color:#37A82E">"Feature"</styled-content> = c(
                            <styled-content style="color:#37A82E">"PSMs"</styled-content>,</monospace>

                        <monospace>                         
                            <styled-content style="color:#37A82E">"Peptides"</styled-content>,</monospace>

                        <monospace>                         
                            <styled-content style="color:#37A82E">"Proteins"</styled-content>),</monospace>

                        <monospace>           
                            <styled-content style="color:#37A82E">"Number lost"</styled-content> 
                            <styled-content style="color:#CC9900">=</styled-content> c(psms_removed,</monospace>

                        <monospace>  &#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;peps_removed,</monospace>

                        <monospace>  &#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;prots_removed),</monospace>

                        <monospace>           
                            <styled-content style="color:#37A82E">"Percentage lost"</styled-content> 
                            <styled-content style="color:#CC9900">=</styled-content> c(psms_removed_prop,</monospace>

                        <monospace>  &#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;peps_removed_prop,</monospace>

                        <monospace>  &#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;prots_removed_prop))</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##    Feature Number.lost Percentage.lost</monospace>

                        <monospace>## 1     PSMs        5679           11.63</monospace>

                        <monospace>## 2 Peptides        1565            6.03</monospace>

                        <monospace>## 3 Proteins         452            8.97</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec24">
                <title>PSM quality control filtering</title>
                <p>The next step is to take a look at the data and make informed decisions about in-depth filtering. Here, we focus on three key quality control filters for TMT data: 1) average reporter ion signal-to-noise (S/N) ratio, 2) percentage co-isolation interference, and 3) percentage SPS mass match. It is possible to set thresholds for these three parameters during the identification search. However, specifying thresholds prior to exploring the data could lead to unnecessarily excessive data exclusion or the retention of poor quality PSMs. We suggest that users set the thresholds for all three aforementioned filters to 0 during the identification search, thus allowing maximum flexibility during data processing. In all cases, quality control filtering represents a trade-off between ensuring high quality data and losing potentially informative data. This means that the thresholds used for such filtering will likely depend upon the initial quality of the data and the number of PSMs, as well as the experimental goal being stringent or exploratory.</p>
            </sec>
            <sec id="sec25">
                <title>Quality control: Average reporter ion signal-to-noise</title>
                <p>Intensity measurements derived from a small number of ions tend to be more variable and less accurate. Therefore, reporter ion spectra with peaks generated from a small number of ions should be filtered out to ensure accurate quantitation and avoid stochastic ion effects. When using an orbitrap analyser, as was the case in the collection of the use-case data, the number of ions is proportional to the S/N value of a peak. Hence, the average reporter ion S/N ratio can be used to filter out quantification based on too few ions.</p>
                <p>To determine an appropriate reporter ion S/N threshold we need to understand the original, unfiltered data. Here, we print a summary of the average reporter S/N before plotting a simple histogram to visualise the data. The default threshold for average reporter ion S/N when filtering within Proteome Discoverer is 10, or 1 on the base-10 logarithmic scale displayed here. We include a line to show where this threshold would be on the data distribution.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Get summary information</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Average.Reporter.SN) %&gt;%</monospace>

                        <monospace>  summary()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##   Min.   1st Qu.   Median   Mean   3rd Qu.     Max.   NA&#x2019;s</monospace>

                        <monospace>##    0.3      84.2    215.8  321.8     450.3   3008.2    140</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Plot histogram of reporter ion signal-to-noise</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> log10(Average.Reporter.SN))) +</monospace>

                        <monospace>  geom_histogram(
                            <styled-content style="color:#CC9900">binwidth =</styled-content> 
                            <styled-content style="color:#000099">0.05</styled-content>) +</monospace>

                        <monospace>  geom_vline(
                            <styled-content style="color:#CC9900">xintercept =</styled-content> 
                            <styled-content style="color:#000099">1</styled-content>, 
                            <styled-content style="color:#CC9900">linetype =</styled-content> 
                            <styled-content style="color:#37A82E">"dashed"</styled-content>, 
                            <styled-content style="color:#CC9900">color =</styled-content> 
                            <styled-content style="color:#37A82E">"red"</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"log10(average reporter SN)"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"Frequency"</styled-content>) +</monospace>

                        <monospace>  ggtitle(
                            <styled-content style="color:#37A82E">"Average reporter ion S/N"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure8.gif"/>
                </p>
                <p>From the distribution of the data it is clear that applying such a threshold would not result in dramatic data loss. Whilst we could set a higher threshold for more stringent analysis, this would lead to unnecessary data loss. Therefore, we keep PSMs with an average reporter ion S/N threshold of 10 or more. We also remove PSMs that have an NA value for their average reporter ion S/N since their quality cannot be guaranteed. This is done by including 
                    <monospace>na.rm = TRUE</monospace>.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Find out how many PSMs we expect to lose</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  dplyr::count(Average.Reporter.SN &lt; 
                            <styled-content style="color:#000099">10</styled-content>)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## # A tibble: 3 x 2</monospace>

                        <monospace>##   &#x2018;Average.Reporter.SN &lt; 10&#x2018;     n</monospace>

                        <monospace>##   &lt;lgl&gt;                      &lt;int&gt;</monospace>

                        <monospace>## 1 FALSE                      42066</monospace>

                        <monospace>## 2 TRUE                         947</monospace>

                        <monospace>## 3 NA                           140</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Drop these rows from the data</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf %&gt;%</monospace>

                        <monospace>  filterFeatures(~ Average.Reporter.SN &gt;= 
                            <styled-content style="color:#000099">10</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">na.rm =</styled-content> TRUE,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>)</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec26">
                <title>Quality control: Isolation interference</title>
                <p>A second data-dependent quality control parameter which should be considered is the isolation interference. The first type of interference that occurs during a TMT experiment is reporter ion interference, also known as cross-label isotopic impurity. This type of interference arises from manufacturing-level impurities and experimental error. The former should be reduced somewhat by the inclusion of lot-specific correction factors in the search set-up and users should ensure that these corrections are applied. In Proteome Discoverer this means setting &#x201c;Apply Quan Value Corrections&#x201d; to &#x201c;TRUE&#x201d; within the reporter ions quantifier node. The second form of interference is co-isolation interference which occurs during the MS run when multiple labelled precursor peptides are co-isolated in a single data acquisition window. Following fragmentation of the co-isolated peptides, this results in an MS2 or MS3 reporter ion peak (depending upon the experimental design) derived from multiple precursor peptides. Hence, co-isolation interference leads to inaccurate quantitation of the identified peptide. This problem is reduced by filtering out PSMs with a high percentage isolation interference value. As was the case for reporter ion S/N, Proteome Discoverer has a suggested default threshold for isolation interference - 50% for MS2 experiments and 75% for SPS-MS3 experiments.</p>
                <p>Again, we get a summary and visualise the data using the code chunk below.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Get summary information</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Isolation.Interference.in.Percent) %&gt;%</monospace>

                        <monospace>  summary()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##   Min.  1st Qu.   Median    Mean  3rd Qu.    Max.</monospace>

                        <monospace>##  0.000    0.000    8.385  12.637  21.053   84.379</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Plot histogram of co-isolation interference</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> Isolation.Interference.in.Percent)) +</monospace>

                        <monospace>  geom_histogram(
                            <styled-content style="color:#CC9900">binwidth =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>) +</monospace>

                        <monospace>  geom_vline(
                            <styled-content style="color:#CC9900">xintercept =</styled-content> 
                            <styled-content style="color:#000099">75</styled-content>, 
                            <styled-content style="color:#CC9900">linetype =</styled-content> 
                            <styled-content style="color:#37A82E">"dashed"</styled-content>, 
                            <styled-content style="color:#CC9900">color =</styled-content> 
                            <styled-content style="color:#37A82E">"red"</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"Isolation inteference (%)"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"Frequency"</styled-content>) +</monospace>

                        <monospace>  ggtitle(
                            <styled-content style="color:#37A82E">"Co-isolation interference %"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure9.gif"/>
                </p>
                <p>Looking at the data, very few PSMs have an isolation interference above the suggested threshold, and hence minimal data will be lost. Again, we choose to apply the standard threshold with the understanding that decreasing the threshold would result in greater data loss. Importantly, we are able to apply relatively standard thresholds here as the preliminary exploration did not expose any problems with the experimental data (in terms of labelling or MS analysis). If users have reason to believe the data is of poorer quality then more stringent thresholding should be considered.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Find out how many PSMs we expect to lose</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  dplyr::count(Isolation.Interference.in.Percent &gt; 
                            <styled-content style="color:#000099">75</styled-content>)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## # A tibble: 2 x 2</monospace>

                        <monospace>##   &#x2018;Isolation.Interference.in.Percent &gt; 75&#x2018;      n</monospace>

                        <monospace>##   &lt;lgl&gt;                                     &lt;int&gt;</monospace>

                        <monospace>## 1 FALSE                                     42007</monospace>

                        <monospace>## 2 TRUE                                         59</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Remove these rows from the data</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf %&gt;%</monospace>

                        <monospace>  filterFeatures(~ Isolation.Interference.in.Percent &lt;= 
                            <styled-content style="color:#000099">75</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">na.rm =</styled-content> TRUE,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>)</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec27">
                <title>Quality control: SPS mass match</title>
                <p>The final quality control filter that we will apply is a percentage SPS mass match threshold. SPS mass match is a metric which has been introduced by Proteome Discoverer versions 2.3 and above to quantify the percentage of SPS-MS3 fragments that can still be explicitly traced back to the precursor peptide. This parameter is of particular importance given that quantitation is based on the SPS-MS3 spectra. Unfortunately, the SPS Mass Match percentage is currently only a feature of Proteome Discoverer (2.3 and above) and will not be available to users of other third party software.</p>
                <p>We follow the same format as before to investigate the SPS Mass Match (%) distribution of the data. The default threshold within Proteome Discoverer is a SPS Mass Match above 65%. In reality, since SPS Mass Match is only reported to the nearest 10%, removing PSMs annotated with a value below 65% means removing those with 60% or less. Hence, only PSMs with 70% SPS Mass Match or above would be retained. We can see how many PSMs would be lost based on such thresholds using the code chunk below.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Get summary information</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(SPS.Mass.Matches.in.Percent) %&gt;%</monospace>

                        <monospace>  summary()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##   Min.  1st Qu.   Median   Mean  3rd Qu.    Max.</monospace>

                        <monospace>##   0.00    50.00    70.00  64.31   80.00   100.00</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Plot histogram of SPS mass match %</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> SPS.Mass.Matches.in.Percent)) +</monospace>

                        <monospace>  geom_histogram(
                            <styled-content style="color:#CC9900">binwidth =</styled-content> 
                            <styled-content style="color:#000099">10</styled-content>) +</monospace>

                        <monospace>  geom_vline(
                            <styled-content style="color:#CC9900">xintercept =</styled-content> 
                            <styled-content style="color:#000099">65</styled-content>, 
                            <styled-content style="color:#CC9900">linetype =</styled-content> 
                            <styled-content style="color:#37A82E">"dashed"</styled-content>, 
                            <styled-content style="color:#CC9900">color =</styled-content> 
                            <styled-content style="color:#37A82E">"red"</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"SPS mass matches (%)"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"Frequency"</styled-content>) +</monospace>

                        <monospace>  scale_x_continuous(
                            <styled-content style="color:#CC9900">breaks =</styled-content> seq(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">100</styled-content>, 
                            <styled-content style="color:#000099">10</styled-content>)) +</monospace>

                        <monospace>  ggtitle(
                            <styled-content style="color:#37A82E">"SPS mass match %"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure10.gif"/>
                </p>
                <p>From the summary and histogram we can see that the distribution of SPS Mass Matches is much less skewed than that of average reporter ion S/N or isolation interference. This means that whilst the application of thresholds on average reporter ion S/N and isolation interference led to minimal data loss, attempting to impose a threshold on SPS Mass Match represents a much greater trade-off between data quality and quantity. For simplicity, here we choose to use the standard threshold of 65%.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Find out how many PSMs we expect to lose</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  dplyr::count(SPS.Mass.Matches.in.Percent &lt; 
                            <styled-content style="color:#000099">65</styled-content>)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## # A tibble: 2 x 2</monospace>

                        <monospace>##   &#x2018;SPS.Mass.Matches.in.Percent &lt; 65&#x2018;      n</monospace>

                        <monospace>##   &lt;lgl&gt;                               &lt;int&gt;</monospace>

                        <monospace>## 1 FALSE                               21697</monospace>

                        <monospace>## 2 TRUE                                20310</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Drop these rows from the data</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf %&gt;%</monospace>

                        <monospace>  filterFeatures(~ SPS.Mass.Matches.in.Percent &gt;= 
                            <styled-content style="color:#000099">65</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">na.rm =</styled-content> TRUE,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>)</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec28">
                <title>Assessing the impact of data-specific filtering</title>
                <p>As we did after the non-specific cleaning steps, we check to see how many PSMs, peptides and proteins have been removed throughout the in-depth data-specific filtering.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Summarize the effect of data-specific filtering</styled-content>
                        </monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Determine the number and proportion of PSMs removed</styled-content>
                        </monospace>

                        <monospace>psms_remaining_2 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  nrow() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>psms_removed_2 
                            <styled-content style="color:#984806">&lt;-</styled-content> psms_remaining - psms_remaining_2</monospace>

                        <monospace>psms_removed_prop_2 
                            <styled-content style="color:#984806">&lt;-</styled-content> ((psms_removed_2 / original_psms) * 
                            <styled-content style="color:#000099">100</styled-content>) %&gt;%</monospace>

                        <monospace>  round(
                            <styled-content style="color:#CC9900">digits =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Determine number and proportion of peptides removed</styled-content>
                        </monospace>

                        <monospace>peps_remaining_2 
                            <styled-content style="color:#984806">&lt;-</styled-content> rowData(cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]])$Sequence %&gt;%</monospace>

                        <monospace>  unique() %&gt;%</monospace>

                        <monospace>  length() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>peps_removed_2 
                            <styled-content style="color:#984806">&lt;-</styled-content> peps_remaining - peps_remaining_2</monospace>

                        <monospace>peps_removed_prop_2 
                            <styled-content style="color:#984806">&lt;-</styled-content> ((peps_removed_2 / original_peps) * 
                            <styled-content style="color:#000099">100</styled-content>) %&gt;%</monospace>

                        <monospace>  round(
                            <styled-content style="color:#CC9900">digits =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Determine number and proportion of proteins removed</styled-content>
                        </monospace>

                        <monospace>prots_remaining_2 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Master.Protein.Accessions) %&gt;%</monospace>

                        <monospace>  unique() %&gt;%</monospace>

                        <monospace>  length() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>prots_removed_2 
                            <styled-content style="color:#984806">&lt;-</styled-content> prots_remaining - prots_remaining_2</monospace>

                        <monospace>prots_removed_prop_2 
                            <styled-content style="color:#984806">&lt;-</styled-content> ((prots_removed_2 / original_prots) * 
                            <styled-content style="color:#000099">100</styled-content>) %&gt;%</monospace>

                        <monospace>  round(
                            <styled-content style="color:#CC9900">digits =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Print as a table</styled-content>
                        </monospace>

                        <monospace>data.frame(
                            <styled-content style="color:#37A82E">"Feature"</styled-content> 
                            <styled-content style="color:#984806">=</styled-content> c(
                            <styled-content style="color:#37A82E">"PSMs"</styled-content>,</monospace>

                        <monospace>                         
                            <styled-content style="color:#37A82E">"Peptides"</styled-content>,</monospace>

                        <monospace>                         
                            <styled-content style="color:#37A82E">"Proteins"</styled-content>),</monospace>

                        <monospace>           
                            <styled-content style="color:#37A82E">"Number lost"</styled-content> 
                            <styled-content style="color:#984806">=</styled-content> c(psms_removed_2,</monospace>

                        <monospace>                             peps_removed_2,</monospace>

                        <monospace>                             prots_removed_2),</monospace>

                        <monospace>           
                            <styled-content style="color:#37A82E">"Percentage lost"</styled-content> 
                            <styled-content style="color:#984806">=</styled-content> c(psms_removed_prop_2,</monospace>

                        <monospace>                                 peps_removed_prop_2,</monospace>

                        <monospace>                                 prots_removed_prop_2))</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##    Feature Number.lost Percentage.lost</monospace>

                        <monospace>## 1     PSMs       21456           43.94</monospace>

                        <monospace>## 2 Peptides       10162           39.13</monospace>

                        <monospace>## 3 Proteins        1299           25.77</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec29">
                <title>Managing missing data</title>
                <p>Having finished the data cleaning at the PSM-level, the final step is to deal with missing data. Missing values represent a common challenge in quantitative proteomics and there is no consensus within the literature on how this challenge should be addressed. Indeed, missing values fall into different categories based on the reason they were generated, and each category is best dealt with in a different way. There are three main categories of missing data: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Within proteomics, values which are MCAR arise due to technical variation or stochastic fluctuations and emerge in a uniform, intensity-independent distribution. Examples include values for peptides which cannot be consistently identified or are unable to be efficiently ionised. By contrast, MNAR values are expected to occur in an intensity-dependent manner due to the presence of peptides at abundances below the limit of detection.
                    <sup>
                        <xref ref-type="bibr" rid="ref17">17</xref>
                    </sup>
                    <sup>,</sup>
                    <sup>
                        <xref ref-type="bibr" rid="ref22">22</xref>
                    </sup>
                    <sup>,</sup>
                    <sup>
                        <xref ref-type="bibr" rid="ref23">23</xref>
                    </sup> In many cases this is due to the biological condition being evaluated, for example the cell type or treatment applied.</p>
                <p>To simplify this process, we consider the management of missing data in three steps. The first step is to determine the presence and pattern of missing values within the data. Next, we filter out data which exceed the desired proportion of missing values. This includes removing PSMs with a greater number of missing values across samples than we deem acceptable, as well as whole samples in cases where the proportion of missing values is substantially higher than the average. Finally, imputation can be used to replace any remaining NA values within the dataset. This final step is optional and can equally be done prior to filtering if the user wishes to impute all missing values without removing any PSMs, although this is not recommended. Further, whilst it is possible to complete such steps at the peptide- or protein-level, we advise management of missing values at the lowest data level to minimise the effect of implicit imputation during aggregation.</p>
            </sec>
            <sec id="sec30">
                <title>Exploring the presence of missing values</title>
                <p>First, to determine the presence of missing values in the PSM-level data we use the 
                    <monospace>nNA</monospace> function within the 
                    <monospace>QFeatures</monospace> infrastructure. This function will return the absolute number and percentage of missing values both per sample and as an average. Importantly, alternative third-party software may output missing values in formats other than NA, such as zero, or infinite. In such cases, missing values can be converted directly into NA values through use of the 
                    <monospace>zeroIsNA</monospace> or 
                    <monospace>infIsNA</monospace> functions within the 
                    <monospace>QFeatures</monospace> infrastructure.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Determine whether there are any NA values in the data</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  assay() %&gt;%</monospace>

                        <monospace>  anyNA()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## [1] TRUE</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Determine the amount and distribution of NA values in the data</styled-content>
                        </monospace>

                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  nNA()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## $nNA</monospace>

                        <monospace>## DataFrame with 1 row and 2 columns</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##         nNA        pNA</monospace>

                        <monospace>##   &lt;integer&gt;  &lt;numeric&gt;</monospace>

                        <monospace>## 1         4 0.00307262</monospace>

                        <monospace>##</monospace>

                        <monospace>## $nNArows</monospace>

                        <monospace>## DataFrame with 21697 rows and 3 columns</monospace>

                        <monospace>##              name       nNA       pNA</monospace>

                        <monospace>##       &lt;character&gt; &lt;integer&gt; &lt;numeric&gt;</monospace>

                        <monospace>## 1              13         0         0</monospace>

                        <monospace>## 2              20         0         0</monospace>

                        <monospace>## 3              25         0         0</monospace>

                        <monospace>## 4              26         0         0</monospace>

                        <monospace>## 5              29         0         0</monospace>

                        <monospace>## ...           ...       ...       ...</monospace>

                        <monospace>## 21693       48786         0         0</monospace>

                        <monospace>## 21694       48792         0         0</monospace>

                        <monospace>## 21695       48797         0         0</monospace>

                        <monospace>## 21696       48810         0         0</monospace>

                        <monospace>## 21697       48819         0         0</monospace>

                        <monospace>##</monospace>

                        <monospace>## $nNAcols</monospace>

                        <monospace>## DataFrame with 6 rows and 3 columns</monospace>

                        <monospace>##          name       nNA        pNA</monospace>

                        <monospace>##   &lt;character&gt; &lt;integer&gt;  &lt;numeric&gt;</monospace>

                        <monospace>## 1          S1         0 0.00000000</monospace>

                        <monospace>## 2          S2         2 0.00921786</monospace>

                        <monospace>## 3          S3         0 0.00000000</monospace>

                        <monospace>## 4          S4         1 0.00460893</monospace>

                        <monospace>## 5          S5         1 0.00460893</monospace>

                        <monospace>## 6          S6         0 0.00000000</monospace>
                    </preformat>
                </p>
                <p>We can see that the data only contains 0.003% missing values, corresponding to 4 NA values. This low proportion is due to a combination of the TMT labelling strategy and the stringent PSM quality control filtering. In particular, co-isolation interference when using TMT labels often results in very low quantification values for peptides which should actually be missing or &#x2018;NA&#x2019;. Nevertheless, we continue and check for sample-specific bias in the distribution of NAs by plotting a simple histogram. We also use colour to indicate the condition of each sample as to check for condition-specific bias.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Plot histogram to visualize the distribution of NAs</styled-content>
                        </monospace>

                        <monospace>nNA(cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]])$nNAcols %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  mutate(
                            <styled-content style="color:#CC9900">Condition =</styled-content> rep(c(
                            <styled-content style="color:#37A82E">"Treated"</styled-content>, 
                            <styled-content style="color:#37A82E">"Control"</styled-content>), 
                            <styled-content style="color:#CC9900">each =</styled-content> 
                            <styled-content style="color:#000099">3</styled-content>)) %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> name, 
                            <styled-content style="color:#CC9900">y =</styled-content> pNA, 
                            <styled-content style="color:#CC9900">group =</styled-content> Condition, 
                            <styled-content style="color:#CC9900">fill =</styled-content> Condition)) +</monospace>

                        <monospace>  geom_bar(
                            <styled-content style="color:#CC9900">stat =</styled-content> 
                            <styled-content style="color:#37A82E">"identity"</styled-content>, 
                            <styled-content style="color:#CC9900">position =</styled-content> 
                            <styled-content style="color:#37A82E">"dodge"</styled-content>) +</monospace>

                        <monospace>  geom_hline(
                            <styled-content style="color:#CC9900">yintercept =</styled-content> 
                            <styled-content style="color:#000099">0.002</styled-content>, 
                            <styled-content style="color:#CC9900">linetype =</styled-content> 
                            <styled-content style="color:#37A82E">"dashed"</styled-content>, 
                            <styled-content style="color:#CC9900">color =</styled-content> 
                            <styled-content style="color:#37A82E">"red"</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"Sample"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"Missing values (%)"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure11.gif"/>
                </p>
                <p>The percentage of missing values is sufficiently low that none of the samples need be removed. Further, there is no sample- or condition-specific bias in the data. We can get more information about the PSMs with NA values using the code below.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Find out the range of missing values per PSM</styled-content>
                        </monospace>

                        <monospace>nNA(cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]])$nNArows$nNA %&gt;%</monospace>

                        <monospace>  table()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## .</monospace>

                        <monospace>##     0   1</monospace>

                        <monospace>## 21693   4</monospace>
                    </preformat>
                </p>
                <p>From this output we can see that the maximum number of NA values per PSM is one. This information is useful to know as it may inform the filtering strategy in the next step.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Get indices of rows which contain NA</styled-content>
                        </monospace>

                        <monospace>rows_with_na_indices 
                            <styled-content style="color:#984806">&lt;-</styled-content> which(nNA(cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]])$nNArows$nNA != 
                            <styled-content style="color:#000099">0</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Subset rows with NA</styled-content>
                        </monospace>

                        <monospace>rows_with_na 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]][rows_with_na_indices, ]</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Inspect rows with NA</styled-content>
                        </monospace>

                        <monospace>assay(rows_with_na)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##         S1   S2   S3   S4   S5   S6</monospace>

                        <monospace>## 12087 11.0 17.0 13.3 22.1   NA 30.6</monospace>

                        <monospace>## 30824 45.0   NA 43.1 66.7 69.7 62.1</monospace>

                        <monospace>## 30846 34.3   NA 47.9 56.8 65.5 57.2</monospace>

                        <monospace>## 44791 22.8 28.7 19.6   NA  3.8 12.2</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec31">
                <title>Filtering out missing values</title>
                <p>First we apply some standard filtering. Typically, it is desirable to remove features, here PSMs, with greater than 20% missing values. We can do this using the 
                    <monospace>filterNA</monospace> function in 
                    <monospace>QFeatures</monospace>, as outlined below. We pass the function the 
                    <monospace>SummarizedExperiment</monospace> and use the 
                    <monospace>pNA</monospace> = argument to specify the maximum proportion of NA values to allow.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Check how many PSMs we will remove</styled-content>
                        </monospace>

                        <monospace>nNA(cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]])$nNArows %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  dplyr::count(pNA &gt;= 
                            <styled-content style="color:#000099">20</styled-content>)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## # A tibble: 1 x 2</monospace>

                        <monospace>##   &#x2018;pNA &gt;= 20&#x2018;     n</monospace>

                        <monospace>##   &lt;lgl&gt;       &lt;int&gt;</monospace>

                        <monospace>## 1 FALSE       21697</monospace>
                    </preformat>
                </p>
                <p>Although the use-case data does not contain any PSMs with &gt;20% missing values, we demonstrate how to apply the desired filter below.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Remove PSMs with more than 20 % (0.2) NA values</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf %&gt;%</monospace>

                        <monospace>  filterNA(
                            <styled-content style="color:#CC9900">pNA =</styled-content> 
                            <styled-content style="color:#000099">0.2</styled-content>,</monospace>

                        <monospace>           
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>)</monospace>
                    </preformat>
                </p>
                <p>Since previous exploration of missing data did not reveal any sample with an excessive number of NA values, we do not need to remove any samples from the analysis.</p>
                <p>Although not covered here, users may wish to carry out condition-specific filtering in cases where the exploration of missing values revealed a condition- specific bias, or where the experimental question requires. This would be the case, for example, if one condition was transfected to express proteins of interest whilst the control condition lacked these proteins. Filtering of both conditions together could, therefore, lead to the removal of proteins of interest.</p>
            </sec>
            <sec id="sec32">
                <title>Imputation (optional)</title>
                <p>The final step is to consider whether to impute the remaining missing values within the data. Imputation refers to the replacement of missing values with probable values. Since imputation requires complex assumptions and can have substantial effects on downstream statistical analysis, we here choose to skip imputation. This is reasonable given that we only have 3 missing values at the PSM-level, and that some of these will likely be removed by aggregation. A more in-depth discussion of imputation will be provided below in the LFQ workflow.</p>
            </sec>
            <sec id="sec33">
                <title>Summary of PSM data cleaning</title>
                <p>Thus far we have checked that the experimental data we are using is of high quality by visualising the raw data and calculating TMT labelling efficiency. We then carried out non-specific data cleaning, data-specific filtering steps and management of missing data. Here, we present a combined summary of these PSM processing steps.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Determine final number of PSMs, peptides and master proteins</styled-content>
                        </monospace>

                        <monospace>psms_final 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  nrow() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>psms_removed_total 
                            <styled-content style="color:#984806">&lt;-</styled-content> original_psms - psms_final</monospace>

                        <monospace>psms_removed_total_prop 
                            <styled-content style="color:#984806">&lt;-</styled-content> ((psms_removed_total / original_psms) * 
                            <styled-content style="color:#000099">100</styled-content>) %&gt;%</monospace>

                        <monospace>  round(
                            <styled-content style="color:#CC9900">digits =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>)</monospace>


                        <monospace>peps_final 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Sequence) %&gt;%</monospace>

                        <monospace>  unique() %&gt;%</monospace>

                        <monospace>  length() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>peps_removed_total 
                            <styled-content style="color:#984806">&lt;-</styled-content> original_peps - peps_final</monospace>

                        <monospace>peps_removed_total_prop 
                            <styled-content style="color:#984806">&lt;-</styled-content> ((peps_removed_total / original_peps) * 
                            <styled-content style="color:#000099">100</styled-content>) %&gt;%</monospace>

                        <monospace>  round(
                            <styled-content style="color:#CC9900">digits =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>)</monospace>


                        <monospace>prots_final 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Master.Protein.Accessions) %&gt;%</monospace>

                        <monospace>  unique() %&gt;%</monospace>

                        <monospace>  length() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>prots_removed_total 
                            <styled-content style="color:#984806">&lt;-</styled-content> original_prots - prots_final</monospace>

                        <monospace>prots_removed_total_prop 
                            <styled-content style="color:#984806">&lt;-</styled-content> ((prots_removed_total / original_prots) * 
                            <styled-content style="color:#000099">100</styled-content>) %&gt;%</monospace>

                        <monospace>  round(
                            <styled-content style="color:#CC9900">digits =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Print as table</styled-content>
                        </monospace>

                        <monospace>data.frame(
                            <styled-content style="color:#37A82E">"Feature"</styled-content> 
                            <styled-content style="color:#984806">=</styled-content> c(
                            <styled-content style="color:#37A82E">"PSMs"</styled-content>,</monospace>

                        <monospace>                         
                            <styled-content style="color:#37A82E">"Peptides"</styled-content>,</monospace>

                        <monospace>                         
                            <styled-content style="color:#37A82E">"Proteins"</styled-content>),</monospace>

                        <monospace>           
                            <styled-content style="color:#37A82E">"Number lost"</styled-content> 
                            <styled-content style="color:#984806">=</styled-content> c(psms_removed_total,</monospace>

                        <monospace>                             peps_removed_total,</monospace>

                        <monospace>                             prots_removed_total),</monospace>

                        <monospace>           
                            <styled-content style="color:#37A82E">"Percentage lost"</styled-content> 
                            <styled-content style="color:#984806">=</styled-content> c(psms_removed_total_prop,</monospace>

                        <monospace>                                 peps_removed_total_prop,</monospace>

                        <monospace>                                 prots_removed_total_prop),</monospace>

                        <monospace>           
                            <styled-content style="color:#37A82E">"Number remaining"</styled-content> 
                            <styled-content style="color:#984806">=</styled-content> c(psms_final,</monospace>

                        <monospace>                                  peps_final,</monospace>

                        <monospace>                                  prots_final))</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##    Feature Number.lost Percentage.lost Number.remaining</monospace>

                        <monospace>## 1     PSMs       27135           55.57            21697</monospace>

                        <monospace>## 2 Peptides       11727           45.16            14242</monospace>

                        <monospace>## 3 Proteins        1751           34.74             3289</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec34">
                <title>Logarithmic transformation of quantitative data</title>
                <p>Once satisfied that the PSM-level data is clean and of high quality, the PSM-level quantitative data is log transformed. log2 transformation is a standard step when dealing with quantitative proteomics data since protein abundances are dramatically skewed towards zero. Such a skewed distribution is to be expected given that the majority of cellular proteins present at any one time are of relatively low abundance, whilst only a few highly abundant proteins exist. To perform the logarithmic transformation and generate normally distributed data we pass the PSM-level data in the 
                    <monospace>QFeatures</monospace> object to the 
                    <monospace>logTransform</monospace> function, as per the below code chunk.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## log2 transform quantitative data</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> logTransform(
                            <styled-content style="color:#CC9900">object =</styled-content> cp_qf,</monospace>

                        <monospace>                      
                            <styled-content style="color:#CC9900">base =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>,</monospace>

                        <monospace>                      
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>,</monospace>

                        <monospace>                      
                            <styled-content style="color:#CC9900">name =</styled-content> 
                            <styled-content style="color:#37A82E">"log_psms"</styled-content>)</monospace>

                        <monospace>
                            <styled-content style="color:#984806">## Verify</styled-content>
                        </monospace>

                        <monospace>cp_qf</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## An instance of class QFeatures containing 3 assays:</monospace>

                        <monospace>##  [1] psms_raw: SummarizedExperiment with 48832 rows and 6 columns</monospace>

                        <monospace>##  [2] psms_filtered: SummarizedExperiment with 21697 rows and 6 columns</monospace>

                        <monospace>##  [3] log_psms: SummarizedExperiment with 21697 rows and 6 columns</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec35">
                <title>Aggregation of PSMs to proteins</title>
                <p>For the aggregation itself we use the 
                    <monospace>aggregateFeatures</monospace> function and provide the base level from which we wish to aggregate, the log PSM-level data in this case. We also tell the function which column to aggregate, which is specified by the fcol argument. We will first aggregate from PSM to peptide to create explicit 
                    <monospace>QFeatures</monospace> links. This means grouping by PSM &#x201c;Sequence&#x201d;.</p>
                <p>As well as grouping PSMs according to their peptide sequence, the quantitative values for each PSM must be aggregated into a single peptide-level value. The default aggregation method within 
                    <monospace>aggregateFeatures</monospace> is the 
                    <monospace>robustSummary</monospace> function from the 
                    <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/release/bioc/html/MsCoreUtils.html">
                        <monospace>MsCoreUtils</monospace> package</ext-link>.
                    <sup>
                        <xref ref-type="bibr" rid="ref19">19</xref>
                    </sup> This method is a form of robust regression and is described in detail elsewhere.
                    <sup>
                        <xref ref-type="bibr" rid="ref24">24</xref>
                    </sup> Nevertheless, the user must decide which aggregation method is most appropriate for their data and biological question. Further, an understanding of the selected method is critical given that aggregation is a form of implicit imputation and has substantial effects on the downstream data. Indeed, aggregation methods have different ways of dealing with missing data, either by removal or propagation. Options of aggregation methods within the aggregateFeatures function include 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://rdrr.io/bioc/MsCoreUtils/man/medianPolish.html">MsCoreUtils::medianPolish</ext-link>
                    </monospace>, 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://rdrr.io/bioc/MsCoreUtils/man/robustSummary.html">MsCoreUtils::robustSummary</ext-link>
                    </monospace>, 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/colSums">base::colMeans</ext-link>
                    </monospace>, 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/colSums">base::colSums</ext-link>
                    </monospace>, and 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://rdrr.io/rforge/matrixStats/man/rowMedians.html">matrixStats::colMedians</ext-link>
                    </monospace>. Users should also be aware that some methods have specific input requirements. For example, 
                    <monospace>robustSummary</monospace> assumes that intensities have already been log transformed.</p>
            </sec>
            <sec id="sec36">
                <title>Aggregating using robust summarisation</title>
                <p>Here, we use 
                    <monospace>robustSummary</monospace> to aggregate from PSM to peptide-level. This method is currently considered to be state-of-the-art as it is more robust against outliers than other aggregation methods.
                    <sup>
                        <xref ref-type="bibr" rid="ref24">24</xref>
                    </sup>
                    <sup>,</sup>
                    <sup>
                        <xref ref-type="bibr" rid="ref25">25</xref>
                    </sup> We also include 
                    <monospace>na.rm = TRUE</monospace> to exclude any NA values prior to completing the summarisation.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Aggregate PSM to peptide</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> aggregateFeatures(cp_qf,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"log_psms"</styled-content>,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">fcol =</styled-content> 
                            <styled-content style="color:#37A82E">"Sequence"</styled-content>,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">name =</styled-content> 
                            <styled-content style="color:#37A82E">"log_peptides"</styled-content>,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">fun =</styled-content> MsCoreUtils::robustSummary,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">na.rm =</styled-content> TRUE)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## Your quantitative and row data contain missing values. Please read the</monospace>

                        <monospace>## relevant section(s) in the aggregateFeatures manual page regarding the</monospace>

                        <monospace>## effects of missing values on data aggregation.</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Verify</styled-content>
                        </monospace>

                        <monospace>cp_qf</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## An instance of class QFeatures containing 4 assays:</monospace>

                        <monospace>##  [1] psms_raw: SummarizedExperiment with 48832 rows and 6 columns</monospace>

                        <monospace>##  [2] psms_filtered: SummarizedExperiment with 21697 rows and 6 columns</monospace>

                        <monospace>##  [3] log_psms: SummarizedExperiment with 21697 rows and 6 columns</monospace>

                        <monospace>##  [4] log_peptides: SummarizedExperiment with 14242 rows and 6 columns</monospace>
                    </preformat>
                </p>
                <p>We are now left with a 
                    <monospace>QFeatures</monospace> object holding the PSM and peptide-level data in their own 
                    <monospace>SummarizedExperiment</monospace>s. Importantly, an explicit link has been maintained between the two levels and this makes it possible to gain information about all PSMs that were aggregated into a peptide.</p>
            </sec>
            <sec id="sec37">
                <title>Considerations for aggregating non-imputed data</title>
                <p>If users did not impute prior to aggregation, NA values within the PSM-level data may have propagated into NaN values. This is because peptides only supported by PSMs containing missing values would not have any quantitative value to which a sum or median function, for example, can be applied. Therefore, we check for NaN and convert back to NA values to facilitate compatibility with downstream processing.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Confirm the presence of NaN</styled-content>
                        </monospace>

                        <monospace>assay(cp_qf[[
                            <styled-content style="color:#37A82E">"log_peptides"</styled-content>]]) %&gt;%</monospace>

                        <monospace>  is.nan() %&gt;%</monospace>

                        <monospace>  table()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## .</monospace>

                        <monospace>## FALSE</monospace>

                        <monospace>## 85452</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Replace NaN with NA</styled-content>
                        </monospace>

                        <monospace>assay(cp_qf[[
                            <styled-content style="color:#37A82E">"log_peptides"</styled-content>]])[is.nan(assay(cp_qf[[
                            <styled-content style="color:#37A82E">"log_peptides"</styled-content>]]))] 
                            <styled-content style="color:#984806">&lt;-</styled-content> NA</monospace>
                    </preformat>
                </p>
                <p>Next, using the same approach as above, we use the 
                    <monospace>aggregateFeatures</monospace> function to assemble the peptides into proteins. As before, we must pass several arguments to the function. Namely, the 
                    <monospace>QFeatures</monospace> object i.e. 
                    <monospace>cp_qf</monospace>, the data level we wish to aggregation from i.e. 
                    <monospace>log_peptides</monospace>, the column of the 
                    <monospace>rowData</monospace> defining how to aggregate the features i.e. by 
                    <monospace>"Master.Protein.Accessions"</monospace> and a name for the new data level e.g. 
                    <monospace>"log_proteins".</monospace> We again choose to use 
                    <monospace>robustSummary</monospace> as our aggregation method and we pass 
                    <monospace>na.rm = TRUE</monospace> to ignore NA values. Users can type 
                    <monospace>?aggregateFeatures</monospace> to see more information. Users should be aware that peptides are grouped by their master protein accession and, therefore, downstream differential expression analysis will consider protein groups rather than individual proteins.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Aggregate peptides to protein</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> aggregateFeatures(cp_qf,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"log_peptides"</styled-content>,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">fcol =</styled-content> 
                            <styled-content style="color:#37A82E">"Master.Protein.Accessions"</styled-content>,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">name =</styled-content> 
                            <styled-content style="color:#37A82E">"log_proteins"</styled-content>,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">fun =</styled-content> MsCoreUtils::robustSummary,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">na.rm =</styled-content> TRUE)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## Your quantitative and row data contain missing values. Please read the</monospace>

                        <monospace>## relevant section(s) in the aggregateFeatures manual page regarding the</monospace>

                        <monospace>## effects of missing values on data aggregation.</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Verify</styled-content>
                        </monospace>

                        <monospace>cp_qf</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## An instance of class QFeatures containing 5 assays:</monospace>

                        <monospace>##  [1] psms_raw: SummarizedExperiment with 48832 rows and 6 columns</monospace>

                        <monospace>##  [2] psms_filtered: SummarizedExperiment with 21697 rows and 6 columns</monospace>

                        <monospace>##  [3] log_psms: SummarizedExperiment with 21697 rows and 6 columns</monospace>

                        <monospace>##  [4] log_peptides: SummarizedExperiment with 14242 rows and 6 columns</monospace>

                        <monospace>##  [5] log_proteins: SummarizedExperiment with 3289 rows and 6 columns</monospace>
                    </preformat>
                </p>
                <p>Following aggregation, we have a total of 3289 proteins remaining within the data.</p>
            </sec>
            <sec id="sec38">
                <title>Normalisation of quantitative data</title>
                <p>After transforming the data, we normalise the protein-level abundances. Normalization is a process of correction whereby quantitative data is returned to its original, or &#x2018;normal&#x2019;, state. In expression proteomics, the aim of post-acquisition data normalization is to minimise the biases that arises due to experimental error and technological variation. Specifically, the removal of random variation and batch effects will allow samples to be aligned prior to downstream analysis. Importantly, however, users must also be aware of any normalization that has taken place within their sample preparation, as this will ultimately influence the presence of differentially abundant proteins downstream. An extensive review on normalization strategies, both experimental and computational, is provided in Ref. 
                    <xref ref-type="bibr" rid="ref26">26</xref>.</p>
                <p>Unfortunately, there is not currently a single normalization method which performs best for all quantitative proteomics datasets. Within the Bioconductor packages, however, exists 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://www.bioconductor.org/packages/release/bioc/html/NormalyzerDE.html">NormalyzerDE</ext-link>
                    </monospace>, a tool for evaluating different normalisation methods.
                    <sup>
                        <xref ref-type="bibr" rid="ref27">27</xref>
                    </sup> By passing a 
                    <monospace>SummarizedExperiment</monospace> object to the 
                    <monospace>normalyzer</monospace> function it is possible to generate a report comparing common normalisation strategies, such as total intensity (TI), median intensity (MedI), average intensity (AI), quantile (from the 
                    <monospace>preprocessCore</monospace> package),
                    <sup>
                        <xref ref-type="bibr" rid="ref28">28</xref>
                    </sup> NormFinder (NM),
                    <sup>
                        <xref ref-type="bibr" rid="ref29">29</xref>
                    </sup> Variance Stabilising Normalization (VSN, from the vsn package),
                    <sup>
                        <xref ref-type="bibr" rid="ref30">30</xref>
                    </sup> Robust Linear Regression (RLR), and LOESS (from the 
                    <monospace>limma</monospace> package).
                    <sup>
                        <xref ref-type="bibr" rid="ref31">31</xref>
                    </sup> A number of qualitative and quantitative evaluation measures are provided within the report, including total intensity, Pooled intragroup Coefficient of Variation (PCV), Pooled intragroup Median Absolute Deviation (PMDA), CV-intensity plots, MA-plots, Pearson and Spearman correlation.</p>
                <p>
                    <monospace>Normalyzer</monospace> accepts intensity data in a raw format, prior to log transformation. Therefore, we first generate a protein-level 
                    <monospace>SummarizedExperiment</monospace> from our PSM-level data prior to transformation.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Aggregate from PSM directly to protein</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> aggregateFeatures(cp_qf,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">fcol =</styled-content> 
                            <styled-content style="color:#37A82E">"Master.Protein.Accessions"</styled-content>,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">name =</styled-content> 
                            <styled-content style="color:#37A82E">"proteins_direct"</styled-content>,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">fun =</styled-content> MsCoreUtils::robustSummary,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">na.rm =</styled-content> TRUE)</monospace>
                    </preformat>
                </p>
                <p>Hence, we will use the &#x201c;proteins_direct&#x201d; 
                    <monospace>SummarizedExperiment</monospace> here and the function will do the log2 transformation for us. A second important consideration is that missing values must be denoted &#x2018;NA&#x2019;, not zero, NaN or infinite. We can pass the 
                    <monospace>SummarizedExperiment</monospace> containing the protein data to the 
                    <monospace>normalyzer</monospace> function. With this, we provide a name for the report and the directory in which to save the report. The 
                    <monospace>normalyzer</monospace> function also expects two pieces of information, the sample name and corresponding experimental group. We previously annotated the data with this information through the sample and condition columns of the 
                    <monospace>colData</monospace>, so we tell the 
                    <monospace>normalyzer</monospace> function to look here.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Generate normalyzer report</styled-content>
                        </monospace>

                        <monospace>normalyzer(
                            <styled-content style="color:#CC9900">jobName =</styled-content> 
                            <styled-content style="color:#37A82E">"normalyzer"</styled-content>,</monospace>

                        <monospace>           
                            <styled-content style="color:#CC9900">experimentObj =</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"proteins_direct"</styled-content>]],</monospace>

                        <monospace>           
                            <styled-content style="color:#CC9900">sampleColName =</styled-content> 
                            <styled-content style="color:#37A82E">"sample"</styled-content>,</monospace>

                        <monospace>           
                            <styled-content style="color:#CC9900">groupColName =</styled-content> 
                            <styled-content style="color:#37A82E">"condition"</styled-content>,</monospace>

                        <monospace>           
                            <styled-content style="color:#CC9900">outputDir =</styled-content> 
                            <styled-content style="color:#37A82E">"."</styled-content>)</monospace>
                    </preformat>
                </p>
                <p>The function will take a few minutes to run, particularly if there are many samples. Once complete, the report can be accessed as a 
                    <monospace>.pdf</monospace> file containing plots such as those displayed in 
                    <xref ref-type="fig" rid="f4">Figure 4</xref>.</p>
                <fig fig-type="figure" id="f4" orientation="portrait" position="float">
                    <label>Figure 4. </label>
                    <caption>
                        <title>Example of plots generated by the 
                            <monospace>normalyzer</monospace> tool and provided in the .pdf report.</title>
                        <p>Boxplots (top) and scatterplots (bottom) are two of the evaluation measures within the 
                            <monospace>normalyzer</monospace> report. Samples are grouped based on their condition to provide users with an easy way to evaluate the suitability of different normalization methods for their data. The log2 data can be used as a reference to compare the data pre- and post-normalization.</p>
                    </caption>
                    <graphic id="gr12" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure12.gif"/>
                </fig>
                <p>Since the 
                    <monospace>normalyzer</monospace> report did not indicate any superior normalisation method in this case, we will apply a center median approach here. To do this, we pass the log transformed protein-level data to the 
                    <monospace>normalize</monospace> function in 
                    <monospace>QFeatures</monospace>. We specify the method of normalisation that we wish to apply i.e. 
                    <monospace>method = "center.median"</monospace> and name the new data level e.g. 
                    <monospace>name = "log_norm_proteins"</monospace>. Of note, for users who wish to apply VSN normalisation the raw protein data must be passed (prior to any log transformation) as the log transformation is done internally when specify 
                    <monospace>method = "vsn"</monospace>. All other methods require users to explicitly perform log transformation on their data before use. More details can be found in the 
                    <monospace>QFeatures</monospace> documentation, please type 
                    <monospace>help("normalize,QFeatures-method")</monospace>.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## normalize the log transformed peptide data</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> normalize(cp_qf,</monospace>

                        <monospace>                   
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"log_proteins"</styled-content>,</monospace>

                        <monospace>                   
                            <styled-content style="color:#CC9900">name =</styled-content> 
                            <styled-content style="color:#37A82E">"log_norm_proteins"</styled-content>,</monospace>

                        <monospace>                   
                            <styled-content style="color:#CC9900">method =</styled-content> 
                            <styled-content style="color:#37A82E">"center.median"</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Verify</styled-content>
                        </monospace>

                        <monospace>cp_qf</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## An instance of class QFeatures containing 7 assays:</monospace>

                        <monospace>##  [1] psms_raw: SummarizedExperiment with 48832 rows and 6 columns</monospace>

                        <monospace>##  [2] psms_filtered: SummarizedExperiment with 21697 rows and 6 columns</monospace>

                        <monospace>##  [3] log_psms: SummarizedExperiment with 21697 rows and 6 columns</monospace>

                        <monospace>##  [4] log_peptides: SummarizedExperiment with 14242 rows and 6 columns</monospace>

                        <monospace>##  [5] log_proteins: SummarizedExperiment with 3289 rows and 6 columns</monospace>

                        <monospace>##  [6] proteins_direct: SummarizedExperiment with 3289 rows and 6 columns</monospace>

                        <monospace>##  [7] log_norm_proteins: SummarizedExperiment with 3289 rows and 6 columns</monospace>
                    </preformat>
                </p>
                <p>To evaluate the effect of normalisation we plot a simple boxplot.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Evaluate the effect of data normalization</styled-content>
                        </monospace>

                        <monospace>pre_norm 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"log_proteins"</styled-content>]] %&gt;%</monospace>

                        <monospace>  assay() %&gt;%</monospace>

                        <monospace>  longFormat() %&gt;%</monospace>

                        <monospace>  mutate(
                            <styled-content style="color:#CC9900">Condition =</styled-content> ifelse(colname %in% c(
                            <styled-content style="color:#37A82E">"S1"</styled-content>, 
                            <styled-content style="color:#37A82E">"S2"</styled-content>, 
                            <styled-content style="color:#37A82E">"S3"</styled-content>),</monospace>

                        <monospace>                            
                            <styled-content style="color:#37A82E">"Treated"</styled-content>, 
                            <styled-content style="color:#37A82E">"Control"</styled-content>)) %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> colname, 
                            <styled-content style="color:#CC9900">y =</styled-content> value, 
                            <styled-content style="color:#CC9900">fill =</styled-content> Condition)) +</monospace>

                        <monospace>  geom_boxplot() +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"Sample"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"log2(abundance)"</styled-content>, 
                            <styled-content style="color:#CC9900">title =</styled-content> 
                            <styled-content style="color:#37A82E">"Pre-normalization"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>


                        <monospace>post_norm 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"log_norm_proteins"</styled-content>]] %&gt;%</monospace>

                        <monospace>  assay() %&gt;%</monospace>

                        <monospace>  longFormat() %&gt;%</monospace>

                        <monospace>  mutate(
                            <styled-content style="color:#CC9900">Condition =</styled-content> ifelse(colname %in% c(
                            <styled-content style="color:#37A82E">"S1"</styled-content>, 
                            <styled-content style="color:#37A82E">"S2"</styled-content>, 
                            <styled-content style="color:#37A82E">"S3"</styled-content>),</monospace>

                        <monospace>                            
                            <styled-content style="color:#37A82E">"Treated"</styled-content>, 
                            <styled-content style="color:#37A82E">"Control"</styled-content>)) %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> colname, 
                            <styled-content style="color:#CC9900">y =</styled-content> value, 
                            <styled-content style="color:#CC9900">fill =</styled-content> Condition)) +</monospace>

                        <monospace>  geom_boxplot() +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"Sample"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"log2(abundance)"</styled-content>, 
                            <styled-content style="color:#CC9900">title =</styled-content> 
                            <styled-content style="color:#37A82E">"Post-normalization"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>


                        <monospace>(pre_norm + theme(
                            <styled-content style="color:#CC9900">legend.position =</styled-content> 
                            <styled-content style="color:#37A82E">"none"</styled-content>)) +</monospace>

                        <monospace>  post_norm &amp; plot_layout(
                            <styled-content style="color:#CC9900">guides =</styled-content> 
                            <styled-content style="color:#37A82E">"collect"</styled-content>)</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure13.gif"/>
                </p>
                <p>We can now generate a density plot to help us visualise what the process of log transformation and normalisation has done to the data. This is done using the 
                    <monospace>plotDensities</monospace> function from the 
                    <monospace>limma</monospace> package.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## visualize the process of log transformation and normalization</styled-content>
                        </monospace>

                        <monospace>par(
                            <styled-content style="color:#CC9900">mfrow =</styled-content> c(
                            <styled-content style="color:#000099">1</styled-content>, 
                            <styled-content style="color:#000099">3</styled-content>))</monospace>


                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"psms_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  assay() %&gt;%</monospace>

                        <monospace>  plotDensities(
                            <styled-content style="color:#CC9900">legend =</styled-content> 
                            <styled-content style="color:#37A82E">"topright"</styled-content>,</monospace>

                        <monospace>                
                            <styled-content style="color:#CC9900">main =</styled-content> 
                            <styled-content style="color:#37A82E">"Raw PSMs"</styled-content>)</monospace>


                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"log_psms"</styled-content>]] %&gt;%</monospace>

                        <monospace>  assay() %&gt;%</monospace>

                        <monospace>  plotDensities(
                            <styled-content style="color:#CC9900">legend =</styled-content> FALSE,</monospace>

                        <monospace>                
                            <styled-content style="color:#CC9900">main =</styled-content> 
                            <styled-content style="color:#37A82E">"log2(PSMs)"</styled-content>)</monospace>


                        <monospace>cp_qf[[
                            <styled-content style="color:#37A82E">"log_norm_proteins"</styled-content>]] %&gt;%</monospace>

                        <monospace>  assay() %&gt;%</monospace>

                        <monospace>  plotDensities(
                            <styled-content style="color:#CC9900">legend =</styled-content> FALSE,</monospace>

                        <monospace>                
                            <styled-content style="color:#CC9900">main =</styled-content> 
                            <styled-content style="color:#37A82E">"log2(norm proteins)"</styled-content>)</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure14.gif"/>
                </p>
            </sec>
            <sec id="sec39">
                <title>Exploration of data using 
                    <monospace>QFeatures</monospace> links</title>
            </sec>
            <sec id="sec40">
                <title>Creating 
                    <monospace>assay</monospace> links</title>
                <p>After completing all data pre-processing, we now add explicit links between our final protein-level data and the raw PSM-level data which we created as an untouched copy. This allows us to investigate all data corresponding to the final proteins, including the data that has since been removed. To do this, we use the 
                    <monospace>addAssayLinks</monospace> function, demonstrated below. We can check that the 
                    <monospace>assay</monospace> links have been generated correctly by passing our 
                    <monospace>QFeatures</monospace> object to the AssayLink function along with the 
                    <monospace>assay</monospace> of interest (
                    <monospace>i =</monospace>).

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Add assay links from log_norm_proteins to psms_raw</styled-content>
                        </monospace>

                        <monospace>cp_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> addAssayLink(
                            <styled-content style="color:#CC9900">object =</styled-content> cp_qf,</monospace>

                        <monospace>                      
                            <styled-content style="color:#CC9900">from =</styled-content> 
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>,</monospace>

                        <monospace>                      
                            <styled-content style="color:#CC9900">to =</styled-content> 
                            <styled-content style="color:#37A82E">"log_norm_proteins"</styled-content>,</monospace>

                        <monospace>                      
                            <styled-content style="color:#CC9900">varFrom =</styled-content> 
                            <styled-content style="color:#37A82E">"Master.Protein.Accessions"</styled-content>,</monospace>

                        <monospace>                      
                            <styled-content style="color:#CC9900">varTo =</styled-content> 
                            <styled-content style="color:#37A82E">"Master.Protein.Accessions"</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Verify</styled-content>
                        </monospace>

                        <monospace>assayLink(cp_qf,</monospace>

                        <monospace>          
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"log_norm_proteins"</styled-content>)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## AssayLink for assay &lt;log_norm_proteins&gt;</monospace>

                        <monospace>## [from:psms_raw|fcol:Master.Protein.Accessions|hits:42678]</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec41">
                <title>Visualising aggregation</title>
                <p>One of the characteristic attributes of the 
                    <monospace>QFeatures</monospace> infrastructure is that explicit links have been maintained throughout the aggregation process. This means that we can now access all data corresponding to a protein, its component peptides and PSMs. One way to do this is through the use of the 
                    <monospace>subsetByFeature</monospace> function which will return a new 
                    <monospace>QFeatures</monospace> object containing data for the desired feature across all levels. For example, if we wish to subset information about the protein &#x201c;Q01581&#x201d;, that is hydroxymethylglutaryl-CoA synthase, we could use the following code:

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Subset all data linked to the protein with accession Q01581</styled-content>
                        </monospace>

                        <monospace>Q01581 
                            <styled-content style="color:#984806">&lt;-</styled-content> subsetByFeature(cp_qf, 
                            <styled-content style="color:#37A82E">"Q01581"</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Verify</styled-content>
                        </monospace>

                        <monospace>Q01581</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## An instance of class QFeatures containing 7 assays:</monospace>

                        <monospace>##  [1] psms_raw: SummarizedExperiment with 42 rows and 6 columns</monospace>

                        <monospace>##  [2] psms_filtered: SummarizedExperiment with 27 rows and 6 columns</monospace>

                        <monospace>##  [3] log_psms: SummarizedExperiment with 27 rows and 6 columns</monospace>

                        <monospace>##  [4] log_peptides: SummarizedExperiment with 15 rows and 6 columns</monospace>

                        <monospace>##  [5] log_proteins: SummarizedExperiment with 1 rows and 6 columns</monospace>

                        <monospace>##  [6] proteins_direct: SummarizedExperiment with 1 rows and 6 columns</monospace>

                        <monospace>##  [7] log_norm_proteins: SummarizedExperiment with 1 rows and 6 columns</monospace>
                    </preformat>
                </p>
                <p>We find that in this data the protein Q01581 has 15 peptides and 27 supporting its identification and quantitation. We also see that the original data prior to processing contained 42 PSMs in support of this protein.</p>
                <p>Further, we can visualise the process of aggregation that has led to the protein-level abundance data for Q01581, as demonstrated below. Of note, this plot shows the protein data prior to normalisation.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Define conditions</styled-content>
                        </monospace>

                        <monospace>treament 
                            <styled-content style="color:#984806">&lt;-</styled-content> c(
                            <styled-content style="color:#37A82E">"S1"</styled-content>, 
                            <styled-content style="color:#37A82E">"S2"</styled-content>, 
                            <styled-content style="color:#37A82E">"S3"</styled-content>)</monospace>

                        <monospace>control 
                            <styled-content style="color:#984806">&lt;-</styled-content> c(
                            <styled-content style="color:#37A82E">"S4"</styled-content>, 
                            <styled-content style="color:#37A82E">"S5"</styled-content>, 
                            <styled-content style="color:#37A82E">"S6"</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Plot abundance distributions across samples at PSM, peptide and protein-level</styled-content>
                        </monospace>

                        <monospace>Q01581[, , c(
                            <styled-content style="color:#37A82E">"log_psms"</styled-content>, 
                            <styled-content style="color:#37A82E">"log_peptides"</styled-content>, 
                            <styled-content style="color:#37A82E">"log_proteins"</styled-content>)] %&gt;%</monospace>

                        <monospace>  longFormat() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  mutate(
                            <styled-content style="color:#CC9900">assay_order =</styled-content> factor(</monospace>

                        <monospace>    assay,</monospace>

                        <monospace>    
                            <styled-content style="color:#CC9900">levels =</styled-content> c(
                            <styled-content style="color:#37A82E">"log_psms"</styled-content>, 
                            <styled-content style="color:#37A82E">"log_peptides"</styled-content>, 
                            <styled-content style="color:#37A82E">"log_proteins"</styled-content>),</monospace>

                        <monospace>    
                            <styled-content style="color:#CC9900">labels =</styled-content> c(
                            <styled-content style="color:#37A82E">"PSMs"</styled-content>, 
                            <styled-content style="color:#37A82E">"Peptides"</styled-content>, 
                            <styled-content style="color:#37A82E">"Protein"</styled-content>)),</monospace>

                        <monospace>    
                            <styled-content style="color:#CC9900">condition =</styled-content> ifelse(colname %in% control, 
                            <styled-content style="color:#37A82E">"control"</styled-content>, 
                            <styled-content style="color:#37A82E">"treatment"</styled-content>)) %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> colname, 
                            <styled-content style="color:#CC9900">y =</styled-content> value, 
                            <styled-content style="color:#CC9900">colour =</styled-content> assay)) +</monospace>

                        <monospace>  geom_point(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">3</styled-content>) +</monospace>

                        <monospace>  geom_line(aes(
                            <styled-content style="color:#CC9900">group =</styled-content> rowname)) +</monospace>

                        <monospace>  scale_x_discrete(
                            <styled-content style="color:#CC9900">limits =</styled-content> paste0(
                            <styled-content style="color:#37A82E">"S"</styled-content>, 
                            <styled-content style="color:#000099">1</styled-content>:
                            <styled-content style="color:#000099">6</styled-content>)) +</monospace>

                        <monospace>  facet_wrap(~assay_order) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"Sample"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"Abundance"</styled-content>) +</monospace>

                        <monospace>  ggtitle(
                            <styled-content style="color:#37A82E">"log2 Q01581 abundance profiles"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure15.gif"/>
                </p>
            </sec>
            <sec id="sec42">
                <title>Determining PSM and peptide support</title>
                <p>Another benefit of the explicit links maintained within a 
                    <monospace>QFeatures</monospace> object is the ease at which we can determine PSM and peptide support per protein. When applying the 
                    <monospace>aggregateFeatures</monospace> function a column, termed 
                    <monospace>".n"</monospace>, is created within the rowData of the new 
                    <monospace>SummarizedExperiment</monospace>. This column indicates how many lower-level features were aggregated into each new higher-level feature. Hence, 
                    <monospace>".n"</monospace> in the peptide-level data represents how many PSMs were aggregated into a peptide, whilst in the protein-level data it tells us how many peptides were grouped into a master protein. For ease of plotting, we will use the 
                    <monospace>"proteins_direct"</monospace> data generated above. Since this data was generated via direct aggregation of PSM to protein, 
                    <monospace>".n"</monospace> this will tell us PSM support per protein. We plot these data as simple histograms.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Plot PSM support per protein - .n in the proteins_direct SE</styled-content>
                        </monospace>

                        <monospace>psm_per_protein 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"proteins_direct"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> .n)) +</monospace>

                        <monospace>  geom_histogram(
                            <styled-content style="color:#CC9900">binwidth =</styled-content> 
                            <styled-content style="color:#000099">1</styled-content>, 
                            <styled-content style="color:#CC9900">boundary =</styled-content> 
                            <styled-content style="color:#000099">0.5</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"PSM support (shown up to 20)"</styled-content>,</monospace>

                        <monospace>       
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"Frequency"</styled-content>) +</monospace>

                        <monospace>  scale_x_continuous(
                            <styled-content style="color:#CC9900">expand =</styled-content> c(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">0</styled-content>),</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">limits =</styled-content> c(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">20.5</styled-content>),</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">breaks =</styled-content> seq(
                            <styled-content style="color:#000099">1</styled-content>, 
                            <styled-content style="color:#000099">20</styled-content>, 
                            <styled-content style="color:#000099">1</styled-content>)) +</monospace>

                        <monospace>  scale_y_continuous(
                            <styled-content style="color:#CC9900">expand =</styled-content> c(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">0</styled-content>),</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">limits =</styled-content> c(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">1000</styled-content>),</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">breaks =</styled-content> seq(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">1000</styled-content>, 
                            <styled-content style="color:#000099">100</styled-content>)) +</monospace>

                        <monospace>  ggtitle(
                            <styled-content style="color:#37A82E">"PSM support per protein"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Plot peptide support per protein - .n in the proteins SE</styled-content>
                        </monospace>

                        <monospace>peptide_per_protein 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"log_proteins"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> .n)) +</monospace>

                        <monospace>  geom_histogram(
                            <styled-content style="color:#CC9900">binwidth =</styled-content> 
                            <styled-content style="color:#000099">1</styled-content>, 
                            <styled-content style="color:#CC9900">boundary =</styled-content> 
                            <styled-content style="color:#000099">0.5</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"Peptide support (shown up to 20)"</styled-content>,</monospace>

                        <monospace>       
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"Frequency"</styled-content>) +</monospace>

                        <monospace>  scale_x_continuous(
                            <styled-content style="color:#CC9900">expand =</styled-content> c(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">0</styled-content>),</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">limits =</styled-content> c(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">20.5</styled-content>),</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">breaks =</styled-content> seq(
                            <styled-content style="color:#000099">1</styled-content>, 
                            <styled-content style="color:#000099">20</styled-content>, 
                            <styled-content style="color:#000099">1</styled-content>)) +</monospace>

                        <monospace>  scale_y_continuous(
                            <styled-content style="color:#CC9900">expand =</styled-content> c(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">0</styled-content>),</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">limits =</styled-content> c(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">1100</styled-content>),</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">breaks =</styled-content> seq(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">1100</styled-content>, 
                            <styled-content style="color:#000099">100</styled-content>)) +</monospace>

                        <monospace>  ggtitle(
                            <styled-content style="color:#37A82E">"Peptide support per protein"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>


                        <monospace>psm_per_protein + peptide_per_protein</monospace>
                    </preformat>
                </p>
                <p>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure16.gif"/>
                </p>
                <p>At this point, users may wish to include additional quality control filtering based on PSM and/or peptide support per protein. Given the extensive quality control filtering already applied in this workflow, we decide not to remove additional proteins based on PSM or peptide support.</p>
            </sec>
            <sec id="sec43">
                <title>Data export</title>
                <p>Finally, we save the protein-level data and export the 
                    <monospace>QFeatures</monospace> object into an 
                    <monospace>.rda</monospace> file so that we can re-load it later at convenience.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Save protein-level SE</styled-content>
                        </monospace>

                        <monospace>cp_proteins 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"log_norm_proteins"</styled-content>]]</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Export the final TMT QFeatures object</styled-content>
                        </monospace>

                        <monospace>save(cp_qf, 
                            <styled-content style="color:#CC9900">file =</styled-content> 
                            <styled-content style="color:#37A82E">"cp_qf.rda"</styled-content>)</monospace>
                    </preformat>
                </p>
            </sec>
        </sec>
        <sec id="sec44">
            <title>Label-free data processing workflow</title>
            <p>Having discussed the processing of quantitative TMT-labelled data, we now move on to consider that of label-free quantitative (LFQ) data. As described previously, the cell culture supernatant fractions of triplicate control and treated HEK293 cells were kept label-free. As such, each sample was analysed using an independent mass spectrometry run without pre-fractionation. Again, a two-hour gradient in an Orbitrap Lumos Tribrid mass spectrometer coupled to an UltiMate 3000 HPLC system was applied. Given that much of the TMT pre-processing workflow also applies to label-free data, we only discuss steps which are different to those previously described. Readers are advised to refer to the TMT processing workflow for a more in-depth explanation of any shared steps.</p>
            <sec id="sec45">
                <title>Identification search using Proteome Discoverer</title>
                <p>As was the case for TMT labelled cell pellets, raw LFQ data from supernatant samples was searched using Proteome Discoverer 2.5. The GitHub repository associated with this manuscript can be found at 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/CambridgeCentreForProteomics/f1000_expression_proteomics">https://github.com/CambridgeCentreForProteomics/f1000_expression_proteomics</ext-link> which contains the identification search along with an additional explanation of key parameters in an appendix. To begin processing LFQ data, users should export a peptide-level 
                    <monospace>.txt</monospace> file from the results of their identification search.</p>
            </sec>
            <sec id="sec46">
                <title>Data import, housekeeping and exploration</title>
                <p>Unlike the TMT-labelled use-case data which was processed from the PSM-level, the label-free use-case data can only be considered from the peptide-level up. This is because a retention time alignment algorithm (equivalent to match between runs) was applied to the PSM-level data. This means that peptides can be identified in samples even without a corresponding PSM, simply by sharing feature information across runs.</p>
            </sec>
            <sec id="sec47">
                <title>Importing data into a 
                    <monospace>QFeatures</monospace> object</title>
                <p>We locate the PeptideGroups 
                    <monospace>.txt</monospace> file and upload this into a 
                    <monospace>QFeatures</monospace> data container in the same way as before. Since the samples are already stored in the correct order, we simply identify the quantitative columns by their indices.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Locate the PeptideGroups .txt file</styled-content>
                        </monospace>

                        <monospace>sn_peptide 
                            <styled-content style="color:#984806">&lt;-</styled-content> 
                            <styled-content style="color:#37A82E">"supernatant_lfq_results_peptides.txt"</styled-content>
                        </monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Identify columns containing quantitative data</styled-content>
                        </monospace>

                        <monospace>sn_peptide %&gt;%</monospace>

                        <monospace>  read.delim() %&gt;%</monospace>

                        <monospace>  names()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##  [1] "Peptide.Groups.Peptide.Group.ID"</monospace>

                        <monospace>##  [2] "Checked"</monospace>

                        <monospace>##  [3] "Tags"</monospace>

                        <monospace>##  [4] "Confidence"</monospace>

                        <monospace>##  [5] "PSM.Ambiguity"</monospace>

                        <monospace>##  [6] "Sequence"</monospace>

                        <monospace>##  [7] "Modifications"</monospace>

                        <monospace>##  [8] "Modifications.all.possible.sites"</monospace>

                        <monospace>##  [9] "Qvality.PEP"</monospace>

                        <monospace>## [10] "Qvality.q.value"</monospace>

                        <monospace>## [11] "SVM_Score"</monospace>

                        <monospace>## [12] "Number.of.Protein.Groups"</monospace>

                        <monospace>## [13] "Number.of.Proteins"</monospace>

                        <monospace>## [14] "Number.of.PSMs"</monospace>

                        <monospace>## [15] "Master.Protein.Accessions"</monospace>

                        <monospace>## [16] "Master.Protein.Descriptions"</monospace>

                        <monospace>## [17] "Protein.Accessions"</monospace>

                        <monospace>## [18] "Number.of.Missed.Cleavages"</monospace>

                        <monospace>## [19] "Theo.MHplus.in.Da"</monospace>

                        <monospace>## [20] "Sequence.Length"</monospace>

                        <monospace>## [21] "Abundance.F1.Sample"</monospace>

                        <monospace>## [22] "Abundance.F2.Sample"</monospace>

                        <monospace>## [23] "Abundance.F3.Sample"</monospace>

                        <monospace>## [24] "Abundance.F4.Sample"</monospace>

                        <monospace>## [25] "Abundance.F5.Sample"</monospace>

                        <monospace>## [26] "Abundance.F6.Sample"</monospace>

                        <monospace>## [27] "Abundances.Count.F1.Sample"</monospace>

                        <monospace>## [28] "Abundances.Count.F2.Sample"</monospace>

                        <monospace>## [29] "Abundances.Count.F3.Sample"</monospace>

                        <monospace>## [30] "Abundances.Count.F4.Sample"</monospace>

                        <monospace>## [31] "Abundances.Count.F5.Sample"</monospace>

                        <monospace>## [32] "Abundances.Count.F6.Sample"</monospace>

                        <monospace>## [33] "Quan.Info"</monospace>

                        <monospace>## [34] "Found.in.File.in.F1"</monospace>

                        <monospace>## [35] "Found.in.File.in.F2"</monospace>

                        <monospace>## [36] "Found.in.File.in.F3"</monospace>

                        <monospace>## [37] "Found.in.File.in.F4"</monospace>

                        <monospace>## [38] "Found.in.File.in.F5"</monospace>

                        <monospace>## [39] "Found.in.File.in.F6"</monospace>

                        <monospace>## [40] "Found.in.Sample.in.S1.F1.Sample"</monospace>

                        <monospace>## [41] "Found.in.Sample.in.S2.F2.Sample"</monospace>

                        <monospace>## [42] "Found.in.Sample.in.S3.F3.Sample"</monospace>

                        <monospace>## [43] "Found.in.Sample.in.S4.F4.Sample"</monospace>

                        <monospace>## [44] "Found.in.Sample.in.S5.F5.Sample"</monospace>

                        <monospace>## [45] "Found.in.Sample.in.S6.F6.Sample"</monospace>

                        <monospace>## [46] "Found.in.Sample.Group.in.S1.F1.Sample"</monospace>

                        <monospace>## [47] "Found.in.Sample.Group.in.S2.F2.Sample"</monospace>

                        <monospace>## [48] "Found.in.Sample.Group.in.S3.F3.Sample"</monospace>

                        <monospace>## [49] "Found.in.Sample.Group.in.S4.F4.Sample"</monospace>

                        <monospace>## [50] "Found.in.Sample.Group.in.S5.F5.Sample"</monospace>

                        <monospace>## [51] "Found.in.Sample.Group.in.S6.F6.Sample"</monospace>

                        <monospace>## [52] "Confidence.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [53] "Charge.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [54] "Delta.Score.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [55] "Delta.Cn.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [56] "Rank.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [57] "Search.Engine.Rank.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [58] "Concatenated.Rank.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [59] "mz.in.Da.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [60] "Delta.M.in.ppm.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [61] "Delta.mz.in.Da.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [62] "RT.in.min.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [63] "Percolator.q.Value.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [64] "Percolator.PEP.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [65] "Percolator.SVMScore.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [66] "XCorr.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [67] "Top.Apex.RT.in.min"</monospace>
                    </preformat>
                </p>
                <p>In the code chunk below, we again use the 
                    <monospace>readQFeatures</monospace> function to import our data into 
                    <monospace>R</monospace> and create a 
                    <monospace>QFeatures</monospace> object. We find the abundance data is located in columns 21 to 26 and thus pass this to ecol. After import we annotate the 
                    <monospace>colData</monospace>.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Create QFeatures object</styled-content>
                        </monospace>

                        <monospace>sn_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> readQFeatures(
                            <styled-content style="color:#CC9900">table =</styled-content> sn_peptide,</monospace>

                        <monospace>                       
                            <styled-content style="color:#CC9900">ecol =</styled-content> 
                            <styled-content style="color:#000099">21</styled-content>:
                            <styled-content style="color:#000099">26</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#CC9900">sep =</styled-content> 
                            <styled-content style="color:#37A82E">"</styled-content>\t
                            <styled-content style="color:#37A82E">"</styled-content>,</monospace>

                        <monospace>                       
                            <styled-content style="color:#CC9900">name =</styled-content> 
                            <styled-content style="color:#37A82E">"peptides_raw"</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Clean sample names</styled-content>
                        </monospace>

                        <monospace>colnames(sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_raw"</styled-content>]]) 
                            <styled-content style="color:#984806">&lt;-</styled-content> paste0(
                            <styled-content style="color:#37A82E">"S"</styled-content>, 
                            <styled-content style="color:#000099">1</styled-content>:
                            <styled-content style="color:#000099">6</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Annotate samples</styled-content>
                        </monospace>

                        <monospace>sn_qf$sample 
                            <styled-content style="color:#984806">&lt;-</styled-content> paste0(
                            <styled-content style="color:#37A82E">"S"</styled-content>, 
                            <styled-content style="color:#000099">1</styled-content>:
                            <styled-content style="color:#000099">6</styled-content>)</monospace>


                        <monospace>sn_qf$condition 
                            <styled-content style="color:#984806">&lt;-</styled-content> rep(c(
                            <styled-content style="color:#37A82E">"Treated"</styled-content>, 
                            <styled-content style="color:#37A82E">"Control"</styled-content>), 
                            <styled-content style="color:#CC9900">each =</styled-content> 
                            <styled-content style="color:#000099">3</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Verify and allocate colData to initial SummarizedExperiment</styled-content>
                        </monospace>

                        <monospace>colData(sn_qf)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## DataFrame with 6 rows and 2 columns</monospace>

                        <monospace>##         sample   condition</monospace>

                        <monospace>##    &lt;character&gt; &lt;character&gt;</monospace>

                        <monospace>## S1          S1     Treated</monospace>

                        <monospace>## S2          S2     Treated</monospace>

                        <monospace>## S3          S3     Treated</monospace>

                        <monospace>## S4          S4     Control</monospace>

                        <monospace>## S5          S5     Control</monospace>

                        <monospace>## S6          S6     Control</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>colData(sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_raw"</styled-content>]]) 
                            <styled-content style="color:#984806">&lt;-</styled-content> colData(sn_qf)</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec48">
                <title>Preliminary data exploration</title>
                <p>Next, we check the names of the features within the peptide-level 
                    <monospace>rowData</monospace>. These features differ from those found at the PSM-level and users should be aware that they have reduced post-search control over the quality of PSMs included in the peptide quantitation, and which method of aggregation is used to define these. Proteome Discoverer uses the sum of PSM quantitative values to calculate peptide-level values. Other third-party softwares may use different methods.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Find out what information was imported</styled-content>
                        </monospace>

                        <monospace>sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  colnames()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##  [1] "Peptide.Groups.Peptide.Group.ID"</monospace>

                        <monospace>##  [2] "Checked"</monospace>

                        <monospace>##  [3] "Tags"</monospace>

                        <monospace>##  [4] "Confidence"</monospace>

                        <monospace>##  [5] "PSM.Ambiguity"</monospace>

                        <monospace>##  [6] "Sequence"</monospace>

                        <monospace>##  [7] "Modifications"</monospace>

                        <monospace>##  [8] "Modifications.all.possible.sites"</monospace>

                        <monospace>##  [9] "Qvality.PEP"</monospace>

                        <monospace>## [10] "Qvality.q.value"</monospace>

                        <monospace>## [11] "SVM_Score"</monospace>

                        <monospace>## [12] "Number.of.Protein.Groups"</monospace>

                        <monospace>## [13] "Number.of.Proteins"</monospace>

                        <monospace>## [14] "Number.of.PSMs"</monospace>

                        <monospace>## [15] "Master.Protein.Accessions"</monospace>

                        <monospace>## [16] "Master.Protein.Descriptions"</monospace>

                        <monospace>## [17] "Protein.Accessions"</monospace>

                        <monospace>## [18] "Number.of.Missed.Cleavages"</monospace>

                        <monospace>## [19] "Theo.MHplus.in.Da"</monospace>

                        <monospace>## [20] "Sequence.Length"</monospace>

                        <monospace>## [21] "Abundances.Count.F1.Sample"</monospace>

                        <monospace>## [22] "Abundances.Count.F2.Sample"</monospace>

                        <monospace>## [23] "Abundances.Count.F3.Sample"</monospace>

                        <monospace>## [24] "Abundances.Count.F4.Sample"</monospace>

                        <monospace>## [25] "Abundances.Count.F5.Sample"</monospace>

                        <monospace>## [26] "Abundances.Count.F6.Sample"</monospace>

                        <monospace>## [27] "Quan.Info"</monospace>

                        <monospace>## [28] "Found.in.File.in.F1"</monospace>

                        <monospace>## [29] "Found.in.File.in.F2"</monospace>

                        <monospace>## [30] "Found.in.File.in.F3"</monospace>

                        <monospace>## [31] "Found.in.File.in.F4"</monospace>

                        <monospace>## [32] "Found.in.File.in.F5"</monospace>

                        <monospace>## [33] "Found.in.File.in.F6"</monospace>

                        <monospace>## [34] "Found.in.Sample.in.S1.F1.Sample"</monospace>

                        <monospace>## [35] "Found.in.Sample.in.S2.F2.Sample"</monospace>

                        <monospace>## [36] "Found.in.Sample.in.S3.F3.Sample"</monospace>

                        <monospace>## [37] "Found.in.Sample.in.S4.F4.Sample"</monospace>

                        <monospace>## [38] "Found.in.Sample.in.S5.F5.Sample"</monospace>

                        <monospace>## [39] "Found.in.Sample.in.S6.F6.Sample"</monospace>

                        <monospace>## [40] "Found.in.Sample.Group.in.S1.F1.Sample"</monospace>

                        <monospace>## [41] "Found.in.Sample.Group.in.S2.F2.Sample"</monospace>

                        <monospace>## [42] "Found.in.Sample.Group.in.S3.F3.Sample"</monospace>

                        <monospace>## [43] "Found.in.Sample.Group.in.S4.F4.Sample"</monospace>

                        <monospace>## [44] "Found.in.Sample.Group.in.S5.F5.Sample"</monospace>

                        <monospace>## [45] "Found.in.Sample.Group.in.S6.F6.Sample"</monospace>

                        <monospace>## [46] "Confidence.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [47] "Charge.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [48] "Delta.Score.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [49] "Delta.Cn.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [50] "Rank.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [51] "Search.Engine.Rank.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [52] "Concatenated.Rank.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [53] "mz.in.Da.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [54] "Delta.M.in.ppm.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [55] "Delta.mz.in.Da.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [56] "RT.in.min.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [57] "Percolator.q.Value.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [58] "Percolator.PEP.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [59] "Percolator.SVMScore.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [60] "XCorr.by.Search.Engine.Sequest.HT"</monospace>

                        <monospace>## [61] "Top.Apex.RT.in.min"</monospace>
                    </preformat>
                </p>
                <p>We also determine the number of PSMs, peptides and proteins represented within the initial data. Since identical peptide sequences with different modifications are stored as separate entities, the output of 
                    <monospace>dim</monospace> will not tell us the number of peptides. Instead, we need to consider only unique peptide sequence entries, as demonstrated in the code chunk below.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Determine the number of PSMs</styled-content>
                        </monospace>

                        <monospace>original_psms 
                            <styled-content style="color:#984806">&lt;-</styled-content> sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Number.of.PSMs) %&gt;%</monospace>

                        <monospace>  sum()</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Determine the number of peptides</styled-content>
                        </monospace>

                        <monospace>original_peps 
                            <styled-content style="color:#984806">&lt;-</styled-content> sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Sequence) %&gt;%</monospace>

                        <monospace>  unique() %&gt;%</monospace>

                        <monospace>  length() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Determine the number of proteins</styled-content>
                        </monospace>

                        <monospace>original_prots 
                            <styled-content style="color:#984806">&lt;-</styled-content> sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Master.Protein.Accessions) %&gt;%</monospace>

                        <monospace>  unique() %&gt;%</monospace>

                        <monospace>  length() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## View</styled-content>
                        </monospace>

                        <monospace>original_psms</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## [1] 144302</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>original_peps</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## [1] 20312</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>original_prots</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## [1] 3941</monospace>
                    </preformat>
                </p>
                <p>Thus, the search identified 144302 PSMs corresponding to 20312 peptides and 3941 proteins. Finally, we take a look at some of the key parameters applied during the identification search. This is an important verification step, particularly for those using publicly available data with limited access to parameter settings.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Check missed cleavages</styled-content>
                        </monospace>

                        <monospace>sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Number.of.Missed.Cleavages) %&gt;%</monospace>

                        <monospace>  table()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## .</monospace>

                        <monospace>##     0    1  2</monospace>

                        <monospace>## 22055 1248 72</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Check precursor mass tolerance</styled-content>
                        </monospace>

                        <monospace>sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Delta.M.in.ppm.by.Search.Engine.Sequest.HT) %&gt;%</monospace>

                        <monospace>  summary()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##     Min. 1st Qu. Median   Mean 3rd Qu.   Max.</monospace>

                        <monospace>## -9.9600  -0.2500 0.1500 0.6576  0.6900 9.9900</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Check fragment mass tolerance</styled-content>
                        </monospace>

                        <monospace>sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Delta.mz.in.Da.by.Search.Engine.Sequest.HT) %&gt;%</monospace>

                        <monospace>  summary()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##       Min.     1st Qu.   Median      Mean   3rd Qu.      Max.</monospace>

                        <monospace>## -0.0113400 -0.0001400 0.0000900 0.0006618 0.0004800 0.0142300</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Check peptide confidence allocations</styled-content>
                        </monospace>

                        <monospace>sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Confidence) %&gt;%</monospace>

                        <monospace>  table()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## .</monospace>

                        <monospace>##  High</monospace>

                        <monospace>## 23375</monospace>
                    </preformat>
                </p>
                <p>The preliminary data is as expected so we continue on to evaluate the quality of the raw data.</p>
            </sec>
            <sec id="sec49">
                <title>Experimental quality control checks</title>
            </sec>
            <sec id="sec50">
                <title>Quality control of the raw mass spectrometry data</title>
                <p>To briefly assess the quality of the raw mass spectrometry data from which the search results were derived, we create simple plots. In contrast to the previous PSM processing workflow, we do not have access to information about ion injection times from the peptide-level file. However, we can still look at the peptide delta mass across retention time, as well as the frequency of peptides across the retention time gradient.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Plot scatter plot of mass accuracy</styled-content>
                        </monospace>

                        <monospace>sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> RT.in.min.by.Search.Engine.Sequest.HT,</monospace>

                        <monospace>             
                            <styled-content style="color:#CC9900">y =</styled-content> Delta.M.in.ppm.by.Search.Engine.Sequest.HT)) +</monospace>

                        <monospace>  geom_point(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">0.5</styled-content>, 
                            <styled-content style="color:#CC9900">shape =</styled-content> 
                            <styled-content style="color:#000099">4</styled-content>) +</monospace>

                        <monospace>  geom_hline(
                            <styled-content style="color:#CC9900">yintercept =</styled-content> 
                            <styled-content style="color:#000099">5</styled-content>, 
                            <styled-content style="color:#CC9900">linetype =</styled-content> 
                            <styled-content style="color:#37A82E">"dashed"</styled-content>, 
                            <styled-content style="color:#CC9900">color =</styled-content> 
                            <styled-content style="color:#37A82E">"red"</styled-content>) +</monospace>

                        <monospace>  geom_hline(
                            <styled-content style="color:#CC9900">yintercept =</styled-content> -
                            <styled-content style="color:#000099">5</styled-content>, 
                            <styled-content style="color:#CC9900">linetype =</styled-content> 
                            <styled-content style="color:#37A82E">"dashed"</styled-content>, 
                            <styled-content style="color:#CC9900">color =</styled-content> 
                            <styled-content style="color:#37A82E">"red"</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"RT (min)"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"Delta precursor mass (ppm)"</styled-content>) +</monospace>

                        <monospace>  scale_x_continuous(
                            <styled-content style="color:#CC9900">limits =</styled-content> c(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">120</styled-content>), 
                            <styled-content style="color:#CC9900">breaks =</styled-content> seq(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">120</styled-content>, 
                            <styled-content style="color:#000099">20</styled-content>)) +</monospace>

                        <monospace>  scale_y_continuous(
                            <styled-content style="color:#CC9900">limits =</styled-content> c(-
                            <styled-content style="color:#000099">10</styled-content>, 
                            <styled-content style="color:#000099">10</styled-content>), 
                            <styled-content style="color:#CC9900">breaks =</styled-content> c(-
                            <styled-content style="color:#000099">10</styled-content>, -
                            <styled-content style="color:#000099">5</styled-content>, 
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">5</styled-content>, 
                            <styled-content style="color:#000099">10</styled-content>)) +</monospace>

                        <monospace>  ggtitle(
                            <styled-content style="color:#37A82E">"Peptide retention time against delta precursor mass"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure17.gif"/>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Plot histogram of peptide retention time</styled-content>
                        </monospace>

                        <monospace>sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_raw"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> RT.in.min.by.Search.Engine.Sequest.HT)) +</monospace>

                        <monospace>  geom_histogram(
                            <styled-content style="color:#CC9900">binwidth =</styled-content> 
                            <styled-content style="color:#000099">1</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"RT (min)"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"Frequency"</styled-content>) +</monospace>

                        <monospace>  scale_x_continuous(
                            <styled-content style="color:#CC9900">breaks =</styled-content> seq(
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#000099">120</styled-content>, 
                            <styled-content style="color:#000099">20</styled-content>)) +</monospace>

                        <monospace>  ggtitle(
                            <styled-content style="color:#37A82E">"Peptide frequency across retention time"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure18.gif"/>
                </p>
                <p>For a more in-depth discussion of these plots users should refer back to the TMT processing workflow. Since neither plot indicates any major problems with the MS runs, we continue on to basic data cleaning.</p>
            </sec>
            <sec id="sec51">
                <title>Basic data cleaning</title>
                <p>As discussed in detail above, there are several basic data cleaning steps which are non-specific and should be applied to all quantitative datasets, regardless of the quantitation method or data level (PSM, peptide or protein). These steps are as follows:
                    <list list-type="order">
                        <list-item>
                            <label>1.</label>
                            <p>Removal of features without a master protein accession</p>
                        </list-item>
                        <list-item>
                            <label>2.</label>
                            <p>Removal of features corresponding to protein groups which contain a contaminant</p>
                        </list-item>
                        <list-item>
                            <label>3.</label>
                            <p>Removal of features without quantitative data</p>
                        </list-item>
                        <list-item>
                            <label>4.</label>
                            <p>(Optional) Removal of features which are not unique to a protein group</p>
                        </list-item>
                        <list-item>
                            <label>5.</label>
                            <p>Removal of features not allocated rank 1 during the identification search</p>
                        </list-item>
                        <list-item>
                            <label>6.</label>
                            <p>Removal of features not annotated as unambiguous</p>
                        </list-item>
                    </list>
                </p>
                <p>In addition to these standard steps, LFQ data should be filtered to remove peptides that were not quantified based on a monoisotopic peak. The monoisotopic peak is that which comprises the most abundant natural isotope of each constituent element. For bottom-up proteomics, this typically translates to the peptides containing carbon-12 and nitrogen-14. When the different isotopes are well resolved, the monoisotopic peak usually provides the most accurate measurement.</p>
                <p>Before we remove any data, we first create a second copy of the original 
                    <monospace>SummarizedExperiment</monospace>, as to retain a copy of the raw data for reference. As before we use the 
                    <monospace>addAssay</monospace> function.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Add second copy of data to be filtered</styled-content>
                        </monospace>

                        <monospace>data_copy 
                            <styled-content style="color:#984806">&lt;-</styled-content> sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_raw"</styled-content>]]</monospace>


                        <monospace>sn_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> addAssay(
                            <styled-content style="color:#CC9900">x =</styled-content> sn_qf,</monospace>

                        <monospace>                  
                            <styled-content style="color:#CC9900">y =</styled-content> data_copy,</monospace>

                        <monospace>                  
                            <styled-content style="color:#CC9900">name =</styled-content> 
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Verify</styled-content>
                        </monospace>

                        <monospace>sn_qf</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## An instance of class QFeatures containing 2 assays:</monospace>

                        <monospace>##  [1] peptides_raw: SummarizedExperiment with 23375 rows and 6 columns</monospace>

                        <monospace>##  [2] peptides_filtered: SummarizedExperiment with 23375 rows and 6 columns</monospace>
                    </preformat>
                </p>
                <p>Here, cleaning is done is two steps. The first is the removal of contaminant proteins using the self-defined 
                    <monospace>find_cont</monospace> function. Refer back to the TMT processing workflow for more details.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Store row indices of peptides matched to a contaminant-containing</styled-content> protein group</monospace>

                        <monospace>cont_peptides 
                            <styled-content style="color:#984806">&lt;-</styled-content> find_cont(sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>]], cont_acc)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Remove these rows from the data</styled-content>
                        </monospace>

                        <monospace>
                            <styled-content style="color:#000099">if</styled-content> (length(cont_peptides) &gt; 
                            <styled-content style="color:#000099">0</styled-content>)</monospace>

                        <monospace>  sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>]] 
                            <styled-content style="color:#984806">&lt;-</styled-content> sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>]][-cont_peptides, ]</monospace>
                    </preformat>
                </p>
                <p>Second, we carry out all remaining cleaning using the 
                    <monospace>filterFeatures</monospace> function as before.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>sn_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> sn_qf %&gt;%</monospace>

                        <monospace>  filterFeatures(~ !Master.Protein.Accessions == 
                            <styled-content style="color:#37A82E">""</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>) %&gt;%</monospace>

                        <monospace>  filterFeatures(~ !Quan.Info == 
                            <styled-content style="color:#37A82E">"NoQuanValues"</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">i =</styled-content>
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>) %&gt;%</monospace>

                        <monospace>  filterFeatures(~ !Quan.Info == 
                            <styled-content style="color:#37A82E">"NoneMonoisotopic"</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>) %&gt;%</monospace>

                        <monospace>  filterFeatures(~ Number.of.Protein.Groups == 
                            <styled-content style="color:#000099">1</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>) %&gt;%</monospace>

                        <monospace>  filterFeatures(~ Rank.by.Search.Engine.Sequest.HT == 
                            <styled-content style="color:#000099">1</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>) %&gt;%</monospace>

                        <monospace>  filterFeatures(~ PSM.Ambiguity == 
                            <styled-content style="color:#37A82E">"Unambiguous"</styled-content>,</monospace>

                        <monospace>                 
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>)</monospace>
                    </preformat>
                </p>
                <p>As before, we check to see whether additional annotations remain within the &#x201c;Quan.Info&#x201d; column.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Check for remaining annotations</styled-content>
                        </monospace>

                        <monospace>sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Quan.Info) %&gt;%</monospace>

                        <monospace>  table()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## .</monospace>

                        <monospace>##</monospace>

                        <monospace>## 17999</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec52">
                <title>Assessing the impact of non-specific data cleaning</title>
                <p>As in the previous example, we assess the impact that cleaning has had on the data. Specifically, we determine the number and proportion of PSMs, peptides and proteins lost. Again, when we refer to the number of peptides we only consider unique peptide sequences, not those that differ in their modifications.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Determine number of PSMs, peptides and proteins remaining</styled-content>
                        </monospace>

                        <monospace>psms_remaining 
                            <styled-content style="color:#984806">&lt;-</styled-content> sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Number.of.PSMs) %&gt;%</monospace>

                        <monospace>  sum()</monospace>


                        <monospace>peps_remaining 
                            <styled-content style="color:#984806">&lt;-</styled-content> sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Sequence) %&gt;%</monospace>

                        <monospace>  unique() %&gt;%</monospace>

                        <monospace>  length() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>prots_remaining 
                            <styled-content style="color:#984806">&lt;-</styled-content> sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  rowData() %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  pull(Master.Protein.Accessions) %&gt;%</monospace>

                        <monospace>  unique() %&gt;%</monospace>

                        <monospace>  length() %&gt;%</monospace>

                        <monospace>  as.numeric()</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Determine the number of proportion of PSMs, peptides and proteins removed</styled-content>
                        </monospace>

                        <monospace>psms_removed 
                            <styled-content style="color:#984806">&lt;-</styled-content> original_psms - psms_remaining</monospace>

                        <monospace>psms_removed_prop 
                            <styled-content style="color:#984806">&lt;-</styled-content> ((psms_removed /original_psms) * 
                            <styled-content style="color:#000099">100</styled-content>) %&gt;%</monospace>

                        <monospace>  round(
                            <styled-content style="color:#CC9900">digits =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>)</monospace>


                        <monospace>peps_removed 
                            <styled-content style="color:#984806">&lt;-</styled-content> original_peps - peps_remaining</monospace>

                        <monospace>peps_removed_prop 
                            <styled-content style="color:#984806">&lt;-</styled-content> ((peps_removed / original_peps) * 
                            <styled-content style="color:#000099">100</styled-content>) %&gt;%</monospace>

                        <monospace>  round(
                            <styled-content style="color:#CC9900">digits =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>)</monospace>


                        <monospace>prots_removed 
                            <styled-content style="color:#984806">&lt;-</styled-content> original_prots - prots_remaining</monospace>

                        <monospace>prots_removed_prop 
                            <styled-content style="color:#984806">&lt;-</styled-content> ((prots_removed / original_prots) * 
                            <styled-content style="color:#000099">100</styled-content>) %&gt;%</monospace>

                        <monospace>  round(
                            <styled-content style="color:#CC9900">digits =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Present in a table</styled-content>
                        </monospace>

                        <monospace>data.frame(
                            <styled-content style="color:#37A82E">"Feature"</styled-content> 
                            <styled-content style="color:#984806">=</styled-content> c(
                            <styled-content style="color:#37A82E">"PSMs"</styled-content>,</monospace>

                        <monospace>                         
                            <styled-content style="color:#37A82E">"Peptides"</styled-content>,</monospace>

                        <monospace>                         
                            <styled-content style="color:#37A82E">"Proteins"</styled-content>),</monospace>

                        <monospace>           
                            <styled-content style="color:#37A82E">"Number lost"</styled-content> 
                            <styled-content style="color:#984806">=</styled-content> c(psms_removed,</monospace>

                        <monospace>                             peps_removed,</monospace>

                        <monospace>                             prots_removed),</monospace>

                        <monospace>           
                            <styled-content style="color:#37A82E">"Percentage lost"</styled-content> 
                            <styled-content style="color:#984806">=</styled-content> c(psms_removed_prop,</monospace>

                        <monospace>                                 peps_removed_prop,</monospace>

                        <monospace>                                 prots_removed_prop),</monospace>

                        <monospace>           
                            <styled-content style="color:#37A82E">"Number remaining"</styled-content> 
                            <styled-content style="color:#984806">=</styled-content> c(psms_remaining,</monospace>

                        <monospace>                                  peps_remaining,</monospace>

                        <monospace>                                  prots_remaining))</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##    Feature Number.lost Percentage.lost Number.remaining</monospace>

                        <monospace>## 1     PSMs       28140           19.50           116162</monospace>

                        <monospace>## 2 Peptides        3767           18.55            16545</monospace>

                        <monospace>## 3 Proteins         690           17.51             3251</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec53">
                <title>Peptide quality control filtering</title>
                <p>When extracting data from the peptide-level 
                    <monospace>.txt</monospace> file rather than aggregating up from a PSM file, additional parameters exist within the peptide 
                    <monospace>rowData</monospace>. Such parameters include Quality PEP, Quality q-value, and SVM score, as well as similar scoring parameters provided by the search engine. Although we will not complete additional filtering based on these parameters in this workflow, users may wish to explore this option.</p>
            </sec>
            <sec id="sec54">
                <title>Managing missing data</title>
                <p>Having cleaned the peptide-level data we now move onto the management of missing data. This is of particular importance for LFQ workflows where the missing value challenge is amplified by intrinsic variability between independent MS runs. As before, the management of missing data can be divided into three steps: 1) exploring the presence and distribution of missing values, (2) filtering out missing values, and (3) optional imputation.</p>
            </sec>
            <sec id="sec55">
                <title>Exploring the presence of missing values</title>
                <p>The aim of the first step is to determine how many missing values are present within the data, and how they are distributed between samples and/or conditions.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Are there any NA values within the peptide data?</styled-content>
                        </monospace>

                        <monospace>sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  assay() %&gt;%</monospace>

                        <monospace>  anyNA()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## [1] TRUE</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## How many NA values are there within the peptide data?</styled-content>
                        </monospace>

                        <monospace>sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  nNA()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## $nNA</monospace>

                        <monospace>## DataFrame with 1 row and 2 columns</monospace>

                        <monospace>##         nNA       pNA</monospace>

                        <monospace>##   &lt;integer&gt; &lt;numeric&gt;</monospace>

                        <monospace>## 1     15863   14.6888</monospace>

                        <monospace>##</monospace>

                        <monospace>## $nNArows</monospace>

                        <monospace>## DataFrame with 17999 rows and 3 columns</monospace>

                        <monospace>##              name       nNA       pNA</monospace>

                        <monospace>##       &lt;character&gt; &lt;integer&gt; &lt;numeric&gt;</monospace>

                        <monospace>## 1               1         4   66.6667</monospace>

                        <monospace>## 2               2         1   16.6667</monospace>

                        <monospace>## 3               3         0    0.0000</monospace>

                        <monospace>## 4               4         1   16.6667</monospace>

                        <monospace>## 5               5         0    0.0000</monospace>

                        <monospace>## ...           ...       ...       ...</monospace>

                        <monospace>## 17995       23371         0         0</monospace>

                        <monospace>## 17996       23372         0         0</monospace>

                        <monospace>## 17997       23373         0         0</monospace>

                        <monospace>## 17998       23374         0         0</monospace>

                        <monospace>## 17999       23375         0         0</monospace>

                        <monospace>##</monospace>

                        <monospace>## $nNAcols</monospace>

                        <monospace>## DataFrame with 6 rows and 3 columns</monospace>

                        <monospace>##          name       nNA       pNA</monospace>

                        <monospace>##   &lt;character&gt; &lt;integer&gt; &lt;numeric&gt;</monospace>

                        <monospace>## 1          S1      3699   20.5511</monospace>

                        <monospace>## 2          S2      1945   10.8062</monospace>

                        <monospace>## 3          S3      2048   11.3784</monospace>

                        <monospace>## 4          S4      3674   20.4122</monospace>

                        <monospace>## 5          S5      2673   14.8508</monospace>

                        <monospace>## 6          S6      1824   10.1339</monospace>
                    </preformat>
                </p>
                <p>As expected, the LFQ data contains a higher proportion of missing values as compared to the TMT-labelled data. There are 15863 missing (NA) values within the data, which corresponds to 15%. We check for sample- and condition-specific biases in the distribution of these NA values.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Plot histogram to visualize sample-specific distribution of NAs</styled-content>
                        </monospace>

                        <monospace>nNA(sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>]])$nNAcols %&gt;%</monospace>

                        <monospace>  as_tibble() %&gt;%</monospace>

                        <monospace>  mutate(
                            <styled-content style="color:#CC9900">Condition =</styled-content> rep(c(
                            <styled-content style="color:#37A82E">"Treated"</styled-content>, 
                            <styled-content style="color:#37A82E">"Control"</styled-content>), 
                            <styled-content style="color:#CC9900">each =</styled-content> 
                            <styled-content style="color:#000099">3</styled-content>)) %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> name, 
                            <styled-content style="color:#CC9900">y =</styled-content> pNA, 
                            <styled-content style="color:#CC9900">group =</styled-content> Condition, 
                            <styled-content style="color:#CC9900">fill =</styled-content> Condition)) +</monospace>

                        <monospace>  geom_bar(
                            <styled-content style="color:#CC9900">stat =</styled-content> 
                            <styled-content style="color:#37A82E">"identity"</styled-content>) +</monospace>

                        <monospace>  geom_hline(
                            <styled-content style="color:#CC9900">yintercept =</styled-content> 
                            <styled-content style="color:#000099">14.7</styled-content>, 
                            <styled-content style="color:#CC9900">linetype =</styled-content> 
                            <styled-content style="color:#37A82E">"dashed"</styled-content>, 
                            <styled-content style="color:#CC9900">color =</styled-content> 
                            <styled-content style="color:#37A82E">"red"</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"Sample"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"Missing values (%)"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure19.gif"/>
                </p>
                <p>Whilst S1 and S4 have a slightly higher proportion of missing values, all of the samples are within an acceptable range to continue. Again, there is no evidence of a condition-specific bias in the data.</p>
            </sec>
            <sec id="sec56">
                <title>Filtering out missing values</title>
                <p>We next filter out features, here peptides, which comprise 20% or more missing values.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Check how many peptides we will remove</styled-content>
                        </monospace>

                        <monospace>which(nNA(sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>]])$nNArows$pNA &gt;= 
                            <styled-content style="color:#000099">20</styled-content>) %&gt;%</monospace>

                        <monospace>  length()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## [1] 4364</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#D47D3C">## Remove peptides with 2 or more NA values</styled-content>
                        </monospace>

                        <monospace>sn_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> sn_qf %&gt;%</monospace>

                        <monospace>  filterNA(
                            <styled-content style="color:#CC9900">pNA =</styled-content> 
                            <styled-content style="color:#000099">0.2</styled-content>,</monospace>

                        <monospace>           
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>)</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec57">
                <title>Imputation (optional)</title>
                <p>Finally, we check how many missing values remain in the data before making a decision as to whether imputation is required.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>nNA(sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>]])$nNA</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## DataFrame with 1 row and 2 columns</monospace>

                        <monospace>##         nNA       pNA</monospace>

                        <monospace>##   &lt;integer&gt; &lt;numeric&gt;</monospace>

                        <monospace>## 1      2452   2.99719</monospace>
                    </preformat>
                </p>
                <p>There are 2452 missing values remaining. The presence of proteins with single or low peptide support means that some of these NA values will likely be propagated upward during aggregation. Whilst NA values were traditionally problematic during the application of downstream statistical methods, there are now a number of algorithms that allow statistics to be completed on data containing missing values. For example, the 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://www.bioconductor.org/packages/release/bioc/html/msqrob2.html">MSqRob2</ext-link>
                    </monospace>
                    <sup>
                        <xref ref-type="bibr" rid="ref24">24</xref>
                    </sup>
                    <sup>,</sup>
                    <sup>
                        <xref ref-type="bibr" rid="ref25">25</xref>
                    </sup>
                    <sup>,</sup>
                    <sup>
                        <xref ref-type="bibr" rid="ref32">32</xref>
                    </sup> package facilitates statistical differential expression analysis on datasets without the need for imputation and functions within the 
                    <monospace>QFeatures</monospace> infrastructure. Nevertheless, for the purpose of demonstration, we here choose to impute the raw intensity data.</p>
                <p>As eluded to above, the most appropriate method to determine such probable values is dependent upon why the value is missing, that is whether it is MCAR or MNAR. Although the optimal imputation method is specific to each dataset, left-censored methods (e.g. minimal value approaches, limit of detection) have proven favorable for data with a high proportion of MNAR values whilst hot deck methods (e.g. k-nearest neighbours, random forest, maximum likelihood methods) are more appropriate when the majority of missing data is MCAR [e.g. Refs. 
                    <xref ref-type="bibr" rid="ref23">23</xref>, 
                    <xref ref-type="bibr" rid="ref33">33</xref>]. Within the 
                    <monospace>QFeatures</monospace> infrastructure imputation is carried out by passing the data to the impute function, please see 
                    <monospace>?impute</monospace> for more information. To see which imputation methods are supported by this function we use the following code:

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Find out available imputation methods</styled-content>
                        </monospace>

                        <monospace>MsCoreUtils::imputeMethods()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## [1] "bpca" "knn"  "QRILC" "MLE"   "MLE2" "MinDet" "MinProb"</monospace>

                        <monospace>## [8] "min"  "zero" "mixed" "nbavg" "with" "RF"     "none"</monospace>
                    </preformat>
                </p>
                <p>Unfortunately, it is very challenging to determine the reason(s) behind missing data, and in most cases experiments contain a mixture of MCAR and MNAR. For LFQ data where little is know about the cause of missing values it is advisable to use methods optimised for MCAR. Here we will use the baseline k-nearest neighbours (k-NN) imputation on the raw peptide intensities. Of note, users who wish to utilise an alternative imputation method should check whether their selected method has a requirement for normality. If the method requires data to display a normal distribution, users must log2 transform the data prior to imputation.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Impute missing values using kNN</styled-content>
                        </monospace>

                        <monospace>sn_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> impute(sn_qf,</monospace>

                        <monospace>                
                            <styled-content style="color:#CC9900">method =</styled-content> 
                            <styled-content style="color:#37A82E">"knn"</styled-content>,</monospace>

                        <monospace>                
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>,</monospace>

                        <monospace>                
                            <styled-content style="color:#CC9900">name =</styled-content> 
                            <styled-content style="color:#37A82E">"peptides_imputed"</styled-content>)</monospace>
                    </preformat>
                </p>
                <p>Following imputation we check to ensure that the distribution of the data has not dramatically changed. To do so we create a density plot of the data pre- and post-imputation.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## visualise the impact of imputation</styled-content>
                        </monospace>

                        <monospace>par(
                            <styled-content style="color:#CC9900">mfrow =</styled-content> c(
                            <styled-content style="color:#000099">1</styled-content>, 
                            <styled-content style="color:#000099">2</styled-content>))</monospace>


                        <monospace>sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  assay() %&gt;%</monospace>

                        <monospace>  log2() %&gt;%</monospace>

                        <monospace>  plotDensities(
                            <styled-content style="color:#CC9900">main =</styled-content> 
                            <styled-content style="color:#37A82E">"Pre-imputation"</styled-content>,</monospace>

                        <monospace>                
                            <styled-content style="color:#CC9900">legend =</styled-content> FALSE)</monospace>


                        <monospace>sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_imputed"</styled-content>]] %&gt;%</monospace>

                        <monospace>  assay() %&gt;%</monospace>

                        <monospace>  log2() %&gt;%</monospace>

                        <monospace>  plotDensities(
                            <styled-content style="color:#CC9900">main =</styled-content> 
                            <styled-content style="color:#37A82E">"Post-imputation"</styled-content>,</monospace>

                        <monospace>                
                            <styled-content style="color:#CC9900">legend =</styled-content> 
                            <styled-content style="color:#37A82E">"topright"</styled-content>)</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure20.gif"/>
                </p>
                <p>From this plot the change in the data appears to be minimal. We can further validate this by comparing the summary statistics of the data pre- and post-imputation.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Determine the impact of imputation on summary statistics</styled-content>
                        </monospace>

                        <monospace>pre_imputation_summary 
                            <styled-content style="color:#984806">&lt;-</styled-content> sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_filtered"</styled-content>]] %&gt;%</monospace>

                        <monospace>  assay() %&gt;%</monospace>

                        <monospace>  longFormat() %&gt;%</monospace>

                        <monospace>  group_by(colname) %&gt;%</monospace>

                        <monospace>  summarise(
                            <styled-content style="color:#CC9900">sum_intensity =</styled-content> sum(value, 
                            <styled-content style="color:#CC9900">na.rm =</styled-content> TRUE),</monospace>

                        <monospace>            
                            <styled-content style="color:#CC9900">max_intensity =</styled-content> max(value, 
                            <styled-content style="color:#CC9900">na.rm =</styled-content> TRUE),</monospace>

                        <monospace>            
                            <styled-content style="color:#CC9900">median_intensity =</styled-content> median(value, 
                            <styled-content style="color:#CC9900">na.rm =</styled-content> TRUE))</monospace>


                        <monospace>post_imputation_summary 
                            <styled-content style="color:#984806">&lt;-</styled-content> sn_qf[[
                            <styled-content style="color:#37A82E">"peptides_imputed"</styled-content>]] %&gt;%</monospace>

                        <monospace>  assay() %&gt;%</monospace>

                        <monospace>  longFormat() %&gt;%</monospace>

                        <monospace>  group_by(colname) %&gt;%</monospace>

                        <monospace>  summarise(
                            <styled-content style="color:#CC9900">sum_intensity =</styled-content> sum(value, 
                            <styled-content style="color:#CC9900">na.rm =</styled-content> TRUE),</monospace>

                        <monospace>            
                            <styled-content style="color:#CC9900">max_intensity =</styled-content> max(value, 
                            <styled-content style="color:#CC9900">na.rm =</styled-content> TRUE),</monospace>

                        <monospace>            
                            <styled-content style="color:#CC9900">median_intensity =</styled-content> median(value, 
                            <styled-content style="color:#CC9900">na.rm =</styled-content> TRUE))</monospace>


                        <monospace>print(pre_imputation_summary)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## # A tibble: 6 x 4</monospace>

                        <monospace>##   colname sum_intensity max_intensity median_intensity</monospace>

                        <monospace>##   &lt;chr&gt;           &lt;dbl&gt;         &lt;dbl&gt;            &lt;dbl&gt;</monospace>

                        <monospace>## 1 S1       98919496611.    1477162278.        1948794.</monospace>

                        <monospace>## 2 S2      155722262777.    1678988256.        3553168.</monospace>

                        <monospace>## 3 S3      145509803642.    1804842981.        3251578.</monospace>

                        <monospace>## 4 S4       94892286529.    1087946291.        1873948.</monospace>

                        <monospace>## 5 S5      121590387110.    1307181986         2503109</monospace>

                        <monospace>## 6 S6      143538084562.    1608003894.        3282077.</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>print(post_imputation_summary)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## # A tibble: 6 x 4</monospace>

                        <monospace>##   colname sum_intensity max_intensity median_intensity</monospace>

                        <monospace>##   &lt;chr&gt;           &lt;dbl&gt;         &lt;dbl&gt;            &lt;dbl&gt;</monospace>

                        <monospace>## 1 S1       99811359317.    1477162278.        1812200.</monospace>

                        <monospace>## 2 S2      156124619343.    1678988256.        3478920.</monospace>

                        <monospace>## 3 S3      145754574641.    1804842981.        3201591</monospace>

                        <monospace>## 4 S4       96201734159.    1087946291.        1720642.</monospace>

                        <monospace>## 5 S5      122113590867.    1307181986         2440628.</monospace>

                        <monospace>## 6 S6      143994721088.    1608003894.        3199278.</monospace>
                    </preformat>
                </p>
                <p>Comparison of the two tables reveals minimal change within the data. However, we find that S1 and S4 display greater differences between pre- and post-imputation statistics because of the higher number of missing values which required imputation.</p>
            </sec>
            <sec id="sec58">
                <title>Logarithmic transformation of quantitative data</title>
                <p>In the following code chunk we log2 transform the peptide-level data to generate a near-normal distribution within the quantitative data. This is necessary prior to the use of 
                    <monospace>robustSummary</monospace> aggregation.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## log2 transform the quantitative data</styled-content>
                        </monospace>

                        <monospace>sn_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> logTransform(
                            <styled-content style="color:#CC9900">object =</styled-content> sn_qf,</monospace>

                        <monospace>                      
                            <styled-content style="color:#CC9900">base =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>,</monospace>

                        <monospace>                      
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"peptides_imputed"</styled-content>,</monospace>

                        <monospace>                      
                            <styled-content style="color:#CC9900">name =</styled-content> 
                            <styled-content style="color:#37A82E">"log_peptides"</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Verify</styled-content>
                        </monospace>

                        <monospace>sn_qf</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## An instance of class QFeatures containing 4 assays:</monospace>

                        <monospace>##  [1] peptides_raw: SummarizedExperiment with 23375 rows and 6 columns</monospace>

                        <monospace>##  [2] peptides_filtered: SummarizedExperiment with 13635 rows and 6 columns</monospace>

                        <monospace>##  [3] peptides_imputed: SummarizedExperiment with 13635 rows and 6 columns</monospace>

                        <monospace>##  [4] log_peptides: SummarizedExperiment with 13635 rows and 6 columns</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec59">
                <title>Aggregation of peptide to protein</title>
                <p>Now that we are happy with the peptide-level data, we aggregate upward to proteins using the 
                    <monospace>aggregateFeatures</monospace> function.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Aggregate peptide to protein</styled-content>
                        </monospace>

                        <monospace>sn_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> aggregateFeatures(sn_qf,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"log_peptides"</styled-content>,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">fcol =</styled-content> 
                            <styled-content style="color:#37A82E">"Master.Protein.Accessions"</styled-content>,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">name =</styled-content> 
                            <styled-content style="color:#37A82E">"log_proteins"</styled-content>,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">fun =</styled-content> MsCoreUtils::robustSummary,</monospace>

                        <monospace>                           
                            <styled-content style="color:#CC9900">na.rm =</styled-content> TRUE)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## Your row data contain missing values. Please read the relevant</monospace>

                        <monospace>## section(s) in the aggregateFeatures manual page regarding the effects</monospace>

                        <monospace>## of missing values on data aggregation.</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Verify</styled-content>
                        </monospace>

                        <monospace>sn_qf</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## An instance of class QFeatures containing 5 assays:</monospace>

                        <monospace>##  [1] peptides_raw: SummarizedExperiment with 23375 rows and 6 columns</monospace>

                        <monospace>##  [2] peptides_filtered: SummarizedExperiment with 13635 rows and 6 columns</monospace>

                        <monospace>##  [3] peptides_imputed: SummarizedExperiment with 13635 rows and 6 columns</monospace>

                        <monospace>##  [4] log_peptides: SummarizedExperiment with 13635 rows and 6 columns</monospace>

                        <monospace>##  [5] log_proteins: SummarizedExperiment with 2837 rows and 6 columns</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec60">
                <title>Normalisation of quantitative data</title>
                <p>Finally, we complete the data processing by normalising quantitation between samples. This is done using the "
                    <monospace>center.median</monospace>" method via the 
                    <monospace>normalize</monospace> function.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## normalize protein-level quantitation data</styled-content>
                        </monospace>

                        <monospace>sn_qf 
                            <styled-content style="color:#984806">&lt;-</styled-content> normalize(sn_qf,</monospace>

                        <monospace>                   
                            <styled-content style="color:#CC9900">i =</styled-content> 
                            <styled-content style="color:#37A82E">"log_proteins"</styled-content>,</monospace>

                        <monospace>                   
                            <styled-content style="color:#CC9900">name =</styled-content> 
                            <styled-content style="color:#37A82E">"log_norm_proteins"</styled-content>,</monospace>

                        <monospace>                   
                            <styled-content style="color:#CC9900">method =</styled-content> 
                            <styled-content style="color:#37A82E">"center.median"</styled-content>)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Verify</styled-content>
                        </monospace>

                        <monospace>sn_qf</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## An instance of class QFeatures containing 6 assays:</monospace>

                        <monospace>##  [1] peptides_raw: SummarizedExperiment with 23375 rows and 6 columns</monospace>

                        <monospace>##  [2] peptides_filtered: SummarizedExperiment with 13635 rows and 6 columns</monospace>

                        <monospace>##  [3] peptides_imputed: SummarizedExperiment with 13635 rows and 6 columns</monospace>

                        <monospace>##  [4] log_peptides: SummarizedExperiment with 13635 rows and 6 columns</monospace>

                        <monospace>##  [5] log_proteins: SummarizedExperiment with 2837 rows and 6 columns</monospace>

                        <monospace>##  [6] log_norm_proteins: SummarizedExperiment with 2837 rows and 6 columns</monospace>
                    </preformat>
                </p>
                <p>The final dataset is comprised of 2837 proteins. We will save the protein-level 
                    <monospace>SummarizedExperiment</monospace> file as well as exporting the final 
                    <monospace>QFeatures</monospace> object.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Save protein-level SE</styled-content>
                        </monospace>

                        <monospace>sn_proteins 
                            <styled-content style="color:#984806">&lt;-</styled-content> sn_qf[[
                            <styled-content style="color:#37A82E">"log_norm_proteins"</styled-content>]]</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Export TMT final QFeatures object</styled-content>
                        </monospace>

                        <monospace>save(sn_qf, 
                            <styled-content style="color:#CC9900">file =</styled-content> 
                            <styled-content style="color:#37A82E">"sn_qf.rda"</styled-content>)</monospace>
                    </preformat>
                </p>
            </sec>
        </sec>
        <sec id="sec61">
            <title>Exploration of protein data</title>
            <p>Having described the processing steps for quantitative proteomics data, we next demonstrate how to explore the protein-level data prior to statistical analysis. For this we will utilise the TMT-labelled cell pellet dataset since it contains a greater number of proteins.</p>
            <sec id="sec62">
                <title>Correlation plots</title>
                <p>We will first generate correlation plots between pairs of samples. To do this we use the 
                    <monospace>corrplot</monospace> package to calculate and plot the Pearson&#x2019;s correlation coefficient between each sample pair. The 
                    <monospace>cor</monospace> function within the 
                    <monospace>corrplot</monospace> package will create a correlation matrix but requires a 
                    <monospace>data.frame</monospace>, 
                    <monospace>matrix</monospace> or a 
                    <monospace>vector</monospace> of class 
                    <monospace>numeric</monospace> as input. To convert the 
                    <monospace>QFeatures assay</monospace> data into a 
                    <monospace>data.frame</monospace> we use the as.
                    <monospace>data.frame</monospace> function.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Convert TMT CP protein assay into a dataframe</styled-content>
                        </monospace>

                        <monospace>prot_df 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"log_norm_proteins"</styled-content>]] %&gt;%</monospace>

                        <monospace>  assay() %&gt;%</monospace>

                        <monospace>  as.data.frame()</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Calculate a correlation matrix between samples</styled-content>
                        </monospace>

                        <monospace>corr_matrix 
                            <styled-content style="color:#984806">&lt;-</styled-content> cor(prot_df,</monospace>

                        <monospace>                   
                            <styled-content style="color:#CC9900">method =</styled-content> 
                            <styled-content style="color:#37A82E">"pearson"</styled-content>,</monospace>

                        <monospace>                   
                            <styled-content style="color:#CC9900">use =</styled-content> 
                            <styled-content style="color:#37A82E">"pairwise.complete.obs"</styled-content>)</monospace>


                        <monospace>print(corr_matrix)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##           S1        S2        S3        S4        S5        S6</monospace>

                        <monospace>## S1 1.0000000 0.9863382 0.9927432 0.9447089 0.9627190 0.9628944</monospace>

                        <monospace>## S2 0.9863382 1.0000000 0.9886960 0.9206049 0.9535162 0.9501522</monospace>

                        <monospace>## S3 0.9927432 0.9886960 1.0000000 0.9344640 0.9551061 0.9548680</monospace>

                        <monospace>## S4 0.9447089 0.9206049 0.9344640 1.0000000 0.9822722 0.9867422</monospace>

                        <monospace>## S5 0.9627190 0.9535162 0.9551061 0.9822722 1.0000000 0.9928376</monospace>

                        <monospace>## S6 0.9628944 0.9501522 0.9548680 0.9867422 0.9928376 1.0000000</monospace>
                    </preformat>
                </p>
                <p>Now we can visualise the correlation data using pairwise scatter plots and a correlation heat map.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Plot correlation between two samples - S1 and S2 used as example</styled-content>
                        </monospace>

                        <monospace>prot_df %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">`</styled-content>
                            <styled-content style="color:#CC9900">S1</styled-content>
                            <styled-content style="color:#37A82E">`</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">`</styled-content>
                            <styled-content style="color:#CC9900">S2</styled-content>
                            <styled-content style="color:#37A82E">`</styled-content>)) +</monospace>

                        <monospace>  geom_point(
                            <styled-content style="color:#CC9900">colour =</styled-content> 
                            <styled-content style="color:#37A82E">"grey45"</styled-content>, 
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">0.5</styled-content>) +</monospace>

                        <monospace>  geom_abline(
                            <styled-content style="color:#CC9900">intercept =</styled-content> 
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#CC9900">slope =</styled-content> 
                            <styled-content style="color:#000099">1</styled-content>) +</monospace>

                        <monospace>  theme(
                            <styled-content style="color:#CC9900">panel.grid.major =</styled-content> element_blank(),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">panel.grid.minor =</styled-content> element_blank(),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">plot.background =</styled-content> element_rect(
                            <styled-content style="color:#CC9900">fill =</styled-content> 
                            <styled-content style="color:#37A82E">"white"</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">panel.background =</styled-content> element_rect(
                            <styled-content style="color:#CC9900">fill =</styled-content> 
                            <styled-content style="color:#37A82E">"white"</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">axis.title.x =</styled-content> element_text(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">15</styled-content>, 
                            <styled-content style="color:#CC9900">vjust =</styled-content> -
                            <styled-content style="color:#000099">2</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">axis.title.y =</styled-content> element_text(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">15</styled-content>, 
                            <styled-content style="color:#CC9900">vjust =</styled-content> 
                            <styled-content style="color:#000099">3</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">axis.text.x =</styled-content> element_text(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">12</styled-content>, 
                            <styled-content style="color:#CC9900">vjust =</styled-content> -
                            <styled-content style="color:#000099">1</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">axis.text.y =</styled-content> element_text(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">12</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">axis.line =</styled-content> element_line(
                            <styled-content style="color:#CC9900">linewidth =</styled-content> 
                            <styled-content style="color:#000099">0.5</styled-content>, 
                            <styled-content style="color:#CC9900">colour =</styled-content> 
                            <styled-content style="color:#37A82E">"black"</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">plot.margin =</styled-content> margin(
                            <styled-content style="color:#000099">10</styled-content>, 
                            <styled-content style="color:#000099">10</styled-content>, 
                            <styled-content style="color:#000099">10</styled-content>, 
                            <styled-content style="color:#000099">10</styled-content>)) +</monospace>

                        <monospace>  xlim(-
                            <styled-content style="color:#000099">7.5</styled-content>, 
                            <styled-content style="color:#000099">5</styled-content>) +</monospace>

                        <monospace>  ylim(-
                            <styled-content style="color:#000099">7.5</styled-content>, 
                            <styled-content style="color:#000099">5</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"log2(abundance S1)"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"log2(abundance S2)"</styled-content>) +</monospace>

                        <monospace>  coord_fixed(
                            <styled-content style="color:#CC9900">ratio =</styled-content> 
                            <styled-content style="color:#000099">1</styled-content>)</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure21.gif"/>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Create colour palette for continuum</styled-content>
                        </monospace>

                        <monospace>col 
                            <styled-content style="color:#984806">&lt;-</styled-content> colorRampPalette(c(
                            <styled-content style="color:#37A82E">"#BB4444"</styled-content>, 
                            <styled-content style="color:#37A82E">"#EE9988"</styled-content>, 
                            <styled-content style="color:#37A82E">"#FFFFFF"</styled-content>,</monospace>

                        <monospace>                          
                            <styled-content style="color:#37A82E">"#77AADD"</styled-content>, 
                            <styled-content style="color:#37A82E">"#4477AA"</styled-content>))</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Plot all pairwise correlations</styled-content>
                        </monospace>

                        <monospace>prot_df %&gt;%</monospace>

                        <monospace>  cor(
                            <styled-content style="color:#CC9900">method =</styled-content> 
                            <styled-content style="color:#37A82E">"pearson"</styled-content>,</monospace>

                        <monospace>      
                            <styled-content style="color:#CC9900">use =</styled-content> 
                            <styled-content style="color:#37A82E">"pairwise.complete.obs"</styled-content>) %&gt;%</monospace>

                        <monospace>  corrplot(
                            <styled-content style="color:#CC9900">method =</styled-content> 
                            <styled-content style="color:#37A82E">"color"</styled-content>,</monospace>

                        <monospace>           
                            <styled-content style="color:#CC9900">col =</styled-content> col(
                            <styled-content style="color:#000099">200</styled-content>),</monospace>

                        <monospace>           
                            <styled-content style="color:#CC9900">type =</styled-content> 
                            <styled-content style="color:#37A82E">"upper"</styled-content>,</monospace>

                        <monospace>           
                            <styled-content style="color:#CC9900">addCoef.col =</styled-content> 
                            <styled-content style="color:#37A82E">"white"</styled-content>,</monospace>

                        <monospace>           
                            <styled-content style="color:#CC9900">diag =</styled-content> FALSE,</monospace>

                        <monospace>           
                            <styled-content style="color:#CC9900">tl.col =</styled-content> 
                            <styled-content style="color:#37A82E">"black"</styled-content>,</monospace>

                        <monospace>           
                            <styled-content style="color:#CC9900">tl.srt =</styled-content> 
                            <styled-content style="color:#000099">45</styled-content>,</monospace>

                        <monospace>           
                            <styled-content style="color:#CC9900">outline =</styled-content> TRUE)</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure22.gif"/>
                </p>
                <p>From these plots we can see that all replicate pairs have a Pearson&#x2019;s correlation coefficient &gt;0.98 whilst the correlation between pairs of control and treated samples is somewhat lower. Users may interpret this information as an early indication that some proteins may be differentially abundant between the two groups.</p>
                <p>Of note, whilst correlation is widely applied as a measure of reproducibility, users are reminded that correlation coefficients alone are not informative of reproducibility.
                    <sup>
                        <xref ref-type="bibr" rid="ref34">34</xref>
                    </sup>
                    <sup>,</sup>
                    <sup>
                        <xref ref-type="bibr" rid="ref35">35</xref>
                    </sup> This is especially true for expression proteomics data in which high correlation values are likely due to the majority of proteins remaining at similar levels regardless of cellular perturbation. Users are directed to Ref. 
                    <xref ref-type="bibr" rid="ref36">36</xref> for additional information regarding how to determine the calculation of experimental reproducibility.</p>
            </sec>
            <sec id="sec63">
                <title>Principal Component Analysis</title>
                <p>Principal Component Analysis (PCA) is a dimensionality reduction method which aims to simplify complex datasets and facilitate the visualisation of multi-dimensional data. Here we use the prcomp function from the 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html">stats</ext-link>
                    </monospace> package to perform the PCA. Since PCA does not accept missing values and we did not impute the TMT data, the 
                    <monospace>filterNA</monospace> function can be used to remove any missing values that may be present in the protein-level data. We then extract and transpose the assay data before passing it to the prcomp function to carry out PCA.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Carry out principal component analysis</styled-content>
                        </monospace>

                        <monospace>prot_pca 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"log_norm_proteins"</styled-content>]] %&gt;%</monospace>

                        <monospace>  filterNA() %&gt;%</monospace>

                        <monospace>  assay() %&gt;%</monospace>

                        <monospace>  t() %&gt;%</monospace>

                        <monospace>  prcomp(
                            <styled-content style="color:#CC9900">scale =</styled-content> TRUE, 
                            <styled-content style="color:#CC9900">center =</styled-content> TRUE)</monospace>
                    </preformat>
                </p>
                <p>We can get an idea of the outcome of the PCA by running the summary function on the results of the PCA.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Get a summary of the PCA</styled-content>
                        </monospace>

                        <monospace>summary(prot_pca)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## Importance of components:</monospace>

                        <monospace>##                           PC1     PC2     PC3      PC4      PC5       PC6</monospace>

                        <monospace>## Standard deviation    42.4845 26.7522 18.1650 15.15893 14.44395 5.474e-14</monospace>

                        <monospace>## Proportion of Variance 0.5488  0.2176  0.1003  0.06987  0.06343 0.000e+00</monospace>

                        <monospace>## Cumulative Proportion  0.5488  0.7664  0.8667  0.93657  1.00000 1.000e+00</monospace>
                    </preformat>
                </p>
                <p>Finally, we create a PCA plot. For additional PCA exploration and visualization tools users are directed to the 
                    <ext-link ext-link-type="uri" xlink:href="https://cran.r-project.org/web/packages/factoextra/index.html">
                        <monospace>factoextra</monospace> package</ext-link>.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Generate dataframe of each sample's PCA results</styled-content>
                        </monospace>

                        <monospace>pca_df 
                            <styled-content style="color:#984806">&lt;-</styled-content> as.data.frame(prot_pca$x)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Annotate samples with their corresponding condition</styled-content>
                        </monospace>

                        <monospace>pca_df$condition 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"psms_raw"</styled-content>]]$condition</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Generate a PCA plot using PC1 and PC2</styled-content>
                        </monospace>

                        <monospace>pca_df %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> PC1, 
                            <styled-content style="color:#CC9900">y =</styled-content> PC2, 
                            <styled-content style="color:#CC9900">colour =</styled-content> condition)) +</monospace>

                        <monospace>  geom_point(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">4</styled-content>) +</monospace>

                        <monospace>  scale_color_brewer(
                            <styled-content style="color:#CC9900">palette =</styled-content> 
                            <styled-content style="color:#37A82E">"Set2"</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">colour =</styled-content> 
                            <styled-content style="color:#37A82E">"Condition"</styled-content>) +</monospace>

                        <monospace>  geom_hline(
                            <styled-content style="color:#CC9900">yintercept =</styled-content> 
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#CC9900">linetype =</styled-content> 
                            <styled-content style="color:#37A82E">"dashed"</styled-content>) +</monospace>

                        <monospace>  geom_vline(
                            <styled-content style="color:#CC9900">xintercept =</styled-content> 
                            <styled-content style="color:#000099">0</styled-content>, 
                            <styled-content style="color:#CC9900">linetype =</styled-content> 
                            <styled-content style="color:#37A82E">"dashed"</styled-content>) +</monospace>

                        <monospace>  guides(
                            <styled-content style="color:#CC9900">colour =</styled-content> guide_legend(
                            <styled-content style="color:#CC9900">override.aes =</styled-content> list(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">3</styled-content>))) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"PC1 (42.5 %)"</styled-content>, y = 
                            <styled-content style="color:#37A82E">"PC2 (26.8 %)"</styled-content>) +</monospace>

                        <monospace>  ggtitle(
                            <styled-content style="color:#37A82E">"Protein-level PCA plot"</styled-content>) +</monospace>

                        <monospace>  xlim(-
                            <styled-content style="color:#000099">100</styled-content>, 
                            <styled-content style="color:#000099">100</styled-content>) +</monospace>

                        <monospace>  ylim(-
                            <styled-content style="color:#000099">100</styled-content>, 
                            <styled-content style="color:#000099">100</styled-content>) +</monospace>

                        <monospace>  coord_fixed(
                            <styled-content style="color:#CC9900">ratio =</styled-content> 
                            <styled-content style="color:#000099">1</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure23.gif"/>
                </p>
            </sec>
            <sec id="sec64">
                <title>Exploring potential batch effects</title>
                <p>Before carrying out differential expression analysis, it is first necessary to explore the presence of batch effects within the data. Batch effects are derived from non-biological factors which impact the experimental data. These include reagents, instrumentation, personnel and laboratory conditions. In most cases the increased variation caused by batch effects will lead to reduced downstream statistical power. On the other hand, if correlated with the experimental sub- groups, batch effects can also lead to confounded results and the incorrect biological interpretation of differential expression.
                    <sup>
                        <xref ref-type="bibr" rid="ref37">37</xref>
                    </sup>
                </p>
                <p>Given that the use-case data was derived from a small experiment with only six samples and a single TMTplex, there are minimal batch effects to explore here. For users analysing larger experiments completed over long period of time, across several laboratories/individuals, or using multiple TMTplex reagents, it is advisable to annotate the PCA plot with all potential batch factors. If data is found to cluster based on any of these factors, batch effects should be incorporated into downstream analyses. For example, users can apply the 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://rdrr.io/bioc/limma/man/removeBatchEffect.html">removeBatchEffect</ext-link>
                    </monospace> function from the limma package.</p>
            </sec>
            <sec id="sec65">
                <title>Discovery and biological interpretation of differentially abundant proteins</title>
                <p>The last section of this workflow demonstrates how to gain biological insights from the resulting list of proteins. Again we will utilise the TMT-labelled cell pellet data, although the process would be exactly the same for the LFQ supernatant protein list. Users are reminded that although referred to as differential &#x2018;expression&#x2019; analysis, abundance is determined by both protein synthesis and degradation.</p>
            </sec>
            <sec id="sec66">
                <title>Extracting and organising protein-level data</title>
                <p>We first extract the protein-level 
                    <monospace>SummarizedExperiment</monospace> from the cell pellet TMT 
                    <monospace>QFeatures</monospace> object and specify the study factors. Here we are interested in discovering differences between conditions, control and treated. As well as assigning these conditions to each sample, we can define the control group as the reference level such that differential abundance is reported relative to the control. This means that when we get the results of the statistical analysis, &#x2018;upregulated&#x2019; will refer to increased abundance in treated cells relative to control controls.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Extract protein-level data and associated colData</styled-content>
                        </monospace>

                        <monospace>cp_proteins 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_qf[[
                            <styled-content style="color:#37A82E">"log_norm_proteins"</styled-content>]]</monospace>

                        <monospace>colData(cp_proteins) 
                            <styled-content style="color:#984806">&lt;-</styled-content> colData(cp_qf[[
                            <styled-content style="color:#37A82E">"log_norm_proteins"</styled-content>]])</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Create factor of interest</styled-content>
                        </monospace>

                        <monospace>cp_proteins$condition 
                            <styled-content style="color:#984806">&lt;-</styled-content> factor(cp_proteins$condition)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Check which level of the factor is the reference level and correct</styled-content>
                        </monospace>

                        <monospace>cp_proteins$condition</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## [1] Treated Treated Treated Control Control Control</monospace>

                        <monospace>## Levels: Control Treated</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>cp_proteins$condition 
                            <styled-content style="color:#984806">&lt;-</styled-content> relevel(cp_proteins$condition, 
                            <styled-content style="color:#CC9900">ref =</styled-content> 
                            <styled-content style="color:#37A82E">"Control"</styled-content>)</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec67">
                <title>Differential expression analysis using limma</title>
                <p>Bioconductor contains several packages dedicated to the statistical analysis of proteomics data. For example, 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/3.15/bioc/html/MSstats.html">MSstats</ext-link>
                    </monospace> and 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/3.15/bioc/html/MSstatsTMT.html">MSstatsTMT</ext-link>
                    </monospace> can be used to determine differential protein expression within both DDA and DIA datasets for LFQ and TMT, respectively.
                    <sup>
                        <xref ref-type="bibr" rid="ref38">38</xref>
                    </sup>
                    <sup>,</sup>
                    <sup>
                        <xref ref-type="bibr" rid="ref39">39</xref>
                    </sup> Of note, 
                    <monospace>MSstatsTMT</monospace> includes additional functionality for dealing with larger, multi-plexed TMT experiments. For LFQ experiments, 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/release/bioc/html/proDA.html">proDA</ext-link>
                    </monospace>, 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://rdrr.io/github/wolski/prolfqua/">prolfqua</ext-link>
                    </monospace> and 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://www.bioconductor.org/packages/release/bioc/html/msqrob2.html">MSqRob2</ext-link>
                    </monospace> can be utilised, among others.
                    <sup>
                        <xref ref-type="bibr" rid="ref32">32</xref>
                    </sup>
                    <sup>,</sup>
                    <sup>
                        <xref ref-type="bibr" rid="ref40">40</xref>
                    </sup> Here, we will use the 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/release/bioc/html/limma.html">limma</ext-link>
                    </monospace> package.
                    <sup>
                        <xref ref-type="bibr" rid="ref41">41</xref>
                    </sup> limma is widely used for the analysis of large omics datasets and has several models that allow differential abundance to be assessed in multifactorial experiments. This is useful because it allows multiple factors, including TMTplex, to be integrated into the model itself, thus minimising the effects of confounding factors. In this example we will apply 
                    <monospace>limma</monospace>&#x2019;s empirical Bayes moderated t-test, a method that is appropriate for small sample sizes.
                    <sup>
                        <xref ref-type="bibr" rid="ref31">31</xref>
                    </sup>
                </p>
                <p>We first use the 
                    <monospace>model.matrix</monospace> function to create a matrix in which each of the samples are annotated based on the factors we wish to model, here the condition group. This ultimately defines the &#x2018;design&#x2019; of the model, that is how the samples are distributed between the groups of interest. We then fit a linear model to the abundance data of each protein by passing the data and model design matrix to the 
                    <monospace>lmFit</monospace> function. Finally, we update the estimated standard error for each model coefficient using the eBayes function. This function borrows information across features, here proteins, to shift the per-protein variance estimates towards an expected value based on the variance estimates of other proteins with similar mean intensity. This empirical Bayes technique has been shown to reduce the number of false positives for proteins with small variances as well as increase the power of detection for differentially abundant proteins with larger variances.
                    <sup>
                        <xref ref-type="bibr" rid="ref42">42</xref>
                    </sup> Further, we use the 
                    <monospace>trend = TRUE</monospace> argument when passing the 
                    <monospace>eBayes</monospace> function so that an intensity-dependent trend can be fitted to the prior variances. For more information about the 
                    <monospace>limma</monospace> trend method users are directed to Ref. 
                    <xref ref-type="bibr" rid="ref43">43</xref>.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Design a matrix containing all of the factors we wish to model the effects of</styled-content>
                        </monospace>

                        <monospace>model_design 
                            <styled-content style="color:#984806">&lt;-</styled-content> model.matrix(~ cp_proteins$condition)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Verify</styled-content>
                        </monospace>

                        <monospace>print(model_design)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##   (Intercept) cp_proteins$conditionTreated</monospace>

                        <monospace>## 1           1                            1</monospace>

                        <monospace>## 2           1                            1</monospace>

                        <monospace>## 3           1                            1</monospace>

                        <monospace>## 4           1                            0</monospace>

                        <monospace>## 5           1                            0</monospace>

                        <monospace>## 6           1                            0</monospace>

                        <monospace>## attr(,"assign")</monospace>

                        <monospace>## [1] 0 1</monospace>

                        <monospace>## attr(,"contrasts")</monospace>

                        <monospace>## attr(,"contrasts")$&#x2018;cp_proteins$condition&#x2018;</monospace>

                        <monospace>## [1] "contr.treatment"</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Create a linear model using this design</styled-content>
                        </monospace>

                        <monospace>fitted_lm 
                            <styled-content style="color:#984806">&lt;-</styled-content> cp_proteins %&gt;%</monospace>

                        <monospace>  assay() %&gt;%</monospace>

                        <monospace>  lmFit(
                            <styled-content style="color:#D47D3C">design =</styled-content> model_design)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Update the model based on Limma eBayes algorithm</styled-content>
                        </monospace>

                        <monospace>fitted_lm 
                            <styled-content style="color:#984806">&lt;-</styled-content> eBayes(
                            <styled-content style="color:#CC9900">fit =</styled-content> fitted_lm,</monospace>

                        <monospace>                    
                            <styled-content style="color:#CC9900">trend =</styled-content> TRUE)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Save results of the test</styled-content>
                        </monospace>

                        <monospace>limma_results 
                            <styled-content style="color:#984806">&lt;-</styled-content> topTable(
                            <styled-content style="color:#CC9900">fit =</styled-content> fitted_lm,</monospace>

                        <monospace>                          
                            <styled-content style="color:#CC9900">coef =</styled-content> 
                            <styled-content style="color:#37A82E">"cp_proteins$conditionTreated"</styled-content>,</monospace>

                        <monospace>                          
                            <styled-content style="color:#CC9900">adjust.method =</styled-content> 
                            <styled-content style="color:#37A82E">"BH"</styled-content>,</monospace>

                        <monospace>                          
                            <styled-content style="color:#CC9900">number =</styled-content> Inf) %&gt;%</monospace>

                        <monospace>rownames_to_column(
                            <styled-content style="color:#37A82E">"Protein"</styled-content>) %&gt;%</monospace>

                        <monospace>as_tibble() %&gt;%</monospace>

                        <monospace>mutate(
                            <styled-content style="color:#CC9900">TP =</styled-content> grepl(
                            <styled-content style="color:#37A82E">"ups"</styled-content>, Protein))</monospace>
                    </preformat>
                </p>
                <p>Having applied the model to the data, we need to verify that this model was appropriate and that the statistical assumptions were met. To do this we first generate an SA plot using the 
                    <monospace>plotSA</monospace> function within 
                    <monospace>limma</monospace>. An SA plot shows the log2 residual standard deviation (sigma) against log average abundance and is a simple way to visualise the trend that has been fitted to the data.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Plot residual SD against average log abundance</styled-content>
                        </monospace>

                        <monospace>plotSA(fitted_lm,</monospace>

                        <monospace>       
                            <styled-content style="color:#CC9900">xlab =</styled-content> 
                            <styled-content style="color:#37A82E">"Average log2(abundance)"</styled-content>,</monospace>

                        <monospace>       
                            <styled-content style="color:#CC9900">ylab =</styled-content> 
                            <styled-content style="color:#37A82E">"log2(sigma)"</styled-content>,</monospace>

                        <monospace>       
                            <styled-content style="color:#CC9900">cex =</styled-content> 
                            <styled-content style="color:#000099">0.5</styled-content>)</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure24.gif"/>
                </p>
                <p>The residual standard deviation is a measure of model accuracy and is most easily conceptualised as a measurement of how far from the model prediction each data point lies. The smaller the residual standard deviation, the closer the fit between the model and observed data.</p>
                <p>Next we will plot a p-value histogram. Importantly, this histogram shows the distribution of p-values prior to any multiple hypothesis test correction or FDR control. This means plotting the 
                    <monospace>P.value</monospace> variable, not the 
                    <monospace>adj.P.Val</monospace>.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Plot histogram of raw p-values</styled-content>
                        </monospace>

                        <monospace>limma_results %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> P.Value)) +</monospace>

                        <monospace>  geom_histogram(
                            <styled-content style="color:#CC9900">binwidth =</styled-content> 
                            <styled-content style="color:#000099">0.025</styled-content>) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"P-value"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"Frequency"</styled-content>) +</monospace>

                        <monospace>  ggtitle(
                            <styled-content style="color:#37A82E">"P-value distribution following Limma eBayes trend model"</styled-content>) +</monospace>

                        <monospace>  theme_bw()</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure25.gif"/>
                </p>
                <p>The figure displayed shows an anti-conservative p-value distribution. The flat distribution across the base of the graph represents the non-significant p-values spread uniformly between 0 and 1, whilst the peak close to 0 contains significant p-values, along with some false positives. For a more thorough explanation of interpreting p-value distributions, including why your data may not produce an anti-conservative distribution if your statistical model is inappropriate, please see Ref. 
                    <xref ref-type="bibr" rid="ref44">44</xref>. Now, having applied the statistical model and verified it&#x2019;s suitability, we take an initial look at the outputs.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Look at limma results table</styled-content>
                        </monospace>

                        <monospace>head(limma_results)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## # A tibble: 6x8</monospace>

                        <monospace>##   Protein logFC AveExpr     t  P.Value  adj.P.Val      B TP</monospace>

                        <monospace>##   &lt;chr&gt;   &lt;dbl&gt;   &lt;dbl&gt; &lt;dbl&gt;    &lt;dbl&gt;      &lt;dbl&gt;  &lt;dbl&gt; &lt;lgl&gt;</monospace>

                        <monospace>## 1 Q9C0G0  2.97   -0.814  33.7 1.81e-10 0.000000596 14.1  FALSE</monospace>

                        <monospace>## 2 Q01581  1.50    0.588  28.4 7.87e-10 0.00000129  13.0  FALSE</monospace>

                        <monospace>## 3 P15104  1.32     1.36  23.7 3.69e- 9 0.00000404  11.7  FALSE</monospace>

                        <monospace>## 4 Q9UK41  1.46    -1.49  22.0 7.05e- 9 0.00000553  11.1  FALSE</monospace>

                        <monospace>## 5 P37268  1.32   -0.678  21.1 9.80e- 9 0.00000553  10.8  FALSE</monospace>

                        <monospace>## 6 P04183  1.29    0.939  21.1 1.01e- 8 0.00000553  10.8  FALSE</monospace>
                    </preformat>
                </p>
                <p>The results table contains several important pieces of information. Each master protein is represented by its accession number and has an associated log2 fold change, that is the log2 difference in mean abundance between conditions, as well as a log2 mean expression across all six samples, termed 
                    <monospace>AveExpr</monospace>. Since we carried out an empirical Bayes moderated t-test, each protein also has a moderated t-statistic and associated p-value. The moderated t-statistic can be interpreted in the same way as a standard t-statistic. Each protein also has an adjusted p-value which accounts for multiple hypothesis testing to control the overall FDR. The default method for multiple hypothesis corrections within the 
                    <monospace>topTable</monospace> function that we applied is the Benjamini and Hochberg (BH) adjustment,
                    <sup>
                        <xref ref-type="bibr" rid="ref45">45</xref>
                    </sup> although we could have specified an alternative. Finally, the B-statistic represents the log-odds that a protein is differentially abundant between the two conditions, and the data is presented in descending order with those with the highest log-odds of differential abundance at the top.</p>
                <p>We can add annotations to this results table based on the user-defined significance thresholds. In the literature, for stringent analyses an FDR-adjusted p-value threshold of 0.01 is most frequently used, or 0.05 for exploratory analyses. Ultimately these thresholds are arbitrary and set by the user. The addition of a log-fold change (
                    <monospace>logFC</monospace>) threshold is at the users discretion and can be useful to determine significant results of biological relevance. When using a TMT labelling strategy the co-isolation interference can lead to substantial and uneven ratio compression, thus it is not recommended to apply a fold change threshold here.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Add direction of log fold change relative to control</styled-content>
                        </monospace>

                        <monospace>limma_results$direction 
                            <styled-content style="color:#984806">&lt;-</styled-content> ifelse(limma_results$logFC &gt; 
                            <styled-content style="color:#000099">0</styled-content>,</monospace>

                        <monospace>                                  
                            <styled-content style="color:#37A82E">"up"</styled-content>, 
                            <styled-content style="color:#37A82E">"down"</styled-content>) %&gt;%</monospace>

                        <monospace>  as.factor()</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Add significance thresholds</styled-content>
                        </monospace>

                        <monospace>limma_results$significance 
                            <styled-content style="color:#984806">&lt;-</styled-content> ifelse(limma_results$adj.P.Val &lt; 
                            <styled-content style="color:#000099">0.01</styled-content>,</monospace>

                        <monospace>                                     
                            <styled-content style="color:#37A82E">"sig"</styled-content>, 
                            <styled-content style="color:#37A82E">"not.sig"</styled-content>) %&gt;%</monospace>

                        <monospace>  as.factor()</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Verify</styled-content>
                        </monospace>

                        <monospace>str(limma_results)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## tibble [3,289 x 10] (S3: tbl_df/tbl/data.frame)</monospace>

                        <monospace>##  $ Protein      : chr [1:3289] "Q9C0G0" "Q01581" "P15104" "Q9UK41" &#x2026;</monospace>

                        <monospace>##  $ logFC        : num [1:3289] 2.97 1.5 1.32 1.46 1.32 &#x2026;</monospace>

                        <monospace>##  $ AveExpr      : num [1:3289] -0.814 0.588 1.359 -1.489 -0.678 &#x2026;</monospace>

                        <monospace>##  $ t            : num [1:3289] 33.7 28.4 23.7 22 21.1 &#x2026;</monospace>

                        <monospace>##  $ P.Value      : num [1:3289] 1.81e-10 7.87e-10 3.69e-09 7.05e-09 9.80e-09 &#x2026;</monospace>

                        <monospace>##  $ adj.P.Val    : num [1:3289] 5.96e-07 1.29e-06 4.04e-06 5.53e-06 5.53e-06 &#x2026;</monospace>

                        <monospace>##  $ B            : num [1:3289] 14.1 13 11.7 11.1 10.8 &#x2026;</monospace>

                        <monospace>##  $ TP           : logi [1:3289] FALSE FALSE FALSE FALSE FALSE FALSE &#x2026;</monospace>

                        <monospace>##  $ direction    : Factor w/ 2 levels "down","up": 2 2 2 2 2 2 2 1 1 2 &#x2026;</monospace>

                        <monospace>##  $ significance : Factor w/ 2 levels "not.sig","sig": 2 2 2 2 2 2 2 2 2 2 &#x2026;</monospace>
                    </preformat>
                </p>
                <p>In the next code chunk, we use the 
                    <monospace>decideTests</monospace> function to determine how many proteins are significantly up- and down- regulated in the treated compared to control HEK293 cells. We tell this function to classify the significance of each t-statistic based on a BH-adjusted p-value of 0.01. If we had not used TMT labels and wished to include a logFC threshold, we could have included 
                    <monospace>lfc = as</monospace> an argument. The function will then output a numerical matrix containing either -1, 0, or 1 for each protein in each condition, where a value of -1 indicates significant downregulation, 0 not significant and 1 significant upregulation. To simplify interpretation, we print a summary of this matrix.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Get a summary of statistically significant results</styled-content>
                        </monospace>

                        <monospace>fitted_lm %&gt;%</monospace>

                        <monospace>  decideTests(
                            <styled-content style="color:#CC9900">adjust.method =</styled-content> 
                            <styled-content style="color:#37A82E">"BH"</styled-content>,</monospace>

                        <monospace>              
                            <styled-content style="color:#CC9900">p.value =</styled-content> 
                            <styled-content style="color:#000099">0.01</styled-content>) %&gt;%</monospace>

                        <monospace>  summary()</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>##       (Intercept) cp_proteins$conditionTreated</monospace>

                        <monospace>## Down         1448                          395</monospace>

                        <monospace>## NotSig        414                         2569</monospace>

                        <monospace>## Up           1427                          325</monospace>
                    </preformat>
                </p>
                <p>From this table we can see that 395 proteins were downregulated in treated HEK293 cells compared to the control group whilst 325 were upregulated. Given that no logFC threshold was applied some of the significant differences in abundance may be small. Further, these results mean little without any information about which proteins these were and what roles they play within the cell. We subset the significant proteins so that we can investigate them further.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Subset proteins that show significantly different abundance</styled-content>
                        </monospace>

                        <monospace>sig_proteins 
                            <styled-content style="color:#984806">&lt;-</styled-content> subset(limma_results,</monospace>

                        <monospace>                       adj.P.Val &lt;= 
                            <styled-content style="color:#000099">0.01</styled-content>)</monospace>


                        <monospace>length(sig_proteins$Protein)</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## [1] 720</monospace>
                    </preformat>
                </p>
            </sec>
            <sec id="sec68">
                <title>Visualising differentially abundant proteins</title>
                <p>Before looking deeper into which proteins have differential abundance, we first create some simple plots to visualise the results. Volcano plots and MA plots are two of the common visualisations used in this instance. When plotting the former, users are advised to plot raw p-values rather than their derivative BH-adjusted p-values. Point colours can be used to indicate significance based on BH-adjusted p-values, as is shown in the code chunk below.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Generate a volcano plot</styled-content>
                        </monospace>

                        <monospace>limma_results %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> logFC, 
                            <styled-content style="color:#CC9900">y =</styled-content> -log10(P.Value))) +</monospace>

                        <monospace>  geom_point(aes(
                            <styled-content style="color:#CC9900">colour =</styled-content> significance:direction), 
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">0.5</styled-content>) +</monospace>

                        <monospace>  scale_color_manual(</monospace>

                        <monospace>  
                            <styled-content style="color:#CC9900">values =</styled-content> c(
                            <styled-content style="color:#37A82E">"black"</styled-content>, 
                            <styled-content style="color:#37A82E">"black"</styled-content>, 
                            <styled-content style="color:#37A82E">"deepskyblue"</styled-content>, 
                            <styled-content style="color:#37A82E">"red"</styled-content>), 
                            <styled-content style="color:#CC9900">name =</styled-content> 
                            <styled-content style="color:#37A82E">""</styled-content>,</monospace>

                        <monospace>  
                            <styled-content style="color:#CC9900">labels =</styled-content> c(
                            <styled-content style="color:#37A82E">"Downregulated insignificant"</styled-content>,</monospace>

                        <monospace>             
                            <styled-content style="color:#37A82E">"Upregulated insignificant"</styled-content>,</monospace>

                        <monospace>             
                            <styled-content style="color:#37A82E">"Downregulated significant"</styled-content>,</monospace>

                        <monospace>             
                            <styled-content style="color:#37A82E">"Upregulated significant"</styled-content>)) +</monospace>

                        <monospace>  theme(
                            <styled-content style="color:#CC9900">axis.title.x =</styled-content> element_text(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">15</styled-content>, 
                            <styled-content style="color:#CC9900">vjust =</styled-content> -
                            <styled-content style="color:#000099">2</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">axis.title.y =</styled-content> element_text(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">15</styled-content>, 
                            <styled-content style="color:#CC9900">vjust =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">axis.text.x =</styled-content> element_text(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">12</styled-content>, 
                            <styled-content style="color:#CC9900">vjust =</styled-content> -
                            <styled-content style="color:#000099">1</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">axis.text.y =</styled-content> element_text(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">12</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">plot.background =</styled-content> element_rect(
                            <styled-content style="color:#CC9900">fill =</styled-content> 
                            <styled-content style="color:#37A82E">"white"</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">panel.background =</styled-content> element_rect(
                            <styled-content style="color:#CC9900">fill =</styled-content> 
                            <styled-content style="color:#37A82E">"white"</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">axis.line =</styled-content> element_line(
                            <styled-content style="color:#CC9900">linewidth =</styled-content> 
                            <styled-content style="color:#000099">0.5</styled-content>, 
                            <styled-content style="color:#CC9900">colour =</styled-content> 
                            <styled-content style="color:#37A82E">"black"</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">plot.margin =</styled-content> margin(
                            <styled-content style="color:#000099">10</styled-content>, 
                            <styled-content style="color:#000099">10</styled-content>, 
                            <styled-content style="color:#000099">10</styled-content>, 
                            <styled-content style="color:#000099">10</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">legend.position =</styled-content> c(
                            <styled-content style="color:#000099">0.25</styled-content>, 
                            <styled-content style="color:#000099">0.9</styled-content>)) +</monospace>

                        <monospace>  labs(
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"log2(FC)"</styled-content>, 
                            <styled-content style="color:#CC9900">y =</styled-content> 
                            <styled-content style="color:#37A82E">"-log10(p-value)"</styled-content>) +</monospace>

                        <monospace>  xlim(-
                            <styled-content style="color:#000099">3.1</styled-content>, 
                            <styled-content style="color:#000099">3.1</styled-content>)</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure26.gif"/>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Generate MA plot</styled-content>
                        </monospace>

                        <monospace>limma_results %&gt;%</monospace>

                        <monospace>  ggplot(aes(
                            <styled-content style="color:#CC9900">x =</styled-content> AveExpr, 
                            <styled-content style="color:#CC9900">y =</styled-content> logFC)) +</monospace>

                        <monospace>  geom_point(aes(
                            <styled-content style="color:#CC9900">colour =</styled-content> significance:direction), 
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">0.5</styled-content>) +</monospace>

                        <monospace>  scale_color_manual(</monospace>

                        <monospace>  
                            <styled-content style="color:#CC9900">values =</styled-content> c(
                            <styled-content style="color:#37A82E">"black"</styled-content>, 
                            <styled-content style="color:#37A82E">"black"</styled-content>, 
                            <styled-content style="color:#37A82E">"deepskyblue"</styled-content>, 
                            <styled-content style="color:#37A82E">"red"</styled-content>), 
                            <styled-content style="color:#CC9900">name =</styled-content> 
                            <styled-content style="color:#37A82E">""</styled-content>,</monospace>

                        <monospace>  
                            <styled-content style="color:#CC9900">labels =</styled-content> c(
                            <styled-content style="color:#37A82E">"Downregulated insignificant"</styled-content>,</monospace>

                        <monospace>             
                            <styled-content style="color:#37A82E">"Upregulated insignificant"</styled-content>,</monospace>

                        <monospace>             
                            <styled-content style="color:#37A82E">"Downregulated significant"</styled-content>,</monospace>

                        <monospace>             
                            <styled-content style="color:#37A82E">"Upregulated significant"</styled-content>)) +</monospace>

                        <monospace>  theme(
                            <styled-content style="color:#CC9900">axis.title.x =</styled-content> element_text(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">15</styled-content>, 
                            <styled-content style="color:#CC9900">vjust =</styled-content> -
                            <styled-content style="color:#000099">2</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">axis.title.y =</styled-content> element_text(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">15</styled-content>, 
                            <styled-content style="color:#CC9900">vjust =</styled-content> 
                            <styled-content style="color:#000099">2</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">axis.text.x =</styled-content> element_text(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">12</styled-content>, 
                            <styled-content style="color:#CC9900">vjust =</styled-content> -
                            <styled-content style="color:#000099">1</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">axis.text.y =</styled-content> element_text(
                            <styled-content style="color:#CC9900">size =</styled-content> 
                            <styled-content style="color:#000099">12</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">plot.background =</styled-content> element_rect(
                            <styled-content style="color:#CC9900">fill =</styled-content> 
                            <styled-content style="color:#37A82E">"white"</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">panel.background =</styled-content> element_rect(
                            <styled-content style="color:#CC9900">fill =</styled-content> 
                            <styled-content style="color:#37A82E">"white"</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">axis.line =</styled-content> element_line(
                            <styled-content style="color:#CC9900">linewidth =</styled-content> 
                            <styled-content style="color:#000099">0.5</styled-content>, 
                            <styled-content style="color:#CC9900">colour =</styled-content> 
                            <styled-content style="color:#37A82E">"black"</styled-content>),</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">plot.margin =</styled-content> margin(
                            <styled-content style="color:#000099">10</styled-content>, 
                            <styled-content style="color:#000099">10</styled-content>, 
                            <styled-content style="color:#000099">10</styled-content>, 
                            <styled-content style="color:#000099">10</styled-content>),</monospace>

                        <monospace>  &#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;
                            <styled-content style="color:#CC9900">legend.position =</styled-content> c(
                            <styled-content style="color:#000099">0.25</styled-content>, 
                            <styled-content style="color:#000099">0.9</styled-content>)) +</monospace>

                        <monospace>  xlab(
                            <styled-content style="color:#37A82E">"log2(mean abundance)"</styled-content>) +</monospace>

                        <monospace>  ylab(
                            <styled-content style="color:#37A82E">"log2(FC)"</styled-content>) +</monospace>

                        <monospace>  xlim(-
                            <styled-content style="color:#000099">5</styled-content>, 
                            <styled-content style="color:#000099">3.5</styled-content>)</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure27.gif"/>
                </p>
            </sec>
            <sec id="sec69">
                <title>Gene Ontology enrichment analysis</title>
                <p>The final step in the processing workflow is to apply Gene Ontology (GO) enrichment analyses to gain a biological understanding of the proteins which were either up or downregulated in HEK293 cells upon treatment. GO terms provide descriptions for genes and their corresponding proteins in the form of Molecular Functions (MF), Biological Processes (BP) and Cellular Components (CC). By carrying out GO enrichment analysis we can determine whether the frequency of any of these terms is higher than expected in the proteins of interest compared to all of the proteins which were detected. Such results can indicate whether proteins that were increased or decreased in abundance in treated HEK293 cells represent particular cellular locations, biological pathways or cellular functions.</p>
                <p>Although GO enrichment analysis can be carried out online using websites such as 
                    <ext-link ext-link-type="uri" xlink:href="http://cbl-gorilla.cs.technion.ac.il/">GOrilla</ext-link>
                    <sup>
                        <xref ref-type="bibr" rid="ref46">46</xref>
                    </sup> or 
                    <ext-link ext-link-type="uri" xlink:href="http://www.pantherdb.org/">PantherDB</ext-link>,
                    <sup>
                        <xref ref-type="bibr" rid="ref47">47</xref>
                    </sup>
                    <sup>,</sup>
                    <sup>
                        <xref ref-type="bibr" rid="ref48">48</xref>
                    </sup> we advise against this due to a lack of traceability and reproducibility. Instead, readers are advised to make use of GO enrichment packages within the Bioconductor infrastructure. Many such packages exist, including 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/release/bioc/html/topGO.html">topGO</ext-link>
                    </monospace>,
                    <sup>
                        <xref ref-type="bibr" rid="ref49">49</xref>
                    </sup> 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://www.bioconductor.org/packages/release/bioc/html/GOfuncR.html">GOfuncR</ext-link>
                    </monospace>,
                    <sup>
                        <xref ref-type="bibr" rid="ref50">50</xref>
                    </sup> and 
                    <monospace>
                        <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html">clusterProfiler</ext-link>
                    </monospace>.
                    <sup>
                        <xref ref-type="bibr" rid="ref51">51</xref>
                    </sup> Here we will use 
                    <monospace>enrichGO</monospace> function in the 
                    <monospace>clusterProfiler</monospace> package.</p>
                <p>First, we subset the accessions of proteins that we consider to be significantly up or downregulated. These will be our proteins of interest.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Subset significantly upregulated and downregulated proteins</styled-content>
                        </monospace>

                        <monospace>sig_up 
                            <styled-content style="color:#984806">&lt;-</styled-content> limma_results %&gt;%</monospace>

                        <monospace>  filter(direction == 
                            <styled-content style="color:#37A82E">"up"</styled-content>) %&gt;%</monospace>

                        <monospace>  filter(significance == 
                            <styled-content style="color:#37A82E">"sig"</styled-content>) %&gt;%</monospace>

                        <monospace>  pull(Protein)</monospace>


                        <monospace>sig_down 
                            <styled-content style="color:#984806">&lt;-</styled-content> limma_results %&gt;%</monospace>

                        <monospace>  filter(direction == 
                            <styled-content style="color:#37A82E">"down"</styled-content>) %&gt;%</monospace>

                        <monospace>  filter(significance == 
                            <styled-content style="color:#37A82E">"sig"</styled-content>) %&gt;%</monospace>

                        <monospace>  pull(Protein)</monospace>
                    </preformat>
                </p>
                <p>Next, we input the UniProt IDs of up and downregulated proteins into the GO enrichment analyses, as demonstrated below. Importantly, we provide the protein list of interest as the foreground and a list of all proteins identified within the study as the background, or &#x2018;universe&#x2019;. The keyType argument is used to tell the function that our protein accessions are in UniProt format. This allows mapping from UniProt ID back to a database containing the entire human genome (
                    <monospace>org.Hs.eg.db</monospace>). We also inform the function which GO categories we wish to consider, here &#x201c;ALL&#x201d;, meaning BP, MF and CC.</p>
                <p>As well as the information outlined above, there is the opportunity for users to specify various thresholds for statistical significance. These include thresholds on original and adjusted p-values (using the 
                    <monospace>pvalueCutoff</monospace> argument) as well as q-values (via the 
                    <monospace>qvalueCutoff</monospace> argument). Although many papers often use &#x2018;q- value&#x2019; to mean &#x2018;BH-adjusted p-value&#x2019;, the two are not always the same and users should be explicit about the statistical thresholds that they have applied. For exploratory purposes we will use the standard BH method for FDR control and set p-value, BH-adjusted p-value, and q-value thresholds of 0.05.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Search for enriched GO terms within upregulated proteins</styled-content>
                        </monospace>

                        <monospace>ego_up 
                            <styled-content style="color:#984806">&lt;-</styled-content> enrichGO(
                            <styled-content style="color:#CC9900">gene =</styled-content> sig_up,</monospace>

                        <monospace>  &#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;
                            <styled-content style="color:#CC9900">universe =</styled-content> limma_results$Protein,</monospace>

                        <monospace>  &#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;
                            <styled-content style="color:#CC9900">OrgDb =</styled-content> org.Hs.eg.db,</monospace>

                        <monospace>  &#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;
                            <styled-content style="color:#CC9900">keyType =</styled-content> 
                            <styled-content style="color:#37A82E">"UNIPROT"</styled-content>,</monospace>

                        <monospace>  &#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;
                            <styled-content style="color:#CC9900">ont =</styled-content> 
                            <styled-content style="color:#37A82E">"ALL"</styled-content>,</monospace>

                        <monospace>  &#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;
                            <styled-content style="color:#CC9900">pAdjustMethod =</styled-content> 
                            <styled-content style="color:#37A82E">"BH"</styled-content>,</monospace>

                        <monospace>  &#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;
                            <styled-content style="color:#CC9900">pvalueCutoff =</styled-content> 
                            <styled-content style="color:#000099">0.05</styled-content>,</monospace>

                        <monospace>  &#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;
                            <styled-content style="color:#CC9900">qvalueCutoff =</styled-content> 
                            <styled-content style="color:#000099">0.05</styled-content>,</monospace>

                        <monospace>  &#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;
                            <styled-content style="color:#CC9900">readable =</styled-content> TRUE)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Check results</styled-content>
                        </monospace>

                        <monospace>ego_up</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## #</monospace>

                        <monospace>## # over-representation test</monospace>

                        <monospace>## #</monospace>

                        <monospace>## #&#x2026;@organism     Homo sapiens</monospace>

                        <monospace>## #&#x2026;@ontology     GOALL</monospace>

                        <monospace>## #&#x2026;@keytype      UNIPROT</monospace>

                        <monospace>## #&#x2026;@gene     chr [1:325] "Q9C0G0" "Q01581" "P15104" "Q9UK41" "P37268" "P04183" "Q9UHI8" &#x2026;</monospace>

                        <monospace>## #&#x2026;pvalues adjusted by &#x2019;BH&#x2019; with cutoff &lt;0.05</monospace>

                        <monospace>## #&#x2026;2 enriched terms found</monospace>

                        <monospace>## &#x2019;data.frame&#x2019;: 2 obs. of 10 variables:</monospace>

                        <monospace>##  $ ONTOLOGY    : chr "CC" "CC"</monospace>

                        <monospace>##  $ ID          : chr "GO:0005758" "GO:0031970"</monospace>

                        <monospace>##  $ Description : chr "mitochondrial intermembrane space" "organelle envelope lumen"</monospace>

                        <monospace>##  $ GeneRatio   : chr "15/319" "15/319"</monospace>

                        <monospace>##  $ BgRatio     : chr "45/3228" "49/3228"</monospace>

                        <monospace>##  $ pvalue      : num 1.32e-05 4.18e-05</monospace>

                        <monospace>##  $ p.adjust    : num 0.00507 0.008</monospace>

                        <monospace>##  $ qvalue      : num 0.00506 0.00798</monospace>

                        <monospace>##  $ genelD      : chr "CHCHD2/TIMM9/AK2/TIMM8B/COA4/COA6/MIX23/TIMM8A/DIABLO/TIMM13/TIMM10/TRIAP1/CYCS/COX17/CAT"</monospace>

                        <monospace>##  $ Count : int 15 15</monospace>

                        <monospace>## #&#x2026;Citation</monospace>

                        <monospace>##  T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.</monospace>

                        <monospace>##  clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.</monospace>

                        <monospace>##  The Innovation. 2021, 2(3):100141</monospace>
                    </preformat>
                </p>
                <p>We can see from the results that there are 2 significantly enriched terms associated with the upregulated proteins. Next, we take a look at the downregulated proteins.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Search for enriched GO terms within downregulated proteins</styled-content>
                        </monospace>

                        <monospace>ego_down 
                            <styled-content style="color:#984806">&lt;-</styled-content> enrichGO(
                            <styled-content style="color:#CC9900">gene =</styled-content> sig_down,</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">universe =</styled-content> limma_results$Protein,</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">OrgDb =</styled-content> org.Hs.eg.db,</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">keyType =</styled-content> 
                            <styled-content style="color:#37A82E">"UNIPROT"</styled-content>,</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">ont =</styled-content> 
                            <styled-content style="color:#37A82E">"ALL"</styled-content>,</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">pAdjustMethod =</styled-content> 
                            <styled-content style="color:#37A82E">"BH"</styled-content>,</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">pvalueCutoff =</styled-content> 
                            <styled-content style="color:#000099">0.05</styled-content>,</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">qvalueCutoff =</styled-content> 
                            <styled-content style="color:#000099">0.05</styled-content>,</monospace>

                        <monospace>                     
                            <styled-content style="color:#CC9900">readable =</styled-content> TRUE)</monospace>


                        <monospace>
                            <styled-content style="color:#984806">## Check results</styled-content>
                        </monospace>

                        <monospace>ego_down</monospace>
                    </preformat>

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>## #</monospace>

                        <monospace>## # over-representation test</monospace>

                        <monospace>## #</monospace>

                        <monospace>## #&#x2026;@organism     Homo sapiens</monospace>

                        <monospace>## #&#x2026;@ontology     GOALL</monospace>

                        <monospace>## #&#x2026;@keytype      UNIPROT</monospace>

                        <monospace>## #&#x2026;@gene     chr [1:395] "Q53EL6" "P08243" "P35716" "Q92878" "P26583" "Q92522" "O43657" &#x2026;</monospace>

                        <monospace>## #&#x2026;pvalues adjusted by &#x2019;BH&#x2019; with cutoff &lt;0.05</monospace>

                        <monospace>## #&#x2026;69 enriched terms found</monospace>

                        <monospace>## &#x2019;data.frame&#x2019;:    69 obs. of 10 variables:</monospace>

                        <monospace>##  $ ONTOLOGY    : chr "BP" "BP" "BP" "BP" &#x2026;</monospace>

                        <monospace>##  $ ID          : Chr "GO:0006310" "GO:0006520" "GO:0000725" "GO:0006302" &#x2026;</monospace>

                        <monospace>##  $ Description : chr "DNA recombination" "amino acid metabolic process" "recombinational repair" "double-strand break repair" &#x2026;</monospace>

                        <monospace>##  $ GeneRatio   : chr "36/378" "35/378" "22/378" "31/378" &#x2026;</monospace>

                        <monospace>##  $ BgRatio     : chr "94/3166" "112/3166" "55/3166" "98/3166" &#x2026;</monospace>

                        <monospace>##  $ pvalue      : num 2.27e-11 2.48e-08 8.41e-08 1.20e-07 1.01e-06 &#x2026;</monospace>

                        <monospace>##  $ p.adjust    : num 5.66e-08 3.09e-05 7.00e-05 7.48e-05 5.06e-04 &#x2026;</monospace>

                        <monospace>##  $ qvalue      : num 5.37e-08 2.93e-05 6.63e-05 7.09e-05 4.80e-04 &#x2026;</monospace>

                        <monospace>##  $ genelD      : chr "RAD50/HMGB2/H1-10/RADX/MRE11/H1-0/H1-2/ZMYND8/HMGB3/MCM5/NUCKS1/RAD21/PRKDC/SFPQ/MCM4/XRCC6/H1-3/MCM7/TFRC/XRCC"| __truncated__ "ASNS/PHGDH/SDSL/SARS1/YARS1/AARS2/HMGCL/IARS2/GARS1/AARS1/HIBADH/PYCR1/MCCC2/ACADSB/DHFR/MARS1/SLC25A12/ETFA/PS"| __truncated__ "RADX/MRE11/ZMYND8/MCM5/NUCKS1/RAD21/SFPQ/MCM4/XRCC6/MCM7/XRCC5/PPP4R2/POGZ/YY1/MCM3/MCM2/VPS72/PARP1/BRD8/MCM6/FUS/RECQL" "RAD50/HMGB2/RADX/MRE11/DEK/ZMYND8/MCM5/NUCKS1/RAD21/PRKDC/TP53/SFPQ/SMARCC2/MCM4/XRCC6/HPF1/MCM7/XRCC5/HMGB1/PP"| __truncated__ &#x2026;</monospace>

                        <monospace>##  $ Count       : int 36 35 22 31 20 56 57 18 14 56 &#x2026;</monospace>

                        <monospace>## # &#x2026;Citation</monospace>

                        <monospace>##  T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.</monospace>

                        <monospace>##  clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.</monospace>

                        <monospace>##  The Innovation. 2021, 2(3):100141</monospace>
                    </preformat>
                </p>
                <p>The downregulated proteins contain 69 significantly enriched GO terms. There are many ways in which users can represent these results visually. Here, we create a barplot using the 
                    <monospace>barplot</monospace> function from the 
                    <monospace>enrichplot</monospace> package.
                    <sup>
                        <xref ref-type="bibr" rid="ref52">52</xref>
                    </sup> Users are directed to the vignette of the 
                    <monospace>enrichplot</monospace> package for additional visualisation options and guidance. We plot the first 10 GO terms i.e. the 10 GO terms with the greatest enrichment.

                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                        <monospace>
                            <styled-content style="color:#984806">## Plot the results</styled-content>
                        </monospace>

                        <monospace>barplot(ego_down,</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">x =</styled-content> 
                            <styled-content style="color:#37A82E">"Count"</styled-content>,</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">showCategory =</styled-content> 
                            <styled-content style="color:#000099">10</styled-content>,</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">font.size =</styled-content> 
                            <styled-content style="color:#000099">12</styled-content>,</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">label_format =</styled-content> 
                            <styled-content style="color:#000099">28</styled-content>,</monospace>

                        <monospace>        
                            <styled-content style="color:#CC9900">colorBy =</styled-content> 
                            <styled-content style="color:#37A82E">"p.adjust"</styled-content>)</monospace>
                    </preformat>
                    <inline-graphic xlink:href="https://f1000research-files.f1000.com/manuscripts/152361/ba5b98bf-5740-4a80-8072-0714e29e82e0_figure28.gif"/>
                </p>
            </sec>
        </sec>
        <sec id="sec70">
            <title>Writing and exporting data</title>
            <p>Finally, we export the results of our statistical analyses as 
                <monospace>.csv</monospace> files.

                <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                    <monospace>
                        <styled-content style="color:#984806">## Save results of Limma statistics</styled-content>
                    </monospace>

                    <monospace>write.csv(gene_results, 
                        <styled-content style="color:#CC9900">file =</styled-content> 
                        <styled-content style="color:#37A82E">"all_limma_results.csv"</styled-content>)</monospace>


                    <monospace>
                        <styled-content style="color:#984806">## Save subsets of upregulated and downregulated proteins</styled-content>
                    </monospace>

                    <monospace>write.csv(sig_upregulated, 
                        <styled-content style="color:#CC9900">file =</styled-content> 
                        <styled-content style="color:#37A82E">"upregulated_results.csv"</styled-content>)</monospace>

                    <monospace>write.csv(sig_downregulated, 
                        <styled-content style="color:#CC9900">file =</styled-content> 
                        <styled-content style="color:#37A82E">"downregulated_results.csv"</styled-content>))</monospace>


                    <monospace>
                        <styled-content style="color:#984806">## Save results of GO enrichment</styled-content>
                    </monospace>

                    <monospace>write.csv(ego_up, 
                        <styled-content style="color:#CC9900">file =</styled-content> 
                        <styled-content style="color:#37A82E">"upregulated_go_enrichment.csv"</styled-content>)</monospace>

                    <monospace>write.csv(ego_down, 
                        <styled-content style="color:#CC9900">file =</styled-content> 
                        <styled-content style="color:#37A82E">"downregulated_go_enrichment.csv"</styled-content>)</monospace>
                </preformat>
            </p>
            <p>Users can also use the 
                <monospace>
                    <ext-link ext-link-type="uri" xlink:href="https://www.rdocumentation.org/packages/ggplot2/versions/0.9.0/topics/ggsave">ggsave</ext-link>
                </monospace> function to export any of the figures generated.</p>
        </sec>
        <sec id="sec71">
            <title>Discussion and conclusion</title>
            <p>Expression proteomics is becoming an increasingly important tool in modern molecular biology. As more researchers participate in expression proteomics, either by collecting data or accessing data collected by others, there is a need for clear illustration(s) of how to deal with such complex data.</p>
            <p>Existing bottom-up proteomics workflows for differential expression analysis either provide pipelines with limited user control and flexibility (e.g., 
                <monospace>MSstats</monospace> and 
                <monospace>MSstatsTMT</monospace>
                <sup>
                    <xref ref-type="bibr" rid="ref38">38</xref>
                </sup>
                <sup>,</sup>
                <sup>
                    <xref ref-type="bibr" rid="ref39">39</xref>
                </sup>), can only be applied to specific data formats (e.g., 
                <monospace>Proteus</monospace> which is limited to input from MaxQuant
                <sup>
                    <xref ref-type="bibr" rid="ref53">53</xref>
                </sup>), or provide very limited commentary. The latter directly contributes to a problematic disconnect between researchers and their data whereby the users do not understand if or why each step is necessary for their given dataset and biological question. This can prevent researchers from refining a workflow to fit their specific needs. Finally, the majority of proteomics workflows utilise 
                <monospace>data.frame</monospace> or 
                <monospace>tibble</monospace> structures which limits their traceability, as is the case for 
                <monospace>protti</monospace>, 
                <monospace>promor</monospace> and 
                <monospace>prolfqua.</monospace>
                <sup>
                    <xref ref-type="bibr" rid="ref54">54</xref>
                </sup>
                <sup>&#x2013;</sup>
                <sup>
                    <xref ref-type="bibr" rid="ref56">56</xref>
                </sup>
            </p>
            <p>The workflow presented here outlines in completion how to process, analyse and interpret LFQ and TMT expression proteomics data derived from a bottom-up DDA experiment. Critically, we emphasize quality control and data-guided decisions with an extensive explanation of all key steps and how they may differ in various scenarios (e.g., the quantitation method, instrumentation and biological question). Our workflow takes advantage of the relatively recent 
                <monospace>QFeatures</monospace> infrastructure to ensure explicit and transparent data pre-processing as well as to provide an easy way for users to trace back through their analyses. These features are particularly important for beginners who wish to gain a better understanding of their data and how it changes throughout this workflow.</p>
            <p>No single workflow can demonstrate the processing, analysis and interpretation of all proteomics data. Our workflow is currently suitable for DDA datasets with label-free or TMT-based quantitation. We do not include examples of experiments that combine data from multiple TMTplexes, although the code provided could easily be expanded to include such a scenario. This workflow provides an in-depth user-friendly pipeline for both new and experienced proteomics data analysts.</p>
        </sec>
        <sec id="sec72">
            <title>Session information and getting help</title>
            <p>The workflows provided involve use of functions from many different R/Bioconductor packages. The sessionInfo function provides an easy way to summarize all packages and corresponding their versions used to generate this document. Should software updates lead to the generation of errors or different results to those demonstrated here, such changes should be easily traced.

                <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                    <monospace>
                        <styled-content style="color:#984806">## Print session information</styled-content>
                    </monospace>

                    <monospace>sessionInfo()</monospace>
                </preformat>

                <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                    <monospace>## R version 4.3.0 (2023-04-21)</monospace>

                    <monospace>## Platform: x86_64-apple-darwin20 (64-bit)</monospace>

                    <monospace>## Running under: macOS Ventura 13.4</monospace>

                    <monospace>##</monospace>

                    <monospace>## Matrix products: default</monospace>

                    <monospace>## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib</monospace>

                    <monospace>## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0</monospace>

                    <monospace>##</monospace>

                    <monospace>## locale:</monospace>

                    <monospace>## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8</monospace>

                    <monospace>##</monospace>

                    <monospace>## time zone: Europe/London</monospace>

                    <monospace>## tzcode source: internal</monospace>

                    <monospace>##</monospace>

                    <monospace>## attached base packages:</monospace>

                    <monospace>## [1] stats4    stats    graphics    grDevices    utils   datasets   methods</monospace>

                    <monospace>## [8] base</monospace>

                    <monospace>##</monospace>

                    <monospace>## other attached packages:</monospace>

                    <monospace>##  [1] patchwork_1.1.2             enrichplot_1.20.0</monospace>

                    <monospace>##  [3] clusterProfiler_4.8.1       org.Hs.eg.db_3.17.0</monospace>

                    <monospace>##  [5] AnnotationDbi_1.62.1        limma_3.56.2</monospace>

                    <monospace>##  [7] Biostrings_2.68.1           XVector_0.40.0</monospace>

                    <monospace>##  [9] corrplot_0.92               NormalyserDE_1.18.0</monospace>

                    <monospace>## [11] tibble_3.2.1                dplyr_1.1.2</monospace>

                    <monospace>## [13] stringr_1.5.0               ggplot2_3.4.2</monospace>

                    <monospace>## [15] QFeatures_1.10.0            MultiAssayExperiment_1.26.0</monospace>

                    <monospace>## [17] SummarizedExperiment_1.30.2 Biobase_2.60.0</monospace>

                    <monospace>## [19] GenomicRanges_1.52.0        GenomeInfoDb_1.36.0</monospace>

                    <monospace>## [21] IRanges_2.34.0              S4Vectors_0.38.1</monospace>

                    <monospace>## [23] BiocGenerics_0.46.0         MatrixGenerics_1.12.2</monospace>

                    <monospace>## [25] matrixStats_1.0.0</monospace>

                    <monospace>##</monospace>

                    <monospace>## loaded via a namespace (and not attached):</monospace>

                    <monospace>## [1] splines_4.3.0           bitops_1.0-7               ggplotify_0.1.0</monospace>

                    <monospace>## [4] cellranger_1.1.0        polyclip_1.10-4            preprocessCore_1.62.1</monospace>

                    <monospace>## [7] rpart_4.1.19            lifecycle_1.0.3            lattice_0.21-8</monospace>

                    <monospace>## [10] MASS_7.3-60            backports_1.4.1            magrittr_2.0.3</monospace>

                    <monospace>## [13] Hmisc_5.1-0            rmarkdown_2.22             yaml_2.3.7</monospace>

                    <monospace>## [16] sp_1.6-1               cowplot_1.1.1              MsCoreUtils_1.12.0</monospace>

                    <monospace>## [19] DBI_1.1.3              RColorBrewer_1.1-3         abind_1.4-5</monospace>

                    <monospace>## [22] zlibbioc_1.46.0        purrr_1.0.1                AnnotationFilter_1.24.0</monospace>

                    <monospace>## [25] ggraph_2.1.0           RCurl_1.98-1.12            yulab.utils_0.0.6</monospace>

                    <monospace>## [28] nnet_7.3-19            tweenr_2.0.2               sandwich_3.0-2</monospace>

                    <monospace>## [31] git2r_0.32.0           GenomeInfoDbData_1.2.10    ggrepel_0.9.3</monospace>

                    <monospace>## [34] tidytree_0.4.2         terra_1.7-37               nortest_1.0-4</monospace>

                    <monospace>## [37] codetools_0.2-19       DelayedArray_0.26.3        DOSE_3.26.1</monospace>

                    <monospace>## [40] ggforce_0.4.1          tidyselect_1.2.0           RcmdrMisc_2.7-2</monospace>

                    <monospace>## [43] aplot_0.1.10           raster_3.6-20              farver_2.1.1</monospace>

                    <monospace>## [46] viridis_0.6.3          base64enc_0.1-3            jsonlite_1.8.5</monospace>

                    <monospace>## [49] e1071_1.7-13           tidygraph_1.2.3            Formula_1.2-5</monospace>

                    <monospace>## [52] tools_4.3.0            treeio_1.24.1              Rcpp_1.0.10</monospace>

                    <monospace>## [55] glue_1.6.2             BiocBaseUtils_1.2.0        gridExtra_2.3</monospace>

                    <monospace>## [58] xfun_0.39              qvalue_2.32.0              usethis_2.2.0</monospace>

                    <monospace>## [61] withr_2.5.0            BiocManager_1.30.21        fastmap_1.1.1</monospace>

                    <monospace>## [64] fansi_1.0.4            digest_0.6.31              R6_2.5.1</monospace>

                    <monospace>## [67] gridGraphics_0.5-1     colorspace_2.1-0           GO.db_3.17.0</monospace>

                    <monospace>## [70] RSQLite_2.3.1          utf8_1.2.3                 tidyr_1.3.0</monospace>

                    <monospace>## [73] generics_0.1.3         data.table_1.14.8          class_7.3-22</monospace>

                    <monospace>## [76] graphlayouts_1.0.0     httr_1.4.6                 htmlwidgets_1.6.2</monospace>

                    <monospace>## [79] S4Arrays_1.0.4         scatterpie_0.2.1           pkgconfig_2.0.3</monospace>

                    <monospace>## [82] gtable_0.3.3           blob_1.2.4                 impute_1.74.1</monospace>

                    <monospace>## [85] shadowtext_0.1.2       htmltools_0.5.5            carData_3.0-5</monospace>

                    <monospace>## [88] bookdown_0.34          fgsea_1.26.0               ProtGenerics_1.32.0</monospace>

                    <monospace>## [91] clue_0.3-64            scales_1.2.1               png_0.1-8</monospace>

                    <monospace>## [94] ggfun_0.1.0            knitr_1.43                 rstudioapi_0.14</monospace>

                    <monospace>## [97] reshape2_1.4.4         nlme_3.1-162               checkmate_2.2.0</monospace>

                    <monospace>## [100] proxy_0.4-27          cachem_1.0.8               zoo_1.8-12</monospace>

                    <monospace>## [103] parallel_4.3.0        HDO.db_0.99.1              foreign_0.8-84</monospace>

                    <monospace>## [106] pillar_1.9.0          grid_4.3.0                 vctrs_0.6.3</monospace>

                    <monospace>## [109] car_3.1-2             cluster_2.1.4              htmlTable_2.4.1</monospace>

                    <monospace>## [112] evaluate_0.21         cli_3.6.1                  compiler_4.3.0</monospace>

                    <monospace>## [115] rlang_1.1.1           crayon_1.5.2               labeling_0.4.2</monospace>

                    <monospace>## [118] plyr_1.8.8            forcats_1.0.0              fs_1.6.2</monospace>

                    <monospace>## [121] stringi_1.7.12        viridisLite_0.4.2          BiocParallel_1.34.2</monospace>

                    <monospace>## [124] munsell_0.5.0         lazyeval_0.2.2             GOSemSim_2.26.0</monospace>

                    <monospace>## [127] Matrix_1.5-4.1        hms_1.1.3                  bit64_4.0.5</monospace>

                    <monospace>## [130] KEGGREST_1.40.0       haven_2.5.2                igraph_1.5.0</monospace>

                    <monospace>## [133] memoise_2.0.1         BiocWorkflowTools_1.26.0   ggtree_3.8.0</monospace>

                    <monospace>## [136] fastmatch_1.1-3       bit_4.0.5                  readxl_1.4.2</monospace>

                    <monospace>## [139] downloader_0.4        gson_0.1.0                 ape_5.7-1</monospace>
                </preformat>
            </p>
            <p>Users are advised to update 
                <monospace>R</monospace> itself as well as packages as required. Bioconductor packages can be updated using the 
                <monospace>BiocManager::install()</monospace> function, as shown below.

                <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">

                    <monospace>if (!require(
                        <styled-content style="color:#37A82E">"BiocManager"</styled-content>, 
                        <styled-content style="color:#CC9900">quietly =</styled-content> TRUE)) {</monospace>

                    <monospace>  install.packages(
                        <styled-content style="color:#37A82E">"BiocManager"</styled-content>)</monospace>

                    <monospace>}</monospace>

                    <monospace>BiocManager::install()</monospace>
                </preformat>
            </p>
        </sec>
        <sec id="sec73">
            <title>Author contributions</title>
            <p>C. H. conceptualisation, investigation, methodology, project administration, software, validation, writing &#x2013; original draft preparation, review and editing; C. S. D. software and writing - review and editing; T. K. methodology, supervision, software, writing - review and editing; K. S. L. funding acquisition, supervision, writing - review and editing; L. M. B. conceptualisation, methodology, supervision, writing - review and editing.</p>
        </sec>
    </body>
    <back>
        <sec id="sec74" sec-type="data-availability">
            <title>Data availability</title>
            <p>This workflow is written in the R statistical programming language and uses freely available open-source software packages from 
                <ext-link ext-link-type="uri" xlink:href="https://cran.r-project.org/">CRAN</ext-link> and 
                <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/">Bioconductor</ext-link>. Version numbers for all packages are shown in the Session information section.</p>
            <p>Raw mass spectrometry data is freely available online through the ProteomeXchange Consortium via the PRIDE repository with identifier PXD041794. All processed data is available at 
                <ext-link ext-link-type="uri" xlink:href="http://doi.org/10.5281/zenodo.7837375">http://doi.org/10.5281/zenodo.7837375</ext-link>, and at GitHub repository 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/CambridgeCentreForProteomics/f1000_expression_proteomics">https://github.com/CambridgeCentreFor
Proteomics/f1000_expression_proteomics
</ext-link>.</p>
        </sec>
        <ack>
            <title>Acknowledgements</title>
            <p>The authors would like to thank Savvas Kourtis from the Centre for Genomic Regulation, Barcelona, and Oliver M. Crook, from the Department of Statistics, Oxford University, UK, for trialing this workflow.</p>
        </ack>
        <ref-list>
            <title>References</title>
            <ref id="ref1">
                <label>1</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Pina-Jim&#x00e9;nez</surname>
                            <given-names>E</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Calzada</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bautista</surname>
                            <given-names>E</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Incomptine a induces apoptosis, ROS production and a differential protein expression on non-hodgkin&#x2019;s lymphoma cells.</article-title>
                    <source>

                        <italic toggle="yes">Int. J. Mol. Sci.</italic>
</source>
                    <year>September 2021</year>;<volume>22</volume>(<issue>19</issue>):<fpage>10516</fpage>.
                    <pub-id pub-id-type="pmid">34638856</pub-id>
                    <pub-id pub-id-type="doi">10.3390/ijms221910516</pub-id>
                    <pub-id pub-id-type="pmcid">PMC8508949</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref2">
                <label>2</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Amiri-Dashatan</surname>
                            <given-names>N</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ahmadi</surname>
                            <given-names>N</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Rezaei-Tavirani</surname>
                            <given-names>M</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Identification of differential protein expression and putative drug target in metacyclic stage of leishmania major and leishmania tropica: A quantitative proteomics and computational view.</article-title>
                    <source>

                        <italic toggle="yes">Comp. Immunol. Microbiol. Infect. Dis.</italic>
</source>
                    <year>April 2021</year>;<volume>75</volume>:<fpage>101617</fpage>.
                    <pub-id pub-id-type="pmid">33581562</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.cimid.2021.101617</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref3">
                <label>3</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Anitua</surname>
                            <given-names>E</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Fuente</surname>
                            <given-names>M</given-names>
                            <prefix>de la</prefix>
                        </name>

                        <name name-style="western">
                            <surname>Muruzabal</surname>
                            <given-names>F</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Differential profile of protein expression on human keratocytes treated with autologous serum and plasma rich in growth factors (PRGF).</article-title>
                    <source>

                        <italic toggle="yes">PLoS One.</italic>
</source>
                    <year>October 2018</year>;<volume>13</volume>(<issue>10</issue>):<fpage>e0205073</fpage>.
                    <pub-id pub-id-type="pmid">30312303</pub-id>
                    <pub-id pub-id-type="doi">10.1371/journal.pone.0205073</pub-id>
                    <pub-id pub-id-type="pmcid">PMC6193583</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref4">
                <label>4</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Dupree</surname>
                            <given-names>EJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Jayathirtha</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Yorkey</surname>
                            <given-names>H</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>A critical review of bottom-up proteomics: The good, the bad, and the future of this field.</article-title>
                    <source>

                        <italic toggle="yes">Proteomes.</italic>
</source>
                    <year>July 2020</year>;<volume>8</volume>(<issue>3</issue>):<fpage>14</fpage>.
                    <pub-id pub-id-type="pmid">32640657</pub-id>
                    <pub-id pub-id-type="doi">10.3390/proteomes8030014</pub-id>
                    <pub-id pub-id-type="pmcid">PMC7564415</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref5">
                <label>5</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Obermaier</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Griebel</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Westermeier</surname>
                            <given-names>R</given-names>
                        </name>
</person-group>:
                    <chapter-title>Principles of protein labeling techniques.</chapter-title>
                    <source>

                        <italic toggle="yes">Methods in Molecular Biology.</italic>
</source>
                    <publisher-loc>New York</publisher-loc>:
                    <publisher-name>Springer</publisher-name>;<year>2015</year>;<fpage>153</fpage>&#x2013;<lpage>165</lpage>.
                    <pub-id pub-id-type="doi">10.1007/978-1-4939-2550-6_13</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref6">
                <label>6</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Fern&#x00e1;ndez-Costa</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Mart&#x00ed;nez-Bartolom&#x00e9;</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>McClatchy</surname>
                            <given-names>DB</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Impact of the identification strategy on the reproducibility of the DDA and DIA results.</article-title>
                    <source>

                        <italic toggle="yes">J. Proteome Res.</italic>
</source>
                    <year>June 2020</year>;<volume>19</volume>(<issue>8</issue>):<fpage>3153</fpage>&#x2013;<lpage>3161</lpage>.
                    <pub-id pub-id-type="pmid">32510229</pub-id>
                    <pub-id pub-id-type="doi">10.1021/acs.jproteome.0c00153</pub-id>
                    <pub-id pub-id-type="pmcid">PMC7898222</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref7">
                <label>7</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Alex</surname>
                            <given-names>H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Noble</surname>
                            <given-names>WS</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wolf-Yadlin</surname>
                            <given-names>A</given-names>
                        </name>
</person-group>:
                    <article-title>Technical advances in proteomics: new developments in data-independent acquisition.</article-title>
                    <source>

                        <italic toggle="yes">F1000Res.</italic>
</source>
                    <year>March 2016</year>;<volume>5</volume>:<fpage>419</fpage>.
                    <pub-id pub-id-type="pmid">27092249</pub-id>
                    <pub-id pub-id-type="doi">10.12688/f1000research.7042.1</pub-id>
                    <pub-id pub-id-type="pmcid">PMC4821292</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref8">
                <label>8</label>
                <mixed-citation publication-type="book">
                    <collab>R Development Core Team</collab>:
                    <source>

                        <italic toggle="yes">R: A Language and Environment for Statistical Computing.</italic>
</source>
                    <publisher-loc>Vienna, Austria</publisher-loc>:
                    <publisher-name>R Foundation for Statistical Computing</publisher-name>;<year>2011</year>.
                    <isbn>3-900051-07-0</isbn>.
                    <ext-link ext-link-type="uri" xlink:href="http://www.R-project.org/">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref9">
                <label>9</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Huber</surname>
                            <given-names>W</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Carey</surname>
                            <given-names>VJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gentleman</surname>
                            <given-names>R</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Orchestrating high-throughput genomic analysis with bioconductor.</article-title>
                    <source>

                        <italic toggle="yes">Nat. Methods.</italic>
</source>
                    <year>January 2015</year>;<volume>12</volume>(<issue>2</issue>):<fpage>115</fpage>&#x2013;<lpage>121</lpage>.
                    <pub-id pub-id-type="pmid">25633503</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nmeth.3252</pub-id>
                    <pub-id pub-id-type="pmcid">PMC4509590</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref10">
                <label>10</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Hutchings</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dawson</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Krueger</surname>
                            <given-names>T</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>A Bioconductor workflow for processing, evaluating and interpreting expression proteomics data.</article-title>
                    <year>2023</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://zenodo.org/record/7837375">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref11">
                <label>11</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>McAlister</surname>
                            <given-names>GC</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Nusinow</surname>
                            <given-names>DP</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Jedrychowski</surname>
                            <given-names>MP</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>MultiNotch MS3 enables accurate, sensitive, and multiplexed detection of differential expression across cancer cell line proteomes.</article-title>
                    <source>

                        <italic toggle="yes">Anal. Chem.</italic>
</source>
                    <year>July 2014</year>;<volume>86</volume>(<issue>14</issue>):<fpage>7150</fpage>&#x2013;<lpage>7158</lpage>.
                    <pub-id pub-id-type="pmid">24927332</pub-id>
                    <pub-id pub-id-type="doi">10.1021/ac502040v</pub-id>
                    <pub-id pub-id-type="pmcid">PMC4215866</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref12">
                <label>12</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ting</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Rad</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gygi</surname>
                            <given-names>SP</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>MS3 eliminates ratio distortion in isobaric multiplexed quantitative proteomics.</article-title>
                    <source>

                        <italic toggle="yes">Nat. Methods.</italic>
</source>
                    <year>October 2011</year>;<volume>8</volume>(<issue>11</issue>):<fpage>937</fpage>&#x2013;<lpage>940</lpage>.
                    <pub-id pub-id-type="pmid">21963607</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nmeth.1714</pub-id>
                    <pub-id pub-id-type="pmcid">PMC3205343</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref13">
                <label>13</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Plubell</surname>
                            <given-names>DL</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wilmarth</surname>
                            <given-names>PA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Zhao</surname>
                            <given-names>Y</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Extended multiplexing of tandem mass tags (TMT) labeling reveals age and high fat diet specific proteome changes in mouse epididymal adipose tissue.</article-title>
                    <source>

                        <italic toggle="yes">Mol. Cell. Proteomics.</italic>
</source>
                    <year>May 2017</year>;<volume>16</volume>(<issue>5</issue>):<fpage>873</fpage>&#x2013;<lpage>890</lpage>.
                    <pub-id pub-id-type="pmid">28325852</pub-id>
                    <pub-id pub-id-type="doi">10.1074/mcp.m116.065524</pub-id>
                    <pub-id pub-id-type="pmcid">PMC5417827</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref14">
                <label>14</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Brenes</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hukelmann</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bensaddek</surname>
                            <given-names>D</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Multibatch TMT reveals false positives, batch effects and missing values.</article-title>
                    <source>

                        <italic toggle="yes">Mol. Cell. Proteomics.</italic>
</source>
                    <year>October 2019</year>;<volume>18</volume>(<issue>10</issue>):<fpage>1967</fpage>&#x2013;<lpage>1980</lpage>.
                    <pub-id pub-id-type="pmid">31332098</pub-id>
                    <pub-id pub-id-type="doi">10.1074/mcp.ra119.001472</pub-id>
                    <pub-id pub-id-type="pmcid">PMC6773557</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref15">
                <label>15</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Perez-Riverol</surname>
                            <given-names>Y</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bai</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bandla</surname>
                            <given-names>C</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>November 2021</year>;<volume>50</volume>(<issue>D1</issue>):<fpage>D543</fpage>&#x2013;<lpage>D552</lpage>.
                    <pub-id pub-id-type="pmid">34723319</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gkab1038</pub-id>
                    <pub-id pub-id-type="pmcid">PMC8728295</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref16">
                <label>16</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Deutsch</surname>
                            <given-names>EW</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bandeira</surname>
                            <given-names>N</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Perez-Riverol</surname>
                            <given-names>Y</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The ProteomeXchange consortium at 10 years: 2023 update.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>November 2022</year>;<volume>51</volume>(<issue>D1</issue>):<fpage>D1539</fpage>&#x2013;<lpage>D1548</lpage>.
                    <pub-id pub-id-type="pmid">36370099</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gkac1040</pub-id>
                    <pub-id pub-id-type="pmcid">PMC9825490</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref17">
                <label>17</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Gatto</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Vanderaa</surname>
                            <given-names>C</given-names>
                        </name>
</person-group>:
                    <article-title>QFeatures: Quantitative features for mass spectrometry data.</article-title>R package version 1.9.2.<year>2023</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/RforMassSpectrometry/QFeatures">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref18">
                <label>18</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Morgan</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Obenchain</surname>
                            <given-names>V</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hester</surname>
                            <given-names>J</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>SummarizedExperiment: SummarizedExperiment container.</article-title>R package version 1.29.1.<year>2022</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/SummarizedExperiment">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref19">
                <label>19</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Rainer</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Vicini</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Salzer</surname>
                            <given-names>L</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>A modular and expandable ecosystem for metabolomics data annotation in r.</article-title>
                    <source>

                        <italic toggle="yes">Metabolites.</italic>
</source>
                    <year>February 2022</year>;<volume>12</volume>(<issue>2</issue>):<fpage>173</fpage>.
                    <pub-id pub-id-type="pmid">35208247</pub-id>
                    <pub-id pub-id-type="doi">10.3390/metabo12020173</pub-id>
                    <pub-id pub-id-type="pmcid">PMC8878271</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref20">
                <label>20</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Frankenfield</surname>
                            <given-names>AM</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ni</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ahmed</surname>
                            <given-names>M</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Protein contaminants matter: Building universal protein contaminant libraries for DDA and DIA proteomics.</article-title>
                    <source>

                        <italic toggle="yes">J. Proteome Res.</italic>
</source>
                    <year>July 2022</year>;<volume>21</volume>(<issue>9</issue>):<fpage>2104</fpage>&#x2013;<lpage>2113</lpage>.
                    <pub-id pub-id-type="pmid">35793413</pub-id>
                    <pub-id pub-id-type="doi">10.1021/acs.jproteome.2c00145</pub-id>
                    <pub-id pub-id-type="pmcid">PMC10040255</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref21">
                <label>21</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Pages</surname>
                            <given-names>H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Aboyoun</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gentleman</surname>
                            <given-names>R</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Biostrings: Efficient Manipulation of Biological Strings.</article-title>R package version 2.66.0.<year>2022</year>.</mixed-citation>
            </ref>
            <ref id="ref22">
                <label>22</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Karpievitch</surname>
                            <given-names>Y</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Stanley</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Taverner</surname>
                            <given-names>T</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>A statistical framework for protein quantitation in bottom-up MS-based proteomics.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>June 2009</year>;<volume>25</volume>(<issue>16</issue>):<fpage>2028</fpage>&#x2013;<lpage>2034</lpage>.
                    <pub-id pub-id-type="pmid">19535538</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btp362</pub-id>
                    <pub-id pub-id-type="pmcid">PMC2723007</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref23">
                <label>23</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Lazar</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gatto</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ferro</surname>
                            <given-names>M</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Accounting for the multiple natures of missing salues in label-free quantitative proteomics data sets to compare imputation strategies.</article-title>
                    <source>

                        <italic toggle="yes">J. Proteome Res.</italic>
</source>
                    <year>March 2016</year>;<volume>15</volume>(<issue>4</issue>):<fpage>1116</fpage>&#x2013;<lpage>1125</lpage>.
                    <pub-id pub-id-type="pmid">26906401</pub-id>
                    <pub-id pub-id-type="doi">10.1021/acs.jproteome.5b00981</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref24">
                <label>24</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Sticker</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Goeminne</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Martens</surname>
                            <given-names>L</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Robust summarization and inference in proteome-wide label-free quantification.</article-title>
                    <source>

                        <italic toggle="yes">Mol. Cell. Proteomics.</italic>
</source>
                    <year>July 2020</year>;<volume>19</volume>(<issue>7</issue>):<fpage>1209</fpage>&#x2013;<lpage>1219</lpage>.
                    <pub-id pub-id-type="pmid">32321741</pub-id>
                    <pub-id pub-id-type="doi">10.1074/mcp.ra119.001624</pub-id>
                    <pub-id pub-id-type="pmcid">PMC7338080</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref25">
                <label>25</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Goeminne</surname>
                            <given-names>LJE</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gevaert</surname>
                            <given-names>K</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Clement</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <article-title>Peptide-level robust ridge regression improves estimation, sensitivity, and specificity in data-dependent quantitative label-free shotgun proteomics.</article-title>
                    <source>

                        <italic toggle="yes">Mol. Cell. Proteomics.</italic>
</source>
                    <year>February 2016</year>;<volume>15</volume>(<issue>2</issue>):<fpage>657</fpage>&#x2013;<lpage>668</lpage>.
                    <pub-id pub-id-type="pmid">26566788</pub-id>
                    <pub-id pub-id-type="doi">10.1074/mcp.m115.055897</pub-id>
                    <pub-id pub-id-type="pmcid">PMC4739679</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref26">
                <label>26</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>O&#x2019;Rourke</surname>
                            <given-names>MB</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Town</surname>
                            <given-names>SEL</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dalla</surname>
                            <given-names>PV</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>What is normalization? the strategies employed in top-down and bottom-up proteome analysis workflows.</article-title>
                    <source>

                        <italic toggle="yes">Proteomes.</italic>
</source>
                    <year>August 2019</year>;<volume>7</volume>(<issue>3</issue>):<fpage>29</fpage>.
                    <pub-id pub-id-type="pmid">31443461</pub-id>
                    <pub-id pub-id-type="doi">10.3390/proteomes7030029</pub-id>
                    <pub-id pub-id-type="pmcid">PMC6789750</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref27">
                <label>27</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Willforss</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Chawade</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Levander</surname>
                            <given-names>F</given-names>
                        </name>
</person-group>:
                    <article-title>NormalyzerDE: Online tool for improved normalization of omics expression data and high-sensitivity differential expression analysis.</article-title>
                    <source>

                        <italic toggle="yes">J. Proteome Res.</italic>
</source>
                    <year>October 2018</year>;<volume>18</volume>(<issue>2</issue>):<fpage>732</fpage>&#x2013;<lpage>740</lpage>.
                    <pub-id pub-id-type="pmid">30277078</pub-id>
                    <pub-id pub-id-type="doi">10.1021/acs.jproteome.8b00523</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref28">
                <label>28</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Bolstad</surname>
                            <given-names>B</given-names>
                        </name>
</person-group>:
                    <article-title>preprocessCore: A collection of pre-processing functions.</article-title>R package version 1.60.2.<year>2023</year>.</mixed-citation>
            </ref>
            <ref id="ref29">
                <label>29</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Andersen</surname>
                            <given-names>CL</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Jensen</surname>
                            <given-names>JL</given-names>
                        </name>

                        <name name-style="western">
                            <surname>&#x00d8;rntoft</surname>
                            <given-names>TF</given-names>
                        </name>
</person-group>:
                    <article-title>Normalization of real-time quantitative reverse transcription-PCR data: A model-based variance estimation approach to identify genes suited for normalization, applied to bladder and colon cancer data sets.</article-title>
                    <source>

                        <italic toggle="yes">Cancer Res.</italic>
</source>
                    <year>August 2004</year>;<volume>64</volume>(<issue>15</issue>):<fpage>5245</fpage>&#x2013;<lpage>5250</lpage>.
                    <pub-id pub-id-type="pmid">15289330</pub-id>
                    <pub-id pub-id-type="doi">10.1158/0008-5472.can-04-0496</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref30">
                <label>30</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Huber</surname>
                            <given-names>W</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Heydebreck</surname>
                            <given-names>A</given-names>
                            <prefix>von</prefix>
                        </name>

                        <name name-style="western">
                            <surname>S&#x00fc;ltmann</surname>
                            <given-names>H</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Variance stabilization applied to microarray data calibration and to the quantification of differential expression.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>July 2002</year>;<volume>18</volume>(<issue>suppl_1</issue>):<fpage>S96</fpage>&#x2013;<lpage>S104</lpage>.
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/18.suppl_1.s96</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref31">
                <label>31</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Smyth</surname>
                            <given-names>GK</given-names>
                        </name>
</person-group>:
                    <article-title>Linear models and empirical bayes methods for assessing differential expression in microarray experiments.</article-title>
                    <source>

                        <italic toggle="yes">Stat. Appl. Genet. Mol. Biol.</italic>
</source>
                    <year>January 2004</year>;<volume>3</volume>(<issue>1</issue>):<fpage>1</fpage>&#x2013;<lpage>25</lpage>.
                    <pub-id pub-id-type="pmid">16646809</pub-id>
                    <pub-id pub-id-type="doi">10.2202/1544-6115.1027</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref32">
                <label>32</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Goeminne</surname>
                            <given-names>LJE</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Sticker</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Martens</surname>
                            <given-names>L</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>MSqRob takes the missing hurdle: Uniting intensity- and count-based proteomics.</article-title>
                    <source>

                        <italic toggle="yes">Anal. Chem.</italic>
</source>
                    <year>March 2020</year>;<volume>92</volume>(<issue>9</issue>):<fpage>6278</fpage>&#x2013;<lpage>6287</lpage>.
                    <pub-id pub-id-type="pmid">32227882</pub-id>
                    <pub-id pub-id-type="doi">10.1021/acs.analchem.9b04375</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref33">
                <label>33</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Liu</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dongre</surname>
                            <given-names>A</given-names>
                        </name>
</person-group>:
                    <article-title>Proper imputation of missing values in proteomics datasets for differential expression analysis.</article-title>
                    <source>

                        <italic toggle="yes">Brief. Bioinform.</italic>
</source>
                    <year>June 2021</year>;<volume>22</volume>(<issue>3</issue>).
                    <pub-id pub-id-type="pmid">32520347</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bib/bbaa112</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref34">
                <label>34</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Irizarry</surname>
                            <given-names>R</given-names>
                        </name>
</person-group>:
                    <article-title>Correlation is not a measure of reproducibility.</article-title>
                    <year>2015</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://simplystatistics.org/posts/2015-08-12-correlation-is-not-a-measure-of-reproducibility/">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref35">
                <label>35</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Bunting</surname>
                            <given-names>KV</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Steeds</surname>
                            <given-names>RP</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Slater</surname>
                            <given-names>LT</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>A practical guide to assess the reproducibility of echocardiographic measurements.</article-title>
                    <source>

                        <italic toggle="yes">J. Am. Soc. Echocardiogr.</italic>
</source>
                    <year>December 2019</year>;<volume>32</volume>(<issue>12</issue>):<fpage>1505</fpage>&#x2013;<lpage>1515</lpage>.
                    <pub-id pub-id-type="pmid">31653530</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.echo.2019.08.015</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref36">
                <label>36</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Darbani</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Stewart</surname>
                            <given-names>CN</given-names>
                        </name>
</person-group>:
                    <article-title>Reproducibility and reliability 
                        <monospace>assay</monospace>s of the gene expression-measurements.</article-title>
                    <source>

                        <italic toggle="yes">J. Biol. Res (Thessalon). </italic>
</source>
                    <year>May 2014</year>;<volume>21</volume>(<issue>1</issue>).
                    <pub-id pub-id-type="pmid">25984486</pub-id>
                    <pub-id pub-id-type="doi">10.1186/2241-5793-21-3</pub-id>
                    <pub-id pub-id-type="pmcid">PMC4376515</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref37">
                <label>37</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Leek</surname>
                            <given-names>JT</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Scharpf</surname>
                            <given-names>RB</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bravo</surname>
                            <given-names>HC</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Tackling the widespread and critical impact of batch effects in high-throughput data.</article-title>
                    <source>

                        <italic toggle="yes">Nat. Rev. Genet.</italic>
</source>
                    <year>September 2010</year>;<volume>11</volume>(<issue>10</issue>):<fpage>733</fpage>&#x2013;<lpage>739</lpage>.
                    <pub-id pub-id-type="pmid">20838408</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nrg2825</pub-id>
                    <pub-id pub-id-type="pmcid">PMC3880143</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref38">
                <label>38</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Choi</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Chang</surname>
                            <given-names>C-Y</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Clough</surname>
                            <given-names>T</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>MSstats: an r package for statistical analysis of quantitative mass spectrometry-based proteomic experiments.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>May 2014</year>;<volume>30</volume>(<issue>17</issue>):<fpage>2524</fpage>&#x2013;<lpage>2526</lpage>.
                    <pub-id pub-id-type="pmid">24794931</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btu305</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref39">
                <label>39</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Huang</surname>
                            <given-names>T</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Choi</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Tzouros</surname>
                            <given-names>M</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>MSstatsTMT: Statistical detection of differentially abundant proteins in experiments with isobaric labeling and multiple mixtures.</article-title>
                    <source>

                        <italic toggle="yes">Mol. Cell. Proteomics.</italic>
</source>
                    <year>October 2020</year>;<volume>19</volume>(<issue>10</issue>):<fpage>1706</fpage>&#x2013;<lpage>1723</lpage>.
                    <pub-id pub-id-type="pmid">32680918</pub-id>
                    <pub-id pub-id-type="doi">10.1074/mcp.ra120.002105</pub-id>
                    <pub-id pub-id-type="pmcid">PMC8015007</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref40">
                <label>40</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wolski</surname>
                            <given-names>WE</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Nanni</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Grossmann</surname>
                            <given-names>J</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>prolfqua: A comprehensive R-package for proteomics differential expression analysis.</article-title>
                    <source>

                        <italic toggle="yes">J. Proteome Res.</italic>
</source>
                    <year>March 2023</year>;<volume>22</volume>(<issue>4</issue>):<fpage>1092</fpage>&#x2013;<lpage>1104</lpage>.
                    <pub-id pub-id-type="pmid">36939687</pub-id>
                    <pub-id pub-id-type="doi">10.1021/acs.jproteome.2c00441</pub-id>
                    <pub-id pub-id-type="pmcid">PMC10088014</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref41">
                <label>41</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ritchie</surname>
                            <given-names>ME</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Phipson</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Di</surname>
                            <given-names>W</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>limma powers differential expression analyses for RNA-sequencing and microarray studies.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>January 2015</year>;<volume>43</volume>(<issue>7</issue>):<fpage>e47</fpage>&#x2013;<lpage>e47</lpage>.
                    <pub-id pub-id-type="pmid">25605792</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gkv007</pub-id>
                    <pub-id pub-id-type="pmcid">PMC4402510</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref42">
                <label>42</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Phipson</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Lee</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Majewski</surname>
                            <given-names>IJ</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression.</article-title>
                    <source>

                        <italic toggle="yes">Ann. Appl. Stat.</italic>
</source>
                    <year>June 2016</year>;<volume>10</volume>(<issue>2</issue>):<fpage>946</fpage>&#x2013;<lpage>963</lpage>.
                    <pub-id pub-id-type="pmid">28367255</pub-id>
                    <pub-id pub-id-type="doi">10.1214/16-aoas920</pub-id>
                    <pub-id pub-id-type="pmcid">PMC5373812</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref43">
                <label>43</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Law</surname>
                            <given-names>CW</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Chen</surname>
                            <given-names>Y</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Shi</surname>
                            <given-names>W</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>voom: precision weights unlock linear model analysis tools for RNA-seq read counts.</article-title>
                    <source>

                        <italic toggle="yes">Genome Biol.</italic>
</source>
                    <year>2014</year>;<volume>15</volume>(<issue>2</issue>):<fpage>R29</fpage>.
                    <pub-id pub-id-type="pmid">24485249</pub-id>
                    <pub-id pub-id-type="doi">10.1186/gb-2014-15-2-r29</pub-id>
                    <pub-id pub-id-type="pmcid">PMC4053721</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref44">
                <label>44</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Robinson</surname>
                            <given-names>D</given-names>
                        </name>
</person-group>:
                    <article-title>How to interpret a p-value histogram.</article-title>
                    <year>2014</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://varianceexplained.org/statistics/interpreting-pvalue-histogram/">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref45">
                <label>45</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Benjamini</surname>
                            <given-names>Y</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hochberg</surname>
                            <given-names>Y</given-names>
                        </name>
</person-group>:
                    <article-title>Controlling the false discovery rate: a practical and powerful approach to multiple testing.</article-title>
                    <source>

                        <italic toggle="yes">J. R. Stat. Soc.</italic>
</source>
                    <year>1995</year>;<volume>57</volume>(<issue>1</issue>):<fpage>289</fpage>&#x2013;<lpage>300</lpage>.
                    <pub-id pub-id-type="doi">10.1111/j.2517-6161.1995.tb02031.x</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref46">
                <label>46</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Eden</surname>
                            <given-names>E</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Navon</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Steinfeld</surname>
                            <given-names>I</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists.</article-title>
                    <source>

                        <italic toggle="yes">BMC Bioinformatics.</italic>
</source>
                    <year>February 2009</year>;<volume>10</volume>(<issue>1</issue>).
                    <pub-id pub-id-type="pmid">19192299</pub-id>
                    <pub-id pub-id-type="doi">10.1186/1471-2105-10-48</pub-id>
                    <pub-id pub-id-type="pmcid">PMC2644678</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref47">
                <label>47</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Mi</surname>
                            <given-names>H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Muruganujan</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Thomas</surname>
                            <given-names>PD</given-names>
                        </name>
</person-group>:
                    <article-title>PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>November 2012</year>;<volume>41</volume>(<issue>D1</issue>):<fpage>D377</fpage>&#x2013;<lpage>D386</lpage>.
                    <pub-id pub-id-type="doi">10.1093/nar/gks1118</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref48">
                <label>48</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Thomas</surname>
                            <given-names>PD</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ebert</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Muruganujan</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>PANTHER: Making genome-scale phylogenetics accessible to all.</article-title>
                    <source>

                        <italic toggle="yes">Protein Sci.</italic>
</source>
                    <year>November 2021</year>;<volume>31</volume>(<issue>1</issue>):<fpage>8</fpage>&#x2013;<lpage>22</lpage>.
                    <pub-id pub-id-type="pmid">34717010</pub-id>
                    <pub-id pub-id-type="doi">10.1002/pro.4218</pub-id>
                    <pub-id pub-id-type="pmcid">PMC8740835</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref49">
                <label>49</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Alexa</surname>
                            <given-names>A</given-names>
                        </name>
</person-group>":
                    <article-title>topGO: Enrichment Analysis for Gene Ontology.</article-title>R package version 2.50.0.<year>2022</year>.</mixed-citation>
            </ref>
            <ref id="ref50">
                <label>50</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Grote</surname>
                            <given-names>S</given-names>
                        </name>
</person-group>:
                    <article-title>GOfuncR: Gene ontology enrichment using FUNC.</article-title>R package version 1.18.0.<year>2022</year>.</mixed-citation>
            </ref>
            <ref id="ref51">
                <label>51</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Tianzhi</surname>
                            <given-names>W</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Erqiang</surname>
                            <given-names>H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Shuangbin</surname>
                            <given-names>X</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.</article-title>
                    <source>

                        <italic toggle="yes">Innovation.</italic>
</source>
                    <year>August 2021</year>;<volume>2</volume>(<issue>3</issue>):<fpage>100141</fpage>.
                    <pub-id pub-id-type="pmid">34557778</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.xinn.2021.100141</pub-id>
                    <pub-id pub-id-type="pmcid">PMC8454663</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref52">
                <label>52</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Yu</surname>
                            <given-names>G</given-names>
                        </name>
</person-group>:
                    <article-title>enrichplot: Visualization of Functional Enrichment Result.</article-title>R package version 1.18.3.<year>2022</year>.</mixed-citation>
            </ref>
            <ref id="ref53">
                <label>53</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Gierlinski</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gastaldello</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Cole</surname>
                            <given-names>C</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Proteus: an R package for downstream analysis of maxquant output.</article-title>
                    <source>

                        <italic toggle="yes">bioRxiv.</italic>
</source>
                    <year>2018</year>.
                    <pub-id pub-id-type="doi">10.1101/416511</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref54">
                <label>54</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ranathunge</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Patel</surname>
                            <given-names>SS</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Pinky</surname>
                            <given-names>L</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>promor: a comprehensive R package for label-free proteomics data analysis and predictive modeling.</article-title>
                    <source>

                        <italic toggle="yes">bioRxiv.</italic>
</source>
                    <year>2023</year>.
                    <pub-id pub-id-type="doi">10.1101/2022.08.17.503867</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref55">
                <label>55</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Quast</surname>
                            <given-names>J-P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Schuster</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Picotti</surname>
                            <given-names>P</given-names>
                        </name>
</person-group>:
                    <article-title>protti: an R package for comprehensive data analysis of peptide- and protein-centric bottom-up proteomics data.</article-title>
                    <source>

                        <italic toggle="yes">Bioinform. Adv.</italic>
</source>
                    <year>December 2021</year>;<volume>2</volume>(<issue>1</issue>).
                    <pub-id pub-id-type="pmid">36699412</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioadv/vbab041</pub-id>
                    <pub-id pub-id-type="pmcid">PMC9710675</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref56">
                <label>56</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wolski</surname>
                            <given-names>WE</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Nanni</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Grossmann</surname>
                            <given-names>J</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Ralph Schlapbach, and Christian Panse. prolfqua: A comprehensive r-package for proteomics differential expression analysis.</article-title>
                    <source>

                        <italic toggle="yes">bioRxiv.</italic>
</source>
                    <year>2022</year>.
                    <pub-id pub-id-type="doi">10.1101/2022.06.07.494524</pub-id>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report221408">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.152361.r221408</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Yadav</surname>
                        <given-names>Amit Kumar</given-names>
                    </name>
                    <xref ref-type="aff" rid="r221408a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-9445-8156</uri>
                </contrib>
                <aff id="r221408a1">
                    <label>1</label>Translational Health Science and Technology Institute (THSTI), Faridabad, Haryana, India</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>18</day>
                <month>12</month>
                <year>2023</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2023 Yadav AK</copyright-statement>
                <copyright-year>2023</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport221408" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.139116.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The article by Hutchings et al titled "A Bioconductor Workflow for Processing, Evaluating, and Interpreting Expression Proteomics Data" provides a detailed overview tutorial of the interpretation of quantitative proteomics data through R environment using TMT and LFQ workflows as examples.</p>
            <p> </p>
            <p> 
                <bold>Major comments:</bold> 
                <list list-type="order">
                    <list-item>
                        <p>The use of open-source tools ensures transparency and reproducibility. The open-source R software packages from the Bioconductor project also add credibility to the workflow.</p>
                    </list-item>
                    <list-item>
                        <p>The workflow described is comprehensive and is a good tutorial for analysis. The article describes the comprehensive workflow with R commands, covering data import, quality control, differential expression analysis, and gene ontology enrichment analysis. This approach ensures a holistic understanding of expression proteomics.</p>
                    </list-item>
                    <list-item>
                        <p>The described use-case examples from TMT and LFQ provides practical relevance and enhances the applicability of the workflow.</p>
                    </list-item>
                    <list-item>
                        <p>The target audience is for beginners who are also familiar with R. It may also be helpful to optionally provide shiny app (for GUI based application) that biologists who are unfamiliar with CLI, can also use.</p>
                    </list-item>
                </list> 
                <bold>Minor Comments:</bold> 
                <list list-type="order">
                    <list-item>
                        <p>This article deals with Proteome Discoverer output. Maybe inclusion of MaxQuant or other workflows may add to it reach.</p>
                    </list-item>
                    <list-item>
                        <p>The difference between razor and shared peptides is not clear. Perhaps rephrasing would help a newcomer to proteomics.</p>
                    </list-item>
                    <list-item>
                        <p>Page 40, (Under section Visualising Aggregation) &#x201c;&#x2026;15 peptides and 27 supporting&#x2026;&#x201d;. Did the author mean 27 PSMs? It is missing in the sentence. Please clarify if these were peptides or PSMs?</p>
                    </list-item>
                    <list-item>
                        <p>PCA is not considered fit when there are lot of missing values as in LFQ analysis. Should these be removed before PCA?</p>
                    </list-item>
                    <list-item>
                        <p>Authors have conflated modified and non-modified forms of the same peptides during analysis. This may not be able to detect modified changes if differential in nature.</p>
                    </list-item>
                    <list-item>
                        <p>Perhaps &#x201c;quantitative proteomics&#x201d; is a more appropriate term than &#x201c;Expression proteomics&#x201d;.</p>
                    </list-item>
                </list> The article appears to be a valuable contribution to the field of quantitative proteomics, especially for R users, providing a comprehensive and user-friendly workflow.
                <bold> </bold>
            </p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>Yes</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Yes</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Yes</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>quantitative proteomics, bioinformatics, computational biology, proteome informatics</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <sub-article article-type="response" id="comment12038-221408">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Breckels</surname>
                            <given-names>Lisa</given-names>
                        </name>
                        <aff>University of Cambridge, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>18</day>
                    <month>7</month>
                    <year>2024</year>
                </pub-date>
            </front-stub>
            <body>
                <p>We would like to thank Dr. Yadav for reviewing our workflow paper and providing positive feedback. In response to Dr. Yadav&#x2019;s minor comments, we have provided additional information on how to use this workflow with output from MaxQuant. This information can be found in a new section in the 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/CambridgeCentreForProteomics/f1000_expression_proteomics/blob/main/appendix.pdf">Appendix</ext-link>&#x00a0;of our Github repository and is mentioned explicitly in the workflow. The sentence regarding shared and unique peptides has been changed slightly and additional discussion added to emphasise how protein inference and presence of isoforms can alter the definition of &#x2018;unique&#x2019; peptides. We agree that missing values need to be addressed before performing PCA and as such in the workflow we use the function filterNA in QFeatures to remove missing data prior to any PCA performed. We have also explicitly stated that we conflate modified and non-modified forms of each peptides, and that users could aggregate by &#x201c;Annotated. Sequence&#x201d; if they wish to retain this information. However, the discovery of differences in the behaviour of these peptides would require peptide-level statistical analyses, which is outside of the scope of this workflow.</p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report221410">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.152361.r221410</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Locard-Paulet</surname>
                        <given-names>Marie</given-names>
                    </name>
                    <xref ref-type="aff" rid="r221410a1">1</xref>
                    <xref ref-type="aff" rid="r221410a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-2879-9224</uri>
                </contrib>
                <aff id="r221410a1">
                    <label>1</label>Universit&#x00e9; Toulouse III - Paul Sabatier (UT3), Toulouse, France</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>27</day>
                <month>11</month>
                <year>2023</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2023 Locard-Paulet M</copyright-statement>
                <copyright-year>2023</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport221410" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.139116.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The authors provide the full description of a R-based workflow applied to quantitative proteomics data analysis (TMT and label-free data). It makes use of QFeature objects, a structure that is very well suited for bottom-up proteomics data analysis. I think that this is a very nice step-by-step tutorial for MS-based proteomics data analysis with R for people who start in the field and have limited coding skills. Beyond its educational aspect, it provides the backbones of a data analysis pipeline that can be modified and published alongside any manuscript that includes MS-based proteomics data analysis.</p>
            <p> </p>
            <p> The manuscript is very well written, and the authors provide very clear and detailed instructions on how to use the workflow. The figures are also very clear and informative. They present the data, detail what quality control to perform to assess if the data are suitable for statistical analysis. This workflow contains a lot of useful QC plots and checks that are often overlooked, including TMT labelling efficiency calculation. It makes it a good step-by-step guide on how to analyze TMT-labelled bottom-up data. I really like the description on how to double check quality metrics such as S/N and isolation interference to adapt them to the data.</p>
            <p> </p>
            <p> The raw data is available in PRIDE and all the fasta files and the Proteome Discoverer output files are in Zenodo. The code is in a github repository. I easily found all the data associated with this manuscript.</p>
            <p> </p>
            <p> Major comments: 
                <list list-type="order">
                    <list-item>
                        <p>It should be made clear that this script is tailored for Proteome Discoverer outputs (and maybe even for a given version of Proteome Discoverer). For most of the plotting/filtering/analysis steps, changing input would only require adapting the column names/headers, and most of the time it is very well discussed by the authors. Nevertheless, right now this script is only adapted to Proteome Discoverer outputs and this should be made clear in the abstract and the introduction, as well as the discussion.</p>
                    </list-item>
                    <list-item>
                        <p>Aggregation of PSM quantification values to peptides and proteins:</p>
                        <p> </p>
                        <p> Here, the authors choose to aggregate all PSM intensities corresponding to the same sequence (stripped of its modifications). I would aggregate the PSM quantities per peptidoform (sequence + localized modifications) since two ions can have the same sequence but with modifications that are differentially regulated between the conditions compared. This should be mentioned.</p>
                    </list-item>
                    <list-item>
                        <p>For the GO-term enrichment: only the majority protein accessions are reported in the output of the limma analysis. So there is only one gene per feature. How do you make sure that it is the most representative annotation for a given protein group? I am fully aware that this issue is ignored by the community (for lack of alternative strategy), and that most of the time people just pick one accession per group for GO-term enrichment analysis. So, I am not asking the authors to find a solution. Nevertheless, it would be nice to mention that the output of the enrichment will depend on what accession is picked per group.</p>
                    </list-item>
                </list> Minor comments: 
                <list list-type="order">
                    <list-item>
                        <p>I would remove contaminants before estimation of TMT labelling efficiency.</p>
                    </list-item>
                    <list-item>
                        <p>I find the paragraph &#x201c;Additional considerations regarding protein isoforms&#x201d; unclear. The authors should rephrase sentences such as this one: &#x201c;PSMs or peptides that were previously mapped to one protein and one protein group could instead be mapped to multiple proteins and one protein group.&#x201d; Maybe they should use the term &#x201c;canonical protein&#x201d; to be more precise?</p>
                        <p> </p>
                        <p> In any cases, the issue of peptide uniqueness does not depend only on the presence of isoforms in the fasta file but also on the strategy that was chosen for protein inference. I agree with the authors that people should precisely describe what peptides/PSMs were used for quantification, and it is good to mention it. Nevertheless, this level of detail on general MS-based proteomics concept may not be necessary here. (So the entire paragraph on isoforms could maybe be removed, especially since isoforms are not mentioned later).</p>
                    </list-item>
                    <list-item>
                        <p>In the paragraph &#x201c;Removing PSMs that are not rank 1&#x201d;:</p>
                        <p> </p>
                        <p> I think that the &#x201c;PSM category&#x201d; that is discussed a bit later is specific of Proteome Discoverer. Some search engines report PSMs of equal score (this would correspond to the &#x201c;pretty rank&#x201d; in Mascot). I don&#x2019;t think that all this should necessarily be discussed in this manuscript, but I insist on the fact that the workflow is tuned for Proteome Discoverer output and this should be made clear in the introduction/abstract.</p>
                    </list-item>
                    <list-item>
                        <p>In the paragraph &#x201c;Managing missing data&#x201d; for the TMT:</p>
                        <p> </p>
                        <p> the authors mention MCAR, MAR, and MNAR, but do they all apply here? I would expect TMT-labelled data to mostly have missing values due to ions being under the limit of detection because when an ion is fragmented, the reporter ions used for relative quantification are often all detected. Isn&#x2019;t it the case?</p>
                        <p> Knowing this should restrict the choice for a more suited strategy of replacement of missing values.</p>
                        <p> </p>
                        <p> Still on missing values: page 33, it is stated that &#x201c;Typically, it is desirable to remove features, here PSMs, with greater than 20% missing values&#x201d;. Why this number? Is this accepted by the entire community? (this comment also applies to the same step in the LFQ analysis.</p>
                        <p> </p>
                        <p> Regarding missing value replacement (discussion on LFQ, page 55): it is really nice to provide references of strategies applied to replacement of missing values. It is indeed a tricky decision to make (how to replace? Should we replace?) and there is no one-size-fits-them-all method. Why do you choose to replace at peptide level and not protein level?</p>
                    </list-item>
                    <list-item>
                        <p>Page 38: what do the authors mean by &#x201c;report did not indicate any superior normalization method&#x201d;? How would we know what normalization method works best? This point is because I am curious, not to correct any obvious mistake. It is great to try out different normalization strategies, but I don&#x2019;t really see how to pick one based on the boxplots of normalized intensities (Figure 4).</p>
                    </list-item>
                    <list-item>
                        <p>The authors normalize the data after protein aggregation and not at the PSM level. Is this general practice? Wouldn&#x2019;t it make sense to normalize before aggregation?</p>
                    </list-item>
                    <list-item>
                        <p>Page 44; &#x201c;Data import, housekeeping and exploration&#x201d;:</p>
                        <p> </p>
                        <p> the authors mention that LFQ analysis cannot be performed at PSM level but has to be done at peptide level. I think that this is dependent on the software tool that is used, and this may be specific of Proteome Discoverer. Match between run is performed at the ion level, but the intensity retrieved at this step can be reported at PSM level. I think that this is the case in MaxQuant &#x201c;evidence.txt&#x201d; tables, if I am not mistaken.</p>
                    </list-item>
                    <list-item>
                        <p>Page 63: DEqMS could be mentioned since it is specifically developed for proteomics statistical analysis. (DOI: 10.1074/mcp.TIR119.001646)</p>
                    </list-item>
                    <list-item>
                        <p>This is more of a na&#x00ef;ve question regarding using Limma in this context: does it make sense to model the variance depending on intensity after missing value replacement? I know that this manuscript may not be the place to discuss this, but I would be interested to know what the authors&#x2019; opinion is on this question.</p>
                    </list-item>
                    <list-item>
                        <p>Additional suggestions (only suggestions that may totally be ignored/dismissed by the authors if they don&#x2019;t think that they&#x2019;ll improve their manuscript):</p>
                        <p> The data is available in PRIDE (PXD041794), but the information necessary to match TMT channel/sample to conditions/replicates is only available in the associated paper. It would be great to add the Table 1 of the manuscript and a link to the zenodo repository in PRIDE alongside the data. An even better solution would be to provide the metadata in the SDRF-Proteomics format (
                            <ext-link ext-link-type="uri" xlink:href="https://www.nature.com/articles/s41467-021-26111-3">https://www.nature.com/articles/s41467-021-26111-3</ext-link>; there is a GUI to generate these files now: https://lessdrf.streamlit.app/). This would facilitate data reuse and transparency.</p>
                    </list-item>
                </list> Small comments / typos: 
                <list list-type="bullet">
                    <list-item>
                        <p>Page&#x00a0;3, in the code at the bottom of the page: the &#x201c;,&#x201d; is missing after &#x201c;stringr&#x201d;.</p>
                    </list-item>
                    <list-item>
                        <p>Paragraph "Assessing the impact of non-specific data cleaning&#x201d;, in the table (and other similar tables or mentions of number of protein groups and peptide stripped sequences): &#x201c;proteins&#x201d; should be replaced by &#x201c;protein groups&#x201d; since this is what is actually counted. If the authors wanted to be more precise, they could also specify that the &#x201c;peptides&#x201d; correspond to peptide sequences stripped of all modifications.</p>
                    </list-item>
                    <list-item>
                        <p>Bar plot of missing value proportion page 32: what does the red dashed line correspond to?</p>
                    </list-item>
                    <list-item>
                        <p>Page 37, when running the function `normalizer`, I got an error &#x201c;No RT column specified (column named 'RT') or option not specified Skipping RT normalization.&#x201d; And could not get the expected report. I am not familiar with the tool and did not investigate further.</p>
                    </list-item>
                    <list-item>
                        <p>I don&#x2019;t think that &#x201c;softwares&#x201d; can be used. There is no &#x201c;s&#x201d; at the end.</p>
                    </list-item>
                    <list-item>
                        <p>Page 51: &#x201c;cleaning is done is two steps&#x201d; should be &#x201c;cleaning is done in two steps&#x201d;</p>
                    </list-item>
                    <list-item>
                        <p>Page 54: in the bar plot of missing value count, what is the dashed red line?</p>
                    </list-item>
                    <list-item>
                        <p>Page 55: &#x201c;If the method requires data to display a normal distribution, users must log2 transform the data prior to imputation.&#x201d; -&gt; is there a reason to perform missing values replacement before log transformation? If not, this step could be moved to after log transform?</p>
                    </list-item>
                    <list-item>
                        <p>Page 66: &#x201c;If we had not used TMT labels and wished to include a logFC threshold, we could have included lfc = as an argument&#x201d;. The characters of &#x201c;
                            <italic>lfc = as</italic>&#x201d; are in a different police, I think that this is a bit unclear since the &#x201c;as&#x201d; should be regular text. Also, maybe you could explain &#x201c;have included 
                            <italic>lfc =</italic> followed by the minimum absolute log2-transformed fold change&#x201d;. &#x00a0;</p>
                    </list-item>
                    <list-item>
                        <p>Page 72: &#x201c;summarize all packages and corresponding their versions used to generate&#x201d; -&gt; the sentence does not seem correct to me.</p>
                    </list-item>
                </list>
            </p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>Yes</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Yes</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Yes</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>I work on MS-based proteomics data analysis and computational mass spectrometry.</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <sub-article article-type="response" id="comment12037-221410">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Breckels</surname>
                            <given-names>Lisa</given-names>
                        </name>
                        <aff>University of Cambridge, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>18</day>
                    <month>7</month>
                    <year>2024</year>
                </pub-date>
            </front-stub>
            <body>
                <p>We thank Dr Locard-Paulet for her positive comments and have answered her additional comments below.</p>
                <p> </p>
                <p> 
                    <bold>Reviewer Comments:</bold>
                </p>
                <p> </p>
                <p> Major comments: 
                    <list list-type="order">
                        <list-item>
                            <p>It should be made clear that this script is tailored for Proteome Discoverer outputs (and maybe even for a given version of Proteome Discoverer). For most of the plotting/filtering/analysis steps, changing input would only require adapting the column names/headers, and most of the time it is very well discussed by the authors. Nevertheless, right now this script is only adapted to Proteome Discoverer outputs and this should be made clear in the abstract and the introduction, as well as the discussion.</p>
                        </list-item>
                    </list> 
                    <bold>Author Response: </bold>We appreciate that both use-case datasets in this workflow were processed using Proteome Discoverer and have added text to emphasise this further in the abstract and introduction. Nevertheless, we feel that the workflow is still useful to users of alternative third-party software as the general data processing steps, and extensive discussion of these, remains relevant. To help users make the most out of this workflow, we have provided information on how to adapt the workflow from Proteome Discoverer files to those generated by MaxQuant, a free and accessible search software. This information is provided in a new section in the 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/CambridgeCentreForProteomics/f1000_expression_proteomics/blob/main/appendix.pdf">Appendix</ext-link> of our Github repository.</p>
                <p> </p>
                <p> 
                    <bold>Reviewer Comments:</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>Aggregation of PSM quantification values to peptides and proteins:</p>
                            <p> </p>
                            <p> Here, the authors choose to aggregate all PSM intensities corresponding to the same sequence (stripped of its modifications). I would aggregate the PSM quantities per peptidoform (sequence + localized modifications) since two ions can have the same sequence but with modifications that are differentially regulated between the conditions compared. This should be mentioned.</p>
                        </list-item>
                    </list> 
                    <bold>Author Response: </bold>As Dr Locard-Paulet mentions, it is possible for the same peptide sequence with and without modifications to behave differently. We have added the following to the workflow to acknowledge this: &#x201c;For simplicity, we here aggregate all PSMs corresponding to the same stripped peptide sequence, regardless of whether they contain different modifications. If users are interested in exploring differential expression of protein isoforms or have reason to believe that post-translational modifications may be important in answering their question,</p>
                <p> PSMs could also be aggregated by "Annotated.Sequence" to preserve this information.&#x201d; We agree that this point is worth drawing attention to. In order to observe any differential abundance of peptides when modified, statistical analysis would need to be carried out at the peptide-level. Since this is outside of the scope of this workflow, we continue to aggregate using the stripped sequence column to facilitate statistical analysis at the protein level.</p>
                <p> </p>
                <p> 
                    <bold>Reviewer Comments:</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>For the GO-term enrichment: only the majority protein accessions are reported in the output of the limma analysis. So there is only one gene per feature. How do you make sure that it is the most representative annotation for a given protein group? I am fully aware that this issue is ignored by the community (for lack of alternative strategy), and that most of the time people just pick one accession per group for GO-term enrichment analysis. So, I am not asking the authors to find a solution. Nevertheless, it would be nice to mention that the output of the enrichment will depend on what accession is picked per group.</p>
                        </list-item>
                    </list> 
                    <bold>Author Response: </bold>We agree with and acknowledge this point. Indeed, it is a limitation of protein inference that results in statistics and interpretation being done at the level of protein groups rather than individual proteins. As a result, we would always encourage users to explore their statistical hits, particularly those of interest for follow up. This would include tracing the protein group back down all data levels and potentially plotting an aggregation graph, as demonstrated in &#x201c;Exploration of data using QFeatures links&#x201d;. With regard to the output of GO enrichment, we do not expect the overall biological interpretation to be dramatically skewed as only protein lists with multiple proteins with the same GO annotation will be considered enriched. Therefore, even if a few of the master proteins are not representative of the protein which is actually changing in abundance, these few cases are unlikely to cause an incorrect enrichment. Nevertheless, it is always worthwhile for the user to examine the results to verify correct interpretation.</p>
                <p> </p>
                <p> 
                    <bold>Reviewer Comments:</bold>
                </p>
                <p> </p>
                <p> Minor comments: 
                    <list list-type="order">
                        <list-item>
                            <p>I would remove contaminants before estimation of TMT labelling efficiency.</p>
                        </list-item>
                        <list-item>
                            <p>I find the paragraph &#x201c;Additional considerations regarding protein isoforms&#x201d; unclear. The authors should rephrase sentences such as this one: &#x201c;PSMs or peptides that were previously mapped to one protein and one protein group could instead be mapped to multiple proteins and one protein group.&#x201d; Maybe they should use the term &#x201c;canonical protein&#x201d; to be more precise?</p>
                            <p> </p>
                            <p> In any cases, the issue of peptide uniqueness does not depend only on the presence of isoforms in the fasta file but also on the strategy that was chosen for protein inference. I agree with the authors that people should precisely describe what peptides/PSMs were used for quantification, and it is good to mention it. Nevertheless, this level of detail on general MS-based proteomics concept may not be necessary here. (So the entire paragraph on isoforms could maybe be removed, especially since isoforms are not mentioned later).</p>
                        </list-item>
                    </list> 
                    <bold>Author Response: </bold>We thank Dr Locard-Paulet for her input on this complex issue. We agree that for a beginner-level workflow, the discussion of isoforms may be confusing. We have removed the paragraph about isoforms, whilst adding some further comment on the importance of defining uniqueness above.</p>
                <p> </p>
                <p> 
                    <bold>Reviewer Comments:</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>In the paragraph &#x201c;Removing PSMs that are not rank 1&#x201d;:</p>
                            <p> </p>
                            <p> I think that the &#x201c;PSM category&#x201d; that is discussed a bit later is specific of Proteome Discoverer. Some search engines report PSMs of equal score (this would correspond to the &#x201c;pretty rank&#x201d; in Mascot). I don&#x2019;t think that all this should necessarily be discussed in this manuscript, but I insist on the fact that the workflow is tuned for Proteome Discoverer output and this should be made clear in the introduction/abstract.</p>
                        </list-item>
                    </list> 
                    <bold>Author Response: </bold>We have included additional emphasis on Proteome Discoverer in the introduction to this workflow. That said, users of alternative software will still benefit from being aware of the concept of &#x2018;PSM rank&#x2019;.</p>
                <p> </p>
                <p> 
                    <bold>Reviewer Comments:</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>In the paragraph &#x201c;Managing missing data&#x201d; for the TMT:</p>
                            <p> </p>
                            <p> the authors mention MCAR, MAR, and MNAR, but do they all apply here? I would expect TMT-labelled data to mostly have missing values due to ions being under the limit of detection because when an ion is fragmented, the reporter ions used for relative quantification are often all detected. Isn&#x2019;t it the case?</p>
                            <p> Knowing this should restrict the choice for a more suited strategy of replacement of missing values.</p>
                        </list-item>
                    </list> 
                    <bold>Author Response: </bold>Indeed, knowing the reason behind missing values is important in deciding upon an imputation method. We have added an additional few sentences to outline this in the &#x201c;Imputation&#x201d; section of the TMT workflow. Nonetheless, we still wish to emphasise that in the use-case, and most datasets with such little missingness, it is not necessary to impute. Additional discussion has also been added under &#x201c;The importance of knowing what you expect in missing values&#x201d; to ensure that users consider their experimental design and quantitation strategy when assessing and dealing with missing values.</p>
                <p> </p>
                <p> 
                    <bold>Reviewer Comments:</bold>
                </p>
                <p> Still on missing values: page 33, it is stated that &#x201c;Typically, it is desirable to remove features, here PSMs, with greater than 20% missing values&#x201d;. Why this number? Is this accepted by the entire community? (this comment also applies to the same step in the LFQ analysis.</p>
                <p> </p>
                <p> 
                    <bold>Author Response:&#x00a0;</bold>The percentage of missing data to allow per dataset will depend on the experimental design and number of samples. There are no strict rules on what one should allow and if imputation is required several groups report that up to 20% missing values in quantitative proteomics is acceptable (please see, for example, 
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/s41597-024-02922-z">https://doi.org/10.1038/s41597-024-02922-z</ext-link> and 
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1186/s12864-022-08723-1">https://doi.org/10.1186/s12864-022-08723-1</ext-link>) to maintain data structure as close to reality as possible.</p>
                <p> </p>
                <p> 
                    <bold>Reviewer Comments:</bold>
                </p>
                <p> Regarding missing value replacement (discussion on LFQ, page 55): it is really nice to provide references of strategies applied to replacement of missing values. It is indeed a tricky decision to make (how to replace? Should we replace?) and there is no one-size-fits-them-all method. Why do you choose to replace at peptide level and not protein level?</p>
                <p> </p>
                <p> 
                    <bold>Author Response:&#x00a0;</bold>The reason that we typically choose to impute at the lowest possible data level is to avoid the implicit imputation that occurs during aggregation. For example, if a protein is supported by two peptides, and one peptide contains missing values whilst the other does not, aggregation would result in a protein level quantitation value without the user explicitly stating how to deal with the missing value.</p>
                <p> </p>
                <p> 
                    <bold>Reviewer Comments:</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>Page 38: what do the authors mean by &#x201c;report did not indicate any superior normalization method&#x201d;? How would we know what normalization method works best? This point is because I am curious, not to correct any obvious mistake. It is great to try out different normalization strategies, but I don&#x2019;t really see how to pick one based on the boxplots of normalized intensities (Figure 4).</p>
                        </list-item>
                    </list> 
                    <bold>Author Response:&#x00a0;</bold>In our experience of sharing and teaching this workflow, we have found that one of the most common questions is in regard to selecting an appropriate normalisation method. Therefore, we decided to include NormalyzerDE as one approach for users to make this decision. However, we appreciate that the results of this comparison are not particularly informative for either of the use-case datasets. As a result, we have decided to remove the NormalyzerDE example from the workflow and instead include further discussion.</p>
                <p> </p>
                <p> 
                    <bold>Reviewer Comments:</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>The authors normalize the data after protein aggregation and not at the PSM level. Is this general practice? Wouldn&#x2019;t it make sense to normalize before aggregation?</p>
                        </list-item>
                    </list> 
                    <bold>Author Response:&#x00a0;</bold>We acknowledge that this decision is not a simple one, and not one which has a consensus in the literature. For example, some workflows choose to normalise at lower data levels so that normalisation can occur prior to imputation. We typically choose to normalise data at the protein level because this is the level at which statistical analyses are carried out. As the goal of normalisation is to remove non-biological variation to increase statistical power and lead to the discovery of meaningful, biological variation, we feel it makes the most sense to do this step immediately before statistics. Normalising the data at PSM or peptide level would not account for any small variation re-introduced during aggregation (depending upon the selected aggregation method).</p>
                <p> </p>
                <p> 
                    <bold>Reviewer Comments:</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>Page 44; &#x201c;Data import, housekeeping and exploration&#x201d;:</p>
                            <p> </p>
                            <p> the authors mention that LFQ analysis cannot be performed at PSM level but has to be done at peptide level. I think that this is dependent on the software tool that is used, and this may be specific of Proteome Discoverer. Match between run is performed at the ion level, but the intensity retrieved at this step can be reported at PSM level. I think that this is the case in MaxQuant &#x201c;evidence.txt&#x201d; tables, if I am not mistaken.</p>
                        </list-item>
                    </list> 
                    <bold>Author Response:&#x00a0;</bold>We thank Dr Locard-Paulet for pointing this out and have adjusted the text in the LFQ data import section accordingly.</p>
                <p> </p>
                <p> 
                    <bold>Reviewer Comments:</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>Page 63: DEqMS could be mentioned since it is specifically developed for proteomics statistical analysis. (http://doi.org/10.1074/mcp.TIR119.001646)</p>
                        </list-item>
                    </list> 
                    <bold>Author Response:&#x00a0;</bold>Thank you for this suggestion. We have now mentioned and referenced this package in the appropriate place.</p>
                <p> </p>
                <p> 
                    <bold>Reviewer Comments:</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>This is more of a na&#x00ef;ve question regarding using Limma in this context: does it make sense to model the variance depending on intensity after missing value replacement? I know that this manuscript may not be the place to discuss this, but I would be interested to know what the authors&#x2019; opinion is on this question.</p>
                        </list-item>
                    </list> 
                    <bold>Author Response:&#x00a0;</bold>We agree with the reviewer that one does need to be cautious, and in the text, we advise users to check their data following imputation and that the distribution has not dramatically changed. As we mention in the manuscript it is now possible to use packages such as msqRob2 to facilitates statistical differential expression analysis on datasets without the need for imputation. With the limma package, for a given protein, the cases with missing values are removed both from the data and design matrix. That is, the linear model is fitted to the non-missing values. If a particular regression co-efficient cannot be estimated from the observed data for a protein of interest, then a NA value will be returned for that coefficient.</p>
                <p> </p>
                <p> 
                    <bold>Reviewer Comments:</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>Additional suggestions (only suggestions that may totally be ignored/dismissed by the authors if they don&#x2019;t think that they&#x2019;ll improve their manuscript):</p>
                            <p> The data is available in PRIDE (PXD041794), but the information necessary to match TMT channel/sample to conditions/replicates is only available in the associated paper. It would be great to add the Table 1 of the manuscript and a link to the zenodo repository in PRIDE alongside the data. An even better solution would be to provide the metadata in the SDRF-Proteomics format (
                                <ext-link ext-link-type="uri" xlink:href="https://www.nature.com/articles/s41467-021-26111-3">https://www.nature.com/articles/s41467-021-26111-3</ext-link>; there is a GUI to generate these files now: https://lessdrf.streamlit.app/). This would facilitate data reuse and transparency.</p>
                        </list-item>
                    </list> Small comments / typos: 
                    <list list-type="bullet">
                        <list-item>
                            <p>Page&#x00a0;3, in the code at the bottom of the page: the &#x201c;,&#x201d; is missing after &#x201c;stringr&#x201d;.</p>
                        </list-item>
                        <list-item>
                            <p>Paragraph "Assessing the impact of non-specific data cleaning&#x201d;, in the table (and other similar tables or mentions of number of protein groups and peptide stripped sequences): &#x201c;proteins&#x201d; should be replaced by &#x201c;protein groups&#x201d; since this is what is actually counted. If the authors wanted to be more precise, they could also specify that the &#x201c;peptides&#x201d; correspond to peptide sequences stripped of all modifications.</p>
                        </list-item>
                        <list-item>
                            <p>Bar plot of missing value proportion page 32: what does the red dashed line correspond to?</p>
                        </list-item>
                        <list-item>
                            <p>Page 37, when running the function `normalizer`, I got an error &#x201c;No RT column specified (column named 'RT') or option not specified Skipping RT normalization.&#x201d; And could not get the expected report. I am not familiar with the tool and did not investigate further.</p>
                        </list-item>
                        <list-item>
                            <p>I don&#x2019;t think that &#x201c;softwares&#x201d; can be used. There is no &#x201c;s&#x201d; at the end.</p>
                        </list-item>
                        <list-item>
                            <p>Page 51: &#x201c;cleaning is done is two steps&#x201d; should be &#x201c;cleaning is done in two steps&#x201d;</p>
                        </list-item>
                        <list-item>
                            <p>Page 54: in the bar plot of missing value count, what is the dashed red line?</p>
                        </list-item>
                        <list-item>
                            <p>Page 55: &#x201c;If the method requires data to display a normal distribution, users must log2 transform the data prior to imputation.&#x201d; -&gt; is there a reason to perform missing values replacement before log transformation? If not, this step could be moved to after log transform?</p>
                        </list-item>
                        <list-item>
                            <p>Page 66: &#x201c;If we had not used TMT labels and wished to include a logFC threshold, we could have included lfc = as an argument&#x201d;. The characters of &#x201c;
                                <italic>lfc = as</italic>&#x201d; are in a different police, I think that this is a bit unclear since the &#x201c;as&#x201d; should be regular text. Also, maybe you could explain &#x201c;have included&#x00a0;
                                <italic>lfc =</italic>&#x00a0;followed by the minimum absolute log2-transformed fold change&#x201d;. &#x00a0;</p>
                        </list-item>
                        <list-item>
                            <p>Page 72: &#x201c;summarize all packages and corresponding their versions used to generate&#x201d; -&gt; the sentence does not seem correct to me.</p>
                        </list-item>
                    </list> 
                    <bold>Author Response:&#x00a0;</bold>We would like to thank Dr. Locard-Paulet for being so thorough in reviewing the manuscript. We have corrected the highlighted typos.</p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report221416">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.152361.r221416</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Borges Lima</surname>
                        <given-names>Diogo</given-names>
                    </name>
                    <xref ref-type="aff" rid="r221416a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-6056-0825</uri>
                </contrib>
                <aff id="r221416a1">
                    <label>1</label>Leibniz - Forschungsinstitut f&#x00fc;r Molekulare Pharmakologie, Berlin, Germany</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>27</day>
                <month>11</month>
                <year>2023</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2023 Borges Lima D</copyright-statement>
                <copyright-year>2023</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
                <license>
                    <license-p>The author(s) is/are employees of the US Government and therefore domestic copyright protection in USA does not apply to this work. The work may be protected under the copyright laws of other jurisdictions when used in those jurisdictions.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport221416" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.139116.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The article entitled "A Bioconductor workflow for processing, evaluating, and interpreting expression proteomics data" regards a full and comprehensive workflow for performing not only qualitative but also quantitative proteomics analysis.</p>
            <p> </p>
            <p> The manuscript is very well written and describes in detail the whole workflow for quantifying proteomics data. The authors focused on the LFQ analysis and TMT (for labeled datasets). However, this workflow relies on the results from a search engine that will provide the identified peptides. On this manuscript the authors used results from Proteome Discoverer.</p>
            <p> </p>
            <p> I recommended the publication, but I would like to pinpoint some minor comments:</p>
            <p> </p>
            <p> a) Although the authors mentioned the quantitation analyses by using TMT for labeled datasets, why they didn't show the analysis by using SILAC, once it uses XIC (the same strategy used by LFQ)?</p>
            <p> </p>
            <p> b) The authors used Proteome Discoverer as search engine, and the provided a template to import the results into the workflow. They could also provide other templates to turn this workflow more versatile, such as FragPipe, Patternlab for Proteomics, etc.</p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>Yes</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Yes</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Yes</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>I have experience in proteomics analyses (qualitative and quantitative); XL-MS analysis. I also develop software for identifying and quantifying proteomics datasets.</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <sub-article article-type="response" id="comment12036-221416">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Breckels</surname>
                            <given-names>Lisa</given-names>
                        </name>
                        <aff>University of Cambridge, UK</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>18</day>
                    <month>7</month>
                    <year>2024</year>
                </pub-date>
            </front-stub>
            <body>
                <p>We thank Dr. Lima for taking the time to review our workflow paper. The main aim of this workflow was to provide an overview of the essential steps required for the processing, analysis and interpretation of quantitative mass spectrometry-based expression proteomics data. We show two use case examples, one labelled TMT experiment and one label-free experimental design. It would have been nice to also include a SILAC use-case example, among other designs, but in the interest of length of the workflow we chose these two designs as they are typically the most popular in the field. Further, statistical analysis of SILAC experiments often applies a ratiometric approach, which differs from the statistics discussed here. As Dr. Lima states we indeed use Proteome Discoverer to process the data and show how to import a PSM, peptide or protein .txt data table into R and into a QFeatures object. All third-party MS search engines typically output a data table of quantitation data and for flexibility this is where we start our analysis. Header names in the output data will differ between softwares but should be transferable. Another popular third-party software used for quantitation and identification is MaxQuant and we have added a new section in the 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/CambridgeCentreForProteomics/f1000_expression_proteomics/blob/main/appendix.pdf">Appendix</ext-link>&#x00a0;of our Github repository to show users how one might use data from MaxQuant in the context of this workflow.</p>
            </body>
        </sub-article>
    </sub-article>
</article>
