<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="other" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.16665.2</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Software Tool Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 2; peer review: 2 approved]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="no" equal-contrib="yes">
                    <name>
                        <surname>Garcia</surname>
                        <given-names>Maxime</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-2827-9261</uri>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no" equal-contrib="yes">
                    <name>
                        <surname>Juhos</surname>
                        <given-names>Szilveszter</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a2">2</xref>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Larsson</surname>
                        <given-names>Malin</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a4">4</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Olason</surname>
                        <given-names>Pall I.</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Martin</surname>
                        <given-names>Marcel</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-0680-200X</uri>
                    <xref ref-type="aff" rid="a5">5</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Eisfeldt</surname>
                        <given-names>Jesper</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-3716-4917</uri>
                    <xref ref-type="aff" rid="a6">6</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>DiLorenzo</surname>
                        <given-names>Sebastian</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a7">7</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Sandgren</surname>
                        <given-names>Johanna</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>D&#x00ed;az De St&#x00e5;hl</surname>
                        <given-names>Teresita</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Ewels</surname>
                        <given-names>Philip</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-4101-2502</uri>
                    <xref ref-type="aff" rid="a2">2</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Wirta</surname>
                        <given-names>Valtteri</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a8">8</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Nist&#x00e9;r</surname>
                        <given-names>Monica</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-1261-3790</uri>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>K&#x00e4;ller</surname>
                        <given-names>Max</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a9">9</xref>
                </contrib>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Nystedt</surname>
                        <given-names>Bj&#x00f6;rn</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-7809-7664</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>Department of Oncology-Pathology, Karolinska Institutet, J5:30 BioClinicum, Visionsgatan 4, Karolinska University Hospital at Solna, Solna, 17164, Sweden</aff>
                <aff id="a2">
                    <label>2</label>Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, Solna, 17121, Sweden</aff>
                <aff id="a3">
                    <label>3</label>Department of Cell and Molecular Biology, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Husargatan 3, Uppsala, 752 37, Sweden</aff>
                <aff id="a4">
                    <label>4</label>Department of Physics, Chemistry and Biology, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Link&#x00f6;ping University, Link&#x00f6;ping, 58183, Sweden</aff>
                <aff id="a5">
                    <label>5</label>Department of Biochemistry and Biophysics, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Stockholm University, Box 1031, Solna, 17121, Sweden</aff>
                <aff id="a6">
                    <label>6</label>Clinical Genetics, Department of Molecular Medicine and Surgery, Karolinska Institutet, MMK L1:00, Karolinska University Hospital at Solna, Stockholm, 171 76, Sweden</aff>
                <aff id="a7">
                    <label>7</label>Department of Medical Sciences, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Husargatan 3, Uppsala, 752 37, Sweden</aff>
                <aff id="a8">
                    <label>8</label>Department of Microbiology, Tumor and Cell Biology, Clinical Genomics Facility, Science for Life Laboratory, Karolinska Institutet, Box 1031, Solna, 171 21, Sweden</aff>
                <aff id="a9">
                    <label>9</label>School of Engineering Sciences in Chemistry, Biotechnology and Health, Science for Life Laboratory, KTH Royal Institute of Technology, Box 1031, Solna, 17121, Sweden</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:bjorn.nystedt@scilifelab.se">bjorn.nystedt@scilifelab.se</email>
                </corresp>
                <fn fn-type="con">
                    <p>MK, BN and MN conceived the idea for Sarek. MG and SJ led the project. MG, SJ, ML, PIO, MM, JE, and SDL designed and implemented the workflow. JS, TDS, VW, MN, BN, PE and MK performed testing and provided design feedback. MG, SJ and BN wrote the manuscript with the help from all authors.</p>
                </fn>
                <fn id="FN1">
                    <p>*These authors contributed equally to this work.</p>
                </fn>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>4</day>
                <month>9</month>
                <year>2020</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2020</year>
            </pub-date>
            <volume>9</volume>
            <elocation-id>63</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>3</day>
                    <month>7</month>
                    <year>2020</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2020 Garcia M et al.</copyright-statement>
                <copyright-year>2020</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/9-63/pdf"/>
            <abstract>
                <p>Whole-genome sequencing (WGS) is a fundamental technology for research to advance precision medicine, but the limited availability of portable and user-friendly workflows for WGS analyses poses a major challenge for many research groups and hampers scientific progress. Here we present Sarek, an open-source workflow to detect germline variants and somatic mutations based on sequencing data from WGS, whole-exome sequencing (WES), or gene panels. Sarek features (i) easy installation, (ii) robust portability across different computer environments, (iii) comprehensive documentation, (iv) transparent and easy-to-read code, and (v) extensive quality metrics reporting. Sarek is implemented in the Nextflow workflow language and supports both Docker and Singularity containers as well as Conda environments, making it ideal for easy deployment on any POSIX-compatible computers and cloud compute environments. Sarek follows the GATK best-practice recommendations for read alignment and pre-processing, and includes a wide range of software for the identification and annotation of germline and somatic single-nucleotide variants, insertion and deletion variants, structural variants, tumour sample purity, and variations in ploidy and copy number. Sarek offers easy, efficient, and reproducible WGS analyses, and can readily be used both as a production workflow at sequencing facilities and as a powerful stand-alone tool for individual research groups. The Sarek source code, documentation and installation instructions are freely available at 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/nf-core/sarek">https://github.com/nf-core/sarek</ext-link> and at 
                    <ext-link ext-link-type="uri" xlink:href="https://nf-co.re/sarek/">https://nf-co.re/sarek/</ext-link>.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>Analysis workflow</kwd>
                <kwd>Whole Genome Sequencing</kwd>
                <kwd>Germline variants</kwd>
                <kwd>Somatic variants</kwd>
                <kwd>Cancer</kwd>
            </kwd-group>
            <funding-group>
                <award-group id="fund-1" xlink:href="http://dx.doi.org/10.13039/501100004359">
                    <funding-source>Vetenskapsr&#x00e5;det</funding-source>
                    <award-id>2017-00630</award-id>
                    <award-id>2017-00656</award-id>
                </award-group>
                <award-group id="fund-2" xlink:href="http://dx.doi.org/10.13039/501100006313">
                    <funding-source>Barncancerfonden</funding-source>
                    <award-id>BB2018-0001</award-id>
                    <award-id>BB2017-0001</award-id>
                    <award-id>BB2019-0001</award-id>
                </award-group>
                <award-group id="fund-3" xlink:href="http://dx.doi.org/10.13039/501100004063">
                    <funding-source>Knut och Alice Wallenbergs Stiftelse</funding-source>
                    <award-id>2014.0278</award-id>
                </award-group>
                <funding-statement>This study was supported by the Swedish Research Council (NGI: 2017-00630, NBIS: 2017-00656), the Swedish Childhood Cancer Fund (The Swedish Childhood Tumor Biobank (BTB): BB2017-0001; BB2018-0001; BB2019-0001), and the Knut and Alice Wallenberg Foundation (KAW 2014.0278).</funding-statement>
                <funding-statement>
                    <italic>The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</italic>
                </funding-statement>
            </funding-group>
        </article-meta>
        <notes>
            <sec sec-type="version-changes">
                <label>Revised</label>
                <title>Amendments from Version 1</title>
                <p>This version is a minor revision and improvement of the already accepted manuscript, based on the comments from the two reviewers. The main change is the inclusion of accuracy measures for germline variants based on the Genome In a Bottle HG001 gold standard dataset, presented in the text and in the new Table 4. &#x00a0; In addition, we have also added information about which tools are used for each type of variant calling in the revised Table 1. Other edits to the text are minor clarifications of i) the selection of the included software, ii) the usage of the &#x201c;-profile&#x201d; parameter, iii) the yet limited benchmarking of exome sequencing data, iv) the availability of a small test dataset, v) the user responsibility to adjust the downstream filtering of variants, vi) how Docker, Singularity and Conda environments are provided, and vii) the workflow error handling.</p>
            </sec>
        </notes>
    </front>
    <body>
        <sec sec-type="intro">
            <title>Introduction</title>
            <p>Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies opens up new avenues for research and for clinical applications, with many large initiatives launched worldwide. While much effort has been invested in novel sequencing analysis software, the importance of providing and maintaining workflows to combine software in an efficient and reproducible manner has been underestimated and too few resources are typically dedicated to address this issue. This is of particular importance for somatic variant analysis and especially for analysis of complex cancer genomes, where a combination of tools is still required for optimal sensitivity and specificity and to detect various types of gene mutations and other abnormalities (
                <xref ref-type="bibr" rid="ref-1">Alioto 
                    <italic toggle="yes">et al</italic>., 2015</xref>). Some encouraging solutions have been presented in recent years, including 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/WGLab/SeqMule">SeqMule</ext-link> (
                <xref ref-type="bibr" rid="ref-17">Guo 
                    <italic toggle="yes">et al</italic>., 2015</xref>), 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/hall-lab/speedseq">SpeedSeq</ext-link> (
                <xref ref-type="bibr" rid="ref-5">Chiang 
                    <italic toggle="yes">et al</italic>., 2015</xref>), 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/bcbio/bcbio-nextgen">Bcbio-nextgen</ext-link>, and 
                <ext-link ext-link-type="uri" xlink:href="http://bioinformatics.astate.edu/dna-pipeline/">DNAp</ext-link> (
                <xref ref-type="bibr" rid="ref-3">Causey 
                    <italic toggle="yes">et al</italic>., 2018</xref>). While all of the above represent commendable and important efforts, we have not found any workflow solution that in our opinion fulfils all of the following important user aspects: (i) easy installation, (ii) robust portability across different compute environments, (iii) comprehensive documentation, (iv) transparent and easy-to-read code, and (v) extensive quality metrics reporting. Here we present Sarek, an easy-to-install community-maintained workflow, offering a complete and scalable solution for germline and somatic variant detection, annotation and quality control. Sarek supports several reference genomes and can handle data from WGS, WES and gene panels, and is intended to be used both as a production workflow at core facilities and as a stand-alone tool for individual research groups. By using Docker or Singularity containers, Sarek installs easily on all POSIX compatible systems such as Linux and Mac OS X and is designed to work on compute environments dedicated to handle sensitive personal data without direct internet access&#x2014;a situation expected to become increasingly common with growing data security awareness.</p>
        </sec>
        <sec sec-type="methods">
            <title>Methods</title>
            <sec>
                <title>Operation: workflow overview and software</title>
                <p>Sarek offers a portable workflow for germline and somatic variant detection, annotation and quality control based on WGS, WES or gene panel data, using a range of state-of-the-art software and data resources in the field (
                    <xref ref-type="table" rid="T1">Table 1</xref>, 
                    <xref ref-type="fig" rid="f1">Figure 1</xref>). In the pre-processing step, sequence reads are aligned to the reference genome with BWA-MEM (
                    <xref ref-type="bibr" rid="ref-20">Li, 2013</xref>), followed by deduplication and recalibration with GATK (
                    <xref ref-type="bibr" rid="ref-23">McKenna 
                        <italic toggle="yes">et al</italic>., 2010</xref>). For germline samples, single-nucleotide variants and small insertion/deletions are detected with HaplotypeCaller (
                    <xref ref-type="bibr" rid="ref-23">McKenna 
                        <italic toggle="yes">et al</italic>., 2010</xref>) and Strelka2 (
                    <xref ref-type="bibr" rid="ref-18">Kim 
                        <italic toggle="yes">et al</italic>., 2018</xref>), and structural variations are detected with Manta (
                    <xref ref-type="bibr" rid="ref-4">Chen 
                        <italic toggle="yes">et al</italic>., 2016</xref>) and TIDDIT (
                    <xref ref-type="bibr" rid="ref-10">Eisfeldt 
                        <italic toggle="yes">et al</italic>., 2017</xref>). For somatic samples, somatic single-base mutations (SSM) and small somatic insertion/deletion mutations (SIM) are detected by GATK4 Mutect2 (
                    <xref ref-type="bibr" rid="ref-6">Cibulskis 
                        <italic toggle="yes">et al</italic>., 2013</xref>) and Strelka2 (
                    <xref ref-type="bibr" rid="ref-18">Kim 
                        <italic toggle="yes">et al</italic>., 2018</xref>). Somatic structural variants (including copy-number variation), as well as ploidy and sample purity are detected by Manta (
                    <xref ref-type="bibr" rid="ref-4">Chen 
                        <italic toggle="yes">et al</italic>., 2016</xref>), ASCAT (
                    <xref ref-type="bibr" rid="ref-26">Van Loo 
                        <italic toggle="yes">et al</italic>., 2010</xref>), and Control-FREEC (
                    <xref ref-type="bibr" rid="ref-2">Boeva 
                        <italic toggle="yes">et al</italic>., 2012</xref>). All variants are annotated for potential functional effects with snpEff (
                    <xref ref-type="bibr" rid="ref-7">Cingolani 
                        <italic toggle="yes">et al</italic>., 2012</xref>) and VEP (
                    <xref ref-type="bibr" rid="ref-24">McLaren 
                        <italic toggle="yes">et al</italic>., 2016</xref>). Importantly, Sarek also generates a wide range of quality control metrics using 
                    <ext-link ext-link-type="uri" xlink:href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/">FastQC</ext-link>, QualiMap (
                    <xref ref-type="bibr" rid="ref-25">Okonechnikov 
                        <italic toggle="yes">et al</italic>., 2016</xref>), BCFtools (
                    <xref ref-type="bibr" rid="ref-22">Li, 2011</xref>), Samtools (
                    <xref ref-type="bibr" rid="ref-21">Li 
                        <italic toggle="yes">et al</italic>., 2009</xref>), and VCFtools (
                    <xref ref-type="bibr" rid="ref-8">Danecek 
                        <italic toggle="yes">et al</italic>., 2011</xref>), visualized as an aggregated quality control review across samples with MultiQC (
                    <xref ref-type="bibr" rid="ref-11">Ewels 
                        <italic toggle="yes">et al</italic>., 2016</xref>). All software currently included in Sarek are selected based on the criteria that they should be of high quality, well-maintained, and with robust installation and running performances. Additional alternative or complementing software will be added to Sarek in later updates, based on the input and engagement of the user community.</p>
                <table-wrap id="T1" orientation="portrait" position="anchor">
                    <label>Table 1. </label>
                    <caption>
                        <title>
Software required and implemented in Sarek.</title>
                        <p>A list of all the software required and currently implemented in Sarek. All analysis and quality metrics software are installed automatically when Sarek is launched. P, Preprocessing; G, Germline; S, Somatic; snv, Single-nucleotide variants and small indels; sv, Structural variants; pp, Ploidy and sample purity; a, Annotation.</p>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Software/Resource</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Analyses</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Availability</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <bold>Required software</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Nextflow</td>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="https://www.nextflow.io/index.html">https://www.nextflow.io/index.html</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Docker, Singularity or Conda</td>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="https://www.docker.com/">https://www.docker.com/</ext-link>, 
                                    <ext-link ext-link-type="uri" xlink:href="https://sylabs.io/">https://sylabs.io/</ext-link>, 
                                    <ext-link ext-link-type="uri" xlink:href="https://docs.conda.io/en/latest/">https://docs.conda.io/en/latest/</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <bold>Included analysis software</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">BWA-MEM</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">P</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="http://bio-bwa.sourceforge.net/">http://bio-bwa.sourceforge.net/</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">GATK4</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">P, G(snv)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="https://software.broadinstitute.org/gatk/">https://software.broadinstitute.org/gatk/</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Samtools</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">P, G(snv)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="https://github.com/samtools/samtools">https://github.com/samtools/samtools</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Strelka2</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">G(snv), S(snv)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="https://github.com/Illumina/strelka">https://github.com/Illumina/strelka</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Manta</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">G(sv), S(sv)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="https://github.com/Illumina/manta">https://github.com/Illumina/manta</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">TIDDIT</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">G(sv)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="https://github.com/SciLifeLab/TIDDIT">https://github.com/SciLifeLab/TIDDIT</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">GATK4 Mutect2</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">S(snv)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2">https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Freebayes</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">S(snv)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="https://github.com/ekg/freebayes">https://github.com/ekg/freebayes</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">ASCAT</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">S(pp)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="https://github.com/Crick-CancerGenomics/ascat">https://github.com/Crick-CancerGenomics/ascat</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Control-FREEC</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">S(pp)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="http://boevalab.inf.ethz.ch/FREEC/">http://boevalab.inf.ethz.ch/FREEC/</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">snpEff</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">G(a), S(a)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="http://snpeff.sourceforge.net/">http://snpeff.sourceforge.net/</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">VEP</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">G(a), S(a)</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="http://www.ensembl.org/vep">http://www.ensembl.org/vep</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <bold>Included quality metrics software</bold>
</td>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">MultiQC</td>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="http://multiqc.info/">http://multiqc.info/</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">FastQC</td>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="https://www.bioinformatics.babraham.ac.uk/projects/fastqc/">https://www.bioinformatics.babraham.ac.uk/projects/fastqc/</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">BamQC</td>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="https://github.com/s-andrews/BamQC">https://github.com/s-andrews/BamQC</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">QualiMap</td>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="http://qualimap.bioinfo.cipf.es/">http://qualimap.bioinfo.cipf.es/</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">BCFtools</td>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="https://github.com/samtools/bcftools">https://github.com/samtools/bcftools</ext-link>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">VCFtools</td>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <ext-link ext-link-type="uri" xlink:href="https://vcftools.github.io/index.html">https://vcftools.github.io/index.html</ext-link>
</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                    <label>Figure 1. </label>
                    <caption>
                        <title>Schematic overview of the Sarek workflow for analysis of germline and somatic variants.</title>
                        <p>A schematic overview including some of the main analysis software implemented in the Sarek workflow. A more comprehensive list of the currently implemented software is given in 
                            <xref ref-type="table" rid="T1">Table 1</xref>.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/27789/89f3d64c-f181-4982-bd23-eca9a6f68ca7_figure1.gif"/>
                </fig>
            </sec>
            <sec>
                <title>Portability and reproducibility</title>
                <p>Sarek is implemented in Nextflow (
                    <xref ref-type="bibr" rid="ref-9">Di Tommaso 
                        <italic toggle="yes">et al</italic>., 2017</xref>), a workflow language designed specifically for bioinformatics applications. Nextflow has a transparent design, making the Sarek code easy to read, adjust and extend. Sarek has well-functioning error reporting to diagnose e.g. software or hardware errors during a run, and incomplete runs are easily restarted from any stage in the workflow process. Compared to the Bpipe workflow language (used in for example DNAp), Nextflow offers superior support for different execution environments, like Slurm, Sun Grid Engine, LSF and Kubernetes, and includes native support for cloud compute environments including Google Cloud and AWS. Support for 
                    <ext-link ext-link-type="uri" xlink:href="https://aws.amazon.com/batch/">AWS batch</ext-link> gives the possibility to easily distribute thousands of batch jobs on Amazon Web Services. Sarek is part of a rapidly growing community effort of well documented and community-tested 
                    <ext-link ext-link-type="uri" xlink:href="https://nf-co.re/">Nextflow pipelines</ext-link>, and adheres to the nf-core portability and documentation guidelines (
                    <xref ref-type="bibr" rid="ref-12">Ewels 
                        <italic toggle="yes">et al</italic>., 2019</xref>). To facilitate easy installation and to ensure reproducibility, all Sarek required tools are installed in Conda, and then pushed to DockerHub (
                    <ext-link ext-link-type="uri" xlink:href="https://hub.docker.com/">https://hub.docker.com/</ext-link>), making Sarek and all its dependencies directly accessible from a Conda environment, or as 
                    <ext-link ext-link-type="uri" xlink:href="http://www.docker.com/">Docker</ext-link> or Singularity (
                    <xref ref-type="bibr" rid="ref-19">Kurtzer 
                        <italic toggle="yes">et al</italic>., 2017</xref>) containers. While Docker is a widely appreciated container solution, it is not always allowed at high-performance computing centers because of the involved security risks, making Singularity the preferred choice at these sites (
                    <xref ref-type="bibr" rid="ref-19">Kurtzer 
                        <italic toggle="yes">et al</italic>., 2017</xref>). This is of particular importance for computer environments designed for handling of sensitive personal data, where a high level of data security has to be maintained across multiple projects and users.</p>
            </sec>
            <sec>
                <title>Implementation: equipment and resource usage</title>
                <p>Sarek can be installed and executed on any POSIX-compatible computer system. To run a full WGS analysis, including both germline and somatic variants from a tumour/normal dataset with 90x/90x read coverage, we recommend a minimum of 16 cores on a node with 128 GB RAM, and at least 4 TB available free storage (in addition to the initial FASTQ files) in the input/output working directory. Of this, about 1.4 TB will be allocated for BAM files, annotated VCF files and CNV files, but excluding GVCF files (
                    <xref ref-type="table" rid="T2">Table 2</xref>). At the end of the run, 2.3 TB temporary data can be removed, unless the user plans to perform re-runs from intermediate processing states. Many processes are distributed across cores by dividing the genome into smaller chunks, each being handled as a separate core job, with all the results being merged and sorted in a final step. Some of the used software are parallelized by design, while for others Sarek uses a scatter-gather approach to efficiently distribute the processing load across CPU cores and reduce the wall clock runtime.</p>
                <table-wrap id="T2" orientation="portrait" position="anchor">
                    <label>Table 2. </label>
                    <caption>
                        <title>Sarek resource usage.</title>
                        <p>Resource usage during a Sarek run on a WGS 90X/90X coverage medulloblastoma dataset on a 48-threaded computer node, starting from compressed FASTQ files. The storage resources refer to result files only. The total storage including all temporary data was 3.7 TB.</p>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th colspan="1" rowspan="1"/>
                                <th align="center" colspan="1" rowspan="1" valign="top">Input
                                    <break/>data</th>
                                <th align="center" colspan="1" rowspan="1" valign="top">Mapping, merging,
                                    <break/>deduplication</th>
                                <th align="center" colspan="1" rowspan="1" valign="top">Quality score
                                    <break/>recalibration</th>
                                <th align="center" colspan="1" rowspan="1" valign="top">Variant calling,
                                    <break/>annotation</th>
                                <th align="center" colspan="1" rowspan="1" valign="top">Total</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Storage</bold>
                                </td>
                                <td align="center" colspan="1" rowspan="1" valign="top">458 GB</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">530 GB</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">386 GB</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">4 GB</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">1378 GB</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <bold>Process time</bold>
</td>
                                <td colspan="1" rowspan="1"/>
                                <td align="center" colspan="1" rowspan="1" valign="top">1081 CPU h</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">95 CPU h</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">614 CPU h</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">1790 CPU h</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <bold>Wall clock time</bold>
</td>
                                <td colspan="1" rowspan="1"/>
                                <td align="center" colspan="1" rowspan="1" valign="top">35h 26m</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">3h 26m</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">13h 29m</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">48h 21m</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">

                                    <bold>Peak memory</bold>
</td>
                                <td colspan="1" rowspan="1"/>
                                <td align="center" colspan="1" rowspan="1" valign="top">119 GB</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">18 GB</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">128 GB</td>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                        </tbody>
                    </table>
                    <table-wrap-foot>
                        <fn>
                            <p>GB, gigabyte; CPU, central processing unit; h, hours; m, minutes.</p>
                        </fn>
                    </table-wrap-foot>
                </table-wrap>
            </sec>
            <sec>
                <title>Installation and testing</title>
                <p>Sarek is run from a computer system with a local installation of Nextflow and support for either Conda environments, Docker or Singularity containers. Nextflow can automatically fetch the Sarek source code from GitHub. All software dependencies are encapsulated in Docker or Singularity containers which are downloaded from 
                    <ext-link ext-link-type="uri" xlink:href="http://hub.docker.com/">Docker Hub</ext-link>, or built in a new Conda environment using Bioconda (
                    <xref ref-type="bibr" rid="ref-16">Gr&#x00fc;ning 
                        <italic toggle="yes">et al</italic>., 2018</xref>). As such, cumbersome software installations by the user are completely avoided. Configuration files allow tailoring to specific user needs. Sarek comes with a small test dataset and a suite of tests to verify the installation. This is also used for Continuous Integration testing with 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/features/actions">GitHub Actions</ext-link>.</p>
            </sec>
        </sec>
        <sec sec-type="results">
            <title>Results</title>
            <p>To test performance in terms of resource usage and biological results, Sarek was run on a medulloblastoma WGS tumour/normal dataset from a sample with high tumour cell content (&#x223c;98%), and with a curated &#x201c;Gold Set&#x201d; of verified somatic mutations from a previous benchmark study (
                <xref ref-type="bibr" rid="ref-1">Alioto 
                    <italic toggle="yes">et al</italic>., 2015</xref>). In line with the above benchmark study, Sarek (version 2.5.2) was executed with WGS germline and somatic variant calling using a 90X/90X tumour/normal dataset (accession number EGAD00001001859, read sets EGAR00001387019-24 and EGAR00001387025-32). Runs were performed on a single 48-thread node with a local direct attached storage (DAS): A Dell PowerEdge R740 server, with two Intel Xeon Gold 6126 with a total of 24 cores (48 threads) CPUs, 756 GB memory, and 100 TB SCv3020 Compellent Storage. The complete Sarek run including preprocessing followed by both germline and somatic variant calling and annotation took 48 hours and 21 minutes, and required about three times more storage than the original input data (
                <xref ref-type="table" rid="T2">Table 2</xref>). Notably, the complete Sarek run was executed by a single command, with fully automated installation, execution, and efficient job distributions of the more than 15 different software tools to complete the analysis and provide quality control metrics, without any manual intervention needed during the two-day run. To ensure that the Sarek output was biologically sound, we calculated precision, recall and F1 statistics for the Sarek output based on the &#x201c;Gold Set&#x201d; of somatic single-base mutations (SSM) and somatic insertion/deletion mutations (SIM) as previously defined (
                <xref ref-type="bibr" rid="ref-1">Alioto 
                    <italic toggle="yes">et al</italic>., 2015</xref>). Using the intersection of the output from the two somatic variant callers (GATK4 Mutect2 and Strelka2), Sarek provided accuracy measures for SSMs (F1 score = 0.80) and SIMs (F1 score = 0.58) in the top range of the 18 somatic variant calling procedures included in the original benchmarking study on this data set (
                <xref ref-type="table" rid="T3">Table 3</xref>), indicating that the workflow operates as intended. The sample purity was estimated to be 100%, as compared to 98% previously reported for this sample. For somatic structural variants and ploidy, no relevant benchmark data was available, and therefore no quantitative assessment beyond previously published results for the implemented software could be performed, but the integrity of the runs were checked by comparing the results of Manta, ASCAT, and Control-FREEC run within Sarek and as stand-alone. To benchmark Sarek on germline single-nucleotide variants and small insertions/deletions, we used 46X WGS data for the well-studied individual NA12878:HG001 (
                <ext-link ext-link-type="uri" xlink:href="ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/">ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/</ext-link>, read set folders 131219_D00360_005_BH814YADXX [accession number SRR2052337 - SRR2052339, SRR2052342, SRR2052345, SRR20523428], and 131219_D00360_006_AH81VLADXX [accession number SRX1049774 -SRX1049779]) and a &#x201c;Gold Set&#x201d; of variants from the Genome in a Bottle project (
                <xref ref-type="bibr" rid="ref-27">Zook 
                    <italic toggle="yes">et al.</italic>, 2019</xref>), showing overall high accuracy (
                <xref ref-type="table" rid="T4">Table 4</xref>).</p>
            <table-wrap id="T3" orientation="portrait" position="anchor">
                <label>Table 3. </label>
                <caption>
                    <title>
Sarek WGS somatic variant benchmarking.</title>
                    <p>Summary of accuracy measures for the two somatic variant callers used in Sarek to detect somatic single-base mutations (SSMs) and somatic insertion/deletion mutations (SIMs), as well as their union and intersection.</p>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1" valign="top">Somatic caller</th>
                            <th align="center" colspan="1" rowspan="1" valign="top">Recall</th>
                            <th align="center" colspan="1" rowspan="1" valign="top">Precision</th>
                            <th align="center" colspan="1" rowspan="1" valign="top">F1-score</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">

                                <bold>SSM (Gold Set: n=1263)</bold>
</td>
                            <td colspan="1" rowspan="1"/>
                            <td colspan="1" rowspan="1"/>
                            <td colspan="1" rowspan="1"/>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">GATK4 Mutect2</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.80</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.45</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.58</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Strelka2</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.77</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.29</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.42</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Union (GATK4 Mutect2, Strelka2)</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.82</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.23</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.36</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Intersection (GATK4 Mutect2, Strelka2)</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.74</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.88</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.80</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Benchmark median
                                <xref ref-type="other" rid="FN1">*</xref>
                            </td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.68</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.78</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.71</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">

                                <bold>SIM (Gold Set: n=347)</bold>
</td>
                            <td colspan="1" rowspan="1"/>
                            <td colspan="1" rowspan="1"/>
                            <td colspan="1" rowspan="1"/>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">GATK4 Mutect2</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.48</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.38</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.42</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Strelka2</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.74</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.31</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.44</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Union (GATK4 Mutect2, Strelka2)</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.77</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.25</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.38</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Intersection (GATK4 Mutect2, Strelka2)</td>
                            <td align="center" colspan="1" rowspan="1" valign="top"> 0.46</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.77</td>
                            <td align="center" colspan="1" rowspan="1" valign="top"> 0.58</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Benchmark median
                                <xref ref-type="other" rid="FN1">*</xref>
                            </td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.34</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.71</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.48</td>
                        </tr>
                    </tbody>
                </table>
                <table-wrap-foot>
                    <fn>
                        <p>* The median accuracy measures across 18 somatic variant calling procedures as previously reported (
                            <xref ref-type="bibr" rid="ref-1">Alioto 
                                <italic toggle="yes">et al</italic>., 2015</xref>)</p>
                    </fn>
                </table-wrap-foot>
            </table-wrap>
            <table-wrap id="T4" orientation="portrait" position="anchor">
                <label>Table 4. </label>
                <caption>
                    <title>Sarek WGS germline variant benchmarking.</title>
                    <p>Summary of accuracy measures for the two variant callers used in Sarek to detect germline single-nucleotide variants (SNVs) and germline insertion/deletion variants (INDELs), as well as their union and intersection.</p>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1" valign="top">Germline caller</th>
                            <th align="center" colspan="1" rowspan="1" valign="top">Recall</th>
                            <th align="center" colspan="1" rowspan="1" valign="top">Precision</th>
                            <th align="center" colspan="1" rowspan="1" valign="top">F1-score</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">
                                <bold>SNV (Gold Set: n=3088156)</bold>
                            </td>
                            <td colspan="1" rowspan="1"/>
                            <td colspan="1" rowspan="1"/>
                            <td colspan="1" rowspan="1"/>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">GATK4 HaplotypeCaller</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.93</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">1.00</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.96</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Strelka2</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.98</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">1.00</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.99</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Union (GATK4 HaplotypeCaller, Strelka2)</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.99</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.94</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.96</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Intersection (GATK4 HaplotypeCaller, Strelka2)</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.93</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">1.00</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.96</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">
                                <bold>INDEL (Gold Set: n=530423)</bold>
                            </td>
                            <td colspan="1" rowspan="1"/>
                            <td colspan="1" rowspan="1"/>
                            <td colspan="1" rowspan="1"/>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">GATK4 HaplotypeCaller</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.91</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.99</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.95</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Strelka2</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.92</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.99</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.95</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Union (GATK4 HaplotypeCaller, Strelka2)</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.93</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.98</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.96</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Intersection (GATK4 HaplotypeCaller, Strelka2)</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.90</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">1.00</td>
                            <td align="center" colspan="1" rowspan="1" valign="top">0.94</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
        </sec>
        <sec>
            <title>Use case</title>
            <p>Sarek has been extensively tested and applied on various WGS datasets, including thousands of samples for germline variant analyses, and hundreds of paired tumour/normal samples for somatic mutation analyses. In addition, Sarek has also been adapted to run on WES data and gene panels, and has been reported to work well in pilot user projects, although no systematic testing has yet been performed on such data. Below we present a standard use case with a tumour/normal WGS dataset as input, running both germline and somatic variant analyses.</p>
            <sec>
                <title>Input data</title>
                <p>For a somatic variant analysis, the user should provide the sequencing FASTQ files from both tumour and normal control tissue from the same individual, described in a tab-delimited TSV file (here: 
                    <italic toggle="yes">samples.tsv</italic>). Each line of the TSV file contains information about a sequence data file, including: The identifier of the individual, the gender (XX or XY), the status of the sample (0 for Normal or 1 for Tumour), the identifier of the sample, the sequencing lane (if samples are multiplexed across multiple lanes), and the paths to the FASTQ file of the first and second read in the read-pair. Relapse samples from the same individual are also supported.</p>
            </sec>
            <sec>
                <title>Running sarek on WGS data with singularity containers</title>
                <p>Running Sarek with Singularity container on a computer system supporting Java 8 requires only installation of Nextflow and Singularity. A full analysis run starting from FASTQ files including mapping, recalibration, variant calling and annotation, as well as generating a full QC report can be invoked by a single Nextflow command:</p>
                <p>
                    <preformat orientation="portrait" position="float" preformat-type="computer code" xml:space="preserve">
                        <styled-content style="color:#000000">&gt; nextflow run nf-core/sarek -r 2.5.2 -profile singularity --input samples.tsv --tools Mutect2,Strelka,Manta,TIDDIT,ASCAT,ControlFREEC,snpEff,VEP</styled-content>
                    </preformat>
                </p>
                <p>Nextflow will recognize the workflow name and will download the specified version (2.5.2) of the pipeline from GitHub, including the corresponding container, as well as fetching the required reference files from 
                    <ext-link ext-link-type="uri" xlink:href="https://ewels.github.io/AWS-iGenomes/">AWS-iGenomes</ext-link>. The default reference genome is human GRCh38, but Sarek also supports GRCh37 and nearly 30 other genomes directly accessible from iGenomes. Alternatively, users can manually supply Sarek with other reference genomes. Non-default parameters and links to local reference files are handled in accordance with nf-core guidelines. User configuration profiles can be stored locally or centrally at 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/nf-core/configs">https://github.com/nf-core/configs</ext-link>.</p>
            </sec>
            <sec>
                <title>Output</title>
                <p>A full Sarek run will produce a large number of output files, but the main results consist of (i) a set of annotated variants in VCF files from the various included tools for both germline and somatic variants, (ii) tumour sample purity and ploidy results for somatic samples, and (iii) a broad set of QC metrics. A detailed description of all output files is given at the 
                    <ext-link ext-link-type="uri" xlink:href="https://nf-co.re/sarek/docs/output">Sarek documentation pages</ext-link>. While Sarek will report variants from all callers included in the run, it is up to the user to decide how to combine and filter the results from different callers, since the optimal post-processing will depend on the particular samples and research questions at hand.</p>
            </sec>
        </sec>
        <sec sec-type="discussion">
            <title>Discussion</title>
            <p>Human WGS is transforming medical research, and provides a foundation to develop novel clinical applications and improve health care. An important aspect to harvesting the potential of WGS is however to empower the research community with adequate bioinformatics tools, and reproducible bioinformatics workflows are important drivers of scientific progress by making complex processing of large datasets feasible for a wide range of researchers. While we are highly appreciative of existing workflows for cancer and non-cancer variant detection, we argue that there is no one-size-fits-all solution and more initiatives are needed to serve the large and diverse research user community, especially for WGS data. Sarek builds on a philosophy of reasonably narrow, independent workflows, written in the domain-specific language Nextflow. In our experience, this is an effective strategy to simplify workflow maintenance at sequencing core facilities, and to allow easy deployment and modifications by individual research groups. Sarek efficiently utilizes cloud and high-performance compute clusters and installs easily across compute environments. Sarek provides annotated VCF files, CNV reports and quality metrics for germline and cancer samples from raw FASTQ sequencing data in about 48 hours for 90X/90X WGS data (as demonstrated here), in a few hours for WES data, and within minutes for gene panels (in-house data, not presented here). It should be noted that while Sarek can substantially reduce the labor and management time of running and maintaining a large collection of software, and help users to perform quality-controlled reporting in an organized manner, careful parameter tuning, downstream variant filtering, and qualitative assessments by the user remains important. Ongoing efforts aim to develop add-on ranking and visualization modules and to efficiently extract clinically and biologically relevant findings, to help advance basic and translational research.</p>
        </sec>
        <sec sec-type="conclusions">
            <title>Conclusion</title>
            <p>Sarek is a portable and reproducible workflow to detect germline and somatic variants from WGS, WES and gene panel data. It includes extensive analysis and quality control metrics, while still being limited to a relatively narrow scope to achieve optimal usability, functionality and transparency. Sarek is flexible with a low threshold for user modifications, and is thus well adapted to the current requirements in the research community. Thanks to its design, it installs easily and reproducibly on all POSIX compatible computer systems, including secure compute environments for sensitive personal data with indirect Internet access.</p>
        </sec>
        <sec>
            <title>Data availability</title>
            <sec>
                <title>Source data</title>
                <p>European Genome-phenome Archive: A comprehensive assessment of somatic mutation detection in cancer using whole genome sequencing. 
                    <ext-link ext-link-type="uri" xlink:href="https://www.ebi.ac.uk/ega/datasets/EGAD00001001859">https://www.ebi.ac.uk/ega/datasets/EGAD00001001859</ext-link>. Read sets EGAR00001387019-24 and EGAR00001387025-32 were analysed. These data are held under restricted access. Readers wishing to apply for access to the data must first apply through the ICGC Data Access Compliance Office (
                    <ext-link ext-link-type="uri" xlink:href="https://protect-eu.mimecast.com/s/ahB3CvKYTwj2VszboVi">https://icgc.org/daco</ext-link>) and complete the data access form. Access will be granted to those whose projects conform to the 
                    <ext-link ext-link-type="uri" xlink:href="https://protect-eu.mimecast.com/s/lME9CwMESlzArUKAHyI">goals and policies of ICGC</ext-link>. Help with completing the data access form is available at 
                    <ext-link ext-link-type="uri" xlink:href="https://protect-eu.mimecast.com/s/G_xFCxOVtE49NFRUYPx">https://icgc.org/daco/help-guide-section</ext-link>.</p>
                <p>Sequence Read Archive: 
                    <ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/sra/SRX1049774">NIST Genome in a Bottle, 
                        <sup>~</sup>300X sequencing of HG001 (NA12878)</ext-link>. 
                    <ext-link ext-link-type="uri" xlink:href="ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/">ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/</ext-link>, read set folders 131219_D00360_005_BH814YADXX [SRA accession number SRR2052337 - SRR2052339, SRR2052342, SRR2052345, SRR20523428], and 131219_D00360_006_AH81VLADXX [SRA accession number SRX1049774 -SRX1049779]). These data are publicly available for direct download.</p>
                <p>The workflow itself comes with a prebuilt profile with a complete configuration for automated testing, including links to a small test dataset.</p>
            </sec>
        </sec>
        <sec>
            <title>Software availability</title>
            <p>
                <bold>Sarek is available at:</bold> 
                <ext-link ext-link-type="uri" xlink:href="https://nf-co.re/sarek">https://nf-co.re/sarek</ext-link>.</p>
            <p>
                <bold>Source code available at:</bold> 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/nf-core/sarek">https://github.com/nf-core/sarek</ext-link>.</p>
            <p>
                <bold>Archived source code at time of publication:</bold> 
                <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.3579102">https://doi.org/10.5281/zenodo.3579102</ext-link> (
                <xref ref-type="bibr" rid="ref-14">Garcia 
                    <italic toggle="yes">et al</italic>., 2019</xref>).</p>
            <p>
                <bold>License:</bold> 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/nf-core/sarek/blob/master/LICENSE">MIT License</ext-link>.</p>
        </sec>
    </body>
    <back>
        <ack>
            <title>Acknowledgements</title>
            <p>We are grateful for the valuable input from the Oslo University Hospital bioinformatics core facility (Oslo University Hospital), the T Martinsson lab (Gothenburg University), the A&#x2013;C Syv&#x00e4;nen lab (Uppsala University), and Alex Peltzer (Quantitative Biology Center, University of T&#x00fc;bingen). The National Genomics Infrastructure (NGI) and Uppsala Multidisciplinary Centre for Advanced Computational Science (UPPMAX) provided computational resources. Help with graphical design was provided by Dr. Jonas S&#x00f6;derberg (Uppsala university).</p>
        </ack>
        <ref-list>
            <ref id="ref-1">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Alioto</surname>
                            <given-names>TS</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Buchhalter</surname>
                            <given-names>I</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Derdak</surname>
                            <given-names>S</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing.</article-title>
                    <source>

                        <italic toggle="yes">Nat Commun.</italic>
</source>
                    <year>2015</year>;<volume>6</volume>:<fpage>10001</fpage>.
                    <pub-id pub-id-type="pmid">26647970</pub-id>
                    <pub-id pub-id-type="doi">10.1038/ncomms10001</pub-id>
                    <pub-id pub-id-type="pmcid">4682041</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-2">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Boeva</surname>
                            <given-names>V</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Popova</surname>
                            <given-names>T</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bleakley</surname>
                            <given-names>K</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2012</year>;<volume>28</volume>(<issue>3</issue>):<fpage>423</fpage>&#x2013;<lpage>5</lpage>.
                    <pub-id pub-id-type="pmid">22155870</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btr670</pub-id>
                    <pub-id pub-id-type="pmcid">3268243</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-3">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Causey</surname>
                            <given-names>JL</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ashby</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Walker</surname>
                            <given-names>K</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>DNAp: A Pipeline for DNA-seq Data Analysis.</article-title>
                    <source>

                        <italic toggle="yes">Sci Rep.</italic>
</source>
                    <year>2018</year>;<volume>8</volume>(<issue>1</issue>):<fpage>6793</fpage>.
                    <pub-id pub-id-type="pmid">29717215</pub-id>
                    <pub-id pub-id-type="doi">10.1038/s41598-018-25022-6</pub-id>
                    <pub-id pub-id-type="pmcid">5931599</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-4">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Chen</surname>
                            <given-names>X</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Schulz-Trieglaff</surname>
                            <given-names>O</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Shaw</surname>
                            <given-names>R</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2016</year>;<volume>32</volume>(<issue>8</issue>):<fpage>1220</fpage>&#x2013;<lpage>1222</lpage>.
                    <pub-id pub-id-type="pmid">26647377</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btv710</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-5">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Chiang</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Layer</surname>
                            <given-names>RM</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Faust</surname>
                            <given-names>GG</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>SpeedSeq: ultra-fast personal genome analysis and interpretation.</article-title>
                    <source>

                        <italic toggle="yes">Nat Methods.</italic>
</source>
                    <year>2015</year>;<volume>12</volume>(<issue>10</issue>):<fpage>966</fpage>&#x2013;<lpage>968</lpage>.
                    <pub-id pub-id-type="pmid">26258291</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nmeth.3505</pub-id>
                    <pub-id pub-id-type="pmcid">4589466</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-6">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Cibulskis</surname>
                            <given-names>K</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Lawrence</surname>
                            <given-names>MS</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Carter</surname>
                            <given-names>SL</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples.</article-title>
                    <source>

                        <italic toggle="yes">Nat Biotechnol.</italic>
</source>
                    <year>2013</year>;<volume>31</volume>(<issue>3</issue>):<fpage>213</fpage>&#x2013;<lpage>219</lpage>.
                    <pub-id pub-id-type="pmid">23396013</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nbt.2514</pub-id>
                    <pub-id pub-id-type="pmcid">3833702</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-7">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Cingolani</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Platts</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wang le</surname>
                            <given-names>L</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of 
                        <italic toggle="yes">Drosophila melanogaster</italic> strain w
                        <sup>1118</sup>; iso-2; iso-3.</article-title>
                    <source>

                        <italic toggle="yes">Fly (Austin).</italic>
</source>
                    <year>2012</year>;<volume>6</volume>(<issue>2</issue>):<fpage>80</fpage>&#x2013;<lpage>92</lpage>.
                    <pub-id pub-id-type="pmid">22728672</pub-id>
                    <pub-id pub-id-type="doi">10.4161/fly.19695</pub-id>
                    <pub-id pub-id-type="pmcid">3679285</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-8">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Danecek</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Auton</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Abecasis</surname>
                            <given-names>G</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The variant call format and VCFtools.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2011</year>;<volume>27</volume>(<issue>15</issue>):<fpage>2156</fpage>&#x2013;<lpage>2158</lpage>.
                    <pub-id pub-id-type="pmid">21653522</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btr330</pub-id>
                    <pub-id pub-id-type="pmcid">3137218</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-9">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Di Tommaso</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Chatzou</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Floden</surname>
                            <given-names>EW</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Nextflow enables reproducible computational workflows.</article-title>
                    <source>

                        <italic toggle="yes">Nat Biotechnol.</italic>
</source>
                    <year>2017</year>;<volume>35</volume>(<issue>4</issue>):<fpage>316</fpage>&#x2013;<lpage>319</lpage>.
                    <pub-id pub-id-type="pmid">28398311</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nbt.3820</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-10">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Eisfeldt</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Vezzi</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Olason</surname>
                            <given-names>P</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>
                        <italic toggle="yes">TIDDIT</italic>, an efficient and comprehensive structural variant caller for massive parallel sequencing data [version 2; peer review: 2 approved].</article-title>
                    <source>

                        <italic toggle="yes">F1000Res.</italic>
</source>
                    <year>2017</year>;<volume>6</volume>:<fpage>664</fpage>.
                    <pub-id pub-id-type="pmid">28781756</pub-id>
                    <pub-id pub-id-type="doi">10.12688/f1000research.11168.2</pub-id>
                    <pub-id pub-id-type="pmcid">5521161</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-11">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ewels</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Magnusson</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Lundin</surname>
                            <given-names>S</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>MultiQC: Summarize analysis results for multiple tools and samples in a single report.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2016</year>;<volume>32</volume>(<issue>19</issue>):<fpage>3047</fpage>&#x2013;<lpage>3048</lpage>.
                    <pub-id pub-id-type="pmid">27312411</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btw354</pub-id>
                    <pub-id pub-id-type="pmcid">5039924</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-12">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ewels</surname>
                            <given-names>PA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Peltzer</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Fillinger</surname>
                            <given-names>S</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>
                        <italic toggle="yes">nf-core</italic>: Community curated bioinformatics pipelines.</article-title>
                    <source>

                        <italic toggle="yes">bioRxiv.</italic>
</source>
                    <year>2019</year>;<volume>610741</volume>.
                    <pub-id pub-id-type="doi">10.1101/610741</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-14">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Garcia</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Peltzer</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Alneberg</surname>
                            <given-names>J</given-names>
                        </name>
</person-group>:
                    <article-title>nf-core/sarek: Sarek 2.5.2 - J&#x00e5;kk&#x00e5;tjkaskajekna (Version 2.5.2).</article-title>
                    <source>

                        <italic toggle="yes">Zenodo.</italic>
</source>
                    <year>2019</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://www.doi.org/10.5281/zenodo.3579102">http://www.doi.org/10.5281/zenodo.3579102</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-16">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Gr&#x00fc;ning</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dale</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Sj&#x00f6;din</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Bioconda: sustainable and comprehensive software distribution for the life sciences.</article-title>
                    <source>

                        <italic toggle="yes">Nat Methods.</italic>
</source>
                    <year>2018</year>;<volume>15</volume>(<issue>7</issue>):<fpage>475</fpage>&#x2013;<lpage>476</lpage>.
                    <pub-id pub-id-type="pmid">29967506</pub-id>
                    <pub-id pub-id-type="doi">10.1038/s41592-018-0046-7</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-17">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Guo</surname>
                            <given-names>Y</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ding</surname>
                            <given-names>X</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Shen</surname>
                            <given-names>Y</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>SeqMule: automated pipeline for analysis of human exome/genome sequencing data.</article-title>
                    <source>

                        <italic toggle="yes">Sci Rep.</italic>
</source>
                    <year>2015</year>;<volume>5</volume>:<fpage>14283</fpage>.
                    <pub-id pub-id-type="pmid">26381817</pub-id>
                    <pub-id pub-id-type="doi">10.1038/srep14283</pub-id>
                    <pub-id pub-id-type="pmcid">4585643</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-18">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Kim</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Scheffler</surname>
                            <given-names>K</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Halpern</surname>
                            <given-names>AL</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Strelka2: fast and accurate calling of germline and somatic variants.</article-title>
                    <source>

                        <italic toggle="yes">Nat Methods.</italic>
</source>
                    <year>2018</year>;<volume>15</volume>(<issue>8</issue>):<fpage>591</fpage>&#x2013;<lpage>594</lpage>.
                    <pub-id pub-id-type="pmid">30013048</pub-id>
                    <pub-id pub-id-type="doi">10.1038/s41592-018-0051-x</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-19">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Kurtzer</surname>
                            <given-names>GM</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Sochat</surname>
                            <given-names>V</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bauer</surname>
                            <given-names>MW</given-names>
                        </name>
</person-group>:
                    <article-title>Singularity: Scientific containers for mobility of compute.</article-title>
                    <source>

                        <italic toggle="yes">PLoS One.</italic>
</source>
                    <year>2017</year>;<volume>12</volume>(<issue>5</issue>):<fpage>e0177459</fpage>.
                    <pub-id pub-id-type="pmid">28494014</pub-id>
                    <pub-id pub-id-type="doi">10.1371/journal.pone.0177459</pub-id>
                    <pub-id pub-id-type="pmcid">5426675</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-22">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>H</given-names>
                        </name>
</person-group>:
                    <article-title>A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2011</year>;<volume>27</volume>(<issue>21</issue>):<fpage>2987</fpage>&#x2013;<lpage>2993</lpage>.
                    <pub-id pub-id-type="pmid">21903627</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btr509</pub-id>
                    <pub-id pub-id-type="pmcid">3198575</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-20">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>H</given-names>
                        </name>
</person-group>:
                    <article-title>Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.</article-title>
                    <source>

                        <italic toggle="yes">arXiv 1303.3997v2.</italic>
</source>
                    <year>2013</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/pdf/1303.3997v2.pdf">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-21">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Handsaker</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wysoker</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The Sequence Alignment/Map format and SAMtools.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2009</year>;<volume>25</volume>(<issue>16</issue>):<fpage>2078</fpage>&#x2013;<lpage>2079</lpage>.
                    <pub-id pub-id-type="pmid">19505943</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btp352</pub-id>
                    <pub-id pub-id-type="pmcid">2723002</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-23">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>McKenna</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hanna</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Banks</surname>
                            <given-names>E</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.</article-title>
                    <source>

                        <italic toggle="yes">Genome Res.</italic>
</source>
                    <year>2010</year>;<volume>20</volume>(<issue>9</issue>):<fpage>1297</fpage>&#x2013;<lpage>1303</lpage>.
                    <pub-id pub-id-type="pmid">20644199</pub-id>
                    <pub-id pub-id-type="doi">10.1101/gr.107524.110</pub-id>
                    <pub-id pub-id-type="pmcid">2928508</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-24">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>McLaren</surname>
                            <given-names>W</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gil</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hunt</surname>
                            <given-names>SE</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The Ensembl Variant Effect Predictor.</article-title>
                    <source>

                        <italic toggle="yes">Genome Biol.</italic>
</source>
                    <year>2016</year>;<volume>17</volume>(<issue>1</issue>):<fpage>122</fpage>.
                    <pub-id pub-id-type="pmid">27268795</pub-id>
                    <pub-id pub-id-type="doi">10.1186/s13059-016-0974-4</pub-id>
                    <pub-id pub-id-type="pmcid">4893825</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-25">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Okonechnikov</surname>
                            <given-names>K</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Conesa</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Garc&#x00ed;a-Alcalde</surname>
                            <given-names>F</given-names>
                        </name>
</person-group>:
                    <article-title>Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2016</year>;<volume>32</volume>(<issue>2</issue>):<fpage>292</fpage>&#x2013;<lpage>294</lpage>.
                    <pub-id pub-id-type="pmid">26428292</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btv566</pub-id>
                    <pub-id pub-id-type="pmcid">4708105</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-26">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Van Loo</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Nordgard</surname>
                            <given-names>SH</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Lingj&#x00e6;rde</surname>
                            <given-names>OC</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Allele-specific copy number analysis of tumors.</article-title>
                    <source>

                        <italic toggle="yes">Proc Natl Acad Sci U S A.</italic>
</source>
                    <year>2010</year>;<volume>107</volume>(<issue>39</issue>):<fpage>16910</fpage>&#x2013;<lpage>16915</lpage>.
                    <pub-id pub-id-type="pmid">20837533</pub-id>
                    <pub-id pub-id-type="doi">10.1073/pnas.1009843107</pub-id>
                    <pub-id pub-id-type="pmcid">2947907</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-27">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Zook</surname>
                            <given-names>JM</given-names>
                        </name>

                        <name name-style="western">
                            <surname>McDaniel</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Olson</surname>
                            <given-names>ND</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>An open resource for accurately benchmarking small variant and reference calls.</article-title>
                    <source>

                        <italic toggle="yes">Nat Biotechnol.</italic>
</source>
                    <year>2019</year>;<volume>37</volume>(<issue>5</issue>):<fpage>561</fpage>&#x2013;<lpage>566</lpage>.
                    <pub-id pub-id-type="pmid">30936564</pub-id>
                    <pub-id pub-id-type="doi">10.1038/s41587-019-0074-6</pub-id>
                    <pub-id pub-id-type="pmcid">6500473</pub-id>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report61129">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.18214.r61129</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Pitk&#x00e4;nen</surname>
                        <given-names>Esa</given-names>
                    </name>
                    <xref ref-type="aff" rid="r61129a1">1</xref>
                    <xref ref-type="aff" rid="r61129a2">2</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-9818-6370</uri>
                </contrib>
                <aff id="r61129a1">
                    <label>1</label>Institute for Molecular Medicine Finland (FIMM), Helsinki Institute of Life Science, University of Helsinki, Helsinki, Finland</aff>
                <aff id="r61129a2">
                    <label>2</label>Applied Tumor Genomics Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>31</day>
                <month>3</month>
                <year>2020</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2020 Pitk&#x00e4;nen E</copyright-statement>
                <copyright-year>2020</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport61129" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.16665.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>This manuscript describes Sarek, a workflow for analyzing next-generation sequencing (NGS) data. Sarek is based on Nextflow, a popular tool for defining computational workflows. In order to process NGS data, i.e., generating annotated variant calls ready for downstream analyses, multiple complex software tools need to be executed. This is not only computationally demanding, but also labor-intensive due to operators having to install and maintain a complicated collection of software as well as diagnose failed analysis runs, often resulting in high management overhead compared to the total computation time (Yakneen and Waszak, 2020
                <sup>
                    <xref ref-type="bibr" rid="rep-ref-61129-1">1</xref>
                </sup>). Sarek aims to minimize the installation and management time overhead by building a NGS workflow on top of Nextflow, automatically installing the required software components. These software consist of some of the state-of-the-art tools in read mapping, variant calling and annotation and quality control. Sarek is a welcome addition to the toolkit of bioinformaticians looking for an NGS analysis workflow, which can be easily installed on a HPC cluster or cloud environment. The article is well-written and clear to understand. While I'm happy to recommend indexing of the manuscript also in the present form, I have a few suggestions how to improve it:</p>
            <p> 
                <bold>Major comments:</bold> 
                <list list-type="order">
                    <list-item>
                        <p>While some of the existing NGS workflows are mentioned, I would appreciate it if Sarek was compared to these approaches in more detail. Is there functionality that is currently missing from Sarek that is present in one of the other workflows?</p>
                    </list-item>
                    <list-item>
                        <p>Typically in NGS data analysis, a lot of time can be spent on debugging failed runs to find out whether one of the tools failed, if the data is corrupted/missing or if there was a hardware error. How does Sarek support run diagnostics and relaunching failed jobs?</p>
                    </list-item>
                    <list-item>
                        <p>How does Sarek combine variant calls when multiple callers are used for a variant type?&#x00a0;</p>
                    </list-item>
                    <list-item>
                        <p>Somatic single nucleotide and indel variant calls from Sarek were shown to match well with a previously defined gold standard callset. No benchmark data was available for more complex somatic variants and variant calling accuracy for germline variants was not evaluated. I am interested in seeing more comprehensive tests to cover all germline and somatic variant types.</p>
                    </list-item>
                    <list-item>
                        <p>I would appreciate it if a minimal test dataset together with instructions and a suite of automated tests was provided with Sarek. This would make it easier for the user to test out an installation as well as raising issues in GitHub.&#x00a0;</p>
                    </list-item>
                </list> &#x00a0;</p>
            <p> 
                <bold>Minor comments:</bold> 
                <list list-type="order">
                    <list-item>
                        <p>How easy it is for users to modify or extend Sarek by for example adding a new variant caller to the workflow? This could be explored in more detail in text.</p>
                    </list-item>
                    <list-item>
                        <p>It would be good to easily see which tools are used to call and analyze each variant type. This information could be added either to Fig 1, Table 1 or both.</p>
                    </list-item>
                    <list-item>
                        <p>FreeBayes is included in Figure 1 but missing from Table 1.</p>
                    </list-item>
                    <list-item>
                        <p>The wording in &#x201c;To facilitate easy installation and to ensure reproducibility, all Sarek required tools are managed in Docker or Singularity (Kurtzer et al., 2017) containers, or a Conda environment.&#x201d; should be clarified -- are all tools being maintained in all the three systems?</p>
                    </list-item>
                    <list-item>
                        <p>When running Sarek with default options, it crashes in the tool version check. It took me a while to figure out this was due to the default &#x201c;-profile&#x201d; argument of &#x201c;standard&#x201d; which seems to assume Singularity is available. It would be good to improve error messages so that it is easier to understand the underlying cause. A minimal installation and testing procedure mentioned above would help in this regard.</p>
                    </list-item>
                    <list-item>
                        <p>Typo:&#x00a0;&#x201c;Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies opens&#x2026;&#x201d; -&gt; &#x201c;...open&#x201d;.</p>
                    </list-item>
                </list>
            </p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>Yes</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Yes</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Yes</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Bioinformatics, cancer genetics, machine learning</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-61129-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Butler enables rapid cloud-based analysis of thousands of human genomes.</article-title>
                        <source>
                            <italic>Nat Biotechnol</italic>
                        </source>.<year>2020</year>;<volume>38</volume>(<issue>3</issue>) :
                        <elocation-id>10.1038/s41587-019-0360-3</elocation-id>
                        <fpage>288</fpage>-<lpage>292</lpage>
                        <pub-id pub-id-type="pmid">32024987</pub-id>
                        <pub-id pub-id-type="doi">10.1038/s41587-019-0360-3</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
        <sub-article article-type="response" id="comment5675-61129">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Nystedt</surname>
                            <given-names>Bj&#x00f6;rn</given-names>
                        </name>
                        <aff>Uppsala university, Sweden</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>1</day>
                    <month>7</month>
                    <year>2020</year>
                </pub-date>
            </front-stub>
            <body>
                <p>
                    <bold>We are grateful to the reviewer for the positive and constructive comments! We have uploaded a revised version of the manuscript and included updated documentation for the workflow, including adjustments and improvements as suggested by the reviewer, and as described in the detailed comments below, marked in bold. </bold>
                </p>
                <p> Approved</p>
                <p> This manuscript describes Sarek, a workflow for analyzing next-generation sequencing (NGS) data. Sarek is based on Nextflow, a popular tool for defining computational workflows. In order to process NGS data, i.e., generating annotated variant calls ready for downstream analyses, multiple complex software tools need to be executed. This is not only computationally demanding, but also labor-intensive due to operators having to install and maintain a complicated collection of software as well as diagnose failed analysis runs, often resulting in high management overhead compared to the total computation time (Yakneen and Waszak, 2020
                    <ext-link ext-link-type="uri" xlink:href="https://f1000research.com/articles/9-63#rep-ref-61129-1">
                        <sup>1</sup>
                    </ext-link>). Sarek aims to minimize the installation and management time overhead by building a NGS workflow on top of Nextflow, automatically installing the required software components. These software consist of some of the state-of-the-art tools in read mapping, variant calling and annotation and quality control. Sarek is a welcome addition to the toolkit of bioinformaticians looking for an NGS analysis workflow, which can be easily installed on a HPC cluster or cloud environment. The article is well-written and clear to understand. While I'm happy to recommend indexing of the manuscript also in the present form, I have a few suggestions how to improve it:</p>
                <p> Major comments: 
                    <list list-type="order">
                        <list-item>
                            <p>While some of the existing NGS workflows are mentioned, I would appreciate it if Sarek was compared to these approaches in more detail. Is there functionality that is currently missing from Sarek that is present in one of the other workflows?</p>
                        </list-item>
                    </list> 
                    <bold>This is a relevant comment, but a detailed comparison of the current functionality of different workflows risk being quickly outdated as functionality will frequently change both in Sarek and in the other mentioned workflows. Also, the main purpose of Sarek is not to provide unique functionality 
                        <italic>per se</italic>, but to provide a workflow solution with some generally important features as detailed in the manuscript</bold>
                </p>
                <p>
                    <bold> 
                        <italic>&#x201c;..(i) easy installation, (ii) robust portability across different compute environments, (iii) comprehensive documentation, (iv) transparent and easy-to-read code, and (v) extensive quality metrics reporting.&#x201d;</italic>
                    </bold>
                </p>
                <p>
                    <bold> Therefore, the main difference between Sarek and the other mentioned workflows is in practical usability rather than a particular functionality, as these can typically be tuned or added as needed. </bold> 
                    <list list-type="order">
                        <list-item>
                            <p>Typically in NGS data analysis, a lot of time can be spent on debugging failed runs to find out whether one of the tools failed, if the data is corrupted/missing or if there was a hardware error. How does Sarek support run diagnostics and relaunching failed jobs?</p>
                        </list-item>
                    </list> 
                    <bold>This is a good comment, and NextFlow reports the failed process and the error-code from the underlying software, and Sarek is designed to make it very easy to resume failed jobs from the point of failure. This has now been better highlighted in the manuscript under the heading &#x201c;Portability and reproducibility&#x201d;. We will work continuously with the user community to further avoid run failures and to continuously improve the diagnostic capability and error-handling.</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>How does Sarek combine variant calls when multiple callers are used for a variant type?&#x00a0;</p>
                        </list-item>
                    </list> 
                    <bold>Sarek will report variants from all callers include in the run, but it is up to the user to decide on how to combine results from different callers, since e.g. the optimal balance between high specificity 
                        <italic>versus</italic> high sensitivity differs across research projects. This has been clarified in the manuscript under the heading &#x201c;Output&#x201d;, and in the &#x201c;Discussion&#x201d;. &#x00a0;&#x00a0;</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>Somatic single nucleotide and indel variant calls from Sarek were shown to match well with a previously defined gold standard callset. No benchmark data was available for more complex somatic variants and variant calling accuracy for germline variants was not evaluated. I am interested in seeing more comprehensive tests to cover all germline and somatic variant types.</p>
                        </list-item>
                    </list> 
                    <bold>This is a good suggestion and we have run germline variant calling with Sarek on the HG001 sample and compared to the Genome In a Bottle gold standard dataset, and these results are now included in the manuscript.</bold>
                </p>
                <p> 
                    <bold>Benchmarking of complex somatic variants is very complex and difficult due to the lack of robust and relevant benchmark datasets making such testing beyond the scope of this publications, since Sarek is a workflow and does not provide any novel algorithms or methodology 
                        <italic>per se</italic>. For now, we refer users to the tests published along with the respective included variant calling software. This limitation is stated in the manuscript under the heading &#x201c;Results&#x201d;.</bold>
                </p>
                <p>
                    <bold> &#x00a0;</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>I would appreciate it if a minimal test dataset together with instructions and a suite of automated tests was provided with Sarek. This would make it easier for the user to test out an installation as well as raising issues in GitHub.&#x00a0;</p>
                        </list-item>
                    </list> 
                    <bold>This is a good suggestion, and there is actually already a minimal test dataset and a suit of tests available and documented in Sarek, as detailed under the heading &#x201c;Installation and testing&#x201d;. We have improved the Sarek documentation to make this clear and easy to find. </bold>
                </p>
                <p> Minor comments: 
                    <list list-type="order">
                        <list-item>
                            <p>How easy it is for users to modify or extend Sarek by for example adding a new variant caller to the workflow? This could be explored in more detail in text.</p>
                        </list-item>
                    </list> 
                    <bold>This is a good suggestion and we have clarified this point in the manuscript under the heading &#x201c;Operation: Workflow overview and software&#x201d;. In brief, we have started with software we have judged being of high quality, well-maintained and robust. Additional software will be added to Sarek later on in a community effort, and this process is already ongoing.</bold>
                </p>
                <p>
                    <bold> &#x00a0;</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>It would be good to easily see which tools are used to call and analyze each variant type. This information could be added either to Fig 1, Table 1 or both.</p>
                        </list-item>
                    </list> 
                    <bold>This is a good suggestion and we have added this information to Table 1.</bold>
                </p>
                <p>
                    <bold> &#x00a0;</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>FreeBayes is included in Figure 1 but missing from Table 1.</p>
                        </list-item>
                    </list> 
                    <bold>This is a good note, and we have adjusted the manuscript accordingly. FreeBayes can optionally be run in Sarek, and it is now included in Table 1.&#x00a0;</bold>
                </p>
                <p>
                    <bold> &#x00a0;</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>The wording in &#x201c;To facilitate easy installation and to ensure reproducibility, all Sarek required tools are managed in Docker or Singularity (Kurtzer et al., 2017) containers, or a Conda environment.&#x201d; should be clarified -- are all tools being maintained in all the three systems?</p>
                        </list-item>
                    </list> 
                    <bold>This is a very useful comment, and this has now been clarified in the manuscript under the heading &#x201c;Portability and reproducibility&#x201d;. All tools are installed in Conda, and then pushed to DockerHub (https://hub.docker.com/). This way all tools are available directly from all three systems; Conda, Docker and Singularity.</bold>
                </p>
                <p>
                    <bold> &#x00a0;</bold> 
                    <list list-type="order">
                        <list-item>
                            <p>When running Sarek with default options, it crashes in the tool version check. It took me a while to figure out this was due to the default &#x201c;-profile&#x201d; argument of &#x201c;standard&#x201d; which seems to assume Singularity is available. It would be good to improve error messages so that it is easier to understand the underlying cause. A minimal installation and testing procedure mentioned above would help in this regard.</p>
                        </list-item>
                    </list> 
                    <bold>This is a very useful comment, and we have improved the error message and the documentation regarding the &#x201c;-profile&#x201d; arguments. </bold> 
                    <list list-type="order">
                        <list-item>
                            <p>Typo:&#x00a0;&#x201c;Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies opens&#x2026;&#x201d; -&gt; &#x201c;...open&#x201d;.</p>
                        </list-item>
                    </list> 
                    <bold>This typo has been corrected in the manuscript. </bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Is the rationale for developing the new software tool clearly explained?</p>
                        </list-item>
                    </list> Yes 
                    <list list-type="bullet">
                        <list-item>
                            <p>Is the description of the software tool technically sound?</p>
                        </list-item>
                    </list> Yes 
                    <list list-type="bullet">
                        <list-item>
                            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
                        </list-item>
                    </list> Yes 
                    <list list-type="bullet">
                        <list-item>
                            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
                        </list-item>
                    </list> Yes 
                    <list list-type="bullet">
                        <list-item>
                            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
                        </list-item>
                    </list> Yes</p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report59295">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.18214.r59295</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>H&#x00e5;ndstad</surname>
                        <given-names>Tony</given-names>
                    </name>
                    <xref ref-type="aff" rid="r59295a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r59295a1">
                    <label>1</label>Oslo University Hospital, Oslo, Norway</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>9</day>
                <month>3</month>
                <year>2020</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2020 H&#x00e5;ndstad T</copyright-statement>
                <copyright-year>2020</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport59295" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.16665.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>Sarek is a workflow for variant detection and analysis of sequencing data from WGS, WES and targeted panels. The workflow is comprehensive and versatile, allowing for variant detection in both germline and somatic samples, from WGS/WES/panel sequencing. 
                <list list-type="bullet">
                    <list-item>
                        <p>It includes variant calling of SNPs, indels, and structural variants, as well as annotation and extensive quality control.</p>
                    </list-item>
                    <list-item>
                        <p>Sarek is open source and part of the nf-core community effort which builds well-curated analysis pipelines in the Nextflow pipeline framework.</p>
                    </list-item>
                    <list-item>
                        <p>Sarek is very user friendly, and installation, configuration and execution is easy to perform, while the workflow is also flexible.</p>
                    </list-item>
                    <list-item>
                        <p>Implementation manages to be clear, despite also being advanced with parallelization and ample choice of installation/execution. Many researchers would likely struggle to implement pipelines at this advanced level.</p>
                    </list-item>
                    <list-item>
                        <p>The documentation is excellent, and despite the comprehensive functionality, most users should find it easy to set up Sarek and get started. I have no doubt that Sarek will be a very valuable addition to the research community.</p>
                    </list-item>
                </list> &#x00a0;&#x00a0;&#x00a0; &#x00a0;</p>
            <p> The paper is well written and fulfils all the reviewer criteria. As reviewer, I have only a few minor comments: 
                <list list-type="bullet">
                    <list-item>
                        <p>Sarek can use different Nextflow configuration profiles. In the Sarek documentation, it says that the test profile is a profile with a complete configuration for automated testing and that it includes links to test data so needs no other parameters.</p>
                        <p> </p>
                        <p> It should be obvious to most users, but I would suggest that the authors make it clear that when using the test profile, a user must also supply conda, docker or singularity profile if not all the tools are installed in the PATH. This is clear from the general nf-core documentation (https://nf-co.re/usage/introduction) but less so from the Sarek documentation.</p>
                    </list-item>
                </list> 
                <list list-type="bullet">
                    <list-item>
                        <p>Whereas the choice of Nextflow is justified, there is little argumentation for why the different tools (variant callers in particular) are selected other than that they represent state-of-the-art. Also why several tools for variant calling are combined is not mentioned, though the referred paper (Alioto 
                            <italic>et al.</italic>, 2015) makes a clear case for this, at least for somatic variant calling.</p>
                    </list-item>
                </list> 
                <list list-type="bullet">
                    <list-item>
                        <p>The paper title makes it clear that Sarek is for whole genome sequencing analysis, but as stated in the text, Sarek is also applicable to exome and targeted panel analyses where the authors say it has been run successfully.</p>
                        <p> </p>
                        <p> Many researchers are using exome sequencing, so it could be of interest to know if the authors have an opinion or experience with how use of targeted sequencing data limit Sarek in terms of accuracy or utility, for example, are the tools used for structural variant calling able to handle exome data well (to the extent possible with targeted sequencing?)</p>
                        <p> </p>
                        <p> The documentation has a small chapter stating that the authors recommend supplying a BED file with the targeted regions, but there is not so much explanation of what the effect of this is.</p>
                    </list-item>
                </list> 
                <list list-type="bullet">
                    <list-item>
                        <p>The authors demonstrate that Sarek is both fast and accurate by running it on a tumor/normal(germline) dataset from a previous benchmark study. I think this is acceptable/sufficient, but one could always wish for more; the paper could be strengthened by for example running the well-known public germline HG001 sample against the Genome In a Bottle gold standard dataset.</p>
                    </list-item>
                    <list-item>
                        <p>Accurate somatic variant calling is difficult. But the included benchmark study demonstrates that Sarek performs well in comparison with other pipelines. The tool leaves it up to the user to decide whether to use output from a single variant caller or the union or intersection from all tools for increased sensitivity or precision.</p>
                    </list-item>
                    <list-item>
                        <p>In summary, I think Sarek is a great addition to the community and recommend the paper for indexing.</p>
                    </list-item>
                </list>
            </p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>Yes</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Yes</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Yes</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Diagnostic bioinformatics (variant calling pipelines) and variant interpretation</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <sub-article article-type="response" id="comment5674-59295">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Nystedt</surname>
                            <given-names>Bj&#x00f6;rn</given-names>
                        </name>
                        <aff>Uppsala university, Sweden</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>1</day>
                    <month>7</month>
                    <year>2020</year>
                </pub-date>
            </front-stub>
            <body>
                <p>
                    <bold>We are grateful to the reviewer for the positive and constructive comments! We have uploaded a revised version of the manuscript and included updated documentation for the workflow, including adjustments and improvements as suggested by the reviewer, and as described in the detailed comments below, marked in bold. </bold>
                </p>
                <p> Approved</p>
                <p> Sarek is a workflow for variant detection and analysis of sequencing data from WGS, WES and targeted panels. The workflow is comprehensive and versatile, allowing for variant detection in both germline and somatic samples, from WGS/WES/panel sequencing. 
                    <list list-type="bullet">
                        <list-item>
                            <p>It includes variant calling of SNPs, indels, and structural variants, as well as annotation and extensive quality control.</p>
                        </list-item>
                        <list-item>
                            <p>Sarek is open source and part of the nf-core community effort which builds well-curated analysis pipelines in the Nextflow pipeline framework.</p>
                        </list-item>
                        <list-item>
                            <p>Sarek is very user friendly, and installation, configuration and execution is easy to perform, while the workflow is also flexible.</p>
                        </list-item>
                        <list-item>
                            <p>Implementation manages to be clear, despite also being advanced with parallelization and ample choice of installation/execution. Many researchers would likely struggle to implement pipelines at this advanced level.</p>
                        </list-item>
                        <list-item>
                            <p>The documentation is excellent, and despite the comprehensive functionality, most users should find it easy to set up Sarek and get started. I have no doubt that Sarek will be a very valuable addition to the research community.</p>
                        </list-item>
                    </list> &#x00a0;&#x00a0;&#x00a0; &#x00a0;</p>
                <p> The paper is well written and fulfils all the reviewer criteria. As reviewer, I have only a few minor comments: 
                    <list list-type="bullet">
                        <list-item>
                            <p>Sarek can use different Nextflow configuration profiles. In the Sarek documentation, it says that the test profile is a profile with a complete configuration for automated testing and that it includes links to test data so needs no other parameters.</p>
                            <p> </p>
                            <p> It should be obvious to most users, but I would suggest that the authors make it clear that when using the test profile, a user must also supply conda, docker or singularity profile if not all the tools are installed in the PATH. This is clear from the general nf-core documentation (https://nf-co.re/usage/introduction) but less so from the Sarek documentation.</p>
                        </list-item>
                    </list> 
                    <bold>This is a good suggestion and we have highlighted and improved this information in the Sarek documentation.&#x00a0;We are also working to revise the general documentation format in nf-core to make this more transparent throughout. </bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Whereas the choice of Nextflow is justified, there is little argumentation for why the different tools (variant callers in particular) are selected other than that they represent state-of-the-art. Also why several tools for variant calling are combined is not mentioned, though the referred paper (Alioto 
                                <italic>et al.</italic>, 2015) makes a clear case for this, at least for somatic variant calling.</p>
                        </list-item>
                    </list> 
                    <bold>This is a good suggestion and we have clarified this point in the manuscript. In brief, we have included software we have judged being of high quality, well-maintained and robust. Additional software will be added to Sarek later on in a community effort, and this process is already ongoing. </bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>The paper title makes it clear that Sarek is for whole genome sequencing analysis, but as stated in the text, Sarek is also applicable to exome and targeted panel analyses where the authors say it has been run successfully.</p>
                            <p> </p>
                            <p> Many researchers are using exome sequencing, so it could be of interest to know if the authors have an opinion or experience with how use of targeted sequencing data limit Sarek in terms of accuracy or utility, for example, are the tools used for structural variant calling able to handle exome data well (to the extent possible with targeted sequencing?)</p>
                            <p> </p>
                            <p> The documentation has a small chapter stating that the authors recommend supplying a BED file with the targeted regions, but there is not so much explanation of what the effect of this is.</p>
                        </list-item>
                    </list> 
                    <bold>This is a relevant comment, and we want to clarify that Sarek has been run on whole-exome sequencing data in pilot user projects, and has been reported to us to work well (personal communication), but no comprehensive benchmark has been performed by us to evaluate this. This has been clarified in the Sarek documentation and in the manuscript. </bold> 
                    <list list-type="bullet">
                        <list-item>
                            <p>The authors demonstrate that Sarek is both fast and accurate by running it on a tumor/normal(germline) dataset from a previous benchmark study. I think this is acceptable/sufficient, but one could always wish for more; the paper could be strengthened by for example running the well-known public germline HG001 sample against the Genome In a Bottle gold standard dataset.</p>
                        </list-item>
                    </list> 
                    <bold>This is a good suggestion and we have run germline variant calling with Sarek on the HG001 sample and compared to the Genome In a Bottle gold standard dataset, and these results are now included in the manuscript. </bold>
                </p>
                <p> &#x00a0; 
                    <list list-type="bullet">
                        <list-item>
                            <p>Accurate somatic variant calling is difficult. But the included benchmark study demonstrates that Sarek performs well in comparison with other pipelines. The tool leaves it up to the user to decide whether to use output from a single variant caller or the union or intersection from all tools for increased sensitivity or precision.</p>
                        </list-item>
                        <list-item>
                            <p>In summary, I think Sarek is a great addition to the community and recommend the paper for indexing.</p>
                        </list-item>
                    </list> 
                    <list list-type="bullet">
                        <list-item>
                            <p>Is the rationale for developing the new software tool clearly explained?</p>
                        </list-item>
                    </list> Yes 
                    <list list-type="bullet">
                        <list-item>
                            <p>Is the description of the software tool technically sound?</p>
                        </list-item>
                    </list> Yes 
                    <list list-type="bullet">
                        <list-item>
                            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
                        </list-item>
                    </list> Yes 
                    <list list-type="bullet">
                        <list-item>
                            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
                        </list-item>
                    </list> Yes 
                    <list list-type="bullet">
                        <list-item>
                            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
                        </list-item>
                    </list> Yes</p>
            </body>
        </sub-article>
    </sub-article>
</article>
