<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="other" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.168786.1</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Software Tool Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Identification of Viral Variants from Functional Genomics Data</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 1; peer review: 1 approved, 1 approved with reservations]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>R&#x00f6;ckl</surname>
                        <given-names>Florian</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Visualization</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Friedel</surname>
                        <given-names>Caroline C.</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Funding Acquisition</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-3569-4877</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>Institute for Informatics, Ludwig-Maximilians-Universitaet Muenchen (LMU), Munich, Bavaria, Germany</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:caroline.friedel@bio.ifi.lmu.de">caroline.friedel@bio.ifi.lmu.de</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>18</day>
                <month>8</month>
                <year>2025</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2025</year>
            </pub-date>
            <volume>14</volume>
            <elocation-id>794</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>11</day>
                    <month>8</month>
                    <year>2025</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2025 R&#x00f6;ckl F and Friedel CC</copyright-statement>
                <copyright-year>2025</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/14-794/pdf"/>
            <abstract>
                <sec>
                    <title>Background</title>
                    <p>Virus mutants are commonly used for studying the role of individual viral proteins in infections and are increasingly investigated with functional genomics experiments of infected cells that use sequencing-based assays such as RNA-seq or ATAC-seq. However, existing mutant virus strains are often poorly documented, in particular if they have been created decades ago. Identifying viral variants directly in the functional genomics experiments avoids additional genome sequencing and allows confirming the presence of specific mutations directly in the experiment of interest.</p>
                </sec>
                <sec>
                    <title>Methods</title>
                    <p>We present a pipeline to directly identify mutations in viral genomes from sequencing-based functional genomics data. The pipeline combines existing SNP callers with novel methods for identifying deletions, insertions, and corresponding inserted sequences. These novel methods address the problem that existing structural variant callers performed poorly on functional genomics data with large variations in read coverage.</p>
                </sec>
                <sec>
                    <title>Results</title>
                    <p>We evaluated the pipeline on RNA-seq data for infection with knockout mutants for important proteins of Herpes simplex virus 1 (HSV-1). Comparison of the variants identified by our pipeline with the descriptions of the original publications showed that we could correctly recover the introduced mutations.</p>
                </sec>
                <sec>
                    <title>Conclusions</title>
                    <p>Our pipeline offers researchers a fast and easy way to identify variants in the viral genome without additional genome sequencing. The pipeline is implemented as a workflow for the workflow management system Watchdog and is available at 
                        <uri xlink:href="https://github.com/watchdog-wms/watchdog-wms-workflows/">https://github.com/watchdog-wms/watchdog-wms-workflows/</uri> (workflow VariantCallerPipeline).</p>
                </sec>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>variant calling pipeline</kwd>
                <kwd>functional genomics data</kwd>
                <kwd>virus infections</kwd>
                <kwd>null mutant virus&#x202f;</kwd>
            </kwd-group>
            <funding-group>
                <award-group id="fund-1" xlink:href="https://doi.org/10.13039/501100001659">
                    <funding-source>Deutsche Forschungsgemeinschaft</funding-source>
                    <award-id>FR2938/11-1</award-id>
                </award-group>
                <funding-statement>This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, www.dfg.de) in the framework of the Research Unit FOR5200 DEEP-DV (443644894) project FR 2938/11-1 to C.C.F. </funding-statement>
                <funding-statement>
                    <italic>The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</italic>
                </funding-statement>
            </funding-group>
        </article-meta>
    </front>
    <body>
        <sec id="sec5" sec-type="intro">
            <title>Introduction</title>
            <p>Advances in molecular biology and genetics provide new technologies for studying virus infections and the role of individual viral genes during infection. This provides the basis for the development of treatments against virus infections or for their use as tools in genetic engineering, vaccine development, or gene therapy.
                <sup>
                    <xref ref-type="bibr" rid="ref1">1</xref>
                </sup> A common approach is the creation of mutant virus strains (see e.g. Ref. 
                <xref ref-type="bibr" rid="ref2">2</xref>) containing single nucleotide polymorphisms (SNPs) or insertions or deletions (indels) of sequences that alter the functions of individual viral genes. For well-studied viruses like herpesviruses, such experiments have been conducted for decades. Consequently, many commonly used mutant strains have been generated decades ago, often before complete genome sequences of these viruses were available (e.g., in Refs. 
                <xref ref-type="bibr" rid="ref3">3</xref>&#x2013;
                <xref ref-type="bibr" rid="ref11">11</xref> to list just a few examples). These have often been passed between laboratories and used for a multitude of experiments. However, the precise genome location of mutations or inserted sequences are often poorly documented and other undocumented mutations may have been introduced either with the original mutation or in the time since. Furthermore, even for recently created viral mutants, the description in the corresponding articles are often very limited and do not provide nucleotide positions (e.g. in Ref. 
                <xref ref-type="bibr" rid="ref12">12</xref>). Moreover, even if the precise location of introduced mutations is known, it is often important to verify their presence, in particular if results from experiments do not meet expectations.</p>
            <p>The standard approach to identify mutations in viral genomes is genome sequencing,
                <sup>
                    <xref ref-type="bibr" rid="ref13">13</xref>
                </sup> which requires separate experiments. However, due to advances in high-throughput sequencing technologies, analysis of virus gene functions is now commonly performed using sequencing-based functional genomics assays of virus-infected cells, such as RNA sequencing (RNA-seq), Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq), or chromatin immunoprecipitation (ChIP) followed by sequencing (ChIP-seq) (e.g. in Refs. 
                <xref ref-type="bibr" rid="ref14">14</xref>, 
                <xref ref-type="bibr" rid="ref15">15</xref>). Since functional genomics experiments commonly also provide nucleotide coverage of viral genomes, though generally with very variable coverage, they afford the unique opportunity to identify viral variants directly in the experiment of interest without additional genome sequencing.</p>
            <p>In this article, we present a pipeline to automatically identify viral variants in functional genomics data of virus infections, including SNPs, deletions and insertions and (optionally) inserted sequences. This pipeline uses existing SNP calling methods, in particular bcftools
                <sup>
                    <xref ref-type="bibr" rid="ref16">16</xref>
                </sup> and VarScan
                <sup>
                    <xref ref-type="bibr" rid="ref17">17</xref>
                </sup>), which we found to perform well also for RNA-seq or other functional genomics data that exhibit large variations in read coverage across the viral genome (see e.g., 
                <xref ref-type="fig" rid="f1">
Figure 1</xref>). In contrast, state-of-the-art structural variant callers we evaluated (DELLY,
                <sup>
                    <xref ref-type="bibr" rid="ref18">18</xref>
                </sup> GRIDSS2,
                <sup>
                    <xref ref-type="bibr" rid="ref19">19</xref>
                </sup> and BreakDancer
                <sup>
                    <xref ref-type="bibr" rid="ref20">20</xref>
                </sup>) performed poorly in identifying insertions and deletions in viral null mutants from these data. This is not surprising, as RNA-seq data and other functional genomics data with non-uniform read distributions violate the underlying assumptions of existing structural variant callers. We thus implemented a new approach to identify deletions and insertions based on gaps in read coverage and clipped (i.e. partial) read alignments. We combined this with 
                <italic toggle="yes">de novo</italic> assembly using rnaSPAdes
                <sup>
                    <xref ref-type="bibr" rid="ref21">21</xref>
                </sup> to identify inserted sequences. Analysis of previously published RNA-seq data for infection with knockout mutants of herpes simplex virus 1 (HSV-1)
                <sup>
                    <xref ref-type="bibr" rid="ref14">14</xref>
                </sup> and an HSV-1 strain expressing a green fluorescent protein (GFP)
                <sup>
                    <xref ref-type="bibr" rid="ref22">22</xref>
                </sup> showed that our pipeline allows fast and easy identification of viral variants and their precise genomic locations to characterize poorly documented mutant virus strains at the nucleotide level.</p>
            <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                <label>
Figure 1. </label>
                <caption>
                    <title>Per-base read coverage (y-axis) on the HSV-1 genome (x-axis) for an 4sU-seq sample for infection with an HSV-1 null mutant containing a deletion of the ICP22 protein (see Results for details).</title>
                    <p>4sU-seq is a variant of RNA-seq based on sequencing newly transcribed RNA obtained by 4-RNA labelling with thiouridine (4sU).
                        <sup>
                            <xref ref-type="bibr" rid="ref30">30</xref>
                        </sup> This shows that read coverage varies considerably across the genome depending on gene expression. The deletion is located between nucleotides 133,243 and 134,072 (see 
                        <xref ref-type="table" rid="T1">
Table 1</xref>) but cannot be distinguished from other regions with low expression either visually or with standard deletion callers.</p>
                </caption>
                <graphic id="gr1" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/185998/274de8c3-338b-4373-871e-df90906e57c1_figure1.gif"/>
            </fig>
        </sec>
        <sec id="sec6" sec-type="methods">
            <title>Methods</title>
            <sec id="sec7">
                <title>Implementation</title>
                <p>The virus variant caller pipeline was implemented as a workflow for the workflow management system Watchdog and is available at 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/watchdog-wms/watchdog-wms-workflows/">https://github.com/watchdog-wms/watchdog-wms-workflows/</ext-link> (workflow VariantCallerPipeline). The workflow takes as input read alignments against the viral genome in BAM format for one or more virus-infected samples. Read sequences in FASTQ format are only required if inserted sequences are to be identified, which is an optional step. We used BWA
                    <sup>
                        <xref ref-type="bibr" rid="ref23">23</xref>
                    </sup> for read alignment as it is very fast and requires little memory, but any read alignment program can be used that provides SAM/BAM output, includes read sequences in the output and produces clipped read alignments if only parts of a read can be aligned to the viral genome. Notably, since we are not interested in identifying splicing events, which are rare in viruses, there is no need to use a splicing-aware aligner for RNA-seq
 data.</p>
                <p>The variant caller pipeline is divided into two main parts, which are described in the following: (1) SNP calling and (optionally) strain identification and (2) indel detection and (optionally) identification of inserted sequences.</p>
                <p>

                    <bold>

                        <italic toggle="yes">SNP calling</italic>
</bold>
                </p>
                <p>
                    <xref ref-type="fig" rid="f2">
Figure 2</xref> provides an overview of the steps performed for SNP calling. First, the variant callers bcftools
                    <sup>
                        <xref ref-type="bibr" rid="ref16">16</xref>
                    </sup> and VarScan
                    <sup>
                        <xref ref-type="bibr" rid="ref17">17</xref>
                    </sup> (after running &#x2018;samtools mpileup&#x2019;
                    <sup>
                        <xref ref-type="bibr" rid="ref24">24</xref>
                    </sup>) are applied independently to each input BAM file. Both tools provide the identified SNPs in the variant call format (VCF).
                    <sup>
                        <xref ref-type="bibr" rid="ref25">25</xref>
                    </sup> Next, so-called 
                    <italic toggle="yes">consistent</italic> SNPs are determined that are identified by both bcftools and VarScan. If more than one replicate is available, SNPs are considered consistent if they are detected by both tools in all replicates. Consistent SNPs are then mapped to viral features, e.g., genes, coding sequences, or introns, given a gene annotation in GTF format for the viral genome.</p>
                <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                    <label>
Figure 2. </label>
                    <caption>
                        <title>Overview of the steps the pipeline employs for SNP calling.</title>
                    </caption>
                    <graphic id="gr2" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/185998/274de8c3-338b-4373-871e-df90906e57c1_figure2.gif"/>
                </fig>
                <p>Furthermore, if a set of reference SNPs for different virus strains is provided by the user, the pipeline performs a prediction of the virus strain for each sample. This is useful both for verifying the virus strain used in the experiment and the parental strain from which a particular null mutant was generated. Such reference SNPs can be obtained by identifying consistent SNPs with our pipeline for functional genomics data of various virus strains. An example file with reference SNPs for HSV-1 strains 17, F and KOS 1.1 is included with example input files at 
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.14266852">https://doi.org/10.5281/zenodo.14266852</ext-link> and the Watchdog module for strain identification (identifyStrain, available at 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/watchdog-wms/watchdog-wms-modules">https://github.com/watchdog-wms/watchdog-wms-modules</ext-link>).</p>
                <p>For strain identification, the following distance 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>D</mml:mi>
                        </mml:math>
</inline-formula> is calculated for each reference strain: 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>D</mml:mi>
                            <mml:mo>=</mml:mo>
                            <mml:mrow>
                                <mml:mo>|</mml:mo>
                                <mml:msub>
                                    <mml:mi>S</mml:mi>
                                    <mml:mn>1</mml:mn>
                                </mml:msub>
                                <mml:mo>&#x222a;</mml:mo>
                                <mml:msub>
                                    <mml:mi>S</mml:mi>
                                    <mml:mn>2</mml:mn>
                                </mml:msub>
                                <mml:mo>|</mml:mo>
                            </mml:mrow>
                            <mml:mo>&#x2212;</mml:mo>
                            <mml:mrow>
                                <mml:mo>|</mml:mo>
                                <mml:msub>
                                    <mml:mi>S</mml:mi>
                                    <mml:mn>1</mml:mn>
                                </mml:msub>
                                <mml:mo>&#x2229;</mml:mo>
                                <mml:msub>
                                    <mml:mi>S</mml:mi>
                                    <mml:mn>2</mml:mn>
                                </mml:msub>
                                <mml:mo>|</mml:mo>
                            </mml:mrow>
                        </mml:math>
</inline-formula>, with 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>S</mml:mi>
                                <mml:mn>1</mml:mn>
                            </mml:msub>
                        </mml:math>
</inline-formula> the set of consistent SNPs identified for the virus used in the experiment and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>S</mml:mi>
                                <mml:mn>2</mml:mn>
                            </mml:msub>
                        </mml:math>
</inline-formula> the set of reference SNPs for a reference strain. The strain with the smallest distance 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>D</mml:mi>
                        </mml:math>
</inline-formula> is then predicted for the virus. This measure is largely independent of the reference genome sequence used for read alignment. For illustration, consider the following example. Assume a sample 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>S</mml:mi>
                        </mml:math>
</inline-formula> that was derived from strain 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>X</mml:mi>
                        </mml:math>
</inline-formula> but is aligned against the genome sequence of a different strain 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>Y</mml:mi>
                        </mml:math>
</inline-formula>. Furthermore, reference SNPs for strains 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>X</mml:mi>
                        </mml:math>
</inline-formula>, 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>Y</mml:mi>
                        </mml:math>
</inline-formula> and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>Z</mml:mi>
                        </mml:math>
</inline-formula> were also obtained by aligning functional genomics data for these strains against the genome for strain 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>Y</mml:mi>
                        </mml:math>
</inline-formula>. This will result in a (relatively) large number of consistent SNPs for sample 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>S</mml:mi>
                        </mml:math>
</inline-formula>, no/few reference SNPs for strain 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>Y</mml:mi>
                        </mml:math>
</inline-formula> and (relatively) large numbers of reference SNPs for strains 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>X</mml:mi>
                        </mml:math>
</inline-formula> and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>Z</mml:mi>
                        </mml:math>
</inline-formula>. Since consistent SNPs for 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>S</mml:mi>
                        </mml:math>
</inline-formula> and reference SNPs for 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>X</mml:mi>
                        </mml:math>
</inline-formula> will be largely the same, the distance will be close to zero. The distance for 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>Y</mml:mi>
                        </mml:math>
</inline-formula> and 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>Z</mml:mi>
                        </mml:math>
</inline-formula> will be larger since consistent SNPs for 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>S</mml:mi>
                        </mml:math>
</inline-formula> are not in the reference set for 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>Y</mml:mi>
                        </mml:math>
</inline-formula> and will differ from reference SNPs for 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>Z</mml:mi>
                        </mml:math>
</inline-formula>.</p>
                <p>

                    <bold>

                        <italic toggle="yes">Indel detection</italic>
</bold>
                </p>
                <p>Insertions and deletions in viral genomes are determined as outlined in 
                    <xref ref-type="fig" rid="f3">
Figure 3</xref>. First, per-base read coverage, i.e. the number of reads overlapping each genome position, and clipped reads (= reads with unaligned parts) are extracted from each input BAM file using samtools.
                    <sup>
                        <xref ref-type="bibr" rid="ref24">24</xref>
                    </sup> The results are then used as input for indel calling as described below. Subsequently, identified indels are also mapped to genomic features. In addition, the pipeline can identify inserted sequences by combining the results from insertion detection with 
                    <italic toggle="yes">de novo</italic> read assembly obtained with rnaSPAdes
                    <sup>
                        <xref ref-type="bibr" rid="ref21">21</xref>
                    </sup> if raw read sequences in FASTQ format are provided.</p>
                <fig fig-type="figure" id="f3" orientation="portrait" position="float">
                    <label>
Figure 3. </label>
                    <caption>
                        <title>Overview of the steps the pipeline employs for indel detection.</title>
                    </caption>
                    <graphic id="gr3" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/185998/274de8c3-338b-4373-871e-df90906e57c1_figure3.gif"/>
                </fig>
                <p>

                    <italic toggle="yes">Candidate deletion detection</italic>
                </p>
                <p>The pipeline first detects potential deletions by identifying regions of the genome with very low read coverage compared to (i) the complete genome using a global threshold and (ii) the surrounding genomic regions using a local threshold. For this purpose, a global z-score is calculated for each position, comparing the logarithm of the read coverage (= log read coverage) for this position to the mean and standard deviation of the log read coverage for the complete genome. If this is below a stringent global threshold, the position is labelled as a potential deletion. If it passes only a less stringent global threshold, a local z-score is calculated comparing the log read coverage at this position to the mean and standard deviation of the previous 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>n</mml:mi>
                        </mml:math>
</inline-formula> nucleotides (nt) before the current position (by default 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>n</mml:mi>
                            <mml:mo>=</mml:mo>
                            <mml:mn>500</mml:mn>
                        </mml:math>
</inline-formula>). If the local z-score is below the local threshold, the position will also be labelled as a potential deletion.</p>
                <p>The local z-score is used as read coverage can vary massively between positions in functional genomics data. This is exemplified for an RNA-seq sample in 
                    <xref ref-type="fig" rid="f1">
Figure 1</xref>. However, calculating local z-cores for every position is very costly as it requires calculating the mean and standard deviation over the preceding 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>n</mml:mi>
                        </mml:math>
</inline-formula> nt for every genomic position. Thus, the stringent global z-score threshold is first employed to identify clear-cut cases of potential deletions. Local z-scores are only calculated for less clear-cut cases. Optionally, a user-defined length threshold can also be used to exclude very short deletions.</p>
                <p>

                    <italic toggle="yes">Deletion verification</italic>
                </p>
                <p>Candidate deletions are subsequently verified using clipped read alignments. As depicted in 
                    <xref ref-type="fig" rid="f4">
Figure 4</xref>, reads crossing a deletion in the genome can only be aligned with gaps to the reference genome. If the alignment is performed using a non-splice-aware read aligner, such as BWA, this will result in clipped read alignments where only parts of the read are aligned to the genome. Notably, this often also occurs with splice-aware read aligners as the start and end nucleotides of the deletion generally do not match canonical splicing signals expected by many splice-aware read aligners.</p>
                <fig fig-type="figure" id="f4" orientation="portrait" position="float">
                    <label>
Figure 4. </label>
                    <caption>
                        <title>Illustration of clipping at deletion sites.</title>
                        <p>The top shows the mutated viral genome that contains the green and blue sequences from the reference genome below, but the orange sequence was deleted. Reads from the mutant viral genome can thus only be aligned with gaps to the reference genome (top of the reference genome). If a non-splice-aware aligner is used or a splice-aware aligner that requires presence of splice signals, this results in clipped read alignments (at the bottom). If both parts of the read are sufficiently long to be aligned to the genome, this will result in multiple clipped alignments per read. If a part of the read is too short for alignment (marked by a red cross), this part will not be aligned at all.</p>
                    </caption>
                    <graphic id="gr4" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/185998/274de8c3-338b-4373-871e-df90906e57c1_figure4.gif"/>
                </fig>
                <p>Deletions should exhibit a peak of right-clipped reads ending at the deletion start and a peak of left-clipped reads beginning at the deletion end. Such peaks of clipped reads are again identified using both a global and local z-score, both of which are calculated separately for peaks of right-clipped and left-clipped reads. For the global z-score at each position, the number of clipped reads is compared against the mean and standard deviation for the same type of clipped reads across the whole genome. For the local z-score, the number of clipped reads is compared against the mean and standard deviation for a window starting 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>x</mml:mi>
                        </mml:math>
</inline-formula> nt upstream of the candidate peak and ending 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>x</mml:mi>
                        </mml:math>
</inline-formula> nt downstream of the candidate peak (by default 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>x</mml:mi>
                            <mml:mo>=</mml:mo>
                            <mml:mn>20</mml:mn>
                        </mml:math>
</inline-formula>), excluding the peak position itself. If both the global and the local z-scores pass a global (default: 10) and local (default: 50) threshold, respectively, the position is considered a peak. In addition, a minimum number of reads is required for a peak (default: 10 reads).</p>
                <p>To verify deletions, the pipeline identifies pairs of right-clipped and subsequent left-clipped peaks (i.e. the 
                    <italic toggle="yes">clipping pattern</italic> of deletions) and determines whether the positions of the two peaks overlap with a candidate deletion detected based on the per-base read coverage. Subsequently, the clipped sequences of the corresponding clipped read alignments (i.e. the unaligned part of the read in this alignment) are extracted from the BAM file and position-weight-matrices (PWMs) are computed from the sequence profiles of the clipped sequence parts. As can be seen in 
                    <xref ref-type="fig" rid="f4">
Figure 4</xref>, the PWMs of the clipped sequence parts on either side of the deletion should match the reference sequence on the opposite side of the deletion.</p>
                <p>To test this, the best match of each PWM is determined in a window around the opposite deletion end. The score of a potential match is calculated as the sum of log-odds scores over all positions comparing the value of the PWM for the nucleotide at this position against the background probability of that nucleotide in the complete genome sequence. The best match for a PWM is the match with the highest score. If the best matches for both PWMs have a score &gt;0 or at least one has a score &gt;1, the deletion is accepted. If neither match is good enough, the deletion is flagged as a potential deletion that may contain an insertion. This special case was observed for one of the data sets analysed in the results section. In this case, the potential insertion sequence is determined as described in the next section and can be further analysed.</p>
                <p>It should be noted that our approach for predicting deletions may also identify splicing events in RNA-seq data. However, splicing is rare in viruses and the few cases detected can easily be excluded after mapping the deletions to the genome annotation. For instance, even a very thorough re-annotation of the HSV-1 genome, a relatively large viral genome of ~152 kb, based on short- and long-read RNA-seq identified only 15 splicing events.
                    <sup>
                        <xref ref-type="bibr" rid="ref26">26</xref>
                    </sup> Most of those had only low abundance compared to the corresponding unspliced transcripts.</p>
                <p>

                    <italic toggle="yes">Insertion detection</italic>
                </p>
                <p>Our pipeline also uses clipped read alignments to determine insertions since reads containing part of an inserted sequence cannot be completely aligned to the genome (see 
                    <xref ref-type="fig" rid="f5">
Figure 5</xref>). Originally, we expected that the resulting clipping pattern should consist of a peak of right-clipped reads at a reference genome position 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>n</mml:mi>
                        </mml:math>
</inline-formula> preceding the insertion in the genome followed by a peak of left-clipped reads at position 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>n</mml:mi>
                            <mml:mo>+</mml:mo>
                            <mml:mn>1</mml:mn>
                        </mml:math>
</inline-formula>. However, the examples of null mutants created by insertions that we investigated showed a different pattern, consisting of a peak of left-clipped reads at position 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>n</mml:mi>
                        </mml:math>
</inline-formula> and a peak of right-peaked reads at position 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>n</mml:mi>
                            <mml:mo>+</mml:mo>
                            <mml:mn>1</mml:mn>
                        </mml:math>
</inline-formula> (see 
                    <xref ref-type="fig" rid="f5">
Figure 5</xref>). This results from the first and last position of inserted sequences matching the genome on the other side of the insertion and is likely a consequence of the use of homologous recombination for inserting sequences.
                    <sup>
                        <xref ref-type="bibr" rid="ref27">27</xref>
                    </sup>
                </p>
                <fig fig-type="figure" id="f5" orientation="portrait" position="float">
                    <label>
Figure 5. </label>
                    <caption>
                        <title>Illustration of read clipping at insertion sites.</title>
                        <p>The top shows the mutated viral genome (blue) that contains an inserted sequence (orange) not present in the reference genome. Reads spanning the boundary of the insertion therefore contain parts of both the reference and insertion sequence. When aligned to the reference genome, the part of the reads containing the insertion sequence (orange) have to be clipped since they cannot be aligned to the reference genome. We observed that commonly the start and/or end of the insertion also matches the reference genome directly before and after the insertion site (in this example, 1 nt matches on each side). As a result, reads can be aligned beyond the insertion site, resulting in a distinctive insertion clipping pattern with a peak of left-clipped positions one or more positions left of a peak of right-clipped positions.</p>
                    </caption>
                    <graphic id="gr5" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/185998/274de8c3-338b-4373-871e-df90906e57c1_figure5.gif"/>
                </fig>
                <p>To allow for such matches between the insertion start and/or end to the surrounding genomic regions, we introduced a parameter 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>&#x03d5;</mml:mi>
                        </mml:math>
</inline-formula> determining the maximum number of such matches that are allowed. Thus, any pair of positions for a left-clipped peak 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>p</mml:mi>
                                <mml:mi>l</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> and a right-clipped peak 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>p</mml:mi>
                                <mml:mi>r</mml:mi>
                            </mml:msub>
                        </mml:math>
</inline-formula> is used to predict an insertion if 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>p</mml:mi>
                                <mml:mi>r</mml:mi>
                            </mml:msub>
                            <mml:mo>&#x2212;</mml:mo>
                            <mml:msub>
                                <mml:mi>p</mml:mi>
                                <mml:mi>l</mml:mi>
                            </mml:msub>
                            <mml:mo>+</mml:mo>
                            <mml:mn>1</mml:mn>
                            <mml:mo>&#x2264;</mml:mo>
                            <mml:mi>&#x03d5;</mml:mi>
                        </mml:math>
</inline-formula>. In the example in 
                    <xref ref-type="fig" rid="f5">
Figure 5</xref>, 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:msub>
                                <mml:mi>p</mml:mi>
                                <mml:mi>r</mml:mi>
                            </mml:msub>
                            <mml:mo>&#x2212;</mml:mo>
                            <mml:msub>
                                <mml:mi>p</mml:mi>
                                <mml:mi>l</mml:mi>
                            </mml:msub>
                            <mml:mo>+</mml:mo>
                            <mml:mn>1</mml:mn>
                            <mml:mo>=</mml:mo>
                            <mml:mn>2</mml:mn>
                        </mml:math>
</inline-formula>. For each identified insertion, we extract the non-aligned parts of clipped reads to calculate consensus sequences for the insertion start and end, respectively. These consensus sequences are commonly 30-40 nt long.</p>
                <p>To identify the remaining central part of the inserted sequences, the pipeline optionally performs a 
                    <italic toggle="yes">de novo</italic> sequence assembly using rnaSPAdes,
                    <sup>
                        <xref ref-type="bibr" rid="ref21">21</xref>
                    </sup> a modification of the genome assembler SPAdes
                    <sup>
                        <xref ref-type="bibr" rid="ref28">28</xref>
                    </sup> for application to RNA-seq data. Assembly is performed for all reads, which also includes reads from non-viral sequences, in particular the inserted sequences. Following this, the consensus sequences for the insertion start and end are aligned to the resulting assembled contigs using BWA. If a match for both consensus sequences is found, the assembled sequence starting with the consensus of the insertion start and ending with the consensus of the insertion end is extracted. Insertion sequences containing only one of the consensus sequences are also extracted but are flagged for special attention. The origin of the inserted sequences can then be confirmed using BLAST.
                    <sup>
                        <xref ref-type="bibr" rid="ref29">29</xref>
                    </sup>
                </p>
                <p>We also investigated whether 
                    <italic toggle="yes">de novo</italic> assembly alone was sufficient for detection of both insertions and deletions by aligning the assembled contigs to the viral reference genome (see results). However, this either resulted in too few or too many indels depending on parameters, thus we did not pursue this approach for the pipeline.</p>
            </sec>
            <sec id="sec8">
                <title>Operation</title>
                <p>Watchdog and the VariantCallerPipeline can be run on Linux and MacOS systems. Running Watchdog requires Java 11 or higher. The deployment of required software during the VariantCallerPipeline run is performed with conda (
                    <ext-link ext-link-type="uri" xlink:href="https://conda.io">https://conda.io</ext-link>, using conda-forge and bioconda channels) using the deployment functionality of Watchdog. Watchdog also supports easy parallelization of workflow runs on computing clusters and monitoring of workflow execution, which can be used when running our pipeline. Example input files can be found at 
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.14266852">https://doi.org/10.5281/zenodo.14266852</ext-link>. A detailed README on installing and running the pipeline can be found at 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/watchdog-wms/watchdog-wms-workflows/">https://github.com/watchdog-wms/watchdog-wms-workflows/</ext-link> in the VariantCallerPipeline directory.</p>
            </sec>
        </sec>
        <sec id="sec9" sec-type="results">
            <title>Results</title>
            <sec id="sec10">
                <title>Input data</title>
                <p>We applied our pipeline to previously published 4sU-seq data for infection with null mutants for multiple HSV-1 proteins.
                    <sup>
                        <xref ref-type="bibr" rid="ref14">14</xref>
                    </sup> 4sU-seq is a variant of RNA-seq based on sequencing newly transcribed RNA obtained by RNA labelling with 4-thiouridine (4sU).
                    <sup>
                        <xref ref-type="bibr" rid="ref30">30</xref>
                    </sup> 4sU-seq was performed for null mutant viruses of the following HSV-1 proteins:

                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2022;</label>
                            <p>ICP4, null mutant created from HSV-1 strain 17 by a SNP, which resulted in a temperature sensitive mutant (TsK)
                                <sup>
                                    <xref ref-type="bibr" rid="ref3">3</xref>,
                                    <xref ref-type="bibr" rid="ref5">5</xref>
                                </sup>
                            </p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>ICP0 and ICP22, null mutants created by deletions from HSV-1 strains 17 and F, respectively
                                <sup>
                                    <xref ref-type="bibr" rid="ref4">4</xref>,
                                    <xref ref-type="bibr" rid="ref6">6</xref>,
                                    <xref ref-type="bibr" rid="ref7">7</xref>
                                </sup>
                            </p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>
ICP4, ICP27 and vhs, null mutants created by insertions from HSV-1 strains 17, KOS 1.1 and 17, respectively.
                                <sup>
                                    <xref ref-type="bibr" rid="ref8">8</xref>&#x2013;
                                    <xref ref-type="bibr" rid="ref11">11</xref>
                                </sup>
                            </p>
                        </list-item>
                    </list>
                </p>
                <p>The precise genomic location for these null mutants have not been described and most of these were created before the first HSV-1 genome sequence (for strain 17) was completed in 1988.
                    <sup>
                        <xref ref-type="bibr" rid="ref31">31</xref>
                    </sup> Two replicates were available for all null mutant viruses, except for the ICP4 knockout by insertion (&#x0394;ICP4), for which only one replicate was performed.</p>
                <p>In addition, we analysed RNA-seq data for human brain organoids
                    <sup>
                        <xref ref-type="bibr" rid="ref22">22</xref>
                    </sup> infected with an HSV-1 strain 17 virus engineered to express GFP.
                    <sup>
                        <xref ref-type="bibr" rid="ref12">12</xref>
                    </sup> Here, RNA-seq data was available for brain organoids from two genetically distinct induced pluripotent stem cell lines, each infected for 3 and 6 days (2 replicates each, resulting in 8 samples).</p>
                <p>All 4sU-seq and RNA-seq samples were aligned against the HSV-1 strain 17 genome (GenBank accession: JN555585) using BWA and then fed into the pipeline. The HSV-1 genome contains two repeat regions at each end of the genome that are repeated internally in the genome. Since read alignment cannot distinguish between the two repeats, one occurrence of each repeat (i.e. the ones at the genome ends) was replaced by N&#x2019;s for read alignment.</p>
                <p>The performance of the pipeline was evaluated by comparing the results with the descriptions of the original publications. The insertion sequences that were extracted by the pipeline from the sequence assembly were investigated with the NCBI BLAST webserver to identify their origin.</p>
            </sec>
            <sec id="sec11">
                <title>SNPs in the TsK mutant</title>
                <p>Our pipeline identified 28 consistent SNPs in the TsK mutant, three of which were in the ICP4 gene. One of these was consistent with the sequence change identified by Davison 
                    <italic toggle="yes">et al.</italic>
                    <sup>
                        <xref ref-type="bibr" rid="ref5">5</xref>
                    </sup> as responsible for the mutant phenotype: a replacement of a C:G base pair by a T:A base pair that changed the 475th codon of the ICP4 gene from an alanine codon to a valine codon. Our pipeline matched this missense mutation to a SNP at nucleotide 129,708. It furthermore showed that the TsK mutant differed from its parental strain 17 by an additional 27 SNPs, whose effects remain unclear. In particular, one of the other two SNPs identified in the ICP4 gene leads to a second amino acid change in ICP4 from serine to asparagine.</p>
            </sec>
            <sec id="sec12">
                <title>Deletions identified for HSV-1 null mutants</title>
                <p>To detect deletions, our pipeline was run with a stringent global z-score cut-off of -2.5, a less stringent global cut-off of 0.0 and a local z-score cut-off of -6.0. No minimum length was required for the deletions. This resulted in the identification of the deletions shown in 
                    <xref ref-type="table" rid="T1">
Table 1</xref>. We identified a deletion each in the ICP0 null mutant (&#x0394;ICP0) and the ICP22 null mutant (&#x0394;ICP22), respectively, that matched the target gene and approximate length described in the corresponding articles.
                    <sup>
                        <xref ref-type="bibr" rid="ref4">4</xref>,
                        <xref ref-type="bibr" rid="ref6">6</xref>,
                        <xref ref-type="bibr" rid="ref7">7</xref>
                    </sup> Furthermore, the sequences found directly up- and downstream of the predicted deletions matched the target sequences of the restriction enzymes used in the corresponding experiments to create the deletions (XhoI &amp; SalI for &#x0394;ICP0; PvuII &amp; BstEII/Eco91I for &#x0394;dICP22). Thus, we could recover the exact locations of the introduced deletions.</p>
                <table-wrap id="T1" orientation="portrait" position="float">
                    <label>
Table 1. </label>
                    <caption>
                        <title>Deletions detected by the pipeline for any of the HSV-1 null mutants in the 4sU-seq data and whether this represents the deletion described in the original papers describing the null mutant, a deletion in the parental strain or a known intron.</title>
                        <p>For the known intron in ICP22, the position of the intron is also indicated in the last column. The genes US10-US12 overlap at the deletion position in the &#x0394;ICP27 virus.</p>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Mutant</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Type</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Start position</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">End position</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">
Gene</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">&#x0394;ICP0</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">described deletion</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">120913</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">123031</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">ICP0</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">&#x0394;ICP22</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">deletion in parental strain F</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">132276</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">132280</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">ICP22</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">&#x0394;ICP22</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">described deletion</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">133243</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">134072</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">ICP22</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">&#x0394;ICP27</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">deletion in parental strain KOS 1.1</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">144838</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">144849</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">US10;US11;US12</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">&#x0394;ICP27</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">known intron</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">132404</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">132497</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">ICP22 (132,375-132,543)</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">&#x0394;vhs</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">known intron</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">132390</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">132513</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">ICP22 (132,375-132,543)</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>A further deletion in the 5&#x2019; UTR of ICP22 identified in &#x0394;ICP22 infection corresponded to a genome deletion in the parental strain F from which the &#x0394;ICP22 virus was derived. Similarly, a deletion identified in the &#x0394;ICP27 virus is already present in its parental strain KOS 1.1. In addition, a deletion was identified for the ICP27 null mutant (&#x0394;ICP27) and the vhs null mutant (&#x0394;vhs) that fell into a known intron in the ICP22 gene. Although this intron is spliced in all samples, it was not detected in &#x0394;ICP0, &#x0394;ICP4 and TsK infection. For &#x0394;ICP4 and TsK infection this was likely due to the fact that read coverage on the whole viral genome was relatively low as ICP4 is necessary for optimal expression of other HSV-1 genes.
                    <sup>
                        <xref ref-type="bibr" rid="ref32">32</xref>
                    </sup> For &#x0394;ICP0 infection the opposite applied as both replicates had by far the highest read coverage on the viral genome of any of the samples. As a consequence, sufficient numbers of reads from unspliced ICP22 transcripts were detected for the intron not to be identified as a deletion.</p>
            </sec>
            <sec id="sec13">
                <title>Insertions identified for HSV-1 null mutants</title>
                <p>Insertions were also predicted using default values. Local z-scores were calculated for the 40 nt around each peak position, at least 10 clipped reads were required for each peak and a maximum overlap 
                    <inline-formula>

                        <mml:math display="inline">
                            <mml:mi>&#x03d5;</mml:mi>
                        </mml:math>
</inline-formula> of 10 nt was allowed for the insertion ends and the surrounding genome regions. Furthermore, the consensus sequences obtained from the clipped parts of the read had to be at least 10 nt long. The identified insertions are listed in 
                    <xref ref-type="table" rid="T2">
Table 2</xref>.</p>
                <table-wrap id="T2" orientation="portrait" position="float">
                    <label>
Table 2. </label>
                    <caption>
                        <title>Insertions detected by the pipeline for any of the HSV-1 null mutants in the 4sU-seq data and whether this represents the insertion described in the original papers describing the null mutant or an insertion in the parental strain.</title>
                        <p>Information in brackets indicates characteristics of the inserted sequences that could be confirmed from the consensus sequences or the assembly followed by BLAST. Overlap = overlap between the ends of the inserted sequence and the surrounding genome sequences. The genes US5-US7 overlap at the insertion position in the parental strain KOS 1.1.</p>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Mutant</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Type</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Position</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Overlap</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">
Gene</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">&#x0394;ICP4</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">described insertion (stop codons, HpaI recognition site)</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">130376</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">4</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">ICP4</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">&#x0394;ICP27</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">described insertion (E. coli lacZ gene)</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">113648</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">3</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">ICP27</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">&#x0394;ICP27</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">insertion in parental strain KOS 1.1</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">140458</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">3</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">US5;US6;US7</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">&#x0394;vhs</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">described insertion (cloning vector with lacZ gene)</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">91923</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">2</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">vhs</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>All but one of the identified insertions matched the description in the corresponding publications on how the null mutants were created.
                    <sup>
                        <xref ref-type="bibr" rid="ref8">8</xref>&#x2013;
                        <xref ref-type="bibr" rid="ref11">11</xref>
                    </sup> In particular, we could confirm the insertion of lacZ genes in both the &#x0394;ICP27 and &#x0394;vhs virus by BLASTing the predicted insertion sequences obtained from the assembly. For the &#x0394;ICP4 mutant, the insertion of a small 16 nt sequence could be directly confirmed from the consensus sequences of the insertion start and end as these overlapped. This insertion sequence contained the 3 stop codons, one for each frame, and a recognition site of the HpaI restriction enzyme described in the original publication. The additional insertion identified in the &#x0394;ICP27 virus represented a known insertion in the parental strain KOS 1.1.</p>
                <p>Interestingly, we found that the position for the insertion in the vhs coding sequence (251st codon) described in the corresponding publication for the &#x0394;vhs mutant
                    <sup>
                        <xref ref-type="bibr" rid="ref10">10</xref>
                    </sup> may have been calculated based on a wrong strand assignment. The vhs gene is located on the negative strand, with the coding sequence ranging from positions 91,167 (stop codon) to 92,636 (start codon). Accordingly, the insertion position identified by our pipeline (91,923) is in the 238th codon. However, if the codon position is erroneously calculated from the positive strand, the insertion would be after the first position of the 252nd codon (excluding the stop codon), which is closer to the original publication. Since the insertion site was described to be in the unique recognition site of the NruI restriction enzyme in the vhs gene
                    <sup>
                        <xref ref-type="bibr" rid="ref10">10</xref>
                    </sup> and the centre of this NruI recognition site is at the insertion position identified by our pipeline, this is indeed the correct position.</p>
            </sec>
            <sec id="sec14">
                <title>Insertions in the GFP-expressing HSV-1 virus</title>
                <p>According to the original publication describing this virus,
                    <sup>
                        <xref ref-type="bibr" rid="ref12">12</xref>
                    </sup> an enhanced GFP (EGFP) gene with a mouse cytomegalovirus promoter was inserted between the open reading frames (ORFs) UL55 and UL56. In addition, a LoxP site (= a 34 nt DNA sequence recognized by the Cre recombinase enzyme) was inserted downstream of the UL23 ORF. Two insertion sites at positions 46,665 and 116,147 were identified in 8 and 7 of the samples, respectively, located downstream of the UL23 coding sequence and between UL55 and UL56, respectively. The insertion sequence for the first insertion indeed contained a LoxP site and BLAST analysis of the insertion sequence for the second insertion site showed that it matched several cloning vectors containing the GFP gene. Thus, we correctly identified the precise genome positions of both insertions.</p>
                <p>It should be noted that the insertion at position 116,147 was actually identified as a deletion between positions 116,147 and 116,154 into which an insertion was placed. This special case is predicted by the pipeline if the PWMs obtained from the clipped reads cannot be matched to the opposite end of the deletion and an insertion sequence can be identified from the assembly. Unfortunately, the description in the original publication on how the sequence was inserted is not sufficiently detailed to explain how this small deletion was generated during the insertion process, but it is most likely a consequence of the experimental approach used.</p>
                <p>Additional insertions were identified at positions 62,143, 106,984, and 119,496 in 4-8 of the samples. However, no insertion sequences could be extracted from the assembly for insertions at positions 62,143 and 106,984 based on the consensus sequences from the clipped reads, while the insertion sequence for 119,496 matched the genome downstream of the predicted insertion site. Based on these results and inspection of the genome at these positions, we concluded that these represented artefacts from repetitive sequences. This highlights how the combination of consensus sequences from the clipped parts of reads and the assembly can be used to filter out incorrectly identified insertions.</p>
            </sec>
            <sec id="sec15">
                <title>Comparison to 
                    <italic toggle="yes">de novo</italic> assembly</title>
                <p>For comparison, we also investigated whether deletions and insertions could be identified directly from the contigs assembled by rnaSPAdes instead of performing the analysis of read coverage and clipped read alignments performed by our pipeline. For this purpose, contigs assembled for the 4sU-seq data of HSV-1 mutant infections were aligned against the reference genome using minimap2.
                    <sup>
                        <xref ref-type="bibr" rid="ref33">33</xref>
                    </sup> However, this showed that assembled contigs often contained small indels (~1-50bp) compared to the reference genome, which would result in a large number of predicted indels if we included all of them. Thus, we evaluated different minimum length thresholds on identified indels. Furthermore, we observed that for some insertions the inserted sequence was only partially assembled and thus located at the start or end of assembled contigs. This resulted in a clipped alignment of these contigs to the genome. We thus also evaluated the option to include such clipped alignments to identify the position of the insertion and at least the start or end of the inserted sequence.</p>
                <p>
                    <xref ref-type="fig" rid="f6">
Figure 6</xref> shows an evaluation of different thresholds on the indel length with and without inclusion of clipped contig alignments for insertion detection. This showed that a relatively small minimum indel length of 16 nt had to be used to identify all indels and clipped contig alignments had to be included. Higher minimum indel lengths excluded the 16 nt insertion in the &#x0394;ICP4 mutant, while the lacZ gene insertion in the &#x0394;ICP27 mutant would be missed without allowing clipped contig alignments. However, this parameter combination resulted in large numbers of predicted insertions for &#x0394;ICP0, &#x0394;ICP22, &#x0394;ICP27, and &#x0394;vhs mutants, making it difficult to distinguish the correct indels in these mutants.</p>
                <fig fig-type="figure" id="f6" orientation="portrait" position="float">
                    <label>
Figure 6. </label>
                    <caption>
                        <title>Analysis of the number of predicted deletions or insertions identified from the contigs assembled from the raw sequencing reads for different minimum indel lengths.</title>
                        <p>For insertions, we also evaluated the effect of predicting insertions if only one end of the contig can be aligned to the genome in a clipped alignment. Parameters for which the correct deletion (for the &#x0394;ICP0 and &#x0394;ICP22 viruses) or the correct insertion (for the &#x0394;ICP4, &#x0394;ICP27 and &#x0394;vhs viruses) is recovered are filled in black.</p>
                    </caption>
                    <graphic id="gr6" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/185998/274de8c3-338b-4373-871e-df90906e57c1_figure6.gif"/>
                </fig>
            </sec>
        </sec>
        <sec id="sec16" sec-type="discussion">
            <title>Discussion</title>
            <p>In this article, we present a pipeline for identification of SNPs and indels in viral variants from functional genomics experiments, such as RNA-seq, ATAC-seq or others. Development of the pipeline was motivated by the observation that commonly used null mutant viruses are often not described in sufficient detail to determine the precise genomic location of mutations. Notably, this does not only apply to null mutants created decades ago before the availability of viral genome sequences but also to more recently created virus variants as the GFP-expressing HSV-1 virus from.
                <sup>
                    <xref ref-type="bibr" rid="ref12">12</xref>
                </sup> In the latter case, only the approximate location relative to viral genes was described. Thus, application of our pipeline provides the first annotation of the precise genome location for key mutations in several widely used HSV-1 mutant viruses.</p>
            <p>Our pipeline has the advantage that it does not require additional genome sequencing experiments and can be run directly on the experiment from which biological conclusions are drawn. Furthermore, the computational overhead is relatively low, in particular if sequence assembly for identification of longer insertion sequences is omitted. This would be sufficient if one is not interested in the insertion sequence or the insertion is short enough that the sequence can be identified directly from the consensus sequences as in the case of the &#x0394;ICP4 mutant.</p>
            <p>Without assembly, indel detection runs in a few minutes for one sample instead of &gt; 1h with assembly, reducing the runtime enormously. For SNP detection, computational overhead is determined by the runtimes of bcftools and VarScan (including &#x2018;samtools mpileup&#x2019;), which took about 20 and 15 minutes per sample, respectively, even for the &#x0394;ICP0 infection samples with the highest coverage of the HSV-1 genome.</p>
            <p>Despite the additional overhead, identification of inserted sequences from read assemblies has the advantage that it allows confirming the insertion of particular marker genes like GFP or lacZ and distinguishing the marker insertions from other insertions that may have been correctly or incorrectly predicted. Notably, 
                <italic toggle="yes">de novo</italic> assembly alone is not sufficient to identify indels with high precision without further post-processing and tuning parameters to a particular sample. In contrast, one parameter combination for our pipeline recovered all variants introduced into the HSV-1 null mutants without predicting too many additional indels. Notably, the additional indels identified by our pipeline for the HSV-1 null mutants were not actually incorrect as they represented indels in the parental strains of the null mutants or introns.</p>
            <p>A disadvantage of our pipeline is that it depends on sufficient read coverage of the corresponding genome regions. While this also applies to standard genome sequencing, functional genomics data can have low read coverage either in parts or on the complete genome if they depict gene expression (such as RNA-seq, PRO-seq or similar methods to capture transcriptional processes) or if the viral genome shows generally low coverage. Although most parts of viral genomes are generally transcribed to some degree, lowly expressed genes or non-transcribed regions can have insufficient coverage. Read coverage can be low on the whole genome when virus genome replication and transcription are impaired, such as during &#x0394;ICP4 and TsK infections, or in the early stages of infection. Nevertheless, this issue can be addressed by combining different types of functional genomics data, replicates or different time points of infection.</p>
            <p>Although we only tested the pipeline for (variants of
) RNA-seq data, these represented both the major challenges for our pipeline, i.e. variable and low coverage samples, and the most commonly applied assay for functional studies of viral null mutants. We are thus confident that our pipeline will be highly useful for researchers using functional genomics to study viruses and the functional role of individual virus genes.</p>
        </sec>
    </body>
    <back>
        <sec id="sec20" sec-type="data-availability">
            <title>Data availability</title>
            <p>The data sets supporting the results of this article are available in the Gene Expression Omnibus (GEO) under the following identifiers:
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>4sU-seq data of HSV-1 null mutant infections: GSE151912, 
                            <ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE151912">https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE151912</ext-link> (previously published RNA-seq data from the study by Wang 
                            <italic toggle="yes">et al.</italic>
                            <sup>
                                <xref ref-type="bibr" rid="ref14">14</xref>
                            </sup>).</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>RNA-seq data of infection with the GFP-expressing HSV-1 virus: GSE163952, 
                            <ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE163952">https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE163952</ext-link> (previously published RNA-seq data from the study by Rybak-Wolf 
                            <italic toggle="yes">et al.</italic>
                            <sup>
                                <xref ref-type="bibr" rid="ref22">22</xref>
                            </sup>).</p>
                    </list-item>
                </list>
            </p>
            <p>A pre-print version of this article has been deposited at bioRxiv at: 
                <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1101/2025.01.31.635891">https://doi.org/10.1101/2025.01.31.635891</ext-link>.
                <sup>
                    <xref ref-type="bibr" rid="ref34">34</xref>
                </sup>
            </p>
        </sec>
        <sec id="sec17">
            <title>Software availability</title>
            <p>Software available from: 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/watchdog-wms/watchdog-wms-workflows">

                    <italic toggle="yes">https://github.com/watchdog-wms/watchdog-wms-workflows</italic>
</ext-link>, 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/watchdog-wms/watchdog-wms-modules">

                    <italic toggle="yes">https://github.com/watchdog-wms/watchdog-wms-modules</italic>
</ext-link>
            </p>
            <p>
Source code available from: 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/watchdog-wms/watchdog-wms-workflows">

                    <italic toggle="yes">https://github.com/watchdog-wms/watchdog-wms-workflows</italic>
</ext-link>, 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/watchdog-wms/watchdog-wms-modules">

                    <italic toggle="yes">https://github.com/watchdog-wms/watchdog-wms-modules</italic>
</ext-link>
            </p>
            <p>Archived source code at time of publication: 
                <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.16639950">

                    <italic toggle="yes">https://doi.org/10.5281/zenodo.16639950</italic>
</ext-link>
            </p>
            <p>License: 
                <italic toggle="yes">GNU General Public License v3.0</italic>
            </p>
        </sec>
        <ref-list>
            <title>References</title>
            <ref id="ref1">
                <label>1</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Varanda</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Felix</surname>
                            <given-names>MR</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Campos</surname>
                            <given-names>MD</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>An Overview of the Application of Viruses to Biotechnology.</article-title>
                    <source>

                        <italic toggle="yes">Viruses.</italic>
</source>
                    <year>2021</year>;<volume>13</volume>(<issue>10</issue>).
                    <pub-id pub-id-type="pmid">34696503</pub-id>
                    <pub-id pub-id-type="doi">10.3390/v13102073</pub-id>
                    <pub-id pub-id-type="pmcid">PMC8541484</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref2">
                <label>2</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Johnston</surname>
                            <given-names>JB</given-names>
                        </name>

                        <name name-style="western">
                            <surname>McFadden</surname>
                            <given-names>G</given-names>
                        </name>
</person-group>:
                    <article-title>Technical knockout: understanding poxvirus pathogenesis by selectively deleting viral immunomodulatory genes.</article-title>
                    <source>

                        <italic toggle="yes">Cell. Microbiol.</italic>
</source>
                    <year>2004</year>;<volume>6</volume>(<issue>8</issue>):<fpage>695</fpage>&#x2013;<lpage>705</lpage>.
                    <pub-id pub-id-type="pmid">15236637</pub-id>
                    <pub-id pub-id-type="doi">10.1111/j.1462-5822.2004.00423.x</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref3">
                <label>3</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Marsden</surname>
                            <given-names>HS</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Crombie</surname>
                            <given-names>IK</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Subak-Sharpe</surname>
                            <given-names>JH</given-names>
                        </name>
</person-group>:
                    <article-title>Control of protein synthesis in herpesvirus-infected cells: analysis of the polypeptides induced by wild type and sixteen temperature-sensitive mutants of HSV strain 17.</article-title>
                    <source>

                        <italic toggle="yes">J. Gen. Virol.</italic>
</source>
                    <year>1976</year>;<volume>31</volume>(<issue>3</issue>):<fpage>347</fpage>&#x2013;<lpage>372</lpage>.
                    <pub-id pub-id-type="pmid">180249</pub-id>
                    <pub-id pub-id-type="doi">10.1099/0022-1317-31-3-347</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref4">
                <label>4</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Post</surname>
                            <given-names>LE</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Roizman</surname>
                            <given-names>B</given-names>
                        </name>
</person-group>:
                    <article-title>A generalized technique for deletion of specific genes in large genomes: alpha gene 22 of herpes simplex virus 1 is not essential for growth.</article-title>
                    <source>

                        <italic toggle="yes">Cell.</italic>
</source>
                    <year>1981</year>;<volume>25</volume>(<issue>1</issue>):<fpage>227</fpage>&#x2013;<lpage>232</lpage>.
                    <pub-id pub-id-type="pmid">6268303</pub-id>
                    <pub-id pub-id-type="doi">10.1016/0092-8674(81)90247-6</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref5">
                <label>5</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Davison</surname>
                            <given-names>MJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Preston</surname>
                            <given-names>VG</given-names>
                        </name>

                        <name name-style="western">
                            <surname>McGeoch</surname>
                            <given-names>DJ</given-names>
                        </name>
</person-group>:
                    <article-title>Determination of the sequence alteration in the DNA of the herpes simplex virus type 1 temperature-sensitive mutant ts K.</article-title>
                    <source>

                        <italic toggle="yes">J. Gen. Virol.</italic>
</source>
                    <year>1984</year>;<volume>65</volume>(<issue>Pt 5</issue>):<fpage>859</fpage>&#x2013;<lpage>863</lpage>.
                    <pub-id pub-id-type="doi">10.1099/0022-1317-65-5-859</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref6">
                <label>6</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Stow</surname>
                            <given-names>ND</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Stow</surname>
                            <given-names>EC</given-names>
                        </name>
</person-group>:
                    <article-title>Isolation and characterization of a herpes simplex virus type 1 mutant containing a deletion within the gene encoding the immediate early polypeptide Vmw110.</article-title>
                    <source>

                        <italic toggle="yes">J. Gen. Virol.</italic>
</source>
                    <year>1986</year>;<volume>62</volume>(<issue>12</issue>):<fpage>2571</fpage>&#x2013;<lpage>2585</lpage>.</mixed-citation>
            </ref>
            <ref id="ref7">
                <label>7</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Perry</surname>
                            <given-names>LJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Rixon</surname>
                            <given-names>FJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Everett</surname>
                            <given-names>RD</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Characterization of the IE110 gene of herpes simplex virus type 1.</article-title>
                    <source>

                        <italic toggle="yes">J. Gen. Virol.</italic>
</source>
                    <year>1986</year>;<volume>67</volume>(<issue>Pt 11</issue>):<fpage>2365</fpage>&#x2013;<lpage>2380</lpage>.
                    <pub-id pub-id-type="pmid">3023529</pub-id>
                    <pub-id pub-id-type="doi">10.1099/0022-1317-67-11-2365</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref8">
                <label>8</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>DeLuca</surname>
                            <given-names>NA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Schaffer</surname>
                            <given-names>PA</given-names>
                        </name>
</person-group>:
                    <article-title>Activities of herpes simplex virus type 1 (HSV-1) ICP4 genes specifying nonsense peptides.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>1987</year>;<volume>15</volume>(<issue>11</issue>):<fpage>4491</fpage>&#x2013;<lpage>4511</lpage>.
                    <pub-id pub-id-type="pmid">3035496</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/15.11.4491</pub-id>
                    <pub-id pub-id-type="pmcid">PMC340876</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref9">
                <label>9</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>DeLuca</surname>
                            <given-names>NA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Schaffer</surname>
                            <given-names>PA</given-names>
                        </name>
</person-group>:
                    <article-title>Physical and functional domains of the herpes simplex virus transcriptional regulatory protein ICP4.</article-title>
                    <source>

                        <italic toggle="yes">J. Virol.</italic>
</source>
                    <year>1988</year>;<volume>62</volume>(<issue>3</issue>):<fpage>732</fpage>&#x2013;<lpage>743</lpage>.
                    <pub-id pub-id-type="pmid">2828668</pub-id>
                    <pub-id pub-id-type="doi">10.1128/jvi.62.3.732-743.1988</pub-id>
                    <pub-id pub-id-type="pmcid">PMC253626</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref10">
                <label>10</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Fenwick</surname>
                            <given-names>ML</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Everett</surname>
                            <given-names>RD</given-names>
                        </name>
</person-group>:
                    <article-title>Inactivation of the shutoff gene (UL41) of herpes simplex virus types 1 and 2.</article-title>
                    <source>

                        <italic toggle="yes">J. Gen. Virol.</italic>
</source>
                    <year>1990</year>;<volume>71</volume>(<issue>Pt 12</issue>):<fpage>2961</fpage>&#x2013;<lpage>2967</lpage>.
                    <pub-id pub-id-type="pmid">2177088</pub-id>
                    <pub-id pub-id-type="doi">10.1099/0022-1317-71-12-2961</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref11">
                <label>11</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Smith</surname>
                            <given-names>IL</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hardwicke</surname>
                            <given-names>MA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Sandri-Goldin</surname>
                            <given-names>RM</given-names>
                        </name>
</person-group>:
                    <article-title>Evidence that the herpes simplex virus immediate early protein ICP27 acts post-transcriptionally during infection to regulate gene expression.</article-title>
                    <source>

                        <italic toggle="yes">Virology.</italic>
</source>
                    <year>1992</year>;<volume>186</volume>(<issue>1</issue>):<fpage>74</fpage>&#x2013;<lpage>86</lpage>.
                    <pub-id pub-id-type="pmid">1309283</pub-id>
                    <pub-id pub-id-type="doi">10.1016/0042-6822(92)90062-T</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref12">
                <label>12</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Snijder</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Sacher</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>R&#x00e4;m&#x00f6;</surname>
                            <given-names>P</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Single-cell analysis of population context advances RNAi screening at multiple levels.</article-title>
                    <source>

                        <italic toggle="yes">Mol. Syst. Biol.</italic>
</source>
                    <year>2012</year>;<volume>8</volume>:<fpage>579</fpage>.
                    <pub-id pub-id-type="pmid">22531119</pub-id>
                    <pub-id pub-id-type="doi">10.1038/msb.2012.9</pub-id>
                    <pub-id pub-id-type="pmcid">PMC3361004</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref13">
                <label>13</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Jansz</surname>
                            <given-names>N</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Faulkner</surname>
                            <given-names>GJ</given-names>
                        </name>
</person-group>:
                    <article-title>Viral genome sequencing methods: benefits and pitfalls of current approaches.</article-title>
                    <source>

                        <italic toggle="yes">Biochem. Soc. Trans.</italic>
</source>
                    <year>2024</year>;<volume>52</volume>(<issue>3</issue>):<fpage>1431</fpage>&#x2013;<lpage>1447</lpage>.
                    <pub-id pub-id-type="pmid">38747720</pub-id>
                    <pub-id pub-id-type="doi">10.1042/BST20231322</pub-id>
                    <pub-id pub-id-type="pmcid">PMC11346438</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref14">
                <label>14</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wang</surname>
                            <given-names>X</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hennig</surname>
                            <given-names>T</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Whisnant</surname>
                            <given-names>AW</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Herpes simplex virus blocks host transcription termination via the bimodal activities of ICP27.</article-title>
                    <source>

                        <italic toggle="yes">Nat. Commun.</italic>
</source>
                    <year>2020</year>;<volume>11</volume>(<issue>1</issue>):<fpage>293</fpage>.
                    <pub-id pub-id-type="pmid">31941886</pub-id>
                    <pub-id pub-id-type="doi">10.1038/s41467-019-14109-x</pub-id>
                    <pub-id pub-id-type="pmcid">PMC6962326</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref15">
                <label>15</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Djakovic</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hennig</surname>
                            <given-names>T</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Reinisch</surname>
                            <given-names>K</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The HSV-1 ICP22 protein selectively impairs histone repositioning upon Pol II transcription downstream of genes.</article-title>
                    <source>

                        <italic toggle="yes">Nat. Commun.</italic>
</source>
                    <year>2023</year>;<volume>14</volume>(<issue>1</issue>):<fpage>4591</fpage>.
                    <pub-id pub-id-type="pmid">37524699</pub-id>
                    <pub-id pub-id-type="doi">10.1038/s41467-023-40217-w</pub-id>
                    <pub-id pub-id-type="pmcid">PMC10390501</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref16">
                <label>16</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>H</given-names>
                        </name>
</person-group>:
                    <article-title>A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.</article-title>
                    <source>

                        <italic toggle="yes">Bioinf (Oxf).</italic>
</source>
                    <year>2011</year>;<volume>27</volume>(<issue>21</issue>):<fpage>2987</fpage>&#x2013;<lpage>2993</lpage>.
                    <pub-id pub-id-type="pmid">21903627</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btr509</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref17">
                <label>17</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Koboldt</surname>
                            <given-names>DC</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Zhang</surname>
                            <given-names>Q</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Larson</surname>
                            <given-names>DE</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing.</article-title>
                    <source>

                        <italic toggle="yes">Genome Res.</italic>
</source>
                    <year>2012</year>;<volume>22</volume>(<issue>3</issue>):<fpage>568</fpage>&#x2013;<lpage>576</lpage>.
                    <pub-id pub-id-type="pmid">22300766</pub-id>
                    <pub-id pub-id-type="doi">10.1101/gr.129684.111</pub-id>
                    <pub-id pub-id-type="pmcid">PMC3290792</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref18">
                <label>18</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Rausch</surname>
                            <given-names>T</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Zichner</surname>
                            <given-names>T</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Schlattl</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>DELLY: structural variant discovery by integrated paired-end and split-read analysis.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2012</year>;<volume>28</volume>(<issue>18</issue>):<fpage>i333</fpage>&#x2013;<lpage>i339</lpage>.
                    <pub-id pub-id-type="pmid">22962449</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/bts378</pub-id>
                    <pub-id pub-id-type="pmcid">PMC3436805</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref19">
                <label>19</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Cameron</surname>
                            <given-names>DL</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Baber</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Shale</surname>
                            <given-names>C</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>GRIDSS2: comprehensive characterisation of somatic structural variation using single breakend variants and structural variant phasing.</article-title>
                    <source>

                        <italic toggle="yes">Genome Biol.</italic>
</source>
                    <year>2021</year>;<volume>22</volume>(<issue>1</issue>):<fpage>202</fpage>.
                    <pub-id pub-id-type="pmid">34253237</pub-id>
                    <pub-id pub-id-type="doi">10.1186/s13059-021-02423-x</pub-id>
                    <pub-id pub-id-type="pmcid">PMC8274009</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref20">
                <label>20</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Chen</surname>
                            <given-names>K</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wallis</surname>
                            <given-names>JW</given-names>
                        </name>

                        <name name-style="western">
                            <surname>McLellan</surname>
                            <given-names>MD</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>BreakDancer: an algorithm for high-resolution mapping of genomic structural variation.</article-title>
                    <source>

                        <italic toggle="yes">Nat. Methods.</italic>
</source>
                    <year>2009</year>;<volume>6</volume>(<issue>9</issue>):<fpage>677</fpage>&#x2013;<lpage>681</lpage>.
                    <pub-id pub-id-type="pmid">19668202</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nmeth.1363</pub-id>
                    <pub-id pub-id-type="pmcid">PMC3661775</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref21">
                <label>21</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Bushmanova</surname>
                            <given-names>E</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Antipov</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Lapidus</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data.</article-title>
                    <source>

                        <italic toggle="yes">GigaScience.</italic>
</source>
                    <year>2019</year>;<volume>8</volume>(<issue>9</issue>):<fpage>giz100</fpage>.
                    <pub-id pub-id-type="pmid">31494669</pub-id>
                    <pub-id pub-id-type="doi">10.1093/gigascience/giz100</pub-id>
                    <pub-id pub-id-type="pmcid">PMC6736328</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref22">
                <label>22</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Rybak-Wolf</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wyler</surname>
                            <given-names>E</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Pentimalli</surname>
                            <given-names>TM</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Modelling viral encephalitis caused by herpes simplex virus 1 infection in cerebral organoids.</article-title>
                    <source>

                        <italic toggle="yes">Nat. Microbiol.</italic>
</source>
                    <year>2023</year>;<volume>8</volume>(<issue>7</issue>):<fpage>1252</fpage>&#x2013;<lpage>1266</lpage>.
                    <pub-id pub-id-type="pmid">37349587</pub-id>
                    <pub-id pub-id-type="doi">10.1038/s41564-023-01405-y</pub-id>
                    <pub-id pub-id-type="pmcid">PMC10322700</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref23">
                <label>23</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Durbin</surname>
                            <given-names>R</given-names>
                        </name>
</person-group>:
                    <article-title>Fast and accurate short read alignment with Burrows-Wheeler transform.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2009</year>;<volume>25</volume>(<issue>14</issue>):<fpage>1754</fpage>&#x2013;<lpage>1760</lpage>.
                    <pub-id pub-id-type="pmid">19451168</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btp324</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref24">
                <label>24</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Handsaker</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wysoker</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The Sequence Alignment/Map format and SAMtools.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2009</year>;<volume>25</volume>(<issue>16</issue>):<fpage>2078</fpage>&#x2013;<lpage>2079</lpage>.
                    <pub-id pub-id-type="pmid">19505943</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btp352</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref25">
                <label>25</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Danecek</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Auton</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Abecasis</surname>
                            <given-names>G</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The variant call format and VCFtools.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2011</year>;<volume>27</volume>(<issue>15</issue>):<fpage>2156</fpage>&#x2013;<lpage>2158</lpage>.
                    <pub-id pub-id-type="pmid">21653522</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btr330</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref26">
                <label>26</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Whisnant</surname>
                            <given-names>AW</given-names>
                        </name>

                        <name name-style="western">
                            <surname>J&#x00fc;rges</surname>
                            <given-names>CS</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hennig</surname>
                            <given-names>T</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Integrative functional genomics decodes herpes simplex virus 1.</article-title>
                    <source>

                        <italic toggle="yes">Nat. Commun.</italic>
</source>
                    <year>2020</year>;<volume>11</volume>(<issue>1</issue>):<fpage>2038</fpage>.
                    <pub-id pub-id-type="pmid">32341360</pub-id>
                    <pub-id pub-id-type="doi">10.1038/s41467-020-15992-5</pub-id>
                    <pub-id pub-id-type="pmcid">PMC7184758</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref27">
                <label>27</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Bollag</surname>
                            <given-names>RJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Watdman</surname>
                            <given-names>AS</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Liskay</surname>
                            <given-names>RM</given-names>
                        </name>
</person-group>:
                    <article-title>Homologous recombination in mammalian cells.</article-title>
                    <source>

                        <italic toggle="yes">Annu. Rev. Genet.</italic>
</source>
                    <year>1989</year>;<volume>23</volume>:<fpage>199</fpage>&#x2013;<lpage>225</lpage>.
                    <pub-id pub-id-type="doi">10.1146/annurev.ge.23.120189.001215</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref28">
                <label>28</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Prjibelski</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Antipov</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Meleshko</surname>
                            <given-names>D</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Using SPAdes De Novo Assembler.</article-title>
                    <source>

                        <italic toggle="yes">Curr. Protoc. Bioinformatics.</italic>
</source>
                    <year>2020</year>;<volume>70</volume>(<issue>1</issue>):<fpage>e102</fpage>.
                    <pub-id pub-id-type="doi">10.1002/cpbi.102</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref29">
                <label>29</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Altschul</surname>
                            <given-names>SF</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gish</surname>
                            <given-names>W</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Miller</surname>
                            <given-names>W</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Basic local alignment search tool.</article-title>
                    <source>

                        <italic toggle="yes">J. Mol. Biol.</italic>
</source>
                    <year>1990</year>;<volume>215</volume>(<issue>3</issue>):<fpage>403</fpage>&#x2013;<lpage>410</lpage>.
                    <pub-id pub-id-type="doi">10.1016/S0022-2836(05)80360-2</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref30">
                <label>30</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Windhager</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bonfert</surname>
                            <given-names>T</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Burger</surname>
                            <given-names>K</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Ultrashort and progressive 4sU-tagging reveals key characteristics of RNA processing at nucleotide resolution.</article-title>
                    <source>

                        <italic toggle="yes">Genome Res.</italic>
</source>
                    <year>2012</year>;<volume>22</volume>(<issue>10</issue>):<fpage>2031</fpage>&#x2013;<lpage>2042</lpage>.
                    <pub-id pub-id-type="pmid">22539649</pub-id>
                    <pub-id pub-id-type="doi">10.1101/gr.131847.111</pub-id>
                    <pub-id pub-id-type="pmcid">PMC3460197</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref31">
                <label>31</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>McGeoch</surname>
                            <given-names>DJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dalrymple</surname>
                            <given-names>MA</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Davison</surname>
                            <given-names>AJ</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The complete DNA sequence of the long unique region in the genome of herpes simplex virus type 1.</article-title>
                    <source>

                        <italic toggle="yes">J. Gen. Virol.</italic>
</source>
                    <year>1988</year>;<volume>69</volume>(<issue>Pt 7</issue>):<fpage>1531</fpage>&#x2013;<lpage>1574</lpage>.
                    <pub-id pub-id-type="doi">10.1099/0022-1317-69-7-1531</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref32">
                <label>32</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Watson</surname>
                            <given-names>RJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Clements</surname>
                            <given-names>JB</given-names>
                        </name>
</person-group>:
                    <article-title>A herpes simplex virus type 1 function continuously required for early and late virus RNA synthesis.</article-title>
                    <source>

                        <italic toggle="yes">Nature.</italic>
</source>
                    <year>1980</year>;<volume>285</volume>(<issue>5763</issue>):<fpage>329</fpage>&#x2013;<lpage>330</lpage>.
                    <pub-id pub-id-type="pmid">6246451</pub-id>
                    <pub-id pub-id-type="doi">10.1038/285329a0</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref33">
                <label>33</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>H</given-names>
                        </name>
</person-group>:
                    <article-title>Minimap2: pairwise alignment for nucleotide sequences.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2018</year>;<volume>34</volume>(<issue>18</issue>):<fpage>3094</fpage>&#x2013;<lpage>3100</lpage>.
                    <pub-id pub-id-type="pmid">29750242</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/bty191</pub-id>
                    <pub-id pub-id-type="pmcid">PMC6137996</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref34">
                <label>34</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Florian</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Caroline</surname>
                            <given-names>CF</given-names>
                        </name>
</person-group>:
                    <article-title>Identification of Viral Variants from Functional Genomics Data.</article-title>
                    <source>

                        <italic toggle="yes">bioRxiv.</italic>
</source>2025.01.31.635891.
                    <pub-id pub-id-type="doi">10.1101/2025.01.31.635891</pub-id>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report420733">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.185998.r420733</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Kumar</surname>
                        <given-names>Anuj</given-names>
                    </name>
                    <xref ref-type="aff" rid="r420733a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-5023-7618</uri>
                </contrib>
                <aff id="r420733a1">
                    <label>1</label>Dalhousie University, Halifax, Canada</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>3</day>
                <month>11</month>
                <year>2025</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2025 Kumar A</copyright-statement>
                <copyright-year>2025</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport420733" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.168786.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>In my opinion, the authors have done an excellent job in developing a robust pipeline for screening viral variants from functional genomics data. The pipeline facilitates SNP calling, indel detection, candidate deletion identification, and additional variant analyses. Its user-friendly design makes it a valuable tool for the scientific community to efficiently identify and interpret viral genome variants.</p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>Yes</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Yes</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Yes</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Bioinformatics, Emerging infectious diseases, Functional Genomics</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report408633">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.185998.r408633</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Tomb&#x00e1;cz</surname>
                        <given-names>D&#x00f3;ra</given-names>
                    </name>
                    <xref ref-type="aff" rid="r408633a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r408633a1">
                    <label>1</label>University of Szeged, Szeged, Csongr&#x00e1;d, Hungary</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>19</day>
                <month>9</month>
                <year>2025</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2025 Tomb&#x00e1;cz D</copyright-statement>
                <copyright-year>2025</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport408633" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.168786.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>
                <bold>Full report</bold>
            </p>
            <p> The manuscript presents a workflow (&#x201c;VariantCallerPipeline&#x201d;) for identifying viral SNPs and indels&#x2014;including reconstruction of inserted sequences&#x2014;directly from functional-genomics data (e.g., RNA-seq/4sU-seq) of infected cells, obviating separate viral genome sequencing. SNPs are called with bcftools and VarScan; indels are detected using a combination of read-coverage troughs and peaks of left/right clipped reads, with PWM-based breakpoint validation; optional rnaSPAdes assembly retrieves inserted sequences. On well-known HSV-1 mutants (&#x0394;ICP0, &#x0394;ICP22, &#x0394;ICP27, &#x0394;vhs, &#x0394;ICP4; TsK) and a GFP-expressing strain, the pipeline recovers the intended edits and clarifies additional parental-strain variants or introns. The problem is real and common&#x2014;legacy mutants with sparse nucleotide-level documentation&#x2014;and the solution is practical and timely.</p>
            <p> 
                <bold>Strengths</bold> 
                <list list-type="bullet">
                    <list-item>
                        <p>Clear rationale; addresses a frequent, under-served need in virology labs.</p>
                    </list-item>
                    <list-item>
                        <p>Sensible method design for uneven coverage typical of RNA-derived data.</p>
                    </list-item>
                    <list-item>
                        <p>Convincing case studies on canonical HSV-1 mutants; insertion sequences verified (e.g., lacZ/EGFP).</p>
                    </list-item>
                </list> 
                <bold>Limitations&#x00a0;</bold> 
                <list list-type="bullet">
                    <list-item>
                        <p>Reliance on sufficient local read coverage; potential ambiguity with splice junctions in RNA-derived data if not controlled.</p>
                    </list-item>
                    <list-item>
                        <p>Evaluation centered on HSV-1 RNA/4sU-seq (coverage profiles may differ in other assays/viruses).</p>
                    </list-item>
                </list> 
                <bold>Points that must be addressed &#x00a0;</bold> 
                <list list-type="order">
                    <list-item>
                        <p>Please pin exact software versions (bcftools/htslib, VarScan, samtools, BWA, SPAdes/rnaSPAdes), provide checksums for the reference genome and annotation used, and supply either a container (Docker/Singularity) 
                            <bold>or</bold> a one-command runnable example using your Zenodo inputs, with expected outputs and brief runtime/memory notes.</p>
                        <p> 
                            <italic>Rationale:</italic> ensures others can rerun the workflow without environment drift.</p>
                    </list-item>
                    <list-item>
                        <p>In Methods, specify how zeros are handled in log-coverage (pseudocount), define the global/local z-score calculations and clip-peak criteria, and add one sentence on preventing splice junctions being misinterpreted as deletions (e.g., default masking of annotated introns or an option for splice-aware alignment).</p>
                        <p> 
                            <italic>Rationale:</italic> removes ambiguity for RNA-derived datasets and documents default behavior.</p>
                    </list-item>
                    <list-item>
                        <p>Ensure Tables 1&#x2013;2 list complete coordinates, event sizes, strand, and genomic context (CDS/UTR/intron). Deposit per-sample final VCF/BED (SNPs and indels) and FASTA for insert consensus sequences (with brief BLAST summaries).</p>
                        <p> 
                            <italic>Rationale:</italic> makes the results reusable and easy to interpret by others.</p>
                    </list-item>
                </list> 
                <bold>Minor edits</bold> 
                <list list-type="bullet">
                    <list-item>
                        <p>Figure 1 caption: &#x201c;a 4sU-seq sample&#x201d; (not &#x201c;an&#x201d;).</p>
                    </list-item>
                    <list-item>
                        <p>Correct typo: reference 27 author &#x201c;Waldman&#x201d;.</p>
                    </list-item>
                    <list-item>
                        <p>Consistent notation: define n,x,&#x03d5;n, x, \phin,x,&#x03d5; and coordinate conventions (1-based; inclusive bounds). Replace &#x201c;local z-cores&#x201d; with z-scores.</p>
                    </list-item>
                </list> 
                <bold>Recommendation</bold>
            </p>
            <p> The tool is valuable and technically sound; the three requested clarifications and packaging steps will make it easily reusable by the community</p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>Yes</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Yes</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Partly</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>viral transcriptomics, genomics, metagenomics</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
        <sub-article article-type="response" id="comment15098-408633">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Friedel</surname>
                            <given-names>Caroline C.</given-names>
                        </name>
                        <aff>Ludwig Maximilian University of Munich, Germany</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>19</day>
                    <month>12</month>
                    <year>2025</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Thank you for this generally positive assessment. We addressed all raised issues as outlined in the following:</p>
                <p> </p>
                <p> 1.&#x00a0; &#x00a0; A Docker image for running the pipeline (preconfigured to run the examples) is now available via Docker hub (
                    <ext-link ext-link-type="uri" xlink:href="https://hub.docker.com/r/carolinefriedel/virus-variant-caller">https://hub.docker.com/r/carolinefriedel/virus-variant-caller</ext-link>). Files for creating the Docker image are now also available together with the example files at Zenodo (
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.14266852">https://doi.org/10.5281/zenodo.14266852</ext-link>).</p>
                <p> Output files resulting from running the example are now also available at 
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.14266852">https://doi.org/10.5281/zenodo.14266852</ext-link>. This includes exact software versions and notes on runtime for all steps. Memory requirements are also indicated in the description of the Zenodo entry. All steps of the pipeline can be run on a laptop with 16 GB RAM, except for the assembly of inserted sequences with rnaSPAdes, which can be omitted. rnaSPAdes can be run with 32 GB RAM.</p>
                <p> </p>
                <p> Md5 checksums for the reference genome and annotation are now also included with the example.&#x00a0;</p>
                <p> </p>
                <p> 2.&#x00a0; &#x00a0; We uploaded a revised manuscript in which we specify how zeros are handled in log coverage, i.e. using a user-defined pseudocount (default 1),&#x00a0; and define the global/local z-score calculations and clip-peak criteria. We also added more information on how splice junctions are prevented as being misinterpreted as deletions, i.e. by automatically comparing them to the genome annotation.&#x00a0;</p>
                <p> </p>
                <p> 3.&#x00a0; &#x00a0; Tables 1-2 now also list the strand of the gene and the corresponding feature of this gene (i.e. CDS, UTR or intron) the corresponding deletion or insertion is contained in. We also clarified in the text that strand is not considered for indel detection since we determine indels present in the DNA, i.e. on both strands. The size of the deletion or insertion is also provided in the table, however for the &#x0394;ICP27 and &#x0394;vhs viruses this is only approximate as multiple, largely identical, insertion sequences were assembled.&#x00a0;</p>
                <p> </p>
                <p> The full pipeline output for the 4sU-seq data is now also available at Zenodo (
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.17979981">https://doi.org/10.5281/zenodo.17979981</ext-link>) together with the BLAST results for the assembled sequences. Output format are now also described in the description of the Zenodo entry and the README file available for the workflow at 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/watchdog-wms/watchdog-wms-workflows">https://github.com/watchdog-wms/watchdog-wms-workflows</ext-link>.&#x00a0;</p>
                <p> </p>
                <p> 4.&#x00a0; &#x00a0; Minor edits: We corrected the two typos and defined the variables requested. Local z-scores was not changed to z-scores, as we calculate both global and local z-scores and we want to avoid confusion between the two. We added explanations to the tables that coordinates are 1-based and inclusive.</p>
            </body>
        </sub-article>
    </sub-article>
</article>
