<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="other" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.10082.1</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Software Tool Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                    <subj-group>
                        <subject>Animal Genetics</subject>
                    </subj-group>
                    <subj-group>
                        <subject>Bioinformatics</subject>
                    </subj-group>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Disambiguate: An open-source application for disambiguating two species in next generation sequencing data from grafted samples</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 1; peer review: 2 approved, 1 approved with reservations]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Ahdesm&#x00e4;ki</surname>
                        <given-names>Miika J.</given-names>
                    </name>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Gray</surname>
                        <given-names>Simon R.</given-names>
                    </name>
                    <xref ref-type="aff" rid="a2">2</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Johnson</surname>
                        <given-names>Justin H.</given-names>
                    </name>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Lai</surname>
                        <given-names>Zhongwu</given-names>
                    </name>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>AstraZeneca Oncology iMed, Cambridge, UK</aff>
                <aff id="a2">
                    <label>2</label>AstraZeneca R&amp;D Information, Cambridge, UK</aff>
                <aff id="a3">
                    <label>3</label>AstraZeneca Oncology iMed, Waltham, USA</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:miika.ahdesmaki@astrazeneca.com">miika.ahdesmaki@astrazeneca.com</email>
                </corresp>
                <fn fn-type="con">
                    <p>MA authored the bwa and rna-star disambiguation algorithms, co-authored the manuscript and implemented the algorithms in Python. SG wrote the C++ implementation of the algorithms. JJ co-authored the manuscript. ZL designed and implemented the original Tophat (and Hisat2) disambiguation algorithm and co-authored the manuscript.</p>
                </fn>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>All authors are employees of AstraZeneca.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>22</day>
                <month>11</month>
                <year>2016</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2016</year>
            </pub-date>
            <volume>5</volume>
            <elocation-id>2741</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>17</day>
                    <month>11</month>
                    <year>2016</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 Ahdesm&#x00e4;ki MJ et al.</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/5-2741/pdf"/>
            <abstract>
                <p>Grafting of cell lines and primary tumours is a crucial step in the drug development process between cell line studies and clinical trials. 
                    <italic toggle="yes">Disambiguate</italic> is a program for computationally separating the sequencing reads of two species derived from grafted samples. 
                    <italic toggle="yes">Disambiguate</italic> operates on alignments to the two species and separates the components at very high sensitivity and specificity as illustrated in artificially mixed human-mouse samples. This allows for maximum recovery of data from target tumours for more accurate variant calling and gene expression quantification. Given that no general use open source algorithm accessible to the bioinformatics community exists for the purposes of separating the two species data, the proposed 
                    <italic toggle="yes">Disambiguate</italic> tool presents a novel approach and improvement to performing sequence analysis of grafted samples. Both Python and C++ implementations are available and they are integrated into several open and closed source pipelines. 
                    <italic toggle="yes">Disambiguate</italic> is open source and is freely available at 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/AstraZeneca-NGS/disambiguate">https://github.com/AstraZeneca-NGS/disambiguate</ext-link>.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>NGS</kwd>
                <kwd>patient derived xenograft</kwd>
                <kwd>explant</kwd>
                <kwd>disambiguation</kwd>
                <kwd>sequencing</kwd>
            </kwd-group>
            <funding-group>
                <funding-statement>The author(s) declared that no grants were involved in supporting this work.</funding-statement>
            </funding-group>
        </article-meta>
    </front>
    <body>
        <sec sec-type="intro">
            <title>Introduction</title>
            <p>Xenografts, both cell line and primary tumour, are routinely profiled in preclinical and translational research. Xenografts are used to study everything from new target identification to responses to targeted therapeutics and mechanisms of resistance
                <sup>
                    <xref ref-type="bibr" rid="ref-1">1</xref>
                </sup> in an environment that is more realistic than just 2D cell lines. However, due to mouse stromal contamination of the human tumour, not all the data resulting from studying the extracted samples are guaranteed to be of human origin.</p>
            <p>Direct high throughput sequencing of grafted samples with a mixture of two species is routine practice. However with the high volume of data and computational challenges of alignment and kmer identification, new computational strategies are required to computationally separate the two species&#x2019; components for more accurate downstream analysis
                <sup>
                    <xref ref-type="bibr" rid="ref-1">1</xref>
                </sup>, especially for the reduction of variant calling artefacts. However, the two-species alignment approach proposed in Bradford 
                <italic toggle="yes">et al</italic>.
                <sup>
                    <xref ref-type="bibr" rid="ref-1">1</xref>
                </sup> excludes reads that align to both organisms, clearly dismissing a large portion of the data as evidenced in 
                <xref ref-type="table" rid="T1">Table 1</xref> and 
                <xref ref-type="table" rid="T2">Table 2</xref> when observing cross species alignment rates.</p>
            <table-wrap id="T1" orientation="portrait" position="anchor">
                <label>Table 1. </label>
                <caption>
                    <title>Read pairs assigned human (hg19) and mouse (mm10) post disambiguation in BWA aligned DNA-seq data.</title>
                    <p>The &#x2019;Ambiguous&#x2019; column includes reads that aligned to neither or had equal quality scores for the alignments and could not be disambiguated.</p>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1">Sample</th>
                            <th align="center" colspan="1" rowspan="1">mm10</th>
                            <th align="center" colspan="1" rowspan="1">hg19</th>
                            <th align="center" colspan="1" rowspan="1">Ambiguous</th>
                            <th align="center" colspan="1" rowspan="1">Read pairs
                                <break/>total</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">SRR1176814</td>
                            <td align="center" colspan="1" rowspan="1">47197650</td>
                            <td align="center" colspan="1" rowspan="1">26157
                                <sup>
                                    <xref ref-type="other" rid="note-1">&#x2020;</xref>
                                </sup>
                            </td>
                            <td align="center" colspan="1" rowspan="1">88542</td>
                            <td align="center" colspan="1" rowspan="1">47312349</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">SRR1528269</td>
                            <td align="center" colspan="1" rowspan="1">11502
                                <sup>
                                    <xref ref-type="other" rid="note-2">&#x2020;&#x2020;</xref>
                                </sup>
                            </td>
                            <td align="center" colspan="1" rowspan="1">77102895</td>
                            <td align="center" colspan="1" rowspan="1">153767</td>
                            <td align="center" colspan="1" rowspan="1">77268164</td>
                        </tr>
                    </tbody>
                </table>
                <table-wrap-foot>
                    <fn>
                        <p id="note-1">
                            <sup>&#x2020;</sup>Down from 25638785 read pairs with alignment to hg19</p>
                        <p id="note-2">
                            <sup>&#x2020;&#x2020;</sup>Down from 39686392 read pairs with alignment to mm10</p>
                    </fn>
                </table-wrap-foot>
            </table-wrap>
            <table-wrap id="T2" orientation="portrait" position="anchor">
                <label>Table 2. </label>
                <caption>
                    <title>Read pairs assigned human (hg19) and mouse (mm10) post disambiguation in STAR aligned RNA-seq data.</title>
                    <p>The &#x2019;Ambiguous&#x2019; column includes reads that aligned to neither or had equal quality scores for the alignments and could not be disambiguated.</p>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1">Sample</th>
                            <th align="center" colspan="1" rowspan="1">mm10</th>
                            <th align="center" colspan="1" rowspan="1">hg19</th>
                            <th align="center" colspan="1" rowspan="1">Ambiguous</th>
                            <th align="center" colspan="1" rowspan="1">Read
                                <break/>pairs total</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">SRR1930152</td>
                            <td align="center" colspan="1" rowspan="1">23126086</td>
                            <td align="center" colspan="1" rowspan="1">80694
                                <sup>
                                    <xref ref-type="other" rid="note-3">&#x2020;</xref>
                                </sup>
                            </td>
                            <td align="center" colspan="1" rowspan="1">849364</td>
                            <td align="center" colspan="1" rowspan="1">24056144</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1">SRR387400</td>
                            <td align="center" colspan="1" rowspan="1">94289
                                <sup>
                                    <xref ref-type="other" rid="note-4">&#x2020;&#x2020;</xref>
                                </sup>
                            </td>
                            <td align="center" colspan="1" rowspan="1">49677937</td>
                            <td align="center" colspan="1" rowspan="1">9880844</td>
                            <td align="center" colspan="1" rowspan="1">59653070</td>
                        </tr>
                    </tbody>
                </table>
                <table-wrap-foot>
                    <fn>
                        <p id="note-3">
                            <sup>&#x2020;</sup>Down from 3005372 read pairs with alignment to hg19</p>
                        <p id="note-4">
                            <sup>&#x2020;&#x2020;</sup>Down from 6001230 read pairs with alignment to mm10</p>
                    </fn>
                </table-wrap-foot>
            </table-wrap>
            <p>Algorithms designed for disambiguating the host and tumour sequences include e.g. the Xenome tool
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>
                </sup>, which is based on machine learning applied to k-mers from both species. However, the implementation is not readily available and is not free for non-academic users. In 
                <xref ref-type="bibr" rid="ref-3">3</xref> the authors also aligned the reads to both species, but no attempt was taken to disambiguate the data and no implementation is readily available.</p>
            <p>Here, an alternative approach using read alignment quality is proposed to further disambiguate reads that can be mapped to both species. Alignment is first performed to both species independently and the reads are disambiguated as a post-processing step. There is no requirement to maintain pseudo reference indices based on combinations of reference sequences. This approach shows a very high sensitivity and specificity on artificially generated samples obtained by mixing reads from the individual species. The 
                <italic toggle="yes">Disambiguate</italic> tool is community supported and widely used in several open and closed source pipelines.</p>
        </sec>
        <sec sec-type="methods">
            <title>Methods</title>
            <sec>
                <title>Implementation</title>
                <p>The 
                    <italic toggle="yes">Disambiguate</italic> algorithm works by operating on natural name sorted BAM files from alignments to two species. Name sorting is a critical part in not having to read all the data from both species&#x2019; alignments into memory simultaneously; the same read aligned to both species is disambiguated on the fly by going through both alignment files synchronously. For reads that have alignments to both species and therefore require disambiguation, the specific details of the disambiguation process are slightly different for the different aligners. Thus far the algorithm has been tested for BWA-MEM
                    <sup>
                        <xref ref-type="bibr" rid="ref-4">4</xref>
                    </sup> and Bowtie2
                    <sup>
                        <xref ref-type="bibr" rid="ref-5">5</xref>
                    </sup> for DNA-seq, and TopHat2
                    <sup>
                        <xref ref-type="bibr" rid="ref-6">6</xref>
                    </sup>, STAR
                    <sup>
                        <xref ref-type="bibr" rid="ref-7">7</xref>
                    </sup> and Hisat2
                    <sup>
                        <xref ref-type="bibr" rid="ref-8">8</xref>
                    </sup> for RNA-seq. Illumina&#x2019;s paired end sequencing is preferred as the mate can often break a tie. 
                    <xref ref-type="fig" rid="f1">Figure 1</xref> illustrates the disambiguation process.</p>
                <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                    <label>Figure 1. </label>
                    <caption>
                        <title>The disambiguation process illustrated.</title>
                        <p>Alignment is first performed against both species. The disambiguation application then operates on the raw, natural name sorted BAM files to assign the read pairs into one of the two species or as ambiguous for unresolved cases.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/10863/2b87a384-55a7-417c-84b0-695438231a37_figure1.gif"/>
                </fig>
                <p>

                    <italic toggle="yes">Disambiguate</italic> assigns the reads on a per-pair basis, based on the highest quality alignment of the read pair. For BWA and STAR the alignment score (AS, higher better) is used as the primary disambiguation metric followed by edit distance (NM, lower better) to the reference. For Tophat2 and Hisat2 based alignments the sum (lower better) of edit distance, number of reported alignments (NH) and the number of gap opens (XO) is used.</p>
            </sec>
            <sec>
                <title>Operation</title>
                <p>The algorithm is implemented in Python (with dependency on the 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/pysam-developers/pysam">Pysam</ext-link> package) and C++ (with dependency on 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/pezmaster31/bamtools">BamTools</ext-link>), with the C++ version being approximately four times faster than the Python code. 64 bit unix/linux systems are supported.</p>
                <p>Given name sorted alignment (BAM) files aligned to the two species of interest (e.g. human and mouse), the algorithm infers for each read the most likely origin. The output contains BAM files for both species, BAM files for ambiguous reads and a text file describing how many read pairs were assigned to each BAM file. The simplest way to perform all of the alignment and disambiguation is by running 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/chapmanb/bcbio-nextgen">bcbio</ext-link>, in which 
                    <italic toggle="yes">Disambiguate</italic> is integrated, on the raw sequencing data.</p>
            </sec>
        </sec>
        <sec sec-type="results">
            <title>Results</title>
            <p>To illustrate the utility of 
                <italic toggle="yes">Disambiguate</italic>, raw publicly available human and mouse exome sequencing reads (100bp paired end Illumina data) were downloaded from the European Nucleotide Archive (
                <ext-link ext-link-type="uri" xlink:href="http://www.ebi.ac.uk/ena">ENA</ext-link>) with Run Accessions 
                <ext-link ext-link-type="uri" xlink:href="http://www.ebi.ac.uk/ena/data/view/SRR1176814">SRR1176814</ext-link> and 
                <ext-link ext-link-type="uri" xlink:href="http://www.ebi.ac.uk/ena/data/view/SRR1528269">SRR1528269</ext-link>.</p>
            <p>The reads were concatenated, aligned against hg19 and mm10 using BWA MEM, and processed using 
                <italic toggle="yes">Disambiguate</italic>. Pre-disambiguation, for the human sample (SRR1528269), there were 39686392 read pairs (out of total 77268164), for which at least one read aligned to mouse. Similarly, for the mouse sample (SRR1176814), there were 25638785 read pairs (out of total 47312349) for which at least one read aligned to human. 
                <xref ref-type="table" rid="T1">Table 1</xref> summarises the post disambiguation results. As can be seen, the disambiguation algorithm correctly pulls apart virtually all of the read pairs. In other internal studies, 
                <italic toggle="yes">Disambiguate</italic> has time and again highlighted samples with low human assigned component, correlating with poor extraction or lack of growth of the tumour cells in the host.</p>
            <p>STAR aligned human (SRR387400) and mouse (SRR1930152) RNA-seq data was also analysed with very similar results, see 
                <xref ref-type="table" rid="T2">Table 2</xref>.</p>
        </sec>
        <sec sec-type="conclusions">
            <title>Conclusions</title>
            <p>In summary, 
                <italic toggle="yes">Disambiguate</italic> provides an important tool for computationally separating sequence reads originating from two species. In human-mouse studies it also allows the study of the mouse stromal component for gene expression and DNA variation.</p>
            <p>In addition to RNA-seq and whole genome sequencing, it is worth highlighting that for targeted hybridisation capture sequencing of xenograft samples, where baits from a single species are used, disambiguation is still highly recommended. This is best seen in 
                <xref ref-type="table" rid="T1">Table 1</xref> where a large number of human exome reads aligned to mouse and would potentially affect downstream interpretation without disambiguation.</p>
            <p>
                <italic toggle="yes">Disambiguate</italic> has been well adopted in the open source community; it is integrated in the open source 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/chapmanb/bcbio-nextgen">bcbio</ext-link> pipeline, and has been successfully used in both RNA and DNA sequencing of xenografts both at AstraZeneca and other research institutes. This is evidenced by the number of support tickets from a variety of organisations on the bcbio-nextgen Github page.</p>
        </sec>
        <sec>
            <title>Data availability</title>
            <p>The data used here is available from the European Nucleotide Archive with Run Accession numbers SRR1176814 and SRR1528269.</p>
        </sec>
        <sec>
            <title>Software availability</title>
            <p>Software integrating 
                <italic toggle="yes">Disambiguate</italic> available from: 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/chapmanb/bcbio-nextgen">https://github.com/chapmanb/bcbio-nextgen</ext-link>
</p>
            <p>Latest source code: 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/AstraZeneca-NGS/disambiguate">https://github.com/AstraZeneca-NGS/disambiguate</ext-link>
</p>
            <p>Archived source code as at time of publication: DOI: 
                <ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5281/zenodo.166017">10.5281/zenodo.166017</ext-link>
                <sup>
                    <xref ref-type="bibr" rid="ref-9">9</xref>
                </sup>
            </p>
            <p>License: MIT.</p>
        </sec>
    </body>
    <back>
        <ack>
            <title>Acknowledgments</title>
            <p>The authors wish to thank Brad Chapman, Rory Kirchner and Eric Schelhorn for feedback and fixes on 
                <italic toggle="yes">Disambiguate</italic>.</p>
        </ack>
        <ref-list>
            <ref id="ref-1">
                <label>1</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
	
                        <name name-style="western">
                            <surname>Bradford</surname>
                            <given-names>JR</given-names>
                        </name>
	
                        <name name-style="western">
                            <surname>Farren</surname>
                            <given-names>M</given-names>
                        </name>
	
                        <name name-style="western">
                            <surname>Powell</surname>
                            <given-names>SJ</given-names>
                        </name>
	
                        <etal/>
</person-group>:
                    <article-title>RNA-Seq Differentiates Tumour and Host mRNA Expression Changes Induced by Treatment of Human Tumour Xenografts with the VEGFR Tyrosine Kinase Inhibitor Cediranib.</article-title>
                    <source>
	
                        <italic toggle="yes">PLoS One.</italic>
</source>
                    <year>2013</year>;<volume>8</volume>(<issue>6</issue>):<fpage>e66003</fpage>.
                    <pub-id pub-id-type="pmid">23840389</pub-id>
                    <pub-id pub-id-type="doi">10.1371/journal.pone.0066003</pub-id>
                    <pub-id pub-id-type="pmcid">3686868</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-2">
                <label>2</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
	
                        <name name-style="western">
                            <surname>Conway</surname>
                            <given-names>T</given-names>
                        </name>
	
                        <name name-style="western">
                            <surname>Wazny</surname>
                            <given-names>J</given-names>
                        </name>
	
                        <name name-style="western">
                            <surname>Bromage</surname>
                            <given-names>A</given-names>
                        </name>
	
                        <etal/>
</person-group>:
                    <article-title>Xenome--a tool for classifying reads from xenograft samples.</article-title>
                    <source>
	
                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2012</year>;<volume>28</volume>(<issue>12</issue>):<fpage>i172</fpage>&#x2013;<lpage>i178</lpage>.
                    <pub-id pub-id-type="pmid">22689758</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/bts236</pub-id>
                    <pub-id pub-id-type="pmcid">3371868</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-3">
                <label>3</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
	
                        <name name-style="western">
                            <surname>Rossello</surname>
                            <given-names>FJ</given-names>
                        </name>
	
                        <name name-style="western">
                            <surname>Tothill</surname>
                            <given-names>RW</given-names>
                        </name>
	
                        <name name-style="western">
                            <surname>Britt</surname>
                            <given-names>K</given-names>
                        </name>
	
                        <etal/>
</person-group>:
                    <article-title>Next-generation sequence analysis of cancer xenograft models.</article-title>
                    <source>
	
                        <italic toggle="yes">PLoS One.</italic>
</source>
                    <year>2013</year>;<volume>8</volume>(<issue>9</issue>):<fpage>e74432</fpage>.
                    <pub-id pub-id-type="pmid">24086345</pub-id>
                    <pub-id pub-id-type="doi">10.1371/journal.pone.0074432</pub-id>
                    <pub-id pub-id-type="pmcid">3784448</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-4">
                <label>4</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
	
                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>H</given-names>
                        </name>
</person-group>:
                    <article-title>Aligning sequence reads, clone sequences and assembly contigs with bwa-mem</article-title>.
                    <italic toggle="yes">bioRxiv</italic>, arXiv:1303.3997 q&#x2013;bio.GN.<year>2013</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/pdf/1303.3997v2.pdf">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-5">
                <label>5</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
	
                        <name name-style="western">
                            <surname>Langmead</surname>
                            <given-names>B</given-names>
                        </name>
	
                        <name name-style="western">
                            <surname>Salzberg</surname>
                            <given-names>SL</given-names>
                        </name>
</person-group>:
                    <article-title>Fast gapped-read alignment with Bowtie 2.</article-title>
                    <source>
	
                        <italic toggle="yes">Nat Methods.</italic>
</source>
                    <year>2012</year>;<volume>9</volume>(<issue>4</issue>):<fpage>357</fpage>&#x2013;<lpage>359</lpage>.
                    <pub-id pub-id-type="pmid">22388286</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nmeth.1923</pub-id>
                    <pub-id pub-id-type="pmcid">3322381</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-6">
                <label>6</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
	
                        <name name-style="western">
                            <surname>Kim</surname>
                            <given-names>D</given-names>
                        </name>
	
                        <name name-style="western">
                            <surname>Pertea</surname>
                            <given-names>G</given-names>
                        </name>
	
                        <name name-style="western">
                            <surname>Trapnell</surname>
                            <given-names>C</given-names>
                        </name>
	
                        <etal/>
</person-group>:
                    <article-title>TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.</article-title>
                    <source>
	
                        <italic toggle="yes">Genome Biol.</italic>
</source>
                    <year>2013</year>;<volume>14</volume>(<issue>4</issue>):<fpage>R36</fpage>.
                    <pub-id pub-id-type="pmid">23618408</pub-id>
                    <pub-id pub-id-type="doi">10.1186/gb-2013-14-4-r36</pub-id>
                    <pub-id pub-id-type="pmcid">4053844</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-7">
                <label>7</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
	
                        <name name-style="western">
                            <surname>Dobin</surname>
                            <given-names>A</given-names>
                        </name>
	
                        <name name-style="western">
                            <surname>Davis</surname>
                            <given-names>CA</given-names>
                        </name>
	
                        <name name-style="western">
                            <surname>Schlesinger</surname>
                            <given-names>F</given-names>
                        </name>
	
                        <etal/>
</person-group>:
                    <article-title>STAR: ultrafast universal RNA-seq aligner.</article-title>
                    <source>
	
                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2013</year>;<volume>29</volume>(<issue>1</issue>):<fpage>15</fpage>&#x2013;<lpage>21</lpage>.
                    <pub-id pub-id-type="pmid">23104886</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/bts635</pub-id>
                    <pub-id pub-id-type="pmcid">3530905</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-8">
                <label>8</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
	
                        <name name-style="western">
                            <surname>Kim</surname>
                            <given-names>D</given-names>
                        </name>
	
                        <name name-style="western">
                            <surname>Langmead</surname>
                            <given-names>B</given-names>
                        </name>
	
                        <name name-style="western">
                            <surname>Salzberg</surname>
                            <given-names>SL</given-names>
                        </name>
</person-group>:
                    <article-title>HISAT: a fast spliced aligner with low memory requirements.</article-title>
                    <source>
	
                        <italic toggle="yes">Nat Methods.</italic>
</source>
                    <year>2015</year>;<volume>12</volume>(<issue>4</issue>):<fpage>357</fpage>&#x2013;<lpage>360</lpage>.
                    <pub-id pub-id-type="pmid">25751142</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nmeth.3317</pub-id>
                    <pub-id pub-id-type="pmcid">4655817</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-9">
                <label>9</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
	
                        <name name-style="western">
                            <surname>Ahdesm&#x00e4;ki</surname>
                            <given-names>MJ</given-names>
                        </name>
</person-group>:
                    <article-title>AstraZeneca-NGS/disambiguate: Release for publication [Data set].</article-title>
                    <source>
	
                        <italic toggle="yes">Zenodo.</italic>
</source>
                    <year>2016</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5281/zenodo.166017">Data Source</ext-link>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report17881">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.10863.r17881</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Oliver</surname>
                        <given-names>Gavin R.</given-names>
                    </name>
                    <xref ref-type="aff" rid="r17881a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-9948-3799</uri>
                </contrib>
                <contrib contrib-type="author">
                    <name>
                        <surname>Nair</surname>
                        <given-names>Asha A.</given-names>
                    </name>
                    <xref ref-type="aff" rid="r17881a1">1</xref>
                    <role>Co-referee</role>
                </contrib>
                <aff id="r17881a1">
                    <label>1</label>Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>5</day>
                <month>12</month>
                <year>2016</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 Oliver GR and Nair AA</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport17881" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.10082.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>We believe that overall the software tool article by Ahdesm&#x00e4;ki&#x00a0;
                <italic>et al.&#x00a0;</italic>seems sound and provides a solution to&#x00a0;a problem that appears to be inadequately addressed in the field currently.&#x00a0;</p>
            <p> </p>
            <p> Nonetheless, we believe the manuscript would benefit from some minor amendments in order to increase its utility and accessibility to readers.</p>
            <p> </p>
            <p> In brief:</p>
            <p> </p>
            <p> 
                <bold>
                    <underline>Intro/Background</underline>
                </bold>
            </p>
            <p> </p>
            <p> Needs expanded slightly to better set the scene and describe the general approach of read disambiguation.</p>
            <p> </p>
            <p> </p>
            <p> 
                <bold>
                    <underline>Methodology</underline>
                </bold>
            </p>
            <p> </p>
            <p> The methodology should be expanded slightly and made more explicit.</p>
            <p> </p>
            <p> </p>
            <p> 
                <bold>
                    <underline>Tables 1&amp;2:</underline>
                </bold> 
                <list list-type="bullet">
                    <list-item>
                        <p>Combine 1 &amp; 2 into a single table and label the samples by data type, i.e DNA and RNA</p>
                    </list-item>
                    <list-item>
                        <p>Show %s as well as numbers</p>
                    </list-item>
                    <list-item>
                        <p>Clearly label the species in the tables</p>
                    </list-item>
                    <list-item>
                        <p>Clearly label correctly mapped/incorrectly mapped reads in table</p>
                    </list-item>
                    <list-item>
                        <p>Clearly label human and mouse genomes as such</p>
                    </list-item>
                    <list-item>
                        <p>Tables should clearly show all numbers pre- and post- disambiguation, rather than having superscripted references in the table legend</p>
                    </list-item>
                    <list-item>
                        <p>Essentially, a&#x00a0;novice should be able to read the paper and extract relevant info more easily.</p>
                    </list-item>
                </list> </p>
            <p> 
                <underline>
                    <bold>Figure 1 </bold>
                </underline> 
                <list list-type="bullet">
                    <list-item>
                        <p>Should be more granular, informative and descriptive of the process.&#x00a0;Include read alignment etc.&#x00a0; Describe the Disambiguate process</p>
                    </list-item>
                    <list-item>
                        <p>Use same font size for all text in the Figure</p>
                    </list-item>
                </list> </p>
            <p> 
                <bold>
                    <underline>Comparison with a competitor product</underline>
                </bold>
            </p>
            <p> </p>
            <p> This is something that is clearly missing. If it is literally impossible to compare to a competitor because the software is not accessible, this should be stated clearly as a reason for the lack of comparison in the paper.</p>
            <p> </p>
            <p> 
                <bold>
                    <underline>Tumor samples</underline>
                </bold>
            </p>
            <p> </p>
            <p> It would be interesting to know how performance is affected by use of highly mutated tumor xenografts. This is arguably beyond the scope of the paper, but warrants at least some mention.</p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.</p>
        </body>
        <sub-article article-type="response" id="comment2423-17881">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Ahdesm&#x00e4;ki</surname>
                            <given-names>Miika</given-names>
                        </name>
                        <aff/>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>NA</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>11</day>
                    <month>1</month>
                    <year>2017</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Dear Gavin and Asha,&#x00a0;</p>
                <p> Many thanks for the very detailed review and comments. We have addressed your points in v2 of the manuscript.&#x00a0;&#x00a0;</p>
                <p> </p>
                <p> Into/background:&#x00a0;</p>
                <p> We have added the text in braces: "Direct high throughput sequencing of grafted samples with a mixture of two species is routine practice. {However, the origin species of each read or read pair is unknown and needs to be determined informatically.}" to better set the scene. Further, the operation of xenome is now updated and xenome is now included in a comparison study. We have more explicitly stated that "Alignment is first performed to both species independently and the reads are disambiguated as a post-processing step, {assigning reads to the species with higher quality alignments}"&#x00a0;</p>
                <p> Methodology:&#x00a0;</p>
                <p> We have clarified the methodology section by spelling out the disambiguation algorithm and giving the reasoning why two schemes are used.&#x00a0;&#x00a0;</p>
                <p> Table 1&amp;2:&#x00a0;</p>
                <p> We have combined Tables 1&amp;2 and revised the contents to address these points.&#x00a0;</p>
                <p> Figure 1:&#x00a0;</p>
                <p> We have redrawn the figure to be more descriptive.&#x00a0;</p>
                <p> Comparison to competitor product:&#x00a0;</p>
                <p> We have now compared our approach to Xenome, which was recently open sourced, and included the results of the comparison in the updated table with discussion.&#x00a0;</p>
                <p> Tumor samples:&#x00a0;</p>
                <p> We agree that evaluating the performance of the disambiguation algorithm in a messy cancer genome like the highly rearranged MCF7 would be extremely interesting. If we get our hands on appropriate data we will consider publishing the results on the program Github page.</p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report17879">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.10863.r17879</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Eldridge</surname>
                        <given-names>Matthew D.</given-names>
                    </name>
                    <xref ref-type="aff" rid="r17879a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r17879a1">
                    <label>1</label>Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>25</day>
                <month>11</month>
                <year>2016</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 Eldridge MD</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport17879" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.10082.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>This paper describes a computational tool for separating sequencing reads from a sample that contains DNA or RNA from two species. This is a necessary pre-processing step for genomic or transcriptomic analysis of patient-derived xenograft cancer models.</p>
            <p> </p>
            <p> The approach is based on alignments of sequence reads to the reference genome sequences for the two species in question. The authors have tested their approach on DNA-seq data from publicly available human and mouse exome datasets concatenated to simulate a xenograft sample. The results presented in Table 1 show very good separation of reads from the two species datasets with only a small percentage of reads being assigned to the wrong species (0.06% and 0.01%) and a higher but still very low percentage of reads flagged as ambiguous, i.e. align equally well to both genomes. Similar results were presented for RNA-seq data, although here the percentages of incorrectly assigned and ambiguous reads are unsurprisingly higher than for DNA-seq.</p>
            <p> </p>
            <p> Use of the alignment scores, and in the event of a tie the edit distance, is a reasonable approach to disambiguate reads and is the method used for BWA and STAR alignments. For TopHat2 and HISAT2 a different scoring function is required, although the reasons for this are not given. Further, the choice of function (sum of edit distance, number of reported alignments and number of gap opens) is not completely obvious and raises the question of whether the authors have attempted to tune the function, e.g. by adjusting the weighting of each component.</p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <sub-article article-type="response" id="comment2424-17879">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Ahdesm&#x00e4;ki</surname>
                            <given-names>Miika</given-names>
                        </name>
                        <aff/>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>NA</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>11</day>
                    <month>1</month>
                    <year>2017</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Dear Mathew,&#x00a0;</p>
                <p> Many thanks for reviewing our manuscript and the comments. We have modified v2 of the manuscript to address the points you raise, namely:&#x00a0; 
                    <list list-type="order">
                        <list-item>
                            <p>The aligner tags are very similar between BWA and STAR; and between TopHat2 and HISAT2. However, fairly different between BWA/STAR vs TopHat2/Hisat2 and therefore we couldn't use the same scheme originally developed for TopHat2 with BWA/STAR. With the appearance of HISAT2 especially for hg38 we decided to utilise the TopHat2 scheme for HISAT2 given their outputs are almost interchangeable. We have mentioned this in the updated text.&#x00a0;</p>
                        </list-item>
                        <list-item>
                            <p>The sum of edit distance, number of reported alignments and number of hap opens has always worked for us well out of the box (as illustrated in the tables) and while tuning their weights may yield some minor benefits, it would risk overfitting to existing data. Any benefits of the weight tuning would have to be measured over a very long time, running multiple versions of weighted and the unweighted algorithms side by side. We have given this reasoning (complexity) in the text as our excuse of not tuning the weights further.&#x00a0;&#x00a0;</p>
                        </list-item>
                    </list> &#x00a0;</p>
                <p> Thank you again for the comments and helping us improve the manuscript.</p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report17877">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.10863.r17877</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Nicorici</surname>
                        <given-names>Daniel</given-names>
                    </name>
                    <xref ref-type="aff" rid="r17877a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r17877a1">
                    <label>1</label>Orion Corporation Orion Pharma, Espoo, Finland</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>23</day>
                <month>11</month>
                <year>2016</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 Nicorici D</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport17877" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.10082.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>This papers introduces a tool, named Disambiguate, for computationally separating the DNA/RNA sequencing reads of two species, like for example in case of xenograft samples. The tool takes as input BAM files from wide range of NGS aligners.</p>
            <p> </p>
            <p> I have made the following minor observations:&#x00a0; 
                <list list-type="order">
                    <list-item>
                        <p>The tool&#x00a0;Disambiguate works on RNA-seq and&#x00a0;DNA-seq data and this is mentioned for the first time in Methods section. Probably it would help to have this mentioned much earlier, like for example in the abstract too.</p>
                    </list-item>
                    <list-item>
                        <p>In order to improve the clarity,&#x00a0;to the Tables 1 and 2&#x00a0;could be added also the percentages where is relevant, like for example, "26157" would become "26157 (0.0553%)" and so on.</p>
                    </list-item>
                </list>
            </p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <sub-article article-type="response" id="comment2425-17877">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Ahdesm&#x00e4;ki</surname>
                            <given-names>Miika</given-names>
                        </name>
                        <aff/>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>NA</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>11</day>
                    <month>1</month>
                    <year>2017</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Dear Daniel,&#x00a0;</p>
                <p> Thank you for the review, your comments are much appreciated. We have addressed your points in v2 of the manuscript.&#x00a0; 
                    <list list-type="order">
                        <list-item>
                            <p>We have explicitly mentioned in the abstract and the introduction that the tool can be used for both DNA and RNA-seq data&#x00a0;</p>
                        </list-item>
                        <list-item>
                            <p>We have added percentages into the tables as you suggested&#x00a0;</p>
                        </list-item>
                    </list> </p>
                <p> Thank you for the review and helping us improve the manuscript.</p>
            </body>
        </sub-article>
    </sub-article>
</article>
