<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="methods-article" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.6200.2</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Method Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                    <subj-group>
                        <subject>Bioinformatics</subject>
                    </subj-group>
                    <subj-group>
                        <subject>Genomics</subject>
                    </subj-group>
                    <subj-group>
                        <subject>Protein Chemistry &amp; Proteomics</subject>
                    </subj-group>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Prediction of multi-drug resistance transporters using a novel sequence analysis method</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 2; peer review: 2 approved]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>McDermott</surname>
                        <given-names>Jason E.</given-names>
                    </name>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Bruillard</surname>
                        <given-names>Paul</given-names>
                    </name>
                    <xref ref-type="aff" rid="a2">2</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Overall</surname>
                        <given-names>Christopher C.</given-names>
                    </name>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Gosink</surname>
                        <given-names>Luke</given-names>
                    </name>
                    <xref ref-type="aff" rid="a2">2</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Lindemann</surname>
                        <given-names>Stephen R.</given-names>
                    </name>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>Biological Sciences, Pacific Northwest National Laboratory, Washington, WA, 99352, USA</aff>
                <aff id="a2">
                    <label>2</label>National Security Divisions, Pacific Northwest National Laboratory, Washington, WA, 99352, USA</aff>
                <aff id="a3">
                    <label>3</label>Department of Molecular Microbiology and Immunology, Oregon Health &amp; Science University, Portland, OR, 97239, USA</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:Jason.McDermott@pnnl.gov">Jason.McDermott@pnnl.gov</email>
                </corresp>
                <fn fn-type="con">
                    <p>J.E.M conceived of the study, applied the methods, interpreted results, wrote the manuscript. P.B. developed the software, devised the grammars used, and wrote the manuscript. C.O. analyzed results and wrote the manuscript. L.G. integrated results to provide final rankings and advised on statistics. S.R.L. provided data for microbiome applications and guidance in interpretation of results.</p>
                </fn>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>29</day>
                <month>5</month>
                <year>2015</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2015</year>
            </pub-date>
            <volume>4</volume>
            <elocation-id>60</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>18</day>
                    <month>5</month>
                    <year>2015</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2015 McDermott JE et al.</copyright-statement>
                <copyright-year>2015</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/4-60/pdf"/>
            <abstract>
                <p>There are many examples of groups of proteins that have similar function, but the determinants of functional specificity may be hidden by lack of sequence similarity, or by large groups of similar sequences with different functions. Transporters are one such protein group in that the general function, transport, can be easily inferred from the sequence, but the substrates, the molecules or classes of molecules transported can be impossible to predict from sequence with current methods. In this paper we describe a linguistic-based approach to identify functional patterns from groups of unaligned protein sequences and its application to predict multi-drug resistance transporters (MDRs) from bacteria. We first show that our method can recreate known patterns from PROSITE for several motifs from unaligned sequences. We then show that the method, MDRpred, can predict MDRs with greater accuracy and positive predictive value than a collection of currently available family-based models from the Pfam database. Finally, we apply MDRpred to a large collection of protein sequences from an environmental microbiome study to make novel predictions about drug resistance in a potential environmental reservoir.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>antibiotic resistance</kwd>
                <kwd>bacteria</kwd>
                <kwd>machine learning</kwd>
                <kwd>linguistics</kwd>
                <kwd>protein function</kwd>
                <kwd>multidrug resistance transporters</kwd>
                <kwd>microbiome</kwd>
            </kwd-group>
            <funding-group>
                <funding-statement>This study was supported by the Signatures Discovery Initiative, a component of the Laboratory Directed Research and Development Program at Pacific Northwest National Laboratory (PNNL), a multiprogram national laboratory operated by Battelle for the U.S. Department of Energy under Contract DE-AC05-76RL01830. A portion of this research was supported by the Genomic Science Program (GSP), Office of Biological and Environmental Research (OBER), U.S. Department of Energy (DOE) and is a contribution of the PNNL Foundational Scientific Focus Area. </funding-statement>
                <funding-statement>
                    <italic>The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</italic>
                </funding-statement>
            </funding-group>
        </article-meta>
        <notes>
            <sec sec-type="version-changes">
                <label>Updated</label>
                <title>Changes from Version 1</title>
                <p>We have updated the manuscript to include more clear descriptions of the methods for generating regular expressions and scoring physiochemical properties. We have also updated the text to emphasize the strength of our approach which does not require sequence alignment to identify functionally important sequence regions and to better emphasize how the method predicts substrate specificity, in the broad class of antibiotic compounds, for transporters. Supporting data has been greatly expanded to enhance reproducibility and we include a link to our GitHub project for MDRpred that includes a Python script allowing users to apply it to their own sequences. Overall we believe that the insightful and constructive comments from the reviewers greatly improved the manuscript.</p>
            </sec>
        </notes>
    </front>
    <body>
        <sec sec-type="intro">
            <title>Introduction</title>
            <p>Gram-negative bacteria are a major cause of many human diseases and, due to the emergence of antibiotic resistance, new means to combat them are a pressing international health issue. Recently the Center for Disease Control and Prevention (CDC) highlighted this problem, by stating that, &#x201c;&#x2026; new antibiotics will always be needed to keep up with resistant bacteria&#x2026;&#x201d; (
                <xref ref-type="bibr" rid="ref-8">CDC, 2013</xref>). Antibiotic resistance is mediated by several distinct mechanisms including enzymatic conversion of antibiotics and transporters that eliminate antibiotics from inside cells (
                <xref ref-type="bibr" rid="ref-5">Blair 
                    <italic toggle="yes">et al.</italic>, 2015</xref>). Transporter superfamilies can be easily identified by standard sequence similarity but specific functional information (e.g. substrate specificity) can be more problematic.</p>
            <p>Protein function has traditionally been determined by costly and time-consuming experimental approaches. Tools to determine sequence similarity such as BLAST have enabled efficient annotation of novel proteins by transfer of function. Such methods have been very effective at delineating families of functionally similar proteins that have similar sequences. More flexible approaches using simple grammars like regular expressions and hidden Markov models have improved this process significantly (
                <xref ref-type="bibr" rid="ref-4">Bateman 
                    <italic toggle="yes">et al.</italic>, 2000</xref>; 
                <xref ref-type="bibr" rid="ref-15">Gough &amp; Chothia, 2002</xref>). However, there remain many proteins that cannot be readily associated with known functions using these approaches, largely because they are unrelated by sequence. The field of linguistics is concerned with the structure of languages and studies morphology, syntax, and semantics. This task, which is grounded in mathematics, is directly analogous to the task of interpreting sequences of amino acids to predict function. To date, the application of linguistic-rooted approaches, such as generative grammars, to protein sequences and the use of rigorous and exhaustive approaches to optimize models has been limited.</p>
            <p>Generative grammars have a rich history in linguistic analysis with limited application to biological problems (
                <xref ref-type="bibr" rid="ref-11">Durbin 
                    <italic toggle="yes">et al.</italic>, 1998</xref>). They can be classified in terms of the Chomsky hierarchy where grammars lower in the hierarchy (e.g., regular grammars) are simpler to understand, compute with, and parse; while grammars further up in the hierarchy are more complex but also have more descriptive power. Algorithms such as PROSITE (
                <xref ref-type="bibr" rid="ref-16">Hofmann 
                    <italic toggle="yes">et al.</italic>, 1999</xref>) identify simple motifs in proteins using regular expressions, which are the simplest form of grammar (i.e. regular grammars). Hidden Markov models (HMM), a type of regular grammar, have also been applied to detect protein motifs and families. In addition to regular grammars, computational biologists have utilized stochastic context-free grammars for sequence modeling (
                <xref ref-type="bibr" rid="ref-2">Anderson 
                    <italic toggle="yes">et al.</italic>, 2012</xref>; 
                <xref ref-type="bibr" rid="ref-13">Dyrka 
                    <italic toggle="yes">et al.</italic>, 2013</xref>). Such grammars are better at modeling palindromic sequences that are found in RNA structure. All three of these are limited, however, because they still require an underlying sequence alignment.</p>
            <p>The regular expressions contained in the PROSITE database are identified using a manual process to first gather a set of examples of a functional class, perform a multiple sequence alignment on those examples, and finally generate a regular expression by looking at regions of the sequence that align and are generally functionally important, for example a phosphorylated residue or active site. A similar procedure is used to create hidden Markov models (HMMs) such as those found in the Pfam database, except that the process of determining a model is automated. Motif determination using these methods is practically limited to operation on families of related protein sequences that have been aligned and has been carried out manually for individual protein motifs (such as in the PROSITE database). Many proteins with the same function may not have significant sequence similarity to allow alignments to be easily or accurately performed. The dependence on multiple sequence alignments and manual construction of protein patterns limits the ability to provide insight into problematic protein motifs.</p>
            <p>Previously we have described an effective approach to classification of problematic protein families such as bacterial type III secreted effectors that share little sequence similarity (
                <xref ref-type="bibr" rid="ref-25">McDermott 
                    <italic toggle="yes">et al.</italic>, 2011</xref>; 
                <xref ref-type="bibr" rid="ref-33">Samudrala 
                    <italic toggle="yes">et al.</italic>, 2009</xref>). This method used a support vector machine to integrate different sequence-based features and did not use multiple sequence alignment; rather, because the secretion signal is located in the most N-terminal region of the proteins, it took advantage of this natural alignment of disparate sequences. For problematic protein families in which the discriminating motifs are located in different regions of the protein, methods are needed to be able to automatically identify motifs or features, even where the sequence background might be very noisy and traditional methods for aligning sequences based on evolutionary conservation will not be effective.</p>
            <p>In this study we describe an application of the 
                <underline>P</underline>roactive 
                <underline>I</underline>ntelligent 
                <underline>L</underline>earning with 
                <underline>G</underline>rammar (PILGram) method to protein sequences to develop patterns that can discriminate functional classes of proteins in an alignment-free manner. PILGram uses a genetic algorithm to automate feature selection and build regular expressions that discriminate between classes. We first show that PILGram is able to partially re-create PROSITE patterns for ser/thr phosphatase binding and for zinc fingers in an automated and alignment-free manner. We then apply PILGram to classify transporters involved in drug resistance from other transporter proteins and show that the resulting PILGram model performs better than existing HMM models at classifying proteins in this important functional class. Finally, we combine different PILGram models using a simple voting method to develop an effective classifier called MDRpred. The patterns identified by PILGram map to regions that are likely to be important for substrate specificity, highlighting regions that could be targeted for drug development. We show that PILGram can be a general tool for development of simple patterns for functional classification of protein sequences. As a demonstration we apply MDRpred to a metagenome from an environmental microbial community and highlight several high-confidence predictions of novel MDR transporter proteins. Our results indicate that PILGram may be very effective at identifying functional sequence patterns from groups of protein sequences in the absence of any kind of sequence alignment.</p>
        </sec>
        <sec sec-type="methods">
            <title>Methods</title>
            <sec>
                <title>Protein pattern datasets for proof-of-concept</title>
                <p>To examine the ability of PILGram to identify patterns from unaligned protein sequences we used sets of sequences used to define regular expressions for protein motifs from the PROSITE database. In this way we could compare the output of PILGram with the established PROSITE patterns that had been generated from the aligned set of protein examples. Proteins matching each indicated PROSITE pattern (positive examples) were obtained from the PROSITE website (
                    <ext-link ext-link-type="uri" xlink:href="http://prosite.expasy.org">http://prosite.expasy.org</ext-link>) as the &#x201c;prosite.dat&#x201d; file. UniProt identifiers were extracted from the &#x201c;DR&#x201d; fields and the matching sequences, obtained from the UniProt database, were listed as true positives &#x201c;T&#x201d;. Of the sequences in the UniProt database that did not match the positive examples, approximately 6000 were chosen at random (specific numbers given for each example) to serve as negative examples (See PROSITE_positives_PS000125.fasta, PROSITE_negatives_PS000125.fasta, PROSITE_positives_PS00028.fasta, PROSITE_negatives_PS00028.fasta). The most current PROSITE records available at the time were used (See PROSITE_PS00125.txt and PROSITE_PS00028.txt).</p>
            </sec>
            <sec>
                <title>Drug resistance transporter dataset</title>
                <p>To construct a training set for multidrug resistance transporters we obtained the protein sequences of 6097 transporter proteins from the Transporter Classification Database [TCDB; (
                    <xref ref-type="bibr" rid="ref-31">Saier 
                        <italic toggle="yes">et al.</italic>, 2014</xref>)] along with family classifications. This database was searched for &#x201c;drug resistance&#x201d; giving 71 drug resistance (DR) transporters (See MDR_TCDB_positives.fasta and MDR_TCDB_negatives.fasta datasets). We then searched the protein sequence descriptions from the UniProt database and found an additional 89 sequences annotated with &#x201c;[drug] resistance&#x201d; that were not included in the TCDB annotations. We used the TCDB-annotated DR transporters as our positive examples because most are accompanied by references. The &#x2018;candidate&#x2019; list of positive examples annotated by UniProt was held out of the training set so as not to interfere with classification. The remaining 5934 sequences were used as negative examples since they are annotated as transporters but not as DR transporters in either database.</p>
            </sec>
            <sec>
                <title>Hot Lake peptide sequences</title>
                <p>Metagenomic DNA was extracted from two unicyanobacterial consortia cultivated from a microbial mat inhabiting Hot Lake, WA (
                    <xref ref-type="bibr" rid="ref-22">Lindemann 
                        <italic toggle="yes">et al.</italic>, 2013</xref>) as previously described (
                    <xref ref-type="bibr" rid="ref-9">Cole 
                        <italic toggle="yes">et al.</italic>, 2014</xref>). Metagenome reconstructions were generated as reported by Nelson 
                    <italic toggle="yes">et al.</italic>, (manuscript submitted). Briefly, paired-end reads were generated by the US Department of Energy (DOE) Joint Genome Institute (JGI; 
                    <ext-link ext-link-type="uri" xlink:href="http://jgi.doe.gov">http://jgi.doe.gov</ext-link>) under CSP 701, quality trimmed using Trimmomatic (
                    <xref ref-type="bibr" rid="ref-6">Bolger 
                        <italic toggle="yes">et al.</italic>, 2014</xref>), and assembled using IDBA-UD (
                    <xref ref-type="bibr" rid="ref-29">Peng 
                        <italic toggle="yes">et al.</italic>, 2012</xref>) with a minimum contig size of 250 bp. Contigs longer than 2 Kb were binned using read coverage for each scaffold using Bowtie2 (
                    <xref ref-type="bibr" rid="ref-18">Langmead &amp; Salzberg, 2012</xref>) and samtools (
                    <xref ref-type="bibr" rid="ref-20">Li 
                        <italic toggle="yes">et al.</italic>, 2009</xref>). Gene models for the metagenome reconstructions were generated using Prodigal (
                    <xref ref-type="bibr" rid="ref-17">Hyatt 
                        <italic toggle="yes">et al.</italic>, 2010</xref>) and hand-curated in some instances. Additionally, axenic organisms isolated from the consortia were sequenced of 10 Kb libraries with PacBio and assembled by the JGI, also under CSP 701. The genomes of axenic organisms were shown to be identical to the corresponding genome reconstructions in the metagenome (Nelson 
                    <italic toggle="yes">et al.</italic>, submitted), and replaced these reconstructions in the metagenome database, being more complete. For the axenic isolates, gene models were generated by IMG/ER (
                    <xref ref-type="bibr" rid="ref-23">Markowitz 
                        <italic toggle="yes">et al.</italic>, 2009</xref>). The sequences are available through NCBI GenBank under accessions, NZ_JQMU00000000.1 GI:675281874 (
                    <italic toggle="yes">Porphyrobacter</italic> sp. HL-46), NZ_JMMC00000000.1 GI:653087839 (
                    <italic toggle="yes">Halomonas</italic> sp. HL-48), NZ_JAFX00000000.1 GI:635638184 (
                    <italic toggle="yes">Algoriphagus marincola</italic> str. HL-49), NZ_JYNR00000000.1 GI:761631804 (
                    <italic toggle="yes">Marinobacter excellens</italic> HL-55), and NZ_JMLY00000000.1 GI:654325145 (
                    <italic toggle="yes">Marinobacter</italic> sp. HL-58). Metagenome sequences not mapped to sequences from axenic cultures have been submitted to GenBank and are awaiting accessions.</p>
            </sec>
            <sec>
                <title>Feature generation</title>
                <p>Physiochemical properties (PPs) were calculated using the Python propy module (
                    <xref ref-type="bibr" rid="ref-7">Cao 
                        <italic toggle="yes">et al.</italic>, 2013</xref>). Properties were calculated using the 147 Composition, Transition, Distribution (CTD) descriptors in propy (
                    <xref ref-type="bibr" rid="ref-11">Dubchak, 1995</xref>). Classes of properties include hydrophobicity, normalized van der Waals volume (VDWV), polarity, charge, secondary structure, solvent accessibility, and polarizability. In each class amino acids are grouped into three groups based on their physiochemical properties, for example hydrophobicity includes hydrophobic residues (C, L, V, I, M, F, W), polar residues (R, K, E, D, Q, N), and neutral residues (G, A, S, T, P, H, Y). Groups for other classes can be found in (
                    <xref ref-type="bibr" rid="ref-11">Dubchak, 1995</xref>). Composition calculates a length-normalized score based on the number of residues in the group (for example polar residues) in the sequence. Distribution calculates the portion of the sequence that includes a certain percentage (1, 25, 50, 75, or 100) of the matches for that group. Transition calculates the number of times an amino acid from one group (polar, e.g.) is found next to one from another group (hydrophobic, e.g.) in the sequence, normalized by length.</p>
                <p>The PP-protein regular expressions (PRE) were represented as a combination of regular expression from the standard PRE with one of the PPs. PILGram treats the PP as an independent element to add to a regular expression. The fitness score for a particular combination is evaluated by calculating the PP score (see above) for the region or regions of the sequence matched by the regular expression. If there is more than one matched region by a regular expression the PP scores from each segment are averaged.</p>
                <p>As an example if the PRE is &#x201c;FG*.TL&#x201d;, then a sequence such as:</p>
                <list list-type="bullet">
                    <list-item>
                        <label/>
                        <p>MKGGLA
                            <underline>
                                <bold>FGADAYLLIWTL</bold>
                            </underline>QQST&#x2026;</p>
                    </list-item>
                </list>
                <p>would be matched in the underlined region. An additional PP of &#x201c;hydrophobicityC1&#x201d; (that is, composition class for hydrophobic amino acids), would be scored by counting the number of hydrophobic residues (C, L, V, I, M, F, W) in the region (6) and dividing by the length of the matched region (12) to give 0.50. A second sequence:</p>
                <list list-type="bullet">
                    <list-item>
                        <label/>
                        <p>MIYTSSG
                            <underline>
                                <bold>FGLLILLYCMTL</bold>
                            </underline>RHCN&#x2026;</p>
                    </list-item>
                </list>
                <p>would be matched in the underlined region, but the PP hydrophobicityC1 score would be higher 10/12 = 0.83. The PILGram optimization explores many possible combinations using a genetic algorithm (see below) to find the PP and PRE combination that gives the best accuracy.</p>
                <p>The transmembrane region (TMR) grammar is composed of the PRE with the addition of predefined patterns that represent potential transmembrane regions. These were established by including all transmembrane regions defined in the entire TCDB (
                    <xref ref-type="bibr" rid="ref-31">Saier 
                        <italic toggle="yes">et al.</italic>, 2014</xref>) with two flanking amino acids from the N- and C-terminal portions of the region, but leaving the sequence of the transmembrane region itself as variable. This means that a given transmembrane region from TCDB (underlined here):</p>
                <list list-type="bullet">
                    <list-item>
                        <label/>
                        <p>AAQT
                            <underline>
                                <bold>LSVYFLAFALGVVIWGVLA</bold>
                            </underline>DKWGR</p>
                    </list-item>
                </list>
                <p>would result in a &#x2018;seed&#x2019; TMR-PRE of &#x201c;QT*.DK&#x201d;. These seed PREs can then be chosen by PILGram to incorporate into parse trees (see below) to generate new PREs. So the resulting models look identical to those generated by the PRE grammar alone, but may be biased toward a focus on transmembrane regions.</p>
            </sec>
            <sec>
                <title>Performance evaluation</title>
                <p>PILGram models were constructed using half the training data and performance was evaluated with the other half. PILGram optimization was based on accuracy:</p>
                <p>
                    <disp-formula>
                        <mml:math display="block" id="math1">
                            <mml:mrow>
                                <mml:mi>A</mml:mi>
                                <mml:mtext>&#x2009;</mml:mtext>
                                <mml:mo>=</mml:mo>
                                <mml:mtext>&#x2009;</mml:mtext>
                                <mml:mfrac>
                                    <mml:mrow>
                                        <mml:mi>T</mml:mi>
                                        <mml:mi>P</mml:mi>
                                        <mml:mtext>&#x2009;</mml:mtext>
                                        <mml:mo>+</mml:mo>
                                        <mml:mtext>&#x2009;</mml:mtext>
                                        <mml:mi>T</mml:mi>
                                        <mml:mi>N</mml:mi>
                                    </mml:mrow>
                                    <mml:mrow>
                                        <mml:mi>T</mml:mi>
                                        <mml:mi>P</mml:mi>
                                        <mml:mtext>&#x2009;</mml:mtext>
                                        <mml:mo>+</mml:mo>
                                        <mml:mtext>&#x2009;</mml:mtext>
                                        <mml:mi>F</mml:mi>
                                        <mml:mi>P</mml:mi>
                                        <mml:mtext>&#x2009;</mml:mtext>
                                        <mml:mo>+</mml:mo>
                                        <mml:mtext>&#x2009;</mml:mtext>
                                        <mml:mi>T</mml:mi>
                                        <mml:mi>N</mml:mi>
                                        <mml:mtext>&#x2009;</mml:mtext>
                                        <mml:mo>+</mml:mo>
                                        <mml:mtext>&#x2009;</mml:mtext>
                                        <mml:mi>F</mml:mi>
                                        <mml:mi>N</mml:mi>
                                    </mml:mrow>
                                </mml:mfrac>
                            </mml:mrow>
                        </mml:math>
                    </disp-formula>
                </p>
                <p>where TP, TN, FP, and FN are true positive, true negative, false positive, and false negative predictions, respectively. For final evaluation of models we also calculated positive predictive value:</p>
                <p>
                    <disp-formula>
                        <mml:math display="block" id="math2">
                            <mml:mrow>
                                <mml:mi>P</mml:mi>
                                <mml:mi>P</mml:mi>
                                <mml:mi>V</mml:mi>
                                <mml:mtext>&#x2009;</mml:mtext>
                                <mml:mo>=</mml:mo>
                                <mml:mtext>&#x2009;</mml:mtext>
                                <mml:mfrac>
                                    <mml:mrow>
                                        <mml:mi>T</mml:mi>
                                        <mml:mi>P</mml:mi>
                                    </mml:mrow>
                                    <mml:mrow>
                                        <mml:mi>T</mml:mi>
                                        <mml:mi>P</mml:mi>
                                        <mml:mtext>&#x2009;</mml:mtext>
                                        <mml:mo>+</mml:mo>
                                        <mml:mtext>&#x2009;</mml:mtext>
                                        <mml:mi>F</mml:mi>
                                        <mml:mi>P</mml:mi>
                                    </mml:mrow>
                                </mml:mfrac>
                            </mml:mrow>
                        </mml:math>
                    </disp-formula>
                </p>
                <p>and area under the receiver operator characteristic (ROC) curve (AUC) (
                    <xref ref-type="bibr" rid="ref-32">Salzberg, 1997</xref>).</p>
            </sec>
            <sec>
                <title>Pattern clustering</title>
                <p>Clustering of patterns for MDRpred was accomplished by assembling a vector of binary values (match or no match) across the 6005 examples (71 positive plus 5934 negative examples) from the training set for each of the 36 final MDR patterns. Euclidean distance was calculated between all pairs of vectors and the hcclust function from R (version 3.0.1) was used for hierarchical clustering using complete agglomeration.</p>
            </sec>
            <sec>
                <title>PILGram</title>
                <p>Machine learning methods, like SIEVE (
                    <xref ref-type="bibr" rid="ref-26">McDermott 
                        <italic toggle="yes">et al.</italic>, 2011</xref>; 
                    <xref ref-type="bibr" rid="ref-33">Samudrala 
                        <italic toggle="yes">et al.</italic>, 2009</xref>), take features as input to build a model. Features are the smallest elements derived from the examples (protein sequences) that can be categories (e.g. amino acid type) or values (e.g. solvent accessibility values). While the selection of salient features is critical for classification, most algorithms require their manual specification. PILGram (Proactive Intelligent Learning with Grammar) is an approach to automate the feature selection process and allows for the selection of irredundant features. PILGram does this by combining a genetic algorithm and a generative grammar, which is a formalized set of rules for combining features into different patterns in the form of parse trees. PILGram generates a large number of such trees and then applies a genetic algorithm, which iteratively recombines these trees to determine an optimal model for classification of the positive and negative examples. In this way PILGram specifies an absorbing Markov chain on the space of features, and given sufficient time, will always converge to a collection of optimal non-redundant features. The mathematical foundations of and explicit algorithm for PILGram are currently pending review, but the algorithm is perhaps best understood by example.</p>
                <p>Consider the following toy example: height, weight, and age data are gathered from a population and each person is labeled as obese or not. One might like to automate the determination of obesity using only height, weight, and age. It is known that the body mass index (BMI) is a good indicator of obesity and is given by (weight/(height &#x00d7; height)). In order to determine this quantity, PILGram might make use of the following grammar.</p>
                <p>&#x2329;expr&#x232a;::=(&#x2329;expr&#x232a;&#x2329;op&#x232a;&#x2329;expr&#x232a;)|&#x2329;attr&#x232a;</p>
                <p>&#x2329;op&#x232a;::=+|-| &#x00d7; |/</p>
                <p>&#x2329;attr&#x232a;::=height | weight | age</p>
                <p>In this grammar the &#x2018;|&#x2019; symbol is to be read as &#x2018;or&#x2019; and &#x2018;::=&#x2019; can be read as &#x2018;replace by.&#x2019; So the second line tells us that &#x2018;&#x2329;op&#x232a;&#x2019; can be replaced by &#x2018;+&#x2019;, &#x2018;-&#x2018;, &#x2018;&#x00d7;&#x2019;, or &#x2018;/&#x2019;. The symbols to the left of ::= are called non-terminal symbols. This grammar can be used to generate features as follows.</p>
                <list list-type="bullet">
                    <list-item>
                        <label>1.</label>
                        <p>Write down &#x2329;expr&#x232a;.</p>
                    </list-item>
                    <list-item>
                        <label>2.</label>
                        <p>Locate any non-terminal symbol in your expression.</p>
                    </list-item>
                    <list-item>
                        <label>3.</label>
                        <p>Replace the chosen non-terminal according to the grammar.</p>
                    </list-item>
                    <list-item>
                        <label>4.</label>
                        <p>If there is a non-terminal symbol in your expression, then return to step 2.</p>
                    </list-item>
                </list>
                <p>This process can be viewed as a parse tree. That is, at step 1 one writes &#x2329;expr&#x232a;. Then every time a non-terminal symbol is replaced one writes the replacement below the non-terminal symbol and connects each symbol in the replacement with the initial symbol with a line. A vertical line is placed below each non-terminal symbol that is not replaced. The resulting expression is then read from left to right along the &#x2018;leaves&#x2019; of the resulting tree. For instance, BMI might be produced from the procedure as follows:</p>
                <p>&#x2329;expr&#x232a;&#x2192; (&#x2329;expr&#x232a;&#x2329;op&#x232a;&#x2329;expr&#x232a;) &#x2192; (&#x2329;attr&#x232a;&#x2329;op&#x232a;&#x2329;expr&#x232a;) &#x2192; (weight &#x2329;op&#x232a;&#x2329;expr&#x232a;) &#x2192; (weight/&#x2329;expr&#x232a;) &#x2192; (weight/(&#x2329;expr&#x232a;&#x2329;op&#x232a;&#x2329;expr&#x232a;)) &#x2192; (weight/(&#x2329;expr&#x232a;&#x00d7;&#x2329;expr&#x232a;)) &#x2192; (weight/(&#x2329;attr&#x232a;&#x00d7;&#x2329;expr&#x232a;)) &#x2192; (weight/(&#x2329;attr&#x232a;&#x00d7;&#x2329;attr&#x232a;)) &#x2192; (weight/(height &#x00d7;&#x2329;attr&#x232a;)) &#x2192; (weight/(height &#x00d7; height)).</p>
                <p>This is more succinctly expressed by the parse tree:</p>
                <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/6999/83424c33-4b8d-4815-9859-38b476a7a8b7_figure_PT1.gif"/>
                <p>While one might get lucky and generate this expression by random application of the above grammar, it is highly unlikely. However, one might generate (weight-height) and (age+height &#x00d7; height). While neither of these expressions are BMI, BMI can be produced by 
                    <italic toggle="yes">mutating</italic> and 
                    <italic toggle="yes">crossing</italic> these feature.</p>
                <p>Mutation is a process by which a node in the parse tree is randomly selected and then replaced with another value such that the tree remains consistent with the generative rules of the grammar. In some cases, one might opt to re-build the tree below the replaced node thereby giving the algorithm greater flexibility. For instance, the first expression can be represented as a parse tree and mutated as follows:</p>
                <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/6999/83424c33-4b8d-4815-9859-38b476a7a8b7_figure_PT2.gif"/>
                <p>The resulting feature, (weight/height), is more similar to BMI than the initial feature, and in fact performs better at classifying obesity. To arrive at BMI we could apply the 
                    <italic toggle="yes">crossing</italic> procedure to (weight/height) and (age+height &#x00d7; height).</p>
                <p>Crossing is a process by which two features are expressed as parse trees and two of their subtrees are exchanged so that the resulting parse trees are consistent with the grammar. For instance, BMI can be found by crossing (weight/height) and (age+height &#x00d7; height) as follows:</p>
                <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/6999/83424c33-4b8d-4815-9859-38b476a7a8b7_figure_PT3.gif"/>
                <p>Not all crossings and mutations will produce better features, and not all features should be considered for crossing or mutation. To handle this, PILGram behaves stochastically and preferentially selects features for mutation and crossing according to how well they perform. The guiding principle is that features which perform better should be closer to the optimal feature than those that do not. The entire PILGram algorithm can be outlined as follows:</p>
                <list list-type="bullet">
                    <list-item>
                        <label>1.</label>
                        <p>Select a grammar for feature generation and a fitness function to evaluate the features against.</p>
                    </list-item>
                    <list-item>
                        <label>2.</label>
                        <p>Randomly generate a population of features and determine the fitness of each feature.</p>
                    </list-item>
                    <list-item>
                        <label>3.</label>
                        <p>Randomly subsample the population where a feature is selected with probability proportional to its fitness.</p>
                    </list-item>
                    <list-item>
                        <label>4.</label>
                        <p>For each feature selected in step 3, copy the feature and randomly change the value of a random node in its parse tree in a manner consistent with the grammar. Return the initial feature and the result of the mutation to the population.</p>
                    </list-item>
                    <list-item>
                        <label>5.</label>
                        <p>Randomly subsample the population for pairs of features with each feature selected with probability proportional to its fitness.</p>
                    </list-item>
                    <list-item>
                        <label>6.</label>
                        <p>For each pair selected in step 5 produce a copy of their parse trees. Randomly select a subtree in each feature&#x2019;s parse tree and exchange these subtrees ensuring that the exchange produces features which are consistent with the grammar. Return the two initial features and the two new features to the population.</p>
                    </list-item>
                    <list-item>
                        <label>7.</label>
                        <p>Compute the fitness of all features in the population and remove the least fit features until the population returns to its initial size.</p>
                    </list-item>
                    <list-item>
                        <label>8.</label>
                        <p>If the fittest feature has converged, then terminate the algorithm, otherwise return to step 3.</p>
                    </list-item>
                </list>
                <p>A common variation of the algorithm is to randomly generate new features at the start of step 7 and add them to the population before reducing the population size. Another common modification is to iteratively apply the algorithm such that the fitness function is updated between iterations to account for the fittest feature. This allows one to generate a list of irredundant features. Unsurprisingly, the choice of generative grammar strongly influences the quality of the resulting features. Below we will make use of Perl&#x2019;s regular expression grammar to produce motifs in an alignment free fashion (
                    <xref ref-type="other" rid="SF1">Supplemental Figure 1</xref>).</p>
                <p>Many conventional genetic algorithms use &#x2018;chromosomes&#x2019;, the group of variables that alter algorithm behavior, with a set length. PILGram is based on the idea of recombining parse trees, so does not have a defined length. As described above the trees can be mutated and crossed during the optimization process, and the length of the resulting regular expression is therefore variable, though it is limited by the maximum depth of parse trees allowed. Parse tree depth can be set but larger values become more computationally intensive.</p>
                <p>PILGram has been applied in areas ranging from text analysis, which uses a combination of atomic features based on letter frequency based atomic features and regular expressions, and to image analysis, which uses more complex image-based atomic features. In both of these cases PILGram not only provided features that were optimal for classification, but that were also easily interpreted by a user (unpublished results). In addition to these application spaces, precursor technology has been applied to loop unrolling in the realm of compiler optimization (
                    <xref ref-type="bibr" rid="ref-19">Leather 
                        <italic toggle="yes">et al.</italic>, 2009</xref>) where it was found that learned features resulted in an increase from 48% of the theoretical efficiency bound (using expert driven features) to 76% of the theoretical bound using features automatically identified by a PILGram-like algorithm. We note that PILGram does not train a classifier, rather it selects features which means that any improvements are not the result of overfitting but instead, are a consequence of carefully chosen features.</p>
            </sec>
            <sec>
                <title>Regular expressions</title>
                <p>The protein regular expression (PRE) used by PILGram to identify patterns in protein sequences is expressed in standard regular expression notation. Briefly:</p>
                <table-wrap id="T7" orientation="portrait" position="anchor">
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1">Symbols</th>
                                <th align="left" colspan="1" rowspan="1">Example
                                    <break/>use</th>
                                <th align="left" colspan="1" rowspan="1">Description</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">.</td>
                                <td align="left" colspan="1" rowspan="1">.</td>
                                <td align="left" colspan="1" rowspan="1">Matches any single residue</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">[XYZ]</td>
                                <td align="left" colspan="1" rowspan="1">[AFGHL]</td>
                                <td align="left" colspan="1" rowspan="1">Matches any single residue that is
                                    <break/>contained in the brackets</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">[^XYZ]</td>
                                <td align="left" colspan="1" rowspan="1">[^KR]</td>
                                <td align="left" colspan="1" rowspan="1">Matches any single residue that is not
                                    <break/>contained in the brackets</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">[X-Y]</td>
                                <td align="left" colspan="1" rowspan="1">[A-E]</td>
                                <td align="left" colspan="1" rowspan="1">Indicates a range of residues in
                                    <break/>alphabetical order</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">^</td>
                                <td align="left" colspan="1" rowspan="1">^MST</td>
                                <td align="left" colspan="1" rowspan="1">Matches the start (N-terminus) of the
                                    <break/>sequence</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">$</td>
                                <td align="left" colspan="1" rowspan="1">FGH$</td>
                                <td align="left" colspan="1" rowspan="1">Matches the end (C-terminus) of the
                                    <break/>sequence</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">X*</td>
                                <td align="left" colspan="1" rowspan="1">A*</td>
                                <td align="left" colspan="1" rowspan="1">Matches zero or more of the preceding
                                    <break/>element</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">X+</td>
                                <td align="left" colspan="1" rowspan="1">K+</td>
                                <td align="left" colspan="1" rowspan="1">Matches one or more of the preceding
                                    <break/>element</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">X?</td>
                                <td align="left" colspan="1" rowspan="1">C?</td>
                                <td align="left" colspan="1" rowspan="1">Matches zero or one of the preceding
                                    <break/>element</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">X{Y}</td>
                                <td align="left" colspan="1" rowspan="1">L{20}</td>
                                <td align="left" colspan="1" rowspan="1">Matches the indicated number of the
                                    <break/>preceding element</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">X{Y,Z}</td>
                                <td align="left" colspan="1" rowspan="1">R{2,4}</td>
                                <td align="left" colspan="1" rowspan="1">Matches preceding element Y or Z times</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
            </sec>
        </sec>
        <sec sec-type="results">
            <title>Results</title>
            <media content-type="figshare" orientation="portrait" position="float" xlink:href="http://dx.doi.org/10.6084/m9.figshare.1415804"/>
            <sec>
                <title>Alignment-free identification of discriminatory protein patterns in PROSITE</title>
                <p>To test the ability of PILGram to identify discriminatory regular expressions from unaligned sequences we focused on a well-defined group of proteins with a known discriminatory pattern. We first examined the serine-threonine phosphatase pattern (PROSITE PS00125) by obtaining 166 sequences listed as true positives from PROSITE (see Methods). For negative examples we randomly selected 5344 sequences from UniProt that are not included in the positive sequences.</p>
                <p>We applied PILGram to this dataset using a standard regular expression grammar modified for protein sequences (see 
                    <xref ref-type="other" rid="ST1">Supplemental Figure 1</xref>). The algorithm was terminated after 276 iterations when the fitness (classification accuracy) did not change over 10 consecutive iterations. The resulting pattern (
                    <xref ref-type="table" rid="T1">Table 1</xref>) had a very high accuracy and positive predictive value (PPV) at 99.9% and 92%, respectively. The pattern identified by PILGram contains the core of the existing PROSITE pattern, a K or R (the PILGram pattern adds a Q) followed by GNH, missing the first and last residue of the PROSITE pattern, and performs nearly as well (See Supplemental Data PILGram_PATTERNS_PS00125.txt). In 
                    <xref ref-type="table" rid="T2">Table 2</xref> we show several examples of the functional regions identified in sequences by the original PROSITE model and the PILGram-derived model (PRE matches in bold type). Alignments for the complete set of positive examples are included as Supplemental Data PS00125_alignments.out. However, the PILGram pattern required no sequence alignment or manual determination of a conserved pattern.</p>
                <table-wrap id="T1" orientation="portrait" position="anchor">
                    <label>Table 1. </label>
                    <caption>
                        <title>Ser/Thr Phosphatase model.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1">Model</th>
                                <th align="left" colspan="1" rowspan="1">Pattern</th>
                                <th align="left" colspan="1" rowspan="1">Accuracy</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td colspan="1" rowspan="1">
								
                                    <italic toggle="yes">PS00125</italic>
							</td>
                                <td colspan="1" rowspan="1">
								
                                    <italic toggle="yes">[LIVMN][KR]GNHE</italic>
							</td>
                                <td colspan="1" rowspan="1">
								
                                    <italic toggle="yes">100.0%</italic>
							</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">1</td>
                                <td colspan="1" rowspan="1">[KQR]G+NH</td>
                                <td colspan="1" rowspan="1">99.9%</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>The ser/thr phosphatase pattern is relatively simple and does not include any gaps of variable size. We were interested in determining if PILGram would also work on a more complicated pattern and so chose the zinc finger pattern (PS00028), which is a somewhat variable arrangement of conserved cysteine and histidine residues. We obtained the 1997 sequences used for the construction of the PROSITE pattern and additionally collected 5435 randomly selected protein sequences from the UniProt database to serve as negative examples for this test example. Because individual runs converged on different predictive patterns we ran PILGram 10 times on the dataset. In principle, PILGram will always eventually converge to the optimal pattern. However, in practice there may be &#x2018;flat regions&#x2019; over which the fitness function does not significantly vary with feature modification or local extrema. In such situations, PILGram may take significant time to escape these regions and it is more economical to employ a weak convergence test, run PILGram several times, and aggregate the features.</p>
                <table-wrap id="T2" orientation="portrait" position="anchor">
                    <label>Table 2. </label>
                    <caption>
                        <title>Example alignments of ser/thr phosphatase sequences.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1">Sequence</th>
                                <th align="left" colspan="1" rowspan="1">Model</th>
                                <th align="left" colspan="1" rowspan="1">Functional region</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q9LHE7</td>
                                <td align="left" colspan="1" rowspan="1">PS00125</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>PANITL
                                        <underline>
                                            <bold>LRGNHE</bold>
                                        </underline>SRQLTQ</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q9LHE7</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 1</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>PANITLL
                                        <underline>
                                            <bold>RGNH</bold>
                                        </underline>ESRQLTQ</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">P12982</td>
                                <td align="left" colspan="1" rowspan="1">PS00125</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>SENFFL
                                        <underline>
                                            <bold>LRGNHE</bold>
                                        </underline>CASINR</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">P12982</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 1</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>SENFFLL
                                        <underline>
                                            <bold>RGNH</bold>
                                        </underline>ECASINR</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">A2XN40</td>
                                <td align="left" colspan="1" rowspan="1">PS00125</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>PQRITI
                                        <underline>
                                            <bold>LRGNHE</bold>
                                        </underline>SRQITQ</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">A2XN40</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 1</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>PQRITIL
                                        <underline>
                                            <bold>RGNH</bold>
                                        </underline>ESRQITQ</monospace>
                                </td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>The resulting patterns (
                    <xref ref-type="table" rid="T3">Table 3</xref>; Supplemental Data PILGram_PATTERNS_PS00028.txt) vary in composition and accuracy, with a maximum accuracy obtained of about 92%. All patterns fall short of the manually determined PROSITE pattern that has an accuracy of 99%. It is interesting to note that none of the identified patterns perfectly matches any portions of the manually determined PROSITE pattern, though there are some consistently identified features such as multiple cysteine residues.</p>
                <table-wrap id="T3" orientation="portrait" position="anchor">
                    <label>Table 3. </label>
                    <caption>
                        <title>Zinc finger patterns identified.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1">Model</th>
                                <th align="left" colspan="1" rowspan="1">Pattern</th>
                                <th align="left" colspan="1" rowspan="1">Accuracy</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td colspan="1" rowspan="1">
								
                                    <italic toggle="yes">PS00028</italic>
							</td>
                                <td colspan="1" rowspan="1">
								
                                    <italic toggle="yes">C.{2,4}C.{3}[LIVMFYWC].{8}H.{3,5}H</italic>
							</td>
                                <td colspan="1" rowspan="1">
								
                                    <italic toggle="yes">99.0%</italic>
							</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">1</td>
                                <td colspan="1" rowspan="1">[^LV][^F][^VW]{8}[^ADILN]{7}C</td>
                                <td colspan="1" rowspan="1">87.6%</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">2</td>
                                <td colspan="1" rowspan="1">C[D-H].+[R][^EFG]H</td>
                                <td colspan="1" rowspan="1">87.4%</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">3</td>
                                <td colspan="1" rowspan="1">.{15}C[^FGIRW][^C]C</td>
                                <td colspan="1" rowspan="1">92.7%</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">4</td>
                                <td colspan="1" rowspan="1">[CHP][^V]{53}</td>
                                <td colspan="1" rowspan="1">81.2%</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">5</td>
                                <td colspan="1" rowspan="1">[C].{27}[C]</td>
                                <td colspan="1" rowspan="1">89.0%</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">6</td>
                                <td colspan="1" rowspan="1">[^VW]{55}.*.+$</td>
                                <td colspan="1" rowspan="1">80.0%</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">7</td>
                                <td colspan="1" rowspan="1">C[^V]{42}</td>
                                <td colspan="1" rowspan="1">83.0%</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">8</td>
                                <td colspan="1" rowspan="1">K.{3}C+</td>
                                <td colspan="1" rowspan="1">87.0%</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">9</td>
                                <td colspan="1" rowspan="1">C[AGHNQT]K</td>
                                <td colspan="1" rowspan="1">87.2%</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">10</td>
                                <td colspan="1" rowspan="1">C[^IFPV]{3}F+[^CE]</td>
                                <td colspan="1" rowspan="1">91.0%</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">Combined</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1">95.3%</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>We examined the possibility that the patterns identified by PILGram would be synergistic in their discriminatory ability. For each example protein (positive and negative) we counted how many of the individual PILGram patterns matched, then used this number as a discriminator. We found that using this simple voting procedure increased the accuracy from 92% to a maximum of 95.3% when six or more patterns match a sequence (
                    <xref ref-type="fig" rid="f1">Figure 1</xref>). While this performance still does not reach the level of the original PROSITE pattern (99%), we believe it demonstrates the utility of PILGram for identifying patterns from unaligned sequences.</p>
                <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                    <label>Figure 1. </label>
                    <caption>
                        <title>Accuracy for prediction of zinc finger proteins.</title>
                        <p>Matches to PILGram-generated regular expression patterns for the zinc finger domain (represented in PROSITE PS00028) were counted (X axis) and accuracy (Y axis) calculated based on the known positives and negative examples datasets (see text). Peak accuracy of the approach is attained at six pattern matches.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/6999/83424c33-4b8d-4815-9859-38b476a7a8b7_figure1.gif"/>
                </fig>
                <p>We were interested to know if PILGram was identifying regions of the sequence that overlap with the PROSITE pattern. We identified regions in all positive example sequences that match the ten PILGram patterns and calculated a score for each sequence based on the number of matches, per residue, that PILGram identified in the real zinc finger region. On average, 3.4 PILGram patterns match each residue of the known PS00028 pattern, whereas the number of patterns matching arbitrary residues in the sequence was 2.1. This shows that PILGram identifies more patterns overlapping the canonical zinc finger motif. However, it is clear that PILGram-derived motifs may not be canonical and further work needs to be done in this area. We show examples of matches from individual PILGram models as well as the per-residue overlap score (as &#x201c;Summary&#x201d;) in 
                    <xref ref-type="table" rid="T4">Table 4</xref>. Note that none of the individual PILGram models matches the single (Q24174, beginning at residue 540) or double (Q59RR0, beginning at residue 645) zinc finger motifs completely, but that the overlap score for the functional regions in both sequences are higher than surrounding sequences. Alignments for the complete set of positive examples are provided as Supplemental Data PS00028_alignments.out.</p>
                <table-wrap id="T4" orientation="portrait" position="anchor">
                    <label>Table 4. </label>
                    <caption>
                        <title>Example alignments of zinc finger regions.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1">Sequence</th>
                                <th align="left" colspan="1" rowspan="1">Model</th>
                                <th align="left" colspan="1" rowspan="1">Functional region</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q24174</td>
                                <td align="left" colspan="1" rowspan="1">PS00028</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>ATDPRP
                                        <underline>
                                            <bold>CPKCGKIYRSAHTLRTHLEDKH</bold>
                                        </underline>TVCPGY</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q24174</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 1</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>ATDPRPCPKCGKIYRSAH
                                        <underline>
                                            <bold>TLRTHLEDKHTVCPGY</bold>
                                        </underline>
                                    </monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q24174</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 2</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>
                                        <underline>
                                            <bold>ATDPRPCPKCGKIYRSAHTLRTHLEDKHTVCPGY</bold>
                                        </underline>
                                    </monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q24174</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 3</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>
                                        <underline>
                                            <bold>ATDPRPCPKC</bold>
                                        </underline>GKIYRSAHTL
                                        <underline>
                                            <bold>RTHLEDKHTVCPGY</bold>
                                        </underline>
                                    </monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q24174</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 4</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>ATDPRPCPKCGKIYRSAHTLRTHLEDKHTVCPGY</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q24174</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 5</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>ATDPRPCPKCGKIYRSAHTLRTHLEDKHTVCPGY</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q24174</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 6</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>
                                        <underline>
                                            <bold>ATDPRPCPKCGKIYRSAHTLRTHLEDKHTVCPGY</bold>
                                        </underline>
                                    </monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q24174</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 7</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>ATDPRPCPKCGKIYRSAHTLRTHLEDKHTVCPGY</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q24174</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 8</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>ATDPRPCPKCGKIYRSAHTLRTHLED
                                        <underline>
                                            <bold>KHTVC</bold>
                                        </underline>PGY</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q24174</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 9</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>ATDPRPCPK
                                        <underline>
                                            <bold>CGK</bold>
                                        </underline>IYRSAHTLRTHLEDKHTVCPGY</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q24174</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 10</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>ATDPRPCPKCGKIYRSAHTLRTHLEDKHTVCPGY</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q24174</td>
                                <td align="left" colspan="1" rowspan="1">Summary</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>222222
                                        <underline>3334332222223344444455</underline>444333</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q59RR0</td>
                                <td align="left" colspan="1" rowspan="1">PS00028</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>EDKIYT
                                        <underline>
                                            <bold>CTYKNCGKKFTRRYNVRSHIQTH</bold>
                                        </underline>LSDRPFG
                                        <underline>
                                            <bold>CQFCPKRFVRQHDLNRHVKGH</bold>
                                        </underline>IEARYS</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q59RR0</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 1</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>EDKIYTCTYKNCGKKFTRRYNV
                                        <underline>
                                            <bold>RSHIQTHLSDRPFGCQFC</bold>
                                        </underline>PKRFVRQHDLNRHVKGHIEARYS</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q59RR0</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 2</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>
                                        <underline>
                                            <bold>EDKIYTCTYKNCGKKFTRRYNVRSHIQTHLSDRPFGCQFCPKRFVRQHDLNRHVKGHIEARYS</bold>
                                        </underline>
                                    </monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q59RR0</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 3</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>EDKIYTCTYKNCGKKFTRRYN
                                        <underline>
                                            <bold>VRSHIQTHLSDRPFGCQFC</bold>
                                        </underline>PKRFVRQHDLNRHVKGHIEARYS</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q59RR0</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 4</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>
                                        <underline>
                                            <bold>EDKIYTCTYKNCGKK</bold>
                                        </underline>FTRRYNVRSHIQTHLSDRPFGCQFCPKRFVRQHDLNRHVKGHIEARYS</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q59RR0</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 5</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>EDKIYTCTYKN
                                        <underline>
                                            <bold>CGKKFTRRYNVRSHIQTHLSDRPFGCQFC</bold>
                                        </underline>PKRFVRQHDLNRHVKGHIEARYS</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q59RR0</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 6</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>
                                        <underline>
                                            <bold>EDKIYTCTYKNCGKKFTRRYNVRSHIQTHLSDRPFGCQFCPKRFVRQHDLNRHVKGHIEARYS</bold>
                                        </underline>
                                    </monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q59RR0</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 7</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>EDKIYTCTYKNCGKKFTRRYNVRSHIQTHLSDRPFGCQFCPKRFVRQHDLNRHVKGHIEARYS</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q59RR0</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 8</td>
                                <td align="left" colspan="1" rowspan="1">ED
                                    <monospace>
                                        <underline>
                                            <bold>KIYT</bold>
                                        </underline>CTYKNCGKKFTRRYNVRSHIQTHLSDRPFGCQFCPKRFVRQHDLNRHVKGHIEARYS</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q59RR0</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 9</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>EDKIYTCTYKN
                                        <underline>
                                            <bold>CGK</bold>
                                        </underline>KFTRRYNVRSHIQTHLSDRPFGCQFCPKRFVRQHDLNRHVKGHIEARYS</monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q59RR0</td>
                                <td align="left" colspan="1" rowspan="1">PILGram 10</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>
                                        <underline>
                                            <bold>EDKIYTCTYKNCGKKFTRRYNVRSHIQTHLSDRPFGCQFCPKRFVRQHDLNRHVKGHIEARYS</bold>
                                        </underline>
                                    </monospace>
                                </td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1">Q59RR0</td>
                                <td align="left" colspan="1" rowspan="1">Summary</td>
                                <td align="left" colspan="1" rowspan="1">
                                    <monospace>334444544446665444444566666665555555666633333333333333333222222</monospace>
                                </td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
            </sec>
            <sec>
                <title>Drug resistance transporters</title>
                <p>A more difficult task for functional classification is to develop a model that will discriminate a group of functionally related proteins that cannot be aligned by traditional sequence alignment methods, or where the alignment does not allow discrimination between closely related sequences with different functions. To test its utility with these kinds of problematic proteins we applied PILGram to develop a classifier for antibiotic drug resistance transporters.</p>
                <p>Though transporter superfamily members can be identified fairly readily using standard sequence alignment approaches, previous studies have shown that sequence similarity has limited utility for classifying of transporters by substrate specificity (
                    <xref ref-type="bibr" rid="ref-3">Barghash &amp; Helms, 2013</xref>). The same authors also showed separately that integrating simple data (amino acid composition, dipeptide composition) could be used to classify some substrate families with good accuracy (
                    <xref ref-type="bibr" rid="ref-34">Schaadt 
                        <italic toggle="yes">et al.</italic>, 2010</xref>; 
                    <xref ref-type="bibr" rid="ref-35">Schaadt &amp; Helms, 2012</xref>), but these models have little potential for providing biological insight. Additionally, it remains unclear if there are members of functional families that have yet to be discovered because of lack of strong sequence similarity. ATP-binding cassette transporters (ABC), resistance-nodulation-cell division (RND) superfamily, and major facilitator superfamily (MFS) transporters are common superfamilies of proteins involved in the transport of a wide variety of different compounds, such as sugars, ions, peptides, and more complex organic molecules. Multidrug resistance (MDR) transporters are found in each of these superfamilies and are primary mediators of antibiotic drug resistance (
                    <xref ref-type="bibr" rid="ref-27">Nikaido, 2009</xref>; 
                    <xref ref-type="bibr" rid="ref-28">Nikaido &amp; Pages, 2012</xref>). Though MDR transporters actually encompass a range of substrate specificities because there are many types of drugs they export, we hypothesized that there would be some unifying features of MDR transporters that could be captured using PILGram.</p>
                <p>We gathered a set of 73 known MDR transporter sequences (positive examples) from the TCDB (
                    <xref ref-type="bibr" rid="ref-31">Saier 
                        <italic toggle="yes">et al.</italic>, 2014</xref>) and used the remainder of sequences classified in the TCDB as non-MDR transporters (negative examples; 5935 sequences). This dataset (Supplemental Data MDR_TCDB_positives.fasta and MDR_TCDB_negatives.fasta) was used to train and cross-validate MDRpred as described below.</p>
            </sec>
            <sec>
                <title>Traditional methods of identifying antibiotic resistance transporters</title>
                <p>We first evaluated how well previously generated HMM models from the Pfam database could discriminate between MDR and non-MDR transporters. We identified four Pfam models that seem to definitively identify drug resistance transporters (PF00893, PF08370, PF00873, and PF13536) and applied them to the set of sequences considering a &#x2018;hit&#x2019; as a sequence matched by any of the models with high confidence (E value &lt; 1e-100). The Pfam models provide very good accuracy (~97%), but only identify 10 of 73 MDR transporters (14%), and these are likely hits to many of the sequences used to create the models in the first place.</p>
            </sec>
            <sec>
                <title>PILGram model training</title>
                <p>We examined the ability of PILGram to find patterns capable of identifying MDR transporters from other transporter sequences. Though regular expressions have been shown to be effective at capturing many types of functional patterns in proteins (
                    <xref ref-type="bibr" rid="ref-16">Hofmann 
                        <italic toggle="yes">et al.</italic>, 1999</xref>), other patterns may be more amenable to broader chemical and structural characteristics of regions of proteins (
                    <xref ref-type="bibr" rid="ref-11">Dubchak, 1995</xref>). Because we believed that transmembrane regions (TMRs) would be important features in this classification task we modified our protein regular expression (PRE) grammar (
                    <xref ref-type="other" rid="SF1">Supplemental Figure 1</xref>) to bias the feature generation processes toward producing TMRs (TMR-PRE). Additionally, we included a large set of different types of protein physiochemical properties in our PILGram search (PP-PRE). PILGram included the 147 types of properties as features that could be chosen during the search. If a physiochemical property was used in a search the score (value for that particular property) was calculated for all matches of the accompanying regular expression on a sequence. If there were multiple matches to the protein then the scores were averaged.</p>
                <p>Using a 2-fold cross-validation approach (see Methods) we used PILGram to generate 36 models (
                    <xref ref-type="other" rid="ST1">Supplemental Table 1</xref> and Supplemental Data PILGram_PATTERNS_MDRpred.txt), approximately 12 models from each of the three grammars (PRE, TMR-PRE, and PP-PRE). The models had individual accuracies ranging from 70&#x2013;75%, underperforming the combination of HMM models that already exist. However, application of the simple voting approach used above in which the number of models that matched each sequence was counted, improved the results dramatically. The accuracy and PPV for increasing numbers of model matches is shown in 
                    <xref ref-type="other" rid="SF2">Supplemental Figure 2</xref>, and have maximum values at the most stringent threshold (requiring all patterns be matched) of 99% and 28%, respectively. Using models from each of the grammars individually in the voting approach showed that each grammar, PRE, TMR-PRE, and PP-PRE, performs very similarly in terms of accuracy and PPV when considering the maximum number of model matches (accuracies 96%, 97%, and 95%, respectively, and PPVs all at 12%). From these results it appears that the overall performance of our approach benefits from the combination of different kinds of models, which more than doubles the PPV.</p>
                <p>To examine whether the individual scores could be combined to provide better prediction we employed logistic regression and found that this improved our results somewhat (
                    <xref ref-type="fig" rid="f2">Figure 2</xref>; 
                    <xref ref-type="other" rid="SF3">Supplemental Figure 3</xref>). As a comparison for the same ~97% accuracy level provided by the traditional methods (Pfam family matches) our method, we call MDRpred, identifies 37 of the MDR transporters from our training set (50%) versus 10 for the traditional methods. It is clear that further development is needed to improve classification of this important group, but our approach provides the best method to date of identifying drug resistance transporters using sequence alone.</p>
                <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                    <label>Figure 2. </label>
                    <caption>
                        <title>MDR classification results.</title>
                        <p>The accuracy (blue line), positive predictive value (black line), and percentage of total MDR transporters identified (coverage; red line) are shown as a function of the score threshold used (X axis). The score is derived from a logistic regression on the complete set of 36 models generated (see text).</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/6999/83424c33-4b8d-4815-9859-38b476a7a8b7_figure2.gif"/>
                </fig>
            </sec>
            <sec>
                <title>Functional motifs identified</title>
                <p>In addition to classification of sequences a second goal of this work is to identify biologically relevant regions of proteins that are responsible for protein function. We showed that PILGram can identify regions known to be functionally important in zinc fingers. Here we apply a similar approach to identify regions that may be important for drug resistance in transporters. That is, those regions of the transporter that are most important for their function of transporting a broad class of substrates, antibiotic drugs.</p>
                <p>We first examined the overlap in patterns by clustering models based on the training sequences that they matched (
                    <xref ref-type="other" rid="SF4">Supplemental Figure 4</xref>). The models were arranged using hierarchical clustering and then seven clusters of similar models were identified. We found that most of the clusters exhibited some similarity in patterns and model from each cluster with the highest independent accuracy listed in 
                    <xref ref-type="table" rid="T5">Table 5</xref>. We found that applying logistic regression to combine these seven models provided a similar performance as the voting method, but underperformed the logistic regression on the complete set of models somewhat (
                    <xref ref-type="other" rid="SF3">Supplemental Figure 3</xref>). This indicates that the seven models represent a large portion of the information in the approach but that the additional models add significant value.</p>
                <table-wrap id="T5" orientation="portrait" position="anchor">
                    <label>Table 5. </label>
                    <caption>
                        <title>Drug resistance transporter patterns identified.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1">Model</th>
                                <th align="left" colspan="1" rowspan="1">Pattern</th>
                                <th align="left" colspan="1" rowspan="1">PhysiochemicalProp</th>
                                <th align="left" colspan="1" rowspan="1">Accuracy</th>
                                <th align="left" colspan="1" rowspan="1">ClusterName</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td colspan="1" rowspan="1">36</td>
                                <td colspan="1" rowspan="1">D[^ADGHY]+[AEFHI].+SR</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1">73%</td>
                                <td colspan="1" rowspan="1">Cluster 1</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">31</td>
                                <td colspan="1" rowspan="1">AR.+RL[DMPR-Y]</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1">74%</td>
                                <td colspan="1" rowspan="1">AR-L</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">8</td>
                                <td colspan="1" rowspan="1">AQ.+AT</td>
                                <td colspan="1" rowspan="1">Solvent Accessibility</td>
                                <td colspan="1" rowspan="1">73%</td>
                                <td colspan="1" rowspan="1">AQ-T</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">18</td>
                                <td colspan="1" rowspan="1">[AC][DFGLMPQRVY]+RQ</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1">75%</td>
                                <td colspan="1" rowspan="1">RQ-L</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">27</td>
                                <td colspan="1" rowspan="1">[DGLN-V]VR.+TV.+[CDEY]*$</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1">76%</td>
                                <td colspan="1" rowspan="1">VR</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">13</td>
                                <td colspan="1" rowspan="1">AQ.+RQ.{49}</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1">75%</td>
                                <td colspan="1" rowspan="1">Cluster 6</td>
                            </tr>
                            <tr>
                                <td colspan="1" rowspan="1">16</td>
                                <td colspan="1" rowspan="1">MR.+LL[STVW]</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1">73%</td>
                                <td colspan="1" rowspan="1">M-L</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>EmrD is an MDR transporter with a solved crystal structure (
                    <xref ref-type="bibr" rid="ref-36">Yin 
                        <italic toggle="yes">et al.</italic>, 2006</xref>). We examined the overlap of the PILGram models on the EmrD sequence and found that the maximum overlap in matched expressions from our models occurred in H3 69-103 and the loop following H4 118-131. The latter region has been highlighted as the &#x2018;selectivity filter&#x2019;, a loop extending in to the cytoplasm and that abrogates substrate selectivity when mutated (
                    <xref ref-type="bibr" rid="ref-36">Yin 
                        <italic toggle="yes">et al.</italic>, 2006</xref>) (
                    <xref ref-type="fig" rid="f3">Figure 3</xref>). This suggests that for this case where a substrate selectivity region is known, our model can correctly identify it, though more examples would be necessary to fully demonstrate this. Alignments of matches with individual models with all positive example MDR sequences is provided as Supplemental Data MDRpred_alignments.out.</p>
                <fig fig-type="figure" id="f3" orientation="portrait" position="float">
                    <label>Figure 3. </label>
                    <caption>
                        <title>Prediction of selectivity in EmrD.</title>
                        <p>The structure of the MDR transporter EmrD from 
                            <italic toggle="yes">E. coli</italic> (2GFP) is shown with the regions of maximum pattern overlap shown in red. This region has been shown to be the substrate selectivity filter for substrates transported by the protein, showing that MDRpred predictions can highlight functionally important regions.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/6999/83424c33-4b8d-4815-9859-38b476a7a8b7_figure3.gif"/>
                </fig>
            </sec>
        </sec>
        <sec>
            <title>Identification of novel MDR transporter candidates from environmental microbiomes</title>
            <p>New antibiotic resistance mechanisms are thought to be acquired from a very large natural reservoir of environmental bacteria, most of which have not yet been characterized (
                <xref ref-type="bibr" rid="ref-10">D'Costa 
                    <italic toggle="yes">et al.</italic>, 2007</xref>; 
                <xref ref-type="bibr" rid="ref-14">Forsberg 
                    <italic toggle="yes">et al.</italic>, 2012</xref>; 
                <xref ref-type="bibr" rid="ref-21">Li 
                    <italic toggle="yes">et al.</italic>, 2014</xref>). This means that novel antibiotics may face emergence of antibiotic resistance in pathogenic bacteria by lateral gene transfer or other means (
                <xref ref-type="bibr" rid="ref-1">Aminov &amp; Mackie, 2007</xref>; 
                <xref ref-type="bibr" rid="ref-14">Forsberg 
                    <italic toggle="yes">et al.</italic>, 2012</xref>). We were interested in determining if our models could be used to identify candidate MDRs from environmental samples. We therefore searched a species-resolved metagenomic dataset acquired from consortia (
                <xref ref-type="bibr" rid="ref-9">Cole 
                    <italic toggle="yes">et al.</italic>, 2014</xref>) cultivated from a phototrophic microbial mat in Hot Lake, Washington (
                <xref ref-type="bibr" rid="ref-22">Lindemann 
                    <italic toggle="yes">et al.</italic>, 2013</xref>). Though soil microbial communities have been examined for antibiotic resistance potential previously (
                <xref ref-type="bibr" rid="ref-10">D'Costa 
                    <italic toggle="yes">et al.</italic>, 2007</xref>) communities living in extreme environments such as Hot Lake have not. We postulated that these kinds of communities might be rich sources of novel MDR transporters given the manifold interactions between community members (
                <xref ref-type="bibr" rid="ref-24">Martinez 
                    <italic toggle="yes">et al.</italic>, 2009</xref>; 
                <xref ref-type="bibr" rid="ref-30">Piddock, 2006</xref>).</p>
            <p>We first searched the 69010 protein sequences from the Hot Lake consortial metagenomes (Nelson 
                <italic toggle="yes">et al.</italic>, submitted) for known MDR transporters using the Pfam families (PF00893, PF08370, PF00873, and PF13536) and identified 118 high-confidence (E value &lt; 1e-100) matches. Interestingly, when we examined a set of clones gathered from 18 soil samples and selected for expression of multidrug resistance phenotypes (
                <xref ref-type="bibr" rid="ref-14">Forsberg 
                    <italic toggle="yes">et al.</italic>, 2012</xref>) we found only 14 MDR transporters at the same stringency, though one caveat is that the efficiency of expression of transporters could be a limitation in this system. This suggests that the Hot Lake community has a relatively large number of MDR transporters.</p>
            <p>We believed that there would be MDR in this metagenome that would not be detected using the Pfam families available. Therefore, we searched the Hot Lake consortium metagenome using all 36 models and then ranked the results by number of matched sequences. A histogram of number of matching models is shown in 
                <xref ref-type="other" rid="SF5">Supplemental Figure 5</xref>. Because MDRpred was trained only on transporter proteins it cannot discriminate transporter proteins from non-transporters. That is, there are a significant number of false positive predictions that match proteins unlikely to be transporters. Accordingly, we filtered candidates to only those proteins identified as transporters by Pfam (list of Pfam transporter families provided as Supplemental Data Pfam_transporters.txt) and at the highest stringency we identified five candidate MDR sequences (
                <xref ref-type="table" rid="T6">Table 6</xref>). This step is included in the overall MDRpred process to allow accurate prediction in entire genomes or metagenomes. We provide a full list of other high-confidence predictions (matching more than 30 individual models, annotated as transporters but not multidrug resistance transporters by Pfam) as Supplemental Data HotLake_MDRpred_predictions.fasta.</p>
            <p>Though two of these predictions are already annotated as transporters (arabinose efflux permease and lipid transporter) these are largely automated predictions based on traditional sequence analysis approaches (BLAST searches and family/motif matches). Novel antibiotic resistance transporters are likely to show some similarities with known transporters (
                <xref ref-type="bibr" rid="ref-14">Forsberg 
                    <italic toggle="yes">et al.</italic>, 2012</xref>), but definite substrate specificity is often not revealed by these relationships. The value of MDRpred is the potential to identify novel antibiotic resistance transporters from sequences annotated as transporters where substrate specificity has not been experimentally established.</p>
            <table-wrap id="T6" orientation="portrait" position="anchor">
                <label>Table 6. </label>
                <caption>
                    <title>Predicted novel multidrug resistance transporters from Hot Lake.</title>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1">ID</th>
                            <th align="left" colspan="1" rowspan="1">Description</th>
                            <th align="left" colspan="1" rowspan="1">Length</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td colspan="1" rowspan="1">CY41DRAFT_3272</td>
                            <td colspan="1" rowspan="1">Arabinose efflux permease family
                                <break/>protein</td>
                            <td colspan="1" rowspan="1">434</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">HLSNC01_00824</td>
                            <td colspan="1" rowspan="1">ATPase components of ABC
                                <break/>transporters</td>
                            <td colspan="1" rowspan="1">547</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">HLSNC12_00368</td>
                            <td colspan="1" rowspan="1">Putative oligoketide cyclase/lipid
                                <break/>transport protein</td>
                            <td colspan="1" rowspan="1">152</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
        </sec>
        <sec sec-type="discussion | conclusions">
            <title>Discussion and conclusions</title>
            <p>The explosion in number of sequences available from a large number of sources has driven the need for better methods to capture patterns in distinct groups of functionally related sequences. Our method, based on linguistic approaches to pattern identification, has several advantages over existing methods. Not requiring a sequence alignment means that important and discriminatory sequence regions can be identified from functionally similar proteins that may be highly evolutionarily divergent or where the evolutionary relationships are unclear. Having a wide range of grammars that can be applied in the framework is a significant strength, allowing for flexible pattern discovery. In the current paper we use only variants of a protein regular expression grammar, but other grammars can easily be used depending on the application. For example, context-free grammars could be applied to better identify potential non-local interactions between different regions in the protein sequences.</p>
            <p>In the current study we have shown that PILGram can be successfully applied to identify patterns in proteins sequences, first by application to known functional sequences from the PROSITE database, and then by application to a set of proteins related by function but where functional determinants of specificity are not well understood. From our initial work with PROSITE families we found that some kinds of patterns may be more amenable to identification using PILGram, but this was a limited proof-of-concept application that would merit further characterization. In the case of the zinc finger pattern, which has variable spacing between active cysteine and histidine measurements we found that very accurate models could be obtained by taking a simple voting approach between multiple independent PILGram models.</p>
            <p>Application of our approach to the MDR sequences identified a set of over 30 individual PILGram models that, when combined, provided very good accuracy and positive predictive value, relative to a combination of existing HMM models in Pfam. To our knowledge this is the first attempt to develop a predictive model for MDR transporters across families. Similar to our results with PROSITE patterns we found that these models could identify regions known to be important for substrate specificity in MDRs. This represents a step forward in classification of this important group of transporters.</p>
            <p>The vast number of uncharacterized and often unculturable bacteria in environmental communities represent a large amount of genetic potential given the ability of bacteria to share genetic information. As an example application, we ran our method on sequences identified from a moderately complex community derived from an extreme environment, in this case the Hot Lake unicyanobacterial consortia (
                <xref ref-type="bibr" rid="ref-9">Cole 
                    <italic toggle="yes">et al.</italic>, 2014</xref>). We identified five candidates that were strongly predicted by the combination of our models to be MDRs. Given that the positive predictive value of the combined method is nearly 30% it is likely that one or two of these predictions is a true positive. Further research is needed to be able to predict specific drug substrate specificities for MDRs and other transporters.</p>
            <p>We believe that the method we describe, MDRpred, will complement well the other commonly used sequence annotation methods and that it provides a unique set of predictions about potential novel MDRs. Furthermore, the PILGram approach to identification of functional patterns in unaligned sequences has applications in a large number of other problematic protein groups where function is conserved over sequence.</p>
        </sec>
        <sec>
            <title>Data and Software availability</title>
            <sec>
                <title>Software access</title>
                <p>A publication describing the PILGram software is currently in preparation (Gosink &amp; Bruillard, manuscript in preparation) but the software is available upon request from the authors.</p>
            </sec>
            <sec>
                <title>Latest source code</title>
                <p>Code implementing the MDRpred algorithm as described is available on Github (
                    <ext-link ext-link-type="uri" xlink:href="http://github.com/biodataganache/MDRpred">http://github.com/biodataganache/MDRpred</ext-link>).</p>
            </sec>
            <sec>
                <title>Source code as at the time of publication</title>
                <p>
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/F1000Research/MDRpred/releases/tag/V2.0">https://github.com/F1000Research/MDRpred/releases/tag/V2.0</ext-link>
                </p>
            </sec>
            <sec>
                <title>Archived source code as at the time of publication</title>
                <p>
                    <ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5281/zenodo.17514">http://dx.doi.org/10.5281/zenodo.17514</ext-link>
                </p>
            </sec>
            <sec>
                <title>Software license</title>
                <p>Apache License v2.0</p>
                <p>
                    <bold>
					
                        <italic toggle="yes">Figshare:</italic>
				</bold> Prediction of multi-drug resistance transporters dataset doi: 
                    <ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.6084/m9.figshare.1415804">10.6084/m9.figshare.1415804</ext-link> (
                    <xref ref-type="bibr" rid="ref-26">McDermott 
                        <italic toggle="yes">et al.</italic>, 2015</xref>).</p>
            </sec>
        </sec>
    </body>
    <back>
        <sec id="S1" sec-type="supplementary material">
            <title>Supplementary material</title>
            <fig fig-type="figure" id="SF1" orientation="portrait" position="float">
                <label>Supplemental Figure 1. </label>
                <caption>
                    <title>Backus&#x2013;Naur form grammar for proteins.</title>
                </caption>
                <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/6999/83424c33-4b8d-4815-9859-38b476a7a8b7_Suppl_figure1.gif"/>
            </fig>
            <fig fig-type="figure" id="SF2" orientation="portrait" position="float">
                <label>Supplemental Figure 2. </label>
                <caption>
                    <title>Simple voting method for combining models.</title>
                    <p>
						
                        <bold>A. Accuracy of combined patterns for classification.</bold> Matches to PILGram-generated regular expression patterns for the MDR transporters were counted (X axis) and accuracy (Y axis) calculated based on the known positives and negative examples datasets (see text). Peak accuracy is attained when all 36 patterns match the sequence, indicating that the diversity of MDR transporter sequences is likely to be high. Redundancy analysis (
                        <xref ref-type="table" rid="T3">Table 3</xref>) shows that a similar accuracy can be obtained with seven patterns. 
                        <bold>B. Positive-predictive value of combined patterns for classification.</bold> Positive predictive value (the percentage of true positives in all positive predictions; Y axis) was calculated for each number of MDR transporter pattern matches (X axis). The maximum value is reached in sequences that match all patterns.</p>
                </caption>
                <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/6999/83424c33-4b8d-4815-9859-38b476a7a8b7_Suppl_figure2.gif"/>
            </fig>
            <fig fig-type="figure" id="SF3" orientation="portrait" position="float">
                <label>Supplemental Figure 3. </label>
                <caption>
                    <title>Comparison of MDRpred models.</title>
                    <p>The receiver-operator characteristic curves (ROC) are shown for the simple vote combination of 36 models (black line), the logistic regression combination of 36 models (blue line), and the logistic regression combination of the selected seven models shown in 
                        <xref ref-type="table" rid="T3">Table 3</xref> (red line). The area under the curve (AUC) for each method is indicated in the legend. The results show that using all models in a score derived by logistic regression provides the best performance, though the other methods also perform adequately.</p>
                </caption>
                <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/6999/83424c33-4b8d-4815-9859-38b476a7a8b7_Suppl_figure3.gif"/>
            </fig>
            <fig fig-type="figure" id="SF4" orientation="portrait" position="float">
                <label>Supplemental Figure 4. </label>
                <caption>
                    <title>Clustering of MDRpred individual models.</title>
                    <p>For each labeled sequence (rows) we assessed the presence (red cell) or absence (white cell) of a match to any of the 36 MDRpred regular expression models (columns). Hierarchical clustering was used to highlight relationships between models based on their patterns of matches. Dendrograms for sequences and models are shown. The patterns highlight seven groups of models that share a large number of predictions. Representative members from each of these clusters are shown in 
                        <xref ref-type="table" rid="T3">Table 3</xref>.</p>
                </caption>
                <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/6999/83424c33-4b8d-4815-9859-38b476a7a8b7_Suppl_figure4.gif"/>
            </fig>
            <fig fig-type="figure" id="SF5" orientation="portrait" position="float">
                <label>Supplemental Figure 5. </label>
                <caption>
                    <title>Histogram of MDRpred votes for metagenomic sequences.</title>
                    <p>The number of sequences (Y axis) with N matches (X axis) in the Hot Lake metagenome (69,010 sequences total) is shown as a histogram. This plot can be compared with similar plots in 
                        <xref ref-type="fig" rid="SF2">Supplemental Figure 2</xref> showing accuracy and positive predictive value at each of these stringency thresholds.</p>
                </caption>
                <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/6999/83424c33-4b8d-4815-9859-38b476a7a8b7_Suppl_figure5.gif"/>
            </fig>
            <table-wrap id="ST1" orientation="portrait" position="anchor">
                <label>Supplemental Table 1. </label>
                <caption>
                    <title>PILGram models for MDR transporter prediction.</title>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1">Order</th>
                            <th align="left" colspan="1" rowspan="1">PP</th>
                            <th align="left" colspan="1" rowspan="1">RESmall</th>
                            <th align="left" colspan="1" rowspan="1">Accuracy</th>
                            <th align="left" colspan="1" rowspan="1">Cluster</th>
                            <th align="left" colspan="1" rowspan="1">ClusterName</th>
                            <th align="left" colspan="1" rowspan="1">Grammar</th>
                            <th align="left" colspan="1" rowspan="1">RE</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td colspan="1" rowspan="1">1</td>
                            <td colspan="1" rowspan="1">_SecondaryStrD1075</td>
                            <td colspan="1" rowspan="1">.{408}$</td>
                            <td colspan="1" rowspan="1">70%</td>
                            <td colspan="1" rowspan="1">1</td>
                            <td colspan="1" rowspan="1">Cluster 1</td>
                            <td colspan="1" rowspan="1">PP-PRE</td>
                            <td colspan="1" rowspan="1">.{408}$</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">3</td>
                            <td colspan="1" rowspan="1">_SolventAccessibilityD3075</td>
                            <td colspan="1" rowspan="1">LM.*VV</td>
                            <td colspan="1" rowspan="1">72%</td>
                            <td colspan="1" rowspan="1">1</td>
                            <td colspan="1" rowspan="1">Cluster 1</td>
                            <td colspan="1" rowspan="1">PP-PRE</td>
                            <td colspan="1" rowspan="1">LM.*VV</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">4</td>
                            <td colspan="1" rowspan="1">_ChargeD1100</td>
                            <td colspan="1" rowspan="1">VG.*GL[^CQ].{233}</td>
                            <td colspan="1" rowspan="1">72%</td>
                            <td colspan="1" rowspan="1">1</td>
                            <td colspan="1" rowspan="1">Cluster 1</td>
                            <td colspan="1" rowspan="1">PP-PRE</td>
                            <td colspan="1" rowspan="1">VG.*GL[^QC].{233}</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">5</td>
                            <td colspan="1" rowspan="1">_PolarityD3100</td>
                            <td colspan="1" rowspan="1">VG.*GL[^EF].{233}</td>
                            <td colspan="1" rowspan="1">72%</td>
                            <td colspan="1" rowspan="1">1</td>
                            <td colspan="1" rowspan="1">Cluster 1</td>
                            <td colspan="1" rowspan="1">PP-PRE</td>
                            <td colspan="1" rowspan="1">VG.*GL[^FE].{233}</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">6</td>
                            <td colspan="1" rowspan="1">_HydrophobicityD3100</td>
                            <td colspan="1" rowspan="1">VG.*GL[^DH].{233}</td>
                            <td colspan="1" rowspan="1">71%</td>
                            <td colspan="1" rowspan="1">1</td>
                            <td colspan="1" rowspan="1">Cluster 1</td>
                            <td colspan="1" rowspan="1">PP-PRE</td>
                            <td colspan="1" rowspan="1">VG.*GL[^HD].{233}</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">
							
                                <bold>36</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>NA</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>D[^ADGHY]+[AEFHI].+SR</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>73%</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>1</bold>
						</td>
                            <td colspan="1" rowspan="1">Cluster 1</td>
                            <td colspan="1" rowspan="1">PP-PRE</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>D[IK-SCFK-WE]+[^YC-DQWK-WG].+SR</bold>
						</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">2</td>
                            <td colspan="1" rowspan="1">_SecondaryStrT13</td>
                            <td colspan="1" rowspan="1">GR[^A-P]</td>
                            <td colspan="1" rowspan="1">70%</td>
                            <td colspan="1" rowspan="1">2</td>
                            <td colspan="1" rowspan="1">AR-RL</td>
                            <td colspan="1" rowspan="1">PP-PRE</td>
                            <td colspan="1" rowspan="1">GR[^PA-PA]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">19</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">AR.+AP[DGILNPQRVW]</td>
                            <td colspan="1" rowspan="1">74%</td>
                            <td colspan="1" rowspan="1">2</td>
                            <td colspan="1" rowspan="1">AR-RL</td>
                            <td colspan="1" rowspan="1">TMR-PRE</td>
                            <td colspan="1" rowspan="1">AR.+AP[^YAKASHKTMEHCKTEYMSECMYF]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">23</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">RT.*AG.+$</td>
                            <td colspan="1" rowspan="1">66%</td>
                            <td colspan="1" rowspan="1">2</td>
                            <td colspan="1" rowspan="1">AR-RL</td>
                            <td colspan="1" rowspan="1">TMR-PRE</td>
                            <td colspan="1" rowspan="1">RT.*AG.+$</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">29</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">[RSTVWY]*.{41}AR.+RL</td>
                            <td colspan="1" rowspan="1">71%</td>
                            <td colspan="1" rowspan="1">2</td>
                            <td colspan="1" rowspan="1">AR-RL</td>
                            <td colspan="1" rowspan="1">PRE</td>
                            <td colspan="1" rowspan="1">[^EA-IEGH-NAAQ]*.{41}AR.+RL</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">
							
                                <bold>31</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>NA</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>AR.+RL[DMPRSTVWY]</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>74%</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>2</bold>
						</td>
                            <td colspan="1" rowspan="1">AR-RL</td>
                            <td colspan="1" rowspan="1">PRE</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>AR.+RL[PM-MDUR-YM]</bold>
						</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">33</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">AR.+LR</td>
                            <td colspan="1" rowspan="1">70%</td>
                            <td colspan="1" rowspan="1">2</td>
                            <td colspan="1" rowspan="1">AR-RL</td>
                            <td colspan="1" rowspan="1">PRE</td>
                            <td colspan="1" rowspan="1">AR.+LR</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">7</td>
                            <td colspan="1" rowspan="1">_NormalizedVDWVD2001</td>
                            <td colspan="1" rowspan="1">AQ.+AT</td>
                            <td colspan="1" rowspan="1">73%</td>
                            <td colspan="1" rowspan="1">3</td>
                            <td colspan="1" rowspan="1">AQ-T</td>
                            <td colspan="1" rowspan="1">PP-PRE</td>
                            <td colspan="1" rowspan="1">AQ.+AT</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">
							
                                <bold>8</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>_SolventAccessibilityD1100</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>AQ.+AT</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>73%</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>3</bold>
						</td>
                            <td colspan="1" rowspan="1">AQ-T</td>
                            <td colspan="1" rowspan="1">PP-PRE</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>AQ.+AT</bold>
						</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">22</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">AQ.*LF.*LT.*</td>
                            <td colspan="1" rowspan="1">72%</td>
                            <td colspan="1" rowspan="1">3</td>
                            <td colspan="1" rowspan="1">AQ-T</td>
                            <td colspan="1" rowspan="1">TMR-PRE</td>
                            <td colspan="1" rowspan="1">AQ.*LF.*LT.*</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">30</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">AQ.*TI[AQRSTVWY]+</td>
                            <td colspan="1" rowspan="1">71%</td>
                            <td colspan="1" rowspan="1">3</td>
                            <td colspan="1" rowspan="1">AQ-T</td>
                            <td colspan="1" rowspan="1">PRE</td>
                            <td colspan="1" rowspan="1">AQ.*TI[^LC-P]+</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">17</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">RQ.*LA[ACDE]</td>
                            <td colspan="1" rowspan="1">73%</td>
                            <td colspan="1" rowspan="1">4</td>
                            <td colspan="1" rowspan="1">RQ-L</td>
                            <td colspan="1" rowspan="1">TMR-PRE</td>
                            <td colspan="1" rowspan="1">RQ.*LA[^TF-YM]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">
							
                                <bold>18</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>NA</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>[AC][DFGLMPQRVY]+RQ</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>75%</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>4</bold>
						</td>
                            <td colspan="1" rowspan="1">RQ-L</td>
                            <td colspan="1" rowspan="1">TMR-PRE</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>[^DM-YYRL-VEUQ][GG-GFVMQD-DLUQ-</bold>
                                <break/>
                                <bold>RMPY]+RQ</bold>
						</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">20</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">RQ.+LL[AEFHKNRT]</td>
                            <td colspan="1" rowspan="1">74%</td>
                            <td colspan="1" rowspan="1">4</td>
                            <td colspan="1" rowspan="1">RQ-L</td>
                            <td colspan="1" rowspan="1">TMR-PRE</td>
                            <td colspan="1" rowspan="1">RQ.+LL[^PGDV-WSDM-QPYI-LC]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">25</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">RQ.*LA.{12}[^GM-Y]*[^MY]{14}</td>
                            <td colspan="1" rowspan="1">71%</td>
                            <td colspan="1" rowspan="1">4</td>
                            <td colspan="1" rowspan="1">RQ-L</td>
                            <td colspan="1" rowspan="1">PRE</td>
                            <td colspan="1" rowspan="1">RQ.*LA.{12}[^M-YG]*[^YM]{14}</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">26</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">[KLNPQSTVW]VR.*AV</td>
                            <td colspan="1" rowspan="1">72%</td>
                            <td colspan="1" rowspan="1">5</td>
                            <td colspan="1" rowspan="1">VR</td>
                            <td colspan="1" rowspan="1">PRE</td>
                            <td colspan="1" rowspan="1">[^RA-IIEKCMAFIM]VR.*AV</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">
							
                                <bold>27</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>NA</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>[DGLNPQRSTV]VR.+TV.+[CDEY]*$</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>76%</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>5</bold>
						</td>
                            <td colspan="1" rowspan="1">VR</td>
                            <td colspan="1" rowspan="1">PRE</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>[LGDN-VU]VR.+TV.+[^QF-NUAF-WH]*$</bold>
						</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">28</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">[^D-LW]+VR.+ML</td>
                            <td colspan="1" rowspan="1">71%</td>
                            <td colspan="1" rowspan="1">5</td>
                            <td colspan="1" rowspan="1">VR</td>
                            <td colspan="1" rowspan="1">PRE</td>
                            <td colspan="1" rowspan="1">[^MD-LW]+VR.+ML</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">35</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">[^A-KNR]SR.*AA[ADEGHIKSTVY]+</td>
                            <td colspan="1" rowspan="1">72%</td>
                            <td colspan="1" rowspan="1">5</td>
                            <td colspan="1" rowspan="1">VR</td>
                            <td colspan="1" rowspan="1">PRE</td>
                            <td colspan="1" rowspan="1">[^NA-KR]SR.*AA[^FF-FLWL-QCFI-RK]+</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">12</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">RV.+IA.[K-S]</td>
                            <td colspan="1" rowspan="1">72%</td>
                            <td colspan="1" rowspan="1">6</td>
                            <td colspan="1" rowspan="1">Cluster 6</td>
                            <td colspan="1" rowspan="1">TMR-PRE</td>
                            <td colspan="1" rowspan="1">RV.+IA.[MK-SS]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">
							
                                <bold>13</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>NA</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>AQ.+RQ.{49}</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>75%</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>6</bold>
						</td>
                            <td colspan="1" rowspan="1">Cluster 6</td>
                            <td colspan="1" rowspan="1">TMR-PRE</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>AQ.+RQ.{49}</bold>
						</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">14</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">NV.+AQ.{48}</td>
                            <td colspan="1" rowspan="1">70%</td>
                            <td colspan="1" rowspan="1">6</td>
                            <td colspan="1" rowspan="1">Cluster 6</td>
                            <td colspan="1" rowspan="1">TMR-PRE</td>
                            <td colspan="1" rowspan="1">NV.+AQ.{48}</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">24</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">VV.+VR.*RQ.*LA*[ACDT]</td>
                            <td colspan="1" rowspan="1">72%</td>
                            <td colspan="1" rowspan="1">6</td>
                            <td colspan="1" rowspan="1">Cluster 6</td>
                            <td colspan="1" rowspan="1">PRE</td>
                            <td colspan="1" rowspan="1">VV.+VR.*RQ.*LA*[^WE-SC]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">32</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">ID.+AQ.{48}</td>
                            <td colspan="1" rowspan="1">70%</td>
                            <td colspan="1" rowspan="1">6</td>
                            <td colspan="1" rowspan="1">Cluster 6</td>
                            <td colspan="1" rowspan="1">PRE</td>
                            <td colspan="1" rowspan="1">ID.+AQ.{48}</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">34</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">.{45}AQ.+LV.*AR.*FN*</td>
                            <td colspan="1" rowspan="1">74%</td>
                            <td colspan="1" rowspan="1">6</td>
                            <td colspan="1" rowspan="1">Cluster 6</td>
                            <td colspan="1" rowspan="1">PRE</td>
                            <td colspan="1" rowspan="1">.{45}AQ.+LV.*AR.*FN*</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">9</td>
                            <td colspan="1" rowspan="1">_SolventAccessibilityT12</td>
                            <td colspan="1" rowspan="1">MF.*QL</td>
                            <td colspan="1" rowspan="1">72%</td>
                            <td colspan="1" rowspan="1">7</td>
                            <td colspan="1" rowspan="1">M-L</td>
                            <td colspan="1" rowspan="1">PP-PRE</td>
                            <td colspan="1" rowspan="1">MF.*QL</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">10</td>
                            <td colspan="1" rowspan="1">_NormalizedVDWVT13</td>
                            <td colspan="1" rowspan="1">MF.*QL</td>
                            <td colspan="1" rowspan="1">71%</td>
                            <td colspan="1" rowspan="1">7</td>
                            <td colspan="1" rowspan="1">M-L</td>
                            <td colspan="1" rowspan="1">PP-PRE</td>
                            <td colspan="1" rowspan="1">MF.*QL</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">11</td>
                            <td colspan="1" rowspan="1">_PolarizabilityT12</td>
                            <td colspan="1" rowspan="1">MF.*QL</td>
                            <td colspan="1" rowspan="1">71%</td>
                            <td colspan="1" rowspan="1">7</td>
                            <td colspan="1" rowspan="1">M-L</td>
                            <td colspan="1" rowspan="1">PP-PRE</td>
                            <td colspan="1" rowspan="1">MF.*QL</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">15</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">MR.+LV.{49}.{49}</td>
                            <td colspan="1" rowspan="1">71%</td>
                            <td colspan="1" rowspan="1">7</td>
                            <td colspan="1" rowspan="1">M-L</td>
                            <td colspan="1" rowspan="1">TMR-PRE</td>
                            <td colspan="1" rowspan="1">MR.+LV.{49}.{49}</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">
							
                                <bold>16</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>NA</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>MR.+LL[STVW]</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>73%</bold>
						</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>7</bold>
						</td>
                            <td colspan="1" rowspan="1">M-L</td>
                            <td colspan="1" rowspan="1">TMR-PRE</td>
                            <td colspan="1" rowspan="1">
							
                                <bold>MR.+LL[^HA-RD]</bold>
						</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">21</td>
                            <td colspan="1" rowspan="1">NA</td>
                            <td colspan="1" rowspan="1">MR.+AQ</td>
                            <td colspan="1" rowspan="1">71%</td>
                            <td colspan="1" rowspan="1">7</td>
                            <td colspan="1" rowspan="1">M-L</td>
                            <td colspan="1" rowspan="1">TMR-PRE</td>
                            <td colspan="1" rowspan="1">MR.+AQ</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
            <table-wrap id="ST2" orientation="portrait" position="anchor">
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="2" rowspan="1">Columns are as follows</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td colspan="1" rowspan="1">Order</td>
                            <td colspan="1" rowspan="1">The order the pattern was generated (no significance)</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">PP</td>
                            <td colspan="1" rowspan="1">If the pattern includes physiochemical properties this column
                                <break/>indicates which property was used. NA indicates no PP was
                                <break/>used.</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">RESmall</td>
                            <td colspan="1" rowspan="1">A canonical regular expression generated by PILGram (non-
                                <break/>canonical expressions are in column labeled RE)</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">Accuracy</td>
                            <td colspan="1" rowspan="1">The cross-validated accuracy of this pattern</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">Cluster</td>
                            <td colspan="1" rowspan="1">The cluster assignment from clustering patterns based on
                                <break/>which examples were matched</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">ClusterName</td>
                            <td colspan="1" rowspan="1">A descriptive name for the cluster</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">Grammar </td>
                            <td colspan="1" rowspan="1">The original grammar used for generation of the pattern. Note
                                <break/>that TMR-PRE will result in similar patterns as PRE because it
                                <break/>simply biases the search space at the outset</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">RE</td>
                            <td colspan="1" rowspan="1">The original regular expression generated by PILGram. This is
                                <break/>functionally equivalent to the RESmall, but frequently contains
                                <break/>redundant character sets.</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
            <table-wrap id="ST3" orientation="portrait" position="anchor">
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="1" rowspan="1">PPs included</th>
                            <th colspan="1" rowspan="1"/>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td colspan="1" rowspan="1">_ChargeD1100</td>
                            <td colspan="1" rowspan="1">Distribution of positively charged amino acids [KR]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">_HydrophobicityD3100</td>
                            <td colspan="1" rowspan="1">Distribution of hydrophobic amino acids [CLVIMFW]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">_NormalizedVDWVD2001</td>
                            <td colspan="1" rowspan="1">Distribution of medium-sized amino acids [NVEQIL]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">_NormalizedVDWVT13</td>
                            <td colspan="1" rowspan="1">Transitions between small and large amino acids [GASTPD]
                                <break/>&lt;-&gt;[MHKFRYW]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">_PolarityD3100</td>
                            <td colspan="1" rowspan="1">Distribution of highly polar amino acids [KMHFRYW]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">_PolarizabilityT12</td>
                            <td colspan="1" rowspan="1">Transition between low and medium polarizable amino acids
                                <break/>[GASDT]&lt;-&gt;[CPNVEQIL]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">_SecondaryStrD1075</td>
                            <td colspan="1" rowspan="1">Distribution of amino acids that tend to form alpha helices
                                <break/>[EALMQKRH]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">_SecondaryStrT13</td>
                            <td colspan="1" rowspan="1">Transition between helical and coil amino acids
                                <break/>[EALMQKRH]&lt;-&gt;[GNPSD]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">_SolventAccessibilityD1100</td>
                            <td colspan="1" rowspan="1">Distribution of amino acids that tend to be buried [ALFCGIVW]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">_SolventAccessibilityD3075</td>
                            <td colspan="1" rowspan="1">Distribution of amino acids that tend to be intermediate
                                <break/>between buried and exposed [MPSTHY]</td>
                        </tr>
                        <tr>
                            <td colspan="1" rowspan="1">_SolventAccessibilityT12</td>
                            <td colspan="1" rowspan="1">Transition between buried and exposed amino acids
                                <break/>[ALFCGIVW]&lt;-&gt;[RKQEND]</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
        </sec>
        <ref-list>
            <ref id="ref-1">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Aminov</surname>
                            <given-names>RI</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Mackie</surname>
                            <given-names>RI</given-names>
                        </name>
					</person-group>:
                    <article-title>Evolution and ecology of antibiotic resistance genes.</article-title>
                    <source>
						
                        <italic toggle="yes">FEMS Microbiol Lett.</italic>
					</source>
                    <year>2007</year>;<volume>271</volume>(<issue>2</issue>):<fpage>147</fpage>&#x2013;<lpage>161</lpage>.
                    <pub-id pub-id-type="pmid">17490428</pub-id>
                    <pub-id pub-id-type="doi">10.1111/j.1574-6968.2007.00757.x</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-2">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Anderson</surname>
                            <given-names>JW</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Tataru</surname>
                            <given-names>P</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Staines</surname>
                            <given-names>J</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Evolving stochastic context--free grammars for RNA secondary structure prediction.</article-title>
                    <source>
						
                        <italic toggle="yes">BMC Bioinformatics.</italic>
					</source>
                    <year>2012</year>;<volume>13</volume>:<fpage>78</fpage>.
                    <pub-id pub-id-type="pmid">22559985</pub-id>
                    <pub-id pub-id-type="doi">10.1186/1471-2105-13-78</pub-id>
                    <pub-id pub-id-type="pmcid">3464655</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-3">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Barghash</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Helms</surname>
                            <given-names>V</given-names>
                        </name>
					</person-group>:
                    <article-title>Transferring functional annotations of membrane transporters on the basis of sequence similarity and sequence motifs.</article-title>
                    <source>
						
                        <italic toggle="yes">BMC Bioinformatics.</italic>
					</source>
                    <year>2013</year>;<volume>14</volume>:<fpage>343</fpage>.
                    <pub-id pub-id-type="pmid">24283849</pub-id>
                    <pub-id pub-id-type="doi">10.1186/1471-2105-14-343</pub-id>
                    <pub-id pub-id-type="pmcid">4219331</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-4">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Bateman</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Birney</surname>
                            <given-names>E</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Durbin</surname>
                            <given-names>R</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>The Pfam protein families database.</article-title>
                    <source>
						
                        <italic toggle="yes">Nucleic Acids Res.</italic>
					</source>
                    <year>2000</year>;<volume>28</volume>(<issue>1</issue>):<fpage>263</fpage>&#x2013;<lpage>266</lpage>.
                    <pub-id pub-id-type="pmid">10592242</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/28.1.263</pub-id>
                    <pub-id pub-id-type="pmcid">102420</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-5">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Blair</surname>
                            <given-names>JM</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Webber</surname>
                            <given-names>MA</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Baylay</surname>
                            <given-names>AJ</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Molecular mechanisms of antibiotic resistance.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Rev Microbiol.</italic>
					</source>
                    <year>2015</year>;<volume>13</volume>(<issue>1</issue>):<fpage>42</fpage>&#x2013;<lpage>51</lpage>.
                    <pub-id pub-id-type="pmid">25435309</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nrmicro3380</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-6">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Bolger</surname>
                            <given-names>AM</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Lohse</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Usadel</surname>
                            <given-names>B</given-names>
                        </name>
					</person-group>:
                    <article-title>Trimmomatic: a flexible trimmer for Illumina sequence data.</article-title>
                    <source>
						
                        <italic toggle="yes">Bioinformatics.</italic>
					</source>
                    <year>2014</year>;<volume>30</volume>(<issue>15</issue>):<fpage>2114</fpage>&#x2013;<lpage>2120</lpage>.
                    <pub-id pub-id-type="pmid">24695404</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btu170</pub-id>
                    <pub-id pub-id-type="pmcid">4103590</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-7">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Cao</surname>
                            <given-names>DS</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Xu</surname>
                            <given-names>QS</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Liang</surname>
                            <given-names>YZ</given-names>
                        </name>
					</person-group>:
                    <article-title>propy: a tool to generate various modes of Chou's PseAAC.</article-title>
                    <source>
						
                        <italic toggle="yes">Bioinformatics.</italic>
					</source>
                    <year>2013</year>;<volume>29</volume>(<issue>7</issue>):<fpage>960</fpage>&#x2013;<lpage>962</lpage>.
                    <pub-id pub-id-type="pmid">23426256</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btt072</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-8">
                <mixed-citation publication-type="book">
                    <collab>CDC</collab>.
                    <article-title>ANTIBIOTIC RESISTANCE THREATS in the United States</article-title>.<year>2013</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://www.cdc.gov/drugresistance/threat-report-2013/pdf/ar-threats-2013-508.pdf">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-9">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Cole</surname>
                            <given-names>JK</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Hutchison</surname>
                            <given-names>JR</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Renslow</surname>
                            <given-names>RS</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Phototrophic biofilm assembly in microbial-mat-derived unicyanobacterial consortia: model systems for the study of autotroph-heterotroph interactions.</article-title>
                    <source>
						
                        <italic toggle="yes">Front Microbiol.</italic>
					</source>
                    <year>2014</year>;<volume>5</volume>:<fpage>109</fpage>.
                    <pub-id pub-id-type="pmid">24778628</pub-id>
                    <pub-id pub-id-type="doi">10.3389/fmicb.2014.00109</pub-id>
                    <pub-id pub-id-type="pmcid">3985010</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-10">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>D'Costa</surname>
                            <given-names>VM</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Griffiths</surname>
                            <given-names>E</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Wright</surname>
                            <given-names>GD</given-names>
                        </name>
					</person-group>:
                    <article-title>Expanding the soil antibiotic resistome: exploring environmental diversity.</article-title>
                    <source>
						
                        <italic toggle="yes">Curr Opin Microbiol.</italic>
					</source>
                    <year>2007</year>;<volume>10</volume>(<issue>5</issue>):<fpage>481</fpage>&#x2013;<lpage>489</lpage>.
                    <pub-id pub-id-type="pmid">17951101</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.mib.2007.08.009</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-11">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Dubchak</surname>
                            <given-names>I</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Muchnik</surname>
                            <given-names>I</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Holbrook</surname>
                            <given-names>SR</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Prediction of protein folding class using global description of amino acid sequence.</article-title>
                    <source>
						
                        <italic toggle="yes">Proc Natl Acad Sci U S A.</italic>
					</source>
                    <year>1995</year>;<volume>92</volume>(<issue>19</issue>):<fpage>8700</fpage>&#x2013;<lpage>4</lpage>.
                    <pub-id pub-id-type="pmid">7568000</pub-id>
                    <pub-id pub-id-type="doi">10.1073/pnas.92.19.8700</pub-id>
                    <pub-id pub-id-type="pmcid">41034</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-12">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Durbin</surname>
                            <given-names>R</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Eddy</surname>
                            <given-names>SR</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Krogh</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.</article-title>Cambridge University Press,<year>1998</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://www.cambridge.org/ae/academic/subjects/life-sciences/genomics-bioinformatics-and-systems-biology/biological-sequence-analysis-probabilistic-models-proteins-and-nucleic-acids">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-13">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Dyrka</surname>
                            <given-names>W</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Nebel</surname>
                            <given-names>JC</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Kotulska</surname>
                            <given-names>M</given-names>
                        </name>
					</person-group>:
                    <article-title>Probabilistic grammatical model for helix-helix contact site classification.</article-title>
                    <source>
						
                        <italic toggle="yes">Algorithms Mol Biol.</italic>
					</source>
                    <year>2013</year>;<volume>8</volume>(<issue>1</issue>):<fpage>31</fpage>.
                    <pub-id pub-id-type="pmid">24350601</pub-id>
                    <pub-id pub-id-type="doi">10.1186/1748-7188-8-31</pub-id>
                    <pub-id pub-id-type="pmcid">3892132</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-14">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Forsberg</surname>
                            <given-names>KJ</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Reyes</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Wang</surname>
                            <given-names>B</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>The shared antibiotic resistome of soil bacteria and human pathogens.</article-title>
                    <source>
						
                        <italic toggle="yes">Science.</italic>
					</source>
                    <year>2012</year>;<volume>337</volume>(<issue>6098</issue>):<fpage>1107</fpage>&#x2013;<lpage>1111</lpage>.
                    <pub-id pub-id-type="pmid">22936781</pub-id>
                    <pub-id pub-id-type="doi">10.1126/science.1220761</pub-id>
                    <pub-id pub-id-type="pmcid">4070369</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-15">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Gough</surname>
                            <given-names>J</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Chothia</surname>
                            <given-names>C</given-names>
                        </name>
					</person-group>:
                    <article-title>SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments.</article-title>
                    <source>
						
                        <italic toggle="yes">Nucleic Acids Res.</italic>
					</source>
                    <year>2002</year>;<volume>30</volume>(<issue>1</issue>):<fpage>268</fpage>&#x2013;<lpage>272</lpage>.
                    <pub-id pub-id-type="pmid">11752312</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/30.1.268</pub-id>
                    <pub-id pub-id-type="pmcid">99153</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-16">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Hofmann</surname>
                            <given-names>K</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Bucher</surname>
                            <given-names>P</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Falquet</surname>
                            <given-names>L</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>The PROSITE database, its status in 1999.</article-title>
                    <source>
						
                        <italic toggle="yes">Nucleic Acids Res.</italic>
					</source>
                    <year>1999</year>;<volume>27</volume>(<issue>1</issue>):<fpage>215</fpage>&#x2013;<lpage>219</lpage>.
                    <pub-id pub-id-type="pmid">9847184</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/27.1.215</pub-id>
                    <pub-id pub-id-type="pmcid">148139</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-17">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Hyatt</surname>
                            <given-names>D</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Chen</surname>
                            <given-names>GL</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Locascio</surname>
                            <given-names>PF</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Prodigal: prokaryotic gene recognition and translation initiation site identification.</article-title>
                    <source>
						
                        <italic toggle="yes">BMC Bioinformatics.</italic>
					</source>
                    <year>2010</year>;<volume>11</volume>:<fpage>119</fpage>.
                    <pub-id pub-id-type="pmid">20211023</pub-id>
                    <pub-id pub-id-type="doi">10.1186/1471-2105-11-119</pub-id>
                    <pub-id pub-id-type="pmcid">2848648</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-18">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Langmead</surname>
                            <given-names>B</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Salzberg</surname>
                            <given-names>SL</given-names>
                        </name>
					</person-group>:
                    <article-title>Fast gapped-read alignment with Bowtie 2.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Methods.</italic>
					</source>
                    <year>2012</year>;<volume>9</volume>(<issue>4</issue>):<fpage>357</fpage>&#x2013;<lpage>359</lpage>.
                    <pub-id pub-id-type="pmid">22388286</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nmeth.1923</pub-id>
                    <pub-id pub-id-type="pmcid">3322381</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-19">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Leather</surname>
                            <given-names>H</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Bonilla</surname>
                            <given-names>E</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>O&#x2019;Boyle</surname>
                            <given-names>M</given-names>
                        </name>
					</person-group>:
                    <article-title>Automatic Feature Generation for Machine Learning Based Optimizing Compilation.</article-title>
                    <source>
						
                        <italic toggle="yes">International Symposium on Code Generation and Optimization.</italic>
					</source>
                    <year>2009</year>.
                    <pub-id pub-id-type="doi">10.1109/CGO.2009.21</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-20">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>H</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Handsaker</surname>
                            <given-names>B</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Wysoker</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>The Sequence Alignment/Map format and SAMtools.</article-title>
                    <source>
						
                        <italic toggle="yes">Bioinformatics.</italic>
					</source>
                    <year>2009</year>;<volume>25</volume>(<issue>16</issue>):<fpage>2078</fpage>&#x2013;<lpage>2079</lpage>.
                    <pub-id pub-id-type="pmid">19505943</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btp352</pub-id>
                    <pub-id pub-id-type="pmcid">2723002</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-21">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>W</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Sharma</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Kaur</surname>
                            <given-names>P</given-names>
                        </name>
					</person-group>:
                    <article-title>The DrrAB efflux system of 
                        <italic toggle="yes">Streptomyces peucetius</italic> is a multidrug transporter of broad substrate specificity.</article-title>
                    <source>
						
                        <italic toggle="yes">J Biol Chem.</italic>
					</source>
                    <year>2014</year>;<volume>289</volume>(<issue>18</issue>):<fpage>12633</fpage>&#x2013;<lpage>12646</lpage>.
                    <pub-id pub-id-type="pmid">24634217</pub-id>
                    <pub-id pub-id-type="doi">10.1074/jbc.M113.536136</pub-id>
                    <pub-id pub-id-type="pmcid">4007453</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-22">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Lindemann</surname>
                            <given-names>SR</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Moran</surname>
                            <given-names>JJ</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Stegen</surname>
                            <given-names>JC</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>The epsomitic phototrophic microbial mat of Hot Lake, Washington: community structural responses to seasonal cycling.</article-title>
                    <source>
						
                        <italic toggle="yes">Front Microbiol.</italic>
					</source>
                    <year>2013</year>;<volume>4</volume>:<fpage>323</fpage>.
                    <pub-id pub-id-type="pmid">24312082</pub-id>
                    <pub-id pub-id-type="doi">10.3389/fmicb.2013.00323</pub-id>
                    <pub-id pub-id-type="pmcid">3826063</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-23">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Markowitz</surname>
                            <given-names>VM</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Mavromatis</surname>
                            <given-names>K</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Ivanova</surname>
                            <given-names>NN</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>IMG ER: a system for microbial genome annotation expert review and curation.</article-title>
                    <source>
						
                        <italic toggle="yes">Bioinformatics.</italic>
					</source>
                    <year>2009</year>;<volume>25</volume>(<issue>17</issue>):<fpage>2271</fpage>&#x2013;<lpage>2278</lpage>.
                    <pub-id pub-id-type="pmid">19561336</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btp393</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-24">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Martinez</surname>
                            <given-names>JL</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>S&#x00e1;nchez</surname>
                            <given-names>MB</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Mart&#x00ed;nez-Solano</surname>
                            <given-names>L</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Functional role of bacterial multidrug efflux pumps in microbial natural ecosystems.</article-title>
                    <source>
						
                        <italic toggle="yes">FEMS Microbiol Rev.</italic>
					</source>
                    <year>2009</year>;<volume>33</volume>(<issue>2</issue>):<fpage>430</fpage>&#x2013;<lpage>449</lpage>.
                    <pub-id pub-id-type="pmid">19207745</pub-id>
                    <pub-id pub-id-type="doi">10.1111/j.1574-6976.2008.00157.x</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-25">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>McDermott</surname>
                            <given-names>JE</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Corrigan</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Peterson</surname>
                            <given-names>E</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Computational prediction of type III and IV secreted effectors in gram-negative bacteria.</article-title>
                    <source>
						
                        <italic toggle="yes">Infect Immun.</italic>
					</source>
                    <year>2011</year>;<volume>79</volume>(<issue>1</issue>):<fpage>23</fpage>&#x2013;<lpage>32</lpage>.
                    <pub-id pub-id-type="pmid">20974833</pub-id>
                    <pub-id pub-id-type="doi">10.1128/IAI.00537-10</pub-id>
                    <pub-id pub-id-type="pmcid">3019878</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-26">
                <mixed-citation publication-type="data">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>McDermott</surname>
                            <given-names>JE</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Bruillard</surname>
                            <given-names>P</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Overall</surname>
                            <given-names>CC</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Prediction of multi-drug resistance transporters dataset.</article-title>
                    <source>
						
                        <italic toggle="yes">Figshare.</italic>
					</source>
                    <year>2015</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.6084/m9.figshare.1415804">Data Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-27">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Nikaido</surname>
                            <given-names>H</given-names>
                        </name>
					</person-group>:
                    <article-title>Multidrug resistance in bacteria.</article-title>
                    <source>
						
                        <italic toggle="yes">Annu Rev Biochem.</italic>
					</source>
                    <year>2009</year>;<volume>78</volume>:<fpage>119</fpage>&#x2013;<lpage>146</lpage>.
                    <pub-id pub-id-type="pmid">19231985</pub-id>
                    <pub-id pub-id-type="doi">10.1146/annurev.biochem.78.082907.145923</pub-id>
                    <pub-id pub-id-type="pmcid">2839888</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-28">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Nikaido</surname>
                            <given-names>H</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Pag&#x00e8;s</surname>
                            <given-names>JM</given-names>
                        </name>
					</person-group>:
                    <article-title>Broad-specificity efflux pumps and their role in multidrug resistance of Gram-negative bacteria.</article-title>
                    <source>
						
                        <italic toggle="yes">FEMS Microbiol Rev.</italic>
					</source>
                    <year>2012</year>;<volume>36</volume>(<issue>2</issue>):<fpage>340</fpage>&#x2013;<lpage>363</lpage>.
                    <pub-id pub-id-type="pmid">21707670</pub-id>
                    <pub-id pub-id-type="doi">10.1111/j.1574-6976.2011.00290.x</pub-id>
                    <pub-id pub-id-type="pmcid">3546547</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-29">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Peng</surname>
                            <given-names>Y</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Leung</surname>
                            <given-names>HC</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Yiu</surname>
                            <given-names>SM</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>IDBA-UD: a 
                        <italic toggle="yes">de novo</italic> assembler for single-cell and metagenomic sequencing data with highly uneven depth.</article-title>
                    <source>
						
                        <italic toggle="yes">Bioinformatics.</italic>
					</source>
                    <year>2012</year>;<volume>28</volume>(<issue>11</issue>):<fpage>1420</fpage>&#x2013;<lpage>1428</lpage>.
                    <pub-id pub-id-type="pmid">22495754</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/bts174</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-30">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Piddock</surname>
                            <given-names>LJ</given-names>
                        </name>
					</person-group>:
                    <article-title>Multidrug-resistance efflux pumps - not just for resistance.</article-title>
                    <source>
						
                        <italic toggle="yes">Nat Rev Microbiol.</italic>
					</source>
                    <year>2006</year>;<volume>4</volume>(<issue>8</issue>):<fpage>629</fpage>&#x2013;<lpage>636</lpage>.
                    <pub-id pub-id-type="pmid">16845433</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nrmicro1464</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-31">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Saier</surname>
                            <given-names>MH</given-names>
                            <suffix>Jr</suffix>
                        </name>
						
                        <name name-style="western">
                            <surname>Reddy</surname>
                            <given-names>VS</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Tamang</surname>
                            <given-names>DG</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>The transporter classification database.</article-title>
                    <source>
						
                        <italic toggle="yes">Nucleic Acids Res.</italic>
					</source>
                    <year>2014</year>;<volume>42</volume>(<issue>Database issue</issue>):<fpage>D251</fpage>&#x2013;<lpage>258</lpage>.
                    <pub-id pub-id-type="pmid">24225317</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gkt1097</pub-id>
                    <pub-id pub-id-type="pmcid">3964967</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-32">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Salzberg</surname>
                            <given-names>SL</given-names>
                        </name>
					</person-group>:
                    <article-title>On comparing classifiers: Pitfalls to avoid and a recommended approach.</article-title>
                    <source>
						
                        <italic toggle="yes">Data Min Knowl Discov.</italic>
					</source>
                    <year>1997</year>;<volume>1</volume>(<issue>3</issue>):<fpage>317</fpage>&#x2013;<lpage>328</lpage>.
                    <pub-id pub-id-type="doi">10.1023/A:1009752403260</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-33">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Samudrala</surname>
                            <given-names>R</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Heffron</surname>
                            <given-names>F</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>McDermott</surname>
                            <given-names>JE</given-names>
                        </name>
					</person-group>:
                    <article-title>Accurate prediction of secreted substrates and identification of a conserved putative secretion signal for type III secretion systems.</article-title>
                    <source>
						
                        <italic toggle="yes">PLoS Pathog.</italic>
					</source>
                    <year>2009</year>;<volume>5</volume>(<issue>4</issue>):<fpage>e1000375</fpage>.
                    <pub-id pub-id-type="pmid">19390620</pub-id>
                    <pub-id pub-id-type="doi">10.1371/journal.ppat.1000375</pub-id>
                    <pub-id pub-id-type="pmcid">2668754</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-34">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Schaadt</surname>
                            <given-names>NS</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Christoph</surname>
                            <given-names>J</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Helms</surname>
                            <given-names>V</given-names>
                        </name>
					</person-group>:
                    <article-title>Classifying substrate specificities of membrane transporters from 
                        <italic toggle="yes">Arabidopsis thaliana</italic>.</article-title>
                    <source>
						
                        <italic toggle="yes">J Chem Inf Model.</italic>
					</source>
                    <year>2010</year>;<volume>50</volume>(<issue>10</issue>):<fpage>1899</fpage>&#x2013;<lpage>1905</lpage>.
                    <pub-id pub-id-type="pmid">20925375</pub-id>
                    <pub-id pub-id-type="doi">10.1021/ci100243m</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-35">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Schaadt</surname>
                            <given-names>NS</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Helms</surname>
                            <given-names>V</given-names>
                        </name>
					</person-group>:
                    <article-title>Functional classification of membrane transporters and channels based on filtered TM/non-TM amino acid composition.</article-title>
                    <source>
						
                        <italic toggle="yes">Biopolymers.</italic>
					</source>
                    <year>2012</year>;<volume>97</volume>(<issue>7</issue>):<fpage>558</fpage>&#x2013;<lpage>567</lpage>.
                    <pub-id pub-id-type="pmid">22492257</pub-id>
                    <pub-id pub-id-type="doi">10.1002/bip.22043</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-36">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Yin</surname>
                            <given-names>Y</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>He</surname>
                            <given-names>X</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Szewczyk</surname>
                            <given-names>P</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Structure of the multidrug transporter EmrD from 
                        <italic toggle="yes">Escherichia coli</italic>.</article-title>
                    <source>
						
                        <italic toggle="yes">Science.</italic>
					</source>
                    <year>2006</year>;<volume>312</volume>(<issue>5774</issue>):<fpage>741</fpage>&#x2013;<lpage>744</lpage>.
                    <pub-id pub-id-type="pmid">16675700</pub-id>
                    <pub-id pub-id-type="doi">10.1126/science.1125629</pub-id>
                    <pub-id pub-id-type="pmcid">3152482</pub-id>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report8819">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.6999.r8819</article-id>
            <title-group>
                <article-title>Reviewer response for version 2</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Baltrus</surname>
                        <given-names>David Anthony</given-names>
                    </name>
                    <xref ref-type="aff" rid="r8819a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r8819a1">
                    <label>1</label>School of Plant Sciences, University of Arizona, Tuscon, AZ, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>17</day>
                <month>6</month>
                <year>2015</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2015 Baltrus DA</copyright-statement>
                <copyright-year>2015</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport8819" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.6200.2"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>It's good to go, my critiques have been adequately addressed .</p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report8818">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.6999.r8818</article-id>
            <title-group>
                <article-title>Reviewer response for version 2</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Flight</surname>
                        <given-names>Robert</given-names>
                    </name>
                    <xref ref-type="aff" rid="r8818a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r8818a1">
                    <label>1</label>Resource Center for Stable Isotope-Resolved Metabolomics, University of Kentucky, Lexington, KY, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>1</day>
                <month>6</month>
                <year>2015</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2015 Flight R</copyright-statement>
                <copyright-year>2015</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport8818" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.6200.2"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>Much improved, and much clearer how the PILGram regex's are being generated and used.</p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report7890">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.6648.r7890</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Baltrus</surname>
                        <given-names>David Anthony</given-names>
                    </name>
                    <xref ref-type="aff" rid="r7890a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r7890a1">
                    <label>1</label>School of Plant Sciences, University of Arizona, Tuscon, AZ, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>25</day>
                <month>3</month>
                <year>2015</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2015 Baltrus DA</copyright-statement>
                <copyright-year>2015</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport7890" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.6200.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>Given the growing problem of antibiotic resistance across bacterial pathogens, Multi Drug Resistant (MDR) transporters are an intrinsically important group of bacterial proteins. However, unlike other resistance protein families where precise characterization is possible (i.e. B-lactamases), and while we can often "see" MDR transporters in bacterial genomes due to sequence similarity, it is nearly impossible at the present time to accurately annotate what antibiotic substrates these transporters act on. As genome sequences pile up, and whole genome sequences begin to be used to predict drug resistances/sensitivities in clinical settings, the it becomes increasingly important to accurately describe the roles of MDR transporter complexes.</p>
            <p>The article from McDermott 
                <italic>et al</italic>. focuses on using grammar based alignment free approaches in order to predict and classify MDR protein complexes, and develops a program called PILGram for this purpose. The authors do a good job of describing the problem they are addressing throughout the introduction, and giving examples of the utility of grammar based approaches to an audience that is likely not well versed in these analyses. Realistically, the results are not exceptional. The model does a great job of predicting ser/thr phosphatase patterns, but current approaches using similarity based searches do a pretty good job as well. The model does slightly worse than conventional methods with the prediction of zinc-fingers, likely because of their unstructured nature, but taking the consensus using grammar based approaches is still on par with other widely used methods. The authors don't improve on predictions for these two classes with grammar based methods, but they provide a good demonstration that such models can work on par with conventional analyses. I think it's important to develop both sequence based and sequence independent approaches and that these go hand in hand rather than act in a mutually exclusive way.</p>
            <p>The rubber meets the road when the authors try to predict novel MDR classes, and the results are not great. While numerically, the data seems to show that PILGram is able to be trained to identify MDR transporters with levels of accuracy above randomness, it misses a lot. On the other hand, so do conventional analyses, which is what makes this an interesting problem to tackle. Moreover, the authors use a metagenomic dataset (nicely done by the way, I wasn't expecting that) to try and predict novel MDR transporters. The data do suggest that PILGram can pick up&#x00a0;
                <italic>something</italic>&#x00a0;of a signal within these metagenomes compared to a soil sample, which is encouraging. However, I'm left with a bit of an unenthusiastic taste in my mouth when I see the table of "high confidence novel transporter proteins" and 3 of the 5 are annotated as some kind of transporters, and the other two are FtsH and a related protein. The authors do point out that it's likely that at least one of these is a true positive (my guess, it's not either of the last two), but it would be good in the discussion if the authors could further flesh out what differentiates the data that PILGram is giving you from simply looking through the annotations for "transporter" proteins given that 3 of 5 are likely transporting something based on the JGI annotation. Said more plainly, it would be good if the authors could describe what PILGram is telling them about the first three genes in table 4 that the annotations don't. I think this would really wrap the story up better.</p>
            <p>My overall impression is that this is a solid paper, albeit without really exceptional results. However, utilization of these sequence alignment independent grammar models and pipelines and descriptions for how they behave on real world data is a step forward and therefore worthy of being published. The data is solid, and the authors do a good job of describing the limitations. We need anything and everything possible to be able to predict MDR proteins given the large amounts of genome data that are going to be piling up. PILGram will only get better with larger training sets.</p>
            <p>Slight side note...I'm wondering whether glc-1 from 
                <italic>C.elegans</italic> should be included in the training set for the "Prediction of multi-drug resistance transporters dataset" table. Seems weird to me given that these are bacterial proteins.</p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
        <sub-article article-type="response" id="comment1340-7890">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>McDermott</surname>
                            <given-names>Jason</given-names>
                        </name>
                        <aff>Pacific Northwest National Laboratory, USA</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>None (aside from the fact that I'm the author of the paper)</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>8</day>
                    <month>5</month>
                    <year>2015</year>
                </pub-date>
            </front-stub>
            <body>
                <p>We thank both the reviewers for their insightful and very helpful comments. We have revised the manuscript according to the reviewers&#x2019; suggestions and feel that it is substantially improved in terms of clarity and potential for reproducibility. Importantly, we have provided more complete data, results, and code that we employ in the paper.</p>
                <p>Dr. Baltrus points out that two of the high-confidence predictions made by the method are annotated as non-transporter proteins. This issue arises from the fact that a preliminary screen was done on the metagenome to identify transporters (in the &#x2018;Identification of novel MDR transporter candidates from environmental microbiomes&#x2019;) using Pfam families. As explained in the paper this is necessary because MDRpred was trained only on transporter proteins and so may give spurious results when applied to non-transporters. However, it looks like the FtsH sequence was erroneously included as a transporter- probably because it has an ABC-associated ATPase domain. The other protein, listed as a &#x201c;Bacterial cell division membrane protein&#x201d;, has a membrane domain associated with O-antigen, but this appears to be involved in synthesis of O-antigen and not transport. These Pfam families have both been removed from our list for transporters, which is now provided as a data file. Table 4 (now Table 6) has been updated by removing these two predictions and a complete set of higher-confidence predictions is now included as Supplemental Data file.</p>
                <p>Dr. Baltrus also asked for clarification of what MDRpred would be giving beyond examining annotations for &#x201c;transporter&#x201d; proteins. The value of MDRpred is that it will predict which transporter proteins are capable of transporting a specific class of substrates, antibiotic drug compounds. As we point out in the text this is a broad class of compounds and is often incompletely defined for individual well-studied transporters. However, out method is able to accurately classify MDR transporters 
                    <italic>relative</italic> to other transporters that do not transport drug compounds. Even in the case of the first and third predictions (annotated as arabinose and lipid transporters, respectively) the specific annotations are based on best matches by Pfam or BLAST, and may not accurately represent the substrates that are actually transported depending on how close the matches are. We now include an extended discussion of the interpretation of the list of proteins found in Table 4 (now Table 6) in the Results section.</p>
                <p>Glc-1, a glutamate gated transporter, can confer drug resistance in 
                    <italic>C. elegans</italic>, though it appears that it does not transport drugs itself. We&#x2019;ve removed it from the training set.</p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report7889">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.6648.r7889</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Flight</surname>
                        <given-names>Robert</given-names>
                    </name>
                    <xref ref-type="aff" rid="r7889a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r7889a1">
                    <label>1</label>Resource Center for Stable Isotope-Resolved Metabolomics, University of Kentucky, Lexington, KY, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>19</day>
                <month>3</month>
                <year>2015</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2015 Flight R</copyright-statement>
                <copyright-year>2015</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport7889" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.6200.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>
                <underline>
                    <bold>Claims</bold>
                </underline>
                <list list-type="bullet">
                    <list-item>
                        <p>Implement a linguistic-based approach that allows the identification of functional patterns from groups of functionally related proteins that does not require alignment of the proteins</p>
                    </list-item>
                    <list-item>
                        <p>The method uses regular-expressions that are generated using a parse-tree that is modified via a genetic algorithm, and fitness is scored by accuracy using training data.</p>
                    </list-item>
                    <list-item>
                        <p>Able to find discriminative patterns for serine-threonine phosphatases, zinc fingers, and multi-drug resistance (MDR) transporters</p>
                    </list-item>
                    <list-item>
                        <p>Predict MDR transporters in a bacterial community from "Hot Lake" as a potential pool for novel MDR transporters that could be transferred to current bacteria as novel source of antibiotic resistance</p>
                    </list-item>
                    <list-item>
                        <p>PILGram is able to identify and separate based on the binding region responsible for substrate specificity</p>
                    </list-item>
                </list>
            </p>
            <p>
                <underline>
                    <bold>Praises</bold>
                </underline>
            </p>
            <p>From this version of the manuscript (v1), the claims are justified. Regular expressions are a linguistic construct, the authors are able to reproduce previously defined regexes without prior alignment using the PILGram method, and classify zinc fingers and MDRs by counting the number of regexes matching a particular sequence, resulting in the MDRPred method. This method, MDRPred was then applied to a newly sequenced bacterial community and possibly novel MDR transporters identified.</p>
            <p>In addition, from the text, the generation and validation of the regexes was done in a statistically rigorous way, with half of the data used for training and half of the data used for testing / validation / calculation of metrics. This is nice to see in this kind of paper, as it has become the exception rather than the rule.</p>
            <p>
                <bold>
                    <underline>Reservations</underline>
                </bold>
            </p>
            <p>Although I think the general claims can be justified from the text, there are some areas of concern that I think should be addressed in a subsequent version of the manuscript.&#x00a0;</p>
            <p>These reservations fall under these major areas, ordered in what I consider most important to least important:
                <list list-type="bullet">
                    <list-item>
                        <p>data availability for reproducibility</p>
                    </list-item>
                    <list-item>
                        <p>actual MDRPred code</p>
                    </list-item>
                    <list-item>
                        <p>actual id's for positive and negative examples</p>
                    </list-item>
                    <list-item>
                        <p>Weak "substrate specificity" claim</p>
                    </list-item>
                    <list-item>
                        <p>lack of description in the text leading to either lack of clarity or possible misunderstandings</p>
                    </list-item>
                    <list-item>
                        <p>PILGram details</p>
                    </list-item>
                    <list-item>
                        <p>Physiochemical Properties and TMR</p>
                    </list-item>
                    <list-item>
                        <p>Elaboration of clustering</p>
                    </list-item>
                    <list-item>
                        <p>Supplemental Table 1</p>
                    </list-item>
                    <list-item>
                        <p>Describing REGEXE's</p>
                    </list-item>
                    <list-item>
                        <p>language implying other methods are not "linguistic"</p>
                    </list-item>
                </list>
            </p>
            <p>Each of these reservations are further detailed below.</p>
            <p>
                <underline>
                    <bold>Data &amp; Code Availability</bold>
                </underline>
            </p>
            <p>Not all of the code / data necessary to reproduce the results are currently provided. While acknowledging that the primary algorithm (PILGram) is currently awaiting publication and that this is **not** the place to describe the particulars of that software, I think there are still steps to be taken to improve the reproducibililty of **this** publication by providing more of the data. It should be noted that when the PILGram algorithm is published, this publication should be updated with references and links to make it easier for others to find.&#x00a0;</p>
            <p>That being said, this is a publication about a **method**, and although the particulars of the **method** are well described, there is no accompanying code, scripts, even psuedocode supplied so that the reader might make use of the **method** themselves, either on the provided supplementary FASTA files, or on their own sequences. I searched github for the term "mdrpred", and also for the lead authors' name and twitter username to no avail. The need for an actual script or executable (preferably open source) is increased after reading the description of including PP-PRE and TMR-PRE, and calculating their matches, as this section is a little unclear as to how exactly that calculation is performed with no example (see comment below).</p>
            <p>In addition to the code, other data that should be included are:
                <list list-type="bullet">
                    <list-item>
                        <p>UniProt entries for positive and negative examples for serine-threonine phosphatases and zinc fingers</p>
                    </list-item>
                    <list-item>
                        <p>genome accession and gene annotations from the metagenome analysed</p>
                    </list-item>
                    <list-item>
                        <p>list of metagenome accessions annotated as MDR's using MDRPred</p>
                    </list-item>
                    <list-item>
                        <p>Date of download of PROSITE data. prosite.dat on Mar 17, 2015 shows 198 positive matches for PS00125, and I'm assuming 2018 (hard to tell from file) positive matches for PS00028, versus 166 and 1997 sequences mentioned in the text.</p>
                    </list-item>
                    <list-item>
                        <p>Text files of the regexes generated by PILGram in each case</p>
                    </list-item>
                </list>
            </p>
            <p>
                <underline>
                    <bold>Weak "substrate specificity" claim</bold>
                </underline>
            </p>
            <p>This is mentioned in the abstract, and 2 times in the introduction. The wording in the abstract implies that the method is able to delineate substrate specificity, i.e. that the method can generate regexes that are specific for different substrates. However, the one result implies rather that the regexes identify the region responsible for substrate specificity (which is really neat). These seem to be two different things in my mind, and I think either the claim in the abstract and introduction should be dropped or clarified, especially given that there is only one example provided. Finally, the claim is further weakened in the current text because the word **substrate** is missing from the paragraph discussing the evidence for substrate specificity (Results, Drug resistance transporters, Functional motifs identified, last paragraph in that section, no mention of "substrate", just specificity).</p>
            <p>More so than the "substrate specificty" claim, I think the authors would do well to place more emphasis on the fact that *all* of this work is done on sets of sequences **without alignment** first! It might just be me, but this was to me one of the most important things in the paper (and something I will probably make use of in my own research), that did not seem to be highlighted enough.</p>
            <p>
                <underline>
                    <bold>Lack of Description</bold>
                </underline>
            </p>
            <p>
                <bold>PILGram Algorithmic Details</bold>
            </p>
            <p>Again I acknowledge that this is not the place to detail the full inner-workings of the PILGram algorithm, and the example in the text for BMI is useful. However, most genetic algorithms have a defined chromosome length defining the solution. I would have expected an analogous situation for PILGram, in that one would have to define the **length** of the regular expression. This does not appear to be the case here, given the variety of reg-ex's noted for Zinc fingers and MDRs. As far as I can tell, this is likely due to the way that individual trees can be recombined, but it is not clear from the text how different length regexes result. Clarification of how different length regexes result would be useful.</p>
            <p>
                <bold>Physiochemical Properties and TMR</bold>
            </p>
            <p>I think I understand why the physiochemical properties (PP) and transmembrane region (TMR) score were included for the MDRPred, however there is currently no discussion of their inclusion or justification in the text. From the current description of them, it is also difficult to imagine how something matches the PP-PRE and TMR-PRE, including the PP and TMR scores as part of the match. Therefore I recommend:
                <list list-type="bullet">
                    <list-item>
                        <p>Having a better description of the PP and TMR scores in Methods</p>
                    </list-item>
                    <list-item>
                        <p>Justification for the inclusion of PP-PRE and TMR-PRE in MDRPred. Currently the only justification is "Because we believed ...". I would hazard a guess that the accuracy drops precipitously without them, but there is nothing in the text currently describing why they are needed.</p>
                    </list-item>
                    <list-item>
                        <p>Giving examples of how some PPs are different for different AAs</p>
                    </list-item>
                    <list-item>
                        <p>Example of calculation of PP score and TMR score for a regex match</p>
                    </list-item>
                    <list-item>
                        <p>Example of full match for a derived PP-PRE or TMR-PRE</p>
                    </list-item>
                </list>
            </p>
            <p>
                <bold>Elaboration of clustering</bold>
            </p>
            <p>In the Results, "Functional motifs identified", a description of clustering the generated models is provided. The current description is ambiguous. I think what was done was a vector of length 71 (corresponding to the number of training sequences) was generated for each model, with a 1 indicating a match to the model, and 0 indicating no match to the model. These 36 vectors (one for each model) were subsequently clustered using hierarchical clustering.&#x00a0;</p>
            <p>No description of what distance metric was used to calculate the distance between the model vectors, nor which hierachical clustering method was used is provided. In the R stats package, there are: two variations of Ward's minimum variance method, the complete linkage method, the single linkage method, median, and centroid. The software, version, and algorithm reference should be provided for completeness.</p>
            <p>Supplemental Figure 4 should have the clusters indicated on the figure (boxes or something).</p>
            <p>
                <bold>Supplemental Table 1</bold>
            </p>
            <p>I believe supplemental table 1 could benefit from including:
                <list list-type="bullet">
                    <list-item>
                        <p>a description of what each column is beyond the title (for example, what is the difference between RESmall and RE??)</p>
                    </list-item>
                    <list-item>
                        <p>a description of the PP that are included (it appears there are only 11 that end up being used)</p>
                    </list-item>
                    <list-item>
                        <p>an indication of which are PRE, PP-PRE, and TMR-PRE</p>
                    </list-item>
                </list>
            </p>
            <p>
                <bold>Describing REGEXE's</bold>
            </p>
            <p>I use regular expressions regularly, but even still I found it difficult to follow the regular expressions listed in the text without looking to a reference. A short description, even in the supplemental materials of general features of the regexes would be useful. For example, the fact that [ABC] means one of either A or B or C at that position, that {3, 8} means either 3 or 8 letters between the previous and the next thing, and that [^ABC] means none of either A or B or C.</p>
            <p>Further, having examples of what portion of a sequence is matched, especially for the serine-threonine case where the sequence interval in general overlaps between the PROSITE pattern and the PILGram derived pattern. But also having examples for the Zinc finger showing the attributes shown, or describing what part of the regex encodes which features would help a lot.</p>
            <p>
                <bold>Language implying other methods are not linguistic</bold>
            </p>
            <p>In its current form, the abstract reads:</p>
            <p>"In this paper we describe a linguistic approach to identify ..."</p>
            <p>This implies that regular expressions in PROSITE and hidden markov models are not "linguistic" approaches. However, in the text, describing regular expressions used by PROSITE as the simplest form of grammar (regular grammars), and Hidden Markov models as a type of regular grammar (Introduction, paragraph 3). If these are grammars, then that implies they are linguistic approaches.&#x00a0;</p>
            <p>In fact, from the description in the manuscript, PILGram generates regular expressions that in some cases are very similar to those used by PROSITE. It seems currently unclear as to how generating regular expressions using PILGram is a "linguistic" approach, but aligning and finding common features (as in PROSITE or HMMs) is not. &#x00a0;</p>
            <p>I understand that PILGram is able to generate discriminative regexes without alignment first, and that is very useful (as exemplified by this manuscript), but from the current description that does not make it "linguistic". I admit I may be missing something in reading the current text in this area, as I am not a linguist.</p>
            <p>
                <bold>
                    <underline>Other Simple Improvements</underline>
                </bold>
            </p>
            <p>Methods: under PILGram, first sentence, a reference is missing to SIEVE.</p>
            <p>Methods: PILGram example, example grammar, spaces around symbols would greatly&#x00a0;</p>
            <p>improve the readability</p>
            <p>In Results: actually identify the **core** of the PROSITE reg-ex that PILGram</p>
            <p>is able to capture, noting that PILGram drops the first and last AA in the PROSITE</p>
            <p>one, and adds **Q** to the set of alternatives compared to PROSITE.</p>
            <p>Results: paragraph 2 says "Supplemental Table 1", but I believe this should be</p>
            <p>"Supplemental Figure 1".</p>
            <p>Results: "We found that applying logistic regression to combine these seven models provided a similar performance as the voting method, but underperformed the logistic regression on the complete set of models." **how much** did it underperform, curious minds want to know??</p>
            <p>Methods: hot lake peptide data, the wording implies single bacterium genome, however, the **Results** makes it clear that a metagenome is being used. One of these two sections should be modified to clarify whether it was a single genome or a metagenome. In addition, if these sequences have been submitted, accession numbers should be provided.</p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
        <sub-article article-type="response" id="comment1341-7889">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>McDermott</surname>
                            <given-names>Jason</given-names>
                        </name>
                        <aff>Pacific Northwest National Laboratory, USA</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>None (aside from the fact that I'm the author)</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>8</day>
                    <month>5</month>
                    <year>2015</year>
                </pub-date>
            </front-stub>
            <body>
                <p>We thank both the reviewers for their insightful and very helpful comments. We have revised the manuscript according to the reviewers&#x2019; suggestions and feel that it is substantially improved in terms of clarity and potential for reproducibility. Importantly, we have provided more complete data, results, and code that we employ in the paper.</p>
                <p>Dr. Flight had a number of points grouped by subject matter so I&#x2019;ve addressed each of them below using the same organization.</p>
                <p>
                    <bold>Data and code availability</bold>
                </p>
                <p>Dr. Flight&#x2019;s points are very good. We now include the requested datasets in the manuscript and reference them appropriately (see below for details). We have put a script and associated files on Github that represents the MDRpred algorithm as an open source project at: 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/biodataganache/MDRpred">https://github.com/biodataganache/MDRpred</ext-link>
                </p>
                <p>The following data files have been added to the manuscript:
                    <list list-type="order">
                        <list-item>
                            <p>UniProt ids and sequences have been included for both positive and negative examples for the PS00125 and PS00028 PROSITE patterns.</p>
                        </list-item>
                        <list-item>
                            <p>Genome accessions and links to genome annotations for all sequenced genomes in the metagenome used have been provided. Sequence bins (that is, sequences that are specific to a species that hasn&#x2019;t been sequenced as an axenic culture) are currently being deposited in GenBank and the manuscript will be updated when accession numbers are available.</p>
                        </list-item>
                        <list-item>
                            <p>A full list of high-confidence MDR predictions from the metagenome and their annotations are provided.</p>
                        </list-item>
                        <list-item>
                            <p>The original PROSITE data records used in our analysis are provided. These both have been updated in PROSITE since our analysis and the numbers of sequences changed then.</p>
                        </list-item>
                        <list-item>
                            <p>The lists regular expressions associated with each problem (the two PROSITE patterns and the MDR task) are now provided as text files.</p>
                        </list-item>
                        <list-item>
                            <p>As soon as the PILGram code is released we will update the manuscript with a link to the software and citation for the publication.</p>
                        </list-item>
                    </list>&#x00a0;</p>
                <p>
                    <bold>Weak substrate specificity claim</bold>
                </p>
                <p>We have updated the manuscript in several places (Introduction, Results, and Discussion) to clarify our claim of substrate specificity. MDRpred predicts substrate specificity at a broader class level, essentially drug-type compound or not. We have included text to explain this distinction and also updated our discussion of specificity in the Results section to make clear that we mean substrate specificity at this broad level.</p>
                <p>We have also added stronger language about the lack of need for sequence alignment for our method to work, which we agree is one of the major points in the paper.</p>
                <p>
                    <bold>PILGram Algorithmic details</bold>
                </p>
                <p>In the Methods section we include a paragraph describing how the genetic algorithm operates on parse trees. It is indeed the case that the length of the regular expression is not fixed because genetic algorithm recombinations occur on these trees. We believe that this, combined with the other clarifications of the method now included in the revision, should adequately resolve this confusion.</p>
                <p>
                    <bold>Physiochemical Properties and TMR</bold>
                </p>
                <p>To address Dr. Flight&#x2019;s comments we have greatly expanded our description of how the physicochemical properties and TMR scoring and grammars work. Also we have examined the contribution of models arising from each of these grammars to the overall method performance. Interestingly, we found that models from each grammar displayed very similar performance independently and each contributed to the final combined performance. We now provide examples of matches and of how the scores are calculated for different sequences and for different physicochemical properties.</p>
                <p>This was a good suggestion and we believe that the manuscript is really strengthened with these revisions.</p>
                <p>
                    <bold>Elaboration of clustering</bold>
                </p>
                <p>The details of the clustering approach are now described in a new Methods subsection, &#x201c;
                    <bold>Pattern clustering&#x201d;</bold>.</p>
                <p>
                    <bold>Supplemental Table 1</bold>
                </p>
                <p>Supplemental Table 1 now includes column descriptions, descriptions for the PPs that were included, and a column indicating the source of each pattern (PRE, TMR-PRE, or PP-PRE).</p>
                <p>
                    <bold>Describing REGEXE&#x2019;s</bold>
                </p>
                <p>We have added a subsection to the Methods titled, &#x201c;Regular Expressions&#x201d;, which summarizes interpretation of the regular expressions found in the manuscript.</p>
                <p>We have also added examples for the PROSITE patterns showing which portions of sequences were matched by the PILGram-generated patterns as new Tables 2 and 4 the Results section. The full alignment files are now provided as supplemental data files.</p>
                <p>
                    <bold>Language implying other methods are not linguistic</bold>
                </p>
                <p>Dr. Flight&#x2019;s point is well-taken. It was not our intent to imply that other approaches, like those we mention (HMMs and PROSITE), are not derived from linguistics. We have revised the text throughout to make clear that other currently used bioinformatics methods are also derived from linguistics.</p>
                <p>
                    <bold>Simple improvements</bold>
                </p>
                <p>All addressed.</p>
            </body>
        </sub-article>
    </sub-article>
</article>
