<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="other" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.129161.1</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Software Tool Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>AMAW: automated gene annotation for non-model eukaryotic genomes</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 1; peer review: 3 approved with reservations, 1 not approved]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Meunier</surname>
                        <given-names>Lo&#x00ef;c</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Data Curation</role>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Visualization</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                    <xref ref-type="aff" rid="a2">2</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Baurain</surname>
                        <given-names>Denis</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Funding Acquisition</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Resources</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Cornet</surname>
                        <given-names>Luc</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Data Curation</role>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-3420-4488</uri>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>InBios-PhotoSYSTEMS, University of Liege, Liege, B-400, Belgium</aff>
                <aff id="a2">
                    <label>2</label>TERRA Teaching and research centre, University of Liege, Gembloux, B-5030, Belgium</aff>
                <aff id="a3">
                    <label>3</label>Mycology and Aerobiology, Sciensano, Ixelles, B-1000, Belgium</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:lmeunier.bioinfo@gmail.com">lmeunier.bioinfo@gmail.com</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>16</day>
                <month>2</month>
                <year>2023</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2023</year>
            </pub-date>
            <volume>12</volume>
            <elocation-id>186</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>16</day>
                    <month>1</month>
                    <year>2023</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2023 Meunier L et al.</copyright-statement>
                <copyright-year>2023</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/12-186/pdf"/>
            <abstract>
                <p>
                    <bold>Background:</bold> The annotation of genomes is a crucial step regarding the analysis of new genomic data and resulting insights, and this especially for emerging organisms which allow researchers to access unexplored lineages, so as to expand our knowledge of poorly represented taxonomic groups. Complete pipelines for eukaryotic genome annotation have been proposed for more than a decade, but the issue is still challenging. One of the most widely used tools in the field is MAKER2, an annotation pipeline using experimental evidence (mRNA-seq and proteins) and combining different gene prediction tools. MAKER2 enables individual laboratories and small-scale projects to annotate non-model organisms for which pre-existing gene models are not available. The optimal use of MAKER2 requires gathering evidence data (by searching and assembling transcripts, and/or collecting homologous proteins from related organisms), elaborating the best annotation strategy (training of gene models) and efficiently orchestrating the different steps of the software in a grid computing environment, which is tedious, time-consuming and requires a great deal of bioinformatic skills.</p>
                <p>
                    <bold>Methods:</bold> To address these issues, we present AMAW (Automated MAKER2 Annotation Wrapper), a wrapper pipeline for MAKER2 that automates the above-mentioned tasks. Hence, our tool is able from a given genome and the corresponding organism name to gather and assemble mRNA-seq evidence, collect protein evidence, iteratively train HMM models of gene prediction, in order to yield the most accurate evidence-supported annotation possible without manual curating nor organism expertise. Importantly, AMAW also exists as a Singularity container recipe easy to deploy on a grid computer, thereby overcoming the tricky installation of MAKER2.</p>
                <p>
                    <bold>Use case:</bold>The performance of AMAW is illustrated through the annotation of a selection of 32 protist genomes, for which we compared its annotations with those produced with gene models directly available in AUGUSTUS.</p>
                <p>
                    <bold>Conclusions:</bold> AMAW shows to be a performant tool for the automation of the annotation of non-model organism genomes, which significantly improves the genome annotation quality in comparison with the naive use of MAKER2 with pre-existing gene models. Thereby, AMAW directly supports small- and large-scale genome annotation projects, by facilitating the annotation of emerging unicellular eukaryotic genomes by scientist without a strong bioinformatics background, and/or speeds up the annotation of many genomes at once.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>Genome annotation</kwd>
                <kwd>non-model unicellular eukaryotes</kwd>
                <kwd>gene prediction</kwd>
                <kwd>evidence data acquisition</kwd>
                <kwd>Singularity container</kwd>
                <kwd>automation</kwd>
            </kwd-group>
            <funding-group>
                <award-group id="fund-1">
                    <funding-source>F.R.S.-FNRS</funding-source>
                    <award-id>CDRJ.0008.20</award-id>
                </award-group>
                <award-group id="fund-2">
                    <funding-source>BELSPO</funding-source>
                    <award-id>B2/191/P2/BCCMGEN-ERA</award-id>
                </award-group>
                <award-group id="fund-3">
                    <funding-source>F.R.S.-FNRS</funding-source>
                    <award-id>2.5020.11</award-id>
                </award-group>
                <funding-statement>This work was supported by the F.R.S-FNRS. Computational resources were provided by the Consortium des &#x00c9;quipements de Calcul Intensif (C&#x00c9;CI) funded by the F.R.S.-FNRS (2.5020.11), and through two research grants to DB: B2/191/P2/BCCM GEN-ERA (Belgian Science Policy Office - BELSPO) and CDR J.0008.20 (F.R.S.-FNRS). LC was also supported by the GEN-ERA research grant.</funding-statement>
            </funding-group>
        </article-meta>
    </front>
    <body>
        <sec id="sec1" sec-type="intro">
            <title>Introduction</title>
            <p>Coding sequences (CDS) and, more generally, gene structures from an organism, are essential genomic data, especially for phylogenomics and gene mining, for which accessing reliable protein sequences from publicly available emerging draft genomes is invaluable (
                <xref ref-type="bibr" rid="ref14">Keeling and Burki, 2019</xref>). These can be more or less accurately obtained through the structural annotation of a genome, for which the collection of evidence data and the use of annotation pipelines are tricky at best (
                <xref ref-type="bibr" rid="ref23">Yandell and Ence
                    <italic toggle="yes">,</italic> 2012</xref>).</p>
            <p>Following the decrease in sequencing costs due to the advent of Next Generation Sequencing and the concomitant explosion of sequenced organisms, new genomic data from emerging model organisms allow researchers to access unexplored taxonomic groups (
                <xref ref-type="bibr" rid="ref14">Keeling and Burki, 2019</xref>). However, eukaryotic genomes, whose biodiversity is predominantly represented by protist lineages (
                <xref ref-type="bibr" rid="ref1">Adl 
                    <italic toggle="yes">et al.</italic>, 2019</xref>, 
                <xref ref-type="bibr" rid="ref2">Burki 
                    <italic toggle="yes">et al.</italic>, 2020</xref>), present special features which complexify the structural annotation process: large genomes with a low gene density, long intergenic regions, as well as introns (
                <xref ref-type="bibr" rid="ref23">Yandell and Ence, 2012</xref>). Although pipelines for eukaryotic genome annotation have been developed for more than a decade, it is still challenging to obtain an accurate annotation of the gene structures, a shortcoming that is often revealed in phylogenomic studies (
                <xref ref-type="bibr" rid="ref11">Di Franco 
                    <italic toggle="yes">et al.</italic>, 2019</xref>). MAKER2 (
                <xref ref-type="bibr" rid="ref13">Holt and Yandell, 2011</xref>) has been, for more than a decade, one of the most popular annotation pipelines for eukaryotes.</p>
            <p>Although MAKER2 (
                <xref ref-type="bibr" rid="ref13">Holt and Yandell, 2011</xref>) enables individual laboratories to annotate non-model organisms (for which pre-existing gene models are not available), the use of this tool remains complex, as it implies the orchestration and fine-tuning of a multi-step process (
                <xref ref-type="bibr" rid="ref3">Campbell 
                    <italic toggle="yes">et al.,</italic> 2015</xref>). First, an evidence dataset must be compiled by collecting phylogenetically related proteins and species-specific transcripts, which often requires the assembly of RNA-Seq data for new organisms. Next, iterative runs of MAKER2 (
                <xref ref-type="bibr" rid="ref13">Holt and Yandell, 2011</xref>) must also be coordinated to aim for accurate predictions, which includes intermediary specific training of different gene predictor models.</p>
            <p>Here we present AMAW (Automated MAKER2 Annotation Wrapper) (
                <xref ref-type="bibr" rid="ref20">Lo&#x00ef;c Meunier 
                    <italic toggle="yes">et al.,</italic> 2022</xref>), a wrapper pipeline facilitating the annotation of emerging unicellular eukaryotes (
                <italic toggle="yes">i.e.</italic>, protist) genomes in both small and large-scale projects in a grid-computing environment. This tool addresses all the above-mentioned tasks according to MAKER2 authors&#x2019; recommendations (
                <xref ref-type="bibr" rid="ref3">Campbell 
                    <italic toggle="yes">et al.</italic>, 2015</xref>) and is, to our knowledge, the first implementation automating the use of MAKER2. We also demonstrate that the use of AMAW yields genome annotation significantly improved in comparison to the use of MAKER2 with the AUGUSTUS (
                <xref ref-type="bibr" rid="ref22">Stanke 
                    <italic toggle="yes">et al.</italic>, 2008</xref>) gene models that are available by default.</p>
        </sec>
        <sec id="sec2" sec-type="methods">
            <title>Methods</title>
            <sec id="sec3">
                <title>Implementation</title>
                <p>AMAW is implemented in Perl 5 version 22 (Perl, 1994) (RRID:SCR_018313) and is available either in a standalone version or through a Singularity container. Basic inputs required by AMAW pipeline are a FASTA-formatted nucleotide genome file and the organism name. Alternatively, evidence data, such as proteins or transcripts/ESTs provided by the user, or even gene models, can also be directly used for genome annotation.</p>
            </sec>
            <sec id="sec4">
                <title>Functionalities</title>
                <p>The MAKER2 annotation suite was chosen to be automated for its performance and interesting features: beside supporting gene prediction with evidence data, MAKER2 has been demonstrated to improve the accuracy of its internal gene predictors, to maintain this accuracy even when the quality or size of evidence data decreases, as well as to limit the number of overpredictions (
                    <xref ref-type="bibr" rid="ref13">Holt and Yandell, 2011</xref>).</p>
                <p>Taking MAKER2 as its internal engine, AMAW is able to gather and assemble RNA-Seq evidence, collect protein evidence, iteratively train the hidden Markov models (HMMs) of the predictors to yield the most accurate evidence-supported annotation possible without manual curation nor prior expertise of the organism (see AMAW subsection). Our tool, designed for non-model unicellular eukaryotic genomes, presents helpful applications in phylogenomics and comparative genomics. Indeed, some taxonomic lineages still lack high-quality genomic data (
                    <xref ref-type="bibr" rid="ref2">Burki 
                        <italic toggle="yes">et al.</italic>, 2020</xref>), and filling these gaps would extend studies to these interesting groups.</p>
                <p>The pipeline devised in AMAW (
                    <xref ref-type="fig" rid="f1">Figure 1</xref>) aims to reach three goals: (1) to achieve the most accurate annotation of a non-model genome without manual curation, (2) to automate the use of MAKER2 for supporting large-scale annotation projects, and (3) to simplify its installation and usage for users without a strong bioinformatics background.</p>
                <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                    <label>Figure 1. </label>
                    <caption>
                        <title>Overview of AMAW pipeline and steps.</title>
                        <p>First transcripts and protein evidence are collected and deployed, if required. Then, three iterative runs of MAKER2 are performed to progressively train SNAP and AUGUSTUS gene predictors. The final genome annotation is generated after the third MAKER2 run.</p>
                    </caption>
                    <graphic id="gr1" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/141827/60541e32-e348-43f8-baf9-9df1e823d06a_figure1.gif"/>
                </fig>
                <p>First, a key factor for achieving accurate genome annotation is to collect as much evidence data (transcripts and/or proteins) as possible. This is needed both to optimize the training of specific gene models of 
                    <italic toggle="yes">ab initio</italic> gene predictors and to improve the confidence level in predictions supported by experimental data (
                    <xref ref-type="bibr" rid="ref13">Holt and Yandell, 2011</xref>).</p>
                <p>Second, building evidence datasets is a time-consuming task, which also implies a certain level of bioinformatics skills. This consists of, in the best cases, finding and downloading directly available transcript or protein datasets for the genome species to annotate. However, this process often further requires assembling raw RNA-Seq reads into transcripts and gathering a reasonably sized protein dataset, usually including sequences of taxa phylogenetically close to the organism of interest. If building evidence datasets is feasible for a few genomes to annotate, doing so repeatedly for dozens or hundreds of genomes is hardly conceivable. This is why AMAW addresses this issue by automating the acquisition of both available RNA-sequence and protein data from reliable public databases (&#x201c;NCBI Sequence Read Archive (SRA)&#x201d; for RNA- sequence data and a combination of &#x201c;Ensembl genomes&#x201d; and NCBI databases for protein sequences).</p>
                <p>Third, in addition of constructing a good input dataset for the annotation, AMAW automates the installation and the global use of the MAKER2 annotation pipeline based on good practices published by its authors (
                    <xref ref-type="bibr" rid="ref3">Campbell 
                        <italic toggle="yes">et al.</italic>, 2015</xref>), and orchestrates the successive runs in a grid-computing environment. Even if MAKER2 is described as an easy to use pipeline, its handling and the optimal fine-tuning of its parameters demand that users take notice of its large documentation and, again, require a good bioinformatics understanding.</p>
                <p>The complete workflow of AMAW can be summarized in three steps:
                    <list list-type="order">
                        <list-item>
                            <label>1.</label>
                            <p>Transcript evidence data acquisition: RNA-Seq acquisition, assembly into transcripts, quantification of the abundance of the transcripts and filtering of redundant transcripts and minor isoforms;</p>
                        </list-item>
                        <list-item>
                            <label>2.</label>
                            <p>Protein evidence deployment;</p>
                        </list-item>
                        <list-item>
                            <label>3.</label>
                            <p>MAKER2 iterative runs and progressive training of its internal gene predictors.</p>
                        </list-item>
                    </list>
                </p>
                <p>It is possible for the user to provide their own in-house protein and/or transcript dataset(s). Moreover, they can short-circuit the pipeline by choosing an existing gene model for AUGUSTUS (
                    <xref ref-type="bibr" rid="ref22">Stanke 
                        <italic toggle="yes">et al.</italic>, 2008</xref>) and/or SNAP (
                    <xref ref-type="bibr" rid="ref16">Korf, 2004</xref>). However, unless available models are well-suited for the organism at hand (matching species), it is advised to rely on AMAW full analysis.</p>
            </sec>
            <sec id="sec5">
                <title>AMAW</title>
                <p>
                    <italic toggle="yes">Acquisition and building of transcript evidence data</italic>
                </p>
                <p>The generation of a specific transcript dataset is carried out on the basis of the organism species name, provided by the user. This name is used to search for RNA-Seq experiments in 
                    <ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/sra">NCBI SRA</ext-link>. Considering the divergence between nucleotide sequences at the genus level, only species-specific data is collected to perform direct nucleotide alignment (
                    <xref ref-type="bibr" rid="ref3">Campbell 
                        <italic toggle="yes">et al.</italic>, 2015</xref>). The information of RNA-Seq experiment runs is collected with e-utilities and the corresponding FASTQ files are downloaded with fastq-dump v3.0.0. The acquisition of the RNA-Seq data prioritizes paired-end reads, when available, rather than single-end libraries, for more accurate transcript assembly. To limit the data volume to be stored in the case of well-represented organisms, two options are implemented: (1) a threshold on the maximal cumulative size of FASTQ files to download (by default: 25 GB) and (2) a threshold on the number of experiments (by default: none). Moreover, RNA-Seq experiments are sorted by ascending data volume before being selected in an attempt to maximize the diversity of RNA-Seq libraries.</p>
                <p>FASTQ read files are assembled into transcripts with Trinity v2.12.0 (
                    <xref ref-type="bibr" rid="ref12">Grabherr 
                        <italic toggle="yes">et al.</italic>, 2013</xref>) (standard parameters). The abundance of transcripts is first assessed with &#x201c;align_and_estimate_abundance.pl&#x201d;, a Trinity utility script that uses RSEM (
                    <xref ref-type="bibr" rid="ref18">Li and Dewey, 2011</xref>), then a custom script removes the redundant transcripts (which are common when several samples are pooled) and minor isoforms (by default, with abundance &lt; 10% for a Trinity-defined gene). Finally, assembled transcripts are pooled and fetched to MAKER2.</p>
                <p>
                    <italic toggle="yes">Deployment of preloaded protein evidence data</italic>
                </p>
                <p>To collect a set of curated protein sequences of eukaryotic microorganisms, Ensembl genomes (
                    <xref ref-type="bibr" rid="ref15">Kersey 
                        <italic toggle="yes">et al.</italic>, 2018</xref>) were downloaded (Protists, Fungi and Plants - release 35.0, 08 May 2017) in combination with protist genomes available on the NCBI (March 2017) into a single database. However, to accelerate the computation time of MAKER2 annotations, this protein sequence database was subdivided following the major eukaryotic taxonomic clades. For this, we used the NCBI third taxonomic level (usually the phylum), which allows us to already considerably reduce the quantity of data to deploy for an annotation while ensuring enough sequence evidence for less studied lineages. Moreover, for further optimization of the computation time, these subsets were also dereplicated with CD-HIT version 4.6 (
                    <xref ref-type="bibr" rid="ref19">Li and Godzik, 2006</xref>): sequences sharing &#x2265; 99% identity were removed in favor of a single representative sequence. In practice, the taxonomy of the user-given organism species name is used to deploy the protein database corresponding to its taxon.</p>
                <p>
                    <italic toggle="yes">MAKER2 runs and intermediate trainings of the gene predictors</italic>
                </p>
                <p>Following the good practices given by 
                    <xref ref-type="bibr" rid="ref3">Campbell 
                        <italic toggle="yes">et al.</italic> (2015)</xref>, the default AMAW workflow consists in three successive MAKER2 runs:
                    <list list-type="order">
                        <list-item>
                            <label>1.</label>
                            <p>The first MAKER2 round predicts the genes only based on alignment of the provided transcript and protein data on the genome assembly to annotate. The predicted gene sequences will then be used for training a gene model for the SNAP gene predictor.</p>
                        </list-item>
                        <list-item>
                            <label>2.</label>
                            <p>MAKER2 second round uses SNAP with the trained gene model and the evidence data will only be used for supporting the presence or absence of the predicted genes. Then, the SNAP gene model is trained again and a gene model is trained for AUGUSTUS.</p>
                        </list-item>
                        <list-item>
                            <label>3.</label>
                            <p>MAKER2 third and last round performs gene predictions with both trained SNAP and AUGUSTUS gene predictors.</p>
                        </list-item>
                    </list>
                </p>
                <p>At the end of these three annotation rounds, two sets of gene predictions containing the gene predictors consensus are returned: a first one containing those supported by evidence data and a second one with the unsupported ones. However, the latter dataset needs to be cautiously used as the false positive rate is expected to be higher.</p>
                <p>For optimal performance of the pipeline, it is possible (and recommended when applicable) for the user to provide her/his own experimental transcript data.</p>
                <p>Beside the complete pipeline, AMAW also offers the possibility to shorten the analyses to only one round to:
                    <list list-type="bullet">
                        <list-item>
                            <label>-</label>
                            <p>annotate several genomes of the same species (or re-run a previous analysis) for which the evidence data has already been constructed and the SNAP and AUGUSTUS gene models already trained.</p>
                        </list-item>
                        <list-item>
                            <label>-</label>
                            <p>directly use an AUGUSTUS gene model (available in its library or provided by the user) without evidence data building. It is noteworthy that this mode does not use the SNAP gene predictor.</p>
                        </list-item>
                    </list>
                </p>
                <p>In this case, only the third round is launched according to the chosen mode.</p>
            </sec>
        </sec>
        <sec id="sec6">
            <title>Use cases: structural genome annotation of protist lineages</title>
            <p>The efficiency of MAKER2 being well known (
                <xref ref-type="bibr" rid="ref13">Holt and Yandell, 2011</xref>), we illustrate the performance of AMAW by comparing its annotations with those of MAKER2 on a selection of 32 protist genomes in two very contrasted conditions (
                <xref ref-type="bibr" rid="ref4">Cornet, Luc 2022a</xref>). In detail, the annotations generated with AMAW, where a gene model is specifically created for the genome from the available data, are compared with those produced with gene models directly available in AUGUSTUS (
                <xref ref-type="bibr" rid="ref22">Stanke 
                    <italic toggle="yes">et al.</italic>, 2008</xref>). The latter (control) condition corresponds to a basic usage of MAKER2.</p>
            <p>To explore the impact of gene model choice, four AUGUSTUS models were used against AMAW generated ones: 
                <italic toggle="yes">Homo sapiens</italic>, 
                <italic toggle="yes">Arabidopsis thaliana</italic>, 
                <italic toggle="yes">Aspergillus oryzae</italic> and the &#x201c;closest&#x201d; available model with respect to the organism to annotate. For this, a dataset of 32 genomes of protist organisms was designed and the quality of the different structural annotations was assessed using the completeness metrics provided by BUSCO v4 (
                <xref ref-type="bibr" rid="ref21">Seppey 
                    <italic toggle="yes">et al.</italic>, 2019</xref>) and the latest orthologous databases (
                <xref ref-type="bibr" rid="ref17">Kriventseva 
                    <italic toggle="yes">et al.</italic>, 2019</xref>). The genomes were downloaded from the NCBI and are available in the Supplementary Database (
                <xref ref-type="bibr" rid="ref4">Cornet, Luc 2022a</xref>). For more details, see Supplementary Tables 1 (
                <xref ref-type="bibr" rid="ref8">Cornet, Luc 2022e</xref>) and 2 (
                <xref ref-type="bibr" rid="ref9">Cornet, Luc 2022f</xref>) for the complete taxonomy of these genomes, evidence data used to train the gene models and orthologous databases used with BUSCO.</p>
            <p>The analysis of median values of BUSCO metrics shows that AMAW gene models significantly improve the quality of MAKER2 annotations (
                <xref ref-type="fig" rid="f2">Figure 2A</xref>): with a median completeness of 90.6% (the closest gene model is the second most complete with a median of 68.7%), a median rate of fragmented annotations of 3.8% (second: closest gene model with 8.2%) and a median rate of missing annotations of 5.4% (second: closest gene model with 14.0%). Complete BUSCO results are provided as a table (see Supplementary Table 3 (
                <xref ref-type="bibr" rid="ref10">Cornet, Luc 2022g</xref>)) and individual barplots for completeness, fragmented and missing genes (see Supplementary Figures 1 (
                <xref ref-type="bibr" rid="ref5">Cornet, Luc 2022b</xref>), 2 (
                <xref ref-type="bibr" rid="ref6">Cornet, Luc 2022c</xref>) and 3 (
                <xref ref-type="bibr" rid="ref7">Cornet, Luc 2022d</xref>), respectively).</p>
            <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                <label>Figure 2. </label>
                <caption>
                    <title>A. Comparison of median values of the percentage of completeness, and fragmented and missing genes between MAKER2 with AUGUSTUS gene models (
                        <italic toggle="yes">H. sapiens</italic>, 
                        <italic toggle="yes">A. thaliana</italic>, 
                        <italic toggle="yes">A. oryzae</italic> and closest available) and AMAW gene models. B. Representation of the percentage of occurrences (out of 32 genomes) where a gene model yields the most complete annotation, the least fragmented proteins or the least missing proportion of expected proteins, in comparison with other gene models.</title>
                </caption>
                <graphic id="gr2" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/141827/60541e32-e348-43f8-baf9-9df1e823d06a_figure2.gif"/>
            </fig>
            <p>Among the five gene models used for each genome, AMAW performed best, giving the most complete annotation in 59.4% of cases, the least fragmented annotations in 34.4.8% of cases and the lowest proportion of missing proteins in 50.0% of cases (
                <xref ref-type="fig" rid="f2">Figure 2B</xref>). AMAW annotations for which RNA-Seq data is available are of better quality (see 
                <xref ref-type="fig" rid="f3">Figure 3</xref>).</p>
            <fig fig-type="figure" id="f3" orientation="portrait" position="float">
                <label>Figure 3. </label>
                <caption>
                    <title>A. Comparison of median values of the percentage of completeness, and fragmented and missing genes between MAKER2 with AUGUSTUS gene models (
                        <italic toggle="yes">H. sapiens</italic>, 
                        <italic toggle="yes">A. thaliana</italic>, 
                        <italic toggle="yes">A. oryzae</italic> and closest available) and AMAW gene models. B. Representation of the percentage of occurrences (out of 17 genomes) where a gene model yields the most complete annotation, the least fragmented proteins or the least missing proportion of expected proteins, in comparison with other gene models.</title>
                </caption>
                <graphic id="gr3" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/141827/60541e32-e348-43f8-baf9-9df1e823d06a_figure3.gif"/>
            </fig>
            <p>Among the five gene models assayed for each genome, AMAW performed best, giving the most complete annotation in 73.3% of cases (in comparison with 59.4% for the full genome dataset), the least fragmented annotations in 46.7% of cases (in comparison with 34.4%) and the lowest proportion of missing proteins in 60.0% of cases (in comparison with 50.0%).</p>
        </sec>
        <sec id="sec7" sec-type="conclusions">
            <title>Conclusions</title>
            <p>We presented AMAW and its set of functionalities automazing the annotation of genomes, with a specific aim for non-model organisms. The application example shows how AMAW significantly improves the genome annotation quality in comparison of naive use of MAKER2 with pre-existing gene models, as well as the importance of providing specific evidence data. We aim with AMAW&#x2019;s functionalities automating the acquisition and deployment of evidence data to contribute to the effort for achieving continually more complete and accurate annotations, especially for poorly represented eukaryotic lineages. Considering its streamlined installation and straightforward usage in grid-computing environments, we hope AMAW to be useful in future small and large genome annotation projects.</p>
        </sec>
        <sec id="sec8">
            <title>Author contributions</title>
            <p>L. Meunier: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software, Visualization, Writing - Original Draft Preparation.</p>
            <p>D. Baurain: Conceptualization, Funding Acquisition, Methodology, Resources, Validation, Writing - Review &amp; Editing.</p>
            <p>L. Cornet: Conceptualization, Data Curation, Formal Analysis, Investigation, Software, Supervision, Writing - Review and Editing.</p>
        </sec>
    </body>
    <back>
        <sec id="sec11" sec-type="data-availability">
            <title>Data availability</title>
            <p>The genome assemblies used to assess AMAW are publicly available on the NCBI assembly database (
                <ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/assembly/">https://www.ncbi.nlm.nih.gov/assembly/</ext-link>) and released as an archived database (
                <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.6084/m9.figshare.21757880">https://doi.org/10.6084/m9.figshare.21757880</ext-link>):</p>
            <p>
                <bold>Acytostelium subglobosum</bold>:</p>
            <p>Genbank GCA_000787575.2</p>
            <p>
                <bold>Ascogregarina taiwanensis</bold>:</p>
            <p>Genbank GCA_000172235.1</p>
            <p>
                <bold>Auxenochlorella pyrenoidosa</bold>:</p>
            <p>Genbank GCA_001430745.1</p>
            <p>
                <bold>Balamuthia mandrillaris</bold>:</p>
            <p>Genbank GCA_001185145.1</p>
            <p>
                <bold>Breviolum minutum</bold>:</p>
            <p>Genbank GCA_000507305.1</p>
            <p>
                <bold>Chromera velia</bold>:</p>
            <p>Genbank GCA_000585135.1</p>
            <p>
                <bold>Chlorella vulgaris</bold>:</p>
            <p>Genbank GCA_001021125.1</p>
            <p>
                <bold>Cladosiphon okamuranus</bold>:</p>
            <p>Genbank GCA_001742925.1</p>
            <p>
                <bold>Coccomyxa subellipsoidea</bold>:</p>
            <p>Genbank GCA_000258705.1</p>
            <p>
                <bold>Crithidia acanthocephali</bold>:</p>
            <p>Genbank GCA_000482105.1</p>
            <p>
                <bold>Cyclospora cayetanensis</bold>:</p>
            <p>Genbank GCA_000769155.2</p>
            <p>
                <bold>Cymbomonas tetramitiformis</bold>:</p>
            <p>Genbank GCA_001247695.1</p>
            <p>
                <bold>Diplonema papillatum</bold>:</p>
            <p>Genbank GCA_001655075.1</p>
            <p>
                <bold>Endotrypanum monterogeii</bold>:</p>
            <p>Genbank GCA_000333855.2</p>
            <p>
                <bold>Euplotes focardii</bold>:</p>
            <p>Genbank GCA_001880345.1</p>
            <p>
                <bold>Fragilariopsis cylindrus</bold>:</p>
            <p>Genbank GCA_001750085.1</p>
            <p>
                <bold>Gonium pectorale</bold>:</p>
            <p>Genbank GCA_001584585.1</p>
            <p>
                <bold>Haemoproteus tartakovskyi</bold>:</p>
            <p>Genbank GCA_001625125.1</p>
            <p>
                <bold>Halocafeteria seosinensis</bold>:</p>
            <p>Genbank GCA_001687465.1</p>
            <p>
                <bold>Herpetomonas muscarum</bold>:</p>
            <p>Genbank GCA_000482205.1</p>
            <p>
                <bold>Lotmaria passim</bold>:</p>
            <p>Genbank GCA_000635995.1</p>
            <p>
                <bold>Mastigamoeba balamuthi</bold>:</p>
            <p>Genbank GCA_000765095.1</p>
            <p>
                <bold>Moneuplotes crassus</bold>:</p>
            <p>Genbank GCA_001880385.1</p>
            <p>
                <bold>Neospora caninum</bold>:</p>
            <p>RefSeq GCF_000208865.1</p>
            <p>
                <bold>Parachlorella kessleri</bold>:</p>
            <p>Genbank GCA_001598975.1</p>
            <p>
                <bold>Pilasporangium apinafurcum</bold>:</p>
            <p>Genbank GCA_001600475.1</p>
            <p>
                <bold>Porphyridium purpureum</bold>:</p>
            <p>Genbank GCA_000397085.1</p>
            <p>
                <bold>Pseudoperonospora cubensis</bold>:</p>
            <p>Genbank GCA_000252605.1</p>
            <p>
                <bold>Saccharina japonica</bold>:</p>
            <p>Genbank GCA_000978595.1</p>
            <p>
                <bold>Sarcocystis neurona</bold>:</p>
            <p>Genbank GCA_000727475.1</p>
            <p>
                <bold>Trebouxia gelatinosa</bold>:</p>
            <p>Genbank GCA_000818905.1</p>
            <p>
                <bold>Uroleptopsis citrina</bold>:</p>
            <p>Genbank GCA_001653735.1</p>
            <sec id="sec12">
                <title>Extended data</title>
                <p>Supplementary Database: Figshare: AMAW-genomes-used 
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.6084/m9.figshare.21757880">https://doi.org/10.6084/m9.figshare.21757880</ext-link> (
                    <xref ref-type="bibr" rid="ref4">Cornet, Luc, 2022a</xref>)</p>
                <p>Archive containing the FASTA files of the 32-genomes selection used in the use case.</p>
                <p>Figshare: AMAW-Supplementary_Figure1.jpg 
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.6084/m9.figshare.21603990">https://doi.org/10.6084/m9.figshare.21603990</ext-link> (
                    <xref ref-type="bibr" rid="ref5">Cornet, Luc, 2022b</xref>)</p>
                <p>This project contains the following extended data:
                    <list list-type="bullet">
                        <list-item>
                            <label>-</label>
                            <p>AMAW-Supplementary_Figure1.jpg (BUSCO metrics: percentage of completeness for each of the 32 analyzed genomes using five gene models.)</p>
                        </list-item>
                    </list>
</p>
                <p>Figshare: AMAW-Supplementary_Figure2.jpg 
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.6084/m9.figshare.21603996">https://doi.org/10.6084/m9.figshare.21603996</ext-link> (
                    <xref ref-type="bibr" rid="ref6">Cornet, Luc, 2022c</xref>)</p>
                <p>This project contains the following extended data:
                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2010;</label>
                            <p>AMAW-Supplementary_Figure2.jpg (BUSCO metrics: percentage of fragmented genes for each of the 32 analyzed genomes using five gene models.)</p>
                        </list-item>
                    </list>
</p>
                <p>Figshare: AMAW-Supplementary_Figure3.jpg 
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.6084/m9.figshare.21603999">https://doi.org/10.6084/m9.figshare.21603999</ext-link> (
                    <xref ref-type="bibr" rid="ref7">Cornet, Luc, 2022d</xref>)</p>
                <p>This project contains the following extended data:
                    <list list-type="bullet">
                        <list-item>
                            <label>-</label>
                            <p>AMAW-Supplementary_Figure3.jpg (BUSCO metrics: percentage of missing genes for each of the 32 analyzed genomes using five gene models.)</p>
                        </list-item>
                    </list>
</p>
                <p>Figshare: AMAW-Supplementary_Table 1.csv 
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.6084/m9.figshare.21604011">https://doi.org/10.6084/m9.figshare.21604011</ext-link> (
                    <xref ref-type="bibr" rid="ref8">Cornet, Luc, 2022e</xref>)</p>
                <p>This project contains the following extended data:
                    <list list-type="bullet">
                        <list-item>
                            <label>-</label>
                            <p>AMAW-Supplementary_Table1.csv</p>
                        </list-item>
                    </list>
</p>
                <p>Figshare: AMAW-Supplementary_Table 2.csv 
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.6084/m9.figshare.21604002">https://doi.org/10.6084/m9.figshare.21604002</ext-link> (
                    <xref ref-type="bibr" rid="ref9">Cornet, Luc, 2022f</xref>)</p>
                <p>This project contains the following extended data:
                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2010;</label>
                            <p>AMAW-Supplementary_Table2.csv</p>
                        </list-item>
                    </list>
</p>
                <p>Figshare: AMAW-Supplementary_Table 3.csv 
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.6084/m9.figshare.21750965">https://doi.org/10.6084/m9.figshare.21750965</ext-link> (
                    <xref ref-type="bibr" rid="ref10">Cornet, Luc, 2022g</xref>)</p>
                <p>This project contains the following extended data:
                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2010;</label>
                            <p>AMAW-Supplementary_Table3.csv (BUSCO metrics results for each set of genome and used gene model, and the ortoDB database associated to the analysis)</p>
                        </list-item>
                    </list>
</p>
                <p>Data are available under the terms of the 
                    <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International license</ext-link> (CC-BY 4.0). Software Availability</p>
                <p>AMAW is released both as a Singularity container recipe and a standalone Perl script (
                    <ext-link ext-link-type="uri" xlink:href="https://bitbucket.org/phylogeno/amaw/">https://bitbucket.org/phylogeno/amaw/</ext-link>)</p>
                <p>Archived source code at time of publication: 
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.7490001">https://doi.org/10.5281/zenodo.7490001</ext-link> (
                    <xref ref-type="bibr" rid="ref20">Lo&#x00ef;c Meunier 
                        <italic toggle="yes">et al.,</italic> 2022</xref>)</p>
                <p>License: 
                    <ext-link ext-link-type="uri" xlink:href="https://www.gnu.org/licenses/gpl-3.0.en.html">GNU GPL v3</ext-link>
                </p>
            </sec>
        </sec>
        <ack>
            <title>Acknowledgments</title>
            <p>We thank David Colignon (ULi&#x00e8;ge) and Olivier Mattelaer (UCLouvain) for their help with the C&#x00c9;CI computing clusters</p>
        </ack>
        <ref-list>
            <title>References</title>
            <ref id="ref1">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Adl</surname>
                            <given-names>SM</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bass</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Lane</surname>
                            <given-names>CE</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Revisions to the Classification, Nomenclature, and Diversity of Eukaryotes.</article-title>
                    <source>

                        <italic toggle="yes">J. Eukaryot. Microbiol.</italic>
</source>
                    <year>2019</year>;<volume>66</volume>(<issue>1</issue>):<fpage>4</fpage>&#x2013;<lpage>119</lpage>.
                    <pub-id pub-id-type="pmid">30257078</pub-id>
                    <pub-id pub-id-type="doi">10.1111/jeu.12691</pub-id>
                    <pub-id pub-id-type="pmcid">PMC6492006</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref2">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Burki</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Roger</surname>
                            <given-names>AJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Brown</surname>
                            <given-names>MW</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The New Tree of Eukaryotes.</article-title>
                    <source>

                        <italic toggle="yes">Trends Ecol. Evol.</italic>
</source>
                    <year>2020</year>;<volume>35</volume>:<fpage>43</fpage>&#x2013;<lpage>55</lpage>.
                    <pub-id pub-id-type="doi">10.1016/j.tree.2019.08.008</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref3">
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Campbell</surname>
                            <given-names>MS</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Holt</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Moore</surname>
                            <given-names>B</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Genome Annotation and Curation Using MAKER2 and MAKER-P (Vol. 3).</article-title>
                    <year>2015</year>.</mixed-citation>
            </ref>
            <ref id="ref4">
                <mixed-citation publication-type="data">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Cornet</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <data-title>AMAW-genomes-used.</data-title>Dataset.
                    <source>

                        <italic toggle="yes">figshare.</italic>
</source>
                    <year>2022a</year>.
                    <pub-id pub-id-type="doi">10.6084/m9.figshare.21757880.v1</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref5">
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Cornet</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <article-title>AMAW-Supplementary_Figure1.png. figshare. Figure.</article-title>
                    <year>2022b</year>.
                    <pub-id pub-id-type="doi">10.6084/m9.figshare.21603990.v3</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref6">
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Cornet</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <article-title>AMAW-Supplementary_Figure1.png. figshare. Figure.</article-title>
                    <year>2022c</year>.
                    <pub-id pub-id-type="doi">10.6084/m9.figshare.21603990.v3</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref7">
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Cornet</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <article-title>AMAW-Supplementary_Figure3.png. figshare.</article-title>
                    <source>

                        <italic toggle="yes">Figure.</italic>
</source>
                    <year>2022d</year>.
                    <pub-id pub-id-type="doi">10.6084/m9.figshare.21603999.v2</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref8">
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Cornet</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <data-title>AMAW-Supplementary_Table1.csv.</data-title>[Data].
                    <source>

                        <italic toggle="yes">figshare.</italic>
</source>
                    <year>2022e</year>.
                    <pub-id pub-id-type="doi">10.6084/m9.figshare.21604011.v2</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref9">
                <mixed-citation publication-type="data">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Cornet</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <data-title>AMAW-Supplementary_Table2.csv.</data-title>[Data].
                    <source>

                        <italic toggle="yes">figshare.</italic>
</source>
                    <year>2022f</year>.
                    <pub-id pub-id-type="doi">10.6084/m9.figshare.21604002.v2</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref10">
                <mixed-citation publication-type="data">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Cornet</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <data-title>AMAW-Supplementary_Table3.csv.</data-title>[Data].
                    <source>

                        <italic toggle="yes">figshare.</italic>
</source>
                    <year>2022g</year>.
                    <pub-id pub-id-type="doi">10.6084/m9.figshare.21750965.v1</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref11">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Di Franco</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Poujol</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Baurain</surname>
                            <given-names>D</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Evaluating the usefulness of alignment filtering methods to reduce the impact of errors on evolutionary inferences.</article-title>
                    <source>

                        <italic toggle="yes">BMC Evol. Biol.</italic>
</source>
                    <year>2019</year>;<volume>19</volume>:<fpage>21</fpage>.
                    <pub-id pub-id-type="pmid">30634908</pub-id>
                    <pub-id pub-id-type="doi">10.1186/s12862-019-1350-2</pub-id>
                    <pub-id pub-id-type="pmcid">PMC6330419</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref12">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Grabherr</surname>
                            <given-names>MG</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Haas</surname>
                            <given-names>BJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Joshua</surname>
                            <given-names>MY</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data.</article-title>
                    <source>

                        <italic toggle="yes">Nat. Biotechnol.</italic>
</source>
                    <year>2013</year>;<volume>29</volume>(<issue>7</issue>):<fpage>644</fpage>&#x2013;<lpage>652</lpage>.
                    <pub-id pub-id-type="doi">10.1038/nbt.1883</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref13">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Holt</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Yandell</surname>
                            <given-names>M</given-names>
                        </name>
</person-group>:
                    <article-title>MAKER2: An annotation pipeline and genome-database management tool for second-generation genome projects.</article-title>
                    <source>

                        <italic toggle="yes">BMC Bioinformatics.</italic>
</source>
                    <year>2011</year>;<volume>12</volume>(<issue>1</issue>):<fpage>491</fpage>.
                    <pub-id pub-id-type="pmid">22192575</pub-id>
                    <pub-id pub-id-type="doi">10.1186/1471-2105-12-491</pub-id>
                    <pub-id pub-id-type="pmcid">PMC3280279</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref14">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Keeling</surname>
                            <given-names>PJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Burki</surname>
                            <given-names>F</given-names>
                        </name>
</person-group>:
                    <article-title>Progress towards the Tree of Eukaryotes.</article-title>
                    <source>

                        <italic toggle="yes">Curr. Biol.</italic>
</source>
                    <year>2019</year>;<volume>29</volume>(<issue>16</issue>):<fpage>R808</fpage>&#x2013;<lpage>R817</lpage>.
                    <pub-id pub-id-type="doi">10.1016/j.cub.2019.07.031</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref15">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Kersey</surname>
                            <given-names>PJ</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Allen</surname>
                            <given-names>JE</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Allot</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Ensembl Genomes 2018: An integrated omics infrastructure for non-vertebrate species.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>2018</year>;<volume>46</volume>(<issue>D1</issue>):<fpage>D802</fpage>&#x2013;<lpage>D808</lpage>.
                    <pub-id pub-id-type="pmid">29092050</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gkx1011</pub-id>
                    <pub-id pub-id-type="pmcid">PMC5753204</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref16">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Korf</surname>
                            <given-names>I</given-names>
                        </name>
</person-group>:
                    <article-title>Gene finding in novel genomes.</article-title>
                    <source>

                        <italic toggle="yes">BMC Bioinformatics.</italic>
</source>
                    <year>2004</year>;<volume>5</volume>:<fpage>59</fpage>.
                    <pub-id pub-id-type="pmid">15144565</pub-id>
                    <pub-id pub-id-type="doi">10.1186/1471-2105-5-59</pub-id>
                    <pub-id pub-id-type="pmcid">PMC421630</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref17">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Kriventseva</surname>
                            <given-names>EV</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Kuznetsov</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Tegenfeldt</surname>
                            <given-names>F</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>OrthoDB v10: Sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs.</article-title>
                    <source>

                        <italic toggle="yes">Nucleic Acids Res.</italic>
</source>
                    <year>2019</year>;<volume>47</volume>(<issue>D1</issue>):<fpage>D807</fpage>&#x2013;<lpage>D811</lpage>.
                    <pub-id pub-id-type="pmid">30395283</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gky1053</pub-id>
                    <pub-id pub-id-type="pmcid">PMC6323947</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref18">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dewey</surname>
                            <given-names>CN</given-names>
                        </name>
</person-group>:
                    <article-title>RSEM: Accurate transcript quantification from RNA-Seq data with or without a reference genome.</article-title>
                    <source>

                        <italic toggle="yes">BMC Bioinformatics.</italic>
</source>
                    <year>2011</year>;<volume>12</volume>.
                    <pub-id pub-id-type="pmid">21816040</pub-id>
                    <pub-id pub-id-type="doi">10.1186/1471-2105-12-323</pub-id>
                    <pub-id pub-id-type="pmcid">PMC3163565</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref19">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>W</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Godzik</surname>
                            <given-names>A</given-names>
                        </name>
</person-group>:
                    <article-title>Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2006</year>;<volume>22</volume>(<issue>13</issue>):<fpage>1658</fpage>&#x2013;<lpage>1659</lpage>.
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btl158</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref20">
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Meunier</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Baurain</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Cornet</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <article-title>AMAW - Automated MAKER2 Annotation Wrapper (0.223430). Zenodo. [Code].</article-title>
                    <year>2022</year>.
                    <pub-id pub-id-type="doi">10.5281/zenodo.7490001</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref21">
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Seppey</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Manni</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Zdobnov</surname>
                            <given-names>EM</given-names>
                        </name>
</person-group>:
                    <chapter-title>BUSCO: Assessing Genome Assembly and Annotation Completeness.</chapter-title>
                    <person-group person-group-type="editor">

                        <name name-style="western">
                            <surname>Kollmar</surname>
                            <given-names>M</given-names>
                        </name>
</person-group>, editor.
                    <source>

                        <italic toggle="yes">Gene Prediction. Methods in Molecular Biology.</italic>
</source>
                    <publisher-loc>New York, NY.</publisher-loc>:
                    <publisher-name>Humana</publisher-name>;<year>2019</year>; vol<volume>1962</volume>.</mixed-citation>
            </ref>
            <ref id="ref22">
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Stanke</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Diekhans</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Baertsch</surname>
                            <given-names>R</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Using native and syntenically mapped cDNA alignments to improve de novo gene finding.</article-title>
                    <source>

                        <italic toggle="yes">Bioinformatics.</italic>
</source>
                    <year>2008</year>;<volume>24</volume>(<issue>5</issue>):<fpage>637</fpage>&#x2013;<lpage>644</lpage>.
                    <pub-id pub-id-type="pmid">18218656</pub-id>
                    <pub-id pub-id-type="doi">10.1093/bioinformatics/btn013</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref23">
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Yandell</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ence</surname>
                            <given-names>D</given-names>
                        </name>
</person-group>:
                    <article-title>A beginner&#x2019;s guide to eukaryotic genome annotation.</article-title>
                    <year>2012</year>;<volume>13</volume>(<issue>May</issue>):<fpage>329</fpage>&#x2013;<lpage>342</lpage>.</mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report267741">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.141827.r267741</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Ruiz Torres</surname>
                        <given-names>Laura</given-names>
                    </name>
                    <xref ref-type="aff" rid="r267741a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-7631-9139</uri>
                </contrib>
                <contrib contrib-type="author">
                    <name>
                        <surname>Contreras-Moreira</surname>
                        <given-names>Bruno</given-names>
                    </name>
                    <xref ref-type="aff" rid="r267741a2">2</xref>
                    <role>Co-referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-5462-907X</uri>
                </contrib>
                <aff id="r267741a1">
                    <label>1</label>University of Ja&#x00e9;n, Ja&#x00e9;n, Spain</aff>
                <aff id="r267741a2">
                    <label>2</label>Estaci&#x00f3;n Experimental de Aula Dei-CSIC, Zaragoza, Spain</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>3</day>
                <month>6</month>
                <year>2024</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2024 Ruiz Torres L and Contreras-Moreira B</copyright-statement>
                <copyright-year>2024</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport267741" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.129161.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>
                <bold>Introduction:</bold>
            </p>
            <p> page 3 parag2a: consider changing "following the decrease in sequencing costs due to the advent of Next Generation Sequencing and the concomitant explosion of sequenced organisms" to "The advent of Next Generation Sequencing has decreased sequencing costs, leading to an explosion in the number of sequenced organisms"</p>
            <p> page 3 parag2b: Please support "MAKER2 has been, for more than a decade, one of the most popular annotation pipelines for eukaryotes" with the number of citations in that period, for instance. This would also be a good place to mention other software choices beyond MAKER2.</p>
            <p> page 3: Not sure the data presented supports "We also demonstrate that the use of AMAW yields genome annotation significantly improved in comparison to the use of MAKER2 with the AUGUSTUS (Stanke et al., 2008) [Ref 1] gene models that are available by default."</p>
            <p> </p>
            <p> 
                <bold>Methods:</bold>
            </p>
            <p> Implementation: which version of singularity container was used?</p>
            <p> page 3 paragraph-1: I am not sure the documentation at 
                <ext-link ext-link-type="uri" xlink:href="https://bitbucket.org/phylogeno/amaw">https://bitbucket.org/phylogeno/amaw</ext-link> supports the sentence "simplify its installation and usage for users without a strong bioinformatics background". As described there the installation of dependencies does not seem particularly easy, nor could I find instructions to install singularity on my system. Can't see either in the bitbucket landing page any instructions on grid instructions.</p>
            <p> page 4 Please fix "AMAW automates the installation and the and orchestrates the successive runs in a grid-computing environment"</p>
            <p> page5 please consider changing "Moreover, RNA-Seq experiments are sorted by ascending data volume before being selected in an attempt to maximize the diversity of RNA-Seq libraries." to "To enhance the diversity of RNA-seq libraries, experiments are strategically selected based on ascending data volume"</p>
            <p> Use cases:</p>
            <p> Please consider changing "For this, a dataset of 32 genomes of protist organisms was designed and the quality of the different structural annotations was assessed using the completeness metrics provided by BUSCO v4" to "To assess the impact of gene model choice, we designed a 32 protist genomes dataset and evaluated the quality of different structural annotations using BUSCO v4"</p>
            <p> Please start this section explaining exactly what is being compared. Readers will know that AMAW automatizes several steps already, but we don't know exactly what non-automatic choices were made for running MAKER2, although it says "basic" later on. We need to&#x00a0;know whether this is a fair comparison, this will also help understand figure 2.</p>
            <p> Figure 2: Please change figure colors and rewrite legend, as it is is really hard to understand which bars correspond to 32 protists, why is Arabipdsis (green) in both panels? Are human and Arabidopsis non-model species?</p>
            <p> &#x00a0;Figure 3: The legend is identical to that on figure 2 but with 17 genomes instead of 32. From the text I can't really understand it.</p>
            <p> </p>
            <p> Please rewrite the "Use cases" section, as it is we don't understand it at all.</p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>No</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>No</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>No</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>No</p>
            <p>Reviewer Expertise:</p>
            <p>Population Genetics, Ecology, Bioinformatics</p>
            <p>We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-267741-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Using native and syntenically mapped cDNA alignments to improve de novo gene finding.</article-title>
                        <source>
                            <italic>Bioinformatics</italic>
                        </source>.<year>2008</year>;<volume>24</volume>(<issue>5</issue>) :
                        <elocation-id>10.1093/bioinformatics/btn013</elocation-id>
                        <fpage>637</fpage>-<lpage>44</lpage>
                        <pub-id pub-id-type="pmid">18218656</pub-id>
                        <pub-id pub-id-type="doi">10.1093/bioinformatics/btn013</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report267739">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.141827.r267739</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Titus-McQuillan</surname>
                        <given-names>James</given-names>
                    </name>
                    <xref ref-type="aff" rid="r267739a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-5197-4791</uri>
                </contrib>
                <aff id="r267739a1">
                    <label>1</label>University of North Carolina, Charlotte NC, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>10</day>
                <month>5</month>
                <year>2024</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2024 Titus-McQuillan J</copyright-statement>
                <copyright-year>2024</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport267739" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.129161.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>Review: AMAW: automated gene annotation for non-model eukaryotic genomes 
                <list list-type="bullet">
                    <list-item>
                        <p>Software Tool Article</p>
                    </list-item>
                </list> Overview:</p>
            <p> MAKER2 is one of the most widely used annotation software for annotating non-model genomes. However, the learning curve of MAKER2 is steep and not tailor-made for every genome. The authors have constructed a container with a wrapper inside to help flatten the difficulty curve when using the MAKER2 software suite.</p>
            <p> </p>
            <p> Major Revision: 
                <list list-type="bullet">
                    <list-item>
                        <p>I think the conclusion section outpaces what this article aims to accomplish. As I read the article, looked over the code, and viewed the figures, I saw a well-executed wrapper for MAKER2. Similar to this tutorial - 
                            <ext-link ext-link-type="uri" xlink:href="https://darencard.net/blog/2017-05-16-maker-genome-annotation/">https://darencard.net/blog/2017-05-16-MAKER-genome-annotation/</ext-link> from Card 
                            <italic>et al.</italic> 2019 [Ref-1]. I do not see any evidence that, &#x201c;The application example shows how AMAW significantly improves the genome annotation quality in comparison of naive use of MAKER2 with pre-existing gene models, as well as the importance of providing specific evidence data.&#x201d; Does this wrapper allow for much faster and user-friendly implementation of MAKER2 pipelines? I see evidence of that claim. However, the former conclusions are baseless, given that the pipeline could still be used if given the template 
                            <italic>sans</italic> AMAW.</p>
                    </list-item>
                </list> Minor Revision: 
                <list list-type="bullet">
                    <list-item>
                        <p>&#x201c;Moreover, they can short-circuit the pipeline by choosing an existing gene model for AUGUSTUS.&#x201d; - I think the term short-circuit here may cause confusion. As colloquially, short-circuit usually means a device broke, i.e., this will not be interpreted as a shortcut. But instead that doing the above will break the pipeline.</p>
                    </list-item>
                    <list-item>
                        <p>Supplemental materials Fig. 3. &#x201c;It is desirable to a genome with the lowest proportion of missing genes.&#x201d; This needs to be clearer. Maybe &#x2013; &#x201c;Lower scores are desirable.&#x201d;?</p>
                    </list-item>
                </list> Conclusion:</p>
            <p> To strengthen this article, I would ground the conclusions. This is a software release that solves a problem many scientists face &#x2013; the learning curve of running gene annotations with MAKER2. I think it is worthwhile, as well, to mention other softwares which aims to accomplish similar tasks, e.g., BRAKER(1 &amp; 2), Funannotate, etc. Plus, there is nearly an exhaustive list of annotation software out there for specific tasks like long reads (PacBio SMRT Analysis and Nanopore&#x2019;s Software suite), proteomics approaches (AnnotaPipeline), finding isoforms (SQANTI and TranD), and a plethora of others that could be listed; which seem/are relatively as user-friendly as the AMAW (in my opinion). AMAW is just one of those software applications that uses MAKER2 as the engine.</p>
            <p> </p>
            <p> This article could help many run annotations and further the knowledge for those lost in the world of annotation software. Using a tried-and-true method like MAKER2 with an easy-to-follow pipeline is perfect for getting results and understanding the processes of annotation. However, the claims currently need to be tailored to what this paper accomplishes: a user-friendly pipeline and software wrapper for genomic annotations.</p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>No</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Yes</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Yes</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Partly</p>
            <p>Reviewer Expertise:</p>
            <p>Biostatistics, genomics, bioinformatics</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-267739-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Genomic Basis of Convergent Island Phenotypes in Boa Constrictors.</article-title>
                        <source>
                            <italic>Genome Biol Evol</italic>
                        </source>.<year>2019</year>;<volume>11</volume>(<issue>11</issue>) :
                        <elocation-id>10.1093/gbe/evz226</elocation-id>
                        <fpage>3123</fpage>-<lpage>3143</lpage>
                        <pub-id pub-id-type="pmid">31642474</pub-id>
                        <pub-id pub-id-type="doi">10.1093/gbe/evz226</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report267742">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.141827.r267742</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Torruella</surname>
                        <given-names>Guifr&#x00e9;</given-names>
                    </name>
                    <xref ref-type="aff" rid="r267742a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-6534-4758</uri>
                </contrib>
                <aff id="r267742a1">
                    <label>1</label>Institut de Biologia Evolutiva (UPF-CSIC), Barcelona, Spain</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>10</day>
                <month>5</month>
                <year>2024</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2024 Torruella G</copyright-statement>
                <copyright-year>2024</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport267742" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.129161.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>
                <bold>Overview:</bold>
            </p>
            <p> The manuscript by Meunier et al. is about an automated pipeline to annotate eukaryotic genomes using various rounds of MAKER2. I have found it relevant, so I thank the authors for their work. The analyses are sound, but the authors do not present them in enough detail. It reads more like an abstract and not like a software manuscript. Overall, I think the manuscript could be much improved. Please see some comments below that I hope the authors find useful, since I would be really interested in using this software for my own projects.</p>
            <p> </p>
            <p> 
                <bold>General comments and questions:</bold>
            </p>
            <p> </p>
            <p> Regarding the workflow, I find it interesting that the authors try to clean the number of transcripts to train the model, but they do not explain why this is important. Does it improve accuracy or reduce computational time, as they say for the protein set? In this regard, did the authors compare Trinity to Spades? In my personal experience, I find that the former gives more isoforms, probably fake, but I have never done a proper comparison.</p>
            <p> </p>
            <p> I understand the pipeline is designed to study non-model organisms, and for the &#x201c;use cases,&#x201d; they cannot provide accuracy or sensitivity, but I wonder if they could validate the pipeline with the same model organisms for which they use the gene models. For example, how bad is the annotation if the gene set of a closely related species is done in Homo or Arabidopsis?</p>
            <p> </p>
            <p> Then, I don&#x2019;t know if providing only the median values of the % completeness, fragmented, and missing markers is informative enough. It is known that the issues with annotation depend on genome complexity (size, number of introns, genetic code, etc.). I think they should add some extra validations, such as plotting the amount of complete BUSCO markers with various genome features. I understand the software is designed to help non-bioinformatic protistologists to improve their genome annotations, so I think it would benefit everybody if more details were given regarding the diversity of protist genomes. I assume it is easier to annotate condensed genomes of parasites compared to free-living organisms, but how does this change between amoebae or flagellates, or between heterotrophic or photosynthetic organisms? I understand this might be completely out of the scope of the authors, but as a protistologist, these are things that I would like developers to address.</p>
            <p> </p>
            <p> Again, if there is no RNA-seq data from the same species, how AMAW compares to MAKER2 regarding all taxa is not informative. First, which species of the use cases have or have not available RNA-seq? Then, did the authors compare the same genome with and without RNA-seq input?</p>
            <p> </p>
            <p> Another question: why does SNAP need to be run before AUGUSTUS? If the authors of MAKER2 provide evidence for that, it should be cited. Since there have not been many developments in genome annotation, I would suggest the authors explain a bit more about how previous software works, mostly since their pipeline is based on these methods.</p>
            <p> </p>
            <p> How is the latest step of annotation done? How are the gene models refined?</p>
            <p> </p>
            <p> I strongly suggest the authors compare AMAW with the various modes in BRAKER, another widely used tool for structural genome annotation.</p>
            <p> </p>
            <p> On a computational level, the authors mention the suitability for grid-computing environments (I guess it means in clusters and supercomputers?). So, can it take advantage of MPI? The authors should provide some details on how the pipeline works in this regard; e.g., what are the computational requirements and times. Related to the previous comment, I wonder how it compares with other structural annotation tools.</p>
            <p> </p>
            <p> 
                <bold>Detailed comments:</bold>
            </p>
            <p> </p>
            <p> In the first sentence of the conclusions, it&#x2019;s written &#x201c;automazing,&#x201d; which I believe is incorrect.</p>
            <p> </p>
            <p> Then, the first part of the following sentence is cumbersome. I&#x2019;d simplify it: &#x201c;We aim with AMAW&#x2019;s functionalities to automate the acquisition and deployment of evidence data to contribute to the effort for achieving continually more complete and accurate annotations, especially for poorly represented eukaryotic lineages.&#x201d;</p>
            <p> </p>
            <p> Figure S3 has some issues with the upper and lower borders.</p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>Partly</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Partly</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>No</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>No</p>
            <p>Reviewer Expertise:</p>
            <p>Evolutionary protistology. I am a user of genome assembly and annotation tools. I have experience in various genome projects of non-model eukaryotic genomes.</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report255599">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.141827.r255599</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Lang</surname>
                        <given-names>B. Franz</given-names>
                    </name>
                    <xref ref-type="aff" rid="r255599a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r255599a1">
                    <label>1</label>University of Montreal, Montreal, Canada</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>2</day>
                <month>4</month>
                <year>2024</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2024 Lang BF</copyright-statement>
                <copyright-year>2024</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport255599" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.129161.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>reject</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The manuscript by Meunier et al. aims to simplify the use of MAKER2, a genome annotator developed over a dozen years ago by Holt and Yandell, by automating its installation and usage processes. While the initiative to aid non-specialist users through this automation is commendable, the manuscript falls short in critically evaluating MAKER2's limitations compared to more recent annotators like Braker, Gemoma, and Funannotate. These newer tools, which address some of MAKER2's conceptual and practical shortcomings, are neither mentioned nor evaluated for annotation quality. The literature review in the manuscript is notably brief and lacks depth, undermining the manuscript's contribution to the field, even if considered a brief technical note.</p>
            <p> </p>
            <p> To strengthen the manuscript, it is essential to include a comprehensive literature review on the current state of genome annotation, highlighting the advancements and shortcomings of existing tools, including MAKER2. A detailed comparison of annotation quality among these tools should also be provided, moving beyond mere gene counts to assess the quality of predicted gene models. Such analyses are crucial for justifying the automation of the MAKER2 pipeline and ensuring that potential users are not choosing this approach just because it is so easy. Extensive revisions and additional analyses are necessary to address these significant gaps and to honour the contributions of previous work in the field.</p>
            <p>Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?</p>
            <p>No</p>
            <p>Is the rationale for developing the new software tool clearly explained?</p>
            <p>Partly</p>
            <p>Is the description of the software tool technically sound?</p>
            <p>Partly</p>
            <p>Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?</p>
            <p>Partly</p>
            <p>Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?</p>
            <p>Partly</p>
            <p>Reviewer Expertise:</p>
            <p>genomics, bioinformatics, gene finding, RNA structure and prediction</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.</p>
        </body>
    </sub-article>
</article>
