Gene Annotation Easy Viewer (GAEV): Integrating KEGG’s Gene Function Annotations and Associated Molecular Pathways

We developed a Gene Annotation Easy Viewer (GAEV) that integrates the gene annotation data from the KEGG (Kyoto Encyclopedia of Genes and Genomes) Automatic Annotation Server. GAEV generates an easy-to-read table that summarizes the query gene name, the KO (KEGG Orthology) number, name of gene orthologs, functional definition of the ortholog, and the functional pathways that query gene has been mapped to. Via links to KEGG pathway maps, users can directly examine the interaction between gene products involved in the same molecular pathway. We provide a usage example by annotating the newly published freshwater microcrustacean Daphnia pulex genome. This gene-centered view of gene function and pathways will greatly facilitate the genome annotation of non-model species and metagenomics data. GAEV runs on a Windows or Linux system equipped with Python 3 and provides easy accessibility to users with no prior Unix command line experience.


Introduction
Describing the biological function of computationally annotated genes in non-model assemblies and the molecular pathways formed by these genes' products is critical for identifying the genetic basis of the various unique biological attributes (e.g., physiology, life history, behavior) of the species in question. Computational search against DNA/protein databases, e.g., NCBI Blast (Boratyn et al., 2013), UniProt (Bateman et al., 2017), Inter-Pro (Finn et al., 2017), based on homology and protein domain information using computational tools, such as Blast (Camacho et al., 2009), InterProScan (Jones et al., 2014, and Hmmer (Mistry et al., 2013), can make predictions for individual gene functions. In contrast, delineating the molecular pathways encoded by the entire suite of genes of a single species is a much more challenging task, especially for non-model species. To this extent, mapping genes to the molecular pathways derived from intensively studied model organisms provides an entry point for addressing this need.
For mapping genes into known molecular pathways, the Kyoto Encyclopedia of Genes and Genomes (KEGG) provide comprehensive web services (Kanehisa et al., 2017;Kanehisa & Goto, 2000;Kanehisa et al., 2016a). KEGG is an integrated database for biological interpretation of genome sequences. The molecular function of genes is classified using ortholog groups, i.e., KEGG Orthology (KO). KEGG also contains KEGG pathways, BRITE hierarchies, and KEGG modules, all of which are networks of KO nodes. It is possible to annotate the molecular functions of a set of genes from complete/partial genome assembly or metagenomics dataset and their encoded molecular pathways using KEGG automatic annotation services that are provided through webservers BlastKOALA and GhostKOALA (Kanehisa et al., 2016b). For a non-model species, we can use KAAS (KEGG Automatic Annotation Server) web services to annotate the complete or random set of genes to describe their molecular function and map them into identified molecular pathways. The annotation results consist of KO numbers for each gene, genes mapped to KEGG pathway database, and genes mapped to BRITE. Nonetheless, the resulting complete set of pathways and BRITE hierarchy can only be viewed through the temporary URL provided by KEGG, which are only available for several days after the analyses are completed. Although these results are organized through either curated KEGG pathways or BRITE hierarchy, KAAS does not provide an integrative gene-centered view of gene function and pathways, i.e., the complete summary of gene function and all associated molecular pathways for each gene.
As can be envisioned, integrating the gene function annotation based on KEGG orthology and KEGG pathways can provide an efficient way to characterize both the predicted genes and associated pathways for a newly assembled genome or metagenomics dataset. Despite numerous computational packages for retrieving KEGG pathways using the API interface provided by KEGG database (e.g., Moutselos et al., 2009;Wrzodek et al., 2011), none of these packages to our best knowledge allows us to reconstruct the complete set of molecular pathways contained in a newly assembled genome. To provide a means to utilizing the highly informative resources at KEGG for annotating genomic sequences and molecular pathways for non-model species, we have developed a Gene Annotation Easy Viewer (GAEV) for integrating results of KEGG orthology annotation and KEGG pathways mapping using KEGG API tools in both Windows and Linux environment. GAEV is aimed to provide a gene-centered view of gene function and pathways, i.e., the complete summary of gene function and all possibly associated molecular pathways for each gene. This is distinct from other KEGG-related software such as MEGAN (Huson et al., 2016) andMinPath (Ye &Doak, 2009). MEGAN can achieve overall functional analysis of microbiome data with KEGG data (Huson et al., 2016), whereas Minpath aims to provide a conservative and faithful estimation of the biological pathways for a query dataset (Ye & Doak, 2009). GAEV is implemented in Python 3 and can be used as an independent package.

Methods
Assuming that the KEGG ortholog number is known for a single gene, the KO information can be retrieved from KEGG database by utilizing KEGG REST-style API. GAEV uses the 'get' operation of the KEGG API to extract data on the gene and linked pathways of every K number provided in the input file. The data extracted from KEGG database are stored in data files that can be loaded into GAEV to skip the data extraction step (Figure 1). Some genes will not have a KO number associated with them. Before data extraction, GAEV will trim the input file to remove all genes that do not have a KO number or have a KO number that cannot be searched in the KEGG database. Once data extraction from KEGG's database is complete and the data file is generated, GAEV helps the user handle and visualize the data by exporting the data as a table in an HTML file. GAEV populates the table with the user defined gene ID provided in the input file and the associated K number provided in the input file, as well as the gene name, definition, and linked pathways that have been retrieved from the KEGG database. The linked pathway map URLs that highlight identified genes in the genome assembly are created using the following formula: http://www. In the above URL, [mapno] represents the pathway accession number. [k-num{1,2,3…}] represents the K number for each gene in the pathway that is present in the provided genome assembly, and [k-num_interest] represents the K number of the focal

Amendments from Version 2
A section was added under use cases to further explain how to use the batch jobs feature, and various typos raised by reviewers were addressed.

Figure 1. Workflow of Gene Annotation Easy Viewer (GAEV).
gene that will be highlighted with a unique color.
[node_color] and [font_color] represent the desired color of the focal gene's node and font on the pathway map, respectively. By default, the node color of the focal gene is dark red, whereas the node color of other genes in the same pathway that are present in the genome assembly is light green.

Use cases Installation
The most up-to-date version of this software can be downloaded at https://github.com/UtaDaphniaLab/kegg_path_generator. This software requires Python 3 or newer to run. It is recommended that this software be used as a standalone program simply by double clicking on GAEV.py or by using the 'python 3 GAEV.py' command.

Annotation
We analyze the newly published Daphnia pulex genome (Ye et al., 2017) to demonstrate the usage of our package. The required input file for our package contains two columns. The first column contains the gene names, whereas the second column represents the KO (KEGG orthology) numbers (Figure 2, Supplementary File 1). The KO numbers for the entire set of genes can be obtained through KEGG Automatic Annotation Server. Briefly, users can provide the query protein sequences in a fasta file and use one of the provided search algorithms (e.g, Blast, GhostX, GhostZ) to assign KO numbers to each of the queried genes. The Daphnia protein fasta file can be found at https://figshare.com/articles/PA42_3_0_protein_new_txt/6653297. With a gff/gtf genome annotation file, users can also use tools such as gff2sequence (Camiolo & Porceddu, 2013) to extract DNA/protein sequences from genomic assemblies, which can be used as query sequences. Furthermore, researchers working with non-model organisms could use protein sequences extracted from an assembled transcriptome as input data. At the end of this analysis, the user will receive via email a link to the result page, where the query result can be downloaded. The downloaded query result can be directly used as input file for our package even when some genes are not provided a KO number (which will be automatically excluded from further analysis).
With the obtained input file, the annotation analysis can be started by simply running GAEV.py and following the instructions of the menus. The first menu provides the option of using the obtained input file to extract data from KEGG or skipping the data extraction step by loading a pre-generated data file. Next, GAEV will prompt the user for the location of the input or data file. Both absolute and relative paths are accepted, but it is recommended that the GAEV.py file be placed in the same folder as the input or data file, so that the relative path can be easily used. After the data extraction from KEGG's servers is completed, a data file will be created, which can be repeatedly used for making different pathway tables. The next several menus guide the user through the process of customizing the output table. The user has the options to apply filters so that GAEV only outputs a table using genes with a specific keyword in its definition or linked pathways.

Output file
The output file is an html file that can be opened in any internet browser (for example see Supplementary File 2). The results are organized in three different sections. The first section is the Genes and Linked Pathways, where for each query gene the molecular function based on KO and relevant pathways are listed. For each gene, its associated pathway(s) contains a link to the corresponding pathway page on KEGG website, where this specific gene is colored in red and all the identified genes from the genome assembly are colored in green. The other two sections contain a list of the pathways sorted by the number of identified genes and by alphabetic order, respectively. These two sections provide a pathway-centered view of the functions of the annotated genome.

Batch jobs
The batch functions located in the first menu can be used when there are several sets of genes in different input files that the user wants to annotate. The batch functions require a file with the relative/absolute paths of each input file on separate lines. Alternatively, entering 'all' will direct GAEV to run using every file with a txt extension as an input file if new data files need to be generated or every file with a dat extension if new tables need to be created from existing data files.
Filters can be applied to batch jobs and will apply to all sets of genes. An html output file like the one described above will be created for each set of genes in the same folder as its respective input and data file.

Conclusions
The integrative annotation approach implemented in our package GAEV draws upon resources available at KEGG and provides an efficient way to explore the molecular pathways embodied in a draft genome. The integration of the generated html file with KEGG web services provides an intuitive interface to explore specific molecular pathways, with all the identified KEGG homologs highlighted in the pathway map. This type of information is essential to initial exploration of non-model organisms' genomes to understand the conservation of specific pathways compared to established model systems. For example, if we examine the circadian rhythm pathway in the Daphnia genome (by clicking on the link to the circadian rhythm pathway in the generated html file), we see strong conservation between Daphnia and Drosophila, with only 1 gene (i.e., Vri) in this pathway missing an identified homolog in the Daphnia assembly ( Figure 3). Further efforts can be dedicated to verifying the absence of Vri gene in Daphnia genome. The strong conservation of the circadian pathway can greatly aid future efforts in using the freshwater microcrustacean Daphnia to understand the internal clock of aquatic organisms in response to aquatic environments.
In principle, GAEV can be used for visualizing functions and pathways for gene sets of any scale, ranging from genome-wide data to subsets of genes in a genome. For example, we can use GAEV to visualize the pathways that differentially expressed genes are involved in. Often the large number of differentially expressed genes from RNA-seq experiments prevents clear cataloguing of these genes and molecular pathways. Analyzing the genes of interest using our package can provide a quick, integrative view of the genes and affected pathways.
In summary, with a user-friendly design (e.g., no requirement of UNIX command line experience) in mind, we have developed GAEV to provide a fast, easily accessible summary for KEGG gene annotation results. We expect that GAEV will find its use in many bioinformatic analyses, especially those involving non-model species.

Grant information
This work is supported by start-up funds from University of Texas at Arlington to SX.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

2.
3. Huynh and Xu present a new approach, GAEV, to annotate gene models of a given non-model genome assembly. It maps genes to their associated pathway counterpart of model species. GAEV specializes on improving the conventional KEGG analyses and focuses on providing a gene centric view of gene function and pathways. Overall, I think GAEV can be a good addition to the current bioinformatics research community. I recommend indexing of this manuscript once the authors could address my following questions/suggestions.

Version
The first sentence of the manuscript confuses me about what problem the GAEV tries to solve. It provides a new and better way to annotate a given genome with pathways, but I don't think it tries to help assembly per se. I can see a possible role that GAEV plays in evaluating a de novo non-model assembly by looking at how many conservative pathways and genes are completed (for example, using Drosophila's circadian rhythm pathway to evaluate the newly assembled Daphnia genome). So it would be better if the authors could be clearer what problems GAEV tries to solve. I have tested GAEV using the example data provided by the authors. It worked without any errors. However, I found two typos in the first two options of the menu popping up from typing "python GAEV.py". Is it supposed to be "Create and generate a new data file and table from a new dataset of KO numbers"? Please fix it if it is the case.
GAEV provides an interactive interface to walk the users through the entire procedure. Such a design gives the users certain degree of flexibility. However, if some users need to try GAEV using different parameters or filters (which is usually true when one tries to annotate their genome/metagenome), such an interactive design can be overwhelming. I am wondering if the authors have plans to have an end-to-end automatic version of GAEV? By providing enough options from command line, I think GAEV can be both flexible and efficient. The likely problem with relative/absolute path to input files can also be naturally solved, and the requirement of putting the input files together with GAEV.py is not necessary as well. I am not asking the authors to have an implementation for this publication, but it would be nice to know the authors' thoughts on this.
GAEV implements a Trimming method. However, I cannot find some descriptions of this step. Such information is very important to users to understand what has been done internally and how it has been done.

7.
has been done. GAEV also supports "BATCH" mode. Could the authors provide some use cases and let the users know what they should expect in terms of the final HTML report?
One typo: "MEGAN can achieve overall functional analysisof microbiome …" should be "MEGAN can achieve overall functional analysis of microbiome …".
In the URL formula, it seems that there are extra spaces after "www". Please fix it if it is the case.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes No competing interests were disclosed.

Competing Interests:
Reviewer Expertise: Bioinformatics, genomics, genetics I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Efthymios Ladoukakis
National Technical University of Athens, Athens, Greece I have no further comments to make.

Is the rationale for developing the new software tool clearly explained? Yes
Is the description of the software tool technically sound? Yes Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes No competing interests were disclosed.

Competing Interests:
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
which they are linked. The html view provides also an overview of the total existing pathways per associated genes, a task which can be useful for whole genome and metagenome annotation queries.
The authors provide a thorough explanation about how the tool works and communicates with KEGG API by generating the appropriate links and exporting the generated data. The authors also provide some test datasets which can be used to generate the results mentioned in the manuscript.
Nevertheless this reviewer considers this endeavour to have already been covered by other bioinformatic tools with which a comparison would be necessary to underline the importance of the new tool. For example tools like MEGAN can provide a thorough investigation of the existing KEGG pathways in a genome/metagenome (although by using an older and not commercial version of the KEGG database). Furthermore MinPath can be also used in combination with KEGG generated datasets in order to provide a similar pathway reconstruction analysis. Maybe the authors could elaborate a little bit more about what makes their tool more suitable than these already published tools.
Moreover during the Conclusions section the authors do not seem to explain the methodology in order to examine the differences between Daphnia and Drosophila and how that (and similar analyses) can be achieved solely (or more intuitively) by exploiting this particular tool.
In general this tool seems like a good addition to a bioinformatic pipeline for genomic or metagenomic analysis but this reviewer thinks that the author must emphasize more on the differences and/or improvements regarding similar tools.

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly No competing interests were disclosed.

Competing Interests:
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above. Thanks very much for your comments on our manuscript. Please see below for how we revised our manuscript to address your concerns.
1. Nevertheless this reviewer considers this endeavour to have already been covered by other bioinformatic tools with which a comparison would be necessary to underline the importance of the new tool. For example tools like MEGAN can provide a thorough investigation of the existing KEGG pathways in a genome/metagenome (although by using an older and not commercial version of the KEGG database). Furthermore MinPath can be also used in combination with KEGG generated datasets in order to provide a similar pathway reconstruction analysis.
: Our goal with our new tool is to provide a gene-centric view of molecular pathways, Response where each gene is accompanied by all the pathways where this gene is predicted to play a role. This is different from the purposes of MEGAN and Minpath. We emphasized this idea in the last paragraph of Introduction and drew comparison with MEGAN and Minpath.

Moreover during the Conclusions section the authors do not seem to explain the methodology in order to examine the differences between Daphnia and Drosophila and how that (and similar analyses) can be achieved solely (or more intuitively) by exploiting this particular tool.
: This example of circadian pathway is to demonstrate using this tool for initial Response Daphnia exploration of non-model organisms' genomes to understand the conservation of specific pathways compared to established model systems (i.e., Drosophila). The Drosophila circadian pathway is provided through KEGG database. Users can directly examine their interested pathways from the results of GAEV (click on the link in the generated html file) and view the pathways and mapped genes on KEGG website. We provide a brief explanation of how to technically view the pathway on KEGG server in Discussion.
No competing interests were disclosed.

Competing Interests:
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage