A TALE of shrimps: Genome-wide survey of homeobox genes in 120 species from diverse crustacean taxa

The homeodomain-containing proteins are an important group of transcription factors found in most eukaryotes including animals, plants and fungi. Homeobox genes are responsible for a wide range of critical developmental and physiological processes, ranging from embryonic development, innate immune homeostasis to whole-body regeneration. With continued fascination on this key class of proteins by developmental and evolutionary biologists, multiple efforts have thus far focused on the identification and characterization of homeobox orthologs from key model organisms in attempts to infer their evolutionary origin and how this underpins the evolution of complex body plans. Despite their importance, the genetic complement of homeobox genes has yet been described in one of the most valuable groups of animals representing economically important food crops. With crustacean aquaculture being a growing industry worldwide, it is clear that systematic and cross-species identification of crustacean homeobox orthologs is necessary in order to harness this genetic circuitry for the improvement of aquaculture sustainability. Using publicly available transcriptome data sets, we identified a total of 4183 putative homeobox genes from 120 crustacean species that include food crop species, such as lobsters, shrimps, crayfish and crabs. Additionally, we identified 717 homeobox orthologs from 6 other non-crustacean arthropods, which include the scorpion, deer tick, mosquitoes and centipede. This high confidence set of homeobox genes will now serve as a key resource to the broader community for future functional and comparative genomics studies.


Introduction
As one of the fastest growing industries, the seafood trade is dominated by fishing and farming of crustaceans, with annual sales exceeding $40 billion (Stentiford et al., 2012). Crustacean aquaculture is multi-faceted, not only contributing to the everincreasing demands by international markets, but is also directly linked to the socio-economic aspects of many developing nations through the creation of jobs and infrastructure. Aquaculture practices have intensified in recent years to cope with the demand. Yet, many are not sustainable since the increased densities of farmed shrimps often serve as hotbeds for pathogens if left unabated, causing infectious diseases and the devastation of cultures resulting in massive financial losses. As a result, regulations associated with aquaculture diseases are being enforced with emphasis placed on preventative measures, e.g. enhancement of broodstock and research aiming to further our understanding on crustacean development and ways to utilize the innate ability of crustaceans to combat pathogens (Lai & Aboobaker, 2017;Stentiford et al., 2012).
Several conserved molecular genetic circuitries are well-known for regulating many aspects of development and innate immune homeostasis. One prominent example would be homeobox genes, a family of transcription factors defined by the presence of a homeodomain (Holland, 2013). As one of the most important master controls in development, some headway has already been made in understanding the involvement of homeobox genes in innate immunity; Caudal in Drosophila melanogaster is implicated in commensal-gut mutualism (Ryu et al., 2004;Ryu et al., 2008). Given their importance, major efforts have thus far focused on characterization of homeobox genes in well-known model organisms such as humans (Garcia-Fernàndez, 2005;Holland et al., 2007), Caenorhabditis elegans (Bürglin, 1997), D. melanogaster (Mukherjee & Bürglin, 2007), planarians (Currie et al., 2016;Felix & Aboobaker, 2010;Garcia-Fernandez et al., 1991), amphioxus (Luke et al., 2003, teleost fish (Mulley et al., 2006) and many more. Although homeobox orthologs have been previously studied in the crustacean Parhyale hawaiensis (Kao et al., 2016), systematic and cross-species characterization of this gene family across the broader Crustacea with focus on food crop species is currently lacking. A better understanding of homeobox genes in crustaceans is therefore required to address this major shortfall, leading us to our present work.

Transcriptome data sets and query sets
We retrieved complete transcriptome data sets for 120 crustacean species available at the time of manuscript preparation from the European Nucleotide Archive. Six non-crustacean arthropod proteomes were retrieved from Uniprot. A complete list of accessions used in this study is provided in Supplementary  Table 1. We retrieved a list of query sequences used in subsequent homology searches from Uniprot and GenBank.

Identification of homeobox orthologs
Based on a previously published workflow (Lai & Aboobaker, 2017), we used multiple Basic Local Alignment Search Tool (BLAST)-based approaches, such as BLASTp and tBLASTn to identify genes with homeodomain sequences. The BLAST results were filtered by e-value of < 10 -6 , best reciprocal BLAST hits against the GenBank non-redundant (nr) database and redundant contigs having at least 95% identity were collapsed using CD-HIT. We then utilized HMMER (version 3.1) employing hidden Markov models (HMM) profiles (Finn et al., 2011) to scan for the presence of Pfam homeodomains (Bateman et al., 2004) on the best reciprocal nr BLAST hits, to compile a final non-redundant set of crustacean and arthropod homeobox gene orthologs (Dataset 1).

Multiple sequence alignment and phylogenetic tree construction
Multiple sequence alignment of homeodomain sequences was performed using MAFFT (version 7) (Katoh et al., 2009). Phylogenetic tree was built from the MAFFT alignment using RAxML WAG + G model to generate a best-scoring maximum likelihood tree (Stamatakis, 2014). Geneious (version 7) was used to generate a graphical representation of Newick tree (Kearse et al., 2012).

Results and discussion
Identification of putative homeobox genes in crustaceans With the recent availability of a large number of transcriptome data sets, we perform an extensive search for homeobox genes from 120 crustacean species. We focus on species represented across the broader Crustacea sampling from three main crustacean classes, Malacostraca, Branchiopoda and Copepoda, with focus on key food crop species from the order Decapoda (Supplementary  Table 1). Using BLAST-based approaches and profile HMM (Bateman et al., 2004;Finn et al., 2011;Finn et al., 2015) for homology searches, we conservatively identified 4183 transcripts with homeodomain sequences from crustaceans ( Figure 1; Dataset 1). Additionally, we included six non-crustacean arthropod species in our search and from these species, we identified 717 homeobox orthologs ( Figure 1; Dataset 1). Classification and phylogenetic analysis of TALE class genes Concerted efforts to establish evolutionary classification of homeobox genes have resulted in 11 recognised classes (Edvardsen et al., 2005;Holland et al., 2007;Ryan et al., 2006;Zhong et al., 2008;Zhong & Holland, 2011). The Three-Amino acid-Loop Extension (TALE) superclass within the group of homeobox genes is characterized by three additional residues between alpha helices 1 and 2 of the homeodomain (Bertolino et al., 1995). TALE class homeodomain proteins are further divided into 6 subclasses, Meis, Pknox, Pbc, Irx, Mkx and Tgif characterized by distinct motifs beyond the homeodomain (Bürglin, 1997;Bürglin, 2005;Holland et al., 2007;Mukherjee & Bürglin, 2007). We have classified a total of 165 TALE class orthologs from 15 decapod crustacean species (Figure 2). These genes form distinct phylogenetic grouping, which allows confident assignment of decapod TALE class orthologs into 6 sub-families ( Figure 2). Importantly, the tree topology of crustacean TALE class orthologs recapitulated observations from a previous study (Holland et al., 2007). The tree was constructed using the maximum-likelihood method from an amino acid multiple sequence alignment, which include TALE class genes from other species (Zhong et al., 2008 andZhong &Holland, 2011). TALE orthologs representing 6 subclasses are colour-coded. The node labels of each taxon are marked with distinctive colors denoted in the figure inset. Bootstrap support values (n=1000) are denoted as branch labels.

Conclusion
We identified 4900 homeodomain transcripts from 120 crustaceans and 6 non-crustacean arthropod species. Although this data set is non-exhaustive -transcriptomes contain only genes expressed at the point of sample collection -it will now serve as a key resource for future functional studies in the context of crustacean aquaculture. Beyond crustaceans, this work is widely applicable to studies on homeobox genes from other animals and will facilitate evolutionary and comparative genomics investigations.

Competing interests
No competing interests were disclosed.

Grant information
This work was supported by the EMBO Fellowship and the Human Frontier Science Program Fellowship to AGL.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes This work makes a putative assessment of the overall homeodomain complements of the transcriptomes of a number of crustacean species. One class of homeodomain containing genes (TALE) from one order of crustaceans (Decapoda) is assessed in detail, but otherwise no attempt is made to categorise putative hits into gene families. The work is therefore preliminary in scope, and would benefit from the provision of even broadest-level classification of hits into appropriate classes/subclasses of gene, which should be fairly straightforward given the diagnostic residues used to categorise these classes.
The title as it stands is misleading -a genome wide survey is not made, and instead transcriptomic data is used, which will (by necessity) be gappy.
A link is made between aquaculture, innate immunity and homeodomain-containing proteins, but this is very tenuous. Particularly, why the focus is on TALE class genes is unclear, if is the gene used as Caudal the exemplar for a link between these fields?
The methods section needs to be more precise. For example: "such as BLASTp and tBLASTn to identify genes with homeodomain sequences". What was done? How were protein sequences derived from nucleotide data for blastp searches? What sequences were used to search your datasets? (Perhaps add these to the sentence: "list of query sequences used in subsequent homology searches from Uniprot and GenBank."). The latter is particularly important as more distant sequences may be missed.
I have several questions about e values. -Dataset 1 contains several ID'd genes with higher E values than the stated cutoff (< 10-6). Is this deliberate?
-The e-value of < 10-6 will also likely result in larger datasets returning more hits, purely as a -The e-value of < 10-6 will also likely result in larger datasets returning more hits, purely as a consequence of how the E (expect) value is calculated. For example, the transcriptome Daphnia magna has 271,000 sequences, 12,000. Therefore it is much more likely that sequences will make it Triops through your annotation pathway in rather than . This will skew the results shown in Daphnia magna Triops Fig 1B. Is it possible to show that homeodomain genes are not artificially excluded, perhaps by giving the values and best blast hits from "next-best" excluded sequences in your initial searches of small E datasets, to prove no homeodomain sequences were artificially excluded? This is crucial, given the short length of the homeodomain, which will be the primary source of signal.
Was HMMR really run on the best reciprocal nr hits, as is suggested by your phrasing? Or was it run on the transcriptome-derived data? Fig 1A: Violin plots are not appropriate here. Look for instance at the Branchiopod data, where 3 points are used to infer this plot.
The results in the tree in Fig 2 seem to indicate that decapod crustaceans completely lack Mkx genes, and the presentation of this is disingenuous in text (note the paraphyly of known Mkx homologues with regard to the inferred crustacean Mkx). Instead, the crustacean Mkx seem to be Pbc? "Importantly, the tree topology of crustacean TALE class orthologs recapitulated observations from a previous study (Holland et al., 2007)." -this statement does not seem to be correct.
The homeodomain complements, especially of ANTP class genes, of several crustacean species have been described previously, but no attempt is made to place the results observed in the context of the annotated sets of other species. Could this be provided? Particularly, the utility of the re-assessment of non-crustacean datasets is unclear, as these resources have been annotated previously in more detail. Were additional homeodomain-containing genes found by this re-analysis? Or fewer?
In short, this work seems to be partially successful in its aims. With the addition of additional information about the identity of sequences, and the correction of the problems noted above, it will be a coherent addition to extant information on crustacean homeodomain-containing genes.
There are several areas where the phrasing could be improved, e.g.
-"With continued fascination on this key class of proteins" Sometimes articles (a/the) are missing from the text, e.g.
-"Phylogenetic tree was built from" Is the work clearly and accurately presented and does it cite the current literature? Partly

Is the study design appropriate and is the work technically sound? Partly
Are sufficient details of methods and analysis provided to allow replication by others?

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes No competing interests were disclosed.

Competing Interests:
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

The benefits of publishing with F1000Research:
Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com