Phylogenetic placement of whole genome duplications in yeasts through quantitative analysis of hierarchical orthologous groups

Samuel Moix; Natasha Glover; Sina Majidian

doi:10.12688/f1000research.128656.1

Home Browse Phylogenetic placement of whole genome duplications in yeasts through...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Method Article

Phylogenetic placement of whole genome duplications in yeasts through quantitative analysis of hierarchical orthologous groups

[version 1; peer review: 1 approved with reservations, 1 not approved]

Samuel Moix¹, Natasha Glover ², Sina Majidian ^1,2

PUBLISHED 12 Apr 2023

Author details Author details

¹ Department of Computational Biology, University of Lausanne, Lausanne, 1015, Switzerland
² Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland

Samuel Moix
Roles: Formal Analysis, Methodology, Software, Visualization, Writing – Original Draft Preparation

Natasha Glover
Roles: Methodology, Supervision, Visualization, Writing – Review & Editing

Sina Majidian
Roles: Conceptualization, Supervision, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the The OMA collection collection.

Abstract

Background: Whole genome duplications (WGD) are genomic events leading to formation of polyploid organisms. Resulting duplicated genes play important roles in driving species evolution and diversification. After such events, the initial ploidy is usually restored, complicating their detection across evolution. With the advance of bioinformatics and the rising number of new well-assembled genomes, new detection methods are ongoingly being developed to overcome the weaknesses of different approaches.

Results: Here we propose a novel method for detecting WGD in yeast lineages based on the quantitative and comparative analysis of hierarchical orthologous groups (HOGs) of duplicated genes for a given set of organisms. We reconstruct ancestral genomes to obtain evolutionary information for each phylogenetic branch. This reconstruction relies on the inference of HOGs from the selected species’ proteomes. To estimate WGD events, the number of HOGs of duplicated genes across all taxonomic ranges are adjusted according to the molecular clock hypothesis and by the average genome size. Branches with a significant increase in the adjusted number of duplicated gene families are kept as candidates for WGD placement. The developed method was tested on two real datasets and showed promising results in phylogenetic WGD placements on the yeast lineage.

Keywords

comparative genomics, orthologous groups, whole genome duplications, yeast

Corresponding authors: Natasha Glover, Sina Majidian

Competing interests: No competing interests were disclosed.

Grant information: This work was supported by the Swiss National Science Foundation [grant numbers 205085 and 183723].
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2023 Moix S et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Moix S, Glover N and Majidian S. Phylogenetic placement of whole genome duplications in yeasts through quantitative analysis of hierarchical orthologous groups [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2023, 12:382 (https://doi.org/10.12688/f1000research.128656.1) First published: 12 Apr 2023, 12:382 (https://doi.org/10.12688/f1000research.128656.1) Latest published: 12 Apr 2023, 12:382 (https://doi.org/10.12688/f1000research.128656.1)

Introduction

The potential roles of gene duplication in driving evolutionary innovation have now long been recognized.¹ Whole genome duplication (WGD) is a major event leading to duplicated genes, found several times over the course of evolution in many eukaryotic organisms,² and leads to the formation of an organism (or a cell) with additional copies of its entire genome. This event is often followed by massive gene losses due to non-functionalization, however some duplicated genes might undergo subfunctionalization (gene partitioning) or neofunctionalization (acquire a novel gene function) which will favor their preservation.³

To understand the circumstances leading to WGD in eukaryotes, Marcet-Houben and Gabaldón proposed a hypothesis describing an ancient allopolyploidization event (hybridization event between two different species of the yeast prior to the genome doubling).⁴ In case of an autopolyploid scenario (genome doubling in a single species), benefits from neofunctionalization would only take effect in the long run. In contrast, an allopolyploid scenario would lead to an initial selective advantage, because of the fact that it brings together different physiological properties and isolates the newly formed lineage sexually.⁴

Elucidating the causes of WGD remains a non-trivial task, as such events are often ancient and leave few traces, since over the course of evolution polyploidy is usually not maintained.⁵^–⁷ For the same reasons, dating these events becomes challenging. One type of remaining traces in available genomes are conserved paralogs (i.e. ohnologs) which are defined as descendants of an ancestral gene derived from a duplication event. Paralogs are also referred to as ohnologs when resulting from WGD.

Finding homologous relationships (genes derived from the same ancestor) is a key concept in evolutionary biology used by researchers to describe genetic differences and similarities between different species. One way to categorize genes is by inferring so-called hierarchical orthologous groups (HOGs). HOGs are defined as “sets of genes that have descended from a common ancestral gene in a given ancestral species”.⁸ These nested groups of orthologs and paralogs, defined at each taxonomic level, can be used to reconstruct ancestral genomes and infer gene families. This also provides information on evolutionary events such as gene duplications, gains or losses between known or reconstructed genomes. These inferred relationships play a critical role for this study.

Over the past few decades, multiple WGD events on the tree of life have been detected.⁹ In early vertebrate evolution, Susumu Ohno proposed that two rounds of complete genome duplication occurred (often referred as 2R). There is now strong evidence confirming this hypothesis.¹⁰ Additionally, a consensus has been reached in favor of a third round (3R) WGD occurring in the fish lineage and thus also referred to as the fish-specific genome duplication.¹¹ As mentioned earlier, WGD events have also been established in other eukaryotic lineages. For instance, an ancestral genome duplication has been placed before the common ancestor of extant seed plants, followed by another in the common ancestor of extant angiosperms.¹² Moreover, multiple WGD events have been estimated among eudicotyledons and monocotyledons.¹² Over the course of evolution, another well-known case occurred just before the separation of the yeast species Vanderwaltozyma polyspora from the S. cerevisiae lineage.¹³

Although found in vertebrates, plants, fungi, and invertebrates, WGDs remain relatively rare evolutionary events that have far-reaching repercussions on speciation and biodiversity dynamics.²^,¹⁴ Being able to place them across the tree of life is crucial to our understanding of species’ evolutionary history. Additionally, polyploidy has important effects on human health. Belonging to the hallmarks of cancers, polyploidization events are found in roughly one third of human cancers. Besides cancer, genome duplications also occur in normal cells such as hepatocytes and in premalignant lesions, impacting other disease conditions.¹⁵ Thus, expanding knowledge of WGD is of great importance for cancer and other disease research.

Currently, the main existing approaches for detecting WGD using genomic data are based on the detection of large conserved syntenic regions, gene counts along phylogeny, and synonymous mutation rates.¹⁶^,¹⁷ These methods all have their own drawbacks. For instance, investigations through synteny require high-quality genome assemblies providing gene order, and might be biased by small scale duplications or remaining layers of older WGD.¹⁸ In methods based on the number of synonymous substitutions per synonymous site (Ks), signal is lost in older WGDs due to Ks stochasticity. Moreover, these methods suffer from Ks saturation which might lead to the formation of artificial peaks.¹⁹ Such limitations could be mitigated by leveraging a combination of multiple techniques when trying to infer WGD events. Finally, to cope with the rapid increase in available genomic data, scalable methods need to be developed.

This study aims to propose a novel approach, based on hierarchical orthologous groups of duplicated genes (henceforth referred to as duplication HOGs), to locate WGD events on the given phylogenetic tree of a set of organisms. The method relies on the organisms’ proteomes, i.e. protein sequence annotations, for the detection of WGD resulting from allopolyploidy or autopolyploidy. This study is focused on the yeast lineage as it contains a single well-established WGD.¹³ Additionally, yeast species have relatively small and less complex genomes, facilitating computation and analysis, thus favoring this choice. Applying our proposed method on two datasets comprising yeast species allowed proper identification of previously reported WGD events (details in the Results section). It is worth mentioning that the method could be used for newly sequenced genomes as well.

Methods

Our method uses the HOGs in addition to the species tree as inputs. In the following section, we described how one can estimate HOGs for the genome of interest. Then, in the section “Detection of whole genome duplications (WGD)”, our method to detect WGD is presented in detail. Of note, we evaluated our method using 32 yeast genomes from the Saccharomycetales clade (details in the Results section).

Orthology inference

We used the OMA Standalone software²⁰ to infer HOGs and to reconstruct ancestral genomes. To obtain HOGs, OMA Standalone performs multiple steps. First, the data undergoes quality and consistency checks. Second, all-against-all sequence alignments are computed in order to infer pairwise orthologs. We exported precomputed all-against-all protein sequence alignments from the OMA browser.²¹^,²² Using precomputed all-against-all alignments as the input of the OMA standalone software significantly lowers the needed computational resources. In contrast to most other orthology inference resources, during the all-against-all alignment phase, OMA compares multiple isoforms in order to identify and utilize the isoform with the best cross-species matches.²¹ Next, pairwise orthologs are used to build an orthology graph. Finally, the GETHOGs algorithm (“Graph-based Efficient Technique for Hierarchical Orthologous Groups”) utilizes this graph and the given species tree for HOG inference. If no species tree is given, OMA estimates it using a least-squares distance approach based on OMA groups²³ (Figure 1). OMA standalone was run with the selected sets of yeasts to obtain an estimated species tree in Newick format and HOGs in an OrthoXML file.²⁰

Figure 1. Quantitative duplication hierarchical orthologous group (HOG) analysis pipeline flowchart.

The main steps of the developed method are illustrated on this flowchart. (For the complete description of the method, see the Methods section).

Detection of whole genome duplications (WGD)

For WGD analysis, the number of inferred duplicated genes at each taxonomic level is essential. To retrieve this information, the python library pyHam (“python HOG analysis method”)²⁴ was used. This package facilitates the extraction of evolutionary information such as gene duplications or losses at specific taxonomic levels from OMA Standalone’s HOG inference output. The duplication events can be inferred by the “mapper” class by comparing evolutionary relationships between an ancestral node and any of its descendants.

Our method to place WGD events is based on the number of HOGs which were the direct result of a duplication event (we call them as “HOGs of duplicated genes”) (Figure 1). The species tree was traversed from the root node and the evolutionary history, including duplication events, of all the genes between each ancestor node and its two closest descendant nodes were inferred using pyHam’s “vertical comparison” method.²⁴ The number of HOGs of duplicated genes, rather than the number of duplication events was considered, as ohnologs should result from one large duplication event. Branches in the species tree with significantly higher numbers of duplication HOGs (after the adjustments described below) were considered as putative WGD events.

In order to have a fair comparison between branches on the species tree with different lengths, the number of duplication HOGs was adjusted by evolutionary distance and average genome size. Adjusting by distance is based on the molecular clock hypothesis, which states that DNA and protein sequences evolve at a somewhat constant rate throughout time and across organisms.²⁵ With this assumption, we would expect more evolutionary events over a longer time period and thus need to adjust the number of duplication HOGs accordingly by dividing by branch length.

Note that since there could be some small branches in the tree which make the number of duplication HOGs adjusted value misleadingly high, it is needed to use a pseudocount when dividing counts by distance. Depending on the dataset, the user of our code is able to use different values of this pseudocount (set to one as default). Finally, to balance among the different genome sizes, the average number of genes between the compared nodes is used based on the assumption that the ancestor and descendant share similar genome sizes. These adjustments can be summarized with the following formula:

adjusted value = \frac{# duplication HOGs}{(branch length + pseudocount) * (average genome size)}

After obtaining the number of adjusted duplication HOGs for each branch, values and distribution were plotted to determine a significant threshold for selecting possible WGD candidates. Another way to find candidates is by computing adjusted HOG counts with different pseudocounts and considering branches that remain significant over the various adjustments. In order to find these high values (outliers), we set a threshold for each separate case which is defined based on the interquartile range.

threshold = third quartile + 1.5 * interquartile range

Benchmarking of this method was done by comparing results with known cases found in the literature.⁴^,²⁶

Results

A diverse range of yeast organisms were used in this project in order to assess the performance of our WGD placement method. We chose yeast for a proof of concept because this clade is well documented and the species have simple genomes relative to other eukaryotes. These genomes were extracted from the Orthologous MAtrix (OMA) database.²⁰ This database currently contains 2,496 genomes (Nov. 2022) from across the entire Tree of Life and is continuously being updated. As part of method evaluation, we first consider a small dataset, “small yeast”, of five species including Schizosaccharomyces pombe, Kluyveromyces lactis, Zygosaccharomyces Rouxii, Vanderwaltozyma polyspora and Saccharomyces cerevisiae. In this dataset, the well-known model organism S. pombe acts as a rooting species. It is known that K. lactis and Z. rouxii are potential candidates of species that might have led to the WGD in the hybridization theory⁴ and thus branching off before the known WGD placed before V. polyspora’s branch. In other words, Vanderwaltozyma polyspora and Saccharomyces cerevisiae are known to have undergone polyploidization.⁴ Another dataset, “large yeast”, was used as proof of concept and is composed of 32 species from the Saccharomycetales order found on the OMA database in addition to the species Penicillium chrysogenum for rooting.

As described in the method section, once HOGs were inferred with OMA standalone, the estimated number of gained, duplicated, retained or lost genes on each branch was obtained and visualized with a python package pyHam (Figure 2A,B). We notice a striking gene loss on the branch leading to the extant Saccharomyces cerevisiae South African wine strain (strain AWRI796), which may be due to its domestication. However, many of the gene losses might be false positives, which can happen when the extant genome is of lower quality or incomplete. Branches leading to Kluyveromyces lactis and Zygosaccharomyces rouxii have slightly less duplication than the other branches, but this data on its own is not sufficient to discriminate against possible WGD candidates (Figure 2A,B).

Figure 2. Hierarchical orthologous groups (HOGs) reveal evolutionary events in the yeast lineage.

(A) Inferred phylogenetic tree of the “small yeast” dataset using OMA standalone pipeline. Numbers in black correspond to branch length and numbers in purple to the branch id found in the bar plots. Colored stacked bars on branches illustrate the non-adjusted number of gained, duplicated, retained or lost genes from one node to another (these species have roughly 5000 genes in average). The same information can be found as a barplot on panel B. The red-colored branch corresponds to the placement of the whole genome duplications (WGD) detected by our method. (B) Bar plots represent the number of gained genes, duplication events, duplication HOGs, retained genes, and lost genes for the “small yeast” dataset. Branch ids correspond to the ids found on the phylogenetic tree of panel A. (C) Same bar plots as panel B adjusted by the distance (branch length plus pseudocount of one) and the average genome size for each branch (id).

However, when counting the number of duplication HOGs adjusted by genome size and branch length, a distinctive larger relative quantity (seen as an outlier on the distribution) appears on the branch of the known WGD (branch id 5; Figure 2C). The significance threshold is determined by adding the third quartile to the multiplication of the interquartile range by 1.5 (Figure 3A). This allows clear placement of the WGD event on the species tree. Note that this significant value remains with a pseudocount of five (see Figure 3B). Indeed, the next highest quantity (branch id 8) lies at only 52% of this significant value.

Figure 3. Distinctive peak in accordance with the known whole genome duplications (WGD) branch.

(A) Bar graph of the relative (by min-max normalization) adjusted (by branch length plus one and average genome size) number of duplication hierarchical orthologous groups (HOGs) for each branch of the “small yeast” dataset (see Figure 2 for branch id correspondence). On the right side, the box plot represents the distribution of these values. The distribution outlier is considered as a significant value and thus as a WGD candidate. Red colored bars correspond to branches that are significant over a range of different pseudocounts. (B) Replication of panel A barplot with a pseudocount of five. (C) Adjusted number of duplication HOGs for each branch of the “large yeast” dataset. Pseudocount added to the distance changed to five. Red colored bars correspond to branches that remain significant across a range of different pseudocounts (e.g. 1,5 and 10). See Figure 4 for branch id correspondence.

Applying the same approach on the bigger data set containing 32 species from the Saccharomycetales order resulted in two possible WGD candidates. The first corresponds to the aforementioned WGD branch with the branch id 25 (Figure 4). The second, with an adjusted number above 0.8, corresponds to the branch leading to the extant species Pichia sorbitophila with the branch id 42. Interestingly, it was shown that Pichia sorbitophila results from allopolyploidization and has a genome of about twice the size of its ancestor.²⁷ In this data set, we can see smaller differences among adjusted HOG counts which makes it more difficult to set the threshold to distinguish WGD branches, i.e. those with the highest duplication HOGs (Figure 4).

Figure 4. Quantitative duplication hierarchical orthologous group (HOG) analysis results on the “large yeast” dataset.

(A) Inferred phylogenetic tree of the “large yeast” dataset. Established post-whole genome duplications (WGD) species under the light-green box. (B) Graph represents the number of duplication HOGs adjusted by the distance (branch length plus one) and the average genome size for each branch (id). On the right side, the box plot represents the distribution of these values. Keeping significant values that remain with higher pseudocounts (Figure 3C) correspond to found WGD candidates (branch id 25 and 42). Filled blue colored circles on the phylogenetic tree correspond to remaining significant values with a pseudocount of one.

Discussion

Proposed method is able to correctly place the WGD event in the yeast lineage

Results obtained on the two yeast datasets are promising as the branch with the maximum adjusted duplication HOGs corresponds to the known WGD.

It is important to note that despite adding a pseudocount, when adjusting by evolutionary distance, small branches are still slightly more pushed upward compared to longer branches. To account for this, it is recommended to consider different pseudocounts and compare the outcomes. With the “large yeast” dataset we see various WGD candidates depending on the used pseudocount. By repeating the analysis with different pseudocounts (e.g., integers between one and 20) two branches persist: the known WGD branch but also the branch leading to P. sorbitophila which has a very large number of duplicated genes. As mentioned earlier, it was shown that this species has undergone genome duplication through allopolyploidization.²⁸

Overall, the applied method seems to perform quite well on its own for WGD placement in the yeast lineage, assuming the user keeps a fairly stringent threshold or discriminates by applying different pseudocounts.

Limitations of the quantitative duplication HOG analysis

As the scope of this work was limited on yeast and witnessing clear results obtained on this lineage, extending the approach to find placement of WGD on plant datasets would be of great interest. This is quite challenging due to the fact that the plant dataset in the current release of OMA contains at the moment only few and relatively distanced species for many reported WGD. First of all, based on comparison, the method depends entirely on the characteristics of the species used as input. Only using species from post-WGD or opposingly, only ones that did not undergo WGD, would not lead to significant values of interest, as an average needs to be set. In such cases, results should probably not be considered on their own to place WGD events but could be combined with different approaches.

In addition, the number of input species also play an important role in setting the baseline to find abnormal duplication increases. While only a few well-selected species of the yeast lineage allowed proper WGD placement, a random selection might require a larger dataset to identify significant values. The size and variety of the set should also be considered. First, to infer ancient duplications, more extant information is needed for a better resulting ancestral reconstruction. Second, to find abnormal changes, normality needs to be defined through a sufficient amount of information and such changes might also result from different events than WGD. Moreover, in cases with multiple overlapping WGDs on the same branch, only one event could potentially be highlighted.

Perspectives

In the end, the proposed method indicates branches with an abnormal increase of HOGs of duplicated genes under the assumption of the molecular clock hypothesis and correct ancestral genome reconstruction. While these may not be WGD events per se, this information remains beneficial, or even sufficient on its own, to detect WGD. Like most approaches this method has its own strengths and weaknesses and should therefore be combined with other known or novel methods in order to place WGD in more complex datasets. A coherent next step would be to consider the HOG synteny of the found WGD candidates, as this information can be retrieved as a weighted network on the OMA platform and would likely improve the detection accuracy.

WGD placement, despite the many existing methods and algorithms,²⁷ remains a complicated task and is therefore an ongoing research field. Most recently, researchers tried to apply a supervised machine learning approach based on the known detectable signatures of WGD.¹⁸ The advance of computational and biotechnologies should lead to the development of such new or improved methods to resolve remaining questions around WGD evolution. Such findings would contribute to a better understanding of the related rise of evolutionary innovations throughout history and could also be expanded to the study of WGD as macro-evolutionary event occurring in tumorigenesis.

Data availability

Underlying data

Datasets are publicly available:

- species tree with Newick format https://omabrowser.org/All/speciestree.nwk
- precomputed all-against-all alignments can be exported from https://omabrowser.org/oma/export/

Software availability

OMA standalone program available at https://omabrowser.org/standalone/

OMA browser license: Mozilla Public License 2.0

Source code for Python jupyter notebook and intermediate results for producing figures: https://github.com/cChiiper/UNIL_MLS_Master_FS_HOG_WGD_Detection

Archive source code for jupyter notebook at time of publication: https://doi.org/10.5281/zenodo.7682181

Python jupyter notebook license: Creative Commons Attribution 4.0 International license

Specifically, to manipulate data the following packages are needed:

- pyHam (https://zoo.cs.ucl.ac.uk/doc/pyHam/index.html)
- pandas (https://pandas.pydata.org/)
- numpy (https://numpy.org/)

For plotting the data:

- matplotlib (https://matplotlib.org/)
- seaborn (https://seaborn.pydata.org/)

References

1. Ohno S: Evolution by Gene Duplication. Allen and Unwin; 1970.
2. Van de Peer Y, Mizrachi E, Marchal K: The evolutionary significance of polyploidy. Nat. Rev. Genet. 2017; 18: 411–424. Publisher Full Text
3. Glasauer SMK, Neuhauss SCF: Whole-genome duplication in teleost fishes and its evolutionary consequences. Mol. Gen. Genomics. 2014; 289: 1045–1060. PubMed Abstract | Publisher Full Text
4. Marcet-Houben M, Gabaldón T: Beyond the Whole-Genome Duplication: Phylogenetic Evidence for an Ancient Interspecies Hybridization in the Baker’s Yeast Lineage. PLoS Biol. 2015; 13: e1002220. PubMed Abstract | Publisher Full Text | Free Full Text
5. Wendel JF: The wondrous cycles of polyploidy in plants. Am. J. Bot. 2015; 102: 1753–1756. PubMed Abstract | Publisher Full Text
6. Albertin W, Marullo P: Polyploidy in fungi: evolution after whole-genome duplication. Proc. R. Soc. B Biol. Sci. 2012; 279: 2497–2509. PubMed Abstract | Publisher Full Text | Free Full Text
7. Campbell MA, Ganley ARD, Gabaldón T, et al.: The Case of the Missing Ancient Fungal Polyploids. Am. Nat. 2016; 188: 602–614. PubMed Abstract | Publisher Full Text
8. Zahn-Zabal M, Dessimoz C, Glover NM: Identifying orthologs with OMA: A primer.2020. Publisher Full Text
9. Watts RL, Watts DC: Gene duplication and the evolution of enzymes. Nature. 1968; 217: 1125–1130. Publisher Full Text
10. Clear Evidence for Two Rounds of Vertebrate Genome Duplication. PLoS Biol. 2005; 3: e344. Publisher Full Text
11. Meyer A, Van de Peer Y: From 2R to 3R: evidence for a fish-specific genome duplication (FSGD). BioEssays. 2005; 27: 937–945. Publisher Full Text
12. Jiao Y, et al.: Ancestral polyploidy in seed plants and angiosperms. Nature. 2011; 473: 97–100. Publisher Full Text
13. Scannell DR, et al.: Independent sorting-out of thousands of duplicated gene pairs in two yeast species descended from a whole-genome duplication. Proc. Natl. Acad. Sci. 2007; 104: 8397–8402. PubMed Abstract | Publisher Full Text | Free Full Text
14. Wolfe KH: Origin of the Yeast Whole-Genome Duplication. PLoS Biol. 2015; 13: e1002221. PubMed Abstract | Publisher Full Text | Free Full Text
15. Matsumoto T, et al.: Proliferative polyploid cells give rise to tumors via ploidy reduction. Nat. Commun. 2021; 12: 646. PubMed Abstract | Publisher Full Text | Free Full Text
16. Mabry ME, et al.: Phylogeny and multiple independent whole-genome duplication events in the Brassicales. Am. J. Bot. 2020; 107: 1148–1164. Publisher Full Text
17. Yang Y, Li Y, Chen Q, et al.: WGDdetector: a pipeline for detecting whole genome duplication events using the genome or transcriptome annotations. BMC Bioinformatics. 2019; 20: 75. PubMed Abstract | Publisher Full Text | Free Full Text
18. McKibben MTW, Barker MS: Applying Machine Learning to Classify the Origins of Gene Duplications.2021. Publisher Full Text
19. Vanneste K, Van de Peer Y, Maere S: Inference of Genome Duplications from Age Distributions Revisited. Mol. Biol. Evol. 2013; 30: 177–190. PubMed Abstract | Publisher Full Text
20. Altenhoff AM, et al.: OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Res. 2019; 29: 1152–1163. PubMed Abstract | Publisher Full Text | Free Full Text
21. Altenhoff AM, et al.: OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res. 2021; 49: D373–D379. PubMed Abstract | Publisher Full Text | Free Full Text
22. Dylus D, et al.: How to build phylogenetic species trees with OMA.2020. Publisher Full Text
23. Altenhoff AM, Gil M, Gonnet GH, et al.: Inferring Hierarchical Orthologous Groups from Orthologous Gene Pairs. PLoS One. 2013; 8: e53786. PubMed Abstract | Publisher Full Text | Free Full Text
24. pyHam: a python package to visualize and process hierarchical orthologous groups (HOGs) – Open Reading Frame - Dessimoz Lab. http
25. The Molecular Clock and Estimating Species Divergence | Learn Science at Scitable. http
26. Kellis M, Birren BW, Lander ES: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 2004; 428: 617–624. PubMed Abstract | Publisher Full Text
27. Lallemand T, Leduc M, Landès C, et al.: An Overview of Duplicated Gene Detection Methods: Why the Duplication Mechanism Has to Be Accounted for in Their Choice. Genes. 2020; 11: 1046. PubMed Abstract | Publisher Full Text | Free Full Text
28. Louis VL, et al.: Pichia sorbitophila, an Interspecies Yeast Hybrid, Reveals Early Steps of Genome Resolution After Polyploidization. G3 GenesGenomesGenetics. 2012; 2: 299–311. PubMed Abstract | Publisher Full Text | Free Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 12 Apr 2023

Author details Author details

¹ Department of Computational Biology, University of Lausanne, Lausanne, 1015, Switzerland
² Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland

Samuel Moix
Roles: Formal Analysis, Methodology, Software, Visualization, Writing – Original Draft Preparation

Natasha Glover
Roles: Methodology, Supervision, Visualization, Writing – Review & Editing

Sina Majidian
Roles: Conceptualization, Supervision, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This work was supported by the Swiss National Science Foundation [grant numbers 205085 and 183723].
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 12 Apr 2023, 12:382

https://doi.org/10.12688/f1000research.128656.1

Copyright

© 2023 Moix S et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Moix S, Glover N and Majidian S. Phylogenetic placement of whole genome duplications in yeasts through quantitative analysis of hierarchical orthologous groups [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2023, 12:382 (https://doi.org/10.12688/f1000research.128656.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 12 Apr 2023

Views

20

Reviewer Report 28 Sep 2023

Gavin Conant, Department of Biological Sciences, North Carolina State University, Raleigh, North Carolina, USA

Not Approved

https://doi.org/10.5256/f1000research.141268.r203842

Review of Moix et al., “Phylogenetic placement of whole genome duplications in yeasts through quantitative analysis of hierarchical orthologous groups.”

NOTE: I agree with much of what X. Guo has written in their peer review and notice ... Continue reading

Review of Moix et al., “Phylogenetic placement of whole genome duplications in yeasts through quantitative analysis of hierarchical orthologous groups.”

NOTE: I agree with much of what X. Guo has written in their peer review and notice that these points have not been addressed, which will be essential should this manuscript be published.

Overview: In this manuscript, the authors propose a new approach to detecting whole-genome duplications from sequenced genomes using the concept of hierarchical orthologous groups (HOGs) and validate that method on genomic data from yeast.

Major comments:
At the highest level, there is no harm in having new methods for detecting WGD events, and I think that there is some promise in the idea of using HOGs for such a problem. I nonetheless find it difficult to summon a great deal of enthusiasm for this tool and manuscript for a few reasons.

Most importantly, if my understanding is correct, this method represents an approximation to existing approaches of modeling gene family evolution on a phylogeny using birth-death processes (this point was again noted by the first reviewer). Since there is at least one existing tool that uses just such models to infer the presence of WGD events (1, 2), the authors here need to explain the similarities and differences between their approach and the prior work. I will note that this existing tool does treat the WGD events in a slightly ad hoc way, so there is potentially room for improvement.

My second concern is that what the community actually lacks in not tools for detecting WGD events. We lack tools that can assess the statistical support for those inferences. The authors take some welcome steps in this direction, but I don’t find them to be especially well-explained nor particularly robust. It appears that the authors use the normalized distribution of HOG appearance across the branches of the tree as a null distribution and consider quartile outliers within this distribution to be potential WGD events. That approach gives the appearance of good behavior in the dataset considered here, which essentially has exactly one WGD. However, it is less clear that the method would reject the hypothesis of a WGD in datasets where in fact no WGDs occurred or could handle data like flowering plants with many WGD events on the tree. I would encourage the authors to better explain this section.

Minor points:
I am not sure “pseudo-count” is the right term for the scaling parameter the authors use, since those call to mind a more controlled procedure for accounting for unobserved data. Here what is being done is accounting for short branches, but the authors should differentiate between branches that are short but non-zero verses branches that the routines are estimating to be numerically zero. Doing so is tricky but important. For instance, some phylogenetic packages can estimate branch length confidence intervals, which might be a more rigorous approach to this problem.

The first paragraph of the abstract is rather badly written. I would say that polyploidy leads to WGD, not vice versa. More accurately, WGD is a shorthand for genomes with polyploidy in their history. Likewise “initial ploidy is restored” isn’t really correct: the chromosomes may return to a diploid pairing state in meiosis with the organism still retaining many more chromosomes that its ancestors had.

The first paragraph of the introduction could also mention the dosage balance hypothesis and how it impacts duplicate retention (3). (In contrast to the first reviewer, I would urge the authors not to eliminate this more general discussion of polyploidy here).

I would not write that “allopolyploidy…lead[s] to an initial selective advantage…” because I do not believe this is invariably the case. Rather I would phrase this as “Allopolyploidy could…”

The first reviewer is also correct that the section of Marcet-Houben and Gabaldon’s work is not clear.

References:

Tiley GP, Ané C, & Burleigh JG (2016) Evaluating and Characterizing Ancient Whole-Genome Duplications in Plants with Gene Count Data. Genome biology and evolution 8(4):1023-1037.
Rabier C-E, Ta T, & Ané C (2014) Detecting and locating whole genome duplications on a phylogeny: a probabilistic approach. Molecular biology and evolution 31(3):750-762.
Edger PP & Pires JC (2009) Gene and genome duplications: the impact of dosage-sensitivity on the fate of nuclear genes. Chromosome research : an international journal on the molecular, supramolecular and evolutionary aspects of chromosome biology 17(5):699-717.

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Yes

References

1. Tiley GP, Ané C, Burleigh JG: Evaluating and Characterizing Ancient Whole-Genome Duplications in Plants with Gene Count Data.Genome Biol Evol. 2016; 8 (4): 1023-37 PubMed Abstract | Publisher Full Text
2. Rabier CE, Ta T, Ané C: Detecting and locating whole genome duplications on a phylogeny: a probabilistic approach.Mol Biol Evol. 2014; 31 (3): 750-62 PubMed Abstract | Publisher Full Text
3. Edger PP, Pires JC: Gene and genome duplications: the impact of dosage-sensitivity on the fate of nuclear genes.Chromosome Res. 2009; 17 (5): 699-717 PubMed Abstract | Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Evolution of polyploid genomes

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Respond or Comment

Views

18

Reviewer Report 12 May 2023

Xinyi Guo, CEITEC-Central European Institute of Technology, Masarykova univerzita, Brno, South Moravian Region, Czech Republic

Approved with Reservations

https://doi.org/10.5256/f1000research.141268.r171823

The manuscript presents a simple method for WGD inference based on gene counts and the authors have shown its good performance in yeast lineages. While the authors recognized knowledge gaps in the field and presented both their own solution and ... Continue reading

The manuscript presents a simple method for WGD inference based on gene counts and the authors have shown its good performance in yeast lineages. While the authors recognized knowledge gaps in the field and presented both their own solution and its limitations, I found that the rationale and description of the method are still not very clear, and there is a lack of comparison with other existing methods based on gene count data. In addition, the presented method is a Python-based wrapper for extracting and processing HOG gene count information from an existing gene orthology inference method (OMA), which might exclude many potential users who prefer other pipelines.

Major comments:

The main contribution of the method presented by the authors is the use of the adjusted number of HOGs according to the molecular clock hypothesis and by the average genome size. While branch length is somehow expected to be associated with gene counts due to the impact of paralogs on phylogenetic inference, genome size might be a poor indicator for WGD or gene counts due to the amplification of genomic repeats or progressive diploidization. For both factors, the authors didn’t mention in their introduction why they decided to consider their impact on WGD and why to implement the parameter in their method.
Following point 1, in their analyses, the authors presented adjusted values based on both branch length and genome size. Does this mean that both factors should be jointly considered in order to give the best performance? If so, I suggest providing comparative results based on individual factors as well.
The principle of the method is easy to understand and should be sufficient to replicate. I didn’t check if similar methods have been presented elsewhere, but I do realize that there are some existing methods based on gene count data, e.g., WGDgc (Rabier et al. 2014¹). Therefore, I recommend that the authors check more publications and compare the performance of their methods with others.
In their discussion, the authors presented a short but nice discussion on the development of WGD placing methods and their limitations. I like it. However, related to point 1, I found much dispensable information in their introduction. I would expect more focus on the rationale of the study and on developing the method.

Minor points:

In Abstract:

“After such events, the initial ploidy is usually restored” - The sentence is not very clear to me. What is the initial ploidy, do you mean diploid? And if you believe this usually happens, why consider genome size?
“We reconstruct ancestral genomes” - I guess you mean you inferred the number of genes in ancestral genomes. Please check such wording carefully throughout the manuscript.

In Introduction:

Paragraph 2 (P2). “Marcet-Houben and Gabaldón proposed a hypothesis” - This sentence is misleading as you were talking about eukaryotes first and mentioned yeast only in the parenthesis.
P3. “For the same reasons, dating these events becomes challenging. ” - If it’s for the same reason, why not mention both tasks in one sentence to make it more coherent?
P3. “One type of remaining traces in available genomes are conserved paralogs (i.e. ohnologs) which are defined as descendants of an ancestral gene derived from a duplication event. Paralogs are also referred to as ohnologs when resulting from WGD.” - Please check these two sentences for redundancy (paralogs/ohnologs appeared twice). If you mention one type, I would expect some information on other types.
P4. “One way to categorize genes is by inferring so-called hierarchical orthologous groups” - How about other ways? Why did you choose this way?
P5. Ref9 is too old – Is a paper from 1968 well suited to represent something “over the past few decades”? Also, I don’t think it’s necessary to give too many details (nearly a whole paragraph) on other species.
P6. Similar to P5, it seems too detailed. Both P5 and P6 are not so much related to the method itself.
P7. I found synteny and Ks here. How about gene counts, which is the main focus of this study?
P7. “scalable methods need to be developed” - This sentence seems to imply that something on the scalability of the methods is considered by the authors. But I didn’t see it in the remaining part of the manuscript.
P8. “proteomes, i.e. protein sequence annotations” - Are you sure proteomes mean just annotations? You might want to say protein sequences and their annotations.

In Methods:

P2. I would like to know why the authors consider OMA instead of many other orthology inference pipelines (orthofinder, orthoMCL etc.) to perform their analyses, although the authors mentioned some features of the OMA pipeline. Perhaps some comparison between pipelines will help.
P3. “For WGD analysis, the number of inferred duplicated genes at each taxonomic level is essential.” - This sentence is not very clear. What kind of WGD analysis? And I would say it’s the reliable estimates of the duplicated gene number that is essential.
P4. “The number of HOGs of duplicated genes, rather than the number of duplication events was considered, as ohnologs should result from one large duplication event. ” - I suggest some rewording as you already mentioned ohnologs resulted from WGD. Perhaps it’s better to say there are also many local gene duplication events besides large scale duplication events caused by WGD.
P5. Why adjust genome size? This is not mentioned in the introduction or here.
P5. “With this assumption, we would expect more evolutionary events over a longer time period” - What kind of events do you mean? Gene duplication? And again, how this is linked with the branch length of the phylogeny, which indicates the substitution rates?
P6-7. Any recommendation to adjust the pseudocount?

In Results:

P1. This should belong to part of the methods.
P2. “which may be due to its domestication” - Any evidence or reference for that?
P2. “However, many of the gene losses might be false positives, which can happen when the extant genome is of lower quality or incomplete.” - Again any evidence to support the low quality or completeness of the genomes?
P3. I noticed that the value of 1.5 appears in the results but not in the methods (although in the formula).
P4. What’s the difference between Fig3C and Fig4B?

In Discussion:

P2. “To account for this, it is recommended to consider different pseudocounts and compare the outcomes.” - Again, it’s better to give more details. What does pseudocounts mean? How could the value vary across organisms?
P5. “more extant information” - What kind of information, for example?
P6. “correct ancestral genome reconstruction” - Again genome reconstruction is not proper here.

Expanded answers to the mandatory questions:

Is the rationale for developing the new method (or application) clearly explained?
The explanation is not very clear enough. Please see my major comments #1-3
Is the description of the method technically sound?
Technically, the description is fine but some more details should be needed. Please see my major comment #1
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
The conclusions should be supported but the performance of the method needs to be further tested or compared. Please see my major comment #4.

Is the rationale for developing the new method (or application) clearly explained?

Partly
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

References

1. Rabier CE, Ta T, Ané C: Detecting and locating whole genome duplications on a phylogeny: a probabilistic approach.Mol Biol Evol. 2014; 31 (3): 750-62 PubMed Abstract | Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Genome evolution

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 12 Apr 2023

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 1 12 Apr 23	read	read

Xinyi Guo, Masarykova univerzita, Brno, Czech Republic
Gavin Conant, North Carolina State University, Raleigh, USA

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

20 Views

28 Sep 2023 | for Version 1

Gavin Conant, Department of Biological Sciences, North Carolina State University, Raleigh, North Carolina, USA

20 Views Cite this report Responses(0)

Not Approved

Review of Moix et al., “Phylogenetic placement of whole genome duplications in yeasts through quantitative analysis of hierarchical orthologous groups.”

NOTE: I agree with much of what X. Guo has written in their peer review and notice that these points have not been addressed, which will be essential should this manuscript be published.

Overview: In this manuscript, the authors propose a new approach to detecting whole-genome duplications from sequenced genomes using the concept of hierarchical orthologous groups (HOGs) and validate that method on genomic data from yeast.

Major comments:
At the highest level, there is no harm in having new methods for detecting WGD events, and I think that there is some promise in the idea of using HOGs for such a problem. I nonetheless find it difficult to summon a great deal of enthusiasm for this tool and manuscript for a few reasons.

Most importantly, if my understanding is correct, this method represents an approximation to existing approaches of modeling gene family evolution on a phylogeny using birth-death processes (this point was again noted by the first reviewer). Since there is at least one existing tool that uses just such models to infer the presence of WGD events (1, 2), the authors here need to explain the similarities and differences between their approach and the prior work. I will note that this existing tool does treat the WGD events in a slightly ad hoc way, so there is potentially room for improvement.

My second concern is that what the community actually lacks in not tools for detecting WGD events. We lack tools that can assess the statistical support for those inferences. The authors take some welcome steps in this direction, but I don’t find them to be especially well-explained nor particularly robust. It appears that the authors use the normalized distribution of HOG appearance across the branches of the tree as a null distribution and consider quartile outliers within this distribution to be potential WGD events. That approach gives the appearance of good behavior in the dataset considered here, which essentially has exactly one WGD. However, it is less clear that the method would reject the hypothesis of a WGD in datasets where in fact no WGDs occurred or could handle data like flowering plants with many WGD events on the tree. I would encourage the authors to better explain this section.

Minor points:
I am not sure “pseudo-count” is the right term for the scaling parameter the authors use, since those call to mind a more controlled procedure for accounting for unobserved data. Here what is being done is accounting for short branches, but the authors should differentiate between branches that are short but non-zero verses branches that the routines are estimating to be numerically zero. Doing so is tricky but important. For instance, some phylogenetic packages can estimate branch length confidence intervals, which might be a more rigorous approach to this problem.

The first paragraph of the abstract is rather badly written. I would say that polyploidy leads to WGD, not vice versa. More accurately, WGD is a shorthand for genomes with polyploidy in their history. Likewise “initial ploidy is restored” isn’t really correct: the chromosomes may return to a diploid pairing state in meiosis with the organism still retaining many more chromosomes that its ancestors had.

The first paragraph of the introduction could also mention the dosage balance hypothesis and how it impacts duplicate retention (3). (In contrast to the first reviewer, I would urge the authors not to eliminate this more general discussion of polyploidy here).

I would not write that “allopolyploidy…lead[s] to an initial selective advantage…” because I do not believe this is invariably the case. Rather I would phrase this as “Allopolyploidy could…”

The first reviewer is also correct that the section of Marcet-Houben and Gabaldon’s work is not clear.

References:

Tiley GP, Ané C, & Burleigh JG (2016) Evaluating and Characterizing Ancient Whole-Genome Duplications in Plants with Gene Count Data. Genome biology and evolution 8(4):1023-1037.
Rabier C-E, Ta T, & Ané C (2014) Detecting and locating whole genome duplications on a phylogeny: a probabilistic approach. Molecular biology and evolution 31(3):750-762.
Edger PP & Pires JC (2009) Gene and genome duplications: the impact of dosage-sensitivity on the fate of nuclear genes. Chromosome research : an international journal on the molecular, supramolecular and evolutionary aspects of chromosome biology 17(5):699-717.

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Yes

References

1. Tiley GP, Ané C, Burleigh JG: Evaluating and Characterizing Ancient Whole-Genome Duplications in Plants with Gene Count Data.Genome Biol Evol. 2016; 8 (4): 1023-37 PubMed Abstract | Publisher Full Text
2. Rabier CE, Ta T, Ané C: Detecting and locating whole genome duplications on a phylogeny: a probabilistic approach.Mol Biol Evol. 2014; 31 (3): 750-62 PubMed Abstract | Publisher Full Text
3. Edger PP, Pires JC: Gene and genome duplications: the impact of dosage-sensitivity on the fate of nuclear genes.Chromosome Res. 2009; 17 (5): 699-717 PubMed Abstract | Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Evolution of polyploid genomes

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

18 Views

12 May 2023 | for Version 1

Xinyi Guo, CEITEC-Central European Institute of Technology, Masarykova univerzita, Brno, South Moravian Region, Czech Republic

18 Views Cite this report Responses(0)

Approved With Reservations

The manuscript presents a simple method for WGD inference based on gene counts and the authors have shown its good performance in yeast lineages. While the authors recognized knowledge gaps in the field and presented both their own solution and its limitations, I found that the rationale and description of the method are still not very clear, and there is a lack of comparison with other existing methods based on gene count data. In addition, the presented method is a Python-based wrapper for extracting and processing HOG gene count information from an existing gene orthology inference method (OMA), which might exclude many potential users who prefer other pipelines.

Major comments:

The main contribution of the method presented by the authors is the use of the adjusted number of HOGs according to the molecular clock hypothesis and by the average genome size. While branch length is somehow expected to be associated with gene counts due to the impact of paralogs on phylogenetic inference, genome size might be a poor indicator for WGD or gene counts due to the amplification of genomic repeats or progressive diploidization. For both factors, the authors didn’t mention in their introduction why they decided to consider their impact on WGD and why to implement the parameter in their method.
Following point 1, in their analyses, the authors presented adjusted values based on both branch length and genome size. Does this mean that both factors should be jointly considered in order to give the best performance? If so, I suggest providing comparative results based on individual factors as well.
The principle of the method is easy to understand and should be sufficient to replicate. I didn’t check if similar methods have been presented elsewhere, but I do realize that there are some existing methods based on gene count data, e.g., WGDgc (Rabier et al. 2014¹). Therefore, I recommend that the authors check more publications and compare the performance of their methods with others.
In their discussion, the authors presented a short but nice discussion on the development of WGD placing methods and their limitations. I like it. However, related to point 1, I found much dispensable information in their introduction. I would expect more focus on the rationale of the study and on developing the method.

Minor points:

In Abstract:

“After such events, the initial ploidy is usually restored” - The sentence is not very clear to me. What is the initial ploidy, do you mean diploid? And if you believe this usually happens, why consider genome size?
“We reconstruct ancestral genomes” - I guess you mean you inferred the number of genes in ancestral genomes. Please check such wording carefully throughout the manuscript.

In Introduction:

Paragraph 2 (P2). “Marcet-Houben and Gabaldón proposed a hypothesis” - This sentence is misleading as you were talking about eukaryotes first and mentioned yeast only in the parenthesis.
P3. “For the same reasons, dating these events becomes challenging. ” - If it’s for the same reason, why not mention both tasks in one sentence to make it more coherent?
P3. “One type of remaining traces in available genomes are conserved paralogs (i.e. ohnologs) which are defined as descendants of an ancestral gene derived from a duplication event. Paralogs are also referred to as ohnologs when resulting from WGD.” - Please check these two sentences for redundancy (paralogs/ohnologs appeared twice). If you mention one type, I would expect some information on other types.
P4. “One way to categorize genes is by inferring so-called hierarchical orthologous groups” - How about other ways? Why did you choose this way?
P5. Ref9 is too old – Is a paper from 1968 well suited to represent something “over the past few decades”? Also, I don’t think it’s necessary to give too many details (nearly a whole paragraph) on other species.
P6. Similar to P5, it seems too detailed. Both P5 and P6 are not so much related to the method itself.
P7. I found synteny and Ks here. How about gene counts, which is the main focus of this study?
P7. “scalable methods need to be developed” - This sentence seems to imply that something on the scalability of the methods is considered by the authors. But I didn’t see it in the remaining part of the manuscript.
P8. “proteomes, i.e. protein sequence annotations” - Are you sure proteomes mean just annotations? You might want to say protein sequences and their annotations.

In Methods:

P2. I would like to know why the authors consider OMA instead of many other orthology inference pipelines (orthofinder, orthoMCL etc.) to perform their analyses, although the authors mentioned some features of the OMA pipeline. Perhaps some comparison between pipelines will help.
P3. “For WGD analysis, the number of inferred duplicated genes at each taxonomic level is essential.” - This sentence is not very clear. What kind of WGD analysis? And I would say it’s the reliable estimates of the duplicated gene number that is essential.
P4. “The number of HOGs of duplicated genes, rather than the number of duplication events was considered, as ohnologs should result from one large duplication event. ” - I suggest some rewording as you already mentioned ohnologs resulted from WGD. Perhaps it’s better to say there are also many local gene duplication events besides large scale duplication events caused by WGD.
P5. Why adjust genome size? This is not mentioned in the introduction or here.
P5. “With this assumption, we would expect more evolutionary events over a longer time period” - What kind of events do you mean? Gene duplication? And again, how this is linked with the branch length of the phylogeny, which indicates the substitution rates?
P6-7. Any recommendation to adjust the pseudocount?

In Results:

P1. This should belong to part of the methods.
P2. “which may be due to its domestication” - Any evidence or reference for that?
P2. “However, many of the gene losses might be false positives, which can happen when the extant genome is of lower quality or incomplete.” - Again any evidence to support the low quality or completeness of the genomes?
P3. I noticed that the value of 1.5 appears in the results but not in the methods (although in the formula).
P4. What’s the difference between Fig3C and Fig4B?

In Discussion:

P2. “To account for this, it is recommended to consider different pseudocounts and compare the outcomes.” - Again, it’s better to give more details. What does pseudocounts mean? How could the value vary across organisms?
P5. “more extant information” - What kind of information, for example?
P6. “correct ancestral genome reconstruction” - Again genome reconstruction is not proper here.

Expanded answers to the mandatory questions:

Is the rationale for developing the new method (or application) clearly explained?
The explanation is not very clear enough. Please see my major comments #1-3
Is the description of the method technically sound?
Technically, the description is fine but some more details should be needed. Please see my major comment #1
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
The conclusions should be supported but the performance of the method needs to be further tested or compared. Please see my major comment #4.

Is the rationale for developing the new method (or application) clearly explained?

Partly
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

References

1. Rabier CE, Ta T, Ané C: Detecting and locating whole genome duplications on a phylogeny: a probabilistic approach.Mol Biol Evol. 2014; 31 (3): 750-62 PubMed Abstract | Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Genome evolution

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] 1. Ohno S: Evolution by Gene Duplication. Allen and Unwin; 1970.

[2] 2. Van de Peer Y, Mizrachi E, Marchal K: The evolutionary significance of polyploidy. Nat. Rev. Genet. 2017; 18: 411–424. Publisher Full Text

[3] 3. Glasauer SMK, Neuhauss SCF: Whole-genome duplication in teleost fishes and its evolutionary consequences. Mol. Gen. Genomics. 2014; 289: 1045–1060. PubMed Abstract | Publisher Full Text

[4] 4. Marcet-Houben M, Gabaldón T: Beyond the Whole-Genome Duplication: Phylogenetic Evidence for an Ancient Interspecies Hybridization in the Baker’s Yeast Lineage. PLoS Biol. 2015; 13: e1002220. PubMed Abstract | Publisher Full Text | Free Full Text

[5] 5. Wendel JF: The wondrous cycles of polyploidy in plants. Am. J. Bot. 2015; 102: 1753–1756. PubMed Abstract | Publisher Full Text

[6] 6. Albertin W, Marullo P: Polyploidy in fungi: evolution after whole-genome duplication. Proc. R. Soc. B Biol. Sci. 2012; 279: 2497–2509. PubMed Abstract | Publisher Full Text | Free Full Text

[7] 7. Campbell MA, Ganley ARD, Gabaldón T, et al.: The Case of the Missing Ancient Fungal Polyploids. Am. Nat. 2016; 188: 602–614. PubMed Abstract | Publisher Full Text

[8] 8. Zahn-Zabal M, Dessimoz C, Glover NM: Identifying orthologs with OMA: A primer.2020. Publisher Full Text

[9] 9. Watts RL, Watts DC: Gene duplication and the evolution of enzymes. Nature. 1968; 217: 1125–1130. Publisher Full Text

[10] 10. Clear Evidence for Two Rounds of Vertebrate Genome Duplication. PLoS Biol. 2005; 3: e344. Publisher Full Text

[11] 11. Meyer A, Van de Peer Y: From 2R to 3R: evidence for a fish-specific genome duplication (FSGD). BioEssays. 2005; 27: 937–945. Publisher Full Text

[12] 12. Jiao Y, et al.: Ancestral polyploidy in seed plants and angiosperms. Nature. 2011; 473: 97–100. Publisher Full Text

[13] 13. Scannell DR, et al.: Independent sorting-out of thousands of duplicated gene pairs in two yeast species descended from a whole-genome duplication. Proc. Natl. Acad. Sci. 2007; 104: 8397–8402. PubMed Abstract | Publisher Full Text | Free Full Text

[14] 14. Wolfe KH: Origin of the Yeast Whole-Genome Duplication. PLoS Biol. 2015; 13: e1002221. PubMed Abstract | Publisher Full Text | Free Full Text

[15] 15. Matsumoto T, et al.: Proliferative polyploid cells give rise to tumors via ploidy reduction. Nat. Commun. 2021; 12: 646. PubMed Abstract | Publisher Full Text | Free Full Text

[16] 16. Mabry ME, et al.: Phylogeny and multiple independent whole-genome duplication events in the Brassicales. Am. J. Bot. 2020; 107: 1148–1164. Publisher Full Text

[17] 17. Yang Y, Li Y, Chen Q, et al.: WGDdetector: a pipeline for detecting whole genome duplication events using the genome or transcriptome annotations. BMC Bioinformatics. 2019; 20: 75. PubMed Abstract | Publisher Full Text | Free Full Text

[18] 18. McKibben MTW, Barker MS: Applying Machine Learning to Classify the Origins of Gene Duplications.2021. Publisher Full Text

[19] 19. Vanneste K, Van de Peer Y, Maere S: Inference of Genome Duplications from Age Distributions Revisited. Mol. Biol. Evol. 2013; 30: 177–190. PubMed Abstract | Publisher Full Text

[20] 20. Altenhoff AM, et al.: OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Res. 2019; 29: 1152–1163. PubMed Abstract | Publisher Full Text | Free Full Text

[21] 21. Altenhoff AM, et al.: OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res. 2021; 49: D373–D379. PubMed Abstract | Publisher Full Text | Free Full Text

[22] 22. Dylus D, et al.: How to build phylogenetic species trees with OMA.2020. Publisher Full Text

[23] 23. Altenhoff AM, Gil M, Gonnet GH, et al.: Inferring Hierarchical Orthologous Groups from Orthologous Gene Pairs. PLoS One. 2013; 8: e53786. PubMed Abstract | Publisher Full Text | Free Full Text

[24] 24. pyHam: a python package to visualize and process hierarchical orthologous groups (HOGs) – Open Reading Frame - Dessimoz Lab. http

[25] 25. The Molecular Clock and Estimating Species Divergence | Learn Science at Scitable. http

[26] 26. Kellis M, Birren BW, Lander ES: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 2004; 428: 617–624. PubMed Abstract | Publisher Full Text

[27] 27. Lallemand T, Leduc M, Landès C, et al.: An Overview of Duplicated Gene Detection Methods: Why the Duplication Mechanism Has to Be Accounted for in Their Choice. Genes. 2020; 11: 1046. PubMed Abstract | Publisher Full Text | Free Full Text

[28] 28. Louis VL, et al.: Pichia sorbitophila, an Interspecies Yeast Hybrid, Reveals Early Steps of Genome Resolution After Polyploidization. G3 GenesGenomesGenetics. 2012; 2: 299–311. PubMed Abstract | Publisher Full Text | Free Full Text

Phylogenetic placement of whole genome duplications in yeasts through quantitative analysis of hierarchical orthologous groups

Abstract

Keywords

Introduction

Methods

Orthology inference

Figure 1. Quantitative duplication hierarchical orthologous group (HOG) analysis pipeline flowchart.

Detection of whole genome duplications (WGD)

Results

Figure 2. Hierarchical orthologous groups (HOGs) reveal evolutionary events in the yeast lineage.

Figure 3. Distinctive peak in accordance with the known whole genome duplications (WGD) branch.

Figure 4. Quantitative duplication hierarchical orthologous group (HOG) analysis results on the “large yeast” dataset.

Discussion

Proposed method is able to correctly place the WGD event in the yeast lineage

Limitations of the quantitative duplication HOG analysis

Perspectives

Data availability

Underlying data

Software availability

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated