ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article

Phylogenetic placement of whole genome duplications in yeasts through quantitative analysis of hierarchical orthologous groups

[version 1; peer review: 1 approved with reservations, 1 not approved]
PUBLISHED 12 Apr 2023
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the The OMA collection collection.

Abstract

Background: Whole genome duplications (WGD) are genomic events leading to formation of polyploid organisms. Resulting duplicated genes play important roles in driving species evolution and diversification. After such events, the initial ploidy is usually restored, complicating their detection across evolution. With the advance of bioinformatics and the rising number of new well-assembled genomes, new detection methods are ongoingly being developed to overcome the weaknesses of different approaches.

Results: Here we propose a novel method for detecting WGD in yeast lineages based on the quantitative and comparative analysis of hierarchical orthologous groups (HOGs) of duplicated genes for a given set of organisms. We reconstruct ancestral genomes to obtain evolutionary information for each phylogenetic branch. This reconstruction relies on the inference of HOGs from the selected species’ proteomes. To estimate WGD events, the number of HOGs of duplicated genes across all taxonomic ranges are adjusted according to the molecular clock hypothesis and by the average genome size. Branches with a significant increase in the adjusted number of duplicated gene families are kept as candidates for WGD placement. The developed method was tested on two real datasets and showed promising results in phylogenetic WGD placements on the yeast lineage.

Keywords

comparative genomics, orthologous groups, whole genome duplications, yeast

Introduction

The potential roles of gene duplication in driving evolutionary innovation have now long been recognized.1 Whole genome duplication (WGD) is a major event leading to duplicated genes, found several times over the course of evolution in many eukaryotic organisms,2 and leads to the formation of an organism (or a cell) with additional copies of its entire genome. This event is often followed by massive gene losses due to non-functionalization, however some duplicated genes might undergo subfunctionalization (gene partitioning) or neofunctionalization (acquire a novel gene function) which will favor their preservation.3

To understand the circumstances leading to WGD in eukaryotes, Marcet-Houben and Gabaldón proposed a hypothesis describing an ancient allopolyploidization event (hybridization event between two different species of the yeast prior to the genome doubling).4 In case of an autopolyploid scenario (genome doubling in a single species), benefits from neofunctionalization would only take effect in the long run. In contrast, an allopolyploid scenario would lead to an initial selective advantage, because of the fact that it brings together different physiological properties and isolates the newly formed lineage sexually.4

Elucidating the causes of WGD remains a non-trivial task, as such events are often ancient and leave few traces, since over the course of evolution polyploidy is usually not maintained.57 For the same reasons, dating these events becomes challenging. One type of remaining traces in available genomes are conserved paralogs (i.e. ohnologs) which are defined as descendants of an ancestral gene derived from a duplication event. Paralogs are also referred to as ohnologs when resulting from WGD.

Finding homologous relationships (genes derived from the same ancestor) is a key concept in evolutionary biology used by researchers to describe genetic differences and similarities between different species. One way to categorize genes is by inferring so-called hierarchical orthologous groups (HOGs). HOGs are defined as “sets of genes that have descended from a common ancestral gene in a given ancestral species”.8 These nested groups of orthologs and paralogs, defined at each taxonomic level, can be used to reconstruct ancestral genomes and infer gene families. This also provides information on evolutionary events such as gene duplications, gains or losses between known or reconstructed genomes. These inferred relationships play a critical role for this study.

Over the past few decades, multiple WGD events on the tree of life have been detected.9 In early vertebrate evolution, Susumu Ohno proposed that two rounds of complete genome duplication occurred (often referred as 2R). There is now strong evidence confirming this hypothesis.10 Additionally, a consensus has been reached in favor of a third round (3R) WGD occurring in the fish lineage and thus also referred to as the fish-specific genome duplication.11 As mentioned earlier, WGD events have also been established in other eukaryotic lineages. For instance, an ancestral genome duplication has been placed before the common ancestor of extant seed plants, followed by another in the common ancestor of extant angiosperms.12 Moreover, multiple WGD events have been estimated among eudicotyledons and monocotyledons.12 Over the course of evolution, another well-known case occurred just before the separation of the yeast species Vanderwaltozyma polyspora from the S. cerevisiae lineage.13

Although found in vertebrates, plants, fungi, and invertebrates, WGDs remain relatively rare evolutionary events that have far-reaching repercussions on speciation and biodiversity dynamics.2,14 Being able to place them across the tree of life is crucial to our understanding of species’ evolutionary history. Additionally, polyploidy has important effects on human health. Belonging to the hallmarks of cancers, polyploidization events are found in roughly one third of human cancers. Besides cancer, genome duplications also occur in normal cells such as hepatocytes and in premalignant lesions, impacting other disease conditions.15 Thus, expanding knowledge of WGD is of great importance for cancer and other disease research.

Currently, the main existing approaches for detecting WGD using genomic data are based on the detection of large conserved syntenic regions, gene counts along phylogeny, and synonymous mutation rates.16,17 These methods all have their own drawbacks. For instance, investigations through synteny require high-quality genome assemblies providing gene order, and might be biased by small scale duplications or remaining layers of older WGD.18 In methods based on the number of synonymous substitutions per synonymous site (Ks), signal is lost in older WGDs due to Ks stochasticity. Moreover, these methods suffer from Ks saturation which might lead to the formation of artificial peaks.19 Such limitations could be mitigated by leveraging a combination of multiple techniques when trying to infer WGD events. Finally, to cope with the rapid increase in available genomic data, scalable methods need to be developed.

This study aims to propose a novel approach, based on hierarchical orthologous groups of duplicated genes (henceforth referred to as duplication HOGs), to locate WGD events on the given phylogenetic tree of a set of organisms. The method relies on the organisms’ proteomes, i.e. protein sequence annotations, for the detection of WGD resulting from allopolyploidy or autopolyploidy. This study is focused on the yeast lineage as it contains a single well-established WGD.13 Additionally, yeast species have relatively small and less complex genomes, facilitating computation and analysis, thus favoring this choice. Applying our proposed method on two datasets comprising yeast species allowed proper identification of previously reported WGD events (details in the Results section). It is worth mentioning that the method could be used for newly sequenced genomes as well.

Methods

Our method uses the HOGs in addition to the species tree as inputs. In the following section, we described how one can estimate HOGs for the genome of interest. Then, in the section “Detection of whole genome duplications (WGD)”, our method to detect WGD is presented in detail. Of note, we evaluated our method using 32 yeast genomes from the Saccharomycetales clade (details in the Results section).

Orthology inference

We used the OMA Standalone software20 to infer HOGs and to reconstruct ancestral genomes. To obtain HOGs, OMA Standalone performs multiple steps. First, the data undergoes quality and consistency checks. Second, all-against-all sequence alignments are computed in order to infer pairwise orthologs. We exported precomputed all-against-all protein sequence alignments from the OMA browser.21,22 Using precomputed all-against-all alignments as the input of the OMA standalone software significantly lowers the needed computational resources. In contrast to most other orthology inference resources, during the all-against-all alignment phase, OMA compares multiple isoforms in order to identify and utilize the isoform with the best cross-species matches.21 Next, pairwise orthologs are used to build an orthology graph. Finally, the GETHOGs algorithm (“Graph-based Efficient Technique for Hierarchical Orthologous Groups”) utilizes this graph and the given species tree for HOG inference. If no species tree is given, OMA estimates it using a least-squares distance approach based on OMA groups23 (Figure 1). OMA standalone was run with the selected sets of yeasts to obtain an estimated species tree in Newick format and HOGs in an OrthoXML file.20

16ad8209-f591-49ec-8212-b6c9cf1fd082_figure1.gif

Figure 1. Quantitative duplication hierarchical orthologous group (HOG) analysis pipeline flowchart.

The main steps of the developed method are illustrated on this flowchart. (For the complete description of the method, see the Methods section).

Detection of whole genome duplications (WGD)

For WGD analysis, the number of inferred duplicated genes at each taxonomic level is essential. To retrieve this information, the python library pyHam (“python HOG analysis method”)24 was used. This package facilitates the extraction of evolutionary information such as gene duplications or losses at specific taxonomic levels from OMA Standalone’s HOG inference output. The duplication events can be inferred by the “mapper” class by comparing evolutionary relationships between an ancestral node and any of its descendants.

Our method to place WGD events is based on the number of HOGs which were the direct result of a duplication event (we call them as “HOGs of duplicated genes”) (Figure 1). The species tree was traversed from the root node and the evolutionary history, including duplication events, of all the genes between each ancestor node and its two closest descendant nodes were inferred using pyHam’s “vertical comparison” method.24 The number of HOGs of duplicated genes, rather than the number of duplication events was considered, as ohnologs should result from one large duplication event. Branches in the species tree with significantly higher numbers of duplication HOGs (after the adjustments described below) were considered as putative WGD events.

In order to have a fair comparison between branches on the species tree with different lengths, the number of duplication HOGs was adjusted by evolutionary distance and average genome size. Adjusting by distance is based on the molecular clock hypothesis, which states that DNA and protein sequences evolve at a somewhat constant rate throughout time and across organisms.25 With this assumption, we would expect more evolutionary events over a longer time period and thus need to adjust the number of duplication HOGs accordingly by dividing by branch length.

Note that since there could be some small branches in the tree which make the number of duplication HOGs adjusted value misleadingly high, it is needed to use a pseudocount when dividing counts by distance. Depending on the dataset, the user of our code is able to use different values of this pseudocount (set to one as default). Finally, to balance among the different genome sizes, the average number of genes between the compared nodes is used based on the assumption that the ancestor and descendant share similar genome sizes. These adjustments can be summarized with the following formula:

adjusted value=#duplication HOGsbranch length+pseudocountaverage genome size

After obtaining the number of adjusted duplication HOGs for each branch, values and distribution were plotted to determine a significant threshold for selecting possible WGD candidates. Another way to find candidates is by computing adjusted HOG counts with different pseudocounts and considering branches that remain significant over the various adjustments. In order to find these high values (outliers), we set a threshold for each separate case which is defined based on the interquartile range.

threshold=third quartile+1.5interquartile range

Benchmarking of this method was done by comparing results with known cases found in the literature.4,26

Results

A diverse range of yeast organisms were used in this project in order to assess the performance of our WGD placement method. We chose yeast for a proof of concept because this clade is well documented and the species have simple genomes relative to other eukaryotes. These genomes were extracted from the Orthologous MAtrix (OMA) database.20 This database currently contains 2,496 genomes (Nov. 2022) from across the entire Tree of Life and is continuously being updated. As part of method evaluation, we first consider a small dataset, “small yeast”, of five species including Schizosaccharomyces pombe, Kluyveromyces lactis, Zygosaccharomyces Rouxii, Vanderwaltozyma polyspora and Saccharomyces cerevisiae. In this dataset, the well-known model organism S. pombe acts as a rooting species. It is known that K. lactis and Z. rouxii are potential candidates of species that might have led to the WGD in the hybridization theory4 and thus branching off before the known WGD placed before V. polyspora’s branch. In other words, Vanderwaltozyma polyspora and Saccharomyces cerevisiae are known to have undergone polyploidization.4 Another dataset, “large yeast”, was used as proof of concept and is composed of 32 species from the Saccharomycetales order found on the OMA database in addition to the species Penicillium chrysogenum for rooting.

As described in the method section, once HOGs were inferred with OMA standalone, the estimated number of gained, duplicated, retained or lost genes on each branch was obtained and visualized with a python package pyHam (Figure 2A,B). We notice a striking gene loss on the branch leading to the extant Saccharomyces cerevisiae South African wine strain (strain AWRI796), which may be due to its domestication. However, many of the gene losses might be false positives, which can happen when the extant genome is of lower quality or incomplete. Branches leading to Kluyveromyces lactis and Zygosaccharomyces rouxii have slightly less duplication than the other branches, but this data on its own is not sufficient to discriminate against possible WGD candidates (Figure 2A,B).

16ad8209-f591-49ec-8212-b6c9cf1fd082_figure2.gif

Figure 2. Hierarchical orthologous groups (HOGs) reveal evolutionary events in the yeast lineage.

(A) Inferred phylogenetic tree of the “small yeast” dataset using OMA standalone pipeline. Numbers in black correspond to branch length and numbers in purple to the branch id found in the bar plots. Colored stacked bars on branches illustrate the non-adjusted number of gained, duplicated, retained or lost genes from one node to another (these species have roughly 5000 genes in average). The same information can be found as a barplot on panel B. The red-colored branch corresponds to the placement of the whole genome duplications (WGD) detected by our method. (B) Bar plots represent the number of gained genes, duplication events, duplication HOGs, retained genes, and lost genes for the “small yeast” dataset. Branch ids correspond to the ids found on the phylogenetic tree of panel A. (C) Same bar plots as panel B adjusted by the distance (branch length plus pseudocount of one) and the average genome size for each branch (id).

However, when counting the number of duplication HOGs adjusted by genome size and branch length, a distinctive larger relative quantity (seen as an outlier on the distribution) appears on the branch of the known WGD (branch id 5; Figure 2C). The significance threshold is determined by adding the third quartile to the multiplication of the interquartile range by 1.5 (Figure 3A). This allows clear placement of the WGD event on the species tree. Note that this significant value remains with a pseudocount of five (see Figure 3B). Indeed, the next highest quantity (branch id 8) lies at only 52% of this significant value.

16ad8209-f591-49ec-8212-b6c9cf1fd082_figure3.gif

Figure 3. Distinctive peak in accordance with the known whole genome duplications (WGD) branch.

(A) Bar graph of the relative (by min-max normalization) adjusted (by branch length plus one and average genome size) number of duplication hierarchical orthologous groups (HOGs) for each branch of the “small yeast” dataset (see Figure 2 for branch id correspondence). On the right side, the box plot represents the distribution of these values. The distribution outlier is considered as a significant value and thus as a WGD candidate. Red colored bars correspond to branches that are significant over a range of different pseudocounts. (B) Replication of panel A barplot with a pseudocount of five. (C) Adjusted number of duplication HOGs for each branch of the “large yeast” dataset. Pseudocount added to the distance changed to five. Red colored bars correspond to branches that remain significant across a range of different pseudocounts (e.g. 1,5 and 10). See Figure 4 for branch id correspondence.

Applying the same approach on the bigger data set containing 32 species from the Saccharomycetales order resulted in two possible WGD candidates. The first corresponds to the aforementioned WGD branch with the branch id 25 (Figure 4). The second, with an adjusted number above 0.8, corresponds to the branch leading to the extant species Pichia sorbitophila with the branch id 42. Interestingly, it was shown that Pichia sorbitophila results from allopolyploidization and has a genome of about twice the size of its ancestor.27 In this data set, we can see smaller differences among adjusted HOG counts which makes it more difficult to set the threshold to distinguish WGD branches, i.e. those with the highest duplication HOGs (Figure 4).

16ad8209-f591-49ec-8212-b6c9cf1fd082_figure4.gif

Figure 4. Quantitative duplication hierarchical orthologous group (HOG) analysis results on the “large yeast” dataset.

(A) Inferred phylogenetic tree of the “large yeast” dataset. Established post-whole genome duplications (WGD) species under the light-green box. (B) Graph represents the number of duplication HOGs adjusted by the distance (branch length plus one) and the average genome size for each branch (id). On the right side, the box plot represents the distribution of these values. Keeping significant values that remain with higher pseudocounts (Figure 3C) correspond to found WGD candidates (branch id 25 and 42). Filled blue colored circles on the phylogenetic tree correspond to remaining significant values with a pseudocount of one.

Discussion

Proposed method is able to correctly place the WGD event in the yeast lineage

Results obtained on the two yeast datasets are promising as the branch with the maximum adjusted duplication HOGs corresponds to the known WGD.

It is important to note that despite adding a pseudocount, when adjusting by evolutionary distance, small branches are still slightly more pushed upward compared to longer branches. To account for this, it is recommended to consider different pseudocounts and compare the outcomes. With the “large yeast” dataset we see various WGD candidates depending on the used pseudocount. By repeating the analysis with different pseudocounts (e.g., integers between one and 20) two branches persist: the known WGD branch but also the branch leading to P. sorbitophila which has a very large number of duplicated genes. As mentioned earlier, it was shown that this species has undergone genome duplication through allopolyploidization.28

Overall, the applied method seems to perform quite well on its own for WGD placement in the yeast lineage, assuming the user keeps a fairly stringent threshold or discriminates by applying different pseudocounts.

Limitations of the quantitative duplication HOG analysis

As the scope of this work was limited on yeast and witnessing clear results obtained on this lineage, extending the approach to find placement of WGD on plant datasets would be of great interest. This is quite challenging due to the fact that the plant dataset in the current release of OMA contains at the moment only few and relatively distanced species for many reported WGD. First of all, based on comparison, the method depends entirely on the characteristics of the species used as input. Only using species from post-WGD or opposingly, only ones that did not undergo WGD, would not lead to significant values of interest, as an average needs to be set. In such cases, results should probably not be considered on their own to place WGD events but could be combined with different approaches.

In addition, the number of input species also play an important role in setting the baseline to find abnormal duplication increases. While only a few well-selected species of the yeast lineage allowed proper WGD placement, a random selection might require a larger dataset to identify significant values. The size and variety of the set should also be considered. First, to infer ancient duplications, more extant information is needed for a better resulting ancestral reconstruction. Second, to find abnormal changes, normality needs to be defined through a sufficient amount of information and such changes might also result from different events than WGD. Moreover, in cases with multiple overlapping WGDs on the same branch, only one event could potentially be highlighted.

Perspectives

In the end, the proposed method indicates branches with an abnormal increase of HOGs of duplicated genes under the assumption of the molecular clock hypothesis and correct ancestral genome reconstruction. While these may not be WGD events per se, this information remains beneficial, or even sufficient on its own, to detect WGD. Like most approaches this method has its own strengths and weaknesses and should therefore be combined with other known or novel methods in order to place WGD in more complex datasets. A coherent next step would be to consider the HOG synteny of the found WGD candidates, as this information can be retrieved as a weighted network on the OMA platform and would likely improve the detection accuracy.

WGD placement, despite the many existing methods and algorithms,27 remains a complicated task and is therefore an ongoing research field. Most recently, researchers tried to apply a supervised machine learning approach based on the known detectable signatures of WGD.18 The advance of computational and biotechnologies should lead to the development of such new or improved methods to resolve remaining questions around WGD evolution. Such findings would contribute to a better understanding of the related rise of evolutionary innovations throughout history and could also be expanded to the study of WGD as macro-evolutionary event occurring in tumorigenesis.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 12 Apr 2023
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Moix S, Glover N and Majidian S. Phylogenetic placement of whole genome duplications in yeasts through quantitative analysis of hierarchical orthologous groups [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2023, 12:382 (https://doi.org/10.12688/f1000research.128656.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 12 Apr 2023
Views
12
Cite
Reviewer Report 28 Sep 2023
Gavin Conant, Department of Biological Sciences, North Carolina State University, Raleigh, North Carolina, USA 
Not Approved
VIEWS 12
Review of Moix et al., “Phylogenetic placement of whole genome duplications in yeasts through quantitative analysis of hierarchical orthologous groups.”

NOTE: I agree with much of what X. Guo has written in their peer review and notice ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Conant G. Reviewer Report For: Phylogenetic placement of whole genome duplications in yeasts through quantitative analysis of hierarchical orthologous groups [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2023, 12:382 (https://doi.org/10.5256/f1000research.141268.r203842)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
14
Cite
Reviewer Report 12 May 2023
Xinyi Guo, CEITEC-Central European Institute of Technology, Masarykova univerzita, Brno, South Moravian Region, Czech Republic 
Approved with Reservations
VIEWS 14
The manuscript presents a simple method for WGD inference based on gene counts and the authors have shown its good performance in yeast lineages. While the authors recognized knowledge gaps in the field and presented both their own solution and ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Guo X. Reviewer Report For: Phylogenetic placement of whole genome duplications in yeasts through quantitative analysis of hierarchical orthologous groups [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2023, 12:382 (https://doi.org/10.5256/f1000research.141268.r171823)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 12 Apr 2023
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.