Models for the retention of duplicate genes and their biological underpinnings

Gene content in genomes changes through several different processes, with gene duplication being an important contributor to such changes. Gene duplication occurs over a range of scales from individual genes to whole genomes, and the dynamics of this process can be context dependent. Still, there are rules by which genes are retained or lost from genomes after duplication, and probabilistic modeling has enabled characterization of these rules, including their context-dependence. Here, we describe the biology and corresponding mathematical models that are used to understand duplicate gene retention and its contribution to the set of biochemical functions encoded in a genome.


Introduction
Richard Feynman left the quotation, "What I cannot create, I do not understand" on his blackboard at the time of his death.Creation in mathematical modeling is writing down a model that describes a process.The retention of duplicate genes over long evolutionary periods involves mechanisms and processes in population genetics, evolution, molecular biology, ecology, and biochemistry.Here we describe the current state of modeling in the field of duplicate gene retention.
The genome can experience duplications of its content across a range of sizes, from incomplete duplications of single genes to small-scale events (single duplicate genes created through tandem duplication or retrotransposition events) to large scale events that involve multiple genes or even whole genomes.These events are broadly classified as whole genome duplications (WGD) or small scale duplications (SSD): the two types have several key differences.WGD duplicates are duplicated together with their interacting partners, and the population (and cellular) genetic model for the initial fixation of WGD duplicates is less straightforward than for SSDs.WGD events can be either allopolyploidies or autopolyploidies (see below).SSD duplicates typically do not see their interaction partners duplicated and initially have a frequency of 1/(2N e ) in a diploid population.[3][4][5][6] In examining small-scale duplication, tandem duplicates may be identical at birth, but also may be born as chimeras or as partial duplicates. 2,7Those that are non-identical can be viewed as partly along the way to functional divergence.Identical duplicates are more common at birth, but less common among older tandem duplicates. 2,7Similarly, genes that emerge by retrotransposition are less common at birth, but because they are likely born in a different expression environment and chromosomal location, diverge faster and are, relative to their birth rate, more common at older ages. 83][14] In general, it has been suggested by Wagner  (2005)  15 that the expression cost of a duplicated gene leads to a fitness cost in its possessor, and a similar argument could be made for a replication cost, especially for organisms that have limitations on genome size.Such a limitation would occur when selection acts to minimize replication time in log phase growth.
More generally, duplicates can be retained as a mechanism to gain extra expression. 16An example of this in humans and other mammals appears to be the convergent duplication and retention of copies of the amylase gene. 17Trypanosomes seem to regulate gene expression more generally through gene duplication, with very few transcription factors in their genomes. 18,191][22][23][24] Mechanisms of duplicate preservation that involve changes of function through mutation accumulation include subfunctionalization, 25,26 the partitioning of functions among copies from the pre-duplication ancestral state, and neofunctionalization, 16 the acquisition of a new beneficial function.
This review is focused on the characterization of different types of models with distinct assumptions that characterize the duplicate gene retention process.The goal of all of these models is to probabilistically predict which genes are likely to be found in genomes that have WGD events of different ages and ongoing processes of SSD.Multiple models for all of the processes described above exist and are discussed below together with their biological assumptions.Table 1 summarizes the models that have been characterized here.

REVISED Amendments from Version 1
In response to reviewers, a few changes were made to the manuscript.Most importantly, a new table (Table 1) has been added to summarize the models described and their biological assumptions.A paragraph has been added to describe the mechanisms of small scale duplication and corresponding retention profiles, including two new references.A clarifying paragraph organizing the work has been added to the introduction.Lastly, a series of more minor presentation and grammatical changes has been made.
Any further responses from the reviewers can be found at the end of the article

Gene Duplicability
Different retention mechanisms are differentially applicable to different genes.This has given rise to the notion of gene duplicability, that some genes are inherently more duplicable than other genes. 10,27,28For a gene to be retained after duplication, it ultimately needs to be either subfunctionalizable or neofunctionalizable. 12 To be subfunctionalizable, a gene must have at least two modular functions (biochemical activities, including binding, or modular expression domains), such that there are mutations that can disable one subfunction without damaging others.The probability of eventual subfunctionalization for two identical duplicate gene copies scales as 1 À 0.5 (f À 1) , where f is the number of functions.The probability of neofunctionalization is harder to quantify, but either new selectable biochemical functions or expression domains must be evolvable.7][38][39][40] Empirically, there is a class of "duplication-resistant" genes where natural selection apparently acts against the maintenance of both copies. 35Different genomes might have different proportions of genes that are duplicable, as well as that are subject to dosage constraints. 41 analyzing the retention of genes following two rounds of WGD, the Atlantic salmon genome paper 42 presented a conditional probability analysis suggesting that the gene duplicability hypothesis predicts that genes retained after one round of WGD might be more likely to be retained after the second round.However, prior analyses from plants tended to suggest that the factors that favor the retention of duplicates after a first polyploidy tend to be attenuated in subsequent polyploidies. 43Support for the hypothesis from the test in Atlantic salmon was also lacking, but there is more complexity to the process, including changing gene duplicability, the time-dependence of the retention process, and other factors, 44,45 and probabilistic models that can be used as an expectation for different hypotheses are described below. 41e Biological Considerations as Building Blocks for Models To model duplicate gene retention, one must describe what is mutable and selectable.Protein encoding genes must function as proteins after transcription and translation.They are expressed at a given concentration in specific places and at specific times.They then carry out various functions: binding, catalysis, or transport in interaction with other molecules in the cell.This is what we mean by function.The expression domains for a protein are a quantitative description of where and when expression occurs.There are cases where duplication is used as a mechanism for amplifying expression level, although this seems to be a temporary situation in most organisms, with trypanosomes being a possible exception.

Modeling Expression Evolution of Duplicate Genes as a Stand-Alone Process
Though classical models for the retention of duplicate genes often consider their levels of sequence divergence, gene expression data provide a promising source of underutilized information.In particular, gene expression data are now widely available for many species and often consist of measurements across multiple conditions, which can include tissues, sexes, and developmental stages.These measurements are an attribute of function, as knowledge of where and when a gene is expressed provides insight into its biological roles.Indeed, Ohno proposed that the first step of functional divergence between duplicate genes is their expression divergence. 16Thus, gene expression is a trait that can be exploited to understand gene function and, in the case of duplicate genes, the divergence between their functions.
7][48] Yet expression divergence between duplicate genes does not provide information about the exact mechanisms of their retention.For instance, both neofunctionalization and subfunctionalization result in functional divergence between gene copies.Thus, it is important to compare the expression profiles of both copies to that of the ancestral single-copy gene, as this can elucidate how each copy has changed since duplication.Such an approach was developed about a decade ago 49 and later implemented as the software CDROM. 50Applications of this approach have uncovered widespread neofunctionalization in Drosophila, 49 mammals, 51 honeybees, 52 and grasses. 53wever, a key shortcoming of the approach of Assis and Bachtrog (2013) 49 is that it does not account for stochastic changes in gene expression arising from phenotypic drift. 54This obstacle can be overcome by modeling gene expression evolution on the phylogenetic tree relating a pair of duplicates and their single-copy ancestor.There is a natural hierarchy of models for describing how gene expression evolves along the branches of a phylogenetic tree.The simplest is Brownian motion (BM), which models phenotypic drift without making any assumption of selection for a particular expression level. 552][63][64][65] Another conceivable approach is to model the genotype driving expression evolution, but this requires an understanding of the combinatorial role of promoters and enhancers regulating expression, 66,67 which is currently beyond our modeling capabilities. 68ith this in mind, researchers have recently begun to employ OU models for studying the expression evolution of duplicate genes. 24,69Additionally, DeGiorgio and Assis (2021) 69 developed CLOUD, which predicts retention mechanisms of duplicate genes by overlaying their OU model with a neural network. 69Though likelihood ratio tests (LRTs) have classically been used for similar tasks with single-copy genes, 24,[58][59][60][61]70 machine learning approaches present several advantages, such as the optimization of model fit to training data, direct evaluation of performance on independent test data, and ability to make predictions from data with correlated or conflicting signals. 71Further, such methods make predictions solely from data, 71 which can be advantageous when the underlying evolutionary model is unknown.Indeed, CLOUD demonstrates excellent predictive performance, outshining CDROM in classifying retention mechanisms while also being able to predict parameters corresponding to expression optima and strengths of selection and drift.69 Still, much remains to be done in this area.For one, many advanced machine learning algorithms have yet to be explored in this context.Moreover, though expression data can provide a lot of useful information about a gene, this does not necessarily mean that we should neglect other complementary sources, such as its sequence or protein structure.Another advantage of machine learning is that it is not weighed down by additional information, as correlated or conflicting signals can be reduced or even removed through regularization.However, the problem lies in extending the underlying OU model to accommodate diverse pieces of information.Last, one can argue that the most important extension of such work is to accommodate more species and gene copies.Currently, most researchers do not have access to expression data for multiple of the same conditions in many species, but this is soon to change as the cost of sequencing continues to decrease.And of course, it is critical to assay the fates of gene families with more than two members, as many such families are prominent across study systems and may be key to understanding adaptation.

Modeling the Evolutionary Cost of Gene Duplication
Analyses of duplicate genes often start from the premise that gene duplications are selectively neutral, creating redundant copies that can potentially degrade through degenerative mutations. 9,72The patterns of which genes do and do not tend to survive in duplicate and the dosage-balance hypothesis (see above) already suggest that gene duplications are not all selectively neutral at birth.As mentioned above, Wagner has extended this argument by showing that, at least in microbial organisms, the gene expression costs associated with an extra gene copy are rarely if ever expected to be selectively neutral. 15,73His model considers the per-time unit cost, in terms of high energy phosphate bonds, of expressing a duplicate gene for different ranges of mRNA and protein levels, finding that, for reasonably large values of the effective population size of microbial species, those energy costs are large enough to discount the hypothesis that the fate of a duplication is primarily driven by neutral evolution. 74The principle that excess gene expression has measurable negative fitness effects due to the costs of transcription and translation has been elegantly experimentally explored in E. coli. 75

Dosage Imbalance Cost
It is well established that maintaining stoichiometric balance with interacting partners is an important driving force to preserve duplicate genes in genomes while waiting for other preservation mechanisms to act.The mechanistic driving force behind this is thought to be the prevention of the accumulation of exposed hydrophobic residues that populate binding interfaces and can lead to misinteractions that might be deleterious to cells.An explicit model that relates fitness to the expected concentration of surface hydrophobic residues has been generated and used to explore how this model enables the transition to subfunctionalized states, with opposite trends observed after WGD and SSD. 76This model is a mechanistic update over a previously described hazard function model that did not model this underlying biochemistry. 77

Population Genetic Considerations
In eukaryotic organisms, the baseline state for most chromosomes is diploidy.Either across the whole genome or for individual loci, this diploid state is disrupted after gene duplication.Functional tetraploidy has meiotic implications that are not present for SSD events (see 42 for a discussion in Atlantic salmon).Over time, WGD events return to a state of functional diploidy and may start that way for alloduplication events with the chromosome sets already diverged, as may have been the case for Xenopus 78 and Brassica species. 79,80wever, so far, this divergence characterization has been viewed without the underlying population-level dynamics.While more complex for WGD events, SSDs in diploid organisms begin with a frequency of 1/(2N e ) and must fix before they are lost if they are going to be retained.The neutral expectation for eventual fixation of such duplicates is that they will fix with a probability equal to their frequency.Classical population genetics then gives a time-dependent expectation for the frequency based upon the age of the duplicate.Stark et al. (2021) 81 have presented a population genetic model to evaluate if the age-dependent frequency is unexpected for a duplicate evolving neutrally.The power of this approach was evaluated using a Moran Model, with two selective parameters, one for selection on the duplicate itself, which can be positive when total gene dosage amplification is beneficial or negative due to factors like expression and replication cost, and the other for selection on the new function (for example neofunctionalization).
In using this model with actual segregating duplicates, the frequency of a duplicate in a population can be measured through population genomic sequencing.However, the age of the duplicate is the next question.For SNPs, population geneticists examine the length of tracts of identity by descent to estimate the age of an allele (e.g., duplicate locus) (see 82 ), but a much simpler approach based upon pairwise pS values between copies may be possible.Application of these approaches to real data has not been performed yet to evaluate their performance.
Those considerations are used to evaluate selection on the copy itself.Selection on the sequence of the duplicate might be measured by a more rich data source and parametrization or in a simpler manner by examining ratios like pN/pS or using tests like the McDonald-Kreitman test applied to duplicates. 83terspecific and Phylogenetic Models Moving from intraspecific to interspecific analysis of gene duplicates, including models that run on a single genome, early work from Lynch and Conery (2000) 9 and from Lynch, Force, and coworkers 10,25,84 11 presented a formal probabilistic gene tree-species tree reconciliation framework using the exponential distribution model and generated software for this.Others have created similar software packages. 87,88Additional innovations to this framework have included the treatment of synteny, 89 and species level processes. 90ntemporaneously with the Lynch and Conery (2000) 9 modeling, Force et al. (1999) 25 presented a more mechanistic framework for subfunctionalization and neofunctionalization as processes.This was formalized as a Markov Model by Stark et al. (2017) 91 and expanded upon by Wilson and Liberles (2023) 76 to enable consideration of dosage balance.Multi-scale Markov models reflect a further step in this trajectory.

Multi-Scale Markov Models
A wide range of multi-scale Markov models for the evolution of gene families has been studied in the literature.Models with states that record very detailed information about biology are suitable for simulation-based analysis.However, such models may not be useful for theoretical analysis due to the size of their state space.On the other hand, models with simplified state space are useful for in-depth theoretical and numerical analysis, which often leads to novel biological insights.Both types of models provide powerful tools for the analysis, and the choice of one over another may depend on the types of biological questions one might want to answer.
As an example, Stark (2017) 92 suggested a simulation model for the evolution of a family of genes, in which detailed information is recorded within a binary-matrix of 0's and 1s such that each row corresponds to a gene and each column corresponds to its function.Later, Diao et al. (2022) 93 applied the binary matrix model of Stark (2017) 92 in their simulation-based analysis, which led to some interesting biological insights.Their results suggested that when the rate of gene duplication dominates the rate of gene loss, then the distribution of tree shapes is close to following the uniform ranked tree shape (URT) distribution (i.e., the distribution for a constant birth-death process).However, when the opposite is true, then gene trees are more balanced than the URT distribution.
Next, Stark et al. (2017) 91 constructed a model with a simplified state space, for the theoretical analysis of the evolution of duplicate genes, see Figure 1.The manageable size of the state space allowed them to derive analytical expressions for the rates of subfunctionalization and pseudogenization.This led to the result, earlier predicted by classical models, that models with subfunctionalization provided a better fit to the age distribution of duplicate genes than models with a constant rate.Further, Diao et al. (2020) 94 developed a more advanced model for the theoretical analysis of the evolution of a family of duplicate genes that was based on the application of a level-dependent Quasi-Birth-and-Death (QBD) process.The state (n,m,k) of their QBD model consists of the variables n and m representing the number of genes and the number of the redundant genes respectively, and the variable k which, in a simplified manner, records the remaining information about the family.The authors took advantage of both types of models.They used the simulation-based binary matrix model of Stark (2017) 92 to obtain detailed outputs and then fitted the parameters of their QBD to data obtained from these outputs.Next, they derived biological insights by computing metrics based on the expressions from the theory of QBDs, such as the stationary distribution of observing various states within the model and the distribution of the time it takes for the family to lose a gene.Soewongsono et al. (2023) 95 then applied this QBD model to a more general problem of reconciliation, in which the task is to find a mapping of a gene tree to a species tree, to maximize the likelihood.
The authors provided an algorithm to compute the likelihood of the reconciliation given the available incomplete data.

Modeling Asymmetric Divergence and Gene Conversion in Duplicate Genes
Another angle to evaluate duplicate gene retention mechanisms involves an examination of evolutionary symmetry.
In fact asymmetric divergence between duplicate gene copies is relatively common. 96This result is particularly striking when one considers that many tests have low power to detect asymmetry. 97The test that was employed uses an outgroup gene from a relative lacking the duplication to polarize differences between the two paralogs, forming a triplet tree.
Maximum likelihood was then used to compare a symmetric model, where dS (K s ), dN (K a ), or dN/dS (K a /K s or omega) was constrained to be equal for both paralogs, to an alternative model where the divergence statistic was allowed to differ between paralogs. 96It is tempting to attribute to neofunctionalization cases where one paralog has accelerated evolution relative to the other, but in fact many different modes of evolution can induce asymmetry. 98In fact, asymmetry in divergence is arguably more interesting for the differences seen between duplicate genes created by different mechanisms 3 or as a means for detecting recent gene conversion.We have used the ancient polyploidy in baker's yeast have experienced very recent gene conversion, such that ancient polyploidy-produced paralogs in one species are more similar to each other than those copies are to their orthologs in a closely related species, despite the fact that the divergence time between the ortholog pairs is probably ten-fold lower than the time since the paralogs were formed. 99ne conversion refers to several mutational mechanisms that can allow part of the sequence of one member of a gene family to overwrite the corresponding region in another paralog, effectively erasing some of the divergence between the two paralogs. 100Because such events violate the assumption of independent evolution between paralogs, they are difficult to treat with standard models.Ji et al. (2016), 101 have described a codon model of evolution that jointly considers the paired codons from two paralogous genes, incorporating a parameter t modeling the frequency with which conversion events alter the paralogs' sequences.This model confirms the surprisingly high rate of gene conversion among the yeast ribosomal proteins, which had previously and incorrectly been taken to represent the more general rate of gene conversion among yeast paralogs. 102ole Genome Duplication: Duplicate Losses, Modeling and Synteny While approaches such as standard time-independent birth and death models can be applied to duplicate genes produced by WGD, or polyploidies, there are complexities and opportunities introduced by WGD events that benefit from models that are specific to them.Polyploidy refers to a variety of events that result in eukaryotic cells with more than two copies of the genome. 103Polyploid lineages are formed relatively often, but most quickly go extinct. 104However, great trunks of the eukaryotic tree of life descend from surviving ancient polyploidy events, including all vertebrates and flowering plants, as well as specific lineages of yeasts, ciliates, and other plants. 105olyploid individuals can form through the merger of genomes from the same species, known as autopolyploidy, or of distinct species, referred to as allopolyploidy. 106[109] The term WGD is potentially slightly misleading because it suggests that all genes in the genome are duplicated.Initially they are.However, for any reasonably old polyploidy event, many or even most of the duplicate genes will have been lost. 109,110Probably most of these losses occur through the fixation of loss-of-function mutations in one copy by genetic drift, a process common to duplicates of all types. 9As described in the earlier characterization of duplicate retention mechanisms, selection from various sources can also play a role.

Another Class of Models for Evolution After Polyploidization Events
The question of demonstrating that a particular genome has an ancient polyploidy in its history is a complex one 111 and somewhat distinct from our concerns here.However, one obvious consequence of a polyploidy is the production of a group of duplicate genes that were all formed "at the same moment."In principle, a neutral measure of paralog divergence, such as the number of synonymous substitutions per synonymous site (K s ) should be able to detect a polyploidy through the excess of duplicates with similar K S values. 9,47While the actual practice of detecting polyploidy events in this way requires care, 112 it has been an extremely illuminating approach.For instance, in a pioneering study, Maere et al. (2005), 113 were able to fit a mixture of age models to the Arabidopsis thaliana genome and detect three ancient polyploidies in its history.They further showed convergent retention of genes of similar function in duplicate after these events. 113To do so, they modeled three distinct processes: 1) a basal continuous rate of single gene duplication, 2) a set of between one and three ancient polyploidy events and 3) continuous losses of duplicates created by processes 1 and 2. They evolved these three processes in simulated discrete K s time intervals and fit the simulations to the observed set of K s values from duplicate genes found in the A. thaliana genome.
Maere et al.'s (2005) 113 approach is elegant but challenging to implement: other analyses of a similar form have instead fit mixtures of models to the observed duplicate divergences, combining a basal steady-state duplicate birth-death model with one or more discrete events duplicating the entire genome (with the potential for the immediate removal of some of these duplicates). 112,114Such approaches allow for testing hypotheses regarding the number of polyploidy events in the Figure 2. A region of ten ancestral genes duplicated through the teleost-specific genome duplication (TGD). 125hown in the center in gray are the ten genes as they are found in the genome of the spotted gar (L.oculatus), which lacks the TGD.The paralogous regions created by the TGD in the eight genomes possessing it are then shown above and below the gar genes.The lines joining pairs of genes indicate that these genes are neighbors in the genome (i.e., they are in synteny).After the TGD, some duplicates survive in all (pink) or some (tan) genomes, while others have been returned to single copy, either from the subgenome with more surviving genes (blue) or than with fewer (green).Numbers at the top of each column/pillar are the orthology confidence estimates from POInT.In other words, this figure gives the confidence for placing the genes in this orthology state relative to the other 2 8 -1=255 orthology configurations.Genes are shown with their Ensemble identifiers 126 for reference.This figure is an original figure produced by the authors for this review article.each genome may have one or two homologs present at a pillar (a state for complete homolog absence will be added to future versions of POInT).Different parental subgenomes within an extant genome can be distinguished (orange verses tan) but subgenome identities between the genomes are unknown.B) For N = 2 polyploid genomes, there are 2 N possible orthology relations.At each pillar i, we can compute the likelihood of the observed gene presence and absence data for a given orthology pattern XX using the model M and the tree t: L i xx |M,t.C) Using the synteny relationships, the values L i 00 |M,t .. L i 11 |M,t can be conditioned on L i-1 00 |M,t .. L i-1 11 |M,t with a transition probability matrix Θ.The elements of Θ depend on Θ i,g , where i is the pillar number and g is the genome.If synteny in maintained between pillars i and i+1 for genome g, Θ i,g = Θ M , a global constant estimated by maximum likelihood (0≤Θ M ≤1).Otherwise Θ i,g =0.5, meaning the orthology pattern at i is independent of that at i-1.This equation can be applied recursively to compute the likelihood of the entire dataset with standard hidden Markov model approaches 123 : the ⨀ operator represents an element-wise vector product.The tree branch lengths and model parameters are estimated from the data by maximum likelihood using standard numerical techniques. 127This figure is an original figure produced by the authors for this review article.lineage of a genome, but the results require some caution in their interpretation due to the relatively modest information provided by K s values. 112lyploidy and Gene Synteny Our discussion so far has considered signals such as gene tree topologies and divergence measures which are applicable to all types of duplicate genes.However, for the specific case of a polyploidy, another type of highly informative data is present: the gene order patterns among the duplicated and non-duplicated genes.These patterns are commonly referred to as gene synteny.7][118] Figure 2 illustrates the principle that a WGD, in contrast to SSDs, produces duplicate genes that preserve the gene order present in the unduplicated ancestor.Indeed, these patterns can identify ancient polyploidies even in the limiting case where all the duplicate genes were subsequently lost, provided that an outgroup genome lacking the polyploidy is available and the degree of chromosomal rearrangement is not too large.Hence, synteny is often considered the best evidence of the presence of an ancient polyploidy, even if formal tests using it are hard to develop. 112,119e example of the power of combining syntenic information with models of duplicate gene gain and loss can be seen with POInT (the Polyploid Orthology Inference Tool) 120,121 Assuming that the duplicate products of a WGD are known through syntenic information (as in Figure 2), one can use the generic discrete character evolutionary model of Paul Lewis  (2001)  122 to model the preservation or loss of duplicate copies in different genomes that share this WGD (Figure 3A).Briefly, the loss model presumes that all loci (or pillars) start in a duplicated state D and then can undergo fates such as loss (resulting in states S 1 or S 2 ) or fixation (D f ).Such a model can be applied to the duplicate presence and absence data for a group of genomes sharing the polyploidy.However, the difficulty arises that the orthology assignments between those genomes are unknown.POInT hence computes the likelihood of the observed gene presence/absence data at each pillar for all possible orthology relationship under a duplicate loss model.It then uses a hidden Markov model to condition that set of likelihoods at the current pillar i on those from pillars 0..i-1-using a transition matrix Q. 123 The elements of Q are determined by whether or not synteny is preserved between i-1 and i in each genome. 109ce such a framework is in place, standard likelihood ratio tests 124 can be used to test hypotheses about the evolution of polyploid genomes such as what fraction of the duplicates appear to have been fixed 120 or whether one of the two parental genomes from an allopolyploid is favored when duplicates are lost. 121This second pattern, termed biased fractionation, is likely indicative of an allopolyploidy 108 and raises questions as to whether the subgenomes of allopolyploid hybrids are functionally compatible. 79ncluding Thoughts A number of models have been generated that describe different levels of duplicate gene retention with different levels of mechanistic detail and as standalone models for individual problems, or as models that are integrated with other problems.These models are summarized in Table 1.Two key elements of duplicate gene retention are coding sequence function modeled using summary statistics (like dN/dS) or Markov models describing increasing layers of complexity and expression evolution.These can also include models for syntenic position.While some of the Markov models attempt to integrate the two layers of evolution reflecting coding sequence function and expression in a sophisticated manner, other approaches either use a simpler unifying factor, like [P] in a biophysical model or treat them independently or without differential specification.Work in these directions is making substantial progress in capturing biological realism.Modeling of duplicate gene retention can converge with the broader modeling frameworks being developed for the genotype-phenotype map.While much of this modeling is in the realm of additive statistical association, the field of computational systems biology includes modeling frameworks that add another layer to the genotype-phenotype map that have not been touched much at the boundaries of the duplicate gene retention modeling field.Mechanistic models for gene expression evolution will also be fruitful in this field.There is a lot of room to keep expanding these modeling frameworks as genomic and other omic data accumulate for species and underlying populations and as biological domain-specific modeling improves that can improve mechanistic duplicate gene retention models.

Are the conclusions drawn appropriate in the context of the current research literature? Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Early evolution of life I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.This is an interesting review on the dynamic of gene retention and loss, depending on gene structure, gene function, the mechanism creating the duplicated copies, and other biological constraints.It presents various mathematical models characterizing such gene duplicates retention/loss dynamics in a context-dependent way.In particular, the difference between gene duplicates evolution depends on whether gene copies result from small scale duplication or whole genome duplication (WGD).The last part of the paper also explains the probabilistic and algorithmic ways of predicting WGD events.
My main criticism is that the review intermingles a variety of biological concepts and mathematical models, with no clear subdivision.The introduction should clearly explain the objective of the review and how it is organized.The content should be subdivided into coherent subsections.I would suggest adding a table with all the various constraints for gene duplicability and retention listed in the different sections, summarizing which of them are considered by which model/paper.
In addition, it would be interesting to discuss gene retention as a function not only of small-scale duplication versus whole-genome duplication, but also as a function of the type of small-scale events: duplication of a single gene versus duplication of larger segments, tandem duplication versus retrotransposition.

○
Although the target audience should be familiar with most of the biological and mathematical terms used, some are less obvious and should be explained, such as "heteromultimerization partners".
○ On page 7, correct the sentence: "The relative frequency of formation of these two types of polyploids many be approximately equal".I guess "many" should be replaced by "may".

○
On the same page, 6th line before end: "As described, selection from various sources can also play a role."As described where?
○ First part of Figure 2: I guess the segments, for each genome, above and below that for L. oculatus correspond to the two paralogous syntenies?That should be specified.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com

Figure 1 .
Figure 1.To model the evolution of gene duplicates, Stark et al. (2017) 91 constructed a Markov chain with state space {0,1, …,z-1,S,P} and generator Q where z is the number of regulatory regions within the gene, and S and P is are the subfunctionalization and the pseudogenization (absorbing) state, respectively.In the above example of transitions with z=4, the regions hit by null mutations are in red, and the regions protected by selective pressure are in yellow.This figure is adapted from Stark et al. (2017), 91 which was published under an open access license.

Figure 3 .
Figure 3. Modeling duplicate gene loss after polyploidy.A) Following Lewis (2001),122 a discrete state model M allows an ancestral position to be duplicated (D), single copy (S 1 or S 2 ) or a fixed duplicate (D f ).Transitions between these states occur at rates proportional to model parameters α, ɛ, and γ.Losses occur along an assumed phylogenetic tree t with branch lengths l 1 ..l t .The extant genomes are phased into a series of homologous columns or pillars: each genome may have one or two homologs present at a pillar (a state for complete homolog absence will be added to future versions of POInT).Different parental subgenomes within an extant genome can be distinguished (orange verses tan) but subgenome identities between the genomes are unknown.B) For N = 2 polyploid genomes, there are 2 N possible orthology relations.At each pillar i, we can compute the likelihood of the observed gene presence and absence data for a given orthology pattern XX using the model M and the tree t: L i xx |M,t.C) Using the synteny relationships, the values L i 00 |M,t .. L i 11 |M,t can be conditioned on L i-1 00 |M,t .. L i-1 11 |M,t with a transition probability matrix Θ.The elements of Θ depend on Θ i,g , where i is the pillar number and g is the genome.If synteny in maintained between pillars i and i+1 for genome g, Θ i,g = Θ M , a global constant estimated by maximum likelihood (0≤Θ M ≤1).Otherwise Θ i,g =0.5, meaning the orthology pattern at i is independent of that at i-1.This equation can be applied recursively to compute the likelihood of the entire dataset with standard hidden Markov model approaches123 : the ⨀ operator represents an element-wise vector product.The tree branch lengths and model parameters are estimated from the data by maximum likelihood using standard numerical techniques.127This figure is an original figure produced by the authors for this review article.

○
Is the topic of the review discussed comprehensively in the context of the current literature?YesAre all factual statements correct and adequately supported by citations?YesIs the review written in accessible language?YesAre the conclusions drawn appropriate in the context of the current research literature?YesCompeting Interests: No competing interests were disclosed.

Table 1 .
A summary of the models described for duplicate gene retention together with their assumptions is presented.
12s been pioneering.The first model presented byLynch and Conery (2000)9modeled duplicate gene retention expectations with a simple exponential distribution.This assumes that no matter how long a gene has been in a genome, the instantaneous probability of loss is constant, which is not consistent with retention mechanisms, but is useful as a nonfunctionalization null model.Konrad et al. (2011)12and Teufel et al. (2016)77described sets of hazard functions that did not have the property of timeindependent hazard functions, and Zhao et al. (2015)85presented an age-dependent birth-death process inspired by this framework.Yohe et al. (2015)86presented a theoretical gene tree-species tree reconciliation framework using the Konrad et al. (2011)12model, but this was never implemented as software.Arvestad et al. (2009)