Is the pan-genome also a pan-selectome?

The comparative genomics of prokaryotes has shown the presence of conserved regions containing highly similar genes (the 'core genome') and other regions that vary in gene content (the ‘flexible’ regions). A significant part of the latter is involved in surface structures that are phage recognition targets. Another sizeable part provides for differences in niche exploitation. Metagenomic data indicates that natural populations of prokaryotes are composed of assemblages of clonal lineages or "meta-clones" that share a core of genes but contain a high diversity by varying the flexible component. This meta-clonal diversity is maintained by a collection of phages that equalize the populations by preventing any individual clonal lineage from hoarding common resources. Thus, this polyclonal assemblage and the phages preying upon them constitute natural selection units.


The pan-genomic world
Bacterial and archaeal genomes show a surprising diversity in gene content even in otherwise very similar strains 1,2 . Some parts of the genome are shared and keep a high sequence similarity. The 95% average nucleotide identity (ANI) appears as a kind of "magic number" that fits the definition of most classical species, replacing the pre-genomic 70% DNA-DNA hybridization "golden rule" 3 . This is the 'core' of prokaryotic genomes (surprisingly similar figures hold for Bacteria and Achaea in spite of their highly divergent molecular biology) 4,5 .
However, the most remarkable finding of prokaryotic genomes is the presence of other genomic regions that are extremely variable and differ in gene content and synteny from one strain to another 6 . One consequence is that the diversity of genes found in a single prokaryotic species is stunning; for example in Escherichia coli with about 5000 genes per genome, it is estimated that there could be about 45,000 different gene families in its pan-genome 7 . The 'open-ness' of a bacterial pan-genome for a species ranges from roughly twice the size of an individual genome, to more than ten-fold 6,8 . Thus, the genetic diversity hidden in the prokaryotic domain is much higher than initially suspected. This raises several questions regarding the biology and evolution of prokaryotic cells. How is this enormous diversity generated and, even more importantly, maintained? How does it impact an organism's survival strategies and adaptation potential? These questions are fundamental gaps in our knowledge of the largest and oldest group of organisms on the planet.
The availability of multiple genomes from the same bacterial species has advanced greatly our knowledge of prokaryotic pangenomes 2,8 . Furthermore, the availability of large metagenomic datasets permits the analysis of the presence or absence of parts of the genomes of microbes that are known to be abundant in a specific habitat 9-12 ; this bypasses the limitations and biases of pure culture retrieval of strains. We can begin to see general trends now in the pools of genes in the core and 'flexible' components of prokaryotic pan-genomes.

The flexible pool and the cell surface
One major problem of the flexible pool is its remarkable diversity, which makes it hard to classify its genes. Being less widely distributed, they are more difficult to annotate and often appear as hypothetical proteins. However, as more genomes are sequenced, patterns start to be discernible. Particularly informative are the clusters of 'flexible genes' collected in genomic islands (GIs), which contain groups of contiguous genes, making functional inference much more reliable. Much of the flexible pool is collected in GIs of 10Kb or more. In this paper we will focus on some of these islands that are present in most (or all) strain genomes but containing different genes (i.e. they are in the same genomic context and code for the same type of function or structure but the genes are only distantly related, if at all). For the sake of clarity we will designate these genomic islands found in many strains but containing different genes 'flexible Genomic Islands' (fGIs).
One kind of fGI that appears to have universal distribution encodes for the synthesis of exposed structures of the prokaryotic cell. One of the most remarkable examples of this phenomenon is the gene cluster that codes for the synthesis of the O-chain of the Gram-negative lipopolysaccharide (LPS; 13 ). Classically known as the 'O-antigen', the diversity of this exposed envelope in Salmonella or Escherichia has been known for many years 14,15 . The O-chain is a repeat-unit polysaccharide, the monomeric repeat has generally between two and six sugar residues. O-chains are extremely variable in the nature, order and linkage of the different sugars 13,16 . This complex polysaccharide is very important for the survival of the cell, providing the appropriate hydrophilic envelope to allow nutrient imports towards the cell 17 . After subculture in the laboratory many strains lose part of the polysaccharide, originating "rough" mutants, which probably have diluted the critical importance that it had for the cell's lifestyle. However, the importance of the O-chain for antibiotic susceptibility is long known, illustrating how the permeability properties of the cell vary with small changes affecting this structure 18 . Further, the gene cluster for such an important cell component is extremely variable. This diversity has classically been explained by the advantage of the concomitant antigenic variation that could prevent the host from identifying, and eventually expelling, all the strains of one of these species. However, even accepting this simplistic inference from host-microbiome interaction, the variability found in free-living bacteria is comparable (if not higher) than those of specialized pathogens or symbionts 17 .
As an example, Figure 1 shows the fGIs detected in the genomes available of Candidatus Pelagibacter ubique, probably the most abundant pelagic marine microbe. Further, in addition to the O-chain gene cluster, all known exposed structural motives that can vary reflect similar genomic patterns of variation. For example, capsular or slime layer polysaccharides 5,19 , the teichoic acids of Grampositives 20 , or the S-layer glycoproteins of archaea 4,21 all seem to be located in fGIs. Other exposed structures that are also typical components of the flexible gene pool are flagella, pili and porins. An alternative way to view the diversity found in all these cell components is that they all are important phage recognition targets.

Phages, phages everywhere
Viruses and their hosts are extremely entangled entities. In many environments there are on average about 10 phages for every one bacterium 22 , which means that bacteria are under constant attack. There are millions of viruses in every drop of ocean water; on average, it is estimated that about a mole of viruses (6x10 23 ) attack bacteria every minute in the oceans 5,6 . Some estimates suggest that a quarter of newly photosynthesized carbon in marine environments travels through the ''viral shunt'', moving it directly to dissolved organic carbon before grazers or other consumers can access it 6,7 . The diversity of bacteriophages is quite large, and dynamic, changing with time in a given environment 23 . Presumably, this change in virus diversity reflects changes in bacterial diversity, since phages are obligate parasites, and many phages can only infect specific bacterial strains. It has been estimated that 60-70% of the bacterial genomes sequenced to date contain prophages 8,9 (Rob Edwards, Personal communication). About two-thirds of all sequenced proteobacteria (no 'l') Gamma-proteobacterial and low GC Gram-positive bacteria harbor prophages 10 ; thus the phages are also part of the pan-genome for many of these organisms.
The surface of a cell is what is presented to the world (both friend and foe alike); in environments where bacteria are under constant qualities exploited by these widespread phototrophs and are also typical components of the flexible pool of these microbes 28 . A similar story is depicted by the tonB receptors involved in the transport of micronutrients 29 .
It is important to emphasize here that fGIs related to phage sensitivity such as the O-chain of the LPS and fGIs related to niche specialization are genetically linked in a single replicon so that negative selection by phage predation would compensate automatically positive selection by overly efficient exploitation of resources. For example, a sudden increase in the concentration of nutrients that are exploited by a clone that might lead to a major clonal expansion would be kept in check by the increase in the concentration of the linked phage receptors 12 . This mechanism of population control although negative for the short term prevalence of the clone might be good for its long term survival since it maintains the complexity of the community and its endurance.

The pan-selectome
The unit of selection has been a major conundrum in evolutionary biology 30 . Historically the proposals have been, according to times and fashions, going from the gene 31 to the community 32 or even the planet 33 . In 1997 Ernst Mayr defined the unit of selection as "a discrete entity and a cohesive whole, an individual or a social group, the survival and successful reproduction of which is favored by selection owing to its possession of certain properties" 34 . attack from viruses, the need to often change the shape and appearance of surface proteins is compelling. Thus a good evolutionary strategy would be to vary these proteins, both by changing their structures, but also by distributing them amongst other bacteria within the population, where possible.
Can the need to diversify phage receptors explain the enormous diversity of the pan-genome? Certainly not, but it could be responsible for a large part. However, the other major component of the flexible pool might also be preserved via phage predation 12 .

The flexible pool and niche partitioning
Many components of the flexible pool are involved in niche partitioning: 1) transport of substrates and the cognate metabolic pathways required for their assimilation by the cell, 2) regulation, such as two component systems involved in fine tuning the response to environmental stimuli 3) respiratory chains and or protective mechanisms involved in different oxygen or light relationships. This has been found to apply to both bacteria 11,24-27 and archaea 4 . For heterotrophic osmotrophs the transporter baggage carried in the genome is determined largely by lifestyle and niche specialization. Accordingly the transporters found in different clonal lineages are extremely variable and are typical components of the flexible gene pool. In a remarkably parallel way, hli genes (coding for high-light inducible proteins) present in marine picocyanobacteria might influence the fine light because they are instrumental for its long term preservation ( Figure 2). Furthermore, the different clonal lineages as retrieved by pure culture have little chance of succeeding in nature (or in complex biotechnological processes such as dairy or wine production or sewage treatment).
Although this idea invokes elements of group selection, a controversial theory of evolutionary biology 35 , it is being proposed to explain evolution of prokaryotic populations, about which very little evolutionary theory has been solidly established, particularly outside of the walls of the test tube.
We would like to advance here the idea that in nature, the selection unit in a prokaryotic community or assemblage is an ensemble of clonal lineages that share the same (or highly similar) core genome but differ in the flexible gene complements regions (i.e. the selection unit at the genomic level would be the pan-genome) Box 1. These meta-clonal populations are maintained and equalized by phage predation 12 in an analogous way as the immune system in a mammal maintains in check tumors. Thus, phage populations should be considered as belonging to the same selection unit, not only because they are part of the pan-genome (which they often are) but In reality there are many more fGIs (often four or five or more); further many differential transporters or other niche exploitation related features can be coded in small flexible islands or islets interspersed along the core, but they are all genetically linked to the O-chain and other surface related fGIs that are major recognition targets for phages. Three types of phages and receptors have been indicated, by different geometric forms. This set of cellular clones and phages is in equilibrium since the disproportionate increase of cells or phages is prevented by the density dependent kinetics of phage infection. For example if one clone increases over a certain threshold it will be over-preyed by its phages, until population returns to normal following a classical Lotka-Volterra predator/prey equilibrium. This is favorable for the meta-clonal population, since it keeps different lineages with complementary ecological skills acting in tune. This meta-clonal/viral population can be selected as an unit to exploit similar environments such as the water column in the ocean that is very similar worldwide.

Competing interests
No competing interests were disclosed. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data supporting group selection in prokaryotes
One of us proposed a Constant-Diversity model based on the distribution of gene functions in the genomic islands that under-recruit in metagenomes, where the core genome recruits at high similarity 12 . Since then, several independent workers have found data supporting this notion.
One of the most convincing demonstrations that different fGIs in prokaryotes are largely involved in phage sensitivity is the work of Avrani et al. 36 , in which the resistance to phages in several isolates of Prochlorococcus could all be assigned to mutations taking place in Genomic Island 4 of this microbe identified previously 9 , as involved in the synthesis of the O-chain of the lipopolysaccharide. By measuring the frequencies of mutants in metagenomic datasets the authors conclude: "abundant Prochlorococcus populations belonging to a single ecotype with common physiological and ecological characteristics are actually an assortment of subpopulations with different susceptibilities to co-occurring phages" and "Thus, large numbers of taxonomically identical organisms, fulfilling the same ecological role, are probably maintained in the environment as a result of micro-diversity in phage susceptibility regions" 36 . A similar situation is found in Synechococcus where also resistant mutants were found to have altered genes in the O-chain region 37 .
Many other recent developments support the coexistence of complex populations of phages and their hosts in natural communities 23,38-41 . There is also evidence that phages contribute to keep high levels of host diversity 42 and that diversity promotes productivity 43 .
Recently there have been other alternatives used to explain the diversity of pan-genomes by some type of kin selection, such as the so-called Black Queen hypothesis 44 in which some genes present in certain lineages can supply the functions for other clonal lineages. Along the same lines, Teusink et al. describe a "game theory" explanation for the coexistence of proteolytic and non-proteolytic strains in dairy multi-starter cultures 45 . However, these models only make even more critical the role of phages to keep the proper ratios among the different cooperating lineages.

Conclusion
We are proposing here a way of thinking about prokaryotic communities in which cells with a core above 95% ANI, but with a wide diversity of flexible genome complement, and phages praying on them, form an evolutionary selection unit, the "prokaryotic selecton". This has important repercussions for the evolutionary biology of prokaryotes.

Author contributions
The two authors contributed to this manuscript; both were involved in writing multiple drafts. Francisco Eduardo Rodriguez Valera wrote Box 1, and prepared Figure 2, and Dave Ussery made Figure 1. The species as an evolutionary unit is at the centre of the modern Neo-Darwinian synthesis. Species are not just considered as a taxonomic level of classification but as a kind of biological entity or level of organization beyond the cell or the individual 46 . However, this "natural species concept" has been difficult to transfer to prokaryotes due to the lack of sexual reproduction and the unpredictable levels of recombination (particularly when including the illegitimate one) of prokaryotic genomes 47 . If the pan-selectomes described here are the units of selection, they might also fulfill the requirements to be considered natural evolutionary units. However, this requires a mechanism that provides the discontinuities in genetic diversity found in nature. Metagenomic data show that there are discontinuities located at ca 95% ANI, beyond which a gap indicates an empty space in the sequence diversity space 48 . Of course this only applies to the core genome of the meta-clones but still reflects a coherence that requires an evolutionary drive reminiscent of the breeding barriers found in animal species for example. How could the meta-clones explain such discontinuities? This critical question remains to be answered. However, we would like to advance one hypothesis that we call "the maverick hypothesis". Let's assume that a meta-clone of bacteria and phages is established somewhere, for example, exploiting the degradation of chitin, a common component in the water column of the ocean. A pan-genome evolves that allows for the efficient exploitation of the multiple resources associated to this polysaccharide (other accompanying organic molecules, phosphorus and nitrogen source etc.). The populations of this microbe have also a complement of phages to keep a wellequilibrated consortium. The physical proximity, near where the resources (such as zooplankton remains) are found facilitates genomic homogenization by homologous recombination. The pools of genes in the flexible genome can diverge enormously but the core will remain relatively coherent. The rise of a "maverick" that would try to form a monoclonal population diverging away from the homogenizing influence of the rest would be prevented by the excessive phage predation pressure coupled to less efficient exploitation of resources. This trend in the long run might be enough to provide the discontinuities required to form a natural species-like entity.