Campylobacter jejuni genomes exhibit notable GC variation within housekeeping genes [version 1; peer review: 2 not approved]

Campylobacter jejuni ( C. jejuni ) is a rapidly evolving bacterial species with massive genetic recombination potential to generate niche specific genotypes. Generally the housekeeping gene lineage has been evidenced to undergo lateral gene transfer and recombination quite frequently compared to the information processing gene lineage. During such exchanges, genetic amelioration takes place over time acquiring the host genomes’ molecular characteristics. In this study, fifty genes that comprised twenty five metabolic housekeeping lineage genes and twenty five information processing lineage genes from nineteen C. jejuni genomes were studied. These nineteen genomes included seven C. jejuni isolates that belonged to the same genotype or multilocus sequence type ST-474 that were sequences in New Zealand. The genes from both lineages were tested for recombination and the guanine-cytosine (GC) variation. There was a positive correlation between the GC variance and the number of recombination events amongst the metabolic housekeeping genes. Genes that showed wider GC variance had a relatively high number of recombination events. In contrast, although recombination was evident in all of the informational genes, there was no correlation between the GC variance and recombination. The observation of a positive correlation between the GC variance and the recombination events in the metabolic housekeeping genes may reflect the recent events of exchange of DNA and the regions that are constantly dynamic to undergo recombination under certain circumstances.


Introduction
Molecular events such as mutation, deletion, recombination and gene transfer play paramount roles in shaping the evolution of prokaryotes (reviewed by 1,2 ).As a consequence the genomes are more prone for nucleotide base compositional fluctuations.Particularly, the evolutionary forces pose a major impact on the guanine-cytosine (GC) content of bacteria at the level of genes and genomes 3 .Amongst all the important evolutionary forces, the impact of recombination (homologous, non-homologous or illegitimate) on the evolution of bacteria has been evidenced as the major driving force or factor of microevolution [4][5][6][7][8] .However, the rate of recombination may differ greatly amongst different bacterial species; while some species recombine more frequently to have multiple recombination events than mutations that render them weakly clonal, where as in other species it appears to be a rare incident leading to distinct clonal lineages 5,6,8,9 .Studies of genetic diversity in the bacterial kingdom have shown that bacteria form clusters of genetically related strains and that extensive recombination among related clusters have been regarded as normal rather than exceptional events 10 .However, not every single gene is involved in recombination or horizontal gene transfer 11,12 .
The lineages of the genes were broadly classified into informational and operational or metabolic housekeeping genes.The informational genes include genes of translation (T), transcription (S), and replication (R) and also the ATPases, GTPases (G) and tRNA synthetases whereas the operational genes are those involved in cell operations such as amino acid synthesis (A), biosynthesis of co-factors (B), cell envelope proteins (C), energy metabolism (E), intermediary metabolism (I), fatty acid and phospholipid biosynthesis (L), nucleotide biosynthesis (N), and regulatory genes (Z) 12 .The operational genes are the most modular genes in the cells that are inclined to be horizontally transferred or recombined most often 11,13 .As a result of this behaviour of the housekeeping genes it is therefore prudent to speculate that housekeeping genes may exhibit notable molecular differences compared to the informational genes.
Campylobacter jejuni (C.jejuni), is a zoonotic pathogen that colonises the gut of a wide variety of birds and mammals and has been attributed to the majority of bacterial gastroenteritis cases in developed countries 14 .Most predominantly the disease caused is by C. jejuni and often the disease is self limiting, however on rare occasions there can be serious sequelae such as Guillain-Barré syndrome and reactive arthritis 15 .The natural competency and the plasticity of C. jejuni were not investigated until Dingle et al. (2001)  16 designed an MLST scheme for C. jejuni which has subsequently been exploited to structure and investigate the association of C. jejuni populations with different hosts and the environment from which human clinical infection originated [17][18][19][20][21] .Further, Wilson et al. 22 used the population genetics-phylogenetics approach to demonstrate the massive evolutionary potential of C. jejuni inferring that recombination plays a major role in the generation of diversity at twice the rate of mutation per se.
As C. jejuni is an actively recombining bacterial species, an attempt to investigate the GC variations in a subset of operational genes and informational genes was made in an effort to understand the difference between these two lineages of genes within C. jejuni genomes.The assumptions to conduct this analysis were: (1) amelioration or coalescence of GC content or the nucleotide base composition takes relatively long time 23 which means that an event of recombination in a given group of genes with lack of time for coalescence will exhibit notable GC variation compared to other genes in the population; (2)  housekeeping genes are relatively more prone for recombination where interspecies recombination has been demonstrated between donor and recipient DNA molecules that differ by up to 25-30% of their nucleotide sites [24][25][26] .Given the fact that C. jejuni is a competent bacterial species where generation of host adapted variants has been documented using MLST datasets by several studies [27][28][29][30][31] , it may be hypothesised that the housekeeping genes may exhibit notable differences in their GC content 24 ; (3) Informational genes require highly stringent homology for an event of recombination to occur and recombination takes place as a part of DNA repair 32 and hence these genes may not exhibit such notable GC variation even in the presence of recombination events.In order to test these hypotheses, nineteen C. jejuni genomes were analysed in this study.

Methods
Analyses of metabolic and informational housekeeping genes Housekeeping genes were located on the C. jejuni NCTC 11168 genome (GenBank accession number NC002163) and a total of 50 genes were selected for the analyses [33][34][35] .The genes were further classified into operational genes (metabolic housekeeping genes) and informational genes based on their function by referring to the KEGG pathway and the gene function websites (http://www.genome.jp/dbget-bin/www_bget?ko+K03495 and http://www.igs.cnrs-mrs.fr/mgdb-cgi/www_gene_catalog?rpr.ann.)The positions of the categorised selected subset of genes (metabolic housekeeping and informational genes) are marked on the reference C. jejuni NCTC 11168 (GenBank accession number NC 002163) circular genome and are shown in Figures 1A and 1D, respectively.Here after in this paper the genes are referred as housekeeping genes (metabolic housekeeping genes) and informational genes for the purpose of plain comparison and interpretation.Twelve fully sequenced C. jejuni genomes (Table 1) were used to compare 50 selected genes.The gene sequences were downloaded from the GenBank database (http://www.ncbi.nlm.nih.gov/genbank/).Seven C. jejuni MLST ST-474 isolates were sequenced at the Institute of Veterinary, Animal and Biomedical Sciences, Massey University, Palmerston North, New Zealand.The overall GC and GC3 contents of individual genes were compared using DnaSP v5 36 .The frequency distribution graphs of GC contents from all the 19 genomes compared were generated in R programming language 37 .Inferences on recombination within each gene under investigation were drawn using Dual-Brothers within Geneious v.5.3.4 38 and the number of recombination events, referred to as Rm was estimated using DnaSP v5.The relationship between the GC variance and the recombination events was analysed using linear models by having Rm as a dependent variable, and the log GC variance and the length as independent variables.Detailed description of methods used in the article are available from a related data article 39 .

Results and Discussion
GC variance in housekeeping genes Guanine-Cytosine (GC) content ranged between 31.9% and 36.4% across the metabolic housekeeping genes in general, where both the high and low ranges of the GC contents was evident in the C. jejuni subsp.doylei (CJJD269.97269.97)genome.The GC content of the housekeeping alleles varied among genomes with the MLST scheme alleles the tkt and gltA alleles showing a relatively wider variation followed by glyA, glmM, aspA, glnA and uncA alleles.GC contents were identical amongst all genes across all the ST-474 genomes except for two genes, namely fumC and trpC however, the GC3 content of the pycA gene varied between the ST-474 genomes.Analysis of recombination events showed that the infB gene possessed the highest number of 27 sites (Rm [the number of recombination events] = 27) (Table 2a) while the trpB and uncA possessed the least number of sites (n = 1) to be involved in recombination.The linear model showed that the number of recombination events was positively correlated with the GC variance, where the genes that showed a wider GC variance Figure 1B Represents the relationship between the GC variance and the length of genes.

Figure 1C
Represents the relationship between the GC variance and the number of recombination events in informational genes.

Figure 1A
Represents the relationship between the GC variance and the number of recombination events.

Figure 1D
Represents the relationship between the length and the number of recombination events in informational genes.
showed a high number of recombination events.However, the infB and gltB genes showed relatively high numbers of recombination events (n = 27 and n = 26, respectively) and they did not reveal convincing GC variation between genomes as it was observed in other genes.While the GC variance and the number of recombination events was found to be positively correlated (p value = 0.009), the length was not found to influence the number of recombination events significantly (p value = 0.7).Figures 1A and 1B illustrate the relationship between the GC variance and the recombination events, and the relationship between the length and the recombination events, respectively.The frequency distribution graphs of the GC contents for all the 25 genes are provided in Mohan et al. 37 in figures 2-5.

GC variance in informational genes
The overall guanine-cytosine (GC) content varied greatly between informational genes in general, where genes such as mfd, ogt, polA, recJ, recN and xseA showed lower ranges of GC contents (from 0.265 to 0.298) whereas the remainder showed a relatively higher GC content with rplB showing the highest GC content of 0.392.The analysis of informational genes for recombination events showed that ogt gene and ssb possessed a single recombination site, while mutS was found to possess the highest number of recombination events (n = 38) (Table 1b describes the recombination events that occurred in the informational genes).The linear model showed that Rm was dependent on the length of the genes (p value = 0.05) rather than on the GC variance (p value = 0.9).Figures 1C and 1D illustrate the relationship between the GC variance and the number of recombination events, and the relationship between the length and recombination events, respectively.(Frequency distribution graphs of GC contents for the informational genes are provided in Mohan et al. 37 in figures 6-9).Differences in the nucleotide base composition of a gene and/ or genome is a fundamental element shaping the genomic evolution which in turn, directly influences the GC contents of genes and/or genomes 40 .GC variation has been thought to be driven both by neutral mutational effects and adaptive selection pressures 41,42 .In this study, GC contents of the housekeeping and informational genes were measured across 19 C. jejuni genomes.Variation in the GC contents amongst the housekeeping genes investigated in this study within the 19 genomes was clearly evident.However, it should be noted that the GC contents of the housekeeping genes within the seven ST-474 genomes were identical except for two genes, the fumC and trpC, with a variation in the GC3 content of the pycA gene (but showed an identical overall GC content across all the ST-474s).There was an association between the GC variance and the number of recombination sites that occurred within different housekeeping genes, where the majority of the genes (investigated in this study) that possessed a wider GC variance showed a higher number of recombination sites.
The variation in base composition is a consequence of differences in the patterns of evolutionary events 43 where variation in the GC content is dependent on the mutational patterns and/or the evolutionary events that had occurred in a given nucleotide sequence 43,44 .GC content has also been shown to be correlated with various biological factors 45 and is a potential research area where there has been a significant level of research carried out 3,41,[46][47][48][49][50][51][52] as well as is ongoing.Previous reports that investigated the causes for the differences in the GC content have evidenced that the GC content at the third codon position (GC3) and the conversion of GC to AT and AT to GC at this position to occur in favour of a selection for the GC content of any given genome 47,53 .Further, this selection pressure possesses a great impact on evolution 47,53 .Furthermore, the major changes in the GC content has been shown to predict the future direction of the evolutionary changes of the genomic GC content 54 .The tRNA abundances in a genome has been shown to be yet another important selective pressure that determines the synonymous codon usage (change in the GC3 position) which in turn reflects the differential evolution in organisms 50,55 .Hence, in the light of previous research reports it will be prudent to hypothesise that since the housekeeping genes are involved in metabolic processes in a genome this may be a reflection of the different metabolic evolutionary pressures that acted upon these genes as a measure of adaptation to the prevailing environmental conditions.Also, the base composition may have changed as a result of recombination between the same species and/or between bacteria with a similar base composition and tRNA pools.It is very intriguing to determine the driving forces for the change in the base composition which may help to better understand the biological reasons behind the frequent nucleotide changes within housekeeping genes in C. jejuni.
Further, there may be few additional explanations for the variation in the GC contents that might have arisen as a result of recombination such as (1) they may not have ameliorated or coalesced after an event of recombination; or (2) may not have had ample time for amelioration after recombination; or (3) those sites may be the hotspots or active spots on the genes that are dynamic and continuously engaging in recombination.
Reports using MLST datasets have revealed mosaic alleles within the seven housekeeping alleles and have also suggested and/ or raised the possibilities of both host adaptation and convergence of C. jejuni and Campylobacter coli (C.coli) species 27,30,31 .There may be an additional influence which may be relatively biological that triggers such recombination between C. jejuni isolates present in a host and/or convergence of C. jejuni with C. coli that may in turn enable better survival of the bacteria in certain hosts.Hybrid alleles of tkt and gltA have been frequently documented in previous reports 30,31 where in our study,    these two genes showed wider GC variance within genomes.In contrast, atpD and trpB showed the least number of recombination sites where, the GC content did not vary as evidenced in the tkt and the gltA alleles across the investigated genomes.
However, there was one exception, the infB gene that showed a high number of recombination sites showed relatively small GC variance.
In the case of informational genes, the overall GC content varied greatly amongst genes investigated within the genomes.Although there was variation in the GC contents within genes, it was not positively correlated with the number of recombination sites, whereas it was correlated with the length of genes which is not surprising.It may be speculated that since an event of recombination in informational genes demand a relatively high degree of homologies 56 , recombination in this subset of genes may have occurred between sequences with high homologies.Moreover, according to the complexity theory 11,12 the repair and ribosomal genes belong to the lineages of higher complex, which may not allow them to compromise or tolerate GC variation during recombination.Further, apart from the differences between individual genomes, the overall GC variation amongst the informational genes investigated in this study, (lower GC content in mfd, ogt, polA, recJ, recN and xseA genes and higher GC contents in the remainder) reflect the differences in the functional conservation and complexity.For example, rplB showed the highest GC content of 39.2% which is relatively complex and conserved as it is a 50S ribosomal protein -L2, involved in several discrete steps of polypeptide synthesis such as peptidyl transferase activity, binding of aminoacyl-tRNA to A and P-sites 57 .

Future research
Even though C. jejuni has been shown over the past decade through a significant amount of research to be a promiscu-ous and a competent bacterial species, the biological triggers behind recombination which leads to host adaptation and emergence of new variants are very unclear.Further, most of the studies are based on the seven housekeeping alleles that cover the internal fragments (approximately 400-500 base pairs) of those genes used in the MLST scheme for typing the isolates 16,58 .The nucleotide differences outside the typing region always get neglected which is important when a functional gene is to be evaluated for evolution.Housekeeping genes involved in various metabolic functions and cellular processes play a critical role in the overall integrity and survival of micro-organisms in general.The correlation between GC variance and recombination indicates the vulnerability of housekeeping genes to evolutionary forces and further it also shows how dynamic the regions on these genes are to continuously respond to such stimuli.Although base composition varies with mutation and recombination events (which is an expected biological plausibility) it is intriguing to speculate Even though our study has made an effort to differentiate between two lineages of genes in C. jejuni and to substantiate and speculate the possible triggers for GC variation and its relationship with recombination, the data we used is very small to make concrete statements.However, larger genomic datasets will be able to provide distinctive resolution on the hypotheses formulated in our study and will also provide answers to the questions that we are raising in this paper.This paper compared 50 genes, obtained from 12 Campylobacter genomes, to test two hypotheses.I rephrase them for clarity here as: (1) Does GC content provide evidence of recombination events, and can this shed light on the history of recombinational events in the Campylobacter genome?(2) Are housekeeping genes more prone for recombination than informational genes in Campylobacter?From the 'Results and discussion' section it is not clear whether the authors have confirmed or rejected these hypotheses.
I have difficulties with both of these hypotheses.My concerns with the first hypothesis are the following: The authors consider only recombination between donor DNA with a higher GC-content into an AT-rich Campylobacter genome.They ignore that the vast majority of recombination cab occur within a Campylobacter species, or, to a lesser extent, between closely related species (C.jejuni and C. coli), in which case there is no base composition difference between donor and acceptor and you wouldn't see amelioration of GC content.The 'recombination' that the authors have identified and use in their analysis (they even mention recombination sites although it is unclear how these were identified) have more likely taken place between alleles of different Campylobacter clones than between genes of different species.
My concern with the second hypothesis is that there is no fundamental genetic or physiological difference between 'housekeeping genes' and 'informational genes'.It is a man-made division only, while the cell just maintains its physiology by means of all these proteins.The only relevant distinction here is whether a gene product is active as a sole contributor to a process (an enzyme, say, that works on a single substrate-to-product conversion without interaction with other factors) or whether a gene product acts in close contact with many other gene products (a ribosomal protein, say).The latter are constrained in mutation, but only for non-synonymous mutations.
The authors ignore codon usage effects, expression levels (highly expressed genes employ different codons), and compositional constraints of proteins that are reflected by their codons and thus affect the gene's GC content.
A recent paper studied evolution in the complete core genome of  I had to read and re-read the first paragraph of the Introduction section many times.Going through just that first paragraph: "Molecular events such as mutation, deletion, recombination and gene transfer play paramount roles in shaping the evolution of prokaryotes (reviewed by 1,2  "Particularly, the evolutionary forces pose a major impact on the guanine-cytosine (GC) content of bacteria at the level of genes and genomes 3 .
Amongst all the important evolutionary forces, the impact of recombination (homologous, nonhomologous or illegitimate) on the evolution of bacteria has been evidenced as the major driving force or factor of microevolution [4][5][6][7][8] ." So recombination is driving the changes in GC content?How does that work exactly?Or perhaps the authors here are talking about something other than differences in GC content?If so, this could be elaborated.
"However, the rate of recombination may differ greatly amongst different bacterial species; while some species recombine more frequently to have multiple recombination events than mutations that render them weakly clonal, whereas in other species it appears to be a rare incident leading to distinct clonal lineages 5,6,8,9 ." And where does Campylobacter fit on this scale?Is there more variation, more recombination, or less?And, once again, perhaps it would be nice for the authors to mention that Campylobacter is very AT rich.

"Studies of genetic diversity in the bacterial kingdom have shown that bacteria form clusters of genetically related strains and that extensive recombination among related clusters have been regarded
as normal rather than exceptional events10.
However, not every single gene is involved in recombination or horizontal gene transfer 11,12 ."This is comforting perhaps, but an indication of the fraction of genes undergoing recombination in Campylobacter would be good… In the Methods section, the NCBI refseq number is given, rather than a true 'GenBank accession number', which for the C. jejuni genome is AL111168.The methods section states that the genomes were downloaded from GenBank -but it appears that perhaps NCBI RefSeq was used instead.RefSeq is a somewhat curated database, with different gene annotations (and hence there could be different gene lengths which might give different %GC contents) -so it is very important to clearly state WHICH database was actually used here.The INSDC accession numbers are shared between GenBank, EMBL, and the DNA DataBase of Japan (DDBJ).To be honest, I really do not see that large of a difference between the variation in %GC in the housekeeping genes vs. information genes.I think a simple box and whiskers plot, showing the genome distribution, compared to the distribution for the housekeeping vs. information genes is the best way to visualise and compare the three distributions.I suspect that there's not that much of a difference here.(Which by the way, I looked through the manuscript several times -nowhere could I find a mention or even brief discussion that C. jejuni is quite AT rich, compared to other bacterial genomes).
The conclusion, that the information genes have less recombination than the housekeeping genes is perhaps not surprising, but I'm not sure how 'recombination' is measured here -the authors refer to a computer program which was used to measure recombination in HIV sequences, with high rates of changes.However, applying this model to something like Campylobacter, where the variation is extremely low (compared to HIV certainly).So if there is a single nucleotide difference, is this called a 'recombination event'?Are SNPs recombination events?Maybe sometimes?
In summary, I find this short paper would reflect a nice student project, done as part of a course.
It is a good exercise to write up what has been done, but I'm still missing the hypothesis that is being tested, the anticipated results, and what sorts of results would falsify the hypothesis.Why not come up with a very clear hypothesis that can be tested, and run it across the roughly 80 C. jejuni genomes available from NCBI? http://www.ncbi.nlm.nih.gov/genome/149 Competing Interests: No competing interests were disclosed.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com sequence sequence type; GBS: Guillain-Barré syndrome; ORFs: Open reading frames; Mb: Mega bases; GC: Guanine: Cytosine; ns: not stated; *:Date of start of project.
1) how frequently the nucleotide base composition change in C. jejuni; (2) how much nucleotide changes can a C. jejuni gene and/or genome tolerate; (3) what are the biological triggers that generate new C. jejuni variants; (4) in the event of recombination how long does the exchanged gene portion take to coalesce; and (5) what are the functional genes mostly affected by evolutionary forces?

©
2013 Ussery D. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.David Ussery Center for Biological Sequence Analysis, Department of Systems Biology, The Technical University of Denmark, Lyngby, Denmark This short article examines variation in GC content and recombination in two different sets of genes, across nineteen Campylobacter jejuni genomes.After reading the abstract several times, I'm still not sure exactly what hypothesis is being tested here.I am confused by the first sentence in the abstract, which states that C. jejuni is 'rapidly evolving' and has 'massive genetic recombination potential'.Compared to what?Certainly compared to a virus, C. jejuni is quite slowly evolving.Further, it seems from the larger picture of whole genome comparison, there is not THAT much difference within the C. jejuni genomes, compared to say for example E. coli, which can be three to four times the size of C. jejuni, and has a very large pan-genome -about ten times the size of any individual E. coli genome.Several years ago, based on a smaller set of genomes, we found that the C. jejuni genome was much less 'open' than the E. coli genome.(see PMID 19691844).

Table 2a GC content range and recombination sites identified in the housekeeping genes.
Rm: Number of recombination sites.

have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.
27 Campylobacter genomes.It covered more than 1100 genes and employed various statistical methods.That paper reported that there was NO division in evolutionary signature between informational versus housekeeping genes (Snipen et al., (2012) Analysis of evolutionary patterns of genes in Campylobacter jejuni and C. coli.Microb.Inform.Expt.2:8). https://doi.org/10.5256/f1000research.1103.r922 What is the difference between 'mutation' and 'deletion'?Would a 'deletion' not be considered a subset of 'mutation'?And what is meant by 'recombination'?Is this different to 'mutation'? "As a consequence the genomes are more prone for nucleotide base compositional fluctuations."The[prokaryotes] are MORE prone than what?Viruses?Eukaryotes?Not sure what is being referred to here.Are the authors saying that mutations, deletions, recombination, gene transfer happen more often in bacteria than in eukaryotes?Or PERHAPS the fact that, because bacterial genomes tend to be more coding-rich (80% or more of the genome codes for proteins), then variations in the third codon position might allow genomes to become more AT rich or GC rich??
)." I'm curious as to what exactly a 'molecular event' is?I would think that most molecular biologists are 'atomists' that is, they think in terms of biochemistry.Is there an alternative, perhaps vitalism or some supernatural event that is an alternative to a 'molecular event'.