Horizontal transfer and evolution of wall teichoic acid gene cassettes in Bacillus subtilis

Background: Wall teichoic acid (WTA) genes are essential for production of cell walls in gram-positive bacteria and necessary for survival and variability in the cassette has led to recent antibiotic resistance acquisition in pathogenic bacteria. Methods: Using a pan-genome approach, we examined the evolutionary history of WTA genes in Bacillus subtilis ssp. subtilis. Results: Our analysis reveals an interesting pattern of evolution from the type-strain WTA gene cassette possibly resulting from horizontal acquisition from organisms with similar gene sequences. The WTA cassettes have a high level of variation which may be due to one or more independent horizontal transfer events during the evolution of Bacillus subtilis ssp. subtilis. This swapping of entire WTA cassettes and smaller regions within the WTA cassettes is an unusual feature in the evolution of the Bacillus subtilis genome and highlights the importance of horizontal transfer of gene cassettes through homologous recombination within B. subtilis or other bacterial species. Conclusions: Reduced sequence conservation of these WTA cassettes may indicate a modified function like the previously documented WTA ribitol/glycerol variation. An improved understanding of high-frequency recombination of gene cassettes has ramifications for synthetic biology and the use of B. subtilis in industry.


Introduction
In a recent paper, we studied the relationship between essential and core genes in Bacillus subtilis ssp. subtilis through a pan-genomic approach. Core genes are the set of genes present in all or almost all strains of a species/subspecies in a pangenome. Pan-genomes are determined computationally by finding orthologous gene clusters (OGC) between strains based on homology and genome context. [1][2][3] An OGC is the set of genes, with at most one per strain, that have been computationally determined to be orthologs. An OGC is "core" if the number of genes in the OGC exceeds or equals some threshold such as 95% or 100% of the strains in the pan-genome. A node within a pan-genome graph (PGG) is an OGC and if two nodes (OGCs -specifically the genes in the OGCs) are adjacent in the genome of one or more of the pangenome strains this is considered an edge. While core genes are determined computationally, "essential genes" are experimentally determined genes that render an organism as nonviable if removed in laboratory growth conditions. Such genes are determined experimentally through knockout studies via methods such as random transposon insertion.
For example, Koo et al. 4 and Kobayashi et al. 5 computationally and experimentally determined the essential gene set in the gram-positive bacterium Bacillus subtilis ssp. subtilis. For B. subtilis ssp. subtilis, both Koo et al. 4 and Kobayashi et al. 5 used similar single knockout methods to determine "essential" genes when grown in LB at 37°C. Koo identified 257 essential genes while Kobayashi identified 271 essential genes. The union of these two sets resulted in 305 essential genes. 6 Sutton et al. 6 mapped these 305 genes to the PGG OGCs using the type strain 168 RefSeq genome NC_000964.3 (BioSample SAMEA3138188) used by Koo et al. 4 and Kobayashi et al. 5 and 289 were determined to be core OGCs. 6 For those genes that were determined to be essential in laboratory conditions but determined not to be core for the subspecies pan-genome, Sutton et al. 6 found that some were not truly essential but rather "conditionally essential" due to the presence of other genes such as toxin/antitoxin cognate pairs where the antitoxin is essential in the presence of the toxin gene.
Curiously, eight of the 305 essential genes that were not core genes are involved in the biosynthesis of wall teichoic acid (WTA). These include tuaB (OGC 4729 present in 85 of 108 genomes), mnaA/yvyH (OGC 4735 present in 84 of 108 genomes), tagH (OGC 4744 present in 84 of 108 genomes), tagG (OGC 4745 present in 35 of 108 genomes), tagF (OGC 4746 present in 35 of 108 genomes), tagD (OGC 4748 present in 35 of 108 genomes), tagA (OGC 4749 present in 35 of 108 genomes) and tagB (OGC 4750 present in 35 of 108 genomes). By examining the PGG we determined that genes homologous but diverged from the type strain 168 WTA genes were present at the same relative genome location but were contained in other OGCs. In order to understand the evolution of these essential WTA genes that are not contained in single OGCs, more diverged protein orthology was used to overcome the strong OGC constraint of a minimum of 90% identity over 90% of gene length at the nucleotide level. Our analysis indicates that the WTA genes have a high level of variation possibly due to horizontal gene transfer via recombination as entire cassettes and smaller regions in B. subtilis ssp. subtilis.
WTA genes are involved in production of anionic glycopolymers required for consistent cell shape and division. 7,8 Mutants deficient in WTA biosynthesis show increased sensitivity to temperature and certain buffer components and cells tend to aggregate in culture. [9][10][11][12][13][14] While the WTA genes have been shown to be dispensable, 11 the resulting cells have abnormal morphology and show impaired growth and reproduction. B. subtilis and S. aureus mutants deficient in lipoteichoic acid (LTA) biosynthesis can be obtained but only if grown under a narrow range of conditions; they are temperature sensitive and exhibit severe growth defects. 9,10 Taken together, LTAs and WTAs create what has been aptly described by Neuhaus and Baddiley as a "continuum of negative charge" that extends from the bacterial cell membrane beyond the outermost layers of peptidoglycan. 15 New pathogenesis-related functions for WTAs have also been realized and it has been suggested that the biosynthetic enzymes that make these polymers are targets for novel antibacterial agents. 16,17 Indeed, the first WTA-active antibiotic 18 acts by blocking the export of WTA to the extracellular surface. The chemical structures of WTAs vary in gram-positive bacteria but the most common structures are glycerol or ribitol phosphate repeats. [19][20][21][22][23][24][25][26] B. subtilis can make poly (glycerol phosphate) or poly (ribitol phosphate) WTAs depending on the strain, 27 while S. aureus strains primarily make poly (ribitol phosphate) WTAs. [28][29][30][31] The ribitol WTA genes (tar) were thought to distinguish B. subtilis ssp. spizizenii strains from the glycerol WTA genes (tag) contained in B. subtilis ssp. subtilis strains based on functional studies in B. subtilis strain W23 which is a B. subtilis ssp. spizizenii strain and B. subtilis strain 168 which is a B. subtilis ssp. subtilis strain. 32 Sequencing of the W23 tar genes revealed eight open reading frames in two adjacent divergently transcribed operons, tarABIJKL and tarDF, where tarA, tarB, tarD and tarF have clear homology to their counterparts tagA, tagB, tagD and tagF while tarI, tarJ, tarK and tarL have no obvious homology to tag genes. The four conserved tar/tag genes appear to construct the basic core of the teichoic acid structure although for tar/tagF the function is somewhat different since tagF is much longer than and shares only C-terminal homology with tarF. 33 The four tar genes which are not conserved presumably are specific to the ribitol modifications to teichoic acids. The tag genes are also organized in two adjacent divergently transcribed operons tagABC and tagDEF where tagC and tagE have no obvious homology to tar genes. 32 More recently it was determined that the ribitol/glycerol distinction of tar versus tag genes is not distinguishing between spizizenii and subtilis subspecies but rather either subspecies can contain one or the other. 34 Methods For the B. subtilis ssp. subtilis pan-genome we selected strains with complete genomes in RefSeq. 35 We restricted our analysis to complete genomes to ensure that missing genes due to incomplete genome sequencing/assembly did not affect the approach or results. We limited our choice to RefSeq for two reasons: RefSeq performs a series of quality checks to remove dubious genome assemblies, and the initial pan-genome construction depends upon reasonably consistent annotation which RefSeq provides. We extracted the genomes based on organism name: Bacillus subtilis (we did not specify subspecies, since for many RefSeq genomes a subspecies is not given). For each pan-genome we then compared the genomes using a fast Average Nucleotide Identity (ANI) estimate generated using MASH. 36 We used type strains and ANI to determine which of these genomes were the desired organism. We also used ANI to remove very closely related strains to reduce oversampling bias (for example, for the B. subtilis type strain, 168, has at least eight genomes in RefSeq). We used GGRaSP 37 to choose a single medoid sequence from any complete linkage ANI cluster with a threshold of 0.01% or 1/10,000 base pair difference. The strain 168 medoid genome is the Entrez reference genome for the B. subtilis type strain (GenBank sequence AL009126.3, BioSample SAMEA3138188, Assembly ASM904v1/GCA_000009045.1) which can be used to map the Koo et al. 4 and Kobayashi et al. 5 results.
Using this approach, for B. subtilis 143 genomes were downloaded from RefSeq. Of these 132 genomes were determined to be B. subtilis ssp. subtilis based on type strains and ANI. The minimum ANI between any pair of the 132 B. subtilis ssp. subtilis genomes was 97.28% whereas the maximum ANI of any of the 11 other genomes to the 132 genomes was 95.73% providing good separation between the other subspecies. The 132 genomes were reduced to 109 genomes after removing redundant strains. Finally, we removed strain delta6 (BioSample SAMN05150066) because it is known to have been engineered to remove multiple genes. Thus, we were left with 108 B. subtilis genomes (Supplementary Table 1).
For B. subtilis ssp. subtilis the initial pan-genome was based on the RefSeq annotation of these genomes. The pan-genome was generated using the pan-genome pipeline at the J. Craig Venter Institute (JCVI) at the nucleotide level using default parameters with the exception that a minimum of 90% identity and 90% length for pairwise Blast matches were used to prevent possible clustering of non-orthologous genes. 38 This produced ortholog gene clusters (OGC) using gene context 1 as well as a Pan-Genome Graph (PGG). 2 The PGG has two main components: nodes representing OGCs, and edges representing the sequence between genes and the order and orientation of the genes in the genomes. We updated the code repository for the JCVI pan-genome pipeline with a script: iterate_pgg_graph.pl which calls pgg_annotate.pl for the genomes in the existing PGG in order to ensure consistent annotation of the genomes and iterates until the PGG stabilizes. The script pgg_annotate.pl uses an existing PGG to assign regions of a genome to nodes of the graph. This is done by blasting the medoid gene sequence for the OGC the node represents against the genome and then uses Needleman-Wunsch 39 to extend the alignment if needed. If there are conflicting blast matches, then the matches are resolved based on which matches are consistent with the structure of the PGG which encapsulates gene context across the entire pangenome. Once the nodes of the PGG are mapped to each of the genomes in the pan-genome a new version of the PGG is intrinsic and then explicitly extracted. This process is iterated to stability. This ensures that each genome is consistently annotated so that missing genes are not due to inconsistent annotation.
The medoid nucleotide sequence for each of the 144 OGCs found in any of the WTA cassettes was translated into peptide sequences. These 144 peptide sequences were combined into a multifasta file to create a peptide Blast database (makeblastdb -in WTA_prot.fasta -dbtype prot). A peptide level all versus all Blast search 40 of these 144 peptide sequences was performed (blastp -query WTA_prot.fasta -db WTA_prot.fasta -out tmp -task blastp -evalue 0.000001 -outfmt "6 qseqid sseqid pident qstart qend qlen sstart send slen evalue bitscore stitle"). Potential protein ortholog matches were retained if the percent identity was ≥40% and the length of the match was ≥80% of the shorter protein. Matches with smaller bit scores than the first match to a protein from the same clade (i.e. a paralog) were not retained. More limited matches were retained for tagF matches in an attempt to determine possible orthologs but they were not treated as likely orthologs. The protein ortholog Blast matches are in Supplementary Table 7 and the probable protein orthologs are in Table 1.
Phylogenetic trees were generated from the pan-genome using complete linkage hierarchical clustering of pairwise Jaccard distance and genome ANI distances. The resulting trees were rendered using the Interactive Tree of Life (iTOL). [45][46][47] A linear illustration of the WTA cassettes was generated using the SimpleSynteny tool. 48

Results
For our 108 strain pan-genome (Supplementary Table 1), the WTA genes in all genomes are colocalized and bounded by two core genes: yvyE (OGC 3712) and ribZB (OGC 3756). For all but one genome, this region is contiguous and therefore can be considered as a cassette. For strain N1-1 (SAMN10225190), the cassette is split into two nearby contiguous pieces due to an inversion between two inverted copies of an IS1182 transposon, one inside the cassette and one outside the cassette. This could be a real rearrangement or an assembly error. For the purpose of our analysis we treated these two pieces as a single cassette as if the inversion had not taken place.
We extracted the PGG annotation, which assigns genome coordinates to OGCs, for just the WTA cassettes from the complete PGG annotation of the 108 genomes (Supplementary Table 2). There were a total of 144 OGCs which were present in one or more of the WTA cassettes which we extracted from the overall set of OGCs (Supplementary Table 3). We distinguished seven clades. Clade I (blue) is the ribitol WTA consistent with that found in strain W23. Clade II (purple) and clade III (orange) appear to also be ribitol WTA based on the presence of tarI (OGC 5434), tarJ (OGC 5435), tarK (OGC 5436) and tarL (OGC 5437). Clade IV (yellow) may be a ribitol WTA based on a shorter tagF gene. Clade V (red) may be a glycerol WTA based on a longer tagF gene. Clade VI is the type strain glycerol WTA. Clade VII (teal) is missing many key WTA genes (tag/tarBADFG) so it is unclear how the one strain in that clade is constructing WTAs. The 10 medoid strains used in Figure 3 are in bold.
We also extracted the set of edges between these OGCs from the PGG (Supplementary Table 4). Based on the presence or absence of each of the 144 OGCs in a strain we calculated the pairwise Jaccard distance as a measure of WTA gene content dissimilarity (Supplementary Table 5). The strains were then classified into seven clades (I-VII) based on the Jaccard distance ( Figure 1). Clades I-VII have 43, 2, 3, 1, 23, 35, and 1 strain(s), respectively. To determine whether the gene content difference was due to evolutionary inheritance, we determined the pairwise average nucleotide identity (ANI) similarity between strains (Supplementary Table 6) which is considered to be strongly correlated with evolutionary distance (Figure 2). Since the WTA gene-content based clades do not form the same clades under ANI distance, but rather intermix, something beyond straight inheritance is accounting for the WTA gene content within strains.
To determine orthology beyond the strict orthology of the OGCs, we used protein level Blast (blastp) to compare the medoid gene sequence of each of the 144 WTA OGCs against the other 143 WTA OGC medoid gene sequences (Supplementary Table 7). The medoid gene sequence of an OGC is the gene sequence with the lowest summed pairwise homology distances as determined by Blast to all other gene sequences in the OGC. The ribitol specific genes tarI (OGC 5434), tarJ (OGC 5435), tarK (OGC 5436) and tarL (OGC 5437) had no obvious OGC protein orthologs. Likewise, the glycerol specific genes tarC (OGC 4751) and tarE (OGC 4747) had no obvious OGC protein orthologs. Based on the ribitol and glycerol specific OGCs, three clades (I, II and III) appear to be ribitol WTAs, one clade (VI) appears to be glycerol WTA, and three clades are an indeterminate WTA type. Within the WTA cassette there are three adjacent OGC pairs (for a total of six OGCs) which are conserved across all 108 strains: tagO-tuaH (OGCs 3721-3722), lytD-pmiA (OGCs 3749-3750) and gerBC-ywtG (OGCs 3753-3754), in that order from bounding OGC 3712 to bounding OGC 3756. Each of these highly similar OGC pairs contain similar DNA sequences which allow for homologous recombination events to occur between strains from different clades. We did not explicitly examine possible recombination between strains of the same clade. We illustrate the basic OGC structure within the WTA cassette for the seven clades shown in Figure 1 in the linear alignment shown in Figure 3 of the WTA cassettes from the 10 medoid genomes determined by ANI and shown in bold in Figure 2.
We examined OGC and OGC protein ortholog based evidence of recombination between clades for each clade. We did this by determining the OGC patterns for each of the 108 strains across the WTA cassette (Supplementary Table 8). An OGC pattern is simply the order of OGCs across the WTA cassette for a given strain from OGC 3712 to OGC 3756. For identical OGC patterns we collapsed identical columns for a simpler presentation in Supplementary Table 9. When analyzing the OGC patterns, we specifically looked for common subpatterns, a subinterval of the OGC pattern, to determine possible recombination events. An OGC pattern or subpattern is shown as a parentheses bounded comma separated list of OGCs where OGCs with consecutive numbers are indicated with a hyphen (e.g. 4724-4730,5418-5421,4734-4735,5422-5424,4736,4744, 5425). A null subpattern indicated by () means there are no OGCs for that subpattern. We looked specifically at the four subregions of the WTA between the bounding OGCs and the three conserved OGC pairs. We designated these four regions based on their bounding OGCs: 3712-3721, 3722-3749, 3750-3753, and 3754-3756.
For the 23 strains in clade V, there are eight OGC patterns (Supplementary Table 9, Supplementary Table 8, Table 2   For clades II, III, IV, and VII recombination based on OGC patterns is harder to evaluate since they only have 2, 3, 1, and 1 strain(s), respectively. For clade VII, the OGCs which are not unique to clade VII are common to most other clades. For clade IV, this is also true with the exception of OGC subpattern (7625-7631) which is shared with clade III but it is unclear if this is due to shared ancestry or recombination. For clade III, OGC patterns are mixed between that seen for clades I and VI with some unique OGCs mixed in but there is no obvious recombination pattern except in region 3754-3756 where there are two OGC subpatterns: (3755) and (4754). For clade II, the OGC patterns are very similar to clade I with some unique OGCs and no evidence of recombination except in region 3754-3756 where there are two OGC subpatterns: (5438) and (4754).
For the eight essential WTA genes that did not have core OGCs, we combined the protein ortholog OGCs (Table 1) showing: tuaB, mnaA/yvyH, tagG, and tagH are in all 108 strains; tagD, tagA, and tagB are in 107 strains (missing from the one clade VII strain); and tagF is in only 87 strains. We investigated why tagF is missing from 20 strains besides the clade VII strain also missing tagDAB. It is also missing from all clade II and III strains and from 15 clade I strains. For the 15 clade I strains two short proteins (OGCs 4740-4741) have homology to tagF in nonoverlapping regions but many clade VI strains have both OGCs 4740-4741 and 4746 (tagF). We searched the entire genomes of the strains missing an obvious tagF ortholog using tblastn to search at the peptide level the tag/tarF medoid proteins for OGCs 4746 (tagF), 5988 (tarF), and 5430 (tarF). For the 15 clade I genomes both 5998 and 5430 matched full length at 65% and 70% identity to a region adjacent to OGC 5431 and overlapping the OGCs 4740 and 4741. This indicates that these 15 genomes are misannotated in this region. Unfortunately, the PGG-based annotation pipeline enforces consistency of annotation but not correctness of annotation. The original RefSeq annotation had OGCs 4740 and 4741 for 24 and 23 strains respectively, while the tagF ortholog for the same region was annotated for 15 strains. The PGG refinement process will choose between competing OGC annotation for a region based on predominance which resulted in OGCs 4740 and 4741 replacing the tagF ortholog annotation for those 15 strains. This results in 39 genomes being annotated with OGCs 4740 and 4741: 15 from clade I which do not have an alternative tagF ortholog and 24 from clade VI which do have an alternative tagF ortholog (OGC 4746). In these 24 clade VI strains the entire tagF ortholog (1185 bp) from the 15 clade I strains is not present but only the C terminal region (bps 654-1185) which contains OGCs 4740 and 4741 matches with 93% sequence identity. One possibility is that this region was transferred during homologous recombination but due to the presence of another tagF gene in the genome this particular tagF devolved in these 24 strains leaving two small open reading frames annotated as hypothetical proteins. For the clades II and III genomes no obvious tagF ortholog was found using tblastn but significant hits to OGC 5437 (tarL) had low homology and significant but lower homology was seen for several other WTA cassette OGCs. It is not clear whether one or more of these homologs may be providing the needed tagF functionality. For the one clade VII strain the entire core WTA machinery is missing (tag/tarBADF). Based on the literature one would expect this to be an engineered strain with limited viability and abnormal cell morphology but it is a soil isolate with normal cell morphology. 49 It is an interesting question whether the one clade VII strain is deficient in WTAs but compensating with LTAs.
To better understand the homologous recombination occurring in the WTA cassettes, we extracted the nucleotide sequences for all 108 cassettes, aligned them, trimmed them to remove columns predominated by gaps, and generated a maximum likelihood tree (Figure 4). Since clade V strains cluster on one branch of the tree with a much higher bootstrap value than any other branch, they clearly have a very diverged cassette from other strains. The intermixing of other clades in the remainder  The WTA gene cassette is highly variable in B. subtilis ssp. subtilis, much more so than other essential or core genes. 6 This does not appear to be an ancient evolutionary split of strain subtypes, although inheritance is an obvious component, but rather the result of homologous recombination of WTAs as an entire cassette or just regions within the cassette. This method of rapid evolution is often in response to environmental pressure (e.g. phage infection). The WTAs exposed on the cell surface have been shown to be potential phage targets and variation in the WTAs can determine phage susceptibility [50][51][52] and therefore strain fitness. Likewise, antibiotic susceptibility has been shown to be a driving factor in strain fitness in the wild and WTA biosynthetic proteins are the targets of some antibiotics and further study of these novel WTA OGCs may lead to novel antibiotics. [53][54][55][56][57] WTAs can also modulate specificity of species and strains that can effectively engage in horizontal gene transfer and thereby acquire antibiotic resistance or virulence traits. 58 WTAs are also involved in host immune system evasion. [59][60][61][62] The high rate of recombination and genetic tractability in B. subtilis make it a model organism for biological engineering. Discovery of novel core OGCs using a pan-genome approach can help identify genomic regions capable of excluding gene insertions or non-essential OGCs ready for deletion. The high degree of essential gene conservation in the WTA cassette suggests that they might not be easily deleted. The high level of variation in the WTA cassette is intriguing and a possible target for engineering since it appears to be under adaptive pressure. By naturally rearranging the WTA cassette B. subtilis may be able to occupy new niches where acquisition of WTA genes not susceptible to certain antibiotics or phages is advantageous. Conversely, biological engineers might be able to recombine genes conferring antibiotic susceptibility to expand the number of usable antibiotic genes required for manipulating multiple endogenous loci concurrently.

Conclusion
Reduced sequence conservation of the WTA cassettes in the seven clades we determined may indicate a modified function like the previously documented WTA ribitol/glycerol variation found in two of those clades. This WTA variation poses a number of questions about function, response to environmental pressure, and potential engineering targets as discussed above. An improved understanding of high-frequency recombination of WTA gene cassettes has ramifications for synthetic biology and the use of B. subtilis in industry. This project contains the following underlying data:

Data availability
• Table 1. The protein level orthologous OGCs within the WTA cassettes. Column 1 is the gene name/symbol. Column 2 is the set of OGCs determined to be orthologs at the protein level. Column 3 is the number of the 108 strains in the PGG which contain one of the protein level orthologs. Column 4 is OGC medoid sequence RefSeq annotation for one of the protein level orthologs.
• Table 2. OGC subpatterns for the WTA cassettes across clades I-VII. The OGC subpatterns show some limited recombination within the WTA cassettes but most recombination seems limited to the entire cassette. Column 1 is the region between core OGCs within the WTA cassette. Column 2 is an OGC subpattern. Columns 3-9 indicate the number of strains within a clade that has the given OGC subpattern for that row. The rows are ordered relative to their order in the WTA cassette from core OGC 3712 to core OGC 3756. This project contains the following extended data: • Supplementary Table 1 • Supplementary Table 2 • Supplementary Table 3 • Supplementary Table 4 • Supplementary Table 5 • Supplementary Table 6 • Supplementary Table 7 • Supplementary Table 8 • Supplementary Table 9 Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC BY 4.0).

Open Peer Review
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com