A bioinformatics insight to rhizobial globins: gene identification and mapping, polypeptide sequence and phenetic analysis, and protein modeling.

Globins (Glbs) are proteins widely distributed in organisms. Three evolutionary families have been identified in Glbs: the M, S and T Glb families. The M Glbs include flavohemoglobins (fHbs) and single-domain Glbs (SDgbs); the S Glbs include globin-coupled sensors (GCSs), protoglobins and sensor single domain globins, and the T Glbs include truncated Glbs (tHbs). Structurally, the M and S Glbs exhibit 3/3-folding whereas the T Glbs exhibit 2/2-folding. Glbs are widespread in bacteria, including several rhizobial genomes. However, only few rhizobial Glbs have been characterized. Hence, we characterized Glbs from 62 rhizobial genomes using bioinformatics methods such as data mining in databases, sequence alignment, phenogram construction and protein modeling. Also, we analyzed soluble extracts from Bradyrhizobium japonicum USDA38 and USDA58 by (reduced + carbon monoxide (CO) minus reduced) differential spectroscopy. Database searching showed that only fhb, sdgb, gcs and thb genes exist in the rhizobia analyzed in this work. Promoter analysis revealed that apparently several rhizobial glb genes are not regulated by a -10 promoter but might be regulated by -35 and Fnr (fumarate-nitrate reduction regulator)-like promoters. Mapping analysis revealed that rhizobial fhbs and thbs are flanked by a variety of genes whereas several rhizobial sdgbs and gcss are flanked by genes coding for proteins involved in the metabolism of nitrates and nitrites and chemotaxis, respectively. Phenetic analysis showed that rhizobial Glbs segregate into the M, S and T Glb families, while structural analysis showed that predicted rhizobial SDgbs and fHbs and GCSs globin domain and tHbs fold into the 3/3- and 2/2-folding, respectively. Spectra from B. japonicum USDA38 and USDA58 soluble extracts exhibited peaks and troughs characteristic of bacterial and vertebrate Glbs thus indicating that putative Glbs are synthesized in B. japonicum USDA38 and USDA58.


Abstract
Globins (Glbs) are proteins widely distributed in organisms. Three evolutionary families have been identified in Glbs: the M, S and T Glb families. The M Glbs include flavohemoglobins (fHbs) and single-domain Glbs (SDgbs); the S Glbs include globin-coupled sensors (GCSs), protoglobins and sensor single domain globins, and the T Glbs include truncated Glbs (tHbs). Structurally, the M and S Glbs exhibit 3/3-folding whereas the T Glbs exhibit 2/2-folding. Glbs are widespread in bacteria, including several rhizobial genomes. However, only few rhizobial Glbs have been characterized. Hence, we characterized Glbs from 62 rhizobial genomes using bioinformatics methods such as data mining in databases, sequence alignment, phenogram construction and protein modeling. Also, we analyzed soluble extracts from Bradyrhizobium japonicum USDA38 and USDA58 by (reduced + carbon monoxide (CO) reduced) minus differential spectroscopy. Database searching showed that only , , fhb sdgb gcs and genes exist in the rhizobia analyzed in this work. Promoter analysis thb revealed that apparently several rhizobial genes are not regulated by a -10 glb promoter but might be regulated by -35 and Fnr (fumarate-nitrate reduction regulator)-like promoters. Mapping analysis revealed that rhizobial s and fhb thb s are flanked by a variety of genes whereas several rhizobial s and s sdgb gcs are flanked by genes coding for proteins involved in the metabolism of nitrates and nitrites and chemotaxis, respectively. Phenetic analysis showed that rhizobial Glbs segregate into the M, S and T Glb families, while structural analysis showed that predicted rhizobial SDgbs and fHbs and GCSs globin domain and tHbs fold into the 3/3-and 2/2-folding, respectively. Spectra from B .
USDA38 and USDA58 soluble extracts exhibited peaks and japonicum troughs characteristic of bacterial and vertebrate Glbs thus indicating that putative Glbs are synthesized in .
USDA38 and USDA58. Structurally, Glbs fold into a tertiary structure known as the globin fold. This protein folding consists of six to eight α-helices (designated with letters A to H) that form a hydrophobic pocket where a heme prosthetic group is located 2 . Two structural types of the globin fold have been identified in Glbs: the 2/2-and 3/3-fold. In the 2/2-Glbs, helices B and E overlap to helices G and H 3 and in the 3/3-Glbs helices A, E and F overlap to helices B, G and H 4,5 . Likewise, three evolutionary families have been identified in Glbs 6,7 : the M, S and T Glb families. The M Glbs include flavohemoglobins (fHbs) and single-domain Glbs (SDgbs), the S Glbs include globincoupled sensors (GCSs), protoglobins and sensor single domain globins, and the T Glbs include truncated Glbs (tHbs) (which are further classified into class 1, class 2 and class 3 tHbs). Canonical tHbs are ~20 to 40 amino acids shorter than the globin fold, resulting in an almost absent helix A and a helix F that is reduced to a single turn 8,9 . The M and S Glbs fold into the 3/3-fold whereas the T Glbs fold into the 2/2-fold.  18,19 . A plant Glb (leghemoglobin (Lb)) that is synthesized at high (~3 to 5 mM) concentrations in nodules apparently facilitates O 2 -diffusion to the symbiotic rhizobia and maintains low (submicromolar) concentrations of O 2 within nodules. This is essential for sustaining the (micro) aerobic respiration of symbiotic rhizobia and preventing the inactivation of nitrogenase (which fixes the atmospheric N 2 into NH 4 + ) by O 2 . The binding and metabolizing of NO by Lb and other Glbs is also well documented 11,20 . Thus, a likely function for Lb in nodules is to detoxify the NO that is generated during the plant infection by rhizobia 21 . However, little is known about the properties and functions of Glbs either within the symbiotic or free-living rhizobia.
Forty-six years ago Appleby 22 was the first to propose the existence of Glbs in rhizobia. This author detected absorption peaks and troughs that are characteristic of Glbs in differential (dithionite reduced + CO minus dithionite reduced) spectra of soluble extracts from Bradyrhizobium japonicum 505 (Wisconsin). Subsequent spectroscopic analyses suggested the existence of soluble Glbs in Rhizobium leguminosarum bv. viciae 23 , B. japonicum NPK63 24 and R. etli CE3 25 . The first rhizobial glb gene was identified in the pSymA megaplasmid of Sinorhizobium meliloti 1021 26 . BLAST analysis revealed that this gene corresponded to an fhb gene and thus was named smfhb. A bioinformatics analysis showed that smfhb is flanked by nos and fix genes (which code for denitrification enzymes and high O 2 -affinity terminal oxidases and an O 2 -sensor, respectively) and that apparently it is regulated by an Fnr-like promoter. These observations suggested that smfhb is regulated by the concentration of O 2 and that SmfHb functions in some aspects of nitrogen metabolism. A transcriptomic analysis of the S. meliloti response to NO in culture showed that smfhb (also designated as a S. meliloti hmp) is upregulated by NO and the analysis of a smfhbmutant exhibited a high sensitivity to NO in culture and led to a reduced N 2 -fixation efficiency in planta. These observations suggested that SmfHb functions in some aspects of NO metabolism, possibly by detoxifying NO 27 .
Genomic analysis reported by Vinogradov et al. 7 revealed that Glb sequences exist in several rhizobia. However, in spite of the above reports knowledge on the rhizobial Glbs is quite limited. Hence, in order to obtain information on the properties of rhizobial Glbs we characterized Glb sequences from selected rhizobial genomes by using bioinformatics methods. These included gene characterization, polypeptide sequence and phenetic analysis, as well as protein modeling. Also, we analyzed soluble extracts from B. japonicum USDA38 and USDA58 by differential spectroscopy. Our main results showed that only fhb, sdgb, gcs and thb genes exist in the rhizobia analyzed in this work; that several rhizobial glb genes are not regulated by a -10 promoter but might be regulated by -35 and Fnr-like promoters; that rhizobial fhbs and thbs are flanked by a variety of genes whereas several rhizobial sdgbs and gcss are flanked by genes coding for proteins involved in the metabolism of nitrates and nitrites and chemotaxis, respectively; that rhizobial Glbs segregate into the M, S and T Glb families; that predicted rhizobial SDgbs and fHbs and GCSs globin domain and tHbs fold into the 3/3-and 2/2-fold, respectively, and that spectra from B. japonicum USDA38 and USDA58 soluble extracts exhibit peaks and troughs characteristic of bacterial and vertebrate Glbs.

Database search
Putative Glb sequences and Glb domains were identified in databases (Table S1) containing the genomes of rhizobial species and strains using the query sequences S. meliloti fHb; Vitreoscilla SDgb; Agrobacterium tumefaciens GCS; Methanosarcina acetivorans protoglobin; Methylacidiphilum infernorum sensor single domain globin; Mycobacterium tuberculosis tHb class 1; A. tumefaciens tHb class 2, and M. avium tHb class 3 (Genbank accession numbers AY328026, AAA75506, NP_354049, 2VEB_A, YP_001939425, NP_216058, WP_020813663 and BAN32501, respectively) and the SUPERFAMILY database (http://supfam.mrc-lmb.cam.ac.uk) 28 . Resulting sequences were subjected to a FUGUE analysis (http:// tardis.nibio.go.jp/fugue/prfsearch.html) 29 to determine the most similar Glb structure and presence of proximal H at the myoglobinfold position F8. Putative Glbs had to satisfy the following criteria: length higher than or ~100 amino acids, a FUGUE Z score higher than 6 (which corresponds to 99% specificity 29 ) with known Glb structures, and the presence of proximal H at position F8.
Gene mapping and detection of promoter sequences Scaffolds containing copies of the glb gene were used for mapping glbs. This included the detection of open reading frames (ORFs) ~5 kb up-and downstream to glbs and ORF length, transcription direction and localization in the +/-strand. Canonical (-10 and -35) and Fnr 30 promoter sequences and Shine-Dalgarno sequences were searched within 130 nucleotides upstream to the rhizobial glb genes either by using the search tool of MS Word ® or by pairwise sequence alignments using the ClustalX program (http://www. clustal.org/clustal2/) 31 .

Protein sequence alignments and phenetic analysis
Pairwise and multiple sequence alignments were performed using the ClustalX program 31 . Multiple sequence alignment was manually verified using the procedure described by Kapp et al. 32 based on the myoglobin-fold 33 . A phenogram was constructed from the aligned sequences using the UPGMA method from the ClustalX program. The resulting phenogram was edited using the iTOL program (http://itol.embl.de/) 34 .
Modeling and analysis of the predicted proteins tertiary structure The tertiary structure of rhizobial Glbs was modeled using the automated mode of the I-TASSER server (http://zhanglab.ccmb. med.umich.edu/I-TASSER/) 35-37 , which also provided the best structural homologs to the query sequences. Models were edited using the VMD program (http://www.ks.uiuc.edu/Research/vmd/) 38 and Adobe Photoshop ® software. Distance and dihedral angles of amino acids at the heme prosthetic group were calculated using the distance and dihedral tools of the SwissPDBViewer program (http://spdbv.vital-it.ch/) as described by Gopalasubramaniam et al. 39 and Sáenz-Rivera et al. 40 , respectively.
Bacterial growth, cell rupture and spectral analysis Bradyrhizobium japonicum USDA38 and USDA58 were kindly provided by Drs. Donald Keister and Douglas Jones (United States Department of Agriculture, USA). All reagents were purchased from Sigma-Aldrich (St. Louis MO, USA). B. japonicum cells were grown in YM (Yeast Mannitol) broth (per 100 ml: KH 2 PO 4 , 50 mg; MgSO 4 , 20 mg; NaCl, 10 mg; mannitol, 1 g; yeast extract, 50 mg, pH 7.0) for 3 to 5 days at 30°C with shaking at 200 rpm. Cells were harvested by centrifugation at 11,000 × g, pellets were resuspended in 50 mM Na-phosphate buffer (pH 7.2) containing 1 mM EDTA and 1 mM phenylmethylsulfonyl fluoride (PMSF). Cells were disrupted by sonication at maximum power (three cycles of 1 min each in ice) and incubation at 4°C overnight with gentle agitation after the addition of DNAse I (40 U/ml), RNAse A (3 U/ml) and lysozyme (2 mg/ml). The resulting solution was cleared by centrifugation at 22,000 × g for 40 min at 4°C, and the supernatant was fractionated with solid ammonium sulphate between 35 and 65% saturation. The resulting pellet was resuspended in 5 ml of 50 mM Na-phosphate buffer (pH 7.2) containing 1 mM EDTA and 1 mM PMSF and dialyzed for 18 h against the same buffer to remove the excess of salts. 0.5 to 1 ml aliquots of the dialyzed solution were used to obtain the dithionite reduced + CO minus dithionite reduced differential spectra in a Beckman DU6 spectrophotometer. Control spectra were obtained from commercial (Sigma-Aldrich) preparations of the sperm whale myoglobin and bovine blood hemoglobin. Globin nomenclature corresponds to the first three binomial (genus and species) letters followed by the strain name, globin type and gene copy number. URLs indicate links to individual glb gene sequences 56 . Globin nomenclature corresponds to the first three binomial (genus and species) letters followed by the strain name, globin type and globin copy number. URLs indicate links to individual Glb polypeptide sequences 57 . Dataset 3. Distance to the heme Fe and orientation of distal, proximal, B10 and CD1 amino acids in the predicted structure of selected rhizobial Glbs (Table S2) http://dx.doi.org/10.5256/f1000research.6392.d46191

Dataset 2. Predicted
Structural homologs (including the PDB ID number), amino acids from the structural homologs and values for the structural homologs amino acids to individual rhizobial Glbs are indicated in parenthesis for comparison 58 .

Results and discussion
Detection of Glb sequences in the genomes of α-and β-rhizobia Recently, Vinogradov et al. 7 reported that Glb sequences exist in the genomes of 96 rhizobia. However, this report did not provide the rhizobial Glb sequences or links to rhizobial scaffolds containing the Glb sequences. Hence, we searched in databases (see the Methods section and Table S1) in order to obtain rhizobial Glb sequences for analysis. We selected 62 out of the 96 rhizobial genomes reported by the above authors representing the major rhizobial genera, species and strains, which included αand β-rhizobia (i.e. those classified within the αand β-proteobacteria, respectively). A total of 197 glb sequences were detected in the 62 rhizobial genomes, corresponding to 7 fhbs, 47 sdgbs, 40 gcss and 103 thbs (4 thbs class 1, 56 thbs class 2 and 43 thbs class 3). Individual Glb nucleotide and polypeptide sequences and links to rhizobial scaffolds containing the Glb sequences are provided in Dataset 1 and Dataset 2, respectively. All the rhizobial genomes analyzed in this work contained glb sequences, thus indicating that glbs are widespread in rhizobia. However, protoglobin and sensor single domain globin sequences were not detected in the rhizobial genomes. This observation indicates that apparently only the fhb, sdgb, gcs and thb lineages evolved within rhizobia.
A distribution analysis showed that most (61) of the rhizobial genomes analyzed in this work contain thbs, either as single thbs (13) or in combination with fhbs, sdgbs and/or gcss (48). Furthermore, one rhizobial genome contained only a gcs and none contained only fhbs and sdgbs and the combinations fhbs + sdgbs, fhbs + gcss and sdgbs + gcss (Figure 1). These observations indicate that in the rhizobia analyzed in this work thbs predominate over other glbs and that in these bacteria fhbs, sdgbs and gcss mostly exist in combination with thbs. Also, analysis of the glb copy number showed that in the rhizobia analyzed in this work fhbs mostly exist as single copy (ranging from one to two copies), sdgbs mostly exist as two copies (ranging from one to four copies), gcss exist as either single or two copies (ranging from one to two copies) and thbs mostly exist as two copies (ranging from one to three copies) although quite a few thbs exist as single copy (Table 1). Thus, apparently rhizobial glbs mostly exist as either single or two copies.

Mapping of glb genes in the rhizobial genomes
The glb genes detected in this work were mapped within the rhizobial genomes in order to identify genes that flank nearby to and could coexpress with glbs. Mapping analysis showed that rhizobial glb copies are located in different scaffolds and that they are not tandemly arrayed. Figure S1A shows that either no ORFs or ORFs coding for hypothetical or non-identified proteins are located nearby most of the rhizobial fhb genes. However, genes coding for the transcriptional regulator NsrR, 2-nitropropane dioxygenase and NosR, Z, D, F, Y and X are located nearby cupnecN1fhb1, rhilegUPM1137fhb and sinmel1021fhb, respectively. Figure S1B shows that B. elkanii and B. japonicum sdgbs are mostly flanked by genes coding for proteins that function in nitrate/nitrite metabolism and sugar transport. Figure S1C shows that genes coding for proteins that function in chemotaxis are located nearby several rhizobial gcss, although genes coding for a peptide deformylase, sugar and nitrate transport proteins and NAD(P)H nitrate reductase are located nearby some other rhizobial gcss. Figure S1D shows  that genes flanking the rhizobial thbs are rather variable. However, B. japonicum thbs are often flanked by genes coding for the transcriptional regulator Rieske Fe-S, shikimate kinase and alcohol dehydrogenase; mesorhizobia thbs are often flanked by genes coding for permeases and tRNA-Trp, and R. leguminosarum thbs are often flanked by genes coding for membrane proteins. Thus, if glb and flanking genes coexpress in rhizobia, and proteins coded by these genes function within the same metabolic pathways, the above observations suggest that rhizobial Glbs could play a variety of roles in rhizobial physiology, including nitrate/nitrite metabolism, transport processes, gene regulation and chemotaxis. Interestingly, with the exception of sinmel1021fhb which is flanked by nos and fix genes ( Figure S1A) 26 , nif and fix genes coding for proteins that function in N 2 -fixation were not detected nearby the rhizobial glb genes. This observation suggests that rhizobial Glbs might not directly function in N 2 -fixation.
Detection of promoter sequences upstream to the rhizobial glb genes Identification of promoter sequences is crucial to an understanding of gene regulation and ultimately protein function within the cell's physiology. Hence, we searched for canonical (-10 and -35) promoters and the O 2 -and NO-regulated Fnr promoter 30,41,42 within 130 nucleotides upstream to 44 selected rhizobial glb genes (i.e. those representative of major rhizobial Glb clades identified in this work (see Figure 2)). Also, we searched for Shine-Dalgarno sequences within the same region, which indicate that Glb transcripts could be translated into proteins. Results showed that, with the exception of burphySTM815thb1, burphySTM815thb2 and rhilupHPC(L)thb1, a -10 promoter is absent upstream of the selected rhizobial glbs. In contrast, with the exception of cupnec-N1thb1 and rhilupHPC(L)thb2, a -35 promoter exists upstream of the selected rhizobial glbs. Searching for Fnr promoter sequences revealed that Fnr-like promoters exist upstream to 30 out of the 44 selected rhizobial glbs, including fhb, sdgb, gcs and thb genes. A Shine-Dalgarno sequence was detected upstream to most of the selected rhizobial glbs ( Table 2). These observations suggest that the -35 promoter is a major canonical promoter that regulates most of the rhizobial glbs, that it is likely that several rhizobial glbs are regulated by levels of O 2 and NO throughout an FNR mechanism 41-44 and that rhizobial Glb transcripts are translated into proteins.  Sequence alignments and phenetic analysis of rhizobial Glbs Pairwise sequence alignments showed that the rhizobial fHbs, SDgbs, GCSs and tHbs analyzed in this work are 34.6 to 85.4%, 6.7 to 100%, 10.9 to 100% and 3.5 to 100% identical, respectively. This indicates that variability among the rhizobial Glb sequences is high. Moreover, identity values for the fHbs globin and flavin domains were 39.1 to 93.7% and 26.5 to 81.1%, respectively, and identity values for the GCSs globin and transmitter domains were 17.5 to 100% and 5.9 to 100%, respectively. Thus, apparently in the rhizobial fHbs and GCSs analyzed in this work the globin domain is more conserved than the flavin and transmitter domains.  Figure S2).
Multiple sequence alignment showed that, with the exception of 21 GCSs, in the rhizobial Glbs analyzed in this work, the proximal (F8, located at position 322/323 in Figure S2) amino acid to the heme Fe is H. Apparently, in the above rhizobial GCSs, F8 is E. Amino acids other than H occupying the F8 position in bacterial Glbs were previously reported by Vinogradov et al. 7 . However, because H F8 is absolutely conserved in Glbs (i.e. from bacteria to mammals) 1,32,[45][46][47] , assigning E F8 to rhizobial (and other bacterial) GCSs should be taken with caution as this assignment might result from a sequence alignment artifact.
Ideally, F8 from rhizobial GCSs should be identified by experimental methods, such as x-ray crystallography. Multiple sequence alignment also showed that in the rhizobial Glbs analyzed in this work, the distal (E7, located at position 285/289/290 in Figure S2) amino acid to the heme Fe is Q in fHbs, can be Q/R/K/M/L in SDgbs, Q in GCSs and can be H/F/L/V/R in tHbs. This indicates that distal Q is conserved in rhizobial fHbs and GCSs and that amino acids occupying the distal position in rhizobial SDgbs and tHbs are variable. The B10 and CD1 amino acids (located at positions 257 and 270/271/273 in Figure S2, respectively), which also participate in binding of ligands to the heme Fe [48][49][50] , are Y and F in most of the rhizobial Glbs analyzed in this work followed by (in order of abundance) F, S and V and H, I, S and Y, respectively.  Figure S2. The fHb, SDgb, GCS, tHb class 1, tHb class 2 and tHb class 3 clusters are indicated with light blue, dark blue, red, light green, bright green and dark green, respectively. Stars indicate Glbs selected for the detection of promoter sequences upstream to the glb genes and Glb protein modeling.
A phenogram was constructed from the above multiple sequence alignment. Figure 2 shows that the rhizobial Glbs analyzed in this work segregate into two main lineages: one containing fHbs, SDgbs and GCSs, and the other containing tHbs (the fHb/SDgb/GCS and tHb lineages, respectively). This is consistent with the main evolutionary lineages identified in bacterial Glbs 1,51,52 thus indicating that major evolutionary patterns for rhizobial Glbs were identical to those for other bacterial Glbs. Rhizobial fHbs and GCSs cluster with rhizobial SDgbs within the fHb/SDgb/GCS lineage owing to the similarity between the fHb and GCS globin domains and SDgbs. This has been postulated to be the result of an early divergence from a common ancestor to the bacterial fHb and GCS globin domains and SDgbs 1,6 . The tHb lineage segregates into rhizobial tHbs class 1, tHbs class 2 and tHbs class 3. Within this lineage the rhizobial tHbs class 3 segregate in ancestral position to the rhizobial tHbs class 1 and tHbs class 2. Also, the bradyrhizobial, azorhizobial, mesorhizobial, rhizobial and burkholderial tHbs class 3 segregate from each other; the segregation within rhizobial, sinorhizobial, mesorhizobial and β-rhizobial tHbs class 2 is rather conserved, and bradyrhizobial tHbs class 2 and class 3 segregate into the B. elkanii and B. japonicum tHb sublineages. These observations indicate that rhizobial tHbs evolved similarly to other bacterial tHbs 7,8,52 and that evolution of rhizobial tHb sublineages was rather conserved.
Modeling and analysis of the predicted rhizobial Glbs tertiary structure Structure elucidation is essential to a full understand of a protein´s function within the cell´s physiology. The structure of a considerable number of bacterial and non-bacterial Glbs has been elucidated by x-ray crystallography. However, with the exception of a S. meliloti fHb whose tertiary structure was predicted using bioinformatics methods 26 , the structure of rhizobial Glbs is not known. Hence, we used bioinformatics methods to predict and analyze the tertiary structure of 44 selected rhizobial Glbs (i.e. those representative of major rhizobial Glb clades identified in this work (see Figure 2 and Table S2)) using the best structural homologs as templates (Dataset 3).
Predicted structures for selected rhizobial SDgbs and fHbs and GCSs globin domain and tHbs fold into the 3/3-and 2/2-globin fold, respectively ( Figure 3 to Figure 8). Figure 3 shows that structures among the predicted rhizobial fHbs are highly similar. Yet major differences were detected in the BurphySTM815fHb, CupnecHPC(L)fHb and RhilegUMP1137fHb flavin domains, which exhibited two additional helices. Dataset 3 shows that among globin domains from predicted rhizobial fHbs the distance of the proximal H and distal Q to the heme Fe is 1.44 to 2.47 Å and 6.71 to 15.35 Å, respectively. This observation suggests that the heme Fe in rhizobial fHbs is pentacoordinate. Figure 4 shows that 3/3-globin folding is highly conserved in the predicted structure of the rhizobial SDgbs AzodoeUFLA1-100SDgb, BraelkUSDA3254SDgb2, BraelkUSDA3259SDgb1 and Braja-pUSDA38SDgb2. Major variations to 3/3-globin folding from predicted rhizobial SDgbs consisted of the existence of an unusually short helix E in BraelkUSDA94SDgb2, a long helix H in BraelkUS-DA3254SDgb1 and BrajapUSDA124SDgb1, and the existence of a pre-helix A followed by a long loop at the N-terminal of Braelk-WSM1741SDgb2. Dataset 3 shows that among the predicted rhizobial SDgbs the distance of proximal H and distal Q/R/K/M to the heme Fe is 2.11 to 4.44 Å and 5.08 to 6.63 Å, respectively. This observation suggests that the heme Fe in rhizobial SDgbs is either penta-or hexacoordinate.
Only the globin domain from bacterial GCSs has been crystalized and analyzed by x-ray crystallography 53,54 (Dataset 3). Crystal structure for the bacterial GCSs transmitter domain has not been elucidated. Hence, we only predicted and analyzed the tertiary structure of globin domains from the selected rhizobial GCSs. Figure 5 shows that the predicted rhizobial GCSs globin domain exhibits a 1.5-to 3-turn pre-helix A, that (with the exception of SinfreGR64GCS) no loop exists between helices A and B, and that helix H is unusually long in Rhietl8C3GCS, RhietlCIAT652GCS2 and RhilegGB30GCS2. Dataset 3 shows that among the predicted rhizobial GCSs globin domain distance of proximal H/E and distal Q to the heme Fe is 1.77 to 5.56 Å and 4.09 to 9.04 Å, respectively. This observation suggests that the heme Fe in the rhizobial GCSs globin domain is either penta-or hexacoordinate. Figure 6 to Figure 8 show that 2/2-globin folding is highly conserved in the predicted rhizobial tHbs class 1, class 2 and class 3. Major variations to 2/2-globin folding from predicted rhizobial tHbs consisted of the existence of a 2.5-turn pre-helix A followed by a long loop at the N-terminal of (class 1) CupnecN1tHb1 ( Figure 6); the existence of a one-turn pre-helix F (designated as φ in Figure 7 8 ) in the rhizobial tHbs class 2; the existence of a long and extended C-terminal region in (class 2) BraelkUSDA94tHb1 (Figure 7), and the substitution of helix A by a long loop that connects to helix B through a 1-to 2.5-turn pre-helix B in (class 3) BraelkUS-DA76tHb2, BrajapUSDA123tHb1, BurphySTM815tHb1, Meslot-NZP2037tHb2 and Sinmel1021tHb2 (Figure 8). Dataset 3 shows that among the predicted rhizobial tHbs, the distance of proximal H and distal H/L/F to the heme Fe is 1.77 to 7.51 Å and 4.09 to 8.25 Å, respectively. This observation suggests that the heme Fe in the rhizobial tHbs is either penta-or hexacoordinate.
The above observations suggest that in spite of sequence variability (see the Sequence alignments and phenetic analysis of rhizobial Glbs subsection) the structure of rhizobial Glbs is similar to the canonical 3/3-or 2/2-globin folding of bacterial and non-bacterial Glbs. However, a number of predicted rhizobial Glbs exhibited variations at the N-and C-terminal regions suggesting that their structural properties could be different to those of canonical Glbs.

Conclusions
Rhizobial Glbs have been poorly studied. However, results reported in this work provide molecular and biochemical data from a bioinformatics perspective that contribute to a better understanding of these proteins. For example, the distribution and outline for the evolution of glb genes and Glb proteins among rhizobia was clarified, genes that could coexpress with the rhizobial glbs were identified and the predicted tertiary structure for rhizobial Glbs was elucidated. Also, spectroscopic analysis suggested that soluble Glbs are synthesized in free-living B. japonicum USDA38 and USDA58. This information will be useful in designing future experimental work focused on clarifying Glb functions within the physiology of free-living and symbiotic rhizobia.