Keywords
Comparative genomics, oxygen requirements, prediction, Bayesian inference
Comparative genomics, oxygen requirements, prediction, Bayesian inference
Identification of microbial organisms with specific habitat adaptations is important for a range of purposes, such as specifying organisms as likely producers of industrially or scientifically relevant enzymes. An easy-to-make prediction of adaptations to specific habitats based on genome sequences, independent of time consuming laboratory tests, would therefore be of value to researchers for narrowing down a list of potential organisms of interest for their particular purpose. In addition, a list of genomic features that effectively predicts the environmental preference of a group of organisms would aid scientific researchers in gaining a mechanistic understanding of the requirements a given environment imposes on its microbial inhabitants.
To demonstrate a method for making such predictions, this study aims to predict bacterial oxygen requirements. This choice was made in part because prediction and description of genomic characteristics relevant for oxygen requirements are relatively absent in the literature, in spite of a many characterized genomes available. Furthermore, when prediction of oxygen requirement has been attempted in the literature, the authors generally invoke the false dichotomy of a bacterium being either an aerobe or anaerobe1,2. Similar unfairly dichotic approaches are often seen with respect to other habitat classifications, e.g. salinity and thermophilicity3–5. In contrast, this study aims to distinguish between three different classifications: aerobe, anaerobe and facultative anaerobe. These oxygen requirement classifications can be found at the NCBI list of sequenced genomes (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi) and have simple and specific definitions. Obligate aerobes are organisms that require the presence of oxygen for respiration, while the presence of oxygen is detrimental to the growth of obligate anaerobes. Non-obligate, or aero tolerant, anaerobes may grow in the presence of oxygen, but are unable to use it in respiration. In this study, we did not distinguish between these two types of anaerobes. Facultative anaerobic bacteria can use oxygen for respiration, but will also grow in the absence of oxygen, although typically more slowly6.
Specific living-conditions will naturally impose selective pressures on the optimal set of protein functions, and a sensible basis for prediction would thus be the genomic make-up with respect to an organism protein domain profile. This idea has been the basis of a number relatively successful attempts at predicting different types of habitat adaptations1.
For the purpose of classification prediction, this study implements a naive Bayesian classifier. This is a relatively simple method, but it has in the past been shown to be effective prediction tool in a vast range of areas, including bacterial thermophilicity prediction7,4, genetic risk factors for disease8,9 and taxonomic classification of fungi10.
The genomes included in this study were selected from the NCBI genome database based on the oxygen requirement classifications in the NCBI Iproks table (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). To avoid overestimation of the predictive performance, only one member of each genus was randomly selected to be included within each classification. Thus the overall dataset configuration was as show in Table 5.
The included genomes where translated to predicted proteomes using the Prodigal tool11 with default settings. The predicted proteomes were searched for the presence of the protein domain Pfam-A12. This search was performed using hmmscan3 with default settings, a tool which is part of the HMMR3 package13. The presence or absence of all Pfam-A domains found in the sum of proteomes was stored in a presence/absence matrix (Additional file 6). Based on this matrix, Pfam-A domains overrepresented in any one specific class were identified. Similarly to a previous study7, overrepresentation is here defined as the domain being present in at least 65% of the members of a given class, and that the frequency in that class is significantly (p < 0.05) different from the frequency in all other classes, given a two-tailed independent t-test. In this manner, a list of class-associated Pfam-A domains, along with their frequency of occurring in each of the three classifications, was created. This list contained the observed likelihood of a given Pfam-A domain being present, given the classification, and will be referred to as the ‘likelihood file’. The script used to construct the model can be found in Additional file 7. All scripts can additionally be found at https://github.com/danbjensen/Oxygen_requirement_prediction. The scripts are also permanently available at http://dx.doi.org/10.5281/zenodo.7099.
Predictions were based on the above described likelihood file. A flat prior was used, meaning that initially the probability of an arbitrary genome being any of the three classifications was considered to be 1/NClasses ~ 0.333. If a given domain was found to be present in a given genome, the probability of that genome belonging to each of the included classifications was updated by a factor of the observed likelihood for the individual groups, p(family|class). If the family or domain was found not to be present, the probability was updated by a factor of 1-p(family|class). The posterior probability for a given genome belonging to the various classifications, C, given the observed presence or absence of a specific domain, Oi, was then calculated using Bayes rule:
A pseudo-count (PC) of 0.1 was used for all likelihoods to prevent the probabilities from plummeting to zero. The three included classifications of oxygen requirements were predicted using a one-step and two-step naive Bayesian inference network, as illustrated in Figure 1.
a) In the one-step prediction method the genomes in the test set are assigned a posterior probability for each of the three included classifications, given their protein domain profile. The genomes are predicted to belong to the classification to which they have the highest posterior probability. b) The genomes in the test set are first assigned posterior probabilities for being able or unable to respire, based on their protein domain profile. Using a second model, those genomes found most likely to be capable of respiration are assigned a posterior probability of belonging to the classifications Aerobe or Facultative.
In the one-step approach, each genome is assigned the single classification it is considered most likely to have, based on its protein domain profile. The likelihood file is based on a training matrix containing the protein domain profiles of all genomes with their associated classification, with the exception of the genome being predicted (N-fold cross-validation).
In the two-step approach, every genome is first predicted to be able or unable to use oxygen for respiration. This is done based on a training matrix, containing every included genome, marked as either respiration-capable (aerobe and facultative anaerobe) or not respiration-capable (anaerobe). The genome being predicted is excluded from the training matrix. Any genome predicted to not be respiration-capable is considered to be a predicted anaerobe, while genomes predicted to be respiration-capable go through a second round of predictions. These predictions are based on the likelihood files derived from a matrix containing the protein domain profiles of aerobes and facultative bacteria only. Based on this, every genome predicted to be respiration-capable is predicted to be either an aerobe or facultative anaerobe. If a genome under prediction is present in the training matrix, the profile of this genome is excluded from the training. The script used to make the described predictions can be found as Additional file 8.
To evaluate predictive performance, Matthew's Correlation Coefficient (MCC)14 was used. As three categories were included in this study, the predictions for each category were evaluated individually by forcing the three classes into two; the one a given genome belongs to, and every other class. This is a common method for adapting the MCC method to prediction data with more than two possible classifications15. The script used to calculate the performances can be found in Additional file 9.
To evaluate the efficiency of class-associated Pfam-A domains, i.e. Pfam-A domains found significantly more frequently in one specific oxygen requirement class compared to any other, as an input for a naive Bayesian classification of bacterial oxygen requirements, the Matthew's Correlation Coefficient (MCC)14 was used. In the context of the MCC, a value of 1 indicates perfect correlation between predicted and actual class, a value of -1 indicates a perfect anti-correlation and a value of 0 is expected when the predictions are perfectly random. Two strategies were attempted: one where prediction of all three classifications was attempted in a single step and another where a simple Bayesian network was implemented, describing the oxygen requirement classifications as two nested dichotomies.
One-step predictions. Table 1 shows the predictive performance achieved when the three classifications are predicted in a single step, based on the relative abundance of the various Pfam-A domains in the different classes. The performance is clearly best for the prediction of aerobes and anaerobes, which perform with an MCC value well above 0, although not fully 1. The performance for predicting facultative anaerobes, although higher than 0, are not satisfyingly above what one might expect from random clustering of the data to be considered truly meaningful. The exact predictions and correct classification of the individual genomes are listed in Additional file 1.
Predictions of all classes are performed better than random chance, although aerobe and anaerobe bacteria clearly show the best performance compared to facultative anaerobe bacteria.
Classification | Predictive performance (MCC) |
---|---|
Aerobe | 0.63 |
Anaerobe | 0.76 |
Facultative | 0.31 |
To further examine the conditions behind the above performances, Table 2 shows the distribution of class-predictions for genomes of each of the three actual classifications. As can be seen, the vast majority of aerobe and anaerobe genomes are predicted correctly. For the facultative subsection of the dataset, however, many genomes are erroneously predicted to be aerobes. By contrast, the rate of erroneous prediction of genomes in these an aerobe or facultative anaerobe of being an anaerobe is rather low. This explains why the prediction performance of aerobe genomes appeared lower than for anaerobes; the false positive value in the MCC equation becomes larger for aerobe genomes, thus causing the overall value to drop.
Aerobe bacteria are correctly predicted to aerobe in 87% of the cases and are mis-predicted to be facultative anaerobes in 11% of the cases. Similarly anaerobe bacteria are correctly predicted in 88% of the cases, and are mis-prediction of anaerobes as aerobe or facultative anaerobes happen equally frequently, in 6% of the cases. Facultative anaerobes are most commonly mis-predicted to be aerobes, in 44% of the cases. The facultative anaerobes are only correctly predicted in 35% of the cases.
Aerobe genomes | Anaerobe genomes | Facultative genomes | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Predictions | Aerobe | 137 | 87 | % | Aerobe | 6 | 6 | % | Aerobe | 43 | 44 | % |
Anaerobe | 3 | 2 | % | Anaerobe | 95 | 88 | % | Anaerobe | 21 | 21 | % | |
Facultative | 17 | 11 | % | Facultative | 7 | 6 | % | Facultative | 34 | 35 | % |
This finding arguably makes sense in light of the fact that facultative anaerobes possess the ability to respire using oxygen, which is a feature missing in strict anaerobic organisms. It thus makes sense to assume that a specific list of enzymatic characteristics are required or useful for the organism to perform respiration, which one might thus expect to find in aerobes as well as facultative anaerobes. The same characteristics would likely not be useful in anaerobe bacteria, which would result in an enzymatic profile of aerobe and facultative anaerobe bacteria, which would stand out as separate from anaerobes.
Two-step predictions. Inspired by the findings described above, a reasonable prediction strategy would be to first separate the respiration-capable organisms from the anaerobes, and subsequently attempt to further distinguish between the two kinds of respiring bacteria. Here, the initial prediction of distinguishing anaerobe from non-anaerobe bacteria is based on the Pfam-A presence/absence data, with all aerobe and facultative anaerobe bacteria simply considered as the same classification. The secondary prediction is based on the Pfam-A presence/absence data from known respiration-capable genomes only, disregarding the anaerobe portion of the dataset. This two-step approach yields the overall predictive performances shown in Table 3. The exact predictions and correct classification of the individual genomes can be found in Additional file 2. It should be noted that a considerable improvement is found in the performance of prediction of facultative anaerobes. These improvements can be understood by how the members of the three classes are actually predicted, as shown in Table 4. Many facultative anaerobe genomes are still erroneously predicted to be aerobes. However, the percentage of correct predictions of aerobes and anaerobes has clearly increased compared to the one-step method, indicating that the two-step network offers some advantage.
The performance for aerobe and anaerobe predictions are the same as for the one step prediction method, but the performance for prediction of facultative anaerobes have increased from 0.31 to 0.39.
Classification | MCC |
---|---|
Aerobe | 0.63 |
Anaerobe | 0.76 |
Facultative | 0.39 |
Notice that the frequency of correctly predicted facultative anaerobes have not increased compared with the one-step method (33% vs. 35%), but that the fraction of erroneous predictions of aerobe and anaerobe bacteria have been decreased (5% vs. 11% for aerobes, 4% vs. 6% for anaerobes). Thus the better performance of the prediction of facultative anaerobe genomes is due to an increased accuracy in predicting aerobe and anaerobe bacteria rather than an increased accuracy in predicting facultative anaerobe bacteria.
Aerobe genomes | Anaerobe genomes | Facultative genomes | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Predictions | Aerobe | 141 | 90 | % | Aerobe | 8 | 7 | % | Aerobe | 47 | 48 | % |
Anaerobe | 8 | 5 | % | Anaerobe | 96 | 89 | % | Anaerobe | 18 | 19 | % | |
Facultative | 8 | 5 | % | Facultative | 4 | 4 | % | Facultative | 32 | 33 | % |
As described above, the most effective method for predicting oxygen requirement attempted in the present study was the two-step Bayesian network, where anaerobes were first distinguished from the respiring bacteria (aerobes and facultative anaerobes). This study found a total of 252 protein domains to be consistently over-represented in anaerobe genomes, compared to the respiration-capable genomes. The specific likelihoods of these domains being present given an anaerobe or aerobe/facultative genome, are listed in Additional file 3.
Aerobe genomes were consistently distinguished from facultative anaerobe genomes by 402 domains, while facultative genomes consistently had 122 specific domains over-represented in their genomes, compared to aerobe genomes. The specific likelihoods of these 524 domains being present given that the genome is from an aerobe or facultative anaerobe bacterium, respectively, are listed in Additional file 4.
Thus, by applying the information provided in Additional file 3 and Additional file 4 in a two-step Bayesian estimation, as described in the Method section, it is possible to calculate the most likely oxygen requirement class of an arbitrary bacterial genome, provided a Pfam-A profile is available for said genome.
Very few studies that attempt to predict microbial oxygen requirements can be found in the literature. Two examples are the studies by Wu & Moore and Lingner et al.1,2. In both of these studies they distinguish just two classes of oxygen requirement at a time. Although Wu & Moore look at three classes (aerobe, anaerobe and facultative anaerobe), they only attempt dichotic predictions, always leaving out one class entirely. They are thus uninformative about the reliability of predictions where the genome in question can be any of the possible classifications. In contrast, the method described in the present study offer more realistic estimations of how well the prediction of any of the three included classifications will perform.
In their distinction of aerobic and anaerobic organisms, Wu & Moore report an average misclassification rate, when distinguishing between aerobe and anaerobe genomes, of 15% and 13% when basing predictions on Clusters of Orthologous Groups and KEGG Orthology groups, respectively. To allow for a direct comparison, the two-step method described in the present study shows an average misclassification rate of a slightly less than 8% (MCC = 0.84) when distinguishing between aerobe and anaerobe genomes alone (Additional file 5). This suggest that the two-step method described here, along with being more simple to perform, is actually almost twice as accurate, when directly compared to the methods presented by Wu & Moore.
Similar to the present study, Lingner et al. attempted to predict oxygen requirements based on protein domain profiles; however only the distinction between aerobe and anaerobe genomes was described. For this purpose, Lingner et al. reported a performance in the form of sensitivity multiplied by specificity, of 0.88, which is comparable to the 0.84 achieved for aerobe/anaerobe distinction when using the method described here (Additional file 5). To construct the protein domain profiles used by Lingner et al., the number of each of the Pfam-A domains present in a given genome was used. In contrast, this study looked only on the presence or absence of the various Pfam-A domains in the individual genome. The comparable performances thus indicate that the presence of specific protein domains is indicative of oxygen requirements, regardless of the copy-number of those domains. Furthermore, it should be noted that Lingner et al. performed their predictions based on genomes available from NCBI 2009. They do not specifically specify the number of genomes labeled with respect to oxygen requirement at that time, but given the continuous additions of new genome sequences, it can reasonably be assumed to be fewer than the genomes available for the present study.
One of the main basic premises of the naive Bayesian inference method is that the various inputs, on which the inference is made, are mutually independent. In this study, the various inputs were different Pfam-A domains16. Protein domains by definition consist of compact sequences that will fold, perform functions and even evolve independently of the rest of the protein in which they reside17. Based on this fact, the basic premise of independence seems reasonable, and the method should thus be applicable.
Furthermore, even in situations where the premise of independence is invalid, the naive Bayesian classifier can be shown to produce excellent performance18,19. This means that even if certain protein domains might be found together regardless of classification, it would still be reasonable to expect the two-step method described here to be effectively applicable.
The results presented in this study show that bacterial oxygen requirements can be accurately predicted without considering protein domain copy number. Although facultative anaerobes could be predicted with a performance significantly better than random guessing, further optimization is still desired so as to make the distinction meaningful in practice. Such optimization would include additional biologically meaningful markers, e.g. the presence of specific transcription factors. However, the distinction between aerobe and anaerobe organisms, are as good, or better than what is achieved by other methods published in the scientific literature1,2.
The best performances were achieved when using a simple Bayesian network, first distinguishing respiration-capable bacteria (aerobes and facultative anaerobes) from anaerobic bacteria, and subsequently distinguishing the aerobes from the facultative anaerobes. The respiration-capable bacteria could be distinguished from anaerobic bacteria with a Matthews' Correlation Coefficient of 0.76, while pure aerobes could be distinguished from anaerobes with a Matthews' Correlation Coefficient of 0.84.
Given the success with respect to distinguishing respiring bacteria from anaerobes, a reasonable follow up would be to study the class-associated Pfam-A domains identified in this study in more detail. They offer a logical first step for supplying a mechanistic model, explaining the genetic adaptations necessary for a bacterium given certain environmental oxygen exposures. Furthermore, we plan to test this method for prediction of other types of habitats, including cases with more than two categories. Based on the findings of this study, we would recommend that the categories of such cases be divided into biologically meaningful sets of dichotic super-classes, followed by two or more rounds of predictions.
Conceived the project: DWU, DBJ.
Performed data analysis: DBJ.
Wrote the first draft of the manuscript: DBJ.
Paper feedback and analysis suggestions: DWU.
Both authors have read and approved the final manuscript.
We thank the Technical University of Denmark and the Danish Research Council for funding this work. All grants were assigned to David Ussery.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 13 Sep 13 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)