ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Streptococcal taxonomy based on genome sequence analyses

[version 1; peer review: 2 approved]
* Equal contributors
PUBLISHED 01 Mar 2013
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Phylogenetics collection.

Abstract

The identification of the clinically relevant viridans streptococci group, at species level, is still problematic. The aim of this study was to extract taxonomic information from the complete genome sequences of 67 streptococci, comprising 19 species, by means of genomic analyses, multilocus sequence analysis (MLSA), average amino acid identity (AAI), genomic signatures, genome-to-genome distances (GGD) and codon usage bias. We then attempted to determine the usefulness of these genomic tools for species identification in streptococci. Our results showed that MLSA, AAI and GGD analyses are robust markers to identify streptococci at the species level, for instance, S. pneumoniae, S. mitis, and S. oralis. A Streptococcus species can be defined as a group of strains that share ≥ 95% DNA similarity in MLSA and AAI, and > 70% DNA identity in GGD. This approach allows an advanced understanding of bacterial diversity.

Keywords

Comparative Genomic, Genomic Taxonomy, Streptococci species

Introduction

Bacteria are subjected to numerous forces driving their diversification. As a consequence, different strains of a single bacterial species sometimes have the ability to explore distinct niches, to be pathogenic or non-pathogenic and to present different metabolic pathways1,2. In such a scenario, the identification of bacteria isolates to the species level is a hard task1,2.

Currently, the genus Streptococcus comprises 99 recognized species, many of which are associated with disease in humans and animals (http://www.bacterio.net/s/streptococcus.html). The viridans group streptococci (VGS) encompass four phylogenetic clusters: Mitis, Mutans, Salivarius and Anginosus, which are part of the human microbiota, being isolated mainly from the oral cavity, gastrointestinal and genitourinary tracts3. The Mitis group currently includes the important pathogen S. pneumoniae and 12 other recognized species, S. australis, S. cristatus (formerly S. crista), S. gordonii, S. infantis, S.mitis, S. oligofermentans, S. oralis, S. parasanguinis (formerly S. parasanguis), S. peroris, S. pseudopneumoniae, S. sanguinis (formerly S. sanguis) and S. sinensis. The Anginosus group includes three recognized species, S. anginosus, S. constellatus (including two subspecies S. constellatus subsp. constellatus and S. constellatus pharyngis) and S. intermedius, and the Salivarius group includes S. salivarius, S. vestibularis, and S. thermophilus.

Currently, bacterial species are considered to be a group of strains (including the type strain) that are characterized by a certain degree of phenotypic consistency, showing > 70% DNA-DNA hybridization values and over 97% 16S rRNA sequence similarity4,5. Identification of streptococci is based on the current taxonomic standards using a combination of 16S rRNA gene sequence analyses, DNA-DNA hybridization, serologic and phenotypic data; however, they have been strikingly resistant to satisfactory classification, reflected in frequently changing nomenclature6,7. For instance, the 16S rRNA gene sequences of S. mitis and S. oralis are almost identical (> 99%) to S. pneumoniae, making the use of this information alone insufficient to distinguish these species8.

Recent studies have used whole genome analysis to determine the taxonomic relationships among bacterial species914. In order to determine the robustness of genomic markers in streptococci species delineation, we analyzed a collection of 67 complete genomes. The availability of whole genome sequences of several closely related species, for instance, S. mitis - S. oralis - S. pneumoniae, and S. salivarius - S. thermophilus - S. vestibularis, formed an ideal test case for the establishment of the genomic taxonomy of streptococci.

Material and methods

Genome sequence data

The genomic sequences of 67 streptococci that were publicly available for download by June 2nd, 2011 at the National Center for Biotechnology Information (NCBI) under the project accession number indicated in Table 1 were used in this study. The following analyses were performed according to Thompson et al. (2009)13 and are briefly described below.

Table 1. Genomic features of the streptococci.

G+C content (%): guanine + cytosine content (%). No. of CDs: number of coding DNA sequence. Nc: effective number of codons.

OrganismGenBank
accession no.
Genome
size (nt)
G+C content
(%)
No. of
CDS
Nc
S. agalactiae A909 CP0001142,127,83935199644.9
S. agalactiae NEM316 AL7326562,211,48535209445.2
S. agalactiae 2603VR AE0099482,160,26735212445.1
S. anginosus F0211 AECT000000001,993,70938203550.6
S. bovis ATCC 700338 AEEL000000002,050,89337208844.5
S. downei F0415 AEKN000000002,239,42143220454.4
S. dysgalactiae subsp. equisimilis GGS-124 AP0109352,106,34039209450.3
S. equi subsp. equi 4047 FM2048832,253,79341200152.6
S. equi subsp. zooepidemicus FM2048842,149,86841186952.4
S. equi subsp. zooepidemicus MGCS10565 CP0011292,024,17141189352.3
S. gallolyticus subsp. gallolyticus TX20005 AEEM000000002,214,09137221844.5
S. gallolyticus UCN34 FN5972542,350,91137222344.4
S. gordonii str. Challis substr. CH1 CP0007252,196,66240205152.4
S. infantis SK1302 AEDY000000001,792,25239210248.9
S. infantarius subsp. infantarius ATCC BAA-102 ABJK000000001,925,08737205144.0
S. mitis B6 FN5680632,146,61139200450.4
S. mitis SK321 AEDT000000001,873,70240175749.8
S. mutans NN2025 AP0106552,013,58736189546.4
S. mutans UA159 AE0141332,030,92136196046.5
S. oralis ATCC 35037 AEDW000000001,884,71241179351.4
S. parasanguinis ATCC 15912 ADVN000000002,124,73041203552.8
S. parasanguinis F0405 AEKM000000002,050,30241197852.9
S. pneumoniae AP200 CP0021212,130,58039221650.3
S. pneumoniae ATCC 700669 FM2111872,221,31539199050.0
S. pneumoniae CGSP14 CP0010332,209,19839220650.3
S. pneumoniae D39 CP0004102,046,11539191449.8
S. pneumoniae G54 CP0010152,078,95339211450.0
S. pneumoniae Hungary19A-6 CP0009362,245,61539215550.2
S. pneumoniae INV104 FQ3120302,142,12239182449.9
S. pneumoniae INV200 FQ3120292,093,31739193050.0
S. pneumoniae JJA CP0009192,120,23439212350.2
S. pneumoniae OXC141 FQ3120272,036,86739182449.9
S. pneumoniae P1031 CP0009202,111,88239207350.1
S. pneumoniae R6 AE0073172,038,61539204250.1
S. pneumoniae Taiwan19F-14 CP0009212,112,14839204450.1
S. pneumoniae TCH843119A CP0019932,088,77239227550.4
S. pneumoniae TIGR4 AE0056722,160,84239210550.0
S. pneumoniae 670-6B CP0021762,240,04539235250.4
S. pneumoniae 70585 CP0009182,184,68239220250.1
S. pseudoporcinus SPIN 20026 AENS000000002,111,37236203048.6
S. pyogenes MGAS315 AE0140741,900,52138186549.1
S. pyogenes MGAS2096 CP0002611,860,35538189849.4
S. pyogenes MGAS5005 CP0000171,838,55438186548.9
S. pyogenes MGAS6180 CP0000561,897,57338189448.9
S. pyogenes MGAS8232 AE0099491,895,01738183949.0
S. pyogenes MGAS9429 CP0002591,836,46738187749.0
S. pyogenes MGAS10270 CP0002601,928,25238198649.0
S. pyogenes MGAS10394 CP0000031,899,87738188649.2
S. pyogenes MGAS10750 CP0002621,937,11138197949.1
S. pyogenes M1 GAS AE0040921,852,44138169648.8
S. pyogenes NZ131 CP0008291,815,78538170048.8
S. pyogenes SSI-1 BA0000341,894,27538185949.1
S. pyogenes str. Manfredo AM2950071,841,27138174548.9
S. salivarius SK126 ACLO000000002,128,33240199247.0
S. sanguinis ATCC 49296 AEPO000000002,054,85241201351.7
S. sanguinis SK36 CP0003872,388,43543227054.5
S. sanguinis VMC66 AEVH000000002,311,94943226054.5
S. suis BM407 FM2520322,146,22941193252.0
S. suis GZ1 CP0008372,038,03441197952.4
S. suis P17 AM9460162,007,49141182451.9
S. suis SC84 FM2520312,095,89841189852.0
S. thermophilus CNRZ1066 CP0000241,796,22639191547.0
S. thermophilus LMD-9 CP0004191,856,36839170946.8
S. thermophilus LMG 18311 CP0000231,796,84639188846.9
S. thermophilus ND03 CP0023401,831,94939191946.8
S. uberis 0140J AM9460151,852,35236176246.4
S. vestibularis F0396 AEKO000000002,022,28939197947.1

16S rRNA gene sequence analysis and multilocus sequence analysis (MLSA)

The 16S rRNA gene sequences and the gene sequences used for MLSA were obtained from GenBank (http://www.ncbi.nlm.nih.gov). The MLSA approach was based on the concatenated sequences of five house-keeping genes (aroE, ddl, gki, pheS and recA)15,16. The concatenated sequences were aligned with ClustalX program17. The phylogenetic inference was based on the neighbour-joining genetic distance method (NJ)18 using MEGA519. Distance estimations were obtained according to the Kimura-2-parameter20 for 16S rRNA gene and MLSA. The reliability of each tree topology was checked by 2000 bootstrap replications21.

Average amino acid identity (AAI)

The AAI of all conserved protein-coding genes was calculated as described previously22. Conserved protein-coding genes between a pair of genomes were determined by whole-genome pairwise sequence comparisons using the BLASTp algorithm23. For these comparisons, all protein-coding sequences (CDSs) from one genome were searched against the genomic sequence of the other genome. The genetic relatedness between a pair of genomes was measured by the AAI of all conserved genes between the two genomes as computed by the BLAST algorithm. By this approach, a value of < 95% AAI of protein-coding genes indicates separate species.

Codon usage

Codon usage bias was calculated for each genome. The effective number of codons used in a sequence (Nc)24 was calculated using CHIPS (http://emboss.bioinformatics.nl/cgi-bin/emboss/chips) with the default parameters.

Determination of dinucleotide relative abundance values and genomic dissimilarity

Mononucleotide and dinucleotide frequencies were calculated using COMPSEQ (http://emboss.bioinformatics.nl/cgi-bin/emboss/compseq) with default parameters. Dinucleotide relative abundances (ρ*XY) were calculated using the equation ρ*XY = fXY/fXfY where fXY denotes the frequency of dinucleotide XY, and fX and fY denote the frequencies of X and Y, respectively. The difference in genome signature between two sequences is expressed by the genomic dissimilarity (δ*), which is the average absolute dinucleotide of relative abundance difference between two sequences, and were calculated using the equation: δ*(f,g) = 1/16Σ|ρ*XY (f) - ρ*XY (g)| (multiplied by 1000 for convenience), where the sum extends over all dinucleotides25.

Genome-to-genome distances (GGD)

The genome distance was calculated using genome-to-genome distance calculator (GGDC)26. Distances between a pair of genomes were determined by whole-genome pairwise sequence comparisons using BLAST23. For these comparisons, algorithms were used to determine high-scoring segment pairs (HSPs) for inferring intergenomic distances for species delimitation. The corresponding distance threshold can be used for species delimitation26.

Results and discussion

In this work we compared complete genomes for 67 streptococci comprising 19 species to address their taxonomic position. A previous study with a small set of streptococci genomes (eight) and species (four), using a combination of several genomic analyses, showed the applicability of this approach in streptococci taxonomy9. Overall our analysis, using a large data set, showed that genomic taxonomy is an accurate approach to clearly define the streptococci species. The taxonomic resolution of the 16S rRNA, AAI, MLSA, GGD and codon usage analysis for streptococci species definition is summarized in Table 2.

Table 2. Taxonomic resolution of genomic analyses of streptococci species.

MLSA: multilocus sequence analysis. AAI: amino acid identity. GGD: genome to genome distance. Nc: effective number of codons.

16S rRNA
(%)
MLSA
(%)
AAI
(%)
GGD
(%)
Codon usage
(Nc)
Intraspecies ≥99≥95≥95>70-
S. pyogenes ≥99≥98>97>7049
S. agalactiae 9910098>7045
S. equi 9998>96>7052
S. suis 100100100>7052
S. pneumoniae 99≥97>97>7050
S.thermophilus 99100>97>7047
Interspecies ≤99<95<95<7044-54
S. thermophilus-salivarius-vestibularis 99<94<92<7047
S. pneumoniae-mitis-oralis >99<94<93<7050–51

General genomic features

The complete genome of the streptococci comprised a single chromosome. The estimated size of the genomes ranged from 1.7 Mb (S. infantis) to 2.3 Mb (S. sanguinis). The number of CDS varied from 1,700 (S. pyogenes) to 2,352 (S. pneumoniae) (Table 1). The average G+C content of streptococci genomes ranged from 35% to 43%. These species presented a variable interspecies genome size and G+C content, indicating heterogeneity within the genus Streptococcus. One of the reasons for this variability could be associated with the frequent occurrence of horizontal gene transfer events2729.

Phylogenetic reconstructions by 16S rRNA and MLSA

MLSA and 16S rRNA phylogenetic trees showed similar topologies (Figure 1). The MLSA was performed using five instead of the seven genes applied in the pneumococcus multilocus sequence typing (MLST) scheme (http://spneumoniae.mlst.net/)15,16. Three genes, aroE, ddl and gki, are from the MLST scheme, and pheS and recA were included in this work. The concatenation of these genes (7741 bp) allowed an accurate delineation of the streptococci species considered here. The nucleotide sequence similarities were much lower for MLSA than 16S rRNA gene. A pairwise comparison of MLSA among the species revealed sequence similarity between 67% and 100%, while the 16S rRNA gene sequence similarities varied from 92% to 100%. At the intraspecies level, the similarity values ranged from 95% to 100% for MLSA, and 99% to 100% for the 16S rRNA gene sequences. The closest species within the Mitis (S. pneumoniae - S. oralis - S. mitis) and Salivarius groups (S. vestibulares - S. salivarius - S. thermophilus) were clearly placed apart from each other by MLSA, while these species had almost identical 16S rRNA gene sequences (≥ 99% sequence similarity). A previously study showed that recA analysis is a valuable tool for proper identification of pneumococci in routine diagnostics, but limitations on discrimination of other members of the Mitis group were observed30. S. sanguinis ATCC 49296 showed a much closer relationship with S. oralis ATCC 35037T (95% similarity) than to other S. sanguinis strains (77% similarity), suggesting it belongs to the species S. oralis. In addition, S. bovis ATCC 700338 was placed in the S. gallolyticus cluster with 98% MLSA sequence similarity. This work showed that MLSA, using this new combination of five concatenated genes (aroE, ddl, gki, pheS and recA), distinct from the Streptococcus MLST scheme, allowed a proper identification of most streptococci species, even within the VGS group.

c55e6710-0965-4ba6-8bd7-365f7a83d524_figure1.gif

Figure 1. Neighbor-joining tree based on 16S rRNA gene sequences and MLSA concatenated sequences of Streptococcus.

The numbers at the nodes indicate the values of bootstrap statistics after 2000 replications, and values below 50% are not shown. Bars, 0.005% and 0.02% estimated sequence divergence.

Average amino acid identity (AAI)

The percentage of average amino acid identity (AAI) among streptococci species ranges from 68% to 94%, while within species it varies from 95% to 100%. The VGS species S. pneumoniae, S. mitis and S. oralis shared 89–93% AAI. The species S. salivarius, S. thermophilus and S. vestibularis showed a maximum AAI of 93%. S. sanguinis ATCC 49296 and S. oralis ATCC 35037 showed 96% identity and S. bovis ATCC 700338 and S. gallolyticus strains had 98% identity. These findings suggest that strains ATCC 49296 and ATCC 700338 belong to the species S. oralis and S. gallolyticus, respectively. According to our analyses the AAI and MLSA are the most useful genomic features for the elucidation of streptococci taxonomy.

Genome signature

The genomic dissimilarity values among streptococci were between 3 and 127, while the intraspecies values were between 0 and 17. Streptococci within the VGS group, for instance, S. salivarius, S. thermophilus and S. vestibularis species, showed dissimilarity values between 5 and 12 and S. pneumoniae, S. mitis and S. oralis species had dissimilarity values between 3 and 14. Thus, there was not a clear differentiation of these closely related species within the VGS group on the basis of the genomic dissimilarity values. This could be due to the extensive recombination and horizontal gene transfer events which occur between closely related streptococci species that share ecological niches12,30.

On the other hand, species within the Pyogenic group had a distinct genomic signature, with values ranging from 13 to 85. However, genome signatures alone have significant limitations when used as phylogenetic markers for differentiating members of the VGS. The exact mechanisms that generate and maintain the genome signatures are complex, but possibly involve differences in species-specific compositional bias, i.e., G+C content, G+C and A+T skews, codon bias, and mutation bias32,33.

Codon usage bias (Nc)

Nc values provide a meaningful measure of the extent of codon preference in a genome, values range between 20 (extremely biased genome where one codon is used per amino acid) and 61 (all synonymous codons are used). Within the set of 67 complete streptococci genomes examined in this study, the Nc ranged from 44.0 to 54.5 (Table 1). For instance, S. pneumoniae - S. oralis - S. mitis species had Nc values of 50, 51 and 50, respectively. The Salivarius group (S. vestibulares - S. salivarius - S. thermophilus), and S. bovis ATCC 700338- S. gallolyticus showed Nc values of 47 and 44.5, respectively. Overall, codon usage bias was very similar among the streptococci species investigated. However, S. sanguinis ATCC 49296 showed a much closer Nc value with the S. oralis ATCC 35037 (51.7 and 51.4, respectively) than other S. sanguinis strains (54.5), which was in agreement with the other analyses used in this study.

Genome distance analysis

The GGD was calculated only for closely related species that were not differentiated by 16S rRNA gene sequence analysis (Figure 1). Based on GGD analysis the species within the Mitis and Salivarius groups were identified as separate species, showing GGD values analogous to the < 70% discriminatory value used for DNA-DNA hybridization. Conversely, S. bovis ATCC 700338 and S. gallolyticus were identified as belonging to the same species by GGD.

S. bovis ATCC 700338 (biotype II) and S. gallolyticus as well as S. sanguinis ATCC 49296 and S. oralis ATCC 35037T were not separated and, therefore, according to this analysis would be classified as the same species, respectively. It was shown that S. bovis biotype I and II/2 isolates were, in fact, S. gallolyticus34, and S. sanguinis ATCC 49296 was placed into S. oralis species by GGD analysis. A misidentification of S. sanguinis ATCC 49296 has already been shown by means of biochemical and serological properties by Narikawa and colleagues35.

Another interesting result is that the S. parasanguinis ATCC 15912 and F0405 strains were found to be at the upper limits for definition as members of the same species based on different genomic analyses. For instance, they shared 95% AAI, 94% identity by MLSA, a value of 17 on the basis of genomic signature and < 70% similarity in GGD. Therefore, based on these genomic markers, these S. parasanguinis strains could, in fact, be separate species. This data reflects the complexity of bacterial species delineation, since these organisms are all under a constant evolutionary process.

Conclusion

The delineation of closely related streptococci species was evident in this genomic study. Different methods produced different levels of taxonomic resolution. The methods with the higher resolution for species identification were MLSA and AAI, while closely related species had similar Nc values and genomic signatures. Based on the genomic analyses, a Streptococcus species can be defined as a group of strains that shares ≥ 95% identity in MLSA and AAI, and > 70% identity in GGD. This definition may be useful to advance the taxonomy of Streptococcus. This approach allows an advanced understanding of bacterial diversity and identification.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 01 Mar 2013
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Thompson CC, Emmel VE, Fonseca EL et al. Streptococcal taxonomy based on genome sequence analyses [version 1; peer review: 2 approved]. F1000Research 2013, 2:67 (https://doi.org/10.12688/f1000research.2-67.v1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 01 Mar 2013
Views
26
Cite
Reviewer Report 03 Apr 2013
Bruno Gomez-Gil, CIAD, A.C., Mazatlán Unit for Aquaculture and Environmental Management, Mazatlán, Mexico 
Approved
VIEWS 26
The article is well written with an appropriate title and abstract. The methods are adequate for the aims of the study, but I would suggest that including the Average Nucleotide Identity (ANI) analysis as suggested by Rosello-Mora et al. 2006, ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Gomez-Gil B. Reviewer Report For: Streptococcal taxonomy based on genome sequence analyses [version 1; peer review: 2 approved]. F1000Research 2013, 2:67 (https://doi.org/10.5256/f1000research.865.r872)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
26
Cite
Reviewer Report 03 Apr 2013
Tomoo Sawabe, Laboratory of Microbiology, Graduate School of Fisheries Sciences, Hokkaido University, Hakodate, Japan 
Approved
VIEWS 26
  • Title and abstract are good enough to attract readers in the scientific community.

  • Genome based multi-gene sequence comparison is one of the promising tools to analyse bacterial populations. To achieve the analysis for Streptococcus, the authors carefully designed massive data genome
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Sawabe T. Reviewer Report For: Streptococcal taxonomy based on genome sequence analyses [version 1; peer review: 2 approved]. F1000Research 2013, 2:67 (https://doi.org/10.5256/f1000research.865.r818)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 01 Mar 2013
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.