Keywords
COVID-19, SARS-CoV-2, Synonymous mutations, RNA secondary structure
This article is included in the Emerging Diseases and Outbreaks gateway.
This article is included in the Research Synergy Foundation gateway.
This article is included in the Coronavirus (COVID-19) collection.
The emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) had led to a global pandemic since December 2019. SARS-CoV-2 is a single-stranded RNA virus, which mutates at a higher rate. Multiple works had been done to study nonsynonymous mutations, which change protein sequences. However, there is little study on the effects of SARS-CoV-2 synonymous mutations, which may affect viral fitness. This study aims to predict the effect of synonymous mutations on the SARS-CoV-2 genome.
A total of 26645 SARS-CoV-2 genomic sequences retrieved from Global Initiative on Sharing all Influenza Data (GISAID) database were aligned using MAFFT. Then, the mutations and their respective frequency were identified. Multiple RNA secondary structures prediction tools, namely RNAfold, IPknot++ and MXfold2 were applied to predict the effect of the mutations on RNA secondary structure and their base pair probabilities was estimated using MutaRNA. Relative synonymous codon usage (RSCU) analysis was also performed to measure the codon usage bias (CUB) of SARS-CoV-2.
A total of 150 synonymous mutations were identified. The synonymous mutation identified with the highest frequency is C3037U mutation in the nsp3 of ORF1a. Of these top 10 highest frequency synonymous mutations, C913U, C3037U, U16176C and C18877U mutants show pronounced changes between wild type and mutant in all 3 RNA secondary structure prediction tools, suggesting these mutations may have some biological impact on viral fitness. These four mutations show changes in base pair probabilities. All mutations except U16176C change the codon to a more preferred codon, which may result in higher translation efficiency.
Synonymous mutations in SARS-CoV-2 genome may affect RNA secondary structure, changing base pair probabilities and possibly resulting in a higher translation rate. However, lab experiments are required to validate the results obtained from prediction analysis.
COVID-19, SARS-CoV-2, Synonymous mutations, RNA secondary structure
The title of the paper has been revised to, “Prediction of the effects of the top 10 synonymous mutations from 26645 SARS-CoV-2 genomes of early pandemic phase”. The introduction section with information on Alpha variant has been added and some introduction section has been revised. Additional paragraphs in the discussion section on identification of synonymous mutation, RNA secondary structure prediction and codon bias usage have been included to improve the clarity of the manuscript. Some citations have been removed, added or updated accordingly.
See the authors' detailed response to the review by Chandran Nithin
See the authors' detailed response to the review by Leyi Wang
See the authors' detailed response to the review by Takahiko Koyama
See the authors' detailed response to the review by Diego Forni
In December 2019, coronavirus disease 2019 (COVID-19) cases first emerged from Wuhan, China1. Soon after, rapid spread of COVID-19 has resulted in a serious global outbreak. COVID-19 is an infectious and potentially lethal disease caused by a newly found coronavirus strain, known as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The virus causes clinical manifestation ranging from asymptomatic to severe pneumonia and in the worst scenario, death2. SARS-CoV-2 seems to have a higher transmission rate3 but lower mortality rate2 in comparison to Middle East respiratory coronavirus (MERS-CoV) and severe acute respiratory syndrome coronavirus (SARS-CoV).
SARS-CoV-2 is a single-stranded RNA virus with a genome size of 29,903 bases. In general, RNA viruses have a higher mutation rate than DNA viruses and this allows them to evolve rapidly, escaping the host immune defence response4. Different SARS-CoV-2 variants with multiple synonymous and nonsynonymous mutations have been reported since the beginning of the outbreak5. Some variants are classified as variants of concern (VOCs) since they are associated with the change in viral pathogenicity such as, higher disease severity, higher transmission rate, lower immunity response in the host as a consequence of the mutations6. However, it is expected that most of these mutations in SARS-CoV-2 genome are either neutral or mildly deleterious7. Numerous studies have been carried out to understand the molecular mechanisms of these nonsynonymous mutations on the functions of different SARS-CoV-2 proteins6. For example, the Alpha variant (B.1.1.7) of SARS-CoV-2, first identified in the UK in late 2020, is characterized by several mutations, including the D614G mutation in the spike protein, which enhances its binding affinity to the ACE2 receptor8. This variant exhibits increased transmissibility compared to the original virus, which led to its rapid spread globally9. However, there are only a few studies on the synonymous mutations of SARS-CoV-2 genome10,11.
Synonymous mutations are also known as silent mutations because the nucleotide mutations result in a change in the RNA sequence without altering the amino acid sequence12. Synonymous mutations have been suggested to have no functional consequence on the fitness of organisms and their evolution in long term13. However, numerous recent studies had showed that synonymous mutations may affect the folding and stability of RNA structures14. Interestingly a large scale study of synonymous mutations in multiple yeast genes has shown that most of synonymous mutations are not neutral, affecting the fitness of the cell15. For RNA viruses, even though synonymous mutations generally do not change their pathogenicity directly, some studies reveal that synonymous mutations may affect the RNA secondary structure of the virus16 and also change the codon usage bias of the genes in the virus17,18. The use of mRNA-based COVID-19 vaccines reduce the severity of the disease. However, mRNA molecule is susceptible to the degradation due to the presence of 2’ OH group in the ribose. To improve the stability of mRNA vaccine, Zhang et al. (2023) designed a novel algorithm, which optimizes the codon usage and RNA secondary structure by using synonymous codon19.
The synonymous mutations play some important biological roles, which may affect viral fitness and pathogenicity. However, the study of biological consequences of synonymous mutations have been largely overlooked. In this study, we identified synonymous mutations of SARS-CoV-2 genome from early pandemic phase. We predicted the effects of these synonymous mutations of the top 10 highest frequency on RNA secondary structures and codon usage bias of SARS-CoV-2 genome. These findings allow the researchers to prioritize these mutations for function analysis in the future.
30,229 SARS-CoV-2 genomic sequences were downloaded from GISAID database (Global Initiative on Sharing All Influenza Data, RRID:SCR_018251)20 ranging from 31 December 2019 to 22 March 2021. SARS-CoV-2 genomic sequences were filtered by setting parameters to keep only sequences with complete genome and high coverage. The sequences were further filtered to remove those sequences with higher than 0.1% “N” unresolved nucleotides and ambiguous letters. A total of 3,584 sequences were removed by applying this filter. The reference sequence of SARS-CoV-2 genome (NC_045512.2)21 was retrieved in fasta format from NCBI database (NCBI, RRID:SCR_006472). It is a Wuhan isolate with a complete genome which comprises of 29,903 bases.
The rapid calculation available in MAFFT online server (MAFFT, version 7.467, RRID:SCR_011811)22 was used to perform multiple sequence alignment (MSA) for 26,645 SARS-CoV-2 genomes. This option supports the alignment of more than 20,000 sequences with approximately 30,000 sites. The alignment length was kept, which means the insertions at the mutated sequences were removed, to keep the alignment length the same as the reference sequence. While other parameters were left as default.
A simple Python script was written to identify the mutations in 26,645 SARS-CoV-2 genomes. To determine whether the identified mutations are synonymous or nonsynonymous, MEGA X software, version 10.2.5 build 10210330 (MEGA Software, RRID:SCR_000667)23 was utilized to perform the translation for inspection purposes. The presence of amino acid changes was identified by referring to the genomic position of the nucleotide mutations. Synonymous mutations with the top 10 highest frequencies were generated.
The RNA secondary structure of wild type and mutant sequences were predicted using RNAfold program, version 2.4.18 (Vienna RNA, RRID:SCR_008550)24 with the incorporation of SHAPE reactivity data obtained from the study done by Manfredonia et al. (2020)25. The RNA secondary structure prediction was performed using a sequence length of 250 nucleotides upstream and downstream of the mutation site. Other than RNAfold, another two programs which are IPknot++ version 2.2.1 (SCR_022557)26, and MXFold2 (SCR_022558)27 were also used to perform the RNA secondary structure prediction of SARS-CoV-2 wild type and mutants.
To predict how the mutations affect RNA local folding, base pair probability was estimated by utilizing MutaRNA, version 1.3.0 (MutaRNA, RRID:SCR_021723)28. MutaRNA is a web-based tool that allows prediction and visualization of the structure changes induced by a single nucleotide polymorphism (SNP) in an RNA sequence. It includes the base pair probabilities within RNA molecule of both wild type and mutant. The parameters used in MutaRNA were set as default except the window size was changed to 501nt.
Relative synonymous codon usage (RSCU) represents the ratio of the observed frequency of codons appearing in a gene to the expected frequency under equal codon usage. RSCU is calculated using the formula:
where Xi implies the number of occurrences of codon i and n stands for the number of synonymous codons encoded for that particular amino acid.
A synonymous mutation is a change in the nucleotide that does not cause any changes in the encoded amino acid. Synonymous mutations were previously considered to be less important, but they are now proven to have some effects on RNA folding, RNA stability, miRNA binding and translational efficiency29. Synonymous mutations may have significant effects on the adaptation, virulence, and evolution of RNA viruses30. Another study done also indicated that synonymous mutations have association with more than 50 human diseases such as hemophilia B, tuberculosis (TB), cystic fibrosis (CF), Alzheimer, schizophrenia, chronic hepatitis C and so on31. All these studies show that increasing importance has been associated with synonymous mutations over these years. Hence, it is necessary for us to study the effects of synonymous mutations of SARS-CoV-2 genome.
A total of 381 mutations were found in SARS-CoV-2 genomes by using python script, in which 150 of them are synonymous mutations. The distribution of these 150 synonymous mutations in 11 coding regions is shown in Figure 1. Among these mutations, ORF1a and ORF1b have a higher number of synonymous mutations at 76 and 33, respectively, which might be due to their longer sequence length. Besides that, our findings also show high C to U mutation rate in SARS-CoV-2 genome and this mutational skews are in line with multiple studies32–35. The high C to U mutation rate may be driven by host APOBEC-mediated RNA editing system and overexpression of APOBEC3 protein promotes viral replication and propagation in the human colon epithelial cell line36. These mutational skews are necessary to be considered when deducing the selection acting on synonymous variants in SARS-CoV-2 evolution11. Synonymous mutations are assumed subject to a lower selective pressure than nonsynonymous mutations, presumably the purifying selection force has stronger negative impact on the frequencies of nonsynonymous mutations. Interestingly there may be some selection force on synonymous mutations shown by a few studies, suggesting that these synonymous mutations are not random and neutral, may have some biological impact on viral fitness11,32,37.
The synonymous mutations in SARS-CoV-2 genomes with the top 10 highest frequency obtained from the analysis of 150 synonymous mutations were listed in Table 1. Our sequence samples are obtained from December 2019 to March 2021 and this period overlapped with the peak of Alpha variant (B.1.1.7) outbreak5. The defining synonymous SNPs of Alpha variant include C241T, C913T, C3037T, C5986T, C14676T, C15279T and T16176C5, and all except C241T are reported in our study as well. As shown in Table 1, synonymous mutations with the highest frequency identified from SARS-CoV-2 genomes is C3037U mutation located in nsp3 of ORF1a, followed by C313U mutation in nsp1 of ORF1a and C9286U mutation in nsp4 of ORF1a. Mutations with higher frequency are mostly found in ORF1a and ORF1b. Although there are some overlapping ORFs in the SARS-CoV-2 genome, such as ORF1a and ORF1b, ORF3a and ORF3c38, the top 10 highest frequency synonymous mutations are not located in these overlapping sites. It is of great interest to find out the effect of these top 10 synonymous mutations on SARS-CoV-2 genome. However, it is important to take note that the high frequency of some mutations is not necessarily due to their positive effects. They may emerge during early stage of pandemic and are transmitted to all of their descendants, even though they have no or little effect on viral fitness39.
Similar to another companion paper, which focuses on the prediction analysis of nonsynonymous mutations of SARS-CoV-2 proteins40, the same SARS-CoV-2 virus genome data from GISAID database ranging from 1st January 20 to 22 March 21 were used in this study. The data collection time was overlapping with the period when the frequency of alpha variant reached the highest numbers around March–May 2134. There are seven synonymous mutations identified as the defining mutations in the alpha variant, of which all except C241T are also reported in our study. Due to the rapid evolution of SARS-CoV-2 genome, it is beyond the scope of our study to keep track SARS-CoV-2 mutational profile and to predict the consequences of these mutations. Two independent studies reported that alpha or alpha-like SARS-CoV-2 variants are circulating among wild deer population in North America in late 202141,42. Although there is no reported case of viral spillback from deer to human transmission, we can’t simply rule out this possibility yet. Hence, our findings remain relevant despite of not using the latest genome dataset.
SARS-CoV-2 virus can form highly structured RNA elements, which may affect viral replication, discontinuous transcription and translation43,44. For example, SARS-CoV-2 forms a three-stemmed pseudoknot structure to promote programmed -1 ribosomal frameshifting to increase the synthesis of the proteins required for viral replication43,44. There are numerous high throughput studies on the characterization of RNA secondary structure of SARS-CoV-2 genome25,45–48. In these recent high throughput studies, the RNA secondary structures of SARS-CoV-2 genome were determined experimentally using chemical probing methods, such as SHAPE-MaP25,45 or proximity ligation methods, such as RIC-seq47, COMRADES48. Although these data are very useful to determine the RNA secondary structures of SARS-CoV-2 virus, there is very little study on the effect of the synonymous mutations on RNA secondary structure, which may be beneficial or deleterious to the viral fitness. Therefore, we performed RNA secondary structure prediction and base pair probability estimation analysis of these top 10 highest frequency of synonymous mutations.
To improve the outcome of the study, multiple RNA secondary structure prediction tools, namely RNAfold with SHAPE reactivity data24, IPknot++26 and MXfold227 were applied in our study. In addition, MutaRNA analysis tool was used to estimate the base pair potential of the wild type and mutant sequences. RNAfold with SHAPE reactivity data uses thermodynamic approach to calculate the minimum free energy for the most probable RNA secondary structure by incorporating the nucleotide reactivity data derived from the experiments. If the reactivity value is high, the nucleotide is less likely to be paired, or vice versa. SARS-CoV-2 virus can form pseudoknot structures, which promote ribosomal frameshifting44. However, many RNA secondary structure prediction programmes don’t predict pseudoknot structure since the calculation is computationally demanding. IPknot++ is one of the few programmes, which can predict pseudoknot structure. MXfold2 predicts RNA secondary structure using deep learning method with a large amount of training dataset.
Although different tools may produce similar results for identical RNA sequences, it's important to note that there can be variations in prediction outputs due to differences in algorithms, thermodynamic parameter settings, inclusion of pseudoknot calculation, incorporation of experimental data and the assumptions of each tool. RNAfold, IPknot++ and MXfold2 apply the nearest neighbour model, using different thermodynamic parameters. In addition, MXfold2 implements deep learning models with max margin framework49. Multiple experimental genome-wide mapping of RNA secondary structures studies showed that 5’ UTR of SARS-CoV-2 RNA genome forms 7 conserved stem-loop structures, and in some studies, 8 SLs, depending on the sequence length25,45–48,50. To demonstrate the usability of prediction tools, we predicted RNA secondary structure of the sequence of 5’ UTR (1-480 nt) of SARS-CoV-2 (Extended data 1)51. The RNA secondary structures predicted by RNAfold with SHAPE data, IPknot++ and MXfold2 are similar, especially SL1, SL5-8 regions and they are comparable to most of the published experimental data25,45–48,50. Both RNAfold with SHAPE data, and MXfold2 successfully predicted SL4, but IPknot++ predicted SL4 with pseudoknot structure, which has not been reported in other studies. Interestingly the result obtained from RNAfold without SHAPE data is quite different, possibly due to the missing experimental data. In addition, it has been shown that SARS-CoV-2 may adopt different RNA secondary structure conformations7,19,36,37,39,41. Our study is aimed to predict if the sSNP may affect RNA secondary structure and the outcomes allow us to prioritize variants for the experiment functional studies in the future. Using multiple prediction tools may help to increase the accuracy and reliability of the prediction result. The prediction results for all 10 synonymous mutations using these 3 tools and the base pair probability estimation results are summarized in the Table 2 (✓ - changes, × - no change). The results for all 10 synonymous mutations predicted with RNAfold, IPknot++ and MXfold2 are available in Extended data 2, 3 and 4, respectively51. The base pair probabilities for all 10 synonymous mutations are shown as circular plots in Extended data 551. The darker the edge is, the more likely the two connected bases to form base pair. Of these 10 synonymous mutations, four mutants which are all located in ORF1ab, namely C913U, C3037U, U16176C and C18877U mutants show pronounced changes between wild types and mutants in all 3 prediction tools, suggesting these synonymous mutations may have some biological impact on viral fitness. Having say that, it is also possible that other mutants with only one or two changes predicted by these analyses, may also affect RNA secondary structures, having some impact on viral fitness. It has been shown that SARS-CoV-2 virus can form elaborated RNA secondary structures at 5’ and 3’UTRs, and frameshifting element (FSE), located between the boundary of ORF1a and ORF1ab7,19,36,37,39,41. The 5’ UTR of SARS-CoV-2 is important for viral mRNA stability52 and protein translation53 while the 3’ UTR may be involved in viral proliferation in the host cell54. Interestingly it has been observed that base substitution type, transitions from C to U base occurred at higher frequency in the stem region of RNA secondary structure of 5’ and 3’ UTR of SARS-CoV-2 genome, possibly due to the less detrimental effect on the structure34. The FSE can form pseudoknot structures, which regulate the relative protein expression of ORF1a and ORF1ab during viral infection43,44.
RNAfold (SHAPE) | IPknot++ | MXFold2 | MutaRNA | |
---|---|---|---|---|
C313U | × | × | × | × |
C913U | ✓ | ✓ | ✓ | ✓ |
C3037U | ✓ | ✓ | ✓ | ✓ |
C5986U | ✓ | × | ✓ | ✓ |
C9286U | × | ✓ | × | ✓ |
C14676U | ✓ | ✓ | × | ✓ |
C15279U | × | ✓ | × | × |
U16176C | ✓ | ✓ | ✓ | ✓ |
C18877U | ✓ | ✓ | ✓ | ✓ |
C26735U | ✓ | ✓ | × | ✓ |
Other than 5’ and 3’ UTRs, Huston et al. (2021) found that ORF1ab region forms extensive RNA secondary structure network45. Coincidentally all four mutations, C913U, C3037U, U16176C and C18877U reported in our study are located within ORF1ab. C913U mutation is found in the Nsp2, near the start codon (position 806) in ORF1a in SARS-CoV-2 genome. As shown in Figure 2, the wild type structures predicted by RNAfold and MXfold2 shares some degree of similarity around position 95–330 of 501 base long structure. C913U mutation has a pronounced effect on RNA secondary structure predicted by RNAfold. C913U mutation results in the appearance or disappearance of multiple loops, not only at the nearby mutated residue, but also at the sites further apart, suggesting this mutation may affect its long-range RNA interaction. While MXfold2 predicts that U913 mutant forms a shorter stem and a larger hairpin loop compared to C913 wild type. However, the structure predicted by IPknot++ is quite different from others, in which, C913U results in change of pseudoknot structure. Figure 2D shows that the base pair interactions of wild type RNA are changed substantially by U913 mutation. Previously it has been shown that Nsp2 protein suppresses host immune response by inhibiting the mRNA translation of interferon gene55. Although C913U mutation does not alter the amino acid residue of Nsp2 protein, it may be worthwhile to see if this C913U mutation plays a direct or indirect role in host immune response through Nsp2 protein. Since C913U is near to Nsp1 and Nsp2 protein boundaries, the altered RNA secondary structure may affect ribosome stalling, which, in turn, affect folding of nascent polypeptides and translation initiation. In addition, Nsp1 protein facilitates viral propagation by inhibiting host protein translation machinery56 and promoting host mRNA degradation57. It will be interesting to investigate if the C913U mutation affects these functions.
(A) RNA secondary structure of C913 wild type and U913 mutant predicted using RNAfold. (B) RNA secondary structure of C913 wild type and U913 mutant predicted using IPknot++. (C) RNA secondary structure of C913 wild type and U913 mutant predicted using MXfold2. (D) MutaRNA circular plots of base pairing probabilities of C913 wild type and U913 mutant. The black arrow indicates the position of WT and mutated nucleotides while the red arrow indicates the starting position of the query sequence.
C3037U mutation is found in the Nsp3 in ORF1a. As shown in Figure 3, both IPknot++ and MXfold2 predict that U3037 mutant forms longer stem and smaller internal loop compared to wild type. On the contrary, RNAfold predicts that a small internal loop fuses into a bigger internal loop in U3037 mutant. MutaRNA circular plot shows that there is some minor difference in base pair probabilities between C3037 wild type and U3037 mutant. Nsp3 is a papain-like protease, which hydrolyzes several Nsp proteins, involved in viral replication58. Hence, we should investigate the effect of this mutation on its cleavage activity, probably through the change in transcription or translation level of Nsp3.
(A) RNA secondary structure of C3037 wild type and U3037 mutant predicted using RNAfold. (B) RNA secondary structure of C3037 wild type and U3037 mutant predicted using IPknot++. (C) RNA secondary structure of C3037 wild type and U3037 mutant predicted using MXfold2. (D) MutaRNA circular plots of the base pairing probabilities of C3037 wild type and U3037 mutant. The black arrow indicates the position of WT and mutated nucleotides while the red arrow indicates the starting position of the query sequence.
U16176C mutation is located in the Nsp12, close to the boundary of Nsp12 and Nsp13 genes in ORF1b. As shown in Figure 4, U16176C mutation results in a drastic change in RNA secondary structure predicted using RNAfold. IPknot++ predicts C16176 mutant forms new pseudoknot structures, which are absent in wild type U16176. On the other hand, MXfold2 predicts C16176 mutant forms a larger multi-branched loop and a shorter stem compared to wild type. Similarly, MutaRNA result shows C16176 mutant affects base pair potential at multiple sites. Nsp12 is one of the subunits of RNA-dependent RNA polymerase (RdRp), which is required for RNA synthesis59. A study showed that a 1.4-kb-long SARS-CoV-2 RNA sequence (residues 15071–16451) located in the Nsp12 and Nsp13 regions is required to facilitate viral RNA packaging60. Since U16176C mutation may affect RNA secondary structure, it will be interesting to see if it affects viral RNA packaging. U16176C together with C14676U and C15279U have very similar number of frequencies as shown in Table 1. Interestingly IPknot++ predicted all of them result in changes in pseudoknot structure as shown in Extended data 3. We speculated that these three sSNPs may be functionally related. These mutations are located downstream of the frameshifting element (residues 13405–13488) and this element forms a pseudoknot to promote ribosomal frameshifting during viral replication61. It has been demonstrated that synonymous mutations affect both RNA secondary structure of the ribosomal frameshift signal and frameshifting efficiency in SARS-CoV virus62. Another study had shown that this ribosomal frameshifting structure in SARS-CoV-2 virus involves long-range sequence interaction of 1.5 kb48. It remains to be seen whether the long-range sequence interaction for ribosomal frameshifting can go beyond 1.5kb long.
(A) RNA secondary structure of U16176 wild type and C16176 mutant predicted using RNAfold. (B) RNA secondary structure of U16176 wild type and C16176 mutant predicted using IPknot++. (C) RNA secondary structure of U16176 wild type and C16176 mutant predicted using MXfold2. (D) MutaRNA circular plots of the base pairing probabilities of U16176 wild type and C16176 mutant. The black arrow indicates the position of WT and mutated nucleotides while the red arrow indicates the starting position of the query sequence.
C18877U mutation is located in Nsp14 in ORF1b. As shown in Figure 5, an additional internal loop is formed in U18877 mutant predicted by RNAfold. IPknot++ predicts U18877 mutant forms extra internal loops and longer hairpin near the mutated residue and it also affects the pseudoknot structure at 2 different sites further from the mutated residue. While MXfold2 predicts U18877 mutant forms one hairpin with multiple loops instead of one hairpin as seen in wild type. The changes at multiple base pairing sites due to the U18877 mutation is also observed in MutaRNA circular plot. Nsp14 is important to maintain high fidelity during viral RNA synthesis63.
(A) RNA secondary structure of C18877 wild type and U18877 mutant predicted using RNAfold. (B) RNA secondary structure of C18877 wild type and U18877 mutant predicted using IPknot++. (C) RNA secondary structure of C18877 wild type and U18877 mutant predicted using MXfold2. (D) MutaRNA circular plots of the base pairing probabilities of C18877 wild type and U18877 mutant. The black arrow indicates the position of WT and mutated nucleotides while the red arrow indicates the starting position of the query sequence.
Other than affecting RNA secondary structure, it has been shown that synonymous mutations may affect protein translation efficiency and accuracy through the formation of codon usage bias (CUB), which is non-random usage of synonymous codons, common in all species64. It is a phenomenon where some codons are preferred over others for a specific amino acid. SARS-CoV-2 replicates using host cell’s machinery and synthesizes its protein by utilizing host cellular components. Hence, codon usage bias may affect the replication of viruses65.
Relative synonymous codon usage (RSCU) is a widely used statistical approach66 that can be used to measure codon usage bias in coding sequences. The RSCU values of SARS-CoV-2 are shown in Table 3 and the most preferred codons for each amino acid are marked in bold. Stop codons (UAA, UAG, UGA) and codons which code for an amino acid uniquely (AUG, UGG) are excluded from RSCU analysis.
Based on the RSCU values, the synonymous codons can be classified into five groups: i) codons with RSCU value equals to 1.0 are unbiased codons; ii) codons with RSCU value > 1.0 are codons preferred in a genome; iii) codons with RSCU value < 1.0 are codons less preferred in a genome; iv) codons with RSCU value > 1.6 are codons which are over-represented in a genome; v) codons with RSCU value < 1.6 are codons which are under-represented in a genome65. There are 15 preferred codons (RSCU value > 1.0) and 11 over-represented codons (RSCU value > 1.6) in SARS-CoV-2 genome as shown in Table 3. The preferred codons in SARS-CoV-2 genome are GCA (Ala), CGU (Arg), AAU (Asn), GAU (Asp), UGU (Cys), CAA (Gln), GAA (Glu), CAU (His), AUU (Ile), UUG (Leu), AAA (Lys), UUU (Phe), CCA (Pro), AGU (Ser) and UAU (Tyr) while the over-represented codons are GCU (Ala), AGA (Arg), GGU (Gly), CUU (Leu), UUA (Leu), CCU (Pro), UCA (Ser), UCU (Ser), ACA (Thr), ACU (Thr), and GUU (Val). The presence of the preferred and over-presented codons in a genome increases the protein synthesis rate.
Table 4 shows the RSCU analysis of the top 10 synonymous mutations. The codons in bold in the ‘codon change’ column are the codons with higher RSCU value, which means they are more preferred in SARS-CoV-2 genome. Most of the mutations change the codon to a more preferred codon as shown in Table 4. Nine of the ten synonymous mutations involve changes from C to U nucleotides and eight of them are located at the third position of codon, suggesting these changes are not random and possibly subjected to some selection pressure. In agreement with our study, the excessive changes of C to U nucleotides in SARS-CoV-2 genome has been reported in multiple studies32–35. Since the preferred codons may have a better translation efficiency and accuracy compared to the nonpreferred codons64, it is possible that most of these mutations may increase the viral fitness. While a study show that RNA secondary structures may be functionally linked to protein translation based on the evidence obtained from experimental work67, it is difficult for us to establish the connection solely using in silico studies.
The effects of SARS-CoV-2 synonymous mutations in various aspects such as RNA secondary structure and codon usage bias were studied, even though they do not cause changes in amino acid residue of the protein. C913U, C3037U, U16176C and C18877U mutants show pronounced changes between wild type and mutant predicted in all 3 RNA secondary structure prediction tools, suggesting these mutations may have some biological impact on viral fitness. In addition, these mutations showed changes in base pair potential estimated by MutaRNA. All mutations except U16176C change the codon to a more preferred codon, which may result in higher translation efficiency. Due to the shortcomings of prediction tools, experimental studies, such as protein translation assays, RNA packaging assays, are needed to give a more comprehensive understanding of the biological consequences of synonymous mutations on SARS-CoV-2 virus.
No ethical approval is required for data analysis in this study (EA2702021).
SARS-CoV-2 virus genome sequence data were obtained from the GISAID Database. The multiple alignment data can be assessed through FigShare.
Figshare: MSA (SARS-CoV-2). https://doi.org/10.6084/m9.figshare.20486178.v168
Figshare: RNA secondary structure prediction and base pair probability estimation analysis
https://doi.org/10.6084/m9.figshare.20486166.v652
Extended data 1. Comparation of RNA secondary structure of SARS-CoV-2 5’ UTR (1-480 nt) predicted using RNAfold without SHAPE data, RNAfold with SHAPE data, IPknot++, MXfold2
Extended data 2. The RNA secondary structure of SARS-CoV-2 genome predicted using RNAfold.
Extended data 3. The RNA secondary structure of SARS-CoV-2 genome predicted using IPknot++.
Extended data 4. The RNA secondary structure of SARS-CoV-2 genome predicted using MXfold2.
Extended data 5. The base pair probabilities of SARS-CoV-2 genome estimated using MutaRNA
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
The python code for the identification of SARS-CoV-2 genome mutations can be assessed through GitHub.
CHN contributes to the concept, design, supervision of the project. WXB and SBZ contribute to the design, methodology, and data collection. WXB contributed to the analysis, and interpretation of data. All authors were involved in drafting and revising the manuscript and approved the final version.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: viral genomics, viral evolution
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: viral genomics, viral evolution
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational Biology, Structural Genomics, RNA biology, Virology
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: RNA structure prediction; Computational Structural Biology; RNA-protein complexes.
Is the work clearly and accurately presented and does it cite the current literature?
No
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Partly
Are the conclusions drawn adequately supported by the results?
No
References
1. Lan T, Allan M, Malsick L, Woo J, et al.: Secondary structural ensembles of the SARS-CoV-2 RNA genome in infected cells. Nature Communications. 2022; 13 (1). Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: RNA structure prediction; Computational Structural Biology; RNA-protein complexes.
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Clinical Virology Diagnosis
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Genome Analysis, SARS-CoV-2, Cancer, Immunology, Stem cell
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||||||
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | |
Version 4 (revision) 18 Sep 24 |
read | |||||
Version 3 (revision) 29 Feb 24 |
read | read | ||||
Version 2 (revision) 05 Sep 22 |
read | read | ||||
Version 1 18 Oct 21 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)