Keywords
Ribosome profiling, codon optimization, Ribo-seq, RNA-seq, translation kinetics, codon usage, codon pair usage, protein therapeutics
Ribosome profiling, codon optimization, Ribo-seq, RNA-seq, translation kinetics, codon usage, codon pair usage, protein therapeutics
The ribosome profiling data analysis pipeline described in the manuscript has been extensively updated in terms of its usability and composition of tools. We have updated the alignment tool from Tophat to HISAT2, changed the code from Python 2.7 to Python 3.7 and updated many of the scripts to make the code more readable and generate error messages to assist in debugging. Furthermore, we have bundled all of the scripts involved in the pipeline into two more easily run bash scripts to aid in usability. The GitHub documentation and usage notes have also been updated to more explicitly describe versions of assemblies and tools used in the pipeline. We have updated the figures in the paper with data generated from the new pipeline and have added an additional statistic in the data validation section, the Spearman's rank correlation coefficient.
See the authors' detailed response to the review by Rafal Bartoszewski and James Collawn
See the authors' detailed response to the review by Jordan Berg
See the authors' detailed response to the review by Stefano Biffo and Riccardo Rossi
The ribosome profiling (footprinting) technique has only been around for a decade1 but has already contributed tremendously to our understanding of translation efficiency and kinetics. Initially developed to systematically monitor protein translation in yeast1, it has since been adapted to work in a range of organisms2,3 and to tackle a variety of questions. Ribosome profiling data typically consist of a set of sequences of ribosome protected fragments (RPF), designated as Ribo-seq data, which is accompanied by sequences from total RNA (RNA-seq). The availability of Ribo-seq and RNA-seq data from the same sample provides a treasure trove of information, enabling quantitative study of translation efficiency, rate and kinetics of every mRNA sequence in the pool4. Given that these sequences cover the entire transcriptome, and also include tRNA and rRNA, typically only a fraction of the data is presented and constructively used, within its initial publication. Further analyses, and comparisons of different ribosome profiling datasets can yield significant new information.
We recently conducted a ribosome profiling study to examine the translation kinetics of blood coagulation factor IX5, a protein with great pharmaceutical interest. Two human embryonic kidney 293T (HEK293T) cell lines were lentivirally transduced, one with the wild type (WT) version of the gene and one with a codon optimized (CO) F95. Codon optimization is a widely used technique that aims at increasing the protein expression levels by replacing multiple codons within a coding sequence with synonymous ones. In doing so, the amino acid sequence of the protein remains unaltered, therefore these changes were assumed to be inconsequential for the structure and function of the protein. However, this is not always true; through our ribosome profiling study, we described that these synonymous changes drastically altered translational kinetics and led to protein conformational changes5.
The translational kinetics of the F9 variants, along with the control genes, GAPDH and ACTB, were analyzed in detail in the original publication5. Similarly, any other gene of interest can be investigated in this dataset in terms of their rate of synthesis and translational kinetics; genes in the entire transcriptome can be compared to each other. Since there are several other HEK293T ribosomal profiling datasets available, these could be used to examine the reproducibility of the results6. Furthermore, by looking into ribosome profiling datasets from other cell types, such as other human cells7 and/or across species, it would be valuable to examine whether a given gene maintains the same translation kinetics or if there are significant differences that could reflect on the conformation of the protein. Clearly, since a rather large inter-experiment variation is expected, the accumulation of several ribosome profiling databases would be very useful for this type of analysis.
Innovative computational approaches of analyzing ribosome profiling data have led to the identification of novel CDSs that lead to the production of previously unidentified peptides and variants of known proteins8. Such coding sequences may be found in what is typically designated as untranslated regions (UTRs) of the mRNA, particularly the 5’UTRs, and may originate from non-AUG start sites9–11. However, such approaches have not been applied yet to this dataset and it would be intriguing to see if they could lead to new discoveries12. Importantly, since the genome of the HEK293T used to generate this dataset contains part of lentiviral vector and the cytomegalovirus (CMV) promoter to drive expression of F9, it would be interesting to examine whether any part of this sequence is actively translated. These analyses may be particularly insightful in studies of immunogenicity.
Further analysis of this dataset will help elucidate the effect of codon usage, codon context and possible other factors in translational kinetics. By looking at the global rate in which each codon is translated, and examining adjacent sequences on a transcriptome level, it may be possible to predict translational kinetics of recombinant genes and to make inferences on whether cotranslational folding may be affected. This may be particularly important in gene therapy applications where the cell type expressing the gene of interest may be different from the naturally expressing cells, e.g. expression of coagulation factor VIII from hepatocytes in gene therapy. A recent study in yeast13 showed promising results in this direction; however, increasing availability of ribosome profiling datasets from other cell types will allow further comparisons. A unique feature of this dataset that may be pivotal in these types of studies is the presence of F9 in two genes with very different codon usage.
WT (RefSeq NM_000133.3) and CO (accessible at https://github.com/FDA/Ribosome-Profiling F9_opt1_construct_100bpUTRs.fasta)14 F9 ORFs were sub-cloned into pcDNA3.1/V5-His-TOPO (Invitrogen/Life Technologies) according to manufacturer’s instructions to generate pcDNA3.1-F9-V5-His plasmids. Each fusion construct (WTF9-V5-His and COF9-V5-His) was sub-cloned into a lentiviral vector pTK642 (gift from Dr. Kafri, University of North Carolina at Chapel Hill) at the Pacl/Sfil site.
Human embryonic kidney cells (HEK293T; ATCC) were grown in Dulbecco’s Modified Eagle Medium (Quality Biological, Inc) with 1% L-glutamine (Quality Biological), 1% penicillin- streptomycin (Hyclone) and 10% fetal bovine serum (Quality Biological) at 37°C in 5% CO2. HEK293T cells stably expressing WT or CO FIX were established following transduction with lentiviral vectors, as previously described15.
An equivalent number of cells were plated in T-flasks and supplemented with 10 ng/ml of Vitamin K3, one day prior to all experiments. The culture medium was replaced with Opti-MEM Reduced Serum Medium (Life Technologies) at approximately 80–90% cell confluency and cells were harvested after an additional 24 hours of incubation.
Ribosome profiling was conducted as described previously7 using the Illumina TruSeq Ribo Profile (Mammalian) Kit according to manufacturer’s instructions with modifications in harvest, RNA isolation/purification (isopropanol isolation used to improve the yield) and ribosome protected fragments size selection (~20–32 nt). During harvest, media was carefully removed, and cells were immediately flash-frozen. All equipment used from hence forth was pre-chilled. Cells were quickly scraped into 1 ml of ice-cold lysis buffer (5X Mammalian Polysome Buffer, 10% Triton-X100, 100 mM DTT, DNase I, Nuclease-free water) and homogenized on ice by passing through a 26G needle 10 times. Lysate was then spun at 4°C for 10 minutes at 20,000 × g. Supernatant was aliquoted into cryovials and immediately frozen in liquid nitrogen for future use. Samples were sequenced using Illumina HiSeq 2500.
The complete ribosome profiling pipeline analysis is described in Figure 1: Sequencing data were pre-processed and aligned as described by Alexaki et al.5 as well as the step by step guide found in the README.txt accessible on GitHub.
Colored arrows indicate steps that first require execution of utility script (blue and yellow) or require manual input by the user (red). Pipeline steps are represented as ovals (main step) or pentagons (validation / analysis step). Rectangles represent input / output data. UTR: untranslated region, CDS: coding sequence, RPF: ribosome protected fragments, RPKM: reads per kilobase of transcript per million mapped reads.
RPF sequences were analyzed based on fragment length (Figure 2a), alignment distribution between coding sequences (CDSs) and 5’- and 3’-UTRs (Figure 2b), triplet periodicity (Figure 3a) and reading frame (Figure 3b). RPF fragments 20–22 nt and 27–29 nt in length were used for further analysis with a P-site offset of 12 nucleotides from the 5’ end of the fragment. Pearson and Spearman correlations were used to evaluate the reproducibility between replicates using a common subset of moderately to highly expressed genes (reads per kilobase of transcript per million mapped reads, RPKMCDS ≥10) and considering reads with the ribosome A site annotated at least 20 nt downstream of the coding sequence start codon (Table 1). Both Pearson and Spearman coefficients show strong correlation between experimental replicates.
(a) Fragment size distribution of Ribo-seq and RNA-seq reads. The average of 6 experiments (3 WT and 3 CO F9) was plotted, s.e.m. are shown. (b) Distribution of Ribo-seq (left) and RNA-seq (right) reads in mRNA coding regions (CDSs) and untranslated (5’UTR and 3’UTR) regions. The average of 6 experiments (3 WT and 3 CO F9) was plotted, s.e.m. are shown.
(a) Profiles of the 5′ end positions of all 20–22 nt (top) and 27–29 nt (bottom) fragments relative to the start codon of their genes. The average of 6 experiments (3 WT and 3 CO F9) was plotted. (b) Positions of 20–22 nt and 27–29 nt fragments relative to the reading frame of the Ribo-seq (left) and RNA-seq (right) reads. The average of 6 experiments (3 WT and 3 CO F9) was plotted, s.e.m. are shown.
RPKM of each gene in the Ribo-seq and RNA-seq datasets were calculated, considering reads with the ribosome A site annotated at least 20 nt downstream of the start codon. A comparison between each pair of experiments within the 3 replicates was performed
The quality of the sequencing files is presented in Table 2. A pipeline was created to process the data (Figure 1). A number of steps allow for validation of the data and confirmation of their quality. The fragment length distributions for the whole genome were plotted, indicating that the vast majority of the fragments from the Ribo-seq data are either 20–21 or 27–28 nucleotides in length (Figure 2a), and as expected the RNA-seq data have a more flat distribution. The distribution of the Ribo-seq data in the UTRs and CDSs of the mRNA was also plotted. As expected, most of the sequences aligned within the CDSs (Figure 2b), while a smaller fraction of the RNA-seq data aligned with the CDSs. It should be noted that as the 3’ UTR, and 5’ UTR are typically shorter in length than the CDSs, it is not surprising that about 60% of the RNA-seq data align with the CDSs (Figure 2b). In addition, Ribo-seq data exhibit periodicity, characteristic of the RPFs (Figure 3a and Figure 3b), which is not observed in the RNA-seq data (Figure 3b). In accordance with previously published data16, we can infer that the 5′-most peaks in (Figure 3a) represent ribosomes with the start codon in the P site and the second codon in the A site, for both large and short fragments. Very tight correlation between the experiments, both for Ribo-seq and RNA-seq data, supports the reproducibility of the results (Table 1).
Sample ID, index, yield, number of clusters, percent Q30 and above and mean Q score for all sequencing experiments.
Sequencing for 3 replicates of RNA-seq and Ribo-seq of HEK293T cells stably expressing WT and CO FIX was performed by Eurofins Genomics (Louisville, KY, USA), resulting in 12 raw data files (3 WT and 3 CO F9 for both Ribo-seq and RNA-seq) in FASTQ format. Raw data are accessible at the NCBI Sequence Read Archive (SRA) under BioProject accession PRJNA591214. File names, SRA accession numbers (experiment and sample) and descriptions of data are summarized below in Table 3.
Filenames, SRA experiment accession, SRA sample accession and brief description of the 12 Ribo-seq and RNA-seq FASTQ files. All data files are accessible from SRA BioProject accession PRJNA591214. Data files represent three replicates of each condition (WT F9 Ribo-seq, WT F9 RNA-seq, CO F9 Ribo-seq and CO F9 RNA-seq).
The custom ribosome profiling analysis pipeline has been deposited in GitHub in the FDA/Ribosome-Profiling directory14. Raw data files may be accessed from SRA and downloaded to the ‘./Ribosome_profiling/Raw_data/X/’ folder. In our descriptions and instructions, ‘X’ is replaced with ‘S12’, but the user may choose any designation they prefer. Detailed instructions for running the data analysis pipeline are included in the ‘README.txt’ file.
Execution of the pipeline requires the following tools (version tested) be installed on the user’s system: Python (3.7.6) (https://www.python.org) (Python Software Foundation, Wilmington, DE, USA) and modules pysam (0.15.3) (https://github.com/pysam-developers/pysam) and biopython (1.77) (https://biopython.org/), GFF Utilities (gffread v0.12.1) (http://ccb.jhu.edu/software/stringtie/gff.shtml) (Johns Hopkins University, Baltimore, MD, USA), Bowtie (1.0.0) (http://bowtie-bio.sourceforge.net/index.shtml) (Johns Hopkins University, Baltimore, MD, USA), HISAT2 (2.1.0) (https://ccb.jhu.edu/software/hisat2/manual.shtml) (Johns Hopkins University, Baltimore, MD, USA), FASTX-Toolkit (0.0.14) (http://hannonlab.cshl.edu/fastx_toolkit/commandline.html) (Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA), Samtools (1.7 using htslib 1.7) (http://www.htslib.org/) (Genome Research Limited, Hinxton, Cambridgeshire, UK).
NCBI BioProject: Ribosome profiling of HEK-293T cells stably expressing wild-type and codon-optimized coagulation factor IX. Accession number PRJNA591214; https://identifiers.org/NCBI/bioproject:PRJNA591214.
This project collates the raw data, held at the NCBI Sequence Read Archive (SRA).
The pipeline, including the code used to process the presented dataset and instructions for use, is available: https://github.com/FDA/Ribosome-Profiling
Archived pipeline at time of publication: https://doi.org/10.5281/zenodo.367870914.
License: MIT License.
The authors would like to thank Dr. Nicholas T. Ingolia and Dr. Estelle Russek-Cohen, CDER, FDA for very useful discussions. This work was supported by funds from the U.S. Food and Drug Administration Chief Scientist grant and in part by an appointment to the Research Participation Program at the Center for Biologics Evaluation and Research administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the U.S. Department of Energy and the U.S. Food and Drug Administration (C.K.-S.). Additionally, this work was in part supported by the National Institutes of Health grant HL151392 (A.A.K.). This research was supported by the Intramural Research Program of the National Library of Medicine at the NIH (M.D.).
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Ribosome profiling library creation and data analysis.
Is the rationale for creating the dataset(s) clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Yes
Are sufficient details of methods and materials provided to allow replication by others?
Yes
Are the datasets clearly presented in a useable and accessible format?
Yes
References
1. Alexaki A, Hettiarachchi G, Athey J, Katneni U, et al.: Effects of codon optimization on coagulation factor IX translation and structure: Implications for protein and gene therapies. Scientific Reports. 2019; 9 (1). Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Protein trafficking, protein expression and folding (Collawn); mRNA structure, translational expression, and silent polymorphisms (Bartoszewski).
Is the rationale for creating the dataset(s) clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Partly
Are sufficient details of methods and materials provided to allow replication by others?
Partly
Are the datasets clearly presented in a useable and accessible format?
Partly
References
1. Alexaki A, Hettiarachchi G, Athey J, Katneni U, et al.: Effects of codon optimization on coagulation factor IX translation and structure: Implications for protein and gene therapies. Scientific Reports. 2019; 9 (1). Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Ribosome profiling library creation and data analysis.
Is the rationale for creating the dataset(s) clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Yes
Are sufficient details of methods and materials provided to allow replication by others?
Yes
Are the datasets clearly presented in a useable and accessible format?
No
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Translational control of gene expression
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 2 (revision) 21 Sep 20 |
read | read | read |
Version 1 10 Mar 20 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)