Ribosome profiling of HEK293T cells overexpressing codon optimized coagulation factor IX

Ribosome profiling provides the opportunity to evaluate translation kinetics at codon level resolution. Here, we describe ribosome profiling data, generated from two HEK293T cell lines. The ribosome profiling data are composed of Ribo-seq (mRNA sequencing data from ribosome protected fragments) and RNA-seq data (total RNA sequencing). The two HEK293T cell lines each express a version of the F9 gene, both of which are translated into identical proteins in terms of their amino acid sequences. However, these F9 genes vary drastically in their codon usage and predicted mRNA structure. We also provide the pipeline that we used to analyze the data. Further analyzing this dataset holds great potential as it can be used i) to unveil insights into the composition and regulation of the transcriptome, ii) for comparison with other ribosome profiling datasets, iii) to measure the rate of protein synthesis across the proteome and identify differences in elongation rates, iv) to discover previously unidentified translation of peptides, v) to explore the effects of codon usage or codon context in translational kinetics and vi) to investigate cotranslational folding. Importantly, a unique feature of this dataset, compared to other available ribosome profiling data, is the presence of the F9 gene in two very distinct coding sequences.


Introduction
The ribosome profiling (footprinting) technique has only been around for a decade 1 but has already contributed tremendously to our understanding of translation efficiency and kinetics. Initially developed to systematically monitor protein translation in yeast 1 , it has since been adapted to work in a range of organisms 2,3 and to tackle a variety of questions. Ribosome profiling data typically consist of a set of sequences of ribosome protected fragments (RPF), designated as Ribo-seq data, which is accompanied by sequences from total RNA (RNA-seq). The availability of Ribo-seq and RNA-seq data from the same sample provides a treasure trove of information, enabling quantitative study of translation efficiency, rate and kinetics of every mRNA sequence in the pool 4 . Given that these sequences cover the entire transcriptome, and also include tRNA and rRNA, typically only a fraction of the data is presented and constructively used, within its initial publication. Further analyses, and comparisons of different ribosome profiling datasets can yield significant new information.
We recently conducted a ribosome profiling study to examine the translation kinetics of blood coagulation factor IX 5 , a protein with great pharmaceutical interest. Two human embryonic kidney 293T (HEK293T) cell lines were lentivirally transduced, one with the wild type (WT) version of the gene and one with a codon optimized (CO) F9 5 . Codon optimization is a widely used technique that aims at increasing the protein expression levels by replacing multiple codons within a coding sequence with synonymous ones. In doing so, the amino acid sequence of the protein remains unaltered, therefore these changes were assumed to be inconsequential for the structure and function of the protein. However, this is not always true; through our ribosome profiling study, we described that these synonymous changes drastically altered translational kinetics and led to protein conformational changes 5 .
The translational kinetics of the F9 variants, along with the control genes, GAPDH and ACTB, were analyzed in detail in the original publication 5 . Similarly, any other gene of interest can be investigated in this dataset in terms of their rate of synthesis and translational kinetics; genes in the entire transcriptome can be compared to each other. Since there are several other HEK293T ribosomal profiling datasets available, these could be used to examine the reproducibility of the results 6 . Furthermore, by looking into ribosome profiling datasets from other cell types, such as other human cells 7 and/or across species, it would be valuable to examine whether a given gene maintains the same translation kinetics or if there are significant differences that could reflect on the conformation of the protein. Clearly, since a rather large inter-experiment variation is expected, the accumulation of several ribosome profiling databases would be very useful for this type of analysis.
Innovative computational approaches of analyzing ribosome profiling data have led to the identification of novel CDSs that lead to the production of previously unidentified peptides and variants of known proteins 8 . Such coding sequences may be found in what is typically designated as untranslated regions (UTRs) of the mRNA, particularly the 5'UTRs, and may originate from non-AUG start sites 9-11 . However, such approaches have not been applied yet to this dataset and it would be intriguing to see if they could lead to new discoveries 12 . Importantly, since the genome of the HEK293T used to generate this dataset contains part of lentiviral vector and the cytomegalovirus (CMV) promoter to drive expression of F9, it would be interesting to examine whether any part of this sequence is actively translated. These analyses may be particularly insightful in studies of immunogenicity.
Further analysis of this dataset will help elucidate the effect of codon usage, codon context and possible other factors in translational kinetics. By looking at the global rate in which each codon is translated, and examining adjacent sequences on a transcriptome level, it may be possible to predict translational kinetics of recombinant genes and to make inferences on whether cotranslational folding may be affected. This may be particularly important in gene therapy applications where the cell type expressing the gene of interest may be different from the naturally expressing cells, e.g. expression of coagulation factor VIII from hepatocytes in gene therapy. A recent study in yeast 13 showed promising results in this direction; however, increasing availability of ribosome profiling datasets from other cell types will allow further comparisons. A unique feature of this dataset that may be pivotal in these types of studies is the presence of F9 in two genes with very different codon usage.

Amendments from Version 1
The ribosome profiling data analysis pipeline described in the manuscript has been extensively updated in terms of its usability and composition of tools. We have updated the alignment tool from Tophat to HISAT2, changed the code from Python 2.7 to Python 3.7 and updated many of the scripts to make the code more readable and generate error messages to assist in debugging. Furthermore, we have bundled all of the scripts involved in the pipeline into two more easily run bash scripts to aid in usability. The GitHub documentation and usage notes have also been updated to more explicitly describe versions of assemblies and tools used in the pipeline. We have updated the figures in the paper with data generated from the new pipeline and have added an additional statistic in the data validation section, the Spearman's rank correlation coefficient.
Any further responses from the reviewers can be found at the end of the article REVISED (Invitrogen/Life Technologies) according to manufacturer's instructions to generate pcDNA3.1-F9-V5-His plasmids. Each fusion construct (WTF9-V5-His and COF9-V5-His) was subcloned into a lentiviral vector pTK642 (gift from Dr. Kafri, University of North Carolina at Chapel Hill) at the Pacl/Sfil site.
An equivalent number of cells were plated in T-flasks and supplemented with 10 ng/ml of Vitamin K3, one day prior to all experiments. The culture medium was replaced with Opti-MEM Reduced Serum Medium (Life Technologies) at approximately 80-90% cell confluency and cells were harvested after an additional 24 hours of incubation.

Ribosome profiling
Ribosome profiling was conducted as described previously 7 using the Illumina TruSeq Ribo Profile (Mammalian) Kit according to manufacturer's instructions with modifications in harvest, RNA isolation/purification (isopropanol isolation used to improve the yield) and ribosome protected fragments size selection (~20-32 nt). During harvest, media was carefully removed, and cells were immediately flash-frozen. All equipment used from hence forth was pre-chilled. Cells were quickly scraped into 1 ml of ice-cold lysis buffer (5X Mammalian Polysome Buffer, 10% Triton-X100, 100 mM DTT, DNase I, Nuclease-free water) and homogenized on ice by passing through a 26G needle 10 times. Lysate was then spun at 4°C for 10 minutes at 20,000 × g. Supernatant was aliquoted into cryovials and immediately frozen in liquid nitrogen for future use. Samples were sequenced using Illumina HiSeq 2500.
The complete ribosome profiling pipeline analysis is described in Figure 1: Sequencing data were pre-processed and aligned as described by Alexaki et al. 5 as well as the step by step guide found in the README.txt accessible on GitHub.
RPF sequences were analyzed based on fragment length (Figure 2a), alignment distribution between coding sequences (CDSs) and 5'-and 3'-UTRs (Figure 2b), triplet periodicity ( Figure 3a) and reading frame ( Figure 3b). RPF fragments 20-22 nt and 27-29 nt in length were used for further analysis with a P-site offset of 12 nucleotides from the 5' end of the fragment. Pearson and Spearman correlations were used to evaluate the reproducibility between replicates using a common subset of moderately to highly expressed genes (reads per kilobase of transcript per million mapped reads, RPKM CDS ≥10) and considering reads with the ribosome A site annotated at least 20 nt downstream of the coding sequence start codon (Table 1). Both Pearson and Spearman coefficients show strong correlation between experimental replicates.

Dataset validation
The quality of the sequencing files is presented in Table 2. A pipeline was created to process the data ( Figure 1). A number of steps allow for validation of the data and confirmation of their quality. The fragment length distributions for the whole genome were plotted, indicating that the vast majority of the fragments from the Ribo-seq data are either 20-21 or 27-28 nucleotides in length (Figure 2a), and as expected the RNA-seq data have a more flat distribution. The distribution of the Ribo-seq data in the UTRs and CDSs of the mRNA was also plotted. As expected, most of the sequences aligned within the CDSs (Figure 2b), while a smaller fraction of the RNA-seq data aligned with the CDSs. It should be noted that as the 3' UTR, and 5' UTR are typically shorter in length than the CDSs, it is not surprising that about 60% of the RNA-seq data align with the CDSs (Figure 2b). In addition, Ribo-seq data exhibit periodicity, characteristic of the RPFs (Figure 3a and Figure 3b), which is not observed in the RNA-seq data (Figure 3b). In accordance with previously published data 16 , we can infer that the 5′-most peaks in (Figure 3a) represent ribosomes with the start codon in the P site and the second codon in the A site, for both large and short fragments. Very tight correlation between the experiments, both for Ribo-seq and RNA-seq data, supports the reproducibility of the results (Table 1).

Data records
Sequencing for 3 replicates of RNA-seq and Ribo-seq of HEK293T cells stably expressing WT and CO FIX was performed by Eurofins Genomics (Louisville, KY, USA), resulting in 12 raw data files (3 WT and 3 CO F9 for both Ribo-seq and RNA-seq) in FASTQ format. Raw data are accessible at the NCBI Sequence Read Archive (SRA) under BioProject accession PRJNA591214. File names, SRA accession numbers (experiment and sample) and descriptions of data are summarized below in Table 3.

Usage notes
The custom ribosome profiling analysis pipeline has been deposited in GitHub in the FDA/Ribosome-Profiling directory 14 . Raw data files may be accessed from SRA and downloaded to the './Ribosome_profiling/Raw_data/X/' folder. In our descriptions and instructions, 'X' is replaced with 'S12', but the user may choose any designation they prefer. Detailed instructions for running the data analysis pipeline are included in the 'README.txt' file.
Execution of the pipeline requires the following tools (version tested) be installed on the user's system: Python      This project collates the raw data, held at the NCBI Sequence Read Archive (SRA).

Software availability
The pipeline, including the code used to process the presented dataset and instructions for use, is available: https:// github.com/FDA/Ribosome-Profiling License: MIT License.

Open Peer Review
of the RNA-seq and Ribo-seq of the two cell lines. The correlations between the pairs of experiments seemed quite good and we believe this is an important type of study, however, we do have two concerns. First of all, the housekeeping control genes they utilize are GAPDH and ACTB, both of which are translated on free ribosomes in the cytosol. F9, however, is a secreted protein that is translated on ER ribosomes. Therefore, a rationale for the appropriateness of the two controls needs to be discussed. Secondly, they used a Box-Cox variance-stabilizing transformation for raw data followed by a Kolmogorov-Smirnov test which tests for probability distribution functions but is a less selective normality test in our view.  3 ? Some more detail here about the rationale for the statistical analyses is therefore warranted and whether additional controls and samples would have any effect on the outcomes of the studies should be discussed.
Thank you very much for raising this interesting question. Although they are not secreted proteins as F9, ACTB and GAPDH were chosen as they have a relatively high level of expression and therefore generate sufficient sequencing reads to lead to a well resolved ribosome profile. Moreover, these two genes' expression is stable and not changing due to the transfection or cell culture growth. It is not our intention to make any conclusions regarding the translation kinetics of secreted versus non-secreted proteins. Rather, the control genes are compared to themselves between experimental groups (cells expressing WT F9 and cells expressing CO F9). In this context, we believe that ACTB and GAPDH are appropriate to use as control genes.
Regarding the statistical method, we want to point out that the Kolmogorov-Smirnov method is considered more conservative than the methods mentioned by the reviewer. Therefore, we feel that by using it we are able to detect a difference between two cumulative curves, then we should be more confident that the difference is real. Second, our aim is not to test for normality, but rather to compare two distributions. The main aims of the normalization step via the Box-Cox transformation is to stabilize the variance (of the originally skewed distribution), and to put the distributions on a comparable scale, so that we can compare the cumulative plots.
If part of the intended contribution of this data note is the pipeline, it would be helpful to at least provide a bash script that ties together the different scripts and flexibly accepts input FASTQ files. Currently, with all the hard-coding present, it makes it difficult to reuse these scripts, especially when input references or sequence files vary from those used when analyzing this particular dataset.

Methods:
A Pearson's correlation assumes the data are normally distributed. As sequencing data follows a negative binomial distribution, the use of a Spearman rank-order coefficient would be more appropriate. It might be helpful to clarify that this is for comparing biological replicates.
○ Is there a reason the A-site offset is set at a strict 15 nt? Recent methods, such as those found in the RiboWaltz (https://doi.org/10.1371/journal.pcbi.1006169) 2 package allow for a more optimized P-site offset determination that could probably be applied to A-site offset determination. On that note, is there a downstream reason you are interested in calculating the offset of the A-site? It looks like P-sites were used when calculating periodicity in the figures and scripts and thus it is not clear why the A-site offset is mentioned.

○
The pipeline uses deprecated software (TopHat has been deprecated for nearly 5 years now) and should be updated to use a more accurate splice-aware option (such as STAR, HISAT2). Figures might need to be updated as appropriate. Is the installation of dependencies included in a script, or does the user need to handle that? For example, BioPython and pysam are included as a dependency in the python scripts, but I don't see them listed in the manuscript.

GitHub Documentation:
To aid in readability, some formatting updating (new lines in example code) would be helpful. Currently, there are several commands on the same line as a comment, making the commands and comments difficult to read. Using markdown syntax to display the code sections would aid in readability as well.

○
Explicitly state the gene annotation used for the preparation of this dataset. Comprehensive, basic gene, etc? Same for the GFF3 file. It is not clear from the documentation itself what versions were used and would be nice from an archiving/reproducibility stand-point. The same goes for listing software versions used for processing the dataset as presented in the paper.

Code:
I could not get the trim-adapters.py script to work using the deposited data. I had empty trimmed files output and no error information was displayed to aid in debugging.

○
As far as functionality, all of the indexing scripts appeared to function, but I was unable to test past the trim-adapters.py script due to the issue mentioned above.
○ helpful to at least provide a bash script that ties together the different scripts and flexibly accepts input FASTQ files. Currently, with all the hard-coding present, it makes it difficult to reuse these scripts, especially when input references or sequence files vary from those used when analyzing this particular dataset. We appreciate the suggestion and have made changes accordingly. The pipeline has been modified to be more user friendly, and updates have been made as described in the response to Reviewer 1.

Methods:
A Pearson's correlation assumes the data are normally distributed. As sequencing data follows a negative binomial distribution, the use of a Spearman rank-order coefficient would be more appropriate. It might be helpful to clarify that this is for comparing biological replicates.

○
Thank you for pointing this out. We have added a Spearman rank-order coefficient to Table 1  We agree with your concern. When we started developing our pipeline, TopHat was the software of choice and we retained it to allow duplication of the data in our published paper [1]. We have now updated the pipeline to use HISAT2; all the figures have been updated accordingly. The readme file and manuscript have been updated to reflect the complete list of software dependencies and their versions.

GitHub Documentation:
To aid in readability, some formatting updating (new lines in example code) would be helpful. Currently, there are several commands on the same line as a comment, making the commands and comments difficult to read. Using markdown syntax to ○ display the code sections would aid in readability as well. We thank you very much for pointing this out. We have removed all unnecessary commands and lines of code that had been commented out. We believe this will aid in readability of the code.
Explicitly state the gene annotation used for the preparation of this dataset. Comprehensive, basic gene, etc? Same for the GFF3 file. It is not clear from the documentation itself what versions were used and would be nice from an archiving/reproducibility stand-point. The same goes for listing software versions used for processing the dataset as presented in the paper. ○ We agree with your concern. We have updated the GitHub documentation to explicitly state which annotation and assembly versions were used in the in the analysis of this dataset.
I could not get the trim-adapters.py script to work using the deposited data. I had empty trimmed files output and no error information was displayed to aid in debugging.As far as functionality, all of the indexing scripts appeared to function, but I was unable to test past the trim-adapters.py script due to the issue mentioned above.
○ Thank you very much for pointing this out. This is an interesting result. We have re-run the analysis pipeline using the new bash scripts and code updated to Python 3.7, and the adapter trimming step ran correctly. However, we have added an error message in the Trim_adapters.py script to notify the user if an error occurs to aid in the debugging process.
Are the protocols appropriate and is the work technically sound? Partly ○ We have updated the alignment tool used in the pipeline to HISAT2 and the code used in the individual scripts to Python 3.7. We hope that these updates to the pipeline have improved the appropriateness of the protocol.
Are sufficient details of methods and materials provided to allow replication by others? Partly ○ We have bundled all scripts in the pipeline into two shell scripts and there is now a warning message in the Trim_adapters.py script that notifies the user when it does not execute properly. We have also updated the dependency list to accurately reflect all additional software required as well as versions that were tested. We hope that these updates have made replication of the work easier for the user.
Are the datasets clearly presented in a useable and accessible format? Partly ○ We hope that the updates to the pipeline described above have made the datasets more useable and accessible.