parazitCUB: An R package to streamline the process of investigating the adaptations of parasites' codon usage bias

Examining the intricate association between parasites and their hosts, particularly at the codon level, assumes paramount importance in comprehending evolutionary processes and forecasting the characteristics of novel parasites. While diverse metrics and statistical analyses are available to explore codon usage bias (CUB), there presently exists no dedicated tool for examining the co-adaptation of codon usage between parasites and hosts. Therefore, we introduce the parazitCUB R package to address this challenge in a scalable and efficient manner, as it is capable of handling extensive datasets and simultaneously analyzing of multiple parasites with optimized performance. parazitCUB enables the elucidation of parasite-host interactions and the evolutionary patterns of parasites through the implementation of various indices, cluster analysis, multivariate analysis, and data visualization techniques. The tool can be accessed at the following location: https://github.com/AliYoussef96/parazitCUB


Introduction
The transfer of genetic information from messenger RNAs (mRNAs) to proteins occurs through codons, which are sequences of three nucleotides representing amino acids.With the exception of methionine (Met) and tryptophan (Trp), most amino acids can be encoded by multiple codons, resulting in codon degeneracy.Based on studies conducted on multiple organisms, synonymous codons, which encode the same amino acid, are not uniformly utilized within genes or across different genes in the same genome, leading to codon usage bias (CUB) phenomenon. 1 In every organism, specific preferred (optimal) codons exist, which are utilized more frequently in highly expressed genes compared to genes with lower expression levels. 2 The codon usage of an organism is influenced by two major forces: mutation pressure and natural selection.Nucleotide composition, synonymous substitution rate, tRNA abundance, codon hydropathy, DNA replication initiation sites, gene length, and expression level are all known to impact the CUB. 1 Intracellular parasites can be categorized as facultative or obligate.Facultative parasites can reproduce both inside and outside host cells, whereas obligate parasites are unable to replicate outside their host cells and solely depend on the host cell's resources for reproduction. 3Previous studies have shown that translational selection and/or directed mutational pressure shape the codon usage of intracellular parasite genomes to optimize or deoptimize it towards the codon usage of their hosts. 3,4Previous investigations have emphasized the significance of examining the interplay between parasites and the codon usage of their hosts.For instance, research conducted on the Influenza A virus (IAV) has demonstrated that understanding the patterns of codon usage in viruses might aid in the development of novel vaccines through the use of Synthetic Attenuated Virus Engineering (SAVE), which involves weakening a virus by deoptimizing its viral codons. 5imilarly, another study demonstrated that the replacement of natural codons with synonymous triplets possessing higher CpG frequencies can effectively deactivate poliovirus infectivity. 6To understand how parasites interact with their hosts and how they evolve, it is crucial to investigate the composition of parasite genes at the codon or nucleotide level.This analysis could assist in uncovering the mechanisms underlying parasite-host interactions and help in predicting the characteristics of newly discovered parasites.
A variety of metrics have been established to evaluate Codon Usage Bias (CUB), including the effective number of codons (ENc), codon adaptation index (CAI), relative synonymous codon usage (RSCU), and translational selection index (P2-index). 7Statistical analyses, such as correspondence analysis and the Neutrality Plot, have been employed to explore the influence of selection and mutation on molding CUB. 7Various tools and packages, such as coRdon, 9 CodonW (http://codonw.sourceforge.net),and BCAWT. 8are available for assessing and measuring CUB.However, there is currently a lack of specialized software specifically designed to examine the co-adaptation of codon usage between parasites and their hosts.The only available package developed for studying the interaction of codon usage between viruses and hosts was created in 2019 by the same first author of this research, known as vhcub R package.
The infection of multiple organisms by various parasites is a widespread phenomenon, exemplified by the existence of 1424 known viruses that can affect humans, as documented in the virus-host database. 9Investigating the co-evolution of codon usage between parasites and their respective hosts presents a challenging task in the field of bioinformatics.However, thanks to modern techniques and software advancements, this endeavor has become feasible.To address this challenge in a scalable and efficient manner, the parazitCUB tool was developed.The ParazitCUB, in contrast to its predecessor vhcub package, offers an expanded scope that goes beyond virus-host interactions.It encompasses the co-evolution of codon usage between parasites and hosts, providing a more comprehensive analysis.Notably, Para-zitCUB allows for the examination of larger and more extensive datasets, making it well-suited for handling substantial amounts of data.Additionally, it enables the concurrent study of multiple parasites, a feature lacking in vhcub, which only permits the analysis of a single organism with its host.Moreover, ParazitCUB has undergone significant optimization of its functions, resulting in improved speed and performance.A notable advantage of ParazitCUB lies in its user-friendly interface, facilitating effortless utilization even for users with limited proficiency in R programming.

Implementation
ParazitCUB employs several packages, such as Biostrings, 10 seqinr, 11 and stringr, 12 to handle FASTA formate files and perform DNA sequence modifications.For CUB and multivariate analysis, the package utilizes coRdon and factoextra, 13 as well as new functions implementation.To visualize the data effectively, ParazitCUB utilizes, ggplot2, 14 pheatmap, 15 and RColorBrewer. 16ParazitCUB efficiently extracts DNA sequences in FASTA format for each organism under study.These sequences are then combined into a comprehensive list.The package encompasses various indices for investigating CUB, as well as cluster analysis, multivariate analysis, and data visualization.A comprehensive list of the package's functions, along with their corresponding results, can be found in Table 1.As well as, the package workflow has also been summarized in (Figure 1).Operation parazitCUB was developed in R, and the source code can be found on GitHub and archived with Zenodo. 20It works with Windows and most Linux operating systems.
1 # install devtools if it is not available 2 # install.packages("devtools") 3 devtools::install_github("AliYoussef96/parazitCUB") The parazitCUB package consists of six main branches, each serving a distinct purpose: nucleotide content analysis, CUB analysis at the gene level, CUB analysis at the codon level, cluster analysis, multivariate analysis, and data visualization.Within each branch, a range of methods is available for conducting CUB studies.The complete workflow of ParazitCUB is illustrated in Figure 1, providing an overview of the entire process.For comprehensive information on using ParazitCUB, detailed documentation is readily available https://github.com/AliYoussef96/parazitCUB.

Use cases
The utilization of parazitCUB for investigating codon usage bias (CUB) in viruses (or any type of parasites), their respective hosts, and the co-adaptation between them offers a straightforward and highly customizable approach.To exemplify the capabilities of the package, the coding sequences of seven viruses, namely Influenza A, Influenza B, Influenza C, Influenza D, MERS, SARS-CoV, and SARS-CoV-2, were obtained from NCBI virus gateway. 17To showcase the package's ability to handle larger datasets, two variants of each virus were downloaded, with one variant isolated from a non-human host and the other from a human host (except for Influenza D).
To begin, all the fasta files for the viruses should be located in a single directory, such as a folder named "virus fasta" To read all the files simultaneously for parazitCUB analysis, the following straightforward approach can be employed: 1 library("parazitCUB") 2 library("ggplot2") 3 fasta.files<-list.files("flufasta/", pattern = ".fasta",full.names= T) 4 list.virus<-read.virus(fasta.files,sep = "|") 5 # The sep parameter to mange long headers in fasta files.
After importing the host FASTA file, we focused exclusively on the human host for this particular analysis (only one host selection is permitted).In this instance, the genes exhibiting the highest expression levels in human lung tissues were collected from the Human Protein Atlas project database.
1 theHost <-read.host("humanfasta/Human.fasta",sep = "|") A reasonable quality control step is to examine the coding sequence length in the virus datasets, to remove any bias (very long sequences, or very short ones) that could negatively affect the result.parazitCUB provides an easy straightforward function to do that.

QC.boxplot (list.virus)
This function will create a boxplot (Figure 2A) which illustrates the distribution of coding sequence lengths across the study.Through the examination of outliers in the boxplot, the QC.cutoff() function can be employed to exclude extremely long and short sequences from subsequent analyses, thereby enhancing the integrity of the data.
1 list.virus<-QC.cutoff(list.virus,cut.off.up= 4000, cut.off.down= 100) To exemplify the provided CUB workflow; we will show a case study involving the utilization of various functions from each of the six branches of the package.

Nucleotide content analysis
To compute the GC content at every position across all viruses included in the study: 1 GC.list <-GC.content(list.virus) GC.boxplot() function, which produces a graphical representation of the GC content distribution (Figure 2B).

CUB analysis on genes/codons levels
As part of this section, numerous indices can be calculated to assess Codon Usage Bias (CUB).For example, the effective number of codons (ENc) can be determined using the ENc.values.new()function, which utilizes a modified version.

Cluster analysis and Multivariate analysis
Cluster analysis and multivariate analysis have been widely utilized in numerous research studies to explore codon usage patterns.Within the framework of parazitCUB, three essential functions have been integrated for this purpose.One of these functions, cub.heatmap(), facilitates the generation of a heatmap using either Relative Synonymous Codon Usage (RSCU) values or statistical representations of dinucleotide over-and underrepresentation.Additionally, cub.heatmap() supports the utilization of various clustering methods implemented through the R stats function hclust() 19 (Figure 2C).

Data visualization
The forces that impact codon usage bias (CUB), such as mutational pressure and natural selection, have been extensively explored using various plots including the ENc-GC3 plot, PR2 plot, and Neutrality plot.In parazitCUB, two versions of the ENc-GC3 plot are available: the first version displays the ENc-GC3 of a specific virus analyzed (Figure 3A), while the second version presents the average ENc-GC3 for all the organisms studied in a single figure (Figure 3B).The same applies to the PR2 plot (Figure 4A and B).The Neutrality plot, can only be used for one organism at a time (Figure 5).This project contains the following underlying data: • Fasta Files: A folder containing all the fasta files used in the case study https://github.com/AliYoussef96/parazitCUB/tree/main/flu%20fasta Improved: "Host (e.g., virus) Fasta File" in the upper right corner of Figure 1 should be "Host (e.g., human) Fasta File".

1.
To avoid repeating the word "previous," the sentence (Second paragraph of Introduction) "Previous investigations have emphasized the significance of examining the interplay between parasites and the codon usage of their hosts" should be rephrased as "This has emphasized the significance of examining the interplay between parasites and the codon usage of their hosts."

2.
Is the rationale for developing the new software tool clearly explained?Yes

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?Yes Competing Interests: No competing interests were disclosed.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Ahmed Abdelmonem Hemedan
Luxembourg Centre for Systems Biomedicine, Luxembourg University, Luxembourg City, Luxembourg District, Luxembourg # parazitCUB manuscript review The article introducing parazitCUB, an R package for analysing codon usage bias in parasites and hosts -It is a valuable addition to the field, addressing a unique research need.
The introduction would require incorporating specific examples where codon usage bias analysis has impacted our understanding of parasitic diseases and treatment strategies.
A thorough comparative analysis with existing tools like CodonW or coRdon is required -perhaps you would like to focus on unique features, user interface, data handling capabilities, and specific functionalities, would provide valuable insights.
Expanding on the computational methodologies and algorithms used in parazitCUB is crucial for scientific rigor and reproducibility.
Including a robust performance assessment section, exploring aspects like accuracy, computational efficiency, and scalability through comparative analyses, is essential.
Detailed case studies in the use case section, demonstrating the tool's utility with specific data and analytical processes, would illustrate its practical value.
Enhanced documentation, including a comprehensive user guide, example datasets, troubleshooting tips, and FAQs, is necessary to make the tool approachable to a broader audience.
Enriching the paper with more effective visuals and data visualizations would aid in conveying complex data and analyses in an engaging manner.
The discussion section should contextualise parazitCUB within the broader bioinformatics field and its potential impact on future research, including possible expansions or updates of the tool.I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Is
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com

Figure 2 .
Figure2."Quality Control," "Nucleotide Content," "Cluster Analysis," and "Multivariate Analysis" within the ParazitCUB package.A) A box plot for the lengths of all coding sequences in the study, serving as a quality control measure.With outliers displayed as red dots.B) A box plot illustrates the GC content of all organisms in the study for each codon position and provides an overall view of the GC content.C) A heatmap, combined with cluster analysis, utilizes the statistical representation of dinucleotide over-and underrepresentation to visually depict patterns and similarities among the data.D) Conducting cluster analysis on a Principal Component Analysis (PCA) using the RSCU values for each organism in the study enables the identification of clusters and relationships based on codon usage.

Figure 3 .
Figure 3. ENc-GC3 analysis implemented in parazitCUB.A) ENc-GC3 plot displays the ENc values plotted against the GC3 content for the virus Influenza A (human CDS as reference) CDS.In this plot, the solid red line represents the expected ENc values when the codon bias is solely influenced by GC3s.B) The plot represents the average effect of ENc-GC3 for all organisms included in the study.

Figure 4 .
Figure 4. PR2-plot analysis implemented in parazitCUB.A) A PR2-plot illustrates the coding sequences (CDS) of the Influenza A (human CDS as reference) CDS, depicting their GC bias (ratio of G3 to G3 + C3) and AT bias (ratio of A3 to A3 + T3) in the third position of each codon.The two solid red lines on the graph indicate the point where both the vertical and horizontal coordinates are 0.5, representing the condition where A is equal to T and G is equal to C. B) The plot represents the average effect of PR2 values for all organisms included in the study.

Figure 5 .
Figure5.The Neutrality plot involves analyzing the GC12 and GC3 contents by plotting their frequencies against each other.On the plot, the y-axis represents the average GC frequency at the first and second codon positions (GC12), while the x-axis represents the GC frequency at the third codon position (GC3).The equation for the slope, along with the coefficient of determination (R) and its associated p-value, are shown.
the rationale for developing the new software tool clearly explained?PartlyIs the description of the software tool technically sound?YesAre sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?YesIs sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?YesAre the conclusions about the tool and its performance adequately supported by the findings presented in the article?Yes Competing Interests: No competing interests were disclosed.Reviewer Expertise: Systems medicine, Disease dynamic modelling, Biostatistics, Mathematics, Bioinformatics, Molecular Pathology.

Table 1 .
A comprehensive list of the package's functions, along with their corresponding results.

Table 1 .
Continued 18Also, MILC.values(),B.values(), and MCB.values() functions could be used to calculate the MILC, B, and MCB, respectively.All of these functions can work on the virus coding sequence without the need for a host coding sequence as a reference set.Some indices within the ParazitCUB package require a reference gene set to ensure their accurate computation and cannot be executed without it.One such example is: 1 MILC.list.virus<-MILC.values(list.virus,host = theHost) Certain indices rely on a reference genes set for their proper functioning and cannot operate without it.For instance; 1 cai.list<-CAI.values(list.virus,host = theHost) # To calculate the Codon Adaptation Index. 2 melp.list<-MELP.values(list.virus,host = theHost) # To calculate the MILCbased Expression Level Predictor.3 E.values <-E.values (list.virus,host = theHost) # To calculate the Related measure of expression.To calculate the Relative synonymous codon usage.Could be used for the virus and the host. 2 rscu.host<-RSCU.values(theHost) 3 SiD <-SiD.list(RSCU.host= rscu.host,RSCU.virus = rscu.virus)# To calculate similarity index between the RSCU of the virus and the host.4 rcdi <-RCDI.calc(list.virus,theHost, rscu.host,enc.host) # To calculate Relative codon deoptimization index.
Moreover, various matrices are provided within ParazitCUB to facilitate the examination of Codon Usage Bias (CUB) at the codon level.For instance:1 rscu.virus<-RSCU.values(list.virus)#