Regulatory patterns of differentially expressed genes in Ebola and related viruses are critical for viral screening and diagnosis

PepVax, Inc, 10411 Motor City Drive, Bethesda, MD, 20817, USA Immage Biotherapeutics Corp, University City Science Center, 3711 Market St., Philadelphia, PA, 19104, USA Current affiliation: Computational Biology Branch, NCBI/NLM/NIH, 8600 Rockville Pike, Bethesda, MD, 20894, USA Current affiliation: National Human Genome Center (NHGC), Cancer Center Building, Howard University, 2041 Georgia Avenue, Washington, D.C., 20060, USA


Introduction
Viruses aim to interrupt and manipulate normal functions of cells, qualifying them as biological agents (Eckert, 1967). As years go by, human viruses are increasingly being studied and recorded, with examples like the Ebolavirus (Sanchez et al., 2016), human immunodeficiency virus (HIV) and the human herpesvirus (HHV) (Zell et al., 2008). Some viruses are considered emerging with epidemic manifestations, and their pathogenic effects have reshaped human response to viral epidemics and their detection methodologies. In the space of two to three years, epidemics resulting from Ebola virus and Zika virus caused the world health organization (WHO) to put in place necessary international public health emergency rules to better deal with such viral threats (Rojek et al., 2016). Adequate preparation for such threats engages both community and health care systems, with better surveillance and response time and better diagnostic development pipelines for vaccines and therapies. WHO recommends that devices used for diagnosis should be sensitive, specific, affordable, equipment-free, robust, rapid and easily delivered, especially in developing countries. Several detection techniques used in the last decade, like PCR and microarrays, have seriously improved upon infectious virus identification (Lau et al., 2010). Given that the safety and quality of blood related tools remains a public health concern, these products require routine screening, good quality control and good manufacturing practices (Albertoni et al., 2014). Development of better serological tests have drastically decreased transmission risk of human immunodeficiency virus (HIV), hepatitis C virus (HCV) and hepatitis B virus (HBV) (Tabor & Epstein, 2002). Antigen detection as a viral quantification method has been demonstrated for gag p24 antigen (Dong et al., 2012) and genome identification (Cotten et al., 2014) but sometimes, the absence of a good dynamic viral detection range results in limitations of this technique. Biological understanding of viral pathogenesis is also critical for developing better detection tools. Several viruses interact with host cell proteins to access and control cell machinery. Viral interference with host regulatory networks affects normal cell processes like gene expression and cell growth (Chatr-aryamontri et al., 2009). Documented information has shown that viruses exploit host phosphorylation mechanisms for their replication and in turn manipulate transcription mechanisms for pathogenic gains (Schang et al., 2002). Due to challenges encountered with current technologies like mass spectrometry, in silico tools are used both in the development of targeted therapies for viral inhibition through derived antigenic epitopes and for identification of viral phosphorylation sites (Xue et al., 2010). Current genomic viral detection techniques use DNA microarray technology, which is a collection of single-stranded DNA spots attached to a solid surface (Clinical and Laboratory Standards Institute, 2005). This technique has helped detect HIV mutations associated with drug resistance (Kozal et al., 1996), after which several studies have used chips to detect HCV (Xu et al., 2005), CNS infecting viruses (Leveque et al., 2011) and respiratory viruses (Coiras et al., 2005). Previous Ebola virus and related virus microarray data from the Gene Expression Omnibus (Achinko et al., 2016) identified 3 main viral clusters and 4 main proteins (CD209, LCK, IL-2 and MYB), which play critical roles in host immune activation after viral attacks. Their structural analysis and differential expression analysis between clusters shows viral regulatory patterns of the host genome that are critical to pathogenesis and can be exploited for screening novel viruses and developing novel therapeutic targets.
to round up GVps in RCRD data table to one decimal place and columns with gene-sum per sample = 0 were removed from the data, bringing total samples to n = 28 (Dataset 2, Achinko et al., 2017b), including sample_1, Sample_2 and sample_3 (KGC); sample_27, sample_28 and sample_29 (MDC); and sample_30, sample_31, sample_32, sample_33 and sample_34 (EBOVC), as described in Achinko et al. (2016). The RCRD table processed in edgeR was split into 3 groups: group 1 related to KGC, group 2 related to EBOVC and group 3 related to MDC. The data was further processed for library size (sequencing depth or gene counts) and typically genes with less than 2 fold expression were filtered out, retaining only genes represented in samples across groups.
Given that groups in our data didn't reflect true biological replicates but gene profiles observed from the RCDN heatmap showed patterns which could be qualified as such, we analyzed our data using edgeR to include: Principal component analysis (PCA), Biological Coefficient of Variation (BCV), Differential Expression (DE) analyses and gene functional analyses. PCA uses the multidimensional scaling plot of distance between gene expression profiles (MDS). This is a two-dimensional scatter plot showing that distances between samples approximate to log2 fold change. The biological coefficient of variation (BCV) is the square root of the dispersion parameter considered using the negative binomial model. In our case, the algorithm to Estimate Empirical Bayes Tagwise Dispersion values was used to estimate dispersion. Tagwise dispersion uses Bayes empirical method, laying emphasis on a maximum likelihood conditionally weighted approach. Differential expression analysis was performed through genewise negative binomial generalized linear models using quasi-likelihood tests. This test fits a negative binomial generalized log-linear model to count data, then a genewise statistical test was used to compare genes between groups. The false discovery rate (FDR) calculated for the P-values was based on Benjamini & Hochberg (1995) at α < 0.05. Functional analysis was performed using the GO (Gene Ontology, RRID: SCR_006580) database with focus on the biological process, and KEGG (Kyoto Encyclopedia for Genes and Genomes, RRID: SCR_012773).  HLA-K, LCK, MYB) were extracted into bed file format using a unique code (Achinko, 2017) and files were mounted into the Integrated Genomics Viewer software (RRID:SCR_011793) and compared with Human genome (hg38) sequences (https://www.ncbi.nlm.nih.gov/genome) and Gencode V24 (Harrow et al., 2012) sequences to view genomic locations and variations for comparative analysis. Genomic matches were exported as image files for viewing.

Results
Cluster determination from extracted and normalized Gene Expression Omnibus data Through the use of hierarchical clustering, RCDN data showed cluster patterns on the heatmap (Figure 1) following the three main and previously identified clusters, KGC, MDC and EBOVC (Achinko et al., 2016). KGC comprised 5 samples, MDC comprised 16 samples and EBOVC comprised 13 samples. Comparative statistical analysis with ANOVA between Sample 30 (EBOV), Sample1 and Sample 27 then Sample 27 and Sample 1, was significant between Sample 30 and Sample 1 (P-value = 0.0371; F statistic = 4.348, DF=1) Sample 30 and Sample 27, (P-value = 7.51e-05, F-statistic = 15.692, DF=1). No significant difference was seen between Sample 1 and Sample 27 (P-value = 0.3687, F statistic = 0.808). From our previous work (Achinko et al., 2016), heatmaps centric for HLA and signaling proteins showed a similar pattern of CD209 upregulated in KGC and MDC only ( Figure 2) but not expressed in EBOVC. Meanwhile, the same couldn't be said for LCK, MYB and IL-2 proteins, which showed differential expression patterns in the 3 clusters for this analysis, although they were previously observed to be up-regulated in KGC and EBOVC only but down-regulated in Sample 32 (EBOV). The focus on these four proteins (CD209, LCK, MYB and IL-2) was relative to their role in MHC class I and II gene activation. The expression patterns led to identification of variant types expressed by the three cluster groups and it was observed that, apart from

Sample variation analysis
Distance analysis for variation between sample expression profiles using a two dimensional PCA plot ( Figure 4A) with log of fold change (logFC) on both axes showed three clusters, with KGC and EBOVC related samples dominantly clustering on the negative X-axis (leading logFC dim 1) and EBOVC located to the bottom left. Clustering patterns observed in the plot suggest common expression patterns in identified viral clusters. MDC dominantly clustered to the positive X and Y axes. EBOV never really grouped in any cluster though aligned with KGC and EBOVC. The estimated coefficient of variation was 8.5958 with a BCV of 2.93, which is much higher than a BCV of 0.4 for a standard controlled experiment. The trend of dispersion fit analysis under the quasi-likelihood pipeline shows an increase in dispersion as the log2CPM increases and the squeezed values show a good fit along the trend ( Figure 4B).

Differential expression analyses
The three viral clusters (KGC, MDC and EBOVC) were used in pairs for comparative analyses and variation was considered significant at 5% false discovery rate. The plots were performed for log-fold changes (logFC) against log-counts per million (logCPM) and the blue lines define limits for logFC = ±2, where differentially expressed proteins were highlighted as red point dots. In all three clusters the majority of proteins were downregulated ≤ 5-fold and a few up-regulated ≥ 10-fold ( Figure 5A). Pairwise comparative analysis showed a similar pattern for differentially expressed proteins observed in all three clusters when MDC was compared to KGC and EBOVC but when EBOVC was compared to KGC, there were very few differentially expressed genes down-regulated ≤ 10-fold. Differential expression analysis showed that more proteins showed a > ±2 fold change when MDC was individually compared to KGC ( Figure 5B) and EBOVC ( Figure 5C), while fewer proteins were significantly down regulated when KGC was compared to EBOVC ( Figure 5D). This could be confirmed on the heatmaps plotted on estimated tagwise values for the first 100 proteins in each comparative category. Mapping the top 100 differentially expressed proteins for DE on heatmaps showed a very clear split pattern for MDC against KGC ( Figure 5E) and EBOVC ( Figure 5F), and it could be observed from the pattern that MDC samples clustered together with their proteins down-regulated (green) while KGC and EBOVC samples clustered with up-regulated proteins (red). For EBOVC compared to KGC ( Figure 5G), the pattern was a little bit different with MDC samples not completely separated from KGC and EBOVC because some proteins were both up and down-regulated in different samples. This pattern further shows the similarity in EBOVC and KGC and how they differ from MDC.

Pathway analysis for differentially expressed clusters
Pathway analysis was performed on two databases, GO and KEGG. GO identified 20855 terms in total with the first 20 (Table 1) focusing on cell activation and positive and negative regulation of the immune system. Total GO data can be seen in Dataset 3 (Achinko et al., 2017c). KEGG identified 311 pathways with the first 20 (Table 2) involved in phosphate metabolism (glycogenesis and gluconeogenesis), ATP synthesis, fatty acid metabolism, steroid hormone metabolism and DNA metabolism. Total KEGG data can be seen in Dataset 4 (Achinko et al., 2017d). This pattern showed that though the top 20 differentially expressed genes varied as seen in their respective tables, most of them were involved in similar pathways.

Discussion
Viral genomics is an area of intense research, given that viruses could be regional and specific to the local human host. Viruses like Ebola, Rift Valley fever and Zika have resulted in sudden epidemics in West Africa, East Africa and South America, respectively. This work aimed to look at gene expression patterns of EBOV and related viral species as decribed in Achinko et al. (2016). Previous work done by Achinko et al. (2016) identified four main genes   (CD209, LCK, IL-2 and MYB) involved with successful transmission of the virus within the host. Given that total genomic data for Zaire EBOV (Dataset 1: RCRD, Achinko et al., 2017a) and related viruses extracted from the GEO profile database maintained known clusters, it may be that pathogenic events already documented for related viral species could help in further understanding EBOV pathogenesis. The identified viral related cluster patterns could serve as groups to differentiate relative host pathogenic events and viral expression patterns. Statistical analysis performed with ANOVA showed similar results to previous data (Achinko et al., 2016), where EBOVC was significantly different to KGC and MDC but no significant difference was observed between KGC and MDC. This pattern could suggest a different mode of entry by EBOV into the host cell, which could be the result of its use of the nonstructural soluble glycoprotein (sGP) to penetrate its host, resulting in by-stander apoptosis (Baize et al., 2000).
Heatmaps centric to HLA class I and II variants showed similar expression patterns to previous heatmaps (Achinko et al., 2016), where HLA class I isoforms were down-regulated across clusters and CD209 was not expressed for EBOVC but slight variations in expression patterns were observed for LCK, IL-2 and MYB proteins. The expressed variant types identified for each of the four proteins of interest (CD209, LCK, IL-2 and MYB), showed that KGC and MDC expressed unique protein variant types different from those for EBOVC. This observation could confirm the statistically significant difference in expression pattern of EBOVC from KGC and MDC. The observed variants for each protein showed differential domain expression for EBOVC but maintained all domains per variant per protein type for KGC and MDC. This pattern for EBOVC, suggests differential regulation due to differential usage of promoter regions on related genes through alternative splicing processes, leading to protein variants with misplaced cellular functions (Landry et al., 2003;Pajares et al., 2007). Furthermore, differential usage of EBOV distal promoter regions has been shown to produce non-functional LCK mRNA variants (Takadera et al., 1989), and differential expression of MYB protein variants has been shown to occur at gene promoter regions (O'Rourke & Ness, 2008). Though the expression of these functional protein variants in KGC and MDC, it has been shown that the Epstein-Barr virus (EBV) nuclear antigen 1 (EBNA1) protein (Sample 28) possesses a large Gly-Ala repeat domain affecting proteosomal degradation of EBNA1 polypeptide and preventing HLA I antigenic peptide formation, which is needed to induce CD8+ immune T-cells (Levitskaya et al., 1997). It has also been shown that EBV G protein-coupled receptor BILF1 engages the HLA-I/β 2 -microglobulin complex, favoring rapid loss of HLA-I molecules from the cell surface and intense reduction of HLA-I half-life (Zuo et al., 2009). HIV-I nef protein (Sample 2) either induces degradation of MHC class I complexes in endosomes and lysosomes through RING-finger motifs or targets the degradation of MHC-I complexes through di-leucine motifs (Rowe et al., 2010) common to CD209 protein. EBV has also been shown to interact epigenetically with host genes through histone modifications and promoter hypermethylation, resulting in gene expression variations and tumorigenesis (Tsai et al., 2006). The mechanism of function in KGC and MDC clusters shows a different form of regulation from EBOVC, suggesting that genetic mechanisms underlying the former lead to gene repression or expression while in the latter, it leads to alternative splicing with aim to favor virulence in host cells.
The coefficient of biological variation depicted by PCA was 8.5958, greater than that in controlled biological replication experiments, but the plot grouped EBOVC and KGC to the negative x-axis, centralizing Ebola virus although its related cluster (EBOVC) samples dominantly located to the bottom while MDC samples located to the positive x-axis. This pattern suggests that EBOV could be more heterogenous than the other viral samples and given that it uses sGP to infect its host, there could be different genomic pathways involved in its pathogenesis. The high BCV suggests an uncontrolled biological replication experiment, but the observed expression patterns ( Figure 4A) show that individual viral samples were very similar and could be used as replication data.
Differential expression analysis performed on all three clusters showed that the majority of proteins were down-regulated, and those of interest belonged to: human retina cDNA samples, potassium voltage-gated channel interacting protein 4 (KCNIP4), a mucin protein (MUC8) highly expressed in human airways (Shankar et al., 1994), a nucleoporin 36 (nup36) of the nuclear pore complex (NPC) needed for mitotic progression and chromosome segregation (Zuccolo et al., 2007), a suppressor of tumorigenicity protein (ST20), eukaryotic translation initiation factor 5A (EIF5A) whose activation by Kirsten ras oncogene (K-Ras) favors pancreatic ductal adenocarcinoma migration (Fujimura et al., 2015), nitric oxide synthetase (NO), which requires L-arginine and molecular oxygen to synthesize nitric oxide and acts as a neurotransmitter (Bloch et al., 1995), natriuretic peptide A (NPPA) involved in regulation of inflammation by altering cytokine secretion and macrophage function (Mezzasoma et al., 2016), non-coding microRNA 664B known to promote proliferation of human osteosarcoma (Chen et al., 2015) and a basic leucine zipper (MAFF), known to be a proto-oncogene and transcription factor that interacts with the promoter of oxytocin receptor gene. The identification of gene variants associated with human retina cDNA samples could be linked to sample 24 (Munoz-Erazo et al., 2012), which is part of the EBOVC cluster. This sample showed that West Nile Virus (WNV) has various forms of ocular involvement (chorioretinitis, uveitis, and occlusive retinal vasculitis), of which chorioretinitis is associated with retinal pigment epithelium degeneration. In response to WNV infection, interferon-β signaling is induced (Cinatl et al., 2006) through Toll-like receptors.
From all the differentially expressed proteins discussed, we can identify those that directly involve viral interaction, those that are involved with pathways that successfully transmit the virus and those that result in oncogenesis. The general differential expression pattern was mapped onto pairwise comparative analyses of viral clusters for which EBOVC and KGC had very few proteins down regulated. Though the top differentially expressed proteins including the human retina cDNA protein variants and von Willebrand factor (VWF) involved with homeostasis, were down-regulated, some proteins were up-regulated which included but not limited to: wnt family member 1 (WNT1), shown to increase the expression of a scavenger receptor B (CD36) located on endothelial cells, dendritic cells and macrophages (Matsumoto et al., 2000); and vestigial like family member 4 (VLL4), known to act as tumor suppressor in lung cancer by down regulating yes-associated protein 1 (YES)-transcription factor interacting domain (YAP-TEAD) (Zhang et al., 2014). YES protein has also been identified in an interaction pathway with LCK (Achinko et al., 2015). Wolframin (WFS1) is a transmembrane protein mainly located in the endoplasmic reticulum and known to be involved in the regulation of Ca 2+ homeostasis (Takei et al., 2006). WAP four-disulfide core domain 8 (WFDC8) functions as a serine protease inhibitor through its Kunitz-inhibitor domain and also plays a role in immune homeostasis (Ferreira et al., 2013). WASP homolog-associated protein, associated with actin, membranes and microtubules, is known to help promote nucleation, which in turn stimulates actin polymerization of actin related protein (Arp2/3 complex) at the golgi and tubular membranes (Campellone et al., 2008). WASP is known to favor proper coordinated transport of viral particles through vesicles into the cell. The up-regulated proteins observed from differential expression analysis of EBOVC and KGC suggest activation of proteins that favor membrane receptors and vesicular transport of viruses into host cells. Comparative analysis between MDC and KGC showed a similar pattern of protein expression as seen in the general expression pattern comparing all three clusters discussed above. Interestingly, when compared to EBOVC, MDC showed a differential expression pattern where the retina cDNA protein variants and VWF were up-regulated and WASP was down-regulated, as opposed to the differential expression pattern seen when comparing EBOVC to KGC where VWF was downregulated and WASP up-regulated. Down-regulation of VWF suggests a reduced cellular response to viral entry, hence supporting the idea that EBOV uses sGP to penetrate host cells, whilst up-regulation of VWF suggests an immune response of infected cells after viral pathogenesis. Down-regulation of WASP causes inactivation of vesicles needed to transport MHC class I antigenic epitopes out of the cell (van Endert et al., 2002) and hence causes a downregulation of immune response.
The heatmaps targeting proteins from each differential expression comparative viral cluster analysis presented with a distinct pattern in which proteins were down-regulated in MDC compared to KGC and EBOVC, however this distinct pattern did not show up during comparative viral cluster analysis of differentially expressed proteins from EBOVC and KGC. Previous work on viral clusters MDC, KGC and EBOVC (Achinko et al., 2016)  GO identified pathways showed that proteins like VWF, involved in regulation of complement activation and the lectin pathway (GO:0001867, GO:0001869) were of priority to MDC but not EBOVC and KGC. Likewise, KEGG identified pathways showed that genes like WASP protein, involved in glycolysis and ATP synthesis to transport vesicles around the cell, were of priority in KGC and EBOVC but not MDC. These cellular pathways and glycolysis related functions further support a coordinated regulatory pattern in MDC to repress transcription of genes but the same genes are up-regulated for KGC related viral pathogenesis and alternatively spliced for EBOVC related viral pathogenesis. The immune expression patterns observed through bed graphs showed HLA class I and II alleles were identified for each viral infection type and these variants differed from one sample to the other. It was observed that all viruses generated HLA class I alleles with variants cutting across HLA-A, which suggests a common region of interest to be targeted when designing viral antigenic epitopes in silico, with aim to develop a cut across antiviral therapy. The ubiquitously transcribed tetratricopeptide repeat containing, Y-linked (UTY) protein is a male-specific demethylase catalyzing the demethylation of lysine-27 (Lys27) in histone 3 (H3K27me3) (Walport et al., 2013). A variant of UTY is located on chromosome 6 and has several amino acid repeats in the protein, suggesting a possible translocation of the gene. Its epigenetic role has been shown to non-catalytically regulate transcription by interacting with T-box proteins (Miller et al., 2010). Some KEGG pathways that UTY has been shown to be prominently involved in are: Lysine biosynthesis (path:hsa00300) and Lysine degradation (path:hsa00310). This suggests UTY protein can have an effect on expression of Lysine-containing signaling proteins. With this regulatory pattern in mind, it suggests a possible mechanism of regulation for proteins of interest like: tyrosine-protein kinases (LCK) at the promoter regions with related processes like Tyrosine metabolism (path:hsa00350) and Glycolysis/Gluconeogenesis (path:hsa00010), CD209 with processes like Valine, leucine and isoleucine degradation (path:hsa00280) and Valine, leucine and isoleucine biosynthesis (path:hsa00290), MYB with processes like Purine metabolism (path:hsa00230) and Pyrimidine metabolism (path:hsa00240) and IL-2 with process like Citrate cycle (TCA cycle) (path:hsa00020). TAP2 was found in fusion with the HLA class II gene (HLA-DOB) and was previously shown to be down-regulated in MDC (Achinko et al., 2016). Given that TAP2 is actively involved in HLA class I peptide antigenic processing, transcription of TAP2 in MDC could be repressed. Major HLA class I alleles (HLA-A, HLA-B and HLA-C) were down-regulated in all clusters, but more so in the EBOVC cluster, which suggests important roles of other possible proteins like LCK, which together with IL-2 and MYB were all down-regulated in this cluster (Achinko et al., 2016). Contrasting the role of these proteins in EBOVC and KGC can help to further explain regulatory patterns on host genomes critical for the viral expression patterns observed. The role of proteins with regulatory function on HLA class II chromosomal genes locations needs to be further investigated. Their interplay with HLA class I and II immune genes and how they influence viral pathogenesis will aid in the identification of genetic targets for development of antiviral therapy and vaccine designs.

Conclusion
This study was undertaken to further investigate the role of four proteins (CD209, LCK, IL-2 and MYB) previously identified to be critical for viral pathogenesis in related clusters (KGC, MDC and EBOVC) (Achinko et al., 2016). Structural variant analysis showed that variants for these four proteins were unique and specific to EBOVC or KGC and MDC. The domain regulatory pattern suggested differential splicing for EBOVC and gene expression or repression for KGC and MDC, which could be directed by epigenetic programs or differential promoter regulatory patterns on the chromosome. Differential expression analyses suggested that the proteins that vary greatly during viral infection are involved with immune homeostasis and regulation of host cell immune receptors, required by viruses to penetrate the host cells. The heatmaps for these differential expression patterns showed that viral-immune cell interaction patterns were B-lymphotropic for MDC and T-lymphotropic for KGC and EBOVC. Protein expression regulatory patterns observed also suggested an immune regulatory pattern with genes like UTY, which process pattern recognition receptors needed to activate CD4+ and CD8+ T-cells, and specific amino acids on proteins related to proper identification of HLA class I and II peptides by antigen processing cells (APC). Specific differential expression patterns identified in this work set a basis for developing tools necessary for viral screening during epidemic outbreaks and development of antiviral immune therapies to properly stimulate immune expression.