A proposed molecular mechanism for pathogenesis of severe RNA-viral pulmonary infections [version 1; peer review: awaiting peer review]

Background: Certain riboviruses can cause severe pulmonary complications leading to death in some infected patients. We propose that DNA damage induced-apoptosis accelerates viral release, triggered by depletion of host RNA binding proteins (RBPs) from nuclear RNA bound to replicating viral sequences. Methods: Information theory-based analysis of interactions between RBPs and individual sequences in the Severe Acute Respiratory Syndrome CoronaVirus 2 (SARS-CoV-2), Influenza A (H3N1), HIV-1, and Dengue genomes identifies strong RBP binding sites in these viral genomes. Replication and expression of viral sequences is expected to increasingly sequester RBPs SRSF1 and RNPS1. Ordinarily, RBPs bound to nascent host transcripts prevents their annealing to complementary DNA. Their depletion induces destabilizing R-loops. Chromosomal breakage occurs when an excess of unresolved R-loops collide with incoming replication forks, overwhelming the DNA repair machinery. We estimated stoichiometry of inhibition of RBPs in host nuclear RNA by counting competing binding sites in replicating viral genomes and host RNA. Results: Host RBP binding sites are frequent and conserved among different strains of RNA viral genomes. Similar binding motifs of SRSF1 and RNPS1 explain why DNA damage resulting from SRSF1 depletion is complemented by expression of RNPS1. Clustering of strong RBP binding sites coincides with the distribution of RNA-DNA hybridization sites across the genome. SARS-CoV-2 replication is estimated to require 32.5-41.8 hours to effectively compete for binding of an equal proportion of SRSF1 binding sites in host encoded nuclear RNAs. Significant changes in expression of transcripts encoding DNA repair and apoptotic proteins were found in an analysis of influenza A and Dengue-infected cells in some individuals. Conclusions: R-loop-induced apoptosis indirectly resulting from viral replication could release significant quantities of membraneassociated virions into neighboring alveoli. These could infect adjacent Open Peer Review Reviewer Status AWAITING PEER REVIEW Any reports and responses or comments on the article can be found at the end of the article. Page 1 of 22 F1000Research 2020, 9:943 Last updated: 07 AUG 2020


Introduction
Background RNA viruses have long been known as an important source of zoonotic disease transmission 1 . In these infections, a key question that needs to be answered is which infected individuals will progress from mild to severe symptoms that require intensive care? While complex underlying conditions increase susceptibility, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and Influenza A can lead to severe or lethal outcomes regardless of the age or health status in certain individuals. The Chinese and the initial US patients with SARS-CoV-2 showed that higher viral replication and multiplicity of infection are evident in severely ill individuals [2][3][4] . Textbook depictions of viral release and infection indicate budding from the cell membrane. This explanation might not adequately explain the rapid onset of symptoms and transmissibility seen in some individuals infected with these agents. We suggest that these factors can be explained by a cytopathology of induced lytic events, releasing high titers of virus. Programmed cell death (apoptosis), which has been suggested to occur in RNA viral conditions such as Influenza, is activated through innate immunity, with concomitant inflammatory responses. Viral RNA has been suggested to signal Toll-Like receptors and type I interferon expression, which binds to its receptor, IFNAR, and stimulates induction of PCD genes such as FasL or TRAIL 5 .
We propose an alternative mechanism in which infection of RNA virus triggers unrepaired sites of chromosomal breakage, causing apoptosis and consequentially, high-titer viral release ( Figure 1). This is precipitated by the binding of RNA binding proteins (RBPs) to viral genomes and transcripts instead of nuclear transcripts, to prevent destabilization of chromosome structure. This study identifies the sequences, locations and abundance of these binding sites and presents evidence for specific expression changes in DNA damage genes in Influenza and Dengue infections and evidence of expression changes consistent Figure 1. Proposed mechanism of high multiplicity of RNA viral infections. Newly synthesized host RNA binding proteins (SRSF1, RNPS1) are required to stabilize nascent transcripts throughout the nucleus. During influenza or other viral infections, these proteins can be bound to viral genomes and transcriptomes. As viral replication and transcription proceeds, these nucleic acids containing strong binding sites for these RBPs in the cytoplasm (SARS-CoV-2) and nucleus (Influenza) that complete with host RNAs and deplete these proteins from the nucleus. This enables nascent transcripts to reanneal with transcription templates, and R-loops are formed. If not removed by RNAse H or other helicases, unresolved R-loops at numerous genomic loci triggers genomic instability. Their frequency and density of unrepaired chromosome damage would be expected to overwhelm DNA repair components (BRCA1/2, FANC complex, and XPC), inducing multiple chromosomal strand breaks in each cell 6 . These breakage events initiate apoptosis, releasing a high multiplicity of infectious viral particles.
with induction of apoptosis. The damage is thought to arise as the result of replication forks colliding with R-loops formed by host transcripts. Ordinarily these structures are mitigated through formation of stable interactions with frequently bound endogenous RBPs 7 .
The SR protein family consists of RNA binding proteins that play significant roles in the regulation of mRNA splicing 8 . SRSF1 (formerly ASF/SF2) is an exonic splicing enhancer (ESE) that has been shown to interact with the U1 snRNP and recruit the protein to the donor (5') splice site 9, 10 . However, binding of SRSF1 to nascent transcripts has also been shown to play a significant role in genome stability, first described in reference 6, whereby the presence of SRSF1 bound to pre-mRNA repressed the formation of DNA:RNA hybrids, which led to R-loops, double-stranded breaks, and a hypermutation phenotype. This phenotype could be corrected not only by increasing RNase H expression (to eliminate DNA:RNA hybrids), but with the overexpression of the RNA binding protein RNPS1 11 . RNPS1, part of the apoptosis-and splicing-associated protein (ASAP) complex, can directly interact with SRSF1 12 and could possibly help recruit SRSF1 to ESE sites 13 . Other RNA binding proteins have been shown to increase genome instability when depleted, including THOC1 14 , MFAP1 15 , and FIP1L1 16 .
Binding sites for these RBPs are identified using information theory (IT)-based sequence analysis, which has proven both theoretically and in numerous practical examples to be an accurate approach for predicting binding affinities of nucleic aid sequences recognized by particular DNA or RNA binding proteins 17 . IT can be used to identify binding sites, and to evaluate the impact a sequence variant may have on binding site strength 18 . IT has been applied in studies which involved mRNA splicing 19,20 , splicing regulatory factors (SRFs 21,22 ), other RNA binding proteins 23 and transcription factor binding sites (TFBS; 24,25 ), and has been used to accurately predicted level of gene expression and identify causative mutations in a wide spectrum of diseases 17 . IT-based analysis has the distinct advantage to other bioinformatic approaches as the predicted information content (known as R i ; measured in bits) can be quantified as binding site affinity as it is related to thermodynamic entropy 26 . The binding affinity of a sequence predicted by IT has been shown experimentally to directly relate to the observed binding quantity of said sequence 26 . IT-based models are generated from a series of annotated binding sites for a particular RBP. The average strength of the sites used to generate said is referred to as its R sequence . IT-based models can also be derived from highthroughput binding site identification techniques such as ChIPseq (e.g. the derivation of TFBS models in 24). Information density-based clustering (IDBC) analysis, where groups of closely situated binding sites are evaluated based on their combined strength (their "information density") and intersite distances, has been applied along with these TFBS models in both the identification of TFBS-dense clusters, and accurate prediction of gene expression patterns 25 .
We suggest that the viral genome binds these also and define the locations of likely strong binding sites across the genomes of various RNA viruses. As it replicates, we propose that the viral genome binds these proteins, preventing their reimportation into the nucleus where they are normally needed for essential post-transcriptional activities. We theorize that incremental replication and transcription of viral RNAs in the cytoplasm creates a sink for these proteins, starving the host nucleus, and initiating a series of events that release viral particles into the lumen, enabling rapid infection of neighboring lung epithelial cells (Figure 1). An infographic has been created to provide a detailed step-by-step guide to the proposed mechanism, from the initial viral infection to spread of infection to the lungs and other major organs, leading to lowered blood oxygen levels, and multi-system organ failure 27 .
Proposed molecular pathogenetic mechanism of RNA-viral infection RNA viral genomes of Influenza viruses replicate in the nucleus and are processed by host RNA spliceosomes. For example, the M and NS segments of the Influenza genome are processed using the host splicing mechanism 28 . Viral RNAs, like host transcripts, are capable of sequence specific binding to RBPs. This can conceivably deplete RBPs from host encoded RNAs, where they ordinarily function. These unbound RNAs are capable of hybridizing to the non-template derived strand of the chromosome 29 . RNA naturally forms a stronger bond to DNA than DNA does to itself, especially rG:dC hybrids 6 . As a result, mRNAs would replace DNA by hybridizing complimentary bases, resulting in R-loop formation, and can lead to DNA damage.
The RNA spliceosome regulator SRSF1 acts on exonic splicing enhancer sequences in pre-mRNA and forms RNP complexes with nascent mRNA precursors. Aside from its established role in enhancing exon recognition 10 , binding of SRSF1 to these transcripts is required to prevent or destabilize the formation of R-loops 6 . R-loops are derived from RNA transcripts that anneal to the chromosomal strand complimentary to the transcription template stand. If not eliminated, these structures pose a threat to genomic integrity as targets for DNA damage. The structure of R-loops consists of two duplex-single strand junctions which are recognized by nucleases that cleave the DNA 29 . DNA fragmentation causes a G2 phase cell cycle arrest which can potentially lead to cell death 11 . R-loops that are not targeted by nucleases are nonetheless still non-functional and thus, inflict damage on the cell 6 . As RNA viruses enter the cell and replicate, the nucleic acid sequences they encode divert RBPs such as SRSF1 away from binding to nuclear RNA transcripts, thus promoting the creation of R-loops.
RNPS1 is a pre-mRNA splicing activator protein that functions together with SRSF1 to form RNP complexes on nascent transcripts 13,30 , but also has a role in preventing transcriptional R-loop formation 11 . RNPS1 also suppresses high molecular weight DNA fragmentation at high expression levels. These two proteins work together but have independent mechanisms as RNPS1 cannot compensate for SRSF1 splicing function in its absence and vice versa 11 . In Dengue virus, the protein called NS5 binds to host spliceosome complexes and modulates endogenous splicing to change mRNA isoform abundance of antiviral factors. By also interacting with U5 snRNP particles, it reduces the efficiency of pre-mRNA processing, hence resulting in a less restrictive environment for viral replication. It has also been shown that NS5 interacts with the host protein, RNPS1, which disrupts normal nuclear RNA binding processes 31 .
Viral infections interfere with post-transcriptional processing of host pre-mRNA including splicing, capping, and translation during viral invasion. Since SRSF1 binds and interacts with pre-mRNA during the earliest stages of splicing, diversion of SRSF1 and other spliceosomes to other RNA sequences depletes the cell's resources. Normally, cellular mRNA is 7-methylguanosine cap is added to the 5' end to protect the sequence from degradation. However, Influenza carries proteins that has "capsnatching" abilities 32 . Influenza snatches the 5' cap by cleaving the mRNA 10 to 15 nucleotides away from the guanosine and this cap is used to prime transcription of the virus. Finally, during viral infections, all RNA processing mechanisms are now being shared between two genomes. Ultimately, as transcriptional and translation mechanisms fail to facilitate the mRNA, they will create R-loops with DNA, cause DNA damage, and induce higher expression of DNA repair genes (such as DDB2; see Results).
Unrepaired damaged DNA that encounters a replication fork leads to unresolved double strand breaks, triggering apoptosis. The quantity of virus that escapes into tissues, blood and other conduits (e.g. lymphatic), and other systems would likely dwarf the amount that is released by conventional viral budding from the cell membrane. This viral load will likely overwhelm the immune system in individuals who are already immune deficient and might provoke a systemic inflammatory response (like a cytokine storm). However, the high titer of virus is likely to infect neighboring cells and other tissues. The extent of the apoptotic response may be the distinguishing finding which separates the patients who survive the infection from those who end up in intensive care, develop pulmonary insufficiency and multi-system failure.
The deficiency in SRSF1 and other RBPs in the nuclei of Influenza, Dengue or SARS-CoV-2 infected cells does not require any specialized mechanism. Assuming that the virus is replicating freely in the cytoplasm (or nucleus, in the case of Influenza), the significant excess of unpackaged, replicated viral RNA acts as a sponge to sequester newly synthesized, folded RBPs. Based on mass action, the quantity of RBPs that would be transported into the nucleus for host mRNA processing would have a much-diminished nuclear stoichiometry in comparison with normal, uninfected cells.

Results
Derivation of CLIP-based SRSF1 and RNPS1 information theory-based models Cells depleted of SRSF1 has been shown to have unstable genomes which can be corrected by overexpression of RNPS1 11 .
In order to investigate the significance of SRSF1 and RNPS1 binding in viral genomes, we first developed information theorybased models for the recognition sequences for each of these proteins using binding site datasets derived from transcriptome-wide RNA binding protein datasets of CLIP sequencing data. We then scanned multiple RNA viral genomes, as well as the human transcriptome, with these derived models to identify and predict the strength of individual binding sites.
An Information Weight Matrix (IWM) for SRSF1 has been previously derived 21 , however, it was only based on very small set of manually curated binding sites (N=28). We therefore derived new SRSF1 IWMs using publicly available eCLIP data (two separate replicates from reference 33). Multiple SRSF1 models exhibited very similar binding motifs, however, their differences justified our analyses using the two most divergent IWMs in this study. These models are referred to as SRSF1 "Replicate 1" and "Replicate 2" models, as they are models derived from two separate eCLIP experimental replicates from the same study. SRSF1 "Replicate 1" is derived from a larger number of eCLIP peaks (50,000) compared to 5,000 for "Replicate 2". Since SRSF1 "Replicate 1" was derived from a greater number of sites, it therefore may be more accurate for detection of weaker SRSF1 binding sites.
A distinct IWM was derived by iCLIP data from transcriptomewide, protein crosslinking to sequences recognized by RNPS1 30 . It was evident that the RNPS1 IWM and the newly derived SRSF1 models exhibited a similar pattern of nucleotide conservation based on comparison of their respective sequence logos (Table 1). STAMP, a program which analyzes DNA-binding motifs, was used to compare these models 34 . The SRSF1 "Replicate 1" and "Replicate 2" models were both highly similar (motif alignment e-value < 0.01) to the RNPS1 IWM (Table 1), implying that individual binding sites recognized by these two factors are similar. Indeed, the motif similarity between these two factors has been described 13 . We suggest that this overlap in their respective binding affinities may account for why RNPS1 overexpression can enable SRSF1-deficient cells to overcome their inherent genomic instability phenotype.

RBP binding sites in RNA viral genomes
The newly derived SRSF1 and RNPS1 models (as well as an hnRNP A1 model to act as a positive control [its derivation described in 22], as the RBP has been shown to regulate transcription of beta coronaviral genes 35 ) were used to scan the genomes of multiple RNA viruses: Dengue (Type 3), HIV (Strain B and C), Influenza A (H3N2; two separate strains), and SARS-CoV-2 (NC_045512.2). In coronaviruses, the infectious particle contains the positive strand, but the negative strand copy of the RNA is generated for protein translation 36 and may be available to bind RBPs. Therefore, both the positive and negative strands of the viral genomes were scanned for SRSF1, RNPS1 and hnRNP A1 binding, regardless of the replication mechanism of the virus.
The SARS-CoV-2 genome was determined to contain >600 SRSF1 (with either SRSF1 model) and RNPS1 binding sites (Table 1). However, histograms which illustrate the distribution of the strengths of all SRSF1 and RNPS1 binding sites in SARS-CoV-2 ( Figure 2A) reveal that the majority of these are weak sites (where R i < R sequence ) that may not be used. We therefore focused downstream analysis on strong binding sites (where R i ≥ R sequence ) of each IWM (R sequence : 6.7 bits for the SRSF1 "Replicate 1" model; 6.4 bits for the SRSF1 "Replicate 2" model; 7.8 bits for the RNPS1 model; and 4.6 bits for the hnRNP A1 model). There are only 35 RNPS1 and between 31-60 SRSF1 binding sites (depending on SRSF1 model) on the positive strand of the SARS-CoV-2 genome that meet this R sequence threshold ( Table 1). The total number of SRSF1 binding sites within all other viral genomes tested are provided in Table 2, while RNPS1 and hnRNP A1 binding site counts are available within a Zenodo repository for this study (extended data 38 Section 1 - Table 1). The hnRNP A1 model consistently predicts more strong binding sites than the SRSF1 and RNPS1 models across all the RNA viral genomes tested, as well as in the human gene controls. This is likely partially due to its relatively low R sequence threshold compared to the other models used. Interestingly, we observed significantly more SRSF1 and RNPS1 binding sites on the positive strand compared to the negative strand for all tested RNA viral genomes (exception: sites in SARS-CoV-2 predicted by SRSF1 "Replicate 1" model). This phenomenon was observed in both positive-strand and negative-strand RNA viruses (e.g. both Influenza A strains tested). This imbalance was not observed in the human genes tested (Table 2).
Previously, tightly organized groups of transcription factor binding sites (TFBS) were identified using information dense clustering 25,39 . We applied this method to identify regions of the viral genomes with large concentrations of binding sites (extended data 38 Section 1 - Table 2). Clusters of weak SRSF1 and RNPS1 sites are common (e.g. there are 5 SRSF1 clusters on the positive strand of SARS-CoV-2; extended data 38 Section 1 -Tables 2A and 2B); however, clusters made up exclusively of strong binding sites (R i ≥ R sequence ) are extremely rare in the viral genomes tested.
We observed that all strong RNPS1 sites were also predicted to be strong (R i ≥ R sequence ) by the SRSF1 "Replicate 2" model. This is not surprising, as the two models were found to have significantly similar binding motifs (Table 1). This overlap, as well as the location and strength of all other strong SRSF1 ("Replicate 2" model only) and RNPS1 binding sites, can be observed in Figure 3 where sites were mapped across the SARS-CoV-2 and Influenza A genomes. This was not observed, however, for SRSF1 "Replicate 1" despite its similarity to the RNPS1 model. For this SRSF1 model, nearly half of all strong RNPS1 sites were predicted to be weak (R i below the R sequence threshold).
Despite its low mutation rate, over 220 SARS-CoV-2 strains have already been identified, with potential mutational hot spots of different geographic origins 40 . If the  proposed mechanism does play a role in the severity of infection, then it is expected that various strains of SARS-CoV-2 would not significantly differ in numbers of binding sites, as no particular strain of SARS-CoV-2 has yet been proven to affect disease recovery (indeed, more transmissible strains have been identified but none more pathogenic 41,42 ). To test this theory, genomes of 8 SARS-CoV-2 strains were downloaded from the Global Initiative on Sharing All Influenza Data (GISAID) database and analyzed using the IWMs for SRSF1, RNPS1 and hnRNP A1 (Table 3 for positive strand analysis; extended data 38 Section 1 - Table 3 for analysis of both strands). The particular strains that were selected were those that showed maximum divergence from one other based on analyses by NextStrain (which tracks the genomic epidemiology of SARS-CoV-2 43 ). Binding site counts of different strains were within 90% across all strains, except for MT198652.1 (Spain), which contains an undetermined sequence where binding site differences are apped. A strong consistency between binding site counts and strengths was noted, despite maximizing in the divergence between the selected SARS-CoV-2 strains. For RBPs binding, it was therefore not significant as to which SARS-CoV-2 sequence was selected for the subsequent analyses.   ) were scanned for strong preexisting binding sites for the RBP RNPS1 and SRSF1 (newly derived "Replicate 2" model). Custom wiggle tracks which contained those RBP of R i ≥ R sequence were generated and visualized by NCBI Nucleotide. Track images were manually adjusted to indicate the strand in which the binding site was identified (blue vertical lines indicate sites on the positive strand, orange on the negative strand). The majority of sites predicted by the RNPS1 model were simultaneously predicted by the SRSF1 model, however the SRSF1 model identifies additional unique binding sites.
of binding sites remains relatively consistent between each Influenza A strain, despite their divergent genomic sequences.
The locations of all predicted binding sites and informationdense clusters within the genome of each RNA virus tested has been made available within the extended data archive (Section 2 38 ). This data is provided in the form of 'bedgraph' genome browser tracks. The location of binding site clusters are also provided as lollipop plots within the archive (Section 3), as are the IWMs used to evaluate each site (Section 4).
Human transcriptome analysis of RNA binding sites Each of these RNA viral genomes contain multiple strong RNA binding sites. The frequency of RBP binding in human transcriptomes was determined to relate the relative abundance of these proteins bound to viral RNAs compared to their normal reservoir in host nuclear RNA of infected cells. Expressed host gene sequences were scanned with IWMs for SRSF1, RNPS1 and hnRNP A1 to locate all potential binding sites throughout transcribed regions of the human genome, then partitioned among these genes based on their abundance in relevant cell types. These were compared with binding sites within 300nt of a known exon, as many of these RBPs have critical functions in exon recognition and maturation of mRNA splice isoforms (provided as bedgraph tracks in the Zenodo archive [Section 2] 38 ). While the majority of these binding sites are considered weak (R i < R sequence ; Figure 2B) Regardless of these differences, however, this analysis illustrates that many strong binding sites are separated by < 200nt and highlights how densely arrayed these sites are in the human transcriptome.
The number of strong SRSF1, RNPS1 and hnRNP A1 binding sites (R i ≥ R sequence ) were enumerated by gene (extended data 38 Section 1 - Table 5 [A-D]; genes without any strong binding sites are not listed). Similar tables were created which count the number of information-dense clusters located within each gene (extended data 38 Section 1 - DRIP (DNA-RNA immunoprecipitation) sequencing is a highthroughput method of identifying regions of the genome where R-loops can form. DRIPc sequencing is an improvement which provides higher resolution mapping data in a strand-specific . Despite an additional level of filtering (where the strand of the clusters and DRIPc-seq intervals must match), the frequency of overlap between binding site clusters and DRIPc-seq was much higher compared to the frequency of overlap to the DRIP-seq dataset (~15-17% overlap depending on IWM; extended data 38 Section 1 - Table 6A). In all test cases, limiting analysis to only those genes that are expressed in A549 cells (≥1 TPM) increased the percent overlap of clusters and both DRIP-and DRIPc-seq data sets (e.g. we find a 15.3% of RNPS1 clusters/DRIPc-seq overlap among all genes, but 20.2% overlap when considering expressed genes in the A549 cell line only). When this analysis was repeated but limited to only those clusters near an exon (within 300nt), this also showed a significant increase in the fraction of clusters overlapping DRIP-seq intervals (extended data 38 Section 1 - Table 6B). These observations remain consistent when considering individual binding sites, rather than binding site clusters (extended data 38 Section 1 - Table 6C and 6D). It therefore seems that the vast majority of individual binding sites and information-dense binding site clusters do not overlap these DRIPand DRIPc-seq regions. For example, only 5 of 36 clusters within THSD4 overlap the DRIPc-seq dataset ( Figure 4B; extended data 38 Section 1 - Table 5F).
As we are limiting this analysis to sites that are within a few, often short DRIPc-seq intervals, the distances between pairs of sites are likely to be tightly grouped. We also computed the average number of all binding sites and clusters, and only those which overlap the DRIPc-seq dataset, for each individual gene (sites and clusters per 100nt of gene length; extended data 38 Section 1 - Table 5). Binding site densities within specific genes are reduced for sites overlapping DRIPc-seq intervals (e.g. THSD4 SRSF1 cluster density reduces from 5.2E-03 to 7.0E-04 clusters per 100nt).

DNA damage response by RNA viral infection
We have previously described a machine learning (ML) based approach for developing gene signatures for expression various environmental exposures to cells, initially focusing on prediction of chemotherapy effects 46 . This method was applied to ionizing radiation data, from which accurate gene signatures were derived that could differentiate levels of radiation exposures. In particular, low exposures were distinguished from higher radiation levels that cause Acute Radiation Syndrome (ARS 47 ). ARS is characterized by vomiting, diarrhea, fever, low white blood cell count and fatigue. Physicians might not consider ARS in the differential diagnosis when presented with a patient exhibiting these symptoms, since Influenza and Dengue (viral) infections also present with vomiting, diarrhea, lymphopenia (especially Influenza H1N1 48 ) and fatigue, and are more common. Like ARS, these conditions lead to death in some cases. While Influenza A has a worldwide distribution, Dengue is more prevalent in Southeast Asia, the Americas and the Western Pacific where it presents typically with severe manifestations including hemorrhagic fever and shock. We have considered how the life cycle of these viruses might be related to the corresponding cellular responses.
Expression data from irradiated blood samples were used to derive the human radiation gene signatures reported in Zhao et al. 47 . While it was assumed that these ML models were specific for diagnosing ARS, the models were further tested to determine if they could distinguish ARS from other conditions that share similar clinical presentation (e.g. vomiting, diarrhea). Four human ML radiation signatures from Zhao et al. (assessed by traditional validation; denotated as ML models "M1", "M2", "M3" and M4" which are described in extended data 38 Section 1 -  Table 7). Approximately 15% of aplastic anemia patients were also misclassified. The model "M1" showed the lowest misclassification rate against Influenza patients (9-29% of patients misclassified), models "M2" best classified Dengue-infected patients (7-33% misclassified), while models "M1" and "M3" performed well with patients with aplastic anemia (5-20% misclassified for "M1" and 0-14% misclassified for "M3"). In nearly every instance, the inclusion of normal controls from the Influenza and Dengue studies improved overall accuracy of all four ML models (17.4% and 18.1% average misclassification of Influenza and Dengue-infected patients, respectively). This phenomenon was not observed in the aplastic anemia dataset tested. The observation that normal controls are more often correctly classified indicates that these models are not so much incorrectly classifying infected patients, as they are identifying gene expression differences that may be a response to or caused by the viral infection itself.
The four radiation gene signatures assessed from Zhao et al. 47 consist of 32 unique genes. When performing feature removal analysis (where model accuracy is reassessed after each gene is individually removed from it), 10 genes were identified that greatly contribute to patient misclassification: DDB2, PCNA, GTF3A, PRKCH, CDKN1A, GADD45A, BCL2, MOAP1, TRIM22 and TALDO1 (extended data 38 Section 1 - Table 8).
DDB2 is a DNA damage binding protein that is present in all four ML models. DDB2 expression levels were elevated in irradiated patients, which is likely due a cellular response to radiation exposure, as this gene participates in nucleotide excision repair (it ubiquitinates histones H3 and H4 to increase accessibility of nucleosomes, exposing DNA and enabling access to XPC [xeroderma pigmentosum group C-complementing protein], which performs NER 49,50 ). DDB2 shared a similar pattern of expression between irradiated samples as well as infected patients that were misdiagnosed as irradiated (elevated DDB2 expression in misclassified Influenza and Dengue patients; Figure 5). The activation of DDB2 would be consistent with the proposed mechanism, whereby high levels of RNA viral genome increase the formation of abnormal, unresolved R-loops which in turn activate a DNA damage response. Expression of DDB2 between those correctly classified and those misclassified as irradiated was deemed significant by the Mann Whitney test (p-value = 0.0001). Other genes with significant differences in expression included GTF3A, PRKCH and PCNA (which also has a role in the DNA damage response; extended data 38 Section 1 - Table 8).
Biochemical kinetics of depleted RNA binding proteins in the human transcriptome In the mechanism proposed (Figure 1), the fraction of SRSF1 and RNPS1 bound to host RNA decreases as the fraction of SARS-CoV-2 genome increases as it replicates in the cell, causing RNA:DNA hybrids which result in R-loops. We therefore estimate the quantity of viral genomes and extent of viral replication required for viral binding site counts to approach, match, and exceed the number of host RNA sites available. These are derived from the number of SRSF1 and RNPS1 sites expressed in either a single A549 cell or a type II primary pneumocyte. The overall expression of each host gene was normalized by dividing by total expression of the given dataset, then by multiplying the number of all binding sites within a gene to its normalized gene expression value, and finally by multiplying the sum of all expression-adjusted binding site counts by the expected number of mature RNAs in a cell. We estimate a total of 80,000 RNAs per single cell (as determined by Marinov et al. 51 ), which is comparable with totals determined in other studies (e.g. Xia et al. 52 determined that a single osteosarcoma cell contains 92,000 ± 32,000 mature RNAs).
Based on this approach, the total number of expressed binding sites (of any strength) was computed for SRSF1 and RNPS1 (Table 1). However, this estimate includes sites expected to be weakly binding. When taking only strong binding sites into account, we estimate 12.7 to 18.2 million expressed SRSF1 ("Replicate 1" and "Replicate 2" SRSF1 models, respectively) and 9.9 million expressed RNPS1 binding sites in a single A549 cell. In a single primary pneumocyte, we estimated 6.6 to 9.4 million expressed SRSF1 sites ("Replicate 1" and "Replicate 2" models, respectively), as well as 5.2 million expressed RNPS1 binding sites. These estimates are based on expression levels in normal cells and may differ in infected cells. While the dissociation constant for RNPS1 is unknown, the dissociation constant of SRSF1 (K D ) bound to the RNA sequence 5′-UGAAGGAC-3′ has been experimentally measured as 0.8 μM 53 . With the K D , a Scatchard plot for SRSF1 binding was derived where host bind sites are substrates and viral binding sites are considered to be inhibitors of host RNA binding. We assumed no free RNA binding protein, which is bound to either host or viral binding sites. This assumption is reasonable for strong binding sites (where R i ≥ R sequence ). We use K D to compute the theoretical number of viral genomes required to satisfy various viral genome to host binding site ratios ( Figure 6 [ Table left]). This calculation is also carried out without reference to K D , by instead computing the number of viral genomes required to achieve binding site ratios in viral to host-bound RBP from a direct analysis of primary pneumocyte and A549 transcriptomes. The number of strong SRSF1 binding sites in a single viral genome multiplied by the level of viral replication is compared with the estimated number of expressed SRSF1 sites in the host nucleus (in a pneumocyte or an A549 cell; Figure 6 [ Table right]). The data presented in Figure 6 uses the number of sites predicted by SRSF1 "Replicate 2" model, and only considers the positive strand of SARS-CoV-2. Despite their similarities, the SRSF1 "Replicate 2" model predicts far more binding sites on the positive strand of SARS-CoV-2 compared to the "Replicate 1" model (N=60 and 31, respectively). This leads to small differences in the estimated doubling time, when only the positive strand of the virus (extended data 38 Section 1 - Table 9A) is considered. An examination of potential binding sites on both strands of SARS-CoV-2 does not appreciably alter the estimated doubling time for both SRSF1 IWMs (extended data 38 Section 1 - Table 9B).
The doubling times required for infection initiated by a single virion were computed for varying numbers of viral genomes, as replication increases the overall counts of viral RBP binding sites. The processivity rate of genome replication for SARS-CoV-2 is currently unknown, so a value was estimated based on a polymerization rate of 3.7 nt/s for a different RNA-dependent viral RNA polymerase, that of Vesicular Stomatitis Virus (VSV) 54 . The doubling time was then adjusted to 2.31 hours per replication event, based on the increased length of the SARS-CoV-2 genome (L=30,899nt) compared to VSV. The doubling time is estimated to be between 32.5 to 44.1 hours to achieve a level of SARS-CoV-2 binding that depletes RBP from an equal number of expressed host nuclear RNA sites (1:1 ratio). However, fewer replication events and shorter doubling times are computed using the published K D of SRSF1 (between 5-14 hours less). The number of replication events required for viral genome binding sites to overtake host RNA binding was less in primary pneumocytes compared to A549 cells (~2.3 hours or 1 doubling of the SARS-CoV-2 genome). This was anticipated, since the total number of expressed SRSF1 (and RNPS1) sites are lower in primary pneumocytes than the immortalized cell line due to lower overall gene expression levels.

Discussion
We propose a previously undescribed putative mechanism of RNA viral infection-induced apoptosis, supported RNA binding events determined by information theoretic analysis. In the mechanism, viral release is enhanced by viral genome replication, which sequesters RBPs, thereby depleting native binding of RBPs to and stabilization of host-encoded transcripts. This process can occur in either the cytoplasm or the nucleus of the host cell, depending on specific replication requirements of different viral families. In SARS-CoV-2, this is expected to substantially reduce import of RBPs into the nucleus. Reduced availability of nuclear RBPs promotes R-loops through formation of complimentary duplexes between nascent transcripts and chromosomal sequences. High densities of R-loops at a late stage of infection would be expected to overwhelm cellular DNA repair mechanisms that ordinarily remove these structures and eliminate DNA breakage. DNA damage markers DDB2 and PCNA are increased in both Influenza and Dengue infections, respectively. Unrepaired, persistent chromosome double strand breaks are unstable and induce apoptosis, which would be expected to release high viral titers.
IT-based models of RBP binding sites was used to scan viral RNA genomes (Influenza, SARS-CoV-2 and Dengue) and host transcriptomes. IT models derived from thousands of validated RBP binding sites delineated numerous strong SRSF1, RNPS1 (and hnRNP A1) RNA binding sites within these viral genomes. The derived SRSF1 and RNPS1 binding motifs were shown to be highly similar, consistent with previous published studies demonstrating that RNPS1 could partially complement genomic instability due to SRSF1 deficiency. Indeed, both models detected many of the same RNA binding sites in the host transcriptome and all strong RNPS1 binding sites detected in the SARS-CoV-2 genome were simultaneously detected by at least one SRSF1 information model. In divergent strains of both SARS-CoV-2 and Influenza A (H3N2), the frequencies and strengths of these binding sites are highly consistent. Finally, we estimate that the quantity of replicated viral genomes necessary to meet or exceed the number of binding sites expressed within a lung can exceed the site counts in the host genome, and the doubling time required to deplete these RBPs which is consistent with the observed time course of severe infections. SARS-CoV-2 efficiently infects multiple species of mammals 55 , and possesses an RNA polymerase with proofreading capability, which enables it to faithfully and accurately replicate and transcribe its genome. In this study, we suggest that effects of SARS-CoV-2 infection are mild in most individuals because most of us mount robust immune responses and eventually clear the virus. The mechanism that we propose (Figure 1), which may be a contributing factor of a variety of different RNA viruses, has the potential to overwhelm that response through jackpot replication coupled to apoptotic events caused by loss of chromosome integrity stemming from depletion of essential RBPs. This results in high multiplicities of infection of cells in the most vulnerable cells. This could cause a rapid onset of loss of viable pneumocytes, and compromising oxygen transport, to a point where it is insufficient to maintain blood pO 2 levels to support organ functions. Systemic inhibition of viral replication and transcription of viral proteins will be essential to prevent or mitigate this pathological mechanism.
Other coronaviruses such as MERS and SARS have been shown to induce apoptosis 56 . The polyphenol Resveratrol has been shown to downregulate apoptosis in vitro 56,57 , possibly by overexpressing sirtuins (a family of signalling proteins). However, this is ultimately not a practical solution to infection, as the drug will only delay an eventual high multiplicity infection event. In order to inhibit the viral mechanism proposed in this study, a drug must inhibit the viral machinery that sequesters spliceosomal components, leading to R-loops and DNA damage. This may explain, in part, why remdesivir (Gilead) improves the recovery of patients with severe respiratory symptoms. The drug, which was originally developed for treatment of Ebola virus by inhibiting its RNA dependent RNA polymerase, also inhibits viral replication of SARS-CoV-2.
Other potential therapies include those targeting expression of genes encoded by the viral genome, which use a common 5' leader sequence of all transcripts. The promoter sequence for these genes binds to the host encoded hnRNP A1, which regulates transcription of beta coronaviral genes (of which SARS is a member of that family). While hnRNP A1 could be a potential drug target for therapy (there are small molecules that have been shown to inhibit hnRNP A1 RNA splicing activity 58 ), there would be concerns that this may cause inadvertent side effects due to its impact on normal mRNA splicing.
Regardless of whether apoptosis releases large quantities of mature infectious virus, the proposed mechanism will still likely impact pneumocyte function. Should high multiplicity of infection arise from apoptotic release of infectious particles from jackpot viral replication and RBP depletion is expected to severely damage both the original cell and neighboring pneumocytes. The severe symptoms might be the result of rapid lysis of cells responsible for oxygen transport, rather than a cytokine storm. Autopsies of infected individuals from Wuhan China have shown evidence of inflammation, but not necessarily macrophage invasion and pulmonary edema 59 .
Furthermore, apoptosis has been demonstrated in lung epithelial cells in Macaques infected with Influenza virus 60 . This could explain why physicians and other health professionals in repeated contact with multiple infected patients do not seem to have time to develop immunity to the virus, regardless of their age. Type II pneumocytes which produce surfactant, required at high levels in newborns, decrease with age 61 and are particularly diminished in individuals with respiratory disease like COPD (Chronic obstructive pulmonary disease) and ARDS (Acute respiratory distress syndrome). If the multiplicity of infection (MOI) of virus damages this population of cells, then individuals with fewer cells might be more susceptible to exhibiting insufficient pulmonary function due to the high MOI released by the mechanism proposed. These patients would be at greater risk for severe complications requiring assisted ventilation. It is also possible that the deficiency of functional pneumocytes in such individuals cannot be compensated for by extracorporeal membrane oxygenation to rescue multiple organ failure.
Humans have the highest number of type II cells at birth, because the first breath requires significant levels of surfactant to expand lung volume. Synthetic surfactant is an essential treatment for premature birth, since type II pneumocytes mature late in gestation. Age-related loss of these cells has been measured and the mechanism leading to it was described 61 . Loss of functional pneumocytes is particularly evident in individuals with ARDS, who exhibit significant lung fibrosis, which is also seen in patients with SARS-CoV-2 infections. Older individuals (or those with pre-existing respiratory conditions) are more susceptible to the loss of the remaining cells by apoptosis or autophagia. Decreased pneumocyte counts affect O 2 transport efficiency, which lowers pO 2 in the blood, tissues and organs. The proposed mechanism implies that jackpot viral replication events, regardless of age of the infected individual, enhance viral release through apoptosis and infection of adjacent cells. Jackpot replication events are more likely in coronaviruses like SARS-CoV-2, as they are capable of repressing the innate immune response, i.e. induction of interferon response to viral double stranded RNA (unlike Influenza) 62-64 . Repression of innate immunity enables the virus to replicate unabated, which would be expected to delay recognition of these cells by regulatory T cells and killing by macrophages.
The immune system appears to be a witness, rather than a direct participant in the process of killing infected pneumocytes in many SARS-CoV-2 patients. Approximately a third of infected patients do not raise antibodies after exposure to SARS-CoV-2 65 , and lack proinflammatory cytokines 66 . Indeed, other coronaviruses have been shown to counter the innate immune response 36 . One plausible explanation for this is that coronaviruses can evade the host immune response 62,63 , specifically response involving antiviral type I and type III Interferon proteins (IFNs) 64 . While many viruses can suppress immune response, IFN response is significantly delayed in coronavirus-infected host cells. There is very little IFN detected in the early stages of MERS-CoV infection 67,68 . In MERS-CoV, innate immunity is suppressed by NS4B. This raises questions about how, or even whether, extrinsic (IFN receptor-mediated) apoptotic response occurs. However, while the suppression of interferon response is clearly a necessity for SARS-CoV-2 progression, it is not sufficient to explain the severity of the molecular pathogenesis since unrestricted replication alone cannot account for the release of high titer of mature virus that would be necessary for widespread, rapid infection. We therefore suggest that the mechanism proposed could explain how SARS-CoV-2 causes severe lung damage without requiring a hyperinflammatory reaction resulting from a cytokine storm initiated by activating an innate immunity response.
Viral infections significantly alter the transcriptional profiles of host genes in infected cells. Recent studies of Zika virus (an RNA virus) have revealed that infection not only impacts transcription, but affects alternative mRNA splicing as well 69 . Both RNA and DNA viral infections encode factors that directly 70 or indirectly 69 alter host RNA processing, resembling alternative mRNA isoforms. We suggest that the mRNA splicing changes observed subsequent to infection of an RNA virus could be a consequence of replicated viral genome binding to RBPs, thus changing the nuclear stoichiometry of splicing proteins (such as SRSF1). This would effectively reduce the concentration of available splicing factors, which could be responsible for the observed alternative splicing events of other splicing factors (such as SRSF2 and SRSF3) reported by Bonenfant et al. 69 . Thus, the mechanism proposed in this study may not only impact genome stability by the introduction of R-loops, but may simultaneously alter the global alternative splicing landscape in infected host cells.
RNA-based vaccines based upon synthetic SARS-CoV-2 transcripts containing modified nucleosides that have been dephosphorylated to escape innate immunity are being tested 71 . These candidates exploit host protein synthesis machinery to transiently express viral antigens that activate B and T-cell immunity. However, these synthetic RNAs would also be available for RBP binding. A transcript encoding the SARS-CoV-2 spike glycoprotein 'S' gene, for example, would contain 7 strong RNPS1 and between 6 to 8 strong SRSF1 binding sites (depending on SRSF1 model). If the levels of expression produced from these transcription templates cannot be carefully controlled, excess production of these RNAs could potentially elicit undesirable side effects through sequestration of critical host RNA binding proteins required to inhibit R-loop formation.
Localization of viral replication to the cytoplasm does not obviate the fact that there is still a competition between the host and viral genomes for these RNA binding proteins. While the binding site stoichiometry calculations are unchanged, compartmentalization of the viral and host genomes does have implications for preventing R-loops during host transcription. Since coronavirus replicates in the cytoplasm, binding of newly synthesized RBPs occurs there. This makes less protein available to be imported into the nucleus for binding to nascent transcripts to prevent R-loops from forming. The viral genome may have an advantage in this competition for binding to RBPs relative to nuclear transcripts, due to the proximity of the viral genome to nascent RBPs in the cytoplasm, which may limit transport and impede their import into the nucleus. RBPs are often highly expressed, including SRSF1 and RNPS1, and are abundant in the lung (where SARS-CoV-2 infection is most prominent). Thus, the cytoplasmic concentration of viral genome necessary to prevent the localization of RBPs into the nucleus is likely to vary between different tissues.
The proposed mechanism of RNA virally-induced apoptosis is supported by extensive bioinformatic analyses indicating that strong RNA binding sites of host RBPs are common in RNA viral genomes, and that the frequencies of such binding sites are relatively consistent between divergent strains in both Influenza A and SARS-CoV-2. Future efforts should elucidate details of the mechanism with functional analysis of infected cells, including demonstration of increased R-loop formation, induction of relevant apoptotic or DNA repair responses, and direct interaction between viral genomes and host RBPs. This would justify further investigations into binding of specific RBPs to viral sequences in infected patients. The potentially prognostic significance of such data could be useful in differentiating among drug therapies that target RNA viral genome replication and/or expression. IWMs for SRSF1 and RNPS1 were derived from eCLIP and iCLIP-seq datasets (respectively) using Maskminent under varying model length conditions (6-10nt long; 1000 Monte Carlo cycles). As experimental noise has been found to contribute to non-specific IWMs 24 , we limited model derivation to only the to the 5,000 or 50,000 iCLIP peaks with the highest signal value (SRSF1) or the lowest p-values (RNPS1; computed by Piranha). In practice, the derived models remained similar regardless of the size of peak subset used. As many intervals from the SRSF1 and RNPS1 datasets were short (<20nt), peak lengths were extended on either direction by the sequence length (e.g. a 10nt interval becomes 30nt long). We found that both RNPS1 and SRSF1 models derived at lengths of 6nt to be most informative with similar R i densities, although they differed slightly (Table 1). Both the RNPS1 model and the SRSF1 model derived from the second replicate (SRSF1 "Replicate 2") selected was generated from 5000 CLIP-seq peaks, while the SRSF1 "Replicate 1" model was derived from 50,000 peaks.

Methods
The derived RNPS1 and SRSF1 models had visually similar RNA binding motifs. To evaluate the similarity between them, the RNPS1 and SRSF1 IWMs were compared using the STAMP web server 34 , which performs a pairwise alignment between each motif (ungapped Smith-Waterman alignment method) and compared using a Pearson correlation coefficient distance metric.  Tables 3 and 4.

Expressed RNA binding sites in lung cells
Publicly-available expression datasets were downloaded from the Gene Expression Omnibus for A549 cell lines (GSE141171; RNAseq) and primary type II pneumocytes (GSE86618; scRNAseq). Normal expression for each cell type was computed by taking the average of all control samples from each dataset (N=3 control samples in GSE141171; N=215 control samples in GSE86618). We then use this information to estimate the total number of binding sites present in a single pneumocyte or A549 cell. First, the program "ScanDataSummaryProgram.pl" (available within underlying data 38 Section 6) was used to compute the total number of binding sites (≥R sequence ) in each cell type for each expressed gene (TPM >0; underlying data 37 Section 1 - Table 5). The overall expression of each gene was then normalized using the program "TotalBindingSitePer-CellCalculator.pl" (underlying data 38 Section 6), which divides expression by the sum of all TPM values in the cell, multiplied by the estimated number of mature RNAs in a cell at any given timepoint (80,000 RNAs per lymphoblastoid cell 51 ). It then multiplies this normalized gene expression value with its binding site total to determine the overall contribution of binding sites from that gene in a single cell. The sum of this value across all expressed genes gives the total number of RNA binding sites expected to be available in a cell at any given time ( Figure 6).

Information-dense clustering of RBPs across viral genomes and human transcriptome
Information dense clustering has previously been applied to the human genome to identify clusters of organized TFBSs 25,39 . The clustering software (v1; described in reference 25; software provided in a Zenodo archive -https://doi.org/10.5281/ zenodo.1707423) was used in this study to identify clusters of low-affinity (R i > 0 bits), moderate-affinity (≥ 1 2 R sequence ) and high-affinity (≥R sequence ) RBP sites in both the viral genomes investigated in this study, and across the entire human transcriptome. To be considered a cluster, each set of component sites was required to occur ≤25nt from one other, and the total information of all sites within the cluster equalled or exceeded ≥50 bits. In its original design, the clustering algorithm considered binding sites on both strands in forming clusters. To maintain strand specificity, we separated input by strand. Due to the high memory demands of the clustering algorithm, transcriptome scan input was separated into segments of ~200,000 sites per run, which was then subsequently combined. To avoid the inadvertent separation of a binding site cluster, input was split only when two sequential binding sites were >1000nt apart.

Identification of RBP sites and clusters within DRIP-seq intervals
All binding sites and information-dense clusters identified in the human genome were intersected with DRIP-seq and DRIPc-seq intervals, which indicate where there is evidence of R-loop formation in the human genome (performed by "ClusterToDRIPseqAnalysisProgram.pl"; underlying data 38 Section 6). The DRIP-seq dataset (GSE68845; IMR90 cells) is not strand specific, thus binding sites and clusters from either strand are considered when intersected against these intervals. DRIPc-seq data (GSE70189; NTERA2 cells), however, is strand specific which has been taken into account (e.g. positive strand clusters found in positive strand DRIPc-seq intervals reported). We then computed the gene density of sites and clusters that are found within these intervals (underlying data 38 Section 1 - Table 5) using the script "ClusterToDRIPseqAnalysisProgram. GeneDensityFinder.pl" (underlying data 38 Section 6) to determine if there is a correlation between the presence of binding sites and R-loop formation.

Lollipop plots and intersite distance histograms
Lollipop plots which indicate the location of informationdense clusters for all viral genomes described in this study and for all genes in the human transcriptome (with ≥1 cluster) were generated in R (version 3.6.3) using the Bioconductor package "trackViewer" (v.1.20.3 74 ). The lollipop plots presenting human genes contain intron and exon boundary information which was generated using the RefSeq database (release 60). Multiple lollipop plots were generated for multi-segmented viral genomes (one image per segment). The height of each "lollipop" corresponds to the information density of a cluster, and its location in the genome is indicated (GRCh37) along with the number of sites which comprise the cluster.
Histograms which illustrate the distribution of binding site R i values and the frequency of the distance between RBPs ("intersite distances") were generated using the R package 'ggplot2' (v3.1.1 75 ). Intersite distance frequency was determined by first grouping all RBP by gene, followed by determining the distance between each site in sequential order. Distance thresholds of 500nt or 1000nt were assigned for all intersite distance histograms. Rare instances of distances greater than these thresholds were excluded from the histogram, as their inclusion led to plots too wide to be informative.

Radiation gene expression signatures and viral infection
Gene expressions for individuals with the diseases above were collected from Gene Expression Omnibus (GEO), which consisted of 5 Influenza studies (GSE29385, GSE82050, GSE50628, GSE61821, GSE27131), 4 Dengue studies (GSE97861, GSE97862, GSE51808, GSE58278) and 2 studies involving Aplastic Anemia patients (GSE16334, GSE33812). We also collected expression data from two studies with radiation-exposed samples (GSE6874 and GSE10640). The best performing human signatures (assessed by traditional validation; described in Table 7 [underlying data 38 Section 1]) from Zhao et al. 47 were then used to test the gene expression datasets in order to determine if these models would misclassify infected patients as irradiated (with and without control patients). Models were tested using the MatLab script used to perform "traditional validation" in the Zhao et al. study ("regularValidation_multiclassSVM. m", https://zenodo.org/record/1170572), which first normalizes gene expression values by quantile normalization before applying the radiation model to the infected patient data to predict outcome. The script then compares prediction of radiation exposure to the clinical data provided. MatLab scripts are compatible with GNU Octave.
To better understand why the radiation models are predicting certain Influenza-and Dengue-infected patients as irradiated, violin plots were generated using GraphPad Prism v8 to visually illustrate differences in gene expression between infected individuals correctly classified and those misclassified by each radiation model ( Figure 5). When inspecting violin plots of the 32 genes which make up the 4 radiation models tested, 10 genes were identified to have contributed towards false positives predictions as they shared a similar pattern of expression in those that were radiated in two gene expression datasets of irradiated individuals (GSE6874 and GSE10640). The 10 genes are: DDB2, PCNA, GTF3A, PRKCH, CDKN1A, GADD45A, BCL2, MOAP1, TRIM22 and TALDO1. Mann Whitney tests were used to compare the expression of these genes in false negative and true positive patients. Four genes (DDB2, PCNA, GTF3A and PRKCH) were consistently found significant in most of the studies tested.

Association kinetic analysis
The dissociation constant of SRSF1 RRM2 domain bound to the RNA sequence 5′-UGAAGGAC-3′ was experimentally determined to be 0.8 μM 53 . This information allowed for the derivation of a theoretical Scatchard plot for SRSF1 binding by varying the relative proportions of viral to host binding sites bound (where viral binding sites are considered inhibitors, and host binding sites as substrate). We can compute the theoretical number of viral genomes necessary to reach these relative proportions according to: Where K d is the SRSF1 dissociation constant, n is the number of sites an SRSF1 protein can bind at one time (n=1), [L] is the concentration of free SRSF1, and v is the amount of SRSF1 bound to the viral genome relative to host. With this derivation, we assume there is no free RNA binding protein. These proportions were converted to numbers of viral genomes per infected host cell (determined using the above formula in an MS-Excel spreadsheet), adjusted for the computed number of viral genomes per cell by the number of SRSF1 binding sites in a single viral genome (described earlier). We also computed the number of viral genomes necessary to reach these proportions by taking A549 or pneumocyte host cell binding site expression (computed previously) into account. We then used the known processivity rate of 3.7 nucleotides/sec for VSV RNA dependent RNA polymerase 54 to estimate the doubling time required.

Statistical analysis
The average distances between adjacent binding sites of SRSF1, RNPS1 and hnRNP A1 were determined within both expressed human genes and RNA viral genomes (Dengue, HIV-1 strains B and C, Influenza A and SARS-CoV-2). A program script "calculateIntersiteDistance.pl" (underlying data 38 Section 6) takes a set of binding site coordinates and their associated genes as input and determines the pairwise distances between all consecutive binding sites in the same gene. Subsequently, "removeOutliersHigherThanN.pl" is used to discard extreme outlier distances exceeding a specified threshold (thresholds of 500nt and 1000nt were evaluated). Finally, "getStatisticsOnCol.pl" evaluates a given set of intersite distances and computes the count, geometric mean, median, arithmetic mean and their standard deviation. The program was used to evaluate intersite distances at multiple R i thresholds (low-[R i > 0 bits], moderate-[≥ 1 2 R sequence ] and high-affinity [≥R sequence ] binding sites). We also examined binding sites which intersect DRIPc-seq intervals in the human genome using this procedure. Output from this analysis are provided as histograms in extended data 38 Section 5, as described earlier.

Data availability
A data repository titled "Characteristics of human and viral RNA binding sites and site clusters recognized by SRSF1 and RNPS1" has been deposited as a Zenodo archive (DOI: 10.5281/zenodo.3737089 38 ). The archive contains the following underlying and extended data, organized across 6 sections. Section 1 primarily consists of extended data, and Sections 2-6 contains the underlying data presented in the paper.

Extended data
Zenodo: Characteristics of human and viral RNA binding sites and site clusters recognized by SRSF1 and RNPS1. http://doi.org/10.5281/zenodo.3737089 38 This project contains the following extended data: Section 1 -The nine additional tables described in this study ("Section 1 -Tables 1-9"), which provide SRSF1, RNPS1 and hnRNP A1 binding site and information-dense cluster counts across various RNA viral genomes [including multiple SARS-CoV-2 and Influenza strains] and the human transcriptome, the estimated SARS-CoV-2 doubling time necessary for viral genome SRSF1 binding site availability to exceed sites within the host transcriptome, and an analysis of Influenza, Dengue, and aplastic anemia patients misdiagnosed as irradiated by established radiation gene signatures.

Underlying data
Zenodo: Characteristics of human and viral RNA binding sites and site clusters recognized by SRSF1 and RNPS1. http://doi.org/10.5281/zenodo.3737089 38 Section 2. All SRSF1, RNPS1 and hnRNP A1 binding site genome browser tracks for human and all viral genomes analyzed in this study (GRCh37). Section 3. The full set of lollipop plots (indicating the location of SRSF1, RNPS1 and hnRNP A1 information-dense clusters) in all human genes and in each of the viral genomes analyzed. Section 4. The Ri(b,l) matrices or IWMs for all RBPs analyzed (SRSF1, hnRNP A1 and RNPS1). Section 5. The full set of histograms which display the distribution of R i strength and intersite distance between the binding sites for each RBP [across all transcribed regions or within known DRIPc-seq intervals. Section 6. A set of 7 Perl scripts created specifically for this study, with instructions for their use: A) "ClusterTo-DRIPseqAnalysisProgram.pl" -reports which informationdense clusters are located within DRIPc-and/or DRIP-seq intervals (individually and by gene); B) "ClusterToDRIPseqA-nalysisProgram.GeneDensityFinder.pl" -uses the output from script "A" to determine the number and the density of information-dense clusters within a gene (total clusters within the gene and those within DRIPc-seq intervals); C) "calculateIntersiteDistance.pl" -determines the distance between all binding sites in the same gene from a list of genomic coordinates; D) "remo-veOutliersHigherThanN.pl" -discards intersite distances computed by script "C" that are greater than a specified threshold; E) "getStatisticsOnCol.pl" -calculates the count, geometric mean, median, arithmetic mean, and standard deviation of values from script "D"; F) "ScanDataSummaryProgram.pl" -determines the number of binding sites (above a specified R i threshold) found within known genes (the program also reports the total expression of those genes using external A549 and pneumocyte expression datasets) from binding site coordinate data; G) "Total-BindingSitePerCellCalculator.pl" -estimates the number of binding sites expressed in a single A549 or pneumocyte cell at any given time.