Differential enrichment of yeast DNA in SARS-CoV-2 and related genomes supports synthetic origin hypothesis [version 2; peer review: awaiting peer review]

Background: Knowledge about the origin of SARS-CoV-2 is necessary for both a biological and epidemiological understanding of the COVID19 pandemic. Evidence suggests that a proximal evolutionary ancestor of SARS-CoV-2 belongs to the bat coronavirus family. However, as further evidence for a direct zoonosis remains limited, alternative modes of SARS-CoV-2 biogenesis should be also considered. Results: Here we show that the genomes from SARS-CoV-2 and from SARS-CoV-1 are differentially enriched with short chromosomal sequences from the yeast S. cerevisiae at focal positions that are known to be critical for virus replication, host cell invasion, and host immune response. Specifically, for SARS-CoV-2, we identify two sites: one at the start of the viral replicase domain, and the other at the end of the spike gene past its critical domain junction; for SARS-CoV-1, one at the start of the RNA dependent RNA polymerase gene, and the other at the start of the spike protein’s receptor binding domain. As yeast is not a natural host for this virus family, we propose a directed passage model for viral constructs, including virus replicase, in yeast cells based on co-transformation of virus DNA plasmids carrying yeast selectable genetic markers followed by intrachromosomal homologous recombination through gene conversion. Highly differential sequence homology data across yeast chromosomes congruent with chromosomes harboring specific auxotrophic markers further support this passage model. Model and data together allow us to infer a hypothetical tripartite genome assembly scheme for the synthetic biogenesis of SARS-CoV-2 and SARS-CoV-1. Conclusions: These results provide evidence that the genome sequences of SARS-CoV-1, SARS-CoV-2, but not that of RaTG13, BANAL20-52 and all other closest SARS coronavirus family members identified, are carriers of distinct homology signals that might point to large-scale genomic editing during a passage of directed replication and chromosomal integration inside genetically modified yeast cells. Open Peer Review Reviewer Status AWAITING PEER REVIEW Any reports and responses or comments on the article can be found at the end of the article. Page 1 of 11 F1000Research 2021, 10:912 Last updated: 11 NOV 2021


Introduction
From the beginning of the COVID-19 pandemic, in March 2020, evidence was put forward that the outbreak of novel coronavirus SARS-CoV-2 within the human population was most likely a product of natural evolution 1 . According to this view, COVID-19 is a zoonosis that probably originated from a species of closely related bat coronaviruses 2 . Prior to a hypothetical spillover event, a recent ancestor to SARS-CoV-2 likely evolved inside bat host cells for many decades 3 . However, the natural evolution hypothesis of SARS-CoV-2 origin is currently not without considerable limitations: first, the difficulty in characterizing the evolutionary origin of the unusual polybasic (RRAR) furin cleavage site at the S1/S2 junction of the SARS-COV-2 spike (S) glycoprotein 4 ; second, the discrepancy between an exponentially suppressed tropism of SARS-CoV-2 in Rhinolophus sinicus bat cells 5 and the high susceptibility of SARS-CoV-2 toward cell entry via Rhinolophus sinicus angiotensin-converting enzyme 2, its primary entry receptor 6 ; and third, the persistent inability to identify an intermediate ancestral host between human and the horseshoe bat Rhinolophus affinis. This species was reported to be the host of coronavirus RaTG13 7,8 , currently the isolate with the highest sequence similarity to the SARS-CoV-2 genome, which is located on the same phylogenetic branch as Rhinolophus sinicus bat coronavirus 9 . Finding the last animal progenitor host of SARS-CoV-2 has been further complicated by a continued uncertainty about the origin of RaTG13 itself 10,11 . In contrast to the natural evolution hypothesis for SARS-CoV-2, the above limitations do not necessarily apply to genetic engineering of viral genomes. For example, in the case of SARS coronavirus, it has long been established that introducing a synthetic poly-arginine construct at the furin cleavage site significantly increases the rate of entry into human cells compared with wild-type spike protein 12 . Also before 2010, after a period of rapid progress in the understanding the relevant host-virus factors 13,14 , natural barriers in host range of positive strand RNA viruses were rationally extended, leading to directed viral replication in new species including model organisms that originally were not permissive, such as the yeast Saccharomyces cerevisiae 15 . Accordingly, to transform yeast into a synthetic host for viral replication, the scheme has been to co-express viral RNA dependent RNA polymerase (RdRp) and, if also necessary for replication, additional viral factors on plasmids under the control of auxotrophic yeast selectable markers (YSM) 16 . These selectable markers are primarily there to direct cell lines into stable expression of desired plasmid DNA, but at the same time may function as entry gates for directed insertion of exogenous genetic material into yeast chromosomes 17 . In principle, once the RdRp and required auxiliary factors are selectively and functionally expressed, this approach applies to any replication competent SARS coronavirus RNA, including cloned or de novo synthesized genomic parts or even entire genomes, thus facilitating their replication as well as integration into the yeast genome. Our hypothesis is that such a passage would leave behind traces in the genomes of both the virus construct and the synthetic host.

Methods
SARS and SARS-like betacoronavirus whole genome nucleotide sequences were taken from the comprehensive sequence and phylogenetic analyses by Zhou et al. 18 and from Li et al. 9 . In our study, sequences were selected only if they had a valid GenBank accession identifier or an NCBI Reference Sequence (RefSeq) accession identifier, as of 5 June 2021, resulting in the reference set of 13 whole genome virus sequences (see also Extended Data). BLAT whole genome comparative sequence analysis was performed using the BLAT public webserver (BLAT, RRID:SCR_011919) with options set "Genome: Search all" and "All results (no minimum matches)". Given each one of the corresponding 13 BLAT output tables produced from genomic alignments to the yeast S. cerevisiae (Extended data Tables S2 -S14), the profiled BLAT score, pS, was the genome-wide distribution of BLAT scores (output table column [SCORE]) weighted by the corresponding length of the homologous genomic region (output table distance between columns [START] and [END]). The cumulative profiled BLAT score cS, which was used as a genomewide quantitative indicator of yeast (S. cerevisiae) homology, was the total sum over this distribution. After shifting cS by the sample's mean and dividing by its standard deviation, the resulting standardized BLAT z-score became then a relative indicator of sequence homology with S. cerevisiae. Sequence alignments for cross-validation were produced with lalign from the fasta36-36.3.8/bin/lalign36 software package (version number 36.3.8) with parameter settings: -f -12 -g 0 -E 1. This parameter choice followed standard parameters for LALIGN.
Sequence identities were calculated using the Clustal Omega public webserver (RRID:SCR_001591) with standard preset parameters.

Results
To interrogate the possibility that a similar passage through yeast cells took place within the family of SARS coronaviruses, we selected eight reference genomes 18 19 . Each search from the above set of query sequences against the entire multi-species genome database produced a high number (between 1689 and 5083) of tiles, i.e. perfectly aligned short

Updates from Version 1
Minor data changes included (due to recent publication of the BANAL-20-52 virus genome).
Any further responses from the reviewers can be found at the end of the article UPDATE DNA sequences of length 11. A large majority of these tiles were repeatedly matched on the same two target genomes (out of 107 total; see also Extended data Table S1): SARS-CoV-2 (NC_045512.2), the only coronavirus genome in the database, and S. cerevisiae (SacCer3/S288c). In these instances, BLAT identified many homologous regions by aggregating multiple tiles (Tables S2-S9), and to each homologous region it produced an integer score S, which is the number of perfectly matched positions therein. To obtain a genome-wide view of this homology signal we stacked together all homologous regions weighted by their individual alignment scores S, which resulted in an accumulated homology profile, pS (see Methods and Extended Data Figures S1 and S2). To remove its shortest-scale fluctuations, the profile was smoothed by a centered sliding window filter with window size of 200 nucleotides (nt). The output of eight genomic profiles ( Figure 1 and Figure S2) were ordered by decreasing sequence identity to SARS-CoV-2.
For SARS-CoV-2, two prominent (pS > 20) peaks indicated highly localized profile scores at levels ~10-fold above the apparent background. A first peak (P1) reaching a top alignment score of 47 in the narrow genomic interval [7191 nt..7192 nt] max , and a second peak P2 over ~18,000 nt downstream with a score of 36 in the region [25196 nt..25212 nt] max (see, Figure 1). To put these data into an established gene-function context these two maxima, with half-maximum widths w 1/2 = 215 nt and w 1/2 = 219 nt, respectively, were annotated with available information from the closest and most specifically annotated genomic region in RefSeq, the NCBI Reference Sequence database 20 . Thus P1 was closest to the start of the C-terminal domain of non-structural protein 3 (designated nsp3C), which extends over the interval [6962 nt..8552 nt]. The C-terminal domain of nsp3 is known to play a critical role in replication due to its direct interaction with nsp4, thereby facilitating virus-induced membrane rearrangement and replication complex formation; conversely, loss of nsp3C-nsp4 interaction abolishes SARS coronavirus replication 21 . P2 was located toward the 3′ end of the open reading frame of the spike gene. Here it overlapped with the 3′ end of the stretch that covers both the S1/S2 cleavage region and the S2 fusion subunit of the S protein (S_S1/S2, with interval [23192 nt..25187 nt]). The S_S1/S2 domain includes the characteristic furin cleavage site at the S1/S2 junction 22 . Cleavage activates the nearby S2 fusion peptide and together they constitute an essential part in SARS-CoV-2 particle-dependent and particle-independent cell entry through fusion of viral and cellular membranes 23,24 . A similar analysis for the RaTG13 viral genome identified only one isolated peak (P3) with a maximum profile score of 50 on the interval In contrast to the five signals identified in these three genomes, an equivalent analysis for the other three (RacCS203, Rc-o319, MERS-CoV) produced only negative results. Their accumulated homology profiles were evenly distributed across the entire genomes consistent with a low random score background from many short spurious matches. As a further specificity control, negative results were obtained (see, Figure S3 and Tables S10-S14) after profiling the five most closely SARS-CoV-1 related betacoronavirus isolates from five wild animals (civet, Paradoxurus hermaphroditus, Paguma larvata, Aselliscus stoliczkanus, and Rhinolophus sinicus), which together with SARS-CoV-2 occupy the same phylogenetic branch 9 . These data lead to a highly differential signature of yeast homology in SARS-CoV-1, SARS-CoV-2 and RaTG13 genomes after calculating standardized z-scores ( Figure 2) from the entire BLAT profiles produced to all 13 of the above sequences (Tables S2-S14). This analysis was also extended by including the three recently identified bat SARS-like coronavirus genomes from the same clade as RaTG13 (z = 4.72), i.e., BANAL-20-52 (z = 0.36), BANAL-20-103 (z = -1.56) and BANAL-20-236 (z = 0.20), which all produced markedly smaller z-scores than SARS-CoV-1 (z = 9.44) and SARS-CoV-2 (z = 5.47). To cross-validate the detected yeast homology signals in P1-P5, we also used an independent sequence alignment method, LALIGN 25 , which additionally produced statistics (E-values) for pairwise alignments. While the peaks P1 and P2, as well as P4 and P5, could be positively cross-validated, the P3 signal in RaTG13 detected by BLAT did not yield a statistically significant alignment with LALIGN, with its E-value reaching above 0.01 (see , Table S16 and Figure S4). Taken together, these highly differential data show that, for SARS-CoV-1 and for SARS-CoV-2, genes known to be critical for viral replication and host cell invasion display localized yeast homology at their flanking regions with limited extensions into the corresponding open reading frames.
To explain this yeast DNA enrichment pattern, we propose the following artificial passage model ( Figure 3A): Its starting point is a doubly auxotrophic, synthetic yeast cell line with stable, heterologous expression of viral replicase complex (RdRp, optionally together with auxiliary factors for replication, Aux) from a plasmid under the control of a selectable marker YSM1. A second plasmid carries another auxotrophic yeast selectable marker YSM2, which originates from a different chromosome, and regulates the expression of a non-replicative segment of viral RNA (nrvRNA1). At this point, nrvRNA1 is any uninterrupted DNA segment from a SARS-coronavirus related genome prior to passage. Through homologous recombination, the target yeast chromosome is transformed and nrvRNA1 is integrated 17 at the chromosomal site of the auxotrophy conferring allele homologous to YSM2. During passage cell growth double stranded DNA breaks occur, and breaks at both ends of nrvRNA1 ends, their flanking regions, and their homologous extensions into YSM2 are repaired preferably by intra-chromosomal gene conversion 26 , i.e. through a non-crossover homologous recombination, and with the endogenous site as the homologous repair donor ( Figure 3A).  Figure S2). Alignment scores retrieved only from hits matching S. cerevisiae full genomic sequence assembly SacCer3/S288c. For the corresponding BLAT output, see Table S1, and Table S2- If we assume that nrvRNA1 itself contains a copy of RdRp (and of Aux), then the above model implies that higher-order integration events 17 will occur between the YSM1 plasmid and the primary site of integration. In effect, short segments from its YSM1 region will be also integrated into nrvRNA1. In this case the passage model specifically predicts that during S. cerevisiae growth nrvRNA1 will accumulate sequences from exactly two yeast chromosomes, i.e. those two which YSM1 and YSM2 originated from.
To test this prediction, we produced the score profile pS, but this time from the yeast sequence hits on each chromosome. For direct comparison, we then transformed each profile into a single number (cS), for all 16 chromosomes (mitochondrial chromosome excluded), by calculating the sum of pS over the entire chromosome length conditional on the cutoff pS > 30. In the case of SARS-CoV-2, this procedure resulted in two distinct peaks at chromosome number II and number XV ( Figure 3B). For SARS-CoV-1, the highest two peaks were at chromosomes IV and V, followed by a much shallower peak on XVI with only 0.24 the height of IV. One peak was detected for RaTG13, also at XVI, whereas the other three viral genomes produced no signal at the chosen cutoff (see, Figure 3B, also for similar data without a cutoff). To further connect these data to our passage model, we attempted to match the seven most commonly used auxotrophic yeast selectable markers 27,28 according to their chromosomal origin: ADE2 (adenine requiring phosphoribosylaminoimidazole carboxylase, on chromosome XV), HIS3 (histidine requiring imidazoleglycerol-phosphate dehydratase, chr. XV), LEU2 (leucine requiring Betaisopropylmalate dehydrogenase, chr. III), LYS2 (lysine requiring aminoadipate reductase, chr. II), MET15 (methionine requiring O-acetyl homoserine-O-acetyl serine sulfhydrylase, chr. XII), URA3 (uracil requiring orotidine-5'-phosphate (OMP) decarboxylase, chr.V), and TRP1 (tryptophan requiring phosphoribosylanthranilate isomerase, chr. IV). In agreement with the model prediction, five out the seven markers could be matched to the four highest of the five chromosome peaks detected in SARS-CoV-2 and SARS-CoV-1 ( Figure 3B). This outcome especially implies that for SARS-CoV-2 the two auxotrophic markers (YSM1, YSM2) could be any pair from the triple (ADE2, HIS3, LYS2), and for SARS-CoV-1 either the pair (LEU2, TRP1) or (TRP1, LEU2). Thus SARS-CoV-2 and SARS-CoV-1 both did, but RaTG13 did not fit into this synthetic passage model.
These results further allowed us to infer a specific scheme for the synthetic biogenesis of SARS-CoV-2 and SARS-CoV-1 in transformed yeast cells ( Figure 3C). The idea is to stitch together both outer DNA complements of a chosen viral genome with the inner segment nrvRNA1. For co-transformation and integration, two plasmids are designed that carry the YSM2 selectable marker with either the 5′-end (nrvRNA2) or the 3′-end (nrvRNA3) of the target virus genome along with some overlap into nrvRNA1 (regions 1′ and 1′′, respectively, see Figure 3C). Essential plasmid ingredients are also a transcriptional promoter for nrvRNA2, and a self-cleaving ribozyme (Rz) sequence for the correct 3′-end in nrvRNA3 15 . Once these three non-replicable RNA encoding segments are integrated on the chromosome in the correct order, expression of fully replicable virus (+)RNA begins and replication commences upon co-expression of the viral replicase complex (RdRp and Aux, controlled through the auxotrophic marker YSM1). The final step, assembly into a fully infectious viral particle, is conveniently achieved with a yeast virus-like-particle (VLP)  , Table S2-S14 and Methods section). Evolutionary guide tree represents pairwise sequence identities between genomic sequences (see , Table S15). Grey shaded box indicates [−1, 1] standard deviation interval. originating from a SARS-CoV related virus. Primary integration of non-homologous nrvRNA1 sequence occurs through homologous recombination (HR) between the auxotrophic plasmid yeast selectable marker YSM1 (grey box) and its chromosomal homolog (striped grey box); higher-order homologous recombination follows on the flanking regions of nrvRNA1 through intra-chromosomal gene-conversion; co-expression of viral replicase complex (RdRp) and other auxiliary viral genes (Aux). Scheme in parts adapted from Compton et al. (1982), and from Alves-Rodrigues et al. (2006). P, yeast promoter; A n , poly-adenosine sequence. (B) Integrated profile scores, cS, from BLAT sequence hits on S. cerevisiae by chromosome number from the same six input sequences as in Figure 1 (purple columns); to calculate cS, a score profile cutoff pS > 30 was used. Without a cut-off (pS > 0), the same order emerged (black horizontal bars, maximum pS score at each chromosome; all other maximum pS scores from the other genomic queries are below, within shaded area). Five common yeast selectable markers are assigned to their chromosomes of origin. (C) Procedural second stage scheme for the synthetic biogenesis of SARS-CoV-2 and SARS-CoV-1. For the possible pairings of yeast selectable markers (YSM1, YSM2) matched in (B), genome editing of the three segments nrvRNA1, 2, 3 leads to a fully replicable virus (+)sense RNA. Virus (self-)assembly follows via expression of the structural proteins S, E, M, and N from an enhanced plasmid set Aux*. Rz, self-cleaving ribozyme.
expression system for the structural proteins S, E (envelope), M (membrane), and N (nucleocapsid) that can be used in parallel by an extended set of auxiliary proteins, Aux* 29 . This hypothetical cellular factory may therefore produce the targeted, fully infectious viral particles without itself being infected by the virus produced.

Discussion
Taken together, our results reveal a highly differential homology signal on SARS-CoV-2 and SARS-CoV-1 genomes, which-according to our model-points to their history of targeted integration, recombination, and directed viral replication through passage in an artificial S. cerevisiae host. This genomic pattern suggests similar synthetic origins of SARS-CoV-1 and SARS-CoV-2, but at the same time robustly excludes all other clade members from this type of synthetic origin. A special case is RaTG13, which in our analysis produced both a simpler pattern and a weaker signal of common genetic history with yeast than the two mutually more similar homology signals found in SARS-CoV-1 and SARS-CoV-2. Yet RaTG13 is claimed to be much closer to SARS-CoV-2 evolutionarily 7 , i.e. 96% genomic sequence identity to SARS-CoV-2 against 80% between SARS-CoV-1 and the latter. This divergence suggests that if RaTG13 is assumed to be a product of natural evolution then both the sequences of SARS-CoV-1 and SARS-CoV-2 cannot be. Alternatively, the origin of RaTG13 could be artificial 11 -along with SARS-CoV-2 and SARS-CoV-1 30 , as our results also suggest. In this context, a point of uttermost importance would be the identification of the putative input progenitor SARS-CoV nucleotide sequence that went into passage. For example, it could be a highly pathogenic virus designed for, or naturally adapted to human cells and then selected for a transient artificial passage together with some genetic modifications 31 of the virus to attenuate its virulence. Then its release back into the human host would likely initiate a rapid succession of complex reversal mutations toward its more pathogenic original structure 30,31 . Intriguingly, during the first months of the SARS-CoV-2 outbreak, the genomic regions of nsp3 and spike protein had the highest mutational rate within the SARS-CoV-2 genome 32 which may interfere with the yeast homology signal detected in the present study. During an epidemic, such reversal mutations toward an unidentified artificial genotype would be highly detrimental to most public health countermeasures, including pharmacological interventions and vaccinations. In contrast, through specific guidance of countermeasures such as vaccine development, detailed knowledge about the input progenitor's nucleotide sequence would effectively confer population immunity against the pathogen. Without such specific knowledge, our and further results about the conditions of the yeast artificial passage could offer important directions for SARS-CoV-2 antiviral interventions.

Data availability
Associated or additional data. All data underlying the results are available as part of the article and no additional source data are required.
Repository-hosted data. The following sequence data was retrieved from the NCBI GenBank repository: The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com