Complete genome of Pieris rapae, a resilient alien, a cabbage pest, and a source of anti-cancer proteins

The Small Cabbage White ( Pieris rapae) is originally a Eurasian butterfly. Being accidentally introduced into North America, Australia, and New Zealand a century or more ago, it spread throughout the continents and rapidly established as one of the most abundant butterfly species. Although it is a serious pest of cabbage and other mustard family plants with its caterpillars reducing crops to stems, it is also a source of pierisin, a protein unique to the Whites that shows cytotoxicity to cancer cells. To better understand the unusual biology of this omnipresent agriculturally and medically important butterfly, we sequenced and annotated the complete genome from USA specimens. At 246 Mbp, it is among the smallest Lepidoptera genomes reported to date. While 1.5% positions in the genome are heterozygous, they are distributed highly non-randomly along the scaffolds, and nearly 20% of longer than 1000 base-pair segments are SNP-free (median length: 38000 bp). Computational simulations of population evolutionary history suggest that American populations started from a very small number of introduced individuals, possibly a single fertilized female, which is in agreement with historical literature. Comparison to other Lepidoptera genomes reveals several unique families of proteins that may contribute to the unusual resilience of Pieris. The nitrile-specifier proteins divert the plant defense chemicals to non-toxic products. The apoptosis-inducing pierisins could offer a defense mechanism against parasitic wasps. While only two pierisins from Pieris rapae were characterized before, the genome sequence revealed eight, offering additional candidates as anti-cancer drugs. The reference genome we obtained lays the foundation for future studies of the Cabbage White and other Pieridae species.


Introduction
The Small Cabbage White (Pieris rapae, Figure 1), also known as European Cabbage Butterfly, or Imported Cabbageworm, is one of the most common and widely spread butterflies in North America, ranging from Southern Canada to Mexico 1 . While mostly present in disturbed open habitats, it also invades valley bottoms, mountain tops, and forested areas 2 . In many northeastern USA states, it frequently outnumbers all other butterflies combined 3 . North American populations of the Cabbage Whites, currently numbering in billions, are likely a progeny of a single female accidentally introduced to Quebec, Canada during the second half of the 19 th century 4,5 . By the beginning of the 20 th century it had reached California Coast 6 . Around the same time, it was introduced into Hawaii, New Zealand and Australia 6,7 . Originally from Eurasia and Northern Africa 1 , Cabbage White has become one of the most ubiquitous butterfly species. The reasons for its population expansion across variable habitats as well as the population history of American invasion are poorly understood.
While only very few butterflies are agricultural pests, the Small White is notorious for reducing cabbage plants to stems. Going through its life-cycle quickly and having up to 6 generations per year 8 , it is a serious pest of the mustard family crops 5,9 . In addition to damaging plants, caterpillars contaminate and stain produce with feces.
These butterflies are also a source of a protein with anti-cancer properties 10 . Aptly termed pierisin, this enzyme of a probable bacterial origin is unique to Pieris and its close relatives among Lepidoptera species 10,11 . Pierisin contains an N-terminal ADP-ribosylation catalytic domain followed by four ricin domains, and it can induce apoptosis and thus contribute to metamorphosis and resistance to parasitoids 11,12 . Due to its cytotoxic effects on many cancer cell lines, pierisin is an unexpected protein of medical importance 10 . Agricultural and medical significance of the Cabbage White has attracted broad attention from researchers and the general public. However, the lack of complete genome sequence hinders these studies.
To aid genetics, evolutionary, and biochemical studies of the Cabbage White, we sequenced and annotated its complete genome from North American specimens. At 246 Mbp, it is one of the smallest genomes among Lepidoptera genomes assembled to this day, and the first representative from the Pierinae subfamily. Overall, this diploid genome contains 1.5% heterozygous positions that is consistent with the expected high level of butterfly's heterozygosity. However, the Pieris genome contains a large number of SNP-free segments that are at least 1000 bp long (with the median length equal to 38000 bp), which together constitute 18.3% of the assembled genome. This number is below 4% in other species. The high fraction of homozygous segments indicates low genetic diversity of the population, which supports the hypothesis that Cabbage White expansion in America started from a very small number of individuals, which could be as low as 1 or 2 fertilized females.
Comparison to other Lepidoptera genomes reveals several unique families of proteins that may contribute to the unusual resilience and adaptability of Pieris. For instance, the nitrile-specifier proteins, which converts plant defense chemicals to non-toxic molecules 13 are unique to these species. The apoptosis-inducing pierisins could offer a defense mechanism against parasitic wasps. While only two pierisins from Pieris rapae were characterized before 14,15 , the genome sequencing revealed eight genes coding for pierisins, offering additional candidates for anti-cancer drugs development. The reference genome we obtained lays the foundation for future studies of the Cabbage White and other species of Pieridae.

Results and discussion
Genome assembly, annotation, and comparison to other Lepidoptera genomes We assembled a 246 Mb reference genome of Pieris rapae (Pra), which is one of the smallest among currently sequenced Lepidoptera genomes (Supplementary Table S1A) [16][17][18][19][20][21][22][23][24][25][26] . The scaffold N50 of Pra genome assembly is 617 kb, better than many other published Lepidoptera genomes (Table 1). The genome assembly is also better than many other Lepidoptera genomes in terms of completeness measured by the presence of Core Eukaryotic Genes Mapping Approach (CEGMA) genes (Supplementary Table S1B) 27 , cytoplasmic ribosomal proteins and independently assembled transcripts (Table 1). The genome sequence has been deposited at DDBJ/EMBL/GenBank under the accession LWME00000000. The version described in this paper is version LWME01000000. In addition, the main results from genome assembly, annotation and analysis can be downloaded at http://prodata.swmed.edu/LepDB/.
We assembled the transcriptome of Pra using another specimen (NVG-3537) from the same locality. Based on the transcriptome, homologs from other Lepidoptera and Drosophila melanogaster, de novo gene predictions, and repeat identification (Supplementary  Table S2B), we predicted 13,188 protein-coding genes in the Pra genome (Supplementary Table S2C). 74.4% of these genes are likely expressed in the adult, as they fully or partially overlap with the transcripts. We annotated the putative functions of the 10,747 protein-coding genes (Supplementary Table S2D). Comparison of the protein sets from Lepidoptera species revealed the presence of some proteins unique to the Cabbage White and not present in other species. Among these are pierisins and nitrile-specifier proteins that play important roles in resistance against parasites and toxins from plants and contribute to the successful spread of Pieris rapae across continents.

Phylogeny of Lepidoptera
We identified orthologous proteins encoded by 13 Lepidoptera genomes (Plutella xylostella, Bombyx mori, Manduca sexta, Lerema accius, Papilio glaucus, Papilio polytes, Papilio xuthus, Phoebis sennae, Melitaea cinxia, Heliconius melpomene, Danaus plexippus, Calycopis cecrops and Pieris rapae) and detected 4906 universal orthologous groups, from which 1845 groups consist of a single-copy gene in each of the species. A phylogenetic tree built from the concatenated alignment of the single-copy orthologs using RAxML places Pieris as the sister to Phoebis (Figure 2), the only other member of the Pieridae family with sequenced genome. Our analysis places Papilionidae as a sister to all other butterflies, including skippers (Hesperiidae). Such placement contradicts morphology-based phylogeny, but is reproduced in all maximumlikelihood and Bayesian trees published recently 26,28 .
All nodes received 100% bootstrap support when the alignment of all single-copy orthologs was used. However, since bootstrap only measures internal consistency of phylogenetic signal in the alignment, very large datasets will almost always result in 100% support, even if the tree is incorrect and biased by various factors such as nucleotide composition and long branch attraction. To find the weakest nodes, we reduced the amount of data by randomly splitting the concatenated alignment of all single-copy orthologs into 100 alignments (about 3088 positions in each alignment). The consensus tree based on these alignments revealed that the node referring to relative position of skippers and swallowtails shows the lowest support (68%) compared to other nodes, and their evolutionary history remains to be further investigated when better taxon sampling by complete genomes is achieved.

Anti-cancer protein pierisin
We identified 8 copies of the pierisin gene (Supplementary  Table S3A), while only 2 copies were previously reported from Pieris rapae (GenBank) 14,15 . At least 7 pierisin copies are likely expressed, as their partial sequences are present in the RNA-seq data from adult. The pierisin protein resembles a classic bacterial AB-toxin, with an enzymatically active A domain toxin that is delivered across the eukaryotic membrane through interaction with receptors on the cell surface by the B domain. Pierisin is closely related to the bacterial mosquitocidal toxin MTX NAD(+)dependent ADP-ribosyltransferase for which the crystal structure is known 29 , with the closest pierisin sequence Pra57.2 having 32.56% identity to the structure sequence represented by the MTX holotoxin (PDB 2vse). The pierisin toxin transfers an ADP-ribosyl moiety to 2'-deoxyguanosine residues in DNA 30 , while the ricin domains mediate interactions with neutral glycosphingolipid receptors, globotriaosylceramide (Gb3), and globotetraosylceramide (Gb4) 31 . The toxin is thought to serve as a defense factor against parasitization by wasps 12 , but also induces apoptosis in cancer cell lines 10,11,32 .
Seven copies of pierisin encoded by the Pieris rapae genome include an N-terminal ADP-ribosylation toxin followed by an inhibitory linker and four ricin domains. Mapping the Pieris rapae pierisin sequence conservations (in rainbow from conserved red to variable blue) to the MTX holotoxin structure revealed a strict conservation of the active site and residues surrounding the NADbinding site ( Figure 3A, NAD in ball and stick), as well as conservation of the inhibitory linker in the region that replaces NAD ( Figure 3B, linker in tube). The receptor-interacting ricin domains include QxW motifs that contribute to cytotoxicity ( Figure 3B, spheres), and display relatively lower overall conservation than the catalytic domain. Thus, the receptor-interacting function might be diverging across the different copies of the gene, potentially allowing broader receptor specificity.
One copy of pierisin (Pra57.3) lacks the N-terminal ADP-ribosylation domain, and is composed of four ricin domains following an N-terminal signal peptide, as validated by both the assembled genome and de novo assembled transcripts. In addition, the phylogenetic tree of the ricin domains in the eight copies of pierisin places this protein on the longest branch, suggesting that it has undergone rapid divergence from other pierisins and could have adopted a different function. Lacking the toxin domain, Pra57.3 may aid others toxins in entering the cells. Alternatively, it may be able to bind to the neutral glycosphingolipid receptors in the Pieris, and protect its own cells against other pierisins with the toxic ADP-ribosylation domains.
Detoxifying nitrile-specifier proteins During feeding, the cabbage white butterfly larvae possess the ability to counteract toxic secondary metabolites produced by the food plant glucosinolate-myrosinase major chemical defense system. The hydrolysis reaction of plant myrosinase, which normally produces toxic isothiocyanates, is redirected to the production of nitriles in the presence of the larval gut nitrile-specifier protein (NSP) 13 . The exact role of NSP in nitrile production is debatable, the protein could either serve as an enzyme catalyzing the formation of nitriles from the aglycone intermediate or as an allosteric cofactor for myrosinase 13,33 . The detoxifying NSP protein belongs to an insectspecific gene family consisting of variable tandem repeating units termed insect allergen-related repeats. While other Lepidoptera genomes appear to have no NSP genes, the Pieris genome encodes two copies of the NSPs (Supplementary Table S3B), each containing three copies of the insect allergen-related repeat domain 34 .
Recently, a crystal structure of an insect allergen-related repeat domain from cockroach revealed a novel fold of twelve alphahelices (two 6 helical repeating units) encapsulating a large hydrophobic cavity. While the sequence identity between the allergen structure and each of the three Pieris NSP domains is relatively low (˜ 20% to each), their sequences can be confidently mapped to the known structure for functional inference. The cockroach allergen repeat cavity binds phospholipids such as phosphatidylethanolamine and phosphatidylglycerol when expressed in bacteria; and phosphatidylinositol (PI), phosphatidylserine and phosphatidylcholine when expressed in yeast. Alternately, the allergen purified from cockroach bound nonphosphorylated fatty acids such as palmitate, stearate, and oleate 35 , revealing a promiscuous binding capacity of the hydrophobic pocket. Such a promiscuous allergen binding activity might translate to the sequence-related NSP pockets, allowing binding of the various aglycone intermediates of the glucosinolate-myrosinase system.
Mapping the NSP-related protein sequences conservations to the allergen structure highlights invariant residues that both line the hydrophobic cavity of each domain, connect the repeating units, and connect adjacent α-helices of the repeat (Figure 4, conserved residues colored red). The hydrophobic nature of the binding cavity is preserved in the NSP sequences, including numerous invariant hydrophobic residues that likely contribute to function. Conserved NSP residues also reside near the PO4 group of the phospholipid binding site (Figure 4D), including a YxxxW motif found in each repeat that should restrict the site to accommodate smaller ligands.
In fact, the aglycone intermediate SO4 group and adjacent backbone atoms could mimic the PO4 in phospholipid ( Figure 4E).
Alternately, the positions of invariant polar residues are limited to those that contribute to α-helical interactions, to the linker regions that do not line the hydrophobic cavity, or to insertions not present in the template allergen-repeat structure. While an active site could potentially form between repeating domains of the NSP structure, no obvious clusters of catalytic residues could be mapped to the individual cavities of any of the domain repeats present in NSP. Potentially, the NSP cavities could accommodate binding the various aglycone intermediates produced by myrosinase, allowing time for spontaneous conversion to simple nitriles in the low pH of the gut. Thus, the NSP binding cavity could act in a pseudoenzymatic capacity, without traditional catalytic residues mediating chemistry.
Inferring the population history from the SNP distribution pattern While the Pieris rapae genome is very heterozygous at 1.5%, the distribution of these SNPs in the genome is highly non-random. The histogram of SNP fraction in 1000 bp genomic windows for both Pieris rapae and Papilio glaucus (Pgl) is shown in Figure 4A. Since the reads from the highly heterozygous regions in the genome may not map well to the reference genome, such regions usually show lower-than-expected coverage and may hinder the detection of heterozygous positions. Therefore, in the analysis of both Pgl and Pra genomes, we focused on the genomic regions with coverage that are expected for a diploid genome. Compared to Pgl, the Pra genome contains a much higher fraction of homozygous (SNP-free) regions ( Figure 4B). This difference cannot be simply explained by the relatively low heterozygosity of Pra (1.5% for Pra and 2.3% for Pgl), because the probability of observing SNP-free segments longer than 500 bp is below 1% for genome of this size having 1.5% of heterozygosity ( Figure 4C).
The Pra genome assembly contains a large portion (18.3% of the total length) of SNP-free segments that are at least 1,000 bp. The average coverage of the SNP-free segments by the reads is 87 fold, which is higher than the average coverage of all the segments under study (coverage: 84 fold). Therefore, the lack of heterozygous positions does not arise from the failure of mapping reads from one haplotype to the reference genome which represents another haplotype in the highly heterozygous region. The Pgl genome contains only 1.55% long (>= 1000 bp) SNP-free segments, which also support that the large portion of SNP-free segments in the Pra genome is not an artifact. The median length of these segments is 38,000 bp, and the longest SNP-free region in the P. rapae draft genome is 339,000 bp.
The presence of such high proportion of SNP-free segments indicates that this Pra specimen inherited a large proportion of identical alleles from its parents. Two scenarios could explain this: (1) this specimen is a result of recent inbreeding between brothers and sisters or between cousins (2) the population started from a very small number of individuals or has been through very severe bottlenecks and therefore the genetic diversity within the population is low. In order to distinguish between these two scenarios, we simulated them.
Inbreeding between brother and sister would result in the presence of ˜2 5% long homozygous segments, and this ratio goes down to 6.3% when the parents are cousins ( Figure 6A). Inbreeding between half-blooded brother and sister from the same father (or mother) and whose mothers are sisters would result in 18.6% of homozygous segments. However, inbreeding between very close relatives would result in a very high median lengths of the SNP-free segments ( Figure 6B), even if we assumed a very high recombination rate, 10 cM/Mb 36 . The median length of SNP-free segments in this scenario is still above 200,000 bp, which is much higher than the observed value, 38,000 bp. Therefore, inbreeding between close relatives cannot explain the observed SNP pattern.
The observed pattern of SNP-free segments agrees very well with the second scenario, i.e., the genetic diversity in the population is low, because the population started from very small number of individuals or has undergone very severe bottlenecks. The observed fraction and median lengths of the long SNP-free segments agrees very well with the simulated data assuming that the population started with 3 individuals (could be one female carrying spermatophores of two males) and has been developing for about 500 generations ( Figure 6C, D). This supports the hypothesis that Pieris rapae came to America in 19 th century and the population started from very few individuals introduced by human activity. It cannot be excluded that the population started with a larger number of introduced individuals, but the genetic diversity was reduced due to severe bottlenecks, possibly early on, so only the progeny of one or two females gave rise to American populations of Pieris rapae. However, as a widely spread butterfly species over all different habitats that is somewhat resistant to parasite and toxins in plant, bottlenecks in the later stage of population history is not very likely.

Materials and methods
Library preparation and sequencing

Genome and transcriptome assembly
We removed sequence reads that did not pass the purity filter and classified the remaining reads according to their TruSeq adapter indices to get individual sequencing libraries. Mate pair libraries were processed by the Delox script 37 to remove the loxP sequences and to separate true mate pair from paired-end reads. All reads were processed by mirabait 38 v4.0.2 to remove contamination from the TruSeq adapters, an in-house script to remove low quality portions (quality score < 20) at the ends of both reads, JELLYFISH 39 v2.2.3 to obtain k-mer frequencies in all the libraries, and QUAKE 40 v0.3.5 to correct sequencing errors. The data processing resulted in seven libraries that were supplied to Platanus 41 v1.2.4 for genome assembly: 250 bp and 500 bp paired-end libraries, 2 kbp, 6kbp, 15k bp true mate pair libraries, a library containing all the paired-end reads from the mate pair libraries, and a single-end library containing all reads whose pairs were removed in the process.
We mapped these reads to the initial assembly with Bowtie2 42 v2.2.3 and calculated the coverage of each scaffold with the help of SAMtools 43 v1.0. Many short scaffolds in the assembly showed coverage that was about half of the expected value; they likely came from highly heterozygous regions that were not merged to the equivalent segments in the homologous chromosomes. We removed them if they could be fully aligned to another significantly less covered region (coverage > 90% and uncovered region < 500 bp) in a longer scaffold with high sequence identity (>95%). Similar problems occurred in the Heliconius melpomene, Pterourus glaucus and Lerema accius genome projects, and similar strategies were used to improve the assemblies 19,24,26 .
The RNA-seq reads were processed using a similar procedure as the genomic DNA reads to remove contamination from TruSeq adapters and the low quality portion of the reads. Afterwards, we applied three methods to assemble the transcriptomes:

Identification of repeats and gene annotation
Two approaches were used to identify repeats in the genome: the RepeatModeler 48 v1.0.7 pipeline and in-house scripts that extracted regions with coverage 3 times higher than expected. These repeats were submitted to the CENSOR 49 server to assign them to the repeat classification hierarchy. The species-specific repeat library and all repeats classified in RepBase 50 v18.12 were used to mask repeats in the genome by RepeatMasker 51 v4.0.3.
We obtained two sets of transcript-based annotations from two pipelines: TopHat followed by Cufflinks and Trinity followed by PASA. In addition, we obtained eight sets of homology-based annotations by aligning protein sets from Drosophila melanogaster 52  Identification of orthologous proteins, analysis of unique genes for Pieris rapae, and phylogenetic tree reconstruction We identified the orthologous groups from 13 Lepidoptera genomes using OrthoMCL 65 v2.0.9. The orthologous groups that contain only Pieris proteins were further investigated. Starting from these Pieris sequences, we attempted to identify their orthologs in other Lepidoptera genomes using reciprocal BLAST. Potential orthologs encoded by the genome but missed in the protein sets were predicted with the help of genblastG. Two groups of proteins, i.e. the pierisins and nitrile-specifier proteins discussed above turned out to be unique for Pieris. We manually curated the sequences for proteins in these two groups and submitted them to MESSA 66 to perform secondary structure and disordered region prediction, domain identification and 3D structure prediction. We aligned the pierisin sequences using MAFFT v7. 1845 orthologous groups consisted of single-copy genes from every species, and they were used for phylogenetic analysis. An alignment was built for each universal single-copy orthologous group using both global sequence aligner MAFFT 68 and local sequence aligner BLASTP. Positions that were consistently aligned by both aligners were extracted from each individual alignment and concatenated to obtain an alignment containing 308,750 positions. The concatenated alignment was used to obtain a phylogenetic tree using RAxML. Bootstrap resampling of the aligned positions was performed to assign the confidence level of each node in the tree. In addition, in order to detect the weakest nodes in the tree, we reduced the amount of data by randomly splitting the concatenated alignment into 100 alignments (about 3,088 positions in each alignment) and applied RAxML to each alignment. We obtained a 50% majority rule consensus tree and assigned confidence level to each node based on the percent of individual trees supporting this node.

Analysis of the SNP patterns in Pieris rapae
We analyzed the SNPs in Pra and Papilio glaucus (Pgl) genomes using the same protocol, in which we mapped each read to the genomes and detected SNPs using the Genome Analysis Toolkit 69 v3.5. The distribution of genome coverage by the reads in 100 bp windows was plotted. For both Pra and Pgl genomes, this distribution shows two peaks. In addition to the main peak centered at the expected coverage for a diploid genome, there is an additional peak to the left that corresponds to highly divergent regions between the two homologous chromosomes. Owing to this sequence divergence, only the reads corresponding to the sequence of one of the homologous chromosomes can be mapped, which results in the lower-than-expected coverage. To analyze the distribution of SNPs, we used the regions whose coverage by the reads falls within the diploid peak.
We calculated the total number of positions with SNPs in such regions and simulated random distribution of these SNPs. The simulated distributions were used as controls. For the random control, experimental data, and the simulated genomes discussed below, we divided the scaffolds into 100, 200, 300, 400, 500, 1000, 2000, 5000, and 10000 bp windows (segments less than the window length at the ends of scaffolds are discarded), respectively, and calculated the presence of SNP-free windows. We concatenated neighboring SNP-free regions to obtain the longest SNP-free segments, and calculated the median length of these SNP-free segments.
Simulation of recent inbreeding and evolutionary history of the Pieris rapae population in America We simulated Pieris rapae haplotypes by randomly introducing SNPs to the Pra reference genome, and the frequency of SNPs was set to be half of the frequency of heterozygous positions in the sequenced Pra individual (i.e., 0.7%). Two simulated haplotypes were randomly paired to represent another simulated Pieris rapae individual, and the rate of heterozygous positions in the simulated individuals would be comparable to that observed in the sequenced specimen. To simulate the mating between two individuals, we assumed the two haplotypes of each individual could recombine at a certain rate (recombination rate) and generate a new haplotype that is inherited to the offspring.
The recombination rates of insects are rather variable, and the recombination rates for Bombyx mori, Heliconius melpomene and Heliconius erato are estimated to be 2.6, 5.5 and 6.1, respectively 36 . Therefore, we estimated the recombination rate for Pieris rapae to range between 1cM/Mb and 10cM/Mb per generation. To simulate recent inbreeding, we randomly select a recombination rate within this range. The mutations in this process are not introduced because the per generation mutation rate for butterflies are expected to be in the magnitude of 1e-9 mutation per base pair 70 , much lower than the existing variation between haplotypes. We simulated three scenarios of inbreeding: (1) between brother and sister (2) between cousins and (3) between half-blooded brother and sister.
To simulate the evolution of Pieris rapae population, we assumed the population started from a certain number of individuals (2, 3 and 4). Several parameters would affect the population evolution, i.e., the number of generations since the species invaded America, the recombination rate, the mutation rate, and the effective population size. Pieris rapae was suggested to invade America in the second half of 19 century, and has 3-6 generations per year. Therefore, we assumed the number of generations to be 500. Based on the known values for other Lepidoptera species, we assumed the recombination rate to be 5cM/Mb and the mutation rate to be 2.5e-9. In the initial generations, the effective population size is mainly limited by the population size, and the population may undergo exponential growth. We assumed an exponential growth of the effective population size at rate of 10 fold per generation (each pair produce 20 offsprings). Later on, the population may reach its stationary phase, and the effective population size will be limited by the population structure and will not keep increasing. The effective population size of insects usually ranges between 1e5 and 1e6 71 , and we assumed the effective population size to be 5e5 after the initial exponential growth phase.

Data availability
Sequencing reads were deposited in the NCBI SRA database under accession number SRP073457. The genome sequence was deposited at DDBJ/EMBL/GenBank under accession number LWME00000000.