A tiling array-based comparative genomic hybridization approach to predict copy number variations between Plasmodium falciparum field isolates from the Indian Sub- continent [version 1; peer review: 1 approved, 1 approved with reservations]

Background: There are several techniques to analyse copy number variation in both research and clinical settings, such as whole genome amplification (sWGA), SNP arrays and one of the most commonly used techniques, array based comparative genomic hybridization (aCGH). In the latter, copy number comparison is obtained between differentially labelled target and reference DNAs by measuring ratio of fluorescence intensity of probes indicating loss or gain in the chromosomal region. Methods: Here we carry out a comparative analysis between two Plasmodium falciparum parasite isolates (Pf-isolate-2 and Pf-isolate-1) causing malaria using array CGH. The array contains approximately 418,577, 60mer custom-designed probes with an average probe spacing of 56 bp. The significant major variations (amplifications and deletions) copy number variations (CNV) in Pf-isolate-2 (Pf-2) in comparison with Pf-isolate-1 (Pf-1), are reported. Results: CNVs have been seen in all the chromosomes in Pf-2, most of the deletions have been seen mostly in sub-telomeric and telomeric regions of the chromosomes that comprises of variant surface antigen family genes. Apart from the subtelomeric regions other parts of the chromosomes have also shown CNVs. Novel variations ,  like Open Peer Review


Introduction
Plasmodium falciparum is the most virulent among the 6 Plasmodium species that causes malaria in humans. Current efforts to eliminate the disease have not been successful because of the rapid genetic adaptation and natural selection of the parasite. The currently known causes for genome sequence variations in the parasite are single nucleotide polymorphism (SNP), insertions, deletions of short sequences, major amplifications, translocations, inversions, and allelic variations (Cheeseman et al., 2009). Apart from the host factors these parasitic genetic variations may also lead to the disease severity., P.falciparum strains have long been established in continuous culture systems and most genomic analysis has naturally utilized the incredible benefit from such techniques. However, analysis of genome variations from culture adapted parasite lines remains intrinsically very different from analysis of in-vivo derived parasite nucleic acid material and the latter could have a major impact on understanding of the parasite biology (LeRoux et al., 2009). Environmental pressures have been known to introduce changes in genomes providing organisms an evolutionary advantage for adaptation and CNVs could be one such change that assists the parasite adaptation to the new environment. These CNVs could enable survival of the parasite under drug pressure. An amplification event in the GTP cyclohydrolase has been frequently reported and the high copy number of GCH-1 has been shown to confer resistance to anti-folates (Heinberg et al., 2013;Jiang et al., 2008;Kidgell et al., 2006;Nair et al., 2008;Ribacke et al., 2007). The major known parasite molecule that influences the severity of the disease are encoded by Variant Surface Antigen family genes that are located in the sub-telomeric part of the chromosome. P. falciparum 3D7 genome sequencing revealed the presence of ~60 var genes followed by other variant surface antigen families, rifin/stevor (~190), Pfmc-2TM (~12) and surfin (~10) (Frech & Chen, 2013;Scherf et al., 2008). Most studied var genes encodes for PfEMP-1 (Erythrocyte membrane protein-1) that play an important role in the pathogenesis by periodic switching and immune evasion. Along with the exported family members, these genes play an important role in cytoadhesion by binding to multiple cell receptors, including ICAM-1, CD36, E-selectin, NCAM and CD31 on micro vascular endothelia, which mediates sequestration that in-turn, leads to the severity of the disease. Recently another receptor, Endothelial protein C receptor (EPCR), a brain specific receptor has been identified as a binding partner of PfEMP-1 (Smith et al., 2013;Wassmer & Gray, 2017). Var genes are divided into 3 major groups A, B and C and 2 intermediate groups UpsB/A and B/C (Lavstsen et al., 2003). The higher transcript level of var group B and var group C family genes have been reported in severe malaria cases in comparison with the uncomplicated malaria cases from India (Subudhi et al., 2015).
There is limited information regarding complete nuclear genome variation for any of the human malaria parasites from field isolates from the Indian subcontinent, however, Tyagi et al. in 2014 have reported evolutionary history of Plasmodium falciparum by sequence analysis of mitochondrial genome of Indian isolates. Most studies have focused on analysis of sequence variations in single genes from different parts of India (Kumar et al., 2005;Kumar et al., 2012;Mishra et al., 2015;Rajesh et al., 2008;Zeeshan et al., 2012). These have focused on SNP's for "drug resistance" marker genes of Indian isolates. Garg et al., 2009;Pathak et al., 2014;Sharma et al., 2015).
Genomics technologies like whole genome microarrays and more recently next-generation sequencing have enabled deeper analysis of various types of genomic variations. Using a reference clone/isolate, copy number variations of other isolates can be identified (Carret et al., 2005). An interesting study by Oyola et al. in 2016 details of SNP analysis from dried blood spots using Illumina based sequencing approaches.
Array-based technologies have been used to distinguish CNV's in different diseases (Coughlin et al., 2012). Such techniques have been extensively used in cancer research to identify DNA copy number variability to represent structural variations at higher resolution and increased sensitivity (Liu, 2007;Mockler et al., 2005;Przybytkowski et al., 2011).
The objective of this study was to see whether we could at all distinguish differences between the two parasite isolates taken from individual patients showing differing disease severity using aCGH tilling array approach.
The microarray data discussed in this manuscript has been deposited in the NCBI Gene Expression Omnibus (GEO) under the GEO series accession number (GSE77165).

Collection of blood sample
Venous blood samples were collected (5ml) from P. falciparum infected patients admitted to S.P. Medical College, Bikaner, India. Patient's samples were collected on informed consent according to hospital guidelines. Sample collection was approved by the hospital ethical committee (No.F.(Acad) SPMC/2003/2395). Infection with P. falciparum was confirmed by microscopy and rapid diagnostic tests (OptiMal test; Diamed AG, cressier sur Morat, Switzerland, Falcivax test; Zephyr Biomedical System Goa, India) in the hospital. Blood was immediately (within 15 mins) subjected to density gradient-based separation (Histopaque 1077, Sigma Aldrich, USA) to separate the peripheral blood mononuclear cells (PBMCs) from the infected blood samples following manufactures instruction. Both the fractions were subsequently washed twice with phosphate buffered saline (PBS) and lysed using 4 volumes of TRI-reagent and stored at -80°C. Samples were transported in cold chain to BITS, processed and assessed by 18s r RNA gene based multiplex PCR to eliminate any possibility of P. vivax co-infection. Details of the PCR primers used for multiplex PCR for the detection of P. falciparum and P. vivax are presented in Table 1 as cited in the research articles (Pakalapati et al., 2013). PCR conditions used for the amplifications of 18S r RNA gene involved initial denaturaton at 93°C for 2 minutes followed by denaturation at 93°C for 1 minute 30 seconds, annealing at 52°C for 2 minutes, extension at 72°C for 3 minutes, after initial denaturation all the reactions were run for 30 cycles. The TRI protocol was used to allow investigations of both RNA and DNA from the precious clinical samples. (Chomczynski, 1993) Criteria for determining complicated cases were based on World Health Organization guidelines (World malaria report, 2010). For CGH array hybridization, isolates from two patients were considered a) cerebral malaria (CM) whose Glasgow coma scale = 7/14 was considered as a complicated case (PFC) Pf-2 or test, b) uncomplicated case (PFU) Pf-1.

Genomic DNA extraction
Total DNA was isolated from samples using back extraction buffer following manufacture's protocol (TRI-Reagent, Sigma Aldrich, and USA). Both the DNA samples (Pf-isolate-2 and Pfisolate-1) were run on an agarose gel and found to be intact. Total DNA purity and concentration were assessed using Nanodrop ND-1000 UV-Visible spectrophotometer (Nanodrop technologies, Rockland, USA). The A260/A280 ratio suggests the absence of proteins and RNA.
Genomic DNA preparation and hybridization 1.5 Micrograms of each DNA sample were restriction digested using Alu1 and Rsa1. This restricted DNA was random-primer labelled with Cyanine-3 (Cy3) and Cyanine-5-dUTP (Cy5) using Agilent Genomic DNA labeling kit (Part Number: 5190-0453). The labelled DNA was concentrated and quality assessed for yields and specific activity. Pf-2 which is a test sample was labelled with Cy3 and Pf-1 sample taken as a control was labelled with Cy5. 5 micrograms of the labelled samples were hybridized on a Genotypic designed P. falciparum 3D7 Custom 2x400K CGH Microarray, (Agilent Microarray Design ID AMADID: 025327). Hybridization of probes was carried out in Agilent Sure hyb chambers at 65°C with rotation for 40 hours. The hybridized slides were washed using Agilent aCGH wash buffer (Part No: 5188-5221/22) and scanned using the Agilent Microarray Scanner G2505C. The Scanned image was processed using Agilent Feature extraction software version 10.7 and probe intensities were normalized using the linear dye normalization method.
Comparative genome hybridization array design A high resolution custom CGH array was designed with 418,577 probes, representing the entire P.falciparum 3D7 reference genome using the e-Array tool SureDesign v 5.0.1 by the Agilent certified microarray facility of Genotypic Technology, Bengaluru, India. Whole genome sequence for P. falciparum 3D7 strain was downloaded from NCBI genome database; the 60mer probes were designed based on optimal GC% and melting temperature using Probe parser V.1.0.exe (Perl program developed by Genotypic Technology, Bangalore, India) Genotypic Probe parser is the perl code developed by Genotypic Technology to design 60mer Probes from the downloaded genome sequence and calculate the GC% and melting temperature (Tm) for the probes. The 60 mer probes designed by the above mentioned program were then subjected to the Agilent eArray tool and the CGH array designed. The probes were designed by tiling over the DNA sequence. The tiling resolution was 56 bp. The probe length has been set as 60 bp which is the optimised length for Agilent comparative genomic hybridization workflow. The probes were designed by calculating GC% and optimum Tm values of (>65 genome sequence and calculate the GC% and melting temperature. The sequence information, including chromosomal location and direction of transcriptional regulation of P. falciparum 3D7 were retrieved from Plasmodium Genome Resource. [PlasmoDB.V.6.0] (Aurrecoechea et al., 2009;Gardner et al., 2002).
Array CGH data analysis and aberration filter to detect copy number variation Image analysis and data normalization was done by applying the linear dye normalization method using the Agilent Feature Extraction Software version 10.7. The Raw data were analysed using the Agilent Genomic Workbench Software version 6.5. A well-accepted ADM-2 algorithm was used (Aberration Detection Method II algorithm) developed by Agilent and included in the Agilent Genomic Workbench for CGH analysis. ADM-2 is a statistical algorithm which computes copy number differences between the sample and reference using an iterative procedure to identify all genomic regions from its expected value of 0 larger than a given threshold (In this case we have taken a minimum of 5 consecutive probes, with an average log ratio of 0.3). At each iteration, the region with the most significant score is reported. ADM-2 statistical score is computed as the average normalized Cy3/Cy5 log 2 ratios of all probes in the genomic interval multiplied by the square root of the number of these probes. Deletions are reported where the signal intensities of the probes representing the region in the test strain are near background and represents the absence of hybridization. Graphical visualization of the data for the significant aberrations was done using UCSC genome browser (Schneider et al., 2006). Table 1. Details of the PCR primers used for the multiplex PCR for the detection of Plasmodium falciparum, and P.vivax, PCR conditions used for the amplifications of 18S r RNA gene involved initial denaturaton at 93°C for 2 minutes followed by denaturation at 93°C for 1 minute 30 seconds, annealing at 52°C for 2 minutes, extension at 72°C for 3 minutes, after initial denaturation all the reactions were run for 30 cycles.

Results
In this study, a custom designed tiling microarray has been used to investigate the overall difference between the two clinical isolates i.e., Pf-isolate-2 (Pf-2) (complicated/cerebral malaria) and Pf-isolate-1 (Pf-1) (uncomplicated malaria). The reference genome 3D7 was used to design the probes. The tiling array used in this study enables large-scale detection of regions showing copy number variations across the whole genome. The CNV's were identified in Pf-2 (complicated) by comparison to Pf-1 (uncomplicated malaria). The well-accepted ADM-2 algorithm was used with filtering criteria of, minimum of 5 probes and the average log fold change of 0.3 for the analysis of CNV. Here we are reporting significant CNV regions seen in Pf-2 in comparison with the Pf-1. In this data, the term amplification and deletion are used to represent copy number of test strain Pf-2 compared to Pf-1. Deletions are reported where the signal intensities of the probes representing the region in the test strain are near background and represent the absence of hybridization.
The size of each chromosome with the number of genes reported here are as per PlasmoDB V.6.0 (Aurrecoechea et al., 2009;Gardner et al., 2002). Those regions that are re-annotated in the genome assembly of P. falciparum 3D7 and updated in recent version of PlasmoDB i.e PlasmoDB V.34 and are eliminated from the study to rule out any inadvertent errors due to the re-annotation processes. A circos plot of whole genome CNV data of Pf-Isolate-2 has been generated to show the regions in chromosomes exhibiting CNV (Figure 1).
Genome wide gene copy number variation A total of 687 unique genes was seen to exhibit CNVs in the genome of Pf-isolate-2 (Dataset 1, (Pandey et al., 2018a)). The loss/gains varied in size from 100 nucleotides to 47,000 bp and included intra and intergenic regions, as they are an essential part of the genome and might be involved in the regulation of the genes. From the total of 687 genes, amplification is reported in 511 and deletions in 176 genes. There are 27 genes in regions of chr-4,6,7,8,10,11,12,13 in Pf-2 in which both amplifications/gains and deletions/loss are present. This region is rich in Variant Surface Antigen family genes. The deletions or amplifications covered smaller regions and did not cover more than 2 contiguous genes. Most variations were seen in the sub-telomeric and telomeric regions.

Variations in Variant Surface Antigens (VSA)
Our analysis revealed CNVs in 268 unique VSA gene families, (Dataset 2, (Pandey et al., 2018b)) out of which variation is present in 58 unique var genes in Pf-2 in comparison to Pf-1. Most of these genes showing CNVs belong to var group B and C. A previous report from our group has also reported differential expression in the transcriptome of these family genes in the isolates causing severe malaria. The longest CNV was reported in chr2: 86, 3297: 94, 7019, near the telomere covering about 20 genes. This region largely had deletions and covered genes belonging to the: var gene family (PfEMP-1) rifin, stevor, and exported family members Plasmodium exported protein (EPF-1, EPF-3, EPF-4) Pfmc-2TM Maurer's cleft two transmembrane protein (MC-2TM) and hypothetical proteins. Regions with multigene family genes showed deletions at both the telomeric and centromeric regions of the chromosomes. In chromosome-1 of Pf-isolate-2 a deletion of 16.3kb (563139-579488) region was seen, that includes 5 genes coding for Stevor, pseudogene (PFA0705C) rifin (RIF, PFA0710C) Plasmodium exported protein (hyp7), unknown function (PFA0715C) Plasmodium exported protein, unknown function, pseudogene (PFA0720W). There is also a 11.7 kb, deletion region in the same chromosome, which includes 4 genes coding for, Exported protein Family-3 and 50S integral memberane protein, putative (PFL1140W) An amplification of the region, that covers three genes, PFL1155W, PFL1150C, and PFL1145W have been seen previously in Dd2 strain of Plasmodium falciparum (Dharia et al., 2009). One of the genes showing amplification codes for the GTP cyclohydrolase enzyme, which is the 1 st and the rate-limiting enzyme of the folate biosynthesis pathway. Amplification of this gene has been previously reported in parasite isolates from certain endemic countries (Kidgell et al., 2006;Nair et al., 2008). Another gene, mitochondrial carrier protein, putative (PFL1145W) is reported as a homologue of Yhm2 which was characterized from S. cerevisiae as an inner membrane DNA binding protein.
It is speculated that this protein helps in mitochondrial maintenance by tethering mitochondrial DNA to inner membranes (van Dooren et al., 2006). Another region in chromosome-12 showing amplification covers 2 genes codes for Polyubiquitin (PfpUB, PFL0585W) and non-SERCA-type Ca transporting P-ATPase (ATP-4, PFL0590C).
An amplification of 3.3 kb is present in chromosome -13 (408857-412219) in the single gene that codes for mRNAdecapping enzyme 2 putative (DCP2, PF13_0048), Amplification of 3.19 present on chromosome-14 covering a single gene coding for pyridine nucelotide transhydrogeanse, putative (PF14_0508) and 3.13 kb which is also present in chromosome-14 covering a single gene EMP-1 trafficking protein (PTP3, PF14_0758). It has been reported that disruption of the gene (PF14_0758) showed reduced levels of expression of PfEMP-1, which suggests that the protein encoded by PF14_0758 plays a role in trafficking and the display of the virulence protein PfEMP-1 on the host erythrocyte (Maier et al., 2008). All these genes showing amplifications in the data are important for the parasite to survive in the host environment

Discussion
Several studies at the genomic and transcriptomic level have been undertaken to understand the molecular basis of pathogenesis involved in severe disease manifestation. However, there is scarcity of information on genomic variation in parasites causing complicated manifestations, regions with seasonal malarial transmission. Besides, most of the CNV or SNP studies have been done on in-vitro parasite isolates or clinical isolates from other parts of the world, but very few studies are from parts of India (Cheeseman et al., 2009;Cheeseman et al., 2016). The lacunae in genome variation study still exists as there is no study from parts of India showing nuclear genomic variations between two parasites isolates with differing disease manifestations. CNVs were seen to be present throughout the chromosomes of Pf-2. Large amplifications and deletions of several kilo bases in regions covering Pf-EMP1 and exporter family member genes, have been seen, as it is a known fact that the variant surface antigen family genes are prone to variations and immune invasions. (Carret et al., 2005;Mok et al., 2008;Tan et al., 2009). Amplification in chromosome number 12 includes genes like GCH-1 which is a rate limiting enzyme of folate biosynthesis and is important for the parasite to survive, amplification of this gene is reported to cause antifolate resistance and drives the evolution of drug resistance (Kidgell et al., 2006;Nair et al., 2008). Amplification in Glycophorin binding protein (GBP, PF10_0159) was also seen in the PF-2, this gene is important for the parasite to invade the erythrocyte as it binds to the Glycophorin protein on the erythrocyte membrane. Amplification in sodium-dependent phosphate transporter (PiT, MAL13P1.206) has also seen, the uptake of Pi is essential for the parasite for the synthesis of DNA, RNA and numerous phosphorylated metabolic intermediates (Saliba et al., 2006).

Conclusion
The implications of deletions or amplifications in the coding regions of the genes remain unanswerable at this point. This study shows the presence of CNV based genome differences from two field isolates. Based on msp-1 and msp-2 genotyping it appears that both the isolates exhibit allelic families representing K1 (msp-1 block-2) and FC27 (msp-2 repeats) (Snounou et al., 1999). At this level of the study, it is not possible to speculate whether the differences that we are seeing in Pf-isolate-2 are instrumental in causing the severity of the disease. However, we are reporting here some putative and novel variations, which have not been reported before in single gene studies with field isolates. The Data clearly shows that there is a difference in the isolate which has been taken from the patient with severe malaria (Pf-2) than from that with uncomplicated malaria (Pf-1). Differences exist not only in coding regions, but also in the intergenic locations. Thus, there could be implications at the level of gene regulation. However, these differences could also be inherently natural differences between the two parasite isolates without any direct correlation with the disease states, which could be influenced by various other host factors. This case study is just a preliminary attempt to utilize aCGH tiling array using ADM-2 algorithm to report the variations if any or differences in the genomic segments of parasite isolate causing severe malaria with that of the isolate causing uncomplicated malaria. To get a real picture of the genomic variations related to the disease severity, a tiling array approach can be used to analyze a large set of clinical samples exhibiting different disease complications.

Data availability
The microarray data discussed in this manuscript has been deposited in the NCBI Gene Expression Omnibus (GEO) under the GEO series accession number (GSE77165). Plasmodium causing complicated malaria) and Pf-isolate-1 (uncomplicated malaria). The study is largely a case study reporting the genomic differences in Pf-isolate-2. The study has identified amplifications in all the chromosomes of Pf-isolate-2 that are largely limited to telomeric, and subtelomeric regions and include multigene families such as vars, rifins, and stevors. Though, this is not the first report of aCGH technique being used for comparative genome analysis, it's application in study of field isolates is demonstrated. The study holds merit for publication in the journal however, there are some criticisms that need to be addressed.

Major points:
The variations including amplifications and deletions are of common occurrence between two parasite strains isolated from different patients. These variations are more prominent in telomeric and sub-telomeric regions comprising of multigene families such as var genes, etc. Even two different isolates exhibiting identical phenotypic expression may show variations in multigene families. Since the present study is focused on identifying the genomic differences in Pf-isolate-2 (causing complicated malaria) with Pf-isolate-1 (causing uncomplicated malaria), there is no experimental evidence suggestive of direct association of the CNVs found in the study with the disease phenotype. Moreover, it would have been beneficial to incorporate more samples for comparison including parasites exhibiting identical phenotypes. This would have been useful in drawing meaningful conclusions on the CNVs and disease severity.

1.
What variations in results are expected if the study were to use Next Generation Sequencing 2.
(NSG) instead of aCGH, a summary in the discussion section will be appropriate? How the results might be affected if a newer version of Plasmodium genome database was used?

3.
Mention of each gene name in the text, to describe changes, is confusing and it can be presented in the table format making it easier to read and understand.

4.
There are genes such as msp2 (Chaorattanakawee et al, 2018;PMID: 29342212), etc that have been demonstrated to be associated with disease severity. Did the authors found any variation in msp2? The findings of such studies reporting genetic markers of disease severity should be discussed in detail in the discussion section.

Minor points:
Page 3, Introduction section: references for studies that have identified the receptors for EMP1s should be given such as ICAM-1, CD36, E-selectin, NCAM and CD31 Page 3 paragraph 3 last sentence is not necessary, just include the reference. Page 3 paragraph 5: Objective statement is too long. Reframe it to something like "The objective of this study was to identify genetic difference(s) between two parasite isolates, with varying disease severity, using aCGH tilling approach" Page 4 Table 1 legend: No need to mention the PCR condition here as it has already been mentioned in method section in Page 3 paragraph 7. Page 4 paragraph 5: Please mention what is the "GC%" and "melting temperature" considered for designing 60mer probe. Page 4 paragraph 5: Reframe the sentence "The probes were designed by calculating GC% and optimum Tm values of (>65 genome sequence and calculate the GC% and melting temperature." The Plasmodium Genome Resource used was very old (V 6.0) need to be verified with the current version of Plasmo DB (V 39.0). Please be consistent with Plasmodium Gene nomenclature as the Gene ID used were from different version of Plasmo DB it makes it difficult to follow.