Evaluation of cytosine conversion methods for whole-genome DNA methylation profiling [version 1; peer review: awaiting peer review]

Background: DNA methylation, the most common epigenetic modification, is defined as the removal or addition of methyl groups to cytosine bases. Studying DNA methylation provides insight into the regulation of gene expression, transposon mobility, genomic stability, and genomic imprinting. Whole-genome DNA methylation profiling (WGDM) is a powerful tool to find DNA methylation. This technique combines standard whole-genome sequencing methodology ( e.g. , Illumina high-throughput sequencing) with additional steps where unmethylated cytosine is converted to uracil. However, factors such as low cytosine conversion efficiency and inadequate DNA recovery during sample preparation oftentimes render poor-quality data. It is therefore imperative to benchmark sample preparation protocols to increase sequencing data quality and reduce false positives in methylation detection. Methods: A survey analysis was performed to investigate the efficiency of the following commercially available cytosine conversion kits when coupled with the NEBNext® Ultra™ DNA Library Prep Kit for Illumina (NEB): Zymo Research EZ DNA Methylation™ kit (hereafter known as Zymo Conversion kit), QIAGEN EpiTect Bisulfite kit (hereafter known as QIAGEN Conversion kit


Introduction
DNA methylation is a key epigenetic regulator of many biological processes, including transposon activation, heterosis, biotic and abiotic stress response, development, and reproduction (Moore et al., 2013).Whole-genome DNA methylation profiling (WGDM) is among the most efficient techniques available for the detection of DNA methylation, providing >90% single-nucleotide-based genome-wide coverage for cytosines followed by guanine (CpG) residues (Chatterjee et al., 2017).WGDM is similar to the traditional whole-genome sequencing technique with additional steps of cytosine conversion to uracil (Plongthongkum et al., 2014;Li et al., 2018).Of deployed cytosine conversion techniques, sodium bisulfite (SB) conversion is a gold standard process that utilizes single-base-level chemical treatment to convert cytosine to uracil while leaving 5-methylcytosine (5-mC) unaffected prior to PCR enrichment (Figure 1).During PCR amplification, uracil is converted to thiamine, allowing researchers to perform high-throughput sequencing to determine the average methylation level in a sample (Feng et al., 2020).Although considered a gold standard, several limitations can arise through SB treatment, such as low conversion efficiency (failed conversion of non-methylated cytosines to uracil) and DNA degradation.These can impair the efficiency of downstream steps such as PCR amplification, library yield, and false detection of methylated cytosine (Hernández et al., 2013;Iurlaro et al., 2016;Tierling et al., 2018).
Recent data indicate EM conversion provides superior results compared to SB treatment.In a current research, Feng et al., (2020) investigated DNA methylation in Arabidopsis thaliana using the enzymatic methylation sequencing (EM-seq) method and compared their findings to whole-genome bisulfite sequencing (WGBS) data.They reported EM-seq provided better data than that of WGBS.Specifically, libraries prepared with EM-seq kits had higher mapping and lower duplication rates, lower background noise, higher average coverage, and higher total cytosine coverage between replicates (Feng et al., 2020).In another study using human gDNA, similar results were obtained where EM-seq libraries outscored bisulfite-converted libraries in most of the examined parameters, including coverage, duplication, sensitivity, and nucleotide composition (Vaisvila et al., 2021).Additionally, even with extremely low levels of DNA (100 pg), EM-seq proved to be a more reliable method for detecting methylation state than WBGS, which requires 10-200 ng DNA input (Vaisvila et al., 2021).
However, optimized sample preparation protocols are only available for a handful of species and are missing for many important crop plants.In addition, most existing benchmark methods have been performed for mammalian genomes, which are different from plant genomes (e.g., plant genomes are enriched with repetitive elements, as well as CHG and CHH methylation).Therefore, in this study, we have compared two commercial SB conversion kits: Zymo Conversion kit and QIAGEN Conversion kit with an EM conversion kit, NEB EM-seq kit, using soybean (Glycine max [L.] Merrill) DNA samples.Further, we have benchmarked the use of these three kits with the NEBNext ® Ultra™ DNA Library Prep Kit for Illumina (NEB) for sample preparation for high-throughput sequencing.

Plant growth and DNA isolation
Seeds of soybean genotype Williams 82 were sown in a Miracle-Gro Moisture Control potting medium and grown at 25°C, 16 h d À1 light at 230 to 365 μM m À2 s À1 , and 90% relative humidity in a growth chamber.At the V2 developmental stage (Fehr and Caviness, 1977), a trifoliate leaf was collected for DNA isolation (Figure 2).
A detailed protocol can be found on protocols.io.Genomic DNA (gDNA) was isolated using the Zymo Research Quick-DNA Plant/Seed Miniprep kit (Cat #D6020; Irvine, CA, USA).DNA concentration was measured using a Qubit fluorometer (Thermo Fisher Scientific; Waltham, MA, USA) coupled with a Thermo Fisher Scientific dsDNA High Sensitivity Assay kit (Cat #Q32851).DNA purity was assessed from 260/230 and 260/280 nm absorbance ratios using a NanoDrop ND-1000 spectrophotometer (Thermo Fisher Scientific).Furthermore, the quality and size of obtained DNA was determined using gel electrophoresis (Fisher Scientific gel casting and electrophoresis apparatus (Cat #FB-SB-1316); 1% agarose gel with 1X Tris acetate-EDTA buffer (ThermoFisher; Cat # B49).Nucleic Acid was stained with Biotium GelRed Nucleic Acid Gel Stain (Fisher Scientific; Cat # NC9594719) and gel imaging was performed using LiCOR ODYSSEY imager (Serial # OFC1321) under a 520 nm channel.As a non-methylated control, the soybean gDNA was spiked with Escherichia coli gDNA (Zymo Research, Cat #D5016) representing 1% of the total gDNA input.The spiked sample was then separated into three aliquots (each 65 μL), which were sheared using Qsonica sonicators (Newtown, CT, USA).Sonication settings entailed 15 sec on/90 sec off for 8 cycles at a 20 kHz frequency, resulting in DNA fragments of approximately 350 bp.Sheared aliquots were used for library preparation, as described below.
For SB conversion using the Zymo Conversion kit, DNA was incubated with CT Conversion Reagent (Zymo Research; Cat #D5001-1) (Table 3 and Figure 1) in a thermocycler for approximately 16 hrs.The converted DNA samples were then desulphonated using M-Desulphonation Buffer (Zymo Research; Cat #D5001-5).For SB conversion using the QIAGEN Conversion kit, the sample was incubated in a thermocycler for approximately 5 hrs with the Bisulfite mix (Table 2 and Figure 1) and the DNA protection buffer provided in the kit.Converted DNA samples were then treated with Buffer BD from the kit for desulphonation.For EM conversion, the sample was first treated with TET2 (NEB Cat #E7130AVIAL; part of the NEBNext ® Enzymatic Methyl-seq Conversion Module) and APOBEC (NEB Cat #E7133AVIAL) (Table 1 and Figure 1) enzymes in a series of steps, which ultimately converted cytosines to uracil, allowing for the detection of 5 mC and 5 hMC.
Converted samples were purified using Beckman Coulter SPRIselect magnetic beads (Cat #B23317).The barcoding of converted DNA fragments was carried out with NEBNext ® Multiplex Oligos for Illumina (Methylated Adaptor, Index Primers Set 1) (NEB Cat #E7535S/L) using PCR (16 cycles) with EpiMark ® Hot Start Taq DNA Polymerase (NEB Cat #M0490S) following manufacturer instructions (more details can be found on protocols.io).

Quality control and sequencing of WGBS and EM-seq libraries
After amplification, prepared libraries were purified using SPRIselect magnetic beads (Cat #B23317).The concentration of prepared libraries was measured using the Thermo Fisher Scientific dsDNA High Sensitivity Assay kit (Cat #Q32851) with an Invitrogen Qubit Fluorometer as well as a NanoDrop ND-1000 spectrophotometer (Cat# E1123552; Thermo Fisher Scientific).Prepared libraries were then sent to Novogene Corporation Inc. (Sacramento, CA, USA) for downstream quality assessment and whole-genome sequencing.Fragment size and concentration were validated with an Agilent Bioanalyzer (Agilent Technologies; Santa Clara, CA, USA).Libraries were sequenced on an Illumina Hi-Seq sequencer to obtain 150 bp paired-end reads.

Results and discussion
Despite the utility and growing popularity of WGBS for epigenome analysis in crop plants, relatively little has been done to identify various factors affecting sample preparation and data quality.Cytosine conversion is a crucial step of this process, as partial conversion can lead to false methylation calls.In this survey experiment, we used soybean DNA to prepare samples using three different cytosine conversion protocols.We have evaluated their ease of use, DNA loss during the conversion step, and subsequent data quality.These evaluations shed light on the performance of these three kits and their use for preparing samples to analyze crop plant epigenomes.
The QIAGEN kit outperformed other kits in time requirement and DNA recovery Essential factors to consider when selecting a kit are the final yield of libraries and the ease of the protocol.Based on the time required for cytosine conversion steps, the QIAGEN Conversion kit was the fastest method to prepare the whole library (≈ 24 hrs), followed by the NEB EM-seq kit (≈ 26 hrs) and Zymo Conversion kit (≈ 35 hrs).When evaluating the cytosine conversion step, the Zymo Conversion kit needed the longest time of ≈ 16 hrs.However, the QIAGEN Figure 3.Comparison of sequencing data quality between the cytosine conversion kits.Quality was determined by library recovery (%), number of unprocessed and preprocessed raw sequences, average length of preprocessed sequences (bp), duplication rate (%), GC content (%), mapping efficiency (%), total coverage (X), and cytosine conversion efficiency (%).NEB, New England BioLabs.
Conversion kit and the NEB EM-seq kit required a total of ≈ 5 hrs, which included the post-conversion cleanup step (Tables 1-3).
Following cytosine conversion, the highest recovery of DNA was observed with the QIAGEN conversion kit (100%), followed by the Zymo Conversion (20%) and NEB EM-seq (7.40%) kits (Figure 3).Previously, researchers noted factors that resulted in the reduction of libraries while using SB-based cytosine conversion.In a study conducted by Tanaka and Okamoto (2007), a real-time PCR experiment indicated that DNA degradation during SB conversion occurs primarily due to the hydrolytic reaction or the prolonged exposure to the SB.Additionally, prolonged exposure to SB results in the formation of abasic sites and subsequent DNA strand damage (Tanaka and Okamoto, 2007).An abasic site is a region of DNA that lacks both purine and pyrimidine bases due to DNA damage formed by spontaneous hydrolysis of the N-glycosidic bond (Boiteux and Guillet, 2004).According to the manufacturer's manuals for SB-based conversion kits, DNA is denatured and fragmented before or during cytosine conversion in the thermocycler to chemically convert nonmethylated cytosines into uracil.Thus, the prolonged exposure to SB during denaturation, as well as DNA fragmentation, likely contributed to the loss of small DNA fragments during the downstream purification steps, reducing library yield.Conversely, high recovery of DNA when using the QIAGEN conversion kit may be attributed to the presence of DNA protection buffer in the kit (Izzi et al., 2014;Leontiou et al., 2015;Tierling et al., 2018).The drastic loss of DNA with the NEB EM-seq kit may be due to comparatively more purification steps, increasing the chances of DNA loss due to pipetting.All cytosine conversion kits yielded quality sequencing data Prepared samples were sequenced to see if the cytosine conversion method had any effect on the sequencing.The initial quality of sequencing data was analyzed based on the following criteria: the total number of sequencing reads obtained, total coverage (X), duplication rate (%), Phred quality score, Per-base sequence, and GC content (%) in the sample.These parameters indicated no major differences among the three samples as discussed below.The number of raw sequencing reads per sample ranged from 33-50 million, with coverage ranging from 20X to 30X (Figure 3).The Phred quality score of each data set was found to be excellent, ranging between 31 to 40 (Figure 4).Based on the per-base sequence plot, thymine was the most abundant base in each library with 50% of all four bases, whereas cytosine was the least abundant (Figure 5).GC content was similar between samples, with the Zymo Research library having the highest level, followed by the QIAGEN and NEB kits (Figure 3).As expected, the per-sequence GC content for each sample was biased due to cytosine conversion into thiamine during PCR amplification.Thus, our data suggest that the cytosine conversion method has little effect on the initial sequence quality.
Nevertheless, initial data quality may not reflect data utility.Therefore, we mapped reads to the soybean genome to see the mapping efficiency (defined as how many reads can be aligned to the genome proportional to the total reads).Reads generated from all three methods had similar mapping efficiencies.The NEB kit library had the highest mapping efficiency (74.4%), followed by the QIAGEN kit (72.5%) and the Zymo Research kit (70.8%) (Figure 3).The observed mapping efficiency (>70%) is comparable with previously reported efficiencies for cytosine-converted samples (≈ 60-75%) (Hari and Parthasarathy, 2019).These results implied that sequences obtained from all three methods could be aligned correctly to the soybean genome.
We also calculated the duplication rate for each sample, as a higher rate of duplication distorts the true sequence proportion in a given sample.Duplication rates between 13 and 15% were detected for each library (Figure 3), which is lower than previously reported duplication rates during soybean methylome profiling (Rambani et al., 2015).Duplication can occur for several reasons.For instance, biased PCR enrichment and overamplification of DNA fragments can result in an overrepresentation of library fragments.Duplication can also occur when an identical template binds to numerous clusters on a flow cell.Duplicates of PCR products distort the actual proportion of sequences in the sample (please see more details on the Babraham Institute website).

Cytosine conversion efficiency
It is crucial to analyze the cytosine conversion efficiency during DNA methylation analysis.If a library has low conversion efficiency, it can result in false methylation detection (Singer, 2019).Adding methylated, non-methylated, or both types of genomic controls in a sample can help identify these limitations presented by bisulfite-converted DNA.
Here, the cytosine conversion efficiency was calculated using reads aligned to non-methylated E. coli and chloroplast gDNA for each sample.According to these alignments, the bisulfite conversion efficiency of samples ranged between 90 and 99.8% (Figure 3).QIAGEN and Zymo Conversion kits had a similar conversion efficiency of roughly 99% (Figure 3).Similar results were reported for both bisulfite kits in a study conducted on human chromosomal DNA (Tierling et al., 2018).For the NEB EM-seq kit, the cytosine conversion efficiency was found to be 90.9%,around 9% lower than the data obtained from a published study (Figure 3) (Feng et al., 2020).The efficiency and accuracy of a conversion kit are dependent primarily on the PCR cycling procedure, conditions, the length of the desulphonation process, and the addition of reagents to prevent DNA degradation (Izzi et al., 2014;Tierling et al., 2018).In this case, the QIAGEN kit included a DNA protection buffer containing a pH indicator dye to ensure the correct pH for cytosine conversion.Alternatively, the Zymo Research kit protocol distinguishes the alkaline or denaturation step from the conversion step (Izzi et al., 2014).However, it is unclear what contributed to poor cytosine conversion efficiency in the NEB EM-seq kit.It is possible that further optimization of the input DNA quantity is necessary for this kit when it is used with the NEB library preparation kit.
Furthermore, the data were aligned to the soybean reference genome to detect the number of methylated cytosines in each library.The highest number of methylated cytosines in all three contexts (i.e., CpG, CHG, and CHH) was found in the library prepared with the NEB EM-seq kit, with around 1.295 Â 10 9 total methylated cytosines, followed by 1.169 Â 10 9 in the QIAGEN library and 703 million in the Zymo Research library.Moreover, the QIAGEN and Zymo Research libraries displayed a similar distribution of methylated cytosine across contexts.The NEB library showed a distinct distribution compared to the other kits, with an increase in methylated cytosines at CHH context (Figure 6).This can be attributed to the comparatively poor cytosine conversion efficiency, which may have resulted in a high number of erroneously methylated cytosines (Simpson et al., 2017).As expected, the majority of cytosines in each library were methylated in CpG context, followed by CHG and CHH (Figure 6).These data are congruent with a past soybean DNA methylation study, which revealed the greatest levels of methylation in CpG context, followed by CHG and CHH contexts, under normal developmental conditions (Song et al., 2013).Extended data Analysis code available from: https://github.com/ajwije/DNA_methylation_analysis.git Archived analysis code at time of publication: https://doi.org/10.5281/zenodo.7328525(Wijeratne, 2022).License: MIT The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com

Figure 5 .
Figure 5. Per Base Sequence Content of the (a) Zymo-research kit, (b) Qiagen kit, and (c) NEB kit.Results were obtained using MultiQC.NEB, New England BioLabs.

Table 3 .
Thermocycler incubation condition for the Zymo Research kit.

Table 1 .
Thermocycler incubation condition for the NEB kit.NEB, New England BioLabs.

Table 2 .
Thermocycler incubation condition for the Qiagen kit.