Long read nanopore sequencing for detection of HLA and CYP2D6 variants and haplotypes

Haplotypes are often critical for the interpretation of genetic laboratory observations into medically actionable findings. Current massively parallel DNA sequencing technologies produce short sequence reads that are often unable to resolve haplotype information. Phasing short read data typically requires supplemental statistical phasing based on known haplotype structure in the population or parental genotypic data. Here we demonstrate that the MinION nanopore sequencer is capable of producing very long reads to resolve both variants and haplotypes of HLA-A, HLA-B and CYP2D6 genes important in determining patient drug response in sample NA12878 of CEPH/UTAH pedigree 1463, without the need for statistical phasing. Long read data from a single 24-hour nanopore sequencing run was used to reconstruct haplotypes, which were confirmed by HapMap data and statistically phased Complete Genomics and Sequenom genotypes. Our results demonstrate that nanopore sequencing is an emerging standalone technology with potential utility in a clinical environment to aid in medical decision-making.

An important aspect of precision medicine is the study of how genes influence individual response to drug therapies, known as pharmacogenomics (PGx). PGx genotyping impacts the choice of drug dosing in many medical contexts. As an example, in acute lymphoblastic leukemia patients the metabolizer status of thiopurine methyltransferase (TPMT) must be considered when calculating the initial drug dose of mercaptopurine (6-MP) to ensure proper treatment and avoid fatal toxicity 1,2 .
PGx data are typically collected by sequencing a small panel of known PGx genes via traditional Sanger sequencing, or targeted genotyping technologies 3 . Diagnostic labs are also exploring the use of whole genome or exome sequencing (WGS, WES) for PGx. However, existing methods have various limitations, which may lead to adverse drug responses. WGS and WES methods may fail to capture or provide adequate sequence coverage for certain PGx loci. Targeted genotyping approaches, such as Taqman (Life Technologies), Luminex (Luminex Corp.) or Sequenom (Agena Bioscience), can fail to detect novel loss-of-function mutations due to their selective interrogation of predefined genomic loci. In a diagnostic clinic, where results are often required within days of administering diagnostic tests, some existing technologies can delay the return of clinical results. Current massively parallel technologies have high capital costs (ranging from $100,000-$1,000,000) requiring clinical laboratories to purchase and maintain large instruments to perform in-house genotyping. Alternatively, a laboratory can send these PGx samples to a third party service for a fee, but may wait up to several months for a clinical report.
Clinical haplotypes, tightly-linked collections of inherited alleles, that are responsible for a plethora of medical phenotypes including patient drug response are important in many medical sequencing applications. Information about haplotypes, or genotype phase, can be inferred from parental genotypes or genetic pattern frequency in the human population, however, these predictions can be inaccurate if de novo or rare haplotypes are encountered in a patient 4,5 . Due to the chromosomal distance between alleles, current short read technologies in use for PGx are often unable to resolve haplotype information without supplemental statistical phasing or parental genotypic data.
Recently, Oxford Nanopore Technologies Inc. has developed the MinION, a real time nanopore-based DNA sequencing instrument which is compact, inexpensive and faster than most established DNA sequencing technologies. The time it takes from initiation of library preparation to basecalling the first sequence read is approximately 3 hours and the instrument is capable of detecting long sequence reads in excess of 50kb (according to the manufacturer, ONT). Nanopore-based technology promises major advances in DNA sequencing by offering an inexpensive (e.g. on the order of $1000) pocket-sized device for clinical diagnostics or field experiments.
Here we report nanopore-based sequencing of three clinically relevant PGx genes to identify medically actionable variants and haplotypes without statistical phasing.

PCR amplification
Primer sequences for HLA-A, HLA-B and CYP2D6 are available in Table S1. For CYP2D6, we designed primers to specifically amplify CYP2D6 while not amplifying the 94% identical CYP2D7. PCR primer specificity was verified using UCSC in silico PCR. We used the standard protocol (for fragments up to 8 kb) of the KAPA LongRange HotStart PCR system: 5× KAPALongRange Buffer (without Mg 2 + ) 1×, MgCl 2 (25 mM) 1.75 mM, dNTPs (10 mM each dNTP) 0.3 mM, Fwd primer (10 μM) 0.5 μM, Rev primer (10 μM) 0.5 μM 50ng of genomic DNA (1ul of a 50ng/ul preparation), KAPA LongRange HotStart DNA Polymerase (2.5 U/μl) 1.25 U/50 μl, PCR grade water up to 50μl. For HLA-A and HLA-B, genomic sequence was downloaded from UCSC Browser with common SNPs masked. Primers were designed using Primer3 using parameters of 68°C for optimal annealing temp and 26bp minimum primer length 6 . PCR cycling conditions were: 94°C for 3 mins, followed by 35 cycles of 94°C for 20 sec, 68°C for 15 sec and 68°C for 5 mins followed by a final step of 72°C for 5 mins and hold at 10°C.
Oxford Nanopore genomic DNA library preparation The DNA libraries were prepared using the Oxford Nanopore Genomic DNA Sequencing protocol (SQK-MAP003). 1.5μg of PCR product was used (instead of the suggested 1μg based on improved yield in earlier testing) with equimolar amounts of CYP2D6, HLA-A and HLA-B amplicons in solution. DNA was not fragmented because the PCR amplicons were already at the desired size for sequencing and downstream haplotyping (4-5Kbp; Table S1).

Amendments from Version 1
Dear reviewers, We would like to thank you for your thorough and useful review of our manuscript "Long read nanopore sequencing for detection of HLA and CYP2D6 variants and haplotypes". Please find our point-by-point response in Supplementary file 2.
Both of you suggested we reproduce our results using the newest Oxford Nanopore Technology kits and pipeline. Therefore, we performed the sequencing experiment using the newest Oxford Nanopore Genomic DNA Sequencing protocol (SQK-MAP004). The library was sequenced on 3 fresh R7.3 flow cells, only one of which produced a significant number of reads (just over 1000 reads from the other two flow cells). As anticipated, the 2D consensus reads generated using the newer protocol were longer on average suggesting advances in basecalling accuracy since our initial experiments. Unfortunately, yield was significantly lower with a total of 885 aligned reads for all targeted loci (including our 3 genes-of-interest and the nanopore controls), which did not enable us to make accurate genotyping calls. These new read data have been deposited in the NCBI Sequence Read Archive along with our other data.
Thank you again for your feedback on our manuscript.

Sincerely,
The authors

REVISED
In accordance with the protocol, we end-repaired the DNA with the NEBNext end repair module (New England Biolabs, cat. no. E6050) and subsequently dA-tailed the sample using the NEB-Next dA-tailing module (New England Biolabs cat. no. E6053), prior to ligation of nanopore-specific adapters. All purifications were accomplished with Agencourt AMPure XP beads (Beckman Coulter Inc., cat. no. A63880). Throughout the library preparation, care was taken not to vortex or vigorously pipette/mix the library to avoid shearing the DNA into smaller fragments.
Oxford Nanopore MinION sequencing and basecalling The MinION flowcell (R7.3 flowcell chemistry) was run for 24 hours using the MinKNOW software (v47.3) producing 24,859 fast5 files, corresponding to individual reads from base detection events at specific nanopore channels. Online basecalling was performed using the Metrichor software (v2.23). The MinION outputs 3 reads for each dsDNA molecule that passes through a pore. The leading ssDNA is referred to as "1D template" and its complementary ssDNA strand is the "1D complement". When both 1D template and 1D complement reads are basecalled, a 2D consensus sequence is determined based on complementarity. All read information was extracted using the python HDF5 package h5py (http://github.com/h5py/h5py). We observed 19,655 1D template reads, 9,584 1D complement reads and 7,540 2D reads. The mean lengths were 2,693bp for 1D template, 2,706bp for 1D complement and 3,486bp for 2D consensus.

Read alignment
Existing massively parallel sequencing instruments, such as the Illumina HiSeq 2500, produce accurate short reads typically up to ~250bp in length. These sequencers can produce hundreds of millions of reads which need to be rapidly aligned to a reference genome. Current computational methods for accurate alignment of these reads, including BWA 7 and Bowtie2 8 are based on the Burrows-Wheeler Transform FM index, and they are designed to align short reads with minimal variation to a reference assembly. BWT-FM methods are insufficiently sensitive to align much longer reads with higher error rates 9 . These long reads, generated by single molecule sequencers such as the Oxford Nanopore MinION or the Pacific Biosciences RS II, have a significantly higher error rate, enriched for insertions or deletions (indels) rather than substitutions 9 . Mapping of these reads is best suited to aligners that were originally designed for whole genome alignments, such as LAST 10 . We chose to use BLASR, originally developed for the Pacific Biosciences system, to align our data, because it was designed to align long error-prone reads rather than genomes 9 .
All reads were aligned to the human genome reference assembly GRCh37.p13 (hg19) using default BLASR parameters (gap open penalty = ten, gap extension penalty = zero, minimum seed length = 12). The use of other parameters, such as a gap-open penalty of zero (with default gap extension penalty = zero) did not alter the results, even though it may be expected to do so given the prevalence of indels expected in single molecule nanopore sequencing. The majority of successfully aligned long read fragments were obtained from 2D basecalls (Table 1), and these were of higher quality because they are consensus reads constructed from corresponding 1D template and complement. For the final alignment data, for each separate read event (1D template, 1D complement and 2D consensus), we selected the 2D read if it was available. Since the 1D reads typically had lower mapping accuracy and significantly shorter aligned fragments (Table 1), these were not included in our variant or haplotype calling analysis.
To repeat our findings, we performed the sequencing experiment using the newest Oxford Nanopore Genomic DNA Sequencing protocol (SQK-MAP004). The library was sequenced on 3 fresh R7.3 flow cells, only one of which produced a significant number of reads. The 2D consensus reads generated using the newer protocol were longer on average (3885.3bp compared to 2952.3bp, see Table 1) suggesting advances in basecalling accuracy since our initial experiments. However, yield was significantly lower with a total of 885 aligned reads for all targeted loci, which did not enable us to make accurate genotyping calls.
Finally, we performed two separate alignments depending on our desired sequencing application: a) gene targets with highly similar nearby genes; and b) highly polymorphic gene targets. The first analysis only selected the single best alignment for each long read. This was critical for the gene CYP2D6 because the CYP2D locus on chromosome 22 harbors two paralogous pseudogenes CYP2D7 and CYP2D8P 11 . In particular, CYP2D6 and CYP2D7 are highly similar (94% identity, BLAST E-value = 0.0) and are positioned in tandem on the chromosome. By allowing reads to only map to a single best hit, we were able to verify that the PCR selectively amplified CYP2D6 and not nearby related genes (see coverage in Figure 1). Our second analysis was performed due to the high degree of polymorphism in the MHC locus on chromosome 6. As part of the MHC haplotype project 12 , multiple reference contigs for this highly variable region are included in the GRCh37 reference assembly as indicated in the release notes (http://www.ncbi.nlm. nih.gov/genome/guide/human/release_notes.html). Since long reads from the NA12878 HLA-A and HLA-B genes mapped to different Variants were detected using a naïve threshold requiring 1/3 of reads to contain the variant genotype at that position. While this was effective for substitution detection, we are likely to detect many false positive deletions due to the high deletion error rate (Table 1). Haplotype proportions were identified by interrogating clinical marker positions (Table S2)

Results
To evaluate the MinION for diagnostic PGx sequencing, we selectively amplified and sequenced the genes CYP2D6, HLA-A, and HLA-B from the CEPH/UTAH pedigree 1463 sample NA12878. CYP2D6 is a pharmacogenetically vital cytochrome P450 gene because it encodes a protein responsible for metabolism of 20% of clinically used drugs 11 . The diagnostic relevance of 2D6 is derived from its significant polymorphism which contributes to dramatic inter-individual variability in enzyme activity 11 . Also important are the HLA genes which are clinically relevant for solid organ transplantation and accurate dosing of abacavir, allopurinol and carbamazepine, used to treat HIV/AIDS, hyperuricemia and seizure disorders, respectively 17-19 . The HLA genes are among the most polymorphic loci in the human genome, making their sequencing and confident typing difficult with current short read DNA sequencing methods. These three genes were also chosen for sequencing due to their length (4-5Kbp), which did not require long range PCR amplification methods.
PCR amplicons of these three genes from NA12878 (CEPH/Utah Pedigree 1463) were sequenced on the MinION instrument yielding 19655 read events. Each read event could be basecalled in multiple forms, as a template or complement strand (1D) or as a consensus of the two (2D), and we obtained 36779 1D and 2D reads in total. For the purpose of diagnostic evaluation, we chose to align only the consensus 2D reads due to their lower error rate and extended length (Table 1; see Methods). Reads were aligned to the human genome (GRCh37), with an abundance of aligned reads 4-5Kb in length representing full-length PCR amplicons. As well, smaller aligned read fragments were observed, some of which are speculated to be by-products of shearing during experimental DNA handling (Figure 2A, Table S1).
With depth of coverage of ~1000× for each of the genes, many chromosomal positions were called with 70-90% consensus, demonstrating that as coverage of loci increases on the MinION, confidence improves with regard to specific base calls ( Figure 2B). While the MinION basecalls are emitted with a comparatively high error rate (Table 1), the majority of errors appear to be randomly distributed across the length of the reads, which is why increasing coverage can yield a consensus that matches variant calls from existing sequencing and genotyping platforms such as Illumina, Complete Genomics and Sequenom.
MinION-called variants and haplotypes were validated against statistically phased genotypes from multiple platforms including Complete Genomics and Sequenom MassARRAY (see Methods). Based on the statistically phased genotypes, we determined that NA12878 possesses both the *3 and *4 loss-of-function alleles for CYP2D6, and this *3/*4 diplotype is interpreted as reduced metabolism of drugs such as codeine (an opiate) and olanzapine (an atypical antipsychotic) 20-22 .
CYP2D6 haplotype proportions in MinION data were identified by interrogating clinical marker positions across all aligned reads to establish the proportion of reads corresponding to each PGx haplotype (Table S2). Only reads spanning all clinical markers were included (n = 404), so that haplotypes could be measured by linkage of markers on a single DNA molecule. The MinION data confirmed the statistically-phased haplotypes by direct interrogation of markers from individual reads ( Figure 2C). However, we also observed a prominent *2 haplotype, which we could not account for given that our Sequenom MassARRAY and qPCR results indicated that the CYP2D6 locus was diploid (no copy number variation) and could only correspond to a *3/*4 diplotype. To determine whether the *2 haplotype could arise from mismatched CYP2D7 DNA, which was not supposed to be PCR amplified (see Methods), we interrogated four positions with different bases between CYP2D6 and CYP2D7 reference sequences and found that all reads corresponded to CYP2D6 (Supplementary File S1). Finally, we hypothesized that the *2 haplotype might arise due to duplexes forming between *3 and *4 oligonucleotides during PCR (effectively outcompeting the primer binding during the annealing step), but this was ruled out by identifying the *2 haplotype using only 1D reads. It is possible that this *2 haplotype arose either due to early cycle template switching during PCR or sample contamination 23 . Also, the relative proportion of haplotypes was potentially skewed by the PCR amplification step.
The HLA-A and HLA-B haplotypes were determined in the same way as the CYP2D6 haplotypes, using predefined markers from the HapMap project. In HLA-A (only spanning reads, n = 203), the most abundant haplotype matched the transmitted haplotypes of the parents NA12891 and NA12892, which both transmitted an identical haplotype ( Figure 2C). Accounting for the errors in MinION sequencing, when allowing for a single mismatch in the haplotype, ~85% of reads confirm the NA12878 diplotype. In HLA-B (only spanning reads, n = 202), the majority of reads corresponded to the transmitted and untransmitted haplotypes of the parent NA12891, with only 8.4% of reads corresponding to the transmitted haplotype of parent NA12892 ( Figure 2C). This could be a result of potential contamination suspected earlier in described above with CYP2D6. As suggested for CYP2D6, the relative proportion of HLA haplotypes was likely also affected by PCR bias during amplification. HLA alleles were called with 4-digit resolution using the GATK HLACaller, but due to the high error rates of nanopore reads, HLA alleles did not match with alleles called using HapMap data (Table S3) 24 .

Conclusions and discussion
Phasing of genotypes is critical to prevent misinterpretation of PGx variants. The importance of correct phasing of PGx genotypes is illustrated with the gene TPMT, which plays a critical role in the metabolism of thiopurine, a drug used to treat acute lymphoblastic leukemia. In a recent study 25 , an individual was reported to have a TPMT *3B/*3C diplotype, based on observed heterozygous genotypes for the rs1142345 and rs1800460 variants, but this was a misinterpretation due to faulty haplotyping and *1/*3A is the correct diplotype. The rs1800460 variant is present in both *3A and *3B haplotypes while the rs1142345 variant is present in both *3A and *3C haplotypes. As a result, it was possible for the individual to have a *1/*3A diplotype or a *3B/*3C diplotype. Clinically, a *1/*3A diplotype corresponds to an intermediate metabolizer, requiring a 30-70% reduction in thiopurine dose, while a *3B/*3C diplotype corresponds to a poor metabolizer with a 90% reduction in dose 2 . An individual who receives a standard dose and is a poor metabolizer can experience fatal toxicity, while a low dose for a normal metabolizer can lead to disease progression. Clinical trials have demonstrated the medical importance of TPMT haplotyping in treatment of myeloid leukemias and non-malignant immunologic disorders 2,26 .
Long sequence reads aid haplotype identification by determining which genetic variants are in phase (i.e. on the same DNA strand). If the TPMT genotypes from the example above were sequenced using nanopore-based long read technology, the *1/*3A diplotype would likely be called correctly (note that rs1142345 and rs1800460 are only 8,310bp apart).  Supplementary material Figure S1 (Updated). CopyCaller software analysis of data from Taqman copy number assays Hs04502391_cn and Hs04083572_cn. CYP2D6 from sample NA12878 was observed to be diploid. Samples known to be 1 copy or greater than 2 copies for CYP2D6 (GM17235 and GM17232 respectively 28 ) were run for the two Taqman copy number assays as well.
While nanopore sequencing with the MinION is demonstrably error-prone in its current stage of development, we assert that this technology holds promise for clinical applications because accurate consensus sequences can be built with sufficient coverage given the high number of reads generated. As well, we have been able to successfully call haplotypes from long reads de novo in the absence of parental haplotypes or statistical phasing. The MinION device produced sufficiently long mappable reads to phase all variants in the loci examined. As error rates on the MinION decrease, we can expect to deconvolute these data into more accurate diplotypes with less noise and will be able to measure how much multi-sample multiplexing can be supported by a single run.
According to the CPIC, 63 genes and 132 drugs have guidelines for pharmacogenomic status (http://www.pharmgkb.org/cpic/pairs), and this list is constantly expanding. With increasing guidelines and demands for PGx in the clinic, affordable and rapid nanopore sequencing may hold great utility. We confirm that the funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.  This manuscript describes the analysis of three PCR-fragments generated from a human reference sample using a MinION sequencing device. Despite the preliminary nature of the data, and the fact that the paper is based on a single run of a single sample, it will nevertheless be of significant interest, and publication as a research note is in my view justified. In fact, given the broad interest in this emerging new technology, and its applicability to all life sciences (not just human (pharmaco-)genetics), the readership might be much wider than anticipated by the authors. This constitutes at the same time the biggest weakness of the paper, since it is in parts rather inaccessible for readers from fields other than human genetics (e.g. virology or microbiology), for whom this paper might nevertheless be highly relevant.
In order to address the concern that the manuscript is based on a single MinION run, and given the nature of the MAP program, it should be no problem for the authors to repeat the experiment with another sample, and while doing so specifically address the concerns of Dr. Kennedy. In addition, the authors should strive to improve accessibility to a wider audience wherever possible. Finally, it would be helpful to include additional experimental details that are currently missing. technology and there is a lot of interest in understanding its current capabilities, longer-term potential and limitations. Positive aspects of the report are description of long-read nanopore sequencing on long amplicons (4-5kb), and description of data handling and bioinformatics approaches this team is using, which may aid others working with the MinIon device. Negative aspects of the report are that all the data are from only one reference sample, and one run on the device; rather than being utterly convincing, the data suggests haplotyping on unknown samples may be possible only once error rates on the MinIon reduce; and there is an unresolved question about the CYP2D6 CNV or haplotype analyses, potentially due to sample contamination.
Despite these issues, the report is still of merit and will be of interest to many in the field.

Specific comments:
Methods Long PCR has been widely used for specific amplification of CYP2D6, with reaction conditions and primer sequences well established. It is not clear why the authors designed their own primers for this task, or how these novel primers were validated. This should be spelt out more clearly.
It would have been useful to have non-diploid control samples for the CYP2D6 CNV assay -for example a haploid (CYP2D6*5) case or multicopy case, to provide confidence that the assay was working as expected. This is relevant because of the question raised by the *2 haplotype which shows up in the MinIon analyis of NA12878.

Results
Page 6, first para: This description of the possible origins of the mystery CYP2D6*2 haplotype needs some editing for improved clarity. Not clear what is meant by "*3 and *4 duplexes forming". Also not clear what would cause PCR biases alluded to in the last sentence of this paragraph (and of the following paragraph).

Introduction
Suggest this sentence be changed: However, existing methods have various limitations, which may lead to adverse drug responses. > However, existing methods have various limitations, which may lead to failure to detect variants of pharmacogenetic significance.

Results
Page 5, para 4. First sentence should read thus, for improved clarity: "CYP2D6 haplotype proportions in nd Page 5, para 4. First sentence should read thus, for improved clarity: "CYP2D6 haplotype proportions in were identified by interrogating clinical marker positions…" MinIon data Supp File S1 It would be helpful to indicate which of the sequences is CYP2D6 and which is CYP2D7.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
I am also a participant in the MinIon Access Programme, and my laboratory is Competing Interests: working with the device. I met Dr Ammar at a meeting organised by ONT, and we have a shared interest in use of this technology for pharmacogenetic analysis. However, I have not been involved in the work described here.