DNA sequencing With nanopores
DNA sequencing With nanopores
[version 1; not peer reviewed]No competing interests were disclosed
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
All commenters must hold a formal affiliation as per our Policies. The information that you give us will be displayed next to your comment.
User comments must be in English, comprehensible and relevant to the article under discussion. We reserve the right to remove any comments that we consider to be inappropriate, offensive or otherwise in breach of the User Comment Terms and Conditions. Commenters must not use a comment for personal attacks. When criticisms of the article are based on unpublished data, the data should be made available.
Tēnā koutou i tēnei ata
Ko Kurahaupo te waka
Ko Tapuae-o-uenuku te maunga
Ko Wairau te awa
Ko Rangitāne te iwi
Ko Huataki tāku tupuna
Ko Moa tāku whanau
Ko Rawiri Ekeru tōku ingoa
No Reira
Tēnā koutou, tēnā koutou, tēnā koutou kotou katoa
My name is David Eccles, and I want to thank you for being here to listen to my talk, and the organisers for bringing me here. I am both a freelance Bioinformatician and a Research fellow in Bioinformatics at the Malaghan Institute of Medical Research in Wellington. Part of my work at the Malaghan Institute involves using genetics to work out how cancer cells interact with mitochondria. I will be talking a tiny bit about that work, but I'm mostly here to talk about my experience (and that of others) Sequencing DNA with Nanopores. This talk is my own, and opinions are my own.
I brought a script with me today, because I'm prone to rambling when I get nervous.
A Rhapsody of Errors
My first ever accepted conference talk (ever) was a talk on nanopore sequencing at the New Zealand Next-Generation Sequencing conference in 2015. I was following right after the keynote speaker, who was talking about sepsis and how nanopore sequencing could help with getting the right antibiotics to a patient as fast as possible.
So, let's get this out of the way first. I was talking about errors in nanopore reads in my talk in 2015, and why I didn't think they would be a big problem for most use cases, even at the really high error rate they were with the first batch of flow cells.
https://f1000research.com/slides/4-1224
I had, and still have, a lot of belief in the capability of nanopore sequencing. Oxford Nanopore has on occasions mentioned that the electrical sensor is better than the chemistry, and the chemistry is better than the software, and as far as I'm aware that is still essentially true. With a strong background in computer science and a little bit of mathematics, my biggest belief is in the capability for the software to improve, even now. One slide I put up in that first talk is this one here, which shows the base call of a single read, recalled on three separate occations. In this case I wanted to highlight a particularly tricky region, an 8-base homopolymer in the mitochondrial genome, and I found it quite interesting how much the basecalled sequences changed so much over the course of a year.
I'd like to think that people have moved on from talking about errors, but they haven't. The single-base error rate is pretty much the only thing remaining that gives people a reason they can cling to; a reason for justifying the use of other technologies. But that single-base error rate is a software problem. It will keep getting better. Regardless of that, for most use-cases where nanopore reads are compared against existing sequences, they're perfectly fine as they are.
Next Next Generation Sequencing
In that first talk, I introduced my own ideas about sequencer generations. I see sequencing technology as being currently split up into three technological advances:
The first generation is Sanger sequencing. Typically gel-based sequencing done by tagging sequences and counting their lengths.
The second generation is the one that I expect most of you are familiar with. DNA bases are individually tagged (or otherwise identified), and the action of base incorporation is recorded during DNA synthesis.
The third generation is the new one, where sequences are observed at a single molecule scale as they progress through the sequencer.
I've tried to be a bit careful with my definitions here because I think it's reasonable to say that some sequencers straddle two generations. If I consider PacBio systems, for example, their sequencers combine synthesis of known bases with some recording of sequence dynamics during base incorporation.
Model-free Sequencing
In October 2014, if you had asked me for one word to describe the sequencer that we had just started testing, I would have said "potential". Give me three words, and I would have expanded that to "potentially disruptive technology".
The most important piece of advice I think I can give about nanopore sequencing is this: just try it.
One of the problems with disruptive technologies is that they have uses that we don't know about yet. I'm not so interested in what Nanopore sequencing can do that every other sequencer can already do. What interests and excites me is the potential for nanopore to bring forth a whole raft of completely unexpected uses. In many cases, we have no support for these things, no foundation, no precedent. If you've got an idea about what you want to do with nanopore sequencing, then... sorry, I probably can't give you any advice on whether or not it would work.
But just try it anyway. I'd love to hear about what you think nanopore devices can do. Compared to other high-throughput sequencing methods, nanopore sequencing has a very low capital cost, and it's also pretty competitively priced in terms of minimum run cost. You're unlikely to blow your budget by giving nanopore sequencing a go, particularly if you've been able to borrow someone else's flongle flow cells for the project (as part of a joint academic collaboration, of course).
Even though it feels like a long time when doing your first experiments -- navigating the frustration of extra reagents and the demands of Oxford Nanopore's Support team -- it's really not that long when considering that a completely new thing is being learnt. And that
surge of accomplishment when you've prepared your own sequencing library, carefully loaded it onto a flow cell, and the bloody thing actually works?
Ah, that's such a wonderful feeling.
Julie Blommaert
Nanopore sequencing is helping us make genetic discoveries that are not possible with other technologies.
I've had a brief chat on Twitter with Julie Blommaert, who is currently assembling the genomes of a few different microscopic rotifer species. She created assemblies that are relatively complete in terms of gene content, but the bigger genomes were less contiguous. The reason behind this loss of contiguity was only visible after specifically looking for transposable elements and repetitive DNA, using a bioinformatics pipeline called dnaPipeTE. So she knows that these elements are there, but not *where* they are in the genome.
Julie tried PacBio sequencing on these rotifer genomes, and was able to patch up a few of the repetitive holes, but there are still some missing pieces. She hopes to use nanopore sequencing in the future to identify and place really long repeats, as well as other unassembled repetitive regions from her genomes. The long queues of service centres are a bit of a drag, so it'll be great to be able to do her sequencing in-house and get instant results.
Read Count
I want to spend a few minutes going through one of the first things that I do with a sequencing run. This data comes from a 2-day cDNA sequencing run I did a couple of weeks ago with Carole Grasso at the Malaghan Institute, looking for genes that are up and down regulated when we fiddle with the mitochondrial DNA content of cells. It just so happens that the cDNA used to create this plot is still sitting inside the flowcell that I'm trying to run here.
This here is a read frequency plot, essentially a line graph version of a histogram. Oxford Nanopore does tend to hide this data in the graphs they output, but I think it can help to highlight one of the biggest issues with sequencing yield, that the smaller sequences get
the lions share of the sequencing time.
I've got two hypotheses around why this happens. The first is that smaller reads are harder to filter out during sample prep, and harder to correct for when calculating sample molarity for loading onto the sequencer. It's somewhat surprising that just one nanogram of 150bp reads makes up 10-20% of the recommended 50-100 fmol loading of a
flow cell.
My second hypothesis is that smaller sequences move faster through the flow cell. They have less mass, which means they're quicker to react when the pores open up to accept more sequences. A short read can run halfway across a sequencing well while a long read is still putting its boots on.
Base Count
These numbers are multiplied by the sequence length to produce a plot of the number of sequenced bases with a given sequence length. The MinKNOW interface was my first experience with seeing a plot like this, and I've found it to be quite a useful representation,
particularly for genomic DNA, because MinION reads typically have a more bell-shaped normal distribution under this transformation (except when sequencing transcripts). It also represents things that are actually useful for assembling DNA. In this case, with the cDNA reads, I can see that transcripts appear to be generally well represented throughout the range of sequence lengths.
Sequenced Base Density
These base counts are divided by the total number of sequenced bases to produce a plot of the proportion of sequenced bases with a given length.
However, I much prefer the look of it when I take that data, put it into a matrix, then plot the matrix on a black background with a splash of orange colour.
Digital Electrophoresis
I call this "digital electrophoresis", because it looks like the image plot that you might get when running your samples out on a gel.
It's not a perfect match, but I've found it quite surprising in hindsight when comparing the gel plots to the digital electrophoresis and seeing that weird stuff in the reads is also present in the gels. In this case there's a slightly stronger banding of short sequences in BC09 compared to BC07, which is picked up in both the gel image, and the simulated image based on sequenced reads.
Cumulative Base Density
Going further in the processing of sequence data, we can convert the density plot into a cumulative frequency plot by adding each total cumulatively to the previous total, starting with the longest sequences.
This produces a much smoother graph, largely independent of the nature of the input reads, and makes it a little bit easier for me to pick out general trends.
N50 / L50
If you do any amount of nanopore sequencing and start talking about quality control of data, you'll probably hear 'N50' being talked about a lot. This statistic is one way of calculating the average length of sequences, but allowing for a lot of rubbish short sequences.
A question that you might ask about a sequence data set is, "what is the minimum number of sequences I need to capture half this dataset?" A supplementary question to that is, "what is the shortest sequence length in such a minimal dataset?" For long-read sequencing, we talk about read N50, which gives information about the length of reads in a sequencing run, and about genome N50, which gives information about the completeness of assembled contiguous sequences, also called contigs.
It turns out, the N50 value can be determined by reading off a cumulative frequency plot. By starting the cumulative frequency generation with the longest reads first, the N50 value can be read off the cumulative graph by looking at the point where 50% of the total bases have been added up. Ditto for N10, N90, and any other N you're interested in.
Through the usual process of making things in biology more confusing than they need to be, the N50 value actually refers to a length, and the complementary L50 value refers to a number. To avoid confusion, I prefer to always include a unit (kilobases, for example) and only give the length statistic, which is the most commonly used of the two values.
Bridging the Gaps / Blank
Note: where I displayed blank slides in the presentation, I flipped back to a live demonstration of a sequencing run.
I've attended every yearly London Calling conference since they began in 2015.
It's a great experience, but an expensive one. For those that do attend in person, Oxford Nanopore has given us a little something extra in our swag bag to compensate for the expense. For the first conference, it was a new and improved version of the MinION sequencer. The last conference gave us an early-access taste of the higher-efficiency Series D flow cells.
Oxford Nanopore do a great job with making the conference accessible to others who can't attend in person. All previous talks are publicly available on the nanopore website, and I've found it useful to see them myself to watch something I had to miss on the day, or revisit a talk that I really enjoyed.
One of the talks that stuck in my head was Dr. Karen Miga's presentation in 2017 on the linear assembly of the human Y centromere.
Bridging the Gaps / 1
Just in case you weren't aware, the human reference genome isn't complete. Every human chromosome in our current reference genome has a multi-megabase region that was impossible to sequence using existing sequencing technology.
Dr. Miga and her group at UCSC used a set of 9 bacterial artificial chromosomes (BACs) of over 100kb that were known to span the centromere of the Y chromosome. The circular chromosomes were linearised, adapter-tailed, then sequenced on a MinION, with read N50s of over 100kb, many reads of which were the full length of the target BAC. Reads were aligned and polished to generate a consensus sequence for each array. With the help of Illumina sequencing, they were able to correct errors, and assemble the corrected sequences from different BAC arrays into a 346kb centromeric region.
https://londoncallingconf.co.uk/lc/2017-plenary#217670414
http://dx.doi.org/10.1038/nbt.4109
Bridging the Gaps / 2
https://doi.org/10.1038/nbt.4060
About a year later, Dr. Miga was involved in the formation of an interdisciplinary research team that has been tasked with using nanopore reads to make the human genome better. The nanopore whole-genome-sequencing consortium have released long and ultra-long read nanopore datasets (both fastq and fast5) for the human cell line NA12878, and have been updating their called read set as base calling software improves. They have also created a draft assembly of the human genome, including an unbroken span across the HLA region, which is regarded as an incredibly challenging region to sequence using Illumina technology.
But they're not stopping there. she is now part of the telomere-to-telomere consortium, who have sequenced the CHM13hTERT human cell line using the Nanopore GridION, and have supplemented that data with 10X, BioNano, HiC, and PacBio. Dr. Miga and Adam Phillippy gave talks at the Advances in Genome Biology and Technology meeting
this year, demonstrating that telomere-to-telomere assemblies for all human chromosomes are now within reach. They got to three assembled contigs for the X chromosomes, and with her manual curation of the data, were able to turn that three into one: a single contig for a single X chromosome.
bit.ly/PhillippySlides
A Single Ultra-long Read
One of the things I really like about nanopore sequencing is how much information you can get out of a single read. You may remember my first slide of this talk, where I spent a while talking about basecalling a read, and digging down a bit into why it was different from the reference. That was a small 50bp fragment from one of the first reads I'd ever sequenced on the MinION.
This here is what the nanopore community has been calling a long read, or possibly bordering on an ultra-long read. This came from a flow cell sequenced as part of the nanopore whole-genome consortium experimentation with an ultra-long-read protocol, flow cell FAF15586, sequenced in Birmingham on the 8th of March 2017.
What I'm showing here is a dotplot of this single 250kb read mapped to itself. To people who aren't familiar with them, dotplots are fairly unintuitive, but I'm starting with this visualisation because it's what people used to mapping are more likely to understand.
I found that it was quite difficult to use this to explain to other people what was going on. They kept asking difficult questions about it, like, "why is there a diagonal line down the middle of the image?" and, "why are the repeating bits a triangle, or a square?" There was something about the idea of comparing things to themselves that was unintuitive.
Profile Plot of an Ultra-long Read
So I had another think about the data that I was trying to represent. Was there a better way to represent something that made the locations of repetitive features more obvious. What I came up with was this: the location on a sequence is represented on the X axis, and the distance between repetitive features is represented on the Y axis in a log scale. No box shapes to be seen, and it helps to hide the idea that there's a central spine of a reference sequence that represents the ideal path through a sequence.
So now, of course, because this is a completely unfamiliar plot to everyone, I get asked even more questions, although usually they're just variants of "what do the colours mean?"
Reverse complement pairs appear as funnel-shaped things in orange, and repetitive blocks appear as sliced hills in maroon/purple, or maybe a ripple pattern of a sunset on water. And there are a few AT and other microsatellites that pop up in the very small scale which appear as greenish-yellow blobs down the bottom.
I find it interesting seeing all this diversity in repetitive sequence that there don't seem to be any instances where the complement of a sequence is found. I expect it's going to take a few more years to work out why that is the case.
Semicircular Plot of an Ultra-long Read
Thinking about sunset on water, I had a go at modifying this visualisation to make it a bit more like an actual sunset. This is representing the same information as the profile plot, but in a form that I consider to be more visually-appealing.
In any case, this particular sequence has a very interesting feature right at its start: a tandem repeat array where the unit size of the tandem repeat is about 40kb, and the entire array covers at least 180kb.
Those lengths are completely out of the ballpark of what any other sequencer would be able to discover. Even with a 10kb read length, another sequencer wouldn't even be able to determine that this region was repetitive, let alone identify a unique flanking region for anchoring. Yet here we are, with a long nanopore read that's nonchalantly eating a massive tandem array for breakfast.
For clarification, at 450 bases per second, 250kb works out to a little under 10 minutes, which is about how long I might take to eat my breakfast.
Why Use Nanopore? / Blank
I can see for myself that nanopore sequencing is a disruptive technology. My own curiosity drives me to find ways in which nanopore sequencing can do things that no other sequencer can do, and those are frequently quite different from what others ask me to do with the sequencer. As another example, the flow cell that I'm sequencing upside-down today flew down here in my hand luggage yesterday.
Sequencing In Space
But that's not the first time a nanopore sequencer has flown, and not the highest by a long shot. In 2016, the Oxford Nanopore MinION was launched into space, and Dr. Kate Rubins was the first person to demonstrate that DNA could be sequenced in microgravity.
This is interesting, because here on Earth, we believe that loading beads help with the sequencing process. But Dr. Rubins has shown us it works at 0 G, so it can't be gravity that's helping out there. Around the same time, Matthias Maurer also showed that sequencing works in a high-pressure diving bell at the bottom of the ocean. Even more interesting to me was that the replicated sequencing runs actually performed slightly better in space than on Earth.
https://www.ncbi.nlm.nih.gov/pubmed/29269933
The first test of nanopore sequencing in space were carried out on existing known genomes. They sequenced bacteriophage lambda, E. coli, and Mus musculus, three common model organisms with well-annotated and understood genomes.
I find it interesting looking at their comparison of the MinION-sequenced reads and the Illumina reads and seeing an abundance of unmatched reads in the MinION runs that are absent from the Illumina runs. This is something of a curse and a blessing for nanopore sequencing: it tells you almost everything, warts and all.
Viral Sequencing
I worked with some researchers at ESR: Jing Wang, Nicole Moore and Richard Hall, who were in the MinION Early Access Program, and enlisted my help in adding the final touches to their paper on sequencing an Influenza genome. The different components were PCR-amplified from cDNA transcripts, and it was nice to see that proportional transcript coverage was a lot more consistent across the length of transcripts when compared to a similar extraction followed by Illumina sequencing. Also, we had reasonably high consensus concordance with Illumina sequencing; over 99% for all transcripts, and 100% agreement for two of the eight transcripts. This was in 2015 using R7.3 flow cells. Even now, I still have arguments with people about the terrible accuracy of nanopore sequencing.
De-novo Assembly
Around the same time, I was able to encourage Jodie Chandler and Graham Le Gros at the Malaghan Institute to let me have a crack at assembling the genome of the rodent parasite, Nippostrongylus brasiliensis. We didn't get enough sequence for the whole genome, but I was able to find some sequences that looked suspiciously like mitochondrial DNA. Jodie and I put in a lot of effort in properly annotating the mitochondrial genome, including identifying the start and stop sites of all encoded
genes, and even working out the likely codons for tRNA based on matches to existing databases. Jodie collected gene sequences from other nematode species for the construction of a phylogenetic tree, and I dug deeper into the nanopore data to produce an electrical consensus sequence for the mitochondrial genome.
That work was a great springboard for our next de-novo assembly paper, where we enlisted the help of Oxford Nanopore Technologies to generate enough reads on R9.4 flow cells to assemble the whole genome of that same parasite. I don't really want to spend time on talking about that, because it's still an unfinished story. I'm not completely happy with that assembly. I think that with another single MinION run with the new ligation kit, we could probably fix up most of issues with the genome assembly graphs.
Disease Outbreaks
Something that gives me confidence that the MinION will work in dealing with complex genome graphs is the work that Dr. Una Ren and Dr. Jenny Draper have been doing at ESR in assembling bacterial genomes and plasmids for outbreak investigations. At the top here you can see a spaghetti ball of an assembled genome, derived from MiSeq reads, and on the bottom is the nice clean assembly derived from MinION reads. Library preparation for these two assemblies was carried out on the same initial DNA sample.
In the future, Dr. Ren and Dr. Draper hope to carry out more metagenomic work for culture-independent testing and diagnosis. They're also interested in looking beyond the basic sequence, finding out more about the use of methylation in bacterial genomes.
Professor Kat Holt
If you have any interest in nanopore sequencing and, in particular, how it is useful for the investigation of infectious disease, you're probably already familiar with Professor Kat Holt's lab.
For example, two of her postdocs, Dr. Margaret Lam and Dr. Kelly Wyres, have been leading research of virulence and resistence genes of Klebsiella. Nanopore sequencing was used to resolve the plasmid vectors, which is essential to understanding the evolutionary pathways and public health risks that these elements pose. Dr. Wyres looked at the genomic epidemiology of Klebsiella in 7 Asian countries. She used nanopore sequencing to show that in most of these cases there was convergence of resistance & virulence genes in the same plasmid vector.
Professor Holt's lab manager, Dr. Louise Judd, has been doing the wet lab work, and has developed and shared her protocols for bacterial DNA extraction and sequencing. Dr. Judd has also been generating the raw data for Ryan Wick, helping him to develop and improve PoreChop, Unicycler, Bandage, and also for the lab's exhaustive overview of the
behaviour and effectiveness of different nanopore basecallers.
Transcript Sequencing / Blank
Oxford Nanopore accepted our institute into the MinION Early Access programme in 2014. I was so enthusiastic about the MinION sequencer that I rushed around for a few weeks trying to get as many abstracts as possible. In the end we submitted five of them, and someone from Oxford Nanopore told me afterwards that they pretty much picked one at random from our institute because it didn't make sense to send us more than one sequencer.
Transcript Sequencing / 1
As I've said, Carole Grasso and I were comparing mouse transcript expression in cells with and without mitochondrial DNA. There are a lot of mitochondrial fragments in the genome, so I wanted to have a quick look to make sure there was no funny business going on with ghost genes being expressed in cells that had no DNA in their mitochondria.
I did a whole-transcriptome analysis. Reassuringly there was no mitochondrial expression in that case, but I had a peek at the rest of the genome, and found a number of genes that appeared to be switched on when mitochondrial DNA was removed. We're still investigating this stuff: The cDNA that's was sequenced on this flow cell a couple of weeks ago is hopefully the last of what we need to firm up our data and get a good publication out.
Chimeric Reads
https://f1000research.com/articles/6-631/v2
Ruby White and Olivier Lamiable from Professor Franca Ronchese's group at the Malaghan Institute were interested in looking at mouse interferon gene copies using nanopore sequencing. The interferon A genes are found near each other in a genome cassette and are closely related, with a single primer set being able to amplify transcripts
from 14 different genes within the same family. Interestingly, there are other Interferon B, K, and E gene classes that do not have the same large diversity. We wanted to see if the MinION could distinguish between the different isoforms, despite their substantial sequence identity.
So, primer sets were created for the interferon A family, which has gene copies, and interferon B, which doesn't, and also beta actin as an additional control.
But then something interesting happened when we looked at the first few sequences, which completely changed the course of our research. We put one read into BLAST, and found that it mapped to both the beta actin gene, and an interferon A gene. This led to follow-up experimentation and additional controls, in which we discovered that the joining of the two sequences was happening during sample preparation.
If you're interested in excluding chimeric reads from your own data, then PoreChop from Professor Holt's lab works pretty well at identifying and optionally excluding these reads. And if you want to get more of them, do an overnight ligation.
I've got an image here that shows what was happening with the physical sequence, which has been mapped approximately to the raw signal to demonstrate all the corobborating parts of the read. This was back when nanopore used hairpins in their sequencing kits with an easily-recognised signal, so it was a bit easier to do a mapping by eye.
I considered this chimeric read finding to be a pretty big deal, particularly in light of the index switching saga around Illumina reads that was happening at the same time, and it was a great demonstration of the benefits of a warts-and-all sequencing approach.
Visualising a Nanopore Read / 1
We've seen chimeric reads in our other barcoded data. If you look hard enough, you can even find it in situations where it's very unlikely to have happened during sample preparation. Like this, for example, once again from our most recent mouse cDNA run.
I like looking at fastq files, but it can be quite tricky to see the entirety of a single read, particularly if it's more than 300 bp long. I've spent a lot of my time inspecting sequences, and I wanted to find some way to visualise a sequence in a single image, regardless of its length. This here is the output of an annotation script I wrote up last week in preparation for this talk, showing a chimeric nanopore read from Carole Grasso's last cDNA sequencing run. I've been previously making things like this manually in a word processor, but I figured it was time to make things a little bit easier for me.
This looks great for short sequences, but I run into issues with really long sequences.
Visualising a Nanopore Read / 2
My initial thoughts on how to improve long sequence visualisation were along a cartesian path, thinking about arrays and rectangular matrices. But I realised there were limitations to that approach. Especially when there's repetition in a sequence, it often doesn't make sense to decide on a start and end point for each repeat unit.
So what I came up with was a spiral representation of a DNA sequence, starting in the middle, and spiralling outwards.
This sequence doesn't really have any repetitive elements in it, but you might recall I talked about one from the nanopore whole-genome consortium a few slides back...
Visualising an Ultra-long Read / 1
And here it is. This is that same 250kb sequence, showing a squished representation of the actual sequence, with the spiral arrangement organised to emphasise any repetitive elements that have a unit length of around 40kb. These repetitive bits can be identified as rays of similar patterns going from from the fifth inner ring inwards to the centre. The repetitive array finishes about two thirds of the way through the sequence, beyond which the rings stop looking like the ones further in.
I also used this spiral visualisation to look at individual bases of the longest read that's been sequenced so far on a nanopore sequencer, 2.3 megabases, zooming through the sequence at a rate of 20kb per second, or about fifty times the speed at which it's sequenced by the nanopores [see https://doi.org/10.5281/zenodo.1254317].
Novel Variant Discovery [blank]
I started my talk with things that the MinION found tricky when we first started out. I expect that homopolymer sequences are always going to be challenging when the time resolution of the sequencer is not sufficient to distinguish between where one base stops and the next one starts. Then again, maybe the bases don't travel through the pore in a half-ladder form, and are actually physically overlapping, which would mean that we wouldn't ever be able to tease apart a single base in the electrical signal.
However, it did seem to have potential in the area of variant analysis, particularly when there's an insertion or deletion of more than a couple of bases.
Novel Variant Discovery
https://doi.org/10.1002/mgg3.564
I'd like to highlight some work that was recently published by Melissa Leija-Salazar in the area of variant analysis.
She used the MinION to detect clinically-relevant heterozygous variants in the GBA gene. Defects in this gene have an established clinical link to a lysosomal storage disorder called Gaucher disease. GBA variants have also been found to be a significant risk factor for Parkinson's disease.
Starting with an amplified 9kb genomic region, Melissa's research group was able to detect a 55bp deletion in a recombinant allele from a Parkinson’s patient, which was missed by other sequencing methods. Using MinION reads, they were able to detect all previously known coding missense mutations at the correct zygosity, and with some help from Nanopolish and NanoOK, were able to exclude most false variants that appeared to be the result of systematic base calling errors. In total, they processed 85 samples for downstream analysis, and encountered one instance of a false negative, which was possibly as a result of failed PCR primer binding to the recombinant region.
In summary, I share Melissa's view that Nanopore sequencing is a versatile, cheap technology, and a suitable platform for sequencing, even of difficult regions, both in the diagnostic and in the research environments.
https://twitter.com/Mel_Salazar_PD/status/1107733088140967938
Acknowledgements
I've got a bit more to say, but first a few acknowledgements. Thanks to nanopore users both in public spaces, and more privately within the nanopore community. In spite of my occasional rants, Oxford Nanopore has been very supportive of my comments, and in supporting me to continue with my Nanopore research.
I'd also like to put in a special thanks to Dr. Laura Boykin, for reminding me to keep trying to do better with regards to diversity, and to Jo Stanton and the Genetics Otago Team for bringing me to Otago University for the first time.
Lastly, I'd like to acknowledge the Health Research Council and the Cancer Society of New Zealand, who have funded the mitochondrial genome project that I've been part of at the Malaghan Institute of Medical Research. That funding has helped me stay afloat while doing primarily voluntary work for interdisciplinary projects around the world.
If you're interested in any of my visualisation scripts, have a look at the things that I've most recently modified in my bioinformatics scripts repository on gitlab.
Diversity [blank]
So... I want to talk about an elephant in the room. It's uncomfortable for me to talk about this. I'm not the right person to say these things, but maybe I can inspire the right people to say it again a little louder, and maybe I can help people like me to hear a little
bit better.
Oxford Nanopore has mentioned that there are over 6,000 MinION devices
out and about in over 80 countries. With those numbers, there's also a great geographical and social diversity in the people that are using nanopore sequencing. We should be embracing this diversity, because different perspectives give research its life and purpose.
There are people in the middle east trying to get nanopore sequencing working for them. I've heard stories about sequencing on the slopes of a volcano in Tanzania, and elsewhere in Africa: in Kenya, in Uganda, and in Nigeria. It's been used in Brazil. It's been
used in Antarctica, in space, at the bottom of the ocean... and it's also been used by a few groups in Australia and New Zealand.
Most of these people are struggling, partly because this technology stuff is new and improving at a breakneck pace, but also partly because the systems that have been set up haven't been designed with all that diversity in mind. They have been built on foundations that support a certain group of people: people with money and power.
These foundations can be replaced, but that replacement is going to need a lot of support. And to stop us from sinking into the mud, that support *has* to be there before the existing foundations can be knocked away.
To be frank, I'm not *just* talking about nanopore sequencing here. This philosophy is encapsulated in the best projects of the free and open source software community, and can extend out into other parts of our life. But lets start small, within our little community,
because starting small is quick, easy, and will help us to see that it's going to work.
Please don't hide your own struggles because you think it's not good enough, or because you think that competition is healthy, and we need to do everything in isolation. Please don't hide your failures; we need information on both successes and failures in order to learn. If you know someone who's struggling, please help to amplify their voice. If you don't know anyone like that, ask around and find someone to share the joys and pain of learning together with you. If you do this, we can help build the foundations for a better community, and discover the benefits that a strong, evolving diversity can bring to our world.
Thanks for listening. Kia Ora.
Tēnā koutou i tēnei ata
Ko Kurahaupo te waka
Ko Tapuae-o-uenuku te maunga
Ko Wairau te awa
Ko Rangitāne te iwi
Ko Huataki tāku tupuna
Ko Moa tāku whanau
Ko Rawiri Ekeru tōku ingoa
No Reira
Tēnā koutou, tēnā koutou, tēnā koutou kotou katoa
My name is... READ MORE
Tēnā koutou i tēnei ata
Ko Kurahaupo te waka
Ko Tapuae-o-uenuku te maunga
Ko Wairau te awa
Ko Rangitāne te iwi
Ko Huataki tāku tupuna
Ko Moa tāku whanau
Ko Rawiri Ekeru tōku ingoa
No Reira
Tēnā koutou, tēnā koutou, tēnā koutou kotou katoa
My name is David Eccles, and I want to thank you for being here to listen to my talk, and the organisers for bringing me here. I am both a freelance Bioinformatician and a Research fellow in Bioinformatics at the Malaghan Institute of Medical Research in Wellington. Part of my work at the Malaghan Institute involves using genetics to work out how cancer cells interact with mitochondria. I will be talking a tiny bit about that work, but I'm mostly here to talk about my experience (and that of others) Sequencing DNA with Nanopores. This talk is my own, and opinions are my own.
I brought a script with me today, because I'm prone to rambling when I get nervous.
A Rhapsody of Errors
My first ever accepted conference talk (ever) was a talk on nanopore sequencing at the New Zealand Next-Generation Sequencing conference in 2015. I was following right after the keynote speaker, who was talking about sepsis and how nanopore sequencing could help with getting the right antibiotics to a patient as fast as possible.
So, let's get this out of the way first. I was talking about errors in nanopore reads in my talk in 2015, and why I didn't think they would be a big problem for most use cases, even at the really high error rate they were with the first batch of flow cells.
https://f1000research.com/slides/4-1224
I had, and still have, a lot of belief in the capability of nanopore sequencing. Oxford Nanopore has on occasions mentioned that the electrical sensor is better than the chemistry, and the chemistry is better than the software, and as far as I'm aware that is still essentially true. With a strong background in computer science and a little bit of mathematics, my biggest belief is in the capability for the software to improve, even now. One slide I put up in that first talk is this one here, which shows the base call of a single read, recalled on three separate occations. In this case I wanted to highlight a particularly tricky region, an 8-base homopolymer in the mitochondrial genome, and I found it quite interesting how much the basecalled sequences changed so much over the course of a year.
I'd like to think that people have moved on from talking about errors, but they haven't. The single-base error rate is pretty much the only thing remaining that gives people a reason they can cling to; a reason for justifying the use of other technologies. But that single-base error rate is a software problem. It will keep getting better. Regardless of that, for most use-cases where nanopore reads are compared against existing sequences, they're perfectly fine as they are.
Next Next Generation Sequencing
In that first talk, I introduced my own ideas about sequencer generations. I see sequencing technology as being currently split up into three technological advances:
The first generation is Sanger sequencing. Typically gel-based sequencing done by tagging sequences and counting their lengths.
The second generation is the one that I expect most of you are familiar with. DNA bases are individually tagged (or otherwise identified), and the action of base incorporation is recorded during DNA synthesis.
The third generation is the new one, where sequences are observed at a single molecule scale as they progress through the sequencer.
I've tried to be a bit careful with my definitions here because I think it's reasonable to say that some sequencers straddle two generations. If I consider PacBio systems, for example, their sequencers combine synthesis of known bases with some recording of sequence dynamics during base incorporation.
Model-free Sequencing
In October 2014, if you had asked me for one word to describe the sequencer that we had just started testing, I would have said "potential". Give me three words, and I would have expanded that to "potentially disruptive technology".
The most important piece of advice I think I can give about nanopore sequencing is this: just try it.
One of the problems with disruptive technologies is that they have uses that we don't know about yet. I'm not so interested in what Nanopore sequencing can do that every other sequencer can already do. What interests and excites me is the potential for nanopore to bring forth a whole raft of completely unexpected uses. In many cases, we have no support for these things, no foundation, no precedent. If you've got an idea about what you want to do with nanopore sequencing, then... sorry, I probably can't give you any advice on whether or not it would work.
But just try it anyway. I'd love to hear about what you think nanopore devices can do. Compared to other high-throughput sequencing methods, nanopore sequencing has a very low capital cost, and it's also pretty competitively priced in terms of minimum run cost. You're unlikely to blow your budget by giving nanopore sequencing a go, particularly if you've been able to borrow someone else's flongle flow cells for the project (as part of a joint academic collaboration, of course).
Even though it feels like a long time when doing your first experiments -- navigating the frustration of extra reagents and the demands of Oxford Nanopore's Support team -- it's really not that long when considering that a completely new thing is being learnt. And that
surge of accomplishment when you've prepared your own sequencing library, carefully loaded it onto a flow cell, and the bloody thing actually works?
Ah, that's such a wonderful feeling.
Julie Blommaert
Nanopore sequencing is helping us make genetic discoveries that are not possible with other technologies.
I've had a brief chat on Twitter with Julie Blommaert, who is currently assembling the genomes of a few different microscopic rotifer species. She created assemblies that are relatively complete in terms of gene content, but the bigger genomes were less contiguous. The reason behind this loss of contiguity was only visible after specifically looking for transposable elements and repetitive DNA, using a bioinformatics pipeline called dnaPipeTE. So she knows that these elements are there, but not *where* they are in the genome.
Julie tried PacBio sequencing on these rotifer genomes, and was able to patch up a few of the repetitive holes, but there are still some missing pieces. She hopes to use nanopore sequencing in the future to identify and place really long repeats, as well as other unassembled repetitive regions from her genomes. The long queues of service centres are a bit of a drag, so it'll be great to be able to do her sequencing in-house and get instant results.
Read Count
I want to spend a few minutes going through one of the first things that I do with a sequencing run. This data comes from a 2-day cDNA sequencing run I did a couple of weeks ago with Carole Grasso at the Malaghan Institute, looking for genes that are up and down regulated when we fiddle with the mitochondrial DNA content of cells. It just so happens that the cDNA used to create this plot is still sitting inside the flowcell that I'm trying to run here.
This here is a read frequency plot, essentially a line graph version of a histogram. Oxford Nanopore does tend to hide this data in the graphs they output, but I think it can help to highlight one of the biggest issues with sequencing yield, that the smaller sequences get
the lions share of the sequencing time.
I've got two hypotheses around why this happens. The first is that smaller reads are harder to filter out during sample prep, and harder to correct for when calculating sample molarity for loading onto the sequencer. It's somewhat surprising that just one nanogram of 150bp reads makes up 10-20% of the recommended 50-100 fmol loading of a
flow cell.
My second hypothesis is that smaller sequences move faster through the flow cell. They have less mass, which means they're quicker to react when the pores open up to accept more sequences. A short read can run halfway across a sequencing well while a long read is still putting its boots on.
Base Count
These numbers are multiplied by the sequence length to produce a plot of the number of sequenced bases with a given sequence length. The MinKNOW interface was my first experience with seeing a plot like this, and I've found it to be quite a useful representation,
particularly for genomic DNA, because MinION reads typically have a more bell-shaped normal distribution under this transformation (except when sequencing transcripts). It also represents things that are actually useful for assembling DNA. In this case, with the cDNA reads, I can see that transcripts appear to be generally well represented throughout the range of sequence lengths.
Sequenced Base Density
These base counts are divided by the total number of sequenced bases to produce a plot of the proportion of sequenced bases with a given length.
However, I much prefer the look of it when I take that data, put it into a matrix, then plot the matrix on a black background with a splash of orange colour.
Digital Electrophoresis
I call this "digital electrophoresis", because it looks like the image plot that you might get when running your samples out on a gel.
It's not a perfect match, but I've found it quite surprising in hindsight when comparing the gel plots to the digital electrophoresis and seeing that weird stuff in the reads is also present in the gels. In this case there's a slightly stronger banding of short sequences in BC09 compared to BC07, which is picked up in both the gel image, and the simulated image based on sequenced reads.
Cumulative Base Density
Going further in the processing of sequence data, we can convert the density plot into a cumulative frequency plot by adding each total cumulatively to the previous total, starting with the longest sequences.
This produces a much smoother graph, largely independent of the nature of the input reads, and makes it a little bit easier for me to pick out general trends.
N50 / L50
If you do any amount of nanopore sequencing and start talking about quality control of data, you'll probably hear 'N50' being talked about a lot. This statistic is one way of calculating the average length of sequences, but allowing for a lot of rubbish short sequences.
A question that you might ask about a sequence data set is, "what is the minimum number of sequences I need to capture half this dataset?" A supplementary question to that is, "what is the shortest sequence length in such a minimal dataset?" For long-read sequencing, we talk about read N50, which gives information about the length of reads in a sequencing run, and about genome N50, which gives information about the completeness of assembled contiguous sequences, also called contigs.
It turns out, the N50 value can be determined by reading off a cumulative frequency plot. By starting the cumulative frequency generation with the longest reads first, the N50 value can be read off the cumulative graph by looking at the point where 50% of the total bases have been added up. Ditto for N10, N90, and any other N you're interested in.
Through the usual process of making things in biology more confusing than they need to be, the N50 value actually refers to a length, and the complementary L50 value refers to a number. To avoid confusion, I prefer to always include a unit (kilobases, for example) and only give the length statistic, which is the most commonly used of the two values.
Bridging the Gaps / Blank
Note: where I displayed blank slides in the presentation, I flipped back to a live demonstration of a sequencing run.
I've attended every yearly London Calling conference since they began in 2015.
It's a great experience, but an expensive one. For those that do attend in person, Oxford Nanopore has given us a little something extra in our swag bag to compensate for the expense. For the first conference, it was a new and improved version of the MinION sequencer. The last conference gave us an early-access taste of the higher-efficiency Series D flow cells.
Oxford Nanopore do a great job with making the conference accessible to others who can't attend in person. All previous talks are publicly available on the nanopore website, and I've found it useful to see them myself to watch something I had to miss on the day, or revisit a talk that I really enjoyed.
One of the talks that stuck in my head was Dr. Karen Miga's presentation in 2017 on the linear assembly of the human Y centromere.
Bridging the Gaps / 1
Just in case you weren't aware, the human reference genome isn't complete. Every human chromosome in our current reference genome has a multi-megabase region that was impossible to sequence using existing sequencing technology.
Dr. Miga and her group at UCSC used a set of 9 bacterial artificial chromosomes (BACs) of over 100kb that were known to span the centromere of the Y chromosome. The circular chromosomes were linearised, adapter-tailed, then sequenced on a MinION, with read N50s of over 100kb, many reads of which were the full length of the target BAC. Reads were aligned and polished to generate a consensus sequence for each array. With the help of Illumina sequencing, they were able to correct errors, and assemble the corrected sequences from different BAC arrays into a 346kb centromeric region.
https://londoncallingconf.co.uk/lc/2017-plenary#217670414
http://dx.doi.org/10.1038/nbt.4109
Bridging the Gaps / 2
https://doi.org/10.1038/nbt.4060
About a year later, Dr. Miga was involved in the formation of an interdisciplinary research team that has been tasked with using nanopore reads to make the human genome better. The nanopore whole-genome-sequencing consortium have released long and ultra-long read nanopore datasets (both fastq and fast5) for the human cell line NA12878, and have been updating their called read set as base calling software improves. They have also created a draft assembly of the human genome, including an unbroken span across the HLA region, which is regarded as an incredibly challenging region to sequence using Illumina technology.
But they're not stopping there. she is now part of the telomere-to-telomere consortium, who have sequenced the CHM13hTERT human cell line using the Nanopore GridION, and have supplemented that data with 10X, BioNano, HiC, and PacBio. Dr. Miga and Adam Phillippy gave talks at the Advances in Genome Biology and Technology meeting
this year, demonstrating that telomere-to-telomere assemblies for all human chromosomes are now within reach. They got to three assembled contigs for the X chromosomes, and with her manual curation of the data, were able to turn that three into one: a single contig for a single X chromosome.
bit.ly/PhillippySlides
A Single Ultra-long Read
One of the things I really like about nanopore sequencing is how much information you can get out of a single read. You may remember my first slide of this talk, where I spent a while talking about basecalling a read, and digging down a bit into why it was different from the reference. That was a small 50bp fragment from one of the first reads I'd ever sequenced on the MinION.
This here is what the nanopore community has been calling a long read, or possibly bordering on an ultra-long read. This came from a flow cell sequenced as part of the nanopore whole-genome consortium experimentation with an ultra-long-read protocol, flow cell FAF15586, sequenced in Birmingham on the 8th of March 2017.
What I'm showing here is a dotplot of this single 250kb read mapped to itself. To people who aren't familiar with them, dotplots are fairly unintuitive, but I'm starting with this visualisation because it's what people used to mapping are more likely to understand.
I found that it was quite difficult to use this to explain to other people what was going on. They kept asking difficult questions about it, like, "why is there a diagonal line down the middle of the image?" and, "why are the repeating bits a triangle, or a square?" There was something about the idea of comparing things to themselves that was unintuitive.
Profile Plot of an Ultra-long Read
So I had another think about the data that I was trying to represent. Was there a better way to represent something that made the locations of repetitive features more obvious. What I came up with was this: the location on a sequence is represented on the X axis, and the distance between repetitive features is represented on the Y axis in a log scale. No box shapes to be seen, and it helps to hide the idea that there's a central spine of a reference sequence that represents the ideal path through a sequence.
So now, of course, because this is a completely unfamiliar plot to everyone, I get asked even more questions, although usually they're just variants of "what do the colours mean?"
Reverse complement pairs appear as funnel-shaped things in orange, and repetitive blocks appear as sliced hills in maroon/purple, or maybe a ripple pattern of a sunset on water. And there are a few AT and other microsatellites that pop up in the very small scale which appear as greenish-yellow blobs down the bottom.
I find it interesting seeing all this diversity in repetitive sequence that there don't seem to be any instances where the complement of a sequence is found. I expect it's going to take a few more years to work out why that is the case.
Semicircular Plot of an Ultra-long Read
Thinking about sunset on water, I had a go at modifying this visualisation to make it a bit more like an actual sunset. This is representing the same information as the profile plot, but in a form that I consider to be more visually-appealing.
In any case, this particular sequence has a very interesting feature right at its start: a tandem repeat array where the unit size of the tandem repeat is about 40kb, and the entire array covers at least 180kb.
Those lengths are completely out of the ballpark of what any other sequencer would be able to discover. Even with a 10kb read length, another sequencer wouldn't even be able to determine that this region was repetitive, let alone identify a unique flanking region for anchoring. Yet here we are, with a long nanopore read that's nonchalantly eating a massive tandem array for breakfast.
For clarification, at 450 bases per second, 250kb works out to a little under 10 minutes, which is about how long I might take to eat my breakfast.
Why Use Nanopore? / Blank
I can see for myself that nanopore sequencing is a disruptive technology. My own curiosity drives me to find ways in which nanopore sequencing can do things that no other sequencer can do, and those are frequently quite different from what others ask me to do with the sequencer. As another example, the flow cell that I'm sequencing upside-down today flew down here in my hand luggage yesterday.
Sequencing In Space
But that's not the first time a nanopore sequencer has flown, and not the highest by a long shot. In 2016, the Oxford Nanopore MinION was launched into space, and Dr. Kate Rubins was the first person to demonstrate that DNA could be sequenced in microgravity.
This is interesting, because here on Earth, we believe that loading beads help with the sequencing process. But Dr. Rubins has shown us it works at 0 G, so it can't be gravity that's helping out there. Around the same time, Matthias Maurer also showed that sequencing works in a high-pressure diving bell at the bottom of the ocean. Even more interesting to me was that the replicated sequencing runs actually performed slightly better in space than on Earth.
https://www.ncbi.nlm.nih.gov/pubmed/29269933
The first test of nanopore sequencing in space were carried out on existing known genomes. They sequenced bacteriophage lambda, E. coli, and Mus musculus, three common model organisms with well-annotated and understood genomes.
I find it interesting looking at their comparison of the MinION-sequenced reads and the Illumina reads and seeing an abundance of unmatched reads in the MinION runs that are absent from the Illumina runs. This is something of a curse and a blessing for nanopore sequencing: it tells you almost everything, warts and all.
Viral Sequencing
I worked with some researchers at ESR: Jing Wang, Nicole Moore and Richard Hall, who were in the MinION Early Access Program, and enlisted my help in adding the final touches to their paper on sequencing an Influenza genome. The different components were PCR-amplified from cDNA transcripts, and it was nice to see that proportional transcript coverage was a lot more consistent across the length of transcripts when compared to a similar extraction followed by Illumina sequencing. Also, we had reasonably high consensus concordance with Illumina sequencing; over 99% for all transcripts, and 100% agreement for two of the eight transcripts. This was in 2015 using R7.3 flow cells. Even now, I still have arguments with people about the terrible accuracy of nanopore sequencing.
De-novo Assembly
Around the same time, I was able to encourage Jodie Chandler and Graham Le Gros at the Malaghan Institute to let me have a crack at assembling the genome of the rodent parasite, Nippostrongylus brasiliensis. We didn't get enough sequence for the whole genome, but I was able to find some sequences that looked suspiciously like mitochondrial DNA. Jodie and I put in a lot of effort in properly annotating the mitochondrial genome, including identifying the start and stop sites of all encoded
genes, and even working out the likely codons for tRNA based on matches to existing databases. Jodie collected gene sequences from other nematode species for the construction of a phylogenetic tree, and I dug deeper into the nanopore data to produce an electrical consensus sequence for the mitochondrial genome.
That work was a great springboard for our next de-novo assembly paper, where we enlisted the help of Oxford Nanopore Technologies to generate enough reads on R9.4 flow cells to assemble the whole genome of that same parasite. I don't really want to spend time on talking about that, because it's still an unfinished story. I'm not completely happy with that assembly. I think that with another single MinION run with the new ligation kit, we could probably fix up most of issues with the genome assembly graphs.
Disease Outbreaks
Something that gives me confidence that the MinION will work in dealing with complex genome graphs is the work that Dr. Una Ren and Dr. Jenny Draper have been doing at ESR in assembling bacterial genomes and plasmids for outbreak investigations. At the top here you can see a spaghetti ball of an assembled genome, derived from MiSeq reads, and on the bottom is the nice clean assembly derived from MinION reads. Library preparation for these two assemblies was carried out on the same initial DNA sample.
In the future, Dr. Ren and Dr. Draper hope to carry out more metagenomic work for culture-independent testing and diagnosis. They're also interested in looking beyond the basic sequence, finding out more about the use of methylation in bacterial genomes.
Professor Kat Holt
If you have any interest in nanopore sequencing and, in particular, how it is useful for the investigation of infectious disease, you're probably already familiar with Professor Kat Holt's lab.
For example, two of her postdocs, Dr. Margaret Lam and Dr. Kelly Wyres, have been leading research of virulence and resistence genes of Klebsiella. Nanopore sequencing was used to resolve the plasmid vectors, which is essential to understanding the evolutionary pathways and public health risks that these elements pose. Dr. Wyres looked at the genomic epidemiology of Klebsiella in 7 Asian countries. She used nanopore sequencing to show that in most of these cases there was convergence of resistance & virulence genes in the same plasmid vector.
Professor Holt's lab manager, Dr. Louise Judd, has been doing the wet lab work, and has developed and shared her protocols for bacterial DNA extraction and sequencing. Dr. Judd has also been generating the raw data for Ryan Wick, helping him to develop and improve PoreChop, Unicycler, Bandage, and also for the lab's exhaustive overview of the
behaviour and effectiveness of different nanopore basecallers.
Transcript Sequencing / Blank
Oxford Nanopore accepted our institute into the MinION Early Access programme in 2014. I was so enthusiastic about the MinION sequencer that I rushed around for a few weeks trying to get as many abstracts as possible. In the end we submitted five of them, and someone from Oxford Nanopore told me afterwards that they pretty much picked one at random from our institute because it didn't make sense to send us more than one sequencer.
Transcript Sequencing / 1
As I've said, Carole Grasso and I were comparing mouse transcript expression in cells with and without mitochondrial DNA. There are a lot of mitochondrial fragments in the genome, so I wanted to have a quick look to make sure there was no funny business going on with ghost genes being expressed in cells that had no DNA in their mitochondria.
I did a whole-transcriptome analysis. Reassuringly there was no mitochondrial expression in that case, but I had a peek at the rest of the genome, and found a number of genes that appeared to be switched on when mitochondrial DNA was removed. We're still investigating this stuff: The cDNA that's was sequenced on this flow cell a couple of weeks ago is hopefully the last of what we need to firm up our data and get a good publication out.
Chimeric Reads
https://f1000research.com/articles/6-631/v2
Ruby White and Olivier Lamiable from Professor Franca Ronchese's group at the Malaghan Institute were interested in looking at mouse interferon gene copies using nanopore sequencing. The interferon A genes are found near each other in a genome cassette and are closely related, with a single primer set being able to amplify transcripts
from 14 different genes within the same family. Interestingly, there are other Interferon B, K, and E gene classes that do not have the same large diversity. We wanted to see if the MinION could distinguish between the different isoforms, despite their substantial sequence identity.
So, primer sets were created for the interferon A family, which has gene copies, and interferon B, which doesn't, and also beta actin as an additional control.
But then something interesting happened when we looked at the first few sequences, which completely changed the course of our research. We put one read into BLAST, and found that it mapped to both the beta actin gene, and an interferon A gene. This led to follow-up experimentation and additional controls, in which we discovered that the joining of the two sequences was happening during sample preparation.
If you're interested in excluding chimeric reads from your own data, then PoreChop from Professor Holt's lab works pretty well at identifying and optionally excluding these reads. And if you want to get more of them, do an overnight ligation.
I've got an image here that shows what was happening with the physical sequence, which has been mapped approximately to the raw signal to demonstrate all the corobborating parts of the read. This was back when nanopore used hairpins in their sequencing kits with an easily-recognised signal, so it was a bit easier to do a mapping by eye.
I considered this chimeric read finding to be a pretty big deal, particularly in light of the index switching saga around Illumina reads that was happening at the same time, and it was a great demonstration of the benefits of a warts-and-all sequencing approach.
Visualising a Nanopore Read / 1
We've seen chimeric reads in our other barcoded data. If you look hard enough, you can even find it in situations where it's very unlikely to have happened during sample preparation. Like this, for example, once again from our most recent mouse cDNA run.
I like looking at fastq files, but it can be quite tricky to see the entirety of a single read, particularly if it's more than 300 bp long. I've spent a lot of my time inspecting sequences, and I wanted to find some way to visualise a sequence in a single image, regardless of its length. This here is the output of an annotation script I wrote up last week in preparation for this talk, showing a chimeric nanopore read from Carole Grasso's last cDNA sequencing run. I've been previously making things like this manually in a word processor, but I figured it was time to make things a little bit easier for me.
This looks great for short sequences, but I run into issues with really long sequences.
Visualising a Nanopore Read / 2
My initial thoughts on how to improve long sequence visualisation were along a cartesian path, thinking about arrays and rectangular matrices. But I realised there were limitations to that approach. Especially when there's repetition in a sequence, it often doesn't make sense to decide on a start and end point for each repeat unit.
So what I came up with was a spiral representation of a DNA sequence, starting in the middle, and spiralling outwards.
This sequence doesn't really have any repetitive elements in it, but you might recall I talked about one from the nanopore whole-genome consortium a few slides back...
Visualising an Ultra-long Read / 1
And here it is. This is that same 250kb sequence, showing a squished representation of the actual sequence, with the spiral arrangement organised to emphasise any repetitive elements that have a unit length of around 40kb. These repetitive bits can be identified as rays of similar patterns going from from the fifth inner ring inwards to the centre. The repetitive array finishes about two thirds of the way through the sequence, beyond which the rings stop looking like the ones further in.
I also used this spiral visualisation to look at individual bases of the longest read that's been sequenced so far on a nanopore sequencer, 2.3 megabases, zooming through the sequence at a rate of 20kb per second, or about fifty times the speed at which it's sequenced by the nanopores [see https://doi.org/10.5281/zenodo.1254317].
Novel Variant Discovery [blank]
I started my talk with things that the MinION found tricky when we first started out. I expect that homopolymer sequences are always going to be challenging when the time resolution of the sequencer is not sufficient to distinguish between where one base stops and the next one starts. Then again, maybe the bases don't travel through the pore in a half-ladder form, and are actually physically overlapping, which would mean that we wouldn't ever be able to tease apart a single base in the electrical signal.
However, it did seem to have potential in the area of variant analysis, particularly when there's an insertion or deletion of more than a couple of bases.
Novel Variant Discovery
https://doi.org/10.1002/mgg3.564
I'd like to highlight some work that was recently published by Melissa Leija-Salazar in the area of variant analysis.
She used the MinION to detect clinically-relevant heterozygous variants in the GBA gene. Defects in this gene have an established clinical link to a lysosomal storage disorder called Gaucher disease. GBA variants have also been found to be a significant risk factor for Parkinson's disease.
Starting with an amplified 9kb genomic region, Melissa's research group was able to detect a 55bp deletion in a recombinant allele from a Parkinson’s patient, which was missed by other sequencing methods. Using MinION reads, they were able to detect all previously known coding missense mutations at the correct zygosity, and with some help from Nanopolish and NanoOK, were able to exclude most false variants that appeared to be the result of systematic base calling errors. In total, they processed 85 samples for downstream analysis, and encountered one instance of a false negative, which was possibly as a result of failed PCR primer binding to the recombinant region.
In summary, I share Melissa's view that Nanopore sequencing is a versatile, cheap technology, and a suitable platform for sequencing, even of difficult regions, both in the diagnostic and in the research environments.
https://twitter.com/Mel_Salazar_PD/status/1107733088140967938
Acknowledgements
I've got a bit more to say, but first a few acknowledgements. Thanks to nanopore users both in public spaces, and more privately within the nanopore community. In spite of my occasional rants, Oxford Nanopore has been very supportive of my comments, and in supporting me to continue with my Nanopore research.
I'd also like to put in a special thanks to Dr. Laura Boykin, for reminding me to keep trying to do better with regards to diversity, and to Jo Stanton and the Genetics Otago Team for bringing me to Otago University for the first time.
Lastly, I'd like to acknowledge the Health Research Council and the Cancer Society of New Zealand, who have funded the mitochondrial genome project that I've been part of at the Malaghan Institute of Medical Research. That funding has helped me stay afloat while doing primarily voluntary work for interdisciplinary projects around the world.
If you're interested in any of my visualisation scripts, have a look at the things that I've most recently modified in my bioinformatics scripts repository on gitlab.
Diversity [blank]
So... I want to talk about an elephant in the room. It's uncomfortable for me to talk about this. I'm not the right person to say these things, but maybe I can inspire the right people to say it again a little louder, and maybe I can help people like me to hear a little
bit better.
Oxford Nanopore has mentioned that there are over 6,000 MinION devices
out and about in over 80 countries. With those numbers, there's also a great geographical and social diversity in the people that are using nanopore sequencing. We should be embracing this diversity, because different perspectives give research its life and purpose.
There are people in the middle east trying to get nanopore sequencing working for them. I've heard stories about sequencing on the slopes of a volcano in Tanzania, and elsewhere in Africa: in Kenya, in Uganda, and in Nigeria. It's been used in Brazil. It's been
used in Antarctica, in space, at the bottom of the ocean... and it's also been used by a few groups in Australia and New Zealand.
Most of these people are struggling, partly because this technology stuff is new and improving at a breakneck pace, but also partly because the systems that have been set up haven't been designed with all that diversity in mind. They have been built on foundations that support a certain group of people: people with money and power.
These foundations can be replaced, but that replacement is going to need a lot of support. And to stop us from sinking into the mud, that support *has* to be there before the existing foundations can be knocked away.
To be frank, I'm not *just* talking about nanopore sequencing here. This philosophy is encapsulated in the best projects of the free and open source software community, and can extend out into other parts of our life. But lets start small, within our little community,
because starting small is quick, easy, and will help us to see that it's going to work.
Please don't hide your own struggles because you think it's not good enough, or because you think that competition is healthy, and we need to do everything in isolation. Please don't hide your failures; we need information on both successes and failures in order to learn. If you know someone who's struggling, please help to amplify their voice. If you don't know anyone like that, ask around and find someone to share the joys and pain of learning together with you. If you do this, we can help build the foundations for a better community, and discover the benefits that a strong, evolving diversity can bring to our world.
Thanks for listening. Kia Ora. READ LESS
Use of this website is subject to the F1000 Research Limited (F1000) General Terms and Conditions.
Submission of user comments to this website is subject to additional Terms and Conditions. By clicking "I accept the User Comment Terms and Conditions" before you submit your first comment, you agree to be bound by these conditions every time you submit a comment.
Terms relating to user comments