A rhapsody of errors: sequence exploration with nanopores
A rhapsody of errors: sequence exploration with nanopores
[version 1; not peer reviewed]Flow cells and reagents were received from Oxford Nanopore at a reduced cost as part of the MinION Access program
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
All commenters must hold a formal affiliation as per our Policies. The information that you give us will be displayed next to your comment.
User comments must be in English, comprehensible and relevant to the article under discussion. We reserve the right to remove any comments that we consider to be inappropriate, offensive or otherwise in breach of the User Comment Terms and Conditions. Commenters must not use a comment for personal attacks. When criticisms of the article are based on unpublished data, the data should be made available.
The Next Next Generation
By next generation sequencing, I of course refer to next next
generation sequencing, or third generation sequencing to use a
slightly better phrase. I see sequencing technology as being currently
split up into three technological advances:
The first generation is sanger sequencing. Typically gel-based
sequencing done by tagging sequences and counting their lengths.
The second generation I hope most of you are familiar with, given that
this is a next generation sequencing conference. DNA bases are
individually tagged (or otherwise identified), and the action of base
incorporation is recorded during DNA synthesis.
The third generation is the new one, where sequences are observed at a
single molecule scale as they progress through the sequencer.
I'll be talking about a pure third-generation sequencer, but I'm being
a bit careful with my definitions here because I think it's reasonable
to say that some sequencers straddle two generations. These sequencers
combine synthesis of known bases (in other words, de-known
synthesis) with some recording of sequence dynamics during base
incorporation.
From Cells to Cell Metabolism
As a bioinformatician heavily on the data processing side of things, I
don't really pay much attention to the biology side of things unless
it's relevant to what I'm doing. In this case, one of the staff at the
Malaghan Institute, Mike Berridge, was having trouble getting a paper
published that talked about mitochondrial transfer from skin cells to
cancer cells. Reviewers of his paper wanted to see some sequencing
results to confirm what they had seen experimentally.
I saw an opportunity for an abstract to test out this new sequencing
technology through sequencing mitochondrial DNA, which I figured would
be perfect for the device. It's a small genome with a length pretty
close to the sweet spot for third-generation sequencing, so even with
fairly low yields I expected we'd get reasonable results from
sequencing.
In my haste to get us sequencing something using the new
technology, I glossed over the fact that the most important region for
genotyping happened to be an 8 base-pair homopolymer region in the
mitochondrial genome. This was the only known location where our
cancer and skin cells had a different mitochondrial genotype.
Random Error
Unfortunately, this meant our genotyping had a single point of
failure, and this point happened to be prone to failure. We had a
great run with high genome coverage, and a quick turnaround time from
sample loading to genotyping, but the area around the homopolymer
region was rubbish.
This is an alignment plot showing deletions in blue and substitutions
in red. Any insertions appear as chartreuse at the top. This
particular region has about half of the bases incorrect, which is high
enough that it might even be random accuracy, rather than a true
representation of the sequence.
Third Generation Sequencers
So in October 2014, if you had asked me for one word to describe the
sequencer that we were testing, I would have said, "potential".
Give me three words, and I would have expanded that to, "potentially
disruptive technology".
Just in case you were asleep when the title of this session was
announced, the device we were testing was in fact the Oxford Nanopore
MinION. Before I continue, could I please have a show of hands to help
me to understand what people know already about the MinION.
- Who already has a MinION in their lab?
- Who has been so eager to know about the technology that they've already seen my poster, or read one of the summaries I've posted on the Internet?
- Who knows nothing about the MinION, and came here to see what it's all about?
Great, thanks. I've actually brought with me one of the sequencingdevices to show around. This is one that had a hardware failure and
was replaced, but I've been dragging my feet about getting it sent
back because it's so useful for show and tell. Please make sure it
gets back to me by the end of this session, so that I don't have to
make up a story about how it was eaten by a dragon while in
Palmerston North.
Flowing by Hand
As the first speaker talking about the MinION, I'll give you a very
brief introduction about the sequencer.
The MinION is pretty tiny, and is powered from a USB3 port and a
pipette.
What we have here on this slide is the consumable hardware of the
sequencer; the sequencing flow cell. There's an entry port underneath
a cover here which you pipette your sample into, an array of
sequencing pores that accept sequence, and an outlet channel for
holding about 2 mls of waste.
Drawing Pores
Here's a bioinformatician's impression of what happens at the site of
sequencing. DNA is drawn into the pores (shown here in orange) via a
small voltage gradient, and a motor protein (shown here in crimson)
untwists the DNA and makes sure it passages through the pore at the
right speed. As each base moves through the pore, the current across
the pore changes ever so slightly, and this current is sampled at
5,000 times per second by the sequencer.
A Nebulous Calling
Hmm, I think I recognise the guy in this photo.
This shows the sequencing and base-calling setup, with the sequencer,
the laptop, and the network port circled, because they're pretty hard
to see otherwise (or maybe it's just my eyes).
Basecalling is currently done all in the cloud, using one of Oxford
Nanopore's programs that runs on an Amazon Elastic Compute
instance. The computer records the signals, converts them into events
that hopefully describe a single base transition, and then uploads the
event data to the internet where it is converted into a standard FASTQ
sequence.
Polymeric Cryptography
If you recall a few slides previously, you'll remember that we had a
bit of a problem with getting good FASTQ sequence out of the
MinION. This graph shows why, and why it is such a hard problem to
fix.
The base calling program used by Oxford Nanopore works on a five-base
model. Each block of consistent current is considered to represent the
sequence at five adjacent bases. Once you enter into a homopolymer
region of sufficient length, the distinction between one event and the
next is lost, and everything starts to look the same. You just get a
longer sequence of the same event.
So that's the theoretical model, but during this fateful run, I
decided to set a flag to capture the raw signal on the MinION as it
was sequencing.
Friendly Error
Here's what I saw across that region. The time per base change is not
consistent, which is why base calling is not as easy as it might first
appear. While most of the base changes seem to be quite obvious, it
doesn't actually look much like the theoretical model, and the
homopolymer region is surprisingly not at a consistent signal level.
But the thing that struck me about this signal was the
noise. Excluding a few pops and clicks, it was pretty close to an
ideal model of gaussian noise. This demonstrated to me what the chief
technology officer of Oxford Nanopore kept saying, that the sensors on
their flow cells were extremely sensitive, and it would be a long time
before the chemistry and software caught up to that sensitivity.
An Eventful Sequence
What's shown on this graph is what happens when the signal has gone
through Oxford Nanopore's event calling software on the computer,
where it tries to find out where the base switching has happened. It's
done a pretty good job at finding events, although there are possibly
a couple of places where it has picked up two base changes as a single
base. These places are less evident on the raw trace, suggesting that
the sequences have the occasional tendency to zip through much faster
than expected, resulting in deleted bases in the output.
Precision and Recall
And, as it happens, that's what is dominating errors in the
base-called FASTQ files. What you're looking at here is an identical
event sequence that has been base-called on three different occasions
over the past year. First when it was sequenced in October 2014, again
when I was doing my genotyping bit for the mitochondrial paper
submission in November 2014, and again a couple of weeks ago for this
conference.
This demonstrates the value in keeping your data available, and makes
fairly clear one of the benefits of MinION sequencing, that it is
possible to re-call sequences to account for changes in the base
calling model as the model improves over time.
The eagle-eyed among you may notice that the homopolymer region in the
middle no longer has any deletions in it, which indicates to me that
it might just be possible to properly type this region in the future.
You Can Call Me 'A'
Everything that I've shown you up to this point has been data from a
single read that has come out of the MinION. This is a graph showing
what happens when you pile up multiple sequences together and look at
individual base frequencies at each location. We got 16 reads across
this region that had the best chance of being correct. The reference
sequence is shown with open circles around dots.
The base with the highest proportion matches the reference here at
every location except in the middle of the homopolymer region, where
there are a couple of deletions called instead of the actual
sequence. What we were expecting was a 1 bp insertion, rather than a
deletion.
If this homopolymer region were the only region we were able to
genotype with the MinION, I'd have to reject this as a useless
technology. However, through looking at the entire mitochondrial
genome, we were able to pick up a few other completely novel mutations
that were also supported by an IonTorrent run we did on the same
samples.
Forward To The Past
As it happens, for genotyping the mitochondria we needed to use all
three generations of sequencing technology to get an accurate
representation of the mitochondrial genome. The MinION data was key in
helping to troubleshoot a failure we had with a complete loss of
coverage in the IonTorrent sequencing, and was actually used for
confirmatory genotyping of novel variants. And, over that homopolymer
region, our best results were still produced using sanger sequencing.
If you're thirsty for more MinION publications, I've been asked if I
could please mention another paper I helped out with, where an
influenza genome was assembled from nanopore reads via overlap
consensus to produce single contigs per gene and greater than 99%
accuracy.
Acknowledgements
I've got a few people to thank for helping me out in my exploration of
the MinION, and you can see them here. Oxford Nanopore have been
wonderful in supporting the critique and exploration of their own
devices, and have an amazing community of people to share our joys and
sufferings.
But I want to put out an especially big thank you for Graham Le Gros
and his lovely team of researchers for helping me to carry out my own
independent studies at the Malaghan Institute of Medical Research.
The Next Next Generation
By next generation sequencing, I of course refer to next next
generation sequencing, or third generation sequencing to use a
slightly better phrase. I see sequencing technology as being currently
split up into three technological advances:
The... READ MORE
The Next Next Generation
By next generation sequencing, I of course refer to next next
generation sequencing, or third generation sequencing to use a
slightly better phrase. I see sequencing technology as being currently
split up into three technological advances:
The first generation is sanger sequencing. Typically gel-based
sequencing done by tagging sequences and counting their lengths.
The second generation I hope most of you are familiar with, given that
this is a next generation sequencing conference. DNA bases are
individually tagged (or otherwise identified), and the action of base
incorporation is recorded during DNA synthesis.
The third generation is the new one, where sequences are observed at a
single molecule scale as they progress through the sequencer.
I'll be talking about a pure third-generation sequencer, but I'm being
a bit careful with my definitions here because I think it's reasonable
to say that some sequencers straddle two generations. These sequencers
combine synthesis of known bases (in other words, de-known
synthesis) with some recording of sequence dynamics during base
incorporation.
From Cells to Cell Metabolism
As a bioinformatician heavily on the data processing side of things, I
don't really pay much attention to the biology side of things unless
it's relevant to what I'm doing. In this case, one of the staff at the
Malaghan Institute, Mike Berridge, was having trouble getting a paper
published that talked about mitochondrial transfer from skin cells to
cancer cells. Reviewers of his paper wanted to see some sequencing
results to confirm what they had seen experimentally.
I saw an opportunity for an abstract to test out this new sequencing
technology through sequencing mitochondrial DNA, which I figured would
be perfect for the device. It's a small genome with a length pretty
close to the sweet spot for third-generation sequencing, so even with
fairly low yields I expected we'd get reasonable results from
sequencing.
In my haste to get us sequencing something using the new
technology, I glossed over the fact that the most important region for
genotyping happened to be an 8 base-pair homopolymer region in the
mitochondrial genome. This was the only known location where our
cancer and skin cells had a different mitochondrial genotype.
Random Error
Unfortunately, this meant our genotyping had a single point of
failure, and this point happened to be prone to failure. We had a
great run with high genome coverage, and a quick turnaround time from
sample loading to genotyping, but the area around the homopolymer
region was rubbish.
This is an alignment plot showing deletions in blue and substitutions
in red. Any insertions appear as chartreuse at the top. This
particular region has about half of the bases incorrect, which is high
enough that it might even be random accuracy, rather than a true
representation of the sequence.
Third Generation Sequencers
So in October 2014, if you had asked me for one word to describe the
sequencer that we were testing, I would have said, "potential".
Give me three words, and I would have expanded that to, "potentially
disruptive technology".
Just in case you were asleep when the title of this session was
announced, the device we were testing was in fact the Oxford Nanopore
MinION. Before I continue, could I please have a show of hands to help
me to understand what people know already about the MinION.
- Who already has a MinION in their lab?
- Who has been so eager to know about the technology that they've already seen my poster, or read one of the summaries I've posted on the Internet?
- Who knows nothing about the MinION, and came here to see what it's all about?
Great, thanks. I've actually brought with me one of the sequencingdevices to show around. This is one that had a hardware failure and
was replaced, but I've been dragging my feet about getting it sent
back because it's so useful for show and tell. Please make sure it
gets back to me by the end of this session, so that I don't have to
make up a story about how it was eaten by a dragon while in
Palmerston North.
Flowing by Hand
As the first speaker talking about the MinION, I'll give you a very
brief introduction about the sequencer.
The MinION is pretty tiny, and is powered from a USB3 port and a
pipette.
What we have here on this slide is the consumable hardware of the
sequencer; the sequencing flow cell. There's an entry port underneath
a cover here which you pipette your sample into, an array of
sequencing pores that accept sequence, and an outlet channel for
holding about 2 mls of waste.
Drawing Pores
Here's a bioinformatician's impression of what happens at the site of
sequencing. DNA is drawn into the pores (shown here in orange) via a
small voltage gradient, and a motor protein (shown here in crimson)
untwists the DNA and makes sure it passages through the pore at the
right speed. As each base moves through the pore, the current across
the pore changes ever so slightly, and this current is sampled at
5,000 times per second by the sequencer.
A Nebulous Calling
Hmm, I think I recognise the guy in this photo.
This shows the sequencing and base-calling setup, with the sequencer,
the laptop, and the network port circled, because they're pretty hard
to see otherwise (or maybe it's just my eyes).
Basecalling is currently done all in the cloud, using one of Oxford
Nanopore's programs that runs on an Amazon Elastic Compute
instance. The computer records the signals, converts them into events
that hopefully describe a single base transition, and then uploads the
event data to the internet where it is converted into a standard FASTQ
sequence.
Polymeric Cryptography
If you recall a few slides previously, you'll remember that we had a
bit of a problem with getting good FASTQ sequence out of the
MinION. This graph shows why, and why it is such a hard problem to
fix.
The base calling program used by Oxford Nanopore works on a five-base
model. Each block of consistent current is considered to represent the
sequence at five adjacent bases. Once you enter into a homopolymer
region of sufficient length, the distinction between one event and the
next is lost, and everything starts to look the same. You just get a
longer sequence of the same event.
So that's the theoretical model, but during this fateful run, I
decided to set a flag to capture the raw signal on the MinION as it
was sequencing.
Friendly Error
Here's what I saw across that region. The time per base change is not
consistent, which is why base calling is not as easy as it might first
appear. While most of the base changes seem to be quite obvious, it
doesn't actually look much like the theoretical model, and the
homopolymer region is surprisingly not at a consistent signal level.
But the thing that struck me about this signal was the
noise. Excluding a few pops and clicks, it was pretty close to an
ideal model of gaussian noise. This demonstrated to me what the chief
technology officer of Oxford Nanopore kept saying, that the sensors on
their flow cells were extremely sensitive, and it would be a long time
before the chemistry and software caught up to that sensitivity.
An Eventful Sequence
What's shown on this graph is what happens when the signal has gone
through Oxford Nanopore's event calling software on the computer,
where it tries to find out where the base switching has happened. It's
done a pretty good job at finding events, although there are possibly
a couple of places where it has picked up two base changes as a single
base. These places are less evident on the raw trace, suggesting that
the sequences have the occasional tendency to zip through much faster
than expected, resulting in deleted bases in the output.
Precision and Recall
And, as it happens, that's what is dominating errors in the
base-called FASTQ files. What you're looking at here is an identical
event sequence that has been base-called on three different occasions
over the past year. First when it was sequenced in October 2014, again
when I was doing my genotyping bit for the mitochondrial paper
submission in November 2014, and again a couple of weeks ago for this
conference.
This demonstrates the value in keeping your data available, and makes
fairly clear one of the benefits of MinION sequencing, that it is
possible to re-call sequences to account for changes in the base
calling model as the model improves over time.
The eagle-eyed among you may notice that the homopolymer region in the
middle no longer has any deletions in it, which indicates to me that
it might just be possible to properly type this region in the future.
You Can Call Me 'A'
Everything that I've shown you up to this point has been data from a
single read that has come out of the MinION. This is a graph showing
what happens when you pile up multiple sequences together and look at
individual base frequencies at each location. We got 16 reads across
this region that had the best chance of being correct. The reference
sequence is shown with open circles around dots.
The base with the highest proportion matches the reference here at
every location except in the middle of the homopolymer region, where
there are a couple of deletions called instead of the actual
sequence. What we were expecting was a 1 bp insertion, rather than a
deletion.
If this homopolymer region were the only region we were able to
genotype with the MinION, I'd have to reject this as a useless
technology. However, through looking at the entire mitochondrial
genome, we were able to pick up a few other completely novel mutations
that were also supported by an IonTorrent run we did on the same
samples.
Forward To The Past
As it happens, for genotyping the mitochondria we needed to use all
three generations of sequencing technology to get an accurate
representation of the mitochondrial genome. The MinION data was key in
helping to troubleshoot a failure we had with a complete loss of
coverage in the IonTorrent sequencing, and was actually used for
confirmatory genotyping of novel variants. And, over that homopolymer
region, our best results were still produced using sanger sequencing.
If you're thirsty for more MinION publications, I've been asked if I
could please mention another paper I helped out with, where an
influenza genome was assembled from nanopore reads via overlap
consensus to produce single contigs per gene and greater than 99%
accuracy.
Acknowledgements
I've got a few people to thank for helping me out in my exploration of
the MinION, and you can see them here. Oxford Nanopore have been
wonderful in supporting the critique and exploration of their own
devices, and have an amazing community of people to share our joys and
sufferings.
But I want to put out an especially big thank you for Graham Le Gros
and his lovely team of researchers for helping me to carry out my own
independent studies at the Malaghan Institute of Medical Research. READ LESS
Use of this website is subject to the F1000 Research Limited (F1000) General Terms and Conditions.
Submission of user comments to this website is subject to additional Terms and Conditions. By clicking "I accept the User Comment Terms and Conditions" before you submit your first comment, you agree to be bound by these conditions every time you submit a comment.
Terms relating to user comments