Chimeric Reads and Where to Find Them
Chimeric Reads and Where to Find Them
[version 1; not peer reviewed]No competing interests were disclosed
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
All commenters must hold a formal affiliation as per our Policies. The information that you give us will be displayed next to your comment.
User comments must be in English, comprehensible and relevant to the article under discussion. We reserve the right to remove any comments that we consider to be inappropriate, offensive or otherwise in breach of the User Comment Terms and Conditions. Commenters must not use a comment for personal attacks. When criticisms of the article are based on unpublished data, the data should be made available.
Tēnā koutou katoa
Ko Kurahaupo te waka
Ko Tapuae-o-uenuku te maunga
Ko Wairau te awa
Ko Rangitāne te iwi
Nō Pōneke ahau
Kei Karori tōku kāinga
He kai pūtaiao ki Malaghan Institute of Medical Research ahau
Ko Rawiri tōku ingoa
Tēnā koutou, tēnā koutou, tēnā koutou katoa
Introduction
Hello, Kia ora, and thank you for being here to listen to my talk on chimeric reads. My name is David Eccles, and for the last five years or so, I've been doing DNA sequencing work on the Oxford Nanopore MinION at the Malaghan Institute of Medical Research. In this talk, I'll discuss the consequences of a discovery we made when looking at the results from one of our sequencing runs a few years ago.
Barcoding DNA Sequences
When we prepare more than one DNA sample for sequencing on the MinION, we attach a known sequence of DNA to the sample, a bit like the address section of a postcard. We call these address sequences "DNA barcodes". Most sequencing systems require an additional adapter to be added, which acts a little bit like a postage stamp: no adapter, no
sequencing. As long as we are able to read the post card, no matter how much the individual cards get jumbled up, the barcodes mean that they can always be assigned to the right origin or destination.
At least, that's what's meant to happen. Usually it happens this way, but not always.
Sometimes when you get one thing sequenced, and later on open up your folder full of DNA sequences (we usually call them "reads") that are labelled with a particular barcode, you can find another thing entirely.
Nanopore Sequencing
Oxford Nanopore have created a range of devices that can detect and sequence very, very long molecules. To put a number to that statement, the sequencers typically produce reads of tens to hundreds of thousands of DNA bases, with the longest being a bit over two million bases. While the output of these devices varies by orders of magnitudes, the fundamental method of action is identical.
For those who don't know about the technology, nanopore sequencing involves the translocation of a long polymer through a protein pore with the help of an electric current, detecting changes in translocation speed as it goes through. The changes are represented as change in electric current over time.
The speed of translocation usually depends on the size and shape of the thing going through the pore, so the current can be used as a proxy for electrophysical properties. By understanding how that shape changes with different DNA bases, it's possible to work backwards from the electrical current trace to the bases. After one read finishes,
another can feed into the nanopore, resulting in a very fast, observational method of DNA sequencing.
The reads that come out of the nanopore sequencing devices are electrical traces of the entire DNA sequence... whether you want it or not.
What We Tried To Create
This presentation is based around a paper that we published a couple of years ago.
https://f1000research.com/articles/6-631/v2
Ruby and Olivier from Professor Franca Ronchese's group at the Malaghan Institute were interested in looking at mouse interferon gene copies using nanopore sequencing. The interferon A genes are found near each other in a genome cassette and are closely related; a single primer set is able to amplify transcripts from 14 different genes within the same family. Interestingly, there are other Interferon gene classes - B, K, and E - that do not have the same large diversity. Our initial interest was in trying to work out if the MinION reads would be able to distinguish between the different isoforms, despite their substantial sequence similarity.
So, PCR primer sets were created for the interferon A family, which has gene copies, and interferon B, which doesn't, and also beta actin as an additional control.
During sample preparation, sample barcodes that were enzymatically joined (or ligated) to the amplified sequences. These barcodes then had additional adapter sequences ligated to them that allowed the sequencing to be carried out, as well as a DNA hairpin that allowed the forward and reverse-complement strands to be sequenced in tandem.
How To Find Chimeric Reads
Email: 2016-Jul-25
But then something interesting happened when we looked at the first few sequences, which changed the course of our research. I sent the first set of results to my colleagues, which was a table of counts indicating the number of reads matching the transcripts we expected to see. Some of these numbers didn't make sense, but I wasn't really paying attention to the numbers I was producing. Luckily my colleagues were paying attention, and gave me enough of a mental slap to wake me out of my data-robot trance and explore the reads in more depth.
We put one read into a genetic search engine called BLAST, and found that it matched to both the beta actin gene, and an interferon A gene. This was... unexpected.
These were genes that appeared nowhere near each other in the genome, and somehow ended up right next to each other in one of the reads that came off the sequencer. What we had accidentally encountered was a chimeric read; that is, a single DNA sequence originating from at least two separate origins. I want to give credit to my colleagues, Ruby and Olivier, at the Malaghan Institute, and to Professor Ronchese for having an interest in looking deeper into this to understand what was going on.
How To Find Chimeric Reads
This led to follow-up experimentation and additional controls, in which we discovered that some, but not all, of the joining of the sequences was happening during sample preparation.
PCR reactions were done in separate tubes. We used two separate barcodes for each of the amplicons that we had amplified. After that, one set of barcodes were pooled together and used for subsequent processing via a 10-minute joining (or ligation) step, and the other set of barcodes was processed through a slower overnight ligation in the fridge. These separate quick and overnight ligations were cleaned up to remove excess enzymes, then combined together and run on the MinION.
After sequencing, \emph{most} of the reads looked fine, but we found a few barcodes that got mixed up in the sequencing process. We found amplicons that had no barcodes which were nevertheless sequenced, together with another, different amplicon. Multi-barcode chimeric reads were being formed during all the joining & ligation steps, but they were also happening during the sequencing itself, in the software, where the sequencer had trouble telling where one sequence stopped and the next one began.
What we found scared me enough for me to look for chimeric reads in almost all the sequencing runs I've done since.
What We Ended Up Creating
Here's an example of one of the reads that came off that exploratory run.
I talked face-to-face with the people from Oxford Nanopore who designed the sequencing kits, who claimed that this situation was not possible. While I was talking to them, something clicked in my head, and I realised that the hairpin adapter would pair up with itself under the right conditions. I realised the chimeric structure was probably formed during the ligation of hairpins and adapters to barcoded amplicons, joining two separate double-stranded molecules with a double-hairpin linker during the overnight ligation.
Where To Find Chimeric Reads
Some of our chimeric reads were created during sample preparation, which makes me think it's happening during the ligation steps for other sequencing platforms as well. Around the time we were writing up our chimeric reads paper, there was a lot of noise on Twitter about an issue on Illumina systems that has been called "Index swapping", which I suspect is a symptom of chimeric reads.
We've seen chimeric reads in our other barcoded nanopore data. Even in our cDNA sequencing runs on nanopore, where we're using a special rapid attachment chemistry that doesn't involve any enzymatic ligation, and even when I exclude in-silico, software-based joining, I can still find chimeric reads that don't make biological sense.
From this, I've come to the conclusion that they must be forming spontaneously, or by magic. If anyone wants to take a look and see if they come to any other conclusion, feel free to let me know and I can direct you to a few of the more fantastic beasts that I've seen in our nanopore reads.
Why Does It Matter?
In 2018, after a few pre-prints from researchers concerned about the index swapping problem, Illumina put out a white paper to discuss their own internal observations, as well as kits to reduce the issue. Illumina now has a "Unique Dual Index" library preparation option that makes it easier to detect when samples are mixed up during sequencing, so at least they're doing something about it. But I don't get the feeling that sequencing companies are concerned about the problem. I found this text on Illumina's website yesterday:
While index hopping can occur, it has a limited effect on most
applications and background hopped reads can be filtered out as
noise. Typical levels of index hopping on patterned flow cell
systems range from 0.1–2% depending on the type, quality, and
handling of the library.
https://www.illumina.com/science/education/minimizing-index-hopping.html
I agree that the impact of chimeric reads depends on the application, but think that they will cause issues in a number of common situations, especially where a small number of mis-assigned reads will have an impact on the results of a study. This would be an issue where large amounts of one type of sequence are being mixed with small amounts of another type of sequence. I want to particularly highlight single cell sequencing as an area of concern, where a few tens of thousands of reads are being assigned to each cell within sequencing runs sometimes containing billions of reads, so the impact of misclassification is quite high.
How To Avoid Chimeric Reads
If you're interested in excluding chimeric reads from your own data, then I can give you a few suggestions. Most people are cash-strapped, so the simplest approach of loading single samples into each run isn't a realistic solution.
If you're sequencing on Illumina, then use unique dual-index barcodes. If you can get access to the raw data, it might also be possible to get statistics on index swapping, so you can find out how rare it is for your samples.
Nanopore have a rapid adapter solution which can work for some applications, with a bit of a hit on the yield of sequencing runs. PoreChop from Professor Kat Holt's lab works pretty well at identifying and optionally excluding chimeric nanopore reads. I've
also written up my own protocol to remove chimeric reads in software, but should point out that I don't know of any methods that can get rid of chimeric reads entirely.
Another option is to ignore the possibility of chimeric reads, or accept the sequencing companies' claims that their occurrence is so rare that it won't make a difference.
But hopefully, you're not going to do that. I'm grateful to have had the opportunity to tell you that chimeric reads exist, and am hopeful that this release of information - this curious observation - will eventually lead to a better understanding of biology.
Acknowledgements
Thank you
Tēnā koutou katoa
Ko Kurahaupo te waka
Ko Tapuae-o-uenuku te maunga
Ko Wairau te awa
Ko Rangitāne te iwi
Nō Pōneke ahau
Kei Karori tōku kāinga
He kai pūtaiao ki Malaghan Institute of Medical Research ahau
Ko Rawiri tōku ingoa
Tēnā koutou, tēnā koutou, tēnā koutou... READ MORE
Tēnā koutou katoa
Ko Kurahaupo te waka
Ko Tapuae-o-uenuku te maunga
Ko Wairau te awa
Ko Rangitāne te iwi
Nō Pōneke ahau
Kei Karori tōku kāinga
He kai pūtaiao ki Malaghan Institute of Medical Research ahau
Ko Rawiri tōku ingoa
Tēnā koutou, tēnā koutou, tēnā koutou katoa
Introduction
Hello, Kia ora, and thank you for being here to listen to my talk on chimeric reads. My name is David Eccles, and for the last five years or so, I've been doing DNA sequencing work on the Oxford Nanopore MinION at the Malaghan Institute of Medical Research. In this talk, I'll discuss the consequences of a discovery we made when looking at the results from one of our sequencing runs a few years ago.
Barcoding DNA Sequences
When we prepare more than one DNA sample for sequencing on the MinION, we attach a known sequence of DNA to the sample, a bit like the address section of a postcard. We call these address sequences "DNA barcodes". Most sequencing systems require an additional adapter to be added, which acts a little bit like a postage stamp: no adapter, no
sequencing. As long as we are able to read the post card, no matter how much the individual cards get jumbled up, the barcodes mean that they can always be assigned to the right origin or destination.
At least, that's what's meant to happen. Usually it happens this way, but not always.
Sometimes when you get one thing sequenced, and later on open up your folder full of DNA sequences (we usually call them "reads") that are labelled with a particular barcode, you can find another thing entirely.
Nanopore Sequencing
Oxford Nanopore have created a range of devices that can detect and sequence very, very long molecules. To put a number to that statement, the sequencers typically produce reads of tens to hundreds of thousands of DNA bases, with the longest being a bit over two million bases. While the output of these devices varies by orders of magnitudes, the fundamental method of action is identical.
For those who don't know about the technology, nanopore sequencing involves the translocation of a long polymer through a protein pore with the help of an electric current, detecting changes in translocation speed as it goes through. The changes are represented as change in electric current over time.
The speed of translocation usually depends on the size and shape of the thing going through the pore, so the current can be used as a proxy for electrophysical properties. By understanding how that shape changes with different DNA bases, it's possible to work backwards from the electrical current trace to the bases. After one read finishes,
another can feed into the nanopore, resulting in a very fast, observational method of DNA sequencing.
The reads that come out of the nanopore sequencing devices are electrical traces of the entire DNA sequence... whether you want it or not.
What We Tried To Create
This presentation is based around a paper that we published a couple of years ago.
https://f1000research.com/articles/6-631/v2
Ruby and Olivier from Professor Franca Ronchese's group at the Malaghan Institute were interested in looking at mouse interferon gene copies using nanopore sequencing. The interferon A genes are found near each other in a genome cassette and are closely related; a single primer set is able to amplify transcripts from 14 different genes within the same family. Interestingly, there are other Interferon gene classes - B, K, and E - that do not have the same large diversity. Our initial interest was in trying to work out if the MinION reads would be able to distinguish between the different isoforms, despite their substantial sequence similarity.
So, PCR primer sets were created for the interferon A family, which has gene copies, and interferon B, which doesn't, and also beta actin as an additional control.
During sample preparation, sample barcodes that were enzymatically joined (or ligated) to the amplified sequences. These barcodes then had additional adapter sequences ligated to them that allowed the sequencing to be carried out, as well as a DNA hairpin that allowed the forward and reverse-complement strands to be sequenced in tandem.
How To Find Chimeric Reads
Email: 2016-Jul-25
But then something interesting happened when we looked at the first few sequences, which changed the course of our research. I sent the first set of results to my colleagues, which was a table of counts indicating the number of reads matching the transcripts we expected to see. Some of these numbers didn't make sense, but I wasn't really paying attention to the numbers I was producing. Luckily my colleagues were paying attention, and gave me enough of a mental slap to wake me out of my data-robot trance and explore the reads in more depth.
We put one read into a genetic search engine called BLAST, and found that it matched to both the beta actin gene, and an interferon A gene. This was... unexpected.
These were genes that appeared nowhere near each other in the genome, and somehow ended up right next to each other in one of the reads that came off the sequencer. What we had accidentally encountered was a chimeric read; that is, a single DNA sequence originating from at least two separate origins. I want to give credit to my colleagues, Ruby and Olivier, at the Malaghan Institute, and to Professor Ronchese for having an interest in looking deeper into this to understand what was going on.
How To Find Chimeric Reads
This led to follow-up experimentation and additional controls, in which we discovered that some, but not all, of the joining of the sequences was happening during sample preparation.
PCR reactions were done in separate tubes. We used two separate barcodes for each of the amplicons that we had amplified. After that, one set of barcodes were pooled together and used for subsequent processing via a 10-minute joining (or ligation) step, and the other set of barcodes was processed through a slower overnight ligation in the fridge. These separate quick and overnight ligations were cleaned up to remove excess enzymes, then combined together and run on the MinION.
After sequencing, \emph{most} of the reads looked fine, but we found a few barcodes that got mixed up in the sequencing process. We found amplicons that had no barcodes which were nevertheless sequenced, together with another, different amplicon. Multi-barcode chimeric reads were being formed during all the joining & ligation steps, but they were also happening during the sequencing itself, in the software, where the sequencer had trouble telling where one sequence stopped and the next one began.
What we found scared me enough for me to look for chimeric reads in almost all the sequencing runs I've done since.
What We Ended Up Creating
Here's an example of one of the reads that came off that exploratory run.
I talked face-to-face with the people from Oxford Nanopore who designed the sequencing kits, who claimed that this situation was not possible. While I was talking to them, something clicked in my head, and I realised that the hairpin adapter would pair up with itself under the right conditions. I realised the chimeric structure was probably formed during the ligation of hairpins and adapters to barcoded amplicons, joining two separate double-stranded molecules with a double-hairpin linker during the overnight ligation.
Where To Find Chimeric Reads
Some of our chimeric reads were created during sample preparation, which makes me think it's happening during the ligation steps for other sequencing platforms as well. Around the time we were writing up our chimeric reads paper, there was a lot of noise on Twitter about an issue on Illumina systems that has been called "Index swapping", which I suspect is a symptom of chimeric reads.
We've seen chimeric reads in our other barcoded nanopore data. Even in our cDNA sequencing runs on nanopore, where we're using a special rapid attachment chemistry that doesn't involve any enzymatic ligation, and even when I exclude in-silico, software-based joining, I can still find chimeric reads that don't make biological sense.
From this, I've come to the conclusion that they must be forming spontaneously, or by magic. If anyone wants to take a look and see if they come to any other conclusion, feel free to let me know and I can direct you to a few of the more fantastic beasts that I've seen in our nanopore reads.
Why Does It Matter?
In 2018, after a few pre-prints from researchers concerned about the index swapping problem, Illumina put out a white paper to discuss their own internal observations, as well as kits to reduce the issue. Illumina now has a "Unique Dual Index" library preparation option that makes it easier to detect when samples are mixed up during sequencing, so at least they're doing something about it. But I don't get the feeling that sequencing companies are concerned about the problem. I found this text on Illumina's website yesterday:
While index hopping can occur, it has a limited effect on most
applications and background hopped reads can be filtered out as
noise. Typical levels of index hopping on patterned flow cell
systems range from 0.1–2% depending on the type, quality, and
handling of the library.
https://www.illumina.com/science/education/minimizing-index-hopping.html
I agree that the impact of chimeric reads depends on the application, but think that they will cause issues in a number of common situations, especially where a small number of mis-assigned reads will have an impact on the results of a study. This would be an issue where large amounts of one type of sequence are being mixed with small amounts of another type of sequence. I want to particularly highlight single cell sequencing as an area of concern, where a few tens of thousands of reads are being assigned to each cell within sequencing runs sometimes containing billions of reads, so the impact of misclassification is quite high.
How To Avoid Chimeric Reads
If you're interested in excluding chimeric reads from your own data, then I can give you a few suggestions. Most people are cash-strapped, so the simplest approach of loading single samples into each run isn't a realistic solution.
If you're sequencing on Illumina, then use unique dual-index barcodes. If you can get access to the raw data, it might also be possible to get statistics on index swapping, so you can find out how rare it is for your samples.
Nanopore have a rapid adapter solution which can work for some applications, with a bit of a hit on the yield of sequencing runs. PoreChop from Professor Kat Holt's lab works pretty well at identifying and optionally excluding chimeric nanopore reads. I've
also written up my own protocol to remove chimeric reads in software, but should point out that I don't know of any methods that can get rid of chimeric reads entirely.
Another option is to ignore the possibility of chimeric reads, or accept the sequencing companies' claims that their occurrence is so rare that it won't make a difference.
But hopefully, you're not going to do that. I'm grateful to have had the opportunity to tell you that chimeric reads exist, and am hopeful that this release of information - this curious observation - will eventually lead to a better understanding of biology.
Acknowledgements
Thank you READ LESS
Use of this website is subject to the F1000 Research Limited (F1000) General Terms and Conditions.
Submission of user comments to this website is subject to additional Terms and Conditions. By clicking "I accept the User Comment Terms and Conditions" before you submit your first comment, you agree to be bound by these conditions every time you submit a comment.
Terms relating to user comments