Reflections on repetitive repeats with REPAVER
Reflections on repetitive repeats with REPAVER
[version 1; not peer reviewed]No competing interests were disclosed
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
All commenters must hold a formal affiliation as per our Policies. The information that you give us will be displayed next to your comment.
User comments must be in English, comprehensible and relevant to the article under discussion. We reserve the right to remove any comments that we consider to be inappropriate, offensive or otherwise in breach of the User Comment Terms and Conditions. Commenters must not use a comment for personal attacks. When criticisms of the article are based on unpublished data, the data should be made available.
Tēnā koutou katoa
Ko Kurahaupo te waka
Ko Tapuae-o-uenuku te maunga
Ko Wairau te awa
Ko Rangitāne te iwi
Nō Pōneke ahau
Kei Karori tōku kāinga
He kai pūtaiao ki Malaghan Institute of Medical Research ahau
Ko Rawiri tōku ingoa
Tēnā koutou, tēnā koutou, tēnā koutou katoa
Introduction
Hello, Kia ora, and thank you for being here to listen to me giving a talk on repetitive sequences. My name is David Eccles, and I come from a little town in the middle of New Zealand called Wellington, which is famous for its wind, and infamous for its remoteness. For the last five years or so, I've been doing DNA sequencing on the Oxford Nanopore Technologies' MinION sequencer at the Malaghan Institute of Medical Research. Part of my work at the Malaghan Institute involves using genetics to work out how cancer cells interact with mitochondria, but in \emph{this} talk, I'm going to explore one of my favourite hobbies outside the work that I'm paid for.
I brought a script with me today, because I'm prone to rambling when I get nervous, and it helps in the future for other people who are looking at my slides.
This image is a panorama photo, taken on a sunny day when I got distracted by the scenery. This is showing, from left to right, the Wellington harbour, the circular Malaghan Institute building, and a sliver of Victoria University of Wellington.
Panorama Stitching
The image has been used frequently by the Malaghan Institute for publicity purposes. It's actually a crop of a full 360 degree panorama that I made from 16 source images using a program called Hugin.
Control Points
For those who aren't familiar with photo stitching, in order to assemble the panorama, you first identify control points for matching bits of different images. The ideal points are things that have high contrast and are easy to identify in other images. There's a bit of an art to getting a good panorama. You need a good eye for the repetitive structures that can cause problems, including the things that look locally unique, but are duplicated in completely different regions of the big picture. If you're lucky and have control of the camera, source images can be taken that make sure no pictures sits entirely within repetitive regions, so that there's always a unique bit in each image that can be matched to neighbouring images.
Polar Mapping
Once you've got a 360 degree panorama, you can do a whole bunch of interesting things with it. One thing I love doing is mapping the horizontal panorama to a circular image. I was standing on the well of a mini and another car when I took these images, so there are bits missing from the centre. In any case, this circular mapping is a reasonably good way at visually representing a large horizontal expanse in a smaller area.
Nanopore Sequencing
But I'm not here to talk about panoramas; I'm here to talk about nanopore sequencing, specifically that carried out by Oxford Nanopore Technologies' MinION, GridION and PromethION devices. For those who don't know about the technology, nanopore sequencing involves the translocation of a long polymer through a hole, detecting changes in translocation speed as it goes through. These changes are represented as change in electric current over time.
The speed of translocation usually depends on the size and shape of the thing going through the pore, so the current can be used as a proxy for electrophysical properties. By understanding how that shape changes with different DNA bases, it's possible to work backwards from the electrical current trace to the bases, resulting in a very fast, observational method of DNA sequencing.
Next Next Generation Sequencing
So let's take a step back here and consider how nanopore sequencing is more than a little bit different to other methods. I see sequencing technology as being currently split up into three technological advances:
The first generation is sanger sequencing. Typically gel-based sequencing done by tagging sequences and counting their lengths.
The second generation is the one that I expect most of you are familiar with. DNA bases are individually tagged (or otherwise identified), and the action of base incorporation is recorded during DNA synthesis.
The third generation is the new one, where sequences are observed at a single molecule scale as they progress through the sequencer.
I've tried to be a bit careful with my definitions here because I think it's reasonable to say that some sequencers straddle two generations. If I consider PacBio systems, for example, their sequencers combine synthesis of known bases with some recording of sequence dynamics during base incorporation.
Nanopore sequencing has more than a few differences from other sequencing technologies beyond the fundamental, generational, differences that I've mentioned here. There are two that I want to mention briefly in passing. The first is that the sequencing pores are re-usable: soon after one DNA strand has gone through the pore, it can be re-loaded with another one; one pore will sequence thousands of strands over the course of a single run.
The second is that it can sequence *really long* strands of DNA, and that's what I'll be discussing in my talk today.
Mapping Illumina to Nanopore
About three years ago, I managed to encourage someone at our institute to prepare some DNA from the rodent parasite, Nippostrongylus brasiliensis. At that time, Oxford Nanopore only had a single sample preparation method that produced what they called a
two-direction (or 2D) read. I got a few reads out of the sequencer, with what seemed to be fairly high quality, but nothing close to what I'd be able to use to assemble a genome. But I was curious about what I could find out from the reads I had. The Sanger Institute had previously produced a genome assembly for the same parasite using Illumina reads, so I did a few computational experiments involving mapping Illumina reads to nanopore sequences.
What I ended up getting when I looked at the mapping results in Tablet looked really strange to me. Instead of a fairly flat coverage across each read, I found quite a lot of cases where the Illumina coverage had a weird mountainous profile. The sequences themselves didn't look particularly suspicious at the base level, so I actually spent a bit of time on coding in an attempt to get rid of these visual eyesores.
Mapping Nanopore to Nanopore
Things got weirder when I tried mapping the nanopore reads to themselves. I was surprised to see that I got the same patchiness in coverage, but it was even more pronounced and hard-edged in the nanopore-to-nanopore mapping. Eyeballing the results in Tablet increased my discomfort, because in some places there seemed to be differences in the identity of mapped reads in high-coverage regions that were right next to each other.
To explain this further, I found a locally non-repetitive region (in other words, it wasn't a tandem repeat, it wasn't repeated near itself), the region was about a thousand bases in length, and that region was present in a lot of different places in the genome.
I got quite hopeful when I mentioned this on Twitter, and was reminded of the existence of transposons that hopped around the genome, and that there was a database about these things. Unfortunately that database didn't have this particular 1kb sequence from our rodent parasite, and I'm loathe to categorise these things as transposons before demonstrating that they actually do move around the genome.
I was really confused, and realised that I should probably find some way to better determine what was going on, and how common this problem was. As full disclosure, I still haven't worked out why this was happening, so if you've got any ideas about this, please feel free to tell me.
Repeats - a Ubiquitous Problem / 1
So thus began my process of discovery. I'd started with a can of worms, so though maybe it might be better to take a step back and try to understand something I was more familiar with, and that thing was tandem repeats.
Here's a sequence that came off a MinION sequencing device a bit over a year ago. It's a cDNA transcript encoding one of the ubiquitin genes in Mus musculus, and I've mapped the roughly 1kb linear sequence to a spiral, starting at the centre and going out to the edge of the image. Sorry about the shift in gear from genome assembly to cDNA transcripts; I'm easily distracted.
Ubiquitin is a protein composed of 76 amino acids. For those of you who are quick at maths, you might realise that this transcript is a lot longer than the 228 nucleotides that are needed to encode the protein. In fact, this transcript is actually a polyubiquitin precursor transcript, but you wouldn't know it from looking at this image.
Repeats - a Ubiquitous Problem / 2
However, if I adjust the number of bases per ring to a few more, up to 226, then the repetitive structure of the cDNA transcript becomes a lot more obvious. I notice very obvious bands of the same base sequences radiating out from the centre of the image.
Spiralling Out Of The Void
For those interested in the nuts and bolts of this spiral plot, here's my drawing code as it exists at the moment. I use a bit of calculus to work out how far along the spiral a particular location is, so that I can get a smooth transition between different loops while keeping the angle per base constant. I do most of my work on the command line, and frequently work with sequences from a few tens of bases to a few hundred kilobases, so I've got quite a lot of variables in there to try to make things look at least passable at all scales.
However, this spiral visualisation code doesn't have any way to guess the number of bases per ring. That's all down to the person creating the image. It's a hard problem, and one that can have multiple solutions.
Repetitive Lyrics
So I'm going to take you through a bit of the process of development that led to me discovering that 226 number. It doesn't quite start with a song and a dance, but that'll do for now.
About two years ago a Reddit user, Frigorifico, had created and reported on a lyric visualisation tool called SongSim. You put in the lyrics of a song, and it will visualise the repetitive elements in the words. This tool was actually based on dotplots that are frequently used for DNA alignment.
Each dot in this matrix represents a comparison between two words in the lyrics of a song. Where there is a black dot, it means that the word is the same. In the example highlighted here, we can see that the 34th word in the lyrics is the same as the fifth word in the lyrics, and that word ``lamb'' appears in a few other places. There's a solid black line down the diagonal because each word is the same as itself.
The interesting thing I noticed about this approach is that it concentrates on the words of a song, rather than letters. I saw parallels between the words in songs, and multi-base sequences in DNA sequences, which we call kmers.
This multi-base approach is also used in a completely non-visual program called SATFIND, which looks for repetitive motifs in DNA. Taking these things as inspiration, I worked through my own method to numerically describe the nature of repetitiveness in a DNA sequence.
Discovering Repeats in DNA
I wanted to have a go at doing something similar to that in arbitrary DNA sequences, with lengths from a few hundred bases to a few million bases. I needed a method that was fast, but still able to capture essentially every repetitive pattern in a sequence.
This is the method that I've ended up with, where instead of words in English text, I'm picking out overlapping equal-length subsequences of DNA. By recording the location of repeated subsequences, I can generate statistics that relate to the size of repetitive
regions. This is how I got that 226 number for ubiquitin. I just looked for the distances between repeated kmers, and picked the most common distance as my repeat length, or ring length.
When I'm able to generate statistics, I like to be able to visualise those statistics as well, so to start off, I converted these numbers back into the standard dot-plot format.
Rhyme & Readin'
The mouse cDNA ubiquitin transcript is quite a short sequence, as nanopore reads go. From a brief search on Google Scholar, it doesn't look like researchers have been too interested in doing mouse genomic DNA on the nanopore. However, I was able to find a fairly close match in the nanopore reads I've got lying around from the rodent parasite,
Nippostrongylus brasiliensis. It didn't need to be ubiquitin for demonstrating the algorithm, but I figured I might as well add in that thread of continuity.
That's a dot-plot representation, but I found that it was quite difficult to use this to explain to other people what was going on. They kept asking difficult questions about it, like, ``why is there a diagonal line down the middle of the image?'' and, ``why are the repeating bits a triangle, or a square?'' There was something about the idea of comparing things to themselves that was unintuitive.
Reading From a Profile Perspective
So I had another think about the data that I was trying to represent. Was there a better way to represent something that made the locations of repetitive features more obvious. What I came up with was this: the location on a sequence is represented on the X axis, and the distance between features is represented on the Y axis in a log scale. No box shapes to be seen, and it helps to hide the idea that there's a central spine of a reference sequence that represents the ideal path through a sequence. Reverse complement pairs appear as funnel-shaped things, and repetitive blocks appear as sliced hills, or maybe a ripple pattern of a sunset on water.
As an aside, if I'm trying to model this as a water surface, then the Y axis should be a reciprocal function, rather than a log function. I intend to fix that sometime in the near future.
The read shown here is a moderate-length read, within the normal range of what would appear in a genomic DNA sequencing run on a nanopore MinION. However, this read is only a small part of a bigger picture. The read was actually used as evidence to assemble a 400kb contig from about 500 source sequences using a genome assembly program called Canu.
Mapping Seeds
For those who aren't familiar with the overlap-consensus method of genome assembly, in order to assemble the genome, Canu first identifies seeds for matching short subsequences in different sequenced reads. The ideal seeds are ones that have high complexity and are easy to identify in other sequences. There's a bit of an art to getting a good assembly. You need a good eye for the repetitive structures that can cause problems, including the things that look locally unique, but are duplicated in completely different regions of the genome. If you're lucky and have control of the sample preparation, long reads can be sequenced to make sure no reads sit entirely within repetitive regions, so that there's always a unique bit in each read that can be matched to neighbouring reads.
People have frequently compared genome assembly to jigsaw puzzles, and photo stitching is basically a jigsaw puzzle where you get to decide what the pieces look like. Even though it's one dimensional rather than two dimensional, genome assembly still has the usual issues. Sometimes it's hard to know which way round a sequence goes, like with ropes and poles in jigsaw puzzles. Sometimes you get bits which are complex, but highly repetitive, like windows and bricks. And sometimes there are low-complexity regions, where the same thing is repeated in close proximity with itself, just like the sky and grass.
But I will admit that this 400kb sequence is an assembled contig, not an ultra-long read. It's a good demonstration of why the genome of Nippostrongylus brasiliensis is so hard to assemble with short Illumina reads, but there hasn't yet been any ultra-long read sequencing done on Nippo.
Nanopore WGS Consortium
For that, we need to move on to the less repetitive human genome, and the Nanopore whole genome sequencing consortium, and more recently the Telomere-to-telomere consortium headed by Karen Miga and Adam Phillippy. I haven't participated directly in their projects, but I'm a huge fan of the data they have produced, which is released under a creative commons attribution license. They've got one paper shunted through the peer review process, and it looks like there'll be a few more to follow (both DNA and RNA) in the next year or so.
Human UBC
So there we go. In the whole genome consortium reads, I found one read over 100kb that spanned over the human UBC gene, about 20kb from the end of the read. The current consensus of the nanopore sequencing community is that an ultra-long read sequencing run has a read N50 of over 100 kilobases; in other words, over 50% of the sequenced bases in the run come from reads that are over 100kb in length.
Repeats, Four Ways
That's as far as I've got with tracking down ubiquitin. Before I show the next visualisation, I want to distract you with a discussion about what it actually means for a sequence to have repeated in the genome.
When we look at a DNA sequence on our computer screens, we tend to imagine it on its own, unmoving and alone. But DNA is almost always double stranded, paired up with a reverse-complement sequence on the other side. If you consider that DNA is a physical object in 3D space, then it might be easier to see that a sequence can be physically reversed without actually breaking anything in the sequence, or in its surroundings.
So if we're looking for repeats in the genome, it's probably a good idea to look for all four of those possible configurations, rather than just the sequence on its own.
I'd like to point out that if DNA sequences are truly random, then all four of these possibilities should appear in roughly equal quantities, and be roughly equally distributed through the genome. They don't, and they aren't, so DNA sequences are not random.
Larger Tandem Repeats
This here is another long read, or maybe even an ultra-long read. This came from a flow cell sequenced as part of the nanopore whole-genome consortium experimentation with an ultra-long-read protocol, flow cell FAF15586, sequenced in Birmingham on the 8th of March 2017. It's not the biggest sequence, but it includes a tandem repeat array with one
of the largest unit lengths that I've seen so far, about 40kb. Those are massively long statistics, and completely out of the ballpark of what any other sequencer would be able to discover. Even with a 10kb read length, another sequencer wouldn't even be able to determine that this region was repetitive, let alone identify a unique flanking
region for anchoring. Yet here we are, with a long nanopore read that's nonchalantly eating a massive tandem array for breakfast.
Once again, you can see things a bit like water ripples from the tandem repeat array, getting closer and closer together in the log scale here as the things getting compared get further and further apart.
Semicircular Plot of an Ultra-long Read
Thinking about water, hills, and sunsets, I had a go at modifying this visualisation to make it a bit more like an actual sunset. This is representing the same information as the profile plot, but in a form that I consider to be more visually-appealing.
According to the consortium's paper, the longest full-length mapped read in the nanopore WGS data set (aligned with GraphMap) was 882 kb, corresponding to a reference span of 993 kb.
Long Read in Spiral Form
Just in case you were wondering, the spiral sequence plot also kind-of works for really long reads as well, especially if they contain repetitive sequences.
Human VDR
This here longest read that has been sequenced so far on Oxford Nanopore's MinION (as far as I'm aware). This here is a read from chromosome 12 that's almost 2.3 megabases, produced by Alex Payne and Nadine Holmes at the University of Nottingham. The DNA sequence would have had a physical length of about 2/3 of a millimetre, and it would have taken about an hour and a half to move through the nanopore at 450 bases per second.
As an aside, this sequence couldn't actually be sequenced in one go by Oxford Nanopore's standard sequencing algorithm at the time. They put in an upper limit of a million electrical samples per read, and this sequence exceeds that about five times over. Matt Loose did some post-processing on consecutive raw signal traces in order to create the complete sequence.
Here the profile plot is again spread into a semicircle so that the long range differences have a bit more leg room. The sequence shown here apparently includes the vitamin D receptor, so for those who are researching vitamin D, there's now a single nanopore read available that has a couple of megabases of surrounding context.
Human HLA
I want to finish up the formal part of my talk by showing a more extreme version of this plot. This is a contig assembled by the nanopore WGS consortium that contains the entirety of the MHC region... plus another 12Mb of surrounding sequence. The MHC region is shown at about 2.5 megabases to 6.5 megabases on this plot (give or
take half a megabase or so).
As an aside, the Telomere-to-telomere consortium have also assembled the entirety of the X chromosome of CHM13 as a single contiguous sequence. Unfortunately my code is not yet efficient enough to process that in one go, so you'll have to settle with this 15Mb sequence for now.
This sequence contains two quite similar 25bp sequences, one of which appears 41 times (ACTCCAGCCTGGTGACAGAGTGAGA) and one that appears just once (ACTCCAGCTCACAGTCCTGTCGATG) within this region. And, I'm a little bit embarrased to admit that finding those 25bp sequences is the closest I've got in the last 3 years to revisiting the weird non-repetitive repeats in the Nippo genome. I got too distracted with the other reverse-complement repeat patterns that I've been finding in most long DNA sequences I've looked at, and I just keep finding more loose ends as I dig deeper into the complexities of these patterns.
If there's still time, I want to tinker around with my laptop and show some other visualisations that I've found quite interesting. Feel free to fire away some questions, while I have a peek. But before I do that, I might as well take the opportunity to say Kia Ora: thanks for listening, and for being a great audience.
Tēnā koutou katoa
Ko Kurahaupo te waka
Ko Tapuae-o-uenuku te maunga
Ko Wairau te awa
Ko Rangitāne te iwi
Nō Pōneke ahau
Kei Karori tōku kāinga
He kai pūtaiao ki Malaghan Institute of Medical Research ahau
Ko Rawiri tōku ingoa
Tēnā koutou, tēnā koutou, tēnā koutou... READ MORE
Tēnā koutou katoa
Ko Kurahaupo te waka
Ko Tapuae-o-uenuku te maunga
Ko Wairau te awa
Ko Rangitāne te iwi
Nō Pōneke ahau
Kei Karori tōku kāinga
He kai pūtaiao ki Malaghan Institute of Medical Research ahau
Ko Rawiri tōku ingoa
Tēnā koutou, tēnā koutou, tēnā koutou katoa
Introduction
Hello, Kia ora, and thank you for being here to listen to me giving a talk on repetitive sequences. My name is David Eccles, and I come from a little town in the middle of New Zealand called Wellington, which is famous for its wind, and infamous for its remoteness. For the last five years or so, I've been doing DNA sequencing on the Oxford Nanopore Technologies' MinION sequencer at the Malaghan Institute of Medical Research. Part of my work at the Malaghan Institute involves using genetics to work out how cancer cells interact with mitochondria, but in \emph{this} talk, I'm going to explore one of my favourite hobbies outside the work that I'm paid for.
I brought a script with me today, because I'm prone to rambling when I get nervous, and it helps in the future for other people who are looking at my slides.
This image is a panorama photo, taken on a sunny day when I got distracted by the scenery. This is showing, from left to right, the Wellington harbour, the circular Malaghan Institute building, and a sliver of Victoria University of Wellington.
Panorama Stitching
The image has been used frequently by the Malaghan Institute for publicity purposes. It's actually a crop of a full 360 degree panorama that I made from 16 source images using a program called Hugin.
Control Points
For those who aren't familiar with photo stitching, in order to assemble the panorama, you first identify control points for matching bits of different images. The ideal points are things that have high contrast and are easy to identify in other images. There's a bit of an art to getting a good panorama. You need a good eye for the repetitive structures that can cause problems, including the things that look locally unique, but are duplicated in completely different regions of the big picture. If you're lucky and have control of the camera, source images can be taken that make sure no pictures sits entirely within repetitive regions, so that there's always a unique bit in each image that can be matched to neighbouring images.
Polar Mapping
Once you've got a 360 degree panorama, you can do a whole bunch of interesting things with it. One thing I love doing is mapping the horizontal panorama to a circular image. I was standing on the well of a mini and another car when I took these images, so there are bits missing from the centre. In any case, this circular mapping is a reasonably good way at visually representing a large horizontal expanse in a smaller area.
Nanopore Sequencing
But I'm not here to talk about panoramas; I'm here to talk about nanopore sequencing, specifically that carried out by Oxford Nanopore Technologies' MinION, GridION and PromethION devices. For those who don't know about the technology, nanopore sequencing involves the translocation of a long polymer through a hole, detecting changes in translocation speed as it goes through. These changes are represented as change in electric current over time.
The speed of translocation usually depends on the size and shape of the thing going through the pore, so the current can be used as a proxy for electrophysical properties. By understanding how that shape changes with different DNA bases, it's possible to work backwards from the electrical current trace to the bases, resulting in a very fast, observational method of DNA sequencing.
Next Next Generation Sequencing
So let's take a step back here and consider how nanopore sequencing is more than a little bit different to other methods. I see sequencing technology as being currently split up into three technological advances:
The first generation is sanger sequencing. Typically gel-based sequencing done by tagging sequences and counting their lengths.
The second generation is the one that I expect most of you are familiar with. DNA bases are individually tagged (or otherwise identified), and the action of base incorporation is recorded during DNA synthesis.
The third generation is the new one, where sequences are observed at a single molecule scale as they progress through the sequencer.
I've tried to be a bit careful with my definitions here because I think it's reasonable to say that some sequencers straddle two generations. If I consider PacBio systems, for example, their sequencers combine synthesis of known bases with some recording of sequence dynamics during base incorporation.
Nanopore sequencing has more than a few differences from other sequencing technologies beyond the fundamental, generational, differences that I've mentioned here. There are two that I want to mention briefly in passing. The first is that the sequencing pores are re-usable: soon after one DNA strand has gone through the pore, it can be re-loaded with another one; one pore will sequence thousands of strands over the course of a single run.
The second is that it can sequence *really long* strands of DNA, and that's what I'll be discussing in my talk today.
Mapping Illumina to Nanopore
About three years ago, I managed to encourage someone at our institute to prepare some DNA from the rodent parasite, Nippostrongylus brasiliensis. At that time, Oxford Nanopore only had a single sample preparation method that produced what they called a
two-direction (or 2D) read. I got a few reads out of the sequencer, with what seemed to be fairly high quality, but nothing close to what I'd be able to use to assemble a genome. But I was curious about what I could find out from the reads I had. The Sanger Institute had previously produced a genome assembly for the same parasite using Illumina reads, so I did a few computational experiments involving mapping Illumina reads to nanopore sequences.
What I ended up getting when I looked at the mapping results in Tablet looked really strange to me. Instead of a fairly flat coverage across each read, I found quite a lot of cases where the Illumina coverage had a weird mountainous profile. The sequences themselves didn't look particularly suspicious at the base level, so I actually spent a bit of time on coding in an attempt to get rid of these visual eyesores.
Mapping Nanopore to Nanopore
Things got weirder when I tried mapping the nanopore reads to themselves. I was surprised to see that I got the same patchiness in coverage, but it was even more pronounced and hard-edged in the nanopore-to-nanopore mapping. Eyeballing the results in Tablet increased my discomfort, because in some places there seemed to be differences in the identity of mapped reads in high-coverage regions that were right next to each other.
To explain this further, I found a locally non-repetitive region (in other words, it wasn't a tandem repeat, it wasn't repeated near itself), the region was about a thousand bases in length, and that region was present in a lot of different places in the genome.
I got quite hopeful when I mentioned this on Twitter, and was reminded of the existence of transposons that hopped around the genome, and that there was a database about these things. Unfortunately that database didn't have this particular 1kb sequence from our rodent parasite, and I'm loathe to categorise these things as transposons before demonstrating that they actually do move around the genome.
I was really confused, and realised that I should probably find some way to better determine what was going on, and how common this problem was. As full disclosure, I still haven't worked out why this was happening, so if you've got any ideas about this, please feel free to tell me.
Repeats - a Ubiquitous Problem / 1
So thus began my process of discovery. I'd started with a can of worms, so though maybe it might be better to take a step back and try to understand something I was more familiar with, and that thing was tandem repeats.
Here's a sequence that came off a MinION sequencing device a bit over a year ago. It's a cDNA transcript encoding one of the ubiquitin genes in Mus musculus, and I've mapped the roughly 1kb linear sequence to a spiral, starting at the centre and going out to the edge of the image. Sorry about the shift in gear from genome assembly to cDNA transcripts; I'm easily distracted.
Ubiquitin is a protein composed of 76 amino acids. For those of you who are quick at maths, you might realise that this transcript is a lot longer than the 228 nucleotides that are needed to encode the protein. In fact, this transcript is actually a polyubiquitin precursor transcript, but you wouldn't know it from looking at this image.
Repeats - a Ubiquitous Problem / 2
However, if I adjust the number of bases per ring to a few more, up to 226, then the repetitive structure of the cDNA transcript becomes a lot more obvious. I notice very obvious bands of the same base sequences radiating out from the centre of the image.
Spiralling Out Of The Void
For those interested in the nuts and bolts of this spiral plot, here's my drawing code as it exists at the moment. I use a bit of calculus to work out how far along the spiral a particular location is, so that I can get a smooth transition between different loops while keeping the angle per base constant. I do most of my work on the command line, and frequently work with sequences from a few tens of bases to a few hundred kilobases, so I've got quite a lot of variables in there to try to make things look at least passable at all scales.
However, this spiral visualisation code doesn't have any way to guess the number of bases per ring. That's all down to the person creating the image. It's a hard problem, and one that can have multiple solutions.
Repetitive Lyrics
So I'm going to take you through a bit of the process of development that led to me discovering that 226 number. It doesn't quite start with a song and a dance, but that'll do for now.
About two years ago a Reddit user, Frigorifico, had created and reported on a lyric visualisation tool called SongSim. You put in the lyrics of a song, and it will visualise the repetitive elements in the words. This tool was actually based on dotplots that are frequently used for DNA alignment.
Each dot in this matrix represents a comparison between two words in the lyrics of a song. Where there is a black dot, it means that the word is the same. In the example highlighted here, we can see that the 34th word in the lyrics is the same as the fifth word in the lyrics, and that word ``lamb'' appears in a few other places. There's a solid black line down the diagonal because each word is the same as itself.
The interesting thing I noticed about this approach is that it concentrates on the words of a song, rather than letters. I saw parallels between the words in songs, and multi-base sequences in DNA sequences, which we call kmers.
This multi-base approach is also used in a completely non-visual program called SATFIND, which looks for repetitive motifs in DNA. Taking these things as inspiration, I worked through my own method to numerically describe the nature of repetitiveness in a DNA sequence.
Discovering Repeats in DNA
I wanted to have a go at doing something similar to that in arbitrary DNA sequences, with lengths from a few hundred bases to a few million bases. I needed a method that was fast, but still able to capture essentially every repetitive pattern in a sequence.
This is the method that I've ended up with, where instead of words in English text, I'm picking out overlapping equal-length subsequences of DNA. By recording the location of repeated subsequences, I can generate statistics that relate to the size of repetitive
regions. This is how I got that 226 number for ubiquitin. I just looked for the distances between repeated kmers, and picked the most common distance as my repeat length, or ring length.
When I'm able to generate statistics, I like to be able to visualise those statistics as well, so to start off, I converted these numbers back into the standard dot-plot format.
Rhyme & Readin'
The mouse cDNA ubiquitin transcript is quite a short sequence, as nanopore reads go. From a brief search on Google Scholar, it doesn't look like researchers have been too interested in doing mouse genomic DNA on the nanopore. However, I was able to find a fairly close match in the nanopore reads I've got lying around from the rodent parasite,
Nippostrongylus brasiliensis. It didn't need to be ubiquitin for demonstrating the algorithm, but I figured I might as well add in that thread of continuity.
That's a dot-plot representation, but I found that it was quite difficult to use this to explain to other people what was going on. They kept asking difficult questions about it, like, ``why is there a diagonal line down the middle of the image?'' and, ``why are the repeating bits a triangle, or a square?'' There was something about the idea of comparing things to themselves that was unintuitive.
Reading From a Profile Perspective
So I had another think about the data that I was trying to represent. Was there a better way to represent something that made the locations of repetitive features more obvious. What I came up with was this: the location on a sequence is represented on the X axis, and the distance between features is represented on the Y axis in a log scale. No box shapes to be seen, and it helps to hide the idea that there's a central spine of a reference sequence that represents the ideal path through a sequence. Reverse complement pairs appear as funnel-shaped things, and repetitive blocks appear as sliced hills, or maybe a ripple pattern of a sunset on water.
As an aside, if I'm trying to model this as a water surface, then the Y axis should be a reciprocal function, rather than a log function. I intend to fix that sometime in the near future.
The read shown here is a moderate-length read, within the normal range of what would appear in a genomic DNA sequencing run on a nanopore MinION. However, this read is only a small part of a bigger picture. The read was actually used as evidence to assemble a 400kb contig from about 500 source sequences using a genome assembly program called Canu.
Mapping Seeds
For those who aren't familiar with the overlap-consensus method of genome assembly, in order to assemble the genome, Canu first identifies seeds for matching short subsequences in different sequenced reads. The ideal seeds are ones that have high complexity and are easy to identify in other sequences. There's a bit of an art to getting a good assembly. You need a good eye for the repetitive structures that can cause problems, including the things that look locally unique, but are duplicated in completely different regions of the genome. If you're lucky and have control of the sample preparation, long reads can be sequenced to make sure no reads sit entirely within repetitive regions, so that there's always a unique bit in each read that can be matched to neighbouring reads.
People have frequently compared genome assembly to jigsaw puzzles, and photo stitching is basically a jigsaw puzzle where you get to decide what the pieces look like. Even though it's one dimensional rather than two dimensional, genome assembly still has the usual issues. Sometimes it's hard to know which way round a sequence goes, like with ropes and poles in jigsaw puzzles. Sometimes you get bits which are complex, but highly repetitive, like windows and bricks. And sometimes there are low-complexity regions, where the same thing is repeated in close proximity with itself, just like the sky and grass.
But I will admit that this 400kb sequence is an assembled contig, not an ultra-long read. It's a good demonstration of why the genome of Nippostrongylus brasiliensis is so hard to assemble with short Illumina reads, but there hasn't yet been any ultra-long read sequencing done on Nippo.
Nanopore WGS Consortium
For that, we need to move on to the less repetitive human genome, and the Nanopore whole genome sequencing consortium, and more recently the Telomere-to-telomere consortium headed by Karen Miga and Adam Phillippy. I haven't participated directly in their projects, but I'm a huge fan of the data they have produced, which is released under a creative commons attribution license. They've got one paper shunted through the peer review process, and it looks like there'll be a few more to follow (both DNA and RNA) in the next year or so.
Human UBC
So there we go. In the whole genome consortium reads, I found one read over 100kb that spanned over the human UBC gene, about 20kb from the end of the read. The current consensus of the nanopore sequencing community is that an ultra-long read sequencing run has a read N50 of over 100 kilobases; in other words, over 50% of the sequenced bases in the run come from reads that are over 100kb in length.
Repeats, Four Ways
That's as far as I've got with tracking down ubiquitin. Before I show the next visualisation, I want to distract you with a discussion about what it actually means for a sequence to have repeated in the genome.
When we look at a DNA sequence on our computer screens, we tend to imagine it on its own, unmoving and alone. But DNA is almost always double stranded, paired up with a reverse-complement sequence on the other side. If you consider that DNA is a physical object in 3D space, then it might be easier to see that a sequence can be physically reversed without actually breaking anything in the sequence, or in its surroundings.
So if we're looking for repeats in the genome, it's probably a good idea to look for all four of those possible configurations, rather than just the sequence on its own.
I'd like to point out that if DNA sequences are truly random, then all four of these possibilities should appear in roughly equal quantities, and be roughly equally distributed through the genome. They don't, and they aren't, so DNA sequences are not random.
Larger Tandem Repeats
This here is another long read, or maybe even an ultra-long read. This came from a flow cell sequenced as part of the nanopore whole-genome consortium experimentation with an ultra-long-read protocol, flow cell FAF15586, sequenced in Birmingham on the 8th of March 2017. It's not the biggest sequence, but it includes a tandem repeat array with one
of the largest unit lengths that I've seen so far, about 40kb. Those are massively long statistics, and completely out of the ballpark of what any other sequencer would be able to discover. Even with a 10kb read length, another sequencer wouldn't even be able to determine that this region was repetitive, let alone identify a unique flanking
region for anchoring. Yet here we are, with a long nanopore read that's nonchalantly eating a massive tandem array for breakfast.
Once again, you can see things a bit like water ripples from the tandem repeat array, getting closer and closer together in the log scale here as the things getting compared get further and further apart.
Semicircular Plot of an Ultra-long Read
Thinking about water, hills, and sunsets, I had a go at modifying this visualisation to make it a bit more like an actual sunset. This is representing the same information as the profile plot, but in a form that I consider to be more visually-appealing.
According to the consortium's paper, the longest full-length mapped read in the nanopore WGS data set (aligned with GraphMap) was 882 kb, corresponding to a reference span of 993 kb.
Long Read in Spiral Form
Just in case you were wondering, the spiral sequence plot also kind-of works for really long reads as well, especially if they contain repetitive sequences.
Human VDR
This here longest read that has been sequenced so far on Oxford Nanopore's MinION (as far as I'm aware). This here is a read from chromosome 12 that's almost 2.3 megabases, produced by Alex Payne and Nadine Holmes at the University of Nottingham. The DNA sequence would have had a physical length of about 2/3 of a millimetre, and it would have taken about an hour and a half to move through the nanopore at 450 bases per second.
As an aside, this sequence couldn't actually be sequenced in one go by Oxford Nanopore's standard sequencing algorithm at the time. They put in an upper limit of a million electrical samples per read, and this sequence exceeds that about five times over. Matt Loose did some post-processing on consecutive raw signal traces in order to create the complete sequence.
Here the profile plot is again spread into a semicircle so that the long range differences have a bit more leg room. The sequence shown here apparently includes the vitamin D receptor, so for those who are researching vitamin D, there's now a single nanopore read available that has a couple of megabases of surrounding context.
Human HLA
I want to finish up the formal part of my talk by showing a more extreme version of this plot. This is a contig assembled by the nanopore WGS consortium that contains the entirety of the MHC region... plus another 12Mb of surrounding sequence. The MHC region is shown at about 2.5 megabases to 6.5 megabases on this plot (give or
take half a megabase or so).
As an aside, the Telomere-to-telomere consortium have also assembled the entirety of the X chromosome of CHM13 as a single contiguous sequence. Unfortunately my code is not yet efficient enough to process that in one go, so you'll have to settle with this 15Mb sequence for now.
This sequence contains two quite similar 25bp sequences, one of which appears 41 times (ACTCCAGCCTGGTGACAGAGTGAGA) and one that appears just once (ACTCCAGCTCACAGTCCTGTCGATG) within this region. And, I'm a little bit embarrased to admit that finding those 25bp sequences is the closest I've got in the last 3 years to revisiting the weird non-repetitive repeats in the Nippo genome. I got too distracted with the other reverse-complement repeat patterns that I've been finding in most long DNA sequences I've looked at, and I just keep finding more loose ends as I dig deeper into the complexities of these patterns.
If there's still time, I want to tinker around with my laptop and show some other visualisations that I've found quite interesting. Feel free to fire away some questions, while I have a peek. But before I do that, I might as well take the opportunity to say Kia Ora: thanks for listening, and for being a great audience. READ LESS
Use of this website is subject to the F1000 Research Limited (F1000) General Terms and Conditions.
Submission of user comments to this website is subject to additional Terms and Conditions. By clicking "I accept the User Comment Terms and Conditions" before you submit your first comment, you agree to be bound by these conditions every time you submit a comment.
Terms relating to user comments