A glitch in the matrix: scratching the itch of repetitive DNA
A glitch in the matrix: scratching the itch of repetitive DNA
[version 1; not peer reviewed]No competing interests were disclosed
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
All commenters must hold a formal affiliation as per our Policies. The information that you give us will be displayed next to your comment.
User comments must be in English, comprehensible and relevant to the article under discussion. We reserve the right to remove any comments that we consider to be inappropriate, offensive or otherwise in breach of the User Comment Terms and Conditions. Commenters must not use a comment for personal attacks. When criticisms of the article are based on unpublished data, the data should be made available.
The story I want to tell you about today starts with a common, everyday object, particularly in a research setting. Namely, a can of worms.
Our Can of Worms
In this case, it's pretty close to a literal can of worms. Five tubes of 25μg of freeze-killed Nippostrongylus brasiliensis were prepared by Jodie Chandler and Mali Camberis from the Malaghan Institute, and were sequenced by Oxford Nanopore Technologies at the
start of last year. I eventually managed to reduce the 4.5 terabytes of sequence data to something more manageable, transfer it to my computer, and have been exploring those DNA sequences for the past year or so.
One thing that I noticed early on (actually prior to sending worms to Oxford Nanopore) was that the genome was very repetitive: a large part of the genome was composed of sequence that I had seen elsewhere, which brings me to The Matrix.
What Is The Matrix?
For those among you who haven't seen it, an R16 movie came out in 1999 about a simulated world, very much like the world we live in today. The movie begins with a woman, Trinity, who runs on walls to take out a squad of police officers, and gets crazier from that point on, as the audience is introduced to the slightly different rules of the simulated world. The protagonist, Neo, comments at one point that he had "a little Déjà vu", where he saw a cat walking past, "and then another that looked just like it". Trinity explains that "a Déjà vu is usually a glitch in the Matrix; it happens when
they change something."
We can use a matrix for DNA sequence analysis to visually demonstrate a comparison between two sequences: one sequence on the X axis, and another sequence on the Y axis. When repetitive regions are present, they appear as diagonal interruptions in a mostly featureless expanse of nothingness; a glitch, in other words.
By the way, this is not the cat from The Matrix; this is my cat, Hansa.
Genome Assembly Improvement
With the help of Jodie Chandler, and Mali Camberis, we were able to get enough DNA to have a decent crack at assembling the genome of Nippo. With Jonathan Ewbank supervising my bioinformatics, we were even able to get a paper out of that work before I got distracted by something else. Our research demonstrated that we could assemble a
genome from nanopore long reads that was better than the existing Illumina-based assembly for almost all measures. But it left a bit of an itch for me to scratch: the assembly wasn't perfect, and we didn't have a complete answer as to why.
Most of our assembled genome was made up of single contiguous sequences of DNA, or contigs, that had no substantial overlap anywhere else in the genome. These are represented as dark red portions in the stacked barplot shown here. But there were a lot of regions where very large chunks of sequence was shared, covering many kilobases of
DNA. These are represented as the joined bits in the squiggly firework display above the bar plot. In some cases, over 12 different contigs shared a DNA sequence among themselves, telling me that we've still got a bit of a way to go before we can call this genome complete.
So, I started looking at these regions of matched sequence, starting with the simplest non-trivial cases, where three contigs joined together in a Y-shaped subgraph. I didn't realise at that time how deep the rabbit hole of repetitive regions goes.
Comparing Contigs
I categorised all the three-way subgraphs based on how it looked like the contigs linked together, using the same dotplot visualisation as I showed you before. I'll take you through my thought process for one of these.
This is the simplest connection of three contigs that I could find in the Nippo genome that we assembled. The line down the diagonal is where a contig is being compared with itself. The other bits show comparisons between different contigs. Here on the top line the start of contig B matches the end of contig A, while on the second line the start of contig C matches the end of contig B. I interpret this diagram as demonstrating that the middle contiguous DNA sequence, sequence B, sits right between A and C.
Choosing a Discovery VeCTR
I categorised the contig I just showed you as "plain linked", denoting a situation where the way to join bits seemed obvious and deterministic to me. There were also similar situations where it seemed like one contig was entirely contained within another, which I called "contained".
I also found situations where the linking was obviously not possible, and due to population variation in the Nippo worms that we sequenced; I called these situations "heterozygous branch".
Choosing a Discovery VeCTR - highlighted
The last category interested me more. These were regions where the matching region was defined by a large matrix of repeated sequences very close to each other, which we called a very long complex tandem repeat, or VeCTR for short.
You've seen tandem repeats before, or at least heard them. Maybe not very long ones, but certainly their shorter equivalents. They happen all over the place: They happen in speeches, they happen in music, and they happen in nursery rhymes.
Repetitive Rhymes
A couple of months ago I was reminded (possibly by Georgia Carson) of a visualisation tool that a Reddit user, Frigorifico, had created. You put in the lyrics of a song, and it will visualise the repetitive elements in the words. This tool was based on dotplots that are used for DNA alignment, so if you had a bit of trouble understanding that stuff before, I'll try and walk you through this now.
Each dot in this matrix represents a comparison between two words in the lyrics. Where there is a black dot, it means that the word is the same. In the example highlighted here, we can see that the 34th word in the lyrics is the same as the fifth word in the lyrics, and
that word "lamb" appears in a few other places. There's a solid black line down the diagonal because each word is the same as itself.
This rhyme has a couple of other interesting features; one is that the repeat matrix is interrupted (or you might say glitched) on the second to last line of each verse. The other is that the repeat length increases for subsequent verses: two words, then three words, then five words.
Discovering Repeats in DNA
I wanted to have a go at doing something similar to that in arbitrary DNA sequences, with lengths from a few hundred bases to a few million bases. I needed a method that was fast, but still able to capture essentially every repetitive pattern in a sequence.
This is the method that I've ended up with, where instead of words in English text, I'm picking out overlapping equal-length subsequences of DNA. By recording the location of repeated subsequences, I can generate a self-vs-self dotplot, or generate statistics that relate to the size of repetitive regions.
Rhyming DNA
Here's a piece of the Nippo genome that has an expanding repeat structure that is a little bit like that of the nursery rhyme. There are these blocks of repeats within the pairwise matrix. In the upper left corner there are a couple of repeat blocks that share sequence,
whereas the middle and lower right repeat blocks don't share sequence.
After spending too much time trying to explain to people why these dotplots were reflected down the diagonal, and why there was a line down the diagonal, I had a think about how I could better represent a sequence, but still show any patterns in that sequence.
Alternative Profile View of DNA
This is what I came up with. This is showing exactly the same region as in the previous slide.
The X axis still shows the sequence location, and the individual data points are still representing the same thing, but I've changed the Y axis to indicate the separation distance between features. Instead of rectangular grids, the repetitive regions now have a more interesting curved outline, and it's a bit easier to see substructure and interruptions within the repetitive regions.
These repeats are much larger than STRs or microsatellites that you might be used to. While STRs and microsatellites have a repeat unit size of a few bases up to 50 or so, the repeat unit size of these regions is over a hundred bases; in some cases over a thousand.
I'd love to expand on another couple of garden paths that I went down with regards to visualising DNA, but for reasons of brevity and sanity, it's probably better if I keep to this format. Feel free to talk to me in my office if you want to know more.
But Why?
Before I go further, I'm going to try to give some meaning to what I've seen, and give you a bit of time to digest what you've seen.
The first thing that I should point out is that the discovery of large, repetitive sequences is not novel. As one example from many, defects in a huge repeat on human chromosome 4 have been implicated in a disease called Facioscapulohumeral muscular dystrophy. Large repeats have also been implicated in some cancers.
The visualisation might be novel, but I admit that I haven't spent much time looking for similar things.
There's been a recent paper that's come out on the presence of "extrachromosomal circular DNA" in the human genome. I don't think it's too much of a stretch to imagine that something like rolling circle amplification on these could make a long linear sequence, which could then be incorporated somehow into chromosomes.
One of the most interesting suggestions I had from the Reddit community was the suggesting that repetitive sequences could be used to calibrate the timing of chemical reactions.
I've also wondered if these sequences might have some functional benefit in either creating a defined structure in the DNA, for example for chromosome packing during mitosis, or to help proteins to narrow in on a particular binding sequence.
The answer is probably all of the above, and more. I feel like I've dropped into another world of discovery here, and hopefully you'll forgive me if I haven't fully explored the biology or the computational side of this.
Eat Me!
And, indeed, the journey continues. The repetitive sequences that I saw in DNA have only been the beginning of my travels into the large-scale structure of DNA. As alluded to previously, I've been tempted by the delicious journey of further discovery. What if I'm blinded by the limitations I've put on how I see DNA sequences, and am missing something that's blindingly obvious? What if there were reverse-complement patterns in the DNA, just like there are exact-copy patterns in the DNA? What if the reverse of a sequence were also embedded as a pattern in the DNA?
Well, as it turns out, the rabbit hole goes deeper, even on the visualisation side of things. I'm going to leave you with a few examples that represent the beginning of my discoveries of the large-scale complexities of DNA at the sequence level.
Nippo 28S rRNA Gene
I've found repetitive regions that are within repetitive regions. Here's a region from the Nippo genome which has the 28S ribosomal RNA gene that is nailed together with smaller repetitive chunks. I say ``small'', but these are regions of something like a thousand bases, with repetitive units that are about thirty bases long. I've got no idea what they're doing there, no idea what their function might be, but I can see them, and hopefully that will lead to something more in the future.
Complexity at Massive Scales
And then when I look at larger regions of the genome, megabase-sized contigs where we've been able to assemble them, I see other interesting patterns when I consider reversal and reverse-complementation as well as simple repeats. The reverse-sequence patterns tend to reside in small regions, and are most commonly AT-rich regions. Whereas the reverse-complement patterns stretch over regions as far as I've been able to observe. They're not in all DNA sequences, and present at far in excess of what I would expect when looking at completely random strings of letters.
And, just in case you're wondering, this is not just a Nippo-specific pattern.
Human MHC Region
At the start of the year, a new paper came out where the Nanopore Whole-genome Sequencing Consortium had assembled the human genome using nanopore reads alone. One of the things that they particularly emphasised in their paper was that they had managed to assemble a 15 megabase contig that included the Major Histocompatibility Complex region, typically considered to be one of the most difficult regions to analyse from a genetic perspective.
I thought it would be interesting to run my visualisation algorithm through this region and see what came up. What I saw was the same massively long-range reverse-complement patterns, and actually nothing particularly out of the ordinary at that scale in the MHC region compared to the surrounding sequence. Again, I've got no idea what this means.
So, that's the beginning of discovery. I look forward to your thoughts on where the future may lead. Thank you.
The story I want to tell you about today starts with a common, everyday object, particularly in a research setting. Namely, a can of worms.
Our Can of Worms
In this case, it's pretty close to a literal can... READ MORE
The story I want to tell you about today starts with a common, everyday object, particularly in a research setting. Namely, a can of worms.
Our Can of Worms
In this case, it's pretty close to a literal can of worms. Five tubes of 25μg of freeze-killed Nippostrongylus brasiliensis were prepared by Jodie Chandler and Mali Camberis from the Malaghan Institute, and were sequenced by Oxford Nanopore Technologies at the
start of last year. I eventually managed to reduce the 4.5 terabytes of sequence data to something more manageable, transfer it to my computer, and have been exploring those DNA sequences for the past year or so.
One thing that I noticed early on (actually prior to sending worms to Oxford Nanopore) was that the genome was very repetitive: a large part of the genome was composed of sequence that I had seen elsewhere, which brings me to The Matrix.
What Is The Matrix?
For those among you who haven't seen it, an R16 movie came out in 1999 about a simulated world, very much like the world we live in today. The movie begins with a woman, Trinity, who runs on walls to take out a squad of police officers, and gets crazier from that point on, as the audience is introduced to the slightly different rules of the simulated world. The protagonist, Neo, comments at one point that he had "a little Déjà vu", where he saw a cat walking past, "and then another that looked just like it". Trinity explains that "a Déjà vu is usually a glitch in the Matrix; it happens when
they change something."
We can use a matrix for DNA sequence analysis to visually demonstrate a comparison between two sequences: one sequence on the X axis, and another sequence on the Y axis. When repetitive regions are present, they appear as diagonal interruptions in a mostly featureless expanse of nothingness; a glitch, in other words.
By the way, this is not the cat from The Matrix; this is my cat, Hansa.
Genome Assembly Improvement
With the help of Jodie Chandler, and Mali Camberis, we were able to get enough DNA to have a decent crack at assembling the genome of Nippo. With Jonathan Ewbank supervising my bioinformatics, we were even able to get a paper out of that work before I got distracted by something else. Our research demonstrated that we could assemble a
genome from nanopore long reads that was better than the existing Illumina-based assembly for almost all measures. But it left a bit of an itch for me to scratch: the assembly wasn't perfect, and we didn't have a complete answer as to why.
Most of our assembled genome was made up of single contiguous sequences of DNA, or contigs, that had no substantial overlap anywhere else in the genome. These are represented as dark red portions in the stacked barplot shown here. But there were a lot of regions where very large chunks of sequence was shared, covering many kilobases of
DNA. These are represented as the joined bits in the squiggly firework display above the bar plot. In some cases, over 12 different contigs shared a DNA sequence among themselves, telling me that we've still got a bit of a way to go before we can call this genome complete.
So, I started looking at these regions of matched sequence, starting with the simplest non-trivial cases, where three contigs joined together in a Y-shaped subgraph. I didn't realise at that time how deep the rabbit hole of repetitive regions goes.
Comparing Contigs
I categorised all the three-way subgraphs based on how it looked like the contigs linked together, using the same dotplot visualisation as I showed you before. I'll take you through my thought process for one of these.
This is the simplest connection of three contigs that I could find in the Nippo genome that we assembled. The line down the diagonal is where a contig is being compared with itself. The other bits show comparisons between different contigs. Here on the top line the start of contig B matches the end of contig A, while on the second line the start of contig C matches the end of contig B. I interpret this diagram as demonstrating that the middle contiguous DNA sequence, sequence B, sits right between A and C.
Choosing a Discovery VeCTR
I categorised the contig I just showed you as "plain linked", denoting a situation where the way to join bits seemed obvious and deterministic to me. There were also similar situations where it seemed like one contig was entirely contained within another, which I called "contained".
I also found situations where the linking was obviously not possible, and due to population variation in the Nippo worms that we sequenced; I called these situations "heterozygous branch".
Choosing a Discovery VeCTR - highlighted
The last category interested me more. These were regions where the matching region was defined by a large matrix of repeated sequences very close to each other, which we called a very long complex tandem repeat, or VeCTR for short.
You've seen tandem repeats before, or at least heard them. Maybe not very long ones, but certainly their shorter equivalents. They happen all over the place: They happen in speeches, they happen in music, and they happen in nursery rhymes.
Repetitive Rhymes
A couple of months ago I was reminded (possibly by Georgia Carson) of a visualisation tool that a Reddit user, Frigorifico, had created. You put in the lyrics of a song, and it will visualise the repetitive elements in the words. This tool was based on dotplots that are used for DNA alignment, so if you had a bit of trouble understanding that stuff before, I'll try and walk you through this now.
Each dot in this matrix represents a comparison between two words in the lyrics. Where there is a black dot, it means that the word is the same. In the example highlighted here, we can see that the 34th word in the lyrics is the same as the fifth word in the lyrics, and
that word "lamb" appears in a few other places. There's a solid black line down the diagonal because each word is the same as itself.
This rhyme has a couple of other interesting features; one is that the repeat matrix is interrupted (or you might say glitched) on the second to last line of each verse. The other is that the repeat length increases for subsequent verses: two words, then three words, then five words.
Discovering Repeats in DNA
I wanted to have a go at doing something similar to that in arbitrary DNA sequences, with lengths from a few hundred bases to a few million bases. I needed a method that was fast, but still able to capture essentially every repetitive pattern in a sequence.
This is the method that I've ended up with, where instead of words in English text, I'm picking out overlapping equal-length subsequences of DNA. By recording the location of repeated subsequences, I can generate a self-vs-self dotplot, or generate statistics that relate to the size of repetitive regions.
Rhyming DNA
Here's a piece of the Nippo genome that has an expanding repeat structure that is a little bit like that of the nursery rhyme. There are these blocks of repeats within the pairwise matrix. In the upper left corner there are a couple of repeat blocks that share sequence,
whereas the middle and lower right repeat blocks don't share sequence.
After spending too much time trying to explain to people why these dotplots were reflected down the diagonal, and why there was a line down the diagonal, I had a think about how I could better represent a sequence, but still show any patterns in that sequence.
Alternative Profile View of DNA
This is what I came up with. This is showing exactly the same region as in the previous slide.
The X axis still shows the sequence location, and the individual data points are still representing the same thing, but I've changed the Y axis to indicate the separation distance between features. Instead of rectangular grids, the repetitive regions now have a more interesting curved outline, and it's a bit easier to see substructure and interruptions within the repetitive regions.
These repeats are much larger than STRs or microsatellites that you might be used to. While STRs and microsatellites have a repeat unit size of a few bases up to 50 or so, the repeat unit size of these regions is over a hundred bases; in some cases over a thousand.
I'd love to expand on another couple of garden paths that I went down with regards to visualising DNA, but for reasons of brevity and sanity, it's probably better if I keep to this format. Feel free to talk to me in my office if you want to know more.
But Why?
Before I go further, I'm going to try to give some meaning to what I've seen, and give you a bit of time to digest what you've seen.
The first thing that I should point out is that the discovery of large, repetitive sequences is not novel. As one example from many, defects in a huge repeat on human chromosome 4 have been implicated in a disease called Facioscapulohumeral muscular dystrophy. Large repeats have also been implicated in some cancers.
The visualisation might be novel, but I admit that I haven't spent much time looking for similar things.
There's been a recent paper that's come out on the presence of "extrachromosomal circular DNA" in the human genome. I don't think it's too much of a stretch to imagine that something like rolling circle amplification on these could make a long linear sequence, which could then be incorporated somehow into chromosomes.
One of the most interesting suggestions I had from the Reddit community was the suggesting that repetitive sequences could be used to calibrate the timing of chemical reactions.
I've also wondered if these sequences might have some functional benefit in either creating a defined structure in the DNA, for example for chromosome packing during mitosis, or to help proteins to narrow in on a particular binding sequence.
The answer is probably all of the above, and more. I feel like I've dropped into another world of discovery here, and hopefully you'll forgive me if I haven't fully explored the biology or the computational side of this.
Eat Me!
And, indeed, the journey continues. The repetitive sequences that I saw in DNA have only been the beginning of my travels into the large-scale structure of DNA. As alluded to previously, I've been tempted by the delicious journey of further discovery. What if I'm blinded by the limitations I've put on how I see DNA sequences, and am missing something that's blindingly obvious? What if there were reverse-complement patterns in the DNA, just like there are exact-copy patterns in the DNA? What if the reverse of a sequence were also embedded as a pattern in the DNA?
Well, as it turns out, the rabbit hole goes deeper, even on the visualisation side of things. I'm going to leave you with a few examples that represent the beginning of my discoveries of the large-scale complexities of DNA at the sequence level.
Nippo 28S rRNA Gene
I've found repetitive regions that are within repetitive regions. Here's a region from the Nippo genome which has the 28S ribosomal RNA gene that is nailed together with smaller repetitive chunks. I say ``small'', but these are regions of something like a thousand bases, with repetitive units that are about thirty bases long. I've got no idea what they're doing there, no idea what their function might be, but I can see them, and hopefully that will lead to something more in the future.
Complexity at Massive Scales
And then when I look at larger regions of the genome, megabase-sized contigs where we've been able to assemble them, I see other interesting patterns when I consider reversal and reverse-complementation as well as simple repeats. The reverse-sequence patterns tend to reside in small regions, and are most commonly AT-rich regions. Whereas the reverse-complement patterns stretch over regions as far as I've been able to observe. They're not in all DNA sequences, and present at far in excess of what I would expect when looking at completely random strings of letters.
And, just in case you're wondering, this is not just a Nippo-specific pattern.
Human MHC Region
At the start of the year, a new paper came out where the Nanopore Whole-genome Sequencing Consortium had assembled the human genome using nanopore reads alone. One of the things that they particularly emphasised in their paper was that they had managed to assemble a 15 megabase contig that included the Major Histocompatibility Complex region, typically considered to be one of the most difficult regions to analyse from a genetic perspective.
I thought it would be interesting to run my visualisation algorithm through this region and see what came up. What I saw was the same massively long-range reverse-complement patterns, and actually nothing particularly out of the ordinary at that scale in the MHC region compared to the surrounding sequence. Again, I've got no idea what this means.
So, that's the beginning of discovery. I look forward to your thoughts on where the future may lead. Thank you. READ LESS
Use of this website is subject to the F1000 Research Limited (F1000) General Terms and Conditions.
Submission of user comments to this website is subject to additional Terms and Conditions. By clicking "I accept the User Comment Terms and Conditions" before you submit your first comment, you agree to be bound by these conditions every time you submit a comment.
Terms relating to user comments