High-throughput multiplexed tandem repeat genotyping using targeted long-read sequencing

Devika Ganesamoorthy; Mengjia Yan; Valentine Murigneux; Chenxi Zhou; Minh Duc Cao; Tania P. S. Duarte; Lachlan J. M. Coin

doi:10.12688/f1000research.25693.1

Home Browse High-throughput multiplexed tandem repeat genotyping using targeted...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

High-throughput multiplexed tandem repeat genotyping using targeted long-read sequencing

[version 1; peer review: 1 approved with reservations, 1 not approved]

Devika Ganesamoorthy ^1,2, Mengjia Yan¹, Valentine Murigneux¹, [...] Chenxi Zhou^1,2, Minh Duc Cao¹, Tania P. S. Duarte¹, Lachlan J. M. Coin ^1,2

Devika Ganesamoorthy ^1,2, Mengjia Yan¹, [...] Valentine Murigneux¹, Chenxi Zhou^1,2, Minh Duc Cao¹, Tania P. S. Duarte¹, Lachlan J. M. Coin ^1,2

PUBLISHED 02 Sep 2020

Author details Author details

¹ Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland, 4072, Australia
² Department of Clinical Pathology, The University of Melbourne, Melbourne, Victoria, 3052, Australia

Devika Ganesamoorthy
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Mengjia Yan
Roles: Formal Analysis, Investigation, Methodology, Writing – Original Draft Preparation, Writing – Review & Editing

Valentine Murigneux
Roles: Formal Analysis, Validation, Visualization, Writing – Review & Editing

Chenxi Zhou
Roles: Data Curation, Formal Analysis, Visualization, Writing – Review & Editing

Minh Duc Cao
Roles: Formal Analysis, Software, Writing – Review & Editing

Tania P. S. Duarte
Roles: Investigation, Validation, Writing – Review & Editing

Lachlan J. M. Coin
Roles: Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Methodology, Project Administration, Resources, Software, Supervision, Visualization, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Nanopore Analysis gateway.

Abstract

Background: Tandem repeats (TRs) are highly prone to variation in copy numbers due to their repetitive and unstable nature, which makes them a major source of genomic variation between individuals. However, population variation of TRs has not been widely explored due to the limitations of existing approaches, which are either low-throughput or restricted to a small subset of TRs. Here, we demonstrate a targeted sequencing approach combined with Nanopore sequencing to overcome these limitations.
Methods: We selected 142 TR targets and enriched these regions using Agilent SureSelect target enrichment approach with only 200 ng of input DNA. We barcoded the enriched products and sequenced on Oxford Nanopore MinION sequencer. We used VNTRTyper and Tandem-genotypes to genotype TRs from long-read sequencing data. Gold standard PCR sizing analysis was used to validate genotyping results from targeted sequencing data.
Results: We achieved an average of 3062-fold target enrichment on a panel of 142 TR loci, generating an average of 97X coverage per sample with 200 ng of input DNA per sample. We successfully genotyped an average of 75% targets and genotyping rate increased to 91% for the highest-coverage sample for targets with length less than 2 kb, and GC content greater than 25%. Alleles estimated from targeted long-read sequencing were concordant with gold standard PCR sizing analysis and highly correlated with alleles estimated from whole genome long-read sequencing.
Conclusions: We demonstrate a targeted long-read sequencing approach that enables simultaneous analysis of hundreds of TRs and accuracy is comparable to PCR sizing analysis. Our approach is feasible to scale for more targets and more samples facilitating large-scale analysis of TRs.

Keywords

Tandem repeats, targeted sequencing, long-read sequencing

Corresponding authors: Devika Ganesamoorthy, Lachlan J. M. Coin

Competing interests: No competing interests were disclosed.

Grant information: This work was supported by the Australian Government National Health and Medical Research Council Project Grant APP1052303 to Prof Lachlan J. M. Coin.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2020 Ganesamoorthy D et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

How to cite: Ganesamoorthy D, Yan M, Murigneux V et al. High-throughput multiplexed tandem repeat genotyping using targeted long-read sequencing [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2020, 9:1084 (https://doi.org/10.12688/f1000research.25693.1) First published: 02 Sep 2020, 9:1084 (https://doi.org/10.12688/f1000research.25693.1) Latest published: 02 Sep 2020, 9:1084 (https://doi.org/10.12688/f1000research.25693.1)

Introduction

Repeated sequences occur in multiple copies throughout the genome; they make up almost half of the human genome¹. Repeat sequences can be divided into two categories: interspersed repeats and tandem repeats (TRs). Interspersed repeats are scattered throughout the genome and are remnants of transposons². TRs consists of repeat units that are located adjacent to each other (i.e. in tandem). There are almost 1 million TRs in the human genome, covering 10.6% of the entire genome³. TRs can be further divided into two types based on the length of the repeat unit; repeats with one to six base pair repeat units are classified as microsatellites or short tandem repeats (STRs) and those with more than six base pair repeat units are known as minisatellites⁴.

TRs are prone to high rates of copy number variation and mutation due to the repetitive unstable nature, which makes them a major source of genomic variation between individuals. Variation in TRs may explain some of the phenotypic variation observed in complex diseases as it is poorly tagged by single nucleotide variation^5,6. Recent studies have shown that 10% to 20% of coding and regulatory regions contain TRs and suggested that variations in TRs could have phenotypic effect⁷. Although TRs represent a highly variable fraction of the genome, analysis of TRs so far are limited to known pathogenic regions, mainly STRs due to the limitations in analysis techniques.

Traditionally, TR analysis has been carried out via restriction fragment length polymorphism (RFLP) analysis⁸ or PCR amplification of the target loci followed by fragment length analysis⁹. These techniques are only applicable to a specific target region and not scalable to high-throughput analysis, which limits the possibility of genome-wide TR analysis. In the recent decade, significant progress has been made in utilising high-throughput short-read sequencing data for genotyping STRs¹⁰. Our group and others have also demonstrated targeted sequencing approaches using short-read sequencing for TR analysis^11,12. Several computational tools have been developed to improve the accuracy of TR genotyping from short-read sequencing data with varying performance^13–19. Yet, most of these tools have focused mainly on the analysis of STRs and analysis of longer TRs remains a hurdle for these approaches. We reported a new approach GtTR in Ganesamoorthy et al.¹², which utilizes short-read sequencing data to genotype longer TRs. GtTR reports absolute copy number of the TRs, but it does not report the exact genotype of two alleles due to the use of short-read sequencing data.

Sequencing reads that span the entire repeat region are informative to accurately genotype TRs¹¹, and therefore are ideal for genome-wide TR analysis. Long-read sequencing technologies have the potential to span all TRs in human genome, including long TRs. There have been few reports on the use of long-read sequencing for the analysis of specific TRs implicated in diseases^20–22. Genotyping tools utilizing long-read sequencing data, such as Nanosatellite²¹, RepeatHMM²³ and Tandem-genotypes²⁴ have been reported in the recent years with varying performance across different length of repeat units and repeat length. We reported VNTRTyper in Ganesamoorthy et al.¹² to genotype TRs from long-read whole genome sequencing data. Despite the availability of genotyping tools, long-read sequencing is not widely used for TR analysis, due to the high costs associated with whole genome long-read sequencing. Cost-effective long-read sequencing approaches will be an important and attractive option to genotype TRs in large-scale studies. However, there has been limited progression on targeted long-read sequencing of TRs.

We have previously demonstrated that targeted sequence capture of repetitive TR sequences are feasible using short-read sequencing technologies¹². In this study, we demonstrate the targeted sequence capture of repetitive TRs using Oxford Nanopore long-read sequencing technologies. There have been previous reports on the use of targeted sequencing combined with long read sequencing technologies²⁵; however, enrichment of repeat sequences requires optimization in probe design and probe hybridization approaches. We optimized the protocols and report successful enrichment of repetitive sequences followed by long-read sequencing. We demonstrate the accuracy of genotype estimates from targeted long-read sequencing by comparison with gold-standard PCR sizing analysis. In this study, we predominantly targeted longer TRs (i.e. minisatellites); however, our approach is applicable to all TRs. Our targeted long-read sequencing method presented here provides an accurate and cost-effective approach for large-scale analysis of TRs, which will be useful for researchers to explore the impact of TR variants on diseases and phenotypes.

Methods

Samples for sequencing

DNA samples of CEPH/UTAH pedigree 1463 were purchased from Coriell Institute for Medical Research (USA). Seven family members from the pedigree used for sequencing analysis were NA12877, NA12878, NA12879, NA12881, NA12882, NA12889 and NA12890.

Selection of TRs and probe design

The selection of TRs and design of probes were described in Ganesamoorthy et. al. (2018)¹². Briefly, 142 TRs were selected; they range from 112 to 25236 bp in length in the reference human genome (hg19) and the number of repeat units range from 2 to 2300 repeats. TRs used in this study were selected as part of another study to investigate association between TRs and Obesity and these targeted TRs are not disease associated. Agilent SureSelect DNA design (Agilent Technologies) was used to design target probes to capture the targeted regions (including 100-bp flanking regions) and regions flanking the TRs (at least 1000 bp).

Nanopore targeted sequencing of TRs

All seven family members from the CEPH pedigree 1463 were used for Nanopore targeted sequencing analysis (NA12877, NA12878, NA12879, NA12881, NA12882, NA12889 and NA1289). Target sequence capture for Nanopore sequencing was performed using Agilent SureSelect XT HS Target Enrichment System (Agilent Technologies) according to the manufacturer’s instructions with slight modifications. Briefly, 200 ng of DNA was fragmented to 3 kb using Covaris Blue miniTUBE (Covaris). Greater than 90% of the targeted TRs are less than 3 kb and SureSelect capture protocol works effectively on fragments less than 4 kb in length; therefore, DNA products were sheared to 3 kb. Fragmented DNA was end-repaired, adapter-ligated and amplified prior to target capture. Extension time for pre-capture amplification was increased to 4 minutes to allow for the amplification of long fragments and 14 cycle amplification was used. Purified pre-capture PCR products were hybridized to the designed capture probes for 2 hours. Streptavidin beads (Thermo Fisher) were used to pull down the DNA fragments bound to the probes. Finally, captured DNA was amplified with long extension time (4 minutes) using Illumina Index adapters provided in the enrichment kit. Post capture PCR products were purified using 0.8X - 1X AMPure XP beads (Beckman Coulter).

Nanopore sequencing library preparation was performed using 1D Native barcoding genomic DNA (with EXP-NBD103 and SQK-LSK108) (Oxford Nanopore Technologies) protocol according to the manufacturer’s instructions with minor modifications. Briefly, 100–200ng of post capture PCR products were end repaired and incubated at 20°C for 15 mins and 65°C for 15 mins. End repaired products were ligated with unique native barcodes. Purification steps after end repair and barcode ligation were avoided to minimize the loss of DNA. Barcoded samples were pooled in equimolar concentrations prior to adapter ligation. Adapter ligated samples were purified using 0.4X AMPure XP beads (Beckman Coulter). Samples were split into two sequencing groups: NA12877, NA12878, NA12879 and NA12890 (group 1); and NA12881, NA12882 and NA12889 (group 2). Sequencing was performed using a MinION sequencer (Oxford Nanopore Technologies) using R9.5 flow cell. Both groups were sequenced for 48 hours. Nanopore sequencing data were base called using Albacore (version 2.2.7) and reads were demultiplexed using Albacore (version 2.2.7) based on the barcode sequences.

Public data used in the study

Nanopore WGS data on CEPH Pedigree 1463 sample NA12878 were obtained from the Nanopore WGS consortium (https://github.com/nanopore-wgs-consortium/NA12878/blob/master/nanopore-human-genome/rel_3_4.md)²⁶. PacBio WGS data on NA12878 sample were downloaded from SRA with accession numbers SRX627421 and SRX638310²⁷

VNTRTyper

Sequencing reads were mapped to hg19 reference genome using Minimap2 (version 2.13)²⁸. For Nanopore sequencing ‘-ax map-ont’ and for PacBio WGS ‘-ax map-pb’ parameters were used. VNTRTyper, our in-house tool described by Ganesamoorthy et al.¹² was used to genotype TRs from long-read sequencing data. Briefly, VNTRTyper takes advantage of the long-read sequencing to identify the number of repeat units in the TR regions. Firstly, the tool identifies reads that span the repeat region and applies hidden Markov models (HMM) to align the repetitive portion of each read to the repeat unit. Then it estimates the multiplicity of the repeat units in a read using a profile HMM.

Recently, we further improved the accuracy of genotyping estimates by clustering the copy number counts from reads to identify the likely genotypes per target. We used Kmeans clustering and the number of clusters are fixed at two clusters for two alleles. A minimum threshold of two supporting reads per genotype was used to assign genotypes. Furthermore, for heterozygous alleles, both alleles should have at least 10% of reads supporting the allele, if not allele with less than 10% of reads was excluded during the analysis. The updated version of VNTRTyper can be accessed from GitHub Japsa release 1.9–3c and can be deployed using script name jsa.tr.longreads. Details of VNTRTyper analysis are previously reported by Ganesamoorthy et al.¹².

Tandem-genotypes

We also used another independent method, Tandem-genotypes, to estimate genotypes from long-read sequencing data. Tandem-genotypes was recently reported for analysis of TR genotypes from long-read sequencing data²⁴ and it can be utilised for both Nanopore and PacBio sequencing technologies.

Nanopore and PacBio sequencing data were mapped to the hg19 reference genome using LAST v959²⁹. Calculation of repeat length per sequencing read was performed with Tandem-genotypes as reported by Mitsuhashi et al.²⁴. Copy number changes in reads covering the repeat’s forward and reverse strands were merged and the two alleles with the highest number of supporting reads for each VNTR were extracted. A minimum threshold of two supporting reads per genotype was used to assign genotypes.

PCR analysis of VNTRs

A total of 10 targeted VNTR regions which are less than 1 kb in repetitive sequence were validated by PCR sizing analysis in this study (PCR primer sequences provided in Extended data, Supplementary Table 1)³⁰. These ten targets include various repeat unit length and repeat sequence combinations to assess the accuracy of the genotypes determined from sequencing data. The majority of these targets were tested in our previous study¹² and the results from the previous PCR analysis were used for these regions. PCRs were performed using HotStar Taq DNA Polymerase (Qiagen) and PCR conditions were optimized for each PCR target. PCR products were purified and subjected to capillary electrophoresis on an ABI3500xL Genetic Analyzer (Applied BioSystems). Fragment sizes were analyzed using GeneMapper 4.0 (Applied BioSystems). Alternatively, STRand or Osiris could be used for fragment size analysis. Capillary electrophoresis plots provided in Extended data, Supplementary Information PCR data³⁰.

Statistical analysis

Linear regression analysis was used to determine correlation between genotype estimates. All plots were generated using GraphPad Prism (version 7.00 for Windows; GraphPad Software, La Jolla California USA).

We investigated the effect of GC content, repeat length, repeat period and repeat copy number on target sequencing depth using a multivariate linear regression model. We used ggplot2 (version 3.2.0) to visualize the relationship between these factors and sequencing depth across all seven samples. Thresholds on GC and repeat length were chosen based on this visual analysis. Genotype rate was calculated as the proportion of sample, target pairs which had a predicted genotyped (based on VNTRtyper) amongst all targets which met the GC and repeat length thresholds.

Results

We demonstrate a targeted sequence capture approach combined with Nanopore long-read sequencing to genotype hundreds of TRs.

Targeted Capture Sequencing of Tandem repeats

We performed sequence enrichment of targeted TRs for 7 samples followed by long-read sequencing using Oxford Nanopore Technologies’ MinION as described in the Methods. Figure 1a shows the read length distribution observed in targeted capture sequencing data. The median read length followed the expected read-length distribution, with the exception of an under-representation of repeats of length >3 kb (Figure 1a). The read length in this study was sufficient to analyse majority of targeted TRs of length less than 2 kb. Sequence coverage varied across targets and samples, on average 97X sequence coverage was achieved, with only 19 targets having less than 1X coverage (Figure 1b) and majority of the low coverage targets (16 of the 19 targets) have less than 25% GC content (Extended data, Supplementary Figure 1)³⁰.

Extended data, Supplementary Table 2³⁰ summarises the metrics for targeted sequencing on Nanopore sequencing technologies. Nanopore multiplexing (See Methods) group 1 samples had similar yield between samples; however, Nanopore multiplexing group 2 samples had varying yield per sample. Despite the differences in sequencing yield, we achieved an average of 3062-fold target enrichment and on target capture rate was approximately 50%.

Figure 1. Read length and sequence coverage distribution.

(a) Read length distribution of Nanopore targeted sequencing. Lines indicates the read length distribtuion for each sample and grey bars indicate the length distribution of targeted TRs and (b) Sequence coverage distribution of Nanopore targeted sequencing for all seven samples.

Genotyping of TRs using targeted long-read sequencing

Genotype estimates from targeted long-read sequencing datasets were estimated using our tool VNTRTyper¹² with the improvements described in the Methods. We also applied Tandem-genotypes²⁴ to determine the genotypes of TRs from long-read sequencing data. We used a minimum of two reads as read threshold to determine the repeat number for each allele.

Prior to obtaining any sequence data, we generated PCR sizing results as a gold standard on 10 targets for comparison to sequencing analysis. These 10 targets were selected to include various repeat unit length and repeat sequence combinations to assess the accuracy of the genotypes determined from sequencing data. Of these 10 targets, two were excluded for comparison as all seven samples had insufficient number of spanning reads (minimum of two reads required for genotyping) to genotype these targets. Genotype estimates from VNTRtyper on these eight targets correlated well with PCR (Pearson correlation greater than 0.980 for all samples) (Table 1 and Extended data, Supplementary Figure 2)³⁰. Genotype estimates by Tandem-genotypes also correlated well with PCR, with a correlation greater than 0.984 for all samples Extended data, (Supplementary Table 3 and Supplementary Figure 3)³⁰; however, fewer targets had sufficient data to compare with PCR sizing results.

Table 1. Genotype estimates on Nanopore targeted capture sequencing using VNTRTyper.

Sample	Method	Genotype of Target*								Pearson correlation with PCR
Sample	Method	TR_8 (12.0)	TR_57 (15.6)	TR_86 (2.0)	TR_87 (9.0)	TR_93 (2.0)	TR_109 (15.3)	TR_112 (4.0)	TR_120 (2.2)	Pearson correlation with PCR
NA12877	PCR	12.0/13.0	10.6/13.6	2.0/2.0	6.0/8.0	2.0/2.0	15.3/17.3	3.0/4.0	2.2/2.2	0.9988
NA12877	Nanopore	12.0/13.0	ND	2.0/2.0	6.2/8.4	2.0/2.0	15.3/17.3	3.0/4.1	2.2/3.2	0.9988
NA12878	PCR	12.0/12.0	10.6/12.6	2.0/2.0	6.0/9.0	2.0/2.0	15.3/17.3	3.0/4.0	2.2/2.2	0.9978
NA12878	Nanopore	11.0/12.1	ND	2.0/2.0	6.0/8.8	2.0/2.0	14.8 /17.1	3.0/4.0	2.2/3.2	0.9978
NA12879	PCR	12.0/13.0	10.6/10.6	2.0/2.0	8.0/9.0	2.0/2.0	15.3/17.3	3.0/3.0	2.2/2.2	0.9927
NA12879	Nanopore	12.4/12.4	10.6/10.6	2.0/2.0	6.0/8.6	2.0/2.0	14.8/16.9	3.0/4.0	2.2/3.2	0.9927
NA12881	PCR	12.0/12.0	10.6/10.6	2.0/2.0	8.0/9.0	2.0/2.0	15.3/15.3	3.0/4.0	2.2/2.2	0.9944
NA12881	Nanopore	12.0/12.0	10.6/10.6	2.0/2.0	8.0/9.0	ND	15.3/17.3	3.0/4.0	2.2/3.2	0.9944
NA12882	PCR	12.0/13.0	10.6/10.6	2.0/2.0	6.0/6.0	2.0/2.0	15.3/17.3	3.0/3.0	2.2/2.2	0.9919
NA12882	Nanopore	11.8/13.0	10.6/10.6	2.0/2.0	6.0/8.5	2.0/2.0	15.3/17.3	3.0/4.0	2.2/3.2	0.9919
NA12889	PCR	12.0/12.0	13.6/17.6	2.0/2.0	8.0/8.0	2.0/2.0	17.3/17.3	4.0/4.0	2.2/2.2	0.9800
NA12889	Nanopore	12.1/12.1	10.6/15.8	2.0/2.0	6.1/8.1	2.0/2.0	14.8/17.2	3.0/4.0	2.2/3.2	0.9800
NA12890	PCR	12.0/13.0	10.6/10.6	2.0/2.0	6.0/9.0	2.0/2.0	15.3/15.3	3.0/3.0	2.2/2.2	0.9936
NA12890	Nanopore	12.2/12.2	10.6/10.6	2.0/2.0	6.0/9.0	2.0/2.0	13.3/15.3	3.0/4.0	2.2/3.2	0.9936

*Repeat Number in reference hg19 is provided within brackets for each target

*Repeat numbers that do not agree with PCR results are highlighted in red.

ND – Sufficient data not available for genotype analysis

Genotype estimates from VNTRTyper and Tandem-genotypes for all 142 targets from Nanopore capture sequencing samples are provided in Extended data, Supplementary Spreadsheet Tables 1 and 2³⁰, respectively. Genotype estimates by VNTRtyper and Tandem-genotypes correlate well and the correlation values range from 0.904 to 0.994 for Nanopore targeted sequencing.

We were able to determine the genotype on average for 60% of the targets (range 48% to 75%) using VNTRTyper and 57% of the targets (range 41% to 74%) using Tandem-genotypes. Both VNTRTyper and Tandem-genotypes failed to genotype targets with low GC sequence content (<25% GC content) and targets which are greater than 2 kb in length, which accounts for approximately 22% of the targets (32 of the 142 targets). Targets with low GC sequence content (<25% GC content) did not have sufficient sequence coverage for analysis due to inefficient sequence enrichment in these regions (Extended data, Supplementary Figure 1)³⁰. Targets which are greater than 2 kb in length did not have sufficient spanning reads for genotyping analysis (See Methods and Figure 1a).

It was evident that the GC content of the target and size (i.e. repeat length) affected the genotyping efficiency of our targeted capture sequencing approach. Therefore, we assessed the genotyping rate based on the size of the target and GC content of the target (Figure 2). For all 142 targets genotyping rate using VNTRTyper was only 59.8% (Figure 2a); however, the genotyping rate improved to 67% for 125 targets with a size threshold of 2 kb (Figure 2b) and 67.1% for 125 targets with 25% GC threshold (Figure 2c). Furthermore, genotyping rate improved to 75.2% for 110 targets with a combined 2 kb size threshold and 25% GC threshold (Figure 2d). Also, sample with high sequence coverage (NA12889) had the highest genotyping rate of 90.9% for 110 targets (Extended data, Supplementary Figure 4)³⁰. Genotyping rate using Tandem-genotypes also improved to an average of 63.7% for 110 targets (range 43.6% to 85.5%) (Extended data, Supplementary Figure 5)³⁰.

Figure 2. Assessment of genotyping rate using VNTRTyper based on the size and GC content of the target for all seven samples.

Triangle indicates that greater than 50% of the samples had a genotyping estimate and circle indicates only less than 50% of the samples had a genotyping estimate for the given target. Colours indicate the depth, which is defined as the number of spanning reads detected for the target region. Black thick lines inside the plots indicate the 2-kb size threshold and 25% GC content size threshold. (a) Genotyping rate for all targets, shown as repeat number vs period (i.e. repeat unit). (b) Genotyping rate with 2-kb size threshold. (c) Genotyping rate with 25% GC threshold. (d) Genotyping rate with both 25% GC threshold and 2-kb size threshold.

Genotyping of Tandem repeats using long-read whole genome sequencing

To investigate the accuracy of genotype estimates of TRs from targeted sequence capture compared to WGS, we performed genotyping analysis on the targeted regions using VNTRTyper and Tandem-genotypes on whole genome long-read sequencing data. We downloaded whole genome long-read Nanopore and PacBio sequencing data on CEPH Pedigree 1463 NA12878 sample. We have previously reported genotyping estimates by VNTRTyper on PacBio NA12878 WGS data¹². Here we use the genotype estimates by VNTRTyper on PacBio NA12878 WGS data to compare genotype estimates by Tandem-genotype and the results of targeted sequencing analysis.

We compared the accuracy of genotype estimates from WGS data with PCR sizing analysis. Genotype estimates by VNTRTyper and Tandem-genotypes on WGS data were compared with PCR sizing results on 10 targets (Table 2). VNTRTyper and Tandem-genotypes had comparable correlation with PCR sizing analysis for both Nanopore and PacBio WGS. Genotype estimates for all 142 targets from Nanopore and PacBio WGS data determined using VNTRTyper and Tandem-genotypes are provided in Extended data, Supplementary Spreadsheet Table 3³⁰.

Table 2. Genotype estimates on NA12878 WGS and Capture sequencing data using VNTRTyper and Tandem-genotypes.

Method^	Genotype of Target*										Pearson Correlation with PCR
Method^	TR_8 (12.0)	TR_32 (38.4)	TR_57 (15.6)	TR_64 (18.3)	TR_86 (2.0)	TR_87 (9.0)	TR_93 (2.0)	TR_109 (15.3)	TR_112 (6.0)	TR_120 (2.2)	Pearson Correlation with PCR
PCR	12.0/12.0	9.4/10.4	10.6/12.6	17.3/18.3	2.0/2.0	6.0/9.0	2.0/2.0	15.3/17.3	3.0/4.0	2.2/2.2
NPW_VT	12.0/12.0	8.2/10.7	10.6/12.6	17.3/18.6	2.0/2.0	6.0/9.0	2.0/2.0	15.3/17.3	3.0/4.0	2.2/3.2	0.9980
NPW_TG	11.0/12.0	4.4/9.4	10.6/12.6	18.3/19.3	2.0/2.0	6.0/9.0	2.0/2.0	15.3/17.3	3.0/4.0	2.2/2.2	0.9800
PBW_VT	12.0/12.0	11.0/12.4	10.6/12.6	18.3/19.3	2.0/2.0	6.0/9.0	2.0/2.0	15.3/17.3	3.0/4.0	2.2/2.2	0.9957
PBW_TG	12.0/12.0	ND	12.6/12.6	ND	2.0/2.0	ND	2.0/2.0	15.3/15.3	4.0/4.0	2.2/2.2	0.9900
NPC_VT	11.0/12.1	ND	ND	ND	2.0/2.0	6.0/8.8	2.0/2.0	14.8/17.1	3.0/4.0	2.2/3.2	0.9978
NPC_TG	11.0/12.0	ND	ND	ND	2.0/2.0	6.0/9.0	ND	15.3/17.3	3.0/4.0	2.2/2.2	0.9988

^{^}NPW_VT – Nanopore WGS VNTRTyper; NPW_TG – Nanopore WGS Tandem-genotypes; PBW_VT – PacBio WGS VNTRTyper; PBW_TG – PacBio WGS Tandem-genotypes; NPC_VT – Nanopore Capture sequencing VNTRTyper; NPC_TG – Nanopore Capture sequencing Tandem-genotypes

*Repeat Number in reference hg19 is provided within brackets for each target

*Repeat numbers that do not agree with PCR results are highlighted in red.

ND – Sufficient data not available for genotype analysis.

We compared the genotype estimates between WGS data and targeted capture sequencing data (77 targets which had results for both WGS and targeted sequencing). Genotype estimates by VNTRTyper between WGS data and targeted capture sequencing data showed a correlation of 0.9782 (correlation on 154 alleles) (Figure 3a). Genotype estimates by Tandem-genotypes had lower correlation between WGS and targeted capture sequencing data of 0.7694 (correlation on 152 alleles – 76 targets, removing the two outliers improved correlation to 0.9084) (Figure 3b). On the subset of seven targets for which we had generated PCR sizing analysis, Nanopore WGS data correlated with 12/14 genotype estimates on Nanopore capture sequencing using VNTRTyper precisely compared to PCR sizing (Table 2 and Figure 4a). Genotype estimates using Tandem-genotypes on Nanopore WGS data correlated with 11/12 genotype estimates on Nanopore capture sequencing precisely compared to PCR sizing (Figure 4b).

Figure 3. Correlation between whole genme sequencing and targeted sequencing genotype estimates.

Using (a) VNTRTyper and (b) Tandem-genotypes for the NA12878 sample.

Figure 4. Comparison of genotype estimates between whole genme sequencing and target capture sequencing for NA12878 sample.

Using (a) VNTRTyper and (b) Tandem-genotypes. Red line indicates PCR sizing results. Targets with no genotype estimates are shown as a gap for the corresponding column.

Variation in Tandem repeats

To assess the extent of variation in repeat numbers between individuals, we compared the genotype estimates to the reported reference (hg19) repeat number. Genotype estimates determined by VNTRTyper on Nanopore capture sequencing on seven members of CEPH pedigree 1463 were used to assess the variation. We found that for a given sample, on average 51% (range 45–60%) of the targets have a genotype which is different to the reference, with more deletions (28%) than duplications (23%) (Figure 5).

Figure 5. Percentage difference between reported repeat number in reference genome (hg19) and estimated repeat number based on genotype estimates using VNTRTyper on Nanopore targeted sequencing.

Discussion

In this study, we present a targeted sequencing approach combined with long-read sequencing technology to genotype TRs. To our knowledge, this is the first report on genotyping analysis of hundreds of TRs using targeted long-read sequencing approach. Sequencing reads that span the entire repeat region and flanking region are often useful in providing an accurate estimation of the repeat genotype. Long-read sequencing technologies have the ability to generate reads which can span the entire repeat region and flanking regions. However, whole genome long-read sequencing analysis is still expensive for large-scale population analysis; hence, we developed a targeted long-read sequencing approach for TR analysis.

We showed that 1) target enrichment of repetitive sequences followed by long-read sequencing is feasible and 2) genotype predictions on targeted TR sequencing are comparable to the accuracy of PCR sizing analysis of repeats. Overall, we achieved an average genotyping rate of 75% for 110 TR loci with repeat length less than 2 kb and GC content greater than 25%. Genotyping rate improved to 91% for the highest-coverage sample, indicating that more sequencing could improve genotyping rate.

Targets with low GC sequence content (<25% GC content) did not have sufficient sequence coverage with targeted sequencing. We have previously performed short-read target capture on these regions¹² and observed low sequence coverage in low GC targets. However, both Nanopore and PacBio WGS data do not have any bias in sequence coverage in low GC regions. Hence, the lack of sequence coverage in low GC region for targeted sequencing is likely due to the capture protocol. To overcome the issue of low capture efficiency for low GC regions, it is feasible to increase the number of probes in low GC regions during probe design. This will improve sequence enrichment in low GC targets. Furthermore, use of simulation tools^31,32 which can simulate sequencing data from probe sequences designed for capture sequencing can be used to assess the probe design efficiency prior to sequencing. This will allow to improve probe design in regions with low capture efficiency and subsequently improve coverage.

We also observed targets greater than 2 kb in length could not be genotyped due to the lack of spanning reads for genotyping analysis. This is primarily due to the limitation in sequence read length observed from the capture process. Streptavidin beads used during the capture process has limitations on the size of the fragments it can bind to, which limits the fragment length attainable with this capture protocol. Although there are longer TRs (greater than 2 kb) in the human genome, more than 99% of the TRs reported in human reference genome (hg38) are less than 2 kb in length³. Therefore, our protocol would still be able to successfully genotype most of the TRs in the human genome. TRs greater than 2 kb might need further optimized enrichment protocols.

Our target panel included eight (out of the 142 targets) STRs with longer expansions (>200 number of copies) and seven of these targets failed to genotype. However, three of these had low GC content and one was greater than 4 kb in repeat length. The longer expansions which failed to genotype also had low sequence coverage, however due to the low number of targets we could not conclusively identify the cause for failure for these targets.

We used VNTRTyper, an in-house genotyping tool described by Ganesamoorthy et al. (2018)¹² to determine the repeat number of TRs from long-read sequencing technologies. For comparison, we used Tandem-genotypes²⁴, recently reported genotyping tool for the detection of TR expansions from long-read sequencing. Both genotyping methods were comparable to PCR sizing analysis and genotyping estimates were comparable between the approaches. However, Tandem-genotypes genotyped fewer targets than VNTRTyper. The differences are likely due to the different algorithms used between the methods. Both VNTRTyper and Tandem-genotypes uses reads spanning the repeat region. However, for Tandem-genotypes the flanking length used for analysis is depended on the length of the repeat unit, with a maximum of 100 bp on both sides of the repeat unit. On the other hand, VNTRTyper uses a default 30 bp flanking length for analysis, but it is feasible to change the flanking length. Due to the longer flank length requirement, Tandem-genotypes could have possibly failed to genotype more targets compared to VNTRTyper.

Variations in TRs are a major source of genomic variation between individuals. TRs targeted in this study were initially selected due to the variation observed between case and control samples for obesity analysis¹² and these TRs are variable in the population. We show that approximately 50% of the targeted TRs differ from reported reference copy number. However, the major limitation in this analysis is that the sample size is small, and the individuals are related, which introduces a bias in the analysis. Nevertheless, these findings indicate the possibility of variation in TR copy number between individuals and further large-scale studies are required to ascertain the extent of variation.

We demonstrated that the accuracy of genotype estimates between WGS and targeted capture sequencing were comparable to the accuracy of PCR sizing analysis. However, targeted capture enrichment protocols used in this study have amplification steps, which can introduce errors in TR analysis. This could possibly explain the differences in genotype estimates observed between WGS and targeted capture sequencing for some targets.

An amplification free targeted analysis with long-read sequencing is an ideal option for accurate genotyping of TRs. Targeted cleavage with Cas9 enzyme followed by Nanopore sequencing³³ or PacBio sequencing³⁴ has been recently reported as alternative option for enrichment of regions of interest. This method does not have any amplifications and can be adapted for multiple targets in a single assay. However, currently the DNA input requirements are high and sequencing output are low, which currently restricts wide use of this technique for large-scale analysis.

Selective sequencing approaches utilising Nanopore real-time sequencing capabilities has been reported recently as an alternative approach to enrich regions of interest^35,36. Selective sequencing works by mapping a section of the sequence read generated to the regions of interest and if the fragment matches to a region of interest, it will proceed with sequencing the fragment, if not the fragment is ejected from the pore. This approach will be a cost-effective approach to genotype TRs as it removes the need for specific sample preparation for target enrichment; however, the efficiency of this approach for TR analysis is yet to be determined.

The targeted long-read sequencing approach presented in this study is a cost-effective approach to analyse hundreds of TRs simultaneously. Long-read Nanopore WGS can cost approximately $4000 for 30X coverage of human genome and often with varying coverage across the genome. However, targeted long-read sequencing can be performed for a fraction of cost (less than $300 per sample depending on the multiplexing level) to enrich up to 25 Mb of genomic sequence of interest. The ability to analyse hundreds of TRs for a fraction of cost allows to explore TRs in large-scale studies.

In summary, we present a targeted approach combined with long-read sequencing to enable cost-effective and accurate approach to genotype TRs using long-read sequencing. Using this method, we have successfully demonstrated the feasibility of targeted capture sequencing of repetitive sequences and genotyping TRs using Nanopore long-read sequencing technology. Our targeted long-read sequencing approach would provide a cost-effective tool for large-scale population analysis of tandem repeats.

Data availability

Underlying data

NCBI BioProject: Capture Sequencing of Tandem Repeats. Accession number PRJNA422490, https://identifiers.org/ncbi/bioproject:PRJNA422490.

Extended data

Figshare: Supplementary Information for the "High-throughput multiplexed tandem repeat genotyping using targeted long-read sequencing" article. https://doi.org/10.6084/m9.figshare.12789278.v1³⁰.

This project contains the following extended data:

Supplementary_Information,pdf, which contains the following:
- ◦ Supplementary Figure 1. Sequence coverage distribution on targets vs GC%.
- ◦ Supplementary Figure 2. Correlation of genotype estimates between PCR sizing and VNTRTyper.
- ◦ Supplementary Figure 3. Correlation of genotype estimates between PCR sizing and Tandem-genotypes.
- ◦ Supplementary Figure 4. VNTRTyper genotyping rate with 25% GC and 2Kb size threshold.
- ◦ Supplementary Figure 5. Tandem-genotypes genotyping rate with 25% GC and 2Kb size threshold.
- ◦ Supplementary Table 1. PCR primer sequences.
- ◦ Supplementary Table 2. Targeted Sequencing metrics for Nanopore Capture sequencing of tandem repeats.
- ◦ Supplementary Table 3. Genotype estimates on Nanopore targeted capture sequencing using Tandem-Genotypes.
Supplementary_Spreadsheet_Table1.csv. (Genotype predictions on Nanopore Capture Sequencing data using VNTRtyper.)
Supplementary_Spreadsheet_Table2.csv. (Genotype predictions on Nanopore Capture Sequencing data using Tandem-genotypes.)
Supplementary_Spreadsheet_Table3.csv (Genotype predictions on NA12878 sample Nanopore WGS data and PacBio WGS data using VNTRTyper and Tandem-genotypes.)
Supplementary_Information_PCR_data.pdf. (Capillary electrophoresis results of PCR sizing analysis.)

Extended data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Acknowledgements

We thank Agilent Technologies for their technical assistance with targeted sequencing experiments.

Faculty Opinions recommended

References

1. Lander ES, Linton LM, Birren B, et al.: Initial sequencing and analysis of the human genome. Nature. 2001; 409(6822): 860–921. PubMed Abstract | Publisher Full Text
2. Jurka J, Kapitonov VV, Kohany O, et al.: Repetitive sequences in complex genomes: structure and evolution. Annu Rev Genomics Hum Genet. 2007; 8: 241–59. PubMed Abstract | Publisher Full Text
3. Gelfand Y, Rodriguez A, Benson G: TRDB--the Tandem Repeats Database. Nucleic Acids Res. 2007; 35(Database issue): D80–7. PubMed Abstract | Publisher Full Text | Free Full Text
4. Gemayel R, Cho J, Boeynaems S, et al.: Beyond junk-variable tandem repeats as facilitators of rapid evolution of regulatory and coding sequences. Genes (Basel). 2012; 3(3): 461–480. PubMed Abstract | Publisher Full Text | Free Full Text
5. Armour JA: Tandemly repeated DNA: why should anyone care? Mutat Res. 2006; 598(1–2): 6–14. PubMed Abstract | Publisher Full Text
6. Hannan AJ: TRPing up the genome: Tandem repeat polymorphisms as dynamic sources of genetic variability in health and disease. Discov Med. 2010; 10(53): 314–21. PubMed Abstract
7. Gemayel R, Vinces MD, Legendre M, et al.: Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annu Rev Genet. 2010; 44: 445–477. PubMed Abstract | Publisher Full Text
8. Bidwell JL, Bignon JD: DNA-RFLP methods and interpretation scheme for HLA-DR and DQ typing. Eur J Immunogenet. 1991; 18(1–2): 5–22. PubMed Abstract | Publisher Full Text
9. Tagliabracci A, Buscemi L, Sassaroli C, et al.: Allele typing of short tandem repeats by capillary electrophoresis. Int J Legal Med. 1999; 113(1): 26–32. PubMed Abstract | Publisher Full Text
10. Bahlo M, Bennett MF, Degorski P, et al.: Recent advances in the detection of repeat expansions with short-read next-generation sequencing [version 1; peer review: 3 approved]. F1000Res. 2018; 7: F1000 Faculty Rev-736. PubMed Abstract | Publisher Full Text | Free Full Text
11. Duitama J, Zablotskaya A, Gemayel R, et al.: Large-scale analysis of tandem repeat variability in the human genome. Nucleic acids research. 2014; 42(9): 5728–5741. PubMed Abstract | Publisher Full Text | Free Full Text
12. Ganesamoorthy D, Cao MD, Duarte T, et al.: GtTR: Bayesian estimation of absolute tandem repeat copy number using sequence capture and high throughput sequencing. BMC Bioinformatics. 2018; 19(1): 267. PubMed Abstract | Publisher Full Text | Free Full Text
13. Gymrek M, Golan D, Rosset S, et al.: lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 2012; 22(6): 1154–62. PubMed Abstract | Publisher Full Text | Free Full Text
14. Highnam G, Franck C, Martin A, et al.: Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles. Nucleic Acids Res. 2013; 41(1): e32. PubMed Abstract | Publisher Full Text | Free Full Text
15. Cao MD, Tasker E, Willadsen K, et al.: Inferring short tandem repeat variation from paired-end short reads. Nucleic Acids Res. 2014; 42(3): e16. PubMed Abstract | Publisher Full Text | Free Full Text
16. Willems T, Zielinski D, Yuan J, et al.: Genome-wide profiling of heritable and de novo STR variations. Nat Methods. 2017; 14(6): 590–592. PubMed Abstract | Publisher Full Text | Free Full Text
17. Dashnow H, Lek M, Phipson B, et al.: STRetch: detecting and discovering pathogenic short tandem repeat expansions. Genome Biol. 2018; 19(1): 121. PubMed Abstract | Publisher Full Text | Free Full Text
18. Dolzhenko E, van Vugt JJF, Shaw RJ, et al.: Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Res. 2017; 27(11): 1895–1903. PubMed Abstract | Publisher Full Text | Free Full Text
19. Mousavi N, Shleizer-Burko S, Yanicky R, et al.: Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 2019; 47(15): e90. PubMed Abstract | Publisher Full Text | Free Full Text
20. Schule B, McFarland KN, Lee K, et al.: Parkinson's disease associated with pure ATXN10 repeat expansion. NPJ Parkinsons Dis. 2017; 3: 27. PubMed Abstract | Publisher Full Text | Free Full Text
21. De Roeck A, De Coster W, Bossaerts L, et al.: Accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION. Genome Biol. 2019; 20(1): 239. PubMed Abstract | Publisher Full Text | Free Full Text
22. Ebbert MTW, Farrugia SL, Sens JP, et al.: Long-read sequencing across the C9orf72 'GGGGCC' repeat expansion: implications for clinical use and genetic discovery efforts in human disease. Mol Neurodegener. 2018; 13(1): 46. PubMed Abstract | Publisher Full Text | Free Full Text
23. Liu Q, Zhang P, Wang D, et al.: Interrogating the "unsequenceable" genomic trinucleotide repeat disorders by long-read sequencing. Genome Med. 2017; 9(1): 65. PubMed Abstract | Publisher Full Text | Free Full Text
24. Mitsuhashi S, Frith MC, Mizuguchi T, et al.: Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads. Genome Biol. 2019; 20(1): 58. PubMed Abstract | Publisher Full Text | Free Full Text
25. Karamitros T, Magiorkinis G: Multiplexed Targeted Sequencing for Oxford Nanopore MinION: A Detailed Library Preparation Procedure. Methods Mol Biol. 2018; 1712: 43–51. PubMed Abstract | Publisher Full Text
26. Jain M, Koren S, Miga KH, et al.: Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018; 36(4): 338–345. PubMed Abstract | Publisher Full Text | Free Full Text
27. Pendleton M, Sebra R, Pang AWC, et al.: Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat Methods. 2015; 12(8): 780–6. PubMed Abstract | Publisher Full Text | Free Full Text
28. Li H: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018; 34(18): 3094–3100. PubMed Abstract | Publisher Full Text | Free Full Text
29. Kielbasa SM, Wan R, Sato K, et al.: Adaptive seeds tame genomic sequence comparison. Genome Res. 2011; 21(3): 487–93. PubMed Abstract | Publisher Full Text | Free Full Text
30. Ganesamoorthy D, Yan M, Murigneux V, et al.: Supplementary Information for the "High-throughput multiplexed tandem repeat genotyping using targeted long-read sequencing" article. figshare . Dataset. 2020. http://www.doi.org/10.6084/m9.figshare.12789278.v1
31. Kim S, Jeong K, Bafna V: Wessim: a whole-exome sequencing simulator based on in silico exome capture. Bioinformatics. 2013; 29(8): 1076–7. PubMed Abstract | Publisher Full Text | Free Full Text
32. Cao MD, Ganesamoorthy D, Zhou C, et al.: Simulating the dynamics of targeted capture sequencing with CapSim. Bioinformatics. 2018; 34(5): 873–874. PubMed Abstract | Publisher Full Text | Free Full Text
33. Gilpatrick T, Lee I, Graham JE, et al.: Targeted nanopore sequencing with Cas9-guided adapter ligation. Nat Biotechnol. 2020; 38(4): 433–438. PubMed Abstract | Publisher Full Text | Free Full Text
34. Hafford-Tear NJ, Tsai YC, Sadan AN, et al.: CRISPR/Cas9-targeted enrichment and long-read sequencing of the Fuchs endothelial corneal dystrophy-associated TCF4 triplet repeat. Genet Med. 2019; 21(9): 2092–2102. PubMed Abstract | Publisher Full Text | Free Full Text
35. Payne A, Holmes N, Clarke T, et al.: Nanopore adaptive sequencing for mixed samples, whole exome capture and targeted panels. BioRxiv. 2020; 2020.02.03.926956. Publisher Full Text
36. Kovaka S, Fan Y, Ni B, et al.: Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. bioRxiv. 2020; 2020.02.03.931923. Reference Source

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 02 Sep 2020