High-throughput multiplexed tandem repeat genotyping using targeted long-read sequencing

Tandem repeats (TRs) are highly prone to variation in copy numbers due to their repetitive and unstable nature, which makes them a major source of genomic variation between individuals. However, population variation of TRs have not been widely explored due to the limitations of existing tools, which are either low-throughput or restricted to a small subset of TRs. Here, we used SureSelect targeted sequencing approach combined with Nanopore sequencing to overcome these limitations. We achieved an average of 3062-fold target enrichment on a panel of 142 TR loci, generating an average of 97X sequence coverage on 7 samples utilizing 2 MinION flow-cells with 200ng of input DNA per sample. We identified a subset of 110 TR loci with length less than 2kb, and GC content greater than 25% for which we achieved an average genotyping rate of 75% and increasing to 91% for the highest-coverage sample. Alleles estimated from targeted long-read sequencing were concordant with gold standard PCR sizing analysis and moreover highly correlated with alleles estimated from whole genome long-read sequencing. We demonstrate a targeted long-read sequencing approach that enables simultaneous analysis of hundreds of TRs and accuracy is comparable to PCR sizing analysis. Our approach is feasible to scale for more targets and more samples facilitating large-scale analysis of TRs.

well and the correlation values range from 0.904 to 0.994 for Nanopore targeted sequencing. We were able to determine the genotype on average for 60% of the targets (range 48% to 1 4 9 75%) using VNTRTyper and 57% of the targets (range 41% to 74%) using Tandem-genotypes. Both VNTRTyper and Tandem-genotypes failed to genotype targets with low GC sequence 1 5 1 content (< 25% GC content) and targets which are greater than 2Kb in length, which accounts 1 5 2 for approximately 22% of the targets (32 of the 142 targets). Targets with low GC sequence 1 5 3 content (< 25% GC content) didn't have sufficient sequence coverage for analysis due to Methods and Figure 1a). It was evident that the GC content of the target and size (i.e. repeat length) affected the 1 5 9 genotyping efficiency of our targeted capture sequencing approach. Therefore, we assessed improved to 75.2% for 110 targets with a combined 2Kb size threshold and 25% GC Tandem-genotypes also improved to an average of 63.7% for 110 targets (range 43.6% to  Tandem-genotype and the results of targeted sequencing analysis. We compared the accuracy of genotype estimates from WGS data with PCR sizing analysis. We compared the genotype estimates between WGS data and targeted capture sequencing data (77 targets which had results for both WGS and targeted sequencing). Genotype Tandem-genotypes had lower correlation between WGS and targeted capture sequencing data which we had generated PCR sizing analysis, Nanopore WGS data correlated with 12/14 1 9 5 genotype estimates on Nanopore capture sequencing using VNTRTyper precisely compared 1 9 6 to PCR sizing (Table 2 and Figure 4a). Genotype estimates using Tandem-genotypes on 1 9 7 Nanopore WGS data correlated with 11/12 genotype estimates on Nanopore capture 1 9 8 sequencing precisely compared to PCR sizing ( Figure 4b). To assess the extent of variation in repeat numbers between individuals, we compared the 2 0 2 genotype estimates to the reported reference (hg19) repeat number. Genotype estimates pedigree 1463 were used to assess the variation. We found that for a given sample, on 2 0 5 average 51% (range 45% -60%) of the targets have a genotype which is different to the 2 0 6 reference, with more deletions (28%) than duplications (23%) ( Figure 5). In this study, we presented a targeted sequencing approach combined with long-read 2 1 0 sequencing technology to genotype TRs. To our knowledge, this is the first report on 2 1 1 genotyping analysis of hundreds of TRs using targeted long-read sequencing approach. have the ability to generate reads which can span the entire repeat region and flanking regions. However, whole genome long-read sequencing analysis is still expensive for large-scale 2 1 6 population analysis; hence, we developed a targeted long-read sequencing approach for TR 2 1 7 analysis. We showed that 1) target enrichment of repetitive sequences followed by long-read 2 2 0 sequencing is feasible and 2) genotype predictions on targeted TR sequencing are comparable to the accuracy of PCR sizing analysis of repeats. Overall, we achieved an average genotyping 2 2 2 rate of 75% for 110 TR loci with repeat length less than 2kb and GC content greater than 25%.

3
Genotyping rate improved to 91% for the highest-coverage sample, indicating more 2 2 4 sequencing could improve genotyping rate. Targets with low GC sequence content (< 25% GC content) didn't have sufficient sequence 2 2 7 coverage with targeted sequencing. We have previously performed short-read target capture 2 2 8 on these regions [12] and observed low sequence coverage in low GC targets. However, both 2 2 9 Nanopore and PacBio WGS data didn't have any bias in sequence coverage in low GC 2 3 0 regions. Hence, the lack of sequence coverage in low GC region for targeted sequencing is design. This will improve sequence enrichment in low GC targets. We also observed targets greater than 2Kb in length could not be genotyped due to the lack of 2 3 6 spanning reads for genotyping analysis. This is primarily due to the limitation in sequence number of copies) and 7 of these targets failed to genotype. However 3 of these had low GC 2 4 7 content and one was greater than 4kb in repeat length. The longer expansions which failed to 2 4 8 genotype also had low sequence coverage, however due to the low number of targets we couldn't conclusively identify the cause for failure for these targets. [12] to determine the repeat number of TRs from long-read sequencing technologies. For approaches. However, Tandem-genotypes genotyped less targets than VNTRTyper. The 2 5 7 differences is likely due to the different algorithms used between the methods. Both VNTRTyper and Tandem-genotypes uses reads spanning the repeat region. However, for possibly failed to genotype more targets compared to VNTRTyper. Variations in TRs are a major source of genomic variation between individuals. TRs targeted in 2 6 6 this study were initially selected due to the variation observed between case and control the major limitation in this analysis is that the sample size is small and the individuals are studies are required to ascertain the extent of variation. We demonstrated that the accuracy of genotype estimates between WGS and targeted capture observed between WGS and targeted capture sequencing for some targets. An amplification free targeted analysis with long-read sequencing is an ideal option for this technique for large-scale analysis. The targeted long-read sequencing approach presented in this study is a cost-effective 2 8 9 approach to analyse hundreds of TRs simultaneously. Long-read Nanopore WGS can cost allows to explore TRs in large-scale studies. In summary, we present a targeted approach combined with long-read sequencing to enable 2 9 7 cost-effective and accurate approach to genotype TRs using long-read sequencing. Using this repetitive sequences and genotyping TRs using Nanopore long-read sequencing technology.  reference human genome (hg19) and the number of repeat units range from 2 to 2300 repeats.

1 4
TRs used in this study were selected as part of another study to investigate association 3 1 5 between TRs and Obesity and these targeted TRs are not disease associated. Agilent SureSelect DNA design (Agilent Technologies) was used to design target probes to capture  Nanopore targeted sequencing of TRs: All 7 family members from the CEPH pedigree 1463 were used for Nanopore targeted 3 2 2 sequencing analysis (NA12877, NA12878, NA12879, NA12881, NA12882, NA12889 and 3 2 3 NA1289). Target sequence capture for Nanopore sequencing was performed using Agilent to 3Kb using Covaris Blue miniTUBE (Covaris). Greater than 90% of the targeted TRs are less 3 2 7 than 3Kb and SureSelect capture protocol works effectively on fragments less than 4Kb in 3 2 8 length; therefore, DNA products were sheared to 3Kb. Fragmented DNA was end repaired, 3 2 9 adapter ligated and amplified prior to target capture. Extension time for pre-capture capture PCR products were purified using 0.8X -1X AMPure XP beads (Beckman Coulter). Nanopore sequencing library preparation was performed using 1D Native barcoding genomic 65℃ for 15mins. End repaired products were ligated with unique native barcodes.

4 2
Purification steps after end repair and barcode ligation were avoided to minimize the loss of  Adapter ligated samples were purified using 0.4X AMPure XP beads (Beckman Coulter).

4 5
Samples were split into 2 sequencing groups; NA12877, NA12878, NA12879 and NA12890 - For Nanopore sequencing '-ax map-ont' and for PacBio WGS '-ax map-pb' parameters were repeat units in a read using a profile HMM. Recently, we further improved the accuracy of genotyping estimates by clustering the copy