NAD: Noise-augmented direct sequencing of target nucleic acids by augmenting with noise and selective sampling

Hyunjin Shim

doi:10.12688/f1000research.163516.1

Home Browse NAD: Noise-augmented direct sequencing of target nucleic acids by...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Method Article

NAD: Noise-augmented direct sequencing of target nucleic acids by augmenting with noise and selective sampling

[version 1; peer review: 1 approved, 1 approved with reservations]

Hyunjin Shim

PUBLISHED 10 Apr 2025

Author details Author details

Department of Biology, California State University Fresno, 5241 N Maple Ave, Fresno, California, 93740, USA

Hyunjin Shim
Roles: Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Investigation, Methodology, Project Administration, Resources, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Nanopore Analysis gateway.

This article is included in the Genomics and Genetics gateway.

Abstract

Background

Next-generation sequencing necessitates a minimum quantity and concentration of DNA/RNA samples, typically achieved through amplification using the PCR technique. However, this amplification step introduces several drawbacks to biological insights, including PCR bias and the loss of epigenetic information. The advent of long-read sequencing technologies facilitates direct sequencing, with the primary constraint being the limited amount of DNA/RNA present in biological samples.

Methods

Here, we present a novel method called Noise-Augmented Direct (NAD) sequencing that enables the direct sequencing of target DNA even when it falls below the minimum quantity and concentration required for long-read sequencing by augmenting with noise DNA and adaptive sampling. Adaptive sampling is an emerging technology of nanopore sequencing, allowing the enhanced sequencing of target DNA by selectively depleting noise DNA. In this study, we use the DNA standard of the Lambda phage genome as the noise DNA to augment samples containing low amounts of bacterial genomes (1 ng to 300 ng).

Results

The results with cost-effective flow cells indicate that NAD sequencing successfully detects the target DNA with an input quantity as low as 1 ng, and the bacterial genome of Salmonella enterica can be assembled to 30% completion at an accuracy of 98% with an input quantity of 3 ng. With high throughput flow cells, the bacterial genome of Pseudomonas aeruginosa was assembled to near completion (99.9%) at an accuracy of 99.97% with an input quantity of 300 ng.

Conclusions

This proof-of-concept study demonstrates the potential of NAD sequencing in enhancing the robustness of long-read sequencing for small input DNA/RNA samples with noise augmentation and adaptive sampling.

Keywords

Native DNA/RNA sequencing, Data augmentation, White noise, Long-read sequencing, Adaptive sampling, Metagenomics, Diagnostics

Corresponding author: Hyunjin Shim

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2025 Shim H. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Shim H. NAD: Noise-augmented direct sequencing of target nucleic acids by augmenting with noise and selective sampling [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2025, 14:423 (https://doi.org/10.12688/f1000research.163516.1) First published: 10 Apr 2025, 14:423 (https://doi.org/10.12688/f1000research.163516.1) Latest published: 10 Apr 2025, 14:423 (https://doi.org/10.12688/f1000research.163516.1)

Introduction

Long-read sequencing is revolutionizing DNA/RNA sequencing by simplifying the workflow of genomic research with the advantages of long reads.¹ This novel technology has contributed to numerous unsolved problems in the field of genomics, including the completion of the human genome through the Telomere-to-Telomere Consortium (T2T)² and the Human Pangenome Reference Consortium (HPRC).^3–5 Among the long-read sequencing technologies, nanopore sequencing has several unique advantages by utilizing the biophysical properties of nanopores and nucleic acids.^6,7 This technology measures changes in ionic current flows across a protein nanopore as a string of nucleic acids passes through to reconstruct the genetic sequence from the electric signals.⁸ This signal conversion process further employs advanced computational methods such as neural networks and parallel computing, leveraging recent technological advancements spanning various fields.^9,10 Notable advantages of nanopore sequencing over other sequencing methods include the ability to sequence native DNA/RNA and the ability to sequence selectively through depletion or enrichment. Direct sequencing of native DNA/RNA generates additional information on the sequences. This epigenetic information, such as the detection of DNA and RNA methylation, plays a key role in understanding many diseases, including cancer.^11,12 Furthermore, nanopore sequencing has an important feature of adaptive sampling, which selectively depletes or enriches sequences of interest.¹³ This is achieved by dynamically controlling the voltage across individual nanopores to eject unwanted sequences or allow targeted sequences to continue through the pore, based on rapid basecalling and alignment to reference genomes. Adaptive sampling leverages advanced neural networks and parallel computing to perform real-time analysis and decision-making, allowing researchers to focus sequencing resources on regions or organisms of interest.⁹

Despite these advantages, nanopore sequencing also requires a high input quantity of DNA/RNA to fully leverage these features of direct sequencing and adaptive sampling. For instance, most protocols recommend an input DNA/RNA quantity of at least 500 ng and 1,000 ng for Flongle flow cells and MinION flow cells, respectively. This high input quantity of DNA/RNA is necessary for the adapter efficiency in the ligation preparation, as well as the long-read sequencing of DNA/RNA.¹⁴ This requirement is a limiting factor in utilizing nanopore sequencing in genomic studies involving biological samples with small quantities of DNA/RNA. In scenarios involving small DNA/RNA quantities within samples, an extra PCR step is often employed to amplify these scarce sequences, undermining the capacity of nanopores to capture epigenetic information. Moreover, it necessitates the quantification of DNA/RNA during each sequencing experiment, typically with a stringent quantity threshold. These experimental criteria often lead to the abandonment of direct sequencing or long-read sequencing in numerous studies.¹⁵

Here, we introduce a novel method called Noise-Augmented Direct (NAD) sequencing, which augments a sample of a small DNA quantity (<500 ng) with the ‘noise’ DNA to selectively sequence only the target sequence using adaptive sampling ( Figure 1). The noise or background DNA, such as human DNA during sequencing experiments, is commonly considered an undesirable form of contamination that needs to be identified and eliminated.^16,17 However, the concept of adding noise to increase the generalization performance with real-world data is widely utilized in other fields. For instance, data augmentation with white noise is known to improve the accuracy of deep-learning models in real-world testing.¹⁸ Deep learning has been widely used for various artificial intelligence (AI) tasks such as speech and image recognition and natural language processing.¹⁹ However, the generalization performance of deep learning reduces drastically when tested in noisy real-world data.²⁰ One of the strategies to circumvent this pitfall is to train the neural networks using data augmented with some random noise to increase the robustness and generalization power.²¹

Figure 1. Process of Noise-Augmented Direct (NAD) sequencing.

The target DNA/RNA is extracted from a sample of interest (e.g. human nasal swab) and augmented with the noise DNA/RNA. The sequencing library of the target and noise DNA/RNA is directly sequenced in real-time using adaptive sampling by depleting the noise DNA/RNA. This allows direct sequencing of the target DNA/RNA in a native form, without the need to amplify the target DNA/RNA that is below the minimum quantity and concentration of long-read sequencing requirements.

In this study, we adopt a similar line of reasoning to increase the generalization performance of nanopore sequencing in real-world datasets by adding a controlled amount of ‘noise’ DNA to a biological sample and exploiting the ability of adaptive sampling to enrich the target DNA by selectively depleting the noise DNA. In the context of sequencing experiments, the noise DNA is a preselected sequence that is randomly sequenced by nanopores. It should be a sequence long enough (a minimum of 450 base pairs) to allow adaptive sampling to make a decision, but not so long that it significantly increases mapping time. Here, we use the lambda phage genome as the noise DNA and two bacterial genomes (Pseudomonas aeruginosa and Salmonella enterica) as the target DNA. We demonstrate that NAD sequencing can detect the target DNA with an input quantity as low as 1 ng in cost-effective Flongle flow cells. Furthermore, NAD sequencing can efficiently assemble parts of the bacterial genome, with a target DNA quantity as low as 3 ng (30% complete with an accuracy of 97.58%) in cost-effective Flongle flow cells, and a target DNA quantity of 300 ng (99.9% complete with an accuracy of 99.97%) in high-throughput MinION flow cells. This result demonstrates the potential of NAD sequencing as a practical method for detecting and assembling target DNA of limited quantity. This proof-of-concept study underscores the effectiveness of injecting noise DNA to augment target DNA to expand the applicability of long-read sequencing to real-world settings where biological samples often contain limited nucleic acids. We further discuss the necessity of enhancing the integration of computational processing power to handle the vast datasets generated by NAD sequencing, particularly when incorporating adaptive sampling in real time.

Results

Quality control of NAD sequencing experiments

Microbial DNA standards serve as a benchmark to evaluate the performance along the workflow of genomics analyses and as a tool to increase reproducibility. Before noise augmentation, the quality and quantity of the DNA standards were checked using a DNA Spectrophotometer (DS-11 Series from DeNovix). The results show that the DNA standard of the noise genome (Lambda phage) was as stated by the manufacturers at the concentration of ~600 ng/μL, meeting the quality criteria of A260/280 and A260/230 at ~1.8 and ~2.0, respectively (Extended Data: Figure S1 and Table S1). However, the target DNA samples were below the minimum concentration range of the DNA Spectrophotometer. For example, the target DNA sample of P. aeruginosa was measured to be 15.322 ng/μL before augmenting with the noise DNA (Extended Data: Table S1), which was close to the concentration stated by the manufacturers of 10 ng/μL. Due to the minimum concentration threshold, the quality control values of 260/280 and 260/230 were not relevant for these initial target DNA samples.

Data output of NAD sequencing experiments

The target DNA samples that were below the minimum concentration were augmented with the noise DNA to meet the input DNA criteria for nanopore sequencing (Extended Data: Tables S2 and S3). The augmented DNA samples of Pseudomonas aeruginosa and Salmonella enterica were ligated and sequenced in MinION flow cells and Flongle flow cells (Oxford Nanopore Technologies), respectively.

The noise-augmented sample of P. aeruginosa had two technical replicates, where the first replicate was only run for 13 hours due to an error in flow cells ( Table 1). The second replicate was run for 96 hours to completely exploit the capacity of the MinION flow cell. However, the quality score of real-time basecalling declined significantly after ~72 hours despite replenishing with the flush buffer, which is a typical runtime of MinION flow cells (data not shown). The second replicate has a better experimental output with twice more bases sequenced (4.79 Gb) as compared to the first replicate (2.14 Gb). This data output was below the theoretical output of MinION flow cells at 50 Gb, but achieving 10% of the theoretical capacity was typical data sizes generated with nanopore sequencing in the previous experiments.¹⁵ The second replicate also has a better N50 estimate (40,320 b) which was around 5 times longer than that of the first replicate (Extended Data: Figure S2). The estimated N50 from the reads shows a satisfactory performance of high-molecular-weight DNA sequencing.

Table 1. Experimental parameters of NAD sequencing experiments.

	Samonella_1ng	Samonella_3ng	Samonella_5ng	P_aeruginosa_1	P_aeruginosa_2
Flow cell	Flongle 9.4.1	Flongle 9.4.1	Flongle 9.4.1	MinION 9.4.1	MinION 9.4.1
Input DNA needed	500 ng	500 ng	500 ng	1,000 ng	1,000 ng
Target DNA	1 ng	3 ng	5 ng	300 ng	300 ng
Noise DNA	600 ng	600 ng	600 ng	5,100 ng	5,100 ng
Total input DNA	601 ng	603 ng	605 ng	5,400 ng	5,400 ng
Run time	72 hours	72 hours	72 hours	13 hours	96 hours
Estimated bases	148.61 Mb	231.59 Mb	96.77 Mb	2.14 Gb	4.79 Gb
Data produced	2.2 GB	3.12 GB	1.15 GB	26.79 GB	50.6 GB
Reads generated	129.18 k	151.4 k	44.05 k	1.36 M	1.3 M
Estimated N50	672 b	708 b	11,440 b	6,630 b	40,320 b

The noise-augmented sample of S. enterica had three experimental runs with a range of target DNA concentrations (1 ng, 3 ng, and 5 ng). All the experiments were run for 72 hours to completely exploit the capacity of the Flongle flow cells. However, the quality score of real-time basecalling declined significantly after ~24 hours, which is a typical runtime of Flongle flow cells (data not shown). All the runs generated around 5% to 10% of the theoretical capacity of Flongle flow cells at 2.8 Gb ( Table 1). The N50 estimates of all the sequencing runs showed much lower read lengths than expected of high-molecular-weight DNAs (Extended Data: Figure S2). The sequencing run with the largest amount of target DNA at 5 ng shows the largest estimated N50 at 11,440 b. The estimated N50 correlates with the concentration of the target DNA, indicating that the low estimated N50 may arise as an artifact of sequencing the noise DNA of the Lambda phage genome.

Adaptive sampling of NAD sequencing experiments

In NAD sequencing of the noise-augmented samples, the adaptive sampling feature of nanopore sequencers was used to enrich the target DNA by depleting the noise DNA. The read length graphs from the nanopore report show there are two peaks in all the NAD sequencing experiments (Extended Data: Figure S2). The first peaks are at the read length of 500 b, while the second peaks are at the read length of ~45 kb. The bimodal distribution of these read length graphs indicates successful adaptive sampling in these experiments, with the first peak representing the noise DNA that was rejected within a few seconds after being sequenced with a nanopore. The second peak likely represents the noise DNA of Lambda phage genomes with the length of 48,502 base pairs. The accepted reads of each experiment that were classified as the target DNA had lengths greater than 4,000 b in both the Pseudomonas and Salmonella samples (Extended Data: Figure S3). Furthermore, the N50 values calculated for these accepted reads classified as the target DNA were over 17 kb for the Salmonella samples (Extended Data: Figure S3A), and over 29 kb for the Pseudomonas sample (Extended Data: Figure S3B).

During adaptive sampling, each read passing the nanopore was mapped against the reference genome of the Lambda phage while sequencing (Extended Data: Table S4). There are three types of decisions in adaptive sampling of nanopore sequencing: “no_decision,” “stop_receiving,” and “unblock.” A “stop_receiving” is a decision made when the read is accepted and fully sequenced, based on the real-time basecalling and mapping. On the other hand, “no_decision” occurs when the read continues through the nanopore with no decision to either accept or reject the read and becomes fully sequenced. An “unblock” is a decision made when the read is stopped and rejected by reversing of the voltage, based on the real-time basecalling and mapping.

To evaluate the decision-making process of adaptive sampling, the proportion of these three types of decisions was plotted for each NAD sample, in nominal value ( Figure 2A) and relative value ( Figure 2B). For all the samples, a majority of the reads were “unblocked” while being sequenced, indicating that more than 50% of reads were rejected as adaptive sampling is set to deplete the noise DNA. For the Salmonella samples, there is a decreasing trend of unblocked reads as the concentration of the target DNA increases. This trend indicates that adaptive sampling was functioning as expected, as the nanopores rejected a fewer number of reads when there were fewer noise DNAs present in the sample. Conversely, the proportion of acceptance (“stop_receiving”) in sequencing increases as the concentration of the target DNA increases in the Salmonella samples. For NAD sequencing, it is notable that the proportion of unblocked reads is much larger than the proportion of accepted reads, as adaptive sampling is set to deplete rather than enrich.

Figure 2. Three decision types in adaptive sampling of the NAD samples.

(A) Comparison in nominal value; (B) Comparison in relative value.

“no_decision” when the read has been continued without decision; “stop_receiving” when the read was accepted and fully sequenced (read acceptance occurs at ~4000 bp for depletion protocols); “unblock” when the read was stopped and rejected by the nanopore (read rejection occurs at ~400 bp).

Performance of NAD sequencing experiments

To evaluate the performance of NAD sequencing, the sequence length of the reads was compared by the sample type ( Figure 3A) and by the classification type ( Figure 3B). In all the samples, the average sequence length of the reads that were unblocked and rejected was much shorter at 500 b than those of the other reads ( Figure 3A). The nanopore sequencer was preset to deplete the noise DNA by rejecting the Lambda phage genome of ~48 kb using adaptive sampling. As it takes around 1 second for nanopores to decide on whether to accept or reject a read using the real-time basecalling and mapping process of adaptive sampling, this result verifies the implementation of adaptive sampling given the translocation speed of 450 bases per second.²² Furthermore, the average sequence length of the reads that were accepted and fully sequenced (“stop_receiving”) was much higher at 4,000 b than that of the reads that were continued (“no_decision”) at 1,000 b.

Figure 3. Sequence length of the NAD sequencing experiments.

(A) Comparison by the sample for all reads; (B) Comparison by the noise DNA (Lambda phage) versus the target DNA (S. enterica or P. aeruginosa).

Lambda_S indicates the noise DNA from the samples of S. enterica and Lambda_P indicates the noise DNA from the sample of P. aeruginosa. From the NAD experiments, only reads that had a quality score above 10 were classified using the WIMP workflow (Extended Data: Figure S2). For the noise DNA and the target DNA, Lambda_S_1ng: unblock=161; Lambda_S_3ng: unblock=131; Lambda_S_5ng: unblock=175; Lambda_P: unblock=11574, no_decision=790, stop_receiving=75; Salmonella_1ng: unblock=23, no_decision=380, stop_receiving=30; Salmonella_3ng: unblock=28, no_decision=1862, stop_receiving=123; Salmonella_5ng: unblock=19, no_decision=5394, stop_receiving=391; Pseudomonas: unblock=1766, no_decision=975621, stop_receiving=55543.

To confirm the decision-making process of adaptive sampling, each read was classified by a cloud-based analysis platform called the WIMP workflow (Extended Data: Figures S4 and S5). The species identification was downloaded from the WIMP workflow, and the reads identified as the noise DNA or the target DNA were saved separately. Subsequently, each identified read was subcategorized into its adaptive sampling decision type ( Figure 3B). The results show that the noise DNA from the Salmonella samples was all rejected (“unblocked”) by adaptive sampling. Interestingly, there were some noise DNA reads from the Pseudomonas sample that were either continued (“no_decision”) or even accepted (“stop_receiving”). This result is unexpected as MinION flow cells are expected to have more stable nanopores and generate higher outputs. However, the current setting of GPU-accelerated adaptive sampling may be a limiting factor in depleting the noise DNA with a larger number of nanopores generating big data output in NAD sequencing experiments with real-time processing.

For the target DNA, the NAD experiments successfully continued (“no_decision”) or accepted (“stop_receiving”) a larger number of reads in both the Salmonella samples and the Pseudomonas sample (Extended Data: Table S5). In the Pseudomonas sample, it is notable that more target DNA reads with longer sequence lengths were rejected wrongly (“unblocked”) than those of the Salmonella samples ( Figure 3B), as the decision-making process of adaptive sampling takes potentially has a bottleneck from basecalling for read mapping longer in higher throughput flow cells due to the limitation of parallel computing power in this study.

In the Salmonella samples, the output ratio of target DNA to noise DNA increases from 2.69 to 33.17 as the concentration of the target DNA increases from 1 ng to 5 ng ( Table 2). Furthermore, the ratio of acceptance to rejection of the target DNA increases from 1.30 to 20.58 with the increasing concentration of the bacterial DNA content. This result indicates that adaptive sampling functions with predictable outcomes, and NAD experiments perform effectively in sequencing low concentrations of target DNAs. Additionally, the output ratio of target DNA length and noise DNA length remain constant at around 4 ( Table 2), indicating this correlation in the performance and the concentration of the target DNA is independent of other factors in NAD experiments.

Table 2. Summary statistics of the NAD datasets by WIMP classification at the Genus level.

	Samonella_1ng	Samonella_3ng	Samonella_5ng	Pseudomonas
Input target DNA (ng)	1	3	5	300
Input noise DNA (ng)	600	600	600	5100
Input target/noise DNA ratio	0.00177	0.005	0.00833	0.0591
Output target DNA (number of reads: Genus)	433	2,013	5,804	256,622
Output noise DNA (number of reads: Genus)	161	131	175	13,629
Output target/noise DNA ratio	2.69	15.371	33.17	18.83
Output target DNA length (mean)	1807.95	1961.96	1999.51	1,823.25
Output noise DNA length (mean)	443.83	410.45	454.53	532.88
Output target/noise length ratio	4.074	4.78	4.4	3.42
Target DNA: no_decision	380	1862	5394	975621
Target DNA: stop_receiving	30	123	391	55543
Target DNA: unblock	23	28	19	1766
Target DNA stop_receiving/unblock ratio	1.30	4.39	20.58	31.45
Noise DNA: no_decision	0	0	0	790
Noise DNA: stop_receiving	0	0	0	75
Noise DNA: unblock	161	131	175	11574
Noise DNA stop_receiving/unblock ratio	0	0	0	0.0065

However, the Pseudomonas sample has a much lower output ratio of target DNA to noise DNA at 18.83 despite a higher input ratio of target DNA to noise DNA. This result is due to the noise DNA being accepted or fully sequenced, as shown by the higher ratio of acceptance to rejection in the noise DNA ( Table 2 and Extended Data: Table S5). These summary statistics emphasize the importance of increasing parallel computing power to process higher throughput in MinION flow cells for NAD experiments.

Metagenomic assembly of the target DNA and the noise DNA

The species identification of each read using the WIMP workflow shows that while a large number of reads was correctly classified as the target DNA or the noise DNA at the family level, a higher proportion of reads was wrongly classified as other organisms at the genus level (Extended Data: Figures S4 and S5). For example, a majority of the reads were identified as Escherichia coli in the Salmonella samples. This misclassification may arise from the fact the genomes of Escherichia coli and Salmonella enterica share similar gene content,^23,24 making it difficult for the Centrifuge classification engine to accurately identify the genus or species based on a single long read.

After the rapid species classification, the potential of genome assembly using NAD sequencing experiments with such low DNA inputs was investigated. The reads from the NAD datasets were assembled with a de novo assembler for single-molecule sequencing reads called Flye using a metagenomic option. The assembled fragments had an N50 of around 50 kb in the Salmonella samples and an N50 of around 1 Mb in the Pseudomonas sample (Extended Data: Table S6). This shows that MinION flow cells are much more effective in assembling bacterial genomes with a higher data output, in spite of the lower efficiency in adaptive sampling ( Table 2).

After metagenomic genome assembly, the assembled fragments were aligned against the reference target genome or the reference noise genome ( Table 3 and Extended Data: Tables S7-S8). We evaluated the extent and accuracy of the noise genome assemblies from each sample against the reference genome of Lambda phage. The purpose of this assembly and comparison was to verify the accuracy of long-read sequencing using an independent dataset, different from the target genomes. Additionally, this approach allowed us to quantify the amount of noise reads identified as artifacts. The fewer the noise DNA read, the more sequencing capacity could be allocated to assembling the target genomes. The results show that the Salmonella samples assembled only a small percentage of the target genome of S. enterica. For example, the assembled fragments from the Salmonella_1ng sample only covered 0.36% of the reference target genome, while it covered 100% of the reference noise genome ( Figure 4). The coverage of the target genome increases with the increasing target DNA concentrations, as shown by the Salmonella_3ng sample and the Salmonella_5ng sample covering almost 30% and 20% of the reference target genome, respectively. They both covered the reference noise genome fully as an artifact of NAD sequencing experiments. Notably, the Pseudomonas sample covered 99.9% of the reference target genome ( Figure 5).

Table 3. Analysis of the assembled target genomes against the target reference genome.

	S. enterica reference	Samonella_1ng	S. enterica reference	Samonella_3ng	S. enterica reference	Samonella_5ng	P. Aeruginosa reference	P_aeruginosa
Total Bases	4857450	182527	4857450	1557907	4857450	1055836	6264404	7439804
Aligned Bases	17573(0.36%)	16793(9.20%)	1357232(27.94%)	1279936(82.16%)	899445(18.52%)	846000(80.13%)	6258121(99.90%)	6297543(84.65%)
Unaligned Bases	4839877(99.64%)	165734(90.80%)	3500218(72.06%)	277971(17.84%)	3958005(81.48%)	209836(19.87%)	6283(0.10%)	1142261(15.35%)
Total Sequences	1	5	1	44	1	35	1	98
Aligned Sequences	1(100.00%)	4(80.00%)	1(100.00%)	42(95.45%)	1(100.00%)	33(94.29%)	1(100.00%)	20(20.41%)
Unaligned Sequences	0(0.00%)	1(20.00%)	0(0.00%)	2(4.55%)	0(0.00%)	2(5.71%)	0(0.00%)	78(79.59%)
1-to-1	10	10	59	59	48	48	14	14
Total Length of 1-to-1	16798	16793	1280926	1285001	845297	845479	6263146	6261391
Average Length of 1-to-1	1679.8	1679.3	21710.61	21779.68	17610.35	17614.15	447367.57	447242.21
Average Identity of 1-to-1	82.53%	82.53%	97.58%	97.58%	97.77%	97.77%	99.97%	99.97%

Figure 4. MUMmer N charts to compare Flye-assembled genomes against the target reference (Salmonella enterica: GCF_000006945) or the noise reference (Lambda phage: NC_001416).

(A) Salmonella_1ng against the target reference; (B) Salmonella_1ng against the noise reference; (C) Salmonella_3ng against the target reference; (D) Salmonella_3ng against the noise reference; (E) Salmonella_5ng against the target reference; (F) Salmonella_5ng against the noise reference.

Figure 5. MUMmer N charts to compare Flye-assembled genomes against the target reference (Pseudomonas aeruginosa: GCF_000006765) or the noise reference (Lambda phage: NC_001416).

(A) Pseudomonas_300 ng against the target reference; (B) Pseudomonas_300ng against the noise reference.

The assembled genome had a variable average identity depending on the sample type. The Pseudomonas sample had the highest accuracy of 99.97% in genome assembly when aligned to the reference target genome ( Table 3). The assembled genome of the noise DNA also had the highest accuracy of 99.93% when mapped to the reference noise genome (Extended Data: Table S7). The Salmonella_1ng of the lowest target concentration had the lowest accuracy of 82.53% in genome assembly when aligned to the reference target genome ( Table 3). In this NAD sample, several mutations were detected, including breakpoints, insertions, and single nucleotide polymorphisms (SNPs) (Extended Data: Figure S6). However, the assembled genome of the noise DNA had a high accuracy of 99.89% when mapped to the reference noise genome (Extended Data: Table S8). The Salmonella_3ng and the Salmonella_5ng samples had an average identity of 97.58% and 97.77% when aligned to the reference target genome, respectively ( Table 3). These results show that NAD sequencing experiments accurately assemble full bacterial genomes at lower input DNAs (300 ng) than recommended (1,000 ng) using MinION flow cells. Furthermore, NAD sequencing experiments assemble a fraction of bacterial genomes (~30%) accurately at a much lower input DNAs of 3ng than recommended (500 ng) using Flongle flow cells. Lower input DNAs at 1 ng may still be used for species identification with NAD sequencing, potentially overcoming the current limitation of long-read sequencing of high input DNA requirements.

Discussion

NAD sequencing explores the potential of sequencing low-input target DNA/RNA in its native state by augmenting biological samples with a controlled quantity and quality of noise DNA/RNA. This concept is inspired by the data augmentation technique of machine learning and enabled by the technological advances in long-read sequencing and parallel computing. This study has the specific aim of lowering the minimum input DNA/RNA to a fraction of the recommended input amount of 500 ng and 1,000 ng in cost-effective and high-throughput nanopore sequencing, respectively. This input quantity requirement of DNA/RNA is not realistic in many biological samples without amplification. Conventionally, these biological samples with scarce DNA/RNA have been amplified with the Polymerase Chain Reaction (PCR) technique to meet the minimum criteria of input DNA/RNA before sequencing.

Using PCR, very small amounts of DNA sequences are exponentially amplified to millions to billions of copies with a DNA polymerase in a series of cycles of temperature changes.^25,26 However, there are several PCR-induced biases and artifacts, such as DNA polymerase errors and the loss of epigenetic signatures.^27–29 These PCR-induced issues have been a limiting factor in understanding some biological processes, such as DNA methylation which plays a key role in development and gene expression.³⁰ Recently, there has been an increasing interest in studying the genomes of various species and individuals at the epigenetic level, of how DNA and RNA sequences undergo epigenetic modifications to inherit information without changing the genetic sequences.³¹ Such epigenetic information is relevant in medical fields such as cancer genomics,^32,33 but recent findings also suggest that various organisms utilize base modifications to escape host immunity.^34,35 The development of the ground-breaking mRNA vaccine also arises from the differentially modified nucleotides as a method to transport mRNA without triggering the immune system.^36,37 Thus, expanding the capacity to sequence native DNA/RNA from diverse biological samples will enable the scientific community to further explore novel territories of epigenetics.

The study aims to develop a method that broadens the possibility of direct sequencing for various biological samples so that more DNA/RNA can be sequenced in their native states. In this study, the novel method of NAD sequencing tested the minimum input ranges of target DNA from 1 ng to 5 ng for cost-effective nanopore sequencing and 300 ng for high-throughput nanopore sequencing. We demonstrate that NAD sequencing can detect the target DNA with a quantity as low as 1 ng with the cost-effective Flongle flow cells. Furthermore, NAD sequencing can efficiently assemble parts of a bacterial genome, with the target DNA quantity as small as 3 ng (30% complete with an accuracy of 97.58%) in the cost-effective Flongle flow cell, and the target DNA quantity of 300 ng (99.9% complete with an accuracy of 99.97%) in the high-throughput MinION flow cell.

The initial concentration and quantity of the microbial DNA standards of Salmonella enterica and Pseudomonas aeruginosa, approximately at ~10 ng/μl, fell below the minimum concentration measurable within the confidence level of a DNA spectrophotometer. Without NAD sequencing, the target DNA in these samples was not sufficient to be sequenced effectively in their native form with nanopore sequencing. The primary challenge with low input samples in Oxford Nanopore Technologies' sequencing lies in the ligation step, which typically recommends starting with 1 μg of gDNA or 100-200 fmol of amplicons or cDNA. The rapid ligation kit that is optimized for speed and simplicity also recommends an input requirement of 100 ng of gDNA. Using lower amounts of input material or impure samples can compromise library preparation efficiency and significantly reduce sequencing throughput. The smallest quantity of the target DNA examined in this study was 1 ng of Salmonella enterica genome, constrained by the limitations of the laboratory equipment, but NAD sequencing exhibits the potential to detect smaller input amounts of target DNAs than those employed in this study.

The broad aim of this study is to increase the robustness of long-read sequencing to real-world biological samples of small input amounts and noisy backgrounds. Because of the high input requirement, the option of direct sequencing using long-read technologies is frequently supplanted by PCR and short-read sequencing. Conversely, NAD sequencing capitalizes on noise combined with adaptive sampling to mitigate the challenges associated with high DNA inputs required in long-read sequencing, without the need for amplification or any supplementary processing. The only additional step is to determine the type and amount of noise DNA necessary to attain noise-augmented samples requisite for adaptive sampling. Augmenting biological samples with a controlled noise DNA is inspired by the data augmentation technique of white noise injection in machine learning to improve the robustness and generalization power of deep learning models with noisy real-world data.³⁸

In future developments, the proof-of-concept of NAD sequencing experiments will be broadened to establish standardized protocols for augmenting biological samples with the noise DNA. These protocols are designed to eliminate the need for input DNA/RNA quantification in experimental conditions where such equipment is not available, such as in cost-constraint settings or during remote sampling campaigns. Nanopore sequencing has been optimized to perform in remote and cost-constraint situations where rapid and on-site sequencing of biological samples is desirable. Such scenarios may entail the urgency of swiftly detecting infectious agents in remote areas. In such cases, the simple addition of noise DNA such as the Lambda phage genome will suffice in detecting scarce target DNA/RNAs or even in assembling the target genome. This noise augmentation ensures increasing the robustness of direct sequencing in real-world biological data, as well as eliminating the bottleneck of DNA/RNA quantification.

Methods

Quality control of the microbial DNA standards

Microbial DNA standards from Pseudomonas aeruginosa and Salmonella enterica (Sigma-Aldrich) were obtained at the concentration of 10 ng/μL. For both the microbial DNA standards, the UV absorbance ratio (OD260/OD280) and the bacteria identity from the manufacturers are given as 1.8 and 95%, respectively. DNA concentration and purity were confirmed (Extended Data: Table S1, Figure S1) using a DNA spectrophotometer. Given the total volume of 30 μl, the total amount of DNA from these microbial standards was 300 ng. The input DNA requirements for MinION flow cells (R9.4.1) and Flongle flow cells (R9.4.1) are 1,000 ng and 500 ng, respectively.

Noise augmentation of the microbial DNA standards

Lambda DNA standards from Escherichia coli bacteriophage (Thermo Fisher) were obtained at the concentration of 0.3 μg/μL. To meet the input DNA requirements of nanopore sequencing, the microbial DNA standards were augmented with the lambda DNA standard. For nanopore sequencing of P. aeruginosa, 30 μl of the microbial DNA standard was augmented with 17 μl of the lambda DNA standard to obtain 5.4 μg of input DNA (Extended Data: Table S2). The final concentration of the noise-augmented sample far exceeds the minimum DNA input requirement for MinION flow cells (R9.4.1). For nanopore sequencing of S. enterica, a small amount of the microbial DNA standard (0.1 μl, 0.3 μl, 0.5 μl) was augmented with 2 μl of the lambda DNA standard to obtain approximately 600 ng of input DNA (Extended Data: Table S3). The final concentrations of these noise-augmented samples meet the minimum DNA input requirement for Flongle flow cells (R9.4.1).

Ligation of the augmented DNA samples

A ligation-based sequencing kit was chosen for processing singleplex samples of the noise-augmented target DNA. Library preparation was carried out using the ligation sequencing kits (SQK-LSK109; Oxford Nanopore Technologies) according to the manufacturer’s instructions. For Flongle flow cells (R9.4.1), the Flongle Sequencing Expansion (EXP-FSE001; Oxford Nanopore Technologies) was used in combination for optimal results. These ligation sequencing kits are optimized for preparing sequencing libraries from dsDNA such as gDNA, cDNA, or amplicons. The library preparation method involves repairing and dA-tailing DNA ends using the NEBNext End Repair/dA-tailing module, and then ligating sequencing adapters onto the prepared ends. For the highest data yields, these ligation kits recommend starting with 1 μg of gDNA or 100-200 fmol of shorter-fragment input such as amplicons or cDNA. Starting with lower amounts of input material, or impure samples, may affect library preparation efficiency and reduce sequencing throughput.

Nanopore sequencing of the NAD samples with adaptive sampling

The NAD samples of the microbial DNA standards augmented with the noise DNA were sequenced with a MinION Mk1B (Oxford Nanopore Technologies). For each sequencing run, adaptive sampling implemented in the MinKNOW software (v.21.11.8) was preset to deplete the Lambda phage genome. The complete genome of Escherichia coli bacteriophage (NC_001416.1) was uploaded as a FASTA file as the reference sequence to deplete while sampling. Adaptive sampling requires high computational power due to the need to conduct real-time basecalling to process whether to reject or accept reads for further sequencing. GPU-accelerated adaptive sampling was performed using an NVIDIA GPU on Windows (NVIDIA Quadro P3000). For better performance, we recommend using GPUs with clock speeds of 1320 MHz base clock and 1777 MHz boost clock for NAD sequencing and super-accuracy basecalling.

High-accuracy basecalling of NAD sequencing experiments

After the NAD sequencing of the samples was completed, the raw signal data in FAST5 files were basecalled with Guppy (v6.5.7). The GPU version of Guppy was used to improve the performance of super-accuracy basecalling, which achieves the highest raw read accuracy out of the other available neural network models in Guppy, such as fast analysis and high-accuracy analysis.³⁹ An external GPU enclosure (eGPU) paired with an Nvidia Ampere card (RTX3060) was connected to a Dell Latitude laptop to perform high-accuracy basecalling, saving the basecalled long reads in FASTQ files.

Rapid classification pipeline of NAD sequencing experiments

For the rapid classification of reads, a cloud-based platform providing analysis workflows called EPI2ME was used. Using the EPI2ME platform (v.3.5.7), the WIMP workflow (v.2021.11.26) rapidly classifies long reads from nanopore sequencing based on the Centrifuge classification engine.⁴⁰ The NAD datasets basecalled with the super-accuracy models were classified at the Family, Genus, and Species level using the WIMP workflow.

The classification of each long read was saved to assess the performance of NAD sequencing in accuracy and efficiency. For accuracy, the decision of adaptive sampling to accept or reject further sequencing of each read was analyzed by the reads classified as the target (bacterial DNA) versus the noise (Lambda phage DNA). For efficiency, the mean DNA length sequenced for the reads classified as the target (bacterial DNA) versus the noise (Lambda phage DNA) was calculated.

Genome assembly of the noise DNA and the target DNA

For the downstream analysis, the NAD datasets from each sample were assembled using Flye (v2.9.2).^41,42 For the assembly, the FASTQ files generated from each experiment were combined into one file, and the metagenomic option was used to assemble the long reads into contigs.⁴³

After metagenomic genome assembly, the resulting contigs were aligned to the reference noise genome (NC_001416) and the reference target genome (GCF_000006945 or GCF_000006765) using MUMmer (v4.0+).^44–46 The alignment quality of NAD datasets to the reference genomes was analyzed and visualized using Assemblytics.⁴⁷

Declarations

Ethics approval and consent to participate

Not Applicable

Consent for publication

Not Applicable

Authors' contributions

The author confirms sole responsibility for the study conception and design, data collection, analysis and interpretation, and manuscript preparation.

Availability of data and materials

Software availability

All codes related to this project are available at https://github.com/hshimlab.

For data analysis, Python v.3.6.4 (https://www.python.org), NumPy v.1.17.5 (https://github.com/numpy/numpy), SciPy v.1.1.0 (https://www.scipy.org), seaborn v.0.9.0 (https://github.com/mwaskom/seaborn), Matplotlib v.3.3.4 (https://github.com/matplotlib/matplotlib), pandas v.0.22.0 (https://github.com/pandas-dev/pandas) were used.

For nanopore data acquisition, we used the MinKNOW v.21.11.8 and MinKNOW core v.2.1.0. For rapid nanopore data analysis, we used the EPI2ME platform v.3.5.7. For high-accuracy basecalling, we used Guppy v6.5.7. For genome assembly and visualization, we used Flye v2.9.2, MUMmer v4.0+, and Assemblytics (http://assemblytics.com).

All codes related to this project are available under GNU General Public License v3.0

Extended data

Github: NAD: Noise-augmented direct sequencing of target nucleic acids by augmenting with noise and selective sampling.

The project contains the following extended data:

1. 2025_NAD_sequencing_Supp_Info.pdf

All data related to this project are available under GNU General Public License v3.0

References

1. Marx V: Method of the year: long-read sequencing. Nat. Methods. 2023; 20: 6–11. PubMed Abstract | Publisher Full Text
2. Nurk S, Koren S, Rhie A, et al.: The complete sequence of a human genome. Science. 2022; 376: 44–53. PubMed Abstract | Publisher Full Text | Free Full Text
3. Wang T, Antonacci-Fulton L, Howe K, et al.: The Human Pangenome Project: a global resource to map genomic diversity. Nature. 2022; 604: 437–446. PubMed Abstract | Publisher Full Text
4. Jarvis ED, Formenti G, Rhie A, et al.: Semi-automated assembly of high-quality diploid human reference genomes. Nature. 2022; 611: 519–531. PubMed Abstract | Publisher Full Text | Free Full Text
5. Liao W-W, Asri M, Ebler J, et al.: A draft human pangenome reference. Nature. 2023; 617: 312–324. PubMed Abstract | Publisher Full Text
6. Akeson M, Branton D, Kasianowicz JJ, et al.: Microsecond time-scale discrimination among polycytidylic acid, polyadenylic acid, and polyuridylic acid as homopolymers or as segments within single RNA molecules. Biophys. J. 1999; 77: 3227–3233. PubMed Abstract
7. Meller A, Nivon L, Brandin E, et al.: Rapid nanopore discrimination between single polynucleotide molecules. Proc. Natl. Acad. Sci. USA. 2000; 97: 1079–1084. PubMed Abstract | Publisher Full Text
8. Branton D, Deamer DW, Marziali A, et al.: The potential and challenges of nanopore sequencing. Nat. Biotechnol. 2008; 26: 1146–1153. PubMed Abstract | Publisher Full Text | Free Full Text
9. Shim H: Futuristic Methods in Virus Genome Evolution Using the Third-Generation DNA Sequencing and Artificial Neural Networks. Global Virology III: Virology in the 21st Century. 2019; 485–513. Publisher Full Text
10. Shim H: Three Innovations of Next-Generation Antibiotics: Evolvability, Specificity, and Non-Immunogenicity. Antibiotics (Basel). 2023; 12. PubMed Abstract | Publisher Full Text | Free Full Text
11. van Belzen IAEM , Schönhuth A, Kemmeren P, et al.: Structural variant detection in cancer genomes: computational challenges and perspectives for precision oncology. npj Precision Oncology. 2021; 5: 1–11.
12. Filser M, Schwartz M, Merchadou K, et al.: Adaptive nanopore sequencing to determine pathogenicity of BRCA1 exonic duplication. J. Med. Genet. 2023; 60: 1206–1209. PubMed Abstract | Publisher Full Text | Free Full Text
13. Martin S, Heavens D, Lan Y, et al.: Nanopore adaptive sampling: a tool for enrichment of low abundance species in metagenomic samples. Genome Biol. 2022; 23: 1–27.
14. Maghini DG, Moss EL, Vance SE, et al.: Improved high-molecular-weight DNA extraction, nanopore sequencing and metagenomic assembly from the human gut microbiome. Nat. Protoc. 2021; 16: 458.
15. Park Y, Lee J, Shim H: Sequencing, Fast and Slow: Profiling Microbiomes in Human Samples with Nanopore Sequencing. Applied Biosciences. 2023; 2: 437–458.
16. Breitwieser FP, Pertea M, Zimin AV, et al.: Human contamination in bacterial genomes has created thousands of spurious proteins. Genome Res. 2019; 29: 954.
17. Davis NM, Proctor DM, Holmes SP, et al.: Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome. 2018; 6: 1–14. Publisher Full Text
18. Haba D: Data Augmentation with Python: Enhance deep learning accuracy with data augmentation methods for image, text, audio, and tabular data. Packt Publishing Ltd; 2023.
19. LeCun Y, Bengio Y, Hinton G: Deep learning. Nature. 2015; 521: 436–444. PubMed Abstract | Publisher Full Text
20. Yichuan Tang Centre for Theoretical Neuroscience, University of Waterloo, Waterloo ON, CANADA, Chris Eliasmith Centre for Theoretical Neuroscience, University of Waterloo, Waterloo ON, CANADA. Deep networks for robust visual recognition. [cited 11 Dec 2023]. Publisher Full Text
21. Goodfellow I, Bengio Y, Courville A: Deep Learning. MIT Press; 2016.
22. Wang C, Sensale S, Pan Z, et al.: Slowing down DNA translocation through solid-state nanopores by edge-field leakage. Nat. Commun. 2021; 12: 1–10.
23. Karberg KA, Olsen GJ, Davis JJ: Similarity of genes horizontally acquired by Escherichia coli and Salmonella enterica is evidence of a supraspecies pangenome. Proc. Natl. Acad. Sci. USA. 2011; 108: 20154.
24. Heinrichs DE, Yethon JA, Whitfield C: Molecular basis for structural diversity in the core regions of the lipopolysaccharides of Escherichia coli and Salmonella enterica. Mol. Microbiol. 1998; 30: 221–232. PubMed Abstract | Publisher Full Text
25. Saiki RK, Scharf S, Faloona F, et al.: Enzymatic amplification of beta-globin genomic sequences and restriction site analysis for diagnosis of sickle cell anemia. Science. 1985; 230: 1350–1354. PubMed Abstract
26. Saiki RK, Gelfand DH, Stoffel S, et al.: Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science. 1988; 239: 487–491. PubMed Abstract | Publisher Full Text
27. Acinas SG, Sarma-Rupavtarm R, Klepac-Ceraj V, et al.: PCR-Induced Sequence Artifacts and Bias: Insights from Comparison of Two 16S rRNA Clone Libraries Constructed from the Same Sample. Appl. Environ. Microbiol. 2005; 71: 8966–8969. PubMed Abstract
28. Aird D, Ross MG, Chen W-S, et al.: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011; 12: 1–14. Publisher Full Text
29. Krehenwinkel H, Wolf M, Lim JY, et al.: Estimating and mitigating amplification bias in qualitative and quantitative arthropod metabarcoding. Sci. Rep. 2017; 7: 1–12.
30. Wani K, Aldape KD: PCR Techniques in Characterizing DNA Methylation. Methods Mol. Biol. 2016; 1392. Publisher Full Text
31. Deans C, Maggert KA: What Do You Mean, “Epigenetic”?. Genetics. 2015; 199: 887–896. Publisher Full Text
32. Dupont C, Randall Armant D, Brenner CA: Epigenetics: Definition, Mechanisms and Clinical Perspective. Semin. Reprod. Med. 2009; 27: 351.
33. Sharma S, Kelly TK, Jones PA: Epigenetics in cancer. Carcinogenesis. 2010; 31: 27.
34. Shim H, Shivram H, Lei S, et al.: Diverse ATPase Proteins in Mobilomes Constitute a Large Potential Sink for Prokaryotic Host ATP. Front. Microbiol. 2021; 12: 691847.
35. Park H-M, Park Y, Berani U, et al.: In silico optimization of RNA-protein interactions for CRISPR-Cas13-based antimicrobials. Biol. Direct. 2022; 17: 27. PubMed Abstract | Publisher Full Text | Free Full Text
36. Karikó K, Buckstein M, Ni H, et al.: Suppression of RNA Recognition by Toll-like Receptors: The Impact of Nucleoside Modification and the Evolutionary Origin of RNA. Immunity. 2005; 23: 165–175. PubMed Abstract | Publisher Full Text
37. Sahin U, Karikó K, Türeci Ö: mRNA-based therapeutics — developing a new class of drugs. Nat. Rev. Drug Discov. 2014; 13: 759–780. PubMed Abstract | Publisher Full Text
38. Creating artificial neural networks that generalize. Neural Netw. 1991; 4: 67–79. Publisher Full Text
39. Wick RR, Judd LM, Holt KE: Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol. 2019; 20: 1–10.
40. Kim D, Song L, Breitwieser FP, et al.: Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016; 26: 1721–1729. PubMed Abstract | Publisher Full Text
41. Yuan J: Genome Assembly of Long Error-Prone Reads Using De Bruijn Graphs and Repeat Graphs.2019.
42. Lin Y, Yuan J, Kolmogorov M, et al.: Assembly of long error-prone reads using de Bruijn graphs. Proc. Natl. Acad. Sci. USA. 2016; 113: E8396–E8405. PubMed Abstract | Publisher Full Text
43. Kolmogorov M, Bickhart DM, Behsaz B, et al.: metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods. 2020; 17: 1103–1110. PubMed Abstract | Publisher Full Text
44. Marçais G, Delcher AL, Phillippy AM, et al.: MUMmer4: A fast and versatile genome alignment system. PLoS Comput. Biol. 2018; 14: e1005944. PubMed Abstract | Publisher Full Text | Free Full Text
45. Delcher AL, Kasif S, Fleischmann RD, et al.: Alignment of whole genomes. Nucleic Acids Res. 1999; 27: 2369–2376. PubMed Abstract | Publisher Full Text | Free Full Text
46. Kurtz S, Phillippy A, Delcher AL, et al.: Versatile and open software for comparing large genomes. Genome Biol. 2004; 5: R12. PubMed Abstract | Publisher Full Text
47. Nattestad M, Schatz MC: Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics. 2016; 32: 3021–3023. PubMed Abstract | Publisher Full Text | Free Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 10 Apr 2025

Author details Author details

Department of Biology, California State University Fresno, 5241 N Maple Ave, Fresno, California, 93740, USA

Hyunjin Shim
Roles: Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Investigation, Methodology, Project Administration, Resources, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (1)

version 1

Published: 10 Apr 2025, 14:423

https://doi.org/10.12688/f1000research.163516.1

Copyright

© 2025 Shim H. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Shim H. NAD: Noise-augmented direct sequencing of target nucleic acids by augmenting with noise and selective sampling [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2025, 14:423 (https://doi.org/10.12688/f1000research.163516.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 10 Apr 2025

Views

5

Reviewer Report 28 May 2025

Runsheng Li, City University of Hong Kong, Hong Kong, Hong Kong

Approved

https://doi.org/10.5256/f1000research.179880.r380261

This proof-of-concept study introduces Noise-Augmented Direct (NAD) sequencing, a workflow that mixes very small amounts of target DNA (< 500 ng) with an excess of “noise” DNA (λ-phage) and then uses nanopore adaptive sampling in depletion mode to eject ... Continue reading

This proof-of-concept study introduces Noise-Augmented Direct (NAD) sequencing, a workflow that mixes very small amounts of target DNA (< 500 ng) with an excess of “noise” DNA (λ-phage) and then uses nanopore adaptive sampling in depletion mode to eject the noise reads as they enter the pore.
The authors show that
1.Flongle runs can detect Salmonella enterica DNA down to 1 ng and assemble ~30 % of the genome at 3 ng (97.6 % identity).
2.A MinION R9.4.1 run with 300 ng of Pseudomonas aeruginosa DNA yields a near-complete assembly (99.9 % coverage, 99.97 % identity).
3.Rejected reads peak at ~500 bp (the time required for a decision), while accepted reads peak at ≥ 4 kb.
4.Most λ reads are correctly un-blocked; efficiency deteriorates on the high-throughput MinION run, presumably due to GPU saturation.

The authors propose that NAD sequencing could obviate PCR amplification for low-input samples, preserving epigenetic signals and simplifying field or clinical workflows.

Issues that need further discussion:
1. All data were produced on the R9.4.1 flow cell using the WIMP workflow. As R9 flow cells are being retired and R10.4.1 adaptive sampling relies on a different protocol, please clarify whether your conclusions extend to the R10 chemistry.
2. Please recommend optimal mixing ratios of carrier (“noise”) DNA to target DNA for different input amounts. For instance, if only 3 µg of target DNA is available, how much carrier DNA would you add? In our R10.4.1 tests, adaptive sampling becomes ineffective when the target fraction falls below 0.1 %, because most reads derive from the 400–500 bp carrier genome.

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Yes
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Genetics, Long read sequencing, Bioinformatics, DNA/RNA modification

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Views

15

Reviewer Report 21 Apr 2025

Ruslan Kalendar, University of Helsinki, Helsinki, Finland

Approved with Reservations

https://doi.org/10.5256/f1000research.179880.r378092

In this paper, the authors propose a novel sequencing method called Noise-Augmented Direct (NAD) that enables the direct sequencing of target DNA even when it falls below the minimum quantity and concentration required for long-read sequencing by augmentation with noise ... Continue reading

In this paper, the authors propose a novel sequencing method called Noise-Augmented Direct (NAD) that enables the direct sequencing of target DNA even when it falls below the minimum quantity and concentration required for long-read sequencing by augmentation with noise DNA and adaptive sampling.

The authors point out, “Long-read sequencing ... DNA/RNA sequencing”, RNA sequencing is not practiced, therefore, no need to specify ‘DNA/RNA’, only DNA.
Methylation and other DNA modifications as a necessary requirement for genome sequencing, but this direction makes sense probably only for the human genome and other eukaryotic model organisms, but not for analyzing bacterial genomes, which is the focus of the authors' study.

The main problem cited by the authors, when the target sample concentration is low, is related to quality sequencing.
Based on my experience, we obtained data from single molecules rather than 1 ng when using ONT. There may be a problem of poor ONT sequencing when most of the pores are inactive, I, personally, am not aware of.
Adding background DNA to mix with target low DNA inputs from bacterial genome, I do not believe this is a new method that even requires its title.

Figure 1. Process of Noise-Augmented Direct (NAD) sequencing.

- Here is a general schematic of the standard protocol for preparation for ONT and sequencing. The authors could have simplified this figure for a specific idea.
“The target DNA/RNA is extracted from a sample of interest (e.g. human nasal swab)” - the authors are not specifically investigating this type of sample, but the bacterial genome.

Next, the authors add Lambda phage along with the bacterial genome when preparing the library. Have the authors attempted to add Lambda phage at the last step, before the DNA library is put on the chip? Maybe excessive amounts of Lambda phage, without barcodes, adaptors and integrase, also has some effect on the “dormant” pores.

“Augmented” - why do the authors use this term for a mixed sample?

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Genomics, bioinformatics and molecular biology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 10 Apr 2025

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 1 10 Apr 25	read	read

Ruslan Kalendar, University of Helsinki, Helsinki, Finland
Runsheng Li, City University of Hong Kong, Hong Kong, Hong Kong

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

5 Views

28 May 2025 | for Version 1

Runsheng Li, City University of Hong Kong, Hong Kong, Hong Kong

5 Views Cite this report Responses(0)

Approved

This proof-of-concept study introduces Noise-Augmented Direct (NAD) sequencing, a workflow that mixes very small amounts of target DNA (< 500 ng) with an excess of “noise” DNA (λ-phage) and then uses nanopore adaptive sampling in depletion mode to eject the noise reads as they enter the pore.
The authors show that
1.Flongle runs can detect Salmonella enterica DNA down to 1 ng and assemble ~30 % of the genome at 3 ng (97.6 % identity).
2.A MinION R9.4.1 run with 300 ng of Pseudomonas aeruginosa DNA yields a near-complete assembly (99.9 % coverage, 99.97 % identity).
3.Rejected reads peak at ~500 bp (the time required for a decision), while accepted reads peak at ≥ 4 kb.
4.Most λ reads are correctly un-blocked; efficiency deteriorates on the high-throughput MinION run, presumably due to GPU saturation.

The authors propose that NAD sequencing could obviate PCR amplification for low-input samples, preserving epigenetic signals and simplifying field or clinical workflows.

Issues that need further discussion:
1. All data were produced on the R9.4.1 flow cell using the WIMP workflow. As R9 flow cells are being retired and R10.4.1 adaptive sampling relies on a different protocol, please clarify whether your conclusions extend to the R10 chemistry.
2. Please recommend optimal mixing ratios of carrier (“noise”) DNA to target DNA for different input amounts. For instance, if only 3 µg of target DNA is available, how much carrier DNA would you add? In our R10.4.1 tests, adaptive sampling becomes ineffective when the target fraction falls below 0.1 %, because most reads derive from the 400–500 bp carrier genome.

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Yes
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Genetics, Long read sequencing, Bioinformatics, DNA/RNA modification

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

15 Views

21 Apr 2025 | for Version 1

Ruslan Kalendar, University of Helsinki, Helsinki, Finland

15 Views Cite this report Responses(0)

Approved With Reservations

In this paper, the authors propose a novel sequencing method called Noise-Augmented Direct (NAD) that enables the direct sequencing of target DNA even when it falls below the minimum quantity and concentration required for long-read sequencing by augmentation with noise DNA and adaptive sampling.

The authors point out, “Long-read sequencing ... DNA/RNA sequencing”, RNA sequencing is not practiced, therefore, no need to specify ‘DNA/RNA’, only DNA.
Methylation and other DNA modifications as a necessary requirement for genome sequencing, but this direction makes sense probably only for the human genome and other eukaryotic model organisms, but not for analyzing bacterial genomes, which is the focus of the authors' study.

The main problem cited by the authors, when the target sample concentration is low, is related to quality sequencing.
Based on my experience, we obtained data from single molecules rather than 1 ng when using ONT. There may be a problem of poor ONT sequencing when most of the pores are inactive, I, personally, am not aware of.
Adding background DNA to mix with target low DNA inputs from bacterial genome, I do not believe this is a new method that even requires its title.

Figure 1. Process of Noise-Augmented Direct (NAD) sequencing.

- Here is a general schematic of the standard protocol for preparation for ONT and sequencing. The authors could have simplified this figure for a specific idea.
“The target DNA/RNA is extracted from a sample of interest (e.g. human nasal swab)” - the authors are not specifically investigating this type of sample, but the bacterial genome.

Next, the authors add Lambda phage along with the bacterial genome when preparing the library. Have the authors attempted to add Lambda phage at the last step, before the DNA library is put on the chip? Maybe excessive amounts of Lambda phage, without barcodes, adaptors and integrase, also has some effect on the “dormant” pores.

“Augmented” - why do the authors use this term for a mixed sample?

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Genomics, bioinformatics and molecular biology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] 1. Marx V: Method of the year: long-read sequencing. Nat. Methods. 2023; 20: 6–11. PubMed Abstract | Publisher Full Text

[2] 2. Nurk S, Koren S, Rhie A, et al.: The complete sequence of a human genome. Science. 2022; 376: 44–53. PubMed Abstract | Publisher Full Text | Free Full Text

[3] 3. Wang T, Antonacci-Fulton L, Howe K, et al.: The Human Pangenome Project: a global resource to map genomic diversity. Nature. 2022; 604: 437–446. PubMed Abstract | Publisher Full Text

[4] 4. Jarvis ED, Formenti G, Rhie A, et al.: Semi-automated assembly of high-quality diploid human reference genomes. Nature. 2022; 611: 519–531. PubMed Abstract | Publisher Full Text | Free Full Text

[5] 5. Liao W-W, Asri M, Ebler J, et al.: A draft human pangenome reference. Nature. 2023; 617: 312–324. PubMed Abstract | Publisher Full Text

[6] 6. Akeson M, Branton D, Kasianowicz JJ, et al.: Microsecond time-scale discrimination among polycytidylic acid, polyadenylic acid, and polyuridylic acid as homopolymers or as segments within single RNA molecules. Biophys. J. 1999; 77: 3227–3233. PubMed Abstract

[7] 7. Meller A, Nivon L, Brandin E, et al.: Rapid nanopore discrimination between single polynucleotide molecules. Proc. Natl. Acad. Sci. USA. 2000; 97: 1079–1084. PubMed Abstract | Publisher Full Text

[8] 8. Branton D, Deamer DW, Marziali A, et al.: The potential and challenges of nanopore sequencing. Nat. Biotechnol. 2008; 26: 1146–1153. PubMed Abstract | Publisher Full Text | Free Full Text

[9] 9. Shim H: Futuristic Methods in Virus Genome Evolution Using the Third-Generation DNA Sequencing and Artificial Neural Networks. Global Virology III: Virology in the 21st Century. 2019; 485–513. Publisher Full Text

[10] 10. Shim H: Three Innovations of Next-Generation Antibiotics: Evolvability, Specificity, and Non-Immunogenicity. Antibiotics (Basel). 2023; 12. PubMed Abstract | Publisher Full Text | Free Full Text

[11] 11. van Belzen IAEM , Schönhuth A, Kemmeren P, et al.: Structural variant detection in cancer genomes: computational challenges and perspectives for precision oncology. npj Precision Oncology. 2021; 5: 1–11.

[12] 12. Filser M, Schwartz M, Merchadou K, et al.: Adaptive nanopore sequencing to determine pathogenicity of BRCA1 exonic duplication. J. Med. Genet. 2023; 60: 1206–1209. PubMed Abstract | Publisher Full Text | Free Full Text

[13] 13. Martin S, Heavens D, Lan Y, et al.: Nanopore adaptive sampling: a tool for enrichment of low abundance species in metagenomic samples. Genome Biol. 2022; 23: 1–27.

[14] 14. Maghini DG, Moss EL, Vance SE, et al.: Improved high-molecular-weight DNA extraction, nanopore sequencing and metagenomic assembly from the human gut microbiome. Nat. Protoc. 2021; 16: 458.

[15] 15. Park Y, Lee J, Shim H: Sequencing, Fast and Slow: Profiling Microbiomes in Human Samples with Nanopore Sequencing. Applied Biosciences. 2023; 2: 437–458.

[16] 16. Breitwieser FP, Pertea M, Zimin AV, et al.: Human contamination in bacterial genomes has created thousands of spurious proteins. Genome Res. 2019; 29: 954.

[17] 17. Davis NM, Proctor DM, Holmes SP, et al.: Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome. 2018; 6: 1–14. Publisher Full Text

[18] 18. Haba D: Data Augmentation with Python: Enhance deep learning accuracy with data augmentation methods for image, text, audio, and tabular data. Packt Publishing Ltd; 2023.

[19] 19. LeCun Y, Bengio Y, Hinton G: Deep learning. Nature. 2015; 521: 436–444. PubMed Abstract | Publisher Full Text

[20] 20. Yichuan Tang Centre for Theoretical Neuroscience, University of Waterloo, Waterloo ON, CANADA, Chris Eliasmith Centre for Theoretical Neuroscience, University of Waterloo, Waterloo ON, CANADA. Deep networks for robust visual recognition. [cited 11 Dec 2023]. Publisher Full Text

[21] 21. Goodfellow I, Bengio Y, Courville A: Deep Learning. MIT Press; 2016.

[22] 22. Wang C, Sensale S, Pan Z, et al.: Slowing down DNA translocation through solid-state nanopores by edge-field leakage. Nat. Commun. 2021; 12: 1–10.

[23] 23. Karberg KA, Olsen GJ, Davis JJ: Similarity of genes horizontally acquired by Escherichia coli and Salmonella enterica is evidence of a supraspecies pangenome. Proc. Natl. Acad. Sci. USA. 2011; 108: 20154.

[24] 24. Heinrichs DE, Yethon JA, Whitfield C: Molecular basis for structural diversity in the core regions of the lipopolysaccharides of Escherichia coli and Salmonella enterica. Mol. Microbiol. 1998; 30: 221–232. PubMed Abstract | Publisher Full Text

[25] 25. Saiki RK, Scharf S, Faloona F, et al.: Enzymatic amplification of beta-globin genomic sequences and restriction site analysis for diagnosis of sickle cell anemia. Science. 1985; 230: 1350–1354. PubMed Abstract

[26] 26. Saiki RK, Gelfand DH, Stoffel S, et al.: Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science. 1988; 239: 487–491. PubMed Abstract | Publisher Full Text

[27] 27. Acinas SG, Sarma-Rupavtarm R, Klepac-Ceraj V, et al.: PCR-Induced Sequence Artifacts and Bias: Insights from Comparison of Two 16S rRNA Clone Libraries Constructed from the Same Sample. Appl. Environ. Microbiol. 2005; 71: 8966–8969. PubMed Abstract

[28] 28. Aird D, Ross MG, Chen W-S, et al.: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011; 12: 1–14. Publisher Full Text

[29] 29. Krehenwinkel H, Wolf M, Lim JY, et al.: Estimating and mitigating amplification bias in qualitative and quantitative arthropod metabarcoding. Sci. Rep. 2017; 7: 1–12.

[30] 30. Wani K, Aldape KD: PCR Techniques in Characterizing DNA Methylation. Methods Mol. Biol. 2016; 1392. Publisher Full Text

[31] 31. Deans C, Maggert KA: What Do You Mean, “Epigenetic”?. Genetics. 2015; 199: 887–896. Publisher Full Text

[32] 32. Dupont C, Randall Armant D, Brenner CA: Epigenetics: Definition, Mechanisms and Clinical Perspective. Semin. Reprod. Med. 2009; 27: 351.

[33] 33. Sharma S, Kelly TK, Jones PA: Epigenetics in cancer. Carcinogenesis. 2010; 31: 27.

[34] 34. Shim H, Shivram H, Lei S, et al.: Diverse ATPase Proteins in Mobilomes Constitute a Large Potential Sink for Prokaryotic Host ATP. Front. Microbiol. 2021; 12: 691847.

[35] 35. Park H-M, Park Y, Berani U, et al.: In silico optimization of RNA-protein interactions for CRISPR-Cas13-based antimicrobials. Biol. Direct. 2022; 17: 27. PubMed Abstract | Publisher Full Text | Free Full Text

[36] 36. Karikó K, Buckstein M, Ni H, et al.: Suppression of RNA Recognition by Toll-like Receptors: The Impact of Nucleoside Modification and the Evolutionary Origin of RNA. Immunity. 2005; 23: 165–175. PubMed Abstract | Publisher Full Text

[37] 37. Sahin U, Karikó K, Türeci Ö: mRNA-based therapeutics — developing a new class of drugs. Nat. Rev. Drug Discov. 2014; 13: 759–780. PubMed Abstract | Publisher Full Text

[38] 38. Creating artificial neural networks that generalize. Neural Netw. 1991; 4: 67–79. Publisher Full Text

[39] 39. Wick RR, Judd LM, Holt KE: Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol. 2019; 20: 1–10.

[40] 40. Kim D, Song L, Breitwieser FP, et al.: Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016; 26: 1721–1729. PubMed Abstract | Publisher Full Text

[41] 41. Yuan J: Genome Assembly of Long Error-Prone Reads Using De Bruijn Graphs and Repeat Graphs.2019.

[42] 42. Lin Y, Yuan J, Kolmogorov M, et al.: Assembly of long error-prone reads using de Bruijn graphs. Proc. Natl. Acad. Sci. USA. 2016; 113: E8396–E8405. PubMed Abstract | Publisher Full Text

[43] 43. Kolmogorov M, Bickhart DM, Behsaz B, et al.: metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods. 2020; 17: 1103–1110. PubMed Abstract | Publisher Full Text

[44] 44. Marçais G, Delcher AL, Phillippy AM, et al.: MUMmer4: A fast and versatile genome alignment system. PLoS Comput. Biol. 2018; 14: e1005944. PubMed Abstract | Publisher Full Text | Free Full Text

[45] 45. Delcher AL, Kasif S, Fleischmann RD, et al.: Alignment of whole genomes. Nucleic Acids Res. 1999; 27: 2369–2376. PubMed Abstract | Publisher Full Text | Free Full Text

[46] 46. Kurtz S, Phillippy A, Delcher AL, et al.: Versatile and open software for comparing large genomes. Genome Biol. 2004; 5: R12. PubMed Abstract | Publisher Full Text

[47] 47. Nattestad M, Schatz MC: Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics. 2016; 32: 3021–3023. PubMed Abstract | Publisher Full Text | Free Full Text

NAD: Noise-augmented direct sequencing of target nucleic acids by augmenting with noise and selective sampling

Abstract

Background

Methods

Results

Conclusions

Keywords

Introduction

Figure 1. Process of Noise-Augmented Direct (NAD) sequencing.

Results

Quality control of NAD sequencing experiments

Data output of NAD sequencing experiments

Table 1. Experimental parameters of NAD sequencing experiments.

Adaptive sampling of NAD sequencing experiments

Figure 2. Three decision types in adaptive sampling of the NAD samples.

Performance of NAD sequencing experiments

Figure 3. Sequence length of the NAD sequencing experiments.

Table 2. Summary statistics of the NAD datasets by WIMP classification at the Genus level.

Metagenomic assembly of the target DNA and the noise DNA

Table 3. Analysis of the assembled target genomes against the target reference genome.

Figure 4. MUMmer N charts to compare Flye-assembled genomes against the target reference (Salmonella enterica: GCF_000006945) or the noise reference (Lambda phage: NC_001416).

Figure 5. MUMmer N charts to compare Flye-assembled genomes against the target reference (Pseudomonas aeruginosa: GCF_000006765) or the noise reference (Lambda phage: NC_001416).

Discussion

Methods

Quality control of the microbial DNA standards

Noise augmentation of the microbial DNA standards

Ligation of the augmented DNA samples

Nanopore sequencing of the NAD samples with adaptive sampling

High-accuracy basecalling of NAD sequencing experiments

Rapid classification pipeline of NAD sequencing experiments

Genome assembly of the noise DNA and the target DNA

Declarations

Ethics approval and consent to participate

Consent for publication

Authors' contributions

Availability of data and materials

Software availability

Extended data

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated