Picopore: A tool for reducing the storage size of Oxford Nanopore Technologies datasets without loss of functionality

Scott Gigante

doi:10.12688/f1000research.11022.1

Home Browse Picopore: A tool for reducing the storage size of Oxford Nanopore...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

Picopore: A tool for reducing the storage size of Oxford Nanopore Technologies datasets without loss of functionality

[version 1; peer review: 2 approved]

Scott Gigante

PUBLISHED 07 Mar 2017

Author details Author details

Walter & Eliza Hall Institute of Medical Research, Parkville, Victoria, 3121, Australia

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Nanopore Analysis gateway.

This article is included in the Python collection.

Abstract

Oxford Nanopore Technologies' (ONT) MinION and PromethION long-read sequencing technologies are emerging as genuine alternatives to established Next-Generation Sequencing technologies. A combination of the highly redundant file format and a rapid increase in data generation have created a significant problem both for immediate data storage on MinION-capable laptops, and for long-term storage on lab data servers.
We developed Picopore, a software suite offering three methods of compression. Picopore's lossless and deep lossless methods provide a 25% and 44% average reduction in size, respectively, without removing any data from the files. Picopore's raw method provides an 88% average reduction in size, while retaining biologically relevant data for the end-user. All methods have the capacity to run in real-time in parallel to a sequencing run, reducing demand for both immediate and long-term storage space.

Keywords

DNA Sequencing, Genome Informatics, Nanopore Sequencing, Compression, Data Storage

Corresponding author: Scott Gigante

Competing interests: No competing interests were disclosed. <b>Author Endorsement:</b> Chris Woodruff confirms that the author has an appropriate level of expertise to conduct this research, and confirms that the submission is of an acceptable scientific standard. Chris Woodruff declares he has no competing interests. Affiliation: Visiting Scientist at Bioinformatics Division of Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia.

Grant information: The work discussed in this article was funded by the Speed Lab in the Bioinformatics Division of the Walter & Eliza Institute of Medical Research. The Speed Lab is supported by the Australian NHMRC Program Grant number 1054618.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2017 Gigante S. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

How to cite: Gigante S. Picopore: A tool for reducing the storage size of Oxford Nanopore Technologies datasets without loss of functionality [version 1; peer review: 2 approved]. F1000Research 2017, 6:227 (https://doi.org/10.12688/f1000research.11022.1) First published: 07 Mar 2017, 6:227 (https://doi.org/10.12688/f1000research.11022.1) Latest published: 28 Sep 2017, 6:227 (https://doi.org/10.12688/f1000research.11022.3)

Introduction

Oxford Nanopore Technologies’ (ONT) nanopore sequencing technology MinION provides a high-throughput, low-cost alternative to traditional Next-Generation Sequencing (NGS) technologies¹. The sequencing device itself is handheld and connects by USB to a laptop computer. Together with all equipment and reagents required for DNA library preparation, the equipment required to use MinION is minimal; entire laboratories have even been transported overseas in a suitcase, allowing a versatile and agile approach towards DNA and RNA sequencing².

Over the course of ONT’ Early Access Program, several improvements in software and chemistry have led to a rapid increase in yield, through an increase in average read length, an improvement in basecalling accuracy rates and an increase in total number of reads. In October 2015, the MinION Analysis and Reference Consortium (MARC), using R7.3 flow cells and SQK-MAP005 (2D) chemistry, reported a median of 60,600 reads with a median yield of 650,000 events across 20 MinION experiments³. In contrast, ONT claim to have obtained a total base yield of 17 gigabases using an R9.4 flowcell on the latest version of their MinKNOW software (https://nanoporetech.com/about-us/news/minion-software-minknow-upgraded-enable-increased-data-yield-other-benefits). The drastic increase in data generation from a single experiment is rapidly becoming the limiting factor in uptake of the technology.

The concerns over data storage extend beyond the data generation capabilities of a single flowcell. Recent attempts to perform de novo assembly of eukaryotic genomes have combined the data generated by multiple flowcells in order to gain sufficient coverage of the genome⁴. To this end, ONT have begun the pre-commercial release of the PromethION, a benchtop nanopore sequencing device with 48 flowcells. In addition, each of these flowcells contains 3000 channels, as opposed to the 512 channels in a single MinION flowcell, with data generation projected at 6 terabases per flowcell per day⁵.

Numerous methods have been developed for the efficient analysis of the increasingly large nanopore datasets. However, current methods to reduce the data storage footprint are extremely limited. Nanopore runs uploaded to online repositories, such as the European Nucleotide Archive, are bundled into a tarball, a process which facilitates upload as a single file, but does not decrease file size. Moreover, ONT runs bundled into a tarball (which could then be compressed using traditional means) are not able to be read by any existing nanopore analysis tools. Moreover, traditional compression technologies are poorly adapted to the needs of the individual user, many of whom have no need for a large portion of the data stored by ONT’s basecallers. Therefore, we developed Picopore, a tool for reducing the storage footprint of the ONT runs without preventing users from using their preferred analysis tools. Picopore uses a combination of storage reduction techniques, including built-in dynamic compression in the HDF5 file format, reduction of data duplication, efficient allocation of memory within the file, and the removal of intermediate data generated during basecalling, which is deemed unnecessary by the end-user.

Methods

Implementation

Picopore is developed using the Python h5py module (http://www.h5py.org/), an interface to the HDF5 file format (http://www.hdfgroup.org/HDF5), used by ONT under the FAST5 file extension. Picopore implements a number of different compression methods, a selection of which are applied according to user preferences, before using HDF5’s h5repack to rebuild the file according to the reduced file size requirements.

Compression techniques

Built-in GZIP compression. The HDF5 file format allows for both files and datasets within files to be written using a number of different compression filters, the most universally implemented being GZIP. GZIP applies traditional compression to the data stored in the HDF5 file with choices of compression level between 1 and 9. ONT’s default compression uses GZIP at level 1; Picopore increases this to level 9 in order to decrease disk space usage.

Dynamic memory allocation for variables. Data stored in the HDF5 file format uses fixed-size file formats provided by NumPy, which provides a vast array of options for storing integers, decimals and strings within high-dimensional datasets⁶. ONT’s native data is written using the largest data types provided by NumPy: 64-bit integers, 64-bit decimals, and "variable-size" strings. Picopore vastly reduces disk space usage by analyzing each dataset to determine the minimum number of bytes required by a given variable in the file, changing the data type accordingly.

Collapsing of file structure. The advantage of the HDF5 file format is that it provides a file directory-like storage format for datasets and properties, making reading and writing to the files straightforward and easy to understand. However, the inherent nature of the highly-structured file format requires HDF5 to allocate slots of memory to "groups", which represent the internal directory structure of the file. Picopore reduces the disk space used by this file metadata by collapsing the directory structure, while retaining the option for users to reverse this action when tools that only recognize the original file format are required.

Indexing of duplicated data. ONT’s most widely used basecalling software, the cloud-based Metrichor service (https://metrichor.com/s/), performs feature recognition (or "event detection"). This segments the electrical signal representing each nanopore read into events, each of which represents a period of time when the DNA was stationary in the nanopore. These events are then converted into basecalled data, which provides a single k-mer (at present a 5-mer) of DNA representing the bases in the nanopore contributing to the signal at that time. Each event corresponds to a single row in the basecalled dataset, and both the event detection and basecalled datasets thus store the mean signal, standard deviation, start time and length of the event. Picopore reduces disk space usage by indexing the basecalled dataset to the event detection dataset, removing the duplicated data while retaining the option for users to reverse this action when tools that require access to this data are required.

Removal of intermediate data. The primary function of all basecalling software is to generate a FASTQ file containing the genomic sequence and associated quality scores representing the read stored in each FAST5 file. While some software tools, such as nanopolish⁷ and nanoraw⁸, do make use of the signal, event detection and basecalled datasets, the large majority of analyses, including alignment, assembly and variant calling, simply require access to the FASTQ data. Picopore allows users to remove the intermediate data generated during the process of converting raw signal to FASTQ, while retaining the signal data, should they ever want to re-basecall the run to attain improved FASTQ data or to access this intermediate data at a later stage.

Operation

Requirements. Picopore is built in Python 2.7 (www.python.org) and runs on Windows, Mac OS and Linux. It requires the following Python packages:

h5py 2.2.0 or later
watchdog 0.8.3 or later

In addition, Picopore requires HDF5 1.8.4 or newer, with development headers (libhdf5-dev or similar), including the binary utility h5repack, which is included therein.

Installation. The latest stable version of Picopore is available on PyPi and bioconda (see Software availability). It can be installed according to the following commands:

Linux and Mac OS: pip install picopore

Windows: conda install picopore -c bioconda -c conda-forge

Picopore can also be installed from source (see Software availability) using the command python setup.py install.

Execution. Picopore is run from the command-line as a binary executable.

Picopore accepts both folders and FAST5 files as input. If a folder is provided, it will be searched recursively for FAST5 files, and all files found will be considered as input.

There are three modes of compression available, each of which performs a selection of the techniques described above.

lossless: performs built-in GZIP compression and dynamic memory allocation for variables. This mode is both fast and allows continued analysis of data by any existing software.
deep-lossless: performs lossless compression, as well as collapsing of file structure and indexing of duplicated data. This mode obtains the best compression results without removing any data, but comes at the cost of requiring reversion before most software tools can analyse the data.
raw: performs lossless compression, as well as removal of intermediate data, partially reverting files to the "raw" pre-basecalled file format. This mode is fast, obtains the best filesize reduction, and allows continued analysis by tools that extract FASTQ and related data, but comes at the cost of removing intermediate basecalling data required for some niche applications, such as nanopolish, which cannot be retrieved by Picopore (but can be regenerated using basecalling software.)

Optional arguments include:

–revert: reverts lossless compressed files to their original state to allow high-speed access at the cost of disk usage;
–realtime: watches for file creation in the given input folder(s) and performs the selected mode of compression on new files in real time to reduce the footprint of an ongoing MinION run;
–prefix: allows the user to specify a filename prefix to prevent in-place overwriting of files;
–group: allows the user to select only one of the analysis groups on files that have been processed by multiple basecallers;
–threads: allows the user to specify the number of files to be processed in parallel.

Results

To demonstrate the effectiveness of Picopore’s compression, we ran all three modes of compression on four toy datasets of 40 FAST5 files run using the R9 SQK-RAD001 (R9_1D), R9 SQK-NSK007 (R9_2D), R9.4 SQK-RAD002 (R9.4_1D) and R9.4 SQK-LSK208 (R9.4_2D) protocols. The files for the toy datasets were chosen randomly from the pass folder of four MinION datasets generated at the Australian Genome Research Facility (see Software and data availability). For the R9_1D dataset, DNA was extracted from the lung of a juvenile 129/Sv mouse using the DNeasy Blood and Tissue kit (Qiagen). For the R9_2D, R9.4_1D and R9.4_2D datasets, DNA was extracted from a culture of escherichia coli K12 MG1655 using the Blood & Cell Culture DNA Kit (Qiagen). Quality control performed by visualisation on the TapeStation (Agilent). Run metadata is shown in Table 1.

Table 1. Metadata for MinION datasets sampled to produce toy datasets.

Name	Chemistry	Protocol	MinION ID	Flowcell ID	Sample	Strain
R9_1D	R9 1D	SQK-RAD001	MN17324	FAD24340	Mouse	129/Sv
R9_2D	R9 2D	SQK-NSK007	MN17324	FAD24193	Escherichia coli	K12 MG1655
R9.4_1D	R9.4 1D Rapid	SQK-RAD002	MN17324	FAF04136	Escherichia coli	K12 MG1655
R9.4_2D	R9.4 2D	SQK-RAD002	MN17324	FAF04232	Escherichia coli	K12 MG1655

Each file was compressed and tarred using each of five methods: no compression, gzip (applied after tarring, as per convention), picopore lossless, picopore deep-lossless and picopore raw. Each of these methods was run on a single core. Figure 1 shows that lossless achieves only slightly less compression than gzip, giving an average reduction in size of 25% compared to gzip’s 32%, while deep-lossless and raw perform significantly better, giving average reductions in size of 44% and 88%, respectively. A dependent sample t-test (R v3.3.0) was run on individual compressed file sizes. Table 2 shows that each successive method of compression (excluding gzip, which does not compress individual files) gives a significant reduction in size from the previous. Figure 2 shows that while all of Picopore’s compression methods are much slower than gzip, raw is the fastest of these, followed by lossless and deep-lossless. Note that the tarring time makes up a maximum of 0.04 s/read in each case and is largely negligible.

Figure 1. Size of tarball containing FAST5 files compressed using various methods.

Table 2. Significance of difference in size of files compressed with different methods using a dependent sample t-test (R v3.3.0).

Mode 1	Mode 2	t-statistic	p-value
uncompressed	lossless	33.58	< 10⁻¹⁵
lossless	deep-lossless	20.14	< 10⁻¹⁵
deep-lossless	raw	17.69	< 10⁻¹⁵

Figure 2. Time taken for single-thread compression and tarring of FAST5 files using various methods.

To demonstrate the effectiveness of Picopore’s multithreading, we ran deep-lossless, the most computationally expensive of the Picopore compression methods, on each dataset using 1, 2, 5, and 10 cores. Figure 3 shows an almost linear improvement in speed, showing that even on a small dataset, the multithreading overhead is relatively small.

Figure 3. Speed of deep lossless compression of FAST5 files using multiple threads. The dotted blue line shows the theoretical linear maximum increase in speed for the R9 2D run.

Run	Tarred size	File 1	File 2	File 3	File 4	File 5	File 6	File 7	File 8	File 9	File 10	File 11	File 12	File 13	File 14	File 15	File 16	File 17	File 18	File 19	File 20	File 21	File 22	File 23	File 24	File 25	File 26	File 27	File 28	File 29	File 30	File 31	File 32	File 33	File 34	File 35	File 36	File 37	File 38	File 39	File 40
picopore.R91D.tar	61562880	126978	325670	223732	249226	377423	301922	241909	145035	183215	275292	351576	236004	312910	161998	487444	150646	337151	332732	210224	262985	329936	259457	259707	437934	322232	228240	317004	53188	312908	197135	261651	163231	54306	227116	263938	203198	167060	213946	285961	246846
picopore.R91D.tar.gz	29965264
picopore.R91D.deep-lossless	25323520	816447	2129383	1455373	1640380	2434036	1980652	1653663	936607	1220981	1817028	2044549	1536859	2058787	1074535	3230966	995680	2193066	2204205	1393229	1727085	1310187	1703369	1747713	2894033	2080864	1522863	2078107	304455	2074094	1290465	1756321	1074558	320263	1496800	1747817	1283267	1078450	1394798	1894960	1669921
picopore.R91D.lossless	34365440	598172	1609741	1096297	1240729	1849426	1491041	1236947	693982	911694	1360087	1563955	1156080	1555826	796740	2463782	736876	1656453	1674761	1045727	1295997	1008844	1281862	1308865	2208228	1563195	1143537	1567797	207951	1571087	965184	1332447	790679	218269	1128578	1318907	960935	797689	1050495	1421574	1256083
picopore.R91D.raw	6912000	1177873	2533275	1834272	2034520	2852870	2374173	2036080	1306992	1602237	2207589	2455373	1918211	2468176	1457153	3677686	1375875	2600311	2612921	1768545	2124112	1698993	2099376	2137394	3332622	2485963	1903949	2481304	652320	2485223	1666250	2163673	1457018	668835	1877510	2136518	1660905	1451948	1772640	2285812	2056173
picopore.R92D.tar	86026240	240178	207753	85346	178165	37854	80686	37467	33522	241309	40213	118625	62082	36184	37975	51381	170795	52522	52433	168921	155867	74080	135275	57454	69504	64562	390691	130414	43180	35304	53739	54792	46234	52634	72994	103074	142551	47524	84644	31869	36863
picopore.R92D.tar.gz	65407762
picopore.R92D.deep-lossless	53534720	2118254	1953248	724096	1640014	241610	682472	242069	197180	2354914	257667	973794	491815	229179	232685	376098	683659	402572	391276	1548831	1453645	603656	1245750	442017	549364	514350	3783991	1190277	294191	213048	413299	423906	319105	407210	581646	900160	1314349	337841	709481	179635	235000
picopore.R92D.lossless	69867520	1537189	1433146	521591	1200836	163972	490100	164436	131856	1704243	175411	701340	344893	154790	158718	263957	508693	280187	271976	1131104	1062086	431115	900630	308470	389522	362445	2795805	863205	202527	143911	284419	294937	221646	282427	418694	644058	957492	235747	509714	118832	158973
picopore.R92D.raw	11888640	2392726	2254402	961095	1898404	465336	915726	465858	418718	2635089	481754	1217171	719860	452506	455940	607250	911254	630713	619082	1804343	1706524	839614	1492705	671490	783716	747679	4098767	1435649	518300	436944	641535	652706	543767	635589	811766	1140345	1563638	564179	946040	400900	458412
picopore.R941D.tar	41472000	259557	109450	244348	430904	259740	260582	359462	131642	426392	62352	312027	316750	215210	361118	319267	388336	436561	160488	386495	102904	479988	372349	274656	254725	66128	102188	413731	552975	235508	248889	310359	323564	408696	241380	233940	345496	390338	157258	613288	245591
picopore.R941D.tar.gz	28767584
picopore.R941D.deep-lossless	22999040	1533993	697765	1596609	2760426	1590696	1438079	1859977	837980	2653447	303807	1938288	1725588	1345642	2068911	1859264	2429985	2479072	937067	1746770	507715	2946285	2282752	1505232	1548433	386414	551875	2448310	3300895	1369035	1532853	1934808	1993707	2499321	1431467	1449962	2184960	2018770	978089	3620600	1497986
picopore.R941D.lossless	31928320	1175570	507906	1213720	2119988	1220963	1103811	1438709	614565	2046431	210167	1475558	1332399	1023935	1600253	1429280	1864387	1925527	700550	1360557	367834	2280399	1745935	1154569	1184764	267718	398094	1880010	2557033	1040356	1169613	1481090	1530455	1922791	1095112	1105859	1666089	1572670	727913	2807342	1140310
picopore.R941D.raw	3891200	1926515	1061099	1990677	3207851	1995135	1822825	2264829	1205067	3096528	656423	2344653	2134393	1731678	2494515	2267658	2860624	2914687	1311925	2148386	866121	3396286	2705422	1905109	1941834	741090	911086	2878505	3759690	1753431	1925724	2347024	2415540	2931739	1819111	1839665	2603675	2439938	1354024	4089740	1888743
picopore.R942D.tar	80967680	71621	165528	66227	54886	92057	381053	53614	108258	55866	67089	214281	84440	60721	88703	147059	124221	76824	590429	290533	327073	319824	149973	58171	164856	62763	134943	117338	64701	554092	90090	106877	307984	480653	107262	61009	98702	120058	67215	139938	502049
picopore.R942D.tar.gz	60533928
picopore.R942D.deep-lossless	49213440	248191	795611	265864	206517	416410	2040340	208379	509636	224716	245442	1148412	360142	260388	414375	665845	638356	328611	3252531	1596392	1736012	1676638	773197	241677	805552	256029	668879	503470	291888	3066833	329821	519459	1679208	2654305	528830	239812	424630	579096	242663	554434	2686023
picopore.R942D.lossless	65341440	173730	585256	183066	142716	293890	1532581	140738	363889	152317	170751	846007	251840	177733	292541	488374	459745	227128	2444200	1182101	1301900	1257424	566027	165039	595168	176175	484090	361599	200221	2310455	232724	372702	1255352	1994082	379469	164616	304507	418494	169050	405936	2020413
picopore.R942D.raw	10178560	917441	1471209	929130	874158	1084499	2755128	872175	1179420	893565	912522	1831950	1023733	923665	1084300	1344655	1306777	996065	3987561	2292063	2436248	2372787	1447800	904819	1481472	923270	1344946	1169982	958426	3806494	995382	1188243	2373909	3385148	1201669	911647	1093035	1253590	909779	1224439	3418540

Run	Compression time	Tar time	Total time
picopore.run1.R91D.tar		0.099	0.099
picopore.run1.R91D.tar.gz			1.553
picopore.run1.R91D.deep-lossless.1	26.388	0.039	26.427
picopore.run1.R91D.deep-lossless.10	4.221	0.037	4.258
picopore.run1.R91D.deep-lossless.2	15.456	0.039	15.495
picopore.run1.R91D.deep-lossless.5	6.502	0.036	6.538
picopore.run1.R91D.lossless	10.184	0.044	10.228
picopore.run1.R91D.raw	9.259	0.024	9.283
picopore.run1.R92D.tar		0.203	0.203
picopore.run1.R92D.tar.gz			2.778
picopore.run1.R92D.deep-lossless.1	54.845	0.1	54.945
picopore.run1.R92D.deep-lossless.10	6.988	0.096	7.084
picopore.run1.R92D.deep-lossless.2	28.419	0.1	28.519
picopore.run1.R92D.deep-lossless.5	12.112	0.097	12.209
picopore.run1.R92D.lossless	18.092	0.07	18.162
picopore.run1.R92D.raw	12.762	0.027	12.789
picopore.run1.R941D.tar		0.067	0.067
picopore.run1.R941D.tar.gz			1.218
picopore.run1.R941D.deep-lossless.1	22.208	0.037	22.245
picopore.run1.R941D.deep-lossless.10	3.68	0.045	3.725
picopore.run1.R941D.deep-lossless.2	11.966	0.036	12.002
picopore.run1.R941D.deep-lossless.5	5.786	0.035	5.821
picopore.run1.R941D.lossless	10.344	0.046	10.39
picopore.run1.R941D.raw	8.711	0.021	8.732
picopore.run1.R942D.tar		0.127	0.127
picopore.run1.R942D.tar.gz			2.634
picopore.run1.R942D.deep-lossless.1	52.839	0.097	52.936
picopore.run1.R942D.deep-lossless.10	8.273	0.057	8.33
picopore.run1.R942D.deep-lossless.2	27.186	0.112	27.298
picopore.run1.R942D.deep-lossless.5	13.637	0.097	13.734
picopore.run1.R942D.lossless	15.764	0.072	15.836
picopore.run1.R942D.raw	10.847	0.026	10.873
picopore.run2.R91D.tar		0.101	0.101
picopore.run2.R91D.tar.gz			1.58
picopore.run2.R91D.deep-lossless.1	26.162	0.038	26.2
picopore.run2.R91D.deep-lossless.10	4.183	0.035	4.218
picopore.run2.R91D.deep-lossless.2	15.17	0.038	15.208
picopore.run2.R91D.deep-lossless.5	6.773	0.036	6.809
picopore.run2.R91D.lossless	11.705	0.044	11.749
picopore.run2.R91D.raw	9.091	0.023	9.114
picopore.run2.R92D.tar		0.207	0.207
picopore.run2.R92D.tar.gz			2.885
picopore.run2.R92D.deep-lossless.1	54.867	0.111	54.978
picopore.run2.R92D.deep-lossless.10	6.831	0.058	6.889
picopore.run2.R92D.deep-lossless.2	28.518	0.097	28.615
picopore.run2.R92D.deep-lossless.5	12.106	0.101	12.207
picopore.run2.R92D.lossless	17.429	0.068	17.497
picopore.run2.R92D.raw	12.107	0.027	12.134
picopore.run2.R941D.tar		0.067	0.067
picopore.run2.R941D.tar.gz			1.208
picopore.run2.R941D.deep-lossless.1	22.773	0.037	22.81
picopore.run2.R941D.deep-lossless.10	3.573	0.033	3.606
picopore.run2.R941D.deep-lossless.2	11.93	0.036	11.966
picopore.run2.R941D.deep-lossless.5	5.978	0.034	6.012
picopore.run2.R941D.lossless	10.437	0.045	10.482
picopore.run2.R941D.raw	7.362	0.021	7.383
picopore.run2.R942D.tar		0.181	0.181
picopore.run2.R942D.tar.gz			2.519
picopore.run2.R942D.deep-lossless.1	51.417	0.097	51.514
picopore.run2.R942D.deep-lossless.10	9.187	0.049	9.236
picopore.run2.R942D.deep-lossless.2	27.109	0.097	27.206
picopore.run2.R942D.deep-lossless.5	12.295	0.094	12.389
picopore.run2.R942D.lossless	14.489	0.064	14.553
picopore.run2.R942D.raw	10.48	0.035	10.515
picopore.run3.R91D.tar		0.1	0.1
picopore.run3.R91D.tar.gz			1.552
picopore.run3.R91D.deep-lossless.1	26.353	0.038	26.391
picopore.run3.R91D.deep-lossless.10	4.273	0.035	4.308
picopore.run3.R91D.deep-lossless.2	15.858	0.039	15.897
picopore.run3.R91D.deep-lossless.5	6.697	0.035	6.732
picopore.run3.R91D.lossless	11.305	0.045	11.35
picopore.run3.R91D.raw	9.079	0.024	9.103
picopore.run3.R92D.tar		0.203	0.203
picopore.run3.R92D.tar.gz			2.764
picopore.run3.R92D.deep-lossless.1	54.72	0.098	54.818
picopore.run3.R92D.deep-lossless.10	8.434	0.097	8.531
picopore.run3.R92D.deep-lossless.2	29.378	0.101	29.479
picopore.run3.R92D.deep-lossless.5	13.884	0.097	13.981
picopore.run3.R92D.lossless	15.331	0.072	15.403
picopore.run3.R92D.raw	10.832	0.028	10.86
picopore.run3.R941D.tar		0.068	0.068
picopore.run3.R941D.tar.gz			1.24
picopore.run3.R941D.deep-lossless.1	22.099	0.037	22.136
picopore.run3.R941D.deep-lossless.10	3.5	0.034	3.534
picopore.run3.R941D.deep-lossless.2	11.951	0.036	11.987
picopore.run3.R941D.deep-lossless.5	5.787	0.035	5.822
picopore.run3.R941D.lossless	10.362	0.04	10.402
picopore.run3.R941D.raw	7.343	0.021	7.364
picopore.run3.R942D.tar		0.193	0.193
picopore.run3.R942D.tar.gz			2.56
picopore.run3.R942D.deep-lossless.1	51.226	0.097	51.323
picopore.run3.R942D.deep-lossless.10	6.897	0.094	6.991
picopore.run3.R942D.deep-lossless.2	27.182	0.095	27.277
picopore.run3.R942D.deep-lossless.5	12.2	0.097	12.297
picopore.run3.R942D.lossless	14.13	0.107	14.237

Dataset 1.Output data generated using Picopore on four toy datasets.

http://dx.doi.org/10.5256/f1000research.11022.d153370Each of the four toy datasets was compressed using each of Picopore’s three compression modes and GZIP. Datasets compressed using Picopore were tarred after compression, while the GZIP dataset was tarred before compression.

Discussion

It is clear that, due to the enormous reduction in disk space and low total time requirements, Picopore’s raw compression is the optimal compression mode for users who have no need for the intermediate event detection and basecalling data. The superior running speed of lossless compression over deep lossless compression may mean that this is the preferred method for users who wish to retain all data and need to compress the data in real-time; however, for users with these data retention requirements who wish to store data long-term on their file server, for whom speed of compression is not an issue, deep lossless compression provides the best option. All Picopore compression methods provide significant improvements over uncompressed or traditionally compressed files, and lossless and raw methods carry the added benefit that files can be processed using analysis tools in compressed form.

Although the compression has a high CPU cost, the capability of Picopore to run on multiple threads signifies that, if computing resources are available, the files can be compressed in a reasonably short period of time. Finally, extrapolating the running time per file to a real-time run of 500,000 reads over 48 hours (0.34 s/read), Picopore has the capability to run lossless (0.33 s/read) and raw (0.25 s/read) modes on a single core in real time. While deep-lossless requires multiple cores to keep pace with the MinION data generation, this adds little to the overall computational cost, and can reach real-time speed with just five cores (0.24 s/read). As data generation continues to increase in scale, further gains could be made by incorporating the compression methods used in Picopore into the basecalling software itself.

Conclusions

ONT’ MinION and PromethION sequencing devices promise to produce increasingly large datasets as the technology progresses toward commercial release. The disk space required to run and store one or multiple datasets from these poses a problem for service providers and users alike; Picopore provides three different solutions that cater to the different needs of users.

Although the trade-off between data retention, computing time and disk space is a compromise that cannot be perfectly resolved, Picopore provides user options to reduce their ONT datasets to the minimum viable size based on their intended use. This may involve real-time compression for laptop disk space concerns, reduction of bandwidth usage for transfer of datasets between laboratories, or reduction of the storage footprint on shared data servers.

Software and data availability

Software for Linux or Mac OS available from: https://pypi.python.org/pypi/picopore

Software for Linux, Mac OS and Windows available from: https://anaconda.org/bioconda/picopore

Source code available from: https://github.com/scottgigante/picopore

Archived source code from: https://doi.org/10.5281/zenodo.321957⁹

License: GPLv3

Dataset 1. Output data generated using Picopore on four toy datasets. Each of the four toy datasets was compressed using each of Picopore’s three compression modes and GZIP. Datasets compressed using Picopore were tarred after compression, while the GZIP dataset was tarred before compression. doi, 10.5256/f1000research.11022.d153370¹⁰

The toy dataset used for the analyses in this paper is available on Zenodo: Toy datasets for compression by Picopore, doi: 10.5281/zenodo.321959¹¹.

Author endorsement

Chris Woodruff confirms that the author has an appropriate level of expertise to conduct this research, and confirms that the submission is of an acceptable scientific standard. Chris Woodruff declares he has no competing interests. Affiliation: Visiting Scientist at Bioinformatics Division of Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia.

Author contributions

SG designed and developed the source code, published the software and wrote the article.

Competing interests

No competing interests were disclosed.

Author Endorsement: Chris Woodruff confirms that the author has an appropriate level of expertise to conduct this research, and confirms that the submission is of an acceptable scientific standard. Chris Woodruff declares he has no competing interests. Affiliation: Visiting Scientist at Bioinformatics Division of Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia.

Grant information

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Acknowledgments

The author is grateful to Terry Speed and Chris Woodruff from the Walter & Eliza Hall Institute of Medical Research for their supervision of this and related work, to Alexis Lucattini and Lavinia Gordon from the Australian Genome Research Facility and Matthew Ritchie, Andrew Kinery and Marnie Blewitt from the Walter & Eliza Hall Institute of Medical Research for their assistance in providing MinION data and discussing ongoing analyses, to Biomedical Research Victoria and CSL Limited for the provision of a stipend to begin his work at the Walter & Eliza Hall Institute, and to the organisers and participants of PorecampAU for inspiring this work.

Faculty Opinions recommended

References

1. Eisenstein M: Oxford Nanopore announcement sets sequencing sector abuzz. Nat Biotechnol. 2012; 30(4): 295–296. PubMed Abstract | Publisher Full Text
2. Quick J, Loman NJ, Duraffour S, et al.: Real-time, portable genome sequencing for Ebola surveillance. Nature. 2016; 530(7589): 228–232. PubMed Abstract | Publisher Full Text | Free Full Text
3. Ip CL, Loose M, Tyson JR, et al.: MinION Analysis and Reference Consortium: Phase 1 data release and analysis [version 1; referees: 2 approved]. F1000Res. 2015; 4: 1075. PubMed Abstract | Publisher Full Text | Free Full Text
4. Tyson JR, O’Neil NJ, Jain M, et al.: Whole genome sequencing and assembly of a Caenorhabditis elegans genome with complex genomic rearrangements using the MinION sequencing device. bioRxiv. 2017. Publisher Full Text
5. Jain M, Olsen HE, Paten B, et al.: The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 2016; 17(1): 239. PubMed Abstract | Publisher Full Text | Free Full Text
6. van der Walt S, Colbert SC, Varoquaux G: The numpy array: a structure for efficient numerical computation. Comput Sci Eng. 2011; 13(2): 22–30. Publisher Full Text
7. Loman NJ, Quick J, Simpson JT: A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Methods. 2015; 12(8): 733–735. PubMed Abstract | Publisher Full Text
8. Stoiber MH, Quick J, Egan R, et al.: De novo identification of DNA modifications enabled by genome-guided nanopore signal processing. bioRxiv. 2016. Publisher Full Text
9. Gigante S, Sjödin A: scottgigante/picopore: 1.0.0 Stable [Data set]. Zenodo. 2017. Data Source
10. Gigante S: Dataset 1 in: Picopore: A tool for reducing the storage size of Oxford Nanopore Technologies datasets without loss of functionality. F1000Research. 2017. Data Source
11. Lucattini A, Kinery A: Toy datasets for compression by Picopore [Data set]. Zenodo. 2017. Data Source

Comments on this article Comments (2)

Version 3

VERSION 3 PUBLISHED 28 Sep 2017

Revised

Comment

Version 1

VERSION 1 PUBLISHED 07 Mar 2017

Discussion is closed on this version, please comment on the latest version above.

Author Response 20 Mar 2017

Scott Gigante, Walter & Eliza Hall Institute of Medical Research, Parkville, 3121, Australia

20 Mar 2017

Author Response

Hi Wouter,

Thanks very much for your comments.

My apologies for the error in the PromethION throughput, I had misunderstood the statement in the original reference. I will ... Continue reading Hi Wouter,

Thanks very much for your comments.

My apologies for the error in the PromethION throughput, I had misunderstood the statement in the original reference. I will issue a new version of the paper correcting this.

In regards to kmer length, ONT did indeed move to a 6-mer for the later HMM basecallers, but moved back to a 5-mer for the RNN - there is a brief introduction of this here on the Nanopore Community.

Cheers,
Scott Gigante
Hi Wouter,

Thanks very much for your comments.

My apologies for the error in the PromethION throughput, I had misunderstood the statement in the original reference. I will issue a new version of the paper correcting this.

In regards to kmer length, ONT did indeed move to a 6-mer for the later HMM basecallers, but moved back to a 5-mer for the RNN - there is a brief introduction of this here on the Nanopore Community.

Cheers,
Scott Gigante
Competing Interests: No competing interests were disclosed. Close
Report a concern
Reader Comment 17 Mar 2017

Wouter De Coster, VIB Center for Molecular Neurology Antwerp, Belgium

17 Mar 2017

Reader Comment

Hi Scott,

Interesting paper and great work.

About PromethION you wrote "...data generation projected at 6 terabases per flowcell per day.", but I believe that's the theoretical throughput ... Continue reading Hi Scott,

Interesting paper and great work.

About PromethION you wrote "...data generation projected at 6 terabases per flowcell per day.", but I believe that's the theoretical throughput for the entire machine, all 48 flowcells together rather than per flowcell.

I'm also not sure if the software still uses DNA 5-mers for converting signal to basecalling. If I'm not mistaken that was the case in the early HMM basecaller, which then changed to 6-mers and now an RNN basecaller.

Cheers,
Wouter De Coster
Hi Scott,

Interesting paper and great work.

About PromethION you wrote "...data generation projected at 6 terabases per flowcell per day.", but I believe that's the theoretical throughput for the entire machine, all 48 flowcells together rather than per flowcell.

I'm also not sure if the software still uses DNA 5-mers for converting signal to basecalling. If I'm not mistaken that was the case in the early HMM basecaller, which then changed to 6-mers and now an RNN basecaller.

Cheers,
Wouter De Coster
Competing Interests: No competing interests were disclosed. Close
Report a concern
Discussion is closed on this version, please comment on the latest version above.

Author details Author details

Walter & Eliza Hall Institute of Medical Research, Parkville, Victoria, 3121, Australia

Competing interests

No competing interests were disclosed. <b>Author Endorsement:</b> Chris Woodruff confirms that the author has an appropriate level of expertise to conduct this research, and confirms that the submission is of an acceptable scientific standard. Chris Woodruff declares he has no competing interests. Affiliation: Visiting Scientist at Bioinformatics Division of Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia.

Grant information

The work discussed in this article was funded by the Speed Lab in the Bioinformatics Division of the Walter & Eliza Institute of Medical Research. The Speed Lab is supported by the Australian NHMRC Program Grant number 1054618.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (3)

version 3

Revised

Published: 28 Sep 2017, 6:227

https://doi.org/10.12688/f1000research.11022.3

version 2

Revised

Published: 12 Apr 2017, 6:227

https://doi.org/10.12688/f1000research.11022.2

version 1

Published: 07 Mar 2017, 6:227

https://doi.org/10.12688/f1000research.11022.1

© 2017 Gigante S. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Gigante S. Picopore: A tool for reducing the storage size of Oxford Nanopore Technologies datasets without loss of functionality [version 1; peer review: 2 approved]. F1000Research 2017, 6:227 (https://doi.org/10.12688/f1000research.11022.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 07 Mar 2017

Views

Reviewer Report 21 Mar 2017

David A. Eccles, Malaghan Institute of Medical Research, Wellington, New Zealand; Gringene Bioinformatics, Wellington, New Zealand

Approved

https://doi.org/10.5256/f1000research.11886.r20757

It's quite difficult to knock out a useful software tool for Oxford Nanopore devices before they change the protocol, or release their own tool that does similar things. Scott Gigante has done an admirable job in this regard by developing and publishing a much-needed tool within the space of time between two updates of the MinKNOW software (v1.4 -> v1.5). While the newest version of MinKNOW no longer produces event data by default in the FAST5 files (substantially reducing file sizes), Scott's tool will still be useful for existing file sets and event-called files in the future.

Manuscript

The manuscript is sufficiently verbose for a short software release publication: explaining modes of operation, demonstrating differences in file compression in different modes, and showing processing speed on different datasets.

Title / abstract

The title and abstract sufficiently summarise the manuscript

Introduction

"ONT'" -> "ONT's"
"The drastic increase... the limiting factor in uptake of the technology" -- cite, alter, or remove. My experience is that storage space is only an issue for *existing* users of the technology, not people deciding whether or not to use the MinION.

Operation

At least one running example in the 'Execution' section would be useful, similar to the Usage section of the pypi repository. Something like this:

'picopore --mode lossless --prefix shrunk '

Discussion

The introduction suggests ONT's internal runs are approaching 2M reads, yet the discussion suggests 0.5M reads per run. This is the difference between real-time processing and almost a week of waiting for processing to finish.
The discussion mentions future potential capabilities for basecalling software, but not Picopore itself. Are there any planned updates on the horizon?

Software testing

The current version of Picopore (installed by pip on 2017-Mar-17) had a few minor issues on my system that would prevent most users from being able to run the software. Once these issues were dealt with, Picopore was able to substantially reduce the file size of two FAST5 files with wildly different internal structures, while retaining important base call and raw signal information.

Installation

The program appeared to install fine on my Debian Linux desktop by running 'pip install picopore', as per the manuscript instructions. Unfortunately there was a problem with module import when showing the help dialog:

    $ picopore -h
    Traceback (most recent call last):
      File "/usr/local/bin/picopore", line 7, in
        from picopore.__main__ import main
      File "/usr/local/lib/python2.7/dist-packages/picopore/__main__.py", line 22, in
        from picopore.parse_args import parseArgs, checkSure
      File "/usr/local/lib/python2.7/dist-packages/picopore/parse_args.py", line 20, in
        from builtins import input
    ImportError: No module named builtins

Commenting out the offending line in 'parse_args.py' fixed this error.

Use of other data

Instead of using the provided data, I did a stress test of sorts on Picopore by running it on two files which were put in subdirectories of a parent directory:

A 2kb R7.3 tomato read produced by me in March 2016 (channel 342, read 13) [David Eccles' read]
A 771kb R9.4 E. coli read produced by Nick Loman and Josh Quick in March 2017 [Nick Loman & Josh Quick's read]

Equivalence testing

It is appreciated that Picopore includes a test for equivalence to make sure information is retained. Picopore was able to recursively descend through the directories, but the deep-lossless equivalence test reported failure for both of these sequences. In the case of the second file, it appears that the only failure was a missing //Picopore directory (which should probably be excluded from the failure modes):

- Equivalence test 1

    $ picopore --prefix pico_ -t 10 --test --mode deep-lossless tested_picopore
    Performing deep lossless compression on 2 files...
    No conversion path for dtype: dtype('     Complete.
    Original size:   67048849
    Compressed size: 66616200
    Checking equivalence of /home/gringer/bioinf/reviews/tested_picopore/1/lambda_TEDxWellington_DavidEccles_3637_1_ch342_read13_strand.fast5 (file 1) and /home/gringer/bioinf/reviews/tested_picopore/1/picopore.test.lambda_TEDxWellington_DavidEccles_3637_1_ch342_read13_strand.fast5 (file 2)...
    Failure: //Analyses missing from file 2
    Failure: //Raw missing from file 2
    Failure: //Sequences missing from file 2
    Failure: //UniqueGlobalKey missing from file 2
    Failure: //Picopore missing from file 1
    Traceback (most recent call last):
      File "/usr/local/bin/picopore", line 11, in
        sys.exit(main())
      File "/usr/local/lib/python2.7/dist-packages/picopore/__main__.py", line 80, in main
        runTest(args)
      File "/usr/local/lib/python2.7/dist-packages/picopore/__main__.py", line 63, in runTest
        checkEquivalent(f, compressedFile)
      File "/usr/local/lib/python2.7/dist-packages/picopore/test.py", line 72, in checkEquivalent
        recursiveCheckEquivalent(file1, file2, group.name)
      File "/usr/local/lib/python2.7/dist-packages/picopore/test.py", line 56, in recursiveCheckEquivalent
        recursiveCheckEquivalent(file1, file2, "/".join([name, key]))
      File "/usr/local/lib/python2.7/dist-packages/picopore/test.py", line 50, in recursiveCheckEquivalent
        if not attr2[key] == value:
    ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

- Equivalence test 2

    $ picopore --prefix pico_ -t 10 --test --mode deep-lossless tested_picopore/2
    Performing deep lossless compression on 1 files...
    No conversion path for dtype: dtype('     Complete.
    Original size:   65686831
    Compressed size: 65688039
    Checking equivalence of /home/gringer/bioinf/reviews/tested_picopore/2/loman_771kb_ch181_read4882_strand.fast5 (file 1) and /home/gringer/bioinf/reviews/tested_picopore/2/picopore.test.loman_771kb_ch181_read4882_strand.fast5 (file 2)...
    Failure: //Picopore missing from file 1
    Complete.

Confirmation dialog

The confirmation of writing files is also a great idea, but produces an error when both responses are given. I wonder if this is due to python version incompatibilities (and my attempted prior bugfix):

- Confirmation test 1

    $ picopore --prefix pico_ -t 10 --mode raw tested_picopore
    Performing raw compression with FASTQ and no summary on 2 files...
    Are you sure? (yes|no): no
    Traceback (most recent call last):
      File "/usr/local/bin/picopore", line 11, in
        sys.exit(main())
      File "/usr/local/lib/python2.7/dist-packages/picopore/__main__.py", line 84, in main
        run(args.revert, args.mode, args.input, args.y, args.threads, args.group, args.prefix, args.fastq, args.summary)
      File "/usr/local/lib/python2.7/dist-packages/picopore/__main__.py", line 34, in run
        if y or checkSure():
      File "/usr/local/lib/python2.7/dist-packages/picopore/parse_args.py", line 109, in checkSure
        response = input("Are you sure? (yes|no): ")
      File "", line 1, in
    NameError: name 'no' is not defined

- Confirmation test 2
    $ picopore --prefix pico_ -t 10 --mode raw tested_picopore
    Performing raw compression with FASTQ and no summary on 2 files...
    Are you sure? (yes|no): yes
    Traceback (most recent call last):
      File "/usr/local/bin/picopore", line 11, in
        sys.exit(main())
      File "/usr/local/lib/python2.7/dist-packages/picopore/__main__.py", line 84, in main
        run(args.revert, args.mode, args.input, args.y, args.threads, args.group, args.prefix, args.fastq, args.summary)
      File "/usr/local/lib/python2.7/dist-packages/picopore/__main__.py", line 34, in run
        if y or checkSure():
      File "/usr/local/lib/python2.7/dist-packages/picopore/parse_args.py", line 109, in checkSure
        response = input("Are you sure? (yes|no): ")
      File "", line 1, in
    NameError: name 'yes' is not defined

The 'input' function in my version of python does an evaluation after reading input; replacing 'input(...)' with 'raw_input(...)' fixed this error.

After these errors were fixed enough to allow the code to proceed, Picopore was able to successfully strip out event data (in 'raw' mode) from both the R7.3 and R9.4 FAST5 files, while retaining called FASTQ sequences and raw signal (i.e. it did what it said on the box). File sizes were reduced from 1.3MB down to 577kB for the R7.3 file, and 63Mb down to 9.8MB for the R9.4 file.

Picopore also retained the 'Model' section from the R7.3 FAST5 files, indicating that it probably does a blacklist removal of known analysis components and retains unknown things in the file hierarchy; this should ensure that Picopore will be reasonably forward-compatible for future file format changes even without updates.

Threaded mode

I tried to run Picopore on the toy dataset in single-threaded mode, broke out of it because it was taking too long, then restarted in threaded mode and realised I wanted to stop that as well [Picopore leaves temporary files of a predictable name in the working directories that are not deleted on failure, and I had not deleted them between runs]. Unfortunately, when running Picopore in threaded mode, I was not able to break out of the running program (and needed to kill it using another console).

Equivalence check on included dataset

Running the equivalence check in lossless mode on the provided toy dataset produced no errors. No additional testing was done on the toy dataset.

Competing Interests: David Eccles was a demonstrator / speaker at PoreCampAU 2017, an event which inspired the creation of Picopore.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Author Response 10 Apr 2017

Scott Gigante, Walter & Eliza Hall Institute of Medical Research, Parkville, 3121, Australia

10 Apr 2017

Author Response

Dear Dr. Eccles,

Thank you for your review.

I have amended the manuscript according to your suggested revisions.

You note that an equivalent to Picopore's raw compression is ... Continue reading Dear Dr. Eccles,

Thank you for your review.

I have amended the manuscript according to your suggested revisions.

You note that an equivalent to Picopore's raw compression is now the default behaviour for MinKNOW v1.5. I will continue to examine the output from the latest version of MinKNOW to find new mechanisms for size reduction; however, in the short term, I anticipate that Picopore's major use cases will be to reduce the size of older datasets, and to reduce the size of datasets produced by power users who continue to use MinKNOW with event data enabled.

Thank you also for your extensive testing of Picopore. The errors you pointed out have been resolved in the latest version of Picopore, available on Pypi, Bioconda and Github.

Thank you once again for your comments.

Kind regards,

Scott Gigante
Dear Dr. Eccles,

Thank you for your review.

I have amended the manuscript according to your suggested revisions.

You note that an equivalent to Picopore's raw compression is now the default behaviour for MinKNOW v1.5. I will continue to examine the output from the latest version of MinKNOW to find new mechanisms for size reduction; however, in the short term, I anticipate that Picopore's major use cases will be to reduce the size of older datasets, and to reduce the size of datasets produced by power users who continue to use MinKNOW with event data enabled.

Thank you also for your extensive testing of Picopore. The errors you pointed out have been resolved in the latest version of Picopore, available on Pypi, Bioconda and Github.

Thank you once again for your comments.

Kind regards,

Scott Gigante
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 10 Apr 2017

Scott Gigante, Walter & Eliza Hall Institute of Medical Research, Parkville, 3121, Australia

10 Apr 2017

Author Response

Dear Dr. Eccles,

Thank you for your review.

I have amended the manuscript according to your suggested revisions.

You note that an equivalent to Picopore's raw compression is ... Continue reading Dear Dr. Eccles,

Thank you for your review.

I have amended the manuscript according to your suggested revisions.

You note that an equivalent to Picopore's raw compression is now the default behaviour for MinKNOW v1.5. I will continue to examine the output from the latest version of MinKNOW to find new mechanisms for size reduction; however, in the short term, I anticipate that Picopore's major use cases will be to reduce the size of older datasets, and to reduce the size of datasets produced by power users who continue to use MinKNOW with event data enabled.

Thank you also for your extensive testing of Picopore. The errors you pointed out have been resolved in the latest version of Picopore, available on Pypi, Bioconda and Github.

Thank you once again for your comments.

Kind regards,

Scott Gigante
Dear Dr. Eccles,

Thank you for your review.

I have amended the manuscript according to your suggested revisions.

You note that an equivalent to Picopore's raw compression is now the default behaviour for MinKNOW v1.5. I will continue to examine the output from the latest version of MinKNOW to find new mechanisms for size reduction; however, in the short term, I anticipate that Picopore's major use cases will be to reduce the size of older datasets, and to reduce the size of datasets produced by power users who continue to use MinKNOW with event data enabled.

Thank you also for your extensive testing of Picopore. The errors you pointed out have been resolved in the latest version of Picopore, available on Pypi, Bioconda and Github.

Thank you once again for your comments.

Kind regards,

Scott Gigante
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 20 Mar 2017

Matthew Loose, School of Life Sciences, University of Nottingham, Nottingham, UK

Approved

https://doi.org/10.5256/f1000research.11886.r20761

CITE

Report a concern

Author Response 10 Apr 2017

Scott Gigante, Walter & Eliza Hall Institute of Medical Research, Parkville, 3121, Australia

10 Apr 2017

Author Response

Dear Dr. Loose,

Thank you for your review.

I would like to note that since the compression modes were tested all on the same dataset of a total ... Continue reading Dear Dr. Loose,

Thank you for your review.

I would like to note that since the compression modes were tested all on the same dataset of a total of 160 files across four runs, the number of bases is fixed, with a total of 1.7 Mb in each instance. Thus, the comparison of speeds and sizes between compression modes will be equivalent by either metric. However, I agree that bases/s is a more useful metric than reads/s in comparing results between different protocols, and have included this in my analyses.

In regards to your recommended mode of long term storage, this is indeed my expectation and is Picopore’s recommended mode for end-users. Picopore’s raw mode stores by default only the raw signal and the FASTQ, which is compressed using HDF5’s built-in GZIP compression. Picopore’s lossless and deep-lossless modes are aimed at developers who wish to retain the event data for active use or long-term storage respectively; I acknowledge that these use cases are only suitable for a particular subgroup of users.

Thank you once again for your comments.

Kind regards,

Scott Gigante
Dear Dr. Loose,

Thank you for your review.

I would like to note that since the compression modes were tested all on the same dataset of a total of 160 files across four runs, the number of bases is fixed, with a total of 1.7 Mb in each instance. Thus, the comparison of speeds and sizes between compression modes will be equivalent by either metric. However, I agree that bases/s is a more useful metric than reads/s in comparing results between different protocols, and have included this in my analyses.

In regards to your recommended mode of long term storage, this is indeed my expectation and is Picopore’s recommended mode for end-users. Picopore’s raw mode stores by default only the raw signal and the FASTQ, which is compressed using HDF5’s built-in GZIP compression. Picopore’s lossless and deep-lossless modes are aimed at developers who wish to retain the event data for active use or long-term storage respectively; I acknowledge that these use cases are only suitable for a particular subgroup of users.

Thank you once again for your comments.

Kind regards,

Scott Gigante
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 10 Apr 2017

Scott Gigante, Walter & Eliza Hall Institute of Medical Research, Parkville, 3121, Australia

10 Apr 2017

Author Response

Dear Dr. Loose,

Thank you for your review.

I would like to note that since the compression modes were tested all on the same dataset of a total ... Continue reading Dear Dr. Loose,

Thank you for your review.

I would like to note that since the compression modes were tested all on the same dataset of a total of 160 files across four runs, the number of bases is fixed, with a total of 1.7 Mb in each instance. Thus, the comparison of speeds and sizes between compression modes will be equivalent by either metric. However, I agree that bases/s is a more useful metric than reads/s in comparing results between different protocols, and have included this in my analyses.

In regards to your recommended mode of long term storage, this is indeed my expectation and is Picopore’s recommended mode for end-users. Picopore’s raw mode stores by default only the raw signal and the FASTQ, which is compressed using HDF5’s built-in GZIP compression. Picopore’s lossless and deep-lossless modes are aimed at developers who wish to retain the event data for active use or long-term storage respectively; I acknowledge that these use cases are only suitable for a particular subgroup of users.

Thank you once again for your comments.

Kind regards,

Scott Gigante
Dear Dr. Loose,

Thank you for your review.

I would like to note that since the compression modes were tested all on the same dataset of a total of 160 files across four runs, the number of bases is fixed, with a total of 1.7 Mb in each instance. Thus, the comparison of speeds and sizes between compression modes will be equivalent by either metric. However, I agree that bases/s is a more useful metric than reads/s in comparing results between different protocols, and have included this in my analyses.

In regards to your recommended mode of long term storage, this is indeed my expectation and is Picopore’s recommended mode for end-users. Picopore’s raw mode stores by default only the raw signal and the FASTQ, which is compressed using HDF5’s built-in GZIP compression. Picopore’s lossless and deep-lossless modes are aimed at developers who wish to retain the event data for active use or long-term storage respectively; I acknowledge that these use cases are only suitable for a particular subgroup of users.

Thank you once again for your comments.

Kind regards,

Scott Gigante
Competing Interests: No competing interests were disclosed. Close
Report a concern

Comments on this article Comments (2)

Version 3

VERSION 3 PUBLISHED 28 Sep 2017

Revised

Comment

Version 1

VERSION 1 PUBLISHED 07 Mar 2017

Discussion is closed on this version, please comment on the latest version above.

Author Response 20 Mar 2017

Scott Gigante, Walter & Eliza Hall Institute of Medical Research, Parkville, 3121, Australia

20 Mar 2017

Author Response

Hi Wouter,

Thanks very much for your comments.

My apologies for the error in the PromethION throughput, I had misunderstood the statement in the original reference. I will ... Continue reading Hi Wouter,

Thanks very much for your comments.

My apologies for the error in the PromethION throughput, I had misunderstood the statement in the original reference. I will issue a new version of the paper correcting this.

In regards to kmer length, ONT did indeed move to a 6-mer for the later HMM basecallers, but moved back to a 5-mer for the RNN - there is a brief introduction of this here on the Nanopore Community.

Cheers,
Scott Gigante
Hi Wouter,

Thanks very much for your comments.

My apologies for the error in the PromethION throughput, I had misunderstood the statement in the original reference. I will issue a new version of the paper correcting this.

In regards to kmer length, ONT did indeed move to a 6-mer for the later HMM basecallers, but moved back to a 5-mer for the RNN - there is a brief introduction of this here on the Nanopore Community.

Cheers,
Scott Gigante
Competing Interests: No competing interests were disclosed. Close
Report a concern
Reader Comment 17 Mar 2017

Wouter De Coster, VIB Center for Molecular Neurology Antwerp, Belgium

17 Mar 2017

Reader Comment

Hi Scott,

Interesting paper and great work.

About PromethION you wrote "...data generation projected at 6 terabases per flowcell per day.", but I believe that's the theoretical throughput ... Continue reading Hi Scott,

Interesting paper and great work.

About PromethION you wrote "...data generation projected at 6 terabases per flowcell per day.", but I believe that's the theoretical throughput for the entire machine, all 48 flowcells together rather than per flowcell.

I'm also not sure if the software still uses DNA 5-mers for converting signal to basecalling. If I'm not mistaken that was the case in the early HMM basecaller, which then changed to 6-mers and now an RNN basecaller.

Cheers,
Wouter De Coster
Hi Scott,

Interesting paper and great work.

About PromethION you wrote "...data generation projected at 6 terabases per flowcell per day.", but I believe that's the theoretical throughput for the entire machine, all 48 flowcells together rather than per flowcell.

I'm also not sure if the software still uses DNA 5-mers for converting signal to basecalling. If I'm not mistaken that was the case in the early HMM basecaller, which then changed to 6-mers and now an RNN basecaller.

Cheers,
Wouter De Coster
Competing Interests: No competing interests were disclosed. Close
Report a concern
Discussion is closed on this version, please comment on the latest version above.

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 3 (revision) 28 Sep 17		read
Version 2 (revision) 12 Apr 17		read
Version 1 07 Mar 17	read	read

Matthew Loose, University of Nottingham, Nottingham, UK
David A. Eccles, Malaghan Institute of Medical Research, Wellington, New Zealand; Gringene Bioinformatics, Wellington, New Zealand

Comments on this article

All Comments(2)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

15 Views

29 Sep 2017 | for Version 3

David A. Eccles, Malaghan Institute of Medical Research, Wellington, New Zealand; Gringene Bioinformatics, Wellington, New Zealand

15 Views Cite this report Responses(0)

Approved

No further comments.

Competing Interests

David Eccles was a demonstrator / speaker at PoreCampAU 2017, an event which inspired the creation of Picopore.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

26 Views

13 Apr 2017 | for Version 2

David A. Eccles, Malaghan Institute of Medical Research, Wellington, New Zealand; Gringene Bioinformatics, Wellington, New Zealand

26 Views Cite this report Responses(0)

Approved

The updated version of Picopore fixes all the issues that I reported previously.

Competing Interests

David Eccles was a demonstrator / speaker at PoreCampAU 2017, an event which inspired the creation of Picopore.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

55 Views

21 Mar 2017 | for Version 1

David A. Eccles, Malaghan Institute of Medical Research, Wellington, New Zealand; Gringene Bioinformatics, Wellington, New Zealand

55 Views Cite this report Responses(1)

Approved

"ONT'" -> "ONT's"
"The drastic increase... the limiting factor in uptake of the technology" -- cite, alter, or remove. My experience is that storage space is only an issue for *existing* users of the technology, not people deciding whether or not to use the MinION.

Operation

At least one running example in the 'Execution' section would be useful, similar to the Usage section of the pypi repository. Something like this:

'picopore --mode lossless --prefix shrunk '

Discussion

The introduction suggests ONT's internal runs are approaching 2M reads, yet the discussion suggests 0.5M reads per run. This is the difference between real-time processing and almost a week of waiting for processing to finish.
The discussion mentions future potential capabilities for basecalling software, but not Picopore itself. Are there any planned updates on the horizon?

A 2kb R7.3 tomato read produced by me in March 2016 (channel 342, read 13) [David Eccles' read]
A 771kb R9.4 E. coli read produced by Nick Loman and Josh Quick in March 2017 [Nick Loman & Josh Quick's read]

Competing Interests

David Eccles was a demonstrator / speaker at PoreCampAU 2017, an event which inspired the creation of Picopore.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (1)

Author Response

10 Apr 2017

Scott Gigante, Walter & Eliza Hall Institute of Medical Research, Parkville, 3121, Australia

Dear Dr. Eccles,

Thank you for your review.

I have amended the manuscript according to your suggested revisions.

You note that an equivalent to Picopore's raw compression is now the default behaviour for MinKNOW v1.5. I will continue to examine the output from the latest version of MinKNOW to find new mechanisms for size reduction; however, in the short term, I anticipate that Picopore's major use cases will be to reduce the size of older datasets, and to reduce the size of datasets produced by power users who continue to use MinKNOW with event data enabled.

Thank you also for your extensive testing of Picopore. The errors you pointed out have been resolved in the latest version of Picopore, available on Pypi, Bioconda and Github.

Thank you once again for your comments.

Kind regards,

Scott Gigante

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

46 Views

20 Mar 2017 | for Version 1

Matthew Loose, School of Life Sciences, University of Nottingham, Nottingham, UK

46 Views Cite this report Responses(1)

Approved

Picopore is a well written package that installs quickly and easily, has clear guidance on its use and addresses a relevant issue in Nanopore sequencing at this time.

The tools function as described (certainly on OSX).

I have some reservations about reporting speed in terms of reads/s. I would like to see metrics which take in to account the number of bases being processed per unit time as I suspect the alternative compression options will perform differently by this metric.

I also have some reservations about the use of some modes of picopore compression. Users will need to think carefully about the application of modes which are not immediately compatible with existing tool chains. Given recent announcements from Nanopore with respect to provision of off line base calling I suspect a better long term storage of data will be as simple raw files with an associated fastq.gz.

Competing Interests

No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (1)

Author Response

10 Apr 2017

Scott Gigante, Walter & Eliza Hall Institute of Medical Research, Parkville, 3121, Australia

Dear Dr. Loose,

Thank you for your review.

I would like to note that since the compression modes were tested all on the same dataset of a total of 160 files across four runs, the number of bases is fixed, with a total of 1.7 Mb in each instance. Thus, the comparison of speeds and sizes between compression modes will be equivalent by either metric. However, I agree that bases/s is a more useful metric than reads/s in comparing results between different protocols, and have included this in my analyses.

In regards to your recommended mode of long term storage, this is indeed my expectation and is Picopore’s recommended mode for end-users. Picopore’s raw mode stores by default only the raw signal and the FASTQ, which is compressed using HDF5’s built-in GZIP compression. Picopore’s lossless and deep-lossless modes are aimed at developers who wish to retain the event data for active use or long-term storage respectively; I acknowledge that these use cases are only suitable for a particular subgroup of users.

Thank you once again for your comments.

Kind regards,

Scott Gigante

View more View less

Competing Interests

No competing interests were disclosed.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

Click here to access the data.

Downloaded data do not display as expected? Download the data (13.80KB)

[1] 1. Eisenstein M: Oxford Nanopore announcement sets sequencing sector abuzz. Nat Biotechnol. 2012; 30(4): 295–296. PubMed Abstract | Publisher Full Text

[2] 2. Quick J, Loman NJ, Duraffour S, et al.: Real-time, portable genome sequencing for Ebola surveillance. Nature. 2016; 530(7589): 228–232. PubMed Abstract | Publisher Full Text | Free Full Text

[3] 3. Ip CL, Loose M, Tyson JR, et al.: MinION Analysis and Reference Consortium: Phase 1 data release and analysis [version 1; referees: 2 approved]. F1000Res. 2015; 4: 1075. PubMed Abstract | Publisher Full Text | Free Full Text

[4] 4. Tyson JR, O’Neil NJ, Jain M, et al.: Whole genome sequencing and assembly of a Caenorhabditis elegans genome with complex genomic rearrangements using the MinION sequencing device. bioRxiv. 2017. Publisher Full Text

[5] 5. Jain M, Olsen HE, Paten B, et al.: The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 2016; 17(1): 239. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. van der Walt S, Colbert SC, Varoquaux G: The numpy array: a structure for efficient numerical computation. Comput Sci Eng. 2011; 13(2): 22–30. Publisher Full Text

[7] 7. Loman NJ, Quick J, Simpson JT: A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Methods. 2015; 12(8): 733–735. PubMed Abstract | Publisher Full Text

[8] 8. Stoiber MH, Quick J, Egan R, et al.: De novo identification of DNA modifications enabled by genome-guided nanopore signal processing. bioRxiv. 2016. Publisher Full Text

[9] 9. Gigante S, Sjödin A: scottgigante/picopore: 1.0.0 Stable [Data set]. Zenodo. 2017. Data Source

[10] 10. Gigante S: Dataset 1 in: Picopore: A tool for reducing the storage size of Oxford Nanopore Technologies datasets without loss of functionality. F1000Research. 2017. Data Source

[11] 11. Lucattini A, Kinery A: Toy datasets for compression by Picopore [Data set]. Zenodo. 2017. Data Source

Picopore: A tool for reducing the storage size of Oxford Nanopore Technologies datasets without loss of functionality

Abstract

Keywords

Introduction

Methods

Implementation

Compression techniques

Operation

Results

Table 1. Metadata for MinION datasets sampled to produce toy datasets.

Figure 1. Size of tarball containing FAST5 files compressed using various methods.

Table 2. Significance of difference in size of files compressed with different methods using a dependent sample t-test (R v3.3.0).

Figure 2. Time taken for single-thread compression and tarring of FAST5 files using various methods.

Figure 3. Speed of deep lossless compression of FAST5 files using multiple threads. The dotted blue line shows the theoretical linear maximum increase in speed for the R9 2D run.

Discussion

Conclusions

Software and data availability

Author endorsement

Author contributions

Competing interests

Grant information

Acknowledgments

References

Comments on this article Comments (2)

Open Peer Review

Comments on this article Comments (2)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

The problem

How to fix it

Competing Interests Policy

Stay Updated