ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

Picopore: A tool for reducing the storage size of Oxford Nanopore Technologies datasets without loss of functionality

[version 1; peer review: 2 approved]
PUBLISHED 07 Mar 2017
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Nanopore Analysis gateway.

This article is included in the Python collection.

Abstract

Oxford Nanopore Technologies' (ONT) MinION and PromethION long-read sequencing technologies are emerging as genuine alternatives to established Next-Generation Sequencing technologies. A combination of the highly redundant file format and a rapid increase in data generation have created a significant problem both for immediate data storage on MinION-capable laptops, and for long-term storage on lab data servers. 
We developed Picopore, a software suite offering three methods of compression. Picopore's lossless and deep lossless methods provide a 25% and 44% average reduction in size, respectively, without removing any data from the files. Picopore's raw method provides an 88% average reduction in size, while retaining biologically relevant data for the end-user. All methods have the capacity to run in real-time in parallel to a sequencing run, reducing demand for both immediate and long-term storage space.

Keywords

DNA Sequencing, Genome Informatics, Nanopore Sequencing, Compression, Data Storage

Introduction

Oxford Nanopore Technologies’ (ONT) nanopore sequencing technology MinION provides a high-throughput, low-cost alternative to traditional Next-Generation Sequencing (NGS) technologies1. The sequencing device itself is handheld and connects by USB to a laptop computer. Together with all equipment and reagents required for DNA library preparation, the equipment required to use MinION is minimal; entire laboratories have even been transported overseas in a suitcase, allowing a versatile and agile approach towards DNA and RNA sequencing2.

Over the course of ONT’ Early Access Program, several improvements in software and chemistry have led to a rapid increase in yield, through an increase in average read length, an improvement in basecalling accuracy rates and an increase in total number of reads. In October 2015, the MinION Analysis and Reference Consortium (MARC), using R7.3 flow cells and SQK-MAP005 (2D) chemistry, reported a median of 60,600 reads with a median yield of 650,000 events across 20 MinION experiments3. In contrast, ONT claim to have obtained a total base yield of 17 gigabases using an R9.4 flowcell on the latest version of their MinKNOW software (https://nanoporetech.com/about-us/news/minion-software-minknow-upgraded-enable-increased-data-yield-other-benefits). The drastic increase in data generation from a single experiment is rapidly becoming the limiting factor in uptake of the technology.

The concerns over data storage extend beyond the data generation capabilities of a single flowcell. Recent attempts to perform de novo assembly of eukaryotic genomes have combined the data generated by multiple flowcells in order to gain sufficient coverage of the genome4. To this end, ONT have begun the pre-commercial release of the PromethION, a benchtop nanopore sequencing device with 48 flowcells. In addition, each of these flowcells contains 3000 channels, as opposed to the 512 channels in a single MinION flowcell, with data generation projected at 6 terabases per flowcell per day5.

Numerous methods have been developed for the efficient analysis of the increasingly large nanopore datasets. However, current methods to reduce the data storage footprint are extremely limited. Nanopore runs uploaded to online repositories, such as the European Nucleotide Archive, are bundled into a tarball, a process which facilitates upload as a single file, but does not decrease file size. Moreover, ONT runs bundled into a tarball (which could then be compressed using traditional means) are not able to be read by any existing nanopore analysis tools. Moreover, traditional compression technologies are poorly adapted to the needs of the individual user, many of whom have no need for a large portion of the data stored by ONT’s basecallers. Therefore, we developed Picopore, a tool for reducing the storage footprint of the ONT runs without preventing users from using their preferred analysis tools. Picopore uses a combination of storage reduction techniques, including built-in dynamic compression in the HDF5 file format, reduction of data duplication, efficient allocation of memory within the file, and the removal of intermediate data generated during basecalling, which is deemed unnecessary by the end-user.

Methods

Implementation

Picopore is developed using the Python h5py module (http://www.h5py.org/), an interface to the HDF5 file format (http://www.hdfgroup.org/HDF5), used by ONT under the FAST5 file extension. Picopore implements a number of different compression methods, a selection of which are applied according to user preferences, before using HDF5’s h5repack to rebuild the file according to the reduced file size requirements.

Compression techniques

Built-in GZIP compression. The HDF5 file format allows for both files and datasets within files to be written using a number of different compression filters, the most universally implemented being GZIP. GZIP applies traditional compression to the data stored in the HDF5 file with choices of compression level between 1 and 9. ONT’s default compression uses GZIP at level 1; Picopore increases this to level 9 in order to decrease disk space usage.

Dynamic memory allocation for variables. Data stored in the HDF5 file format uses fixed-size file formats provided by NumPy, which provides a vast array of options for storing integers, decimals and strings within high-dimensional datasets6. ONT’s native data is written using the largest data types provided by NumPy: 64-bit integers, 64-bit decimals, and "variable-size" strings. Picopore vastly reduces disk space usage by analyzing each dataset to determine the minimum number of bytes required by a given variable in the file, changing the data type accordingly.

Collapsing of file structure. The advantage of the HDF5 file format is that it provides a file directory-like storage format for datasets and properties, making reading and writing to the files straightforward and easy to understand. However, the inherent nature of the highly-structured file format requires HDF5 to allocate slots of memory to "groups", which represent the internal directory structure of the file. Picopore reduces the disk space used by this file metadata by collapsing the directory structure, while retaining the option for users to reverse this action when tools that only recognize the original file format are required.

Indexing of duplicated data. ONT’s most widely used basecalling software, the cloud-based Metrichor service (https://metrichor.com/s/), performs feature recognition (or "event detection"). This segments the electrical signal representing each nanopore read into events, each of which represents a period of time when the DNA was stationary in the nanopore. These events are then converted into basecalled data, which provides a single k-mer (at present a 5-mer) of DNA representing the bases in the nanopore contributing to the signal at that time. Each event corresponds to a single row in the basecalled dataset, and both the event detection and basecalled datasets thus store the mean signal, standard deviation, start time and length of the event. Picopore reduces disk space usage by indexing the basecalled dataset to the event detection dataset, removing the duplicated data while retaining the option for users to reverse this action when tools that require access to this data are required.

Removal of intermediate data. The primary function of all basecalling software is to generate a FASTQ file containing the genomic sequence and associated quality scores representing the read stored in each FAST5 file. While some software tools, such as nanopolish7 and nanoraw8, do make use of the signal, event detection and basecalled datasets, the large majority of analyses, including alignment, assembly and variant calling, simply require access to the FASTQ data. Picopore allows users to remove the intermediate data generated during the process of converting raw signal to FASTQ, while retaining the signal data, should they ever want to re-basecall the run to attain improved FASTQ data or to access this intermediate data at a later stage.

Operation

Requirements. Picopore is built in Python 2.7 (www.python.org) and runs on Windows, Mac OS and Linux. It requires the following Python packages:

  • h5py 2.2.0 or later

  • watchdog 0.8.3 or later

In addition, Picopore requires HDF5 1.8.4 or newer, with development headers (libhdf5-dev or similar), including the binary utility h5repack, which is included therein.

Installation. The latest stable version of Picopore is available on PyPi and bioconda (see Software availability). It can be installed according to the following commands:

Linux and Mac OS: pip install picopore

Windows: conda install picopore -c bioconda -c conda-forge

Picopore can also be installed from source (see Software availability) using the command python setup.py install.

Execution. Picopore is run from the command-line as a binary executable.

Picopore accepts both folders and FAST5 files as input. If a folder is provided, it will be searched recursively for FAST5 files, and all files found will be considered as input.

There are three modes of compression available, each of which performs a selection of the techniques described above.

  • lossless: performs built-in GZIP compression and dynamic memory allocation for variables. This mode is both fast and allows continued analysis of data by any existing software.

  • deep-lossless: performs lossless compression, as well as collapsing of file structure and indexing of duplicated data. This mode obtains the best compression results without removing any data, but comes at the cost of requiring reversion before most software tools can analyse the data.

  • raw: performs lossless compression, as well as removal of intermediate data, partially reverting files to the "raw" pre-basecalled file format. This mode is fast, obtains the best filesize reduction, and allows continued analysis by tools that extract FASTQ and related data, but comes at the cost of removing intermediate basecalling data required for some niche applications, such as nanopolish, which cannot be retrieved by Picopore (but can be regenerated using basecalling software.)

Optional arguments include:

  • revert: reverts lossless compressed files to their original state to allow high-speed access at the cost of disk usage;

  • realtime: watches for file creation in the given input folder(s) and performs the selected mode of compression on new files in real time to reduce the footprint of an ongoing MinION run;

  • prefix: allows the user to specify a filename prefix to prevent in-place overwriting of files;

  • group: allows the user to select only one of the analysis groups on files that have been processed by multiple basecallers;

  • threads: allows the user to specify the number of files to be processed in parallel.

Results

To demonstrate the effectiveness of Picopore’s compression, we ran all three modes of compression on four toy datasets of 40 FAST5 files run using the R9 SQK-RAD001 (R9_1D), R9 SQK-NSK007 (R9_2D), R9.4 SQK-RAD002 (R9.4_1D) and R9.4 SQK-LSK208 (R9.4_2D) protocols. The files for the toy datasets were chosen randomly from the pass folder of four MinION datasets generated at the Australian Genome Research Facility (see Software and data availability). For the R9_1D dataset, DNA was extracted from the lung of a juvenile 129/Sv mouse using the DNeasy Blood and Tissue kit (Qiagen). For the R9_2D, R9.4_1D and R9.4_2D datasets, DNA was extracted from a culture of escherichia coli K12 MG1655 using the Blood & Cell Culture DNA Kit (Qiagen). Quality control performed by visualisation on the TapeStation (Agilent). Run metadata is shown in Table 1.

Table 1. Metadata for MinION datasets sampled to produce toy datasets.

NameChemistryProtocolMinION IDFlowcell IDSampleStrain
R9_1DR9 1DSQK-RAD001MN17324FAD24340Mouse129/Sv
R9_2DR9 2DSQK-NSK007MN17324FAD24193Escherichia coliK12 MG1655
R9.4_1DR9.4 1D RapidSQK-RAD002MN17324FAF04136Escherichia coliK12 MG1655
R9.4_2DR9.4 2DSQK-RAD002MN17324FAF04232Escherichia coliK12 MG1655

Each file was compressed and tarred using each of five methods: no compression, gzip (applied after tarring, as per convention), picopore lossless, picopore deep-lossless and picopore raw. Each of these methods was run on a single core. Figure 1 shows that lossless achieves only slightly less compression than gzip, giving an average reduction in size of 25% compared to gzip’s 32%, while deep-lossless and raw perform significantly better, giving average reductions in size of 44% and 88%, respectively. A dependent sample t-test (R v3.3.0) was run on individual compressed file sizes. Table 2 shows that each successive method of compression (excluding gzip, which does not compress individual files) gives a significant reduction in size from the previous. Figure 2 shows that while all of Picopore’s compression methods are much slower than gzip, raw is the fastest of these, followed by lossless and deep-lossless. Note that the tarring time makes up a maximum of 0.04 s/read in each case and is largely negligible.

46e8b9f3-d018-418b-b92b-631ff1ee24a4_figure1.gif

Figure 1. Size of tarball containing FAST5 files compressed using various methods.

Table 2. Significance of difference in size of files compressed with different methods using a dependent sample t-test (R v3.3.0).

Mode 1Mode 2t-statisticp-value
uncompressedlossless33.58< 10−15
losslessdeep-lossless20.14< 10−15
deep-losslessraw17.69< 10−15
46e8b9f3-d018-418b-b92b-631ff1ee24a4_figure2.gif

Figure 2. Time taken for single-thread compression and tarring of FAST5 files using various methods.

To demonstrate the effectiveness of Picopore’s multithreading, we ran deep-lossless, the most computationally expensive of the Picopore compression methods, on each dataset using 1, 2, 5, and 10 cores. Figure 3 shows an almost linear improvement in speed, showing that even on a small dataset, the multithreading overhead is relatively small.

46e8b9f3-d018-418b-b92b-631ff1ee24a4_figure3.gif

Figure 3. Speed of deep lossless compression of FAST5 files using multiple threads. The dotted blue line shows the theoretical linear maximum increase in speed for the R9 2D run.

RunTarred sizeFile 1File 2File 3File 4File 5File 6File 7File 8File 9File 10File 11File 12File 13File 14File 15File 16File 17File 18File 19File 20File 21File 22File 23File 24File 25File 26File 27File 28File 29File 30File 31File 32File 33File 34File 35File 36File 37File 38File 39File 40
picopore.R91D.tar615628801269783256702237322492263774233019222419091450351832152752923515762360043129101619984874441506463371513327322102242629853299362594572597074379343222322282403170045318831290819713526165116323154306227116263938203198167060213946285961246846
picopore.R91D.tar.gz29965264
picopore.R91D.deep-lossless2532352081644721293831455373164038024340361980652165366393660712209811817028204454915368592058787107453532309669956802193066220420513932291727085131018717033691747713289403320808641522863207810730445520740941290465175632110745583202631496800174781712832671078450139479818949601669921
picopore.R91D.lossless3436544059817216097411096297124072918494261491041123694769398291169413600871563955115608015558267967402463782736876165645316747611045727129599710088441281862130886522082281563195114353715677972079511571087965184133244779067921826911285781318907960935797689105049514215741256083
picopore.R91D.raw691200011778732533275183427220345202852870237417320360801306992160223722075892455373191821124681761457153367768613758752600311261292117685452124112169899320993762137394333262224859631903949248130465232024852231666250216367314570186688351877510213651816609051451948177264022858122056173
picopore.R92D.tar86026240240178207753853461781653785480686374673352224130940213118625620823618437975513811707955252252433168921155867740801352755745469504645623906911304144318035304537395479246234526347299410307414255147524846443186936863
picopore.R92D.tar.gz65407762
picopore.R92D.deep-lossless535347202118254195324872409616400142416106824722420691971802354914257667973794491815229179232685376098683659402572391276154883114536456036561245750442017549364514350378399111902772941912130484132994239063191054072105816469001601314349337841709481179635235000
picopore.R92D.lossless698675201537189143314652159112008361639724901001644361318561704243175411701340344893154790158718263957508693280187271976113110410620864311159006303084703895223624452795805863205202527143911284419294937221646282427418694644058957492235747509714118832158973
picopore.R92D.raw11888640239272622544029610951898404465336915726465858418718263508948175412171717198604525064559406072509112546307136190821804343170652483961414927056714907837167476794098767143564951830043694464153565270654376763558981176611403451563638564179946040400900458412
picopore.R941D.tar414720002595571094502443484309042597402605823594621316424263926235231202731675021521036111831926738833643656116048838649510290447998837234927465625472566128102188413731552975235508248889310359323564408696241380233940345496390338157258613288245591
picopore.R941D.tar.gz28767584
picopore.R941D.deep-lossless2299904015339936977651596609276042615906961438079185997783798026534473038071938288172558813456422068911185926424299852479072937067174677050771529462852282752150523215484333864145518752448310330089513690351532853193480819937072499321143146714499622184960201877097808936206001497986
picopore.R941D.lossless3192832011755705079061213720211998812209631103811143870961456520464312101671475558133239910239351600253142928018643871925527700550136055736783422803991745935115456911847642677183980941880010255703310403561169613148109015304551922791109511211058591666089157267072791328073421140310
picopore.R941D.raw3891200192651510610991990677320785119951351822825226482912050673096528656423234465321343931731678249451522676582860624291468713119252148386866121339628627054221905109194183474109091108628785053759690175343119257242347024241554029317391819111183966526036752439938135402440897401888743
picopore.R942D.tar80967680716211655286622754886920573810535361410825855866670892142818444060721887031470591242217682459042929053332707331982414997358171164856627631349431173386470155409290090106877307984480653107262610099870212005867215139938502049
picopore.R942D.tar.gz60533928
picopore.R942D.deep-lossless492134402481917956112658642065174164102040340208379509636224716245442114841236014226038841437566584563835632861132525311596392173601216766387731972416778055522560296688795034702918883066833329821519459167920826543055288302398124246305790962426635544342686023
picopore.R942D.lossless65341440173730585256183066142716293890153258114073836388915231717075184600725184017773329254148837445974522712824442001182101130190012574245660271650395951681761754840903615992002212310455232724372702125535219940823794691646163045074184941690504059362020413
picopore.R942D.raw1017856091744114712099291308741581084499275512887217511794208935659125221831950102373392366510843001344655130677799606539875612292063243624823727871447800904819148147292327013449461169982958426380649499538211882432373909338514812016699116471093035125359090977912244393418540
RunCompression timeTar timeTotal time
picopore.run1.R91D.tar0.0990.099
picopore.run1.R91D.tar.gz1.553
picopore.run1.R91D.deep-lossless.126.3880.03926.427
picopore.run1.R91D.deep-lossless.104.2210.0374.258
picopore.run1.R91D.deep-lossless.215.4560.03915.495
picopore.run1.R91D.deep-lossless.56.5020.0366.538
picopore.run1.R91D.lossless10.1840.04410.228
picopore.run1.R91D.raw9.2590.0249.283
picopore.run1.R92D.tar0.2030.203
picopore.run1.R92D.tar.gz2.778
picopore.run1.R92D.deep-lossless.154.8450.154.945
picopore.run1.R92D.deep-lossless.106.9880.0967.084
picopore.run1.R92D.deep-lossless.228.4190.128.519
picopore.run1.R92D.deep-lossless.512.1120.09712.209
picopore.run1.R92D.lossless18.0920.0718.162
picopore.run1.R92D.raw12.7620.02712.789
picopore.run1.R941D.tar0.0670.067
picopore.run1.R941D.tar.gz1.218
picopore.run1.R941D.deep-lossless.122.2080.03722.245
picopore.run1.R941D.deep-lossless.103.680.0453.725
picopore.run1.R941D.deep-lossless.211.9660.03612.002
picopore.run1.R941D.deep-lossless.55.7860.0355.821
picopore.run1.R941D.lossless10.3440.04610.39
picopore.run1.R941D.raw8.7110.0218.732
picopore.run1.R942D.tar0.1270.127
picopore.run1.R942D.tar.gz2.634
picopore.run1.R942D.deep-lossless.152.8390.09752.936
picopore.run1.R942D.deep-lossless.108.2730.0578.33
picopore.run1.R942D.deep-lossless.227.1860.11227.298
picopore.run1.R942D.deep-lossless.513.6370.09713.734
picopore.run1.R942D.lossless15.7640.07215.836
picopore.run1.R942D.raw10.8470.02610.873
picopore.run2.R91D.tar0.1010.101
picopore.run2.R91D.tar.gz1.58
picopore.run2.R91D.deep-lossless.126.1620.03826.2
picopore.run2.R91D.deep-lossless.104.1830.0354.218
picopore.run2.R91D.deep-lossless.215.170.03815.208
picopore.run2.R91D.deep-lossless.56.7730.0366.809
picopore.run2.R91D.lossless11.7050.04411.749
picopore.run2.R91D.raw9.0910.0239.114
picopore.run2.R92D.tar0.2070.207
picopore.run2.R92D.tar.gz2.885
picopore.run2.R92D.deep-lossless.154.8670.11154.978
picopore.run2.R92D.deep-lossless.106.8310.0586.889
picopore.run2.R92D.deep-lossless.228.5180.09728.615
picopore.run2.R92D.deep-lossless.512.1060.10112.207
picopore.run2.R92D.lossless17.4290.06817.497
picopore.run2.R92D.raw12.1070.02712.134
picopore.run2.R941D.tar0.0670.067
picopore.run2.R941D.tar.gz1.208
picopore.run2.R941D.deep-lossless.122.7730.03722.81
picopore.run2.R941D.deep-lossless.103.5730.0333.606
picopore.run2.R941D.deep-lossless.211.930.03611.966
picopore.run2.R941D.deep-lossless.55.9780.0346.012
picopore.run2.R941D.lossless10.4370.04510.482
picopore.run2.R941D.raw7.3620.0217.383
picopore.run2.R942D.tar0.1810.181
picopore.run2.R942D.tar.gz2.519
picopore.run2.R942D.deep-lossless.151.4170.09751.514
picopore.run2.R942D.deep-lossless.109.1870.0499.236
picopore.run2.R942D.deep-lossless.227.1090.09727.206
picopore.run2.R942D.deep-lossless.512.2950.09412.389
picopore.run2.R942D.lossless14.4890.06414.553
picopore.run2.R942D.raw10.480.03510.515
picopore.run3.R91D.tar0.10.1
picopore.run3.R91D.tar.gz1.552
picopore.run3.R91D.deep-lossless.126.3530.03826.391
picopore.run3.R91D.deep-lossless.104.2730.0354.308
picopore.run3.R91D.deep-lossless.215.8580.03915.897
picopore.run3.R91D.deep-lossless.56.6970.0356.732
picopore.run3.R91D.lossless11.3050.04511.35
picopore.run3.R91D.raw9.0790.0249.103
picopore.run3.R92D.tar0.2030.203
picopore.run3.R92D.tar.gz2.764
picopore.run3.R92D.deep-lossless.154.720.09854.818
picopore.run3.R92D.deep-lossless.108.4340.0978.531
picopore.run3.R92D.deep-lossless.229.3780.10129.479
picopore.run3.R92D.deep-lossless.513.8840.09713.981
picopore.run3.R92D.lossless15.3310.07215.403
picopore.run3.R92D.raw10.8320.02810.86
picopore.run3.R941D.tar0.0680.068
picopore.run3.R941D.tar.gz1.24
picopore.run3.R941D.deep-lossless.122.0990.03722.136
picopore.run3.R941D.deep-lossless.103.50.0343.534
picopore.run3.R941D.deep-lossless.211.9510.03611.987
picopore.run3.R941D.deep-lossless.55.7870.0355.822
picopore.run3.R941D.lossless10.3620.0410.402
picopore.run3.R941D.raw7.3430.0217.364
picopore.run3.R942D.tar0.1930.193
picopore.run3.R942D.tar.gz2.56
picopore.run3.R942D.deep-lossless.151.2260.09751.323
picopore.run3.R942D.deep-lossless.106.8970.0946.991
picopore.run3.R942D.deep-lossless.227.1820.09527.277
picopore.run3.R942D.deep-lossless.512.20.09712.297
picopore.run3.R942D.lossless14.130.10714.237
Dataset 1.Output data generated using Picopore on four toy datasets.
http://dx.doi.org/10.5256/f1000research.11022.d153370Each of the four toy datasets was compressed using each of Picopore’s three compression modes and GZIP. Datasets compressed using Picopore were tarred after compression, while the GZIP dataset was tarred before compression.

Discussion

It is clear that, due to the enormous reduction in disk space and low total time requirements, Picopore’s raw compression is the optimal compression mode for users who have no need for the intermediate event detection and basecalling data. The superior running speed of lossless compression over deep lossless compression may mean that this is the preferred method for users who wish to retain all data and need to compress the data in real-time; however, for users with these data retention requirements who wish to store data long-term on their file server, for whom speed of compression is not an issue, deep lossless compression provides the best option. All Picopore compression methods provide significant improvements over uncompressed or traditionally compressed files, and lossless and raw methods carry the added benefit that files can be processed using analysis tools in compressed form.

Although the compression has a high CPU cost, the capability of Picopore to run on multiple threads signifies that, if computing resources are available, the files can be compressed in a reasonably short period of time. Finally, extrapolating the running time per file to a real-time run of 500,000 reads over 48 hours (0.34 s/read), Picopore has the capability to run lossless (0.33 s/read) and raw (0.25 s/read) modes on a single core in real time. While deep-lossless requires multiple cores to keep pace with the MinION data generation, this adds little to the overall computational cost, and can reach real-time speed with just five cores (0.24 s/read). As data generation continues to increase in scale, further gains could be made by incorporating the compression methods used in Picopore into the basecalling software itself.

Conclusions

ONT’ MinION and PromethION sequencing devices promise to produce increasingly large datasets as the technology progresses toward commercial release. The disk space required to run and store one or multiple datasets from these poses a problem for service providers and users alike; Picopore provides three different solutions that cater to the different needs of users.

Although the trade-off between data retention, computing time and disk space is a compromise that cannot be perfectly resolved, Picopore provides user options to reduce their ONT datasets to the minimum viable size based on their intended use. This may involve real-time compression for laptop disk space concerns, reduction of bandwidth usage for transfer of datasets between laboratories, or reduction of the storage footprint on shared data servers.

Software and data availability

Software for Linux or Mac OS available from: https://pypi.python.org/pypi/picopore

Software for Linux, Mac OS and Windows available from: https://anaconda.org/bioconda/picopore

Source code available from: https://github.com/scottgigante/picopore

Archived source code from: https://doi.org/10.5281/zenodo.3219579

License: GPLv3

Dataset 1. Output data generated using Picopore on four toy datasets. Each of the four toy datasets was compressed using each of Picopore’s three compression modes and GZIP. Datasets compressed using Picopore were tarred after compression, while the GZIP dataset was tarred before compression. doi, 10.5256/f1000research.11022.d15337010

The toy dataset used for the analyses in this paper is available on Zenodo: Toy datasets for compression by Picopore, doi: 10.5281/zenodo.32195911.

Author endorsement

Chris Woodruff confirms that the author has an appropriate level of expertise to conduct this research, and confirms that the submission is of an acceptable scientific standard. Chris Woodruff declares he has no competing interests. Affiliation: Visiting Scientist at Bioinformatics Division of Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia.

Comments on this article Comments (2)

Version 3
VERSION 3 PUBLISHED 28 Sep 2017
Revised
Version 1
VERSION 1 PUBLISHED 07 Mar 2017
Discussion is closed on this version, please comment on the latest version above.
  • Author Response 20 Mar 2017
    Scott Gigante, Walter & Eliza Hall Institute of Medical Research, Parkville, 3121, Australia
    20 Mar 2017
    Author Response
    Hi Wouter,

    Thanks very much for your comments.

    My apologies for the error in the PromethION throughput, I had misunderstood the statement in the original reference. I will ... Continue reading
  • Reader Comment 17 Mar 2017
    Wouter De Coster, VIB Center for Molecular Neurology Antwerp, Belgium
    17 Mar 2017
    Reader Comment
    Hi Scott,

    Interesting paper and great work.

    About PromethION you wrote "...data generation projected at 6 terabases per flowcell per day.", but I believe that's the theoretical throughput ... Continue reading
  • Discussion is closed on this version, please comment on the latest version above.
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Gigante S. Picopore: A tool for reducing the storage size of Oxford Nanopore Technologies datasets without loss of functionality [version 1; peer review: 2 approved]. F1000Research 2017, 6:227 (https://doi.org/10.12688/f1000research.11022.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 07 Mar 2017
Views
55
Cite
Reviewer Report 21 Mar 2017
David A. Eccles, Malaghan Institute of Medical Research, Wellington, New Zealand;  Gringene Bioinformatics, Wellington, New Zealand 
Approved
VIEWS 55
It's quite difficult to knock out a useful software tool for Oxford Nanopore devices before they change the protocol, or release their own tool that does similar things. Scott Gigante has done an admirable job in this regard by developing ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Eccles DA. Reviewer Report For: Picopore: A tool for reducing the storage size of Oxford Nanopore Technologies datasets without loss of functionality [version 1; peer review: 2 approved]. F1000Research 2017, 6:227 (https://doi.org/10.5256/f1000research.11886.r20757)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 10 Apr 2017
    Scott Gigante, Walter & Eliza Hall Institute of Medical Research, Parkville, 3121, Australia
    10 Apr 2017
    Author Response
    Dear Dr. Eccles,

    Thank you for your review.

    I have amended the manuscript according to your suggested revisions.

    You note that an equivalent to Picopore's raw compression is ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 10 Apr 2017
    Scott Gigante, Walter & Eliza Hall Institute of Medical Research, Parkville, 3121, Australia
    10 Apr 2017
    Author Response
    Dear Dr. Eccles,

    Thank you for your review.

    I have amended the manuscript according to your suggested revisions.

    You note that an equivalent to Picopore's raw compression is ... Continue reading
Views
45
Cite
Reviewer Report 20 Mar 2017
Matthew Loose, School of Life Sciences, University of Nottingham, Nottingham, UK 
Approved
VIEWS 45
Picopore is a well written package that installs quickly and easily, has clear guidance on its use and addresses a relevant issue in Nanopore sequencing at this time.

The tools function as described (certainly on OSX).
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Loose M. Reviewer Report For: Picopore: A tool for reducing the storage size of Oxford Nanopore Technologies datasets without loss of functionality [version 1; peer review: 2 approved]. F1000Research 2017, 6:227 (https://doi.org/10.5256/f1000research.11886.r20761)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 10 Apr 2017
    Scott Gigante, Walter & Eliza Hall Institute of Medical Research, Parkville, 3121, Australia
    10 Apr 2017
    Author Response
    Dear Dr. Loose,
     
    Thank you for your review.
     
    I would like to note that since the compression modes were tested all on the same dataset of a total ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 10 Apr 2017
    Scott Gigante, Walter & Eliza Hall Institute of Medical Research, Parkville, 3121, Australia
    10 Apr 2017
    Author Response
    Dear Dr. Loose,
     
    Thank you for your review.
     
    I would like to note that since the compression modes were tested all on the same dataset of a total ... Continue reading

Comments on this article Comments (2)

Version 3
VERSION 3 PUBLISHED 28 Sep 2017
Revised
Version 1
VERSION 1 PUBLISHED 07 Mar 2017
Discussion is closed on this version, please comment on the latest version above.
  • Author Response 20 Mar 2017
    Scott Gigante, Walter & Eliza Hall Institute of Medical Research, Parkville, 3121, Australia
    20 Mar 2017
    Author Response
    Hi Wouter,

    Thanks very much for your comments.

    My apologies for the error in the PromethION throughput, I had misunderstood the statement in the original reference. I will ... Continue reading
  • Reader Comment 17 Mar 2017
    Wouter De Coster, VIB Center for Molecular Neurology Antwerp, Belgium
    17 Mar 2017
    Reader Comment
    Hi Scott,

    Interesting paper and great work.

    About PromethION you wrote "...data generation projected at 6 terabases per flowcell per day.", but I believe that's the theoretical throughput ... Continue reading
  • Discussion is closed on this version, please comment on the latest version above.
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.