pyGeno: A Python package for precision medicine and proteogenomics

pyGeno is a Python package mainly intended for precision medicine applications that revolve around genomics and proteomics. It integrates reference sequences and annotations from Ensembl, genomic polymorphisms from the dbSNP database and data from next-gen sequencing into an easy to use, memory-efficient and fast framework, therefore allowing the user to easily explore subject-specific genomes and proteomes. Compared to a standalone program, pyGeno gives the user access to the complete expressivity of Python, a general programming language. Its range of application therefore encompasses both short scripts and large scale genome-wide studies.

High-throughput systems biology and precision medicine applications require the integration of data from many different sources. For instance, a significant part of precision medicine research revolves around the identification of relevant single nucleotide polymorphisms (SNPs) and insertions/deletions (INDELS) and the study of their context 1 . Furthermore recent studies in proteogenomics show that replacing traditional reference databases such as Uniprot 2 by customized databases that integrate the subject's genomic polymorphisms, can significantly improve the identification of peptides or proteins using mass spectrometry 3-6 . These applications usually require the integration of reference sequences, reference genome annotations, specific SNPs and INDELs along with an external SNP database such as dbSNP 7 for validation. The sheer amount of data generated by theses studies rules out most spreadsheet analyses and requires tools that are both fast and memory efficient. Furthermore, these studies often require the collaboration of people with different sets of skills. Thus, it was important to us to develop a tool that is powerful enough to be integrated in complex high-throughput pipelines, while still being understandable by users with limited technical abilities. In contrast to other projects such as BioPython 8 and PyCogent 9 whose objective is to provide a general set of tools for bioinformatics, the primarily ambition behind pyGeno is to provide the community with a powerful genome and proteome exploration tool that can be easily integrated into scripts. The current version integrates gene set annotations and reference sequences from Ensembl 10 along with polymorphisms (both SNPs and INDELs) derived from dbSNP 7 , and experimentally detected patient-specific polymorphisms.
To our knowledge pyGeno is the only available tool that provides this kind of integration in an easy-to-use and programming-friendly environment. Furthermore, more advanced users can rely on objectoriented inheritance to extend the functionalities of pyGeno to implement support for polymorphisms from other sources. pyGeno has been used with human and mouse genomes and should readily work with any diploid organism whose annotations are made available by Ensembl.

Methods
Design and implementation pyGeno is written in Python, a language that enjoys a large set of well established and mature scientific libraries that are used in research fields such as physics, mathematics and bioinformatics 8,11-13 . pyGeno gives users access to the full expressivity of Python to explore reference and patient-specific genomes and proteomes, by manipulating familiar objects such as genomes, chromosomes, genes, transcripts, proteins and exons. In order to make pyGeno as easy to use and learn as possible, we have created an interface where only one function, get(), can be used for almost any query. An example of usage can be seen in Figure 1. An integrated documentation is also available through the help() function.
The current version of pyGeno does not require any access to remote REST APIs. This results in more robust and faster processing since the application is not affected by connection speed or sudden changes to the server API. On the other hand it also implies that extra care must be taken regarding the optimization of the application.
Memory efficiency and speed are mainly achieved through the use of a custom lazy object-oriented database system that we have specifically written for pyGeno (https://github.com/tariqdaouda/ rabaDB). When an object is loaded through the get() function, only a minimal version of it is served. The object fully develops only once the user accesses a field that is not present in the minimal version ( Figure 1). The transformation is entirely transparent and does not require more memory than necessary to store the fully developed object. This is especially important, since most of the time users are only interested in specific regions of the genome, and do not require that the full genome be loaded into memory. Every loaded object is also a singleton, if the user asks for a previously loaded object, pyGeno will serve the object in memory.
Furthermore, this database system is built on top of SQLite version 3 (http://www.sqlite.org/), a serverless relational database. Because SQLite3 uses single files to store data, pyGeno's database can be easily backed up and shared by a simple copy/paste. Moreover, the files can be directly read, modified and analyzed through any SQLite3 client.
As with any other database system, indexes play a crucial role in determining the general performance. Within pyGeno's database, several reference genomes along with patient-specific data and versions of dbSNP can coexist. Therefore building indexes for all the stored information would result in unnecessarily large databases. We therefore have taken the approach of giving the end user full control over indexation through the ensureGlobalIndex() and dropGlobalIndex() functions. Users can, for example, decide to index the field 'id' of transcripts by using Transcript.ensureGlobalIndex('id') and dramatically improve queries based on transcript ids.
pyGeno's database is populated through imports of datawraps using importSNPs and importGenome functions. Datawraps are compressed archives that can be shared among co-workers, and are designed to solve the version and update problems. A datawrap contains at least one file named manifest.ini that contains basic information about the package such as a description, a version and a maintainer, as well a list of files from which data must be extracted. It is possible to either compress these files within the archive, or to specify URLs from which the files can be downloaded.

Amendments from Version 1
Updated the archived source code link to correspond to the last version since the original publication 1.2.8.

REVISED
In an effort to make pyGeno as easy to install as possible we have made it as dependency-free as possible. This approach has motivated our choice for SQLite3, since it is natively supported by Python 2.5 and above, and it also lead us to develop many tools that were subsequently integrated into pyGeno. Among theses tools are various functions for translating sequences, parsers for GTF/GFF, VCF, FASTA, FASTQ and CSV files, a progress bar, and an efficient way of annotating the genome called segment trees.

Personalized genomes
One of the biggest strengths of pyGeno is to allow the user to define personalized genomes. These genomes are built by combining a reference genome with sets of polymorphisms and a filtering function that returns the alleles to be inserted at the appropriate locus ( Figure 1). Personalized genomes are a powerful tool that can go beyond the definition of patient-specific genomes. For instance, we recently used this tool to combine the results of both RNAand DNA-seq data and create more robust personalized genomes that were used to identify protein-derived peptides by mass spectrometry 3 . Furthermore because pyGeno loads the necessary parts of a given reference genome only once, a pyGeno application can handle several personalized genomes without significantly increasing its memory consumption.
Operation pyGeno's only requirement is Python2 and we highly recommend version 2.7.6 or later. pyGeno can be easily installed using the pip package manager (https://pip.pypa.io/) by typing pip install pyGeno into command line interface. Alternatively the latest developments can be obtained from the github repository. Once pyGeno's installation has been completed, the first action that users must perform is the importation of a reference genome datawrap. In order to simplify the process pyGeno comes with several datawraps that can be directly listed and installed using its bootstrap module. If the desired reference genome is not among the ones provided, users also have the possibility to create their own from scratch by following the steps described in the documentation. After the first reference genome importation, pyGeno is fully functional and users can further expand its database by importing other reference genomes or SNP sets.

Summary
We have developed pyGeno because, in an age where both precision medicine and DNA/RNA sequencing are becoming more and more important, we needed a tool that would allow us to easily work on personalized genomes that include subject-specific genomic features. Nowadays research teams are increasingly multidisciplinary and are composed of people with very different backgrounds. Since we wanted pyGeno to serve as a common language between users, we therefore took great care in making pyGeno easy to install, easy to use and optimized it so it can run on computers with limited resources (eg. laptops). The fact that pyGeno has been downloaded more than 12,000 times over its first year of existence suggests that there is indeed a need for powerful user-friendly precision medicine tools. With pyGeno we have taken a rather unusual approach to user-friendliness. Instead of writing a program with a graphical user interface (GUI), we have decided to create a Python module that fully integrates within the Python environment. This ensures that users can leverage the full expressiveness of Python as well as the functionalities of other python modules such as SciPy and numpy 11 , pandas (http://pandas.pydata.org/) and matplotlib 13 , to meet their specific needs. Furthermore, it led us to think of the functions and objects the user manipulates as pyGeno's interface and we strived to make it as simple and easy to learn as possible.
In the past few years great technologies have been developed. Scripting languages such as Python and JavaScript have taken programming to a whole new level of simplicity, and are now fast enough to serve as foundations to large-scale projects. Freely available libraries such as D3.js (http://d3js.org/) allow for the creation of stunning data representations, that once coupled with tools such as pyGeno, could be used to create powerful interactive representations of biological data. The NoSQL movement has produced several new database systems from which developers can choose, offering them the opportunity to store sheer amounts of data with a flexibility that was not present only a few years ago. These technologies and many others are only waiting to be put together into ground breaking tools for the treatment of biological data. In life saving research areas, we believe that great tools that dramatically improve workflow efficiency are not a luxury but a necessity. Author contributions TD designed and developed pyGeno. SL, CP, and TD contributed to the preparation of the manuscript. All authors were involved in the revision of the draft manuscript and have agreed to the final content.

Competing interests
No competing interests were disclosed.

Grant information
This work was supported by the Canadian Cancer Society (Grant number 701564), assigned to Claude Perreault.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Several times the authors make reference to 'precision medicine' without clarifying what is meant by the term. If the authors have a specific workflow or use case which befits 'precision medicine' they need to make it clearer. I note that the authors do not include a 'Use Cases' section as suggested by in the 'Instructions to Authors'. Their previous paper (Granados , 2014) would et al be a great example of how to use pyGeno on real data.The examples provided in the paper are either too vague or too simplistic.

Open Peer Review
Too often new tools come out which try to reinvent the wheel for a small incremental improvement. Here, the authors need to be acknowledged for using well-established systems like Ensembl, SQLite and reading in existing data types (e.g. GFF, VCF, fasta).
The paper is too short on specifics and somewhat unstructured. The Methods section is fine although would benefit from an overview-with a Figure and/ or text-as the overall structure of pyGeno is unclear. The Personalized genomes section should be expanded as a 'Use Cases' section. The 'Operation' section should be part of the Methods.
The online instructions are quite clear for installation, however I gave up due to the incredibly slow progress: >20 hours remaining of a full genome 'datawrap'. Thus, unfortunately, I was not able to test the software myself. Given that pyGeno has been downloaded many times, this might an issue local to me.
Overall I do feel pyGeno is a valuable contribution to the community, however the paper needs some improvement to highlight the tool's usefulness better.

Specific Comments:
In the third paragraph of the Methods, the authors state that pyGeno is not dependent on any 'remote REST APIs'. If this is the case how does the pyGeno interact with Ensembl and keep in sync with the regular updates (every 6 months)? The focus on 'robust and faster processing' is understandable, but version drift from official sources can be a serious problem. Is version sync with the regular updates (every 6 months)? The focus on 'robust and faster processing' is understandable, but version drift from official sources can be a serious problem. Is version maintenance something end users can do or are they dependent on the authors keeping the versions up-to-date?
In Figure 1, a simple example is provided showing a protein sequence with what appears to be two non-synonymous variants highlighted in the protein sequence. How does pyGeno cope with summarising/ visualising; 1) mutually exclusive variants at a single amino acid (e.g. two non-synonymous variants at different positions of the codon ), or 2) more complex variants like splicing-affecting changes and loss/gain of STOP codons? Similarly does pyGeno accept phased haplotypes thereby allowing inspection of both protein products from each of the individual's alleles?
What are the hardware requirements for running pyGeno and associated analyses? Is a well-specified workstation with several GB of RAM, fast cpu and terabytes of diskspace required or can it be run on a laptop?
the first year, demonstrating its utility. There is a very helpful figure, Figure 1, that gives an overview of the process. The authors might consider adding a section of the manuscript in which they walk the reader through the analysis of an actual gene in an actual genome to give the reader a sense of the findings.

I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. pyGeno is a Python package that allows a user to simply query either a standard reference genome or a custom genome for information such as sequences, SNPs, and related RNA and protein data. I agree that this package fills a need for genome research, appears to be very straightforward to use, and I would like to use it myself.
However, when I tested it I was unable to get it to run; contrary to the authors claims that it is easy to install with minimal dependencies I couldn't make it work. (I am not a Python expert, but I have installed much more complicated packages successfully.) With Python virtual environments, I wonder if there is a need to be quite so minimal. Would it be easier to rely on standard packages to ensure a universal, smooth experience? I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
No competing interests were disclosed.

Competing Interests:
Author Response 03 Jun 2016 , University of Montreal, Canada

Tariq Daouda
Thank you for taking the time to review our work.
In light of your review we have retested the installation, genome importation and polymorphism insertion on Linux, MacOS and Windows and it seems to work on every platform. We are fully committed to make the experience for the end user as smooth and easy as possible, if you could give us more details about the problems you encountered by filling a GitHub issue, we would be very happy to address them.
No competing interests were disclosed. Competing Interests: