ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

Pathogen Sequence Signature Analysis (PSSA): A software tool for analyzing sequences to identify microorganism genotypes

[version 1; peer review: 2 approved with reservations]
PUBLISHED 09 Jan 2017
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Neglected Tropical Diseases collection.

Abstract

Introduction
The chikungunya virus (CHIKV) is an arbovirus vectored by Aedes mosquitoes that infects humans in tropical and sub-tropical areas of Asia and Africa. Recently, outbreaks have been reported in tropical and sub-tropical areas of countries that were previously unaffected (e.g., Brazil, Colombia). Currently, the following geographical genotypes have been identified through phylogenetic analysis of CHIKV E1 gene sequences: the West African (WAf), East/Central/South African (ECSA), and Asian genotypes. Outbreaks in a geographical area can happen with the same or different genotypes. Determining which genotypes are circulating in an outbreak is important for public health management.
Objectives
To create a computer-based system available online that is suitable for detecting changes in CHIKV nucleotide and amino acid sequences and identifying their corresponding geographical genotype.
Methods
We used several computer frameworks, tools, programming languages, algorithms, and infrastructure systems to build a software tool that analyzes changes in nucleotide and amino acid sequences and identifies different geographical genotypes through phylogenetic analysis.
Results
We have built an online software tool called Pathogen Sequence Signature Analysis (PSSA) that allows researchers to analyze nucleotide and amino acid sequence variations between sample CHIKV sequences taken from infected patients and obtained through conventional Sanger sequencing, to identify their corresponding geographical genotype.
Conclusion
PSSA is able to analyze sequences in a simple and effective manner, and includes proper documentation (i.e., UML diagrams) and also basic examples that serve to test the algorithm. Furthermore, PSSA provides various ways to visualize the data in order to aid understanding and interpretation of results.
Results provided by PSSA will be useful for the identification of circulating CHIKV genotypes and public health surveillance. PSSA is available at: http://pssa.itiud.org.

Keywords

Chikungunya virus, Public health, sequences, information system, phylogenetic analysis.

Introduction

Chikungunya virus (CHIKV) is an arbovirus (arthropod-borne virus), which is part of the Alphavirus genus and belongs to the Togaviridae family. It is vectored by Aedes mosquitoes and infects humans in tropical and sub-tropical areas of Asia and Africa1. CHIKV has a positive-sense, single-stranded RNA genome of 12 kb, which can persist for years in humans. Symptoms include rash and febrile illness associated with severe arthralgia2.

Currently, the following geographical genotypes have been identified through phylogenetic analysis of CHIKV E1 gene sequences: the West African (WAf), East/Central/South African (ECSA), and Asian genotypes35. However, most CHIKV phylogenetic studies have just used fragmented sequences from the glycoprotein envelope of the E1 gene, which avoids accurate assessments regarding the relations between strains and their evolutionary dynamics.

Recently, some complete sequences from the CHIKV genome have been made available, so we have used the available data for a complete E1 gene to develop an automated computational algorithm that can be used for accurate and rapid identification of this pathogen.

The online software tool that we have created, called Pathogen Sequence Signature Analysis (PSSA), will allow researchers to analyze nucleotide and amino acid sequence variations between sample CHIKV sequences taken from infected patients, and determine the corresponding genotype from phylogenetic analysis of the results. PSSA also provides various ways to visualize the data in order to aid understanding and interpretation of results.

Methods

Implementation

To build PSSA, we used standard computer-based tools, programming languages, and infrastructure systems. PSSA is based on the Object Oriented Paradigm; thus, for its design, we used the Unified Modelling Language (UML)6. For its development, we used version 7.1.0 of the PHP language (https://www.php.net/) supported by the application server Apache version 2.4.23 (http://www.apache.org/). PSSA’s front end was developed based on version 3.3.7 of Bootstrap (http://getbootstrap.com/). Using Bootstrap is very convenient for this project because it is a framework that properly integrates JavaScript, CSS, and HTML for creating responsive web applications. PSSA uses the version 3.1.1 of the library JQuery (https://www.jquery.com/) for facilitating the use of JavaScript functionalities. After PSSA performs an analysis, it provides results in several formats. One result corresponds to an automatically generated report in pdf format, created based on version 0.0.8 of the PHP library ezpdf (https://github.com/rebuy-de/ezpdf). The other results are a force-directed graph, a radial tree, and a cartesian tree, which were developed supported by version 3.5.17 of the online JavaScript library called Data-Driven Documents, known as D3 (https://www.d3js.org/). D3 provides services for deploying data via interactive visualizations.

The geographical genotype of a sample sequence was determined based on well-defined phylogenetic clusters whose origins have been linked to a given geographic region. We analyzed all available whole genomes in GenBank database (www.ncbi.nlm.nih.gov/genbank/). However, since the E1 gene has been previously used in several studies, including Nunes et al.7, Laiton-Donat et al.8, and Volk et al.9, to determine the genotype of sample sequences, we decided to use the E1 gene. In addition, we performed extensive testing to be sure that our reference strains can accurately classify other sequences.

PSSA stores all information related to nucleotide and amino acid analysis in one relational database. In this project we created said database using version 5.7.16 of the MySql community server (http://dev.mysql.com/downloads/). The database was designed to handle the required information of the proposed nucleotide and amino acid analysis, as well as phylogenetic analyses. The database is managed using version 4.6.5.2 of phpMyAdmin (https://www.phpmyadmin.net/), which is a project that serves to administrate databases that use MySql. MySQL community also offers MySqlWorkbench (https://www.mysql.com/products/workbench/), of which version 6.3 was used to design PSSA’s database.

In addition, version 4.1 of the Integrated Development Environment (IDE) EclipsePHP (https://eclipse.org/pdt/) was used to develop PSSA. EclipsePHP allows for creation of PHP-based projects supporting PHP, CSS, and JavaScript languages. In addition, EclipsePHP provides git services for storing projects in desired repositories. Thus, we started the development of PSSA by creating a PHP project in EclipsePHP; next, we created all required files and wrote the source code for the algorithms that performed the desired analyses; and finally, we created a git configuration to store the project in the GitHub repository system. To host PSSA, the SUSE Linux Enterprise Server 12 SP2 (https://www.suse.com) was used. This server includes all applications mentioned above that are necessary for PSSA to operate correctly.

The various libraries, frameworks, and software we used to develop PSSA are all under the GNU General Public License (http://www.gnu.org/licenses/licenses.en.html). This means that only free software was used to develop our software tool.

As reference sequences we used ECSA genotype HM045811-Ross, Asian genotype HM045810, and West African genotype HM045807, searching through all nucleotide sequences of the CHIKV E1 gene that are available on the GenBank database (www.ncbi.nlm.nih.gov/genbank/). (the first isolated identified of three genotypes).

The accession numbers for the representative or alternative CHIKV E1 sequences used in the phylogenetic analysis are as follows:

ECSA genotype: HM045823, AM258993, EF012359, AM258991, AB455494, GU199352, FJ445426, FN295485, GU301781, HM045784, HM045822, KP164568, KP164570, KP164569.

Asian genotype: HM045813, HM045800, HM045790, HM045789, EF027140, EF027141, FN295483, L37661, EF452493, FJ807897, HE806461, KF318729, KJ451622, AB860301, KP164567, KP164572, KP164571, KJ451624, KP851709, KT211035, KT211049.

West African genotype: HM045816, HM045785, HM045815, HM045818, AY726732, HM045820, HM045817.

Operation

PSSA has been developed to be run in with Google Chrome, Mozilla Firefox, Internet Explorer, and Safari; nevertheless, PSSA might run in other browsers such as Opera. To run PSSA the URL http://pssa.itiud.org must be typed in the web browser.

However, if the user wants to run PSSA locally, the following steps need to be followed:

  • 1. Download PSSA from the github repository: https://github.com/florezfernandez/pssa.

  • 2. Install the local server the software: Apache, PHP, MySql, and PHPMyAdmin (optional).

  • 3. Run the database script, which is available when the project is restored from the repository.

  • 4. The project contains a file called “connection.php” in which the information regarding the connection to the database is configured. Update the information of the server, database, database user, and database password with the information of the local server. The default values provided with PSSA are: server = localhost, database = pssa, database user = “root”, and database password = “”.

Results

The most important feature of PSSA is that the algorithm has been developed to analyze sequences taken from multiple patients. Several sequences can also be submitted per patient. The analysis process is carried out via the following steps:

  • 1. Once the user has accessed PSSA by using the corresponding web address, they must access the “Sequence Analysis” menu, in which the user can select the menu item “Chikungunya Virus”. PSSA then presents the name, description, reference and alternative sequences of available gene(s) for CHIKV (Figure 1). For each gene (e.g., E1), two icons appear on the right side. The first icon deploys reference sequences while the second icon the alternative sequences.

  • 2. There are two different types of analysis in PSSA. By selecting the reference sequences, users proceed with the mutation analysis, which analyzes nucleotide and amino acid changes in patient sequences, and by selecting alternative sequences users proceed with the phylogenetic analysis, which establishes the phylogenetic relationship between the submitted sequences and determines which genotype they belong to.

  • 3. Once the type of analysis has been selected, the user can provide the patient sequences through FASTA files. PSSA includes an example dataset that can be used to test the system. The symbol '-' can be included in desired sequences in order to specify possible missing data; but tabs, blank spaces, and any other symbols than the ones used in these kind of sequences (i.e., A, C, T, G) are not accepted.

411c8621-6e7e-4777-93ee-0a34eb9d03ca_figure1.gif

Figure 1. Screenshot of the PSSA Sequence Analysis menu presenting the available name, description and two icons for the CHIKV E1 gene.

The first icon represents the reference sequence, while the second icon represents the alternative sequences of the CHIKV E1 gene. By selecting the reference sequences, users proceed with the mutation analysis, and by selecting alternative sequences users proceed with the phylogenetic analysis,

After patient sequences have been provided, the analysis algorithm is run and the system presents the corresponding results.

Mutation analysis

For the mutation analysis the system provides an online report that includes the nucleotide and corresponding amino acid changes in each patient’s sequence (Figure 2). The report can be sent to the user via e-mail in pdf format, and contains both a summary of the results and complete details of the analysis. It also provides a force-directed graph which presents each sequence as a node and where the set of nodes deployed using the same color represents one patient (Figure 3). In addition, nodes that belong to each patient are clustered based on the number of nucleotide and amino acid changes. When a sequence contains a substantial part of the E1 gene, these results are reliable and can be used for further purposes. Users might confirm that results are reliable by reading the pdf report and comparing it to the force-directed graph.

411c8621-6e7e-4777-93ee-0a34eb9d03ca_figure2.gif

Figure 2. Online textual visualization of the online report for the mutation analysis that about of presents theing nucleotide changes that produce an amino acid change.

411c8621-6e7e-4777-93ee-0a34eb9d03ca_figure3.gif

Figure 3. Force-directed graph visualization.

Force-directed graph presents each sequence as a node and each set of nodes of the same color as one patient. In addition, nodes that belong to each patient are clustered based on the number of nucleotide and amino acid changes.

Phylogenetic analysis

For the phylogenetic analysis, the system presents results as a radial tree (Figure 4) and a cartesian tree (Figure 5) to establish the phylogenetic relationship between sequences and determine which genotype they belong to, based on an array of alternative sequences corresponding to the E1 gene.

The algorithm is an iterative process. Thus, for each patient file, all sequences are collected by the algorithm; then, for each sequence, some instructions of the algorithm are used to compare the iterated sequence to the selected reference sequence as well as to the alternative sequences. All information regarding the analysis is stored and used to generate the reports through the visualizations described above. It is important to mention that there are two different types of analysis in PSSA, even though they are both closely related. The mutation analysis presents results as a report of the nucleotide and amino acid variations in each patient’s sequence and a force-directed graph, whilst the phylogenetic analysis is based on the nucleotide substitution model and results are presented as a radial and cartesian tree to establish the phylogenetic relationship between sequences and determine which genotype they belong to.

411c8621-6e7e-4777-93ee-0a34eb9d03ca_figure4.gif

Figure 4. Radial tree visualization.

The three genotypes are separated into the different branches, where each branch corresponds to one sequence obtained from the GenBank database.

411c8621-6e7e-4777-93ee-0a34eb9d03ca_figure5.gif

Figure 5. Cartesian tree visualization.

It presents the same information as the radial tree, but it shows the three genotypes separated into the different branches more clearly.

Conclusions

PSSA is an online software tool that provides an automated computational algorithm that guarantees accurate and reliable detection of nucleotide and amino acid sequence variations and provides various ways to visualize the data in order to aid understanding and interpretation of results.

PSSA is different to BMA10, which is another analysis tool developed in our research group, because it not only provides information regarding nucleotides and amino acid changes, but it also compares the sequences with multiple alternative sequences to identify the genotype in a phylogenetic tree.

PSSA will be useful for the identification of circulating CHIKV genotypes in an outbreak and public health surveillance. It is a flexible tool, which implies that it could be used for evaluating other microorganisms, such as bacteria (e.g., Mycobacterium tuberculosis), parasites (e.g., Leishmania) or other viruses (e.g., Dengue, Zika).

Software availability

Software available from: http://pssa.itiud.org

Latest source code: https://github.com/florezfernandez/pssa

Archived source code as at the time of publication:

http://dx.doi.org/10.5281/zenodo.17992211

License: GNU General Public License (GPL)

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 09 Jan 2017
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Salvatierra K and Florez H. Pathogen Sequence Signature Analysis (PSSA): A software tool for analyzing sequences to identify microorganism genotypes [version 1; peer review: 2 approved with reservations]. F1000Research 2017, 6:21 (https://doi.org/10.12688/f1000research.10393.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 09 Jan 2017
Views
14
Cite
Reviewer Report 15 May 2017
Massimo Ciccozzi, Department of Infectious, Parasitic and Immunomediated Diseases, Istituto Superiore di Sanità, Rome, Italy 
Approved with Reservations
VIEWS 14
Salvatierra & Florez describe the development of software tool and web interface for analyzing only Chikungunya sequences.

The idea is very interesting but I have some different concerns about the utility of its application.
  1. It is not well documented that Chikungunya virus can persist in
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ciccozzi M. Reviewer Report For: Pathogen Sequence Signature Analysis (PSSA): A software tool for analyzing sequences to identify microorganism genotypes [version 1; peer review: 2 approved with reservations]. F1000Research 2017, 6:21 (https://doi.org/10.5256/f1000research.11199.r22414)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
15
Cite
Reviewer Report 18 Apr 2017
Easwaran Sreekumar, Molecular Virology Laboratory, Rajiv Gandhi Centre for Biotechnology, Thiruvananthapuram, Kerala, India 
Approved with Reservations
VIEWS 15
The article by Salvatierra & Florez describes development of software tool and web interface for analyzing sequences of microorganisms. Essentially, the software is currently configured for analyzing only chikungunya sequences.

The reviewer has a number of comments/ ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Sreekumar E. Reviewer Report For: Pathogen Sequence Signature Analysis (PSSA): A software tool for analyzing sequences to identify microorganism genotypes [version 1; peer review: 2 approved with reservations]. F1000Research 2017, 6:21 (https://doi.org/10.5256/f1000research.11199.r21444)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 09 Jan 2017
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.