ProfileGrids solve the large alignment visualization problem : influenza hemagglutinin example

Introduction The explosion in biological sequence information has led to the generation of large multiple sequence alignments (MSA). For example, the biggest protein family alignment currently in the Pfam database (Wellcome Trust-Sanger Institute) has over 288,000 sequences1. A new generation of alignment programs, such as Clustal Omega2, are available that allow the routine calculation of such large alignments. However, a Nature Methods review3 noted the lack of software tools for visualizing the results of large alignment calculations. Specifically, there was a call for overcoming the conceptual and technical limitations of large data sets to allow one to navigate visually both an overview and the details of an alignment, while having mechanisms to query annotated data. We point out that this conceptual limitation was solved in late 2008 by the introduction of ProfileGrids as a new paradigm for visualizing large multiple sequence alignments4. Here, we report that the remaining technical limitations have been overcome with version 2.0 of the JProfileGrid software, and that therefore, the large alignment visualization problem has now been solved. We use the influenza hemagglutinin protein family as a case study to demonstrate the new features of the software. Abstract Large multiple sequence alignments are a challenge for current visualization programs. ProfileGrids are a solution that reduces alignments to a matrix, color-shaded according to the residue frequency at each column position. ProfileGrids are not limited by the number of sequences and so solves this visualization problem. We demonstrate the new metadata searching and grep filtering features of the JProfileGrid version 2.0 software on an alignment of 11,900 hemagglutinin protein sequences. JProfileGrid is free and available from http://www.ProfileGrid.org.


Introduction
The explosion in biological sequence information has led to the generation of large multiple sequence alignments (MSA). For example, the biggest protein family alignment currently in the Pfam database (Wellcome Trust-Sanger Institute) has over 288,000 sequences 1 . A new generation of alignment programs, such as Clustal Omega 2 , are available that allow the routine calculation of such large alignments. However, a Nature Methods review 3 noted the lack of software tools for visualizing the results of large alignment calculations. Specifically, there was a call for overcoming the conceptual and technical limitations of large data sets to allow one to navigate visually both an overview and the details of an alignment, while having mechanisms to query annotated data. We point out that this conceptual limitation was solved in late 2008 by the introduction of ProfileGrids as a new paradigm for visualizing large multiple sequence alignments 4 . Here, we report that the remaining technical limitations have been overcome with version 2.0 of the JProfileGrid software, and that therefore, the large alignment visualization problem has now been solved. We use the influenza hemagglutinin protein family as a case study to demonstrate the new features of the software.
Previous MSA visualization paradigms 3,5 do not provide both alignment overviews together with the details of each character's frequency distribution at each homologous position in the entire alignment. A particular technical limitation of stacked sequence representations is that as the alignment size grows, a printed or digital visualization can run out of convenient space. This compounds the conceptual limitation where the user cannot grasp the overall conservation trends and the observed variation details in a large data set. The ProfileGrid paradigm was introduced as a solution to this conceptual problem by converting a multiple sequence alignment to a color-coded matrix of the residue frequency occurring at every homologous position across the entire length of a MSA. Therefore, all MSA information is represented both at variable regions and of infrequent residues that may yield clues about biological function. aligned them using MUSCLE 7 , and visualized the MSA 8 with the new JProfileGrid 2.0 software. Both paradigms depict the entire width of the alignment, i.e., 650 columns from the approximately 570 protein residues and the 80 inserted gap characters. However, due to space limitations, the stacked sequence overview (Fig. 1a) is only a sampling of 600 sequences from the whole alignment. In the stacked representation, each row is an individual protein sequence with the amino acids represented as pixels colored according to the Taylor scheme 9 . Below the stacked sequence overview is a similarly colored magnified view (Fig. 1b) from a much smaller alignment with the amino acid one-letter codes shown. By contrast, the ProfileGrid overview representation (Fig. 1c) divides the alignment width into multiple "tiers" of which six are 100 characters wide. Within each tier are 21 rows for the 20 amino acids and a gap character. Each cell in the matrix is a color shaded count of the character occurrence at the corresponding MSA column position from low (white) to high (dark blue) frequency. Thus, all of the sequences from the entire alignment are represented in the ProfileGrid by the frequency color shading of each cell in the matrix. As a reference, the left-most column of each ProfileGrid tier is colored according to the Taylor scheme for each amino acid row. ProfileGrids solve the visualization problem of large alignments since there is no limit to the number of sequences that can be represented. Note that a stacked sequence representation of all 11,981 hemagglutinin sequences in this example would be twenty times larger than the number of rows shown in the overview (Fig. 1a).
The first version of the Java JProfileGrid software was designed for alignments with hundreds of sequences such as the ubiquitous RecA protein involved in bacterial recombination 4 . Upon analyzing virus protein families such as influenza hemagglutinins with thousands of sequences, it became apparent that the larger data sets were taxing certain software technical limits. We completely overhauled the program to improve the code with respect to object oriented design, memory management, calculation efficiency, and speed. We strategically reduced memory usage and introduced parallelized code for computing amino acid frequency counts. For example, we were able to reduce the memory requirements by addressing a technical limitation of the Java programming language. Java uses 2 bytes (16-bits) of memory for every Unicode character (UTF-16) reflecting the need to support thousands of characters from hundreds of languages. However, typical MSAs consist of only ASCII characters. Thus, we were able to reduce the memory use 2-fold by introducing a byte (8-bit) map between the integers 0 to 20 and the twenty common amino acid codes and a gap character. Parallelization was possible due to the nature of the ProfileGrid format. Amino acid counts within each column are independent of each other and so were partitioned into separately running threads thereby taking advantage of multi-core processor environments for enhanced ProfileGrid calculation speed.
New features in JProfileGrid v2.0 accommodate viewing and analyzing very large MSAs. The software now allows sorting and searching the menu list of sequence names for finding a specific homolog to serve as the reference sequence for the ProfileGrid. The new "overview" modes enable the user to visualize the entire MSA within one window. The detailed ProfileGrid window with the character counts has a new second pane that facilitates simultaneous viewing of different parts of the MSA. To easily focus on rare residues, the "highlight" functionality now identifies residues that occur greater or less than a user-defined threshold of residue frequency. We introduced data sampling to accelerate similarity plot calculations rather than needlessly including every single sequence from large MSAs. Regular expression functionality has also been implemented to find sequences with particular names or amino acid patterns. Finally, JProfileGrid can import simple sequence annotations from flat file spreadsheet databases 10 for metadata filtering to reduce large MSAs to subsets of interest.

Example: influenza hemagglutinin
As a demonstration of the new metadata (Fig. 2a) and regular expression (Fig. 2b) software features, the 11,981 sequence alignment (Fig. 2c) was filtered to just 60 hemagglutinin homologs (Fig. 2d) by using a metadata search for "human" hosts and a regular expression search for a "Mexico" country location. The ProfileGrid views are positioned at alignment column 333 where there is a potential glycosylation site (asn-thr-thr-cys underlined in red) that is found among this sequence subset. Glycosylation is a key post-translational modification that is vital to the proper folding and trafficking of viral proteins. In addition to the role in protein folding and processing, glycosylation in influenza is also important to virus immune evasion  Author contributions AIR and DJV conceived the study. AIR designed the software, collected hemagglutinin sequences, performed the bioinformatic analysis and biocuration, and prepared the first draft of the manuscript. ACA wrote Java code and contributed to writing the manuscript. DJV interpreted the bioinformatic observations and contributed to writing the manuscript. AIR and ACA contributed equally to the work. All authors were involved in the revision of the draft manuscript and have agreed to the final content.

Competing interests
We have developed the software but have no competing interests.   11 . The majority of antibody recognition is dedicated to the surface antigen on the globular head although stemdirected antibodies have recently been described 12 . Glycosylation in the stem region could prevent recognition of these antibodies and allow for virus to escape the immune response. The region visualized by ProfileGrid analysis lies in the stem region of the hemagglutinin molecule. ProfileGrids, therefore, permit observing rare natural sequence variation (Fig. 2d) within the context of an entire multiple sequence alignment (Fig. 2c). ProfileGrids have this unique advantage over other compressed alignment visualization methods such as sequence logos 13 .
In conclusion, ProfileGrids have solved the problem of visualizing very large alignments. As sequence data sets grow, both for the end user and in central database repositories, we anticipate that ProfileGrids will simplify the dissection and analysis of MSAs. Parenthetically, we note that some bioinformatic studies about influenza proteins have lacked figures depicting the alignment upon which the analyses are based 14 . This omission was probably due to technical limitations that now no longer exist. Thus, we propose that ProfileGrid representations of alignment datasets be included as part of publication to assist science communication. This may initiate establishing a future standard for MIMSA: Minimum Information about a Multiple Sequence Alignment 15  Computing and visualizing multiple sequence alignments might have been considered a solved issue for bioinformatics a few years ago, until the flood of sequence data we have been experiencing recently. In that respect, efforts to address the issue on another scale altogether are indeed necessary. This work reports on a newer version of ProfileGrid that attempts to solve the visualization of very large alignments and a demonstration of its capabilities with the influenza hemagglutinin as an example. JProfileGrid 2.0 appears to perform well, captures some elements of large alignments and provides a useful interface for the exploration of large homologous sequence data-sets. Offering a number of options to users, e.g. color schemes and frequency diagrams might indeed be helpful, but the claim of 'solving' the visualization issues with these data is a bit of an over-statement. For funding agencies and beginners in the field, that might have implications such as not appreciating the importance of this line of research.
Beyond this extraordinary claim, the work is fine and provides sufficient detail for interested readers.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
No competing interests were disclosed. Visualising multiple sequence alignments (MSA) is of great importance, and has only become more essential in the wake of next generation sequencing (NGS) projects. The authors report on a new improved version of their software called JProfileGrid 2.0, originally published in BMC Bioinformatics in . In their new implementation, the authors add the possibility to view very large alignments by 2008 means of a profile abstracted from the alignment, such that the number of sequences (typically running into the hundreds or even thousands in the NGS era) is compressed to just 21 rows in a protein MSA, corresponding to the 20 amino acid types and a gap character.
However, visualising a profile is an entirely different thing than visualising an alignment. The authors' claim, therefore, that they have 'solved' the alignment visualisation problem is unwarranted. They should adapt their title and main conclusion to reflect this fact.
Profiles are an abstraction of an MSA, but losing quite a bit of information: for example, correlations between alignment positions, or specific subgroups discernible in an MSA are not visible in a profile. The authors could have opted for alternative ways to compress MSAs containing many sequences, for instance, by clustering the sequences based on sequence similarity, and selecting a representative sequence for each cluster group. The number of clusters could then be selected by the user, or even set automatically based upon optimality criteria concerning the clustering.
Many techniques exist already for visualising profiles, such as the widely used SequenceLogos and related representations. As such, just representing the raw amino acid frequencies using a color coding scheme can hardly be considered novel. The authors could make their profile visualization more useful by using other frequency-derived schemes, such as the aforementioned SequenceLogo (entropy-based), or the log likelihood used in PSSMs calculated by BLAST. On a more general note, the authors should cite related work and software by other researchers in the field.
Summarising, the authors have compiled a useful software package to visualise alignment profiles, but have not solved the MSA visualisation problem for large sequence sets. Further, they could enhance the usability of the software by implementing more profile scoring schemes as indicated above.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
No competing interests were disclosed. The authors have selected an important problem to address which is how to visualise very large multiple sequence alignments. The manuscript describes a new improved version of their software called JProfileGrid originally published in . The software ran on a Stockholm BMC Bioinformatics in 2008 formatted alignment I selected from Pfam and so appears to work largely as advertised.
My main critique of the work is the claim that ProfileGrids solve the large alignment visualisation problem as stated in the title. In my opinion the visualisation given does not solve the problem. The problem is recast as a visualisation of a profile. Profiles give the frequency of amino acids at each position calculated from a multiple sequence alignment and so their size is independent of the number of sequences in the alignment. So visualising the profile is one way of gaining an overview of a multiple sequence alignment. But others have already done this such as HMM-logos and SUPERFAMILY profile visualisation.
Both of these visualisations are more intuitive that the ProfileGrid because it is easier to map these visualisations to a real sequence. However, none of these are really alignment visualisations. They all lose a lot of information that is implicit in the alignment. For example, the correlations between amino acid positions in subfamilies. The JProfileGrid software does provide a compact stacked alignment representation, but that is also possible in other alignment viewers by choosing a sufficiently small font.
Overall I think that this is a well engineered update of an existing software package that has some utility. The authors need to remove the claims that this software solves the large alignment visualisation problem. This is in my opinion is not justified. In particular the current title is not acceptable. The text of the paper also needs to include mention that other profile visualisation software exists.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
No competing interests were disclosed. Competing Interests: