AlignmentViewer: Sequence Analysis of Large Protein Families

AlignmentViewer is a web-based tool to view and analyze multiple sequence alignments of protein families. The particular strengths of AlignmentViewer include flexible visualization at different scales as well as analysis of conservation patterns and of the distribution of proteins in sequence space. The tool is directly accessible in web browsers without the need for software installation. It can handle protein families with tens of thousands of sequences and is particularly suitable for evolutionary coupling analysis, e.g. via EVcouplings.org.


Introduction
Multiple sequence alignment (MSA) analysis (e.g., analysis of sequence patterns, subfamilies, specificity residues, evolutionary couplings) and visualization allows researchers to extract information and gain a better understanding of protein families. MSA is a basic step in many protein analysis workflows, including 3D structure prediction (Marks et al., 2011), structure detection in flexible ('disordered') domains (Toth-Petroczy et al., 2016), function prediction (Tamames et al., 1998) and intracellular localization (Goldberg et al., 2014).
A number of useful tools exist for the visualization of protein MSAs, such as, MView, Wasabi, AliView, MSAViewer and Jalview (Brown et al., 1998;Larsson, 2014;Veidenberg et al., 2016;Waterhouse et al., 2009;Yachdav et al., 2016). MView was one of the first online browser-based MSA viewers, with alignments formatted as an HTML document. Wasabi is a web-based tool particularly useful for phylogenetic analysis and incorporates phylogeny-aware alignment methods. Another desktop application, AliView, has features such as sorting, viewing, removing, editing and merging sequences from large nucleotide sequence datasets. MSAViewer is an interactive MSA visualizer in JavaScript that implements basic features of viewing, scrolling and motif selection. Jalview is a Java-based desktop tool accessible through websites using an embeddable applet, but unfortunately the technology for these applets is no longer supported in most browsers.
AlignmentViewer complements these MSA tools and provides the following features: (i) in-browser and serverless execution, (ii) visualization of very large MSAs, (iii) visualization of conservation patterns, (iv) sequence filtering, (v) logo display, (vi) pairwise sequence identity map, (vii) sequence space exploration by UMAP dimensionality reduction, and (viii) display of top-ranked evolutionary couplings (Hopf et al., 2019).
An earlier version of this article can be found on bioRxiv (DOI: https://doi.org/10.1101/269720); additional features have been implemented since the earlier version.

Operation
AlignmentViewer is a web-based tool written in JavaScript with minimal system requirements. AlignmentViewer works best on Chrome regardless of operating system. AlignmentViewer is developed with the D3 library (d3js.org) to produce dynamic and interactive data visualizations, with performance (speed) for large alignments a major consideration. The tool is entirely client-based, running inside a web browser without the need for server-side computation.

Implementation
Users can access AlignmentViewer and all its features directly from alignmentviewer.org, but its serverless execution enables anyone to quickly start a local copy for online or offline use. Hyperlinks for lookup in background databases, such as Uniprot or Pfam, are made directly from the client. Alignments can be passed to AlignmentViewer also via a URL query parameter that is served by https and is properly encoded (e.g., https://alignmentviewer.org/?url=https://alignmentviewer.org/ example/1bkr_A.1-108.msa.txt), enabling seamless integration from external web services via a simple link (e.g. the EVcouplings, evcouplings.org, web server (Hopf et al., 2019) offers visualization of computed alignments via a link to Align-mentViewer). The tool has been thoroughly tested with many large alignments. An alignment with, e.g., 50,000 sequences (about 13 MB of memory) loads in the Safari browser within one minute; further speedup is planned. Figure 1 shows the main functionalities from AlignmentViewer explained in more detail in the next subsections. The top sub-figure shows the msa view with the sequence logo and the alignment capturing most of the attention. This view lets the user examine in depth the alignment. Each amino acid position is represented in sequence logo and the height shows users the information content of each position, in bits. Then, from left to right we show the pixel view, a part of the stats view, and the sequence space (with annotations). The pixel view gives an overview display of the alignment to enable a coarse view of the alignment for better visualization and pattern identification. The all versus all sequence identity sub-figure in Figure 1 (part of the stats view) displays allows users to identify possible clusters in the alignment based on sequence identity. The bottom right subfigure of Figure 1 displays the sequences clustered by similarity (see section Sequence space) highlighted by user-provided annotations to aid in interpretation of the clusters.

Use case
MSA view Alignment details. The msa view page has summary information: number of sequences, conservation and gap counts for each position, a sequence logo, and the residues in one letter code. By default, columns with gaps in the reference sequence (first row) are omitted in order to facilitate visual focus on sequence patterns relative to a protein of interest and to avoid extremely gapped alignment views typical of many MSA presentations. The amino acids are colored using a conventional coloring scheme, adopted from Mview, based on amino acid properties.
Sequence attributes and sorting. Sequences in the alignment can be sorted using one of four different methods: (i) the original order provided by the user, (ii-iii) by % sequence identity between a particular sequence and the reference (top) sequence,

Amendments from Version 1
We have updated the manuscript and the software according to reviewer comments. We have added a new feature to select the coloring scheme of the alignment, addressed cross-browser compatibility issues, and improved the usability. We have also addressed comments from the reviewers meant to improve the clarity of the manuscript.
Any further responses from the reviewers can be found at the end of the article Figure 1. AlignmentViewer visualization of the beta-lactamase protein domain family. Bars above the sequence alignment quantify residue conservation. The alignment consensus logo (just below the bar chart) is based on the amino acid frequencies. Lower left: pixel view of the alignment especially useful for large families; lower middle: protein-protein sequence similarity matrix graded by percentage identity; lower right: distribution of sequences in sequence space (UMAP projection), colored by species groups. relative to the first or the second (gaps not counted), and (iv) by user-provided (upload annotations tab) sequence weights or other attributes, such as alignment profile scores (e.g., HMM bit scores). Sequences can be filtered by sequence identity relative to a reference sequence or by percentage of gaps.

Pixel view (suitable for large families)
The pixel view (image view website tab) leads to an overview of the entire depth and breadth of an MSA. The amino acid letters are represented by small rectangles of pixels, retaining the amino acid type coloring (image view tab). This striking visual impression can reveal patterns of conservation and variation, especially for large alignments. This is very useful to gain an intuitive view of sequence properties, noise at the uncertain edges of a protein family, as well as subfamily distributions. The coloring scheme can be by (1) amino acid properties, (2) hydrophobicity (red to blue) or (3) mutational difference (stronger color) in a sequence relative to the reference (first row) sequence.

Stats view
The stats view tab leads to plots of statistical properties of the set of sequences in the alignment, including (i) sequence identity relative to the reference sequence, and (ii) min, max, and average of (i); and (iii) a pairwise sequence identity matrix in which each pixel represents the degree of similarity between two sequences, such that a block-diagonal structure of the matrix is indicative of distinct subfamilies, given, e.g., a tree-derived sequence order as user input. The ordering of sequences by phylogeny is (currently) not part of the tool and can be performed using external tools, e.g., Wasabi (Veidenberg et al., 2016).

Annotations and evolutionary couplings
Users can upload custom numerical attributes or labels for the sequences in the MSA (upload annotations) or evolutionary couplings between residue positions (load couplings). Adding these attributes allows users to use sequence weights, compare different measures of sequence fitness (e.g., bitscore, sequence identity, statistical energy) or visualize evolutionary coupling constraints for pairs of positions.

Sequence space
Users can view representations of the MSA sequences in twoor three-dimensional space under the "sequence space" tab. These representations are generated using the Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction algorithm (McInnes et al., 2018), whichhas been adapted for Javascript using the, umap-js library (https://github.com/ PAIR-code/umap-js). The alignmentviewer.org implementation uses the number of amino acid differences between pairs of sequences (the Hamming distance) as the distance metric parameter. The algorithm then iteratively calculates an embedding in two-or three-dimensional space, which is displayed in real time for the end users. UMAP hyperparameters are set to reasonable defaults, but can also be configured via the settings panel. Sequences can be colored by user provided annotations ("upload annotations" tab).

Conclusion
AlignmentViewer is a lightweight online viewer for biological multiple sequence alignments that focuses on usability and performance. Written in JavaScript, this tool can be used in many browsers. The architecture of AlignmentViewer allows its use without software installation and without an internet connection. The visualization capabilities, analysis features and metrics in AlignmentViewer are useful in many areas of biology, especially evolutionary, structural, synthetic and chemical biology. In the future we plan to add a visualization of species diversity, predicted contact maps, and organization by sequence subfamilies with specificity residues. A standalone version of AlignmentViewer is available at alignmentviewer.org and is in use by external services including EVcouplings.org.
AlignmentViewer is an open source project hosted on GitHub, which welcomes engagement of interested members of the community.

Data availability
All data underlying the results are available as part of the article and no additional source data are required.

Software availability
AlignmentViewer website and demo can be found at: https://alignmentviewer.org/.

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
Overall there is a need for Javascript based alignment viewers which can handle large numbers of sequences. So this software is rather timely. However, I think the current implementation seems to still contain significant bugs and the web page requires a round of user experience testing to make it easier to use and interpret the results. Once these improvements are made then I think this has the potential to be a very useful tool.

Manuscript changes:
Please fix capitalisation of PFAM to Pfam

Software/Website changes:
The computing conservation box is annoyingly placed. It does not go away and covers the UMAP view significantly. I really need an estimate of how long it will take for the pairwise identity to load. Some time passes…I didn't realise I had to click on calculate to get the pairwise identity to show. Please UX test this page to make it easier/clearer for the user to understand what to do.
For the top graph on the stats view the three colours used for max/average and min identity are hard to distinguish. Please use a broader colour palette. Please add axis legends for this graph and make it clearer what the title of this graph is. The title is very clear for the bottom plot.
Please add alternative colouring schemes for viewing the alignments. It makes a surprisingly big difference for experienced users ability to interpret an alignment. I personally like the ClustalX colouring scheme that is widely adopted, but there are other popular schemes it would be nice to incorporate as options.
"Based on this and other small tests we can safely state that any computer that is able to run any modern web browser will be able to run any alignment that requires visualization". This is overstating the case. Needs to be toned down. Try testing the software on the ABC transporter family (PF00005) alignment in Pfam. The NCBI alignment contains 2.6 million sequences. If it works seamlessly for that then you can safely state. I tried to hook the viewer up to a Pfam alignment which is not very large (840 sequences). I got a box that said Fetching file…and no response beyond that. https://alignmentviewer.org/?url=http://pfam.xfam.org/family/PF00571/alignment/seed/format?format=fasta&a also tried Stockholm format with no luck.
https://alignmentviewer.org/?url=http://pfam.xfam.org/family/PF00571/alignment/seed/format?format=stockh I also tried with a Pfam family with just 3 sequences. Still with no luck. I got the viewer to work by uploading a Pfam alignment in fasta format with 3 sequences. Then I tried uploading the seed alignment for the CBS domain downloaded from Pfam in fasta format. I then got the following error: Parsing error: sequence #3 has different length.
The first few lines of the alignment are included below and the third sequence appears to be the same length as all the others.

Is the rationale for developing the new software tool clearly explained? Yes
Is the description of the software tool technically sound? Yes and that the string that follows "&url=" has been sanitized using Chrome's built-in console and the "window.encodeURIComponent(x)" function, where x is the string to be sanitized. We have added some text to the manuscript to highlight these requirements. Thank you for pointing out this issue. This bug occurred when refreshing an internal variable, and has now been fixed.
Competing Interests: No competing interests were disclosed.

Response: The manuscript has been corrected.
When loading an alignment using a somewhat older version of Safari (12.1.2), the "computing conservation" window does not disappear after the operation finishes. This problem was absent in Chrome. Response: Thank you for noticing this bug. We have now corrected the issue and the dialog disappears when using Safari.
Competing Interests: No competing interests were disclosed.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com