HPAStainR: a Bioconductor and Shiny app to query protein expression patterns in the Human Protein Atlas

The Human Protein Atlas is a website of protein expression in human tissues. It is an excellent resource of tissue and cell type protein localization, but only allows the query of a single protein at a time. We introduce HPAStainR as a new Shiny app and Bioconductor/R package used to query the scored staining patterns in the Human Protein Atlas with multiple proteins/genes of interest. This allows the user to determine if an experimentally-generated protein/gene list associates with a particular cell type. We validated the tool using the Panglao Database cell type specific marker genes and a Genotype Expression (GTEx) tissue deconvolution dataset. HPAStainR identified 92% of the Panglao cell types in the top quartile of confidence scores limited to tissue type of origin results. It also appropriately identified the correct cell types from the GTEx dataset. HPAStainR fills a gap in available bioinformatics tools to identify cell type protein expression patterns and can assist in establishing ground truths and exploratory analysis. HPAStainR is available from: https://32tim32.shinyapps.io/HPAStainR/

This article is included in the RPackage gateway.

Reviewer Status
Invited Reviewers 08 Oct 2020 report report report

Introduction
The Human Protein Atlas (HPA) has performed immunohistochemistry-based visual proteomics for over 15,313 proteins across 59 tissues. Within each tissue a number of different cell types have been scored for staining patterns by a group of pathologists. Therefore, there is a great amount of visual proteomic data that can be used to classify gene or protein lists into specific cell types [1][2][3] . Their website is designed to query one protein of interest at a time and there is no option to query a list of proteins to determine if that protein set is enriched in a particular cell type. This would be a useful feature to take advantage of this robust dataset. Other gene list tools such as Enrichr, which query multiple databases for associations, have not incorporated the HPA protein cell expression dataset into their tools 4 . There are other R packages used to incorporate and query HPA data such as hpar 5 , which allows for easy loading and querying of version controlled data and HPAnalyze 6 which has powerful visualization tools for protein levels. However, both packages lack the functionality of determining enriched proteins in the database.
We introduce HPAStainR (https://32tim32.shinyapps.io/HPA StainR/), a Bioconductor R package and Shiny app developed to query the cell staining database of the HPA. HPAStainR allows a user to input a list of proteins/genes and returns a rank ordered list of cell types that are stained for the input list. HPAStainR is customizable, allowing the user the ability to include cancer or normal tissue data, change the HPA confidence levels, toggle the identification of what proteins from the list were detected, generate a p-value for how many cell type specific proteins are counted for a given cell type, and allow the downloading of the output as a comma separated (csv) file.

Implementation
The user interface of Shiny HPAStainR is made of a sidebar where one can input their protein/gene list, along with various options to customize the output of the Shiny app. The main panel consists of two tabs. The first tab is the output tab, where the DataTable from the user's query is output. The second tab is informational giving the user a list of HPA cell types and how many proteins were tested/histologically scored in a given cell type.
The HPAStainR package is available on Bioconductor. The package shares all of the same functionality as the Shiny web application, including the ability to run the Shiny app locally and acquire all of the data to do so. This allows HPAStainR to be used as the Shiny app or incorporated into a local R pipeline.

Amendments from Version 1
After carefully reading the comments from our reviewers we have made several changes to the HPAStainR package and paper briefly described here. In the package, `HPA_data_downloader()` has been updated with further functionality and arguments allowing individuals to not only date the downloaded data but use data from a specific date of download. In the vignette, we also now suggest the usage of the Bioconductor package `hpar` for its benefit in version control when paired with HPAStainR. The main function, `HPAStainR()` has also been updated, replacing the default Chi-Square analysis with a Fisher's Exact test, as the data is non-parametric and the Fisher's exact test results in more reproducibility when compared to the permuted p-values from the Chi-Square test. The p-value output has also been changed to be numeric instead of a character. `shiny_HPAStainR()` has also been edited to reflect these changes.
The paper also received revisions based on reviewer comments and changes in the package. Due to using Fisher's Exact Test, all p-values have been changed to be the result of this new method including tables and figures. We have cited other important Bioconductor packages relevant to the Human Protein Atlas, such as `hpar` and `HPAnalyze`. Table 1 has two new columns showing the p-values of the HPAStainR results. We described how the "stringency" is based on the "Reliability" column from HPA's data describing how certain HPA is of the staining result. To better explore the arbitrary staining score we generated staining score distributions for random gene groupings with an n ranging from 10 to 100 genes and added them as extended figures. Lastly, we have added further explanations of the figures in their legends for further clarification.
Any further responses from the reviewers can be found at the end of the article REVISED > BiocManager::install(version = 'devel ') > BiocManager::install("HPAStainR") The next Bioconductor version (3.13) is expected in April or May of 2021 and at that time HPAStainR v.1.2.0 will be released with all the changes in the devel. At that time the HPAStainR shiny app will be updated as well.
Input: There are three required R objects for the main HPAStainR function to work and one optional data frame. The first two required objects are the public staining files from the HPA, which can be downloaded using the package and the `HPA_data_downloader()` function. If the argument `save_file` in `HPA_data_ downloader()`is set to `TRUE` then the file will be dated and downloaded, and following runs will load the saved file. The third required input is either a vector of proteins or genes or a character list of proteins separated by a space, comma, or newline to be queried in HPAStainR. The optional data frame, used in the Shiny app version of HPAStainR, is a table that contains the percent of proteins that stained the tissue compared to the number of the proteins evaluated in the tissue, represented in Extended Table 1 9 , which can be generated using the `hpa_sum-mary_maker()` function. This table demonstrates that not all cell types/tissues have the same number of proteins stained for.
Output: If using `HPAStainR()`, a tibble containing the summarized detection of the input list of proteins or genes for each available cell type customized by the options selected before running the analysis. If using `shiny_HPAStainR()`, a shiny Datatable containing the data previously mentioned in the base `HPAStaiR()` output.
HPA data distribution Our analysis in this paper uses the 19.3 version of the Human Protein Atlas Data. Staining was scored by cell type in each tissue by a group of pathologists who rated the intensity in each evaluated cell type as "high, medium, low or not detected." Not all cell types in all tissues were scored, nor were all cell types consistently evaluated. As a result, there are some caveats in the HPA data that should be noted. The distribution of how many proteins are histologically scored in each of the 137 cell types varies in HPA, such that not all results are equal. The number of proteins scored in cell types ranges from 1 in four substantia nigra cell types to over 17,000 in endometrial glandular cells ( Figure 1A; Extended Table 1 9 ), impacting how often a protein is detected in a given cell type ( Figure 1B). The percent of stained to scored proteins demonstrates an enrichment at both extremes of the distribution ( Figure 1C). To highlight this discrepancy in testing, we have made the information in Extended Table 1 as an available tab on the Shiny app. This data can also be made using the `HPA_summary_maker()` function on normal tissue.

Staining score calculation
The staining score calculation is an arbitrary measure of how well an input list of proteins are enriched for a particular cell type. A formal equation is below, but briefly, it is calculated based on the frequency and intensity of staining within a given cell type. Staining intensity is a percentage of high, medium, low, and not detected counts. The high percentage is multiplied by a value of 100, the medium percentage by 50, and the low percentage by 25, before adding all the results together to generate the final staining score. While arbitrary, we over-weighted high staining as the IHC was more robust and those proteins may better define the cell type. To illustrate the distribution of the staining score we generated 1,000 HPAStainR results on random genes, including the top 10 results from HPAStainR and all results, for random gene lists of sizes 10, 25, 50, and 100. As the number of genes increases the staining score distribution decreases. For the top 10 results from each run, ordered on staining score, the distribution appears to be normal (Extended Figure 1). Analysis of all staining data suggests a right skew (Extended Figure 2).
The model for the staining score equation is below where t is the total number of proteins from the list tested in the cell type, h is the number of proteins with high staining in the cell type, m is the number of proteins with medium staining in the cell type, and l is the number of proteins with low staining in the cell type.

Confidence score calculation
The confidence score is unique to this paper, and only used for the comparison of the Panglao Database (PanglaoDB) cell types to HPA cell types. It is a modified version of the staining score adjusting for size of the protein list for each cell type from PanglaoDB. The confidence score calculation weights PanglaoDB cell types based on how many marker genes they have. Like the staining score, this score ranges from 0-100. The model for the equation is below, where p is the number of proteins tested, with a max p being 50 (standardizing the score range), and the staining score of the protein list in the cell type is represented by s.
Confidence Score 50 p s × = Cell type enriched p-value While we utilized all expressed proteins in our staining score, we recognize that some proteins demonstrate cell type enrichment. For this analysis we generated the "enriched-protein p-value" based on either a Fisher's Exact Test or χ 2 analysis. In this paper we used the results of the Fisher's Exact Test.
To calculate the enriched-protein p-value we generated a list of cell type enriched proteins for each level of stringency, low, normal and high. The stringency parameter filters the `Reliability` column from the normal tissue HPA dataset. This `Reliability` varies from "Enhanced," "Supported," "Approved," to "Uncertain" in decreasing order of certainty (full descriptions of these labels are found here https://www.proteinatlas.org/about/assays+ annotation). Low stringency includes all data, normal stringency includes "Enhanced," "Supported," and "Approved," while high stringency only includes "Enhanced" and "Supported". The cell type enriched protein list was generated by calculating a percentage of positively stained to evaluated proteins across each cell type to adjust for protein scoring frequency. This percentage generated our 'enriched proteins' list from the top quartile of enriched proteins (the proteins present in <25% of the evaluated cell types. The number of proteins were 3,275, 2,543, and 1,235 for low, normal, and high stringency respectively and 3,818 in cancer) (Extended Tables 2 and 3 9 ; Figure 2 and Figure 3). The Fisher's Exact Test analysis was based on the staining presence/absence of 'enriched proteins' for a given HPA cell vs presence/absence of proteins from a protein list query. For all experiments in this paper, stringency was set to normal.

Use cases HPA functionality
HPAStainR uses the publicly available HPA cell type histologically scored staining data to identify the top cell type matches to a queried protein/gene list. It ranks cell types on a 0 to 100 "staining score" (Figure 4). This score is based on the pathologist annotated staining intensity (high, medium, low) of each protein/gene in the query list for each HPA cell type, as a percent of the total number of proteins/genes queried (see Methods). For example, a query of the pancreatic enzymes PRSS1, PNLIP, and CELA3A, along with the protein PRL, would identify "pancreas exocrine glandular cells" as the top hit with a staining score of 75 due to the high staining intensity in three proteins and negative staining of the fourth protein.
The second hit would be the "pituitary gland cells in anterior" due to PRL's high expression in that cell type Four cell types were evaluated for >15,000 proteins (green line) and 61 for <700 proteins (red line). This bimodal distribution shows that there are nearly separate groups of cell types based on how often they are scored. B. A histogram of the 137 cell types on the amount of proteins that had positive staining in each cell type. This distribution reveals an expected skew, cell types that are more often histologically scored, tend to have a higher count of proteins detected. C. A histogram of the percent of positively stained proteins to the total amount of histologically scored proteins for the given cell type. The extreme ends of the distribution are populated by samples with less than 700 scored proteins revealing that seldom scored cell types staining frequencies are often artefacts.
(score of 25), followed by "intestinal glandular cells" which only have medium staining of PRSS1 (score of 12.5).

The Panglao Database
To show the functionality of the Shiny app we applied HPAStainR to the Panglao Database, a hub of communitycurated cell type markers from single cell data 10 . We wanted to investigate how well HPAStainR would mark the cell types based on PanglaoDB's annotations. We downloaded a tsv file of PanglaoDB's cell type gene marker data, and parsed it down to only human protein coding marker genes. We assayed 146 human cell types and their 3,661 marker genes through HPAStainR. The number of marker genes per cell type in PanglaoDB are variable, ranging from one marker in trophoblast stem cells to 216 in interneurons. A histogram of markers per cell type showing the distribution can be found in Figure 5. respectively. This illustrates that tissue enrichment of a protein is not an artifact of how often a protein is scored, as there is a similar level of testing across bins. B. A histogram demonstrating the percent of positive staining cells per protein. Three quality stringencies are given. The 1 st quartile lines demonstrate the specificity cut off of the distribution used at each stringency level. As expected, the number of proteins in the 1 st quartile increase with lower stringency. More commonly detected genes are more greatly affected by stringency when compared to rarely detected genes. There also appear to be a larger number of specific proteins (10%) than there are of semi-specific proteins (25%) in HPA, as the distribution from common to specific decreases before peaking again.

HPAStainR identified many of the cell types in PanglaoDB
To perform analyses between multiple runs of HPAStainR and PanglaoDB, we generated a "confidence score," a value (theoretical 0-100) that corrected for the staining score's determination using an additional feature of how many proteins were evaluated (see Methods). This score weighted cell types with multiple marker proteins staining over cell types with a single or fewer marker proteins. Thus, the confidence score allowed us to rank the cell types based on both staining and depth of data.
HPAStainR is agnostic to the source of a protein/gene list. Therefore, an identification of equivalent cell types across two methods provides strong evidence of HPAStainR's usefulness. Specific protein lists, corresponding to the 146 cell types were evaluated from PanglaoDB in HPAStainR with the  top HPA cell types identified for each. To cover both potential user needs, we included in our PanglaoDB output both the top result of HPAStainR and the top result in the appropriate tissue. The confidence score across these comparisons, generated on HPAStainR data, ranged from 1.5 to 66.75. The results of the 146 cell types were divided into quartiles (Qs) based on the confidence score. The average number of proteins associated with a PanglaoDB cell type used to identify the top HPA cell type strongly correlated with the quartile (76.3; 55.7; 23.9; 7.7 proteins in Q1 to Q4, respectively). In the top quartile of scores, 75% (27/36) of cells matched between PanglaoDB and HPA. Of the nine that were not a perfect match, six matched the top hit when limited by tissue type. Of the remaining Q1 PanglaoDB cell types; liver kupffer cells (a type of macrophage), mesothelial cells, and embryonic stem cells, none had matching cell types in the HPA 11 .
A subset of this analysis can be seen below in Table 1 with the full results being in Extended Table 4 9 . Results were ranked by confidence score, with a strong correlation of higher confidence scores to more accurate cell type assignments between PanglaoDB and HPA. An interesting example are chondrocytes, where the top stained score (27.25) was to TONSIL squamous epithelial cells and the top tissue specific cell type was SOFT TISSUE -chondrocytes (20.75). In addition to the stain score, HPAStainR provides a p-value (and Holm adjusted p-value) based on a separate metric based on cell type specific/ enriched protein expression (see Methods). Although tonsil squamous epithelial cells is the top HPAStainR result, the adjusted enriched protein p-value was 1.0 (nonsignificant) while it was p=2.5E-04 for the chondrocytes, indicating cell-enriched proteins favored the correct match.
HPAStainR can help determine cell type populations in bulk RNA sequencing We then demonstrated the functionality of HPAStainR in bulk datasets. We utilized the variable gene expression data from the Genotype Expression (GTEx) dataset that we had previously uncovered as being driven by variation in pneumocytes or the presence of bronchial epithelium 12 . There were 33 genes identified in the pneumocyte cluster and 70 genes in the bronchial epithelium cluster. HPAStainR was applied separately to both lists and found the top results to be lung pneumocytes and bronchus respiratory epithelial cells respectively ( Figure 6A and 6B; Extended Tables 5 and 6 9 ). Therefore, across both single cell and bulk gene expression data, we have identified useful functionality to HPAStainR.

Conclusion
HPAStainR fills a small gap in our knowledge base by allowing for the query of gene/protein lists against the cellular protein expression pattern data of HPA. As datasets of single cell RNA sequencing analysis become available, it is useful to have a tool to correlate these individual cellular transcriptomic gene profiles with translated protein expression patterns. We have also shown the tool can recapitulate bulk RNA sequencing findings making it a valuable tool to understand the cellular composition of a sample. The HPA is an excellent resource to observe staining patterns within cells across tissues for proteins of interest. The limitations of the study are the quality of the staining across all HPA tissues and the quality/consistency of the pathology scoring of the tissues 13 .
Both of these may impact the scoring achieved for any given query. HPAStainR is a new valuable resource to accelerate exploratory and ground truth queries in the HPA cell type protein staining data.
Human Protein Atlas normal tissue and cancer tissue data was acquired from the website: https://www.proteinatlas.org/about/ download (last visited March 28 th 2020)

Extended data
Harvard Dataverse: HPAStainR -A Bioconductor and shiny app to query protein expression patterns in the Human Protein Atlas, https://doi.org/10.7910/DVN/CL5ZTA 9 .   each tissue cell combination the number of proteins being positively scored over the number of times proteins are evaluated. These are categorically grouped on the amount of proteins evaluated.
•�� Extended Table 2. The rarity for proteins in normal tissue across different filters. For each gene in normal tissue that was detected, the percent of how often proteins stained compared to how often they were histologically scored based on three quality filters, low, normal and high. Each protein is also labeled if it is considered rare or not in a given tissue based on if its percentage was in the bottom 1 st quartile of the distribution for each quality filter. NA indicates that the protein never reached the threshold to be counted in a given filter level. Proteins that never positively stained were not included.
•�� Extended Open Peer Review predictions. The main purpose of the tool addresses a need in the field and should be encouraged. However, I would like to mention some of my concerns below about the basis of the tool and the structure of the manuscript.
It has been mentioned that there were two main input files from the HPA database which can be downloaded with the 'HPA_data_downloader' function. Do you need to download all the data from the HPA for each analysis? Is this really necessary? Can't you do this on the fly?
The staining score seems to be constructed a bit arbitrarily. Based on staining intensity (high, medium, and low), scores are weighted with some arbitrary constant values and normalized by the total query number. But I wonder if there is any skew in this scoring scheme in case extreme queries are tried, such as all high with many proteins or all low, etc. Especially, given the highly non-uniform scoring distribution in the HPA database ( Figure 1).
Confidence score? Why not combine this equation with the staining score? The point of keeping them as separate metrics seems a bit vague. Also, a bit confusing from the user standpoint.
Do the authors use the cell type enrichment p-value in the enrichment at all? If so, it is not obvious from the text. The actual purpose of these figures should be clearly explained in the main text. It is hard to get it unless staring at them for a while. Also, these first three figures are not directly related to the tool explained here. They are rather some statistics showing the key points of HPA data. I would recommend replacing these with more directly related graphs demonstrating the performance of the HPAStainR and including the current ones as supplemental data.

Is the rationale for developing the new software tool clearly explained? Yes
Is the description of the software tool technically sound? Partly

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Bioinformatics particularly in RNA-seq, single-cell genomics, and algorithm dev. areas.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Author Response 12 Mar 2021
Tim Nieuwenhuis, Johns Hopkins University School of Medicine Baltimore, Baltimore, USA Thank you for taking the time to review this paper and the associated package, below is a response to your concerns and the changes we've made because of them.

… Do you need to download all the data from the HPA for each analysis? Is this really necessary? Can't you do this on the fly?
In the current version of `HPA_data_downloader()`, as long as the parameter `save_file` is set to `TRUE`, it will only require you to download the files once. The next release of the package (v1.1.4 available on the main branch at https://github.com/tnieuwe/HPAStainR) includes an updated version of the function that marks the download date of the files, and also allows the user to select which downloaded file they want to use in their analysis for backward reproducibility.

The staining score seems to be constructed a bit arbitrarily…
We acknowledge the somewhat arbitrary nature of the equation, although it is based on prior histology scoring methods more common in histopathology studies. Additionally, we have generated a distribution of 1,000 HPAStainR results on randomly selected genes, including the top 10 results from HPAStainR and all results, for random gene lists of sizes 10, 25, 50, and 100, these are found as Extended Figures 1 and 2. Our findings show the increased number of genes results in lower staining scores. Regardless, for the top 10 results, the distribution appears to be normal. Analysis of all staining data suggests a right skew. We were unable to create an extreme skew, but we cannot entirely exclude it for some unique queries.

Why not combine the confidence score with the staining score?
The reason that they are separate is because the confidence score is simply a scaled staining score that is only used for PanglaoDB-HPAStainR analyses. The context we used the confidence score was strictly for testing how well proteins considered marker genes in PanglaoDB mark the equivalent cell types in HPA. This was described in greater detail to reviewer 2's query.

Do the authors use the cell type enrichment pvalue at all?
Yes, in the original manuscript we use the p-value at the end of the HPAStainR identified many of the cell types in PanglaoDB section. We also show it in figure 6, and in the updated version of the manuscript, we now include it in table 1. Also, the p-values have changed as the analysis has been updated from a X 2 analysis to a Fisher's Exact Test.

The purpose of these figures should be clearly explained in the main text.
We have added further information in the figure legends to better clarify what the figures represent. HPAStainR does not have an output that lends itself to a graphical representation, however, we have added two extended figures to the manuscript showing the distribution of the staining score in various random samplings as noted above.
avoid repeated downloading of the data and/or to provide some reproducibility in the analyses (see below for details on this hpaNormalTissue, hpaCancer, 'both') ``` There are generally two approaches when it comes to using data from remote resources: Download the data on the fly, which allows to use the very latest version of the data, but at the expense of lack of reproducibility/tracking. Indeed, the results can unexpectedly change from one day to another. This would be to option in the HPAStainR package, as well as other Bioconductor packages such as rols (that queries various ontologies, including GO).

1.
Packaging and versioning data to guarantee tracking and reproducibility of the analysis. This is for instance the option provided by hpar (latest hpar release, version 1.32.1, provides HPA data version 19.3, dated 2020/03/06). Other Bioconductor package that offer this solution are GO.db (package GO), and many other Bioconductor data packages.

2.
It would be useful for HPAStainR to make these assumptions explicit and to document how to track results: using hpar or manually (and documenting!) storing the tables downloaded using `HPA_data_downloader()`.
Returning to the reproducibility of the results, despite identical input data (except for the cancer data column names), it is intriguing that the results aren't identical. and found the issue was most likely due to the usage of simulated p-values in the chi-square analysis. To fix this, and overall improve the tool, we have changed the base test in HPAStainR() to a Fisher's Exact Test, which works on the non-parametric data we have. From data not shown here, it does not lengthen the run time of HPAStainR(). Therefore all pvalues in the paper have been updated to the results of the Fisher's Exact Test. Output: We have updated the text to reflect the output of both `HPAStainR()` and `shiny_HPAStainR()` separately. If using the base `HPAStainR ()` function a tibble is returned, while the table returned in the `shiny_HPAStainR` is referred to as a Datatable. The p-values were character values due to the usage of `format.pvalue().` This has been revised and changed to numerical values for the next release.

##HPA data distribution
We have also clarified Extended Table 1 and its relationship as the output to the 'HPA_summary_maker()' function.

## Confidence Score Calculation
The creation of the "confidence scores" was strictly for our PanglaoDB analysis comparison and is not useful outside of validation studies. PanglaoDB had cell types with wildly variable numbers of marker genes. This caused cell types with 1 marker gene that had "high" staining (such as Schwann cells in extended table 4) to become a top hit in PanglaoDB (by staining score), but that does not reveal the accuracy of the tools as the marker genes may just be unique to the cell type in its specific tissue. Therefore we generated the confidence score, which controls and adjusts for the number of marker genes used because having ~40 proteins properly staining is more informative than ~2. The confidence scores are not used in the HPAStainR tool due to the higher consistency of protein staining per cell type, allowing the staining score to be sufficient for ranking.
## Cell type enriched p-value We have added more information explaining how the stringency is based on the "Reliability" column and how each level of stringency functions. The next release of HPAStainR will include this information in the description of the stringency parameter as well.

## The Panglao Database
We have revised the table to include p-values of the tissue specificity. The reason we used confidence scores is that the p-values were simulated in the chi-square analysis used in HPAStainR, where there is a smaller range of possible p-values. We felt the confidence score was a better way to order the cell types due to its ability to highlight the staining score while controlling for how many genes each PanglaoDB cell type has.
## HPAStainR can help determine cell type populations in bulk RNA Sequencing In extended table 5 stomach glandular cells have the second-lowest adjusted p-value at 0.306 and in extended table 6 fallopian tube glandular cells, nasopharynx respiratory epithelial cells, endometrium glandular cells, and cervix-uterine glandular cells all have an adjusted p-value <.05. However, the bronchus epithelial cell's p-value is still the smallest at 1.58x10 -21 compared to fallopian tube glandular cells at 3.99x10 -19 .
## Software availability All "old" versions have been removed and will remain so in the next version.