A curated transcriptome dataset collection to investigate the immunobiology of HIV infection

Compendia of large-scale datasets available in public repositories provide an opportunity to identify and fill current gaps in biomedical knowledge. But first, these data need to be readily accessible to research investigators for interpretation. Here, we make available a collection of transcriptome datasets relevant to HIV infection. A total of 2717 unique transcriptional profiles distributed among 34 datasets were identified, retrieved from the NCBI Gene Expression Omnibus (GEO), and loaded in a custom web application, the Gene Expression Browser (GXB), designed for interactive query and visualization of integrated large-scale data. Multiple sample groupings and rank lists were created to facilitate dataset query and interpretation via this interface. Web links to customized graphical views can be generated by users and subsequently inserted in manuscripts reporting novel findings, such as discovery notes. The tool also enables browsing of a single gene across projects, which can provide new perspectives on the role of a given molecule across biological systems. This curated dataset collection is available at: http://hiv.gxbsidra.org/dm3/geneBrowser/list.

Uncovering the gene transcription signature associated with different outcomes of HIV infection is paramount to a deeper understanding of HIV pathogenesis and to identifying potential therapeutic targets for improving immunological response and for eradicating HIV infection 1 . HIV has a complex life cycle during which it engages multiple host cellular components, including the immune cells in which it replicates, undermining immune functions. It also highjacks host transcription factors and enzymes to assure viral production and subsequent infections 2 . HIV dysregulates host genes resulting in aberrant immune response, disease progression, and opportunistic infections 3,4 . The ability to pool and analyze samples across various groups of HIV infected individuals with different disease outcomes and across various cell types or tissues, offers a unique opportunity to define common denominators of the immune control of HIV infection, the regulation of HIV replication, and/or the virus-host interaction. With this in mind, we make available, via an interactive web application, a curated collection of transcriptome datasets relevant to HIV infection.
With over 65,000 studies deposited in the NCBI Gene Expression Omnibus (GEO), a public repository of transcriptome profiles, the identification of datasets relevant to a particular research area is not straightforward. Furthermore, GEO is primarily designed as a repository for storing data, rather than for browsing and interacting with the data. Thus, we used a custom web application, the gene expression browser (GXB), to host a collection of datasets that we identified as particularly relevant to the study of the immunobiology of HIV infection. This tool has been described in detail and the source code released as part of a recent publication 5 . It allows seamless browsing and interactive visualization of large volumes of heterogeneous data. Users can easily customize data plots by adding multiple layers of information, modifying the sample order and generating links that capture these settings and can be inserted in email communications or in publications. Accessing the tool via these links also provides access to rich contextual information essential for data interpretation. This includes for instance access to gene information and relevant literature, study design, and detailed sample information.

Identification of relevant datasets
Potentially relevant datasets deposited in GEO were identified using an advanced query based on the Bioconductor package GEOmetadb, version 1.30.0, and on the SQLite database that captures detailed information on GEO data structure (https://www.bioconductor. org/packages/release/bioc/html/GEOmetadb.html) 6 . The search query was designed to retrieve entries where the title or summary contained the word HIV, and were generated from human samples using Illumina or Affymetrix commercial platforms.
The relevance of each entry returned by this query was assessed individually. This process involved reading through the descriptions and examining the list of available samples and their annotations. Sometimes it was also necessary to review the original published report in which the design of the study and generation of the dataset are described in more details. We identified 87 datasets meeting the search criteria and containing HIV infected samples (some studies related to HIV problematics contained uninfected samples only). Out of the 87 datasets, 41 were generated from tissues or cells isolated from HIV infected individuals, 46 contained cell lines or primary cells infected in vitro. Since molecular, cellular and physiological processes involved in the context of in vivo and in vitro infections are dramatically different, we decided to create two separate collections. Here we describe the "in vivo collection" composed of 34 curated datasets (after filtering out datasets that did not meet quality control criteria, as described in "Dataset Validation" section, or datasets generated using an unsupported array platform). Of the 34 datasets, 7 are from whole blood, 7 from peripheral blood mononuclear cells (PBMCs), 8 from CD4 + and/or CD8 + T-cells, 4 from monocytes, 1 from dendritic cells (DCs), and 7 from tissues different from blood ( Figure 1). Four datasets comprise samples from patients co-infected with tuberculosis (TB) 7-10 , one dataset comprises samples from AIDS related lymphomas 11 , and four datasets addressed HIV infected patients with neurological disorders, such as HIV related fatigue syndrome 12 , major depression disorder (MDD) 13 , or HIV-Associated Neurocognitive Disorder (HAND) 14,15 . Among the many noteworthy datasets, several stood out, such as the extensive  Table 1. Thematic composition of our collection is illustrated by a graphical representation of relative occurrences of terms in the list of titles loaded into the GXB tool ( Figure 2).   Gene expression browser (GXB) -dataset upload and annotation Once a final selection had been made, each dataset was downloaded from GEO as a Simple Omnibus Format in Text (SOFT) file. It was in turn uploaded on a dedicated instance of the GXB, an interactive web application developed at the Benaroya Research Institute, hosted on the Amazon Web Services cloud. Available sample and study information were also uploaded. Samples were grouped according to possible interpretations of study results and gene rankings were computed based on different group comparisons (e.g. comparing samples form HIV negative vs HIV positive patients, with or without antiretroviral therapy, in different stages of disease progression, or with or without co-infection, depending on the focus of respective studies).

GXB -short tutorial
The GXB software has been described in detail in a recent publication 5 . This custom software interface provides users with a means to easily navigate and filter the dataset collection available at http://hiv.gxbsidra.org/dm3/geneBrowser/list. A web tutorial is also available online: https://gxb.benaroyaresearch.org/dm3/tutorials. gsp#gxbtut. Briefly, datasets of interest can be quickly identified either by filtering on criteria from pre-defined lists on the left side of the dataset navigation page, or by entering a query term in the search box at the top of the dataset navigation page. Clicking on one of the studies listed in the dataset navigation page opens a viewer designed to provide interactive browsing and graphic representations of large-scale data in an interpretable format. This interface is designed to present ranked gene lists and to display expression results graphically in a context-rich environment. Selecting a gene from the rank-ordered list on the left of the data-viewing interface will display its expression values graphically in the screen's central panel. Directly above the graphical display, drop down menus give users the ability: a) To change the rank list by selecting different comparisons (in cases where the dataset is split in more than two groups), or to only include genes that are selected for specific biological interest. b) To change sample grouping (Group Set button); in some datasets, user can switch between interpretations where samples are grouped based on cell type or disease, for example. c) To sort individual samples within a group based on associated categorical or continuous variables (e.g. gender or age). d) To toggle between a bar plot view and a box plot view, with expression values represented as a single point for each sample. Samples are split into the same groups whether displayed as a bar plot or a box plot. e) To provide a color legend for the sample groups. f) To select categorical information to be overlaid at the bottom of the graph. For example, the user can display gender or smoking status in this manner. g) To provide a color legend for the categorical information overlaid at the bottom of the graph. h) To download the graph as a portable network graphics (png) image or the table with expression values as a comma separated values (csv) file. Measurements have no intrinsic utility in absence of contextual information. It is this contextual information that makes the results of a study or experiment interpretable. It is therefore important to capture, integrate and display information that will give users the ability to interpret data and gain new insights from it. We have organized this information under different tabs directly above the graphical display. The tabs can be hidden to make more room for displaying the data plots, or revealed by clicking on the blue "hide/show info panel" button on the top right corner of the display. Information about the gene selected from the list on the left side of the display is available under the "Gene" tab. Information about the study is available under the "Study" tab. Information available about individual samples is provided under the "Sample" tab. Rolling the mouse cursor over a bar plot, while displaying the "Sample" tab, lists any clinical, demographic, or laboratory information available for the selected sample. Finally, the "Downloads" tab allows advanced users to retrieve the original dataset for analysis outside this tool. It also provides all available sample annotation data for use alongside the expression data in third party analysis software. Other functionalities are provided under the "Tools" dropdown menu located in the top right corner of the user interface. These functionalities include notably: a) "Annotations", which provides access to all the ancillary information about the study, samples and the dataset, organized across different tabs; b) "Cross Project View", which provides the ability to browse across all available studies for a given gene; c) "Copy Link", which generates a mini-URL encapsulating information about the display settings in use and that can be saved and shared with others (clicking on the envelope icon on the toolbar inserts the url in an email message via the local email client); and d) "Chart Options", which gives user the option to customize chart labels.

Dataset validation
Quality control checks were performed by examination of profiles of relevant biological markers. Known leukocyte surface markers were used to verify consistency of the information provided by dataset depositors, and to identify instances where contamination of samples by other leukocyte populations may be confounding. The markers that were used include: CD3 (CD3D), a T-cell marker; CD4 and CD8 (CD8A), markers of CD4 + and CD8 + T cells respectively; CD11c (ITGAX), an mDC marker; CD14, expressed by monocytes and macrophages; or Adiponectin (ADIPOQ), expressed in adipose tissue. Expression of the XIST transcripts, which expression is gender-specific, was also examined in datasets containing relevant information, to determine its concordance with demographic information provided with the GEO submission (respective links in Table 1).

Data availability
All datasets included in our curated collection are also available publically via the NCBI GEO website: www.ncbi.gov/geo; and are referenced throughout the manuscript by their GEO accession numbers (e.g. GSE44228). Signal files and sample description files can also be downloaded from the GXB tool under the "downloads" tab. As strengths of the article I will highlight: The application is friendly and easy to use and allowed us to compare our results with a large collection of databases in a comprehensive way.
The software allows searches related with a particular gene and how its expression is modified in different scenarios (infected vs non-infected, long term non-progressors vs typical progressors, treated vs untreated).
The cellular types in which dataset have been obtained are indicated.
Datasets included have been selected according to their interest and high methodological standards. For example, when contamination with cell types different from those initially targeted are detected the studies are not considered for the final dataset thus enhancing the quality of the results.
I would propose some suggestions to improve this interesting tool: All the studies were performed with microarrays. It would be important to discuss if the inclusion of data using RNA-seq approaches and the current units used in these studies (FPKMs, RPKMs,TPMs) could be incorporated in the future.
It should be clarified if the results among the different studies are normalized or just described with the units used in each study. If data normalization has been performed it would important to describe how it was done. Overall it represents an important effort that can be useful for many researchers working in the field of HIV genetics and pathogenesis.
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. also easier for standardization across studies. Second, that users should be attentive to the subtleties of analysis: covariates such as gender, age, cellularity, analytical platforms and batch effects can influence expression profiles significantly. In-depth analysis may thus require downloading of original expression data.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. Competing Interests: