A curated compendium of monocyte transcriptome datasets of relevance to human monocyte immunobiology research

Systems-scale profiling approaches have become widely used in translational research settings. The resulting accumulation of large-scale datasets in public repositories represents a critical opportunity to promote insight and foster knowledge discovery. However, resources that can serve as an interface between biomedical researchers and such vast and heterogeneous dataset collections are needed in order to fulfill this potential. Recently, we have developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB). This tool can be used to overlay deep molecular phenotyping data with rich contextual information about analytes, samples and studies along with ancillary clinical or immunological profiling data. In this note, we describe a curated compendium of 93 public datasets generated in the context of human monocyte immunological studies, representing a total of 4,516 transcriptome profiles. Datasets were uploaded to an instance of GXB along with study description and sample annotations. Study samples were arranged in different groups. Ranked gene lists were generated based on relevant group comparisons. This resource is publicly available online at http://monocyte.gxbsidra.org/dm3/landing.gsp.


This article is included in the Collective Data channel. Access
This article is included in the Sidra Research channel. Platforms such as microarrays and, more recently, next generation sequencing have been leveraged to generate molecular profiles at the scale of entire systems. The global perspective gained using such approaches is potentially transformative. Transcriptome profiling enabled for instance the characterization of molecular perturbations that occur in the context of a wide range disease processes [1][2][3][4][5][6][7][8][9][10] . This in turn has provided opportunities for the discovery of biomarkers and for the development of novel therapeutic modalities 3,11-13 . More recently such systems-scale profiling of the blood transcriptome has also been used to monitor response to vaccines or therapeutic drugs [14][15][16][17][18][19] . The democratization of these approaches has led to proliferation of data in public repositories: over 1.7 million individual transcriptome profiles from more than 65,000 studies have been deposited to date in the NCBI Gene Expression Omnibus (GEO), a public repository of transcriptome profiles.
Taken together this vast body of "collective data" holds the promise of accelerating the pace of biomedical discovery by creating countless opportunities for identifying and filling critical knowledge gaps. Building tools that provide biomedical researchers with the ability to seamlessly interact with collections of datasets along with rich contextual information is essential in promoting insight and enabling knowledge discovery. To address this need we have developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB).
GXB was described in a recent publication and is available as open source software on GitHub 20 . This tool constitutes a simple interface for the browsing and interactive visualization of large volumes of heterogeneous data. Users can easily customize data plots by adding multiple layers of information, modifying the order of samples, and generating links that capture these settings, which can be inserted in email communications or in publications. Accessing the tool via these links also provides access to rich contextual information that is essential for data interpretation. This includes access to gene information and relevant literature, study design information, detailed sample information as well as ancillary data 20 .
In recent years, a large number of transcriptional studies have been conducted aiming at the characterization and functional classification of monocytes in health and disease. Monocytes are a population of immune cells found in the blood, bone marrow, and spleen. They constitute ~10% of the total circulating blood leukocytes in humans. They can remain in the blood circulation for up to 1-2 days, after which time, if they have not been recruited to a tissue, they die and are removed. They are considered the systemic reservoir of myeloid precursors for renewal of tissue macrophages and dendritic cells.  63,64 . Moreover, loss of CCR2 expressing nonclassical monocytes is associated with cognitive impairment in antiretroviral therapy-naïve infected subjects 65 . Altogether these findings indicate that monocytes are more than circulating precursors and have different effector functions in response to various infections and during inflammation. Clearly furthering our understanding of the role of monocyte subsets in health and disease will require many more studies, also we hope that the dataset compendium that we are making available to the research community via this publication can help support these endeavors.
In this data note we are making available via GXB a curated compendium of 93 public datasets relevant to human monocyte immunobiology, representing a total of 4,516 transcriptome profiles.

Identification of monocyte datasets
Potentially relevant datasets deposited in GEO were identified using an advanced query based on the Bioconductor package GEOmetadb and the SQLite database that captures detailed information on the GEO data structure; https://www.bioconductor.org/packages/release/bioc/html/GEOmetadb.html 66 . The search query was designed to retrieve entries where the title and description contained the word Monocyte OR Monocytes, were generated from human samples, using Illumina or Affymetrix commercial platforms. The query result is appended with rich metadata from GEOmetadb that allows for manual filtering of the retrieved collection.
The relevance of each entry returned by this query was assessed individually. This process involved reading through the descriptions and examining the list of available samples and their annotations. Sometimes it was also necessary to review the original published report in which the design of the study and generation of the dataset is described in more detail. Using the search query, the results also returned a number of datasets that did not include profiles of monocytes but instead of "monocyte-derived dendritic cells" or "monocyte-derived macrophages". During our manual screen these were excluded as were studies employing monocytic cell lines.
Only studies including primary human monocyte profiles were retained. The datasets cover a broad range of studies investigating human monocyte immunobiology in the context of diseases and through comparison with diverse cell populations and study types as illustrated by a graphical representation of relative occurrences of terms in the descriptions of the studies loaded into our tool ( Figure 1). A wide range of cell types and diseases are represented. Ultimately, the collection was comprised of 93 curated datasets. It includes datasets generated from studies profiling primary human CD14+ cells isolated from patients with autoimmune diseases (7), bacterial, virus and parasite infections (7), cancer (4), cardiovascular diseases (4), kidney diseases (4), as well as monocytes isolated from healthy subjects (58) ( Figure 2). The 58 datasets in which monocytes were isolated from healthy subjects were classified based on whether profiling was conducted ex vivo or following in vitro experiments. In total 38 datasets were identified in which primary human CD14+ cells were stimulated or infected in in vitro experiments ( Figure 2). Among the many noteworthy datasets, there are 8 datasets investigating differences between monocytes subsets; classical ( Table 1 and can be browsed interactively in GXB.  Dataset upload and annotation on GXB Once a final selection was made each dataset was downloaded from GEO in the SOFT file format. It was in turn uploaded on an instance of the Gene Expression Browser (GXB) hosted on the Amazon Web Services cloud. Available sample and study information were also uploaded. Samples were grouped according to possible interpretations of study results and ranking based on the different group comparisons that were computed (e.g. comparing monocyte isolated from case vs controls in studies where profiling was performed ex-vivo; or stimulated vs medium control in in vitro experiments).

Short Gene Expression Brower tutorial
The GXB software has been described in detail in a recent publication 20 . This custom software interface provides users with a means to easily navigate and filter the dataset collection available at http://monocyte.gxbsidra.org/dm3/landing.gsp. A web tutorial is also available online: http://monocyte.gxbsidra.org/dm3/tutorials.gsp#gxbtut. Briefly, datasets of interest can be quickly identified either by filtering using criteria from pre-defined lists on the left or by entering a query term in the search box at the top of the dataset navigation page. Clicking on one of the studies listed in the dataset navigation page opens a viewer designed to provide interactive browsing and graphic representations of large-scale data in an interpretable format. This interface is designed to present ranked gene lists and display expression results graphically in a contextrich environment. Selecting a gene from the rank ordered list on the left of the data-viewing interface will display its expression values graphically in the screen's central panel. Directly above the graphical display drop down menus give users the ability: a) To change how the gene list is ranked; this allows the user to change the method used to rank the genes, or to include only genes that are selected for specific biological interest; b) To change sample grouping (Group Set button), in some datasets a user can switch between groups based on cell type to groups based on disease type, for example; c) To sort individual samples within a group based on associated categorical or continuous variables (e.g. gender or age); d) To toggle between the bar chart view and a box plot view, with expression values represented as a single point for each sample. Samples are split into the same groups whether displayed as a bar chart or box plot; e) To provide a color legend for the sample groups; f) To select categorical information that is to be overlaid at the bottom of the graph. For example, the user can display gender or treatment status in this manner; g) To provide a color legend for the categorical information overlaid at the bottom of the graph; and h) To download the graph as a png image or csv file for performing a separate analysis. Measurements have no intrinsic utility in absence of contextual information. It is this contextual information that makes the results of a study or experiment interpretable. It is therefore important to capture, integrate and display information that will give users the ability to interpret data and gain new insights from it. We have organized this information under different tabs directly above the graphical display. The tabs can be hidden to make more room for displaying the data plots, or revealed by clicking on the blue "show info panel" button on the top right corner of the display. Information about the gene selected from the list on the left side of the display is available under the "Gene" tab. Information about the study is available under the "Study" tab. Information available about individual samples is provided under the "Sample" tab. Rolling the mouse cursor over a bar chart's element while displaying the "Sample" tab lists any clinical, demographic, or laboratory information available for the selected sample. Finally, the "Downloads" tab allows advanced users to retrieve the original dataset for analysis outside this tool. It also provides all available sample annotation data for use alongside the expression data in third party analysis software. Other functionalities are provided under the "Tools" drop-down menu located in the top right corner of the user interface. Some of the notable functionalities available through this menu include: a) Annotations, which provides access to all the ancillary information about the study, samples and dataset organized across different tabs; b) Cross-project view, which provides the ability for a given gene to browse through all available studies; c) Copy link, which generates a mini-URL encapsulating information about the display settings in use and that can be saved and shared with others (clicking on the envelope icon on the toolbar inserts the url in an email message via the local email client); and d) Chart options, which gives user the option to customize chart labels.

Dataset validation
Quality control checks were performed with the examination of profiles of relevant biological indicators. Known leukocyte markers were used, such as CD14, which is expressed by monocytes and macrophages; as well as markers that would indicate significant contamination of the sample by other leukocyte populations: such as CD3, a T-cells marker; CD19, a B-cell marker; CD56, an NK cell marker (Figure 3; The expression of the CD14 marker across all studies can be checked using the cross project functionality of GXB: http://monocyte.gxbsidra.org/dm3/geneBrowser/crossProje ct?probeID=201743_at&geneSymbol=CD14&geneID=929). We have systematically verified that expression of the genes encoding those surface markers was consistent with grouping labels provided by depositors. In addition, expression of the XIST transcripts, in which expression is gender-specific, was also examined to determine its concordance with demographic information provided with the GEO submission (expression of XIST should be high in females and low in males).

Data availability
All datasets included in our curated collection are also available publically via the NCBI GEO website: http://www.ncbi.nlm.nih. gov/geo/; and are referenced throughout the manuscript by their GEO accession numbers (e.g. GSE25913). Signal files and sample description files can also be downloaded from the GXB tool under the "downloads" tab.
Author contributions DR: curated, uploaded and annotated datasets, and drafted the manuscript. SB: installed the software, uploaded datasets, programmed portions of the web application, and tested the software, and assisted in drafting the manuscript. SP: participated in the design of the software, programmed portions of the original web application, installed the software, and tested the software, and assisted in drafting the manuscript. CQ: participated in designed and programmed portions of the original web application, tested the software, and assisted in drafting the manuscript. DC: participated in software design, tested the software, and drafted the manuscript.

Competing interests
No competing interests were disclosed.
Grant information DR, SB and DC received support from the Qatar Foundation.
I confirm that the funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
dataset validation. The Gene Expression Browser should prove very useful for investigating large datasets; however, I have several questions and comments regarding the curated data itself:

Title:
The novel aspect and apparent emphasis of this data note is using the Gene Expression Browser to more easily explore the curated ninety-six datasets. But the current title emphasizes the key information on fostering the knowledge discovery. Please consider rephrasing it by focusing on the monocyte datasets and web application.

Introduction:
As the Gene Expression Browser has been described in detail previously, the emphasis of this data note should be on the curated data. It would be helpful to discuss the motivation for creating this particular compendium of monocyte transcriptome datasets as well as the intended use of the curated data given the breadth and heterogeneity of diseases, cell types, and experiments that it includes.

Methods:
1. Please elaborate more specifically on how the datasets were curated. What were the eligibility criteria for inclusion into the compendium?
2. The table summarizing the published data can difficult to read due to its landscape orientation. Consider rotating the table from a landscape orientation to a portrait orientation.
3. In the right pie chart of Figure 2, there are twelve datasets studying primary monocytes; however, datasets classified as stimulation, infection, and monocyte subsets may also contain primary in vitro monocytes. Better categorization is needed.
4. Data validation is critical for verifying that a dataset is acceptable for use. The authors mention performing dataset validation but do not report the related results or summary of their validation. On page 9, the process of assessing contamination by other leukocyte populations using surface markers should be done carefully as CD14 monocytes do share surface marker CD4. 5. In Fig. 3, it is unclear whether the orange bar plot is referring to CD4 T cells or CD4 cells in general. They are different cell types.
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.
No competing interests were disclosed.

Competing Interests:
Author Response 29 Mar 2016 , Sidra Medical and Research Center, Qatar

Darawan Rinchai
We thank the reviewers for their valuable feedback and suggestions to improve our manuscript.

Title:
Following the suggestion of the reviewers we changed the title of the manuscript to "A curated compendium of of transcriptome datasets of relevance to human monocyte immunobiology research". + + +