A compendium of monocyte transcriptome datasets to foster biomedical knowledge discovery

Systems-scale profiling approaches have become widely used in translational research settings. The resulting accumulation of large-scale datasets in public repositories represents a critical opportunity to promote insight and foster knowledge discovery. However, resources that can serve as an interface between biomedical researchers and such vast and heterogeneous dataset collections are needed in order to fulfill this potential. Recently, we have developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB). This tool can be used to overlay deep molecular phenotyping data with rich contextual information about analytes, samples and studies along with ancillary clinical or immunological profiling data. In this note, we describe a curated compendium of 93 public datasets generated in the context of human monocyte immunological studies, representing a total of 4,516 transcriptome profiles. Datasets were uploaded to an instance of GXB along with study description and sample annotations. Study samples were arranged in different groups. Ranked gene lists were generated based on relevant group comparisons. This resource is publicly available online at . http://monocyte.gxbsidra.org/dm3/landing.gsp


Introduction
Platforms such as microarrays and, more recently, next generation sequencing have been leveraged to generate molecular profiles at the scale of entire systems.The global perspective gained using such approaches is potentially transformative.Transcriptome profiling enabled for instance the characterization of molecular perturbations that occur in the context of a wide range disease processes [1][2][3][4][5][6][7][8][9][10] .This in turn has provided opportunities for the discovery of biomarkers and for the development of novel therapeutic modalities 3,[11][12][13] .More recently such systems-scale profiling of the blood transcriptome has also been used to monitor response to vaccines or therapeutic drugs [14][15][16][17][18][19] .The democratization of these approaches has led to proliferation of data in public repositories: over 1.7 million individual transcriptome profiles from more than 65,000 studies have been deposited to date in the NCBI Gene Expression Omnibus (GEO), a public repository of transcriptome profiles.
Taken together this vast body of "collective data" holds the promise of accelerating the pace of biomedical discovery by creating countless opportunities for identifying and filling critical knowledge gaps.Building tools that provide biomedical researchers with the ability to seamlessly interact with collections of datasets along with rich contextual information is essential in promoting insight and enabling knowledge discovery.To address this need we have developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB).
GXB was described in a recent publication and is available as open source software on GitHub 20 .This tool constitutes a simple interface for the browsing and interactive visualization of large volumes of heterogeneous data.Users can easily customize data plots by adding multiple layers of information, modifying the order of samples, and generating links that capture these settings which can be inserted in email communications or in publications.Accessing the tool via these links also provides access to rich contextual information that is essential for data interpretation.This includes access to gene information and relevant literature, study design information, detailed sample information as well as ancillary data 20 .
In recent years, a large number of transcriptional studies have been conducted aiming at the characterization and functional classification of monocytes in health and disease.Monocytes are a population of immune cells found in the blood, bone marrow, and spleen.They constitute ~10% of the total circulating blood leukocytes in humans.They can remain in the blood circulation for up to 1-2 days, after which time, if they have not been recruited to a tissue, they die and are removed.They are considered the systemic reservoir of myeloid precursors for renewal of tissue macrophages and dendritic cells.Monocytes play a key role during immune response as professional phagocytes 21,22 , and producers of immune mediators 23,24 .Indeed, reports show that monocytes are recruited at the site of infections as innate effectors of the inflammatory response to microbes, killing pathogens via phagocytosis, production of reactive oxygen intermediate (ROIs) 25 , reactive nitrogen intermediate (RNIs) 26,27 , myeloperoxidase (MPO) 28,29 , and producing inflammatory cytokines 30 that contribute to further amplifying the antimicrobial response 31 .
Human monocytes are derived from hematopoietic stem cells in the bone marrow and are released into peripheral blood circulation upon maturation.They are divided into three major subsets based on the expression of cell surface markers CD14 and CD16.The most prevalent subset in the blood circulation, accounting for 90% of all monocytes, are the classical monocytes that express high levels of CD14 but low levels of CD16.The remaining 10% is divided into two subsets: intermediate monocyte with high expression of CD14 and CD16 (CD14+CD16+) and non-classical monocytes that express low levels of CD14 but high levels of CD16 (CD14dimCD16++ or CD14+CD16++) [32][33][34] .
In this data note we are making available via GXB a curated compendium of 93 public datasets relevant to human monocyte immunobiology, representing a total of 4,516 transcriptome profiles.

Identification of monocyte datasets
Potentially relevant datasets deposited in GEO were identified using an advanced query based on the Bioconductor package GEOmetadb and the SQLite database that captures detailed information on the GEO data structure; https://www.bioconductor.org/packages/release/bioc/html/GEOmetadb.html 35.The search query was designed to retrieve entries where the title and description contained the word Monocyte OR Monocytes, were generated from human samples, using Illumina or Affymetrix commercial platforms.The query result is appended with rich metadata from GEOmetadb that allows for manual filtering of the retrieved collection.
The relevance of each entry returned by this query was assessed individually.This process involved reading through the descriptions and examining the list of available samples and their annotations.Sometimes it was also necessary to review the original published report in which the design of the study and generation of the dataset is described in more detail.The datasets cover a broad range of human immunology studies investigating monocyte immunobiology in the context of diseases and through comparison with diverse cell populations and study types as illustrated by a graphical representation of relative occurrences of terms in the list of diseases loaded into our tool (Figure 1).A wide range of cell types and diseases are represented.Ultimately, the collection was comprised of 93 curated datasets.It includes datasets generated from studies profiling primary human CD14+ cells isolated from patients with autoimmune diseases (7), bacterial, virus and parasite infections (7), cancer (4), cardiovascular diseases (4), kidney diseases (4), as well as monocytes isolated from healthy subjects (58) (Figure 2).The 58 datasets in which monocytes were isolated from healthy subjects were classified based on whether profiling was conducted ex vivo or following in vitro experiments.In total 38 datasets were identified in which primary human CD14+ cells were stimulated or infected in in vitro experiments (Figure 2).Among the many noteworthy datasets, there are 8 datasets investigating differences between monocytes subsets; classical (CD14++CD16-), intermediate (CD14+CD16+) and non-classical monocytes (CD14-CD16++) [32][33][34] [GXB: GSE16836, GSE18565, GSE25913, GSE34515, GSE35457, GSE51997, GSE60601, GSE66936].Another dataset from Banchereau and colleagues investigated responses of monocyte and dendritic cells to 13 different vaccines  in vitro 36 [GXB: GSE44721].The datasets that comprise our collection are listed in Table 1 and can be browsed interactively in GXB.

Dataset upload and annotation on GXB
Once a final selection had been made each dataset was downloaded from GEO in the SOFT file format.It was in turn uploaded on an instance of the Gene Expression Browser (GXB) hosted on the Amazon Web Services cloud.Available sample and study information were also uploaded.Samples were grouped according to possible interpretations of study results and ranking based on the different group comparisons that were computed (e.g.comparing monocyte isolated from case vs controls in studies where profiling was performed ex-vivo; or stimulated vs medium control in in vitro experiments).

Short Gene Expression Brower tutorial
The GXB software has been described in detail in a recent publication 20 .This custom software interface provides users with a means to easily navigate and filter the dataset collection available at http://monocyte.gxbsidra.org/dm3/landing.gsp.A web tutorial is also available online: http://monocyte.gxbsidra.org/dm3/tutorials.gsp#gxbtut.Briefly, datasets of interest can be quickly identified either by filtering using criteria from pre-defined lists on the left or by entering a query term in the search box at the top of the dataset navigation page.Clicking on one of the studies listed in the dataset navigation page opens a viewer designed to provide interactive browsing and graphic representations of large-scale data in an interpretable format.This interface is designed to present ranked gene lists and display expression results graphically in a f) To select categorical information that is to be overlaid at the bottom of the graph.For example, the user can display gender or smoking status in this manner; g) To provide a color legend for the categorical information overlaid at the bottom of the graph; and h) To download the graph as a png image or csv file for performing a separate analysis.Measurements have no intrinsic utility in absence of contextual information.It is this contextual information that makes the results of a study or experiment interpretable.It is therefore important to capture, integrate and display information that will give users the ability to interpret data and gain new insights from it.We have organized this information under different tabs directly above the graphical display.The tabs can be hidden to make more room for displaying the data plots, or revealed by clicking on the blue "show info panel" button on the top right corner of the display.Information about the gene selected from the list on the left side of the display is available under the "Gene" tab.Information about the study is available under the "Study" tab.Information available about individual samples is provided under the "Sample" tab.Rolling the mouse cursor over a bar chart's element while displaying the "Sample" tab lists any clinical, demographic, or laboratory information available for the selected sample.Finally, the "Downloads" tab allows advanced users to retrieve the original dataset for analysis outside this tool.It also provides all available sample annotation data for use alongside the expression data in third party analysis software.Other functionalities are provided under the "Tools" dropdown menu located in the top right corner of the user interface.Some of the notable functionalities available through this menu include: a) Annotations, which provides access to all the ancillary information about the study, samples and dataset organized across different tabs; b) Cross-project view, which provides the ability for a given gene to browse through all available studies; c) Copy link, which generates a mini-URL encapsulating information about the display settings in use and that can be saved and shared with others (clicking on the envelope icon on the toolbar inserts the url in an email message via the local email client); and d) Chart options, which gives user the option to customize chart labels.

Dataset validation
Quality control checks were performed with the examination of profiles of relevant biological indicators.Known leukocyte markers were used, such as CD14, which is expressed by monocytes and macrophages; as well as markers that would indicate significant contamination of the sample by other leukocyte populations: such as CD3, a T-cells marker; CD19, a B-cell marker; CD56, an NK cell marker (Figure 3; The expression of the CD14 marker across all studies can be checked using the cross project functionality of GXB: http://monocyte.gxbsidra.org/dm3/geneBrowser/crossProject?prob eID=201743_at&geneSymbol=CD14&geneID=929).In addition, expression of the XIST transcripts, in which expression is genderspecific, was also examined to determine its concordance with demographic information provided with the GEO submission.Consider rotating the table from a landscape orientation to a portrait orientation.
3. In the right pie chart of Figure 2, there are twelve datasets studying primary monocytes; however, datasets classified as stimulation, infection, and monocyte subsets may also contain primary in vitro monocytes.Better categorization is needed.
4. Data validation is critical for verifying that a dataset is acceptable for use.The authors mention performing dataset validation but do not report the related results or summary of their validation.On page 9, the process of assessing contamination by other leukocyte populations using surface markers should be done carefully as CD14 monocytes do share surface marker CD4.
5. In Fig. 3, it is unclear whether the orange bar plot is referring to CD4 T cells or CD4 cells in general.They are different cell types.
No competing interests were disclosed.

Competing Interests:
We have read this submission.We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.
comparison with other cell types, or evaluation of variation among healthy individuals.

4.
Assessing contamination can indeed be difficult, especially using this type of data where cell-level information is lacking.We plan to explore with our bioinformatics collaborators the development of a "scoring" approach to better quantify potential contamination but this is not a simple matter to address.At this point we have simply verified for each dataset that expression of markers was consistent with grouping labels provided by depositors.We have added language in the manuscript to clarify this point.The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more

Thank you for pointing out this typo on this label. This dataset focuses on genomic profile of human blood both CD4+ and CD8+ T cells, B cells, NK cells monocytes and neutrophil
The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com

Figure 1 .
Figure 1.Thematic composition of the dataset collection.Word frequencies extracted from text descriptions of the studies loaded into the GXB tool are depicted as a word cloud.The size of the words is proportional to their frequency.

Figure 2 .
Figure 2. Break down of the dataset collection by category.The pie chart on the left panel indicates dataset frequencies by disease status.The chart on the right panel indicates the type of studies carried out for the 58 datasets consisting of monocyte obtained exclusively from healthy donors.

Table 1 . 86 Interleukin- 1 - 89 Genome
List of datasets constituting the collection.monocytes and macrophages from hES cells following coculture-free differentiation in M-CSF and IL-Gene Expression Signatures of Alveolar Macrophages as well as Peripheral Blood Monocytes Overlap and Correlate with of-function mutations in REP-1 affect intracellular vesicle transport in fibroblasts and monocytes of CHM patients gene expression signatures are translated into reduced erythropoiesis and LDL cholesterol levels in humans Early Post-Transplant Immunity Using Purified Cell Subsets Reveals Functional Networks Not Evident by Whole Blood Analysis analysis of lupus immune complex stimulation of purified CD14+ monocytes and how this response is regulated by of human blood classical monocytes (CD14++CD16-), CD16 positive monocytes (CD14+16++ and CD14++CD16+), and CD1c+ CD19insights into key genes and pathways involved in the pathogenesis of HLA-B27-associated acute anterior of human monocytes stimulated with all-trans retinoic acid (ATRA) or 1,25a-dihydroxyvitamin D3 (1lymphocyte-and monocyte-specific type I interferon (IFN) signatures in autoimmunity and viral infection.and Type I Interferon-Dependent Enhanced Immunogenicity of an NYVAC-HIV-1 Env-Gag-Pol-Nef Vaccine Vector with Dual Deletions of Type I and Type II Interferonanalysis of monocytes from healthy donors, patients with metastatic breast cancer, sepsis or tuberculosis.Wide Gene Expression Study of Circulating Monocytes in human with extremely high vs. low bone mass profiles for human peripheral blood T cells, B cells, natural killer cells, monocytes, and polymorphonuclear cells: comparisons to ischemic stroke, migraine, and Tourette syndrome environment.Selecting a gene from the rank ordered list on the left of the data-viewing interface will display its expression values graphically in the screen's central panel.Directly above the graphical display drop down menus give users the ability: a) To change how the gene list is ranked; this allows the user to change the method used to rank the genes, or to include only genes that are selected for specific biological interest; b) To change sample grouping (Group Set button), in some datasets a user can switch between groups based on cell type to groups based on disease type, for example; c) To sort individual samples within a group based on associated categorical or continuous variables (e.g.gender or age); d) To toggle between the bar chart view and a box plot view, with expression values represented as a single point for each sample.Samples are split into the same groups whether displayed as a bar chart or box plot; e) To provide a color legend for the sample groups;

Figure 3 .
Figure 3. Illustrative example showing the abundance levels of CD14 transcripts across samples in a given study.The expression of this gene is indicative of the purity of primary human monocyte preparation; this marker is expected to be high in monocyte preparations and low in other leukocyte populations.In this view of the GXB expression of CD14 can be visualized across projects listed on the left.
Figure 3 was corrected accordingly as shown in the new Figure 3.No competing interests were disclosed.Competing Interests: 16 March 2016 Referee Report doi:10.5256/f1000research.8800.r12768Marc Pellegrini Division of Infection and Immunity, Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia In this short descriptive report the authors put their published Gene Expression Browser tool to work in arranging several thousand transcriptome profiles obtained from public datasets that looked at monocyte immunology.They were able to compare groups of monocytes based on phenotypic attributes and rank gene expression.The authors provide a nice summary of the technique and validation.No competing interests were disclosed.Competing Interests: I have read this submission.I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.Page 16 of 17 F1000Research 2016, 5:291 Last updated: 01 AUG 2018