A curated transcriptomic dataset collection relevant to embryonic development associated with in vitro fertilization in healthy individuals and patients with polycystic ovary syndrome

The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp.


Introduction
Oocytes are maternal germ cells developed in ovaries during the fetal phase and kept throughout the female reproductive ages for monthly maturation and subsequent ovulation following the endocrinological regulation associated with menstrual cycles 1 . Oocyte maturation starts with the monthly resumption of the first meiotic process of one primary oocyte arrested in prophase I (characterized by the germinal vesicle, also classified as immature or metaphase I (MI) stage) 1 . After extrusion of the first polar body, the primary oocyte progresses to metaphase II of the second meiosis and becomes the secondary oocyte, which is competent to fertilization by a sperm.
Such oocyte growth/maturation occurs inside the ovarian follicle, which is also concomitantly under a process called folliculogenesis. Folliculogenesis consists of follicular cell proliferation, development and differentiation 1 . Primordial follicles containing primary oocytes grow into the mature Graafian follicle with the coordinated progression of the holding germ cells to the secondary oocytes 2 . Ovulation then occurs under the regulation of gonadotropins and sex steroids, resulting in the release of an oocyte into the peritoneal cavity. Upon fertilization by a sperm, the liberated oocyte resumes its second meiotic division to become the zygote, which further goes into a form of embryo called morula through several mitotic divisions and compaction of component cells. Continuous cell division further transforms morula to blastocyst, which has a fluidfilled cavity and is ready for implanting to the uterine endometrium 3 .
The oocyte in the ovarian follicle is a primary regulator of follicular cell differentiation and function, whereas metabolic cooperation occurs between oocytes and follicular cells to ensure substrate supply necessary for oocyte growth/maturation 4 . The follicular cells consist of two types of cell groups, theca cells (also known as stromal cells) and granulosa cells. Theca cells form the outer layer of the ovarian follicle, while inner granulosa cells make a direct contact with the oocyte. These cells also produce steroid hormones, such as progestins and estrogens, under the control of pituitary gonadotropins, which is important for priming uterine endometrium and other reproductive tissues for supporting expected implantation and pregnancy 5 . During folliculogenesis, granulosa cells continuously proliferate to form the follicular antrum, a fluid-filled cavity formed among the granulosa cell cluster. Upon formation of the antrum, two populations of granulosa cells become identifiable: one cell group known as cumulus cells (CCs), which surround the oocyte and remain associated with it even after ovulation, and the other group called mural granulosa cells, which form an inner layer of the follicle. The oocyte and CCs form the cumulus-oocyte-complex in which these cells directly communicate with each other through the gap-junctions created between them. This cellular communication plays a central role in the regulation of folliculogenesis and oocyte maturation by enabling the nutritional transfer and traffic of macromolecules between them 6 .
In vitro fertilization (IVF) is one type of assisted reproductive technology developed for the treatment of infertility 7 . It is a procedure consisting of (1) harvesting oocytes from the peritoneal cavity of the women artificially stimulated for their ovulation, (2) fertilization of the oocytes by mixing with sperms in vitro, and (3) implantation of fertilized oocytes into the uterine cavity. Before implantation, fertilized oocytes are regularly cultured for 2-6 days in a growth medium allowing its cell division and multiplication. Although a lot of improvements have been added to IVF, its success rate for successful live birth is still less than 50% even in younger women, and the main challenge remains the risk of multiple pregnancies, which is directly associated with increased incidence of fetal morbidity and infant mortality during maternal, perinatal and neonatal periods 8 . To prevent multiple IVF-associated pregnancies, single-embryo transfer is considered, for which selection of the most viable and healthy embryo is critical. Morphological inspection of embryos is employed for selecting high quality embryos 9,10 , but it is not sufficient to predict the developmental potential of embryos. Therefore, studies have been performed during the last several years to develop better methods of embryo selection by examining proteomics or metabolomics of embryos 11-13 . Recently, emergence of microarray technology has introduced a new approach to study the genetic aspects of fertility. Primarily, studies employing this new technique focused on the role surrounding follicular cells for evaluating the quality of carrying oocytes, and estimated its usefulness by comparing and correlating the data from stromal cells with the quality of embryos and with a positive or negative IVF outcome [14][15][16][17][18][19] . Such studies also included samples obtained from healthy or diseased women, for example women with polycystic ovary syndrome (PCOS), for whom the IVF success rate is known to be reduced compared with healthy subjects 20 .
To help identify knowledge gaps in the field of IVF, ovarian function and/or the influence of reproductive diseases, we provide here a resource enabling mainstream researchers in this field to browse transcriptomic datasets relevant to the oocyte and surrounding stromal cells obtained from healthy subjects or those with PCOS, in association with IVF outcome. Such a resource offers a unique opportunity to identify the genes that play key roles in oocyte maturation, embryonic development and crosstalk between oocytes and granulosa cells, eventually contributing to the future improvement of the IVF procedure.

Methods
In order to identify datasets relevant to IVF, we developed queries in a way to include the conditions, such as oocytes, CCs or granulosa cells in humans. Queries were employed on NCBI (https://www. ncbi.nlm.nih.gov/) and are as follows:

-Homo sapiens[organism] AND (in vitro fertilization OR in vitro fertilization OR in vitro fecundation) AND ("Expression profiling by array"[gdsType] OR "Expression profiling by high throughput sequencing"[gdsType]).
This query retrieved 85 datasets. After excluding RNA-seq datasets from the collection and examining each dataset carefully based on study description and list of samples and their annotations to verify their direct relevance to the theme of this data compendium, a total number of 23 datasets were selected. In total, 12 were successfully uploaded into the data browser. Details of these datasets are recapitulated in Table 1.
After curation, each dataset was downloaded from the Gene Expression Omnibus of the National Center for Biotechnology Information website (NCBI GEO) using the SOFT file format, and was then uploaded, along with its study information and samples available, to the Gene Expression Browser, version 1. Blastocysts group vs embryos of poor quality. Finally, computed ranking lists were created based on each grouping, using the rank list option provided in the GXB software. Therefore, GXB provides the users with a means to easily navigate and filter our uploaded and processed dataset collections, which are available at http://ivf. gxbsidra.org/dm3/landing.gsp.
A web tutorial for GXB is available online: http://ivf.gxbsidra. org/dm3/tutorials.gsp#gxbtut and is briefly reproduced here so that readers can use this article as a standalone resource 21,22 : "datasets of interest can be quickly identified either by filtering criteria from pre-defined lists shown on the left side of the GXB dataset navigation window, or by entering a query term in the search box located at its left top portion. Clicking on one of the studies listed in the dataset navigation window opens a viewer, which is designed to provide interactive browsing and graphic representations of the largescale data in an interpretable format. This interface is intended to navigate ranked gene lists and displays transcriptomic results graphically in a context-rich environment. Selecting a gene from the rank-ordered list on the left side of the data-viewing window displays its expression values graphically. The drop-down menus directly above the graphical display give the users the following options: a) Change how the gene list is ranked, which allows the user to change the method used to rank the genes, or to include only the genes that are selected based on his/her specific biological interest; b) Change sample grouping (Group Set button), so that in some datasets, a user can switch between groups, based on, for example, the cell types and the diseases of interest; c) Sort individual samples within a group based on the associated categorical or continuous variables (e.g., gender and age); d) Toggle between the histogram and a box-plot plot with expression values, which are demonstrated as a single point for each sample in the graph; e) Paste color legends for sample groups; f) Select categorical information that is to be overlaid at the bottom of the graph. For example, the user can display gender or smoking status using this function; g) Provide a color legend for the categorical information overlaid at the bottom of the graph; h) Download the graph in a jpeg format. Generally, raw data of the measurements per se shown in graphs have no intrinsic utility in the absence of their contextual information. It is therefore important to display such information together with the data shown in the graphs, so that viewers are able to interpret demonstrated data and gain new insights from it. In the datasets provided, the contextual information has been organized under different tabs directly above the graphical display. The tabs can be hidden to make more room for displaying the data plots, or revealed by clicking on the blue "Show Info Panel" button in the top right corner of the display window. Information for the gene, which is selected from the list and is shown in the left side of the display, is available under the "Gene" tab. The study information is also available under the "Study" tab. Further, information on individual samples is provided under the "Sample" tab. Rolling the mouse cursor over a histogram bar while displaying the "Sample" tab enables viewing of any clinical, demographic, or laboratory information provided for the selected sample. Finally, the "Downloads" tab allows advanced users to retrieve the original datasets for their future analysis to be performed outside GXB. It also provides all available sample annotation data together with the expression data."

Dataset validation
Quality checks for the datasets uploaded to GXB were performed by validating the specific expression of the Xist transcript (X-inactive specific transcript), which is a non-protein-coding RNA that inactivates one of the diploid X chromosomes existing in the female cells of mammals 23,24 . Since all uploaded datasets comprised samples obtained from women, Xist was expected to be present and expressed at high levels in all samples, except one dataset which comprises oocyte transcriptomic data, as haploid oocytes do not bear chromosome X inactivation. Expectedly, when microarrays provided probes for Xist, its expression was present in all datasets comprising cumulus or granulosa cells. While Xist expression was absent in oocyte samples of the GSE12034, it was highly expressed in the non-ovarian diploid tissue samples of the same dataset. Additional validation of our datasets was performed by examining the expression of some ovarian-specific genes, such as those specific to the zona pellucida protein (ZP1, ZP2 and ZP3), FIGLA (folliculogenesis-specific basic helix-loop-helix gene, also known as factor in the germline α), which encodes a transcription factor regulating the expression of multiple oocyte-specific genes 25 , and BMP15 (bone morphogenetic protein 15), which is functional in the folliculogenesis 26 . FIGLA was selectively expressed in oocyte samples in the GSE12034 dataset, but not in non-ovarian control tissues. The same expression pattern was also confirmed for ZP1, ZP2, ZP3, and BMP15.

Data availability
All datasets were cited in our manuscript. They are designated by their GEO accession numbers (e.g. GSE34526), and can also be accessed using this identifier via the NCBI GEO website (https:// www.ncbi.nlm.nih.gov/gds/?term=). User can download all uploaded dataset files and associated sample information through the GXB tool: "Downloads" tab.
Author contributions RM and TK conceived the theme of this dataset collection, SB contributed to the loading and curation of datasets. RM prepared the first draft of the manuscript, TK and DC edited it.

Competing interests
No competing interests were disclosed.

Grant information
This study was support by the Intramural Grant of the Sidra Medical and Research Center. However, there are some concerns that are not addressed here.
How were the expression values, as shown in the graphs, obtained from the raw data files? The details of the methodology used to analyse the raw data and to generate the ranked gene lists should be given. In the present form, it is difficult to make use of the data for any meaningful scientific analysis.
The purpose of this study is to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in web-based customized graphical views. However, the gene expression data is shown as expression values for some datasets and as Log expression values for the others.
There is a typographical mistake in the spellings of granulosa cells.
The Pubmed articles linked to the data sets are not available.
Although putting together these data is helpful for the analysis of transcriptome data from normal and PCOS patients undergoing IVF, it would be meaningful but not mandatory to include the data available from similar platforms for theca cells.
It is a good effort done by the authors to put together several studies.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
No competing interests were disclosed. Competing Interests: 17  Although this manuscript is showing a web platform to look at genes involved in PCOS patients in oocyte, cumulus cells and granulosa cells, some concerns should also be considered and is not addressed here.
Bias can be obtained because of the method of RNA isolation, purification and RNA amplification that may be different between published papers.
The datasets were validated using cellular specific gene expression but nothing is mentioned about cellular contamination, reference genes (housekeeping genes), ...
Because the data are classified by study using raw values for specific gene for each sample, it becomes highly difficult to grasp meaningful information. It would have been helpful to further analyze the data and not only showing the raw data of each sample.
In conclusion, it seems to be a good platform to run a quick analysis looking at several studies.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
No competing interests were disclosed. The manuscript by Mackeh et al. presents a very interesting and novel approach to identify genes that are potentially linked to embryonic development. The authors introducing a valuable resource collecting gene expression profiling datasets from oocytes and surrounding stromal cells of healthy subjects or those with polycystic ovary syndrome, in correlation with IVF outcome. This resource is quite beneficial by providing a catalogue of genes that show altered expression in negative IVF outcome.
The transcriptomic datasets are presented in an easy-to-use interactive web application that enables users, including those who are not experts in gene expression profiling, to identify altered gene expression in oocytes and associated cells in normal and diseased situations. Overall, I would give the manuscript in its current form a high priority to be indexed. I have minor comment and suggestion.
In page 3, the first paragraph (line 10) of the introduction, the secondary oocyte is also commonly known as 'egg'. I would suggest that both terms are mentioned.
It will be interesting for the authors to check whether adding the term (Egg) in the queries will yield extra datasets.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. Competing Interests: