ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article

Improving prediction of core transcription factors for cell reprogramming and transdifferentiation

[version 1; peer review: 2 not approved]
PUBLISHED 13 Jan 2022
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Cell & Molecular Biology gateway.

Abstract

Identification of transcription factors (TFs) that could induce and direct cell conversion remains a challenge. Though several hundreds of TFs are usually transcribed in each cell type, the identity of a cell is controlled and can be achieved through the ectopic overexpression of only a small subset of so-called core TFs. Currently, the experimental identification of the core TFs for a broad spectrum of cell types remains challenging. Computational solutions to this problem would provide a better understanding of the mechanisms controlling cell identity during natural embryonic or malignant development, as well as give a foundation for cell-based therapy. Herein, we propose a computational approach based on over-enrichment of transcription factors binding sites (TFBS) in differentially accessible chromatin regions that could identify the potential core TFs for a variety of primary human cells involved in hematopoiesis. Our approach enables the integration of both transcriptomic (single-cell RNA sequencing, scRNA-seq) and epigenenomic (single-cell assay for transposable-accessible chromatin, scATAC-seq) data at the single-cell resolution to search for core TFs, and can be scalable to predict subsets of core TFs and their role in a given conversion between cells.

Keywords

cell conversion, scRNA-seq, scATAC-seq, transcription factors, epigenetics

Introduction

The cell identity is largely controlled by transcription factors (TFs). TFs regulate gene expression by binding DNA in a sequence-specific manner, targeting short sequences called transcription factor binding sites (TFBS). Although almost half of all TFs are expressed in a particular cell type,21 only a minor share of these TFs — so-called core TFs — are sufficient to maintain cell identity by defining the corresponding gene expression programs.11,22,23 The identification of core TFs for a large number of cell types would be a valuable addition for an atlas of transcription regulators supplementing the Encyclopedia of Regulatory DNA Elements (ENCODE, Ref. 16). Such an atlas, in turn, would facilitate systematic investigation of regulatory networks and contribute to establishing and refining direct cell conversion protocols for clinically relevant cell types.6,7

Systematic determination of core TFs controlling individual cell type identity has previously been attempted. Initial efforts were mainly focused on the experimental screening of the TFs, presumably regulating the deferentially expressed genes (DEGs) in the comparison between query cell type, and a small number of alternative cell types that could potentially serve as an initial stage for conversion. Some of these TFs could play a role as regulators controlling cellular identities. For example, studies showed that over-expression of MyoD1 in fibroblasts leads to its conversion into the muscle cells,19 while inhibition of Oct4 resulted in the suppression of the pluripotent stem cell population during mammalian embryo development.12 Recent experiments with TF over-expression leading to conversion of cells to another cell type appeared to be used as a stringent test of the potential of specific TFs to establish and maintain cell identity.11,22,23 Nonetheless, while being illustrative validation for each TF, such experiments are still time- and labor-consuming, and resulting observations are limited to specific cell types.

The growth of genome-wide sequencing technologies allowed to develop computational systems capable of predicting candidate core TFs.2,9,14,17 However, being broad in scope and easily scalable, these methods infer predictions using preferably only bulk RNA sequencing (RNA-seq) data, which estimates the average gene expression level across a hundred thousands to millions of cells. As a result, they are insufficient for analysis of heterogeneous systems, such as early embryonic populations or complex tissues, including brain or bone marrow.

Here we propose an approach that uses single-cell expression and DNA accessibility data to select core TFs for cell differentiation or directed conversion. A distinct feature of the approach is incorporating not only TFs expression levels in the original and target cell types, but also (1) the chromatin conditions in gene regulatory elements, as well as (2) TF putative binding sites. Thus, this method simultaneously takes into account the accessibility and expression profile of the initial and terminal cell types involved in the conversion. Additionally, our method uses modified gene set enrichment analysis (GSEA)18 for the selection of core TFs, thus reducing the number of arbitrary thresholds in the pipeline.

Results

To validate our method, we applied it to hematopoietic differentiation datasets,5,1 since this process has been extensively studied. We provided TFs for the hematopoietic stem cells (HSC) differentiation into CD4(+) cells as an example (Table 1). The detected TFs are critical for the HSC-to-CD4(+) cells differentiation. The top-ranked TF, TCF7, is a transcription activator recruited in T-cell lymphocyte differentiation and is necessary for the survival of immature CD4(+) and CD8(+) thymocytes.13,10 RORA gene plays a crucial role in the regulation of embryonic development, differentiation and immunity.4 TBX21 is a lineage-defining TF, which initiates Th1(CD4(+)) lineage development from naive T helper (CD4(+)) precursor cells.24,10 The LEF1 TF has a higher affinity to a functionally important site in the T-cell receptor-alpha enhancer, and thereby its presence in these regions increases the activity of the enhancer.3

Table 1. A predicted list of transcription factors for HSC to CD4(+) lymphocytes differentiation.

HGNC geneGSEA p-valGSEA q-val
TCF74.36×10331.50×1031
RORA6.98×10312.17×1029
NR1D11.83×10295.29×1028
TBX211.57×1086.20×108
LEF12.53×1078.24×107

Methods

The proposed approach (Figure 1) consists of the following steps. First, for two given cell types involved in cell differentiation or conversion pathways, the minimal spanning tree (MST) is reconstructed based on the open chromatin in regulatory regions (Figure 2, Figure 3). Then, a differential accessibility analysis (DAA) between initial and final cell types is performed to retrieve a list of genomic regions (ATAC-seq peaks) ranked by the statistical significance of a change in chromatin accessibility for a given cell conversion (Figures 4, 5). Next, the sequences corresponding to each of the ranked regions undergo the functional annotation with TFBS. Finally, TFs ranking is inferred by GSEA,18 which was adjusted to estimate the tendency of TFBS for given TF under investigation to be over-represented at the most statistically significant genomic regions for a given cell differentiation or conversion.

89cc6955-9712-4313-8024-0d6a8edc0830_figure1.gif

Figure 1. Schematic overview of the proposed approach within the typical pipeline of TFs selection.

89cc6955-9712-4313-8024-0d6a8edc0830_figure2.gif

Figure 2. The Minimal Spanning Tree (MST), reconstructed on scATAC-seq data (GSE74912) for the system of the 8 hematopoietic cell types.

HSC, hematopoietic stem cells; MPP, multipotentent progenitor; LMPP, lymphoid-primed multipotent progenitor; CLP, common lymphoid progenitor; NK, natural killer cells.

89cc6955-9712-4313-8024-0d6a8edc0830_figure3.gif

Figure 3. UMAP clustering of (A) scATAC-seq and (B) scRNA-seq of the 13 primary hematopoietic cell types (GSE74912).

89cc6955-9712-4313-8024-0d6a8edc0830_figure4.gif

Figure 4. Heatmap of the most differentially accessible scATAC-seq regions between hematopoietic stem cell (HSC) and CD4(+) T helper cells (CD4Tcell) cells (GSE74912).

89cc6955-9712-4313-8024-0d6a8edc0830_figure5.gif

Figure 5. Heatmap of the most differentially expressed genes from scRNA-seq data between hematopoietic stem cell (HSC) and CD4(+) T helper cells (CD4Tcell) cells (GSE74912).

Reconstruction of cell trajectories with scATAC-seq data

scATAC-seq data (GEO: (GSE96769, GSE111586)) were used to reconstruct the minimal spanning tree (MST) of hematopoietic cell types, the hierarchy of which was aligned along pseudo-time, reflecting a degree of pluripotency of the cells observed in the single-cell assay for transposable-accessible chromatin (scATAC-seq) dataset.15 Thus, the obtained MST presents a collection of possible cell trajectories among the analyzed cell types.

Differential accessibility analysis

Similarly to DEG analysis,20 a differential accessibility analysis (DAA) of genomic regions was performed between two given cell types on the cell trajectory by hrefhttps://www.bioconductor.org/packages/devel/bioc/manuals/slingshot/man/slingshot.pdfSlingshot v2.3. Accordingly, for each cell population on the MST, such a subset of regions ranked by p-value can be obtained, discriminating given cell population from others.

TFs filtration and TFBS annotation

We excluded from the downstream analysis TFs that had either a near-zero median expression (below 5% percentile) in the final cell type or had a higher expression in the original cell types based on scRNA-seq data (GEO: GSE74912). Thus, only TFs uniquely expressed in a final cell population were considered.

Genomic regions (scATAC-seq peaks from GSE74912) were listed and ranked based on the significance of DAA (p-value < 0.01) performed by Monocle2, and used for functional annotation by TFBS using position weight matrices (PWM, p-value < 0.0001) from the HOCOMOCO database.8

TFs ranking via GSEA-like enrichment analysis

GSEA18 was modified to perform the TF ranking according to their significance for a given cell conversion.

Since TF sequence preferences and, therefore, the quantity of TFBS for each TF is different, TFs annotations are presented highly unequally in the regions ranking. Thereby, GSEA here was utilized to infer the degree of TFBS abundance at the top of the regions ranking for a given conversion.

Consequently, for GSEA, the genomic regions ranking annotated with TFBS was taken as a pre-ranked list of TFs and each separate factor as a signature gene set. The final TFs ranking obtained from GSEA, thus, represents the significance of distinct TFs for cell differentiation or conversion.

Discussion

The proposed pipeline utilizes both transcriptomic and epigenenomic data at the single-cell resolution to search for core TFs that enable cell differentiation and conversion within the human hematopoietic system. The transcription factors rankings obtained (Table 1) suggest that the current approach is capable of predicting subsets of core TFs as well as reflecting their importance for cell differentiation and conversion between cells.

Conclusions

Herein, we described a method for integrating single-cell chromatin accessibility and gene expression data that can successfully select core TFs for cell differentiation and conversion in silico.

Data availability

Underlying data

Gene Expression Omnibus: A Single-Cell Atlas of in vivo Mammalian Chromatin Accessibility, https://identifiers.org/geo: GSE111586

Gene Expression Omnibus:Single-cell epigenomics maps the continuous regulatory landscape of human hematopoietic differentiation [scATAC-Seq], https://identifiers.org/geo: GSE96769

Gene Expression Omnibus: ATAC-seq data, https://identifiers.org/geo: GSE74912

Extended data

Analysis code available from: https://github.com/annykay/transFactorsPrediction

Archived analysis code as at time of publication: https://doi.org/10.5281/zenodo.5799254

License: MIT

Competing interests

No competing interests were disclosed.

Grant information

The study was supported by Ministry of Science and Higher Education of the Russian Federation (agreement no. 075-15-2020-899).

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 13 Jan 2022
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Raevskiy M, Kondrashina A and Medvedeva Y. Improving prediction of core transcription factors for cell reprogramming and transdifferentiation [version 1; peer review: 2 not approved]. F1000Research 2022, 11:38 (https://doi.org/10.12688/f1000research.75321.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 13 Jan 2022
Views
17
Cite
Reviewer Report 31 Jan 2022
Erdem B. Dashinimaev, Koltzov Institute of Developmental Biology, Russian Academy of Sciences, Moscow, Russian Federation;  Center for Precision Genome Editing and Genetic Technologies for Biomedicine, Pirogov Russian National Research Medical University, Moscow, Russian Federation 
Not Approved
VIEWS 17
Major comments:
  • On the whole, the article is unclear. I have no doubt that some thoughtful and valuable work has been done in which interesting data have been obtained, but in this manner, the entire work
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Dashinimaev EB. Reviewer Report For: Improving prediction of core transcription factors for cell reprogramming and transdifferentiation [version 1; peer review: 2 not approved]. F1000Research 2022, 11:38 (https://doi.org/10.5256/f1000research.79178.r119675)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
29
Cite
Reviewer Report 27 Jan 2022
Valentiva Boeva, Department of Computer Science, Institute for Machine Learning, ETH Zurich, Zurich, Switzerland 
Samuel Gunz, Department of Computer Science, Institute for Machine Learning, ETH Zurich, Zurich, Switzerland 
Not Approved
VIEWS 29
Rationale:

The rationale of developing this method is clearly stated. The authors identify experimental work to be the bottleneck in identifying core transcription factors (TFs) and therefore suggest a computational approach to solve this challenge. However, the ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Boeva V and Gunz S. Reviewer Report For: Improving prediction of core transcription factors for cell reprogramming and transdifferentiation [version 1; peer review: 2 not approved]. F1000Research 2022, 11:38 (https://doi.org/10.5256/f1000research.79178.r119678)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 13 Jan 2022
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.