ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article

Space-log: a novel approach to inferring gene-gene net-works using SPACE model with log penalty

[version 1; peer review: 3 approved with reservations]
PUBLISHED 21 Sep 2020
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Cell & Molecular Biology gateway.

This article is included in the Bioinformatics gateway.

This article is included in the RPackage gateway.

Abstract

Gene expression data have been used to infer gene-gene networks (GGN) where an edge between two genes implies the conditional dependence of these two genes given all the other genes. Such gene-gene networks are of-ten referred to as gene regulatory networks since it may reveal expression regulation. Most of existing methods for identifying GGN employ penalized regression with L1 (lasso), L2 (ridge), or elastic net penalty, which spans the range of L1 to L2 penalty. However, for high dimensional gene expression data, a penalty that spans the range of L0 and L1 penalty, such as the log penalty, is often needed for variable selection consistency. Thus, we develop a novel method that em-ploys log penalty within the framework of an earlier network identification method space (Sparse PArtial Correlation Estimation), and implement it into a R package space-log. We show that the space-log is computationally efficient (source code implemented in C), and has good performance comparing with other methods, particularly for networks with hubs.Space-log is open source and available at GitHub, https://github.com/wuqian77/SpaceLog

Keywords

Gene-gene network, gene regulation, penalized regression, log penalty, partial correlation, R package, algorithm

Introduction

The objective of this paper is to introduce a novel method that constructs gene-gene network (GGN) based on high dimensional gene expression data. Popular methods for GGN include neighborhood selection1, graphical Lasso2, and space (Sparse PArtial Correlation Estimation)3. Neighborhood selection consistently estimates the non-zero entries of the partial correlation matrix, and provide an approximation of the maximum likelihood estimate of partial correlation matrix. Graphical Lasso improves on neighborhood selection by providing a maximum likelihood estimate of the partial correlation matrix. The space method exploits the symmetry of partial correlation matrix to improve the estimation accuracy. It also avoids potential conflicts in neighborhood selection, that is, Yi is selected as a neighbor of Yj but Yj is not selected as a neighbor of Yi, and one has to make a post-hoc decision for whether Yi and Yj are connected. Furthermore, those available methods employ L1, L2 or elastic net penalty. However, penalties in the range of L0 to L1 is often needed to improve the accuracy of variable selection for high-dimensional gene expression data4. In this paper, we propose a new statistical method to estimate GGN by implementing the log penalty for the space approach, and we refer to our method as space-log.

Methods

Suppose that we have data on n independent individuals and m genes. Assume the expression of m genes, after appropriate normalization, follow a multivariate Gaussian distribution N(0, ∑).

Neighborhood selection using lasso or log penalty: NS-lasso, NS-log

The neighborhood selection (NS) approach considers each gene separately. Let Yi be the gene expression value for the ith gene and Yi = (Y1, ..., Yi−1, Yi+1, Ym)T . For the NS approach, Yi is regressed on Yi by a penalized regression:

β^=argmin{12(YiYiβ)T(YiYiβ)+nijp(|βi,j|;ω)}(1)

with penalty function p(|β|;ω). We will compare NS-lasso with lasso penalty p(|β|;λ)=λ|β|5 and NS-log with log penalty p(|β|;λ,τ)=λlog(|β|+τ)6,7. Source codes of NS-lasso and NS-log are available at https://github.com/Sun-lab/penalized_estimation/.

Joint modeling space using lasso penalty: space-lasso

The joint modeling approach space3 is to estimate GGN, without the need to fit many (m) single gene regression models separately, but directly estimate partial correlation among all the genes. Denote the partial correlation between Yi and Yj by ρi,j. If we know the concentration matrix ∑−1 = (σij)m×m, then ρi,j=σijσiiσjj. So we can easily get that ρi,j=sign(βi,j)βijβji. Thus, the problem is translated into partial correlation matrix estimation. Specifically3, proposed to minimize a penalized loss function

Ln(β,σ,Y)=12i=1mwiYijiβijYj2+ijp(|ρij|;λ)(2)

=12i=1mwiYijiρi,jσjjσiiYj2+ijp(|ρi,j|;λ),(3)

where wi ≥ 0 is the weight, e.g., uniform weights wi = 1 for space-no, residual variance based weights wi = σjj for space-res, and degree based weights wi = number of genes that {j : ρi,j ≠ 0, ji} for space-df. In 3, p(|ρ|; λ) = λ|ρ| and we call it as space-lasso.

New algorithm space-log: joint modeling space using log penalty

Inspired by 6 and 7, we extended the space approach with log penalty as space-log p(|ρ|; λ, τ) = λlog(|ρ| + τ) and used the active shooting algorithm3 to update the coefficient estimates iteratively in space-log (Extended data). We determined the tuning parameters by using extended BIC (extBIC)8.

Denote the target loss function as

f(ρ;σ)=12i=1mwiYijiρi,jσjjσiiYj2+ijp(|ρi,j|;τ,λ)=12i=1mwiYijiρi,jσjjσiiYj2+λijlog(|ρi,j|+τ)(4)

The goal is to estimate ρ^ = argminρf(ρ) for a given λ and τ. We implement the penalized estimation using space and Log penalties by Local Linear Approximation (LLA)9.

p(|ρi,j|;λ,τ)p(|ρ^i,j(k)|;λ,τ)+p(|ρ^i,j(k)|;λ,τ)(|ρi,j||ρ^i,j(k)|)(5)

Where |ρ^i,j(k)| is the estimate of regression coefficient ρi,j at the k-th iteration. After applying LLA for the penalty part, we can minimize loss function at the (k + 1)-th step, while solving for ρi,j by

f(k+1)(ρij)=12i=1mwiYiρi,jσjjσiiYjli;ljρ^i,l(k)σ^(k)llσ^(k)iiYl2+ijp(|ρ^i,j(k)|;λ,τ)|ρi,j|(6)

By letting f(k+1)(ρi,j)/ρi,j=0, we can find the solution for ρi,j as follows:

ρ^i,j(k+1)={0if|zj(k)|vj1p(|ρ^i,j(k)|;λ,τ)sgn(ρ^i,j(k))[|zj(k)|vj1p(|ρ^i,j(k)|;λ,τ)]if|zj(k)|>vj1p(|ρ^i,j(k)|;λ,τ)(7)

where zi,j(k)=Yj(Yili;ljρ^i,l(k)σ^(k)llσ^(k)iiYl)/vj,vj=YjTYj, and p(|ρ^i,j(k)|;λ,τ)=sgn(ρ^i,j(k))λ|ρ^i,j(k)|+τ.

Active-shooting

We adapted the same idea active-shooting algorithm from3 to update the coefficient estimation iteratively in space-log. Without loss of generality, we kept most notation from3 but tailored with space-log. The details are included in the Extended data.

Simulation studies

In this section, we present Monte Carlo simulation to evaluate the performance of the space-log, space-lasso, NS-log, and NS-lasso. Following7, we studied two types of graphs: the traditional random graphs (ER model) where all the genes have the same expected number of neighbors10,11, and hubs graphs where a few genes may have a large number of neighbors (BA model), and BA model is more frequently observed in gene networks12.

We simulated GGN of m genes under both the BA and ER models, respectively. The initial graph had one gene and no edge. In the (k+1)th step, we added e edges between a new gene and e old genes. Under the BA model, there is a greater probability for the new gene to connect to an existing hub gene that has larger number of edges with the probability pE=vi(t)/jvj(t), where vj(t) number of edges connected with the ith gene at the tth step. For the ER model, each edge of any gene pair (Gi,Gj) was added randomly in the GGN with probability pE independent from all other edges. After constructing the bone of GGN, we simulated gene expression based on multivariate Gaussian. Without loss of generality, we simulated data sets with n = 400 individuals. As shown in Table 1, we considered different number of genes m = 100, 200, 300 with various sparsity level determined by pE = 1=m or 2=m for the ER model and e = 1 or e = 2 for the BA model.

We evaluated the performance of the methods by the following metrics: number of false positives (FP), false negatives (FN), FP+FN, F1 score, FDR, true positive rate (power). Note that there are three different weights used in joint modeling setting (space-log, space-lasso): (1) uniform weights; (2) residual variance based weights; and (3) degree freedom based weights. The corresponding methods are referred to as sp_no, sp_res, and sp_df with/without log respectively.

Under the BA model with m=100 and e=1 (Figure 1), we can see that space-log has smallest Errors (FP+FN), smallest FDR, and highest F1 score than other approaches, indicating that space-log controls overall false positive and false negative rates well. Under the ER model (Figure 2) with m=100 and e=1, space-log is slightly better than space, and NS-log shows lower Errors and higher F1 score than other approaches including space-log. Under both models, the log penalty has less false positives but slightly more false negatives compared to lasso penalty. We note that although log penalty performs well for both the ER and BA models, space-log is particularly powerful in identifying hub networks (such as BA models).

In the Extended data, Figures S3 and S4 show the results under the BA model for m=100,200,300 with low number of connections (e=1) and high number of connections (e=2), respectively. Figures S5 and S6 show the results under the ER model with low and high numbers of connections, respectively. Comparing with Figure 1, a similar pattern was noted with the increase of number of genes (m increases from 100, 200, to 300). In BA with low connections (Figure S3), space-log showed smallest FP+FN error and largest F1 score, which outperform all other methods. In BA with high connections (Figure S4), NS-log showed smallest FP+FN error and largest F1 score. For ER model with low and high connections, NS-log outperforms other methods in terms of FP+FN and F1 scores. It’s in line with our understanding that space-log is powerful at identifying hubs network, and NS-log is powerful at dealing with complex network with high number of gene-gene interactions and random networks.

We showed a simulated graph for the BA model with 400 subjects, 100 genes and each gene has only 1 connection (Figure 3). The GGN was estimated by 8 different approaches. In Figure 3, the true edges were indicated by black color, false positive (FP) edges by red color, and false negative (FN) edges by grey color. It’s clear that space-log identified far fewer false positive edges (red line) comparing with space-lasso and NS approaches, while clearly indicating the hub structures. We observed that the FP edges by two NS approaches were quite randomly identified, and the FP edges by two space approaches were mostly within a hub and not between hubs. In summary, the log penalty generally has better performance than the lasso penalty, and both space-log and NS-log control false positive and false negative rate well. For random networks, i.e., no hub, NS-log performs better than other methods. space-log performs best for hub-like gene networks (Figure 1 and Figure 3) with higher F1 score and less false positive edges. Identifying hub networks is generally considered of great interest in the GGN analysis, because a few of hubs connecting with a large proportion of genes, and those hub genes are thought to be master regulators and play a critical role in a biological system13.

Table 1. Simulation settings.

m n pE (ER) e (BA)
4001001/100, 2/1001,2
4002001/200, 2/2001,2
4003001/300, 2/3001,2
498f1a18-e082-410d-8671-aba20b00ded1_figure1.gif

Figure 1. BA with 100 genes and e=1.

498f1a18-e082-410d-8671-aba20b00ded1_figure2.gif

Figure 2. ER with 100 genes and e=1.

498f1a18-e082-410d-8671-aba20b00ded1_figure3.gif

Figure 3. A simulated graph for the BA model with 100 genes (e=1) and multiple hubs.

A total of 400 subjects were generated. The GGN was estimated by 8 different approaches. Black is true edges, red is false positive edges, and grey is false negative edges.

498f1a18-e082-410d-8671-aba20b00ded1_figure4.gif

Figure 4. TCGA data analysis with ALL 189 Gene Sets.

498f1a18-e082-410d-8671-aba20b00ded1_figure5.gif

Figure 5. TCGA data analysis with BA hub-type Gene Sets.

498f1a18-e082-410d-8671-aba20b00ded1_figure6.gif

Figure 6. GTEx data analysis with ALL 189 Gene Sets.

498f1a18-e082-410d-8671-aba20b00ded1_figure7.gif

Figure 7. GTEx data analysis with BA hub-type Gene Sets.

Application to GTEx and TCGA data

TCGA data

We applied both proposed space-log and existing methods (space-lasso, NS-log, and NS-lasso) to identify GGN using RNA-seq data from tumor tissue of 550 TCGA (The Cancer Genome Atlas) Colon Adenocarcinoma (TCGA-COAD) cancer patients14. The preprocessing steps of RNA-seq data included: (1) transforming the expression of each gene by log(total read count) = logTReC (2) removing the confounding effects by taking residuals of logTReC from a linear regression with the following covariates: 75% of logTReC per sample (which captures read depth), plate, institution, age, and six PCs from the corresponding germline genotype data. After removing genes with low expression across most samples, we had 18,238 genes and 450 samples.

We considered gene sets C6 curated oncogenic pathways by MSigDB from the Broad Institute and inferred the GGN within each gene set. There were 189 gene pathways with a total of 8,737 unique genes for which TCGA have expression data. The sizes of gene sets ranged from 9 to 338 genes. Since we don’t know the true GGN, we downloaded the common pathway version 10 from www.pathwaycommons.org to provide a partial "gold standard". The observed GGN by different methods were compared with the known edges from common pathway and calculated FP, FN, FP+FN, number of total discovery, F1 score, and true positive rate (Extended data: Figure S10). The NS-based approach with both LASSO and log penalty discovered much more edges than space-based approach and space-log had fewer false positive (fewer FN+FP too) than space-lasso. There is almost no difference on number of false negative between different methods, as well as F1 score (Figure 4). Furthermore, in order to show the performance of these methods on the hub networks, we identified 17 pathways with hub-like genes (each hub gene set has < 50 genes and variance of the number of identified edges for each gene in the gene set > the first quartile of all 189 gene sets) and re-calculated the summary metrics in Figure 5. We noted that space-log approach has smallest Errors and slightly higher F1 than other approaches, which is in line with our finding in simulation that space-log is powerful in identifying hub networks, (such as BA models).

GTEx data

The Genotype Tissue Expression (GTEx) project15 aims to study tissue-specific gene expression and regulation in normal individuals. In this paper, we used gene expression data (RNA-seq) from blood tissue of 451 patients to identify GGN. We pre-processed gene expression data using the same procedure as for TCGA data. We mapped genes to gene pathways by MSigDB (https://www.gsea-msigdb.org/gsea/msigdb/index.jsp). A total of 189 gene pathways were represented with a total of 8097 unique genes. The size of gene sets ranged from 8 to 306 genes.

Again, we applied space-log, space-lasso, NS-log, and NS-lasso approaches to identify GGN. Using the same common pathway file used for the TCGA analysis as gold standard, we calculated FP, FN, FP+FN, # of discovery, F1, TPR (Figure 6). We obtained very similar results to the TCGA data. The NS-based approach with both LASSO and log penalty discovered much more edges than the space-based approach and space-log has fewer false positive (fewer FN+FP too) than space-lasso. There is almost no difference in the number of false negative between different methods, as well as F1 score. A similar sensitivity analysis was conducted to a subset of hub-type genes (Figure 7), where 30 pathways were selected to be in the first quartile of the variance of the number of identified genes with < 50. It also showed the space-log approach has smallest Errors (F1 is similar to other approaches).

Conclusions

In this paper, we proposed a new joint modeling method with log penalty, space-log, to identify gene-gene network. An assumption of the GGN analysis is that most of gene pairs do not directly interact with each other, and there are a few of master genes (hubs) for network that connect with many other genes, which are thought to play a critical role in a biological system13,16. Both simulation and real data analyses showed that space-log is particularly powerful in identifying hub networks and master genes, which is considered of great interest in gene-gene network analysis. In the Extended data, we compared several tuning parameter selection approaches, such as BIC Zou et al.17, extBIC8,18, and oracle4, and showed that extBIC outperforms other methods in simulation. The R package "SpaceLog" on GitHub includes algorithms, simulation, and real data examples: https://github.com/wuqian77/SpaceLog.

Data availability

Underlying data

Simulation data. We used barabasi.game function from igraph R package to generate the skeleton of a BA model.

Source code, simulated data, and plots: https://github.com/wuqian77/SpaceLog/tree/master/Simulation.

TCGA data. The RNA-seq dataset from tumor tissue of 550 TCGA (The Cancer Genome Atlas) colon adenocarcinoma (TCGA-COAD) cancer patients14 can be downloaded from dbGap phs000178.v1.p1.: https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga.

Pre-processing and data analysis source code: https://github.com/wuqian77/SpaceLog/tree/ master/Analysis/TCGA.

GTex data. The RNA-seq dataset of blood tissue from the Genotype Tissue Expression (GTEx) project15 can be downloaded from dbGap Genotype-Tissue Expression Project and the study accession is phs000424.v7.p2.: https://www.gtexportal.org/home/datasets.

Pre-processing and data analysis source code: https://github.com/wuqian77/SpaceLog/tree/master/Analysis/GTex.

Extended data

Zenodo: SpaceLog: First release of spacelog, http://doi.org/10.5281/zenodo.400293119.

This project contains the following extended data:

  • the detailed algorithm for active shooting;

  • simulation and figures on comparing methods to choose tuning parameters;

  • simulation and figures on comparing different GGN methods under various scenarios.

License: GPL-3

Software availability

Source code for space-log available from: https://github.com/wuqian77/SpaceLog

Archived source code as at time of publication: http://doi.org/10.5281/zenodo.400293119.

License: GPL-3

Source code for NS-log and NS-lasso available from: https://github.com/Sun-lab/penalized_estimation

License: GPL-3

Existing methods space-lasso is available on R CRAN: https://cran.r-project.org/web/packages/space/index.html.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 21 Sep 2020
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Wu Q(, Sun W and Hsu L. Space-log: a novel approach to inferring gene-gene net-works using SPACE model with log penalty [version 1; peer review: 3 approved with reservations]. F1000Research 2020, 9:1159 (https://doi.org/10.12688/f1000research.26128.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 21 Sep 2020
Views
20
Cite
Reviewer Report 21 Jun 2021
Jessie Jeng, Department of Statistics, North Carolina State University, Raleigh, NC, USA 
Approved with Reservations
VIEWS 20
This paper considers the problem of identifying gene regulatory networks based on high-dimensional gene expression data. A new method, space-log, is introduced to perform penalized regression with a log penalty. The new method is compared with several existing methods in ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Jeng J. Reviewer Report For: Space-log: a novel approach to inferring gene-gene net-works using SPACE model with log penalty [version 1; peer review: 3 approved with reservations]. F1000Research 2020, 9:1159 (https://doi.org/10.5256/f1000research.28834.r86957)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 05 Jan 2022
    Qian Wu, Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, 98109, USA
    05 Jan 2022
    Author Response
    1. I think it would be important to explain why the proposed method has advantages in identifying networks with hubs. I would imagine that a network with hubs tends to have ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 05 Jan 2022
    Qian Wu, Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, 98109, USA
    05 Jan 2022
    Author Response
    1. I think it would be important to explain why the proposed method has advantages in identifying networks with hubs. I would imagine that a network with hubs tends to have ... Continue reading
Views
19
Cite
Reviewer Report 19 Oct 2020
Chi Song, College of Public Health, Division of Biostatistics, The Ohio State University, Columbus, OH, USA 
Approved with Reservations
VIEWS 19
In this paper, the authors developed “space-log”, a new method to infer gene-gene network (GGN), by incorporating log penalty into the “space” method. The authors compared their method to the original “space” with LASSO penalty and the neighborhood selection (NS) ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Song C. Reviewer Report For: Space-log: a novel approach to inferring gene-gene net-works using SPACE model with log penalty [version 1; peer review: 3 approved with reservations]. F1000Research 2020, 9:1159 (https://doi.org/10.5256/f1000research.28834.r71864)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
25
Cite
Reviewer Report 14 Oct 2020
Yuying Xie, Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI, USA 
Yuning Hao, Amazon, Seattle, WA, USA 
Approved with Reservations
VIEWS 25
The paper introduces an extension of the SPACE (Sparse Partial Correlation Estimation) method, which is used to estimate a gene-gene network (GNN). The proposed framework, space-log, relies on the log penalty, which delivers better variable selection performance than LASSO, especially ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Xie Y and Hao Y. Reviewer Report For: Space-log: a novel approach to inferring gene-gene net-works using SPACE model with log penalty [version 1; peer review: 3 approved with reservations]. F1000Research 2020, 9:1159 (https://doi.org/10.5256/f1000research.28834.r71868)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 05 Jan 2022
    Qian Wu, Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, 98109, USA
    05 Jan 2022
    Author Response
    1. In both TCGA and GTEx analyses, the dimensions m is smaller than the sample size n. Since log penalty can handle high dimension low sample size data, can
    ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 05 Jan 2022
    Qian Wu, Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, 98109, USA
    05 Jan 2022
    Author Response
    1. In both TCGA and GTEx analyses, the dimensions m is smaller than the sample size n. Since log penalty can handle high dimension low sample size data, can
    ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 21 Sep 2020
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.