ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Robust and efficient identification of biomarkers from RNA-Seq data using median control chart

[version 1; peer review: 1 approved with reservations, 1 not approved]
PUBLISHED 03 Jan 2019
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Background: One of the main goals of RNA-seq data analysis is identification of biomarkers that are differentially expressed (DE) across two or more experimental conditions. RNA-seq uses next generation sequencing technology and it has many advantages over microarrays. Numerous statistical methods have already been developed for identification the biomarkers from RNA-seq data. Most of these methods were based on either Poisson distribution or negative binomial distribution. However, efficient biomarker identification from discrete RNA-seq data is hampered by existing methods when the datasets contain outliers or extreme observations. Specially, the performance of these methods becomes more severe when the data come from a small number of samples in the presence of outliers. Therefore, in this study, an attempt is made to propose an outlier detection and modification approach for RNA-seq data to overcome the aforesaid problems of traditional methods. We make our proposed method facilitate in RNA-seq data by transforming the read count data into continuous data.
Methods: We use median control chart to detect and modify the outlying observation in a log-transformed RNA-seq dataset. To investigate the performance of the proposed method in absence and presence of outliers, we employ the five popular biomarker selection methods (edgeR, edgeR_robust, DEseq, DEseq2 and limma) both in simulated and real datasets.
Results: The simulation results strongly suggest that the performance of the proposed method improved in the presence of outliers. The proposed method also detected an additional 18 outlying DE genes from a real mouse RNA-seq dataset that were not detected by traditional methods. Using the KEGG pathway and gene ontology analysis results we reveal that these genes may be biomarkers, which require validation in a wet lab.
Conclusions: Our proposal is to apply the proposed method for biomarker identification from other RNA-seq data.

Keywords

RNA-Seq data, Logarithmic transformation, Biomarker, Outliers and Robustness

Introduction

One of the major objectives of researchers is to identify biomarkers from RNA-Seq data that are differentially expressed (DE) between two or more experimental conditions. Microarrays have been extensively used during the past few decades to perform this task. But after reducing the cost of sequencing, biomarker identification using RNA-seq data has emerged as an alternative choice to microarrays1,2. RNA-seq uses next generation sequencing technology to produce a vast amount of data. Curse of dimensionality is a common problem for analyzing RNA-Seq data, which means a "large p, small n" problem. Hence, dimension reduction of the data matrix is a primary objective for further downstream analysis. Identification of biomarkers is one of the dimensionality reduction approaches.

RNA-seq count data inherently follow a Poisson or negative binomial (NB) distribution rather than normal distribution like microarray data. Numerous statistical methods have been developed to identify biomarkers from RNA-seq count data. The earliest method is DEGseq, which is based on Poisson distribution. This method suffers from overdispersion and therefore Poisson distribution-based methods are not suitable for RNA-seq data36. To overcome this problem, NB distribution-based methods have been proposed. Some NB based methods are: baySeq, DESeq, DESeq2, EBSeq, edgeR, edgeR (robust) and NBPSeq711. However, most of the methods cannot estimate properly the gene-wise dispersion parameters and they also suffer from small sample sizes12. DESeq, DESeq2, edgeR, edgeR (robust) and NBPSeq incorporate information of all genes in their algorithms.

Despite the popularity of these statistical methods for identification of biomarker genes, they are sensitive to outliers and often produce lower accuracies in the presence of outliers. Outliers may arise in RNA-seq count data because there are several data generating stages from biological harvesting of RNA samples to counting of sequence read map data13. To mitigate this issue many algorithms use transformation methods. There are several transformation methods for RNA-seq data: logarithmic transformation14, variance-stabilizing transformation (vst)6, TMM transformation15, regularized logarithm8 and variance modeling at the observation level (voom)16. These methods only reduce the low level outliers into reasonable spaces during parameter estimations; however they fail to reduce the influence of high level outliers with small sample sizes in the data matrix.

Consequently, most biomarker selection methods that use the aforesaid transformations, are sensitive to outliers or extreme values with small-sample sizes. Therefore, in this study, an attempt is made to propose an outlier detection and modification approach for RNA-seq data to improve the performance of the popular biomarker selection methods in the presence of outliers. To make our proposed method facilitate in RNA-seq data we transform the read count data into continuous data using regularize logarithmic transformation.

The article is organized as follows: Methods briefly describes the logarithmic transformation and formulation the proposed outlier detection and modification approach. In Results and Conclusions a broad simulation study and a real data study have been carried out.

Methods

Let ygik be the number of reads simulated from gth gene of kth replicates in the ith condition (g = 1, 2, . . ., G; i = 1,2; k =1, 2, . . . , ni); where ni is the number of replicated in condition i. G is the total number of gene. Then the negative bionomial-distribution is as follows:

ygikNegativeBinomial(mean=μgi,size=rgi)

In this negative binomial parameterization, E(ygik) = μgi and Var ( (ygik)=μgi+μgi2rgi; where, μgi is the mean of gth gene in ith condition and 1rgi is the dispersion parameter. Now we want to test the following null hypothesis:

H0:μg1μg2=0VsH1:μg1μg2

A gene will be declared as DE if H0 is rejected, otherwise it is equally expressed (EE).

Log-transformation

Log-transformation is very useful in RNA-Seq data. The log-transformed data usually follow the normal distribution, which depends on the degree of skewness before transformation. As the RNA-Seq count data can be equal to zero, so we shift them by one before transforming them:

xgik=log(ygik+1)

Log-transformed values have less extreme values (or outliers) than the untransformed data.

However, the log-transformed values reduces the influence of low level outlying observations; however this transformation fails to reduce the influence of high level outlying observations. Therefore, we propose the following outlier detection and modification rule.

Biomarker identification using the proposed procedure

Since the median and median absolute deviation is the robust measure of location and scale parameter, respectively, therefore, we used the median control chart as a measure of outlier detection. The proposed procedure is as follows:

  • 1. We declare a gene as an outlying gene if it doesn’t fall into the interval [LCL, UCL]. Where LCL and UCL are the lower and upper control limit for median and they are defined by LCL= MEDg,(i)-3×MADg,(i) and UCL=MEDg,(i)+3×MADg,(i)]. Here, MEDg,(i)=median

    (xgik; g =1, 2,…, G; k =1, 2,…, ni ; i=1,2) is the median of gth gene in ith condition, MADg,(i) = mediank=1,2,..,ni(|xgik − MEDg, (i)|) is the median absolute deviation.

  • 2. Check the existence of outliers for each gene from each of the conditions (i=1,2), separately using step 1. If outliers exist, replace them by their respective group medians (MEDg,(i)).

  • 3. Apply the anti-log transformation to obtain modified RNA-seq (MRS) count datasets.

  • 4. Apply traditional statistical methods in MRS data to identify biomarker genes using the p-values adjusted by Benjamini-Hochberg method.

  • 5. Obtain the functional annotations and KEGG pathways for detected biomarker genes.

Performance evaluation

In order to evaluate the performance of different biomarkers selection methods we considered the area under the receiver operating characteristic curve (ROC) curve. The ROC is created by plotting the true positive rate (TPR) against the false positive rate (FPR) for different cut-off points of a parameter. For a particular threshold each point on the ROC curve produces a TPR/FPR pair. The area under the ROC curve (AUC) is a performance measure which helps us to select an optimum method that can distinguish between two gene groups such as DE or EE well.

Datasets

To investigate the performance of the proposed method in comparison with five popular methods as mentioned above, for both small-and-large-sample cases with 2 groups/conditions, we considered 100 datasets for both cases with sample sizes of n1=n2= 3 and 15, respectively. Each dataset for each case represented gene expression profiles for 1000 genes, each with n=(n1+n2) samples, where the read counts of each gene was generated using the negative-binomial distribution and this type of simulation study was also used in 11. The number of DE genes were set to 40 for each of the 100 datasets. We divided these 40 DE genes into two groups: 20, up-regulated DE genes and 20, down-regulated DE genes. To show the effect of outliers (extreme high counts) on the methods, we randomly selected 10% and 30% genes and for each of these genes, we selected a single sample randomly and multiplied the observed count of this sample with randomly selected factor between 5 and 10. This process was applied for each of the 100 datasets. We computed average values of different performance measures such as true positive rate (TPR), false positive rate (FPR) and AUC based on 40 estimated DE genes by five methods (edgeR, edgeR_robust, DESeq, DESeq2, and limma (voom)) for each of 100 original datasets and proposed MRS datasets.

We also considered a real RNA-seq mouse dataset17 to demonstrate the performance of the methods. This dataset consists of 36535 genes with 21 samples. This dataset was downloaded from ReCount website http://bowtie-bio.sourceforge.net/recount. It can also be downloaded from the GEO series accession number GSE26024. Among 21 samples, RNA-seq count expression collected from 10 C57BL/6J (B6) and 11 DBA/2J (D2) inbred mouse strains.

Software

To demonstrate the performance of the proposed method, a comparison with five popular methods (edgeR, edgeR_robust, DESeq, DESeq2 and limma (voom)) was performed. We used both simulated and real RNA-seq count datasets. We used three R packages of other methods: edgeR, DESeq and limma. The performance measure area under the receiver operating characteristics curve (AUC) was computed for each of the methods using R package ROCR. All R packages are available in the comprehensive R archive network (cran) or Bioconductor.

Results

Performance evaluation based on simulated dataset

Table 1 summarizes the average AUCs estimated by eight methods based on 100 simulated datasets using 4% DE genes in absence and presence of single outlier in each of 10% and 30% genes for both small-sample cases (n1=n2=3) and large-sample cases (n1=n2=15), respectively. In Table 1, the results without and within the brackets (.) indicate the estimated AUCs by the five methods using the original RNA-seq datasets and proposed MRS datasets. From Table 1 we observed that in absence of outliers, five methods (edgeR, edgeR_robust, DESeq, DESeq2 and limma (voom)) produced almost similar results using the original RNA-seq datasets and proposed MRS datasets, for both small-and-large-sample cases. However, in the presence of outliers, the performance of these methods has significantly increased using the proposed MRS datasets for both cases. For example, in the presence of 10% outliers, edgeR and DESeq produce AUCs 0.829 and 0.818, respectively for small-sample case. Whereas for the same condition these two methods produce AUCs 0.842 and 0.838, respectively using our proposed MRS datasets. Figure 1a and b and Figure 1c and d show the boxplot of AUCs based on 100 simulated datasets by each of the methods in absence and presence of outliers for small-and large-sample cases, respectively. The left and right-side panels in this figure indicate the boxplot of estimated AUCs using original RNA-seq datasets and proposed MRS datasets. Similar results were found from these boxplots, as in Table 1.

Table 1. Performance evaluation of different methods using AUC values for both small-and-large samples cases

For small-sample case (n1=n2=3)
Performance MeasureOutliersedgeRedgeR
(robust)
DESeqDESeq2Limma
(voom)
AUCWithout Outliers0.844
(0.844)
0.851
(0.85)
0.837
(0.842)
0.835
(0.833)
0.819
(0.819)
10% Outliers0.829
(0.842)
0.826
(0.825)
0.818
(0.838)
0.815
(0.826)
0.784
(0.801)
30% Outliers0.697
(0.796)
0.709
(0.786)
0.713
(0.785)
0.705
(0.791)
0.708
(0.768)
For large-sample case (n1=n2=15)
Performance MeasureOutliersedgeRedgeR
(robust)
DESeqDESeq2Limma
(voom)
AUCWithout Outliers0.966
(0.965)
0.943
(0.942)
0.963
(0.965)
0.959
(0.962)
0.931
(0.931)
10% Outliers0.940
(0.959)
0.934
(0.946)
0.936
(0.953)
0.898
(0.952)
0.918
(0.931)
30% Outliers0.894
(0.937)
0.907
(0.948)
0.852
(0.945)
0.782
(0.912)
0.851
(0.929)
1320fc27-145a-435f-a1b9-00cac29fe055_figure1.gif

Figure 1. Performance evaluation using boxplot of AUCs estimated by different methods using the simulated datasets.

(ab) for small-sample case (n1=n2= 3) and (cd) for large-sample case (n1=n2= 15).

Performance evaluation based on real dataset

After filtering we retain 11,474 genes. To investigate the performance of the proposed method, we employ three methods (edgeR, DEseq and limma) for detection of the biomarkers between the two mouse strains. Figure 2a and b represents the Venn diagram of estimated DE genes by edgeR, DESeq and Limma using the original and MRS dataset, respectively. From Figure 2a, we revealed that edgeR and Limma performed better than DESeq by sharing more genes (414). We also noticed that there are 1925 overlapping DE genes between these methods. To investigate the performance of the proposed outlier detection and modification approach in this dataset, we first detect and modify the outliers (if any) to get the MRS dataset. We detected 200 outliers in this dataset using the proposed outlier detection rule. The Venn diagram in Figure 2b represents the results of these three methods using the proposed MRS dataset. From this figure we can clearly observe that there are 1956 overlapping genes detected by these methods. Among these genes there are 18 genes that are declared as outliers by the proposed method and those were not detected as DE genes using the original mouse dataset.

1320fc27-145a-435f-a1b9-00cac29fe055_figure2.gif

Figure 2. Comparison of the DEGs detected by three methods and outlying gene expression profiles for Mouse dataset.

Venn diagram of DEGs detected by (a) the edgeR, DESeq and Limma in the original mouse dataset or by (b) the edgeR, DESeq and Limma in the modified mouse dataset using the proposed method. (c) Heatmap of 16 outlying DEGs detected by the proposed method.

Furthermore, we performed the gene overexpression analyses through Database for Annotation, Visualization and Integrated Discovery (DAVID)18 to explore the biological process (BF) categories and pathway annotations of the 18 identified outlying DE genes. Out of 18 genes DAVID identified 16 genes. A heatmap is created for these 16 outlying DE genes in Figure 2c. The heatmap correctly clusters the samples between C57BL/6J (B6) and 11 DBA/2J (D2) using these genes. Among the 16 genes, 8 upregulated (Fam46b, Alx3, Dusp2, Pdyn, Agbl2, Pcdh12, Ubl5, Gpx8) and 8 downregulated (Stard5, Ptprc, Slc7a5, Slc24a1, Ehd2, Adgrg3, Tefm, Tsnaxip1) DE genes are identified. GO analysis results showed that upregulated DE genes are significantly enriched in protein side chain deglutamylation, protein deglutamylation and embryo development at BP level. Downregulated DEGs were enriched in negative regulation of CREB transcription factor activity, negative regulation of NIK/NF-kappaB signaling and sterol import at BP level (see extended data). KEGG analysis showed that the upregulated DEGs were mostly enriched in cocaine addiction, glutathione metabolism and thyroid hormone synthesis. The downregulated DEGs were enriched in phototransduction, primary immunodeficiency and central carbon metabolism in cancer (Table 2). We also constructed the protein-protein interaction (PPI) network around the proteins encoded by these 16 outlying genes using STRING database19. We considered confidence score 400 for selection these networks. Figure 3 and Figure 4 represent the PPI networks using the up-regulated and down-regulated DEGs, respectively. In addition, we explored miRNAs-target gene interactions from miRTarBase20 to identify miRNAs. The miRNAs-target gene interactions network is shown in Figure S1 in extended data.

Table 2. KEGG pathways for the 16 outlying DEGs detected by the proposed method for Mouse dataset.

Pathways for up-regulated genes
KEGG IDPathway NameName of GeneP-value
mmu05030Cocaine addictionPdyn1.79e-02
mmu00480Glutathione metabolismGpx82.29e-02
mmu05031Amphetamine addictionPdyn2.53e-02
mmu04918Thyroid hormone synthesisGpx82.79e-02
mmu00590Arachidonic acid metabolismGpx83.31e-02
mmu05034AlcoholismPdyn7.29e-02
mmu04010MAPK signaling pathwayDusp29.17e-02
Pathways for down-regulated genes
KEGG IDPathway NameName of GeneP-value
mmu04744PhototransductionSlc24a11.35e-02
mmu05340Primary immunodeficiencyPtprc1.79e-02
mmu05230Central carbon metabolism in cancerSlc7a53.27e-02
mmu04666Fc gamma R-mediated phagocytosisPtprc4.38e-02
mmu04660T cell receptor signaling pathwayPtprc5.16e-02
mmu04150mTOR signaling pathwaySlc7a57.59e-02
mmu04514Cell adhesion molecules (CAMs)Ptprc8.2e-02
mmu04144EndocytosisEhd21.36e-01
1320fc27-145a-435f-a1b9-00cac29fe055_figure3.gif

Figure 3. PPI network using the up-regulated outlying genes identified by the proposed method.

1320fc27-145a-435f-a1b9-00cac29fe055_figure4.gif

Figure 4. PPI network using the down-regulated outlying genes identified by the proposed method.

Conclusions

Biomarker identification under two or more conditions is an important task for elucidating the molecular basis of phenotypic variation. Next generation sequencing (RNA-seq) has become very popular and a competitive alternative to microarrays because of reducing the cost of sequencing and limitation of microarrays. A number of methods have been developed for detecting biomarkers from RNA-seq data. However, most of the methods are sensitive to outliers and produce misleading results in the presence of outliers. In this study, we have proposed an outlier detection and modification approach using the median control chart. From the simulation study in the presence of outliers we have observed that the performance of five biomarker selection methods are improved significantly when the datasets are modified by the proposed method, both for small-and large-sample cases. The proposed method also detected an additional 16 outlying genes from a real mouse dataset. From GO and KEGG pathway enrichment analysis, we have shown that these genes belong to some important pathways.

Data availability

Underlying data

Simulated datasets available from: https://doi.org/10.5281/zenodo.221288121

Real dataset: The mouse dataset used in this study is publicly available at the NCBI GEO website: GSE26024.

Extended data

Zenodo: Figure S1: miRNAs-target gene interactions using the outlying genes identified by the proposed method, http://doi.org/10.5281/zenodo.227992122

Zenodo: Table A1. Biological process categories for 16 genes, http://doi.org/10.5281/zenodo.228001223

Software availability

The R code for the proposed method is available in https://github.com/snotjanu/OutMod-RnaSeq

Archived code: http://doi.org/10.5281/zenodo.227940524

License: MIT

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 03 Jan 2019
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Shahjaman M, Akter H, Rashid MM et al. Robust and efficient identification of biomarkers from RNA-Seq data using median control chart [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2019, 8:7 (https://doi.org/10.12688/f1000research.17351.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 03 Jan 2019
Views
12
Cite
Reviewer Report 06 Jun 2019
Jun Li, Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN, USA 
Not Approved
VIEWS 12
The author proposed a method for outlier detection and differential expression (DE) identification for RNA-seq data. While DE is surely an important problem in RNA-seq data analysis, the proposed method is based on a wrong mathematical model and thus makes ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Li J. Reviewer Report For: Robust and efficient identification of biomarkers from RNA-Seq data using median control chart [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2019, 8:7 (https://doi.org/10.5256/f1000research.18975.r48899)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
25
Cite
Reviewer Report 29 Apr 2019
Lei Li, Dan L Duncan Comprehensive Cancer Center, Baylor College of Medicine (BCM), Houston, TX, USA 
Approved with Reservations
VIEWS 25
In the manuscript, Shahjaman et al. presents a new median control chart method for controlling outliers in RNA-Seq data. They applied this method on simulated data and also a real mouse dataset. In general, a useful bioinformatics method for outlier ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Li L. Reviewer Report For: Robust and efficient identification of biomarkers from RNA-Seq data using median control chart [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2019, 8:7 (https://doi.org/10.5256/f1000research.18975.r47416)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 03 Jan 2019
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.