Robust and efficient identification of biomarkers from RNA-Seq data using median control chart

Md Shahjaman; Habiba Akter; Md. Mamunur Rashid; Md. Ibnul Asifuzzaman; Md. Bipul Hossen; Md. Rezanur Rahman

doi:10.12688/f1000research.17351.1

Home Browse Robust and efficient identification of biomarkers from RNA-Seq data...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Robust and efficient identification of biomarkers from RNA-Seq data using median control chart

[version 1; peer review: 1 approved with reservations, 1 not approved]

Md Shahjaman ¹, Habiba Akter¹, Md. Mamunur Rashid¹, Md. Ibnul Asifuzzaman¹, Md. Bipul Hossen¹, Md. Rezanur Rahman²

Md Shahjaman ¹, Habiba Akter¹, [...] Md. Mamunur Rashid¹, Md. Ibnul Asifuzzaman¹, Md. Bipul Hossen¹, Md. Rezanur Rahman²

PUBLISHED 03 Jan 2019

Author details Author details

¹ Department of Statistics, Begum Rokeya University, Rangpur, 5400, Bangladesh
² Department of Biochemistry and Biotechnology, Khwaja Yunus Ali University, Sirajgonj, 6200, Bangladesh

Md Shahjaman
Roles: Conceptualization, Supervision, Writing – Review & Editing

Habiba Akter
Roles: Methodology

Md. Mamunur Rashid
Roles: Resources, Software

Md. Ibnul Asifuzzaman
Roles: Conceptualization, Writing – Review & Editing

Md. Bipul Hossen
Roles: Writing – Review & Editing

Md. Rezanur Rahman
Roles: Validation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

Abstract

Background: One of the main goals of RNA-seq data analysis is identification of biomarkers that are differentially expressed (DE) across two or more experimental conditions. RNA-seq uses next generation sequencing technology and it has many advantages over microarrays. Numerous statistical methods have already been developed for identification the biomarkers from RNA-seq data. Most of these methods were based on either Poisson distribution or negative binomial distribution. However, efficient biomarker identification from discrete RNA-seq data is hampered by existing methods when the datasets contain outliers or extreme observations. Specially, the performance of these methods becomes more severe when the data come from a small number of samples in the presence of outliers. Therefore, in this study, an attempt is made to propose an outlier detection and modification approach for RNA-seq data to overcome the aforesaid problems of traditional methods. We make our proposed method facilitate in RNA-seq data by transforming the read count data into continuous data.
Methods: We use median control chart to detect and modify the outlying observation in a log-transformed RNA-seq dataset. To investigate the performance of the proposed method in absence and presence of outliers, we employ the five popular biomarker selection methods (edgeR, edgeR_robust, DEseq, DEseq2 and limma) both in simulated and real datasets.
Results: The simulation results strongly suggest that the performance of the proposed method improved in the presence of outliers. The proposed method also detected an additional 18 outlying DE genes from a real mouse RNA-seq dataset that were not detected by traditional methods. Using the KEGG pathway and gene ontology analysis results we reveal that these genes may be biomarkers, which require validation in a wet lab.
Conclusions: Our proposal is to apply the proposed method for biomarker identification from other RNA-seq data.

Keywords

RNA-Seq data, Logarithmic transformation, Biomarker, Outliers and Robustness

Corresponding authors: Md Shahjaman, Habiba Akter

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2019 Shahjaman M et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Shahjaman M, Akter H, Rashid MM et al. Robust and efficient identification of biomarkers from RNA-Seq data using median control chart [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2019, 8:7 (https://doi.org/10.12688/f1000research.17351.1) First published: 03 Jan 2019, 8:7 (https://doi.org/10.12688/f1000research.17351.1) Latest published: 03 Jan 2019, 8:7 (https://doi.org/10.12688/f1000research.17351.1)

Introduction

One of the major objectives of researchers is to identify biomarkers from RNA-Seq data that are differentially expressed (DE) between two or more experimental conditions. Microarrays have been extensively used during the past few decades to perform this task. But after reducing the cost of sequencing, biomarker identification using RNA-seq data has emerged as an alternative choice to microarrays^1,2. RNA-seq uses next generation sequencing technology to produce a vast amount of data. Curse of dimensionality is a common problem for analyzing RNA-Seq data, which means a "large p, small n" problem. Hence, dimension reduction of the data matrix is a primary objective for further downstream analysis. Identification of biomarkers is one of the dimensionality reduction approaches.

RNA-seq count data inherently follow a Poisson or negative binomial (NB) distribution rather than normal distribution like microarray data. Numerous statistical methods have been developed to identify biomarkers from RNA-seq count data. The earliest method is DEGseq, which is based on Poisson distribution. This method suffers from overdispersion and therefore Poisson distribution-based methods are not suitable for RNA-seq data^3–6. To overcome this problem, NB distribution-based methods have been proposed. Some NB based methods are: baySeq, DESeq, DESeq2, EBSeq, edgeR, edgeR (robust) and NBPSeq^7–11. However, most of the methods cannot estimate properly the gene-wise dispersion parameters and they also suffer from small sample sizes¹². DESeq, DESeq2, edgeR, edgeR (robust) and NBPSeq incorporate information of all genes in their algorithms.

Despite the popularity of these statistical methods for identification of biomarker genes, they are sensitive to outliers and often produce lower accuracies in the presence of outliers. Outliers may arise in RNA-seq count data because there are several data generating stages from biological harvesting of RNA samples to counting of sequence read map data¹³. To mitigate this issue many algorithms use transformation methods. There are several transformation methods for RNA-seq data: logarithmic transformation¹⁴, variance-stabilizing transformation (vst)⁶, TMM transformation¹⁵, regularized logarithm⁸ and variance modeling at the observation level (voom)¹⁶. These methods only reduce the low level outliers into reasonable spaces during parameter estimations; however they fail to reduce the influence of high level outliers with small sample sizes in the data matrix.

Consequently, most biomarker selection methods that use the aforesaid transformations, are sensitive to outliers or extreme values with small-sample sizes. Therefore, in this study, an attempt is made to propose an outlier detection and modification approach for RNA-seq data to improve the performance of the popular biomarker selection methods in the presence of outliers. To make our proposed method facilitate in RNA-seq data we transform the read count data into continuous data using regularize logarithmic transformation.

The article is organized as follows: Methods briefly describes the logarithmic transformation and formulation the proposed outlier detection and modification approach. In Results and Conclusions a broad simulation study and a real data study have been carried out.

Methods

Let y_gik be the number of reads simulated from gth gene of kth replicates in the ith condition (g = 1, 2, . . ., G; i = 1,2; k =1, 2, . . . , n_i); where n_iis the number of replicated in condition i. G is the total number of gene. Then the negative bionomial-distribution is as follows:

y_{g i k} \sim Negative Binomial (mean= μ_{g i}, size= r_{g i})

In this negative binomial parameterization, E(y_gik) = μ_gi and Var ( $(y_{g i k}) = μ_{g i} + \frac{μ_{g i^{2}}}{r_{g i}};$ where, μ_gi is the mean of gth gene in ith condition and $\frac{1}{r_{g i}}$ is the dispersion parameter. Now we want to test the following null hypothesis:

H_{0} : μ_{g 1} - μ_{g 2} = 0 Vs H_{1} : μ_{g 1} \neq μ_{g 2}

A gene will be declared as DE if H₀ is rejected, otherwise it is equally expressed (EE).

Log-transformation

Log-transformation is very useful in RNA-Seq data. The log-transformed data usually follow the normal distribution, which depends on the degree of skewness before transformation. As the RNA-Seq count data can be equal to zero, so we shift them by one before transforming them:

x_{gik} = \log (y_{gik} + 1)

Log-transformed values have less extreme values (or outliers) than the untransformed data.

However, the log-transformed values reduces the influence of low level outlying observations; however this transformation fails to reduce the influence of high level outlying observations. Therefore, we propose the following outlier detection and modification rule.

Biomarker identification using the proposed procedure

Since the median and median absolute deviation is the robust measure of location and scale parameter, respectively, therefore, we used the median control chart as a measure of outlier detection. The proposed procedure is as follows:

1. We declare a gene as an outlying gene if it doesn’t fall into the interval [LCL, UCL]. Where LCL and UCL are the lower and upper control limit for median and they are defined by LCL= MED_g,(i)-3×MAD_g,(i) and UCL=MED_g,(i)+3×MAD_g,(i)]. Here, MED_g,(i)=median
(x_gik; g =1, 2,…, G; k =1, 2,…, n_i ; i=1,2) is the median of gth gene in ith condition, MAD_g,(i) = median_{k=1,2,..,n_i}(|x_gik − MED_{g, (i)}|) is the median absolute deviation.
2. Check the existence of outliers for each gene from each of the conditions (i=1,2), separately using step 1. If outliers exist, replace them by their respective group medians (MED_g,(i)).
3. Apply the anti-log transformation to obtain modified RNA-seq (MRS) count datasets.
4. Apply traditional statistical methods in MRS data to identify biomarker genes using the p-values adjusted by Benjamini-Hochberg method.
5. Obtain the functional annotations and KEGG pathways for detected biomarker genes.

Performance evaluation

In order to evaluate the performance of different biomarkers selection methods we considered the area under the receiver operating characteristic curve (ROC) curve. The ROC is created by plotting the true positive rate (TPR) against the false positive rate (FPR) for different cut-off points of a parameter. For a particular threshold each point on the ROC curve produces a TPR/FPR pair. The area under the ROC curve (AUC) is a performance measure which helps us to select an optimum method that can distinguish between two gene groups such as DE or EE well.

Datasets

To investigate the performance of the proposed method in comparison with five popular methods as mentioned above, for both small-and-large-sample cases with 2 groups/conditions, we considered 100 datasets for both cases with sample sizes of n₁=n₂= 3 and 15, respectively. Each dataset for each case represented gene expression profiles for 1000 genes, each with n=(n₁+n₂) samples, where the read counts of each gene was generated using the negative-binomial distribution and this type of simulation study was also used in 11. The number of DE genes were set to 40 for each of the 100 datasets. We divided these 40 DE genes into two groups: 20, up-regulated DE genes and 20, down-regulated DE genes. To show the effect of outliers (extreme high counts) on the methods, we randomly selected 10% and 30% genes and for each of these genes, we selected a single sample randomly and multiplied the observed count of this sample with randomly selected factor between 5 and 10. This process was applied for each of the 100 datasets. We computed average values of different performance measures such as true positive rate (TPR), false positive rate (FPR) and AUC based on 40 estimated DE genes by five methods (edgeR, edgeR_robust, DESeq, DESeq2, and limma (voom)) for each of 100 original datasets and proposed MRS datasets.

We also considered a real RNA-seq mouse dataset¹⁷ to demonstrate the performance of the methods. This dataset consists of 36535 genes with 21 samples. This dataset was downloaded from ReCount website http://bowtie-bio.sourceforge.net/recount. It can also be downloaded from the GEO series accession number GSE26024. Among 21 samples, RNA-seq count expression collected from 10 C57BL/6J (B6) and 11 DBA/2J (D2) inbred mouse strains.

Software

To demonstrate the performance of the proposed method, a comparison with five popular methods (edgeR, edgeR_robust, DESeq, DESeq2 and limma (voom)) was performed. We used both simulated and real RNA-seq count datasets. We used three R packages of other methods: edgeR, DESeq and limma. The performance measure area under the receiver operating characteristics curve (AUC) was computed for each of the methods using R package ROCR. All R packages are available in the comprehensive R archive network (cran) or Bioconductor.

Results

Performance evaluation based on simulated dataset

Table 1 summarizes the average AUCs estimated by eight methods based on 100 simulated datasets using 4% DE genes in absence and presence of single outlier in each of 10% and 30% genes for both small-sample cases (n₁=n₂=3) and large-sample cases (n₁=n₂=15), respectively. In Table 1, the results without and within the brackets (.) indicate the estimated AUCs by the five methods using the original RNA-seq datasets and proposed MRS datasets. From Table 1 we observed that in absence of outliers, five methods (edgeR, edgeR_robust, DESeq, DESeq2 and limma (voom)) produced almost similar results using the original RNA-seq datasets and proposed MRS datasets, for both small-and-large-sample cases. However, in the presence of outliers, the performance of these methods has significantly increased using the proposed MRS datasets for both cases. For example, in the presence of 10% outliers, edgeR and DESeq produce AUCs 0.829 and 0.818, respectively for small-sample case. Whereas for the same condition these two methods produce AUCs 0.842 and 0.838, respectively using our proposed MRS datasets. Figure 1a and b and Figure 1c and d show the boxplot of AUCs based on 100 simulated datasets by each of the methods in absence and presence of outliers for small-and large-sample cases, respectively. The left and right-side panels in this figure indicate the boxplot of estimated AUCs using original RNA-seq datasets and proposed MRS datasets. Similar results were found from these boxplots, as in Table 1.

Table 1. Performance evaluation of different methods using AUC values for both small-and-large samples cases

For small-sample case (n₁=n₂=3)
Performance Measure	Outliers	edgeR	edgeR (robust)	DESeq	DESeq2	Limma (voom)
AUC	Without Outliers	0.844 (0.844)	0.851 (0.85)	0.837 (0.842)	0.835 (0.833)	0.819 (0.819)
	10% Outliers	0.829 (0.842)	0.826 (0.825)	0.818 (0.838)	0.815 (0.826)	0.784 (0.801)
	30% Outliers	0.697 (0.796)	0.709 (0.786)	0.713 (0.785)	0.705 (0.791)	0.708 (0.768)
For large-sample case (n₁=n₂=15)
Performance Measure	Outliers	edgeR	edgeR (robust)	DESeq	DESeq2	Limma (voom)
AUC	Without Outliers	0.966 (0.965)	0.943 (0.942)	0.963 (0.965)	0.959 (0.962)	0.931 (0.931)
	10% Outliers	0.940 (0.959)	0.934 (0.946)	0.936 (0.953)	0.898 (0.952)	0.918 (0.931)
	30% Outliers	0.894 (0.937)	0.907 (0.948)	0.852 (0.945)	0.782 (0.912)	0.851 (0.929)

Figure 1. Performance evaluation using boxplot of AUCs estimated by different methods using the simulated datasets.

(a–b) for small-sample case (n₁=n₂= 3) and (c–d) for large-sample case (n₁=n₂= 15).

Performance evaluation based on real dataset

After filtering we retain 11,474 genes. To investigate the performance of the proposed method, we employ three methods (edgeR, DEseq and limma) for detection of the biomarkers between the two mouse strains. Figure 2a and b represents the Venn diagram of estimated DE genes by edgeR, DESeq and Limma using the original and MRS dataset, respectively. From Figure 2a, we revealed that edgeR and Limma performed better than DESeq by sharing more genes (414). We also noticed that there are 1925 overlapping DE genes between these methods. To investigate the performance of the proposed outlier detection and modification approach in this dataset, we first detect and modify the outliers (if any) to get the MRS dataset. We detected 200 outliers in this dataset using the proposed outlier detection rule. The Venn diagram in Figure 2b represents the results of these three methods using the proposed MRS dataset. From this figure we can clearly observe that there are 1956 overlapping genes detected by these methods. Among these genes there are 18 genes that are declared as outliers by the proposed method and those were not detected as DE genes using the original mouse dataset.

Figure 2. Comparison of the DEGs detected by three methods and outlying gene expression profiles for Mouse dataset.

Venn diagram of DEGs detected by (a) the edgeR, DESeq and Limma in the original mouse dataset or by (b) the edgeR, DESeq and Limma in the modified mouse dataset using the proposed method. (c) Heatmap of 16 outlying DEGs detected by the proposed method.

Furthermore, we performed the gene overexpression analyses through Database for Annotation, Visualization and Integrated Discovery (DAVID)¹⁸ to explore the biological process (BF) categories and pathway annotations of the 18 identified outlying DE genes. Out of 18 genes DAVID identified 16 genes. A heatmap is created for these 16 outlying DE genes in Figure 2c. The heatmap correctly clusters the samples between C57BL/6J (B6) and 11 DBA/2J (D2) using these genes. Among the 16 genes, 8 upregulated (Fam46b, Alx3, Dusp2, Pdyn, Agbl2, Pcdh12, Ubl5, Gpx8) and 8 downregulated (Stard5, Ptprc, Slc7a5, Slc24a1, Ehd2, Adgrg3, Tefm, Tsnaxip1) DE genes are identified. GO analysis results showed that upregulated DE genes are significantly enriched in protein side chain deglutamylation, protein deglutamylation and embryo development at BP level. Downregulated DEGs were enriched in negative regulation of CREB transcription factor activity, negative regulation of NIK/NF-kappaB signaling and sterol import at BP level (see extended data). KEGG analysis showed that the upregulated DEGs were mostly enriched in cocaine addiction, glutathione metabolism and thyroid hormone synthesis. The downregulated DEGs were enriched in phototransduction, primary immunodeficiency and central carbon metabolism in cancer (Table 2). We also constructed the protein-protein interaction (PPI) network around the proteins encoded by these 16 outlying genes using STRING database¹⁹. We considered confidence score 400 for selection these networks. Figure 3 and Figure 4 represent the PPI networks using the up-regulated and down-regulated DEGs, respectively. In addition, we explored miRNAs-target gene interactions from miRTarBase²⁰ to identify miRNAs. The miRNAs-target gene interactions network is shown in Figure S1 in extended data.

Table 2. KEGG pathways for the 16 outlying DEGs detected by the proposed method for Mouse dataset.

Pathways for up-regulated genes
KEGG ID	Pathway Name	Name of Gene	P-value
mmu05030	Cocaine addiction	Pdyn	1.79e-02
mmu00480	Glutathione metabolism	Gpx8	2.29e-02
mmu05031	Amphetamine addiction	Pdyn	2.53e-02
mmu04918	Thyroid hormone synthesis	Gpx8	2.79e-02
mmu00590	Arachidonic acid metabolism	Gpx8	3.31e-02
mmu05034	Alcoholism	Pdyn	7.29e-02
mmu04010	MAPK signaling pathway	Dusp2	9.17e-02
Pathways for down-regulated genes
KEGG ID	Pathway Name	Name of Gene	P-value
mmu04744	Phototransduction	Slc24a1	1.35e-02
mmu05340	Primary immunodeficiency	Ptprc	1.79e-02
mmu05230	Central carbon metabolism in cancer	Slc7a5	3.27e-02
mmu04666	Fc gamma R-mediated phagocytosis	Ptprc	4.38e-02
mmu04660	T cell receptor signaling pathway	Ptprc	5.16e-02
mmu04150	mTOR signaling pathway	Slc7a5	7.59e-02
mmu04514	Cell adhesion molecules (CAMs)	Ptprc	8.2e-02
mmu04144	Endocytosis	Ehd2	1.36e-01

Figure 3. PPI network using the up-regulated outlying genes identified by the proposed method.

Figure 4. PPI network using the down-regulated outlying genes identified by the proposed method.

Conclusions

Biomarker identification under two or more conditions is an important task for elucidating the molecular basis of phenotypic variation. Next generation sequencing (RNA-seq) has become very popular and a competitive alternative to microarrays because of reducing the cost of sequencing and limitation of microarrays. A number of methods have been developed for detecting biomarkers from RNA-seq data. However, most of the methods are sensitive to outliers and produce misleading results in the presence of outliers. In this study, we have proposed an outlier detection and modification approach using the median control chart. From the simulation study in the presence of outliers we have observed that the performance of five biomarker selection methods are improved significantly when the datasets are modified by the proposed method, both for small-and large-sample cases. The proposed method also detected an additional 16 outlying genes from a real mouse dataset. From GO and KEGG pathway enrichment analysis, we have shown that these genes belong to some important pathways.

Data availability

Underlying data

Simulated datasets available from: https://doi.org/10.5281/zenodo.2212881²¹

Real dataset: The mouse dataset used in this study is publicly available at the NCBI GEO website: GSE26024.

Extended data

Zenodo: Figure S1: miRNAs-target gene interactions using the outlying genes identified by the proposed method, http://doi.org/10.5281/zenodo.2279921²²

Zenodo: Table A1. Biological process categories for 16 genes, http://doi.org/10.5281/zenodo.2280012²³

Software availability

The R code for the proposed method is available in https://github.com/snotjanu/OutMod-RnaSeq

Archived code: http://doi.org/10.5281/zenodo.2279405²⁴

License: MIT

Grant information

The author(s) declared that no grants were involved in supporting this work.

Acknowledgements

We would like to thank the reviewers for their valuable comments on the paper, as these comments led us to an improvement of the work.

Faculty Opinions recommended

References

1. Mortazavi A, Williams BA, McCue K, et al.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008; 5(7): 621–628. PubMed Abstract | Publisher Full Text
2. Beyer M, Mallmann MR, Xue J, et al.: High-resolution transcriptome of human macrophages. PLoS One. 2012; 7(9): e45466. PubMed Abstract | Publisher Full Text | Free Full Text
3. Wang L, Feng Z, Wang X, et al.: DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics. 2010; 26(1): 136–138. PubMed Abstract | Publisher Full Text
4. Nagalakshmi U, Waern K, Snyder M: RNA-Seq: a method for comprehensive transcriptome analysis. Curr Protoc Mol Biol. 2010; Chapter 4: Unit 4.11.1–13. PubMed Abstract | Publisher Full Text
5. Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26(1): 139–140. PubMed Abstract | Publisher Full Text | Free Full Text
6. Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010; 11(10): R106. PubMed Abstract | Publisher Full Text | Free Full Text
7. Hardcastle TJ, Kelly KA: baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010; 11: 422. PubMed Abstract | Publisher Full Text | Free Full Text
8. Love MI, Huber W, Anders S: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15(12): 550. PubMed Abstract | Publisher Full Text | Free Full Text
9. Leng N, Dawson JA, Thomson JA, et al.: EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics. 2013; 29(8): 1035–1043. PubMed Abstract | Publisher Full Text | Free Full Text
10. Zhou X, Lindsay H, Robinson MD: Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res. 2014; 42(11): e91. PubMed Abstract | Publisher Full Text | Free Full Text
11. Di Y, Schafer DW, Cumbie JS, et al.: The NBP negative binomial model for assessing differential gene expression from RNA-seq. Stat Appl Genet Mol Biol. 2011; 10(1): 1–18. Publisher Full Text
12. Robinson MD, Smyth GK: Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008; 9(2): 321–332. PubMed Abstract | Publisher Full Text
13. George NI, Bowyer JF, Crabtree NM, et al.: An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data. PLoS One. 2015; 10(6): e0125224. PubMed Abstract | Publisher Full Text | Free Full Text
14. Zwiener I, Frisch B, Binder H: Transforming RNA-Seq data to improve the performance of prognostic gene signatures. PLoS One. 2014; 9(1): e85150. PubMed Abstract | Publisher Full Text | Free Full Text
15. Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010; 11(3): R25. PubMed Abstract | Publisher Full Text | Free Full Text
16. Law CW, Chen Y, Shi W, et al.: voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014; 15(2): R29. PubMed Abstract | Publisher Full Text | Free Full Text
17. Bottomly D, Walter NA, Hunter JE, et al.: Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays. PLoS One. 2011; 6(3): e17820. PubMed Abstract | Publisher Full Text | Free Full Text
18. Huang da W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009; 4(1): 44–57. PubMed Abstract | Publisher Full Text
19. Szklarczyk D, Morris JH, Cook H, et al.: The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 2017; 45(D1): D362–8. PubMed Abstract | Publisher Full Text | Free Full Text
20. Hsu SD, Lin FM, Wu WY, et al.: miRTarBase: a database curates experimentally validated microRNA-target interactions. Nucleic Acids Res. 2011; 39(Database issue): 163–9. PubMed Abstract | Publisher Full Text | Free Full Text
21. Shahjaman Md: Simulated Data for figure 1 (Version v1). 2018. http://www.doi.org/10.5281/zenodo.2212881
22. Shahjaman Md: miRNAs-target gene interactions using the outlying genes identified by the proposed method (Version v1.0.0). 2018. http://www.doi.org/10.5281/zenodo.2279921
23. Shahjaman Md: Biological process categories for 16 genes (Version v1.0.0). 2018. http://www.doi.org/10.5281/zenodo.2280012
24. snotjanu: snotjanu/OutMod-RnaSeq v1.0.0 (Version v1.0.0). 2018. http://www.doi.org/10.5281/zenodo.2279405

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 03 Jan 2019

Author details Author details

¹ Department of Statistics, Begum Rokeya University, Rangpur, 5400, Bangladesh
² Department of Biochemistry and Biotechnology, Khwaja Yunus Ali University, Sirajgonj, 6200, Bangladesh

Md Shahjaman
Roles: Conceptualization, Supervision, Writing – Review & Editing

Habiba Akter
Roles: Methodology

Md. Mamunur Rashid
Roles: Resources, Software

Md. Ibnul Asifuzzaman
Roles: Conceptualization, Writing – Review & Editing

Md. Bipul Hossen
Roles: Writing – Review & Editing

Md. Rezanur Rahman
Roles: Validation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (1)

version 1

Published: 03 Jan 2019, 8:7

https://doi.org/10.12688/f1000research.17351.1

Copyright

© 2019 Shahjaman M et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Shahjaman M, Akter H, Rashid MM et al. Robust and efficient identification of biomarkers from RNA-Seq data using median control chart [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2019, 8:7 (https://doi.org/10.12688/f1000research.17351.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 03 Jan 2019

Views

12

Reviewer Report 06 Jun 2019

Jun Li, Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN, USA

Not Approved

https://doi.org/10.5256/f1000research.18975.r48899

The author proposed a method for outlier detection and differential expression (DE) identification for RNA-seq data. While DE is surely an important problem in RNA-seq data analysis, the proposed method is based on a wrong mathematical model and thus makes ... Continue reading

The author proposed a method for outlier detection and differential expression (DE) identification for RNA-seq data. While DE is surely an important problem in RNA-seq data analysis, the proposed method is based on a wrong mathematical model and thus makes no sense.

The author assumes the read count $y_{gik}$ follows a negative binomial model with mean $\mu_{gi}$ depending only on the gene ($g$) and the condition $i$. Where is the sequencing depth? Normalizing/incorporating sequencing depth has been a central question in DE analysis and significant efforts have been made by many important papers in this field, but the author completely ignored this term. Similarly, without considering the sequencing depth, the null hypothesis the author wrote is also wrong. With the wrong model and wrong hypothesis for testing, the proposed method does not make any sense.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

No
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

No

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: statistics, biostatistics, bioinformatics, sequencing data analysis

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Respond or Comment

Views

25

Reviewer Report 29 Apr 2019

Lei Li, Dan L Duncan Comprehensive Cancer Center, Baylor College of Medicine (BCM), Houston, TX, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.18975.r47416

In the manuscript, Shahjaman et al. presents a new median control chart method for controlling outliers in RNA-Seq data. They applied this method on simulated data and also a real mouse dataset. In general, a useful bioinformatics method for outlier ... Continue reading

In the manuscript, Shahjaman et al. presents a new median control chart method for controlling outliers in RNA-Seq data. They applied this method on simulated data and also a real mouse dataset. In general, a useful bioinformatics method for outlier detection will be valuable for biomarker detection. However, in the present state of the manuscript the robust and efficiency of this method is not clear. There are a few concerns that should be considered to improve the manuscript.

Outlier methods has actually been implemented in a few tools such as DESeq2 (https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#approach-to-count-outliers). The authors need to compare their method with other available outlier detection methods? Also, what is the tool parameters for the authors use in their evaluation? The outlier performance may largely depend on the parameters used.
The figure legend needs to be well described. What is the meaning of Figure 1A and Figure 1C, is that one for pre-control and one for after-medial control? Also, Figure 1c and Figure 1d?
Why only use three, not five methods for real data performance evaluation?
The released github code cannot reproduce the authors’ result.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: RNA seq

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 03 Jan 2019

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 1 03 Jan 19	read	read

Lei Li, Baylor College of Medicine (BCM), Houston, USA
Jun Li, University of Notre Dame, Notre Dame, USA

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

12 Views

06 Jun 2019 | for Version 1

Jun Li, Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN, USA

12 Views Cite this report Responses(0)

Not Approved

The author proposed a method for outlier detection and differential expression (DE) identification for RNA-seq data. While DE is surely an important problem in RNA-seq data analysis, the proposed method is based on a wrong mathematical model and thus makes no sense.

The author assumes the read count $y_{gik}$ follows a negative binomial model with mean $\mu_{gi}$ depending only on the gene ($g$) and the condition $i$. Where is the sequencing depth? Normalizing/incorporating sequencing depth has been a central question in DE analysis and significant efforts have been made by many important papers in this field, but the author completely ignored this term. Similarly, without considering the sequencing depth, the null hypothesis the author wrote is also wrong. With the wrong model and wrong hypothesis for testing, the proposed method does not make any sense.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

No
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

No

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

statistics, biostatistics, bioinformatics, sequencing data analysis

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

25 Views

29 Apr 2019 | for Version 1

Lei Li, Dan L Duncan Comprehensive Cancer Center, Baylor College of Medicine (BCM), Houston, TX, USA

25 Views Cite this report Responses(0)

Approved With Reservations

In the manuscript, Shahjaman et al. presents a new median control chart method for controlling outliers in RNA-Seq data. They applied this method on simulated data and also a real mouse dataset. In general, a useful bioinformatics method for outlier detection will be valuable for biomarker detection. However, in the present state of the manuscript the robust and efficiency of this method is not clear. There are a few concerns that should be considered to improve the manuscript.

Outlier methods has actually been implemented in a few tools such as DESeq2 (https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#approach-to-count-outliers). The authors need to compare their method with other available outlier detection methods? Also, what is the tool parameters for the authors use in their evaluation? The outlier performance may largely depend on the parameters used.
The figure legend needs to be well described. What is the meaning of Figure 1A and Figure 1C, is that one for pre-control and one for after-medial control? Also, Figure 1c and Figure 1d?
Why only use three, not five methods for real data performance evaluation?
The released github code cannot reproduce the authors’ result.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

RNA seq

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] 1. Mortazavi A, Williams BA, McCue K, et al.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008; 5(7): 621–628. PubMed Abstract | Publisher Full Text

[2] 2. Beyer M, Mallmann MR, Xue J, et al.: High-resolution transcriptome of human macrophages. PLoS One. 2012; 7(9): e45466. PubMed Abstract | Publisher Full Text | Free Full Text

[3] 3. Wang L, Feng Z, Wang X, et al.: DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics. 2010; 26(1): 136–138. PubMed Abstract | Publisher Full Text

[4] 4. Nagalakshmi U, Waern K, Snyder M: RNA-Seq: a method for comprehensive transcriptome analysis. Curr Protoc Mol Biol. 2010; Chapter 4: Unit 4.11.1–13. PubMed Abstract | Publisher Full Text

[5] 5. Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26(1): 139–140. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010; 11(10): R106. PubMed Abstract | Publisher Full Text | Free Full Text

[7] 7. Hardcastle TJ, Kelly KA: baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010; 11: 422. PubMed Abstract | Publisher Full Text | Free Full Text

[8] 8. Love MI, Huber W, Anders S: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15(12): 550. PubMed Abstract | Publisher Full Text | Free Full Text

[9] 9. Leng N, Dawson JA, Thomson JA, et al.: EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics. 2013; 29(8): 1035–1043. PubMed Abstract | Publisher Full Text | Free Full Text

[10] 10. Zhou X, Lindsay H, Robinson MD: Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res. 2014; 42(11): e91. PubMed Abstract | Publisher Full Text | Free Full Text

[11] 11. Di Y, Schafer DW, Cumbie JS, et al.: The NBP negative binomial model for assessing differential gene expression from RNA-seq. Stat Appl Genet Mol Biol. 2011; 10(1): 1–18. Publisher Full Text

[12] 12. Robinson MD, Smyth GK: Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008; 9(2): 321–332. PubMed Abstract | Publisher Full Text

[13] 13. George NI, Bowyer JF, Crabtree NM, et al.: An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data. PLoS One. 2015; 10(6): e0125224. PubMed Abstract | Publisher Full Text | Free Full Text

[14] 14. Zwiener I, Frisch B, Binder H: Transforming RNA-Seq data to improve the performance of prognostic gene signatures. PLoS One. 2014; 9(1): e85150. PubMed Abstract | Publisher Full Text | Free Full Text

[15] 15. Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010; 11(3): R25. PubMed Abstract | Publisher Full Text | Free Full Text

[16] 16. Law CW, Chen Y, Shi W, et al.: voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014; 15(2): R29. PubMed Abstract | Publisher Full Text | Free Full Text

[17] 17. Bottomly D, Walter NA, Hunter JE, et al.: Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays. PLoS One. 2011; 6(3): e17820. PubMed Abstract | Publisher Full Text | Free Full Text

[18] 18. Huang da W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009; 4(1): 44–57. PubMed Abstract | Publisher Full Text

[19] 19. Szklarczyk D, Morris JH, Cook H, et al.: The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 2017; 45(D1): D362–8. PubMed Abstract | Publisher Full Text | Free Full Text

[20] 20. Hsu SD, Lin FM, Wu WY, et al.: miRTarBase: a database curates experimentally validated microRNA-target interactions. Nucleic Acids Res. 2011; 39(Database issue): 163–9. PubMed Abstract | Publisher Full Text | Free Full Text

[21] 21. Shahjaman Md: Simulated Data for figure 1 (Version v1). 2018. http://www.doi.org/10.5281/zenodo.2212881

[22] 22. Shahjaman Md: miRNAs-target gene interactions using the outlying genes identified by the proposed method (Version v1.0.0). 2018. http://www.doi.org/10.5281/zenodo.2279921

[23] 23. Shahjaman Md: Biological process categories for 16 genes (Version v1.0.0). 2018. http://www.doi.org/10.5281/zenodo.2280012

[24] 24. snotjanu: snotjanu/OutMod-RnaSeq v1.0.0 (Version v1.0.0). 2018. http://www.doi.org/10.5281/zenodo.2279405

Robust and efficient identification of biomarkers from RNA-Seq data using median control chart

Abstract

Keywords

Introduction

Methods

Log-transformation

Biomarker identification using the proposed procedure

Performance evaluation

Datasets

Software

Results

Performance evaluation based on simulated dataset

Table 1. Performance evaluation of different methods using AUC values for both small-and-large samples cases

Figure 1. Performance evaluation using boxplot of AUCs estimated by different methods using the simulated datasets.

Performance evaluation based on real dataset

Figure 2. Comparison of the DEGs detected by three methods and outlying gene expression profiles for Mouse dataset.

Table 2. KEGG pathways for the 16 outlying DEGs detected by the proposed method for Mouse dataset.

Figure 3. PPI network using the up-regulated outlying genes identified by the proposed method.

Figure 4. PPI network using the down-regulated outlying genes identified by the proposed method.

Conclusions

Data availability

Underlying data

Extended data

Software availability

Grant information

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated