Robust and efficient identification of biomarkers from RNA-Seq data using median control chart [version 1; referees: awaiting peer review]

One of the main goals of RNA-seq data analysis is identification Background: of biomarkers that are differentially expressed (DE) across two or more experimental conditions. RNA-seq uses next generation sequencing technology and it has many advantages over microarrays. Numerous statistical methods have already been developed for identification the biomarkers from RNA-seq data. Most of these methods were based on either Poisson distribution or negative binomial distribution. However, efficient biomarker identification from discrete RNA-seq data is hampered by existing methods when the datasets contain outliers or extreme observations. Specially, the performance of these methods becomes more severe when the data come from a small number of samples in the presence of outliers. Therefore, in this study, an attempt is made to propose an outlier detection and modification approach for RNA-seq data to overcome the aforesaid problems of traditional methods. We make our proposed method facilitate in RNA-seq data by transforming the read count data into continuous data. We use median control chart to detect and modify the outlying Methods: observation in a log-transformed RNA-seq dataset. To investigate the performance of the proposed method in absence and presence of outliers, we employ the five popular biomarker selection methods (edgeR, edgeR_robust, DEseq, DEseq2 and limma) both in simulated and real datasets. The simulation results strongly suggest that the performance of the Results: proposed method improved in the presence of outliers. The proposed method also detected an additional 18 outlying DE genes from a real mouse RNA-seq dataset that were not detected by traditional methods. Using the KEGG pathway and gene ontology analysis results we reveal that these genes may be biomarkers, which require validation in a wet lab. Our proposal is to apply the proposed method for biomarker Conclusions: identification from other RNA-seq data.


Introduction
One of the major objectives of researchers is to identify biomarkers from RNA-Seq data that are differentially expressed (DE) between two or more experimental conditions.Microarrays have been extensively used during the past few decades to perform this task.But after reducing the cost of sequencing, biomarker identification using RNA-seq data has emerged as an alternative choice to microarrays 1,2 .RNA-seq uses next generation sequencing technology to produce a vast amount of data.
Curse of dimensionality is a common problem for analyzing RNA-Seq data, which means a "large p, small n" problem.Hence, dimension reduction of the data matrix is a primary objective for further downstream analysis.Identification of biomarkers is one of the dimensionality reduction approaches.
RNA-seq count data inherently follow a Poisson or negative binomial (NB) distribution rather than normal distribution like microarray data.Numerous statistical methods have been developed to identify biomarkers from RNA-seq count data.The earliest method is DEGseq, which is based on Poisson distribution.This method suffers from overdispersion and therefore Poisson distribution-based methods are not suitable for RNA-seq data [3][4][5][6] .To overcome this problem, NB distributionbased methods have been proposed.Some NB based methods are: baySeq, DESeq, DESeq2, EBSeq, edgeR, edgeR (robust) and NBPSeq [7][8][9][10][11] .However, most of the methods cannot estimate properly the gene-wise dispersion parameters and they also suffer from small sample sizes 12 .DESeq, DESeq2, edgeR, edgeR (robust) and NBPSeq incorporate information of all genes in their algorithms.
Despite the popularity of these statistical methods for identification of biomarker genes, they are sensitive to outliers and often produce lower accuracies in the presence of outliers.Outliers may arise in RNA-seq count data because there are several data generating stages from biological harvesting of RNA samples to counting of sequence read map data 13 .To mitigate this issue many algorithms use transformation methods.There are several transformation methods for RNA-seq data: logarithmic transformation 14 , variance-stabilizing transformation (vst) 6 , TMM transformation 15 , regularized logarithm 8 and variance modeling at the observation level (voom) 16 .These methods only reduce the low level outliers into reasonable spaces during parameter estimations; however they fail to reduce the influence of high level outliers with small sample sizes in the data matrix.
Consequently, most biomarker selection methods that use the aforesaid transformations, are sensitive to outliers or extreme values with small-sample sizes.Therefore, in this study, an attempt is made to propose an outlier detection and modification approach for RNA-seq data to improve the performance of the popular biomarker selection methods in the presence of outliers.
To make our proposed method facilitate in RNA-seq data we transform the read count data into continuous data using regularize logarithmic transformation.
The article is organized as follows: Methods briefly describes the logarithmic transformation and formulation the proposed outlier detection and modification approach.In Results and Conclusions a broad simulation study and a real data study have been carried out.

Methods
Let y gik be the number of reads simulated from gth gene of kth replicates in the ith condition (g = 1, 2, . .., G; i = 1,2; k =1, 2, . . ., n i ); where n i is the number of replicated in condition i. G is the total number of gene.Then the negative bionomialdistribution is as follows: Negative Binomial (mean= ,size= ) However, the log-transformed values reduces the influence of low level outlying observations; however this transformation fails to reduce the influence of high level outlying observations.Therefore, we propose the following outlier detection and modification rule.
Biomarker identification using the proposed procedure Since the median and median absolute deviation is the robust measure of location and scale parameter, respectively, therefore, we used the median control chart as a measure of outlier detection.The proposed procedure is as follows: 1. We declare a gene as an outlying gene if it doesn't fall into the interval [LCL, UCL].Where LCL and UCL are the lower and upper control limit for median and they are defined by LCL= MED g,(i) -3×MAD g,(i) and UCL=MED g,(i) +3×MAD g,(i) ].Here, MED g,(i) =median 2. Check the existence of outliers for each gene from each of the conditions (i=1,2), separately using step 1.If outliers exist, replace them by their respective group medians (MED g,(i) ). 3. Apply the anti-log transformation to obtain modified RNA-seq (MRS) count datasets.
4. Apply traditional statistical methods in MRS data to identify biomarker genes using the p-values adjusted by Benjamini-Hochberg method.
5. Obtain the functional annotations and KEGG pathways for detected biomarker genes.

Performance evaluation
In order to evaluate the performance of different biomarkers selection methods we considered the area under the receiver operating characteristic curve (ROC) curve.The ROC is created by plotting the true positive rate (TPR) against the false positive rate (FPR) for different cut-off points of a parameter.For a particular threshold each point on the ROC curve produces a TPR/FPR pair.The area under the ROC curve (AUC) is a performance measure which helps us to select an optimum method that can distinguish between two gene groups such as DE or EE well.

Datasets
To investigate the performance of the proposed method in comparison with five popular methods as mentioned above, for both small-and-large-sample cases with 2 groups/conditions, we considered 100 datasets for both cases with sample sizes of n 1 =n 2 = 3 and 15, respectively.Each dataset for each case represented gene expression profiles for 1000 genes, each with n=(n 1 +n 2 ) samples, where the read counts of each gene was generated using the negative-binomial distribution and this type of simulation study was also used in 11.The number of DE genes were set to 40 for each of the 100 datasets.We divided these 40 DE genes into two groups: 20, up-regulated DE genes and 20, down-regulated DE genes.To show the effect of outliers (extreme high counts) on the methods, we randomly selected 10% and 30% genes and for each of these genes, we selected a single sample randomly and multiplied the observed count of this sample with randomly selected factor between 5 and 10.This process was applied for each of the 100 datasets.We computed average values of different performance measures such as true positive rate (TPR), false positive rate (FPR) and AUC based on 40 estimated DE genes by five methods (edgeR, edgeR_robust, DESeq, DESeq2, and limma (voom)) for each of 100 original datasets and proposed MRS datasets.
We also considered a real RNA-seq mouse dataset 17 to demonstrate the performance of the methods.This dataset consists of 36535 genes with 21 samples.This dataset was downloaded from ReCount website http://bowtie-bio.sourceforge.net/recount.It can also be downloaded from the GEO series accession number GSE26024.Among 21 samples, RNA-seq count expression collected from 10 C57BL/6J (B6) and 11 DBA/2J (D2) inbred mouse strains.

Software
To demonstrate the performance of the proposed method, a comparison with five popular methods (edgeR, edgeR_robust, DESeq, DESeq2 and limma (voom)) was performed.We used both simulated and real RNA-seq count datasets.We used three R packages of other methods: edgeR, DESeq and limma.The performance measure area under the receiver operating characteristics curve (AUC) was computed for each of the methods using R package ROCR.All R packages are available in the comprehensive R archive network (cran) or Bioconductor.

Results
Performance evaluation based on simulated dataset Table 1 summarizes the average AUCs estimated by eight methods based on 100 simulated datasets using 4% DE genes in absence and presence of single outlier in each of 10% and 30% genes for both small-sample cases (n 1 =n 2 =3) and large-sample cases (n 1 =n 2 =15), respectively.In Table 1, the results without and within the brackets (.) indicate the estimated AUCs by the five methods using the original RNA-seq datasets and proposed MRS datasets.From Table 1 we observed that in absence of outliers, five methods (edgeR, edgeR_robust, DESeq, DESeq2 and limma (voom)) produced almost similar results using the original RNA-seq datasets and proposed MRS datasets, for both small-and-large-sample cases.However, in the presence of outliers, the performance of these methods has significantly increased using the proposed MRS datasets for both cases.For example, in the presence of 10% outliers, edgeR and DESeq produce AUCs 0.829 and 0.818, respectively for small-sample case.Whereas for the same condition these two methods produce AUCs 0.842 and 0.838, respectively using our proposed MRS datasets.Figure 1a  RNA-seq datasets and proposed MRS datasets.Similar results were found from these boxplots, as in Table 1.

Performance evaluation based on real dataset
After filtering we retain 11,474 genes.To investigate the performance of the proposed method, we employ three methods (edgeR, DEseq and limma) for detection of the biomarkers between the two mouse strains.Figure 2a and b represents the Venn diagram of estimated DE genes by edgeR, DESeq and Limma using the original and MRS dataset, respectively.From Figure 2a, we revealed that edgeR and Limma performed better than DESeq by sharing more genes (414).We also noticed that there are 1925 overlapping DE genes between these methods.
To investigate the performance of the proposed outlier detection and modification approach in this dataset, we first detect and modify the outliers (if any) to get the MRS dataset.We detected 200 outliers in this dataset using the proposed outlier detection rule.The Venn diagram in Figure 2b represents the results of these three methods using the proposed MRS dataset.From this figure we can clearly observe that there are 1956 overlapping  genes detected by these methods.Among these genes there are 18 genes that are declared as outliers by the proposed method and those were not detected as DE genes using the original mouse dataset.
Furthermore, we performed the gene overexpression analyses through Database for Annotation, Visualization and Integrated Discovery (DAVID) 18 to explore the biological process (BF) categories and pathway annotations of the 18 identified outlying DE genes.Out of 18 genes DAVID identified 16 genes.A heatmap is created for these 16 outlying DE genes in Figure 2c.The heatmap correctly clusters the samples between C57BL/6J (B6) and 11 DBA/2J (D2) using these genes.Among the 16 genes, 8 upregulated (Fam46b, Alx3, Dusp2, Pdyn, Agbl2, Pcdh12, Ubl5, Gpx8) and 8 downregulated (Stard5, Ptprc, Slc7a5, Slc24a1, Ehd2, Adgrg3, Tefm, Tsnaxip1) DE genes are identified.GO analysis results showed that upregulated DE genes are significantly enriched in protein side chain deglutamylation, protein deglutamylation and embryo development at BP level.Downregulated DEGs were enriched in negative regulation of CREB transcription factor activity, negative regulation of NIK/NF-kappaB signaling and sterol import at BP level (see extended data).KEGG analysis showed that the upregulated DEGs were mostly enriched in cocaine addiction, glutathione metabolism and thyroid hormone synthesis.The downregulated DEGs were enriched in phototransduction, primary immunodeficiency and central carbon metabolism in cancer (Table 2).We also constructed the protein-protein interaction (PPI) network around the proteins encoded by these 16 outlying genes using STRING database 19 .We considered confidence score 400 for selection these networks.Figure 3 and Figure 4 represent the PPI networks using the up-regulated and down-regulated DEGs, respectively.In addition, we explored miRNAs-target gene interactions from miRTarBase 20 to identify miRNAs.The miRNAs-target gene interactions network is shown in Figure S1 in extended data.

Conclusions
Biomarker identification under two or more conditions is an important task for elucidating the molecular basis of phenotypic variation.Next generation sequencing (RNA-seq) has become very popular and a competitive alternative to microarrays because of reducing the cost of sequencing and limitation of microarrays.A number of methods have been developed for detecting biomarkers from RNA-seq data.However, most of the methods are sensitive to outliers and produce misleading results in the presence of outliers.In this study, we have proposed an outlier detection and modification approach using the median control chart.From the simulation study in the presence of outliers we have observed that the performance of five biomarker selection methods are improved significantly when the datasets are modified by the proposed method, both for small-and large-sample cases.The proposed method also detected an additional 16 outlying genes from a real mouse dataset.From GO and KEGG pathway   enrichment analysis, we have shown that these genes belong to some important pathways.

Extended data
Zenodo: Figure S1: miRNAs-target gene interactions using the outlying genes identified by the proposed method, http://doi.org/10.5281/zenodo.227992122 Zenodo: Table A1.Biological process categories for 16 genes, http://doi.org/10.5281/zenodo.2280012 23he benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com and b and Figure 1c and d show the boxplot of AUCs based on 100 simulated datasets by each of the methods in absence and presence of outliers for small-and large-sample cases, respectively.The left and right-side panels in this figure indicate the boxplot of estimated AUCs using original

Figure 2 .
Figure 2. Comparison of the DEGs detected by three methods and outlying gene expression profiles for Mouse dataset.Venn diagram of DEGs detected by (a) the edgeR, DESeq and Limma in the original mouse dataset or by (b) the edgeR, DESeq and Limma in the modified mouse dataset using the proposed method.(c) Heatmap of 16 outlying DEGs detected by the proposed method.

Figure 3 .
Figure 3. PPI network using the up-regulated outlying genes identified by the proposed method.

Figure 4 .
Figure 4. PPI network using the down-regulated outlying genes identified by the proposed method.