High throughput biological sequence analysis using machine learning-based integrative pipeline for extracting functional annotation and visualization

Md Al Amin; Feroza Naznin; Most Nilufa Yeasmin; Md Sumon Sarkar; Md Misor Mia; Abdullahi Chowdhury; Md Zahidul Islam

doi:10.12688/f1000research.144871.1

Home Browse High throughput biological sequence analysis using machine learning-based...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

High throughput biological sequence analysis using machine learning-based integrative pipeline for extracting functional annotation and visualization

[version 1; peer review: 2 approved with reservations]

Md Al Amin¹, Feroza Naznin², Most Nilufa Yeasmin³, [...] Md Sumon Sarkar⁴, Md Misor Mia⁴, Abdullahi Chowdhury ⁵, Md Zahidul Islam³

Md Al Amin¹, Feroza Naznin², [...] Most Nilufa Yeasmin³, Md Sumon Sarkar⁴, Md Misor Mia⁴, Abdullahi Chowdhury ⁵, Md Zahidul Islam³

PUBLISHED 07 Mar 2024

Author details Author details

¹ Department of Computer Science and Engineering, Prime University, Dhaka, 1216, Bangladesh
² Department of Computer Science and Engineering, Green University of Bangladesh, Dhaka, 1460, Bangladesh
³ Department of Information and Communication Technology, Islamic University, Kushtia, 7003, Bangladesh
⁴ Department of Pharmacy, Islamic University, Kustia, 7003, Bangladesh
⁵ Department of Computer Science and Engineering, East West University, Dhaka, 1212, Bangladesh

Md Al Amin
Roles: Conceptualization, Formal Analysis, Methodology, Resources, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Feroza Naznin
Roles: Conceptualization, Methodology, Resources, Validation, Writing – Review & Editing

Most Nilufa Yeasmin
Roles: Validation, Visualization, Writing – Review & Editing

Md Sumon Sarkar
Roles: Software, Validation, Writing – Review & Editing

Md Misor Mia
Roles: Software, Visualization, Writing – Review & Editing

Abdullahi Chowdhury
Roles: Writing – Original Draft Preparation, Writing – Review & Editing

Md Zahidul Islam
Roles: Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Machine Learning in Drug Discovery and Development collection.

Abstract

The Differential Gene Expression (DGE) approach to find out the expressed genes relies on measures such as log-fold change and adjusted p-values. Although fold change is commonly employed in gene expression studies, especially in microarray and RNA sequencing experiments to quantify alterations in a gene’s expression level, a limitation and potential hazard of relying on fold change in this context is its inherent bias. As a consequence, it might incorrectly categorize genes that have significant differences but minor ratios, resulting in poor detection of mutations in genes with high expression levels. In contrast, machine learning offers a more comprehensive view, adept at capturing the non-linear complexities of gene expression data and providing robustness against noise that inspired us to utilize machine learning models to explore differential gene expression based on feature importance in Type 2 Diabetes (T2D), a significant global health concern, in this study. Moreover, we validated biomarkers based on our findings expressed genes with previous studies to ensure the effectiveness of our ML models in this work which led us to go through to analysis pathways, gene ontologies, protein-protein interactions, transcription factors, miRNAs, and drug predictions to deal with T2D. This study aims to consider the machine learning technique as a good way to know about expressed genes profoundly not relying on the DGE approach, and to control or reduce the risk of T2D patients by helping drug developer researchers.

Keywords

Bioinformatics, Machine Learning, Type-2 Diabetes, Proteins, Pathways, Gene Ontology, RNA-Sequence, Drug

Corresponding authors: Abdullahi Chowdhury, Md Zahidul Islam

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2024 Al Amin M et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Al Amin M, Naznin F, Yeasmin MN et al. High throughput biological sequence analysis using machine learning-based integrative pipeline for extracting functional annotation and visualization [version 1; peer review: 2 approved with reservations]. F1000Research 2024, 13:161 (https://doi.org/10.12688/f1000research.144871.1) First published: 07 Mar 2024, 13:161 (https://doi.org/10.12688/f1000research.144871.1) Latest published: 07 Mar 2024, 13:161 (https://doi.org/10.12688/f1000research.144871.1)

1. Introduction

Differential Gene Expression (DGE) analysis usually based on the DESeq2 package¹ is a traditional and common bioinformatics technique that helps to identify expressed genes under different conditions offering insights into genes that exhibit varying expression levels.² In RNA sequence data, fold change in gene expression studies can be biased, potentially misclassifying genes with large absolute differences but small relative ratios.³ However, the advent of Machine Learning (ML) has brought about a significant change in bioinformatics, and it is now widely acknowledged as a powerful tool that can provide detailed and useful explanations of complex data that were once difficult to understand.⁴ And with the passage of time, in the medical sector, ML techniques are getting popularity, being effective for decision-making.⁵^,⁶ Using different kinds of ML algorithms is noticeable in RNA sequence data for different types of detection and to find out the correlation of sequences,⁷ as well as for showing the effectiveness of machine learning algorithms in detecting splice variants from RNA sequence data.⁸ Such as: To identify and classify cancers early on, different computer algorithms have been used on microarray data sets. These include support vector machines, random forest, and neural networks.⁹ On the other hand, this study uses a neural network to analyze RNA sequence-expressed genes from different datasets to predict a patient’s health status.¹⁰ And in this paper, the primary objective is to classify or identify different types of cancers based on the patterns found in the gene expression data. By doing so, the research aims to enhance the accuracy and efficiency of cancer diagnosis, potentially leading to more targeted and effective treatments¹¹ that inspired us to apply ML models in the bioinformatics field, especially in the RNA sequence count data.

Type 2 Diabetes (T2D), sometimes referred to simply as diabetes, is a long-term illness that affects the metabolic process.¹² According to IDF, around 6.7 million people were dead in 2021, which is one of the major ten reasons for death in the universe, and around 541 million adults are affected by T2D.¹³ It was also projected in 2021 that by 2030, 643 million people would have diabetes, and by 2045, 783 million people will have the disease.¹⁴ However, the risk of serious complications from T2D is greatly reduced if it can be diagnosed in its early stages.¹⁵ Moreover, pioneers of improved biotechnology invented several bioinformatics tools that assisted the course of study about T2D.¹⁶ Yet, other groups of researchers have relied on machine learning (ML)-based aid systems for forecasting chronic illnesses.¹⁷^–¹⁹ Researchers have suggested utilizing machine learning-based classification models to estimate the prevalence of T2D depending on its risk factors.²⁰^–²³ So, this information encouraged us to be involved with T2D.

In our research on individuals with T2D, we utilized a feature importance method using XGBoost to identify highly expressed genes from RNA sequencing count data detecting T2D and not T2D individuals based on count data. This approach was used instead of relying solely on adjusted p-values and log fold Change values to determine significant genes. By training various algorithms on RNA sequence count data, we achieved notable prediction accuracies, with XGBoost emerging as a standout, and this approach not only enhances gene detection accuracy but also challenges traditional bioinformatics metrics, suggesting a richer machine learning-driven perspective on the genetic prospect of diseases like T2D. Moreover, on our detected expressed genes or significant features, we went through several bioinformatics analyses such as pathways, gene ontologies, protein-protein interactions, transcription factors, miRNAs, and drug predictions to deal with T2D. More importantly, we validated our findings with past studies to show the effectiveness of our models. So, in the future, this study will help researchers to gain knowledge more about mutated genes through machine learning, and to think about the prevention of T2D based on bioinformatics analysis like drug prediction discussed in this paper.

2. Methods

2.1 Data acquisition and pre-processing

We initiated our study by selecting the RNA-Sequence count dataset (GSE81608)²⁴ from GEO.²⁵ This dataset was chosen for its reliability, especially in the context of biological data. The specifics of this dataset are detailed in Table 1. For ease of access and analysis, we downloaded the raw-count data from GREIN.²⁶ Afterward, the data underwent a pre-processing through Utilizing the “pandas” library²⁷ step to prepare this data for machine learning training.

Table 1. Overview of dataset.

Disease Name	Geo Association	GEO Platform	Tissues/Cells	Genes for every sample	Case Samples	Control Samples
Type 2 Diabetes	GSE81608	Illumina HiSeq 2500	Alpha/Beta/Delta/PP	28089	949	651

2.2 Machine learning analysis

On the preprocessed data, we trained six distinct ML models: Random Forest, AdaBoost, Gradient Boosting, Logistic Regression, Decision Tree, and XGBoost on the 80% of our dataset where 20% was for test. These models were trained on two classes: diabetes and non-diabetes. Our primary objective was to identify significant features²⁸ which in our study were considered as differentially expressed genes.²⁹ The entire machine learning process on this dataset is depicted in Figure 2. Additionally, for a more intuitive understanding, we visualized the RNA sequencing data and DGE in 4.

2.3 Bioinformatics analysis

On our findings (significant features), we delved deeper into the data using a set of bioinformatics techniques. This comprehensive approach covered protein-protein interaction network analysis, gene ontology and pathway analysis, transcription factors and miRNAs analysis, hub gene extraction, and protein-drug interactions. To ensure the robustness and validity of our methods, we validated our results with existing literature. For a more intuitive understanding of our analytical approach, we have visually represented our methodology in Figure 1. For our gene ontology and pathway analysis, we turned to EnrichR.³⁰ This tool provided insights into various sights of gene ontology, including biological processes, cellular components, and molecular functions. To further enrich our pathway analysis, we involved with information from trusted databases such as KEGG,³¹ Reactome,³² WikiPathways,³³ Elsevier, and BioCarta.³⁴ Throughout this process, we maintained an adjusted p-value of less than 0.05 as our benchmark for deciding significant pathways. On the other hand, our exploration into protein-protein interactions was simplified by the STRING online tool.³⁵ Following this, we embarked on the creation of a hub-protein network, using the Cytoscape application with the cytohubba plugin.³⁶ For insights into transcription factors, we recommended the JASPAR³⁷ and ChEA³⁸ databases. This allowed us to identify graphically plausible transcription factors that might connect with our differentially expressed genes (DEGs). This exploration was further enhanced using the NetworkAnalyst tool. Additionally, the TarBase³⁹ and miRTarBase⁴⁰ databases were instrumental in shedding light on miRNA-DEG interactions. NetworkAnalyst⁴¹ was the center of our analysis of TFs–gene and miRNAs–gene interaction networks. Moreover, for our exploration into drug-protein interactions, we relied on DrugBank,⁴² a comprehensive online resource of medicines and their associated drug targets.

Figure 1. A diagrammatical depiction of the methodology used in this study.

Figure 2. Supervised learning to diagnose diabetes.

3. Results

3.1 Machine learning models result and evolution matrix

The evaluation of machine learning models is crucial across all domains but regardless of data related to biology is inescapable.⁷⁹ While accuracy is a commonly used metric to evaluate a model’s performance, it alone may not provide a comprehensive assessment, especially in the context of biological or health-related datasets. Therefore, to achieve a clear understanding of a model’s efficacy, we incorporated a range of metrics including Precision,⁴³ Recall,⁴⁴ F1-Score,⁴⁵ and Specificity,⁴⁶ RocAuc, True Positive Rate, and False Positive Rate. These metrics collectively offer a perspective on the model’s performance, capturing various aspects of its predictive capabilities. The confusion matrix serves as a foundational tool in this context, covering the various measures employed to evaluate the efficiency of a classification model. However, To diagnose T2D from RNA sequence data, we used five unsupervised models Gradient Boosting, XGBoost, Logistic Regression, Random Forest, and AdaBoost. The XGBoost model performed well, with the highest prediction accuracy at 0.941% and the second lowest Log-Loss at 0.282%, showing values of confusion matrix for Precision 0.943, Recall 0.958, RocAuc 0.937, Specificity 0.915, F1-Score 0.950, TPR 0.958 and TFR 0.085%. Besides, the Gradient Boosting model showed the second highest prediction accuracy at 0.941% and the lowest Log-Loss at 0.280%, representing values of confusion matrix for Precision 0.948, Recall 0.943, RocAuc 0.937, Specificity 0.915, F1-Score 0.950, TPR 0.958 and TFR 0.085%. On the other hand, AdaBoost is the worst model in this accuracy, with the lowest accuracy of 0.744%. In addition, the accuracy is 0.884% in the third and 0.772% in the fourth positions for Logistic Regression and Random Forest, respectively. All the values have been shown in the Table 2, and the Figure 3 visualize the performance of the models based on the Accuracy and Log-Los.

Table 2. Assessing model performance: evaluation metric scores.

ML Models	Accuracy	Log_Loss	Precision	Recall	RocAuc	Specificity	F1-Score	TPR	FPR
XGBoost	0.941	0.282	0.943	0.958	0.937	0.915	0.950	0.958	0.085
Gardient Boosting	0.941	0.280	0.948	0.953	0.938	0.923	0.950	0.953	0.077
Logistic Regression	0.884	0.711	0.909	0.895	0.882	0.869	0.902	0.895	0.131
Random Forest	0.772	0.527	0.755	0.911	0.740	0.569	0.826	0.911	0.431
AdaBoost	0.744	0.582	0.748	0.858	0.717	0.577	0.799	0.858	0.423

Figure 3. Comparision of machine learning models based on test accuracies.

3.2 Differential gene analysis using feature importance

Although we trained 5 different models to get our significant features (expressed genes), we selected the top 100 important genes based on our best-performed model, XGBoost with 0.941% accuracy, among them because this one has been trained well and showed the best performance (shown in Figure 3). In addition, we included our predicted expressed genes in the Extended data supplementary file-1). However, by the way of using this method, we ignored the conventional method, Differential Gene Expressions (DEGs). For the XGBoost model, the parameters we used have been provided in Table 3.

Table 3. Important XGBoost parameters.

Parameters’ name	Value	Description
`learning_rate (eta)`	0.5	Controls the contribution of each tree. Lower values make optimization more robust and prevent overfitting.
`max_depth`	8	Maximum depth of a tree. Increasing can lead to overfitting.
`subsample`	1	Fraction of training data for growing trees. Value 1 can prevent overfitting.
`colsample_bytree`	1	Fraction of features for building each tree. Value 1 can help prevent overfitting.
`min_child_weight`	1	Minimum sum of instance weight needed in a child. Higher values make the algorithm conservative.
`gamma`	0	Minimum loss reduction for further partition on a leaf node. Acts as tree regularization.
`lambda (reg_lambda)`	1	L2 regularization term on weights. Avoids overfitting.

3.3 Pathway enrichment and gene ontology evaluation

Utilizing the computational tool EnrichR, we conducted a gene set enrichment approach to determine pathways and took into account five pathways databases to conduct experiments utilizing DEGs of T2D. The 20 leading terms of signaling Pathways are presented in Figure 5. The top 10 terms in biological processes, molecular operations, and cellular components are included in Table 4. The adj. p-value, mostly less than 0.05, filters both the GO and the Pathways, which are then ordered ascendingly.

Figure 4. Differential gene analysis using machine learning technique.

Figure 5. Pathway enrichment summary of type 2 diabetes DEGs.

Table 4. Exploration of DEGs of type 2 diabetes from an ontological perspective.

Category	GO ID	Term	P-value	Genes
GO Biological Process	GO:0002484	antigen processing and presentation of endogenous peptide antigen via MHC class I via ER pathway	0.000511	HLA-C;HLA-A
	GO:0009954	proximal/distal pattern formation	0.000679	CYP26B1;SIX3
	GO:0046631	alpha-beta T cell activation	0.000679	HLA-A;INS
	GO:0032024	positive regulation of insulin secretion	0.0009010	PSMD9;BAD;CFTR
	GO:0050709	negative regulation of protein secretion	0.0009726	PSMD9;CYP51A1;INS
	GO:0090277	positive regulation of peptide hormone secretion	0.0012945	PSMD9;BAD;INS
	GO:0016050	vesicle organization	0.0015751	SAR1A;ALS2;SNX11
	GO:1902285	semaphorin-plexin signaling pathway involved in neuron projection guidance	0.0018624	PLXND1;SEMA3F
	GO:0010564	regulation of cell cycle process	0.0019787	SIX3;KLHL13;CDK13;MAD1L1
	GO:0045859	regulation of protein kinase activity	0.0022649	FGR;DBNDD1;ALS2;CAMKK1
GO Cellular Component	GO:0042612	MHC class I protein complex	0.0003664	HLA-C;HLA-A
	GO:0031264	death-inducing signaling complex	0.0006795	CASP10;CFLAR
	GO:0012507	ER to Golgi transport vesicle membrane	0.0025005	SAR1A;HLA-C;HLA-A
	GO:0042611	MHC protein complex	0.0044344	HLA-C;HLA-A
	GO:0030666	endocytic vesicle membrane	0.0081768	M6PR;HLA-C;HLA-A;CFTR
	GO:0098553	lumenal side of endoplasmic reticulum membrane	0.0085962	HLA-C;HLA-A
	GO:0071556	integral component of lumenal side of endoplasmic reticulum membrane	0.0085962	HLA-C;HLA-A
	GO:0070013	intracellular organelle lumen	0.0098880	ASAH1;VGF;POLDIP2;NDUFAF7;CSNK2B; FUCA2;CDK13;HS3ST1;CHGB;INS
	GO:0005769	early endosome	0.0108350	ASAH1;ALS2;HLA-C;HLA-A;CFTR
	GO:0101002	ficolin-1-rich granule	0.0137120	ASAH1;CSNK2B;ENPP4;CDK13
GO Molecular Function	GO:0097199	cysteine-type endopeptidase activity involved in apoptotic signaling pathway	0.0010850	CASP10;CFLAR
	GO:0016274	protein-arginine N-methyltransferase activity	0.0013218	NDUFAF7;PRMT1
	GO:0097200	cysteine-type endopeptidase activity involved in execution phase of apoptosis	0.0018624	CASP10;CFLAR
	GO:0015174	basic amino acid transmembrane transporter activity	0.0018624	SLC7A7;SLC7A2
	GO:0097153	cysteine-type endopeptidase activity involved in apoptotic process	0.0024908	CASP10;CFLAR
	GO:0043425	bHLH transcription factor binding	0.0053564	PSMD9;KDM1A
	GO:0005179	hormone activity	0.0070353	VGF;CHGB;INS
	GO:0001222	transcription corepressor binding	0.0092030	CTBP1;SIX3
	GO:0048018	receptor ligand activity	0.0190294	VGF;SEMA3F;WNT16;CHGB;INS
	GO:0140297	DNA-binding transcription factor binding	0.0205433	PSMD9;KDM1A;CTBP1;MIXL1

3.4 Establishing a PPI network and discovering hub genes

We used STRING to analyze the PPI network and a Cytoscape representation to predict the adherence pathways and recurrent interactions between DEGs. Utilizing topological metrics, such as a degree higher than 15°, extremely communicating proteins were defined via PPI interpretation. The most prominent DEGs include 75 nodes in this PPI network (shown in Figure 6) and 226 edges between them. Hub genes have a strong association in potential units and top 10% interconnectivity. Due to these interconnections, hub genes typically play a crucial role in biological systems. To find the top 18 DEGs (hub genes), we used Cytoscape’s Cytohubba plugin. Figure 7 illustrates the hub genes notably: TP53, INS, KDM1A, SNAI1, RCOR1, CTBP1, RPA1, RAD52, SQLE, CYP51A1, CFTR, CPE, C3, PRMT1, NFYB, CD38, CFP and CASP10. These identified hub proteins could be useful as therapeutic targets, yet their roles still need to be explored. T2D-related differentially expressed genes (DEGs) and their hub genes are summarized in Table 5.

Figure 6. PPI network consisting of type 2 diabetes DEGs.

The circular nodes in the diagram symbolize differentially expressed protein genes, while the edges depict the communication between nodes. The PPI consists of 75 nodes connected by 226 edges. The PPI network was generated utilizing STRING and visualized via Cytoscape.

Figure 7. Determining hub genes in the Cluster through using Cytohubba.

The most up-to-date MCC and BottleNeck techniques available in the Cytohubba plugin were used to obtain hub genes. The top 14 hub genes from each approach are highlighted below, along with the links between them and other compounds. BottleNeck contains 58 nodes and 100 edges, but the MCC network has only 48 nodes and 90 edges.

Table 5. Evaluation of protein-protein interactions recognizes hub genes compiled by DEGs.

Gene Symbol	Description	Feature
TP53	Tumor Protein P53	DNA-binding transcription factor activity
INS	Insulin	Identical protein binding and protease binding
KDM1A	Lysine Demethylase 1A	DNA-binding transcription factor activity and enzyme binding
SNAI1	Snail Family Transcriptional Repressor 1	Sequence-specific DNA binding and DNA-binding transcription repressor activity
RCOR1	REST Corepressor 1	Chromatin binding and DNA-binding transcription repressor activity
CTBP1	C-Terminal Binding Protein 1	DNA-binding transcription factor activity and transcription factor binding
RPA1	Replication Protein A1	Nucleic acid binding and single-stranded DNA binding
RAD52	RAD52 Homolog, DNA Repair Protein	Identical protein binding and DNA strand exchange activity
SQLE	Squalene Epoxidase	Oxidoreductase activity and squalene monooxygenase activity
CYP51A1	Cytochrome P450 Family 51 Subfamily A Member 1	Iron ion binding and oxidoreductase activity
CFTR	CF Transmembrane Conductance Regulator	Enzyme binding and PDZ domain binding
CPE	Carboxypeptidase E	Cell adhesion molecule binding and carboxypeptidase activity
C3	Complement C3	Signaling receptor binding and C5L2 anaphylatoxin chemotactic receptor binding
PRMT1	Protein Arginine Methyltransferase 1	RNA binding and methyltransferase activity
NFYB	Nuclear Transcription Factor Y Subunit Beta	DNA-binding transcription factor activity and protein heterodimerization activity
CD38	CD38 Molecule	Transferase activity and hydrolase activity, acting on glycosyl bonds
CFP	Complement Factor Properdin	Positively regulates the alternative complement pathway of the innate immune system
CASP10	Caspase 10	Ubiquitin protein ligase binding and cysteine-type peptidase activity

3.5 Recognition of transcriptional and post-translational regulators

We used a network-based strategy to parse the governing TFs and miRNAs to locate substantial transcriptional changes and learn more about the hub protein’s signaling molecules. Transcription factors are proteins that govern gene activity and transcription over all life forms.⁴⁷ Tiny RNA molecules called miRNAs have a role in post-transcriptional expression regulation. We investigated the interaction between DEGs and TFs, as shown in Figure 8, and DEGs and miRNAs, as shown in Figure 9. Major promoters of the TFs of differentially expressed genes were ELK4, FOXC1, FOXL1, GATA2, JUN, MEF2A, NFIC, NFKB1, POU2F2, PPARG, RELA, TEAD1, USF2, YY1, PRRX2, STAT3, TP53, E2F1, CREB1, NANOG, CREM, RUNX1, TP63, AR, HNF4A, POU5F1, SOX2, MITF, SPI1, MYC, FLI1, SUZ12, and EGR1. Mir-6883-5p, mir-6785-5p, mir-149-3p, mir-4728-5p, mir-17-5p, mir-210-3p, mir-374a-5p, mir-21-3p, mir-129-2-3p, mir-7-5p, mir-16-5p, mir-1-3p, mir-124-3p, mir-155-5p, mir-27a-3p, mir-34a-5p, let-7b-5p, and mir-107 were specified so that a concise overview of the DEGs operating at post-transcriptional regulators could be established. This Table 6 summarizes both transcriptional and post-transcriptional regulatory factors of type 2 diabetes-related differentially expressed genes.

Figure 8. The infrastructure of coordinated regulatory interactions between DEGs and TFs generated by the Network Analyst.

The circular cyan nodes represent transcription factors, while the circular red nodes represent gene icons that connect with transcription factors.

Figure 9. The connectivity of interrelated regulatory interactions between DEGs and miRNAs.

Here, the square node represents miRNAs, while the circular-shaped gene symbols connect with miRNAs.

Table 6. Overview of transcriptional and post-transcriptional regulatory biomolecules of differentially expressed genes of T2D (a) transcription regulators and (b) post-transcriptional regulators.

Symbol	Description	Feature
(a) Transcriptional regulators
ELK4	ETS Transcription Factor ELK4	DNA-binding transcription factor activity and chromatin binding
FOXC1	Forkhead Box C1	DNA-binding transcription factor activity and transcription factor binding
FOXL1	Forkhead Box L1	DNA-binding transcription factor activity and transcription factor activity
GATA2	GATA Binding Protein 2	DNA-binding transcription factor activity and chromatin binding
JUN	Jun Proto-Oncogene	RNA binding and sequence-specific DNA binding
MEF2A	Myocyte Enhancer Factor 2A	DNA-binding transcription factor activity and protein heterodimerization activity
NFIC	Nuclear Factor I C	DNA-binding transcription activator activity, RNA polymerase II-specific
NFKB1	Nuclear Factor Kappa B Subunit 1	DNA-binding transcription factor activity and sequence-specific DNA binding
POU2F2	POU Class 2 Homeobox 2	DNA-binding transcription factor activity and protein domain specific binding
PPARG	Peroxisome Proliferator-Activated Receptor Gamma	DNA-binding transcription factor activity and chromatin binding
RELA	RELA Proto-Oncogene	DNA-binding transcription factor activity and identical protein binding
TEAD1	TEA Domain Transcription Factor 1	DNA-binding transcription factor activity
USF2	Upstream Transcription Factor 2	DNA-binding transcription factor activity and sequence-specific DNA binding
YY1	YY1 Transcription Factor	DNA-binding transcription factor activity and transcription coactivator activity
PRRX2	Paired Related Homeobox 2	DNA-binding transcription factor activity and sequence-specific DNA binding
STAT3	Signal Transducer And Activator Of Transcription 3	DNA-binding transcription factor activity and sequence-specific DNA binding
TP53	Tumor Protein P53	DNA-binding transcription factor activity and protein heterodimerization activity
E2F1	E2F Transcription Factor 1	DNA-binding transcription factor activity and transcription factor binding
CREB1	CAMP Responsive Element Binding Protein 1	DNA-binding transcription factor activity and enzyme binding
NANOG	Nanog Homeobox	DNA-binding transcription factor activity and chromatin binding
CREM	CAMP Responsive Element Modulator	DNA-binding transcription factor activity and core promoter sequence-specific DNA binding
RUNX1	RUNX Family Transcription Factor 1	DNA-binding transcription factor activity and protein homodimerization activity
TP63	Tumor Protein P63	DNA-binding transcription factor activity and identical protein binding
AR	Androgen Receptor	DNA-binding transcription factor activity and chromatin binding
HNF4A	Hepatocyte Nuclear Factor 4 Alpha	DNA-binding transcription factor activity and sequence-specific DNA binding
POU5F1	POU Class 5 Homeobox 1	RNA binding and sequence-specific DNA binding
SOX2	SRY-Box Transcription Factor 2	DNA-binding transcription factor activity and protein heterodimerization activity
MITF	Melanocyte Inducing Transcription Factor	RNA polymerase II cis-regulatory region sequence-specific DNA binding
SPI1	Spi-1 Proto-Oncogene	DNA-binding transcription factor activity and RNA binding
MYC	MYC Proto-Oncogene	RNA polymerase II cis-regulatory region sequence-specific DNA binding
FLI1	Fli-1 Proto-Oncogene	DNA-binding transcription factor activity and chromatin binding
SUZ12	SUZ12 Polycomb Repressive Complex 2 Subunit	Sequence-specific DNA binding and chromatin binding
EGR1	Early Growth Response 1	DNA-binding transcription factor activity and transcription factor binding
(b) Post-transcriptional regulators
mir-17-5p	MicroRNA 17	Improved inflammation-induced insulin resistance by suppressing ASK1 expression in macrophages
mir-4728-5p	MicroRNA 4728	Promote the proliferation and migration in breast cancer cell
mir-149-3p	MicroRNA 149	Role in obesity-associated metabolic abnormalities.
mir-6785-5p	MicroRNA 6785	Role in tumor proliferation and invasion
mir-6883-5p	MicroRNA 6883	Induce G1 Phase Cell-Cycle Arrest in Colon Cancer Cells
mir-210-3p	MicroRNA 210	Plays a protective role in cardiovascular homeostasis and is decreased in whole blood of T2DM mice
mir-374a-5p	MicroRNA 374a	Regulates Inflammatory Response in Diabetic Nephropathy by Targeting MCP-1 Expression
mir-21-3p	MicroRNA 21	Enhances glucose uptake and subsequently promotes insulin secretion
mir-129-2-3p	MicroRNA 129-2	Involved in inflammatory responses and apoptosis
mir-7-5p	MicroRNA 7	Regulates GLP-1-Mediated Insulin Release
mir-16-5p	MicroRNA 16	miR-16 deletion leads to insulin resistance in males and exacerbated glucose intolerance in females
mir-1-3p	MicroRNA 1	Positively associated with important characteristics of pre-diabetes, including glycaemic abnormalities and insulin resistance
mir-124-3p	MicroRNA 124	Negative regulator to inhibit the proliferation, migration and invasion of cancer cells
mir-155-5p	MicroRNA 155	Plays a crucial role in the pathogenesis of diabetes mellitus (DM) and its complications
mir-27a-3p	MicroRNA 27a	Promotes insulin resistance and mediates glucose metabolism by targeting PPAR-gamma-mediated PI3K/AKT signaling
mir-34a-5p	MicroRNA 34a	Affects the development and functional maturity of beta-cells, which in turn decreases the body’s tolerance to glucose level and the insulin secretion
let-7b-5p	MicroRNA Let-7b	Regulate glucose metabolism and insulin sensitivity
mir-107	MicroRNA 107	Higher miR-107 expression is related to insulin resistance in the diabetic group

3.6 Detection of potential medications

To understand the structural features implicated in signal transduction, conducting a protein-drug interaction analysis⁴⁸ is necessary. We listed 18 potential treatment drugs for frequent DEGs as possible pharmacological candidates in T2D employing NetworkAnalyst techniques dependent on drug-protein connections from the DrugBank library. Figure 10 shows 14 well-known therapeutic agents, including Insulin Human, Dalteparin, Lovastatin, Atorvastatin, Insulin glargine, Myristic acid, M-cresol, Insulin peglispro, L-lysine, L-ornithine, Ivacaftor, Glyburide, Bumetanide, and Lumacaftor that were found in the Protein Drug Interactions of DEGs of T2D. The potential uses of the remaining four chemical compounds in healthcare are still being investigated.

Figure 10. The picture represents 18 potential drugs for T2D treatments employing the protein-drug interaction strategy.

Here, the rectangular node symbolizes drugs, whereas the circular gene symbols are linked to drugs.

4. Discussion

In the modern era, over time, as artificial intelligence is improving rapidly, Machine Learning is performing as an essential part in the bioinformatics sector analyzing data profoundly.⁴⁹ Although we can use ML techniques on most of the RNA sequence data, in this research, we have analyzed T2D data because it is a chronic illness that can have severe and life-threatening complications.¹⁵ In this study, we presented a count-based classification pipeline to identify expressed genes applying the feature importance technique, as well as to detect the patient based on count data. Moreover, the approaches used here enable us to process large amounts of transcriptome data and draw reliable conclusions regarding T2D proteins involving various bioinformatics techniques, allowing us to comprehensively understand T2D and identify associated biomarkers.

In our comprehensive investigation of Type 2 Diabetes (T2D), we employed several supervised machine learning algorithms, including Random Forest, AdaBoost, Gradient Boosting, Logistic Regression, Decision Tree, and XGBoost. Their performance metrics, accuracies, and losses are visually represented in Figure 3 and detailed in Tab-2. From a bioinformatics perspective, we conducted Pathway enrichment analysis (Figure 5), Gene Ontology assessments (Table 4), Protein-Protein Interaction studies (Figure 6), and explored Hub-Protein interactions (Figure 7), Transcriptional Factor interactions (Figure 8), miRNA interactions (Figure 9), and drug-protein interactions (Figure 10). Each hub gene was meticulously detailed with its features in Table 5. Furthermore, we provided an in-depth overview of both transcriptional and post-transcriptional regulatory differentially expressed genes in Table 6. Our machine learning models’ efficacy in identifying significant genes from RNA sequence data sourced from NCBI for T2D is illustrated in Figure 4. For a holistic understanding of our research approach, we’ve outlined the entire methodology in Figure 1. Our dataset is comprehensively presented in Table 1, and the parameters of our top-performing model, XGBoost, are shown in Table 3.

In terms of our best model XGBoost, XGBoost’s superior performance on our dataset can be attributed to several factors. Its ability to model complex non-linear relationships, combined with built-in L1 and L2 regularization, makes it adept at handling high-dimensional data. Unique features such as internal handling of missing values, tree pruning, and efficient column block computation further enhance its efficiency. The model’s adaptability in hyperparameter tuning, resilience to outliers, and capability to capture feature interactions likely contributed to its edge. Additionally, the inherent nature of some datasets might align better with gradient-boosted trees, suggesting that our data’s underlying patterns were particularly suited for XGBoost. The mathematical equation of the aim and process of the XGBoost model is shown below:

Given a dataset with $n$ samples and $m$ features, the prediction of the model for the $i^{th}$ instance at the $t^{th}$ iteration is denoted as ${\hat{y}}_{i}^{(t)}$ . The objective function to be optimized in XGBoost at each iteration is:

(1)

{Obj}^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t)}) + \sum_{j = 1}^{t} Ω (f_{j})

where

l

is the differentiable convex loss function and

Ω

is the regularization term defined as:

(2)

Ω (f) = γT + \frac{1}{2} λ \sum_{j = 1}^{T} w_{j}^{2}

Here, $T$ is the number of leaves in the tree and $w_{j}$ is the score assigned to the $j^{th}$ leaf.

The optimal structure of the tree is found by minimizing:

(3)

Gain = \frac{1}{2} [\frac{G^{2}}{H + λ} + \frac{{(G - g)}^{2}}{H - h + λ} - \frac{g^{2}}{h + λ}] - γ

with

G

and

H

being the sum of the first and second-order gradients of the loss function for the instances in the current node, and

g

and

h

being the corresponding sums for the instances that go to the left child node.

Gene Ontology (GO) and pathway enrichment analysis is a widely used statistical method in bioinformatics that helps researchers to gain insights into the biological relevance of extensive gene sets. In T2D, persistent exposure to high glucose levels and free fatty acids induces beta-cell dysfunction and may initiate beta-cell apoptosis.⁵⁰ Sterol regulatory element binding proteins (SREBPs) regulate lipid production and adipogenesis. SREBPs expression was considerably reduced in individuals with type 2 diabetes.⁵¹ Neutrophil degranulation is related to an aberrant echocardiographic patterning T2D.⁵² The expression of the FAS signaling pathway (CD95) was connected to systemic and skeletal muscle insulin resistance.⁵³ IFN-gamma or TNF-alpha Mediated Cell Proliferation is associated with T2D. Interferon-gamma is crucial for the ruination of cells and the onset of T2D.⁵⁴ Tumor necrosis factor (TNF)-alpha, a cytokine derived primarily from macrophages and adipocytes, can encourage insulin resistance (IR) and inevitably aid the advancement of T2D (T2D).⁵⁵ The innate immune system plays a crucial part in T2D by contributing to low-grade inflammation and insulin resistance, which are key factors in the development and progression of the disease.⁵⁶ Several MHC class I alleles, such as HLA-B and MHC class II alleles(including HLA-DRB1 and HLA-DQB1) were associated with T2D risk.⁵⁷^,⁵⁸

We identified hub proteins that are expressed highly or poorly in T2D patients. Patients with T2D had significantly greater serum TP53 levels than healthy non-diabetic controls.⁵⁹ The decreased level of CTBP1-AS2 was linked with diabetes in the Iranian population.⁶⁰ The presence of complement factor 3 (C3) is linked to insulin resistance.⁶¹ ROS- and RAS-mediated diabetic retinopathy involves PRMT-1 and DDAHs-induced ADMA upregulation.⁶² In a sizeable portion (9–15%) of patients with Type 2 or persistent Type 1 diabetes, CD38 autoantibodies have been discovered. Most of these autoantibodies (about 60%) exhibit agonistic characteristics, such as Ca2+ mobilization in lymphocytic cell lines and in pancreatic islets, indicating that they are biologically active. CD38 autoantibodies promote glucose-mediated insulin secretion in human pancreatic islets.⁶³ The hypomethylation of CASP10 may result from T2D and severe and long-lasting hyperglycemia.⁶⁴ In humans, CFTR deficiency causes intrinsic abnormalities in insulin secretion inside the islets.⁶⁵

Our discovered TFs are associated with T2D. The adult human pancreas has been found to express the Sox2 gene. It seems improbable that Sox2 will have a genetic influence on the development of T2D.⁶⁶ In the human intestine, MYC transcription factor expression is correlated with either glycaemic management (HbA1c level) or body mass index (BMI).⁶⁷ In the skeletal muscle of T2D patients, STAT3 is constitutively phosphorylated, and increased STAT3 signaling plays a role in the etiology of T2D and insulin resistance.⁶⁸ Much research has been conducted on the peroxisome proliferator-activated receptor gamma (PPARG), whose ligands have become effective insulin sensitizers in type 2 diabetes.⁶⁹ YY1 plays an important role in T2DM and it might be useful as a new therapeutic target in the fight against the disease.⁷⁰

Our identified miRNAs are also linked to T2D. Targeting the SOCS1-mediated NF-B Pathway, miR-210-3p Increases Insulin Resistance and Obesity-Induced Adipose Tissue Inflammation. MiR-374a-5p appears to be associated with the downregulation of pro-inflammatory biomarkers that are connected to insulin resistance and is elevated in metabolically healthy obese persons as compared to metabolically abnormal obese patients.⁷¹ A possible biomarker for the early diagnosis of diabetic nephropathy is miR-21-3p, whose expression is downregulated in association with the onset of diabetic nephropathy.⁷² MiR-7-5p targeting may be a likely therapeutic approach to metabolic illnesses brought on by insulin dysfunction.⁷³ Patients with diabetic neuropathy have dysregulated long non-coding miR-1-3p axis.⁷⁴ Compared to healthy individuals, peripheral blood mononuclear cells (PBMCs) obtained from T2D patients exhibited low miR-155 expression.⁷⁵

Our identified potential drug molecules can be effective for T2D patients. Insulin Human N (medium-acting) and Insulin Human R (short-acting) both are used in diabetes mellitus to lower blood glucose levels. Insulin glargine modulates carbohydrate, protein, and lipid metabolism by suppressing hepatic glucose synthesis and lipolysis and improving peripheral glucose clearance. Long-acting insulin, or insulin peglispro, is used to treat both T1D and T2D.⁷⁶ Glyburide belongs to the class of medications known as sulfonylureas, which stimulate the pancreas to produce insulin and reduce blood sugar levels.⁷⁷ T2D and its associated illnesses may be prevented and treated using myristic acid.⁷⁸ There is some uncertainty on how Dalteparin, Lovastatin, Atorvastatin, Myristic acid, M-cresol, L-lysine, L-ornithine, Ivacaftor, Bumetanide, and Lumacaftor interact with T2D. So, more studies as well as preclinical and clinical trials are required.

5. Conclusions

As our research aims to provide the machine learning approach to identify significant genes through feature importance technique and to address the global burden of T2D and enhance the lives of individuals, affected by this chronic disease, based on bioinformatics methods, we went through machine learning approach and common bioinformatics methods.

In terms of the machine learning approach, our research represents a remarkable advancement in bioinformatics and (T2D) research. Combining machine learning algorithms, statistical analysis, and bioinformatics techniques, we have gained valuable insights into the molecular mechanisms underlying T2D. It would help us to identify expressed genes not solely being dependable on the conventional bioinformatics approach, Differentially Expressed Genes (DEGs). As well as, for a large number of features, a machine learning approach would be an effective and more reliable approach to predict expressed genes. We have successfully processed and analyzed vast amounts of information by utilizing large volumes of transcriptomic data. Our XGBoost model could be an example that showed a noticeable performance with 94.1% accuracy. This achievement demonstrates the potential of machine learning as a powerful tool for precise and efficient T2D diagnosis, which also can significantly impact clinical practice by enabling early intervention and personalized treatment approaches based on count data.

On the other hand, in our comprehensive molecular biomarker study on T2D, we analyzed a diverse set of molecular entities, including 20 pathways, 30 gene ontologies, 51 transcriptional factors, 18 hub-genes, 18 miRNAs, and 18 potential drugs where we were able to validate a subset among them. So, according to our validation, key pathways such as SREBPs, neutrophil degranulation, FAS signaling, IFNgamma/TNFalpha-mediated cell proliferation, and the innate immune system were highlighted as central to T2D’s development. Hub proteins, notably TP53, CTBP1-AS2, C3, PRMT-1, DDAHs, CD38 autoantibodies, CASP10, and CFTR, present promising avenues for disease research and treatment. Similarly, transcriptional factors like Sox2, MYC, STAT3, PPARG, and YY1, coupled with miRNAs such as miR-210-3p, miR-374a-5p, miR-21-3p, miR-7-5p, and miR-155, shed light on the regulatory dynamics underpinning T2D. Furthermore, our research identified potential drug molecules, including insulin analogs, sulfonylureas, myristic acid, and other compounds, that hold promise for therapeutic intervention. While further preclinical and clinical trials are necessary to validate their efficacy and safety profiles, these findings offer potential avenues for revolutionizing T2D treatment strategies and improving patient outcomes.

Overall, this study represents a significant advancement in using machine learning to identify expressed genes from RNA sequence count data, and by integrating key bioinformatics methods on T2D and validating our findings against prior research, we offer a robust approach to understanding and addressing T2D at the molecular level.

One of the limitations of our study is the accuracy of our models, which needs further improvement. While our current approach provided insights, the application of deep learning models in future studies could enhance the precision of our findings. Additionally, there remains a set of biomarkers that we have yet to validate with existing literature. A thorough validation of these biomarkers is essential, as it would offer researchers a more detailed understanding of expressed genes, ultimately aiding in more accurate analyses and better strategies to address T2D.

Author contributions

Md Al Amin: Took the lead in conceptualization and design of the work; primary role in data curation and analysis; led the writing of the original draft of the manuscript; major contributor in review and editing; gave final approval of the version to be published; agreed to be accountable for all aspects of the work.

Feroza Naznin: Played a significant role in the formal analysis of the data; contributed to the investigation; involved in writing, review, and editing of the manuscript; gave final approval and agreed to be accountable for all aspects of the work.

Most Nilufa Yeasmin: Led the acquisition of resources; involved in data curation; gave final approval and agreed to be accountable for all aspects of the work.

Md Sumon Sarkar: Took the lead in software development; contributed to the validation of the results; gave final approval and agreed to be accountable for all aspects of the work.

Md Misor Mia: Led the visualization process; supported the investigation; gave final approval and agreed to be accountable for all aspects of the work.

Abdullahi Chowdhury: Contributed to the methodology; supported the writing, review, and editing of the manuscript; gave final approval and agreed to be accountable for all aspects of the work.

Md Zahidul Islam: Acted as the primary supervisor and administrator of the project; took a leading role in writing, reviewing, and editing the manuscript; gave final approval and agreed.

Data availability

Underlying data

Zenodo: Data from Analysis on Type-2 Diabetes RNA-Sequence Data, https://doi.org/10.5281/zenodo.10603991.⁷⁹

This project contains the following underlying data:

- Data from analysis.zip (protein-protein interaction network analysis, gene ontology and pathway analysis, transcription factor and miRNA analysis, hub gene extraction, and protein-drug interactions).

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Zenodo: Top 100 proteins, https://doi.org/10.5281/zenodo.10603257.⁸⁰

This project contains the following extended data:

- Top-100-proteins.txt (supplementary file-1).

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Source data

Data repository: Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI). Title: RNA Sequencing of Single Human Islet Cells Reveals Type 2 Diabetes Genes. The persistent identifier: GSE81608. Archived source code at time of publication: http://dx.doi.org/10.1016/j.cmet.2016.08.018 Link of dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81608 and http://www.ilincs.org/apps/grein/?gse=GSE81608

Description of this dataset: Gene Expression Omnibus (GEO) at the National Center for Biotechnology. (Pancreatic islet cells are critical for maintaining normal blood glucose levels and their malfunction underlies diabetes development and progression. They used single-cell RNA sequencing to determine the transcriptomes of 1,492 human pancreatic $α$ -, $β$ -, $δ$ - and PP cells from non-diabetic and type 2 diabetes organ donors. They identified cell type specific genes and pathways as well as 245 genes with disturbed expression in type 2 diabetes. Importantly, 92% of the genes have not previously been associated with islet cell function or growth. Comparison of gene profiles in mouse and human $α$ - and $β$ -cells revealed species-specific expression. All data are available for online browsing and download and will hopefully serve as a resource for the islet research community.)

License: Data is available under the terms of the Open Database License. GEO is an open-access database, meaning the data stored within it is freely available for anyone to access, download, and reuse.

For the citation of this dataset: “Xin, Y., Kim, J., Okamoto, H., Ni, M., Wei, Y., Adler, C., Murphy, A.J., Yancopoulos, G.D., Lin, C. and Gromada, J., 2016. RNA sequencing of single human islet cells reveals type 2 diabetes genes. Cell metabolism, 24(4), pp. 608-615.”

Software availability

Software-1: EnrichR: Utilizing the computational tool EnrichR, we conducted a gene set enrichment approach to determine pathways and took into account five pathways databases to conduct experiments utilizing DEGs of T2D. Software available from: https://maayanlab.cloud/Enrichr/. Software-2: STRING and Cytoscape: We used STRING to analyze the PPI network and a Cytoscape representation to predict the adherence pathways and recurrent interactions between DEGs. Software (String) available from: https://string-db.org/. Software (Cytoscape) available from: https://cytoscape.org/.

Source code available from: https://github.com/alamin852369/ML-for-Type-2-Diabetes/blob/main/ML_for_TD.ipynb.

References

1. Love M, Anders S, Huber W: Differential analysis of count data–the deseq2 package. Genome Biol. 2014; 15(550): 10–1186. Publisher Full Text
2. McDermaid A, Monier B, Zhao J, et al.: Interpretation of differential gene expression results of rna-seq data: review and integration. Brief. Bioinform. 2019; 20(6): 2044–2054. PubMed Abstract | Publisher Full Text | Free Full Text
3. Wikipedia: Fold change. accessed: Date (Year). Reference Source
4. Kumar I, Singh SP, et al.: Machine learning in bioinformatics. Bioinformatics. 2022; 443–456. Elsevier.
5. Kaisar S, Chowdhury A: Integrating oversampling and ensemble-based machine learning techniques for an imbalanced dataset in dyslexia screening tests. ICT Express. 2022; 8(4): 563–568.
6. Shafin SS, Prottoy SA, Abbas S, et al.: Distributed denial of service attack detection using machine learning and class oversampling. Applied Intelligence and Informatics: First International Conference, AII 2021, Nottingham, UK, July 30–31, 2021, Proceedings 1. Springer; 2021; pp. 247–259.
7. Sprang M, Andrade-Navarro MA, Fontaine J-F: Batch effect detection and correction in rna-seq data using machine-learning-based automated assessment of quality. BMC Bioinformatics. 2022; 23(6): 1–15. Publisher Full Text
8. Billard MJ, Fitzhugh DJ, Parker JS, et al.: G protein coupled receptor kinase 3 regulates breast cancer migration, invasion, and metastasis. PLoS One. 2016; 11(4): e0152856. PubMed Abstract | Publisher Full Text | Free Full Text
9. Shi J: Machine learning and bioinformatics approaches for classification and clinical detection of bevacizumab responsive glioblastoma subtypes based on mirna expression. Sci. Rep. 2022; 12(1): 8685. PubMed Abstract | Publisher Full Text | Free Full Text
10. Urda D, Montes-Torres J, Moreno F, et al.: Deep learning to analyze rna-seq gene expression data. Advances in Computational Intelligence: 14th International Work-Conference on Artificial Neural Networks, IWANN 2017, Cadiz, Spain, June 14-16, 2017, Proceedings, Part II 14. Springer; 2017; pp. 50–59.
11. Rukhsar L, Bangyal WH, Ali Khan MS, et al.: Analyzing rna-seq gene expression data using deep learning approaches for cancer classification. Appl. Sci. 2022; 12(4): 1850. Publisher Full Text
12. I. D. federation, About diabetes. Reference Source
13. I. D. federation, About diabetes (facts and figures). Reference Source
14. Merve A, Yayintaş Ö: In silico analysis of quercetin, gallic acid, oleanolic acid, and ursolic acid on diabetes mellitus. Troia Med. J. 2022; 3(3): 100–110.
15. A. diabetes association, The path to understanding diabetes starts here. http
16. Palnitkar U: Growth of indian biotech companies, in the context of the international biotechnology industry. J. Commer. Biotechnol. 2005; 11: 146–154. Publisher Full Text
17. De Silva K, Jönsson D, Demmer RT: A combined strategy of feature selection and machine learning to identify predictors of prediabetes. J. Am. Med. Inform. Assoc. 2020; 27(3): 396–406. PubMed Abstract | Publisher Full Text | Free Full Text
18. Coombes CE, Abrams ZB, Li S, et al.: Unsupervised machine learning and prognostic factors of survival in chronic lymphocytic leukemia. J. Am. Med. Inform. Assoc. 2020; 27(7): 1019–1027. PubMed Abstract | Publisher Full Text | Free Full Text
19. Hyland SL, Faltys M, Hüser M, et al.: Early prediction of circulatory failure in the intensive care unit using machine learning. Nat. Med. 2020; 26(3): 364–373. Publisher Full Text
20. Larabi-Marie-Sainte S, Aburahmah L, Almohaini R, et al.: Current techniques for diabetes prediction: review and case study. Appl. Sci. 2019; 9(21): 4604. Publisher Full Text
21. Sisodia D, Sisodia DS: Prediction of diabetes using classification algorithms. Procedia Comput. Sci. 2018; 132: 1578–1585. Publisher Full Text
22. Zou Q, Qu K, Luo Y, et al.: Predicting diabetes mellitus with machine learning techniques. Front. Genet. 2018; 9: 515. PubMed Abstract | Publisher Full Text | Free Full Text
23. Alehegn M, Joshi RR, Mulay P: Diabetes analysis and prediction using random forest, knn, naïve bayes and j48: An ensemble approach. Int. J. Sci. Technol. Res. 2019; 8(9): 1346–1354.
24. Xin Y, Kim J, Okamoto H, et al.: Rna sequencing of single human islet cells reveals type 2 diabetes genes. Cell Metab. 2016; 24(4): 608–615. PubMed Abstract | Publisher Full Text
25. Barrett T, Troup DB, Wilhite SE, et al.: Ncbi geo: archive for functional genomics data sets—10 years on. Nucleic Acids Res. 2010; 39(suppl_1): D1005–D1010. Publisher Full Text
26. Bernstein MN, Ni Z, Collins M, et al.: Charts: a web application for characterizing and comparing tumor subpopulations in publicly available single-cell rna-seq data sets. BMC Bioinformatics. 2021; 22(1): 1–9.
27. McKinney W, et al.: pandas: a foundational python library for data analysis and statistics. Python for High Performance and Scientific Computing. 2011; 14(9): 1–9.
28. Rajbahadur GK, Wang S, Oliva GA, et al.: The impact of feature importance methods on the interpretation of defect classifiers. IEEE Trans. Softw. Eng. 2021; 48(7): 2245–2261. Publisher Full Text
29. Anders S, Huber W: Differential expression analysis for sequence count data. Nature Precedings. 2010; pp. 1–1. Publisher Full Text
30. Kuleshov MV, Jones MR, Rouillard AD, et al.: Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016; 44(W1): W90–W97.
31. Kanehisa M, Goto S: Kegg: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000; 28(1): 27–30. PubMed Abstract | Publisher Full Text | Free Full Text
32. Fabregat A, Jupe S, Matthews L, et al.: The reactome pathway knowledgebase. Nucleic Acids Res. 2018; 46(D1): D649–D655. PubMed Abstract | Publisher Full Text | Free Full Text
33. Slenter DN, Kutmon M, Hanspers K, et al.: Wikipathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Res. 2018; 46(D1): D661–D667. PubMed Abstract | Publisher Full Text | Free Full Text
34. Nishimura D: Biocarta, Biotech Software & Internet Report. The Computer Software Journal for Scient. 2001; 2(3): 117–120.
35. Szklarczyk D, Gable AL, Lyon D, et al.: String v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019; 47(D1): D607–D613. PubMed Abstract | Publisher Full Text | Free Full Text
36. Chin C-H, Chen S-H, Wu H-H, et al.: cytohubba: identifying hub objects and sub-networks from complex interactome. BMC Syst. Biol. 2014; 8(4): 1–7.
37. Khan A, Fornes O, Stigliani A, et al.: Jaspar 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 2018; 46(D1): D260–D266. PubMed Abstract | Publisher Full Text | Free Full Text
38. Lachmann A, Xu H, Krishnan J, et al.: Chea: transcription factor regulation inferred from integrating genome-wide chip-x experiments. Bioinformatics. 2010; 26(19): 2438–2444. PubMed Abstract | Publisher Full Text | Free Full Text
39. Sethupathy P, Corda B, Hatzigeorgiou AG: Tarbase: A comprehensive database of experimentally supported animal microrna targets. RNA. 2006; 12(2): 192–197. PubMed Abstract | Publisher Full Text | Free Full Text
40. Huang H-Y, Lin Y-C-D, Li J, et al.: mirtarbase 2020: updates to the experimentally validated microrna–target interaction database. Nucleic Acids Res. 2020; 48(D1): D148–D154. PubMed Abstract | Publisher Full Text
41. Zhou G, Soufan O, Ewald J, et al.: Networkanalyst 3.0: a visual analytics platform for comprehensive gene expression profiling and meta-analysis. Nucleic Acids Res. 2019; 47(W1): W234–W241. PubMed Abstract | Publisher Full Text | Free Full Text
42. Wishart DS, Feunang YD, Guo AC, et al.: Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic Acids Res. 2018; 46(D1): D1074–D1082. PubMed Abstract | Publisher Full Text | Free Full Text
43. Fawcett T: An introduction to roc analysis. Pattern Recogn. Lett. 2006; 27(8): 861–874. Publisher Full Text
44. Sofaer HR, Hoeting JA, Jarnevich CS: The area under the precision-recall curve as a performance metric for rare binary events. Methods Ecol. Evol. 2019; 10(4): 565–577. Publisher Full Text
45. Lever J, Krzywinski M, Altman N: Points of significance: model selection and overfitting. Nat. Methods. 2016; 13(9): 703–704. Publisher Full Text
46. Lange RT, Lippa SM: Sensitivity and specificity should never be interpreted in isolation without consideration of other clinical utility metrics. Clin. Neuropsychol. 2017; 31(6-7): 1015–1028. PubMed Abstract | Publisher Full Text
47. Cheng C, Alexander R, Min R, et al.: Understanding transcriptional regulation by integrative analysis of transcription factor binding data. Genome Res. 2012; 22(9): 1658–1667. PubMed Abstract | Publisher Full Text | Free Full Text
48. Lahti JL, Tang GW, Capriotti E, et al.: Bioinformatics and variability in drug response: a protein structural perspective. J. R. Soc. Interface. 2012; 9(72): 1409–1437. PubMed Abstract | Publisher Full Text | Free Full Text
49. Baldi P, Brunak S: Bioinformatics: the machine learning approach. MIT Press; 2001.
50. Cnop M, Welsh N, Jonas J-C, et al.: Mechanisms of pancreatic β-cell death in type 1 and type 2 diabetes: many differences, few similarities. Diabetes. 2005; 54(suppl_2): S97–S107.
51. Laudes M, Barroso I, Luan J, et al.: Genetic variants in human sterol regulatory element binding protein-1c in syndromes of severe insulin resistance and type 2 diabetes. Diabetes. 2004; 53(3): 842–846. PubMed Abstract | Publisher Full Text
52. Ministrini S, Andreozzi F, Montecucco F, et al.: Neutrophil degranulation biomarkers characterize restrictive echocardiographic pattern with diastolic dysfunction in patients with diabetes. Eur. J. Clin. Investig. 2021; 51(12): e13640. PubMed Abstract | Publisher Full Text | Free Full Text
53. Wueest S, Mueller R, Blüher M, et al.: Fas (cd 95) expression in myeloid cells promotes obesity-induced muscle insulin resistance. EMBO Mol. Med. 2014; 6(1): 43–56. PubMed Abstract | Publisher Full Text | Free Full Text
54. von Herrath MG , Oldstone MB: Interferon-γ is essential for destruction of β cells and development of insulin-dependent diabetes mellitus. J. Exp. Med. 1997; 185(3): 531–540.
55. Tilg H, Moschen AR: Inflammatory mechanisms in the regulation of insulin resistance. Mol. Med. 2008; 14: 222–231. PubMed Abstract | Publisher Full Text | Free Full Text
56. Berbudi A, Rahmadika N, Tjahjadi AI, et al.: Type 2 diabetes and its impact on the immune system. Curr. Diabetes Rev. 2020; 16(5): 442–449. PubMed Abstract | Publisher Full Text
57. Marzban A, Kiani J, Hajilooi M, et al.: Hla class ii alleles and risk for peripheral neuropathy in type 2 diabetes patients. Neural Regen. Res. 2016; 11(11): 1839–1844. PubMed Abstract | Publisher Full Text
58. Frydrych LM, Bian G, O’Lone DE, et al.: Obesity and type 2 diabetes mellitus drive immune dysfunction, infection development, and sepsis mortality. J. Leukoc. Biol. 2018; 104(3): 525–534. PubMed Abstract | Publisher Full Text
59. Sliwinska A, Kasznicki J, Kosmalski M, et al.: Tumour protein 53 is linked with type 2 diabetes mellitus. Indian J. Med. Res. 2017; 146(2): 237–243. PubMed Abstract | Publisher Full Text | Free Full Text
60. Erfanian Omidvar M, Ghaedi H, Kazerouni F, et al.: Clinical significance of long noncoding rna vim-as1 and ctbp1-as2 expression in type 2 diabetes. J. Cell. Biochem. 2019; 120(6): 9315–9323. PubMed Abstract | Publisher Full Text
61. Wlazlo N, Van Greevenbroek MM, Ferreira I, et al.: Complement factor 3 is associated with insulin resistance and with incident type 2 diabetes over a 7-year follow-up period: the codam study. Diabetes Care. 2014; 37(7): 1900–1909. PubMed Abstract | Publisher Full Text
62. Chen Y, Xu X, Sheng M, et al.: Prmt-1 and ddahs-induced adma upregulation is involved in ros-and ras-mediated diabetic retinopathy. Exp. Eye Res. 2009; 89(6): 1028–1034. PubMed Abstract | Publisher Full Text
63. Antonelli A, Ferrannini E: Cd38 autoimmunity: recent advances and relevance to human diabetes. J. Endocrinol. Investig. 2004; 27: 695–707. PubMed Abstract | Publisher Full Text
64. Volkmar M, Dedeurwaerder S, Cunha DA, et al.: Dna methylation profiling identifies epigenetic dysregulation in pancreatic islets from type 2 diabetic patients. EMBO J. 2012; 31(6): 1405–1426. PubMed Abstract | Publisher Full Text | Free Full Text
65. Koivula FNM, McClenaghan NH, Harper AG, et al.: Islet-intrinsic effects of cftr mutation. Diabetologia. 2016; 59: 1350–1355. PubMed Abstract | Publisher Full Text | Free Full Text
66. Gu HF, Gu T, Östenson C-G, et al.: Evaluation of sox2 genetic effect on the development of type 2 diabetes. Gene. 2011; 486(1-2): 94–96. PubMed Abstract | Publisher Full Text
67. Ellegaard A-M, Knop FK: Myc mrna expression throughout the intestine is not associated with body mass index or type 2 diabetes, Endocrinology. Diabetes Metab. 2022; 5(2): e00327. Publisher Full Text
68. Mashili F, Chibalin AV, Krook A, et al.: Constitutive stat3 phosphorylation contributes to skeletal muscle insulin resistance in type 2 diabetes. Diabetes. 2013; 62(2): 457–465. PubMed Abstract | Publisher Full Text | Free Full Text
69. Janani C, Kumari BR: Ppar gamma gene–a review. Diabetes Metab. Syndr. Clin. Res. Rev. 2015; 9(1): 46–50. PubMed Abstract | Publisher Full Text
70. Kosasih FR, Bonavida B: Yy1-mediated regulation of type 2 diabetes via insulin, YY1 in the Control of the Pathogenesis and Drug Resistance of Cancer.2021; 271–287.
71. Doumatey AP, He WJ, Gaye A, et al.: Circulating mir-374a-5p is a potential modulator of the inflammatory process in obesity. Sci. Rep. 2018; 8(1): 7680. PubMed Abstract | Publisher Full Text | Free Full Text
72. Akpınar K, Aslan D, Fenkçi SM, et al.: mir-21-3p and mir-192-5p in patients with type 2 diabetic nephropathy. Diagnosis. 2022; 9(4): 499–507. PubMed Abstract | Publisher Full Text
73. Saeidi L, Shahrokhi SZ, Sadatamini M, et al.: Can circulating mir-7-1-5p, and mir-33a-5p be used as markers of t2d patients? Arch. Physiol. Biochem. 2021; 129: 771–777. Publisher Full Text
74. Ashjari D, Karamali N, Rajabinejad M, et al.: The axis of long non-coding rna malat1/mir-1-3p/cxcr4 is dysregulated in patients with diabetic neuropathy. Heliyon. 2022; 8(3): e09178. PubMed Abstract | Publisher Full Text | Free Full Text
75. Jankauskas SS, Gambardella J, Sardu C, et al.: Functional role of mir-155 in the pathogenesis of diabetes mellitus and its complications. Non-coding RNA. 2021; 7(3): 39. PubMed Abstract | Publisher Full Text | Free Full Text
76. Jacober S, Prince M, Beals J, et al.: Basal insulin peglispro: overview of a novel long-acting insulin with reduced peripheral effect resulting in a hepato-preferential action. Diabetes. Obes. Metab. 2016; 18: 3–16. Publisher Full Text
77. Langer O, Yogev Y, Xenakis EM, et al.: Insulin and glyburide therapy: dosage, severity level of gestational diabetes, and pregnancy outcome. Am. J. Obstet. Gynecol. 2005; 192(1): 134–139. PubMed Abstract | Publisher Full Text
78. Takato T, Iwata K, Murakami C, et al.: Chronic administration of myristic acid improves hyperglycaemia in the nagoya–shibata–yasuda mouse model of congenital type 2 diabetes. Diabetologia. 2017; 60(10): 2076–2083. PubMed Abstract | Publisher Full Text
79. Al Amin M: Data from Analysis on Type-2 Diabetes RNA-Sequence Data. [Dataset]. Zenodo. 2024. Publisher Full Text
80. Al Amin M: Top 100 proteins. [Dataset]. Zenodo. 2024. Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 07 Mar 2024