ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

High throughput biological sequence analysis using machine learning-based integrative pipeline for extracting functional annotation and visualization

[version 1; peer review: 2 approved with reservations]
PUBLISHED 07 Mar 2024
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Machine Learning in Drug Discovery and Development collection.

Abstract

The Differential Gene Expression (DGE) approach to find out the expressed genes relies on measures such as log-fold change and adjusted p-values. Although fold change is commonly employed in gene expression studies, especially in microarray and RNA sequencing experiments to quantify alterations in a gene’s expression level, a limitation and potential hazard of relying on fold change in this context is its inherent bias. As a consequence, it might incorrectly categorize genes that have significant differences but minor ratios, resulting in poor detection of mutations in genes with high expression levels. In contrast, machine learning offers a more comprehensive view, adept at capturing the non-linear complexities of gene expression data and providing robustness against noise that inspired us to utilize machine learning models to explore differential gene expression based on feature importance in Type 2 Diabetes (T2D), a significant global health concern, in this study. Moreover, we validated biomarkers based on our findings expressed genes with previous studies to ensure the effectiveness of our ML models in this work which led us to go through to analysis pathways, gene ontologies, protein-protein interactions, transcription factors, miRNAs, and drug predictions to deal with T2D. This study aims to consider the machine learning technique as a good way to know about expressed genes profoundly not relying on the DGE approach, and to control or reduce the risk of T2D patients by helping drug developer researchers.

Keywords

Bioinformatics, Machine Learning, Type-2 Diabetes, Proteins, Pathways, Gene Ontology, RNA-Sequence, Drug

1. Introduction

Differential Gene Expression (DGE) analysis usually based on the DESeq2 package1 is a traditional and common bioinformatics technique that helps to identify expressed genes under different conditions offering insights into genes that exhibit varying expression levels.2 In RNA sequence data, fold change in gene expression studies can be biased, potentially misclassifying genes with large absolute differences but small relative ratios.3 However, the advent of Machine Learning (ML) has brought about a significant change in bioinformatics, and it is now widely acknowledged as a powerful tool that can provide detailed and useful explanations of complex data that were once difficult to understand.4 And with the passage of time, in the medical sector, ML techniques are getting popularity, being effective for decision-making.5,6 Using different kinds of ML algorithms is noticeable in RNA sequence data for different types of detection and to find out the correlation of sequences,7 as well as for showing the effectiveness of machine learning algorithms in detecting splice variants from RNA sequence data.8 Such as: To identify and classify cancers early on, different computer algorithms have been used on microarray data sets. These include support vector machines, random forest, and neural networks.9 On the other hand, this study uses a neural network to analyze RNA sequence-expressed genes from different datasets to predict a patient’s health status.10 And in this paper, the primary objective is to classify or identify different types of cancers based on the patterns found in the gene expression data. By doing so, the research aims to enhance the accuracy and efficiency of cancer diagnosis, potentially leading to more targeted and effective treatments11 that inspired us to apply ML models in the bioinformatics field, especially in the RNA sequence count data.

Type 2 Diabetes (T2D), sometimes referred to simply as diabetes, is a long-term illness that affects the metabolic process.12 According to IDF, around 6.7 million people were dead in 2021, which is one of the major ten reasons for death in the universe, and around 541 million adults are affected by T2D.13 It was also projected in 2021 that by 2030, 643 million people would have diabetes, and by 2045, 783 million people will have the disease.14 However, the risk of serious complications from T2D is greatly reduced if it can be diagnosed in its early stages.15 Moreover, pioneers of improved biotechnology invented several bioinformatics tools that assisted the course of study about T2D.16 Yet, other groups of researchers have relied on machine learning (ML)-based aid systems for forecasting chronic illnesses.1719 Researchers have suggested utilizing machine learning-based classification models to estimate the prevalence of T2D depending on its risk factors.2023 So, this information encouraged us to be involved with T2D.

In our research on individuals with T2D, we utilized a feature importance method using XGBoost to identify highly expressed genes from RNA sequencing count data detecting T2D and not T2D individuals based on count data. This approach was used instead of relying solely on adjusted p-values and log fold Change values to determine significant genes. By training various algorithms on RNA sequence count data, we achieved notable prediction accuracies, with XGBoost emerging as a standout, and this approach not only enhances gene detection accuracy but also challenges traditional bioinformatics metrics, suggesting a richer machine learning-driven perspective on the genetic prospect of diseases like T2D. Moreover, on our detected expressed genes or significant features, we went through several bioinformatics analyses such as pathways, gene ontologies, protein-protein interactions, transcription factors, miRNAs, and drug predictions to deal with T2D. More importantly, we validated our findings with past studies to show the effectiveness of our models. So, in the future, this study will help researchers to gain knowledge more about mutated genes through machine learning, and to think about the prevention of T2D based on bioinformatics analysis like drug prediction discussed in this paper.

2. Methods

2.1 Data acquisition and pre-processing

We initiated our study by selecting the RNA-Sequence count dataset (GSE81608)24 from GEO.25 This dataset was chosen for its reliability, especially in the context of biological data. The specifics of this dataset are detailed in Table 1. For ease of access and analysis, we downloaded the raw-count data from GREIN.26 Afterward, the data underwent a pre-processing through Utilizing the “pandas” library27 step to prepare this data for machine learning training.

Table 1. Overview of dataset.

Disease NameGeo AssociationGEO PlatformTissues/CellsGenes for every sampleCase SamplesControl Samples
Type 2 DiabetesGSE81608Illumina HiSeq 2500Alpha/Beta/Delta/PP28089949651

2.2 Machine learning analysis

On the preprocessed data, we trained six distinct ML models: Random Forest, AdaBoost, Gradient Boosting, Logistic Regression, Decision Tree, and XGBoost on the 80% of our dataset where 20% was for test. These models were trained on two classes: diabetes and non-diabetes. Our primary objective was to identify significant features28 which in our study were considered as differentially expressed genes.29 The entire machine learning process on this dataset is depicted in Figure 2. Additionally, for a more intuitive understanding, we visualized the RNA sequencing data and DGE in 4.

2.3 Bioinformatics analysis

On our findings (significant features), we delved deeper into the data using a set of bioinformatics techniques. This comprehensive approach covered protein-protein interaction network analysis, gene ontology and pathway analysis, transcription factors and miRNAs analysis, hub gene extraction, and protein-drug interactions. To ensure the robustness and validity of our methods, we validated our results with existing literature. For a more intuitive understanding of our analytical approach, we have visually represented our methodology in Figure 1. For our gene ontology and pathway analysis, we turned to EnrichR.30 This tool provided insights into various sights of gene ontology, including biological processes, cellular components, and molecular functions. To further enrich our pathway analysis, we involved with information from trusted databases such as KEGG,31 Reactome,32 WikiPathways,33 Elsevier, and BioCarta.34 Throughout this process, we maintained an adjusted p-value of less than 0.05 as our benchmark for deciding significant pathways. On the other hand, our exploration into protein-protein interactions was simplified by the STRING online tool.35 Following this, we embarked on the creation of a hub-protein network, using the Cytoscape application with the cytohubba plugin.36 For insights into transcription factors, we recommended the JASPAR37 and ChEA38 databases. This allowed us to identify graphically plausible transcription factors that might connect with our differentially expressed genes (DEGs). This exploration was further enhanced using the NetworkAnalyst tool. Additionally, the TarBase39 and miRTarBase40 databases were instrumental in shedding light on miRNA-DEG interactions. NetworkAnalyst41 was the center of our analysis of TFs–gene and miRNAs–gene interaction networks. Moreover, for our exploration into drug-protein interactions, we relied on DrugBank,42 a comprehensive online resource of medicines and their associated drug targets.

39c339ca-ad4a-4c5a-b765-0f654fa71774_figure1.gif

Figure 1. A diagrammatical depiction of the methodology used in this study.

39c339ca-ad4a-4c5a-b765-0f654fa71774_figure2.gif

Figure 2. Supervised learning to diagnose diabetes.

3. Results

3.1 Machine learning models result and evolution matrix

The evaluation of machine learning models is crucial across all domains but regardless of data related to biology is inescapable.79 While accuracy is a commonly used metric to evaluate a model’s performance, it alone may not provide a comprehensive assessment, especially in the context of biological or health-related datasets. Therefore, to achieve a clear understanding of a model’s efficacy, we incorporated a range of metrics including Precision,43 Recall,44 F1-Score,45 and Specificity,46 RocAuc, True Positive Rate, and False Positive Rate. These metrics collectively offer a perspective on the model’s performance, capturing various aspects of its predictive capabilities. The confusion matrix serves as a foundational tool in this context, covering the various measures employed to evaluate the efficiency of a classification model. However, To diagnose T2D from RNA sequence data, we used five unsupervised models Gradient Boosting, XGBoost, Logistic Regression, Random Forest, and AdaBoost. The XGBoost model performed well, with the highest prediction accuracy at 0.941% and the second lowest Log-Loss at 0.282%, showing values of confusion matrix for Precision 0.943, Recall 0.958, RocAuc 0.937, Specificity 0.915, F1-Score 0.950, TPR 0.958 and TFR 0.085%. Besides, the Gradient Boosting model showed the second highest prediction accuracy at 0.941% and the lowest Log-Loss at 0.280%, representing values of confusion matrix for Precision 0.948, Recall 0.943, RocAuc 0.937, Specificity 0.915, F1-Score 0.950, TPR 0.958 and TFR 0.085%. On the other hand, AdaBoost is the worst model in this accuracy, with the lowest accuracy of 0.744%. In addition, the accuracy is 0.884% in the third and 0.772% in the fourth positions for Logistic Regression and Random Forest, respectively. All the values have been shown in the Table 2, and the Figure 3 visualize the performance of the models based on the Accuracy and Log-Los.

Table 2. Assessing model performance: evaluation metric scores.

ML ModelsAccuracyLog_LossPrecisionRecallRocAucSpecificityF1-ScoreTPRFPR
XGBoost0.9410.2820.9430.9580.9370.9150.9500.9580.085
Gardient Boosting0.9410.2800.9480.9530.9380.9230.9500.9530.077
Logistic Regression0.8840.7110.9090.8950.8820.8690.9020.8950.131
Random Forest0.7720.5270.7550.9110.7400.5690.8260.9110.431
AdaBoost0.7440.5820.7480.8580.7170.5770.7990.8580.423
39c339ca-ad4a-4c5a-b765-0f654fa71774_figure3.gif

Figure 3. Comparision of machine learning models based on test accuracies.

3.2 Differential gene analysis using feature importance

Although we trained 5 different models to get our significant features (expressed genes), we selected the top 100 important genes based on our best-performed model, XGBoost with 0.941% accuracy, among them because this one has been trained well and showed the best performance (shown in Figure 3). In addition, we included our predicted expressed genes in the Extended data supplementary file-1). However, by the way of using this method, we ignored the conventional method, Differential Gene Expressions (DEGs). For the XGBoost model, the parameters we used have been provided in Table 3.

Table 3. Important XGBoost parameters.

Parameters’ nameValueDescription
learning_rate (eta)0.5Controls the contribution of each tree. Lower values make optimization more robust and prevent overfitting.
max_depth8Maximum depth of a tree. Increasing can lead to overfitting.
subsample1Fraction of training data for growing trees. Value 1 can prevent overfitting.
colsample_bytree1Fraction of features for building each tree. Value 1 can help prevent overfitting.
min_child_weight1Minimum sum of instance weight needed in a child. Higher values make the algorithm conservative.
gamma0Minimum loss reduction for further partition on a leaf node. Acts as tree regularization.
lambda (reg_lambda)1L2 regularization term on weights. Avoids overfitting.

3.3 Pathway enrichment and gene ontology evaluation

Utilizing the computational tool EnrichR, we conducted a gene set enrichment approach to determine pathways and took into account five pathways databases to conduct experiments utilizing DEGs of T2D. The 20 leading terms of signaling Pathways are presented in Figure 5. The top 10 terms in biological processes, molecular operations, and cellular components are included in Table 4. The adj. p-value, mostly less than 0.05, filters both the GO and the Pathways, which are then ordered ascendingly.

39c339ca-ad4a-4c5a-b765-0f654fa71774_figure4.gif

Figure 4. Differential gene analysis using machine learning technique.

39c339ca-ad4a-4c5a-b765-0f654fa71774_figure5.gif

Figure 5. Pathway enrichment summary of type 2 diabetes DEGs.

Table 4. Exploration of DEGs of type 2 diabetes from an ontological perspective.

CategoryGO IDTermP-valueGenes
GO Biological ProcessGO:0002484antigen processing and presentation of endogenous peptide antigen via MHC class I via ER pathway0.000511HLA-C;HLA-A
GO:0009954proximal/distal pattern formation0.000679CYP26B1;SIX3
GO:0046631alpha-beta T cell activation0.000679HLA-A;INS
GO:0032024positive regulation of insulin secretion0.0009010PSMD9;BAD;CFTR
GO:0050709negative regulation of protein secretion0.0009726PSMD9;CYP51A1;INS
GO:0090277positive regulation of peptide hormone secretion0.0012945PSMD9;BAD;INS
GO:0016050vesicle organization0.0015751SAR1A;ALS2;SNX11
GO:1902285semaphorin-plexin signaling pathway involved in neuron projection guidance0.0018624PLXND1;SEMA3F
GO:0010564regulation of cell cycle process0.0019787SIX3;KLHL13;CDK13;MAD1L1
GO:0045859regulation of protein kinase activity0.0022649FGR;DBNDD1;ALS2;CAMKK1
GO Cellular ComponentGO:0042612MHC class I protein complex0.0003664HLA-C;HLA-A
GO:0031264death-inducing signaling complex0.0006795CASP10;CFLAR
GO:0012507ER to Golgi transport vesicle membrane0.0025005SAR1A;HLA-C;HLA-A
GO:0042611MHC protein complex0.0044344HLA-C;HLA-A
GO:0030666endocytic vesicle membrane0.0081768M6PR;HLA-C;HLA-A;CFTR
GO:0098553lumenal side of endoplasmic reticulum membrane0.0085962HLA-C;HLA-A
GO:0071556integral component of lumenal side of endoplasmic reticulum membrane0.0085962HLA-C;HLA-A
GO:0070013intracellular organelle lumen0.0098880ASAH1;VGF;POLDIP2;NDUFAF7;CSNK2B; FUCA2;CDK13;HS3ST1;CHGB;INS
GO:0005769early endosome0.0108350ASAH1;ALS2;HLA-C;HLA-A;CFTR
GO:0101002ficolin-1-rich granule0.0137120ASAH1;CSNK2B;ENPP4;CDK13
GO Molecular FunctionGO:0097199cysteine-type endopeptidase activity involved in apoptotic signaling pathway0.0010850CASP10;CFLAR
GO:0016274protein-arginine N-methyltransferase activity0.0013218NDUFAF7;PRMT1
GO:0097200cysteine-type endopeptidase activity involved in execution phase of apoptosis0.0018624CASP10;CFLAR
GO:0015174basic amino acid transmembrane transporter activity0.0018624SLC7A7;SLC7A2
GO:0097153cysteine-type endopeptidase activity involved in apoptotic process0.0024908CASP10;CFLAR
GO:0043425bHLH transcription factor binding0.0053564PSMD9;KDM1A
GO:0005179hormone activity0.0070353VGF;CHGB;INS
GO:0001222transcription corepressor binding0.0092030CTBP1;SIX3
GO:0048018receptor ligand activity0.0190294VGF;SEMA3F;WNT16;CHGB;INS
GO:0140297DNA-binding transcription factor binding0.0205433PSMD9;KDM1A;CTBP1;MIXL1

3.4 Establishing a PPI network and discovering hub genes

We used STRING to analyze the PPI network and a Cytoscape representation to predict the adherence pathways and recurrent interactions between DEGs. Utilizing topological metrics, such as a degree higher than 15°, extremely communicating proteins were defined via PPI interpretation. The most prominent DEGs include 75 nodes in this PPI network (shown in Figure 6) and 226 edges between them. Hub genes have a strong association in potential units and top 10% interconnectivity. Due to these interconnections, hub genes typically play a crucial role in biological systems. To find the top 18 DEGs (hub genes), we used Cytoscape’s Cytohubba plugin. Figure 7 illustrates the hub genes notably: TP53, INS, KDM1A, SNAI1, RCOR1, CTBP1, RPA1, RAD52, SQLE, CYP51A1, CFTR, CPE, C3, PRMT1, NFYB, CD38, CFP and CASP10. These identified hub proteins could be useful as therapeutic targets, yet their roles still need to be explored. T2D-related differentially expressed genes (DEGs) and their hub genes are summarized in Table 5.

39c339ca-ad4a-4c5a-b765-0f654fa71774_figure6.gif

Figure 6. PPI network consisting of type 2 diabetes DEGs.

The circular nodes in the diagram symbolize differentially expressed protein genes, while the edges depict the communication between nodes. The PPI consists of 75 nodes connected by 226 edges. The PPI network was generated utilizing STRING and visualized via Cytoscape.

39c339ca-ad4a-4c5a-b765-0f654fa71774_figure7.gif

Figure 7. Determining hub genes in the Cluster through using Cytohubba.

The most up-to-date MCC and BottleNeck techniques available in the Cytohubba plugin were used to obtain hub genes. The top 14 hub genes from each approach are highlighted below, along with the links between them and other compounds. BottleNeck contains 58 nodes and 100 edges, but the MCC network has only 48 nodes and 90 edges.

Table 5. Evaluation of protein-protein interactions recognizes hub genes compiled by DEGs.

Gene SymbolDescriptionFeature
TP53Tumor Protein P53DNA-binding transcription factor activity
INSInsulinIdentical protein binding and protease binding
KDM1ALysine Demethylase 1ADNA-binding transcription factor activity and enzyme binding
SNAI1Snail Family Transcriptional Repressor 1Sequence-specific DNA binding and DNA-binding transcription repressor activity
RCOR1REST Corepressor 1Chromatin binding and DNA-binding transcription repressor activity
CTBP1C-Terminal Binding Protein 1DNA-binding transcription factor activity and transcription factor binding
RPA1Replication Protein A1Nucleic acid binding and single-stranded DNA binding
RAD52RAD52 Homolog, DNA Repair ProteinIdentical protein binding and DNA strand exchange activity
SQLESqualene EpoxidaseOxidoreductase activity and squalene monooxygenase activity
CYP51A1Cytochrome P450 Family 51 Subfamily A Member 1Iron ion binding and oxidoreductase activity
CFTRCF Transmembrane Conductance RegulatorEnzyme binding and PDZ domain binding
CPECarboxypeptidase ECell adhesion molecule binding and carboxypeptidase activity
C3Complement C3Signaling receptor binding and C5L2 anaphylatoxin chemotactic receptor binding
PRMT1Protein Arginine Methyltransferase 1RNA binding and methyltransferase activity
NFYBNuclear Transcription Factor Y Subunit BetaDNA-binding transcription factor activity and protein heterodimerization activity
CD38CD38 MoleculeTransferase activity and hydrolase activity, acting on glycosyl bonds
CFPComplement Factor ProperdinPositively regulates the alternative complement pathway of the innate immune system
CASP10Caspase 10Ubiquitin protein ligase binding and cysteine-type peptidase activity

3.5 Recognition of transcriptional and post-translational regulators

We used a network-based strategy to parse the governing TFs and miRNAs to locate substantial transcriptional changes and learn more about the hub protein’s signaling molecules. Transcription factors are proteins that govern gene activity and transcription over all life forms.47 Tiny RNA molecules called miRNAs have a role in post-transcriptional expression regulation. We investigated the interaction between DEGs and TFs, as shown in Figure 8, and DEGs and miRNAs, as shown in Figure 9. Major promoters of the TFs of differentially expressed genes were ELK4, FOXC1, FOXL1, GATA2, JUN, MEF2A, NFIC, NFKB1, POU2F2, PPARG, RELA, TEAD1, USF2, YY1, PRRX2, STAT3, TP53, E2F1, CREB1, NANOG, CREM, RUNX1, TP63, AR, HNF4A, POU5F1, SOX2, MITF, SPI1, MYC, FLI1, SUZ12, and EGR1. Mir-6883-5p, mir-6785-5p, mir-149-3p, mir-4728-5p, mir-17-5p, mir-210-3p, mir-374a-5p, mir-21-3p, mir-129-2-3p, mir-7-5p, mir-16-5p, mir-1-3p, mir-124-3p, mir-155-5p, mir-27a-3p, mir-34a-5p, let-7b-5p, and mir-107 were specified so that a concise overview of the DEGs operating at post-transcriptional regulators could be established. This Table 6 summarizes both transcriptional and post-transcriptional regulatory factors of type 2 diabetes-related differentially expressed genes.

39c339ca-ad4a-4c5a-b765-0f654fa71774_figure8.gif

Figure 8. The infrastructure of coordinated regulatory interactions between DEGs and TFs generated by the Network Analyst.

The circular cyan nodes represent transcription factors, while the circular red nodes represent gene icons that connect with transcription factors.

39c339ca-ad4a-4c5a-b765-0f654fa71774_figure9.gif

Figure 9. The connectivity of interrelated regulatory interactions between DEGs and miRNAs.

Here, the square node represents miRNAs, while the circular-shaped gene symbols connect with miRNAs.

Table 6. Overview of transcriptional and post-transcriptional regulatory biomolecules of differentially expressed genes of T2D (a) transcription regulators and (b) post-transcriptional regulators.

SymbolDescriptionFeature
(a) Transcriptional regulators
ELK4ETS Transcription Factor ELK4DNA-binding transcription factor activity and chromatin binding
FOXC1Forkhead Box C1DNA-binding transcription factor activity and transcription factor binding
FOXL1Forkhead Box L1DNA-binding transcription factor activity and transcription factor activity
GATA2GATA Binding Protein 2DNA-binding transcription factor activity and chromatin binding
JUNJun Proto-OncogeneRNA binding and sequence-specific DNA binding
MEF2AMyocyte Enhancer Factor 2ADNA-binding transcription factor activity and protein heterodimerization activity
NFICNuclear Factor I CDNA-binding transcription activator activity, RNA polymerase II-specific
NFKB1Nuclear Factor Kappa B Subunit 1DNA-binding transcription factor activity and sequence-specific DNA binding
POU2F2POU Class 2 Homeobox 2DNA-binding transcription factor activity and protein domain specific binding
PPARGPeroxisome Proliferator-Activated
Receptor Gamma
DNA-binding transcription factor activity and chromatin binding
RELARELA Proto-OncogeneDNA-binding transcription factor activity and identical protein binding
TEAD1TEA Domain Transcription Factor 1DNA-binding transcription factor activity
USF2Upstream Transcription Factor 2DNA-binding transcription factor activity and sequence-specific DNA binding
YY1YY1 Transcription FactorDNA-binding transcription factor activity and transcription coactivator activity
PRRX2Paired Related Homeobox 2DNA-binding transcription factor activity and sequence-specific DNA binding
STAT3Signal Transducer And Activator Of Transcription 3DNA-binding transcription factor activity and sequence-specific DNA binding
TP53Tumor Protein P53DNA-binding transcription factor activity and protein heterodimerization activity
E2F1E2F Transcription Factor 1DNA-binding transcription factor activity and transcription factor binding
CREB1CAMP Responsive Element
Binding Protein 1
DNA-binding transcription factor activity and enzyme binding
NANOGNanog HomeoboxDNA-binding transcription factor activity and chromatin binding
CREMCAMP Responsive Element
Modulator
DNA-binding transcription factor activity and core promoter sequence-specific DNA binding
RUNX1RUNX Family Transcription Factor 1DNA-binding transcription factor activity and protein homodimerization activity
TP63Tumor Protein P63DNA-binding transcription factor activity and identical protein binding
ARAndrogen ReceptorDNA-binding transcription factor activity and chromatin binding
HNF4AHepatocyte Nuclear Factor 4 AlphaDNA-binding transcription factor activity and sequence-specific DNA binding
POU5F1POU Class 5 Homeobox 1RNA binding and sequence-specific DNA binding
SOX2SRY-Box Transcription Factor 2DNA-binding transcription factor activity and protein heterodimerization activity
MITFMelanocyte Inducing Transcription
Factor
RNA polymerase II cis-regulatory region sequence-specific DNA binding
SPI1Spi-1 Proto-OncogeneDNA-binding transcription factor activity and RNA binding
MYCMYC Proto-OncogeneRNA polymerase II cis-regulatory region sequence-specific DNA binding
FLI1Fli-1 Proto-OncogeneDNA-binding transcription factor activity and chromatin binding
SUZ12SUZ12 Polycomb Repressive
Complex 2 Subunit
Sequence-specific DNA binding and chromatin binding
EGR1Early Growth Response 1DNA-binding transcription factor activity and transcription factor binding
(b) Post-transcriptional regulators
mir-17-5pMicroRNA 17Improved inflammation-induced insulin resistance by suppressing ASK1 expression in macrophages
mir-4728-5pMicroRNA 4728Promote the proliferation and migration in breast cancer cell
mir-149-3pMicroRNA 149Role in obesity-associated metabolic abnormalities.
mir-6785-5pMicroRNA 6785Role in tumor proliferation and invasion
mir-6883-5pMicroRNA 6883Induce G1 Phase Cell-Cycle Arrest in Colon Cancer Cells
mir-210-3pMicroRNA 210Plays a protective role in cardiovascular homeostasis and is decreased in whole blood of T2DM mice
mir-374a-5pMicroRNA 374aRegulates Inflammatory Response in Diabetic Nephropathy by Targeting MCP-1 Expression
mir-21-3pMicroRNA 21Enhances glucose uptake and subsequently promotes insulin secretion
mir-129-2-3pMicroRNA 129-2Involved in inflammatory responses and apoptosis
mir-7-5pMicroRNA 7Regulates GLP-1-Mediated Insulin Release
mir-16-5pMicroRNA 16miR-16 deletion leads to insulin resistance in males and exacerbated glucose intolerance in females
mir-1-3pMicroRNA 1Positively associated with important characteristics of pre-diabetes, including glycaemic abnormalities and insulin resistance
mir-124-3pMicroRNA 124Negative regulator to inhibit the proliferation, migration and invasion of cancer cells
mir-155-5pMicroRNA 155Plays a crucial role in the pathogenesis of diabetes mellitus (DM) and its complications
mir-27a-3pMicroRNA 27aPromotes insulin resistance and mediates glucose metabolism by targeting PPAR-gamma-mediated PI3K/AKT signaling
mir-34a-5pMicroRNA 34aAffects the development and functional maturity of beta-cells, which in turn decreases the body’s tolerance to glucose level and the insulin secretion
let-7b-5pMicroRNA Let-7bRegulate glucose metabolism and insulin sensitivity
mir-107MicroRNA 107Higher miR-107 expression is related to insulin resistance in the diabetic group

3.6 Detection of potential medications

To understand the structural features implicated in signal transduction, conducting a protein-drug interaction analysis48 is necessary. We listed 18 potential treatment drugs for frequent DEGs as possible pharmacological candidates in T2D employing NetworkAnalyst techniques dependent on drug-protein connections from the DrugBank library. Figure 10 shows 14 well-known therapeutic agents, including Insulin Human, Dalteparin, Lovastatin, Atorvastatin, Insulin glargine, Myristic acid, M-cresol, Insulin peglispro, L-lysine, L-ornithine, Ivacaftor, Glyburide, Bumetanide, and Lumacaftor that were found in the Protein Drug Interactions of DEGs of T2D. The potential uses of the remaining four chemical compounds in healthcare are still being investigated.

39c339ca-ad4a-4c5a-b765-0f654fa71774_figure10.gif

Figure 10. The picture represents 18 potential drugs for T2D treatments employing the protein-drug interaction strategy.

Here, the rectangular node symbolizes drugs, whereas the circular gene symbols are linked to drugs.

4. Discussion

In the modern era, over time, as artificial intelligence is improving rapidly, Machine Learning is performing as an essential part in the bioinformatics sector analyzing data profoundly.49 Although we can use ML techniques on most of the RNA sequence data, in this research, we have analyzed T2D data because it is a chronic illness that can have severe and life-threatening complications.15 In this study, we presented a count-based classification pipeline to identify expressed genes applying the feature importance technique, as well as to detect the patient based on count data. Moreover, the approaches used here enable us to process large amounts of transcriptome data and draw reliable conclusions regarding T2D proteins involving various bioinformatics techniques, allowing us to comprehensively understand T2D and identify associated biomarkers.

In our comprehensive investigation of Type 2 Diabetes (T2D), we employed several supervised machine learning algorithms, including Random Forest, AdaBoost, Gradient Boosting, Logistic Regression, Decision Tree, and XGBoost. Their performance metrics, accuracies, and losses are visually represented in Figure 3 and detailed in Tab-2. From a bioinformatics perspective, we conducted Pathway enrichment analysis (Figure 5), Gene Ontology assessments (Table 4), Protein-Protein Interaction studies (Figure 6), and explored Hub-Protein interactions (Figure 7), Transcriptional Factor interactions (Figure 8), miRNA interactions (Figure 9), and drug-protein interactions (Figure 10). Each hub gene was meticulously detailed with its features in Table 5. Furthermore, we provided an in-depth overview of both transcriptional and post-transcriptional regulatory differentially expressed genes in Table 6. Our machine learning models’ efficacy in identifying significant genes from RNA sequence data sourced from NCBI for T2D is illustrated in Figure 4. For a holistic understanding of our research approach, we’ve outlined the entire methodology in Figure 1. Our dataset is comprehensively presented in Table 1, and the parameters of our top-performing model, XGBoost, are shown in Table 3.

In terms of our best model XGBoost, XGBoost’s superior performance on our dataset can be attributed to several factors. Its ability to model complex non-linear relationships, combined with built-in L1 and L2 regularization, makes it adept at handling high-dimensional data. Unique features such as internal handling of missing values, tree pruning, and efficient column block computation further enhance its efficiency. The model’s adaptability in hyperparameter tuning, resilience to outliers, and capability to capture feature interactions likely contributed to its edge. Additionally, the inherent nature of some datasets might align better with gradient-boosted trees, suggesting that our data’s underlying patterns were particularly suited for XGBoost. The mathematical equation of the aim and process of the XGBoost model is shown below:

Given a dataset with n samples and m features, the prediction of the model for the ith instance at the tth iteration is denoted as ŷit. The objective function to be optimized in XGBoost at each iteration is:

(1)
Objt=i=1nlyiŷit+j=1tΩfj
where l is the differentiable convex loss function and Ω is the regularization term defined as:
(2)
Ωf=γT+12λj=1Twj2

Here, T is the number of leaves in the tree and wj is the score assigned to the jth leaf.

The optimal structure of the tree is found by minimizing:

(3)
Gain=12G2H+λ+Gg2Hh+λg2h+λγ
with G and H being the sum of the first and second-order gradients of the loss function for the instances in the current node, and g and h being the corresponding sums for the instances that go to the left child node.

Gene Ontology (GO) and pathway enrichment analysis is a widely used statistical method in bioinformatics that helps researchers to gain insights into the biological relevance of extensive gene sets. In T2D, persistent exposure to high glucose levels and free fatty acids induces beta-cell dysfunction and may initiate beta-cell apoptosis.50 Sterol regulatory element binding proteins (SREBPs) regulate lipid production and adipogenesis. SREBPs expression was considerably reduced in individuals with type 2 diabetes.51 Neutrophil degranulation is related to an aberrant echocardiographic patterning T2D.52 The expression of the FAS signaling pathway (CD95) was connected to systemic and skeletal muscle insulin resistance.53 IFN-gamma or TNF-alpha Mediated Cell Proliferation is associated with T2D. Interferon-gamma is crucial for the ruination of cells and the onset of T2D.54 Tumor necrosis factor (TNF)-alpha, a cytokine derived primarily from macrophages and adipocytes, can encourage insulin resistance (IR) and inevitably aid the advancement of T2D (T2D).55 The innate immune system plays a crucial part in T2D by contributing to low-grade inflammation and insulin resistance, which are key factors in the development and progression of the disease.56 Several MHC class I alleles, such as HLA-B and MHC class II alleles(including HLA-DRB1 and HLA-DQB1) were associated with T2D risk.57,58

We identified hub proteins that are expressed highly or poorly in T2D patients. Patients with T2D had significantly greater serum TP53 levels than healthy non-diabetic controls.59 The decreased level of CTBP1-AS2 was linked with diabetes in the Iranian population.60 The presence of complement factor 3 (C3) is linked to insulin resistance.61 ROS- and RAS-mediated diabetic retinopathy involves PRMT-1 and DDAHs-induced ADMA upregulation.62 In a sizeable portion (9–15%) of patients with Type 2 or persistent Type 1 diabetes, CD38 autoantibodies have been discovered. Most of these autoantibodies (about 60%) exhibit agonistic characteristics, such as Ca2+ mobilization in lymphocytic cell lines and in pancreatic islets, indicating that they are biologically active. CD38 autoantibodies promote glucose-mediated insulin secretion in human pancreatic islets.63 The hypomethylation of CASP10 may result from T2D and severe and long-lasting hyperglycemia.64 In humans, CFTR deficiency causes intrinsic abnormalities in insulin secretion inside the islets.65

Our discovered TFs are associated with T2D. The adult human pancreas has been found to express the Sox2 gene. It seems improbable that Sox2 will have a genetic influence on the development of T2D.66 In the human intestine, MYC transcription factor expression is correlated with either glycaemic management (HbA1c level) or body mass index (BMI).67 In the skeletal muscle of T2D patients, STAT3 is constitutively phosphorylated, and increased STAT3 signaling plays a role in the etiology of T2D and insulin resistance.68 Much research has been conducted on the peroxisome proliferator-activated receptor gamma (PPARG), whose ligands have become effective insulin sensitizers in type 2 diabetes.69 YY1 plays an important role in T2DM and it might be useful as a new therapeutic target in the fight against the disease.70

Our identified miRNAs are also linked to T2D. Targeting the SOCS1-mediated NF-B Pathway, miR-210-3p Increases Insulin Resistance and Obesity-Induced Adipose Tissue Inflammation. MiR-374a-5p appears to be associated with the downregulation of pro-inflammatory biomarkers that are connected to insulin resistance and is elevated in metabolically healthy obese persons as compared to metabolically abnormal obese patients.71 A possible biomarker for the early diagnosis of diabetic nephropathy is miR-21-3p, whose expression is downregulated in association with the onset of diabetic nephropathy.72 MiR-7-5p targeting may be a likely therapeutic approach to metabolic illnesses brought on by insulin dysfunction.73 Patients with diabetic neuropathy have dysregulated long non-coding miR-1-3p axis.74 Compared to healthy individuals, peripheral blood mononuclear cells (PBMCs) obtained from T2D patients exhibited low miR-155 expression.75

Our identified potential drug molecules can be effective for T2D patients. Insulin Human N (medium-acting) and Insulin Human R (short-acting) both are used in diabetes mellitus to lower blood glucose levels. Insulin glargine modulates carbohydrate, protein, and lipid metabolism by suppressing hepatic glucose synthesis and lipolysis and improving peripheral glucose clearance. Long-acting insulin, or insulin peglispro, is used to treat both T1D and T2D.76 Glyburide belongs to the class of medications known as sulfonylureas, which stimulate the pancreas to produce insulin and reduce blood sugar levels.77 T2D and its associated illnesses may be prevented and treated using myristic acid.78 There is some uncertainty on how Dalteparin, Lovastatin, Atorvastatin, Myristic acid, M-cresol, L-lysine, L-ornithine, Ivacaftor, Bumetanide, and Lumacaftor interact with T2D. So, more studies as well as preclinical and clinical trials are required.

5. Conclusions

As our research aims to provide the machine learning approach to identify significant genes through feature importance technique and to address the global burden of T2D and enhance the lives of individuals, affected by this chronic disease, based on bioinformatics methods, we went through machine learning approach and common bioinformatics methods.

In terms of the machine learning approach, our research represents a remarkable advancement in bioinformatics and (T2D) research. Combining machine learning algorithms, statistical analysis, and bioinformatics techniques, we have gained valuable insights into the molecular mechanisms underlying T2D. It would help us to identify expressed genes not solely being dependable on the conventional bioinformatics approach, Differentially Expressed Genes (DEGs). As well as, for a large number of features, a machine learning approach would be an effective and more reliable approach to predict expressed genes. We have successfully processed and analyzed vast amounts of information by utilizing large volumes of transcriptomic data. Our XGBoost model could be an example that showed a noticeable performance with 94.1% accuracy. This achievement demonstrates the potential of machine learning as a powerful tool for precise and efficient T2D diagnosis, which also can significantly impact clinical practice by enabling early intervention and personalized treatment approaches based on count data.

On the other hand, in our comprehensive molecular biomarker study on T2D, we analyzed a diverse set of molecular entities, including 20 pathways, 30 gene ontologies, 51 transcriptional factors, 18 hub-genes, 18 miRNAs, and 18 potential drugs where we were able to validate a subset among them. So, according to our validation, key pathways such as SREBPs, neutrophil degranulation, FAS signaling, IFNgamma/TNFalpha-mediated cell proliferation, and the innate immune system were highlighted as central to T2D’s development. Hub proteins, notably TP53, CTBP1-AS2, C3, PRMT-1, DDAHs, CD38 autoantibodies, CASP10, and CFTR, present promising avenues for disease research and treatment. Similarly, transcriptional factors like Sox2, MYC, STAT3, PPARG, and YY1, coupled with miRNAs such as miR-210-3p, miR-374a-5p, miR-21-3p, miR-7-5p, and miR-155, shed light on the regulatory dynamics underpinning T2D. Furthermore, our research identified potential drug molecules, including insulin analogs, sulfonylureas, myristic acid, and other compounds, that hold promise for therapeutic intervention. While further preclinical and clinical trials are necessary to validate their efficacy and safety profiles, these findings offer potential avenues for revolutionizing T2D treatment strategies and improving patient outcomes.

Overall, this study represents a significant advancement in using machine learning to identify expressed genes from RNA sequence count data, and by integrating key bioinformatics methods on T2D and validating our findings against prior research, we offer a robust approach to understanding and addressing T2D at the molecular level.

One of the limitations of our study is the accuracy of our models, which needs further improvement. While our current approach provided insights, the application of deep learning models in future studies could enhance the precision of our findings. Additionally, there remains a set of biomarkers that we have yet to validate with existing literature. A thorough validation of these biomarkers is essential, as it would offer researchers a more detailed understanding of expressed genes, ultimately aiding in more accurate analyses and better strategies to address T2D.

Author contributions

Md Al Amin: Took the lead in conceptualization and design of the work; primary role in data curation and analysis; led the writing of the original draft of the manuscript; major contributor in review and editing; gave final approval of the version to be published; agreed to be accountable for all aspects of the work.

Feroza Naznin: Played a significant role in the formal analysis of the data; contributed to the investigation; involved in writing, review, and editing of the manuscript; gave final approval and agreed to be accountable for all aspects of the work.

Most Nilufa Yeasmin: Led the acquisition of resources; involved in data curation; gave final approval and agreed to be accountable for all aspects of the work.

Md Sumon Sarkar: Took the lead in software development; contributed to the validation of the results; gave final approval and agreed to be accountable for all aspects of the work.

Md Misor Mia: Led the visualization process; supported the investigation; gave final approval and agreed to be accountable for all aspects of the work.

Abdullahi Chowdhury: Contributed to the methodology; supported the writing, review, and editing of the manuscript; gave final approval and agreed to be accountable for all aspects of the work.

Md Zahidul Islam: Acted as the primary supervisor and administrator of the project; took a leading role in writing, reviewing, and editing the manuscript; gave final approval and agreed.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 07 Mar 2024
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Al Amin M, Naznin F, Yeasmin MN et al. High throughput biological sequence analysis using machine learning-based integrative pipeline for extracting functional annotation and visualization [version 1; peer review: 2 approved with reservations]. F1000Research 2024, 13:161 (https://doi.org/10.12688/f1000research.144871.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 07 Mar 2024
Views
4
Cite
Reviewer Report 01 Oct 2024
Hemant Kulkarni, M&H Research, LLC, San Antonio, TX, USA 
Approved with Reservations
VIEWS 4
The authors take a machine learning approach to identification of DEGs in the context of type 2 diabetes. The dataset is from real-life, the sample size is adequate, the methods are well described and the results appropriately stated.

... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Kulkarni H. Reviewer Report For: High throughput biological sequence analysis using machine learning-based integrative pipeline for extracting functional annotation and visualization [version 1; peer review: 2 approved with reservations]. F1000Research 2024, 13:161 (https://doi.org/10.5256/f1000research.158726.r317925)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
9
Cite
Reviewer Report 22 Jul 2024
Farizky Martriano Humardani, Universitas Brawijaya, Malang, East Java, Indonesia 
Approved with Reservations
VIEWS 9
This study presents comprehensive and well - organized data on markers for T2DM and associated drugs. However, I have several feedback points for improvement:

1. Points 3.3, 3.4, and 3.5 do not show any correlation or new ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Humardani FM. Reviewer Report For: High throughput biological sequence analysis using machine learning-based integrative pipeline for extracting functional annotation and visualization [version 1; peer review: 2 approved with reservations]. F1000Research 2024, 13:161 (https://doi.org/10.5256/f1000research.158726.r303201)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 07 Mar 2024
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.