Keywords
Marine Predator Algorithm, Particle Swarm Optimization, Gene selection Optimization, rMRMR, Classification
This article is included in the Bioinformatics gateway.
High-dimensional microarray data complicates reliable cancer classification. Compact, informative gene panels are needed to maintain predictive power while improving interpretability and cost.
We propose a two-stage feature-selection pipeline. Stage 1 ranks genes via an ensemble of filters—ReliefF, chi-square, and Kullback–Leibler divergence—tempered by minimum redundancy–maximum relevance to promote diversity. Stage 2 performs wrapper-based subset search using a Particle Marine Predator Optimizer that fuses Marine Predators Algorithm for global exploration with Particle Swarm Optimization for local refinement. The objective maximizes cross-validated SVM accuracy while penalizing subset size.
Across seven benchmarks (Breast, CNS, Leukemia, Leukemia-3c, Leukemia-4c, Lymphoma, Ovarian), we compare against Bat Algorithm, Grey Wolf Optimizer, Marine Predators Algorithm, White Shark Optimizer, and recent representatives using accuracy, F1, precision, sensitivity, Matthews correlation coefficient, selected-gene count, and convergence behavior. The method frequently matches or exceeds alternatives while selecting few genes, achieving perfect accuracy on several datasets (Leukemia, Leukemia-3c, Lymphoma, Ovarian) and stable, strong performance on the remainder. Typical subset sizes are 2–5 genes for Leukemia variants, 7–8 for CNS, and ~20 for Breast. Optimization traces show rapid, steady improvement.
The pipeline achieves an effective exploration–exploitation balance, yielding compact gene panels without sacrificing classification performance. Its modular design supports straightforward extension to larger cohorts and other omics modalities.
Marine Predator Algorithm, Particle Swarm Optimization, Gene selection Optimization, rMRMR, Classification
Feature extraction involves selecting or incorporating features to decrease the data amount for processing, which is crucial in addressing various challenges.1–3 DNA microarrays represent a molecular technique enabling the analysis of thousands of genes in a single experiment, using numerous cells or tissues. The evolution of DNA microarray technology has resulted in the production of high-dimensional datasets, significantly influencing areas like clinical diagnostics and drug development.4 Gene expression data derived from DNA microarray experiments has become a crucial tool for cancer classification and detection.5,6 However, this data is often burdened with irrelevant, redundant, and noisy genes, posing a significant challenge to machine learning algorithms. Developing a predictive model based on unrelated genes can lead to decreased classification accuracy. One approach to resolve this issue is through gene selection, a process of eliminating irrelevant and redundant genes while preserving the most relevant ones.7 Gene selection can offer deeper insights,5 such as assisting researchers in understanding the molecular mechanisms of cancer and potentially leading to new therapies through an analysis of gene patterns, as well as reducing clinical costs.
Selection techniques for Gene are generally classified into two methods7: filter and wrapper. Filter methods are valued for their computational efficiency, as they assess genes based on the dataset’s intrinsic properties without involving machine learning algorithms. Commonly used filter techniques include Minimum-Redundancy-Maximum-Relevance (MRMR), Robust-MRMR, and ReliefF.
Conversely, wrapper methods frame gene selection as an optimization problem,8–14 employing search techniques or machine learning algorithms to evaluate gene subsets. While wrapper approaches typically achieve higher classification accuracy compared to filter methods, they come with significant computational costs. To address this, hybrid methods that integrate filter and wrapper techniques have become increasingly popular.15–18 These hybrid approaches have shown greater effectiveness in handling high-dimensional datasets, such as microarray data, particularly in classification tasks. Despite advancements, further studies are essential to design more effective hybrid gene selection techniques.15,16
A key challenge in gene selection arises from the exponential increase in potential solutions as the number of genes grows. Consequently, researchers strive to discover near-optimal gene subsets by improving existing metaheuristic methods.
Metaheuristic algorithms serve as general-purpose frameworks that optimize search processes independently of specific problems.19
Several metaheuristic methods have been adapted for gene selection. Examples include a Harmony Search method improved by a Markov Blanket,20 a Binary Flower Pollination Algorithm merged with β-Hill Climbing,21 the rMRMR technique paired with an enhanced Bat Algorithm,6 a Binary JAYA Algorithm incorporating Adaptive Mutation,22 and Correlation-Based Feature Selection used alongside a refined Binary Particle Swarm Optimization.15 Nonetheless, the complexity of the search space and gene interactions means these methods frequently encounter issues with becoming trapped in local optima.
The Marine Predators Algorithm (MPA) is a metaheuristic optimization method introduced by Faramarzi,23 influenced by marine animals’ hunting habits. MPA operates using stochastic population updates and employs two random walk strategies: Brownian motion and Lévy flight. Recognized for its simple parameter tuning, wide applicability, user-friendliness, and strong search performance, MPA has seen successful use in diverse areas. These applications encompass ECG signal classification,24 dynamic clustering,25 energy-efficient fog computing,26 medical image segmentation for COVID-19,27,28 and photovoltaic array reconfiguration.29
The MPA was adapted and enhanced in several studies. The authors of33 introduced a hybrid gene selection method (MPAC) for DNA microarray-based cancer classification, combining Minimum Redundancy Maximum Relevance (mRMR) filtering with an Improved Marine Predator Optimizer enhanced by a crossover operator. By optimizing both exploration and exploitation, MPAC sought concise biomarker subsets and employed k-nearest neighbor for classification. Experiments on nine benchmark datasets demonstrated that MPAC consistently outperformed or remained competitive with state-of-the-art algorithms. Nonetheless, the reliance on the crossover operator introduces potential weaknesses, including risks of premature convergence, excessive randomness that disrupts promising search trajectories, and sensitivity to crossover rate settings, which limit stability and reproducibility across diverse datasets.
We introduce a hybrid gene selection technique, named PMPA, which integrates the rMRMR filter method with a modified version of the Marine Predators Algorithm that incorporates Particle Swarm Optimization (PSO) as a wrapper. The modifications are designed to enhance the population diversity at the conclusion of each MPA iteration. The performance of the proposed method we evaluate on nine datasets with different dimensions, using the number of selected genes and classification accuracy as metrics. The approach we compare against other gene selection techniques. Additional comparisons with seven recent advanced methods on the same datasets indicate the effectiveness of the proposed method, which achieves superior results on four of the datasets.
The structure of This study is as follows: Section 2 and Section 3 provide a review and explanation of the methodology. The results we present in Section 4, and the conclusions along with potential future research directions are discussed in Section 5.
This part explores the development of the MPA, which has been improved to function as a simple and efficient metaheuristic optimization method.
MPA is a population-based optimizer modeled on marine foraging; it alternates between Lévy flights and Brownian moves to explore and exploit the search space.
1. High speed ratio: If predators move far faster than prey, the best option is effectively to hold position while the prey’s motion (Lévy or Brownian) drives encounters.
Unit speed ratio: when predator and prey speeds are similar, Brownian updates are favored for the predator, particularly if prey follows Lévy motion.
Low speed ratio: if prey outruns predators, Lévy steps are preferred by the predator regardless of prey motion.
2. Unit velocity: When the prey and predator move at the same pace, Brownian motion is the optimum strategy for the predator, especially if the prey is using Levy motion.
3. Low velocity: When the prey moves significantly quicker than the predator, the predator’s best course of action is to adopt Levy motion, Regardless of the prey’s movement style.
Figure 1 summarizes the three search regimes that underpin the Marine Predators Algorithm (MPA) and explain how the optimizer balances exploration and exploitation over time.

Conceptual schematic of the three velocity-ratio stages (high, unit, low) that govern exploration vs. exploitation. Early iterations emphasize Brownian exploration around elite guidance; the mid regime mixes Lévy and Brownian updates across sub-populations; late iterations emphasize Lévy jumps plus Fish Aggregating Devices (FADs) disturbance to avoid stagnation. Abbreviations: MPA, Marine Predators Algorithm; FADs, Fish Aggregating Devices.
Figure 1 contrasts predator and prey movement under different relative speed ratios and links these behavioral rules to the mathematical updates used during optimization. In the early stage, when the effective predator speed exceeds the prey’s, the most advantageous tactic is to remain largely stationary while sampling the space with wide, randomized steps. Operationally, MPA models this with Brownian perturbations around elite guidance, which encourages broad coverage of the decision space and reduces the risk of premature commitment to a local basin. This phase is dedicated to exploration and typically occupies the first third of the iteration budget.
When predator and prey move at comparable speeds—the mid-optimization regime—the algorithm mixes movement models to hedge between global search and local refinement. Half of the population is updated using Lévy flights, which inject heavy-tailed steps capable of vaulting across deceptively flat regions, while the remaining half follows Brownian motion to consolidate promising areas. The alternation helps the swarm probe new basins without abandoning ongoing improvements, acting as a controlled transition toward exploitation.
In the late stage, the prey effectively outpaces the predator, so the predator relies on occasional long-range Lévy jumps to re-engage valuable regions while fine-tuning around incumbents. Here, exploitation dominates: solutions are refined relative to an elite memory that tracks the best-so-far candidate and broadcasts directional cues to the population. The algorithm also introduces ecologically motivated disturbances—such as the Fish Aggregating Devices (FADs) mechanism—to periodically reshuffle a subset of positions. This controlled randomness prevents stagnation around deceptive attractors while preserving the information accumulated in the elite matrix.
Taken together, the three regimes in Figure 1 provide a principled schedule for step-size distributions, population partitioning, and memory use. Early Brownian exploration maps the landscape; a mid-course Lévy/Brownian hybrid tests new basins while validating incumbents; and a late Lévy-accented exploitation phase concentrates effort where returns are highest. In the context of gene selection, these dynamics help the optimizer discover sparse, high-performing subsets despite the combinatorial explosion of possibilities. By coupling elite guidance with stochastic motion models whose statistics change over time, MPA systematically converts biological foraging insight into an effective search policy for high-dimensional feature selection.
2.1.1 Initialization
The initialization phase begins with the creation of a prey population within the defined search space, as outlined in Eq. 1. In this equation, and indicate the minimum and maximum boundaries, respectively, while rand represents a randomly generated value between 0 and 1.
Once producing the prey population, the fitness values are calculated. The predator with the best score, represented by , is identified as the most efficient forager based on evolutionary principles. The Elite matrix is then constructed using this individual, the matrix has the dimensions ( ), where represents the population size and represents the number of dimensions, as defined in Eq. 2.
To update predator positions, a second matrix called Prey is constructed, having the same dimensions as the Elite matrix, as expressed in Eq. 3.
2.1.2 High velocity ratio stage
As previously described, when the predator’s speed exceeds the speed of the prey, the predator should stay motionless as this is the best course of action. This phase represents the exploration stage, which continues as long as iter max_iter. Equations 4 and 5 explain the mathematical model for this phase:
In Equation 4, the Brownian motion is symbolized by , a vector comprised of random values derived from a normal distribution. The study’s authors indicate that in Equation 5, is assigned a value of 0.5, while is a uniformly distributed random vector ranging from 0 to 1. In both of these equations, the symbol represents element-wise multiplication.
2.1.3 Unit velocity ratio stage
During this phase, the predator and prey move at matching speeds. This stage corresponds to the midpoint of the optimization process, where the focus begins to transition from exploration to exploitation. The condition iter max must be satisfied. The mathematical models for the initial half of the population, which uses Levy motion, are described in Eqs. 6 and 7:
For the second half of the population, which utilizes Brownian motion, the mathematical expressions are given by Eqs. 8 and 9:
In these equations, and represent Levy and Brownian motion, respectively, while is an adaptive parameter that controls the step size. This parameter is calculated using Eq. 10:
2.1.4 Low-velocity ratio stage
In this phase, the prey moves at a much higher speed compared to the predator, making Levy motion the predator’s most effective hunting strategy. This stage corresponds to the exploitation phase, occurring during the latter part of the optimization process when iter max . The illustration of this stage is provided in Eqs. 11 and 12:
Numerous studies30 emphasize that environmental factors, such as eddy formation and Fish Aggregating Devices (FADs), greatly influence prey behavior. FADs, specifically, alter the time predators allocate to searching, with of their efforts concentrated locally and directed toward chasing prey in other areas. The influence of FADs is quantified using Eq. 13.
In Eq. 13, is a binary vector composed of elements that are either 1 or 0. It is generated by assigning random values between 0 and 1 to each element of , where values below 0.2 are set to zero, and those equal to or above 0.2 are set to one. The parameter signifies a random value generated between 0 and 1, while FADs represents the probability of FADs impacting the search process. The subscripts and denote randomly picked indices from the prey matrix, and and indicate the lower and upper bounds, respectively.
The MPA models the memory behavior of marine predators by maintaining earlier prey positions alongside updating the current ones. The fitness scores of both current and previous solutions are evaluated, and positions are swapped when the prior solution demonstrates superior fitness.
The MPA optimization steps are presented in Algorithm 1.
Initialize prey population with random values within
Evaluate fitness for each prey in
Set the predator with the highest fitness as the Elite
Initialize Elite using the Elite predator and matrix Prey with population
for each iteration iter do
if iter max_iter then
for each prey do
Calculate using Brownian motion
Update prey position with step size
end for
else if max_iter iter max_iter then
for first half of prey population do
Calculate using Levy motion
Update using Levy motion
end for
for the other part of prey population do
Calculate using Brownian motion
Update with Brownian motion
end for
else
for each prey do
Calculate using Levy motion
Update prey position using step size
end for
end if
Update the Elite matrix and recalculate fitness values
Apply Fish Aggregating Devices (FADs) effect:
for each prey do
if then
Update based on , and
else
Update
with random Prey positions and
end if
end for
end for
Return optimal solution found in Elite
Particle Swarm Optimization (PSO) is a method inspired by the social interactions observed in animals like birds and bees.31 The algorithm employs particles to traverse the search space and identify the optimal solution.32
Particle Swarm Optimization (PSO) draws on simple social dynamics: particles share information about promising regions while balancing personal and global experience.
Positions and velocities are iteratively updated using a memory of each particle’s personal best and the swarm’s best-so-far, injecting randomness to avoid premature convergence.
PSO is widely used for solving both minimization and maximization problems due to its straightforward implementation and the limited number of parameters it requires. Its applications extend to areas such as function optimization, feature selection, and clustering.
During iteration , the position and velocity of a particle are represented by (iter) and (iter), respectively. The update of the particle’s movement follows these equations:
In these formulas, signifies the inertia weight, which determines how much the previous velocity affects the current one. The constants and influence the particle’s tendency to move toward its personal best (pbest) and the global best (gbest) positions, respectively. The function rand() produces random values within the range 0 to 1 (excluding both), adding stochastic behavior to the particle’s motion.
The particles update their positions and velocities in successive iterations until a termination condition is fulfilled, such as reaching a predefined iteration limit or obtaining a target fitness value.
The suggested PMPA hybrid approach for gene selection is presented in this section. The filter strategy is discussed in Section 3.1, while the wrapper approach—which includes the suggested PMPA optimization steps—is explained in Section 3.2.
This stage contains three main steps, including Initialization, Hybridization, and Filtering process outcomes.
Step 1: Initialization .
First, we form an ensemble of three classical filters—ReliefF, Chi-square, and KL divergence—to score genes independently; scores are then combined by averaging to obtain a single ranking.
Step 2: Hybridization .
Next, we blend the ensemble ranking with MRMR’s relevance estimates. Each gene receives two mutual-information-based signals—its relevance to the class label and its average ensemble rank—and we modulate MRMR by a per-gene mean score to stabilize selection. The final relevance score multiplies I (Gx,c) by R (Gi).
Step 3: Filtering process outcomes .
Finally, we threshold the ranking to produce a compact gene list, which becomes the search space for the wrapper optimizer.
3.2.1 Solution representation
Gene selection is characterized as a combinatorial optimization problem, where the solution comprises a subset of genes.34,35 As the number of selected genes from experimental datasets grows, the complexity of searching the solution space for the optimal subset increases. Formally, let N denote the total number of genes, and [2N] represent all potential subsets of candidate genes. A solution is encoded as x, a binary string x = (x1, x2, . . . , xN ), where N represents the string length or the gene subset size. In this encoding, a bit value of ‘1’ signifies that a gene is included, while ‘0’ indicates exclusion.
3.2.2 Fitness function
Gene selection methods aim to reduce the selected genes while improving accuracy. To accomplish this, several studies have integrated classification accuracy and gene subset size into a unified weighted function. This function serves as a fitness metric for evaluating potential gene subsets, as detailed in Equation (16) below:
Our objective rewards predictive performance while discouraging unnecessarily large subsets.
In this context, p represents the total genes in the dataset, and s indicates the candidate subset size. The weighting factors for classification accuracy (α) and gene subset size (β) are assigned values of 1 and 0.001, respectively.36,37 In this study, classification accuracy is evaluated through 10-fold cross-validation using an SVM classifier. This cross- validation approach is commonly applied in gene selection because it provides consistent results and minimizes variability in relation to input data.38 It is worth noting that nearly all comparative methods rely on k-fold cross-validation for validation purposes.
We weight accuracy and subset size with (α, β) = (1, 0.001); performance is estimated via 10-fold cross-validated SVM, a common and stable choice in this domain.
3.2.3 Marine Predator Optimizer Based PSO
The rMRMR-PMPA method, a sophisticated variation of the PMPA strategy created specially to address the gene selection optimization challenge, is presented in this section. Finding biologically significant genes utilizing the rMRMR approach is the primary goal of rMRMR-PMPA. The PMPA approach then uses the identified genes as input to improve and streamline the gene selection procedure. The PMPA method comprises six progressive steps, which are described below and illustrated in Figure 2.

Workflow showing: (1) parameter initialization (MPA/PSO); (2) population initialization over binary gene masks; (3) fitness evaluation using SVM accuracy with size penalty; (4) MPA-based position updates with elite memory and FADs; (5) PSO refinement using personal/global bests; (6) stopping criterion. The filter space is produced by rMRMR-enhanced ranking before the wrapper search. Abbreviations: PMPA, Particle Marine Predator Optimizer; rMRMR, robust minimum-redundancy maximum-relevance; PSO, Particle Swarm Optimization; SVM, support vector machine.
Step 1: PMPA Initialization. In this step, the parameters for the MPA and the PSO are initialized. The MPA parameters include β, P, F ADs, population size (N), iterations number (MaxItr), upper and lower bound (ub, lb). The PSO parameters are c1 and c2, which are two cognitive factors, and inertia weight w.
Step 2: Population Initialization. At this stage, random solutions are generated to create the population for the gene selection problem. The population size is defined as N × D, where D represents the total number of decision variables. The population for the proposed PMPA is formulated using Equation 17.
Step 3: Population Evaluation. During this phase, the fitness values of the population solutions are computed and assessed using the fitness function defined in Equation 16. This function evaluates the problem’s objectives and criteria. Once the fitness values are determined, the solution with the highest fitness, referred to as the best solution, is identified as the top predator XI in the MPA. This solution is considered the most optimal among all candidates. Additionally, the Elite matrix is updated to store information about the top predator, serving as a memory to retain the best solutions encountered throughout the optimization process.
Step 4: Population Update Using MPA. In this phase, the MPA search agents perform their search operations by updating their positions to discover improved candidate solutions, guided by the Elite matrix. After updating the positions, the influence of FADs is applied and adjusted to balance exploration and exploitation. The newly generated solutions are then evaluated, and the Elite matrix and memory are updated accordingly to retain details of the best solutions identified during the process.
Step 5: Update population by PSO. In this step, the PSO is operated to enhance both the exploration and exploitation capabilities of the MPA and find better candidate solutions. Once the PSO starts its operations, its search agents will begin their search for a better solution than that obtained by the MPA. Subsequently, the obtained solutions will be assessed and updated utilizing the MPA process.
Step 6: Verify the Stopping Criterion. Steps 3, 4, and 5 are iteratively repeated until the stopping condition is satisfied.
This section presents the experimental protocol adopted to rigorously evaluate the proposed PMPA feature selection framework on high-dimensional gene expression benchmarks. We consider standard microarray datasets that exhibit the small-n/large-p regime, where only a few dozen samples are measured over thousands of genes; this setting is known to challenge both overfitting control and search stability. To ensure comparability, all methods operate under identical computational budgets, candidate subset bounds, and classifier back-ends. Unless otherwise indicated, performance is estimated with stratified k-fold cross-validation using the same fold partitions across methods to minimize variability due to resampling. We report complementary metrics: overall accuracy and F1-score (discrimination), sensitivity and precision (error asymmetry), the Matthews correlation coefficient (class-imbalance robustness), and the number of selected genes (model parsimony). Optimizer fitness balances predictive performance against subset size via a weighted objective, promoting compact yet informative biomarker panels. Comparators include single-strategy metaheuristics (MPA, BAT, GWO, and WSO) configured with recommended settings within the same search ranges, alongside recent hybrid filter–wrapper approaches from the literature where available. We further analyze convergence behavior—best-so-far fitness trajectories across iterations—to characterize exploration–exploitation trade-offs and robustness. All experiments were repeated over multiple random seeds, and summary results are reported as mean ± standard deviation to reflect run-to-run variability. These choices establish a transparent, reproducible basis for testing whether PMPA achieves (i) competitive or superior classification, (ii) substantial gene reduction, and (iii) consistent performance across datasets—properties that are essential for reliable downstream use in bioinformatics pipelines and decision support.
Seven popular microarray benchmark datasets47–49 were used to assess the PMPA technique. In order to identify gene patterns that distinguish malignant samples from healthy ones, these datasets are widely used in pattern recognition research employing evolutionary algorithms and machine learning.35 The size of the datasets varies; some have limited prognostic significance, while gene counts range from 2,000 to 15,154. The patient sample sizes vary from 60 to 235. Column 1 of Table 1 shows the datasets that were examined; they were obtained from the https :// csse . szu . edu . cn / staff / zhuzx / Datasets.html.47–49
Datasets: Breast, CNS, Leukemia, Leukemia-3c, Leukemia-4c, Lymphoma, Ovarian. Methods: PMPA, MPA, BAT, GWO, WSO. Metric: overall accuracy; SD = standard deviation over repeated runs/folds.
This section compares the outcomes of the proposed method with those achieved by other algorithms. We benchmark PMPA against representative baselines such as the Bat Optimization Algorithm (BAT), Grey Wolf Optimizer (GWO), Marine Predator Optimizer (MPA), and White Shark Optimizer (WSO) on standard microarray datasets.
F1 Score
The F1 score results shown in Table 2 demonstrate the strong performance of PMPA when compared to other methods (BAT, GWO, MPA, and WSO) across multiple datasets. In the Breast dataset, PMPA achieved the highest average F1 score (0.9491) along with the lowest standard deviation (0.0092), indicating both superior accuracy and stability. Although BAT slightly outperformed PMPA in the CNS dataset with an F1 score of 0.9611, PMPA remained highly competitive with a score of 0.9520. For the Leukemia_3c, Leukemia, Lymphoma, and Ovarian datasets, all methods, including PMPA, reached a perfect F1 score of 1 with a standard deviation of 0, signifying flawless and consistent performance. In the Leukemia_4c dataset, BAT achieved the highest F1 score (0.9847), while PMPA performed strongly with a score of 0.9760. Overall, PMPA showed excellent results, especially in the Breast dataset, and maintained strong stability across most datasets, proving its effectiveness relative to the other optimization methods.
Same method/dataset roster as Table 1. F1 combines precision and sensitivity (recall); higher is better.
Matthews Correlation Coefficient (MCC)
Table 3 (MCC results) demonstrate how well PMPA performs in comparison to other algorithms (BAT, GWO, MPA, and WSO) across several datasets. In the Breast dataset, PMPA achieved the highest average MCC (0.9069), while BAT had a slightly lower average. BAT, however, showed the lowest standard deviation (0.0168), indicating higher consistency in performance. In the CNS dataset, BAT performed the best with an average MCC of 0.8918, but PMPA remained competitive with a score of 0.8643. Both algorithms showed strong results in terms of standard deviation, with BAT having the smallest value (0.0004). Across the Leukemia_3c, Leukemia, Lymphoma, and Ovarian datasets, all methods, including PMPA, achieved a perfect MCC of 1, along with a standard deviation of 0, reflecting flawless and consistent performance. In the Leukemia_4c dataset, BAT once again had the highest MCC (0.9822), while PMPA remained strong with an MCC of 0.9734. Overall, PMPA exhibited superior performance in the Breast dataset and remained competitive across other datasets, showing stability and effectiveness in comparison to the other methods.
Robust correlation-based metric for binary/multi-class performance; +1 perfect, 0 random, −1 total disagreement. Same methods/datasets as Table 1.
Precision
The precision results from the Table 4 indicate the competitive performance of PMPA compared to other algorithms (BAT, GWO, MPA, and WSO). In the Breast dataset, PMPA achieved the highest average precision (0.9740), outperforming the other methods, with MPA obtaining the lowest standard deviation (0.0175). In the CNS dataset, BAT showed the best precision (0.9729), but PMPA remained competitive with a score of 0.9564, and BAT also had the smallest standard deviation (0.0043). Across the Leukemia_3c, Leukemia, Lymphoma, and Ovarian datasets, all methods, including PMPA, achieved perfect precision with a value of 1 and a standard deviation of 0, indicating flawless performance across these datasets. In the Leukemia_4c dataset, BAT had the highest precision (0.9892), while PMPA performed well with a precision of 0.9859. Overall, PMPA demonstrated strong precision, particularly in the Breast dataset, and maintained competitive results across other datasets, further demonstrating its reliability and effectiveness when compared to the other algorithms.
Fraction of predicted positives that are true positives. Same methods/datasets as Table 1.
Sensitivity (Recall)
Table 5 (sensitivity findings) indicates PMPA’s performance in relation to other algorithms (BAT, GWO, MPA, and WSO). PMPA outperformed the other techniques in the Breast dataset, achieving the highest average sensitivity (0.9261) and the lowest standard deviation (0.0203), demonstrating accuracy and stability. BAT demonstrated marginally superior sensitivity (0.9496) for the CNS dataset, but PMPA maintained its competitiveness with a score of 0.9479. Across the Leukemia_3c, Leukemia, Lymphoma, and Ovarian datasets, all methods, including PMPA, obtained perfect sensitivity with a value of 1 and a standard deviation of 0, demonstrating flawless and consistent performance in these datasets. In the Leukemia_4c dataset, MPA achieved the highest sensitivity (0.9901), while PMPA remained competitive with a sensitivity of 0.9705.
Fraction of true positives that are correctly identified. Same methods/datasets as Table 1.
Selected Features
The comparison of the selected features across different algorithms (PMPA, BAT, GWO, MPA, and WSO) shown in Table 6 highlights PMPA’s performance in selecting fewer and more relevant features. In the Breast dataset, GWO selected the fewest features (14.7), followed by MPA (16.8), and PMPA selected 19.8 features, which is still competitive. In terms of stability, GWO had the lowest standard deviation (2.1359). For the CNS dataset, PMPA achieved the best result, selecting the fewest features (7.77) with the lowest standard deviation (1.5241), outperforming the other methods. In the Leukemia_3c dataset, MPA selected the fewest features (4.1), but PMPA closely followed with 4.67 features. Similarly, in the Leukemia_4c dataset, MPA had the best result (6.1), with PMPA closely behind, selecting 8.07 features. For the Leukemia dataset, PMPA and MPA both selected the fewest features (3), with PMPA having the lowest standard deviation (0). In the Lymphoma and Ovarian datasets, PMPA also tied for the fewest selected features (2 and 3, respectively), demonstrating its ability to consistently choose minimal and relevant features across multiple datasets. Overall, PMPA showed strong feature selection performance, consistently selecting fewer features with high stability, making it an effective and efficient method compared to other algorithms.
Parsimony comparison showing typical subset sizes produced by each method on each dataset. Lower is better when accuracy is comparable.
Fitness Value
The fitness value results from the table ( Table 7) demonstrate the performance of PMPA compared to other algorithms (BAT, GWO, MPA, and WSO). In the Breast dataset, PMPA achieved the best fitness value (0.0498) with the lowest standard deviation (0.0077), indicating superior performance and stability compared to the other methods. For the CNS dataset, BAT had the best fitness value (0.0523), followed by MPA and PMPA with slightly higher values, while BAT also showed the lowest standard deviation (0.0004). In the Leukemia_3c dataset, MPA demonstrated the best fitness (0.00082), but PMPA remained highly competitive with a value of 0.00093 and the second-lowest standard deviation. In the Leukemia_4c dataset, BAT again achieved the best fitness value (0.0117), while PMPA followed closely with 0.0122. In the Leukemia, Lymphoma, and Ovarian datasets, PMPA obtained the best fitness values, tying with MPA in the Leukemia dataset (0.0006) and showing superior results in Lymphoma (0.0004) and Ovarian (0.0006). Across most datasets, PMPA demonstrated excellent fitness value performance, particularly excelling in the Breast, Lymphoma, and Ovarian datasets, and maintaining consistent stability, highlighting its effectiveness relative to other algorithms.
Objective couples cross-validated SVM accuracy with a size penalty; lower fitness indicates better trade-off (higher accuracy/fewer genes). Same methods/datasets as Table 1.
Convergence Behavior
Figure 3 summarizes the per-iteration fitness trajectories for PMPA versus four widely used metaheuristics (MPA, BAT, GWO, and WSO) across seven microarray benchmarks. In all panels, lower curves indicate better objective values. Two broad patterns emerge. First, PMPA generally exhibits a steep initial decline followed by early stabilization, suggesting that the rMRMR-driven filtering sharply contracts the search space while the MPA↔PSO wrapper alternation accelerates exploitation. Second, the variance band around PMPA’s trajectory is narrow in most datasets, indicating run-to-run stability consistent with the small standard deviations reported for accuracy, F1, MCC, and fitness. Occasional shallow oscillations after the mid-iterations correspond to controlled step-size adjustments when the algorithm transitions from exploration (Brownian/Lévy phases in MPA) to PSO-based local refinement. Below, we comment on each subfigure.

Best-so-far objective vs. iteration for PMPA and baselines (MPA, BAT, GWO, WSO) on: (A) Breast, (B) CNS, (C) Leukemia_3c, (D) Leukemia_4c, (E) Leukemia, (F) Lymphoma, (G) Ovarian. Lower curves indicate better fitness (higher accuracy with smaller subsets). PMPA typically shows rapid initial descent and early stabilization; variability bands (if shown) reflect run-to-run spread. Abbreviations: BAT, Bat Algorithm; GWO, Grey Wolf Optimizer; WSO, White Shark Optimizer; CNS, Central Nervous System.
In the Breast dataset presented in Figure 3 (A), PMPA undergoes a rapid mono-exponential drop in the first few tens of iterations and settles at the lowest fitness among competitors, with minimal post-convergence jitter. This behavior mirrors the quantitative tables, where PMPA attains the best mean fitness and the smallest standard deviation, together with strong precision, sensitivity, and MCC. The early plateau suggests that the PSO pass quickly locks onto compact, high-quality gene subsets after MPA disperses candidates into promising basins.
In the CNS dataset presented in Figure 3 (B), convergence is slower and exhibits one or two inflection points, reflecting a more rugged objective surface. PMPA’s curve descends steadily and shows a secondary improvement mid-run, typical of a handoff from global exploration to PSO-driven refinement. Although BAT may transiently approach similar minima, PMPA maintains a favorable stability–fitness trade-off and, importantly, achieves highly competitive accuracy using markedly fewer genes, indicating efficient subset regularization.
In the Leukemia_3c dataset presented in Figure 3 (C), all algorithms rapidly reach a near-zero fitness floor, consistent with the near-perfect classification metrics. PMPA is among the first to converge and exhibits an extremely tight variance band, implying robustness to initialization and hyperparameter perturbations. The ceiling effect in performance underscores the dataset’s separability once key biomarkers are retrieved.
In the Leukemia_4c dataset presented in Figure 3 (D), This panel shows the most pronounced multi-stage descent. PMPA advances with steady, staircase-like improvements and a late-stage refinement consistent with PSO fine-tuning. While BAT occasionally attains a marginally lower terminal fitness, PMPA’s curve remains smoother and less erratic, indicative of better generalization control under the wrapper’s size penalty and rMRMR-informed search space.
In the Leukemia dataset presented in Figure 3 (E), The trajectories collapse quickly to the minimum, echoing the perfect accuracy and MCC observed across methods. PMPA’s decay is smooth and essentially variance-free, which is compatible with its ability to isolate a minimal subset (two to three genes) without sacrificing discriminative power. This emphasizes the value of the filter–wrapper coupling for highly separable tasks.
In the Lymphoma dataset presented in Figure 3 (F ), Similar to Leukemia, curves flatten early at the optimum with negligible oscillations. PMPA’s stabilization is earliest or among the earliest and remains stable thereafter, reinforcing its repeatability on datasets where biological signal-to-noise is high and redundant genes are abundant.
In the Ovarian dataset presented in Figure 3 (G), PMPA exhibits a sharp initial descent and reaches the optimum swiftly. The convergence band is nearly flat after early iterations, aligning with the perfect downstream metrics and the very small number of selected genes. This suggests that the hybrid search identifies a sparse yet sufficient signature rapidly, then avoids unnecessary exploration that could destabilize the solution.
Taken together, the seven panels indicate that PMPA’s advantage is twofold: fast descent driven by informed initialization and global search, followed by stable exploitation that resists overfitting through explicit subset-size regularization. Where baselines catch up (e.g., CNS or Leukemia_4c), they typically require more iterations and display higher fluctuation amplitudes. Conversely, on strongly separable datasets (Leukemia, Lymphoma, Ovarian), all methods converge rapidly, yet PMPA maintains the earliest and smoothest stabilization while using among the fewest features. These observations corroborate the tabulated accuracy, F1, MCC, precision, sensitivity, and fitness summaries, and they collectively support the conclusion that PMPA offers a reliable convergence profile with favorable stability–efficiency characteristics across diverse gene-expression landscapes.
To evaluate the effectiveness of the proposed PMPA algorithm, we conducted a comprehensive comparison with several well-known and widely-used feature selection methods, including mRMR-DBH,39 SU-RSHSA Shreem et al.,40 mRMR-MBAO Pashaei,41 MIM-MFOA,42 IBCFPA,43 ISFLA,37 rMRMR-HBA,44 BCROSAT,45 MRMR-BA,17 IG-MBKH,36 and SARA.46 The comparison was carried out across six publicly available gene expression datasets: CNS, Ovarian, Leukemia 4c, Leukemia, Leukemia 3c, and Breast. Two key performance indicators were used in the evaluation: the average number of selected genes and the classification accuracy. These metrics are critical for assessing the trade-off between dimensionality reduction and predictive performance. The results, summarized in Table 8, illustrate the strengths and weaknesses of each method and provide insights into the advantages of the proposed PMPA approach.
To further assess PMPA, we compare it to a broad set of recent filter/wrapper selectors reported in the literature, see Table 8.
For each dataset, the average number of genes (ANoG) and classification accuracy (CACC) reported by PMPA vs. literature methods (e.g., mRMR-MBAO, SU-RSHSA, mRMR-DBH, IBCFPA, MIM-MFOA, BCROSAT, ISFLA, SARA, rMRMR-HBA, IG-MBKH, MRMR-BA). “nd” indicates values not disclosed in source reports.
With an emphasis on the average number of selected genes (ANoG) and the associated classification accuracy (CACC), Table 8 compares the suggested PMPA approach with a number of cutting-edge feature selection algorithms across numerous biomedical datasets. As indicated by the results, it demonstrates the strong performance of PMPA in both accuracy and gene reduction. Notably, in the CNS dataset, PMPA selected only 7.767 genes while achieving an accuracy of 93.78%.
Notes: Average number of genes (ANoG) and Classification accuracy (CACC) are used to record the results. “nd” denotes that the value was not disclosed. Method names have been normalized (e.g., MRMR-BA, rMRMR-HBA). Any unusual underscore-like symbols were replaced with standard underscores.
For the Ovarian dataset, PMPA achieved perfect classification accuracy (100%) with just 3 selected genes. In the Leukemia 4c dataset, PMPA reached an accuracy of 98.93% with 8.067 genes, indicating a favorable trade-off compared to methods like mRMR-MBAO and mRMR-DBH, which selected more than 20 genes. In both the Leukemia and Leukemia 3c datasets, PMPA achieved 100% accuracy using only 2 and 4.667 genes, respectively, while alternative techniques required up to 49.6 genes. For the Breast dataset, PMPA reported an accuracy of 95.36% with 19.833 selected genes, though several baselines did not report results for this dataset.
This study addressed the gene selection problem in cancer classification, it suggested a novel hybrid approach called PMPA. The method strikes a balance between exploration and exploitation by utilizing the complementary strengths of Particle Swarm Optimization (PSO) and the Marine Predators Algorithm (MPA), which allows it to efficiently traverse high-dimensional gene expression datasets. PMPA regularly outperforms other optimization techniques, such as BAT, GWO, and WSO, across a range of performance parameters, including classification accuracy, sensitivity, and the number of selected genes, according to experimental evaluations on a number of well-known microarray benchmarks. Additionally, PMPA improves computational efficiency by removing unnecessary genes and has improved stability, as seen by a decreased standard deviation in the results.
The effectiveness of the PMPA method stems from its adaptive mechanisms, which allow it to search the solution space effectively while avoiding local optima. This makes PMPA a robust and reliable tool for the tasks in gene selection in cancer classification. The ability to identify a minimal subset of genes without compromising classification performance highlights its practicality and potential for real-world applications in cancer diagnostics and personalized medicine.
Future research could explore several extensions to the PMPA framework. One direction is integrating additional evolutionary algorithms, such as Genetic Algorithms or Differential Evolution, to enhance performance on larger and more complex datasets. Another potential avenue involves applying PMPA to other types of omics data, such as proteomics and metabolomics, to validate its applicability beyond gene expression. Exploring the combination of PMPA with deep learning models could automate the gene selection process further and improve scalability for handling larger datasets.
Dynamic parameter tuning during the search process could enhance the method’s efficiency, while implementing parallel processing techniques could significantly reduce computational time, making PMPA more practical for real-time clinical diagnostics.
Zenodo: A Novel PMPA for Gene Selection Health SS with Datasets and Codes https://doi.org/10.5281/zenodo.17390187
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
| Views | Downloads | |
|---|---|---|
| F1000Research | - | - |
|
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)