Background

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.168092.1

Research Article

Articles

Impact of sample size on optimisation algorithms for the MLP used in the prediction of client subscription to a term deposit

[version 1; peer review: 1 approved with reservations]

Botlhoko

Tshegofatso

Conceptualization Formal Analysis Methodology Writing – Original Draft Preparation https://orcid.org/0000-0003-4939-7582 1 Volition Montshiwa

Tlhalitshi

Conceptualization Methodology Supervision Writing – Review & Editing https://orcid.org/0000-0003-3168-3441 a 1 1Department of Business Statistics & Operations Research, North West University Faculty of Economic and Management Sciences, Potchefstroom, North West, 2735, South Africa

a volition.montshiwa@nwu.ac.za

No competing interests were disclosed.

22 12 2025

2025

1426

12 12 2025

2025

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background

One of the disadvantages of the multilayer perception (MLP), which is a machine learning (ML) algorithm used in various fields, includes the uncontrollable growth of the number of total parameters, which may make MLP redundant in such high dimensions, and the uncontrollable growing stack of layers that ignores spatial information. Optimization algorithms were developed to determine the optimum number of parameters for MLP.

Methods

In this paper, the performances of the Genetic Algorithm (GA), Grasshopper Optimization Algorithm (GOA), and Covariance Matrix Adaptation Evolution Strategy (CMA-ES) are compared. The study also sought to determine the impact of sample size variations on these optimization algorithms. A dataset on the direct marketing campaigns of a Portuguese banking institution from the UCI Machine Learning Repository with a sample size of 4 521 was used. Synthetic Minority Oversampling Technique (SMOTE) was applied to balance the binary dependent variables for the training data across various sample sizes.

Results

Based on the classification accuracy, specificity, sensitivity, precision, F-score, and execution time, the MLP based on CMA-ES (CMA-ES-MLP) was identified as the best classifier overall, as it maintained high rates of these classification metrics and was the second fastest to train. CMA-ES-MLP with a training sample of 5 114 was our ideal classifier, and it competes well with the classifiers that have been built by previous studies that used the same dataset.

Conclusions

The study found no consistent increase or decrease in the classification performance of the algorithms as the sample size increased, and the metrics fluctuated rapidly across sample sizes. It is recommended that future studies be conducted to compare the best-performing classifiers identified in previous studies with the CMA-ES-MLP in this study under the same experimental conditions.

Multilayer Perceptron (MLP); Genetic Algorithm (GA); Grasshopper Optimization Algorithm (GOA); Covariance Matrix Adaptation Evolution Strategy (CMA-ES); Machine Learning; Term Deposit Subscription.

The author(s) declared that no grants were involved in supporting this work.

1.1 Introduction

The multidisciplinary field of data mining includes Information Technology (IT), Artificial Intelligence (AI), Machine Learning (ML), statistics, pattern recognition, data retrieval, Neural Networks (NN), and information-based systems. ¹ This study focused on ML classification algorithms and classifiers. A classifier is an algorithm that links input data to a specific category. ² More specifically, this study focuses on the Multilayer Perceptron (MLP) classifier because it is one of the most-used algorithms in data science and in recent studies ^{3–
7} because of its flexibility and ability to differentiate data that can be split linearly ⁸ defined the MLP as a feedforward artificial neural network (ANN) that comprises the input layer, at least one hidden layer, and the output layer, which are connected by nodes. MLP is also of interest in this study because it is applicable to various fields such as speech recognition, image recognition, text classification, and machine translation software.

Although it is applicable across various disciplines, a disadvantage of MLP is that the number of total parameters in it can grow uncontrollably, whereby the number of perceptrons in layer one is multiplied by the number of parameters in layer two, which is then multiplied by the number of parameters in layer three and so on. This is inefficient because of redundancy in such high dimensions. In addition, ( ⁹: 400) stated that when flattened vectors are used as inputs, this uncontrollably growing stack of layers ignores the spatial information. These multiplying parameters can be difficult to control; hence, optimization algorithms were established to determine the optimum number of parameters for the MLP.

Reference [10] defined an algorithm as a process or equation that solves a problem by following a predetermined set of steps. Reference [11] described optimization techniques as analytical approaches that use differential calculus to find the best solution. Reference [12] further explained that the purpose of optimization techniques is mainly to handle problems that cannot be handled by classifiers. These problems consist of functions with a single variable, functions with multiple variables and no constraints, and functions with multiple variables with both equality and inequality constraints. A variety of optimization algorithms have been developed, and because of their adaptable and flexible searching processes, they have demonstrated a great degree of promise in solving optimization issues. In addition, ¹³ mentioned their capacity to use specific statistical tools to display satisfactory performance on MLP classification methods, as well as their efficiency in resolving linear and non-linear problems by avoiding local optima and balancing the exploration and exploitation trends.

According to, ¹⁴ there are several optimization algorithms used in optimizing the MLP, including the Bayesian optimization algorithms (BOA), binary particle swarm optimization (BPSO), Covariance Matrix Adaptation Evolution Strategy (CMA-ES), Differential Evolution (DE), FireFly Algorithm (FFA), genetic algorithms (GA), grasshopper optimization algorithm (GOA), and particle swarm optimization (PSO). Other optimization algorithms include the hybrid meta-heuristic approach, which was used in the study by, ^{15,
16} and it has been compared to other newly developed optimization algorithms that were used to form hybrid MLP models such as the Gloworm Swarm Optimization-MLP (GSO-MLP), Biogeographical-Based Optimization-MLP (BBO-MLP), and Genetic Algorithm-MLP (GA-MLP).

The scope of this study is limited to the Genetic Algorithm (GA), grasshopper optimization algorithm (GOA), and Covariance Matrix Adaptation Evolution Strategy (CMA-ES). This is because the literature comparing these novel evolutionary optimization algorithms is scarce. Therefore, although they are known to be better performers than older algorithms, the best optimization algorithm for the MLP between GA, GOA, and CMA-ES remains unknown. It is imperative to determine the most efficient optimization algorithm for an optimal MLP because each optimization technique has various reliability, strength, efficiency, utilization, and limitations. According to, ¹⁷ one of the disadvantages of not knowing the most efficient optimization algorithm is that it cannot determine the best level of local optima. It can also waste time for end-users of MLPs (i.e., non-statisticians/non-data scientists) to compare the optimization algorithms before fitting their MLPs, as opposed to having a study such as the current study to refer to, which has already compared these algorithms and has recommended the most efficient one(s).

This study also intended to explore the effect of changes in sample size on the efficiency of GA, GOA, and CMA-ES. This is because the increase in the sample size is known by some studies to be able to improve the accuracy and robustness of many statistical methods, as detailed by studies such as those conducted by ^{18–
21} highlighted that when the focus is on individualised outcome risk prediction, it has been shown that extremely large datasets might be needed for ML techniques. The authors explained that for binary outcomes, ML techniques could require more than ten times as many events for each predictor to achieve a small amount of over-fitting compared with classic modelling techniques such as logistic regression and might show instability and high optimism ²¹ explained that when dealing with optimization algorithms and sample size, it is vital to ensure accurate predictions in key subgroups and to consider the accurate sample size when using an existing dataset to avoid overfitting. On the other hand, although some studies advocate for a large dataset for ML algorithms, ²² explained that a study with a sample size that is too small has a higher risk of missing a meaningful underlying difference, while one with a sample size that is too large may be more expensive than necessary.

It is evident that sample size affects the efficiency of ML algorithms. However, the efficiency of GA, GOA, and CMA-ES when used in optimizing the MLP relative to the sample size remains unknown, and to the best of our knowledge, this has never been explored before in a single study. In this study, efficiency refers to a measure of the quality of the optimization algorithms depending on the sample size, which is evaluated using measures such as specificity, sensitivity/recall, and execution time. Therefore, this study intended to determine the impact of sample size on the efficiency of GA, GOA, and CMA-ES when used for optimizing the MLP, with a focus on these due to their wide application in various studies ^{23–
27} and because of their known effectiveness and flexibility.

A comparison of GA, GOA, and CMA-ES in optimizing MLP and the effect of sample size on the performance of these algorithms is the main objective of this study. However, as an area of application, these methods are applied to predict the likelihood of subscribing to a term deposit following telephone-based direct marketing by a banking institution. This has been the focus of application of ML classifiers in several previous studies, including. ^{28–
30} Therefore, this study intends to extend the literature in this area, which has caught the attention of many researchers when comparing the performance of ML classifiers. More details on the ML classifiers applied and the conclusions reached from these previous studies are detailed in Table 2 in the dataset section of this paper.

1.2 Related works on evaluation of optimisation algorithms for the MLP

Several previous studies that explored the efficiency of various optimization algorithms for MLP in different areas of application showed that the most efficient optimized MLP varies depending on the area of application, sample sizes, and evaluation metrics implemented in such studies. From the studies reviewed, the most common area of research is information technology ^{31–
34} followed by the medical sector. ^{35–
37} To extend the study by, ³⁸ who focused on the financial sector, the current study uses a financial dataset, but it includes the CMA-ES-MLP and GOA-MLP, which are compared to the basic MLP and GA-MLP, which were also included in the study by, ³⁸ but across different sample sizes rather than only one. From the studies reviewed, the sample sizes ranged from 400 to 8367, but only one sample was used per study. As such, the current study expands the scope of these studies by comparing the basic MLP and its optimized variates using different samples to determine the effect of sample size on the performance of these ML algorithms.

The literature shows that in all the studies, the optimized versions of the MLPs were selected as the best performers, and not the basic MLP, which was not optimized. This is evident from the studies conducted by ³⁸ in which the diversity-considered GA-MLP ensemble algorithm (DGAMLPE) outperformed the unoptimized basic MLP, ³⁵ in which DGAMLPE outperformed the basic MLP, and, ³¹ in which the GOA-MLP outperformed the basic MLP. “This implies that indeed the optimised variates of the MLP can improve the basic MLP, and it is also seen that optimisation algorithms give the MLP a competitive advantage over other ML classifiers such as Random Forest (RF), Extreme Gradient Boost (X-GBoost), Weighted Count of Errors and Correct (WCEC), and Deep Belief Network-Support Vector Machine (DBN-SVM), Logistic Regression (LR), K-Nearest Neighbors (K-NN), Decision Tree Classifier (DTC), Support Vector Machine (SVM), Random Forest Classifier (RFC), and Ensemble” ³⁵:314). Considering these findings from the literature, the researchers were interested in optimization algorithms for the MLP in the current study. To extend the literature, the researchers included the CMA-ES-MLP in the competing models and explored the effect of sample size on these ML algorithms.

The most used optimized variates of the MLP from previous studies are GA-MLP, ^{16,
32,
34,
35,
37} followed by PSO-MLP, ^{16,
31,
32,
34,
39} and GOA-MLP, ^{31,
33,
39} but none of these studies included CMA-ES-MLP, which implies that the performance of CMA-ES-MLP against GA-MLP and GOA-MLP remains an area that requires further research. This study bridges this gap. It also appears that the most frequently used accuracy metric from the reviewed studies is classification accuracy, ^{16,
31–
39} followed by the F-measure. ^{33,
35–
38} The negative and positive predictive values appear to be the least used accuracy metrics. ^{31,
35} Other classification evaluation metrics used in previous studies included sensitivity/recall, specificity, and precision. Similarly, the current study also implemented the classification accuracy, precision, sensitivity/recall, specificity, F-measure, and execution time to compare optimized algorithms based on the popularity of these metrics in previous studies. Including a variety of comparison metrics in a single study assists in minimizing the model selection bias that may be experienced when very few similar metrics are used in the comparison and selection of the most efficient model.

1.3 Method 1.3.1 Dataset

The data used in this study is a secondary dataset on the direct marketing campaigns of a Portuguese banking institution. The dataset was obtained from the UCI Machine Learning Repository of the Center for Machine Learning and Intelligent Systems. The primary contributor to the data is. ²⁸ The dataset can be accessed at https://archive.ics.uci.edu/ml/datasets/Bank+Marketing. The dataset has a total of 4 521 observations, and 11 variables were selected for use as attributes (see Table 1) in this study to predict whether a client will subscribe to a term deposit following the marketing campaign. That is, the binary variable “has the client subscribed to a term deposit” from the dataset is used as a dependent variable (binary; 0 is no and 1 is yes).

Table 1. Description of features.

Name of variables	Description of variable	Variable type/category
Age	Client’s age	Numeric
Type of Job	The type of job of client	Admin, blue collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, and unknown
Marital Status	What is the marital status of the client?	Divorced, married, single, unknown
Educational level	Highest qualification of client	Basic 4y, basic 6y, basic 9y, high school, illiterate, professional course, university degree, and unknown
Default	Does the client have credit in default?	No, yes, and unknown
Housing	Does the client have housing loan?	No, yes, and unknown
Loan	Does the client have a personal loan?	No, yes, and unknown
Contact	Contact communication type	Cellular and telephone
Day	The last contact day of the week	Monday, Tuesday, Wednesday, Thursday, Friday
Duration	Last contact duration, in seconds (numeric)	e.g., if duration = 0 then y = 'no'
Outcome of Previous Marketing Campaign		Failure, non-existent, and success

To mimic different sample sizes which are needed to study the impact of sample size on the efficiency of the MLP optimisation algorithms, nine (9) random samples of different sizes (varying by 10%) were drawn with replacement from the 4521. Samples were randomly selected at 10% difference using stratified sampling, in which the dependent variable was used as the stratum to ensure that the samples maintained the distribution of the main dataset in the dependent variable. The following random sample sizes were created: 10% (n = 452), 20% (n = 904), 30% (n = 1356), 40% (n = 1808), 50% (n = 2261), 60% (n = 2713), 70% (n = 3165), 80% (n = 3617), 90% (n = 4069), and the entire dataset, which contained 100% of the observations (n = 4521). The variables described in Table 1 were used as independent variables or features.

All categorical features with at least three (3) classes from Table 1 were converted to dummy variables using the one-hot encoding technique, which converts classes of the categorical variable to a vector that contains 1 and 0, denoting the presence and absence of the feature, respectively, which led to an increase in the number of features used in the paper to 42. Previous studies that have been conducted that focused on the application and/or comparison of ML classifiers (including MLP and its variates) on the dataset chosen for this study are summarized in Table 2.

Table 2. A summary of studies on the comparison and application of ML classifiers on using the dataset on direct marketing campaigns of a Portuguese banking institution from the UCI repository.

Authors	Classifiers compared or applied	Best model
Moro et al. (2014)	LR, DT, NN, and SVM.	NN with an Area under the area of the receiver operating characteristic curve (AUC) of 0.8 and area of the LIFT cumulative curve (ALIFT) of 0.7
Ghatasheh et al. (2020)	Meta-Cost-MLP, Cost Sensitive Classifier-MLP, MLP (Baseline), DL-MLP, J48, LL, DT, Very Fast Decision Rules (VFDR) and RF.	Meta-cost MLP with recall of 0.808, precision of 0.771, Geometric mean of 78.93%, and Classification accuracy of 77.48%.
Moro et al. (2011)	NB, DT and SVM.	SVM with AUC of 0.938 and ALIFT=0.887.
Asare-Frempong and Jayabalan (2017)	MLP, DT (C4.5), LR and RF	RF with classification accuracy of 86.08% and AUC of 92.7%.
Moro et al. (2015) ⁴²	Customer lifetime value (LTV) based NN (LTV-NN), baseline NN (with no historical data),	LTV-NN increased the AUC of the baseline-NN from 0.8002 to 0.8609, while ALIFT improved from 0.6701 to 0.7044 where AUC was at least 0.84, and ALIFT was at least 0.69.
Elsalamony (2014)	MLP, NB, LR, and the Ross Quinlan new DT (C5.0).	Based on the testing dataset, MLP produced the highest classification accuracy of 90.49%, LR the highest sensitivity of 65.53%, and C5.0 yielded specificity of 93.23%.
Zaki et al. (2024)	Stochastic Gradient Descent (SGD) Classifier, k-nearest neighbour Classifier, and Random Forest Classifier.	DT with a classification accuracy of 87.5%, a negative predictive value (NPV) of 93%, and a positive predictive value (PPV) of 87.8%.
Ładyżyński et al. (2019) ⁴⁴	RF, classification and regression tree (CART) and deep belief learning implemented in H2O framework, and deep belief networks implemented in H2O framework with l1 regularization parameter added.	CART with a precision of 9.01% and recall of 67.27%, and the authors commented it is the most efficient in terms of computing power.
Pavlović et al. (2014) ⁴⁵	DT	DT yielded classification accuracy of 88.51%, sensitivity of 93.6%, specificity of 50.1%, AUC of 70.5%, and Brier of 20.5%.
Karim and Rahman (2013) ⁴⁶	NB and DT (4.5).	DT (C4.5) with classification accuracy of 94%, precision for “yes” of 79.1%, precision for “no” of 95.5% and AUC of 93.3, but the DT (C4.5) was 5.78 seconds slower to train than the NB.
Kim and Street (2004)	Baseline ANN and Genetic Algorithm (GA) based ANN (GA-ANN).	GA-ANN .

Table 2 shows that 2004 to date, several ML classifiers have been evaluated to predict the likelihood of a client to subscribe to a term loan following a direct marketing campaign by the bank using data from a Portuguese banking institution. In general, the table shows that the results vary depending on the setting such as the number of attributes, the number of observations in the data, and the number of training times to mention a few. Most of these studies included neural networks ^{29,
40–
43,
47} including the basic MLP and its variates such as Meta-Cost-MLP, Cost sensitive classifier-MLP, and the GA based ANN (GA-ANN). Although it appears in most previous studies, the basic neural networks classifier was only found to be the best performer when compared to LR and DT and SVM in the study by. ²⁸ However, whenever its modified variates were included in the comparison, these variates were found to be best performers against the basic MLP such as in the study by ²⁹ in which the Meta-Cost-MLP outperformed the basic MLP and other classifiers such as (J48, LL, DT, VFDR), and in the study by ⁴⁷ in which the GA-ANN outperformed the baseline ANN. These results show that making improves to the basic MLP can improve its performance, hence this paper extend literature around the enhancement of the neural networks (specifically the MLP) as done by some authors in Table 2, by comparing GA, GOA and CMA-ES optimisation algorithms for the MLP using the direct marketing data used in studies that are summarise in this table. It is evident from Table 2 that these optimisation algorithms have never been compared in a single study using the dataset that was used by the studies in Table 2.

1.4 Data analysis methods

1.4.1 Data balancing

The data in this study were split into 80% training data and 20% testing data, which is a commonly used train-to-testing data-splitting ratio. A Synthetic Minority Oversampling Technique (SMOTE) was used to balance the training samples ⁴⁸ defined SMOTE as one of the most used oversampling techniques to solve imbalanced data problems, and it aims to balance class distributions by randomly increasing minority class examples by replicating them ⁴⁸ explained that SMOTE uses linear interpolation to generate the virtual training records. These synthetic data were generated through a random selection of at least one k-nearest neighbor for each observation in the minority class. ⁴⁸ In this study, SMOTE was chosen because of its advantage in reducing the risk of overfitting and its wide application in many previous studies, such as. ^{48–
52}

From Figure 1, Y i is the point under consideration, Y i 1 to Y i 4 are the nearest neighbors, and w 1 to w 4 represent the synthetic data generated by the randomized interjection ⁵³ explained that synthetic samples are generated by considering the difference between the nearest neighbor and the feature vector ⁵³ further explained that the difference is multiplied by a random number between 1 and 0 and then added to the feature vector under consideration. Table 3 presents balanced training data from the original dataset.

Figure 1. Example of how to generate synthetic data using SMOTE ( <sup> <xref ref-type="bibr" rid="ref53">53</xref> </sup>:1414).

In Figure 1 explains how SMOTE randomly generates synthetic data ( w 1 to w 4 ) to balance the imbalanced dataset by taking the difference between the nearest neighbours ( Y i 1 to Y i 4 ) of the data point under consideration ( Y i ) and multiplying Y i by a random number between 0 and 1, and then adding it to the feature vector under consideration. ⁵³

Table 3. Frequencies of the dependent variable in the SMOTE balanced training sets across the sample sizes.

Unbalanced data				Balanced data
Sample Size	Client Subscription	N	%	Sample Size	Client Subscription	N	%
n = 362	Unsubscribed	313	86	n = 626	Unsubscribed	313	50
n = 362	Subscribed	49	14	n = 626	Subscribed	313	50
n = 723	Unsubscribed	640	89	n = 1280	Unsubscribed	640	50
n = 723	Subscribed	83	11	n = 1280	Subscribed	640	50
n = 1085	Unsubscribed	970	89	n = 1940	Unsubscribed	970	50
n = 1085	Subscribed	115	11	n = 1940	Subscribed	970	50
n = 1446	Unsubscribed	1296	90	n = 2592	Unsubscribed	1296	50
n = 1446	Subscribed	150	10	n = 2592	Subscribed	1296	50
n = 1809	Unsubscribed	1602	89	n = 3024	Unsubscribed	1602	50
n = 1809	Subscribed	207	11	n = 3024	Subscribed	1602	50
n = 2170	Unsubscribed	1916	88	n = 3832	Unsubscribed	1916	50
n = 2170	Subscribed	254	12	n = 3832	Subscribed	1916	50
n = 2026	Unsubscribed	1791	88	n = 3582	Unsubscribed	1791	50
n = 2026	Subscribed	235	12	n = 3582	Subscribed	1791	50
n = 2894	Unsubscribed	2557	88	n = 5114	Unsubscribed	2557	50
n = 2894	Subscribed	337	12	n = 5114	Subscribed	2557	50
n = 3255	Unsubscribed	2880	88	n = 5760	Unsubscribed	2880	50
n = 3255	Subscribed	375	12	n = 5760	Subscribed	2880	50
n = 3617	Unsubscribed	3199	88	n = 6398	Unsubscribed	3199	50
n = 3617	Subscribed	418	12	n = 6398	Subscribed	3199	50

Table 3 shows that the class of the dependent variable is balanced after using SMOTE specifically for the training data samples. In all the samples, equal numbers of unsubscribed participants and subscribed participants are observed.

1.4.2 Multilayer Perceptron (MLP)

Explained that MLP was invented in 1958 at the Cornell Aeronautical Laboratory by Frank Rosenblatt, funded by the Office of Naval Research in the United States. ⁵⁴ Further explained that although it was originally designed as a machine rather than a program, the perceptron was first implemented in IBM 704 as software before being implemented in specially designed hardware as the “Mark 1 perceptron.” In addition, ⁵⁵ explained that the purpose of this machine is image recognition; it has 400 photocells arranged in an array and randomly connected to the “neurons.” According to the author, electric motors update the weights during learning, and the weights are encoded in the potentiometers. The flexibility of the MLP has enabled its function in various activities. ⁵⁶ It has only been used for image recognition, ^{57,
58} speech recognition, ⁵⁹ and machine translation software. ⁶⁰ Currently, it can be used for text data, ^{61,
62} speech recognition, ⁵⁸ and other types of data. MLP can be fitted using various software, such as Waikato Environment for Knowledge Analysis 3.9 (WEKA), Statistical Package for the Social Sciences (SPSS), and Python. With the use of optimization algorithms, such as those being compared in this study, MLPs have become very useful, convenient, and easy to use.

The MLP consists of an input and an output layer with one or more hidden layers of non-linear activating nodes. ⁶³ Each node in one layer connects with a certain weight to every node in the following layer. ⁶³ In the input layer, the activations, which were defined by ⁶⁴ as the source of the MLP’s power, were determined using the following equation: b j = ∑ i = 0 D w ij ( 1 ) x i , (1)

The first layer involves M linear combinations of the d-dimensional input for i 1 , 2 , … , M and j = 1 , 2 , … , d , where w ij ( 1 ) are the weights for node j in layer 1 for incoming node I and (1) indicates that this is the first layer of the network. Each activation was then transformed by a non-linear activation function g.

In this study, tanh was used as the activation function for the hidden layer ⁶⁵ described the Tanh function as a smoother, zero-center function, with a range between -1 and 1. The Tanh function is defined by the following equation sourced from ⁶⁵: f ( x ) = ( e x − e − x e x + e − x ) , (2) where x is an input to the neuron and e is Euler’s number.

A sigmoid function was used as the activation function for the output layer. ⁶⁵ defined the sigmoid as a non-linear activation used mostly in feedforward neutral networks. “It is a bounded differentiable real function, defined for real input values, with positive derivatives everywhere and some degree of smoothness” ( ⁶⁵:5). The sigmoid activation function is given by the following relationship, sourced from ⁶⁵: f ( x ) = 1 1 + exp ( − b j ) (3) where f ( x ) corresponds to the outputs of the basis functions and is interpreted as the output of the hidden units.

1.4.3 Covariance Matrix Adaptation Evolution Strategy (CMA-ES)

The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) was developed by Hansen et al. in 2003. ⁶⁶ According to, ⁶⁷ the algorithm’s theoretical underpinnings include variable metrics, and the CMA-ES uses maximum-likelihood updates in conjunction with a stochastic variable-metric approach. In an algorithm that quickly converges to the global optimum across a wide class of functions, the covariance matrix maximizes likelihood while resembling an expectation-maximization algorithm ^{68,
69} explained that the CMA-ES algorithm has certain drawbacks, such as its performance becoming slow if the number of model parameters that need to be estimated is large. The approximation of gradients without assuming or requiring their existence is another flaw of this algorithm. CMA-ES is a plausible candidate for an effective parameter estimation algorithm, ⁷⁰ but it must be tested against other algorithms to ascertain its efficiency, particularly when the sample size is varied.

The CMA-ES samples from the multivariate normal distribution search rank the sampled points according to their fitness function values. The multivariate normal distribution can be calculated using the following equation obtained by ⁷¹: x i ∼ N ( m k , σ k 2 C k ) , (4) ∼ m k + σ k × N ( 0 , C k ) , (5) where m k is the distribution average and recent favorite solution to the optimization problem, σ k is the step size, and C k is the symmetric and positive definite

The fitness function for the CMA-ES is defined as: f ( x ) = g ( x T Hx ) , (6) where Hx is the Hessian matrix of f ( x ) and x T is the transpose of x .

The mean distribution is then updated to a weighted average using the following equation: m new ← ∑ i = 1 μ w i x i : λ = m + ∑ i = 1 μ wi ( x i : λ − m ) , (7) where m new is the new distribution mean, μ is the number of parameters, λ is the population size, m is the mean vector, and w i is the recombination weight.

The isotropic evolution is then updated using the following equation: p σ ← ( 1 − C σ ) p σ + 1 − ( 1 − C σ ) 2 μ w C k 1 2 m k + 1 − m k σ k , (8) where p σ is the evolution path, ( 1 − C σ ) is the discount factor, 1 − ( 1 − C σ ) 2 is the complement for the discounted variance, and μ w C k 1 2 m k + 1 − m k σ k are distributed as N ( 0 , I ) under neutral selection. σ k + 1 = σ k × exp ( C σ d σ ( | | p σ | | E | | N ( 0 , I ) | | − 1 ) ) , (9) where C σ d σ ( | | p σ | | E | | N ( 0 , I ) | | − 1 ) is unbiased about 0 under unbiased selection. E | | N ( 0 , I ) | | = 2 ( n + 1 2 ) ( n 2 ) , (10) ≈ n ( 1 − 1 4 n + 1 21 n 2 ) , (11)

Update of the covariance matrix adopted is described as follows: p c ← ( 1 − C c ) p c + 1 ( 0 , α n ) ( | | p σ | | ) 1 − ( 1 − c c ) 2 μ w m k + 1 − m k σ k , (12)

The CME-ES is finally updated using: C k + 1 = ( 1 − c 1 − c μ + c s ) C k + c 1 p c p c T + c μ ∑ i = 1 μ w i x i : λ − m k σ k ( x i : λ − m k σ k ) T (13) where c s is the small variance loss, c 1 is the learning rate for updating the covariance matrix, and c μ is the learning rate for rank- μ for updating the covariance matrix.

1.4.4 Genetic Algorithm (GA)

Reference [73] proposed a learning machine called the Genetic Algorithm (GA), which paralleled the principles of evolution. Barricelli (1954) pointed out that the first computer simulation of evolution was created in 1954 at the Institute for Advanced Study in Princeton, New Jersey, thanks to the efforts of Barricelli ⁷³ found that GA has some limitations, such as repeated evaluation of the fitness function and difficulties in working with dynamic datasets; it tends to converge to a local optimum or even arbitrary points, instead of the global optimum of the problem. “A better solution is only in comparison to other solutions, and the stop criterion is not clear in every problem” ( ⁷³:226). On the other hand, GA has been noticed to be a very efficient and effective technique for both optimisation and ML applications. ⁷⁴ Another advantage of GA is that it requires less information about the problem ^{75,
76} stated that GA can work very well on mixed (discrete and/or continuous) problems. “The GA can be applied in real world situations such as engineering design, to make the design cycle process fast and economical, and in robotics too, to create learning robots which will behave as humans and will do tasks like cooking and laundry” ( ⁷⁷: 347).

The efficiency of GAs depends on mutation and crossover operators and their relationships. “To determine the most appropriate operators, different mutation and crossover operators are used and they are compared with each other since GA involves a process of complex interaction between its parameters” ⁷⁸ suggested that for the algorithm to perform best, the population size must range between 50 and 100 observations. In this study, we verified this recommendation by studying the effectiveness of GA in different sample sizes ⁷⁹ stated that the algorithm comprises four main steps: selection, reproduction, replacement, and termination. The steps are as follows:

1.4.4.1 Selection

Reference [80] explained that by choosing the reproduction of offspring, the primary goal of this phase is to identify the area with the highest likelihood of producing a solution to the problem that is superior to that of the previous generation. The authors add that the selection of individuals will then be arranged in pairs of two to enhance reproduction ⁷⁹ also explained that individuals will then pass on their genes to the next generation. “The GA uses the fitness proportionate selection technique to ensure that useful solutions are used for recombination” ( ⁷⁹: 3). Fitness proportion selection is defined by the author as the most popular method of parent selection, where every individual can become a parent with a probability that is proportional to its fitness. “Fitter individuals have a higher chance of mating and propagating their features to the next generation. Therefore, such a selection strategy applies a selection pressure to the more fit individuals in the population, evolving better individuals over time”( ⁸⁰: 16). The fitness proportionate selection can be calculated using the following equation adopted from ⁸⁰: p i = f i ∑ j = 1 N f i (14) where f i denotes the fitness of individual i in the population, N denotes the number of individuals in the population, and p i denotes the probability.

1.4.4.2 Reproduction

Reference [80] explained that the algorithm applies variation operators to the parent population during the reproduction phase, creating a child population. This phase has four main operators, crossover, mutation, replacement, and termination, which are discussed below.

1.4.4.3 Crossover

According to, ⁸¹ the crossover operator swaps the genetic information of two parents to produce offspring ⁸¹ also explained that this is performed on parent pairs that are selected randomly to generate a child population of equal size to the parent population. For this study, a single-point crossover was considered. “Single point crossover works in such a way that a parent organism string is selected. All data beyond this point in the organism string were swapped between the two parent organisms. Strings are characterized by positional bias” ( ⁸¹: 13).

1.4.4.4 Mutation

The mutation operator adds genetic information to the new child population. According to, ⁸² the operator achieves this by flipping some bits in the chromosome to solve the problem of local minima and enhance diversification. In the present study, a bit-flip mutation was considered. “Bit flip mutation works in such a way that it selects one or more random bits and flip them. This can only be done for binary encoded GA’s” ( ⁸²: 47).

1.4.4.5 Replacement

Reference [80] elucidated that the replacement operator acts as the final generational step to replace the old population with the new child population. In this study, a generational replacement operator is used, where the previous generation is replaced with a newly generated child population.

1.4.4.6 Termination

Reference [80] explains that termination is only possible in specific situations, such as having reached an absolute number of generations but not having improved the population for X iterations or the objective function value reaching a pre-defined threshold ⁷⁹ cited a genetic algorithm example in which a counter was maintained to record generations for which the population did not improve. “Initially, we set the counter to zero. Each time we do not generate an offspring, which is better than the individuals in the population, we increase the counter. However, if the fitness of any offspring is better, then we reset the counter to zero” ( ⁷⁹: 2). The author also stated that the algorithm terminates when the counter reaches a predetermined value.

1.4.5 Grasshopper Optimisation Algorithm (GOA)

The Grasshopper Optimisation Algorithm (GOA) is a new swarm intelligence algorithm and population-based method developed by Seyedali Mirjalili in 2017. ⁸³ According to the authors, the GOA mainly observes the behavior of grasshopper swarms and their social interactions. Every grasshopper in the population represents a solution, and its location within the swarm is determined by three forces: wind advection, the force of gravity applied to it, and social interactions with other grasshoppers. ⁸⁴ The process of optimizing the grasshopper algorithm involves several steps, including initialization, creation, and evaluation of the first population, identification of the best overall solution, updating the decreasing coefficient parameter, mapping the grasshopper’s distance, and updating the solution. ⁸⁵

Reference [87] explained that the GOA can improve the average fitness of all grasshoppers, which helps the GOA effectively increase the first randomly generated solutions. The algorithm can be computed using software such as Matrix Laboratory (MATLAB) and Python. No information relating to the GOA in comparison with other algorithms has emerged, as this is a newly developed algorithm. Therefore, little is known about the efficiency of this algorithm compared to its predecessors; hence, the proposed study seeks to expand the scope of this algorithm. Grasshopper position ( X i ) calculations depend on three types of forces: social interactions and other grasshoppers, wind advection, and gravitational force. ⁸⁷ All equations used in the description of the GOA in this study were sourced from ⁸⁷ the grasshopper’s position is defined as: X i = S i + G i + A i , (15) where X i defines the position of the i-th grasshopper, S i is the social interaction, G i is the gravitational force on the i-th grasshopper, and A i is wind advection.

From Equation 15, social interaction is defined as: S i = ∑ J = 1 N s ( d ij ) d ij , ̂ (16) where d _ij is the distance between grasshopper i and grasshopper j in the d th dimension.

From Equation 15, the gravitational force ( G i ) on the grasshopper is computed as follows: G i = − g e g ̂ , (17) where − g denotes the gravitational constant and e ̂ g is the unit vector towards the center of the earth.

From Equation 15, the wind advection is computed as follows: A i = ue ̂ w , (18) where u is a constant drift and e ̂ g represents a unity vector towards the direction of the wind.

When substituting Equations 16– 18 into Equation 15, the position of the current grasshopper becomes. X i = ∑ j = 1 j ≠ 1 N s ( | x j − x i | ) x j − x i d ij − ge ̂ g + ue ̂ w (19) where N is the total number of grasshoppers.

Reference [84] explained how the pseudocode of the GOA algorithm works. The GOA starts optimization by creating a set of random solutions; the search agents then update their positions, followed by the determination of the position of the best target obtained thus far, and this position is updated in each iteration. ⁸³ Additionally, the distances between grasshoppers were normalized in each iteration ⁸³ stated that position updating is performed iteratively until the end criterion is satisfied. Finally, the position and fitness of the best target are returned as the best approximation of the global optimum.

1.4.6 Model comparison criteria

Precision, sensitivity/recall, F-score, classification accuracy, sensitivity, specificity, and execution time were used to evaluate and compare the optimization algorithms for the MLP, as described in this section. The classifier with the highest precision, recall, F-score, accuracy rate, sensitivity, specificity, and lowest execution time is preferred.

Classification accuracy (also referred to as overall accuracy) was described by ⁸⁸ as the number of correct forecasts divided by the total number of forecasts. It is the most straightforward clustering quality measure proposed by ⁸⁹ to assess the clustering results related to the ground truth. ⁸⁸ Classification accuracy was calculated by ⁸⁸ as follows: Accuracy = True Positives + True Negatives ( Positives + Negative ) (20)

Reference [91] characterized specificity as a proportion of the extent of real negatives that are effectively distinguished, and they described the specificity equation as follows: specificity = True Negatives True Negatives + False Positives (21)

Precision was defined by ⁹¹ as a measure of how close a series of measurements are to one another. The author explained that precise measurements are highly reproducible, even if the measurements are not near the correct value. Precision was calculated as follows ⁹¹: Precision = True Positives True Positives + False Positives (22)

Reference [91] characterize the sensitivity/recall rate as a measure of the proportion of real positives that are accurately identified. The following equation for recall/sensitivity was adopted from ⁹⁰: Sensitivity / recall = True Positives True Positives + False Negatives (23)

Reference [93] defined the F-measure as a weighted harmonic mean of recall and precision. There are several motivations for this choice ⁹² explains that the harmonic mean is commonly appropriate when averaging rates or frequencies, but there are also a set of theoretical reasons. The author further explains that the mean allows differential weighting of recall and precision, but they are commonly given equal weights. The F-measure was computed as follows: F = 2 ∗ Precision ∗ Recall Precision + Recall (24)

Execution time is defined by ⁹³ as the amount of time spent by the system executing a given task, including the amount of time it spends executing runtime or system services.

1.5 Results

To ease the presentation and interpretation of the results, the results are presented by plotting each classification metric of all the ML classifiers under comparison across the sample sizes in Figures 2 to 8.

Figure 2. Classification accuracy for the basic MLP, GA-MLP, GOA-MLP and CMA-ES-MLP by sample size.

The line graphs represent the overall classification accuracy of the basic MLP and each optimised MLP to determine the impact of various sample sizes on their classification accuracy. The classification accuracy is rate at which the model correctly classifies all the observations (both non-subscriptions and subscriptions).

Figure 3. Precision rate for the basic MLP, GA-MLP, GOA-MLP and CMA-ES-MLP by sample size.

The line graphs represent the overall precision rate of the basic MLP and each optimised MLP across the various sample sizes. This was to determine the impact of various sample sizes on the percentage of the term deposit subscriptions that are correctly classified by the models under comparison out of all the cases that were predicted as term deposit subscriptions by these models.

Figure 4. Sensitivity/recall for the basic MLP, GA-MLP, GOA-MLP and CMA-ES-MLP by sample size.

The line graphs represent the sensitivity rate of the basic MLP and each optimised MLP across the various sample sizes. This was to determine the impact of various sample sizes on the percentage of the term deposit subscriptions that are correctly classified by the models under comparison out of all the term deposit subscriptions from the testing datasets.

Figure 5. Specificity rates for the basic MLP, GA-MLP, GOA-MLP and CMA-ES-MLP by sample size.

The line graphs represent the specificity rate of the basic MLP and each optimised MLP across the various sample sizes. This was to determine the impact of various sample sizes on the percentage of the term deposit non-subscriptions that are correctly classified by the models under comparison out of all the term deposit non-subscriptions from the testing datasets.

Figure 6. F-measure rates for the basic MLP, GA-MLP, GOA-MLP and CMA-ES-MLP by sample size.

The line graphs represent the F-measure rate of the basic MLP and each optimised MLP across the various sample sizes. This was to determine the impact of various sample sizes on the harmonic mean of precision and recall. That is, how the sample size impacts the ability of the models under comparison to balance precision and recall.

Figure 7. Execution times for the basic MLP, GA-MLP, GOA-MLP and CMA-ES-MLP by sample size.

The line graphs represent the execution time of the basic MLP and each optimised MLP across the various sample sizes. This was to determine the impact of various sample sizes on the time it takes to complete the processes of deriving each model.

Figure 8. Mean of classification metrics for the basic MLP, GA-MLP, GOA-MLP and CMA-ES-MLP across the samples.

The line graphs represent the performance of the basic MLP and each optimised MLP across the various sample sizes on average. This was to determine the impact of various sample sizes on the average performance of the models. The mean classification performance was computed by taking the average of overall classification accuracy, precision, sensitivity and specificity.

Figure 2 shows that, for the basic MLP classifier, the classification accuracy values fluctuate and do not follow a clear increasing or decreasing trend with different sample sizes. The GA-MLP shows that the classification accuracy values also fluctuate and do not exhibit a consistent pattern with the change in the sample size. For GOA-MLP, the classification accuracy values show fluctuations but are more stable than those of its competitors. For CMA-ES-MLP, the classification accuracy values appear to fluctuate with a significant drop for the 60% sample (n = 3832), but in most of the sample sizes (10% (n = 626), 20% (n = 1280), 40% (n = 2592), 70% (n = 3582), 80% (n = 5114), 90% (n = 5114), and 100% (n = 6398)), this classifier has the highest overall classification accuracy rates; that is, the ability to classify both subscribers and non-subscribers from the datasets. The precision rates for all classifiers across sample sizes are shown in Figure 3.

Figure 3 shows that the precision values for all the models fluctuated as the sample sizes increased and did not show a consistent upward or downward pattern as the sample size increased. Generally, relatively high rates of precision are shown for the CMA-ES-MLP in most of the sample sizes (10% (n = 626), 20% (n = 1280), 30% (n = 1940), 40% (n = 2592), 50% (n = 3204), 60% (n = 3832), 70% (n = 3582), 80% (n = 5114), and 100% (n = 6398)). This implies that CMA-ES-MLP has the highest ability to correctly classify positive cases (subscribers) out of all predicted positives compared to GA-MLP, GOA-MLP, and the basic MLP in most instances. The sensitivity/recall rates for all classifiers across sample sizes are presented in Figure 4.

Figure 4 shows a sharp increase from the smallest sample size (10% (n = 626)) to the second-smallest sample size (20% (n = 1280)) in the sensitivity/recall rate for the basic MLP. Thereafter, a steady increase was observed until n = 2592, followed by fluctuating values of sensitivity in the remaining sample sizes. Generally, the basic MLP with no optimization yielded the lowest sensitivity rates across all sample sizes compared to its competitors (except for the full dataset (n = 6398)). The second lowest sensitivity/recall rates were observed for GA-MLP across all samples, except for the 70% sample (n = 3582) and the 80% sample (n = 5114), so generally GA-MLP is the second worst performer among all four models. Figure 4 shows that the sensitivity/recall values for GOA-MLP and CMA-ES-MLP also show fluctuations but are generally relatively higher than those of the basic MLP and GA-MLP for most samples.

The sensitivity/recall rates for CMA-ES-MLP decreased slowly as the sample size increased (except when n = 5760). In general, the sensitivity rates for GA-MLP, GOA-MLP, and CMA-ES-MLP are more stable across the sample sizes relative to those derived from the basic MLP without optimization because they do not fluctuate rapidly, as in the case of the basic MLP. In most instances, GA-MLP and CMA-ES-MLP correctly classified the negatives (non-subscribers) better than GOA-MLP and basic MLP. The specificity rates for all classifiers across sample sizes are shown in Figure 5.

Figure 5 shows that for the basic MLP, the specificity values are relatively low and fluctuate with different sample sizes. The specificity for GA-MLP was highest for the smallest sample size (10% (n = 626)), followed by an upward trend between the second-smallest sample (20% (n = 1280)) and the fifth-lowest or sixth highest (50% (n = 3204)). Thereafter, it fluctuates, but for most sample sizes, its values are greater than those of the basic MLP and lower than those of the GOA-MLP. The specificity values for CMA-ES-MLP generally increased from the sixth highest sample (50% (3204)) to the full dataset (n = 6398) as the sample size increased. Generally, the CMA-ES-MLP classifies the positives (subscribers) correctly more accurately than the basic MLP, GA-MLP, and GOA-MLP. The F-measure rates for all classifiers across sample sizes are shown in Figure 6.

Figure 6 shows that for the basic MLP, the F-measure appears to fluctuate with different sample sizes without forming a clear upward or downward trend, and the basic MLP yielded the lowest F-measure across all sample sizes. For GA-MLP, generally, there seems to be an increase in the F-measure as the sample size increases from 20% (n=1280) to 50% (n=3240) and from 60% (3832) to 90% (5760); however, GA-MLP is the second worst performer in terms of the F-measure. Figure 6 also shows that the F-measure for the GOA-MLP fluctuates, and there is a significant drop in its performance for the whole dataset (n=1940); however, this classifier is generally the second-best performer in terms of the F-measure, after the CMA-ES-MLP. The execution times for all classifiers across sample sizes are shown in Figure 7.

Figure 7 shows that the basic MLP was the fastest to train, followed by CMA-ES-MLP. For the GA-MLP and GOA-MLP algorithms, there was an increasing trend whereby, as the sample sizes increased, the execution time also increased for these classifiers, but GA-MLP was the most expensive model when the sample size was at least 5114. The means of the classification metrics for all classifiers across the sample sizes are shown in Figure 8.

Figure 8 shows that the CMA-ES-MLP algorithm consistently achieved the highest mean accuracy across different sample sizes (except for the 50% sample size (n=3204)), indicating that it is the most accurate model overall. The GA-MLP and GOA-MLP algorithms showed varied performance, but for most sample sizes (10% (n=626), 20% (n=1280), 30% (n=1940), 40% (n=2592), 70% (n=3582), 80% (n=5114), and 100% (n=6398)), GA-GOA-MLP provided more accurate classifications than GA-MLP. The basic MLP algorithm consistently achieved the lowest mean classification accuracy, indicating its poor performance compared to its optimized variates. In general, the classifiers can be ranked in descending order of mean classification accuracy: CMA-ES-MLP, GOA-MLP, GA-MLP, and basic MLP.

1.6 Conclusion

This study was conducted to determine the impact of sample size on the classification ability and efficiency of GA, GOA, and CMA-CS, which are optimization algorithms for the MLP. The comparison was performed using line graphs of precision, F-measure, accuracy, sensitivity/recall, specificity, and execution time for basic MLP, GA-MLP, GOA-MLP, and CMA-ES-MLP across the ten samples. The line charts did not reveal a defined relationship between the performance of the classifiers across the sample sizes because the plots varied rapidly as the sample size increased. However, the execution time showed a clearer pattern as the sample size increased. The results revealed that GOA-MLP had more stable classification accuracy values than its competitors. Generally, the sensitivity rates for GA-MLP, GOA-MLP, and CMA-ES-MLP were more stable across the sample sizes relative to those derived from the basic MLP without optimization, since they did not fluctuate rapidly like those of the basic MLP.

The researchers concluded that the CMA-ES-MLP is the best model for this study in general because it maintains high rates of classification accuracy, F-measure, precision, and specificity for most sample sizes, and was the second-best performing classifier execution time. Furthermore, the mean classification metric results revealed that the CMA-ES-MLP algorithm consistently achieved the highest mean accuracy across nine different sample sizes, indicating that it is the most accurate model overall. The CMA-ES-MLP optimizer was identified as the most efficient optimization algorithm for an optimum MLP, as it was generally the most accurate optimizer, and it provided a lower execution time than GA-MLP and GOA-MLP, which did not increase noticeably as the sample size increased, implying that the CMA-ES optimizer is the most efficient optimizer for an optimum MLP compared with GA and GOA across all samples.

Generally, the sample size affects the performance of the MLP because the values of the classification metrics do not remain constant as the sample size changes. However, the results revealed that the values of the accuracy metrics for all the models fluctuated as the sample size increased, and there was no consistent increase or decrease in the classification performance of the algorithms as the sample size increased. On the other hand, the execution times for the GA and GOA optimizers increased as the sample size increased, but the execution time of the basic MLP remained the lowest and was almost constant as the sample size increased. Although CMA-ES had the lowest execution time compared to GOA and GA, it increased slightly when the sample size was at least 5114.

Contribution

This study compared the performance of the basic MLP to MLPs optimized using GA, GOA, and CMA-ES, which has not been done in other studies; therefore, this is a contribution to the literature on MLP and optimization algorithms. Through this study, it is now known that the performance of MLP, GA-MLP, CMA-ES-MLP, GO-MLP, and GOA-MLP varies rapidly across the sample sizes, so we cannot generalize that the larger the sample size, the better the model, or vice versa. This novel knowledge extends the literature on ML classifiers, especially MLP. From the execution time results, the change in sample sizes revealed that the basic MLP was the fastest, followed by the CMA-ES-MLP, whereas in the other models, as the sample size increased, the execution time also increased. This implies that the CMA-ES-MLP is not just the most accurate, but also less expensive and has proven to be more stable in terms of training time as the sample size increases. This implies that the training time for the CMA-ES-MLP is least affected by the change in the datasets and using it with large datasets is likely not to affect its training time significantly as opposed to the GA and GOA. These results contribute novel knowledge about the efficiency of CMA-ES in optimizing the MLP.

The findings of this study also showed that training the MLP and its optimized variates on different samples that are randomly drawn from a larger dataset may aid in identifying the sample that can yield the most accurate classifier, as opposed to training the classifiers using one training dataset. More specifically, the selected model CMA-ES-MLP yielded the highest accuracy (overall classification accuracy, precision, and specificity) when the sample size was 5114, which is less than that of the mother dataset of 6398 observations. The best CMA-ES-MLP identified in this study competes well with classifiers that were the best performers from previous studies using the same dataset. For example, the best CMA-ES-MLP that was identified as the performer in this study has a classification accuracy of 90.18%, which is higher than that of the Meta-cost MLP (77.48%), ²⁹ RF (86.08%), ⁴¹ and DT (87.5%). ³⁰ This comparison does not ignore the fact that in some previous studies, the setting was different from that used in our study. It is recommended that a future study using the classifiers that were identified as the best from previous studies in Table 2 and the CMA-ES-MLP from this study be conducted to compare these classifiers under the same setting. The recommendations drawn from this study contribute new possible areas of research around ML classifiers, and the implications of the findings from this study contribute to a novel, accurate, and efficient approach to predicting the likelihood of a potential client subscribing to a term deposit using CMA-ES-MLP.

Ethical considerations

This paper was written using parts of a PhD study whose proposal was presented at the school colloquium, where it received approval. It was subsequently submitted to the School Scientific Committee for approval as well. Then the proposal approved by the North-West University’s Faculty of Economic and Management Sciences Research Scientific Committee (FEMS-REC) on 30 June 2023, with the study classified as minimal risk. The ethics approval number is NWU-00684-22-A4.

Data availability

The data used in this study is a secondary dataset on direct marketing campaigns of a Portuguese banking institution named “Bank Marketing.” The dataset was obtained from the UCI Machine Learning Repository by the Center for Machine Learning and Intelligent Systems. The primary contributor for the data is. ²⁸ The dataset can be accessed through https://archive.ics.uci.edu/ml/datasets/Bank+Marketing. DOI: 10.24432/C5K306. The researchers took some random samples to mimic different sample sizes so that they can successfully achieve the objective of study which is to determine the impact of sample size on the performance of optimisation algorithms for the MLP used in the prediction of client subscription to a term deposit. The dataset is licensed under CC BY 4.0 license which allows for its sharing and adaptation for any purpose (which imply that research purposes is included) provided that the appropriate credit is given (which is done in this paper in section 1.3.1).

Acknowledgements

The authors of this research acknowledge North-West University (NWU) for availing resources to support this research.

References 1

Mythili

Shanavas

: An Analysis of students’ performance using classification algorithms. IOSR Journal of Computer Engineering. 2014;16(1):63–69. 10.9790/0661-16136369

Tomar

Chaudhari

Barbosa

JLV

: International conference on intelligent computing and smart communication 2019: Proceedings of ICSC 2019. Springer Nature;2020.

Khan

Jan

: Performance of machine learning techniques in protein fold recognition problem. 2010 International Conference on Information Science and Applications. IEEE;2010; pp.1–6.

Stottinger

Hanbury

Sebe

: Sparse color interest points for image retrieval and object categorization. IEEE Trans. Image Process. 2012;21(5):2681–2692. 22294029

10.1109/TIP.2012.2186143

Gulia

Vohra

Rani

: Liver patient classification using intelligent techniques. International Journal of Computer Science and Information Technologies. 2014;5(4):5110–5115.

Shafiq

AlRegib

: Patch-level MLP classification for improved fault detection. SEG Technical Program Expanded Abstracts 2018. Society of Exploration Geophysicists;2018; pp.2211–2215.

Çığşar

Ünal

: Comparison of data mining classification algorithms determining the default risk. Sci. Program. 2019;2019:1–8. 10.1155/2019/8706505

Jamuna

Karpagavalli

Vijaya

: Classification of seed cotton yield based on the growth stages of cotton crop using machine learning techniques. 2010 International Conference on Advances in Computer Engineering. IEEE;2010; pp.312–315.

Harikrishnan

Sethi

Pandey

: Handwritten digit recognition with feed-forward multi-layer perceptron and convolutional neural network architectures. 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA). IEEE;2020; pp.398–402.

Rouse

: Internet of Things (IOT),[ONLINE] Internet-of-Things [Acedido em 23 Junho 2015]. 2014. Reference Source

Hull

: Optimal control theory for applications. Springer Science & Business Media;2013.

Fernández

López

Galar

: Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowl.-Based Syst. 2013;42:97–110. 10.1016/j.knosys.2013.01.018

HARIT

: Optimizing Weights And Biases in MLP Using Whale Optimization Algorithm. Durham University;2022.

Abdel-Basset

El-Shahat

El-Henawy

: A new fusion of grey wolf optimizer algorithm with a two-phase mutation for feature selection. Expert Syst. Appl. 2020;139:112824. 10.1016/j.eswa.2019.112824

Bhesdadiya

Jangir

: Training multi-layer perceptron in neural network using whale optimization algorithm. Indian J. Sci. Technol. 2016;9(19):28–36.

Alboaneen

Tianfield

Zhang

: Sentiment analysis via multi-layer perceptron trained by meta-heuristic optimisation. 2017 IEEE International Conference on Big Data (Big Data). IEEE;2017; pp.4630–4635.

Aljarah

Faris

Mirjalili

: Optimizing connection weights in neural networks using the whale optimization algorithm. Soft. Comput. 2018;22:1–15. 10.1007/s00500-016-2442-1

Anderson

Kelley

Maxwell

: Sample-size planning for more accurate statistical power: A method adjusting sample effect sizes for publication bias and uncertainty. Psychol. Sci. 2017;28(11):1547–1562. 28902575

10.1177/0956797617723724

Kyriazos

: Applied psychometrics: sample size and sample power considerations in factor analysis (EFA, CFA) and SEM in general. Psychology. 2018;09(08):2207–2230. 10.4236/psych.2018.98126

Uttley

: Power analysis, sample size, and assessment of statistical assumptions—Improving the evidential value of lighting research. Leukos;2019.

Riley

: Calculating the sample size required for developing a clinical prediction model. BMJ. 2020;368. 10.1136/bmj.m441

Gibson

Huisman

: Designing image segmentation studies: statistical power, sample size and reference standard quality. Med. Image Anal. 2017;42:44–59. 28772163

10.1016/j.media.2017.07.004

PMC5666910

Taud

Mas

J-F

: Multilayer perceptron (MLP). Geomatic approaches for modeling land change scenarios. 2018;451–455. 10.1007/978-3-319-60801-3_27

Bisong

: The multilayer perceptron (MLP). Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners. 2019;401–405. 10.1007/978-1-4842-4470-8_31

Zare

Pourghasemi

Vafakhah

: Landslide susceptibility mapping at Vaz Watershed (Iran) using an artificial neural network model: a comparison between multilayer perceptron (MLP) and radial basic function (RBF) algorithms. Arab. J. Geosci. 2013;6:2873–2888. 10.1007/s12517-012-0610-x

Fath

Madanifar

Abbasi

: Implementation of multilayer perceptron (MLP) and radial basis function (RBF) neural networks to predict solution gas-oil ratio of crude oil systems. Petroleum. 2020;6(1):80–91. 10.1016/j.petlm.2018.12.002

Mohammadi

Ataei

Kakaei

: Prediction of the production rate of chain saw machine using the multilayer perceptron (MLP) neural network. Civil Engineering Journal. 2018;4(7):1575–1583. 10.28991/cej-0309196

Moro

Cortez

Rita

: A data-driven approach to predict the success of bank telemarketing. Decis. Support. Syst. 2014;62:22–31. 10.1016/j.dss.2014.03.001

Ghatasheh

Faris

AlTaharwa

: Business analytics in telemarketing: Cost-sensitive analysis of bank campaigns using artificial neural networks. Appl. Sci. 2020;10(7):2581. 10.3390/app10072581

Zaki

Khodadadi

Lim

: Predictive Analytics and Machine Learning in Direct Marketing for Anticipating Bank Term Deposit Subscriptions. American Journal of Business and Operations Research. 2024;11(1):79–88. 10.54216/AJBOR.110110

Ghaleb

Mohamad

Fadzli

: E-mail spam classification using grasshopper optimization algorithm and neural networks. Comput., Mater. Continua. 2022;71(3):4749–4766. 10.32604/cmc.2022.020472

Das

Jena

Nayak

: A novel PSO based back propagation learning-MLP (PSO-BP-MLP) for classification. Computational Intelligence in Data Mining-Volume 2: Proceedings of the International Conference on CIDM, 20-21 December 2014. Springer;2015; pp.461–471.

Michira

Rimiru

Mwangi

: Improved multilayer perceptron neural networks weights and biases based on the grasshopper optimization algorithm to predict student performance on ambient learning. Proceedings of the 2023 7th international conference on machine learning and soft computing. 2023; pp.61–68.

Yuan

Moayedi

: The performance of six neural-evolutionary classification techniques combined with multi-layer perception in two-layered cohesive slope stability analysis and failure recognition. Eng. Comput. 2020;36:1705–1714. 10.1007/s00366-019-00791-4

Abdollahi

Keshandehghan

Gardaneh

: Accurate detection of breast cancer metastasis using a hybrid model of artificial intelligence algorithm. Archives of Breast Cancer. 2020;18–24. 10.32768/abc.20207118-24

Mishra

Tripathy

Mallick

: EAGA-MLP—an enhanced and adaptive hybrid classification model for diabetes diagnosis. Sensors. 2020;20(14):4036. 32698547

10.3390/s20144036

PMC7411768

Dweekat

Lam

: Cervical cancer diagnosis using an integrated system of principal component analysis, genetic algorithm, and multilayer perceptron. Healthcare. MDPI;2022; vol.10(10): p.2002.

Zhang

Wang

: Financial distress prediction with a novel diversity-considered GA-MLP ensemble algorithm. Neural. Process. Lett. 2022;54(2):1175–1194. 10.1007/s11063-021-10674-9

Ghaleb

Mohamad

Abdullah

EFHS

: Spam classification based on supervised learning using grasshopper optimization algorithm and artificial neural network. Advances in Cyber Security: Second International Conference, ACeS 2020, Penang, Malaysia, December 8-9, 2020, Revised Selected Papers 2. Springer;2021; pp.420–434.

Moro

Laureano

Cortez

: Using data mining for bank direct marketing: An application of the crisp-dm methodology. 2011.

Asare-Frempong

Jayabalan

: Predicting customer response to bank direct telemarketing campaign. 2017 International Conference on Engineering Technology and Technopreneurship (ICE2T). IEEE;2017; pp.1–4.

Moro

Cortez

Rita

: Using customer lifetime value and neural networks to improve the prediction of bank deposit subscription in telemarketing campaigns. Neural Comput. & Applic. 2015;26:131–139. 10.1007/s00521-014-1703-0

Elsalamony

: Bank direct marketing analysis of data mining techniques. Int. J. Comput. Appl. 2014;85(7):12–22. 10.5120/14852-3218

Ładyżyński

Żbikowski

Gawrysiak

: Direct marketing campaigns in retail banking with the use of deep learning and random forests. Expert Syst. Appl. 2019;134:28–35. 10.1016/j.eswa.2019.05.020

Pavlović

Reljić

Jaćimović

: Application of Data Mining in direct marketing. Industrija. 2014;42(1):189–201. 10.5937/industrija42-5087

Karim

Rahman

: Decision tree and naive bayes algorithm for classification and generation of actionable knowledge for direct marketing. 2013.

Kim

Street

: An intelligent system for customer targeting: a data mining approach. Decis. Support. Syst. 2004;37(2):215–228. 10.1016/S0167-9236(03)00008-3

Fernández

Garcia

Herrera

: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 2018;61:863–905. 10.1613/jair.1.11192

Karabulut

Ibrikci

: Effective automated prediction of vertebral column pathologies based on logistic model tree with SMOTE preprocessing. J. Med. Syst. 2014;38:1–9. 10.1007/s10916-014-0050-0

Bahaweres

Agustian

Hermadi

: Software defect prediction using neural network based smote. 2020 7th International Conference on Electrical Engineering, Computer Sciences and Informatics (EECSI). IEEE;2020; pp.71–76.

Zhang

: Phishing detection method based on borderline-smote deep belief network. Security, Privacy, and Anonymity in Computation, Communication, and Storage: SpaCCS 2017 International Workshops, Guangzhou, China, December 12-15, 2017, Proceedings 10. Springer;2017; pp.45–53.

Liu

Song

: Research on intrusion detection method based on improved smote and XGBoost. Proceedings of the 8th International Conference on Communication and Network Security. 2018; pp.37–41.

Hussein

Yohannese

: A-SMOTE: A new preprocessing approach for highly imbalanced datasets by improving SMOTE. Int. J. Comput. Intell. Syst. 2019;12(2):1412–1422. 10.2991/ijcis.d.191114.002

Olazaran

: A sociological study of the official history of the perceptrons controversy. Soc. Stud. Sci. 1996;26(3):611–659. 10.1177/030631296026003005

Bishop

Nasrabadi

: Pattern recognition and machine learning. Springer;2006; (no.4).

Gaikwad

Tiwari

Keskar

: Efficient FPGA implementation of multilayer perceptron for real-time human activity classification. IEEE Access. 2019;7:26696–26706. 10.1109/ACCESS.2019.2900084

Yan

Shan

: Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876. 2015; vol.7(8): p.4.

Kanan

Cottrell

: Color-to-grayscale: does the method matter in image recognition?. PloS one. 2012;7(1):e29740. 22253768

10.1371/journal.pone.0029740

PMC3254613

Deng

: Automatic speech recognition. Springer;2016.

Parloff

: Why deep learning is suddenly changing your life. Fortune. New York: Time Inc;2016.

Aggarwal

Zhai

: A survey of text classification algorithms. Mining text data. Springer;2012; pp.163–222.

Miner

: Practical text mining and statistical analysis for non-structured text data applications. Academic Press;2012.

Ahishakiye

Taremwa

Omulo

: Crime prediction using decision tree (J48) classification algorithm. International Journal of Computer and Information Technology. 2017;6(3):188–195.

Colak

Yesilbudak

Bayindir

: Daily photovoltaic power prediction enhanced by hybrid GWO-MLP, ALO-MLP and WOA-MLP models using meteorological information. Energies. 2020;13(4):901. 10.3390/en13040901

Nwankpa

Ijomah

Gachagan

: Activation functions: Comparison of trends in practice and research for deep learning. arXiv preprint arXiv:1811.03378. 2018.

Hansen

Müller

Koumoutsakos

: Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evol. Comput. 2003;11(1):1–18. 12804094

10.1162/106365603321828970

Hansen

: The CMA evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772. 2016.

Balakrishnan

Wainwright

: Statistical guarantees for the EM algorithm: From population to sample-based analysis. 2017.

Arsenault

Poulin

Côté

: Comparison of stochastic optimization algorithms in hydrological model calibration. J. Hydrol. Eng. 2014;19(7):1374–1384. 10.1061/(ASCE)HE.1943-5584.0000938

Suominen

Brink

Salmi

: Parameter estimation of complex chemical kinetics with covariance matrix adaptation evolution strategy. Match-Communications in Mathematical and Computer Chemistry. 2012;68(2):469.

Lin

Nielsen

Emtiyaz

: Tractable structured natural-gradient descent using local parameterizations. International Conference on Machine Learning. PMLR;2021; pp.6680–6691.

Burgin

Eberbach

: Evolutionary Turing in the Context of Evolutionary Machines. arXiv preprint arXiv:1304.3762. 2013.

Chernukhin

Zingg

: Multimodality and global optimization in aerodynamic design. AIAA J. 2013;51(6):1342–1354. 10.2514/1.J051835

Bottou

Curtis

Nocedal

: Optimization methods for large-scale machine learning. SIAM Rev. 2018;60(2):223–311. 10.1137/16M1080173

Gálvez

Iglesias

: A new iterative mutually coupled hybrid GA–PSO approach for curve fitting in manufacturing. Appl. Soft Comput. 2013;13(3):1491–1504. 10.1016/j.asoc.2012.05.030

Miller

De Lamare

: Distributed spectrum estimation based on alternating mixed discrete-continuous adaptation. IEEE Signal Processing Letters. 2016;23(4):551–555. 10.1109/LSP.2016.2539328

Hook

: Designing with the body: Somaesthetic interaction design. MIt Press;2018.

Hassanat

Almohammadi

Alkafaween

: Choosing mutation and crossover ratios for genetic algorithms—a review with a new dynamic approach. Information. 2019;10(12):390. 10.3390/info10120390

Hermawanto

: Genetic algorithm for solving simple mathematical equality problem. arXiv preprint arXiv:1308.4675. 2013.

Drezner

: Biologically inspired parent selection in genetic algorithms. Ann. Oper. Res. 2020;287(1):161–183. 10.1007/s10479-019-03343-7

Mazidi

Fakhrahmad

Sadreddini

: A meta-heuristic approach to CVRP problem: local search optimization based on GA and ant colony. 2016.

Mirjalili

: Genetic algorithm. Evolutionary algorithms and neural networks: Theory and applications. 2019;43–55. 10.1007/978-3-319-93025-1_4

Saremi

Mirjalili

Lewis

: Grasshopper optimisation algorithm: theory and application. Adv. Eng. Softw. 2017;105:30–47. 10.1016/j.advengsoft.2017.01.004

Zakeri

Hokmabadi

: Efficient feature selection method using real-valued grasshopper optimization algorithm. Expert Syst. Appl. 2019;119:61–72. 10.1016/j.eswa.2018.10.021

Peng

: A novel meta-matching approach for ontology alignment using grasshopper optimization. Knowl.-Based Syst. 2020;201:106050.

Heidari

Faris

Aljarah

: An efficient hybrid multilayer perceptron neural network with grasshopper optimization. Soft. Comput. 2019;23:7941–7958. 10.1007/s00500-018-3424-2

Mirjalili

Saremi

: Grasshopper optimization algorithm for multi-objective optimization problems. Appl. Intell. 2018;48:805–820. 10.1007/s10489-017-1019-8

Yang

Chen

: Temporal data clustering via weighted clustering ensemble with different representations. IEEE Trans. Knowl. Data Eng. 2010;23(2):307–320. 10.1109/TKDE.2010.112

Gavrilov

Anguelov

Indyk

: Mining the stock market (extended abstract) which measure is best?. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. 2000; pp.487–496.

Lalkhen

McCluskey

: Clinical tests: sensitivity and specificity. Contin. Educ. Anaesth. Crit. Care Pain. 2008;8(6):221–223. 10.1093/bjaceaccp/mkn041

Shepherd

Wheeler

Selbie

: Overseer ^®: accuracy, precision, error and uncertainty. Accurate and efficient use of nutrients on farms. 2013;1–8.

Powers

: What the F-measure doesn’t measure: Features, Flaws, Fallacies and Fixes. arXiv preprint arXiv:1503.06410. 2015.

Casas

Taheri

Ranjan

: A balanced scheduler with data reuse and replication for scientific workflows in cloud computing systems. Futur. Gener. Comput. Syst. 2017;74:168–178. 10.1016/j.future.2015.12.005

10.5256/f1000research.185252.r446950

Reviewer response for version 1

Mohammadagha

Mohsen

1 Referee https://orcid.org/0009-0007-0394-353X 1University of Texas at Arlington, Arlington, Texas, USA

Competing interests: No competing interests were disclosed.

10 1 2026

2026

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve-with-reservations

This study compares the performance of three optimization algorithms—Genetic Algorithm (GA), Grasshopper Optimization Algorithm (GOA), and Covariance Matrix Adaptation Evolution Strategy (CMA-ES)—for optimizing Multilayer Perceptron (MLP) neural networks across varying sample sizes. Using a Portuguese banking dataset (4,521 observations) from the UCI repository, the researchers evaluated how sample size affects classification performance when predicting client subscription to term deposits. The study employed SMOTE for data balancing and tested ten different sample sizes (10%-100%). Results identified CMA-ES-MLP as the best overall performer with high accuracy, precision, and specificity, while maintaining competitive execution time.

1. Is the work clearly and accurately presented and does it cite current literature?

Recommendations: Update literature review with 2023-2025 papers.

2. Is the study design appropriate and is the work technically sound? Yes

3. Are sufficient details of methods and analysis provided to allow replication by others?

Recommended : Code availability, Random seeds, Incomplete hyperparameter specifications, Software versions, MLP architecture.

4. If applicable, is the statistical analysis and its interpretation appropriate?

It can be suggested to conduct runs with statistical tests (e.g., Friedman or any other methods) and report confidence intervals.

5. Are all the source data underlying the results available to ensure full reproducibility? Yes

6. Are the conclusions drawn adequately supported by the results?

It is recommended to add confidence intervals, statistical tests (e.g., McNemar's test, Friedman test with post-hoc analysis)

Citation Format Issues: The manuscript contains improper in-text citation formatting that creates grammatically incomplete sentences. Specifically:

"According to,14 there are several..." is missing the author name(s) before the superscript citation

"in the study by,15,16 and it has been..." similarly omits the required author name(s)

Corrections needed entire manuscript:

Use narrative citations, for example: “According to Abdel-Basset et al. [14], there are several optimization algorithms.

I provide a reference demonstrating an example for statistical evaluation in machine learning research.

(Reference 1)

Is the work clearly and accurately presented and does it cite the current literature?

Partly

If applicable, is the statistical analysis and its interpretation appropriate?

Partly

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Partly

Are sufficient details of methods and analysis provided to allow replication by others?

Reviewer Expertise:

Civil Engineering and Computer Science

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

References 1

: Evaluating machine learning performance using python for neural network models in urban transportation in New York city case study. Journal of Economy and Technology .2026;4: 10.1016/j.ject.2025.11.001 266-283

10.1016/j.ject.2025.11.001

Montshiwa

Tlhalitshi

Business Statistics, North West University Faculty of Economic and Management Sciences, Mahikeng, North West, South Africa

Competing interests: None

23 1 2026

Comment 1: Update literature review with 2023-2025 papers.

Response 1: Thank you for the comment. Kindly note that this article is derived from a PhD study conducted in 2020, which informed the initial scope and selection of the literature. Removal of references below 2023 may lead to misalignment with the theoretical framework and objectives of the study.

Comment 2: Code availability, Random seeds, Incomplete hyperparameter specifications, Software versions, MLP architecture.

Response 2: Thank you for this comment. The codes utilised for the analysis are available upon request. These codes include detailed information on the random seed settings, hyperparameter specifications, software versions, and the MLP architecture used in the study. This statement is included in the manuscript in data availability section.

Comment 3: It can be suggested to conduct runs with statistical tests (e.g., Friedman or any other methods) and report confidence intervals.

Response 3: This is a valuable suggestion, and the inclusion of confidence intervals as well as statistical comparison tests such as the Friedman and McNemar tests would enhance the robustness of the analysis. Due to the limited scope and objectives of the current study, these analyses were not implemented at this stage. However, they will be incorporated in future work to provide a more comprehensive statistical comparison of the models.

Comment 4: It is recommended to add confidence intervals, statistical tests (e.g., McNemar's test, Friedman test with post-hoc analysis)

Response 4: This is a valuable suggestion, and the inclusion of confidence intervals and formal statistical comparison tests such as McNemar’s and Friedman tests would indeed strengthen the robustness of the analysis. However, given the limited scope and objectives of the current study, the focus was placed on comparative predictive performance using standard evaluation metrics. Incorporating additional inferential statistical tests is therefore left as a potential extension for future work.