Prediction of PM2.5 concentrations in Malaysia using machine learning techniques: a review [version 1; peer review: 1 approved, 1 approved with reservations]

Particulate matter (PM), an air pollutant that is detrimental to breathing, is either emitted or formed ambiently. The exposure of respiratory system towards PM2.5, the fine particles of 2.5 micrometres diameter, causes complication for health. Thus, developing pollution control strategies requires the prediction of PM 2.5 concentrations. Advancement of technology and computer science knowledge, machine learning (ML) algorithms are used for highly accurate prediction of air pollutant concentrations. Recently, air quality in Smart Cities of Malaysia has been getting worse due to industrialization, emissions from private motor vehicles, and transboundary haze pollution. Therefore, the forecasting of PM2.5 emissions to ensure they are within the statutory limits becomes necessary. Several machine learning methods have been implemented in existing research to predict air pollution concentrations in comparison to PM2.5. However, very few studies have used ML techniques to predict air quality in Malaysia when compared with global studies. Hence, to create awareness on the ML techniques and promote further research in this area, this study reviews and highlights most of the existing ML techniques for the prediction of PM2.5.


Introduction
Air quality is important for human health, crops, vegetation, and aesthetic considerations, for example, visibility. Air pollution, in which the air is contaminated with a variety of dirt and chemicals, is detrimental to breathing and can cause a wide variety of health defects and issues. Bad air is a combination of both natural and human-made sources of perilous substances. It was estimated in 2016 that outdoor air pollution in rural and urban areas caused 4.2 million premature fatalities worldwide annually; exposure to PM is the reason for the mortality, which causes cancers, and respiratory and cardiovascular diseases [https://www.who.int/news-room/fact-sheets/detail/ambient-(outdoor)-air-quality-and-health].
A complicated mixture of ultrafine particles and vaporized tiny molecules or liquid droplets is known as particulate matter (PM) [https://www.epa.gov/pm-pollution/particulate-matter-pm-basics]. PM are categorized according to their sizes. PM 2.5 has a diameter of 2.5 microns or less. PM 2.5 is inhalable and can travel farther into our bodies and deposit into alveoli, eventually passing into the bloodstream. They may cause cardiovascular diseases, since they start to mix into the blood stream [https://blog.breezometer.com/what-is-particulate-matter].
The health impact of exposure to PM 2.5 is a nightmare which involves different age groups of people. 1 The number of coughs counted by urban workers who are exposed to PM 2.5 determines how badly their respiratory system is in danger. Usmani 2 reviewed a paper and mentioned that environmental indicators and PM 2.5 have a great impact on Malaysian health services via infant mortality rate, fertility rate, and life expectancy. He also concluded that outdoor and indoor air quality can affect the health of school children who go to school every day.
Traffic pollution and industrialization are the root sources of emission of PM 2.5 . In Malaysia, the main sources of PM 2.5 are emissions from industrial growth, motor vehicles, and recently, transboundary haze pollution. Traffic-related air pollution (TRAP), happens due to the emissions from motor vehicles. 1-3 S N Brohi et al. 4 concluded that industries were the main contributors to PM in Malaysia, accounting for 32%. Haze incidents have been a crucial problem in Malaysia for decades. Jaafar et al. 5 concluded that PM 2.5 is presumed to be one of the most critical health hazards and should be continuously monitored during haze episodes.
Several empirical studies have identified air quality as a major concern in smart cities. The non-linear behaviour of air pollutants, combined with other significant regional factors, results in a highly complex system of air pollutant generation. According to research, capturing nonlinearity between air contaminants and their emission and dispersal sources is difficult in traditional deterministic models. As a result, to address the issue of capturing non-linearity trends in air pollution models and mitigating the impact of PM 2.5 , ML approaches based on statistical algorithms that are reliable and widely used should be considered.
This literature review is structured as follows. After the justification for the prediction of PM 2.5 , the steps involved in each ML technique are discussed first. The results and discussion of each ML approach used to predict PM 2.5 concentrations are presented next. Finally, the conclusions discuss the use of ML for PM 2.5 prediction.

Methods
The following phases comprise the review of ML approaches. The first step, to find related SCOPUS indexed papers, was to use keyword combinations to find the document; which were: {'Particulate Matter'} AND {'PM2.5'} AND {'Prediction'} AND {'Machine Learning'}. The paper publishing period was limited to 2017-2021, and the study was limited to journal and conference proceedings that were published in English. We ended up with 284 documents as a result.
After that, the studies were screened by looking at the title and abstract. Biological studies, social studies, and investigations into the relationship between PM 2.5 and other air contaminants, among other topics, were omitted. The number of documents was reduced to 36. Finally, the papers that were unanimously deemed out of scope were excluded after reviewing the whole document. As a consequence, 20 manuscripts were chosen for further examination.
The five aspects are used to review the articles. The initial analysis is based on the ML type that was utilized. The researchers' method is the second point to consider. Third, the study's location as well as the dataset's characteristics. The fourth and fifth components deal with the evaluation method, performance measures such as root mean square error (RMSE), mean square error (MSE), mean absolute error (MAE), coefficient of determination (R2), and the performance of the algorithms under consideration.

Results and discussion
The findings of the review are explained in this section. To begin, the review's findings are based on the number of studies that used machine learning approaches to forecast PM 2.5 levels. Figure 3 shows the number of conference and journal articles produced in the last five years to predict PM 2.5 . It shows that the number of studies has been increasing in recent years, and if this trend continues, a potentially higher number of documents might be expected by the end of 2021.
Next, the number of machine learning algorithms has increased significantly; however, this increase has not been evenly distributed globally. Figure 4 shows that the number of published studies in Eurasia and North America is significantly higher. While China (146 studies) and the US (109 studies) have the most published research, Malaysia only has three studies.
According to the study, supervised machine learning techniques like neural networks (NN), deep learning (DL), decision tree (DT), and others were commonly employed. The largest percentages are for DL and NN, followed by DT, regression, and support vector machine (SVM). The following categories are used to categorise the full descriptions of the selected papers. For the prediction of PM 2.5 , Zhang et al. 9 suggested a DL model including an auto-encoder (AE) and a bidirectional LSTM (Bi-LSTM). The proposed method incorporates data preprocessing to improve prediction accuracy, an AE layer to extract implicit features and increase training efficiency, and a Bi-LSTM layer to predict. The results indicate that the proposed model's prediction was better with low RMSE and higher R 2 values and a positive correlation does exist.
Liu et al. 10 proposed a hybrid ensemble model using DBN, LSTM, and multilayer perceptron (MLP). To reduce the model's complexity, it uses complementary ensemble empirical mode decomposition (CEEMD) to extract the features from the data series. To produce the best forecast outcomes, the imperial competition algorithm (ICA) is used to alter the weights of the predictors. Two sets of hourly PM 2.5 concentrations from Shanghai are used to validate the model. According to the findings of the experiments, the proposed model performed better in terms of accuracy and resilience.

Category 2: neural network
The neural network is based on the idea of classifying input observations by linearly combining datasets. For lowering difficulties and errors, Jiang et al. 11 presented the group teaching optimization algorithm (GTOA): the extreme learning machine (ELM) method, which was based on a data-preprocessing strategy. Two-step decomposition is used to break down PM concentration data into high-frequency IMFs. GTOA is then used to optimise ELM. Over a 16-month period, the data covers hourly PM 2.5 levels in Beijing. The findings showed that the proposed model improved prediction accuracy while maintaining a low mean absolute error and root mean square error. Hung et al. 15 conducted research to determine how transported smoke aerosols affect air quality in New York State. ANN was used in the method, and five models with distinct sets of predictors were considered during the summer seasons of 2012-2019. When the models were evaluated using RMSE and R 2 , the results revealed that smoke cases had higher average PM 2.5 concentrations than non-smoke cases.

Category 3: decision tree
In the decision analysis, a decision tree can be used to visually and explicitly depict decisions and decision making. Angelin Jebamalar et al. 16 created a method that combined light gradient boosting and decision tree techniques. The model breaks the tree leaf by leaf using the best fit, allowing it to handle massive amounts of data with little memory and great speed. From 2017 to 2019, the PM 2.5 information for Chennai, India, was collected using IoT devices and stored in the cloud. When compared to RF, DT, and regression approaches, the suggested model outperformed with the lowest mean absolute error and root mean square error values.
Another study 17 focused on PM 2.5 predictions from environmental sensor data streams using a stacked boosting ensemble (STBoost) model with z-score optimization techniques. The STBoost includes Light GBM as the meta regressor and base regressors such as XGBoost and GBM regression and RF to improve the prediction accuracy. The results indicate that the STBoost model outperformed with an accuracy score of 99.52 and an RMSE value of 0.1048.
To improve PM 2.5 perception, Luo et al. 18 used an image-based technique that included CNN and GBM. Daily weather conditions, 6976 pictures, and hourly data from Shanghai were used to create the model (2016). With the proposed technique, the MAE, RMSE, and R 2 estimations of PM 2.5 are 3.56, 10.02, and 0.85, respectively.
Using a 1.5 years' dataset of Malaysia, 19 examined the performance of MLP and RF models in predicting PM 2.5. Confusion matrix was utilised as a performance metric. MLP was outperformed by RF overall.

Category 4: regression
The regression model was used in a number of articles to predict particulate matter. Kleine Deters et al. 20 present a machine learning technique for predicting PM 2.5 based on pollution and meteorological data from two Ecuadorian cities over a six-year period. Using CGM in regression analysis, it was found that PM 2.5 could be predicted more accurately during extreme weather.
Kowalski and Warchałowski 21 provide a comparison of machine learning techniques for predicting dust-type air pollution levels. Real-time hourly data from Krakow was utilised to train and test the models using MSE and R 2 over the course of a year. The best prediction approach, according to the findings, was a regression model. Gu et al. 22 introduced a recurrent air quality prediction (RAQP) model, which combines a recurrent framework and SVR. The model was tested on 180 hourly records from a small Chinese town. The RAQP model was found to be more successful than state-of-the-art air quality predictors with a low RMSE value.
Aljuaid et al. 23 compared numerous forecasting approaches and strategies based on mathematics and machine learning. The one-hour and five-minute datasets were created by combining and manipulating numerous sources of Danish data. For comparison, the researchers utilised multivariate (SVR, DT, and K-nearest neighbour) and univariate (autoregression) techniques. SVR had the lowest RMSE and MAE values for the one-hour data set, and auto-regression for the five-minutes dataset, according to the findings.

Conclusions
This study looked over 20 scientific papers that focused on machine learning algorithms for PM 2.5 prediction. The use of machine learning to forecast PM 2.5 has grown significantly in the previous five years, although only three research articles have been published in Malaysia. There are also several international research programmes that concentrate on more than one type of particulate matter. For predicting PM 2.5 , the ML techniques DL, NN, and DT are often utilised. Overall, it appears that supervised machine learning algorithms are employed to predict air pollution, notably PM 2.5 . The review concludes that while there have been few researches in Malaysia using machine learning to forecast PM 2.5 , the field can be expanded and the accuracy improved, as has been done globally using supervised machine learning methodologies.

Data availability
No data are associated with this article.

Author contributions
Palanichamy Naveen did the conception of the work, drafting the article, and revision to the final version. Kuhaneswaran and Rishanti did data collection, data analysis and interpretation, under the guidance of S Subramanian and their supervisors Su-Cheng Haw and Palanichamy Naveen. Palanichamy Naveen is the corresponding author for this paper.