ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Review

Prediction of PM2.5 concentrations in Malaysia using machine learning techniques: a review

[version 1; peer review: 1 approved, 1 approved with reservations]
PUBLISHED 14 Dec 2021
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Research Synergy Foundation gateway.

Abstract

Particulate matter (PM), an air pollutant that is detrimental to breathing, is either emitted or formed ambiently. The exposure of respiratory system towards PM2.5, the fine particles of 2.5 micrometres diameter, causes complication for health. Thus, developing pollution control strategies requires the prediction of PM2.5 concentrations. Advancement of technology and computer science knowledge, machine learning (ML) algorithms are used for highly accurate prediction of air pollutant concentrations. Recently, air quality in Smart Cities of Malaysia has been getting worse due to industrialization, emissions from private motor vehicles, and transboundary haze pollution. Therefore, the forecasting of PM2.5 emissions to ensure they are within the statutory limits becomes necessary. Several machine learning methods have been implemented in existing research to predict air pollution concentrations in comparison to PM2.5. However, very few studies have used ML techniques to predict air quality in Malaysia when compared with global studies. Hence, to create awareness on the ML techniques and promote further research in this area, this study reviews and highlights most of the existing ML techniques for the prediction of PM2.5.

Keywords

Air Pollution, Particulate Matter (PM2.5), Neural Network, Deep Learning, Decision Tree

Introduction

Air quality is important for human health, crops, vegetation, and aesthetic considerations, for example, visibility. Air pollution, in which the air is contaminated with a variety of dirt and chemicals, is detrimental to breathing and can cause a wide variety of health defects and issues. Bad air is a combination of both natural and human-made sources of perilous substances. It was estimated in 2016 that outdoor air pollution in rural and urban areas caused 4.2 million premature fatalities worldwide annually; exposure to PM is the reason for the mortality, which causes cancers, and respiratory and cardiovascular diseases [https://www.who.int/news-room/fact-sheets/detail/ambient-(outdoor)-air-quality-and-health].

A complicated mixture of ultrafine particles and vaporized tiny molecules or liquid droplets is known as particulate matter (PM) [https://www.epa.gov/pm-pollution/particulate-matter-pm-basics]. PM are categorized according to their sizes. PM2.5 has a diameter of 2.5 microns or less. PM2.5 is inhalable and can travel farther into our bodies and deposit into alveoli, eventually passing into the bloodstream. They may cause cardiovascular diseases, since they start to mix into the blood stream [https://blog.breezometer.com/what-is-particulate-matter].

The health impact of exposure to PM2.5 is a nightmare which involves different age groups of people.1 The number of coughs counted by urban workers who are exposed to PM2.5 determines how badly their respiratory system is in danger. Usmani2 reviewed a paper and mentioned that environmental indicators and PM2.5 have a great impact on Malaysian health services via infant mortality rate, fertility rate, and life expectancy. He also concluded that outdoor and indoor air quality can affect the health of school children who go to school every day.

Traffic pollution and industrialization are the root sources of emission of PM2.5. In Malaysia, the main sources of PM2.5 are emissions from industrial growth, motor vehicles, and recently, transboundary haze pollution. Traffic-related air pollution (TRAP), happens due to the emissions from motor vehicles.13 S N Brohi et al.4 concluded that industries were the main contributors to PM in Malaysia, accounting for 32%. Haze incidents have been a crucial problem in Malaysia for decades. Jaafar et al.5 concluded that PM2.5 is presumed to be one of the most critical health hazards and should be continuously monitored during haze episodes.

Several empirical studies have identified air quality as a major concern in smart cities. The non-linear behaviour of air pollutants, combined with other significant regional factors, results in a highly complex system of air pollutant generation. According to research, capturing nonlinearity between air contaminants and their emission and dispersal sources is difficult in traditional deterministic models. As a result, to address the issue of capturing non-linearity trends in air pollution models and mitigating the impact of PM2.5, ML approaches based on statistical algorithms that are reliable and widely used should be considered.

This literature review is structured as follows. After the justification for the prediction of PM2.5, the steps involved in each ML technique are discussed first. The results and discussion of each ML approach used to predict PM2.5 concentrations are presented next. Finally, the conclusions discuss the use of ML for PM2.5 prediction.

Methods

The following phases comprise the review of ML approaches. The first step, to find related SCOPUS indexed papers, was to use keyword combinations to find the document; which were: {‘Particulate Matter’} AND {‘PM2.5’} AND {‘Prediction’} AND {‘Machine Learning’}. The paper publishing period was limited to 2017–2021, and the study was limited to journal and conference proceedings that were published in English. We ended up with 284 documents as a result.

After that, the studies were screened by looking at the title and abstract. Biological studies, social studies, and investigations into the relationship between PM2.5 and other air contaminants, among other topics, were omitted. The number of documents was reduced to 36. Finally, the papers that were unanimously deemed out of scope were excluded after reviewing the whole document. As a consequence, 20 manuscripts were chosen for further examination.

The five aspects are used to review the articles. The initial analysis is based on the ML type that was utilized. The researchers' method is the second point to consider. Third, the study's location as well as the dataset's characteristics. The fourth and fifth components deal with the evaluation method, performance measures such as root mean square error (RMSE), mean square error (MSE), mean absolute error (MAE), coefficient of determination (R2), and the performance of the algorithms under consideration.

Results and discussion

The findings of the review are explained in this section. To begin, the review's findings are based on the number of studies that used machine learning approaches to forecast PM2.5 levels. Figure 3 shows the number of conference and journal articles produced in the last five years to predict PM2.5. It shows that the number of studies has been increasing in recent years, and if this trend continues, a potentially higher number of documents might be expected by the end of 2021.

Next, the number of machine learning algorithms has increased significantly; however, this increase has not been evenly distributed globally. Figure 4 shows that the number of published studies in Eurasia and North America is significantly higher. While China (146 studies) and the US (109 studies) have the most published research, Malaysia only has three studies.

According to the study, supervised machine learning techniques like neural networks (NN), deep learning (DL), decision tree (DT), and others were commonly employed. The largest percentages are for DL and NN, followed by DT, regression, and support vector machine (SVM). The following categories are used to categorise the full descriptions of the selected papers.

Category 1: deep learning

Artificial neural networks, which are algorithms inspired by the structure and function of the brain, are used in deep learning. The papers that fall within this category are listed below.

Shahriar et al.6 conducted a study in three Bangladeshi air pollution hotspots to evaluate the effectiveness of two hybrid models for predicting daily PM2.5 concentrations in terms of computational efficiency and accuracy. The models presented and compared with DT and Catboost were Autoregressive Integrated Moving Average (ARIMA)-Artificial Neural Network (ANN) and ARIMA-SVM. With lower MAE and RMSE and a better R2 value, the CatBoost and ARIMA-ANN models fared well.

By comparing four prediction models, Yang et al.7 sought to anticipate PM several days in advance in 39 Seoul stations. The gated recurrent Unit (GRU) of a convolutional neural network (CNN) and the long short-term memory (LSTM) of a CNN were the two models suggested and compared. The CNN-LSTM model provided reliable prediction by capturing hidden patterns with low RMSE and MAE values. In Ref. 8 deep neural network (DNN) based hybrid model, DNN-LSTM, was proposed. The DNN-LSTM model outperformed the multiple additive regression trees (MART) and deep feedforward neural network (DFNN) models with a highest R2 and lowest values of MAE and RMSE for 48-h predictions.

For the prediction of PM2.5, Zhang et al.9 suggested a DL model including an auto-encoder (AE) and a bidirectional LSTM (Bi-LSTM). The proposed method incorporates data preprocessing to improve prediction accuracy, an AE layer to extract implicit features and increase training efficiency, and a Bi-LSTM layer to predict. The results indicate that the proposed model’s prediction was better with low RMSE and higher R2 values and a positive correlation does exist.

Liu et al.10 proposed a hybrid ensemble model using DBN, LSTM, and multilayer perceptron (MLP). To reduce the model's complexity, it uses complementary ensemble empirical mode decomposition (CEEMD) to extract the features from the data series. To produce the best forecast outcomes, the imperial competition algorithm (ICA) is used to alter the weights of the predictors. Two sets of hourly PM2.5 concentrations from Shanghai are used to validate the model. According to the findings of the experiments, the proposed model performed better in terms of accuracy and resilience.

Category 2: neural network

The neural network is based on the idea of classifying input observations by linearly combining datasets. For lowering difficulties and errors, Jiang et al.11 presented the group teaching optimization algorithm (GTOA): the extreme learning machine (ELM) method, which was based on a data-preprocessing strategy. Two-step decomposition is used to break down PM concentration data into high-frequency IMFs. GTOA is then used to optimise ELM. Over a 16-month period, the data covers hourly PM2.5 levels in Beijing. The findings showed that the proposed model improved prediction accuracy while maintaining a low mean absolute error and root mean square error.

With daily records from cities in Finland and Brazil, Neto et al.12 evaluated single and neural-based ensemble techniques. The values of MSE, MAPE, MAE, and RMSE were compared. MLP, the ensemble method, has the best overall performance. Suleiman et al.13 examined the performance of boosted regression trees (BRT), artificial neural networks (ANN), and support vector machines (SVM) models in forecasting roadside PM2.5 concentrations in London at nineteen monitoring sites. For performance evaluation, RMSE and other metrics were used. In general, ANN performed better.

Ali Shah et al.14 assessed the performance of the Phase Space Reconstruction (PSR) technique, which captures multi-time scale information, using radial and linear Support Vector Regression (SVR), Feedforward Neural Network (FFNN), and Random Forest (RF) ML. RMSE and MAE were used to assess the performance of ML techniques on a dataset collected in Saudi Arabia (Masfalah) over a 21-month period. The results showed that FFNN produced a reliable prediction.

Hung et al.15 conducted research to determine how transported smoke aerosols affect air quality in New York State. ANN was used in the method, and five models with distinct sets of predictors were considered during the summer seasons of 2012–2019. When the models were evaluated using RMSE and R2, the results revealed that smoke cases had higher average PM2.5 concentrations than non-smoke cases.

Category 3: decision tree

In the decision analysis, a decision tree can be used to visually and explicitly depict decisions and decision making. Angelin Jebamalar et al.16 created a method that combined light gradient boosting and decision tree techniques. The model breaks the tree leaf by leaf using the best fit, allowing it to handle massive amounts of data with little memory and great speed. From 2017 to 2019, the PM2.5 information for Chennai, India, was collected using IoT devices and stored in the cloud. When compared to RF, DT, and regression approaches, the suggested model outperformed with the lowest mean absolute error and root mean square error values.

Another study17 focused on PM2.5 predictions from environmental sensor data streams using a stacked boosting ensemble (STBoost) model with z-score optimization techniques. The STBoost includes Light GBM as the meta regressor and base regressors such as XGBoost and GBM regression and RF to improve the prediction accuracy. The results indicate that the STBoost model outperformed with an accuracy score of 99.52 and an RMSE value of 0.1048.

To improve PM2.5 perception, Luo et al.18 used an image-based technique that included CNN and GBM. Daily weather conditions, 6976 pictures, and hourly data from Shanghai were used to create the model (2016). With the proposed technique, the MAE, RMSE, and R2 estimations of PM2.5 are 3.56, 10.02, and 0.85, respectively.

Using a 1.5 years’ dataset of Malaysia,19 examined the performance of MLP and RF models in predicting PM2.5. Confusion matrix was utilised as a performance metric. MLP was outperformed by RF overall.

Category 4: regression

The regression model was used in a number of articles to predict particulate matter. Kleine Deters et al.20 present a machine learning technique for predicting PM2.5 based on pollution and meteorological data from two Ecuadorian cities over a six-year period. Using CGM in regression analysis, it was found that PM2.5 could be predicted more accurately during extreme weather.

Kowalski and Warchałowski21 provide a comparison of machine learning techniques for predicting dust-type air pollution levels. Real-time hourly data from Krakow was utilised to train and test the models using MSE and R2 over the course of a year. The best prediction approach, according to the findings, was a regression model.

Gu et al.22 introduced a recurrent air quality prediction (RAQP) model, which combines a recurrent framework and SVR. The model was tested on 180 hourly records from a small Chinese town. The RAQP model was found to be more successful than state-of-the-art air quality predictors with a low RMSE value.

Aljuaid et al.23 compared numerous forecasting approaches and strategies based on mathematics and machine learning. The one-hour and five-minute datasets were created by combining and manipulating numerous sources of Danish data. For comparison, the researchers utilised multivariate (SVR, DT, and K-nearest neighbour) and univariate (auto-regression) techniques. SVR had the lowest RMSE and MAE values for the one-hour data set, and auto-regression for the five-minutes dataset, according to the findings.

Category 5: support vector machine

For PM2.5 ground-level forecasting in the city of Bogotá, Mogollón-Sotelo et al.24 presented SVM. Statistical validation was done using RMSE, etc. The SVM model predicts with greater accuracy. An ensemble empirical mode decomposition (EEMD), least square SVM (LSSVM), and PSR were proposed in Ref. 25 as alternative method for forecasting the following day. The empirical results reveal that the EEMD-PSR-LSSVM outperformed other models in terms of MAPE and RMSE values.

Conclusions

This study looked over 20 scientific papers that focused on machine learning algorithms for PM2.5 prediction. The use of machine learning to forecast PM2.5 has grown significantly in the previous five years, although only three research articles have been published in Malaysia. There are also several international research programmes that concentrate on more than one type of particulate matter. For predicting PM2.5, the ML techniques DL, NN, and DT are often utilised. Overall, it appears that supervised machine learning algorithms are employed to predict air pollution, notably PM2.5. The review concludes that while there have been few researches in Malaysia using machine learning to forecast PM2.5, the field can be expanded and the accuracy improved, as has been done globally using supervised machine learning methodologies.

Data availability

No data are associated with this article.

Author contributions

Palanichamy Naveen did the conception of the work, drafting the article, and revision to the final version. Kuhaneswaran and Rishanti did data collection, data analysis and interpretation, under the guidance of S Subramanian and their supervisors Su-Cheng Haw and Palanichamy Naveen. Palanichamy Naveen is the corresponding author for this paper.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 14 Dec 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Palanichamy N, Haw SC, S S et al. Prediction of PM2.5 concentrations in Malaysia using machine learning techniques: a review [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2021, 10:1279 (https://doi.org/10.12688/f1000research.73163.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 14 Dec 2021
Views
14
Cite
Reviewer Report 24 Jan 2022
D. Devaraj, Department of Electrical and Electronics Engineering, Kalasalingam Academy of Research and Education, Krishnankoil, Tamil Nadu, India 
Approved with Reservations
VIEWS 14
Globally, in the major cities, air pollution is becoming a major issue. To adopt appropriate pollution control strategies Prediction of PM 2.5 concentration is necessary. Recently, in the literature, machine learning algorithms like Decision tree algorithms, Artificial neural networks, and ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Devaraj D. Reviewer Report For: Prediction of PM2.5 concentrations in Malaysia using machine learning techniques: a review [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2021, 10:1279 (https://doi.org/10.5256/f1000research.76794.r115274)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
22
Cite
Reviewer Report 06 Jan 2022
Nagender Aneja, Digital Science, Faculty of Science, Universiti Brunei Darussalam, Bandar Seri Begawan, Brunei 
Approved
VIEWS 22
This manuscript examines and highlights the most available ML strategies for PM2.5 prediction to raise knowledge of ML techniques and encourage additional research in this field. The articles published in Malaysia proposed supervised machine learning algorithms to predict air pollution, ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Aneja N. Reviewer Report For: Prediction of PM2.5 concentrations in Malaysia using machine learning techniques: a review [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2021, 10:1279 (https://doi.org/10.5256/f1000research.76794.r115273)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 14 Dec 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.