Keywords
Air Pollution, Particulate Matter (PM2.5), Neural Network, Deep Learning, Decision Tree
This article is included in the Research Synergy Foundation gateway.
Air Pollution, Particulate Matter (PM2.5), Neural Network, Deep Learning, Decision Tree
Air quality is important for human health, crops, vegetation, and aesthetic considerations, for example, visibility. Air pollution, in which the air is contaminated with a variety of dirt and chemicals, is detrimental to breathing and can cause a wide variety of health defects and issues. Bad air is a combination of both natural and human-made sources of perilous substances. It was estimated in 2016 that outdoor air pollution in rural and urban areas caused 4.2 million premature fatalities worldwide annually; exposure to PM is the reason for the mortality, which causes cancers, and respiratory and cardiovascular diseases [https://www.who.int/news-room/fact-sheets/detail/ambient-(outdoor)-air-quality-and-health].
A complicated mixture of ultrafine particles and vaporized tiny molecules or liquid droplets is known as particulate matter (PM) [https://www.epa.gov/pm-pollution/particulate-matter-pm-basics]. PM are categorized according to their sizes. PM2.5 has a diameter of 2.5 microns or less. PM2.5 is inhalable and can travel farther into our bodies and deposit into alveoli, eventually passing into the bloodstream. They may cause cardiovascular diseases, since they start to mix into the blood stream [https://blog.breezometer.com/what-is-particulate-matter].
The health impact of exposure to PM2.5 is a nightmare which involves different age groups of people.1 The number of coughs counted by urban workers who are exposed to PM2.5 determines how badly their respiratory system is in danger. Usmani2 reviewed a paper and mentioned that environmental indicators and PM2.5 have a great impact on Malaysian health services via infant mortality rate, fertility rate, and life expectancy. He also concluded that outdoor and indoor air quality can affect the health of school children who go to school every day.
Traffic pollution and industrialization are the root sources of emission of PM2.5. In Malaysia, the main sources of PM2.5 are emissions from industrial growth, motor vehicles, and recently, transboundary haze pollution. Traffic-related air pollution (TRAP), happens due to the emissions from motor vehicles.1–3 S N Brohi et al.4 concluded that industries were the main contributors to PM in Malaysia, accounting for 32%. Haze incidents have been a crucial problem in Malaysia for decades. Jaafar et al.5 concluded that PM2.5 is presumed to be one of the most critical health hazards and should be continuously monitored during haze episodes.
Several empirical studies have identified air quality as a major concern in smart cities. The non-linear behaviour of air pollutants, combined with other significant regional factors, results in a highly complex system of air pollutant generation. According to research, capturing nonlinearity between air contaminants and their emission and dispersal sources is difficult in traditional deterministic models. As a result, to address the issue of capturing non-linearity trends in air pollution models and mitigating the impact of PM2.5, ML approaches based on statistical algorithms that are reliable and widely used should be considered.
This literature review is structured as follows. After the justification for the prediction of PM2.5, the steps involved in each ML technique are discussed first. The results and discussion of each ML approach used to predict PM2.5 concentrations are presented next. Finally, the conclusions discuss the use of ML for PM2.5 prediction.
The following phases comprise the review of ML approaches. The first step, to find related SCOPUS indexed papers, was to use keyword combinations to find the document; which were: {‘Particulate Matter’} AND {‘PM2.5’} AND {‘Prediction’} AND {‘Machine Learning’}. The paper publishing period was limited to 2017–2021, and the study was limited to journal and conference proceedings that were published in English. We ended up with 284 documents as a result.
After that, the studies were screened by looking at the title and abstract. Biological studies, social studies, and investigations into the relationship between PM2.5 and other air contaminants, among other topics, were omitted. The number of documents was reduced to 36. Finally, the papers that were unanimously deemed out of scope were excluded after reviewing the whole document. As a consequence, 20 manuscripts were chosen for further examination.
The five aspects are used to review the articles. The initial analysis is based on the ML type that was utilized. The researchers' method is the second point to consider. Third, the study's location as well as the dataset's characteristics. The fourth and fifth components deal with the evaluation method, performance measures such as root mean square error (RMSE), mean square error (MSE), mean absolute error (MAE), coefficient of determination (R2), and the performance of the algorithms under consideration.
The findings of the review are explained in this section. To begin, the review's findings are based on the number of studies that used machine learning approaches to forecast PM2.5 levels. Figure 3 shows the number of conference and journal articles produced in the last five years to predict PM2.5. It shows that the number of studies has been increasing in recent years, and if this trend continues, a potentially higher number of documents might be expected by the end of 2021.
Next, the number of machine learning algorithms has increased significantly; however, this increase has not been evenly distributed globally. Figure 4 shows that the number of published studies in Eurasia and North America is significantly higher. While China (146 studies) and the US (109 studies) have the most published research, Malaysia only has three studies.
According to the study, supervised machine learning techniques like neural networks (NN), deep learning (DL), decision tree (DT), and others were commonly employed. The largest percentages are for DL and NN, followed by DT, regression, and support vector machine (SVM). The following categories are used to categorise the full descriptions of the selected papers.
Artificial neural networks, which are algorithms inspired by the structure and function of the brain, are used in deep learning. The papers that fall within this category are listed below.
Shahriar et al.6 conducted a study in three Bangladeshi air pollution hotspots to evaluate the effectiveness of two hybrid models for predicting daily PM2.5 concentrations in terms of computational efficiency and accuracy. The models presented and compared with DT and Catboost were Autoregressive Integrated Moving Average (ARIMA)-Artificial Neural Network (ANN) and ARIMA-SVM. With lower MAE and RMSE and a better R2 value, the CatBoost and ARIMA-ANN models fared well.
By comparing four prediction models, Yang et al.7 sought to anticipate PM several days in advance in 39 Seoul stations. The gated recurrent Unit (GRU) of a convolutional neural network (CNN) and the long short-term memory (LSTM) of a CNN were the two models suggested and compared. The CNN-LSTM model provided reliable prediction by capturing hidden patterns with low RMSE and MAE values. In Ref. 8 deep neural network (DNN) based hybrid model, DNN-LSTM, was proposed. The DNN-LSTM model outperformed the multiple additive regression trees (MART) and deep feedforward neural network (DFNN) models with a highest R2 and lowest values of MAE and RMSE for 48-h predictions.
For the prediction of PM2.5, Zhang et al.9 suggested a DL model including an auto-encoder (AE) and a bidirectional LSTM (Bi-LSTM). The proposed method incorporates data preprocessing to improve prediction accuracy, an AE layer to extract implicit features and increase training efficiency, and a Bi-LSTM layer to predict. The results indicate that the proposed model’s prediction was better with low RMSE and higher R2 values and a positive correlation does exist.
Liu et al.10 proposed a hybrid ensemble model using DBN, LSTM, and multilayer perceptron (MLP). To reduce the model's complexity, it uses complementary ensemble empirical mode decomposition (CEEMD) to extract the features from the data series. To produce the best forecast outcomes, the imperial competition algorithm (ICA) is used to alter the weights of the predictors. Two sets of hourly PM2.5 concentrations from Shanghai are used to validate the model. According to the findings of the experiments, the proposed model performed better in terms of accuracy and resilience.
The neural network is based on the idea of classifying input observations by linearly combining datasets. For lowering difficulties and errors, Jiang et al.11 presented the group teaching optimization algorithm (GTOA): the extreme learning machine (ELM) method, which was based on a data-preprocessing strategy. Two-step decomposition is used to break down PM concentration data into high-frequency IMFs. GTOA is then used to optimise ELM. Over a 16-month period, the data covers hourly PM2.5 levels in Beijing. The findings showed that the proposed model improved prediction accuracy while maintaining a low mean absolute error and root mean square error.
With daily records from cities in Finland and Brazil, Neto et al.12 evaluated single and neural-based ensemble techniques. The values of MSE, MAPE, MAE, and RMSE were compared. MLP, the ensemble method, has the best overall performance. Suleiman et al.13 examined the performance of boosted regression trees (BRT), artificial neural networks (ANN), and support vector machines (SVM) models in forecasting roadside PM2.5 concentrations in London at nineteen monitoring sites. For performance evaluation, RMSE and other metrics were used. In general, ANN performed better.
Ali Shah et al.14 assessed the performance of the Phase Space Reconstruction (PSR) technique, which captures multi-time scale information, using radial and linear Support Vector Regression (SVR), Feedforward Neural Network (FFNN), and Random Forest (RF) ML. RMSE and MAE were used to assess the performance of ML techniques on a dataset collected in Saudi Arabia (Masfalah) over a 21-month period. The results showed that FFNN produced a reliable prediction.
Hung et al.15 conducted research to determine how transported smoke aerosols affect air quality in New York State. ANN was used in the method, and five models with distinct sets of predictors were considered during the summer seasons of 2012–2019. When the models were evaluated using RMSE and R2, the results revealed that smoke cases had higher average PM2.5 concentrations than non-smoke cases.
In the decision analysis, a decision tree can be used to visually and explicitly depict decisions and decision making. Angelin Jebamalar et al.16 created a method that combined light gradient boosting and decision tree techniques. The model breaks the tree leaf by leaf using the best fit, allowing it to handle massive amounts of data with little memory and great speed. From 2017 to 2019, the PM2.5 information for Chennai, India, was collected using IoT devices and stored in the cloud. When compared to RF, DT, and regression approaches, the suggested model outperformed with the lowest mean absolute error and root mean square error values.
Another study17 focused on PM2.5 predictions from environmental sensor data streams using a stacked boosting ensemble (STBoost) model with z-score optimization techniques. The STBoost includes Light GBM as the meta regressor and base regressors such as XGBoost and GBM regression and RF to improve the prediction accuracy. The results indicate that the STBoost model outperformed with an accuracy score of 99.52 and an RMSE value of 0.1048.
To improve PM2.5 perception, Luo et al.18 used an image-based technique that included CNN and GBM. Daily weather conditions, 6976 pictures, and hourly data from Shanghai were used to create the model (2016). With the proposed technique, the MAE, RMSE, and R2 estimations of PM2.5 are 3.56, 10.02, and 0.85, respectively.
Using a 1.5 years’ dataset of Malaysia,19 examined the performance of MLP and RF models in predicting PM2.5. Confusion matrix was utilised as a performance metric. MLP was outperformed by RF overall.
The regression model was used in a number of articles to predict particulate matter. Kleine Deters et al.20 present a machine learning technique for predicting PM2.5 based on pollution and meteorological data from two Ecuadorian cities over a six-year period. Using CGM in regression analysis, it was found that PM2.5 could be predicted more accurately during extreme weather.
Kowalski and Warchałowski21 provide a comparison of machine learning techniques for predicting dust-type air pollution levels. Real-time hourly data from Krakow was utilised to train and test the models using MSE and R2 over the course of a year. The best prediction approach, according to the findings, was a regression model.
Gu et al.22 introduced a recurrent air quality prediction (RAQP) model, which combines a recurrent framework and SVR. The model was tested on 180 hourly records from a small Chinese town. The RAQP model was found to be more successful than state-of-the-art air quality predictors with a low RMSE value.
Aljuaid et al.23 compared numerous forecasting approaches and strategies based on mathematics and machine learning. The one-hour and five-minute datasets were created by combining and manipulating numerous sources of Danish data. For comparison, the researchers utilised multivariate (SVR, DT, and K-nearest neighbour) and univariate (auto-regression) techniques. SVR had the lowest RMSE and MAE values for the one-hour data set, and auto-regression for the five-minutes dataset, according to the findings.
For PM2.5 ground-level forecasting in the city of Bogotá, Mogollón-Sotelo et al.24 presented SVM. Statistical validation was done using RMSE, etc. The SVM model predicts with greater accuracy. An ensemble empirical mode decomposition (EEMD), least square SVM (LSSVM), and PSR were proposed in Ref. 25 as alternative method for forecasting the following day. The empirical results reveal that the EEMD-PSR-LSSVM outperformed other models in terms of MAPE and RMSE values.
This study looked over 20 scientific papers that focused on machine learning algorithms for PM2.5 prediction. The use of machine learning to forecast PM2.5 has grown significantly in the previous five years, although only three research articles have been published in Malaysia. There are also several international research programmes that concentrate on more than one type of particulate matter. For predicting PM2.5, the ML techniques DL, NN, and DT are often utilised. Overall, it appears that supervised machine learning algorithms are employed to predict air pollution, notably PM2.5. The review concludes that while there have been few researches in Malaysia using machine learning to forecast PM2.5, the field can be expanded and the accuracy improved, as has been done globally using supervised machine learning methodologies.
Palanichamy Naveen did the conception of the work, drafting the article, and revision to the final version. Kuhaneswaran and Rishanti did data collection, data analysis and interpretation, under the guidance of S Subramanian and their supervisors Su-Cheng Haw and Palanichamy Naveen. Palanichamy Naveen is the corresponding author for this paper.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the topic of the review discussed comprehensively in the context of the current literature?
Yes
Are all factual statements correct and adequately supported by citations?
Yes
Is the review written in accessible language?
Yes
Are the conclusions drawn appropriate in the context of the current research literature?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Machine learning, Smart grid
Is the topic of the review discussed comprehensively in the context of the current literature?
Yes
Are all factual statements correct and adequately supported by citations?
Yes
Is the review written in accessible language?
Yes
Are the conclusions drawn appropriate in the context of the current research literature?
Yes
References
1. Masood A, Ahmad K: A model for particulate matter (PM2.5) prediction for Delhi based on machine learning approaches. Procedia Computer Science. 2020; 167: 2101-2110 Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Deep Learning, NLP
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 14 Dec 21 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)