Prediction of Air Pollution in Seoul based on functionalities of various Machine Learning Models

Tanmay Debnath
6 min readJun 16, 2020

Air Pollution has been an area of concern for the modern world and the topic of global warming has raised quite a lot of controversy with several facts and figures. Overall, air pollution causes the deaths of around 7 Million people worldwide each year, and is one of the world’s largest single environmental risk. Productivity losses and degraded quality of life caused by air pollution are estimated to cost the world economy around $5 trillion per year.

Mathematically speaking, it has been observed that the trends in air pollution are quite similar for a particular place, considering a particular time-frame. Hence various researchers and data-scientists have utilized the modern mathematical tools to exploits the similarity in the trends. Here, the demonstration has been provided regarding the accuracy in the prediction of the values of certain particles in the atmosphere based on datasets collected over time, for Seoul, South Korea.

DATASETS

The datasets has been obtained from the Kaggle website. Kaggle, a subsidiary of Google LLC, is an online community for data-scientists and machine learning enthusiasts. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

LINK: https://www.kaggle.com/bappekim/air-pollution-in-seoul

CONTENT

This data provides average values for six pollutants (SO2, NO2, CO, O3, PM10, PM2.5).

  • Data were measured every hour between 2017 and 2019.
  • Data were measured for 25 districts in Seoul.
  • This dataset is divided into four files.
The view of the dataset present on the Seoul Air Pollution reference

The dataset contains a detailed perspective towards the major contributions of the particles in the predictability of the pollution matrix for a particular period of time.

Speaking in the language of Machine Learning problems, the following comes under the category of Time-series data analysis. There has been earlier cases of the studies regarding the time series data analysis and hence this provided us the basic foundation for starting the initial investigation in the dataset. The data varies over a wide range values and even for a particular place, considering a particular time-frame, the data varies largely. For the identification and the predicting of the values of pollution level for the next time-frame, we chose a single value for the implementation, ‘PM2.5’.

The following the variability in the data (for the total number of data-points)

For the determination of the problem, we have chosen a few models based on the performance and the wide acceptability. The models that we propose are:

  1. Linear Regression (LR)
  2. Random Forest Regression (RVR)
  3. Support Vector Regression (SVR)
  4. Long-Short Term Memory (LSTM)

MODEL PREDICTION

We have used the various stated mathematical tools for the making of the final detection and hence comparing the best results. Each mathematical tool has its own functioning abilities and features. They work best in specific situations and even might work universally. Here we have explored the possibilities of the tools and have pointed out the best features that helped in the determination of the same.

LINEAR REGRESSION

Linear regression shows the linear relationship between the independent variables and the dependent variables. This mathematical tool resembles to that of a line segment equation. Moreover, the linear regression tries to create a linear relationship between all the independent variables and the dependent variables and creates a line in between them. However, in this analysis, we have implemented the model in a different way. We have computed the data used in the previous time frame and have used the same as the input for the next prediction. By doing this, we have divided the entire dataset into quite large segments and have made prediction for each of the underlying segments.

The graph for the predicted and actual values

Upon close inspection on the dataset , it has been observed that the following values of the mean squared error and mean absolute error floated up.

Average Mean squared error = 0.00374

Average Mean absolute error = 0.0318

(Note: In statistics, the above mentioned values plays an important role in the determination of the convergence of the predicted data to the actual values. The values shows the smaller the values are the more realistic are our predictions and hence the better is our model.)

RANDOM FOREST REGRESSION

The following regressor group comes under the ensemble group of classifiers and regressors because of the steady algorithm to search for the solution on the basis of the decision tree models. The model analyses the provided data and breaks the solution based on certain instructions and hence make the final decision trees. These trees help the module to detect the actual answer the users seeking and hence can make the actual decision based on the weights of the answer required for the solution to pop up. In simple words, Random Forest Regression uses the concept of non-linear data analysis segments which makes it one of the desirable and competent models in the analysis of time-series datasets. In this case, we have considered the case where we implemented the regression model based on the earlier dataset points. Based on the same, the data-trees are generated and hence the final deduction is made by the model.

Analysis of the dataset based on the Random Forest Regression model

From the analysis, we found the following relations of the convergence of the predicted data with the actual values.

Average Mean squared error = 0.0132

Average Mean absolute error = 0.0728

SUPPORT VECTOR MACHINES (SVM)

As the name suggests, the support vector machines work on the module of the supported vectors. The algorithm searches the proper position for the implementation of a boundary such that it can properly dignify a point to be a part of a different class. SVM uses certain points as the support positions which would be dignified as the points closest to the line and which would make a line pass through the lines and hence find the best correlation for the entire dataset. Support can be taken into account for the analysis of both the linear and non-linear data. Because of its robust structuring, the best hyperplane can hence be decided based on the behavior of the data points in the plane.

Analysis of the dataset based on the Support Vector Machine model

From the analysis, we found the following relations of the convergence of the predicted data with the actual values.

Average Mean squared error = 0.00181

Average Mean absolute error = 0.0237

LONG SHORT TERM MEMORY (LSTM)

The following mathematical tool is a special case and a subset of another form of neural networks. The LSTM is a form of Recurrent Neural Networks (RNN). The following form of neural network has a capability to memorize the data and hence analyze the same. Using back-propagation, the data which was fed forward for the studying and analyses is reverted back and is stored for a particular amount of time such that the weights related to the back-propagation positions change accordingly and hence the body can make the final decisions based on the most reduced form of errors possible. LSTM has wide range of applications ranging from handwriting recognition to audio processing systems. Upon application of the same in our dataset, we found the following results:

Analysis of the dataset based on the LSTM model

From the analysis, we found the following relations of the convergence of the predicted data with the actual values.

Average Mean squared error = 0.00281

Average Mean absolute error = 0.0287

RESULTS

Based on the analyses done by the 4 mathematical tools, it is clear, using the data of the mean squared error and the mean absolute error that the SVM (Support Vector Machine) model is the best work-fit model for this analysis. SVM has dynamic usability and hence has its significance in the analysis of the non-linear modelling as well.

--

--