Deep learning-based PM2.5 prediction considering the spatiotemporal correlations: A case study of Beijing, China

Deep learning-based PM2.5 prediction considering the spatiotemporal correlations: A case study of Beijing, China

Accepted Manuscript Deep learning-based PM2.5 prediction considering spatiotemporal correlations: A case study of Beijing, China the Unjin Pak, Jun ...

5MB Sizes 0 Downloads 30 Views

Accepted Manuscript Deep learning-based PM2.5 prediction considering spatiotemporal correlations: A case study of Beijing, China

the

Unjin Pak, Jun Ma, Unsok Ryu, Kwangchol Ryom, U. Juhyok, Kyongsok Pak, Chanil Pak PII: DOI: Reference:

S0048-9697(19)33481-3 https://doi.org/10.1016/j.scitotenv.2019.07.367 STOTEN 33561

To appear in:

Science of the Total Environment

Received date: Revised date: Accepted date:

14 May 2019 28 June 2019 22 July 2019

Please cite this article as: U. Pak, J. Ma, U. Ryu, et al., Deep learning-based PM2.5 prediction considering the spatiotemporal correlations: A case study of Beijing, China, Science of the Total Environment, https://doi.org/10.1016/j.scitotenv.2019.07.367

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT Deep learning-based PM 2.5 prediction considering the spatiotemporal correlations : A case study of Beijing, China Unjin Pak 1, Jun Ma 2, Unsok Ryu 3, Kwangchol Ryom 4, Juhyok U 5, Kyongsok Pak 3, Chanil Pak 6 1

Department of Automation Engineering, Kim Chaek University of Technology, Pyongyang 950003,

Democratic People's Republic of Korea Department of Geology, Kim Il Sung University, Pyongyang 999093, Democratic People's Republic of Korea

3

School of Information Science, Kim Il Sung University, Pyongyang 999093, Democratic People's Republic

PT

2

Department of Metallurgical Engineering, Kim Chaek University of Technology, Pyongyang 950003,

SC

4

RI

of Korea

Democratic People's Republic of Korea

Digital Library, Kim Chaek University of Technology, Pyongyang 950003, Democratic People's Republic of

NU

5

Korea

Information Center, Kim Chaek University of Technology, Pyongyang 950003, Democratic People's Republic

MA

6

of Korea

AC

CE

[email protected]

PT E

D

✉Unjin Pak

ACCEPTED MANUSCRIPT ABSTRACT Air pollution is one of the serious environmental problems that humankind faces and also a hot topic in Northeastern Asia. Therefore, the accurate prediction of PM2.5 (particulate matter with an aerodynamic diameter of ≤2.5 ㎛) is very significant in the management of human health and the decision-making of government for the environmental management. In this study, a spatiotemporal convolutional neural network (CNN) and long shortterm (LSTM) memory (CNN-LSTM) model (also called PM (particulate matter) predictor) was proposed and used to predict the next day’s daily average PM2.5 concentration in Beijing City. The spatiotemporal correlation

PT

analysis using the mutual information (MI) was performed, considering not only the linear correlation but also nonlinear correlation between target and observation parameters; in addition, it was fully considered for the whole

RI

area of China with the target monitoring station as the center and also for the historic air quality and meteorological data. As a result, the spatiotemporal feature vector (STFV) which reflects both linear and nonlinear correlations

SC

between parameters was effectively constructed. The PM predictor secured a fast and accurate prediction performance by efficiently extracting the inherent features of the latent air quality and meteorological input data associated with PM 2.5 through CNN and by fully reflecting the long-term historic process of input time series

NU

data through LSTM. The air quality and meteorological data from the 384 monitoring stations which represents the whole area of China with Beijing City as the center during the 3 years (Jan. 1 st, 2015 to Dec. 31th, 2017) were

MA

used to verify the validity of the proposed method. In conclusion, the proposed method was proved to have a better stability and prediction performance compared to multi-layer perceptron (MLP) and LSTM models.

PT E

D

Keywords: PM2.5 prediction, PM predictor, Deep learning, CNN, LSTM, Spatiotemporal correlation

1. Introduction

CE

Currently, the air pollution due to PM 2.5 has been considered as a critical problem threatening the human health, and therefore the practical measures for its prevention are strongly required. Over the years, many

AC

investigations and studies have elucidated the compositions and dispersion properties (WHO 2003; Xing et al., 2016) of air pollutants including PM 2.5 and reported that they cause various diseases such as respiratory and heart diseases and lung and skin cancers (Kampa and Castanas, 2008; WHO 2016). Therefore, the air quality prediction with a high accuracy is essential for the prevention of medical accidents due to air pollution and for the comprehensive and effective control of atmospheric environment. The research methodology for air pollution prediction mainly includes deterministic (Baklanov et al., 2008; Kim et al., 2010; Woody et al., 2016; Bray et al., 2017; Zhou et al., 2017) and statistical (Carlo et al., 2007; Castellano et al., 2009; Gennaro et al., 2013; Donnelly et al., 2015) methods. The deterministic method, in which the formation and diffusion process of pollutants is modeled by using the theoretical meteorological emission and the chemical model, is not enough to explain the nonlinearity and heterogeneity of many factors related to the formation of pollutants due to the use of ideal theory in the determination of model structure and the estimation of parameters by experience. In contrast with the deterministic method, the statistical method makes it possible to avoid the

ACCEPTED MANUSCRIPT complexity and bother of modeling as well as to show a good performance due to the use of the statistical modeling technique based on data-driven manner. Artificial neural network (ANN) has been widely employed in the research for the air pollution prediction because it can realize the nonlinear mechanism of atmospheric phenomenon well and also show a high prediction performance (Pérez et al., 2000; Perez and Reyes, 2006; Yildirim and Bayramoglu, 2006; Zhou et al., 2014). After that, researchers focused on the improvement of prediction accuracy and started to pay attention to the development of the hybrid or integration models combining various optimization methods with ANN beyond the use of

PT

conventional simple neural network. To predict the next day’s daily average concentrations of PM 10 and PM 2.5 in Thessaloniki, Greece and in Helsinki, Finland (Voukantsis et al., 2011), the prediction model combining linear regression and multi-layer perceptron (MLP) was developed on the basis of the principal component analysis to

RI

inter-compare the air pollution patterns of both cities and extract the principal components. To predict the emission amount of PM10, ANN was combined with genetic algorithm to show more than 3 times prediction performance,

SC

compared to the conventional multi-linear regression and principal component regression models (Antanasijević et al., 2013). Monte Carlo simulations were employed to overcome the uncertainty of model and improve the

NU

prediction performance of neural network (Mohammad et al., 2013), and the hybrid model combining air mass trajectory analysis with wavelet transformation was developed to improve the performance of ANN (Feng et al., 2015). The emission and diffusion of pollutants is related to not only the interaction of pollutants but also the

MA

meteorological factors largely, and therefore it is firstly required to use a lot of air quality and meteorological parameters in the air pollution prediction. That is, such situation occurred the training and process of huge input data in neural network and increased the difficulties in the use of the conventional shallow model.

D

Nowadays, deep learning, a field of machine learning for artificial intelligence, has been successfully applied to

PT E

the wide-ranging fields such as computer vision (Bengio, 2009), image classification (Bengio, 2009; Chan et al., 2015), speech recognition (Mohamed et al., 2011), prediction of time series data (Zhang et al., 2015) and natural language processing (Collobert and Weston, 2008; Bengio, 2009; Mohamed et al., 2011). Moreover, it has also been widely applied to the prediction of air quality (Pak et al., 2018; Ahn et al., 2017; Ong et al., 2014); the

CE

prediction models considering the spatial and temporal correlation of data have been remarkable (Li et al., 2017; Li et al., 2017; Soh et al., 2018). Air pollutants certainly have spatiotemporal correlations in their emission and

AC

diffusion, and therefore it is natural to discuss the spatiotemporal correlation related to the target parameter when considering air quality. In other words, if the air pollutants are analyzed only in the time domain, the impacts and the relationships between other regions may be ignored; in contrast, if only the spatial relationships are considered, the diffusions of pollutants according to time may be ignored. From such viewpoints, the following previous researches are meaningful for our research schedule. In the LSTME model (Li et al., 2017) that capture spatiotemporal correlations, the researchers only used PM2.5 as the historic air quality data and integrated the meteorological data such as temperature, humidity, wind velocity and visibility into the model as the additional data. In addition, they considered the spatiotemporal correlation only as regards to PM2.5 and used all the data of 12 monitoring stations as the input data of the model. In the Geoi-deep belief network (Geoi-DBN) model (Li et al., 2017), the satellite-derived aerosol optical depth (AOD) and normalized difference vegetation index (NDVI) in addition to PM2.5, temperature, humidity and wind velocity were also used as the input data in order to consider the geographical correlation; three factors were subjectively introduced as the additional input data in order to

ACCEPTED MANUSCRIPT express the spatiotemporal correlation. In the space-temporal deep neural network (ST-DNN) model (Soh et al., 2018), in terms of the consideration of spatiotemporal correlation, the temporal, geographical and spatial features were extracted by LSTM, CNN, and integration of k-nearest neighbor (KNN) and ANN, respectively; the extracted data were ultimately put together in ANN to produce the prediction value. To sum up the above research results, the spatiotemporal correlation analysis of data was conducted only as the consideration of the linear correlation between parameters, and moreover the consideration of both the historic air quality data and the meteorological data was not fully conducted.

PT

In this study, we proposed a method to predict the concentration of the next day’s daily average PM2.5 in Beijing City by using the PM predictor, which consider the spatiotemporal correlation as regards to target

RI

parameter.

(1)

SC

The contribution of this study are as follows:

The spatiotemporal correlation analyzed by using MI was fully considered and effectively extracted for the 384 monitoring stations which represents the whole area of China with the target monitoring station

NU

as the center and also for the historic air quality and meteorological data; as a result, the spatiotemporal feature vector which reflects both linear and nonlinear correlations between factors was constructed. PM predictor including CNN-LSTM model and MI estimator was designed to predict the concentration

MA

(2)

of the next day’s daily average PM2.5 in Beijing City. The CNN was designed to efficiently extract the inherent feature essential for the prediction of PM 2.5 from input data; the LSTM was designed to fully

As a result of the performance evaluation, it was demonstrated that the prediction accuracy is

PT E

(3)

D

represent the long-term historic process of input time series data.

considerably improved in case of using both the air quality and meteorological data than in case of using only the air quality data. In addition, the proposed method was proved to be superior to the previous

CE

methods for the PM 2.5 prediction.

2. Materials and methods

AC

2.1. Data collection

In this study, the 4608 air quality and meteorological data from the 384 monitoring stations which represent the whole area of China with Beijing City as the center during the 3 years (Jan. 1st, 2015 to Dec. 31th, 2017) were used; among them, the air quality data were downloaded from the Ministry of Ecology and Environment of the People’s Republic of China (http://www.mee.gov.cn/), and the meteorological data were downloaded from 2345 weather forecast website (http://tianqi.2345.com/). The climate of Beijing, China, is a typical semi-humid continental monsoon climate in the northern temperate zone. It is hot and rainy in summer, cold and dry in winter;spring and autumn are relatively short. Fig. 1 shows the distribution of air quality monitoring stations in China and the rank of daily average PM2.5 value for each station on Jan. 1st, 2015. The air quality and meteorological data used in this study were shown in Table 1.

ACCEPTED MANUSCRIPT The present dataset contained 13,152 records for each station. In the experimental stage to determine the suitable structure and optimal parameters of CNN-LSTM, the dataset was split into two subsets including learning data and test data; among the whole data, 80% was used as the learning data, and the remaining 20% was used as the test data. 2.2. The analysis of spatiotemporal correlation by MI 2.2.1. MI Estimator

PT

Identifying the inherent interaction between the given variables in such engineering problems as pattern recognition and prediction has been an important research topic. Several measures (Ryu et al., 2018) to evaluate correlations between variables have been developed and used since a long time ago, and the most widely-used

RI

measure is correlation coefficient. However, this measure has the limitation that only a linear relationship between

SC

variables can be detected (Jain and Murthy, 2016). However, MI is one of the popular measures to quantify the various correlations between random variables including nonlinear correlations. It is based on measuring the amount of information that the variables share and does not require any assumptions about the distribution of data.

I ( X ;Y )   p( x, y ) log 2

(1)

MA

xX yY

p( x, y) , p( x) p ( y )

NU

Given two variables X and Y, the definition of MI is as follows.

where p(x) and p(y) are the mass probabilities of random variables, x and y, respectively, and p(x, y) is the joint distribution of the two variables. If X and Y are statistically independent, MI has a value of 0, whereas if they are

D

strongly correlated, MI has a large value. The estimators for the calculation of MI typically includes Basic

PT E

Histogram, Kernel Estimator, B-splines Estimator and Nearest Neighbors (NN)-based Estimator, and it was reported that NN-based Estimator is more beneficial for the multivariate filter feature selection (Doquire and Verleysen, 2012). Therefore, we adopted the NN-based Estimator as the MI estimator for the spatiotemporal correlation evaluation. The main principle of the estimator is as follows. If the neighbors at a specific observation

CE

in the X space correspond to the neighbors at the same point in the Y space , there will certainly be a strong relationship between X and Y. Therefore, the estimator is defined as follows. 

AC

d Q H ( X )   ( K )  (q)  log(cd)+  log( x (Q, K )) , (2) Q q 1 where ψ, Q, d and

cd are the digamma function, the number of samples in X, the dimensionality of these samples

and the volume of a d-dimensional unitary ball, respectively, and as the Euclidean distance) from the

 X (q, K )

q th observation in X to its K th NN. The most popular estimator can be

derived as follows. 

I ( X ; Y )   (Q)  ( K ) 

is twice the distance (usually chosen

1 1 Q   ( ( xi )  ( yi )) , K Q i 1

(3)

ACCEPTED MANUSCRIPT where

 xi

is

the

number

of

points

whose

distance

xi

from

is

not

greater

than

0.5   (q, K )  0.5  max( X (q, K )Y (q, K )) . 2.2.2. STFV construction

S  {s1 , s2 ,

The set of monitoring stations is defined as

sN } ,

and the set of indexes is defined as

r

F , n  1,

, N , t  1, T ,

RI

xsnj (t ), j  1,

PT

R  {r1 , r2 , rF } , and then each historic dataset can be expressed as follows.

where N and F are the number of monitoring stations and features, respectively, and T is time. That is,

 xsr11 (1),  r  xs11 (2), X    x r1 (T  1),  s1  X sr11

, X sr2F ,

the

target

, X sr1N ,

station

, xsr2F (2),

, xsr1F (T  1), xsr12 (T  1), X

(4)

X

described

, xsr2F (1),

, xsr1F (2), xsr12 (2),

rF s1

and

a

new

input

time

series,

, X srNF } , as follows.

, xsr1F (1), xsr12 (1),

s1

SC

, X sr1F , X sr12 ,

to

MA

X  {X sr11 ,

geographically

D

near

r1 s2

PT E

order

rj , in a station of sn at a time of t. We sorted the observation time series in the

NU

the observation value of an index,

r

xsnj (t ) is

, xsr2F (T  1) ,

s2

X sr2F

xsrFN (1)   , xsr1N (2), , xsrFN (2)    r1 rF , xsN (T  1), , xsN (T  1)   sN  r1 rF X sN X sN

, xsr1N (1),

,

CE

Therefore, L-dimensional time series are obtained, and then X reflects both the spatial and temporal features of the input observation data. The maximum value of time delay, D is set. At this time, the historic data affecting the target PM2.5 concentration are considered with a time delay up to D. D can be set application-specifically or

AC

empirically through experiment on the basis of considering the pollutant diffusion properties and meteorological characteristics of the input observation features. Our purpose is to predict the pollutant concentration of the next day, so the time delay has an interval of one day. When a time series, an operator,

Bd , d  1, 2,

is defined as

Bd (Y )  [0,0,

, y(T 1)]T , is given,

, D , called the delay operator of Y is defined, and the delayed version of Y, Bd (Y ) , ,0, y(1),

y(T  1), y(T  1  d )]T .

d

At this time, X is constituted of the following L  D time series, Z.

Z  {B1 ( X sr11 ),

Y  [ y(1), y(2),

, BD ( X sr11 ),

, B1 ( X srNF ),

, BD ( X srNF )}

ACCEPTED MANUSCRIPT In this case, the target time series has a time delay of one day with of Y, so it is described as follows.

Y*  [ xsr (2), xsr (3),

, xsr (T )]T

The temporal diagram showing the temporal difference between the target time series and the delayed time series spatiotemporally-correlated to it is shown in Fig 2. between the target time series,

I (Y* , Bd ( X snj )), d  1, 2, r

r

Y* , and the delayed version, Bd ( X snj ) ,

, D is calculated. At this time, the larger value I has, the larger impacts the feature,

PT

Then, the MI

rj , in a station of sn before a time, d has on the target parameter. In other words, it means that the historic data,

RI

xsnj (t  d ) , should be used for the prediction of y* (t ) . We rationally set the correlation threshold  of MI to r

Yo  {Bd ( X snj ) | I (Y* , Bd ( X snj ))   , d  1, r

 {Bd * ( X ), Bd * ( X ), 1

r1* s1*

2

r2* s2*

rP* s*P

, D, n  1,

, N , j  1,

, F} ,

MA

r

Yo is described as follows.

NU

STFV. The set of time series beyond the correlation threshold,

SC

determine the number of spatiotemporal features and then include only the features beyond this value into the input

, Bd * ( X )} P

where P is the number of time series selected by the MI estimator and has the same dimensions with STFV.

D

vt , which is the input of the PM predictor, can be finally constructed as follows.

PT E

Therefore, the STFV,

vt  ( xsr1* (t  d1* ), xsr2* (t  d 2* ), , xsr*P (t  d P* )) *

1

2

*

,

(5)

P

vt is the p-dimensional STFV at a time t.

CE

where

*

Fig. 3 shows the correspondence between the 60 features and the corresponding MI values, taking into account the

AC

spatiotemporal correlation for the target. Here, the number of features was obtained by selecting from the feature with the highest correlation in order. The time delay, d, was set to 20, which allows the features to be fully included through a series of experiments. MI values are changed ranging from 0.126 to 0.335, and the features with a large correlation were mostly the pollutants in and around Beijing. From the Fig. 3, it can be seen that in the overall trend, the closer the distance is to the target station, the larger the spatiotemporal correlation become. In addition, it was shown that air quality data has a relatively higher correlation compared to meteorological data because the air quality features appeared more than the meteorological features in the STFV. Moreover, it is remarkable that the meteorological data in the western region of Beijing has a relatively large impact on the PM2.5 concentration in Beijing. 2.3. Proposed PM predictor

ACCEPTED MANUSCRIPT As shown in Figure 4, the PM predictor consists of two parts: the STFV generation part and the CNN-LSTM model part. The STFV generation part mainly includes the MI estimator that analyze the spatiotemporal correlation between the target parameter and the observation time series; the CNN-LSTM model is constituted of 6 CNNs (each one has a convolutional layer and a pooling layer) and 2 layers of LSTM. In CNN1, only the features of the Beijing area, which is the target station with the most features among the STFV obtained by the MI estimator, were input. In CNNs 2 to 6, the stations were placed in order of feature appearance frequency for STFV. Their inputs corresponded to the features of their own stations among the STFV, respectively, similar to CNN1. At this time, the input dimensions may be different in every CNN because the features appearing in the STFV and their

PT

numbers are different depending on station. Then, the feature time series passed through each CNN are integrated in the LSTM and finally constitute the output of the PM predictor. The detailed descriptions of CNN and LSTM

RI

are in Appendices A and B.

Three evaluation indexes, the mean absolute error (MAE), the root-mean-square error (RMSE) and the mean

SC

absolute percentage error (MAPE), are used to evaluate the performance of PM predictor and they are calculated

1 l 2  Oi  P i   i 1 l

(6)

MA

RMSE 

NU

by formulas 6 to 8, respectively.

MAE 

1 l  Oi  Pi l i 1 1 l Oi  Pi  l i 1 Oi

,

(8)

PT E

D

MAPE 

(7)

where Oi and Pi and l represent the observed values, the predicted value and the number of evaluation samples, respectively. MAE and RMSE are the evaluation indexes to evaluate the absolute error, and the smaller the values

CE

are, the better the model performances are. MAPE is the indicator to measure the relative error, and the smaller the value, the closer the predicted result to the actual value.

AC

2.4. PM predictor structure

First, we used RSME, MAE and MAPE to examine the effect of different correlation thresholds



on

prediction accuracy. The maximum delay, D, was set to 19 because it did not exceed 19 in several times-repeated experiments. Table 2 shows the predicted performances obtained by using different D, is set to 19. It was found that excessively high or low values of



 values when the time delay,

lead to a poor prediction performance. The

reason for this is mainly that when the threshold is high, the number of important features is low, and when the threshold is low, some irrelevant indexes are input and cause interference. Table 2 shows that the model yields a better prediction performance when the correlation threshold is 0.1. That is why the correlation threshold was set to 0.1, which was the most proper setting for the PM predictor.

ACCEPTED MANUSCRIPT Next, a reasonable model structure was determined through a series of comparison experiments. When the correlation threshold is 0.1, the total number of spatiotemporal features is 83; among them, the target station stands for 31%, 5 neighboring stations stand for 9%, 9%, 11%, 16% and 18%, respectively, and the remaining stations stand for 6% in total. Since the maximum time delay is 19, we input the feature time series during 19 days for all CNNs. In this case, the number of features input to each CNN is different because the number of features appearing in STFV is different for each station. On the basis of the above-mentioned data, we set the network structures for all the CNNs (CNN1~CNN6) as follows. In the convolutional layer, the number of feature maps and the size of the kernel are set as 32 and 3, respectively. In the pooling layer, the pooling size is set as 2, and the average pooling

PT

is used. In addition, for two LSTMs, the numbers of neurons are 150 and 50 on the first and second layers, respectively. The connection between the two layers was formed at 60% to overcome the overfitting problems in

RI

the model.

SC

3. Results and discussion 3.1. Prediction performance

NU

We conducted the performance evaluation of the proposed PM predictor by predicting the daily average PM2.5 concentration from July 2016 to Jun 2017 after training the predictor with the learning data from January 2015 to June 2016. The reason why the prediction range was set as the above is because the air pollution due to PM

MA

becomes worse in winter. The prediction result is shown in Fig. 5. The green and red lines represent the practicallyobserved and the predicted PM2.5 concentrations, respectively. In the follow-up property of the predicted values on the observed values, the PM predictor displayed an accurate prediction performance in the whole prediction

D

range. This shows that the proposed model expresses the seasonal pattern and the nonlinear pattern well, response

PT E

not only short-term but also long-term predictions well and displays a good performance even about the sudden changes in air quality. From Fig. 5, we set the part in which the concentration change of pollutant was relatively serious (Nov. 2016 to Feb. 2017) as a new prediction range and evaluated the performance of PM predictor again. The prediction result was shown in Fig. 6, and the performance indexes, RMSE, MAE and MAPE were 3.855,

CE

3.007 and 0.027, respectively. The RMSE and MAE values during the 4 months were larger than the RMSE and MAE values (3.011 and 2.296) during 1 year because the range of concentration change during the 4 months became larger. In contrast, MAPE value was decreased from 0.04 during 1 year to 0.027 during the 4 months

AC

because the overcoming ability of predictor to under-predict or over-predict the higher concentration or the lower concentration occurred generally in the prediction of air quality was greatly improved. 3.2. Comparison of the proposed PM predictor with other prediction methods In the early researches for air quality prediction, it was demonstrated that MLP has a better performance than the conventional statistical methods. After that, several approaches that combines ANN with other different methods were introduced in order to improve the prediction performance compared to MLP (Chattopadhyay and Chattopadhyay, 2012; Dutot et al., 2007; Biancofiore et al., 2015). Recently, the air quality prediction which employs deep neural network (DNN) has been popular, and therefore we compared the proposed method with the conventional MLP model and one kind of DNN, LSTM in prediction performance. 3.2.1 The influence of spatiotemporal correlation on prediction performance

ACCEPTED MANUSCRIPT In order to evaluate the influence of spatiotemporal analysis on the prediction accuracy, we conducted the performance evaluation of PM predictor in two ways: when the air quality and meteorological data from the 384 monitoring stations are input to models directly after being normalized (Table 3) and when the STFV constructed after the spatiotemporal correlation analysis on them are input to models (Table 4). In former case, PM predictor acts simply as a CNN-LSTM model, and the input data of the model includes Beijing dataset and 5 groups of the other datasets, which are divided in the viewpoint of region. As a result of performance evaluation, it was obviously shown that the proposed PM predictor and the LSTM model have less errors than the MLP model about two types of input data. This shows that DNN has a prediction performance superior to MLP. In Tables 3 and 4, LSTM

PT

(RMSE 4.764, MAE 3.612, MAPE 0.068) has an obvious improvement in prediction performance as compared to MLP (RMSE 20.278, MAE 14.982, MAPE 0.258). This is related to the following advantage of LSTM. LSTM

RI

can process a huge input air quality and meteorological data and even the longest sequence data without vanishing the gradient. In other words, LSTM can sufficiently reflect the long-term historic process in the input time series

SC

data. This is the superiority of LSTM over MLP. In addition, it was demonstrated that when CNN, which can efficiently extract the inherent features of long-term input data, is combined with LSTM, the prediction performance is more improved compared to when LSTM is used alone, in terms of the above two cases. Moreover,

NU

in comparison of Tables 3 and 4, it was shown that the PM predictor considering spatiotemporal correlation is more adaptable that the other methods.

MA

3.2.2. The influence of meteorological data on prediction performance In order to evaluate the influence of meteorological data on the prediction accuracy, we input the STFV of air quality and meteorological data, separately, to the model, and the investigation results were shown in Tables 5 and

D

6. As shown in these Tables, the air quality data improved the prediction performance of the model compared to

PT E

the meteorological data. In comparison of Tables 4 and 5 for the proposed PM predictor, the performance indexes had less errors in case of the spatiotemporal analysis using the integration of air quality and meteorological data (RMSE 2.997, MAE 2.21, MAPE 0.039) than in case of the spatiotemporal analysis using only the air quality data (RMSE 3.763, MAE 2.932, MAPE 0.052). This shows that the integration of air quality and meteorological data

CE

provide a higher accuracy and stability. In Table 3 , the performance indicators are RMSE 5.357, MAE 4.971 and MAPE 0.081, and in table 6, the performance indexes are RMSE 5.287, MAE 5.21 and MAPE 0.08; in comparison

AC

of Tables 3 and 6, the performance indexes had no big difference. Moreover, MAE value became larger in case of considering the STFV about meteorological data (5.21) than in case of not considering it (4.971). This makes us pay attention to the fact that only the use of meteorological data cannot improve the prediction performance even though the spatiotemporal correlation analysis about data is conducted. Therefore, it was demonstrated that air quality data is more effective than meteorological data in the promotion of model’s accuracy, and the spatiotemporal correlation analysis considering both air quality and meteorological data is superior to all the other methods.

4. Conclusion In this paper, we proposed the spatiotemporal PM predictor which includes MI estimator and CNN-LSTM model to predict the next day’s daily average PM2.5 concentration in Beijing City. At first, the STFV suitable for CNN-LSTM model was constructed through the spatiotemporal correlation analysis by MI estimator, which

ACCEPTED MANUSCRIPT considers both linear and nonlinear correlations. The spatiotemporal correlation associated with target parameter was fully considered and effectively extracted for the 384 monitoring stations which represents the whole area of China with the target monitoring station as the center and also for the historic air quality and meteorological data. In addition, a fast and accurate prediction performance was secured by efficiently extracting the inherent features of the latent air quality and meteorological input data associated with PM 2.5 through CNN and by fully reflecting the long-term historic process of input time series data through LSTM. As a result of the performance evaluation and comparison, it was demonstrated that air quality data have a larger impact on the promotion of prediction performance than meteorological data, and the proposed method considering the spatiotemporal correlation shows

PT

a higher accuracy and stability, compared with MLP and LSTM models. As far as we know, the research approach which combines MI with deep learning to predict air quality has never been conducted before. In the future, we

RI

will try to limit the application range of the proposed approach to the target station itself or some neighboring stations with the target station as the center.

SC

Acknowledgements

This study was supported by a grant from the Environmental Science and Engineering Research Council,

NU

Democratic People’s Republic of Korea. The authors are thankful to the Ministry of Ecology and Environment of the People’s Republic of China, and Shanghai 2345 Network Technology Co., Ltd. for providing the experiment

MA

data for pursuing the work. The critical reading of the manuscript by the anonymous reviewer is greatly appreciated.

Appendix A. CNN related work

D

CNN is a very useful neural network proposed from human neural system and shows an excellent performance in a wide range of application (Ko, 2018). Typical CNN is a hierarchical model that alternately performs two

PT E

computational layers (convolutional layer and subsampling or pooling layer) and ultimately classification through a fully connected layer as shown in Fig 1. The convolutional layer extracts features from the input data by means of the sliding-window manner that realizes feature maps that expresses the temporal arrangement feature of time

CE

series data. The weight of the convolutional filter that performs the feature mapping is shared in the convolutional layer and is locally linked to the input data. The subsampling layer reduces the size of output dimension by averagepooling or max-pooling over the feature maps in the convolutional layer: therefore, it is possible to ignore

AC

variations such as minute shifts or deformations in the input data. The last fully connected layer of CNN generates the CNN model's output data. Fig. 1 shows the processing of CNN for multivariate input time series data, where M and E denote the numbers of input features and time lengths, respectively; Ci and Si denote i-th convolutional and subsampling layers, respectively. Let CKi and CEi denote the kernel size and the time length in Ci layer, also SPi and SEi denote the pooling size and the time length in Si layer, respectively. Then, CE1 =E- CK1+1: similarly, CE2= SE1- CK2+1, also, SEi=CEi /SPi. Here CMi and SMi denote the number of filter output in Ci layer and the number of output in Si layer, respectively. FNN is a fully connected layer, the output of which is the output from the entire network. Thus CNN performs training on data by alternating between convolutional and subsampling layers, so that it can fully reflect characteristics of time series data or sequence data.

Appendix B. LSTM related work

ACCEPTED MANUSCRIPT LSTM, a special type of recurrent neural network which is able to learn long-term dependencies, is designed to solve the long-term dependency problem by means of short-term memory (Hochreiter and Schmidhuber, 1997). LSTM is able to process even the longest sequence data without vanishing of the gradient: now, it is widely used to solve sequence data problems such as speech recognition, natural language processing and automatic annotation of images. As shown in Fig. 2, the LSTM has a complex recurrent structure in a single cell, which is chronologically connected in time. The LSTM has two property values, one is the hidden state V(t) value of the cell that changes with time, and the other is the cell state C(t) that makes it possible to maintain memory in the long term. In Fig. 2, the cell state is changed horizontally along the top line of the LSTM cell in the block diagram.

PT

The LSTM can add or remove information in the cell state. The forget gate F(t) adjusts the connection of the input U(t) and the previous hidden state V(t-1) to the cell state C(t), allowing the cell to remember or forget U(t) and

RI

V(t-1) when needed. And the input gate I(t) and Î(t) determines whether to feed the input value to the cell state C(t).

SC

The output gate O(t) also determines the exit based on the cell state C(t). This process is shown in Eq. (1)-(6).

I (t )   Wi V (t  1),U (t )  Bi 

(1)

,

MA

O(t )   Wo V (t  1),U (t )  Bo  ,

NU

F (t )   W f V (t  1),U (t )  B f  ,

(2) (3)

D

I (t )  tanh Wi V (t  1),U (t )  Bi  ,

(4)

PT E

C (t )  F (t )  C(t 1)  I (t )  I (t ) , V (t )  O(t )  tanh(C(t )) ,

(5) (6)

CE

where W and B denote the weight matrix and bias vector, respectively; σ(·) denotes the sigmoid function, tanh(·) denotes the hyperbolic tangent function. From the above figure and equations, it can be noticed that the precise control of internal state and input data, which is reflected in the cell state from the characteristics of the LSTM

AC

itself, can be proceeded with: in addition, both the fixed length and variable length data can be processed at the entrance and exit. These advantages are more prominent in case of using LSTM combined with other types of DNN than in case of using LSTM alone.

References Ahn, J., Shin. D., Kim, K., Yang, J., 2017. Indoor Air Quality Analysis Using Deep Learning with Sensor Data. Sensors. 17, 2476. doi:10.3390/s17112476. Antanasijević, DZ., Pocajt, VV., Povrenović, DS., Ristić, MĐ., Perić-Grujić, A., 2013. PM10 emission forecasting using artificial neural networks and genetic algorithm input variable optimization. Sci. Total. Environ. 443, 511-519. https://doi.org/10.1016/j.scitotenv.2012.10.110. Baklanov, A., Mestayer, PG., Clappier, A., Zilitinkevich, S., Joffre, S., Mahura, A., Nielsen, NW., 2008. Towards improving the simulation of meteorological fields in urban areas through updated/advanced surface fluxes description. Atmos. Chem. Phys. 8(3), 523–543. https://doi.org/10.5194/acp-8-523-2008.

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

NU

SC

RI

PT

Bengio, Y., 2009. Learning deep architectures for AI. Found. Trends. Mach. Learn. 2(1), 1–127. doi:10.1561/2200000006. Biancofiore, F., Verdecchia, M., Carlo, PD., Tomassetti, B., Aruffo, E., Busilacchio, M., Bianco, S., Tommaso, SD., Colangeli, C., 2015. Analysis of surface ozone using a recurrent neural network. Sci. Total. Environ. 514, 379–387. doi:10.1016/j.scitotenv.2015.01.106. Bray, CD., Battye, W., Aneja, VP., Tong, D., Lee, P., Tang, Y., Nowak, JB., 2017. Evaluating ammonia (NH 3) predictions in the NOAA National Air Quality Forecast Capability (NAQFC) using in-situ aircraft and satellite measurements from the CalNex2010 campaign. Atmos. Environ. 163, 65-76. https://doi.org/10.1016/j.atmosenv.2017.05.032. Carlo, PD., Pitari, G., Mancini, E., Gentile, S., Pichelli, E., Visconti, G., 2007. Evolution of surface ozone in central Italy based on observations and statistical model. J. Geophys. Res. 112, D10316. https://doi.org/10.1029/2006JD007900. Castellano, M., Franco, A., Cartelle, D., Febrero, M., Roca, E., 2009. Identification of NOx and ozone episodes and estimation of ozone by statistical analysis. Water. Air. Soil. Pollut. 198, 95–110. https://doi.org/10.1007/s11270-008-9829-2. Chan, TH., Jia, K., Gao, S., Lu, J., Zeng, Z., Ma, Y., 2015. PCANet: a simple deep learning baseline for image classification? IEEE Trans. Image. Process. 24(12), 5017–5032. https://doi.org/10.1109/TIP.2015.2475625. Chattopadhyay, S., Chattopadhyay, G., 2012. Modeling and prediction of monthly total ozone concentrations by use of an artificial neural network based on principal component analysis. Pure. Appl. Geophys. 169(10), 1891–1908. doi:10.1007/s00024-011-0437-5. Collobert, R., Weston, J., 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning. ACM, New York. pp 160–167. doi:10.1145/1390156.1390177. Donnelly, A., Misstear, B., Broderick, B., 2015. Real time air quality forecasting using integrated parametric and non-parametric regression techniques. Atmos. Environ. 103, 53–65. https://doi.org/10.1016/j.atmosenv.2014.12.011. Doquire, G., Verleysen, M., 2012. A COMPARISON OF MULTIVARIATE MUTUAL INFORMATION ESTIMATORS FOR FEATURE SELECTION. In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods. pages 176-185. doi: 10.5220/0003726101760185. Dutot, AL., Rynkiewicz, J., Steiner, FE., Rude, J., 2007. A 24-h forecast of ozone peaks and exceedance levels using neural classifiers and weather predictions. Environ. Model. Softw. 22(9), 1261–1269. doi:10.1016/j.envsoft.2006.08.002. Feng, X., Li, Q., Zhu, Y., Hou, J., Jin, L., Wang, J., 2015. Artificial neural networks forecasting of PM2.5 pollution using air mass trajectory based geographic model and wavelet transformation. Atmos. Environ. 107, 118-128. https://doi.org/10.1016/j.atmosenv.2015.02.030. Gennaro, G., Trizioa, L., Gilioa, AD., Pey, J., Pérez, N., Cusack, M., Alastuey, A., Querol, X., 2013. Neural network model for the prediction of PM 10 daily concentrations in two sites in the Western Mediterranean. Sci. Total. Environ. 463–464, 875–883. https://doi.org/10.1016/j.scitotenv.2013.06.093. Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural. Comput. 9(8), 1735–1780. doi:10.1162/neco.1997.9.8.1735. Jain, N., Murthy, CA., 2016. A new estimate of mutual information based measure of dependence between two variables: properties and fast implementation. Int J. Mach. Learn. Cybern. 7(5), 857–875. doi:10.1007/s13042015-0418-6. Kampa, M., Castanas, E., 2008. Human health effects of air pollution. Environ. Pollut. 151(2), 362-367. https://doi.org/10.1016/j.envpol.2007.06.012. Kim, Y., Fu, JS., Miller, TL., 2010. Improving ozone modeling in complex terrain at a fine grid resolution:part Iexaminationof analysis nudging andall PBL schemes associated with LSMs in meteorological model. Atmos. Environ. 44(4), 523–532. doi: 10.1016/j.atmosenv.2009.10.045. Ko, BC., 2018. A brief review of facial emotion recognition based onvisual information. Sensors. 18(2), 401. doi:10.3390/s18020401. Li, T., Shen, H., Yuan, Q., Zhang, X., Zhang, L., 2017. Estimating ground-level PM2.5 by fusing satellite and station observations: a geo-intelligent deep learning approach. Geophys. Res. 44, 11,985–911,993. https://doi.org/10.1002/2017GL075710. Li, X., Peng, L., Yao, X., Cui, S., Hu, Y., You, C., Chi, T., 2017. Long short-term memory neural network for air pollutant concentration predictions: method development and evaluation. Environ. Pollut. 231, 997–1004. http://dx.doi.org/10.1016/j.envpol.2017.08.114. Mohammad, A., Nima, K., Mohammad, MR., 2013. Predicting hourly air pollutant levels using artificial neural networks coupled with uncertainty analysis by Monte Carlo simulations. Environ. Sci. Pollut. Res. 20, 4777– 4789. doi:10.1007/s11356-012-1451-6.

ACCEPTED MANUSCRIPT Mohamed, AR., Sainath, TN., Dahl, G., Ramabhadran, B., Hinton, GE., Picheny, MA., 2011. Deep belief networks using discriminative features for phone recognition. In: 2011 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP.2011.5947494. Ong, BT., Sugiura, K., Zettsu, K., 2014. Dynamic pre-training of deep recurrent neural networks for predicting environmental monitoring data. IEEE Int. Conf. Big. Data. 16(2), 760–765. https://doi.org/10.1109/BigData.2014.7004302. Pak, U., Kim, C., Ryu, U., Sok, K., Pak, S., 2018. A hybrid model based on convolutional neural networks and long short-term memory for ozone concentration prediction. Air. Qual. Atmos. Health. 11(8), 883–895. https://doi.org/10.1007/s11869-018-0585-1. Perez, P., Reyes, J., 2006. An integrated neural network model for PM10 forecasting. Atmos. Environ. 40(16), 2845-2851. https://doi.org/10.1016/j.atmosenv.2006.01.010.

MA

NU

SC

RI

PT

Pérez, P.,Trier, A.,Reyes, J., 2000. Prediction of PM2.5 concentrations several hours in advance using neural networks in Santiago, Chile. Atmos. Environ. 34(8), 1189-1196. doi:10.1016/s1352-2310(99)00316-7. Ryu, U., Wang, J., Kim, T., Kwak, S., U, J., 2018. Construction of traffic state vector using mutual information for short-term traffic flow prediction. Transport. Res. Part C. 96, 55-71. https://doi.org/10.1016/j.trc.2018.09.015. Soh, PW., Chang, JW., Huang, JW., 2018. Adaptive deep learning-based air quality prediction model using the most relevant spatial-temporal relations. IEEE Access. 6, 38186–38199. doi:10.1109/ACCESS.2018.2849820. Voukantsis, D., Karatzas, K., Kukkonen, J., Räsänen, T., Karppinen, A., Kolehmainen, M., 2011. Intercomparison of air quality data using principal component analysis, and forecasting of PM10 and PM2.5 concentrations using artificial neural networks, in Thessaloniki and Helsinki. Sci. Total. Environ .409(7), 1266-1276. doi:10.1016/j.scitotenv.2010.12.039. WHO., 2003. Health aspects of air pollution with particulate matter, ozone and nitrogen dioxide. Tech. Rep. WHO. WHO., 2016. Ambient air pollution: A global assessment of exposure and burden of disease. WHO Library Cataloguing-in-Publication Data. Woody, MC., Wong, HW., West, JJ., Arunachalama, S., 2016. Multiscale predictions of aviation-attributable PM2.5 for U.S. airports modeled using CMAQ with plume-in-grid and an aircraft-specific 1-D emission model. Atmos. Environ. 147, 384–394. https://doi.org/10.1016/j.atmosenv.2016.10.016. Xing, YF., Xu, YH., Shi, MH., Lian, YX., 2016. The impact of PM 2.5 on the human respiratory system. J. Thorac. Dis. 8(1), 69-74. doi:10.3978/j.issn.2072-1439.2016.01.19.

AC

CE

PT E

D

Yildirim, Y.,Bayramoglu, M., 2006. Adaptive neuro-fuzzy based modelling for prediction of air pollution daily levels in city of Zonguldak. Chemosphere. 63(9), 1575-1582. doi:10.1016/j.chemosphere.2005.08.070. Zhang, CY., Chen, CLP., Gan, M., Chen, L., 2015. Predictive deep Boltzmann machine for multiperiod wind speed forecasting. IEEE Trans. Sustain. Energy. 6(4), 1416–1425. https://doi.org/10.1109/TSTE.2015.2434387. Zhou, G., Xu, J., Xie, Y., Chang, L., Gao, W., Gu, Y., Zhou, J., 2017. Numerical air quality forecasting over eastern China: an operational application of WRF-Chem. Atmos. Environ. 153, 94–108. https://doi.org/10.1016/j.atmosenv.2017.01.020. Zhou, Q., Jiang, H., Wang, J., Zhou, J., 2014. A hybrid model for PM2.5 forecasting based on ensemble empirical mode decomposition and a general regression neural network. Sci. Total. Environ. 496, 264-274. doi:10.1016/j.scitotenv.2014.07.051.

ACCEPTED MANUSCRIPT Table 1 Air quality and meteorological data used in the modeling study. Input Parameters Timing

Unit

PM2.5



㎍/㎥

2

PM10



㎍/㎥

3

SO2



㎍/㎥

4

CO



㎎/㎥

5

NO2



㎍/㎥

6

O3

8-h average

㎍/㎥

7

Maximum temperature (maxt)

Daily data



8

Minimum temperature (mint)



9

Weather Grade (weag)



10

Wind Direction (wind)



11

Wind Grade (wing)



MA D PT E CE AC



SC

RI

-

Timing

Unit

daily average

㎍/㎥

NU

Output Parameter Next day’s daily average PM2.5 1 concentration

PT

1

ACCEPTED MANUSCRIPT

3.115

2.501

0.053

0.25

2.973

2.271

0.041

0.2

2.971

2.27

0.039

0.15

2.891

2.15

0.038

0.1

2.875

2.117

0.037

0.05

2.973

2.274

0.04

AC

CE

PT E

D

MA

NU

SC

RI

0.3

PT

Table 2 Effect of the correlation threshold on the present PM predictor. Correlation threshold RMSE (μg/m3) MAE (μg/m3) MAPE

ACCEPTED MANUSCRIPT Table 3 The comparison of MLP, LSTM and PM predictor in prediction performance when the normalized input data are directly used. LSTM

PM predictor

RMSE (㎍/㎥)

37.792

11.337

5.357

MAE (㎍/㎥)

31.59

10.985

4.971

MAPE

0.549

0.189

0.081

AC

CE

PT E

D

MA

NU

SC

RI

PT

MLP

ACCEPTED MANUSCRIPT Table 4 The comparison of MLP, LSTM and PM predictor in prediction performance when the STFV are input. LSTM

PM predictor

RMSE (㎍/㎥)

20.278

4.764

2.997

MAE (㎍/㎥)

14.982

3.612

2.21

MAPE

0.258

0.068

0.039

AC

CE

PT E

D

MA

NU

SC

RI

PT

MLP

ACCEPTED MANUSCRIPT Table 5 The comparison of MLP, LSTM and PM predictor in prediction performance when the air quality data are only input. LSTM

PM predictor

RMSE (㎍/㎥)

26.639

8.11

3.763

MAE (㎍/㎥)

21.01

6.791

2.932

MAPE

0.362

0.128

0.052

AC

CE

PT E

D

MA

NU

SC

RI

PT

MLP

ACCEPTED MANUSCRIPT Table 6 The comparison of MLP, LSTM, and PM predictor in prediction performance when the meteorological data are only input. LSTM

PM predictor

RMSE (㎍/㎥)

35.936

11.229

5.287

MAE (㎍/㎥)

28.881

10.637

5.21

MAPE

0.502

0.183

0.08

AC

CE

PT E

D

MA

NU

SC

RI

PT

MLP

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

Fig. 1. The distribution of air quality monitoring stations in China. (The color of each station represents the rank of daily average PM2.5 on Jan.1st, 2015, as described in the bottom right of the figure. For interpretation of the references to color in this figure legend, the reader is referred to the website https://www .aqistudy.cn/.)

PT

ACCEPTED MANUSCRIPT

Fig. 2. The temporal diagram showing the temporal difference between the target time series and the delayed

AC

CE

PT E

D

MA

NU

SC

RI

time series.

PT

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

NU

SC

RI

Fig. 3. The spatiotemporal correlation features according to MI values. (In the format of A_B-C on the horizontal axis, A is the name of monitoring station, B is the name of feature in the input data, and C is the delay time. The name of monitoring station was the combination of the first letters among the English name, and the name of feature needs to reference Table 1. )

SC

RI

PT

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

NU

Fig. 4. Proposed PM predictor.

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

Fig. 5. The prediction result of PM2.5 concentration during the period from July in 2016 to June in 2017. (1 year)

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

Fig. 6. The prediction result of PM2.5 concentration during the period from November in 2016 to February in

AC

CE

PT E

D

MA

2017. (4 months)

ACCEPTED MANUSCRIPT

NU

SC

RI

PT

Appendix A. CNN related work

AC

CE

PT E

D

MA

Fig. 1. CNN structure for multivariate time series data.

ACCEPTED MANUSCRIPT

PT E

D

MA

NU

SC

RI

PT

Appendix B. LSTM related work

AC

CE

Fig. 2. Network structure of the LSTM.

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

NU

SC

RI

PT

GRAPHICAL ABSTRACT

ACCEPTED MANUSCRIPT HIGHLIGHTS MI estimator and CNN-LSTM model is proposed for predict the next day’s daily average PM2.5 concentration in Beijing City. The spatiotemporal correlation between the input variables of model is analyzed by the MI estimator and improves the prediction accuracy of present station.

AC

CE

PT E

D

MA

NU

SC

RI

PT

CNN-LSTM model inputs the STFV as an inlet and predicts PM2.5 concentration through deep learning.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6