A Gaussian process mixture model-based hard-cut iterative learning algorithm for air quality prediction

A Gaussian process mixture model-based hard-cut iterative learning algorithm for air quality prediction

Applied Soft Computing Journal 85 (2019) 105789 Contents lists available at ScienceDirect Applied Soft Computing Journal journal homepage: www.elsev...

1MB Sizes 0 Downloads 63 Views

Applied Soft Computing Journal 85 (2019) 105789

Contents lists available at ScienceDirect

Applied Soft Computing Journal journal homepage: www.elsevier.com/locate/asoc

A Gaussian process mixture model-based hard-cut iterative learning algorithm for air quality prediction ∗

Yatong Zhou a , Xiangyu Zhao a , Kuo-Ping Lin b,c , , Ching-Hsin Wang d , Lingling Li e a

School of Electronics and Information Engineering, Hebei University of Technology, 5340 Xiping Road, Beichen District, Tianjin 300401, PR China Department of Industrial Engineering and Enterprise Information, Tunghai University, Taichung 40704, Taiwan c Institute of Innovation and Circular Economy, Asia University, Taichung 41354, Taiwan d Department of Leisure Industry Management, National Chin-Yi University of Technology, Taichung 41170, Taiwan e State Key Laboratory of Reliability and Intelligence of Electrical Equipment, Hebei University of Technology, Tianjin 300130, China b

article

info

Article history: Received 28 March 2018 Received in revised form 12 August 2019 Accepted 7 September 2019 Available online 24 September 2019 Keywords: Gaussian processes mixtures Gaseous pollutant time series Prediction Machine learning

a b s t r a c t Air quality is closely related to concentrations of gaseous pollutants, and the prediction of gaseous pollutant concentration plays a decisive role in regulating plant and vehicle emissions. Due to the non-linear and chaotic characteristics of the gas concentration series, traditional models may not easily capture the complex time series pattern. In this study, the Gaussian Process Mixture (GPM) model, which adopts hidden variables posterior hard-cut (HC) iterative learning algorithm, is first applied to the prediction of gaseous pollutant concentration in order to improve prediction performance. This algorithm adopts iterative learning and uses the maximizing a posteriori (MAP) estimation to achieve the optimal grouping of samples which effectively improves the expectation–maximization (EM) learning in GPM. The empirical results of the GPM model reveals improved prediction accuracy in gaseous pollutant concentration prediction, as compared with the kernel regression (K-R), minimax probability machine regression (MPMR), linear regression (L-R) and Gaussian Processes (GP) models. Furthermore, GPM with various learning algorithms, namely the HC algorithm, Leave-one-out Cross Validation (LOOCV), and variational algorithms, respectively, are also examined in this study. The results also show that the GPM with HC learning achieves superior performance compared with other learning algorithms. © 2019 Elsevier B.V. All rights reserved.

1. Introduction With ever increasing population growth, rapid industrial development and expansion, air quality in China is declining year on year, which causes significant harm to human health and the environment [1–4]. In preventing air pollution, early warning mechanisms, and advance start-up in emission reductions are of particular importance. The more accurate prediction of gaseous pollutant concentrations has great practical significance. Up to now, researchers have still only entered the exploratory stage of airborne pollutant concentration prediction. Some prediction models, such as Kalman filter [5], Artificial Neural Network (ANN) and Support Vector Machine (SVM) [6] are commonly adopted for air pollution prediction. Wang et al. [7] applied a backpropagation neural network (BPNN) system to predict atmospheric particulate content. Shen et al. [8] used the optimal brain surgeon method to prune neural ∗ Corresponding author at: Department of Industrial Engineering and Enterprise Information, Tunghai University, Taichung 40704, Taiwan. E-mail addresses: [email protected], [email protected] (K.-P. Lin). https://doi.org/10.1016/j.asoc.2019.105789 1568-4946/© 2019 Elsevier B.V. All rights reserved.

network models. Song et al. [9] combined advantages of the two models to adapt to different prediction intervals. Zhao et al. [10] combined genetic algorithm with ANN and proposed a predictive model that reduce the dimensionality of input factors and avoid the curse of dimensionality. Si et al. [11] proposed a combined model of improved gray system and ANN to predict air quality, which used the output of gray system as the input of BP to make more accurate prediction. However, these ANN-based predictions own high computational complexity, and prone to over-fitting or trap in local minima during prediction. Muñoz et al. [12] used different classifiers to predict daily peaks of PM10 concentrations in a region, demonstrating that SVM prediction is significantly better than ANN. Leng et al. [13] applied SVM to predict the content of metal elements in the air. Nieto et al. [14] proposed an air quality prediction model based on SVM, which was used to predict the concentration of pollutants such as CO, SO2 , and O3 in the air. After that, García Nieto et al. [15] proposed a hybrid PSO-SVM, which combined SVM with particle swarm optimization (PSO) to predict the concentration of cyanobacteria in reservoirs. Wang et al. [16] proposed a model based on wavelet transform and SVM to predict PM10 concentrations in the air. Chen et al. [17] used SVM to predict methane gas

2

Y. Zhou, X. Zhao, K.-P. Lin et al. / Applied Soft Computing Journal 85 (2019) 105789

concentrations and improved the prediction accuracy by optimizing the model parameters in the leave-one-out cross-validation (LOOCV) framework. Ni et al. [18] proposed a SVM predictor based on artificial fish swarm and fractal dimension. However, SVM is not a probabilistic model, and could not output confidence intervals, thus reducing the prediction reliability. In addition, Wang et al. [19] combined two-phase decomposition optimized by differential evolution algorithm with extreme learning machine (ELM) to predict air quality. Prasad et al. [20] used an adaptive neuro-fuzzy inference system to predict the concentrations of five air pollutants, which considered meteorological factors (pressure, temperature, relative humidity, visibility, etc.) and the pollutant concentrations of the previous day. Donnelly et al. [21] proposed a real-time air quality prediction model based on integrated parameters and non-parametric kernel regression. And kernel regression is used to generate linear factors such as wind speed, wind direction, season and either day or night. However, the above models consider a large number of factors and neglect the chaos characteristics of the gaseous pollutant concentration series, which will be restricted by many conditions in practical prediction. The Gaussian process (GP) model has been successfully used to predict chaotic time series since it owns the advantages of both ANN and SVM [22,23]. However, a single GP model usually ignores subtle differences in different periods of the series, which causes bias and inaccuracy in prediction. Tresp [24] proposed a Gaussian process mixture (GPM) model in 2000, which adopted a ‘‘divide and conquer’’ strategy: the samples are divided into several components and a GP is assigned to each component for learning and prediction. This strategy not only reduces the computational complexity, but also finely depicts the nuances among different time periods of the series [25]. Classical GPM learning algorithms include the Markov Chain Monte Carlo (MCMC), Variational Bayes (VB) and Expectation– Maximization (EM) algorithms. Among these algorithms, the MCMC algorithm converges slowly when there are many parameters, and the convergence condition is difficult to determine. The VB algorithm assumes the parameters are conditionally independent of the particular learning samples but this assumption is very doubtful. As a result, the estimated posterior variable can significantly differ from the real one when the parameters highly correlate to each other. The EM algorithm is better than the first two algorithms and has a wide range of applications. It usually adopts some approximate strategies, such as the leave-one-out cross-validation (LOOCV) and Variational algorithm, considering that all the samples in the GPM model are jointly correlated. However, LOOCV is time-consuming, while variational inference is sensitive to the initial values of the parameters, resulting in large prediction errors. In this study, we propose to use hidden variable posterior hard-cut (HC) iterative learning algorithm [26] for GPM in the air quality prediction. The HC algorithm, proposed by the authors several years ago, actually is an EM algorithm in which the samples are divided into different groups according to the maximizing a posteriori (MAP) criterion in the E step, and the parameters of each group of samples are estimated using Maximum Likelihood Estimation (MLE) in the M step to achieve optimal grouping of samples. In the experiments, HC was compared with the LOOCV and Variational algorithms given the same GPM model. Additionally GPM was compared with the K-R, MPMR, LR, and GP prediction models. The HC algorithm was found to be more accurate than the other two learning algorithms, while the GPM model was found to be more accurate than the other four traditional models.

2. Gaussian process mixture model: An overview 2.1. Brief review on GP model Given the leaning sample set D = {(xt , yt )}nt=1 , xt is an input sample, and yt is the corresponding target value. Denote the regression function as f (X), where X = (x1 , x2 , . . . , xn )T . For a finite sample set D, f (X) = [f (x1 ) , f (x2 ) , . . . , f (xn )]T have a joint Gaussian distribution. This Gaussian process expression is: f (X) ∼ GP µ (X) , K X,X′

(

(

))

,

(1)

where µ (X) is the mean function and K X,X function. After adding noise:

(

) ′

is the covariance

y = f (X) + ε,

(2)

where ε )is an independent Gaussian white noise, satisfying ε ∼ ( N 0, σw2 I , in which I is an identity matrix, and σw2 is the noise variance. It is inferred that y is still a Gaussian process from Eqs. (1) and (2), and its expression is: y ∼ GP µ (X) , K X,X′ + σw2 δ X,X′

(

(

)

(

))

,

(3)

( ) ( ) ′ 2 ′ where µ (X) is the mean function, and ( K′ ) X,X + σw δ X,′X is the function in which δ X,X = 1 if X = X , and ( covariance ) δ X,X′ = 0 otherwise. Usually, zero mean function is adopted, and thus the prior distribution of y is: y ∼ N 0, K X ,X ′ + σw2 I .

(

(

)

)

(4)

Then, for the test sample {x∗ , y∗ }, the learning sample’s output y and the test sample’s output y∗ are subject to the joint Gaussian distribution:

[

y y∗

]

( [ K (X, X) + σw2 I ∼ N 0, K (x∗ , X)

K (X,x∗ ) K (x∗ , x∗ ) + σw2

]) (5)

where a kernel function usually used is Square Exponential (SE) function [19] defined as follows:

[

K x, x′ = σg2 exp −

(

)

1  2l2

  x − x′  2

] (6)

where σg2 is the variance of the kernel function, l is the fea} { ture width. Let θ = l, σg , σw denote a undetermined hyperparameters vector contained in the GP model, which must be determined in the process of learning. Based on the Bayesian formula, the predictive distribution of y∗ is obtained as: p y∗ |y ∼ N µ∗ , σy2∗ ,

(

)

(

)

(7)

where µ∗ and σy2∗ are the mean and variance of the test sample’s output, respectively:

( )( )−1 µ∗ = K X∗ , X K (X, X) + σw2 I y ( ∗ ∗) ( ∗ )( )−1 ( ) 2 σy∗ = K X , X − K X , X K (X, X) + σw2 I K X,X∗

(8) (9)

Using µ∗ as the predicted value, combined with σy2∗ , the pre-

[



dictive confidence interval can be derived as µ∗ − 2 σy2∗ , µ∗ +



]

2 σy2∗ . 2.2. The basic principles of GPM GPM is a typical mixture model in which each component is a GP. Generally, zt = c means that the t th sample belongs to the c th GP component. Xc and yc denote the input matrix and output vector of the c th GP component, respectively. Therefore, the GPM

Y. Zhou, X. Zhao, K.-P. Lin et al. / Applied Soft Computing Journal 85 (2019) 105789

further expressed as:

can be expressed as: p (y|X, θ) =

C ∏

p (yc |Xc , θ c )

(10)

p (yc |Xc , θ c ) = N yc |0, K (Xc , Xc |θ c ) + σwc I , c = 1, 2, . . . , C

)

= (11)

where C represents the total number of GP components and σw2 c is the noise variance in the c th GP component, which is equivalent to σw2 in the previous section. Eq. (10) indicates the independence among the GP components, and Eq. (11) defines the c th GP component containing the parameter set θ c . The GPM model first uses multinomial distribution as the gating function to generate the latent variables zt : p (zt = c ) = πct ,

(12)

where it can be assumed that πct = πc for simplicity. That is, for each component, the mixing coefficient of all the samples is the same [26]. The generative GPM model in [26] denote the set of latent parameters as z = [z1 , z2 , . . . , zN ]T , and the input matrix and the output vector as X and y, respectively. Then, this generative GPM model can be expressed by the following equations [26]: Pr (zt = c ) = πc , t = 1, 2, . . . , N ,

(13)

p (xt |zt = c , θ c ) = N (xt |mc , Sc ) , t = 1, 2, . . . , N ,

(14)

N yc |0, K (Xc , Xc |θ c ) + σw2 c I ,

(

)

(15)

c =1

where Eq. (14) adds Gaussian distribution to make the model generative, and Eq. (15) is directly derived from Eqs. (10) and (11). th The parameter } set of the c GP component is θ c = {πc , mc , Sc , σgc , lc , σwc , where σgc and lc are the parameters of the SE covariance function for the c th GP component. Therefore, θ = {θ c }Cc=1 is the set of all the parameters of the GPM model, which will be estimated by the learning algorithm of the model. For a GPM model, each component can be concentrated in different input regions because the input means mc of these GP components are different. GPM adopts a divide and conquer strategy, so the prediction is better than that of a single GP model in many cases. 3. Learning algorithms of GPM model 3.1. The traditional Variational learning algorithm The Variational approximation was proposed by Nguyen and Bonilla [27], whose idea was to approximate the posterior of latent variables by variational inference during the E step in the EM learning. Specifically, suppose the given input sample set is {xt }nt=1 , thus the output distribution is: p (yt |xt , D, θ) =

∫ ≈



) =

( T ∑



p zt = k, yt |xt , Ω, θ

)

k=1 2

(

p (y|X, θ) =



(

p yt |xt , Ω, θ

c =1

C ∏

3

p (yt |xt , Ω, θ) p (Ω|D, θ) dΩ

(



p (yt |xt , Ω, θ) q (Ω) dΩ ≈ p yt |xt , Ω, θ

) ,

(16)

where the two approximations are used to make the calculation more efficient and reasonable. The first approximation is to replace the true posterior probability with the variational posterior. ⌢

The second approximation is to further replace q (Ω) with Ω , the posterior mean of q (Ω). Therefore, the output distribution can be

( T ∑

p zt = k|xt , Ω

k=1

=



p yt |xt , zt = k, Ω, θ

∑T

p xt |zt = k, Ω



(



(



Since p yt |xt , zt = k, Ω, θ

(17)

p xt |zt = i, Ω

×p yt |xt , zt = k, Ω, θ

)

) ⌢

) (

i=1 p zt = i|Ω

(



) (

p zt = k|Ω

T ∑ k=1



(



) (

)

)

) is Gaussian, the predictive value for

the test samples is a weighted average of Gaussian means, and ( ) ( ) ⌢

the weight is determined by p zt = k|Ω

∑T

i=1

(



p zt = i|Ω



) (



p xt |zt = k, Ω

/

)

p xt |zt = i, Ω .

3.2. The traditional LOOCV learning algorithm LOOCV, proposed by Yang and Ma [28], is a heuristic EM learning algorithm. For a given sample set, LOOCV is adapted to derive the likelihood. Specifically, suppose the given sample set is {xt , yt }nt=1 , where xt is the input sample and yt is the corresponding output. Using multiple GPs to process this sample set, the hyper-parameters for a single GP can be denoted as θ c . Denote X−t and Y−t in the sample set as the remaining sample sets after missing xt and yt . Then the predictive distribution for the t th sample is: p (yt , xt |Y−t , X−t , θ) =

C ∑

) ( αtc p (yt |xt , Y−t , X−t , A,θ c ) p xt |ϕc ,

c =1

(18) The LOOCV likelihood p (Y, X, θ) can be expressed as: p (Y, X, θ) =

n C ∏ ∑

( ) αtc p (yt |xt , Y−t , X−t , A, θ c ) p xt |ϕc ,

(19)

t =1 c =1

where A = {αtc } and αtc is the probability that {xt , yt } belongs to the c th GP component, θ c is the set of GP parameters for the c th GP component, and ϕc incorporates the mean and variance of the Gaussian inputs in the c th GP component. In sum, the LOOCV algorithm uses each sample as a validation set, in turn to ensure the accuracy of the learning algorithm. 3.3. Hidden variables posterior hard-cut iterative learning algorithm The hidden variables posterior hard-cut (HC) iterative learning algorithm, described in [26], is adopted for air quality prediction. The HC algorithm, proposed by the authors several years ago [26], significantly reduces the computational complexity of EM algorithms using a hard-cut strategy in the E-step. That is, the samples are allocated into multiple GP components according to the MAP criterion. Then, the parameters of these GP components are estimated by MLE in the M step. The specific procedures of the algorithm are as follows: Step 1: Given a sample set {xt }Nt=1 , divide it into C groups by k-means clustering algorithm; Step 2: The parameters of each GP component are estimated by MLE, where the mixing coefficient πc , the mean mc and the

4

Y. Zhou, X. Zhao, K.-P. Lin et al. / Applied Soft Computing Journal 85 (2019) 105789

covariance matrix Sc of the input distribution are respectively solved by the following analytical formula:

⎧ Nc ⎪ ⎪ πc = ⎪ ⎪ N ⎪ ⎪ ⎪ 1 ∑ ⎨ xt mc = , Nc t ∈ Ac ⎪ ⎪ ⎪ ∑ 1 ⎪ ⎪ ⎪ Sc = (xt − mc ) (xt − mc )T ⎪ ⎩ Nc

(20)

t ∈ Ac

where Ac = {t: zt = c } and the undetermined GP parameters can be obtained by maximizing the likelihood function below:

⎧ Q [θ; θ (k)] ⎪ ⎪ ( ) ∑ ⎪ C ⎪ ∧ ⎪ ⎪ ⎪ [ln p (Xc |θ c ) + ln p (yc |Xc , θ c )] = ln p Z |θ + ⎪ ⎨ c =1

Q [θ; θ (k)] ⎪ ⎪ ] [∑ ⎪ } C { ⎪ ∑ ⎪ Nc ln πc + ⎪ t∈ ⎪ [ (Ac ln N (x2t |mc), Sc ) 2 ] = ⎪ ⎩ + ln N yc |0, K Xc , Xc |σfc , lc + σvc I

5. Predictable characteristics of gaseous pollutant concentration series

. (21)

c =1

Step 3: Reassign the learning samples into different GP components according to the MAP criterion: ∧

zt = arg max

p (yt , xt , zt = c |θ c (k))

p (yt , xt |θ (k()) [ )] . = arg max πc N (xt |mc , Sc ) N yc |0, σfc2 + σv2c

(22)

c

c

If the reassigning result is consistent with the last iteration, the iteration stops and the output is the final result; otherwise, return to Step 2. Step 4: After parameter learning, for a given test sample X∗ , it can also be assigned into an appropriate GP component according to the MAP criterion: zˆt = arg max c p z ∗ = c |X ∗ = arg max c πc N X ∗ |mc , Sc .

(

)

(

)

(23)

Then, the test sample is assigned to the group zˆt , and thus, its corresponding output y∗ can be predicted by the zˆtth GP component. That is, the parameters as well as learning samples for the prediction, are derived from the zˆtth GP component. 4. GPM prediction based on phase space reconstruction theory Once the GPM model completes the learning, it can be used for the prediction of gaseous pollutant concentration series. If GPM adopts HC iterative learning, all the GPs’ of the training samples are already known after the learning. Therefore, the test samples can be assigned to the corresponding GPs’ according to the MAP criterion. The prediction results are then obtained based on the single GP. Nguyen and Bonilla [27] indicated that this prediction is usually more accurate than the weighted prediction. If GPM adopts Variational learning, the expected value of random hidden variables is approximated to its estimated value [29]. When the estimated value of hidden variables is determined, the prediction distribution is easy to obtain. If GPM adopts MCMC learning, many samples of hidden variables will be obtained. The predictive distributions of the model output are easily obtained for each set of samples. The true predictive distributions can be approximated by the mean of these predictive distributions [30]. Given the gaseous pollutant concentration series {s (n)}∞ n=1 , GPM can be used to predict the series value in the future. Phase space reconstruction theory ensures this kind of prediction is reliable. Assume that there exists a smooth map f : Rd → R in the reconstruction phase space such that: s (n) = f [s (n − 1) , s (n − 2) , . . . , s (n − d)]

If the map f is known, the value of series s at n is uniquely determined by its d values in the past. Thus, the prediction problem is converted to a regression problem to estimates the map. It is important to choose a suitable embedding dimension d and time delay τ in the prediction [31]. The false nearest neighbor method is commonly used to estimate d. The time-delayed mutual information provides an effective tool to determine a reasonable τ . Unfortunately, these methods are time-consuming. To achieve a more effective search, a grid traversal search strategy is adopted in the following experiments.

5.1. The source of gaseous pollutant time series Gaseous pollutants originate from a significant contaminated area and are collected, in real time, by a metal oxide chemical sensor. The sensor records the time-sharing responses of four kinds of gaseous pollutants: Nitrogen dioxide (NO2 ), Hydrocarbon (HC), Ozone (O3 ) and Nitrogen oxide (NOX ). Time series of the four gaseous pollutants are stored in a database from 0:00 on March 11, 2004, to 23:00 on April 3, 2015, with 389 days * 24 h, a total of 9336 sampling points. 42 days * 24 h with 1008 points are selected for analysis. Before the characteristics analysis and prediction, the series must be normalized. We use min–max method to normalize the series. After normalization, the variables fall into [0, 1] and the units can be ignored. The normalized time series are plotted as follows (see Fig. 1): 5.2. Gaseous pollutant time series analysis In this subsection, the predictable characteristics of the gaseous pollutant series are analyzed in terms of the autocorrelation function (ACF), the largest Lyapunov exponent, the saturation correlation dimension and the recursion graph. (1) Autocorrelation function As a direct criterion for determining whether a series is linear or not, the ACF describes in detail the dependency among time within a time series. By setting a fixed delay, the correlation between the initial time and any time within the range of the delay can be obtained. This study computes the ACFs of the fourtime series and plots them in Fig. 2, where the maximum delay is 200. At the fifth time delay, the ACFs for all gaseous pollutants except O3 is less than the upper limit of the confidence interval, while that of O3 approaches the upper limit of the confidence interval. Therefore, the time series of contaminated gases are highly nonlinear. (2) Maximum Lyapunov exponent The maximum Lyapunov exponent (MLE) explores the chaotic characteristics of the series. The series has a stable fixed point when the MLE of the series is negative. While it is zero, the series has a periodic solution. On the other hand, the series has a chaotic characteristic when it is positive. As shown in Fig. 3, the MLE of the four gaseous pollutant series with twenty 24-hour cycles is obtained by the Wolf method. It can be seen that the MLE values are all positive, and thus all the time series are chaotic. (3) Correlation dimension The correlation dimension increases with the embedding dimension. According to whether the correlation dimension of the series arrives at saturation, it is possible to distinguish whether the series is random or chaotic. When the correlation dimension gradually increases but is not saturated, the series has random

Y. Zhou, X. Zhao, K.-P. Lin et al. / Applied Soft Computing Journal 85 (2019) 105789

5

Fig. 1. Normalized time series of the four gaseous pollutants.

Fig. 2. ACFs of the four gaseous pollutants.

characteristics, otherwise, is chaotic. In order to test the randomness or chaotic characteristics of the four gaseous pollutant series,

their correlation dimensions are obtained, with the embedding dimension from 2 to 8, respectively. The correlation dimensions

6

Y. Zhou, X. Zhao, K.-P. Lin et al. / Applied Soft Computing Journal 85 (2019) 105789

Fig. 3. Maximum Lyapunov exponents of the four gaseous pollutants.

of the four gaseous pollutant series in Fig. 4 are all saturated with the increase of the embedding dimension, and thus, the chaotic characteristics of these series are verified again. (4) Recurrence graph It is possible to extract the chaotic attractors from high dimensional series based on phase space reconstruction. The series recurrence graphs can be constructed according to the embedding dimension d and time delay τ used in the phase space reconstruction. There is the main diagonal in the recurrence graphs which reflects the predictability of the series via the number and length of the lines parallel to this diagonal. In order to see the predictability of the four gaseous pollutant series, the recursive graph based on the first 250 samples is shown in Fig. 5, with the parameters d = 8 and τ = 1. As seen in Fig. 5, only a small number of lines are parallel and short to the main diagonal, which shows that the four series are suitable for short-term prediction, but are not easily predicted in the long run.

The predictive experiment is conducted on a Lenovo laptop with an Intel CoreTM i5-2430M 2.40 GHz CPU, and 4 GB memory, using Matlab 2010a. For the series prediction of gaseous pollutants, the following three indicators are used to measure the quality of prediction results: (1) Root Mean Square Error (RMSE):

  n 1 ∑ [ ]2 RMSE = √ yp (i) − yt (i) . n

(2) Determination Coefficient (R2 ): R2 = 1 − SSR/SST

(25)

n ∑ SSR = [yp (i) − yt (i)]2

(26)

i=1 n

SST = 6. Experiment results This section focuses on NO2 and HC predictive experiments. The 101st to 500th points of the two-gas series are constructed as the learning samples, and the 501st to 900th points are constructed as the test samples. Firstly, the proposed HC learning algorithm is compared with the traditional LOOCV and Variational algorithms. Then, the GPM model is compared with the Kernel Regression (K-R), Minimax Probability Machine Regression (MPMR), Linear regression (L-R), and GP models. K-R is a kernel-based regression model. By adjusting the optimal window width, the prediction error of K-R approaches its minimum. MPMR has no specific assumptions about the sample distribution, and only needs the mean and the covariance matrix of the samples in the prediction 0. L-R is a kind of traditional linear prediction model, with good predictive performance on linear series, but is inaccurate on nonlinear series prediction.

(24)

i=1

∑[

yp (i) − ym

]2

.

(27)

i=1

In Eqs. (24) to (27), yp (i) and yt (i) are the prediction result and target of the ith test sample, respectively, and ym is the mean of the test sample targets. According to the two indicators, the larger the R2 value, the smaller the RMSE, and the more accurate prediction. 6.1. NO2 series prediction For the GPM, the number of GP components C , as well as the parameters d and τ must be determined before the prediction. To obtain the best parameters for the NO2 series prediction, this study fixes C = 4 first and makes a prediction when d changes from 2 to 8, and τ changes from 1 to 6. The corresponding prediction RMSE and R2 are shown in Fig. 6. According to Fig. 6, both the minimum RMSE and maximum R2 are obtained at d = 7 and τ = 1. Therefore, d = 7 and τ = 1

Y. Zhou, X. Zhao, K.-P. Lin et al. / Applied Soft Computing Journal 85 (2019) 105789

7

Fig. 4. Correlation dimension of the four gaseous pollutants.

Fig. 5. Recursive graphs of the four gaseous pollutants.

are fixed, and the prediction results are obtained when C changes from 1 to 8. The prediction RMSE and R2 are plotted in Fig. 7. As seen in Fig. 7, both the minimum RMSE and the maximum R2 occur at C = 5. Based on these results, this study sets d = 7,

τ = 1, C = 5, and makes predictions on the NO2 series using five models. The absolute prediction errors of the five models are in Fig. 8(a). And the result between the 550th and 570th point is

8

Y. Zhou, X. Zhao, K.-P. Lin et al. / Applied Soft Computing Journal 85 (2019) 105789

Fig. 6. RMSE and R2 values of GPM model variation with d and τ (C = 4 is fixed).

Fig. 7. RME and R2 values of GPM model variation with C (d = 7 and τ = 1 are fixed). Table 1 Comparison of the NO2 prediction results given by the three algorithms for GPM model.

Table 3 Comparison of the HC prediction results given by the three algorithms for GPM model.

Parameters

Parameter

d d d d d d d d

=3 =6 =7 =8 =2 =6 =7 =8

τ τ τ τ τ τ τ τ

Variational

=1 =1 =1 =1 =2 =2 =2 =2

LOOCV

HC

RMSE

R2

RMSE

R2

RMSE

R2

0.1429 0.1218 0.1204 0.1160 0.1607 0.1514 0.1498 0.1496

0.2971 0.4925 0.5054 0.5414 0.1137 0.2210 0.2125 0.2119

0.0994 0.1003 0.0930 0.0909 0.1587 0.1488 0.1456 0.1370

0.6600 0.6558 0.7052 0.7187 0.1354 0.2457 0.2564 0.3393

0.0931 0.0920 0.0868 0.0888 0.1453 0.1249 0.1277 0.1241

0.7017 0.7107 0.7430 0.7315 0.2940 0.4698 0.4278 0.4575

shown in Fig. 8(b). It is seen in Fig. 8 that the error peak of GPM in some periods is smaller than those of the other four models. Specifically using GPM with the optimal parameters, the predictive values, as well as the true values of the series, are shown in Fig. 9. The blue line in Fig. 9(a) denotes the true value, and the red line denotes the predictive value. Closer to the blue and red lines, the prediction is more accurate. In Fig. 9(b), the abscissa represents the true value while the ordinate represents

d d d d d d d d

=7 =8 =2 =8 =6 =7 =8 =5

Variational

τ τ τ τ τ τ τ τ

=1 =1 =2 =2 =3 =3 =3 =4

LOOCV

HC

RMSE

R2

RMSE

R2

RMSE

R2

0.1111 0.1318 0.1674 0.1698 0.1785 0.1737 0.1696 0.1845

0.6499 0.5071 0.2017 0.1768 0.0944 0.1451 0.1706 0.0362

0.1306 0.1209 0.1519 0.1241 0.2017 0.1760 0.1533 0.1970

0.5168 0.5853 0.3427 0.5599 0.1561 0.1225 0.3218 0.0990

0.0865 0.0885 0.1457 0.1143 0.1328 0.1288 0.1142 0.1472

0.7878 0.7779 0.3948 0.6271 0.4989 0.5300 0.6240 0.3866

the predictive value. The more concentrated the blue points are around the main diagonal line, which is green in the figure, the more accurate the prediction is. Fig. 10 shows the GPM model’s confidence intervals of NO2 series prediction. The blue and red dashed lines in Fig. 10 represent the upper and lower bounds of the confidence interval. As seen in Fig. 10, both bounds of the confidence intervals are close to the

Table 2 Comparison of the NO2 prediction results given by five learning models. Parameters

d d d d d d d d

=3 =6 =7 =8 =2 =6 =7 =8

K-R

τ τ τ τ τ τ τ τ

=1 =1 =1 =1 =2 =2 =2 =2

MPMR

L-R

GP

GPM

RMSE

R2

RMSE

R2

RMSE

R2

RMSE

R2

RMSE

R2

0.0962 0.1035 0.1041 0.1047 0.1501 0.1262 0.1309 0.1313

0.6813 0.6334 0.6306 0.6263 0.2264 0.4582 0.4223 0.3929

0.0932 0.0937 0.0932 0.0923 0.1481 0.1331 0.1349 0.1359

0.7010 0.6998 0.7038 0.7098 0.2469 0.3975 0.3620 0.3500

0.1025 0.1033 0.1024 0.1020 0.1533 0.1393 0.1362 0.1360

0.6386 0.6348 0.6421 0.6452 0.1929 0.3402 0.3492 0.3488

0.0932 0.0924 0.0895 0.0895 0.1490 0.1255 0.1290 0.1250

0.7009 0.7078 0.7269 0.7272 0.2375 0.4642 0.4131 0.4498

0.0931 0.0920 0.0868 0.0888 0.1453 0.1249 0.1277 0.1241

0.7017 0.7107 0.7430 0.7315 0.2940 0.4698 0.4278 0.4575

Y. Zhou, X. Zhao, K.-P. Lin et al. / Applied Soft Computing Journal 85 (2019) 105789

9

Fig. 8. Absolute prediction errors of the five models.

Fig. 9. True values and predictive values of NO2 series.

predictive curve itself, which shows that the prediction results have high reliability. The predictive results of GPM using the three different learning algorithms are shown in Table 1. Note that all the results are obtained with the optimal parameters for all the three learning algorithms. It is seen from the table that the predictions of the HC algorithm are more accurate than those of the other two learning algorithms for all the selected values of d and τ , and the best prediction results RMSE = 0.0868, R2 = 0.7430 are obtained at d = 7 and τ = 1. The predictive results of the five models are shown in Table 2. All results are obtained with the optimal parameters for the five models. For the NO2 series prediction, the predictive results of GPM are more accurate than those of the other four models at all values of d and τ . The ranking of prediction accuracy for the five models is: GPM > GP > MPMR > K-R >= L-R. Generally, the prediction accuracy of the five models increases with d and decreases with τ . However, too large a d value has a negative impact on GPM prediction. It is pointed that the RMSE values in Table 2 belong to just one data set of particular air pollutants NO2 . We need to test the performances of five learning models on more pollutant data sets in the next section. 6.2. HC series prediction As with the NO2 series, in order to obtain the best parameters for the HC series prediction, this study fixes C = 3 and makes a prediction when d changes from 2 to 8, and τ changes from 1 to 6. Both the minimum RMSE and the maximum R2 are obtained at d = 7, and τ = 1. Therefore, d = 7 and τ = 1 can be fixed, and the prediction made is done when C changes from 1 to 8, with the RMSE and R2 as plotted in Figs. 11 and 12. It is seen in Fig. 11, both the minimum RMSE and the maximum R2 occur when C = 3. Then the parameters d = 7,

τ = 1, and C = 3 are set and the prediction is done on the HC series. Finally, the absolute prediction errors of the five models are shown in Fig. 12(a), and the results between the 520th and 540th points are shown in Fig. 12(b). It is clear that the GPM is more accurate than the other models. With the optimal parameters, the prediction results of the HC gas series using GPM are shown in Fig. 13. The HC gas series prediction is more accurate than that of NO2 , which is consistent with the recursive graph in Fig. 5. Fig. 14 shows the GPM model’s confidence intervals of the HC series prediction. As seen in the figure, both bounds of the confidence intervals are close to the predictive curve itself, which shows that the prediction results have high reliability. The predictive results of GPM using the three different learning algorithms are shown in Table 3. It is clear that the prediction of the HC learning algorithm is more accurate than the other two learning algorithms for all the selected values of d and τ , and the best prediction results RMSE = 0.0865, R2 = 0.7878 are obtained at d = 7 and τ = 1. The predictive results of the five models are shown in Table 4. For the HC series prediction, the predictive results of GPM are more accurate than those of the other four models for all values of d and τ . The ranking of accuracy for the five models is: GPM>GP> K-R ≥ MPMR> L-R. As with the NO2 series prediction, the prediction accuracy of the five models generally increases with d and decreases with τ . 7. Conclusions This study proposes to use hidden variable posterior hard-cut (HC) iterative algorithm to train GPM model, in order to predict the concentrations of gaseous pollutants for the measurement of air quality. Before prediction, the original series is normalized at

10

Y. Zhou, X. Zhao, K.-P. Lin et al. / Applied Soft Computing Journal 85 (2019) 105789

Fig. 10. Predictive confidence interval of GPM model for NO2 series.

Fig. 11. RMSE and R2 values of GPM model variation with C (d = 7 and τ = 1 are fixed).

Fig. 12. Absolute prediction errors of the five models.

Fig. 13. True values and predictive values of HC series.

Y. Zhou, X. Zhao, K.-P. Lin et al. / Applied Soft Computing Journal 85 (2019) 105789

11

Fig. 14. Predictive confidence interval of GPM model for HC series. Table 4 Comparison of the HC prediction results given by five learning models. Parameter

d d d d d d d d

=7 =8 =2 =8 =6 =7 =8 =5

K-R

τ τ τ τ τ τ τ τ

=1 =1 =2 =2 =3 =3 =3 =4

MPMR

L-R

GP

GPM

RMSE

R2

RMSE

R2

RMSE

R2

RMSE

R2

RMSE

R2

0.1008 0.0981 0.1465 0.1098 0.1334 0.1305 0.1150 0.1517

0.7117 0.7269 0.3886 0.6559 0.4946 0.5178 0.6187 0.3484

0.0926 0.0930 0.1493 0.1300 0.1491 0.1437 0.1213 0.1575

0.7568 0.7547 0.3647 0.5174 0.3684 0.4148 0.5758 0.2974

0.1031 0.1031 0.1557 0.1531 0.1733 0.1690 0.1273 0.1803

0.6989 0.6985 0.3095 0.3459 0.1464 0.1913 0.5325 0.0796

0.0891 0.0896 0.1464 0.1154 0.1350 0.1294 0.1180 0.1500

0.7751 0.7721 0.3890 0.6198 0.4823 0.5260 0.5983 0.3630

0.0865 0.0885 0.1457 0.1143 0.1328 0.1288 0.1142 0.1472

0.7878 0.7779 0.3948 0.6271 0.4989 0.5300 0.6240 0.3866

first. Secondly the ACF, the largest Lyapunov exponent, the predicted recursion graph, and the saturation correlation dimension of the normalized series are analyzed to measure the nonlinear and chaotic characteristics of the series. The analysis shows that these series all demonstrate the strong nonlinear, chaotic characteristics and short-term predictability. For the time series prediction based on phase space reconstruction, it is necessary to determine the embedding dimension d and time delay τ . Given the GPM model, the number of GP components C must also be determined. In the experiments, we fixes C at first, and the optimal embedding dimension d and time delay τ are obtained by traversal search all the points in the given grid. After obtaining the optimal d and τ , the optimal C is obtained by testing various values. It is found in the experiments that the prediction accuracy usually increases gradually with an increasing d and decreasing τ . However, too large a value of d has a negative impact on GPM prediction. The optimal C depends on the series themselves, and there is no definite relationship between accuracy and C . The HC learning is compared with two others traditional GPM learning algorithms, namely the LOOCV and Variational learning. According to the RMSE and the R2 , the HC learning is more accurate than the other two algorithms in many parameters d and τ combination situations. It is worth noting that in some special cases of d and τ combination, the Variational learning almost loses predictive ability since the prediction time is unbearable. The GPM, K-R, MPMR, L-R and GP models are compared in terms of RMSE, and R2 with the same embedding dimension d and delay τ , and with all the optimal parameter values for these models. The experiment results show that the GPM is more accurate than the other four traditional models in prediction. Thus, the GPM is a useful alternative for forecasting the concentrations of gaseous pollutants which has been demonstrated the strong nonlinear, chaotic characteristics and short-term predictability in this study; it can provide (1) more accurate prediction values and (2) predictive confidence intervals to make the prediction more reliable. This study is the first attempt to use GPM to forecast the concentrations of gaseous pollutants. One topic that the study did not examine air quality prediction in noisy acquisition environments, where GPM will be compared with traditional models in

terms of robustness and sensitivity to noise. The issues is worth further investigation. Declaration of competing interest No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2019.105789. References [1] J. Zhang, J. Tao, R. Zhang, Z. Xu, Observations of air pollutants and visibility during in guangzhou, Res. Environ. Sci. 21 (2008) (2008) 161–165. [2] C. Cai, X. Zheng, Application of contingent valuation method in valuing health gains from air quality improvement, Res. Environ. Sci. 20 (2007) 150–154. [3] H. Shi, Q. Gao, S. Zhang, Research review of impacts and feedback of air pollution on climate change, Res. Environ. Sci. 25 (2012) 974–980. [4] X. Wang, W. Jiang, H. Liu, Y. Wang, Y. Zhang, Numerical simulation study on the effect of major industrial sources in nanjing on urban ari quality, Res. Environ. Sci. 20 (2007) 33–43. [5] S. Camillo, A. D’Allura, S. Finardi, A. Bolignano, R. Sozzi, Application of bias adjustment techniques to improve air quality forecasts, Atmos. Pollut. Res. 6 (2015) 928–938. [6] K. Lin, P. Pai, S. Yang, Forecasting concentrations of air pollutants by logarithm support vector regression with immune algorithms, Appl. Math. Comput. 217 (2011) 5318–5327. [7] C Wang, G. Qi, H. Xi, G. Zuo, Research on application of BP neural network to predict TSP pollution, Chin. J. Sci. Instrum. 24 (2003) 529–530. [8] L. Shen, Y. Wang, D. Lei, Application of artificial neural networks on the prediction of surface ozone concentrations, Environ. Sci. 32 (2011) 2231–2235. [9] Y. Song, S. Zhen, The application of BP neural network and time series model to air quality forecast for Baotou, J. Arid Land Res. Environ. 27 (2013) 65–70. [10] H. Zhao, A. Liu, K. Wang, Z. Bai, Improved GA-ANN model for air quality forecasting, Res. Environ. Sci. 22 (2009) 1276–1281. [11] Z. Si, B. Sun, X. Li, Prediction of air quality based on improved grey neural network model, Chin. J. Environ. Eng. 7 (2013) 3543–3547. [12] E. Muñoz, M.L. Martín, I.J. Turias, M.J. Jimenez-Come, F.J. Trujillo, Prediction of PM10 and SO2 , exceedances to control air pollution in the Bay of Algeciras, Spain, Stoch. Environ. Res. Risk Assess. 28 (2014) 1409–1420. [13] X. Leng, J. Wang, H. Ji, Q. Wang, H. Li, X. Qian, Fengying Li, Meng Yang, Prediction of size-fractionated airborne particle-bound metals using mlr, BP-ANN and SVM analyses, Chemosphere 180 (2017) 513–522.

12

Y. Zhou, X. Zhao, K.-P. Lin et al. / Applied Soft Computing Journal 85 (2019) 105789

[14] P.J. García Nieto, E.F. Combarro, J.J. del Coz Díaz, E. Montañés, An SVMbased regression model to study the air quality at local scale in Oviedo urban area (Northern Spain): A case study, Appl. Math. Comput. 219 (2013) 8923–8937. [15] P.J. García Nieto, J.R. Alonso Fernández, V.M. González Suárez, C. Díaz Muñiz, E. García-Gonzalo, R. Mayo Bayón, A hybrid PSO optimized SVMbased method for predicting of the cyanotoxin content from experimental cyanobacteria concentrations in the trasona reservoir: A case study in northern Spain, Appl. Math. Comput. 260 (2015) 170–187. [16] P. Wang, H. Zhang, Z.-D. Qin, Q.-C. Yao, H. Gen, PM10 concentrations forecasting model based on wavelet-SVM, Environ. Sci. 38 (2017) 3153–3161. [17] Q. Chen, W.-G. Jiang, X. Chen, Prediction of methane emission of paddy field based on the support vector regression model, Environ. Sci. 34 (2013) 2975–2982. [18] Z. Ni, X. Zhu, M. Cheng, Air quality prediction method based on fish swarm and fractal dimension, Pattern Recognit. Artif. Intell. 29 (2016) 1122–1131. [19] D. Wang, S. Wei, H. Luo, C. Yue, O. Grunder, A novel hybrid model for air quality index forecasting based on two-phase decomposition technique and modified extreme learning machine, Sci. Total Environ. 580 (2017) 719–733. [20] K. Prasad, A.K. Gorai, P. Goyal, Development of ANFIS models for air quality forecasting and input optimization for reducing the computational cost and time, Atmos. Environ. 128 (2016) 246–262.

[21] A. Donnelly, B. Misstear, B. Broderick, Real-time air quality forecasting using integrated parametric and non-parametric regression techniques, Atmos. Environ. 103 (2015) 53–65. [22] Y. Zhou, T. Zhang, J. Sun, Multi-scale Gaussian processes: A novel model for chaotic time series prediction, Chin. Phys. Lett. 24 (2007) 42–45. [23] Y. Zhou, T. Zhang, X. Li, Prediction of chaotic time series based on multi-scale Gaussian processes, Lecture Notes in Comput. Sci. 4224 (2006) 183–190. [24] V. Trest, Mixtures of Gaussian processes, Adv. Neural Inf. Process. Syst. 13 (2000) 654–660. [25] Y. Zhou, Z. Chen, J. Ma, From Gaussian processes to the mixture of Gaussian processes: A survey, J. Signal Process. 32 (2016) 960–972. [26] Z. Chen, J. Ma, Y. Zhou, Hard-cut EM algorithm for mixtures of Gaussian processes, Lecture Notes in Comput. Sci. 2 (2014) 68–75. [27] T. Nguyen, E. Bonilla, Fast allocation of Gaussian process experts, in: Proceedings of the 31st International Conference on Machine Learning, 2014, pp. 145-153. [28] Y. Yan, Study the EM Algorithms for the Mixture of Experts Architecture, School of Mathematical Sciences, Peking University, Beijing, 2011. [29] S. Sun, X. Xu, Variational inference for infinite mixtures of Gaussian processes with applications to traffic flow prediction, IEEE Trans. Intell. Transp. Syst. 12 (2011) 466–475. [30] J. Shi, R. Murray-Smith, D.M. Titterington, Hierarchical Gaussian process mixtures for regression, Stat. Comput. 15 (2005) 31–41. [31] J. Sun, Prediction of chaotic time series based on modified minimax probability machine regression, Chin. Phys. 16 (2007) 3262–3270.