Evolving Gaussian process models for prediction of ozone concentration in the air

Evolving Gaussian process models for prediction of ozone concentration in the air

Simulation Modelling Practice and Theory 33 (2013) 68–80 Contents lists available at SciVerse ScienceDirect Simulation Modelling Practice and Theory...

425KB Sizes 0 Downloads 7 Views

Simulation Modelling Practice and Theory 33 (2013) 68–80

Contents lists available at SciVerse ScienceDirect

Simulation Modelling Practice and Theory journal homepage: www.elsevier.com/locate/simpat

Evolving Gaussian process models for prediction of ozone concentration in the air Dejan Petelin a, Alexandra Grancharova c,⇑, Juš Kocijan a,b a

Department of Systems and Control, Jozˇef Stefan Institute, Jamova cesta 39, SI-1000 Ljubljana, Slovenia University of Nova Gorica, Centre for Systems and Information Technologies, Vipavska 13, SI-5000 Nova Gorica, Slovenia c Institute of System Engineering and Robotics, Bulgarian Academy of Sciences, Acad G. Bonchev Str., Bl.2, P.O. Box 79, Sofia 1113, Bulgaria b

a r t i c l e

i n f o

Article history: Available online 25 May 2012 Keywords: Ozone concentration prediction Dynamic systems modelling Evolving Gaussian process model

a b s t r a c t Ozone is one of the main air pollutants with harmful influence to human health. Therefore, predicting the ozone concentration and informing the population when the air-quality standards are not being met is an important task. In this paper, various first- and high-order Gaussian process models for prediction of the ozone concentration in the air of Bourgas, Bulgaria are identified off-line based on the hourly measurements of the concentrations of ozone, sulphur dioxide, nitrogen dioxide, phenol and benzene in the air and the meteorological parameters, collected at the automatic measurement stations in Bourgas. Further, as an alternative approach an on-line updating (evolving) Gaussian process model is proposed and evaluated. Such an approach is needed when the training data is not available through the whole period of interest and consequently not all characteristics of the period can be trained or when the environment, that is to be modelled, is constantly changing. Ó 2012 Elsevier B.V. All rights reserved.

1. Introduction Ozone (O3), a form of oxygen, is a highly unstable and poisonous gas that can form and react under the action of light and that is present in two layers of the atmosphere. The ozone is a very specific air substance, which is present in the whole Earth’s atmosphere – from the ground level to the top of the atmosphere. The stratospheric ozone prevents the harmful solar ultraviolet radiation to reach the Earth’s surface. However, in the tropospheric layer, which is at ground level, the ozone is an air pollutant, which damages human health and the ecosystem equilibrium. Exposure to ozone can cause serious health problems in plants and people, thus ozone pollution is a major problem in some regions of the world. It tends to increase during periods of high temperatures and sunny skies. The ozone content changes in the troposphere and the complexity of the processes defining these changes are the reasons why the atmospheric ozone dynamics is an object of intensive research. The most direct way to obtain accurate air quality information is from measurements made at surface monitoring stations across countries. Fixed measurements of hourly ozone concentrations in compliance with the European Directive on ambient air quality and cleaner air for Europe [1] give continuous information about the evolution of surface ozone pollution at a large number of sites across Europe. In several Member States, they are more and more supplemented by numerical model outputs delivered at a regional or local scale, in keeping with the European Directive. The European standards that guarantee human-health protection are as follows: health protection level, 120 lg/m3 8 h mean concentration; informing the public level,

⇑ Corresponding author. E-mail address: [email protected] (A. Grancharova). 1569-190X/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.simpat.2012.04.005

D. Petelin et al. / Simulation Modelling Practice and Theory 33 (2013) 68–80

69

180 lg/m3 1 h mean concentration; and warning the public level, 240 lg/m3 1 h mean concentration. Therefore, predicting the ozone concentration and informing the population when the air-quality standards are not being met are important tasks. Ozone concentration has a pronounced daily cycle [22], which can be modelled and forecasted using a variety of methods, and methods that describe the non-linear dynamics from available data are particularly useful. Thus, there exists a number of methods for ozone concentration prediction based on various modelling techniques, e.g. based on neural network NARX models [2,15,31], polynomial NARX models [26], fuzzy systems [20,21], support vector machines [7], ARIMA stochastic models [9], Gaussian processes (GP) [15,14,13]. There are also methods which are based on a combination of some of the mentioned techniques, e.g. the approach in [10] combines the use of neural networks, support vector machines and genetic algorithms. In this paper, we focus on the use of GP modelling techniques for development and comparison of various models for prediction of ozone concentration in the air. The GP model is a probabilistic, non-parametric model based on the principles of Bayesian probability. It differs from most of the other black-box identification approaches in that it does not try to approximate the modelled system by fitting the parameters of the selected basis functions, but rather by searching for relationships among the measured data. The output of the GP model is a normal distribution, expressed in terms of the mean and the variance. The mean value represents the most likely output and the variance can be interpreted as a measure of its confidence. The obtained variance, which depends on the amount and the quality of the available identification data, is important information when it comes to distinguishing the GP models from other computational intelligence methods. Because of their properties GP models are especially suitable for modelling of uncertain processes or when modelling data are unreliable, noisy or missing. GP models fit well for modelling of environmental systems as well as for ozone pollution modelling. Thus, GP models have been developed for prediction of ozone concentration in the air of Bourgas, which is among the regions in Bulgaria with the highest levels of ozone pollution in the air. In [14] first-order GP models based on measurements of the air-pollutant concentrations are identified and verified for one-step-ahead predictions of the ozone concentration in the air of Bourgas. Furthermore, in [13] high-order GP models by using measurements of both the air pollutants and the meteorological parameters are identified and verified. In both cases GP models are trained off-line using only a subset of the available data due to the high computational burden of modelling GP models. However, this limitation and, consequently, the quality of GP models can be improved with on-line updating using the most recent measurements. A noticeable drawback of system identification with GP models is the computation time necessary for the modelling. Regression based on GP models involves several matrix computations in which the computational complexity increases with the third power of the number of input data, such as matrix inversion and the calculation of the log-determinant of the used covariance matrix. This computational greed restricts the amount of training data, to at most a few thousand cases. To overcome the computational-limitation issues and to also make use of the method for large-scale dataset applications, numerous authors have suggested various sparse approximations [27,28]. All sparse approximate methods try to retain the bulk of the information contained in the full training dataset, but reduce the size of the covariance matrix to facilitate a less computationally demanding implementation of the GP model. The special kind of sparse approximate method is on-line modelling method Sparse On-line Gaussian Processes (OGP) [8] which tries to incorporate all information of the data by projecting to the reduced covariance matrix. The OGP method was already implemented for modelling the ozone concentration in the air [25]. As the weather and its characteristics are constantly changing, the model should be updated and adjusted as well. That means it should not only update the model with information contained in streaming data, but should concurrently optimize hyperparameter values as well. As we experienced the OGP method has problems with numerical instability, therefore we propose, by our opinion, more robust method for on-line updating (evolving) of a GP model and compare its performance with an off-line trained GP models. The proposed method is based on the concept described in [24], but it is implemented differently as the method used in the experimental part of the paper. The paper is structured as follows. In Section 2, the use and properties of Gaussian processes for modelling are reviewed. In Section 3, first- and high-order GP models for prediction of ozone concentration in the air of Bourgas, Bulgaria are identified off-line. A method for prediction of ozone concentration based on an on-line updated (evolving) GP model is proposed and evaluated in Section 4. The concluding remarks end the paper. 2. Modelling of dynamic systems with Gaussian processes A GP model is a flexible, probabilistic, non-parametric model with uncertainty predictions. Its uses and properties for modelling are reviewed in [29]. The use of Gaussian processes for modelling dynamic systems is a relatively recent development [6,12,18]. A retrospective review can be found in [17]. A Gaussian process is a collection of random variables which have a joint multivariate Gaussian distribution (Fig. 1). Assuming a relationship of the form y = f(x) between input x and output y, we have y1 ; . . . ; yN  N ð0; RÞ, where Rpq = Cov (yp, yq) = C(xp,xq) gives the covariance between output points corresponding to input points xp and xq. Thus, the mean l(x) and the covariance function C(xp, xq) fully specify the Gaussian process. The value of covariance function C(xp, xq) expresses the correlation between the individual outputs f(xp) and f(xq) with respect to inputs xp and xq. Note that the covariance function C(, ) can be any function that generates a positive semi-definite covariance matrix. It is usually composed of two parts,

Cðxp ; xq Þ ¼ C f ðxp ; xq Þ þ C n ðxp ; xq Þ;

ð1Þ

70

D. Petelin et al. / Simulation Modelling Practice and Theory 33 (2013) 68–80

(b)

4 3

2σ(x ) 1

Output [y]

2 2σ(x1)

2σ(x2)

2

2σ(x3) μ(x3)

1

2σ(x ) 3

0

μ(x ) 2

−1 −2

1 0 −1 −2

2σ(x2)

−3 −4 −8

4 3

μ(x1)

Output [y]

(a)

−3 −6 x

−4

1

−2

0

x2 2

4

x3 6

Input [x]

8

−4 −8

−6 x 1

−4

−2

0

x 2 2

4

x3 6

8

Input [x]

Fig. 1. Modelling with GP: (a) Gaussian distribution of predictions at new points x1, x2 and x3, conditioned on the training points (.); (b) the predictive mean together with its 2r error bars for three points, x1 and x3 that are close to the training points, and x2 that is more distant.

where Cf represents the functional part and describes the unknown system we are modelling, and Cn represents the noise part and describes the model of noise. For noise part most commonly used is the constant covariance function, presuming white noise. Choice of the covariance function for the functional part also depends on the stationarity of the process. Assuming stationary data most commonly used covariance function is the square exponential covariance function. The composite covariance function is therefore

" Cðxp ; xq Þ ¼ v 1 exp 

# D 1X wd ðxdp  xdq Þ2 þ dpq v 0 ; 2 d¼1

ð2Þ

where wd, v1 and v0 are the ‘hyperparameters’ of the covariance function, D is the input dimension, and dpq = 1 if p = q and 0 otherwise. On the contrary assuming non-stationary data the polynomial or its special case the linear covariance function can be used. Other forms and combinations of covariance functions suitable for various applications can be found in [29]. Hyperparameters can be written as a vector H = [w1, . . . , wD, v1, v0]T. Parameters wd indicate the importance of individual inputs: if wd is zero or near zero, it means the inputs in dimension d contain little information and could possibly be neglected. To accurately reflect the correlations present in the training data, the hyperparameters of the covariance function need to be optimized. Due to the probabilistic nature of the GP models, the common model optimization approach where model parameters and possibly also the model structure are optimized through the minimization of a cost function defined in terms of model error (e.g. mean square error), is not readily applicable. A probabilistic approach to the optimization of the model is more appropriate. Actually, instead of minimizing the model error, the probability of the model is maximized. Consider a set of N D-dimensional input vectors X = [x1, x2, . . . , xN]T and a vector of output data y = [y1, y2, . . . , yN]. Based on the data (X, y), and given a new input vector x⁄, we wish to find the predictive distribution of the corresponding output y⁄. Based on training set X, a covariance matrix K of size N  N is determined. The overall problem of learning unknown parameters from data corresponds to the predictive distribution p(y⁄jy, X, x⁄) of the new target y, given the training data (y, X) and a new input x⁄. In order to calculate this posterior distribution, a prior distribution over the hyperparameters p(Hjy, X) can first be defined, followed by the integration of the model over the hyperparameters

pðy jy; X; x Þ ¼

Z

pðy jH; y; X; x ÞpðHjy; XÞdH:

ð3Þ

The computation of such integrals can be difficult due to the intractable nature of the non-linear functions. A solution to the problem of intractable integrals is to adopt numerical integration methods such as the Monte-Carlo approach. Unfortunately, significant computational efforts may be required to achieve a sufficiently accurate approximation. In addition to the Monte-Carlo approach, another standard and general practice for estimating hyperparameters is the maximum-likelihood estimation, i.e., to minimize the following negative log-likelihood function:

LðHÞ ¼ 

1 1 N logðjKjÞ  yT K1 y  logð2pÞ 2 2 2

ð4Þ

As the likelihood is, in general, non-linear and multimodal, efficient optimization routines usually entail the gradient information. The computation of the derivative of L with respect to each of the parameters is as follows

  @LðHÞ 1 @K 1 @K þ yT K1 ¼  trace K1 K  1y: @hi 2 @hi 2 @hi

ð5Þ

71

D. Petelin et al. / Simulation Modelling Practice and Theory 33 (2013) 68–80

GP models can be easily utilized for regression calculation. Based on training set X, a covariance matrix K of size N  N is determined. The aim is to find the distribution of the corresponding output y⁄ for some new input vector x⁄ = [x1(N + 1), x2(N + 1), . . . , xD(N + 1)]. For the collection of random variables [y1, . . . , yN, y⁄] we can write:

½y; y   N ð0; K Þ

ð6Þ

with the covariance matrix

ð7Þ

where y = [y1, . . . , yN] is an 1  N vector of training targets. The predictive distribution of the output for a new test input has normal probability distribution with mean and variance

lðy Þ ¼ kðx ÞT K1 y; r2 ðy Þ ¼ jðx Þ  kðx ÞT K1 kðx Þ;

ð8Þ ð9Þ

where k(x⁄) = [C(x1, x⁄), . . . , C(xN, x⁄)]T is the N  1 vector of covariances between the test and training cases, and j(x⁄) = C(x⁄, x⁄) is the covariance between the test input itself. The identified model, in addition to mean value, also provides information about the confidence in prediction by the variance. Usually, the confidence of the prediction is depicted with 2r interval which is about 95% confidence interval. This confidence region can be seen in the example in Fig. 2 as band bounded by dotted lines. It highlights areas of the input space where the prediction quality is poor, due to the lack of data or noisy data, by indicating a wider confidence band around the predicted mean. Gaussian processes can, like neural networks, be used to model static non-linearities and can therefore be used for modelling of dynamic systems [6,18,19] as well as time series if lagged samples of output signals are fed back and used as regressors. For the modelling of dynamic systems we consider representation where the output at the step k depends on the delayed outputs y and the exogenous inputs u:

yðkÞ ¼ f ðyðk  1Þ; . . . ; yðk  nÞ; uðk  1Þ; . . . ; uðk  nÞÞ þ ðkÞ

ð10Þ

where (k) is white noise and the output y(k) depends on the state vector x(k) = [y(k  1), y(k  2), . . . , y(k  n), u(k  1), u(k  2), . . . , u(k  n)] at step k. Assuming the signal is known up to the step k, we wish to predict the output of the system h steps ahead, i.e., we need to find the predictive distribution of y(k + h) corresponding to x(k + h). Multiple-step-ahead predictions of a system modelled by Eq. (10) can be achieved by iteratively making repeated one-step-ahead predictions, up to the desired horizon [18,6].

4 3

Output [y]

2 1 0 −1 −2 −3 −4 −8

−6

−4

−2

0

2

4

6

8

Input [x] Fig. 2. Using GP models: in addition to the prediction mean value (full line), we obtain a 95% confidence region (dotted lines) for the underlying function f.

72

D. Petelin et al. / Simulation Modelling Practice and Theory 33 (2013) 68–80

The quality of the prediction mean value with GP models can be assessed by computing the average squared error (ASE)

ASE ¼

N 1X ^i  yi 2 ½y N i¼1

ð11Þ

^i and yi are the ozone concentration measurement and the ozone concentration prediction in the i-th step. Additionwhere y ally, the quality of the prediction variance can be assessed with the logarithm of the predictive density error (LDE)

N   ½y ^i  yi 2 1 1 X LDE ¼ logð2pÞ þ log r2i þ 2 2N i¼1 r2i

where

! ð12Þ

r2i is the prediction variance in the ith step.

3. Off-line GP models for prediction of ozone concentration in the air of Bourgas

Average ozone concentration [μg/m3]

Measurement data for the year 2008, collected at the automatic measurement station in the centre of Bourgas, Bulgaria, are used. The data includes hourly measurements of the concentrations of ozone, sulphur dioxide, nitrogen dioxide, phenol and benzene. The meteorological parameters have not been measured at this station. However, in order to study how these parameters would influence the prediction of ozone concentration in the air of Bourgas, their measurements at two other stations in the regions of Bourgas city (region ‘‘Dolno Ezerovo’’ and region ‘‘Meden rudnik’’) are used. It is accepted that the values of the meteorological parameters in the centre of Bourgas represent the average of the measurements collected in the stations in Dolno Ezerovo and Meden rudnik. It should be noted that in the Gaussian process models, the mean hourly concentrations of ozone are used. In the previous research [22,23], it has been shown that the ozone concentration has a strong daily cycle. Fig. 3 shows the daily cycle of the average hourly ozone concentrations for all seasons. It can be seen that the average ozone concentration has a minimum at 5–8 h, depending on the season. After 6–8 h it is growing fast, i.e. the formation and collection of ozone in the air starts after 6–8 h. The maximum is reached at 11–16 h. Therefore, if we are able to prognosticate correctly the ozone concentration at that hourly interval, we will be able to prognosticate the maximal hourly concentration for the day in all cases of high health risk. As we are interested to obtain an accurate prediction of ozone concentration in the interval where is the risk to exceed the established air quality standards, the training data include the measurements from 9 h till 16 h at every 2-nd day of the year 2008. The days for which is not available a full collection of all measurements in the interval 9–16 h are excluded from the data set. Therefore, the total number of the training data is 1.032 corresponding to 129 days of the year. The validation data include the measurements from 1 h till 23 h at every 9th day of the year 2008 (by excluding the days which coincide with the 2nd days). The days, for which there are measurements at some hours only, are excluded from the validation data set. The total number of the validation data is 299 corresponding to 13 days of the year.

120 110 100 90 80 70 60 50 40 30

0

2

4

6

8

10

12

14

16

18

20

hours winter

summer

spring

fall

Fig. 3. Average hourly ozone concentrations for each season.

22

24

73

D. Petelin et al. / Simulation Modelling Practice and Theory 33 (2013) 68–80

3.1. First-order GP models Four first-order GP models for prediction of ozone concentration are identified and verified based on the preprocessed available measurement data. In addition to log-likelihoods ðLÞ of the identified models the average squared errors (ASE) and the log density errors (LDE) computed for the training data and the validation data are given in Table 1. The four identified first-order GP models have the following input parameters: Model A1:

The value of ozone concentration at the previous hour:

cO3 ðkÞ ¼ f1 ðcO3 ðk  1ÞÞ

ð13Þ

where k is the current hour of the day and cO3 is the concentration of ozone in the air. Model B1: The values of ozone concentration, the concentrations of the air pollutants, and the meteorological parameters at the previous hour:

cO3 ðkÞ ¼ f2 ðcO3 ðk  1Þ; cNO2 ðk  1Þ; cSO2 ðk  1Þ; cC 6 H5 OH ðk  1Þ; cC 6 H6 ðk  1Þ; hðk  1Þ; pðk  1Þ; srðk  1Þ; tempðk  1Þ; wsðk  1ÞÞ

ð14Þ

Here, cNO2 ; cSO2 ; cC 6 H5 OH , and cC 6 H6 are the concentrations of nitrogen dioxide, sulphur dioxide, phenol and benzene in the air, h is the air humidity, p is the air pressure, sr is the sun radiation, temp is the air temperature, ws is the wind speed. Model C1: The values of ozone concentration and the concentrations of the air pollutants at the previous hour:

cO3 ðkÞ ¼ f3 ðcO3 ðk  1Þ; cNO2 ðk  1Þ; cSO2 ðk  1Þ; cC 6 H5 OH ðk  1Þ; cC 6 H6 ðk  1ÞÞ Model D1:

ð15Þ

The values of ozone concentration and the meteorological parameters at the previous hour:

cO3 ðkÞ ¼ f4 ðcO3 ðk  1Þ; hðk  1Þ; pðk  1Þ; srðk  1Þ; tempðk  1Þ; wsðk  1ÞÞ

ð16Þ

In Table 1, the best obtained values of L, ASETRAIN, PLDETRAIN, ASEVAL, and LDEVAL are given in bold. It can be seen that the largest log-likelihood of model parameters is obtained with model B1, as well as the smallest errors associated to the training data. The best validation results are obtained with model A1, which is also the simplest among the four identified models as its output remains the same as its input. That is unexpected from the physical point of view, therefore it seems sensible to test high-order models as well. 3.2. High-order GP models Four second-order and one third-order GP models for prediction of ozone concentration are identified and verified based on the preprocessed available measurement data. In addition to the log-likelihoods ðLÞ of the identified models the average squared errors (ASE) and the log density errors (LDE) computed for the training data are given in Table 2. The training and the validation data are formed in the same way as for the first-order models. The identified high-order GP models have the following input parameters: Model A2:

The values of ozone concentration at two previous hours:

cO3 ðkÞ ¼ f5 ðcO3 ðk  1Þ; cO3 ðk  2ÞÞ Model A3:

ð17Þ

The values of ozone concentration at three previous hours:

cO3 ðkÞ ¼ f6 ðcO3 ðk  1Þ; cO3 ðk  2Þ; cO3 ðk  3ÞÞ Model B2:

ð18Þ

The values of ozone concentration, the concentrations of the air pollutants, and the meteorological parameters at two previous hours:

Table 1 The log-likelihoods ðLÞ, average squared errors (ASE), and log density errors (LDE). Model

L

ASETRAIN

LDETRAIN

ASEVAL

LDEVAL

A1 B1 C1 D1

626.6 726.3 682.5 694.0

0.0169 0.0117 0.0144 0.0137

0.6202 0.8069 0.7021 0.7253

0.0282 0.0305 0.0386 0.0297

0.2474 0.2314 0.0816 0.1544

74

D. Petelin et al. / Simulation Modelling Practice and Theory 33 (2013) 68–80

Table 2 The log-likelihoods ðLÞ, average squared errors (ASEs), and log density errors (LDE). Model

L

ASETRAIN

LDETRAIN

ASEVAL

LDEVAL

A2 A3 B2 C2 D2

622.3 631.8 740.2 677.6 711.7

0.0159 0.0149 0.0100 0.0128 0.0121

0.6519 0.6849 0.8864 0.7629 0.7914

0.0265 0.0268 0.0230 0.0317 0.0231

0.2802 0.2521 0.3447 0.1092 0.3531

cO3 ðkÞ ¼ f7 ðcO3 ðk  1Þ; cNO2 ðk  1Þ; cSO2 ðk  1Þ; cC 6 H5 OH ðk  1Þ; cC 6 H6 ðk  1Þ; hðk  1Þ; pðk  1Þ; srðk  1Þ; tempðk  1Þ; wsðk  1Þ; cO3 ðk  2Þ; cNO2 ðk  2Þ; cSO2 ðk  2Þ; cC6 H5 OH ðk  2Þ; cC 6 H6 ðk  2Þ; hðk  2Þ; pðk  2Þ; srðk  2Þ; tempðk  2Þ; wsðk  2ÞÞ Model C2:

ð19Þ

The values of ozone concentration and the concentrations of the air pollutants at two previous hours:

cO3 ðkÞ ¼ f8 ðcO3 ðk  1Þ; cNO2 ðk  1Þ; cSO2 ðk  1Þ; cC 6 H5 OH ðk  1Þ; cC 6 H6 ðk  1Þ; cO3 ðk  2Þ; cNO2 ðk  2Þ; cSO2 ðk  2Þ; cC6 H5 OH ðk  2Þ; cC 6 H6 ðk  2ÞÞ Model D2:

ð20Þ

The values of ozone concentration and the meteorological parameters at two previous hours:

cO3 ðkÞ ¼ f9 ðcO3 ðk  1Þ; hðk  1Þ; pðk  1Þ; srðk  1Þ; tempðk  1Þ; wsðk  1Þ; cO3 ðk  2Þ; hðk  2Þ; pðk  2Þ; srðk  2Þ; tempðk  2Þ; wsðk  2ÞÞ

ð21Þ

It can be noticed from Tables 1 and 2 that for the same model type, the high-order models are more accurate than the firstorder models (model A2 is more accurate than model A1, model B2 is more accurate than model B1 etc.). It can be seen from Table 2 that the largest log-likelihood of model parameters is obtained for model B2, as well as the smallest errors associated to the training data and the smallest ASEVAL error. However, the smallest LDEVAL error is obtained with model D2, which is much simpler than model B2. Therefore, model D2 is considered as the best GP model for ozone concentration prediction. All models considered in this section are trained off-line on the data of 1 year period only. As the weather is constantly changing, in next section an alternative approach that updates with in-coming data is proposed. 4. Evolving GP models In the previous section, data through all period of interest, in our case year 2008, is available for training and validation.1 Therefore all characteristics of the data from the whole period can be trained. In this section we will test an alternative approach of training GP models that is needed when the training data is not available through the whole period of interest and consequently not all characteristics of the period can be trained. In this case the model should be constantly adapted according to new characteristics. Such an approach can be used in environments that are constantly changing. As the weather is constantly changing this approach seems promising. For this purpose we propose our method for an on-line adapting of GP models, named Evolving Gaussian process models [24]. Similar study was done in [25], but different model structure and other method was used. Evolving systems [4] are self-developing systems inspired by the idea of system model evolution in a dynamically changing and evolving environment. They differ from other traditional adaptive systems known from control theory [5] as they online adapt both structure and parameter values of the model from incoming data [4]. We use the term evolving GP models in sense of recursive adapting of the structure of GP model and hyperparameter values. This term was introduced in 1990s for neural networks [11], in 2001 for fuzzy rule-based systems [3] and for neurofuzzy systems in 2002 [16]. The GP models depend on data and covariance function. More detailed, data is defined with various regressors and basis vectors, and covariance function is defined with the type and hyperparameter values. Therefore, there are four parts that can evolve: regressors, basis vectors, type of covariance function and hyperparameter values. As it is described in Section 2, training of GP models for large amount of data is very time consuming. To overcome this greed only the subset of most informative data, so called basis vectors set, is used. With a type or a combination of various types of covariance function a prior knowledge of the system is included in the model. Nevertheless, with optimization of hyperparameter values the model is even more adjusted to real system. However, in dynamic non-linear systems, where the non-linear mapping between input and output can not be easily formulated, frequently the squared exponential covariance function is used presuming smoothness and stationarity of the system. That means the covariance function is fixed and does not need to evolve. Furthermore, the squared exponential covariance function can be used with automatic relevance determination (ARD) which is able 1

It should be noted that not the same data is used for training and validation, but the data from the same period.

D. Petelin et al. / Simulation Modelling Practice and Theory 33 (2013) 68–80

75

to find influential regressors [29]. With optimization of hyperparameter values, uninfluential regressors have smaller values and as a consequence have smaller influence to the result. Therefore, all available regressors can be used. Consequently, only basis vectors set and hyperparameter values have left to be evolved. A general concept of evolving GP models, presented in [24] contains following steps: Algorithm 1. MODELEVOLVING(x⁄, y⁄) 1. X = ADD (X , x⁄) // Add new input data x⁄ to the set X if LENGTHðX Þ > maxLength then 2. iGains = INFORMATIONGAIN ðX Þ // Calculate information gains // Remove worst data from X 3. X = REMOVEWORST ðX ; iGainsÞ end 4. H = CALCHYPVAL ðX ; HÞ // Calculate hyperparameter values H 5. K = UPDATECOVMAT ðX ; HÞ // Update covariance matrix K These basic steps are repeated for every in-coming sample of data until there is no more available data or until a requirement to stop the process is received. To keep the subset of the most informative data small enough to process all steps before new data arrives, the maximum length of the subset should be set with parameter maxLength. Therefore the parameter maxLength is a design parameter. Operations in the concept (emphasized in pseudo code) can be implemented in various ways. There are two critical operations, calculation of information gain and calculation of hyperparameter values. Both operations can be implemented using well known methods. For the calculation of information gain any on-line learning method for GP models can be used, e.g. [8,30], etc. Also for the calculation of hyperparameter values any suitable optimization method or even some heuristic method can be used. However, our implementation of operations in the concept are described below. Before the description of the main operations, a basic elements and some operations should be described. The core of the concept is a set of the most informative data. Actually, it is the subset of the data already considered. It is denoted as X and defined as

X  X:

ð22Þ

To operate with the set X two basic operations are needed: adding and removing elements. Both operations are very straightforward. Adding new element x+ to the set X is defined as

X þ ¼ ½ X j xþ ; where X

þ

ð23Þ

is new, extended set of most informative data. Removing ith element vi from the set X is defined as



X ¼ ½v1 ; . . . ; vi1 ; viþ1 ; . . . ; vn ;

ð24Þ

where n is the length of the set X . The main operations of the algorithm are implemented as follows: 1. ADD ðX ; x Þ: Adds new data to the set of most informative data X is implemented in a such way that adds new data x⁄ to X only when the data is ‘‘new’’ to the current model. In other words, x⁄ is added only when absolute difference between prediction mean value for x⁄ described by Eq. (8), based on current model and measured value y⁄ is greater than the pre-set threshold. The threshold can be set heuristically and is, therefore, a design parameter. If this condition is fulfilled, x⁄ is added to X by using Eq. (23). Procedure 1. ADDðX ; x ; y Þ if jy⁄  l(x⁄)j > threshold then // Prediction is calculated by (8) X ¼ ½ X jx  // Adding x⁄ to X by using (23) end return X 2. INFORMATIONGAIN ðX Þ: Calculates information gain for each element in X . Actually it calculates log-marginal likelihood L, by using Eq. (4), for each subset of X of length n  1, where n is the length of X . It should be noted that this operation is performed only when the subset X has exceeded the pre-set length l, therefore l = n  1 is always true. Higher log-marginal likelihood means higher information gain. As it is calculated for the subset without the data for which the information gain is to be calculated, it should be inverted. Therefore, it is multiplied with 1, so that higher log-marginal likelihood means lower information gain. In the case of time series also forgetting is used. It is implemented as exponential forgetting

76

D. Petelin et al. / Simulation Modelling Practice and Theory 33 (2013) 68–80

e  ; HÞ ¼ kkc  LðX  ; HÞ LðX

ð25Þ

where k is forgetting factor, k is current sequence number, c is the number of sequence when currently considered data was e is negative log-marginal likelihood considering exponential forgetting, X  is the subset of X of length n  1. added to X ; L Forgetting can be easily turned off with setting k = 1. Procedure 2. INFORMATIONGAIN ðX Þ for i = 1 to LENGTH ðX Þ do X  ¼ ½v1 ; . . . ; vi1 ; viþ1 ; . . . ; vn  e  ; HÞ iGains½i ¼ LðX

// removing i-th element from X // by using (24) // calculating information gain // by using (25)

end return iGains 3. REMOVEWORST ðX ; iGainsÞ: Removes element with the worst information gain from the set X . It is performed simply by using Eq. (24). Procedure 3. REMOVEWORST ðX ; iGainsÞ ind = 1 min = 1 for i = 1 to LENGTH (iGains) do if iGains[i] < min then ind = i min = iGains[i] end end X ¼ ½v1 ; . . . ; vind1 ; vindþ1 ; . . . ; vn 

// removing ind-th element // from X by using (24)

return X

4. CALCHYPVAL ðX ; HÞ: Hyperparameter values are calculated by maximizing log-likelihood from Eq. (4). This can be done with any suitable optimisation method. In our case it is done with conjugate gradients optimization method proposed in [29]. In detail, the Polak–Ribiere version of conjugate gradients is used to compute search directions by using Eq. (5), and a line search using quadratic and cubic polynomial approximations and the Wolfe–Powell stopping criteria is used together with the slope ratio method for guessing initial step sizes. Additional checks are made to make sure that exploration is taking place and that extrapolation will not be unlimited large. The algorithm is available in [29]. 5. UPDATECOVMAT ðX ; HÞ: Updates covariance matrix. It is performed only when hyperparameter values have changed. However, if the covariance matrix is to be updated, the update is done by using Eq. (2) as it is described in Section 2. 5. Results To assess the effectiveness of proposed method we divided data in four periods: astronomical winter, astronomical spring, astronomical summer and astronomical fall,2 as data is available only for the year 2008 and it can be shown (Fig. 3) that selected seasons have different characteristics. From Fig. 3 it can be seen that selected four groups can be further divided in two main groups. In the first group are seasons spring and summer which have high ozone concentration from 10 h to 17 h, and in the second group are seasons fall and winter which have high ozone concentration from 12 h to 15 h. The first group has much higher and longer high ozone concentration peak than the second group. To expose the effectiveness of the on-line training model in the case of changing characteristics we performed both, an offline and on-line, without forgetting, training of GP models, on winter season data and validated models on summer season data. It should be noted that as a basis of on-line training, the off-line trained model was used and then was constantly adapted through validation data, i.e. the whole year. Considering the analysis of the regressors in Section 3, the model structure D2 is used. 2 Astronomical winter lasts from 21 December to 20 March, astronomical spring lasts from 21 March to 20 June, summer lasts from 21 June to 22 September and fall lasts from 23 September to 20 December

77

D. Petelin et al. / Simulation Modelling Practice and Theory 33 (2013) 68–80

(a)

Date: 27.7.2008 (Off−line training in winter)

Ozone concentration [μg/m3]

160 140 120 100 80 60 Validation data

40

GP model mean GP model mean ± 2*std

20 0

2

4

6

8

10

12

14

16

18

20

22

24

Hours

(b)

Date: 27.7.2008 (On−line training without forgetting)

Ozone concentration [μg/m3]

160 140 120 100 80 60 Validation data

40

GP model mean GP model mean ± 2*std

20 0

2

4

6

8

10

12

14

16

18

20

22

24

Hours Fig. 4. The predicted mean value and 95% confidence interval of ozone concentration for 27 July 2008: (a) using off-line training in winter season; (b) using on-line training through whole validation data, i.e. whole year.

Table 3 Error measures of the ozone concentration predictions for 1 day (27 July 2008) based on off-line trained and on-line trained model. Model

ASE

LDE

Off-line On-line

0.024086 0.015686

0.50363 0.33299

Predictions of the ozone concentration for 1 day (27 July 2008)3 based on both models are depicted in Fig. 4 and their error measures are written in Table 3. It can be seen that predictions of the ozone concentration based on the on-line trained model are much more accurate than predictions based on the off-line trained model, especially in the period between 10 h and 18 h when the ozone concentration usually reaches harmful values. To analyse the influence of the exponential forgetting to predictions of the ozone concentration, several evaluations of online trained GP models with various forgetting factors k: 1 (without forgetting), 0.99995, 0.99985, 0.99975, 0.99965 and below was performed. Error measures of the ozone concentration predictions for all validation data (spring, summer and fall 3

This day was chosen as it is from summer season and it reaches harmful ozone concentration.

78

D. Petelin et al. / Simulation Modelling Practice and Theory 33 (2013) 68–80 Table 4 Error measures of the ozone concentration predictions for all available data based on on-line trained models with forgetting factors k: 1, 0.99995, 0.99985, 0.99975 and 0.99965, and both off-line trained models. Model

ASE

LDE

On-line

k=1 k = 0.99995 k = 0.99985 k = 0.99975 k = 0.99965

0.015686 0.014668 0.014277 0.014351 0.014329

0.63299 0.68761 0.72404 0.72279 0.72249

Off-line

winter All year

0.024086 0.01838

0.50363 0.61022

(a) 160

Date: 9.3.2008 (On−line training with forgetting)

Ozone concentration [μg/m3]

140 120 100 80 60 40 Validation data

20

GP model mean GP model mean ± 2*std

0 0

2

4

6

8

10

12

14

16

18

20

22

24

Hours Date: 9.8.2008 (On−line training with forgetting)

Ozone concentration [μg/m3]

(b) 160 140 120 100 80 60 Validation data

40

GP model mean GP model mean ± 2*std

20 0

2

4

6

8

10

12

14

16

18

20

22

24

Hours Fig. 5. The predicted mean value and 95% confidence interval of ozone concentration for: (a) 9 March and (b) 9 August 2008.

seasons) of each model are written in Table 4. For comparison, error measures of the ozone concentration predictions based on both off-line trained models4 are written as well.

4

Model off-line trained on winter season data and model off-line trained on whole year data used in Section 3.

D. Petelin et al. / Simulation Modelling Practice and Theory 33 (2013) 68–80

79

As on-line trained model with forgetting factor k = 0.99985 has minimal error measures (Table 4), its predictions of the ozone concentration for 2 days are depicted in Fig. 5. It can be seen that the predictions are very similar, therefore proposed approach seems promising alternative of training GP models for predicting the ozone concentration. 6. Conclusions In this paper, various first- and high-order GP models for prediction of the ozone concentration in the air of Bourgas, Bulgaria are identified off-line and compared. The results show that for the same model type, the high-order models are more accurate than first-order models. The best model is the second-order model, whose input parameters are values of the ozone concentration and the meteorological parameters at two previous hours. Furthermore, as an alternative approach an on-line updating (evolving) GP model is proposed. Such an approach is needed when the training data is not available through the whole period of interest and consequently not all characteristics of the period can be trained or when the environment, that is to be modelled, is constantly changing. The proposed approach is evaluated as a second-order model5 with various forgetting factors. The investigation shows that the predictions obtained with the proposed approach are very similar to predictions obtained with a model off-line trained on all year data. Nevertheless, it is more suitable for changing environments, therefore the proposed approach seems promising alternative of training GP models for predicting the ozone concentration. The analysis of the influence of the exponential forgetting to predictions of the ozone concentration shows that the forgetting factor k = 0.99985 provides the most accurate predictions. Acknowledgements This work was financed by the National Science Fund of the Ministry of Education, Youth and Science of Republic of Bulgaria, Contract DO02-94/14.12.2008 and the Slovenian Research Agency, Contract BI-BG/09–10-005 (‘‘Application of Gaussian processes to the modelling and control of complex stochastic systems’’). References [1] D. 2008/50/EC, Directive 2008/50/EC of the european parliament and of the council of 21 may 2008 on ambient air quality and cleaner air for europe, 2008. . [2] S.M. Al-Alawi, S.A. Abdul-Wahab, C.S. Bakheit, Combining principal component regression and artificial neural-networks for more accurate predictions of ground-level ozone, Environmental Modelling & Software 23 (2008) 396–403. [3] Angelov, P., Buswell, R., 2001. Evolving rule-based models: a tool for intelligent adaptation, in: Proceedings of the Joint 9th IFSA World Congress and 20th NAFIPS International Conference, pp. 1062–1066. [4] P. Angelov, D.P. Filev, N. Kasabov, Evolving Intelligent Systems: Methodology and Applications, IEEE Press Series on Computational Intelligence, WileyIEEE Press, 2010. [5] K.J. Åström, B. Wittenmark, Computer Controlled Systems: Theory and Design, Prentice Hall, 1984. [6] K. Azˇman, J. Kocijan, Application of Gaussian processes for black-box modelling of biosystems, ISA Transactions 46 (4) (2007) 443–457. [7] A.B. Chelani, Prediction of daily maximum ground ozone concentration using support vector machine, Environmental Monitoring and Assessment 162 (2010) 169–176. [8] L. Csató, M. Opper, Sparse online gaussian processes, Neural Computation 14 (3) (2002) 641–668. [9] C. Dueñas, M.C. Fernández, S. Cañete, J. Carretero, E. Liger, Stochastic model to forecast ground-level ozone concentration at urban and rural areas, Chemosphere 61 (2005) 1379–1389. [10] Y. Feng, W. Zhang, D. Sun, L. Zhang, Ozone concentration forecast method based on genetic algorithm optimized back propagation neural networks and support vector machine data classification, Atmospheric Environment 45 (2011) 1979–1985. [11] B. Fritzke, Growing cell structures – a self-organizing network for unsupervised and supervised learning, Neural Networks 7 (9) (1994) 1441–1460. [12] A. Grancharova, J. Kocijan, T.A. Johansen, Explicit stochastic predictive control of combustion plants based on Gaussian process models, Automatica 44 (4) (2008) 1621–1631. [13] A. Grancharova, J. Kocijan, A. Krastev, H. Hristova, High order gaussian process models for prediction of ozone concentration in the air, in: Proceedings of the 7th EUROSIM Congress on Modelling and Simulation, EUROSIM’10, 2010. [14] A. Grancharova, D. Nedialkov, J. Kocijan, H. Hristova, A. Krastev, Application of Gaussian processes to the prediction of ozone concentration in the air of bourgas, in: Proceedings of International Conference on Automatics and Informatics, 2009, pp. IV-17–IV-20. [15] B. Grašicˇ, P. Mlakar, M.Z. Bozˇnar, Ozone prediction based on neural networks and Gaussian processes, Nuovo Cimento della Societa Italiana di Fisica, Sections C 29 (6) (2006) 651–662. [16] N.K. Kasabov, Evolving Connectionist Systems: Methods and Applications in Bioinformatics, Brain Study and Intelligent Machines, Springer-Verlag, New York, 2002. [17] J. Kocijan, Gaussian process models for systems identification, in: Proceedings of the 9th International PhD Workshop on Systems and Control: Young Generation Viewpoint, 8, 2008. [18] J. Kocijan, A. Girard, B. Banko, R. Murray-Smith, Dynamic systems identification with Gaussian procesess, Mathematical and Computer Modelling of Dynamic Systems 11 (4) (2005) 411–424. [19] J. Kocijan, B. Likar, Gas–liquid separator modelling and simulation with Gaussian-process models, Simulation Modelling Practice and Theory 16 (8) (2008) 910–922. [20] Y. Lin, W.G. Cobourn, Fuzzy system models combined with nonlinear regression for daily ground-level ozone predictions, Atmospheric Environment 41 (2007) 3502–3513. [21] A. Nebot, V. Mugica, A. Escobet, Ozone prediction based on meteorological variables: a fuzzy inductive reasoning approach, Atmospheric Chemistry and Physics Discussions 8 (2008) 12343–12370. [22] D. Nedialkov, M. Angelova, G. Baldjiev, A. Krastev, H. Hristova, 2006. Ozone concentrations in the air – standards and daily cycle, in: Proceedings of 19th International Symposium on Bioprocess Systems, Sofia. 5

The best second-order model structure obtained in the first experiment.

80

D. Petelin et al. / Simulation Modelling Practice and Theory 33 (2013) 68–80

[23] D. Nedialkov, A. Grancharova, H. Hristova, A. Krastev, Linear regression models for prediction of ozone concentration in the air of bourgas, in: Proceedings of International Conference on Automatics and Informatics, Sofia, 2009, pp. IV-21–IV-24. [24] D. Petelin, J. Kocijan, Control system with evolving Gaussian process models, in: Evolving and Adaptive Intelligent Systems (EAISs), 2011 IEEE Workshop on. 2011, pp. 178–184. [25] D. Petelin, J. Kocijan, A. Grancharova, On-line Gaussian process model for the prediciton of the ozone concentration in the air, Comptes Rendus de l’Académie Bulgare des Sciences 64 (1) (2011) 117–124. [26] E. Pisoni, M. Farina, C. Carnevale, L. Piroddi, Forecasting peak air pollution levels using narx models, Engineering Applications of Artificial Intelligence 22 (2009) 593–602. [27] J. Quinonero-Candela, C.E. Rasmussen, A unifying view of sparse approximate gaussian process regression, Journal of Machin Learning Research 6 (2005) 1939–1959. [28] J. Quinonero-Candela, C.E. Rasmussen, C.K.I. Williams, Approximation Methods for Gaussian Process Regression, Tech. Rep., Microsoft Research, September 2007. [29] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning, The MIT Press, 2006. [30] M. Seeger, C.K.I. Williams, N.D. Lawrence, Fast forward selection to speed up sparse gaussian process regression, in: Ninth International Workshop on Artificial Intelligence and Statistics. Society for Artificial Intelligence and Statistics, 2003. [31] T.A. Solaiman, P. Coulibaly, P. Kanaroglou, Ground-level ozone forecasting using data-driven methods, Air Qual Atmos Health 1 (2008) 179–193.