Proceedings of the 18th World Congress The International Federation of Automatic Control Milano (Italy) August 28 - September 2, 2011
Analysing Gaussian Processes for Stationary Black-Box Combustion Engine Modelling Benjamin Berger ∗ , Florian Rauscher ∗∗ and Boris Lohmann ∗ ∗
Institute of Automatic Control, Technische Universit¨ at M¨ unchen 85748 Garching, Germany (e-mail: {benjamin.berger, lohmann}@tum.de). ∗∗ Kratzer Automation AG, 85716 Unterschleissheim, Germany (e-mail:
[email protected]).
Abstract: The performance of model-based engine calibration is highly dependent on the type of the modelling. In this paper, a type of modelling which has not been used for engine calibration yet, the gaussian process model, is introduced and compared to other state of the art models. Starting from the requirements on the modelling, it can be shown from the theory that the gaussian process modelling is suitable for modelling stationary nonlinear engine mappings and has various advantages compared to other algorithms. Therefore, time and costs on the test-bed can be reduced if one is using gaussian process models for engine calibration. This theoretical result is validated by an application of an diesel engine, where NOx, consumption and soot are modeled. Keywords: Model-Based Engine Calibration, Engine modelling, Stochastic Processes, Black-Box Modelling, Model Comparison 1. INTRODUCTION
ambition is, to find a type of modelling which is suitable for an automated Online Optimization.
Due to the increasing number of actuators, legislative restrictions and the increasing requirements of customers, the task of engine calibration is time-consuming and costintensive and therefore, there is a need to automate and optimize this process. One of the first major steps in automation of engine calibration was, building a blackbox model of the measured variables (eg. soot, NOx, fuel consumption) and optimizing this surrogate model instead of the real engine (e.g. see Mitterer (2000) and Gschweitl et al. (2001)). This model-based engine calibration is still state of the art (e.g. see Deflorian et al. (2010)). In the last years there is a trend to Online Optimization, where measurement, modelling and optimization are not strictly separated (see Kl¨opper (2008) and the references therein). A permanent interaction between the test-bed and the optimization-routine leads to a better model-quality in the desired regions and helps avoiding measurements in undesired regions. At the end of the Online Optimization more and more measurements are taken at the area near the optimum. Compared to the Offline Optimization, no Validation of the optimum is needed and no undesired loops of the process have to be made, if the model quality is bad at the optimal Point. Therefore this procedure leads to a substantially saving of time and costs, see Kl¨opper (2008).
In Section 2 the requirements on the modelling are discussed. Various types for modelling have been developed and used for engine calibration, section 3 gives an overview. In section 4 Gaussian Processes are introduced and section 5 compares this type of modelling with existing modellings for engine calibration. This theoretical study is validated by an application of an diesel engine in section 6. At the end of the paper, in section 7, an outlook is given. 2. REQUIREMENTS ON THE MODELLING In this section, some important requirements on the modelling in model-based engine calibration are derived and discussed in detail. These requirements result in many conclusions in the further paper. A focus is made on the Online Optimization, but most of the requirements also hold for Offline Optimization.
An important question in Online and Offline Optimization of engines is, which type of modelling to use. In this paper, a type of modelling which has not been used for engine calibration yet, the gaussian process model, is introduced and compared to other state of the art models. The 978-3-902661-93-7/11/$20.00 © 2011 IFAC
10633
(1) The modelling must be suitable for high-dimensional problems (1-10 input dimensions). A practical example is the optimization of a diesel engine with the 6 parameters: quantity and time of the pre-injection, quantity and time of the post-injection, main injection time and injection pressure. This leads to a 6 dimensional input space. (2) As time on the test-bed is very expensive, the number of measurements should be minimized. Therefore, the modelling should be able to build a good model with as little measurements as possible. That is why every measurement should contribute a maximum of information to the model. 10.3182/20110828-6-IT-1002.01160
18th IFAC World Congress (IFAC'11) Milano (Italy) August 28 - September 2, 2011
(3) The modelling should be flexible enough, so that every non-linear engine mapping can be approximated. A good adaptation of the model at the measurement data should always be possible. (4) The problem of Overfitting has to be solved. The model-flexibility must never be too big and an Overfitting on the measurement data has to be avoided. Thus, the modelling has to be robust to noise on the measurements. (5) Item (3) and (4) have to run automated and robust to model parameters, so that an automated Online Optimization, with no manual interaction, is possible. This task is crucial. If, at any time, the modelling is not able to be flexible enough or overfitting occurs, in an automated Online Optimization, bad models will lead to wrong predictions and useless measurements will be taken at undesired regions. In the worst case, without manual interaction, a large part of measurements would be meaningless and the Optimization would cause high costs. (6) The modelling has not only to be able to predict an expectation about the true engine behaviour, but also a quantity about the certainty and probability of the model is important for an automated Online Optimization. Only with this quantity, measurement points can not only be placed at the assumed optimum, but also, where big uncertainty about the model-expectation occurs. One can formulate additional requirements for a good modelling for engine calibration (e.g. like physical interpretation of the model). Some of those are discussed in section 5. A full list of possible requirements is beyond the topic of this paper. 3. OVERVIEW OF MODELLING-TYPES The first modelling algorithms which were used in modelbased engine calibration, were polynomial models (e.g. see Mitterer (2000) and Gschweitl et al. (2001)). As polynomial regression suffers from some drawbacks, which are discussed in chapter 5, more sophisticated models were developed and used. One class of models can be regarded as Tree-based Models (see Bishop (2007)). This type of modelling divides the input space in smaller sub-spaces and assigns a separate model for each subspace. In order to achieve a continuous model over the whole input space, the intersections between the separate models are smoothed. The LOLIMOT Algorithm (LOcal LInear MOdel Tree) divides the input space into axis-orthogonal rectangles (see Nelles (2000)). The HHT Algorithm (Hinging Hyperplane Trees) uses an optimization routine to achieve also non-axis-orthogonal intersections (see Ernst (1998)). Not only straight intersections are possible. Jakubek and Keuth (2005) suggest to divide the input space into sub spaces with ellipsoidal contour lines. This type of modelling can also be viewed as a Takagi-Sugeno Fuzzy System. Another class of models has its origin in the area of machine learning. Neural Networks are widely used for model-based engine calibration, especially the MLP Network (Multi-Layer Perceptron), e.g. see Mitterer (2000) and Poland (2002) (Note: the RBF-Network (Radial Basis
Function Network) is not suitable for high-dimensional input spaces, requirement (1) section 2, see Bishop (1996)). Algorithms which have not been used in engine calibration, but still have a potential to improve the process and therefore are examined here, are the Support Vector Machine (see Smola and Sch¨ olkopf (2004)), the Relevance Vector Machine (see Tipping (2001)) and the Gaussian Processes (see Rasmussen and Williams (2006)), which are the special topic of this paper. A possible arrangement of the modelling-types, which are discussed in this paper, could be as follows: • Polynomial Regression • Tree-based Models · LOLIMOT · HHT · Takagi-Sugeno Fuzzy Systems • Machine Learning Algorithms · Neural Networks (MLP) · Support Vector Machines (SVM) · Relevance Vector Machines (RVM) · Gaussian Processes (GP) 4. GAUSSIAN PROCESSES FOR ENGINE CALIBRATION The motivation for the use of Gaussian Processes (GP) in Engine Calibration is straightforward. As we will see in section 5, polynomial regression, where the model is linear in it’s parameters, has some significant drawbacks. Fahrmeir et al. (2009) suggests, if linear models can’t be used, GP models, which are a type of nonparametric regression, may be the solution. In addition, MacKay (2003) says, that Gaussian processes are useful tools of automated tasks. Further, MLP-Networks with bayesian regularisation are widely used in engine calibration (e.g. see Kl¨opper (2008)). As there is a relation between GP and MLP-Networks, which is also discussed in chapter 5, it was assumed that GP also work in practice. However, the analysis on GP models and their use for regression and prediction is far from new, MacKay (2003). Already in 1880, T.N. Thiele was analysing time-series using Gaussian Processes (see Lauritzen (1981)). Within the geostatistics field, regression using GP is called kriging (see Cressie (1993) and Fahrmeir et al. (2009)). Also, ARMA (autoregressive moving average) models and Kalman filters can be viewed as forms of Gaussian process models, Bishop (2007). Further, GP are used in the task of global optimization (e.g. see Jones (2001)). Because parametric models are most commonly used in engine calibration, to gain a better understanding of gaussian process regression, the derivation of the formulas is started with the Dual Representation (for more more detailed information see Bishop (2007)). A model which is linear in it’s parameters w can be written as y(x) = wT φ(x)
(1)
where φ(x) is a vector with nonlinear basis functions (e.g. if we would perform polynomial regression, then φ(x) would be (1, x1 , x2 , x1 x2 , x21 , ...)T ). Typically the parameters of this model are determined by minimizing a sum-of-squares error function given by
10634
18th IFAC World Congress (IFAC'11) Milano (Italy) August 28 - September 2, 2011
J(w) =
N 2 1X wT φ(xn ) − tn 2 n=1
(2)
1.5
1
where N is the number of measurements which are taken from the engine, xn is the location in the input space of one measurement and tn is the value of the measurement. Now we introduce k(xn , xm ) = φ(xn )T φ(xm ) = (K)n,m
y
0.5
0
(3) −0.5
the kernel function k(xn , xm ) and the Gramm matrix K. If we set the gradient of J(w) with respect to w equal to zero and substitute the solution for w into (1) we obtain y(x) = k(x)T K−1 t
sine = true unknown func. measurements GP − mean p=0.0017562 GP − 95% confidence interval GP − mean p=4.2199e−005
−1
(4)
−1.5
0
0.2
0.4
0.6
0.8
1
x
where we have defined the vector k(x) with elements kn = k(x, xn ) and t = (t1 , ..., tN )T . From (4) we see, that the solution of the least-squares problem (2) can be expressed completely by the kernel function k(x, x′ ). We can now work directly with the kernel functions, without explicit calculation of φ(x), which allows us to choose kernel functions, where the vector φ(x) contains implicitly many basis functions. If we use the common squaredexponential (SE) kernel k(x, x′ ) = exp −kx − x′ k2 /2l2 (5)
then the vector φ(x) has even infinite dimensionality and therefore contains an infinite number of basis functions, which can be shown by expanding the kernel through a power series. Further, using the SE kernel leads to a infinitely differentiable function y(x). In order to apply gaussian processes for regression, we need to consider the noise on our measurements tn of the engine, which are given by tn = yn + ǫn
(6)
where yn = y(xn ) and ǫn is a random noise variable. In this paper a independent gaussian distributed noise is assumed, which is common in the task of engine calibration 1 , so that p(t|y) = N (t|y, σ 2 I)
(7)
with y = (y1 , ..., yN )T . We follow the definition of gaussian processes from Rasmussen and Williams (2006): Definition. A Gaussian process is a collection of random variables, any finite number of which have a joint gaussian distribution. If we normalize our measurements from the engine to zero mean, then from this definition it follows that the distribution p(y) is also Gaussian whose mean is zero and whose covariance is defined by a Gram matrix K, so that p(y) = N (y|0, K).
(8)
Fig. 1. GP example: The dashed line represents the unknown function (sine), which in practice could be any nonlinear engine mapping. One only knows some measured data (circle), which is shifted by noise. With this measured data, a gaussian process model can be calculated. The predicted mean (11) represents the estimated function value and with the predicted variance (12), a 95% confidence interval can be calculated, which represents the degree of certainty where the estimated function is expected. Note the widening confidence interval on the right edge, which represents increasing uncertainty due to difficult extrapolation and the lack of measurement in this area. Also, another GP mean is calculated, whose parameters are not optimized, which one could interpret as overfitted on the training data. It can be seen, that the probability p (calculated from (13)) of this overfitted GP is much less than the probability of the optimized GP. In this way, by optimizing the parameters, overfitting can be avoided. From this we can derive the distribution p(t) if we integrate out y Z p(t) = p(t|y)p(y) dy = N (t|0, K + σ 2 I) (9) where we used standard formulae given in Rasmussen and Williams (2006). Given a set of measurements, we now want to predict measurements for new inputs. Using the definition, again it follows that measurements and the prediction y∗ are jointly gaussian distributed, which can be written as K + σ 2 I k(x∗ ) t ∼ N 0, . (10) y∗ k(x∗ )T k(x∗ , x∗ ) Again, using standard formulae given in Rasmussen and Williams (2006), we can derive the predicted mean and variance of the gaussian process, which are given by: mean(y∗ ) = k(x∗ )T (K + σ 2 I)−1 t (11) var(y∗ ) = k(x∗ , x∗ ) − k(x∗ )T (K + σ 2 I)−1 k(x∗ ).
1
If one uses a linear model (1) and applies a sum-of-squares error function (2) for the estimation of the parameters w, then he implicitly also assumes gaussian noise, Bishop (2007). Note, however, due to outliers in the measurements, which are quite common in the task of engine calibration, this assumption is not true. In section 7 a possible solution will be presented.
(12)
The kernel-functions typically have parameters which can be tuned. E.g. if one uses the squared-exponential (SE) kernel (5), then there are d + 2 parameters to tune, where d is the dimension of the input space. Instead of fixing this
10635
18th IFAC World Congress (IFAC'11) Milano (Italy) August 28 - September 2, 2011
parameters or tuning them manually, these parameters can be inferred by the training data. The probability of the training data given the parameter values, p(t|Θ), is given by (Boyle (2007)) 1 T 1 2 −1 − t (K + σ I) t (13) p(t|Θ) = N 1 exp 2 (2π) 2 |K + σ 2 I| 2
1 0.8 0.6 0.4 y
With this equation, now all parameters can be optimized on the data and the gaussian process modelling is resulting in an fully automatic approach. Figure 1 gives an example of gaussian-process-regression.
0.2 0 2
−0.2
5. COMPARISON OF MODELLING-TYPES −0.4
In this section, the modelling-types introduced in section 3 are compared to Gaussian Process regression. 5.1 Gaussian Processes compared to polynomial regression Polynomial regression has several advantages compared to Gaussian processes. Polynomials have a simple form, are well known and easy to understand. Further, polynomial regression is computationally cheap and easy to implement. In order to avoid overfitting (requirement (4)), statistical tests can be used (e.g. see Fahrmeir et al. (2009)). This tests remove parameters which are not significant and therefore are not needed in the model. In this way, a big set of basis functions can be chosen for regression, without the fear of obtaining overfitting. However, there are some drawbacks of polynomial regression in theory and practice. One disadvantage of polynomial regression is a bad extrapolation of the data. As polynomials, which are not constant over the whole input space, tend very fast to high (positive or negative) values outside the region of the measurement-data, gaussian processes do not. Instead, using the SE kernel (5), gaussian processes tend to the mean of the data, when every measurement is far away from the prediction. Further, GP indicate a growing uncertainty of the prediction very fast in an increasing of the variance (12) (e.g. see the right edge at figure 1). This property makes it easy for the user to distinguish which predictions one can trust. Another drawback is, that low-order polynomials are not flexible enough to approximate any nonlinear engine mapping (requirement (3)) and high-order polynomials tend to waviness and ’end-effects’. Although Runge’s phenomenon, which describes the problem of oscillation at the edges of an interval, is a problem of interpolation, these oscillations also can be observed at regression, if the order of the polynomial increases, see figure 2. This oscillation is strongly related to the undesired effect of nonlocal behaviour in polynomial regression. When using polynomials for regression, Magee (1998) showed, that measurements can have a large and undesired influence on the predicted function at a location, which is very different from the location, where the measurements have been made. E.g. one can show that an increase in the measurements on one location, can cause a decrease in a very other location.
−5
Runge´s function 1/(1+x ) Training Data (half) Polynomial degree 5 Polynomial degree 10 Gaussian Process 0 x
5
Fig. 2. Drawback of polynomial regression: The thick dashed line indicates Runge’s function, which is given 1 by 1+x 2 . Training-data (circles) is taken for regression from this function. For clarity, only the half training-data is plotted. One can see, that polynomial regression has problems at approximating this function. Whereas low-order polynomials can’t approximate this function, high-order polynomials become increasing oscillating at the edges of the training interval. If one continues to increase the order of the polynomial, the oscillation gets very big, which is known as Runge’s phenomenon. Note the good fit of the gaussian process (SE-kernel (5) is used).
Because of that, polynomials can only give a good local approximation on the nonlinear engine mapping. Further, these oscillating effects become even more severe, if the space of regression is increasing. Due to the curse of dimensionality, this is typically the case if the number of inputs of the model is increasing. Therefore, Bishop (2007) points out, that this method is not suitable for high dimensions (requirement (1)). Also, in Zhou et al. (2005) is discussed, that, instead of a polynomial regression, a gaussian process model should be used for a global modelling. The equations (11) and (12) hold for arbitrary kernel functions k(x, x′ ). If one uses a linear model (1) (e.g. a polynomial model), then the same results could be obtained. Therefore the gaussian process viewpoint is a generalization of polynomial regression. The advantage of a gaussian process viewpoint is, that we can use kernel functions that can only be expressed by a infinite-dimensional vector φ(x), which would be the case if one uses the common squared-exponential kernel (5). Hence we can now see the deeper reason, why polynomials do bad at approximating some functions (like in figure 2). It is the limited number of 1 basis functions. If one would add a basis function of 1+x 2 in figure 2, then a perfect fit would be obtained. But how can one know which basis functions to use? In practice, Fahrmeir et al. (2009) therefore suggests to go another way. The solution is, to work with an infinity number of basis functions, which is given at the gaussian process viewpoint, if one chooses the SE kernel (5).
10636
18th IFAC World Congress (IFAC'11) Milano (Italy) August 28 - September 2, 2011
5.2 Gaussian Processes compared to tree-based regression
1.2 1
As the drawbacks of polynomial regression (e.g. oscillation) become more severe, if the space of regression increases, as mentioned above, one possible solution would be to divide the whole input space in smaller sub-spaces and calculate a separate polynomial model for each subspace. This type of modelling can be regarded as treebased regression, see Bishop (2007).
0.8 0.6 y
0.2
However, tree-based modelling suffers also from some drawbacks. The prediction at each intersection between each submodel is critical, which can be seen at large errorbars at the intersections in Nelles (2001) and in figure 3 (see below). Further, the interpolation behaviour of tree based modelling at V-type situations, see Nelles (2001) page 409, can cause strange behaviour at the intersections of the model. Therefore, if d is the number of inputs, a d − 1 dimensional sub-space results, the sub-space of intersections, which cannot be determined precise. Another problem of tree-based modelling arises, if the type of intersections don’t suit the nature of the function which should be approximated. As an example, Runge’s function 1 1+x2 , drawn in figure 3, and straight intersections, which are used in LOLIMOT and HHT, are considered. As polynomial regression has some drawbacks, which was discussed in the previous chapter, a tree-based modelling was performed by splitting x in two parts and build a separate polynomial model for each part. Clearly, this treebased model gives a better fit, than the polynomial models in figure 2. Now consider a multidimensional Runge-function, like d Y 1 , (14) f (x) = 1 + x2i i=1 where xi are the different inputs. Obviously, like in the one-dimensional case in figure 2, polynomial regression will not provide a good fit on this function. Hence, a tree-
2
0
Runge´s function 1/(1+x ) Training Data (half) Tree−based model Gaussian Process
−0.2
Tree-based regression has advantages. A relatively simple structure leads to a fast training speed of the model. Further, tree-based models share all the advantages from the polynomial modelling, like an easy implementation. A human interpretability of a tree model is often seen as another major strength, Bishop (2007). With LOLIMOT and HHT, which are mentioned earlier in section 3, a human interpretation can be given by the particular tree structure that is learned. A human interpretation of a gaussian process model can be obtained through an automatic relevance determination (ARD), see Neal (1996) and MacKay (1994), where the degree of nonlinearity of each input easily can be determined. It is also possible to incorporate prior knowledge with ARD. If one knows the degree of nonlinearity from an input of a similar engine, then he easily can incorporate this knowledge in the measurmentdesign of the interested engine. A similar incorporation of prior knowledge is also possible with LOLIMOT and HHT, see Martini et al. (2003). Moreover, human interpretability and incorporation of prior knowledge is more easy using tree-based models than other algorithms from machine learning, see Nelles (2001).
0.4
−0.4 1 Φ
Φ
1
0.5
Φ
2
0 −5
−4
−3
−2
−1
0 x
1
2
3
4
5
Fig. 3. Tree-based modelling: The input x is split up in two parts and a polynomial of degree 3 has been applied for regression on each part. This tree-based model clearly gives a better fit, than the polynomial models in figure 2. In the lower graph, the membership functions of the tree-model are drawn. based modelling will split the whole input-space in smaller subspaces. If only straight intersections are possible, a good fit can be obtained if every axis is split up in two parts, like in figure 3. But in a d-dimensional input space, this will lead two 2d independent submodels and therefore also the number of measurements increases exponentially with the number of inputs. Generally, if the engine mapping can’t be approximated by polynomials and the type of intersections don’t suite the nature of the problem, then the number of submodels and therefore the number of measurements will grow rapidly with tree-based modelling in an high dimensional space, which clearly contradicts to requirement (2). Note this is not the case with GP regression, since only one global model and no submodels exist. With GP the required measurements for approximating the multidimensional Runge-function therefore increases approximately linear. Therefore, like the MLP network, gaussian processes can get along better with the curse of dimensionality with fewer measurements then tree based models, see also Nelles (2001). In addition, measurements which have been taken near at an intersection will just belong to one submodel. Although these measurements are taken very closely to a nearby submodel, they will provide no information for this submodel. This clearly contradicts to requirement (2), as every measurement should contribute a maximum of information to the model. This drawback can also be seen in figure 3. Because each submodel has no information about measurements, which are outside it’s subspace, the submodels don’t know that the function values have to decrease outside their subspace and therefore the tree-based model gives high values at the intersection between the submodels. A better approximation of the intersection can only be given if more measurements are made. Note that the approximation with a GP doesn’t have this problem, since the GP uses all information.
10637
18th IFAC World Congress (IFAC'11) Milano (Italy) August 28 - September 2, 2011
A further discussion on LOLIMOT, HHT and MLP (see section 5.3) can be found in Nelles (2001). 5.3 Gaussian Processes compared to other algorithms from machine learning MLP Networks. The MLP neural network is widely used in the area of machine learning. Also in the task of engine calibration, see Mitterer (2000), online engine calibration, see Kl¨opper (2008), and even dynamic online engine calibration, see Deflorian et al. (2010), the MLP is used for modelling. As mentioned above, the MLP can cope better with the curse of dimensionality than tree-based models, see also Nelles (2001), and therefore it needs fewer measurements for the same problem. With bayesian regularization, see MacKay (1992), overfitting can be avoided even if the number of neurons is large and therefore a modelling with MLP networks with bayesian regularization results in a fully automatic approach. A MLP modelling will generate different functions, due to random network initializations and nonlinear optimization of multimodal problems, which is sometimes seen as a drawback of a modelling with MLP’s. But exactly these different functions can be used to identify a quantity about the certainty and probability of the model (requirement (6)). If different MLP models predict similar values, then the certainty of the prediction is expected to be high and vice versa. This property has been used for online optimization in Kl¨opper (2008), where measurements are placed in areas, where big uncertainty about the modelexpectation occurs. This approach is called query-bycommittee. It exists a simple relation between MLP networks and gaussian processes. Neal (1996) has shown that, using standard regularization techniques, the distribution of functions generated by a MLP network will tend to a Gaussian process in the limit of an infinite number of neurons. This result raised a broad discussion in the area of machine learning, if gaussian processes possibly could replace neural networks, e.g. see MacKay (1998). Further, why should one use query-by-committee with MLP networks, when one can use a gaussian process, which produces the same results as a committee of an infinite number of MLP’s each with an infinite number of neurons? A modelling with MLP’s may be preferred if the number of measurements is large, because a modelling with gaussian processes is computationally expensive, see section 5.4 and Rasmussen (1996). For more information on MLP networks, see Bishop (1996). Support and relevance vector machines. Gaussian process are kernel machines, whereas the support vector machine and the relevance vector machine are sparse kernel machines. Because gaussian processes can be computationally expensive, see section 5.4, sparse kernel machines take a subset of measurements for prediction. Therefore these methods are computationally cheaper, but the predictions will not be as accurate as gaussian processes, see Bishop (2007).
For a further discussion on machine learning and more methods for non-linear regression see Rasmussen (1996). 5.4 Drawbacks of Gaussian process regression For a modelling with gaussian processes, a matrix inversion is needed for the optimization of the parameters (13) and for prediction (11), (12). This matrix has size N × N , where N is the number of measurements. Therefore the computational complexity scales with O(N 3 ). Obviously, this modelling will be computationally infeasible if the number of measurements is too high, which is a significant limitation of gaussian processes. Hence, on an average PC this method could just been tested for N = 10.000. Clearly, gaussian processes are not suitable for dynamic engine calibration, where the number of measurements is high. But GP are appropriate for static engine calibration, because seldom the number of measurements is bigger than 10.000. 5.5 Summary of modelling comparison For an automated process with few human interaction and with the primary objective of reducing time and costs at the test-bed, a solution with gaussian processes may be preferred. If the number of measurements is too big to calculate a gaussian process model, then other methods of machine learning can be helpful. If more measurements are available and a strong human interaction in the process is needed, then a tree-based model may be recommended. A polynomial model should only be used for small and low-complex problems. 6. APPLICATION AND MODEL COMPARISON ON A DIESEL ENGINE In this section a gaussian process model is applied on NOx, consumption and soot measurements of a diesel engine and compared to a polynomial stepwise regression. In this application only a single operation point is considered and only local models are trained. The inputs of the models are the main injection time, injection pressure, quantity and time of the pre-injection and quantity of exhaust gas recirculation. This leads to an 5 dimensional input space. For reasons of confidentiality all measurements are scaled to an interval of [0 1]. From a total set of 144 measurements, 50 measurements are randomly removed for model validation and the remaining 94 are used for training. With this data a gaussian process model had been trained with (13) and predictions had been made with (11). Further, a polynomial stepwise regression had been calculated as a reference. A power series of order o = 5 formed the admissible set of regressors. Using a statistical test (e.g. given in Fahrmeir et al. (2009)), a stepwise regression has been performed, where the significant regressors had been identified, which were used for prediction afterwards. Also, a power series of order o = 40 had been tested, but it was found that no higher polynomial term than order o = 5 had been significant.
10638
18th IFAC World Congress (IFAC'11) Milano (Italy) August 28 - September 2, 2011
NOx − gaussian process regression
0.6 0.4 0.2
0
1
0.4 0.6 0.8 1 measured NOx (scaled) consumption − gaussian process regression training data validation data
0.8 0.6 0.4 0.2 0
0
1
0.2 0.4 0.6 0.8 measured consumption (scaled) soot − gaussian process regression
0.6 0.4 0.2
0
0.2
0.4 0.6 0.8 measured soot (scaled)
The results of the gaussian process and polynomial stepwise regression modelling is shown in figure 4 and 5. The NRMSE is given in table 1.
training data
consumption soot
0.234 0.269 0.540 0.626 0.297 1.022
0
0.4 0.6 0.8 1 measured NOx (scaled) consumption − polynomial stepwise regression
1
% % % % % %
training data validation data
0.8 0.6 0.4 0.2
0
0.2 0.4 0.6 0.8 measured consumption (scaled) soot − polynomial stepwise regression
1
training data validation data
0.8 0.6 0.4 0.2
0
0.2
0.4 0.6 0.8 measured soot (scaled)
1
5.1. This can again be shown with the NOx measurements. Table 2. NRMSE of NOx for reduced training and increased validation data.
validation data 0.402 0.430 0.912 0.998 0.659 1.550
0.2
Fig. 5. Measured vs. predicted plot for training and validation data of NOx (top), consumption (middle) and soot (bottom) of the polynomial stepwise regression model.
Table 1. NRMSE of NOx, consumption and soot for training and validation data. GP Poly GP Poly GP Poly
0.2
0
1
Fig. 4. Measured vs. predicted plot for training and validation data of NOx (top), consumption (middle) and soot (bottom) of the gaussian process model.
NOx
0.4
1
training data validation data
0.8
0
0.6
0
1
training data validation data
0.8
0
0.2
predicted consumption (scaled)
predicted consumption (scaled)
predicted NOx (scaled)
0.8
0
predicted soot (scaled)
NOx − polynomial stepwise regression 1
training data validation data
predicted soot (scaled)
predicted NOx (scaled)
1
% % % % % %
GP Poly
As expected, the gaussian process models are giving better results than the polynomial models. While there are only small differences of the models for NOx and consumption, there is obviously a big difference in the soot models, which indicates that this function is harder to approximate. Clearly, by increasing the number of measurements, also the results of the polynomial model will improve, because there exists a polynomial that approximates every engine mapping, which follows from the theorem of StoneWeierstraß. But for a small number of measurements it can be seen that GP allow a better regression, since they can use an infinite number of basis functions, see section
training data 0.336 % 1.265 %
validation data 0.447 % 1.312 %
Reducing the training data to 43 measurements and increasing the validation data to 101 measurements, makes it harder for the modelling algorithms to calculate correct predictions. This is shown in figure 6. The NRMSE is given in table 2. As one can see, with reduced training data the results are not as good as before, but again the gaussian process model makes reasonable predictions with this fewer measurements while the polynomial model can’t be used. Further, with only 43 training points, a tree-based modelling would clearly give no improvement, since the measurements have to be splitted up in their subspaces. Hence, it has been shown that gaussian processes work in practice and they are able to reduce the number of measurements and therefore reduce time and costs on the test-bed.
10639
18th IFAC World Congress (IFAC'11) Milano (Italy) August 28 - September 2, 2011
NOx − gaussian process regression
predicted NOx (scaled)
1
training data validation data
0.8 0.6 0.4 0.2 0
0
predicted NOx (scaled)
1
0.2
0.4 0.6 0.8 measured NOx (scaled) NOx − polynomial stepwise regression
1
training data validation data
0.8 0.6 0.4 0.2 0
0
0.2
0.4 0.6 0.8 measured NOx (scaled)
1
Fig. 6. Measured vs. predicted plot for reduced training and increased validation data of NOx of the gaussian process model (top) and the polynomial stepwise regression model (bottom). 7. OUTLOOK Because modelling with gaussian processes is quite common, various algorithms have been developed for online optimization of time-consuming computer simulations, see Jones (2001) and these algorithms can also be used for an online optimization of a combustion engine. Further, the assumption of gaussian noise (6), which is made by nearly every modelling for engine calibration, is not true if outliers occur in the measurements, which is quite common in the task of engine calibration. A model which is robust to outliers can be achieved with an assumption of student-t noise. Because the student-t distribution has longer ’tails’ than the gaussian, outliers in the measurement will not change the shape of the model, e.q. see Neal (1997). REFERENCES Bishop, C.M. (1996). Neural Networks for Pattern Recognition. Oxford University Press, USA. Bishop, C.M. (2007). Pattern Recognition and Machine Learning (Information Science and Statistics). Springer. Boyle, P. (2007). Gaussian processes for regression and optimisation. Ph.D. thesis University of Wellington. Cressie, N. (1993). Statistics for Spatial Data. Wiley. Deflorian, M., Kl¨opper, F., and R¨ uckert, J. (2010). Online dynamic black box modelling and adaptive experiment design in combustion engine calibration. In 6th IFAC Symposium ”Advances in Automotive Control”. Ernst, S. (1998). Hinging hyperplane trees for approximation and identification. In Proceedings of the 37th IEEE Conference on Decision and Control. Fahrmeir, L., Kneib, T., and Lang, S. (2009). Regression (Modelle, Methoden und Anwendungen). Springer. Gschweitl, K., Pluegl, H., Fortuna, T., and Leithgoeb, R. (2001). Steigerung der effizienz in der modellbasierten
motorenapplikation durch die neue cameo online doetoolbox. ATZ, 103, 636–643. Jakubek, S. and Keuth, N. (2005). Improved neurofuzzy models for design processes and simulation in automotive applications. Automatisierungstechnik, 53, 425–433. Jones, D.R. (2001). A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21, 345383. Kl¨opper, F. (2008). Entwicklung und Einsatz modellgest¨ utzter Online-Methoden zur Parameteroptimierung von Verbrennungsmotoren am Pr¨ ufstand. Ph.D. thesis TU-M¨ unchen. Lauritzen, S.L. (1981). Time series analysis in 1880: A discussion of contributions made by t.n. thiele. International Statistical Review, 49, 319–331. MacKay, D.J.C. (1992). A practical bayesian framework for backpropagation networks. Neural Computation, 4, 448–472. MacKay, D.J.C. (1994). Bayesian methods for backprop networks. Models of Neural Networks, 3, 211–254. MacKay, D.J.C. (1998). Introduction to gaussian processes. In C.M. Bishop (ed.), Neural Networks and Machine Learning, 133–166. Springer. MacKay, D.J.C. (2003). Information theory, Inference, and Learning Algorithms. Cambridge University Press. Magee, L. (1998). Nonlocal behavior in polynomial regressions. The American Statistician, 52, 20–22. Martini, E., Voss, H., T¨opfer, S., and Isermann, R. (2003). Effiziente motorapplikation mit lokal linearen neuronalen netzen. MTZ, 5, 406–413. Mitterer, A. (2000). Optimierung vielparametriger Systeme in der KFZ Antriebsentwicklung. Ph.D. thesis TUM¨ unchen. Neal, R.M. (1996). Bayesian Learning for Neural Networks. Springer. Lecture Notes in Statistics. Neal, R.M. (1997). Monte carlo implementation of gaussian process models for bayesian regression and classification. Nelles, O. (2000). Local linear model trees (lolimot) toolbox for nonlinear system identification. In IFAC Symposium on System Identification (SYSID), Santa Barbara, USA. Nelles, O. (2001). Nonlinear System Identification. Springer. Poland, J. (2002). Modellgest¨ utzte und Evolution¨ are Optimierungsverfahren f¨ ur die Motorentwicklung. Ph.D. thesis University of T¨ ubingen. Rasmussen, C.E. (1996). Evaluation Of Gaussian Processes And Other Methods For Non-Linear Regression. Ph.D. thesis, University of Toronto. Rasmussen, C.E. and Williams, C.K.I. (2006). Gaussian Processes for Machine Learning. The MIT Press. Smola, A.J. and Sch¨ olkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14, 199–222. Tipping, M.E. (2001). Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211 –244. Zhou, Z., Ong, Y.S., Nguyen, M.H., and Lim, D. (2005). A study on polynomial regression and gaussian process global surrogate model in hierarchical surrogate-assisted evolutionary algorithm. Evolutionary Computation, 3.
10640