Proceedings of the 15th IFAC Symposium on System Identification Saint-Malo, France, July 6-8, 2009
Global Supervised and Local Unsupervised Learning in Local Model Networks ⋆ ∗∗ ˇ Benjamin Hartmann ∗ Oliver Nelles ∗ Igor Skrjanc ∗∗ Anton Sodja
University of Siegen, Department of Mechanical Engineering, D-57068 Siegen, Germany (e-mail:
[email protected]). ∗∗ University of Ljubljana, Department of Electrical Engineering, SI-1000 Ljubljana, Slovenia (e-mail:
[email protected]). ∗
Abstract: In this paper a new algorithm for nonlinear system identification with local linear models is proposed. The algorithm utilizes product space clustering inherently in a heuristic treeconstruction algorithm. The high flexible validity functions obtained by fuzzy clustering combined with supervised learning in the framework of a local model network result in an efficient partitioning algorithm. Its properties are illustrated by a demonstration example. Keywords: Nonlinear System Identification; Neural Networks; Grey Box Modeling. 1. INTRODUCTION In the last two decades, architectures based on the interpolation of local models have attracted more and more interest as static function approximators and particularly as nonlinear dynamic models. Local linear models allow the transfer of many insights and methods from the mature field of linear control theory to the nonlinear world. Recent advances in the area of convex optimization and the development of efficient algorithms for the solution of linear matrix inequalities have contributed significantly to the boom on local linear model structures. The output yˆ of a local model network with p inputs u = [u1 u2 · · · up ]T can be calculated as the interpolation of M local model outputs yˆi (·), i = 1, . . . , M , see Fig. 1 [Nelles, 2001], yˆ =
M X
yˆi (u)Φi (u)
(1)
M X
Φi (u) = 1 .
(2)
i=1
Thus, everywhere in the input space the contributions of all local models sum up to 100%. In principle, the local models can be chosen of arbitrary type. If their parameters shall be estimated from data, however, it is extremely beneficial to choose a linearly parameterized model class. The most common choice are polynomials. Polynomials of degree 0 (constants) yield a neuro-fuzzy system with singletons or a normalized radial basis function network. Polynomials of degree 1 (linear) yield local linear model structures, which is by far the most popular choice. As the degree of the polynomials increases, the number of local models required for a certain accuracy decreases. Thus, by increasing the local models’ complexity, at some point a polynomial of high degree with
i=1
where the Φi (·) are called interpolation, validity or weighting functions. These validity functions describe the regions where the local models are valid; they describe the contribution of each local model to the output. From the fuzzy logic point of view (1) realizes a set of M fuzzy rules where the Φi (·) represent the rule premises and the yˆi (·) are the associated rule consequents. Because a smooth transition (no switching) between the local models is desired here, the validity functions are smooth functions between 0 and 1. For a reasonable interpretation of local model networks it is furthermore necessary that the validity functions form a partition of unity: ⋆ This work was supported by the German Research Foundation (Deutsche Forschungsgemeinschaft (DFG), project code NE 656/3-1).
978-3-902661-47-0/09/$20.00 © 2009 IFAC
Fig. 1. Local model network: The outputs yˆi (·) of the local models (LMi ) are weighted with their validity function values Φi (·) and summed up.
1517
10.3182/20090706-3-FR-2004.0169
15th IFAC SYSID (SYSID 2009) Saint-Malo, France, July 6-8, 2009 just one local model (M = 1) is obtained, which is in fact equivalent with a global polynomial model (Φ1 (·) = 1).
key properties of both: the construction algorithm and the finally constructed model.
Besides the possibilities of transferring parts of mature linear theory to the nonlinear world, local linear models seem to represent a good trade-off between the required number of local models and the complexity of the local models themselves. Due to the overwhelming importance and for simplicity of notation the rest of this paper will deal only with local models of linear type: yˆi (u) = wi,0 + wi,1 u1 + wi,2 u2 + . . . + wi,p up . (3)
This article is organized as follows. Section 2 gives an overview on the partitioning strategies product space clustering and heuristic tree-construction algorithms and specifies their advantages and drawbacks. In Sect. 3 a new partitioning strategy is introduced and its properties are discussed. The incremental construction algorithm is proposed in Sect. 4. Strategies for initial cluster center placement to circumvent unfeasible splits are introduced in Sect. 5. The performance of the axes-oblique partitioning strategy is evaluated in Sect. 6. This paper ends by summarizing the important conclusions.
However, an extension to polynomials of higher degree or other linearly parameterized model classes is straightforward. One of the key features of local model networks is that the input spaces for the local models and for the validity functions can be chosen independently. In the fuzzy interpretation this means that the rule premises (IF) can operate on (partly) other variables than the rule consequents (THEN). With different input spaces (1) has to be extended to, see Fig. 2: M X yˆ = yˆi (x)Φi (z) (4)
2. CLUSTERING VERSUS HEURISTIC TREE CONSTRUCTION This section gives an overview about the two popular strategies product space clustering and heuristic treeconstruction algorithms proposed for input space partitioning. A theoretical investigation will analyze the key advantages and drawbacks of each approach. 2.1 Product Space Clustering
i=1
with x = [x1 x2 · · · xnx ]T spanning the consequent input space and z = [z1 z2 · · · znz ]T spanning the premise input space. This feature enables the user to incorporate prior knowledge about the strength of nonlinearity from each input to the output into the model structure. Or the other way round, the user can draw such conclusions from a black-box model which has been identified from data. Especially for dynamic models where the model inputs include delayed versions of the physical inputs and output, the dimension nx becomes very large in order to cover all dynamic effects. In the most general case (universal approximator) this is also true for nz. However, for many practical problems a lower-dimensional z can be chosen, sometimes even one or two scheduling variables can yield sufficiently accurate models. If the validity functions once are determined, it is easy to efficiently estimate the parameters of the local linear models wij by local or global least squares methods. The decisive difference between all proposed algorithms to construct local linear model structures is the strategy to partition the input space spanned by z = [z1 z2 · · · znz ]T , i.e., to choose the validity regions and consequently the parameters of the validity functions. This strategy determines the z1 z2
u1
...
u2
...
general model
y
Fi x1 x2
...
up
premises
znz
consequents
y
xnx
Fig. 2. For local model networks the inputs can be assigned to the premise and/or consequent input space according to their nonlinear or linear influence on the model output.
Product space clustering strategies focus on the product space that is jointly spanned by the inputs and the output [z1 z2 · · · znz y]. Usually the Gustafson-Kessel [Gustafson and Kessel, 1979] or Gath-Geva [Gath and Geva, 1989] clustering algorithms are applied that search for hyper-ellipsoids of equal or different volumes, respectively. Fig. 3 shows an example with one input- and one output-dimension. The process is modeled with three clusters. These algorithms are able to discover local hyperplanes in the product space by forming ellipsoids with a very small extension in one direction which can be observed from the covariance matrix [Babuˇska and Verbruggen, 1996]. To date, these are the most popular partitioning strategies for building local linear model networks. Due to the high flexibility of the validity functions in size and orientation the curse of dimensionality is a much lesser issue than for most competing strategies. However, this comes at the price of a significantly reduced interpretability in terms of fuzzy logic. Another drawback of all clusteringbased partitioning strategies is that the multi-dimensional validity functions cannot be projected to one-dimensional membership functions without loosing modeling accuracy. And even if a loss in accuracy is conceded, a further merging step will be necessary to significantly reduce the number of membership functions [Babuˇska et al., 1996]; otherwise the number of membership functions for each variable would be equal to the number of rules. Moreover, two serious restrictions of these partitioning strategies must be mentioned, that limit their applicability to complex problems significantly. First, these algorithms work only for local linear models. It has been shown in [Nelles, 2001] that local quadratic models can be much superior in specific applications. Second, the input space for the rule premises and the rule consequents must be identical because the partitioning strategy and the local model structure are inherently intertwined in the
1518
15th IFAC SYSID (SYSID 2009) Saint-Malo, France, July 6-8, 2009 clustering approach. Thus, it is necessary to choose x = z which is a severe limitation!
organized, like e.g. LOLIMOT, the constructed model itself is flat in the sense that all validity functions Φi (·) can be calculated in parallel. This is an important feature if the network really should be realized in hardware or by some parallel computers. 3.1 Motivation and Benefits for Axes-Oblique Partitioning
Fig. 3. Piece-wise linear approximation by clusters. 2.2 Heuristic Tree-Construction Algorithms Based on the heuristic tree search algorithm CART [Breiman and J.H. Friedman R. Olshen R., 1984] many similar partitioning strategies like LOLIMOT [Nelles et al., 1996] have been proposed for local model networks, see also [Sugeno and Kang, 1988, Johansen, 1995]. Their key idea is to incrementally subdivide the input space by axes-orthogonal cuts. Besides their simplicity, the strict separation between rule premises and consequents input spaces and their low computational demand, one big advantage is their easy interpretability in terms of fuzzy logic. The axes-orthogonal partitioning always allows a projection of the validity regions to the one-dimensional input variables. Also undesirable normalization side effects and extrapolation behavior can be improved compared to clustering or data-based strategies [Shorten and MurraySmith, 1997, Nelles, 2001]. Their main drawback inherently lies in the axes-orthogonal partitioning strategy: the performance of these algorithms degrades more and more with an increasing dimensionality of the premise input space. Thus, they are quite sensitive with respect to the curse of dimensionality. An example for a local model network with three local models (LMs) and the corresponding validity functions is shown in Fig. 4.
Fig. 4. Local model network. Local linear models (top) and validity functions (bottom).
The heuristic tree-construction algorithms discussed in Subsect. 2.2 offer an overwhelming number of attractive features. Their only major shortcoming is their sensitivity to the curse of dimensionality as a consequence of the restriction to axes-orthogonal splits. An extension to axes-oblique splits would overcome this drawback. In fact, an optimized axes-oblique partitioning strategy would be exceptionally well suited for high-dimensional problems. It is known from multilayer perceptron (MLP) networks that they perform very well on high-dimensional problems because they optimize the direction of nonlinearity for every sigmoid function (neuron). This direction optimization is the key for success when dealing with many inputs but it inherently requires nonlinear optimization techniques. The main drawbacks of MLP networks are their very bad interpretability and the high training effort since all parameters have to be nonlinearly optimized concurrently. Moreover, the results crucially depend on the parameter initialization. The goal of this paper is to keep the advantages of MLP networks but to overcome their weaknesses by exploiting local model structures. A first step extending this direction optimization idea to local linear models has been realized with hinging hyperplanes proposed in [Breiman, 1993]. Hinging hyperplanes are functions that look like the cover sides of a partly opened book. The direction of the hinge (that line where front and back side meet) was optimized. The next step in development was taken in [Pucar and Millnert, 1995] where the piecewise local linear models have been smoothed by interpolation functions. But at this point the space in which the local linear models are defined and the space of the validity functions still had to be identical by construction. In fact, the parameters of the local linear models and the hinge directions are coupled. This feature gave rise to an efficient training algorithm but it is also a severe limitation in the context of local model networks. Also these approaches are restricted to linear local models. In [Ernst, 1998] all these restrictions were overcome by introducing so-called generalized hinging hyperplanes where the input space partitioning is independent of the local models. An axes-oblique partitioning strategy and an efficient construction algorithm for its realization was proposed in [Nelles, 2006]. In this paper a new approach for axes-oblique partitioning is introduced where the split directions are determined with Gustafson-Kessel fuzzy clustering. A global supervised learning strategy in local model networks will be combined with local unsupervised fuzzy clustering. 3.2 Fusion of supervised and unsupervised learning
3. AXES-OBLIQUE PARTITIONING Both of the strategies discussed in the previous section yield flat models. Even if the algorithm is hierarchically
The aim of the fusion of product space clustering and heuristic tree-construction algorithms is the combination of the advantages of supervised and unsupervised learning
1519
15th IFAC SYSID (SYSID 2009) Saint-Malo, France, July 6-8, 2009 (see Table 1). The usage of high flexible validity functions obtained by product space clustering enables the algorithm to overcome the curse of dimensionality. This flexibility is a consequence of the fuzzy covariance matrix that allows an arbitrary orientation and size of the clusters in the product space.
Σi = with
Table 1. Attributes of algorithms to be combined
2 σ1,1 2 σ2,1 .. .
2 σk,l
=
2 σ1,2 2 σ2,2 .. .
··· ··· .. .
2 σ1,nz+1 2 σ2,nz+1 .. .
2 2 2 σnz+1,1 σnz+1,2 · · · σnz+1,nz+1
2 σl,k
,
(6)
i
∀ k, l = 1, . . . , nz + 1.
The membership functions µi (·) of a Gaussian basis function network are given by: 1 µi (z) = exp − xi . (7) 2 To achieve a partition of unity the membership functions have to be normalized to obtain the validity functions: µi (z) Φi (z) = M . (8) P µj (z) j=1
The next section gives an overview of the construction algorithm that is used for structure optimization. 4. SUPERVISED HIERARCHICAL CLUSTERING SUHICLUST The SUHICLUST (SUpervised HIerarchical CLUSTering) algorithm closely follows the proposals in [Ernst, 1998] and [Nelles, 2006] which were strongly motivated by the LOLIMOT algorithm first published in [Nelles et al., 1996]. The key features of this axes-oblique construction algorithm are: • Incremental: In each iteration an additional local model is generated. • Splitting: In each iteration the local model with worst local error measure is split into two submodels. • Local least squares: The parameters of the local models are locally estimated by a weighted least squares method. This is computationally extremely cheap and introduces a regularization effect which increases the robustness [Murray-Smith and Johansen, 1997]. • Adaptive resolution: The smoothness of the local model interpolation depends on the fuzzy covariance matrix obtained by fuzzy clustering and therefore on the size of the validity regions. The smaller the validity regions are the less smooth the interpolation will be. • Split optimization: In contrast to LOLIMOT the splits are axes-oblique. Thus, the position and direction of each split are optimized. The application of Gustafson-Kessel fuzzy clustering in product space determines the new split in the input space. Only the new split is optimized; all existing splits are kept unchanged. The number of parameters in the case of one output dimension is 12 (nz + 1)(nz + 2). • Nested optimization: After evaluation of the new split by fuzzy clustering, the parameters of the two involved local models are newly estimated by a local weighted least squares method, see Fig. 6.
Fig. 5. Projection of clusters (top) from product space to input space leads to membership functions (bottom). In order to project the clusters obtained by e.g. GustafsonKessel fuzzy clustering from product space to input space the cluster dimension of the output y is kept constant at the corresponding cluster center value ci,y . Therefore the cluster rotation in the output dimension is neglected. Fig. 5 illustrates this by an example with one input u and one output y. The two cluster projections µi are generated by slicing the clusters at their output center coordinate ci,y . The distance xi from a data point to each center ci = [ci,1 ci,2 · · · ci,nz ci,y ]T is calculated with the help of the covariance matrix Σi , which scales and rotates the axes: xi = ||[z T ci,y ]T − ci ||Σi q T T T = ([z T ci,y ]T − ci ) Σ−1 i ([z ci,y ] − ci ) .
(5)
The covariance matrix Σi will be of symmetric shape and of size (nz + 1 × nz + 1), if one output is applied:
With the exceptions and extensions described above, the axes-oblique construction algorithm closely follows LOLIMOT [Nelles, 2001]. To optimize the splitting parameters (fuzzy covariance matrix) Gustafson-Kessel fuzzy
1520
15th IFAC SYSID (SYSID 2009) Saint-Malo, France, July 6-8, 2009 non-unique because both membership functions have the maximum activity in the same point. To circumvent effects like this some heuristic restrictions are applied that prevent wrong initial center placements. Fig. 6. Structure of SUHICLUST algorithm. clustering is used as a nonlinear optimization technique. The data points that correspond to the local model that is going to be splitted are taken as clustering data. The validity values classify the data (validity over 0.5). All already existing splits are kept unchanged. This ensures that the computational demand does not increase over the iterations. Each time after the generation of two new local models two local weighted least squares estimations are carried out in order to optimize the local model parameters of the two newly generated local models. The parameters of all other local models are kept unchanged because they are hardly affected by this split.
Fig. 8. Modeling of the process shown in Fig. 7. Left: Membership functions. Right: Validity functions. Split is taken nearly orthogonal to the direction of the process nonlinearity.
5. INITIAL CLUSTER CENTER PLACEMENT To split the worst local model into two new local models the approach proposed in this paper uses fuzzy clustering. One difficulty that comes along with the unsupervised learning strategy is the initialization of two new cluster centers in each iteration of the heuristic tree-construction algorithm. This initialization can yield to some undesirable effects as a result of convergence in local minima of the clustering algorithm. In this section examples for wrong initial cluster center placements are introduced. To ensure adequate center placements some heuristic restrictions reduce the probability of unfeasible splits.
Fig. 9. Modeling of the process shown in Fig. 7. Left: Membership functions. Right: Validity functions. Centers with too close input space coordinates produce nonunique validity regions. 5.2 Methods to avoid wrong center placement Three methods are suggested that should avoid the wrong placement of initial clusters:
Fig. 7. Left: Process (light) and model (solid) output with the axis-oblique partitioned strategy. Right: Example for one reasonable partitioning with the proposed algorithm. 5.1 Examples for wrong center placement of initial clusters In order to demonstrate wrong initial cluster center placements the process shown in Fig. 7 will be modeled. Figure 7 (right) presents a reasonable partitioning generated with the proposed algorithm. All splits are taken orthogonal to the direction of the main process nonlinearity. Figures 8 and 9 are two examples for undesirable splits that were calculated by the Gustafson-Kessel clustering algorithm with random cluster initialization. In Fig. 8 the centers are placed nearly orthogonal to the process nonlinearity. There is no improvement of the global model although one local model was added. Another interesting effect occurs if the cluster centers get too close center coordinates. This case is illustrated in Fig. 9. The validity functions become
• Center placement of initial clusters in direction of the eigenvector with the largest eigenvalue (approximately the direction of the process nonlinearity). • Avoiding of cluster centers with close input space coordinates. • Insurance that centers are near data points: With an appropriate distance measure that gives information about data density around the cluster centers it can be prevented that centers are placed far away from data points. The presented methods deliver sufficient results in case of academic examples with undisturbed training data. However, the development of a robust and adaptive approach for initializing the cluster centers in terms of noisy and randomly distributed data samples is still a topic for future research. 6. DEMONSTRATION EXAMPLES The advantages of the axes-oblique partitioning with SUHICLUST become apparent with an approximation problem where the process nonlinearity depends on a linear combination of the input dimensions. Therefore, in the following the function 2f1 (9) y= f2 + f3
1521
15th IFAC SYSID (SYSID 2009) Saint-Malo, France, July 6-8, 2009 with f1 = exp 8 (u1 − 0.5)2 + (u2 − 0.5)2 f2 = exp 8 (u1 − 0.2)2 + (u2 − 0.7)2 f3 = exp 8 (u1 − 0.7)2 + (u2 − 0.2)2 is modeled. It is the benchmark problem Mars1 which is also used in [Murray-Smith, 1994] and [Friedman, 1991]. For the approximation 900 equally distributed, noise-free data samples are generated. This function shall be approximated with a normalized root mean squared error of less than 5%. With LOLIMOT 21 local linear models were needed, while the SUHICLUST algorithm (axis-oblique partitioning strategy) could achieve the same accuracy with 11 local linear models, see Fig. 10.
Fig. 10. Top left: Process to be modeled. Top right: Convergence behavior for axes-orthogonal and axesoblique partitioning strategy. Bottom left: Contours of SUHICLUST partition drawn at 0.5. Bottom right: Partition generated with LOLIMOT. 7. CONCLUSIONS The identification of high-dimensional, nonlinear static and dynamic processes with nonlinearities that stretch along multiple inputs is well done with the new algorithm (SUHICLUST) proposed in this paper. Global supervised learning in the framework of a local model network meets local unsupervised learning. The axes-oblique partitioning strategy and the high flexible validity functions enable a suitable approximation behavior in high-dimensional input spaces with a small number of local submodels. Owing to the distinction between input spaces for rule premises and conclusions the local models can be of any polynomial degree which allows the incorporation of prior knowledge. This and many other features make SUHICLUST to a powerful identification algorithm.
R. Babuˇska, M. Setnes, U. Kaymak, and H.R. van Naute Lemke. Simplification of fuzzy rule bases. In European Congress on Intelligent Techniques and Soft Computing (EUFIT), pages 1115–1119, Aachen, Germany, 1996. L. Breiman. Hinging hyperplanes for regression, classification, and function approximation. IEEE Transactions on Information Theory, 39(3):999–1013, May 1993. L. Breiman and C.J. Stone J.H. Friedman R. Olshen R. Classification and Regression Trees. Chapman & Hall, New York, 1984. S. Ernst. Hinging hyperplane trees for approximation and identification. In IEEE Conference on Decision and Control (CDC), pages 1261–1277, Tampa, USA, 1998. J.H. Friedman. Multivariate adaptive regression splines (with discussion). The Annals of Statistics, 19(1):1–141, March 1991. I. Gath and A.B.. Geva. Unsupervised optimal fuzzy clustering. IEEE Trans. Pattern Analysis and Machine Intelligence, 11(7):773–781, 1989. D.E. Gustafson and W.C. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In IEEE Conference and Decsion and Control, pages 761–766, San Diego, USA, 1979. T.A. Johansen. Identification of non-linear system structure and parameters using regime decomposition. Automatica, 31(2):321–326, 1995. R. Murray-Smith. A Local Model Network Approach to Nonlinear Modeling. PhD thesis, University of Strathclyde, Strathclyde, UK, 1994. R. Murray-Smith and T.A. Johansen. Local learning in local model networks. In R. Murray-Smith and T.A. Johansen (Eds.), editors, Multiple Model Approaches to Modelling and Control, chapter 7, pages 185–210. Taylor & Francis, London, 1997. O. Nelles. Nonlinear System Identification. Springer, Berlin, Germany, 2001. O. Nelles. Axes-oblique partitioning strategies for local model networks. In International Symposium on Intelligent Control (ISIC), Munich, Germany, October 2006. O. Nelles, S. Sinsel, and R. Isermann. Local basis function networks for identification of a turbocharger. In IEE UKACC International Conference on Control, pages 7– 12, Exeter, UK, September 1996. P. Pucar and M. Millnert. Smooth hinging hyperplanes: A alternative to neural nets. In European Control Conference (ECC), pages 1173–1178, Rome, Italy, 1995. R. Shorten and R. Murray-Smith. Side-effects of normalising basis functions in local model networks. In R. Murray-Smith and T.A. Johansen (Eds.), editors, Multiple Model Approaches to Modelling and Control, chapter 8, pages 211–229. Taylor & Francis, London, 1997. M. Sugeno and G.T. Kang. Structure identification of fuzzy model. Fuzzy Sets & Systems, 28(1):15–33, 1988.
REFERENCES R. Babuˇska and H.B. Verbruggen. An overview of fuzzy modeling for control. Control Engineering Practice, 4 (11):1593–1606, 1996.
1522