Recovering piecewise affine maps by sparse optimization

Recovering piecewise affine maps by sparse optimization

16th IFAC Symposium on System Identification The International Federation of Automatic Control Brussels, Belgium. July 11-13, 2012 Recovering piecewi...

274KB Sizes 2 Downloads 100 Views

16th IFAC Symposium on System Identification The International Federation of Automatic Control Brussels, Belgium. July 11-13, 2012

Recovering piecewise affine maps by sparse optimization ⋆ Laurent Bako ∗,∗∗ Khaled Boukharouba ∗,∗∗ Stéphane Lecoeuche ∗,∗∗ ∗

Univ Lille Nord de France, F-59000 Lille, France. ∗∗ EMDouai, IA, F-59500 Douai, France.

Abstract: Piecewise Affine maps (PWA) constitute an attractive modeling abstraction for nonlinear dynamic systems. An important feature of PWA model structures is that they carry the quite natural intuition according to which the operating space of a complex physical system can be decomposed into an appropriate number of pieces on which the system can be viewed as being affine. The paper presents a new method for the recovery of piecewise affine systems from observed samples of data. This problem is usually addressed through data clustering procedures which, unfortunately, are rarely guaranteed to be optimal. The ambition of this work is to overcome the need of resorting to a systematic clustering. To this end, some ideas of sparse optimization are used in combination with the local linearity property of PWA maps. Keywords: hybrid systems, piecewise affine systems, system identification, ℓ1 -norm optimization, non-smooth optimization. 1. INTRODUCTION Piecewise Affine (PWA) systems can be viewed as a modeling abstraction of nonlinear continuous systems. They are also well-known for having a universal approximation capability in the sense that they can be used to describe a variety of dynamical systems including the general class of hybrid systems Heemels et al. [2001], Sontag [1981]. How to compute piecewise affine maps from observed data is therefore an important engineering problem that has justly been drawing a great deal of attention in the recent years Bemporad et al. [2005], Boukharouba et al. [2009], Ferrari-Trecate et al. [2003], Juloski et al. [2005], Nakada et al. [2005], Ohlsson and Ljung [2011b], Roll et al. [2004]. We will refer to this problem as the PWA regression problem. The main challenge in inferring PWA maps from input-output data lies in the fact that one does not know a priori which data pair originates from which submodel. In that, the PWA regression problem is very analogous to the identification of switched linear systems Bako [2011], Bako and Vidal [2008], Bako et al. [2011], Lauer et al. [2011], Ozay et al. [2008], Vidal et al. [2003] or the problem of subspace clustering Ohlsson and Ljung [2011a], Vidal [2011]. The problem has been tackled in many different ways, the most frequent solution being to try to partition the data according to their respective generating submodels. This is generally achieved through some iterative data clustering techniques which, unfortunately, are rarely optimal. Contribution. The main contribution of this paper is to show that the parameter vectors (PV) describing the submodels of a PWA map are, under an appropriate formulation, obtainable by convex optimization techniques. In a sense, this work can be viewed as a study on the applicability of the identification method of Bako [2011] to PWA systems. Considering the whole

Outline. The remainder of the paper is organized as follows. We formally state the piecewise regression problem in Section 2. In Section 3 we present the proposed approach which is based on sparse optimization. Numerical examples are provided throughout to illustrate the main claims. Notation. Throughout the paper |S| will refer to the absolute value if S is a real number and the cardinality if S is a finite set. kxk2 = (x21 + . . . + x2n )1/2 is the ℓ2 -norm of x ∈ Rn and kxk1 = |x1 | + . . . + |xn | is the ℓ1 -norm of x. 2. THE PWA IDENTIFICATION PROBLEM A PWA model is a mathematical representation of a physical or natural dynamical system; it is considered to be of the form

⋆ E-mails: [email protected] [email protected] [email protected]

978-3-902823-06-9/12/$20.00 © 2012 IFAC

mixed dataset without any prior clustering, we observe that one PV of the PWA map can be directly estimated by minimizing a ℓ1 -norm sparsity-aware criterion. In effect, any PV of the PWA map achieves a sparse error vector over the collected dataset, that is, an error vector containing zero entries wherever the corresponding submodel is active. For the ℓ1 -norm minimization to successfully recover a PV, the corresponding subsystem must have generated a sufficiently large number of data as compared to the other subsystems. This requirement may not be met in general. We therefore propose to reinforce the sparsity of the after-minimization error vector by exploiting the local linearity of PWA maps. An appropriate weighted objective functional is hence constructed where the weights reflect the property that two close regressors are likely to pertain to the same submodel. What distinguishes the method of this paper from the clustering procedures Bemporad et al. [2005], Boukharouba et al. [2009], Ferrari-Trecate et al. [2003], Juloski et al. [2005], Nakada et al. [2005] is its ability to return a PV from the whole mixed dataset, i.e., without prior clustering.

y(t) = f (x(t)) + e(t),

356

(1)

10.3182/20120711-3-BE-2027.00375

16th IFAC Symposium on System Identification Brussels, Belgium. July 11-13, 2012

θ4

2

applicable to PWA as well. In the latter case, information about local linearity can be exploited in order to ease the estimation process.

1 =θ

y(t)

N

θ3

1.5

θ2

θ1 1

0.5 0

3

6

9

12

x(t)

Fig. 1. One-dimensional example of a PWA map where two PVs are equal. where y(t) ∈ R is the output at time t, e(t) represents possible noise or modeling error, and x(t) is what will be referred to as the regressor defined by  ⊤ x(t) = y(t − 1) · · · y(t − na ) u(t − 1)⊤ · · · u(t − nb )⊤ (2) with u(t) ∈ Rnu standing for the input of the system being modeled and na and nb the structural orders of the model. Note that (1) can also be a static model in which case, x(t) will be an unstructured instantaneous observation. In model (1), f is a piecewise affine map defined from the regression space (which is assumed to be a bounded polyhedron) X ⊂ Rn , n = nb nu + na , to R by   ⊤ x f (x) = θi if x ∈ Xi (3) 1 s

where {Xi }i=1 form a polyhedral partition of X , i.e., each set Xi is a convex polyhedron defined by Xi = {x ∈ Rn : Hi x  bi } , ∪si=1 Xi = X and Xi ∩ Xj = ∅ whenever i 6= j. Here, Hi ∈ Rdi ×n and bi ∈ Rdi where di is the number of hyperplanes delimiting the polyhedron Xi and the symbol  denotes a component-wise inequality. We call Xi the validity region of the affine submodel i which is described by the parameter vector θi ∈ Rn+1 . For convenience, we may also re-write model (1) in the form ⊤ y(t) = θσ(t) x ¯(t) + e(t) (4)   ⊤ ⊤ and σ(t) is referred to as the where x ¯(t) = x(t) 1 discrete state (or discrete mode) which is defined as σ(t) = i ∈ {1, . . . , s} whenever x(t) ∈ Xi . It is worth observing the following from the definition of the PWA map: (i) The PVs θi are not necessarily all distinct. An example of situation where two PVs are equal is represented in Figure 1. In this case, the pieces (θ1 , X1 ) and (θ4 , X4 ) of the PWA map differ only in their validity regions. (ii) It may happen that some pairs (x(t), y(t)) fit more than one different affine submodels. That is, it is possible (if ¯(t) = noise is assumed nonexistent) to have y(t) = θi⊤ x ¯(t), while x(t) ∈ / Xi , x(t) ∈ Xj and θi 6= θj . θj⊤ x If one disregards the correspondance between the regions and the affines pieces, then a PWA system is structurally equivalent to a switched system where, by definition, the nature of the switched mechanism is unspecified. One tiny difference is that case (ii) above can occur in a switched system but not case (i). However if one concentrates only on the identification of the PVs, a method designed for switched systems is readily

357

Given observations {x(t), y(t)}t=1 generated by a PWA model of the form (1), with x(t) defined as in (2), we are interested s here in estimating the parameter vectors {θi }i=1 together with s the associated validity regions {Xi }i=1 . This involves partitioning the dataset into s groups where each group corresponds to one submodel. Note that estimation of the regions boils down to the determination of separating hyperplanes once the data are correctly partitioned. This can be efficiently handled via many quite standard SVM-based classification tools Schölkopf and Smola [2001], Vapnik [1998]. Therefore we will not treat explicitly the region estimation problem here. To proceed with the estimation of the PVs, we will assume in this paper that the orders na and nb are finite, equal for all submodels and known a priori. However the number of submodels may be unknown. It is further assumed that each individual affine submodel is minimal in the ordinary sense. 3. THE PROPOSED SOLUTION In this section we present the main result of the paper. The major challenge in inferring PWA models from data lies in the fact that one does not know a priori which data pair (x(t), y(t)) is associated to which submodel. For this reason, the classical idea for tackling the PWA system identification problem is to start by partitioning the data according to their most presumably generating submodels. This is achieved in general through data clustering algorithms which rely on the principle that two close regressors (in the sense of the euclidean norm) are likely to have been generated by the same affine submodel. Given such a partition, standard linear identification techniques such as the least squares method can be applied to recover each of the constituent submodels of (1). One problem however in that standard and intuitive approach is that data clustering algorithms are rarely guaranteed to be optimal. In effect, most of those algorithms iteratively re-assign the data to the different submodels, starting from a certain initial guess. They may hence be very sensitive to initialization. 3.1 A first introductory idea One way to overcome the combinatorial nature of the PWA regression problem consists in identifying as many parameter vectors as there are data available Ferrari-Trecate et al. [2003]. This exploits in general the fact that the PWA map to be identified is locally linear. In effect, local linearity induces the property that two data points x(t) and x(k) that are close in the sense of the euclidean norm are likely to pertain to the same submodel of (1). We can therefore compute the N parameter vectors by solving the following optimization problem min

θ(1),...,θ(N )

t N X X

w(t, k)d(t, k)

(5)

t=1 k=1

with d(t, k) = y(k) − θ(t)⊤ x ¯(k) + y(t) − θ(k)⊤ x ¯(t)

+ γ θ(t) − θ(k) 2

(6)

16th IFAC Symposium on System Identification Brussels, Belgium. July 11-13, 2012

and

XX 2 2 kx(t) − x(k)k2 , = N (N − 1) t=1 N

σx2

t

k=1 t

XX 2 2 σy2 = ky(t) − y(k)k2 . N (N − 1) t=1 k=1 Note that d(t, t) = 2 y(t) − θ(t)⊤ x ¯(t) and w(t, t) = 1 so that the N terms d(t, t) receive the highest weight in the cost functional in (5). When the data are noise-free, σx2 and σy2 can be chosen such that the solution {θ(1), . . . , θ(N )} to problem (5) enjoys the ideal property that θ(k) = θ(t) whenever σ(k) = σ(t). This property can be enhanced by increasing the regularization parameter γ in (6). N

Observing that the optimization problem (5) is convex, it can be efficiently solved by existing numerical tools Boyd and VanN denberghe [2004]. A sequence {θ(t)}t=1 of PVs can hence be computed from the data. Ideally, the PVs that correspond to data points generated by the same submodel of (1) are expected to be equal. However strict equality may not hold between such vectors if for example the training data are contaminated by noise. In order to reduce the number of PVs, one can cluster the set of identified PVs by using for example the K-means algorithm Duda and Hart [1973]. However convergence properties of such clustering algorithms are not well established. Example 1. For illustration purpose we use the following numerical example Bemporad et al. [2005]  ⊤ θ x ¯(t) + e(t), if  [4 −1 10] x ¯(t) < 0    1 −4 1 10 x ¯(t)  0 (8) ¯(t) + e(t), if y(t) = θ2⊤ x 5 1 −6    ⊤ θ3 x ¯(t) + e(t), if [−5 −1 6] x ¯(t) < 0

358

2

θˆ3 (t)

1 0

−1 −2 1 1

0

0.5

−1

θˆ2 (t)

0 −2

θˆ1 (t)

−0.5

(a) No noise, γ = 1.

2 1

θˆ3 (t)

 1 1 2 2 w(t, k) = exp − 2 kx(t) − x(k)k2 − 2 ky(t) − y(k)k2 . σx σy (7) The rationale behind this formulation is as follows. For each pair (x(t), x(k)) of data points, we would like to find a pair of PVs θ(k) and θ(t) to fit (as far as possible) the data points (¯ x(t), y(t)) and (¯ x(k), y(k)) respectively in such a way that both PVs are close whenever x(t) and x(k) are close in the sense of the euclidean norm.

This last requirement justifies the presence of the quantity θ(t) − θ(k) 2 in the cost functional. The fitting error is measured here in terms of the ℓ1 -norm which is known to offer a robust minimization capacity Candès and Randall [2006], Candès et al. [2008] with respect to potential mismatches. That is, minimizing the ℓ1 -norm of the fitting error vector tends to push only the small entries (distances d(t, k) for which (¯ x(t), y(t)) and (¯ x(k), y(k)) are generated by the same submodel) towards zero and leave the large entries large (distances d(t, k) for which (¯ x(t), y(t)) and (¯ x(k), y(k)) are generated by different submodels). The weight w(t, k) ∈ [0 1] corresponding to the term d(t, k) is intended for reinforcing the fact that θ(t) and θ(k) must be all the closer as x(t) and x(k) are spatially close. Note however that if the PWA map is not continuous, then closeness of x(t) and x(k) must not imply necessarily closeness of θ(t) and θ(k). The term 1 2 ky(t) − y(k)k2 present in (7) accounts for these situations. σy2 The parameters σx2 and σy2 appearing in the expression of w(t, k) can be empirically chosen as the average of the pairwise squared-distances, i.e.,

0

−1 −2 1 1

0

0.5

−1

θˆ2 (t)

0 −2

−0.5

θˆ1 (t)

(b) Small amount of noise, γ = 1.

Fig. 2. Estimated sequence θ(t) = [θ1 (t) θ2 (t) θ3 (t)]⊤ . with x ¯(t) = [y(t − 1) u(t − 1) 1]⊤ and the PVs are given by # # " # " " −0.3 0.5 −0.4 θ1 = 1.0 , θ2 = −1.0 , θ3 = 0.5 . −1.7 −0.5 1.5 Solving (5) respectively with noise-free and noisy (SNR=30 dB) data yield the sequences {θ(t)} depicted in Figure 2. The outcome of this experiment is that when the data are completely noise-free, such ideal result as θ(t) ∈ {θ1 , . . . , θs } can be reasonably expected. In the presence of noise however, the story is slightly different, hence prompting the necessity of clustering the estimated PVs. 3.2 The proposed approach The method presented above has the inconvenience of necessitating a subsequent clustering step after a huge number of affine submodels have been identified first. In this section, a slightly different approach is taken. While this approach keeps the main basic features of the one developed in Subsection 3.1, it provides directly a number of submodels that is as small as possible. For clarity of presentation it is perhaps better to focus on the estimation of a first single PV from the entire mixed dataset. The remaining PVs will be obtained similarly over a modified dataset where the data pairs pertaining to the firstly identified PV have been removed. We start by observing as in Bako [2011] that for any θ ∈ {θ1 , . . . , θs }, the error vector  ⊤ φ(θ) = y(1) − θ⊤ x ¯(1), · · · , y(N ) − θ⊤ x ¯(N ) must be sparse in the sense that it must contain as many zero components as possible (when noise-free data is assumed). To see clearly why this is the case, let us disregard temporarily the noise in model (4). Then if θ = θi for some i = 1, . . . , s, then any entry of φ(θ) can be written in the form y(t) − ¯(t) = (θσ(t) − θi )⊤ x ¯(t), which is equal to zero whenever θi⊤ x

16th IFAC Symposium on System Identification Brussels, Belgium. July 11-13, 2012

σ(t) = i. Assuming each submodel to have been sufficiently visited by the PWA system, it becomes clear that any φ(θi ) is sparse to some extent. From this observation and the wellknown sparsity-promoting property of the ℓ1 -norm, one can search for one of the θi s by minimizing the cost functional N X y(t) − θ⊤ x JSARX (θ) = kφ(θ)k1 = ¯(t) .

(9)

t=1

This solution derived in Bako [2011] applies to general switched systems where no additional information is available about the nature of the switching mechanism. It can be seen that the method is directly applicable to PWA systems. In that particular case, we know that the switchings are such that the map f is locally linear that is, two close regressors are very likely to have been generated by the same affine submodel. Taking advantage of this, we introduce the following cost function t N X  X ¯(k) + w(t, k) y(k) − θ⊤ x Jpwa (θ) = (10) t=1 k=1  y(t) − θ⊤ x ¯(t) in which the weights w(t, k) are defined as in (7). This last objective function resembles the one in (5) but with the fondamental difference that it can, in contrast to (5), provide directly a single θ in {θ1 , . . . , θs } without requiring parameter clustering. To further enhance sparsity, we can similarly as in Candès et al. [2008], minimize iteratively a weighted version of (10). The iterative scheme can be defined for a fixed number rmax of iterations as follows (r) θ(r) = arg min Jpwa (θ) (11) θ

 I − (θ) = t : y(t) − θ⊤ x ¯(t) < 0 ,  I + (θ) = t : y(t) − θ⊤ x ¯(t) > 0 ,  I 0 (θ) = t : y(t) − θ⊤ x ¯(t) = 0 . Whenever S is a finite set, the notation |S| will refer to the cardinality of S. A sufficient condition of recoverability. Assume that the ¯ = n+ sequence of errors {e(t)} is zero in (1) and that rank(X) 1. Define the number

¯X ¯ ⊤ )−1 Xz ¯ ¯ =

(X (13) r(X) sup 2 z∈RN ,kzk1 =1

¯ is the where the notation sup refers to supremum. Indeed r(X) ⊤ −1 ¯ ¯ ¯ norm of (X X ) X subordinate to vectors norms 1 and 2 ; it

¯X ¯ ⊤ )−1 X ¯ . Note that this norm is sometimes denoted (X 1,2 is computable (see e.g. Tao [1984]). If 0 1 I (θi ) > N − (14) ¯ , 2M r(X) with M = maxt=1,...,N k¯ x(t)k2 , then θi is the unique minimizer of JSARX . Proof. For θi to be the unique minimizer of JSARX , it must hold that for any η ∈ Rn+1 , η 6= 0, N N X X y(t) − (θi + η)⊤ x y(t) − θi⊤ x ¯(t) . ¯(t) < t=1

t=1

This is equivalent to X  y(t) − θi⊤ x ¯(t) − y(t) − θi⊤ x ¯(t) − η ⊤ x ¯(t) t∈I / 0 (θi )

with (r) Jpwa (θ) =

t N X X

<



w(t, k) α(r) (k) y(k) − θ⊤ x ¯(k) +

t=1 k=1

 α(r) (t) y(t) − θ⊤ x ¯(t)

X η ⊤ x ¯(t) .

(15)

t∈I 0 (θi )

(12)

and α(0) (k) = 1 for any k and 1 α(r) (k) = y(k) − (θ(r−1) )⊤ x ¯(k) + ξ

Observe from the triangle inequality that |a| − |a + b| ≤ |b| for any two real numbers a and b. Using this, ⊤seen that P it can be ¯(t) . It the left hand side of (15) is smaller than t∈I 0 / (θi ) η x then follows that for θi to be the unique minimizer of JSARX , it suffices that X X ⊤ η ⊤ x η x ¯(t) ¯(t) < t∈I / 0 (θi )

when the iteration number r ≥ 1. The number ξ > 0 is a small positive number intended essentially here for preventing division by zero. The iterative procedure above is likely to converge to a point in the true set of PVs {θ1 , . . . , θs }. Given one such vector, say θi∗ , we can determine the set of data pertaining to the corresponding submodel and hence remove them from the whole dataset. To compute other PVs, we minimize again a functional like Jpwa (θ) over the remaining set of data. The process can be continued this way until all PVs are recovered. 3.3 Further improvements Note that the capacity of the proposed identification method to recover correctly all the affine submodels depends heavily on the richness of the data matrix   ¯ = x(1) x(2) · · · x(N ) . X 1 1 ··· 1 In order to analyze this aspect, assume that the sequence of errors {e(t)} is identically zero in (1). For any θ ∈ Rn+1 , define a partition of the set of indices I = {1, . . . , N } by

359

t∈I 0 (θi )

⊤ ¯(t) on both sides of which, by adding the term t∈I / 0 (θi ) η x the inequality symbol, reads also as X 1 ⊤ ¯ η . η ⊤ x ¯(t) < X 1 2 0 P

t∈I / (θi )

It is therefore enough that

⊤ ¯I η < 1 sup X c 1 2 kX¯ ⊤ ηk1 =1 ¯ ¯ I is formed with columns of X where Ic = I \ I 0 (θi ) and X c ¯ ⊤ ηk1 ≤ indexed by Ic . By the Cauchy-Schwarz inequality, kX Ic M |Ic | kηk2 . The conclusion now follows from the observation ¯ ✷ that supkX¯ ⊤ ηk =1 kηk2 = r(X). 1

¯ be as small It follows from (14) that it is desirable to have r(X) as possible. This is a property of richness on the measured data ¯ However, because the last row of X ¯ is constructed matrix X. ¯ only with ones, the matrix X is likely to have worse genericity properties when compared to X = [x(1) x(2) · · · x(N )]. A consequence of this constant row is as follows.

16th IFAC Symposium on System Identification Brussels, Belgium. July 11-13, 2012

A necessary condition of recoverability. Assume that the sequence of errors {e(t)} is identically equal to zero in (1). Then for a PV θi to minimize JSARX , it is necessary that + |I (θi )| − |I − (θi )| ≤ |I 0 (θi )|. (16) Here, |I + (θi )| refers to the cardinality of I + (θi ). Proof. If θi is a minimizer of JSARX , then for any η ∈ Rn+1 , condition (15) holds with the strict inequality symbol replaced with a large inequality one. For a sufficiently small but nonzero η, sign[y(t) − θi⊤ x ¯(t)] = sign[y(t) − θi⊤ x ¯(t) − η ⊤ x ¯(t)] so that by exploiting the fact that |z| = sign(z)z, we obtain X X η ⊤ x ¯(t) sign[y(t) − θi⊤ x ¯(t)]η ⊤ x ¯(t) ≤ t∈I / 0 (θi )

that is, X t∈I + (θ

sparsity (%) 50 45 30 20 10 linear 100 100 100 89 31 affine 100 100 100 69 26 Table 1. Percentages of successful recoveries when condition (16) is satisfied for both PWL and PWA models. sparsity (%) 50 45 30 20 10 linear – 100 100 82 15 affine – 0 0 0 0 Table 2. Percentages of successful recoveries when (16) is not satisfied for both PWL and PWA models.

t∈I 0 (θi )



η x ¯(t) − i)

to recover the affine model i if it generates less than 50% of data.

X t∈I − (θ



η x ¯(t) ≤ i)

(r)

X η ⊤ x ¯(t) . t∈I 0 (θi )

Now choose η to be of the form η = [0 ε]⊤ with ε a nonzero scalar. Then because of the ones appended in the vectors x ¯(t), the following holds  |I + (θi )| − |I − (θi )| ε ≤ |I 0 (θi )| · |ε|. Since ε can have an arbitrary sign, condition (16) must hold. ✷ Note that |I 0 (θi )| is the number of regressors pertaining to submodel i. What condition (16) says is that the presence of the ones in the regressors x ¯(t) can worsen the recoverability of θi by ℓ1 minimization. For, it puts a requirement on the repartition of signs in the sequence y(t) − θi⊤ x ¯(t) t∈I\I 0 (θ ) . i More precisely, the ability of the ℓ1 minimization program to yield a PV of the system becomes dependent not only on how large is the number of data generated by the corresponding submodel (which is |I 0 (θi )|) but also on how close the numbers |I − (θi )| and |I + (θi )| are. In the worst case, i.e. when either |I + (θi )| = 0 or |I − (θi )| = 0, we must have |I 0 (θi )|/N ≥ 1/2. In words, this means that for a PV to be recovered by minimization of JSARX , the corresponding submodel must have generated at least 50% of the total number of available data. However such a condition holds rarely when the considered PWA system is composed of more than two submodels. Example 2. To illustrate numerically the necessity of condition (16), consider the example of two models, one piecewise linear of the form y(t) = a⊤ i x(t)+v(t), and the other piecewise affine and represented by y(t) = a⊤ i x(t)+bi +v(t) where {v(t)} ⊂ R is a sparse sequence and x(t) ∈ R3 . Sparsity is measured here in terms of percentage of samples k for which v(k) = 0, i.e., |{k ∈ I : v(k) = 0}| Sparsity of {v(k)} = 100 . N The sequence {v(t)} represents here the errors induced by data pertaining to submodels other than i, i.e., v(t) = (aσ(t) − ai )⊤ x(t) for the PWL model and v(t) = (aσ(t) − ai )⊤ x(t) + bσ(t) − bi for the PWA model. For different levels of sparsity of {v(t)} (or equivalently, for different proportions of data generated by submodel i), we run a Monte-Carlo simulation of size 100 over 200 samples (x(t), y(t)) such that x(t) ∼ N (0, I3 ). For each simulation, the parameter vector θi = ⊤ 4 [a⊤ i bi ] ∈ R is drawn at random. We see from Tables 1 and 2 that when condition (16) is satisfied, both the linear and affine models can be successfully recovered for a sparsity level as small as 30%. Conversely when (16) is not met, it is impossible

360

In order to mitigate this problem we may replace the cost Jpwa above by a different criterion. To proceed, let us introduce the notations yetk = y(t) − y(k) and x etk = x(t) − x(k). (17)  ⊤ ⊤ n+1 Let θ ∈ R be partitioned as θ = a b , with a ∈ Rn and b ∈ R. Then we can observe that if two data pairs (x(t), y(t)) and (x(k), y(k)) belong to the same mode i described by θ = θi , then yetk − a⊤ x etk = 0 (when effect of noise is disregarded). This means that when a = ai for some  i, the sequence yetk − a⊤ x etk t,k is sparse to some extent. We may consequently consider minimizing the following criterion (r) in place of Jpwa in (12) (r)

Jpwa-red (a) =

t−1 N X X

w(t, k)α(r) (t, k) yetk − a⊤ x etk

(18)

t=1 k=1 (r)

where the weight α

(t, k) now takes the form 1 α(r) (t, k) = . (r−1) yetk − (a )⊤ x etk + ξ Iteratively optimizing (18) over the entire dataset is expected to yield a vector (its estimate) a ˆ that describes one submodel of the PWA map. Given such a vector a ˆ, we can compute an estimate  ⊤ of the corresponding bias b, the last entry of θ = a⊤ b , as X  1 ˆb = y(t) − a ˆ⊤ x(t) |Iδ (ˆ a)| t∈Iδ (ˆ a)  with Iδ (a) = (t, k), t < k : yetk − a⊤ x etk < δ for some threshold δ which accounts for potential noise. Note in passing that ˆb can alternatively be computed as the median of the  N samples y(t) − a ˆ⊤ x(t) t=1 , that is, ˆb = arg min b

N X y(t) − a ˆ⊤ x(t) − b . t=1

Example 3. To measure the benefit of the transformation (17), let us consider a piecewise affine model similar to the one of Example 2. Again a Monte-Carlo simulation of size 100 is carried out and the results are reported in Table 3. It turns out that the differences model slightly increases the possibility of recovering the parameter vector θi when condition (16) is not fulfilled. The reason why the difference model is not doing much better is as follows. While it successfully removes the parameter b of the affine model, it tends to reduce the sparsity. For example, if the sequence {v(t)} has sparsity NN1 in the initial model, then the sequence {v(t) − v(k)}t6=k will have −1 sparsity equal to NN1 NN1−1 which is much smaller than NN1 for

16th IFAC Symposium on System Identification Brussels, Belgium. July 11-13, 2012

a large N . Another problem with the difference model is that it increases the number of data from N to N (N − 1) introducing more numerical complexity. It might also amplify noise in some cases. sparsity (%) 50 45 30 20 10 affine – 0 0 0 0 differences model – 77 15 2 0 Table 3. Percentages of successful recoveries for a PWA model when condition (16) is not satisfied.

4. CONCLUSION We have presented a new identification method for the recovery of PWA maps from empirical observations. The foundation of the proposed method is the formulation of the PWA regression problem as a sparsity-aware ℓ1 optimization problem. The parameter vector that is associated with each piece of the map is then computed so as the make the induced vector of errors as sparse as possible. In principle, the method can successfully recover a parameter vector if the corresponding submodel has generated a sufficiently large number of data. It has also been shown that the presence of a row of ones in the regressor matrix has some limitative consequences on the performance of the algorithm. REFERENCES L. Bako. Identification of switched linear systems via sparse optimization. Automatica, 47:668–677, 2011. L. Bako and R. Vidal. Algebraic identification of switched MIMO ARX models. In M. Egerstedt and B. Mishra, editors, Hybrid Systems: Control and Computation, volume 4981 of LCNS, pages 43–57. Springer Verlag, 2008. L. Bako, K. Boukharouba, E. Duviella, and S. Lecoeuche. A recursive identification algorithm for switched linear/affine models. Nonlinear Analysis: Hybrid Systems, 5:242–253, 2011. A. Bemporad, A. Garulli, S. Paoletti, and A. Vicino. A bounded-error approach to piecewise affine system identification. IEEE Transactions on Automatic Control, 50:1567– 1580, 2005. K. Boukharouba, L. Bako, and S. Lecoeuche. Identification of piecewise affine systems based on dempster-shafer theory. In IFAC Symposium on System Identification, Saint Malo, France, 2009. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. E. Candès and P. A. Randall. Highly robust error correction by convex programming. IEEE Transactions on Information Theory, 54:2829–2840, 2006. E. J. Candès, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted ℓ1 minimization. Journal Fourier Analysis and Applications, 14:877–905, 2008. R. O. Duda and P. E. Hart. Pattern classification and scene analysis. New York, Wiley, 1973. G. Ferrari-Trecate, M. Muselli, D. Liberati, and M. Morari. A clustering technique for the identification of piecewise affine systems. Automatica, 39:205–217, 2003. W.P.M.H. Heemels, B. De Schutter, and A. Bemporad. Equivalence of hybrid dynamical models. Automatica, 37:1085– 1091, 2001.

361

A. Lj. Juloski, S. Weiland, and W.P.M.H. Heemels. A bayesian approach to identification of hybrid systems. IEEE Transactions on Automatic Control, 50:1520–1533, 2005. F. Lauer, G. Bloch, and R. Vidal. A continuous optimization framework for hybrid system identification. Automatica, 47: 608–613, 2011. H. Nakada, K. Takaba, and T. Katayama. Identification of piecewise affine systems based on statistical clustering technique. Automatica, 41:905–913, 2005. H. Ohlsson and L. Ljung. A convex approach to subspace clustering. In Joint IEEE Conference on Decision and Control and European Control Conference, Orlando FL, USA, 2011a. H. Ohlsson and L. Ljung. Piecewise affine system identification using sum-of-norms regularization. In IFAC World Congress, Milano, Italy, 2011b. N. Ozay, M. Sznaier, C. Lagoa, and O. Camps. A sparsification approach to set membership identification of a class of affine hybrid systems. In Conference on Decision and Control, Cancun, Mexico, 2008. J. Roll, A. Bemporad, and L. Ljung. Identification of piecewise affine systems via mixed-integer programming. Automatica, 40:37–50, 2004. B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press, 2001. E. D. Sontag. Nonlinear regulation: The piecewise linear approach. IEEE Transaction on Automatic Control, 26:346– 357, 1981. P. D. Tao. Convergence of a subgradient method for computing the bound norm of matrices. Linear Algebra and its Applications, 62:163–182, 1984. V. Vapnik. Satistical Learning theory. New-York: Wiley, 1998. R. Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28:52–68, 2011. R. Vidal, S. Soatto, Y. Ma, and S. Sastry. An algebraic geometric approach to the identification of a class of linear hybrid systems. In Conference on Decision and Control, Maui, Hawaii, USA, 2003.