Automatica 53 (2015) 94–102
Contents lists available at ScienceDirect
Automatica journal homepage: www.elsevier.com/locate/automatica
Brief paper
Forward and backward least angle regression for nonlinear system identification✩ Long Zhang, Kang Li School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast BT9 5AH, UK
article
info
Article history: Received 19 May 2013 Received in revised form 17 July 2014 Accepted 1 December 2014
Keywords: Least angle regression (LAR) Forward selection Backward refinement Nonlinear system identification
abstract A forward and backward least angle regression (LAR) algorithm is proposed to construct the nonlinear autoregressive model with exogenous inputs (NARX) that is widely used to describe a large class of nonlinear dynamic systems. The main objective of this paper is to improve model sparsity and generalization performance of the original forward LAR algorithm. This is achieved by introducing a replacement scheme using an additional backward LAR stage. The backward stage replaces insignificant model terms selected by forward LAR with more significant ones, leading to an improved model in terms of the model compactness and performance. A numerical example to construct four types of NARX models, namely polynomials, radial basis function (RBF) networks, neuro fuzzy and wavelet networks, is presented to illustrate the effectiveness of the proposed technique in comparison with some popular methods. © 2014 Elsevier Ltd. All rights reserved.
1. Introduction A large class of nonlinear dynamic systems can be described by a nonlinear autoregressive model with exogenous input (NARX) (Chen, Billing, & Luo, 1989) y(t ) = f (y(t − 1), . . . , y(t − ny ), u(t − 1), . . . , u(t − nu )) + ξ (t ) = f (x(t )) + ξ (t )
(1)
where the set {u(t ), y(t )} represents the real system input and output at time interval t , t = 1, 2, . . . , N , N being the size of the training data set. Their largest input and output lags are nu and ny , respectively. ξ (t ) denotes the error. The set {x(t ), y(t )} is the model input vector and output at time interval t. For simplicity, the model input x(t ) = [y(t −1), . . . , y(t −ny ), u(t −1), . . . , u(t −nu )] is rewritten as x(t ) = [x1 (t ), . . . , xr (t )] with the dimension r = ny + nu . f (·) is some unknown function.
✩ This work was supported in part by the U.K. Research Councils under Grants EP/G042594/1 and EP/L 001063/1, in part by the Chinese Scholarship Council, in part by the National Natural Science Foundation of China under Grants 61271347 and 61273040, and by the Science and Technology Commission of Shanghai Municipality under grant 11JC1404000. The material in this paper was not presented at any conference. This paper was recommended for publication in revised form by Associate Editor Antonio Vicino under the direction of Editor Torsten Söderström. E-mail addresses:
[email protected] (L. Zhang),
[email protected] (K. Li).
http://dx.doi.org/10.1016/j.automatica.2014.12.010 0005-1098/© 2014 Elsevier Ltd. All rights reserved.
Constructing such a NARX model involves three steps (Ljung, 1999): (1) model input selection. More specifically, the unknown lags ny and nu need to be determined. Statistical tests and regression methods are among the popular approaches (Haber & Unbehauen, 1990; Lind & Ljung, 2005, 2008); (2) choice of mapping function f (·). Polynomials (Billings & Chen, 1989), radial basis function (RBF) networks (Chen, Cowan, & Grant, 1991), neuro fuzzy networks (Harris, Hong, & Gan, 2002; Wang & Mendel, 1992) and wavelet networks (Billings & Wei, 2005; Zhang, 1997) are popular options. Though some suggestions are made on the function selection (Sjöberg et al., 1995), no unified framework is available; (3) parameter identification in function f (·). This requires the specific expression of the model (1). One popular NARX model structure is a linear combination of nonlinear functions whose parameters are given a priori, which is formulated as (Ljung, 1999) y(t ) =
M
pi (x(t ), vi )θi + ξ (t )
(2)
i=1
where pi is some nonlinear function with pre-fixed nonlinear parameters vector vi , and θi , i = 1, . . . , M, are the linear coefficients to be optimized. The model (2) is also called the linear-in-theparameters model for pre-fixed nonlinear parameters vi ’s. However, the number of nonlinear functions M is often large, the fixed values for these nonlinear parameters are not optimized, and some nonlinear functions are redundant. This is often referred to as an over-parametrization problem, and not all nonlinear functions are necessarily included into the final model but a good subset is desirable (De Nicolao & Trecate, 1999). Within this context, building a
L. Zhang, K. Li / Automatica 53 (2015) 94–102
linear-in-the-parameters model becomes a model reduction or selection problem. This paper focuses on the model selection issue. The main objective of model selection is to build a parsimonious model with good generalization performance. Exhaustive search to test all possible subsets is only suitable for a very small number of candidate model terms (nonlinear functions) while it is computationally too demanding when the number is large (Lind & Ljung, 2008; Mao & Billings, 1997). This is known to be an NP-hard problem. To reduce the computational effort, stepwise forward selection methods (Miller, 2002), like forward orthogonal selection (Chen et al., 1991), fast recursive algorithm (Li, Peng, & Irwin, 2005) and orthogonal matching pursuit (Pati, Rezaiifar, & Krishnaprasad, 1993), start from an empty model and add one term at a time until the model performance is satisfied. The alternative is the stepwise backward selection that begins with the full model using all the candidates, and then deletes one term at a time. All these methods are fast, greedy therefore suboptimal (Kump, Bai, Chan, Eichinger, & Li, 2012; Sherstinsky & Picard, 1996). Hence, a parsimonious model with the smallest model size is always desirable. To improve the model compactness and the generalization performance, the combination of forward selection and backward selection has been proposed in Li, Peng, and Bai (2006) and Zhang, Li, Bai, and Wang (2012), where the backward selection is used to reselect and replace those insignificant terms produced by the forward selection. Alternatively, a number of hybrid methods combining forward selection and backward elimination (instead of backward replacement) have been reported (Haugland, 2007; Soussen, Idier, Brie, & Duan, 2011; Zhang, 2011), where the backward elimination removes insignificant terms. The elimination scheme is also referred to as model pruning. For example, term clustering (Aguirre & Billings, 1995) and simulation error (Farina & Piroddi, 2010; Piroddi & Spinelli, 2003) based pruning methods have been studied for constructing polynomial NARX models. It is noted that the subset selection may fail in the following scenarios:
• The candidate terms are highly correlated and redundant, which may lead to the ill-conditioning problem (Moussaoui, Brie, & Richard, 2005). The forward selection can avoid selecting highly correlated terms but is not entirely immune to the illconditioning problem. The backward selection easily suffers from this problem as it has to deal with the inversion of all the terms at the beginning. • If the training data is severely polluted by noise, these subset selection methods may fit the models into noise which leads to the over-fitting problem (Chen, Hong, & Harris, 2010; Poggio & Girosi, 1990). The pre-filter and k cross validation are useful to provide a tradeoff between the training accuracy and generalization performance, but additional computations are incurred. • If the training data does not contain sufficient information, a model with no or low bias but high variance may not have a satisfactory prediction accuracy. A biased model may be more desirable using a good bias/variance trade-off technique (Johansen, 1997; Poggio & Girosi, 1990). • If small changes in the data can result in a very different model, then the model is less robust and its prediction accuracy is reduced (Tibshirani, 1996). Given these above considerations, regularization methods are popular techniques to build sparse, robust and biased models by imposing additional penalties or constraints on the solution. A general regularization algorithm is the Tikhonov regression that adds a penalty term to sum squared error (SSE) cost function (Bishop, 1997; Johansen, 1997; Moussaoui et al., 2005), which is given by CF Tikhonov =
N t =1
ξ 2 (t ) + λ
M i=1
DF i
(3)
95
where the regularization parameter λ controls the fitting smoothness and the model size. DF i denotes the function derivatives of different orders. However, this may be computationally too demanding. More recently, the ridge regression and least absolute shrinkage and selection operator (LASSO) use additional l2 norm and l1 norm penalties, respectively. The cost function becomes CF ridge =
N
ξ 2 (t ) + λ
t =1
M
θi2
(4)
|θi |.
(5)
i =1
and CF lasso =
N t =1
ξ 2 (t ) + λ
M i=1
These two methods use simplified penalty terms on weights and they aim to minimize the sum of SSE and norms of model weights. Though the ridge regularization can shrink the large weights towards zeros but has little effects on small weights (Kump et al., 2012). Unlike the ridge regression, LASSO has the potential to shrink some weights to exact zeros and can be interpreted as a Bayesian estimator (Tibshirani, 1996). More recently, some modifications have been proposed on the penalty term, such as using the differences between adjacent coefficients (Ohlsson, Ljung, & Boyd, 2010) or differences among all the coefficients (Ohlsson & Ljung, 2013). The difficulty is to mathematically give an explicit solution for the optimal regularization parameter λ. The optimal regularization parameter can be determined by cross validation. Alternatively, it can be estimated by the Bayesian framework under Gaussian prior distributions. Though a number of algorithms have been proposed (Osborne, Presnell, & Turlach, 2000; Rosset & Zhu, 2007; Tibshirani, 1996), most of them are computationally inefficient compared to the forward selection. As a promising regularization scheme – the least angle regression (LAR) – has been proposed and widely studied (Efron, Hastie, Johnstone, & Tibshirani, 2004). It is a variant of the forward selection as it begins with an empty model with initially no regressor and then selects one term at a time until a stop criterion is satisfied. Unlike the forward selection where the model weights (coefficients) are identical to the least squares solution, the least angle scheme is used to determine the weights. LAR has a few distinctive advantages. First, it is computationally just as fast as the forward selection and more efficient than LASSO methods due to its complete piecewise linear path. Further, it can be easily modified to produce solutions for LASSO estimator. However, the LAR is still a local method and may not produce a sparser model than the forward selection and LASSO methods. The main objective of this paper is to improve the model sparsity and generalization performance of the LAR algorithm. This is achieved by introducing a replacement scheme using an additional refinement stage. The new method has a forward LAR stage and backward LAR stage. The forward LAR is the same as the original LAR. The backward stage compares the significance of each term in the initial model with the remaining terms in the candidate pool and then replaces insignificant ones, leading to an improved model in terms of compactness and performance. The main difference with our previous work on forward and backward methods is that the least angle scheme rather than least squares approach is employed to determine the model coefficients. Unlike other existing model pruning methods, the proposed method employs the replacement scheme instead of elimination. Further, the LAR is a computationally efficient regularization method without additional computational efforts to determine the regularization parameter. A more detailed difference analysis is given in Section 3. Extensive numerical simulations on the construction of four NARX models, including the polynomial, RBF, neuro fuzzy and wavelet models are presented to demonstrate that the new method is able to produce sparser model than the original LAR algorithm and some alternatives.
96
L. Zhang, K. Li / Automatica 53 (2015) 94–102
2. Preliminaries 2.1. NARX models The matrix form of the linear-in-the-parameters NARX model shown in Eq. (2) is given as y = P2 + ξ
(6)
where y = [y(1), . . . , y(N )] is the output vector, 2 = [θ1 , . . . , θM ]T is the weight vector, ξ = [ξ (1), . . . , ξ (N )]T is the residual vector. The matrix P is the whole candidate term pool given by
as the wavelet function, where r is the dimension of input variables x(t ), n and m are the dilation and translation parameters, respectively. If the input data x(t ), t = 1, . . . , N is normalized into [0, 1], both m = {m1 , . . . , mi , . . . , mr } and n are integers and can be initialized as follows n = {−3, −2, −1, 0, 1, 2, 3} −(s2 − 1) ≤ mi ≤ 2n − s1 − 1,
i = 1, . . . , r
(14)
T
P = [p1 , . . . , pM ].
(7)
It is an N-by-M matrix with pi = [pi (x(1), vi ), . . . , pi (x(N ), vi )]T . The ultimate goal of the model selection is to select a subset given by Pm = [pi1 , . . . , pim ]
(8)
where [i1 , . . . , im ] denote the indexes of selected terms, and to determine its weights
2m = [θi1 , . . . , θim ].
(9)
2.2. Initialization of different NARX models First of all, the candidate term set needs to be initialized. For a polynomial NARX model, it is given by y(t ) =
r
xi1 (t )θi1 +
xi1 (t )xi2 (t )θi1 i2
i1 =1 i2 =i1
i1 =1
+
r r
r
···
r
xi1 (t ) . . . xil (t )θi1 ...il
(10)
il =il−1
i1 =1
where t = 1, . . . , N. A polynomial NARX model does not have nonlinear parameters vi ’s but has to determine the different degrees of the model input variables. When initializing an RBF network, the basis function has to be chosen first. This paper considers the most popular Gaussian basis function pi (x(t ), vi ) = exp(−∥x(t ) − ci ∥2 /σ 2 )
(11)
where the nonlinear parameters vi include the centers ci ’s and the width σ . The width is generally chosen as a fixed value for all terms and the centers ci are initiated by the input data samples, namely ci = x(i), i = 1, . . . , N, which can produce a total of N candidate terms (Chen et al., 1991). There are different types of membership functions for a neuro fuzzy model. The widely used neuro fuzzy model with the Gaussian membership function is given by pi (x(t ), vi ) =
exp(−∥x(t ) − ci ∥2 /σ 2 ) M
.
(12)
exp(−∥x(t ) − ci ∥2 /σ 2 )
Compare (11) and (12), the neuro fuzzy has the same nonlinear parameters (centers ci ’s and width σ ) with that of RBF networks if they both use Gaussian function. It has been proved that the neuro fuzzy model is equivalent to the normalized RBF networks (Jang & Sun, 1993; Moody & Darken, 1989). Hence, it generally shares the same initialization scheme with the RBF network (Wang & Mendel, 1992). The wavelet network employed the Mexican function − 12 n
pi (x(t ), vi ) = 2
·
−n
r − ∥2
x(t ) − m∥ exp − 2
2.3. Forward selection The least angle regression (LAR) can be viewed as a regularized version of forward selection. To illustrate it well, the original forward selection is introduced first. In both forward selection and LAR, all the regressors are first normalized to unit Euclidean length and the output vector is centered so that the resultant output has a zero mean. Let yˆ be the estimated vector and its initial estimator is yˆ ls0 = 0 where ls denotes the least squares solution and 0 means no regressor is included into the model. For simplicity, the indexes of the included regressors are grouped in an active set Λ which is initialized to be 0 while the remaining being labeled in an inactive set Γ which initially is {1, 2, . . . , M }. As the selection proceeds, the size of the active set Λ will increase while the size of the inactive set Γ will decrease accordingly. At the beginning (k = 0), the initial residual is
ξ ls0 = y − yˆ ls0 .
(15)
The first candidate term or regressor included in the active set is the one which has the largest correlation with the current residual, i.e. c0 = max |pTi ξ 0 |. ls
(16)
i∈Γ
Suppose pi1 is the first selected term and move it from the inactive set Γ to the active set Λ. Then the task is to compute the coefficient 21 by least squares method, which is given by
2ls1 = PTΛ PΛ
−1
PTΛ y
(17)
where PΛ = pi1 . Hence, the estimated result of the system outputs using only one regression term is yˆ ls1 = PΛ 2ls1 .
(18)
The updated model residual is
ξ ls1 = y − yˆ ls1 .
(19)
At the second step (k = 1), compute the correlation and find the second regressor according to the largest correlation given by
i =1
where s1 and s2 control the initial range. The difference between s2 and s1 can be set to be not less than 5 to make sure that the wavelet function is compactly supported (Billings & Wei, 2005). If the input data is not normalized, a good initialization has been given in Oussar and Dreyfus (2000).
∥2−n x(t ) − m∥2 2
c1 = max |pTi ξ 1 |. ls
(20)
i∈Γ
Suppose pi2 is then found and the active set becomes Λ = {i1 , i2 }. With the new term being added in the model, these two model regressor weights 22 need to be updated using least square methods, which is given by
2ls2 = PTΛ PΛ
−1
PTΛ y
(21)
where PΛ = [pi1 , pi2 ]. The result of the system outputs is
(13)
yˆ ls2 = PΛ 2ls2 .
(22)
L. Zhang, K. Li / Automatica 53 (2015) 94–102
97
The procedure continues until the desired model performance is met. In a word, the forward selection includes two steps, namely finding regressors and estimating their coefficients. 2.4. Least angle regression The least angle regression (LAR) procedure follows the same general scheme with the forward selection, but it does not use the least squares solution for 2. More specifically, at the initial step (k = 0), the first regressor pi1 is determined by finding the maximal correlation, which is given by
ξ 0 = y − yˆ 0 c0 = max |pTi ξ 0 |
Fig. 1. The geometric comparison of the forward selection and LAR.
(23)
i∈Γ
where yˆ 0 = 0. Then its coefficient of the first regressor is updated by the least angle scheme instead of the least squares solution. This is formulated as follows.
21 = 20 + γ0 (2ls1 − 20 )
(24)
where 20 = 0. Obviously, 21 is a combination of the least squares solution 2ls1 given by (17) and the previous coefficient value 20 . Note that as a new term is selected into the model, the number of previous model coefficients is less than the current model by one. To make (24) work, the dimension of the previous model is increased by one and the additional parameter is set to zero. The estimated yˆ 1 and residual ξ 1 therefore become yˆ 1 = yˆ 0 + γ0 (ˆyls1 − yˆ 0 ) , ξ 1 = y − yˆ 1
(25)
respectively. The essential step here is that the least angle scheme is used to determine the unknown value γ0 . To achieve this, the least angle scheme needs to find another regressor pj from the inactive set, where the angle between pj and the residual vector is the same as the currently selected regressor pi1 with residual vector. A simplified method for finding the equal least angle is to use the residual correlations (Berwin, 2004), i.e.
|pTi1 ξ 1 | = |pTi∈Γ ξ 1 |.
(26)
Substitute formula (25) into (26) and define d0 = yˆ ls1 − yˆ 0 .
(27)
Then the γ0 can be computed by Berwin (2004) +
γ0 = min i∈Γ
pTi ξ 0 − c0
,
pTi ξ 0 + c0
pTi d0 − c0 pTi d0 + c0
ξ 10 as the p1 has. In other words, the residue ξ 10 bisects the angle between p1 and p2 , which is denoted as α0 . The same procedure is applied to p3 , leading to the residue ξ 11 , the coefficient 211 and the angle α1 . In this example, α0 is smaller than α1 , therefore p2 and p1 have the least angle and the coefficient for the p1 is 21 = 210 . In summary, the two main steps for computing LAR coefficients are to compute the bisection equation shown in (26) and to find the regressor with least angle using (28). 3. Forward and backward least angle regression Although LAR is a computationally efficient regularized forward algorithm, it is still a local method, and the resultant model can be further improved in terms of the model compactness and generalization performance. This paper introduces a backward refinement stage to replace those insignificant terms in the initial model built by the forward LAR and thus improves the model generalization performance and compactness. 3.1. Forward LAR The forward LAR is equal to the original LAR. It is summarized as follows. 1. Set the initial estimated vector yˆ 0 = 0 and k = 0. Initialize the active set Λ and inactive set Γ . 2. At the kth step, find the most important regressor by computing the correlation
ξ k = y − yˆ k ck = max |pTi ξ k |
(29)
i∈Γ
(28)
where 0 < γ0 ≤ 1. The following step is made in the same way and continues until the model performance is satisfied. To better understand the differences between forward selection and LAR, a geometric interpretation in the case of only three regressors is given in Fig. 1. Three regressors p1 , p2 , p3 and one output vector y are plotted using the solid lines. In the initial step, p1 is selected as the first regressor using both forward selection and LAR since it has the largest correlation with the initial residue ξ ls0 = ξ 0 = y among the three regressors. Then the task is to estimate its coefficient. For the forward selection, it uses least squares ls solution 2ls1 with the aim to get the smallest residue ξ 1 . In this ls case, 21 is the projection coefficient of the y into the regressor p1 , ls and the residual ξ 1 is orthogonal to the regressor p1 . For the LAR, the inactive regressors p2 , p3 are involved to compute this value. For example, if p2 is first used, let ξ 10 and 210 be its residue and coefficient, respectively. To have unique values for these two parameters, the bisection condition needs to be satisfied, where the inactive regressor p2 has the same angle with the residue vector
and then move the selected regressor from inactive set Γ to active set Λ. 3. Estimate the coefficient of the selected regressor in the kth step.
2lsk+1 = (PTΛ PΛ )−1 PTΛ y yˆ lsk+1 = PΛ 2lsk+1 dk = yˆ lsk+1 − yˆ k
pTi ξ k − ck
pTi ξ k + ck
, 1 , i∈Γ pTi dk − ck pTi dk + ck 0 < γk ≤ 1 ls yˆ k+1 = yˆ k + γk (ˆyk+1 − yˆ k ) ls 2k+1 = 2k + γk (2k+1 − 2k ) +
γk = min
,
(30)
where PΛ represents active regressor matrix and 2lsk+1 denotes the least square solutions. 4. Update k = k + 1 and repeat steps 2 and 3 until a criterion is met, e.g. the number of selected regressors has reached a preset value, or an information based statistical criterion, like Akaike information criterion (AIC) (Akaike, 1974), Bayesian
98
L. Zhang, K. Li / Automatica 53 (2015) 94–102
information criterion (BIC) (Schwarz, 1978) and final prediction error (FPE) (Burshtein & Weinstein, 1985), is satisfied. Suppose the model with active set Λ = {i1 , i2 , . . . , im } has been selected in the forward LAR. 3.2. Backward LAR The objective of backward refinement is to re-check the active regressors. This needs two steps. First, each previously selected regressor (except for the last selected regressor) in the active set is moved back to the inactive set sequentially. Further, the most important one in the inactive set which produces largest residual correlation is identified and put into the active set. In the first step, if any active regressor is moved back to the inactive set, the coefficients for the left active regressors needs to be re-estimated using the least angle scheme. For details, let the il , l = m − 1, . . . , 1, be moved back to the inactive set, and the updated active set Λ = {i1 , . . . , il−1 , il+1 , . . . , im } with the number of m − 1. According to the least angle scheme, the selected regressor has to use its following regressor to determine its coefficient. Hence, the first (l − 2) coefficients are kept unchanged and they do not need to be re-calculated. While the regressor coefficients with indexes of il−1 , il+1 , . . . , im−1 need to be re-computed by
ξ = y − yˆ j c = |pTΛj+1 ξ|
Λs = Λ1:j+1 2lsj+1 = (PTΛs PΛs )−1 PTΛs y yˆ lsj+1 = PΛs 2lsk+1 d = yˆ lsj+1 − yˆ j
(31)
pTΛj+2 ξ − c pTΛj+2 ξ + c + , , 1 , γ = min T T pΛj+2 d − c pΛj+2 d + c 0<γ ≤1 ls yˆ j+1 = yˆ j + γ (ˆyj+1 − yˆ j ) ls 2j+1 = 2j + γ (2j+1 − 2j )
where j = l − 2, . . . , m − 3 for l ≥ 2, while j = 0, . . . , m − 3 for l = 1. The Λj denotes the jth element in Λ and Λ0 = 0. In the second step, determine the (m − 1)th regressor coefficient with index im and re-select the mth regressor using the same procedure in the forward LAR given by (30). If the mth regressor is not the original one, it means that a new regressor has entered into the refined model. The process stops when no previously selected regressor can be replaced. Remark 1 (Difference with Existing Methods). The popular existing forward and backward methods include the LASSO solution proposed in the original LAR paper (Efron et al., 2004), adaptive forward–backward greedy algorithm (Zhang, 2011), bidirectional greedy method (Haugland, 2007), and sparse Bernoulli–Gaussian deconvolution approach (Soussen et al., 2011). Though the idea of forward–backward is also briefly introduced in Miller (2002), the specific implementation procedure and model performance evaluation are not given. The popular model pruning methods include term clustering algorithm (Aguirre & Billings, 1995) and simulation error based identification algorithm (Piroddi & Spinelli, 2003). The differences of these methods with the proposed one can be summarized as follows.
• Unlike the proposed method that replaces insignificant terms using more significant ones in the backward stage, these existing methods focus on elimination of redundant or insignificant terms. It is not guaranteed that the simplified model is better than the original one as some important terms may be left out.
Further, it is difficult to determine the elimination criterion in some methods, such as adaptive forward–backward greedy algorithm. It is also worth pointing out that the term clustering method is only suitable for polynomial NARX model, but not other models. • The proposed method starts the backward LAR only when the forward LAR terminates. However, these existing forward and backward methods run the backward elimination in every forward iteration. Most of existing methods deal with normal regression without the consideration of regularization and therefore the drawbacks inherent with the subset selection approach still exist in these methods. Only LASSO solution using LAR method and sparse Bernoulli–Gaussian deconvolution approach employs the regularized scheme. However, the Bernoulli–Gaussian approach needs to determine an additional regularized parameter, which leads to increased computational demand. Remark 2 (Selection Criterion). This paper follows the original LAR method and employs the correlation between regressor and the current residual as the term selection criterion. It is worth mentioning that many popular alternatives exist in the literature. For example, the error reduction ratio (ERR) and simulation reduction ratio (SRR) are widely used in nonlinear system identification. ERR involves minimizing the one step ahead prediction error and measures the least squares error reduction when a new term is included while SRR minimizes predicted simulation error. For polynomial NARX modeling, it was observed that the forward selection with ERR can pick up some spurious terms while the SRR based identification method with pruning scheme can build better models (Piroddi & Spinelli, 2003) at the expenses of significantly increased computation requirement. Remark 3 (Improvements Over LAR). The LAR is a local method and may not produce a compact model with the smallest model size. The new method has the potential to improve the model compactness through the replacement scheme. In the worst scenario, the new method produces the same model as the original LAR. Like all greedy but fast methods, the new method algorithm cannot guarantee to find the sparest models. 3.3. Computational complexity The computational complexity of the forward and backward LAR mainly comes from the inversion operation. Many popular matrix computation methods, like Cholesky factorization, Gaussian LU decomposition, modified Gram–Schmidt, Householder transformation, Given method and singular value decomposition, can be used to compute matrix inversion. The original LAR employs the Cholesky factorization. In this case, the forward LAR (original LAR) requires o(m3 + Nm3 ) flops (floating points operations) (Efron et al., 2004) and the backward LAR requires o(q(m3 + Nm3 )) flops where q represents the number of refinement loops. 4. A numerical example Consider the following benchmark nonlinear system (Hong, Harris, & Wilson, 1999) u(t ) = sin z (t ) =
π
t 20 2.5z (t − 1)z (t − 2)
+ 0.5u(t − 1) 1 + z 2 (t − 1) + z 2 (t − 2) + 0.3 cos(0.5(z (t − 1) + z (t − 2))) y(t ) = z (t ) + ξ (t )
(32)
where the noise-free system input and output are u(t ) and z (t ) at time interval t, respectively. The output y(t ) is corrupted by a Gaussian noise sequence ξ (t ) with the signal-to-noise rate 25 dB.
L. Zhang, K. Li / Automatica 53 (2015) 94–102
Fig. 2. Variation in running time of the polynomial models with different model size.
99
Fig. 4. Variation in test errors of the polynomial models with different model size.
their inputs. For the polynomial NARX model, the input variables form a total of 34 polynomial terms with 4 first-degree terms, 10 second-degree terms, and 20 third-degree terms. The RBF network employs a Gaussian kernel function pi (x(t ), vi ) = exp(−∥x − ci ∥2 /σ 2 ), where ci ’s and σ are the RBF centers and width, respectively. The width is chosen as σ = 1.5 by trial-and-error, while the centers are initialized on the training data points, and to produce a total of 500 candidate terms. The neuro fuzzy with Gaussian memexp(−∥x−c ∥ /σ ) bership function is given by pi (x(t ), vi ) = m exp(−∥xi−c ∥2 /σ 2 ) . 2
i=1
2
i
The centers and widths are initialized using the same scheme of RBF networks to generate a total of 500 candidate terms. The wavelet network employed the Mexican function pi (x(t ), vi ) = 1
2− 2 n · r − ∥2−n x(t ) − m∥2 exp(−
Fig. 3. Variation in train errors of the polynomial models with different model size.
The equivalent noise variance is 0.0031. The noise is only added on system output and has no effect on system equation, which is also known as output error (Aguirre, Barbosa, & Braga, 2010). A data sequence of 1000 samples are generated. The first 500 samples are used for training and the remaining 500 samples are used for testing. Four types of NARX models—polynomials, RBF networks, neuro fuzzy and wavelet networks are used to model this benchmark. The sum squared error (SSE) over the training data and testing data sets are used to evaluate the model performance. To test the performance of the resultant models, a free run simulation is employed on a validation test set (Aguirre et al., 2010; Piroddi & Spinelli, 2003). As shown in (1), the regressor variables of NARX model include previous output variables. For the free run simulation, all these variables are from the NARX models outputs instead of the true system. In other words, when validating the NARX models, the original system is not needed in the free run simulation. As the inputs of NARX models can affect their final performance, they should be chosen carefully and in this paper they are determined by trial-and-error. Polynomials with degrees up to 3 for variables x(t ) = {y(t − 1), y(t − 2), u(t − 1), u(t − 2)} are used as the nonlinear functions p in (2) while the other three models use variables x(t ) = {y(t − 1), y(t − 2), u(t − 1), u(t − 2), u(t − 3)} as
∥2−n x(t )−m∥2 2
) as the wavelet
function, where m and n are the dilation and translation parameters, respectively. Both m and n are integers and can be selected between −3 ≤ mi , n ≤ 3 where the wavelet networks can converge. To illustrate the effectiveness of the proposed forward and backward LAR (FBLAR) method, forward selection (FS), LAR, orthogonal matching pursuit (OMP), LASSO using LAR solution are employed for comparison. The sum squared errors (SSE) over the training data and testing data sets are given in the training SSE and test SSE, respectively. For polynomial NARX models, the averages of Monte Carlo simulation results over running time, training errors and test errors with 10 trials are given in Figs. 2–4. These results reveal that it is capable of building a more compact model than FS, LAR, OMP and LASSO in that the model has the smallest model size while achieving a similar model generalization performance. This is though achieved at the cost of extra computation time. According to the Occam’s razor, a simple model with good generalization performance is always preferable. From this point of view, the model built by FBLAR is better than those constructed by the other methods. For RBF, neuro fuzzy and wavelet NARX models, the Monte Carlo simulation results with 10 trials are given by Figs. 5–13, respectively. These results again confirm that the new method produces sparser models than the LAR method and other popular alternatives when the original system is modeled by different types of models. Though this is achieved by incurring some extra computations in the same order as the other methods. Obviously, the new method provides a good tradeoff between model compactness and computational complexity. 5. Conclusion A forward and backward least angle regression (LAR) algorithm has been proposed for constructing a range of NARX models. The
100
L. Zhang, K. Li / Automatica 53 (2015) 94–102
Fig. 5. Variation in running time of the RBF models with different model size.
Fig. 6. Variation in train errors of the RBF models with different model size.
Fig. 7. Variation in test errors of the RBF models with different model size.
Fig. 8. Variation in running time of the neuro fuzzy models with different model size.
Fig. 9. Variation in train errors of the neuro fuzzy models with different model size.
Fig. 10. Variation in test errors of the neuro fuzzy models with different model size.
L. Zhang, K. Li / Automatica 53 (2015) 94–102
101
backward stage can replace the insignificant terms previously selected by the forward LAR with the most significant ones left in the candidate term pool, leading to a more compact model with improved generalization performance. Numerical results have confirmed the effectiveness of the proposed method. References
Fig. 11. Variation in running time of the wavelet network models with different model size.
Fig. 12. Variation in train errors of the wavelet network models with different model size.
Fig. 13. Variation in test errors of the wavelet network models with different model size.
Aguirre, L. A., Barbosa, B. H. G., & Braga, A. P. (2010). Prediction and simulation errors in parameter estimation for nonlinear systems. Mechanical Systems and Signal Processing, 24, 2855–2867. Aguirre, L. A., & Billings, S. A. (1995). Improved structure selection for nonlinear models based on term clustering. International Journal of Control, 62, 569–587. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723. Berwin, A. T. (2004). Disscussion of least angle regression. Annals of Statistics, 32, 481–490. Billings, S. A., & Chen, S. (1989). Identification of non-linear rational systems using a prediction error estimation algorithm. International Journal of Systems Science, 20, 467–494. Billings, S. A., & Wei, H. L. (2005). A new class of wavelet networks for nonlinear system identification. IEEE Transactions on Neural Networks, 16, 862–874. Bishop, C.M. (1997). Neural network for pattern recognition. Oxford. Burshtein, D., & Weinstein, E. (1985). Some relations between the various criteria for autoregressive model order determination. IEEE Transactions on Acoustics, Speech and Signal Processing, 33, 1071–1079. Chen, S., Billing, S. A., & Luo, W. (1989). Orthogonal least squares methods and their application to non-linear system identification. International Journal of Control, 50, 1873–1896. Chen, S., Cowan, C. F. N., & Grant, P. M. (1991). Orthogonal least squares algorithm for radial basis funtion networks. IEEE Transactions on Neural Networks, 2, 302–309. Chen, S., Hong, X., & Harris, C. J. (2010). Sparse kernel regression modeling using combined locally regularized orthogonal least squares and D-optimality experimental design. IEEE Transactions on Automatic Control, 14, 477–499. De Nicolao, G., & Trecate, G. F. (1999). Consistent identification of NARX models via regularization networks. IEEE Transactions on Automatic Control, 44, 2045–2049. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32, 407–499. Farina, M., & Piroddi, L. (2010). An iterative algorithm for simulation error based identification of polynomial input–output models using multi-step prediction. International Journal of Control, 83, 1442–1456. Haber, R., & Unbehauen, H. (1990). Structure identification of nonlinear dynamic systems: a survey on input/output approaches. Automatica, 26, 651–677. Harris, C., Hong, X., & Gan, Q. (2002). Adaptive modelling, estimation, and fusion from data: a neurofuzzy approach. Springer Verlag. Haugland, D. (2007). A bidirectional greedy heuristic for the subspace selection problem. In International conference on engineering stochastic local search algorithms: designing, implementing and analyzing effective heuristics. Vol. 4638 (pp. 162–176). Hong, X., Harris, C. J., & Wilson, P. A. (1999). Neurofuzzy state identification using prefiltering. IEE Proceedings Control Theory and Applications, 146, 234–240. Jang, J. S., & Sun, C. T. (1993). Functional equivalence between radial basis function networks and fuzzy inference systems. IEEE Transactions on Neural Networks, 4, 156–159. Johansen, T. A. (1997). On Tikhonov regularization, bias and variance in nonlinear system identification. Automatica, 33, 441–446. Kump, P., Bai, E. W., Chan, K., Eichinger, B., & Li, K. (2012). Variable selection via rival (removing irrelevant variables amidst lasso iterations) and its application to nuclear material detection. Automatica, 48, 2107–2115. Li, K., Peng, J. X., & Bai, E. W. (2006). A two-stage algorithm for identification of non-linear dynamic systems. Automatica, 42, 1189–1197. Li, K., Peng, J., & Irwin, G. W. (2005). A fast nonlinear model identification method. IEEE Transactions on Automatic Control, 8, 1211–1216. Lind, I., & Ljung, L. (2005). Regressor selection with the analysis of variance method. Automatica, 41, 693–700. Lind, I., & Ljung, L. (2008). Regressor and structure selection in NARX models using a structured anova approach. Automatica, 44, 383–395. Ljung, L. (1999). System identification: theory for the user (2nd ed.). Prentice Hall. Mao, K. Z., & Billings, S. A. (1997). Algorithms for minimal model structure detection in nonlinear dynamic system identification. IEEE Transactions on Neural Networks, 68, 311–330. Miller, A. (2002). Subset selection in regression (2nd ed.). CRC Press. Moody, J., & Darken, C. J. (1989). Fast learning in networks of locally-tuned processing units. Neural Computing, 1, 281–294. Moussaoui, S., Brie, D., & Richard, A. (2005). Regularization aspects in continuoustime model identification. Automatica, 41, 197–208. Ohlsson, H., & Ljung, L. (2013). Identification of switched linear regression models using sum-of-norms regularization. Automatica, 49, 1045–1050. Ohlsson, H., Ljung, L., & Boyd, S. (2010). Segmentation of ARX-models using sumof-norms regularization. Automatica, 46, 1107–1111. Osborne, M. R., Presnell, B., & Turlach, B. A. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20, 389–403.
102
L. Zhang, K. Li / Automatica 53 (2015) 94–102
Oussar, Y., & Dreyfus, G. (2000). Initialization by selection for wavelet network training. Neurocomputing, 131–143. Pati, Y. C., Rezaiifar, R., & Krishnaprasad, P. S. (1993). Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In The twenty-seventh asilomar conference on signals, systems and computers, 1993 (pp. 40–44). IEEE. Piroddi, L., & Spinelli, W. (2003). An identification algorithm for polynomial NARX models based on simulation error minimization. International Journal of Control, 76, 1767–1781. Poggio, T., & Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978–982. Rosset, S., & Zhu, J. (2007). Piecewise linear regularized solution paths. The Annals of Statistics, 35, 1012–1030. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464. Sherstinsky, A., & Picard, R. W. (1996). On the efficiency of the orthogonal least squares training method for radial basis function networks. IEEE Transactions on Neural Networks, 7, 195–200. Sjöberg, J., Zhang, Q., Ljung, L., Benveniste, A., Delyon, B., Glorennec, P., et al. (1995). Nonlinear black-box modeling in system identification: a unified overview. Automatica, 31, 1691–1724. Soussen, C., Idier, J., Brie, D., & Duan, J. (2011). From Bernoulli–Gaussian deconvolution to sparse signal restoration. IEEE Transactions on Signal Processing, 59, 4572–4584. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 20, 267–288. Wang, L. X., & Mendel, J. M. (1992). Fuzzy basis functions, universal approximation, and orthogonal least-squares learning. IEEE Transactions on Neural Networks, 3, 807–814. Zhang, Q. H. (1997). Using wavelet network in nonparametric estimation. IEEE Transactions on Neural Networks, 8, 227–236. Zhang, T. (2011). Adaptive forward–backward greedy algorithm for learning sparse representations. IEEE Transactions on Information Theory, 57, 4689–4708. Zhang, L., Li, K., Bai, E. W., & Wang, S. J. (2012). A novel two-stage classical gramschmidt algorithm for wavelet network construction. In 16th IFAC symposium on system identification, Vol. 16 (pp. 644–649).
Long Zhang received his B.Eng. and M.Eng. degrees as an outstanding graduate in Electrical Engineering and Automation from Harbin Institute of Technology, Harbin, China, in 2008 and 2010, respectively, and the Ph.D. degree in Electronics, Electrical Engineering and Computer Science from Queen’s University Belfast, UK, in 2013. He is currently a Research Associate at the Department of Automatic Control and System Engineering, University of Sheffield, UK. His research interest includes system identification, neural networks, statistical regression and fault diagnosis in both time and frequency domains. Kang Li received the B.Sc. degree from Xiangtan University, Xiangtan, China, in 1989, the M.Sc. degree from the Harbin Institute of Technology, Harbin, China, in 1992, and the Ph.D. degree from Shanghai Jiaotong University, Shanghai, China, in 1995. He is currently a Professor of Intelligent Systems and Control with the School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast, UK. He is involved in bioinformatics with applications on food safety and biomedical engineering. He is also a Visiting Professor with the Harbin Institute of Technology, Shanghai University, Shanghai, China, and the Ningbo Institute of Technology, Zhejiang University, Zhejiang, China. He held a Visiting Fellowship or Professorship with the National University of Singapore, Singapore, the University of Iowa, Iowa City, IA, USA, the New Jersey Institute of Technology, Newark, NJ, USA, Tsinghua University, Beijing, China, and the Technical University of Bari, Taranto, Italy. He has authored more than 200 papers in his areas of expertise, and edited 12 conference proceedings (Springer). His current research interests include nonlinear system modeling, identification and control, bio-inspired computational intelligence, and fault-diagnosis and detection, with recent applications in power systems and renewable energy, and polymer extrusion processes. He serves in the editorial boards of Neurocomputing, the Transactions of the Institute of Measurement and Control, Cognitive Computation, and International Journal of Modelling, Identification and Control.