Incremental smooth support vector regression for Takagi–Sugeno fuzzy modeling

Incremental smooth support vector regression for Takagi–Sugeno fuzzy modeling

Neurocomputing 123 (2014) 281–291 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Increme...

2MB Sizes 7 Downloads 179 Views

Neurocomputing 123 (2014) 281–291

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Incremental smooth support vector regression for Takagi–Sugeno fuzzy modeling Rui Ji a,b,n, Yupu Yang a,b, Weidong Zhang a,b a b

Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, China Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China

art ic l e i nf o

a b s t r a c t

Article history: Received 21 October 2012 Received in revised form 12 July 2013 Accepted 12 July 2013 Communicated by W.S. Hong Available online 16 August 2013

We propose an architecture for Takagi–Sugeno (TS) fuzzy system and develop an incremental smooth support vector regression (ISSVR) algorithm to build the TS fuzzy system. ISSVR is based on the ε-insensitive smooth support vector regression (ε-SSVR), a smoothing strategy for solving ε-SVR, and incremental reduced support vector machine (RSVM). The ISSVR incrementally selects representative samples from the given dataset as support vectors. We show that TS fuzzy modeling is equivalent to the ISSVR problem under certain assumptions. A TS fuzzy system can be generated from the given training data based on the ISSVR learning with each fuzzy rule given by a support vector. Compared with other fuzzy modeling methods, more forms of membership functions can be used in our model, and the number of fuzzy rules of our model is much smaller. The performance of our model is illustrated by extensive experiments and comparisons. & 2013 Elsevier B.V. All rights reserved.

Keywords: Takagi–Sugeno fuzzy systems Smooth support vector regression Reference functions Fuzzy modeling ε-insensitive learning Incremental learning

1. Introduction Support vector machine (SVM) [1] is one of the most promising learning algorithms for pattern classification. It is based on the structural risk minimization (SRM) principle. Vapnik introduced an ε-insensitive loss function and applied SVM to regression problems. The ε-insensitive loss function sets an ε-insensitive tube around the data, within which errors are disregarded. This problem is referred to as ε-insensitive support vector regression (ε-SVR) [2]. ε-SVR is formulated as a constrained minimization problem, and is extended to the nonlinear case by using the kernel technique. A smoothing strategy for solving ε-SVR, named ε-insensitive smooth support vector regression (ε-SSVR), was proposed in [3]. The ε-SSVR approximates the ε-insensitive loss by a smooth function and converts ε-SVR to an unconstrained minimization problem. The objective function of ε-SSVR is strongly convex and infinitely differentiable for any arbitrary kernel. It is always solvable using a fast Newton– Armijo method. For the last decade, there has been an increasing interest in incorporating support vector learning into fuzzy modeling. Chen [4,5] proposed a positive definite fuzzy system (PDFS). The membership functions for the same input variable were generated from n Corresponding author at: Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, China. Tel.: þ86 21 34204261; fax: þ86 21 34204038. E-mail addresses: [email protected], [email protected] (R. Ji).

0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2013.07.017

location transformation of a reference function [6]. The fuzzy rules were determined by support vectors (SVs) of an SVM, where the kernel was constructed from the reference functions. The kernel was proven to be an admissible Mercer kernel if the reference functions were positive definite functions [7]. Chiang [8] proposed an SVMbased modeling framework for fuzzy basis function inference system [9]. Fuzzy rules were extracted from the training data based on the SVs. In these two models, the number of fuzzy rules equaled the number of SVs. As the number of SVs in an SVM was usually large, the number of fuzzy rules was equally large. Lin proposed an SVRbased fuzzy neural network (SVRFNN) [10]. The number of fuzzy rules in the SVRFNN was reduced by removing irrelevant fuzzy rules, but this rule reduction approach degraded the generalization performance. In all these SVM-based models, the form of the membership functions was restricted by the Mercer condition [11], i.e., the positive definiteness of the membership functions was required. All of the above models use fuzzy rules with singletons in the consequent. A fuzzy system with Takagi–Sugeno (TS)-type consequent, i.e., a linear combination of the input variables, has better performance than that with singleton consequent. Researchers have also proposed methods for TS fuzzy modeling [12–21]. Leski [20] introduced Vapnik's ε-insensitive loss function to Takagi– Sugeno–Kang (TSK) fuzzy modeling. The parameters of the membership functions were determined by fuzzy c-means clustering (FCM). The number of fuzzy rules equaled the number of clusters. The consequent parameters were obtained by solving a minimization problem. Juang proposed a self-organizing TS-type fuzzy network

282

R. Ji et al. / Neurocomputing 123 (2014) 281–291

with support vector learning (SOTFN-SV) [22]. The antecedent of SOTFN-SV was generated by fuzzy clustering of the data, and then the consequent parameters were tuned by SVM. A TS fuzzy systembased SVR (TSFS-SVR) model was also proposed by Juang [21]. The parameters of TSFS-SVR were learned by a combination of fuzzy clustering and linear SVR. Based on the TSFS-SVR, Juang proposed two other TS fuzzy modeling methods [18,19]. Cai proposed a Gaussian kernel-based high order fuzzy system (KHFIS) [13]. That study extended Leski's model to high order fuzzy systems. In all these TS fuzzy modeling methods, the antecedent fuzzy sets were estimated by fuzzy clustering and they were only conceived for Gaussian membership function. The learning of SVMs can be very costly in terms of time and memory consumption, especially on large datasets as we have to deal with a large kernel matrix. In some cases, the data cannot be collected in advance, they come sequentially. To address these problems, incremental SVMs were proposed [23–26]. Juang proposed an incremental SVM-trained TS-type fuzzy classifier (ISVMFC) [27]. That study was the first design of a fuzzy classifier using incremental SVM and can be applied to online classification problems. A fuzzy modeling method via online SVM (FSVM) was proposed in [28]. The structure identification was performed by using an online SVM, then fuzzy rules were extracted and membership functions were updated. An incremental reduced support vector machine (IRSVM) was proposed in [29]. It combined incremental learning and reduced support vector machine (RSVM). The RSVM, proposed by Lee [30,31], selected a small random portion of the data to generate a reduced kernel. This reduced kernel technique has been applied to ε-SSVR [3]. Instead of purely random selection, IRSVM selected representative samples incrementally from the dataset in forming the reduced kernel. IRSVM achieved comparable accuracy with RSVM, while with a smaller number of SVs. In this paper, we first apply the concept of IRSVM to ε-SSVR and propose incremental smooth support vector regression (ISSVR). Then we establish a connection between ISSVR and TS fuzzy systems. An ISSVR-based TS fuzzy modeling method is proposed. TS-type fuzzy rules are automatically generated from the given training data based on the ISSVR learning. Since ε-SSVR puts no restrictions on the kernel, our model relaxes the positive definiteness requirement on membership functions. Any arbitrary form of membership functions can be used. Numerical results show that our model has good generalization ability with small number of fuzzy rules. A brief description of our notation is given as follows. All vectors will be column vectors unless transposed to a row vector by a prime superscript 0 . x′y will denote the inner product of two vectors x and y in ℝn . The p-norm of x will be denoted by J x J p . For a vector x in ℝn , the plus function x þ is defined as ðx þ Þi ¼ maxf0; xi g, and the ε-insensitive loss is ðjxjε Þi ¼ maxf0; jxjεg; i ¼ 1; …; n. For a matrix A A ℝmn , Ai will denote the ith row of A. A column vector of ones of arbitrary dimension will be denoted by 1. A training dataset is fðx1 ; y1 Þ; …; ðxm ; ym Þg, where xi A ℝn is the ith sample and yi A ℝ is the observation of real value associated with xi . For notational convenience, the training dataset will be rearranged as an m  n matrix A and y A ℝm . For A A ℝmn and B A ℝnl , the kernel KðA; BÞ maps ℝmn  ℝnl into ℝml . In particular, if x and y are column vectors in ℝn , then Kðx′; yÞ is a real number, KðA; xÞ ¼ Kðx′; A′Þ′ is a column vector in ℝm and KðA; A′Þ is an m  m matrix. The rest of this paper is organized as follows. Section 2 introduces the incremental smooth support vector regression. In Section 3, we introduce TS fuzzy modeling based on the ε-insensitive learning. Section 4 describes our ISSVR-based TS fuzzy modeling method in detail. Section 5 presents experimental results and comparisons. Section 6 is the conclusion.

2. Incremental smooth support vector regression 2.1. Basic ε-SSVR concepts We consider a given dataset fðx1 ; y1 Þ; …; ðxm ; ym Þg which consists of m samples in ℝn represented by A A ℝmn and m observations of real value associated with each sample. The goal of a regression problem is to find a function f ðxÞ that tolerates a small error in fitting all the data. By utilizing the ε-insensitive loss function [2], the tiny errors that fall within some tolerance ε are disregarded. Based on the idea of SVMs, the function f ðxÞ is made as flat as possible at the same time. We begin with the case of linear function f ðxÞ, taking the form f ðxÞ ¼ x′w þ b, where w is the normal vector. The problem can be formulated as the following unconstrained minimization problem: min w;b

1 J w J 22 þ C1′jξjε 2

ð1Þ

where ðjξjε Þi ¼ maxf0; jAi w þ byi jεg; i ¼ 1; …; m, is the ε-insensitive loss and C is a positive parameter controlling the tradeoff between the flatness of f ðxÞ and the amount up to which deviations larger than ε are tolerated. Conventionally, this problem is reformulated as a convex quadratic minimization problem called ε-insensitive support vector regression (ε-SVR). Mercer kernels [11] are used to make the algorithm nonlinear. The ε-SSVR [3] modifies the problem slightly and solves it as an unconstrained minimization problem directly. In ε-SSVR, the square of 2-norm of the ε-insensitive loss is minimized with weight C=2 instead of the 1-norm of the ε-insensitive loss as in (1). In addition, 2 b =2 is added in the objective function, and this induces strong convexity and has little or no effect on the problem. These modifications lead to the following unconstrained minimization problem: min w;b

1 C m 2 ð J w J 22 þ b Þ þ ∑ jAi w þ byi j2ε 2 2i¼1

ð2Þ

For all x A ℝ and ε 4 0, we have jxj2ε ¼ ðxεÞ2þ þðxεÞ2þ , where x þ is a plus function. The following p-function can provide a very accurate smooth approximation to x þ : 1 pðx; αÞ ¼ x þ log ð1 þ expðαxÞÞ α

ð3Þ

where α 40 is the smoothing parameter. Therefore, jxj2ε can be accurately approximated by the following p2ε -function: p2ε ðx; αÞ ¼ ðpðxε; αÞÞ2 þ ðpðxε; αÞÞ2

ð4Þ

Replacing the square of the ε-insensitive loss in (2) using this p2ε -function yields the following ε-SSVR formulation: min w;b

1 C 2 ð J w J 22 þ b Þ þ 1′p2ε ðAw þ1by; αÞ 2 2

ð5Þ

where p2ε ðAw þ 1by; αÞi ¼ p2ε ðAi w þ byi ; αÞ; i ¼ 1; …; m. The objective function in this problem is strongly convex and infinitely differentiable. Therefore, this problem has a unique solution and can be solved using a fast Newton–Armijo method. The solution of (2) can be obtained by solving (5) with α approaching infinity [3]. For the nonlinear case, the duality theorem in convex minimization problem [32,33] and the kernel technique [1] are applied. The observation yA ℝm is approximated by a nonlinear function of the form y  KðA; A′Þu þ 1b, where KðA; A′Þ is a nonlinear kernel with KðA; A′Þij ¼ KðAi ; A′j Þ. The regression parameter u A ℝm and the bias b A ℝ are determined by solving the following unconstrained minimization problem. min u;b

1 C m 2 ð J u J 22 þb Þ þ ∑ jKðAi ; A′Þu þ byi j2ε 2 2i¼1

ð6Þ

R. Ji et al. / Neurocomputing 123 (2014) 281–291

Repeating the same arguments as above, in going from (2) to (5), obtains the nonlinear ε-SSVR as follows: min u;b

1 C 2 ð J u J 22 þ b Þ þ 1′p2ε ðKðA; A′Þu þ 1by; αÞ 2 2

ð7Þ

where p2ε ðKðA; A′Þu þ 1by; αÞi ¼ p2ε ðKðAi ; A′Þu þ byi ; αÞ; i ¼ 1; …; m. It is worth noting that this problem retains the strong convexity and differentiability properties for any arbitrary kernel. It can be solved by the Newton–Armijo method. The following regression function can be obtained by solving (7): m

f ðxÞ ¼ ∑

i¼1

ui Kðx′; A′i Þ þ b

ð8Þ

The samples with corresponding nonzero ui s are SVs. It is often desirable to have fewer SVs. 2.2. Reduced support vector machine The SVM suffers from the difficulty of long computational time and large memory usage in using nonlinear kernels on large datasets. The reduced support vector machine (RSVM) [30,31] was proposed to avoid the computational difficulties and to reduce the model complexity in generating a nonlinear separating surface for a massive dataset. The RSVM randomly selects a small subset A of size m  n, where m⪡m, from the entire dataset A to generate a much smaller rectangular kernel matrix KðA; A′Þ A ℝmm . This reduced kernel KðA; A′Þ is used to replace the full dense square kernel KðA; A′Þ. Based on the Nyström approximation [34,35], KðA; A′Þ can be approximated as KðA; A′Þ  KðA; A′ÞKðA; A′Þ1 KðA; A′Þ

ð9Þ

where KðA; A′Þ is a reduced kernel. Applying the approximation for a vector u A ℝm , we have KðA; A′Þu  KðA; A′ÞKðA; A′Þ1 KðA; A′Þu ¼ KðA; A′Þu 1

ð10Þ

where u ¼ KðA; A′Þ KðA; A′Þu A ℝ is an approximation of u via the reduced kernel technique. The RSVM reduces the computational cost without scarifying the generalization ability. This reduced kernel technique has been successfully applied to ε-SSVR [3]. The ε-SSVR with a reduced kernel KðA; A′Þ is formulated as the following approximate unconstrained minimization problem: min u;b

m

1 C 2 ð J u J 22 þ b Þ þ 1′p2ε ðKðA; A′Þu þ 1by; αÞ 2 2

ð11Þ

where p2ε ðKðA; A′Þu þ 1by; αÞi ¼ p2ε ðKðAi ; A′Þu þ byi ; αÞ; i ¼ 1; …; m. The reduced problem (11) with a suitable C value can be regarded as an approximation to the full problem (7) [30]. This reduced problem also can be solved using the Newton–Armijo method. The solution leads to the following regression function: m



f ðxÞ ¼ ∑ ui Kðx′; Ai Þ þ b

ð12Þ

i¼1

The reduced kernel technique constructs a compressed model, where samples with corresponding nonzero ui s play a similar role of SVs. 2.3. Incremental smooth support vector regression (ISSVR) Instead of choosing the reduced set by purely random selection, the incremental reduced support vector machine (IRSVM) [29] generates the reduced set by selecting informative samples. In this paper, we apply the idea of IRSVM to ε-SSVR and propose incremental smooth support vector regression (ISSVR). The regression function (12) is a linear combination of a ′ ′ ′ dictionary of kernel functions f1; Kð U ; A1 Þ; Kð U ; A2 Þ; …; Kð U ; Am Þg. If the kernel functions in this dictionary are similar, then the

283

hypothesis space spanned by them will be very limited. Intuitively, when projected onto the regression surface, similar kernel functions are likely highly correlated with possibly heavy overlaps. The ISSVR starts with a very small subset A and then adds a new sample Ai into the reduced set only when the extra information carried in the vector KðA; A′i Þ with respect to the column space of KðA; A′Þ is greater than a certain threshold. The amount of the extra information is measured by computing the distance from KðA; A′i Þ to the column space of KðA; A′Þ as follows: r ¼ J KðK′KÞ1 K′KðA; A′i ÞKðA; A′i Þ J 2

ð13Þ

where we let K ¼ KðA; A′Þ for convenience. The square distance can be written as r 2 ¼ ðIPÞKðA; A′i Þ, where P ¼ KðK′KÞ1 K′ is the projection matrix of ℝm onto the column space of K. Thus, r 2 is the excess information carried in KðA; A′i Þ over KðA; A′Þ. The ISSVR algorithm is described as follows. Algorithm 2.1. Incremental smooth support vector regression (ISSVR). Let δ 4 0 be a given threshold and q be a given batch size. Bi A ℝqn denotes a data batch and r denotes a distance vector. (1) Select a very small random subset A0 A ℝmn , typically m ¼ 2, from the training dataset A A ℝmn as the initial reduced set, ′ and generate the initial reduced kernel KðA; A0 Þ. Let A ¼ A0 . (2) For Aj  Aj þ q1 A A⧹A0 , form a data batch Bi . (3) For each Bi , compute the distance vector r, which consists of ′ individual distances from each of the columns of KðA; Bi Þ to the column space of KðA; A′Þ by using (13). (4) For each Aj A Bi , let A ¼ A [ Aj if the corresponding distance value is larger than δ. (5) Repeat (3) and (4) to get the final reduced set A A ℝmn and the final reduced kernel KðA; A′Þ. (6) Solve the reduced ε-SSVR problem defined by (11), where the reduced kernel is obtained in (5), and get the regression function (12) (details can be found in [3]). The ISSVR automatically generates a reduced set by selecting representative samples rather than purely random selection. This algorithm combines the mathematical properties of ε-SSVR such as strong convexity and infinitely differentiability and the advantages of RSVM for dealing with large scale problems. Taking advantage of the ε-SSVR formulation, we only need to solve a system of linear equations iteratively instead of solving a conventional convex quadratic program. Since the size of the reduced set is very small, it will not lead to any computational difficulty in computing the distance r (13), although we need to compute it many times during the whole process. The incremental strategy also leads to strong sparsity. These advantages make ISSVR a competent learning approach for fuzzy modeling.

3. TS fuzzy modeling based on the ε-insensitive learning A TS fuzzy system consisting of l rules has the following structure: Rule j : IF x1 is A1j and x2 is A2j and … and xn is Anj THEN yj ¼ p′j x where xk is the kth input variable and x ¼ ½x1 ; …; xn ′ is the input vector, Akj is a fuzzy set associated with membership function akj :ℝ-½0; 1; j ¼ 1; …; l; k ¼ 1; …; n, pj ¼ ½p1j ; …; pnj ′ A ℝn contains the consequent parameters of the jth rule. The firing strength M j ðxÞ of the jth rule is M j ðxÞ ¼ ∏nk ¼ 1 akj ðxk Þ. If we use the simple weighted

284

R. Ji et al. / Neurocomputing 123 (2014) 281–291

sum method [36], then the output of the TS fuzzy system is l

yðxÞ ¼ ∑ ½M j ðxÞ U ðx′pj Þ ¼ Gðx′Þ′p

ð14Þ

j¼1

where p ¼ ½p′1 ; …; p′l ′ A ℝnl and Gðx′Þ ¼ ½M 1 ðxÞx′; …; M l ðxÞx′′ A ℝnl . Given a training dataset, we seek a linear regression function in the form of (14). Leski [20] incorporated the idea of ε-insensitive learning into fuzzy modeling to obtain fuzzy models tolerant to imprecision. Based on the SRM principle, the following minimization problem was proposed: min p

m τ J p J 22 þ ∑ jGðAi Þ′pyi jε 2 i¼1

ð15Þ

where jGðAi Þ′pyi jε ¼ maxf0; jGðAi Þ′pyi jεg is the ε-insensitive loss. The first term in (15) corresponds to the minimization of the Vapnik– Chervonenkis dimension (model complexity). τ 4 0 is a constant controlling the tradeoff between the model complexity and the amount up to which errors are tolerated. It has been shown that this formulation (15) is equivalent to Vapnik's ε-SVR under certain conditions [13]. The SRM principle guarantees the good generalization ability of the resulting fuzzy model. In solving this problem (15), Leski proposed two approaches, one leads to a constrained quadratic programming problem and the other leads to a problem of solving a system of linear inequalities. We can also convert the problem to the conventional ε-SVR and solve the quadratic programming problem of ε-SVR. In this paper, we apply the idea of ε-SSVR and solve the problem as an unconstrained minimization problem directly.

4. TS fuzzy modeling based on the ISSVR In this section, we first show that the ε-insensitive fuzzy modeling problem (15) can be regarded as an ε-SVR under certain assumptions. Then we convert the problem to an ε-SSVR problem and solve it as an unconstrained minimization problem directly. The advantages of relating fuzzy modeling to ε-SSVR are described. After that, we present our ISSVR-based TS fuzzy modeling method.

The kernel K implicitly defines a nonlinear mapping from ℝn to some other space ℝs where s may be much larger than n. In particular if K is an admissible Mercer kernel, then it defines an inner product in the higher dimensional space and the regression model becomes yðxÞ ¼ φðxÞ′φðZ′Þv þ b0

ð20Þ

where φ is a nonlinear function from ℝ to ℝ and φðZ′Þ A ℝsl results from applying φ to the columns of Z′. This regression function (20) is linear in the high dimensional space ℝs as yðxÞ ¼ φðxÞ′w þ b0 , where w ¼ φðZ′Þv. We can convert the unconstrained minimization problem (15) to the ε-SVR as follows: n

min

w;b0 ;ξ;ξn

s:t:

1 2 2‖w‖2 þ C

s

m

∑ ðξi þ ξni Þ

i¼1

yi φðxi Þ′wb0 r ε þ ξi

ð21Þ

φðxi Þ′w þ b0 yi rε þ ξni ξi ; ξni Z 0 where ξi and ξni are slack variables, one for exceeding the target value by more than ε, the other for being more than ε below the target, and C ¼ τ1 . This problem can be solved in its dual formulation [2]. As a solution of (21), w is given by w ¼ ∑m i¼1 ðαi αni Þφðxi Þ, where αi and αni are Lagrange multipliers, the xi s with n n corresponding nonzero ðαi αi Þs are SVs. If we let vj ¼ αi αi and zj ¼ xi for i ¼ 1; …; m and αi αni a 0, then a TS fuzzy system can be generated. Each fuzzy rule is parameterized by an SV xi and the associated ðαi αni Þ, where xi specifies the location of the membership functions and ðαi αni Þxi gives the consequent parameters. The number of fuzzy rules equals the number of SVs and hence is irrelevant to the dimension of the input space. In this sense, we avoid the “curse of dimensionality”. However, as the kernel function (19) is constructed from the membership functions, the form of the membership functions is restricted by the Mercer condition. 4.2. TS fuzzy modeling based on the ε-SSVR The kernel function Kðx′; zj Þ defined by (19) can be regarded as the product of two kernels, i.e.,

4.1. TS fuzzy modeling based on the ε-SVR

Kðx′; zj Þ ¼ K 1 ðx′; zj ÞK 2 ðx′; zj Þ

First, we add the following rule to the original TS fuzzy system to add the bias term.

where K 1 ðx′; zj Þ ¼ x′zj is a linear kernel and

Rule 0 : IF x1 is A10 and x2 is A20 and … and xn is An0 THEN y0 ¼ b0

K 2 ðx′; zj Þ ¼ ∏ ak ðxk zkj Þ

where the membership functions ak0 ðxk Þ  1 for k ¼ 1; …; n and any xk A ℝ. Then the output of the TS fuzzy system becomes

n

ð22Þ

ð23Þ

k¼1

ð16Þ

is a translation invariant kernel. The following theorem can be used to check whether a translation invariant kernel is an admissible Mercer kernel.

We assume that all membership functions associated with the same input variable are generated from location transformation of a reference function [4–6], i.e., akj ðxk Þ ¼ ak ðxk zkj Þ, where ak is the reference function for the kth input variable and zkj A ℝ is the location parameter of akj ,k ¼ 1; …; n. If we set

Theorem 4.1. (Mercer condition for translation invariant kernels, Smola et al. [37]): A translation invariant kernel Kðx; zÞ ¼ KðxzÞ is an admissible Mercer kernel if and only if the Fourier transform Z F½KðωÞ ¼ ð2πÞn=2 KðxÞexpðiω′xÞdx ð24Þ

yðxÞ ¼ Gðx′Þ′p þ b0 :

pj ¼ vj ½z1j ; …; znj ′ ¼ vj zj

ð17Þ

where zj ¼ ½z1j ; …; znj ′ A ℝn , then (16) can be written as l

yðxÞ ¼ ∑ vj Kðx′; zj Þ þb0 ¼ Kðx′; Z′Þv þ b0

ð18Þ

j¼1

where v ¼ ½v1 ; …; vl ′ A ℝl , Z A ℝln is a matrix consisting of zj , j ¼ 1; …; l, K is a general kernel [32]. Kðx′; zj Þ is the kernel function defined as n

Kðx′; zj Þ ¼ ðx′zj Þ U ∏ ak ðxk zkj Þ: k¼1

ð19Þ

ℝn

is nonnegative. According to [4,5], a function μ : ℝ-ℝ is a positive definite function if and only if its Fourier transform Z 1 μðxÞexpðiωxÞdx ð25Þ F½μðωÞ ¼ ð2πÞ1=2 1

is nonnegative. An obvious conclusion can be drawn that the translation invariant kernel K 2 ðx′; zj Þ (23) is an admissible Mercer kernel if the reference functions ak , k ¼ 1; …; n, are positive definite functions. Furthermore, we conclude that the kernel Kðx′; zj Þ defined by (19) is an admissible Mercer kernel if the reference

R. Ji et al. / Neurocomputing 123 (2014) 281–291

functions ak , k ¼ 1; …; n, are positive definite functions according to the following theorem. Theorem 4.2. (Products of kernels, Schölkopf and Smola [38]): If K 1 and K 2 are admissible Mercer kernels, then Kðx; zÞ : ¼ K 1 ðx; zÞ K 2 ðx; zÞ is an admissible Mercer kernel. The positive definiteness requirement on reference functions is restrictive. Listed below are some commonly used reference functions and their Fourier transform.

v;b0

  4d 2 ω F ½μðωÞ ¼ pffiffiffiffiffiffi sin 2 2d 2π ω

μðxÞ ¼ maxð1djxj; 0Þ; d 4 0;

2. Gaussian μðxÞ ¼ expðdx2 Þ; d 4 0;

 2 1 ω F ½μðωÞ ¼ pffiffiffiffiffiffiexp  4d 2d

3. Cauchy μðxÞ ¼

1 ; d 4 0; 1 þ dx2

have nonnegative Fourier transform, then Kðx′; zj Þ (19) is not an admissible Mercer kernel. In that case, the connection between TS fuzzy modeling and ε-SVR, established in Section 4.1, will not exist. However, we can link the TS fuzzy model to ε-SSVR. Since the ε-SSVR formulation puts no restrictions on the kernel, any arbitrary reference function can be used. Given a training dataset, the observation yA ℝm is approximated by (18) as y  KðA; Z′Þv þ 1b0 . If we let l ¼ m and Z ¼ A, then we have the following unconstrained minimization problem: min

1. Symmetric triangle

F ½μðωÞ ¼

4. Laplace μðxÞ ¼ expðdjxjÞ; d 4 0;

rffiffiffiffiffiffi   π jωj exp pffiffiffi 2d d

rffiffiffi 2 d F ½μðωÞ ¼ π d2 þω2

5. Asymmetric triangle 8 > < 1 þ d1 x; 1 o d1 x o 0 μðxÞ ¼ 1d2 x; 0 rd2 x o 1 ; > : 0 else     d1 þ d2 d1 exp diω1 d2 exp iω d2 pffiffiffiffiffiffi d1 ; d2 4 0; d1 ad2 F ½μðωÞ ¼ 2π ω2 6. Trapezoid 8 1 þ dðx þ aÞ; 1d r x þ a o 0 > > > < 1; a r x r a ; μðxÞ ¼ > 1dðxaÞ; 0 o xa r 1d > > : 0 else pffiffiffi    1 2d F ½μðωÞ ¼ pffiffiffi 2 cos ωa cos ω a þ a; d 4 0 d πω 7. Quadratic μðxÞ ¼ maxð1dx2 ; 0Þ; d 4 0;

F ½μðωÞ ¼

pffiffiffi 4d sin pωffiffid4 dω cos pωffiffid pffiffiffiffiffiffi 2π ω3

rffiffiffi 2 sin ðdωÞ F ½μðωÞ ¼ π ω

The Asymmetric triangle, Trapezoid, Quadratic and Square window reference functions are not positive definite functions. The Gaussian reference function corresponds to the Gaussian kernel. The translation invariant kernel K 2 ðx′; zj Þ (23) constructed from non-positive definite reference functions does not, in general,

m τ J v J 22 þ ∑ jKðAi ; A′Þv þ b0 yi jε 2 i¼1

ð26Þ

According to ε-SSVR, we can slightly modify this problem and solve it as an unconstrained minimization problem directly by using a smoothing strategy. The following unconstrained minimization problem can be defined min v;b0

1 C 2 ð J v J 22 þb0 Þ þ 1′p2ε ðKðA; A′Þv þ1b0 y; αÞ 2 2

ð27Þ

where p2ε ðKðA; A′Þv þ 1b0 y; αÞi ¼ p2ε ðKðAi ; A′Þv þ b0 yi ; αÞ, i ¼ 1; …; m, and C ¼ 2=τ. This problem is strongly convex and infinitely differentiable for any arbitrary kernel K. It has a unique solution and is always solvable using the Newton–Armijo method. Therefore, the positive definiteness requirement on reference functions can be relaxed. Moreover, the definition of reference function also can be relaxed. Definition 4.3. (Reference function, Chen and Wang [5]) A function μ : ℝ-½0; 1 is a reference function if and only if μð0Þ ¼ 1. The condition μðxÞ ¼ μðxÞ in [5] is omitted to allow the use of asymmetric functions such as the Asymmetric triangle function because no symmetry or positive semidefiniteness of KðA; A′Þ is required [32]. We assume the reference functions ak :ℝ-½0; 1; k ¼ 1; …; n are predetermined according to Definition 4.3. Given a training dataset, a kernel K is constructed in the form of (19) and an optimization problem (27), which guarantees the generalization ability of the regression model, is defined. Since the ε-SSVR problem is always solvable, we can obtain the optimal v and b0 for the regression model (18). Once we get the regression model, a set of TS-type fuzzy rules can be easily generated. The number of fuzzy rules l (excluding Rule 0) equals the number of nonzero vj s. The SVs determine the locations fz1 ; …; zl g  ℝn of the membership functions of each fuzzy rule, and vj zj , j ¼ 1; …; l gives the consequent parameters of each fuzzy rule. 4.3. TS fuzzy modeling based on the ISSVR In order to avoid the computational difficulties in dealing with a full dense kernel matrix and to reduce the number of fuzzy rules, we apply ISSVR to our fuzzy modeling method. In finding a fuzzy regression model (18), the ISSVR solves the following approximate unconstrained minimization problem: min v;b0

8. Square window  1; d r x r d μðxÞ ¼ ; d 4 0; o else

285

1 C 2 ð J v J 22 þb0 Þ þ 1′p2ε ðKðA; A′Þv þ1b0 y; αÞ 2 2

ð28Þ

where p2ε ðKðA; A′Þv þ1b0 y; αÞi ¼ p2ε ðKðAi ; A′Þv þ b0 yi ; αÞ, i ¼ 1; :::; m. The reduced set A A ℝmn is determined by selecting representative samples using an incremental approach. All of the results of the previous sections still hold for this approximate problem. The following algorithm describes the procedure of the ISSVR-based TS fuzzy modeling method. Algorithm 4.4. TS fuzzy modeling based on the ISSVR. Inputs: n reference functions ak ðxk Þ; k ¼ 1; …; n, defined by Definition 4.3, and a training dataset fðx1 ; y1 Þ; …; ðxm ; ym Þg represented by A A ℝmn and y A ℝm .

286

R. Ji et al. / Neurocomputing 123 (2014) 281–291

Outputs: A set of TS-type fuzzy rules parameterized by zj , pj , b0 and l. zj ; j ¼ 1; …; l is the location parameter vector associated with the membership functions of the jth fuzzy rule, pj ; j ¼ 1; …; l is the TS-type consequent parameter vector of the jth fuzzy rule, b0 is the consequent constant of Rule 0, and l þ 1 is the number of fuzzy rules. Steps: (1) Construct a kernel K from the given reference functions according to (19). (2) Solve the ε-SSVR problem defined by (28) using Algorithm 2.1 to get v and b0 . (3) Extract fuzzy rules from the regression model: j←1 for i ¼ 1 to m if vi a 0 ′

zj ←Ai ′

pj ←vi Ai j←j þ 1 end if end for l←j1

The algorithm automatically generates a set of TS-type fuzzy rules directly from the given training data. Each fuzzy rule is parameterized by a training sample ðxi ; yi Þ and the associated nonzero regression parameter vi , where xi specifies the location of the membership functions, and vi xi contains the consequent parameters. The number of fuzzy rules is determined by the size of the reduced set. The incremental approach usually generates a much smaller reduced set than purely random selection under comparable accuracy. The mathematical properties such as strong convexity and infinitely differentiability establish the foundation for good generalization performance. Therefore, the resulting TS fuzzy system has good generalization ability with small number of fuzzy rules.

5. Experimental results We evaluated the performance of our model in terms of generalization, number of fuzzy rules and computational time using four examples: a function approximation problem, a chaotic time series prediction problem, a NARMAX model and two real world datasets. The following models were used and compared: 1. AOSVR [39]: Accurate online SVR, using incremental SVR with Gaussian kernel. 2. TSFS-SVR [21]: TS fuzzy system, where the parameters are learned by a combination of fuzzy clustering and linear SVR, using Gaussian membership function. 3. PDFS [4]: Fuzzy system with singletons in the consequent constructed by SVs of ε-SVR, using positive definite reference functions. 4. Leski's model [20]: TS fuzzy system, using FCM to determine the antecedent parameters and a quadratic programming problem to determine the consequent parameters. 5. FSVM [28]: Fuzzy modeling via online SVM, using TS-type consequent and Gaussian membership function. 6. ε-SSVR1 [3]: ε-SSVR using the reduced kernel technique, where the reduced set is generated by purely random selection.

1 The source code of ε-SSVR can be downloaded at http://dmlab8.csie.ntust. edu.tw/ssvmtoolbox.html.

Models were designed for different values of C, ε and d=ðd1 ; d2 Þ (of the reference functions) and the ones that achieved relatively good performance were reported in each example. The reference functions were chosen to be identical for different input variables. In the support vector learning, the parameter C took values from f10; 100; 1000g; the insensitivity value ε took values from f0:1; 0:05; 0:01g. The parameter d=ðd1 ; d2 Þ took values from f2n : n ¼ 10; …; 10g. The parameter γ of the Gaussian kernel KðAi ; A′j Þ ¼ expðγ J Ai Aj J 22 Þ, i; j ¼ 1; …; m took values from f2n : n ¼ 10; …; 10g. All the experiments were run on a personal computer, which consists of an Intel Core i3-540 with 4 GB memory. The computational time combined training and testing time for one particular parameter setting. For the cross-validation, we reported the average value of RMSE, number of fuzzy rules/SVs and computational time. Example 5.1. The approximated function was f ðx1 ; x2 Þ ¼

ð5x2 Þ2 3ð5x1 Þ2 þð5x2 Þ2

;

x1 ; x2 A ½0; 10

ð29Þ

This function was the same as that used in [4,,21]. We selected 200 samples randomly from the input domain according to the uniform distribution as the training set. A different set of 200 samples was generated by the same way for testing. We chose Gaussian reference function for input variables x1 and x2 . We performed five-fold cross-validation using the training set to determine the insensitivity value ε, the parameter d of the reference function and the parameter C in support vector learning. The threshold δ was set to be 0, i.e., we used the full kernel during the cross-validation. The cross-validation results for different values of ε, C and d are displayed in Fig. 1. For better illustration, the d value in Fig. 1 was from 210 to 23 . As shown in Fig. 1, the generalization performance was affected by ε, C and d. Generally, a smaller ε led to better generalization performance. For a fixed ε, the generalization performance depended on the choices of C and d. For different values of C, we got very similar generalization performance by picking a proper d value. Based on the optimal parameter set ðC; ε; dÞ ¼ ð100; 0:01; 0:125Þ obtained via the cross-validation, we tuned the threshold δ. Fig. 2 displays the testing RMSE and the number of fuzzy rules for different values of δ using the testing set. According to Fig. 2, we selected the optimal threshold δ ¼ 1:2. The testing result of our model with δ ¼ 1:2 is shown in Fig. 3, where we can find that the outputs of our model were close to the real outputs. A TS fuzzy system consisting of 13 rules was extracted from the regression model. The location parameters of the membership functions and the consequent parameters (excluding Rule 0) are given in Table 1. The consequent of Rule 0 was the bias of the regression model b0 ¼ 0:2617. The membership functions for input variables x1 and x2 are displayed in Figs. 4 and 5, respectively. μ1 ðx1 Þ and μ2 ðx2 Þ were reference functions for x1 and x2 , respectively, μkj ðxk Þ was the membership function of Akj ; j ¼ 1; …; 12; k ¼ 1; 2. The membership functions μ11 ðx1 Þ; …; μ112 ðx1 Þ belonged to one location family generated by μ1 ðx1 Þ (the thick line in Fig. 4); the membership functions μ21 ðx2 Þ; …; μ212 ðx2 Þ belonged to the other location family generated by μ2 ðx2 Þ (the thick line in Fig. 5). The testing results of the compared models are summarized in Table 2. Our model with Gaussian reference function achieved comparable or better testing result than the other models. FSVM achieved slightly better testing RMSE than our model. However, this achievement came at a cost of larger number of fuzzy rules and longer computational time. The number of fuzzy rules

R. Ji et al. / Neurocomputing 123 (2014) 281–291

287

Fig. 1. Cross-validation RMSE for different values of ε, C and d using our model with Gaussian reference function in Example 5.1.

Fig. 2. Testing RMSE and number of fuzzy rules for different values of δ in Example 5.1.

Table 1 Location parameters and consequent parameters of the TS fuzzy system in Example 5.1.

Fig. 3. Testing result of our model with δ ¼1.2 in Example 5.1.

of our model was much smaller than the other models. AOSVR, which generates exactly the same problem as SVR, had the l argest number of SVs, and this is a disadvantage of conventional SVR. The performance of our model was also evaluated by using different reference functions. The results were compared with PDFS.

Location parameter zj

Consequent parameter pj

(3.0, 5.4) (0.3, 6.3) (9.3, 7.2) (4.2, 4.5) (6.3, 6.0) (9.6, 9.6) (1.8, 1.2) (6.6, 7.2) (3.9, 3.0) (2.7, 0) (4.8, 6.0) (5.1, 9.6)

(–0.0623, –0.1122) (0, –0.0004) (0.0116, 0.009) (–0.1243, –0.1331) (–0.1302, –0.124) (–0.0064, –0.0064) (–0.1823, –0.1215) (0.013, 0.0142) (0.1678, 0.1291) (0.2781, 0) (0.1935, 0.2418) (0.0178, 0.0336)

As shown in Table 3, only positive definite reference functions can be used in PDFS, our model outperformed PDFS in generalization, number of fuzzy rules and computational time using all the selected reference functions. As a conventional SVR-based method, PDFS

288

R. Ji et al. / Neurocomputing 123 (2014) 281–291

generated much more fuzzy rules than our model. We can also conclude that using non-positive definite reference functions did not degrade the performance of our model. The Trapezoid reference

function, which is non-positive definite, achieved the best generalization performance with the fewest fuzzy rules. Example 5.2. The Mackey–Glass chaotic time series was generated by the following equation: dxðtÞ 0:2xðt30Þ ¼ 0:1xðtÞ dt 1 þ x10 ðt30Þ

ð34Þ

We set the time step Δ ¼ 1, x ¼ ðxðt8Þ; xðt7Þ; …; xðtÞÞ, y ¼ xðt þ 1Þ. The 200 points from t ¼ 501700 served as training data, the next 300 points from t ¼ 7011000 were used as testing data. This example was the same as that used in [21]. Table 4 summarizes the testing results of our model using Gaussian reference function in comparison with other models. It can be seen that all the methods achieved close testing RMSE. AOSVR again generated the most SVs. Our model and TSFS-SVR had the fewest fuzzy rules. The other models also had relatively small number of fuzzy rules. Based on this example, the robustness of the methods was tested. A noisy training set was generated by adding white Gaussian noise with mean zero and standard deviation 0.25 to the output of the clean training data. The results for the noisy training set are shown in Table 5. The number of fuzzy rules of AOSVR, TSFS-SVR and Leski's model increased violently when we added noise to the training data, while ε-SSVR and our model were much more robust to noise. The testing RMSE and the number of fuzzy rules of our model were determined by the threshold δ in the ISSVR algorithm. Therefore, our model was more robust to noise than the

Fig. 4. Membership functions for input x1 of our fuzzy model in Example 5.1.

Table 4 Testing results of different models using the clean training set in Example 5.2.

AOSVR TSFS-SVR Leski's model FSVM ε-SSVR Our model

Testing RMSE

No. fuzzy rules/SVs

Computational time (s)

0.0132 0.0128 0.0128 0.0122 0.0122 0.0123

105 9 10 12 21 9

0.636 0.397 0.433 0.784 0.346 0.425

Fig. 5. Membership functions for input x2 of our fuzzy model in Example 5.1.

Table 2 Testing results of different methods in Example 5.1.

AOSVR TSFS-SVR Leski's model FSVM -SSVR Our model

ε

Table 5 Testing results of different models using the noisy training set in Example 5.2.

Testing RMSE

No. fuzzy rules/SVs

Computational time (s)

0.0654 0.0637 0.0628 0.0622 0.0620 0.0624

95 46 38 18 31 13

0.419 0.183 0.264 0.523 0.140 0.212

AOSVR TSFS-SVR Leski's model FSVM ε-SSVR Our model

Testing RMSE

No. fuzzy rules/SVs

Computational time (s)

0.0527 0.0483 0.0508 0.0434 0.0423 0.0426

173 48 45 15 21 11

0.668 0.594 0.590 0.829 0.361 0.447

Table 3 Testing results of our model and PDFS using different reference functions in Example 5.1. Testing RMSE No. fuzzy rules Computational time (s)

PDFS

Our model

S-triangle

Gaussian

Cauchy

Laplace

A-triangle

Trapezoid

Quadratic

Square window

0.0648 89 0.379 0.0633 14 0.208

0.0643 93 0.395 0.0624 13 0.212

0.0639 92 0.387 0.0628 13 0.202

0.0655 93 0.392 0.0627 13 0.210

– – – 0.0621 12 0.217

– – – 0.0628 14 0.220

– – – 0.0625 13 0.201

– – – 0.0630 12 0.206

R. Ji et al. / Neurocomputing 123 (2014) 281–291

clustering/SVR-based models. In those models, the number of fuzzy rules, which equaled the number of clusters/SVs, increased as we added noise, and the model learned more noise information. FSVM is based on the recursive kernel method. The FSVM collects a dictionary and adds a new sample into the dictionary only when the sample is not linearly dependent on the dictionary vectors, so FSVM also has small number of fuzzy rules and is robust to noise. The testing results of our model and PDFS using different reference functions using both clean and noisy training set are displayed in Fig. 6.

Example 5.3. In this example, the NARMAX model was used. This model is one of the most popular models in the neural and fuzzy

289

literature [28]. The model is as follows. yðkÞ ¼

yðk1Þyðk2Þyðk3Þuðk2Þ½yðk3Þ1 þ uðk1Þ 1 þ yðk1Þ2 þ yðk2Þ2

ð35Þ

where uðkÞ is the input signal and yðkÞ is the output of the model. The input sample is xðkÞ ¼ ½yðk1Þ; yðk2Þ; yðk3Þ; uðk1Þ; uðk2Þ

ð36Þ

The initial states are set to zero and the training input is 8  kπ ; k o 200 sin 20 > > > > < 1; 200 r k o 400 uðkÞ ¼ 1; 400 r k o 600 > > > > : 0:3 sin  kπ þ 0:1 sin  kπ þ 0:6 sin  kπ ; 600 r k o 1000 15 20 8:5 ð37Þ

We selected 300 samples as the training set and did simulations using different reference functions. The model was trained incrementally. Fig. 7 displays the training and testing results with Trapezoid reference function, which had the best generalization performance. The training error decreased as the reduced set was incrementally expanded. The performance of other methods was also evaluated and the best results were reported. The results are shown in Table 6. Our model had the smallest testing RMSE and had only one more rule than FSVM, which had the fewest rules. However, FSVM took much more time than our model. AOSVR and PDFS were both based on the conventional SVR. Therefore, they generated more fuzzy rules than the clustering-based models and the incremental models. In this example, our model achieved close results to ε-SSVR with purely random selection. However, the ε-SSVR, as well as PDFS, TSFS-SVR and Leski's model, is not an incremental approach and hence cannot be applied to online problems where samples come sequentially. The model was trained in one batch after all the samples were collected. Example 5.4. We ran the numerical tests using two popular real world dataset: The first dataset is the Boston housing dataset. This dataset contains 506 samples and each sample consists of 13 attributes. The second dataset is the Auto MPG (miles per gallon) dataset. This dataset contains 392 samples and each sample consists of 6 attributes. We tested our model via ten-fold cross-validation. The numerical results are shown in Table 7. Ten-fold results of other models are also included in this table.

Fig. 6. Testing results of our model and PDFS using different reference functions in Example 5.2.

As shown in this table, SVR-based models, i.e., AOSVR and PDFS, generated very large numbers of fuzzy rules using the real world datasets. Combinations of SVR and clustering, i.e., TSFS-SVR and Leski's model had better generalization performance than SVRbased models. They also had smaller numbers of fuzzy rules because the number of clusters is smaller than the number of SVs. Our model is based on the SRM principle and selects a sample only when it is sufficiently representative. Therefore, our model had the best generalization ability while with the smallest number of fuzzy rules among the compared models. The experimental results have demonstrated that our ISSVRbased method is very suitable for TS fuzzy modeling. Our model generated fewer fuzzy rules than ε-SSVR with purely random reduced set under comparable error, because ISSVR selected only the informative samples. Compared with other modeling methods, our model achieved better testing RMSE with much smaller number of fuzzy rules. FSVM also had relatively good generalization ability, but it is more complex. As a fuzzy modeling

290

R. Ji et al. / Neurocomputing 123 (2014) 281–291

Fig. 7. Modeling in training and testing phases in Example 5.3.

Table 6 Testing results of different models in Example 5.3.

AOSVR TSFS-SVR PDFS Leski's model FSVM ε-SSVR Our model

Testing RMSE

No. fuzzy rules/SVs

Computational time (s)

0.0807 0.0814 0.0795 0.0823 0.0773 0.0769 0.0762

92 36 93 39 18 21 19

1.557 0.880 1.282 0.926 1.617 0.453 0.664

Table 7 Ten-fold cross-validation results of different methods in Example 5.4. Testing RMSE No. fuzzy rules Computational time (s)

Boston housing

Auto MPG

AOSVR

TSFS-SVR

PDFS

Leski's model

FSVM

ε-SSVR

Our model

0.1589 438 0.982 1.3852 94 0.734

0.1567 89 0.837 1.4124 61 0.775

0.1549 440 2.818 1.4056 88 1.942

0.1532 81 0.793 1.4323 69 0.836

0.1355 25 1.074 1.2811 23 1.818

0.1351 46 0.404 1.3045 31 0.449

0.1340 25 0.648 1.2734 20 0.585

method, it is also important that more forms of membership functions can be used in our model.

6. Conclusion This paper proposed a TS fuzzy modeling method based on the incremental smooth support vector regression. Under certain assumptions on membership functions and consequent parameters, a criterion function of TS fuzzy modeling based on the ε-insensitive learning was proposed. We converted the criterion function to the nonlinear ε-SSVR and solved it using an incremental approach. TS-type fuzzy rules were then extracted from the regression model. Compared with other modeling methods, more forms of membership functions can be used in our model. Experiments have shown that our model has good generalization

ability with small number of fuzzy rules. Our model was also more robust to noise than the compared models. The setting of pj (17) reduced the degree of freedom of the consequent parameter space from n to 1. However, we can map the consequent parameter space from ℝn to ℝnl and construct a linear SVR to determine the consequent parameters. As future work, we also plan to extend our method to online modeling.

Acknowledgment This work is supported by National Science Foundation of China under Grants 61025016, 61034008 and 61273161, and by “863 Program” of China under Grant 2011AA040605. The authors would like to thank Dr. Yixin Chen and Dr. Yuh-Jye Lee for providing the source code.

R. Ji et al. / Neurocomputing 123 (2014) 281–291

References [1] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, USA, 1995. [2] A. Smola, B. Schölkopf, A tutorial on support vector regression, Statistics and Computing 14 (3) (2004) 199–222. [3] Y.J. Lee, W.F. Hsieh, C.M. Huang, ε-SSVR: a smooth support vector machine for ε-insensitive regression, IEEE Transactions on Knowledge and Data Engineering 17 (5) (2005) 678–685. [4] Y. Chen, J.Z. Wang, Kernel machines and additive fuzzy systems: classification and function approximation, in: Proceedings of 12th IEEE International Conference on Fuzzy Systems, ST. Louis, MO, USA, 2003, pp. 789–795. [5] Y. Chen, J.Z. Wang, Support vector learning for fuzzy rule-based classification systems, IEEE Transactions on Fuzzy Systems 11 (6) (2003) 716–728. [6] D. Dubois, H. Prade, Operations on fuzzy numbers, International Journal of Systems Science 9 (6) (1978) 613–626. [7] R.A. Horn, C.R. Johnson, Matrix Analysis, Cambridge University Press, 1985. [8] J.H. Chiang, P.Y. Hao, Support vector learning mechanism for fuzzy rule-based modeling: a new approach, IEEE Transactions on Fuzzy Systems 12 (1) (2004) 1–11. [9] L.X. Wang, J.M. Mendel, Fuzzy basis functions, universal approximation, and orthogonal least-squares learning, IEEE Transactions on Neural Networks 3 (1992) 807–814. [10] C.T. Lin, S.F. Liang, C.M. Yeh, K.W. Fan, Fuzzy neural network design using support vector regression for function approximation with outliers, in: Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, Waikoloa, 2005, pp. 2763–2768. [11] J. Mercer, Functions of positive and negative type and their connection with the theory of integral equations, Philosophical Transactions of the Royal Society London 209 (1909) 415–446. [12] W. Li, Y. Yang, A new approach to TS fuzzy modeling using dual kernel-based learning machines, Neurocomputing 71 (16–18) (2008) 3660–3665. [13] Q. Cai, Z. Hao, X. Yang, Gaussian kernel-based fuzzy inference systems for high dimensional regression, Neurocomputing 77 (1) (2012) 197–204. [14] C.F. Juang, S.T. Huang, F.B. Duh, Mold temperature control of a rubber injection-molding machine by TSK-type recurrent neural fuzzy network, Neurocomputing 70 (1–3) (2006) 559–567. [15] C.F. Juang, S.J. Shiu, Using self-organizing fuzzy network with support vector learning for face detection in color images, Neurocomputing 71 (16–18) (2008) 3409–3420. [16] C.F. Juang, W.K. Sun, G.C. Chen, Object detection by color histogram-based fuzzy classifier with support vector learning, Neurocomputing 72 (10–12) (2009) 2464–2476. [17] C.F. Juang, I.F. Chung, Recurrent fuzzy network design using hybrid evolutionary learning algorithms, Neurocomputing 70 (16–18) (2007) 3001–3010. [18] C.F. Juang, C.D. Hsieh, A locally recurrent fuzzy neural network with support vector regression for dynamic system modeling, IEEE Transactions on Fuzzy Systems 18 (2) (2010) 261–273. [19] C.F. Juang, C.D. Hsieh, A fuzzy system constructed by rule generation and iterative linear SVR for antecedent and consequent parameter optimization, IEEE Transactions on Fuzzy Systems 20 (2) (2012) 372–384. [20] J.M. Leski, TSK-fuzzy modeling based on ε-insensitive learning, IEEE Transactions on Fuzzy Systems 13 (2) (2005) 181–193. [21] C.F. Juang, C.D. Hsieh, TS-fuzzy system-based support vector regression, Fuzzy Sets and Systems 160 (2009) 2486–2504. [22] C.F. Juang, S.H. Chiu, S.W. Chang, A self-organizing TS-type fuzzy network with support vector learning and its application to classification problems, IEEE Transactions on Fuzzy Systems 15 (5) (2007) 998–1008. [23] G. Cauwenberghs, T. Poggio, Incremental and decremental support vector machine learning, in: T.K. Leen, T.G. Dietterich, V. Tresp (Eds.), Advances in Neural Information Processing Systems, vol. 13, MIT Press, Cambridge, MA, 2000, pp. 409–415. [24] C.P. Diehl, G. Cauwenberghs, SVM incremental learning, adaptation and optimization, in: Proceedings of the International Joint Conference on Neural Networks, 2003, pp. 2685–2690. [25] P. Laskov, C. Gehl, S. Krüger, K.R. Müller, Incremental support vector learning: analysis, implementation and applications, Machine Learning 7 (2006) 1909–1936. [26] N.A. Syed, H. Liu, K.K. Sung, Incremental learning with support vector machines, in: Proceedings of the International Conference on Artificial Intelligence, 1999. [27] W.Y. Cheng, C.F. Juang, An incremental support vector machine-trained TStype fuzzy system for online classification problems, Fuzzy Sets and Systems 163 (1) (2011) 24–44. [28] W. Yu, Fuzzy modeling via on-line support vector machines, International Journal of Systems Science 41 (11) (2010) 1325–1335.

291

[29] Y.J. Lee, H.Y Lo, S.Y. Huang, Incremental reduced support vector machines, in: International Conference on Informatics, Cybernetics and Systems (ICICS), Kaohsiung, Taiwan, 2003. [30] Y.J. Lee, O.L. Mangasarian, RSVM: reduced support vector machines, in: First SIAM International Conference on Data Mining, Chicago, 2001. [31] Y.J. Lee, S.Y. Huang, Reduced support vector machines: a statistical theory, IEEE Transactions on Neural Networks 18 (1) (2007) 1–13. [32] O.L. Mangasarian, Generalized support vector machines, in: A.J. Smola, P.L. Bartlett, B. Schölkopf, D. Schuurmans (Eds.), Advances in Large Margin Classifiers, MIT Press, Cambridge, USA, 2000, pp. 135–146. [33] D.R. Musicant, A. Feinberg, Active set support vector regression, IEEE Transactions on Neural Networks 15 (2) (2004) 268–275. [34] A.J. Smola, B. Schölkopf, Sparse greedy matrix approximation for machine learning, in: Proceedings of 17th International Conference on Machine Learning, San Francisco, USA, 2000, pp. 911–918. [35] C.K.I. Williams, M. Seeger, Using the Nyström method to speed up kernel machines, in: T.K. Leen, T.G. Dietterich, V. Tresp (Eds.), Advances in Neural Information Processing Systems, vol. 13, MIT Press, Cambridge, USA, 2001, pp. 682–688. [36] C.T. Leondes, Fuzzy Theory Systems, Four-Volume Set: Techniques and Applications, vol. 4, Academic Press, New York, 1999. [37] A.J. Smola, B. Schölkopf, K.R. Müller, The connection between regularization operators and support vector kernels, Neural Networks 11 (4) (1998) 637–649. [38] B. Schölkopf, A. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, USA, 2002. [39] J. Ma, J. Theiler, S. Perkins, Accurate on-line support vector regression, Neural Computation 15 (2003) 2683–2703.

Rui Ji received the B.S. and M.S. degrees in control theory and control engineering from Shanghai Jiao Tong University, Shanghai, China. He is currently working toward the Ph.D. degree at the same university. His current research interests include fuzzy modeling, neural networks and machine learning.

Yupu Yang received the B.S. and M.S. degrees from University of Science and Technology of China, Hefei, China, and the Ph.D. degree in control theory and control engineering from Shanghai Jiao Tong University, Shanghai, China. He is currently a professor at the Department of Automation, Shanghai Jiao Tong University. His current research interests include computational intelligence, intelligent control theory and applications and intelligent information processing.

Weidong Zhang received the B.S., M.S. and Ph.D. degrees from Zhejiang University, Hangzhou, China. He is currently a professor at the Department of Automation, Shanghai Jiao Tong University, Shanghai, China. His current research interests include modeling and optimization, intelligent control and networked control systems.