Gaussian kernel-based fuzzy inference systems for high dimensional regression

Gaussian kernel-based fuzzy inference systems for high dimensional regression

Neurocomputing 77 (2012) 197–204 Contents lists available at SciVerse ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom ...

344KB Sizes 0 Downloads 59 Views

Neurocomputing 77 (2012) 197–204

Contents lists available at SciVerse ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Gaussian kernel-based fuzzy inference systems for high dimensional regression Qianfeng Cai a,n, Zhifeng Hao b, Xiaowei Yang c a

Faculty of Applied Mathematics, Guangdong University of Technology, No. 100 Waihuan Xi Road, Guangzhou, PR China Faculty of Computer Science, Guangdong University of Technology, No. 100 Waihuan Xi Road, Guangzhou, PR China c School of Science, South China University of Technology, Wushuan Road, Guangzhou, PR China b

a r t i c l e i n f o

abstract

Article history: Received 6 November 2010 Received in revised form 8 June 2011 Accepted 5 September 2011 Communicated by M. Sato-Ilic Available online 18 September 2011

We propose a novel architecture for a higher order fuzzy inference system (FIS) and develop a learning algorithm to build the FIS. The consequent part of the proposed FIS is expressed as a nonlinear combination of the input variables, which can be obtained by introducing an implicit mapping from the input space to a high dimensional feature space. The proposed learning algorithm consists of two phases. In the first phase, the antecedent fuzzy sets are estimated by the kernel-based fuzzy c-means clustering. In the second phase, the consequent parameters are identified by support vector machine whose kernel function is constructed by fuzzy membership functions and the Gaussian kernel. The performance of the proposed model is verified through several numerical examples generally used in fuzzy modeling. Comparative analysis shows that, compared with the zero-order fuzzy model, firstorder fuzzy model, and polynomial fuzzy model, the proposed model exhibits higher accuracy, better generalization performance, and satisfactory robustness. & 2011 Elsevier B.V. All rights reserved.

Keywords: Fuzzy systems Clustering methods Approximation methods Kernel methods Support vector machines

1. Introduction Fuzzy inference systems (FISs), as powerful tools for dealing with vague, uncertain, and complex information, have been studied for many years. At present, the zero-order FIS and the first-order FIS have remained widely used for a number of applications including automatic control, system identification, image processing, pattern classification, and data mining. Despite the wide application, only a few studies have been done on the higher order FIS (HFIS) whose consequent part is a nonlinear function of the input variables. The main works on the HFIS have focused on the polynomial FIS (PFIS), where the consequent is a higher order polynomial. The antecedents of the PFIS are identified by the subtractive clustering method, whereas the consequent parameters are estimated by the least squares estimation (LSE) as described previously [1]. Many researchers have studied the second-order PFIS, because it has the simplest construction among higher order PFISs. In the case of the second order PFIS, the number of the consequent parameters to be computed is (nþ 1)(nþ 2)/2 for each rule, where n is the number of the input variables [2]. Hence, the parameters of a

n

Corresponding author. E-mail addresses: [email protected] (Q. Cai), [email protected] (Z. Hao), [email protected] (X. Yang). 0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2011.09.005

PFIS may increase drastically with the number of the input variables, the order of the polynomial, and the number of fuzzy rules. To avoid this difficulty, Oh et al. [3] and Park et al. [4] firstly put forward several forms of the rule consequent, and subsequently use genetic algorithms (GA) to select the number of the input variables and the order of the polynomial. However, their works are mainly applicable to the second-order PFIS. In addition, only minimal information can be found in literature on much higher order PFISs or HFISs with different types of nonlinear functions. Recently, kernel methods such as support vector machine (SVM) and kernel-based clustering have attracted growing attention because a nonlinear problem can be transformed into a linear problem by an implicit kernel mapping from the original input space to a high dimensional feature space. Motivated by this idea, we utilize the advantage of the kernel methods to build the relationship between the HFIS and the SVM. We intend to obtain new types of the rule consequent of the HFIS. Numerous methods have been proposed to build the connection between the SVM and the FIS. Chen and Wang [5,6] propose a positive definite fuzzy system (PDFS). In the proposed fuzzy model, the PDFS is equivalent to a Gaussian-kernel SVM [5] if Gaussian membership functions are adopted. That is, the antecedent of a fuzzy rule is obtained by a support vector (SV). Therefore, the number of fuzzy rules is the same as the number of SVs. As the number of SVs is generally large, the size of the FIS based on an SVM is also large. To solve this problem, researchers

198

Q. Cai et al. / Neurocomputing 77 (2012) 197–204

[7] propose a learning algorithm to remove the irrelevant fuzzy rules. In spite of this, the generalization performance is degraded. The above methods are for the zero-order FIS, which has one fuzzy singleton in the consequent of a fuzzy rule. For the firstorder FIS, Leski [8] describes a method for obtaining an FIS by means of the SVM with a data-independent kernel matrix. Moreover, Juang et al. [9] use a combination of fuzzy clustering and the linear SVM to establish a fuzzy model with less parameters number and better generalization performance. However, negligible effort has been done to establish an HFIS with kernel methods. The goal of the current paper is to construct an HFIS with high accuracy and good generalization performance. Contrary to the zero-order FIS and the first-order FIS, the HFIS cannot only reduce the computational complexity of fuzzy models but also improve the numerical accuracy [10]. The key problem of building an HFIS is how to obtain the formulation of the nonlinear function of the consequent. We can solve this problem with the help of Mercer kernels. The rest of this paper is organized as follows. Section 2 reviews the related concepts of the FIS and the SVM. Section 3 describes how kernel techniques can be used to establish an HFIS. In Section 4, the learning algorithm of the HFIS based on kernel methods (KHFIS) is developed. The experimental results are shown in Section 5. Finally, the conclusions are presented in Section 6.

SVM for regression can be formulated as 2

min n 12:w: þ C

w,b, xi , xi

2.1. FIS In a fuzzy model, the kth fuzzy rule can be expressed as ðkÞ ðkÞ Rk : If is AðkÞ 1 and is A2 and. . .and is An ,

then yk ¼ f k ðxÞ,

ð1Þ

T

where x ¼ ½x1 ,x2 ,. . .,xn  is the input variable, yk is the local output variable, AðkÞ is a fuzzy set that is characterized by membership j functions mAðkÞ ðxj Þ, and fk is a crisp function of the input vector x. j When fk(x) is a constant, the resulting FIS is called a zero-order FIS or a singleton fuzzy model, which can be considered as a special case of a Mamdani FIS [10,11]. When fk(x) is a first-order polynomial, it is called a first-order Takagi–Sugeno–Kang FIS, which was originally proposed in [12,13]. When fk(x) is a nonlinear function, it is called a higher order FIS (HFIS). Particularly, if fk(x) is a polynomial, the model is called a polynomial FIS. The firing strength of the kth rule is given by

lAðkÞ ðxÞ ¼ tðmAðkÞ ðx1 Þ, lAðkÞ ðx2 Þ,. . ., lAnðkÞ ðxn ÞÞ,

ð2Þ

2

where the operator t(U) is a t-norm, which is the fuzzy conjunction of the antecedent fuzzy sets. Considering the weighted average defuzzification, we express the overall output of an FIS with M fuzzy rules as M X

f k ðxÞGk ðxÞ,

ð3Þ

mAðkÞ ðxÞ , mAðkÞ ðxÞ

ð4Þ

k¼1

where Gk ðxÞ ¼ PM

i¼1

wT jðxi Þ þ byi r e þ xi , n

xi , xni Z 0,i ¼ 1,. . .,l,

ð5Þ

where j(U) denotes the feature mapping associated with a given n kernel K(U), e is the upper value of the tolerable error, xi , xi are slack variables, and C 40 is a cost coefficient, which determines the trade-off between the model complexity and the degree of tolerance to the errors larger than e. The dual form of the optimization problem (5) becomes a quadratic programming (QP) problem: l X

1 min n 2

a, a

ðai ani Þðaj anj ÞKðxi ,xj Þ þ e

i,j ¼ 1

s:t:

l X

l X

ðai þ ani Þ

i¼1

ðai ani Þ ¼ 0, ai , ani A ½0,C,

l X

yi ðai ani Þ

i¼1

i ¼ 1,2,. . .,l,

ð6Þ

i¼1

where ai and ani are Lagrange multipliers and K(xi,xj)¼ j(xi)Tj(xj). As a result of solving the QP problem (6), we obtain an SVM model given by l X

ðai ani ÞKðxi ,xÞ þ b:

ð7Þ

i¼1

2. FIS and SVM

f ðxÞ ¼

n

ðxi þ xi Þ

s:t: yi wT jðxi Þb r e þ xi ,

f ðxÞ ¼

1

l X

k¼1

is the normalized degree of fulfillment of the antecedent clause of P rule Rk and M k ¼ 1 mAðkÞ ðxÞ a 0. 2.2. SVM Given a training set of l pairs of data points fxi ,yi gli ¼ 1 for a regression problem, where xi A Rn is the ith input data point in the input space and yi A R is the corresponding output value, the

When ai ani a 0, the training pattern xi is called a support vector. In the model (7), the SVM model f(x) is the sparse expansion of the training patterns xi because of the presence of numerous zero coefficients. Therefore, the complexity of the regression function f(x) in (7) depends on the number of SVs rather than on the dimensionality of the input space [14]. In addition, we can construct different kernel machines by choosing different kernel functions, and develop the generalization performance of the SVM model (5) by controlling the two parameters C and e [15].

3. Higher order fuzzy systems using kernel methods In this section, we will give a detailed description about the idea of the KHFIS. The HFIS decomposes a nonlinear problem into a set of locally nonlinear submodels, each of which is represented by a fuzzy rule with the nonlinear consequent. The major concern in establishing an HFIS is how to obtain the nonlinear functions of these submodels. According to the principle of the SVM, nonlinear relations in the original input space can be reduced to linear ones in a possibly high dimensional feature space by an implicit nonlinear function j. Without knowing the nature of the feature space and the explicit form of the mapping j, we compute the inner product of the projected vectors by means of a Mercer kernel K(xi,xj) ¼ j(xi)Tj(xj). Based on this property, Mercer kernels are adopted to build the HFIS. The main idea of the KHFIS is described below. Introducing an implicit function j associated with a Mercer kernel K from the original input space to a high dimensional feature space H1, we map the nonlinear submodel of the original input space into the linear submodel of the high dimensional feature space H1. Correspondingly, the HFIS is transformed into a combination of linear submodels in the feature space H1. Hence, the kth rule of the HFIS in H1 is replaced by Rk : If x is AðkÞ , Then f k ðxÞ ¼ P ðkÞT fðxÞ þ p0ðkÞ ,

ð8Þ

Q. Cai et al. / Neurocomputing 77 (2012) 197–204

where AðkÞ is the antecedent fuzzy set of the kth rule, defined by the membership function mA(k)(x) in (2), p(k) and pðkÞ 0 denote the parameter vector and a bias element of the kth fuzzy rule in the feature space H1, respectively. Consequently, the output of an HFIS in (3) is rewritten as f ðxÞ ¼

M X

Gk ðxÞðpðkÞT fðxÞ þ pðkÞ 0 Þ:

ð9Þ

k¼1

The projection of the input vector x in H1 is denoted by x^ ¼ jðxÞ. Furthermore, we consider a nonlinear mapping c from H1 to a new feature space H given by ^ T , c2 ðx^ ÞT ,. . ., cM ðx^ ÞT T , cðx^ Þ ¼ ½c1 ðxÞ

ð10Þ

k ^ ¼ ½Gk ðxÞx^ T ,Gk ðxÞT (1rkrM). Note that F(x)¼ where c ðxÞ c(j(x)), then the output of the HFIS in (13) is expressed as

f ðxÞ ¼ P T FðxÞ,

ð11Þ

where P is a vector in the high dimensional feature space H and defined as 0

0

0

P ¼ ½P ð1Þ T ,P ð2Þ T ,. . .,P ðMÞ T T , 0

T where pðkÞ ¼ ½pðkÞT ,pðkÞ 0  . As a result, the output of the HFIS is represented as a linear function of the mapped vectors in the feature space H. Following the spirit of the SVM, we can obtain the form of the nonlinear function in the consequent part of a fuzzy rule. The details are presented in the following section.

4. Learning algorithm of KHFIS

199

dimension of the training samples. In the present study, we use the first algorithm to determine the antecedent part of the KHFIS. For the given input–output data pairsfxi ,yi gli ¼ 1 , let the dataset Z A Rðn þ 1Þl to be clustered be represented by zi ¼ ½xTi ,yi T , i ¼ 1,2,. . .,l, j1 be a nonlinear mapping from the original data space to a high dimensional feature space, then the KFCM searching for M clusters is to solve the following minimization optimization problem [16] JC ðU,VÞ ¼

M X l X

2 mm ki d ðzi ,vk Þ,

where mki is the membership degree of data point zi to the kth cluster, m is a fuzziness coefficient, V is the set of cluster centers ðvk A Rðn þ 1Þ Þ, and U ¼ ðmki ÞMl is a fuzzy partition matrix. d(zi,vk) is the distance between f1(zi) and f1(vk) in the feature space, which is calculated as 2

2

d ðzi ,vk Þ ¼ :f1 ðzi Þf1 ðvk ÞÞ: ¼ Kðzi ,zi Þ2Kðzi ,vk Þ þ Kðvk ,vk Þ,

ð13Þ

T

where K(zi,vk)¼ f1(zi) f1(vk). If the Gaussian kernel is adopted, the updating formula of the cluster centers in the original input space [16] is given by Pl mm Kðzi ,vk Þzi vk ¼ Pi l¼ 1 ki : ð14Þ m i ¼ 1 mki Kðzi ,vk Þ Fuzzy membership functions of the antecedent can be generated from the reference functions proposed in [7]. The main parameters of the jth fuzzy membership function in the antecedent of the kth rule is the center vkj and the width skj, where vkj is the jth component of the kth cluster center vk and the width skj is given by l P

m2ki ðxij vkj Þ2

The learning algorithm of the KHFIS consists of two steps. In the first step, the kernel-based fuzzy c-means clustering (KFCM) is used to establish the antecedent part of fuzzy rules. In the second step, the SVM is applied to optimize the consequent parameters.

s ¼

4.1. Construction of antecedent membership functions

4.2. HFIS-kernel function

The classical clustering algorithms require that each data belongs to one and only one cluster. Fuzzy clustering, however, allows each data to belong to several clusters simultaneously, with different degrees of membership [10]. Therefore, fuzzy clustering is better suited for the data where the transitions between the clusters are gradual rather than abrupt. In recent years, the KFCM has attracted considerable attention. Several studies on the kernel-based clustering [16–25] demonstrate that kernel clustering algorithms are more accurate and robust than the conventional clustering algorithms because they can detect clusters with different volumes and shapes. Two KFCM algorithms exist. In the first algorithm, which is proposed by Zhang and Chen [16,18], the fixed point iteration of the prototypes in the original data space is presented with a Gaussian kernel. The convergence has been proven in [19]. The iterative expressions for the polynomial kernel and the sigmoid kernel have been derived [24]. The first algorithm updates c prototypes at each iteration, whereas the second algorithm does not need such iteration and retains the prototypes in the kernel space [20,21]. The prototypes in the original space can be computed by an inverse mapping from the kernel space to the original input space [23]. In order to compare the performance of the two algorithms, authors [25] analyze the computational complexity. At each iteration, the computational complexity of the first algorithm is O(cln), whereas the computational complexity of the second one is O(cl2n) [25], where l is the number of the training samples, c is the number of the clusters, and n is the

ð12Þ

k¼1i¼1

2 kj

i¼1

l P i¼1

:

ð15Þ

m2ki

Now, we propose a new kernel KH defined as K H ðxi ,xj Þ ¼ Fðxi ÞT Fðxj Þ, which is called an HFIS-kernel. Let

fk ðxÞ ¼ ½Gk ðxÞjðxÞT ,Gk ðxÞT , depending on the definition of the mapping F in (11), we have

FðxÞ ¼ ½f1 ðxÞT , f2 ðxÞT ,. . ., fM ðxÞT T :

ð16Þ

Thus, we obtain the form of the HFIS-kernel KH, namely K H ðxi ,xj Þ ¼

M X

fk ðxi ÞT fk ðxj Þ

k¼1

¼

M X

½Gk ðxi Þfðxi ÞT ,Gk ðxi Þ½Gk ðxj Þfðxj ÞT ,Gk ðxj ÞT

k¼1

¼

M X

Gk ðxi ÞGk ðxj Þðjðxi ÞT jðxj Þ þ 1Þ

k¼1

¼

M X

Gk ðxi ÞGk ðxj ÞðKðxi ,xj Þ þ 1Þ,

ð17Þ

k¼1

where j denotes the feature mapping from the input space to the feature space H1 and associates with a given Mercer kernel K(U). Due to (17), the HFIS-kernel can be regarded as the product of two kernels, that is K H ðxi ,xj Þ ¼ K 1 ðxi ,xj ÞK 2 ðxi ,xj Þ,

ð18Þ

200

Q. Cai et al. / Neurocomputing 77 (2012) 197–204 n

Introducing slack variables xi , xi Z 0 for all data pairs, we obtain the QP problem of (21)

where K 1 ðxi ,xj Þ ¼

M X

Gk ðxi ÞGk ðxj Þ,

ð19Þ min12pT p þ

k¼1

ðnÞ

P, x

is constructed by fuzzy membership functions of the antecedent and K 2 ðxi ,xj Þ ¼ Kðxi ,xj Þ þ 1,

ð20Þ

is a Mercer kernel. In order to demonstrate that the HFIS-kernel defined by (17) is suitable for application in the SVM, we need prove that the HFISkernel is a valid kernel. According to Mercer Theorem [26] and its equivalent form in [27], we can prove this conclusion. n

Proposition 1. [27] Let K1 and K2 be kernels over X  X, X D R , f(U) be a real-value function on X. Then the following functions are kernels:

Fðxi ÞT pyi r e þ xni , xi , xni Z 0, i ¼ 1,. . .,l, ðnÞ

where x ¼ ½x1 ,. . ., xl , x1 ,. . .xl  . To solve the above optimization problem, we construct Lagrange function as follows: ðnÞ

Lðp, x , a, an , mi , mni Þ ¼ 12pT p þC 

i¼1

T

ai ðe þ xi yi þ Fðxi Þ pÞ

I¼1 l X

n

ðmi xi þ mni xi Þ

l X

ani ðe þ xni þ yi Fðxi ÞT pÞ,

ð23Þ

where ai , ani , mi , mni Z0 are Lagrange multipliers and C ¼ t  1. Minin mizing this Lagrange function with respect to P, xi and xi results in the following three conditions: p¼

l X

ðai ani ÞFðxi Þ,

ð24Þ

i¼1

ai þ mi ¼ C, i ¼ 1,2,. . .,l,

ð25Þ

ani þ mni ¼ C, i ¼ 1,2,. . .,l:

ð26Þ

Substituting (24–26) into (23), we formulate the dual of (22) as the following QP problem: 1 min n 2

a, a

l X l X

ðai ani Þðaj anj ÞK H ðxi ,xj Þ

i¼1j¼1

þe

l X

ðai þ ani Þ

i¼1

l X

ðai ani Þyi

i¼1

s:t: ai , ani A ½0,C,

i ¼ 1,2,. . .,l,

ð27Þ

where KH(xi,xj)¼ F(xi) F(xj). Putting (24) into (11), we describe the output of the KHFIS as

After the antecedent parts of fuzzy rules are established by the KFCM, we can use the SVM to obtain the optimal consequent parameters of KHFIS based on the proposed kernel KH. Inspired by the structural risk minimization (SRM) inductive principle, we state the objective function as follows:

t

l X

T

4.3. Parameters learning

2

n

ðxi þ xi Þ

i¼1

Proof. Since Gk(x) is a real-value function, K1(xi,xj) given by (19) is a kernel from Proposition 1. Based on the fact that K(U) is a Mercer kernel, we find that K2(xi,xj), which is the sum of 1 and K(U), is also a Mercer kernel. Thus, from (18) we can conclude that the HFIS-kernel is a valid kernel. When mAðkÞ ðxÞ and t-norm are continuous, the HFIS-kernel is also j a continuous function. Hence, by Mercer Theorem, we know that the HFIS-kernel is a Mercer kernel. & The main characterization of a valid kernel function is that it produces symmetric and positive semi-definite kernel matrices. Obviously, the kernel matrix of an HFIS-kernel is symmetric and positive semi-definite. Thus, the HFIS-kernel is still a valid kernel even if the continuous property is canceled.

9yi pT Fðxi Þ9e þ

l X i¼1



Theorem 1. If the membership functions mAðkÞ ðxÞ are continuous for j k ¼ 1,2,    ,M , j ¼ 1,2,. . .,n and t-norm is continuous, then the HFIS-kernel is a Mercer kernel.

P

ð22Þ n T

n

i¼1

(1) K(x,u)¼ K1(x,u)þ K2(x,u), (2) K(x,u)¼ K1(x,u)K2(x,u), (3) K(x,u)¼ f(x)f(u).

l X

n

ðxi þ xi Þ

ti¼1

s:t: yi Fðxi ÞT p r e þ xi ,



min

l 1X

pT p,

ð21Þ

where 9U9e is the e-insensitive loss function [26], e is a user defined parameter, which does not penalize the point whose discrepancy between the predicted and actual value is less than e. The first term is related to the empirical risk. The second term is related to the model complexity. t 40 is some constant which controls the trade-off between the model complexity and the training errors. The traditional algorithms of identifying the optimal parameters are based on the empirical risk minimization, which can lead to the fuzzy model with arbitrary approximation accuracy. However, they cannot guarantee that the generalization performance of the model is good. The SRM inductive principle takes into account both the empirical risk and the complexity of the approximating function, and can guarantee a small testing error by selecting the proper parameter values of e and t.

f ðxÞ ¼

l X

ðai ani ÞK H ðxi ,xÞ:

ð28Þ

i¼1

Comparing the optimization problem (27) with the QP problem (6), we see that the problem (27) has not the constraint Pl n i ¼ 1 ðai ai Þ ¼ 0, because of the fact that there is no a bias term b in (11). Therefore, the parameters learning of the KHFIS can be regarded as a special SVM under the following condition Set A. Condition Set A: (1) the kernel has the form K H ðxi ,xj Þ ¼

M X

Gk ðxi ÞGk ðxj ÞðKðxi ,xj Þ þ1Þ,

k¼1

(2) the parameter C ¼ t  1, (3) b¼0. If the conditions of the condition set A are satisfied, the output of the KHFIS is mathematically equivalent to the expected output of the SVM. As a result, we can use the SVM to generate the consequents of the HFIS.

Q. Cai et al. / Neurocomputing 77 (2012) 197–204

Based on the analysis given above, we can obtain the explicit expression of the consequent function. Theorem 2. Let aðnÞ ¼ ½a1 ,. . ., al , an1 ,. . ., anl T be the solution of the dual problem (27), then the nonlinear function of the kth rule consequent has the form as follows l X

f k ðxÞ ¼

ðai ani ÞGk ðxi ÞðKðxi ,xÞ þ 1Þ:

ð29Þ

201

(2) ANFIS: adaptive-network-based FIS, using subtractive clustering with back propagation of the consequent parameters. (3) PFIS: the second-order polynomial FIS, using subtractive clustering with LSE of the consequent parameters. (4) PDFS: the zero-order FIS constructed by support vectors of the SVM with a Gaussian kernel. (5) Leski’s model: the first-order FIS, using FCM with the SVM of the consequent parameters.

i¼1

Proof. According to (24), we have p¼

l X

ðai ani ÞFðxi Þ:

i¼1

From the form of F(xi) in (16), we rewrite P as 2 3 0 3 f1 ðxi Þ pð1Þ l X 6 7 6 ^ 7 ^ 7 ðai ani Þ6 4 5¼ 4 5: 0 i¼1 pðMÞ fM ðxi Þ 2

Thus, we have 0

pðkÞ ¼

l X

k

ðai ani Þf ðxi Þ,

1 rk rM:

i¼1

k

Since f ðxÞ ¼ ½Gk ðxÞjðxÞT ,Gk ðxÞT , the nonlinear function of the kth rule consequent has the following form:   X   l jðxÞ jðxÞ k ðai ani Þf ðxi ÞT f k ðxÞ ¼ pðkÞ0T ¼ 1 1 i¼1   l X jðxÞ ðai ani Þ½Gk ðxi Þjðxi ÞT ,Gk ðxi Þ ¼ 1 i¼1 ¼

l X

n

ðai ai ÞGk ðxi ÞðKðxi ,xÞ þ1Þ:

The parameters of the used and compared models are listed as follows: M is the number of rules of the FIS; C is the cost coefficient of the SVM, which is used by the PDFS, Leski’s model and our proposed model; e is a user-defined parameter of Vapnik’s insensitive loss function of the SVM, which is used by the PDFS, Leski’s model, and our proposed model; s is the kernel parameter of a Gaussian kernel used by the PDFS and Leski’s model; s1 and s2 are the Gaussian-kernel parameter of the KFCM and the HFISkernel KH, respectively; and d1 and d2 are the polynomial-kernel parameter of the KFCM and the HFIS-kernel KH, respectively. The performance of the model is evaluated by the root-mean-squareerror (RMSE) computed by vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u l u1 X ðy y^ Þ2 , RMSE ¼ t ð30Þ l i¼1 i i where l is the number of the training samples, yi is the output of the ith training sample, and y^ i is the model output of the ith training sample. All experiments are run under the MATLAB environment. The kernel parameters are determined by ten-fold cross-validation strategy. The multiplicative operator is employed for t-norm, and the Gaussian membership function is adopted for fuzzy models. 5.1. Approximation capability

&

i¼1

Let Ok(xi,x)¼Gk(xi)(K(xi,x)þ1). According to Theorem 2, the nonlinear function of the kth consequent fk(x) is the sparse expansion of the kernel function Ok(xi,x), which is the product of fuzzy membership functions and the kernel function. In the case of the second-order PFIS, the number of the consequent parameters to be computed is M(n þ1)(n þ2)/2, where M is the number of fuzzy rules and n is the number of input invariables. While in the case of the KHFIS, the number of the consequent parameters equals the number of SVs of the KHFIS. Hence, the computational complexity of the proposed model might be inferior to that of the higher order PFIS with LSE. Furthermore, if M¼1, then KHFIS is equivalent to the SVM for regression. Therefore, the PDFS is a special case of the KHFIS because the PDFS with Gaussian membership functions is equivalent to the SVM with a Gaussian kernel. 5. Experiment results The performance of the proposed KHFIS is verified with the aid of some well-known synthetic and real benchmarking datasets. Three aspects of the KHFIS are investigated, namely, the approximating capability, generalization performance, and robustness. In these examples, the following models are used and compared: (1) FMID: the first-order FIS, using Gustafson–Kessel (GK) clustering with LSE of the consequent parameters.

Experiment 1. The Box–Jenkins gas furnace data is a famous example of system identification. This data consists of 296 observations [u(t),y(t)], where u(t) is the input gas flow rate into the furnace and y(t) is CO2 concentration in outlet gas. Here we are trying to evaluate the approximation capability of the proposed system. In order to achieve this aim, we firstly construct the inputs x(t) ¼[y(t  1),u(t  4)] and the outputs y(t) of the system based on the original data pairs [u(t),y(t)], and obtain 292 input–output data pairs. Secondly, we use the data pairs [x(t),y(t)] to train the proposed system. Finally, we calculate the total training error and use it to evaluate the approximation capability of the proposed system. We set the parameter values of KHFIS as C ¼200, e ¼0.01, s1 ¼ 8, s2 ¼0.01. Fig. 1 shows the influences of different cluster numbers on the KHFIS. The approximation error is the lowest when the number of clusters is 10. Fig. 2 shows the modeling result with two rules. The comparison of the conventional fuzzy model and several kernel-based fuzzy models is presented in Table 1. The proposed algorithm with the Gaussian kernel achieves the best approximation accuracy with RMSE value of 0.01 and two rules, while the best result of the other models is obtained by the FMID with RMSE value of 0.337 and four rules. Clearly, the approximation capability of the proposed model outperforms the previous fuzzy models significantly. Training time is also shown in Table 1. Kernel-based models take more time than the fuzzy models with LSE and the ANFIS. Moreover, the training time of the proposed fuzzy model is

202

Q. Cai et al. / Neurocomputing 77 (2012) 197–204

comparable with those of the PDFS and Leski’s model. Thus, the computational complexity of the proposed model depends on the training time of the SVM, instead of the nonlinear characteristic. By enhancing the sparse of the SVM, the running speed of the SVM can be improved. This is a research topic worthy of a future study. Keeping the same parameter values of C and e, we also obtain the result of the KHFIS with a polynomial kernel. When d1 ¼d2 ¼2, the best result of this model is achieved with RMSE value of 0.361 and four rules. This indicates that the selection of kernel affects the performance of the KHFIS. The Gaussian kernel produces the best performance in the experiments of the present study. Therefore, unless specified, the Gaussian kernel is used for the KHFIS in the subsequent experiments. 5.2. Generalization performance

Experiment 2. We adopt two real datasets in our comparative study. The first real dataset is the Sugeno-Yasukawa stock price dataset [28]. This dataset consists of 100 input–output pairs. The second real dataset is the sunspot dataset, which has long been considered as a benchmark [29]. The dataset contains the yearly average sunspot numbers recorded from 1700 to 1979. We predict the yearly average of the year starting with the next day using the previous 12 yearly averages.

σ1=8, σ2=0.01

−3

10.01

x 10

10 9.99

9.97 9.96

Table 1 Comparison of results obtained by the selected models for the gas furnace dataset.

9.95 9.94 9.93 2

4

6

8 10 12 14 the number of rules

16

18

20

Fig. 1. Effect of the number of clusters on approximation error.

62

Type

No. inputs

M

RMSE

Training time (s)

FMID [10] ANFIS [11] PFIS [1] PDFS [6] Leski’s model [8] KHFIS

2 2 2 2 2 2

4 4 3 289 2 2

0.337 0.351 0.360 0.363 0.389 0.01

0.266 0.125 0.363 0.416 0.400 0.395

Real value Estimated value

60 58 56 Output

RMSE

9.98

To test the generalization performance of these models, we use ten-fold cross-validation techniques. First, the data is randomly split into 10 subsets with approximately equal sizes. Subsequently, each subset is used for testing and the remainder is applied for training. Secondly, the optimal parameters of these models, such as C, e and the kernel parameter s, are determined by the ten-fold cross-validation on the training set. RMSE is computed on the corresponding testing set. We repeat this process ten times. Finally, the RMSE values are used to measure the generalization performance of the models by averaging the testing errors across 10 testing sets for each model. The experimental results of the ten-fold cross validation are shown in Figs. 3 and 4. The testing errors of the KHFIS are better than those of the FMID, ANFIS and PFIS for almost all of the training and testing data. This can be explained as follows: unlike the existing fuzzy models whose consequent is a linear function of the input variables, the consequent of the KHFIS is a nonlinear function. Therefore, the KHFIS can express the nonlinear relationship much better. Furthermore, the proposed model obtains the consequent parameters by minimizing the structural risk, whereas the FMID, ANFIS, and PFIS obtain the parameters by minimizing the empirical risk. Minimizing the training error to zero may lead to large errors on the testing set. Thus, the proposed KHFIS shows better generalization performance than the traditional fuzzy models. These figures also show that fuzzy models based on kernel methods have very similar results compared with the proposed fuzzy model. For further comparison, Table 2 illustrates the averaging testing RMSE values of several fuzzy models. Among the compared models, the KHFIS is the best model because it obtains the least RMSE with the fewest fuzzy rules. Based on the above comparison, we can conclude that the proposed KHFIS is capable of obtaining good generalization

54 52 50 48 46 44 0

50

100

150 Data

200

Fig. 2. Simulation results of the KHFIS with M ¼2.

250

300

Q. Cai et al. / Neurocomputing 77 (2012) 197–204

203

35 FMID ANFIS PFIS PDFS Leski’s model KHFIS

30

Testing RMSE

25 20 15 10 5 0 1

2

3

4

5 6 Ten Train / Test Sets

7

8

9

10

Fig. 3. Result of the stock price dataset for ten-fold cross validation.

0.4

FMID ANFIS PFIS PDFS Leski’s model KHFIS

0.35 Testing RMSE

0.3 0.25 0.2 0.15 0.1 0.05 0 1

2

3

4

5 6 Ten Train / Test Sets

7

8

9

10

Fig. 4. Result of the sunspot dataset for ten-fold cross validation.

Table 2 Comparison of the test errors on the stock price dataset and the sunspot dataset. Type

FMID ANFIS PFIS PDFS Leski’s model KHFIS

Stock price

Sunspot

M

RMSE

M

RMSE

3 4 16 78 3 2

11.557 11.611 12.251 6.906 8.857 6.892

3 2 8 102 3 2

0.0824 0.0796 0.1978 0.0785 0.0763 0.0765

performance using fewer rules compared with other existing fuzzy models. 5.3. Robustness

Table 3 Comparison of the robustness of several models on noisy datasets. Data sets

FMID

ANFIS

PFIS

PDFS

Leski’s model

KHFIS

Stock price Sunspot

6.7229 3.9281

8.2786 2.4318

13.795 1.7768

4.8428 1.9426

4.6877 1.7262

4.6886 1.6738

dataset. Subsequently, the Chi squared random noise with two degrees of freedom is added to the input and output variables of the training part to evaluate the robustness of the proposed model. Table 3 contrasts the performances of several models with the proposed fuzzy model. Table 4 displays the optimal parameters of the PDFS, Leski’s model and KHFIS obtained by the ten-fold cross validation. The comparative results reveal that the test error of the KHFIS is the lowest among six fuzzy models. Therefore, the KHFIS demonstrates much better robustness than the others. 6. Conclusions

Experiment 3. The robustness of the proposed model is investigated. We select an appropriate training and testing set from 10 training and testing sets. Since the test errors of six models are approximately equal for the ninth training and testing set of the stock price dataset, we select it for this experiment. For the same reason, we choose the first training and testing set for the sunspot

In this paper, a novel algorithm which combines the KFCM with the SVM is presented to identify a higher order FIS. Unlike the polynomial FIS, the consequent of a fuzzy rule is the sparse expansion of the Gaussian kernel and the antecedent membership functions. The antecedent membership functions are built by the

204

Q. Cai et al. / Neurocomputing 77 (2012) 197–204

Table 4 Parameters of fuzzy models. Data sets

Stock price Sunspot

PDFS

Leski’s model

KHFIS

C

e

s

C

e

C

e

s1

s2

16 2

0.5 0.5

16 1

1024 1024

0.5 0.002

16 2

0.001 0.5

1024 128

16 16

KFCM, and then the consequent parameters are optimized by the SVM, whose kernel is constructed by fuzzy membership functions and the Gaussian kernel. Compared with the classical fuzzy models and several kernel-based fuzzy models, this algorithm not only avoids higher computational complexity that exists in the polynomial FIS, but also obtains better accuracy. According to the experimental results, the resulting higher order FIS also exhibits good generalization performance as well as satisfactory robustness.

Acknowledgments The authors are grateful to the anonymous reviewers for their very helpful comments and constructive suggestions with regard to this paper. And this work was supported by Natural Science Foundation of China (61070033), Natural Science Foundation of Guangdong Province (9251009001000005), Science and Technology Planning Project of Guangdong Province (2010B050400011, 2010B080701070, and 2008B080701005), ‘‘Eleventh Five-year Plan’’ Program of the Philosophy and Social Science of Guangdong Province (08O-01), and Opening Project of the State Key Laboratory of Information Security (04-01). References [1] K. Demirli, P. Muthukumaran, Higher order fuzzy system identification using subtractive clustering, J. Intelligent Fuzzy Syst. 9 (3) (2000) 129–158. [2] L.J. Herrera, H. Pomares, I. Rojas, O. Valenzuela, A. Prieto, TaSe, a Taylor seriesbased fuzzy system model that combines interpretability and accuracy, Fuzzy Sets Syst. 153 (3) (2005) 403–427. [3] S.K. OH, W. Pedrycz, S.B. Roh, Genetically optimized fuzzy polynomial neural networks with fuzzy set-based polynomial neurons, Inf. Sci. 176 (23) (2006) 3490–3519. [4] B.J. Park, W. Pedrycz, S.B. Roh, A design of genetically oriented fuzzy relation neural networks (FrNNs) based on the fuzzy polynomial inference scheme, IEEE Trans. Fuzzy Syst. 17 (6) (2009) 1310–1323. [5] Y.X. Chen, J.Z. Wang, Support vector learning for fuzzy rule-based classification systems, IEEE Trans. Fuzzy Syst. 11 (6) (2003) 716–728. [6] Y.X. Chen, J.Z. Wang, Kernel machines and additive fuzzy systems: classification and function approximation, in: Proceedings of IEEE International Conference on Fuzzy Systems, St Louis, Mo, 2003, pp. 789–795. [7] J.H. Chiang, P.Y. Hao, Support vector learning mechanism for fuzzy rule-based modeling: a new approach, IEEE Trans. Fuzzy Syst. 12 (1) (2004) 1–12. [8] J.K. Leski, On support vector regression machines with linguistic interpretation of the kernel matrix, Fuzzy Sets Syst. 157 (2006) 1092–1113. [9] C.F. Juang, C.D. Hsieh, S.J. Shiu, TS-fuzzy system-based support vector regression, Fuzzy Set Syst. 160 (17) (2009) 2486–2504. [10] R. Babuˇska, Fuzzy Modeling for Control, Kluwer Academic Publishers, Boston, 1998. [11] J.R. Jang, C.T. Sun, E. Mizutani, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence, Prentice Hall, NJ, 1997. [12] M. Sugeno, G.T. Kang, Structure identification of fuzzy model, Fuzzy Sets Syst. 28 (1988) 15–33. [13] T. Takagi, M. Sugeno, Fuzzy identification of systems and its applications to modeling and control, IEEE Trans. Syst., Man Cybern. 15 (1) (1985) 116–132. ¨ [14] A.J. Smola, B. Scholkopf, A tutorial on support vector regression, Stat. Comput. 14 (2004) 199–222. [15] V.N. Vapnik, O. Chapelle, Bounds on error expectation for support vector machines, Neural Comput. 12 (9) (2000) 2013–2036.

[16] D.Q. Zhang, S.C. Chen, Clustering incomplete date using kernel-based fuzzy Cmeans algorithm, Neural Process. Lett. 18 (2003) 155–162. [17] D.W. Kim, K.Y. Lee, D. Lee, K.H. Lee, Evaluation of the performance of clustering algorithms kernel-induced feature space, Pattern Recognition 38 (4) (2005) 607–611. [18] D. Zhang, S. Chen, Fuzzy clustering using kernel method, in: Procceedings of the International Conference on Control and Automation, 2002, pp. 123–127. [19] H. Shen, J. Yang, S. Wang, X. Liu, Attribute weighted mercer kernel based fuzzy clustering algorithm for generalnon-spherical datasets, Soft Comput. 10 (11) (2006) 1061–1073. [20] Z.D. Wu,W.X. Xie, J.P. Yu, Fuzzy c-means clustering algorithm based on kernel method, in: Procceedings of the International Conference on Computational Intelligence and Multimedia Applications, 2003, pp. 49–54. [21] L. Zeyu, T. Shiwei, X. Jing, J. Jun, Modified FCM clustering based on kernel mapping, Int. Soc. Opt. Eng. 4554 (2001) 241–245. [22] L. Zhang, C. Zhou, M. Ma, X. Liu, C. Li, C. Sun, M. Liu, Fuzzy kernel clustering based on particle swarm optimization, in: Proceedings of the IEEE International Conference on Granular Computing, 2006, pp. 428–430. [23] S. Zhou, J. Gan, Mercer kernel fuzzy c-means algorithm and prototypes of clusters, in: Proceedings of International Data Engineering and Automated Learning Conference, vol. 3177, 2004, pp. 613–618. [24] Q.F. Cai, W. Liu, T.S.K. Fuzzy, Model using kernel-based Fuzzy c-means Clustering, in: Proceedings of the IEEE International Conference on Fuzzy Systems, 2009, pp. 308–312. [25] D. Graves, W. Pedrycz, Kernel-based fuzzy clustering and fuzzy clustering: comparative experimental study, Fuzzy Sets Syst. 161 (2010) 522–543. [26] J. Mercer, Functions of positive and negative type and their connection with the theory of integral equations, Philos. Trans. R. Soc. A 209 (1909) 415–446. [27] S. Saitoh, Theory of Reproducing Kernels and Its Application, Longman Scientific and Technical, 1988. [28] M. Sugeno, T. Yasukawa, A fuzzy-logic-based approach to qualitative modeling, IEEE Trans. Fuzzy Syst. 1 (1) (1993) 7231–7240. [29] W. Li, Y. Yang, A new approach to TS fuzzy modeling using dual kernel-based learning machines, Neurocomputing 71 (16–18) (2008) 3660–3665.

Qianfeng Cai received the B.E. degree in applied mathematics from Central China Normal University in 1996, the M.S. degree in applied mathematics from Beijing Normal University in 1999, and the Ph.D. degree in Computer Sciences from South China University of Technology in 2005. She is currently an associate professor at the Faculty of Applied Mathematics, Guangdong University of Technology. Her current research interests include computational intelligence, fuzzy systems, and data mining.

Zhifeng Hao received the B.Sc. degrees in mathematics from the Sun Yat-Sen University in 1990 and the Ph.D. degree in Mathematics from Nanjing University in 1995. He is currently a Professor in the Faculty of Computer Science, Guangdong University of Technology and School of Computer Science and Engineering, South China University of Technology. His research interests involve various aspects of algebra, machine learning, data mining, and evolutionary algorithms.

Xiaowei Yang received the B.S. degree in theoretical and applied mechanics, the M.Sc. degree in computational mechanics, and the Ph.D. degree in solid mechanics from Jilin University, Changchun, China, in 1991, 1996, and 2000, respectively. He is currently a professor in the Department of Mathematics, South China University of Technology. His current research interests include designs and analyses of algorithms for large-scale pattern recognitions, imbalanced learning, semi-supervised learning, and evolutionary computation. He has published more than 80 journals and refereed international conference articles, including the areas of structural reanalysis, interval analysis, soft computing, and support vector machine.