A GA-based fuzzy modeling approach for generating TSK models

A GA-based fuzzy modeling approach for generating TSK models

Fuzzy Sets and Systems 131 (2002) 121 – 152 www.elsevier.com/locate/fss A GA-based fuzzy modeling approach for generating TSK models S.E. Papadakis,...

575KB Sizes 0 Downloads 79 Views

Fuzzy Sets and Systems 131 (2002) 121 – 152

www.elsevier.com/locate/fss

A GA-based fuzzy modeling approach for generating TSK models S.E. Papadakis, J.B. Theocharis ∗ Power Systems Laboratory, Department of Electrical and Computer Engineering, Electronics and Computer division, Aristotle University of Thessaloniki, Thessaloniki 54006, Greece Received 29 June 2000; received in revised form 25 July 2001; accepted 12 September 2001

Abstract This paper proposes a new genetic-based modeling method for building simple and well-de1ned TSK models with scatter-type input partitions. Our approach manages all attributes characterizing the structure of a TSK model, simultaneously. Particularly, it determines the number of rules, the input partition, the participating inputs in each rule and the consequent parameters. The model building process is divided into two phases. In phase one, the structure learning task is formulated as a multi-objective optimization problem which is resolved using a novel genetic-based structure learning (GBSL) scheme. Apart from the mean square error (MSE) and the number of rules, three additional criteria are introduced in the 1tness function for measuring the quality of the partitions. Optimization of these measures leads to models with representative rules, small overlapping and e8cient data cover. In order to obtain models with accurate data 1tting and good local performance, the consequent parameters are determined using a local MSE function while the overall model is evaluated on the basis of a global MSE function. The search capabilities of the suggested structure learning scheme are signi1cantly enhanced by including a highly e9ective local search operator implemented by a micro-genetic algorithm and four problem-speci1c operators. Finally, a genetic-based parameter learning (GBPL) scheme is suggested in phase two, which performs 1ne-tuning of the initial models obtained after structure learning. The performance of the proposed modeling approach is evaluated using a static example and a well-known dynamic benchmark problem. Simulation results demonstrate that our models outperform c 2001 Elsevier Science B.V. those suggested by other methods with regard to simplicity, model structure, and accuracy.  All rights reserved. Keywords: Fuzzy modeling; Genetic algorithms; Multi-objective optimization

1. Introduction The fuzzy model suggested by Takagi et al. [29], also known as the TSK model, has gained increasing interest in theoretical analysis and ∗

Corresponding author. Tel.: +30-31-996-343; fax: +30-31996-292. E-mail address: [email protected] (J.B. Theocharis).

applications of fuzzy modeling and control. Let us consider a physical system with m input variables x = [x1 ; x2 ; : : : ; xm ]T and a single output, y, with xi ∈Di = [xi; min ; xi; max ] ⊆ ; (i = 1; : : : ; m) and y∈Y = [ymin ; ymax ] ⊆ . The TSK model is composed of IF-THEN rules of the following form: R(r) : IF x1 is Ar1 AND · · · AND xm is Arm THEN yr = gr (x) = 0r + 1r x1 + · · · + mr xm (r = 1; : : : ; n);

c 2001 Elsevier Science B.V. All rights reserved. 0165-0114/01/$ - see front matter  PII: S 0 1 6 5 - 0 1 1 4 ( 0 1 ) 0 0 2 2 7 - 5

(1)

122

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

where n is the number of rules, xi (i = 1; : : : ; m) are the model inputs, and y r is the output of the rule R (r) . Ari (i = 1; : : : ; m) are fuzzy sets in Di characterized by membership functions Ari (xi ), and air are real coe8cients of the rule polynomials, g r (x). The m-tuple (A1r ; : : : ; Arm ) of fuzzy sets in the premise part of R (r) forms a fuzzy region A1r ×· · ·×Arm in D = D1 × · · · × Dm , called the fuzzy hyper-cube (FHC) in this paper. The philosophy underlying this approach is to decompose the input space into fuzzy regions and to approximate the system output in every region by a simple sub-model. Due to its non-linearity and simple structure, the major advantage of the TSK models is its capability to approximate highly complex systems using a small number of rules [29]. Furthermore, since the rule outputs are linear regression models, the consequent parameters can be calculated using conventional least-squares techniques. Generally, the identi1cation of TSK models can be divided into two phases, namely, the structure learning and the parameter learning phase. Structure learning entails the following sub-tasks: (a) determine the number of rules, (b) perform the input partition, that is, locate the FHCs within the premise space, select the participating input variables, and determine their membership functions, and (c) calculate the consequent parameters. Having determined an initial fuzzy model in phase one, parameter learning is employed in phase two to adjust the premise and the consequent parameters on the basis of an output error measure. With regard to the partition issue, two partition approaches are usually considered, the grid partition and the scatter partition, typically shown in Fig. 1 for a two-dimensional space. The grid partition (Fig. 1(a)) exhibits two major drawbacks: Firstly, it su9ers from the so-called “curse of dimensionality” problem, that is, the number of rules increases exponentially with the number of input variables. Secondly, since each membership function can participate in several rules, the rules are coupled to each other. This constraint aggravates the structure learning process and prevents locating the FHCs in the proper regions. In the scatter partition (Fig. 1(b)) the FHCs can be independently placed anywhere within the input space. Each FHC is regarded as fuzzy cluster whose size and location is dictated by the input data distribution. Provided that certain quality objectives are considered, this ap-

x2

2

A2

2

A2

x2

x1 1 1

2 1

A

A

3 1

A

(a) x2 3

A2

2

A2

1

A2 x1 1

A1

2

3

A1

A1

(b) Fig. 1. The grid partition (a), and the scatter partition (b) of the input space.

proach can lead to fuzzy models with a small number of representative rules, good data cover and simple structure. The objective of an e9ective learning method is to create almost ideal TSK model exhibiting the following criteria: C1. Number of rules: The rule base should contain a moderate number of fuzzy rules. This leads to simple and easily interpretable TSK models.

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

C2. Partition quality: An e8cient partition dictates that the following issues should be properly addressed: Firstly, the fuzzy cells de1ned by the rules in the premise part should be concentrated on important regions in the data. Secondly, neighboring rules should not exhibit a large degree of overlapping. Finally, all data should be adequately covered so that the model is able to infer an output for each input. The above partition qualities lead to a TSK model with representative and descriptive rules. C3. Locality: Following the philosophy of the TSK models, the consequent parameters should be determined in such a way that the input data belonging to a fuzzy region are described by a hyper-plane in the consequent part. This leads to TSK models with good local behavior, a feature that o9ers model transparency and easy interpretation by the user. C4. Accuracy: The overall model should provide an accurate mapping of the input–output representations. This is achieved by minimizing a global output error measure function. A great variety of modeling techniques are suggested in the literature for building TSK models. In [29,30,33,38] structure learning is formulated as an iterative non-linear optimization problem producing grid-type partition models. The methods suggested in [3,20] employ clustering techniques to build scattertype models while a gradient descent algorithm is used for parameter learning. In [36] the premise and consequent identi1cation are separately performed using a clustering algorithm and the orthogonal leastsquares method, respectively. Neural networks are integrated with fuzzy logic in the form of fuzzy neural networks and used to construct fuzzy models [13,17]. Most existing algorithms determine the model structure by minimizing a global error function comprising the residuals of the overall model output and the desired outputs. Although good data 1tting is attained, the resulting TSK model may exhibit a poor local performance. Most often, the rule sub-models show an erratic behavior that is di8cult to interpret. In [39] the model parameters are estimated by minimizing a combined objective function that integrates both global and local learning. Genetic algorithms (GAs) are evolutionary search techniques based on natural selection [9]. Because of their robustness and the ability to obtain global

123

optimum solutions they are used to solve di8cult optimization problems. Recently, GAs are used to generate if-then rules and adjust the parameters of fuzzy systems with application to control [5,11,12,18,21,34,35] and modeling [4,8,15,16,22,28]. Usually, conventional models of Mamdani’s type are considered, such as the models suggested in [5,8,18,21,22,34,35] or TSK models with crisp consequents [4,16,28]. Recently, two papers [6,27] are appeared in the literature dealing with the subject of this paper, namely, the building of TSK models from examples for system identi1cation. In [6] the learning process is divided into two stages: the generation and the re1nement stages. The former stage determines a preliminary TSK model and is solved using a (; )evolution strategy (ES) algorithm while the later one determines the 1nal model with optimal global behavior using a hybrid GA-ES algorithm. The approach in [6] exhibits a signi1cant drawback arising from the formulation in the generation stage. The authors employ the grid-type partition in that stage. In particular, each input variable is fuzzily partitioned by a term set containing a certain number of linguistic terms determined by the user. The candidate rules are formulated by considering all possible combinations of fuzzy sets along each variable. From this pool of candidate fuzzy rules, those rules are selected which contain certain input data within the respective fuzzy regions. For each rule an (; )-ES-based TSK rule consequent learning method is used to determine the consequent parameters by minimizing a local error measure. Given the predetermined premise structure, the generation stage actually performs only the consequent parameter learning task. No e9ort is devoted towards searching and developing the proper number of representative rules within the premise space. As a result, for all examples examined by the authors, even those with two inputs, the resulting models are composed of a large number of fuzzy rules. Hence, due to the increased complexity of the preliminary model, the quality of the attained models after the re1nement stage is seriously reduced with regard to the simplicity, locality, interpretability and transparency. In [27] a two-stage procedure is suggested for generating TSK models used in identi1cation and classi1cation problems. An initial model with an

124

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

over-determined number of rules is derived at stage one using the fuzzy c-means clustering method (FCM). As regards the premise part, a scatter-type partition is achieved and univariate triangular fuzzy sets are obtained by projecting the rows of the fuzzy partition matrix along each axis. Furthermore, for the partition obtained by FCM, the consequent parameters are determined using the least-squares estimate (LSE) method. Next, an optimized version of the initial model is attained during the second stage (re1nement stage) by applying a real-coded GA. In order to preserve the local behavior of the initial model, the GA is subjected to partition and search space constraints. The 1tness function includes only the global error measure, that is, global optimization of the model is conducted. The 1nal TSK model is decided through an iterative procedure involving the user preferences and inspection of the model quality currently available. Speci1cally, when fuzzy sets with large overlapping are observed, a simpli1cation mechanism is activated based on a similarity criterion which reduces the number of fuzzy sets and probably the fuzzy rules. Then, the re1nement stage is applied again to optimize the newly developed initial model. The above process is repeated until a 1nal model is attained with acceptable structure. With regard to this approach the following comments are in order: Firstly, the use of triangular membership functions, having a single point core region, is not a proper choice for the TSK models. As a result, the fuzzy sets obtained 1nally exhibit large base values. Secondly, the determination of representative fuzzy regions and the distinguishability of the fuzzy sets are only implicitly controlled through the constrained search space and the application of the simpli1cation procedure. Finally, the key issue of the number of rules is again heuristically handled through the repeated application of the simpli1cation-re1nement routine. However, since the optimization during the re1nement stage is conducted on the basis of a global error measure and the number of rules is not part of the evolutionary process, it cannot be guaranteed that a correct TSK model with good local behavior will be 1nally obtained. Generally, the GA-based modeling techniques suggested in the literature exhibit the following drawbacks and limitations. (a) Most often, they employ grid-type partitions, thus su9ering from the “curse of

dimensionality” problem. (b) The majority of suggested methods treat only one or two of the structure learning sub-tasks. In most cases, the input partition is predetermined or the premise and the consequent identi1cation are separately performed. (c) The 1tness function to be optimized is usually formulated by the global=local mean square error (MSE) function or as a combination of the above with the number of rules. There is a lack of additional criteria measuring the quality of the resulting partitions. (d) Finally, simple GAs are used including only the standard genetic operators which are unable to cope with di8cult optimization problems. It should be noted that the learning sub-tasks are related to each other. In that respect a separate manipulation of them may lead to sub-optimal solutions. For example, the determination of the proper number of rules is highly related to the size, the location and the overlapping of the fuzzy regions in a partition. Moreover, the locality requirement of the TSK modeling, suggests that a fuzzy partition should identify the linear regions of the output with respect to the inputs. Generally, there are two main alternatives for the formulation of the structure identi1cation problem that is the most important part of the TSK modeling. The former approach is to consider a relaxed formulation of the problem by including just a few features in the 1tness function, such as the MSE. In that case, a suboptimal structure is attained or the proper structure of the model is obtained through a heuristic approach as in [27]. The second approach is to formulate a more complex 1tness function by including in some way the quality criteria C1–C4 of an ideal TSK model. This rePects the increased requirements of TSK models as compared with Mamdani’s type fuzzy models. In that case, the optimization problem is aggravated and enhanced genetic schemes are required to achieve the search task. In this paper, we propose a new modeling method specially designed for generating simple and wellde1ned TSK models, following the second approach mentioned above. The model building process is composed of two phases. In phase one, a genetic-based structure learning (GBSL) algorithm is suggested to identify the proper structure of the model. The GBSL algorithm is an integrated hybrid genetic scheme with the following innovations:

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

• The structure learning task is formulated as multiobjective optimization problem. Apart from the global MSE in [6,27] and the number of rules, three additional partition criteria are introduced in the 1tness function: the typicality, the entropy and the cardinality functions. Optimization of these measures leads to partitions with representative rules and small overlapping between the FHCs. • The suggested encoding scheme combines the bene1ts of a variable length chromosome with those of a constant length chromosome. The former permits evaluating models with di9erent number of rules while the latter one allows the genetic operators to apply in a uni1ed way. The chromosomes provide scatter-type partitions without the limitations and the assumptions imposed by other methods. In contrast to the above formulation, the ES and real-coded GA used in [6,27] prevents from including the number of rules in the 1tness function. • In order to obtain TSK models with correct local behavior and good global performance, the model parameters are calculated following a global model evaluation–local parameter determination approach during structure learning. Accordingly, the parameters of the consequent parameters are determined using a local MSE function while the overall model is evaluated on the basis of a global MSE function. This is opposed to [6] where only the local error measure is employed for the derivation of the preliminary model. • The consequent parameters of each individual are determined using a weighted RLSE method. Thus, instead of using the ES-based rule consequent learning method in [6], the optimal consequent parameters are obtained after a single algorithmic pass. • An e9ective weight adaptation scheme is suggested which changes the objective weights in the 1tness function and determines dynamically the search direction according to a priority schedule. • The search capabilities of GBSL are greatly enhanced by including a suggested local search operator, the micro-genetic algorithm (MGA). Furthermore, four problem-speci1c operators are developed which are tailored to the search needs of the problem. Extensive experimentation demonstrates the e9ectiveness of the suggested operators.

125

A genetic-based parameter-learning (GBPL) algorithm with enhanced training capabilities is proposed in phase two which adjusts the parameters of the initial model. The algorithm includes a novel mechanism that determines the search space of the main GA dynamically during evolution, and the MGA operator. This paper is organized as follows. In Section 2, the premise space description is described in terms of the FHCs. Section 3 illustrates the partition encoding scheme. Section 4 describes the construction of the 1tness function while in Section 5 the resolution of the structure learning problem is presented. In Section 6 the parameter learning algorithm is given and Section 7 presents the simulation results. Finally, Section 8 concludes the work conducted in this paper. 2. Premise space representation 2.1. Formulation of the fuzzy hyper-cube space Each universe Di = [xi; min ; xi; max ] (i = 1; : : : ; m) is discretized by formulating a set DQ i of pi ¿1 points uniformly distributed along Di : DQ i = {di; j ∈ Di | j = 1; : : : ; pi }; di; j = xi;min + (j − 1) (xi; max − xi; min )=(pi − 1); (2) where pi is an integer controlling the discretization accuracy. The overall set DQ = DQ 1 × · · · × DQm de1nes a grid network over the input space D = D1 × · · · × Dm . An example of such a grid for a two-dimensional space, D1 = D2 = [0; 1], is shown in Fig. 2 where p1 = p2 = 7. In order to introduce the fuzzy hyper-cube concept we proceed along the following steps: Step 1: Consider the set Li comprising all possible crisp intervals ‘ir uniquely de1ned over DQ i : Li = {‘ir = (uir ; vir ) | uir ; vir ∈ DQ i ; uir ¡ vir ; r = 1; : : : ; card(Li ) = pi (pi − 1)=2};

(3)

where uir and vir are the starting and the ending points of ‘ir , and card(Li ) is the cardinal number of Li determined by pi . Step 2: Fuzzify the crisp intervals in Li , that is, for each crisp interval ‘ir ∈ Li (i = 1; : : : ; m) associate a

126

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

x2,max

h˜ r is determined as follows:

~

h3

r (x) =

~

h2

~1

h 0.5

1

l2

l

x2,min x1,min

r r h˜ = ‘˜r1 AND ‘˜r2 AND · · · AND ‘˜m

1 1

x1,max

~1

l1 Fig. 2. Illustrative representation of a FHC of a class-3 partition.

fuzzy set ‘˜ir in Di de1ned by ‘˜ir = {(xi ; ’ir (xi )) | xi ∈ Di }. The membership function ’ir (xi ) is parameterized by #ir = (air ; bir ; cir ) and is given by  r −1   xi − cir 2×bi r   ’i (xi ) = 1 +  ; ar  

i

and

ari = 12 (vir − uir ):

(7)

with the AND operator implemented by the algebraic product. The collection of all FHCs formulates the set H˜ , regarded as the space of FHCs: H˜ = L˜1 × · · · × L˜m  r r r = h˜ = (‘˜1 ; : : : ; ‘˜m ) | r = 1; : : : ; card(H˜ )

=

m 

 card(Li ) ;

(8)

i=1

(4)

where cir is the central value of ’ir (xi ), air determines the crossover points at cir ± air , and bir is the membership fuzzi1er. #ir are determined so that the crossover points of the ‘˜ir coincide with the end-points of the crisp interval ‘ir : cir − air = uir and cir + air = vir . Accordingly, cir and air are derived by cir = 12 (uir + vir )

(6)

For each x k belonging to a data set , (6) assigns a membership grade  r (x k ) ∈ [0; 1] in h˜ r . The FHC h˜ r is fully parameterized by # r = {#1r ; : : : ; #mr } that comprises all vectors #ir of the memberships pertaining to the formulation of h˜ r . Furthermore, h˜ r can be interpreted as a composite fuzzy proposition of the form:

~

~1

’ri (xi ):

i=1

h1 l2

m 

(5)

The collection of all ‘˜ir (i = 1; : : : ; m) forms a set L˜ i whose cardinality equals to the one of Li . The fuzzy set ‘˜ir ∈ L˜ i obtained by (5) covers fuzzily the crisp interval ‘ir ∈ Li ; particularly, ‘ir is the 0:5-cut set of ‘˜ir . Hence, there is a one-to-one mapping between the elements of Li and those of L˜ i . Step 3: A FHC denoted by h˜r can be represented as a combination of m fuzzy sets h˜ r = (h˜ r1 ; : : : ; h˜ rm ) with h˜ ri ∈ L˜ i (i = 1; : : : ; m). h˜ r is a multi-dimensional fuzzy set ‘˜1r × · · · × ‘˜mr in D, formally described by h˜ r = {(x;  r (x)) | x ∈ D}. The membership function of

where card(H˜ ) denotes the maximum number of FHCs de1ned in D for given pi (i = 1; : : : ; m). The notion of FHCs provides a Pexible means for representing composite fuzzy sets. The location of h˜ r is determined by selecting the participating ‘ir = (uir ; vir ) while its parameters # r are calculated using (5) and (6). Some interesting properties of FHCs are listed below: • h˜ r can be regarded as a fuzzy cluster located at c r = (c1r ; : : : ; cmr ) with the possibility distribution described by  r (x). • The shape of  r (x) is determined by bir . For large values of bir we have ‘˜ir ≡ ‘ir , hence, h˜ r becomes approximately a crisp hyper-cube h˜ r ∼ = hr = r r r (‘1 ; : : : ; ‘m ); ‘i ∈ Li (i = 1; : : : ; m). r ⊂ h r , includes • The core region of h˜ r ; core(h˜ r ) = h˜ 1:0 r ∼ all data points with  (xk ) = 1, namely, the most representative points of h˜ r located around c r . • The most informative part of h˜ r is determined by h˜ r0:5 including the data points with  r (x k )¿0:5. It can be seen that, approximately, h˜ r0:5 is determined by the area enclosed by h r .

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

The above notions are illustrated in Fig. 2 showing the fuzzy sets ‘1r ; ‘2r and the contour of h˜ r0:5 . 2.2. Partition of the premise space A collection Un = {h˜ 1 ; h˜ 2 ; : : : ; h˜ n }; h˜ i ∈ H˜ (i = 1; : : : ; n), de1nes a scatter-type fuzzy partition described by the parameter vector n = (#1 ; : : : ; #n ). The integer n indicating the number of FHCs participating in the partition is called the partition class. Accordingly, Un will be referred to in the following as a class-n partition. For example, the partition shown in Fig. 2 is a class-3 partition. From the pool of n 0 =card(H˜ ) FHCs de1ned in H˜ we can de1ne n0 di9erent class-n partitions which formulate the n class-n fuzzy partition space: Mf (n0 ; n) = (Un(k) ; #n(k) ) | k = 1; : : : ;

card(Mf (n0 ; n)) =

n0 n

:

(9)

Finally, considering all possible fuzzy partitions Un with class n = 1; : : : ; n0 , we are led to a partition set MQ f (n0 ) =

n0

Mf (n0 ; n)

n=1

(10)

     n0 n0 n0 n0 + +· · ·+ n0 = 2 −1 1 2 fuzzy partitions de1ned on H˜ . The partitions de1ned above o9er a number of signi1cant attributes: Firstly, there is no constraint with regard to the location and the size of the participating FHCs. Secondly, each FHC is independently controlled, that is, modifying a particular FHC leaves the remaining objects intact.

comprising K =



2.3. Interpretation of the TSK fuzzy models It can be seen from (1) that selecting Air ≡ ‘˜ir and Aj (xi ) = /ir (xi ) (i = 1; : : : ; m), the TSK rule can be i reformulated as follows: r R(r) : IF x is h˜

THEN yr = gr (x; r ) = X T ·  r ; (11)

where X T = [1; x1 ; : : : ; xm ] and  r = [ 0r ; 1r ; : : : ; mr ]T are the consequent parameters. The membership func-

127

tion r (x k ) of h˜ r is given by (6). The fuzzy rule (11) provides a link R(r) : h˜ r → y r (r = 1; : : : ; n) between the premise and the consequent space, respectively. The output of the TSK model is a fuzzy blending of the local sub-models, y r = g r (x;  r ), and is calculated using the center of average defuzzi1cation method: y=

n 

Qr (x)y r ;

(12)

r=1

where Qr (x) are the normalized 1ring strengths de1ned by r (x) ; Qr (x) = n r r=1  (x)

r = 1; : : : ; n:

(13)

The rule base is de1ned by selecting a fuzzy partition {Un ; n } where each FHC corresponds to a fuzzy rule and the number of rules is equal to the partition class. It should be noted that given {Un ; n }, the model output is linear with respect to the consequent parameters. Hence, based on a training set ; n can be optimally calculated and a TSK model is functionally represented by FM {Un ; n ; n }. The primary task of fuzzy modeling is to 1nd the optimal model FM {Un∗ ; n∗ ; n∗ }. The optimal partition {Un∗ ; n∗ } in MQ f (n0 ) should satisfy the following properties simultaneously: • Model accuracy: This property refers to the ability of the model to perform accurate matching of the input–output data (xk ; yk ). The partition strategy should meet the philosophy of a TSK model which associates fuzzy clusters in the premise space to hyper-planes in the consequent part. Accordingly, h˜ r ∈ Un∗ (r = 1; : : : ; n) should be properly located so that following condition is ful1lled: for those inputs xk strongly exciting h˜ r , the corresponding outputs y k will be approximated by the hyper-plane yr = X T ·  r . • Partition quality: In addition to model accuracy, {Un∗ ; n∗ } should exhibit certain quality criteria, such as adequate cover of the input data, minimum number of rules, minimum typicality and entropy of the partition. These are signi1cant attributes leading to an ideal TSK model with simple structure and representative rules.

128

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

2.4. Model building method The model building process can be divided into two phases: structure learning and parameter learning. In phase one, a genetic-based structure learning (GBSL) is suggested for 1nding the optimal structure of a TSK model, namely, determine the necessary number of rules and a well-de1ned fuzzy partition. In phase two, the parameters of the initial model obtained GBSL are 1ned tuned using a novel genetic-based parameter learning (GBPL) algorithm. GA’s are robust search optimization algorithms based on the principles of natural selection and genetic evolution. They provide a powerful tool for solving di8cult, ill-de1ned problems [9,26,32]. Generally, a GA implementation can be represented by GA GA = {PGA (Npop ); OGA ; R GA ; QGA }, where PGA is the GA population comprising Npop individuals, RGA is the generation run limit dictated by a termination condition, QGA is the 1tness function, and OGA is the set of genetic operators. For an e8cient GA scheme the following major issues should be properly addressed: • select a scheme for encoding the candidate solutions to the population, referred to as chromosomes or genotypes. • formulate a 1tness function that evaluates each chromosome and rates it relative to other members of the current population. • select the genetic operators in OGA which are used to manipulate the population composition. Standard genetic operators include parent selection, mutation and crossover. 3. Partition encoding In this chapter we present the encoding scheme used by GBSL during the structure learning phase. 3.1. Encoding scheme Each chromosome of the population represents a fuzzy partition, that is, it encodes the locations of the participating FHCs as well as the partition class. Let #0 denote the maximum number of FHCs encoded in a chromosome. #0 is a design parameter taking a much smaller value compared to n0 , the maximum number

of FHCs included in H˜ (#0  n0 ). The chromosome is formulated as a sequence of two binary sub-strings, S#0 = {E#0 ; P#0 }, where P#0 and E#0 denote the parameter encoding part and the rule existence part, respectively. P#0 encodes #0 FHCs; h˜ r (r = 1; : : : ; #0 ), while E#0 encodes the number of the FHCs included in a partition. h˜ r is indirectly described in P#0 by encoding the end-points uir ; vir ∈ Di of the crisp intervals ‘ir ∈ Li along each input xi (i = 1; : : : ; m) which corresponds to determining the location of h r . Keeping bir 1xed and through (5), this is equivalent to encoding cir and air . As shown in Fig. 3, the sub-string P#0 is a linear list of genes P#0 = S 1 · · · S r · · · S #0 where the gene S r encodes h˜ r taking the form S r = s1r · · · smr . Finally, sir (i = 1; : : : ; m) encodes the values uir and vir of ‘ir along xi . Since uir and vir are allowed to take pi possible values in Di ; 2 log2 (pi ) binary bits are required :0 for the encoding of ‘ir in sir . The string Pm#0 ∈{0; 1} comprises a total number of :0 = 2#0 i=1 log2 (pi ) bits. It can be seen that P#0 produces class-#0 partitions U#0 ∈Mf (#0 ; #0 ), that is, partitions composed of exactly #0 FHCs. Apparently, the present form of the genotype string does not allow us to determine the optimal number of rules, a signi1cant design objective in our approach. Therefore, the genotype string is augmented by inserting the rule existence part E#0 , as shown in Fig. 4. E#0 consists of #0 binary bits, E#0 = e1 e2 · · · e#0 where each ei ∈ {0; 1} (i = 1; : : : ; #0 ), called the rule existence bit, is associated with a FHC encoded in P#0 . The FHCs with ei = 1 belong to the resulting partition and are regularly considered in the chromosome decoding while those with ei = 0 are ignored. The modi1ed genotype form provides variable class partitions that encode #0 FHCs at most; in particular, all class-1, class-2; : : : ; class-#0 are produced belonging to the partition space MQ f (#0 ). Assuming that n FHCs (16n6#0 ) are active (ei = 1), we are led to a solution S#0 (n) = {E#0 ; P#0 } that represents a class-n partition Un = {h˜ 1 ; h˜ 2 ; : : : ; h˜ n }. The partition class is encoded by P#0 while the participating FHCs are encoded by E#0 . Note that S#0 (n) corresponds to a TSK model FM {Un ; n ; n } comprising n fuzzy rules of the type described in (11). Hence, the GA actually evolves a population of candidate fuzzy models which are evaluated on the basis of their 1tness function value.

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152 ~

P :

~

r

r

l2

l2

Sr :

~r

...

h2

h1

0

r

r

lm

...

uir

...

...

h

li

...

sir :

~0

...

h

129

vir

...

1010...101

1000...001

nb -bits

nb -bits

... nb = log 2 ( pi )

m

2 0 × m × ∑ log 2 ( pi ) ( bits ) i=1

Fig. 3. The parameter encoding part of a chromosome.

E

P

0

e1

...

e

0

~ ~ h1 h 2

0

~ hr

...

...

~0 h m

20 × m × Σ log2 ( pi )( bits)

0 ( bits )

i =1

0 1

.. .

1 1

1010...101

...

{

1000...001

...

}

S 0 = E 0 , P 0

Fig. 4. Representation of a chromosome including the rule existence part and the parameter encoding part.

3.2. Adaptation of #0 Initially, #0 is heuristically selected based on the expert knowledge with regard to the necessary number of rules. In the following, as the evolution proceeds, #0 is dynamically adapted to the search needs of the problem. Speci1cally, we observe the chromosome of the best 1tness solution (elite) achieved for a succession of G (say 100) generations. If during that period the elite solution remains unchanged and

a large percentage (over 80%) of its rule existence bits are active, this is a strong indication that the GA tries to 1nd a model requiring more than #0 rules. In that case, #0 is doubled, #0 ← 2#0 , and the chromosome size is accordingly modi1ed. In our simulations, #0 is initially set to 10. The above schedule is dictated by the following arguments. Firstly, selecting a large #0 increases the chromosome size which aggravates the search process and reduces the possibility of attaining the optimal fuzzy model. Secondly, in

130

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

compliance with the fuzzy model principles, the primary optimization objective is to identify a model with a limited number of representative rules. 3.3. Chromosome repairs In order to assure feasibility of the obtained partitions each sir ∈P#0 should encode a crisp interval ‘ir = (uir ; vir ) ∈ Li with uir ¡vir . Therefore, during decoding we apply the following chromosome repairs: • If uir ¡vir , the genotypes encoding uir and vir are mutually interchanged in sir . • If uir = vir , then uir and vir are set to uir = xi; min and vir = xi; max , respectively. As a result, we have ‘ir = ‘˜ir = Di , and a fuzzy proposition “xi is ‘˜ir ” is interpreted as “xi is everything”. In this case, it can be easily shown that (1) is equivalent to the form R(r) : IF x1 is ‘˜r1 AND · · · AND xm is Arm THEN yr = gr (x) = 0r + 1r x1 + · · · + ir xi + · · · + mr xm ;

(14)

where the input xi participates in the consequent part but is ignored in the premise part of R(r) . Assuming that the system output y is linear with respect to xi , then “xi is everything” for all fuzzy rules and xi is excluded from the premise part of the rule base. The suggested chromosome repair is specially designed with the intention to produce solutions including the proposition “xi is everything” in certain rules. The amount of such individuals in the population is determined by the probability in which uir = vir occurs. Actually, it implements a genetic operator that assists the GA to identify the linear inputs of the system, as shown in the simulation results. 3.4. Application of the genetic operators In order to illustrate the function of the mutation and the crossover operators, we consider a twodimensional space D1 × D2 ; D1 = D1 = [0; 1] where for simplicity we select #0 = 3 and p1 = p2 = 4. Fig. 5(a) shows a parent chromosome with 27 bits encoding a class-3 partition. Fig. 5(b) and (c) shows the

resulting partitions when mutation takes place at the rule existence part and the parameter encoding part, respectively. In the former case, mutation a9ects the partition class, that is, it changes the number of FHCs in the o9spring partition. In the later case, a di9erent partition is obtained with the same number of FHCs, that is, the class of the parent partition is kept intact. Crossover provides a mechanism for mixing the attributes of two parent partitions and producing two new partitions. In Fig. 6 the parent chromosomes encodingc a class-3 and a class-2 partition, respectively, are combined using a two-point crossover operator. In this example, the 1rst crossover site occurs at the rule existence part while the second one occurs at a position belonging to the parameter encoding part of the chromosomes. Under such a condition, crossover a9ects both the class and the topology of the o9spring partitions. When both crossover sites occur either at the rule existence part or the parameter encoding part, then crossover produces similar results as those mentioned above for the mutation operator. Particularly, in the former case, crossover changes only the partition class while the existent FHCs remain intact. In the later case, crossover produces two new partitions with the same partition classes as those of the parents. The suggested partition encoding method exhibits the following attributes: • Flexibility: The encoding scheme combines the bene1ts of constant and variable size chromosomes. The former enables standard mutation and crossover operators to apply in a uni1ed manner providing feasible solutions while the later one permits producing models with variable number of rules. • Diversity: The genetic composition provides a wide variety of individuals with respect to the partition class and the location of the participating FHCs. The population diversity increases the e8ciency of the genetic evolution and assists the GA to 1nd the optimum solution, that is, identify the proper fuzzy model ful1lling the design objectives. • Searchability: It can be easily realized that after chromosome repairs, S#0 (n) can implement any solution belonging to the partition space MQf (#0 ). Hence, the search space is thoroughly investigated, provided that GA evolves for a su8ciently large number of generations.

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

131

Fig. 5. A typical chromosome encoding a class-3 partition (a), and the resulting partitions when mutation takes place in the rule existence part (b) and the parameter encoding part (c).

4. Construction of the tness function The 1tness function assigns a positive real value to each chromosome, according to the quality of the resulting model Q : {0; 1}:0 → + . The 1tness function should be properly designed so that solutions with high 1tness values exhibit all those attributes that characterize an ideal TSK model. 4.1. Output error measure 4.1.1. Global model evaluation Our primary concern in building a TSK model FM {Un ; n ; n } is how well the model can approxi-

mate a real system. For a given partition {Un ; n }, the above task is achieved by determining the consequent parameters n so that the following objective function is minimized: EG (Un ; #n ; ;n ) = MSE− Global q 1  d = (yk − yˆ k )2 ; q

(15)

k=1

where ykd is the desired output, yˆk is the model output when it is excited by xk , and q is the number of training data. EG measures the ability of the identi1ed model to perform e8cient data matching and is

132

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

Fig. 6. A two-point crossover operator acting upon two parent partitions.

expressed as a sum of the residuals between the real system outputs and the overall outputs of the fuzzy model given by (12). Hence, minimization of (15) leads to a global learning scheme. Global learning algorithms calculate the model parameters using the whole training data set in a single algorithmic pass. Most of the existing learning algorithms suggested in the literature are global learning algorithms; they try to minimize (15) using a variety of techniques such as the gradient descent method [17] and genetic algorithms [8]. Although global learning can build a TSK model with arbitrary level of accuracy, they cannot guarantee the model to have a good local performance. The TSK model follows the multi-model approach where each rule R (r) is regarded as a local linear sub-model

(h˜ (r) (# r ) → g r (x;  r )) de1ned in the fuzzy region covered by h˜ r . Accordingly, a locally well-behaved TSK model suggests that  r is determined so that the input points belonging to h˜ r are approximated by the hyper-plane g r (x;  r ) in the output space. The di9erence between global and local learning is illustrated in Fig. 7. In Fig. 7(a) the input–output data are approximated by a fuzzy model with two rules where the parameters of the straight lines are determined by a global learning algorithm. Apparently, the orientation of the lines does not 1t the data distribution and thus, the behavior of the local sub-models is di8cult to interpret. The same data set is approximated in Fig. 7(b) by a fuzzy model with three rules determined via a local learning algorithm. This model exhibits a good local behavior; the number of rules,

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

y

=

q n  

(x: ) (ykd − ykr )2 ;

133

(16)

r=1 k=1

y2 y1

x A2

A1 (a)

y

y2 y3 y1

where Erk ( r ) is an error measure of the local submodel g r (xk ;  r ) as an approximation to ykd . Minimization of (16) suggests that, locally, each rule is forced to produce the whole output rather than a part of it. The contribution of each data point to the derivation of  r is determined by the local weight wr (xk ) =  r (xk ), 06wr (xk )61. Accordingly, the orientation of g r (xk ;  r ) is primarily determined by a subset comprising those data points with wr (xk ) ≈ 1, namely, the data belonging to the core of h˜ r . In other words, each rule acts like an independent model that is related to the above subset of training data. Note that for a given partition, the premise parameters of the FHCs are 1xed which determines the membership functions  r (xk ). Then, minimization of (16) is achieved through a repeated application of the weighted recursive least-squares (WRLS) algorithm [10]. This overcomes the need of the EP-based rule consequent learning method used in [6]. Particularly, for each sub-model ykr = XkT  r (r = 1; : : : ; n) the consequent parameters  r are calculated by the following recursive equations: r T ;k+1 = ;kr + Kk [ykd − Xk+1 ;kr ];

x A1

A2

A3

(b) Fig. 7. Typical TSK models following the global learning (a) and the local learning (b) approach.

the location of the fuzzy sets and the orientation of the straight lines are determined so that the linear submodels cover the linear segments of the data. 4.1.2. Local parameter determination In order to improve the interpretability of the obtained TSK models and gain insights into the local behavior of the models, n are derived by minimizing a local error measure de1ned as follows: EL (Un ; #n ; ;n ) =

q n   r=1 k=1

wr (xk )Erk (;r )

Kk = Pk+1 Xk+1 =

P k Xk ; T P X 1=wr (xk ) + Xk+1 k k+1

T Pk+1 = [1 − Kk Xk+1 ]Pk ;

(17) (18) (19)

where XkT = [1; x1k ; : : : ; xmk ] and Pk ∈  (m+1) × (m+1) is a covariance matrix. The starting values of the algorithm are 0r = 0 and P0 = aI , where I ∈  (m+1) × (m+1) is an identity matrix and a is a su8ciently large number, usually larger than 300. Eqs. (17) – (19) are iterated for k = 1; : : : ; q until all samples (Xk ; ykd ) of the training set are processed. The optimal parameters r r opt are obtained at the 1nal iteration opt = qr and the r T r hyper-plane yk = Xk  is fully de1ned. In an attempt to integrate the bene1ts of global and local learning, we suggest a local parameter determination–global model evaluation approach, described as follows. Assuming an input partition (Un ; n ) encoded in S #0 (n), the consequent parameters n are calculated by (17) – (19), minimizing the local error measure EL . The resulting TSK model

134

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

FM (Un ; n ; n ) is then evaluated using the global error measure EG in (15). Following the above strategy, the GA is encouraged to search for TSK models which exhibit good local behavior (correct hyperplane orientation) and accurate global 1tting. Upon completion of the genetic evolution, the suggested evaluation scheme leads to a model that optimizes both local and global objective simultaneously. 4.2. Number of rules Simplicity is a signi1cant characteristic of an ideal TSK model, suggesting that the model should perform accurate data 1tting using a minimum number of representative rules. Model simplicity is dictated by the following arguments. Firstly, a large number of rules increases the complexity of the model without providing additional information for the system. Moreover, due to over-parameterization, we are led to noise modeling which reduces the generalization capabilities of the obtained model. Secondly, one of the important motivations of using fuzzy models is that they provide a linguistic description of the real system. This advantage is lost for large rule bases since the operator cannot easily interpret and gain insight into the local behavior of the rules. 4.3. Partition quality functions 4.3.1. Typicality of the partition Typicality is a quality measure employed in several fuzzy clustering techniques such as the fuzzy C-means (FCM) algorithm [31,2]. Since a FHC h˜ r can be regarded as a fuzzy cluster with its center or prototype c r de1ned by (5), the input partition essentially realizes a clustering task. Hence, we employ a clustering objective function as a means for determining the size, the shape, and the locations of the FHCs within the premise space. The typicality function JT of a class-n partition Un is de1ned by JT (Un ; #n ) =

n  r=1

JTr (#r ) =

q n  

(r (xk ))m (drk )2 ;

r=1 k=1

(20) where dkr = xk − c r  is the distance between the center of the rth cluster and the kth data point. m ∈ [1; ∞) is a weighting exponent, taken equal to m = 2 in this

paper. The typicality function JTr is closely related to an important characteristic of h˜ r , namely, the strength of the core(h˜ r ). Minimizing JTr entails that the data points belonging to h˜ r should adhere tightly to its cluster center c r . An ideal TSK model should have a small value of JT . In that case, the FHCs pertaining to the partition are located at high-density regions of the input space (strong cores), thus producing representative fuzzy rules. On the contrary, large values of JT indicate that the distances of the data from the cluster centers are considerably large. Consequently, the fuzzy rules cannot describe compact data groups which reduces the representation capabilities of the model. 4.3.2. Partition entropy The complexity of a TSK model is determined by the rate of interaction between the local sub-systems forming the model, an issue closely related to the socalled overlapping criterion. Particularly, the FHCs participating in a fuzzy partition should exhibit a small degree of overlapping to each other, leading to TSK models with precise as well as descriptive fuzzy rules. As a measure for evaluating the structure complexity of a model we consider the entropy of the partition [38]. Considering all samples of the training set, the entropy of a partition Un is given by Hn (Un ; #n ) =

q 

Hk (Un ; #n ; xk )

k=1

=−

q n  

r (xk ) log(r (xk ));

(21)

k=1 r=1

where Q r (xk ) is 1ring of R (r) given nthe normalized r by (13), with r=1 Q (xk ) = 1. The simplest model structure is characterized by disjunct sub-models with zero overlapping between FHCs. In that case, each xk (k = 1; : : : ; q) belongs to a single FHC, say h˜ i , and the model output coincides with the output of the ith sub-model, y r = g r (xk ;  r ) : Q i (xk ) = 1 and Q j (xk ) = 0 for i = j, (i; j = 1; : : : ; n). Hence, from (21) the minimal entropy attained is Hn = 0. Increasing the overlapping between FHCs introduces more complexity to the model structure, thus increasing the partition entropy. On the extreme case, the entropy becomes maximal, Hn; max = q log(n), when all data points have the same relative degree of membership in all FHCs of

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

the partition: Q r (xk ) = 1=n (r = 1; : : : ; n; k = 1; : : : ; q). Maximum entropy corresponds to a full ignorance situation in fuzzy modeling, where the data excite all sub-models to the same degree. In order to obtain TSK models with simple structure and small overlapping, the partition entropy in (21) should be minimized. 4.3.3. Partition cardinality Cardinality is related to the way the input samples are covered by the FHCs of Un . An e8cient partition strategy dictates that the following conditions are ful1lled. First, it should be assured that each input sample xk (k = 1; : : : ; q) is covered by Un , that is, it exhibits a non-zero membership value in at least one FHC of the partition. This condition is met by n assuring that 0¡ r=1 r (xk )¡q. When r (xk ) = 0 for all r = 1; : : : ; n, the sample xk is not covered by Un . Apparently, this is an unacceptable situation since the fuzzy inference mechanism (12)– (13) cannot be de1ned. Assuming that the 1rst condition is met, the second property suggests that the degrees of memberships should be as large as possible. This implies that xk belongs highly to a FHC, and consequently, it is consistently described by the corresponding fuzzy rule. Let Tn denote a composite fuzzy set in D which is derived by combining h˜ r through the OR operator: Tn =

n

h˜r

with Tn (xk ) = max r (xk ): r=1 ;:::; n

r=1

(22)

For a given xk , Tn (xk ) is the maximum degree of membership in all FHCs of Un , determining the degree to which xk is covered by Un . As a measure of the quality of the data cover we employ the relative cardinality cr of Tn : q |Tn | 1  cr (Un ; #n ) = Tn (xk ); = |D| q

(23)

k=1

where Tn (xk ) is given by (22) and 0¡cr 61. The relative cardinality evaluates the proportion of input samples having the property of Tn , that is, covered by Un . In order to increase the sensitivity of the measure with regard to existence of uncovered data, (23) is

135

modi1ed as follows: Cr (Un ; #n ) =

q 1 − qu =q  Tn (xk ); q

(24)

k=1

where qu denotes the number of uncovered data and q is the total number of training data. Nonzero values of qu reduces Cr with respect to the original cr , thus penalizing fuzzy partitions with uncovered data. Large values of Cr approaching unity indicate that all input data are su8ciently covered with large membership values. Apart from (24), in order to avoid partitions with uncovered data, we employ an additional approach that imposes an extra penalty to the chromosomes encoding such solutions. Particularly, all uncovered data xk are brought to a complete ignorance condition by setting r (xk ) = 1=n (r = 1; : : : ; n). As a result, the output error, the entropy and the typicality measures are increased while the relative cardinality is decreased. Thus, these solutions exhibit a small 1tness value and are genetically discarded in the course of generations. 4.4. Objectives fuzzi=cation The quality criteria described above can be divided into two groups. The 1rst group includes the output error measure and the number of rules which lead to economical TSK models with accurate data matching. The second group includes the typicality, the entropy and the cardinality measures which assist the GA to attain models with well de1ned fuzzy partitions. The structure learning problem consists of determining the partition {Un ; n } (premise identi1cation) and the consequent parameters n (consequent identi1cation) so that the following objectives are simultaneously optimized: Minimize{EG ; n; JT ; Hn }

and

Maximize{Cr }: (25)

It can be seen that the GA has to meet 1ve objectives simultaneously, most of them being competing to each other. For example, minimizing JT and Hn leads to representative FHCs located at high-density areas while maximizing Cr tends to spread the fuzzy partition in order to achieve better data cover. Furthermore, a close look to the objectives indicates that

136

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

Since f(?) ∈ [0; 1], the fuzzi1cation of the objectives actually implements a normalization task. This allows us to handle all objectives in uni1ed way, regardless of their ranges. Instead of minimizing {EG ; n; JT ; Hn } the GA is encouraged to maximize the degree of certainty that these objectives are “minimum”, that is, maximize {E˜G ; n; ˜ J˜T ; H˜ n }. The optimization problem of (25) is now reformulated as

1 F1 = 50

0.9 0.8

F1 = 4

0.7 F1 = 3

f (ξ)

0.6 0.5 0.4

˜ J˜ T ; H˜ n ; Cr }: Maximize{E˜ G ; n;

0.3

For simplicity, the parameters F1 and : are 1xed to F1 = 3 and : = 2 for all cases. However, to account for the di9erent ranges of the objectives, F2; ? should be properly chosen. Since the actual range of certain objectives is not known in advance, this parameter is selected on the basis of observations during the initial generations of the GA. For example, the F2; EG of E˜G = f(EG ) takes initially a large value which is heuristically determined by the user. In the following, the GA allowed to proceed for 1ve generations while we are monitoring the values of EG in the produced chromosomes. Then, F2; EG is set equal to the maximum observed value of EG and remains constant afterwards. Given that : = 2, F2; EG is “minimum” to a degree f(F2; EG ) = 0:3 or “maximum” to a degree 1 − f(F2; EG ) = 0:7, as shown in Fig. 8. Having determined an estimate of the “maximum” F2; EG , the GA searches for better solutions, moving along the left-hand side of f(EG ) towards the optimum. A similar strategy is also applied for the typicality measure. With regard to Hn and n, we set F2; H = q log(#0 ) and F2; n = #0 , respectively, where F2; H is the maximum possible entropy and #0 is the maximum number of rules encoded in a chromosome.

0.2

F2 = 50 κ=2

0.1 0

0

10

20

30

40 50 ξ

60 70

80

90 100

Fig. 8. The shapes of the membership function f(?) for di9erent values F1 and : = 2.

their ranges are quite di9erent. With the exception of Cr that takes values in the normalized range [0; 1], EG usually varies in the range 10−2 –10−4 , n is of the order of 10s while JT and Hn are of the order of 100s. Of equal importance is that each objective may vary in a di9erent range depending on the particular problem. For example, a model with n = 5 may be adequate for one problem but quite insu8cient for another one. The above problems are consistently tackled in this paper by fuzzifying the objectives that should be minimized. For each objective ? ∈ {EG ; n; JT ; Hn } we consider a linguistic term “minimum”. The meaning of this term is interpreted through a membership function de1ned by ?˜ = f(?) =

1 ; 1 + :(?=F2; ? )F1

(26)

where ?˜ is the fuzzy objective and F1 is the fuzzi1er controlling the fuzziness of the fuzzy set “minimum”. At ? = F2; ? we obtain a membership value f(F2; ? ) = 1=1 + :. Fig. 8 shows the shape of (26) for di9erent values of F1 and : = 2. f(?) is an openleft function with f(?) → 1 as ? → 0 and f(?) → 0 as ? → ∞. For a given value of ?, f(?) provides the degree of certainty that the objective ? is “minimum”. Passing {EG ; n; JT ; Hn } through (26) we are led to a ˜ J˜T ; H˜ n } comprising the fuzzy objectives. set {E˜G ; n;

(27)

4.5. Quality function The solution in a multi-objective optimization problem has the meaning of a set of optimal points having the property: it is not possible to improve a particular objective without simultaneously worsening some other objectives of the 1tness function Q. Furthermore, a solution is sub-optimal if some objectives are oversatis1ed while the rest of them are not adequately met. For instance, a sub-optimal model might have small EG and JT but large values of Hn and n. Several methods have been suggested in the literature dealing with multi-objective optimization problems by GAs [32].

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

In this paper, the 1tness function is formulated as a weighted sum of the fuzzy objectives: Q = WE · E˜ G + Wn · n˜ + WT · J˜ T + WH · H˜ n + WC · Cr ;

(28)

where the weight vector W T = [WE ; Wn ; WT ; WH ; WC ] provides the relative weight with which each objective is discriminated by the GA. W is adaptively determined using a weight adaptation scheme described in the sequel.

5. Resolution of the structure learning problem Following our formulation of the 1tness function and thanks to the partition encoding scheme the structure learning task is formulated as a multi-objective optimization problem. Among all candidate partitions included in the search space MQf (#0 ), the GA tries to 1nd the global optimum one maximizing the 1tness function (28). During structure learning, the GA adjusts the parameters encoded in the chromosome, namely, the number of rules and the parameters (uir ; vir ) of ‘ir while the membership fuzzi1ers bir are kept constant. Our approach attempts to manage all features characterizing the structure of a TSK model. Firstly, the number of rules is determined by allowing the rule encoding part while the size and the location of the FHCs is determined through the parameter encoding part. Secondly, well-de1ned models are obtained by introducing additional partition qualities in the 1tness function. Finally, the consequent parameters are optimally calculated by (17)– (19), which completes the structure of the TSK model. The generalized framework considered above renders structure learning a very complex optimization problem that is di8cult to solve using the simple genetic algorithm (SGA) which employs only the crossover and mutation operators, and elitism. Due to the complexity of the 1tness function, the search landscape is severely complicated. In such cases, the search space contains very narrow paths of arbitrary direction, also known in the literature as ridges. The GA should be able to follow consistently such ridges when the global optimum is to be reached.

137

In our problem, the ridges are interpreted as follows. Most often, we observe that changing a single dimension of a FHC in the current solution leads to solutions with lower or equal 1tness value and the GA gets stuck to a local optimum. To attain a better solution we have to move along the ridge direction, changing two or more dimensions of several FHCs, simultaneously, with the proper step size. In order to e9ectively address the problem di8culties and enhance the search capabilities of the SGA we employ a set of six genetic operators that can be divided into two classes. The former class includes a conventional type hill-climbing operator and the MGA, a generalized hill-climbing operator suitable for ridge following tasks. These operators are generalpurpose operators which can be applied to wide class of di8cult optimization problems [19]. The latter class includes a set of four problem-speci1c operators that are tailored to the search requirements of our problem. 5.1. The MGA hill climbing operator The major drawback of the SGA is that, although they are very e8cient in locating the basins of the local optima they are unable to explore these basins e9ectively, in order to 1nd the exact global optimum within a small number of generations. Therefore, several hybrid genetic schemes are suggested in the literature where standard GAs are combined with a hill-climbing or a local search operator [26]. The aim of these operators is to explore quickly a neighborhood of the elite solution provided by the main GA at each generation. Starting from the current solution vector, conventional hill-climbers perform successive steps (perturbations) along orthogonal directions. The newly created solutions are evaluated and when a better solution is found it usually replaces the current solution. A hillclimber of this type is the phenotype mutation (PM) operator suggested in [25]. For each component of the current solution, PM performs a small and a random big step along the positive and the negative direction, successively, evaluating the resulting genotype after each step. When a better solution is attained the process is terminated, otherwise it continues until all axes are perturbed. Although the PM operator enhances signi1cantly the performance of the GA, it lacks the ability to follow narrow ridges. This is explained by considering that PM and all other operators based on

138

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

similar principles perform orthogonal steps which do not necessarily coincide with the direction of the ridge. The MGA suggested in this paper uses a secondlevel micro genetic algorithm [7] with a small population that evolves for a small number of generations. MGA is applied to the parameter encoding part P#0 of a chromosome. It evolves a population of perturbation vectors and serves as an engine for 1nding better solutions in the neighborhood of the current optimum solution. Due to its genetic nature, the MGA operator is capable of evolving paths of arbitrary direction and thus, following potential ridges in the search space. Assume that S#c0 = {E#c0 ; P#c0 } is the current optimum solution and let, for simplicity, X c = P#c0 ∈ {0; 1}:0 denote its parameter encoding genotype vector: X c = [X1c ; : : : ; Xc0 ]

with 0 6 Xic 6 2nb − 1;

(i = 1; : : : ; 0 );

(29)

where  0 = 2m × #0 is the number of parameters encoded P#c0 . At the genotype level, we consider a perturbation vector X p ∈ {0; 1} 0 ×pb de1ned by X p = [X1p ; : : : ; Xp0 ]

with 0 6 Xip 6 2pb − 1;

(i = 1; : : : ; 0 );

(30)

where pb is the number of bits used to encode the perturbation of each parameter which is considerably smaller compared to nb of the main GA. The neighborhood N (X c ; pb) around X c is formulated as follows: N (X c ; pb) = {X ∈ {0; 1}0 | X = X c + X p ; with − 2pb−1 6 Xip 6 2pb−1 ; i = 1; : : : ; 0 }:

(31)

The solutions X belonging to N (X c ; pb) are produced by adding an arbitrary value Xip to each component Xic (i = 1; : : : ;  0 ) of the current genotype vector. It can be seen that X p lie in the interior of a hypercube HC(pb) = [−2pb−1 ; 2pb−1 ] 0 whose volume is determined by pb. Hence, pb de1nes the size of the local search area and is taken the same for all input variables. The aim of the MGA is to explore N (X c ; pb) by evolving a population PMGA of perturbation vectors X p

with a length of  0 × pb bits MGA PMGA (Npop ) = {X p (‘) | X p (‘) ∈ HC(pb); MGA ‘ = 1; : : : ; Npop };

(32)

MGA where Npop is the population size of the MGA. The evolution of PMGA is carried out by standard selection and reproduction operators as in conventional GAs. The MGA can be functionally represented by MGA{PMGA ; OMGA ; RMGA ; QMGA }, where OMGA includes the genetic operators, RMGA is the generation run limit, and QMGA is the 1tness function of MGA. The MGA implementations in this paper use the roulette wheel for parent selection and the standard crossover and mutation operators for the reproduction of new genotypes. The MGA population is evaluated using the same 1tness function as the one of the main GA given by (28), QMGA (X p ) = QGA (X c + X p ), where the rule encoding part E#0 is not explicitly shown. Also, for simplicity sake, the indicators GA and MGA in the 1tness functions are omitted in the following. Assuming that a better solution is attained at the end of a MGA p run, Q(X c + Xopt )¿Q(X c ), the old optimum solution p ). In is replaced by the new one: Q(X c ) ← Q(X c +Xopt view of the above, the search task of the MGA is stated as follows: Among the perturbations X p ∈ PMGA 1nd p , such that the resulting soluthe optimum one, Xopt p c tion Xopt = X + Xopt exhibits the highest 1tness value, p Q(Xopt )¿Q(X c ). This corresponds to determining the p proper size and orientation of Xopt leading to a better c solution compared to X . Combining the main GA and the MGA operator, we are led to a hybrid scheme called the GA–MGA. The main GA performs the global search task, exploring the entire search space while the MGA performs the local search task, exploring genetically the neighborhood N (X c ; pb). The population size and the generation run limit of the MGA should be selected with care such that the MGA assists the hybrid scheme to attain a better solution while not dominating over the main GA. The GA–MGA algorithm is an iterative procedure comprising an external and internal loop. The external loop implements the main GA which is evolved for k = 1; : : : ; RGA generations. Assuming that an elite solution X c (k) is obtained at the kth generation, the

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

internal loop is iterated for ‘ = 1; : : : ; RMGA generations. This implements a complete run of the MGA, called the kth run of the MGA. The above cycles are sequentially repeated until a termination criterion is ful1lled [19]. 5.2. Problem-speci=c operators 5.2.1. Split control operator (SCO) At each generation, SCO is applied to the elite solution with a probability rate Psplit . SCO is an iterative procedure comprising n × m steps. At each step, a single FHC, h˜ i (i = 1; : : : ; n) is broken through along the middle of the jth dimension (j = 1; : : : ; m), producing two descendant FHCs of equal size denoted by h˜ 1i; j and h˜ 2i; j , respectively. Keeping the rest of FHCs intact, we examine the four alternatives: alternative-1 accepts both h˜ 1i; j and h˜ 2i; j , alternative-2 accepts only h˜ 1i; j , alternative-3 accepts only h˜ 2i; j , and 1nally, alternative-4 rejects both descendant FHCs. The above process continues until all dimensions of all active FHCs are examined. Let Q si; j denote the 1tness value of the alternative s = 1; : : : ; 4 for the split of h˜ i along the jth dimension, and Q e denote the 1tness value of the original elite solution. Then, we adopt the alternative providing the highest 1tness value improvement, Q si; j − Q e . It can be seen that the SCO can insert or remove a FHC depending on whether the alternative-1 or the alternatve-4 is selected. 5.2.2. Rule insertion operator (RIO) RIO is applied to the elite solution at each generation with a probability Pin . It examines whether the insertion of a new FHC (rule) leads to a better solution. An interesting feature of this operator is that the new FHC is placed at a region where a large concentration of uncovered input data is observed. Let 0 ⊂ denote the data subset comprising q0 input samples which are not covered by the existing FHCs of the current partition, Unc . RIO entails two tasks, namely, selecting the center of the new FHC, denoted by h˜ new , and determining its initial size. Selecting the center of h˜ new : The center of ˜h new is decided on the basis of a density function used in substractive clustering [3]. For each xk ∈ 0 (k = 1; : : : ; q0 ) the density function Pk is

139

de1ned as Pk =

q0  j=1

exp

xk − xj 2 2A2

;

(33)

where A ∈ [0:2; 0:4] is a problem-speci1c constant. The density function Pk can be viewed as a data density measure. Large values of Pk indicate that there exists a great amount of uncovered data points located around xk , and thus xk is a strong candidate for centering h˜ new . Following the above reasoning, h˜ new is centered at the point x# with the highest density value, P# , provided that P# ¿Pt . Pt is a user de1ned threshold which indicates that, approximately, at least Pt data points are located in the vicinity of x# . Determining the size of h˜ new : Initially, h˜ new takes the smallest possible size with respect to all dimensions. In the following, we perform a small number (four or 1ve) of max step trials; at each trial, the size of h˜ new is gradually increased in a symmetric manner while computing the quality of the resulting partition Unc + {h˜ new }. We choose the partition having the highest quality Q max ; under the assumption that Q max ¿Q e , it becomes the new elite. The RIO operator described below in pseudo-code form, assists the main GA to attain partitions with adequate data cover, hence, a good cardinality measure. For k = 1; : : : ; q0 Compute the density function Pk of xk ∈0 next-k =∗ Find the data point x# with the maximum density P# ∗= (x# ; P# ) = FindMax(Pk ) If P# ¿Pt Then activate h˜ new For step = 1; : : : ; max steps For i = 1; : : : ; m uinew = xi# −step · wstep ; vinew = xi# + step · wstep = ∗ wstep is a small increment value ∗= next-i =∗ Compute the quality Q new of Unc + {h˜ new } ∗ = Q[step] = Q new next-step (stepmax ; Q max ) = FindMax(Q) If Q max ¿Q e Then {For i = 1; : : : ; m uinew = xi# −stepmax · wstep , vinew = xi# + stepmax · wstep next-i Update the elite solution} Else disable h˜ new .

140

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

5.2.3. Rule removal operator (RRO) RRO is applied to the elite solution at each generation with a probability Prem . It examines whether the removal of a FHC improves the quality of the current solution. The FHC to be removed from the current partition Unc , denoted by h˜ rem , may fall into one of the following categories: • The h˜ rem is the FHC with the largest degree of overlapping with the other FHCs in Unc . • The h˜ rem is randomly selected among the FHCs in Unc . • The h˜ rem is the FHC with the smallest volume. The FHC with the largest overlapping is determined by performing a n-step trial. At the kth step (k = 1; : : : ; n), we remove an active FHC h˜ k form Unc and compute the change in the entropy En − En−1 , where En and En−1 is the entropy of Unc and Unc −{h˜ k }, respectively. The FHC producing the maximum entropy decrease is a candidate for removal. The above three events occur with a probability Pover , Prand and Pmin , respectively. The selection of a particular event is made using the roulette wheel selection mechanism upon the set {Pover ; Prand ; Pmin }, with Pover +Prand +Pmin = 1. In our simulations, the probability rates are set to Pover = 0:5, Prand = 0:3 and Pe = 0:2. After removal of h˜ rem , we apply the MGA operator for a couple of generations. The MGA adjusts the resulting partition Unc − {h˜ rem } locally, with the aim to cover the empty space left by h˜ rem . This is achieved by tuning the neighboring FHCs of h˜ rem , leaving the rest FHCs intact. Hence, RRO follows an approach: remove a useless rule and tune the resulting partition. If after local tuning the obtained solution has a better quality, it replaces the current elite. The RRO operator assists the GA to attain solutions with small overlapping between the FHCs, and consequently, with small entropy. 5.2.4. Digital hill-climbing operator (DHCO) RIO, RRO and SCO change the partition class of a solution by inserting or removing a single rule from the corresponding model. They are local operators, acting on a speci1c area of the input space. An additional mechanism for changing the number of rules is through crossover and mutation of the main GA. Although solutions with considerably di9erent structure

Fig. 9. Illustration of the DHCO operator for B = 2.

are produced, the e9ects of the above mechanisms become apparent after a large number of generations. DHCO is a local search tool focusing on the number of rules rather on the parameters of the FHCs. It acts upon the rule existence part E#0 of the current elite with a probability rate Pdhc . Given the current partition, DHCO examines whether a better solution can be obtained through the insertion=removal of certain rules in the entire input space. From the set of #0 digits included in E#0 , we select randomly a subset of B digits, not necessarily successive ones. Considering all possible combinations of these digits and keeping the rest of the chromosome digits intact, we are led to a set 2B − 1 neighboring solutions with di9erent number of rules. Assuming that the best combination exhibits a better quality than the one of the original solution, it becomes the new elite. Fig. 9 illustrates how DHCO works for B = 2, with the new solution derived through evaluation of 22 − 1 = 3 combinations. DHCO explores the neighborhood of a current solution with discrete-type parameters, the rule existence bits in this case. The size of the neighborhood is determined by B, that is, the number of bits that can be simultaneously changed. Since DHCO evaluates all neighboring solutions, B takes a small integer value between 2 and 4. 5.3. Weight adaptation scheme Each weight vector in (28) determines a search direction in the objective space [14]. Accordingly, if the

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

weight vector remains 1xed, the GA searches for solutions along a constant search direction and the algorithm becomes trapped to sub-optimal solutions. An attempt to improve an objective that is not su8ciently satis1ed, aggravates the over-satis1ed ones. To cope with this problem, we suggest a weight adaptation scheme that changes the search direction dynamically during evolution, with the aim to optimize all objectives to the maximum possible degree. Let Q e (g) denote the 1tness value of the elite solution at the gth generation, and QCe (g); C ∈ Sobj = {E˜G ; n; ˜ J˜T ; H˜ nn ; Cr } denote the fuzzy objectives in Q e (g). Since QCe (g) indicates the degree of ful1llment of the Cth objective, its complement QQ Ce (g) = 1 − QCe (g) can be regarded as the degree of failure of that objective. The decision of whether the weight vector W will be changed or not is made every gu generations. As to which weight should be changed, this is decided by the selection value of the corresponding objective. The selection value SelC of the Cth objective is computed on the basis of two factors, namely, the relative degree of failure and a priority schedule determined by the user. Particularly, for each  objective, we introduce a priority PC ∈ [0; 1] with C∈Sobj PC = 1. The priority has the meaning that for two objectives having a similar degree of failure, the weight will be adapted corresponding to the objective with higher priority. Priority rePects the user’s preferences as to which objectives are considered more important with regard to others, and thus, should be satis1ed 1rst. The weight adaptation scheme proceeds as described in the following. The weight vector takes an initial value W0 determined by the expert, and the GA starts searching along this direction. Assume that the last weight update has occurred at the gth generation and let the GA evolve for gu more generations. At the (g + gu)th generation check the following conditions. If Q e (g + gu)¿Q e (g) the weight vector W remains intact and the GA keeps on searching along the current weight direction. If Q e (g) = Q e (g + gu), this indicates that the current search direction is exhausted and W is changed as follows: • Compute the selection values SelC of the objectives: e QQ C (g + gu) SelC = PC ·  ; e QQ C (g + gu) C∈Sobj

C ∈ Sobj :

(34)

141

• Select the objective C∗ with the maximum selection value: C∗ = max arg{SelC }: C∈Sobj

(35)

• Increase the weight WC∗ by a portion wp ∈ [0; 1] determined by the user: WC∗ ← WC∗ + wp WC∗ :

(36)

The increase of WC∗ forces the GA to locate those solutions satisfying the objective QCe∗ to a higher degree. In order to avoid abrupt variations of the search direction, the design parameter gu should be considerably large, so that the current direction is thoroughly explored by GA. Suppose that after Up max successive adaptations of WC∗ the 1tness function is not improved. In that case, WC∗ returns to its value at the last successful adaptation and remains constant thereafter; this is achieved by setting its priority value to zero. The weight vector is dynamically changed during evolution as suggested by (34) – (36). The optimization process is terminated when the priorities of all weights are reset to zero. 5.4. Proposed structure learning algorithm The genetic based structure learning (GBSL) scheme suggested in this paper is an e8cient hybrid algorithm that combines the following features: • The main GA implemented by a SGA that performs the global search task, • The MGA and the PM operator for exploring the neighborhood of the current optimum solution, • The class of {SCO; RIO; RRO; DHCO} comprising four problem-speci1c genetic operators for local tuning of the fuzzy partitions, • The rule adaptation scheme for adjusting the weights of the objectives in the quality function. The GBSL algorithm is used to identify the structure of a near optimum TSK model, that is, determine the proper number of rules, the locations of the FHCs in the input space, and the corresponding consequent parameters.

142

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

Table 1 Basic notation and parameters of the solution encoding method

#a0 P#0 E# 0 S#0 (n) :0 pia a G

Max number of FHC’s encoded in a chromosome Parameter encoding part of a chromosome Rule existence part of a chromosome Solution encoding a class-n partition Max number of bits in P#0 Discretization accuracy of the ith input Number of generations for update of #0

a Indicates

that the parameters are set up by the user.

5.5. Parameter setting of the structure learning algorithm The basic notation and the parameters to be selected for the GBSL are shown in Tables 1– 4. From the two parameters involved in the solution encoding (Table 1), G is of minor importance and does not inPuence the evolution of the algorithm. Furthermore, pi ; i = 1; : : : ; m depend upon the particular problem under consideration and not the speci1c learning algorithm. Small values of pi may lead to sub-optimal

models while large ones increase the search space and aggravates the optimization task of the GAs. Since the GBSL provides only an initial model that is optimized in the sequel by GBPL, pi should take a moderate value. The quality function formulation (Table 2) involves the determination of the crossover values of the fuzzy term “minimum” and the initial weights of the objectives. With regard to the former issue, some of these parameters are determined during the initial generations while others are preset to certain crossover values. Their purpose is to 1gure out the general shape of the “minimum” for each objective depending on the range of the respective objectives. Experience shows that practically these parameters have a negligible e9ect on GBSL evolution. The initial weights are initially selected by the user according to the degree of importance of each objective. Generally, they rePect an objective ordering similar to the one depicted in the priority vector. Since the objective’s weights are adjusted during evolution through the weight adaptation scheme, their initial setting is of minor importance.

Table 2 Notation and parameters of the quality function formulation

Quality function formulation {EG ; n; JT ; Hn ; Cr }

Real values of the global measure, no. of rules, typicality, entropy and partition cardinality Fuzzi1ed values of the objectives Total 1tness (quality) function Weights of the objectivesin Q Crossover value and fuzzi1er of the fuzzy term “minimum” for the Cth objective

{E˜G ; n; ˜ J˜T ; H˜n }

Q

{WG ; Wn ; WT ; WH ; WC }

F2; C ; F1a

a Indicates

that the parameters are set up by the user.

Table 3 Basic notation and parameters of the genetic operators in GBSL

QMGA = Q pba a PMGA RaMGA {Psplit ; Pin ; Prem ; Pdhc } Aa Pta max-stepa Ba a Indicates

that the parameters are set up by the user.

Fitness function of the MGA No. of bits encoding the perturbation vectors of P#0 in the MGA operator Population size of the MGA Generational run limit of the MGA Probability rates for the application of SCO, RIO, RRO and DHCO operators Standard deviation of the power function in the RIO operator Threshold for deciding rule insertion in RIO No. of steps for determining the size of the inserted rules in RIO No. of bits de1ning the neighborhood in the DHCO operator

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152 Table 4 Basic notation and parameters of the weight adaptation scheme

QCe SelC PCa gua Upa max wpa

Fitness value of the Cth objective of the elite solution Selection value of the Cth objective Priority of the Cth objective No. of generations for weight adaptation No. of successive adaptations for weight reset Weight step for updating the objective weights in Q

a Indicates

that the parameters are set up by the user.

The set of genetic operators (Table 3) includes the MGA, a general-purpose hill-climber operator and four speci1c operators (SCO, RIO, RRO, DHCO) devised for the TSK modeling problem. The MGA operator involves three parameters, namely, pb, PMGA and RMGA . Usually, pb is taken as the half of the number of bits used by the main GA to encode each gene in P#0 . Furthermore, in our experiments, PMGA is composed of 4 –5 genotypes and the MGA is allowed to evolve for 6 –7 generations (RMGA = 6 –7). Under such conditions, the MGA is an e9ective local search operator and more or less insensitive to parameter settings [19]. The probability rates for the application of the four speci1c operators {Psplit ; Pin ; Prem ; Pdhc } are set to a common value, usually 0.3. The reasoning behind that choice is that statistically, at each generation almost one of them is applied to the elite solution. Hence, they assist 1nding a better partition while not dominating over the main GA and the MGA performing the primary search task. Among the parameters involved in RIO, A and max-steps are of minor importance while Pt de1nes a threshold for the insertion of a new rule in the partition. In our simulations Pt is taken as a small portion (usually 10%) of the number of data included in the training data set. The sensitivity in the selection of Pt is low as explained in the following. Assuming a lower value of Pt is chosen, the RIO in favored to insert small FHCs in uncovered data areas. However, before acceptance, the new partition is evaluated by the 1tness function that includes the number of rules as an objective. Hence, small rules are discarded and only those are preserved covering large and representative data regions. For the weight adaptation scheme (Table 4) only gu and the priority vector {PC } is of concern. gu denotes the period of generations after which checking and update of the objective weights is taken place. In our experiments gu ∈ [15; 30]. With regard to the determi-

143

nation of the PC ’s, the following comments are in order. Firstly, the GBSL tries to optimize all objectives, simultaneously, with the objective’s priorities de1ning a path along the weight search space. For instance, if PE is high while the priorities of the other objectives take relatively lower values, the GBSL puts an extra pressure along this weight direction. That is, initially, the population is composed of individuals (TSK models) with low values of E. However, these individuals usually contain a large number of rules with large overlapping. In the following, the algorithm is successively focused on the remaining objectives, discarding the unnecessary rules and reducing the overlapping until the optimal model is attained. An alternative scenario should be to dominate Pn . In that case, the population contains simple models that exhibit high E values. Apparently, in order to arrive to the optimal solution, a larger number of generations are required compared to the previous choice, thus reducing the convergence speed. Extensive experimentation shows that irrespective of the particular choice of the priorities, the GBSL was able to obtain a near optimum TSK model. This means that all cases provided models with almost the same number of rules, very similar partitions and very good error measure. In an attempt to improve the convergence of GBSL and since E is a very important feature of the TSK models, we have arrived at a priority vector dominating PE while keeping the priorities of the remaining objectives to lower values: P = [PE ; Pn ; PT ; PH ; PC ] = [0:4; 0:15; 0:1; 0:2; 0:15]. It should be noted that it is the relative degree between objectives that counts more, rather than the absolute values of the priorities. The above choice derived from our experience is adopted for all examples elaborated in the simulation results. From the above discussion, it is concluded that from the pool of parameters shown in Tables 1– 4, just a few have to be selected by the user, most of them having an insigni1cant e9ect on the learning scheme. On the contrary, GBSL is a robust algorithm more or less insensitive to parameter settings. 6. Parameter learning algorithm The purpose of the parameter learning phase can be stated as follows: given the initial model obtained by GBSL and the training data, adjust the model

144

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

parameters optimally so that a acceptable level of accuracy is attained. A technique widely employed in the literature for the parameter learning of TSK models is the ANFIS method [17] which combines the back-propagation (BP) for tuning the premise parameters and LSE for estimating the consequent ones. However, BP exhibits a number of serious drawbacks such as lack of robustness and convergence to local minima. To overcome these problems a novel genetic based parameter learning (GBPL) method is suggested in this paper. GBPL is a hybrid algorithm including the following attributes: (a) a main GA encoding the premise parameters that is used to adjust the memberships of the model and determine the consequent parameters, (b) the adaptive search space range (ASER) mechanism which determines dynamically the search space of the main GA, and (c) the MGA operator for improving the search capabilities of the overall scheme. 6.1. The ASER mechanism Let pr = [?1 ; : : : ; ?3D ]T = [a1 ; : : : ; aD ; b1 ; : : : ; bD ; c1 ; : : : ; cD ]T denote the premise parameter vector of a TSK model containing D fuzzy sets in the premise part. pr includes the parameters {a; b; c} characterizing the membership functions in (4). When GAs are used for training of fuzzy models, the traditional approach is to encode directly the pr into the chromosome of an individual. Accordingly, the chromosome is formulated as a sequence of 3D genes, each one encoding a particular premise parameter. Then, the entire search space is genetically explored until the optimum solution is found. Apparently, direct encoding decreases signi1cantly the convergence speed of the GA and may lead to solutions that violate the structural constraints between successive memberships. In order to properly address the global search problems related to direct encoding we suggest the ASER mechanism which follows a local search approach of the random search method [1]. 6.1.1. Search space formulation Let pr; B (k) denote a base vector at the kth generation of the main GA. We formulate a search area centered at pr; B (k) that is de1ned by N ( pr; B (k)) = [ pr; B (k) − F; pr; B (k) + F]:

(37)

The width vector F = [Ba; : : : ; Ba; Bb; : : :; Bb; Bc; : : : ; Bc ] determines the size of the local search area including the boundary values Ba ; Bb and Bc of the parameters. The genetic search is restricted to N ( pr; B (k)) and the main GA evolves a population of perturbation vectors Cp around the current base vector pr; B (k). For each Cp ∈ N ( pr; B (k)) the actual parameter vector is derived by = pr; B (k) + Cp . Thus, the population PGA (k) at the kth generation is determined by PGA (k) = { (‘) | = pr; B (k) + Cp ; GA }: ‘ = 1; : : : ; Npop

(38)

At the genotype level, the perturbation vector is a string of 3D genes V p ∈ {0; 1}3D×nb de1ned as p V p = [V1p ; : : : ; V3D ] with −2nb−1 ¡Vip 62nb−1 (i = 1; : : : ; 3D). Each gene comprises nb bits and encodes the perturbation of a single parameter. 6.1.2. Fitness evaluation As a 1tness function we employ the global MSE in (15) (Q( (‘)) = EG ) which should be minimized for optimal parameter learning. In order to calculate the 1tness value we need to determine the consequent parameters n of the TSK model with premise parameters encoded in (‘). Based on (12) the overall output of the model can be determined by yˆ k = XkT · n where XkT = [Q1 (xk ); Q1 (xk )x1k ; : : : ; Q1 (xk )xmk ; : : : : : : ; Qn (xk ); Qn (xk )x1k ; : : : ; Qn (xk )xmk ] T

T

(39)

and n = [1 ; : : : ;  n ]T . Note that 1xing (‘) the output yˆ k is linear in the consequent parameters. Hence, n can be estimated using the WRLS algorithm described by (17) – (19) with wr (xk ) = 1 and the regression vector Xk replaced by (39). The choice of the area width is a key issue for the algorithm. Large values of F slow down the convergence process while very small values of F may cause the GA converge to local optimum solutions. Therefore, the search area is dynamically determined with the advance of generations according to the following steps: Step 1 (Initialization): The base vector is initialized with the premise parameters of the model obtained

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

145

Table 5 Basic notation and parameters of the GBPL algorithm

?pr; B vp nba { B a ; Bb ; B c }

Base vector of the GA for parameter learning Perturbation vectors genetically produced in the neighborhood of ?pr; B No. of bits encoding the perturbation vectors vp Width values de1ning the neighborhood around ?pr; B for the membership parameters {a; b; c} A reduction factor for adjusting the parameter widths B# Generational period for reducing the search space around ?pr; B Lower threshold limiting B#

Ha kGa ” a Indicates

that the parameters are set up by the user.

by GBSL, for example, pr; B (0) = pr; o . Furthermore, the widths Ba ; Bb and Bc take relatively large values, properly chosen to preserve the structural constraints of the memberships. Step 2 (Base update): With the search area currently centered at pr; B (k) during the kth generation, the GA is allowed to proceed for kG ∈ [20; 50] generations. At each generation ‘ = 1; : : : ; kG , we observe the perturbation vector Ce (‘) leading to the elite solution epr (k + ‘) = pr; B (k) + Ce (‘). If Q( epr (k + ‘))¡Q( pr; B (k)) for some ‘, the current base is updated as follows: pr; B (k) ← epr (k + ‘). The width parameters are kept intact and the GA proceeds for another kG generations. Step 3 (Width adaptation): If after kG generations there is no improvement in the best 1tness value, this is an indication that the current search area is su8ciently explored. Accordingly, the width parameters B# are reduced according to the following functions:

6.2. MGA in parameter learning

B#; new = B#; old × H# ;

y = f(x1 ; x2 ; x3 )

# ∈ {a; b; c};

(40)

where H# are the reduction factors taking values in the range 0¡H# ¡1. Then, the GA proceeds as in step 2. The reduction factors should be selected with care. Small values of H# lead to a quick reduction of the search space which may cause the GA to get stuck at a local optimum solution. As the training proceeds and the model parameters are getting closer to the global optimum, the search space is gradually reduced. This allows a dynamic increase in the encoding resolution which assists approaching the optimum with high precision. The above procedure continues until B# reaches a lower threshold value, B# ¡J, which terminates the GA.

While the main GA evolves perturbation vectors in the search area N ( pr; B (k)), the MGA explores the neighborhood of the best perturbation vector Ce (‘) at each generation. Hence, the MGA operator acts synergetically to the ASER mechanism improving the search capabilities of the overall scheme. The perturbations of the MGA are encoded using pb = nb=2 number of bits, that is, half the bits used by the main GA. The basic notation and the parameters of GBPL is shown in Table 5. 7. Simulation results 7.1. The =rst numerical example In order to investigate the capabilities of the GBSL algorithm we 1rst examine a nonlinear static example with three inputs and a single output de1ned by

= (x11:5 − 1:5 sin(3x2 ))2 + 5x3 ;

x1 ; x2 ; x3 ∈ [0; 3]: (41)

The system output y is highly nonlinear with respect to the inputs x1 and x2 , and linear with respect to the input x3 . We generate a training data of 300 points, with the inputs randomly selected in the interval [0; 3]. For each input sample the output is derived by (41) with the addition of a noise with uniform distribution and amplitude 0.1. For x3 = 0 the shape of y as a function of x1 and x2 is shown in Fig. 10. We apply the proposed GBSL algorithm to identify the structure of the TSK model. The algorithm

146

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

80 60 y

y 40 3

20 2 0

1

1 x2

2 3

x1

0

Fig. 10. Plot of (41) as a function of x1 and x2 for x3 = 0.

initiates with #0 = 10 which may increase if necessary during evolution. For each input, we choose pi = 64 (i = 1; : : : ; 3). Hence, six bits are used to encode the parameters uir ; vir (i = 1; : : : ; 3; r = 1; : : : ; 10) of the participating FHCs. Based on the values of #0 and pi , each chromosome comprises a total number of 370 bits (10 bits for the rule existence part while the rest of them are used in the parameter encoding part). During that phase the membership fuzzi1ers are kept constant at bir = 2. Given an input partition encoded in a chromosome,

we formulate the corresponding TSK model, compute the objectives, and 1nally, determine the 1tness function value. As described in Section 4.4, the crossover values F2; EG and F2; JT of the fuzzy term “minimum” (26) are determined during the 1rst 1ve generations while F1 = 3 and : = 2 for all objectives. Experience over repeated independent runs shows that the values of F2; ? does not inPuence structure learning. In order to evaluate the performance of the SGA and the inPuence of the suggested operators included in GBSL, we examine four separate genetic schemes. Starting from the SGA, the next schemes are formulated by inserting a group of operators with similar function. For each scheme, the evolution continues until 60:000 1tness evaluations are performed. Scheme-1: SGA. This is a simple GA with elitism, having the following features: a population of 50 genotypes per generation, binary encoding of the parameters of the FHCs, roulette wheel for parent selection, and standard two-point crossover and mutation operators with adaptive probability rates. Scheme-2: Scheme-1 + {MGA; PM; DHCO}. The SGA of scheme-1 is equipped with three local search operators, as described in Section 5. The MGA has the following features: a population of 1ve perturbation vectors that evolves for 6 generations, twopoint crossover and mutation operators with adaptive

1.0 scheme-1

Normalized fitness value

0.9

scheme-2 scheme-3

0.8

GBSI 0.7 0.6 0.5 0.4 0.3 10000

20000 30000 40000 Fitness function evaluations

50000

60000

Fig. 11. Plot of the normalized 1tness value Q˜ versus the number of 1tness evaluations for the 1rst numerical example (averages of 20 runs for each scheme).

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

3

h3 h

h6

4

h2 x1

h

147

h1

h5

7

0 0

(a)

3

3

x2

h7 h3

h h

2

h6

h1 h5

4

x3

0 0 (b)

x1

3

Fig. 12. The optimal partition of the sub-spaces x1 − x2 (a) and x3 − x1 (b), obtained by the GBSL algorithm for the 1rst numerical example. The symbol of each hyper-cube is placed at the upper-left corner.

probability rates, and roulette wheel for parent selection. The number of bits encoding a perturbation is half the one used for the encoding of the parameters by the main GA, that is, pb = 3. At each generation, DHCO is applied to the elite solution with a probability Pdhc = 0:3 while the number of selected bits is B = 4.

Scheme-3: Scheme-2 + {SCO; RIO; RRO}. The three special operators are applied to the current elite solution with probability Psplit = Pin = Prem = 0:3. For the RIO operator, A = 0:3 and Pt was taken as 10% of the total number of training data.

148

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

3.0 SGA SGA-ASER SGA-ASER-MGA

2.5

MSE

2.0

1.5

1.0

0.5

0.0 0

5000

10000

15000

20000

Number of fitness evaluations Fig. 13. Plot of the global MSE versus the number of 1tness evaluations for three di9erent parameter learning schemes (averages of 20 runs for each scheme). Table 6 Parameters of the TSK model for the 1rst numerical example

R (1) R (2) R (3) R (4) R (5) R (6) R (7)

1i (a; b; c)

2i (a; b; c)

( 0i ; 1i ; 2i )

(0:44; 2:31; 2:58) (0:95; 2:36; 0:95) (0:74; 2:26; 2:07) (0:3; 2:43; 2:46) (0:74; 2:25; 0:74) (0:65; 2:25; 2:4) (0:71; 2:43; 0:8)

(0:27; 2:25; 2:88) (0:42; 2:25; 1:53) (0:27; 2:64; 0:27) (0:42; 2:41; 1:02) (0:45; 2:69; 2:25) (0:3; 2:25; 1:71) (0:39; 2:36; 0:39)

(−73:93; 17:25; 20:5; 3:82) (−17:21; 20:29; 9:22; 4:85) (−17:36; −18:52; 19:47; 5:60) (−66:99; 39:32; 25:11; 3:82) (−0:89; 0:2; 3:33; 4:84) (54:57; −44:39; 31:52; 3:82) (2:39; −4:57; 5:64; 4:58)

Scheme-4: GBSL. The 1nal learning algorithm coincides with scheme-3 with the inclusion of the weight adaptation mechanism. The initial weight vector is W T = [10; 3; 2; 4; 2]. The weights of the objectives are increased by a step wp = 0:1 when the current optimum is not improved after ten successive generations (gu = 20). Furthermore, if after 1ve consecutive adaptations of a weight (Up = 5) the quality of the best solution is not improved, the weight returns to its value during the last successful adaptation and remains constant thereafter. The priority vector is determined by P = [PE ; Pn ; PT ; PH ; PC ] = [0:4; 0:15; 0:1; 0:2; 0:15] as discussed in Section 5.5. In order to arrive at safe conclusions, for each one of the above schemes we performed 20 independent experiments (genetic runs). The genetic schemes are

evaluated on the basis of a normalized 1tness value Q˜ ∈ [0; 1]. For each experiment, Q˜ is derived using the actual values of the objectives as follows: EG n JT + W˜ n + W˜ T Q˜ = W˜ E EG; max nmax JT; max + W˜ H

Hn 1 − Cr + W˜ C ; Hn; max 1 − Cr; max

(42)

where EG; max ; nmax ; JT; max ; Hn; max and Cr; max are the maximum observed values of the respective objectives in all experiments, and W˜ is a normalized weight  ˜ vector with WC = 1; C ∈ {EG ; n; JT ; Hn ; Cr }. In our comparison tests, W˜ is taken equal to the priority vector P de1ned above. Apparently, the lower the value of Q˜ the better the quality of the resulting model.

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

4.5

h3 h

2

yk−2

h1

-3 -3

4.5

yk−1

(a)

2

h1 h

2

h3

uk

Note that the computation of the 1tness function is the most time-consuming part of the genetic implementations. Furthermore, depending on the particular operator setup, each scheme performs a di9erent number of 1tness evaluations per generation. Therefore, for fair comparison of the competing schemes, Q˜ is plotted versus the number of 1tness evaluations as shown in Fig. 11. The curve of each scheme is created by averaging over 20 similar curves, each one corresponding to a separate experiment. From Fig. 11 it can be seen that the SGA (scheme-1) is unable to attain high-quality models with acceptable fuzzy partitions and converge to the global optimum solution. Furthermore, most often, each run has led to a di9erent partition. This indicates the di8culties of the optimization problem and justi1es the inclusion of the suggested operators. Notice that the inclusion of the local search and the problem-speci1c operators improve signi1cantly the performance of scheme-2 and scheme-3, respectively. The complete GBSL algorithm including the weight adaptation mechanism exhibits the best performance among the competing schemes. For each of the 20 runs, almost the same TSK model is obtained very close to the optimal one. The optimal partition of the input sub-spaces x1 –x2 and x3 –x1 consistently obtained by GBSL is depicted in Fig. 12(a) and (b), respectively, where the crisp HCs are shown for simplicity. The algorithm partitioned the input space to seven representative regions with small overlapping, leading to a TSK model with seven rules and a global MSE = 2:18. Observing the locations of the FHCs in Fig. 12(a) and the shape of the desired function in Fig. 10, it is concluded that the space x1 –x2 is correctly partitioned. Particularly, the input points belonging to each FHC lie approximately in a hyper-plane describing locally the output y with respect to the nonlinear inputs x1 and x2 . Furthermore, Fig. 12(b) shows that the input x3 is not partitioned at all which indicates that GBSL identi1ed the linear dependence of y with respect to x3 . In all rules, x3 is described by the linguistic term “everything”, that is, a fuzzy set approximately covering the entire universe of x3 ; hence, x3 does not participate to the premise part of the rules. Concluding, the GBSL algorithm has obtained a near optimum model that is correctly de1ned with regard to the premise and the consequent part. The initial model attained by GBSL is 1ne-tuned using the proposed GBPL algorithm. In order to in-

149

-2 -3

(b)

yk−1

4.5

Fig. 14. The optimal partition of the sub-spaces yk−2 − yk−1 (a), and uk − yk−1 (b), obtained by the GBSL algorithm for the third numerical example. The symbol of each hyper-cube is placed at the upper-left corner.

vestigate the inPuence of the suggested tools, the following parameter learning schemes are examined: (1) SGA: This is a SGA with a population of 50 genotypes per generation, binary encoding with nb = 16 bits for each parameter, two-point crossover and mutation with adaptive probability rates, and roulette wheel for parent selection.

150

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

Table 7 Parameters of the TSK Model for the second numerical example

2i (a; b; c)

( 0i ; 1i ; 2i ; 3i )

(1:15; 2:45; 1:91) (1:61; 2:36; −1:7) (1:67; 2:58; 1:91)

(1:38; 2:44; −0:017) (3:31; 2:39; −0:24) (1:07; 2:41; 2:21)

(0:19; −0:12; 1:37; 1:01) (−4:34; −0:13; −0:19; 0:99) (0:43; 0:61; 0:04; 0:97)

(2) SGA–ASER: The above scheme is supported by the ASER mechanism with reduction factors Ha = Hb = Hc = 0:85 and kG = 30. (3) GBPL = SGA–ASER–MGA: The 1nal learning scheme is further equipped with the MGA operator having the following features: a population of 5 chromosomes that involves for 7 generations, two-point crossover, mutation and binary encoding with pb = 8 bits. For each one of the above schemes, we have conducted 20 independent experiments. Fig. 13 shows the averaged values of the global MSE as a function of the number of 1tness evaluations. Notice that compared to the SGA, the ASER and the MGA improve signi1cantly the performance of the GBPL scheme with regard to the convergence speed and the level of accuracy. The 1nal model attained by GBPL has an MSE = 0:072. It is composed of seven rules of the form R(i) : IF x1 is 1i (a; b; c) AND x2 is 2i (a; b; c) THEN yi = ai0 + ai1 x1 + ai2 x2 + ai3 x3

(43)

with their parameters shown in Table 6. 7.2. The second numerical example This example is a benchmark problem in fuzzy modeling, taken from Narendra et al. [36]. The plant to be identi1ed is described by a second-order highly nonlinear di9erence equation yk =

yk−1 yk−2 (yk−1 + 2:5) + uk : 2 2 1 + yk−1 + yk−2

(44)

A training data set of 500 points is created from the plant model, assuming that the input uk is a random signal uniformly distributed in [−2; 2]. The above set is used to build a TSK model with three inputs yk−1 ; yk−2 and uk , and a single output yk . The GBSL

System output model output 4

3

2

yk

R (1) R (2) R (3)

1i (a; b; c)

1

0

-1

-2 0

100

200

300

400

500

time steps k

Fig. 15. The outputs of the real system and the TSK model for the third numerical example.

algorithm is applied to identify the structure of the dynamic model with its parameters being the same used in the previous example. Fig. 14(a) and (b) depict the partition of the sub-spaces yk−2 −yk−1 and uk −yk−1 , respectively. The GBSL has led to an initial TSK model composed of three rules and a MSEtr = 0:0374 on the training data. It can be seen from Fig. 14(b) that the fuzzy sets obtained by projecting the FHCs along the uk axis are approximated by the term “everything” in all rules. This indicates that the algorithm recognized that the model output yk is linear with respect uk . Consequently, the input uk is excluded from the premise part of the rules. In the following, the initial model is trained by GBPL providing a 1nal model with MSEtr = 0:018. The model is composed of three fuzzy rules of the form: R(i) : IF yk−1 is 1i (a; b; c) AND yk−2 is 2i (a; b; c) THEN yki = ai0 + ai1 yk−1 + ai2 yk−2 + ai3 uk (45)

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

151

Table 8 Model comparison for the second numerical example

Model

[39]

[29]

[8]

Our model

No. rules Training MSE Checking MSE

8 0.6184 0.2037

12 0.5072 0.2447

75 0.037 0.0403

3 0.018 0.025

with their parameters shown in Table 7. After the model building process has 1nished, the model is tested over a checking data set of 500 points which is generated by applying a sinusoidal input signal uk = sin(2Kk=25) to the fuzzy model. The outputs of the model and the actual system are shown in Fig. 15. The fuzzy model performs a good tracking of the actual system with a MSEchk = 0:025 on the checking data. Table 8 compares our model with two other TSK models suggested in [29,37] and a Mamdani’s type model (class-B model) [23] suggested in [8]. Table 3 shows that our model outperforms the competing models with regard to model simplicity, the training and the checking MSE. 8. Conclusions This paper presents a genetic-based fuzzy modeling approach for generating accurate and well-de1ned TSK models. The model building process is divided into two phases, namely, the structure learning and the parameter learning phase. Structure learning is performed using the proposed GBSL algorithm which determines the number of rules and the partition of the premise space. In the second phase, the parameters of the initial model are 1ne-tuned using the suggested GBPL algorithm. The performance of our approach is tested using a static example and a benchmark problem. Finally, our model is compared with those obtained by other methods.

References [1] N. Baba, A new approach 1nding the global minimum error functions of neural networks, Neural Networks 2 (1989) 367–373. [2] J. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981.

[3] S. Chiu, Fuzzy model identi1cation based on cluster estimation, J. Intell. Fuzzy Systems 2 (1994) 267–278. [4] M.-Chun Su, H.-T. Chang, Application of neural networks incorporated with real-valued genetic algorithms in knowledge acquisition, Fuzzy Sets Systems 112 (2000) 85–97. [5] I.-F. Chung, C.-J. Lin, C.-T. Lin, A GA-based fuzzy adaptive learning control network, Fuzzy Sets Systems 112 (2000) 65–84. [6] O. Cordon, F. Herrera, A two-stage evolutionary process for designing TSK fuzzy rule-based systems, IEEE Trans. Systems Man, Cybern. (Part-B) 29 (6) (1999) 703–715. [7] G. Dozier, J. Brown, D. Bahler, Solving small and large scale constraint satisfaction problems using a heuristic-based micro-genetic algorithm, First IEEE International Conference on Evolutionary Computations, Vol. 1, IEEE Press, Piscataway, NJ, 1994, pp. 306 –311. [8] W.A. Farag, V.H. Quintana, G. Lambert-Torres, A genetic-based neuro-fuzzy approach for modeling and control of dynamical systems, IEEE Trans. Neural Networks 9 (5) (1998) 756–767. [9] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA, 1989. [10] G.C. Goodwin, K.S. Sin, Adaptive Filtering Prediction and Control, Prentice-Hall, Englewood Cli9s, NJ, 1984. [11] F. Ho9mann, G. P1ster, Automatic design of hierarchical fuzzy controllers using genetic algorithms, Proceedings of the Second Conference on Intelligent Techniques and Soft Computing (EUFIT’94), Aachen, Germany, 1994, pp. 1516 –1522. [12] A. Hofmaifar, E. McCormick, Simultaneous design of membership functions and rule sets for fuzzy controllers using genetic algorithms, IEEE Trans. Fuzzy Systems 3 (2) (1995) 129–139. [13] S. Horikawa, T. Furuhashi, Y. Uchikawa, On fuzzy modeling using fuzzy neural networks with back-propagation algorithm, IEEE Trans. Neural Networks 3 (5) (1992) 801–806. [14] H. Ishibuchi, K.T. Murata, A multi-objective genetic local search algorithm and its application to Powshop scheduling, IEEE Trans. Systems Man, Cybern. 28 (3) (1998) 392–403. [15] H. Ishibushi, K. Nozaki, N. Yamamoto, H. Tanaka, Selecting fuzzy if-then rules for classi1cation problems using genetic algorithms, IEEE Trans. Fuzzy Systems 3 (3) (1995) 260–270. [16] H. Ishigami, T. Fukuda, T. Shibata, F. Arai, Structure optimization of fuzzy neural network by genetic algorithm, Fuzzy Sets and Systems 71 (1995) 257–264.

152

S.E. Papadakis, J.B. Theocharis / Fuzzy Sets and Systems 131 (2002) 121 – 152

[17] J.-S.R. Jang, ANFIS: adaptive-network-based fuzzy inference system, IEEE Trans. Systems, Man, Cybern. 23 (1993) 665–685. [18] C.L. Karr, E.J. Gentry, Fuzzy control of pH using genetic algorithms, IEEE Trans. Fuzzy Systems 1 (1993) 46–53. [19] S.A. Kazarlis, S.E. Papadakis, J.B. Theocharis, V. Petridis, Micro-genetic algorithms as generalized hill-climbing operators for GA-optimization, IEEE Trans. Evol. Comput. 5 (3) (2001) 204–271. [20] E. Kim, M. Park, S. Ji, M. Park, A new approach to fuzzy modeling, IEEE Trans. Fuzzy Systems 2 (3) (1997) 328–337. [21] M.A. Lee, H. Takagi, Integrating design stages of fuzzy systems using genetic algorithms, Proceedings of the IEEE International Conference on Fuzzy Systems, San Fransisco, Vol. 1, 1993, pp. 612– 617. [22] J. Liska, S.S. Melsheimer, Complete design of fuzzy logic systems using genetic algorithms, Third IEEE Conference on Fuzzy Systems, Vol. 3, 1994, pp. 1377–1382. [23] E. Mamdani, Advances in the linguistic synthesis of fuzzy controllers, Int. J. Man–Mach. Studies 8 (1976) 669–678. [24] K.S. Narendra, Parthasarathy, Identi1cation and control of dynamical systems using neural networks, IEEE Trans. Neural Networks 1(1) (1990) 4 –27. [25] V. Petridis, S. Kazarlis, Varying quality function in genetic algorithms and the cutting problem, First IEEE International Conference on Evolutionary Computations, Vol. 1, IEEE Press, Piscataway, NJ, 1994, pp. 166 –169. [26] J.-M. Renders, H. Bersini, Hybridizing genetic algorithms with hill-climbing methods for global optimization: two possible ways, First IEEE International Conference on Evolutionary Computations, Vol. 1, IEEE Press, Piscataway, NJ, 1994, pp. 312–317. [27] M. Setnes, H. Roubos, GA-fuzzy modeling and classi1cation: complexity and performance, IEEE Trans. Fuzzy Systems 8 (5) (2000) 509–522. [28] K. Shimojima, T. Fukuda, Y. Hasegawa, Self-tuning fuzzy modeling with adaptive membership function, rules, and hierarchical structure based on genetic algorithms, Fuzzy Sets Systems 71 (1995) 295–309.

[29] M. Sugeno, K. Tanaka, Successive identi1cation of a fuzzy model and its application to prediction of a complex system, Fuzzy Sets and Systems 42 (1991) 315–334. [30] M. Sugeno, T. Yasukawa, A fuzzy-logic-based approach to qualitative modeling, IEEE Trans. Fuzzy Systems 1 (1) (1993) 7–31. [31] C.-T. Sun, J.-S. Jang, Fuzzy modeling based on generalized neural networks and fuzzy clustering objective functions, Proceedings of the 30th Conference on Decision and Control, Brighton, England, 1991, pp. 2924 –2929. [32] H. Tamaki, H. Kita, S. Kabayashi, Multi-objective optimization by genetic algorithms: a review, IEEE International Conference on Evolutionary Computations (ICEC’96), 1996, pp. 517–522. [33] K. Tanaka, M. Sano, H. Watanabe, Modeling and control of carbon monoxide concentration using a neuro-fuzzy technique, IEEE Trans. Fuzzy Systems 3 (3) (1995) 271–279. [34] W. Tautz, Genetic algorithms for designing fuzzy systems, Proceedings of the Second Conference on Intelligent Techniques and Soft Computing (EUFIT’94), Aachen, Germany, 1994, pp. 558–567. [35] P. Thrift, Fuzzy logic synthesis with genetic algorithms, Proceedings of the Fourth International Conference on Genetic Algorithms (ICGA’91), 1991, pp. 509 –513. [36] L. Wang, R. Langari, Building Sugeno-type models using fuzzy discretization and orthogonal parameter estimation techniques, IEEE Trans. Fuzzy Systems 3 (4) (1995) 454–458. [37] L. Wang, R. Langari, Complex systems modeling via fuzzy logic, IEEE Trans. Systems Man, Cybern. 26 (1996) 100–106. [38] R. Yager, D. Filev, Uni1ed structure and parameter identi1cation of fuzzy models, IEEE Trans. Systems, Man, Cybern. 23 (4) (1993) 1198–1205. [39] J. Yen, L. Wang, C.W. Gillespie, Improving the interpretability of TSK fuzzy models by combining global learning and local leaning, IEEE Trans. Fuzzy Systems 6 (4) (1998) 530–537.