Sparse data in the evolutionary generation of fuzzy models

Sparse data in the evolutionary generation of fuzzy models

Fuzzy Sets and Systems 138 (2003) 363 – 379 www.elsevier.com/locate/fss Sparse data in the evolutionary generation of fuzzy models Daniel Spiegela , ...

430KB Sizes 4 Downloads 50 Views

Fuzzy Sets and Systems 138 (2003) 363 – 379 www.elsevier.com/locate/fss

Sparse data in the evolutionary generation of fuzzy models Daniel Spiegela , Thomas Sudkampb;∗ a

Department of Mathematics and Computer Science, Kutztown University of Pennsylvania. Kutztown, PA 19530, USA b Department of Computer Science, Wright State University, Dayton, OH 45435, USA

Received 6 September 2001; received in revised form 31 December 2002; accepted 24 March 2003

Abstract Fuzzy rule bases have proven to be an e.ective tool for modeling complex systems and approximating functions. Two approaches, global and local rule generation, have been identi0ed for the evolutionary generation of fuzzy models. In the global approach, the standard method of employing evolutionary techniques in fuzzy rule base generation, the 0tness evaluation of a rule base aggregates the performance of the model over the entire space into a single value. A local 0tness assessment utilizes the limited scope of a fuzzy rule to evaluate the performance in regions of the input space. Regardless of the method employed, the ability to construct models is inhibited when training data are sparse. In this research, a multi-criteria 0tness function is introduced to incorporate a bias towards smoothness into the evolutionary selection process. Several multi-criteria 0tness functions, which di.er in the extent of the assessment smoothness and the range of its application, are examined. A set of experiments has been performed to demonstrate the e.ectiveness of the multi-criteria strategies for the evolutionary generation of fuzzy models with sparse data. c 2003 Elsevier B.V. All rights reserved.  Keywords: Approximate reasoning; Fuzzy models; Genetic algorithms

1. Introduction System modeling is an important tool for analysis, evaluation, and prediction in classi0cation, decision analysis, and automatic control. Fuzzy set theory provides a mathematical foundation for developing models when the knowledge of the underlying system is incomplete, imprecise, or too complicated for a closed form mathematical analysis. Two strategies are commonly employed for constructing fuzzy models: obtaining rules heuristically from experts or automatically generating rules from training information. The signi0cance of learning algorithms has grown as the systems being ∗

Corresponding author. E-mail address: [email protected] (T. Sudkamp).

c 2003 Elsevier B.V. All rights reserved. 0165-0114/03/$ - see front matter  doi:10.1016/S0165-0114(03)00136-2

364

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379

modeled have become increasingly sophisticated and the relationships among the system variables more complex. Algorithms for model construction generally fall within one of four categories: neuralfuzzy systems, clustering analysis, techniques based on proximity analysis, and evolutionary rule generation. When su?cient training information is available, these techniques can produce highly precise models. In this paper, we consider the generation of models using evolutionary computation when training data are sparse. A multi-criteria 0tness function, incorporating both training data error and an assessment of the smoothness of the model, is employed to enhance the ability of the evolutionary algorithm to generate rules in regions without training data. A feature shared in the majority of the applications of evolutionary approaches to fuzzy model generation is that the 0tness function provides an assessment of the quality of the rule base. We will refer to such a strategy as a global evolutionary algorithm. In a global evolutionary algorithm, the 0tness reAects how well the rule base matches the training data over the entire input space. With an aggregated 0tness value, it is impossible to di.erentiate regions where the rules approximate the training data well from regions where the rules approximate poorly, introducing a fuzzy model version of the classic credit assignment problem. Local evolutionary learning [3,4,17] addresses this by performing multiple evolutionary searches within prede0ned regions of the input space. Each region has its own population of rules that are generated by independent evaluation and reproduction. In generating rules from training information, whether using a local or global strategy, it is possible that there are regions of the input space without training data. When the support of a rule is contained in such a region, the selection of a rule degenerates to a strictly random process. The incorporation of the assessment of the model smoothness into the 0tness function provides a criterion for rule selection in regions with sparse data. Since the pioneering work of Karr [11,12], there has been a considerable body of research on the generation of fuzzy rules using evolutionary techniques [10,14,13,5]. An extensive bibliography of contributions to the subject can be found in [6]. In [1], BFack identi0ed two general applications of evolutionary techniques to the generation of fuzzy models: the initial determination of a rule base and the optimization of an existing rule base by tuning membership functions. The focus of this paper is on the former application with emphasis on strategies for rule generation with sparse training data. However, the method proposed for inAuencing rule selection in areas without training information could be used to tune rule bases that have been produced by any rule generation strategy. We begin with a brief review of the components of fuzzy models, the rule base representation used in our evolutionary algorithms, and the global and the local strategies for evolutionary rule base generation. This is followed by an examination of the impact of sparse data on evolutionary model generation and the introduction of a secondary 0tness criteria based on smoothness. We conclude with the presentation of experimental results demonstrating the e.ectiveness of the multi-criteria approach to the generation of models in applications with sparse training data. 2. Model representation A fuzzy model is de0ned by a set of fuzzy if–then rules. The antecedent of a rule describes circumstances under which the rule is applicable and the consequent provides an approximate conclusion or response for these conditions. The two predominant types of fuzzy rules di.er in the form of the consequent. In Mamdani style rules [27], the consequent is a fuzzy set over the output domain.

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379

365

The consequent of a Takagi–Sugeno–Kang (TSK) rule [28] is a function of the input values. The examples and experimental results presented in this paper construct models with two input domains and one output domain that are normalized to the interval [−1; 1]. The models will be de0ned by TSK rules with constant consequents. The fundamental representations and strategies, however, are applicable to both Mamdani and TSK style rules with parametrically de0ned consequents and are extendable to systems with higher dimensions. The antecedents of the rules are obtained from decompositions of the input domains. A decomposition of a domain U consists of a set of overlapping fuzzy sets that form a fuzzy quantization of the domain. Throughout this paper, we will decompose the input domains by a sequence of triangular fuzzy sets A1 ; : : : ; An . The center points a1 ; : : : ; an of the triangular membership functions completely determine the domain decomposition: Ai is de0ned by center point a i , the boundary of the support to the left a i−1 , and the boundary of the support to the right a i+1 . A TSK rule with a constant consequent for a two input system with input space U × V has the form if X is Ai and Y is Bj then z = ci;j ; where Ai and Bj are fuzzy sets from the decompositions of U and V , respectively. A rule base de0nes a model, which is realized as a function f : U × V → W from the input space to the output space. In a model of a two input system with triangular decompositions de0ned by center points a1 ; a2 ; : : : ; an and b1 ; b2 ; : : : ; bm , the result for input (x; y) is determined by at most four rules. Let x ∈ [a i ; a i+1 ] and y ∈ [bj ; bj+1 ]. The rules that determine the result are if X is Ai and Y is Bj then z = ci;j ; if X is Ai and Y is Bj+1 then z = ci;j+1 ; if X is Ai+1 and Y is Bj then z = ci+1;j ; if X is Ai+1 and Y is Bj+1 then z = ci+1;j+1 : Weighted averaging based on the membership of the input in the antecedents of the rules is used to combine the rule consequents to produce i+1 j+1 s=1 t=j As (x)Bt (y)cs;t  (1) fi;j (x; y) = i+1 j+1 s=i t=j As (x)Bt (y) over the subset [a i ; a a+1 ] × [bj ; bj+1 ] of U × V . The function fi; j is referred to as a local approximating function. The center points of a1 ; : : : ; an and b1 ; : : : ; bm of the triangular domain decompositions form an n × m grid on the input space U × V as seen in Fig. 1. The four rules whose antecedents have nonzero membership values in a region [a i ; a i+1 ] × [bj ; bj+1 ] determine the local approximating function over that region. The function f is obtained by combining the local approximating functions over each of the nm regions of the grid. In the evolutionary generation of fuzzy models, the population consists of multiple rule bases and the members of the population are evaluated relative to some measure of desirability or 0tness. In each generation, a subset of the population is selected for reproduction based on the 0tness assessment. Reproductive operators, usually variations of crossover and mutation, are employed on

366

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379

Fig. 1. Domain decomposition and grid points.

this subset to produce the subsequent population. The objective is to improve the 0tness of the population by retaining and propagating members with higher 0tness values. An initial step in the development of an evolutionary algorithm is the selection of a representation for the members of the population. Using the triangular domain decompositions and constant consequents, a complete de0nition of a fuzzy rule base for a two input system is given by the center points a1 ; : : : ; an and b1 ; : : : ; bm of the domain decomposition and the consequents ci; j . This rule base is represented by a matrix a1 a2 .. .

b1 c1,1 c2,1 .. .

b2 c1,2 c2,2 . ..

··· ··· ···

bm c1,m cc,m .. .

an

cn, 1

cn, 2

···

cn,m

A position (a i ; bj ) is called a grid point and the combination of the grid point and the entry ci; j in the matrix at position (a i ; bj ) represents the rule ‘if X is Ai and Y is Bj then z = ci; j ’. With this representation, a learning algorithm for a fuzzy rule base with 0xed domain decompositions can be considered to be a strategy for searching the space of matrices for one whose associated model best matches the desired criteria. 3. Evolutionary model generation An evolutionary search is guided by the evaluation of the 0tness of members of the population. When a model is produced from training data, the 0tness measures the degree to which the approximating function de0ned by the model matches the data. A training set for a two input system is a set T = {(xi ; yi ; zi ) | i = 1; : : : ; N } where (xi ; yi ) are elements of the input domains and zi is the associated output value. Fig. 2 shows the support of the rule ‘if X is Ai and Y is Bj then z = ci; j ’ and its associated grid point (a i ; bj ). The region around the grid point (a i ; bj ), indicated by the dotted rectangle, is referred

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379

367

Fig. 2. Region of inclusion of (a i ; bj ).

to as the region of inclusion of the rule. The 0tness of the grid point is determined by the degree to which the local approximating functions match the training data in the region of inclusion. Consider the training point  = (x0 ; y0 ; z0 ) in the region of inclusion of (a i ; bj ) in Fig. 2. The rule corresponding to grid point (a i ; bj ) is used, along with the rules associated with grid points (a i−1 ; bj ), (a i ; bj+1 ), and (a i−1 ; bj+1 ), to compute the value f (x0 ; y0 ) of the model at . The approximation error at  is |f (x0 ; y0 ) − z0 |. The 0tness function of the evolutionary algorithms will attempt to minimize the sum-squared error. Thus with each grid point (a i ; bj ) we associate the local training data approximation error  ltde(i; j) = (f (xi ; yi ) − zi )2 ; (2) where the sum is taken over all the training points in the region of inclusion of (a i ; bj ). In most evolutionary approaches to generating fuzzy rule bases from training data, 0tness is evaluated globally. Each rule base is assigned a single value that represents the degree to which the approximating function matches the training data over the entire input domain. The global training data 9tness of a rule base RB is de0ned by N (f (xi ; yi ) − zi )2 gtdf (RB) = i=1 (3) N n m i=1 j=1 ltde(i; j) ; (4) = N where f is the approximating function de0ned by RB and the sum is taken over the N members of the training set. Evolutionary search uses the 0tness value to select elements from the population to contribute to the subsequent generation. Once the elements of the next generation are selected, they may be modi0ed by the application of evolutionary operations crossover and mutation. A single point crossover for the matrix representation of rule bases RB1 and RB2 is obtained by randomly choosing a location (s; t) and interchanging the values in entries in RB1 after position (s; t), using row major ordering, with the corresponding entries in RB2 .

368

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379

When members of the population are de0ned by a set of real-valued features, mutation is accomplished by making a small random perturbation to the value of a feature. This type of mutation is the fundamental operation for applying evolutionary search to parameter adjustment and real-valued optimization problems [2]. In the search for a rule base, mutation is implemented by modifying a consequent value ci; j in the rule base matrix. The strategy for learning fuzzy rule bases using the global 0tness assessment follows the standard evolutionary paradigm [8]: 1. randomly generate a population P(0) of rule bases 2. t = 0 3. repeat a) evaluate the 0tness of the members in the population P(t) b) select a subset P ⊆ P(t) based on the global 0tness assessment of the rule bases in P(t). This is the initial P(t + 1). c) apply the evolutionary operations to rule bases in P(t + 1) d) t = t + 1 e) if the halting criterion is satis0ed, exit. The cycle of 0tness evaluation, selection, and production of the subsequent generation may terminate after a prede0ned number of iterations or when a rule base is produced that satis0es a desired 0tness speci0cation. As seen in Eq. (4), the approximation errors at each training point are combined to obtain the 0tness of the rule base. Since fuzzy rules are local, it may be advantageous to consider 0tness locally rather than over the entire input space. A local perspective has the advantage of being able to di.erentiate areas that are approximated well from those that are approximated poorly, leading to the possibility of retaining the 0tter rules in areas of good approximation and discarding less desirable approximations. The local training data 9tness associated with grid point (a i ; bj ) is de0ned as ltdf (i; j) =

ltde(i; j) ; N

(5)

where N  is the number of training points in the region of inclusion of (a i ; bj ). If there are no training points in the inclusion, ltdf (i; j) is assigned the default value 0. The local search strategy used in this paper [15,17] modi0es the manner in which rule bases are generated. The local search process determines the consequent of each rule independently using a 0tness evaluation based only on training points in the region of inclusion of the rule. If the population consists of p rule bases, there are p values de0ned for each grid point (a i ; bj ), one from each rule base. Consequently, each rule base provides nm rules that will be examined by mn local searches. The local strategy independently evaluates, selects, and regenerates the p rules for each of the nm grid points. A complete rule base is then constructed from the results of the local searches. An iteration of the learning cycle begins by measuring the 0tness of each grid point, as described in Eq. (5). Fig. 3 shows the 0rst step in the selection process on a population of 0ve rule bases each with six rules. The rules associated with the 0rst grid point compete to be included in the new

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379

369

Fig. 3. Local rule generation.

population. These rules have a probability of selection commensurate with their local 0tness. The right side of the diagram shows a potential outcome of the selection of values for the 0rst rule. The process of local selection continues for all rules. The 0rst rule base to be included in the next generation is constructed from the 0rst rule chosen in each of the local searches, the second rule base from the second rules chosen, etc. Following the selection of the individual rules and the construction of the rule bases, mutation is applied to the members of the new population at a speci0ed mutation rate. Crossover is not employed in the local process due to the ability of elements of the new population to be constructed from multiple parents. The 0tness of the rule de0ned by grid point (a i ; bj ) is determined by how well the training data within the region of inclusion of (a i ; bj ) match the local approximating functions. This measurement is local, using only the rules associated with (a i ; bj ) and the grid points adjacent to it. Although the measurement of 0tness is rule oriented, the termination of the algorithm depends upon the 0tness of an entire rule base since the objective is to produce a coherent rule base and not an independent set of rules. After the complete rule bases have been constructed, the rule base 0tness is obtained using Eq. (3) to retain the optimal rule base and check the termination condition. The local strategy used in this research di.ers from that of CordNon and Herrera [4] in the method of compensating for the interactions among locally generated rules. As illustrated in Fig. 3, at the end of each generation the rules produced by the local searches are combined to produce an entire rule base. The elements of the entire rule base are used to produce the 0tness values for the subsequent population. Thus local and global analysis are intertwined in each generation. The algorithm of CordNon and Herrera separates the rule generation process into two distinct stages: local rule generation and global re0nement. A strictly local strategy is used to produce a rule in each region of the space. The resulting rules are then combined to produce a preliminary rule base. The second stage employs an evolution strategy to re0ne the preliminary rule base. A set of experiments have been performed that compare the ability of the local and global techniques for generating models [15,17]. The local technique has been shown to consistently outperform the global method under tests with both su?cient and sparse training data. However, both strategies often failed to perform well under situations when the training data was sparse. Experimental results comparing these techniques and demonstrating the e.ect of sparse data will be given in Section 5.

370

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379

4. Smoothness as a secondary tness criterion With sparse training data, there may be rules that have no training data in their region of inclusion. When this occurs we say that the associated grid point is uncovered. Regardless of the evolutionary strategy employed for selecting the rules, having uncovered grid points with 0tness determined solely by training data may reduce the rule selection to a random exercise. For local evolutionary learning, there is no criterion upon which to base the selection of one potential consequent over another. The 0tness of every rule is the default value assigned when there is no training data. For assessing global 0tness, the value for an uncovered grid point plays at most a secondary role in the 0tness evaluation. Consider the uncovered grid point ( i+1 ; !j−1 ) in Fig. 2. The value of this grid point is used in the calculation of the 0tness of grid points ( i ; !j−1 ), ( i ; !j ), and ( i+1 ; !j ) and consequently its value is reAected in the global 0tness function. If all the neighboring grid points of an uncovered grid point (as ; bt ) point are also uncovered, then the value assigned to (as ; bt ) has no role in the determination of the 0tness and its selection is strictly random. The preceding discussion also exhibits the detrimental e.ect of an uncovered grid point on neighboring regions. The value of the uncovered grid point ( i+1 ; !j−1 ) in Fig. 2 is used in the calculation of the 0tness of grid points ( i ; !j−1 ), ( i ; !j ), and ( i+1 ; !j ), complicating the ability to make optimal selections at these points. In rule base learning based on proximity of training data to rules, an interpolative technique called completion [18,19] has been employed to produce rules in regions without training data. The intuitive basis for completion is to extend rules generated from training data to adjacent regions in a manner that ensures continuity in the resulting model. The same objective can be achieved in evolutionary learning by incorporating a secondary criterion into the 0tness evaluation that measures the degree of smoothness in the transitions between local approximating functions. This criterion introduces a bias for smoothness in the rule selection process and provides a link between global and local evaluation. In the assessment of fuzzy rules bases, we measure smoothness of a grid point (a i ; bj ) by the di.erence of the grid point value ci; j and the value at (a i ; bj ) of a line between the adjacent grid points in each dimension. Smoothness at (a i ; bj ) can be measured only when there are grid points on both sides. In a model with one input, no smoothness measure is associated with the endpoints. In a two-dimensional grid, there is no smoothness measure for corner points and the smoothness of non-corner border points may only be computed in only one direction. Fig. 4 shows the grid points adjacent to (a i ; bj ) in one input dimension. To estimate the smoothness, the value ci; j of the line between (a i−1 ; bj ; ci−1; j ) and (a i+1 ; bj ; ci+1; j ) at (a i ; bj ) is determined. The di.erence |ci; j −ci; j | provides a measure the degree of curvature of the approximation in that direction. The overall smoothness measure is the average of the di.erences between the entry in the rule matrix and the interpolations. If there are k dimensions, the smoothness measure at grid point (aj ; bj ) is k s(i; j) =

i=1

 | | ci;j − ci;j : k

(6)

The smoothness of the grid points can be aggregated to obtain a global measure of the smoothness of the entire rule base or can be used locally at each grid point. In the former case, the global

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379

371

Fig. 4. Calculation of smoothness.

smoothness measure for a rule base RB is n m i=1 j=1 s(i; j) : s(RB) = nm

(7)

Employing s(RB) in the 0tness function is a global secondary criterion. Learning rule bases using the global strategy requires that a single 0tness value be assigned to each rule base. Following this procedure, the value s(RB) is incorporated into the 0tness measure by gf (RB) = w · gtdf (RB) + (1 − w) · s(RB);

(8)

where w ∈ [0; 1]. There are two ways to apply a smoothness measure in local evolutionary learning: full and partial. An application is full when the smoothness measure is incorporated into the 0tness of every grid point without regard to whether the grid point is covered by training data. Another strategy would be to employ smoothness only when a grid point is uncovered. The intuition for this is to base decisions solely on the best information, training data, when it is available. The use of a secondary criterion is limited to when training data is unavailable. This is referred to as a partial application, since the smoothness measure is incorporated only to a subset of the grid points. Combining global and local assessments of smoothness with full and partial application produces four possible strategies for incorporating smoothness into the evolutionary process. Prior experimentation [16] has indicated the full-local and partial-global combinations are the most successful strategies for compensating for sparse data in the local evolutionary algorithm. The 0tness function for local evolutionary learning employing full-local smoothness is :f (i; j) = w · ltdf (i; j) + (1 − w)s(i; j):

(9)

The 0tness of each grid point is based on both training data and the smoothness measurement for that grid point. The partial-global strategy uses the secondary criterion only in the absence of training data:  ltdf (i; j) if (ai ; bj ) is covered; pgf (i; j) = (10) s(RB) otherwise:

372

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379

Sparse data introduces a type of instability into the evolutionary generation of rule bases; rule bases that signi0cantly di.er in the uncovered regions may produce nearly identical 0tness evaluations. The importance of the addition of a secondary criterion is to discriminate between these cases, with the weighting factor w determining the degree of the bias for smoothness. The objective is similar to that of the incorporation of regularization into the generation of fuzzy rule bases using leastsquares optimization [9]. In the latter, a regularization term is added to stabilize an ill-posed system and the regularization parameter determines the degree to which the additional term inAuences the optimization. 5. Experiments with multi-criteria tness An experimental suite was developed to determine the e.ectiveness of the multi-criteria approach for generating rule bases with sparse data. To simulate learning from training data, a function f is chosen as a target function. The experimental results presented in this paper consider target functions with two variable inputs. The training information is obtained by randomly selecting N points (xi ; yi ) from the input domain. After this selection, the target function is used to produce the training set T = {(xi ; yi ; zi ) | i = 1; : : : ; N }. The evolutionary algorithms are run to produce a rule base with associated approximating function f that best 0ts the training information. The input domain decompositions consist of 10 evenly spaced triangular fuzzy sets. Experiments were conducted with various target functions and training set sizes. The e.ectiveness of the algorithms is analyzed by determining the di.erence between the target function and the approximation on a test set. Since the objective of the experimentation is to analyze the e.ectiveness of the methods of incorporating smoothness into the 0tness evaluation, the test set is restricted to the region of the input domain where smoothness can be calculated in each dimension. Being an interpolative technique, the computation at grid point (a i ; bj ) requires a neighboring grid point in all directions. Fig. 5 shows several grid points for which the smoothness measure cannot be calculated and their regions of inclusion. The test set consists of points evenly spaced within the subspace [a2 ; b2 ] × [an−1 ; bm−1 ]. The shaded region in Fig. 5 shows the test area of a 5 × 5 grid. The test set for the 10 × 10 grid used in these experiments contains 900 points evenly spaced throughout the test region.

Fig. 5. Test area on a grid with 2 input dimensions.

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379

373

A test con0guration consists of the selection of a target function, number of training points, 0tness function, and a weighting. A single run of an algorithm consists of 1000 generations of the evolutionary cycle. Upon the completion of the run, the average and maximum error on the test set are recorded. Each con0guration is run 10 times and the results reported are the average of the average errors and average of the maximum errors over the 10 iterations. The initial population consists of 25 randomly generated rule bases. All of the experiments reported in this paper were run with a mutation rate 0.01 and the crossover rate for the global evolutionary strategy is 0.5. These values were selected based on their performance in a set of experiments conducted to determine the impact of crossover and mutation rates, including self-adapting mutation, on e.ectiveness of rule base using the global and local strategies. The target functions that we will consider are f1 (x; y) = (x + y)=2; f2 (x; y) = x2 + y2 − 1; f3 (x; y) = sin(%(x + y)): The 0rst target function is a simple planar surface that should be easily approximated with minimal training data. This is included to illustrate the e.ects of sparse data in evolutionary approaches to rule base generation. The latter two targets provide varying degrees of curvature in the function. The 0rst step in an iteration of the algorithm is the construction of a training set. This is accomplished by randomly selecting points (xi ; yi ) from the input domain to produce training points (xi ; yi ; f(xi ; yi )). For comparison purposes, each of our experiments will use the same random selection of input co-ordinates of training points. Each iteration of the algorithm is run for 1000 generations and the rule base with minimal average error on the training data generated throughout the 1000 iterations is the result of the trial. 5.1. E;ects of sparse data The experimental results in Table 1 are presented to demonstrate the e.ect of sparse data on rule base generation and to compare the global and local evolutionary learning approaches. The 0rst observation is the di.erence in the error between models produced using the global and local evolutionary strategies. With 1000 training points, the training information covers the entire input space and the error can be directly attributed to the learning algorithm rather than to the distribution of data. For the planar function f1 , the best rule base produced by global 0tness evaluation has average test set error 0.367 and average maximal error of 1.299 over the 10 iterations of the algorithm. This compares to errors of 0.003 and 0.085 for the local learning strategy. As the target functions become more complicated, the error increases in both learning strategies. The local strategy, however, continues to produce results at least an order of magnitude better than global learning. The poor performance of globally based strategies for producing precise TSK models within a restricted number of generations has also been observed in [7,4]. The di.erence in the results of the local and global strategies can be attributed to the degree to which the credit assignment problem a.ects the selection of the subsequent generation. The approximation in a local region is produced by exactly four rules. Thus the error from a training

374

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379 Table 1 E.ects of sparse data Training points

Uncovered regions

Global learning

Local learning

Ave error

Max error

Ave error

Max error

(a) Target: f1 (x; y) = (x + y)=2 1000 0.6 100 37.5 50 61.8

0.367 0.351 0.366

1.299 1.311 1.376

0.003 0.161 0.245

0.085 1.175 1.273

(b) Target: f2 (x; y) = x 2 + y 2 − 1 1000 0.6 100 37.5 50 61.8

0.540 0.578 0.594

1.656 1.686 1.687

0.006 0.212 0.356

0.058 1.479 1.580

(c) Target: f3 (x; y) = sin(%(x + y)) 1000 0.6 100 37.5 50 61.8

0.614 0.627 0.638

1.672 1.739 1.789

0.023 0.235 0.396

0.101 1.489 1.606

point is attributable to at most four rules. The local search uses the error to guide the selection of the subsequent population for the nearest of those four rules. Thus the credit assignment, while not perfect in a single generation, is well focused. The global search is unable to attribute error to regions and the selection process eliminates good approximations with poor throughout the entire space. Consequently, the global approach will require many more generations than the local to obtain a rule bases with essentially all good approximators. The e.ect of sparse data is illustrated by reducing the size of the training set. With the global evolutionary strategy the reduction of the training set has little impact on the quality of the model being produced; the results on the test set are mediocre regardless of the training set size. The reduction has a dramatic e.ect on the results of local learning. This is best illustrated by the growth of the average error. The average error increased from 0.003, 0.006, and 0.023 to 0.245, 0.356, and 0.396 for f1 , f2 , and f3 , respectively, as the size of the training decreased from 1000 to 50. Since the output universe is [−1; 1], an error of 1.0 is 50% of the size of the domain. Only one test point out of the test set of 900 elements is needed to provide a high maximal error. With nonsparse training data, the maximum error for the global strategy varies between 65% and 84% of the size of the domain. Under the same circumstances, the maximal error for models produced by the local learning strategy is 5% of the domain size. Table 2 gives the distribution of test point error for target function f2 . When the data is not sparse, the error of all of test points falls within the range [0; 0:2) with the local strategy. For the global approach, only 19% of the test error are within this range and 11% have error greater than half of the size of the domain. As the training data becomes sparse, there is little change on the distribution of maximal errors for the global strategy. For local learning, however, the number of test points with large error increases with the decrease in training set size.

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379

375

Table 2 Distribution of errors: target f2 Training points Error [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1.0) [1.0,1.2) [1.2,1.4) [1.4,1.6) [1.6,1.8) [1.8,2.0]

1000

100

50

global

local

global

local

global

local

170.5 176.7 191.0 159.0 107.7 57.6 24.5 9.6 2.8 0.0

900.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

162.6 173.3 171.5 142.4 116.4 73.6 37.5 17.7 4.6 0.4

594.5 146.2 69.5 43.7 25.9 12.2 5.9 1.8 0.3 0.0

152.1 164.9 169.7 149.5 125.6 73.7 40.6 18.0 5.6 0.3

404.4 179.7 121.0 83.2 57.4 29.9 16.4 6.7 1.3 0.0

Table 3 Smoothness in global 0tness Training points

w=1 Ave error

w = 0:7

w = 0:5

Max error

Ave error

Max error

Ave error

Max error

1.299 1.311 1.376

0.366 0.357 0.376

1.319 1.328 1.333

0.371 0.371 0.377

1.327 1.301 1.296

(b) Target: f2 (x; y) = x 2 + y 2 − 1 1000 0.540 1.656 100 0.578 1.686 50 0.594 1.687

0.546 0.566 0.566

1.615 1.665 1.618

0.557 0.576 0.564

1.637 1.667 1.560

(c) Target: f3 (x; y) = sin(%(x + y)) 1000 0.614 1.672 100 0.627 1.739 50 0.638 1.789

0.588 0.603 0.603

1.683 1.711 1.734

0.612 0.616 0.632

1.731 1.715 1.783

(a) Target: f1 (x; y) = (x + y)=2 1000 0.367 100 0.351 50 0.366

5.2. Global learning and smoothness In this section, we examine the incorporation of smoothness into the global learning algorithm using the 0tness function given in Eq. (8). Table 3 gives the results with a weighting to training data error of 1.0, 0.7 and 0.5. As noted above, the di.erence between the precision of models built with su?cient or sparse data is minimal; both cases yield models with high maximal and average errors. It was hoped that the addition of smoothness in the selection of the rule base would improve the results, since it favors

376

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379

selecting rule bases with characteristics that are present in the target functions. The data in Table 3 does not support this expectation. Regardless of the weight employed, little or no improvement was produced by the incorporation of the secondary criterion. As we will see in the next section, the addition of smoothness produces a signi0cant improvement in performance of the local algorithm with sparse data. This improvement shows that useful information is being provided by smoothness measure. Consequently, the lack of a positive e.ect on the performance of the global algorithm is again likely due to the di?culties imposed by the credit assignment problem when the evolutionary search is restricted to a 0xed number generations. 6. Local learning with smoothness In this section, we examine the results of incorporating the bias for smoothness into the local strategies. We will also examine the e.ect of noise in the training data on the algorithm performance. 6.1. Full-local 9tness The results of experiments with the local evolutionary learning algorithm with full-local 0tness given in Eq. (9) are shown in Table 4. The experiments were run with the training data 0tness comprising 100%, 70%, and 50% of the 0tness function. Weight w = 1 is the baseline having no smoothness component in the 0tness evaluation. The 0rst observation is that when data are not sparse, the addition of smoothness may degrade the performance. As an example, the maximum error on target f3 goes from 0.107 with no smoothness 0.336 with 50% of the 0tness assigned by the degree of smoothness of the grid point. This can be Table 4 Local learning: full-local smoothness Training points

w=1 Ave error

w = 0:7

w = 0:5

Max error

Ave error

Max error

Ave error

Max error

0.085 1.175 1.273

0.002 0.019 0.033

0.042 0.237 0.439

0.002 0.026 0.039

0.039 0.260 0.462

(b) Target: f2 (x; y) = x 2 + y 2 − 1 1000 0.006 0.058 100 0.212 1.480 50 0.356 1.580

0.012 0.030 0.067

0.063 0.318 0.384

0.024 0.047 0.110

0.082 0.251 0.436

(c) Target: f3 (x; y) = sin(%(x + y)) 1000 0.023 0.107 100 0.235 1.489 50 0.397 1.606

0.033 0.098 0.236

0.167 0.508 0.894

0.089 0.171 0.355

0.336 0.671 1.047

(a) Target: f1 (x; y) = (x + y)=2 1000 0.003 100 0.161 50 0.245

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379

377

Table 5 Local learning: partial-global smoothness Training pts

1000 100 50

f1

f2

f3

Ave error

Max error

Ave error

Max error

Ave error

Max error

0.002 0.130 0.219

0.046 1.077 1.106

0.005 0.186 0.338

0.041 1.415 1.533

0.024 0.195 0.410

0.136 1.280 1.741

explained by the selection process giving a smaller weight to the direct evidence, the training data, and an increased weight to the derivative measure. When data are sparse, however, the incorporation of smoothness greatly improves the performance. The average error decreased from 0.245 to 0.033, 0.356 to 0.067, and from 0.397 to 0.236 for f1 , f2 and f3 , respectively, when run with a training set of size 50 and weight 0.7. In these and other experiments with sparse data, both the average and the maximum error decreased with the incorporation of smoothness. After a point, however, increasing the weighting of the contribution of smoothness increased the error. This can be seen by comparing the results for w = 0:7 and 0.5 in Table 4. Again, we attribute this performance to moving too far from “best information” in the 0tness evaluation. 6.2. Partial 9tness In the previous section, we noted that using other than the “best information” may inhibit the performance of the algorithm. This provides the motivation of the partial application of smoothness— smoothness is used only when there is no training data. In this case, the secondary information is the best information. The results of experiments with the partial-global application of smoothness are given in Table 5. Unlike the case of full-local 0tness, with nonsparse data the performance shows a slight improvement over the baseline case. This is because the smoothness criterion is used rarely and in those few cases it aids in the selection of the consequent. With sparse data the results o.er a small improvement over the baseline, but do not approach the quality of those obtained using the full-local 0tness function. For example, the average error for f1 , f2 , and f3 with 50 training points is 0.245, 0.356, and 0.396 compared with 0.012, 0.030, and 0.067 for full-local smoothness with weight 0.7. The results for local learning with a partial-local incorporation of smoothness follow a similar pattern as those with the partial-global 0tness function. They do not degrade the performance when data is su?cient. When data is sparse, they do not improve the performance as signi0cantly as full-local or even partial-global 0tness. 6.3. Smoothness and noise To determine the robustness of the algorithm to noise in the input, a set of experiments were run in which the training data had varying degrees of imprecision. In these experiments, training

378

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379 Table 6 Local learning with noise Training points

w=1 Ave error

w = 0:7 Max error

w = 0:5

Ave error

Max error

Ave error

Max error

(a) Target: f2 with & = 0:01 noise 1000 0.006 0.049 100 0.200 1.340 50 0.347 1.566

0.013 0.034 0.067

0.064 0.310 0.410

0.025 0.049 0.130

0.078 0.380 0.524

(b) Target: f2 with & = 0:05 noise 1000 0.016 0.090 100 0.202 1.378 50 0.323 1.576

0.031 0.054 0.085

0.099 0.289 0.433

0.054 0.068 0.140

0.152 0.391 0.489

(c) Target: f2 with & = 0:1 noise 1000 0.028 0.147 100 0.223 1.406 50 0.344 1.522

0.050 0.083 0.113

0.157 0.474 0.465

0.064 0.094 0.164

0.296 0.454 0.484

data had the form ( f(xi ; yi ); zi + ei ) where ei is an error term obtained from a normal distribution with mean 0 and standard deviations 0.01, 0.05, and 0.1. The results of the experiments for target function f2 with the full-local learning strategy are given in Table 6. The patterns in the results of the experiments with noise are similar to those observed in the noiseless case. The addition of the smoothness component slightly degrades performance when there is adequate training. Signi0cant improvements are made, however, when the data is sparse. For 50 training examples, the average error for the weighting of 0.7 was 19%, 26% and 38% that of the average error with noise 0.01, 0.05, and 0.1, respectively, and no smoothness component. The reduction in maximum error was signi0cantly greater. Not surprisingly, the e.ectiveness of the smoothness component degrades with noise in the input. As noted before, performance deteriorates when to much weight is assigned to the smoothness. Regardless of the level of noise in the data, the 0tness function with weighting 0.7 for training data and 0.3 for smoothness produced better models than an even weighting between training data and smoothness. 7. Conclusions Sparse training data presents a fundamental di?culty for algorithms that generate models based on the agreement of the model with the training information. These experiments have shown that a multi-criteria 0tness function can greatly improve evolutionary algorithms using local analysis to generate fuzzy models. The results suggest a hybrid system that selects the 0tness function based on the distribution of data might incorporate the bene0cial features into the 0tness assessment. When data are sparse, the full-local application of smoothness produces excellent results. With nonsparse

D. Spiegel, T. Sudkamp / Fuzzy Sets and Systems 138 (2003) 363 – 379

379

data, however, the partial-global evaluation assists in the generation of the few areas without training information. The algorithms and experiments described in this paper employed evolutionary techniques for the entire rule base generation process. However, the incorporation of smoothness into 0tness evaluation can be used as a post-processor by other learning algorithms to compensate for sparse data. The approach would be to create a population by replicating and adding minor perturbations to the rule base created by the original algorithm. The bias added by the multi-criteria 0tness function would then inAuence the selection of the rules in regions without training information. References [1] T. BFack, F. Kursawe, Evolutionary algorithms for fuzzy logic: a brief overview, in: Proc. 5th Int. Conf. IPMU: Information Processing and Management of Uncertainty in Knowledge-Based Systems, vol. 2, 1994, pp. 659 – 664. [2] T. BFack, H.P. Schwefel, An overview of evolutionary algorithms for parameter optimization, Evol. Comput. 1 (1) (1993) 1–23. [3] O. CordNon, F. Herrera, Evolutionary design of TSK fuzzy rule-based systems using ('; )-evolution strategies, in: Sixth IEEE Int. Conf. on Fuzzy Systems (FUZZ-IEEE), Barcelona, 1997, pp. 509 –514. [4] O. CordNon, F. Herrera, A two-stage evolutionary process for designing TSK fuzzy rule-based systems, IEEE Trans. Systems Man Cybernet. B 29 (6) (1999) 703–715. [5] O. CordNon, F. Herrera, F. Ho.mann, L. Magdalena, Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Rule Bases, World Scienti0c Publishing, Singapore, 2001. [6] O. CordNon, F. Herrera, M. Lozano, A classi0ed review of the combination fuzzy-logic genetic algorithms bibliography, in: E. Sanchez, T. Shibata, L. Zadeh (Eds.), Genetic Algorithms and Fuzzy Logic Systems: Soft Computing Perspectives, World Scienti0c Publishing, Singapore, 1997, pp. 209–241. [7] T. Furuhashi, K. Nakaoka, Y. Uchikawa, An e?cient 0nding of fuzzy rules using a new approach to genetic based machine learning, in: Proc. 4th IEEE Int. Conf. on Fuzzy Systems (FUZZ-IEEE’95), Yokohama, March 1995, pp. 715 –722. [8] D.E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, Reading, MA, 1989. [9] J. Haslinger, U. Bodenhofer, M. Burger, Data-driven construction of Sugeno controllers: analytical aspects and new numerical methods, in: Proc. Joint 9th IFSA Congress and NAFIPS 20th Int. Conf., Vancouver, July 2001, pp. 239 –244. [10] F. Herrera, J.L. Verdegay (Eds.), Genetic Algorithms and Soft Computing, Physica-Verlag, Berlin, 1996. [11] C.L. Karr, Design of an adaptive fuzzy logic controller using a genetic algorithm, in: Proc. 4th Int. Conf. on Genetic Algorithms, San Diego, CA, July 1991, pp. 450 – 457. [12] C.L. Karr, E.J. Gentry, Fuzzy control of pH using genetic algorithms, IEEE Trans. Fuzzy Systems 1 (1993) 46–53. [13] W. Pedrycz (Ed.), Fuzzy Evolutionary Computation, Kluwer Academic Publishers, Boston, 1997. [14] E. Sanchez, T. Shibata, L.A. Zadeh, Genetic Algorithms and Fuzzy Logic Systems Soft Computing Perspectives, World Scienti0c, Singapore, 1997. [15] D. Spiegel, T. Sudkamp, Evolutionary strategies for fuzzy models: local vs global construction, in: Proc. 1999 Conf. of the North American Fuzzy Information Processing Society, New York, 1999, pp. 203–207. [16] D. Spiegel, T. Sudkamp, Compensating for sparse data in evolutionary generation of fuzzy models, in: The 2000 IEEE Int. Conf. on Systems, Man, and Cybernetics, Atlanta, 2000, pp. 439 – 444. [17] D. Spiegel, T. Sudkamp, Employing locality in the evolutionary generation of fuzzy rule bases, IEEE Trans. Systems Man Cybernet. B 32 (3) (2002) 296–305. [18] T. Sudkamp, R.J. Hammell II, Interpolation, completion, and learning fuzzy rules, IEEE Trans. Systems Man Cybernet. 24 (2) (1994) 332–342. [19] T. Sudkamp, R.J. Hammell II, Rule base completion in fuzzy models, in: W. Pedrycz (Ed.), Fuzzy Modeling: Paradigms and Practice, Kluwer Academic Publishers, Dordrecht, MA, 1996, pp. 313–330.