On-line assurance of interpretability criteria in evolving fuzzy systems – Achievements, new concepts and open issues

On-line assurance of interpretability criteria in evolving fuzzy systems – Achievements, new concepts and open issues

Information Sciences 251 (2013) 22–46 Contents lists available at SciVerse ScienceDirect Information Sciences journal homepage: www.elsevier.com/loc...

2MB Sizes 0 Downloads 35 Views

Information Sciences 251 (2013) 22–46

Contents lists available at SciVerse ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

On-line assurance of interpretability criteria in evolving fuzzy systems – Achievements, new concepts and open issues Edwin Lughofer Department of Knowledge-Based Mathematical Systems, Johannes Kepler University Linz, Austria

a r t i c l e

i n f o

Article history: Received 6 July 2012 Received in revised form 23 May 2013 Accepted 1 July 2013 Available online 12 July 2013 Keywords: Evolving fuzzy system Complexity reduction Interpretability criteria Knowledge expansion On-line assurance

a b s t r a c t In this position paper, we are discussing achievements and open issues in the interpretability of evolving fuzzy systems (EFS). In addition to pure on-line complexity reduction approaches, which can be an important direction for increasing the transparency of the evolved fuzzy systems, we examine the state-of-the-art and provide further investigations and concepts regarding the following interpretability aspects: distinguishability, simplicity, consistency, coverage and completeness, feature importance levels, rule importance levels and interpretation of consequents. These are well-known and widely accepted criteria for the interpretability of expert-based and standard data-driven fuzzy systems in batch mode. So far, most have been investigated only rudimentarily in the context of evolving fuzzy systems, trained incrementally from data streams: EFS have focussed mainly on precise modeling, aiming for models of high predictive quality. Only in a few cases, the integration of complexity reduction steps has been handled. This paper thus seeks to close this gap by pointing out new ways of making EFS more transparent and interpretable within the scope of the criteria mentioned above. The role of knowledge expansion, a peculiar concept in EFS, will be also addressed. One key requirement in our investigations is the availability of all concepts for on-line usage, which means they should be incremental or at least allow fast processing. Ó 2013 Elsevier Inc. All rights reserved.

1. Introduction 1.1. Motivation In today’s industrial systems, the automatic adaptation of system models with new incoming data samples plays an increasingly important role. This is due growing system complexity, as changing environments, new system states, or new operation modes not contained in precollected training sets are often not covered by an initial model setup in an off-line stage, especially because of high expenses when data needs to be manually pre-processed or even annotated (e.g., in case of classification problems [74,26]). Evolving models, as part of the evolving intelligent systems community [5], provide methodologies for ongoing/continuous model updating. Incremental updating allows a fast processing and requires minimal virtual memory, as samples that have passed the model update algorithm can be discarded. Thus, they are applicable (a) within the context of learning from data streams [31] and (b) within learning tasks in non-stationary environments [101], involving volatile environmental conditions – see [101]. A key issue here is finding a balance between stable converging solutions in a life-long learning context [108] (also referred as stability) and the ability to react dynamically in order to

E-mail address: [email protected] 0020-0255/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ins.2013.07.002

E. Lughofer / Information Sciences 251 (2013) 22–46

23

integrate new information and to handle drifts appropriately [81,32,47,68] (also referred as plasticity); the balance is also referred to as stability-plasticity dilemma in the literature [1,35]. Improved transparency and interpretability of the evolved models may be useful in several real-world applications where the operators and experts intend to gain a deeper understanding of the interrelations and dependencies in the system. This may enrich their knowledge and enable them to interpret the characteristics of the system on a deeper level. For instance, in on-line quality control applications [71], it could be very useful to know which interrelations of system variables/features were violated when faults occur; this knowledge may be a valuable input to fault diagnosis [49] for finding the root cause of the fault, and for providing appropriate automatic feedback control reactions [29]. Further examples are decision support systems [27,110] and classification systems [16], which sometimes require knowing why certain decisions have been made by classification models, especially in medical applications; see, for instance, [113]: insights into these models may provide answers to important questions (e.g., providing the health state of a patient) and support the user in taking appropriate actions. Another field of application for model interpretation is the substitution of expensive hardware with soft sensors, referred to as eSensors in an evolving context [76,6]: the model has to be linguistically or physically understandable, reliable, and plausible to an expert, before it can be substituted for the hardware. Sometimes, it is beneficial to provide further insights into the control behavior [84] or the relationships implicitly contained in a (natural) system. An example of the latter is provided in [75], which describes the compounding of house prices due to their characteristics and location (using fuzzy rules); or in [53,57], describing the dynamic relations between influences on and changes in stock market indices by partial granular descriptions. Thus, experts may gain deeper knowledge and understanding of processes (what is really going on), which can be used in the future design of hard- or software components for more automatization, productivity and efficiency. Another important aspect for improving transparency of evolving models is to encourage richer human–machine interactions [74]. 1.2. State of the art Evolving fuzzy systems (EFS) [63], in particular, evolving rule-based models [8] are a powerful tool of addressing these demands as they offer fuzzy logic model architectures and systems that include components with a clear, linguistically interpretable meaning [15,3] – unlike many other incremental learning and evolving (intelligent) systems techniques which rely on black box model architectures such as neural networks or support vector machines and are thus un-interpretable per se. When fuzzy systems are trained with data and not designed manually by an expert, they loose some of their interpretability, but this loss can be diminished by applying specific improvement techniques in the training stages (in the form of constraints [28], specific learning schemes [93] or post-processing operations [102]), – see, for example, [116,30] for comprehensive surveys. In the evolving systems context, however, the main focus in fuzzy systems extraction has so far been on precise modeling techniques, which aim for high accuracy on new unseen data and, mathematically, for optimality in an incremental optimization context. In fact, the well-known and widely-used approaches (alphabetically ordered) DENFIS [45], EFP [111], eLMN [34], ENFM [105], ePL [60], eT2FIS [109], eTS [11], FLEXFIS(++) [66], IRSFNN [61], rGK [24], SAFIS [95], SONFIN [43], SOFNN [57] (for a comprehensive survey see [63]) take care only a little toward meeting some interpretability criteria or at least regarding considerations toward improvement of model interpretability. Some merely incorporate pure complexity reductions steps, which can be considered a necessary but insufficient prerequisite. For instance, ePL and eTS+ [10] (the extended version of eTS [11]) contain a rule pruning methodology internally based on similarity, age, and utility criteria; or the approach described in [58] as well as FLEXFIS++ [66] perform some rule merging steps within incremental learning cycles whenever rules become redundant, overlapping; eT2FIS [109] includes an on-line merging strategy for type-2 fuzzy sets with Gaussian shape. Other approaches include occasional constraints to ensure a minimum coverage of the input partition (e.g., SAFIS, SOFNN or SONFIN) and to avoid un-defined input states. Hardly any approaches exist that include a comprehensive discussion about various interpretability criteria; nor do they include any mechanisms or concepts for guiding the evolved fuzzy systems to a greater transparency and interpretability. 1.3. Scope of this paper The main focus of this paper is on discussing some important interpretability aspects and criteria which have hitherto been neglected in the context of state-of-the-art EFS approaches. We concentrate on classical flat fuzzy model architectures such as Mamdani [78] and Takagi–Sugeno–Kang type systems [107], and on classical rule-based classifiers as widely used in EFS: each input feature x1 ; . . . ; xp (e.g., temperature, house price, etc.) is part of each rule’s antecedent, which yields full-span rule bases with rules of the form

IF x1 IS L TERM1 AND x2 IS L TERM2 AND    AND xp IS L TERMp THEN y ¼ f

ð1Þ

where f represents either a linguistic term (Mamdani model), a polynomial function (Takagi–Sugeno–Kang model), or a consequent class label (fuzzy rule-based classifier), the linguistic terms (L_TermX) are represented by fuzzy sets A1 ; . . . ; Ap (e.g., ‘Large’, ‘Low’, ‘Cold’) and the conjunction operator ‘‘AND’’ represented by a t-norm [46]. This also covers multiple-output systems ð~ y ¼ ðy1 ; . . . ; ym ÞÞ, as these can usually be decomposed into independent single-output systems. Generalized fuzzy systems, whose rules are represented directly by high-dimensional kernels resp. multivariate distributions without any

24

E. Lughofer / Information Sciences 251 (2013) 22–46

representation by linguistic terms and their conjunctions through t-norms, and as recently used in evolving context [55,52], can be equally handled within the interpretability considerations of this paper, as long as they are equipped with projection concepts to form classical axis-parallel rules and fuzzy partitions – as recently conducted in the evolving approach GENFIS [90]. The discussion includes past achievements, provision of new ideas and guidelines for interpretability improvement, as well as some concrete concepts to meet some criteria in an incremental learning context. Since complexity reduction can be seen as a prerequisite for interpretability, it will be considered in the context of distinguishability and simplicity of fuzzy partition and rule bases. These two criteria are essential establishments in EFS, as rules which originally seem to be disjoint may over time become significantly overlapping or slightly overlapping resp. touching each other while forming a homogenous region (same joint tendency), see also Section 3. Our investigations will go clearly beyond pure complexity reduction steps and include criteria towards a higher level of transparency and interpretability such as consistency, coverage and completeness, feature importance levels, rule importance levels, interpretation of consequents and knowledge expansion. Inconsistency may arise due to similar (overlapping) antecedent parts in two rules, but dissimilar consequents (see also Section 4). Incompleteness or non-sufficient coverage may be caused by rules and fuzzy set shifts over time (Section 5). Rules may seem important at an early stage of the learning process, but may get unimportant over time, the same consideration is applicable for features: in order to reflect this in the evolved fuzzy system shown to a user and to guarantee a smooth learning process, the integration of continuous rule and feature weights are necessary (Sections 6 and 7). Interpretation of consequents basically concerns the case of using Takagi–Sugeno–Kang model, which may get chaotic not representing the real local trends in the models if the incremental learning engine for updating the consequent parameters is wrongly designed/chosen (see Section 8). Dynamic knowledge expansion is a methodology in all common EFS approaches and a necessary concept, as automatically integrating new operation modes arising in the on-line system (stream) (Section 9). Due to the position-oriented scope of our paper, no empirical evidence for evaluation runs and test results is presented. In this sense, the paper differs from a survey paper, which usually includes a vast empirical evaluation and comparison of methods on various data sets – this would go beyond the scope of the paper in terms of its comprehensiveness and compactness due to the broad range of interpretability criteria. Rather we aim to guide the reader to a deeper understanding of the stateof-the-art and point out new and further required improvements in terms of interpretation capabilities of EFS. In this sense, we let it open for a broader audience to participate in the further future development of these concepts for EFS and intend to provide inspiration and motivation for it. The paper is organized as follows: in the subsequent section (Section 2), we provide a detailed problem statement discussing the current situation in EFS and describe our novel contribution to the field. Each subsequent section focusses on the investigation of one particular interpretability criterion (w.r.t. achievements, new concepts and open issues). 2. Problem statement and impact In current EFS approaches, the main focus is placed on precise fuzzy modeling, which means that the primary goal in the incremental learning phase is to evolve fuzzy models that represent (classify, approximate) as accurately as possible the relationships implicitly contained in the data streams. In some cases, incremental optimization techniques, such as for instance the recursive fuzzily weighted least squares estimator [12] or RLM (Recursive Levenberg–Marquardt) [85], are applied to guide the parameters of the models to optimal solutions with respect to an optimization criterion: this is usually achieved (a) in regression settings, by minimizing the error between predicted and measured target values on on-line samples and (b) classification tasks by finding reliable (or even optimal) decision boundaries that discriminate between two or several classes. In this context of data-driven learning and optimization from, little effort has been made to contribute to linguistic modeling in the sense that distinguishable components, consistent rule bases, completeness of fuzzy partitions, small descriptive lengths of rule antecedents, etc. are guaranteed or to achieve at least some sort of interpretability in evolving fuzzy systems. The upper part of the framework visualized in Fig. 1 shows the components which are integrated into current EFS approaches and often lead to (seemingly) illogical fuzzy partitions and confusing rule bases. The lower part illustrates fuzzy modeling from a different viewpoint, namely from the viewpoint of human beings. Expert knowledge, usually acquired by long-term experience with the corresponding system, is used as basis for designing fuzzy rule bases directly by encoding this knowledge as a combination of linguistic terms and rules, see, for instance, [19] or [2] (Chapter 5). In fact, the (Mamdani) fuzzy systems architecture supports such an encoding as usually vague, linguistic expressions are already ‘available’ either in the form of IF–THEN rules drawn from human cognition and experience or in the form of similar linguistic relations that can often be formulated as IF–THEN rules. In this paper, we seek to narrow the gap between precise evolving fuzzy modeling and linguistic fuzzy modeling (as indicated in Fig. 1 by the double arrow), by discussing and investigating potential strategies in current EFS approaches with the aim to improve transparency, readability and interpretability of the fuzzy rule bases. To this end, we examine criteria such as  Distinguishability and simplicity.  Consistency.  Coverage and completeness.

E. Lughofer / Information Sciences 251 (2013) 22–46

25

Fig. 1. Framework showing precise evolving and linguistic fuzzy modeling components (based on data and human input) and the intention to narrow the gap between these two.

   

Feature importance levels. Rule importance levels. Interpretation of consequents. Knowledge expansion.

We address, both, high-level (rule-based) and low-level (fuzzy-set) interpretation spirits according to [116] (there considered for completely off-line fuzzy models). Finally, this lays a bridge to ‘interpretable evolving fuzzy models’, which can be assumed to be closer to linguistic fuzzy models than the precise ones (also underlined by fuzzy partitions in Fig. 1), thus narrowing the gap. A possible closure of the gap depends on the type of application, but cannot be achieved in general – e.g. experts tend to place fuzzy sets along a certain variable in a homogenous, often equidistant manner ((close to) Ruspini partitions [99]), whereas an arbitrary data distribution from a stream does not fulfill such constraints in general. Furthermore, we keep the central spirit of evolving fuzzy systems to build its knowledge completely from data from scratch, i.e. not allowing any pre-defined initial fuzzy partitions from users – we are thus not investigating hybrid modeling aspects. We deem it important to examine criteria in terms of various facets of interpretability, which are then adopted to both, the fuzzy-set and the rule-based level. Two reasons are (i) for some facets similar improvement concepts and techniques can be applied to both, the fuzzy-set and the rule-based level and (ii) some can be seen as a step-wise procedure for a better interpretable power of the models. Another issue in our investigations is that the criteria can be handled within an on-line learning context so they can be applied in combination with the incremental training procedures in EFS. In the ideal case, within or after each incremental learning step, the fuzzy system updated with new incoming data is already guided towards more transparency. Therefore, many conventional off-line approaches [30,15] are neglected, because they require multiple iterations (as e.g. used in constrained-based optimization schemes [86]) or time intensive batch calculations in a batch-type manner, and thus are pretty slow. In addition to the aspects discussed in Section 1.1, our investigations may also lay a solid basis or have an impact for future developments in the field of enhanced human machine interaction scenarios on the model level in cognitive science. The current situation is that humans communicate with machine learning tools either on a pure monitoring level or in slim form on a good-bad reward level, qualifying the model outputs so that the model can improve itself. Communication on a deeper, structural level, for instance, enabling user manipulation of structural components or decision boundaries, is currently (almost) missing. Such communication, however, may also help us to achieve a deeper, evidence-based understanding of how people can interact with machine learning systems in different contexts and to enrich our models by combining objective data with subjective experience. This is a key aspect within the recently published concept of human-inspired evolving machines [64]. It was emphasized in [64] that, in principle, EFS can be applied to such a framework, although the current state of the art leaves the topic under-addressed.

3. Distinguishability and simplicity Distinguishability and simplicity of fuzzy components can be seen as two important aspects towards achieving a transparent and interpretable fuzzy system [30]. While distinguishability requires use of structural components (rules, fuzzy sets)

26

E. Lughofer / Information Sciences 251 (2013) 22–46

that are clearly separable as non-overlapping and non-redundant, simplicity goes a step further and expects models with a trade-off between low complexity and high accuracy. From the mathematical point of view, both criteria can be defined as follows: Definition 1. Distinguishability is guaranteed whenever

= i;j;k;m 9

ðSðRi ; Rj Þ > thrÞ _ ðSðAik ; Ajm Þ > thrÞ

ð2Þ

with S the similarity 2 ½0; 1 between two rules Ri and Rj resp. two fuzzy sets Aik and Ajm appearing in the antecedent parts of rules Ri and Rj . How to measure the similarity S between two fuzzy sets or rules will be under study in the subsequent section. The threshold thr, which ideally has the same value in both parts in (2), governs the degree of similarity allowed between two components and may depend on the chosen similarity measure. Generally, without loss of generality we can say that a value of S close to 1 points to a high similarity, whereas a value of S close to 0 points to a low similarity (normalization of the input features/fuzzy partitions to ½0; 1 may be necessary for some measures, e.g. for (4) below). Definition 2. Let F 1 be a fuzzy system fully evolved from a data stream based on precise modeling concepts. Then, maximum simplicity of this system meets the following criterion:

minfjFjjðjF 1 j > jFjÞ ^ ðaccðFÞ P accðF 1 Þ  Þg

ð3Þ

with acc the accuracy of the fuzzy system and jFj the number of components (rules, fuzzy sets) in the simplest possible, but yet accurate enough model F.  is expressing the allowed loss in accuracy and is application dependent, i.e. usually set by the user according to a maximal allowed model error. Complexity reduction steps during the learning process are expected to improve both, distinguishability and simplicity: for instance, if two rules may (start to) overlap to a degree greater than thr, a merging procedure should be triggered, assuring the condition in definition (2); in case of high redundancy, the merged system F usually meets (3) (only superfluous information is discarded). Usually, we can assume that a simpler fuzzy model can be interpreted more easily than a more complex one, especially when a reduction step is applied to the same model [18]. 3.1. Distinguishability The latter assumption particularly holds when omitting or reducing unnecessary complexity, which may arise when fuzzy sets or rules become redundant and indistinguishable. In current evolving fuzzy systems, this low distinguishability may arise whenever rules and fuzzy sets are moved, adapted or reset due to new incoming data samples from an infinite stream. For instance, consider two distinct data clouds representing two local data distributions which are adequately modeled by two rules in the fuzzy system; after some time, the gap between the two clouds is filled up with new data, which makes the two data clouds indistinguishable – see Fig. 2: rules are represented as ellipsoidal contours in the two-dimensional space (2-r spread in each direction). In a precise modeling context, the rules and their antecedent fuzzy sets are forced to overlap

Fig. 2. (a) initial situation, two distinct rules and (b) updated rules (ellipsoidal contours in solid) according to new samples (marked as ‘crosses’), the rules get significantly overlapping as moving together, the arrows indicating the center movements/rule expansions.

E. Lughofer / Information Sciences 251 (2013) 22–46

27

and become indistinguishable, as some of the new samples expand one rule and some others expand another nearby lying rule, in opposed direction towards the other rule. 3.1.1. Measuring the degree of indistinguishability In recent years, several attempts have been made to address this issue during incremental learning. In [60], redundant rules are detected by a similarity measure S expressed by the sum of the absolute differences between the normalized coordinates of two rule (cluster) centers:

SðR1 ; R2 Þ ¼ 1 

k~ cR1  ~ c R2 k p

ð4Þ

with ~ cR1 the center of the rule R1 and p the dimensionality of the feature space; S is high when two rule centers are close to each other. This approach, however, does not take into account the spread of the rules: the centers of rules with very little spread may be close but nonetheless the data clouds are distinct and have important (evolvable) granularity [88]. Furthermore, it acts solely on rule level. Regarding fuzzy set similarity, the approach in [73] employs the Jaccard index to measure the degree of overlap between two fuzzy sets, but expected to be relatively slow in on-line learning settings (two time-intensive integrals must be calculated). Indeed, nowadays also time-intensive integrals can be calculated within milliseconds, however the number of similarity calculations can be huge, depending on the degree of dimensionality of the learning problem and the number of rules. This is the case when ideally checking the similarity of updated components (rules and fuzzy sets) to all other components after each sample-wise update (in order to be able to guarantee anytime distinguishability). Ramos and Dourado investigated a faster, geometric-based similarity measure for detecting redundant fuzzy sets [92]:

SðA; BÞ ¼ ov erlapðA; BÞ ¼

1 1 þ dðA; BÞ

ð5Þ

where the geometric distance between two Gaussian membership functions can be approximated by:

dðA; BÞ ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðcA  cB Þ2 þ ðrA  rB Þ2

ð6Þ

with rA the width of fuzzy set A. They obtained reduced fuzzy systems with nearly the same performances as the original ones. Apart from being restricted to Gaussian sets, (5) is not scale-invariant, see [80] for a proof. In [58], the time intensive calculation of the Jaccard index for Gaussian membership functions is avoided by transferring the Gaussian sets to equivalent triangular fuzzy sets:

pffiffiffiffiffiffiffi

pffiffiffiffiffiffiffi

aA ¼ cA  rA 2pbA ¼ cA cA ¼ cA þ rA 2p

ð7Þ

with bA the center of the triangular fuzzy sets, aA the left most and cA the right most point (cutting the axis), and then applying the union and intersection according to the formulas in [56] (four cases). In [48], the geometric-based fuzzy set similarity measure in (5) was extended to a high-dimensional measure for multivariate Gaussian in arbitrary position (generalized rules):

SðR1 ; R2 Þ ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi W1 ð~ c2 Þ  W2 ð~ c1 Þ

ð8Þ

c1 the center of the first rule and W1 its multivariate Gaussian defined as with ~ T

W1 ¼ e0:5ð~x~c1 Þ

~ ~ R1 1 ðxc 1 Þ

ð9Þ

1 1

with R the inverse covariance matrix of R1 . A similar measure using (9) was defined in [77] to measure cluster (rule) compatibility, which is applied to evolving classification context for adaptive fault detection in [54]. In case of classical model architectures, the measure in (8) can be also applied by setting the inverse covariance matrix to a diagonal matrix containing 2 the inverse rule spreads in each direction, thus R1 1 ¼ diagð1=r1j Þ for all j ¼ 1; . . . ; p. Recently, combined generic concepts for detecting and eliminating redundant (antecedent of) rules and fuzzy sets have been presented in [69,72], which can be coupled with any popular EFS method. The approach in [69] assumes Gaussian fuzzy sets as basic structural components and therefore enables very fast calculations of the degree of overlap between two rules or fuzzy sets. For rules, this is achieved by using virtual projections of the antecedent space onto the one-dimensional axis and aggregating over the maximum component-wise membership degrees of both intersection points between two Gaussians:

SðR1 ; R2 Þ ¼ ov erlapðR1 ; R2 Þ ¼ Agg pj¼1 ov erlapA1j ;A2j

ð10Þ

ov erlapA1j ;A2j ¼ maxðlj ðinter x ð1ÞÞ; lj ðinter x ð2ÞÞÞ

ð11Þ

with

lj ðxÞ the membership degree to the univariate Gaussian A1j or A2j in the jth dimension, interx ð1; 2Þ the two intersection points and Agg an arbitrary aggregation operator [100]. Eq. (10) ensures that overly complex and totally uninterpretable, e.g., form-

28

E. Lughofer / Information Sciences 251 (2013) 22–46

ing a ‘cluster cross’ in the feature space, or rules completely contained in other rules can be detected as such [69]. For fuzzy sets, a very fast kernel-based similarity measure is used in [69], which requires constant calculation time:

Sker ðA; BÞ ¼ ejcA cB jjrA rB j

ð12Þ

In this approach, the same weight for differences in centers and spreads is used: a difference in the specificity between two sets (representing data clouds with different distributions/spreads) should have the same effect as a shift of the centers. However, the major drawback of this measure is that it delivers the same value whenever two fuzzy sets A and B have both the same spread, i.e. rA ¼ rB , and another two fuzzy sets C and D have both another same spread, i.e. rC ¼ rD – rA , and both pairs having the same center deviation, i.e. cA  cB ¼ cC  cD ; thus, according to 0 spread discrepancy in both cases, Eq. (12) delivers the same value for both fuzzy set pairs; however, in the first case the two fuzzy sets A and B with large spread (e.g., rA ¼ 0:5) may overlap, but not C and D as having smaller spreads (e.g., rC ¼ 0:2). Hence, we may extend above measure by a term which gives statistical evidence of the difference between two Gaussians and is able to resolve this shortcoming:

Sker ðA; BÞ ¼

e

ðcA cB Þ2 r2 þr2 B A



ðcA cB Þ2 ðrA rB Þ2

!,

þe

2

ð13Þ

The first term is motivated from statistical theory, comparing whether two Gaussian distributions are different or not (hypothesis test). It is based on a specific version of the t-test, the so-called Welch test [112]. In case when two centers are closer than the absolute value of their spreads, the term in the exponential part of (13) will get close to 0, thus a similarity near 1 is achieved. This is in-line with the expectation of overlapping fuzzy sets. The second term as defined in (12) is still requested as it reduces the similarity in case when a smaller set appears in a larger one, both having a similar center (thus the first term getting close to 1) – in this case, a merging may loose specifications in certain parts of the feature space and thus accuracy, see [69] for an analysis. A more statistically profound, but also slower, possibility for Sker would be to apply an averaged two-sided Kullback–Leibler divergence (in order to achieve symmetry) [50], which measures the difference between two probability distributions. Whenever the case arises that the fuzzy sets A1j and A2j in all antecedent parts in the two rules R1 and R2 are pairwise similar, i.e.

SðA1j ; A2j Þ P thresh 8j

ð14Þ

then automatically the two rules get similar and thus can be merged. We will see below, however, that rule merging is not necessarily triggered by fuzzy set merging. While all former concepts are restricted to be applicable to Gaussian fuzzy sets, in [72] the generalization to arbitrary fuzzy sets was achieved by defining the degree of overlap of two rules in the form of a two-sided fuzzy inclusion measure:

ov erlapðR1 ; R2 Þ ¼ SðincðR1 ; R2 Þ; incðR2 ; R1 ÞÞ

ð15Þ

incðR1 ; R2 Þ ¼ Tpj¼1 incðA1j ; A2j Þ

ð16Þ

with

with A1p the fuzzy set in the pth antecedent part of rule R1 and incðA1j ; A2j Þ denoting the degree of the inclusion of the fuzzy set A1j in A2j [72], Using a t-norm in (16) (e.g., minimum) justified especially due to the fact that a pronounced non-overlap along one single dimension j is sufficient for the clusters not to overlap at all (as they are torn apart). A feasible choice for the t-conorm S in (15) is the maximum operator, as it constitutes the maximal inclusion of R1 in R2 and R2 in R1 . At this stage, we want to highlight that this generalized measure is not valid to be applicable to single fuzzy set similarity. A small fuzzy set (with low spread) included in another large one (with high spread) could represent two different data spreads in two different regions of the feature space, which are necessary for an appropriate representation and thus interpretation of the local regions. Merging would generally end up with a misplaced fuzzy set for one of both regions. A practical example is given in Fig. 3, examine the two distinct data clouds and the rules as drawn by ellipsoids. Based on Fig. 3, it is quite clear that rule merging is not necessarily triggered by fuzzy set merging, as the two overlapping rules representing the same cloud can certainly be merged, whereas the fuzzy sets A1 and A2 are not supposed to be merged within the one-dimensional view (A2 occurring in the antecedent parts of two rules). Another case for rule merging which is not triggered by fuzzy set merging is the situation when rules are lying aside each other (touching, slightly overlapping) and forming a homogenous region – Section 3.2. Finally, we want to emphasize that, despite the already broad offer of similarity measures within the context of EFS, some others which have been studied only within the context of batch trained fuzzy systems may also be applicable and helpful, see e.g. [80,22,106] or [103]. A particular attraction may be given to the option of using a possibilistic measure, as defined by the degree of applicability of the soft constraint ‘x is B’ for x ¼ A [80] and whose (monotonic) relation to similarity is provided in [80,38]. 3.1.2. Resolving indistinguishability (distinguishability assurance) In all approaches, the overlap, that is, the similarity degree is normalized to ½0; 1, where 0 means no overlap/similarity and 1 full overlap/similarity. Thus, if ov erlapðR1 ; R2 Þ or SðR1 ; R2 Þ is greater than a threshold (the concrete value depends on

E. Lughofer / Information Sciences 251 (2013) 22–46

29

Fig. 3. Rule merging triggered for the two ellipsoids representing one data cloud, fuzzy set merging for feature X1 not suggested according to differences in both, centers and spreads; in fact, an equally weighted merging would end up in a fuzzy set misrepresenting the two local data cloud distributions along Feature X1 (shown in dotted line style).

the applied measure; from our experience, often a value of around 0.8 is a good choice), the two rules should be merged. In [69] a quite generic concepts was proposed (which covers and extends the merging procedures as e.g. suggested in [77,92] and others): merging of rules antecedents is conducted (1) by using a weighted average of centers, where the weights are defined by the support of the rules, assuring that the new center lies in-between the two centers and is closer to the more supported rule and (2) by updating the spread of the more significant rule with the spread of the less significant one exploiting recursive variance concept and including a coverage term:

cnew ¼ j

r

new j

cRj 1 kR1 þ cRj 2 kR2

kR1 þ kR2 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u 2 uk ðrR1 Þ2 ðcnew  cRj 2 Þ kR2 rRj 2 2 t R1 j j þ ¼ þ ðcRj 1  cnew Þ þ knew ¼ kR1 þ kR2 j kR1 þ kR2 kR1 þ kR2 kR1 þ kR2

ð17Þ

where kR1 denotes the number of samples falling into rule R1 and kR2 the number of samples falling into rule R2 , and j is the jth input feature. Without loss of generality kR1 > kR2 , thus the order is fixed according to the support of the rules; if clustering is conducted in the product space [9], j also covers the output variable. Merging of rule consequents requires a specific strategy in order to handle inconsistencies in the rule base appropriately, and is thus explained in Section 4. A merging example based on the growing together of the two rules as shown in Fig. 2 is visualized in Fig. 4. Intuitively, it is clear that merging of two rule triggers the merging of all of their p antecedent parts (p the input dimension), thus fuzzy set merging is conducted implicitly. If the merging process is done directly in the multi-dimensional data learning space, then one merged rule is yielded there, which is afterwards projected to the axes to form the fuzzy sets in the antecedent part of this rule. In case of using generalized fuzzy systems with multivariate Gaussian functions as granules (as e.g. used in evolving context in [55]), the projection concept is more sophisticated as it has to be conducted for an ellipsoid in arbitrary position. Recently, in [90] an approach has been demonstrated and successfully applied to real-world problems, which integrates the eigenvectors and the maximal angle spanned between them and the single feature axes. Merging operation as in (17) can be either applied after projection or by extending the second formula in (17) to the merging of inverse covariance matrices. In terms of merging overlapping fuzzy sets from a single uni-dimensional viewpoint, alternative ways exist; for instance, in [58] two fuzzy sets are merged according to:

cnew ¼

cA þ cB 2

1

rnew ¼ pffiffiffi ðmaxðcA ; cB Þ  minðaA ; aB ÞÞ 2 2

ð18Þ

with cA the focal point of fuzzy set A; cA and aA as in (7), and in [73] merging is based on the a-cut concept to guarantee a minimal coverage:

cnew ¼ ðmaxðUÞ þ minðUÞÞ=2 dnew ¼ ðmaxðUÞ  minðUÞÞ=2

ð19Þ

where U ¼ fcA  dA ; cB  dB g. A strong property of (19) (opposed to other measure in the literature) is that it can be used for arbitrary uni-modal fuzzy sets by interpreting dA as the ‘characteristic’ spread of a set A, which is the width of the set along its a-cut. Thus, when choosing a ¼ e0:5  0:61, we obtain dA;B ¼ rA;B in case of Gaussian fuzzy sets. Furthermore, the merged fuzzy set will always cut the outer borders of the union of the two sets at the a-cut level, providing dnew .

30

E. Lughofer / Information Sciences 251 (2013) 22–46

Fig. 4. Merging of overlapping rules, indicated by an ellipsoid with solid thick line; the original rules indicated by dotted lines, also compare with the initial situation in (2).

Single fuzzy set merging, however, should be taken with care during the incremental learning stages, as the so-called dragging effect may arise [69], which is a repeated fuzzy set misplacement situation as shown in Fig. 3.

3.2. Simplicity Simplicity assurance can be seen as an extension and generalization of distinguishability assurance, as it deals with complexity reduction in a wider sense, in particular with the question of what (aspect of) model complexity can be further reduced because it does not contribute to model quality improvement. In the well-known SAFIS approach [96,95], rule pruning is integrated implicitly into the incremental learning algorithm: rules are pruned whenever their statistical influence measured in terms of its relative membership activation level (relative to the activation levels of all rules so far) drops below a predefined threshold. In this sense, such rules can be considered irrelevant and deleted since they have contributed very little over past data cycles to the final model output. A similar strategy is applied in GENFIS [90], extended to generalized fuzzy rules. Another attempt to delete weakly activated rules is demonstrated in simple_eTS [4], where rules with very low support are deleted. Satellite clusters or outlier clusters are thus not integrated into the model output. An extension of this idea is integrated in the eTS+ approach [10], where concepts such as rule age and utility are used to detect and further eliminate unimportant rules:

Pki Agei ¼ t 

l¼1 t l

ki

Pt Utilityi ¼

l¼1

Wl

t  ti

ð20Þ

with ki the support of the ith rule as used in the previous section, t the current time tag, tl the time tag of the lth data stream sample falling into the ith rule, and ti the time tag when the ith rule was born. Rule ages in a slight different way have been also used for rule pruning in [23]. A concept similar to rule ages was recently proposed by Leite et al. in [52]: most inactive granules (rules) are removed from the system model, where the activity is measured by a monotonic decreasing exponential function (here for the ith granule): Ui ¼ 2wðtti Þ , with w denotes a decay rate and t i the time tag where the ith rule has been updated the last time. The half-life of a rule is the time spent to reduce the factor Ui by half, that is, 1=w. In the well-known SOFNN approach [57], the pruning strategy is based on the optimal brain surgeon approach building upon on the research presented in [59]. The basic idea is to use second order derivative information to find unimportant fuzzy rules (neurons). This is achieved with a Taylor series expansion subject to the least squares error. All of these approaches more or less rely on the importance of fuzzy rules due to their, support, usage and contribution to the final past model outputs; no geometric criteria have been studied which give rise about the equivalency of rules (together with their consequent outputs) in case when they are slightly overlapping, touching or close to each other; this may neglect redundancy in a wider aspect (along nearby lying rules).

3.2.1. In the rule base Thus, we present an additional possibility by following and generalizing the considerations made in the previous section to the case where rules are moving together over time, but are not becoming significantly overlapping. Examples for such situations are shown in Fig. 5. Following the similarity-based criteria in the previous paragraph including geometric

E. Lughofer / Information Sciences 251 (2013) 22–46

31

properties, we design a geometric-based touching criterion for ellipsoids, which acts in a single-pass manner as not requiring any past data. A part of the merge condition between two rules A and B in the partition can be deduced from the consideration that the two rules are touching or slightly overlapping and that their joint range of influence forms a homogenous uni-modal structure. Whenever the two rules are spherical, the condition whether two rules overlap or touch is obviously

dðcR1 ; cR2 Þ 6 r R1 þ rR2

ð21Þ

with dðx; yÞ the Euclidean distance between x and y; cR1 the center of the rule R1 and r R1 the radius of the sphere described by rule R1 . Touching is the case when equality holds. In case of rules with different spreads along different dimensions, the above condition can be extended to account for the different fuzzy set spreads (~ r):

Pp dðcR1 ; cR2 Þ 6

j¼1 jc A1j

 cA2j jðfac  rA1j þ fac  Pp j¼1 jc A1j  cA2j j

rA2j Þ

þ

ð22Þ

with A1j the fuzzy set in the jth antecedent part of R1 ; cA1j its center and rA1j its spread. Note that we use r as the characteristic spread on a-cut level in Section 3.1.2. If evolving clustering is applied in the EFS approach, the spreads of the fuzzy sets can be directly taken from the spread of the clusters (in main position) in each dimension. This means that not simply the sum of the spreads along all dimensions in both rules is used as condition on the right hand side, but a weighted average according to the degree of the center shift along each direction. The symbol  denotes a constant near 0 (either positive or negative) and accounts for the degree of overlap which is required to allow a merging (in case when  < 0) resp. the degree of deviation between the two rules which is allowed to allow a merging (in case when  > 0). If  is set to 0, a touching of the two rule ellipsoids already fulfills the condition. The factor fac influences the tolerance on the contours, based on which the degree of overlap is judged (usually, fac 2 ½1; 2, i.e. a 1- to 2-r band). The second part of the merge condition is driven by the fact that two rules can be merged whenever their joint region follows the ‘‘behavior’’ of their single regions (homogeneity criterion). In order to illustrate this, the two rules shown in Fig. 5(a) can be intuitively merged together as covering one joint data cloud, whereas the two rules shown in (b) are representing data clouds with different orientation, shape characteristics, thus merging has to be handled with care. In fact, merging these two rules to one (shown by a dotted ellipsoid) would unnaturally blow up the two-dimensional range of influence of the new rule, yielding an imprecise presentation of the two data clouds. Fig. 5(c) shows a case where a small rule (also denoted as satellite rule) is attached to a significantly bigger rule. In such a case, the blow up effect regarding the volume of the merged rule is also quite small. We add a smooth homogeneity criterion based on the volumes of the original and (hypothetically) merged rules antecedents, characterizing the degree of ‘blow-up’ of the merged rules, compared to the original rules:

V merged 6 pðV R1 þ V R2 Þ

ð23Þ

with p the dimensionality of the feature space and V the volume of an ellipsoid in main position, which is defined by [41]:



2 

Qp

j¼1

rj  pp=2

p  Cðp=2Þ

CðxÞ ¼

Z

1

tx1 et dt

ð24Þ

0

The third part of the merge condition takes into account the consequents of the rules. In distinguishability assurance techniques, the consequents are considered in the merging process for achieving consistent rule bases, thus their merging handled appropriately (see Section 4). However, they do not have an effect on the merge conditions, as strongly overlapping antecedents always represent superfluous information per se. In the sense of rules grown together, but not necessarily overlapping (as shown in Fig. 5), consequents of the rules may point into different directions, accounting for a necessary non-linearity aspect implicitly contained in the data. For instance, in classification problems, the two rules shown in Fig. 5(a) may

Fig. 5. (a) two rules (solid ellipsoids) which are touching each other and are homogeneous in the sense that the (volume, orientation of the) merged rule (dotted ellipsoid) is in conformity with the original two rules; (b) two rules (solid ellipsoids) which are touching each other and are not homogeneous; and (c) a rule satellite = a small rule touching a much bigger one ? merge is suggested as obviously the smaller rule denotes no extra valuable component.

32

E. Lughofer / Information Sciences 251 (2013) 22–46

represent two different classes. In this case, a merging would be counter-productive, as one rule would wash up the clear original boundary between the classes: One class can even be totally overwhelmed by the other (in the joint region). Thus, we suggest to not merge two touching rules fulfilling criteria (22) and (23), whenever the consequents of the two rules encode a different class label respectively contain a different majority class. In a regression context, the consequents in state-of-the-art EFS are usually represented as singleton numerical values (Sugeno) or hyper-planes (Takagi–Sugeno). In case of singleton consequents, the difference in outputs, i.e. the steepness of the approximation curve, between two touching rules is essential for deciding whether the rules can be merged or not. Thus, we suggest the following similarity measure for the consequents:

Scons ðR1 ; R2 Þ ¼ 1 

jwR1  wR2 j rangeðyÞ

ð25Þ

with wR1 and wR2 the singleton consequent values of R1 and R2 , respectively. The closer wR1 and wR2 , the more constant the regression model along the local regions represented by the two rules, thus achieving a value of Scons near 1. In case of hyper-plane consequents, defined by (and used as f in (1)):

lð~ xÞ ¼

p X wj xj þ w0

ð26Þ

j¼1

we have to catch the degree of continuation of the local trend over the two nearby lying rules. This is essential as a different trend indicates a necessary non-linearity contained in the functional relation/approximation between inputs and outputs – see Fig. 6. Thus, we propose an additional similarity criterion based on the degree of deviation in the hyper-planes’ gradient information, i.e. in the consequent parameter vectors of the two rules without the intercept; range normalization is important in order to obtain comparable impacts of variables in the consequents. We suggest a criterion based on the dihedral angle of the two hyper-planes they span, which is defined by:

!  ~ b   aT~ / ¼ arccos   j~ bj ajj~

ð27Þ

a ¼ ðwR1 ;1 wR1 ;2    wR1 ;p  1ÞT and b ¼ ðwR2 ;1  wR2 ;2     wR2 ;p þ 1ÞT the normal vectors of the two planes, showing into the opposite direction with respect to target y (1 and +1 in the last coordinate). If this is close to 180° (p), the hyper-planes obviously represent the same trend of the approximation curve, therefore the criterion should be high (close to 1), as the rules can be merged. If it is 90° (p=2) or below a change in the direction of the approximation functions takes place, thus the criterion should be equal or lower than 0.5. Hence, the similarity criterion becomes

~ R2 Þ ¼ ~ R1 ; w Scons ðw

/

ð28Þ

p

In case of a low angle (27) but high deviation between the intercepts, the two nearby lying planes are (nearly) parallel but with a significant gap inbetween. This usually points to a modeling case of a step-wise function, and thus the rule should not be merged. Hence, we may extend the similarity expressed by (28) to:

~ R2 Þ ¼ min ~ R1 ; w Scons ðw



/

p

;1

jwR1 ;0  wR2 ;0 j rangeðyÞ

 ð29Þ

Finally, the joined condition whether two neighboring, touching rules R1 and R2 can be merged without loosing valuable information is defined by:

  ~ R2 Þ P thresh ~ R1 ; w Eq: ð22Þ ^ Eq: ð23Þ ^ Scons ðw

ð30Þ

Fig. 6. The corresponding two hyper-planes (consequents) of two touching rule antecedents indicating a necessary non-linearity in the approximation surface, as one hyper-plane shows an up-trend and the other a downtrend in the functional dependency between input features X1; X2 on the target Y; thus, a merged rule (indicated by dotted ellipsoids and hyper-plane) would yield an imprecise presentation and cause a low model quality in this region.

E. Lughofer / Information Sciences 251 (2013) 22–46

33

Finally, we want to point out that smoothing of consequent functions as explained in Section 8.1 may help to increase Scons in (28) for nearby lying and touching rules. The merging process itself is again conducted by (17) for the antecedent parts and by (33) below for the consequents. Obviously, in each incremental learning cycle, only updated rules need to be checked whether they become overlapping/touching with the other existing rules; thus, the computational complexity lies in the range OððC  1ÞpÞ. 3.2.2. In the rule length Simplicity also refers to the length of rule antecedent parts, i.e. to the amount of AND-connections included when reading an IF-THEN premise. For comprehensibility reasons, the number of such different entities in a rule should be as small as possible, otherwise the user looses somehow the meaning of a rule, i.e. the relation between the inputs it wants to express. One possible approach how to tackle this issue and to restrict the lengths of the rules is an incremental feature weighting concept, which may assign features a low weight <  whenever they become unimportant for the model output (predictions). Two possibilities of such weighting concepts will be demonstrated in Section 6, one for streaming classification, one for streaming regression problems. In this context, the rule antecedent parts corresponding to such features may be eliminated when the evolved fuzzy systems is shown to the user/expert, as these features do not include any valuable information required for understanding the (system) relationship/dependency the model represents. 4. Consistency Within an incremental learning context, inconsistency of the rule base may arise whenever two rules get overlapping in the antecedents (due to the considerations made in Section 3.1, see also Fig. 2), but not in the consequents. In fact, within an evolving regression context, independently of the type of the evolving fuzzy system architecture (Mamdani/TS/other), this case may point to either a high noise level or to an inconsistently learnt output behavior; apart from a low interpretability level, different consequents of equally firing rules may lead to highly blurred outputs, thus also affecting the accuracy of the predictions. Within an evolving classification context, such occurrences lead to a (higher) conflict case as classes overlap within the same local region [67]. The inconsistency of the rule base in an evolving learning context can thus be measured in terms of the inconsistency levels of two or more fuzzy rules contained in the rule base. For the sake of simplicity, we only consider the inconsistency between two fuzzy rules, which can be easily generalized to multiple rule inconsistencies. Definition 3. Rule R1 is inconsistent to Rule R2 if and only if Sante ðR1 ; R2 Þ P Scons ðR1 ; R2 Þ with Sante ðR1 ; R2 Þ P thr. With Sante the similarity degree in the antecedent parts and Scons the similarity degree in the consequent parts, the former measured by (15) and the latter measured by (28) for TS fuzzy systems and by (15) for Mamdani fuzzy systems. As discussed at the beginning of Section 3, the threshold may depend on the similarity measure; however, a value of Sante close to 1 always can be assumed to point to a high similarity and of Sante close to 0 to a low similarity. This is obviously an extension to the inconsistency condition for crisp rules, where two rules with the same antecedent parts (i.e. Sante ðR1 ; R2 Þ ¼ 1) are inconsistent whenever their consequent parts are different (i.e. Scons ðR1 ; R2 Þ ¼ 0). Here two rules with similar antecedents, i.e. Sante ðR1 ; R2 Þ P thr, are inconsistent in a fuzzy sense when their consequents are dissimilar resp. their similarity is lower than the similarity of their antecedents.

Fig. 7. Exponential function for consistency evaluation of a rule pair based on the similarity between rules antecedents and rules consequents.

34

E. Lughofer / Information Sciences 251 (2013) 22–46

Inconsistency in EFS has been, to our best knowledge, only handled in [72], where the merging of consequent were affected by the consistency level. However, the level was either set to 0 (full inconsistency) or to 1 (full consistency), depending on whether Sante P Scons (=0) or the other way round (=1). Here, we suggest a smooth criterion based on a exponential kernel-based measure leaned on the definition and motivation in [42]. The modification concerns the influence of a low similarity in the antecedent, thus achieving high consistency degrees in the case of dissimilar antecedents:

 

Sante ðR1 ;R2 Þ 1 Scons ðR1 ;R2 Þ 7



ConsðR1 ; R2 Þ ¼ e

1 Sante

2



ð31Þ

A plot of this function is shown in Fig. 7: obviously, a low value of Sante (up to 0.4) always achieves a consistency close to 1. The consistency of the overall rule base is then given by using the aggregation of the consistencies of the single rules:

Consall ¼ Agg Ci;j¼1;i–j ConsðRi ; Rj Þ

ð32Þ

A high consistency level in the rule base in EFS within an incremental learning context can be assured by applying the merging strategy as described in the previous section. In fact, whenever Sante ðR1 ; R2 Þ is higher than a threshold (e.g. 0.6), the two rules are merged to one (see Section 3.1) and therefore the inconsistency between these resolved. The merging strategy of the rule consequent functions also depends on the inconsistency level and, following the idea of Yager’s participatory learning concept [114], is conducted by [72]:

~ new ¼ w ~ R1 þ a  ConsðR1 ; R2 Þ  ðw ~ R2  w ~ R1 Þ; w

ð33Þ

~ new ¼ w ~ R1 , i.e., the consequent of the more relevant rule; we alwhere a ¼ kR2 =ðkR1 þ kR2 Þ. For ConsðR1 ; R2 Þ ¼ 0, we obtain w ways assume kR1 > kR2 for the sake of simplicity with kR1 the support of rule R1 . In this sense, the consequent vector of the merged rule is more influenced by the more supported rule when the consistency degree is lower, thus increasing the belief in the more supported rule (assuming the other as a more superfluous extraction from the data, e.g. due to noise in the local region). For ConsðR1 ; R2 Þ ¼ 1, on the other hand, Eq. (33) reduces to the weighted averaging concept as also used in various approaches such as [92,73,60,77] or [58]. There, however, inconsistencies are not handled explicitly and adequately. Similar considerations can be made for Mamdani-type fuzzy systems (used for evolving regression purposes), where in case of ConsðR1 ; R2 Þ ¼ 1 the fuzzy sets in the output dimension can be merged according to (19) (in case of fuzzy sets with arbitrary shape) and to (17) (in case of Gaussians); in case ConsðR1 ; R2 Þ < 1, (33) can be used as participatory extension of weighted average for merging the centers of the fuzzy sets A (consequent of rule R1 ) and B (consequent of rule R2 ) and

rmerged ¼ rA þ ConsðA; BÞ  ðrnew  rA Þ for merging their spreads, with

ð34Þ

rnew the second formula in (17); or

dmerged ¼ dA þ ConsðA; BÞ  ðdnew  dA Þ

ð35Þ

when using (19). Finally, we note that in general (knowledge-based) Mamdani systems two rules with equal antecedents A and different consequents B and C usually provide a valid consistent contribution to the rule base, as they can be seen as ‘‘OR’’ connection on the output fuzzy sets: ‘‘B OR C’’; the inconsistency in Mamdani systems here is seen in the particular evolving regression context. In case of evolving classification problems, two rules with the same or similar antecedents can be directly merged when using the extended rule form for K classes, according to the definition in [14], as used in most of the conventional EFC approaches [63,12,70]:

Ri : IF x1 IS Ai1 AND    AND xp IS Aip    THEN li ¼ 1ðconfi1 Þ; li ¼ 2ðconfi2 Þ; . . . ; li ¼ KðconfiK Þ

ð36Þ

i.e. each class is represented in each rule with some confidence level, where the maximal level corresponds to the output class. Then, if, for instance, two rules R1 and R2 with similar antecedents have a different class label preference, i.e. R1 prefers Class #1 with conf11 ¼ 1 and R2 prefers Class #2 with conf22 ¼ 1 (and all other confidences set to 0), the same class overlap situation can be easily represented by one rule Rmerged with confmerged;1 ¼ 1 and confmerged;2 ¼ 1. In general,

confmj ¼ PK

conf1j þ conf2j

i¼1 ðconf1i

þ conf2i Þ

ð37Þ

In case of imbalanced situations, i.e. one rule is supported much more than the other similar rule, the same weighted strategy ~ R instead of w ~ R1;2 , including the support of the rules and the consistency level of the two as in (33) can be applied (using conf 1;2 rules, measured in terms of (31) with:

Scons ðR1 ; R2 Þ ¼ 1 

K 1X jconf1k  conf2k j K k¼1

i.e. with the average discrepancy of confidence levels among the two rules in all classes.

ð38Þ

E. Lughofer / Information Sciences 251 (2013) 22–46

35

5. Coverage and completeness Coverage refers to the specific characteristics of fuzzy partitions (low level coverage) and rules (high level coverage) that they are not allowing any holes in the feature space, thus undefined input states. In a data-driven learning context, coverage may get severely violated as usually rules are only extracted in regions of the feature space where samples appeared. On the other hand, coverage can be always guaranteed whenever fuzzy sets with infinite support are used (such as for instance Gaussian membership functions). Thus, this makes this type of membership functions very attractive for evolving fuzzy systems methods, and therefore most commonly used. The only problem is that then all rules may contribute to the model inference and final output for an arbitrary input sample, making the input–output interpretation hard. However, the contribution of rules with a significant distance to the sample will be more or less negligible, especially when assuring -completeness (as discussed below) with a significant value of  > 0 – then, the rules being far away from the sample will be always overwhelmed by rules lying close to it within the inference process and thus also for the final output. In this sense they can be neglected in the interpretation of the input–output behavior. -completeness with  20; 1 can be seen as a generalization of coverage (i.e. coverage = -completeness with a very small value for ), thus we are focussing on techniques for assuring or at least approaching -completeness. Definition 4. A fuzzy partition for feature X i containing the fuzzy sets A1 ; . . . ; AM is said to be -complete whenever there is no point x 2 ½minðX i Þ; maxðX i Þ such that lAi ðxÞ < , with  > 0 a small positive number. xÞÞ >  for all points ~ x in the input feaExtending this to rule level, requires a rule base l1 ; . . . ; lC such that maxi¼1;...;C ðli ð~ ture space, i.e. for all points (samples) at least one rule fires with a significant degree. Taking into account that each fuzzy set is used in at least one rule, the -completeness of rules can be directly associated with the -completeness of sets through the applied t-norm: as tðx1 ; x2 ; . . . ; xp Þ 6 minðx1 ; x2 ; . . . ; xp Þ, the following holds



x 9i ðli ¼ 8~

  x 9i ð8j T ðlij Þ > Þ ) 8~

j¼1;...;p

lij > Þ



ð39Þ

with lij the membership degree of fuzzy set Aj appearing in the jth antecedent part of the ith rule and p the rule length. In this sense, assuring -completeness on rule level automatically ensures -completeness on fuzzy set level. In the other way round, whenever -completeness on fuzzy set level is assured, each antecedent part of a rule fires with at least , thus T resp. Tð1 ; 2 ; . . . ; p Þ-completeness is assured. The number of approaches which are taking this criterion into account, are quite rare: one of the most well-known approaches supporting this concept internally during the incremental learning scheme, is the SOFNN approach [57]. In fact, the rule evolution machine itself relies on the concepts of -completeness on rule level: whenever a new incoming sample is not sufficiently covered by any rule, i.e. maxi¼1;...;C ðli Þ < r , then the width of the rule with highest activation level is enlarged a bit (default by a factor of 1.2); thus, this does not guarantee an -completeness for the same data sample, but it may enforce a kind of rapprochement on -completeness over time. Another approach which assures a minimum of overlap between adjacent fuzzy sets is the SAFIS approach, originally introduced in [95], later extended in [94]. There, an overlap factor j is defined and integrated into the ‘‘influence of a fuzzy rule’’ concept, which is responsible whether new rules are evolved or not. This indeed ensures completeness, however to the cost of additional complexity by evolving new rules, which in turn may decrease interpretability, according to the simplicity consideration performed in Section 3. In this paper, we present two more generic variants for assuring this important property (not interlinked with a specific EFS approach):  The first acts on fuzzy partition level and is based on heuristic adjustments of the ranges of influence of those components which were updated during the last incremental learning step.  The second one acts on rule level and employs an enhanced incremental optimization procedure for non-linear antecedent parameters in rules. Compared to the available approaches in literature which only provide rapprochement towards -completeness, the first new concept really enforces and ensures it. The second concept also only approaches it, but ensures a more strict convergence in kind of an implicit optimization process, thus providing a real optimized tradeoff between accuracy and -completeness. Furthermore, both concepts do not operate on the cost of a higher complexity. 5.1. Heuristic-based assurance The heuristic-based approach relies on the enforcement of sufficient overlap between updated fuzzy sets and adjacent ones. Thus, once a rule (say the mth) was updated during the incremental learning stage, for each single dimension j ¼ 1; . . . ; p the fuzzy set Am;j appearing in the jth antecedent part of the rule is checked whether it still has significant overlap (> ) with any other set. Let therefore be Aml ;j the ‘‘next’’ fuzzy set to the left and Amr ;j the ‘‘next’’ fuzzy set to the right of the updated set Am;j (if Am;j is the left-resp. right-most set, then only Amr ;j resp. Aml ;j exist). The indices of the fuzzy sets which are ‘‘next to the left’’ resp. ‘‘next to the right’’ are given by:

36

E. Lughofer / Information Sciences 251 (2013) 22–46

mr ¼ argmaxflAi;j ðx ÞjlAm;j ðx Þ ¼  ^ x P cAm;j ^ cAi;j > cAm;j g i–m

ml ¼ argmaxflAi;j ðx ÞjlAm;j ðx Þ ¼  ^ x < cAm;j ^ cAi;j < cAm;j g

ð40Þ

i–m

with cAm;j the center (resp. modal value) of fuzzy set Am;j and x the points left and right to cðAm;j Þ where the membership degree of Am;j is equal to . In case when 8ilAm;j ðminðX j ÞÞ > lAi;j ðminðX j ÞÞ; Am;j denotes the left-most fuzzy set and ml is set to m. In case when 8ilAm;j ðmaxðX j ÞÞ > lAi;j ðmaxðX j ÞÞ; Am;j denotes the right-most fuzzy set and mr is set to m. Note that the definition in (40) does not necessarily imply that the fuzzy set with the closest center is ‘‘the next’’, but that one with biggest overlap on the corresponding side. These definitions provide a kind of adjacency relationship in the context of completeness. If lAm ;j ðx Þ and lAm ;j ðx Þ for x as defined in (40), is greater or equal than , no modification is performed, otherwise the r l updated fuzzy set Am;j is adjusted in a way that -completeness is achieved. If mr ¼ m resp. ml ¼ m, then it is checked whether lAm;j ðmaxðX j ÞÞ >  resp. lAm;j ðminðX j ÞÞ >  – if this is the case, no modification is performed. Fig. 8 visualizes two cases of overlap, sufficient overlap (a) and non-sufficient overlap with the applied modification according to (42) to assure sufficient overlap (b). We restrict the adjustment to the spread of the fuzzy set and do not apply any one to its center as conducted in [21], for instance. This is important to circumvent the so-called dragging effect as described in [69], which may lead to successive fuzzy set mis-placements and furthermore a severe downtrend in the precision of the fuzzy model over time. Adjusting only the spread keeps the focal point of the set at the real natural core of the data cloud and thus the dragging effect is prevented. Let Am be a fuzzy set in an antecedent part belonging to an updated rule during the last incremental learning stage. Then, it is checked whether the membership degree of that intersection point, which is between the two centers of the two fuzzy sets, i.e. cm;j 6 inter x 6 cmr ;j is higher than . In case of Gaussian fuzzy sets (the most conventional choice [63]), the intersection points of two adjacent fuzzy sets Am and Amr appearing in the jth fuzzy partition (for the jth dimension) are given by (see [69]):

interx ð1; 2Þ ¼ 

cmr ;j r2m;j  cm;j r2mr ;j

r2mr ;j  r2m;j

If this if it higher, the spread

exp 0:5

ð41Þ

rm;j is modified in a way such that

ðinter x ð1; 2Þ  cm;j Þ

r2m;j

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u u cmr ;j r2  cm;j r2 2 c2 r2  c2 r2 m;j mr ;j m;j mr ;j mr ;j m;j  tð Þ  r2mr ;j  r2m;j r2mr ;j  r2m;j

2

! ¼

ð42Þ

which reflects that the membership degree at the intersection point is equal to . Substituting (41) into (42) and resolving this equation after rm;j leads to (proof is left to the reader):

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðcm;j þ cmr ;j  aÞ2 rm;j ðnewm;mr Þ ¼ a2

ð43Þ

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi with a ¼ 2lnðÞr2mr ;j . Furthermore, the same is checked between fuzzy sets Am and Aml with intersection as in (41), leading to a new suggested spread rm;j ðnewm;ml Þ. Finally, the new spread is taken as the maximum of the suggested spreads in order to assure full -completeness:

Fig. 8. (a) two fuzzy sets guaranteeing sufficient overlap bigger than  ¼ 0:135 (horizontal line), which is the membership degree at l  2r, covering 95.65% of the data according to the two-r-rule and (b) two adjacent fuzzy sets which are violating -completeness, in dotted lines the adjusted fuzzy set according to (42).

37

E. Lughofer / Information Sciences 251 (2013) 22–46

rm;j ðnewÞ ¼ maxðrm;j ðnewm;mr Þ; rm;j ðnewm;ml ÞÞ

ð44Þ

If Am is the left-resp. right-most fuzzy set in a partition (as defined above), only rm;j ðnewm;mr Þ resp. rm;j ðnewm;ml Þ need to be calculated. Similar considerations and adjustments of characteristic spreads can be conducted for other types of fuzzy sets. The complexity of re-setting is in the range OðpÞ, as for each antecedent part in the updated rule the intersection with the adjacent fuzzy sets needs to be calculated and eventually the spread reset, which both requires constant calculation time. If there is a very weak overlap between adjacent fuzzy sets, the width of the new fuzzy set may be severely blown up, such that the local data characteristics is represented quite inefficiently. Therefore, we propose to constrain the adjustment of rm;j ðnewÞ by:

rm;j ðnewÞ ¼ minðrm;j ðnewÞ; j  rm;j ðoldÞÞ

ð45Þ

with j a value in [1.2, 1.5]. 5.2. Completeness by enhanced incremental optimization This enhanced strategy relies on the possibility that non-linear antecedent parameters (defining the shape of the fuzzy sets) are adjusted in incremental optimization steps based on new incoming data. This means that learning from streams is conducted based on an optimization problem, for which an incremental or recursive methodology exists and due to which parameters are usually forced to converge to some optimal or at least sub-optimal solutions. For instance, the evolving fuzzy systems approach demonstrated in [98] relies on funded and analytically derived incremental optimization steps, achieving stable solutions; or the EFP approach in [111] exploits an incremental version [33] of Levenberg–Marquardt algorithm [79]. Usually, these approaches rely on the non-linear least squares optimization problem for optimizing the quadratic error between estimated and predicted values:

J ¼ JðUnonlin Þ ¼

N X

C X yk  li ð~ xk ÞWi ðUnonlin Þð~ xk Þ

!2 ð46Þ

i¼1

k¼1

with Unonlin the set of non-linear parameters in the rule antecedent parts, Wi the normalized basis functions: Wi ð~ xÞ ¼ PCli ðxÞ ~

k¼1

p

with

lk ð~xÞ

li ð~xÞ ¼ T li;j ðxj Þ the membership degree of rule i to actual sample ~ x. C denotes the number of rules, N the number of j¼1

data samples and li the consequents. Minimizing the non-linear least squares error measure as in (46) fits the non-linear parameters as close as possible to the data streams samples, thus yielding a precise model. Punishment term on model complexity is often a good alternative to prevent over-fitting, see [63] (last section of Chapter 2). In order to investigate -completeness of fuzzy partitions, the idea is to relax this hard fitting criterion by an additional term which punishes those partitions in which fuzzy sets have a low degree of overlap. Aggregating this on rule level, we suggest to include the following term in the optimization problem (46):

Pun ¼

N QC X ~ i¼1 maxð0;   li ðxk ÞÞ

ð47Þ

C

k¼1

If li is greater than , a sufficient coverage in all antecedent parts is achieved, thus the punishment term becomes 0. If li ð~ xÞ is close to 0 for all rules, the sample is not sufficiently covered and therefore (47) approaches 1 due to the normalization term C . Assuming ½0; 1 normalized target values y, both terms in (46) and (47) achieve the same ranges, thus becoming directly combinable and the combined optimization criterion becomes:

0

J ext

¼ a@

N X

yk 

k¼1

C X

!2 1 ! QC N X maxð0;   li ð~ xk ÞÞ i¼1 A þ ð1  aÞ li ð~ xk ÞWi ðUnonlin Þð~ xk Þ C

i¼1

k¼1



ð48Þ

with a denoting the importance level of precision versus completeness, thus being able to tradeoff accuracy and interpretability in a continuous manner. By including the derivatives w.r.t. to all non-linear parameters in the Jacobian, the incremental Levenberg– Marquardt update formulas can be applied in the same manner as in [111]. Note that the first term is differentiable as quadratic parabola function (for derivatives refer to [63], Section 2.7, Chapter 2); according to the maximum operator, the derivative of the second term after a non-linear parameter Un in the nth rule becomes a two way function (proof left to the reader):

8 N X > <

1 @Pun C ¼ k¼1  > @ Un : 0

@ ln ð~ xk Þ @ Un

C Y

maxð0;   li ð~ xk ÞÞ

  ln ð~xk Þ > 0

i¼1;i–n

ð49Þ

else

Compared to the heuristic approach and the achievements in state-of-the-art approaches (as mentioned above), the advantage of this completeness concept is that it is fully integrated into the learning process and thus prevents significant re-adjustments of spreads which may deteriorate the precision of the model. On the other hand, -completeness of fuzzy partitions and rules is indeed approached, but not necessarily fully assured (as LS error is minimized synchronously).

38

E. Lughofer / Information Sciences 251 (2013) 22–46

6. Feature importance levels Feature importance levels defined through feature weights can be seen as important from two viewpoints of interpretability: 1. They may give rise to the importance of features on the final model output either in a global or even in a local sense, making the features and their influence interpretable for the process. 2. They may also lead to a reduction of the rule lengths (see also Section 3.2), as features with weights smaller than  have a very low impact on the final model output and therefore can be eliminated from the rule antecedent and consequent parts when showing the rule base to experts/operators. Feature weights are not only useful in terms of interpretability, they also help to smoothly mitigate the curse of dimensionality effect. Especially, for models including localization components (for obtaining exact and robust partial localized representations) such as fuzzy systems or neural networks, it is well-known that curse of dimensionality is very severe in case when a high number of variables are used as model inputs [89]. This is basically because in high-dimensional spaces, someone cannot speak about locality any longer (on which these types of models rely), as all samples are moving to the border of the joint feature space – see for instance [37] (Chapter 1) for a funded analysis of this problem. As feature weights may approach 0 over time, but not necessarily become strictly 0, someone may speak about a soft dimension reduction concept [65]. Furthermore, in [17] it is emphasized that a reduced feature sub-space is necessary and sufficient to generate human-understandable fuzzy sets, thus feature out-weighting also helps for improving the understandability of fuzzy partitions. Within a soft dimension reduction concept, indeed all features are still present, but those ones with low weights do neither have an influence on model predictions nor on the learning process [65]. In an incremental learning context some out-weighted features may become important again after some time, thus it is beneficial to prepare a reduced model for visualization to the operator, and to further update the equivalent full model with all features included with new incoming data samples/blocks; cutting out features or adjoining features on demand would lead to discontinuities in the learning process [63], as parameters were learnt on different feature spaces before (on older data stream samples). An approach which is addressing incremental feature weighting strategies within the context of evolving fuzzy classifiers is presented in [65] (applicable to original or also transformed features through PCA, PLS, etc.). It operates on a global basis, hence features are either seen as important or unimportant for the whole model, and it is focussed on dynamic classification problems. The basic idea is that feature weights k1 ; . . . ; kp for the p features included in the learning problem are calculated based on a stable separability criterion [25]:

J ¼ traceðS1 w Sb Þ

ð50Þ

with Sw the within scatter matrix modeled by the sum of the covariance matrices for each class, and Sb the between scatter matrix, modeled by the sum of the degree of mean shift between classes. The criterion in (50) is applied (1) dimension-wise to see the impact of each feature separately and (2) for the remaining p  1 feature subspace in order to gain the quality of separation when excluding each feature: a low loss in quality means that the feature is not important, as excluded and the quality stays similar with the remaining features. In both cases p criteria J 1 ; . . . ; J p according to (50) are obtained. For normalization purposes to ½0; 1, finally the feature weights are defined by:

Jj  minðJ j Þ kj ¼ 1 

1;...;p

ð51Þ

max ðJ j Þ  minðJ j Þ

j¼1;...;p

1;...;p

hence the feature with the weakest discriminatory power (and therefore maximal J j ) is assigned a weight value of 0 and the feature with strongest discriminatory power a weight of 1. To be applicable in on-line learning environments, updating the weights in incremental mode is achieved by updating the within-scatter and between-scatter matrices. Both matrices can be incrementally updated, the between class scatter matrix by updating the mean of the classes, the within class scatter matrix by using the recursive covariance matrix formula [40]. The integration of weights depends on the classifier architecture and on the learning engine; in the prediction phase, all membership functions belonging to features with low weights are downweighted in the inference scheme (e.g. by weighted Euclidean distance). In [65], a more stable learning behavior with higher accumulated one-step-ahead prediction accuracy could be achieved. Another attempt for automatic feature weighting and reduction is demonstrated in [10] for on-line regression problems, exploiting the Takagi–Sugeno fuzzy systems architecture with local linear consequent functions. The basic idea is to track the relative contribution of each feature, compared with the contributions of all features, to the model output over time. The tracking can be achieved by summing up the consequent function weights of each feature j per each rule i separately over time [10]:

contribij ¼

N X k¼1

jwij ðkÞj contribi ¼

p X N X jwij ðkÞj j¼1 k¼1

ð52Þ

E. Lughofer / Information Sciences 251 (2013) 22–46

39

with N the number of data samples seen so far. If the sum of the weights turns out to be negligible small compared to the sum of the weights over all features (second formula in (52)), then the jth feature is removed from the corresponding rule. The condition whether to use the most influential rule or the sum of contributions of all features, depends on the dimensionality of the learning problem [7]. Thus, different rules may end up with different dimensionality, allowing maximal flexibility. In [90], on-line feature selection is conducted based on the expected statistical contribution of the feature to the final model output over an infinite number of samples. Deduction leads to complex integrals which can be simplified for fast processing. A reactivation of features, which may turn out to be important at a later stage of the training process, is not possible, neither in [10] nor in [90]; this remains a still open problem in EFS.

7. Rule importance levels Rule importance levels defined by rule weights lying in ½0; 1 are controversially discussed in the fuzzy logic community. On the one hand, rule weights are somewhat proscribed, as they are denoting an unnecessary additional component, making rule bases more complex to read and understand. In [83,82] it is analytically investigated that rule weights can be always equivalently replaced by appropriate modifications of fuzzy membership functions, leading to the same model input–output behavior and thus not being able to account for a higher precision, accuracy. On the other hand, rule weights may serve as important corner stones for a smooth rule reduction during learning procedures, as rules with low weights can be seen as unimportant and may be pruned or even re-activated at a later stage in an on-line learning process. This strategy may be beneficial when starting with an expert-based system, where originally all rules are fully interpretable (as designed by experts/users), however some may turn out to be superfluous over time for the modeling problem at hand. Furthermore, rule weights can be used to handle inconsistent rules in a rule base, see e.g. [87,20], thus serving for another possibility to tackle the problem of consistency, compare with Section 4. The usage of rule weights and their updates during incremental learning phases, was, to our best knowledge, not studied so far in the evolving fuzzy community. Integrating the weights as multiplication factors of the rule membership degrees l into the fuzzy model’s inference scheme, would yield the following functional form (in case of C rules):

yfuz ¼

C X q l ð~xÞ li ð~ xÞWi ð~ x; qÞWi ð~ x; qÞ ¼ PC i i i¼1

k¼1

qk lk ð~xÞ

ð53Þ

p

with li ð~ xÞ ¼ T li;j ðxj Þ the membership degree of rule i to actual sample ~ x and li the consequents. In this sense, the rule j¼1 weights are appearing as non-linear parameters, which may be optimized within incremental procedures such as recursive gradient descent (RGD) [85], recursive Levenberg–Marquardt (RLM) [111] or recursive Gauss–Newton as applied in [44], according to the least-squares error problem or also using the punished measure as in (48). In this case, the punishment term in (47) can be extended by including qi li instead of li . Both algorithms, RGD and RLM, require the first derivative of the optimization functional w.r.t. to the rule weights; in case of least squares problem, this can be calculated for as follows (independently from the type of fuzzy sets used): C X N @yfuz X xm Þ @ Wi ð~ ¼ eð~ xm Þli ð~ xm Þ @ @ qj q j i¼1 m¼1

ð54Þ

with li ð~ xm Þ the linear rule consequent functions in sample ~ xm ; eð~ xm Þ the residual in the p-dimensional sample ~ xm defined by:

eð~ xm Þ ¼ ym 

C X

li ð~ xm ÞWi ð~ xm ; qÞ

ð55Þ

i¼1

and

xÞ @ Wi ð~ @ qj

x; qÞ with respect to qj : the partial derivative of the ith basis function (normalized membership function) Wi ð~

!

q i lj 1 if i ¼ j @ Wi ð~ li xÞ di;j ¼ ¼ PC di;j  PC @ qj ~ ~ 0 if i – j k¼1 qk lk ðxÞ k¼1 qk lk ðxÞ

ð56Þ

Within an on-line modeling context, the interpretation of updating rule weights is the following: some rules which seem to be important at the stage when they are evolved may turn out to be less important at a later stage, without necessarily become overlapping or touching with other rules. Current approaches such as simple TS [4], eTS+ [10], GENFIS [90] or SAFIS [94] tackle this issue by simply deleting the rules from the rule base (based on some criteria like utility, ages, statistical influence) in order to reduce complexity and enhance transparency as well as computation time (see also Section 3.2). However, such rules may become important again at a later stage. Updating the rule weights may help here to find a smooth transition between total out-weighting and full activation of such rules; in fact, some rules may be a bit important and therefore should contribute a little to the final model output (e.g. with weight 0.3). Moreover, compared to the state-of-the-art approaches mentioned above, the importance of rules can be directly linked to the approximation quality of the evolved fuzzy system as part of the error minimization problem (46) or (48): usually, rules with low accuracy will be automatically down-

40

E. Lughofer / Information Sciences 251 (2013) 22–46

weighted. This direct link ‘rule importance $ system error’ is also taken into account in the recently published approach by Leng et al. [58], which, however, requires some re-calculated iterations of model errors (with each rule excluded) based on training samples. Rules with low weights may be ignored in an interpretation stage when the expert/user inspects the fuzzy system, subject to a low contribution level to the final model outputs (prediction) on the stream samples seen so far. The connection with a low contribution is necessary, as hidden rules (due to low weights) could have a reduced effect but still perceived by an expert on a certain field where the fuzzy system is used. The relative contribution level of a rule Ri to the model output over N past data samples can be calculated as:

contribi ¼

N 1X Wi ð~ xk Þ N k¼1

ð57Þ

Thus, those rules are ignored when showing the fuzzy system to the user, for which contribi <  ^ qi < . An extension of (57) would be the application of expected rule significance levels based on approximations of statistical contributions when N goes to infinity, as used for instance in [90,91]. 8. Interpretation of rule consequents In case of Mamdani fuzzy systems, the consequents are fuzzy sets denoting a fuzzy partition for the output/target variable; thus, the interpretation of consequents follows the same criteria together with possible approaches to tackle them as discussed above (distinguishability, simplicity, -completeness, etc.). In case of single model fuzzy classifiers, the standard form of a rule [51,63] includes a single class label as consequent, whereas the extended form [14] contains the confidence in each class, as defined in (36). Such consequents are interpretable per se (as directly storing the class label information) and used in an evolving modeling context in [12,14]. Multi model classifiers either combine single model classifiers within an allpairs strategy [70] (thus, the consequents of each classifier again interpretable per se) or Takagi–Sugeno (TS) type fuzzy systems based on the regression by indicator philosophy [37] coupled with one-versus-rest classification [12]. Hence, we are further investigating the consequents of Takagi–Sugeno type fuzzy systems [107], which are singletons, hyper-planes (as defined in (26)), higher order polynomials or as recently introduced in [48] a linear combinations of kernel functions. The interpretation of singletons is quite clear, as they are representing single values (usually of process variables), whose meanings the experts are usually aware of. Higher order polynomials are rarely used in the fuzzy community, and the interpretation of mixture models with kernels appearing in the consequents is from linguistic point of view almost impossible. Thus, we are focussing on the most convenient and widely used hyper-planes. Indeed, in literature, see for instance [15,30,116], the hyper-plane type consequents as defined in (26) are often seen as completely non-interpretable and therefore excluded from any linguistic modeling point of view. However, in practical applications they may offer some important insights regarding several topics:  Trend analysis in certain local regions: this may be important for on-line control applications to gain knowledge about the control behavior of the signal content pointing to different process states: a constant or rapidly changing behavior can be simply recognized by inspecting the consequent parameters in the corresponding local regions – for a two-dimensional example, see Fig. 9: the consequent functions representing the partial local trends shown as straight lines.

Fig. 9. Trend analysis (as one interpretation option) of a two-dimensional fitted curve (solid line) by partial local linear models snuggling along the real functional approximation (dotted straight lines).

E. Lughofer / Information Sciences 251 (2013) 22–46

41

 Feature importance levels: as already discussed in Section 6, a kind of local feature weighting effect can be achieved by interpreting the local gradients as sensitivity factors. The partial local influences of variables can be summed up over all rules to yield a global influence on the whole model, according to (52).  A physical interpretation of the consequent parameters can be achieved when providing a rule centering approach [13]. In this way, the consequent part becomes:

li ð~ xÞ ¼

p X

v i;k ðxk  ci;k Þ þ v i;0

ð58Þ

k¼1

with ci;k the kth coordinate of rule center ~ ci . Comparing with (26), obviously we obtain:

v i;k ¼ wi;k 8k > 0

ð59Þ

and

v i;0 ¼ wi;0 þ

p X

v i;k ci;k

ð60Þ

k¼1

thus the parameters for the rule centered consequent functions can be directly deduced from the evolved original ones. The essential point is now that the rule-centered form of the generalized Takagi–Sugeno approximator can be written as:

li ð~ xÞ ¼ v i;0 þ v i;k ðxk  ci;k Þ þ

1 1 ðxk  ci;k ÞT V i ðxk  ci;k Þ þ W 3i ðxk  ci;k ; xk  ci;k ; xk  ci;k Þ þ    2! 3!

ð61Þ

with v i;0 a skalar, v i;k a column vector, V i a symmetric matrix and W 3i a symmetric multi-linear form in three vector arguments. Eq. (61) can be interpreted as a Taylor series expansion, with (60) an approximative first order form [39]. A pre-requisite to assure all the interpretation capabilities listed above are well-organized hyper-planes, providing partial functional tendencies of the full non-linear functional approximation surface; otherwise, they may point to any direction. The data-driven technique which is able to fulfill this requirement is the so-called local learning concept, which tries to estimate the consequent function of each rule separately: for the batch modeling case, the nice behavior of local learning of consequent functions was verified in [115]; for the incremental evolving case, a deeper investigation of this issue was conducted in [63]. In [12], local learning was compared with global learning (learning all consequent parameters in one sweep) within the context of multi-model classifiers: it turned out that local learning was not only able to extract consequent functions with a higher interpretable capability, but also more robust fuzzy models in terms of accuracy and numerical stability during training, especially in case of a higher number of inputs p. Due to these investigations, it is recommended to see the older global approach as obsolete, more or less. Local learning conducts a partial linear regression (for each local part/rule) by employing a weighted least squares optimization criterion for each rule i ¼ 1; . . . ; C separately:

Ji ¼

N X

xðkÞÞe2i ðkÞ!min i ¼ 1; . . . ; C Wi ð~ ~i w

k¼1

ð62Þ

where ei ðkÞ ¼ yðkÞ  y^i ðkÞ represents the error of the local linear model in the kth sample (out of N samples in sum). The weights Wi are representing the membership degrees to the corresponding rules, thus achieving the localization effect in the linear regression. In the batch case, its regularized stable solution becomes [36]: 1 ~^i ¼ ðRTi Q i Ri þ ai IÞ RTi Q i~ w y

ð63Þ

with Ri the regression matrix containing all data samples plus a column of ones for the intercept, Q i ¼ diagðWi ð~ x1 Þ; Wi ð~ x2 Þ; . . . ; Wi ð~ xN ÞÞ and ai a regularization parameter. In the incremental case, the Recursive Fuzzily Weighted Least Squares = RFWLS estimator can be deduced by integrating the fuzzy weights W in the update of the inverse Hessian matrix [11]:

~^i ðk þ 1Þ ¼ w ~^i ðkÞ þ cðkÞðyðk þ 1Þ  ~ ~^i ðkÞÞ w r T ðk þ 1Þw

cðkÞ ¼ Pi ðk þ 1Þ~rðk þ 1Þ ¼

1 xðkþ1ÞÞ Wi ð~

P i ðkÞ~ rðk þ 1Þ rT ðk þ 1ÞPi ðkÞ~ rðk þ 1Þ þ~

Pi ðk þ 1Þ ¼ ðI  cðkÞ~ rT ðk þ 1ÞÞPi ðkÞ T

1

ð64Þ ð65Þ

ð66Þ

T with Pi ðkÞ ¼ ðRi ðkÞ Q i ðkÞRi ðkÞÞ the inverse weighted Hessian matrix and ~ rðk þ 1Þ ¼ ½1x1 ðk þ 1Þx2 ðk þ 1Þ . . . xp ðk þ 1Þ the regressor values of the k þ 1th data sample. It converges within each incremental learning step, as RFWLS follows one step in a Gauss–Newton optimization procedure and the function to be optimized in (62) is a hyper-parabola for which Gauss– Newton converges within a single step [84,62]. Extensions of the RFWLS approach includes the integration of a weight decay

42

E. Lughofer / Information Sciences 251 (2013) 22–46

term (as suggested in [90]) and/or a forgetting factor ki as conducted in [68] for increasing the flexibility in learning, espeki cially in case of drifts: (66) is multiplied by 1=ki , the denominator in (65) is changed to Wi ð~xðkþ1ÞÞ þ~ r T ðk þ 1ÞP i ðkÞ~ rðk þ 1Þ. 8.1. Incremental smoothing Despite enforcing local interpretable property of consequent functions, still there may be problems in case of over-fitting, especially whenever the noise level in the data is high. Then, the consequent functions may follow more the trend of the noise rather then the basic tendency of the functional dependence between the variables: an example is shown in Fig. 10(a), where at the lower region of the input space the real dependency between input and target is over-fitted. A strategy for ensuring smooth consequent functions is demonstrated in [97], where a kind of incremental regularization is conducted after each learning step. This is accomplished by correcting the consequent functions of the updated rule(s) by a template T measuring the violation degree subject to a meta-level property on the neighboring rules, i.e. the consequent vector of an updated rule i is re-set by:

~i w

~i þ aðw ~i  T i Þ ð1  aÞw

ð67Þ

with a 2 ½0; 1 the adjustment rate, steering the tradeoff between inclusion of the meta-level and keeping the original hyperplanes. The template T is designed by a weighted averaging of the consequent parameters of the neighboring rules:

~i  Ti ¼ w

X 1 ~i  w ~j ÞmðjÞ ðw ð2r þ 1Þp j2U

ð68Þ

with mðjÞ the mask used as weights of the neighboring rules. This may be set to rule importance levels as discussed in Section 7. r denotes the neighbor radius: all rule centers which are closer to the updated rule i w.r.t. to L1 distance, are denoted as neighbors and used in the summand above. The effect of the smoothening strategy is presented in Fig. 10(b). Finally, it should be highlighted that this strategy assures smoother consequent hyper-planes over nearby lying or even touching rules, thus increases the likelihood of further rule reduction through the simplicity assurance concepts discussed in Section 3.2. This is because the angle between two adjacent rule is increased towards 180 degrees during smoothening. 9. Knowledge expansion At the beginning of the past decade, at the point of time when the first data-driven evolving fuzzy systems were designed, the request for automatic knowledge expansion methodologies was strongly motivated from the viewpoint of accuracy. In particular, new arising system states or on-line operation modes have to be integrated into the model on-the-fly in order to guarantee a sufficient coverage of feature space and to predict new query points falling into the new region reliably. Thus, almost all of the evolving fuzzy systems approaches are possessing an evolving structure, i.e. their learning procedures are providing methodologies for adding new rules on demand (also termed as ‘rule evolution’). Based on which criteria the evolution takes place is the speciality of the particular method, see [63] for a survey, comparison and analysis, and see further some recently published EFS approaches as cited throughout this position paper (e.g., [44,48,53,58,77,70,91,105]). From the viewpoint of interpretability, knowledge expansion increases the model complexity, however this complexity can be seen as a necessary complexity (due to the arguments above), opposed to the unnecessary complexity which is eliminated by the concepts described within the context of distinguishability and simplicity (Section 3.1). On the other hand, it may also play an important role in terms of interpretability improvement, as without it, no new rule would ‘‘describe’’ the new,

Fig. 10. (a) over-fitted relation at the lower region of the input feature space and (b) smoothened approximation.

E. Lughofer / Information Sciences 251 (2013) 22–46

43

expanded situation, thus ignorance [67] may take place. Interpretability and the quality of output predictions may even decrease, as 1. New query points falling in an uncovered region cannot be sufficiently assigned to a linguistic fuzzy set term, thus completeness or even coverage is not fulfilled any longer. 2. The uncertainty of predictions on new query points falling in an uncovered region, i.e. the level of ignorance, will increase [67,70]. Some of the EFS approaches include some clever rule initialization strategies in order to fulfill -completeness (small overlap to nearest rules and fuzzy sets), most of them are assigning the center of a new rule to either the newest data sample (triggering the rule evolution) or to the center of a new density area. Once a new rule (the C þ 1th) is evolved and integrated into the system, the same interpretability concepts as discussed above can be applied to it, substituting C ¼ C þ 1 in all former appearances of the number of rules. Another aspect for knowledge expansion, which is different to rule evolution is the splitting of a rule into two rules. This may be beneficial in case of rules with large spans along one or more directions; however, also from the viewpoint of interpretability a split may be beneficial for representing more reliable components and avoiding mixed information encoded in one component: for instance, consider that a large rule contains two density areas, each one representing a different class: the interpretability of such a rule is not clear and the uncertainty of its predictions is high. Splitting it into two rules according to the two density areas may give a clearer picture about the class distribution, thus also improving interpretability. In case of fuzzy set splitting, occasions as demonstrated in Fig. 5(b) can be omitted (fuzzy set B1 fully covering fuzzy set B2). On the other hand, fuzzy set splitting would increase rule base complexity. In this sense, interpretability in terms of rule base compactness is decreased, while interpretability in terms of fuzzy partition transparency is increased. A possibility for appropriately resolving this situation would be to include the user’s preference whether he/she may opt more for a compact rule base with low complexity or more for a better transparency in the fuzzy partitions. In current EFS approaches, no splitting options are provided, thus we see this as one important future challenge to be handled in order to increase both, interpretability and accuracy of the fuzzy models. Specific care has to be taken when to split and how to split in order to keep a good tradeoff between fuzzy set distinguishability, rule expressiveness and rule base compactness (complexity).

10. Conclusion In this position paper, we have discussed achievements, new concepts and open issues regarding interpretability in evolving fuzzy systems. Our investigations are based on various criteria, most of which are established facets of conventional datadriven fuzzy systems: distinguishability, simplicity, consistency, coverage, completeness, feature importance levels, rule importance levels, interpretation of consequents, and finally the interpretability aspect of knowledge expansion. As evolving systems are being updated continuously and changing due to new incoming data, incrementality, single-pass capability, and a fast computational processing of these criteria are essential in order to guarantee their applicability to model updates. In the direction of distinguishability and simplicity many achievements have been made in the past, which are described and referenced in this paper. Additionally, a general approach to obtain more simplicity in classification as well as regression problems has been presented, that can be used independently of the type of fuzzy sets and multi-dimensional rule kernels chosen. Feature importance levels have been addressed in the literature: a complete approach exists for classification problems, whereas for regression problems some further developments are open. Knowledge expansion is addressed in all common EFS approaches due to criteria for evolving new rules on demand. Since the issues of consistency, completeness and smooth rule importance levels have previously been addressed rudimentarily in the literature, further investigations and concrete concepts have been provided in this paper. Some open issues that require further treatment in the context of EFS:  Splitting of (large span) rules and fuzzy sets for increasing the expressiveness of model components together with an appropriate treatment of distinguishability along the splitting operations, as discussed at the end of Section 9.  Dynamic on-line feature weighting for regression problems: currently, only a hard selection is offered based on statistical influence and contributions in consequents, suffering from the lack of not being able to reactivate any features on demand.  Examine and study the relation of the concept of model-based reliability, i.e. the certainty in model outputs and in model relations in different parts of the feature space, to interpretability.  Concepts for approaching or assuring locality or even an addition-to-one unity of fuzzy partitions in EFS: this would help with the interpretation of active rules [13] and could be considered a valuable prerequisite for:  Interpretation of the (general) input–output behavior of an evolved fuzzy system, especially when using fuzzy sets with infinite support (all rules may fire to a certain degree). Finally, we emphasize that all concepts treated in this paper referred to (conventional and widely used) flat fuzzy model architectures. Thus, future investigations in these directions may investigate different forms of hierarchical structures such as fuzzy pattern trees [104] or fuzzy regression trees [55]. Multi-model architectures, such as all-pairs [70] or one-versus rest

44

E. Lughofer / Information Sciences 251 (2013) 22–46

(evolving) fuzzy classifiers [12] are partially covered by our considerations as single classifiers are following the flat model architecture investigated here. However, interpretability aspects for the overall joint model structure and output are still missing. Acknowledgements This work was funded by the Austrian fund for promoting scientific research (FWF, contract number I328-N23, acronym IREFS) and by the research program at the Austrian Center of Competence in Mechatronics (ACCM), which is a part of the COMET K2 program of the Austrian government. This publication reflects only the authors’ views. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40]

W.C. Abraham, A. Robins, Memory retention – the synaptic stability versus plasticity dilemma, Trends in Neurosciences 28 (2) (2005) 73–78. R. Akerkar, P. Sajja, Knowledge-Based Systems, Jones & Bartlett Learning, Sudbury, MA, 2009. J.M. Alonso, L. Magdalena, Special issue on interpretable fuzzy systems, Information Sciences 181 (2011) 4331–4339. P. Angelov, D. Filev, SimpleTS: a simplified method for learning evolving Takagi–Sugeno fuzzy models, in: Proceedings of FUZZ-IEEE 2005, Reno, Nevada, USA, 2005, pp. 1068–1073. P. Angelov, D. Filev, N. Kasabov, Evolving Intelligent Systems – Methodology and Applications, John Wiley & Sons, New York, 2010. P. Angelov, A. Kordon, Evolving inferential sensors in the chemical process industry, in: P. Angelov, D. Filev, N. Kasabov (Eds.), Evolving Intelligent Systems: Methodology and Applications, John Wiley & Sons, New York, 2010, pp. 313–336. P. Angelov, X. Zhou, Evolving fuzzy-rule-based classifiers from data streams, IEEE Transactions on Fuzzy Systems 16 (6) (2008) 1462–1475. P.P. Angelov, Evolving Rule-Based Models: A Tool for Design of Flexible Adaptive Systems, Springer Verlag, Berlin Germany, 2002. P.P. Angelov, An approach for fuzzy rule-base adaptation using on-line clustering, International Journal on Approximate Reasoning 35 (3) (2004) 275– 289. P.P. Angelov, Evolving Takagi–Sugeno fuzzy systems from streaming data, eTS+, in: P. Angelov, D. Filev, N. Kasabov (Eds.), Evolving Intelligent Systems: Methodology and Applications, John Wiley & Sons, New York, 2010, pp. 21–50. P.P. Angelov, D. Filev, An approach to online identification of Takagi–Sugeno fuzzy models, IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics 34 (1) (2004) 484–498. P.P. Angelov, E. Lughofer, X. Zhou, Evolving fuzzy classifiers using different model architectures, Fuzzy Sets and Systems 159 (23) (2008) 3160–3182. M. Bikdash, A highly interpretable form of Sugeno inference systems, IEEE Transactions on Fuzzy Systems 7 (6) (1999) 686–696. A. Bouchachia, Incremental induction of classification fuzzy rules, in: IEEE Workshop on Evolving and Self-Developing Intelligent Systems (ESDIS) 2009, Nashville, USA, 2009, pp. 32–39. J. Casillas, O. Cordon, F. Herrera, L. Magdalena, Interpretability Issues in Fuzzy Modeling, Springer Verlag, Berlin Heidelberg, 2003. J. Casillas, O. Cordon, M.J. Del Jesus, F. Herrera, Genetic feature selection in a fuzzy rule-based classification system learning process for highdimensional problems, Information Sciences 136 (1–4) (2001) 135–157. G. Castellano, A.M. Fanelli, C. Mencar, A neuro-fuzzy network to generate human-understandable knowledge from data, Cognitive Systems Research 3 (2002) 125–144. C. Cernuda, Experimental Analysis on Assessing Interpretability of Fuzzy Rule-Based Systems, PhD Thesis, Universidad de Oviedo, Spain, 2010. K.-S. Chin, A. Chan, J.-B. Yang, Development of a fuzzy FMEA based product design system, The International Journal of Advanced Manufacturing Technology 36 (7–8) (2008) 633–649. J.S. Cho, D.J. Park, Novel fuzzy logic control based on weighting of partially inconsistent rules using neural network, Journal of Intelligent and Fuzzy Systems 8 (2) (2000) 99–110. M.-Y. Chow, S. Altug, H.J. Trussell, Heuristic constraints enforcement for training of and knowledge extraction from a fuzzy/neural architecture Part ii: Implementation and application, IEEE Transactions on Fuzzy Systems 7 (2) (1999) 151–159. V.V. Cross, T.A. Sudkamp, Similarity and Compatibility in Fuzzy Set Theory: Assessment and Applications, Springer Physica, Heidelberg New York, 2010. D. Dovzan, V. Logar, I. Skrjanc, Solving the sales prediction problem with fuzzy evolving methods, in: WCCI 2012 IEEE World Congress on Computational Intelligence, Brisbane, Australia, 2012. D. Dovzan, I. Skrjanc, Recursive clustering based on a Gustafson–Kessel algorithm, Evolving Systems 2 (1) (2011) 15–24. J.G. Dy, C.E. Brodley, Feature selection for unsupervised learning, Journal of Machine Learning Research 5 (2004) 845–889. C. Eitzinger, W. Heidl, E. Lughofer, S. Raiser, J.E. Smith, M.A. Tahir, D. Sannen, H. van Brussel, Assessment of the influence of adaptive components in trainable surface inspection systems, Machine Vision and Applications 21 (5) (2010) 613–626. J. Feng, An intelligent decision support system based on machine learning and dynamic track of psychological evaluation criterion, in: J. Kacpzryk (Ed.), Intelligent Decision and Policy Making Support Systems, Springer Heidelberg, Berlin, 2008. A. Fiordaliso, A constrained Takagi–Sugeno fuzzy system that allows for better interpretation and analysis, Fuzzy Sets and Systems 118 (2) (2001) 281–296. G.F. Franklin, J.D. Powell, A. Emami-Naeini, Feedback Control of Dynamic Systems, Pearson Higher Education Upper Saddle River, New Jersey, 2009. M.J. Gacto, R. Alcala, F. Herrera, Interpretability of linguistic fuzzy rule-based systems: an overview of interpretability measures, Information Sciences 181 (20) (2011) 4340–4360. J. Gama, Knowledge Discovery from Data Streams, Chapman & Hall/CRC, Boca Raton, Florida, 2010. J. Gama, P. Medas, G. Castillo, P. Rodrigues, Learning with drift detection, Lecture Notes in Computer Science, vol. 3171, Springer, Berlin Heidelberg, 2004, pp. 286–295. J.J. Govindhasamy, S.F. McLoone, G.W. Irwin, Second-order training of adaptive critics for online process control, IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics 35 (2) (2006) 381–385. C. Hametner, S. Jakubek, Local model network identification for online engine modelling, Information Sciences 220 (2013) 210–225. F.H. Hamker, RBF learning in a non-stationary environment: the stability-plasticity dilemma, in: R.J. Howlett, L.C. Jain (Eds.), Radial Basis Function Networks 1: Recent Developments in Theory and Applications, Physica Verlag, Heidelberg, New York, 2001, pp. 219–251. F.E. Harrel, Regression Modeling Strategies, Springer, New York, USA, 2001. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learn-ing: Data Mining, Inference and Prediction, second ed., Springer, New York Berlin Heidelberg, 2009. H.A. Hefny, Comments on distinguishability quantification of fuzzy sets, Information Sciences 177 (2007) 4832–4839. L.J. Herrera, H. Pomares, I. Rojas, O. Valenzuela, A. Prieto, TaSe, a Taylor series-based fuzzy system model that combines interpretability and accuracy, Fuzzy Sets and Systems 153 (3) (2005) 403–427. M. Hisada, S. Ozawa, K. Zhang, N. Kasabov, Incremental linear discriminant analysis for evolving feature spaces in multitask pattern recognition problems, Evolving Systems 1 (1) (2010) 17–27.

E. Lughofer / Information Sciences 251 (2013) 22–46

45

[41] L.O. Jimenez, D.A. Landgrebe, Supervised classification in high-dimensional space: geometrical, statistical, and asymptotical properties of multivariate data, IEEE Transactions on Systems, Man and Cybernetics, part C: Reviews and Applications 28 (1) (1998) 39–54. [42] Y. Jin, Fuzzy modelling of high dimensional systems: complexity reduction and interpretability improvement, IEEE Transactions on Fuzzy Systems 8 (2) (2000) 212–221. [43] C.F. Juang, C.T. Lin, An on-line self-constructing neural fuzzy inference network and its applications, IEEE Transactions on Fuzzy Systems 6 (1) (1998) 12–32. [44] A. Kalhor, B.N. Araabi, C. Lucas, An online predictor model as adaptive habitually linear and transiently nonlinear model, Evolving Systems 1 (1) (2010) 29–41. [45] N.K. Kasabov, Q. Song, DENFIS: dynamic evolving neural-fuzzy inference system and its application for time-series prediction, IEEE Transactions on Fuzzy Systems 10 (2) (2002) 144–154. [46] E.P. Klement, R. Mesiar, E. Pap, Triangular Norms, Kluwer Academic Publishers, Dordrecht Norwell New York London, 2000. [47] R. Klinkenberg, Learning drifting concepts: example selection vs. example weighting, Intelligent Data Analysis 8 (3) (2004) 281–300. [48] M. Komijani, C. Lucas, B.N. Araabi, A. Kalhor, Introducing evolving Takagi–Sugeno method based on local least squares support vector machine models, Evolving Systems 3 (2) (2012) 81–93. [49] J. Korbicz, J.M. Koscielny, Z. Kowalczuk, W. Cholewa, Fault Diagnosis – Models. Artificial Intelligence and Applications, Springer Verlag, Berlin Heidelberg, 2004. [50] S. Kullback, R.A. Leibler, On information and sufficiency, Annals of Mathematical Statistics 22 (1) (1951) 79–86. [51] L. Kuncheva, Fuzzy Classifier Design, Physica-Verlag, Heidelberg, 2000. [52] D. Leite, R. Ballini, P. Costa, F. Gomide, Evolving fuzzy granular modeling from nonstationary fuzzy data streams, Evolving Systems 3 (2) (2012) 65–79. [53] D. Leite, P. Costa, F. Gomide, Interval approach for evolving granular system modeling, in: M. Sayed-Mouchaweh, E. Lughofer (Eds.), Learning in NonStationary Environments: Methods and Applications, Springer, New York, 2012, pp. 271–300. [54] A. Lemos, W. Caminhas, F. Gomide, Adaptive fault detection and diagnosis using an evolving fuzzy classifier, Information Sciences 220 (2013) 64–85. [55] A. Lemos, W. Caminhasund, F. Gomide, Fuzzy evolving linear regression trees, Evolving Systems 2 (1) (2011) 1–14. [56] G. Leng, A hybrid learning algorithm with a similarity-based pruning strategy for self-adaptive neuro-fuzzy systems, Applied Soft Computing 9 (4) (2009) 1354–1366. [57] G. Leng, T.M. McGinnity, G. Prasad, An approach for on-line extraction of fuzzy rules using a self-organising fuzzy neural network, Fuzzy Sets and Systems 150 (2) (2005) 211–243. [58] G. Leng, X.-J. Zeng, J.A. Keane, An improved approach of self-organising fuzzy neural network based on similarity measures, Evolving Systems 3 (1) (2012) 19–30. [59] C.S. Leung, K.W. Wong, P.F. Sum, L.W. Chan, A pruning method for the recursive least squares algorithm, Neural Networks 14 (2) (2001) 147–174. [60] E. Lima, M. Hell, R. Ballini, F. Gomide, Evolving fuzzy modeling using participatory learning, in: P. Angelov, D. Filev, N. Kasabov (Eds.), Evolving Intelligent Systems: Methodology and Applications, John Wiley & Sons, New York, 2010, pp. 67–86. [61] Y.Y. Lin, J.-Y. Chang, C.-T. Lin, Identification and prediction of dynamic systems using an interactively recurrent self-evolving fuzzy neural network, IEEE Transactions on Neural Networks and Learning Systems 24 (2) (2013) 310–321. [62] L. Ljung, System Identification: Theory for the User, Prentice Hall PTR, Prentic Hall Inc., Upper Saddle River, New Jersey, 1999. [63] E. Lughofer, Evolving Fuzzy Systems – Methodologies. Advanced Concepts and Applications, Springer, Berlin Heidelberg, 2011. [64] E. Lughofer, Human-inspired evolving machines – the next generation of evolving intelligent systems?, SMC Newsletter 36 (2011) [65] E. Lughofer, On-line incremental feature weighting in evolving fuzzy classifiers, Fuzzy Sets and Systems 163 (1) (2011) 1–23. [66] E. Lughofer, Flexible evolving fuzzy inference systems from data streams (FLEXFIS++), in: M. Sayed-Mouchaweh, E. Lughofer (Eds.), Learning in NonStationary Environments: Methods and Applications, Springer, New York, 2012, pp. 205–246. [67] E. Lughofer, Single-pass active learning with conflict and ignorance, Evolving Systems 3 (4) (2012) 251–271. [68] E. Lughofer, P. Angelov, Handling drifts and shifts in on-line data streams with evolving fuzzy systems, Applied Soft Computing 11 (2) (2011) 2057– 2068. [69] E. Lughofer, J.-L. Bouchot, A. Shaker, On-line elimination of local redundancies in evolving fuzzy systems, Evolving Systems 2 (3) (2011) 165–187. [70] E. Lughofer, O. Buchtala, Reliable all-pairs evolving fuzzy classifiers, IEEE Transactions on Fuzzy Systems 21 (3) (2013), http://dx.doi.org/10.1109/ TFUZZ.2012.2226892. [71] E. Lughofer, C. Eitzinger, C. Guardiola, On-line quality control with flexible evolving fuzzy systems, in: M. Sayed-Mouchaweh, E. Lughofer (Eds.), Learning in Non-Stationary Environments: Methods and Applications, Springer, New York, 2012, pp. 375–406. [72] E. Lughofer, E. Hüllermeier, On-line redundancy elimination in evolving fuzzy regression models using a fuzzy inclusion measure, in: Proceedings of the EUSFLAT 2011 Conference, Elsevier, Aix-Les-Bains, France, 2011, pp. 380–387. [73] E. Lughofer, E. Hüllermeier, E.P. Klement, Improving the interpretability of data-driven evolving fuzzy systems, in: Proceedings of EUSFLAT 2005, Barcelona, Spain, 2005, pp. 28–33. [74] E. Lughofer, J.E. Smith, P. Caleb-Solly, M.A. Tahir, C. Eitzinger, D. Sannen, M. Nuttin, Human-machine interaction issues in quality control based on online image classification, IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans 39 (5) (2009) 960–971. [75] E. Lughofer, B. Trawinski, K. Trawinski, O. Kempa, T. Lasota, On employing fuzzy modeling algorithms for the valuation of residential premises, Information Sciences 181 (23) (2011) 5123–5142. [76] J. Macias-Hernandez, P. Angelov, Applications of evolving intelligent systems to the oil and gas industry, in: P. Angelov, D. Filev, N. Kasabov (Eds.), Evolving Intelligent Systems: Methodology and Applications, John Wiley & Sons, New York, 2010, pp. 401–421. [77] L. Maciel, A. Lemos, F. Gomide, R. Ballini, Evolving fuzzy systems for pricing fixed income options, Evolving Systems 3 (1) (2012) 5–18. [78] E.H. Mamdani, Application of fuzzy logic to approximate reasoning using linguistic systems, Fuzzy Sets and Systems 26 (12) (1977) 1182–1191. [79] D. Marquardt, An algorithm for least-squares estimation of nonlinear parameters, SIAM Journal on Applied Mathematics 11 (2) (1963) 431–441. [80] C. Mencar, G. Castellano, A.M. Fanelli, Distinguishability quantification of fuzzy sets, Information Sciences 177 (2007) 130–149. [81] F.L. Minku, A. White, X. Yao, The impact of diversity on on-line ensemble learning in the presence of concept drift, IEEE Transactions on Knowledge and Data Engineering 22 (2010) 730–742. [82] D. Nauck, Adaptive rule weights in neuro-fuzzy systems, Neural Computing and Applications 9 (2000) 60–70. [83] D. Nauck, R. Kruse, How the learning of rule weights affects the interpretability of fuzzy systems, in: Proceedings of Fuzz-IEEE ’98, Anchorage, Alaska, 1998, pp. 1235–1240. [84] O. Nelles, Nonlinear System Identification, Springer, Berlin, 2001. [85] L.S.H. Ngia, J. Sjöberg, Efficient training of neural nets for nonlinear adaptive filtering using a recursive Levenberg–Marquardt algorithm, IEEE Transactions on Signal Processing 48 (7) (2000) 1915–1926. [86] J. Valente De Oliveira, Towards neuro-linguistic modeling: constraints for optimization of membership functions, Fuzzy Sets and Systems 106 (1999) 357–380. [87] N.R. Pal, K. Pal, Handling of inconsistent rules with an extended model of fuzzy reasoning, Journal of Intelligent and Fuzzy Systems 7 (1999) 55–73. [88] W. Pedrycz, J. Berezowski, I. Jamal, A granular description of data: a study in evolvable systems, in: M. Sayed-Mouchaweh, E. Lughofer (Eds.), Learning in Non-Stationary Environments: Methods and Applications, Springer, New York, 2012, pp. 57–76. [89] W. Pedrycz, F. Gomide, Fuzzy Systems Engineering: Toward Human-Centric Computing, John Wiley & Sons, Hoboken, New Jersey, 2007. [90] M. Pratama, S.G. Anavatti, E. Lughofer, GENFIS: towards and effective localist network, IEEE Transactions on Fuzzy Systems, 2013 (in press). .

46

E. Lughofer / Information Sciences 251 (2013) 22–46

[91] M. Pratama, M.J. Er, X. Li, R.J. Oentaryo, E. Lughofer, I. Arifin, Data driven modeling based on dynamic parsimonious fuzzy neural network, Neurocomputing 110 (2013) 18–28. [92] J.V. Ramos, C. Pereira, A. Dourado, The building of interpretable systems in real-time, in: P. Angelov, D. Filev, N. Kasabov (Eds.), Evolving Intelligent Systems: Methodology and Applications, John Wiley & Sons, New York, 2010, pp. 127–150. [93] A. Riid, E. Rüstern, Identification of transparent, compact, accurate and reliable linguistic fuzzy models, Information Sciences 181 (20) (2011) 4378– 4393. [94] H.-J. Rong, Sequential adaptive fuzzy inference system for function approximation problems, in: M. Sayed-Mouchaweh, E. Lughofer (Eds.), Learning in Non-Stationary Environments: Methods and Applications, Springer, New York, 2012. [95] H.-J. Rong, N. Sundararajan, G.-B. Huang, P. Saratchandran, Sequential adaptive fuzzy inference system (SAFIS) for nonlinear system identification and prediction, Fuzzy Sets and Systems 157 (9) (2006) 1260–1275. [96] H.-J. Rong, N. Sundararajan, G.-B. Huang, G.-S. Zhao, Extended sequential adaptive fuzzy inference system for classification problems, Evolving Systems 2 (2) (2011) 71–82. [97] N. Rosemann, W. Brockmann, B. Neumann, Enforcing local properties in online learning first order TS-fuzzy systems by incremental regularization, in: Proceedings of IFSA-EUSFLAT 2009, Lisbon, Portugal, 2009, pp. 466–471. [98] J.J. Rubio, Stability analysis for an on-line evolving neuro-fuzzy recurrent network, in: P. Angelov, D. Filev, N. Kasabov (Eds.), Evolving Intelligent Systems: Methodology and Applications, John Wiley & Sons, New York, 2010, pp. 173–199. [99] E.H. Ruspini, A new approach to clustering, Information and Control 15 (1) (1969) 22–32. [100] S. Saminger-Platz, R. Mesiar, D. Dubois, Aggregation operators and commuting, IEEE Transactions on Fuzzy Systems 15 (6) (2007) 1032–1045. [101] M. Sayed-Mouchaweh, E. Lughofer, Learning in Non-Stationary Environments: Methods and Applications, Springer, New York, 2012. [102] M. Setnes, Simplification and reduction of fuzzy rules, in: J. Casillas, O. Cordón, F. Herrera, L. Magdalena (Eds.), Interpretability Issues in Fuzzy Modeling, Springer, Berlin, 2003, pp. 278–302. [103] M. Setnes, R. Babuska, U. Kaymak, H.R.v.N. Lemke, Similarity measures in fuzzy rule base simplification, IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics 28 (3) (1998) 376–386. [104] A. Shaker, R. Senge, E. Hüllermeier, Evolving fuzzy patterns trees for binary classification on data streams, Information Sciences 220 (2012) 34–45. [105] H. Soleimani, K. Lucas, B.N. Araabi, Recursive Gath–Geva clustering as a basis for evolving neuro-fuzzy modeling, Evolving Systems 1 (1) (2010) 59– 71. [106] B. Sridevi, R. Nadarajan, Fuzzy similarity measure for generalized fuzzy numbers, International Journal for Open Problems in Computation and Mathematics 2 (2) (2009) 240–253. [107] T. Takagi, M. Sugeno, Fuzzy identification of systems and its applications to modeling and control, IEEE Transactions on Systems, Man and Cybernetics 15 (1) (1985) 116–132. [108] S. Thrun, Explanation-Based Neural Network Learning: A Lifelong Learning Approach, Kluwer Academic Publishers, Boston, MA, 1996. [109] S.W. Tung, C. Quek, C. Guan, et2fis: An evolving type-2 neural fuzzy inference system, Information Sciences 220 (2013) 124–148. [110] E. Turban, J.E. Aronson, T.-P. Liang, Decision Support Systems and Intelligent Systems, seventh ed., Prentice Hall, Upper Saddle River, New Jersey, 2004. [111] W. Wang, J. Vrbanek, An evolving fuzzy predictor for industrial applications, IEEE Transactions on Fuzzy Systems 16 (6) (2008) 1439–1449. [112] B.L. Welch, The generalization of ‘students’ problem when several different population variances are involved, Biometrika 34 (1–2) (1947) 28–35. [113] T. Wetter, Medical decision support systems, in: Lecture Notes in Computer Science, Medical Data Analysis, Springer, Berlin/Heidelberg, 2000, pp. 458–466. [114] R.R. Yager, A model of participatory learning, IEEE Transactions on Systems, Man and Cybernetics 20 (5) (1990) 1229–1234. [115] J. Yen, L. Wang, C.W. Gillespie, Improving the interpretability of TSK fuzzy models by combining global learning and local learning, IEEE Transactions on Fuzzy Systems 6 (4) (1998) 530–537. [116] S.M. Zhou, J.Q. Gan, Low-level interpretability and high-level interpretability: a unified view of data-driven interpretable fuzzy systems modelling, Fuzzy Sets and Systems 159 (23) (2008) 3091–3131.