Neurocomputing 47 (2002) 1–20
www.elsevier.com/locate/neucom
Rule extraction from local cluster neural nets Robert Andrews ∗ , Shlomo Geva Faculty of Information Technology, Queensland University of Technology, GPO Box 2434, Brisbane. Q 4001, Australia
Abstract This paper describes RULEX, a technique for providing an explanation component for local cluster (LC) neural networks. RULEX extracts symbolic rules from the weights of a trained LC net. LC nets are a special class of multilayer perceptrons that use sigmoid functions to generate localised functions. LC nets are well suited to both function approximation and discrete classi0cation tasks. The restricted LC net is constrained in such a way that the local functions are ‘axis parallel’ thus facilitating rule extraction. This paper presents results for the LC net on a wide variety of benchmark problems and shows that RULEX produces comprehensible, accurate rules that exhibit a high degree of 0delity with the LC network c 2002 Elsevier Science B.V. All rights reserved. from which they were extracted. Keywords: Rule extraction; Local response networks; Knowledge extraction
1. Introduction In [8] Geva et al. describe the local cluster (LC) network, a sigmoidal perceptron with 2 hidden layers where the connections are restricted in such a way that clusters of sigmoids form local response functions similar to radial basis functions (RBFs). They give a construction and training method for LC networks and show that these networks (i) exceed the function representation capability of generalised Gaussian networks, and (ii) are suitable for discrete classi0cation. They also describe a restricted version of the LC network and state that this version of the network is suitable for rule extraction without, however, describing how this is possible. ∗
Corresponding author. E-mail addresses:
[email protected] (R. Andrews),
[email protected] (S. Geva). c 2002 Elsevier Science B.V. All rights reserved. 0925-2312/02/$ - see front matter PII: S 0 9 2 5 - 2 3 1 2 ( 0 1 ) 0 0 5 7 7 - X
2
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
Local function networks are attractive for rule extraction for two reasons. Firstly, it is conceptually easy to see how the weights of a local response unit can be converted to a symbolic rule. Local function units are hyper-ellipsoids in input space and can be described in terms of a reference vector that represents the centre of the hyper-ellipsoid and a set of radii that determine the eCective range of the hyper-ellipsoid in each input dimension. The rule derived from the local function unit is formed by the conjunct of these eCective ranges in each dimension. Rules extracted from each local function unit are thus propositional and of the form if
∀ 1 6 i 6 n: xi ∈ [xi lower ; xi upper ]
then pattern belongs to the target class;
(1)
where [xi lower , xi upper ] represents the eCective range in the ith input dimension. Secondly, because each local function unit can be described by the conjunct of ranges of values in each input dimension it makes it easy to add units to the network during training such that the added unit has a meaning that is directly related to the problem domain. In networks that employ incremental learning schemes a new unit is added when there is no signi0cant improvement in the global error. The unit is chosen such that its reference vector, i.e., the centre of the unit, is one of the as yet unclassi0ed points in the training set. Thus the premise of the rule that describes the new unit is the conjunction of the attribute values of the data point with the rule consequent being the class to which the point belongs. In recent years there has been a proliferation of methods for extracting rules from trained arti0cial neural networks (see [2,18] for surveys of the 0eld). While there are many methods for extracting rules from specialised networks the majority of techniques focus on extracting rules from MLPs. There are a small number of published techniques for extracting rules from local basis function networks. Tresp et al. [20] describe a method for extracting rules from gaussian RBF units. Berthold and Huber [3,4] describe a method for extracting rules from a specialised local function network, the RecBF network. Abe and Lan [1] describe a recursive method for constructing hyper-boxes and extracting fuzzy rules from the same. Duch et al. [7] describe a method for extraction, optimisation and application of sets of fuzzy rules from ‘soft trapezoidal’ membership functions which are formed using a method similar to that described in this paper. In this paper we brieJy describe the restricted LC network and introduce the RULEX algorithm for extracting symbolic rules from the weights of a trained, restricted LC neural net. The remainder of this paper is organised as follows. Section 2 describes the restricted LC net. Section 3 looks brieJy at the general rule extraction task, describes the ADT taxonomy [2,18] for classifying rule extraction techniques and introduces the RULEX algorithm. Section 4 presents comparative results for the LC, nearest neighbour and C5 techniques on some benchmark problems. Section 5 presents an assessment of RULEX in terms of the rule quality criteria of the ADT taxonomy. The paper closes with Section 6 where we put our main 0ndings into perspective.
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
3
Fig. 1.
2. The restricted local cluster network Geva et al. [8] show that a region of local response is formed by the diCerence of two appropriately parameterised, parallel, displaced sigmoids. l(w; r; x) = l+ (w; r; x) − l− (w; r; x) = (k1 ; wT (x − r) + 1) − (k1 ; wT (x − r) − 1):
(2)
The ‘ridge’ function l in Eq. (2) above is a function that is almost zero everywhere except in the region between the steepest part of the two logistic sigmoid functions. 1 The parameter r is a reference vector; the width of the ridge is given by the reciprocal of |w|; and the value of k1 determines the shape of the ridge (which can vary from a rectangular impulse for large values of k1 to a broad bell shape for small values of k1 ). (see Fig. 1) Adding n ridge functions l with diCerent orientations but a common centre produces a function f that peaks at the centre where the ridges intersect but with the component ridges radiating on all sides of the centre (see Fig. 2). To make the function local these component ridges must be ‘cut oC’ without introducing discontinuities into the derivatives of the local function (see Fig. 3). The function f(w; r; x) =
n
l(wi ; r; x)
i=1 1
where (k; h) = 1=(1 + e−kh ).
(3)
4
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
Fig. 2.
Fig. 3.
is the sum of the n ridge functions and the function L(w; r; x) = 0 (k2 ; f(w; r; x) − d)
(4)
eliminates the unwanted regions of the radiating ridge functions when d is selected to ensure that the maximum value of the function f, located at x = r, coincides
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
5
with the centre of the linear region of the output sigmoid 0 . 2 The parameter k2 determines the steepness of the output sigmoid 0 . Geva et al. [8] show that a target function y∗ (x) can be approximated by a function y(x) which is a linear combination of m local cluster functions with centres r distributed over the domain of the function. The expression m L(w ; r ; x) (5) y(x) = =1
then describes the generalised LC network where is the output weight associated with each of the individual local cluster functions L. (Network output is simply the weighted sum of the outputs of the local clusters.) In the restricted version of the network the weight matrix w is diagonal. wi = (0; : : : wi ; : : : ; 0);
i = 1; : : : ; n
(6)
which simpli0es the functions l and f as follows l(wi ; r; x) = (k1 ; wi (xi − ri ) + 1) − (k1 ; wi (xi − ri ) − 1); f(wi ; r; x) =
n
l(wi ; ri ; xi ):
(7) (8)
i=1
One further restriction is applied to the LC network in order to facilitate rule extraction, viz., the output weight, , of each local cluster is held constant. 3 This measure prevents local clusters ‘overlapping’ in input space thus allowing each local cluster to be individually decompiled into a rule. The 0nal form of the restricted LC network for binary classi0cation tasks is given by m 2L(w ; r ; x): (9) y(x) = =1
For multiclass problems several such networks can be combined, one network per class, with the output class being the maximum of the activations of the individual networks. The LC network is trained using gradient descent on an error surface. The training equations are given in Geva et al. [8] and need not be reproduced here. 3. The general rule extraction task Until recently neural networks were very much a ‘black box’ technology, i.e. a trained neural network accepts input and produces output without providing a 2
The value of d is given as d = n(1=1 + e−k1 − 1=1 + ek1 ) where n is the input dimensionality. The maximum value of L is 0.5. Hence for classi0cation problems where the target values are {0; 1} it is appropriate to set = 2. 3
6
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
mechanism whereby a user of the network can explain/verify that the output is ‘correct’ within the province of the problem domain. Rule extraction provides such an explanation=veri0cation mechanism by providing a symbolic link between network inputs and outputs. The rule extraction process may be illustrated with the following simple example. A data set consisting of 2 real valued attributes plus a class label is constructed and a LC network is trained on the data, the task being to learn the classi0cation. Fig. 4a below is a plot of the data set. The horizontal axes represent input values while the vertical axis shows the target class, {0; 1}, of the data points. Fig. 4b is a contour plot of the data set. In this simple example it is clear that the data belonging to class 1 occurs in 2 distinct clusters. Fig. 5a is a plot of inputs against the outputs of the trained network. Fig. 5b is a contour plot of network outputs with the contour line set to the network classi0cation threshold. Analysis of the trained network showed that 2 local clusters were required to learn the problem and that each local cluster covered a disjoint region of input space. From Fig. 5b above it is clear that the behaviour of the network can be explained by approximating the hyper-ellipsoid decision boundaries of the local clusters with hyper-rectangular rules of the form given in Eq. (1). Fig. 6 below shows the rules extracted from the local clusters. This simple and arti0cial problem serves to illustrate the rule extraction process in general and rule extraction from local cluster networks in particular. The example also shows how rule extraction provides an explanation=veri0cation facility for arti0cial neural networks. 3.1. A taxonomy for classifying rule extraction techniques Andrews et al. [2] describe the ADT taxonomy for describing rule extraction techniques. This taxonomy was re0ned in Tickle et al. [18] to better cater for the profusion of published techniques for eliciting knowledge from trained neural networks. The taxonomy consists of 0ve primary classi0cation criteria, viz. (a) the expressive power (or, alternately, the rule format) of the extracted rules; (b) the quality of the extracted rules; (c) the translucency of the view taken within the rule extraction technique of the underlying neural network; (d) the complexity of the rule extraction algorithm; (e) the portabililty of the rule extraction technique across various neural network architectures (i.e., the extent to which the underlying neural network incorporates specialised training regimes). The expressive power of the rules describes the format of the extracted rules. Currently there exist rule=knowledge extraction techniques that extract rules in various formats including propositional rules [5,14,15], fuzzy rules [12,13], scienti0c laws [16], 0nite state automata [9], decision trees [6], and m-of-n rules [19].
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
7
Fig. 4.
The rule quality criterion is assessed via four characteristics, viz., (a) rule accuracy, the extent to which the rule set is able to classify a set of previously unseen examples from the problem domain; (b) rule 5delity, the extent to which the extracted rules mimic the behaviour of the network from which they were extracted;
8
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
Fig. 5.
(c) rule consistency, the extent to which, under diCering runs of the rule extraction algorithm, rule sets are generated which produce the same classi0cations of unseen examples; (d) rule comprehensibility, the size of the extracted rule set in terms of the number of rules and number of antecedents per rule.
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
9
Fig. 6.
The translucency criterion categorises a rule extraction technique according to the granularity of the neural network assumed by the rule extraction technique. Andrews et al. [2] use three key identi0ers to mark reference points along a continuum of granularity from decompositional (rules are extracted at the level of individual hidden and output layer units) to pedagogical (the network is treated as a ‘black box’; extracted rules describe global relationships between inputs and outputs; no analysis of the detailed characteristics of the neural network itself is undertaken). The algorithmic complexity of the rule extraction technique provides a useful measure of the eOciency of the process. It should be noted, however, that few authors in the surveys [2,18] reported or commented on this issue. The portability criterion assessed ANN rule-extraction techniques in terms of the extent to which a given technique could be applied across a range of ANN architectures and training regimes. Currently, there is a preponderance of techniques that might be termed speci5c purpose techniques, i.e. those where the rule extraction technique has been designed speci0cally to work with a particular ANN architecture. A rule extraction algorithm that is tightly coupled to a speci0c neural network architecture has limited use unless the architecture can be shown to be applicable to a broad cross section of problem domains. 3.2. The RULEX technique RULEX is a decompositional technique that extracts propositional rules of the form given in (1) above. As such the imperative is to be able to determine [xi lower , xi upper ] for each input dimension i of each local cluster. This section describes how these values can be determined. Eq. (7) can be rewritten l(wi ; r; x) = (ki ; (xi − ri + bi )) − (ki ; (xi − ri − bi ));
(10)
where ki = k1 wi and bi = 1=wi . Here ki represents the shape of an individual ridge and bi represents the width of the individual ridge. From Eq. (4) we see that the output of a local cluster unit is determined by the sum of the activations of all its component ridges. Therefore, the minimum possible activation of an individual ridge, the ith ridge say, in a local cluster unit that has activation barely greater than its classi0cation threshold, will occur when all ridges other than the ith ridge have maximum activation.
10
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
We de0ne the functions min(•) and max(•) as the minimum and maximum values, respectively of their function arguments. 1 −1 k2 ; (11) min(l(wi ; ri ; xi )) = max(l(wb ; rb ; xb )) − ln OT where OT is the activation threshold of the local cluster and max(l(wb ; rb ; xb )) is the maximum possible activation for any ridge function in the local cluster. As k2 , and OT are constants, and max(l(wb ; rb ; xb )) can be calculated, the value of the minimum activation of the ith ridge, min(l(wb ; rb ; xb )), can be calculated in a straightforward manner. See Appendix A for the derivation of Eq. (11). Let = min(l(wb ; rb ; xb )); m = e−(xi −ri )ki , and n = e−bi ki . From Eqs. (10) and (11) we have 1 1 : (12) − = 1 + mn 1 + m=n Let p = (1−)ebi ki and q = (+1)e−bi ki . Solving Eq. (12) for m and backsubstituting for m and n gives p − q ± p2 + q2 − 2(2 + 1) −1 ki : (13) xi = ri − ln 2 See Appendix B for the derivation of Eq. (13). Thus for the ith ridge function the extremities of the active range, [xi lower , xi upper ] are given by the expressions lower ; (14) xi lower = ri − ki upper ; (15) xi upper = ri + ki where lower is the negative root of the ln(•) expression in Eq. (13) above and upper is the positive root. 3.3. Simpli5cation of extracted rules One of the main purposes of rule extraction from neural networks is to provide an explanation facility for the decisions made by the network. As such it is clearly important that the extracted rules be as comprehensible as possible. The directly extracted rule set may contain: (a) redundant rules; (b) individual rules with redundant antecedent condition(s); and (c) pairs of rules where antecedent conditions can be combined. Rule b is redundant and may be removed from the rule set if there exists a more general rule a such that ∀ 1 6 i 6 n: [xbi lower ; xbi upper ] ⊆ [xai lower ; xai upper ]:
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
11
A rule is also redundant and may be removed from the rule set if ∃ 1 6 i 6 n: [ilower ; iupper ] ∩ [xi lower ; xi upper ] = ;
where [ilower ; iupper ] represents the entire range of values in the ith input dimension. An antecedent condition is redundant and may be removed from a rule if ∃ 1 6 i 6 n: [ilower ; iupper ] ⊆ [xi lower ; xi upper ]:
Rules a and b may be merged on the antecedent for input dimension j if ∀ 1 6 i 6 n: (i = j) ∧ ([xai lower ; xai upper ] = [xbi lower ; xbi upper ]):
RULEX implements facilities for simplifying the directly extracted rule set in order to improve the comprehensibility of the rule set. The simpli0cation is achieved without compromising the accuracy of the rule set. 4. Comparative results The restricted LC network has been applied to a variety of datasets available from the machine learning repository at Carnegie Mellon University. These datasets were selected to show the general applicability of the network. The datasets contain missing values, noisy data, continuous and discrete valued attributes, a mixture of high and low dimensionality, and a variety of binary and multi-class classi0cation tasks. Table 1 below summarises the problem domains used in this study. A variety of methods including linear regression, cross validation nearest neighbour (XVNN), and C5 were chosen to provide comparative results for the restricted LC network. XVNN is a technique whereby 90% of the data is used as a ‘codebook’ for classifying the remaining 10% of the data. Table 1 Summary of problem domains used in the study Domain
Cases
Annealing processes 898 Auto insurance 205 Breast cancer (Wisconsin) 699 Horse colic 368 Credit screening (Australia) 690 Pima diabetes 768 Glass identi0cation 214 Heart disease (Cleveland) 303 Heart disease (Hungarian) 294 Hepatitis prognosis 155 Iris classi0cation 150 Labor negotiations 57 Sick euthyroid 3772 Sonar classi0cation 208
Number of classes
Number of attributes
6 6 2 2 2 2 6 2 2 2 3 2 2 2
38 25 9 22 15 8 9 13 13 19 4 16 29 60
Continuous valued data
Discrete valued data
Missing values
√ √
√ √ √ √ √
√ √ √ √ √
× × √ √ √
× × √ √ √
× √ √
× √ √
×
×
× √ √ √ √ √ √ √ √ √ √ √
12
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
Table 2 Summary of results for selected problem domains and methods Domain
NN
Linear machine
C5 boost
LC
RULEX
Annealing processes Auto insurance Breast cancer (Wisconsin) Horse colic Credit screening (Australia) Pima diabetes Glass identi0cation Heart disease (Cleveland) Heart disease (Hungarian) Hepatitis prognosis Iris classi0cation Labor negotiations Sick euthyroid Sonar classi0cation
9.9 16.0 4.7 18.9 17.3 29.3 29.5 24.3 22.8 18.7 4.7 22.0 3.8 13.9
3.2 35.5 4.2 16.4 13.8 22.4 37.6 17.3 16.6 14.7 6.7 12.0 6.1 22.5
4.2 16.0 3.6 15.7 11.4 24.2 28.9 19.8 22.0 18.1 5.3 18.3 1.0 17.8
2.5 15.6 3.2 13.5 12.9 22.6 36.7 15.8 14.6 16.2 4.7 8.3 7.8 15.4
16.2 27.0 5.7 14.1 15.7 27.4 34.4 19.8 18.7 21.3 6.0 12.3 7.6 21.5
Linear regression was used to obtain a base line for comparison. The nearest neighbour method was chosen because it is a simple form of local function classi0er. C5 was chosen because it is widely used in machine learning as a benchmarking tool. Further C5 is an example of an ‘axis parallel’ classi0er and as such it provides an ideal comparison for the LC network. Ten fold cross validation results are shown in Table 2 above. Figures quoted are average percentage error rates. These results show that in all of the study domains, except the sick euthyroid domain, the restricted LC network produces results that are comparable to those obtained by C5. Further, the results also show that RULEX, even though its primary purpose is explanation not classi0cation, is able to extract accurate rules from the trained network, i.e. rules that provide a high degree of accuracy when used to classify previously unseen examples. As can be seen from table above, the classi0cation accuracy of RULEX is generally slightly worse than that of the LC network. This is most likely due to the combined eCects of (i) the LC network solution bene5ting from a degree of interaction between local clusters. Even though the restricted LC network has been constructed to minimise such interaction there are certain conditions under which local functions ‘cooperate’ in classifying data. Cooperation occurs when ridges of two local clusters ‘overlap’ in input space in such a way that the activation of individual local clusters in the region of overlap is less than the classi0cation threshold, but the network output, i.e., the sum of the activations of the local clusters, is greater than the classi0cation threshold. When the local clusters are decompiled into rules the region of overlap is not captured in the extracted rule set. This can result in a number of false negative classi0cations made by the rule set.
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
13
(ii) the RULEX algorithm approximating the hyper-ellipsoidal local clusters to hyper-rectangular rules. By putting a hyper-rectangle around a hyper-ellipsoid a region of input space not covered by the hyper-ellipsoid is covered by the hyper-rectangle. In problem domains where data values are closely packed this can lead to an increase in the number of false positive classi0cations made by the rule set. 5. RULEX and the ADT taxonomy This section places the RULEX algorithm into the classi0cation framework of the ADT taxonomy presented in Section 3. (a) Rule format From (1) it can be seen that RULEX extracts propositional rules. In the directly extracted rule set each rule contains an antecedent condition for each input dimension as well as a rule consequent which describes the output class covered by the rule. As mentioned in Section 3.2, RULEX provides a rule simpli0cation process which removes redundant rules and antecedent conditions from the directly extracted rules. The reduced rule set contains rules that consist of only those antecedents that are actually used by the trained LC network in discriminating between input patterns. (b) Rule quality As stated previously, the prime function of rule extraction algorithms such as RULEX is to provide an explanation facility for the trained network. The rule quality criteria provide insight into the degree of trust that can be placed in the explanation. Rule quality is assessed according to the accuracy, 0delity, consistency and comprehensibility of the extracted rules. Table 3 below presents data that allows a quantitative measure to be applied to each of these criteria. (i) Accuracy. Despite the mechanism employed to avoid local cluster units ‘overlapping’ during network training (see Eq. (9)) it is clear that there is some degree of interaction between local cluster units. (The larger the values of the parameters k1 and k2 the less the interaction between units but the slower the network training.) This eCect becomes more apparent in problem domains with high dimension input space and in network solutions involving large numbers of local cluster units. Further, RULEX approximates the hyper-ellipsoidal local cluster functions of the LC network with hyper-rectangles. It is therefore not surprising that the classi0cation accuracy of the extracted rules is less than that of the underlying network. It should be noted, however, that while the accuracy 0gures quoted for RULEX are worse than the LC network they are comparable to those obtained from C5. (ii) Fidelity. Fidelity is closely related to accuracy and the factors that aCect accuracy, viz. interaction between units and approximation of hyper-ellipsoids by hyper-rectangles also aCect the 0delity of the rule sets. In general, the rule sets
14
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
Table 3 Rule quality assessment Domain
LC error
RULEX error
Local clusters
Rules
Antecedents per rule
Fidelity
Annealing processes Auto insurance Breast cancer (Wisconsin) Horse colic Credit screening (Australia) Pima diabetes Glass identi0cation Heart disease (Cleveland) Heart disease (Hungarian) Hepatitis prognosis Iris classi0cation Labor negotiations Sick euthyroid Sonar classi0cation
2.5% 15.6% 3.2% 13.5% 12.9% 22.6% 36.7% 15.8% 14.6% 16.2% 4.7% 8.3% 7.8% 15.4%
16.2% 27.0% 5.7% 14.1% 15.7% 27.4% 34.4% 19.8% 18.7% 21.3% 6.0% 12.3% 7.6% 21.5%
16 60 5 5 2 5 22 4 3 6 3 2 4 4
16 57 5 2.5 2 5 19 3 2 4 3 2 4 3
20 13 24 8 5 5 6 5 5 8 3 7 5 8
85.9% 86.5% 97.3% 99.3% 96.8% 93.9% 96.6% 95.2% 95.2% 93.9% 98.6% 95.6% 99.8% 92.7%
extracted by RULEX display an extremely high degree of 0delity with the LC networks from which they were drawn. (iii) Consistency. Rule extraction algorithms that generate rules by querying the trained neural network with patterns drawn randomly from the problem domain [6,17] have the potential to generate a variety of diCerent rule sets from any given training run of the neural network. Such algorithms have the potential for low consistency. RULEX on the other hand is a deterministic algorithm that always generates the same rule set from any given training run of the LC network. Hence RULEX always exhibits 100% consistency. (iv) Comprehensibility. In general, comprehensibility is inversely related to the number of rules and to the number of antecedents per rule. The LC network is based on a greedy, covering algorithm. Hence its solutions are achieved with relatively small numbers of training iterations and are typically compact, i.e. the trained network contains only a small number of local cluster units. Given that RULEX converts each local cluster unit into a single rule, the extracted rule set contains, at most, the same number of rules as there are local cluster units in the trained network. The rule simpli0cation procedures built into RULEX potentially reduces the size of the rule set and ensures that only signi0cant antecedent conditions are included in the 0nal rule set. This leads to extracted rules with as high comprehensibility as possible. (c) Translucency RULEX is distinctly decompositional in that rules are extracted at the level of the hidden layer units. Each local cluster unit is treated in isolation with the local cluster weights being converted directly into a rule.
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
15
(d) Algorithmic complexity Golea [10,11] showed that, in many cases, the computational complexity of extracting rules from trained ANNs and the complexity of extracting the rules directly from the data are both NP-hard. Hence the combination of ANN learning and ANN rule-extraction potentially involves signi0cant additional computational cost over direct rule-learning techniques. Table 4 in Appendix C gives an outline of the RULEX algorithm. Table 5 in Appendix C expands the individual modules of the algorithm. From these descriptions it is clear that the majority of the modules are linear in the number of local clusters (or rules) and the number of input dimensions, O(lc ×n). The modules associated with rule simpli0cation are, at worst, polynomial in the number of rules, O(lc2 ). RULEX is therefore computationally eOcient and has some signi0cant advantages over rule extraction algorithms that rely on a (potentially exponential) ‘search and test’ strategy [14,19]. Thus the use of RULEX to include an explanation facility adds little in the way of overhead to the neural network learning phase. (e) Portability RULEX is non-portable having been speci0cally designed to work with local cluster (LC) neural networks. This means that it cannot be used as a general purpose device for providing an explanation component for existing, trained, neural networks. However, as has been shown in the results presented in Section 4, the LC network is applicable to a broad range of problem domains (including continuous valued, discrete valued domains and domains which include missing values). Hence RULEX is also potentially applicable to a broad variety of problem domains. 6. Conclusion This paper has described the restricted form of the LC local cluster neural network and the associated RULEX algorithm that can be used to provide an explanation facility for the trained network. Results were given for the LC network which show that the network is applicable across a broad spectrum of problem domains and produces results that are at least comparable, and in many cases better than C5, an accepted benchmark standard in machine learning on the problem domains studied. The RULEX algorithm has been evaluated in terms of the guidelines laid out for rule extraction techniques. RULEX is a decompositional technique capable of extracting accurate and comprehensible propositional rules. Further, rule sets produced by RULEX show high 0delity with the network from which they were extracted. RULEX has been shown to be computationally eOcient. Analysis of RULEX reveals only two drawbacks. Firstly the technique is not portable having been designed speci0cally to work with trained LC networks. Secondly, the technique is not immune from the so-called ‘curse of dimensionality’, i.e. in higher dimension problems accuracy and 0delity suCer. This is due to (i) approximating hyper-ellipsoid local functions with hyper-rectangular rules, and (ii) the increased likelihood of interaction between local clusters in the trained network.
16
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
Acknowledgements The authors would like to thank the two anonymous reviewers for their helpful and constructive comments. Appendix A. Derivation of minimum ridge activation Let L = l(wi ; ri ; xi ) OT = (k2 ; (n − 1)max(L) + min(L) − d):
(A.1)
Now d = n max(L) OT = (k2 ; min(L) − max(L)); 1 ; 1 + e−k2 (min(L)−max(L)) 1 e−k2 min(L) − 1 = −k2 max(L) ; O e T 1 e−k2 max(L) = e−k2 min(L) ; OT − 1 1 − 1 − k2 max(L) = − k2 min(L); ln OT 1 −1 k2 : min(L) = max(L) − ln OT OT =
(A.2) (A.3) (A.4) (A.5) (A.6) (A.7)
Appendix B. Derivation of range of activation for the ith ridge function Let = min(l(wb ; rb ; xb )); m = e−(xi −ri ) , and n = ebi . From Eqs. (10) and (11) we have = (ki ; mn) − (ki ; m=n); 1 1 ; − 1 + mn 1 + m=n (1 + m=n) − (1 + mn) = ; (1 + mn)(1 + m=n)
=
(B.1) (B.2) (B.3)
(1 + mn)(1 + m=n) = (1 + m=n) − (1 + mn);
(B.4)
[(1 + m=n) + 1](1 + mn) = (1 + m=n); m + 1 (1 + mn) = (1 + m=n); + n
(B.5)
m2 + ( + 1)mn + ( − 1)m=n + = 0;
(B.6) (B.7)
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20 2
m + ( + 1)n +
−1 n
17
m + = 0:
(B.8)
Let a = ; b = [( + 1)n + ( − 1)=n], and c = . Solving for m gives roots at ( − 1)=n − ( + 1)n ± ( + 1)2 n2 + 2( + 1)( − 1) + (( − 1)=n)2 − 42 ; m= 2 (B.9) m=
( − 1)=n − ( + 1)n ±
(( − 1)=n)2 + ( + 1)2 n2 − 2(2 + 1) : (B.10) 2
Now n = e−bi ki . Substituting for n into (B.10) gives ( − 1)ebi ki − ( + 1)e−bi ki ± ( − 1)2 e2bi ki + ( + 1)2 e−2bi ki − 2(2 + 1) : m= 2 (B.11) Let p = (1 − )ebi ki and q = ( + 1)e−bi ki . p − q ± p2 + q2 − 2(2 + 1) : m= 2 Now m = e−(xi −ri )ki . This gives the following as an expression for xi . ln(m) x i = ri − : ki Appendix C. Algorithmic complexity of RULEX (Tables 4 and 5) Table 4 The RULEX algorithm rulex() { create data structures(); create domain description(); for each local cluster for each ridge function calculate ridge limits(); while redundancies remain remove redundant rules(); remove redundant antecedents(); merge antecedents(); endwhile; feed forward test set(); display rule set(); }==end rulex
(B.12)
(B.13)
18
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
Table 5 Modules of the RULEX algorithm remove redundant rules() { OKtoremove = false; for each rule a for each other rule b for each dimension i if ([xbi lower ; xbi upper ] ⊆ [xai lower ; xai upper ]) OR ([ilower ; iupper ] ∩ [xi lower ; xi upper ] = ) then Oktoremove = true; if Oktoremove then remove rule b(); }==endremove redundant rules remove redundant antecedents() { for each rule for each dimension i if [ilower ; iupper ] ⊆ [xi lower ; xi upper ] then remove antecedent i(); }==end remove redundant antecedents merge antecedents() { OKtomerge = true; for each rule a { for each other rule b for each dimension i for each other dimension j if NOT ([xai lower ; xai upper ] = [xbj lower ; xbj upper ]) then OKtomerge = false; if Oktomerge then { [xai lower ; xai upper ] = [xai lower ; xai upper ] ∪ [xbj lower ; xbj upper ] remove rule b(); }
}
}==end merge antecedents
feed forward test set() { errors = 0; correct = 0; for each pattern in the test set for each rule { classi0ed = true; for each dimension I if testi ∈ [xi lower ; xi upper ] then classi0ed = false; if classi0ed ∧ test target = ruleclasslabel then ++correct; else ++errors; }==end feed forward test set
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
19
Table 5 (continued) display rule set() { for each rule { for each dimension i write([xi lower ; xi upper ]); write(ruleclasslabel ); } }==end display rule set
References [1] S. Abe, M.S. Lan, A method for fuzzy rules extraction directly from numerical data and its application to pattern classi0cation, IEEE Trans. Fuzzy Systems 3 (1) (1995) 18–28. [2] R. Andrews, A.B. Tickle, J. Diederich, A survey and critique of techniques for extracting rules from trained arti0cial neural networks, Knowledge Based Systems 8 (1995) 373–389. [3] M. Berthold, K. Huber, From radial to rectangular basis functions: a new approach for rule learning from large datasets, Technical Report 15-95, 1995, University of Karlsruhe. [4] M. Berthold, K. Huber, Building precise classi0ers with automatic rule extraction, Proceedings of the IEEE International Conference on Neural Networks, Perth, Australia, 1995, Vol. 3, 1263–1268. [5] G.A. Carpenter, A.W. Tan, Rule extraction: from neural architecture to symbolic representation, Connection Sci. 7 (1) (1995) 3–27. [6] M. Craven, Extracting comprehensible models from trained neural networks, Ph.D. Thesis, University of Wisconsin, Madison Wisconsin, 1996. [7] W. Duch, R. Adamczak, K Grabczewski, Neural optimisation of linguistic variables and membership functions, Proceedings of the 6th International Conference on Neural Information Processing ICONIP ’99, Perth, Australia, 1999, Vol. II, 616 – 621. [8] S. Geva, K. Malmstrom, J. Sitte, Local cluster neural net: architecture, training and applications, Neurocomputing 20 (1998) 35–56. [9] L. Giles, C. Omlin, Rule revision with recurrent networks, IEEE Trans. Knowledge Data Eng. 8 (1) (1996) 183–197. [10] M. Golea, On the complexity of rule extraction from neural networks and network querying, Proceedings of the Rule Extraction From Trained Arti0cial Neural Networks Workshop, Society For the Study of Arti0cial Intelligence and Simulation of Behavior Workshop Series ’96, University of Sussex, Brighton, UK, 1996, 51–59. [11] M. Golea, On the complexity of extracting simple rules from trained neural nets (1997), to appear. [12] Y. Hayashi, A neural expert system with automated extraction of fuzzy If-Then rules and its application to medical diagnosis, Adv. Neural Inform. Process. Systems 3 (1990) 578–584. [13] S. Horikawa, T. Furuhashi, Y. Uchikawa, On fuzzy modeling using fuzzy neural networks with the back-propagation algorithm, IEEE Trans. Neural Networks 3 (5) (1992) 801–806. [14] R. Krishnan, A systematic method for decompositional rule extraction from neural networks, Proceedings of the NIPS ’97 Rule Extraction From Trained Arti0cial Neural Networks Workshop, Queensland University of Technology, 1996, 38– 45. [15] F. Maire, A partial order for the m-of-n rule extraction algorithm, IEEE Trans. Neural Networks 8 (6) (1997) 1542–1544. [16] K. Saito, R. Nakano, Law discovery using neural networks, Proceedings of the NIPS ’96 Rule Extraction From Trained Arti0cial Neural Networks Workshop, Queensland University of Technology, 1996, 62– 69. [17] S. Thrun. Extracting provably correct rules from arti0cial neural networks, Technical Report IAI-TR-93-5, Institut for Informatik III Universitat Bonn, Germany, 1994.
20
R. Andrews, S. Geva / Neurocomputing 47 (2002) 1–20
[18] A.B. Tickle, R. Andrews, M. Golea, J. Diederich, The truth will come to light: directions and challenges in extracting the knowledge embedded within trained arti0cial neural networks, IEEE Trans. Neural Networks 9 (6) (1998) 1057–1068. [19] G. Towell, J. Shavlik, The extraction of re0ned rules from knowledge based neural networks, Mach. Learning 131 (1993) 71–101. [20] V. Tresp, J. Hollatz, S. Ahmad, Network structuring and training using rule-based knowledge, Adv. In Neural Inform. Process. Systems 6 (1993) 871–878.
Robert Andrews received his B.Ed. and M.I.T. degrees from Queensland University of Technology in 1985 and 1995, respectively and is currently in the 0nal stages of his Ph.D. He is presently a Lecturer in the School of Information Systems at QUT. He has had practical experience in research projects involving the application of rule extraction techniques to herd improvement in the dairy cattle industry and the analysis of 0nancial data. His interests include neurocomputing, pattern recognition, data mining and theory re0nement.
Shlomo Geva received his B.Sc. degree in Chemistry and Physics from the Hebrew University in 1982. He received his M.Sc. and Ph.D. degrees from the Queensland University of Technology in 1988 and 1992, respectively. He is currently a Senior Lecturer in the School of Computing Science at QUT. His interests include neurocomputing, data mining, pattern recognition, speech and image processing.