Expert Systems With Applications, Vol. 8, No. 1, pp. 77-87, 1995 Copyright © 1994 Elsevier Science Ltd Printed in the USA. All fights reserved 0957-4174/95 $9.50 + .00
Pergamon
Evaluation of Automatic Rule Induction Systems METKA VRTA~NIK AND D . DOLNI~AR University of Ljubljana, Faculty of Science and Technology, Department of Chemical Education and Informatics, Ljubljana, Slovenia
Abstract--The effects of number of attributes for description of data set, number of examples included in the training set, and post-pruning mechanism, on the predictability power of the classification rules for automatic assignment of river water pollution levels were studied. In the induction experiments, the original ID3 algorithm embedded in the Knowledge Maker environment was modified by postpruning mechanism. In order to facilitate the evaluation of the developed classification rules, the algorithm of Reingold and Tilford for tidier drawing of trees was implemented. The results showed that efficient classification rules in comparison with experts' class assignment can already be derived from 500 examples of baseline data, each example being described by 5 attributes.
INTRODUCrION
processes of the expert system shells are generalised by two mechanisms: backward and forward chaining. The major bottleneck in the quicker design of expert system applications in various fields is the knowledge base, which is usually domain-specific. Two general approaches can be used for the automatic extraction of expert knowledge: deduction from theories or rules, and induction from representative sets of examples or expert experiences. In more sophisticated applications, both approaches usually have to be interlinked to cope with the multi-objective character of the problems.
INTELLIGENCE methods and techniques, e.g., expert systems, are becoming important tools in solving chemical problems. Comparative analysis of chemical literature obtained from 1989-90 on-line Chemical Abstracts gave the major domains of emerging expert systems' applications in chemistry, Figure I. Conceptualisation, development, and engineering of chemical and biochemical processes are the most rapidly growing fields of expert system applications in chemistry. In these domains expert systems are used in process design, control, and simulation. In analytical chemistry, expert system approaches have started to compete with standard multivariate statistical techniques, e.g., pattern recognition and cluster analysis, which used to be applied to extract value-added information and routine decision making from large data sets. The vast majority of expert systems in analytical chemistry are oriented toward solving structure-elucidation problems, based on the analysis and automatic synthesis of different spectroscopic data, e.g., IR, NMR, MS, GS/MS, etc., or predicting spectra from structural parameters. Although the knowledge representation strategies, description of objects or events, rules ofgeneralisation, control strategies, and generalisation levels may differ, depending on the expert system shell or environment used in the development of an expert system prototype as well as problem domain, the reasoning or inference ARTIFICIAL
PROBLEM PRESENTATION
The classification of Slovenian rivers according to their pollution level is defined by four classes: • Class I drinking water or drinking water after treatment • Class II water that can be used for bathing, water sports, or fish farming, and could be used for drinking after adequate treatment • Class III water that could be used for irrigation or for industrial purposes after adequate treatment, except for the food industry • Class IV polluted water The criteria for assigning a pollution class/level of a particular water sample at a sampling site are based on the results of (a) selected physical and chemical analyses, (b) specific analyses of heavy metals and selected organic pollutants, (c) microbiological analyses. The overall pollution class of water samples is assigned by an expert panel that includes chemists, biologists and hydrologists. An attempt was made to induce an automatic classification rule for determining pollution level of river water samples. Analysis of the problem characteristics
Requests for reprints should be sent to Metka Vrta~nik, University of Ljubljana, Faculty of Science and Technology, Department of Chemical Education and Informatics, Vegova 4, pp 18/l, 61001 Ljubljana, Slovenia.
77
78
M. Vrtadnik and D. Dolnidar CA DI
BC BT OK MS
TE
WM
FIGURE 1. Frequencies of expert system applications in different domains of chemistry (based on references from Chemical Abstracts 1989-1990). TE = technology/engineering; AC = analylical chemistry; WM = waste management;, MS = material selection; OK = organic chemistry (synthesis design); BT = biotechnology; BI = biochemistry/pharmacology; DI = data base interface; CA = catalyst design.
showed that the problem can be tackled by applying inductive learning theory and tools for automatic rule induction. Automatic rule induction systems for inducing classification rules have already proved valuable as tools in supporting knowledge acquisition, for expert systems. Two types of induction algorithms have been successfully applied to different types of classification problems: ID3 and AQ algorithms (Michalski, Carbonell, & Mitchell, 1983). The major problem in the application of most of the commercially available automatic-rule induction systems is that they generalise rules on the basis of the noiseless domain assumption (no errors in the training data set or in the description language of the data set), Clark & Niblett, 1985. Therefore, such systems are limited in searches for rules for which no counterexamples exist in the data set for induction. The resulting classification rules most often have low predictability power. In our research, the following effects on the classification results were studied: (a) number of attributes for description of data set, (b) number of examples included in the training set for deduction of the classification rule, and (c) post-pruning for estimating the classification errors in the nodes on the selected set of data for monitoring Slovenian river waters. The commercially available automatic induction system used was Knowledge Maker, which is installed within Knowledge Pro environment. DESCRIPTION OF THE DATA SET The data set comprised 1261 examples with each example described by 49 attributes, from selected physical, chemical and biological analyses. The original data set was split into two training sets: (a) 500 examples,
(b) 300 examples; as a test set, the whole data set of 1261 examples was used. The examples were chosen in such a way that the proportionality was preserved among the examples for different pollution classes in training and test data sets. (No examples for pollution class I were included in the training data set, because there were only 11 examples in the whole data set). Among the 49 attributes describing each example, 21 were selected. The selected attributes resulted from previously conducted Linear Discriminant Analysis (LDA) on the same data set (see Table 1). The goal of the classification is the pollution class. THEORETICAL BACKGROUND OF INDUCTION ALGORITHM Systems for the automatic induction of knowledge use entropy (information content) as a measure of uncertainty in the classification of the objects/instances. The classification problem is defined by the equation: N
n ( C l a A = - Z p(c, laj) log2 p(c,I aA i= 1
H(C[ aj) = N = ci" • • Cs = al • • • aj = p(ci) =
entropy of classification number of classes discrete classes i to N values of attributes probability of an object being in class i
The first step in calculating the entropy of classification, after deciding on a partitioning attribute, is to split the original table into subtables, with each example having the same value of the partitioning attribute. The entropy of each subtable H(C[ aj), is calculated for each value of attribute aj. In order to find the entropy of the entire table after the split, the sum of the entropy of each of the values of the attribute, multiplied by the probability that the value will appear in the table, must be calculated. The resulting entropy of the classification, H(CIA), is the weighted average of the entropy for each value aj of the attribute. M
H( CIA) = - 7, p(aj)H(A la:) j=l
If the same calculation is performed for each attribute in the example set, the attribute with the smallest entropy, and thus the least uncertainty, is the best attribute to select for the initial split (Thompson & Thompson, 1986; Clark & Niblett, 1987). In the Knowledge Maker automatic induction system, this algorithm is implemented in such a way, that example tables are stored as a list. A list is also used to store attributes and their values. The tree-building procedure is performed by the function called classify. By
79
Evaluation of Automatic Rule Induction Systems TABLE 1 Attributes Used for the Induction of the Decision Tree
Attribute
Symbol
Units
Conductivity Dissolved oxygen Saturation with oxygen Suspended solids after drying S0spended solids after ignition Fixed solids ingnited at 500°C Total hardness NO; NOE CI-
PREV 02 SAT_C2 SU_SUS SU_ZAR ZAR_OST TR_TOT NO2 NO3 CL NH4 NA SO2 AL FE FENOL DETER KPK BPK52 BAKT_.AN S_BARVA
#S/cm mg/dm a mg/dm a mg/dm 3 mg/dm 3 mg/dm a mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg O2/dm3 mg O2/dm3 mg MPN/dm 3. special scale
NH~-
Na+ SiC2 AI+3 Fe+3 Phenol Detergents COD (K2Cr2OT) BOD5 Bacteriological analysis Measured color * MPN = most probable number.
calculating H(CIA) for each attribute, the function classify chooses the attribute with the smallest entropy for the initial split of the example set. The classify function produces the tree in a depth-first manner. The binary decision tree is represented by a set of rules organised as a set of if-then clauses forming a nested structure. DEVELOPMENT OF THE PROGRAMME FOR GRAPHICAL DISPLAY AND POSTPRUNING OF THE DECISION TREE To display graphically the decision tree developed by the Knowledge Maker induction system, the algorithm of Reingold and Tilford (TR). "Tidier drawings of trees" was implemented, (Reingold & Tilford, 1981). The crucial part of this algorithm is its ability to space out subtrees. The algorithm is recursive, and the main principle works on (1) assigning coordinates for the left and the right subtree of the current node, (2) spacing out the subtrees as much as necessary in the case of overlapping between the right and the left tree. Important parts of the algorithm are demands for drawing characteristics. They are predefined and described as "tidy drawings" and the algorithm attempts to take into account as many of them as possible. The following demands are typical for trees and often they are trivially fulfilled (except the fifth one, which is not trivially solvable on all classes of trees): (1) parallel levels of the tree (nodes of the same level lie on a straight line, and lines that determine levels are pairwise parallel, (2) constant distance between two arbitrary levels, (3) the
smallest constant distance between two adjacent nodes of the same level, (4) symmetric position of the father node above son nodes (father: a current node, sons: nodes of the next level connected to the father), (5) minimal width of the tree (tree width: a horizontal distance between the far left and the far right node of the tree). In our research, the general TR algorithm for drawing trees was implemented with the following modifications: 1. zooming out an arbitrary subtree 2. partial display of node labels (attributes with legend) 3. complete display of node labels 4. display of the appearance of a chosen attribute of the classification in the classification tree. The graphical display of the classification tree offers a number of advantages in the study of the logic structure of the classification tree in comparison with the rules obtained by Knowledge Maker induction algorithm. These advantages enable I. review of the presence of a given attribute in a tree 2. display of partition values of a given attribute 3. identification of contradictions among rules by experts 4. correlation among attributes 5. evaluation of the classification of the extreme examples by experts. The programme developed is also used for classification of new examples. This is done by a tree search procedure, which is much faster than the sequential testing of rules made by the Knowledge Maker system. In incomplete and interdisciplinary domains clas-
80
M. VrtaEnik and D. DolniEar
sification can rarely be exact. But the basic induction algorithm always aims to construct an exact decision tree. For splitting subsets of learning examples at lower levels, attributes that have no classification power are used. This drawback of the basic algorithm can be partially overcome by pruning the decision tree. Pruning is the mechanism that prevents the construction of unreliable subtrees. The program for drawing the classification trees was further developed by the inclusion of a postpruning mechanism, based on the algorithm taken from an article presented by Cestnik, Kononenko, and Bratko (1987). The major effects of the implementation of the post-pruning mechanism are (1) decreasing the size of the decision tree (simplification of the classification), and at the same time (2) preserving the accuracy of the classification rule. All programs for (1) conversion of rules into data for drawing, drawing decision tree, postpruning are written in Pascal. Pascal graphics is used to display trees.
THE DESCRIPTION OF THE EXPERIMENTAL STRATEGY AND RESULTS OF CLASSIFICATION In the first experiment the effect of the number of attributes on the classificatory precision and structure of the classification tree were studied. The experiment was conducted in three parallel runs using the original ID3 algorithm implemented as the function classify on the Knowledge Maker induction system. The parallel experimental runs differ in the number of attributes used for the description of the training data set: 1. run-21 attributes identified as predictor variables by linear discriminant analysis, which was conducted on the same data set. 2. run-7 attributes (selected from the original set of 2 l attributes based on experts' evaluation (Table 2) 3. run-5 attributes (first 5 selected from 7 used in the second run) (Table 2) The training set was composed in all three parallel
runs of 500 examples. As a test set, the entire data set of 1255 examples was used. The goal of the classification was an overall pollution class as defined by an expert panel for each sample in the data set. In the second experiment, the effect of a post-pruning procedure on the prediction power of the already developed classification trees was studied. Therefore, the first experimental strategy was repeated, but the resulting classification trees were subjected to the postpruning algorithm. The results of the first and the second experiments conducted on three parallel runs are summarised in Table 3, which is composed of six subtables. Each row of the table gives the distribution of classification results obtained from the induction system Knowledge Maker within each expert assigned pollution class, while the far right column presents the expert class assignment. From the total of 1255 examples, the expert panel assigned: --7 examples to pollution class I, --43 to pollution class I-II, --229 to pollution class II, --353 to pollution class II-III, - - 1 9 3 to pollution class III, --243 to pollution class Ill-IV, and - - 1 8 7 examples to pollution class IV. From a total of 43 examples assigned by experts as pollution class I-II, the automatic induction system in the case of the first run (based on 21 attributes), --27 were assigned to the same class, namely I-II, - - 1 3 were assigned to class II, and --3 examples were assigned to class (II-III). Frequencies of distribution of the results within each pollution class are given in Figures 2-7, comparing First (21 attributes), Second (7 attributes), and Third (5 attributes) runs. In the third experiment, the effect of the size of the training set for the induction of the classification rule on the classification results was studied. The comparative distribution of results, obtained from a training set composed of 500 and 300 examples, respectively, are given in Figures 8 and 9.
TABLE 2 Attributes Used in the Second and Third Run of the First Experiment
2 Run
3 Run
Attribute
Symbol
Unit
Attribute
Symbol
Unit
BOD5 NO; NO~ Phenol COD (Cr20; 2) NH; Detergents
BPK5 NO2 NO3 FENOL KPK NH4 DETER
mg O2/dm3 mg/dm s mg/dm 3 mg/dm 3 mg O2/dm3 mg/dm 3 mg/dm 3
BOD5 NO; NO; Phenol COD (Cr20~ 2)
BPK5 NO2 NO3 FENOL KPK
mg O2/dm 3
mg O2/dm 3
81
Evaluation of Automatic Rule Induction Systems TABLE 3 Comparative Classification Results for Three Parallel Runs
Results obtained fromoriginal ID3 algorithm(KnowledgeMaker)
Results obtainedby applyingpost-pruningprocedure
classes 121A) CI
I
I-II
II
n-m
m
cl~.~ 121A) m-IV
Iv
To~
E
C~P/
I
l-n
n
n-m
m
m-iV
iv
Tot~
E I
0
1
6
0
0
0
0
7
I
0
0
6
0
0
1
0
7
I-H H
0 0
27 12
13 163
3 48
0 2
0 4
0 0
43 229
15 157
3 60
0 5
9 5
0 0
43 229
0 0 0 0 0
3 0 l 0 44
31 14 3 I 231
252 45 19 7 374
42 I06 28 i0 188
22 22 170 45 263
3 6 22 124 155
353 193 243 187 1255
0 0 o o o o 0
16 2
H-M M M-IV IV Total
I-II II II-m HI m-iv Iv Total
2 0 I 0 21
42 12 3 i 236
242 52 19 7 383
42 102 29 II 189
22 22 165 41 265
3 6 22 124 155
353 193 243 187 1255
I
LII
H
Ill-iv
IV
Total
C(PI
I
I-H
H
H-Ill
Ill
M-IV
IV
Total
0 0
5 23
2 18
0 2
0 0
0 0
0 0
C
,l~
/7A)
H-M
M
,l~
E
(7A)
E I
I-H H
0 0
5 28
2 13
0 2
0 0
0 0
0 0
7
I
I-H H
0
13
163
48
3
2
0
43 229
0
11
146
H-M
0
I
47
241
33
30
I
353
H-Ill
0
I
47
M M-IV IV Total
0 0
2 2
20 6
42 17
21 153 27 233
4 31 142 178
193 243 187 1255
m Ill.iv IV Total
0 0
2 2
21 6
Ill-IV
IV
Total
C
0
0
0
7
0
51
251
357
104 34 11 185
¢la~
/5g)
I
I-II
II
II-M
Ill
68 ~2
48 18
7
2
2
0
43 229
28
34
I
353
90 34
28 144 27 235
4 39 143 187
193 243 187 1255
Total
0
0
0
8
9
0
44
240
386
163
qPl
I
I-II
II
H-Ill
111
M-IV
1V
,lasses (5 A)
E
E I
0
5
2
0
0
0
0
7
I
0
5
2
0
0
0
0
7
I-H 11 II-M M M4V IV Total
0 0 0 0 0 0 0
27 13 1 1 2 0 49
14 162 41 19 3 0 241
2 49 245 45 20 11 372
0 2 35 103 32 8 180
0 3 28 22 159 28 240
0 0 3 3 27 140 173
43 229 353 193 243 187 1255
I-1I II 11-I11 M m-Iv IV Total
0 0 0 0 0 0 0
19 9 1 1 2 0 37
22 149 42 16 3 0 234
2 65
0 2 35 94 31 8 170
0 4 34 28 148 28 242
0 0 2 3 38 142 185
43 229 353 193 243 187 1255
239
51 21 9 387
DISCUSSION
-post-pruning mechanism decreases the size o f the de-
T h e comparative results o f the automatic induction system Knowledge M a k e r applied to the classification o f river water samples according to their overall pollution levels (Table 3 a n d Figures 1-6) show: -the effect of the number of attributes on the precision o f the classification frequencies is negligible (Figures 2-7). In the cases where only a general orientation on the pollution level o f u n k n o w n water samples is needed, the core i n f o r m a t i o n can be induced f r o m the baseline data with accuracy higher than 60% already for k n o w n values o f five attributes--measurements (BOD5, COD, NO~-, NO~, Phenol). If one also takes into a c c o u n t instances where difference between experts' class assignment and a u t o m a t i c induction system results in only half a class, then classification accuracy increases by a further 20 to 30%;
cision tree (Figure 10), while at the same time the accuracy o f the classification rules is slightly changed within each pollution class (fight histograms in Figures 2-7), -the number of examples for inducing the classification rules affects the distribution o f the automatic classification results o f the test set, but the differences do not show the same trend for all classes (Figures 8 and 9); e.g., for class I I - I I I in the case o f the induction rule developed f r o m 500 examples, the automatic class assignment for 250 examples (out o f 353) o f the test set coincide with the expert class assignment, while in the case o f the induction rule derived from 300 examples, only 170 examples coincide with the expert class pollution assignment. It can be stated that the accuracy o f the classification depends on the n u m b e r o f examples in the training set used for the development o f a de-
82
M. Vrta?nik and D. Dolni?ar
Frequencies (%) 100.00 I 90.00 80.00 UN-PRUNED
70.00
POST-PRUNED
60.00
[] 21A
50.00
[] 7A
40.00
[]
5A
30.00 20.00 lO.OO !
0.00
1
= I
I-II
II
II-III
III
III-IV
IV
I
I-II
II
II-III
III
III-IV
IV
Pollution classes FIGURE 2. Distribulion of results for samples assigned by experts in the class (I-II).
cision rule. This aspect of the induction systems will be further tested in such a way that all 1259 sets of measurements will be used for the induction of the decision rule, while experimental results for years 1989 and 1990 will be used as a test set. The major reason this has not yet been accomplished is that the Knowledge Maker induction algorithm for PCs under DOS
environment does not run if the number of examples exceeds 600. The Knowledge Maker decision system represents classification knowledge in the form of if_then rules. How the number of attributes and the post-pruning procedure effect the number of rules is shown in Table 4.
Frequencies (%) 100.00 90.00 80.00 70.00 60.00
[] 21A
50.00
[] 7A
40.00
[] 5A
30.00 20.00 10.00 0.00 I
I-II
II
II-III
III
Ill-IV
IV
I
I-II
II
II-III
III
Ill-IV
IV
Pollution classes FIGURE 3. Distribution of results for samples assigned by experts in the class (ll).
Evaluation of Automatic Rule Induction Systems
83
Frequencies (%) 100.00 t 90.00 80.00
UN-PRUNED
POST-PRUNED
70.00 60,00
Q 21A
50.00
[ ] 7A
40.00
•
5A
30.00 20.00 10.00 0.00 I
I-II
II
ll-llI IH
Ill-IV IV
I
I-II
II
ll-III HI
Ill-IV IV
Pollution classes FIGURE 4. Distribution of results for samples assigned by experts in the class (11-111).
Due to the unstructured presentation of rules, the evaluation of rules by experts is rather difficult. Thus, by the implementation of the algorithm of Reingold and Tilford for the tidier drawing of trees, one example of the decision tree is presented in Figure 10. The decision trees can be used for indepth evaluation of the induction system results, as well as for quick classification of new water samples. Leaves of the trees
are presented by circles; each leaf corresponds to one rule, while the nodes are presented by partition values of a given attribute. The Pascal graphic which was used for displaying trees enables the application of colours for leaves. Thus, light blue circles are used for class III, and red for class IV. By inspection of the displayed coloured decision tree, all misclassification can be easily detected--on the far left branches of the tree the leaves
Frequencies (%) 100.00 90.00 80.00 70.00 °
.
60.00
[ ] 21A
50.00
[ ] 7A
40.00
•
30.00 20.00 10.00 0.00 I
I-II
II
lI-IlI
lIl
IILIV
IV
I
I-II
II
II-HI
III
III-IV
IV
Pollution classes FIGURE 5. Distribution of results for samples assigned by experts in the class (111).
5A
84
M. Vrtadnik and D. Dolni?ar
Frequencies(%) I00.00 90.00 80.00 70.00 6o.o0
[ ] 21A
50.00
[ ] 7A
40.00
•
5A
30.00 20.00 10.00 0.00 I
I-II
II
II-III
Ill
III-IV
IV
I
I-II
II
II-III
III
Ill-IV
IV
Pollutionclasses FIGURE 6. Distribution of results for samples assigned by experts in the class (Ill-IV).
should be coloured light blue and green (representing samples belonging to the less polluted water), while on the far right branches, the leaves should be coloured orange and red (thus showing more polluted water samples). In our future work, the Assistant Professional classification system will be applied on the same domain, and comparison of the results will be made and dis-
cussed. Major advantages of the system Assistant Professional are it can handle incomplete and noisy data, continuous and multivalued attributes, incomplete domains (where attributes don't suffice for exact classification). The system randomly selects a given percentage of learning examples from the whole set, builds a decision tree, and tests it on the rest of examples; then it adds the misclassified examples to the
Frequencies(%) 100.00 90.00 UN-PRUNED
80.00
POST-PRUNED
70.00
ID21A] •
60.00 50.00
[ ] 7A 5A
40.00
30.00 20.00 10.00 0.00 I
I-II
II
II-III
llI
III-IV
IV
I
I-II
II
II-III
HI llI-IV
IV
Pollutionclasses FIGURE 7. Distribution of results for samples assigned by experts in the class (IV).
85
Evaluation of Automatic Rule Induction Systems Pollution classes assigned by AC 300 .,-
250
200
[]
I
[]
l-II
~II
[] n-m
150
100
[]
III
[]
m-Iv
IlIV 50
EC:I
EC: I-II
EC: II
EC: II-III
EC: HI
EC: HI-IV EC:IV Pollution classes assigned by experts
AC = Automatic Classification (Knowledge Maker Induction System). EC = Experts' panel Classification. FIGURE 8. Comparison of automatic (AC) and pollution c l a m s (EC) assignment by experts. Induction of rules results from a training set of 500 examples, described by 21 attributes.
learning set and repeats the step until there are no misclassified examples. Handling o f missing values differs from the approach o f the original ID3 algorithm: all
possible values are assigned to an attribute with the missing value, each weighted with the conditional probability. Other characteristics are binarisation of
Pollution classes assigned by AC 300
250
200
[]
I
•
i-a
[]H
[] II-III
150
[ ] III 100
[]
III-IV
IlIV
5O
EC:I
EC: I-II
EC:II
EC:II-III
EC: III
EC:III-IV
EC: IV
Pollution classes assigned by experts AC = Automatic Classification (Knowledge Maker Induction System). EC = Experts' panel Classification. FIGURE 9. Comparison of automatic (AC) and pollution classes (EC) assignment by experts. Induction rules result from a training set of 300 examples, described by 21 attributes.
86
M. Vrta?nik and D. Dolni?ar TABLE 4
Number of Rules in the Decis0n Trees No. of rules No. of examples
No. of attributes
500 500 500
21 7 5
Unpruned Pruned 119 157 180
88 109 106
must build decision trees with different factors and then choose the tree that suits him the most (Bratko & Cestnik, 1993). Results and experiences obtained by comparing both systems, will be used on other domains, e.g., classification of pesticides degradation rates based on their structural characteristics. CONCLUSIONS
attributes (which is unfortunately not automatic, but the intervals have to be determined manually) and filtering of learning examples; only instances that are correctly classified with the Bayesian principle are selected for learning. The biggest improvement is dealing with noise in the data: tree pruning is applied, which prevents the construction of subtrees that are unreliable. Thus, the size of the decision tree can be reduced and its accuracy can be increased. Tree pruning is less useful in cases of complete domains and small training sets. In the latest version of the Assistant Professional, a prepruning factor has been introduced: the user can specify how much noise is expected to be found in his data. There is no general rule for determining the prepruning factor. It depends on a domain, and the user tends to find the optimal pruning factor so that the decision tree will be the most accurate. For that he
An attempt has been made to apply the automatic rule induction system for extracting knowledge from baseline data on selected physical and chemical analyses of Slovenian rivers in order to facilitate experts' pollution class assignment. It has been shown that reasonable classification trees can already be developed from 500 baseline examples described by 5 attributes. The reduction of measurements from 49 to as low as 7 or 5 should contribute to directing our expertise and scarce funds toward the potential sources of water pollution, thus measuring the pollution load of industrial wastewater effluents on the rivers. This approach should change the practice, which up to now has consisted mostly of monitoring the existing situation, toward identification of major pollution generators and prevention of pollution by implement-
pruned 1 2
3
~
node
85
0
1:
86
Ha
(
2.95
181
?
Class: £: Na 2:NO2 3: hard 4:sat_02 31:02
YES:(-
2.95 O.O15 8,95 105.95 11.75
NO:-~
50: 51: 64: 85: 86:
back
FIGURE 10.
COD su_zar CI color BOD
: t
uiew:
&
2.05 19.65 3.10 14.50 5.30
8V" 148: 181: 182: 221:02
reset
: (Ho~e}
$i02 su__zar BOD cond
exit
8 • ?O 8.55 12.25 2:56.5 9.65
: (Enter}
Classification tree derived from 500 examples and 21 attdbutes.
~• ~ • • •
II I- I I II-IlI III III-IU IU
Evaluation of Automatic Rule Induction Systems ing a p p r o p r i a t e wastewater t r e a t m e n t technologies o r new, less p o l l u t a n t technologies. F r o m the t h e o r e t i c a l p o i n t o f view, it is necessary to study further the selection o f attributes for describing d a t a sets a n d correlating t h e a t t r i b u t e s selected with t h e i r m u t u a l effects o n the p r e d i c t a b i l i t y p o w e r o f the generalised rules. T h e m a j o r l i m i t a t i o n o f t h e c o m m e r c i a l l y available a u t o m a t i c rule i n d u c t i o n systems, w h i c h m u s t be b o r n in m i n d , is t h a t t h e y o p e r a t e o n the "noiseless d o m a i n " a s s u m p t i o n . Thus, special care s h o u l d be given to the structure o f t r a i n i n g d a t a sets as well as to t h e d e s c r i p t i o n language o f t h e d a t a set.
Acknowledgement--The authors wish to thank to The Faculty of Science and Technology--Department of Chemical Methodology and Informafics, The Slovenian Ministry of Research and Technology, and The Environmental Protection Agency, Cincinnati, Ohio USA for funding the research work on the induction systems, and the Slovenian Water Work Association for the data and expertise they kindly provided.
87
REFERENCES Bratko, I., & Cestnik, B. (1993). Advanced techniques of artificial intelligence; Part II: Machine learning and qualitative modelling, TEMPUS, Postgraduate Study of Water Resources Management and Sanitary Engineering (Joint European Project), Ljubljana, June 21-July 4. Cestnik, B., Kononenko, I., & Bratko, I. (1987). ASSISTANT 86: A knowledge-elicitation tool for sophisticated users. In I. Bratko, & N. Lavra6 (Eds.), Progress in machine learning (pp. 31-45). Wilmslow, UK: Sigma Press. Clark, P., & Niblett, T. (1987). Induction in noisy domain. In I. Bratko & N. Lavra~ (Eds.), Progress in machine learning (pp. 1130). Wilmslow, UK: Sigma Press. Michalski, R.S., Carbonell, J.G., & Mitchell, T.M. (1983). Machine learning, an artificial intelligenceapproach. Los Altos, CA: Morgan Kaufmann. Reingold, E.M., & Tilford, J.S. ( 1981). Tidier drawings of trees. IEEE Transactions on Software Engineering, Vol. SE-7, No. 2. Thompson, B., & Thompson, W. (1986). Finding rules in data, an algorithm for extracting knowledge from data. Byte 11, 149-158.