Evaluation of automatic rule induction systems

Evaluation of automatic rule induction systems

Expert Systems With Applications, Vol. 8, No. 1, pp. 77-87, 1995 Copyright © 1994 Elsevier Science Ltd Printed in the USA. All fights reserved 0957-41...

760KB Sizes 0 Downloads 74 Views

Expert Systems With Applications, Vol. 8, No. 1, pp. 77-87, 1995 Copyright © 1994 Elsevier Science Ltd Printed in the USA. All fights reserved 0957-4174/95 $9.50 + .00

Pergamon

Evaluation of Automatic Rule Induction Systems METKA VRTA~NIK AND D . DOLNI~AR University of Ljubljana, Faculty of Science and Technology, Department of Chemical Education and Informatics, Ljubljana, Slovenia

Abstract--The effects of number of attributes for description of data set, number of examples included in the training set, and post-pruning mechanism, on the predictability power of the classification rules for automatic assignment of river water pollution levels were studied. In the induction experiments, the original ID3 algorithm embedded in the Knowledge Maker environment was modified by postpruning mechanism. In order to facilitate the evaluation of the developed classification rules, the algorithm of Reingold and Tilford for tidier drawing of trees was implemented. The results showed that efficient classification rules in comparison with experts' class assignment can already be derived from 500 examples of baseline data, each example being described by 5 attributes.

INTRODUCrION

processes of the expert system shells are generalised by two mechanisms: backward and forward chaining. The major bottleneck in the quicker design of expert system applications in various fields is the knowledge base, which is usually domain-specific. Two general approaches can be used for the automatic extraction of expert knowledge: deduction from theories or rules, and induction from representative sets of examples or expert experiences. In more sophisticated applications, both approaches usually have to be interlinked to cope with the multi-objective character of the problems.

INTELLIGENCE methods and techniques, e.g., expert systems, are becoming important tools in solving chemical problems. Comparative analysis of chemical literature obtained from 1989-90 on-line Chemical Abstracts gave the major domains of emerging expert systems' applications in chemistry, Figure I. Conceptualisation, development, and engineering of chemical and biochemical processes are the most rapidly growing fields of expert system applications in chemistry. In these domains expert systems are used in process design, control, and simulation. In analytical chemistry, expert system approaches have started to compete with standard multivariate statistical techniques, e.g., pattern recognition and cluster analysis, which used to be applied to extract value-added information and routine decision making from large data sets. The vast majority of expert systems in analytical chemistry are oriented toward solving structure-elucidation problems, based on the analysis and automatic synthesis of different spectroscopic data, e.g., IR, NMR, MS, GS/MS, etc., or predicting spectra from structural parameters. Although the knowledge representation strategies, description of objects or events, rules ofgeneralisation, control strategies, and generalisation levels may differ, depending on the expert system shell or environment used in the development of an expert system prototype as well as problem domain, the reasoning or inference ARTIFICIAL

PROBLEM PRESENTATION

The classification of Slovenian rivers according to their pollution level is defined by four classes: • Class I drinking water or drinking water after treatment • Class II water that can be used for bathing, water sports, or fish farming, and could be used for drinking after adequate treatment • Class III water that could be used for irrigation or for industrial purposes after adequate treatment, except for the food industry • Class IV polluted water The criteria for assigning a pollution class/level of a particular water sample at a sampling site are based on the results of (a) selected physical and chemical analyses, (b) specific analyses of heavy metals and selected organic pollutants, (c) microbiological analyses. The overall pollution class of water samples is assigned by an expert panel that includes chemists, biologists and hydrologists. An attempt was made to induce an automatic classification rule for determining pollution level of river water samples. Analysis of the problem characteristics

Requests for reprints should be sent to Metka Vrta~nik, University of Ljubljana, Faculty of Science and Technology, Department of Chemical Education and Informatics, Vegova 4, pp 18/l, 61001 Ljubljana, Slovenia.

77

78

M. Vrtadnik and D. Dolnidar CA DI

BC BT OK MS

TE

WM

FIGURE 1. Frequencies of expert system applications in different domains of chemistry (based on references from Chemical Abstracts 1989-1990). TE = technology/engineering; AC = analylical chemistry; WM = waste management;, MS = material selection; OK = organic chemistry (synthesis design); BT = biotechnology; BI = biochemistry/pharmacology; DI = data base interface; CA = catalyst design.

showed that the problem can be tackled by applying inductive learning theory and tools for automatic rule induction. Automatic rule induction systems for inducing classification rules have already proved valuable as tools in supporting knowledge acquisition, for expert systems. Two types of induction algorithms have been successfully applied to different types of classification problems: ID3 and AQ algorithms (Michalski, Carbonell, & Mitchell, 1983). The major problem in the application of most of the commercially available automatic-rule induction systems is that they generalise rules on the basis of the noiseless domain assumption (no errors in the training data set or in the description language of the data set), Clark & Niblett, 1985. Therefore, such systems are limited in searches for rules for which no counterexamples exist in the data set for induction. The resulting classification rules most often have low predictability power. In our research, the following effects on the classification results were studied: (a) number of attributes for description of data set, (b) number of examples included in the training set for deduction of the classification rule, and (c) post-pruning for estimating the classification errors in the nodes on the selected set of data for monitoring Slovenian river waters. The commercially available automatic induction system used was Knowledge Maker, which is installed within Knowledge Pro environment. DESCRIPTION OF THE DATA SET The data set comprised 1261 examples with each example described by 49 attributes, from selected physical, chemical and biological analyses. The original data set was split into two training sets: (a) 500 examples,

(b) 300 examples; as a test set, the whole data set of 1261 examples was used. The examples were chosen in such a way that the proportionality was preserved among the examples for different pollution classes in training and test data sets. (No examples for pollution class I were included in the training data set, because there were only 11 examples in the whole data set). Among the 49 attributes describing each example, 21 were selected. The selected attributes resulted from previously conducted Linear Discriminant Analysis (LDA) on the same data set (see Table 1). The goal of the classification is the pollution class. THEORETICAL BACKGROUND OF INDUCTION ALGORITHM Systems for the automatic induction of knowledge use entropy (information content) as a measure of uncertainty in the classification of the objects/instances. The classification problem is defined by the equation: N

n ( C l a A = - Z p(c, laj) log2 p(c,I aA i= 1

H(C[ aj) = N = ci" • • Cs = al • • • aj = p(ci) =

entropy of classification number of classes discrete classes i to N values of attributes probability of an object being in class i

The first step in calculating the entropy of classification, after deciding on a partitioning attribute, is to split the original table into subtables, with each example having the same value of the partitioning attribute. The entropy of each subtable H(C[ aj), is calculated for each value of attribute aj. In order to find the entropy of the entire table after the split, the sum of the entropy of each of the values of the attribute, multiplied by the probability that the value will appear in the table, must be calculated. The resulting entropy of the classification, H(CIA), is the weighted average of the entropy for each value aj of the attribute. M

H( CIA) = - 7, p(aj)H(A la:) j=l

If the same calculation is performed for each attribute in the example set, the attribute with the smallest entropy, and thus the least uncertainty, is the best attribute to select for the initial split (Thompson & Thompson, 1986; Clark & Niblett, 1987). In the Knowledge Maker automatic induction system, this algorithm is implemented in such a way, that example tables are stored as a list. A list is also used to store attributes and their values. The tree-building procedure is performed by the function called classify. By

79

Evaluation of Automatic Rule Induction Systems TABLE 1 Attributes Used for the Induction of the Decision Tree

Attribute

Symbol

Units

Conductivity Dissolved oxygen Saturation with oxygen Suspended solids after drying S0spended solids after ignition Fixed solids ingnited at 500°C Total hardness NO; NOE CI-

PREV 02 SAT_C2 SU_SUS SU_ZAR ZAR_OST TR_TOT NO2 NO3 CL NH4 NA SO2 AL FE FENOL DETER KPK BPK52 BAKT_.AN S_BARVA

#S/cm mg/dm a mg/dm a mg/dm 3 mg/dm 3 mg/dm a mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg/dm 3 mg O2/dm3 mg O2/dm3 mg MPN/dm 3. special scale

NH~-

Na+ SiC2 AI+3 Fe+3 Phenol Detergents COD (K2Cr2OT) BOD5 Bacteriological analysis Measured color * MPN = most probable number.

calculating H(CIA) for each attribute, the function classify chooses the attribute with the smallest entropy for the initial split of the example set. The classify function produces the tree in a depth-first manner. The binary decision tree is represented by a set of rules organised as a set of if-then clauses forming a nested structure. DEVELOPMENT OF THE PROGRAMME FOR GRAPHICAL DISPLAY AND POSTPRUNING OF THE DECISION TREE To display graphically the decision tree developed by the Knowledge Maker induction system, the algorithm of Reingold and Tilford (TR). "Tidier drawings of trees" was implemented, (Reingold & Tilford, 1981). The crucial part of this algorithm is its ability to space out subtrees. The algorithm is recursive, and the main principle works on (1) assigning coordinates for the left and the right subtree of the current node, (2) spacing out the subtrees as much as necessary in the case of overlapping between the right and the left tree. Important parts of the algorithm are demands for drawing characteristics. They are predefined and described as "tidy drawings" and the algorithm attempts to take into account as many of them as possible. The following demands are typical for trees and often they are trivially fulfilled (except the fifth one, which is not trivially solvable on all classes of trees): (1) parallel levels of the tree (nodes of the same level lie on a straight line, and lines that determine levels are pairwise parallel, (2) constant distance between two arbitrary levels, (3) the

smallest constant distance between two adjacent nodes of the same level, (4) symmetric position of the father node above son nodes (father: a current node, sons: nodes of the next level connected to the father), (5) minimal width of the tree (tree width: a horizontal distance between the far left and the far right node of the tree). In our research, the general TR algorithm for drawing trees was implemented with the following modifications: 1. zooming out an arbitrary subtree 2. partial display of node labels (attributes with legend) 3. complete display of node labels 4. display of the appearance of a chosen attribute of the classification in the classification tree. The graphical display of the classification tree offers a number of advantages in the study of the logic structure of the classification tree in comparison with the rules obtained by Knowledge Maker induction algorithm. These advantages enable I. review of the presence of a given attribute in a tree 2. display of partition values of a given attribute 3. identification of contradictions among rules by experts 4. correlation among attributes 5. evaluation of the classification of the extreme examples by experts. The programme developed is also used for classification of new examples. This is done by a tree search procedure, which is much faster than the sequential testing of rules made by the Knowledge Maker system. In incomplete and interdisciplinary domains clas-

80

M. VrtaEnik and D. DolniEar

sification can rarely be exact. But the basic induction algorithm always aims to construct an exact decision tree. For splitting subsets of learning examples at lower levels, attributes that have no classification power are used. This drawback of the basic algorithm can be partially overcome by pruning the decision tree. Pruning is the mechanism that prevents the construction of unreliable subtrees. The program for drawing the classification trees was further developed by the inclusion of a postpruning mechanism, based on the algorithm taken from an article presented by Cestnik, Kononenko, and Bratko (1987). The major effects of the implementation of the post-pruning mechanism are (1) decreasing the size of the decision tree (simplification of the classification), and at the same time (2) preserving the accuracy of the classification rule. All programs for (1) conversion of rules into data for drawing, drawing decision tree, postpruning are written in Pascal. Pascal graphics is used to display trees.

THE DESCRIPTION OF THE EXPERIMENTAL STRATEGY AND RESULTS OF CLASSIFICATION In the first experiment the effect of the number of attributes on the classificatory precision and structure of the classification tree were studied. The experiment was conducted in three parallel runs using the original ID3 algorithm implemented as the function classify on the Knowledge Maker induction system. The parallel experimental runs differ in the number of attributes used for the description of the training data set: 1. run-21 attributes identified as predictor variables by linear discriminant analysis, which was conducted on the same data set. 2. run-7 attributes (selected from the original set of 2 l attributes based on experts' evaluation (Table 2) 3. run-5 attributes (first 5 selected from 7 used in the second run) (Table 2) The training set was composed in all three parallel

runs of 500 examples. As a test set, the entire data set of 1255 examples was used. The goal of the classification was an overall pollution class as defined by an expert panel for each sample in the data set. In the second experiment, the effect of a post-pruning procedure on the prediction power of the already developed classification trees was studied. Therefore, the first experimental strategy was repeated, but the resulting classification trees were subjected to the postpruning algorithm. The results of the first and the second experiments conducted on three parallel runs are summarised in Table 3, which is composed of six subtables. Each row of the table gives the distribution of classification results obtained from the induction system Knowledge Maker within each expert assigned pollution class, while the far right column presents the expert class assignment. From the total of 1255 examples, the expert panel assigned: --7 examples to pollution class I, --43 to pollution class I-II, --229 to pollution class II, --353 to pollution class II-III, - - 1 9 3 to pollution class III, --243 to pollution class Ill-IV, and - - 1 8 7 examples to pollution class IV. From a total of 43 examples assigned by experts as pollution class I-II, the automatic induction system in the case of the first run (based on 21 attributes), --27 were assigned to the same class, namely I-II, - - 1 3 were assigned to class II, and --3 examples were assigned to class (II-III). Frequencies of distribution of the results within each pollution class are given in Figures 2-7, comparing First (21 attributes), Second (7 attributes), and Third (5 attributes) runs. In the third experiment, the effect of the size of the training set for the induction of the classification rule on the classification results was studied. The comparative distribution of results, obtained from a training set composed of 500 and 300 examples, respectively, are given in Figures 8 and 9.

TABLE 2 Attributes Used in the Second and Third Run of the First Experiment

2 Run

3 Run

Attribute

Symbol

Unit

Attribute

Symbol

Unit

BOD5 NO; NO~ Phenol COD (Cr20; 2) NH; Detergents

BPK5 NO2 NO3 FENOL KPK NH4 DETER

mg O2/dm3 mg/dm s mg/dm 3 mg/dm 3 mg O2/dm3 mg/dm 3 mg/dm 3

BOD5 NO; NO; Phenol COD (Cr20~ 2)

BPK5 NO2 NO3 FENOL KPK

mg O2/dm 3

mg O2/dm 3

81

Evaluation of Automatic Rule Induction Systems TABLE 3 Comparative Classification Results for Three Parallel Runs

Results obtained fromoriginal ID3 algorithm(KnowledgeMaker)

Results obtainedby applyingpost-pruningprocedure

classes 121A) CI

I

I-II

II

n-m

m

cl~.~ 121A) m-IV

Iv

To~

E

C~P/

I

l-n

n

n-m

m

m-iV

iv

Tot~

E I

0

1

6

0

0

0

0

7

I

0

0

6

0

0

1

0

7

I-H H

0 0

27 12

13 163

3 48

0 2

0 4

0 0

43 229

15 157

3 60

0 5

9 5

0 0

43 229

0 0 0 0 0

3 0 l 0 44

31 14 3 I 231

252 45 19 7 374

42 I06 28 i0 188

22 22 170 45 263

3 6 22 124 155

353 193 243 187 1255

0 0 o o o o 0

16 2

H-M M M-IV IV Total

I-II II II-m HI m-iv Iv Total

2 0 I 0 21

42 12 3 i 236

242 52 19 7 383

42 102 29 II 189

22 22 165 41 265

3 6 22 124 155

353 193 243 187 1255

I

LII

H

Ill-iv

IV

Total

C(PI

I

I-H

H

H-Ill

Ill

M-IV

IV

Total

0 0

5 23

2 18

0 2

0 0

0 0

0 0

C

,l~

/7A)

H-M

M

,l~

E

(7A)

E I

I-H H

0 0

5 28

2 13

0 2

0 0

0 0

0 0

7

I

I-H H

0

13

163

48

3

2

0

43 229

0

11

146

H-M

0

I

47

241

33

30

I

353

H-Ill

0

I

47

M M-IV IV Total

0 0

2 2

20 6

42 17

21 153 27 233

4 31 142 178

193 243 187 1255

m Ill.iv IV Total

0 0

2 2

21 6

Ill-IV

IV

Total

C

0

0

0

7

0

51

251

357

104 34 11 185

¢la~

/5g)

I

I-II

II

II-M

Ill

68 ~2

48 18

7

2

2

0

43 229

28

34

I

353

90 34

28 144 27 235

4 39 143 187

193 243 187 1255

Total

0

0

0

8

9

0

44

240

386

163

qPl

I

I-II

II

H-Ill

111

M-IV

1V

,lasses (5 A)

E

E I

0

5

2

0

0

0

0

7

I

0

5

2

0

0

0

0

7

I-H 11 II-M M M4V IV Total

0 0 0 0 0 0 0

27 13 1 1 2 0 49

14 162 41 19 3 0 241

2 49 245 45 20 11 372

0 2 35 103 32 8 180

0 3 28 22 159 28 240

0 0 3 3 27 140 173

43 229 353 193 243 187 1255

I-1I II 11-I11 M m-Iv IV Total

0 0 0 0 0 0 0

19 9 1 1 2 0 37

22 149 42 16 3 0 234

2 65

0 2 35 94 31 8 170

0 4 34 28 148 28 242

0 0 2 3 38 142 185

43 229 353 193 243 187 1255

239

51 21 9 387

DISCUSSION

-post-pruning mechanism decreases the size o f the de-

T h e comparative results o f the automatic induction system Knowledge M a k e r applied to the classification o f river water samples according to their overall pollution levels (Table 3 a n d Figures 1-6) show: -the effect of the number of attributes on the precision o f the classification frequencies is negligible (Figures 2-7). In the cases where only a general orientation on the pollution level o f u n k n o w n water samples is needed, the core i n f o r m a t i o n can be induced f r o m the baseline data with accuracy higher than 60% already for k n o w n values o f five attributes--measurements (BOD5, COD, NO~-, NO~, Phenol). If one also takes into a c c o u n t instances where difference between experts' class assignment and a u t o m a t i c induction system results in only half a class, then classification accuracy increases by a further 20 to 30%;

cision tree (Figure 10), while at the same time the accuracy o f the classification rules is slightly changed within each pollution class (fight histograms in Figures 2-7), -the number of examples for inducing the classification rules affects the distribution o f the automatic classification results o f the test set, but the differences do not show the same trend for all classes (Figures 8 and 9); e.g., for class I I - I I I in the case o f the induction rule developed f r o m 500 examples, the automatic class assignment for 250 examples (out o f 353) o f the test set coincide with the expert class assignment, while in the case o f the induction rule derived from 300 examples, only 170 examples coincide with the expert class pollution assignment. It can be stated that the accuracy o f the classification depends on the n u m b e r o f examples in the training set used for the development o f a de-

82

M. Vrta?nik and D. Dolni?ar

Frequencies (%) 100.00 I 90.00 80.00 UN-PRUNED

70.00

POST-PRUNED

60.00

[] 21A

50.00

[] 7A

40.00

[]

5A

30.00 20.00 lO.OO !

0.00

1

= I

I-II

II

II-III

III

III-IV

IV

I

I-II

II

II-III

III

III-IV

IV

Pollution classes FIGURE 2. Distribulion of results for samples assigned by experts in the class (I-II).

cision rule. This aspect of the induction systems will be further tested in such a way that all 1259 sets of measurements will be used for the induction of the decision rule, while experimental results for years 1989 and 1990 will be used as a test set. The major reason this has not yet been accomplished is that the Knowledge Maker induction algorithm for PCs under DOS

environment does not run if the number of examples exceeds 600. The Knowledge Maker decision system represents classification knowledge in the form of if_then rules. How the number of attributes and the post-pruning procedure effect the number of rules is shown in Table 4.

Frequencies (%) 100.00 90.00 80.00 70.00 60.00

[] 21A

50.00

[] 7A

40.00

[] 5A

30.00 20.00 10.00 0.00 I

I-II

II

II-III

III

Ill-IV

IV

I

I-II

II

II-III

III

Ill-IV

IV

Pollution classes FIGURE 3. Distribution of results for samples assigned by experts in the class (ll).

Evaluation of Automatic Rule Induction Systems

83

Frequencies (%) 100.00 t 90.00 80.00

UN-PRUNED

POST-PRUNED

70.00 60,00

Q 21A

50.00

[ ] 7A

40.00



5A

30.00 20.00 10.00 0.00 I

I-II

II

ll-llI IH

Ill-IV IV

I

I-II

II

ll-III HI

Ill-IV IV

Pollution classes FIGURE 4. Distribution of results for samples assigned by experts in the class (11-111).

Due to the unstructured presentation of rules, the evaluation of rules by experts is rather difficult. Thus, by the implementation of the algorithm of Reingold and Tilford for the tidier drawing of trees, one example of the decision tree is presented in Figure 10. The decision trees can be used for indepth evaluation of the induction system results, as well as for quick classification of new water samples. Leaves of the trees

are presented by circles; each leaf corresponds to one rule, while the nodes are presented by partition values of a given attribute. The Pascal graphic which was used for displaying trees enables the application of colours for leaves. Thus, light blue circles are used for class III, and red for class IV. By inspection of the displayed coloured decision tree, all misclassification can be easily detected--on the far left branches of the tree the leaves

Frequencies (%) 100.00 90.00 80.00 70.00 °

.

60.00

[ ] 21A

50.00

[ ] 7A

40.00



30.00 20.00 10.00 0.00 I

I-II

II

lI-IlI

lIl

IILIV

IV

I

I-II

II

II-HI

III

III-IV

IV

Pollution classes FIGURE 5. Distribution of results for samples assigned by experts in the class (111).

5A

84

M. Vrtadnik and D. Dolni?ar

Frequencies(%) I00.00 90.00 80.00 70.00 6o.o0

[ ] 21A

50.00

[ ] 7A

40.00



5A

30.00 20.00 10.00 0.00 I

I-II

II

II-III

Ill

III-IV

IV

I

I-II

II

II-III

III

Ill-IV

IV

Pollutionclasses FIGURE 6. Distribution of results for samples assigned by experts in the class (Ill-IV).

should be coloured light blue and green (representing samples belonging to the less polluted water), while on the far right branches, the leaves should be coloured orange and red (thus showing more polluted water samples). In our future work, the Assistant Professional classification system will be applied on the same domain, and comparison of the results will be made and dis-

cussed. Major advantages of the system Assistant Professional are it can handle incomplete and noisy data, continuous and multivalued attributes, incomplete domains (where attributes don't suffice for exact classification). The system randomly selects a given percentage of learning examples from the whole set, builds a decision tree, and tests it on the rest of examples; then it adds the misclassified examples to the

Frequencies(%) 100.00 90.00 UN-PRUNED

80.00

POST-PRUNED

70.00

ID21A] •

60.00 50.00

[ ] 7A 5A

40.00

30.00 20.00 10.00 0.00 I

I-II

II

II-III

llI

III-IV

IV

I

I-II

II

II-III

HI llI-IV

IV

Pollutionclasses FIGURE 7. Distribution of results for samples assigned by experts in the class (IV).

85

Evaluation of Automatic Rule Induction Systems Pollution classes assigned by AC 300 .,-

250

200

[]

I

[]

l-II

~II

[] n-m

150

100

[]

III

[]

m-Iv

IlIV 50

EC:I

EC: I-II

EC: II

EC: II-III

EC: HI

EC: HI-IV EC:IV Pollution classes assigned by experts

AC = Automatic Classification (Knowledge Maker Induction System). EC = Experts' panel Classification. FIGURE 8. Comparison of automatic (AC) and pollution c l a m s (EC) assignment by experts. Induction of rules results from a training set of 500 examples, described by 21 attributes.

learning set and repeats the step until there are no misclassified examples. Handling o f missing values differs from the approach o f the original ID3 algorithm: all

possible values are assigned to an attribute with the missing value, each weighted with the conditional probability. Other characteristics are binarisation of

Pollution classes assigned by AC 300

250

200

[]

I



i-a

[]H

[] II-III

150

[ ] III 100

[]

III-IV

IlIV

5O

EC:I

EC: I-II

EC:II

EC:II-III

EC: III

EC:III-IV

EC: IV

Pollution classes assigned by experts AC = Automatic Classification (Knowledge Maker Induction System). EC = Experts' panel Classification. FIGURE 9. Comparison of automatic (AC) and pollution classes (EC) assignment by experts. Induction rules result from a training set of 300 examples, described by 21 attributes.

86

M. Vrta?nik and D. Dolni?ar TABLE 4

Number of Rules in the Decis0n Trees No. of rules No. of examples

No. of attributes

500 500 500

21 7 5

Unpruned Pruned 119 157 180

88 109 106

must build decision trees with different factors and then choose the tree that suits him the most (Bratko & Cestnik, 1993). Results and experiences obtained by comparing both systems, will be used on other domains, e.g., classification of pesticides degradation rates based on their structural characteristics. CONCLUSIONS

attributes (which is unfortunately not automatic, but the intervals have to be determined manually) and filtering of learning examples; only instances that are correctly classified with the Bayesian principle are selected for learning. The biggest improvement is dealing with noise in the data: tree pruning is applied, which prevents the construction of subtrees that are unreliable. Thus, the size of the decision tree can be reduced and its accuracy can be increased. Tree pruning is less useful in cases of complete domains and small training sets. In the latest version of the Assistant Professional, a prepruning factor has been introduced: the user can specify how much noise is expected to be found in his data. There is no general rule for determining the prepruning factor. It depends on a domain, and the user tends to find the optimal pruning factor so that the decision tree will be the most accurate. For that he

An attempt has been made to apply the automatic rule induction system for extracting knowledge from baseline data on selected physical and chemical analyses of Slovenian rivers in order to facilitate experts' pollution class assignment. It has been shown that reasonable classification trees can already be developed from 500 baseline examples described by 5 attributes. The reduction of measurements from 49 to as low as 7 or 5 should contribute to directing our expertise and scarce funds toward the potential sources of water pollution, thus measuring the pollution load of industrial wastewater effluents on the rivers. This approach should change the practice, which up to now has consisted mostly of monitoring the existing situation, toward identification of major pollution generators and prevention of pollution by implement-

pruned 1 2

3

~

node

85

0

1:

86

Ha

(

2.95

181

?

Class: £: Na 2:NO2 3: hard 4:sat_02 31:02

YES:(-

2.95 O.O15 8,95 105.95 11.75

NO:-~

50: 51: 64: 85: 86:

back

FIGURE 10.

COD su_zar CI color BOD

: t

uiew:

&

2.05 19.65 3.10 14.50 5.30

8V" 148: 181: 182: 221:02

reset

: (Ho~e}

$i02 su__zar BOD cond

exit

8 • ?O 8.55 12.25 2:56.5 9.65

: (Enter}

Classification tree derived from 500 examples and 21 attdbutes.

~• ~ • • •

II I- I I II-IlI III III-IU IU

Evaluation of Automatic Rule Induction Systems ing a p p r o p r i a t e wastewater t r e a t m e n t technologies o r new, less p o l l u t a n t technologies. F r o m the t h e o r e t i c a l p o i n t o f view, it is necessary to study further the selection o f attributes for describing d a t a sets a n d correlating t h e a t t r i b u t e s selected with t h e i r m u t u a l effects o n the p r e d i c t a b i l i t y p o w e r o f the generalised rules. T h e m a j o r l i m i t a t i o n o f t h e c o m m e r c i a l l y available a u t o m a t i c rule i n d u c t i o n systems, w h i c h m u s t be b o r n in m i n d , is t h a t t h e y o p e r a t e o n the "noiseless d o m a i n " a s s u m p t i o n . Thus, special care s h o u l d be given to the structure o f t r a i n i n g d a t a sets as well as to t h e d e s c r i p t i o n language o f t h e d a t a set.

Acknowledgement--The authors wish to thank to The Faculty of Science and Technology--Department of Chemical Methodology and Informafics, The Slovenian Ministry of Research and Technology, and The Environmental Protection Agency, Cincinnati, Ohio USA for funding the research work on the induction systems, and the Slovenian Water Work Association for the data and expertise they kindly provided.

87

REFERENCES Bratko, I., & Cestnik, B. (1993). Advanced techniques of artificial intelligence; Part II: Machine learning and qualitative modelling, TEMPUS, Postgraduate Study of Water Resources Management and Sanitary Engineering (Joint European Project), Ljubljana, June 21-July 4. Cestnik, B., Kononenko, I., & Bratko, I. (1987). ASSISTANT 86: A knowledge-elicitation tool for sophisticated users. In I. Bratko, & N. Lavra6 (Eds.), Progress in machine learning (pp. 31-45). Wilmslow, UK: Sigma Press. Clark, P., & Niblett, T. (1987). Induction in noisy domain. In I. Bratko & N. Lavra~ (Eds.), Progress in machine learning (pp. 1130). Wilmslow, UK: Sigma Press. Michalski, R.S., Carbonell, J.G., & Mitchell, T.M. (1983). Machine learning, an artificial intelligenceapproach. Los Altos, CA: Morgan Kaufmann. Reingold, E.M., & Tilford, J.S. ( 1981). Tidier drawings of trees. IEEE Transactions on Software Engineering, Vol. SE-7, No. 2. Thompson, B., & Thompson, W. (1986). Finding rules in data, an algorithm for extracting knowledge from data. Byte 11, 149-158.