Floating search algorithm for structure learning of Bayesian network classifiers

Floating search algorithm for structure learning of Bayesian network classifiers

Pattern Recognition Letters 24 (2003) 2839–2848 www.elsevier.com/locate/patrec Floating search algorithm for structure learning of Bayesian network c...

145KB Sizes 2 Downloads 126 Views

Pattern Recognition Letters 24 (2003) 2839–2848 www.elsevier.com/locate/patrec

Floating search algorithm for structure learning of Bayesian network classifiers Franz Pernkopf a

a,*

, Paul OÕLeary

b,1

Institute of Communications and Wave Propagation, Graz University of Technology, Inffeldgasse 16c II, Graz, 8010 Austria b Institute of Automation, University of Leoben, Leoben 8700, Austria Received 3 February 2003; received in revised form 10 June 2003

Abstract This paper presents a floating search approach for learning the network structure of Bayesian network classifiers. A Bayesian network classifier is used which in combination with the search algorithm allows simultaneous feature selection and determination of the structure of the classifier. The introduced search algorithm enables conditional exclusions of previously added attributes and/or arcs from the network classifier. Hence, this algorithm is able to correct the network structure by removing attributes and/or arcs between the nodes if they become superfluous at a later stage of the search. Classification results of selective unrestricted Bayesian network classifiers are compared to na€ıve Bayes classifiers and tree augmented na€ıve Bayes classifiers. Experiments on different data sets show that selective unrestricted Bayesian network classifiers achieve a better classification accuracy estimate in two domains compared to tree augmented na€ıve Bayes classifiers, whereby in the remaining domains the performance is similar. However, the achieved network structure of selective unrestricted Bayesian network classifiers is simpler.  2003 Elsevier B.V. All rights reserved. Keywords: Bayesian network classifiers; Feature selection; Floating search method

1. Introduction In classification problems the relevant attributes are often unknown a priori. Thus, many features are derived and the features which do not contribute or even degrade the classification perfor-

* Corresponding author. Tel.: +43-3168734436; fax: +433168734432. E-mail address: [email protected] (F. Pernkopf). 1 Tel.: +43-3842-402-9031.

mance have to be removed. This step is performed during feature selection. The main purpose of feature selection is to reduce the number of extracted features to a set of a few significant ones for classification while maintaining the classification rate. The reduction of the feature set size may even improve the classification accuracy by reducing estimation errors associated with finite sample size effects (Jain and Chandrasekaran, 1982). This behaviour of practical classification approaches is basically caused by insufficient modeling of the class-conditional probability

0167-8655/$ - see front matter  2003 Elsevier B.V. All rights reserved. doi:10.1016/S0167-8655(03)00142-9

2840

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

density distributions obtained from the underlying samples. The best subset of features X ¼ fXi ji ¼ 1; . . . ; d; Xi 2 Y g

ð1Þ

is selected from the set Y ¼ fYi ji ¼ 1; . . . ; Dg;

ð2Þ

where D is the number of extracted features and d 6 D denotes the size of the feature subset (Devijver and Kittler, 1982). A feature selection criterion function J ðX Þ evaluates a chosen subset X , whereby a higher value of J indicates a better subset. The problem of feature selection (Zongker and Jain, 1996; Jain and Zongker, 1997) is to find a subset X  Y such that the number of chosen features jX j is d and J reaches the maximum J ðX opt Þ ¼

max J ðX Þ:

X Y ;jX j¼d

ð3Þ

As evaluation criterion J the cross-validation classification accuracy estimate of the Bayesian network classifier (see Section 2) has been selected during the experiments (see Section 4). Other evaluation measures are available, but in the literature (Kohavi and John, 1997; John et al., 1994; Kohavi, 1994; Dash and Liu, 1997) it is shown that the accuracy of selecting the best subset by applying the later used classifier is the best. Unfortunately, this is computationally very demanding. John et al. (1994) divide the feature selection algorithms into two major groups, the filter approach and the wrapper approach. The filter approach assesses the relevance of the features from the data set which are mainly based on statistical measures. The effects of the selected features on the performance of a particular classifier are neglected. In contrast, the wrapper approach uses the classification performance of the classifier itself as part of the feature search for evaluating the feature subsets. John et al. (1994) claim that the wrapper approach is more appropriate, since the selection of a feature subset which takes the induction algorithm into account achieves a high predictive accuracy on unknown test data. However, this approach is associated with high computational costs, but advances in computer technology make

the wrapper method feasible. The filter approach is mostly used in data mining applications where huge data sets are considered. Blum and Langley (1997) present a third feature selection approach which is called embedded. In this concept the feature selection is performed implicitly during establishing the classification algorithm. Basically, there are two main facilities to improve the classification accuracy of classification algorithms. The first is to reduce the dimensionality by discarding irrelevant features from the feature set for classification. This is performed during feature selection. The second approach models statistical dependencies between attributes to achieve an improvement of the classification accuracy. Singh and Provan (1996) combine both approaches for Bayesian network classifiers, whereby, in a first step the features are selected using an information theoretic measure and in a second step the network is constructed with the selected subset in a scoring-based manner. In this work, the objective is to maximize the performance of a classification algorithm by doing both simultaneously, removing irrelevant features and relaxing independence assumptions between correlated features using a scoring-based approach. The applied classification method is restricted to Bayesian network classifiers. To this aim, a search algorithm is introduced to learn the structure (dependency between the attributes) of the Bayesian network. The search algorithm which is well-established in the feature selection community (Pudil et al., 1994; Jain and Zongker, 1997) enables conditional exclusions of previously added attributes and/or arcs from the classifier structure. Hence, this algorithm is able to correct the network structure by removing arcs and/or attributes if they become superfluous at a later stage of the search.

2. Bayesian network classifiers The Bayes decision rule (Duda and Hart, 1973; Langley et al., 1992) is a classification method based on the BayesÕ theorem. A feature vector x belongs to the class xj according to the a posteriori probability P ðxj jxÞ. This a posteriori prob-

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

ability P ðxj jxÞ is determined using the BayesÕ theorem as P ðxj jxÞ ¼

P ðxjxj ÞP ðxj Þ ; P ðxÞ

ð4Þ

where P ðxÞ ¼

t X

P ðxjxj ÞP ðxj Þ

ð5Þ

j¼1

and t corresponds to the number of class labels. Since the denominator of the BayesÕ theorem (Eq. (5)) is the same for all classes it can be ignored. Further it is assumed that the conditional probability density functions P ðxjxj Þ and the a priori probabilities P ðxj Þ of all classes are given. Normally, these probabilities are estimated from a data set during training of the decision rule. Once the classifier has been trained, classification of an instance x is performed by assigning a class label xi to x. This is done in accordance with the a posteriori probability P ðxi jxÞ determined by the BayesÕ theorem which has to be the largest among the different classes X ¼ fx1 ; . . . ; xt g. Hence, pattern x is assigned to the class xi as x ! xi

if

P ðxi jxÞ ¼ max P ðxj jxÞ: j¼1;...;t

ð6Þ

The na€ıve Bayes decision rule assumes that all the attributes are conditionally independent given the class label. As reported in the literature (Langely and Sage, 1994; Friedman et al., 1997) the performance of the na€ıve Bayes classifier is surprisingly good even the independence assumption between attributes is unrealistic in most of the data sets. Independence between the features ignores any correlation among them. Kononenko (1991) extends the na€ıve Bayes approach by joining dependent attributes into groups. Independence is only assumed between the different groups. The idea of Friedman et al. (1997), Friedman and Goldszmidt (1996) is to improve the classification performance by relaxing the independence assumption using tree augmented naı¨ve Bayes classifiers. To this end, Bayesian networks (Pearl, 1988; Jensen, 1996; Heckerman, 1995) are used to represent the classifier. A Bayesian network B ¼ hG; Hi is a directed acyclic graph G which models probabilistic relationships among

2841

a set of random variables U ¼ fX1 ; . . . ; Xn ; Xg ¼ fU1 ; . . . ; Unþ1 g, where each variable in U has specific states or values denoted by lower case letters fx1 ; . . . ; xn ; xg. The symbol n denotes the number of attributes. Each vertex (node) of the graph represents a random variable, while the edges capture the direct dependencies between the variables. The network encodes the conditional independence relationship that each node is independent of its nondescendants given its parents. These conditional independence relationships reduce the number of parameters needed to represent a probability distribution. In the na€ıve Bayes setup the attribute values of Xi and Xj (Xi 6¼ Xj ) are conditionally independent given the class label of node X. Hence, xi is conditionally independent of xj given class x, whenever P ðxi jx; xj Þ ¼ P ðxi jxÞ for all xi 2 Xi ; xj 2 Xj ; x 2 X, and when P ðxj ; xÞ > 0. The symbol H represents the set of parameters which quantify the network. Each node contains a local probability distribution given its parents. The joint probability distribution is uniquely determined by these local probability distributions. The parameters of the network are estimated by the maximum likelihood method. The structure of the na€ıve Bayes classifier represented as Bayesian network is illustrated in Fig. 1. Feature selection is introduced to this network by removing irrelevant features by means of e.g. a search algorithm (see Section 3). This extension of the na€ıve Bayes decision rule is known as selective na€ıve Bayes classifier. The structure in Fig. 1 shows that each attribute is conditionally independent of the remaining attributes given the class label x of the class variable. The class variable X is the only parent for each attribute Xi denoted as PXi ¼ fXg for all 1 6 i 6 n. Hence, the joint

Fig. 1. Structure of a na€ıve Bayes network.

2842

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

probability distribution P ðX1 ; . . . ; Xn ; XÞ for the network depicted in Fig. 1 is determined to be n Y P ðXi jXÞ; ð7Þ P ðX1 ; . . . ; Xn ; XÞ ¼ P ðXÞ i¼1

and from the definition of conditional probability the probability for the classes in X given the values of the attributes is n Y P ðXi jXÞ; ð8Þ P ðXjX1 ; . . . ; Xn Þ ¼ aP ðXÞ i¼1

where a is a normalization constant. Since the attributes may be correlated and the independence assumption of the na€ıve Bayes classifier is unrealistic, Friedman et al. (1997), Friedman and Goldszmidt (1996) introduce the tree augmented naı¨ve Bayes classifier. It is based on the structure of the na€ıve Bayes network where the class variable is the parent of each attribute. Hence, the a posteriori probability P ðXjX1 ; . . . ; Xn Þ takes all the attributes into account. Additionally, edges (arcs) among the attributes are allowed which capture the correlations among them. Each attribute may have at most one other attribute as additional parent which means that there is an arc from feature Xi to feature Xj . This implies that these two attributes Xi and Xj are not independent given the class label. The influence of Xj on the class probabilities depends also on the value of Xi . An example of a tree augmented na€ıve Bayes network is shown in Fig. 2. A tree augmented na€ıve Bayes network is initialized as na€ıve Bayes network. Additional arcs between attributes are found by means of a search algorithm (see Section 3). The maximum number of arcs added to relax the in-

Fig. 2. Structure of a tree augmented na€ıve Bayes network.

dependence assumption between the attributes is n  1, where n denotes the number of attributes. The selective unrestricted Bayesian network classifier (Singh and Provan, 1996) (see Fig. 3) can be viewed as generalization of the tree augmented na€ıve Bayes network. The class node is treated as root node which cannot be the child of any attribute. The attributes must not be connected directly to the class node as for the tree augmented na€ıve Bayes network. After initialization the network only consists of the class node, and the search algorithm (see Section 3) adds attributes and/or arcs to the network so that the evaluation criterion is maximized. In fact, only arcs are added between the nodes so that the class node remains a root of the network. If there is no arc between an attribute and the network structure then the attribute is not considered during classification. During the determination of the network structure irrelevant features are not included and the classifier is based on a subset of selected features. Since this network is almost unrestricted the computational effort for determining the network structure is huge especially if there is a large number of attributes available. Additionally, the size of the conditional probability tables of the nodes increases exponentially with the number of parents. This might result in a more unreliable probability estimate of the nodes which have a large number of parents. The conditional distribution of X given the value of all attributes is only sensitive to those attributes which form the Markov blanket of X (Pearl, 1988). The Markov blanket of the class node X consists of the direct parents of X, the direct successors (children) of X, and all the direct parents of the direct successors (children) of the class node X. All the features outside the Markov

Fig. 3. Structure of a selective unrestricted Bayesian network.

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

blanket do not have any effect on the classification performance. Introducing this knowledge into the search algorithm reduces the search space and the computational effort for determining the structure of the classifier.

3. Classical floating search (CFS) For learning the structure of the Bayesian network that matches the data best Keogh and Pazzani (1999) propose hill climbing search and a more efficient search algorithm called SuperParent. As scoring function the cross-validation classification accuracy estimate is used to guide the search. Keogh and Pazzani use leave-one-out cross-validation to estimate the accuracy of the current network with the added arcs. In the following, the algorithm for finding the structure of the tree augmented na€ıve Bayes network using hill climbing search is outlined (Keogh and Pazzani, 1999). procedure HillClimbingSearchNet begin Initialize network to na€ıve Bayes. Evaluate the current network. while classification accuracy improves Consider adding every legal arc to the current structure of the classifier and evaluate the classifier. if there is an arc addition which improves the accuracy then Add the arc which gives the largest classification accuracy improvement among all the possible arcs to the current network. else Return current network. end end end Each possible arc from Xi to Xj (Xi 6¼ Xj , Xj is a node without another attribute as parent) is evaluated within the while-loop. If the classification accuracy estimate is enhanced with respect to the current network, then the current network is up-

2843

dated with the arc which gives the largest improvement. Otherwise, if no arc results in an enhancement of the classification performance the current classifier is returned. This algorithm concludes the search if there is no arc which results in an enhancement of the classification accuracy estimate or if no attribute without another attribute as parent is available. An improvement of the hill climbing search (Keogh and Pazzani, 1999) is to apply the classical sequential floating algorithm used for feature selection applications (Pudil et al., 1994). We adopt this algorithm for determining the network structure of tree augmented na€ıve Bayes classifiers and selective unrestricted Bayesian network classifiers. The main disadvantage of hill climbing search for determining the network structure is that once an arc has been added to the network structure, the algorithm has no mechanism for removing the arc at a later stage. Hence this algorithm suffers from the nesting effect (Kittler, 1978). To overcome this drawback, Pudil et al. (1994) present a floating search method for finding significant features which optimize the classification performance in feature selection tasks. This algorithm allows conditional exclusions of previously added attributes and/or arcs from the augmented network. Hence, this algorithm is able to correct wrong decisions which were performed in previous steps. This results in a higher classification accuracy estimate, however, this search strategy uses more evaluations to obtain the network structure and consequently is computationally less efficient than hill climbing search. The classical floating search algorithm is based on a method for adding and on a method for removing attributes and/or arcs from the network structure. In the following, the floating algorithm is presented for establishing the structure of the tree augmented na€ıve Bayes network. Subsequently, the necessary modifications of the algorithm for learning the structure of selective unrestricted Bayesian network classifiers are presented. Adding the most significant arc (AddArc): This algorithm is highly similar to the hill climbing search. Each possible arc A from Xi to Xj (Xi 6¼ Xj , Xj 2 O) is evaluated, whereby O is the set of orphans. Keogh and Pazzani (1999) define a node

2844

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

without a parent except the class node as an orphan. The current network is updated with the arc which gives the largest increase of J within the possible arcs. A potential arc set Ap consists of all arcs from Xi to Xj (Xi 6¼ Xj , Xj 2 O). A possible arc A is an element of Ap (A 2 Ap ). The adding arc (AddArc) algorithm adds at each stage k the most significant arc As to the network structure G Gkþ1 ¼ Gk [ As

ð9Þ

so that the graph Gkþ1 maximizes the evaluation criterion J J ðGkþ1 Þ ¼ max J ðGk [ AÞ: A2Ap

ð10Þ

Removing the least significant arc (RemArc): The removing arc method (RemArc) is the counterpart of the adding arc algorithm. Thereby, in each step the least significant arc Al is discarded from the set of the previously added arcs Gk n G0 , where G0 is the structure of a na€ıve Bayes network. The remaining network structure Gk1 ¼ Gk n Al

ð11Þ

achieves the largest evaluation performance J ðGk1 Þ ¼ max J ðGk n AÞ: A2Gk nG0

ð12Þ

The classical floating search algorithm is a bottom up search procedure introduced by Pudil et al. (1994). The algorithm includes new arcs which maximize the criterion J by means of the AddArc procedure starting from the current graph. Afterwards, conditional exclusions of the previously updated graph take place by applying the RemArc method. If no arc can be excluded anymore, the algorithm proceeds again with the AddArc algorithm. The classical floating algorithm is presented in the following. procedure CFS begin Input: U ¼ fX1 ; . . . ; Xn ; Xg //available variables// Output: Gk ; k ¼ 0; 1; . . . ; n  1 Termination: Stop when k ¼ n  1 or when e.g. a predetermined number of arcs has been added.

The evaluation values of J ðGi Þ for all preceding graphs i ¼ 1; 2; . . . ; n  1 are stored. Initialization: k ¼ 0; Graph G0 is initialized to na€ıve Bayes; AddArc algorithm is used to add 2 arcs, k ¼ 2; go to Step 1 repeat Step 1 (Inclusion) Use the AddArc method to select the most significant arc Akþ1 from the set of potential arcs Ap to form the graph Gkþ1 . Therefore Gkþ1 ¼ Gk [ Akþ1 ; go to Step 2 Step 2 (Conditional exclusion) Find the least significant arc in the graph Gkþ1 by using the RemArc algorithm. if Akþ1 is the least significant arc in the graph Gkþ1 n G0 , i.e. J ðGkþ1 n Akþ1 Þ P J ðGkþ1 n Aj Þ, 8j ¼ 1; . . . ; k, then k ¼ k þ 1; go to Step 1 else if Ar ; 1 6 r 6 k, is the least significant arc in the graph Gkþ1 n G0 , i.e. J ðGkþ1 n Ar Þ > J ðGk Þ, then exclude Ar from Gkþ1 to form a new graph G0k ¼ Gkþ1 n Ar . Now J ðG0k Þ > J ðGk Þ. if k ¼ 2 then Gk ¼ G0k ; J ðGk Þ ¼ J ðG0k Þ; go to Step 1 else go to Step 3 Step 3 (Continuation of conditional exclusion) Find the least significant arc At in the graph G0k . if J ðG0k n At Þ 6 J ðGk1 Þ then set Gk ¼ G0k , J ðGk Þ ¼ J ðG0k Þ; go to Step 1 if J ðG0k n At Þ > J ðGk1 Þ then exclude At from G0k to form a newly reduced graph G0k1 , i.e. G0k1 ¼ G0k n At ; k ¼k1 if k ¼ 2 then Gk ¼ G0k ; J ðGk Þ ¼ J ðG0k Þ; go to Step 1 else go to Step 3 until k ¼ ðn  1Þ or when e.g. a predetermined number of arcs has been added. end

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

This algorithm concludes the search if the maximum of correlation arcs added to the network is n  1. This means that the potential arc set Ap ¼ ;. However, the network achieving the largest classification accuracy estimate is not necessarily the network with all possible arcs added. Hence it would be feasible to terminate this search algorithm at an earlier stage. Since there are consecutive backward steps possible the algorithm should not be stopped at a stage where additional arcs do not enhance the evaluation criterion. For determining the structure of the selective unrestricted Bayesian network the following modifications of the above presented algorithm are necessary: • For the selective unrestricted Bayesian network classifier arcs are added between the nodes in such a way that the class node always remains a root of the network. If there is no arc between a loose attribute and the network structure then the attribute is not considered during classification (feature selection). • The structure of the tree augmented na€ıve Bayes classifier is restricted so that each attribute Xi can have only one other attribute as parent in addition to the class node. This means on the other hand that, for the selective unrestricted Bayesian network classifier, the selection of the arcs in the function AddArc Xj is not restricted to the set of orphans O. Each attribute can have more attributes as parent nodes. • The structure is not initialized to the na€ıve Bayes structure. The initial structure of the selective unrestricted Bayesian network consists of the class node as root of the network without any other arcs. • The function AddArc also adds arcs from the class node X to Xj . This floating search algorithm facilitates the correction of wrong decisions made in previous steps. Therefore, it may approximate the optimal solution in a better way than hill climbing search. However, this classical floating search method is mostly more time consuming, especially in the case with data of great complexity and dimensionality.

2845

An efficient evaluation of the classifier may be achieved by ordering the training instances so that the misclassified samples of previous classifications are classified first (Keogh and Pazzani, 1999). The classification algorithm is terminated as soon as the number of misclassified samples exceeds the error rate of the current best classifier network.

4. Experimental results Experiments have been performed on eight data sets from the UCI repository (Merz et al., 1997) which are shown in Table 1. The attributes in the data sets are multinomial and continuous-valued. Since the classifiers are constructed for multinomial attributes the feature space was discretized in the manner described in (Fayyad and Irani, 1993), whereby the partition boundaries for discretizing the continuous-valued attributes were established only through the training data set. The discretization stages were carried out by the MLC++ system (Kohavi et al., 1994). The Bayesian network was constructed using the Matlab toolbox developed by Murphy (2001). During the experiments a five-fold cross-validation accuracy estimate of the classifier was used as scoring function J for finding the optimal structure of the network. Hence, the accuracy estimate of each classifier is given by the successful predictions on the test sets of each data set. All the experiments are based on exactly the same crossvalidation folds. Table 2 shows the classification accuracy estimate, the number of evaluations (#Ev) used for search, and the number of features Table 1 Data sets used for the experiment Data set

# Features

# Classes

# Instances

Australian Flare Glass Glass2 Heart Pima Vote Vehicle

14 10 9 9 13 8 16 18

2 2 7 2 2 2 2 4

690 1066 214 163 270 768 435 846

2846

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

Table 2 Five-fold cross-validation classification accuracy estimate for the eight data sets Data set Australian Flare Glass Glass2 Heart Pima Vote Vehicle

3-NN

NB

CFS-SNB

%

%

%

#F

#Ev

HCS-TAN %

#A

#Ev

%

#A

#Ev

%

#F

#Ev

67.33 82.93 60.33 67.52 77.78 57.66 92.61 69.32

86.38 82.93 67.98 78.96 83.33 71.47 90.06 60.52

88.14 84.07 71.69 79.58 85.92 74.21 95.87 65.05

7 3 6 8 5 5 4 5

158 29 98 104 182 39 83 89

88.43 84.25 73.46 82.71 86.29 75.37 96.31 74.51

8 3 4 3 3 5 11 16

1066 241 235 190 430 201 1761 2737

88.43 84.25 73.46 82.71 86.29 75.37 96.31 74.77

8 3 4 3 3 5 9 16

1092 243 240 192 432 210 1836 3853

89.43 84.07 72.96 82.66 85.92 75.13 95.86 75.36

11 3 9 4 5 5 2 13

8376 299 801 487 1017 648 511 5477

(#F) or arcs (#A) used to achieve this classification accuracy estimate of the approaches discussed in previous sections. The following abbreviations are used for the different classification approaches: • 3-NN: 3-nearest neighbor classifier. • NB: Na€ıve Bayes classifier. • CFS-SNB: Selective na€ıve Bayes classifier using the classical floating search. • HCS-TAN: Tree augmented na€ıve Bayes classifier using hill-climbing search. • CFS-TAN: Tree augmented na€ıve Bayes classifier using classical floating search. • CFS-SUN: Selective unrestricted Bayesian network using classical floating search. The best achieved classification accuracy is emphasized by boldface letters. The selective na€ıve Bayes classifier (CFS-SNB) achieves a better classification accuracy estimate on all data sets than the na€ıve Bayes (NB) approach based on all available attributes. The computational effort to determine the CFS-SNB classifier is relatively small. Additionally, the fivefold cross-validation accuracy estimate of the 3NN classifier using the discretized features is given for comparison in Table 2. For the data sets Australian, Flare, and Heart the tree augmented na€ıve Bayes classifier is only slightly better than the selective na€ıve Bayes classifier, however, the CFS-SNB classifier has a much simpler structure and a smaller number of parameters are necessary. This means that these domains have some attributes which do not contribute much to the accuracy estimate, e.g. the

CFS-TAN

CFS-SUN

CFS-SNB utilizes just three out of the ten available attributes for estimating the classification accuracy of the Flare data set. Additionally, the number of evaluations used for determining the TAN network is much higher than for establishing the CFS-SNB. The structure of the tree augmented na€ıve Bayes classifier is established either with hill-climbing search or with the proposed classical floating search algorithm. For almost every data set, both algorithms achieve exactly the same classification performance. Just for the Vehicle data set, a slightly better accuracy estimate is obtained by using the CFS approach. The floating property of CFS is normally used at a later stage of the search where more attributes have been involved in the classifier structure. Since most of the data sets have just a few relevant attributes the classical floating search algorithm does not perform backward steps. It can be seen that the network structure and the number of evaluations needed for establishing the structure for almost all data sets are closely related for both search strategies. The selective unrestricted Bayesian network achieves a better classification performance than the TAN network for the data sets Australian and Vehicle. On the remaining data sets the classification performance is slightly worse. However, the resulting network of the CFS-SUN classifier is simpler while the predictive accuracy estimate is comparable to the TAN networks (see the number of features (#F) and the number of arcs (#A) used for the classifiers in Table 2). Since the search space for the SUN structure is much larger than for the TAN structure, the

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

number of evaluations for finding the optimal network is larger.

5. Conclusions and future work In this paper a classical sequential floating search algorithm for determining the network structure of a tree augmented na€ıve Bayes network and a selective unrestricted Bayesian network classifier has been presented. This algorithm is capable of removing previously added arcs at a later stage of the search if they turn out to be irrelevant. The experiments are performed on data sets from the UCI repository. They show that selective unrestricted Bayesian network classifiers achieve similar results compared to tree augmented na€ıve Bayes classifiers. However, in two domains the selective unrestricted Bayesian network achieves a better classification accuracy estimate than the tree augmented na€ıve Bayes network. Future work is focused on using Bayesian network classifiers for continuous-valued attributes since it is assumed that the classification accuracy estimate of the classifiers can be increased. Additionally, special attention will be dedicated to Bayesian multinets. Bayesian multinets are a generalization of the unrestricted Bayesian network. Bayesian multinets allow different relations among the features for each class, i.e. a different structure of the network for each class.

References Blum, A.L., Langley, P., 1997. Selection of relevant features and examples in machine learning. Artif. Intell. 97 (1–2), 245–271. Dash, M., Liu, H., 1997. Feature selection for classification. Intell. Data Anal. 1 (3), 131–156. Devijver, P.A., Kittler, J., 1982. Pattern recognition: A statistical approach. Prentice Hall International. Duda, R.O., Hart, P.E., 1973. Pattern classification and scene analysis. Wiley-Interscience Publications. Fayyad, U.M., Irani, K.B., 1993. Multi-interval discretizaton of continuous-valued attributes for classification learning. In: Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pp. 1022–1027.

2847

Friedman, N., Goldszmidt, M., 1996. Building classifiers using Bayesian networks. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1277–1284. Friedman, N., Geiger, D., Goldszmidt, M., 1997. Bayesian network classifiers. Machine Learning 29, 131–163. Heckerman, D., 1995. A tutorial on learning Bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research. Jain, A.K., Chandrasekaran, B., 1982. Dimensionality and sample size considerations in pattern recognition in practice, vol. 2 of Handbook of Statistics, North-Holland , Amsterdam. Jain, A.K., Zongker, D., 1997. Feature selection: Evaluation, application, and small sample performance. IEEE Trans. Pattern Anal. Machine Intell. 19 (2), 153–158. Jensen, F.V., 1996. An introduction to Bayesian networks. UCL Press Limited. John, G.H., Kohavi, R., Pfleger, K., 1994. Irrelevant features and the subset selection problem. In: Proceedings of 11th International Conference on Machine Learing, pp. 121–129. Keogh, E.J., Pazzani, M.J., 1999. Learning augmented Bayesian classifiers: A comparison of distribution-based and classification-based approaches. In: Proceedings of 7th International Workshop on Artificial Intelligence and Statistics, pp. 225–230. Kittler, J., 1978. Feature set search algorithms. In: Chen, C.H. (Ed.), Pattern Recognition and Signal Processing. Sijtho and Noordho, pp. 41–60. Kohavi, R., John, G.H., 1997. Wrappers for feature subset selection. Artif. Intell. 97, 273–324. Kohavi, R., John, G., Long, R., Manley, D., Pfleger, K., 1994. MLC++: A machine learning library in C++. In: Proceedings of the 6th International Conference on Tools with Artificial Intelligence, pp. 740–743. Kohavi, R., 1994. Feature subset selection as search with probabilistic estimates. In: Proceedings of AAAI Fall Symposium on Relevance, pp. 122–126. Kononenko, I., 1991. Semi-naive Bayesian classifier. In: Proceedings of sixth European working session on learning, pp. 206–219. Langley, P., Iba, W., Thompson, K., 1992. An analysis of Bayesian classifiers. In: Proceedings of 10th National Conference on Artificial Intelligence, pp. 223–228. Langely, P., Sage, S., 1994. Induction of selective Bayesian classifiers. In: Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, pp. 399–406. Merz, C., Murphy, P., Aha, D., 1997. UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, URL: www.ics.uci.edu/~mlearn/MLRepository.html. Murphy, K., 2001. Bayes Net Toolbox for Matlab. URL: www.ai.mit.edu/~murphyk/Softwaxe/BNT/bnt.html. Pearl, J., 1988. Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann. Pudil, P., Novovicova, J., Kittler, J., 1994. Floating search methods in feature selection. Pattern Recognition Lett. 15, 1119–1125.

2848

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

Singh, M., Provan, G.M., 1996. Efficient learning of selective Bayesian network classifiers. In: International Conference of Machine Learning, pp. 453–461.

Zongker, D., Jain, A., 1996. Algorithms for feature selection: An evaluation. In: International Conference on Pattern Recognition, ICPR 96, pp. 18–22.