Floating search algorithm for structure learning of Bayesian network classifiers

Pattern Recognition Letters 24 (2003) 2839–2848 www.elsevier.com/locate/patrec Floating search algorithm for structure learning of Bayesian network c...

Download PDF

145KB Sizes 2 Downloads 126 Views

Report

PDF Reader
Full Text

Pattern Recognition Letters 24 (2003) 2839–2848 www.elsevier.com/locate/patrec

Floating search algorithm for structure learning of Bayesian network classiﬁers Franz Pernkopf a

a,*

, Paul OÕLeary

b,1

Institute of Communications and Wave Propagation, Graz University of Technology, Inﬀeldgasse 16c II, Graz, 8010 Austria b Institute of Automation, University of Leoben, Leoben 8700, Austria Received 3 February 2003; received in revised form 10 June 2003

Abstract This paper presents a ﬂoating search approach for learning the network structure of Bayesian network classiﬁers. A Bayesian network classiﬁer is used which in combination with the search algorithm allows simultaneous feature selection and determination of the structure of the classiﬁer. The introduced search algorithm enables conditional exclusions of previously added attributes and/or arcs from the network classiﬁer. Hence, this algorithm is able to correct the network structure by removing attributes and/or arcs between the nodes if they become superﬂuous at a later stage of the search. Classiﬁcation results of selective unrestricted Bayesian network classiﬁers are compared to na€ıve Bayes classiﬁers and tree augmented na€ıve Bayes classiﬁers. Experiments on diﬀerent data sets show that selective unrestricted Bayesian network classiﬁers achieve a better classiﬁcation accuracy estimate in two domains compared to tree augmented na€ıve Bayes classiﬁers, whereby in the remaining domains the performance is similar. However, the achieved network structure of selective unrestricted Bayesian network classiﬁers is simpler. 2003 Elsevier B.V. All rights reserved. Keywords: Bayesian network classiﬁers; Feature selection; Floating search method

1. Introduction In classiﬁcation problems the relevant attributes are often unknown a priori. Thus, many features are derived and the features which do not contribute or even degrade the classiﬁcation perfor-

* Corresponding author. Tel.: +43-3168734436; fax: +433168734432. E-mail address: [email protected] (F. Pernkopf). 1 Tel.: +43-3842-402-9031.

mance have to be removed. This step is performed during feature selection. The main purpose of feature selection is to reduce the number of extracted features to a set of a few signiﬁcant ones for classiﬁcation while maintaining the classiﬁcation rate. The reduction of the feature set size may even improve the classiﬁcation accuracy by reducing estimation errors associated with ﬁnite sample size eﬀects (Jain and Chandrasekaran, 1982). This behaviour of practical classiﬁcation approaches is basically caused by insuﬃcient modeling of the class-conditional probability

0167-8655/$ - see front matter 2003 Elsevier B.V. All rights reserved. doi:10.1016/S0167-8655(03)00142-9

2840

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

density distributions obtained from the underlying samples. The best subset of features X ¼ fXi ji ¼ 1; . . . ; d; Xi 2 Y g

ð1Þ

is selected from the set Y ¼ fYi ji ¼ 1; . . . ; Dg;

ð2Þ

where D is the number of extracted features and d 6 D denotes the size of the feature subset (Devijver and Kittler, 1982). A feature selection criterion function J ðX Þ evaluates a chosen subset X , whereby a higher value of J indicates a better subset. The problem of feature selection (Zongker and Jain, 1996; Jain and Zongker, 1997) is to ﬁnd a subset X Y such that the number of chosen features jX j is d and J reaches the maximum J ðX opt Þ ¼

max J ðX Þ:

X Y ;jX j¼d

ð3Þ

As evaluation criterion J the cross-validation classiﬁcation accuracy estimate of the Bayesian network classiﬁer (see Section 2) has been selected during the experiments (see Section 4). Other evaluation measures are available, but in the literature (Kohavi and John, 1997; John et al., 1994; Kohavi, 1994; Dash and Liu, 1997) it is shown that the accuracy of selecting the best subset by applying the later used classiﬁer is the best. Unfortunately, this is computationally very demanding. John et al. (1994) divide the feature selection algorithms into two major groups, the ﬁlter approach and the wrapper approach. The ﬁlter approach assesses the relevance of the features from the data set which are mainly based on statistical measures. The eﬀects of the selected features on the performance of a particular classiﬁer are neglected. In contrast, the wrapper approach uses the classiﬁcation performance of the classiﬁer itself as part of the feature search for evaluating the feature subsets. John et al. (1994) claim that the wrapper approach is more appropriate, since the selection of a feature subset which takes the induction algorithm into account achieves a high predictive accuracy on unknown test data. However, this approach is associated with high computational costs, but advances in computer technology make

the wrapper method feasible. The ﬁlter approach is mostly used in data mining applications where huge data sets are considered. Blum and Langley (1997) present a third feature selection approach which is called embedded. In this concept the feature selection is performed implicitly during establishing the classiﬁcation algorithm. Basically, there are two main facilities to improve the classiﬁcation accuracy of classiﬁcation algorithms. The ﬁrst is to reduce the dimensionality by discarding irrelevant features from the feature set for classiﬁcation. This is performed during feature selection. The second approach models statistical dependencies between attributes to achieve an improvement of the classiﬁcation accuracy. Singh and Provan (1996) combine both approaches for Bayesian network classiﬁers, whereby, in a ﬁrst step the features are selected using an information theoretic measure and in a second step the network is constructed with the selected subset in a scoring-based manner. In this work, the objective is to maximize the performance of a classiﬁcation algorithm by doing both simultaneously, removing irrelevant features and relaxing independence assumptions between correlated features using a scoring-based approach. The applied classiﬁcation method is restricted to Bayesian network classiﬁers. To this aim, a search algorithm is introduced to learn the structure (dependency between the attributes) of the Bayesian network. The search algorithm which is well-established in the feature selection community (Pudil et al., 1994; Jain and Zongker, 1997) enables conditional exclusions of previously added attributes and/or arcs from the classiﬁer structure. Hence, this algorithm is able to correct the network structure by removing arcs and/or attributes if they become superﬂuous at a later stage of the search.

2. Bayesian network classiﬁers The Bayes decision rule (Duda and Hart, 1973; Langley et al., 1992) is a classiﬁcation method based on the BayesÕ theorem. A feature vector x belongs to the class xj according to the a posteriori probability P ðxj jxÞ. This a posteriori prob-

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

ability P ðxj jxÞ is determined using the BayesÕ theorem as P ðxj jxÞ ¼

P ðxjxj ÞP ðxj Þ ; P ðxÞ

ð4Þ

where P ðxÞ ¼

t X

P ðxjxj ÞP ðxj Þ

ð5Þ

j¼1

and t corresponds to the number of class labels. Since the denominator of the BayesÕ theorem (Eq. (5)) is the same for all classes it can be ignored. Further it is assumed that the conditional probability density functions P ðxjxj Þ and the a priori probabilities P ðxj Þ of all classes are given. Normally, these probabilities are estimated from a data set during training of the decision rule. Once the classiﬁer has been trained, classiﬁcation of an instance x is performed by assigning a class label xi to x. This is done in accordance with the a posteriori probability P ðxi jxÞ determined by the BayesÕ theorem which has to be the largest among the diﬀerent classes X ¼ fx1 ; . . . ; xt g. Hence, pattern x is assigned to the class xi as x ! xi

if

P ðxi jxÞ ¼ max P ðxj jxÞ: j¼1;...;t

ð6Þ

The na€ıve Bayes decision rule assumes that all the attributes are conditionally independent given the class label. As reported in the literature (Langely and Sage, 1994; Friedman et al., 1997) the performance of the na€ıve Bayes classiﬁer is surprisingly good even the independence assumption between attributes is unrealistic in most of the data sets. Independence between the features ignores any correlation among them. Kononenko (1991) extends the na€ıve Bayes approach by joining dependent attributes into groups. Independence is only assumed between the diﬀerent groups. The idea of Friedman et al. (1997), Friedman and Goldszmidt (1996) is to improve the classiﬁcation performance by relaxing the independence assumption using tree augmented naı¨ve Bayes classiﬁers. To this end, Bayesian networks (Pearl, 1988; Jensen, 1996; Heckerman, 1995) are used to represent the classiﬁer. A Bayesian network B ¼ hG; Hi is a directed acyclic graph G which models probabilistic relationships among

2841

a set of random variables U ¼ fX1 ; . . . ; Xn ; Xg ¼ fU1 ; . . . ; Unþ1 g, where each variable in U has speciﬁc states or values denoted by lower case letters fx1 ; . . . ; xn ; xg. The symbol n denotes the number of attributes. Each vertex (node) of the graph represents a random variable, while the edges capture the direct dependencies between the variables. The network encodes the conditional independence relationship that each node is independent of its nondescendants given its parents. These conditional independence relationships reduce the number of parameters needed to represent a probability distribution. In the na€ıve Bayes setup the attribute values of Xi and Xj (Xi 6¼ Xj ) are conditionally independent given the class label of node X. Hence, xi is conditionally independent of xj given class x, whenever P ðxi jx; xj Þ ¼ P ðxi jxÞ for all xi 2 Xi ; xj 2 Xj ; x 2 X, and when P ðxj ; xÞ > 0. The symbol H represents the set of parameters which quantify the network. Each node contains a local probability distribution given its parents. The joint probability distribution is uniquely determined by these local probability distributions. The parameters of the network are estimated by the maximum likelihood method. The structure of the na€ıve Bayes classiﬁer represented as Bayesian network is illustrated in Fig. 1. Feature selection is introduced to this network by removing irrelevant features by means of e.g. a search algorithm (see Section 3). This extension of the na€ıve Bayes decision rule is known as selective na€ıve Bayes classiﬁer. The structure in Fig. 1 shows that each attribute is conditionally independent of the remaining attributes given the class label x of the class variable. The class variable X is the only parent for each attribute Xi denoted as PXi ¼ fXg for all 1 6 i 6 n. Hence, the joint

Fig. 1. Structure of a na€ıve Bayes network.

2842

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

probability distribution P ðX1 ; . . . ; Xn ; XÞ for the network depicted in Fig. 1 is determined to be n Y P ðXi jXÞ; ð7Þ P ðX1 ; . . . ; Xn ; XÞ ¼ P ðXÞ i¼1

and from the deﬁnition of conditional probability the probability for the classes in X given the values of the attributes is n Y P ðXi jXÞ; ð8Þ P ðXjX1 ; . . . ; Xn Þ ¼ aP ðXÞ i¼1

where a is a normalization constant. Since the attributes may be correlated and the independence assumption of the na€ıve Bayes classiﬁer is unrealistic, Friedman et al. (1997), Friedman and Goldszmidt (1996) introduce the tree augmented naı¨ve Bayes classiﬁer. It is based on the structure of the na€ıve Bayes network where the class variable is the parent of each attribute. Hence, the a posteriori probability P ðXjX1 ; . . . ; Xn Þ takes all the attributes into account. Additionally, edges (arcs) among the attributes are allowed which capture the correlations among them. Each attribute may have at most one other attribute as additional parent which means that there is an arc from feature Xi to feature Xj . This implies that these two attributes Xi and Xj are not independent given the class label. The inﬂuence of Xj on the class probabilities depends also on the value of Xi . An example of a tree augmented na€ıve Bayes network is shown in Fig. 2. A tree augmented na€ıve Bayes network is initialized as na€ıve Bayes network. Additional arcs between attributes are found by means of a search algorithm (see Section 3). The maximum number of arcs added to relax the in-

Fig. 2. Structure of a tree augmented na€ıve Bayes network.

dependence assumption between the attributes is n 1, where n denotes the number of attributes. The selective unrestricted Bayesian network classiﬁer (Singh and Provan, 1996) (see Fig. 3) can be viewed as generalization of the tree augmented na€ıve Bayes network. The class node is treated as root node which cannot be the child of any attribute. The attributes must not be connected directly to the class node as for the tree augmented na€ıve Bayes network. After initialization the network only consists of the class node, and the search algorithm (see Section 3) adds attributes and/or arcs to the network so that the evaluation criterion is maximized. In fact, only arcs are added between the nodes so that the class node remains a root of the network. If there is no arc between an attribute and the network structure then the attribute is not considered during classiﬁcation. During the determination of the network structure irrelevant features are not included and the classiﬁer is based on a subset of selected features. Since this network is almost unrestricted the computational eﬀort for determining the network structure is huge especially if there is a large number of attributes available. Additionally, the size of the conditional probability tables of the nodes increases exponentially with the number of parents. This might result in a more unreliable probability estimate of the nodes which have a large number of parents. The conditional distribution of X given the value of all attributes is only sensitive to those attributes which form the Markov blanket of X (Pearl, 1988). The Markov blanket of the class node X consists of the direct parents of X, the direct successors (children) of X, and all the direct parents of the direct successors (children) of the class node X. All the features outside the Markov

Fig. 3. Structure of a selective unrestricted Bayesian network.

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

blanket do not have any eﬀect on the classiﬁcation performance. Introducing this knowledge into the search algorithm reduces the search space and the computational eﬀort for determining the structure of the classiﬁer.

3. Classical ﬂoating search (CFS) For learning the structure of the Bayesian network that matches the data best Keogh and Pazzani (1999) propose hill climbing search and a more eﬃcient search algorithm called SuperParent. As scoring function the cross-validation classiﬁcation accuracy estimate is used to guide the search. Keogh and Pazzani use leave-one-out cross-validation to estimate the accuracy of the current network with the added arcs. In the following, the algorithm for ﬁnding the structure of the tree augmented na€ıve Bayes network using hill climbing search is outlined (Keogh and Pazzani, 1999). procedure HillClimbingSearchNet begin Initialize network to na€ıve Bayes. Evaluate the current network. while classiﬁcation accuracy improves Consider adding every legal arc to the current structure of the classiﬁer and evaluate the classiﬁer. if there is an arc addition which improves the accuracy then Add the arc which gives the largest classiﬁcation accuracy improvement among all the possible arcs to the current network. else Return current network. end end end Each possible arc from Xi to Xj (Xi 6¼ Xj , Xj is a node without another attribute as parent) is evaluated within the while-loop. If the classiﬁcation accuracy estimate is enhanced with respect to the current network, then the current network is up-

2843

dated with the arc which gives the largest improvement. Otherwise, if no arc results in an enhancement of the classiﬁcation performance the current classiﬁer is returned. This algorithm concludes the search if there is no arc which results in an enhancement of the classiﬁcation accuracy estimate or if no attribute without another attribute as parent is available. An improvement of the hill climbing search (Keogh and Pazzani, 1999) is to apply the classical sequential ﬂoating algorithm used for feature selection applications (Pudil et al., 1994). We adopt this algorithm for determining the network structure of tree augmented na€ıve Bayes classiﬁers and selective unrestricted Bayesian network classiﬁers. The main disadvantage of hill climbing search for determining the network structure is that once an arc has been added to the network structure, the algorithm has no mechanism for removing the arc at a later stage. Hence this algorithm suﬀers from the nesting eﬀect (Kittler, 1978). To overcome this drawback, Pudil et al. (1994) present a ﬂoating search method for ﬁnding signiﬁcant features which optimize the classiﬁcation performance in feature selection tasks. This algorithm allows conditional exclusions of previously added attributes and/or arcs from the augmented network. Hence, this algorithm is able to correct wrong decisions which were performed in previous steps. This results in a higher classiﬁcation accuracy estimate, however, this search strategy uses more evaluations to obtain the network structure and consequently is computationally less eﬃcient than hill climbing search. The classical ﬂoating search algorithm is based on a method for adding and on a method for removing attributes and/or arcs from the network structure. In the following, the ﬂoating algorithm is presented for establishing the structure of the tree augmented na€ıve Bayes network. Subsequently, the necessary modiﬁcations of the algorithm for learning the structure of selective unrestricted Bayesian network classiﬁers are presented. Adding the most signiﬁcant arc (AddArc): This algorithm is highly similar to the hill climbing search. Each possible arc A from Xi to Xj (Xi 6¼ Xj , Xj 2 O) is evaluated, whereby O is the set of orphans. Keogh and Pazzani (1999) deﬁne a node

2844

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

without a parent except the class node as an orphan. The current network is updated with the arc which gives the largest increase of J within the possible arcs. A potential arc set Ap consists of all arcs from Xi to Xj (Xi 6¼ Xj , Xj 2 O). A possible arc A is an element of Ap (A 2 Ap ). The adding arc (AddArc) algorithm adds at each stage k the most signiﬁcant arc As to the network structure G Gkþ1 ¼ Gk [ As

ð9Þ

so that the graph Gkþ1 maximizes the evaluation criterion J J ðGkþ1 Þ ¼ max J ðGk [ AÞ: A2Ap

ð10Þ

Removing the least signiﬁcant arc (RemArc): The removing arc method (RemArc) is the counterpart of the adding arc algorithm. Thereby, in each step the least signiﬁcant arc Al is discarded from the set of the previously added arcs Gk n G0 , where G0 is the structure of a na€ıve Bayes network. The remaining network structure Gk1 ¼ Gk n Al

ð11Þ

achieves the largest evaluation performance J ðGk1 Þ ¼ max J ðGk n AÞ: A2Gk nG0

ð12Þ

The classical ﬂoating search algorithm is a bottom up search procedure introduced by Pudil et al. (1994). The algorithm includes new arcs which maximize the criterion J by means of the AddArc procedure starting from the current graph. Afterwards, conditional exclusions of the previously updated graph take place by applying the RemArc method. If no arc can be excluded anymore, the algorithm proceeds again with the AddArc algorithm. The classical ﬂoating algorithm is presented in the following. procedure CFS begin Input: U ¼ fX1 ; . . . ; Xn ; Xg //available variables// Output: Gk ; k ¼ 0; 1; . . . ; n 1 Termination: Stop when k ¼ n 1 or when e.g. a predetermined number of arcs has been added.

The evaluation values of J ðGi Þ for all preceding graphs i ¼ 1; 2; . . . ; n 1 are stored. Initialization: k ¼ 0; Graph G0 is initialized to na€ıve Bayes; AddArc algorithm is used to add 2 arcs, k ¼ 2; go to Step 1 repeat Step 1 (Inclusion) Use the AddArc method to select the most signiﬁcant arc Akþ1 from the set of potential arcs Ap to form the graph Gkþ1 . Therefore Gkþ1 ¼ Gk [ Akþ1 ; go to Step 2 Step 2 (Conditional exclusion) Find the least signiﬁcant arc in the graph Gkþ1 by using the RemArc algorithm. if Akþ1 is the least signiﬁcant arc in the graph Gkþ1 n G0 , i.e. J ðGkþ1 n Akþ1 Þ P J ðGkþ1 n Aj Þ, 8j ¼ 1; . . . ; k, then k ¼ k þ 1; go to Step 1 else if Ar ; 1 6 r 6 k, is the least signiﬁcant arc in the graph Gkþ1 n G0 , i.e. J ðGkþ1 n Ar Þ > J ðGk Þ, then exclude Ar from Gkþ1 to form a new graph G0k ¼ Gkþ1 n Ar . Now J ðG0k Þ > J ðGk Þ. if k ¼ 2 then Gk ¼ G0k ; J ðGk Þ ¼ J ðG0k Þ; go to Step 1 else go to Step 3 Step 3 (Continuation of conditional exclusion) Find the least signiﬁcant arc At in the graph G0k . if J ðG0k n At Þ 6 J ðGk1 Þ then set Gk ¼ G0k , J ðGk Þ ¼ J ðG0k Þ; go to Step 1 if J ðG0k n At Þ > J ðGk1 Þ then exclude At from G0k to form a newly reduced graph G0k1 , i.e. G0k1 ¼ G0k n At ; k ¼k1 if k ¼ 2 then Gk ¼ G0k ; J ðGk Þ ¼ J ðG0k Þ; go to Step 1 else go to Step 3 until k ¼ ðn 1Þ or when e.g. a predetermined number of arcs has been added. end

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

This algorithm concludes the search if the maximum of correlation arcs added to the network is n 1. This means that the potential arc set Ap ¼ ;. However, the network achieving the largest classiﬁcation accuracy estimate is not necessarily the network with all possible arcs added. Hence it would be feasible to terminate this search algorithm at an earlier stage. Since there are consecutive backward steps possible the algorithm should not be stopped at a stage where additional arcs do not enhance the evaluation criterion. For determining the structure of the selective unrestricted Bayesian network the following modiﬁcations of the above presented algorithm are necessary: • For the selective unrestricted Bayesian network classiﬁer arcs are added between the nodes in such a way that the class node always remains a root of the network. If there is no arc between a loose attribute and the network structure then the attribute is not considered during classiﬁcation (feature selection). • The structure of the tree augmented na€ıve Bayes classiﬁer is restricted so that each attribute Xi can have only one other attribute as parent in addition to the class node. This means on the other hand that, for the selective unrestricted Bayesian network classiﬁer, the selection of the arcs in the function AddArc Xj is not restricted to the set of orphans O. Each attribute can have more attributes as parent nodes. • The structure is not initialized to the na€ıve Bayes structure. The initial structure of the selective unrestricted Bayesian network consists of the class node as root of the network without any other arcs. • The function AddArc also adds arcs from the class node X to Xj . This ﬂoating search algorithm facilitates the correction of wrong decisions made in previous steps. Therefore, it may approximate the optimal solution in a better way than hill climbing search. However, this classical ﬂoating search method is mostly more time consuming, especially in the case with data of great complexity and dimensionality.

2845

An eﬃcient evaluation of the classiﬁer may be achieved by ordering the training instances so that the misclassiﬁed samples of previous classiﬁcations are classiﬁed ﬁrst (Keogh and Pazzani, 1999). The classiﬁcation algorithm is terminated as soon as the number of misclassiﬁed samples exceeds the error rate of the current best classiﬁer network.

4. Experimental results Experiments have been performed on eight data sets from the UCI repository (Merz et al., 1997) which are shown in Table 1. The attributes in the data sets are multinomial and continuous-valued. Since the classiﬁers are constructed for multinomial attributes the feature space was discretized in the manner described in (Fayyad and Irani, 1993), whereby the partition boundaries for discretizing the continuous-valued attributes were established only through the training data set. The discretization stages were carried out by the MLC++ system (Kohavi et al., 1994). The Bayesian network was constructed using the Matlab toolbox developed by Murphy (2001). During the experiments a ﬁve-fold cross-validation accuracy estimate of the classiﬁer was used as scoring function J for ﬁnding the optimal structure of the network. Hence, the accuracy estimate of each classiﬁer is given by the successful predictions on the test sets of each data set. All the experiments are based on exactly the same crossvalidation folds. Table 2 shows the classiﬁcation accuracy estimate, the number of evaluations (#Ev) used for search, and the number of features Table 1 Data sets used for the experiment Data set

# Features

# Classes

# Instances

Australian Flare Glass Glass2 Heart Pima Vote Vehicle

14 10 9 9 13 8 16 18

2 2 7 2 2 2 2 4

690 1066 214 163 270 768 435 846

2846

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

Table 2 Five-fold cross-validation classiﬁcation accuracy estimate for the eight data sets Data set Australian Flare Glass Glass2 Heart Pima Vote Vehicle

3-NN

NB

CFS-SNB

%

%

%

#F

#Ev

HCS-TAN %

#A

#Ev

%

#A

#Ev

%

#F

#Ev

67.33 82.93 60.33 67.52 77.78 57.66 92.61 69.32

86.38 82.93 67.98 78.96 83.33 71.47 90.06 60.52

88.14 84.07 71.69 79.58 85.92 74.21 95.87 65.05

7 3 6 8 5 5 4 5

158 29 98 104 182 39 83 89

88.43 84.25 73.46 82.71 86.29 75.37 96.31 74.51

8 3 4 3 3 5 11 16

1066 241 235 190 430 201 1761 2737

88.43 84.25 73.46 82.71 86.29 75.37 96.31 74.77

8 3 4 3 3 5 9 16

1092 243 240 192 432 210 1836 3853

89.43 84.07 72.96 82.66 85.92 75.13 95.86 75.36

11 3 9 4 5 5 2 13

8376 299 801 487 1017 648 511 5477

(#F) or arcs (#A) used to achieve this classiﬁcation accuracy estimate of the approaches discussed in previous sections. The following abbreviations are used for the diﬀerent classiﬁcation approaches: • 3-NN: 3-nearest neighbor classiﬁer. • NB: Na€ıve Bayes classiﬁer. • CFS-SNB: Selective na€ıve Bayes classiﬁer using the classical ﬂoating search. • HCS-TAN: Tree augmented na€ıve Bayes classiﬁer using hill-climbing search. • CFS-TAN: Tree augmented na€ıve Bayes classiﬁer using classical ﬂoating search. • CFS-SUN: Selective unrestricted Bayesian network using classical ﬂoating search. The best achieved classiﬁcation accuracy is emphasized by boldface letters. The selective na€ıve Bayes classiﬁer (CFS-SNB) achieves a better classiﬁcation accuracy estimate on all data sets than the na€ıve Bayes (NB) approach based on all available attributes. The computational eﬀort to determine the CFS-SNB classiﬁer is relatively small. Additionally, the ﬁvefold cross-validation accuracy estimate of the 3NN classiﬁer using the discretized features is given for comparison in Table 2. For the data sets Australian, Flare, and Heart the tree augmented na€ıve Bayes classiﬁer is only slightly better than the selective na€ıve Bayes classiﬁer, however, the CFS-SNB classiﬁer has a much simpler structure and a smaller number of parameters are necessary. This means that these domains have some attributes which do not contribute much to the accuracy estimate, e.g. the

CFS-TAN

CFS-SUN

CFS-SNB utilizes just three out of the ten available attributes for estimating the classiﬁcation accuracy of the Flare data set. Additionally, the number of evaluations used for determining the TAN network is much higher than for establishing the CFS-SNB. The structure of the tree augmented na€ıve Bayes classiﬁer is established either with hill-climbing search or with the proposed classical ﬂoating search algorithm. For almost every data set, both algorithms achieve exactly the same classiﬁcation performance. Just for the Vehicle data set, a slightly better accuracy estimate is obtained by using the CFS approach. The ﬂoating property of CFS is normally used at a later stage of the search where more attributes have been involved in the classiﬁer structure. Since most of the data sets have just a few relevant attributes the classical ﬂoating search algorithm does not perform backward steps. It can be seen that the network structure and the number of evaluations needed for establishing the structure for almost all data sets are closely related for both search strategies. The selective unrestricted Bayesian network achieves a better classiﬁcation performance than the TAN network for the data sets Australian and Vehicle. On the remaining data sets the classiﬁcation performance is slightly worse. However, the resulting network of the CFS-SUN classiﬁer is simpler while the predictive accuracy estimate is comparable to the TAN networks (see the number of features (#F) and the number of arcs (#A) used for the classiﬁers in Table 2). Since the search space for the SUN structure is much larger than for the TAN structure, the

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

number of evaluations for ﬁnding the optimal network is larger.

5. Conclusions and future work In this paper a classical sequential ﬂoating search algorithm for determining the network structure of a tree augmented na€ıve Bayes network and a selective unrestricted Bayesian network classiﬁer has been presented. This algorithm is capable of removing previously added arcs at a later stage of the search if they turn out to be irrelevant. The experiments are performed on data sets from the UCI repository. They show that selective unrestricted Bayesian network classiﬁers achieve similar results compared to tree augmented na€ıve Bayes classiﬁers. However, in two domains the selective unrestricted Bayesian network achieves a better classiﬁcation accuracy estimate than the tree augmented na€ıve Bayes network. Future work is focused on using Bayesian network classiﬁers for continuous-valued attributes since it is assumed that the classiﬁcation accuracy estimate of the classiﬁers can be increased. Additionally, special attention will be dedicated to Bayesian multinets. Bayesian multinets are a generalization of the unrestricted Bayesian network. Bayesian multinets allow diﬀerent relations among the features for each class, i.e. a diﬀerent structure of the network for each class.

References Blum, A.L., Langley, P., 1997. Selection of relevant features and examples in machine learning. Artif. Intell. 97 (1–2), 245–271. Dash, M., Liu, H., 1997. Feature selection for classiﬁcation. Intell. Data Anal. 1 (3), 131–156. Devijver, P.A., Kittler, J., 1982. Pattern recognition: A statistical approach. Prentice Hall International. Duda, R.O., Hart, P.E., 1973. Pattern classiﬁcation and scene analysis. Wiley-Interscience Publications. Fayyad, U.M., Irani, K.B., 1993. Multi-interval discretizaton of continuous-valued attributes for classiﬁcation learning. In: Proceedings of the Thirteenth International Joint Conference on Artiﬁcial Intelligence, pp. 1022–1027.

2847

Friedman, N., Goldszmidt, M., 1996. Building classiﬁers using Bayesian networks. In: Proceedings of the National Conference on Artiﬁcial Intelligence, pp. 1277–1284. Friedman, N., Geiger, D., Goldszmidt, M., 1997. Bayesian network classiﬁers. Machine Learning 29, 131–163. Heckerman, D., 1995. A tutorial on learning Bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research. Jain, A.K., Chandrasekaran, B., 1982. Dimensionality and sample size considerations in pattern recognition in practice, vol. 2 of Handbook of Statistics, North-Holland , Amsterdam. Jain, A.K., Zongker, D., 1997. Feature selection: Evaluation, application, and small sample performance. IEEE Trans. Pattern Anal. Machine Intell. 19 (2), 153–158. Jensen, F.V., 1996. An introduction to Bayesian networks. UCL Press Limited. John, G.H., Kohavi, R., Pﬂeger, K., 1994. Irrelevant features and the subset selection problem. In: Proceedings of 11th International Conference on Machine Learing, pp. 121–129. Keogh, E.J., Pazzani, M.J., 1999. Learning augmented Bayesian classiﬁers: A comparison of distribution-based and classiﬁcation-based approaches. In: Proceedings of 7th International Workshop on Artiﬁcial Intelligence and Statistics, pp. 225–230. Kittler, J., 1978. Feature set search algorithms. In: Chen, C.H. (Ed.), Pattern Recognition and Signal Processing. Sijtho and Noordho, pp. 41–60. Kohavi, R., John, G.H., 1997. Wrappers for feature subset selection. Artif. Intell. 97, 273–324. Kohavi, R., John, G., Long, R., Manley, D., Pﬂeger, K., 1994. MLC++: A machine learning library in C++. In: Proceedings of the 6th International Conference on Tools with Artiﬁcial Intelligence, pp. 740–743. Kohavi, R., 1994. Feature subset selection as search with probabilistic estimates. In: Proceedings of AAAI Fall Symposium on Relevance, pp. 122–126. Kononenko, I., 1991. Semi-naive Bayesian classiﬁer. In: Proceedings of sixth European working session on learning, pp. 206–219. Langley, P., Iba, W., Thompson, K., 1992. An analysis of Bayesian classiﬁers. In: Proceedings of 10th National Conference on Artiﬁcial Intelligence, pp. 223–228. Langely, P., Sage, S., 1994. Induction of selective Bayesian classiﬁers. In: Proceedings of the Tenth Conference on Uncertainty in Artiﬁcial Intelligence, pp. 399–406. Merz, C., Murphy, P., Aha, D., 1997. UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, URL: www.ics.uci.edu/~mlearn/MLRepository.html. Murphy, K., 2001. Bayes Net Toolbox for Matlab. URL: www.ai.mit.edu/~murphyk/Softwaxe/BNT/bnt.html. Pearl, J., 1988. Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann. Pudil, P., Novovicova, J., Kittler, J., 1994. Floating search methods in feature selection. Pattern Recognition Lett. 15, 1119–1125.

2848

F. Pernkopf, P. O’Leary / Pattern Recognition Letters 24 (2003) 2839–2848

Singh, M., Provan, G.M., 1996. Eﬃcient learning of selective Bayesian network classiﬁers. In: International Conference of Machine Learning, pp. 453–461.

Zongker, D., Jain, A., 1996. Algorithms for feature selection: An evaluation. In: International Conference on Pattern Recognition, ICPR 96, pp. 18–22.

Floating search algorithm for structure learning of Bayesian network classifiers

Floating search algorithm for structure learning of Bayesian network classifiers

Recommend Documents