Learning Bayesian networks using genetic algorithm*

Learning Bayesian networks using genetic algorithm*

Journal of Systems Engineering and Electronics, Val. 18, No. 1, 2007, pp. 142 -147 Learning Bayesian networks using genetic algorithm* Chen Fei, Wang...

490KB Sizes 0 Downloads 74 Views

Journal of Systems Engineering and Electronics, Val. 18, No. 1, 2007, pp. 142 -147

Learning Bayesian networks using genetic algorithm* Chen Fei, Wang Xiufeng & Rao Yimei Coll. of Information Technical Science, Nankai Univ., Tianjin 300071, P. R. China (Received November 23,2005)

Abstract: A new method to evaluate the fitness of the Bayesian networks according to the observed data is provided. The main advantage of this criterion is that it is suitable for both the complete and incomplete cases while the others not. Moreover it facilitates the computation greatly. In order to reduce the search space, the notation of equivalent class proposed by David Chickering is adopted. Instead of using the method directly, the novel criterion, variable ordering, and equivalent class are combined,moreover the proposed mthod avoids some problems caused by the previous one. Later, the genetic algorithm which allows global convergence, lack in the most of the methods searching for Bayesian network is applied to search for a good model in this space. To speed up the convergence, the genetic algorithm is combined with the greedy algorithm. Finally, the simulation shows the validity of the proposed approach.

Keywords: Bayesian networks, Genetic algorithm, Structure learning, Equivalent class.

1. Introduction Graphical model is a marriage between graph theory and probability theory. A Bayesian network is a graphical model or a representation of the probabilistic relations among a set of variables [71. Over the last two decades, the Bayesian network has become a popular representation for encoding uncertain knowledge in expert system. It has been applied in the fraud dete~tion'~], robot navigation"], data compression[51 and so many other areas. There are two parts for a Bayesian network model, the quality part and the quantity part. The quality part refers to the network structure while the quantity part deals with a conditioned probability given its parents. Thus, when learning a Bayesian network, we should pay attention to both of them. There are four kinds of learning issues which are listed in Table 1. The most complex one is when the structure is unknown and the cases observed is incomplete, which is what we are going to deal with. Most of the literature on learning with Bayesian networks is concerned with model selection. In these approaches, some criterion is chosen to measure the fitness of the network structure according to the data. Then a search algorithm is used to good network structures which receive high score of this criterion. The most common criterion used is log relative posterior probability log p ( D I Sh) where D denotes the data observed and Sh means the structure hypothesis. The computation of this criterion requires

the consideration of the parameters, which costs a lot of time. Much attention has been devoted to the study of Bayesian network. In Ref. [2], David Chickering presented the general formula of a search space for which the states of search correspond to the equivalent classes. Wray Buntine discussed different method of learning Bayesian networks from data. Connections were drawn between statistical, neural network and uncertain community, etc'14'. Friedman et al. extended ndive Bayes by allowing slightly less restricted structure while still optimizing likelih~od"~]. R Chen proposed a collective method for addressing the problem of learning the structure of Bayesian network from distributed data"51. Genetic algorithm is an evolutionary model based optimization algorithm[61. The main advantage of genetic algorithm is that it allows global convergence. Moreover it has the ability to find an optimum solution or a set of better ones given a reasonable time limit.

2. New computation 2.1 Preliminary knowledge One needs two things to specify a Bayesian network: the network structure and conditioned probabilities with respect to each node, also called the parameters of this network. Learning Bayesian networks refers to the issues of construction of both the networks structure and parameters.

* This project was supported by the National Natural Science Foundation of China (70572045).

Learning Bayesian networks using genetic algorithm The most common method to construct a Bayesian network is to elicit from domain experts. It works well for small networks, but when the networks grow larger it becomes helpless. Even sometimes, we have to confront with the case when no expert is available. For the above reason, the main research objective concerning the learning of Bayesian networks is how to construct Bayesian networks from data or we can say cases. A case refers to one observation of the variables in the Bayesian networks. If all the variables are observed, the case is complete, otherwise incomplete. According to whether the structure is known and whether the cases are complete, the learning of Bayesian networks from data can be categorized into four kinds, as shown in Table 1.

143

(4)The parameters independence was introduced by Spiegelhalter and Lauritzen [91. n

4,

))+CC

Theorem 1 log(p(D, sh))=log(p(Sh

i=l j=1

c

[log((<-l)!)+~logT(N,,+l)-logT(N, +q)]

, where

k=l

T()refers to the gamma function, q is a integer and NGkdenotesthe number of cases where the value of variable xi is equal to xl! when its parent is Pa/ Gamma function is as follows r ( i ) =1 T(x+ 1) = x T ( x )

"

.

Table1 Four kinds of learning issues

Structure

Unknown

Unknown

Complete Incomplete Complete Incom lete

Among this four kinds, the most complex one is when the structure is unknown and the cases observed is incomplete. In the following subsection, we will use D and S h to denote the cases and the network structure respectively.

2.2 New criterion The criterion most often used for model selection is the log of relative posterior probability. Instead of employ-ing the formula mentioned in [71, we give a new method to calculate it and propose a modified version of this criterion. It allows for the ability to deal with the situation when the cases are incomplete. Above all, we give some reasonable assumptions as follows: (1) Before we see any case, we are indifferent with the network structure and the parameter distribution given a network structure Sh . (2) If we see an incomplete case, we are indifferent with the distribution of the variables unobse-rved in this case in the remaining feasible space. Here the feasible space refers to the case space which does not conflict with the value of variables observed. (3) The cases are independent given the model.

Using the chain rule, we can get from Eq. (1) log

log(p(D, S h ) )=

[I, p ( S h ) p ( 8 ,I S h ) p ( DI ~,,Sh)1d8,

(2)

Since p ( S h ) is independent of 8,,we can derive the following equation from the previous one. log p ( D , S h )= (3) 1% [P(S"L P(@, I S h ) p ( DI q9S*)d8s1

Due to the case independence, we may rewrite the formula as log ( P ( D , S h ) )=

where 6),k = p(xl! I Pu,!',6),Sh), Nijk refers to the number of cases where the value of variable xi is equal to xi" when its parent is Pa,!'. It is obvious how to compute NGk when a l l the cases are complete. But the world is not always so perfect. How to deal with the situation when the cases are not complete which we may confront most of the time? By assumption (2), we can derive the following algorithm to compute the Nijk. Algorithm 1 Initialization: for all ( i, j , k ) Ni,j,k =O, For each case: For each pair of ( i, j , k ): If the observed data does not conflict with

Chen Fei, Wang Xiufeng & Rao Yimei

144

c l o g T(Nvk+I)-logT(N,+r;)]

1 N yk . . = N . . +-1 N I Ilk

Here, I N I is the size of the feasible space according to the case. Using the parameter independence n

<

i=l

j=l

(14)

k= 1

Due to the assumption (l), the prior distribution is indifferent to all network structures, and we can ignore this item when searching for a good model, now the new criteria can be given as n

4,

C ( D , S h )= C C [ l o g ((5-l)!)+

p(6, ' S h ) = n n p c q j l * . . . , q j , )

5

Combining Eqs.(4)and (5), we get log p ( D , S h )=log [ p ( S k ) .

log r ( N i j k +l)-log r(Nu+q ) ] k= I

3. The equivalent class

The following formula is proved by Kevin Karplus[*'

where q is an integer regarding the value of qj. For the special case when Nuk = 0 for any (i, j , k )

(9) Since I; is an integer, Eq.(9) is converted into 1 L7dOij=(q -l)! Since

Moreover p(qjl,..-,@,l,) is a constant with respect to qj ,we get

p
log Ip(D,Sh)l= log [ p ( S h ) .

Then

(12)

At the beginning of this section, we first introduce the definition of equivalent class defined in David Chickering ['I. Definition 1 Two dag (directed acyclic graph) g and g ' are equivalent if for every Bayesian network B = (g,@, ) , there exists a Bayesian network B' = (g',O,,) such that B and B' define the same probability distribution, and vice versa. Herckman showed that the search for the best Bayesian network model according to the cases is NPHard. So, we need a heuristic algorithm to search for a good model in solution space. This is why we recall for genetic algorithm. Moreover, Chickering '31 proved that most of the scoring metrics (including ours) return the same score of fitness according to the networks structure in the same equivalent class. Thus, if we found a representation of the equivalent class and a scoring metrics on this class, we can reduce the size of model space greatly. This is what we have done in this section.

3.1 Representationof the equivalent class Verma and Pearl "" gave the following characteristics of equivalent class: Theorem 2 Two dags are equivalent if and only if they have the same skeletons and the same vstructures. Here, the skeleton and v-structure is defined as follows. Definition 2 The skeleton of a dag results from this dag by removing the direction of its arcs. Definition 3 The v-structure refers to a triple (x, y , z ) while there is arcs from x to y and from z to y, but not arcs (remove) between x and z.

145

Learning Bayesian networks using genetic algorithm

-

Figure 1 shows an example of v-structure.

Fig1 An example of v-structure

There is a representation of equivalent classes described in Chickering "I which is called completed pdags (partial directed acyclic graph). Instead of using the pdags, we propose a new modified version of pdags which is called opdags (ordered partial directed acyclic graph). opdags is defined as follows. Definition 4 Given a pdag, we can construct the opdags in this way: we first give all the variables of this pdag an order, then we can treat each arc as follows: if the arc is directed, then we do nothing, if the arc is undirected between two nodes x and y (assuming x > y ), then we always treat this arc as x-> y . The difference is that we use a ordering between variables. The reason for this will be explained later. By this definition, we can represent the equivalent classes as opdags. Definition 5 The search space of possible network structure which is described by dag is called b-space. Definition 6 The search space of possible network structure which is described by opdag is called e-space.

3.2 Scoring metrics The method described in Chi~kering'~' for computing the scoring metric is as follows: First they use the pdags to represents the equivalent class, since the scoring metric is defined on the single network structure but not equivalent class, they should turn the pdags into the dags which represent single network structures and later turn it back. This may cause two problems: first, obviously the transformation between pdags and dags needs a lot of time. If we could apply the criterion directly on pdags, the transformation is not needed at all. Second, a local change of pdags may not correspond to the local change of the dags in this class. This is unfortunate since the metric can reflect each local change of the dags.

The following shows how we can deal with the problem described above. In section 2, we get a scoring metric which is defined on a single network structure as others. Moreover, in subsection A, we get a representation of equivalent class which is called opdags. Now we can see if the metric can be applied in the opdags. The answer is yes but with a tiny modified version of algorithm 1: when we look for a variable's parents, we will only consider the variables whose order is smaller than this one. Since the transformation no longer exists, the first problem is solved. And using the metric directly defined on the representation of equivalent class, we eliminate the second problem.

4. Greedy genetic algorithm For the reason mentioned above, we need a heuristic algorithm to search for good models. The reason why we consider genetic algorithm is that it allows global convergence and is easy to be implemented. But genetic algorithm also has its own drawback: i.e. the convergence speed is slow. In order to cope with this problem, we combine the genetic algorithm with the greedy operator. 4.1 Coding

The coding of chromosome is as follow:

P = ((4

1 9

(v)*(4

)9

1,-.- (41,(u, )>

(u2

9

here, di refers to the directed parents of variable i ,and d, =(d,l,d,2,-.~,d,,) . d,, is the mth directed parent

of variable i . u, is defined as the undirected parents of the ith variable. The form of u, is the same as d, . Due to the ordering, we can always ignore the (remove) dl and ul .?hereby, if there are n variables, there would be only n - 1 pairs of d, and ur . So we may write the chromosome as p = ((d,),(u,),--., (d"),(u,)) .

4.2 Fitness function Here, we directly use the metric described in section 2 with some modification mentioned in section 3.

4.3 Operator Select operator: roulette selection, which is the same in the classical genetic algorithm. In order to ensure global convergence,we use elitist selection.

Chen Fei, Wang Xiufeng & Rao Yimei

146 Graph operator: the operations of this operator are described as follows: (1) For any undirected arcs, we can delete it which means that delete the variable from ui . (2) For any directed arcs, we can either delete it (means deleting it from di ) ( 3 ) For any pair of nodes that are not adjacent, we can either add a directed arc (means adding a variable in d , ) or an undirected arc (means adding a variable in ui

1

Greedy operator: we make adjustments by graph operator until no adjustment can be made to improve the fitness of models. Here comes a problem: why shouldn’t we combine the greedy operator directly into each case of graph operator which seems more concise. The reason will be explained in subsection E. Mutation operator: this operation is defined in Ref.[ 6 ] . Crossover operator: we use single point crossover which is also defined in Ref.[6].

4.4 Algorithm

Algorithm 2 Step 1 Choose the parameters of genetic algorithm, such as the size of population, the stop criteria etc. Step 2 Coding and initializing the population. Step 3 Genetic operators’ operation. Step 4 Stop if the stop criterion is satisfied else return to step 3 .

the elitist genetic algorithm, the monotone condition is also met. Now, according to theorem 3 , the greedy genetic algorithm is global convergence of probability 1. From the proof, we can get the answer of the problem which is mentioned in subsection C.

5. Simulation The simulations are organized as follows: Firstly, we apply existing Bayesian networks to generate the cases using the Gibbs Sampling. The Gibbs sampling is a unbiased generator of cases in the sense that the probability a case is generated is equal to the probability of that case existing according to the networks. The number of cases generated is 500. Each variable has 5 possible values. The following simulation will show both the validity and efficiency of our approach. Figure2 gives the comparison of this new criterion using the new computation method and p ( D , S h ) mentioned in Ref.[7]. Figure 3 shows the efficiency of e-space comparing the b-space. Moreover, we give the comparison of validity between e-space and b-space. The measure we used here is the criterion we have proposed in section 2. 2000

ii

,-E

-I0001

Node number -noncquivalent .equivalent ;

4.5 Proof Using the theorem mentioned in [ l l ] , we can give a proof of the global convergence for this algorithm. Theorem 3 If an evolutionary algorithm is ergodic and monotone, the algorithm will be global convergent of probability 1. Theorem 4 The greedy genetic algorithm is global convergence of probability 1. Proof Using the graph operator, we can delete any opdag into a opdag without any arc. Then by the graph operator again, we can construct any opdag from the opdag without any arcs. Then the condition of ergodi-city is satisfied. Moreover, since we are using

I

Fig.? The computation of log(p(D,S h ) ) with and without new method

2mt -. ~

2

.

3

4

5

6

7

8

9

I

10

Learning Bayesian networks using genetic algorithm and genetic algorithm with greedy operator. Table 2 described the convergence era of genetic algorithm using greedy operator and regular genetic algorithm. Table 2 Speed of convergencefor genetic algorithm with greedy operator and regular genetic algorithm Reeular Genetic algorithm genetic node number with greedy operator 0

1

1

2

1

3

3

1

3

4

3

10

5

3

19

6

10

21

I

28

50

1

8

30

50

9

50

100

10

50

100

The simulation above shows that the improvement we have made is valid.

6. Conclusions In this paper, we propose a new method for computing the log ( P ( S , D ) ), which is used to evaluate the fitness of the network structure. Moreover, we use the notation of equivalent class which is proposed by David Chickering and make great improvement about it. At last, we introduce the greedy operator in the genetic algorithm which improve the convergence speed of our method to a considerable extend. The simulation shows the validity and efficiency of the proposed approach.

References [l] Berler A, Shimony S E. Bayes networks for sonar sensor fusion. In: proceedings of the thirteenth conference on uncertainty in artificial intelligence. Rhode Island: Morgan Kaufmann Publishers,l997: 14-21. 121 Chickering D M. Learning equivalent classes of Bayesian networks structures. Proceedings of the Twelfh Conference on Uncertainty in Artificial Intelligence. Portland: Morgan Kaufmann Publishers, 1996: 150-157. [3] Chickering D M. A transformational characterization of Bayesian network structures. In: Proceedings of Eleventh

147 Conference on Uncertainty in Artijicial Intelligence. Montreal: Morgan Kaufmann Publishers, 1997: 87-98. [4] Ezawa K J, Schuermann T. FraudlLTncollectible debt detection using a Bayesian network based learning system: a rare binary outcome with mixed data structures. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Montreal: Morgan Kaufmann Publishers, 1997: 157-166. [5] Frey B J. Graphical models for machine learning and digital communication. Cambridge, MA: MITpress, 1998. [6] Goldberg D E. Genetic algorithms in search, optimization, and machine learning. Boston: Addison Wesley,1998. [7] Heckerman D. A tutorial on learning with Bayesian networks. Microsofi Research, Tech Rep: MSR-TR-95-06, 1995. [8] Karplus K. Regularizers for estimating distributions of amino acids from small samples. University of California , Tech Rep: UCSC-CRL-95- 1I , 1995. [9] Lauritzen S, Spiegelhalter D. Local computations with probabilities on graphical structures and their application to expert systems. Royal Statistical Society B, 1988,50(2): 157-224. 101 MacKay D J C . Introduction to monte carlo methods. In: Proceedings of NATO Advanced Study Institute on Learning in Graphical Models. MI Jordan :MIT Press, 1998: 175-204. 1I] Schleuter M G. Asparagos 96 and the travelling salesman problem. Proceedings of the Fourth International Conference on Evolutionary Computation. Cambridge, MA: The MIT Press, 1996: 25 1-272. [12] Verma T, Pearl J. Equivalence and synthesis of causal models. Proceedings of Sixth Conference on Uncertainty in Artificial Intelligence. Boston, MA: Morgan Kaufmann, 1990: 220-227. [13] Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Machine Learning, 29: 131-163. [14] Buntine W. A guide to the literature for learning probabilistic network from data. IEEE Transaction on Knowledge and Data Engineer, 1996,8(2), 195-208. [15] Chen R, Sivakumar K, Kargupta H. Learning bayesian network structure from distributed data. Proceedings of the 3rd SIAM fnternational Data Mining Conference, 2003:284-288.

Chen Fei was born in 1982. Now he is a doctor student of Nankai University, his research interests include Bayesian network and complex network. E-mail: [email protected]