EUROPEAN JOURNAL OF OPERATIONAL RESEARCH European
Journal of Operational Research 94 ( 1996) 488-504
Theory and Methodology
Approximate representation of probabilistic data in expert systems Sumit Sarkar a * , Haim Mendelson b, Veda C. Storey
’
a Department of Information Systems and Decision Sciences, College of Business Administration, Louisiana State University, 3190 CBA, Baton Rouge, LA 70803. USA b Graduate School of Business, Stanford Uniuersity. Stanford, CA, USA ’ W.E. Simon Graduate School of Business Administration, University of Rochester, Rochester, NY, USA Received
13 June 1994;
revised 24 April 1995
Abstract We propose the use of a third-order approximation for the representation of probabilistic data in expert systems and compare it to tree-structured representations. The differences are illustrated using the example of a reliability problem. We show that using the third-order representation results in significantly reduced losses as compared to tree structures, with a small increase in computational complexity. We present heuristic and exact techniques to determine the optimal third-order representation and propose a decomposition technique that allows the exact algorithm to be efficiently used for solving large
problem instances. Keywords:
Artificial
intelligence;
Expert systems; Probabilistic
reasoning;
1. Introduction
Over the past decade, expert systems have been widely used to perform tasks previously done by human experts. These systems capture the knowledge of human experts, and reason with this knowledge in order to replicate the decision processes of experts. Many real-life decision problems require the ability to represent, manipulate, and reason with uncertain knowledge. For instance, when evaluating loan applications in a bank, a loan officer must evaluate the likelihood that a loan will be repaid based on the credit history of the applicant (Rosen-
* Corresponding
author. Fax: 504-388-2511
0377-2217/%/$15.00 Copyright SSDI 0377-2217(95)00130-1
Belief networks
berg and Gleit, 1994). Similarly, medical experts must deal with the uncertainty associated with making predictions based on observed symptoms of patients (Shortliffe, 1976; Winkler and Poses, 1993). As a result, a significant amount of research in expert systems has focused on the representation of uncertainty, and reasoning in the presence of uncertainty. In particular, network structures called belief networks, have been shown to provide an effective framework for representing uncertainty using probability calculus (Duda et al., 1976; Pearl, 1986; Lauritzen and Spiegelhalter, 1988). Although important contributions towards uncertain reasoning using belief networks have been made, schemes to propagate beliefs are of exponential complexity for the general class of belief networks (Cooper, 1990). In many problem domains, an expert system must make infer-
0 1996 Elsevier Science B.V. All rights reserved
S. Sarkar et al./
European Journal of Operational
ences quickly so the efficient manipulation of beliefs is important. Pearl (1988) and Lam-i&en and Spiegelhalter (1988) have shown that sparse network structures, in particular tree-structured representation schemes, provide efficient ways to manipulate beliefs. In this paper, we first examine sparse approximate representations for a reliability problem that is modelled as a belief network. We consider tree-structures (which are second-order representations) and thirdorder approximations for this problem. Using approximate representations results in losses when the data are used for decision support. We find that while the third-order representation is only a little more complex than the tree representation, it performs significantly better when used to make predictions for the reliability problem. We then address the more general problem of finding the representation that best approximates the probability distribution underlying the problem domain. We characterize the best approximate representation using the I-Divergence measure, which has been widely applied to compare alternate approximate distributions (Chow and Liu, 1968; Rebane and Pearl, 1987; Wong and Poon, 1989; Geiger, 1992). We show that using this measure results in the maximum entropy solution among feasible representations. We formulate the problem of finding the best approximate representation as a combinatorial optimization problem. We study heuristic and exact solution techniques for solving this combinatorial problem. The performance of the best third-order approximation is compared with the best tree approximation for the reliability example over a wide range of probability distributions, and found to be consistently superior. Finally, we develop a decomposition scheme that enables us to easily obtain exact solutions for large problem sizes. This paper is divided into seven sections. Section 2 presents the reliability problem and examines losses associated with the use of a simple tree representation. This section then demonstrates how a third-order approximation can lead to much lower losses. In Section 3, we characterize the best representation using the I-Divergence measure. Solution techniques to find the best third-order approximation are discussed in Section 4. In Section 5, we compare the performance of the tree approximation and the
489
Research 94 (1996) 488-504
third-order approximation for a large number of underlying probability distributions. Section 6 presents the decomposition scheme for solving large problem instances. Section 7 summarizes the findings and concludes the paper.
2. Approximating ture
a network using a tree struc-
We describe a decision problem that is modelled as a belief network, and consider an approximate tree representation for the network. We then compare the tree representation with an approximate representation based on a third-order distribution. Although the third-order approximation is only a little more complex for making inferences than a tree structure, it performs significantly better than the tree structure. 2.1, The reliability problem We consider a system that is comprised of five components. The system works if and only if all five components are working. Each component has a probability of failure associated with it; however, the probabilities of failure for different components are not independent. A joint distribution is provided for each possible state of success or failure of these components, and can be used to determine the probability that the system is working. The dependency network in Fig. la captures the probabilistic dependencies across the components. We use an undirected graph notation (called Markov network) (Pearl, 1988) to represent the dependencies; the same dependencies could also be represented using directed graphs.
x A
B
C
D
E
a Actual Distribution
b. Tree Representation
Fig. 1.Belief network topology - actual and approximate,
490
S.Sarkar etal./ European Journul of Operational Research 94 (1996) 488-504
A belief tree used to approximate the decision problem is shown in Fig. lb. Belief trees are a special class of belief networks where each node has exactly one parent (except the root node which has none). An important feature of belief trees is that every pair of non-adjacent events becomes independent given a third variable on the path connecting the pair (Pearl, 1988). This feature allows for efficient representation of the joint distribution using the chain-rule formula. The expression for the joint distribution corresponding to the tree structure in Fig. lb is obtained as follows. Let node A be considered the root node of a directed tree representation that is probabilistically equivalent to the undirected structure. Then node C is a child of node A, and nodes B, D and E are the children of node C. Therefore, we have:
we use the term P(M) to indicate the joint probability that both components A and B are working (all other terms are interpreted in a similar manner). From the data in Table 1, the true probability that the system will work is: P(System is OK) = 0.1328. A prediction may be made after observing one or more components. For example, the decision about the system may be made after observing that components A and C are working. Then, the posterior probability that the system is OK, denoted by p, is: P(System is OK I A and C are OK). Since the system is OK if and only if all five components are working, we have p = P( A, B,C,D, E)/P( A$) = 0.1328/0.28 = 0.474. The cost associated with each possible action is:
P,(A,B,C,D,E)=P,(A).P,(CIA).P,(BIC)
Expected cost of predicting “ -,OK” given A and CareOK=p+C,=0.474XC,.
+‘,(DIC)+,(ElC). Since the probability distribution for a tree structured network consists of component distributions over at most two variables, we call them second-order approximations. The decision problem is to predict whether the system is OK or not. If the system is predicted to be “Not OK” when it is actually OK (i.e. a Type I error), then the cost is C,. Similarly, if the system is predicted to be “OK” when it is not (i.e. a Type II error), then the cost associated with that decision is C,. If the prediction is correct, there is no cost associated with the decision. Expected costs are determined using the true and the approximate representations, respectively. The actual distribution for the reliability problem is as shown in Table 1. For notational convenience,
Expected cost of predicting “OK” given A and C areOK=(l-p).C,=0.524XC,,
When 0.524 X C, s 0.474 X C,, we predict “OK” since it leads to a lower expected cost, otherwise we predict “ 1OK”. Since the optimal decision depends on the ratio of the costs C, and C, rather than their absolute magnitudes, we set C, to 1 and analyze how the expected costs vary with C,. The system is predicted “OK” as long as C, is less than p/(1 - p), which is the “odds” of the system being OK (i.e., 0 * = p/( 1 - p)). The expected cost when the system is predicted to be OK is (1 - p>C,. For values of C, r 0 *, the system is predicted to be “ 7 OK” and has an expected cost p associated with the decision. The expected cost curve as a function of C, is shown in Fig. 2. Now consider the case where the tree approximation is used for prediction. The probability measures
Table 1 Joint distribution for decision problem P(A) = 0.4 P(E) = 0.6 P(AE)=0.2785 P(CD)=O.28 P(ABC)= 0.25 P(ACE)= 0.2305 P(BDE)=O.2365 P(ABDE) = 0.14
P(B) = 0.6 P(AB)=0.31 P(BC)=O.4 P(CE)=O.4 P(ABD)=0.1585 P(ADE)=0.1585 P(CDE)=O.25 P(ACDE)=0.1441
P(C) = 0.5 P(AC)=0.28 P(BD)=O.2785 P(DE)=O.31 P(ABE)=0.2365 P(BCD) P(ABCD)
= 0.2305 = 0.1441
P(BCDE)=0.2125
P(D)=O.4 P(AD)=O. 1877 P(BE)=O.42 P(ACD)= 0.1589 P(BCE) = 0.34 P( ABC.!?) = 0.2 I25
P(ABCDE)=0.1328
S, Sarkar et al./ European Journal of Operational Research 94 (1996) 488-504 -..-a-
A
Exp cost
c
a,,,
. . . . ~. . .
.
. .
.
/
. . . . . .
Ew*dm’Curw
aa?
491
Exp CostUsing&Nd Distribution Exp CostUsingTree Appmximation
cl
Fig. 2. Expected cost using actual probabilities. Fig. 3. Expected cost using the tree approximation.
for the tree representation are obtained by evaluating, from the actual distribution, each of the component distributions associated with the tree representation. Besides its intuitive appeal, this procedure is also justified in an information theoretic sense that we discuss later in this paper. When using the tree approximation, the estimated probability that the system is OK, when A and C are observed to be true, is p, = P,( A,B,C,D,E)/P,( A,C), where P,( ABCDE) and P,( AC) are the probability masses assigned to the corresponding realizations for the tree representation. Using the chain-rule discussed earlier, we have: Pt =
is the range of values (R) for C, over which a loss occurs. Using the approximation leads to a loss over the interval 0.558 < C, < 0.902. Therefore, in this case, R = 0.344. The second parameter is the maximum loss AE over this interval, which occurs at C, = 0.558. We define L as the maximum percentage loss when using the approximate distribution as compared with the actual distribution. In this example, L is obtained as shown below: ,=(&l).lOO=(E$l).lOO
P,(A).P,(Cl A).P,(BIC).p,(o(C).P,(EIC) P,(A.C)
_(~-l).lOO=(pq;~;~;t;
-1j.100
= P,(BIC).P,(DtC).P,(EIC)
0*
=P(B~C).P(DlC).P(EIC)=0.358.
Therefore, the system is predicted to work when C, < p,/(l - p,) = 0," , and predicted “ 7 OK” otherwise. When the decision made using the approximate probability distribution coincides with that made using the actual probability, the expected cost is the same for both. When the decision made is not the same, the expected cost is higher. Thus, there are losses associated with using the approximate distribution. In this case O,* = 0.558, and the decision made is different when 0,’ I C, s 0 * . The expected cost when using the approximate distribution is p in this interval. The expected costs when using the actual and tree-structured distributions, respectively, are shown in Fig. 3. When using the actual distribution, the expected cost curve is OBG, while that using the tree approximation is OEAG. The region over which a loss is incurred when using the approximate distribution is characterized by two parameters. The first parameter
=
( 1 -0,
1 *100=61.6%.
2.2. Using a third-order approximate network While the tree representation is very simple, the above analysis shows that using such structures can lead to large losses compared to using the true
x
P3(A,B,C,D,E) = P3(A)-P,(B,C I A)-P3(D,E I C) A
B
C
D
E
Fig. 4. A third-order approximate structure.
492
S. Sarkar et al./European
Journal of Operational Research 94 (1996) 488-504
distribution. We can reduce the magnitude of these losses by considering more complex approximating structures. A structure which is only a little more complex than a tree structure is one that corresponds to a third-order distribution. The component probability distributions in a third-order representation consist of at most three variables. Fig. 4 shows a third-order representation that may be used to approximate the reliability problem. The probabilistic representation for the third-order approximation, P,(a), is: P3( A,B,C,D,E)
= P3( A) . P3( B,C 1A)
3. Evaluating approximate representations The example in Section 2 illustrates how using the third-order representation to approximate the probability distribution results in significant savings as compared to using the tree structure. When thirdorder structures are appropriate, it is important to obtain the best such approximate representation. In this section we address the problem of finding the optimal solution without assuming a specific cost and decision structure. Rather, we seek the third-order representation that minimizes the I-Divergence measure of approximation error. 3.1. The I-Divergence measure
We use the component distributions evaluated from the true distribution in order to estimate the complete third-order approximation (similar to the procedure for the tree structure). The losses associated with the third-order solution are evaluated as was done earlier for the second-order solution. When using the third-order approximation, the estimate for the probability that the system is OK, when components A and C are observed to be OK, is: p3 = 0.446. Therefore, 0; = pJ(l - ps) = 0.806. Using this approximation leads to a loss over the interval 0.806 < C, < 0.902, and therefore R = 0.096. The maximum loss occurs at C, = 0.806 with L = 11.9%. These values of the loss parameters are much smaller compared to the corresponding values for the second-order approximation. We note that the third-order distributions that are considered as approximations are those in which each third-order component corresponds to a clique in the associated graph. While more general thirdorder distributionscould also be considered, they are less modular for representing probabilistic dependencies. In practice, efficient techniques transform general graphs to ones with cliques before performing inferences (Lam&en and Spiegelhalter, 1988). Both the storage and computational complexity associated with third-order approximations are of order O(nr3), where n is the number of nodes in the network and r is the number of realizations that a variable may have (Lauritzen and Spiegelhalter, 1988). The computational complexity associated with tree structures is o(nr*>.
To evaluate alternate approximate representations, a measure is needed to compare how close each is to the true underlying probability distribution. Chow and Liu (1968) propose an algorithm that uses the I-Divergence measure to evaluate alternative tree (second-order) approximating distributions. This measure has been frequently used to obtain approximate structures (Rebane and Pearl, 1987; Wong and Poon, 1989; Geiger, 1992). The I-Divergence measure is defined as the difference in the information contained in the actual distribution p and the information contained in the approximate distribution r about the actual distribution PI D(P,r)
=ZpilogF. I
This measure is always positive when the distributions p and r are different, and zero when they are identical. The I-Divergence metric is a linear transformation of the well-known logarithmic scoring rule (Good, 19521, which belongs to the class of proper scoring rules (a scoring rule is said to be proper if assessments other than the true distribution p cannot obtain a higher score than p itself). While proper scoring rules are desirable, other factors are also important. Stael von Holstein (1970) lists the three principles relevance, invariance and strong discriminability as important desirable properties for a scoring rule. He has shown that the logarithmic rule is the only proper scoring rule that
S. Sarkar et al./
European Journd
satisfies the relevance property for n > 2. It can be shown that the I-Divergence metric satisfies the properties of invariance and strong discriminability as well.
Let Xji and Xki be the conditioned variables for the term Pai, and Xmi be the conditioning variable for that term. Then: $‘(*)lOg
3.2. Characteristics
493
of Operarionul Research 94 (1996) 488-504
of optimal solutions
We wish to determine the third-order approximation P, that is closest to the original distribution P with respect to the I-Divergence measure D( P, P3). We need to find two things: (i) the specific topology for the best third-order approximation, and (ii) the joint distribution for the best approximation given a topology. We first show how problem (ii) is resolved conditional on the optimal topology. Consider a true distribution P(. 1. Let its best 3rd-order approximation be P,<. > = P,, Pa2 . . . Par, where P,, , Paz, . . . , Par are 3rd- or lower-order component distributions. Now,
=
‘ai
$‘(
.)l”g
= x_ ,C, ,I’
I xmi)
pa(xji7xki
p(
Xji’Xki*Xmi)log
pa(
1xmi)
xji*xki
kJ_ In,
=P(Xmi=O)
+P(x&=
c X,,Jki
1)
‘(
XjiTXki9Xmi
=
O>
P( xmi = 0)
‘(
c
Xjj,x,j
Xji9Xki9Xmi
P(Xmi=
=
‘)
1)
x log P,( Xji,Xki I xmi = 1)
D( P7P.A = ;P(*)
log PO
= ~POlog
- CPO x
log Pk) x log P,( Xji,Xki 1xmi = 0)
P(.)
- $P(.)log[P,,P,*
.**
pa,1
=~p(‘)logp(.)-~p(.)(Z:logp~i) x
=
;P(.)logP(.)-
x
xEP(.;logP,,.
i x The first term on the right hand side of the above expression does not depend on the approximation. Hence, minimizing D( P, P,) is equivalent to maximizing C,C, P(* 1 log Pai. Let Par correspond to the distribution of the root variable X,. Then, Cx PC.1 log Par = Cx P(s) which is log P&X,) = Cx,_o,r P(X,> log P&X,), maximized when P,
x log Pa( Xji,Xki I xmi = 1). The terms P(X,, = 0) and P(X,,. = 1) are fixed for a given PC* 1, since they do not depend on the approximation. Each of the expressions
CX.,X,p(xji9xkiI x*i = O), CX,,Xkipa(xjf?x~i I xmi P(Xji,Xki r’Xmi = 1) and = “0): c x _,,y,, c ~~j,,xkiP~(Xji:jl,i I Xmi= 1) are summations of conditional probability terms over all feasible realizations of the conditioned events, respectively. Hence, each expression sums up to 1. Therefore, the expression Cx P(. ) log Pai is maximized when Pa
494
S. Sarkur et al./ European Journal
of Operational Research 94 (1996) 488-504
tion. Furthermore, the marginal distribution for each variable is also preserved in the best third-order solution. The root variable is a parent to some node(s) in the associated third-order distribution, and therefore, it appears as a conditioning variable in one of the component distributions (i.e. for some component distribution, X,,,i= X,). For instance, if the component is a third-order term, then P,(X,,,,> = P(Xmi) and Pa for all realizations of the variables, and therefore we have Pa(Xji,Xki,Xmi>= P(Xji,Xki,Xmi>. This means that the marginal probability for each variable included in this expression is preserved as well. One of these variables must be the conditioning variable in some other component distribution, and therefore the marginal distribution for all variables which appear in that component are also preserved. By induction, the marginal distribution for each variable must be preserved in the best third-order approximation. As a result, if the best third-order approximation for the decision problem is P,( A,B,C,D,E) = P,(A) . P,
P,( A,B,C) = P( A,B,C), = P( C,D,E).
This result implies that if the topology for the best representation is known, then the probability distribution for the variables in each component of the third-order representation can be obtained by finding the appropriate marginal distribution of those variables in the actual distribution. This is easily obtained from the actual distribution by summing the probability masses corresponding to different realizations over all variables that do not appear in the component under consideration. The distribution for the complete approximate representation is obtained by using the appropriate product form. Thus, the problem of finding the best approximate representation is simplified to one of finding the topology of the network that supports the best representation. Since an analogous result holds for optimal secondorder distributions as well, the probability distributions chosen for the simple tree approximation and the third-order approximation, respectively, are optimal. A related property of the best third-order solution is that it has the maximum entropy among all feasi-
ble third-order approximations that preserve the component distributions. This easily follows from the above result. Let P,l(.> = P,‘,Pi2 * . . PLr be the maximum entropy solution. Then: -CP;(.)logP;(.) x
H()=
= -;P;(.)log(P;,P:, =
- EpzI(‘)(
*.. Pir)
Clog
x
pLi)*
i
Let P,!(. ) be the joint distribution over the variables included in each component (note that P,‘, refers to
the conditional distribution, except for the root variable). Then, we have:
=
~P:(.;log
-;
i
P$.
x
P,'<. ) and Pi,<. > are the same as the corresponding joint and conditional distributions for the true distribution. It follows that the above expression is the same one that was maximized to obtain the best I-Divergence solution. Hence, Pi<. > = P,(. ) where P, minimizes D( P, Pa>. Next, we consider the problem of finding the optimal topology. Define Z(Yi,Yj>as the mutual information across two variables Yi and 5, and J(Yi,Y,,Y,) as the mutual information across three variables Yi, Yj and Y,: I(Y,,YJ = CP(Y,,yi)log i,j
=
p(yi*yi) p(yi)p(yj)
’
C p(yiTi~yk)log p(y,)p(y,)p(y
i,j,k
I
J
)
.
k
Denote the true probability distribution by P(n) and the best Srd-order approximation by P,(. > = Pa, Pa2 . . - Par, where Pa,, Paz, . . . , Par are 3rd- or lowerorder component distributions. Let Pi correspond to the true joint distribution of the conditioned and conditioning variables included in Pai. Then, D(P,P,) is minimized when &C,P,(.)log Pai is
S. Sarkar et al./ European .lournal of Operational
maximized. Let Pai be a third-order term comprising of the variables Xji and Xki as the conditioned variables, and Xmi as the conditioning variable. Also, let q = C,Pi( allog Pa,. Then, we have: T/= CP,(
*)log Pai
in the product approximation. Thus, minimizing D( P, PJ is equivalent to maximizing: C i
C’i<.) X
log pai=
i~oz(xji~xmi)
+
X =
x
,‘(
xji*xki~xmi)log
1‘mi)
p(xji*xki
. ml C
‘(
xji~x~i~x~i)
x,; 2Xki , X,i
’ log
P( Xji)P(
X,i)p(
X,i)
p( xji)p(
‘ki)
=J(Xji,Xki,Xmi) +H(Xji) +H(Xki). Similarly, for the second-order terms: Si= CPi(‘)lOg
Pai=Z(Xji,X,i)
+H(Xji).
X
Representing the second-order terms in the product approximation by P,, , and the third-order terms by P To, we have: ~~pi~~~lo~p~i~H~x~~+i~o~i+~~o~
=H(Xl)
+
iE~o[z(xji,X,i)
+H(xji)]
+ iE!Io[J(xji~xki~xmi)
+H(Xji)
+H(Xki)I,
where X, is the root variable. The terms H(Xji) and Zf( Xki) in the above expression correspond to entropies of conditioned variables in the lower order terms in the product approximation. Each variable in the joint distribution to be approximated, other than the root variable, appears as a conditioned variable in exactly one of the second- or third-order terms. Therefore: C i
C
J(XjiYX~i~Xmi)e
ieT0
X,,,z
=
495
Research 94 (1996) 488-504
Cpi(e)log X
‘ai’
i~or(xji~xmi)
+
C
J(xji,xki9x,i)
iET0 + ,=~,*(Y’)*
4_ ,,nH(Yq) is the sum of the entropy for each variable in the actual distribution, and does not depend on the second- and third-order terms chosen
ThetermC
It follows that the optimal third-order approximation maximizes the sum of the second-order and third-order mutual information terms. In order to determine the topology for the best third-order approximation, the second-order and third-order mutual information terms must be determined for each combination of pairs and triplets of variables in the problem. The best solution corresponds to the set of pairs and triplets that maximizes the sum of the mutual information terms, while not forming cycles across components in the underlying structure. It is easy to verify that third-order solutions are always preferred to second-order solutions (which correspond to simple tree structures), in terms of the I-Divergence measure. Any two variables that share the same parent in a second-order solution, along with the parent variable itself, can be combined into a third-order term that results in a better solution. For instance, let Xj and X, share the same parent X, in a simple tree structure. Then, using the third-order term P( Xi, X, I X,> instead of the second-order terms P + I( X,,X,), with equality only if Xj and X, are conditionally independent with respect to X,. Thus, the third-order approximation is at least as good as the best second-order approximation, and, it is strictly superior if the conditioned variables in the third-order term are not independent given the conditioning variable in the distribution P( . ) that is being approximated.
4. Solution techniques This section presents procedures to find the topology for the best third-order approximation. First, a greedy heuristic for this problem is presented. The greedy heuristic provides a way to find good, although sub-optimal, solutions very quickly. This pro-
4%
S. Smkar et al./ European Journal of Operational Research 94 (1996) 488-504
cedure is then used in an exact branch and bound algorithm that finds optimal solutions. 4.1. Third-order approximation 4.1 .I. Problem formulation The best third-order approximation corresponds to
a set of pairs and triplets of variables, the sum of whose mutual information terms is the maximum. The problem of finding the best third-order representation is formally stated below. 4.1.2. Problem statement Instance: A set of binary variables Xi, i = 1,. . . ,n. The joint distribution associated with the variables Xi, . . . ,X,. Problem: To find a set of triplets and pairs that span all the variables, and are connected without forming “cycles”, such that the sum of weights of the chosen triplets J(Xi, Xi, X,>, and, pairs I(Xi, Xj> is maximized. The following are considered to be “cycles” in the context of the above problem: (i> two triplets/pairs that share more than one common variable, and (ii) two triplets/pairs that are connected to each other through different paths of triplets/pairs. Examples of feasible sets of triplets and pairs are shown in Fig. 5. In order to determine the best third-order approximation, we need the mutual information terms across all possible sets of triplets and pairs of variables. The mutual information term for a triplet or a pair of variables can be evaluated from the joint distribution
associated with that triplet or pair. Since second-order distributions are a subset of third-order distributions, it is the third-order distributions that are important. Therefore, it is not necessary to have the complete joint distribution for the entire set of variables under consideration, which is usually hard to obtain. Instead, we only need to obtain the third-order distributions. Such low order distributions can be specified more accurately by experts or estimated from historical databases. The number of probability parameters required for specifying our third-order approximation is clearly greater than that needed for the second-order approximation. For n = 5 variables as in our reliability example, up to 15 parameters may be required for the second-order approximation; this number increases to 2.5for the third-order ap roximation. These ? numbers are obtained as follows. C, = n parameters are required to specify the first-order distributions (this is because exactly one independent parameter completely specifies the marginal distribution for each binary variable). One additional parameter is required for each of the nondegenerate 5C, secondorder distributions. When the third-order case is considered, there are 5C, nondegenerate third-order distributions, with one additional parameter required for each. In general, for a problem with n variables, the second-order approximation requires 5C, + ‘C, probability parameters, whereas the third-order approximation requires 5C, additional parameters. Thus, while the third-order approximation is quite manageable for small n, it may be hard to obtain all the required probability parameters directly from experts for large n. This calls for the use of a decomposition scheme such as the one we propose in Section 6. Under this scheme, we decompose the
A
Triplets/Pairs: {(A,B,C),(C,D,E)E))
CB B
C
D
E
Triplets/Pairs: ((A,B,C),(B,D),(C,E))
Fig. 5. Feasible sets of spanning triplets and pairs for variables A, B, C, D and E.
S. Sarkar et al./ European Journal of Operational Research 94 (1996) 488-504
497
problem into smaller subproblems and estimate the third-order terms only within a subproblem.
NP-Hard). Therefore, we have developed an exact procedure using the branch and bound technique.
4.2. Greedy heuristic
4.3. Branch and bound solution for the third-order fomulation
The heuristic procedure to find the best third-order solution works as follows. First, all mutual information terms associated with second and third-order components are evaluated. Next, components are included in the solution if they contribute the most amount of mutual information without forming cycles across components. This is checked by ensuring that new components, which are being added, do not share more than one variable with the components already selected (called the acyclic@ check).
The greedy procedure Step 0.
Step 1.
Step 2. Step 3.
Step 4.
Sort all second and third-order components in decreasing order of their mutual information terms. Find the component with the highest mutual information and include it in the solution. Include variables of that component into the list Included-Variables. Repeat Steps 3 and 4 until all variables have been included in Included-Variables. Find the component with the highest mutual information among components not included in the solution. Perform an acyclicity check using Zneluded-Variables.
If the component can be included, add it to the solution, and add new variables to Zneluded-Variables. Return to Step 3. If it cannot be included, find the next highest component and repeat Step 4. The algorithm is efficient and requires at most n - 1 passes of the mutual information terms that are being considered (where n is the number of variables). While a similar procedure obtains the best second-order solution (Chow and Liu, 19681, unfortunately, it is not guaranteed to find the optimal third-order solution. Determining the best third-order topology based on the mutual information terms appears to be a hard problem (however, no proofs are currently available to show that the problem is
The branch and bound algorithm obtains the optimal solution by generating partial solutions and evaluating upper and lower bounds for these partial solutions. The partial solutions are initially generated by incrementally adding feasible components to a null solution (with no components). The lower bound for each partial solution is obtained using the greedy procedure described above. The lower bound for the entire problem is the highest among the lower bounds of existing partial solutions. The upper bound for each partial solution is also obtained in a greedy manner after relaxing the acyclic constraint. The upper bound does not usually correspond to a feasible solution. When it does, it is optimal for that partial solution. If the upper bound for a partial solution is less than the current lower bound for the problem, that path of the branch and bound tree is eliminated from further consideration. When all of the existing partial solutions have been examined, the one with the highest upper bound is explored further. This is done by generating a new partial solution for each component that may be feasibly added to the existing partial solution. Upper and lower bounds for these new partial solutions are evaluated, and depending on these values, the partial solutions are either eliminated or kept active for further analysis. This procedure continues until all active partial solutions are fathomed - i.e. either shown to provide the optimal solution, or to be dominated by some other solution. The branch and bound procedure easily finds the best third-order solution for the reliability problem. We have also analyzed the performance of the procedure for larger problems with up to 9 variables. The statistic of primary interest is the number of partial solutions evaluated by the branch and bound algorithm before the solution is obtained, which is the “number of iterations” it takes the algorithm to find the solution. The mutual information terms used for testing were generated using the uniform, normal, and bimodal distributions, respectively, and the branch and bound program run for different problem
S. Sarkar et al./European
498
Journal
ofOperational
Research 94 (1996) 488-504
Table 2 Performance of the branch and bound algorithm for simulated test data
Table 3 Median, average and maximum iterations required for convergence (nine-variable problem)
Problem No. of iterations for solution size I-10 II-50 51-250 251-500 500-2500 > 2500 Avg
No. of
% Error: Lower bound
Iterations
5% 2% 1% 0.5% 0% 5% 2% 1%
7
8 9
83 31 II
66 70 24
1 46 71
3 24
17
3
13 53 323
sizes. Each problem size was tested using 50 sets of data for each distribution. The performance of the branch and bound algorithm was found to be very similar for test data generated using different distributions. Therefore, we have consolidated the results for the different test data, and summarized them in Table 2. The performance for Local Event Groups consisting of 10 variables is less efficient, where approximately 60 percent of the data-sets (19 of 30) were solved within 500 iterations. For even larger problems, the number of iterations could grow quickly. This suggests that the branch and bound procedure is not appropriate for very large problem instances. The worst case complexity of the algorithm is high due to the fact that the number of feasible solutions grows exponentially with the number of variables in the problem.
Median 1 I Average Maximum 1
% Error: Upper bound
1 I 2 3 1 1 25 61 143 1.2 2.1 3.9 8.1 1.1 20 74 152 323 4 70 70 76 3 449 1266 1805 3016
For large problem instances, the quality of solutions obtained, when the search is terminated early, are evaluated. This is done by examining how the bounds converge to the optimal solution in the branch and bound solution procedure. The lower bound of the problem, as obtained after a non-terminal iteration, is also the best feasible solution identified up to that point. The upper bound does not usually correspond to a feasible solution. Convergence of the lower and upper bounds are examined for problems with 9 variables. The number of iterations required for the bounds to be within 5%, 2%, 1% and 0.1% of the optimal solution are identified. The median, average and maximum number of iterations required are summarized in Table 3. The lower bound converges to the optimal solution very quickly. The median number of iterations required before the lower bound converges is only 3, while the maximum number of
5%
UpperBound 2%
1% soln 1% 2%
5%
I
8.1
0.5% 0%
323 Iterations
lteratlons Average No of Iterations
Fig. 6. Convergence of bounds for simulated data (problem size: nine variables).
499
S. Sarkar et al./ European Journal of Operational Research 94 (1996) 488-504
iterations required is 76. This is a very small fraction of the approximately 10” total possible number of nodes for the branch and bound problem. The above results indicate that the algorithm locates the best solution very quickly, although verifying that it is the optimal still requires examining a large number of partial solutions. Thus, the algorithm is very likely to locate the optimal solution even when the procedure is terminated early. Another interesting result is that the median number of iterations for different levels of convergence are significantly lower than the average. This is because there are a few instances where the number of iterations required for convergence is very high, and increase the average considerably. Fig. 6 illustrates the convergence of the upper and lower bounds for the simulated data.
5. Benefits of the third-order approximation The technique presented in Sections 3 and 4 can be easily applied to the reliability problem discussed in Section 2. We compare the tree structure and the optimal third-order representation for a wide range of different true distributions, and find that the best third-order approximation outperforms the best second-order approximation by a large margin. For the example in Section 2, the loss parameters associated with the third-order approximation were R = 0.096 and L = 11.9%, while that for the second-order approximation were R = 0.344 and L = 61.6%. We examine whether such gains are to be expected in general by repeating this experiment for a large number of true distributions associated with the reliability problem. For each distribution considered, the best third-order and the best second-order approximation is obtained, respectively, and the associated losses evaluated. For convenience in comparing the two approximate representations, we have restricted the true distributions to those for which the best second-order and third-order approximate representations have the structures shown in Fig. 7. True distributions chosen for analysis are those that cannot be exactly represented by either the best second- or third-order approximations. The different true distributions are obtained in the following manner. The arc between B
x 2cx B
A
A
A
C
C
D
B
E
a Actual Distribution
D
B
C
E
b. 3rd~Order Distribution
D
E
E. 2nd-order Dimibutim
Fig. 7. Belief network - actual and best approximate
st~~ctu~s.
and E prevents the true distribution from being exactly represented by a third-order approximation. We vary P(M) and P(BCE) for different experiments, since they affect the strength of the dependency between B and E. The range of values chosen for P(BE) and P( BCE) are those that lead to second-order and third-order approximate structures as shown in Fig. 7. For each set of values for P(BCE) and P( BE), we obtain the odds of the system being OK after observing none, 1, 2, 3 or 4 components respectively. All possible combinations of components are considered for observation prior to a prediction being made. When some components are observed before prediction, then the posterior odds of the system being OK are evaluated using the true and approximate representations. The values of R and L are obtained using the posterior odds for each experiment and summarized. The parameter L can be aggregated in a straightforward fashion (the average is an adequate measure for comparison). It is harder to consolidate R, which is the interval of C, where loss occurs. Since C, is the ratio of the costs associated with the two types of errors, each value of C, refers to all those instances of costs of making Type I and Type II errors respectively, which have Table 4 Summary P(BCE)
results P(BE) a
Loss parameters R
0.31 0.32 0.33 0.34 0.35
0.35-0.43 0.36-0.43 0.37-0.43 0.39-0.43 0.41-0.43
a Feasible interval.
L(B)
Thirdorder
Secondorder
Thirdorder
Secondorder
0.0272 0.0058 0.0269 0.0478 0.0692
0.1504 0.1742 0.1975 0.2193 0.2410
7.05 2.04 7.20 12.76 19.60
37.23 44.87 53.55 63.07 74.57
ml
S. Sarkar et al./European
Journal of Operational Research 94 (1996) 488-504
Range R
-... *..
CM-
.
w
-3dCbKk 0.3-
0.33
0.34
0.35
P(BCE)
Fig. 8. Average R.
that particular ratio. Its domain is the real line [0, m). The following transformation scheme maps the values of C, in the interval [O,m) onto the interval [O,21:
to a specific probability for P(BCE). For example, the second row in Table 4 summarizes the loss parameters for the case where P(BCE) = 0.32. For that value of P(BCE) the feasible range for P(BE) is [0.36-0.431. The loss parameters are determined for values of P(B) incremented in steps of 0.01, and the results summarized. The average value of R is found to be 0.0058 for the third-order approximation, whereas it is 0.1742 for the second-order approximation. Similarly, the average value for L is 2.04% for the third-order approximation, whereas it is 44.87% for the second-order approximation. Figs. 8 and 9 highlight the performance differences between the two approximate representations.
If C,> 1 thenC;=2-$. 2 The above transformation centers the interval around 1, with the distance from 1 reflecting the proportionate rather than the absolute difference of the two type of costs. The results of the experiment are summarized in Table 4. Each row summarizes the loss parameters for different feasible values of P(M) corresponding
L(S)
8o *...**
. . ...**
,,..,... *+
-
*......?? ’ ,.........**
60 .......‘.+.““” *.......
&......“’ /.......***-* .* 40,: **......
0.31
0.32
0.34
0.33 NBCE)
Fig. 9. Average L.
.......* ......
0.35
2&-
3Id~
501
S. Sarkar et al./ European Journal of Operational Research 94 (1996) 488-504
The loss parameters for the third-order approximation are lowest for P(BC.5) = 0.32. This is attributed to the fact that the third-order approximation also has P,(BCE) = 0.32 for that case. However, there is still some loss associated with the third-order solution since P&BE) # P(BE). Overall, the third-order approximation dominates the second-order approximation for the entire range of probability parameters that were examined, and leads to significantly lower losses. This is true for both loss parameters. The distributions we have examined include the entire range for P@CE) and P(BE) that leads to the optimal approximate structures shown in Fig. 7. Therefore, the results are fairly representative for the reliability problem.
6. A decomposition technique for large problems The algorithms developed in Sections 3 and 4 are applicable to a broad class of problems. Tree and third-order approximate representations can be used effectively for any application where the response time requirements are very stringent. In such applications, our techniques can be used to find the optimal third-order representation and to characterize its properties. However, the performance of the exact branch and bound algorithm deteriorates for problems consisting of 10 or more variables. Furthermore, as discussed in Section 4, the number of probability parameters required to specify the necessary third-order distributions also increases rapidly with the size of the problem. The question that arises, then, is how to handle problems with a large number of variables. In this section, we present a decomposition technique that helps in solving such large problems. An intuitively appealing and practically feasible idea is to reduce the complexity of such problems by dividing the set of variables into smaller subsets that consist of highly dependent events. We use the notion of Local Event Groups (LEGS) (Duda et al., 19791 to perform such a decomposition. The belief network is interpreted as a collection of Local Event Groups that are connected to each other by common events (Fig. 10). We show that under certain conditions, the best third-order structure for the problem
LEG2
I I 0
’
@
LEG4
Fig. 10. Local Event Groups in a belief network.
can be obtained by finding the best third-order solution for each LEG. The conditions are: 1. LEGS are connected to each other by common variables, and any two groups that are connected share exactly one common variable. 2. A LEG is connected to any other group through exactly one series of intermediate LEGS. Decomposing the problem into smaller subproblems captures the essence of human reasoning in problem domains with large numbers of variables (Simon, 1982). The two constraints serve as modeling guidelines to facilitate the tree-structuring process. When two LEGS share more than one common variable, it may be appropriate to merge the two groups into a single unified group. This is achieved, of course, at the cost of having a larger number of related variables in the unified group. The second condition implies that the LEGS themselves are not interconnected in such a manner that forms cycles. This is usually the case when large complex problems are decomposed into sub-problems. These subproblems correspond to sub-hypotheses that need not be related to each other, except in contributing to explain the primary hypothesis. These features are inherent in practically all large real-life implementations and have been well-documented in the literature (Duda et al., 1979; Shortliffe, 1976). Once a problem has been decomposed into LEGS, the individual LEGS can be approximated with third-order structures. We show that under the above conditions, the optimal third-order approximation can be obtained by finding the optimal third-order approximation for each of the LEG’s separately. By interpreting a belief network as a collection of LEGS, we can represent the underlying probability
502
S. Sarkar et d/European
Journal
of Operational Research 94 (1996) 488-504
distribution as a product approximation with each term in the approximation corresponding to one LEG. The conditioning event in the joint distribution associated with an LEG is the variable that the LEG shares with its parent LEG. Consider a belief network that consists of two LEGS, Ll and L2. Without loss of generality, we can assume that Ll is the parent of L2. Let Ll contain n + 1 variables W, X i,. . . ,X,, and L2 contain m + 1 variables W,Y , , . . . ,Y,. We can represent the underlying probability distribution as: P(KX,,
If we set P&Xi I X,> equal to P,‘(Xi I XJ, the expression - Cx,,x, P(X,,X,)log P,‘(Xi I X,> becomes common to both sides of the above inequality. Therefore, we must have: * - .F,
p( xiVq,xk) log p,‘(I;
I
,. ,, k < - c P(Yj,W)logPa& I w)
. . . . X,,Y I’..” Y,)
=P(X I,..., x, IW)P(Y I,..., Y, IW)P(W) =P(X(W)*P(Y(W).P(W). Further, for any subset X, of X and Y, of Y, we have: P(X,,Y,,W)=P(X, IW)*P(Y, IW).P(W). We now proceed to show that the optimal representation is obtained by minimizing the I-Divergence measure for each LEG separately. Let the best approximation using the I-Divergence measure be Pi(.). If the proposition is false (i.e., decomposing the problem leads to sub-optimal solutions), then there must be some lower-order term such that it’s conditioned events are from both LEGS and its conditioning event is not W. Let the term be Pi(Xi,Yj I X,>, where X, # W. Consider another product approximation P,( . > that has the same lower order terms as P,‘, except that Xi is conditioned on X,, and yi is conditioned on W. If P,’ is the best approximation, we must have: D( P,Pi)
p( Xi9~rXk) log pi( xj,q
I xk)
< “X_yg wp(xi9yj~x~9w) ,*,’ k, xlogPa(xilx,)~Pa(YjIw) *
- .,Fx I’ ,’
p(xivY,.*xk) t x
log
pi(
‘;
I xi9xk)p,‘(xi I xk)
wp(wjwq
* - x y~k~p(xi9~9x~,w)10g pl(yj I xi,xk) I’,. k. ‘p( xi7xk9w) < - x ,C, wp(xi~yj~x~?w) 13,T kr xlogPa(YjJw)~P(xi,x,,w). (9
By setting P,<$ 1W) = P(Y. 1W > for all realizations of q and W, we have PalYjIW).P(Xi,X,,W)= P( X,,Y,,X,,W > which is the marginal of the actual distribution over the variables Xi, Yj, X, and W. Therefore the FU3S is nothing but D( P(Xi,$X,, W>, P(X,,$X,,W>>, which must be less than the I-Divergence measure for any other approximation over the variables Xi, yi, X, and W. Thus, (i) is false, and therefore P,‘( .> is not the best approximation for the belief network. Next, consider belief networks with k LEGS, where k > 2. By combining LEG’s 2 through k into one large LEG S, we obtain the two LEG situation considered earlier. This means that the best approximation is obtained by solving for LEG 1 and LEG S independently. The result is proved for any arbitrary k by repeating this argument for each LEG that makes up LEG 9. An important outcome of the above result is that the quality of the best approximate solution, as ob-
S. Sarkar et al./ European Journal of Operational Research 94 (1996) 488-504
tained by approximating each LEG, is not compromised when the stated conditions hold. Subsequently, we can restrict our attention to the task of finding the best lower-order approximation for individual LEGS. In Section 3.2 we showed that in the optimal solution, the first-order marginal distribution is preserved for every variable in an LEG. It then follows that such marginal distributions are preserved in the entire network. This feature allows for seamless integration of the different LEGS to obtain the complete approximate structure. If the marginal probabilities of variables shared by different LEGS were not preserved, then the approximate distributions for these variables obtained for lower-order approximations of different LEGS would not be the same. Reconciling these distributions would require modifications to the solutions obtained for one or more LEGS sharing that variable. This means that the solution for one or more LEGS would no longer be optimal, thereby affecting the solution for the entire network. The adjustment of parameters for a LEG can further cascade onto adjacent LEGS. In the worst case, the distribution for all variables in the lowerorder approximation may have to be modified. This would severely undermine the advantage of decomposing the inference network into LEGS. By preserving all first-order marginals, the solution ensures that the marginal probability of a variable is consistent (and equal to that in the actual distribution), regardless of the number of LEGS sharing that variable. The branch and bound algorithm is tractable for problems that can be decomposed into LEGS. We rarely expect to deal with LEGS consisting of 10 or more variables. Decomposing large problem domains into Local Event Groups would ensure that the cardinality of such Local Event Groups is usually below 10. Furthermore, joint probability distributions for Local Event Groups that have a significantly larger number of variables would be very difficult to elicit from experts, or estimate accurately from historical data. When such problems do occur, the rapid convergence of the lower bound provides optimal, or close to optimal, solutions within a relatively small number of iterations. Considering these factors, the branch and bound algorithm is adequate for finding solutions to the third-order approximation problem. Another useful feature of our solution technique is update flexibility. It is relatively easy to modify the
503
approximate representation when parameters are changed, or new variables added, in the belief network. Parameters in the belief network may need to be changed to reflect a better understanding of some causal phenomena; incorporating these changed parameters require restructuring only the LEG(s) that is (are) affected. Similarly, new variables may be added to enhance the capability of an expert system; here, too, a relatively small number of LEGS will typically require modification. This is a desirable feature in real environments.
7. Summary We have proposed the use of a third-order approximation for the representation of probabilistic data in expert systems. Using a reliability problem as an example, we have shown that third-order representations perform significantly better than tree-structured representations (which are second-order representations) while they are only a little more complex than tree representations. We then addressed the more general problem of finding the third-order representation that best approximates the probability distribution underlying a problem domain. We characterized the best approximate representation using the I-Divergence measure, and showed that using this measure results in the maximum entropy solution among feasible representations. Heuristic and exact techniques were developed to obtain the best third-order representation and shown to perform well for problems with up to nine variables. Finally, we presented a decomposition technique that allows the exact algorithm to be efficiently used for solving large problem instances. While this research has focused on tree and thirdorder structures, the theoretical results are not limited to such structures. Currently, we are examining more general structures as approximate representations. Another direction for future research is to identify similar structures for interval valued belief functions (Kyburg, 1987; Shenoy and Shafer, 1986, 1988). Belief intervals express uncertainty about the probabilities themselves, and allow for the representation of ignorance. Such representations are particularly useful when the data provided by the expert does not completely specify the underlying joint distribution
504
S. Sarkar et al./ European Journal of Operational Research 94 (1996) 488-504
(the “under-constrained problem” (Pearl, 1988)). The feasibility of obtaining efficient exact and approximate structures when using interval-based probabilities needs to be rigorously examined.
Acknowledgements The authors would like to thank Professors Henry Kyburg, Greg Dobson, Marshall Friemer, Ushio Sumita and Christopher Brown from the University of Rochester for their helpful comments and suggestions at various stages of this research. Helpful comments were also provided by participants of research seminars at the University of Rochester and Louisiana State University. Finally, we wish to thank the anonymous referees whose detailed comments and suggestions have greatly enhanced the quality of the paper. Earlier findings based on some of the results reported in Section 3 of this paper have been presented in the Proceedings of the Workshop ion Uncertainty in Artificial Intelligence, 1993.
References Chow, C.K., and Liu, C.N. (19681, “Approximating discrete probability distributions with dependence trees”, lEEE Transactions on Information Theory IT-14/3, 462-467. Cooper, G.F. (19901, “The computational complexity of probabilistic inference using Bayesian belief networks”, Artificial Intelligence
42, 393-405.
Duda, R.O., Hart, P.E., and Nilsson, N.J. (19761, “Subjective Bayesian methods for rule-based inference systems”, Prcceedings of the National Computer Conference, AFIPS Conference Proceedings
45, 1075-1082.
Dude, R.O., Hart, P.E., Konolige, K., and Reboh, R. (1979). “A computer-based consultant for mineral exploration”, Final
Report, SRI Projects 6415, SRI International, Menlo Park, California. Geiger, D. (1992). “An entropy based learning algorithm of Bayesian conditional trees”, Proceedings ofthe 8th Workshop on Uncertainty in Al, 92-97. Good, I.J. (19_52), “Rational decisions”, Journal of rhe Royal Statistical Society, Series B 14, 107- 114.
Kyburg, H.E. (1987). “Bayesian and non-Bayesian evidential updating”, Artificial Intelligence 3 1, 271-294. Lamitzen, S.L., and Spiegelhalter, D.J. (19881, “Local computation with probabilities in graphical structures and their applications to expert systems”, Journal of the Royal Statistical Society B 50/2,
157-224.
Pearl, J.(1986), “Fusion, propagation, and structuring in belief networks”, Artificial Intelligence 29, 241-288. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems, Morgan Kaufman, San Mateo, California. Rebane, G., and Pearl, J. (1987). “The recovery of causal polytrees from statistical data”, Proceedings of the 3rd Workshop on Uncertainty in AI, Seattle, 222-228. Rosenberg, E.. and Gleit, A. (1994), “Quantitative methods in credit management: A survey”, Operations Research 42/4, 589-613.
Shenoy, P.P., and Shafer, G. (19861, “Propagating belief functions with local computations”, lE/X Expert l/3,43-52. Shenoy, P.P., and Shafer, G. (19881, “Axioms for probability and belief-function propagation”, Working Paper 209, School of Business, The University of Kansas. Shortliffe, E.H. (19761,Computer Based Medical Cottsultations: MYCIN, Elsevier (North-Holland), Amsterdam. Simon, H.A. (1982). The Sciences of the Artificial, The MIT Press, Cambridge, MA. StaeI von Holstein, C.-A.S. (19701, “Assessment and evaluation of subjective probability distributions”, The Economic Research Institute at the Stockholm School of Economics, Stockholm. Winkler, R.L., and Poses, R.M. (1993). “Evaluating and combining physicians’ probabilities of survival in an intensive care unit”, Management Science 39/12, 1526- 1543. Wong, S.K.M., and Poon, F.C.S. (1989). “Comments on approximating discrete probability distributions with dependence trees”, IEEE Transactions on Pattern Analysis and Machine Intelligence
11/3,333-335.