Pattern Recoonition, Vol, 25, No. 11, pp. 1389 1400, 1992 Printed in Great Britain
0031 3203/92 $5.00+.00 Pergamon Press Ltd © 1992 Pattern Recognition Society
ON USING THE CHI-SQUARED METRIC FOR DETERMINING STOCHASTIC D E P E N D E N C E R. S. VALIVETIand B. J. O O M M E N t School of Computer Science, Carleton University, Ottawa, Canada K 1S 5B6
(Received 14 November 1991; received for publication 16 March 1992) Abstract--In several pattern recognition applications it is often necessary to approximate probability distributions/densities. In estimating this density function, it is either assumed that the form of the density function is known and that parameters that characterize the distribution are merely estimated, or it is assumed that no information about the density function is available. The latter formulation is considered with the additional constraint that even the stochastic dependence is unknown. For the case of discrete-valued features the well-known method due to Chow and Liu (IEEE Trans. Inf. Theory 14, 462-467 (May 1968)) which uses dependence trees, can be used to approximate the underlying probability distribution. This method determines the best dependence-tree based on the well-acclaimed expected mutual information measure (EMIM) metric. The suitability of a chi-squared metric is studied for the same purpose. For a restricted class of distributions, both these metrics are shown to be equivalent and stochastically optimal. For more general cases, the latter metric is almost as efficient as the optimal one, and in all cases the technique presented here is computationally almost an order of magnitude faster than the EMIM based method.
Estimation
Statistical information Probability distribution approximation Dependence trees
I. INTRODUCTION The design of character recognition systems, communication systems and information retrieval systems involve pattern recognition. That is, the objective is to classify a specific sample (e.g. a hand-written character, a digital signal, or a book in the library) into one of a set of known categories. Let us denote the measurement made on a sample as X and the various classes by {°'i[i = 1,2 . . . . . c}. The problem of pattern recognition is indeed one in which we attempt to find the class '~i which maximizes the probability Pr(~'~IX). Since the quantity Pr('°~IX) is not readily computable, this quantity (or a quantity related to it) is computed using the well-acclaimed Bayes' rule, which relates the a priori probability to the a posteriori probabilities. It is worth observing that this term can be computed only by having an accurate knowledge of the distribution P(XI~'i), referred to as the class-conditional distribution. The measurement made on the sample X is generally an N-dimensional vector [xl,x2 ..... xN] +. The components of the vector X, i.e. x l, x2,..., xN, are referred to as the .features and the vector itself is referred to as the feature vector. The information about the distribution of features within a single class is typically obtained by a process referred to as "training". During this process, the information system is presented with a set of samples and is simultaneously informed of the class from which the samples originated. Typically, if
t Author to whom all correspondence should be addressed.
Approximation
Closeness of
the pattern classifier is of the parametric family of classifiers, it assumes the model for the distribution of features and uses these samples to estimate the parameters that characterize the distribution. Owing to the fact that the number of samples may be small, and that real machines often impose limitations on available storage, it is often necessary to restrict the amount of information that can be collected from the samples. Keeping this constraint in mind, it is c o m m o n to make various simplifying assumptions about the distribution P(XI°'i). The simplest assumption made about this distribution is that the random variables or features are statistically independent. Although this assumption (i.e. that of stochastic independence) simplifies the mathematical formulation, it does not necessarily model the underlying real-life situation adequately. To accurately evaluate the joint distribution of the features, the intention of the system designer is to capture as much of the information about their stochastic dependence as possible. Thus, for example, in order to account for the dependence between normally distributed random variables, it is easy to show that the covariance matrix is a d e q u a t e - - t h e result being a consequence of the fact that a normal random vector is completely defined by its mean and the covariance matrix. In the case when the random variables cannot be assumed to be normally distributed, or when the random variables are of the discrete sort, a completely distinct method is required to capture the information about their mutual stochastic dependence. This paper deals with the problem of representing the information regarding the stochastic dependence
1389
1390
R.S. VALIVET1and B. J. OOMMEN
between the random variables when they are discretevalued. Clearly, this constitutes a large number of cases in the field of traditional pattern recognition, and indeed includes the entire field of problems encountered in information retrieval. (6~ The central issue in tackling this problem is one of approximating the joint distribution/density P(X), where X is the vector [xl,xz ..... xN]T. Because of the large size of the dimensionality of the vector X, it is both infeasible and impractical to store estimates of the joint density functioff P(X) for all possible values of X. It is infeasible because, if each feature xi could take one of a set of k distinct values, the number of estimates that would have to be maintained is of the order of kN. Obviously such a scheme would be impractical also. A second alternative is one of approximating P(X) by a well-defined and easily computable density function Pa(X). When using an approximation density function Pa(X), the question is now one of measuring how well this function approximates the "real" density function P(X). In order to quantify this "goodness" of approximation, we must define a distance measure, between these two density functions. One such measure, which is an information theoretic measure, has been known to serve this need. It is given below
I(P, ea) = ~ P(X) log P(X).
(1)
Pa(X) It is well known that I(P, Pa)>_ 0 and I(P, Pa)= 0 only if the approximation is exact (i.e. P(X) = P~(X), for all X). Hence I(P, Pa) is a measure of the closeness of the approximation; the smaller this measure, the better the approximation. In other words, the intent in finding the best approximation for a given density function P(X) is to find a density function Pa(X) from the set of available approximations such that I(P, Pa) is minimized.
1.1. Approximation techniques We have pointed out earlier that it is impractical to gather the estimates for the joint density function P(X), for all the different values which X could assume. Indeed, such a task would involve maintaining estimates of the highest order marginals, inasmuch as these marginals also completely determine the lower-order marginals. Thus from a feasible and practical point of view, we are restricted to collecting information about the lower-order marginals (i.e. the joint density functions for a subset of the N random variables). The basic method adopted in all the approximation methods is as follows. If we have all marginals of the kth order available, the approximation Pa(X) will be chosen so that its marginals of order k (and less) agree with the corresponding marginals of the underlying density function P(X). Among all such functions which satisfy the constraint of the marginals, the best approximation is the one that minimizes the closeness measure defined in equation (1). It is natural to expect that the approximation gets better with the availability of the estimates of higher-order marginals.
The first solution to this problem of approximating discrete distributions was presented by Chow and Liu. t2) In this method, the approximation uses only the information gathered about the first- and secondorder marginals. This constraint naturally leads to an approximation which is based on the Chain Rule, with the modification that the conditional probabilities utilized in the approximation are of the type Pr(x~lxj). This can conceptually be viewed as a spanning tree of a complete graph consisting of N nodes. A node x~ is said to be the parent ofx), if the probability Pr(xjlxl) Pr(xj). Approximations of this type will be called "tree-type approximations". This method, due to Chow and Liu, derives a relation between the measure of closeness I(P, Pa) defined in equation (1) and the measure of dependence between all the pairs of variables. This measure is the well-known expected mutual information measure (EMIM) which is defined for all pairs of discrete-valued variables {xl, xj} as
l*(xi, Xa) = Y. Pr(xl, x j) log ~ .... j
. Pr(xl)Pr(xiJ
(2)
The reader will recognize that this definition for the measure of dependence between variables is very similar to the closeness measure between the two density functions (or distributions) themselves. In fact, P(x~, x j) can be viewed as the underlying distribution, and the "approximating" distribution is the one which assumes the random variables xi, xj are statistically independent. The essential idea in reference (2) is then to compute EMIM for the
pairs of variables, and then select the most dominant dependencies. It is shown later that this approximation can be related to a tree called the dependence tree. A recently published result proves that the result of Chow and Liu yields an approximate dependence tree, which is nearly optimal, even if the criterion is to minimize the probability of misclassification/81 This method finds immense application in all problems that include pattern classification as a sub-problem. In particular, applications to the field of information retrieval can be found in reference (6). In the second type of approximations proposed by Ku and Kullback, ~4) the dependence between the components of the random vector is not restricted to a tree-type dependence model. Once this constraint is relaxed, a very general method results and the paper ~4) describes an extremely interesting and fascinating iterative method to obtain the best approximation possible, with the available information. Since this problem is not the primary concern of our current study, the details of the iterative procedure are not included here. To view our results in the right perspective, we shall first briefly describe the method presented in refer-
On using the chi-squared metric for determining stochastic dependence ence (2). The current study focuses on the use of an alternative new measure, Ix, used to quantify the dependence between variables. This measure, Ix, is obtained by computing the ;(2 distance between pairs of random variables and is much more easily computable when compared to the EMIM metric defined in equation (2). Section 3 defines the new metric I~ and derives some of its properties. We note in passing that it was proved in reference (7) that the EMIM metric introduced in equation (2) possesses the chi-square property asymptotically. We however prove much more powerful relationships betwen the two metrics I z and l* even for non-asymptotic conditions, and the consequences are immediately obvious. For a restricted class of probability distributions, the metrics I x and I* are shown to be equivalent and thus optimal, as far as the problem of finding the dependence tree is concerned. Although the new metric I x is non-optimal in the most general case, experimental results clearly demonstrate its excellent accuracy. These results point out that in cases when the underlying distribution is not based on a tree, the new metric generates a tree that is only marginally non-optimal. In the case when the dependence between variables is indeed of the tree type, in addition to the theoretical properties proved, the convergence to the "real" tree is observed in all experiments conducted. Experimental studies have provided evidence that the computation of the best tree based on the I~ metric can be usually achieved in about 20-25% of the time required to perform the same operation, using the I* metric, if X is a random binary vector. The savings in computation time will be even more pronounced if we are working with multivalued discrete random variables. In other words, the computation of the dependence tree using I x is almost an order of magnitude faster than the corresponding computation based on the I* metric. The experimental results are presented in Section 4.
2. D E P E N D E N C E
TREE
on multiple variables, implying that we shall attempt to retain only dependencies on at most one variable. This leads us to the following approximation: N
P.(X) = l-[ Pr(xilxj,))
(4)
N
Pa(X) = I-[ Pr(x~,lXm,.,)
(5)
i=1
where, ml, m2..... mN is a permutation of the integers {1,2 ..... N}. The dependence assumed in the above equation can be given a graph theoretical interpretation. Consider a graph G, with N nodes, labelled as {Xl, x2 . . . . . xN}. In this graph, the edge (Xm~,Xmj,), represents the fact that the variable x=, is (statistically) dependent on the variable x,.j,,. It is easy to see that G is indeed a tree, with the node xm~ as the root. This tree is completely defined by the permutation (ml, m2.... ,raN) and the function j(.). It is well known that there are NcN- 2~spanning trees on a graph with N nodes. Also, each tree is associated with a unique approximation of the form given in equation (5). The problem of finding the "dependence tree" is one of finding the tree, for which the associated approximation is the best. It is proved in reference (2) that the closeness measure, I(P, P~), computed for the product approximation in equation (5) can be expressed as
ItP, P~,) = -
N
~_, I*(Xm,, X,,,,,,,) + ~ H(X,) -- H(P) i=1
(6)
i-1
where
H(xi) = - ~ Pr(xi)log Pr(xi)
P(X) = Pr(xl) Pr(x2Lx0 Pr(x31xl, x2) ." -
0 <_j(i) < i.
Notice that in this equation, each variable is dependent on exactly one variable that has appeared earlier in the equation. To include the case when j(i) -- 0, we use the convention that Xo is a dummy variable, which does not influence any other variable. With this understanding, Pr(xilxo) will be the same as Pr(xi). In the above case, there is a "natural ordering" among the components (xili = l, 2..... N} of the vector X, such that the variable x~ was conditioned on x j(0, wherej(i) < i. In the general case, the product approximation takes the form
N
Pr(xslx, . . . . . XN 1)"
where
i=1
APPROXIMATION
To introduce the method due to Chow and Liu, {2) we use the following well-established rule, the Chain Rule, which expresses the joint probability distribution in terms of conditional probabilities=
1391
xi
(3)
We notice that in this expression, each variable is conditioned on increasing the number of other variables. In this context, it is important to point out that estimating the kth term of this equation (i.e. a conditional probability term, in which a variable is conditioned on k - 1 other variables), requires maintaining the estimates of all the kth order marginals. If we restrict ourselves to computing only the (first- and) second-order marginals, we are assured that we can compute all the conditional probabilities of the form Pr(x~lxj). Thus we explore the approximation that results if we ignore the conditioning
H(P)
=
-
~ P(X)log P(X) x
l*(xi, x j ) =
Pr(xi, x j)
~, Pr(xl, xj)log - -
.... ,
Pr(xi) Pr.(xfl
.
The problem of finding the dependence tree, is one of finding the permutation (ml,m2 ..... mN) and the function j(-) (or equivalently an array (Jl,J2 . . . . . iN)) which will minimize the right-hand side of equation (6). Clearly 3z~=1 H(x~) and H(P) are independent of the approximation since they only depend on the underlying distribution P(X). Therefore, in order to minimize the
1392
R.S. VALIVETIand B. J. OOMMEN
right hand side of equation (6), we must maximize
~:~=, l*(x~,,, x.,,,). It can be seen that this is essentially the same as the operation of finding the maximum spanning tree (MST) of the graph G, with N nodes, Xl,X2 ..... XN, where the edge between nodes x~, xj is assigned the weight l{x,, x j). In practice, the probabilities required for computing the edge weights of the graph are not known a priori, they must be estimated from the samples. The process of finding the best dependence tree can be algorithmically described as follows. Algorithm Chow Input: The set ofs samples X l, X 2.... ,X*. Output: The best dependence tree ~ as per the EMIM metric. Method: 1. Estimate the first- and second-order marginals from the various samples. 2. Compute the edge weights l*(x~, x j) for all pairs of nodes, and create the graph G. 3. Compute the MST of G and submit it as the desired tree ~*.
formally in Theorem 1 and is illustrated pictorially in Fig. 1. Theorem 1. The dependence tree produced by Algorithm Chow is the maximum likelihood tree, which can be obtained from the samples X 1, X 2.... , X s. Proof. A sketch of the proof of this result, was presented in reference (2). A more detailed and complete proof can be found in reference (5). [] In the light of the above result, it is clear that the metric defined in equation (2) is the best possible metric for capturing dependence information between pairs of variables, if the overall approximation is to be ideal in the sense of obtaining the minimum value of I(P, Pa), as defined in equation (1). The only drawback with the I* measure is that it is computationally very expensive and time consuming (especially for large values of N), as it involves the evaluation of O(N2k 2) logarithms where, as before, k is the number of values which each feature of the sample can take. We shall now present a new metric, called the I x metric, which essentially captures the Z2-statistic of the distributions and which can be used as a measure of dependence between the variables.
End Algorithm Chow. It is proved in reference (2) that Algorithm Chow, indeed finds the maximum likelihood estimate of the dependence tree. The significance of this result is tremendous, and although the result was proved in reference (2), we believe that the power of the result was not adequately stressed. Notice that from the set of all spanning trees of the graph G (totally N N- 2 of them), the maximum likelihood estimate of the tree is the one which maximizes the likelihood of generating the actual occurrence of samples. Thus to obtain such an estimate, the brute force modus operandus would suggest that the likelihood function be evaluated for every possible spanning tree, and then the tree which maximized this function would be the reported solution. Considering the large number of spanning trees, this is computationally infeasible except for graphs with a small number of nodes. Algorithm Chow is an elegant solution to this problem--it merely involves the straightforward computation of the MST; this being a computation of complexity of O(N2). This result, is stated
3. THE •2 STATISTIC We define the metric lx(xi, x j) to quantify the degree of dependence between two discrete (random) variables as follows: Iz(xi, x j) = ~ (Pr(xi, x j) - Pr(xi) Pr(xj)) 2 .... ~ Pr(xi) Pr(xj)
From the definition in equation (7), it is clear that I z has the following desirable characteristics of a metric capturing dependency information: (1) lz(xl, xj) >__0; (2) Ix(xl, x j) = 0 iff Pr(xi, xfl = Pr(x/) Pr(xfl. Observe that the latter equation represents the case when the variables xi, xj are statistically independent. At this juncture, it is worth recapitulating that Section 2 concluded with the very important result due to Chow and Liu which is that the I* metric defined in equation (2) naturally appears in the process of selecting of
Samples
l
Estimate probabilities and Compute I'(., .)
(7)
Select Maximum Likelihood Estimate
1 Compute MST Fig. 1. Finding the maximum likelihood tree.
On using the chi-squared metric for determining stochastic dependence
the maximum likelihood estimate of the dependence tree. This is true, since we are constrained by the laws of probabilities, to choose a product form approximation for the joint distribution P(X), and the only metric which minimizes the closeness measure defined in equation (1) is the E M I M given in equation (2). In this light, it is clear that the metric defined in equation (7), cannot be related to the closeness of approximation defined by equation (1). Furthermore, it does not seem to be possible to relate this metric to an analogously defined closeness of approximation metric obtained if equation (7) was generalized for the case of distributions, as shown below
marginals. To do this, we introduce the following notation: a~ = Pr(xi = ~) and b~ = Pr(xi = ~t).It is easy to see that:l"
ao=Poo+Pol bo=Poo+Plo
Although Property 1 has been proved for the case of binary-valued random variables, we believe that the same result holds even if the restriction of having binary-valued random variables is relaxed. We now prove the above results formally.
Theorem 2. If x~, xj are distinct binary-valued random variables, Ix(xi, xj) increases (decreases) iff l*(xi, xj) increases (decreases). Proof. We note that both I x and I* are functions of the four joint probabilities Pr(x i = ~, xj = fl), for ~, fl = 0, 1. For the sake of brevity, we denote p,o = Pr(x~ = ~, x~=fl). Since Poo+Po~ + P l o + P ~ = 1 , all the four values ofp,a cannot be independently specified. Indeed, since any three of them automatically determine the fourth, in the following derivations, we assume Poo, Po~, P a0 to be the linearly independent parameters, and that Pl~ is implicitly specified in terms of their values. We also note that the first-order marginals for x~ and xj can be easily expressed in terms of these second-order
~.
(8)
}
l*(xi, x j) = ~ Pij log Pij i.j aibj = Poo log Poo - Poo log aobo + Pol Iogpol - P o l logaobl + Plo log plo - Plo log albo
P.(X)
(1) In the case when the random variables xl, xj are binary valued, the metric Ix(x ~,x j) defined in equation (7), increases or decreases monotonically with l*(x~, x j). (2) For a restricted class of probability distributions, i.e. for a specific type of dependence between the individual random variables, the quantities Ix(x~,xj) and l*(x~, x j) are shown to be equivalent. Thus within this family, the I x metric indeed yields the maximum likelihood estimate of the underlying tree, even though the computation is achieved without computing the matrix l*(v).
al=Plo+Pxx bl=pol+Pll
Now, by the definition of I* in equation (2)
I<(P, P~) = ~ (PtX) ~- Pa(X)) 2"
Although lx(x, xj) cannot be related to I(P,P,) defined in equation (1), it is not entirely futile to pursue it. Observe that for the first part, it is far easier to compute it rather than to compute I* because as pointed out earlier, the latter involves the evaluation of numerous logarithms. Thus although we cannot prove the optimality of I x, for all distributions, this metric is not entirely void of an underlying theoretical framework. Indeed we can prove the following properties of Ix:
1393
+ Pl ~logpl 1 -- Pl 1 log a l b v Observe that as a consequence of equation (8),
8ao/dPoo = dbo/dPoo = 1 and daffOpo o = ObffOPoo= - 1. Thus, taking partial derivatives with respect to Poo, we get OPoo = (log poo + 1) Pot (bl
log(aobo) +. --=-taOaobo+ bo) PlO
-- ao) ----(aL
aobl
.
--
bo)
albo
- (logpll + 1 ) - (
~ log(albl)
+ Pll ( _ al - bl)').
/
albl
Simplifying the above equation
01" = logPoo/aobo + Pro + P1~ + Pot + Pt~ dPoo Pll/azbl al bl Poo + Pox ao
Poo + Plo bo
= log P°°/a°b°
(9)
plffalbl" We now derive an analogous expression for the derivative for I z. Consider
I x(x i, x j) = ~ (Po - aibj)2 i# a~b~ = E p2j _ E 2pij + E a,b~ i,j aib j
i,j
= E p2 - I.
i.j
(IO)
i,j a~bj
Using the simplified form of I x in equation (10), and
t In the interest of brevity most of the involved steps in the algebra and calculus are omitted in the proofs.
1394
R.S. VALIVETIand B. J. OOMMEN
taking the partial derivative with respect to Poo, we get OIr _ 2 Poo (Poo~ 2 OPoo aobo - \ ~obo,/ (ao + bo)
Pol
Plo
aobl
albo
(since aobl - D = Pol and albo - D = Plo) - %a + rio.
- \aob t /
_2Pit atbl
- ao) - \ a l b o /
(at - bo)
Hence, combining equations (13) and (14), we get
(Ptt'] 2 \atblJ (-at-bO"
c3lx = (roo - rl 1) (rol + rio). OPoo
After grouping terms and performing some elementary algebraic simplifications (e.g. a x - b0 = b l - ao, al + bl = 2 - (% + bo)), we obtain Olx - 2(roo -- rlt) - (at -- bo)(r2ol + r~o) 8Poo 2 + 2rll 2 --(ao + bo)(r2o + rtl)
(11)
where rq = pq/aibj. It is easy to derive the following relations among the values of {ro} (by purely algebraic manipulations). The details of steps can be found in reference (5) (PIl-Poo)D
roo--rll--
-
-
roo-rto=D~aoaxbo) r t l - rot = D/(aoatbl)
(12)
We note that a o + b o = 1 - (a t - bo), and hence we can rewrite the expression in equation (11) for Olff@oo as
Olx = 2(roo - rtt) -- (al -- bo)(r~t + r~o) c~Poo + (at -- bo) (ro2o + r~x) --(rio + r~x) + 2r~, = (roo -- rt t) (2 -- roo - rt 1) + (at -- bo) 2
× (ro2o+ r~t - rot - rlo ) = (roo - rt 1) {2 + a t ( r o t - rt 1)
+ ao(rto-roo)}
using equation (12).
(13)
The second term on the right-hand side of equation (13)can be simplified using the relations between the values of {ro} in equation (12) to yield 2+at(rol-rtt)+ao(rlo-roo)
=2-
atD aoatbt
=2
aoD aoatbo
D
D
aobl
aabo
Comparing equations (9) and (15), we see that for all distinct random variables, x , x j, t3lffOPoo follows Ol*/dPoo in sign everywhere and the theorem is proved with regard to the variation of the quantities with respect to Poo. Similar expressions can be derived for the partial derivatives o f l x and I*, with respect to the variables Pol and Plo. In the interest of brevity, these are not derived here, but the results are stated below 0I~ _ (rol _ r11)(rmo + roo)
@ol 8Ix _ ( r i o - r l 0 ( r o l + roo) OPlo
~3I* = logrlo. c3Plo rll
roo-rol=D~aobobl)
2
(15)
c9I* _ logrOm C3Pol rl 1
aoalbobt where D = P o o P t t - P o l P t o
r11-rto=D~albobO.
(14)
(16)
These expressions can be derived, by following the above derivation almost synchronously. An inspection of equations (16) reveals that the partial derivatives of I x and I* with respect to Pol and Plo have the same sign. Hence the theorem. [] Remark. A comparison of equations(9) and (15), indicates the equality of sign, only if %1 and rio are both non-zero. It is interesting to note the implication of this condition. Observe that if ro~ and rio are both zero, the definition of rq implies that Pol = Plo = 0, and that the values of x~, xj are closely related (indeed, they are never different). In this case the condition ro~ = rao = 0, is equivalent to the condition that x~ is identically equal to xj and thus xi and xj are nondistinct. Furthermore, an observation of equations (16) indicates that the derivatives of I x and I* with respect to Pol agree in sign everywhere, except when r 1o = roo = 0. This condition implies that Plo =Poo = 0 , or that bo = 0. The condition bo = 0 implies that the random variable xj is always 1, and hence is not informative. Such random variables can therefore be assumed to be absent. Similar arguments can be made with regard to the partial derivative with respect to P lo. In summary, except in trivial cases, the partial derivatives of I x and I* agree in sign. The result established in Theorem 1 implies that if, as a result of slightly perturbing the second-order marginals of the variables xi and x j, the measure l*(xi, xj) increases or decreases, a similar effect will be
On using the chi-squared metric for determining stochastic dependence observed in the measure lx(x~, x j). Though the result is all-encompassing and powerful, it is unfortunately quite a local result. Thus, although the theorem guarantees how the weight assigned to an edge will vary depending on the metric used, it does not guarantee how the weight of the MST will vary as a function of the metric used. In other words, this property does not ensure that these two metrics are equivalent when it concerns computing the MST. In order to guarantee that both measures of dependencies find the same dependency tree, the relative ordering of the weights for all the edges meeting at a particular node must be preserved, irrespective of the metric used (i.e. I x or I*). Although I* and I~ only locally follow one another for all distributions, there is a subset of distributions for which there is a global relationship between the two. For the first part note that I* itself yields the optimal tree only if we restrict ourselves to secondorder marginals. If additionally we constrain ourselves to distributions in which the second-order marginals satisfy certain elementary constraints, we can show that I* and I z globally follow each other. Thus working with a restricted set of distributions, it becomes evident that the MST chosen using the I x metric is exactly the maximum likelihood estimate for the underlying tree. The following sequence of lemmas prove the result. Lemma 1. Consider the pair of features x~,xj and the joint probabilities defined as follows: Poo = a - 2 Pol = b + 2
the last equality being a consequence of equation (18). Similarly, the numerator of the second term on the right-hand side of equation (19) is (2 - d)bo + (c + 2)bl = )t - d(c + a) + c(b + d) = 2. Thus dl z --
d)~
22 = -
22
-
aobob 1
-
albob I
ad 0
=
2
where Kij = a~bj we have dl* d2
- (1 + l o g p o o ) ( -
1) + (1 + logpo0
+(1 + l o g p l o ) + ( 1 + l o g p i l ) ( -
1)
+ logKoo + logK11 - l o g K o ~ - l o g K l o 1l = logP°lP~.°_+ -KooK -PooPll l°g KolKlo" Now it is easy to verify that K o o K l l / K o l K l o = I. Hence the second term of the equation for dl*/d2 vanishes. Thus
,
(b +
,~) 2 )
;0 (c +
22 + (b + c)2 + bc
1
= bc _< 2 < min (a,d).
(18)
-- log where
Now, the total derivative of I x can be written as 2poo(-1) ~_2pol+2pl - - o ÷ 2P11(-1) - aobo aob~ aLbo albl = 2(2 - a)bl + (b + 2)b o + 2(c + 2)b 1 + (2 - d)bo albobl (19) The numerator of the first term on right-hand side of equation (19) is (2 - a)bl + (b + 2 )bo = ~(bo + bO - a(b + d) + b(c + a) -= 2 -- (ad -- bc) = 2
Jr2 + (1 --002+ fl
~=a+dandfl=ad.
(21)
Comparing equations (20) and (21), we see that both d l J d 2 and dl*/d2 are positive, or zero, accordingly as 2 is positive, or zero, respectively. Hence we have proved Lemma 1. []
P~ - 1. lz(xi'xj)= ~ a~b~
aobobl
{20)
aoalbobl
= ~ Pis log Pis - ~_,PO log Kij
='°gla_i)(d
Proof. We note that from equation (10)
d2
aoalbob 1
I*(xi, Xj) = E Pij log Pij i,j aid j
(17)
Then dl*/d)~ follows dlx/d2 in sign.
dl x
22 -
where again the last equality follows since a o + a I = 1. We now proceed to evaluate the derivative of I* with respect to 2. Since
where a, b, c, d and 2 satisfy a+b+c+d
22(a o + al)
+
dl* _ log P01Plo d2 PooPl 1
p~o=C + 2 Pll = d -
1395
Lemma 2. Let x~ be any feature. Then for all j # i, let the joint distribution of the pair of features xi, xj satisfy conditions (17) and (18) of Lemma 1 with the additional constraint that all distributions are based on c o m m o n values of the parameters a, b, c,d but differ only in the value of the parameter 2. Then the index k which maximizes I *(X i, X j) also maximizes l ~(x~,x j). Proof. Since all the edges have common values of the parameters a, b, c, d, but differ only in the value of 2, the edge weights (under l* and I x metrics) are functions of the same single parameter L The stated result simply
1396
R.S. VALIVETIand B. J. OOMMEN
follows from Lemma 1 as per the observation that I* and I x are both increasing functions of 2 in the interval that 2 is constrained to belong to. This implies that the index of the edge with maximum weight remains unchanged under both metrics. []
Theorem 3. If the joint distributions for every pair of variables xi, x s have the same parametric form defined by equations (17) and (18) with identical values of a, b, c, d, but the distributions differ only in the parameter 2, then the MST chosen by I x will be exactly identical to that chosen by I* or is just as good an approximation as the latter. Proof. Lemma 2 has established the result that of the edges meeting at a particular node, the index of the edge with maximum weight is invariant under both the metrics I* and I x. Given the additional information that all edoes are specified using the same parametric constants (i.e. a, b, c, d), a straightforward extension of Lemma I yields the result that the relative orderin 0 among the
weights and Ix(',.) weights are identical. Several algorithms to solve the MST problem are known. We shall show the equivalence of the trees produced by I* and I x metrics, by simulating Kruskal's algorithm31) This algorithm essentially sorts the edges in the descending order of weights and selects the (N - 1) edges that do not form cycles. It is apparent that since the relative ordering of the edge weights under the two metrics are identical, the MST computed will be identical. It must be noted however that the MST is unique only if all the edge weights are distinct. If two edges have identical weights under the I* metric, they can be easily shown to have identical weights under the I x metric (if conditions (17) and (18) hold). Thus in these cases, the MSTs produced by the two metrics can only differ in the selection of an edge from the set of edges with equal weight. The sum of the weights of the edges included in the MST will indeed be identical and hence both the MSTs will be equally good approximations (see equation (6)). [] It is interesting to note that the decision to specify all the
distributions identically, has other implications. These conditions are indicated below.
Lemma 3. If all the
pairs of variables in the graph G use the same par-
ameters a,b,c,d, then b = c and the values of a,d are given by: {(1 - 2b) + x/(1 - 4b)}/2.
Proof. Let us consider a node i and its neighbour j. Then Pr(xl = 0) and Pr(xj = 0) are given by a + b and a + c, respectively. Because of the symmetry, we must have a + b = a + c, and hence b = c. Given the additional conditions a + b + c + d = 1 and ad = bc, we can solve for the two unknowns (i.e. a and d) from these two equations. The above equation indicates that b < 0.25. When b = 0.25, all the constants a,b,c,d are equal (to 0.25). []
4. EXPERIMENTALRESULTS In order to test the validity of the theoretical results presented in this paper, numerous simulations were carried out. The experiments conducted were of two categories. In the first set of experiments the intention was primarily to test the accuracy of the new metric I x. That is, the aim was to determine the number of times the MST zX(obtained by using Ix) is identical to the corresponding tree z* (derived by I*). In the second set of experiments, the aim was to observe how quickly the zx and z* converged to the "real" underlying tree. As a consequence of Theorems 2 and 3, it is easy to see that whenever conditions (17) and (18) are satisfied and the random variable distributions have identical parameters, both the I x and the I* metric will be equally accurate. Of course this has been experimentally verified and in every single case, when the conditions were satisfied, zz and r* were either identical or equally efficient with respect to the EMIM metric I* where by "equally efficient" we mean that the sum of the I* weights for all the edges in r~ and z* are identical. Thus any further simulations done for this scenario can only further justify the assertion. The quality of the estimate r x can be evaluated for the case when the conditions of Theorem 3 are not satisfied. Let us suppose that some (or all) pairs of random variables xi, xj has a second-order marginal distribution Pr(xi, x), which does not satisfy these conditions. Then the estimate zz need not be identical to (or equally efficient as) z*. However, we observe that for a good proportion of time, r x is identical to r* and in the cases when they are not identical, the relative difference between the weights o f t x and z* (i.e. sum of the weights of all the edges) as per the EMIM metric I* is extremely small-- being of the order of 0.5~,,. These experiments were conducted for various dimensions of the random vector, as follows. First the number of components was selected and each secondorder marginal distribution was randomly generated using the procedure explained below. The procedure is easy to follow, if we consider the following sub-problem. Let us limit ourselves to dealing with two events A and B, which occur with probabilities Pr(A) and Pr(B), respectively. The problem which we wish to address is one of randomly assigning a probability to the joint event A n B.
On using the chi-squared metric for determining stochastic dependence In order to assign probabilities to the event A c~ B, we have to first establish some bounds for this quantity. Clearly 0 < Pr(Ac~ B) < min(Pr(A), Pr(B)). Furthermore, since Pr(A u B) = Pr(A) + Pr(B) - Pr(A c~ B) we easily deduce that Pr(A c~ B) = Pr(A) + Pr(B) - Pr(A u B) > Pr(A) + Pr(B) - 1 since Pr(A ~ B) < 1.
1397
Table 1, This table summarizes the results of comparing the trees produced by the I~ and I* metrics, when random distributions were used. The reported results are the accuracy of the tree weights. In each case, the random variables are binary valued Dimension of feature vector, N 6 8 10 12
Average relative error (%) 0.2771 0.3799 0.4470 0.4912
Combining the above results, we arrive at the following relation: max(0, Pr(A) + Pr(B) - 1) < Pr(A c~ B) < min(Pr(A), Pr(B)). (22) Given the lower and upper bounds established by this equation, it is easy to randomly assign a value to Pr(A ~ B). We now illustrate the use of the above result, in assigning the second-order marginal distributions for the pair of random variables xi and x¢. For simplicity, we assume that these variables are binary valued. The first step in this process is to assign the first-order marginals to the random variables xi and xj. In other words we first randomly assign values to Pr(xi = 0) and Pr(x~ = 0), and thus implicitly assign values to Pr(xi = 1) and Pr(xj = 1). Using the bounds specified by (22), we can now randomly assign a value to the joint probability Pr(xi = 0, x¢ = 0). Once this choice has been made, the remaining second-order marginals can be easily written down as follows: Pr(xi = 0, xj = 1) = Pr(xi = 0) - Pr(xi = 0, x~ = 0), Pr(xi = 1, xj = 0) = Pr(xj = 0) - Pr(xi = 0, xj = 0), Pr(x~ = 1 , x j = 1)= Pr(xj = 1 ) - Pr(xi = 0,x~ = 1). To get the joint distribution for all the random variables, the above procedure was repeated for all pairwise combinations of variables. We are now ready to present the results obtained for the above-mentioned case. The results represent the ensemble average, computed over 5000 random distributions, for each value of the dimension D of the random vector. In the following discussion EMIM(z) stands for E 1 * ( e ) . We report the relative error between EMIM(z*) and EMIM(zx). This quantity is computed as the percentage average of EMIM(r*) - EMIM(rx) EMIM(r*) Table 1 presents our results. The results are significant. For example, when N the number of dimensions is 6, the percentage error between the E M I M values of z* and z x is as low as 0.277~o. This value increases to 0.49~o, when the dimension N increases to 12. Similar experiments were conducted in the case when the random variables were not binary valued, but were ternary valued (3-valued). Table 2 presents the results for this case. Notice that the percentage
Table 2. This table summarizes the results of comparing the trees produced by the I x and I* metrics, when random distributions were used. The reported results are the accuracy of the tree weights. In each case (i.e. for each value of N), each of the random variables take one of three values Dimension of ~ature vector, N 6 8 10
Average relative error (o~) 1.2941 1.6487 1.8448
error between the weights of two trees r* and z z is small.t In the second set of experiments, instead of generating the probability distributions directly, we used an underlying tree to generate the samples. Given the dependence tree, the program must generate the various components of the vector in such a way that the "parent" variable is assigned a value before the "child" (or dependent) variable. It is easy to see that this condition is satisfied if the random variables are assigned values in the order that a breadth-first traversal of the dependence tree would visit them. The estimates of the second-order (and the first-order) marginals were updated with every sample. Subject to an initial phase, the trees zx and z* were computed as the samples arrived. As before, we maintain the number of times the trees z* and z x were different. Since the dependence in this case comes from an unknown underlying tree % it is useful to compare the convergence oft* and zz to this tree z,. One measure of the closeness of these trees is the closeness of the distributions generated by them and this is precisely what we chose to monitor. If Pr, P* and Px are the distributions derived from the "real" underlying tree % z* and zz, respectively, the quantities which we measured are the closeness measures I(P r, Pz) and I(P,, P*). Note that these quantities can be easily computed from the definition given in equation (1).
t As the dimension of the feature vector increases, the total number of spanning trees also increases exponentially and so is the chance that z~ and z* will be different trees. The accuracy of the approximation is however determined solely by the sum of the EMIM weights of the edges.
1398
R.S. VALIVETIand B. J. OOM/dEN (a)
0
2
4
(b)
0.05
I
0.045
Closeness of Approximation
0.04 0.035 0.03 0.025 0.02 0.015
0.01 0.005 0 0
I
I
f
500
1000
1500
I
I
I
2000 2500 3000 Number of Sampl¢~
I
I
I
3500
4000
4500
5000
Fig. 2. Rate of convergence (N = 6). (a) Underlying dependence tree used for generating the samples to measure the rate of convergence of Tz. In this case, N, the dimensionality of the random vector is 6, which is equal to the number of nodes in this tree. (b) Variation of the ensemble average of the closeness of approximation I(P, P,) between the actual distribution P and the estimated distribution P, derived from the estimated dependencetree. The metric used to derivethis tree is the I xmetric.The underlyingdependence tree describing P is given in (a).
We now present the results for the case when N, the number of dimensions, is 6. The underlying dependence tree used is shown in Fig. 2(a). In our experiments, the agreement between z* and zx was so close that the plots of I(P,, Px) and I(P,, P*) vs. n, the number of samples, can hardly be differentiated visually. For this reason, in Fig. 2(b) we have shown I(P,, Px) as a function of n. The closeness of approximation in Fig. 2(b), was obtained as the ensemble average over 50 experiments, each consisting of 5000 samples. Besides the fact that I(P,, Px) and I(P,, P*) are extremely close, two other observations are also worth mentioning. When the underlying dependence is one of tree-type dependence, the agreement between ~* and zx is indeed very close. This is evidenced by a very small number of mismatches, when compared to the case of random distributions.
The other observation is that in terms of the number of samples examined to converge to the best tree, both metrics seem to be comparable. For the case illustrated, the number of samples required to reduce the "initial distance" between the real tree and the estimate to 10% of its value, is 1300. It must however be kept in mind that even though the "real dependency" tree may have been found earlier, the closeness measure between these distributions is not exactly zero, due to inaccurate estimates of the conditional probabilities. Figure 3 reports similar results for the case when N, the dimensionality of the random vector, is 10. In order to quantify the ease of computing ~x over that of T*, we have performed some measurements on the execution times. The experiment consisted of generating a random probability distribution for the
On using the chi-squared metric for determining stochastic dependence (a)
1399
0
9
/
/
~
5
7
6
3
(b)
0.018
I
I
I
I
0.016 Closeness of Approximation 0.014 0.012 0.01 0.008 0.006 0.004 0.002 5OO
I
I
I
1000
1500
2000
I
I
I
2500 3000 3500 Number of Samples
I
I
4000
4500
5000
Fig. 3. Rate of convergence (N = 10). (a) Underlying dependence tree used for generating the samples to measure the rate of convergenceof zr In this case, N, the dimensionality of the random vector is 10, which is equal to the number of nodes in this tree. (b) Variation of the ensemble average of the closeness of approximation I(P, P=) between the actual distribution P and the estimated distribution P, derived from the estimated dependencetree. The metric used to derive this tree is the I xmetric.The underlyingdependence tree describing P is given in (a).
case of binary valued random variables and then computing each of z* and r x 5000 times each. The figures on CPU time usage suggest that the computation of Zx takes approximately 22% of the time required to compute ~*. This reduction in computation time is a natural consequence of not requiring the evaluation of a large number of logarithmic terms in the computation of the edge weight matrix. Indeed, in terms of real computations, the effort required to compute I x is almost an order of magnitude less than the corresponding effort involved in computing r*. It can be seen that the savings in terms of CPU time, will be even more significant if the number of values that each feature can assume, increases. 5. CONCLUSION In this paper, we have considered the problem of approximating an unknown underlying discrete prob-
ability distribution, by one derived from a dependence tree. The best dependence tree r* is known to be the MST of a complete graph, with l*(xi, xj) as the edge weight between the pair of nodes x i and xj. This paper proposes a chi-squared based metric, I x, to capture the dependence information between pairs of random variables, and the tree generated using I x is referred to as zx. The quantities l* and I x are shown to follow one another locally, i.e. they increase or decrease together, if we restrict ourselves to binary random variables only. We believe that this result holds in general, but have not obtained proofs for this conjecture. By suitably restricting the domain from which the joint probabilities can be assigned to pairs of random variables, we have shown both these metrics to be equivalent (and hence optimal) in this restricted world. Experimental results clearly demonstrate that when the underlying unknown distribution is derived from
1400
R.S. VALIVETIand B. J. OOMMEN
a dependence tree, both metrics I* and I x succeed in finding it. When the underlying dependence is not actually based on a tree, both z* and z z are estimates for the best dependence tree. Even in this case, whenever the two estimates of the best dependence tree do not always match, their total weights are almost indistinguishably close. The main advantage with our metric I z is that it is computationally far superior to the metric I*; this is a direct consequence of not requiring the evaluation of numerous logarithms. The experimental results suggest that z x can be computed in a small fraction of the time required to compute z*. The savings in terms of C P U time, will be even more significant, if the number of values that each random variable can assume, increases.
Acknowledgements--Both authors were partially supported by Natural Sciences and Engineering Research Council of Canada. The work of the first author was also supported by a graduate award from Bell-Northern Research. A preliminary version of some of the results found in this paper was presented at the 1989 International Symposium on Computer Architecture and Digital Signal Processing, Hong Kong, October 1989.
REFERENCES
l. A. V. Aho, J. E. Hopcroft and J. D. Ullman, The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Massachusetts (1974). 2. C. K. Chow and C. N. Liu, Approximating discrete probability distributions using dependence trees, IEEE Trans. Inf. Theory 14, 462-467 (May 1968). 3. R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. Wiley, New York (1973). 4. H. H. Ku and S. Kullback, Approximating discrete probability distributions, IEEE Trans. Inf. Theory 15(4), 444447 (July 1969). 5. R. S. Valiveti and B. J. Oommen, The use of chi-squared statistics in determining dependence trees, Technical Report SCS-TR-153, School of Computer Science, Carleton University, Ottawa, Ontario, March (1989). 6. C.J. Van Rijsbergen, A theoretical basis for the use of co-occurrence data in information retrieval, J. Documentation 33(2), 106-119 (1977). 7. A. K. C. Wong and T. S. Liu, Typicality, diversity, and feature pattern of an ensemble, IEEE Trans. Computer 24(2), 158-181 (February 1975). 8. S. K. M. Wong and F. C. S. Poon, Comments on approximating discrete probability distribution with dependence trees, IEEE Trans. Pattern Analysis Mach. Intell. 11(3), 333-335 (March 1989).
About the Author--RADHAKRISHNAS. VALIVETIobtained his B.Tech. (electronics) from the Indian Institute of Technology, Kharagpur, India, in 1978. He obtained his M.Sc. (computer science) from McGill University, Montreal, Canada, in 1981, and his Ph.D. (systems) from Carleton University, Ottawa, Canada, in 1990. He is currently a Member of Scientific Staff at Bell-Northern Research. His research interests arc adaptive data structures, computer networks, neural networks, and parallel algorithms.
About the A u t h o r - - B . JOHNOOMMENobtained his B.Tech. degree from the Indian Institute of Technology, Madras, India, in 1975. He obtained his M.E. from the Indian Institute of Science in Bangalore, India, in 1977. He then went on for his M.S. and Ph.D. which he obtained from Purdue University, West Lafayette, Indiana, in 1979 and 1982, respectively. He joined the School of Computer Science at Carleton University in Ottawa, Canada, in the 1981-1982 academic year. He is still at Carleton and holds the rank of Full Professor. His research interests include automata learning, adaptive data structures, statistical and syntactic pattern recognition, stochastic algorithms and partitioning algorithms. He is the author of over 100 refereed journal and conference publications and is a senior member of the IEEE.