0031 - 3203/87 $3.00 + .00 Persamon Journals Ltd. Pattern Recognition Soctety
Pa(tern Rec(~nition. Vol. 20, No. 2, pp. 245 255, 1987. Printed in Great Britain.
AN EVENT-COVERING METHOD FOR EFFECTIVE PROBABILISTIC INFERENCE* ANDREW Z. C. WONG PAMI Laboratory, Department of Systems Design Engineering, University of Waterloo, Waterloo, Ontario, Canada and DAVID K. Y. Cmu Department of Computing and Information Science, University of Guelph, Guelph, Ontario, Canada (Received 7 August 1985; in revisedform 16 April 1986) Abstract- The probabilistic approach is useful in many artificial intelligence applications, especially when a certain degree of uncertainty or probabilisticvariations exists in either the data or the decision process. The event-covering approach detects statisticallysignificant event associations and can deduce a certain structure of inherent data relationships. By event-covering, we mean the process of covering or selecting statistically significant events which are outcomes in the outcome space of variable pairs, disregarding whether the variables (with regards to the complete outcome space) are statistically significant for inference or not. This approach enables us to tackle two problems well known in many artificial intelligence applications, namely: (I) the selection of useful information inherent in the data when the causal relationship is uncertain or unknown, and (2) the necessity to discover and disregard uncertain events which are erroneous or simply irrelevant. Our proposed method can be applied to a large class of decision-support tasks. By analyzing only the useful statisticallysignificant information extracted from the event-covering process, we can formulate an effective probabilistic inference method applicable to incomplete discrete-valued (symbolic) data. The statistical patterns detected by our method then represent important empirical knowledge gained. To demonstrate the method's effectiveness in solving pattern recognition problems with incomplete data and/or data with high "noise" content (with uncertain and irrelevant events), this method has been evaluated using both simulated and real life biomolecular data. Probabilistic inference Event-covering Statistical knowledge Incomplete probability scheme Taxonomical classification
I.INTRODUCTION
An important aspect of machine intelligence is the ability of a computer-based system to infer unknown or uncertain events in a systematic decision or reasoning process. This ability is essential in wideranging situations, such as in assisting medical diagnostic consultation (the MICIN system), in postulating molecular structures (the D E N D R A L system), or in inferring computer configurations (the R 1 system).(z"7) To date, major research efforts have been directed towards transferring knowledge from human experts into rules, computational logic, framebased or search techniques. (j3'2°'21'2~1 However, one serious difficulty with these approaches is that the experts themselves are often not fully aware of the rules and the precise nature of the knowledge involved. Hence, a consistent and complete set of rules that relates relevant events may not be easily available for a particular problem domain. (6) Because of this, once a set of rules is built into a computer system, it usually cannot be easily modified to accommodate newly acquired knowledge *This work is supported by the Natural Sciences and Engineering Research Council, Canada. 245
Discrete-valued data
without introducing unnecessary logical side effects. Furthermore, when missing or unobserved events occur, most of the existing rule-based or logic-based systems are weak in making an inference unless this condition is also specified in the representation framework. In a situation where uncertainty or probabilistic variation is encountered in either the data or the decision process, the deterministic approach is clearly inadequate. In view of this, an event-covering approach is proposed. By event-covering, we mean the process of covering or selecting statistically significant events which are outcomes in the outcome space of variable-pairs, disregarding whether the variables (with regards to the complete outcome space) are statistically significant or not for inference. This new approach tackles two well known problems in artificial intelligence, particularly when inference is involved, namely: (1) the selection of useful information when the causal relationship is uncertain or unknown, and (2) the necessity to disregard erroneous (or simply irrelevant) data. Without assuming exceptionally large sample size at the learning phase and the parametric form of probability distribution of the observations, this method is able to handle a large class of decision-support problems which require simple
246
ANDREW K. C. W O ~ O and DAVID K. Y. CMIu
self-learning capability, especially when knowledge from a human expert is unavailable. In addition to acquiring necessary knowledge in a problem domain, it can also verify the information provided by the human experts. In the past, the Bayesian decision theory and the discriminant function approaches ~HI have been used for decision-support tasks. However, the Bayesian approach usually assumes either statistical independence among the components in an n-tuple of observed features or the normal probability distribution, and therefore is not applicable to many real life situations which do not satisfy these assumptionsP ~ z 23.~ ~,) As for the discriminant function approach, except for Ref. (27), the data are assumed to take real values in an Euclidean space. Thus, the method cannot be directly applied to data involving discrete values. Furthermore, most of the reported works on discrete-valued data using the discriminant approach provide no distinction between variables which are more relevant, and those that are less relevant or altogether irrelevant. And all the existing methods, including Refs (27), (37), use the complete set of outcomes in the estimation process, even though some of the outcomes may be irrelevant to the problem concerned. The proposed approach provides an effective data-directed method for probabilistic inference. In our data model, the problem is formulated as the estimation of missing or unknown discrete values in multivariate observations. With the event-covering process, it analyzes the incomplete probability scheme(iS) defined over a selected subset of statistically significant outcomes of the variable. Hence the inherent extraneous "noise" will not affect the estimation. Further, the inference decision is based on the information obtained from observing multiple events which are statistically relevant for inference. Because this method uses a weighting function and a decision rule to combine the selected information, a more reliable decision can now be made. In addition, this method has the following desirable characteristics. (1) Because the method is based on the probability estimates from an ensemble of observations, it has simple and straightforward updating properties when new data are acquired by simply updating the estimates. (2) Further, it has the capability of indicating the reliability of a particular inferred event. (3) If there is not enough observed or statistically significant information, it is capable of rejecting a particular inference. The data are represented in a very general format, as multivariate observations involving discrete values, such that some of the values may be missing or unobserved. Each value can be considered to represent an event. When the data are represented as strings,
trees or graphs, this format can still be applied if the data can be mapped to a particular ordering scheme.(3t'~'n) This representation can also be extended to include multivariate data of the mixed type (mixed discrete and continuous types), using a minimum loss of information criterion. ('z3°) These extensions will be discussed in a separate paper. In the experiments, this method is initially used to estimate any unknown events in the data. When it is used to estimate the class membership in a pattern recognition problem, it serves as a supervised classifier. In Ref. (3), this method is extended to the cluster analysis problem. In the following sections, we first define the data representation formally, then the event-covering method is introduced. The probabilistic inference method which incorporates the information from multiple observed events is then introduced. The rationale for the methodology design is given for each phase. !i. EVENT-COVERING
A. Data representation Let X = (Xl, Xz ..... X,) be an n-tuple of related variables such that the outcomes of the variables are discrete values. Let Tj = {ajAr -- 1, 2..... Ls} represents the set of Lj possible outcomes for X, (1 < j < n). An observation can then be represented as x = (Xh X2..... X,) such that all the x/s (1 < j < n) are events observed concurrently and xs takes up a value from Ts. When an ensemble of observations is observed at the learning phase, the ensemble can be represented in a tabular form (Fig. I).* This table is called an observation table. B. Basic idea of event-covering The process of event-covering is to find the interdependent relationships relating the events by observing the data in an observation table. In the problem of probabilistic inference, the interdependent
A
set of related variables of discrete values
Xl X2... Xs... X~... X, Observations xl x,
X ! X ~ . . . X/ . . . X k . . . Xs
an event
An ensemble of observations This is also similar to data representation in the relational model of database. *
Fig. I. An observation table.
Effective probabilistic inference relationship will be used to select relevant events that will provide statistical information for the purpose. First, we introduce the basic idea of event-covering using a simple example. Suppose that the interdependent relationship between a variable-pair (Xk, X~) is illustrated in the observations as:
(x~, x,) = (A, ~) and (x~, xj) = (C, D) where the values are concurrent outcomes with a strictly deterministic relationship. From the infor•mation indicated in these observations, it may be concluded that the outcome of an unknown x) can be determined exactly if the outcome of x~ in the same observation is observed and vice versa. However, if the interdependent relationships indicated by the observations are: (x~, x~) = (A, B) (x~, xj) -- (C, D) (xk, x~) = (~, ~)
247
¢xp(ak. aj,) = {obs(ak~) x obs(aj,)}/M(X, Xk), j ~ k where the notation M(X. Xk) represents the total number of observations in the sub-ensemble such that both the outcomes of Xj and X~ are observed. Let ohs(ak, aj,) represent'the observed frequency of (ak,, aj,) in the sub-ensemble. The following expression, /Yk,indicates the degree an observed frequency deviates from the expected frequency. /~ also possesses an asymptotic chi-square property with (Lj - 1) degrees of freedom./Yk is expressed as: (obs(ak, ai,) -- exp(ak, ap))2 ,-I exp(ak,, aj,) /Yk can then be used in a criterion as described below for testing statistical interdependency between ak, and Xj at a presumed significance level. The selection process of an event ak~ of X, can be considered as a function which maps ak, and a variable Xj (j #: k) into a binary decision state, signifying whether or not ak, has statistical interdependency with Xj. Define a function ~ such that the statistical interdependency is indicated by the chi-square test. It is defined as:
(x,, O = (t',o )
h~(a,, Xj) = { ~ otherwiseifD~>~2z~-L~
(x~, xj) = (6, n) (x~, xj) = (G, D) then the following remarks can be made. Remark !. Even if it is observed that x~ is F or G, this observed value cannot be used to infer an unknown value of xj. Remark 2. No matter what is observed for x~, its value cannot be used to infer the value F or G for an unknown X~. Both remarks refer to the interdependency of the events. Based on these properties, the outcomes of X~ and X, can be divided into two subsets: one containing events which have interdependency for inference, and the other containing events which cannot be used for this purpose. This process of identifying these two subsets of outcomes is then called event-coverino.
where ~n~-~., is the tabulated chi-square value with a significant level. With this function formulated, two subsets of the outcomes which demonstrate interdependency can be selected for a variable-pair (Xk, Xj) as"
E~ = {a~l//k(a~., Xj) = 1} and
E~ = {a~,lh] (aj,, X,) = 1}. E~, and E~ correspond to the selected subsets from the outcomes of Xk and Xj respectively. E~ and E~ are called the covered event subsets for the variable-pair (Xk, Xj) and E~, x E~ represents the statistically detected covered event subspace.
C. Event-coverin9 allowino probabilistic variation With real life problems, probabilistic variation is often encountered. Therefore the following procedure of event-covering is based on statistical tests. When an ensemble of incomplete sample observations is given, let us consider a sub-ensemble where the outcomes of X~ and Xk are both observed. The probability distribution of the observations corresponding to a variable-pair X~ and Xj can be described by a two dimensional contingency table with the frequency of the joint outcome as its entry. The expected frequency of an outcome (ak, a~,) can be estimated from the marginal frequency (denoted as obs(a~,) and obs(a,,)) when independence of the variables is assumed, where a~ and a~,are the outcomes of X~ and X~ respectively. The expected frequency is expressed as:
D. Interdependency between restricted variables When the covered event subsets, E~ and E~,, for a variable-pair (Xj, Xk) are identified, the statistical interdependency between the restricted variables with outcomes in Ef and E~ can be estimated. Let the restricted variables involving these subsets of outcomes be X~ and X~ respectively. The expected mutual information between Xf and X~, is calculated as:
l(X;, x~) = E E e(x~, x~) log P(X~, x~) e, e~ Pix,,) P(x,~) If it is divided by the Shannon's entropy function:
n (x:, x~) = - Z 5".e (x~, x~) log e (x:, x,,~,
248
ANDREW K. C. W O N G and DAVID K. Y. Cmu
a measure known as the interdependence redundancy, R(X~, X~, is obtained and defined as:"8' "~
R(X~, X'~)= ,X;. X'~)/H(X~,X'~. It has been shown in [33] that R(X~, X~ possesses an asymptotic chi-square property with ([E~[- 1) x (lEVI- I) degrees of freedom, where I~1 and [E~,[ denote the cardinality of the subsets. Mathematically, R(X~, X ~ is distributed as:
R(X], XO~ 2MtX],X~)HtX],X{)" Hence, the chi-square test can also be used for determining the statistical interdependency between the restricted variables at a presumed significant level.* By now, the interdependency can be described by the detected covered event subsets and the magnitude of interdependency between the restricted variables is indicated by the interdependence redundancy value. This information represents important statistical knowledge gained and both will be used when estimating unknown values in a probabilistic inference problem.
E. Restricted observation table The proposed method uses only the statistically significant information in an observation, during both the learning and the inference phase. During the learning phase, the information selected for estimation is acquired as follows. Given an unknown value of variable X , all the observed events in the observation table (excluding those pertaining to Xj) which have interdependency with Xj can be identified and a "reduced" observation table is obtained for further analysis. This table is called the restricted observation table pivoted on Xr Conceptually, our method of probabilistic inference is based on the information derived from this restricted observation table. In the following section, the inference method of combining significant information in estimating the outcome of an unknown event is introduced. IlL MEASURE OF SURPRISAL FOR PROBABILISTIC INFERENCE
distribution {P(aj,)[r = 1, 2 . . . . . Lj}, the measure of surprisal for an event aj, to occur is
l(a~,)
--- - log P(a,,).
It can be interpreted either as a measure of how unexpected the event is, or as a measure of information conveyed by the event if it occurs. It has the property that when the event at, has a high probability of occurring, it has a low value. For example, when P(aj,) = 1, indicating an absolute occurrence of the event, the surprisal is zero. Similarly, when the event has a low probability of occurrence, then the surprisal has a high value. Further, the surprisal measure is additive and non-negative. That is: (1) the information conveyed by an event cannot be less than zero and, (2) the information obtained from the occurrence of two independent events is the sum of the information conveyed by the two individual events.
B. Measuring surprisal of an event by the conditioning events The occurrence of an event may be estimated by the occurrence of another event which is interdependent with it. To measure the amount of "surprisal" for an event aj, of Xj to occur given the occurrence of an event ak~ of Xk (k •j), a measure known as the conditional information is formulated as I ' (aj, [ a J = - log P(aj, Ia~), j ~ k where P(a~) is defined to be greater than zero. We also call this the measure of conditional surprisal and ak, is referred to as the conditioning event. When aj, and ak, always occur together, that is, when a~, and ak, are strictly interdependent, the conditional surprisal is zero. But when the two events are nearly mutually exclusive, the conditional surprisal will be extremely large. From the event-covering process, E~, and E~ are identified, a formulation based on the incomplete conditional probability scheme defined on E~ x E~ is adopted. The measure of conditional surprisal is defined as
I (airIah) = - log
,P(a?]ak')
Y~ P(aj, la~,)' a)te~
A. The general idea of surprisal The proposed inference method is based on the notion of an information measure first introduced by Wiener.(~l The measure is known as the "information of an event",°:} or a measure of "surprisal".(28j It has been used in defining information in character sequences m~ and two-dimensional pictorial images.°6} This information measure can be formally described as follows. Given a set of distinct events Tj = {aj,[r = 1, 2. . . . . Lj} associated with the discrete probability * More precisely, it measures the deviation of X~ and X~ from statistical independence.
It should be noted that when E~ is a proper subset of the outcome set of X~ and not empty, then
0 < ~ P(aj,ld~) <
1.
aHeE~)
When probabilistic variation is considered, estimating the occurrence of an event a,, based on the occurrence of multiple events will provide a more reliable estimation than when it is based on the occurrence of one event alone. Thus the following
Effective probabilistic inference expression of measuring conditional surprisal based on the high-order probability estimate,
I * (as, Ix') = - log P(as, Ix') is more useful in practice, where x' = (x'j, x'2..... x',), (m > 0) is the subset of events in an observation x_ selected for this purpose. However, there are still several difficulties with the above formulation. (1) It involves probability estimation conditioned by joint events of a high-order nature. Thus, a reliable estimation demands a relatively large sample size, ample computational resources and large storage. Such requirements, are usually unattainable in real life. (2) When a low-order product approximation, such as that described in Ref. (4, 17) is used to resolve the above mentioned difficulty, one issue still remains. Suppose there is interdependency only among a subset of the events, shouldn't the events not in this subset be rejected for estimation? (3) If some of the events in the sample observations are unknown or missing, the missing events would definitely degrade the estimation in this situation. For these reasons, a different formulation for measuring the surprisal of an event based on the occurrence of multiple events is proposed. Let x' -- (x'j, x[ ..... x') (corresponding to the variables, (X'j, X~, .... X~,)) be the subset of events selected to be significant for estimation. The surprisal is defined as
lIaj, i~') = ~ {WlXj, x'~) t(aj,,x'~} k-I
where W(Xs, X'k) is a weighting function provided by human experts or established through sampling observations. This function indicates the importance of an event x;, in estimating the occurrence of asr (A weighting function based on sampling observations will be discussed in the next section.) This measure is called the weighted-summed conditional surprisal (WSCS). If the events in x' are chosen only when they are found to be statistically significant, then W(Xs, X ~) > 0 for all k.
C. A normalized measure of surprisal When WSCS is divided by the total weights and the number of conditioning events in x', a measure indicating the "normalized" amount of surprisal can be defined. This measure is called the normalized surprisal (NS) and is expressed as NS(a, • Ix)t
~
Has
r
'
m
Since NS is a normalized measure which reflects the confidence level of an event to occur, the occurrence of different events can be compared based on their NS values. Here, a weighting function that is proportional to
249
the degree of interdependency is chosen: the interdependence redundancy measure, R(X~, X'~), is bounded by 0 and 1. It has the property that R(X~, X'~) = 0 if the variables are independent, and R(X~, X'~,) -- 1 if they are strictly interdependent. Thus the normalized measure of surprisal for an event to occur on the joint occurrence of multiple events is then defined as
NS(as, lx') = l(as, lx')/ f mk~"~ R(X~,X'~ } where
I(a,1~') = ~ R(X~. X'~) l(aj,.~i) and x__' = (x'~, x~. . . . . x~) is a selected subset of conditioning events from x. The following properties follow directly from the definition of NS and will be stated without proof. Property 1. If there is only one conditioning event, i.e. m = 1 and x' = {xl}, then NS(aj, lx') = i(as, lxl). Property 2. N S ( a s , [ x ' ) = 0 if and only if l(as, lx'O ffi 0 for all x~ in x'. Property 3. NS (aj, Ix') ffas a lower bound of 0 but no upper bound. Property 4. Suppose that two normalized surprisals are obtained and denoted a s N S t and NS2, with their total weights represented as W, and W2, the number of conditioning events represented as m~ and m2, the weighted summed conditional surprisal represented as 1, and 12, then the relative magnitude of NS~ and NSz is governed by (1) the effect of the number of conditioning events: if W~ = W2, I~ = 12, but mt < m2 then NS, > NS2. (2) the effect of the total weights: if ml = m2, 1, = 12 but IV, < W2 then NS, > NS2. (3) the effect of the weighted summed conditional surprisal: if m, = m2, WI = 14/2 but I ~ > 1 2 then NS~ > NS2. Property 4(1) implies that if there are more conditioning events, then the inference is more reliable. Property 4(2) satisfies the intuitive property that if there are less total weights, the reliability is lower. Property 4(3) indicates that if the WSCS is higher, the reliability is lower. Property 5. Given a set of weighting function values and a set of conditional surprisal terms, NS is minimized if the lower conditional surprisal term is weighted with the higher weighting function value in sequential order of magnitude. Property 5 implies that both the conditional surprisal terms and the value of the weighting function terms will contribute to the overall NS magnitude.
D. Decision rule for probabilistic inference The selection of the components of x' from an
250
ANDREWK. C. WONO and DAVIDK. Y. Cmu
observation x_ depends on the hypothesized value assigned to an unknown xj when the value is estimated. The reason is to consider the interdependency between this hypothesized value and the selected events. The conditions for an event x~ = ak., to be selected to estimate xj = at, are described as follows (1) The value ofxk exists (or is observed); (2) The interdependent redundancy, R(X~, X'O, is significant; (3) aueE~ and aj, e E k.
occurrence ofaj, and the sample size for the incomplete scheme of X~. The idea of this probability estimate can be traced back to the mathematician Pascal "9~ and it can be interpreted as follows: if l E~ 12 more samples are observed in addition to the existing M samples, the number of occurrences of x~ = at, is estimated by M(aj,) + I E~ I. Similarly, the unbiased joint distribution of X~ and X~ is defined as
e(x k = aj,, ~ = ak.,)
= {M(aj,.,ak,)+ I}/{M + ]E~I x IE{~[} In rendering a meaningful calculation of the inforwhere M(aj,, a,,) is the number of occurrence of the mation measure, the following additional condition on . . sample size is imposed joint outcome (aj,, ak,) in the incomplete scheme of the ensemble. Hence the conditional surprisal l(aj, lak.,) is calculated as
Y'. {P(a,,l,,,.,)} > T
aFtEr/
l(aj, I aks) ~- - log
where T is a predefined sample size threshold larger than zero.*This subset of selectedevents isrepresented
ajf~ ~
as x' (aj,). Then, given Tj = {aj, lr = 1, 2 ..... Lj} as the set of possible values that can be assigned to an unknown x: xj assumes the value aj,, if NS (aj, lx' (at,)) = min NS(aj, lx' (at,)). rj If x_' is an empty set then an inference cannot be made. Thus a rejected inference simply means there is not enough observed or statistically significant information in making an inference. Also, an estimated value is rejected if there is more than one event sharing the lowest NS value.
E. Unbiased probability estimator When estimating the probability based on an ensemble of observations, zero probability with no observed occurrence in the ensemble may be encountered if the probability estimation is based on direct frequency count. In order to have a better probability estimate for these cases, an unbiased probability estimate proposed by ts' 2s~is adopted. Consider a pair of restricted variables (X],, X~) with the incomplete probability scheme involving events in El, and E~. The unbiased marginal distribution of X~ is then defined as
et# = a,,)-- {Mta,,)+ IE:I}/tM + IEfI2} where M(aj,)and M are respectivelythe frequency of
P(ai" ak")/P(ak") ~, P(aj,, a,,)/P(a,O
= -- log
M(ag, ak~) + 1
E
{M(ai,,ak,)+ l I "
all fi E~
Note that the summation of the estimated conditional probability for all the events in E~ is unity, that is,
Z o,,,g E
M(a:, ak~) + 1 {M(a,,,ak,)+ I} = I.
alt EEk
F. Space complexity In this section, the space complexity of the method in reaching an inference decision will be evaluated. To simplify the analysis, first let us look at the space requirement when event covering is not applied. For a variable-pair (X,, Xj), the method requires to store all the event labels and the probability estimates for each joint outcomes, as well as the marginal probability estimates for the outcomes of Xk and Xj. Hence the memory size for storing all the event labels and the probability estimates is 2 x (Lk u Lj + Lj + Lk)where Lk and Li represent the number of distinct events for Xk and Xj respectively. Let the variable with unknown outcome be X r Then there are ( n - 1) different Xk's to be considered in an n-tuple. Therefore the total space requirement is ~ . 2 x (Lk x Lj + L~+ Lk) f o r a g i v e n X j . k-I
k#j
*Since second-order statistics are required in the probability estimation, the minimum sample size for a reliable estimation can be assumed to be T = A x max /.,/2, j = l, 2....,s
where A is a constant taken to be 5 for liberal estimation. However, experimental evidence suggests that when the sample size is small but the interdependency pattern is reasonably clear, T can be smaller without corrupting the results.
If an inference is to be made to determine the outcome of any variable, the space requirement is then
j=z k - I
2 × (L, × L , + L, + L,).
When event-covering is applied, the space requirement will be significantly less depending on the size of the covered event subsets and the number of variables selected.
Effective probabilistic inference G. Time complexity The computational complexity is also relatively low. The number of chi-square test applications is (L, + Lj + 1) for a variable-pair (Xk, Xj). For data represented as n-tupl¢, there are ,C2 (= {n(n - 1)}/2) different variable-pairs, and the total number of statistical test applications is
(L,+ ,.,+ j-I k-I
k~*j
or
O[n 2 (max L~)], k = 1, 2 ..... n. k
Including the calculation of probability estimates, the complexity of the event-covering process is O[Mn2(max Lk)], k = 1,2 ..... n k
where M is the number of samples for probability estimation. The NS calculation is linearly proportional to the number of selected events in the estimation. IV. EXPERIMENTAL EVALUATION
1 2 3 4 5 6 7 8 9 l0 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
251
ABOOB) ABOOB) ABOOB) ABOOB) ABOOB) ABOOB) ABOOB) OCAO0) OCAO0) BCAO0) (BCAOA)
(BCOOA) (BCOOA) (BCOOA) (BCOOA) (BCOOA) (BCOOA) (OCOOA (OCAOA (OCAOA (OCAOA (OCAOA (COOBO (COOBO (C00B0 (COOBO (COOBO (COOBO (COOBO (C00B0
A. Experiments using simulated data 1. Data set 1. To evaluate the performance of the inference method using event-covering a set of simulated data in the form o f X = ( X , X2 ..... Xs) is generated. For the 30 observations in Fig. 2, the non-zero values indicate the interdependency as prescribed below xl=A,
x2---B,
Xl = B,
x 2 = C,
xl=C,
x4=B,
x2=C,
xa=-A,
x2=C,
xs=A.
xs--B,
To obtain a larger ensemble, the 30 tuples are first duplicated 8 times to produce 240 tuples. The zeroes are then replaced by a pseudo-random outcome generated from the set of possible outcomes {A, B, C, D, E, F} of equal probability. Then, 160 tuples having equally probable pseudo-random outcomes from the same set of outcomes are added to make a total of 400 tuples. For convenience, these pseudo-random values are called ~chance values". This set of sample observations then consists of values interdependent with each others and other values which are not. That is, the values with interdependence can be determined one from another in the same tupl¢, but the "chance values" cannot be determined except by chance. After the tuples are generated, 5~ of these values arc randomly removed to obtain an ensemble of tuples with missing values. In other words, there are 100 unknown values in this set of incomplete data (represented by a ."7" in Fig. 3). This ensemble of 400 tuples are considered as observations acquired during
Fig. 2. Initial 30 tupics in the generation of data set 1. the learning phase. The objective of the experiment is to estimate each of the unknown values based on the observed values in the same tuple. Since not all of the outcomes in a tuple are interdependent, event-covering is used to see whether or not it can screen out the chance values (which do not have any interdependency). Two methods are adopted in the experiments, while method 1 uses event-coveting (calculation based on the incomplete probability scheme on the covered event subspace) for estimating the unknown values, method 2 does not (calculation here is based on the probability scheme on the complete outcome space). Hence, the only difference between the two methods is the application of event-covering and this in turn will affect the rate of correct and reject estimation. In the estimation, a 95% significance level is used in all the chi-square tests. The two estimated methods are then compared. A summary of the result is given in Table 1. The entries in Table 1 indicate the number of values which are (i) correctly estimated, (ii) incorrectly estimated and (iii) rejected as unknown for lack of observed or statistically significant information for the estimation. It is noted that while both method 1 and 2 demonstrate high performance among the outcomes with interdependency, method 2 has a high error rate among the chance values. Obviously the use of event covering in method I has removed many of the incorrect inferences in the chance values. Among the chance values, method 1 rejects 28 of the unknown values while method 2 does not reject any oftbem. It is also noted that while method 2, in comparison with method 1, has 5 more correct estimated values, it
252
1 2 3 4
5 6
7 8 9 !0 11 12 13 14 15 16 17 18 19 20
ANDREW K. C. W O N O and DAVID K. Y. CHIU
ABEFB) ABEEB) ABBDB) AB?FB) ABAAB) AB?BB) ABCFB) DCA?E) FCAAD) BCAFA) BCABA) BCBAA) BCE?A) BCEBA) BCAFA) (9. C D ? A ) (BCCDA) (DCFAA) (FCADA) (FCAAA)
x (BCBCBCB) (ABBBAAB) (ABCABBA) (CAABCCC)
Fig. 3. Examples of the sample observations for data set I.
includes 23 more incorrect ones. A goodness-of-fit test on the number of correct and incorrect inferences in the chance values when method 2 is used indicates that the proportion does not deviate from 1 and 5, the expected proportion of estimating an unknown value by chance. In summary, the method using event-covering can screen out some of the observed chance outcomes in estimating an unknown value as illustrated by this experiment. 2. Data set 2. Further experiments are designed using different simulated data sets. The simulated data sets generated are based on four tuples in the form X__-- (X~, X2 . . . . . X~) (Fig. 4). Each of the four tuples is generated a number of times as indicated by the frequency column in Fig. 4. As a result, a total of 300 tuples in each set are generated. Each of the values in a tuple can be determined by the other values in the same tuple. Then a percentage of the generated values are changed to values taken from the set {D, E, F} with equal probability. (These events are irrelevant and have no information of the other events in the tuple.) Three sets of data having 20, 40 and 60% of these O irrelevant values are thus generated. Then 20Vo (420) of the values are taken out as missing for each data set. As before, the method using event-coveting (method 1) and the method based on the complete outcome
Table 1. Experimental result using simulated data set 1 Interdependent values method I method 2 32 i 0
!~ 80
Fig. 4. Original tuples for generating the data set 2.
Note: the unknown events are indicated by "?" and some of the events are interdependent and others are just pseudorandomly generated.
Correct Incorrect Reject
frequency
32 1 0
Chance values method 1 method 2 7 32 28
12 55 0
Note: method 1 uses event-covering and method 2 does not.
space (method 2) are applied and their result in estimating the missing values are compared. Again, all the chi-square tests are based on 95% confidence level of significance. The estimated values are compared with the corresponding values of the four original tuples. The overall result is tabulated in Table 2. Note that method 1 consistently yields better result than method 2. However, when the "noise" content (indicated by the percentage of {D. E, F } events) is low, method 1 has lower error rate. But when the "noise" content is high, and when there is no observed or statistically significant information for inference, method 1 allows more rejection than method 2. B. Estimation of unknown residues in cytochrome c in different species Another experiment using real life data involves the estimation of unknown residues in an ensemble of biomolecules known as cytochrome c. Cytochrome c is a protein molecule found in the cells of every living organism that uses oxygen for respiration. For mammals and vertebrates, it is a long chain of molecule consisting of 104 amino acid units strung together in corresponding sequential order and folded into an identical three-dimensional structure. There are, all together, 20 types of amino acid units. It is generally accepted that the ordering, that is, the position of these basic units, uniquely determines the structure and function of the molecules. Hence the interdependency of the residue unit type may reflect the structural and functional information of the molecule ensembles.(35) Generally, cytochrone c molecules for lower form organisms are longer (from 104 to 112). In most oftbe known species, the addition or deletion of amino acid segments occurs at the end of the molecule. Mutations are only confined to replacement of amino acid units. An ensemble of cytochrome c molecules consisting of 67 species is adapted from Refs (8, 9). The sequences are aligned as suggested in Ref. (9) with "empty" (missing) elements represented. With the invariant sites deleted, there are only 76 cytochrome c sites in the ensemble (or 76 variables). Invariant sites are those sites where only one type of amino acid is observed for the whole ensemble. Previous studies on this set of data are found in Refs (33, 35). To evaluate the inference method in estimating the amino acid type of missing units, 5 ~ of the amino acid units are randomly removed from 67 species. Since sites with essentially one major amino acid type generally have low interdependency with other sites
253
Effective probabilistic inference
Table 2. Overall results comparing two methods (data set 2) Noise level 20°~ 40°~ 60°~
Method 1
Method 2
4 errors 16 errors { 32 errors 46 rejection
7 errors 32 errors { 67 errors 0 rejection
Note: method 1 uses event-covering and method 2 does not.
Table 3. Estimation of residues in cytochrome c
Inference result after step I
Inference result after step 2
Correct Incorrect Reject
116 (46.0%) 20 (7.9%) 116 (46.0%)
191 (75.8%) 24 (9.5%) 37 (14.7%)
Total
252 (100.0%)
252 (100.0°~)
(though they possess high redundancy of information), a two-step inference is proposed for these sites. The first step is based on interdependence information and the second step is based on redundancy information. First, the amino acid type is estimated using the inference method proposed. When a rejection occurs, the amino acid type will assume the majority value at that site if the probability of the majority value is greater than 90% (i.e. the redundancy is high: ~5~)The result is summarized in Table 3. It is noted that the error rate is low. Thus this experiment further confirms the interdependency and redundancy of these basic biological units:"' 3~)The experiment indicates that the proposed method based on interdependence information when combined with a criterion based on redundancy information of the site can indeed estimate the missing amino acid type.
C. Supervised classification based on the biomolecules of cytochrome c To demonstrate the capability of the proposed method in supervised classification, again the ensemble of cytochrome c biomolecules is used. The ensemble of 67 cytochrome c data can be classified into 8 taxonomical groups of species (Fig. 5). These 8 class labels are entered as supervised class information for the species. Each of the cytochrome c molecules is then used to estimate its membership based on the supervised class information of the remaining 66 samples. This method corresponds to the hold-out method for performing evaluation. Among the 67 samples, 3 "natural" subgroups are identified: the animal subgroup, the plant subgroup and the micro-organism subgroup. All the species in the animal subgroup (Nos. 1-35) and the plant subgroup (Nos. 44-67) are correctly classified. However, in the subgroup of micro-organisms, Crithidia (No. 42) is classified as plants, and N. Crassa (No. 39) and Pk 2 0 : 2 - 6
Euglena (No. 43) are classified as animal based on the selected amino acid type information. The other micro-organism species (Nos. 36, 37, 38, 40, 41) are identified as a subgroup. Next, species Nos. 1-35 are classified separately using the hold-out method despite the small sample size. Three subgroups within this animal subgroup are identified: the mammal subgroup (Nos. 1-16), the bird subgroup (Nos. 17-21) and the insect subgroup (Nos. 32-35). Eight species (Nos. 23-28, 30, 31) are misclassified and the rest (Nos. 22, 29) are not classified into any further subgroup within the animal subgroup. Even though the classifier does not label some of the data according to the class information entered, the result from the first and the second step shows a meaningful taxonomical classification indicating the hierarchy of the species. The misclassification and the rejected classification are probably due to the small class sample and random variation within the subgroup. It should be noted that this method of classification into taxonomical subgroups is based on the observation of species biomolecular information, and the "true" input taxonomical structure may still be speculative. This method uses only the statistically relevant information selected by the event-covering method without predetermining the usefulness of an amino acid type for classification. V. CONCLUDING REMARKS
We have presented an effective tool for probabilistic inference. The fundamental technique--the measure of surprisal using event-covering--is aimed at making an estimation by utilizing joint information from multiple observed events which are themselves detected to be statistically relevant. The probabilistic inference method can perform a great variety of tasks. It is flexible and can be used to infer an unknown event
254
ANDREWK. C. WONGand DAVIDK. Y. CHIU
Corresponding to the cytochrome c data taxonomical groups species no. species name
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Man Chimpanzee Rhesus Horse Donkey Zebra Cow Pig Sheep Camel Great whale Elephant seal Dog Bat Rabbit Kangaroo
Birds
17 18 19 20 21 22
Chicken Turkey Emu King penguin Pekin duck Pigeon
Reptiles and amphibians
23 24 25
Snapping turtle Rattlesnake Bullfrog
Fishes
26 27 28 29 30
Tuna Bonito Carp Dogfish Pacific lamprey
Mullusk
31
Snail
Mammals
Insects
Micro-organisms
Higher plants
32 33 34 35
Fruit fly Screw worm fly Samia cynthia Tobacco horn worm moth
36 37 38 , 39 40 41 42 43
Saccharomyces Debaryomyces Candida krusei N. Crassa Humicola Ustilago Crithidia Euglena
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
Nigela Mung bean Cauliflower Rope Pumpkin Hemp Felder Abutilon seed Cotton seed Castor bean Tomato Maize Arum Sesameseed Leek Acer Niger Sunflower seed Nasturtium Parsnip Wheat germ Buckwheatseed Spinach Gingo
Fig. 5. The different species and the taxonomical subgroups.
or the unknown class of an observation in many pattern recognition and decision-support problems. Most notably, the inference method has the following characteristics (1) An estimation for an unknown or uncertain event can still be made even if some of the events for drawing an inference are missing or unobserved. (2) This method is able to combine different information by weigl~ting according to the restricted variables' interdependency. Thus different weights can be associated with the different information available. (3) Only a selected subset of the observed events is used for estimation purposes, during both the learning phase and the inference phase. This selection is based on a statistical property reflecting the data's inherent interdependence relationships. The final decision for an estimated value is derived from the information measure which indicates the degree of uncertainty between alternative values. When the different estimations are compared, the optimum decision can be made. The event-covering method can deduce a certain structure of inherent interdependent relationships
relating the events. To the best of our knowledge, this is the first attempt where the outcome subspace of a variable is considered for estimation and classification purposes. We have evaluated the method's capability and usefulness by experiments using simulated and real life data. The result has been very encouraging.
REFERENCES
1. A. Barr and E. A. Feigenbaum, eds, The Handbook of Artificial Intelligence, Vois 1 and 2. Harris Tech. Press, Stanfold, CA (1981). 2. B.G. Buchanan and R. O. Duda, Principles ofrule-based expert systems, Adv. Comput. 22, 163-216 (1983). 3. D. K. C. Chiu and A. K. C. Wong, Synthesizing knowledge: a cluster analysis approach using eventcovering, IEEE Trans. Sys. Man Cybernet. (to be published). 4. C. K. Chow and C. N. Liu, Approximating discrete probability distributions with dependence trees, IEEE Trans. lnf. Theory. 1'1"-14,462-467 (1968). 5. R. Christensen, Entropy minimax, a non-Bayesian approach to probability estimation from empirical data, Proc. IEEE Int. Conf. on Cybernetics and Society, pp. 321-325 (1973).
Effective probabilistic inference 6. W. J. Clancey, The epistemology of a rule-based expert system--a framework for explanation, Artif. lntell. 20, 215-251 (1983). 7. P. R. Cohen and E. A. Feigenbaum, The Handbook of Artificial Intelligence. Vol. 3, William Kaufmann, Los Altos, CA (1982). 8. M. O. Dayhoff, Atlas of Protein Sequence and Structure. Vol. 5. Silver Spring, MD (1972). 9. R. E. Dickerson, The structure and history of an ancient protein, Scient. Am. 226, 58-72 (1972). 10. R. E. Dickerson and R. Timkovick, The Enzymes, P. Boyer, ed., Vol. ! l, p. 397 (1975). I 1. R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. John Wiley, New York (1973). 12. B. Forte, M. de Lascurain and A. K. C. Wong, The best lower bound of the maximum entropy for discretized two dimensional probability distributions, IEEE Trans. Inf. Theory (submitted). 13. P. Friedland, Introduction to special section on architectures for knowledge-based systems, Comm. ACM 28, 902-903 (1985). 14. C. Glymour, Independence assumptions and Bayesian updating, Artif. lntell. 25, 95-99 (1985). 15. S. Guiasu, Information Theory with Applications. McGraw-Hill, New York (1977). 16. R. M. Haralick, Decision making in context, IEEE Trans. Pattern Anal. Mach. intell. PAMI-$, 417-428 (1983). 17. H. H. Ku and S. Kuilback, Approximating discrete probability distributions, IEEE Trans. Inf. Theory, IT-16, 368-372 (1970). 18. S. Kullback, Information Theory and Statistics. Wiley, New York (1959). 19. E. Mortimer, Blaise Pascal: the life and work of a realist, Mathematics, An Introduction to Its Spirits and Use, Readings from Scientific American, W. H. Freeman, San Francisco (1979). 20. A. Newell, The knowledge level, Artif. lntell. 18, 87-127 (1982). 21. A. Newell and H. A. Simon, Computer science an empirical inquiry: symbols and search, Comm. ACM 19, 113-126 (1976). 22. D. B. Osteyee and I. J. Good, Information, Weight of Evidence. the Singularity between Probability Measures and Signal Detection. Springer-Verlag, Berlin (1974). 23. E. P. D. Pednault, S. W. Zucker and L. V. Muresan, On the independence assumption under subjective Bayesian updating, Artif lntell. 16, 213-222 (1981).
255
24. H. Prade, A computational approach to approximate and plausible reasoning with applications to expert systems, IEEE Trans. Pattern Anal. Mach. lntell. PAMI-7, 260-283 (1985). 25. R. D. Smallwood, A Decision Structure for Teaching Machine. M.I.T. Press, MA (1962). 26. P. Szolovits and S. G. Pauker, Categorical and probabilistic reasoning in medical diagnosis, Artif lntelL I!, 115-144 (1978). 27. D. Wang and A. K. C. Wong~ Classification of discrete data with feature space transformation, IEEE Trans. Aut. Control AC-24, 434-437 (! 979). 28. S. Watanabe, Knowing and Guessing, a Quantitative Study of Inference and Information. John Wiley, New York (1969). 29. N. Weiner, Cybernetics, or Control and Communication in the Animal and the Machine. Wiley, New York (1948). 30. A. K. C. Wang, D. K. Chiu and M. Lascurain, Inference and cluster analysis of incomplete mixed-mode data, Proc. int. Symposium on New Directions in Computing. Norway, pp. 211-219 (1985). 31. A. K. C. Wang and L. Goldfarb, Pattern recognition of relational structure, J. Kittler, K. S. Fu and L. F. Pau, eds., Pattern Recognition Theory and Applications. pp. 157-175. D. Reidel, Hingham, MA (1982). 32. A. K. C. Wang and D. Ghahraman, A statistical analysis of interdependence in character sequences, Inf. Sciences 8, 173-188 (1975). 33. A. K. C. Wang and T. S. Liu, Typicality, diversity and feature pattern of an ensemble, IEEE Trans. Comput. C-24, 158-181 (1975). 34. A. K. C. Wang and T. P. Liu, Random graph mappings and distribution, Institute of Computer Research Report, University of Waterloo, Canada (1985). 35. A. K. C. Wang, T. S. Liu and C. C. Wang, Statistical analysis of residue variability in cytochrome c, J. molec. Biol. 102, 287-295 (1976). 36. A. K. C. Wang and M. Vogel, Resolution-dependent information measures for image analysis, IEEE Trans. Syst. Man Cybernet. SMC-7 (1977). 37. A. K. C. Wang and C. C. Wang, DECA--a discretevalued clustering algorithm, IEEE Trans. Pattern Anal. Mach. lmell. PAMI-I, 342-349 (1979). 38. A. K. C. Wang and M. L. You, Distance and entropy measure of random graph with application to structural pattern recognition, IEEE Trans. Pattern Anal. Mach. lntell. PAMI-7, 599-609 (1985).
About the Author--ANDSL~w K. C. WONO is currently a Professor of Systems Design Engineering and the Director of the PAMI Group, University of Waterloo, Canada. In 1984, he also assumed the responsibility of Research Director, in charge of the research portion of the Robotic Vision and Knowledge Base Project at the University. Dr. Wang received his Ph.D. in 1968 from Carnegie.Mellon University where he taught for several years. Since then, he has authored and co-authored chapters/sections in several engineering books and published many articles in scientific journals and conference proceedings. He is currently an Associate Editor of the Journal of Computers in Biology and Medicine. About the Autbor-- DAVU>CHIU was born in Hang Kong. He received the M.Sc. degree in Computing and Information Science from Queen's University, Kingston, Canada in 1979. From 1979 to 1982, he was working at NCR Canada Ltd. on unconstraint character recognition. He is currently working toward the Ph.D. degree in the Department of Systems Design Engineering, University of Waterloo, Canada. His research interests include pattern analysis, knowledge-based systems, artificial intelligence and image processing.