Pattern Recognition, Vol. 23. No. 12. pp. 1427-1439. 1990 Printed in Great Britain
0031-3203/90 $3.00 + .00 Pergamon Press pie 1990 Pattern Recognition Society
THE OUTCOME ADVISOR®t EDWARD A. PATRICK 810 Matson Place, Cincinnati, Ohio 45204, U.S.A.
(Received 17 August 1989; in revised form 20 December 1989; received for publication 29 January 1990) Abstract--A new classification method called The Outcome Advisor~ (OA) is presented which is an outgrowth of statistical pattern recognition and the Patrick-Fischer Generalized K-nearest Neighbor Decision Rule. Involved are new definitions of relative frequency and correlation. Training examples are store and processing begins once findings (a focus) are presented. An almost unlimited number of inferences can be made as an inference system and any feature can be used to define categories as a classification system. Implementable as a new neural net structure which is distribution free, multidimensional dependencies in the feature space for each category are learned utilizing a new definition of relative frequency. The new method may help explain how certain neural net structures may be estimating multidimensional dependencies. The OA has been trained and tested on established data bases and has improved performance as measured by experimental probability of error. Pattern recognition Artificial intelligence Medical decision making Outcome advisor Outcome analysis Decision making Nearest neighbor Neural nets Learning Estimation Decision analysis Classification Statistics Inference Medical outcome analysis Statistical pattern recognition
I. INTRODUCTION Optimum classification of findings x in the feature space is achieved by computing the a posteriori probabilities of the M categories, in the category space, which involves computing the category conditional probability density function (ccpdf) for each of the M categories. I1) The optimality criteria is minimum probability of error or minimum risk. When the ccpdfs are unknown, but their family is known to be characterized by a fixed but unknown set of parameters, then a minimum conditional probability of error or minimum conditional risk can be achieved. Conditioning is on training samples, which can be either supervised for each category, or unsupervised. If unsupervised, a certain minimum amount of a priori knowledge is required, o) In the process, the conditional probabilities of the fixed but unknown parameters are estimated, i.e. learned. Furthermore, decision boundaries indirectly are optimally constructed which "separate" the categories.
t The Outcome Advisor is a registered trademark. Supported in part by contracts with FAA, Shriners Burn Institute, Patrick Consult, Inc, and National Science Foundation. Research conducted while Dr Patrick was Professor of Electrical Engineering, Purdue University, and subsequently Research Professor of Electrical and Computer Engineering, University of Cincinnati. ~t A neural net classification system herein is defined as a transformation from a vector of findings to two or more categories; the structure of the transformation consists of layers of processing units interconnected by link parameters and estimators are devised for these parameters. 1427
Neural-net approaches~c to classification systems, with the preceding context in mind, also are characterized by fixed but unknown parameters. These parameters are learned by processing training examples (samples) using various formulas based on hillclimbing or reward and punishment. These procedures are "unsupervised" or "supervised', in quotes because in the context of neural-nets they may not be precisely defined mathematically. In this paper we define supervised and unsupervised learning according to precise definitions in statistical pattern recognition, o) In the 1960s and 1970s, studies of how to estimate the ccpdf had one direction leading away from parametric approaches such as assuming the family of ccpdfs is multivariate Gaussian, unions of multivariate Gaussian, or multinomial and moved toward approaches whereby local mathematical structures are placed according to densities of samples, such as by constructing equivalence regions. I11 There resulted the Generalized K-nearest Neighbor Decision rule by Patrick and Fischer, c2) (see reference 1, pp. 217-268), from which evolved The Outcome Advisor ® ( O A ) O) which is described herein. The O A can be implemented using algorithmic programming or as a new neural-net structure of processing units. Implementation as a new neural net structure has advantages arising from the potential for implementing massive numbers of identical processing units. A major difference between The O A neural-net structure and that of other neural nets is that the O A decision boundaries are learned indirectly through direct estimation of category con-
1428
EDWARDA. PATRICK
ditional probability density functions. In The OA, provide a "focus". Conventional neural nets do not examples are stored and used for learning only after have such flexibility, especially important for outa focus at the findings is presented. The focus itself come analysis. provides a great deal of information. Furthermore, Stored training records are used for the comThe OA provides an integration of viewpoints from putations whether The O A is used as an inference classical statistics, optimum classifiers, rule-based system or a classification system.t With regards to systems, and heuristics. Whereas conventional the inference mode, if a past record never was seen neural nets may estimate parameters using external which agrees with both events B and A, then the feedback based on reward and punishment, feedback estimate of p(BIA) is 0; likewise if for the classifioccurs naturally in The OAs processing units by cation mode a past record was never seen for catcomparing the findings with a candidate example and egory r ? and focus x, then the estimate for projecting the resulting error onto another training p(r* [x) is 0. In both modes, probability estimators sample. The basic storage element in The O A is a are based on the classical relative frequency defitraining example; although preprocessing can pro- nition of probability theory. vide estimated parameters. Alternatively, The OA Concentrating for now on the classification mode, can be viewed as adaptively computing parameters a new definition of relative frequency (reference 3, defined once the focus is received whereas the par- pp. 330-335) can be defined and applied whereby ameters relate probabilities at the training samples the probability is estimated that a training record to the focus. Concerns about storage and processing "could" have been at the focus. An alternative way times requirements resulting from storing training of viewing this is that a probabilistic closeness measrecords are rapidly disappearing as technology ure is defined and utilized for projecting records not at the focus to the focus. Another viewpoint is from advances. The Outcome Advisor (OA) learns to classify that of the Generalized K-nearest Neighbor Decision findings into categories from stored, supervised train- rule which motivated the implementation. A mathing records for each of M categories. The features ematical viewpoint is that equivalence regions (refof the findings need not be binary and need not erence 1, pp. 137-147) are defined through a priori have ordered feature values. A Euclidian closeness defined sets of transformations (with estimated parameters) where a transformation relates the probmeasure is not required. Utilizing a model in L + 1 dimensional space ability at a candidate point xc somewhere in called "In the Beginning there was Hyperspace," hyperspace to the focus x. Furthermore, these a (reference 3, pp. 335--337), any combination of L priori transformations can be viewed as rules of features (AND of features and O R of feature values) thumb or heuristics (a discussion of heuristics as a can be used to define an outcome B and any com- priori models is found in Pearl (4) and is consistent bination of the L features (feature values) can be with Patrick's viewpoint (1) of the need for a priori used to define a condition A. Then p(BIA) can be knowledge in order for learning to be possible). In estimated. In this mode The O A is an "inference fact, the focus used along with the new definition of system" and p(BIA) is a random variable with central correlation can be viewed as a powerful heuristic. A limit theorem properties and its confidence limits new family of ccpdfs is defined where a member of and convergence properties can be determined. the family is characterized by parameters which are Alternatively, any one of L + 1 features can be used probabilities in a region at the focus x and probas a "category feature" whereby the values of this abilities in regions at each of the candidate training feature index categories r~', r~ . . . . . r~ in a category samples. In a sense these are adaptive ccpdfs--space; and The O A becomes a classification system, adapted to the focus and training samples. with the remaining L features used to define features A model with a new neural net structure is develof the patterns. The system estimates a posteriori oped to implement The O A with its probabilistic category probabilities p(r~ Ix), i -- 1, 2 . . . . M and closeness measure. The structure is distribution freest the convergence properties of these estimators can where training records are stored in memory as be determined. With this model The O A is a pure opposed to storing propagation weights among statistical classification system where the findings x neurons. Yet, basic processing units or neurons are interconnected. The basic processing unit of the new neural net involves two vector correlations with a complement and accumulate. Computations begin t As defined here, an inference system computesp(BIA) with B + B = category space whereas for a classification once a focus (findings) is receioed. Viewed as a neural network, processing of a candidate training sample system the categories would be r~' = Bj, r~ -- B2. . . . . r~t = BM and a posteriori probabilities of these categories is in conjunction with other training samples stored are computed. as nodes in the network. These other training ~t Strictly speaking, The OA is parametric in that the samples help to explain why the candidate sample underlying structure of the ccpdf is multinomial, characterized by a very large number of "bin" probabilities. could have been from a category (the category of the But the only bin probabilities (parameters) estimated are findings at the focus) even though the candidate at the focus x. sample is not identical to the focus.
The Outcome Advisor* 2.2. Neural net structures
2. LITERATURE REVIEW
2.1. Statistical pattern recognition In 1972 Patrick and colleagues o) developed and presented various classifications systems. Bayes, minimum risk and minimum conditional risk (conditioned on training examples) decision rules were derived for various families of category conditional probability density functions including those with reproducing statistics. In particular, it was shown how for the multivariate Gaussian family, matched filters and Wenier-Hoff filters result as special cases (reference 1, pp. 193-208). Furthermore, it was shown how the Bayes minimum risk decision rule with a Gaussian multivariate family and diagonal covariance matrices resulted in an adaptive threshold unit (Adaline) (reference 1, pp. 268-277). Also it was shown that in these basic neural-net structures, decision boundaries were defined and estimators for the boundaries constructed. Estimators for parameters characterizing boundaries can have convergence problems which do not exist when parameters of ccpdfs are directly estimated. As pointed out in recent papers, research on neural net structures such as Perceptrons (5.6) was continued by some researchers including Widrow and GrossbergC7.s) and there followed renewed interest in these approaches. Recently Kohonen, Barna, and Chrisley (9) have compared a set of classification systems and utilized Bayes as a benchmark system with reference to researchers including Patrick. o) Alternatively we also had approached the classification problem as that of estimating the local probability density for each category at the focus (findings). Although we were interested in engineering the category conditional probability density function as in Consult-I, ~3)we visualized alternatively accomplishing the task by saving the training examples and not computing estimates of any parameters until the focus was received. This was based on the premise that the focus itself contains a great
deal of correlation information among features. The result was the Generalized K-nearest Neighbor Decision Rule. (2) This method as initially published was for situations where a distance measure existed in the feature space. The Generalized K-nearest Neighbor Decision Rule estimates the category conditional probability density at focus according to
k/n
P(xl ~', ) = v---R
(1.1)
where k is the number out of n training examples from category r~* which are in a region R with volume VR centered at x. The region R is adaptive since it depends on the focus and the probability density for the class at the focus. A distance measure is required to determine the k nearest neighbors. Furthermore, note that kin is a relative frequency. Our approach was to shape the distance measure at the f o c u s . O . 2 , z0.19) PR 23:12-1
1429
Typically, authors describe the basic building block of a neural net as a processing unit which implements a weighted (weights are parameters to be learned) sum of input feature values and then outputs a function (transfer function) of the weighted sum. An Adaline is a typical basic processing unit where weights are adjusted based on error between the output and a desired output. We presented (reference 1, pp. 268-277) for example, hill climbing techniques such as that by Koford and Groner for estimating these weights and discussed the relationship with regression functions for stochastic approximation. We also presented the method of using piecewise linear discriminant functions to implement a complex decision boundary in hyperspace.O.3) Hopfield devised a network model which in effect allows the feedback of the state in one dimension in the feature space to affect the state in another dimension; but researchers have found that a relatively large number of examples are required for training. Correlation information of this sort is important in order to separate categories of patterns. The basic Adaline structure implements a Gaussian decision rule like structure with a diagonal covariance matrix; and Hopfield's approach tends to put back the missing correlation resulting due to the diagonal matrix. Alternative methods of devising estimators for network weights are back propagation oLn2) and counter propagation. We mention these models to emphasize that they are parametric models and parameters are estimated from training examples; the examples themselves are not stored in memory to estimate parameters at classification time. An approach by Robert Hecht-Nielsen called "Nearest Matched Filter Classification ''t~3) would appear to store n supervised training samples each along with their known category, match the focus to each stored sample, and then classify the focus as the category of the sample "nearest" to the focus. The closeness measure is the matched filter, i.e. a conventional correlation of findings with a known category pattern. In the same sense of Cover and Hart, (~4) certain convergence properties can be demonstrated. But just as for Cover and Hart, the method by Hecht-Nielsen does not provide for local category conditional probability density estimation. Furthermore, The O A considers every training example as a candidate to contribute "relative frequency" to the focus for that category, not just the training example "nearest" to the focus in a matched filter sense. Also, in The OA, examples other than a candidate sample are used to estimate the candidate's relative frequency contribution to the focus. We developed the Generalized K-nearest Neighbor Decision Rule ~2) as a method for estimating the local probability density at a focus for each of M categories. As previously discussed, The Outcome
1430
EDWARD A. PATRICK
Adviser was later developed whereby local density estimation utilizes a new definition of relative frequency. With these approaches, a complex decision boundary is indirectly obtained. 3. INTRODUCTION TO THE OUTCOME ADVISOR
Clearly (3.3b) is different from (3.3a) and Equation 3.3(b) seems to use information that equation 3.3(a) does not use. But is that cheating from the standpoint of classical statistics? In the 1960sO) we discovered that learning a ccpdf evolved as follows: p(xlr*, xl, x2 . . . . xn)
3.1. Conventional correlation and new viewpoints
of correlation and relative frequency The Outcome Advisor (a) is an outgrowth of the Generalized K-nearest Neighbor Decision Rule. We now consider three classical concepts: relative frequency, correlation, and distance measure, and then start over with a different viewpoint. In an L dimensional feature space with a region B as focus, n statistically independent samples have relative frequency in B of kin if k of these samples are in B. It does not count if an example is "almost in B", even if only one or a few dimensions do not agree. It also does not matter if we know something about one dimensional or higher dimensional ( < L ) probability densities because the classical definition of relative frequency has no provision for conditioning on a priori knowledge. Conventionally, correlation between two random viariables X/and Xk in two respective dimensions is defined r/k = E[(X/ - m j ) ( X k - ink)], (3.2a) a general term which conveniently fits in the covariance matrix y for use in defining the quadratic function, (X - m)tY-I(X - m), (3.2b) which in turn is used to define the multivariate Gaussian p.d.f. But is this definition appropriate for a feature space where features do not have ordered feature values and linear regression of one dimension's variable against another dimension's variable is not the objective? Classically, we think of distance as Euclidean which is not an appropriate closeness measure when features can have non-ordered feature values. But there are alternative closeness measures that do not require a Euclidean metric. O) For example, we could define the correlation between two vectors x and xs
= f p ( x l h i , r*)p(bilxl, x2 . . . . xn)db/.
That is, training examples can be used in estimating a ccpdf if parameters bi characterizing the function can be identified to estimate. That means that a priori you need to know something about the family of cdpdfs you are trying to estimate. With these preliminaries in mind, we now define a family of cdpdfs whereby a training sample xs has a correlation with a focus x which increases as the number of dimensions in agreement increases, according to (3.3a). Also, the correlation is defined to increase if, for those features in disagreement, other training samples from the same category "say" that xs "could have been" at x. Later we will formalize this viewpoint in terms of an a priori mathematical model of equivalence regions. Alternatively, we could justify the viewpoint as a heuristic model ~4) now that this approach to modeling is more acceptable than 17 years ago when conditioning on a priori knowledge was first advocated3 l) With these ideas in mind, define a new relative frequency at x as
P(xl~?) =
(l/n) * ~ corr2(x, xs, n).
corrl (x, x,) = number of feature valuest of xs identical with those of x (3.3a) or alternatively select a second definition of corre. lation, corr2(x, xs, n) = probability that xs could have been at x given a priori knowledge of training samples xl, xl, • • • x, from the same category. = c(x, x~lxl, x2 . . . . . X s - j , x s + l . . . x , ) . (3.3b) t Similar to Hamming distance.
(3.5)
s=l
3.2. Model for correlation at focus using one
dimensional probability density estimators Models for corr2(x, x,, n) can be developed based on apriori defined models. The first model presented is one with which we have had the most experience. Given a focus x let the one dimensional probabilities in a bin at this focus be Psi : Ps, = 1,
as
(3.4)
ifxsj = x /
(3.6)
/~/otherwise, where/~j is an a priori computed estimate for the ]th dimension using the training samples. The correlation or closeness measure for this model thus is defined: L
corr2(x, xs, n) = I'I Psi" ,/-1
(3.7)
Equation 3.7 can be viewed as the sth sample "being brought to the focus by the other training samples". Since the probability estimates in equation 3.7 are one dimensional (d = 1), it is reasonable that a sufficient number of training examples will be available
The Outcome Advisor~
1431
to obtain good estimators. A priori models can be devised utilizing 2 dim. (d = 2), 3 dim. (d -- 3), etc. a priori estimation. More training examples n are required to obtain good estimators for two dimensional (d = 2) or three dimensional (d = 3) probability estimators than for one dimension (d = 1). High correlation for xs from equation 3.7 results from high relative frequency of high number of dimensions agreeing at the focus. This is one way of discovering correlation between the focus and training examples for the category concerned. If p~, is 0 for some feature x/, then a "CAN' T ''(3) is imposed for the focus being from the category concerned. The focus itself has provided correlation information by informing of a dependency among features for the likely category; a very large number of other dependencies have been eliminated and need not be computed because the focus was utilized.
where x is in a feature space 5~ and the jth feature of x has V values, v ], { X ,,,}0= (4.2)
3.3. Preprocessing in higher dimensional space
not necessarily ordered. There are M categories in the category space ( , denoted
One can anticipate that more correlation information is learned as d increases if n also increases to maintain a priori estimator accuracy. A sketch of this effect is shown in Fig. 1. The general nature of the curve shown in Fig. 1 was confirmed with simulations for similar experience with the Generalized K-nearest Neighbor Decision Rule (reference 1, p. 245). 3.4 Experimental results with The OA For several years we have performed experiments with The OA on various problems. Examples include classification of calcification clusters on breast mammograms (18)as cancer versus benign; for this problem the OA performed with 72% accuracy versus expert radiologist 50% on test cases with 110 training samples. In our early work on diagnosing chest pain in the emergency department, (9'1°'19) (see pp. 140161 of reference 10), the generalized K-nearest Neighbor decision Rule performed at an average (average over three categories) of 73% versus physicians 51%. More recently The O A has been applied to the problem with considerably increased performance.(19)
4. THE OUTCOME ADVISOR MODEL
4.1. Notation Notation now is presented in more detail leading to discussion of other mathematical aspects of The OA. The various feature vectors used thus far are of
Prol~OILity of correct ctossificotion lncree$ino n
f
d Fig. 1.
the general formt, x = [xl,x2 . . . . . XL],
r*/M i li=l"
(4.3)
Supervised training samples are available for each category, and those for the ith category are denoted
Xni = {x~}sn'__1,
(4.4)
and all n samples, n = ~ n~, are stored in memory. i=l
When the findings x (the focus) is presented, the objective is to compute for each category r~*, the category conditional probability density function (ccpdf) which also is conditional on the training samples X, i for that category: p(xlr?, x , , ) : ccpdf.
(4.5)
The focus x is considered to be a region R in the feature space. In practice, each feature of x has one of a discrete number of feature values where a feature also can be a fuzzy set. For this reason, R often is a "bin" or union of bins as would be used in defining a multidimensional histogram. The focus region R can be enlarged by applying OR operations on feature values and this often is done in practice with The OA. A conventional neural net does not have this facility. Enlarging the volume of the focus R can be important when the number of training examples is relatively small. Equation (4.5) for the ccpdf is incomplete until we specify the family of these functions (section 4.2). Further development of how the probability density in equation 4.5 is learned using Xn, will be provided. 4.2. Family of category conditional probability
density functions t Generally, x indicates the findings (i.e. focus) and other vectors are indexed by c to denote candidate sample and s to denote another nonspeeific training sample.
(4.1)
The ccpdf (Equation 4.5), i.e. p(xlr,", X,,) ~
1432
EDWARDA. PATRICK
is in a family ~; described as follows: (1) The focus x is in a region R constructed by the feature value ranges of xi, x2 . . . . . XL, where xj E x for all j. A basic multidimensional histogram structure underlies the construction where the feature values x j,, either designate ranges or points. For example, for age, xi, ' may represent a range such as [40, 50] whereas for location in air traffic control it might indicate the vth sector out of V sectors. (2) In like manner a region Rs contains training sample x~. There exists a probability Ps = p(x Rs It? ) for all regions, and pf = p(x E RIr*, ) for the focus. (3) The ccpdf is characterized by the set {Ps}~'=l and PI" The region R~ with corresponding ps may contain more than one of the multiple training samples x, when multiple samples are in the same region. (4) Once x is presented and pf exists, there is a qf such that py + qf = 1. (5) Each region probability ps is related to pf through a mathematical equivalence relationship to be specified. (6) An estimator /~f can be constructed from samples X~. 4.3. Equivalence relationships From statistics there is the concept of equivalence regions which we have worked with in the past. (1.17.21) Herein we take the viewpoint that an a priori mathematical relationship can be supplied which relates the probability at one region in the feature space to another region. For example, the Gaussian p.d.f, in effect does this by being characterized completely by two variables (mean and standard deviation) with an unlimited number of probability density region pairs thus defined. Let pf be the probability density at the focus x and Pc the probability density at a training sample xc which we refer to as the candidate sample. Define a priori relationships Pfc = gfc(Pc) for all c, R specified,
(4.6)
between the probability in the bin containing training sample xc not at the focus x and the probability at the focus. If these relationships were known, then indeed a candidate sample xc not at x would contribute relative frequency to x through the transformation, just as is done studying stochastic filters based on linear or nonlinear transformations (reference 22, pp. 85--88). The training sample xc is called a candidate sample to distinguish it from all other training samples x~, t The symbol ¥ denotes "'for all". ~t d indicates the dimensionality of the subspace of features with values not in agreement. Knowledge of p.d.f, of dimensionality d is desirable but it is difficult to store in memory a priori because of computational explosion as d increases for large L.
different from xc, which are used a priori or in real time to learn the relationship(s) (4.6) to assist xc in contributing relative frequency to x. However, each training sample will be used in turn as a candidate sample. This indeed is a deviation from classical statistics and it may be desirable at this point to view equation 4.6 as a heuristic or rule of thumb; i.e. a model. It would appear that we are "boot-strapping" using other samples to "say" that xc could have contributed t o x . (3)
What follows is further analysis of the transformation equation (4.6). Recall that the recognition sample x is from one of M categories and we are computing p(x[r?) for each category r ? , and will continue to drop the r~' notation for convenience. Training samples {x,}~= 1 = Xn from the same category were used to estimate certain "lower dimensional" subspaces of the feature space of dimension~t d, d < L. For example, in practice d conveniently is one and the L, marginal (one dimensional) p.d.f.s are estimated a priori. A priori, the training samples Xn were used to estimate "lower dimensional p.d.f.s" but now are to be used for learning in L dimensional feature space. In turn, each L dimensional training sample will be selected with the remaining n - 1 training samples presumed used for a priori learning in lower dimensional space of dimension d. The currently selected sample is denoted xc and called a candidate sample. We are interested in computing p(xlxc, X~_l).
(4.7)
The candidate sample x~ is compared with x and we find x = (v, a), ie v and a are subsets ofx
(4.8a)
x~ = (v, b), ie v and b are subsets of x~ (4.8b) where v:
denotes those features (with feature values) of xc in agreement with x, of dimensionality L - d. (4.9a)
a:
denotes those features (with feature values) of x not in agreement with xc. There are d features in a. (4.9b)
b:
same features as for a but with the feature values for xc.
(4.9c)
Substituting equation (4.8a) into equation (4.7) we obtain: p(xlcc, X . - 1) = P(vl a , x~, X . -
~)p(alx~, X._ ~) (4.10)
where the line through a, i.e. tt, in the first term on the right side of equation 4.10 indicates the a can be
The Outcome Advisor~ eliminated under the assumption that v is statistically independent of a given x~ and Xn-1. At this point we evoke the concept of reproducing probability density functions for learning (reference 1, pp. 85-106), in particular the binomial p.d.f. The first term on the right of equation (4.10) involves updating probability density at v by 1 since xc contributes one sample to v and X,_ l can be ignored.t The second term on the right of equation (4.10) is not updated by xc since x~ does not contribute to a; thus the value of the second term is that contributed a priori by X,_ t learning (see equation 4.11 below). If equation (4.10) truly is reproducing, then as a binomial p.d.f, with probability at x, this probability is updated by a fraction of a sample (let #/~ = p(x[x~, X,_ 1)), i.e. P:~ = (1)*p(alX~-i).
(4.11a)
sample xc is that xc contributes "fully" for those dimensions in agreement and with the best estimated probability at the focus for those features in disagreement. We thus see how the focus can provide considerable correlation information, reducing the problem of calculating a contribution from x~ to x to that for only features in disagreement. 4.4. A priori estimation of subspace probability density functions Still considering category t,.*, we will drop notation T~' for convenience until further noted. It is interesting that a candidate sample x~ contributes relative frequency ,6fc to the focus as follows: (i) Determine those features not having values in agreement between x and x~ and denote this subset a.
Equation (4.1 l a) is rewritten:
p/c = (1)*p(alX._l)d
1433
(4.11b)
to emphasize that a is in a subspace of the feature space of dimension d where d is the number of features of x not having feature values in agreement with the candidate sample x~. If d = 0, then the candidate sample contributes I sample to the relative frequency at x. Contrary to classical statistics, the candidate sample contributes a fraction (of one) to relative frequency at x even when x -<-> xc as long as the a priori probability of a is non zero. When d = L, the candidate sample contributes no additional learning to the probability at x. T h e a priori probability clearly is "against" the model being usedO) to estimate the d dimensional space using training samples X~ (see equations 4.13 later). We have assumed that knowledge is unavailable about statistical dependence of v on a in equation 4.10; and this is yet to be investigated for when such a priori knowledge or learning is available. A key to the practical simplicity of The O A is that learning with parametric models can be accomplished in low dimensional space, preferably 1 dimensional (i.e. the second term on the fight of equation (4.10); thus a reduced number of parameters need be estimated and stored. Alternatively, the first term on the right side of equation (4.10) is not computed until the focus or recognition sample x is received. For this term, the candidate sample xc contributes multidimensional "correlation" information. But equation (4.10) is for one of the n samples selected as the candidate sample; using the sequence of n samples, multiple combinations of "multidimensional correlation" will be considered in computing the probability that x is from the category concerned. With the equivalence model, the best update of a posteriori probability at the focus x using candidate t Given x~, the sample is at v; thus there is a relative frequency of one in the classical sense. Given xc, neither a nor X,_ t contribute to knowledge of v.
(ii) These features a not in agreement, d in number, define a subspace S~(a) at the focus x and it is this subspace probability density we wish to estimate (either a priori, if possible, or after x is known). Define the estimated subspace probability density at the focus x, p~(x, Xc) = p(S~(a)[X.), x (~ R
where d denotes the dimensionality of S, number of features with feature values agreement; n is the number of supervised samples for the category. Special cases of are, for example, where d = 1, 2, 3: d=l:S~(a)=x
/
d = 2 : Sc(a) = ( x / , x k ) d=3:Sc(a)=(xj,xk,xt)
(4.12) i.e. the in distraining interest
xl e a
(4.13a)
(xi,xk) ~ a
(4.13b)
(xj,xk,xl) Ea. (4.13C)
When d = 1, preprocessing is relatively simple requiring computing d = 1 :p~(x, xc) = {{15(xi,, ]X,)}V=l}fi=l V v = {~/,,}°=,~/=,
xj,, ~ x E R
(4.14a)
But since x is not known a priori, we would have to compute L one dimensional histograms. In practice this is not difficult for d = 1, but does become more difficult as d increases. The alternative for d = 1 is to wait the focus x and then compute only those probability densities needed in Equation (4.14a). This suggest the new neural net structure presented later. For d > 1, there are (5) permutations of possible d dimensional histograms. For one permutation indexed by tr, the p.d.f.s are: d = 2 : ~p~(x, xc) = {{p(x/,
X/,Xk E R
x~)F:=~}L,, (4.14b)
1434
EDWARDA. PATRICK
d = 3 : ~p~(x, xc) = {{{.0(x i , xk, xt) }/t.= I }k= 1 }/'- l,
(4.14(;:)
This is a considerable amount of apriori computation and storage even for just this one of the many permutations. Whereas for d = 1, permutations do not enter into the problem. On the other hand, if preprocessing is not done but instead processing awaits the focus x, then S¢(a) is known for each candidate sample as per equations 4.13 and those specific probability densities can be estimated using X,. This suggest the desirability of a neural net structure to make these computations. This structure incorporates the following steps: (1) Focus x received. (2) S~(a) determined for candidate sample x~, d selected. (3) /~,a(x, x~) determined for S~(a) of dimensionality d, and there is not a question as to what permutation to make since there is only one permutation. (4) Equations (4.11) used to update probability density at the focus. (5) Return to (2) for next candidate sample. 4.5. Discussion Uncertainty increases in ,0,a(x, x~) as d increases for a fixed number n of training samples. Estimating at dimensionality d provides more knowledge of dependencies among the features of x~ not in agreement with x at the expense of uncertainty in the estimation. So we wish to keep d "small" if n is not "large". Fortunately another source of dependency information comes from the focus itself, and yet another source of dependency information is the new definition of relative frequency itself, whereby frequent agreement of more candidate sample dimensions for the category at the focus "adds up". A dependency missed with one candidate sample may be seen with another because the latter candidate sample was allowed to come to the focus. 4.6. Estimated probability density at the focus The estimated ccpdf at focus x for the category concerned, given the n supervised training samples, is r dimensionality d of a priori estimate focus =
± .L ~, n
c= 1
F
(0d(x, xc)
(4.15)
I
I
[ [
c[th candidate training sample which contributes relative frequency to the focus
t - - n u m b e r n of supervised training examples for the category.
Several extensions are possible whereby the multiple permutations are utilized when d > 1. Another consideration is to average over the possible permutations: (~) 1 ~] (o,/~(x, xc))
(4.16)
cr~l n c=1
and theh to take a weighted average over d, the dimensionality of the a priori estimate:
aa Y~ n d=l
ar~l
(•'0d(X'X')
(4.17)
=I
where the ad depend on the confidence in estimation at dimensionality d. Further research is needed to investigate performance using such permutations. $. NEURAL NET STRUCTURE FOR THE OUTCOME ADVISOR
5.1. System model for the OA Continuing to consider computing the ccpdf for a specific category, a system model for the O A is shown in Fig. 2. Training examples X, are taken from memory once a focus x is received, and difference vectors {ac}~=l are computed. Then p.d.f.s p,n(ac) are estimated as d increases in sequence from d = 1 to n. For each c, n - 1 training samples x,, s ~-> c, are used for the estimation. The model shown in Fig. 2 is for probability densities of one specific dimensionality d and one specific permutation a. To date, most of our experience (is" 19.20)has been for the case d = 1 for which c~= 1.
5.2. Neural net mode for The OA Again consideration is a single category r* but we drop the ith index for convenience. In this model, parallel processing at multiple nodes is utilized in place of requiring preprocessing. The model involves n 2 nodes, the horizontal direction corresponding to training samples x,, s = 1 . . . . . n and the vertical direction corresponding to candidate samples xc, c = 1, 2 . . . . . n to be brought to the focus x (the findings to be recognized). A typical node is shown in Fig. 3 where x~ is a candidate sample and x, an estimation sample. Because of the nature of histograms the computation in (4.16) is achieved piecemeal using the samples x,, s = 1, 2 . . . . , n which would have been used to construct the histogram. As shown in Fig. 3, first the components of x not in agreement with x~ are determinedt, i.e. ac and this operation is represented by a processing unit M. Then each training example xs, s = 1, 2 , . . . , n is "matched" to a~ using the processing unit M. The histogram effect is obtained by proceeding horizontally in Fig. 3. "t"An index c on a~ indicates that this is the disagreement vector component of the cth candidate sample.
The Outcome Advisore
1435
Focus
Xn :
I
If" X=
s,t
Traininq records
Assemble condidote somptes
t/l
Xc
¢=1
Compute probability density estimotes at "(he focus
/I
O¢ ¢,1
t
I{ ModeL, i.e. equa"(ion 4,15 o f dimens~onoUty d. Most experience to dote is with d : I ,suCh t h c t the model reduces "(o one dimensional histo~'orns.
,
t -~.~ po(o=)
n
1,
I
Fig. 2. Focus
±
° /"
p. (O©) ©.l
p(X:T,,X.)
d
: -n" 2;:a P. (a¢)
• x
Training s a m l ) l ~ used to bring candidate samp~ to the focus X.
Training XI
Oce I
,i
0(;+1
Troining somptes used as condidote samples ~ Operation on x ono x¢ producinq 0¢ M : Operation on o© ond x, d.(x,xJestimoted at value of d= dim (o)
]
R d
n "c~I~ (x,x~) Fig. 3.
1436
EDWARD A. PATRICK Focus x ¢
/lph, t,~)tl~t: O priori computed
X. : Troining
d p. ( x,x~], d:l
X¢
(X.X¢.)
I
"•M
, Operation on x and x¢ producing o~
n
4
M : Ope'otion on o~ ond x
d
l~timoted
ot voLue of d= dirn(o)~ d con vOry for different c
Fig. 4. summing up the matches for candidate sample x~. The operative M compares ac of dimension d with x, and the result is one if all dimensions agree, otherwise it is zero. Respective candidate samples x~, c --- 1, 2 . . . . . n are brought to the focus as we proceed vertically in Fig. 3. The probabilistic relative frequency of the n candidate sample then are brought to the focus. A neural net implementation is shown in Fig. 4 for the case where each feature's histogram is constructed a priori. With reference to Equation (4.14a), V* L probabilities are estimated a priori; i.e.: {{p/,,}ov= l }L=I.
(4.18a)
The M operation in Fig. 4, for candidate sample x~, results in the product: d
H pa~c)
(4.18b)
a,= I
where a(c) = 1, 2 . . . . . d indexes the d dimensions of a~ not in agreement with xc; and P~c) are the corresponding feature value probabilities. The operation M on a,. and x, requires that all components of ac match those of x~ if d is the dimensionality of a~, or that only the individual components
of ~ match corresponding individual components of some xs when d = 1. 5.3. Modification of the basic OA neural net structure The following modifications of the basic structure shown in Fig. 2 are possible: (1) Determine performance versus number of samples n with d as a parameter. (2) Determine performance when using multiple permutations tr when d > 1. (3) In calculating p,a(x, xc) use weights on various dimensions to reflect uncertainty in the dimension(s). These weights can be supplied a priori or learned from the samples as in Patrick/Fischer. tL2) (4) Candidate record xc "far" from the focus can be eliminated from processing by an appropriate heuristic. (5) In effect, multiple layers in a 3D structure could be used (layers of Fig. 3) for different values of d. 5.4. Alternative implementation An alternative method for implementing equation (4.15), the estimate of a posteriori probability for a
The Outcome Advisor°
1437
Focus •
OTrainln
! Error vectors
|
X--Xe
E. s
s "1
used t,o ILrain
X¢
¢11
candidate
samples
I
((oo,:., 3 0 p(xl'r~X.}
Pdla.) .J t
/
~::..dp.(ac)~
Fig. 5.
category, is with multiple O A s as shown in Fig. 5. First, the focus is received and "error vectors" ac computed for candidate samples:
{ac}~=l. Next consider n O A s as shown in Fig. 5, each trained with the same n training samples of error, {x x,},= ~. Now let the inputs or loci for these O A s be the respective n error vectors for candidate samples. In particular, error vector ~ is input as focus to OA¢. The dimensionality of ac is d; and when a~ is input to OA~, (L - d) features are treated as "missing features" and the O A estimates --
n
for the category concerned, in the subspace of dimension d. This precisely is:
p~(x, x~) = p~(*~), as in equation 4.15.
The basic processing unit involves two vector comparisons and accumulate as shown in Fig. 6. With a processor s p e e d o f 6 x 106processings/s, considering L comparisons for an L dimensional vector, and that there are n candidate samples and (n - 1) training samples for each candidate sample, it follows that approximately
L = 200 dimensions n = 1000 training records
For example,
n = 1000 training records will take about s = 6.6 x 10 -3 seconds or alternatively one neural net array processor can handle 1000 classification in 1 second. For example, 10 categories could be processed for each of 10 separate tasks. 6. CONCLUSION
5.5. Neural net processing times
where s is in seconds. For example,
Ln = 3 . 1 0 6 * s .
L = 20 dimensions
p(a~ J{xs}~"-,)
Ln 2 = (3 x 106)s
takes about s = 6.6 seconds for one category. By having one processing board for each category of conflict, this becomes a reasonable computation for "near instantaneous" decisions. When complexity is restricted to d = 1 (i.e. the system in Fig. 4), the neural network configuration of Fig. 2 simplifies with input from x,, s = 1,2 . . . . . n removed and replaced by simple storage of L one dimensional histograms. Equation (5.1) is replaced approximately by:
(5.1)
A new classification system called The Outcome Advisor~ is developed as an outgrowth of the Generalized K-nearest Neighbor Decision Rule by Patrick and Fischer. Supervised training samples for each of M categories are stored awaiting a previously unseen set of findings called the focus, which is to be classified into one of the M categories. Classification is achieved by estimating the probability density for each category at the focus. The estimate of probability density at the focus for a category is achieved with a new definition of relative frequency and a new definition of correlation. A training sample from a category can contribute "relative frequency"
1438
EDWARDA. PATRICK other training sample ~
~s
|mmple i
f
©
R
candidate sample I~i : Operation on x and x, producing a.o Involving L comparisons. M : Operation on so and x., involving L comparisons. C " 1, 2 , . . . , n l- 1,2,...,n Lnt:
Total number of comparisons with n training samples.
(3 x lOe)s : Number of comparisons In s seconds with processor speed of 3 x 10 e prooeulngs per second.
Fig. 6. Basic processing unit for the OA. at the focus even if it is not at the focus. The new relative frequency is an estimate of the probability that a candidate sample could have been at the focus. Other training samples than the candidate sample are used for this estimation. As a classification system, any feature can be used to define a set of categories and as an inference system an almost unlimited number of pairs of conditions and outcomes can be defined. The O A can be implemented using conventional algorithmic programming or as a new neural net structure. The neural net structure involves n 2 hardware processing units where n is the number of training samples for the category. An alternative neural net implementation is presented with n hardware processing units but preprocessing and storage of probability estimates is required. A third implementation is in terms of n processing units which themselves are OAs. SUMMARY
A new classification system called The Outcome Advisor° is developed as an outgrowth of the Generalized K-nearest neighbor decision rule by Patrick and Fischer. Supervised training records for each of M categories are stored awaiting a previously unseen set of findings called the focus. Classification is achieved by estimating the probability density for each category at the focus. The estimate of probability density at the focus for a category is achieved with a new definition of relative frequency and a new
definition of correlation. A training sample from a category can contribute relative frequency at the focus even if it is not at the focus. The new relative frequency is an estimate of the probability that a candidate sample could have been at the focus. Other training samples than the candidate sample are used for this estimation. The O A considers every training example as a candidate to contribute "relative frequency" to the focus for that category, not just the training example "nearest" to the focus in a matched filter sense. Also, in The O A , samples other than a candidate sample are used to estimate the candidate's relative frequency contribution to the focus. By direct estimation of category probability densities, convergence problems with decision boundary estimation are avoided as in conventional neural nets. The O A is implemented using conventional algorithmic programming or as a new neural net structure. The neural net structure involves n 2 hardware processing units where n is the number of training samples for a category. Each specialized processing unit involves computing an error vector using the findings and a candidate training sample; then the error vector is projected onto a different training sample. The accumulation of processing unit outputs for all other training samples constitute the candidate sample's contribution to relative frequency. Examples are stored in the O A whereas parameters are stored in conventional neural nets. On the other hand, The OA can be viewed as defining parameters to estimate once the focus is defined by the findings. An alternative neutral net structure is provided with
The Outcome Advisore n hardware processing units but requires some preprocessing of training samples. A third implementation is presented where there are n processing units which themselves are OAs. As a classification system, any feature can be used as a category feature whereby the values of that feature constitute a set of categories. As an inference system, an almost unlimited n u m b e r of inferences can be made because condition and outcome events can be defined as logical operations on features and feature values. There are two modes of operation: a classical mode where conventional central limit theorem statistics apply and an " O A mode" which utilizes the new definition of relative frequency.
REFERENCES
1. E. A. Patrick, Fundamentals of Pattern Recognition. Prentice Hall, Englewood Cliffs, NJ (1972). 2. E. A. Patrick and F. P. Fischer, Generalized k nearest neighbor decision rule, J. Inform. Control 16, 128-152 (April 1970). 3. E. A. Patrick and J. M. Fattu, Artificial Intelligence with Statistical Pattern Recognition. Prentice Hall, Englewood Cliffs, NJ (1986). 4. J. Pearl, Heuristics. Addison-Wesley, Reading, MA (1985). 5. Rosenblatt, Principles of Neurodynamics. Sparton Books, New York (1959). 6. M. Minsky and S. Papert, Perceptions: An Introduction to Computational Geometry. MIT Press, Cambridge, MA (1969). 7. B. Widrow, Narendra K. Gupta and S. Maitra, Punish/ Reward: learning with a critic in adaptive threshold systems, IEEE Trans. Syst. Man Cybern. SMC-3, (September 1973). 8. S. Grossberg, Some networks that can learn, remember, and reproduce any number of complicated spacetime patterns, I, J. Math. Mech. 19, 53--91 (1969). 9. T. Kohonen, G. Barna and R. Chrisley, Statistical pattern recognition with neural networks: benchmarking studies, IEEE International Conference on
1439
Neural Networks, pp. I61-I68, IEEE Catalog Number 88CH2632-8 (July 1988). 10. E. A. Patrick, Decision Analysis in Medicine. CRC Press, Cleveland, OH (1979). 11. J. J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, Proc. Nat. Acad. Sci., USA 79, 2554--2558 (April 1982). 12. J. J. Hopfield and D. W. Tank, Computing with neural circuits: a model, Science 233, 625--633 (August 1986). 13. R. Hecht-Nielsen, Nearest matched filter classification of spatiotemporal patterns, Appl. Optics 26, (15 May 1987). 14. T. M. Cover and P. E. Hart, Nearest neighbor pattern classification, IEEE Trans. Inform. Theory IT-13, 2127 (January 1967). 15. D. E. Rumelhart and J. L. McCelland, Parallel Distributed Processing: Exploration in the Microstructure of Cognition, Volume I, Foundations. MIT Press, Cambridge MA (1986). 16. G. G. Lorentz, The 13th problem of Hilbert, in F. E. Browder (Ed.), Mathematical Developments Arising from Hilbert Problems. American Mathematical Society, Providence, RI (1976). 17. E. A. Patrick, D. R. Anderson and F. K. Bechtel, Mapping multidimensional space to one dimension for computer output display, IEEE Trans. Comput. C-17, 949-953 (October 1968). 18. E. A. Patrick, M. Moskowitz, E. I. Gruenstein and V. T. Mansukhani, An Outcome Advisor~ expert system network for diagnosis of breast cancer, University of Cincinnati Electrical and Computer Engineering Technical Report, T.D. 110/10/87/ECE (October 1987). 19. J. M. Fattu, R. J. Blomberg and E. A. Patrick, Consult learning system applied to early diagnosis of chest pain, Proc. SMAC, 1987, pp 171-177, IEEE Computer Society Order no, 812, Library of Congress Number 79-641114 (1987). 20. E. A. Patrick, The Outcome Advisor~ selected expert systems from ECE 638, Winter 1987, University of Cincinnati Electrical and Computer Engineering, TR 102/04/87/ECE (April 1987). 21. E. A. Patrick and F. K. Bechtel, A non parametric recognition procedure with storage constraint, Purdue University School of Electrical Engineering, TR-EE 69--24 (August 1969). 22. G. R. Cooper and C. D. McGillem, Probabilistic Methods of Signal and Systems Analysis. Holt, Rinehart and Winston (1971).
About the Author--EDWARD A. PATRICKreceived the B.S. and M.S. degrees from MIT in Electrical
Engineering, the Ph.D. degree in Electrical Engineering from Purdue University, and the M.D. degree from Indiana University School of Medicine. Joining the faculty in E.E. at Purdue University in 1966 as Assistant Professor, he rose to Full Professor of Electrical Engineering at Purdue in 1974 and had a simultaneous appointment of Associate Professor, Indiana University School of Medicine. Subsequently he was co-founder of the Institute of Engineering and Medicine, University of Cincinnati and Research Professor of Electrical & Computer Engineering, University of Cincinnati. He currently is president of Patrick Consult Inc, Cincinnati, Ohio 45204, Director of Informatics, Deaconess Hospital, Cincinnati, Ohio and Co-Director of the Heimlich Institute, Cincinnati, Ohio. Dr Patrick is a Diplomat of the American Board of Emergency Medicine, Fellow of the American College of Emergency Physicians, and past president of the IEEE Systems Man & Cybernetics Society. He is author of Fundamentals of Pattern Recognition, Prentice Hall, Decision Analysis in Medicine, CRC Press, and coauthor (with James Fattu) of Artificial Intelligence with Statistical Pattern Recognition, Prentice Hall. Dr Patrick (with Dr Henry Heimlich) worked the last 14 years to develop the Heimlich maneuver as the treatment for choking replacing back slaps and as the first treatment for drowning.