Pattern Recognition Letters 1 (1983) 305-310 North-Holland
July 1983
Decision rules f o r a hierarchical classifier Marek W. KURZYlqSKI Institute of Control and Systems Engineering, Technical University of Wroclaw, Wroclaw, Poland Received 25 January 1983
Abstract: Two hierarchical classifier strategies which minimize the global and local probabilities of misclassification, respectively are presented. The modified version of the k-NN rule for a hierarchical classifier is proposed. Numerical examples are given.
Key words: Hierarchical classifier, probability of misclassification, decision rules, k-nearest neighbor method.
1. Introduction
It has been reported by Swain and H a u s k a (1977), Kulkarni (1978) and Mui and Fu (1980) that multistage recognition systems based on a decision tree scheme have m a n y remarkable advantages in contrast to the conventional single-stage classifier. This is why this attractive and efficient a p p r o a c h has often been employed in m a n y practical problems (e.g. see Bartolucci et al. (1976), Mui and Fu (1980), Sethi and Chatterjee (1977) and Kurzyfiski (1982)). A decision tree consists of a root-node, a number of nonterminal nodes and a number of terminal nodes. Associated with the rootnode is the entire set of classes into which a pattern may be classified. A nonterminal node is an intermediate decision and its immediate descendant nodes represent the outcomes of that decision. A terminal node corresponds to a terminal decision, i.e. the decision making procedure terminates and the unknown pattern being classified is assigned to the class associated with that node. Thus, in a decision tree or hierarchical classifier the unknown pattern to be recognized undergoes a sequence of decision rules on the path f r o m the root-node to a terminal node. This terminal node represents final
classification and its label indicates the class to which the pattern is assigned. The hierarchical classifier synthesis problem, nontrivial even under restrictive assumptions, is decomposed into three following components (Kanal (1977), Mui and Fu (1980)): (1) the choice of a tree skeleton or hierarchical ordering of the pattern classes, (2) the choice of features to be used at each nonterminal node, (3) the decision rules (strategy) for performing the classification at each nonterminal node. In this paper we focus our attention on the third item, assuming that both the tree structure and features used at each nonterminal node are given. In Section 3 we consider the strategy for a hierarchical classifier, optimal with respect to the overall probability of misclassification. This so called globally optimal strategy is compared with the locally optimal one, in which decision rules minimize the local error probabilities associated with the nonterminal nodes of a tree. In Section 4 the concept of strategies with learning for a hierarchical classifier is discussed and application of this idea is illustrated by the modified version of the knearest neighbor rule. This decision rule was tested and trained on artificial data and numerical results are presented.
0167-8655/83/$3.00 © 1983, Elsevier Science Publishers B.V. (North-Holland)
305
Volume 1, Numbers 5,6
PATTERN RECOGNITION LETTERS
1983
July
~(i): fj(O -~ { d(/- ') : dr(i-a) ~ w)i)}
2. Preliminaries and notation Notations used in this paper are: x e Wc_ [Rn - a feature vector o f recognized pattern - observed value o f a continuous r a n d o m variable X. M - the number of pattern classes ( M > 2 ) . ~(i)={d[i) ' d(~).... , "M/J ,4(i)t _ the set of node labels at the i-th decision tree level (the tree levels are numbered f r o m the terminal level to the rootnode), i = 0 , 1 ..... m, M o = M , M m = I , Mi>-Mi+l; ~(0) represents the set of class labels. w)" - the set of nodes belonging to the subtree with dff) as the root-node. d (i) - discrete r a n d o m variable taking values in the set ~(i), i = 0, 1,..., m; d (°) = d) °) iff the pattern given by feature vector x belongs to the class labeled d)°); d(i)=d(~ i) iff d(°)=d) °) and d ) ° ) e w ~ i), i = 1 , 2 . . . . . m. p)O = p(d(O = dj(i)), in particular p)O) denotes the a priori probability of the class dj(°). fff) (x) = f ( x [ # i)) - conditional probability density function of x given that d (i)=d) i), i = 0 , 1 . . . . . re, j = 1 , 2 . . . . . M~. A i - the event that at the i-th level of classification a correct classification is made
(i = 1, 2 ..... r: )
(2)
be a decision rule used at the same node, which maps the observation subspace to the set of immediate descendant nodes of the node dj(i). In the next sections we consider different strategies n of a tree classifier, i.e. the set of classifying rules at the particular nodes, viz. ~ r = { ~ j ( i ) : i = l , 2 .... , m , j = I , 2 , . . . , M / } ,
(3)
under the assumption that both the tree skeleton and the features (1) are specified.
3. Complete probabilistic information Let us first consider the case of complete probabilistic information. This means that a priori probabilities p~0) and conditional density functions f/(°)(x) are given for every class (i = l, 2 . . . . . M ) .
3.1. Globally a n d locally optimal strategies Let us introduce as the total performance measure of hierarchical classifier the overall probability of error: Pe(z0 = 1 - P ~ (Am, A m- 1. . . . . A I )
Bi - the event that at the i-th level of classification an error is made (i = 1, 2 . . . . . m).
= Prr(Bm) + Pn(Am, Bin-1) + " "
Let us consider an M-class recognition problem. Next organize the pattern classes into a tree under the following two constraints on its structure: (A) Each path f r o m the root-node to a terminal node has the same length, equal to m. (B) No pattern class occurs at more than one terminal node. As follows f r o m the previously presented mechanics of a hierarchical classifier, each nonterminal node is associated with the decision rule for performing the classification and the subset of features to be used. Let
P n ( ' ) denotes the probability of respective events under strategy n. In Kurzyfiski (1983) the globally optimal strategy
xj(i) ~ ~ i ) c_ ff~n}", nff) ~ n,
(1)
denote the subset of features selected f r o m the entire vector x used at the node dj(i), and then let 306
+ Pn (Am, Am - 1. . . . . A 2, B 1 )"
(4)
rt* = { ~j*(i): i = 1, 2 . . . . . m, j = l , 2 . . . . . Mi} has been derived, which minimizes Pc(re). Its decision rules are following:
~j*(i) (x) i)) ~" d(i-1)~k
if
p(i-l)f(ki-l)(xff)) p ,(Ai_ 1. . . . . ALIA/,
=
max
t:a?-%w)~,
d(ki-
1))
pti-l)fti-l)(xff ))
x P *(Ai_ 1. . . . . AIIAi, d~ i- 1)), i = 1,2,..., m, P*(.)=I
j = l , 2 . . . . . M~,
fori=l,
P*(.)=P~,(.),
(5)
Volume 1, Numbers 5,6
PATTERN RECOGNITION LETTERS
where d} i) as argument of P denotes the event d (i) = d) i) . It should be noted here that the classification rules of the globally best strategy emphasize the decision which leads to a greater joint probability of correct classification at the next levels. It results from the fact that the current decision is not independent but it forms a step in a multistage decision process. Another optimal hierarchical classifier strategy
~= {q~)i):i= l , 2 , . . . , m ,
j = l , 2 ..... Mi}
can be considered which does not take into regard the fact that decisions form a multistage procedure. Decision rules of ~ are mutually independent and individually optimal with respect to the local critera Qff)= P~ (BilAi+ I, d)i)), i = 1 , 2 ..... m, j = 1 , 2 ..... M/,
which denote the probabilities of misclassification at particular nodes of a tree. It can readily be shown that decision rules ~1~)i) reduce to the wellknown maximum a posteriori probability rules, namely
~J)i)(x(i))=d(i-1)
Pe(/~) - pc(It *) < pn(B2)[maxP~(B1 [A2, d} 1)) -
min p~ (aliA2, d~l))]. i
To simplify notation let P ( . ) = P~(.) and P *(- ) = P~.(. ). First notice that
Proof.
P(B1 [A2, d/0)) = P *(B1 [A2, d}])),
(8)
P(B2)--- P *(B2).
(9)
Now we have: Pe - P* = P(B2) + P(B1, A2) - P *(B1,
A2)
= ~ [ P(B21d/O)) p} ') + P(B11A2, d} 1)) P (A21d}1)) p}t)[ - ~ [P*(B2Id~'))P}1) + P *(B~]A2, d} 1)) P *(A21d}~))p[l)1 = ~ IP(B1[A2, d/(l')P} ',
if
p~i- l)f~i-1) (#i)) =
holds:
- P *(B2)
(6)
July 1983
X [P(A21d/(I)) +
max p}i Oft(i-1)(x(i)).
(7)
~I~ , ~w)
Comparing (5) and (7) it is worth noting that using the locally optimal strategy ~ we reduce the computational complexity (computer time used for the classification) at the sacrifice of the classification accuracy, because minimization of the error probability at each individual node of a decision tree does not necessarily lead to the overall (globally) optimal hierarchical recognition system.
P(B21d}I))]
+ p/(l) P(B21d}l))[1 - P(B 1[A2, d}l))] 1 - ~. IP}')P*(B,IA2, d} 1)) x [P *(A2[di(1)) + P *(B21d[1))] + p/O)p ,(B2Id}l)) [1 _ p ,(B11A2 ' d/(1))]1 (8__)~ Ip[1)p(B2[d/O))[ 1 _p(BllA2 ' di(l)) ] _ p}l)p ,(B2]d}1))[1 _ P(BIIA2, d}l))] 1 _ P(B2)max[1 - P(BllA2, d}l))]
3.2. Comparison o f error probabilities The following lemma gives the upper bound of difference between the probabilities of misclassification for the locally optimal strategy and the globally optimal one.
- P *(BE)m/in[1 - P ( B I I A 2 ,
d]~))]
(9-<)P (B2)Imax[1 - P(B~[A2,d~))] - m in[l - 15(BIIA2, d/(1))]1
Lemma 1. For m = 2
the following
inequality 307
Volume 1, Numbers 5,6
PATTERN RECOGNITION LETTERS
where N} i-l) is the sample size from group of classes d} i- 1)
= P(B2) [ maxP(BiIA2, d/(0) --rain P(BIIA 2, di(l))].
[]
N~j/)=
4. Hierarchical classifier strategies with learning In the real world there is often a lack of exact knowledge of a priori probabilities p~0) and conditional density functions f/(°)(x). For instance, there are situations in which only a learning sequence, that is a set of correctly classified samples, is known. In this case one obvious and conceptually simple method is to estimate appropriate probabilities and conditional densities from the training set and then to use these estimators to calculate discriminant functions (5) as though they were correct. As an example of such an idea let us consider the k-nearest neighbor (k-NN) decision rule.
4.1. k-Nearest neighbor strategy for hierarchical classifier The k-NN classifying algorithm can be viewed as a Bayesian decision rule with the conditional density functions estimated by (Fukunaga (1972))
f(xli)-
July 1983
ki
N~ V '
(10)
where N~ is the sample size and ki is the number of neighbors from the i-th class contained in a minimum volume V containing k neighbors. Using this estimation method and moreover replacing the true probabilities in discriminant functions (5) by their approximate values, we get the following decision rules for hierarchical classifier:
~ N~/i-1) , t:dj,-,)~~0
k)i-l) is the number of neighbors from group of classes d~i- l) contained in a minimum volume V)i) containing kff ) neighbors to x) i). P(Ai_1 ..... ALIA/, d~i-l)) denotes the empirical joint probability of correct classification at the next stages for the subtree generated by the node dli- 1) Dropping the terms independent of class labels we get the final form of the modified version of the k-nearest neighbor (k-NNM) rule for the hierarchical classifier:
~Jj(i)(x)i)) = dl(i-1)
if
kti- 1)P(Ai_ 1. . . . . AI IAi, dti- 1)) = max ks(i- 1)p(&_ 1..... A~ JA;, dJi- o). (12) Decision rules (12) and empirical probabilities f~(. ) can be calculated alternately beginning from the terminal level of a tree, viz. {~0)} _+ {P(AIIA2, d/(2))} --' { ~/(2)} __, ... _~ {~(Am_ 1..... AllAm, d/(m- 1))} ~ ~t~rn).
4.2. Experimental results In order to study the performance of k-NNM strategy some computer simulations were made for a four-class problem with equal a priori probabilities. The pattern classes were organized into a binary decision tree illustrated in Figure 1. The
~j(i) (Xj(.i))= d~i- 1) if
N~ti- 1) k}i- 1) 1 ~)i) N~/-1) v~i) P(Ai-1 .... , All&, d~i- 1))
N (i-l) k (i-l) =
max
Nj(i )
1
4 '-') vJ'>
)
x P(Ai_ 1..... AI [Ai, ds~i- 1)), i = 1 , 2 ..... m, j = 1, 2,...,M~,
308
{11)
Fig. 1. Decision tree of example.
Volume 1, Numbers 5,6
PATTERN RECOGNITION LETTERS
data for the experiments were computer generated 3-dimensional r a n d o m variables x = (Xl, x2, x3). For performing the classification at the root-node dt 2~ the first coordinate was used; components x2 and x 3 were used at the nodes dt 1~ and d~1), respectively. Two strategies with learning for a tree classifier were tested and compared with respect to classification accuracy: - classical k-NN rules at each nonterminal node, - strategy with k-NN M decision rules (practically, in the considered example the modified version of k-NN rule only at the root-node was used). In order to estimate P ( . ) in (12) f r o m the training set, the so-called 'leave-one-out' method (Toussaint (1974)) was used. Experiment 1. The data for the first experiment were Gaussian r a n d o m variables with covariance matrices equal for every class: ~ =2-I,
i = 1,2, 3,4,
i
and with the following expected values: ]21 -- (0, O, 0),
,U2 = (1, O. 1, 0),
P3 = (2, 0, 0),
P4 = (3, 0, 5).
July 1983
the original f o r m of the k-NN rule. This effect clearly appears in the first experiment. For the data in the second experiment the upper bound in Lemm a 1 is equal to zero and hence the difference between the classification accuracy of both strategies under question is insignificant here.
5. Conclusion In the available literature dealing with hierarchical recognition systems, several systematic and ad-hoc procedures have been proposed for decision tree design and features selection (e.g. Kanal (1977), Mui and Fu (1980)). In the present paper we focus our attention on the decision rules for performing the classification at interior nodes of a tree, assuming that both the tree structure and the feature subsets are given. For the case of complete probabilistic information this paper is a sequel to the article by Kurzyfiski (1983) and it yields a comparative study of the globally and locally best strategies. Furthermore a k-nearest neighbor strategy for the hierarchical classifier is derived and tested on artifical data. This strategy has already been practically employed for the classification of surgical
The sizes of the training sets:
NTR----100, 150,
200, 250, 300, 350, 400, 450, 500,
Se A-
k-['!N
0.4
0 - k-f'~Nfl
U.3
'Zxperiment I.
the size of the testing set: NTS = 500.
Experiment 2. The computer generated data were as in Experiment 1, with the following expected values: /l 1= (0, 0, 0),
/.t2 = (0.1, 4, 4),
P3 = (4, 4, 4),
P4 = (4.1, 8, 8).
The sizes of training and testing sets are as before. Figure 2 summarizes the results of both experiments. It can be noted here that the use of kN N M rules in the hierarchical classifier leads to higher classification accuracy in comparison with
0.2
Zxpe riment 2.
0.1
,
10o
, I 0
20
,
,
,
,
,
,
250
300
350
z*o,o
&5o
500
,
qTR
Fig. 2. Comparison of the empirical probability of misclassification (Pe) versus the training set size (NTR) for kNN and k-NNM decision rules for hierarchical classifier (k = entier NI/~RTR). 309
Volume 1, Numbers 5,6
PATTERN RECOGNITION LETTERS
abdomen states (Kurzyfiski and Wilimowski (1982)) and has proved useful in computer-aided diagnosis making. References Bartolucci, L. and P.H. Swain (1976). Selective radiant temperature mapping using a layered classifier. 1EEE Trans. Geosci. Electron. 14, 101-112. Fukunaga, K. (1972). Introduction to Statistical Pattern Recognition. Academic Press, New York and London. Kanal, L.N. (1977). On hierarchical classifier theory and interactive design. In: P.R. Krishnaiah, ed., Applications o f Statistics. North-Holland, Amsterdam, pp. 301-322. Kulkarni, A.V. (1978). On the mean accuracy of hierarchical classifier. IEEE Trans. Comput. 27, 771-776.
310
July 1983
Kurzyfiski, M.W. and M. Wilimowski (1982). A single-stage versus multistage classification of surgical abdomen diseases. Proc. o f 6-th Int. Conf. on Pattern Recognition, Munich, p. 1220. Kurzyfiski, M.W. (1983). The optimal Strategy of a tree classifier. Pattern Recognition 16 (to appear). Mui, J.K. and K.S. Fu (1980). Automated classification of nucleated blood cells using a binary tree classifier. IEEE Trans. Pattern Anal. and Mach. Intell. 2, 429-443. Sethi, I.K. and B. Chatterjee (1977). Efficient decision tree design for discrete variable pattern recognition problems. Pattern Recognition 10, 197-206. Swain, P.H. and H. Hauska (1977). The decision tree classifier: design and potential. IEEE Trans. Geosci. Electron. 15, 142-147. Toussaint, G.T. (1974). Bibiography on estimation of misclassification. IEEE Trans. Inform. Theory 20, 472-479.