A characterization of nearest-neighbor rule decision surfaces and a new approach to generate them

A characterization of nearest-neighbor rule decision surfaces and a new approach to generate them

Pattern Reco~lnition Pergamon Press 1978. Vol. 10. pp. 41 46. Printed in Great Britain A C H A R A C T E R I Z A T I O N OF N E A R E S T - N E I G H...

465KB Sizes 0 Downloads 45 Views

Pattern Reco~lnition Pergamon Press 1978. Vol. 10. pp. 41 46. Printed in Great Britain

A C H A R A C T E R I Z A T I O N OF N E A R E S T - N E I G H B O R R U L E D E C I S I O N S U R F A C E S A N D A N E W A P P R O A C H TO GENERATE THEM* B. D A S A R A T H Y

University of South Carolina, Columbia, South Carolina and LEE J. WHITE Ohio State University, Columbus, Ohio (Received 2 December 1976; in revised form 6 July 19771 Abstract--the paper considers generating nearest-neighbor rule decision surfaces as an application of a maxmin problem. The maxmin problem is to locate a point in a given convex polyhedron which maximizes the minimum distance from a given set of points in the polyhedron. A characterization of the decision surfaces in n-dimensions is given, and the difficulty involved in generating the decision surfaces in higher dimensional spaces is brought out through this characterization. However, a novel method is presented to generate the surfaces in three dimensions using the algorithm for the maxmin problem.

Pattern classification

Nearest-neighbor

Discriminant surface

A pattern or a sample is a point in an n-dimensional Euclidean space R", called the feature space. A pattern set is a finite set of samples or patterns. Given two or more disjoint pattern sets At, A 2 , . . . , Ak, let A be the nearest pattern set to x c R", i.e. there exists a 3' ~ A~ such that y I <- !l x -

z ]l

Complexity

Let A = {a,b] and B = {c, dl be two pattern sets in R 2 as shown in Fig. 1. FG, G H and HI constitute the required decision surface. Every point on the decision surface is equidistant from one sample in A and one sample in B. Notice region R1, left of the decision surface, contains samples only from the set A, and region R2, right of the decision surface, contains samples only from the set B.

PROBLEM STATEMENT

Ii x -

Surface generation

for every z e A~ and for every j,

1 <_j<_k. Then the nearest-neighbor rule is to classify x with the pattern set Ai. The nearest-neighbor rule decision swfacesJor two pattern sets is defined in the following manner: given two disjoint pattern sets A and B, find one or more surthces dividing the space into two or more regions such that all the samples in one region are from only one pattern set and every point on the decision surtaces is equidistant from a nearest sample in A and a nearest sample in B. In a similar manner, we define the nearest-neighbor rule decision suIJhcesjbr k pattern sets as follows: given k disjoint pattern sets, A1, A2 . . . . . Ak, find (k - 1) or more surfaces dividing the space into k or more regions such that all the samples in one region are from only one pattern set a n d every point on the decision surfaces is equidistant from two nearest samples, one from say, pattern set Ai a n d the other from A j, 1 < i < j _< k. The following example in two dimensions illustrates the above definitions.

RELATED LITERATURE

There is a good deal of related literature on various pattern separation algorithms. If a pattern separation problem is viewed basically as a problem of obtaining a criterion for distinguishing between the elements of two disjoint sets of patterns, one way to achieve separation is to construct a hyperplane or a nonlinear hypersurface such that one set of patterns lies on one side of the hypersurface, and the other set of patterns lies on the other side. It can be easily shown that the sets can be strictly separated by a hyperplane if and only if the convex hull of the sets do not intersect. Such a hyperplane can be efficiently constructed by the linear p r o g r a m m i n g method suggested by M a n g a sarian. (1) When the convex hulls of the two sets intersect, G r e e n b u r g and K o n h e i m {~) show that a unique ellipsoidal (quadratic) separation is possible by nonlinear convex p r o g r a m m i n g a n d M a n g a s a r i a n (1t suggests a procedure to find a nonlinear separation by linear programming. M a n g a s a r i a n {3) also demonstrates how the two sets can be strictly separated by

* The research reported here was supported in part by the Air Force Office of Scientific Research under Grant No. AFOSR 72-2351. 41

42

B. DASARATHYand L. J. WHITE

~\~/F

I

~I----AO L~I,,, ~ 11

Ri

RZ

bx~-~....... ~c

the decision surfaces and the following maxmin problem. The maxmin problem is to locate a point in a given convex polyhedron S so as to maximize the minimum Euclidean distance from a given set of m points ai, 1 _< i < m, in the polyhedron. This problem is equivalent to finding the radius of the largest hypersphere such that the center of the hypersphere lies in S and no a~ lies strictly interior to the hypersphere. This, in turn, results in the following nonlinear nonconvex programming formulation: max r (r, x ~ S )

r - (ai -

x)' (ai - x ) < 0 , 1

<_ i <_ m.

Fig. 1. Decision surface example. more than one hyperplane. He shows that this problem is essentially a nonconvex programming problem and suggests an iterative algorithm based on linear programming. There are several references which discuss the effectiveness of decision making by the nearest-neighbor rule. In particular, Cover and Hart {4) show that the probability of error of the nearest-neighbor rule is bounded below and above by the Bayes probability of error and twice the Bayes probability of error respectively. Chow ~51 implements a recognition scheme that makes use of the nearest-neighbor rule to recognize characters and he reports that the recognition performance of the nearest-neighbor rule method compares favourably with other recognition schemes. The unique feature of our work, reported here, is the direct generation of the nearest-neighbor rule decision surfaces by a finite procedure as an application of a maxmin problem discussed in detail elsewhere. ~6'7) The motivation behind generating these decision surfaces is the following: the number of "hyperplane segments" contained in the decision surfaces can be smaller than the number of samples in the pattern sets as in Fig. 1 and thus the surface can be used in the decision making instead of the samples. In addition, we are motivated in generating the decision surfaces as a first step in designing a nonlinear surface or other piecewise linear discriminant surfaces with fewer hyperplane segments which approximate the nearest-neighbor rule decision surfaces. We conclude this section by noting that the nearest-neighbor rule decision surfaces separate the pattern sets as do the surfaces generated by any of the above pattern separation algorithms. In addition, the nearest-neighbor rule surfaces must satisfy one other criterion in that each point on the decision surfaces must be equidistant from nearest samples from each of two different pattern sets. MAXMIN PROBLEM AND ITS RELATION TO DECISION SURFACES

The development of an algorithm to generate the decision surfaces is based on a relationship between

The authors of this paper consider this problem (See R. 6) and prove that: 1. Any point x within the convex hull of the points ai equidistant from (n + 1) nearest ai is a local opti-

mum. 2. Any point x on a d-dimensional face of S, 0 _< d < n - 1, equidistant from (d + 1) nearest ai is a local optimum. Moreover, no other point is a local optimum. Based on these results, an algorithm is also suggested in the above reference for the problem in two and three dimensions. This algorithm, in particular, generates all the local optima in an efficient manner making use of two heuristics to speed up the convergence. A brief description of the combinatorial nature of the algorithm follows. To generate the local optima within the convex hull, we consider all possible combinations of (n + 1) points ai, and for each such combination, find a point which is equidistant from these (n + 1) points. A point equidistant from (n + 1) points in n-space is the center of the hypersphere passing through these (n + 1) points. Finding this center involves solving n simultaneous linear equations in n unknowns. The center of the hypersphere is a local optimum only if it lies in the convex hull of the points ai and no other a~ is strictly interior to the hypersphere. To obtain the local optima on a face of dimension d, 0 _< d _< n - 1, we consider all possible combinations of (d + 1) a~ and for each such combination, find the point x on this face which is equidistant from these (d + 1) points. Since a face of dimension d can be expressed by (n - d) linear equations, finding such a point on this face amounts to solving n simultaneous linear equations in n unknowns. The point x is a local optimum, provided no other ai is closer to x than the (d + 1) a~ that made up the combination. Let A = {al, a2 . . . . . ap} and B = {bl, b 2 , . . . , bql be the two disjoint pattern sets in R". By definition, the decision surfaces are equidistant from a nearest a~ and b j, and consist of hyperplanes, half-hyperplanes and hyperplane segments (for the sake of readability, only the term "hyperplane" will be used in subsequent discussions and the context should make it clear

Nearest-neighbor rule decision surfaces which surface is intended). The dimensionality of each of these hyperplanes is ( n - 1). The space at which two hyperplanes intersect is a hyperplane of dimension (n - 2) and is equidistant from one ai, one bj and one additional a~ or bj. Continuing this argument, the point at which n hyperplanes on a decision surface intersect is equidistant from (n + 1) nearest samples. Thus these "intersection points" form a subset of local optima for the maxmin problem, if we utilize the patterns ai and b i for the m given points in that problem and enclose the pattern sets by a large rectangular parallelepiped, a convex polyhedron. A word here is necessary about the parallelepiped being very large. We would like to choose a priori a rectangular parallelepiped large enough to contain all the intersection points. But, in fact, a parallelepiped of infinite size will be chosen in the algorithm to be given in the fifth section, i.e. one of the optimality tests whether a generated point equidistant from (n + 1) samples lies in the convex polyhedron will not be carried out.

CHARACI'ERIZATION

OF

DECISION

SURFACES

Nearest-nei~lhbor rule decision su@wes imply nearestneiqhhor rule As noted above, the decision surfaces consist of only hyperplanes, half-hyperplanes a n d hyperplane segments. O u r immediate concern is to show that in the above definition for two pattern sets, if x c R ° lies in the region containing samples from pattern set A, then x is nearer to A than B a n d x can be classified with A by the nearest-neighbor rule, i.e. in effect we want to show that the nearest neighbor rule decision surfaces imply the nearest-neighbor rule. The proof in the other direction, i.e. the nearest-neighbor rule implies nearest-neighbor rule decision surfaces is a direct consequence of the triangle inequality. The theorem below states a slightly stronger result, which takes into account any n u m b e r of pattern sets. Theor~m 1. Let S = ~,)'1, Yz . . . . . y~, zl, z 2 , . . . , zql be a set of (p + q) points in R". Assume there exists a surface such that every point on the surface is equidistant from the nearest )'~ a n d the nearest za, and each of the two regions R~ a n d R2 created by this surface contains only Y = {y~, Y2. . . . . yp] and Z = ',z~, ze . . . . . zq', respectively. Then, without loss of generality if x c R1. x is closer to Y than Z. The proof of this theorem can be found in the Appendix. Corollary 1. Let R~ be one of the regions formed in the definition for the nearest-neighbor rule decision surfaces for two pattern sets. and let R~ contain samples only from the pattern set A (R2 may contain samples from A). Then if x E RI, x is closer to A than the pattern set B. Proof: F r o m the definition for the nearest-neighbor rule decision surfaces, every surface that surrounds

43

region R~ is equidistant from one sample in A and from one other sample in B. By Theorem 1, if x e R1, x is closer to pattern set A t h a n B, since R~ contains only samples from A, and the proof is complete. I~ It was claimed that Theorem 1 states a stronger result than Corollary 1. Corollary 1 is applicable only for two pattern sets whereas Theorem I can be applied to any arbitrary n u m b e r of pattern sets. 7 o see this, let us say we have k pattern sets A~, A, . . . . . Ak, k > 2. Let Y = A l w A 2 w . . . . voA~, and let Z : : A p + 1 ,J Ap+ 2 ~ . . . ~ A k in Theorem 1. Region R~ contains samples only from the first p pattern sets a n d region Re contains samples only from the remaining (k - p) pattern sets. Moreover, every point in R~ is nearer to pattern sets A~, 4e . . . . . d n than to Ap_ ~, Ap+ ,, etc. We can subsequently divide region R~ into two or more regions to classify, these subsets of Y without taking into account any element in Z. This is possible, since every point in R~ is closer to elements in Ythan Z. Similarly, R2 can also be subdivided into two or more regions to classify pattern s e t s Ap+ 1, Ap+:, etc. without taking into account elements in Y In effect, Theorem 1 suggests a procedure to classify more than two pattern sets by ,just grouping these pattern sets into two arbitrary disjoint groups, and finding the nearest-neighbor rule decision surfaces by considering these two groups as two disjoint pattern sets and then classifying each of these groups separately in the new regions. The algorithm below is given only for two pattern sets, since any n u m b e r of pattern sets can be handled in a similar manner.

Convexity qf decision swJhces We state one more theorem which is very crucial to our development. The purpose of Theorem 2 below is to assert that if we are given n points on an (n - 1 } dimensional hyperplane contained in the decision surface such that each of these n points is equidistant from two nearest samples, one from pattern set A and the other from pattern set B, and the rest of the samples are further away from these n points, any convex c o m b i n a t i o n of these n points will be equidistant from the same two nearest samples and the rest of the samples lie further away.

Them'era 2. Let al, a2 . . . . . a , e R" and let y, : ~ R n such that Ilai - Yll < Ila, - zll, 1 < i < n. n

n

~ 3(ixi: O ~ ~i ~- l, ~' O~i =

If x= i

1

I.Ihen

i=1

I I x - vii -< I I v - zll. The proof of this theorem can also be found in the appendix. DECISION

SURFACES

IN T H R E E

DIMENSIONS

The essence of the algorithm to construct the decision surfaces is to generate the intersection points of

44

B. DASARATHYand L. J. WHITE

the decision surfaces using the maxmin algorithm discussed above. If a hyperplane on the decision surface is bounded in all directions, then according to Theorem 2, the decision surface consists of the convex hull of all the intersection points on this hyperplane. If a hyperplane on a decision surface is unbounded in one or more directions, i.e. some of its intersection points lie at infinity, then the hyperplane is to be kept track of by means of its bounding hyperplane of lesser dimensions. It is the curse of dimensionality involved in generating the convex hull and in keeping track of the unbounded decision surfaces in higher dimensions which prevents us from suggesting an algorithm in any n-dimensional space. However, an outline of an algorithm in three dimensions is presented below to show that these tasks are manageable in smaller dimensional spaces. Let a decision surface be "decided" by the pair of samples (ai, b j), viz. ai and bj are the nearest samples to the decision surface. We maintain the following information corresponding to this surface: pointers to al and b j, the planar equation of the decision surface, the cardinality of the set of extreme points (intersection points), and pointers to extreme points themselves. In case a plane is not bounded in one or more directions, a special mark is made and the equations of the bounding lines (in < form) are also stored. To start with, an intersection point which is equidistant from say a~,, bjl and two other samples from either A or B or both, say ai2 and ai3, is generated using the maxmin algorithm, i.e. a local optimum which is equidistant from a~, al 2, ai 3 and bj, is generated. Corresponding to this intersection point, there are three decision planes decided by (ai,, b j,), (ai2, bi,) and (a~3, bj,). (These planes, i.e. the pointers to the deciding pair of samples, the equation of the plane, and a pointer to this intersection point are all stored for each plane.) All the remaining intersection points on the dicision surface decided by (ai,, bj,) are then produced using the maxmin algorithm by generating all the local optima equidistant from a~,, b j, and two other samples neither of which is a~ or ai~. Each one of these intersection points will in turn give rise to two other hyperplanes on the decision surface. Before we consider any pair of samples to be a possible decision pair, we check whether there exists an entry already corresponding to this pair. We go on to generate the remaining intersection points on the decision surface decided by (a~, b j,), (a~, b~,), etc. If necessary, new planes are created that could decide decision surfaces. This process is repeated until all the intersection points for each possible plane are generated. We next observe that if a plane on the decision surface has three or more intersection points the plane is bounded in all directions. The convex hull of these intersection points can be generated in a very efficient manner using Graham's approach ~s). If the plane has only two intersection points, the plane is bounded by three lines (two of which are parallel) and it is semi-infinite. If the plane has only one inter-

section point, it is again semi-infinite and bounded by two lines. The equations of the bounding lines are easy to obtain. CONCLUSION

An upper bound on the growth of the algorithm is easy to obtain under the assumption that it is proportional to the time involved in generating the intersection points. There are at most pq("+~-2) points, equidistant from one ai, one bj and two ai or b j, which can be candidates for the intersection points. A point which is equidistant from four points is the center of the sphere passing through these four points and can be estimated by an 0(n3) algorithm, where n is the dimensionality of the space. Such a point is an intersection point if and only if the four samples which are equidistant from it are also the nearest. This can be done by an 0(p + q) algorithm. Thus the algorithm in three dimensions is bounded above by 0((p + q)5) in time complexity. However, from our computational experience with the maxmin problem, we conclude that the average growth is 0((p + q)3.s~) when the samples are generated by a uniform distribution. In a manner similar to the three dimensional case, it can be argued that an upper bound on the growth of the algorithm in two dimensions is 0((p + q)4). However, a method of lower complexity, 0((p + q)log(p + q)), can be suggested as an alternative in the two dimensional case (see Shamos~91). Finally, it is observed that the algorithm presented here for two pattern sets can be technically extended to any number of pattern sets in view of the discussion following Theorem 1. SUMMARY

The nearest-neighbor rule decision surface, a well known piecewise linear discriminant surface, is the focus of study in this paper. The paper considers generating these decision surfaces as an application of a maxmin problem. The maxmin problem is to locate a point in a given convex polyhedron which maximizes the minimum Euclidean distance from a given set of points in the polyhedron. The decision surfaces, in effect, partition the Euclidean space into regions such that each region contains samples from only one pattern set, and every point in a region is nearer to the pattern set which is contained in that region. The convexity of the decision surfaces is proved; it is shown that each "hyperplane segment" contained in a decision surface is convex. It is demonstrated that the extreme points of these convex segments form a subset of"local optima" for the maxmin problem under a suitable transformation. Thus an algorithm is suggested to generate the decision surfaces for two pattern sets as an application of the maxmin problem in two and three dimensions. It is the curse of dimensionality involved in generating the convex hull in higher dimensional spaces which pre-

Nearest-neighbor rule decision surfaces

45

v e n t s us f r o m s u g g e s t i n g a n a l g o r i t h m in a n y n - d i m e n s i o n a l space. A c o m p l e x i t y a n a l y s i s of t h e a l g o r i t h m is p r e s e n t e d . It is also s h o w n t h a t t h e proc e d u r e s u g g e s t e d for two p a t t e r n sets c a n be technically e x t e n d e d to a n y n u m b e r of disjoint p a t t e r n sets.

REFERENCES 1. O. U Mangasariam Linear and nonlinear separation of patterns by linear programming. Ops Res. 13, 44ff 452 (1965). 2. H. G. Greenberg and A. G. Konheim, Linear and nonlinear methods in pattern classification, I B M J. Res. De~. 8, 299~-307 (1964). 3. O. L. Mangasarian. Multisurface method of pattern separation, IEEE Trans. b~fi Theory 14, 80l 810 11968). 4. T. M. Cover and P. E. Hart, Nearest-neighbor pattern classification, IEEE Trans. lnfi Theory 13, 21-27 (1967). 5. C. K. Chow, Recognition method using neighbor dependence, IRE Trans. elect. Comput. 112, 683-690 11962). 6. B. Dasarathy, Some maxmin location and pattern separation problems: theory and algorithms, Ph.D. Dissertation. The Ohio State University (1975). 7. B. Dasarathy and L. J. White, A maxmin location problem involving a Euclidean distance: applications and algorithms, submitted for publication. 8. R. L. G r a h a m , An efficient algorithm for determining the convex hull of a finite planar set, lnf Proe. Lett. 1, 132 133 (1972). 9. M. I. Shamos, Geometric complexity, A C M Syrup. on Theory q[ Computing, 224~233 (1975). 10. R. G. Busacker and T. L. Saaty, Finite Graphs and Networks: An Introduction with Applications. McGrawHill, New York (1965).

APPENDIX

Proof oJ Theorem 1 : The proof is by contradiction. Suppose the theorem is false; then there exists a z2 ~ Z which is closer to x, i.e.

~ ! x - z ili -% I i x - z i t i for e v e r y z i e Z , a n d I \" - z i Ii < IIx - y l! for every y e g

First we claim that z i is the nearest of all z~ e Z to Xo, i.e.

lix - zj'] = J!x - xoll + /xo - : i .

[Ix - zill > IIx - zkll.

(3)

This is a contradiction since we assumed z2 is closer to x than all other z i ~ Z , i.e. z i is the closest of all -_,~;Z to Xo.

But there exists a y e Ysuch that IIY - ~x~]i = II x o - - , I , since xo ties on the decision surface H. Therefore x -- -) I! ~

:j x -

= lix-xoll

xo [i +

IIXo - :j I[

+ ]Xo-)'l-

But rf x - Xo ]1 + i~~xo - 3'[I > II x - y li by the triangle inequality implying I I x - z / I > I I x - Yil. Again. this is a contradiction to the assumption that z) is the closest o f all y ~ Y and z i ~ Z, and the theorem is proved. [] Proof of theorem 2: The Euclidean distance [ i u - vii

between any two vectors u and v is given by Ilu-v l= (u-v, u - t , ) 1~2 where (x, 3) is the inner product of vector x with vector y. In the statement of the theorem it is given that ] al -- y Ii <- !l a i -- z ~1, 1 <_ i ~ n, i.e. ( a i -- y , ai - - y )

<~ ( a i -- z, ai - - z ) ,

l < i ~ n.

(1)

- ( z , z ; . 1 < i ~ n.

t5)

Multiplication of (5) by a positive constant :q results in 2¢q(al, y) - 2:q(ai, z) > ~ / y , y ) -

:q(z,,~),

1 _~ i < n.

(6)

To prove I I x - Yll <- I I x - z]l where x = Y~'=lc~ai, Z~i = 1 and ~i >- 0, it needs to be shown that 2(x, y) - 2(x, z) ~ (y, y) - (z, z), i.e. y)

--

2Z~=I o~i(al,

Z)

(7)

The proof is completed by noting that (7) is obtained by s u m m i n g each quantity in (6) over all i. []

the Author--BALAKRISHNAN DASARATHY received his Bachelors Degree (First class) in Electrical Engineering from the University of Madras, India, in 1970 and his Ph.D. in Computer and Information Science from the Ohio State University in 1975. Presently he is an Assistant Professor with the Department of Mathematics and Computer Science at the University of South Carolina. He was with the University of Oregon during the academic year 1975-76 as a Visiting Assistant Professor. His current research interests are analysis of algorithms, p r o g r a m m i n g languages and compilers, and pattern classification. About

(4)

Using the properties of the inner product space, from (4) we obtain

>_
For otherwise assume :k is closer to xo, i.e.

I2>

F r o m (1) and (2), and by use of the triangle inequality. we obtain

2Z7=1 o~i(ai,

i ! : j - x o ] l < i ! z l - x o [ [ for a t l z i ~ Z .

zi!>-Ilxo-zkll.

Since x0 lies on the line segment xzj,

2(ai, y ) - 2(ai, z) >~ ( y , y )

Join x to z j, as shown in Fig. 2. By Jordan's theorem (given in Busacker and Saaty~:°~), the line segment ~ j intersects the surface at least once. Arbitrarily select the first intersection point, and let this be x0 lying on a hyperplane H.

I[xo

gig. 2. Construction for Theorem 1.

46

B. DASARATHYand L. J. WHITE About the Author--LEE J. WHITE received his Ph.D. from the University of Michigan in Electrical Engineering in 1967. He is currently an Associate Professor of Computer and Information Science and of Electrical Engineering at the Ohio State University. He has served as a consultant for the Monsanto Research Laboratory and Rockwell International Corporation, and has had extensive engineering work experience with the Dow Chemical Company, the Battelle Memorial Research Institute and the Lockheed Missile and Space Company. His current research interests are in the areas of pattern classification, analysis of algorithms, combinatorial computing and graph theory.