Pattern Recognition Letters 1 (1983) 277-285 North-Holland
July 1983
Approximative fast nearest-neighbour recognition L. M I C L E T
and M. DABOUZ
l~cole Nationale Sup~rieure des T~lecomrnunications, 46 Rue Barrault, 75634 Paris 13, France
Received 31 January 1983 Revised 6 April 1983 Abstract: We analyze (on pseudo-randomly generated data) the errors committed with a log(N) algorithm, attempting to
recognize the nearest neighbour among N vectors. We propose improvements in O(log(N)), and present an example of application on speech data. Key words: Nearest-neighbour search, top-down hierarchical clustering.
1. Introduction The 'nearest-neighbour' classification method is a useful technique in Pattern Recognition. It is unfortunately very time-consuming in most cases, despite several algorithms designed to avoid an exhaustive search. This paper asks the following questions, and answers very partially for pseudor a n d o m data: What happens if one tries to systematically find in an O(log N ) time the 'nearest' neighbour of a test vector in a learning set of N vectors of any dimension? Is there a good probability of getting the actual nearest neighbour? In the case of errors, can we estimate their importance? The results presented seem to lead to the conclusion that, in some pattern recognition problems, it can be a good solution to increase the size of the learning sample, and use such a method to find quickly a good ' a p p r o x i m a t i o n ' of the nearest neighbour. We give an example in speech transmission with a 'classification vocoder'.
1. Fast nearest-neighbour recognition The scheme of the 'nearest-neighbour' classification rule is straightforward: given a learning set
of vectors, partitioned in recognition classes, one chooses as the class of an incoming 'test' vector the class of its nearest neighbour a m o n g the learning set. The asymptotic statistical efficiency of this method is quite interesting, its corresponding algorithm very simple, and its extention (to the 'knearest-neighbours') has good properties. For an extensive bibliography on this method, see for example Devijver and Kittler (1982). Its main drawback, of course, is that it can be very time consuming, since it is supposed to require the computation of N distances, where N stands for the size of the learning sample. In a space of high dimensionality, with a consequently large learning sample, the brute-force algorithm becomes irrealistic. Hence, several authors have suggested faster schemes to find the nearest neighbour, based on rather different ideas. Let us mention briefly: (1) the 'condensation' methods, where one defines a subset of learning points which has the same classification properties as the whole set: for example H a r t (1968); (2) the 'tessellation' methods, where the space is divided into cells; the nearest neighbour has to be searched only in a few of them: for example Delannoy (1980); (3) the 'sorting' methods: the learning vectors
0167-8655/83/$3.00 © 1983, Elsevier Science Publishers B.V. (North-Holland)
277
Volume 1, N u m b e r s 5,6
P A T T E R N R E C O G N I T I O N LETTERS
are sorted along one coordinate; a property of the metrices allows to limit the search: for example Friedman et al. (1975); (4) the 'hierarchical' methods; we base this paper on the Fukunaga and Narendra (1975) algorithm, that we shall see in more detail in the next section. A good review of these fast methods, and of related problems is Lehert (1982). All these methods use first a 'preprocessing' step, which requires some additive computing time, but has to be done only once, whatever be the number of test vectors to be classified. The information extracted by this preprocessing is stored, and is used to limit the number of vectors a m o n g which the nearest neighbour of a test vector has to be looked for. As a general remark, one must notice that the efficiency of this fast search decreases when the dimensionality DIM of the space increases. This is explicitly estimated in Sethy (1980), and Friedman et al. (1975), and can be deduced f r o m the experimental figures of Fukunaga and Narendra (1975). Some methods in O(log N ) expected time are known, but only in dimension DIM= 1 (dichotomic search on sorted data), and DIM = 2 (Voronoi tessellation, as used by Shamos and H o e y (1975), with a long preprocessing). As an example, since the Friedman et al. (1975) algorithm finds the nearest neighbour in an expected time o f O(OIM-N(power 1--1/DIM)), it quickly loses its efficiency when DIM is increasing.
2. Hierarchical classification of the learning set
The preprocessing scheme defined by Fukunaga and Narendra (1975) is to structure the learning set as a tree. The leaves o f the tree are the vectors o f this set. Each node o f the tree contains information to summarize the properties of its corresponding subset of leaves. This tree is obtained with a hierarchical classification method. The authors use a top-down scheme, with a fixed number o f classes at each step, obtained by the ' k - m e a n s ' algorithm. Each node contains the following information: (1) Sp, the set of the vectors (leaves) under this 278
July 1983
node, (2) Alp, a fictive vector, gravity center of Sp, (3) rp, the radius of Sp (i.e. the value of the maximal distance between Mp and a vector of Sp). The nearest neighbour search algorithm is now the following: At each step one is 'exploring' a node, i.e. examinating its successors in the tree. Each of these successors can be found only in one of three situations: (1) It has already been explored; in that case, the current value B of the distance to the nearest neighbour takes all the leaves o f this node into account. (2) Its parameters are such that
B + rp < d(x, Mp) where x is the test point and d the metric of the space. In this case, it is clear that none of the successors of this node can decrease the value of B. It has not to be explored. (3) It has to be explored, which will be the next step. This 'branch and b o u n d ' method avoids the computation of some distances, in a proportion depending on the data, and on the tree-construction algorithm. No a-priori information is available to choose the latter. The authors give some numerical results: with a learning set of 1000 vectors in R2, uniformly distributed, the average number of distances to be computed is 46; with 3000 vectors in R8, it increases to 451. The most optimistic expected time in a fast method is O(log N); it could be obtained with such a hierarchical method, provided that no backtrack is necessary. In other terms, if a safe decision could be done at each node, one could find the nearest neighbour in one run from the top of the tree to the searched leave. In the next section, we shall experimentally examine the errors committed when forcing the latter algorithm not to backtrack, for a hierarchical structure chosen as a robust one. We shall describe a simple way to increase the quality of the results, while maintaining a logarithmic search time.
Volume 1, Numbers 5,6
PATTERN RECOGNITION LETTERS
3. Hierarchical research with no backtrack
July 1983
each step the decision is to compare the distance between the test point and the two gravity centers at the following nodes, and to continue this policy with the closest node. This corresponds to the hypothesis that the set of vectors has been separated in two classes in such a good way that every
3.1. Experimental protocol We have chosen to create a binary hierarchy on the vectors. Each node is in fact a fictive point: the gravity center of all the vectors under this node. At
I I
Q
•
*
/
0
O
/ ' /
I/ I i
GRAVITY CENTER
J
",
/
/ BISECTOR HYPERPLANE
ERROR ZONE
Fig. 1. Error z o n e in t w o d i m e n s i o n s . 279
Volume 1, Numbers 5,6
vector which is on one side of the perpendicular bisector hyperplane between the gravity centers o f the classes has its nearest neighbour on the same side. This can be true in most of cases, but certainly not always. Figure I shows in a 2-dimensional space an illustration of this assumption. The vectors inside the hachured zone, and only these, will lead to a mistake in a no-backtracking decision process. These zones are geometrically defined (in a Euclidean space) by polyhedra built on perpendicular bisector hyperplanes of segments having one extremity in each class. They are naturally located ' a r o u n d ' the decision hyperplane. The first parameter o f the experimental protocol is to choose the binary classification algorithm. It requires at least two qualities: (1) the trees must be as balanced as possible, in order to optimize the speed of the method; (2) the learning vectors, used as test points, must be their own 'nearest' neighbour. These two criteria, according to some experiments on speech data (see last paragraph), lead us to use at each node of the tree the Dynamic Cluster algorithm (Diday and Simon (1976)), in its simplest version: the center of a cluster is its gravity center, and the optimization criterion is to minimize the sum o f the variance in the two clusters. It must be noticed that this method can converge to different local optima, according to the initial choice o f the clusters; this will be a useful property in the following. Other parameters to determine are: the dimensionality of the space, the metrics, the density o f the learning set and its statistical distribution, and the number of test vectors. Moreover, each experiment must be done several times in an independent manner, in order to estimate the variance of the results. We chose the experimental protocol as follows: N pseudo-random vectors of dimension DIM are drawn according to a uniform distribution, in a hypercube of size 1, or according to a DIM-variate normal distribution with a diagonal covariance matrix. Then, this learning set is hierarchised with the method described above. M vectors are drawn according to the same pseudo-random law; for
280
July 1983
PATTERN RECOGNITION LETTERS
Table 1 Value ofpl (%) (Uniform distribution, Euclidian distance, N=M=1000) T
DIM= 1
DIM= 2
DIM 3
1 2 3 4 5 6 7 8 9 10
87.2 85.8 83.5 83.2 85.7 86.1 86.2 84.5 85.6 87.7
68.7 70.4 72.3 68.8 73.6 67.8 70.2 68.2 69.0 69.7
40.6 39.6 41.6 38.6 39.5 40.3 37.0 39.7 39.6 39.9
Average Std. Dev.
85.55 1.38
69.90 1.74
39.64 1.16
=
each of them one searches the three nearest neighbours in the learning set (denoted ppol at distance d, ppo2, ppo3), with the trivial method; then the 'nearest' neighbour is searched for in the tree, with no backtracking (denoted fppo, at a distance df), For such an experiment, we compute: p l = (number of times that ppol =fppo)/M, p 2 = (number o f times that ppo2 =fppo)/M, p3 = (number o f times that ppo3 =fppo)/M, RAP = (sum o f the d/df)/M, PROP = (sum of the d ) / ( s u m of the df). A ' g o o d ' result is, of course, to get p l , p2, p3 as close to 1 as possible and RAP and PROP as low as possible. For a choice of DIM, N, M and o f the random law, the experiment is repeated T times on independent pseudo-random data.
3.2. Results 3.2.1. Variance of the results. The first conclusion we can draw is that the variance of the results is quite low, for a given experiment; in other terms, the parameter T is o f little influence; we have fixed it in most of the following at the value T = 10; the results given are the average on these T similar experiments, in each case. Table 1 gives an example of this property.
Volume 1, Numbers 5,6
PATTERN RECOGNITION LETTERS
3.2.4. Influence of N. Surprisingly, the density in the learning space does not seem to be a critical parameter. The results are generally better with N lower, but these conclusions are only drawn from experiments in low dimensional spaces, DIM = 1, 2 , or 3. It is of course impossible to realise experiments with DIM = 16, and a density comparable to that of 1000, 100, or even 10 vectors in dimension 1! All the results given in the following must be seen in this perspective.
Table 2 Influence of various parameters on p l DIM
Metrics
1 1
d1 dl d' d d2 d2 dl d2 d' d2 d2
1
1 1 1 3 3 3 3 3
N and M
Random law
pl
10 10 100
Gaussian
100
Uniform Gaussian Uniform Uniform Gaussian Uniform Uniform Uniform
0.88 0.90 0.85 0.86 0.84 0.84 0.60 0.60 0.58 0.62 0.58
Uniform Gaussian
1000 1000 100 100 100 100 1000
July 1983
3.2.5. Influence of the metrics. We have compared three metrics: d l (Minkowski), d2 (Euclidian) and d'" (Chebychev). The clustering algorithm loses its properties when used with d'~ Nevertheless, the results are once again rather independent of this distance parameter. They are always better with d2, and worse with d'~ but with a rather small degradation in the latter case.
3.2.2. Influence of the random law. A systematical observation is that the Gaussian distribution gives results slightly worse than the uniform distribution. This difference is statistically valid, but weak.
3.2.6. Influence of the dimensionality. The dimension DIM of the space is definitely the main factor in the value o f p l , p2 and p3. The variations of the other parameters induce very small changes on these results, compared to those due to DIM. Table 2 and Figure 2 show characteristic examples of this observation.
3.2.3. Influence of M. The number of test vectors is not an important factor in the results, provided it is large enough: the variance of the results, when the other parameters are fixed, is quickly decreasing as M grows. A good value is M = N , or more when N is low. %
Uniform Distribution Euclidian Distance
ioo .
~
N = M = iOOO
9o 8o
I
7o 60
_...
5o 40
1.20 ~
~
3o 2o io
i .20
l
2
3
5
~
~'2
I~
DIM
Fig. 2. pl, p2, p3 and RAP.
281
Volume 1, Numbers 5,6
PATTERN RECOGNITION LETTERS
What is more surprising, and could be an interesting property is that RAP and PROP are not much influenced by DIM! This can be interpreted in the way that, as DIM increases, more mistakes are commited, but these mistakes are less important: the real nearest neighbour is not found, but the vector chosen by the direct algorithm is 'close' to it. The value of RAP and PROP is varying around 1.2.
July 1983
done on data labelled with a class number, to simulate a real pattern-recognition problem. The classification errors could then be compared to those made with a real nearest-neighbour decision. Another conclusion is that the number of distance computations for each test point is very close on the average to log N. O f course, the preprocessing time is comparatively important, but in no experiment it was greater than the time required by the exhaustive search o f the three nearest neighbours. The extra space required by the data created in the preprocessing is about (DIM + 1) • N words, which is to be compared with the DIM * N words size of the learning set. Obviously, improvements could come f r o m keeping more information at each node, so that the decision made would be safer. Some ideas from the condensation methods could perhaps be useful for that purpose. But one has to consider that only a fixed (and small) number of tests must be made at each node, if one wants to keep the logarithmic behaviour of the method. Here again, a trade-off would have to be found.
3.3. Conclusions and further experiments In Table 2 and Figure 2 we display a few figures to illustrate the influence of all these parameters. They are drawn f r o m a large set o f experiments, and chosen as representative as possible. What seems to appear as a conclusion is that the 'nearest' neighbour found by the direct hierarchical method is in some way a good approximation of the actual one. It can be used in classification problems, as we did in speech data (cf. last section). Nevertheless, further experiments should be
Uniform Distribution Euclidian Distance N = M = i000
% I00
90 80 1
70
.
0
8
~
60 50 40 30 1.12
20 I0 I
i
i
I
I
I
I
1
2
3
5
8
12
16
Fig. 3. Two independent hierarchisations pl, p3 and RAP. 282
DIM
Volume 1, Numbers 5,6
PATTERN RECOGNITION LETTERS
4. Improvement with multiple independent hierarehisations
July 1983
Hence, the experiment is to p e r f o r m two independent preprocessing on the same data, to search through each of the trees the 'nearest' neighbour, and to choose among two the one whose distance to the test vector is the shorter. Empirically, it is hopeful that the values p l , p2 and p3 will raise, since the error zones can only have a smaller volume. Figure 3 gives, for
The errors made in the direct hierarchical nearest-neighbour search exclusively come f r o m some error zones, as shown in Figure 1. At each level of the binary tree, these zones are located ' a r o u n d ' the perpendicular bisector hyperplane. This induces at least two possibilities for correcting some mistakes. One is to use a clustering method with overlapping clusters. This would lead to finding the same vector at different leaves of the tree, and consequently, a decision made at one node could be in some sense not definitive. We are currently working on such algorithms on randomly drawn data. An example on speech data is given later. Another is to use the remark made in Section 3.1, namely that the dynamic cluster algorithm can converge to different sub-optimal partitions, depending on the initialization. This means that on the same data, two hierarchizations made with different initializations will lead to different trees, and consequently to different errors zones, with likely a small intersection.
1000,
M=N=
T = 10,
Uniform distribution, Euclidian metric, the values of p3, for one and two independent hierarchisations. The improvement is quite significant. There is no reason to limit the number of independent hierarchisations at two, except for what concerns the preprocessing space and time. Figure 4 gives, in the same experimental conditions as Figure 3 the results with one and three different trees. Errors made in dimension DIM=8 are around 10O7o in the latter case, instead of around 40°70 in the firs case, with a number of distance computations three times higher, but still in O(log N ) time. Uniform
Distribution
Euclidian
%
N
=
M
=
Distance [O00
I00 90 80 70 60 50 40
1.08 BO 20 I0 |
I
I
,I
I
I
I
1
2
3
5
8
12
16
DIM
Fig. 4. Three independent hierarchisations pl, p3 and RAP. 283
Volume i, Numbers 5,6
PATTERN RECOGNITION LETTERS
5. An example of application: Speech transmission with a 'clustering vocoder' The principle of speech transmission with a 'clustering vocoder' (Viswanatan et al. (1981)) is first to extract from speech data a 'codebook' o f vectors (e.g. o f linear prediction coefficients), designed to be a good 'summary' of all such possible speech representations. Both the emitter and the receiver have the codebook. When speech is to be transmitted, each input temporal frame is converted into a vector of the same space as the codebook vectors. The nearest neighbour is selected among them, and its number transmitted, along with additive information. The receiver takes the selected vector in its own codebook and synthetizes some speech from it and the extra information. Two problems arise in designing a clustering vocoder: the computing of the codebook, and a fast nearest-neighbour recognition inside it. This is a good example for an application o f approximative algorithms, since the size of the codebook can be entirely controlled. In a very large codebook, it can be enough to select a vector 'not far' from the input vector, and not necessarily the closest. A recent example of designing such a vocoder, and a discussion of these problems can be found in the paper by Wong et al. (1982). In the speech transmission experiment we describe here, as an example among many alternative designs, we extracted a codebook for only one speaker. The algorithm is a straightforward threshold on-line method, which insures (unless there is an important chain effect, which was not the case) that all the codebook vectors are at least at a given distance. The representation space, in this example, is simply the 15 correlation coefficients of the input signal, on a temporal frame of 25.6 ms. For a given threshold, with the Euclidian distance, we extracted a codebook of 400 vectors from 2000 input frames. We then use the hierarchisation method explained in Section 3. In the following, the test vectors are those from which codebook has been extracted, but very similar results have been obtained with other data. We measure the value of p l , and a distortion measure DIS which is very similar to RAP or PROP. 284
July 1983
The figures we measured in this experiment were: average number of distance computations = 9.1, p l ---0.76,
DIS = 1.20.
The very high value of p l , compared to random distribution in dimension 15, can be interpreted in a way that the intrinsic dimensionality of the speech correlation space is much lower (which a principal-component analysis helps to understand: e.g. De La O (1981)). To improve these results, we designed a simple algorithm for overlapping binary clustering. At each step in the tree, a partition in two classes is first made in the latter way. Then we assign to a class all the vectors whose distance ratio to the gravity centers is not less than a given threshold. This leads simply to overlapping classes. For a threshold of 1.20, we built a tree of aveage depth 12.4 (there is an important overlapping). The results are in this new hierarchisation" p l =0.84,
DIS= 1.15.
A good improvement is therefore made, for a slight increase of the required recognition time.
References Delannoy, C. (1980). Un algorithme rapide de recherche de plus proches voisins. R A I R O lnformatique 14(3), 275-286. Devijver, P. and J. Kittler (1982). Pattern Recognition: A Statistical Approach. Prentice-Hall, New York. De La O, A. (1982). Etudes pour un vocodeur a classification. These de docteur-ingenieur, ENST. Doc. ENST-E-82001. Diday, E. and J.C. Simon (1976). Clustering analysis. In" K.S. Fu, Ed., Digital Pattern Recognition. Springer, Berlin-New York. Friedman, J.H., F. Baskett and L.J. Shustek (1975). An algorithm for finding nearest neighbors. IEEE Trans. Computers (October). Friedman, J.H., J.L. Bentley and R.A. Finkel (1977). An algorithm for finding best matches in logarithmic expected time. A C M Trans. Software 3(3). Fukunaga, K. and P.M. Narendra (1975). A branch and bound algorithm for computing k-nearest neighbours. 1EEE Trans. Computers (July). Hart, P.E. (1968). The condensed nearest neighbor rule. IEEE Trans. Inform. Theory 14 (May). Lehert, Ph. (1982). Complexit6 des algorithmes en recherche
Volume 1, Numbers 5,6
PATTERN RECOGNITION LETTERS
associative: une synth~se. Colloque "'Mdthodes Math~matiques en G~ographie", Besangon. Sethi, I.K. (1981). A fast algorithm for recognizing nearest neighbors. IEEE Trans. Systems Man Cybernet. 11 (3). Shamos, M.I. and D. Hoey (1975). Closest point problems. Proc. 6th IEEE Symposium on Foundations o f Computer Science. Viswanatan, V.R., J. Makhoul and R. Schwartz (1982).
July 1983
Medium and low bit rate speech transmission. In: J.P. Haton, Ed., Automatic Speech Analysis and Recognition. Reidel, Dordrecht. Wong, D.Y., B.H. Juang and A.H. Gray (1982). An 800 bits/s vector quantization LPC vocoder. IEEE Trans. Acoust. Speech Sign. Process. 30(5). Yunk, T.P. (1976). A technique to identify nearest neighbors. IEEE Trans. Systems Man Cybernet. 6(10) (October).
285