Pattern Recognition Letters 24 (2003) 2743–2749 www.elsevier.com/locate/patrec
Nonparametric discriminant analysis and nearest neighbor classification M. Bressan *, J. Vitri a Centre de Visi o per Computador (CVC) and Departament d’Inform atica, Universitat Aut onoma de Barcelona, 08193 Bellaterra, Barcelona, Spain Received 18 July 2002; received in revised form 10 February 2003
Abstract Nonparametric discriminant analysis (NDA), opposite to other nonparametric techniques, has received little or no attention within the pattern recognition community. Nearest neighbor classification (NN) instead, has a well established position among other classification techniques due to its practical and theoretical properties. In this paper, we observe that when we seek a linear representation adapted to improve NN performance, what we obtain not surprisingly is quite close to NDA. Since a hierarchy is provided on the extracted features it also serves as a dimensionality reduction technique that preserves NN performance. Experiments evaluate and compare NN classification using our proposed representation against more classical feature extraction techniques. 2003 Elsevier B.V. All rights reserved. Keywords: Nearest neighbors classifier; Nonparametric discriminant analysis; Face recognition
1. Introduction K-nearest neighbors (K-NN) (Fix and Hodges, 1951) has for long been probably the most intuitive classification rule one can think of. When, in 1967, Cover and Hart (1967) showed that this simple technique has asymptotic error rate at most twice the Bayes rate, the K-NN rule received its major boost and its popularity spread. In its simplest version, given a test sample, the K-NN rule
* Corresponding author. Tel.: +34-93-581-3073/3841; fax: +34-93-581-1670. E-mail addresses:
[email protected] (M. Bressan), jordi@ cvc.uab.es (J. Vitri a).
assigns it the most frequent class-label among its nearest neighbors according to the space distance, or the label of its nearest neighbor when K ¼ 1. Besides the theoretical consequences of this rule (Dasarathy, 1990), its main advantage is the fact it can be applied to datasets with small sample size and high dimensionality. These two inconveniences are often impassable obstacles for most pattern classification techniques. Feature extraction and selection is a frequent preprocessing stage in pattern classification. Ignoring class labels, this stage is usually focused on dimensionality reduction preserving representation capability or enhancing statistical properties (principal component analysis, independent component analysis, nonnegative matrix factorization, etc.).
0167-8655/$ - see front matter 2003 Elsevier B.V. All rights reserved. doi:10.1016/S0167-8655(03)00117-X
2744
M. Bressan, J. Vitria / Pattern Recognition Letters 24 (2003) 2743–2749
When labels are considered, the preprocessing is generally focused on obtaining discriminative features (discriminant analysis) as well as a reduction of dimensionality. The application of any of these techniques is usually performed regardless of the classifier, even though many assumptions underly each method and these assumptions frequently affect the classifier (Bressan et al., 2003). In this paper we explore the nexus between nonparametric discriminant analysis (NDA) and the nearest neighbors (NN) classifier, noticing that a slight modification of NDA results in a representation very likely to improve NN performance. In Section 2 we first introduce a general framework for discriminant analysis and, particularly, NDA as presented by Fukunaga and Mantock (1983). Observing NDA from a NN perspective, we introduce a modification of the original algorithm in Section 2.1. In Section 3 some related works are mentioned. Finally, Section 4 contains experiments on artificial data, on a benchmark database and on face recognition and gender recognition illustrating our technique and comparing it to well-known approaches.
2. Nonparametric discriminant analysis From a feature extraction perspective, discriminant analysis is a tool based on a criterion J and two square matrices S E and S I . These matrices generally represent the scatter of sample vectors between different classes for S E , and within a class
(or even class independent scatter information) for S I . Several criteria that convert these matrices into a single statistic for measuring class separability have been proposed (Devijver and Kittler, 1982; Fukunaga, 1990). These measures can be used both for feature selection and extraction. For this last task, the M D linear transform that satisfies, c ¼ arg max traceðW T S E WÞ W W T S I W¼I
ð1Þ
has been shown to extract the linear features that optimize several of such separability measures. This problem has an analytical solution based on the eigenvectors of the scatter matrices. The algorithm presented in Table 1 obtains this solution (Fukunaga, 1990). We can now turn to the definition of the within and between class scatter matrices. The most widely spread approach is the one that makes use of only up to second order statistics of the data. This was done in a classic paper by Fisher (1936) and the technique is referred to as Fisher discriminant analysis (FDA). In FDA the within class scatter matrix is usually computed as a weighted sum of the class-conditional sample covariance matrices. If equiprobable priors are assumed for classes Ck , k ¼ 1; . . . ; K then SI ¼
K 1 X Rk K k¼1
ð2Þ
where Rk is the class-conditional covariance matrix, estimated from the sample set. The between class-scatter matrix is defined as,
Table 1 Given scatter matrices S I and S E , general algorithm for solving the discriminability optimization problem stated in Eq. (1) (1) (2) (3) (4) (5) (6) (7)
Given X the matrix containing data samples placed as N D-dimensional columns, S I the within class scatter matrix, and M maximum dimension of discriminant space. Compute eigenvectors and eigenvalues for S I . Make U the matrix with the eigenvectors placed as columns and K the diagonal matrix with only the nonzero eigenvalues in the diagonal. M I is the number of non-zero eigenvalues. Whiten the data with respect to S I , to obtain M I dimensional whitened data, Z ¼ K1=2 UT X. Compute S E on the whitened data. Compute eigenvectors and eigenvalues for S E and make W the matrix with the eigenvectors placed as columns and sorted by decreasing eigenvalue value. Preserve only the first M E ¼ minfM I ; M; rankðS E Þg columns, WM ¼ fw1 ; . . . ; wM E g (those corresponding to the M E largest eigenvalues). ^ X ¼ WT Z. ^ ¼ WT K1=2 UT and the projected data, Y ¼ W The resulting optimal transformation is W M M
M. Bressan, J. Vitria / Pattern Recognition Letters 24 (2003) 2743–2749 K 1 X T S ¼ ðl l0 Þðlk l0 Þ K k¼1 k E
ð3Þ
where lk is the class-conditional sample mean and l0 is the unconditional (global) sample mean. Notice the rank of S E is K 1, so the number of extracted features is, at most, one less than the number of classes. Also notice the parametric nature of the scatter matrix. The solution provided by FDA is blind beyond second-order statistics. So we cannot expect our method to accurately indicate which features should be extracted to preserve any complex classification structure. In (Fukunaga and Mantock, 1983) present a nonparametric method for discriminant analysis in an attempt to overcome the limitations present in FDA. In nonparametric discriminant analysis the between-class scatter S E is of nonparametric nature. This scatter matrix is generally full rank, thus loosening the bound on extracted feature dimensionality. Also, the nonparametric structure of this matrix inherently leads to extracted features that preserve relevant structures for classification. We briefly expose this technique, extensively detailed in (Fukunaga, 1990). In NDA, the between-class scatter matrix is obtained from vectors locally pointing to another class. This is done as follows. The extra-class nearest neighbor for a sample x 2 Ck is defined as xE ¼ fx0 2 Ck =kx0 xk 6 kz xk; 8z 2 Ck g. In the same fashion we can define the set of intra-class nearest neighbors as xI ¼ fx0 2 Lc =kx0 xk 6 kz xk; 8z 2 Ck g. Both previous definitions are extended to the KNN case by defining xE as the mean of the K nearest extra or intra-class samples. From these neighbors or neighbor averages, the extra-class differences are defined as DE ¼ x xE and the intra-class differences as DI ¼ x xI . Notice that DE points locally to the nearest class (or classes) that does not contain the sample. The nonparametric between-class scatter matrix is defined as, SE ¼
N T 1 X wn DEn DEn N n¼1
ð4Þ
where DEn is the extra-class difference for sample xn , wn a sample weight defined as
n a a o min DEn ; DIn wn ¼ E a I a D þ D n n
2745
ð5Þ
and a is a control parameter between zero and infinity. This sample weight is introduced in order to deemphasize samples away from class boundaries. These samples generally have a larger extra-class difference magnitude exercising an undesirable influence on the scatter matrix: precisely these samples are those that less information carry on nearest class direction. The sample weights in (5) take values close to 0.5 on class boundaries and drop to zero as we move away. The control parameter a adjusts how fast this happens. A parametric form is chosen for the within-class scatter matrix S I , defined as in (2). This choice is heuristically based on the observation that normalization (the first step of the detailed algorithm) should be as global as possible and that it is considered appropriate to apply the Euclidean metric to whitened data, when this metric is used in the NN rule. Fig. 1 illustrates the differences between NDA and FDA in two artificial datasets, one with Gaussian classes where results are similar, and one where FDA assumptions are not met. For the second case, the bimodality of one of the classes displaces the class mean introducing errors in the estimate of the parametric version of S E . The nonparametric version is not affected by this situation. In the computation of both scatter matrices, for theoretical integrity, Fukunaga includes the class prior probabilities P ðCk Þ. In practice, when no such information is available uniform priors are assumed and the resulting formulas are (4) and (2). The average of several nearest neighbors for the intra- and extra-class distances is somewhat artificial. Experiments confirm this: inclusion of more neighbors generally does not affect performance. The main reason for this extension in the original paper on NDA is that when the number of considered neighbors reaches the total number of available class samples the features extracted by NDA are the same as in FDA. So NDA can be considered a nonparametric extension of FDA. As we will see, NDA can be directly justified as a representation that proves adequate for nearest
2746
M. Bressan, J. Vitria / Pattern Recognition Letters 24 (2003) 2743–2749
Fig. 1. First directions of NDA and FDA (dashed) projections, for two artificial datasets. Observe the results in the right-hand figure, where the FDA assumptions are not met.
neighbor classification, so this extension is unnecessary. Along this paper we will restrict ourselves to the 1-NN case. 2.1. NDA from the nearest neighbor rule perspective In this section we make use of the introduced notation to examine the relationship between NN and NDA. This results in a modification of the within class covariance matrix which we also introduce. Given a training sample x, the accuracy of the 1-NN rule can be directly computed by examining the ratio kDE k=kDI k. If this ratio is more than one, x will be correctly classified. Given the M D linear transform W, the projected distances are E;I defined as DE;I W ¼ WD . Notice that this definition does not exactly agree with the extra and intraclass distances in projection space since, except for the orthonormal transformation case, we have no warranty on distance preservation. Equivalence of both definitions is asymptotically true. By the above remarks it is expected, that optimization of the following objective function should improve or, at least not downgrade NN performance, n o b ¼ arg max E DE 2 W ð6Þ W EfkDIW k2 g¼1
This optimization problem can be interpreted as: find the linear transform that maximizes the distance between classes, preserving the expected distance among the members of a single class. Considering that,
n o n o T E kDW k2 ¼ E ðWDÞ ðWDÞ ¼ trace W T DDT W
ð7Þ
where D can be DI or DE . Replacing (7) in (6) we have that this last equation is a particular case of Eq. (1). Additionally, the formulas for the within and between class scatter matrices are directly extracted from this equation. In this case, the between-class scatter matrix agrees with (4), but the within-class scatter matrix is now defined in a nonparametric fashion, N 1 X Sw ¼ DI DI T ð8Þ N n¼1 n n Given that we have an optimization problem of the form given in (1) the algorithm presented in Table 1 can also be applied to the optimization of our proposed objective function (6). Considerations on sample weights and class priors can be plugged in the same fashion as with NDA. Theoretical considerations also hold: it can be seen that, if the mean criterion for the nearest neighbor choice is used with all the class members, (8) turns out to be (2). Fig. 2 illustrates the difference of choosing (2) or (8) through two simple toy datasets representing a single class with a complex distribution. The original data is presented in column (a). Column (b) shows the result of whitening this data with respect to the covariance matrix. This secondorder statistic, measuring the mean distance to the mean, fails to represent classes with complex dis-
M. Bressan, J. Vitria / Pattern Recognition Letters 24 (2003) 2743–2749 Original Data
Covariance Whitening
NonParametric Whitening
Original Data
Covariance Whitening
NonParametric Whitening
(a)
(b)
2747
(c)
Fig. 2. On each row a toy dataset: (a) original data, (b) whitened data using the usual covariance-based within-class scatter matrix, and (c) whitened data using nonparametric within-class covariance matrix.
tributions (in this case multimodal) such as those shown in Fig. 2. Instead, the interpretation of whitening with respect to (8), shown in column (c), is quite straightforward. In the whitened data, the distribution of the intra-class nearest neighbor distances are normalized, with the favourable consequences this has on the NN-rule.
Another close approach which uses local discriminant information is Discriminant Adaptive approach introduced by Hastie and Tibshirani (1996). In this case, the authors focus on an iterative scheme to obtain a local metric modifying the local neighborhoods. 4. Experiments
3. Related works A close approach to ours was introduced in (Moghaddam et al., 2000). In this article, the concept of intra- and extra-class differences is also used, but to obtain a Bayesian measure of face similarity through subspace analysis (PCA) of both difference spaces. The NN rule is then applied to this measure. The main differences with respect to NDA are mainly the parametric nature of the approach (Gaussian or Gaussian mixture assumption on the reduced spaces) and a dual analysis of the data within a Bayesian framework instead of a joint discriminant analysis. The main similarity, besides the use of class differences, lays on the dual eigenspaces, obtained from covariance matrices from which our scatter matrices are a particular case: the case in which only the nearest neighbors are used in the computation.
In all the experiments NDA and modified NDA (NDA2) were learnt using a single nearest neighbor and no sample weights. A proper adjustment of these parameters to each dataset can eventually enhance results. Considering the results are illustrative enough and for the sake of comparison, we choose not to touch these settings. A first experiment is performed on the Letter Image Recognition Data (Blake and Merz, 1998). Each 16-dimensional instance of the 20,000 samples within this database represents a capital typewritten letter in one of twenty fonts. Training is done on the first 16,000 instances and test on the final 4000. Fig. 3 illustrates the results of a 1-NN classification for different representations. Classification was performed for different dimensionalities using the component hierarchy given by the representation. It can be seen that NDA2
2748
M. Bressan, J. Vitria / Pattern Recognition Letters 24 (2003) 2743–2749 Letter Database 1
0.9
Classification Accuracy
0.8 0.7 0.6 0.5 0.4 0.3 NDA2 NDA LDA PCA
0.2 0.1
0
2
4
6
8
10
12
14
16
Subspace dimensionality
Fig. 3. Accuracies on the letter database with different subspace dimensionalities.
Fig. 4. Training (top row) and test images chosen from the AR face database.
outperforms the other representations with an accuracy of 97.1%. This is remarkable since the best result registered over this database is using a modified NN to obtain an accuracy of 95.7%
(Fogarty, 1992). Our representation already improves this accuracy with a subspace dimensionality 12. Two experiments were performed on the AR Face Database (Martınez and Benavente, 1998), one on recognition subject to strong light variations, the second on gender recognition. For the first experiment five training and five test images were chosen for each of the 115 subjects. Test and training were taken over different period of time and, as it can be observed in Fig. 4, the images are subject to strong light changes on both sets. Images were subsampled to 24 · 24 pixels. We expect NDA to learn these light changes in the within-class whitening stage and reflect this normalization in the classification. Results are reflected in Fig. 5(a). Here, NDA2 achieves the best results in practically all dimensions, being 87.7% the highest achieved accuracy. Notice that while all three discriminant analysis procedures clearly outperform PCA, classical NDA and FDA have very similar performance, below our modified NDA. This difference should only be blamed on the calculation of the within-class scatter matrix and the normalization over within-class variations. Actually, NDA and FDA share the same within-class scatter matrix, so it is NDA2Õs nonparametric approach what improves performance. As a reference, NN classification of this problem in 576 dimensional domain space yields a 75.1% accuracy. For the second experiment, on gender recognition, images with strong light variation were dis-
Fig. 5. (a) Recognition accuracy on the AR face database with different subspace dimensionalities and no light normalization. (b) Gender recognition accuracy on the same database and using a ‘‘leave subject samples out’’ procedure.
M. Bressan, J. Vitria / Pattern Recognition Letters 24 (2003) 2743–2749
carded and both training and test merged in a single dataset that results in 256 male samples and 204 female samples taken from 115 subjects. In this particular case, a leave-one-out procedure was employed each time leaving out all the samples for a given subject. Results shown in Fig. 5(b) obey the same procedure employed in the previous experiment, with subspace dimensionality considered from 1 to 100. Images were normalized in variance. Once again, NDA2 outperforms the three other evaluated techniques, being 95.61% the highest achieved accuracy. It is quite surprising that using a single basis on this projection, accuracy already stands above 92%. Once again NDA and FDA perform similarly due to the limitations imposed by a parametric within-class scatter matrix. Learning gender was recently tested in (Moghaddam and Yang, 2002), concluding support vectors were superior to any other traditional classifier in this task. So we applied this technique to our particular problem in order to compare results. After extensive testing, the best results were achieved with a RBF kernel with c ¼ 3: 93.86%. The experiment was also reproduced using the NN classifier on domain space, resulting in an accuracy of 85.1%.
5. Conclusions Searching for a linear feature extraction technique that preserves or enhances nearest neighbor discriminability, results on a slight modification of nonparametric discriminant analysis. This modification affects only the within-class scatter matrix. The resulting technique has several advantages. The fact it works with intra- and extra-class distances allows small class sample sizes (a minimum of two samples per class) and high dimensionality on the sample space. As all linear discriminant analysis techniques it provides a hierarchy on the features and, unlike Fisher discriminant analysis, the number of classes does not affect performance. More generally, its completely nonparametric nature implies no assumption on the class-conditional distributions. We have tested this technique on a benchmark database, obtaining the best results published to
2749
the date. We have also tested and compared the technique on the classical problems of face recognition robust to illumination variation and gender recognition, showing the positive effect our representation has on nearest neighbor classification by comparing it to more classical approaches.
Acknowledgements This work is supported by MCYT grant TIC2000-0399-C02-01, Comissionat per a Universitats i Recerca del Departament de la Presidencia de la Generalitat de Catalunya, and Secretarıa de Estado de Educaci on, Universidades, Investigaci on y Desarollo from the Ministerio de Educaci on y Cultura de Espa~ na. References Blake, C., Merz, C., 1998. UCI repository of machine learning databases, URL http://www.ics.uci.edu/~mlearn/ MLRepository.html. Bressan, M., Guillamet, D., Vitria, J., 2003. Using an ICA representation of local color histograms for object recognition. Pattern Recognition 36 (3), 691–701. Cover, T., Hart, P., 1967. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory IT 13, 21–27. Dasarathy, B., 1990. NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos, CA. Devijver, P., Kittler, J., 1982. Pattern Recognition: A Statistical Approach. Prentice Hall, London, UK. Fisher, R., 1936. The use of multiple measurements in taxonomic problems. Ann. Eugenics 7, 179–188. Fix, E., Hodges, J., 1951. Discriminatory analysis: Nonparametric discrimination: Consistency properties, Tech. Rep. 4, USAF School of Aviation Medicine, February. Fogarty, T., 1992. First nearest neighbor classification on frey and slateÕs letter recognition problem. Mach. Learn. 9, 387–388. Fukunaga, K., 1990. Introduction to Statistical Pattern Recognition, Second ed. Academic Press, Boston, MA. Fukunaga, K., Mantock, J., 1983. Nonparametric discriminant analysis. IEEE Trans. PAMI 5, 671–678. Hastie, T., Tibshirani, R., 1996. Discriminant adaptive nearest neighbor classification. IEEE Trans. PAMI 18, 607–616. Martınez, A., Benavente, R., 1998. The AR face database, Tech. Rep. 24, Computer Vision Center, June. Moghaddam, B., Jebara, T., Pentland, A., 2000. Bayesian face recognition. Pattern Recognition 33 (11), 1771–1782. Moghaddam, B., Yang, M.-H., 2002. Learning gender with support faces. IEEE Trans. PAMI 24 (5), 707–711.