Pattern Recognition Letters 32 (2011) 816–823
Contents lists available at ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
An empirical evaluation on dimensionality reduction schemes for dissimilarity-based classifications q Sang-Woon Kim ⇑ Department of Computer Science and Engineering, Myongji University, Yongin 449-728, South Korea
a r t i c l e
i n f o
Article history: Received 12 August 2009 Available online 18 January 2011 Communicated by F. Roli Keywords: Dissimilarity-based classifications Dimensionality reduction schemes Prototype selection methods Linear discriminant analysis
a b s t r a c t This paper presents an empirical evaluation on the methods of reducing the dimensionality of dissimilarity spaces for optimizing dissimilarity-based classifications (DBCs). One problem of DBCs is the high dimensionality of the dissimilarity spaces. To address this problem, two kinds of solutions have been proposed in the literature: prototype selection (PS) based methods and dimension reduction (DR) based methods. Although PS-based and DR-based methods have been explored separately by many researchers, not much analysis has been done on the study of comparing the two. Therefore, this paper aims to find a suitable method for optimizing DBCs by a comparative study. Our empirical evaluation, obtained with the two approaches for an artificial and three real-life benchmark databases, demonstrates that DR-based methods, such as principal component analysis (PCA) and linear discriminant analysis (LDA) based methods, generally improve the classification accuracies more than PS-based methods. Especially, the experimental results demonstrate that PCA is more useful for the well-represented data sets, while LDA is more helpful for the small sample size problems. Ó 2011 Elsevier B.V. All rights reserved.
1. Introduction One of the most recent and novel developments in the field of statistical pattern recognition (PR) (Jain et al., 2000) is the concept of dissimilarity-based classifications (DBCs) proposed by Duin and his co-authors (Pekalska and Duin, 2005). DBCs are a way of defining classifiers among the classes; and the process is not based on the feature measurements of individual object samples, but rather on a suitable dissimilarity measure among the individual samples. The three major questions we encountered when designing DBCs are summarized as follows: (1) How can prototype subsets be selected (or created) from the training samples? (2) How can the dissimilarities between object samples be measured? (3) How can a classifier in the dissimilarity space be designed? Numerous strategies have been used to explore these questions (Kim and Oommen, 2007; Pekalska and Duin, 2005; Pekalska et al., 2006). First of all, the prototype subsets can be selected in a simple way by using all of the input vectors as prototypes. In most cases, however, this strategy imposes a computational burden on the classifier. To select the prototype subset that is compact and capaq This work was supported by the National Research Foundation of Korea funded by the Korean Government (NRF-2010-0015829). Preliminary versions of this paper were partially presented at the CIARP2008, the 13th Iberoamerican Congress on Pattern Recognition, Havana, Cuba, in September 9–12, 2008, and the ICAART2010, the International Conference on Agents and Artificial Intelligence, Valencia, Spain, in January 22–24, 2010. ⇑ Tel.: +82 31 330 6437. E-mail address:
[email protected]
0167-8655/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2011.01.009
ble of simultaneously representing the entire data set, Duin and his colleagues (Pekalska and Duin, 2005; Pekalska et al., 2006) discussed a number of methods, where a training set is pruned to yield a set of representation prototypes. On the other hand, by invoking a prototype reduction scheme (PRS), Kim and Oommen (2007) also obtained a representative subset, which is utilized by the DBC. Aside from using PRSs, the authors simultaneously proposed the use of the Mahalanobis distance as the dissimilaritymeasurement criterion. With regard to the second question, investigations have focused on measuring the appropriate dissimilarity by using various lp norms, the modified Hausdorff norms, and traditional PR-based measures, such as those used in template matching and correlation-based analysis (Pekalska and Duin, 2005; Pekalska et al., 2006). Finally, the third question refers to the learning paradigms, especially those which deal with either parametric or nonparametric classifiers. On this subject, the use of many traditional decision classifiers including the k-NN rule and the linear/quadratic normal-density-based classifiers has been reported (Pekalska and Duin, 2005). In the interest of brevity, the details of the second and the third questions are omitted here, but can be found in the concerning literature. In DBCs, a good selection of prototypes seems to be crucial to succeed with the classification algorithm in the dissimilarity space. The prototypes should avoid redundancies in terms of selection of similar samples, and include as much information as possible. However, it is difficult for us to find the optimal number of prototypes. Furthermore, there is a possibility of losing some useful information for discrimination when selecting prototypes (Bicego
817
S.-W. Kim / Pattern Recognition Letters 32 (2011) 816–823
et al., 2004; Bunke and Riesen, 2007; Kim and Gao, 2008; Riesen et al., 2007). To avoid these problems, in Bicego et al. (2004), Bunke and Riesen (2007), Kim and Gao (2008), Riesen et al. (2007), the authors separately proposed an alternative approach where all of the available samples were selected as prototypes, and, subsequently, a dimension reduction scheme, such as principal component analysis (PCA) or linear discriminant analysis (LDA), was applied to the reduction of dimensionality. That is, they preferred not to directly select the representation prototypes from the training samples; rather, they proposed a way of using the dimension reduction schemes after computing the dissimilarity matrix with the entire set of training samples. This approach seems to be more principled and allows us to completely avoid the problem of finding the optimal number of prototypes. In this paper, we perform an empirical evaluation on the two approaches of reducing the dimensionality of dissimilarity spaces for optimizing DBCs: prototype selection (PS) based methods and dimension reduction (DR) based methods. In PS-based methods, we first select the representation prototypes from the training data set by resorting to one of the prototype selection methods as described in Kim and Oommen (2007), Pekalska and Duin (2005), Pekalska et al. (2006). Then, we compute the dissimilarity matrix in which each individual dissimilarity is computed on the basis of the measures described in Pekalska et al. (2006). Finally, we perform the classification by invoking a classifier built in the dissimilarity space. On the other hand, in DR-based methods, we prefer not to directly select the representation prototypes from the training samples; rather, we employ a way of using a DR, after computing the dissimilarity matrix with the entire set of training samples. Here, the point to be mentioned is how to choose the optimal number of prototypes and the subspace dimensions to be reduced. In PS-based methods, we heuristically select the same number of (or twice as many) prototypes as the number of classes. In DRbased methods, on the other hand, we can use a cumulative proportion technique (Laaksonen and Oja, 1996; Oja, 1983) to choose the subspace dimensions. The main contribution of this paper is to present an empirical evaluation on the two approaches of reducing the dimensionality of dissimilarity spaces for optimizing DBCs. This evaluation shows that DBCs can be optimized by employing a dimension reduction scheme as well as a prototype selection method. Here, the aim of using the dimension reduction scheme instead of selecting the prototypes is to accommodate some useful information for discrimination and to avoid the problem of finding the optimal number of prototypes. Our experimental results, obtained with the two approaches for an artificial and three real-life benchmark databases, demonstrate that the DR-based methods, such as PCA and LDAbased methods, can generally improve the classification accuracy of DBCs more than the PS-based methods. The remainder of the paper is organized as follows: In Section 2, after presenting a brief overview of dissimilarity representation and prototype selection methods, we present a scheme for the solution of using the dimension reduction schemes. Following this, in Section 3, we present an experimental set-up for the prototype selection based methods and the dimension reduction based methods. In Section 4, we present the experimental results of an artificial data set and three real-life image databases. Finally, in Section 5, we present our concluding remarks.
ple, as an n m dissimilarity matrix, DT,Y[, ], where Y ¼ fyj gm j¼1 2 Rd , a prototype set, is extracted from T, and the subscripts of D represent the set of elements on which the dissimilarities are evaluated. Thus, each entry, DT,Y[i, j], corresponds to the dissimilarity between the pairs of objects, xi and yj, where xi 2 T and yj 2 Y. Consequently, an object, xi, is represented as a column (or a row) vector, d (xi, Y), as follows: T
dðxi ; YÞ ¼ ½dðxi ; y1 Þ; dðxi ; y2 Þ; . . . ; dðxi ; ym Þ ;
1 6 i 6 n:
ð1Þ
Here, the dissimilarity matrix, DT,Y[, ], defines vectors in a dissimilarity space on which the d-dimensional object, x, given in the feature space, is represented as an m-dimensional vector, d(x, Y). In this paper, the vector, d(x, Y), is simply denoted by dY(x). 2.2. Prototype selection (PS) methods The intention of selecting prototypes is to guarantee a good tradeoff between the recognition accuracy and the computational complexity when DBC is built on DT,Y[, ] rather than DT,T[, ]. Various PS methods, such as Random, RandomC, KCentres, ModeSeek, LinProg, FeatSel, KCentres-LP, and EdiCon, have been proposed in the literature (Pekalska and Duin, 2005; Pekalska et al., 2006). KCentres consists of a procedure that is applied to each class separately. For each class i, the algorithm is invoked so as to choose mi center objects, fyj gj¼1 , which are evenly distributed with respect to the dissimilarity matrix DT i ;T i ½; . Here, the center objects are chosen from all objects such that the maximum of the distances over all objects to the nearest center is minimized. FeatSel makes use of supervised information in the dissimilarity space, while KCentres is a k-means clustering (unsupervised) algorithm. Also, FeatSel, like a traditional feature selection method (i.e., forward/backward or plus-L-takeaway-R feature selection method Jain et al., 2000), is capable of selecting an optimal set of prototypes from the original dissimilarity representation according to some separation measure. Therefore, using this PS method, we can reduce the distance matrix DT,T[, ] to DT,Y[, ] by selecting an optimal set of Y prototypes according to the leave-one-out 1-NN error, which can be computed on the given dissimilarity space DT,T directly. In the interest of compactness, the details of the other methods are omitted here, but can be found in the related literature. DBCs summarized in Section 2.1, in which the representation prototypes are selected with a PS, are referred to as PS-based DBCs or simply as PS-based methods. An algorithm for PS-based DBCs is summarized in the following: 1. Select the representation subset, Y, from the training set, T, by resorting to one of the prototype selection methods. 2. Using Eq. (1), compute the dissimilarity matrix, DT,Y[, ], in which each individual dissimilarity is computed on the basis of the measures described in the literature. 3. For a testing sample, z, compute a dissimilarity column vector, dY(z), by using the same measure used in Step 2. 4. Achieve the classification by invoking a classifier built in the dissimilarity space and by operating the classifier on the vector, dY(z).
2. Dissimilarity-based classification (DBC)
From these four steps, we can see that the performance of DBCs relies heavily on how well the dissimilarity space, which is determined by the dissimilarity matrix, DT,Y[, ], is constructed. To improve the performance, we need to ensure that the dissimilarity matrix is well designed.
2.1. Foundations of DBCs
2.3. Dimension reduction (DR) schemes
A dissimilarity representation of a set of samples, T ¼ fxi gni¼1 2 R , is based on pairwise comparisons and is expressed, for exam-
With regard to reducing the dimensionality of the dissimilarity spaces, we can use a strategy of employing DRs after computing
d
818
S.-W. Kim / Pattern Recognition Letters 32 (2011) 816–823
the dissimilarity matrix with the entire set of training samples. Numerous DRs have been proposed in the literature (some of them are Belhumeour et al., 1997; Howland et al., 2006; Loog and Duin, 2004; Martinez and Kak, 2001; Yang et al., 2005; Yu and Yang, 2001). The most well known of these are the classes of linear discriminant analysis (LDA) and principal component analysis (PCA) (Martinez and Kak, 2001). The details of PCA and LDA are omitted here, but we briefly explain below the Chernoff distance-based LDA (CLDA) (Loog and Duin, 2004), which is pertinent to our present study. It is well-known that LDA is incapable of dealing with the heteroscedastic data in a proper way. To overcome this limitation, the Fisher criterion can be extended in the transformed space based on the Chernoff measure (Loog and Duin, 2004). In CLDA, the square of Euclidean distance, SE ¼ p 1p SB , where SB and pi are the between1 2
class scatter matrix and a priori probability of class i, respectively, 1 is replaced with the Chernoff distance defined as: SC ¼ S2 ðm1 T 12 1 m2 Þðm1 m2 Þ S þ p p ðlog S p1 log S1 p2 log S2 Þ, where Si and 1 2
mi are the scatter matrix and the mean vector, respectively; S = p1S1 + p2S2. Using SC, the Fisher separation criterion can be 1 defined as: J H ¼ trfðASW AT Þ1 ðp1 p2 Aðm1 m2 Þ ðm1 m2 ÞT AT AS2W 1 1 1 1 1 p1 log SW2 S1 SW2 þ p2 log SW2 S2 SW2 S2W AT Þg (see Eq. (5) of Loog and Duin, 2004). CLDA finds a mapping of the labeled training data onto a q-dimensional linear subspace (q < c) such that it maximizes the heteroscedastic Chernoff criterion, JH. Here, to simplify the computation, the within-class scatter matrix, SW, can be set to the identity matrix. Therefore, the labeled data set can first be centered and sphered by Karhunen–Loéve mapping, followed by scaling. Then, a solution to the Chernoff criterion can be determined. DBCs, in which the dimensionality of dissimilarity spaces is reduced with a DR, are referred to as DR-based DBCs or DR-based methods. An algorithm for DR-based DBCs is summarized in the following: 1. Select the entire set of training samples, T, as the representation subset, Y. 2. Using Eq. (1), compute the dissimilarity matrix, DT,T[, ], in which each individual dissimilarity is computed on the basis of the measures described in the literature. After computing the DT,T[, ], reduce its dimensionality by invoking a DR. 3. This step is the same as Step 3 in DBC (see the previous section). 4. This step is the same as Step 4 in DBC (see the previous section). The rationale of this strategy is presented in a later section together with the experimental results. 3. Experimental set-up 3.1. Experimental data PS-based and DR-based methods were tested and compared with each other by conducting experiments for an artificial data set (which is referred to as XOR4: 4-dimensional Exclusive OR), a digit image data set, and two well-known face databases, namely, Nist38 (Wilson and Garris, 1992), AT&T (Samaria et al., 1994), and Yale (Georghiades et al., 2001). The data set named XOR4, which has been included in the experiment as a baseline data set, was generated from a mixture of four 4 dimensional Gaussian distributions as follows: p1 ðxÞ ¼ 12 N l11 ; I4 þ
l12 ; I4 Þ;p2 ðxÞ ¼ 12 Nðl21 ; I4 Þ þ 12 Nðl22 ;I4 Þ, where l11 = [2, 2, 0, 0], l12 = [2, 2, 0, 0], l21 = [2, 2, 0, 0], and l22 = [2, 2, 0, 0]. Also, I4 is 1 Nð 2
the 4-dimensional Identity matrix. XOR4 consists of 400 data points of two classes represented within a square; 200 points of the first
and the third corners of the square are in class 1, while 200 points of the opposite corners of the square are in class 2. From this, it is clear that each class contains two clusters. The data set captioned Nist38 chosen from the NIST database (Wilson and Garris, 1992) consists of two kinds of digits, 3 and 8, for a total of 1000 binary images. The size of each image is 32 32 pixels, for a total dimensionality of 1024 pixels. The face database of AT&T, formerly known as the ORL database of faces, consists of ten different images of 40 distinct objects, for a total of 400 images. The size of each image is 112 92 pixels, for a total dimensionality of 10304 pixels. The face database of Yale contains 165 gray scale images of 15 individuals. The size of each image is 243 320 pixels, for a total dimensionality of 77760 pixels. To reduce the computational complexity of this experiment, facial images of Yale were down-sampled into 178 236 pixels and then represented by a centered vector of normalized intensity values. On the other hand, when designing a biometric classification system, we sometimes suffer from difficulties of collecting sufficient data for each objects. For example, in face recognition, since the images are very high-dimensional, the dimension easily exceeds the number of images in the database. In this case, where the number of data samples is smaller than the dimension of data space, scatter matrices become singular and their inverses are not defined. Therefore, the Fisher criterion cannot be applied for reducing the dimensionality, resulting in what is referred to as a small sample size (SSS) problem (Howland et al., 2006). From this consideration, XOR4 and Nist38 were employed as a benchmark for well-represented data sets, while AT&T and Yale were utilized as an example of the SSS problems. 3.2. Experimental method In this experiment, first, data sets are randomly split into training sets and test sets in the ratio of 75:25. Then, the training and testing procedures are repeated 30 times and the results obtained are averaged. To measure the dissimilarity between two object samples, we used conventional measuring systems, such as Euclidean distance (ED) and the regional distance (RD) (Adini et al., 1997). In RD, the dissimilarity is defined as the average of the minimum difference between the gray value of a pixel and the gray value of each pixel in a 5 5 neighborhood of the corresponding pixel. In this case, therefore, the regional distance compensates for a displacement of up to three pixels of the images. To construct the dissimilarity matrices, in PS-based methods, we employed RandomC, KCentres, and FeatSel,1 which were, respectively, selected as a random-based method, a k-means clustering (unsupervised) algorithm, and a supervised algorithm of focusing the leave-one-out 1-NN error in the dissimilarity space. In DR-based methods, on the other hand, to keep a balance in the competition between the two approaches, we employed only three DR-based methods, such as PCA (Belhumeour et al., 1997), direct LDA (LDA) (Yu and Yang, 2001), and Chernoff distance-based LDA (CLDA) (Loog and Duin, 2004). Here, PCA and LDA were selected as the counterparts of KCentres and FeatSel, respectively. Then, CLDA was considered as an advanced one in reducing the dimensionality. For the sake of researchers who would like to duplicate the results, in this experiment, CLDA was implemented with a function of PRTools (Duin et al., 2004), i.e., chernoffm, where the regularization variable was determined as 0.1.
1 Some of the DR methods (e.g., LDA and CLDA) are supervised in the sense that they optimize the between-class scatter vs. within-class scatter information and by this partially explore the gap between the classes. To be fair, FeatSel (i.e., plus-Ltakeaway-R feature selection method) has been selected as a PS method of supervised algorithms. The authors are thankful to the referee who brought this observation to our attention.
819
S.-W. Kim / Pattern Recognition Letters 32 (2011) 816–823
j¼1
, kj
d X
kj :
ð2Þ
j¼1
Here, the subspace dimension, q, (where d and kj are the dimensionality and the eigenvalue) of the data sets is determined by considering the cumulative proportion a(q). The eigenvectors and eigenvalues are computed, and the cumulative sum of the eigenvalues is compared to a fixed number. In other words, the subspace dimensions are selected by considering the following relation:
aðqÞ 6 k 6 aðq þ 1Þ:
ð3Þ
The algorithm employed can be formalized as follows: 1. Compute the covariance matrix, G, with the training data set, T. From G, compute the eigenvalues, kj, j = 1, . . . , d, and the cumulative proportion, a(q), as Eq. (2). 2. Select a q as the final best subspace dimension by considering Eq. (3).
Estimated error rates
aðqÞ ¼
q X
knnc 1 Euclidean distance (ED) regional distance (RD)
0.8 0.6 0.4 0.2 0 1
2
4
8
16
32
64
128
Dimensions ldc 1
Estimated error rates
In addition, to select the dimensions for the systems, we used the cumulative proportion, a, which is defined as follows (Laaksonen and Oja, 1996; Oja, 1983):
Euclidean distance (ED) regional distance (RD)
0.8 0.6 0.4 0.2 0
1
2
4
8
16
32
64
128
Dimensions qdc Estimated error rates
1 Euclidean distance (ED) regional distance (RD)
0.8 0.6 0.4 0.2 0
1
2
4
8
16
32
64
128
Dimensions libsvm Estimated error rates
In this experiment, the PS-based approaches run in dissimilarity spaces of either c or 2c dimension (where c is the number of classes). On the other hand, the DR-based approaches work in qdimensional spaces, where q is chosen such that 99% of variance is preserved. More specifically, the dimensionality of dissimilarity spaces for PCA based method is q, while that of LDA/CLDA based methods is min(q, c1). Finally, to maintain the diversity between the dissimilaritybased classifications, we designed different classifiers, such as k-nearest neighbor classifiers (k = 1), regularized normal densitybased linear/quadratic classifiers, and support vector machines (Fan et al., 1889). All of the classifiers mentioned above were also implemented with PRTools (Duin et al., 2004) and denoted in the next section as knnc, ldc, qdc, and libsvm, respectively. Here, ldc and qdc were regularized with the parameters of (r, s) = (0.01, 0.01) when computing the covariance matrix G by: G = (1 rs)⁄G + r⁄diag(diag(G)) + s⁄mean(diag(G))⁄eye(size(G,1)), where the functions of diag, mean, eye, and size are, respectively, for diagonals of a matrix, mean values of a vector, an identity matrix, and a size of an array. Also, libsvm2 was implemented using the software provided in Fan et al. (1889) (i.e., svm-train and svm-predict) and a polynomial kernel function of degree 1.
1 Euclidean distance (ED) regional distance (RD)
0.8 0.6 0.4 0.2 0
1
2
4
8
16
32
64
128
Dimensions Fig. 1. Plots of the estimated error rates of PS-based DBCs for Yale: (a) top, (b) the second from the top, (c) the third, and (d) bottom; (a)–(d) are obtained in knnc, ldc, qdc, and libsvm, respectively. Here, the prototype subsets are selected with KCentres.
4. Experimental results The run-time characteristics of the empirical evaluation on the four data sets are reported below and shown in figures and tables. We first investigate the rationality of employing a PS or a DR method in reducing the dimensionality. Then, we present classification accuracies of the PS and DR-based methods for an artificial data set and three real-life databases. Consequently, based on the classification results, we grade a ranking of the methods. Finally, we introduce a comparison of the processing CPU-times. Firstly, the experimental results of PS and DR-based methods explained in Sections 2.2 and 2.3 were probed with figures. Figs. 1 and 2 show plots of the estimated error rates obtained with knnc, ldc, qdc, and libsvm for Yale. In Fig. 1, the dissimilarity matrix, in which the four classifiers were evaluated, was generated with the prototype subset selected with a PS (i.e., KCentres). In Fig. 2, on the other hand, after generating the dissimilarity matrix with the 2
This software is referred to as libsvc in PRTools (Duin et al., 2004).
entire data set, the dimensionality of the matrix was reduced by invoking a DR algorithm (i.e., PCA). In the figures, it is interesting to note that the PS and DR-based DBCs can be optimized by means of choosing the number of prototypes or reducing the dimensionality of the dissimilarity space. Consider first the results of RD for knnc (marked with w) in Fig. 1(a) and Fig. 2(a). In this case, both classification accuracies of the DBCs are saturated when the number of prototypes and the subspace dimensions are 8 and 16, respectively. Similarly, in the cases of ldc, qdc, and libsvm, the accuracies are also saturated when the number of prototypes and the subspace dimensions are 16 or 32. The same characteristics could also be observed in the AT&T and Nist38 databases. The results of these databases are omitted here again to avoid repetition. Here, the problem to be addressed is how to choose the optimal number of prototypes and the dimension to be reduced. In PSbased methods, an experimental parameter, such as c or 2c, can
820
S.-W. Kim / Pattern Recognition Letters 32 (2011) 816–823
ED 1 Euclidean distance (ED) regional distance (RD)
0.8 0.6 0.4 0.2 0
1
2
4
8
16
32
64
Criterion values
Estimated error rates
knnc 1
0.8 0.6
Nist38 (2c=4) AT&T (2c=80) Yale (2c=30)
0.4 0.2 0
128
1
2
4
Dimensions
8
0.8 0.6 0.4 0.2 1
2
4
8
32
64
80
RD Euclidean distance (ED) regional distance (RD)
16
32
64
1
Criterion values
Estimated error rates
ldc 1
0
16
Number of features (selected)
0.8 0.6
Nist38 (2c=4) AT&T (2c=80) Yale (2c=30)
0.4 0.2
128
0
Dimensions
1
2
4
8
16
32
64
80
Number of features (selected)
Estimated error rates
qdc 1 Euclidean distance (ED) regional distance (RD)
0.8 0.6 0.4
4.1. Experimental results: artificial data sets
0.2 0
1
2
4
8
16
32
64
128
Dimensions libsvm Estimated error rates
Fig. 3. Plots of the characteristics between the criterion values and the number of features selected with FeatSel for the Nist38, AT&T, and Yale data sets: (a) top and (b) bottom; (a) and (b) are obtained in ED and RD metrics, respectively. Here, the selected dimensions for the three data sets are 4, 80, and 30, respectively.
1 Euclidean distance (ED) regional distance (RD)
0.8 0.6 0.4 0.2 0
1
2
4
8
16
32
64
128
Dimensions Fig. 2. Plots of the estimated error rates of DR-based DBCs for Yale: (a) top, (b) the second from the top, (c) the third, and (d) bottom; (a)–(d) are obtained in knnc, ldc, qdc, and libsvm, respectively. Here, the subspace dimensions are selected with a PCA.
be selected as the number of prototypes. For FeatSel, we selected an optimal set of prototypes by computing a criterion value based on leave-one-out 1-NN classification applied directly on the dissimilarity matrix. To provide an argument on why such a criterion makes sense, we investigated the changes in the criterion values evaluated for the Nist38, AT&T, and Yale data sets. Fig. 3 shows a plot of the characteristics between the criterion values and the number of features selected with FeatSel. In the figure, it should be mentioned that we can optimize the number of (selected) features by considering the criterion values for the data sets. From this consideration, it is evident that the rationale of the paper for employing the criterion works well. In DR-based methods, on the other hand, the subspace dimension can be chosen by using the cumulative proportion (Laaksonen and Oja, 1996; Oja, 1983). For example, in this experiment, by considering Eqs. (2) and (3) when k = 0.99, we chose the subspace dimensions as follows: qED = 7 for XOR4; qED = 41, qRD = 22 for Nist38; qED = 184, qRD = 21 for AT&T; and qED = 76, qRD = 15 for Yale.
Although it is hard to quantitatively evaluate the various PS and DR-based methods, we have attempted to do exactly this. First, the experimental results obtained with an artificial data set, XOR4, were probed into. Table 1 shows the estimated error rates (standard deviation) of the two approaches for XOR4. In the table, PS/ DR methods in the second column refers to the prototype selection methods and/or the dimension reduction schemes. The third column is the estimated error rates obtained with the four classifiers (PS-based and DR-based DBCs). Here, the values underlined are the lowest ones among the six error rates of each classifier (three of RandomC, KCentres, and FeatSel and three of PCA, LDA, and CLDA). As can be shown in Fig. 1, increasing the cardinality of the prototype subset decreases the average error rates of the resultant DBCs. From this consideration, we repeated the experiment for the benchmark data set in two dissimilarity matrices constructed with the cardinalities of c and 2c, respectively. By comparing the two error rates, we observed that the error rate achieved in the 2c cardinality is generally lower than that of the c cardinality. For the interest of brevity, however, only the error rate obtained in the cardinality of 2c was presented here. Table 1 A numerical comparison of the estimated error rates (standard deviations) of PS/DRbased DBCs for XOR4 in the dissimilarity spaces of 2c and q cardinalities. Data set
PS/DR methods
PS-based and/or DR-based DBCs knnc
ldc
qdc
libsvm
XOR4
RandomC
0.1050 (0.0329) 0.0887 (0.0286) 0.1007 (0.0275) 0.0883 (0.0241)
0.3560 (0.1390) 0.1100 (0.0396) 0.3200 (0.1243)
0.1087 (0.0384) 0.0687 (0.0274) 0.1037 (0.0325)
0.4213 (0.1222) 0.1397 (0.0540) 0.3823 (0.1055)
0.0623 (0.0237) 0.3103 (0.1405) 0.1144 (0.0516)
0.0613 (0.0247) 0.2980 (0.1284) 0.2822 (0.0747)
0.0560 (0.0239) 0.3043 (0.1368) 0.1122 (0.0442)
KCentres FeatSel PCA LDA
0.2973 (0.0841)
CLDA
0.0878 (0.0406)
821
S.-W. Kim / Pattern Recognition Letters 32 (2011) 816–823
From the results shown in Table 1, we can see that almost all the lowest error rates (underlined) are achieved with a DR-based method, such as PCA or LDA. For example, the lowest rate of knnc column is obtained with CLDA. Also, the lowest rates of the ldc, qdc, and libsvm classifiers are all achieved with PCA. This observation confirms the possibility that we can improve the classification accuracy of DBCs by effectively reducing the dimensionality of dissimilarity spaces after constructing them with all of the training samples. To observe how well the PS/DR methods work, we picked the best three among the six (three of PSs and three of DRs) methods per each classifier and ranked them in the order from the lowest to the highest error rates. Although this comparison is a very simplistic model of comparison, we believe that it is the easiest approach a researcher can employ when dealing with algorithms
Table 2 Top three PS/DR methods in terms of error rates obtained with the four classifiers for XOR4 in the dissimilarity spaces of 2c and q cardinalities. Data set
XOR4
Performance rank
1st 2nd 3rd
PS and/or DR-based DBCs knnc
ldc
qdc
libsvm
CLDA PCA KCentres
PCA KCentres CLDA
PCA KCentres FeatSel
PCA CLDA KCentres
that have different characteristics. A more scientific model for comparison like the one employed by Sohn (Sohn, 1999) is an avenue for future work. Table 2 shows the rankings of the PS and DR methods for XOR4 in terms of classification accuracies (error rates). In this comparison, DBCs were also designed in the dissimilarity spaces of the cardinality of c, 2c, and q (or min(q, c1) for LDA and CLDA), respectively. For the interest of brevity, however, as shown in Table 1, only the table of showing a ranking of the two DBCs of 2c and q cardinalities was presented here. From the table, we can clearly observe the possibility of improving classification performance of DBCs by utilizing PCA. 4.2. Experimental results: real-life data sets In order to further investigate the run-time characteristics gained by utilizing the PS and DR-based methods, we repeated the experiment for three real-life databases, namely, Nist38, AT&T, and Yale. Table 3 shows the estimated error rates (standard deviation) of these two approaches for the three databases. In the table, ED and RD in the second column are abbreviations for the dissimilarity measuring systems employed and the others are the same as in Table 1. Consider the classification results obtained with the first data set, Nist38. From the results, we can see that all the lowest error rates (underlined) are achieved with a DR-based method, PCA.
Table 3 A numerical comparison of the estimated error rates (standard deviations) of PS/DR-based DBCs for the three real-life databases in the dissimilarity spaces of 2c and q cardinalities. Data sets
Nist38
Distance metrics
ED
RD
AT&T
ED
RD
PS/DR methods
RandomC KCentres FeatSel PCA LDA CLDA RandomC KCentres FeatSel PCA
ED
RD
ldc
qdc
libsvm
0.2132 (0.0673) 0.2053 (0.0562) 0.0656 (0.0174)
0.1929 (0.0724) 0.1907 (0.0668) 0.0588 (0.0170)
0.1817 (0.0659) 0.1809 (0.0601) 0.0501 (0.0149)
0.2011 (0.0732) 0.1963 (0.0714) 0.0587 (0.0157)
0.0236 0.1488 0.1409 0.1357 0.1093 0.0697
0.0195 0.1291 0.1409 0.1227 0.1027 0.0607
0.0085 0.1244 0.1432 0.1123 0.1013 0.0567
0.0207 0.1181 0.1411 0.1284 0.1096 0.0652
(0.0079) (0.0236) (0.0276) (0.0454) (0.0429) (0.0190)
(0.0078) (0.0156) (0.0276) (0.0441) (0.0404) (0.0169)
(0.0056) (0.0148) (0.0263) (0.0394) (0.0418) (0.0179)
(0.0085) (0.0150) (0.0277) (0.0474) (0.0394) (0.0177)
LDA CLDA
0.0333 (0.0120) 0.1356 (0.0197) 0.4576 (0.0310)
0.0291 (0.0107) 0.1176 (0.0204) 0.4576 (0.0310)
0.0136 (0.0066) 0.1095 (0.0176) 0.4575 (0.0312)
0.0343 (0.0106) 0.1095 (0.0163) 0.4576 (0.0311)
RandomC KCentres FeatSel PCA
0.0683 0.0700 0.0683 0.0408
0.0354 (0.0215) 0.0462 (0.0200) 0.0733 (0.0258)
0.0550 0.0567 0.0617 0.0629
0.1825 0.1942 0.2042 0.0504
LDA
0.0308 (0.0222)
CLDA
0.0283 (0.0186) 0.0700 (0.0311) 0.0708 (0.0322)
RandomC KCentres FeatSel PCA LDA
Yale
PS-based and/or DR-based DBCs knnc
(0.0311) (0.0272) (0.0293) (0.0210)
0.0625 (0.0297) 0.0750 (0.0301)
CLDA
0.0104 (0.0099) 0.0446 (0.0260)
RandomC KCentres FeatSel PCA
0.1944 0.1978 0.1867 0.1822
LDA
0.1089 (0.0495)
CLDA
0.0922 (0.0516) 0.2111 (0.0542)
RandomC KCentres FeatSel PCA LDA CLDA
(0.0503) (0.0532) (0.0687) (0.0523)
0.2089 (0.0532) 0.1989 (0.0628) 0.2067 (0.0506) 0.0744 (0.0452) 0.0900 (0.0519)
0.0183 (0.0146) 0.0258 (0.0194) 0.0350 (0.0203) 0.0142 (0.0117) 0.0138 0.0242 0.0396 0.0283
(0.0115) (0.0180) (0.0192) (0.0150)
0.0638 (0.0285) 0.1167 (0.0531) 0.1067 (0.0535) 0.1311 (0.0532) 0.0644 (0.0289) 0.0678 (0.0321) 0.1233 (0.0588) 0.0300 0.0356 0.0700 0.0322 0.0333
(0.0332) (0.0360) (0.0414) (0.0344) (0.0316)
0.1056 (0.0533)
(0.0240) (0.0202) (0.0267) (0.0240)
0.0342 (0.0191) 0.5817 (0.0905)
(0.0407) (0.0374) (0.0477) (0.0211)
0.0504 (0.0309)
0.0387 (0.0228) 0.0433 (0.0224)
0.0479 (0.0305) 0.1942 (0.0408) 0.1933 (0.0400)
0.0417 (0.0242) 0.0354 (0.0218)
0.2325 (0.0426) 0.0892 (0.0266)
0.0167 (0.0155) 0.1479 (0.0341)
0.0458 (0.0201) 0.0654 (0.0280)
0.2133 0.2233 0.2111 0.1900
0.2222 0.2311 0.2744 0.1933
(0.0507) (0.0588) (0.0505) (0.0620)
0.1556 (0.0570) 0.2744 (0.0762)
(0.0570) (0.0539) (0.0741) (0.0528)
0.1111 (0.0466)
0.2344 (0.0564)
0.1100 (0.0504) 0.2933 (0.0657)
0.2400 (0.0563) 0.2189 (0.0688) 0.2189 (0.0592)
0.2944 (0.0554) 0.2800 (0.0598) 0.2200 (0.0500)
0.1522 (0.0544) 0.2622 (0.0757)
0.0578 (0.0419) 0.1156 (0.0598)
822
S.-W. Kim / Pattern Recognition Letters 32 (2011) 816–823
Table 4 Top three PS/DR methods in terms of error rates obtained with the four classifiers for the three real-life databases in the dissimilarity spaces of 2c and q cardinalities. Data sets
Distance metrics
Performance rank
knnc
ldc
qdc
libsvm
Nist38
ED
1st 2nd 3rd 1st 2nd 3rd
PCA FeatSel CLDA PCA FeatSel KCentres
PCA FeatSel LDA PCA FeatSel KCentres
PCA FeatSel LDA PCA FeatSel KCentres
PCA FeatSel LDA PCA FeatSel LDA
1st 2nd 3rd 1st 2nd 3rd
CLDA LDA PCA LDA CLDA FeatSel
PCA LDA CLDA KCentres RandomC FeatSel
LDA RandomC KCentres LDA PCA RandomC
CLDA PCA, LDA LDA CLDA PCA
1st 2nd 3rd 1st 2nd 3rd
CLDA LDA PCA LDA CLDA FeatSel
PCA LDA KCentres RandomC PCA LDA
LDA PCA FeatSel LDA PCA, FeatSel
CLDA LDA PCA LDA CLDA PCA
RD
AT&T
ED
RD
Yale
ED
RD
Similar, not exactly the same, characteristics could also be observed in the second and the third data sets. For AT&T, first, we can see an exception, 0.0138 of (RD - KCentres - ldc), which was occurred with a PS, i.e., KCentres. Next, for Yale, we can also see another exception, 0.0300 of (RD - RandomC - ldc). From the table, it should also be mentioned that the classification error of CLDA is sometimes greater than that of LDA, even though CLDA has been considered as an advanced one in reducing the dimensionality. For example, for AT&T data set, the classification error rates of (RD - LDA - qdc) and (RD - CLDA - qdc) are 0.0167 and 0.1479, respectively. This difference in classification accuracies between the two methods seems to be resulted on the assumptions that we had taken in computing the heteroscedastic Chernoff criterion and the regularization variable. Finding the exact causes of this difference is an avenue for future work. To avoid the confusion based on the exceptions encountered and to observe how well the methods work, we again picked the best three among the six methods per each classifier and ranked them as in Table 2. Table 4 shows the rankings of PS and DR methods for Nist38, AT&T, and Yale in terms of classification accuracies (error rates). From the table, we can clearly observe the possibility of improving classification performance of DBCs by utilizing DRs. Consider ranking of the first data set, Nist38, shown in the first row of Table 4, which was obtained from Table 3. From the rankings, we can see that almost all of the higher rankings are from the results of DRs, namely, PCA and LDA, and a few of them are those of PSs, such as FeatSel and KCentres. In the case of ED, only four among the 12 (three ranks four classifiers) entries are the results of PS-based DBCs and the remaining are of DR-based ones. However, the second row of Nist38, which was obtained with RD metric, shows a different trend for some classifiers. From these considerations, we can see that it is difficult for us to grade the methods as they are. Therefore, for simple comparisons, we first gave marks of 3, 2, or 1 to all entries (DR and PS methods) of Tables 2 and 4 according to their ranks; 3 marks are given to the 1st rank; 2 marks for the 2nd, and 1 mark for the 3rd. Then, we added all marks that each method earned with the four classifiers and the two measuring methods. As an example, for Nist38 in Table 4, the marks that PCA got from ED and RD rows are 12(=3 + 3 + 3 + 3) and 12(=3 + 3 + 3 + 3), respectively. Thus, the total marks that the PCA earned is 24. Using the same system, we graded the other DR (and PS) methods, and, as a final ranking, we obtained a competition result as shown in Table 5.
PS and/or DR-based DBCs
Table 5 Grades of the PS/DR methods obtained with the four classifiers for the four data sets, where the values () of each DR (or PS) method are the grades obtained. Rank
XOR4
Nist38
AT&T
Yale
1st 2nd 3rd
PCA (11) CLDA (6) KCentres (5)
PCA (24) FeatSel (16) LDA (4)
LDA (18) CLDA (11) PCA (9)
LDA (19) PCA (12) CLDA (10)
From the final ranking, we can see that almost all of the highest three ranks are of DR methods. Thus, in general, it should be mentioned that more satisfactory optimization of DBCs can be achieved by applying a DR after building the dissimilarity space with all of the available samples, rather than by selecting the prototype subset from them. The other observations obtained from the experiments are the followings: PCA + LDA, which is one of the most popular LDA-based strategies to deal with the SSS problems, should lead to worse results than LDA in case of XOR4 and Nist38, simply because they are well represented. On the other hand, we should expect improvement of PCA + LDA vs. LDA for the other data as we get badly estimated covariance matrices (between-class scatter and within-class scatter) in these enormously high dimensional spaces.3 In addition, it is interesting to note that, among the three PS methods, we obtained the lowest error rates with KCentres and FeatSel for XOR4 and Nist38, respectively. For AT&T and Yale, however, we obtained the lowest ones with RandomC. As mentioned in Section 3.2, to be fair, we selected FeatSel as a PS method and utilized the supervised information. However, we failed to obtain an improved accuracy with FeatSel for the SSS problems. We also failed to get answers to the following questions: Why does the accuracy of RD be poorer than that of ED?4 Why does the accuracy of CLDA be lower than that of LDA? The problem of answering these questions remains as unresolved. Finally, in the attempt to provide a comparison between the PSbased and DR-based methods, we present an analysis on their computational complexities. Rather than embark on another analysis of the computational complexities of the two approaches, however, we simply measured their processing CPU-times (seconds) for the four data sets. Table 6 shows a comparison of the averaged processing
3
We are also thankful to the referee who informed us about this point. Note that RD has been developed to increase the performance of ED metric (Adini et al., 1997). 4
823
S.-W. Kim / Pattern Recognition Letters 32 (2011) 816–823 Table 6 A comparison of the processing CPU-times (seconds) of the PS and DR methods of ED metric for the four data sets (sample #/dimension #/ class #). Each processing time is obtained by averaging the results of ten iterations on a Windows platform (CPU: 2.67 GHz, RAM: 3.50 GB). PS/DR methods
XOR4 (400/4/2)
Nist38 (1000/1024/2)
AT&T (400/10304/40)
Yale (165/42008/15)
PS
RandomC KCentres FeatSel
0.00 0.09 28.82
0.00 0.49 619.44
0.00 0.36 1851.83
0.00 0.13 83.65
DR
PCA LDA CLDA
0.79 0.06 2.03
15.55 3.11 47.54
1.16 0.07 1291.55
0.23 0.00 20.89
CPU-times (which do not include the times for classification) of the PS and DR methods of ED metric for the four data sets. From Table 6, we can observe that the processing CPU-times increased when FeatSel and CLDA are applied. Especially, it should be mentioned that the processing time of these two methods increases radically as the number of classes increases. For example, consider the processing times for Nist38. The processing times for PCA, LDA, and CLDA methods are, respectively, 15.55, 3.11, and 47.54 (seconds), while those of RandomC, KCentres, and FeatSel are, respectively again, 0, 0.49, and 619.44 (seconds). The same characteristic could also be observed in the results of RD metric. In review, the experimental results show that when the DRbased method, namely, PCA or LDA-based method, is applied to the dissimilarity representation, the classification accuracy of the resultant DBCs increases. More specifically, in terms of classification accuracies (error rates), PCA is clearly more useful for the well-represented data sets, such as XOR4 and Nist38, while LDA is more helpful for the SSS problems, such as AT&T and Yale. 5. Conclusions In order to reduce the dimensionality of dissimilarity spaces for optimizing dissimilarity-based classifications (DBCs), two kinds of approaches, prototype selection (PS) based methods and dimension reduction (DR) based methods, have been explored separately in the literature. The aim of this paper is to empirically evaluate the two methods in terms of classification accuracy for an artificial, three real-life benchmark databases. It is a well known fact that classification algorithms based on LDA are superior to those based on PCA, while PCA can outperform LDA when the training data set is small (Martinez and Kak, 2001). However, no empirical comparison has been done between the performances of DBCs in the dissimilarity spaces reduced by PCA and LDA. Our experimental results obtained with the few examples demonstrate that DR-based methods, such as PCA and LDA based method, excel PS-based methods in terms of classification accuracy. The results also show that PCA based method is generally more useful for the well-represented data sets, while LDA based method works better for the SSS problems. Despite this success, problems remain to be addressed. First, in this evaluation, we employed a very simplistic model of comparison. Thus, developing a more scientific model, such as the one in Sohn (1999), is an avenue for future work. Beside this, we failed to obtain an improved accuracy with the FeatSel and CLDA methods. We also have a question to answer: Why does the accuracy of RD be lower than that of ED? To solve these questions, therefore, further analyzing the experimental results obtained is required. Acknowledgments The authors thank the anonymous Referees for their valuable comments, which improved the quality and readability of the paper.
References Adini, Y., Moses, Y., Ullman, S., 1997. Face recognition: The problem of compensating for changes in illumination direction. IEEE Trans. Pattern Anal. Machine Intell. 19 (7), 721–732. Belhumeour, P.N., Hespanha, J.P., Kriegman, D.J., 1997. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Machine Intell. 19 (7), 711–720. Bicego, M., Murino, V., Figueiredo, M.A.T., 2004. Similarity-based classification of sequences using hidden Markov models. Pattern Recognit. 37, 2281– 2291. Bunke, H. Riesen, K. 2007. A family of novel graph kernels for structural pattern recognition. In: Proceedings of 12th Iberoamerican Congress on Pattern Recognition, Vina del Mar - Valparaiso, Chile, LNCS-4756 20–31. Duin, R.P.W., Juszczak, P., de Ridder, D., Paclik, P., Pekalska, E., 2004. Tax, D.M.J.: PRTools 4: A Matlab Toolbox for Pattern Recognition. Technical Report, Delft University of Technology, The Netherlands. Can also be available at
. Fan, R.-E., Chen, P.-H., Lin, C.-J. 2005. Working set selection using the second order information for training SVM. Journal of Machine Learning Research, 6 1889– 1918. Can also be referred to (as of July 2010)
. Georghiades, A.S., Belhumeur, P.N., Kriegman, D.J., 2001. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Machine Intell. 23 (6), 643–660. Can also be downloaded (as of February 2010) from . Howland, P., Wang, J., Park, H., 2006. Solving the small sample size problem in face recognition using generalized discriminant analysis. Pattern Recognit. 39, 277– 287. Jain, A.K., Duin, R.P.W., Mao, J., 2000. Statistical pattern recognition: A review. IEEE Trans. Pattern Anal. Machine Intell. 22 (1), 4–37. Kim, S.-W., Gao, J. 2008. On using dimensionality reduction schemes to optimize dissimilarity-based classifiers. In: Proceedings of 13th Iberoamerican Congress on Pattern Recognition, Havana, Cuba, LNCS-5197, pp. 309–316. Kim, S.-W., Oommen, B.J., 2007. On using prototype reduction schemes to optimize dissimilarity-based classification. Pattern Recognit. 40, 2946–2957. Laaksonen,J., Oja, E. 1996. Subspace dimension selection and averaged learning subspace method in handwritten digit classification. In: Proceedings of ICANN, Bochum, Germany, pp. 227–232. Loog, M., Duin, R.P.W., 2004. Linear dimensionality reduction via a heteroscedastic extension of LDA: The Cherno criterion. IEEE Trans. Pattern Anal. Machine Intell. 26 (6), 732–739. Martinez, A.M., Kak, A.C., 2001. PCA versus LDA. IEEE Trans. Pattern Anal. Machine Intell. 23 (2), 228–233. Oja, E., 1983. Subspace Methods of Pattern Recognition. Research Studies Press. Pekalska, E., Duin, R.P.W., 2005. The Dissimilarity Representation for Pattern Recognition: Foundations and Applications. World Scientific, Publishing, Singapore. Pekalska, E., Duin, R.P.W., Paclik, P., 2006. Prototype selection for dissimilaritybased classifiers. Pattern Recognit. 39, 189–208. Riesen, K., Kilchherr, V., Bunke, H. 2007. Reducing the dimensionality of vector space embeddings of graphs. In: Proceedings of 5th International Conference on Machine Learning and Data Mining, LNAI-4571, pp. 563–573. Samaria, F., Harter, A. 1994. Parameterisation of a stochastic model for human face identification. In: Proceedings of 2nd IEEE Workshop on Applications of Computer Vision, Sarasota FL, 215–220. Can also be available (as of February 2010) from . Sohn, S.Y., 1999. Meta analysis of classification algorithms for pattern. IEEE Trans. Pattern Anal. Machine Intell. 21 (11), 1137–1144. Wilson, C.L., Garris, M.D. 1992. Handprinted Character Database 3. Technical report, National Institute of Standards and Technology, Gaithersburg, Maryland. Yang, J., Zhang, D., Yong, X., Yang, J.-Y., 2005. Two-dimensional discriminant transform for face recognition. Pattern Recognit. 38, 1125–1129. Yu, H., Yang, J., 2001. A direct LDA algorithm for high-dimensional data – with application to face recognition. Pattern Recognit. 34, 2067–2070.