Neurocomputing 74 (2011) 2176–2183
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Discriminative learning by sparse representation for classification Fei Zang a,b,, Jiangshe Zhang a,b a b
School of Science, Xi’an Jiaotong University, Xi’an 710049, PR China State Key Laboratory for Manufacturing Systems Engineering, Xi’an Jiaotong University, Xi’an 710049, PR China
a r t i c l e i n f o
a b s t r a c t
Article history: Received 7 August 2010 Received in revised form 7 December 2010 Accepted 7 February 2011 Communicated by H. Yu Available online 29 March 2011
Recently, sparsity preserving projections (SPP) algorithm has been proposed, which combines l1-graph preserving the sparse reconstructive relationship of the data with the classical dimensionality reduction algorithm. However, when applied to classification problem, SPP only focuses on the sparse structure but ignores the label information of samples. To enhance the classification performance, a new algorithm termed discriminative learning by sparse representation projections or DLSP for short is proposed in this paper. DLSP algorithm incorporates the merits of both local interclass geometrical structure and sparsity property. That makes it possess the advantages of the sparse reconstruction, and more importantly, it has better capacity of discrimination, especially when the size of the training set is small. Extensive experimental results on serval publicly available data sets show the feasibility and effectiveness of the proposed algorithm. & 2011 Elsevier B.V. All rights reserved.
Keywords: Sparse reconstruction Classification Dimensionality reduction Discriminative learning by sparse representation projections
1. Introduction High-dimensional data such as face images and gene microarrays exist in many pattern recognition and data mining applications. It presents many mathematical challenges as well as some opportunities, and bounds to give rise to new theoretical developments [1]. One of the problems is the so-called ‘‘curse of dimensionality’’ [2]. That is to say, the dimensionality of the data is much larger than the number of the samples. But it is interesting that, in many cases, ‘‘important’’ dimensionality is not so much. Thus, before applying some algorithms to highdimensional data, we need to reduce the dimensionality of the original data. Dimensionality reduction plays a more and more important role in practical data processing and analyzing tasks, which has been used successfully in many machine learning fields such as face recognition [3,4] and remote sense data analysis [5]. In many existing dimensionality reduction algorithms, principal component analysis (PCA) [6] and liner discriminative analysis (LDA) [7] are two classical subspace learning methods. PCA attempts to seek an optimal projection direction, so that covariance of the data is maximized. PCA is based on low representation of the original highdimensional data, which does not exploit the label information of samples. Unlike PCA, LDA is a supervised method which seeks an optimal discriminative subspace by maximizing between-class scatter, meanwhile, minimizing within-class scatter. Recently, a number Corresponding author at: School of Science, and State Key Laboratory for Manufacturing Systems Engineering, Xi’an Jiaotong University, Xi’an 710079, PR China. Tel.: þ 86 29 82665961. E-mail address:
[email protected] (F. Zang).
0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2011.02.012
of manifold learning algorithms have been proposed, which are useful to analyze the data that lies on or near a submanifold of the original space. Locally linear embedding (LLE) [8], a Laplacian eigenmaps (LE) [9], neighborhood-preserving embedding (NPE) [10] and locality preserving projections (LPP) [11] are typical representations of the manifold learning algorithms. LLE is an unsupervised learning algorithm which is able to learn the local structure of nonlinear manifold by computing low-dimension, neighborhood-preserving embeddings of the original high-dimensional data. LE preserves proximity relations of pairwise samples by manipulations on an undirected weighted graph. NPE and LPP model the local submanifold structure by maintaining the relation of the neighborhood of the samples from the data set before and after being transformated. In [12], PCA, LDA and LPP are unified to the graph embedding framework. That is to say, their differences are the graph structure and the edge weights. As a result, graph becomes the heart of many existing dimensionality reduction methods. How to construct a graph is a focus in designing different dimensionality reduction algorithms. k-nearest-neighbor and e-ball-neighborhood which are both the usual Euclidean norm are two popular ways of graph construction. They are simple to compute and easy to operate, but two parameters—the size of sample neighborhood k and the radius of the ball e—are sensitive to noise and difficult to be determined in many real problems. Qiao [13] and Cheng [14] make use of sparse representation to construct a novel graph—l1-graph that inherits many merits of sparse reconstruction and reconstructs an adapting graph for lack of model parameters. But they do not use prior knowledge of class identities, that is, they are unsupervised. In this paper, to enhance the classification performance of SPP algorithm, we propose a new algorithm, called discriminative
F. Zang, J. Zhang / Neurocomputing 74 (2011) 2176–2183
learning by sparse representation or DLSP for short. For each sample xi , i ¼ 1,2, . . . ,n, we use the label information to divide the robust sparse representation coefficients by the original data into two groups: one has the same label with the sample xi and the other has the different label from the sample xi. To achieve the better classification performance, our algorithm proposed in this paper incorporates the merits of both local interclass geometrical structure and sparsity property, DLSP demands local within-class compactness, and also needs the samples that have the different label from the sample xi are close to their class centroid, respectively. That makes DLSP possess the advantages of the sparse reconstruction, and more importantly, it has better capacity of discrimination, especially when the size of the training set is small. The rest of the paper is organized as follows: in Section 2, we briefly review the original l1-graph and SPP algorithm, respectively. We propose our new algorithm in Section 3. In Section 4, the experimental results are given, and we conclude this paper in Section 5.
2. Briefly review l1-graph and SPP For convenience, we firstly give some notations used in this paper. Suppose that X ¼ ½x1 ,x2 , . . . ,xn A Rmn is a set of m-dimensional samples of size n and composed of c class (each class contains P nk , k ¼ 1,2, . . . ,c samples, ck ¼ 1 nk ¼ n). lðxk Þ denotes the class label of sample xk. For example, xk comes from the cth class if lðxk Þ ¼ c. The P kth class centroid is mk ¼ 1=nk lðxi Þ ¼ k xi , k ¼ 1, . . . ,c. For each sample xi A X, the matrix Ai denotes a sample set: Ai ¼ ½x1 , . . . , xi1 ,xi þ 1 , . . . ,xn A Rmðn1Þ . The robust sparse representation coefficient vector of xi is wi: xi ¼ ½Ai ,Iwi ¼ Bi wi , where Bi ¼ ½Ai ,I A Rmðm þ n1Þ , wi A Rm þ n1 , I A Rmm is an identity matrix and wi ð1 : n1Þ denotes the entries from 1 to n1 in the vector wi. 2.1. l1-graph The l1-graph is motivated by that each sample can be reconstructed by the sparse representation of the training data, and the sparse reconstruction coefficients are derived by solving an l1-norm optimization problem. In l1-graph, the vertices are composed of all the training samples, and edge weights to each vertex describe its l1-norm driven reconstruction from the remaining samples. Different from traditional reconstructed graph, the graph adjacency structure and the graph weights are derived simultaneously in the l1-graph process. The construction process is formally stated as follows [14]. (1) Inputs: The sample set denotes as the matrix X ¼ ½x1 ,x2 , . . . ,xn A Rmn . (2) Robust sparse representation: For each sample xi in the sample set, its robust sparse coding is achieved by solving the l1-norm optimization problem: ( min Jwi J1 wi
s:t:
xi ¼ Bi wi
(3) Graph weight setting: Denote G ¼ fX,Wg as the l1-graph, X as graph vertices and W as the graph weight matrix, and we set Wij ¼ wi,j if i 4j, and Wij ¼ wi,j1 if i oj.
2.2. Sparsity preserving projections Unlike many existing techniques such as LPP and NPE, where local neighborhood information based on the Euclidean norm is preserved during the dimensionality reduction procedure,
2177
sparsity preserving projections (SPP) [13] aims to preserve the sparse reconstruction relationship of the samples. Similar to NPE, its objective function is defined as follows: min G
n X
JGT xi GT Xwi J2
i¼1
where wi is the sparse reconstruction coefficient vector. With simple algebraic formulation, the optimal projections of the SPP are the eigenvectors corresponding to the largest t eigenvalues of the following generalized eigenvalue problem: XðW þ WT WT WÞXT g ¼ lXXT g where the matrix W is defined as W ¼ ½w1 ,w2 , . . . ,wn . From the objective function of SPP, we can see that SPP which combines the merits of LLE algorithm demands the reconstruction error be minimized. Meanwhile, different from LLE and NPE, the weights in SPP are achieved by solving the l1-norm optimization problem. So we can say that SPP is an application of l1-graph to the subspace learning.
3. Discriminative learning by sparse representation projections SPP algorithm preserves many merits of LLE and l1-graph, and summarizes the overall behavior of the whole sample set in sparse reconstruction process, but it is an unsupervised learning method which does not use the prior knowledge of class identities. To enhance the classification performance of SPP algorithm, we propose a new method, called discriminative learning by sparse representation projections or DLSP for short, which incorporates the merits of both local geometry structure and global sparse property. Our algorithm preserves the sparse structure of the whole data set in view of representation, and possesses well discriminative ability considering classification. A large number of numerical experiments in Section 4 have manifested the superiority of our proposed algorithm. To explain the reasonability of considering the label information in DLSP, we firstly analyze the difference between l1-graph and the traditional k-nearestneighbor graph.
3.1. l1-graph or k-nearest-neighbor graph In this subsection, take the ORL database for example, we will illuminate, respectively, the l1-graph and k-nearestneighbor graph. Fig. 1 shows the case of the l1 reconstruction and k-nearest-neighbor on the ORL database. We can see that four images have the same label with the left in the l1 reconstruction in Fig. 1(a), and three images have the same label with the left in the k-nearest-neighbor in Fig. 1(b). It is important that the first three images possessing the same label with the left image have larger reconstruction coefficient than other images, in Fig. 1(a). That is to say, l1-graph captures more label information than k-nearest-neighbor graph, moreover, sparsity reconstruction reflects the relation of the pairwise samples hidden in the whole data set more than k-nearest-neighbor based on the Euclidean norm. Meanwhile, there are many images with different label information from the left image in Fig. 1(a). This provides an opportunity to enhance classification performance of the l1-graph and SPP algorithm. In next subsection, we make the best of the advantages of the sparsity reconstruction and the label information to design DLSP algorithm.
2178
F. Zang, J. Zhang / Neurocomputing 74 (2011) 2176–2183
Fig. 1. Compare l1 reconstruction with k-nearest-neighbor on the ORL database. (a) The images corresponding to the largest 10 reconstruction coefficient in sparsity reconstruction processing about the left. (b) The images corresponding to the smallest Euclidean distance between the left and the remaining images.
3.2. DLSP algorithm As for the ith sample xi A X, i ¼ 1, . . . ,n, the robust sparse representation coefficient vector of xi is wi: xi ¼ Bi wi , where Bi ¼ ½Ai ,I. Let w i ¼ ½wi ð1 : i1Þ,0,wi ði : n1Þ A Rn . For classification, according to the label information, we give the following two index sets, si ,di A Rn : ( 1, w ji a0 and lðxi Þ ¼ lðxj Þ j si ¼ 0 otherwise ( j
di ¼
1,
w ji a 0 and lðxi Þ alðxj Þ
0
otherwise
where yji is the jth element in the vector yi. Specifically, suppose that Si ¼ Xsi , Di ¼ Xdi . To enhance the role of samples from Si, we again reconstruct sparsely xi by Si: x i ¼ Si bi . For discrimination, in the lowdimensional space, we demand reconstruction error is minimized and the samples belonging to the Di close to themselves class centroid. Then our two objective functions are defined as follows: J1 ðGÞ ¼
n X
JGT xi GT Si bi J2
i¼1
¼ trðGT
n X
ðxi Si bi Þðxi Si bi ÞT GÞ
i¼1
¼ trðGT UGÞ
J2 ðGÞ ¼
n X c X X i¼1k¼1
0 B T ¼ trB @G
JGT xj GT mk J2
j A Di lðx Þ ¼ k j x
1
n X c X X i¼1k¼1
C ðxj mk Þðxj mk ÞT GC A
j A Di lðx Þ ¼ k j x
¼ trðGT VGÞ
where trade-off parameter a A ½0,1, I is an identity matrix with appropriate size. Based on basic algebra knowledge, the optimal projections of the DLSP are the eigenvectors corresponding to the smallest t eigenvalues of the following eigenvalue problem: ðaU þ ð1aÞVÞg ¼ lg For the high-dimensional data, the training set can be firstly projected onto a PCA subspace to eliminate the redundant information. The procedure for the proposed DLSP algorithm is described in the following: (1) Use PCA to project the data set X into the subspace to eliminate the useless information, and we still denote by X the data set in the PCA subspace. We denote the PCA projection matrix by GPCA. (2) For each sample xi A X, i ¼ 1, . . . ,n, its robust sparse coding is achieved by solving the l1-norm optimization problem: ( min Jwi J1 wi
s:t:
xi ¼ Bi wi
where the matrix Bi ¼ ½x1 , . . . ,xi1 ,xi þ 1 , . . . ,xn ,I A Rmðm þ n1Þ , I is an identity matrix with appropriate size and the coefficient vector wi A Rðm þ n1Þ . (3) Let w i ¼ ½wi ð1 : i1Þ,0,wi ði : n1ÞA Rn and the matrices are Si ¼ Xsi , Di ¼ Xdi , where si and di are defined, respectively, as following: ( 1, w ji a0 and lðxi Þ ¼ lðxj Þ j si ¼ 0 otherwise: ( 1, w ji a 0 and lðxi Þ a lðxj Þ j di ¼ 0 otherwise (4) The sample xi ’s, i ¼ 1,2, . . . ,n, robust sparse representation by the sample set Si is achieved by solving the l1-norm optimization problem: 8 < min Jbi J1 bi
where trðÞ denotes the trace of matrix and the matrices U and V are defined, respectively, as following: n X
U¼
ðxi Si bi Þðxi Si bi ÞT
i¼1
V¼
n X c X X i¼1k¼1
ðxj mk Þðxj mk ÞT
U¼
j A Di lðx Þ ¼ k j
G
GT G ¼ I
xi ¼ Si bi
(5) To demand reconstruction error is minimized and the samples belonging to the Di close to themselves class centroid, we obtain the two matrices as following:
x
As for classification, our algorithm’s final objective function is defined as follows: ( min aJ1 ðGÞ þð1aÞJ2 ðGÞ s:t:
: s:t:
V¼
n X i¼1 n X
ðxi Si bi Þðxi Si bi ÞT c X X
i¼1k¼1
ðxj mk Þðxj mk ÞT
j A Di lðx Þ ¼ k j x
(6) Computing the optimal projection eigenvectors GDLSP ¼ ½g1 , . . . ,gt corresponding to the t smallest eigenvalues of the
F. Zang, J. Zhang / Neurocomputing 74 (2011) 2176–2183
2179
Fig. 2. Some face images from the ORL database. There are 10 available facial expressions and poses for each subject.
Table 1 Average the best recognition accuracy of six algorithms under the different size of the training set on the ORL database and the values in the parentheses are the corresponding standard deviation. ORL
2
3
4
7
DLSP PCA LPP NPE SPP LDA
0.8232(7 0.031) 0.7066(7 0.032) 0.6287(7 0.036) 0.7092(7 0.029) 0.7100(7 0.034) 0.7688(7 0.034)
0.9068(7 0.019) 0.7914( 70.025) 0.7224( 70.029) 0.8033(7 0.027) 0.8067(7 0.022) 0.8681( 70.023)
0.9480(70.015) 0.8489( 7 0.021) 0.7920( 70.026) 0.8637( 7 0.020) 0.8652( 7 0.022) 0.9145( 7 0.019)
0.9858(7 0.011) 0.9322( 7 0.025) 0.9340( 7 0.022) 0.9582( 7 0.018) 0.9405( 7 0.020) 0.9587( 7 0.019)
following eigenvalue problem: ðaU þ ð1aÞVÞg ¼ lg (7) The final optimal projection matrix is G ¼ GPCA GDLSP . For a testing sample x, its image in the low-dimensional space is given by x/y ¼ GT x
4. Experiments In this section, to evaluate the effectiveness of the proposed algorithm for classification, we compare the proposed discriminative learning by sparse representation projection (DLSP) algorithm with the five dimensionality reduction methods— PCA, LPP, NPE, SPP and LDA on several publicly available face databases including ORL [15], UMIST [16] and JAFFE [17] which involve different illumination, expression and pose changes. In all experiments, we keep 98% information in the sense of reconstruction error expect LDA method. In our proposed algorithm, the non-zero entries are those elements which are more than or equal to 0.0001. The sparse reconstruction coefficient vector is solved by a primal-dual algorithm for linear programming using publicly available packages such as l1-magic.1 For LPP, the kernel parameter is selected as the mean of norm of the training samples. The trade-off parameter a in DLSP is fixed as 0.01. The nearest-neighbor classifier is employed to classify in the projected feature space.
4.1. ORL In the ORL face database, there are 40 distinct subjects, each of which contains 10 different images. For some subjects, the images are taken at different times, varying the lighting, facial expressions and facial details. All the images are taken against a dark homogeneous background with the subjects in an upright, frontal 1
http://www.acm.caltech.edu/l1magic/
position. All the images from the ORL database are cropped and the cropped images are normalized to the 32 32 pixels with 256 gray level per pixel. Some samples from this database are showed in Fig. 2. For each subject, l(¼2,3,4,7) images are randomly selected as the training samples and the rest for testing. For each l, we average the results over 50 random splits. Table 1 gives the best classification accuracy and the corresponding standard deviation of six algorithms under the different size of the training set. Fig. 3(a) shows the best classification accuracy versus the variation of the size of the training set. Fig. 3(b) gives comparison of the recognition rates of six algorithms under different reduced dimensions when the size of training samples from each class is four. Fig. 8 depicts the relationship of the recognition accuracy versus the trade-off parameter a corresponding to Fig. 3(b). It can be seen that DLSP algorithm outperforms the other algorithms, especially when the size of the training set is small. 4.2. UMIST In this subsection, we demonstrate our algorithm’s superior classification performance than other subspace learning methods about pose changes. The UMIST database [16] contains 564 images of 20 individuals, each covering a range of poses from profile to frontal views. Subjects cover a range of race, sex and appearance. We use a cropped version of the UMIST database that is publicly available at S. Roweis’ Web page.2 All the cropped images are normalized to the 64 64 pixels with 256 gray level per pixel. Fig. 4 shows some images of an individual. We randomly select l( ¼6, 8, 10, 12) images from each individual for training, and the rest for testing. We repeat these trails 50 times and compute the average results. Table 2 gives the best classification accuracy and the corresponding standard deviation of six algorithms under the different size of the training set. Fig. 5(a) shows the best classification accuracy versus the variation of the size of the training set. Fig. 5(b) gives comparison of the recognition rates of six algorithms under different reduced dimensions when the size of training samples from each class is 10. Recognition rate versus the trade-off parameter a 2
http://cs.nyu.edu/ roweis/data.html
2180
F. Zang, J. Zhang / Neurocomputing 74 (2011) 2176–2183
1
1
0.95
0.9 0.8
0.9
Recognition rate
Recognition rate
0.7 0.85 0.8 0.75 PCA LPP NPE SPP LDA DLSP
0.7 0.65
0.6 0.5 0.4 PCA LPP NPE SPP LDA DLSP
0.3 0.2 0.1 0
2
3
4
5
6
7
10
20
Training sample size
30
40
50
60
70
80
90
Dimension
Fig. 3. Recognition rates of six algorithms on the ORL database: (a) the best recognition rates versus the different size of the training set and (b) the average recognition rates versus the variation of dimensions when the size of per class is fixed as four.
Fig. 4. Some face samples from the UMIST database.
Table 2 Average the best pose recognition accuracy of six algorithms under the different size of the training set on the UMIST database and the values in the parentheses are the corresponding standard deviation. UMIST
6
8
10
12
DLSP PCA LPP NPE SPP LDA
0.9196(7 0.018) 0.8474(7 0.027) 0.8329(7 0.027) 0.8594(7 0.026) 0.8467(7 0.028) 0.8754(7 0.026)
0.9484(70.011) 0.8962( 7 0.019) 0.8829( 7 0.018) 0.9046( 7 0.017) 0.8924( 7 0.020) 0.9094( 7 0.021)
0.9587(7 0.011) 0.9256( 70.016) 0.9160(7 0.017) 0.9313( 70.014) 0.9203(7 0.016) 0.9218( 70.021)
0.9908(7 0.006) 0.9633( 7 0.012) 0.9557( 7 0.012) 0.9691( 7 0.011) 0.9565( 7 0.014) 0.9683( 7 0.013)
corresponding to Fig. 5(b) is showed in Fig. 8. It is found that the DLSP method significantly outperforms both PCA, LPP, NPE, SPP and LDA, especially when the size of the training set is small. That is to say, DLSP algorithm is not sensitive to the pose changes. 4.3. JAFFE In this subsection, we show that our algorithm achieves higher recognition rate than other dimensionality reduction methods about the face expression changes. The JAFFE database contains 213 images of seven facial expressions (six basic facial expressions and one neutral) posed by 10 Japanese female models. Each image has been rated on six emotion adjectives by 60 Japanese subjects. All the cropped images are normalized to the 32 32
pixels with 256 gray level per pixel. Fig. 6 shows some images of an individual. For each subject, l( ¼2,3,4,7) images are randomly selected as the training samples and the rest for testing. For each l, we average the results over 50 random splits. Table 3 gives the best classification accuracy and the corresponding standard deviation of six algorithms under the different size of the training set. Fig. 7(a) shows the best classification accuracy versus the variation of the size of the training set. Fig. 7(b) gives comparison of the recognition rates of six algorithms under different reduced dimensions when the size of training samples from each class is two. Recognition rate versus the trade-off parameter a corresponding to Fig. 7(b) is also depicted in Fig. 8. We see that the DLSP algorithm significantly outperforms the other methods about the face expression changes.
1
1
0.98
0.9
0.96
0.8
0.94
0.7
Recognition rate
Recognition rate
F. Zang, J. Zhang / Neurocomputing 74 (2011) 2176–2183
0.92 0.9 0.88
PCA LPP NPE SPP LDA DLSP
0.86 0.84
2181
0.6 0.5 0.4
PCA LPP NPE SPP LDA DLSP
0.3 0.2 0.1
0.82 6
7
8
9
10
11
12
10
20
30
40
50
60
70
80
90
Dimension
Training sample size
Fig. 5. Recognition accuracy of six algorithms on the UMIST database: (a) the best recognition rates versus the different size of the training set and (b) the average recognition rates versus the variation of dimensions when the size of per class is fixed as 10.
Fig. 6. Some face samples from the JAFFE database.
Table 3 Average the best expressions accuracy of six algorithms under the different size of the training set on the JAFFE database and the values in the parentheses are the corresponding standard deviation. JAFFE
2
3
4
7
DLSP PCA LPP NPE SPP LDA
0.8594(7 0.036) 0.8046(7 0.041) 0.8029(7 0.042) 0.8233( 70.042) 0.7258( 70.050) 0.7881( 70.048)
0.9365( 70.024) 0.8611( 7 0.036) 0.8715( 7 0.040) 0.9000( 7 0.034) 0.8486( 7 0.045) 0.9184( 7 0.025)
0.9665(7 0.018) 0.8966( 7 0.028) 0.9222( 7 0.028) 0.9341( 7 0.025) 0.9024( 7 0.028) 0.9553( 7 0.022)
0.9959(7 0.007) 0.9330(7 0.023) 0.9729(7 0.018) 0.9768(7 0.015) 0.9545(7 0.022) 0.9899(7 0.010)
4.4. Discussions It is worth highlighting the following points that arise from the experimental results. (1) From Tables 1–3 and Figs. 3(a), 5(a) and 7(a), we can see that our proposed algorithm significantly outperforms the other methods, especially, when the size of the training set is small. That is to say, our algorithm captures more the intrinsic information hidden in the training set than other methods. This superior classification performance provides the extensive space of many applications when the size of the training set is small. (2) From Figs. 3(b), 5(b) and 7(b), we can see that our algorithm has achieved the best recognition rate when the reduced
dimensions are 44, 46 and 9, respectively, on the ORL, UMIST and JAFFE database. Those reduced dimensions are less than other methods when they achieve the best recognition rate. This saves a mass of computing time and storage space after obtaining the optimal projections. (3) The trade-off parameter a tunes the contribution of the J1 ðGÞ and J2 ðGÞ in our algorithm. From Fig. 8, we can see that DLSP algorithm achieves better classification performance when the trade-off parameter a is small. When a is small, the contribution of J1 ðGÞ is smaller than that of J2 ðGÞ. That is to say, the role of samples from Si played in the reconstruction of xi is weaken, meanwhile, the role of samples from Di which are not close to xi is increasing. That makes the distance between the samples having the same class label with xi and the samples having different class from xi get longer. As a
2182
F. Zang, J. Zhang / Neurocomputing 74 (2011) 2176–2183
1
0.8 0.7
0.9
Recognition rate
Recognition rate
0.95
0.85
0.8
PCA LPP NPE SPP LDA DLSP
0.75
0.6 0.5 PCA LPP NPE SPP LDA DLSP
0.4 0.3 0.2
0.7 2
3
4 5 Training sample size
6
7
2
4
6
8 Dimension
10
12
14
Fig. 7. Recognition rates of six algorithms on the JAFFE database: (a) the best recognition rates versus the different size of the training set and (b) the average recognition rates versus the variation of dimensions when the size of per class is fixed as two.
0.96
Acknowledgments
0.94
The authors would like to express their gratitude to the anonymous referees as well as the Editor and Associate Editor for their valuable comments which lead to substantial improvements of the paper. This work was supported by the National Basic Research Program of China (973 Program)(Grant no. 2007CB311002) and the National Natural Science Foundation of China (Grant nos. 60675013, 10531030, 61075006/F030401).
0.92
Recognition rate
0.9 0.88 0.86 0.84 0.82
References 0.8 ORL UMIST JAFFE
0.78 0.76 1
2
3
4
5
6
7
8
9
10
11
The index of trade−off parameter α Fig. 8. Recognition rate of DLSP versus the trade-off parameter a on the ORL, UMIST and JAFFE database. Note that a ¼ linspaceð0:01,1,11Þ (the program of MATLAB) in the three experiments.
result, the DLSP algorithm achieves better classification performance. But when a is large, the contribution of J2 ðGÞ becomes small, which is not good for classification.
5. Conclusions In this paper, based on sparse representation, we have proposed DLSP algorithm, which is a new dimensionality reduction method. DLSP incorporates the merits of both local geometry structure and global sparse property. The advantages of DLSP are as follows: DLSP preserves discriminative information, especially when the size of the training set is small; DLSP reconstructs graph adapting and avoids the difficulty of parameter selection as in LPP and NPE; DLSP performs better than SPP algorithm in classification. Experimental results have demonstrated the effectiveness of the proposed algorithm, especially when the size of the training set is small.
[1] D.L. Donoho, High-dimensional data analysis: the curses and blessings of dimensionality. Lecture delivered at the ‘‘Mathematical Challenges of the 21st Century’’ Conference of The American Mathematical Society, Los Angeles, August 6–11, /http://www-stat.stanford.edu/ donoho/Lectures/ AMS2000/AMS2000.htmlS, 2000. [2] R. Bellman, Adaptive Control Processes: A Guided Tour, Princeton University Press, Princeton, 1961. [3] Z. Li, D. Lin, X. Tang, Nonparametric discriminant analysis for face recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (4) (2009) 761–775. [4] H. Cevikalp, M. Neamtu, M. Wilkes, A. Barkana, Discriminant common vectors for face recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (1) (2005) 4–13. [5] B. Mojaradi, H.A. Moghaddam, M. Zoej, R.P.W. Duin, Dimensionality reduction of hyperspectral data via spectral feature extraction, IEEE Transactions on Geoscience and Remote Sensing 47 (7) (2009) 2091–2105. [6] I.T. Jolliffe, Principal Component Analysis, second ed., Springer-Verlag, New York, 2002. [7] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed., Academic Press, Boston, 1990. [8] S. Roweis, L. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [9] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation 15 (7) (2003) 1373–1396. [10] X. He, D. Cai, S. Yan, H.-J. Zhang, Neighborhood preserving embedding, in: IEEE International Conference on Computer Vision (ICCV), 2005. [11] X. He, S. Yan, Y. Hu, P. Niyogi, H.-J. Zhang, Face recognition using Laplacianfaces, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (3) (2005) 328–340. [12] S. Yan, D. Xu, B. Zhang, Q. Yang, H. Zhang, S. Lin, Graph embedding and extensions: a general framework for dimensionality reduction, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2) (2007) 40–51. [13] L. Qiao, S. Chen, X. Tan, Sparsity preserving projections with applications to face recognition, Pattern Recognition 43 (2010) 331–341. [14] B. Cheng, J. Yang, S. Yan, Y. Fu, T. Huang, Learning with l1 -graph for image analysis, IEEE Transactions on Image Processing 19 (4) (2010) 858–866.
F. Zang, J. Zhang / Neurocomputing 74 (2011) 2176–2183
[15] F. Samaria, A. Harter, Parameterisation of a stochastic model for human face identification, in: Second IEEE Workshop on Applications of Computer Vision, Sarasota, FL, December 1994. [16] D.B. Graham, N.M. Allinson, Characterizing virtual eigensignatures for general purpose face recognition, in: Face Recognition: From Theory to Applications, vol. 163, 1998, pp. 116–456. [17] M.J. Lyons, J. Budynek, S. Akamatsu, Automatic classification of single facial images, IEEE Transactions on Neural Networks 21 (12) (1999) 1357–1362.
Fei Zang received his master degree from the School of Mathematical Science at the University of Electronic Science and Technology of China in 2007. Now, he is a candidate of the Ph.D. degree in School of Science in Xi’an Jiaotong University, China. He is interested in pattern recognition and image processing.
2183 Jiangshe Zhang was born in 1962. He received his M.S. and Ph.D. degrees in Applied Mathematics from Xi’an Jiaotong University, Xi’an, China, in 1987 and 1993, respectively. He joined Xi’an Jiaotong University, China, in 1987, where he is currently a full Professor in Faculty of Science in Xi’an Jiaotong University, China. Up to now, he has authored and coauthored one monograph and over 50 journal papers on robust clustering, optimization, short-term load forecasting for electric power system, etc. His current research focus is on Bayesian learning, global optimization, computer vision and image processing, support vector machines, neural networks and ensemble learning.