Author’s Accepted Manuscript Sparse Embedded Dictionary Learning on Face Recognition Yefei Chen, Jianbo Su
www.elsevier.com/locate/pr
PII: DOI: Reference:
S0031-3203(16)30351-X http://dx.doi.org/10.1016/j.patcog.2016.11.001 PR5943
To appear in: Pattern Recognition Received date: 17 March 2016 Revised date: 31 October 2016 Accepted date: 1 November 2016 Cite this article as: Yefei Chen and Jianbo Su, Sparse Embedded Dictionary Learning on Face Recognition, Pattern Recognition, http://dx.doi.org/10.1016/j.patcog.2016.11.001 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Sparse Embedded Dictionary Learning on Face Recognition Yefei Chena,1 , Jianbo Sua,2,∗ a
Department of Automation, Shanghai Jiao Tong University, Shanghai, China
Abstract In sparse dictionary learning based face recognition(FR), a discriminative dictionary is learned from the training set so that good classification performance can be achieved on probe set. In order to achieve better performance and less computation, dimensionality reduction is applied on source data before training. Most of the proposed dictionary learning methods learn features and dictionary separatively, which may decrease the power of dictionary learning because the classification ability of dictionary learning method is based on data structure of source domain. Therefore, a sparse embedded dictionary learning method(SEDL) is proposed, of which dictionary learning and dimensionality reduction are jointly realized and the margin of coefficients distance between between-class and within-class is encourage to be large in order to enhance the classification ability and gain discriminative information. Moreover, orthogonality of the projection matrix is preserved which is critical to data reconstruction. And data reconstruction is consid∗
Corresponding author Email addresses:
[email protected] (Yefei Chen),
[email protected] (Jianbo Su) 1 Yefei Chen received his M.S. degree in Automatic Control from Shanghai Jiao Tong University, Shanghai, China, in 2013. His research interests include pattern recognition, image processing, face recognition and sparse representation. 2 Jianbo Su received his B.S. degree in Automatic Control from Shanghai Jiao Tong University, Shanghai, China, in 1989, his M.S. degree in Pattern Recognition and Intelligent System from the Institute of Automation, Chinese Academy of Science, Beijing, China, in 1992, and his Ph.D. in Control Science and Engineering from Southeast University, Nanjing, China, in 1995. He joined the Department of Automation, Shanghai Jiao Tong University, in 1997, where he has been a Full Professor, since 2000. His research interests include robotics, pattern recognition, and human-machine interaction. He has served as an Associate Editor for IEEE Transactions on Cybernetics, since 2005. Preprint submitted to Pattern Recognition
October 31, 2016
ered to be important for sparse representation. In this paper, an extension of discriminant dictionary learning and sparse embedding is proposed and realized with novel strategies. Experiments show that our method achieves better performance than other state-of-art methods on face recognition. Keywords: face recognition, dictionary learning, sparse embedded 1. Introduction Face recognition method has been studied for decades and various methods have been proposed to handle several difficulties in face recognition, such as illumination, pose, occlusion and small sample size, etc[1]. Meanwhile, the high dimensionality of face image causes the face recognition problem more difficult, because the Euclidean distances between feature vectors become closer to each other making the inference task harder, which is known as concentration phenomenon[2]. It also increases the complexity of computation during recognition. Subspace learning is an efficient and simple way to obtain a projection matrix which preserves the data structure and discriminative information while realizing dimensionality reduction. The classical dimensionality reduction method includes Principal Component Analysis(PCA)[3], Linear Discriminate Analysis(LDA)[4], Locality Preserving Projection(LPP)[5], etc. PCA has some roles in removing the unreliable dimensions, which improves the classification accuracy[6]. It still may not effectively remove the unreliable dimensions. Hence, LDA is used to seek discriminant projections by maximizing the between-class scattering while minimizing the within-class variance. But such methods alone can hardly handle the expression and illumination changes. Sparse representation(SR)[7] has attracted increasing interest in computer vision in recent years[8–15]. The basic idea of SR is encoding the original signal by a task-specific dictionary(often overcomplete) with sparsity constraint and decreasing the effect of Laplace noisy by 0 -norm. SR based method such as sparse representation classification(SRC)[16] was applied on face recognition later. The query image was encoded over a fixed dictionary formed by gallery images by solving a Lasso[17] or OMP[18] problem and classification was achieved by finding corresponding class with minimal residuals. However, such fixed dictionary might not be discriminant enough for classification, which has triggered the research of dictionary learning[19] 2
with sparsity constraints and the study of sparse embedding[20–22]. Unsupervised dictionary learning methods like MOD[23], K-SVD[24] learned atoms of dictionary which can enhance the ability of signal reconstruction, but classification ability was not explicitly obtained. To address this issue, supervised dictionary learning methods[25–30] applying constraints on the sparsity coefficients to improve classification ability were proposed, during which both reconstruction error and classification error were minimized. In the conventional sparse representation based classification, random projection and PCA are often used for dimensionality reduction[31]. However, pre-trained projection matrix make the source domain unchangeable and the dictionary is learned to adapt such domain. Dictionary learning method is based on the concept that the data points lie in multiple low-dimensional subspaces of a high-dimensional ambient space, and the same class data points cluster together as a low-dimensional subspaces[13, 32]. Therefore, the data structure of source domain is meaningful to the classification ability of dictionary learning method. But such pre-trained projection matrix as PCA limit the ability of dictionary learning. Jiang[6] pointed that PCA might not be effective for sparse representation based classifier SRC and the dimensionality reduction should be targeted at removing the dimensions unreliable for the classification. Therefore, an asymmetric PCA(APCA) was proposed in [33] to remove the unreliable dimensions more effectively. Although PCA can improve the classification performance, it is not effective to extract a compact feature set for an efficient classification[33]. Projection matrix learned before dictionary learning might not catch vital information. In order to alleviate the problem, Nguyen et al.[2] proposed a framework for sparsity promoting dimensionality reduction, where orthogonality of projection matrix was preserved to minimize the data reconstruction error. It was designed to discard non-informative part of signal and maintain meaningful sparse structures of the data. It also inspired the idea of sparse embedding by combining the strength of both dimensionality reduction and dictionary learning. Other researches related to sparse embedding focused on sparse projections[20], manifold of coefficients[22] and sparsity preserving embedded[21], but ignored the dimensionality reduction of data space. On the other hand, Zhang et al.[34] proposed a dimensionality reduction method with dimensionality reduction on both dictionary and source domain. However, fixing some part of the dictionary may decrease the effect of dictionary learning. Though some joint dictionary learning and dimensionality reduction meth3
ods have been developed[2, 35], orthogonality of the projection matrix, which is essential for data reconstruction[36], is no longer preserved. Data reconstruction is considered to be important for SRC based methods and the most discriminative dictionary is considered to be the one which gives the smallest intra-class reconstruction error while the largest inter-class reconstruction error[31]. Based on such idea, Zhang et al.[31] used orthogonal projection to minimize the reconstruction error. Hence, we consider that dimensionality reduction before SRC based methods should minimize the data reconstruction error. However, conventional PCA may not be effective for sparse representation. In order to alleviate such problems, a sparse embedded dictionary learning method is proposed in this paper, of which orthogonality of the projection matrix is preserved. During the training stage, both dictionary and projection matrix are optimized jointly, so that the data structure of source domain and dictionary are optimized simultaneously. The margin of coefficients is imposed to be maximized, where intra-class distance is minimized and inter-class distance is maximized. It is used to enhance the discriminability of coefficients. In the proposed sparse embedded dictionary learning method(SEDL), both reconstruction error and coding coefficients are discriminative. Hence, classification is based on both reconstruction error and distance between coefficients. The rest of the paper is organized as follows. Section 2 briefly reviews the related work. Section 3 introduces the details of sparse embedded dictionary learning(SEDL) method. Experiments are conducted and presented in Section 4, followed by Conclusion in Section 5. 2. Related work Dictionary learning based on sparse representation focuses on learning a desired dictionary which has good representational power(or signal reconstruction power) while supporting optimal discrimination of the classes. Sparse representation classification can be viewed as a special form of dictionary learning based classification with dictionary formed by raw data. Sparse Embedded[2] was the first framework proposed to jointly realize dimensionality reduction and dictionary learning. Though the underlying sparse structure of data was respected, the discrimination of sparse coefficients was not considered. Another unsupervised sparse representation based method named Metaface[37] was developed to learn the dictionary by releas4
ing the orthogonality constraints but maintaining union constraints of each dictionary atoms. Meanwhile, Zhang et al.[34] proposed a method jointly learning projection matrix and dictionary with the dictionary partly fixed, which decreased the classification capability of dictionary learning. Supervised sparse representation method like Discriminative K-SVD(DKSVD)[25] was proposed to learn atoms of dictionary based on traditional K-SVD algorithm. It incorporated the classification error into the objective functions in order to enhance the classification ability of coefficients, but a linear classifier might not have good performance on classification. Label Consistent K-SVD(LCKSVD)[27], viewed as an extension of DKSVD, still suffered from the flaw of linear classifier. Using Fisher discrimination criterion and structured dictionary of which atoms were explicitly corresponding to the class labels, FDDL proposed by Yang et al.[30] could learn a better dictionary for classification. JDDRDL[26] based on collaborative representation classification(CRC)[38] tried to perform dimensionality reduction and supervised dictionary learning jointly, but the solution to projection matrix was not proper, of which convergence could not be guaranteed[26]. Liu et al.[28] presented another CRC based method which utilized a bilinear discriminative term instead of a linear classifier. The approach by Lu et al.[35] presented another way to learn dictionary and projection matrix jointly. It claimed that learning features and dictionaries separately might not be powerful enough, because some discriminative information for dictionary learning might be compromised in the feature learning stage if they were applied sequentially, and vice versa. However, such method was applied on image sets and the orthogonality of projection matrix was not explicitly preserved. Patch based methods[39, 40] can also reduce the computation and achieve good performance by partitioning the images into smaller blocks. Lai et al.[39] proposed a modular weighted global sparse representation(WGSR) method, which used selected patches for classification by weighting patches with a nonlinear function. But the optimization of the weighting function remains an open issue. To exploit the flexible representation of patches, Lai et al.[40] proposed a class-wise sparse representation(CSR) for face recognition, which alleviated the drawbacks of Group Lasso. In this paper, we focus on jointly learning projection matrix and dictionary through holistic features.
5
3. Sparse embedded dictionary learning In this section, a novel dictionary learning method: Sparse Embedded Dictionary Learning(SEDL) is proposed. Some dictionary learning methods mentioned above have urged to perform dimensionality reduction jointly, but the solution to the projection matrix is not proper. And orthogonality of projection matrix is not preserved. Here, the proposed method SEDL can jointly perform dimensionality reduction and dictionary learning. The convergence and orthogonality of training projection matrix is guaranteed by [41]. The classification ability of coefficients are enhanced by enlarging the margin of coefficients distance between intra-class and inter-class. Hence, classification is based on both reconstruction error and coding coefficients. Suppose training samples are composed of c classes and each sample is vectorized to be a vector y with m dimensions, then the training samples of the i-th class Yi = [y1 , ..., yni ] ∈ Rm×ni , where ni is the number of samples in i-th class and i = 1, ..., c. Let Y = [Y1 , ..., Yi , ..., Yc ] ∈ Rm×n be the training data matrix, where n = n1 + ... + nc . Dimensionality reduction aims to learn an orthogonal projection matrix P ∈ Rp×m , where p denotes the lower dimension of data(p < m). The dictionary to be learned is denoted as D = [D1 , ..., Di , ..., Dc ] ∈ Rp×k , where k = k1 + ... + kc is the number of atoms in dictionary D, ki is the number of atoms in ith dictionary Di and Di ∈ Rm×ki is the dictionary for i-th class. Let X = [X1 , ..., Xi , ..., Xc ] denotes the coefficients of Y coded over dictionary D. Let Xi = [Xi1 ; ...; Xij ; ...; Xic ]T ∈ Rk×ni denotes the coefficients of Yi coded over the dictionary D, where j = 1, ..., c, Xij ∈ Rkj ×ni is the j-th sub-coefficients of Xi , denoting the coding coefficients of Yi coded over subdictionary Dj . One of the relationship between the variables is visualized in Fig. 1.
Fig. 1: The relationship between variables in term P Y − DX2F
6
The framework of our SEDL model is expressed as follows: JP ∗ ,D∗ ,X ∗ = arg min {r(P, Y, D, X) + λ1 X1 P,D,X
+ λ2 f (X) + P T P Y − Y F }
(1)
s.t. P P T = I. The model is composed of four parts: the reconstruction error term r(P, Y, D, X), the discriminative term f (X), the 1 -norm regularization term X1 which ensures the coefficients being sparse and the Frobenious norm regularization term which ensures the reconstruction error of source domain caused by dimensionality reduction is minimized. The reconstruction error term with 1 -norm regularization term is based on the basic idea of SR. And the idea of structured dictionary[30] is adopted during the phase of reconstruction. The discriminative term focuses on gaining the discriminative information of coding coefficients following the idea proposed in [42]. Though the objective function looks similar to that of [35], the idea is different. The proposed method can obtain orthogonal projection matrix, which is vital for data reconstruction. Data reconstruction is considered to be important for SRC based methods[31]. Hence, we apply data reconstruction with dictionary learning. The term f (X) in [35] is formed according to the neighbour relationship between coefficients. During the training stage, coefficients change at each iteration. However, fixing the relationship can not adapt the change of coefficients. Here, the coefficients distance between the training samples are taken into consideration as [42] did. 3.1. The reconstruction error term Motivated by the idea that not only dictionary D but also sub-dictionary Di associated with i-th class should be able to well represent data[28, 30], we define the first reconstruction error term as r(P, Y, D, X) = P Y − +
c
DX2F
c
+
c i=1
P Yi − Di Xii 2F + (2)
Dj Xij 2F ,
i=1 j=1,j=i
where · F denotes the Frobenious norm of a matrix. The first term P Y − DX2F is designed so that P Y , the training samples after dimensionality 7
reduction, can be well represented by the dictionary D, the second term P Yi −Di Xii ensures that each sub-dictionary should be able to well present the data which can enhance the discriminative ability during the classification stage and the third term guarantees that the effect of other sub-dictionary is minimized. Additionally, the second reconstruction error term P T P Y − Y F is utilized to minimize the reconstruction error of source domain caused by dimensionality reduction. This reconstruction error term is also used in [12, 43] to make sure that projection is presentative to the source data. It is claimed that dimensionality reduction should preserve the most information or energy from the source data while transforming the original data domain to a lower-dimensional one. It can also prevent the pathological case of mapping into the origin, which obtains the sparsest solution but of no interest. The data structure of source domain is vital for dictionary learning. However, traditional methods[16, 28, 30] using pre-trained projection matrix make the source domain fixed and decrease the capability of dictionary learning. Because some discriminative information for dictionary learning may be eliminated in the embedding stage. Thus, embedding the data into a lower dimensionality while jointly learning a discriminative dictionary can exploit more discriminative information. 3.2. Discriminative coefficient term In order to enhance the classification capability, constraints on coefficients of X over dictionary D are applied. Based on such idea that the distance of coefficients with the same label should be small, while others should be large, we define f (X) as follows: f (X) =
n n
xu − xv 22 Suv + ηX2F ,
(3)
u=1 v=1
where n is the total number of all samples, u(v) denotes the u(v)-th training sample, xu(v) ∈ Rk is the related coefficient in u(v)-th column of X. X2F ensures that f (X) is convex and stable and S is a matrix to measure the similarity of the coefficients xu and xv according to their labels. The element of S at u-th row and v-th column is given below: if l(xu ) = l(xv ) 1/nl(xu ) , (4) Suv = otherwise − 1/(n − nl(xu ) ), 8
where l(xu(v) ) denote the class label of xu(v) , nl(xu ) is the number of samples in l(xu )-th class. The idea, enlarging the margin of distance between intra-class and interclass, has been adopted in metric learning, which is proved to be useful[42]. Here, coefficients are optimized instead of finding optimal projections. If setting entities of S being zero when l(xu ) = l(xv ), the formulation turns out to measure the distance of within-class which coherences with the method proposed in [28]. If the strategy of forming S in [35] is adopted, it will turn out to be a matrix like Laplacian matrix. Here, the number of samples is utilized to balance the measures of distance from same and different class. Finally, it can enhance the discriminative capability of coefficients coded over dictionary learned. 3.3. Optimization of SEDL The objective function in Eq.(1) is not convex for P , D and X simultaneously, but it is convex for one of them when others are fixed. Because alternating iterative methods are adopted widely in solving dictionary learning problems, it is utilized by our SEDL model so that P , D and X can be solved iteratively. Firstly, we fix D and P , and update X. Eq.(1) can be rewritten as: J
X∗
= arg min {P Y − X
+
c c i=1 j=1,j=i
DX2F
+
c
P Yi − Di Xii 2F + λ1 X1
i=1
Dj Xij 2F
+ λ2 {
n n
(5) xu −
xv 22 Suv
+
ηX2F }}
u=1 v=1
Following the work in [28, 30], we compute Xi class by class. All Xj , j = i, are fixed, when computing Xi . The objective function above is further
9
reduced to: JXi∗ = arg min {P Yi − DXi 2F + P Yi − Di Xii 2F +
c
Xi
Dj Xij 2F + λ2 [(tr((Xio + Xoi )L(Xio + tr(Xoi )T ))]
j=1,j=i
+ ηXi 2F + λ1 Xi 1 ]} = arg min {P Yi − DXi 2F + P Yi − Dio Xi 2F + Dio Xi 2F
(6)
Xi
T T + λ2 tr(Xi KLK T XiT + Xi KLXoi + Xoi LK T XiT + Xoi LXoi ) 2 + ηXi F + λ1 Xi 1 },
where L = C −S, C = diag{c1 , c2 , ..., cn } is a diagonal matrix whose diagonal elements are sums of rows(or columns) of S. Dio = [0, Di , 0] is a matrix with same size of D, of which only Di related to Xii is non-zero. Doi = [D1 , ...Di−1 , 0, Di+1 , ..., Dc ] is a matrix compensate with Dio , of which only Di related to Xii is zero. Xio = [0, Xi , 0] and Xoi = [X1 , ..., Xi−1 , 0, Xi+1 , ..., Xc ], where X = Xio + Xoi . It is obvious that Xio = Xi K, where K = [0, I, 0] and I is an identity matrix corresponding to the position of Xi in Xio . And Xoi can be viewed as a constant when solving for Xi . Because all the terms in Eq.(6) are differentiable on Xi except for Xi 1 , we employ FISTA[44] to solve it. Secondly, we fix P and X, and update D. Thus, the objective function is derived as: JD∗ = arg min{P Y − D
+
c
c
DX2F
+
c
P Yi − Di Xii 2F +
i=1
(7)
Dj Xij 2F }.
i=1 j=1,j=i
Following the work in [28, 30, 37], the problem can be solved by updating Di class by class. So we compute Di and fix Dj , j = i. i
JDi∗ = arg min{P Y − Di X − Di
+
c
c j=1,j=i
Dj Xij 2F }.
j=1,j=i
10
Dj X j 2F + P Yi − Di Xii 2F + (8)
Unlike the dictionary learned in K-SVD that atoms are orthogonal with each other, atoms here are only required to have unit norm. Therefore the method in [37] is adopted to resolve the problem. Thirdly, we fix D and X, and update P . Thus, the objective function is reduced to: J
P∗
= arg min P Y − D
DX2F
+
c
P Yi − Di Xii 2F
i=1
T
+ P P Y − Y F ,
(9)
s.t. P P T = I. Obviously, it is a optimization problem with orthogonality constraints. We consider that important information from the data can be represented more compact in space of orthogonal basis. Unlike the problem in [2, 26, 34], the objective function can not be transformed into a SVD-like form. Here we adopt method proposed in [41] which gives a general form for solving such problem in help of Cayley tranform. The gradients of function needed is shown below: T
T
∇J(W ) = 2Y (Y W − (DX) ) +
c
{2Yi (YiT W − (Di Xii )T )}
i=1
(10)
− 4Y Y T W + 2W W T Y Y T W + 2Y Y T W W T W, where W = P T , which is used to form a skew-sysmmetric matrix A = ∇JW T − W ∇J for performing constraint-preserving update operation. Most parts of the proposed algorithm such as updating X by FISTA and updating D by a closed-form are based on traditional methods used for dictionary learning, of which the computation complexity is similar with each other. Updating the projection P is based on the method in [41], where a Crank-Nicolson-like update scheme is used to preserve the constraints and curvilinear search algorithms is applied. Unlike the optimization methods[31] using second-order information of objective function to achieve convergence, the update scheme in [41] is an adaptation of the classical steepest descent step to the orthogonality constraints. Hence, the orthogonality is preserved at a reasonable computational cost. Among all the experiments, the computation time of training is below 40 minutes. The summarization of SEDL algorithm is described in Algorithm 1. 11
Input: Output: 1.
2.
3.
4.
5.
Algorithm 1: Sparse Embedded Dictionary Learning Training set Y = [Y1 , ..., Yc ] ∈ Rm×n , parameters λ1 , λ2 , η, and max iteration number T . Projection matrix P , dictionary D, and coefficient X. Initialization. Generate matrix L according to Eq.(4), initialize P as the PCA projection matrix of Y and initialize each atoms of D as a random vector with unit norm. Update the sparse coding coefficients X. Fix Di and P , compute Xi class by class by Eq.(6) with algorithm FISTA. Update the dictionary D. Fix X and P , compute D atoms by atoms with method presented in [37]. Update the projection P . Fix X and D, compute P as an optimization problem with orthogonal constraints presented in [41]. Go to step 2 until the convergence is met or maximum number of iteration is reached.
3.4. Classification of SEDL When dictionary D and projection matrix P is learned, a related coefficient can be obtained by embedding testing samples into low-dimensional features and then coding it over D. Both coefficients and reconstruction error can be used as an index for classification[28, 30]. SRC[16], forming the dictionary by the original training samples, used the reconstruction error associated with each class as an index for classification: x˜ = arg min y − Dx22 + λx1 , x (11) identity(y) = arg min y − Di x˜i 22 . i
where x˜i was composed of entities in x˜ related to Di . Meanwhile, FDDL[30] introduced the idea of sub-dictionary and fisher discriminative criteria during the training stage, which made the coefficients alone be discriminative for classification. Here, both of them are used as index for classification by balancing them with a weighting term β: x˜ = arg min P y − Dx22 + λx1 , x
identity(y) = arg min P y − Di x˜i 22 + βx˜i − αi 22 , i
12
(12)
where the first term is the reconstruction error of i-th class and the second term is the distance between coefficient vector x˜i . αi is a mean coefficient vector of i-th class. 4. Experiments Our method SEDL is verified on four publicly available face databases: AR[45], Extended Yale B[46], CMU PIE[47] and FERET[48]. Experiments are conducted on computer with Intel Core i5 CPU(3.2GHz). Our method is compared with 1-nearest-neighbor(1NN), the support vector machines(SVM) [49], sparse representation classification(SRC)[16] and other joint dimensionality reduction and dictionary learning methods. Recently proposed methods such as Discriminative KSVD(DKSVD) [25], Fisher Discriminant Dictionary Learning(FDDL)[30], Joint Discriminative Dimensionality Reduction and Dictionary Learning(JDDRDL)[26] and Bilinear Discriminant Dictionary Learning(BDDL)[28] are also compared. In order to illustrate the classification ability of dimensionality reduction jointly, a variant SEDL method with projection matrix fixed, namely SEDL∗ , is also compared. In all experiments, Principal Component Analysis(PCA)[3] is applied to reduce the dimensionality or to initialize the projection matrix for SEDL. All samples are preprocessed to have unit 2 -norm before training. There are three parameters, λ1 , λ2 and η in our SEDL model in Eq.(1) and one parameter β in Eq.(12), where λ1 is related with sparsity of the coefficients, λ2 controls the weight of discriminability of the coefficients, η is added to make sure that the function is convex as in [30] and β balances the effect of classification. In AR and CMU PIE experiments, the parameters for training are identical: λ1 = 0.005, λ2 = 0.2 and η = 1.1. Parameters are set to be λ1 = 0.005, λ2 = 0.1 and η = 1.1 for FERET, λ1 = 0.005, λ2 = 0.6 and η = 1.1 for Extended Yale B. In all experiments, β = 0.3. 4.1. The AR database The AR database[45] consists of over 4,000 images of 126 subjects, which varies in illumination, expression and accessory like scarfs and sunglasses blocking some part of the face. Sample images of the first person are illustrated in Fig. 2. As in [26, 28, 30], the subset containing 1,400 images of 100 subjects with 50 males and 50 females separately is utilized. For each subject, 7 images selected randomly are used for training, and the rest 7 images are used for testing, which means that there are 700 images for training and 13
700 images for testing with 100 individuals. The experiments are repeated 10 times to calculate the average recognition rate and the corresponding standard deviation. All images are resized into 60×43. Dimensionality of the features are reduced to 300 by PCA for all experiments on AR database according to [28].
Fig. 2: Sample images of the first subject from AR database
Recognition rates on AR with different dimensionality are demonstrated on Fig. 3. PCA+CRC, PCA+FDDL and JDDRDL methods are compared as baselines. By jointly performing dimensionality reduction, SEDL can improve the recognition rate much higher than other methods under different dimensionality of features. It suggests that performing dimensionality reduction jointly can help learning a more discriminant dictionary for classification. The recognition rate of SEDL increases slightly when the number of feature dimension is above 550 and decreases when the number of feature dimension is 650. The recognition rates of other methods reach maximum with dimensionality less than 550. It implies that increasing dimensionality may decrease the recognition rate and may not be effective for dictionary learning method. Table 1 shows the recognition rates of various methods mentioned above. The proposed method achieves best among all the methods. Among dictionary learning method such as MFL, DKSVD, FDDL and BDDL, BDDL achieves best performance: 93.6%, which is 0.6% less than that of the proposed method. The JDDRDL method based on jointly dictionary learning and dimensionality reduction gives the second highest recognition rate: 94.0%, which is still less than that of the proposed method. It was shown that the convergence of JDDRDL could not be guaranteed because of the improper solution to projection matrix[26]. However, the proposed method SEDL can avoid the such problem by using the method proposed in [41]. It also can be noted that SEDL∗ achieves the third highest recognition rate: 93.7%, which performs dictionary learning with fixed projection matrix. The 14
Fig. 3: Recognition rates under different dimensionality of features on AR database
performance is enhanced with 0.5% by the proposed method SEDL which performs dictionary learning and dimensionality reduction jointly. The experiments are also conducted with different number of training samples(4∼7) on AR database. The proposed method is compared with three classical dictionary learning methods. The result is illustrated in Table 2. It is noted that the proposed method achieves best result when sufficient training samples are given. The recognition rate will decrease largely with small training samples. The proposed method is a bit worse than JDDRDL with just 4 training samples per person. Fig. 4 illustrates the convergence of SEDL. Because each alternative optimization is convex, the objective function value will decrease at each iteration and converge.
Fig. 4: The convergence of SEDL on AR database
15
Table 1: The recognition rates on AR database
Method 1NN SVM[49] SRC[16] CRC[38] MFL[37] DKSVD[25] LC-DKSVD[27] FDDL[30] JDDRDL[26] BDDL[28] SEDL∗ SEDL
REC&STD(%) 71.6±1.75 74.0±1.41 90.1±0.81 92.4±0.88 92.3±0.70 85.5±0.85 89.0±0.75 92.2±0.56 94.0±0.53 93.6±0.58 93.7±0.52 94.2±0.50
Table 2: The recognition rates with different number of training samples on AR database
Num. JDDRDL[26] MFL[37] FDDL[30] SEDL
4 80.7±1.8 78.9±1.4 78.6±1.5 80.5±1.4
5 88.6±1.1 87.2±1.2 87.8±1.3 88.9±1.2
6 91.5±0.9 89.7±0.8 91.1±0.9 92.3±0.8
7 94.0±0.5 92.3±0.7 92.2±0.5 94.2±0.5
4.2. The Extended Yale B database The Extended Yale B database[46] consists of 2,414 images of 38 individuals captured under various lighting conditions controlled in laboratory. Fig. 5 shows sample images of the first person under various lighting conditions. For each subjects, 20 images are randomly selected for training, while the rest for testing. Such experiments are repeated 10 times. All images are normalized to 54 × 48. In the comparison experiments on Extended Yale B database, dimensionality of the features are reduced to 500 by PCA. Recognition rates on Extended Yale B with different dimensionality are demonstrated on Fig. 6. PCA+CRC, PCA+FDDL and JDDRDL methods are compared as baselines. The proposed method outperforms others under 16
Fig. 5: Sample images of the first subject from Extended Yale B database
different dimensionality. The recognition rate(95%) with dimensionality reduced to 300 is still comparable to those method, illustrated in Table 3. It is noted that performing dimensionality reduction can achieve higher recognition rate with less dimensionality.
Fig. 6: Recognition rates under different dimensionality of features on Extended Yale B database
Table 3 shows that the recognition rate of SEDL∗ (95.5%) is at least 1% higher than other dictionary learning methods. We consider that such result is caused by the enhancement of classification ability of coefficients. The performance of proposed method SEDL is 1.1% higher than that of SEDL∗ , which is caused by dimensionality reduction jointly. When comparing with JDDRDL which claims to achieve high recognition rate with small size of training samples from each subject, different numbers of training samples are used for comparison. As illustrated in Table 4, experiments are conducted on 3 different numbers of training samples. It can be found that the proposed method achieves higher result than others with sufficient training samples. When the number of training samples for each subject is reduced to 5, the performance of proposed method is 0.1% worse than that of JDDRDL. 17
Table 3: The recognition rates on Extended Yale B database
Method 1NN SVM[49] SRC[16] CRC[38] MFL[37] FDDL[30] SEDL∗ SEDL
REC&STD(%) 62.5±1.31 88.3±0.74 90.2±0.68 94.5±0.76 91.3±0.80 94.4±0.64 95.5±0.61 96.6±0.62
Table 4: The recognition rates with different number of training samples on Extended Yale B database
Number of training samples JDDRDL[26] MFL[37] FDDL[30] SEDL
5 70.4±2.1 67.7±2.2 68.6±2.0 70.3±2.2
10 85.1±1.5 81.9±1.7 83.5±1.8 86.2±1.5
20 95.7±0.6 91.3±0.8 94.4±0.6 96.6±0.6
4.3. The CMU PIE database The CMU PIE database[47] consists of 41,368 images of 68 individuals with different poses, illuminations and expressions. As in [28], subset containing five near frontal poses(C05, C07, C09, C27, C29) are used, of which there are 170 images for each subjects. For each subjects, 20 images are randomly selected for training, while the rest for testing. Sample images of first person with different poses, illuminations and expressions are demonstrated in Fig. 7 All images are resized to 60 × 45 and experiments are repeated 10 times. Dimensionality of the features are reduced to 500 by PCA for all experiments on CMU PIE database. Recognition rates on CMU PIE with different dimensionality are demonstrated on Fig. 8. PCA+CRC, JDDRDL, PCA+FDDL methods are compared as baselines. It is shown that the proposed method SEDL achieves better, when the number of feature dimensional increases. It is because that 18
Fig. 7: Sample images of the first subject from CMU PIE database
dimensionality reduction is conducted simultaneously. Therefore, discriminant information will be maintained during training.
Fig. 8: Recognition rates under different dimensionality of features on CMU PIE database
Table 5 shows that the best performance is achieved by the proposed method with recognition rate 93.4%. The performance of SEDL∗ is almost equal to that of FDDL which uses Fisher discriminant criteria as constraints to enhance the classification ability. Though the performance of SEDL∗ with dictionary learning alone is 1% less than that of BDDL, applying sparse embedding will improve the result by 1.1%. It shows that the idea of jointly performing dictionary learning and dimensionality reduction can improve the classification ability. 4.4. The FERET database The FERET database[48] consists of 14,051 images with different poses, illuminations and expressions. As in [26], subsets containing frontal images marked with ”ba”, ”bj”, and ”bk” are used, of which there 600 images of 19
Table 5: The recognition rates on CMU PIE database
Method 1NN SVM[49] SRC[16] CRC[38] DKSVD[25] LC-DKSVD[27] FDDL[30] BDDL[28] SEDL∗ SEDL
REC&STD(%) 61.1±0.6 89.1±0.7 93.0±0.4 92.5±0.4 90.8±0.7 70.5±1.5 92.3±0.5 93.3±0.4 92.3±0.5 93.4±0.4
200 individuals. Such images from the subsets are given in Fig. 9. For each subject, 2 images from ”ba” and ”bj” are used for training and the rest 1 image from ”bk” are used for testing. All images are resized to 70 × 60. Dimensionality of the features are reduced to 400 by PCA for all experiments on FERET database.
Fig. 9: Sample images from FERET database
Recognition rates on FERET with different dimensionality are demonstrated on Fig. 10. PCA+CRC, PCA+FDDL and JDDRDL methods are compared as baselines. It is noted that the proposed method outperforms the others under different dimensionality. It also shows that performing dimensionality reduction can achieve higher recognition rate with less dimensionality. Table 6 shows that the recognition rate of proposed method SEDL is at 20
Fig. 10: Recognition rates under different dimensionality of features on FERET database
least 1.5% higher than other methods. The performance of SEDL∗ achieves the second highest score, which illustrates the classification ability of coefficients improved by constraints on coefficients. It is shown that jointly dimensionality reduction method SEDL can improve the result by 1.3%. Those methods using dictionary learning alone are based on the concept that the data points lie in multiple low-dimensional subspaces of a high-dimensional ambient space, and the same class data points cluster together as a lowdimensional subspaces[13, 32]. Hence, we consider that the data structure is vital for the classification ability of dictionary learning method. Fixing the raw data structure will limit the learning ability of dictionary learning. Therefore, the proposed method SEDL can improve the result by jointly performing dictionary learning and dimensionality reduction. In this experiment, there are only 2 images for each subject which may decrease the whole performance of dictionary learning methods in general. 4.5. Gender classification on the AR database The AR database are used to investigate the performance of different methods on gender classification. As in [28], 1,400 images of 100 individuals with 50 men and 50 women are used for experiments. 25 men and 25 women are selected randomly for training and the rest for testing. Such experiments are repeated 10 times. All images are resized into 60 × 43. Dimensionality of the features are reduced to 300 by PCA for all experiments on AR database. Table 7 shows that our method has the highest classification rate: 93.9%, which is at least 0.7% higher than others. The performance of method SEDL 21
Table 6: The recognition rates on FERET database
Method CRC[38] JDDRDL[26] SRC[16] MFL[37] FDDL[30] SEDL∗ SEDL
REC(%) 79.0 81.0 79.0 79.5 72.0 81.2 82.5
is 1.3% higher than that of SEDL∗ , which is caused by jointly dimensionality reduction. Table 7: The gender classification rates on AR database
Method 1NN SVM[26] CRC[38] DKSVD[25] LC-DKSVD[27] FDDL[30] BDDL[28] SEDL∗ SEDL
REC&STD(%) 89.1±2.8 92.1±2.3 92.2±2.2 91.9±1.7 91.2±1.9 92.9±1.8 93.2±2.2 92.6±1.9 93.9±2.1
4.6. Statistical significance test In order to show the effectiveness of the proposed method, a statistical significance test is performed to verify the result. Here, the t-test on the null hypothesis that the improvement of SEDL over other competing methods is insignificant is used. The improvement is considered to be insignificant when the difference of the recognition rates between SEDL and other methods come from distributions with mean less than zero. Two indexes of statistical 22
significant test: H and P is taken into consideration. H = 1 indicates that the null hypothesis can be rejected at some significance level and P indicates the probability of observing the result of H = 0. If the improvement is significant, P should be small. In all cases, H = 1 is obtained with the significance level at 5%. In most cases, H = 1 can be obtained with the significance level at 1% except for JDDRDL(P = 2.2%) on AR database, BDDL(P = 3.6%) on CMU PIE database and BDDL(P = 1.9%) on gender classification. It is noted that the dimensionality of proposed method is not large enough when comparing with other methods. It is illustrated in Fig. 3 and Fig. 8 that the improvement of recognition rate achieved by the proposed method can be enhanced with sufficient dimensionality. 5. Conclusion In this paper, a sparse embedded dictionary learning method is proposed of which orthogonality of the projection matrix is preserved. It is a sparse representation based method which jointly achieves dictionary learning and dimensionality reduction. Experiment results show that proposed method SEDL can enhance the classification ability and gain more discriminant information, because the data structure of source domain and dictionary are optimized simultaneously. Moreover, margin of coefficients distance between intra-class and inter-class is enlarged, which will exploit the discriminability of the coefficients. Experiments on the AR, Extended Yale B, CMU PIE and FERET database demonstrate the classification ability of our method. Acknowledgements This work is partially financially supported by the National Science Foundation of China(NSFC) under grants 61533012 and 61521063. References [1] R. S. Ghiass, O. Arandjelovic, H. Bendada, and X. Maldague, “Infrared face recognition: A literature review,” in Neural Networks (IJCNN), The 2013 International Joint Conference on, pp. 1–10, Aug 2013. [2] H. V. Nguyen, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Sparse embedding: A framework for sparsity promoting dimensionality reduction,” in Computer Vision–ECCV 2012, pp. 414–427, Springer, 2012. 23
[3] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of cognitive neuroscience, vol. 3, no. 1, pp. 71–86, 1991. [4] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. Fisherfaces: recognition using class specific linear projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, pp. 711–720, Jul 1997. [5] X. Niyogi, “Locality preserving projections,” in Neural information processing systems, vol. 16, p. 153, MIT, 2004. [6] X. Jiang, “Linear subspace learning-based dimensionality reduction,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 16–26, 2011. [7] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan, “Sparse representation for computer vision and pattern recognition,” Proceedings of the IEEE, vol. 98, pp. 1031–1044, June 2010. [8] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Transactions on Image Processing, vol. 15, pp. 3736–3745, Dec 2006. [9] J. Mairal, M. Elad, and G. Sapiro, “Sparse representation for color image restoration,” IEEE Transactions on Image Processing, vol. 17, pp. 53– 69, Jan 2008. [10] X. Jiang and J. Lai, “Sparse and dense hybrid representation via dictionary decomposition for face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, pp. 1067–1079, May 2015. [11] C. P. Wei and Y. C. F. Wang, “Undersampled face recognition via robust auxiliary dictionary learning,” IEEE Transactions on Image Processing, vol. 24, pp. 1722–1734, June 2015. [12] J. Lu, V. E. Liong, G. Wang, and P. Moulin, “Joint feature learning for face recognition,” IEEE Transactions on Information Forensics and Security, vol. 10, pp. 1371–1383, July 2015. [13] Y. Sun, Q. Liu, J. Tang, and D. Tao, “Learning discriminative dictionary for group sparse representation,” IEEE Transactions on Image Processing, vol. 23, pp. 3816–3828, Sept 2014. 24
[14] Y. Xu, Z. Zhang, G. Lu, and J. Yang, “Approximately symmetrical face images for image preprocessing in face recognition and sparse representation based classification,” Pattern Recognition, vol. 54, pp. 68 – 82, 2016. [15] Z. Zhao, Y. Cheung, H. Hu, and X. Wu, “Corrupted and occluded face recognition via cooperative sparse representation,” Pattern Recognition, pp. –, 2016. [16] J. Wright, A. Ganesh, S. Rao, and Y. Ma, “Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. submitted to the,” Journal of the ACM, 2009. [17] R. Tibshirani, “Regression shrinkage and selection via the lasso: a retrospective,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 73, no. 3, pp. 273–282, 2011. [18] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition,” in Signals, Systems and Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on, pp. 40–44 vol.1, Nov 1993. [19] J. Mairal, J. Ponce, G. Sapiro, A. Zisserman, and F. R. Bach, “Supervised dictionary learning,” in Advances in Neural Information Processing Systems (D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, eds.), pp. 1033–1040, Curran Associates, Inc., 2009. [20] Z. Lai, W. K. Wong, Y. Xu, J. Yang, and D. Zhang, “Approximate orthogonal sparse embedding for dimensionality reduction,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–1, 2015. [21] J. Lai and X. Jiang, “Discriminative sparsity preserving embedding for face recognition,” in Image Processing (ICIP), 2013 20th IEEE International Conference on, pp. 3695–3699, Sept 2013. [22] K. Slavakis, G. B. Giannakis, and G. Leus, “Robust sparse embedding and reconstruction via dictionary learning,” in Information Sciences and Systems (CISS), 2013 47th Annual Conference on, pp. 1–6, March 2013. 25
[23] K. Engan, S. O. Aase, and J. H. Husoy, “Method of optimal directions for frame design,” in Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on, vol. 5, pp. 2443– 2446 vol.5, 1999. [24] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54, pp. 4311–4322, Nov 2006. [25] Q. Zhang and B. Li, “Discriminative K-SVD for dictionary learning in face recognition,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 2691–2698, June 2010. [26] Z. Feng, M. Yang, L. Zhang, Y. Liu, and D. Zhang, “Joint discriminative dimensionality reduction and dictionary learning for face recognition,” Pattern Recognition, vol. 46, no. 8, pp. 2134–2143, 2013. [27] Z. Jiang, Z. Lin, and L. S. Davis, “Label consistent K-SVD: Learning a discriminative dictionary for recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 2651–2664, Nov 2013. [28] H. Liu, M. Yang, Y. Gao, Y. Yin, and L. Chen, “Bilinear discriminative dictionary learning for face recognition,” Pattern Recognition, vol. 47, no. 5, pp. 1835–1845, 2014. [29] Z. Jiang, Z. Lin, and L. S. Davis, “Learning a discriminative dictionary for sparse coding via label consistent K-SVD,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1697–1704, June 2011. [30] M. Yang, L. Zhang, X. Feng, and D. Zhang, “Fisher discrimination dictionary learning for sparse representation,” in Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 543–550, Nov 2011. [31] H. Zhang, Y. Zhang, and T. S. Huang, “Simultaneous discriminative projection and dictionary learning for sparse representation based classification,” Pattern Recognition, vol. 46, no. 1, pp. 346–354, 2013.
26
[32] H. Wang and W. Zheng, “Robust sparsity-preserved learning with application to image visualization,” Knowledge and information systems, vol. 39, no. 2, pp. 287–304, 2014. [33] X. Jiang, “Asymmetric principal component and discriminant analyses for pattern classification,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 31, no. 5, pp. 931–7, 2008. [34] L. Zhang, M. Yang, Z. Feng, and D. Zhang, “On the dimensionality reduction for sparse representation based face recognition,” in Pattern Recognition (ICPR), 2010 20th International Conference on, pp. 1237– 1240, Aug 2010. [35] J. Lu, G. Wang, W. Deng, and P. Moulin, “Simultaneous feature and dictionary learning for image set based face recognition,” in Computer Vision–ECCV 2014, pp. 265–280, Springer, 2014. [36] D. Cai, X. He, J. Han, and H. J. Zhang, “Orthogonal laplacianfaces for face recognition,” IEEE Transactions on Image Processing, vol. 15, pp. 3608–3614, Nov 2006. [37] M. Yang, L. Zhang, J. Yang, and D. Zhang, “Metaface learning for sparse representation based face recognition,” in Image Processing (ICIP), 2010 17th IEEE International Conference on, pp. 1601–1604, Sept 2010. [38] L. Zhang, M. Yang, X. Feng, Y. Ma, and D. Zhang, “Collaborative representation based classification for face recognition,” arXiv preprint arXiv:1204.2358, 2012. [39] J. Lai and X. Jiang, “Modular weighted global sparse representation for robust face recognition,” IEEE Signal Processing Letters, vol. 19, no. 9, pp. 571–574, 2012. [40] J. Lai and X. Jiang, “Class-wise sparse and collaborative patch representation for face recognition.,” IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3261–3272, 2016. [41] Z. Wen and W. Yin, “A feasible method for optimization with orthogonality constraints,” Mathematical Programming, vol. 142, no. 1-2, pp. 397–434, 2013. 27
[42] H. V. Nguyen and L. Bai, “Cosine similarity metric learning for face verification,” in Proceedings of the 10th Asian conference on Computer vision-Volume Part II, pp. 709–720, Springer-Verlag, 2010. [43] Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng, “Ica with reconstruction cost for efficient overcomplete feature learning,” in Advances in Neural Information Processing Systems, pp. 1017–1025, 2011. [44] J. M. Bioucas-Dias and M. A. Figueiredo, “A new TwIST: Two-step iterative shrinkage/thresholding algorithms for image restoration,” IEEE Transactions on Image Processing, vol. 16, pp. 2992–3004, Dec 2007. [45] A. M. Martinez, “The AR face database,” CVC Technical Report, vol. 24, 1998. [46] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: illumination cone models for face recognition under variable lighting and pose,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, pp. 643–660, Jun 2001. [47] T. Sim, S. Baker, and M. Bsat, “The CMU Pose, Illumination, and Expression (PIE) database,” in Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference on, pp. 46–51, May 2002. [48] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, “The FERET evaluation methodology for face-recognition algorithms,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 1090–1104, Oct 2000. [49] V. Vapnik, The nature of statistical learning theory. Springer Science & Business Media, 2013.
28