Neurocomputing 143 (2014) 134–143
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Multi-feature multi-manifold learning for single-sample face recognition Haibin Yan a, Jiwen Lu b, Xiuzhuang Zhou c,n, Yuanyuan Shang c a
Department of Mechanical Engineering, National University of Singapore, Singapore 117576, Singapore Advanced Digital Sciences Center, Singapore 138632, Singapore c School of Information Engineering, Capital Normal University, Beijing 100048, China b
art ic l e i nf o
a b s t r a c t
Article history: Received 6 November 2013 Received in revised form 15 April 2014 Accepted 2 June 2014 Communicated by Xu Zhao Available online 17 June 2014
This paper presents a Multi-feature Multi-Manifold Learning (M3L) method for single-sample face recognition (SSFR). While numerous face recognition methods have been proposed over the past two decades, most of them suffer a heavy performance drop or even fail to work for the SSFR problem because there are not enough training samples for discriminative feature extraction. In this paper, we propose a M3L method to extract multiple discriminative features from face image patches. First, each registered face image is partitioned into several non-overlapping patches and multiple local features are extracted within each patch. Then, we formulate SSFR as a multi-feature multi-manifold matching problem and multiple discriminative feature subspaces are jointly learned to maximize the manifold margins of different persons, so that person-specific discriminative information is exploited for recognition. Lastly, we present a multi-feature manifold–manifold distance measure to recognize the probe subjects. Experimental results on the widely used AR, FERET and LFW datasets demonstrate the efficacy of our proposed approach. & 2014 Elsevier B.V. All rights reserved.
Keywords: Single-sample face recognition Multi-feature learning Multi-manifold learning
1. Introduction Face recognition has received increasing attention in both academic and industrial communities in recent years, and a large number of face recognition approaches have been proposed in the literature [1–15]. Generally, these approaches can be mainly classified into two categories: geometry-based and appearance-based. Since it is challenging to precisely localize and extract geometrical features from many real facial images especially when they were captured in unconstrained environments, appearance-based methods are more popular in face recognition and many such algorithms have been proposed over the past two decades [1,2,5,8]. The performance of appearance-based face recognition methods is heavily affected by the number of training samples per person [16]. Specifically, if the number of training samples per person is much smaller than facial feature dimension, it is usually inaccurate to estimate the intra-class and inter-class variances for existing appearance-based methods [2,6,8]. In many practical face recognition applications, such as law enhancement, e-passport and ID card identification, there is only a single sample per person registered in these systems because it is generally difficult to
n
Corresponding author. Tel.: +86 10 68902367; fax: +86 10 68416830. E-mail addresses:
[email protected] (H. Yan),
[email protected] (J. Lu),
[email protected] (X. Zhou),
[email protected] (Y. Shang). http://dx.doi.org/10.1016/j.neucom.2014.06.012 0925-2312/& 2014 Elsevier B.V. All rights reserved.
capture more additional samples. Therefore, many existing supervised appearance-based methods cannot be directly applied to solve the single-sample face recognition (SSFR) problem due to the non-existence of intra-class samples to estimate the withinclass variance. While there have been some attempts to address this problem in recent years [17–27], there is still some room to obtain a further improvement on the recognition performance. Existing SSFR methods can be classified into four categories: virtual sample generation [18,19], local feature representation [28,20,29–31], generic learning [21,22,32,33], and image partitioning [23,25–27]. For the first category, virtual samples of each training image are generated so that multiple samples per person are obtained for discriminative feature extraction. However, there is high correlation among the virtually generated samples, which make the extracted features are highly redundant. For the second category, each face image is represented by a discriminative feature descriptor so that different persons are expected to be separated as much as possible. These methods require high computational complexity, which may limit their effectiveness in practical applications. For the third category, an additional generic training set with multiple samples per person is employed to extract discriminative features. Even if these methods work around the SSFR problem, the performance of this type of methods is heavily affected by the selected generic training set, which is difficult to construct in practical applications. For the last category, each face image is first partitioned into several local patches and
H. Yan et al. / Neurocomputing 143 (2014) 134–143
135
Fig. 1. Scheme illustration of the basic idea of our proposed M3L-based SSFR approach. Given a testing face image, we first partition it into several non-overlapping patches and extract K different local features within each patch. Then, SSFR is converted into a multi-feature manifold–manifold matching problem. Lastly, the minimal multi-feature manifold–manifold distance between the gallery sample and probe sample is used to recognize the testing subject.
then discriminative learning techniques are used for feature extraction [24]. However, methods in this category ignore the geometrical information of local patches because when a face image is partitioned into several local patches, different parts of the original face image cannot be modeled accurately by a simple distribution. It is more likely that these patches reside in a manifold and each patch corresponds to a point in the manifold. Motivated by this intuition, we proposed a discriminative multimanifold analysis (DMMA) [27] method recently by modeling local patches of each face image as a manifold and performing discriminative manifold matching for SSFR. However, only the raw intensity feature was utilized for facial patch representation, which is not discriminative enough for robust face matching. Since local features are more robust to raw pixels in face recognition, it is desirable to exploit local features for SSFR under the multi-manifold matching framework. In this paper, we propose a Multi-feature Multi-Manifold Learning (M3L) method for SSFR. Fig. 1 illustrates the basic idea of our proposed approach. Given each face image, we first extract multiple features for each patch so that they are robust to variations and more complementary information can be exploited. Since different features correspond to different manifolds with different intrinsic dimensions, the SSFR problem is formulated as a multi-feature multi-manifold matching problem. To exploit discriminative information among these manifolds, we maximize the manifold margins by learning multiple discriminative subspaces. Experimental results on three datasets show the efficacy of the proposed approach. The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 details our proposed approach. Section 4 provides the experimental results and Section 5 concludes the paper.
matching problem. Existing multi-manifold learning methods assume that data are drawn from a vector space and thus cannot handle multi-feature representations directly. To address this problem, we propose a new multi-feature multi-manifold learning method to learn multiple discriminative feature spaces to maximize the multi-feature manifold margins of different persons. To our best knowledge, this is the first attempt on multi-feature learning in the context of multi-manifold learning. Specifically, our recently proposed DMMA method [27] is the special case of the proposed M3L when there is only one single feature used in our approach.
3. Proposed approach In this section, we first formulate the SSFR problem with multifeature multi-manifold matching, and then present the proposed M3L method. 3.1. Problem formulation Let X ¼ ½x1 ; x2 ; …; xN be the training set, xi is the training example of the ith person with a size of p q, 1 r i r N, N is the number of subjects in the training set. Assume each face image xi is divided into r non-overlapping local patches with an equal size of s t, where r ¼ p q=s t. Let Mki ¼ ½xki1 ; xki2 ; …; xkir be the image patch set of the ith person extracted by the kth feature representation method, which consists of a manifold Mki , and Mi ¼ ½M1i ; M2i ; …; MKi contains K manifolds of the ith person extracted by K different feature representation methods. Let M ¼ ½M1 ; M2 ; …; MN be the training set of N persons and Mki ¼ ½xki1 ; xki2 ; …; xkir be
2. Related work A number of manifold learning algorithms have been proposed to discover the intrinsic low-dimensional embedding of the original data in recent years [34–36], and most of them have been successfully applied to face recognition. The basic assumption of these methods is that high-dimensional data can be considered as a set of geometrically related points lying on or nearby a smooth low-dimensional manifold. While encouraging performance can be obtained, these methods simply assume that samples from different classes define a single manifold in the feature space, which may not usually hold in many practical applications because samples from different classes could lie on different sub-manifolds [37]. Inspired by this observation, several multi-manifold learning algorithms have been proposed in recent years [38,39,27], which model samples from the same class as a manifold and the recognition task can be converted as a manifold–manifold
the kth manifold of the ith person, where xkij A Rdk , 1 r i r N, 1 r j rr, and 1 r k r K. Similarly, each test image T is also partitioned into r non-overlapping local patches and the multi-feature manifold MT ¼ ½M1T ; M2T ; …; MKT containing K manifolds can be also constructed. Now, SSFR can be formulated as the following multi-feature manifold–manifold matching problem: c ¼ arg min di ðMT ; Mi Þ i
ð1Þ
where K
di ðMT ; Mi Þ ¼ ∑ αk di ðMkT ; Mki Þ;
ð2Þ
k¼1
di ðMkT ; Mki Þ is the manifold–manifold distance between MkT and Mki , which are the kth manifold of the testing sample and the ith training sample, respectively, αk 4 0 is the weighting coefficient, and ∑k αk ¼ 1. We see that the multi-feature manifold–manifold
136
H. Yan et al. / Neurocomputing 143 (2014) 134–143
Fig. 2. Illustration of the basic idea of manifold–manifold matching for SSFR, where di denotes the manifold distance between the test sample and the ith training sample.
distance is a weighted combination of the single-feature manifold–manifold distance. Fig. 2 illustrates the basic idea of manifold–manifold matching for SSFR, where the detailed computation of dðMkT ; Mki Þ will be defined in Section 3.3.
following optimization problem: max
W ik ;αk ;1 r i r N;1 r k r K N
K
¼ ∑ ∑ αk i¼1k¼1 N
3
3.2. M L While there have been several methods proposed to calculate the manifold–manifold distance with a single feature descriptor [40,41,27], a formal definition of multi-feature manifold margin has not been defined. A possible solution is to concatenate data samples from different feature spaces together as a new augmented manifold and then apply any existing single-feature manifold– manifold distance method to compute the distance of two manifolds. However, this concatenation is not physically meaningful because each feature has a specific statistical property. This concatenation ignores the diversity of multi-feature data and thus cannot efficiently explore the complementary nature of different features. To address this, we propose to learn some mappings to project the patches in the same manifold to be close and those in different manifolds to be far apart, respectively. Specifically, the aim of M3L is to seek N nK feature matrices W11, W12,…, W 1 K ,…, Wik,…, WNK, i ¼ 1; 2; …; N, k ¼ 1; 2; …; K, to project the training set into N nK low-dimensional feature spaces, such that the multifeature manifold margins are maximized, where Wik is the learned feature space of the kth view of the ith person. Given a patch sample xkij, which is the jth patch of the ith manifold from the kth feature, there are usually two types of locality neighbors in these manifolds: intra-manifold neighbors and inter-manifold neighbors. From the classification viewpoint, we aim to minimize the intra-manifold variance and maximize the inter-manifold separability in these multiple low-dimensional feature spaces, simultaneously, so that the multi-feature manifold margin can be maximized for discriminative feature extraction. To achieve this goal, we formulate the proposed M3L method as the
K
1 r t1 ∑ ∑ ‖W Tik xkij W Tik ykijc ‖2 F kijc t1 j ¼ 1 c ¼ 1
∑ ∑ αk i¼1k¼1
A ¼ A1 A2 !
1 r t2 ∑ ∑ ‖W Tik xkij W Tik zkijd ‖2 H kijd t2 j ¼ 1 d ¼ 1
! ð3Þ
The objective function of A1 is to ensure that the low-dimensional representations of xkij and ykijc are separated as far as possible if they are close and from different classes and that of A2 is to ensure that the low-dimensional representations of xkij and zkijd are pushed as close as possible if they are close and from the same class. In this equation, ykijc represents the cth t1-nearest inter-manifold neighbors and zkijd denotes the dth t2-nearest intra-manifold neighbors of xkij, Fkijc and Hkijd are two affinity matrices to characterize the similarity between xkij and ykijc as well as between xkij and zkijd from the kth feature, respectively, defined as follows: ( expð ‖xkij ykijc ‖2 =σ 2 ; if ykijc A Pðxkij Þ ð4Þ F ijp ¼ 0; otherwise ( H ijq ¼
expð ‖xkij zkijd ‖2 =σ 2 Þ;
if zkirjd A Q ðxkij Þ
0;
otherwise
ð5Þ
where Pðxkij Þ and Q ðxkij Þ denote the t1-inter-manifold neighbors and t2-intra-manifold neighbors of xkij in the learned feature subspace, respectively, t1, t2 and σ are three empirical parameters. F and H denote two graphs to measure the intra-class and inter-class similarity of local patches in different manifolds. Fig. 3 presents an example to show how these intra-manifold and inter-manifold neighbors re identified and used to maximize the multi-feature manifold margins, where t1 and t2 were set to 3 and 2, respectively. To our best knowledge, there is no closed-form solution for the problem in Eq. (3) because there are N nK projection matrices and one weighting vector α to be learned simultaneously and the
H. Yan et al. / Neurocomputing 143 (2014) 134–143
137
Fig. 3. We take two features as an example to illustrate the basic idea of our multi-feature multi-manifold learning method. The points with the same color denote local patches from the same feature, and those with the same shape denote local patches are from the same subject. (a) There are 3 intra-manifold neighbors and 2 inter-manifold neighbors for a local patch, and the points with the same color denote local patches from the same class, and those with different shapes denote local patches from different subjects, respectively. (b) The intra-manifold graphs connect intra-manifold neighbors. (c) The inter-manifold graphs connect inter-manifold neighbors. (d) After M3L, local patches from the same class in different feature spaces are mapped as close as possible, and those from different persons are mapped as far as possible, so that the multifeature manifold margin is maximized and discriminative information can be exploited for recognition. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)
projection matrix Wik needs to be known for computing the nearest neighbors. In this paper, we apply an alternating optimization approach. Specifically, we first initialize W11, W12, …, W 1 K , …, W ik 1 , W ik þ 1 , …, WNK and α with valid solutions to solve Wik sequentially, and then update α iteratively. Given W11, W12, …, W 1K , …, W ik 1 , W ik þ 1 , …, WNK and α, Eq. (3) can be rewritten as
αk
¼
t1
j¼1c¼1
t2
!
t2
r
∑ ∑ ‖W Tik xkij W Tik zkijd ‖2 H kijd þG2
ð6Þ
j¼1d¼1
where G1 ¼
N
K
∑ αq
∑
∑ ∑ ‖W Tpj xqpj W Tpj yqpjc ‖2 F qpjc
p ¼ 1;p a i q ¼ 1 K
þ
∑
q ¼ 1;q a k
j¼1c¼1
!
t1
r
αq
∑ ∑ ‖W Tij xqij W Tij yqijc ‖2 F qijc
ð7Þ
j¼1c¼1
and G2 ¼
N
K
∑ αq
∑
∑ ∑
p ¼ 1;p a i q ¼ 1 K
þ
∑
q ¼ 1;q a k
j¼1d¼1 t2
r
αq
t2
r
! ‖W Tpj xqpj W Tpj yqpjd ‖2 F qpjd
αk t1
r
! ð8Þ
j¼1d¼1
t2
αk t1
r
t1
ð11Þ
r
t2
∑ ∑ ðxkij zkijd Þðxkij zkijd ÞT H kijd
#
ð12Þ
j¼1d¼1
ð13Þ
Let fw1 ; w2 ; …; wdik g be the eigenvectors corresponding to the ik dik largest eigenvalues {λj jj ¼ 1; 2; …; d } ordered in such a way that λ1 Z λ2 Z⋯ Z λdik . Then W ¼ ½w1 ; w2 ; …; wdik is the projection matrix of Wik. We can iteratively and sequentially solve Eq. (13) to determine the ðNK 1Þ projection matrices W11, W12, …, W 1 K , …, W ik 1 , W ik þ 1 , …, WNK. Now, we update α via the learned projection matrices. The objective function in Eq. (3) can be rewritten as A ¼ ∑ αk ½trðW Tik ðT 1k T 2k ÞW ik k¼1
K
subject to
∑ αk ¼ 1;
k¼1
!
∑ ∑ ðxkij ykijc Þðxkij ykijc ÞT F kijc W ik
j¼1c¼1
¼ trðW Tik B1 W ik Þ
ð10Þ
B1 and B2 compute the inter-class and intra-class variations of local patches in the original kth feature space, respectively. Having obtained B1 and B2, we can obtain the bases of Wik by solving the following eigenvalue equation:
ð9Þ
αk Z0:
ð14Þ
where T 1k ¼
1 N r t1 k ∑ ∑ ∑ ðx ykijc Þðxkij ykijc ÞT F kijc t 1 i ¼ 1 j ¼ 1 c ¼ 1 ij
ð15Þ
T 2k ¼
1 N r t2 ∑ ∑ ∑ ðxk zkijd Þðxkij zkijd ÞT H kijd t 2 i ¼ 1 j ¼ 1 d ¼ 1 ij
ð16Þ
t1
"
t1
j¼1c¼1
K
∑ ∑ trðW Tik ½ðxkij ykijc Þðxkij ykijc ÞT F kijc W ik Þ
¼ tr W Tik
αk
α
∑ ∑ ‖W Tij xqij W Tij yqijd ‖2 F qijd
j¼1c¼1
r
∑ ∑ ðxkij ykijc Þðxkij ykijc ÞT F kijc
A2 ðW ik Þ ¼ trðW Tik B2 W ik Þ
max
are two matrices which are not affected by Wik, A1 ðW ik Þ and A2 ðW ik Þ measure the inter-class and intra-class variations of local patches in the learned kth feature space, respectively. Then, A1 ðW ik Þ can be simplified as follows: A1 ðW ik Þ ¼
t1
ðB1 B2 Þw ¼ λw
!
t1
r
αk
Similarly, we simplify A2 ðW ik Þ as follows:
B2 9
∑ ∑ ‖W Tik xkij W Tik ykijc ‖2 F kijc þ G1
αk
!
t1
r
B1 9
where
max A ¼ ðA1 ðW ik Þ þ G1 Þ ðA2 ðW ik Þ þ G2 Þ W ik
where
where T1k and T2k are the inter-class and intra-class variations of local patches among all the learned K feature spaces, respectively. The solution to α in Eq. (14) is αk ¼ 1 corresponding to the maximum trðW Tik ðT 1k T 2k ÞW ik Þ over different features, and αk ¼ 0
138
H. Yan et al. / Neurocomputing 143 (2014) 134–143
otherwise. This solution means that only one feature is selected in our method, and it is equivalent to the one from the best feature. This solution does not explore the complementary property of multi-feature data to get a better performance than that is based on a single feature. To address this, we modify αk to be αqk, where q 4 1, and the new objective function is defined as K
A ¼ ∑ αqk ½trðW Tik ðT 1k T 2k ÞW ik
max α
Aðwt Þ ¼ A1 ðwt Þ A2 ðwt Þ
k¼1
K
∑ αk ¼ 1; αk Z0:
s:t:
ð17Þ
k¼1
K
∑ αk 1
k¼1
Let
and
!
K
Lðα; ζ Þ ¼ ∑ αqk ½trðW Tik ðT 1k T 2k ÞW ik ζ ∂Lðα;ζ Þ ¼ ∂ζ
¼ wTt ðB1 B2 Þwt ¼ wTt λt wt
We construct a Lagrange function
∂Lðα;ζ Þ ∂α k ¼ 0
analyzing the eigenvalues of ðB1 B2 Þ. Since ðB1 B2 Þ is not positive semi-definite, the eigenvalues of ðB1 B2 Þ may be positive, zero and negative. Let λ1 Z λ2 Z ⋯ Z λdik 4 0 Z λdik þ 1 Z ⋯ Z λ k and Wik d be ordered according to their corresponding eigenvalues, we select the first dik eigenvectors to maximize Eq. (17). That is because when the samples are projected to one specific eigenvector wt corresponding to an eigenvalue λt, Eq. (17) can be written as
ð18Þ
k¼1
0, we have
qαqk 1 tr½W Tik ðT 1k T 2k ÞW ik ζ ¼ 0
ð19Þ
K
∑ αk 1 ¼ 0
¼ λt
ð22Þ
If λt 40, this means that the inter-manifold distance is larger than the intra-manifold distance along the direction of wt, and samples can be likely correctly classified. According to this criterion, we can automatically determine the feature dimension dik for Wik.
ð20Þ
k¼1
Combining (Eqs. (19) and 20), we obtain
αk ¼
αk as follows:
ð1=tr½W Tik ðT 1k T 2k ÞW ik Þ1=ðq 1Þ K
∑
k¼1
3.3. Classification ð21Þ
ð1=tr½W Tik ðT 1k T 2k ÞW ik Þ1=ðq 1Þ
The proposed M3L method is summarized in Algorithm 1.
Having obtained Wik and α, we perform classification by using the multi-feature manifold–manifold distance defined in Eq. (2), where dðMkT ; Mki Þ ¼
1 r ∑ dðW Tik xjk ; W Tik N k ðxjk ÞÞ r j¼1
ð23Þ
3
Algorithm 1. M L. Input: Multiple manifolds: M11 ; …; Mki ; …; MKN , where Mki ¼ ½xki1 ; …; xkij ; …; xkir is the kth manifold of the ith person, 1 r ir N, 1 r j r r, 1 r k r K,
parameters: k1, k2, σ, Parameters: iteration number T, tuning parameter q, and
where xjk denotes the jth local patch of the testing sample in the kth feature, N k ðxjk Þ denotes the k-nearest neighbors of xjk in Mki and dðW Tik xjk ; W Tik N k ðxjk ÞÞ can be obtained by solving a constrained least squares problem similar to the locally linear embedding method as discussed in [34]. Fig. 4 illustrates how to compute the distance between the testing patch and its nearest neighbors.
convergence error ε. k
ik
Output: Projection matrices W ik A Rd d and weighting vector α. Step 1 (Initialization): 1.1. Set α ¼ ½1=K; 1=K; …; 1=K. k
k
1.2. Set W 0ik ¼ I d d . 1.3. For each sample xij, compute F and H as shown in (Eqs. (4) and 5), respectively. Step 2 (Local optimization): For t ¼ 1; 2; …; T, repeat 2.1. Compute Wik by solving Eq. (13); 2.2. Compute
α by solving Eq. (21).
2.3. Update xkij: xkij ¼ W T xkij . 2.4. For each sample xij, update the neighbors with the learned Wik and re-compute F and H, respectively. 2.5. If t 42 and jW tik W tik 1 j o ε, go to Step 3. Step 3 (Output projection matrices and the weight): Output W ik ¼ W tik and α ¼ αt .
Unlike existing manifold learning algorithms which empirically select the feature dimension [5,8,35,34,36], we automatically compute the feature dimension dik for the ikth projection matrix Wik because there are N nK dimension numbers to be determined ik K in our proposed method. Generally, there are ∏N i ¼ 1 ∏k ¼ 1 d candidates to be searched and it is very time-consuming, if not impossible, to select the best one. In this paper, we adopt the automatic dimension determination method in [27,29] by
Fig. 4. Illustration of computing the manifold–manifold distance. For the testing patch xjk, we first identify its k-nearest neighbors N k ðxjk Þ in the gallery manifold. Then, local patches in the neighborhood are used to reconstruct the testing patch. In this figure, the neighborhood size is 3 for ease of illustration. Finally, the average of all patches in the manifolds are computed as the distance of two manifolds in the kth feature space.
H. Yan et al. / Neurocomputing 143 (2014) 134–143
4. Experiments We have evaluated our proposed M3L method by conducting a number of SSFR experiments on three benchmark face databases, namely AR [42], FERET [43] and LFW [44]. The following describes the details of the experiments and results. 4.1. Datasets and settings The AR [42] face database contains more than 4000 color face images of 126 people (70 men and 56 women), including frontal facial images with different expressions, lighting conditions, and occlusions (sun glasses and scarves). For each person, there are 26 different facial images taken in two sessions (separated by 2 weeks), and each session contains thirteen 768 576 color images. In our experiments, eight subsets of 800 images (8 images per subject) from 100 different subjects (50 men and 50 women) were used, which were taken from two different sessions and with different expressions. Table 1 provides the detailed information of each subset of this dataset. Fig. 5(a) shows eight sample images of one subject from the AR database. In our experiments, we used the image collected in Subset A for training and the remaining seven subsets for testing. The FERET [43] database consists of 13,539 facial images corresponding to 1565 subjects, who are diverse across ethnicity, gender and age. In our experiments, we used two subsets (FERET-1 and FERET-2) from the FERET database, and compared our method with the state-of-the-art methods which can also address the SSFR problem. Similar to [46,25,18,26], FERET-1 is a subset of the FERET database including 400 gray-level frontal face images comprising of 200 different persons, with the size of 256 384. There are 71 females and 129 males, each has two images (Fa and Fb), with different races, genders, ages, expressions, illuminations and scales, etc. For this database, we applied Fa images for training and Fb images for testing.1 The FERET-2 dataset follows the standard FERET evaluation protocol, and there are five sets (Fa, Fb, Fc, Dup I, and Dup II) in FERET-2. Fa containing 1196 frontal images of 1196 subjects is used as Gallery, while Fb (1195 images of expression variations), Fc (194 images taking under different illumination conditions), Dup I (722 images taken later in time) and Dup II (234 images, a subset of Dup I), are the Probe sets. Dup II consists of images that were taken at least 1 year after the corresponding gallery images. Fig. 5(b) shows some sample images of eight subjects from the FERET-1 database. The LFW face database contains more than 13,000 face images collected from the web. Each face has been labeled with the name of the person pictured. 1680 of the people pictured have two or more distinct photos in the data set. We selected the subjects which have more than 9 images per person, and there are 158 subjects in total. For each subject, we randomly selected two images for experiments: one for training and the other one for test. We repeated this procedure 5 times and took the average as the recognition accuracy. Fig. 5(c) shows some sample images of eight subjects from the LFW database. We can see from this figure that many natural variations of varying poses, expressions, lighting and misalignments were included on these face images. Hence, this dataset is more challenging than the AR and FERET databases which were collected under controlled conditions. For all images in the AR and FERET datasets, the facial part of each image was manually cropped, aligned and resized into 60 60 pixels according to the eyes’ positions. For the LFW database, each face image in this database was automatically 1 The list of subjects whose images were selected in the FERET-1 dataset can be downloaded from http://cs.nju.edu.cn/zhouzh/.
139
Table 1 Detailed information of all the eight subsets of the AR database used in our experiments. Dataset
Expression
Session number
Subset Subset Subset Subset Subset Subset Subset Subset
Neutral Smile Anger Scream Neutral Smile Anger Scream
1 1 1 1 2 2 2 2
A B C D E F G H
detected by the Viola-Jones face detector and aligned into 60 60 with a commercial face alignment software.2 Hence, our experiments on this dataset are fully automatic face recognition because no manual effort is involved. 4.2. Results and analysis 4.2.1. Comparison with existing SSFR methods We compared our proposed M3L method with the following SSFR methods:
Learning-based methods: (1) Block LDA [24], (2) UP (uniform
pursuit) [45], and (3) DMMA (discriminative multi-manifold analysis) [27]. Feature-based methods: (1) LBP [28], (2) LGBP [20], and (3) HGGP [29].
We carefully tuned the optimal parameters for each method for a fair comparison. For Block LDA, DMMA and our proposed M3L method, the block size was empirically set as 10 10. For the UP method, the number of nearest neighbors was selected to be 4 and t was set to be 100 to calculate the similarity weighting matrix. For DMMA and M3L, the values of the parameters t1, t2, k and σ were empirically tuned to be 15, 5, 4 and 100, and the feature dimensions of these manifolds vary from 15 to 25. For the feature-based methods, we followed the parameter settings in the corresponding papers [28,20,29], respectively. For our proposed M3L method, we first extracted three kinds of feature descriptors including the raw intensity feature, LBP feature and local Gabor feature for each patch and then M3L was used to measure the similarity of two face images represented by multiple manifolds. The nearest neighbor classifier with the Euclidean distance was used for classification for all methods. Tables 2–4 tabulate the rank-one recognition rates of different methods on the AR, FERET-1 and FERET-2 databases, respectively. The best recognition accuracy of each method was recorded for a fair comparison. As can be seen, our proposed M3L consistently outperforms the Block LDA, UP, DMMA, LBP, LGBP and HGGP methods with the gains in mean accuracy of 24.3%, 9.8%, 2.0%, 3.7%, 2.7%, and 2.0% on the AR database, 7.5%, 4.0%, 1.0%, 2.5%, 1.5%, and 0.5% on the FERET-1 database, and 16.0%, 15.2%, 1.3%, 27.8%, 13.6%, and 4.0% on the FERET-2 database, respectively. Table 5 shows the rank-1 and rank-10 recognition accuracies obtained by different methods on the LFW database. We can observe from this table that the recognition performance of all methods are generally much lower than those obtained in the AR and FERET databases. That is because face images in the LFW database are collected under uncontrolled conditions and many 2
Available at: http://www.openu.ac.il/home/hassner/data/lfwa/.
140
H. Yan et al. / Neurocomputing 143 (2014) 134–143
Fig. 5. Sample face images from the (a) AR, (b) FERET and (c) LFW datasets, respectively.
Table 2 Rank-1 recognition accuracy (%) of different methods on different subsets of the AR database.
Table 4 Rank-1 recognition accuracy (%) of different methods on different subsets of the FERET-2 database according to the standard FERET evaluation protocol.
Method
B
C
D
E
F
G
H
Mean
Method
Fb
Fc
Dup1
Dup2
Mean
Block LDA [24] UP [45] DMMA [27] LBP [28] LGBP [20] HGGP [29] M3L
85 98 99 98 98 98 99
79 88 93 90 91 92 94
29 59 69 63 65 67 72
73 77 88 87 88 88 90
59 74 85 85 86 87 89
59 66 79 79 79 79 79
18 41 45 44 46 47 49
57.4 71.9 79.7 78.0 79.0 79.7 81.7
Block LDA [24] UP [45] DMMA [27] LBP [28] LGBP [20] HGGP [29] M3L
92.0 93.6 98.1 93.0 94.0 97.6 98.9
82.0 74.5 98.5 51.0 97.0 98.9 98.8
75.6 77.8 81.6 61.0 68.0 77.7 83.3
52.5 59.5 83.2 50.0 53.0 76.1 85.4
75.5 76.4 90.3 63.8 78.0 87.6 91.6
Table 3 Rank-1 recognition accuracy (%) of different methods on the FERET-1 database. Method
Accuracy
Block LDA [24] UP [45] DMMA [27] LBP [28] LGBP [20] HGGP [29] M3L
86.5 90.0 93.0 91.5 92.5 93.5 94.0
variations were included in the dataset, which poses more challenges than the AR and FERET databases. However, we can also see that our proposed M3L still outperforms the Block LDA, UP, DMMA, LBP, LGBP and HGGP methods in terms of both the rank-1
Table 5 Rank-1 and rank-10 recognition accuracies (%) of different methods on the LFW database. Method
Rank-1
Rank-10
Block LDA [24] UP [45] DMMA [27] LBP [28] LGBP [20] HGGP [29] M3L
15.3 14.2 25.1 22.1 25.3 26.4 31.4
25.3 23.5 32.2 30.1 31.2 34.2 38.3
and rank-10 recognition accuracies, which further demonstrates the effectiveness of our proposed method for unconstrained SSFR face recognition.
H. Yan et al. / Neurocomputing 143 (2014) 134–143
Table 6 Rank-1 recognition accuracy (%) of different methods on the FERET-2 database.
141
Table 8 CPU time (in second) used by different methods on the AR database.
Method
Fb
Fc
Dup1
Dup2
Method
Training
Recognition
Adapted FLD [32] POEM [30] LDP [47] G-LDP [47] RAS [48] GV-LBP [31] M3L
88.5 97.0 94.0 97.0 95.7 98.1 98.9
71.6 95.0 83.0 95.0 99.0 98.5 98.8
53.3 77.6 62.0 71.0 80.3 80.9 83.3
35.0 76.2 53.0 69.0 80.3 81.2 85.4
UP Block LDA DMMA M3L
0.0936 0.0780 39.0003 142.1234
0.0006 0.0039 0.3900 1.5612
Table 7 Rank-1 recognition accuracy (%) of different methods on the FERET-2 database. Method
Fb
Fc
Dup1
Dup2
Feature concatenation Individual learning Score fusion M3L
98.3 98.5 98.7 98.9
98.6 98.5 98.0 98.8
82.0 82.4 82.7 83.3
83.8 84.1 85.0 85.4
4.2.2. Comparison with state-of-the-art face recognition methods We also compared our M3L method with six other state-of-theart face recognition methods including Adapted FLD [32], POEM [30], LDP [47], G-LDP (Gabor-based LDP) [47], RSA (robust and scaleapproach) [48], and GV-LBP [48], which were proposed in the past 2 years. We followed the same gallery and probe sets as the standard FERET evaluation protocol, where Fa is used as the gallery set and the other four as the probe sets. Table 6 shows the rank-1 recognition accuracy obtained by different methods on the FERET dataset. As shown in this table, our M3L can still achieve comparable results on the FERET evaluation protocol with the state-ofthe-art face recognition methods. 4.2.3. Comparisons with existing multi-feature learning methods We compared the proposed M3L method with our previous DMMA method with multiple features in SSFR. Previously, DMMA only utilized one feature for recognition, and it is unfair to directly compare our M3L with DMMA. For a fair comparison, we first extracted three kinds of feature descriptors including the raw intensity feature, LBP feature and local Gabor feature for each patch. Then, we applied DMMA with the following three settings:
Feature concatenation: Concatenating three features extracted
from different views, and then running the standard DMMA method. Individual learning: Performing DMMA on each feature separately and then concatenating them into a long feature vector for recognition with the NN classifier. Score fusion: Performing DMMA on each feature and doing the recognition using the NN classifier. Then, the recognition scores are combined with the major voting strategy.
Table 7 shows the rank-1 recognition accuracy obtained by DMMA with multiple features and our M3L method on the FERET2 dataset. As shown in this table, our M3L also outperforms DMMA with multiple features, which further indicates that jointly learning multiple discriminative feature spaces is better than learning them individually.
solve a generalized eigenvalue equation. Let N be the number of training samples and the maximal feature dimension of each image patch is dmax, and r is the number of local patches of each image. Assume L ¼ min fN nr; dmax g, the computational complexity of our proposed M3L method is OðNTKL3 Þ. We compare the computational time of M3L and those of the other three methods including Block LDA, UP and DMMA. Our hardware configuration comprises a 2.4-GHz CPU and a 6GB RAM. Table 8 shows the time spent on the training and the recognition phases by these methods, where the Matlab software, AR database and the nearest neighbor classifier were used. From Table 8, we can see that the computational complexity of the proposed M3L method for training is much larger than other methods. In practical applications, however, training is usually an offline process and only recognition needs to be performed in real time. Thus, the recognition time is usually more of our concern than the training time. As shown in Table 8, the recognition time of our proposed method is around 1.5 s even though it is more time-consuming than other methods. Hence, the computational complexity will not limit the real world applications of our proposed method.
5. Conclusion and future work We have proposed in this paper a multi-feature multi-manifold learning (M3L) method to address the SSFR problem. We partitioned each face image into several local patches and constructed an image set for each sample per person. Then, we formulated SSFR as a multi-feature multi-manifold matching problem and learned multiple discriminative feature subspaces to maximize the multi-feature manifold margins of different persons. Experimental results are presented to demonstrate the efficacy of the proposed method. While our M3L method is specifically developed for SSFR in this work, we believe that it can be also applied to other computer vision applications such as object recognition from image sets and video-based face recognition. How to effectively apply our method to these applications seems an interesting work in the future.
Acknowledgments This work was partly supported by the research grant for the Human Sixth Sense Program (HSSP) at the Advanced Digital Sciences Center (ADSC) from the Agency for Science, Technology and Research (AnSTAR) of Singapore, and the research grants from the National Natural Science Foundation of China under Grant 61373090, and the Beijing Natural Science Foundation of China under Grant 4132014. References
4.2.4. Computational complexity Now, we analyze the computational complexity of the M3L method, which involves T iterations, and in each iteration, we
[1] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cogn. Neurosci. 3 (1) (1991) 71–86.
142
H. Yan et al. / Neurocomputing 143 (2014) 134–143
[2] P.N. Belhumeur, J. Hespanha, D.J. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720. [3] C. Liu, H. Wechsler, Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition, IEEE Trans. Image Process. 11 (4) (2002) 467–476. [4] J. Yang, D. Zhang, A. Frangi, J. Yang, Two-dimensional PCA: a new approach to appearance-based face representation and recognition, IEEE Trans. Pattern Anal. Mach. Intell. 1 (2004) 131–137. [5] X. He, S. Yan, Y. Hu, P. Niyogi, H.J. Zhang, Face recognition using Laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005) 328–340. [6] H. Chen, H. Chang, T. Liu, Local discriminant embedding and its variants, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2005, pp. 846–853. [7] D. Cai, X. He, J. Han, H.-J. Zhang, Orthogonal laplacianfaces for face recognition, IEEE Trans. Image Process. 15 (11) (2006) 3608–3614. [8] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, S. Lin, Graph embedding and extensions: a general framework for dimensionality reduction, IEEE Trans. Pattern Anal. Mach. Intell. 29 (1) (2007) 40–51. [9] J. Zou, Q. Ji, G. Nagy, A comparative study of local matching approach for face recognition, IEEE Trans. Image Process. 16 (10) (2007) 2617–2628. [10] D. Cai, X. He, J. Han, Spectral regression for efficient regularized subspace learning, in: IEEE International Conference on Computer Vision, 2007, pp. 1–8. [11] D. Cai, X. He, Y. Hu, J. Han, T. Huang, Learning a spatially smooth subspace for face recognition, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–7. [12] J. Yang, D. Zhang, J. Yang, B. Niu, Globally maximizing, locally minimizing: unsupervised discriminant projection with applications to face and palm biometrics, IEEE Trans. Pattern Anal. Mach. Intell. 29 (4) (2007) 650–664. [13] J. Lu, Y.-P. Tan, Regularized locality preserving projections and its extensions for face recognition, IEEE Trans. Syst. Man Cybern. Part B: Cybern. 40 (3) (2010) 958–963. [14] J. Lu, Y.-P. Tan, A doubly weighted approach for appearance-based subspace learning methods, IEEE Trans. Inf. Forensics Secur. 5 (1) (2010) 71–81. [15] X. Tan, B. Triggs, Enhanced local texture feature sets for face recognition under difficult lighting conditions, IEEE Trans. Image Process. 19 (6) (2010) 1635–1650. [16] A. Jain, B. Chandrasekaran, Dimensionality and sample size considerations in pattern recognition practice, in: Handbook of Statistics, vol. 2, 1982, pp. 835– 855. [17] X. Tan, S. Chen, Z. Zhou, F. Zhang, Face recognition from a single image per person: a survey, Pattern Recognit. 39 (9) (2006) 1725–1745. [18] D. Zhang, S. Chen, Z. Zhou, A new face recognition method based on SVD perturbation for single example image per person, Appl. Math. Comput. 163 (2) (2005) 895–907. [19] Q. Gao, L. Zhang, D. Zhang, Face recognition using FLDA with single training image per person, Appl. Math. Comput. 205 (2) (2008) 726–734. [20] W. Zhang, S. Shan, W. Gao, X. Chen, H. Zhang, Local gabor binary pattern histogram sequence (lgbphs): A novel non-statistical model for face representation and recognition, in: IEEE International Conference on Computer Vision, 2005, pp. 786–791. [21] T. Kim, J. Kittler, Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005) 318–327. [22] J. Wang, K. Plataniotis, J. Lu, A. Venetsanopoulos, On solving the face recognition problem with one training sample per subject, Pattern Recognit. 39 (9) (2006) 1746–1762. [23] A.M. Martinez, Recognizing imprecisely localized, partially occluded, and expression variant faces from a single sample per class, IEEE Trans. Pattern Anal. Mach. Intell. 24 (6) (2002) 748–763. [24] S. Chen, J. Liu, Z. Zhou, Making FLDA applicable to face recognition with one sample per person, Pattern Recognit. 37 (7) (2004) 1553–1555. [25] S. Chen, D. Zhang, Z. Zhou, Enhanced (PC)2A for face recognition with one training image per person, Pattern Recognit. Lett. 25 (10) (2004) 1173–1181. [26] X. Tan, S. Chen, Z. Zhou, F. Zhang, Recognizing partially occluded, expression variant faces from single training image per person with SOM and soft k-NN ensemble, IEEE Trans. Neural Netw. 16 (4) (2005) 875–886. [27] J. Lu, Y.-P. Tan, G. Wang, Discriminative multimanifold analysis for face recognition from a single training sample per person, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 39–51. [28] T. Ahonen, A. Hadid, M. Pietikäinen, Face recognition with local binary patterns, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2004, pp. 469–481. [29] W. Zhang, X. Xue, Z. Sun, Y. Guo, M. Chi, H. Lu, Efficient feature extraction for image classification, in: IEEE International Conference on Computer Vision, 2007, pp. 1–8. [30] N. Vu, A. Caplier, Face recognition with patterns of oriented edge magnitudes, in: European Conference on Computer Vision, 2010, pp. 313–326. [31] Z. Lei, S. Liao, M. Pietikainen, S. Li, Face recognition by exploring information jointly in space, scale and orientation, IEEE Trans. Image Process. 20 (1) (2011) 247–256.
[32] Y. Su, S. Shan, X. Chen, W. Gao, Adaptive generic learning for face recognition from a single sample per person, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2010, pp. 2699–2706. [33] S. Si, D. Tao, B. Geng, Bregman divergence based regularization for transfer subspace learning, IEEE Trans. Knowl. Data Eng. 22 (7) (2010) 929–942. [34] S. Roweis, L. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [35] J.B. Tenenbaum, V. Silva, J. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323. [36] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (6) (2003) 1373–1396. [37] A. Goldberg, X. Zhu, A. Singh, Z. Xu, R. Nowak, Multi-manifold semi-supervised learning, in: International Conference on Artificial Intelligence and Statistics, 2009, pp. 169–176. [38] X. Liu, H. Lu, W. Li, Multi-manifold modeling for head pose estimation, in: IEEE International Conference on Image Processing, 2010, pp. 3277–3280. [39] R. Xiao, Q. Zhao, D. Zhang, P. Shi, Facial expression recognition on multiple manifolds, Pattern Recognit. 44 (1) (2011) 107–116. [40] R. Wang, S. Shan, X. Chen, W. Gao, Manifold–manifold distance with application to face recognition based on image set, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8. [41] R. Wang, X. Chen, Manifold discriminant analysis, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2009, pp. 1–8. [42] A.M. Martinez, R. Benavente, The AR Face Database, Technical Report, CVC Technical Report, 1998. [43] P. Phillips, H. Moon, S. Rizvi, P. Rauss, The FERET evaluation methodology for face-recognition algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 22 (10) (2000) 1090–1104. [44] G.B. Huang, M. Ramesh, T. Berg, E. Learned-Miller, Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments, Technical Report 07-49, University of Massachusetts, Amherst, October 2007. [45] W. Deng, J. Hu, J. Guo, W. Cai, D. Feng, Robust, accurate and efficient face recognition from a single training image: a uniform pursuit approach, Pattern Recognit. 43 (5) (2010) 1748–1762. [46] J. Wu, Z. Zhou, Face recognition with one training image per person, Pattern Recognit. Lett. 23 (14) (2002) 1711–1719. [47] B. Zhang, Y. Gao, S. Zhao, J. Liu, Local derivative pattern versus local binary pattern: face recognition with high-order local pattern descriptor, IEEE Trans. Image Process. 19 (2) (2010) 533–544. [48] W. Schwartz, H. Guo, L. Davis, A robust and scalable approach to face identification, in: European Conference on Computer Vision, 2010, pp. 476–489.
Haibin Yan received the B.Eng. and M.Eng. degrees from the Xi'an University of Technology, Xi'an, China, in 2004 and 2007, and the Ph.D. degree from the National University of Singapore, Singapore, in 2013, all in mechanical engineering. Now, she is a research fellow at the Department of Mechanical Engineering, National University of Singapore, Singapore. Her research interests include human-robotic interaction, social robotics, industrial robot, and computer vision.
Jiwen Lu received the B.Eng. degree in mechanical engineering and the M.Eng. degree in electrical engineering from the Xi'an University of Technology, Xi'an, China, in 2003 and 2006, respectively, and the Ph.D. degree in electrical engineering from the Nanyang Technological University, Singapore, in 2011. He is currently a research scientist at the Advanced Digital Sciences Center (ADSC), Singapore. His research interests include computer vision, pattern recognition, machine learning, multimedia, and biometrics. He has authored over 80 scientific papers in these areas, including some top venues such as PAMI, TIP, ICCV and CVPR. He is a Co-Chair of the First International Workshop on Human Identification in Multimedia (HIM'14) which is in conjunction with ICME2014, and a Co-Chair of the Biometrics Competition/Evaluation of Kinship Verification in the Wild (KVW'14, KVW'15) which are in conjunction with IJCB2014 and FG2015, respectively. He received the Best Student Paper Award from PREMIA of Singapore, the National Outstanding Student from Ministry of Education, China, and the Best Paper Award Nominations from ICME2011 and ICME 2013, respectively.
H. Yan et al. / Neurocomputing 143 (2014) 134–143 Xiuzhuang Zhou received B.Sc. degree from Department of Atmosphere Physics, Chengdu University of Information Technology, Chengdu, China, in 1996. He received M.Eng. degree and Ph.D. degree from School of Computer Science, Beijing Institute of Technology, Beijing, China, in 2005 and 2011, respectively. He is currently an associate professor in College of Information Engineering, Capital Normal University, Beijing, China. His research interests include computer vision, pattern recognition, and machine learning.
143 Yuanyuan Shang received the Ph.D. degree from National Astronomical Observatories, Chinese Academy of Sciences in 2005. She is currently a Professor and vice dean (research) of the College of Information Engineering, Capital Normal University, Beijing, China. Her research interests include digital imaging sensor, image processing, and computer vision.