Journal of Visual Communication and Image Representation 53 (2018) 13–19
Contents lists available at ScienceDirect
Journal of Visual Communication and Image Representation journal homepage: www.elsevier.com/locate/jvci
A set-to-set nearest neighbor approach for robust and efficient face recognition with image sets☆,☆☆
T
⁎
Ling Wanga, Hong Chengb, , Zicheng Liuc a
School of Electric Engineering, University of Electronic Science and Technology of China, 2006 Xiyuan Avenue, Chengdu 611731, China Center for Robotics, University of Electronic Science and Technology of China, 2006 Xiyuan Avenue, Chengdu 611731, China c Microsoft Research Redmond, One Microsoft Way, Redmond, WA 98052, USA b
A R T I C L E I N F O
A B S T R A C T
Keywords: Face recognition Set-to-set Robust analysis Weighted correlation analysis
Set-to-set face recognition has drawn much attention thanks to its rich set information. We propose a robust and efficient Set-to-Set Nearest Neighbor Classification (S2S-NNC) approach for face recognition by using the maximum weighted correlation between sets in low-dimensional projection subspaces. A pair of face sets is represented as two sets of Mutual Typical Samples (MTS) based on their maximum weighted correlation, and the S2S distance is equivalent to that between two sets of MTS. For the variation of objects within a set, the faces are partitioned into patches and projected onto a correlation subspace to find the MTS between two sets. Furthermore, we develop a S2S-NNC approach for image set-based face recognition. Compared with existing approaches, the S2S-NNC unifies the image-to-image, image-to-set and set-to-set recognition problems into one model. Experimental results show the S2S-NNC approach significantly outperforms the state-of-art approaches on large video samples and small occluded samples.
1. Introduction Video information is wildly used in our life thanks to the development of High Efficiency Video Coding (HEVC) [1–5]. But its also result the computation and accuracy problem of image set-based face recognition in the scenario of video recognition and surveillance [6–9]. For the multi-viewpoints of camera networks or long term observation, the images in the set are usually blurred, occluded or deformed. Traditional image set-based face recognition approaches are not usually robust enough to solve this problem. Furthermore, since the Euclidean distance between two vectors does not work very well to measure the distance between sets, the traditional image-to-image classifiers, such as Support Vector Machine (SVM), Linear Discriminant Analysis (LDA) [10] and Naive Bayesian Nearest Neighbor (NBNN) [11], cannot be directly used in image set-based face recognition. Image set recognition consists of two basic modules, set representation and set distance metric. In set representation, projection based approaches model each image set as a low-dimensional subspace [12–17], manifold [7,18–22] or affine hull [8]. Subspace/Manifold learning needs sufficient samples to obtain the subspace/manifold of an image set, which is very challenging for many face recognition problems. Thus, insufficient gallery samples could greatly degrade the
performance of recognition. To alleviate this problem, an affine hull model with image samples and their mean was used to represent a face set [23]. However, this model does not really solve the small-size sample problem, especially in occluded cases. Moreover, traditional image set-based face recognition approaches suffer high computational complexity. It is hard to construct a robust yet efficient distance metric between image sets. Set-to-set distance metrics usually adopt complex optimization algorithms to evaluate the roles of different samples within a set, thus increasing the complexity of image set recognition approaches. In this paper, we propose a robust and efficient Set-to-Set Nearest Neighbor Classification (S2S-NNC) approach for face recognition by using the maximum weighted correlation of sets in low-dimensional projection subspaces. Here, a pair of sets are described as Mutual Typical Samples (MTS) based on their maximum mutual correlation. As shown in Fig. 1, given two sets, X and Y , we can learn their corresponding projection matrices Wx and Wy by using the maximum correlation between two sets. Furthermore, the MTS of each set is obtained by extracting principal samples from each image within a set. Then, we reduce the distance metric between two original sets with that between their sets of MTS. Given a testing sample set and the gallery sample sets of C classes, we can generate their MTS set pairs. Finally, we can assign
☆
This work is supported by the grant from “National Natural Science Foundation of China – China (NSFC)” (No. 61603077, No. 61273256, No. 61305033 and No. U1233103F01). This paper has been recommended for acceptance by Zicheng Liu. Corresponding author. E-mail addresses:
[email protected] (L. Wang),
[email protected] (H. Cheng),
[email protected] (Z. Liu).
☆☆ ⁎
https://doi.org/10.1016/j.jvcir.2018.02.004 Received 14 March 2017; Received in revised form 24 January 2018; Accepted 3 February 2018 Available online 05 February 2018 1047-3203/ © 2018 Elsevier Inc. All rights reserved.
Journal of Visual Communication and Image Representation 53 (2018) 13–19
L. Wang et al.
Fig. 1. An illustration of MTS based S2S-NNC. The X (c ) and Y are two sets of given classes. By computing the maximum correlation coefficient ρ (X (c ),Y) , we get the projection matrix (c ) (W (xc ),W (yc )) and MTS ( X , Y ). Then S2S distance dMTS (X (c ),Y) is calculated in this MTS space, With the NN classifier, the recognition result is the minimum distance in dMTS (X (c ),Y),c ∈ C .
Discriminant Analysis (MMDA) method for image feature extraction and pattern recognition [28]. Kim et al. extend the concept of principal angles between linear subspaces to manifolds with arbitrary nonlinearity [29]. Cevikalp et al. represent images as points in a linear or affine feature space [8]. To overcome the problem of local linear or nonlinear model, image sets are mapped into Grassmannian or Riemannian manifold space. And then discriminant analysis or information entropy methods are used for recognition [19–22]. Hu et al. introduce a Sparse Approximate Nearest Points (SANP) approach [23], where the nearest points between two sets are sparsely approximated from the respective set. A joint representation of the image set is defined and included both the sample images and their affine hull models. Yang et al. use a joint regularized nearest points to represent image sets of different classes [15]. A joint feature projection matrix learning and dictionary structuring method is proposed in [17]. Chen et al. propose a multivariate sparse representation for video-to-video face recognition [30]. Cui et al. divide images into patches and sparsely encode them. Then whitened PCA and pairwise constrained multiple metric learning techniques are used to reduce the feature dimension and to integrate the descriptors [31]. Wolf et al. present patch-based Local Binary Pattern (LBP) multiple descriptors to capture statistics of local features in a set [32,33]. Moreover, Lu et al. [16,34] and Vemulapalli et al. [35] model the image set by using set-statistics based approaches. Compared to the parametric-based approaches, the subspace/manifold representation does not need any assumptions about the data distribution. However, most of them need training steps and feature extraction, which require some priori information of the dataset and need manmade feature design. Compared with image set representation, the problem of set distance metric seems less of a challenge. The maximum posteriori formulation [6], geometric distance function [8,23,26,27,31,36] and discriminative learning [7,28,32,33,37,38] approaches are used in classification. Zhu et al. propose a I2S classification by extending Mahalanobis distance [39]. Huang et al. propose a Euclidean-to-Riemannian metric to solve the I2S classification problem [40]. Zhu et al. propose a convex or regularized hull to collaboratively represent all of the gallery image sets and query sets [14]. The distance metric between query set and gallery sets can be calculated by the represent coefficients, this method is named as RH-ISCRC. Those existing techniques usually are based on Euclidean distance or its deformation, Cosine distance, and Jaccard distance, et al. A main drawback of these measures is that they cannot unify the I2I, I2S and S2S measure in one framework. Furthermore, most of the S2S measures need an iterative process to obtain the optimal solutions, which leads to increasing of computation in the case of large number of samples. In this paper, some notations of variables are listed in Table 1.
the testing set to the class with the minimum distance between MTS set pairs. By doing so, the proposed approach can work well even in heavy occlusions without any preprocessing steps or explicit dimensionality reduction. More interestingly, we do not use complex optimization algorithms in set representation and set-to-set distance metric. Even in small size samples, the proposed approach still works very well. We also unify the Image-to-Image (I2I), Image-to-Set (I2S) and Set-to-Set (S2S) distance metric in one framework. The rest of this paper is organized as follows. We first give an overview of the related work of S2S recognition approach in Section 2. In Section 3, we highlight the image set representation and its distance metric with MTS and patched-MTS. By using these MTS representations, we propose a S2S nearest neighbor classification approach. Comprehensive experiments are presented in Section 4. Finally, we draw conclusions and discuss future work in Section 5. 2. Related work The existing techniques on image set-based face recognition can, for the most part, be divided into two classes: parametric approaches and non-parameter based approaches. The parametric approaches usually tend to represent the image set by a parametric model or distribution function. Lee et al. use Principal Component Analysis (PCA) to represent gallery images in low-dimensional appearance subspaces [6] and then cluster the subspaces by Kmeans algorithm. Arandjelovic et al. propose a semi-parametric model for learning probability densities confined to highly non-linear but intrinsically low-dimensional manifolds [18]. This approach is based on a stochastic approximation of Kullback-Leibler divergence between the estimated densities. Zheng et al. used kernel canonical correlation analysis to solve the facial expression recognition problem [24]. They used Gabor wavelet transformation to convert landmark points of faces into a labeled graph vector to representing the facial features. The limitation of these parametric-based approaches is that if the image set does not have strong statistical correlation for the parameters, the estimated model cannot represent the image set very well. In contrast, the non-parameter based approaches, usually relax the assumptions on distribution of the data set and are more flexible. One of the important nonparametric approaches is the subspace/manifold based approach. Yamaguchi et al. propose Mutual Subspace Method (MSM) to define the similarity between two image sequences [12]. Kim et al. represent the images by subspaces and recognition is carried out by subspace-to-subspace discrimination matching [25]. Wang et al. propose Manifold Discriminant Analysis (MDA) and Manifold-Manifold Distance (MMD) approaches, by modeling the covariance matrix of the image set as a manifold [7,26,27]. Yang et al. propose a Multi-Manifold
14
Journal of Visual Communication and Image Representation 53 (2018) 13–19
L. Wang et al.
Table 1 Notations. D: the number of variable dimensions; C: the number of classes; Nx ,Ny : the number of samples in X and Y ; M: the number of MTS, M ⩽ min(rank(X);rank(Y)) ; K: the number of patches; i ∈ {1,2,⋯,D} ; j ∈ {1,2,⋯,M } ; k ∈ {1,2,⋯,K } ; m ∈ {1,2,⋯,Nx } ; n ∈ {1,2,⋯,Ny} ;
X = [X1,X2,⋯,X Nx] ∈ D × Nx ;
Y = [Y1,Y2,⋯,YNy] ∈ D × Ny ; Wx = [wx1,wx 2,⋯,w xNx ]T ∈ Nx × M ;
Fig. 2. The correlation coefficient between different classes. The testing set Y is the same class as set X (1) and is different from classes X (2) ~X (5) . ρi = ρ (X (i),Y) . (a) is the results on
Wy = [wy1,wy2,⋯,w yNy]T ∈ Ny × M ;
the YouTube Celebrities dataset. (b) is the results on the aligned YouTube dataset.
Xm : the mth image, its elements are {xi,m} ; x k,m : the kth patch of the mth image, x k,m ∈ d1 d2 × 1, and Xm=
Kx = 〈ϕ (X),ϕ (X) 〉,Ky = 〈ϕ (Y),ϕ (Y) 〉. The MTS should be a general expression by using Kx,Ky to replace X and Y in Eq. (2).
[xT1,m,xT2,m,⋯,xTK ,m]T ;
x k̃ = [x k,1,x k,2,⋯,x k,Nx] is the kth patch set of X ; wx ,m = [wmx,1,wmx,2,⋯,wmx ,M ];
3.2. The S2S distance metric
wy,n = [wny,1,wny,2,⋯,wny,M ]; 〈X,Y〉 = XT Y ; ‖·‖F : Frobenius norm of a matrix.
In S2S distance measurements, the principal angles [42], Subspaces Distance (SSD) [43] and Manifold-Manifold Distance (MMD) [26] are the most commonly used measures. Considering the relationship of correlation coefficients between different classes, other than those distance measures, we propose a more robust S2S distance metric as
3. Image set-based face representation and recognition 3.1. Generating mutual typical samples
dMTS (X,Y) ≜ d ( X, Y) = 1−Ω(M )T ρ2 (X,Y),
We aim to construct a novel sample learning approach which not only reduces the dimensionality of samples but also is robust to occlusion. Moreover, the learned samples from gallery samples adapt to changes in testing samples. In other words, we learn the most mutually correlated samples from each set to generate mutual typical samples. Given two face image sets X,Y , we aim to use two projection matrices Wx and Wy to obtain the underlying low dimensional structures. We project the image sets onto two low dimensional sample subspaces X and Y as
where Ω(M ) is a weight vector. We use the ρ2 (X,Y) instead of ρ (X,Y) in Eq. (3) is because the former has better discrimination. Now we discuss the setting of M and Ω(M ) . We first calculate ρ on two different datasets: the YouTube Celebrities [44] and the aligned YouTube [45] dataset, respectively. The testing set Y is the same class as set X (1) , and the X (2) ~X (5) are image sets from other classes. Let ρi = ρ (X (i),Y),i = 1,2,…,5. As shown in Fig. 2, the differences in images resolutions cause the ρi to be very different. Note, however that ρ1 > ρi when i ≠ 1. We also uses the more challenging dataset, YouTube Celebrities, to experiment with the recognition ratio using different M. As shown in Fig. 3, when Ω(M ) = 1/ M , M = 1 always gives the best result regardless of the gallery sample number N. The main reason is that, with the damped ρ , the equalization reduce the resolution of MTS. And for the lower resolutions of samples, the lager number of M / N gives the less definition of MTS. On the other hand, if Ω is set as Exponential, Pareto or some Weibul distribution functions, there is not much improvement of the recognition ratio. In this case, we just use the maximum correlated MTS pairs, i.e.
X = XWx, Y = YWy,
(1)
N X is x iĵ = ∑mx= 1 wmx ,j x i,m , i.e. each element of X is a where the element of linear weighted combination for the same variables in the sample space. Y . From Similarly, we can obtain the set of the mutual typical samples the numerous linear combinations, we can calculate the projection matrix pair {Wx,Wy} by maximizing the mutual correlation
ρ (X,Y) = arg max Wx,Wy
〈XWx,YWy 〉 ‖XWx ‖F ‖YWy ‖F
,
(2)
dMTS (X,Y) ≜ d ( X, Y) = 1−maxρ2 (X,Y).
M × 1.
In many cases, there exists correlation along samples where ρ ∈ in the set. Consequently, the number of columns M in a projection matrix should be less than the rank of the set matrix, i.e. M ⩽ min{rank(X),rank(Y)} . If there exists high correlation among images in a set, the number of MTS will be very small. We will discuss the setting of M in the Section 3.2. From Eq. (2) we can see that the projection matrices are not only used for weighted combination of the within-sets, but also for using the maximum correlation between-sets. As shown in Fig. 1, the project matrix can extract the similar, yet distinct samples between sets in the context of maximum mutual correlation. Moreover, the MTS represent the most highly correlated samples. Eq. (2) is similar to Canonical Correlation Analysis (CCA) [41] in formulation and can be solved by eigen-decomposition, but it’s fundamentally different from CCA. The MTS is the maximum mutual correlation between samples while the CCA is that between features. Eq. (2) is a linear subspace projection, which can be extended to a non-linear projection by using kernel function mapping. Suppose ϕ is a mapping from space to a higher dimensional space ,ϕ: X ↦ ϕ (X) . The kernel functions for all X,Y ∈ is defined as:
(3)
(4)
With this dimension reduction technique, the computational complexity can be reduced steeply, and at the same time, the experimentations show that the performance of recognition is guaranteed. The MTS representation provides a robust yet simple distance metric for the S2S-NNC approach. We illustrate its robustness compared
Fig. 3. The recognition ratio on YouTube Celebrities dataset with different M. The N is the number of images in each gallery set.
15
Journal of Visual Communication and Image Representation 53 (2018) 13–19
L. Wang et al.
c ̂ = arg min γ (X (c ),Y).
(7)
In principal, the MTS provide the local optimal metric of the two sets and the NNC gives the global optimal metric of the data sets, and thus the proposed S2S-NNC approach is more robust than the state-ofart approaches. If the MTS/patched-MTS distance can be used in I2I, I2S and S2S scenarios, then we unify the I2I, I2S and S2S recognition problems in one framework. 3.5. Relationship with existing approaches In SANP, a sparsely approximated nearest point approach is used to measure the between-set distance by a convex optimization. In MMD, a linearity-constrained hierarchical division clustering approach is used to cluster different classes in manifold learning, and then a weighted distance metric is used in these clustered subspaces. In MSM, the sequence of images is accumulated and then the principal angle is used to measure the similarity of two mutual subspaces. In DCC, linear discriminative analysis is used in the canonical correlation projection subspace of the within-class sets and between-class sets. In SSDML, Mahalanobis distance is used to measure the distance of sets in affine hulls or convex hulls space, and then SVM is used to solve the classification problem. In RH-ISCRC, the gallery sets and query sets are represented by convex or regularized hull collaboratively. And then the recognition is worked with the representation coefficients. In JRNP, the joint regularized nearest points is used to represent image sets of different classes, and the distance metric is calculated on this representation. Similarly, the proposed S2S-NNC and other approaches discussed all used projection techniques to represent the image sets and then distance metrics are used for classification. However, the proposed S2S-NNC approach has several promising properties. First, the MTS pairs are generated from both the within-set and the between-set thus resulting in a more robust image set-based face recognition. Note that, the SANP and MMD are from affine/manifold combinations of the within-sets. Second, the S2S-NNC approach combines the two steps, set representation and distance metric, into one step by providing a closedform solution. The other approaches use an iterative way to find the optimal representation and distance metric between sets. The main computational complexity of S2S-NNC is used to calculate the MTS. If a complete Cholesky decomposition algorithm is used in S2S-NNC approach, the computational complexity is O (Ny2) . The computation complexity of MMD is O (Ny3) [26]. Moreover, the complexity of SANP is about O (Ny3) [23]. Third, the proposed approach works well on large amounts of samples as well as small amounts of samples, but the other approaches need large amounts of samples to learn the manifold. The proposed approach unifies the I2I, I2S and S2S recognition in one framework. Furthermore, the proposed patched-MTS approach is more robust on occlusion.
Fig. 4. The robustness of MTS distance. The X ̂ and Y ̂ are the MTS of {x1,x2,x3} and {y1,y2,y3} , respectively. The red line is the distance between the two MTS, and the green line d (Ci,C′j ) is the distance between subspaces Ci and C′j . (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
with MMD in Fig. 4. Suppose the samples are located in a cube space, X = {x1,x2,x3} and Y = {y1,y2,y3} , where x2 and x3 are most like y2 , x1 is different from y1 and y3 , x1 belong to subspace C1, x2 and x3 belong to subspace C2 , y1 and y3 belong to subspace C1′ and y2 belong to subspace C2′. The dMTS (X,Y) is aimed at finding the minimum distance between the space spanned by {x1,x2,x3} and the space spanned by {y1,y2,y3} , i.e. X and Y . On the other hand, the MMD can the red vertical line between be regarded as utilizing the weighted sum of distances between the N N subspaces. dMMD (X,Y) = ∑i =cx1 ∑ j =cy1 ξij d (Ci,C′j ) , where ξij is weight, d (Ci,C′j ) is the distance between two subspaces, Ncx and Ncy are the numbers of subspaces. Note that, the MMD only considers the weight between samples but ignores the weight within samples, i.e. the linear combination of the green lines. Usually, we have dMTS (X,Y) ⩽ dMMD (X,Y) , which means the MTS based distance is the minimum distance metric between sets. 3.3. The patched S2S representation To solve the occlusion and deformation problems, we propose a patched set-to-set representation (PS2S). A D = p1 × p2 pixels image is partitioned into K non-overlapping sub-patches with d1 × d2 pixels. Then those sub-patches are cascaded in series to form a sequence. If two images with same size are partitioned in the same way, the order of patches in a sequence cannot be changed, then we can consider the corresponding patches as a pair, and the S2S representation and distance can be used on these pairs. Suppose a gallery set and a testing image set of a given class are denoted by X,Y , where each Xm and Yn are partitioned into K suby1;⋯;∼ yK ]. For each patch pair (x k̃ ,∼ yk) , patches, i.e. X = [x1̃ ;⋯;xK̃ ]and Y = [∼ ∼ we can compute the coefficient ρ (x k̃ ,yk) by Eq. (2), and the distance of (X,Y) is defined as
γ (X,Y) = 1−maxζ 2 (X,Y),
(5)
where ζ (X,Y) = ρ (x k̃ ,∼ yk) . Eq. (5) denote that the S2S distance is the weighted cumulation distance of corresponding patches. By doing so, we can solve the occlusion and deformation problems with more accuracy. 1 K
4. Experimental results and analysis
K ∑k = 1
To evaluate the robustness and efficiency of the proposed S2S-NNC approach and compare it with existing techniques, several typical face datasets are used: the AR [46], Caltech Frontal face 1999 [47], TFWM [48], Honda/ UCSD [6], YouTube Celebrities [49] and YouTube aligned [45]. We compare the proposed S2S-NNC approach with the following image set-based recognition methods: DICW [50], SANP [23], MMD [26], DCC [51] and SSDML [39]. In DCC, the dimension of the embedding space is set to 100, and the subspace dimensions are set to 10.
3.4. The S2S-NNC approach Based on the MTS/patched-MTS distance metric, the nearest neighbor classifier is used to solve the S2S face recognition problem. Given a testing set Y and gallery sets X (c ),c ∈ C , the MTS-based NN classifier is
c ̂ = arg min dMTS (X (c ),Y), c
4.1. Datasets description
(6)
The AR dataset contains over 2600 color images corresponding to 100 people’s faces. Each of the images are 768 × 576 pixels with four
and the patched-MTS based NN classifier is 16
Journal of Visual Communication and Image Representation 53 (2018) 13–19
L. Wang et al.
scenarios of different illumination conditions and occlusions all resized to 80 × 60 pixels. The Caltech dataset contains 450 face images of 27 persons. For each person, the face area is cropped from the background with size 450 × 300 and resized to 80 × 60 pixels. The TFWM dataset contains frontal view faces of strangers on the streets with uncontrolled lighting. We select 55 persons and 10 images for each person and crop the face area from the background with size 80 × 60 pixels. The extended YaleB dataset contains 2414 face images of 38 people [52]. Each of the images are 192 × 168 pixels. The Honda/UCSD dataset consists of 59 video sequences involving 20 persons. Each video contains about 400 frames. The cascaded face detector is used to collect faces in each video and the faces are resized to 20 × 20 pixels gray images. The YouTube Celebrities dataset contains 1910 video clips of 47 different people. On average, each person has 41 clips, which are divided into 3 sessions taken at different times and scenes. We use the cascaded face detector [44] to collect faces in each video, where the dataset contains tracking and cropping errors. On the other hand, to avoid the tracking and cropping errors, an aligned YouTube Faces dataset, the file ‘aligned_images_DB’ in [45] is used, which contains 3425 video clips of 1595 different people. We select images from 532 persons who have 3 or more different videos. Each image is resized to 30 × 30 pixels. 4.2. The effect of patch size We first test the effect of varying patch sizes on the performance of recognition. 100 unoccluded images from the AR dataset, which contain illumination and facial expression variation, are used as the gallery set. 100 testing images, which contain sun glasses or scarves, are used for testing. We test PS2S-NNC with patch sizes from 5 × 5 pixels to 15 × 15 pixels. The testing images and the gallery images are partitioned in the same way. The recognition ratio and the time consumption as a function of the patch size are shown in Fig. 5. It is shown that the recognition ratio decreases with increasing patch size, as well as the decreasing of computation time. But the descent of runtime is much sharper than the descent of recognition ratio, when the recognition ratio descends from 0.93–0.86, the runtime descends from 207–3.8 min. As a result, the patch size is set to 10 × 15 pixels in most datasets (except the Honda and YouTube dataset, where it is set to 5 × 5).
Fig. 6. The recognition ratio changes with the gallery number. In the AR dataset with sunglasses and scarf occlusion, the number of gallery images in each person is changed from [1,2,4,6,8]. In the Caltech, TFWM, Honda and YouTube datasets, the number of gallery images for each person is changed from [1,3,5,8]. The red line with ‘▵ ’ and the blue line with ‘∘ ’ are the recognition ratio of PS2S-NNC and DICW, respectively.
and 8 for each person, respectively, and the rest of the images form the testing set. For the Honda and YouTube Face datasets, we first randomly select 20 images from each person, and then randomly select Nx = 1,3,5 and 8 image sets as gallery sets, the remaining are used as a testing set. Fig. 6 shows the relationship between the proposed S2S-NNC and DICW in recognition results and the data set’s gallery numbers. It is shown that, in most datasets, the PS2S-NNC gives better performance than DICW by about 10%. For the extended YaleB dataset, following the set in PSDML [39], the face images are projected into 504-dimensional vectors. In this dataset, the proposed I2S-NNC approach is compared with NN, SVM and PSDML. When PCA is used in PSDML, the dimension is set as 50. In Table 2, the experiment shows that the I2S-NNC gives similar recognition result as PSDML.
4.3. Image-to-set recognition In this section, we test the PS2S-NNC on man-made occlusion and realistic occlusion images with the image-to-class recognition method. For the cropped AR dataset, following the setting in [50], the unoccluded frontal view images with various expressions are used (8 images per person). For each person, we select Nx = 1,2,4,6 and 8images as gallery sets respectively. Two separate sets of images (containing sunglasses and scarves respectively) are used as testing sets. 200 images are randomly selected from each occlusion set. For the Caltech, TFWM datasets, the number of images in the gallery sets are set as Nx = 1,3,5
4.4. Set-to-set image recognition In this section, we evaluate the performance of the S2S-NNC approach used for set-to-set recognition. The ten-fold cross validations are used, i.e. 10 randomly selected gallery and testing set. The experiments include large sample video-based S2S recognition and small sample robust S2S recognition. 4.4.1. large sample video-based recognition We test the proposed approach on the two selected YouTube Table 2 The recognition ratio on Extended YaleB datasets.
Fig. 5. The recognition ratio and time consumption with respect to patch size. There are no significant changes on recognition ratio when the patch size is changed. But the running times are decreased sharply with the different patch size.
Ratio
17
NN
SVM
PSDML
I2S-NNC
76.3%
78.1%
90.0%
89.6%
Journal of Visual Communication and Image Representation 53 (2018) 13–19
L. Wang et al.
Table 3 Recognition ratios on YouTube Celebrities dataset. Methods
50 (%)
100 (%)
200 (%)
MSM DCC MMD SANP SSDML RH-ISCRC JRNP S2S-NNC
54.8 57.6 57.8 57.8 61.9 62.3 76.8 85.5
57.4 62.7 62.8 63.1 65.0 65.6 78.2 86.1
56.7 65.7 64.7 65.6 67.0 66.7 79.5 87.0
dataset. On the YouTube Celebrities dataset, following the setup in [23], 9 video clips are randomly chosen from each person (3 clips from each session), where 3 clips are used for training/gallery set and the rest for testing. Since the number of images in each video is so different, the number of gallery and testing images are also different. Comparatively, in the MSM,DCC, MMD, SANP, SSDML, RH-ISCRC and JRNP approach, we set the number of images in each set as the minimum of gallery and testing number. The experiments for 50, 100, 200 frames per set are conducted. The recognition results are shown in Table 3. The experiment illustrated that the proposed approach performs better than the MSM, DCC, MMD, SANP, SSDML, RH-ISCRC and JRNP.
Fig. 7. The robustness of patched MTS. The green block is the testing set which is occluded by scarves; the red blocks are the gallery sets from three different class; the purple blocks are the MTS of the sets. The d is the MTS distance of the whole images; the γ is the cumulate distance of the patched MTS. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
are occluded by scarves while the gallery set X just consists of the whole, clear faces. The block (1) is the samples of the same person of testing set, the block (2) and (3) are the samples of the other persons. The MTS distance d of the whole images cannot give the correct measurement, but the patched MTS cumulate distance γ gives the right answer. Fig. 8 shows the recognition ratios and run times of the four approaches on YouTube dataset. From Fig. 8 (a), we can see that the S2SNNC approach shows excellent performance especially for a small number of training samples. The MMD, SANP and DCC have similar recognition ratios to MTS only when the gallery number in each video is more than 4. Moreover, when the number of training samples increases, the S2S-NNC approach shows outstanding computational efficiency. In this experiment, the computational complexities of MMD, SANP and DCC are more than 80 times than the S2S-NNC (Fig. 8(b)).
4.4.2. Small sample robust recognition We test the proposed approach on a more challenging dataset: with man-made occlusion and small samples. For the AR dataset, 3 images are randomly selected from each of the four scenarios of the same person, 1 image for the gallery set and the rest for testing set. For the Caltech and TFWM dataset, we randomly select Nx = 3 images from each class-set as the gallery set and the rest for the testing-set. For the Honda dataset, following the setup in [23], 20 sequences are used for each gallery set and the remaining 39 sequences are used for testing. In each sequence, all or at most 50 frames are used. Each person has 9 randomly chosen images, 3 for the gallery and the rest for testing. For the aligned YouTube dataset, we randomly select 3 images from each of the 3 videos, 1 image for the gallery set and the other 2 images for the testing set. The face recognition results are shown in Table 4. We can see that when the testing samples are occluded or have large illuminative changes as in the AR, Caltech and TFWM datasets, the performance of the MMD and SANP approaches degrade greatly. In this case, the S2S-NNC approach significantly outperforms the state-of-art approaches. For the testing samples with multiple viewpoints and expressions in YouTube dataset, the S2S-NNC approach has more robust performance than MMD and SANP. The reason behind this is that the proposed approach avoids using large amounts of training samples to learn models while the MMD approach needs large number of samples to model manifolds. The PS2S-NNC is also more robust than S2S-NNC only on occluded images. The performance of PS2S-NNC is better than S2S-NNC at dealing with deformation and rotation. The reason is that the patched technique increases the number of local features while the S2S-MTS refines its global structure features. As mentioned in Section 3.1, the MTS has the performance of feature fusion and extraction. But when the testing set is so different from the gallery set, the MTS can not represent the feature space accurately. As shown in Fig. 7, the testing samples Y
5. Conclusions and future work We have proposed a novel similarity measure for image sets. To recognize the occluded and deformed objects, we construct MTS spaces between two sets, which are described as the maximum similarity of the two sets. The MTS has affine invariance and is robust to the occlusion and deformation. Experiments in five datasets show that the S2S-NNC achieves better performance than state-of-art approaches. The proposed approach can be applied directly to the images without occlusion detection and training. Furthermore, this approach can deal with I2S as well as S2S face recognition problems. Even though the MTS represent the image set in a reduced dimension, it doesn’t have sparsity which would result in a considerable computational complexity when D is big enough. In future work, we will discuss the sparse solution in detail.
Table 4 The recognition ratio on different datasets. Ratio
AR
Caltech
TFWM
Honda
YouTube
MMD SANP S2S-NNC PS2S-NNC
15% 65% 98% 1
37% 96% 96.3% 92%
63% 1 1 1
69% 84% 89.5% 78.9%
85% 88% 92% 87.8%
Fig. 8. The recognition ratios and runtimes on the YouTube dataset when the gallery number changes from 1 to 8 in each video. (a) recognition ratio with respect to gallery number; (b) runtime in minutes with respect to gallery number.
18
Journal of Visual Communication and Image Representation 53 (2018) 13–19
L. Wang et al.
References
1067–1074. [26] R. Wang, S. Shan, X. Chen, Q. Dai, W. Gao, Manifold–manifold distance and its application to face recognition with image sets, IEEE Trans. Image Process. 21 (10) (2012) 4466–4479. [27] R. Wang, H. Guo, L.S. Davis, Q. Dai, Covariance discriminative learning: A natural and efficient approach to image set classification, in: IEEE CVPR, 2012. [28] W. Yang, C. Sun, L. Zhang, A multi-manifold discriminant analysis method for image feature extraction, Pattern Recogn. 44 (8) (2011) 1649–1657. [29] T.-K. Kim, O. Arandjelović, R. Cipolla, Boosted manifold principal angles for image set-based recognition, Pattern Recogn. 40 (9) (2007) 2475–2484. [30] Y.-C. Chen, V.M. Patel, S. Shekhar, R. Chellappa, P.J. Phillips, Video-based face recognition via joint sparse representation, in: IEEE Conference and Workshops on Automatic Face and Gesture Recognition, 2013. [31] Z. Cui, W. Li, D. Xu, S. Shan, X. Chen, Fusing robust face region descriptors via multiple metric learning for face recognition in the wild, in: IEEE CVPR, 2013. [32] L. Wolf, T. Hassner, Y. Taigman, Effective unconstrained face recognition by combining multiple descriptors and learned background statistics, IEEE Trans. PAMI 33 (10) (2011) 1978–1990. [33] L. Wolf, N. Levy, The svm-minus similarity score for video face recognition, in: IEEE CVPR, 2013. [34] J. Lu, G. Wang, P. Moulin, Image set classification using holistic multiple order statistics features and localized multi-kernel metric learning, in: IEEE ICCV, 2013. [35] R. Vemulapalli, J.K. Pillai, R. Chellappa, Kernel learning for extrinsic classification of manifold features, in: IEEE CVPR, 2013. [36] M. Yang, P. Zhu, L. Van Gool, L. Zhang, Face recognition based on regularized nearest points between image sets, in: IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), 2013. [37] T.-K. Kim, J. Kittler, R. Cipolla, Learning discriminative canonical correlations for object recognition with image sets, in: ECCV, 2006. [38] J. Lu, Y.-P. Tan, G. Wang, Discriminative multimanifold analysis for face recognition from a single training sample per person, IEEE Trans. PAMI 35 (1) (2013) 39–51. [39] P. Zhu, L. Zhang, W. Zuo, D. Zhang, From point to set: Extend the learning of distance metrics, in: IEEE ICCV, 2013. [40] Z. Huang, R. Wang, S. Shan, X. Chen, Learning euclidean-to-riemannian metric for point-to-set classification, in: IEEE CVPR, 2014. [41] D.R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: an overview with application to learning methods, Neural Comput. 16 (12) (2004) 2639–2664. [42] k. Björck, G.H. Golub, Numerical methods for computing angles between linear subspaces, Math. Comput. 27 (123) (1973) 579–594. [43] L. Wang, X. Wang, J. Feng, Subspace distance analysis with application to adaptive bayesian algorithm for face recognition, Pattern Recogn. 39 (3) (2006) 456–464. [44] P. Viola, M.J. Jones, Robust real-time face detection, Int. J. Comput. Vis. 57 (2) (2004) 137–154. [45] L. Wolf, T. Hassner, I. Maoz, Face recognition in unconstrained videos with matched background similarity, in: IEEE CVPR, 2011. [46] A.M. Martínez, A.C. Kak, Pca versus lda, IEEE Trans. PAMI 23 (2) (2001) 228–233. [47] M. Weber, Frantal face 1999.
. [48] D. Miranda, The face we make.
. [49] M. Kim, S. Kumar, V. Pavlovic, H. Rowley, Face tracking and recognition with visual constraints in real-world videos, in: IEEE CVPR, 2008. [50] X. Wei, C.-T. Li, Y. Hu, Face recognition with occlusion using dynamic image-toclass warping (dicw), in: IEEE International Conference on Automatic Face Gesture Recognition, 2013. [51] T.-K. Kim, J. Kittler, R. Cipolla, Discriminative learning and recognition of image set classes using canonical correlations, IEEE Trans. PAMI 29 (6) (2007) 1005–1018. [52] K.-C. Lee, J. Ho, D. Kriegman, Acquiring linear subspaces for face recognition under variable lighting, IEEE Trans. PAMI 27 (5) (2005) 684–698.
[1] F. Bossen, F. Bossen, W.J. Han, J. Min, K. Ugur, Intra coding of the HEVC standard, IEEE Trans. Circ. Syst. Video Technol. 22 (12) (2012) 1792–1801. [2] C. Yan, Y. Zhang, F. Dai, L. Li, Highly parallel framework for HEVC motion estimation on many-core platform, in: Data Compression Conference, 2013, pp. 63–72. [3] C. Yan, Y. Zhang, J. Xu, F. Dai, L. Li, Q. Dai, F. Wu, A highly parallel framework for HEVC coding unit partitioning tree decision on many-core processors, IEEE Signal Process. Lett. 21 (5) (2014) 573–576. [4] C. Yan, Y. Zhang, J. Xu, F. Dai, J. Zhang, Q. Dai, F. Wu, Efficient parallel framework for HEVC motion estimation on many-core processors, IEEE Trans. Circ. Syst. Video Technol. 24 (12) (2014) 2077–2089. [5] C. Yan, Y. Zhang, F. Dai, X. Wang, Parallel deblocking filter for HEVC on many-core processor, Electron. Lett. 50 (5) (2014) 367–368. [6] K.-C. Lee, J. Ho, M.-H. Yang, D. Kriegman, Video-based face recognition using probabilistic appearance manifolds, in: IEEE CVPR, Vol. 1, 2003. [7] R. Wang, X. Chen, Manifold discriminant analysis, in: IEEE CVPR, 2009. [8] H. Cevikalp, B. Triggs, Face recognition based on image sets, in: IEEE CVPR, 2010. [9] M. Hayat, M. Bennamoun, S. An, Learning non-linear reconstruction models for image set classification, in: IEEE CVPR, 2013. [10] B. Scholkopft, K.-R. Mullert, Fisher discriminant analysis with kernels, Neural networks for signal processing IX. [11] O. Boiman, E. Shechtman, M. Irani, In defense of nearest-neighbor based image classification, in: IEEE CVPR, 2008. [12] O. Yamaguchi, K. Fukui, K.-i. Maeda, Face recognition using temporal image sequence, in: IEEE Conference on Automatic Face and Gesture Recognition, 1998. [13] G. Shakhnarovich, B. Moghaddam, Face recognition in subspaces, in: Handbook of Face Recognition, 2011. [14] P. Zhu, W. Zuo, L. Zhang, S.C.K. Shiu, D. Zhang, Image set-based collaborative representation for face recognition, IEEE Trans. Inform. Forens. Secur. 9 (7) (2014) 1120–1132. [15] M. Yang, W. Liu, L. Shen, Joint regularized nearest points for image set based face recognition, in: IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1, 2015, pp. 1–7. [16] J. Lu, G. Wang, P. Moulin, Localized multifeature metric learning for image-setbased face recognition, IEEE Trans. Circ. Syst. Video Technol. 26 (3) (2016) 529–540. [17] J. Lu, G. Wang, J. Zhou, Simultaneous feature and dictionary learning for image set based face recognition, IEEE Trans. Image Process. 26 (8) (2017) 4042–4054. [18] O. Arandjelovic, G. Shakhnarovich, J. Fisher, R. Cipolla, T. Darrell, Face recognition with image sets using manifold density divergence, in: IEEE CVPR, vol. 1, 2005. [19] H. Hu, Sparse discriminative multimanifold grassmannian analysis for face recognition with image sets, IEEE Trans. Circ. Syst. Video Technol. 25 (10) (2015) 1599–1611. [20] A. Yang, S. Chen, Object recognition with image set based on kernel information entropy, in: International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2015, pp. 1314–1318. [21] J. Ma, H. Zhang, W. She, Research on robust face recognition based on depth image sets, in: International Conference on Image, Vision and Computing (ICIVC), 2017, pp. 223–227. [22] W. Wang, R. Wang, Z. Huang, S. Shan, X. Chen, Discriminant analysis on riemannian manifold of gaussian distributions for face recognition with image sets, IEEE Trans. Image Process. PP (99) (2017) 1. [23] Y. Hu, A.S. Mian, R. Owens, Face recognition using sparse approximated nearest points between image sets, IEEE Trans. PAMI 34 (10) (2012) 1992–2004. [24] W. Zheng, X. Zhou, C. Zou, L. Zhao, Facial expression recognition using kernel canonical correlation analysis, IEEE Trans. Neural Networks 17 (1) (2006) 233–238. [25] T.-K. Kim, J. Kittler, R. Cipolla, On-line learning of mutually orthogonal subspaces for face recognition by image sets, IEEE Trans. Image Process. 19 (4) (2010)
19