Collaborative probabilistic labels for face recognition from single sample per person

Collaborative probabilistic labels for face recognition from single sample per person

Author’s Accepted Manuscript Collaborative Probabilistic Labels for Recognition from Single Sample per Person Face Hong-Kun Ji, Quan-Sen Sun, Ze-Xua...

999KB Sizes 3 Downloads 112 Views

Author’s Accepted Manuscript Collaborative Probabilistic Labels for Recognition from Single Sample per Person

Face

Hong-Kun Ji, Quan-Sen Sun, Ze-Xuan Ji, Yun-Hao Yuan, Guo-Qing Zhang www.elsevier.com/locate/pr

PII: DOI: Reference:

S0031-3203(16)30220-5 http://dx.doi.org/10.1016/j.patcog.2016.08.007 PR5836

To appear in: Pattern Recognition Received date: 6 November 2015 Revised date: 17 May 2016 Accepted date: 9 August 2016 Cite this article as: Hong-Kun Ji, Quan-Sen Sun, Ze-Xuan Ji, Yun-Hao Yuan and Guo-Qing Zhang, Collaborative Probabilistic Labels for Face Recognition from Single Sample per Person, Pattern Recognition, http://dx.doi.org/10.1016/j.patcog.2016.08.007 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Collaborative Probabilistic Labels for Face Recognition from Single Sample per Person Hong-Kun Ji a,b, Quan-Sen Sun 1,a, Ze-Xuan Ji a, Yun-Hao Yuan c, Guo-Qing Zhang a a

School of Computer Science and Engineering, Nanjing University of Science & Technology, Nanjing, Jiangsu, 210094, P.R. China b School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, 637553 c Department of Computer Science and Technology, Yangzhou University, Yangzhou, Jiangsu, 225000

Abstract: Single sample per person (SSPP) recognition is one of the most challenging problems in face recognition (FR) due to the lack of information to predict the variations in the query sample. To address this problem, we propose in this paper a novel face recognition algorithm based on a robust collaborative representation (CR) and probabilistic graph model, which is called Collaborative Probabilistic Labels (CPL). First, by utilizing label propagation, we construct probabilistic labels for the samples in the generic training set corresponding to those in the gallery set, thus the discriminative information of the unlabeled data can be effectively explored in our method. Then, the adaptive variation type for a given test sample is automatically estimated. Finally, we propose a novel reconstruction-based classifier for the test sample with its corresponding adaptive dictionary and probabilistic labels. The proposed probabilistic graph based model is adaptively robust to various variations in face images, including illumination, expression, occlusion, pose, etc., and is able to reduce required training images to one sample per class. Experimental results on five widely used face databases are presented to demonstrate the efficacy of the proposed approach.

1. Introduction Over the past three decades, as one of the most visible applications in computer vision, face recognition (FR) has been receiving significant attention [1] and a large number of algorithms have been proposed in recent years [2-25]. Representative and popular algorithms include principal component analysis (PCA) [13], linear discriminant analysis (LDA) [2], locality preserving projections (LPP) [8], sparse representation based classification (SRC) [19] and their weighted, kernelized, and two-dimensional variants [10,11,15,17,25]. Recently, it has been proved that it is the collaborative representation (CR) [23-24] but not the l1-norm sparsity that makes SRC powerful for face classification. What’s more, CR has significantly less complexity. However, because it is usually difficult, expensive and time-consuming to collect sufficient labeled samples, many traditional methods, including collaborative representation based classification (CRC), usually suffer from the scenario that only few labeled samples per person are available. To solve this problem, some semi-supervised learning (SSL) algorithms [43-47], which utilize a large number of unlabeled data to help build a better classifier from the labeled data, have been proposed in recent years. The performance of above methods in face recognition, however, is heavily influenced by the number of training samples per person [26-27]. Especially in many practical applications of FR (e.g., law enforcement, e-passport, surveillance, ID card identification, etc.), we can only have a single sample per person. This makes the problem of FR particularly hard since the information used for prediction is very limited while the variations in the query sample are abundant, including background illumination, pose, and facial corruption/disguise such as makeup, beard, and occlusions (glasses and scarves). Practically, this problem is called single sample per person (SSPP) classification. To address the problem of SSPP, many specially designed FR methods have been developed, which can be generally classified into three categories [26]: image partitioning, virtual sample generation and generic learning. 1

Corresponding author. Tel.: +86 13611507508; Email: [email protected]

For the first category, the image partitioning based methods [29-33] always firstly partition each face image into several local patches and then apply the discriminant learning techniques for feature extraction. For example, Lu et al. proposed the Discriminative MultiManifold Analysis (DMMA) method [29-30] by partitioning the image into several local patches as the feature for training, and converted the face recognition to the manifoldmanifold matching problem. Although these methods have led to improved FR results, local feature extraction and discriminative learning from local patches are sensitive to image variations (e.g., extreme illumination and occlusion). For the second category [34-37], some additional training samples for each person are virtually generated such that discriminative subspace learning can be used for feature extraction. An early such attempt [34] shows that images with the same expression are located on a common subspace, which can be called the expression subspace. By projecting an image with an arbitrary expression into the expression subspaces, we can synthesize new expression images. By means of the synthesized images for individuals with one image sample, we can obtain more accurate estimation of the within-individual variability which contributes to significant improvement in recognition. But one common shortcoming among these methods is that the high-correlated virtually generalized samples may contain much redundancy and some linear transformation (e.g. expression subspaces) may not be suitable for the all face variations, namely pose and uncontrollable noise. For the third category [38-42], an additional generic training set with multiple samples per person (MSPP) is applied to extract discriminative features which are subsequently used to identify the people each enrolled with a single sample. Based on the consideration that face variations for different subjects share much similarity, Yang et al. [38] learned a sparse variation dictionary by taking the relationship between the gallery set and an additional generic training set into account. The generic training set with multiple samples per person can bring new and useful information to the SSPP gallery set. And the sparse variation dictionary learning (SVDL) scheme shows state-of-the-art performance in FR with SSPP. Generic Training Set

Test Samples

Gallery Set Adaptive Variation Subsets

Adaptive Variation Subset T

Probabilistic Label Matrix

Classification

Fig. 1. Flowchart of the proposed CPL for FR with SSPP. First, for all samples in the gallery set (we choose all images from the simile expression for brevity, actually, they may have different variations), we seek their corresponding adaptive variation subsets, i.e. the subset in the red dashed box from the generic training set (in which we just show four variations for an example). Then, we utilize the gallery set and adaptive variation subsets to train the probabilistic label matrix. Given a test sample in the green solid box, our algorithm can automatically seek its adaptive variation subset 𝑇, i.e. the subset in the green dashed box, which will be utilized for classification along with the gallery set and probabilistic label matrix.

Although many improvements have been reported, SVDL and other generic training based methods just focus on the variations but ignore the discriminative information of the generic training set and the essential collaborative representative relationship between the gallery set and generic training set, which will be detailed described in Section 3.1. Motivated by this consideration and recent advances in CR [23-24] that samples from other categories can also have contributions to reconstruct the query samples and in SSL [43-47] to construct a faithful graph utilizing both labeled and unlabeled samples, we propose in this paper a novel Collaborative Probabilistic Labels (CPL) for FR with SSPP, whose framework is illustrated in Fig. 1. The training samples in the gallery set are used to build a gallery dictionary. And the samples in the generic training set, which contain plenty of variations, are used as adaptive probabilistic label dictionaries. Utilizing group sparse coding [48], we can first seek a specific adaptive variation subset from the generic training set for each sample in the gallery set. By extracting collaborative reconstructive relationship between each sample in the gallery set and its corresponding adaptive variation subset and utilizing label propagation techniques, we can then acquire the probabilistic label matrix. At last given a test sample, we construct a novel reconstruction-based classifier with the probabilistic label matrix, the gallery set and an adaptive variation dictionary corresponding to the test sample.

2. Background and related work 2.1.Collaborative representation Suppose that we have 𝑋 = [π‘₯1 , π‘₯2 , β‹― , π‘₯𝑛 ] ∈ β„› π‘šΓ—π‘› , where π‘₯𝑖 , π‘š and 𝑛 denote the 𝑖 -th sample, dimensionality and total number of training samples, respectively. And in order to collaboratively represent the query sample 𝑦 ∈ β„› π‘š using 𝑋, we can use the regularized least square method: (πœŒΜ‚) = π‘Žπ‘Ÿπ‘” π‘šπ‘–π‘›πœŒ {‖𝑦 βˆ’ π‘‹πœŒβ€–22 + πœ†β€–πœŒβ€–22 }, (1) where πœ† is the regularization parameter. The solution of CR with regularized least square in Eq. (1) can be easily and analytically derived as (2) πœŒΜ‚ = (𝑋 𝑇 𝑋 + πœ† βˆ™ 𝐼)βˆ’1 𝑋 𝑇 𝑦, where 𝐼 is an identity matrix of size π‘š. Thus, we can get collaborative reconstructive weight vector 𝜌 just by simply projecting 𝑦 onto (𝑋 𝑇 𝑋 + πœ† βˆ™ 𝐼)βˆ’1 𝑋 𝑇 which is independent of 𝑦. 2.2.Group sparse coding Suppose π‘₯ ∈ β„› 𝑛 is an unknown group sparse solution. Let {π‘₯𝑝𝑖 ∈ β„› 𝑛𝑖 , 𝑖 = 1,2 β‹― , 𝑠} be the grouping of π‘₯, where 𝑝𝑖 βŠ† {1,2, β‹― , 𝑛} is an index set corresponding to the 𝑖-th group, and π‘₯𝑝𝑖 denotes the subvector of π‘₯ indexed by 𝑝𝑖 . The weighted l2,1-norm is defined as follows: β€–π‘₯‖𝑀,2,1 = βˆ‘

𝑠

𝑖=1

𝑀𝑖 β€–π‘₯𝑝𝑖 β€– , 2

(3)

where 𝑀𝑖 ’s are weights associated with each group. We consider the following basis pursuit (BP) model: π‘šπ‘–π‘›π‘₯ β€–π‘₯‖𝑀,2,1 , 𝑠. 𝑑. 𝐴π‘₯ = 𝑏 (4) π‘šΓ—π‘› π‘š where 𝐴 ∈ β„› and 𝑏 ∈ β„› . Without loss of generality, we assume 𝐴 has full rank. Assuming the grouping is a partition of the solution, Deng et al. proposed in [48] an algorithm derived from the primal for group sparse optimization with mixed l2,1regularization. And the derived algorithm has convergence guarantee according to the existing alternating direction method (ADM). They first introduced an auxiliary variable and transform it into equivalent problem: π‘šπ‘–π‘›π‘₯,𝑧 ‖𝑧‖𝑀,2,1 = βˆ‘

𝑠

𝑖=1

𝑀𝑖 ‖𝑧𝑝𝑖 β€– , 2

𝑠. 𝑑. 𝑧 = π‘₯, 𝐴π‘₯ = 𝑏

The augmented Lagrangian problem of Eq. (5) based on ADM is of the form:

(5)

𝛿1 𝛿2 (6) ‖𝑧 βˆ’ π‘₯β€–22 βˆ’ πœ†π‘‡2 (𝐴π‘₯ βˆ’ 𝑏) + ‖𝐴π‘₯ βˆ’ 𝑏‖22 2 2 where πœ†1 ∈ β„› 𝑛 , πœ†2 ∈ β„› π‘š are multipliers and 𝛿1 , 𝛿2 > 0 are penalty parameters. Minimizing Eq. (6) with respect to π‘₯ can be reduced to solving the following linear system: (7) (𝛿1 𝐼 + 𝛿2 𝐴𝑇 𝐴)π‘₯ = 𝛿1 𝑧 βˆ’ πœ†1 + 𝛿2 𝐴𝑇 𝑏 + 𝐴𝑇 πœ†2 And minimizing Eq. (6) with respect to 𝑧 can be reduced to the one-dimensional shrinkage (or soft thresholding) formula: 𝑀𝑖 π‘Ÿπ‘– 𝑧𝑝𝑖 = π‘šπ‘Žπ‘₯ {β€–π‘Ÿπ‘– β€–2 βˆ’ , 0} , for 𝑖 = 1, β‹― , 𝑠 (8) β€–π‘Ÿπ‘– β€–2 𝛿1 1 0 where π‘Ÿπ‘– = π‘₯𝑝𝑖 + 𝛿 (πœ†1 )𝑝𝑖 and the convention 0 βˆ™ 0 = 0 is followed. Finally, the multipliers π‘šπ‘–π‘›π‘₯,𝑧 ‖𝑧‖𝑀,2,1 βˆ’ πœ†1𝑇 (𝑧 βˆ’ π‘₯) +

1

πœ†1 and πœ†2 are updated in the standard way πœ† ← πœ†1 βˆ’ 𝜏1 𝛿1 (𝑧 βˆ’ π‘₯) { 1 πœ†2 ← πœ†2 βˆ’ 𝜏2 𝛿2 (𝐴π‘₯ βˆ’ 𝑏) where 𝜏1 , 𝜏2 > 0 are step lengths.

(9)

3. Proposed approach 3.1.Motivation In supervised learning, we can faultlessly classify a testing sample with a sufficient dictionary with enough labeled training samples for each category. In semi-supervised learning, the labeled samples are very limited. If we only use a few labeled data as a dictionary to represent a testing sample, the representation error may be very huge, even when the unlabeled testing sample really shares the same identity with the labeled training samples. One obvious solution to solve this problem is to use more samples as a dictionary so that the dictionary has strong representation ability. Thus in semi-supervised learning, we can apply label propagation techniques to label the samples which are unlabeled but share the same identity with one labeled sample. However, in the SSPP problems, even the unlabeled samples which share the same identity with one labeled sample are far from accessible. Zhang et al. [23] revealed that it is the collaborative representation (CR) mechanism that truly improves the FR accuracy. And one significant fact in FR is that face images of different classes share similarities. Some samples from class 𝐡 may be very helpful to represent a testing sample with label 𝐴. To clarify this problem, we denote one arbitrary original face image as π‘“π‘œ, along with its reconstructed image using images having the same label with the original image as 𝑓𝑠 and reconstructed image using all other images having different labels with the original image as 𝑓𝑑 . We define the variation between π‘“π‘œ and 𝑓𝑠 as the true variation and the variation between π‘“π‘œ and 𝑓𝑑 as the virtual variation2. Then, we define the difference between the true variation and the virtual variation to be the magnification variation as following: πœŽπ‘šπ‘Žπ‘” = ‖𝑓𝑑 βˆ’ π‘“π‘œβ€–2 βˆ’ ‖𝑓𝑠 βˆ’ π‘“π‘œβ€–2 (10) Obviously, πœŽπ‘šπ‘Žπ‘” > 0 states that 𝑓𝑠 is more similar to π‘“π‘œ than 𝑓𝑑 , otherwise, 𝑓𝑑 is a better reconstruction. Utilizing the Multiple PIE database [53], Fig. 2 shows a simple example for above three kinds of face images and magnification variations of random 700 samples, and the image gray values are between [0,255]. As we can see from Fig. 2(a), 𝑓𝑠 and 𝑓𝑑 are both sufficient similar to π‘“π‘œ. And Fig. 2(b) shows that 𝑓𝑑 perform almost as good as 𝑓𝑠 although 𝑓𝑠 is a little better on the whole. Thus, in this paper, this β€œlack of samples” problem is solved by taking the face images from the 2

Here, we should state that the definition of true variation and virtual variation is different from that of intrapersonal and inter-personal variation described in [28, 40], which denote the intra-class and inter-class scatter, respectively.

other unknown categories as the possible samples of each labeled one. We apply a large number of unlabeled samples in the generic training set with probabilistic label matrix to construct an additional adaptive probabilistic dictionary for a given testing sample. Fortunately, unlabeled samples with sufficient types of variations may be abundant and can be easily obtained.

Fig. 2. a) 𝑓𝑑, π‘“π‘œ and 𝑓𝑠 of a sample; b) magnification variations of random 700 samples.

3.2.Representation of face images with variation 𝑁

Suppose that we have a sufficiently large generic training set {𝐷(𝑣) }𝑣=1 ∈ β„› π‘šΓ—(𝑛×𝑁) with sufficient 𝑛 individuals and sufficient 𝑁 types of variations (e.g., illumination, pose, facial corruption, etc.), and π‘š denotes the feature dimensionality. Each individual in the generic training set has 𝑁 face images corresponding to all 𝑁 types of variations, so that we can enhance the generalization. Since the high-dimensional face images can usually be well represented when the number of features in the dictionary is sufficiently large and the sparse representation is correctly computed [19], a normal original sample denoted by 𝑔 could be represented as 𝑔 = 𝐷𝛼, where 𝐷 is a subset of the generic training set which shares the same type of variation with 𝑔, which means that the samples in 𝐷 are all normal original face images corresponding to each individual. And 𝛼 is the representation coefficient vector of 𝑔 over 𝐷. Let 𝑔(𝑣) denote a sample which has the same identity with 𝑔 but has some different variation of illumination, expression, or pose from 𝑔, where subscript 𝑣 denotes the type of variation. Similarly, we could represent 𝑔(𝑣) as: 𝑔(𝑣) = 𝐷(𝑣) 𝛼(𝑣) , 𝑣 = 1,2, β‹― , 𝑁 (11) where 𝐷(𝑣) is the counterpart of 𝐷 with variation type 𝑣. Since 𝑔(𝑣) and 𝐷(𝑣) show the same variation to 𝑔 and 𝐷, the representation coefficients 𝛼 and 𝛼(𝑣) should also be almost the same: 𝛼(𝑣) β‰ˆ 𝛼, 𝑣 = 1,2, β‹― , 𝑁 (12) Eq. (12) is actually based on the fact that people with similar normal frontal appearance should also have similar appearance in other variations. This assumption has been successfully used in illumination-invariant, expression-invariant and pose-invariant FR [4950], and speech animation [51], respectively. A reference subset 𝐷𝑖 for a gallery individual 𝑔𝑖 from the gallery set 𝐺 = {𝑔𝑖 }𝑐𝑖=1 ∈ β„› π‘šΓ—π‘ , where 𝑐 denotes the number of category, can typically be extracted through group sparse coding [48]: 𝐷𝑖 = 𝐷(𝑣̂) , where 𝑣̂ = π‘Žπ‘Ÿπ‘” π‘šπ‘–π‘›π‘£ ‖𝑔𝑖 βˆ’ 𝐷(𝑣) 𝛼̂(𝑣) β€–2 and 𝛼̂𝑣 = π‘Žπ‘Ÿπ‘” π‘šπ‘–π‘›π›Ό(𝑣) βˆ‘(𝑣)‖𝛼(𝑣) β€– 𝑠. 𝑑. 𝑔𝑖 β‰ˆ βˆ‘(𝑣) 𝐷(𝑣) 𝛼(𝑣) . 2

3.3.Collaborative probabilistic label based classifier 3.3.1. Algorithm Given the gallery set and the generic training set as described above, for each 𝑔𝑖 ∈ 𝐺, we reconstruct it by solving the following optimization problem: 𝛽̂𝑖 = π‘Žπ‘Ÿπ‘” π‘šπ‘–π‘›π›½π‘– {‖𝑔𝑖 βˆ’ 𝐷𝑖 𝛽𝑖 β€–2𝐹 + πœ†β€–π›½π‘– β€–2𝐹 } (13) where πœ† is the regularization parameter as in CR. Here we use the l2-norm to regularize the representation vector 𝛽𝑖 since no subject in 𝐷𝑖 will have the same identity as 𝑔𝑖 , and thus more subjects in 𝐷𝑖 should be involved to represent 𝑔𝑖 . Denote 𝛽𝑖 = 𝛽̂𝑖 ⁄‖𝛽̂𝑖 β€–2 and 𝛽𝑖𝑇 = [𝛽𝑖1 , 𝛽𝑖2 , β‹― , 𝛽𝑖𝑛 ] , where 𝛽𝑖𝑗 is the 𝑗 -th representation coefficient of 𝛽𝑖 . Obviously, the larger 𝛽𝑖𝑗 is, the greater contribution/affinity of the 𝑗 -th individual from the generic training set to the 𝑖-th category in the gallery set will be. We can construct a representation matrix with all the 𝑐 representation vectors as: (14) 𝑅 = [𝛽1 , 𝛽2 , β‹― , 𝛽𝑐 ] Let 𝑒𝑗 denote the 𝑗-th unlabeled individual in the generic training set. Since 𝑔𝑖 can be 𝑛 linearly reconstructed, along with a weight vector 𝛽𝑖 , from the samples of π‘ˆ = {𝑒𝑗 }𝑗=1 with the appropriate variation as 𝑔𝑖 , there is a reason for us to believe that the label information of 𝑔𝑖 can also be linearly propagated to each 𝑒𝑗 with the same weight coefficient. However, we cannot assign a fixed label to 𝑒𝑗 , because 𝑒𝑗 has contributions to almost all gallery samples. 𝑛 On account of each 𝛽𝑖 being normalized, which means that the contributions of {𝑒𝑗 }𝑗=1 to a specific 𝑔𝑖 has been collaboratively balanced, we thus can predict a probabilistic label information for each 𝑒𝑗 from 𝐺 and get the following probabilistic label matrix: (15) 𝑃 = [𝑝1 , 𝑝2 , β‹― , 𝑝𝑐 ] where 𝑝𝑖𝑗 = 𝛽𝑖𝑗 β„βˆ‘(𝑖) 𝛽𝑖𝑗 denotes the 𝑗-th coefficient of 𝑝𝑖 . Thus, we can get the probabilistic graph {𝐺, π‘ˆ, 𝑃} which reflects the affinities between samples in the gallery set and generic training set. The diagram of this graph and a brief classification strategy is shown in Fig. 3. And the detailed classifier will be described in Section 3.3.2.

Fig. 3. The weighted graph (in the blue dashed box) is represented by the blue connections between the generic training set π‘ˆ (blue nodes) and the gallery set 𝐺 (green and orange nodes, and different colors denote specific categories). The weight between 𝑒𝑗 and 𝑔𝑖 is the probabilistic label 𝑝𝑖𝑗 , where 𝑝𝑖𝑗 denotes the 𝑗-th element of 𝑝𝑖 in Eq. (15). When given a test sample 𝑦, we can first obtain its representation vector 𝛼 corresponding to its adaptive dictionary. Then, we propagate 𝛼 through the connections and obtain a specific category for 𝑦 . Obviously, the weight of the connection between the node 𝑔𝑖 itself is 1.

3.3.2. Classifier Given a testing sample 𝑦, we can acquire a variation dictionary 𝐷𝑦 for 𝑦 utilizing the group sparse coding [48] as described in Section 3.2. Along with the gallery set 𝐺, we can construct an adaptive dictionary for 𝑦 as follows: 𝐷 = [𝐺, 𝐷𝑦 ] (16) Then, we code 𝑦 as in CRC [23]:

𝛾̂ = π‘Žπ‘Ÿπ‘” π‘šπ‘–π‘›π›Ύ {‖𝑦 βˆ’ 𝐷𝛾‖2𝐹 + πœ†β€–π›Ύβ€–2𝐹 }

(17)

Let 𝛾̂ = [𝛾̂1 , 𝛾̂2 , β‹― , 𝛾̂𝑐 , 𝛾̂(𝐷𝑦 ) ], and 𝛾̂𝑖 is the coefficient associated with the 𝑖-th individual in the gallery set. The classification is conducted via: (18) 𝑖𝑛𝑑𝑒𝑛𝑑𝑖𝑑𝑦 (𝑦) = π‘Žπ‘Ÿπ‘” π‘šπ‘Žπ‘₯𝑖 {𝛾̂𝑖 + 𝑝𝑖𝑇 𝛾̂(𝐷𝑦 ) } , 𝑖 = 1,2, β‹― , 𝑐 The intact algorithm of CPL is summarized in Algorithm 1 as: Algorithm 1: Collaborative Probabilistic Labels (CPL) 1: Initialization πœ† Normalize samples in the gallery set and the generic training set 2: Obtain Projection Coefficients For 𝑖 = 1: 𝑐 do Extract the reference subset 𝐷𝑖 for each 𝑔𝑖 through group sparse coding. Code 𝑔𝑖 via Eq. (13). End for Construct the projection matrix via Eq. (14). 3: Construct Probabilistic Label Matrix Construct probabilistic label matrix via Eq. (15). 4: Classify Code the testing sample 𝑦 via Eq. (17). Identify 𝑦 via Eq. (18).

4. Experiments In this section, in order to evaluate the proposed CPL method, we test the performance of our method on five widely used benchmark datasets, i.e., AR [52], CMU PIE [55], Multiple PIE [53], FERET [54] and Extended Yale-B [56], in face classification tasks. We also compare the performance of the proposed method with several state-of-the-art methods on FR with SSPP, including SVDL [38], Adaptive Generic Learning (AGL) [40], Extended SRC (ESRC) [42], Expression Subspace Projection (ESP) [34] and Discriminative Multi-Manifold Analysis (DMMA) [29-30] as the representative algorithms of the three categories described in Section 1. We also incorporate several baseline classifiers such as SRC [19], CRC [23], Nearest Subspace (NS) and PCA [13]. It should be noted that NS is reduced to Nearest Neighbor (NN) in the case of FR with SSPP. Among these methods, NN, PCA, SRC, CRC and DMMA do not use a generic training set, and ESP generates virtual samples, while ESP, SVDL, AGL, ESRC and proposed CPL need a generic training set. 4.1.Data preparation 4.1.1. Databases and graphic selection The AR face database contains over 4000 color face images of 126 individuals (70 men and 56 women). In our experiments, we select 2600 face images from 100 individuals (50 men and 50 women). For each individual, there are 26 different grayscale images from two sessions with a resolution of 165 Γ— 120 . These images are frontal views of faces with different facial expressions, lighting conditions and occlusions and the cropped color images are converted to grayscale images. 13 sample images of one individual from the first session are shown in Fig. 4, with indices as V1 to V13.

Fig. 4. 13 sample images of one individual in the AR database

The CMU PIE face database contains 41,368 images of 68 people. To test our method’s robustness to different poses, in our experiment, 20 images with different poses (frontal, left, right, up and down) under various illumination conditions from 66 individuals are selected. These face images are manually cropped and resized to 64 Γ— 64 pixels. 20 sample images of one individual are shown in Fig. 5 with indices as V1 to V20.

Fig. 5. 20 sample images of one individual in the PIE database

The large-scale Multiple PIE database contains 337 subjects, captured under 15 view points and 19 illumination conditions in four recording sessions for a total of more than 750,000 images with a resolution of 100 Γ— 82. In our experiments, we select 249 individuals’ frontal face images which were captured in four sessions with variations of illumination. For each subject in the first session and the fourth session, there are 20 variations, while there are 10 variations for each individual in the second session and the third session. As shown in Fig. 6, 20 sample images of one individual from session 1 are under different illuminations with indices from V1 to V20.

Fig. 6. 20 sample images of one individual in the Multiple PIE database

The FERET database consists of 13539 facial images corresponding to 1565 subjects, which are disparate across ethnicity, gender and age. In our experiments, we used a subset of the FERET database including 600 gray-level frontal face images comprising 200 different persons, with the size of 80 Γ— 80. There are 71 females and 129 males, each of whom has three images, with different genders, ages, expressions and illuminations, etc. Fig. 7 shows some sample images of four subjects from the FERET database, with variation indices as V1 to V3 for each individual.

Fig. 7. 12 sample images of four individuals in the FERET database

The Extended Yale-B face database consists of over 2400 face images from 38 individuals. To evaluate the generalization ability of our model, in our experiment, 13 different grayscale images, which are taken at different times under similar expression and illumination variations (left, right, or positive) as AR database, with a resolution of 96 Γ— 84 for each individual are selected. 13 sample images of one individual are shown in Fig. 8, with variation indices as V1 to V13.

Fig. 8. 13 sample images of one individual in the Yale-B database

4.1.2. Feature description Due to the utilization of the generic training set in CPL, AGL, ESRC and SVDL, we should choose samples from a part of, not all, categories for gallery training, i.e., constructing the gallery set. In our experiments, NN utilizes original image features. DMMA also utilizes original image features and the size of each block is empirically tuned. Owing to generating virtual samples and avoiding the Small Sample Size (SSS) problems, before the estimation of the within-subject variability, we should apply the K-L transform to reduce the feature

dimension to an appropriate value3 for ESP. Similarly, before estimating the within-class and between-class scatter by using those in the generic training set, we also apply the K-L transform to reduce the feature dimensions to an appropriate value4 for AGL. For the other methods, we reduce the feature dimensions to the same value for justice. A brief description is listed in Table 1. Dataset AR PIE Multiple PIE FERET Yale-B

classes 100 66 249 200 38

Table 1. Brief description of the data sets for classification. Number of Feature Dimensionality training (c) variations (N) NN ESP AGL Others 30 13 100 165 Γ— 120 200 200 16 20 200 200 100 64 Γ— 64 99 20 100 100 Γ— 82 400 400 50 3 200 200 100 80 Γ— 80 13 200 100

Block DMMA 33 Γ— 24 12 Γ— 12 20 Γ— 16 16 Γ— 16

4.2.Parameter setting There are several parameters to be set, namely πœ† in CR and wi ’s, πœ†1 , πœ†2 , 𝛿1 , 𝛿2 , 𝜏1 , 𝜏2 in group sparse coding. Following the suggestions in the literatures [23] and [48], in this paper, we set the above regularization parameters as follows: the regularization parameter Ξ» is empirically selected to be 0.01 with 𝛿1 = 0.3β„π‘šπ‘’π‘Žπ‘›(π‘Žπ‘π‘ (𝑏)) , 𝛿2 = 3β„π‘šπ‘’π‘Žπ‘›(π‘Žπ‘π‘ (𝑏)) , 𝜏1 = 𝜏2 = 1.618, πœ†1 = 𝐼𝑛 and πœ†2 = πΌπ‘š , where 𝑏 denotes each sample in the gallery set, i.e., 𝑔𝑖 in each iteration as described in Algorithm 1, and 𝐼𝑛 denotes a 𝑛 Γ— 1 vector with all coefficients being 1. 4.3.Experiments on the AR database We firstly test the robustness of our method in contrast with all comparison methods by using the AR database, owing to its images containing various different facial expressions, lighting conditions and occlusions (glasses and scarves). For each subject in two sessions denoted as S1 and S2, there are 13 variations with indices from V1 to V13 as shown in Fig. 4. Among the 100 subjects in S1, the first 30 subjects are used for gallery training, with the remaining 70 subjects for generic training. In the following tests with different settings, we use the frontal images from V1, V2, V3, V5, V6 and V7 for the gallery set, because we treat them feasible and accessible for a real world application. 1) Testing all variations: This experiment is carried out on the condition that one fixed variation among V1, V2, V3, V5, V6 and V7 from the first 30 subjects in S1 is chosen for the gallery set, while all the other samples in S1 and all samples in S2 which share the same labels with the gallery set are chosen for testing. So the number of testing samples from S1 and S2 are 12 Γ— 30 = 360 and 13 Γ— 30 = 390, respectively. Table 2 lists the recognition rates in the two sessions by the comparison methods.

S1

S2 3

V1 V2 V3 V5 V6 V7 Mean V1

Table 2. The recognition rates (%) on AR database with all variations. NN PCA SRC CRC DMMA EPS AGL ESRC SVDL 44.17 44.17 79.17 65.28 74.44 74.17 90 90.28 91.67 38.61 38.61 73.33 55 66.39 79.72 86.11 87.22 85.83 38.33 38.33 75 52.78 66.94 77.78 81.94 83.06 86.11 34.72 34.72 79.44 50.56 82.78 90.56 86.94 92.22 91.39 31.39 31.39 80.56 47.78 74.44 92.78 85 91.67 92.22 20.56 20.56 76.67 61.11 48.33 84.17 77.5 90.28 91.39 34.63 34.63 77.36 55.42 68.89 83.19 84.45 89.12 89.77 34.36 34.36 65.64 52.31 68.97 57.18 72.78 79.49 80.77

CPL 92.22 88.06 83.61 94.17 93.06 92.78 90.65 83.59

The feature dimension for ESP here is based on the number of distinct categories (c) in the gallery set. Because the maximal reduced dimensionality for the following LDA is 𝑐 βˆ’ 1. 4 Similar to ESP, the feature dimension for AGL here is also based on the number of distinct categories (c) in the gallery set. Because the maximal reduced dimensionality for the following LDA is 𝑐 βˆ’ 1.

V2 V3 V5 V6 V7 Mean

32.56 30 31.28 26.15 20.77 29.19

32.56 30 31.28 26.15 20.77 29.19

57.95 54.87 67.69 61.28 66.67 62.35

48.46 45.64 44.36 37.44 51.28 46.58

61.54 63.59 72.82 68.97 51.54 64.57

54.1 52.56 67.69 75.13 60.77 61.24

64.72 69.17 63.33 64.44 56.11 65.09

73.06 70 80.26 79.17 81.94 77.32

75.9 67.18 81.28 78.46 80.51 77.35

77.95 72.82 81.79 80.77 83.08 80

From Table 2, we can get the following four conclusions. Firstly, all methods get better recognition rates when the gallery samples and testing samples are from the same session. Secondly, CPL overly outperforms all other comparison methods on almost all cases. Thirdly, CPL exceeds those of NN 49.23%, PCA 49.23%, SRC 17.95%, CRC 31.28%, DMMA 14.62%, ESP 26.41%, AGL 10.81%, ESRC 4.1% and SVDL 2.82% on V1 in S2, whose gallery images are taken under neutral expression and symmetric normal illumination as shown in Fig. 4 and most possible for a real world application. Finally, CPL shows more superiority when the gallery samples and testing samples are from different sessions, average recognition rates exceeding those of SVDL 0.88% and 2.65% in S1 and S2 for a simple example. 2) Testing robustness to occlusions: This experiment is carried out on the condition that one fixed variation among V1, V2, V3, V5, V6 and V7 from the first 30 subjects in S1 is chosen for the gallery set, while all occluded samples in S1 and S2 which share the same labels with the gallery set are chosen for testing. The occluded samples are those from V8, V9, V10, V11, V12 and V13, so the number of testing samples from S1 and S2 were 6 Γ— 30 = 180 and 6 Γ— 30 = 180, respectively. Table 3 lists the recognition rates in the two sessions by the comparison methods.

S1

S2

Table 3. The recognition rates (%) on AR database with all occluded samples. NN PCA SRC CRC DMMA EPS AGL ESRC SVDL V1 35 35 67.22 40 62.78 66.11 86.11 87.78 85 V2 26.67 26.67 58.33 32.78 53.89 76.11 77.78 78.33 77.22 V3 23.89 23.89 59.44 26.67 51.67 72.22 72.78 74.44 75.56 V5 32.22 32.22 67.22 29.44 69.44 90.56 85 86.67 86.11 V6 25 25 69.44 32.22 66.11 87.89 80.56 89.44 87.78 V7 10 10 65.56 38.33 45.56 82.22 72.22 86.67 86.67 Mean 25.46 25.46 65.54 33.24 58.24 79.19 79.08 83.89 83.06 V1 23.89 23.89 51.67 27.22 53.33 45.56 66.67 75.56 73.89 V2 21.67 21.67 41.11 26.11 49.44 46.11 55.56 63.89 66.11 V3 17.78 17.78 37.22 22.78 48.89 42.22 60.56 60 53.33 V5 23.33 23.33 54.44 25.56 58.89 60 61.67 73.33 75 V6 15 15 46.11 23.33 56.67 68.89 62.78 72.22 71.11 V7 9.44 9.44 53.33 26.67 46.11 53.89 54.44 75 72.78 Mean 18.52 18.52 47.31 25.28 52.22 52.78 60.28 70 68.7

CPL 87.78 81.11 73.89 92.22 89.44 89.44 85.65 76.67 66.56 61.67 75.56 73.89 80 72.37

From Table 3 we can find that CPL performs much better than all other methods on almost all conditions. Compared with Table 2, CPL shows more superiority on the occluded testing samples, with average recognition accuracy in S2 exceeding those of NN 53.85%, PCA 53.85%, SRC 25.06%, CRC 47.09%, DMMA 20.15%, ESP 19.59%, AGL 12.09%, ESRC 2.37% and SVDL 3.67%. This demonstrates that CPL is much more robust to image occlusions. 3) Testing robustness to random gallery variations: We consider that, in real-world applications such as surveillance, the samples in the gallery may be taken at different moments and under different variations. Thus, this experiment is carried out on the condition that one random variation among V1, V2, V3, V5, V6 and V7 for each individual from the first 30 subjects in S1 is chosen for the gallery set, while all other samples in S1 and all samples in S2 which share the same labels with the gallery set are chosen for testing. So the number of testing samples from S1 and S2 are 12 Γ— 30 = 360 and 13 Γ— 30 = 390 , respectively. Because there are 630 types of combinations, ten independent tests are

performed to get the average recognition rates. Table 4 lists the recognition rates in the two sessions by the comparison methods. Table 4. The average recognition rates (%) on AR database with random training variation. NN PCA SRC CRC DMMA EPS AGL ESRC SVDL CPL S1 23.61 23.61 60.03 58 63.8 61.81 81.47 86.25 85.81 87.11 S2 23.82 23.82 46.85 46.41 51.61 46.79 64.08 71.05 69.89 75.97

Once again, our method still achieves the best results among all the comparison methods. Moreover, in contrast to fixed variations with CPL averagely exceeding those of SVDL 0.88% and 2.65% in S1 and S2, the superiority expands to 1.3% and 6.08% for example. On the one hand, this proves that our method is much more robust with the random gallery variations. On the other hand, it is certified again that CPL has a stronger robustness in dealing with the variations between different sessions. 4.4.Experiments on the PIE database We then evaluate the robustness of our method to different poses along with illumination variations on the PIE database. For each subject, there are 13 variations with indices as V1 to V13 as shown in Fig. 5. Among the 66 subjects, the first 16 subjects are used for gallery training, with the remaining subjects for generic training. In the following tests with different settings, we use the images from V3 for the gallery set, because we find them feasible and accessible for a real world application. In the first test, we select images with different poses but similar lighting conditions (V7, V11, V15 and V19) as the gallery set for testing. Thus, the number of testing samples is 4 Γ— 16 = 64. In the second test, all the other samples are selected as testing samples, which share the same labels but different poses and illumination comparing with the gallery set. Thus, the number of testing samples is 19 Γ— 16 = 304. Table 5 lists the recognition accuracies in the two tests. Pose Pose & Illumination

Table 5. The recognition rates (%) on PIE database. NN PCA SRC CRC DMMA EPS AGL 89.06 89.06 87.5 85.94 59.38 35.94 78.13 25.33 25.33 50 26.32 19.08 24.67 60.86

ESRC 93.75 75

SVDL 93.75 74.34

CPL 96.88 77.3

Table 5 shows that our method achieves the best recognition ability when facing pose variations, exceeding all the other comparison methods of at least 3.13%. In addition, when dealing with different poses along with illumination variations, CPL still shows impressive superiority to the other methods even though all methods’ recognition accuracies decrease under this severer experimental setting. 4.5.Experiments on the Multiple PIE database We then conduct FR with SSPP on the Multiple PIE database. For each subject in the first session denoted as S1, there are 20 variations with indices as V1 to V20 as shown in Fig. 6. Among the 249 subjects in S1, the first 99 subjects are used for gallery training, with the remaining subjects for generic training. In each test, we use the frontal images from one variation for the gallery training, while all the other samples in S1 and all samples in S2, S3 and S4 which share the same labels with the gallery set are chosen for testing. So the number of testing samples from S1, S2, S3 and S4 are 19 Γ— 99 = 1881, 10 Γ— 99 = 990, 10 Γ— 99 = 990 and 20 Γ— 99 = 1980, respectively. Table 6 lists the average recognition accuracy and the recognition rate achieved with samples from V18 as the gallery set for they being taken under symmetric normal illumination, which is most feasible and accessible for a real world application, namely law enforcement, ID card identification, etc. Table 6. The recognition rates (%) on Multiple PIE database with random training variation. NN PCA SRC CRC DMMA EPS AGL ESRC SVDL CPL S1 31.92 31.31 95.71 44.85 79.23 95.69 97.24 99.52 99.84 99.79 V18 S2 24.58 24.17 86.39 35.83 68.11 79.03 90.97 95.25 95.97 95.76

Mean

S3 S4 S1 S2 S3 S4

22.43 22.37 26.51 22.45 19.01 20.02

21.86 22.24 26.31 22.28 18.85 19.9

71.43 76.18 92.87 80.94 67.74 73.16

31.14 32.5 51.34 41.33 33.46 37.61

60.43 63.31 75.28 63.45 56.11 61.3

68 71.45 94.21 76.65 65.25 67.34

80 83.88 96.46 88.92 78.85 82.21

86.46 88.89 99.53 92.42 85.05 89.85

86.29 89.93 99.11 93.26 86.01 88.52

87.29 91.18 99.53 93.53 86.24 91.08

As we can see, the recognition rates of all the methods are much higher than those obtained in the AR database and our method’s superiority is decreased. This is because the variations and occlusions, which are challenging for FR especially with SSPP, in the Multiple PIE database weaken a lot in contrast with those in the AR database. However, our proposed CPL still achieves better recognition rates than the other methods. 4.6.Experiments on the FERET database At last, we conduct FR with SSPP on the Multiple PIE database. For each subject, there are 3 selected variations with indices as V1 to V3 as shown in Fig. 7. Among the 200 subjects, the first 50 subjects are used for gallery training, with the remaining subjects for generic training. In each test, we use the frontal images from one variation for the gallery training, while all the other samples which share the same labels with the gallery set are chosen for testing. So the number of testing samples is 2 Γ— 150 = 300. Table 7 lists the average recognition accuracy and the recognition rate achieved with samples from V1 as the gallery set for they being taken under neutral expression and symmetric normal illumination. Table 7. The recognition rates (%) on FERET database with random training variation. NN PCA SRC CRC DMMA EPS AGL ESRC SVDL CPL V1 53 53 94 86 62 36 88 95 95 96 Mean 41.67 41.67 87.67 81.33 59.67 59 91.11 92.22 92.67 93.67

Nevertheless, the proposed CPL method still achieves better recognition rates compared with all other comparison methods. 4.7.Evaluation of generalization ability In this section, we evaluate the generalization ability of our model by training on one database but testing on another. We utilize AR database for generic training and test respectively on PIE, Multiple PIE and Yale-B databases. In the first experiment, we select images with frontal pose, i.e., V1-V4 as shown in Fig. 5, from all individuals in the PIE database and utilize V3 images to construct the gallery set with remaining images for testing. Thus, the number of testing samples is 3 Γ— 66 = 198. In the second experiment, we select images from S1 in Multiple PIE with V18 images as shown in Fig. 6 for gallery training and the others for testing. Thus, the number of testing samples is 19 Γ— 249 = 4731. In the third experiment, we select images from the Yale-B database and utilize V1 images as shown in Fig. 8 to construct the gallery set with remaining images for testing. So the number of testing samples is 12 Γ— 38 = 456. In all three experiments, images in the AR database are utilized to construct the generic training set. The results of generic training based methods are shown in Table 8. Table 8. The recognition rates (%) with PIE, Multiple PIE or Yale-B as gallery and AR as generic training. AGL ESRC SVDL CPL PIE 26.77 58.08 69.19 72.22 MultiPIE 81.59 89.43 90.83 86.54 Yale-B 75.44 87.72 89.04 93.86

From Table 8, we can see that the generic training based methods, i.e., AGL, ESRC, SVDL and the proposed algorithm, have limited generalization ability. In the first and third experiments, CPL achieves the best recognition accuracies and outperforms the others as least 3.03% and 4.82%, respectively. However, in the second experiment, though CPL still shows

considerable superiority to AGL, it achieves a relatively lower recognition rate compared with variation dictionary based methods, i.e., ESRC and SVDL. Because our model directly utilizes the image features of the generic training set to represent those of the gallery. Meanwhile, CPL is based on the assumption that the variations in the gallery and generic training sets can be well controlled and the variations in the generic training set are plenty enough to cover those of the gallery set. However, existing different databases are captured under various different experimental conditions, such as equipment, image size, cropping, etc., which would have a great influence on the performance of CPL. On the other hand, variation dictionary based methods focus on learning the variation dictionary from the generic training set. And they utilize the learned variation dictionary for classification rather than directly use the image features in the generic training set. Thus, they may be sometimes more efficient in dealing with the discrepancy between two existing different databases where images are captured under very different experimental conditions. But we believe that this disadvantage can be greatly mitigated by constructing a complete generic training set or well controlled image capture condition, which has been well certified by the experimental results presented in sections 4.3-4.6.

5. Conclusion and feature work In this paper, we developed a novel technique for FR with SSPP, which is called Collaborative Probabilistic Labels (CPL). The proposed method has more discriminating abilities than the other comparison methods. And the efficacy of CPL has been evaluated on multiple recognition tasks with several popular databases. Considering the best discriminative ability in contrast with other methods, our algorithm facilitates the effective probabilistic label learning of face space and exhibits impressive robustness to variations and occlusions. Our model is based on the assumption that the generic training set is full enough with sufficient variations. But due to the uncontrolled noise or other defects in the data, such as the situation presented in Section 4.7, this assumption may not always set up. Therefore, it is very worthwhile to build a more robust model to deal with this problem. We are currently exploring these problems both in theory and practice.

6. Acknowledgments This work is supported in part by Graduate Research and Innovation Foundation of Jiangsu Province, China under Grant KYLX15_0379, in part by the National Natural Science Foundation of China under Grants 61273251, 61401209, and 61402203, in part by the Natural Science Foundation of Jiangsu Province under Grant BK20140790, and in part by China Postdoctoral Science Foundation under Grants 2014T70525 and 2013M531364.

Reference [1] W. Zhao, R. Chellppa, P.J. Phillips, A. Rosenfeld, Face recognition: A literature survey, ACM Computing Survey, 35 (4) (2003) 399-458. [2] P.N. Belhumeur, J. Hespanha, D.J. Kriegman, Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection, IEEE Trans. Pattern Analysis and Machine Intelligence, 19 (7) (1997) 711-720. [3] D. Cai, X. He, J. Han, Spectral Regression for Efficient Regularized Subspace Learning, in: Proceedings of International Conference on Computer Vision, 2007, pp. 1-8. [4] D. Cai, X. He, Y. Hu, J. Han, T. Huang, Learning a Spatially Smooth Subspace for Face Recognition, in: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2007, pp. 1-7. [5] H. Chen, H. Chang, T. Liu, Local Discriminant Embedding and Its Variants, in: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2005,

pp. 846-853. [6] Y. Fu, S. Yan, T.S. Huang, Classification and Feature Extraction by Simplexization, IEEE Trans. Information Forensics and Security, 3 (1) (2008) 91-100. [7] X. He, D. Cai, S. Yan, H. Zhang, Neighborhood Preserving Embedding, in: Proceedings of International Conference on Computer Vision, 2005, pp. 1208-1213. [8] X. He, S. Yan, Y. Hu, P. Niyogi, H.J. Zhang, Face Recognition Using Laplacianfaces, IEEE Trans. Pattern Analysis and Machine Intelligence, 27 (3) (2005) 328-340. [9] H. Hu, Orthogonal Neighborhood Preserving Discriminant Analysis for Face Recognition, Pattern Recognition, 41 (6) (2008) 2045-2054. [10] H. Lu, K. Plataniotis, A. Venetsanopoulos, MPCA: Multilinear Principal Component Analysis of Tensor Objects, IEEE Trans. Neural Networks, 19 (1) (2008) 18-39. [11] J. Lu, Y. Tan, A Doubly Weighted Approach for Appearance-based Subspace Learning Methods, IEEE Trans. Information Forensics and Security, 5 (1) (2010) 71-81. [12] J. Lu, Y. Tan, Regularized Locality Preserving Projections and Its Extensions for Face Recognition, IEEE Trans. Systems, Man, and Cybernetics, Part B: Cybernetics, 40 (2) (2010) 958-963. [13] M. Turk, A. Pentland, Eigenfaces for Recognition, Journal of cognitive neuroscience, 3 (1) (1991) 71-86. [14] X. Wang, X. Tang, A Unified Framework for Subspace Face Recognition, IEEE Trans. Pattern Analysis and Machine Intelligence, 26 (9) (2004) 1222-1228. [15] S. Yan, J. Liu, X. Tang, T. Huang, A Parameter-Free Framework for General Supervised Subspace Learning, IEEE Trans. Information Forensics and Security, 2 (1) (2007) 69-76. [16] H.K. Ji, Q.S. Sun, Y.H. Yuan, Z.X. Ji, Fractional-Order Embedding Supervised Canonical Correlations Analysis with Applications to Feature Extraction and Recognition, Neural Processing Letters, 2016, 1-19. [17] M.H. Yang, Kernel Eigenfaces vs. Kernel Fisherfaces: Face Recognition Using Kernel Methods, in: Proceedings of International Conference on Automatic Face and Gesture Recognition, 2002, pp. 215-220. [18] W. Yu, X. Teng, C. Liu, Face Recognition Using Discriminant Locality Preserving Projections, Image and Vision Computing, 24 (3) (2006) 239-248. [19] J. Wright, A. Yang, S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Analysis and Machine Intelligence, 31 (2) (2009) 210-227. [20] K. Huang, S. Aviyente, Sparse representation for signal classification, Advances in Neural Information Processing Systems (NIPS), 2006, pp. 609-616. [21] M. Davenport, M. Duarte, M. Wakin, D. Takhar, K. Kelly, R. Baraniuk, The smashed filter for compressive classification and target recognition, Electronic Imaging, International Society for Optics and Photonics, 2007, pp. 64980H-64980H. [22] M. Davenport, M. Wakin, R. Baraniuk, Detection and estimation with compressive measurements, Department of Electrical and Computer Engineering, Rice University, Technical Report, 2006. [23] L. Zhang, M. Yang, X. Feng, Sparse representation or collaborative representation: Which helps face recognition? in: Proceedings of International Conference on Computer Vision, 2011, pp. 471-478. [24] L. Zhang, M. Yang, X. Feng, Y. Ma, D. Zhang, Collaborative Representation based Classification for Face Recognition, arXiv:1204.2358, 2012. [25] J. Yang, D. Zhang, A.F. Frangi, J.Y. Yang, Two-Dimensional PCA: A New Approach to Appearance-Based Face Representation and Recognition, IEEE Trans. Pattern Analysis and Machine Intelligence, 26 (1) (2004) 131-137. [26] X. Tan, S. Chen, Z. Zhou, F. Zhang, Face recognition from a single image per person: A survey, Pattern Recognition, 39 (9) (2006) 1725-1745.

[27] A. Jain, B. Chandrasekaran, Dimensionality and Sample Size Considerations in Pattern Recognition Practice, Handbook of Statistics, 2 (1982) 835-855. [28] Q. Cao, Y. Ying, P. Li, Similarity metric learning for face recognition, in: Proceedings of International Conference on Computer Vision, 2013, pp. 2408-2415. [29] J. Lu, Y.P. Tan, G. Wang, X. Zhou, Discriminative Multi-Manifold Analysis for Face Recognition from a Single Training Sample per Person, in: Proceedings of International Conference on Computer Vision, 2011, pp. 1943-1950. [30] J. Lu, Y.P. Tan, G. Wang, Discriminative multimanifold analysis for face recognition from a single training sample per person, IEEE Trans. Pattern Analysis and Machine Intelligence, 35 (1) (2013) 39-51. [31] H.H. Liu, S.C. Hsu, C.L. Huang, Single-sample-per-person-based face recognition using fast discriminative multi-manifold analysis, Asia-Pacific Signal and Information Processing Association (APSIPA), 2014, pp. 1-9. [32] H. Kanan, K. Faez, Y. Gao, Face Recognition Using Adaptively Weighted Patch PZM Array from a Single Exemplar Image per Person, Pattern Recognition, 41 (12) (2008) 37993812. [33] X. Tan, S. Chen, Z. Zhou, F. Zhang, Recognizing Partially Occluded, Expression Variant Faces from Single Training Image per Person with SOM and Soft K-NN Ensemble, IEEE Trans. Neural Networks, 16 (4) (2005) 875-886. [34] H. Mohammadzade, D. Hatzinakos, Projection into Expression Subspaces for Face Recognition from Single Sample per Person, IEEE Trans. Affective Computing, 4 (1) (2013) 69-82. [35] Q. Gao, L. Zhang, D. Zhang, Face Recognition Using FLDA with Single Training Image per Person, Applied Mathematics and Computation, 205 (2) (2008) 726-734. [36] A.M. MartΓ­nez, Recognizing Imprecisely Localized, Partially Occluded, and Expression Variant Faces from a Single Sample per Class, IEEE Trans. Pattern Analysis and Machine Intelligence, 24 (6) (2002) 748-763. [37] D. Zhang, S. Chen, Z. Zhou, A New Face Recognition Method Based on SVD Perturbation for Single Example Image per Person, Applied Mathematics and Computation, 163 (2) (2005) 895-907. [38] M. Yang, L.V. Cool, L. Zhang, Sparse variation dictionary learning for face recognition with a single training sample per person, in: Proceedings of International Conference on Computer Vision, 2013, pp. 689-696. [39] S. Si, D. Tao, B. Geng, Bregman Divergence Based Regularization for Transfer Subspace Learning, IEEE Trans. Knowledge and Data Engineering, 22 (7) (2010) 929-942. [40] Y. Su, S. Shan, X. Chen, W. Gao, Adaptive Generic Learning for Face Recognition from a Single Sample per Person, in: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2010, pp. 2699-2706. [41] X. Wang, X. Tang, Random Sampling for Subspace Face Recognition, International Journal of Computer Vision, 70 (1) (2006) 91-104. [42] W.H. Deng, J.N. Hu, J. Guo, Extended SRC: Undersampled face recognition via intraclass variant dictionary. IEEE Trans. Pattern Analysis and Machine Intelligence, 34 (9) (2012) 1864–1870. [43] O. Chapelle, B. SchΓΆlkopf, A. Zien, Semi-supervised Learning, MIT Press, Cambridge, 2006. [44] D. Cai, X. He, J. Han, Semi-supervised discriminant analysis, in: Proceedings of International Conference on Computer Vision, 2007, pp. 1-7. [45] J. Lu, X. Zhou, Y.P. Tan, Y. Shang, J. Zhou, Cost-Sensitive Semi-supervised Discriminant Analysis for Face Recognition, IEEE Trans. Information Forensics and Security 7 (3) (2012) 944-953.

[46] C.G. Li, Z.C. Lin, H.G. Zhang, J. Guo, Learning Semi-Supervised Representation Towards a Unified Optimization Framework for Semi-Supervised Learning, in: Proceedings of International Conference on Computer Vision, 2015, pp. 2767-2775. [47] G.Q. Zhang, H.J. Sun, Z.X. Ji, Q.S. Sun, Label Propagation based on Collaborative Representation for Face Recognition, Neurocomputing, 171 (2016) 1193-1204. [48] W. Deng, W. Yin, Y. Zhang, Group sparse optimization by alternating direction method, SPIE Optical Engineering+ Applications, International Society for Optics and Photonics, 2013, pp. 88580R-88580R. [49] T. Chen, W.T. Yin, X.S. Zhou, D. Comaniciu, T.S. Huang, Total variation models for variable lighting face recognition, IEEE Trans. Pattern Analysis and Machine Intelligence, 28 (9) (2006) 1519-1524. [50] A.N. Li, S.G. Shan, W. Gao, Coupled bias-variance tradeoff for cross-pose face recognition, IEEE Trans. Image Processing, 21 (1) (2012) 305–315. [51] P. Mller, G.A. Kalberer, M. Proesmans, L.V. Gool, Realistic speech animation based on observed 3d face dynamics, Vision, Image and Signal Processing, IEE Proceedings, IET, 152 (4) (2005) 491–500. [52] A. M. Martinez, R. Benavente, The AR face database, CVC Technical Report #24, June 1998. [53] R. Gross, I. Matthews, J. Cohn, T. Kanade, S. Baker, Multi-pie, Image and Vision Computing, 28 (5) (2010) 807–813. [54] P. Phillips, H. Moon, S. Rizvi, P. Rauss, The FERET evaluation methodology for facerecognition algorithms, IEEE Trans. Pattern Analysis and Machine Intelligence, 22 (10) (2000) 1090–1104. [55] T. Sim, S. Baker, M. Bsat, The CMU pose, illumination, and expression database, IEEE Trans. Pattern Analysis and Machine Intelligence, 25 (12) (2003) 1615-1618. [56] A. Georghiades, P. Belhumeur, D. Kriegman, From few to many: illumination cone models for face recognition under variable lighting and pose, IEEE Trans. Pattern Analysis and Machine Intelligence, 23 (6) (2001) 643-660.

Hongkun Ji received the B.E. in Network Engineering from Nanjing University of Science and Technology, Nanjing, China, in 2012. He is currently a Ph.D. Candidate in the school of Computer Science and Engineering, Nanjing University of Science and Technology, China. His research interests include pattern recognition, machine learning, image processing, and computer vision. QuanSen Sun received the Ph.D. degree in Pattern Recognition and Intelligence System from Nanjing University of Science and Technology (NUST), China, in 2006. He is a professor in the Department of Computer Science at NUST. He visited the Department of Computer Science and Engineering, The Chinese University of Hong Kong in 2004 and 2005, respectively. His current interests include pattern recognition, image processing, remote sensing information system, medicine image analysis. Zexuan Ji received B.E. in Computer Science and PhD degrees in Pattern Recognition and Intelligence System from Nanjing University of Science and Technology, Nanjing, China, in 2007 and 2012, respectively. He is currently a postdoctoral research fellow in the School of Computer Science and Engineering at the Nanjing University of Science and Technology. His research interests include medical imaging, image processing and pattern recognition. Yunhao Yuan received the M.Sc. degree in computer science and technology from Yangzhou University (YZU), China, in 2009, and the Ph.D. degree in pattern recognition and intelligence system from Nanjing University of Science and Technology (NUST), China, in

2013. He received two National Scholarships for Graduate Students and for Undergraduate Students from the Ministry of Education, China, an Outstanding Ph.D. Thesis Award and two Top-class Scholarships from the NUST. He is currently an associate professor with the Department of Computer Science and Technology, Jiangnan University (JNU). He is the author or co-author of more than 35 scientific papers. He serves as a reviewer of several international journals such as IEEE TNNLS and IEEE TSMC: Systems. He is a member of ACM, International Society of Information Fusion (ISIF), and China Computer Federation (CCF). He is also the Artificial Intelligence Technical Committee Member and Big Data Technical Committee Member of Jiangsu Computer Society, China. His research interests include pattern recognition, image processing, computer vision, and information fusion. Guoqing Zhang received his B.S. and Master degrees from the School of Information Engineering at the Yangzhou University in 2009 and 2012. He is currently a Ph.D. Candidate in the school of Computer Science and Engineering, Nanjing University of Science and Technology, China. His research interests include pattern recognition, machine learning, image processing, and computer vision.

Highlights 1) CPL explores the probabilistic graph between the gallery and generic training set, which incorporates the discriminative information of generic training set. 2) The adaptive variation type for a given sample can be automatically estimated. 3) CPL utilizes a novel probabilistic label reconstruction based classifier which is adaptively robust to various variations in face images.