A Novel Approach Inspired by Optic Nerve Characteristics for Few-Shot Occluded Face Recognition
Communicated by Dr. Nianyin
Zeng
Journal Pre-proof
A Novel Approach Inspired by Optic Nerve Characteristics for Few-Shot Occluded Face Recognition Wenbo Zheng, Chao Gou, Fei-Yue Wang PII: DOI: Reference:
S0925-2312(19)31313-X https://doi.org/10.1016/j.neucom.2019.09.045 NEUCOM 21300
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
29 June 2019 2 September 2019 9 September 2019
Please cite this article as: Wenbo Zheng, Chao Gou, Fei-Yue Wang, A Novel Approach Inspired by Optic Nerve Characteristics for Few-Shot Occluded Face Recognition, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.09.045
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Elsevier B.V. All rights reserved.
A Novel Approach Inspired by Optic Nerve Characteristics for Few-Shot Occluded Face Recognition Wenbo Zhenga,b , Chao Goub , Fei-Yue Wangb,c,d,∗ b State
a School of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, China Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China c Innovation Center for Parallel Vision, Qingdao Academy of Intelligent Industries, Qingdao 266000, China d Institute of Systems Engineering, Macau University of Science and Technology, Macau 999078, China
Abstract Although there has been a growing body of work for face recognition, it is still a challenging task for faces under occlusion with limited training samples. In this work, we propose a novel framework to address the problem of few-shot occluded face recognition. In particular, inspired by the human being’s optic nerves characteristics that humans recognize the face under occlusion using contextual information rather than paying attention to the facial parts, we propose an effective feature extraction approach to capture the local and contextual information for face recognition. To enhance the robustness, we further introduce an adaptive fusion method to incorporate multiple features, including the proposed structural element feature, connected-granule labeling feature, and Reinforced Centrosymmetric Local Binary Pattern (RCSLBP). Final recognition is derived from the fusion of all classification results according to our proposed novel fusion method. Experimental results on three popular face image datasets of AR, Extended Yale B, and LFW demonstrate that our method performs better than many existing ones for few-shot face recognition in the presence of occlusion. Keywords: Sparse representation, adaptive fusion feature, few-shot learning, face recognition, face occlusion, dictionary learning 1. Introduction With the rapid expansion of cyber-physical-social systems (CPSS) [1, 2] in the information era, effective person identification is becoming more and more urgent to ensure public security, information security, access control, and other related aspects. For person identification, biometrics such as face recognition has become an interesting and active topic in the field of pattern recognition [3, 4]. The face has many advantages over other biometric features [5, 6]. First of all, the requirements for capturing face images can be realized more easily than others [7, 8]. Many video surveillance [9, 10, 11, 12] and identification applications [13] can hardly do without the help of face recognition, such as in suspect detection [14], blacklisting at airports and other public places, and attention assistance. Secondly, because the face structure is more vibrant and the area of the face is usually larger [15] than other biometric features, we can acquire face images without intrusive way, making it ideal and successful for human-computer interaction [16] and other humanized information technologies and applications, such as payer identity verification [17] and interactive media [18, 19, 20]. Therefore, many researchers [21] have devoted to face recognition for biometric applications. If face images are captured under well-constrained circumstances, most existing algorithms can achieve satisfactory re∗ Corresponding
author Email address:
[email protected] (Fei-Yue Wang)
Preprint submitted to Neurocomputing
sults. However, there are occasions with only a few or even one face sample of the subject available, when searching the subject in some scenarios. For example, a police force may have only one single face photo of a suspect when searching the massive videos. That is, there are only a few face images or one face image of each person available in the database, and faces in these images are under occlusion. The face recognition problem under this situation is defined as few-shot face recognition under occlusion [22, 23]. Due to the limitation of training samples (gallery samples) in many scenarios, it is essential to study the few-shot problem for real applications [24, 23], such as gate ID identification, passport identification, and law enforcement. Many methods such as local-feature-based method [25], sparserepresentation-based classification method [26], subspace-mappingand kernel-based methods [27], deep-learning-based methods [28] have been proposed for face recognition under occlusion. However, although these methods are able to deal with slight occlusions, they are not effective to tackle under few-shot learning. As human beings, we need not be aware of the occlusion, and we pay little attention to occlusion when attempting to recognize a face. We also use other information outside the face region to recognize occluded faces due to the characteristics of optic nerves in the human eye. This is the inspiration behind the current study. Typical deep-learning-based methods [29] do not take into account the effect of partial occlusion. They represent each face as a feature vector and compute an aggregate representation with average or maximum pooling [30]. In the presence September 30, 2019
of partial occlusion, the face feature is usually corrupted due to the equal treatment of all images, leading to severe performance degeneration. Besides, performances of deep-learningbased approaches are overly dependent on data size. [31] In general, there are three main challenges for few-shot face recognition under occlusion:
model. Section 3 presents optic nerves characteristics-inspired feature and fusion pattern. Section 4 presents a novel algorithm for few-shot occluded face recognition. Section 5 provides experimental analysis. Finally, Section 6 concludes the paper. 2. Related Work and Prerequisite Knowledge
• The first challenge encountered is the heterogeneity of the shooting environment between the gallery and probe set [28].
2.1. Classic Approaches Occluded Face Recognition Classical face recognition systems rely on global face representations [33, 31] and conventional classifiers [34]. These methods require a lot of training examples to acquire the data structure in a lower-dimensional space using general statistic tools, such as PCA [35], LDA [35], SDA [36]. Due to the requirement of the number of the training samples, these methods are not suitable for solving few-shot problem. [37, 38] Linear-regression-classifier-based (LRC) [39] methods are investigated by Naseem et al. for face classification. To obtain higher robustness, a sparse-representationbased classification (SRC) method that unifies face alignment and recognition into a single framework is presented by Wagner et al. [26, 40]. There are many SRC variants [41]. Based on SRC, the forward sparse representation (fSR) method represents each test image with training images. Compared to fSR, the backward sparse representation (bSR) method, in an opposite direction, represents each training image with test images. Zhao et al. [41] present a CoSR model that combines fSR and bSR together for more robust face recognition. Wu et al. [42] propose a Gradient-Direction-based Hierarchical Adaptive Sparse and Low-Rank (GD-HASLR) model based on SRC, which has the best performance among state-of-the-art methods. Yu et al. [43] present a discriminative multi-scale sparse coding (DSC) model via the learned-dictionary- and sparserepresentation-based classification. Linear and sparse learningbased methods cannot solve few-shot problem, due to the fewshot features and samples. [44] Instead of training an occlusionaware model with visibility annotation, Yang et al. [45] present a Robust Cascaded Pose Regression (RCPR) model based on sparse representation classification via a model adaptation scheme that uses the result of a local regression forest voting method. Cascades learning may imporve the ability of face recognition with occlusions, but this kind method may not slove it effectively under the condition of few training samples. [46] Therefore, in few-shot learning, these methods are able to deal with slight occlusions and consequently are not effective to tackle with occlusions. Few-Shot Face Recognition Given abundant training faces for the base classes, few-shot learning algorithms aim to learn to recognize novel classes with a limited amount of labeled faces. Many efforts have been devoted to overcoming the data efficiency issue. In the following, we discuss representative fewshot learning algorithms organized into three main categories: initialization based, metric learning-based, and hallucination based methods. Initialization based methods [47] tackle the few-shot learning problem by ”learning to fine-tune”. One approach aims to learn good model initialization so that the classifiers for novel classes can be learned with a limited number of
• The second challenge is the shortage of training samples. • The third challenge is the missing some key informative facial features. Therefore, in this paper, we propose a novel few-shot learning approach inspired by nerve characteristics called S2 P2 FR to address the problem of few-shot occluded face recognition. We propose an effective feature extraction approach inspired by the humans’ optic nerves characteristics to capture the local and contextual information for face recognition. Furthermore, we incorporate multiple features including proposed structural element feature, connected-granule labeling feature, and Reinforced Centrosymmetric Local Binary Pattern (RCSLBP). We use dictionary learning and sparse representation to build our model. For dictionary construction, we use dual-feature, the CentroSymmetric Local Binary Pattern (CSLBP) [32] and proposed Adaptive Fusion Pattern of Local and Contextual Information (AFPLC), to encode the fused component images. As we know, most face images under artificial occlusion have exposed a few key points information. Thus, this could be regarded as a few-shot learning problem under incomplete sample information. Then, we build the sparse representation with dual-feature and feed to the novel fusion scheme to recognize the face. In short, the main contributions of this work are in threefold. 1) Inspired by the characteristics of optic nerves, we design a novel structural element for feature extraction to capture the local and contextual information of face; 2) To address the issue of occlusion and enhance the robustness, we further introduce an adaptive fusion method to incorporate multiple features including proposed structural element feature, connected-granule labeling feature and reinforced the centrosymmetric local binary pattern. 3) We introduce artificial (intentional) occlusion for data augmentation and propose to apply few-shot sparse representations learning for few-shot occluded face recognition. Experimental results show this method leads to better recognition performance than other state-of-the-art algorithms. The rest of this paper is arranged as follows. Section 2 reviews some classic approaches to face recognition and the theories of centrosymmetric local binary pattern, reinforced centrosymmetric local binary pattern, and adaptive vector fusion 2
labeled faces and a small number of gradient update steps. Another line of work focuses on learning an optimizer. For example, motivated by the close relationship between the parameters and the activations in a neural network associated with the same category, Qiao et al. [47] propose a novel method that can adopt a pre-trained neural network to novel categories by directly predicting the parameters from the activations. However, due to occlusion of the face samples, this kind methods cannot extract effective feature maps.[30, 48] Metric learning-based methods [49] address the few-shot recognition problem by ”learning to compare”. The intuition is that if a model can determine the similarity of two faces, it can classify an unseen input faces with the labeled instances. This kind methods represent each face as a feature vector. In the presence of partial occlusion, the face feature is usually corrupted due to the equal treatment of all images, leading to severe performance degeneration.[50, 51, 52] Hallucination based methods [53] directly deal with face data deficiency by ”learning to augment”. This class of methods learns a generator from data in the base classes and use the learned generator to hallucinate new novel class face data for data augmentation. One type of generator aims at transferring appearance variations exhibited in the base classes. These generators either transfer variance in base class face data to novel classes or use GAN models to transfer the style. Another type of generators does not explicitly specify what to transfer but directly integrate the generator into a meta-learning algorithm for improving the recognition accuracy. Similarlry, because this kind methods treat each faces equally, the face images with partial occlusion will distort the face representation.[54, 55, 56, 57] Although many algorithms have achieved good performance in few-shot learning, their performances are limited in dealing with occlusion. In general, there are two difficulties in the issues discussed in this paper. One is the few-shot problem, and the other is the face recognition[7] with occlusions problem. Although there are good algorithms in the respective fields of the two problems, there is almost no work on the algorithms that consider the two above problems.
only needs the feature value of the centrosymmetric pixel and does not fully consider the texture feature, while other neighboring pixels also contribute to the description of the feature component. In order to extract more illumination-irrelevant and robust features, the reinforced centrosymmetric local binary pattern feature [62] is presented, which contains the feature of the fusion component images derived from wavelet decomposition and the centrosymmetric local binary pattern. Specifically, we follow the steps in Reference [63]. 3. Optic Nerves Characteristics Inspired Feature And Fusion Pattern Since we believe that the problem of few-shot face recognition with occlusion is the problem of incomplete sample information, we use contextual information to complement the limited local information. Therefore, we propose a descriptor to incorporate the features of local information and contextual information. Based on the principle of image retrieval [64, 65, 66], we redefine the optic nerves characteristics-inspired structural element in face recognition, and introduce the theory of image connected-granule labeling [67, 68] and adaptive vector fusion model to get the adaptive fusion pattern of the local and contextual information. 3.1. Neuronal Visual Characteristics-Inspired Structural Element There is no doubt that the process of visual perception processing is related to the characteristics of the optic nerves. Pasupathy et al. [69] find that more than one-third of the V4 region neurons have the ability to extract specific contour features strongly. This ability of the V4 region neurons depends mainly on the curvature of the contour and the direction of the protrusions of the contour. Based on these characteristics of optic nerves [70], and the definition of structural elements [65, 66, 71] used for image retrieval, we propose a new structural element to get the structural information (regarded as one kind of face-related information) of face images. To be specific, the original image is first divided into many 2 × 2 blocks, as shown in Figure 1 as an example. We can get the diagonal direction according to the difference of n1 and n4 , n2 and n3 , in Figure 1. If the value of n1 is greater than n4 , we think the image has the direction of the diagonal toward n1 . Otherwise, the image has the direction of the diagonal toward n4 . Nine structural elements are defined according to the change of the gradient along the diagonal direction. Then, as shown in Figure 2(a), when the values of the diagonal pixels are equal, they are represented by a line segment without an arrow. On the contrary, when the values of the diagonal pixels are different, they are represented by a line segment with an arrow, and the direction of the arrow indicates the gradient direction of the value. The new structural element can describe any change in the values of the four adjacent pixels. Besides, our process is following as:
2.2. Centrosymmetric Local Binary Pattern and Reinforced Centrosymmetric Local Binary Pattern Since the lack of training samples invalidates any deep learning approaches, we propose to apply artificial features with effective feature extraction to address the second challenge or limited samples problem in Section 1. Given the difference among local pixels, the Local Binary Pattern (LBP) is proposed. Each pixel is considered to be a center pixel and is compared with the surrounding local pixels. LBP is considered as an effective feature to solve recognition problems in line with texture [58, 59, 60], but it has the shortcomings of dimension explosion [61]. To overcome the shortcoming, CSLBP is presented [32] as a modified form of local binary pattern. In CSLBP, considering the difference among centrosymmetric pixels, a local pattern is extracted for each pixel of the input image. Like LBP, each pixel can be regarded as the center pixel in CSLBP. The CSLBP descriptor has not only a low dimension but also good robustness in the flat image region. However, this descriptor 3
n1
n2
n3
n4
n1-n4 n2-n3
29 30 8
Step 1: Starting from the origin (0, 0), move the 2 × 2 structural element from left to right and top to bottom with 2-step length.
19 22
Step 2: If the structural element matches the value of the image (matching means that the values of the images in the corresponding structural element are equal), the value will be preserved, otherwise we will discard the value. Then we will get a structural element sub-map, represented by sei (x, y)(1 6 i 6 9), because there are nine structural elements.
19 22
10
8
10
-22 -19
Step 3: The final structural element map represented by S E(x, y) is obtained by fusing nine structural element maps based on the following rules:
-22 -19
29 30
S E(x, y) =
9 [
sei (x, y)
i=1
. Figure 1: The change of gradient along the diagonal direction of the image
Step 4: Count the number (se1 , se2 , · · · , se9 ) of each structural element sub-map. Step 5: Compute S E based on (se1 , se2 , · · · , se9 ). Suppose we use the number 1 to code the nine structural elements shown in Figure 2(b), then we can get the structural element sub-map using this coding. We suppose the vector of structural element is S E := (se1 , se2 , · · · , se9 ), and the structural elements from left to right in Figure 2(b) correspond to (se1 , se2 , · · · , se9 ) respectively. The structural element sub-map is shown in Figure 3(a) as an example. We can use the method[65] of matching structural element to get the final results shown in Figure 3(b), and referring to Figure 2(b), we get the S E of this image as (0, 0, 0, 0, 1, 1, 0, 2, 0).
(a) Novel structural elements
1
1 1
1
1
1
1
1
1
1
1
3.2. Concept and Theory of Image Connected-Granule Labeling From the flow of visual information processing, the shape information of the object is processed by the ventral lateral pathway, which passes from the V1 region of the visual cortex through the V2 region and the V4 region to several sub-regions of the inferior temporal cortex (IT) [70]. IT region neurons are usually sensitive to the shape information of complex objects, and are sensitive to semantic categories. At the same time, these areas that neurons perceive tend to be aggregated or separated according to the closeness of semantic association. According to the characteristics of visual physiology reflection, image processing can be divided into different areas, which respectively represent sensitive areas of different visual perceptions. Usually one image contains one or multiple objects/targets, represented as connected areas of similar color or texture. To get the color and connectedness information (belonging to faceralated information) of face images, we introduce the theory of granular computing [72]. The connected areas of different shape characteristics are defined as connected-granule labeling, which are described by different attributes and can effectively represent image features. According to the theory of image
1
(b) The coding of novel structural elements Figure 2: Novel structural elements and their code 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1 1
1
1
1
(a) The process of extracting structural elements map 1
1
1 1
1 1
=1
1
1
+2
1 1
+1
1
1
(b) Extraction based on structural element Figure 3: Feature extraction based on structural elements
4
granule labeling [67, 73], we can define the image connectedgranule labeling model as G = (U, A, P, V, F, V p , F p ), where U = {X1 , X2 , · · · , X|U| } is the domain,where |U| is the dimension of the domain U; A = {a1 , a2 , · · · , a|A| } is the finite property S set; V = Va , a ∈ A is the partially ordered set, Va is the valuerange set of the property a; F := U ×A → V is the function of U and A, P is the finite topology-property set, F p := U×P → V p is the function about P. Suppose the arbitrary topology-property T ⊆ P, we can define image connected-granule labeling as R(T ) ={(x, y)|D(F p (x), F p (y)) = v u t d X (xi − yi )2 ≤ r}, ∀p ∈ T ; x, y ∈ U
(1)
5
0
0
0
0
1
0
0
0
0
1
8
9
5
5
0
1
0
1
1
0
1
0
1
1
20 11 12 13
5
0
0
0
0
1
0
0
0
0
1
5
5
5
5
5
1
1
1
1
1
1
1
1
1
1
5
14
5
16 17
1
0
1
0
0
1
0
1
0
0
5
• The number of connected-granule labeling NL is 2. • The divergence of connected-granule labeling ML is, ML =
12 =6 2
• Face-related connected-granule labeling S L is 9. • Face connected-granule labeling OL is 8. Finally, we get the connected-granule labeling feature of the 5-th color map sub-image as {2, 6, 9, 8}.
• The number NL of connected-granule labeling is the number of connected components. • The divergence ML of connected-granule labeling is the average area of the connected components. That is,
ML =
9
C(x, y). Suppose L = 5, when C(x, y) = 5, we define C L (x, y) = 1, otherwise, C L (x, y) = 0. Therefore, we get the connectedgranule labeling of the 5-th color map sub-image. In the 5-th color map sub-image, we can get,
where D(F p (x), F p (y)) is the Euclidean distance located on (x, y), d is the dimension of data and r is the threshold. In order to describe the features of the face and contextual information, we define the number of all connected-components in the image as the number of image connected-granule labeling, the average area of the connected components as the divergence of image connected-granule labeling, and the number of the pixels of the largest connected-component as facerelated connected-granule labeling. The number of the pixels of the second-largest connected-component is called face connected-granule labeling. We use four statistics to characterize the connected-granule labeling, which we refer to as the attributes of connected-granule labeling. Suppose the size of the color image is M × N, we transform the RGB space of the image into the HS V space mentioned in Supplemental Materials, and we define the image in HS V space as C. We define color set L = {0, 1, 2, · · · , 255} in HS V space and C L is the L-th color map sub-image. When C(x, y) = L, we define C L (x, y) = 1, otherwise, C L (x, y) = 0. We define the connected-granule labeling of the L-th color map sub-image as C Li (x, y) = i. Similarly, we can define, in the L-th color map sub-image,
j=1 i=1
7
Figure 4: A connected-granule labeling example, where images are the HSV transform map image, the 5-th color map sub-image, connected-granule image of 5-th color map from left to right.
i=1
M P N P
29 38
C L (i, j)
NL
where C L (i, j) is the value of the connected-granule labeling of the L-th color map sub-image which is located on (x, y). • Face-related connected-granule labeling S L is the number of the pixels of the largest connected-component. • Face connected-granule labeling OL is the number of the pixels of the second largest connected-component. As shown in Figure 4, we transform the RGB space of the image into the HS V space to get the HSV transform map image 5
3.3. Adaptive Fusion Pattern To improve the face feature and describe the feature of the whole face image better, we present a novel descriptor called adaptive fusion pattern of the local and contextual information (AFPLC) shown in Figure 5. We first perform color space transformation and image segmentation. We transform the original image into the HSV image using the space transform and color quantization [74]. We get HSV sub-image using image segmentation; Then we use the theory of structural element discussed in Section 3.1 and connected-granule labeling discussed in Section 3.2 to get the structural element feature and connected-granule labeling feature. We use DWT to get HH, HL, LL, and LH sub-images, and further use CSLBP and encode methods to get RCSLBP features. We perform the theory of structural element to get the structural element features by inputting the HSV subimage. We perform the theory of connected-granule labeling and count related numbers to get connected-granule labeling features; Finally, we use the adaptive vector fusion model to fuse structural element feature, connected-granule labeling feature and RCSLBP to get the final adaptive fusion feature. All in all, the total process is following as: Input: The occluded face image. Output: The fused feature. 1: Color space transform. Let the image size be M × N, convert the image color space from RGB to HSV. The image in HSV space is donoted as C(x, y), 1 ≤ x ≤ M, 1 ≤ y ≤ N.
&'
$%
"#
( ! !
!
!
"#
" !
Figure 5: Feature extraction based on adaptive fusion pattern of the local and contextual information
2:
Extracting the structural element feature. Thej image k j isk divided into many 2×2 blocks, and we can get the M2 × N2 structural element sub-map T (i, j), T ∈ {se1 , se2 , · · · , se9 } according to the nine structural elements described in Section 3.1. In accordance with the theory of structural elements, we get the structural element feature Y as M N bX 2 cb 2c X
where || · ||1 means the sum of the absolute values of all the elements of matrix ·, and mean(·) is the average function. 5: return The fused feature H 4. Classification Based on Sparse Representation with Adaptive Fusion Features
granule labeling discussed in Section 3.2 to get the 1024size feature vector. 4: Feature fusion. On the basis of adaptive vector fusion model, we get the final fusion feature H as
In this paper, for the problem of few-shot face recognition with occlusion, we mainly address the effects of illumination changes and partial occlusion on face recognition [75]. It includes dictionary learning based on illumination-robust features and sparse representation with dual-feature. In order to suppress the effects of illumination changes on redundant data, a robust feature descriptor by combining CSLBP with adaptive fusion pattern of the face and contextual information is discussed. Then the dual-feature-based sparse representation and its fusion algorithm are proposed. The system diagram of the proposed sparse representation method is shown in Figure 6 and Algorithm 1. To better propose our scheme, we first introduce the theory of dictionary learning and then present the sparse representation method with AFPLC and CSLBP.
H = [ω1 × Norm(Y), ω2 × Norm(K), ω3 × Norm(RCS LBP)] (3) where ω1 is the weight of the structural element feature, ω2 is the weight of the connected-granule labeling feature, ω3 is the weight of RCS LBP. Besides, the calculation of the weights meets the following rules: √ ω1 = ` × √ ||Y||1 ω2 = ` × ||K||1 ω = ` × √||RCS LBP||
4.1. The Theory of Dictionary Learning There are two main parts, including dictionary learning and classification in the sparse coding classification algorithm [76, 77, 78, 79]. Applying training data directly to build a dictionary will result in high computational complexity. In order to build an effective sparse dictionary with lower computational complexity, dictionary learning (such as K-SVD) is a promising solution [80, 81, 82]. The main idea of this method is to build a sparse dictionary by minimizing the reconstruction error, e.g., iteratively minimize the energy function to build a sparse dictionary [83, 84] using the K-SVD algorithm.
Y=
i=1 j=1
T label (i, j) = 3:
T label (i, j), s ∈ {se1 , se2 , · · · , se9 } (
1 0
(2)
T (i, j) = s else
Feature extraction based on connected-granule labeling. we can get the connected-granule labeling feature K = 255 P { NL ,ML , S L , OL } in line with the theory of connectedL=0
3
1
` = ln(mean(||Y||1 , ||K||1 , ||RCS LBP||1 ))
6
Figure 6: System diagram of the proposed sparse representation
Further, in order to improve the performance and discrimination ability of dictionary, LC-KSVD is proposed by Jiang et al. [80]. In addition to using class labels of training data, LCKSVD associates label information with each dictionary item (columns of the dictionary matrix) to enforce discriminability in sparse codes during the dictionary learning process [85]. It can be expressed as the following formula: < D, A, X >= arg min ||Y − DX||2
The model (5) includes the classification error as a term in the objective function for dictionary learning. So, according to the recommendation of Song et al. [82], we use a linear predictive classifier W to make the dictionary optimal for classification, and we obtain the following solution by using the ridge regression model, W = (XX T + λI)−1 XH T
2
where X denotes the sparse coding coefficient, H = [h1 , h2 , · · · , hN ] ∈ RK×N are the class labels of input signals Y, hi = [0, 0, · · · , 1, · · · , 0, 0]T ∈ RK is a label vector corresponding to an input signal yi , where the nonzero position indicates the class of yi , and λ is a parameter used to control the relative contribution of the corresponding terms. At the testing stage, we can get the label label of the test image through the product between the sparse coding coefficient xi and linear classifier W. We regard the class of the test image as the argmax of label,
D,A,X
s.t.∀i, ||xi ||0 6 T
+α||Q − AX||2 2
(4)
where X denotes the sparse coding coefficient, D is the dictionary, α controls the relative contribution between reconstruction and label consistency regularization, ||Y − DX|| represents the reconstruction error, Q = [q1 , q2 , · · · , qN ] ∈ RK×N are the “discriminative” sparse codes of input signals Y for classification, α||Q − AX||2 2 denotes the label consistent constraints, and T is a sparsity constraint factor (each signal has fewer than T nonzero items in its decomposition). It is an iterative updating process for dictionary learning in LC-KSVD. Firstly, the atoms of the initial dictionary D utilizes training data selected randomly, and for several iterations, K-SVD has access to generating the dictionary entries within each class. Then, the corresponding label of each dictionary items and the learned item generate the dictionary D. It helps to avoid the local minima problem when parameters are learned simultaneously in LC-KSVD. Given a test image yi and D, we can solve xi , its sparse coding coefficient, through the following objective function, xi = arg min ||yi − Dxi ||2 2 xi
s.t.||xi ||0 6 T
(6)
{label} = arg max W × xi
(7)
label
4.2. Sparse Representation with Dual-Feature Many studies [86, 87] have shown that when performing object recognition, humans usually combine multiple features of an object. To improve the reliability and accuracy of the recognition, multiple features provide richer discriminant information. A large number of experiments [88, 89] have also demonstrated that the obtained class label usually is not corresponding to the minimum inter-class distance when a single feature is applied, but to the second minimum inter-class distance which often results in erroneous recognition results. Misclassification has several reasons, e.g., the classification rules
(5) 7
Algorithm 1 A Novel Fused Sparse Representation with DualFeature Input: Normalize n as the number of images, la (la1 > la2 > la3 > · · · > lan ) as the main class label, lb (lb1 > lb2 > lb3 > · · · > lbn ) as the auxiliary class label. labela = {labela1 , labela2 , · · · labelan } correspond to la , and labelb = {labelb1 , labelb2 , · · · labelbn } correspond to lb . Particularly, we define labelan = labelan+1 = labelan+2 . Output: Final label is our final result. 1: Get the coefficient matrix of the linear classifier W and the sparse coefficient x, and use the features of CS LBP and AFPLC to build two LC-KSVD models, respectively; 2: Use the sparse coefficients and coefficient matrices of each model of the linear classifier W to calculate the class label vectors la and lb ; 3: for i = 1 : n do la +la +lb 4: if lai ≥ i+1 3i+2 i then 5: Final label = labelai 6: else 7: Final label = labelbi 8: end if 9: end for 10: return Final label
vector and three corresponding possible classification results in our proposed method shown in Algorithm 1. The label of three possible classes labelk of the test image can be determined as follow: labelk = getlabel(W × xi ) (9)
themselves and the ability to distinguish features. While missing some key informative facial features that may be beneficial for classification, different feature extraction methods enhance and preserve certain information. The proposed double-fusionfeatures model is
where f means ( a decision function, la +la +lb la labelai lai > i+1 3i+2 i , i = 1, 2, · · · , n , and f( ) = lb labelbi else the fusion classification result is policy.
where getlabel(·) represents the function of predicting three class labels, and labelk (k = 1, 2, 3) is a class prediction corresponding to the class label vector which means the top three weight values. To better represent features, we use the label of the fusion features as [92, 93, 94]. Specifically, given the categories corresponding to la (la1 > la2 > la3 > · · · > lan ) derived from the sparse coefficient of the linear classifier W is labela (labela1 , labela2 , · · · labelan ), and the categories corresponding to lb (lb1 > lb2 > lb3 > · · · > lbn ) derived from the coefficient matrix of the linear classifier W is labelb (labelb1 , labelb2 , · · · labelbn ), n is the number of elements in labela and labelb . For n + 1 and n + 2, we define labelan = labelan+1 = labelan+2 [95, 96, 97, 98, 99]. We assume la plays a significant decisive role, then the recognition result is the category corresponding to labela . The final classification result comes from the following decision function: policy = f (
< D1,2 , A1,2 , X1 , X2 > D,A,X1 ,X2
+ α||Q − A (X1 , X2 )||2
(10)
5. Experimental Results
= arg min ||Y 1,2 − D1,2 (X1 , X2 )||2 2 1,2
la ) lb
2
This section will validate the performance of our few-shot face recognition algorithm by conducting experiments on the AR [100], the Extended Yale B [101], and the LFW [102] datasets. Three comparison experiments are conducted to verify the robustness and high efficiency of our proposed methods. First, we use a random occlusion method near the key points of the face to compose our training set and test set. Then, we use the real occlusion images from AR and LFW. Third, we use small samples dataset to measure the sensitivity and robustness of the algorithms. We use the logo image dataset and LFW dataset to conduct the small samples dataset. There are at most two images per person in the test set of the small samples dataset, and there is only one image per person in training test of the small samples dataset. As an example, face samples in our experiments is shown in Table 1. Besides, we designed a control experiment using the data of our third comparison experiment to verify the reasonableness of our adaptive fusion and the effectiveness of our structural element feature and connected-granule feature. All experiments are conducted using a 4-core PC with an Intel Core i7 6700HQ, 8GB RAM, and Window 10. In this paper, we experiment separately to realize the SRC algorithm [26], DALM algorithm [41], GD-HASLR algorithm [42], DSC algorithm [43], RNR algorithm [103], FR-LSTM algorithm [104],
(8)
s.t.∀i, ||x1,2 i ||0 6 T
where D1,2 = [D1 , D2 ], A1,2 = [A1 , A2 ]; X1 and X2 generated from dual-feature are the sparse coding coefficients. We use the AFPLC feature as X1 and the CSLBP feature as X2 in this paper. Under severe illumination changes and occlusion, we hope that the AFPLC feature would discriminate well on face images. Therefore, in this case, we select the AFPLC feature as the main feature to compensate for lost information, and we also select CSLBP as an auxiliary feature. We conduct the process of learning the dictionary for each feature, and the corresponding model of sparse representation is constructed grounded in the theory of dictionary learning. It is worth mentioning that compared with multi-view learning [90, 91] that depicts the relationship between the models and the data and regards single view as an individual task, while our algorithm regards few-shot occluded face recognition as a task and pays more attention to one of the two features, that is AFPLC. We can get the class label of the test image from Eq.(7) for two sparse representation models. When generating the label of the class of the test image instead of determining the result, we preserve the top three weight values of the class label 8
F-LR-IRNNLS algorithm [105], RCPR algorithm [45], KED algorithm [75], DICW algorithm [106], and Fast NMR algorithm [27] with the same data set on each experiment and compare the experimental results to test the effectiveness of the algorithms. Note that, in order to express our experiments better, the detailed data of runtime of our three comparison experiments, as well as more sample images of the training and test datasets, are presented in our Supplemental Materials.
has similar recognition accuracy with other competing methods. This means our method has the ability to use local information as other competing methods. Based on these two findings, it is obvious that our method is more robust than other methods for few-shot face recognition with different continuous occlusions. According to Figure 7(b) and Table 3, our proposed method achieves the best results among all methods on the Extended Yale B dataset expect when the degree of occlusion is 30%. As we can see, with the increase of the degrees of occlusion, the accuracies of other competing methods have a cliff-like decline, while our method can still maintain high accuracy. This results from using contextual information for the few-shot problem. Experiments validate that our method has strong robustness in few-shot face recognition under different occlusion conditions. According to Figure 7(c) and Table 4, it can get the following two points: Firstly, when the occlusion degree is over 50%, the performance gap between our method and Fast NMR in Figure 7(c) is as remarkable as that shown in Figure 7(a). When the occlusion degree is 60%, the recognition accuracy of our method is 4.048%, 8.439%, 9.133% higher than RNR, DALM, and Fast NMR, respectively. For the few-shot problem, when the degree of occlusion is large, other competing methods result in greater error than S2 P2 FR, because in the training set there is only one occluded face image without much critical local information. This means our method makes full use of contextual information and has robustness for few-shot face recognition when the degree of occlusion is large. Second, S2 P2 FR achieves the best results among all methods when the occlusion degree is not high. This means our method has the ability to use the local information as other competing methods and our method is able to achieve robust performance on a wide range of occlusion degrees and outperform state-of-the-art methods for face recognition. From the above two experiments, it is obvious that our method is more robust than other face recognition methods for few-shot face recognition under different random occlusions.
5.1. Recognition under Random Occlusions
In our first comparison experiment, we use the AR dataset, the Extended Yale B dataset, and the LFW dataset for training and testing. The face images of the three datasets are resized to 128 × 128. On the AR dataset, we choose 80 face images of the 80 individuals to construct training set (gallery data), and the remaining samples of these 80 subjects are used as test set (probe data). On the Extended Yale B dataset, we randomly choose 30 subjects for training and testing. For each subject, the “P00A + 000E + 00” image is used as the training sample, and the remaining 63 images are used for testing. On the LFW dataset, we randomly choose a subset of 100 individuals with 14 samples per person for training and the test of the model. We choose only 14 face images of 14 persons to construct a training set (gallery data), and the remaining over 2000 samples are test images (probe data). Each training or test image is corrupted by a randomly located square block of different content, from the top-logo10 dataset [107] , with diverse block sizes. [14] The occlusion degree of an image is determined by the block size. We conduct the face recognition test and display the recognition accuracies of many state-of-the-art methods, including SRC [26], DALM [41], GD-HASLR [42], DSC [43], RNR [103], FRLSTM [104], F-LR-IRNNLS [105], RCPR [45], KED [75], DICW [106], Fast NMR [27], and our S2 P2 FR under different occlusion degrees in Figure 7, Table 2, Table 3 and Table 4. According to Figure 7(a) and Table 2, we can arrive at two findings: First, when the occlusion degree is equal to or larger than 5.2. Recognition Under Real-world Occlusions 50%, our proposed method achieves better results on the AR dataset than other competing methods, including SRC, DALM, In our second comparison experiment, we select frontal face GD-HASLR, DSC, RNR, FR-LSTM, F-LR-IRNNLS, RCPR, images with real occlusion. We define two kinds of real ocKED, DICW, and Fast NMR. For the few-shot problem, when clusion sources that are “sunglasses/glasses” and “scarf/topee”. the degree of occlusion is larger, other competing methods proThe face images from the AR dataset and the LFW dataset are duce greater errors than S2 P2 FR, because in the training set resized to 128 × 128. On AR dataset, we choose 480 face imthere is only one occluded face image without much critical ages with real occlusion of 80 individuals. On the LFW dataset, local information. This means our method makes full use of we choose a subset of 100 individuals with every 14 samples contextual information and has robustness when the degree of with real-world occlusion for training and testing artificially. occlusion is large. For “sunglasses/glasses”, on AR dataset, we use 240 face imSecond, GD-HASLR, RNR, FR-LSTM, F-LR-IRNNLS, RCPR, ages with sunglasses to construct the training set and test set. DICW, and Fast NMR result in similar results with S2 P2 FR We randomly select one image per individual to construct a when the occlusion degree is no more than 20%. The recogtraining set, and the remaining images are tested set. For the nition accuracies of KED and F-LR-IRNNLS decline fast with LFW dataset, we choose 80 subjects with glasses for training the increase of occlusion degrees; thus, the level of structural and testing. We choose 14 face images per person with glasses noise is aeschynomenous to the two methods. For the few-shot of 14 persons to construct a training set (gallery data) in these problem, when the degree of occlusion is not high, our method 14 samples, and the remaining samples test set (probe data). For 9
Table 1: Exemplar face samples in our experiments
Experiment
Situation Occlusion Rate 10%
Sample images with random occlusion
20%
30%
40%
50%
60%
Occlusion in Real World Sunglasses/Glasses
Sample images with real occlusion
Scarf/Topee
Sample images with artificial occlusion
100% 100%
DALM
95%
GD−HASLR DSC
90%
RNR 85%
Fast NMR FR−LSTM
80%
F−LR−IRNNLS RCPR
75%
KED DICW
70% 65% 10%
S2P2FR 20%
30% 40% Occlusion Rate
50%
60%
SRC
95%
GD−HASLR DSC
90%
RNR 85%
Fast NMR FR−LSTM
80%
F−LR−IRNNLS RCPR
75%
KED
DALM GD−HASLR
90%
DSC RNR
85%
Fast NMR FR−LSTM
80%
F−LR−IRNNLS RCPR
75%
KED DICW
70%
DICW
70%
SRC
95%
DALM
Recognition rates (percent)
SRC Recognition accuracy (percent)
Recognition accuracy (percent)
100%
2 2
2 2
65% 10%
S P FR 20%
30% 40% Occlusion Rate
50%
65% 10%
60%
S P FR 20%
30% 40% Occlusion Rate
50%
60%
(a) The case that test images from the AR dataset (b) The case that test images from the Extended (c) The case that test images from the LFW dataset are occluded randomly Yale B dataset are occluded randomly are occluded randomly Figure 7: Recognition accuracies (percent) under different degrees of random occlusion
Table 2: Recognition accuracies (percent) of test images occluded randomly from the AR dataset
Method SRC[26] DALM[41] GD-HASLR[42] DSC[43] RNR[103] Fast NMR[27] FR-LSTM[104] F-LR-IRNNLS[105] RCPR[45] KED[75] DICW[106] S2 P2 FR
Occlusion Rate 10%
20%
30%
40%
50%
60%
99.568% 99.978% 100.000% 98.970% 100.000% 100.000% 100.000% 100.000% 99.854% 95.785% 99.956% 100.000%
93.257% 95.684% 100.000% 95.470% 100.000% 100.000% 99.578% 99.473% 98.548% 94.685% 96.485% 100.000%
90.246% 93.657% 100.000% 94.785% 96.478% 100.000% 94.786% 95.714% 94.738% 90.486% 93.746% 100.000%
86.144% 89.348% 92.301% 85.055% 90.653% 91.426% 89.409% 89.593% 92.892% 83.048% 86.340% 99.568%
79.646% 87.577% 89.108% 82.721% 87.708% 87.625% 86.370% 79.844% 83.155% 74.538% 82.590% 95.687%
76.352% 83.493% 82.817% 77.350% 87.262% 82.849% 76.438% 72.309% 81.264% 67.519% 76.944% 92.158%
10
Table 3: Recognition accuracies (percent) of test images occluded randomly from the Extended Yale B dataset
Occlusion Rate
Method SRC[26] DALM[41] GD-HASLR[42] DSC[43] RNR[103] Fast NMR[27] FR-LSTM[104] F-LR-IRNNLS[105] RCPR[45] KED[75] DICW[106] S2 P2 FR
10%
20%
30%
40%
50%
60%
99.512% 99.951% 99.963% 93.819% 99.939% 99.974% 99.978% 99.978% 99.853% 95.753% 99.895% 100.000%
93.205% 95.622% 99.966% 91.451% 99.920% 99.991% 99.537% 99.441% 98.508% 94.603% 96.404% 100.000%
90.237% 93.570% 99.990% 90.773% 96.414% 99.933% 94.719% 95.645% 94.656% 90.452% 93.674% 99.989%
86.104% 89.282% 92.219% 85.032% 90.601% 91.353% 89.311% 89.561% 92.802% 83.047% 86.336% 99.506%
79.582% 87.551% 89.090% 82.680% 87.689% 87.581% 86.342% 79.815% 83.149% 74.528% 82.516% 95.637%
76.327% 83.407% 82.775% 77.261% 87.224% 82.831% 76.426% 72.236% 81.193% 67.509% 76.928% 92.125%
Table 4: Recognition accuracies (percent) of test images occluded randomly from the LFW dataset
Occlusion Rate
Method SRC[26] DALM[41] GD-HASLR[42] DSC[43] RNR[103] Fast NMR[27] FR-LSTM[104] F-LR-IRNNLS[105] RCPR[45] KED[75] DICW[106] S2 P2 FR
10%
20%
30%
40%
50%
60%
99.408% 99.794% 99.739% 93.372% 99.209% 99.495% 99.713% 99.215% 99.242% 95.515% 99.250% 100.000%
92.736% 95.308% 99.481% 90.733% 96.517% 99.222% 99.497% 99.332% 98.237% 94.542% 95.422% 99.773%
89.673% 93.078% 99.064% 90.129% 95.932% 97.962% 93.920% 95.574% 93.770% 90.310% 93.334% 99.065%
85.415% 88.325% 91.321% 84.601% 89.665% 90.815% 88.823% 89.242% 92.445% 82.622% 85.545% 99.270%
79.191% 86.554% 88.721% 81.914% 87.084% 87.343% 85.442% 79.099% 82.155% 73.984% 82.350% 95.127%
76.026% 82.795% 81.912% 76.622% 87.186% 82.101% 76.339% 71.530% 80.744% 66.772% 76.141% 91.234%
100%
100%
Glasses Topee 95%
90%
90%
Recognition accuracy (percent)
Recognition rates (percent)
Sunglasses Scraf 95%
85%
80%
75%
70%
80%
75%
70%
65%
65%
60%
60%
55%
85%
SRC
DALM
GD−HASLR
DSC
RNR
Fast NMR
FR−LSTM
Method
F−LR−IRNNLS
RCPR
KED
DICW
55%
2 2
S P FR
SRC
DALM
GD−HASLR
DSC
RNR
Fast NMR
FR−LSTM
Method
F−LR−IRNNLS
RCPR
KED
DICW
S2P2FR
(a) The case that test images from the AR dataset are occluded realistically (b) The case that test images from the LFW dataset are occluded realistically Figure 8: Recognition accuracies (percent) under different degrees of real occlusion
11
one part is the training set, and another is the test set. There is only one image per person in the training test and at most two images per person in the test set. The training set contains 425 face image of 425 persons, and the test set contains 2642 images. From Figure 9 and Table 7, it is clear that our method once again achieves the best results among all methods. This suggests that our method is more effective than other methods under artificial occlusion for few-shot face recognition.
100%
95%
Recognition accuracy (percent)
90%
85%
80%
75%
70%
65%
60%
55%
50%
SRC
DALM
GD−HASLR
DSC
RNR
Fast NMR
FR−LSTM
Method
F−LR−IRNNLS
RCPR
KED
DICW
S2P2FR
5.4. Control Experiment Under Small Samples and Artificial Occlusions In order to verify the reasonableness of our adaptive fusion and the effectiveness of our structural element feature and connected-granule feature in small sample and artifact occlusion, we design the control experiment. We use the RCSLBP, our structural element feature, and our connected-granule feature separately to perform experiments (without feature fusion) and compared them with our proposed method. From Figure 11 and Table 8, it is clear that our method once again achieves the best results among all methods. This suggests that the design of our method is reasonable. The method of using only our structural element feature and the only connected-granule feature is more effective than some methods under artificial occlusion for few-shot face recognition. It shows our design of structural element feature and connected-granule feature are effective. Besides, the method of using only our AFPLC feature is more effective than the method of using only our structural element feature and only connected-granule feature. This means the design of our AFPLC feature is effective. Moreover, we compare the computational complexity and performance for our proposed method S2 P2 FR, and three competing dictionary learning methods: LC-KSVD [81], FDDL [108], and Nayak’s [109]. The complexity for each dictionary learning method is estimated as the (approximate) number of operations required by each method in learning the dictionary. We define c as the number of classes, k as the number of bases per class, N as the number of training patches per class, d as the number of data dimension, q as the number of iterations required for l1 -minimization in sparse coding step and L as sparsity level. From Table 10, it is clear that S2 P2 FR is the least expensive computationally. Besides, S2 P2 FR is more effective than others. From two points, it is obvious that S2 P2 FR is reasonableness and effective.
Figure 9: Recognition accuracies (percent) with artificial occlusion and under small samples
“scarf/topee”, we take a similar approach to construct a training set and test set. On the AR dataset, the classification results of SRC [26], DALM [41], GD-HASLR [42], DSC [43], RNR [103], FRLSTM [104], F-LR-IRNNLS [105], RCPR [45], KED [75], DICW [106], Fast NMR [27], and S2 P2 FR are shown in Figure 8(a). From Figure 8(a) and Table 5, we find that our method has the highest recognition accuracy in each test set. For the test images with sunglasses, the shielding level is relatively low, and the sparse hypothesis holds, so there are many methods achieving good results. There is no significant performance difference between S2 P2 FR and other competing methods. However, when the occlusion degree is increased, such as the image with a scarf, the performance advantage of our method becomes obvious. This is because our method makes full use of face and contextual information. For few-shot face recognition with real occlusion, our method has more advantages than other state-ofthe-art methods. Similarly, on the LFW dataset, the classification results of SRC [26], DALM [41], GD-HASLR [42], DSC [43], RNR [103], FR-LSTM [104], F-LR-IRNNLS [105], RCPR [45], KED [75], DICW [106], Fast NMR [27], and S2 P2 FR are shown in Figure 8(b). As can be seen, from Figure 8(b) and Table 6, the result of our method is better than others. In addition, our algorithm, like others, has a high recognition accuracy for sunglass occlusion images than for scarf occlusion images. In the actual partial occlusion experiment, it is also obvious that the occlusion degree is an important factor affecting the recognition accuracy. Based on the experimental results, we can get the conclusion that our method has the ability of using the local and contextual information for few-shot occluded face recognition.
5.5. Verification Experiment of Generalization Performance of Proposed Algorithm To verify the generalization performance of our proposed algorithm, we fix the model trained on CASIA-WebFace [110], VGG-Face [111] under random occlusions (i.e., without any extra training or fine-tuning) and test its performance on our sub-dataset mentioned in Section 5.3 of the LFW dataset. From Figure 12 and Table 9, it is clear that our method once again achieves the best results among all methods. This suggests that our method is more effective than other methods under artificial occlusion for few-shot face recognition, and our method has an amazing ability to generalize to unseen face data.
5.3. Recognition Under Small Samples and Artificial Occlusions In order to verify the effectiveness and robustness of the proposed algorithm in a small sample and artifact occlusion, we design the third comparison experiment. We use the logo image dataset and LFW dataset to generate a sub-dataset of the LFW dataset. The face images on the sub-dataset of LFW dataset are resized to 250 × 250. This sub-dataset contains two parts: 12
Table 5: Recognition accuracies (percent) of test images that have real occlusions from the AR dataset Method Occlusion Source Sunglasses Scarf
SRC[26]
DALM[41]
GD-HASLR[42]
DSC[43]
RNR[103]
Fast NMR[27]
FR-LSTM[104]
F-LR-IRNNLS[105]
RCPR[45]
KED[75]
DICW[106]
S2 P2 FR
94.457% 57.486%
93.546% 64.857%
94.758% 61.759%
95.784% 64.790%
94.785% 65.784%
96.574% 73.354%
96.578% 64.887%
95.487% 67.480%
95.846% 70.846%
94.860% 62.478%
92.578% 64.178%
98.966% 87.574%
Table 6: Recognition accuracies (percent) of test images that have real occlusions from the LFW dataset Method Occlusion Source Glasses Topee
SRC[26]
DALM[41]
GD-HASLR[42]
DSC[43]
RNR[103]
Fast NMR[27]
FR-LSTM[104]
F-LR-IRNNLS[105]
RCPR[45]
KED[75]
DICW[106]
S2 P2 FR
95.435% 58.425%
93.630% 65.683%
95.399% 62.157%
96.330% 64.950%
95.057% 65.819%
97.451% 73.738%
97.369% 65.438%
95.586% 68.151%
96.677% 71.006%
95.224% 63.226%
92.887% 65.056%
98.987% 87.960%
Table 7: Recognition accuracies (percent) of test images with artificial occlusion and under small samples Recognition accuracies
SRC[26]
DALM[41]
GD-HASLR[42]
DSC[43]
RNR[103]
Fast NMR[27]
FR-LSTM[104]
F-LR-IRNNLS[105]
RCPR[45]
KED[75]
DICW[106]
S2 P2 FR
55.261%
63.777%
65.556%
68.736%
87.509%
89.629%
75.473%
76.457%
80.507%
74.451%
74.830%
93.528%
Table 8: Recognition accuracies (percent) of test images with artificial occlusion and under small samples on the control experiment. “CSLBP” means we only use the CSLBP feature to experiment, “RCSLBP” means we only use the RCSLBP feature to experiment, “Structural Element” means we only use the structural element feature to experiment, “Connected-Granule” means we only use the connected-granule feature to experiment, and “AFPLC” means we only use the AFPLC feature to experiment. Recognition rates
CSLBP
RCSLBP
SRC[26]
DALM[41]
GD-HASLR[42]
DSC[43]
KED[75]
DICW[106]
Connected-Granule
FR-LSTM[104]
Structural Element
F-LR-IRNNLS[105]
RCPR[45]
RNR[103]
AFPLC
Fast NMR[27]
Ours
45.624%
45.624%
55.261%
63.777%
65.556%
68.736%
74.451%
74.830%
75.095%
75.473%
75.773%
76.457%
80.507%
87.509%
89.629%
89.732%
93.528%
Table 9: Recognition accuracies (percent) on the verification experiment DALM[41]
GD-HASLR[42]
DSC[43]
RNR[103]
Fast NMR[27]
FR-LSTM[104]
F-LR-IRNNLS[105]
RCPR[45]
KED[75]
DICW[106]
Ours
54.022%
63.301%
64.877%
67.506%
87.007%
88.528%
74.000%
75.501%
79.480%
72.716%
73.830%
92.926%
SRC 3
RNR Fast NMR FR−LSTM
2
F−LR−IRNNLS RCPR KED
1
10 10%
20% 30% Occlusion Rate
40%
50%
Base−10 log of seconds
Base−10 log of seconds
DSC
GD−HASLR DSC RNR Fast NMR FR−LSTM
2
10
F−LR−IRNNLS RCPR KED
DICW
DICW
S2P2FR
S2P2FR
1
10 10%
60%
3
DALM
10
GD−HASLR
10
SRC
SRC 3
DALM
10
20% 30% Occlusion Rate
40%
50%
DALM
10 Base−10 log of seconds
Recognition rates
SRC[26]
GD−HASLR DSC RNR Fast NMR FR−LSTM
2
10
F−LR−IRNNLS RCPR KED DICW S2P2FR
1
10 10%
60%
20% 30% Occlusion Rate
40%
50%
60%
(a) The runtime of the case that test images have (b) The runtime of the case that test images have (c) The runtime of the case that test images have random occlusions form the AR dataset random occlusions form the Extended Yale B random occlusions form the LFW dataset dataset 180
160
180 Sunglasses Scarf
Glasses Topee
160
160
140
140
120
120
140
100
Seconds/s
Seconds/s
Seconds/s
120
100
80
80
60
60
40
40
100
80
60
20
SRC
DALM
GD−HASLR
DSC
RNR
Fast NMR
FR−LSTM
Methods
F−LR−IRNNLS
RCPR
KED
DICW
S2P2FR
20
40
SRC
DALM
GD−HASLR
DSC
RNR
Fast NMR
FR−LSTM
Methods
F−LR−IRNNLS
RCPR
KED
DICW
S2P2FR
20
SRC
DALM
GD−HASLR
DSC
RNR
Fast NMR
FR−LSTM
Methods
F−LR−IRNNLS
RCPR
KED
DICW
S2P2FR
(d) The runtime of the case that test images have (e) The runtime of the case that test images have (f) The runtime of the case that test images have real-world occlusions from the AR dataset real-world occlusions from the LFW dataset occlusion scenarios from the LFW sub-dataset Figure 10: The runtime under different occlusion scenarios
13
5.6. Discussion with Deep Few-Shot Face Recognition In order to verify the performance of the proposed algorithm and the traditional deep few-shot face algorithm [47], we designed an experiment. We use our small samples dataset to train and test these two models. As shown in Table. 11, it is obvious that our method is more effective than deep few-shot learning for few-shot face recognition under artificial occlusion.
Table 10: Complexity analysis and performance for different dictionary learning methods
Method
Recognition rates
2 2
S P FR
Complexity 2
c kN(2d + 2ck + L2 )
93.53% 2
Based on FDDL[108]
88.74%
c kN(2d + 2qck) + c2 dk2
Based on LC-KSVD[81]
83.47%
c2 kN(2d + 2ck + L2 )
Base on Nayak’s[109]
79.97%
c2 kN(2d + 2qck) + c3 dk2
Table 11: Recognition accuracies (percent) on the discussion
Recognition rates
1
0.8
Recognition accuracy (percent)
S2 P2 FR
75.83%
93.53%
5.7. Discussion on Parts of The Face To verify the performance of the proposed algorithm using imperfect facial data, we present a comprehensive set of experiments we have conducted on face recognition using different parts of the face. To undertake this work, we have utilised face images from two popular face datasets, namely, the FEI [112] and LFW [102] dataset.
0.9
0.7
0.6
0.5
5.7.1. Experiments on parts of the face using the FEI dataset In our experiments, following DFR[113], twelve test sets were generated thereby each test corresponding to one part of the face, using the FEI dataset. The parts were eyes, nose, right cheek, mouth and the forehead. Also, faces were generated just with eyes and nose, the bottom half of the face, the top half of the face, right half and three quarters of the face as well as the full face. From Table 12, it is clear that our method once again achieves the best results among all methods. This suggests that our method is more effective than other methods using imperfect facial data. Besides, according to Table 12, it can get the following two points: Firstly, in recognition of different parts of the face, our method has a higher recognition rate than other methods, especially higher than the DFR[113] which specializes in partial face recognition. This shows that our algorithm has better robustness and effectiveness. Secondly, our method has different recognition rates for different parts of the face. For instance, the recognition rate of our method using only eyes is higher than using only noses. This shows our method be more sensitive to eyes than noses.
0.4
0.3
0.2
0.1
0
CSLBP
RCSLBP
SRC
DALM
GD−HASLR
DSC
KED
DICW
Connected−Granule FR−LSTM
Method
Structural Element F−LR−IRNNLS
RCPR
RNR
AFPLC
Fast NMR
Ours
Figure 11: Recognition accuracies (percent) of test images with artificial occlusion and under small samples on the control experiment. “CSLBP” means we only use the CSLBP feature to experiment, “RCSLBP” means we only use the RCSLBP feature to experiment, “Structural Element” means we only use the structural element feature to experiment, “Connected-Granule” means we only use the connected-granule feature to experiment, and “AFPLC” means we only use the AFPLC feature to experiment.
1
0.9
0.8
Recognition accuracy (percent)
Deep Few-shot Learning[47]
0.7
0.6
5.7.2. Experiments on parts of the face on the LFW dataset In the case of the partial face experiments, we followed the same procedures as we did for the FEI dataset to generate 12 datasets for our experiments. Similarly, from Table 13, in face recognition of different parts, our method has a higher recognition rate than other methods, especially DFR[113] which deals with part of face recognition. This shows that our algorithm has better robustness and effectiveness. Besides, the recognition rate of our method using only eyes is higher than using only noses. This shows our method be more sensitive to eyes than noses.
0.5
0.4
0.3
0.2
0.1
0
SRC
DALM
GD−HASLR
DSC
KED
DICW
FR−LSTM
Method
F−LR−IRNNLS
RCPR
RNR
Fast NMR
Ours
Figure 12: Recognition accuracies (percent) on the verification experiment
14
Table 12: Recognition rates based on parts of the face using FEI dataset SRC[26] DALM[41] GD-HASLR[42] DSC[43] RNR[103] Fast NMR[27] FR-LSTM[104] F-LR-IRNNLS[105] RCPR[45] KED[75] DICW[106] DFR[113] Ours
Right Cheek
Mouth
Forehead
Nose
Eyes
Eyes+Nose
No Eyes+No Nose
Bottom Half
Top Half
Right Half
3/4
Full
12.347% 12.400% 12.346% 12.362% 12.399% 12.414% 12.415% 12.366% 12.363% 12.369% 12.376% 12.435% 14.041%
13.468% 13.546% 13.467% 13.532% 13.509% 13.516% 13.486% 13.499% 13.562% 13.546% 13.475% 13.563% 16.478%
35.724% 35.751% 35.700% 35.755% 35.717% 35.724% 35.730% 35.705% 35.735% 35.727% 35.766% 35.786% 38.267%
13.480% 13.539% 13.542% 13.506% 13.547% 13.486% 13.514% 13.501% 13.493% 13.505% 13.553% 13.575% 16.914%
66.737% 66.661% 66.675% 66.736% 66.686% 66.697% 66.661% 66.696% 66.692% 66.688% 66.681% 66.754% 69.296%
90.089% 90.102% 90.086% 90.101% 90.116% 90.080% 90.052% 90.111% 90.101% 90.035% 90.060% 90.134% 93.098%
88.359% 88.410% 88.444% 88.445% 88.372% 88.371% 88.455% 88.381% 88.404% 88.380% 88.384% 88.457% 90.118%
79.119% 86.516% 88.660% 81.889% 87.056% 87.316% 85.411% 79.040% 82.063% 73.927% 82.287% 94.679% 94.834%
79.126% 86.480% 88.674% 81.894% 87.074% 87.270% 85.386% 79.085% 82.138% 73.968% 82.312% 99.156% 99.456%
79.102% 86.473% 88.624% 81.828% 87.068% 87.249% 85.385% 79.070% 82.151% 73.904% 82.256% 99.245% 99.768%
92.733% 95.283% 99.415% 90.635% 96.422% 99.211% 99.429% 99.315% 98.180% 94.504% 95.396% 99.367% 99.876%
99.355% 99.778% 99.722% 93.341% 99.183% 99.454% 99.685% 99.162% 99.186% 95.461% 99.157% 99.578% 99.965%
Right Cheek
Mouth
Forehead
Nose
Eyes
Eyes+Nose
No Eyes+No Nose
Bottom Half
Top Half
Right Half
3/4
Full
12.414% 12.405% 12.354% 12.406% 12.420% 12.494% 12.424% 12.397% 12.425% 12.430% 12.411% 12.534% 14.070%
13.530% 13.642% 13.559% 13.589% 13.541% 13.563% 13.490% 13.542% 13.656% 13.595% 13.499% 13.635% 16.520%
35.729% 35.830% 35.720% 35.779% 35.797% 35.746% 35.735% 35.796% 35.825% 35.756% 35.811% 35.804% 38.348%
13.555% 13.567% 13.625% 13.528% 13.646% 13.535% 13.534% 13.560% 13.499% 13.558% 13.594% 13.644% 16.941%
66.816% 66.749% 66.771% 66.831% 66.733% 66.753% 66.708% 66.715% 66.727% 66.741% 66.751% 66.799% 69.373%
90.118% 90.183% 90.124% 90.125% 90.120% 90.169% 90.053% 90.171% 90.123% 90.048% 90.121% 90.166% 93.161%
88.407% 88.507% 88.496% 88.470% 88.422% 88.390% 88.465% 88.410% 88.492% 88.467% 88.395% 88.536% 90.181%
79.191% 86.558% 88.676% 81.916% 87.066% 87.398% 85.490% 79.057% 82.151% 74.007% 82.291% 94.689% 94.925%
79.205% 86.513% 88.720% 81.966% 87.152% 87.279% 85.408% 79.137% 82.181% 74.045% 82.389% 99.202% 99.519%
79.179% 86.501% 88.648% 81.905% 87.158% 87.311% 85.469% 79.106% 82.179% 73.928% 82.298% 99.310% 99.842%
92.738% 95.295% 99.478% 90.660% 96.496% 99.251% 99.461% 99.334% 98.270% 94.591% 95.492% 99.423% 99.938%
99.453% 99.779% 99.816% 93.424% 99.249% 99.480% 99.753% 99.254% 99.208% 95.472% 99.227% 99.651% 100.000%
Table 13: Recognition rates based on parts of the face using LFW dataset SRC[26] DALM[41] GD-HASLR[42] DSC[43] RNR[103] Fast NMR[27] FR-LSTM[104] F-LR-IRNNLS[105] RCPR[45] KED[75] DICW[106] DFR[113] Ours
face under occlusion using contextual information, we present an effective structural element feature which can capture the local and contextual information for face recognition. Besides, the adaptive fusion method introduced here can also incorporate multiple features including proposed structural element feature, proposed connected-granule labeling feature, and RCSLBP to enhance the robustness for few-shot face recognition under occlusion. Last but not least, the dictionary learning in our proposed method is effective for a sparse representation of the face. Experimental results validate that the method achieves good recognition performance. It shows our method is robust and has high recognition accuracy. In future research, on the one hand, we will consider combining fast variational inference [114] with dictionary learning to get rid of the limitation of building the dictionary. On the other hand, we plan to investigate the implementation of our algorithm, such as supercomputing.
5.8. Comparison Analysis of Computation Time In this subsection, we compare the running time of our proposed method with state-of-the-art methods. Figure 10 shows the running time of each algorithm in the previous three comparison experiments. From Figure 10(a), Figure 10(b), it can be found that Fast NMR [27] is the fastest method and our method ranks 2rd on AR dataset and the Extended Yale B dataset, perhaps because both methods only involve a kernel regression problem which has a closed-form solution. Furthermore, according to other sub-figures from Figure 10, we get similar conclusion. By further investigation, building a dictionary is time-consuming in this work. We will speed up the way building dictionary in our future work. 6. Conclusion and Future Work Face recognition has many obvious advantages over other biometrics. For example, the application using face recognition of personal identification is promising, including video surveillance, access control, airport blacklist identification, humancomputer interaction, and payer authentication. However, when encountering random occlusion, variable lighting conditions, complex facial expression, lack training samples, and other unavoidable factors, the performance of recognition algorithms for face recognition applications in the real world will decline severely. In this paper, a few-shot learning approach to few-shot face Recognition with occlusion is proposed. Inspired by the humans’ optic nerves characteristics that humans recognize the
Acknowledgment This work is supported by National Natural Science Foundation of China (61806198, 61533019, U1811463). References References [1] Y. Zhang, C. Xu, S. Yu, H. Li, X. Zhang, SCLPV: Secure certificateless public verification for cloud-based cyber-physical-social systems against malicious auditors, IEEE Transactions on Computational Social Systems 2 (4) (2015) 159–170. doi:10.1109/TCSS.2016.2517205.
15
[2] S. Wang, X. Wang, P. Ye, Y. Yuan, S. Liu, F. Y. Wang, Parallel crime scene analysis based on ACP approach, IEEE Transactions on Computational Social Systems 5 (1) (2018) 244–255. doi:10.1109/TCSS. 2017.2782008. [3] J. R. Pinto, J. S. Cardoso, A. Loureno, Evolution, current challenges, and future possibilities in ECG biometrics, IEEE Access 6 (2018) 34746– 34776. doi:10.1109/ACCESS.2018.2849870. [4] C. Ding, D. Tao, Robust face recognition via multimodal deep face representation, IEEE Transactions on Multimedia 17 (11) (2015) 2049– 2058. doi:10.1109/TMM.2015.2477042. [5] S. Yang, L. Zhang, L. He, Y. Wen, Sparse low-rank component based representation for face recognition with low quality images, IEEE Transactions on Information Forensics and Security. [6] B. Chen, C. Chen, W. H. Hsu, Face recognition and retrieval using cross-age reference coding with cross-age celebrity dataset, IEEE Transactions on Multimedia 17 (6) (2015) 804–815. doi:10.1109/TMM. 2015.2420374. [7] B. Xu, Q. Liu, T. Huang, A discrete-time projection neural network for sparse signal reconstruction with application to face recognition, IEEE Transactions on Neural Networks and Learning Systemsdoi:10.1109/ TNNLS.2018.2836933. [8] H. Li, J. Sun, Z. Xu, L. Chen, Multimodal 2d+3d facial expression recognition with deep fusion convolutional neural network, IEEE Transactions on Multimedia 19 (12) (2017) 2816–2831. doi:10.1109/TMM. 2017.2713408. [9] K. Wang, C. Gou, F. Y. Wang, M 4 CD : A robust change detection method for intelligent visual surveillance, IEEE Access 6 (2018) 15505– 15520. doi:10.1109/ACCESS.2018.2812880. [10] W. Zheng, K. Wang, F. Y. Wang, Background subtraction algorithm based on Bayesian generative adversarial networks, Acta Automatica Sinica 44 (5) (2018) 878–890. [11] K. Wang, Y. Liu, C. Gou, F. Y. Wang, A multi-view learning approach to foreground detection for traffic surveillance applications, IEEE Transactions on Vehicular Technology 65 (6) (2016) 4144–4158. doi: 10.1109/TVT.2015.2509465. [12] W. Zheng, K. Wang, F. Y. Wang, A novel background subtraction algorithm based on parallel vision and Bayesian GANs, Neurocomputing. [13] J. Gaston, J. Ming, D. Crookes, Matching larger image areas for unconstrained face identification, IEEE Transactions on Cybernetics (2018) 1–12doi:10.1109/TCYB.2018.2846579. [14] R. He, X. Wu, Z. Sun, T. Tan, Wasserstein CNN: Learning invariant features for nir-vis face recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence. [15] H. Li, H. Hu, C. Yip, Age-related factor guided joint task modeling convolutional neural network for cross-age face recognition, IEEE Transactions on Information Forensics and Security 13 (9) (2018) 2383–2392. doi:10.1109/TIFS.2018.2819124. [16] M. Pietikinen, Computer vision for face-to-face human-computer interaction, in: 2014 4th International Conference on Image Processing Theory, Tools and Applications (IPTA), 2014, pp. 1–1. doi:10.1109/ IPTA.2014.7001915. [17] W. Zheng, K. Wang, F. Y. Wang, GAN-based key secret sharing scheme in blockchain, IEEE Transactions on Cybernetics. [18] O. Lederman, A. Mohan, D. Calacci, A. S. Pentland, Rhythm: A unified measurement platform for human organizations, IEEE MultiMedia 25 (1) (2018) 26–38. doi:10.1109/MMUL.2018.112135958. [19] L. Teijeiro-Mosquera, J. Biel, J. L. Alba-Castro, D. Gatica-Perez, What your face vlogs about: Expressions of emotion and big-five traits impressions in YouTube, IEEE Transactions on Affective Computing 6 (2) (2015) 193–205. doi:10.1109/TAFFC.2014.2370044. [20] C.-J. Chang, G.-J. Jong, Interactive multimedia for kanban commercial system, in: 2012 International Symposium on Information Technologies in Medicine and Education, Vol. 2, 2012, pp. 965–968. doi:10.1109/ ITiME.2012.6291463. [21] H. Li, D. Huang, J.-M. Morvan, Y. Wang, L. Chen, Towards 3d face recognition in the real: A registration-free approach using fine-grained matching of 3d keypoint descriptors, International Journal of Computer Vision 113 (2) (2015) 128–142. doi:10.1007/s11263-014-0785-6. URL https://doi.org/10.1007/s11263-014-0785-6 [22] M. Kan, S. Shan, Y. Su, X. Chen, W. Gao, Adaptive discriminant analysis for face recognition from single sample per person, in: Face and
Gesture 2011, 2011, pp. 193–199. doi:10.1109/FG.2011.5771397. [23] X. Tan, S. Chen, Z.-H. Zhou, F. Zhang, Face recognition from a single image per person: A survey, Pattern Recognition 39 (9) (2006) 1725 – 1745. doi:https://doi.org/10.1016/j.patcog.2006.03.013. URL http://www.sciencedirect.com/science/article/pii/ S0031320306001270 [24] L. Fei-Fei, R. Fergus, P. Perona, One-shot learning of object categories, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (4) (2006) 594–611. doi:10.1109/TPAMI.2006.79. [25] Z. Xiang, H. Tan, W. Ye, The excellent properties of a dense grid-based HOG feature on face recognition compared to Gabor and LBP, IEEE Access 6 (2018) 29306–29319. doi:10.1109/ACCESS.2018.2813395. [26] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2) (2009) 210–227. doi:10.1109/ TPAMI.2008.79. [27] J. Yang, L. Luo, J. Qian, Y. Tai, F. Zhang, Y. Xu, Nuclear norm based matrix regression with applications to face recognition with occlusion and illumination changes, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (1) (2017) 156–171. doi:10.1109/TPAMI. 2016.2535218. [28] S. Hong, W. Im, J. Ryu, H. S. Yang, Sspp-dan: Deep domain adaptation network for face recognition with single sample per person, in: 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 825–829. doi:10.1109/ICIP.2017.8296396. [29] N. Zeng, H. Zhang, B. Song, W. Liu, Y. Li, A. M. Dobaie, Facial expression recognition via learning deep sparse autoencoders, Neurocomputing 273 (2018) 643 – 649. doi:https://doi.org/10.1016/j. neucom.2017.08.043. URL http://www.sciencedirect.com/science/article/pii/ S0925231217314649 [30] J. Zhang, W. Yu, X. Yang, F. Deng, Few-shot learning for ear recognition, in: Proceedings of the 2019 International Conference on Image, Video and Signal Processing, IVSP 2019, ACM, New York, NY, USA, 2019, pp. 50–54. doi:10.1145/3317640.3317646. URL http://doi.acm.org/10.1145/3317640.3317646 [31] Y. Li, L. Jia, Z. Wang, Y. Qian, H. Qiao, Un-supervised and semisupervised hand segmentation in egocentric images with noisy label learning, Neurocomputing 334 (2019) 11 – 24. doi:https://doi. org/10.1016/j.neucom.2018.12.010. URL http://www.sciencedirect.com/science/article/pii/ S0925231218314644 [32] M. Heikkil¨a, M. Pietik¨ainen, C. Schmid, Description of interest regions with center-symmetric local binary patterns, in: Proceedings of the 5th Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP’06, Springer-Verlag, Berlin, Heidelberg, 2006, pp. 58–69. doi:10.1007/11949619_6. URL http://dx.doi.org/10.1007/11949619_6 [33] J. Lai, X. Jiang, Modular weighted global sparse representation for robust face recognition, IEEE Signal Processing Letters 19 (9) (2012) 571– 574. doi:10.1109/LSP.2012.2207112. [34] Z. Xu, H. Chen, S. Zhu, J. Luo, A hierarchical compositional model for face representation and sketching, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (6) (2008) 955–969. doi:10.1109/ TPAMI.2008.50. [35] C. Low, A. B. Teoh, C. Ng, Multi-fold gabor, pca, and ica filter convolution descriptor for face recognition, IEEE Transactions on Circuits and Systems for Video Technology 29 (1) (2019) 115–129. doi: 10.1109/TCSVT.2017.2761829. [36] H. Wan, H. Wang, G. Guo, X. Wei, Separability-oriented subclass discriminant analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2) (2018) 409–422. doi:10.1109/TPAMI.2017. 2672557. [37] H. Zhang, J. Zhang, P. Koniusz, Few-shot learning via saliency-guided hallucination of samples, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [38] W. Li, J. Xu, J. Huo, L. Wang, Y. Gao, J. Luo, Distribution consistency based covariance metric networks for few-shot learning, Vol. 33, 2019, pp. 8642–8649. doi:10.1609/aaai.v33i01.33018642. URL https://www.aaai.org/ojs/index.php/AAAI/article/ view/4885
16
[39] I. Naseem, R. Togneri, M. Bennamoun, Linear regression for face recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (11) (2010) 2106–2112. doi:10.1109/TPAMI.2010.128. [40] A. Wagner, J. Wright, A. Ganesh, Z. Zhou, H. Mobahi, Y. Ma, Toward a practical face recognition system: Robust alignment and illumination by sparse representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (2) (2012) 372–386. doi:10.1109/TPAMI. 2011.112. [41] Z.-Q. Zhao, Y. ming Cheung, H. Hu, X. Wu, Corrupted and occluded face recognition via cooperative sparse representation, Pattern Recognition 56 (2016) 77 – 87. doi:https://doi.org/10.1016/j. patcog.2016.02.016. URL http://www.sciencedirect.com/science/article/pii/ S003132031600087X [42] C. Y. Wu, J. J. Ding, Occluded face recognition using low-rank regression with generalized gradient direction, Pattern Recognition 80 (2018) 256 – 268. doi:https://doi.org/10.1016/j.patcog.2018.03. 016. URL http://www.sciencedirect.com/science/article/pii/ S0031320318301079 [43] Y.-F. Yu, D.-Q. Dai, C.-X. Ren, K.-K. Huang, Discriminative multi-scale sparse coding for single-sample face recognition with occlusion, Pattern Recognition 66 (2017) 302 – 312. doi:https://doi.org/10.1016/ j.patcog.2017.01.021. URL http://www.sciencedirect.com/science/article/pii/ S0031320317300225 [44] W. Liu, X. Chang, Y. Yan, Y. Yang, A. Hauptmann, Few-shot text and image classification via analogical transfer learning, ACM Transactions on Intelligent Systems and Technology 9 (6). doi:10.1145/3230709. [45] H. Yang, X. He, X. Jia, I. Patras, Robust face alignment under occlusion via regional predictive power estimation, IEEE Transactions on Image Processing 24 (8) (2015) 2393–2403. doi:10.1109/TIP.2015. 2421438. [46] X. Wang, F. Yu, R. Wang, T. Darrell, J. E. Gonzalez, Tafe-net: Taskaware feature embeddings for low shot learning, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [47] S. Qiao, C. Liu, W. Shen, A. Yuille, Few-shot image recognition by predicting parameters from activations, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7229–7238. doi:10.1109/CVPR.2018.00755. [48] B. Li, W. Xie, W. Zeng, W. Liu, Learning to update for object tracking with recurrent meta-learner, IEEE Transactions on Image Processing 28 (7) (2019) 3624–3635. doi:10.1109/TIP.2019.2900577. [49] B. Liu, X. Yu, A. Yu, P. Zhang, G. Wan, R. Wang, Deep few-shot learning for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing 57 (4) (2019) 2290–2304. doi:10.1109/ TGRS.2018.2872830. [50] J. Wu, J. Jiang, M. Qi, H. Liu, Independent metric learning with aligned multi-part features for video-based person re-identification, Multimedia Tools and Applicationsdoi:10.1007/s11042-018-7119-6. URL https://doi.org/10.1007/s11042-018-7119-6 [51] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, X. Chen, Vrstc: Occlusionfree video person re-identification, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [52] Y. Feng, Y. Yuan, X. Lu, Person reidentification via unsupervised crossview metric learning, IEEE Transactions on Cybernetics (2019) 1– 11doi:10.1109/TCYB.2019.2909480. [53] F. Pahde, M. Nabi, T. Klein, P. Jahnichen, Discriminative hallucination for multi-modal few-shot learning, in: 2018 25th IEEE International Conference on Image Processing (ICIP), 2018, pp. 156–160. doi:10.1109/ICIP.2018.8451372. [54] Y. Shi, G. LI, Q. Cao, K. Wang, L. Lin, Face hallucination by attentive sequence optimization with reinforcement learning, IEEE Transactions on Pattern Analysis and Machine Intelligence (2019) 1–1doi: 10.1109/TPAMI.2019.2915301. [55] H. Huang, R. He, Z. Sun, T. Tan, Wavelet domain generative adversarial network for multi-scale face hallucination, International Journal of Computer Vision 127 (6) (2019) 763–784. doi:10.1007/ s11263-019-01154-8. URL https://doi.org/10.1007/s11263-019-01154-8 [56] W.-Z. Shao, J.-J. Xu, L. Chen, Q. Ge, L.-Q. Wang, B.-K. Bao, H.-B. Li,
[57] [58] [59] [60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
[69]
[70]
[71]
17
On potentials of regularized wasserstein generative adversarial networks for realistic hallucination of tiny faces, Neurocomputingdoi:https: //doi.org/10.1016/j.neucom.2019.07.046. URL http://www.sciencedirect.com/science/article/pii/ S0925231219310203 F. Cen, G. Wang, Dictionary representation of deep features for occlusion-robust face recognition, IEEE Access 7 (2019) 26595–26605. doi:10.1109/ACCESS.2019.2901376. Z. Xia, X. Ma, Z. Shen, X. Sun, N. N. Xiong, B. Jeon, Secure image LBP feature extraction in cloud-based smart campus, IEEE Access 6 (2018) 30392–30401. doi:10.1109/ACCESS.2018.2845456. P. Yang, F. Zhang, G. Yang, Fusing DTCWT and LBP based features for rotation, illumination and scale invariant texture classification, IEEE Access 6 (2018) 13336–13349. doi:10.1109/ACCESS.2018.2797072. B. Li, Z. Li, S. Zhou, S. Tan, X. Zhang, New steganalytic features for spatial image steganography based on derivative filters and threshold LBP operator, IEEE Transactions on Information Forensics and Security 13 (5) (2018) 1242–1257. doi:10.1109/TIFS.2017.2780805. Y. Fang, Z. Wang, Improving LBP features for gender classification, in: 2008 International Conference on Wavelet Analysis and Pattern Recognition, Vol. 1, 2008, pp. 373–377. doi:10.1109/ICWAPR.2008. 4635807. M. Verma, B. Raman, Center symmetric local binary co-occurrence pattern for texture, face and bio-medical image retrieval, Journal of Visual Communication and Image Representation 32 (2015) 224 – 236. doi:https://doi.org/10.1016/j.jvcir.2015.08.015. URL http://www.sciencedirect.com/science/article/pii/ S1047320315001583 C. Li, S. Zhao, K. Xiao, Y. Wang, Face recognition based on enhanced cslbp, in: J. J. J. H. Park, S.-C. Chen, K.-K. Raymond Choo (Eds.), Advanced Multimedia and Ubiquitous Engineering, Springer Singapore, Singapore, 2017, pp. 539–544. C. Q. Huang, S. M. Yang, Y. Pan, H. J. Lai, Object-location-aware hashing for multi-label image retrieval via automatic mask learning, IEEE Transactions on Image Processing 27 (9) (2018) 4490–4502. doi: 10.1109/TIP.2018.2839522. X. Wang, Z. Wang, A novel method for image retrieval based on structure elements descriptor, Journal of Visual Communication and Image Representation 24 (1) (2013) 63 – 74. doi:https://doi.org/10. 1016/j.jvcir.2012.10.003. URL http://www.sciencedirect.com/science/article/pii/ S1047320312001605 X. Wang, Z. Wang, The method for image retrieval based on multifactors correlation utilizing block truncation coding, Pattern Recognition 47 (10) (2014) 3293 – 3303. doi:https://doi.org/10.1016/ j.patcog.2014.04.020. URL http://www.sciencedirect.com/science/article/pii/ S0031320314001666 C. Z. LI Zhongsheng, HUANG Tongcheng, An image granule labeling model and its implementation, Computer Engineering 41 (3) (2015) 223. doi:10.3969/j.issn.1000-3428.2015.03.042. URL http://www.ecice06.com/EN/abstract/article_26254. shtml J. M. Rossiter, The rapid elicitation of knowledge about images using fuzzy information granules, in: 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542), Vol. 2, 2004, pp. 1159– 1164 vol.2. doi:10.1109/FUZZY.2004.1375575. J. W. Peirce, Understanding mid-level representations in visual processing, Journal of Vision 15 (7) (2015) 5–5. arXiv:https: //jov.arvojournals.org/arvo/content\_public/journal/ jov/934215/i1534-7362-15-7-5.pdf, doi:10.1167/15.7.5. URL https://doi.org/10.1167/15.7.5 T. D. Oleskiw, A. Nowack, A. Pasupathy, Joint coding of shape and blur in area v4, Nature communications 9 (1) (2018) 466. doi:10.1038/ s41467-017-02438-8. URL http://europepmc.org/articles/PMC5792439 Y. Chi, M. K. Leung, ALSBIR: A local-structure-based image retrieval, Pattern Recognition 40 (1) (2007) 244 – 261. doi:https://doi.org/ 10.1016/j.patcog.2006.06.009. URL http://www.sciencedirect.com/science/article/pii/ S0031320306002822
[72] W. Zhu, F.-Y. Wang, Covering based granular computing for conflict analysis, in: Proceedings of the 4th IEEE International Conference on Intelligence and Security Informatics, ISI’06, Springer-Verlag, Berlin, Heidelberg, 2006, pp. 566–571. doi:10.1007/11760146_58. URL http://dx.doi.org/10.1007/11760146_58 [73] J. T. Yao, A. V. Vasilakos, W. Pedrycz, Granular computing: Perspectives and challenges, IEEE Transactions on Cybernetics 43 (6) (2013) 1977–1989. doi:10.1109/TSMCC.2012.2236648. [74] B. S. Manjunath, J. . Ohm, V. V. Vasudevan, A. Yamada, Color and texture descriptors, IEEE Transactions on Circuits and Systems for Video Technology 11 (6) (2001) 703–715. doi:10.1109/76.927424. [75] K. K. Huang, D. Q. Dai, C. X. Ren, Z. R. Lai, Learning kernel extended dictionary for face recognition, IEEE Transactions on Neural Networks and Learning Systems 28 (5) (2017) 1082–1094. doi:10. 1109/TNNLS.2016.2522431. [76] Z. Zhang, W. Jiang, J. Qin, L. Zhang, F. Li, M. Zhang, S. Yan, Jointly learning structured analysis discriminative dictionary and analysis multiclass classifier, IEEE Transactions on Neural Networks and Learning Systems 29 (8) (2018) 3798–3814. doi:10.1109/TNNLS.2017. 2740224. [77] R. Sarkar, S. T. Acton, SDL: Saliency-based dictionary learning framework for image similarity, IEEE Transactions on Image Processing 27 (2) (2018) 749–763. doi:10.1109/TIP.2017.2763829. [78] F. Wang, T. T. Quach, J. Wheeler, J. B. Aimone, C. D. James, Sparse coding for N-gram feature extraction and training for file fragment classification, IEEE Transactions on Information Forensics and Security 13 (10) (2018) 2553–2562. doi:10.1109/TIFS.2018.2823697. [79] Z. Zhang, F. Li, T. W. S. Chow, L. Zhang, S. Yan, Sparse codes autoextractor for classification: A joint embedding and dictionary learning framework for representation, IEEE Transactions on Signal Processing 64 (14) (2016) 3790–3805. doi:10.1109/TSP.2016.2550016. [80] Z. Jiang, Z. Lin, L. S. Davis, Label consistent K-SVD: Learning a discriminative dictionary for recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (11) (2013) 2651–2664. doi:10.1109/TPAMI.2013.88. [81] F. K. Coutts, D. Gaglione, C. Clemente, G. Li, I. K. Proudler, J. J. Soraghan, Label consistent K-SVD for sparse micro-doppler classification, in: 2015 IEEE International Conference on Digital Signal Processing (DSP), 2015, pp. 90–94. doi:10.1109/ICDSP.2015.7251836. [82] Y. Song, Y. Liu, Q. Gao, X. Gao, F. Nie, R. Cui, Euler label consistent K-SVD for image classification and action recognition, Neurocomputing 310 (2018) 277 – 286. doi:https://doi.org/10.1016/j.neucom. 2018.05.036. URL http://www.sciencedirect.com/science/article/pii/ S0925231218305885 [83] R. Ptucha, A. E. Savakis, LGE-KSVD: Robust sparse representation classification, IEEE Transactions on Image Processing 23 (4) (2014) 1737–1750. doi:10.1109/TIP.2014.2303648. [84] L. Zhou, W. Li, Y. Zhang, P. Ogunbona, D. T. Nguyen, H. Zhang, Discriminative key pose extraction using extended LC-KSVD for action recognition, in: 2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA), 2014, pp. 1–8. doi:10.1109/DICTA.2014.7008101. [85] W. Wang, L. Xu, A modified sparse representation method for facial expression recognition, Computational Intelligence and Neuroscience 2016 (2016) 1–12. doi:10.1155/2016/5687602. URL http://europepmc.org/articles/PMC4736316 [86] S. Althloothi, M. H. Mahoor, X. Zhang, R. M. Voyles, Human activity recognition using multi-features and multiple kernel learning, Pattern Recogn. 47 (5) (2014) 1800–1812. doi:10.1016/j.patcog.2013. 11.032. URL http://dx.doi.org/10.1016/j.patcog.2013.11.032 [87] N. Ikizler-Cinbis, S. Sclaroff, Object, scene and actions: Combining multiple features for human action recognition, in: Proceedings of the 11th European Conference on Computer Vision: Part I, ECCV’10, Springer-Verlag, Berlin, Heidelberg, 2010, pp. 494–507. URL http://dl.acm.org/citation.cfm?id=1886063.1886101 [88] Y. Wen, K. Zhang, Z. Li, Y. Qiao, A discriminative feature learning approach for deep face recognition, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Computer Vision – ECCV 2016, Springer International Publishing, Cham, 2016, pp. 499–515.
[89] Y. Em, F. Gag, Y. Lou, S. Wang, T. Huang, L. Duan, Incorporating intra-class variance to fine-grained visual recognition, in: 2017 IEEE International Conference on Multimedia and Expo (ICME), 2017, pp. 1452–1457. doi:10.1109/ICME.2017.8019371. [90] X. Mei, Z. Hong, D. Prokhorov, D. Tao, Robust multitask multiview tracking in videos, IEEE Transactions on Neural Networks and Learning Systems 26 (11) (2015) 2874–2890. doi:10.1109/TNNLS.2015. 2399233. [91] J. Zhao, X. Xie, X. Xu, S. Sun, Multi-view learning overview: Recent progress and new challenges, Information Fusion 38 (2017) 43 – 54. doi:https://doi.org/10.1016/j.inffus.2017.02.007. URL http://www.sciencedirect.com/science/article/pii/ S1566253516302032 [92] D. Wang, X. Wang, S. Kong, Integration of multi-feature fusion and dictionary learning for face recognition, Image and Vision Computing 31 (12) (2013) 895 – 904. doi:https://doi.org/10.1016/j. imavis.2013.10.002. URL http://www.sciencedirect.com/science/article/pii/ S0262885613001509 [93] X. Wu, Q. Li, L. Xu, K. Chen, L. Yao, Multi-feature kernel discriminant dictionary learning for face recognition, Pattern Recognition 66 (2017) 404 – 411. doi:https://doi.org/10.1016/j.patcog.2016.12. 001. URL http://www.sciencedirect.com/science/article/pii/ S0031320316303880 [94] J. Lu, G. Wang, J. Zhou, Simultaneous feature and dictionary learning for image set based face recognition, IEEE Transactions on Image Processing 26 (8) (2017) 4042–4054. doi:10.1109/TIP.2017.2713940. [95] Z. Jiang, Z. Lin, L. S. Davis, Learning a discriminative dictionary for sparse coding via label consistent k-svd, in: CVPR 2011, 2011, pp. 1697–1704. doi:10.1109/CVPR.2011.5995354. [96] W. Liu, Z. Yu, M. Yang, Lijia Lu, Yuexian Zou, Joint kernel dictionary and classifier learning for sparse coding via locality preserving k-svd, in: 2015 IEEE International Conference on Multimedia and Expo (ICME), 2015, pp. 1–6. doi:10.1109/ICME.2015.7177438. [97] B. D. Haeffele, R. Vidal, Structured low-rank matrix factorization: Global optimality, algorithms, and applications, IEEE Transactions on Pattern Analysis and Machine Intelligence (2019) 1–1doi:10.1109/ TPAMI.2019.2900306. [98] X. Zhang, Q. Liu, D. Wang, L. Zhao, N. Gu, S. Maybank, Self-taught semi-supervised dictionary learning with non-negative constraint, IEEE Transactions on Industrial Informatics (2019) 1–1doi:10.1109/TII. 2019.2926778. [99] Video smoke separation and detection via sparse representation, Neurocomputingdoi:https://doi.org/10.1016/j.neucom.2019.06. 011. URL http://www.sciencedirect.com/science/article/pii/ S0925231219308380 [100] A. Martinez, R. Benavente, The AR face database (1998). URL http://www2.ece.ohio-state.edu/~aleix/ARdatabase [101] K.-C. Lee, J. Ho, D. J. Kriegman, Acquiring linear subspaces for face recognition under variable lighting, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (5) (2005) 684–698. doi:10.1109/ TPAMI.2005.92. [102] G. B. Huang, M. Ramesh, T. Berg, E. Learned-Miller, Labeled faces in the wild: A database for studying face recognition in unconstrained environments, Tech. Rep. 07-49, University of Massachusetts, Amherst (October 2007). [103] J. Qian, L. Luo, J. Yang, F. Zhang, Z. Lin, Robust nuclear norm regularized regression for face recognition with occlusion, Pattern Recogn. 48 (10) (2015) 3145–3159. doi:10.1016/j.patcog.2015.04.017. URL http://dx.doi.org/10.1016/j.patcog.2015.04.017 [104] F. Zhao, J. Feng, J. Zhao, W. Yang, S. Yan, Robust LSTM-autoencoders for face de-occlusion in the wild, IEEE Transactions on Image Processing 27 (2) (2018) 778–790. doi:10.1109/TIP.2017.2771408. [105] M. Iliadis, H. Wang, R. Molina, A. K. Katsaggelos, Robust and lowrank representation for fast face identification with occlusions, IEEE Transactions on Image Processing 26 (5) (2017) 2203–2218. doi: 10.1109/TIP.2017.2675206. [106] X. Wei, C. T. Li, Y. Hu, Face recognition with occlusion using dynamic image-to-class warping (DICW), in: 2013 10th IEEE International Con-
18
[107] [108]
[109]
[110] [111]
[112] [113]
[114]
ference and Workshops on Automatic Face and Gesture Recognition (FG), 2013, pp. 1–6. doi:10.1109/FG.2013.6553747. H. Su, X. Zhu, S. Gong, Deep learning logo detection with data expansion by synthesising context, in: Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, IEEE, 2017, pp. 530–539. M. Yang, L. Zhang, X. Feng, D. Zhang, Fisher discrimination dictionary learning for sparse representation, in: 2011 International Conference on Computer Vision, 2011, pp. 543–550. doi:10.1109/ICCV. 2011.6126286. N. Nayak, H. Chang, A. Borowsky, P. Spellman, B. Parvin, Classification of tumor histopathology via sparse feature learning, in: 2013 IEEE 10th International Symposium on Biomedical Imaging, 2013, pp. 410– 413. doi:10.1109/ISBI.2013.6556499. D. Yi, Z. Lei, S. Liao, S. Z. Li, Learning face representation from scratch, CoRR abs/1411.7923. arXiv:1411.7923. URL http://arxiv.org/abs/1411.7923 Q. Cao, L. Shen, W. Xie, O. M. Parkhi, A. Zisserman, Vggface2: A dataset for recognising faces across pose and age, in: 2018 13th IEEE International Conference on Automatic Face Gesture Recognition (FG 2018), 2018, pp. 67–74. doi:10.1109/FG.2018.00020. C. Thomaz, G. Giraldi, FEI face database, 2010. A. Elmahmudi, H. Ugail, Deep face recognition using imperfect facial data, Future Generation Computer Systems 99 (2019) 213 – 225. doi:https://doi.org/10.1016/j.future.2019.04.025. URL http://www.sciencedirect.com/science/article/pii/ S0167739X18331133 J. G. Serra, M. Testa, R. Molina, A. K. Katsaggelos, Bayesian K-SVD using fast variational inference, IEEE Transactions on Image Processing 26 (7) (2017) 3344–3359. doi:10.1109/TIP.2017.2681436.
Fei-Yue Wang received his Ph.D. in Computer and Systems Engineering from Rensselaer Polytechnic Institute, Troy, New York in 1990. He joined the University of Arizona in 1990 and became a Professor and Director of the Robotics and Automation Lab (RAL) and Program in Advanced Research for Complex Systems (PARCS). In 1999, he founded the Intelligent Control and Systems Engineering Center at the Institute of Automation, Chinese Academy of Sciences (CAS), Beijing, China, under the support of the Outstanding Oversea Chinese Talents Program from the State Planning Council and “100 Talent Program” from CAS, and in 2002, was appointed as the Director of the Key Lab of Complex Systems and Intelligence Science, CAS. In 2011, he became the State Specially Appointed Expert and the Director of The State Key Laboratory for Management and Control of Complex Systems. Dr. Wang’s current research focuses on methods and applications for parallel systems, social computing, and knowledge automation. He was the Founding Editor-in-Chief of the International Journal of Intelligent Control and Systems (1995-2000), Founding EiC of IEEE ITS Magazine (2006-2007), EiC of IEEE Intelligent Systems (20092012), and EiC of IEEE Transactions on ITS (2009-2016). Currently he is EiC of China’s Journal of Command and Control. Since 1997, he has served as General or Program Chair of more than 20 IEEE, INFORMS, ACM, ASME conferences. He was the President of IEEE ITS Society (2005-2007), Chinese Association for Science and Technology (CAST, USA) in 2005, the American Zhu Kezhen Education Foundation (2007-2008), and the Vice President of the ACM China Council (2010-2011). Since 2008, he is the Vice President and Secretary General of Chinese Association of Automation. Dr. Wang is elected Fellow of IEEE, INCOSE, IFAC, ASME, and AAAS. In 2007, he received the 2nd Class National Prize in Natural Sciences of China and awarded the Outstanding Scientist by ACM for his work in intelligent control and social computing. He received IEEE ITS Outstanding Application and Research Awards in 2009 and 2011, and IEEE SMC Norbert Wiener Award in 2014.
Wenbo Zheng received his bachelor degree in software engineering from Wuhan University of Technology, Wuhan, China, in 2017. He is currently a Ph.D. student in the School of Software Engineering, Xi’an Jiaotong University as well as the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences. His research interests include computer vision and machine learning.
Chao Gou received the B.S. degree from the University of Electronic Science and Technology of China, Chengdu, China, in 2012 and the Ph.D. degree from the University of Chinese Academy of Sciences (UCAS), Beijing, China, in 2017. From September 2015 to January 2017, he was supported by UCAS as a joint-supervision Ph.D. student in Rensselaer Polytechnic Institute, Troy, NY, USA. He is currently an Assistant Professor with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include computer vision and machine learning. 19
Conflict of Interest and Authorship Conformation Form Please check the following as appropriate: o
All authors have participated in (a) conception and design, or analysis and interpretation of the data; (b) drafting the article or revising it critically for important intellectual content; and (c) approval of the final version.
o
This manuscript has not been submitted to, nor is under review at, another journal or other publishing venue.
o
The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript
o
The following authors have affiliations with organizations with direct or indirect financial interest in the subject matter discussed in the manuscript:
Author’s name Wenbo Zheng Chao Gou Fei-Yue Wang
Affiliation School of Software Engineering, Xi'an Jiaotong University Institute of Automation, Chinese Academy of Sciences Institute of Automation, Chinese Academy of Sciences