G Model
ARTICLE IN PRESS
IJLEO 56639 1–9
Optik xxx (2015) xxx–xxx
Contents lists available at ScienceDirect
Optik journal homepage: www.elsevier.de/ijleo
Multi-pose face ensemble classification aided by Gabor features and deep belief nets
1
2
3
Q1
Yong Chen a,∗ , Ting-ting Huang a , Huan-lin Liu b , Di Zhan a a
4
b
5
Key Laboratory of Industrial Internet of Things & Network Control, MOE, Chongqing University of Posts and Telecommunications, Chongqing 400065, China Chongqing Key Laboratory of Signal and Information Processing, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
6
a r t i c l e
7 18
i n f o
a b s t r a c t
8
Article history: Received 30 January 2015 Accepted 27 October 2015 Available online xxx
9 10 11 12 13
Keywords: Deep belief nets Classification.2D-Gabor feature 16 17 Q3 Neighborhood component analysis 14 15
A face image with pose variations may be encoded with different representations, which may severely degrade classification performance. In this paper, we consider the problem of classification in the multipose face setting using 2D-Gabor features with the deep belief nets approach. We cast the classification as one form of deep learning problem where our goal is to construct the Gabor feature maps in nonlinear characterization. By extracting the 2D-Gabor features of multi-pose faces, then combines Gabor features and the advantages of LTP features; and using of local spatial histogram to describe the face image; then taking X-means algorithm for data processing to further improve the mapping space and dimensionality reduction which can enhances the difference between the data sub-sets. In this way, to learn the complex data space will be automatically divide into multiple sample subspace. We improve the features learning with more discriminating power to benefit the classification problems. We adopt the combination neighborhood component analysis method to mapping this data in a better way. Linear variation of training samples, which can find a more favorable category of linear subspace. We formulate this problem using Gabor feature maps as the input data in deep belief nets. In addition, experimental results on ORL, Yale and LFW datasets show that our proposed algorithm can have better discriminating power and significantly enhance the classification performance, which shows that our algorithm is almost robust to the training samples both on ORL and Yale datasets. Comparisons with eight algorithms (PCA, 2DPCA, Gabor + 2DPCA, SIFT, LBP, LTP, LGBP and LGTP) show that the proposed method is better in recognizing multi-pose faces without large volumes of data. And the experimental results on image classification have verified the effectiveness of the proposed approach. © 2015 Published by Elsevier GmbH.
1. Introduction
19
Face recognition has many important applications, including surveillance, image retrieval, access control and reduplication of identity documents. Face recognition is one of the most fundamental problems in machine learning and impact in computer vision. There are still exit many challenges for face recognition in uncontrolled environments, such as partial occlusions, large pose variations, and extreme ambient illumination [1]. Considerable research attention has been directed, over the past few decades, towards developing reliable automatic face recognition systems that use two dimensional (2D) facial images. While face recognition in controlled conditions (frontal face of cooperative users and controlled indoor illumination) has already
20 21 22 23 24 25 26 27 28 29 30 31
Q2
∗ Corresponding author. Tel.: +86 13983763634. E-mail address:
[email protected] (Y. Chen).
achieved impressive performance over large-scale galleries. Typical applications of face recognition in uncontrolled environments include recognition of individuals in video surveillance frames and images captured by handheld devices. The Local binary pattern (LBP) feature has emerged as a silver lining in the field of texture classification and retrieval [30,31]. Tan et al. [36] proposed local ternary patterns (LTP) method for face recognition. And LTP has stronger capacities of discrimination, and is robustness to the noise and illumination changes in homogeneous regions than LBP. The LTP extract the information based on the distribution of edges, which are coded by using of three values (−1, 0, 1). The wavelet correlogram and Gabor wavelet correlogram were carried on by Moghaddam et al. [40]. The Gabor wavelet has been extensively used in face recognition due to the fact that its kernels are similar to the two dimensional receptive field profiles of the mammalian cortical simple cells. Gabor filters are can provide good perception of local image structures and they are robust to illumination variations [21], it
http://dx.doi.org/10.1016/j.ijleo.2015.10.179 0030-4026/© 2015 Published by Elsevier GmbH.
Please cite this article in press as: Y. Chen, et al., Multi-pose face ensemble classification aided by Gabor features and deep belief nets, Optik - Int. J. Light Electron Opt. (2015), http://dx.doi.org/10.1016/j.ijleo.2015.10.179
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
G Model IJLEO 56639 1–9
Y. Chen et al. / Optik xxx (2015) xxx–xxx
2 50 51 52
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
70 71 72 73 74
75
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
ARTICLE IN PRESS
is a popular method for feature extraction and image representation. The main contributions of this paper are as follows: • We combined the Gabor features and LTP encoding model as a ternary pattern. LTP encodes local structures from the response of Gabor filters in four different orientations. The local ternary pattern provides a discriminative encoding of the four Gabor filters. • All histograms (after LTP feature extraction) are concatenated to form a 65536-dimensional (256 × 256) feature vector. We use Xmeans for data processing to further improve the mapping space and dimensionality reduction which can enhances the difference between the data subsets, in this way, to learn the complex data space will be automatically divide into multiple sample subspace. • We make use of deep belief nets to map the data, trained combination NCA classifier on the topper layers to predict labels, for the combination NCA classifier enhanced the difference between the data subsets, and it can achieve better classification accuracy by using less number of data subsets. • We achieve better performance on ORL, Yale and LFW datasets. The rest of this paper is organized as follows. Section 2 discusses the role of relevant theories in face classification. Section 3 discusses in detail the pose variations classifications. Section 4 conducts extensive experiments to demonstrate the performance of NCA based on the DBNs, and Section 5 concludes the paper. 2. Related work Classification of multi-pose faces showed many problems compared with frontal faces or controlled faces. In order to achieve this goal, many works have been proposed to improvement such problems. Lin et al. proposed a super-resolution face recognition method to solve pose variations, when it comes to a strange face completely, this algorithm is invalid [2]. On this basis, Li et al. have explored synthesis of multi-view face images based on multivariate analysis of kernel which can classify the pose variations by similarity measure [3]. Assembling different kernels to achieve the best classification results for different characteristics, Dong-yang et al. Literature [4] adopts Li’s method to propose an improved method for multi-kernel learning which can classify eigenvectors. To improve the quality of face recognition, Wagner et al. explored affine transformation to model pose variations of faces, and estimate the optimal affine transformation parameters by using Lucas–Kanade method [6] to correct testing images’ pose variations which could increase the robustness of the multi-pose faces recognition using SRC algorithm [5]. When achieved better pose mapping results, Lazebnik et al. introduced Spatial Pyramid Machine (SPM) to improve the traditional K-means clustering method for image representation [7]. Yang et al. adopted Lazebnik’s method to improve SPM method, in which images in different dimensions are in sparse coding [8]. Zhang et al. proposed nonnegative sparse coding, and using a linear SVM for classification [9]. To improve the quality of sparse representations, Wang et al. adapted to sparse representation to align the human face and refers to matrix of low-rank decomposition [10]. In addition, to effectively solve the sparse coding of L1-norm regularized least squares problem, Lee et al. proposed a feature sign search method to reduce the no differentiable problem to an unconstrained quadratic programming (QP), which accelerates the optimization process [11]. An important class of methods that deals with data on multiple subspaces relies on the Gabor features. Shen et al. [12] discussed with the mutual information sampling method for Gabor feature selection. Transfer learning [13], which aims to transfer knowledge
between the labeled and unlabeled data sampled from different distributions, has also attracted extensive research interest. Pan et al. proposed a Transfer Component Analysis method to reduce the Maximum Mean Discrepancy (MMD) [14] between the labeled and unlabeled data, and simultaneously minimize the reconstruction error of the input data using PCA. Quanz et al. have explored sparse coding to extract features for knowledge transfer. However, their method adopts a kernel density estimation (KDE) technique to estimate the PDFs of distributions and then minimizes the Jensen–Shannon divergence between them [15,16]. Be aware that the subspace for each subject is only approximation to the true distribution of face images. In reality, due to pose variations, the actual distribution of face images could be nonlinear or multi-modal. In this paper, we aim to find a method which has a good ability to represent a nonlinear problem; be able to classify the multi-pose faces. Recently, there has been a surge of interest in deep learning. In particular, deep learning [17] has exhibited impressive results. Taigman et al. literature [18] pointed out that efficiency and effectiveness of structure of deep learning on the complex functions is very high. If training for all layers simultaneously, time complexity will be high; if training one layer each time, deviation will pass drill, and end up with owe fitting phenomenon. Therefore, Hinton et al. proposed to establish a multi-layer neural network on unsupervised data, and then adjust the weights by using wake-sleep algorithm; one major computational problem of deep learning is to improve the quality of the generated representation of the topmost while recover the underlying bottom node properly is guaranteed [19]. Local Binary Pattern (LBP) [15,30,31] has achieved remarkable results in texture analysis and face recognition applications, and there turn up with many improved methods. Hafiane et al. [32] Q4 proposed MBP (Median binary pattern) method based on literature [15,30,31], which has achieved a good discriminant performance in texture analysis. However, the median is invariant under the rotation state; the coding mode for the labels will no longer has rotation invariance yet, and there has not been a good solution in dealing with the affine invariant problem. Gou et al. [33] proposed adaptive local binary pattern (Adaptive LBP, ALBP) method to do texture classification, but the robustness is still not be tested when in matching with noise the texture images. Therefore, Guo et al. [34] proposed a completely local binary pattern (Completed LBP, CLBP), that analysis of the LBP method in the point of the local differences in symbols and the size of conversion. However, this method does not give a solution of LBP is sensitive with Gaussian noise. To solve this problem, Ahonen et al. [35] adopted soft histogram into LBP, in which uses two fuzzy memberships’ functions to instead of the original LBP threshold function, thus has a stable and continuous output value in changes of the input image. However, there are still issues in the complexity of computational and sensitive to the changes on the gray level. Tan et al. [36] proposed local ternary patterns (LTP) method for face recognition. The method introduces the interval of ± t (different values of t have different effects in face recognition). In spite of the LTP has stronger capacities of discrimination, and is robustness to the noise and illumination changes in homogeneous regions than LBP. For the values of t have greatly affect the experimental results, LTP got some limitations in handling of multi-scale image changes and partial occlusion. Zhang et al. [37] proposed local gabor binary pattern histogram sequence (LGBPHS) which is based on the literatures [30,31]. The method has connected of all the local gabor binary patterns’ histogram of each image regions, treating the sequence of histogram as the model of human face. However, the accuracy and speed have to be improved in matching with two images when pose and illumination have changes. Tan et al. [28] proposed a method that fusion of Gabor and LBP features, and then dimensionality reduction through PCA; finally integrated of all the scores. Shan et al. [12,29] proposed Fisher discriminate
Please cite this article in press as: Y. Chen, et al., Multi-pose face ensemble classification aided by Gabor features and deep belief nets, Optik - Int. J. Light Electron Opt. (2015), http://dx.doi.org/10.1016/j.ijleo.2015.10.179
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
G Model
ARTICLE IN PRESS
IJLEO 56639 1–9
Y. Chen et al. / Optik xxx (2015) xxx–xxx 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210
211
212 213
214
215 216 217 218 219 220 221
analysis (Fisher discriminate analysis, FDA) which is based on the local Gabor binary patterns (Local Gabor binary pattern, LGBP), and achieved a good results. There has been much work in dealing with large data in face verification. Increasing of the amount of data and feature dimension are resulting in the unsatisfactory of training speed, and there are less flexible when dealing with complex data distributions, such as support vector machine (SVM) [24]. In order to improve the accuracy and efficiency of the conventional classifier in learning large-scale data, we combined the data with X-means clustering for data processing to further improve the mapping space and dimensionality reduction. It can enhances the difference between the data sub-sets, and reduce the redundant data between subsets, X-means clustering was adopted to separate the original data space into multiple clusters automatically, maintaining the original data structure [38]. Moreover, based on the above analysis our work additionally proposed a Multi-pose face classification based on the 2D-Gabor features and deep belief nets (DBN) approach to construct deep learning for classifying multi-pose images accurately. By extracting 2D-Gabor features, then combines Gabor features and the advantages of LTP features; and using of local spatial histogram to describe human face; then taking X-means algorithm for data processing to further improve the mapping space and dimensionality reduction. We aim to improve the features learning with more discriminating power to benefit the classification problems. Finally, we adopt deep learning to train the combination NCA classifier. In many application domains, a large supply of unlabeled data is readily available but the amount of labeled data, which can be expensive to obtain, is very limited so nonlinear NCA [20] may suffer from over fitting. Using treated data as the input image in deep belief nets, changing linearly training samples to find a more favorable category of linear subspace, we aim to provide a large enough data set to estimate model parameters. In addition, we show that the algorithms are efficacious to multi-pose face image classification.
3. Preliminary In this section, we briefly review some theories in this paper; the main theories are as follows.
3.1. Gabor wavelet 2D-Gabor wavelet transform is localized analysis of time (space) frequency that stretching shifts operation on the signal multi-scale analysis gradually. Ultimately, we can achieve the effect of time segments of high frequency, and frequency segments of low frequency. Gabor transform is a special case of short-time Fourier transforms when the window function is taken as a Gaussian function. 2DGabor wavelet is defined as
(k, z) = 222
2 k 2
exp
2 k z2 −
× exp (ikz) − exp
223 224 225 226 227 228 229
2 2
2 − 2
(1)
where represents a constant which is related to the bandwidth of wavelet frequency; z = (x, y) is the coordinate of space position; k determines the orientation and scale of Gabor kernel. Where u and v define the orientation and scale of the Gabor kernels respectively, and the wave vector ku,v is defined as ku,v = kv eiϕu , with kv = kmax /f v (sampling scale), u = u/8 (sampling orientation), √ kmax = /2 (maximum frequency) and f = 2 [4,22].
3
Fig. 1. Local ternary patterns (LTP). (a) LBP schematic and (b) LTP encoding.
3.2. Local ternary pattern
230
Tan et al. [36] proposed local ternary patterns (LTP) method for face recognition. The method introduces the interval of ± t, for the different values of t that have different effects in face recognition. For example, when the value of the neighborhood is in the interval of the center value, the encoding is 0; when the value of the neighborhood is larger than the center value, the encoding is 1; on the contrary, the value of the neighborhood is smaller than the center value, the encoding is −1. The encoding process is as shown in Fig. 1. Fig. 1(a) represents the schematic of LBP; and Fig. 1(b) shows the encoding of LTP. Calculated as follows:
⎧ 1 u ≥ ic + t ⎪ ⎨
s (u, ic , t) = 0 u − ic < t ⎪ ⎩
(2)
231 232 233 234 235 236 237 238 239 240 241
242
−1 u ≤ ic − t
where u is the center value, ic is the pixel value of the neighborhood; t represents the threshold set artificially, and different values of the t (in this paper we set t = 5) will get different results on encoding process. As we know that LTP has stronger capacities of discrimination and is robustness to the noise and illumination changes in homogeneous regions than LBP. However, the values of t have greatly affect the experimental results; LTP gets some limitations in handling of multi-scale image changes and partial occlusion. 3.3. Gabor ternary pattern
244 245 246 247 248 249 250
251
The responses of the four filters are combined as a ternary pattern [21,22].
243
252 253
3
GTPt (x, y) =
3i [(fi (x, y) < −t) + 2 (fi (x, y) > t)]
(3)
i=0
Please cite this article in press as: Y. Chen, et al., Multi-pose face ensemble classification aided by Gabor features and deep belief nets, Optik - Int. J. Light Electron Opt. (2015), http://dx.doi.org/10.1016/j.ijleo.2015.10.179
254
G Model
ARTICLE IN PRESS
IJLEO 56639 1–9
Y. Chen et al. / Optik xxx (2015) xxx–xxx
4
260
where t is used in our experiments with value of 0.03. We call this descriptor the Gabor ternary pattern. There are total 34 = 81 different GTP patterns. Fig. 3 shows the major components of feature extraction form a GTP pattern. A histogram of GTP is calculated, and all histograms are concatenated to form a 1296-dimensional feature vector. Histogram of f (x, y) is defined as:
261
hi =
255 256 257 258 259
I f (x, y) = fc (i) ,
i = 1, 2, . . ., L
(4)
x,y 262 263 264
265
266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283
where i represents the ith pattern; t represents the gray value of the ith gradation pattern; hi represents the number of the ith pattern. Finally, we apply X-means to improve the mapping space. Fig. 2. 2D-Gabor features. (a) 2D-Gabor feature points and (b) Extraction of feature vector.
3.4. Deep belief nets Deep learning constructs machine learning model by learning a deep nonlinear network structure, with many hidden layers and numerous training data. Moreover, this complex function can be well approximated certainly. Accuracy rate of classification is improved apparently by focus on learning the nature of the data that is sated from a small number of samples. Suppose a set of n-layer systems (S 1 , S2 , . . ., Sn ), if we defines the input as I and the output as O, then it can be expressed as: I ≥ S1 ≥ S2 ≥ · · · ≥ Sn ≥ O, by adjusting the parameters of the system which can ensure its output O still equals to its input I yet. There are a range of input levels characteristics will be automatic obtained as S 1 , S2 , . . ., Sn . Deep belief nets (DBNs) [19] is a probabilistic generation model that established a joint distribution between the observed data and labels, which is compared to the traditional discrimination model of neural network. Supposing that F (xi ; W ) is a deep belief nets, and is parameterized by the parameter vector of W . Where xi represents the ith positive image; yi reflects the profile image corresponded. And all training images are represented as
284
D (F (xi ; W ) ; yi )
(5)
i 285
where
286
D=−
ˆ
pi log pi −
i
(1 − pi ) log
ˆ
1 − pi
i
287
is the distance metric, pi represents the gray pixel of the positive
288
ˆ image, and pi
289 290
represents the gray pixel of the positive image which is generated by the model. In this paper, we adopt the cross-entropy to solve the problem of the probability calculation.
N
291
D (Q, P) = −
K=1 292 293
294
295 296 297 298 299 300 301
qk log2
q k
pk
(6)
where vector distributions of Q, P are shown in N-dimensional feature space. 4. Pose variations classifications The common information should be implied when the same individual is in different poses, that means there should be a mapping between a front face and an opposite side face, and this should be a non-linear mapping yet. Therefore, deep belief nets have a good performance in feature representation; we tend to classify multi-pose images by extending NCA algorithm to extract nonlinear features.
4.1. Extracting 2D-Gabor features of pose variations
302
Once the testing images are normalized to a fixed size, we first apply the Gabor filter to each image. Gabor filter can provide good perception of local image structures and they are robust to illumination variations. By assembling different scales and orientations of J k (z), with Gabor features [22] in position of Z is formed, and the convolution of Gabor kernel is defined as followed Jk (z) = I (z) ∗ (k, z)
(7)
where Ak and ϕk represent amplitude and phase in Jk (z) respectively, we obtain the object function for Jk (z) = Ak eiϕk ,and I(z) represents the original image, (k, z) represents the Gabor features in position of z. Assuming that the size of face image is M × N, due to the relatively Gabor features dimension will reaches to 40 × M × N by using 40 filters we process, and the same to amplitude features and phase features (Ak and ϕk ). There is relatively high correlation between adjacent pixels and computational complexity. Therefore, we adopt the NCA algorithm to solve the reduction of 2D-Gabor feature coefficients’ dimensionality problem. See example of the basic flow of Gabor feature extraction, the method of feature extraction is mainly due to the following three steps. Step1: 2D-Gabor feature points and extraction of feature vectors as is shown in Fig. 2. Fig. 2(a) shows 2D-Gabor feature points, and Fig. 2(b) shows the extraction of feature vectors. ( ∈ Step2: In the experiment, ◦ we ◦ take◦ four orientations 0, 2, 4, 6 , corresponding to 0 , 45 , 90 and 135◦ ) with = 1 are used [21]. Furthermore, we only use the odd Gabor kernels which are sensitive to edges and their locations. These four Gabor filters are able to discriminate local details in the face image. Fig. 3 shows four Gabor filtered images; these four images emphasize edges in four different orientations (0◦ , 45◦ , 90◦ and 135◦ ). Step3: For each pixel (x, y) in the normalized key of point region, four Gabor filter fuse expression followed [21,22]. fi (x, y) = Ji (x, y) ∗ I (x, y)
,
i = 0, 1, 2, 3
(8)
303 304 305 306 307 308
309
310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335
336
where Ji = img 2i,0 is the ith odd Gabor kernel. Fig. 4 shows the major components of feature extraction and convolution. It can be observed that the data of processed image after LTP feature extraction is smaller than the data of the original image, and all histograms are concatenated to form a 65,536-dimensional (256 × 256) feature vector. Finally, we apply X-means to reduce the feature dimensionality and further improve the mapping space, and we extracted the histogram feature of the processed image as the input data in deep learning.
Please cite this article in press as: Y. Chen, et al., Multi-pose face ensemble classification aided by Gabor features and deep belief nets, Optik - Int. J. Light Electron Opt. (2015), http://dx.doi.org/10.1016/j.ijleo.2015.10.179
337 338 339 340 341 342 343 344 345
G Model
ARTICLE IN PRESS
IJLEO 56639 1–9
Y. Chen et al. / Optik xxx (2015) xxx–xxx
5
Fig. 5. The overlap between the subsets of data.
reduce the redundant data between subsets, in this way, to learn the complex data space will be automatically divide into multiple sample subspace. We make use of the overlapping-rate to quantify the differences between the subsets of data, such as: Q5 overlapping-rate =
Fig. 3. Gabor filtering.
346
4.2. X-means
357
We choose X-means to further improve the mapping space and reduction the dimension, for it is an adaptive clustering algorithm, we combined the data with X-means clustering for data processing to further improve the mapping space and dimensionality reduction. It can enhances the difference between the data sub-sets, and reduce the redundant data between subsets. It can be performed optimization at a certain interval of k, and finding the best value of k to reflect the potential distribution data automatically. Assume V = [v1 , v2 , . . ., vk , . . .] is the category of mapping space W, and vi is the ith cluster centers. The mapping of the multi samples is defined as:
358
ϕ (Bi ) = [s (v1 , Bi ) , s (v2 , Bi ) , . . ., s (vk , Bi )]
347 348 349 350 351 352 353 354 355 356
359
2 s (vi , Bi ) = max exp −xij − vi
(9) (10)
j
360 361 362
Multi samples will be converted into a point in feature space, which tends to change the multiple examples into a single sample. It can enhances the difference between the data sub-sets, and
subsrti ∩ subsrtj Nfea ∗ Nins
(11)
where subsrti and subsrtj are two data subsets; N fea and Nins represent the characteristics and the number of samples of each data subsets as is shown in Fig. 5. Fig. 5 demonstrates that a data set contains 10 samples and a 10dimensional feature, sampling rate and the feature selection rate are all 50%. The left shaded area and the right shaded area are the two data subsets, each contains five samples and five-dimensional features, the overlapping portions region of the two data subsets are 2 × 2, the overlap rate is 16%. 4.3. Multi-pose classification based on nonlinear NCA method
exp(−dij2 )
exp −dij2 j= / 1
, pij = 0
where dij = F (xi ; W ) − F xj ; W
p (ci = k) =
pij
365 366
367
368 369 370 371 372 373 374 375 376
378 379 380 381 382 383 384 385 386 387
(11)
388
represents the Euclidean dis-
389
tance metric, and F xj ; W is a multi-layer neural network parameterized by the weight vector W. The probability that point i belongs to class k depends on the relative proximity of all other data points that belong to class k
364
377
While facial image contains numerous non-linear factors, there is guarantee that samples of the same category can gather together by using nonlinear NCA as the feature extraction method. Inspired by recent progress in deep learning, and to solve the problems efficiently we proposed a novel algorithm which combines NCA method with DBNs. That approach achieves good performance in keeping class properties of the image. Given a set of N labeled training cases (xi , ci ), for each training image vector xi , define the probability that point i selects one of its neighbors j in the transformed feature space as pij =
363
(12)
390 391 392 393
394
j:ci =cj
The NCA object is to maximize the expected number of correctly classified points on the training data
ONCA = Fig. 4. The major components of feature extraction.
N
pij
(13)
i=1 j:ci =cj
Please cite this article in press as: Y. Chen, et al., Multi-pose face ensemble classification aided by Gabor features and deep belief nets, Optik - Int. J. Light Electron Opt. (2015), http://dx.doi.org/10.1016/j.ijleo.2015.10.179
395 396
397
G Model
ARTICLE IN PRESS
IJLEO 56639 1–9
Y. Chen et al. / Optik xxx (2015) xxx–xxx
6
Output: Mapping space R and Decision parameters w∗ , b Begin Learning face classification algorithm is combined local Gabor ternary patterns with X-means histogram data L Step 1: Divided GTP ( ϕ (Bi ) , yi i = 1, 2, . . ., L , yi ∈ −1, 1 ) into positive collection bag collection bag (By the equations and negative ∗S ∗ w , B b and overlapping-rate = + of y = sign (a ) k i k∈˝ k subsrti ∩ subsrtj /Nfea ∗ Nins ); Step 2: Calculate the value of DD(data subsets) for each positive collection bag, and then treat all the value of DD that are greater than a threshold as an example of a prototype; Step 3: Clustering example prototype by applying the X-means algorithm to form a mapping space R; and projected all packages onto mapping space R, tends to change the multiple examples into a single sample; Step 4: Mapping the data on deep learning, trained combination NCA classifier is used to predict labels; finally obtained the decision parameters of w∗ , b.
Original face image's distribution 2
1.5
1
0.5
0.8 0.6
0.6 0.2 0.4
0.4 0.2
0.8 1
1.2 1.4
(a) The data distribution of the original face image Initial projection 1
0.5
0
0
0.5
1
1.5
Final projection 1 0.5 0
0
2
4
6
8
10
12
14
16
18
20
(b) The linear mapping of the image samples Fig. 6. The data distribution of the NCA features. (a) The data distribution of the original face image and (b) The linear mapping of the image samples.
398 399
And denote dij , then the derivatives of ONCA with respect to parameter vector W for the xi training case are
400
∂ONCA ∂F (xi ; W ) ∂ONCA = ∂W ∂F (xi ; W ) ∂W
401
where
⎡
(14)
⎛
⎞⎤
∂ONCA pij ⎝dij − piz diz ⎠⎦ = −2 ⎣ ∂F (xi ; W ) 402
j:ci =cj
⎡
+ 2⎣
j:ci =cj 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419
z= / i
pji dji −
z= / i
⎛ ⎝
⎞
⎤
(15)
pzq ⎠ pzi dzi ⎦
j:cz =cq
and ∂F (xi ; W ) /∂W is computed using standard back propagation. The distributed characteristics of non-linear NCA is more concentrated than the similar samples, which makes distribution is more consistent in training data and testing data simultaneously, as is shown in Fig. 6. Where the ordinate represents time and the abscissa represents the data. Fig. 6(a) shows the data distribution of the original 3D face image randomly; Fig. 6(b) shows the data linear mapping of one group image samples according to the NCA, and the result show that the initial projection of image data is scattered, while the NCA method can mapping well on these data. Therefore, we consider NCA method to train training data and testing data. Face classification algorithm is combined local Gabor ternary patterns with X-means; the overall proposed method is outlined in Algorithm 1. Algorithm 1. Face classification algorithm is combined local Gabor ternary patterns with X-means Input: Training data L
421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437
End
438
5. Experimental results
439
To prove the validity of the scheme proposed in this paper, we compare our method with PCA, 2DPCA, Gabor + 2DPCA, SIFT [21], LBP [15], LTP [36], LGBP [12] and LGTP, respectively in the experiment on both ORL and Yale datasets. In this paper, we set experimental parameters = 2 (Yale face database) and = (ORL face database); and all the images used in the experiment are modified to size of 32 × 32.
1.5
420
5.1. The structure of DBNs The complete proposed structure of the deep belief nets is built upon Hinton’s work reported in [19] and combines NCA method, which is summarized in Fig. 7. Fig. 7 (a) shows the five-layer depth network, vi (i = 1, 2, . . .) represents the vector of significant layer; hi (i = 1, 2, . . .) represents the vector hidden layer. Then we divided five-layer depth network into a stack of restricted Boltzmann machine (RBM) which is formed by adjacent layers as shown in Fig. 7(b). Pre-training consists of learning a stack of RBM’s in which the feature activations of one RBM are treated as data by the next RBM., this step is to initialize the DBNs. Noticed that, we adapt the histogram of the gray values of the processed image (see Fig. 4) as the input data and integrate with the NCA algorithm, training the first layer’s weighting coefficient matrix W 5 of the limited RBM and category labels(2000), then fixed the W 5 , training the first layer’s hidden layer vector (h1) of the limited RBM, and treat it as second layer’s input data. After learning the first layer of features, a second layer is learned by treating the activation probabilities of the existing features. And next we training the second layer’s weighting coefficient matrix W 4 of the limited RBM and category labels (1000), calculating the implicit unit vectors and weights matrix (W i , i = 1,. . .,5) recursively, where 30 represent the probability of 30 category labels. After the initial pre-training, the parameters can be fine-tuned by performing gradient descent in the NCA objective function introduced by [23]. Sparse encoder is sited in the final layer of DBNs, and the constraints are coupled with regularity limited L1-norm in the sparse auto encoder, where the coordinate descent optimization strategy is often adopted to update each vector (sparse code matrix) individually with the other vectors fixed [14]. Finally, we use cross-entropy to fine tuning the entire network [2], which verifies that proposed method can construct better representations for classifying pose variations images accurately.
Please cite this article in press as: Y. Chen, et al., Multi-pose face ensemble classification aided by Gabor features and deep belief nets, Optik - Int. J. Light Electron Opt. (2015), http://dx.doi.org/10.1016/j.ijleo.2015.10.179
440 441 442 443 444 445 446
447
448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479
G Model
ARTICLE IN PRESS
IJLEO 56639 1–9
Y. Chen et al. / Optik xxx (2015) xxx–xxx
7
Table 1 The accuracy rate (%) of our methods as a binary classifier and other competing methods on LFW using the increasing number of source-domain datasets. Classifier
0
1
2
3
4
SVM [24] LR [25] Adaboost [26] GaussianFace-BC [27] Our method
83.21 81.14 82.91 86.25 74.67
84.32 81.92 83.62 88.24 86.67
85.06 82.65 84.80 90.01 84.00
86.43 83.84 86.30 92.22 90.67
84.75 84.75 87.21 93.73 86.67
we applied 10-fold to cross-validation. Where the sampling rate of the samples is 0.67; the sampling rate of characteristics is 0.5; the number of iterations is 100 times, and the overlap threshold is 0.3. Different classifiers on the experiment have different results. Table 1 demonstrates that the performance of our method is better than those of the first three classifiers (SVM, LR and Adaboost) [27]; our method has almost about 4% improvements when LFW datasets are used for training. Since the adaptive random subspace ensemble classifier enhanced diversity of the base components and determined the size of base classifiers automatically, so as to improve the robustness and accuracy. However, when comparison with GaussianFace-BC [27], our method is not as good as it is. There still lots of work to do with the volume of face data, and off course certainly remains some difficulties in keeping the neighborhood relations between the frontal space and the side space, and it is a meaningful research point. Different overlapping threshold value caused to the different classification results as is shown in Fig. 9. We need find an optimal overlap threshold value to balance the differences in data subsets and the number of data subsets. Fig. 9 demonstrates that the accuracy rate is the highest when the overlap threshold value equals to about 0.28, and the numbers of data subsets are 35. Since the combination NCA classifier enhanced the difference between the data subsets, reduced the redundant between data subsets, and achieved relatively high classification accuracy by less number of data subsets. 5.3. Experiments on the ORL dataset
Fig. 7. The structure of the deep belief nets. (a) Pre-training and (b) Several layers of RBM.
ORL dataset is widely adopted to evaluate computer vision and patter recognition algorithms as is shown in Fig. 10. It has a set of 40 persons with 10 poses, and consists of 400 images, which are generated by randomly sampling the face images from
482 483 484 485 486 487 488 489 490 491 492
LFW [39] dataset contains 13,233 uncontrolled face images of 5749 public figures with variety of pose, lighting, expression, race, ethnicity, age, gender, clothing, hairstyles, and other parameters. All of these images are collected from the Web, as is shown in Fig. 8. The accuracy rate is shown in Table 1. In this paper, we choose four popular representatives: SVM [24], logistic regression (LR) [25], Adaboost [26], and Gaussian Face-BC [27]. Where, 80% of the samples are treating as the training set and 20% of the samples as testing set. Two types of data are accounted for 50% on each time, to ensure the balance of data and the same distribution as far as possible on both training set and testing set. Furthermore, in order to ensure the objectivity of the experiment,
Fig. 8. LFW face dataset.
98 96
Accuracy
481
100
5.2. Experiments on the LFW dataset
94 92 90 88 0.2
0.25
494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517
518
Combination NCA classifier 480
493
0.3
Overlapping threshold value Fig. 9. different overlapping threshold value caused to the different classification results.
Fig. 10. ORL face dataset.
Please cite this article in press as: Y. Chen, et al., Multi-pose face ensemble classification aided by Gabor features and deep belief nets, Optik - Int. J. Light Electron Opt. (2015), http://dx.doi.org/10.1016/j.ijleo.2015.10.179
519 520 521 522
G Model
ARTICLE IN PRESS
IJLEO 56639 1–9
Y. Chen et al. / Optik xxx (2015) xxx–xxx
8
Table 2 The classification accuracy rate on ORL face dataset by using SIFT, LBP, LTP, LGBP, LGTP and our proposed method. Algorithms
Recognition rate
SIFT [21] LBP [15] LTP [36] LGBP [12] LGTP Our method
90.00 92.86 92.5 89.64 94.29 94.98
534
the near-frontal poses under different lighting and illumination conditions. We construct dataset by selecting 250 face images in ORL to form the training data, and other 150 images to form the test data of size 32 × 32. Due to the variations in lighting and illumination, the training and testing data can follow different distributions in the same feature space. We choose to select each three images beginning as training samples in ORL database, and treat the remaining images as testing samples. We conduct a certain amount of experiments for image classification problems to evaluate SIFT [21], LBP [15], LTP [36], LGBP [12] and LGTP algorithm to identify the performance on ORL face database. The results are shown in Table 2.
535
5.4. Experiments on the Yale dataset
523 524 525 526 527 528 529 530 531 532 533
536 537 538 539 540 541 542 543 544 545
Yale face database contains 165 images of 15 volunteers, we choose to select each first five images as training samples, and treat the remaining images as testing samples as is shown in Fig. 11. Fig. 11(a) shows the example of the ORL face dataset; Fig. 11(b) shows the example of ORL face dataset after pretreatment. Comparisons with PCA, 2DPCA, Gabor + 2DPCA, SIFT [21], LBP [15], LTP [36], LGBP [12] and LGTP show that the proposed method is better in recognizing multi-pose faces. The result of the experiment is shown in Fig. 12. Wherein, Fig. 12(a) shows recognition rates of Yale face dataset without making the pretreatment,
Fig. 11(b) represents the recognition rates of Yale face dataset after light pretreatment. We conduct a certain amount of experiments for image classification problems to evaluate eight approaches (PCA, 2DPCA, Gabor + 2DPCA, SIFT [21], LBP [15], LTP [36], LGBP [12] and LGTP) separately. From the results we observe that our proposed method achieves better classification result on the Yale images datasets when the training samples are constant in both of the two individuals face dataset which is shown in Fig. 12. Fig. 9(a) and (b) shows that the recognition rate of preprocessed face images are higher than the non-pretreated human face images. For it combines the Gabor features and advantages of LTP features and using of local spatial histogram to describe human face; then using X-means algorithm for data processing to further improve the mapping space and dimensionality reduction. Apparently, that our approach achieves much better performance than other methods, and shows that the algorithm is almost robust to the training samples. 6. Conclusions We formulated the problem of multi-pose classification as a deep learning problem using 2D-Gabor features and DBN algorithm with combination NCA classifier. The results show that the volume of data has less impact on the pose variations classification by the proposed method. The experiment results on image classification show that the average classification accuracies of our algorithm on the three datasets are better respectively when the training samples are constant in both of the two individuals face dataset. It combines Gabor features and the advantages of LTP features; and using of local spatial histogram to describe human face; then taking X-means algorithm for data processing to further improve the mapping space and dimensionality reduction. It can enhances the difference between the data sub-sets, and reduce the redundant data between subsets. Therefore, our proposed algorithm can have better discriminating power and significantly enhance the classification performance, which shows that our algorithm is almost robust to the training samples both on ORL and Yale datasets. Comparisons with eight algorithms (PCA, 2DPCA, Gabor + 2DPCA, SIFT [21], LBP [15], LTP [36], LGBP [12] and LGTP) show that the proposed method is better in recognizing multi-pose faces without large volumes of data. There are certainly some difficulties in keeping the neighborhood relations between the frontal space and the side space, and it is a further research point. Acknowledgement Authors would like to thank the Chongqing Education Committee Science of China for supporting the Foundation of program, No. KJ1400434.
Fig. 11. Yale face dataset. (a)Yale database and (b) Yale face dataset after pretreatment.
Fig. 12. The classification accuracy rate on Yale face dataset by using PCA, 2DPCA, Gabor + 2DPCA, SIFT, LBP, LTP, LGBP, LGTP and our proposed method. (a) Recognition rates of Yale face dataset without making the pretreatment and (b) Recognition rates of Yale face dataset after light pretreatment.
References [1] M. Yang, L. Zhang, J. Yang, D. Zhang, Robust sparse coding for face recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, p. 1. [2] Miaozhen Lin, Fan Xin, Low resolution face recognition with pose variations using deep belief networks, in: International Congress on Image and Signal Processing (CISP), 2011, pp. 1522–1526. [3] Y. Li, Y. Du, X. Lin, Kernel-based multifactor analysis for image synthesis and recognition, in: Proceedings of the International Conference on Computer Vision (ICCV), 2005, pp. 114–119. [4] Cheng Dong-yang, Jiang Xing-hao, Sun Tan-feng, Image classification using multiple kernel learning and sparse coding, J. Shang Hai Jiao Tong Univ. 46 (11) (2012) 1789–1793. [5] A. Wagner, J. Wright, A. Ganesh, et al., Towards a practical face recognition system: robust registration and illumination via sparse representation, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 597–604. [6] S. Baker, I. Matthews, Lucas-Kanade 20 years on: a unifying framework, Int. J. Comput. Vision 56 (3) (2004) 221–255.
Please cite this article in press as: Y. Chen, et al., Multi-pose face ensemble classification aided by Gabor features and deep belief nets, Optik - Int. J. Light Electron Opt. (2015), http://dx.doi.org/10.1016/j.ijleo.2015.10.179
546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562
563
564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586
587
588 589 590
591
592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609
G Model IJLEO 56639 1–9
ARTICLE IN PRESS Y. Chen et al. / Optik xxx (2015) xxx–xxx
[7] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in: Proc. IEEE Cpmput Soc Conf 612 Comput Vision Pattern Recognit (CVPR), 2006, pp. 2169–2178. 613 [8] J. Yang, K. Yu, Y. Gong, Linear spatial pyramid matching using sparse coding 614 for image classification, in: Proc IEEE Cpmput Soc Conf Comput Vision Pattern 615 Recognit (CVPR), 2009, pp. 1794–1801. 616 [9] C. Zhang, J. Liu, Q. Tian, Image classification by nonnegative sparse coding, low617 rank and sparse decomposition, in: Proc IEEE Cpmput Soc Conf Comput Vision 618 Pattern Recognit (CVPR), 2011, pp. 1673–1680. 619 [10] J.J. Wang, J.C. Yang, K. Yu, F.J. Lv, T. Huang, Y.H. Gong, Locality-constrained linear 620 coding for image classification, in: Proc IEEE Cpmput Soc Conf Comput Vision 621 Pattern Recognit (CVPR), 2010, p. 626. 622 [11] H. Lee, A. Battle, R. Raina, A.Y. Ng, Efficient sparse coding algorithms, in: 623 Advances in Neural Information Processing Systems (NIPS), 2006, pp. 1–6 (20). 624 [12] Linlin Shen, Bai Li, Information theory for Gabor feature selection for face recog625 Q6 nition, EURASIP J. Appl. Signal Process. (2006) 1–11. 626 [13] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22 627 (10) (2010) 1345–1350. 628 [14] A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Scholkopf, A.J. Smola, A kernel method 629 for the two-sample problem, in: Advances in Neural Information Processing 630 Systems (NIPS), 2006, pp. 2–3 (20). 631 [15] B. Quanz, J. Huan, M. Mishra, Knowledge transfer with low-quality data: a fea632 ture extraction issue, in: Proceedings of the IEEE International Conference on 633 Data Engineering (ICDE), 2011, pp. 1–6. 634 [16] B. Quanz, J. Huan, M. Mishra, Knowledge transfer with low-quality data: a 635 feature extraction issue, IEEE Trans. Knowl. Data Eng. 24 (10) (2012) 1–6. 636 [17] Adam Coates, Andrej Karpathy, Y. Andrew, Ng. emergence of object-selective 637 features in unsupervised feature learning, in: Advances in Neural Information 638 Processing Systems (NIPS), 2012, pp. 2690–2698. 639 [18] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, Wolf Lior, DeepFace: closing 640 the gap to human-level performance in face verification, in: IEEE Conference 641 on Computer Vision and Pattern Recognition (CVPR), 2014. 642 [19] Geoffrey E. Hinton, Simon Osindero, Yee-Whye Teh, A fast learning algorithm 643 for deep belief nets, Neural Comput. 18 (2006) 1527–1530. 644 [20] R. Salakhutdinov, Geoffrey E. Hinton, Learning a nonlinear embedding by 645 Q7 preserving class neighborhood structure, Int. J. Comput. Math. 84 (7) (2007) 646 1265–1276. 647 [21] Shengcai Liao, Anil K. Jain, Z. Stan Li, Partial face recognition: alignment648 free approach, IEEE Trans. Pattern Anal. Mach. Intell. 35 (5) (2013) 649 1193–1200. 650 [22] C. Liu, H. Wechsler, Gabor feature based classification using the enhanced fisher 651 linear discriminant model for face recognition, IEEE Trans. Image Process. 11 652 (4) (2002) 467–476. 653 [23] J. Goldberger, S.T. Roweis, G.E. Hinton, Ruslan Salakhutdinov, Neighborhood 654 components analysis, in: Advances in Neural Information Processing Systems 655 (NIPS), 2005, pp. 513–520. 656 [24] C.-C. Chang, C.-J. Lin, Libsvm: a library for support vector machines, ACM TIST 2 (3) (2011) 27. 610 611
9
[25] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, Liblinear: A library for large linear classification, JMLR 9 (2008) 1871–1874. [26] R. Freund, Schapire N. Abe, A short introduction to boosting, J.-Jpn. Soc. Artif. Intell. 14 (771–780) (1999) 1612. [27] Chaochao Lu, Tang Xiaoou, Surpassing human-level face verification perforQ8 mance on LFW with Gaussian, Face (2014). [28] X.Y. Tan, B. Triggs, Fusing Gabor and LBP feature sets for kernel-based face recognition, in: Proceedings of the 2007 Conference on Analysis and Modeling of Faces and Gestures (Springer), 2007, pp. 235–249. [29] S.G. Shan, W.C. Zhang, Y. Su, X.L. Chen, W. Gao, Ensemble of piecewise FDA based on spatial histograms of local (Gabor) binary patterns for face recognition, in: Proceedings of the 18th International Conference on Pattern Recognition (IEEE), 2006, pp. 606–609. [30] T. Ojala, M. Pietikainen, T. Maenpae, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, IEEE Trans. Pattern Anal. Mach. Intell. 24 (7) (2002) 971–987. [31] T. Ojala, K. Valkealahti, E. Oja, M. Pietikainen, Texture discrimination with multidimensional distributions of signed gray-level differences, Pattern Recognit. 34 (3) (2001) 727–739. [32] A. Timo, H. Abdenour, P. Matti, Face recognition with local binary patterns, in: Proceedings of the European Conference on Computer Vision, 2004, pp. 469–481. [33] Z.H. Guo, L. Zhang, D. Zhang, S. Zhang, Rotation invariant texture classification using adaptive LBP with directional statistical features, in: Proceedings of the 17th IEEE International Conference on Image Processing, 2010, pp. 285–288. [34] Z.H. Guo, L. Zhang, D. Zhang, A completed modeling of local binary pattern operator for texture classification, IEEE Trans. Image Process. 19 (6) (2010) 1657–1660. [35] T. Ahonen, M. Pietikainen, Soft histograms for local binary patterns, in: Proceedings of the 2007 Finnish Signal Processing Symposium (FINSIG), 2007, pp. 1–4. [36] X.Y. Tan, B. Triggs, Enhanced local texture feature sets for face recognition under difficult lighting conditions, in: Proceedings of the Third International Conference on Analysis and Modeling of Faces and Gestures (Springer), 2007, pp. 168–182. [37] W.C. Zhang, S.G. Shan, W. Gao, X.L. Chen, H.M. Zhang, Local Gabor binary pattern histogram sequence (LGBPHS): a novel non-statistical model for face representation and recognition, in: Proceedings of the 10th International Conference on Computer Vision (IEEE), 2005, pp. 786–791. [38] Cao Peng, Li Bo, Zhao Dazhe, Adaptive random subspace ensemble classification aided by X-means clustering, J. Comput. Appl. 33 (2) (2013) 550–553. [39] G.B. Huang, M. Ramesh, T. Berg, E. Learned-Miller, Labeled faces in the wild: a database for studying face recognition in unconstrained environments, in: Tech. Rep. 07-49, University of Massachusetts, Amherst, MA, October 2007, http://vis-www.cs.umass.edu/lfw/ . [40] H.A. Moghaddam, T.T. Khajoie, A.H. Rouhi, M.S. Tarzjan, Wavelet correlogram: a new approach for image indexing and retrieval, J. Pattern Recogn. 38 (2005) 2506–2518.
Please cite this article in press as: Y. Chen, et al., Multi-pose face ensemble classification aided by Gabor features and deep belief nets, Optik - Int. J. Light Electron Opt. (2015), http://dx.doi.org/10.1016/j.ijleo.2015.10.179
657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704