Author’s Accepted Manuscript Surveillance Video Face Recognition with Single Sample per Person Based on 3D Modeling and Blurring Xiao Hu, Shaohu Peng, Li Wang, Zhao Yang, Zhaowen Li www.elsevier.com/locate/neucom
PII: DOI: Reference:
S0925-2312(17)30001-2 http://dx.doi.org/10.1016/j.neucom.2016.12.059 NEUCOM17890
To appear in: Neurocomputing Received date: 1 September 2015 Revised date: 26 August 2016 Accepted date: 26 December 2016 Cite this article as: Xiao Hu, Shaohu Peng, Li Wang, Zhao Yang and Zhaowen Li, Surveillance Video Face Recognition with Single Sample per Person Based on 3D Modeling and Blurring, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.12.059 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Surveillance Video Face Recognition with Single Sample per Person Based on 3D Modeling and Blurring Xiao Hu, Shaohu Peng, Li Wang, Zhao Yang, Zhaowen Li School of Mechanical and Electrical Engineering, Guangzhou University, Guangzhou 510006, China.
Abstract Video surveillance has attracted more and more interests in the last decade, video-based Face Recognition (FR) therefore became an important task. However, the surveillance videos include many vague non-frontal faces especially the view of faces looking down and up. As a result, most FR algorithms would perform worse when they were applied in surveillance videos. On the other hand, it was common at video monitoring field that only Single training Sample Per Person (SSPP) is available from their identification card. In order to effectively improve FR for both the SSPP problem and the low-quality problem, this paper proposed an approach to synthesis face images-based on 3D face modeling and blurring. In the proposed algorithm, firstly a 2D frontal face with high-resolution was used to build a 3D face model, then several virtual faces with different poses were synthesized from the 3D model, and finally some degraded face images were constructed from the original and the virtual faces through blurring process. At last multiple face images could be chosen from frontal, virtual and degraded faces to build a training set. Both SCface and LFW databases were employed to evaluate the proposed algorithm by using PCA, FLDA, scale invariant feature transform, compressive sensing and deep learning. The results on both datasets showed that the performance of these methods could be improved when virtual faces were generated to train the classifiers. Furthermore, in SCface database the average recognition rates increased up to 10%, 16.62%, 13.03%, 19.44% and 23.28% respectively for the above-mentioned methods when virtual view and blurred faces were taken to train their classifiers. Experimental results indicated that the proposed method for generating more train samples was effective and could be considered to be applied in intelligent video monitoring system. Keywords: single training sample per person; video surveillance; scale invariant feature transform; compressive sensing; deep learning.
1.
Introduction As increasing social security demand in recent years [1][2], more and more video monitoring
systems have been deployed in many places, such as public squares, college campuses, official buildings, airports, railway stations, hotels and so on [1-5]. To effectively and timely utilize these video resources, many works have been focused on useful information (e.g. face, gait and appearance) from video sequences for various applications such as video behavior analysis, human tracking, abnormal detection and so on. Among these information, being a low intrusive biometric feature [5], face was one of the most important features contribute to identification. Therefore, it was usually detected from video sequences [6] for recognition [7][8] and tracking [9]. In the last several decades, much progress has been made for face recognition and the recognition rate is nearly 100% for static face images under certain conditions [5-10]. These techniques were certainly applied by intelligent monitoring systems [10][11]. Intelligent monitoring systems could not only automatically analyze video information to detect, recognize and track the person, but also send key information to data processing centre such as police station before some events might happen. In this way, it saved efforts of monitoring officers from simple watching the monotonous video. Being an important application of computer vision, biometric recognition, machine learning communities, more and more researchers have focused on video face detection and recognition [1-13]. Unfortunately, it was an extremely challenging task to detect and recognize a face from a surveillance image due to the real unconstrained monitoring environments [12-16], where many low-quality face images with different poses were captured under different illuminations and from different cameras fixed at different distances, even including the face variations, for example many different expressions and disguises. Fig.1(a) presented a real video monitoring scenario, where one was
approaching to one monitoring camera within the view field of the monitoring camera. Fig.1 In Fig. 1, L is defined as the distance between the person and the monitoring camera and H represented to the height of the position which the monitoring camera was installed at. After one person entered the view range of one monitoring camera he or she might be at any positions within the camera view field. Obviously many faces with different poses existed in video sequence. It could be estimated that most were profile views (e.g. single side face, looking-down face and part face), and few were full and frontal views. In order to effectively monitor wider field and more people, monitoring cameras were usually installed at a high location, for example at 2~3m height, higher than normal person (average person’s height is below 1.9m). On the other hand, people generally did not directly face to the camera when he/she was walking within the view field. As a result, major face images captured by surveillance cameras were looking-down faces more than other views of faces. Furthermore, looking-down angle would increase as person was walking to the camera. The first three face images from left to right in Fig.1 (b) were respectively captured at 4.2m, 2.6m and 1meter by video surveillance cameras which are at 2.25m height. The face image on the most right in Fig.1 (b) was a high resolution frontal face captured by one high-resolution camera and with normal illumination and neutral expression. Obviously, compared with the exact frontal face, all of video monitoring faces were looking-down. The closer to camera was, the bigger the looking-down was. Besides, it was showed from Fig.1 that the spatial resolution of monitoring video faces was changed due to different distance between person and camera. Obviously, the further the distance was, the lower the spatial resolution was. On the other hand, the quality of monitoring face became much poorer as the person walked further away from the camera. Based on the above analysis, it was still important to improve the
performance of monitoring face recognition algorithms when the algorithms have to deal with faces including both many looking-down faces and low-quality faces. It was beyond question that video face recognition was still one of the primary sophisticated topics. Even so, as two of the most prominent subspace approach in feature extraction process for face recognition, both Principal Component Analysis (PCA) [18][19] and Fisher’s Linear Discriminant Analysis (FLDA)[20][21] have always been employed in video face recognition field. Face recognition based on PCA was a technique based global face features. In this algorithm, one Eigenface space could be constructed from the training face set by using singular value decomposition technique even though only one train face was collected from each person. However, PCA did not take within-class information into consideration no matter what there were one or more training samples per person. Later FLDA were proposed as another well-known method for dimensionality reduction and classification that projected high-dimensional data onto a low-dimensional space with maximum discrimination. The optimal projection was obtained by minimizing the within-class distance and maximizing the between-class distance simultaneously. The dimension of feature vector derived by FLDA was significantly lower than that of feature vector derived using PCA, and less than the number of face classes. FLDA explicitly attempted to model the difference between the classes and PCA did not take into account any difference between classes. Therefore, FLDA could potentially have better performance than PCA. However, FLDA algorithm needs at least two training faces per class [21]. Besides, one of the key components for appearance-based methods was in their learning mechanisms, whose performance was heavily affected by the number of training samples per person. Besides, Scale Invariant Feature Transform (SIFT) algorithm was another available algorithm for applications such as matching various view of an object or scene (for binocular vision) and identifying
objects [22-26]. The SIFT features were based on some important key points not being dependent on scale and variations in view point. Based on SIFT, D. M. Massimiliano and I. Francesco [25] firstly established correspondences between feature points extracted from face images and then took the number of correct correspondences to determine the likelihood of the similarity between the images. L. Wu et.al [26] proposed a complete pose binary SIFT to address arbitrary pose variation. In this method, five face images with poses of frontal view per subject were selected as training set, and then the binary descriptors of these images were pooled together as the SIFT features of the subject. The SIFT algorithm was often segmented into multiple separate processes. However, several face images per subject need to be obtained in order to gain a high performance for face recognition. Unfortunately, Single Sample Per Person (SSPP) [27] was another challenge in intelligent monitoring system due to the difficulty of collecting several face images for per person and the limitation of the storage space in the real-world face recognition application systems. For example, in some specific scenarios (e.g. law enforcement, driver license, passport and identification card) only one image per person could be acquired for the training of face recognition systems. The SSPP problem made face recognition become more difficult because little information might be extracted from the gallery set with single sample per person to predict the variations in the query faces [28]. Since the intra-class variations could not be well estimated from single training sample, the traditional discriminative subspace learning based face recognition methods could fail to work well. In this paper, a novel algorithm was proposed to deal with both the SSPP problem and the low-quality testing set including many looking-down and vague face. 2.
Literature Review For the low-quality testing set, there were two common methods: the degraded method [14] and
Super-Resolution (SR) [16]. Generally, the degraded method was applied when the training set was consisted of high-quality faces while the testing set was of low-quality of faces. During the degraded method, the training face with high resolution faces were down sampled and then blurred by convoluted with Gaussian blurred function to form some virtual degraded face image for matching testing face in low-quality space. On the contrary, SR reconstructed a high resolution face from its low resolution faces. SR method therefore became an important research in face recognition. However, one of its limitations was that a series of low-quality faces or video sequence had to be captured to reconstruct high resolution face image. To deal with SSPP, many algorithms produced virtual training samples from one real sample for each person [27-49]. Virtual samples aimed to estimate the intra-class face variations by synthesizing extra samples for each subject in order to gain intra-class scatter for every class in FLDA. Furthermore, more training samples could make some algorithms such as CS [41] and deep neural networks (DNNs) [44] more effective. There were generally two classes of approaches: image geometric transforms [27][29] and image processing[33-36]. 2.1 Image geometric transforms Image geometric transform approaches transformed the positions of pixels from one training image to other virtual images through geometric operations such as scale, rotation, down-sampling, without significantly changing the intensity of pixels of the training image. To construct a training matrix from single sample per person, bilateral projection algorithm [27] was applied to translate one 2D face image into two 1D vectors respectively by firstly row - then column and firstly column- then row. X. Hu et.al [18] rotated face image to form more training faces for every person. In the proposed method, one face image was rotated for θ degrees in the same plane using bilinear interpolation to gain several face
images with different directions. H. Yin et.al [21] proposed sampled FLDA in which each face image in the training set was partitioned into several sub-image by sampling interval in height and width respectively. Y. Wang et.al [29] had divided transversely one training sample into several parts and then created some virtual training samples before a sparse representation generated from original sample and virtual samples. In above-mentioned algorithms, obviously Rotated Face Method (RFM) [18] could construct some virtual samples to match multi-poses faces in testing set. 2.2 Image processing Unlike image geometric transforms, image processing method would gain virtual training sample by filtering, decomposing, linear combining, intensity transformations or 3D model, the intensity of pixels of virtual face images would therefore be changed. As an extension of the standard Eigenface technique, (PC)2A firstly combined the original face image with its first-order projected image and later extended to a method named E(PC)2A [19]. In both algorithms, projection map was firstly produced from vertical integral projection and horizontal integral projection, and then some virtual samples were formed from one original face and its projection map. In recent years X. Chen et.al [30] utilized multi-directional orthogonal gradient phase faces to handle illumination invariant single sample face recognition. In the proposed approach, an illumination insensitive orthogonal gradient phase face was obtained by using two vertical directional gradient values of the original image. In order to constitute a variational feature representation from single sample for every person, R. X. Ding et.al [31] employed a linear regression model to fit the variational information of a non-ideal probe sample with respect to an ideal gallery sample. To a certain extent the methods could deal with SSPP problem, however they did not aim at disposing mutli-poses faces. Their performance would therefore become worse when they confronted low-quality of surveillance video faces.
Fortunately, 3D face reconstruction [33] had become an effective tool to deal with SSPP in recent years. In analysis-by-synthesis framework introduced by D. Jiang et.al [34], a personalized 3D face model was firstly constructed from a single frontal 2D face image with neutral expression and normal illumination and then realistic virtual faces with different pose, illumination an express were synthesized based on the personalized 3D face. X. Chen et.al [30] took two vertical directional gradient values of the original image to obtain an illumination insensitive orthogonal gradient phase face as virtual training sample. F. Abdolali [35] synthesized 12 different poses of right and left profile images from one frontal image. Besides, C. Hua et.al [36] took firstly lower-upper decomposition algorithm to decompose single sample into two basis images set, and then two approximation images were reconstructed from the two basis image sets, at last the new training set was consisted of the single sample and its two approximation images for each person. One of 3D face reconstruction's functions was to create different views of faces through its 3D face shape. Hence, 3D modeling method was an alternate approach for dealing with looking-down faces. 2.3 Pattern Recognition Although the above mentioned methods to a certain extent were able to handle the SSPP problem and the low-quality testing set, it was also important how to explore pattern recognition algorithm. The nearest neighbor classifier based on Euclidean distance, Support Vector Machine (SVM) and Artificial Neural Network (ANNs) were commonly used in face recognition classifier. In recent years, some new theories were proposed to build an effective classifier for face recognition [37-40]. B. Wang et.al [38] proposed an adaptive linear regression classifier, by which a probe image was represented as a linear combination of the single class-specific gallery and the intra-personal variations of expression, illumination and disguise. Besides, CS [41-43] and Deep learning [44-50] have become two powerful
means among pattern recognition. Compressive Sensing Since D. L. Donoho [51] proposed compressive sensing (CS) in 2006, more and more CS-based algorithms were developed for image analysis [52]. Because one of CS's features was that sparse signals and images could be reconstructed from what was previously believed to be incomplete information, CS has been developed to for robust of face recognition under face variations [52-56]. In Sparse Representation based Classification (SRC) proposed by J. Wright et.al [41], sparse representation matrix was consisted of face images in training set as the following steps: Supposed di (i=1, ..., K) was the ith column vector of the ith high-resolution face, K denoted number of classes. According to the CS theory, a face dictionary Dface can be constructed as
Dface d1, d2 ,..., d K
di R N K
(1)
For dealing with face variation, most CS-based face recognition research focused on how to design a generic dictionary [52-59]. Taking advantage of Shearlet network, M. A. Borgi et. al. [53] firstly extracted facial features to generate a dictionary of features associated with a range of locations, scales and orientation and then a l0 regularization routine was applied to handle to task of recognition. S. Liao et.al [56] took multi-keypoint descriptors to develop an alignment-free face representation method for solving partial faces frequently captured in unconstrained scenarios. In this work, a dictionary was composed of multi-keypoint descriptors constructed from holistic or partial faces. M. Yang et. al [57] proposed a sparse variation dictionary learning method by jointly learning a projection to connect the generic training set with the gallery set. The learnt sparse variation dictionary could better handle the variations in face images. To effectively and efficiently represent the query face image set by using the gallery face image sets, P. Zhu et.al [58] proposed a novel image set based collaborative representation
by modeling the query set as a convex or regularized hull. Because sparse signals and images could be reconstructed from incomplete information, in order to reduce the effects of SSPP, P. Zhu et.al [43] constructed a local gallery dictionary by extracting the neighboring patches from the gallery dataset and an intra-class variation dictionary by using an external generic dataset for predicting the possible facial variations. Besides, with the help of virtual faces CS has been developed to deal with SSPP problem and low-quality testing set. Some virtual faces vi (i=1, ..., K, here supposed only one virtual face each person) from the high-resolution faces, the virtual face dictionary Dvir can be constructed as
Dv i r v1, v2 ,..., vK
vi R N K
(2)
If illumination compensation Dill was taken into consideration, the whole dictionary D could be integrated into
D D face , Dvir , Dill
(3)
Then, to calculate the residual error according to Equ(4) 2
m i n{ yi - DJ x i } s . t . x i 0 T0 xi
2
(4)
Finally, the test face yi was decided as the xi with minimum residual error. Deep Learning With the emergence of deep learning technique, deep neural networks (DNNs) introduced their great successes into image classification, speech recognition, and face recognition in recent years [44-50]. As a typical building block in DNNs, Denoising Auto-Encoder (DAE) [45] could overcome some limitations such as overfitting of Auto-encoders [46] by introducing stochastic noises to training samples. Stacked denoising auto-encoder (SDAE) allowed DNNs to learn more complex mapping from input to hidden representations [47]. Y. Kang et.al [48] used SDAE to convert non-frontal face images
into frontal face images. Inspired by SDAE’s capability to learn patterns from noisy data, Y. Zhang et.al [49] took iterative SDAE to investigate how to recognize faces with partial occlusions. Compared with the state-of-the-art approaches (e.g. sparse representation), ISDAE achieves competitive results under serious occlusion conditions. Motivated by the success of DAE based DNNs and driven by the SSPP face recognition, S. Gao et.al [50] proposed a supervised auto-encoder to build the DNNs. They first treated the faces with all type of variants as images contaminated by noises. With a supervised auto-encoder, the face without the variant could be recovered, meanwhile, such a supervised auto-encoder also extracted robust features for SSPP, and the features corresponding to the same person should have the same or similar features, and such features should be stable to the intra-class variances which commonly exist in the SSPP scenario. 3.
Algorithm to create the virtual training faces
Although there were many algorithms how to create virtual training faces, 3D modeling might be the most feasible and effective methods when virtual training faces with different poses including looking-down should be created from single training frontal face. Fig. 2 showed the flowchart on how to create more virtual training samples. Firstly, a 3D face model was created from a 2D face by 3D face modeling technology. When the corresponding 3D face shape is finished, the 3D face shape is rotated to some views to form some virtual face poses, such as looking-down pose. Finally, from the rotated 3D face shapes, their corresponding virtual 2D faces are segmented as candidate training faces. These segmented virtual faces would be considered as training set for PCA and LDA or dictionary for CS. After different poses face models are formed, their corresponding 2D virtual faces were segmented. Fig.2
3.1 3D face modeling Here the 3D face modeling technology was used to transform a known 2D high-resolution frontal face image into 3D face shape [59-62]. Space mapping relationship In order to create a 3D face shape, three coordinate systems have to be taken into consideration for forming a mapping relationship between the frontal face and 3D face shape. They are respectively the camera coordinates (x, y, z), the image/retinal coordinates (u, v) and the world coordinates(X, Y, Z). Their three axes are parallel correspondingly among the three coordinate systems and z-axis is identical to Z-axis. In camera coordinates the optical center (C) is regarded as coordinate origin while the world coordinates the origin is represented as O (for face, the nose tip could be as origin). Their space mapping relationship is showed in Fig.3. Fig.3 In Fig.3, C represents the optical center, R is pointed to the retinal plane, and f is the focus distance. Supposed A is a real scene point with world coordinates (XA, YA, ZA), a with coordinates (ua, va) is an image point which A is mapped into the retinal plane. Obviously once the space mapping relationship between the retinal plane and world coordinates are established, the 3D face shape can be modeled from its 2D face. According to the similar triangles’ theory and the parallel correspondingly among the three coordinate systems, the coordinate translation formula between A (XA, YA, ZA) and a (ua, va) which could be derived as
h X A f ua h YA va f h Z d a A f
(5)
here, ZA is the Z-axes of A in world coordinate system and described the depth of the facial features of a real face. Like ua→XA and va→YA, da is supposed as the responsive value of ZA. Certainly for a plane coordinate system (u, v), da could not be calculated at all. In this paper, the virtual value da is regarded as the deep information about a if a were put into a 3D space. f is the focus distance of cameral and h is the distance between the optical center and the origin of the world coordinate system, h/f could be replaced by a constant β. Hence, Equ(5) is changed into Equ(6).
X A
YA
Z A ua T
va
da
T
(6)
When given a frontal face (u, v) every pixel in retinal plane is known, so their responsive (X, Y, Z) in the world coordinates could be calculated once their depth values are estimated. Then, the 3D face shape could be formed. Therefore, for 3D face modeling, another important step is the depth estimation.
Depth Estimation Many algorithms have been developed in over the last decades for depth recovery. For example, I. K. Shlizerman et.al [33] took the albedo to recovering da. S. Y. Baek et.al [61] selected 18 points from face and measured 40 different parameters to set up an anthropometric data. Finally, with the help of PCA, da was estimated. The choice of approach might vary according to the application for which the reconstruction was used. Usually, a 3D face database has to firstly be built as a preference shape for 3D model. Then, by relating the information gathered from the frontal image with that in faces of the preference shape set, the 3D shape of the most matching of preference face is manually decided as the 3D shape of the frontal image and the Z values of the chosen 3D shape are regarded as the depth information of the input frontal face. Some algorithms have been introduced to automatically choose a preference 3D shape. Some used some feature vector (such as feature point and silhouettes) and some
took image-based techniques. After a 3D shape is selected for the input frontal face, the texture should be mapped onto the 3D face shape according to the face components’ locations and depth information. This is an intricate process. Firstly, the face texture is extracted from the input image, and around the eyes, nose, eyebrows, mouth and boundary of a face several points are located from the input face image. Then, with the control of the points face texture are projected onto the surface of the selected 3D shape. In this paper, all 3D face shapes are simulated by the FaceGen Modeller being loaded from http://facegen.com/index.htm and contained the facial measurements of various people. 3.2 Simulating degraded face Face images, obtained by an outdoor surveillance camera, were often confronted with severe degradations (e.g., low-resolution, low-contrast, blur and noise). As a result, the performance of face recognition systems became significantly worse. Though having been employed to obtain higher resolution probe faces in the past, SR [16] would face much difficulty due to the need for successive frames of a video. On the other hand, gallery images being taken from in an indoor environment (close range) have high resolution, and the virtual gallery images created from the gallery by 3D modelling have high resolution too. It was not a desired way that one high-resolution face was used to match a low-quality face. Therefore, in order to increase the performance of face recognition system, some corresponding degraded faces were formed from the original frontal faces and their virtual looking-down faces for better matching these blurred probe faces. A degraded image could be simulated by convolution of high-resolution image f ( x, y) and a point-spread function h( x, y ) .
h( x, y) was also called as Gaussian blurred function and be defined as
h( x, y)
1 2
2
e ( x
2
y 2 ) /( 2 2 )
(7)
In order to find an optimal value of σ for original face and virtual faces so that they could match well
with testing faces from different camera distances, a measure β (sum of histogram difference) was defined based on the internsity histograms of normalized blured downsampled training image [14].
1 mk nk
mk
nk
sum{| Tr j 1 i 1
i
,k
Tekj |}
(8)
where
Tri , k Hist( Norm( gi , k ( x, y)))
(9)
gi ,k ( x, y) was the degraded training image obtained according to Equ (7) from the ith face of the kth subject. mk and nk were the number of testing faces and training faces for the kth subject respectively. Hist() means the histogram. Norm() denoted a normalization operation defined as
Var0 ( g ( x, y ) M ) 2 , if g ( x, y ) M M 0 Var G ( x, y ) 2 M V a 0r( g ( x, y ) M ) , other wis e 0 Var
(10)
where M0 and Var0 were the desired mean and variance values, respectively. Normalization was a pixel-wise operation. It did not change the clarity of the ridge and furrow structures. The main purpose of normalization was to reduce the variations in gray level values along ridges and furrows. 4.
Experiments and results This paper was interested in real surveillance video face from which information was difficultly
obtained to recognize faces by traditional methods. So the chosen databases should include faces captured under different illuminations, with different poses and from different distances, which resulted in very ambiguous low-quality faces. Currently, many datasets, such as FERET [63], AR, LFW [64] and MORPH [65] were available, however the resolution of all faces in these databases was high and clear enough for faces to easily be recognized. While all faces from SCface database, which were captured in real surveillance situation, were ambiguous and the faces from the cameras fixed at further distance were very blurry and low resolution. Besides, faces from FERET and AR were difficultly
modeled to 3D shape from 2D image due to their gray. Though color faces from both LFW [52] and MORPH [53] databases were captured in real world, the faces in the two databases were high-quality. The MORPH database was collected over a span of 5 years with numerous images of the same subject and contained metadata in the form of age, gender, race, height, weight, and eye coordinates. On the other hand, the MORPH seldom included multi-views faces. Therefore, in this paper both SCface database and LFW database were used to evaluate the proposed method. All program codes were written with Matlab and run at Intel (R) i7-4720HQ CPU @2.60GHz with the 8G memory and 64-bit Windows 8 operating system. Table 1 Fig.4 4.1 LFW database The LFW database contains images of 5749 different individuals in uncontrolled environment [64]. LFW-a is a version of LFW after alignment using commercial face alignment software. 45 subjects were random selected from LFW-a to build a database set. For each subject, all its faces in LFW database set were taken and numbers of their faces were listed Table 1. From these faces, one frontal or near frontal face was manually selected for training per subject and other samples for testing. There were 174 testing faces in all. Fig.4(a) showed 45 frontal or near frontal faces per subjects. Obviously, these faces were not neutral expression. Fig.4(b) gave 45 testing faces with different poses for the first 17 subjects according to the order in Fig.4(a). All faces in LFW database for both training and testing were high-resolution and high-quality. This paper took one face to match with different poses of faces. In order to get more virtual training faces, the selected near frontal face was employed to model a 3D face shape and then created 8 poses of virtual faces. All of the images including 8 virtual faces were
firstly cropped to 60×60 and then histogram normalization. Fig.5 showed that different poses of virtual faces were formed from their frontal face. 8 poses of virtual faces included looking-down, looking-left-down, looking-left, looking-left-up, looking-up, looking-right-up, looking-right and looking-right-down. There were 9 faces in all which could be used as training face.
Fig.5
It was not easily to find from Fig.5 that the difference between original face and its 8 virtual faces except its views was difficultly found. However, from their histograms it was easily found that the histograms of virtual faces were much different with that of original face. Fig.6 (a) showed the histogram difference between virtual faces and original face. The histogram of original face was in the middle of the horizontal axis while these histograms of virtual faces were in low segment of the horizontal axis. Besides, there was not almost pixel distribution above intensity 200 for all virtual faces while there were for original face. So in order to cut down the difference to its original face, for every subject its 8 poses of virtual faces were firstly normalized according to Equ(10). The normalized virtual faces and their histograms were plotted on Fig.6(b). After normalization, the histogram has been shifted right and their mean values and variances keep the same as these of its original face. Then, all training faces were processed by histogram equalization. Fig.6(c) showed these faces and their histogram. From Fig.6(c), it was described that their histograms become similar to that of its original face. Table 2 gave comparison about sum of histogram difference (β) between virtual faces and its original face after histogram equalization. These values in Table 2 were calculated according to Equ(8). Fig.6 Table 2
Obviously, the similarity of the virtual faces and its original face were improved after all virtual faces were normalized according to Equ(10) and then histogram equalized. So in this paper, all virtual
faces were processed by normalization and histogram equalization before they were taken to train a classifier. Because all testing faces in this database were high-resolution and quality, all training faces were not necessary blurred. In order to evaluate the proposed algorithm, the strategy of arranging training face was showed in Table 3. The results about LFW database were showed on Table 3 too. From the results, it was estimated that a classifier would work better after its training set were joined by 8 poses of virtual faces and get higher recognition rates for PCA, FLDA and SRC. Table 3 Table 3 showed that their recognition rates were increased by 4.59%, 8.04% (comparison with PCA trained by one original sample) and 9.77% for the three methods when 8 virtual samples were taken to train these classifiers. When some traditional face recognition algorithms were applied to identify real video face, the virtual faces produced by 3D face modeling were more helpful to improve their performance than RFM. Because 3D face model can create some multi-views virtual faces while RFM formed virtual faces which were in the same plane as the original face. 4.2 SCface database SCface database [17] contains 4,160 static images (under uncontrolled illumination conditions) of 130 subjects. Images from different quality and resolution cameras should mimic real-world conditions and enable robust face recognition algorithms testing, emphasizing different law enforcement and surveillance use case scenarios. All of cameras were placed slightly above the subject’s head, about 2.25m form ground. Face images were taken from three various distances between cameras and subject. They were respectively 4.2 meters, 2.6 meters and 1 meter. At every distance, 7 cameras were fitted to capture face image. However, in this paper only 5 face images from 5 visible cameras at every distance were used. Hence, in all 1950 face images employed to construct testing set, there were 15 probe image
sets. Every testing subset included 130 face images. Fig.1 showed one volunteer’ face images from SCface database. Obviously, the further face image was taken at, the lower the resolution of the face image was. The first three face images from left to right in Fig.1 (b) were respectively captured at 4.2 meter, 2.6 meter and 1meter by video surveillance cameras which was at 2.25m height, and their corresponding resolutions were 75×100, 108×144 and 168×244. The first right face in Fig.1 (b) was a high resolution frontal face captured by one high-resolution camera and with normal illumination and neutral expression. The database contained in all 130 frontal high-resolution image (1200×1600), one per person. These frontal face images were consisted of the training set, and were taken to create virtual training faces. Fig.7
4.2.1 Experimental scheme of pro-processing The experimental scheme was showed in Fig.7. After for every person two virtual 2D faces were segmented from their two corresponding 3D faces by the method showed in Fig.2, all color faces, which included original high-resolution faces, two virtual 2D faces and all video monitoring faces with low-quality, were uniformly normalized according to the followed procedure[66]: firstly, color images were changed to gray images, and then all the gray images were rotated according to the eyes coordinates so that two eyes stand on the same horizontal line, and at last a double elliptical mask was used to segment again all these gray images and zoom them into 64×64 pixels for making sure the left eye and the right eye at position (16,16) and (16,48) respectively. Certainly, some images should be scaled to 64×64 by bilinear interpolation if their resolution were lower than 64×64. After being geometry normalized, three training face images per person (including one original face and two virtual faces, showed in Fig.7(b)) were degraded into corresponding virtual training face images showed in Fig.7(d) according to Equ.(7). So far there were 6 face images in all which could be
regarded as training face, and they were one original face, two virtual faces and three blurred faces as candidate training face for face recognition system. If the original face was regarded as the 0th face, the two virtual faces were regarded as the 1th and the 2th training faces respectively corresponding to the two poses, and three blurred faces were called respectively the 0th, the 1th and the 2th faces. Finally, the original face and the virtual faces normalized according to Equ(10) were taken to match well with its corresponding testing faces. All face images including training faces and testing faces were processed by histogram equalization in order to reduce the affection from various illuminations. 4.2.2 Experimental scheme of face recognition ant its results All video monitoring faces, which were captured in visible spectrum in SCface database, were recognized in this paper. There were 130 video monitoring faces per Camera. So there were 1950 video monitoring faces for 15 cameras. The testing set was divided into three subsets according to different distances between subject and camera. They were respectively 4.2m, 2.6m and 1m testing set. Testing faces from three different subsets were different in resolution. The further the distance was, the lower the resolution would be. Hence, these testing faces from 1m subset took on looking-down pose with bigger looking-down angle, so the 2th virtual face and the 2th blurred face were taken as training face together with the original face (the 0 th face). These testing faces from 2.6m subset took on looking-down pose with a bit looking-down angle, so the 1th virtual face and the 1th blurred face were taken as training face together with the original face (the 0th face). These testing faces from 4.2m subset were looked as frontal face, so its training face had only the original face (the 0th face) and its blurred face. The detail could be seen from Table 4, Table 5 and Table 6. At the same time, Table 4, Table 5 and Table 6 depicted the recognition rate (%) of video monitoring faces captured at 1m, 2.6m and 4.2m, respectively. Cam n denoted the nth camera. All testing faces were identified by PCA, FLDA, SIFT,
SCR and SDAE. Table 4 Table 5 Table 6 Fig.8 Fig.8 showed that the matched SIFT points between training face and testing face about one subject. Firstly, the SIFT key-points and their descriptors were extracted from two face images (one was training face, and the other was testing face). Secondly, dot products and their corresponding inverse cosines for were calculated with between each key point in one face and these key points in its corresponding face for gaining a set of vector angles. Finally, if the ratio of vector angles from the nearest neighbor to second nearest neighbor was less than a distance ratio (set as 0.6 in this paper), the two key points being corresponding to the nearest neighbor was considered as matched points. These green lines in Fig.8 presented these matched key points between two face images. The matched key points showed on Fig.8 (a), (b) and (d) showed the matched key points between original training face and the monitoring faces. Obviously, the more the monitoring face was close to frontal, the more possible the matched key points could be found between monitoring face and frontal original face. Although the monitoring faces gained at 1m held high resolution, the matched point was not detected between most of the monitoring faces and original frontal face due to their higher looking-down view. Fig.8 (c) described the matched key-point about between the 1th virtual face and the monitoring faces captured at 2.6m, and Fig.8 (e) for the 2 th virtual face and the monitoring faces captured at 1m. From Fig.8 (c) and (e), one delightful conclusion could be draw that one appropriate view change for training face was useful to find more matched key points between training face and
testing face. Note that one inappropriate view change might make worse, for example no key point was found between monitoring face captured at 4.2m and the 1th or 2th virtual face. Besides the blurred process for training faces did not effectively improve SIFT algorithm, hence for SIFT algorithm, all training faces were not blurred. 130 original frontal face images were too few to gain the weights for SDAE algorithm. In order to train good weights for SDAE with 400 hidden units, two schemes were designed: Scheme I and Scheme II. Scheme I contained 33 faces for each subject. They were one original frontal face, one contaminated frontal face with pepper and salt noise, one contaminated frontal face with Gaussian noise, and 30 (3 × 10) blurred faces by Gaussian function with standard deviations including [0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5] for three above mentioned faces. Besides the training faces in Scheme I, Scheme II also included two virtual looking-down faces and 32×2 blurred faces or contaminated faces. So the original dataset with only 130 images was expanded into a large training dataset with 12870 (130×99) face images for Scheme II while Scheme I contained only 4290 (130×33) face images. Fig.9 showed the weights of the hidden layer of SDAE. Fig.9 Table 4 showed that for the testing face from camera's distance 1m, their mean recognition rates were increased by 1.04%, 6.58%, 4.78%, 4.77% and 4.77% respectively for the five methods (note that comparison with PCA trained by one original sample for FLDA) when these virtual view samples were taken to train these classifiers, and further increased by 4%, 1.7% and 3.69% when both its blurred face and virtual face were employed to train these classifiers for PCA, FLDA and SRC. However, the strategy, which the blurred faces were inputted into the training set, did not improve SIFT algorithm. Table 5 showed that for the testing face from camera's distance 2.6 meters, their mean recognition
rates were increased by 5.36%, 14.59%, 4.31%, 2.46% and 4.3%respectively for the five methods when its virtual sample were taken to train these classifiers, and further increased by 0.46%, 2% and 4% for PCA, FLDA and SRC algorithms when both its blurred faces and virtual faces were employed to train their classifiers. Table 6 showed that for the testing face from camera's distance 4.2 meter, their mean recognition rates were increased by 4.81%, 10.64%, 4.48% and 1.85% respectively for PCA, FLDA, SRC and SDAE algorithms when its blurred faces or virtual view faces were employed to train their classifiers. Like Table 4 and Table 5, SIFT algorithm was not improved when its blurred faces were employed to train its classifier. Obviously, due to low-quality, low resolution, various illumination and various poses the face recognition rate for SCface database was lower than for other static face database. However, after several virtual looking-down face images per person were added into training set, video face recognition algorithms (PCA, FLDA SIFT, SRC and SDAE) performed better than they worked for only one sample per person, and their recognition rates were increased. When its corresponding blurred face images were added to the train set, the recognition rate were improved further. Fig.10
The mean recognition rates of 5 cameras at every camera distances were listed in Table 4, Table 5 and Table 6. The recognition rates of all test faces from the 15 cameras of three distances were calculated according to results showed Table 4-6. When both blurred faces and virtual view faces were employed to train their classifiers, the average recognition rates were increased up to about 10%, 16.62%, 19.44% and 23.28% from 4.78% given by M.Grgic [17] for PCA, FLDA, SRC and SDAE methods respectively. The recognition rate of SIFT algorithm was up to 13.03% from 4.78% though only virtual view faces and original faces were taken to train SIFT. Fig.10 visually showed the
comparison of improvement about recognition algorithms applied to all testing faces from the three distances when PCA, FLDA, SIFT and SRC algorithms were trained by only one original face, two samples per person (one original face +the 1th virtual face) and three samples per person (one original face +the 1th virtual face + the 1th blurred face). In Fig.10, blue bar, green bar and purple bar presented respectively as one training sample, original face + virtual face sample and original face + virtual face + blurred face per person. However, 130 original frontal face images were too few to gain the appropriate weights for SDAE algorithm so that Fig.10 gave two special schemes (I andⅡ) for SDAE algorithms. The results from Scheme I and Scheme Ⅱ were presented by green bar and purple bar in Fig.10. It was displayed that the recognition rates of all methods were increased when they were trained by the training set including virtual faces. And the recognition rates were further increased if their training set included a blurred face and a virtual face each person. Table 7 showed the standard deviation (σ) of Gaussian blurred function, by which the best recognition rates in Table 4, Table 5 and Table 6 were gained. It was estimated from Table 7 that the standard deviation should be set to bigger for the probe face images captured at further position, because the further the camera distance, the more blur the face. However, standard deviation (σ) was not the same value for the same distance, and there was a bit difference because different cameras held different features. It was estimated that camera's features might affect face recognition, however bigger σ should be chosen for farther distance in order to match better these test faces from further cameras. Table 7
5.
Conclusion In real video monitoring environment, there were many low-quality faces with different poses,
especial looking-down faces. Hence, this paper employed 3D modeling technology to create some
virtual training samples similar to looking-down faces from one original face and then to degrade some training faces to match the blurred testing faces from these cameras fitted at further sites. Because two or more samples for per subject were used for training, FLDA was able to extract the intra-class variations to improve recognition rate. FLDA explicitly attempted to model the difference between the classes of faces. Secondly, the dictionary in compressive sensing could include more samples with different poses, as a result, compressive sensing could gain more performance. Thirdly, deep learning with powerful learning was improved much when simulated looking-down faces and their derived faces were input its training set. Though SIFT could deal with view variations, its performance would be increased if corresponding virtual faces with looking-down view were taken as matching set together with original frontal faces. However, recognition rate was still not higher enough, the proposed approach, which produced virtual training face, had improved the performance of recognition effectively, making video face recognition more approximates to real video surveillance environment with single training sample. Conflict of interest None declared. Acknowledgments This project was supported by China Natural Science Foundation (No.61501177), Guangdong Natural Science Foundation (No.S2013010013511), Science and technology planning project in Guangzhou (No.2014J4100127),
China
Scholarship
Council
(No.201408440342)
and
Guangzhou
Key
Laboratory(No. 201605030014). The authors would like to thank Shaokang Chen and Arnold Wiliem who help me revise the manuscript, the anonymous reviewers for their thoughtful and constructive remarks that are helpful to improve the quality of this paper and the designers for their FaceGen
Modeller (http://facegen.com/index.htm). References [1] M. D.Torre, E. Granger, P. V.W. Radtke, R. Sabourin, D. O. Gorodnichy, Partially-supervised learning from facial trajectories for face recognition in video surveillance, Information Fusion. 24 (2015) 31-53. [2] P. V.W. Radtke, E. Granger, R. Sabourin, D. O. Gorodnichy, Skew-sensitive boolean combination for adaptive ensembles - An application to face recognition in video surveillance, Information Fusion. 20 (2014) 31-48. [3] X. Hu, S. H. Peng, J.Y. Yan, N. Zhang, Fast Face Detection Based on Skin Color Segmentation Using Single Chrominance Cr, in: The 7th International Congress on Image and Signal Processing, 2014, pp.789-794. [4] F. Porikli, F. Bremond, S. L. Dockstader, J. Ferryman, A. Hoogs, B. C. Lovell, S. Pankanti, B. Rinner, P. Tu, and P. L. Venetianer, Video surveillance: Past, present, and now the future. IEEE Signal Processing Magazine. 30(3)(2013) 190-198. [5] X. Hu, W. Yu, J.Yao, Face Recognition Using Binary Structure-Based Feature Selection, Journal of Applied Sciences. 28(3)(2010)71-275. [6] S. Huang, M. Jiau, C. Hsu, A high-efficiency and high accuracy fully automatic collaborative face annotation system for distributed online social networks, IEEE Transactions on Circuits and Systems for Video Technology. 24(10)(2014)1800-1813. [7] G. Wang, F. Zheng, C. Shi, J.H. Xue, C. Liu, L. He, Embedding metric learning into set-based face recognition for video surveillance, Neurocomputing. 151(2015)1500-1506. [8] S. Biswas, G. Aggarwal, P. J. Flynn, K. W. Bowyer, Pose-robust recognition of low-resolution face images, IEEE Transactions on pattern analysis and machine intelligence. 35(12) (2013) 3037-3049. [9] V. S. Kenk, R. Mandeljc, S. Kovačič, M. Kristan, M. Hajdinjak, J. Per, Visual re-identification across large, distributed camera networks, Image and Vision Computing, 34 (2015) 11–26. [10] D.F. Smith, A. Wiliem, B.C. Lovell, Face recognition on consumer devices: reflections on replay attacks, IEEE Transactions on Information Forensics and Security, 10(4)(2015) 736-745. [11] X. Hu, Q. Liao, S. Peng, Video surveillance face recognition by more virtual training samples based on 3D modeling, In:11th International Conference on Natural Computation, 2015, pp.113-117. [12] C. Pagano, E. Granger, R. Sabourin, G.L. Marcialis, F. Roli, Adaptive ensembles for face recognition in changing video surveillance environments, Information Sciences. 286 (2014) 75-101. [13] S. Chen, S. Mau, M. T. Harandi, C. Sanderson, A. Bigdeli, B.C. Lovell, Face recognition from still images to video sequences: A local-feature-based framework, EURASIP Journal on Image and Video Processing. 7(2011)1-14. [14] S. Rudrani, S. Das, Face recognition on low quality surveillance images by compensating degradation, Lecture Notes in Computer Science. 6754(2011) 212-221. [15] X. Chen, J. Zhang, Illumination robust single sample face recognition using multi-directional orthogonal gradient phase faces, Neurocomputing. 74 (2011) 2291-2298. [16] M. Jian, K. Lam, Simultaneous hallucination and recognition of low-resolution faces based on
singular value decomposition, IEEE Transactions on Circuits and Systems for Video Technology. 25(11) (2015)1761-1772. [17] M. Grgic, K. Delac, S. Grgic, SCface–surveillance cameras face database, Multimed Tools Appl. 51(2011)863-879. [18] X. Hu, W. Yu, J. Yao, Multi-oriented 2DPCA for face recognition with one training face image per person, Journal of Computational Information Systems, 6(5) (2010) 1563-1570. [19] S.C. Chen, D.Q. Zhang, Z.H. Zhou, Enhanced (PC) 2A for face recognition with one training image per person, Pattern Recognition Lett. 25 (10) (2004) 1173-1181. [20] M. Koc, A. Barkana, A new solution to one sample problem in face recognition using FLDA, Applied Mathematics and Computation.217 (2011) 10368-10376. [21] H. Yin, P. Fu, S. Meng, Sampled FLDA for face recognition with single training image per person, Neurocomputing. 69 (2006) 2443-2445. [22] E. Sadeghipour and N. Sahragard, Face recognition based on improved SIFT algorithm, International Journal of Advanced Computer Science and Applications. 7(1)( 2016)547-551. [23] A. Vinay, K. Ganesh, A. M. Durga, H. N. Nandan, C. Sureka, K. N. B. Murthy, S. Natarajan, Face recognition using filtered EOH-SIFT, Procedia Computer Science. 79 ( 2016 ) 543- 552. [24] C. Zhang, Y. Gu, K. Hu, Y. Wang, Face recognition using SIFT features under 3D meshes, J. Cent. South Univ. 22(2015) 1817−1825. [25] D. M. Massimiliano and I. Francesco, Face recognition from robust SIFT matching, Lecture Notes in Computer Science. 9280(2015) 299-308. [26] L. Wu, P. Zhou, Y. Hou, H. Cao, X. Ma, X. Zhang, Complete pose binary SIFT for face recognition with pose variation, Lecture Notes in Computer Science. 8232(2013)71-80. [27] X. Tan, S. Chen, Z. H. Zhou, F. Zhang, Face recognition from a single image per person: A survey, Pattern Recognition, 39 (2006) 1725-1745. [28] C. Wang, J. Zhang, G. Chang, Q. Ke, Singular value decomposition projection for solving the small sample size problem in face recognition, J. Vis. Commun. Image R. 26 (2015) 265–274. [29] Y. Wang, M. Wang, Y. Chen, Q. Zhu, A novel virtual samples-based sparse representation method for face recognition, Optik. 125 (2014) 3908-3912. [30] X. Chen, J. Zhang, Illumination robust single sample face recognition using multi-directional orthogonal gradient phase faces, Neurocomputing. 74 (2011) 2291-2298. [31] R.X. Ding, D. K. Dua, Z.H. Huang, Z.M. Li, K. Shang, Variational feature representation-based classification for face recognition with single sample per person, J. Vis. Commun. Image R. 30 (2015) 35-45. [32] J. Wu, Z.H. Zhou, Face recognition with one training image per person, Pattern Recognition Lett. 23 (14) (2002) 1711-1719. [33] I. K. Shlizerman, R. Basri, 3D face reconstruction from a single image using a single reference face shape, IEEE Transactions on pattern analysis and machine intelligence, 33(2011) 394-405. [34] D. Jiang, Y. Hu, S.Yan, L. Zhang, H. Zhang, W. Gao, Efficient 3D reconstruction for face recognition, Pattern Recognition, 38 (2005) 787-798. [35] F. Abdolali, S. A. Seyyedsalehi, Improving face recognition from a single image per person via virtual images produced by a bidirectional network, Procedia-Social and Behavioral Sciences,32(2012)108-116. [36] C. Hua, M.Ye, S. Ji, W. Zeng, X. Lu, A new face recognition method based on image decomposition for single sample per person problem, Neurocomputing, 160(2015)287-299.
[37] F. Hafiz, A. A. Shafie, Y. M. Mustafah, Face recognition from single sample per person by learning of generic discriminant vectors, Procedia Engineering, 41 ( 2012 ) 465- 472 . [38] B.Wang, W. Li, Z. Li, Q. Liao, Adaptive linear regression for single-sample face recognition, Neurocomputing, 115(2013)186-191. [39] M. Kafai, L. An, B. Bhanu, Reference Face Graph for Face Recognition, IEEE Transactions on information forensics and security, 9(12)(2014)2132-2143. [40] W. Deng, J. Hu, X. Zhou, J. Guo, Equidistant prototypes embedding for single sample based face recognition with generic learning and incremental learning, Pattern Recognition, 47(2014) 3738-3749. [41] J. Wright, A. Ganesh, A. Yang, Y. Ma, Robust face recognition via sparse representation[J]. IEEE Trans. Pattern Anal. Machine Intell., 31(2009)210-227. [42] M.Yang, L. V. Gool, L. Zhang, Sparse Variation Dictionary Learning for Face Recognition with A Single Training Sample Per Person, In: IEEE International Conference on Computer Vision, 2013,pp.689-696. [43] P. Zhu, M. Yang, L. Zhang, Il-Y. Lee, Local generic representation for face recognition with single sample per person, In NIPS, 2014,1-16. [44] B. Stephen, Deep learning and face recognition: the state of the art, In: Proceedings of SPIE, 2015, Vol.9457(0B), pp.1-8. [45] P. Vincent, H. Larochelle, Y. Bengio, P. Manzagol, Extracting and composing robust features with denoising autoencoders. In: Proceedings of the International Conference on Machine Learning, 2008, pp.1096–1103. [46] S. Becker, Unsupervised learning procedures for neural networks. The International Journal of Neural Systems. 2(1&2)(1991)17-33. [47] R. B. Palm, Prediction as a candidate for learning deep hierarchical models of data, Technical University of Denmark, DTU Informatics. 2012. [48] Y. Kang, K. T. Lee, J. Eun, S. E. Park, S. Choi, Stacked denoising autoencoders for face pose normalization, Lecture Notes in Computer Science. 8228(3)(2013)241-248. [49] Y. Zhang, M. Zhu, R. Liu, S. Zhang, Occlusion-robust face recognition using iterative stacked denoising autoencoder, Lecture Notes in Computer Science. 8228(3)(2013)352-359. [50] S. Gao, Y. Zhang, K. Jia, J. Lu and Y. Zhang, Single sample face recognition via learning deep supervised autoencoders, IEEE Transactions on Information Forensics and security. 10(10)(2015) 2108-2118. [51] D. L. Donoho, Compressed Sensing, IEEE transaction on information theory. 52(4)(2006) 1289-1306. [52] S. Chen, C. Sanderson, M. T Harandi, B.C. Lovell, Improved image set classification via joint sparse approximated nearest subspaces, In: IEEE Conference on. 26th IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp.23 - 28. [53] M. A. Borgi, D. Labate, M. E. Arbi, C. B. Amar, Sparse multi-stage regularized feature learning for robust face recognition, Expert Systems with Applications. 42 (2015) 269-279. [54] H. Zhang, J. Yang, Close the Loop: Joint blind image restoration and recognition with sparse representation prior. In: IEEE International Conference on Computer Vision, 2011, pp.770-777. [55] G. An, J. Wu, Q. Ruan, An illumination normalization model for face recognition under varied lighting conditions. Pattern Recognition Letters. 31(2010)1056-1067. [56] S. Liao, A.K. Jain, S.Z. Li, Partial face recognition: Alignment-free approach, IEEE Transactions
on Pattern Analysis and Machine Intelligence. 35(5)(2013)1193-1205. [57] M. Yang, L. Zhang, S. C. K. Shiu, D. Zhang, Gabor feature based robust representation and classification for face recognition with Gabor occlusion dictionary. Pattern Recognition, 46(2013):1865-1878. [58] P. Zhu, W. Zuo, L. Zhang, Image set based collaborative representation for face recognition, IEEE Transaction on Information Forensics and Security.9(7) ( 2014)1-13. [59] T. Zhang, X. Li, R.Z. Guo, Producing virtual face images for single sample face recognition, Optik, 125 (2014) 5017-5024. [60] Z. Zhang, A flexible new technique for camera calibration, IEEE Transactions on Pattern Analysis and Machine Intelligence.22(11)(2000)1330–1334. [61] S.Y. Baek, B.Y. Kim, K. Lee, 3D face model reconstruction from single 2D frontal image, In: Conference: Proceedings of the 8th International Conference on Virtual Reality Continuum and its Applications in Industry, 2009, pp.14-15. [62] X.Gong, G. Wang, L. Xiong, Single 2D Image-based 3D face reconstruction and its application in pose estimation, Fundamenta Informaticae. 94(2009)179-195. [63] P. J. Phillips, H. Moon, S. A. Rizvi, P. J. Rauss, The FERET evaluation methodology for face-recognition algorithms, IEEE Transactions on pattern analysis and Machine Intelligence. 22(10)(2000)1090-1104. [64] G. B. Huang, M. Ramesh, T. Berg, L. M. Erik, Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments, University of Massachusetts, Amherst, Technical Report, 2007, pp.07-49. [65] R. J. Karl,T Tamirat, MORPH: A Longitudinal Image Database of Normal Adult Age-Progression, In: IEEE 7th International Conference on Automatic Face and Gesture Recognition. 2006, pp 341-345.
Fig.1 Surveillance cameras capturing distances (a) and their respective faces (b). The first three face images from left to right in (b)were respectively captured at 4.2 meters, 2.6 meters and 1meter by surveillance cameras which was fitted at 2.25meters height. The right face image was a high resolution frontal face captured by one camera [17].
th
1 pose
th
2 pose
2D face
3D face shape different poses
2D virtual faces
Fig. 2 The flowchart how to create more virtual training sample.
A
Y
X
v
y
u x
YA
a
XA θ
va ua C
o
f
O
ZA
h
R
F Fig.3 Space mapping relationship
C: Optical Center, R : Retinal plane, F : Virtual Face plane, f : focus distance
z Z
(a)
(b)
Fig.4 All 45 selected frontal faces for training (a) and parts of other poses of faces for testing (b) which were from the first 17 subjects showed (a)
(a)
(b)
(c)
Fig.5 The pre-processing procedure of LFW face. (a) frontal face and its 8 poses of virtual faces; (b) gray images from (a); (c) images after histogram normalization. The middle face in (a) was the original face and these face images around the original face were virtual face by 3D model.
(a)
(b)
(c)
Fig.6 Histogram comparison between original face and its 8 virtual faces. (a) for the original face and its 8 virtual faces, and their histograms; (b) for the original face and its 8 normalized virtual faces, and their histograms; (c) for faces after histogram equalization from (b), and their histograms; The plots on the top were for the original faces, and the other plot every column were virtual faces and represented as Vi (i=1...8)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig.7 The experimental scheme of pro-processing. The faces in (a) were original or virtual color images and (h) showed its corresponding monitoring faces; the images in (b) and (g) were the normalized images according to the way in Fig.8; (c) and (f) were after normalization as Equ(6) histogram equalization of the images in (b) and (g) respectively; the images in (b) were firstly blurred into images in (d) and then histogram equalized into images in (e) according to Equ(10).
(a)
(b)
(c)
(d)
(e)
Fig.8 The matched SIFT points between training face and testing face about one subject. These green lines presented these matched key points between two face images. (a), (b) and (d) showed these key points between original high resolution training face and the monitoring faces captured respectively at 4.2m, 2.6m and 1m; (c) for the 1th virtual face and the monitoring faces captured at 2.6m; (e) for the 2th virtual face and the monitoring faces captured at 1m. The monitoring faces from top to bottom were captured respectively by Cam1, Cam2, Cam3,
Cam4 and Cam5.
(a)
(b)
Fig.9 The weights learned by the hidden layer of SDAE with 400 hidden units. (a) Scheme I including only original face and its derived face; (b) Scheme II including original face, the
1th, the 2th virtual face and their derived faces
25 20 15 10
25 20 15 10
5
5
0
0
PCA
FLDA
SIFT (a)
SRC
SDAE
30
Recognition rate (%)
30
Recognition rate (%)
Recognition rate (%)
30
25 20 15 10 5
PCA
FLDA
SIFT (b)
SRC
SDAE
0
PCA
FLDA
SIFT (c)
SRC
SDAE
Fig.10 Comparison about improvement of recognition rate resulted by the proposed method. blue bar, green bar and purple bar represented different training strategies for different camera distances and algorithms. (a) 1 meter; (b) 2.6 meter;(c) 4.2 meter
Table 1 Number of faces per subject is selected from LFW database
No. of subject
1
2
3
4
5
6
7
8
9
Number of faces
4
2
2
4
3
4
4
3
5
No. of subject
10
11
12
13
14
15
16
17
18
Number of faces
5
3
3
3
8
2
2
4
4
No. of subject
19
20
21
22
23
24
25
26
27
Number of faces
2
2
7
3
3
15
20
3
12
No. of subject
28
29
30
31
32
33
34
35
36
Number of faces
2
2
5
4
2
2
13
4
6
No. of subject
37
38
39
40
41
42
43
44
45
Number of faces
3
5
2
2
3
4
2
2
24
Table 2 Comparison about sum of histogram difference (β) between virtual faces and its original face.
V1
V2
V3
V4
V5
V6
V7
V8
Without normalization
1.2294
1.3733
1.2833
1.3222
1.2233
1.2428
1.3344
1.2778
by normalization
1.1911
1.2589
1.1350
1.2589
1.2233
1.2333
1.2717
1.1394
Table 3 Recognition rate(%) of video monitoring faces from LFW database
Number of training samples
Algorithm
Recognition rate (%)
Only one original sample
PCA
15.52
One original + 8 virtual samples by RFM
PCA
16.67
One original + 8 virtual samples by 3D model
PCA
20.11
One original + 8 virtual samples by RFM
FLDA
17,24
One original + 8 virtual samples by 3D model
FLDA
23.56
Only one original sample
SRC
16.67
One original + 8 virtual samples by RFM
SRC
18.97
One original + 8 virtual samples by 3D model
SRC
26.44
Table 4 Recognition rate(%) of video monitoring faces captured at 1 meter
Number of training samples
algorithm
Cam1
Cam2
Cam3
Cam4
Cam5
Mean
one original face [17]
PCA
6.18
3.9
7.7
8.5
5.4
6.34
PCA
7.69
6.92
6.92
6.15
9.23
7.38
PCA
10.77
10.77
10.77
15.38
9.23
11.38
FLDA
12.31
15.38
10.77
12.31
13.85
12.92
FLDA
13.85
16.92
13.08
12.31
16.92
14.62
SIFT
7.69
10.00
8.46
7.69
9.23
8.61
one original face +the 2 virtual face
SIFT
12.31
13.85
13.08
14.62
13.08
13.39
one original face
SRC
7.69
8.46
7.69
12.31
3.85
8
SRC
13.85
11.54
12.31
15.38
10.77
12.77
SRC
16.15
16.92
15.38
19.23
14.62
16.46
SDAE
13.85
13.08
15.38
16.92
12.31
14.31
SDAE
16.92
19.23
20.77
20.00
18.46
19.08
one original face +the 2th virtual face th
one original face +the 2 virtual face + the 2th blurred face one original face +the 2th virtual face th
one original face +the 2 virtual face + the 2th blurred face one original face th
th
one original face +the 2 virtual face one original face +the 2th virtual face + the 2th blurred face Scheme I (original faces and their derived faces ) Scheme Ⅱ(original face, the 1th, virtual face and their derived faces )
the 2th
Table 5 Recognition rate(%) of video monitoring faces captured at 2.6 meter
Number of training samples
algorithm
Cam1
Cam2
Cam3
Cam4
Cam5
Mean
one original face [17]
PCA
7.7
7.7
3.9
3.9
7.7
6.18
PCA
14.62
7.69
11.54
12.31
11.54
11.54
PCA
14.62
7.69
11.54
13.08
13.08
12
FLDA
23.85
16.92
20.77
23.85
18.46
20.77
FLDA
26.92
18.46
22.31
26.92
19.23
22.77
SIFT
13.08
12.31
8.46
15.38
9.23
11.69
one original face +the 1 virtual face
SIFT
16.92
18.46
11.54
20.00
13.08
16.00
one original face
SRC
29.23
16.15
12.31
25.38
13.08
19.23
SRC
31.54
18.46
13.84
26.92
17.69
21.69
SRC
34.62
20.77
20
29.23
23.85
25.69
SDAE
30.77
23.08
26.15
30.00
27.69
27.54
SDAE
36.15
31.54
26.92
33.08
31.54
31.84
one original face +the 1th virtual face th
one original face +the 1 virtual face + the 1th blurred face one original face +the 1th virtual face one original face +the 1th virtual face + the 1th blurred face one original face th
th
one original face +the 1 virtual face one original face +the 1th virtual face + the 1th blurred face Scheme I (original faces and their derived faces ) Scheme Ⅱ(original face, the 1th, virtual face and their derived faces )
the 2th
Table 6 Recognition rate(%) of video monitoring faces captured at 4.2 meter
Number of training samples
algorithm
Cam1
Cam2
Cam3
Cam4
Cam5
Mean
one original face [17]
PCA
2.3
3.1
1.5
0.7
1.5
1.82
th
PCA
5.38
6.15
7.69
7.69
6.23
6.63
th
one original face +the 0 blurred face
FLDA
13.85
10.77
11.53
16.15
10
12.46
one original face
SIFT
10.77
10.00
9.23
10.77
7.69
9.69
SRC
17.69
10
13.08
10
7.69
11.69
SRC
22.31
14.62
16.15
17.69
10.08
16.17
SDAE
20.77
18.46
17.69
16.92
11.54
17.08
SDAE
22.31
20.78
18.46
18.46
14.62
18.93
one original face +the 0 blurred face
one original face th
one original face +the 0 blurred face Scheme I (original faces and their derived faces ) Scheme Ⅱ(original face, the 1th, virtual face and their derived faces )
the 2th
Table 7
Standard deviation (σ) used in PCA, FLDA and SRC
Different distance
Cam1
Cam2
Cam3
Cam4
Cam5
1 meter
0.75
0.25
0.5
0.25
0.25
2.6 meters
0.75
2.5
1.25
0.75
0.25
4.2 meters
0.75
0.5
1.5
2.5
3.5
Xiao Hu He received the M.S. degree in communication and information system from Yunnan University, kunming, China, in 2003 and received a Ph. D. degree in Biomedical Engineering from Shanghai Jiao Tong University, Shanghai China, in 2006. He is currently a Professor with the School of Mechanical and Electric Engineering, Guangzhou University, and as a visiting scholar working at the University of Quansland from Jan. 2016 to Jan. 2017. His current research interests include machine vision, image and biomedical signal processing, pattern recognition, etc.
[email protected]
Shaohu Peng He received a master degree in signal and information processing from Guangdong University of Technology, China, in 2005 and received a Ph. D. degree in control and signal processing from Dankook University, Korea, in 2013. He is now working in the School of Mechanical and Electric Engineering, Guangzhou University. His research interests include machine vision, image processing, pattern recognition, etc.
[email protected]
Li Wang He received a bachelor degree in electronic engineering from Southeast University, China, in 2009 and received a Ph. D. degree in physical electronics from Southeast University, China, in 2015. He is now working in the School of Mechanical and Electric Engineering, Guangzhou University. His research interests include brain-computer interface, biomedical signal processing, pattern recognition, etc.
[email protected]
Zhao Yang He received Ph.D. degree from South China University of Technology in 2014. He is currently a lecturer of School of Mechanical and Electric Engineering, Guangzhou University. His research interests include machine learning, pattern recognition, and computer vision.
[email protected]
Zhaowen Li He received a bachelor's degree in electronics and communication engineering from Guangzhou University in 2015. He is studying as a master degree candidate in the School of Mechanical and Electric Engineering, Guangzhou University. His research interest is face recognition.
[email protected]