Pattern Recognition Letters 45 (2014) 145–153
Contents lists available at ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Head pose estimation using image abstraction and local directional quaternary patterns for multiclass classification q ByungOk Han 1, Suwon Lee 1, Hyun S. Yang ⇑ Department of Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea
a r t i c l e
i n f o
Article history: Received 29 August 2013 Available online 6 April 2014 Keywords: Head pose estimation Image abstraction Multiclass classification
a b s t r a c t This study treats the problem of coarse head pose estimation from a facial image as a multiclass classification problem. Head pose estimation continues to be a challenge for computer vision systems because extraneous characteristics and factors that lack pose information can change the pixel values in facial images. Thus, to ensure robustness against variations in identity, illumination conditions, and facial expressions, we propose an image abstraction method and a new representation method (local directional quaternary patterns, LDQP), which can remove unnecessary information and highlight important information during facial pose classification. We verified the efficacy of the proposed methods in experiments, which demonstrated its effectiveness and robustness against different types of variation in the input images. Ó 2014 Elsevier B.V. All rights reserved.
1. Introduction The natural interaction between people and computers is an important research topic, which has recently attracted considerable attention. This research area addresses natural user interfaces (NUI) between humans and computers. A NUI is a human–machine interface that does not employ input devices. NUI is a natural interaction method that resembles communication between people. Various related research areas aim to achieve non-intrusive and natural human–computer interaction (HCI), such as face recognition, facial expression recognition, activity recognition, and gesture recognition. As a starting point of these techniques, head pose estimation is a crucial technology that aims to predict human intentions to facilitate the use of non-verbal cues for communication via NUIs. Thus, people can estimate the orientation of another persons head to understand whether they want to interact with them. Head pose estimation is a technique that aims to determine three-dimensional (3D) orientation properties from an image of a human head. In 3D space, objects have geometric properties with six degrees of freedom for rigid body motion, i.e., three rotations and three translation vectors. Head pose estimation methods are usually designed to extract head angular information in terms of the pitch and yaw rotations of a facial image. The pitch and yaw
q
This paper has been recommended for acceptance by S. Wang.
⇑ Corresponding author. Tel.: +82 42 350 7727; fax: +82 42 867 3567. 1
E-mail address:
[email protected] (H.S. Yang). Tel.: +82 42 350 7727; fax: +82 42 867 3567.
http://dx.doi.org/10.1016/j.patrec.2014.03.017 0167-8655/Ó 2014 Elsevier B.V. All rights reserved.
are more difficult to estimate than other properties, such as the roll angle, two-dimensional (2D) translation, and scale, which can be calculated easily using 2D face detection techniques, because of occlusions by features such as glasses, beards, hair, and head angle changes. Different identity, illumination, and facial expression conditions are also serious hindrances when extracting the angular properties of head images. 2. Related works In recent years, many methods have been developed for estimating 3D human head poses using facial RGB images. These studies can be categorized into methods based on classification and those based on regression with machine learning techniques. These methods simply aim to determine whether the pose space is discrete or continuous. The advantages of these classification approaches are that they are comparatively simple and control pose training datasets can be used for training sessions. These methods can then be expanded to a larger pose set by users at any time [1]. Furthermore, the training dataset only requires human head images with corresponding labels for the head angle information. However, this approach can only estimate designated and discrete head poses. By contrast, the regression methods used for human head pose estimation can obtain continuous information related to head poses. However, it is difficult to develop an exact function for robust head pose estimation because of the complexity of the non-linear and linear mappings that connect the facial images and pose labels. From an image representation perspective, head
146
B. Han et al. / Pattern Recognition Letters 45 (2014) 145–153
pose estimation can be divided into two categories: appearancebased methods and geometric feature-based methods. Hence, methods can be classified based on the characteristics of the description vectors used for training. Appearance-based approaches obtain texture information from a facial image, whereas geometric feature-based approaches manipulate positional information related to the facial features, such as the eyes, eyebrows, nose, and mouth. The first method exploits pixel values in the actual facial image, so it is greatly affected by various factors that change images and it is necessary to employ an effective noise removal algorithm, such as illumination normalization or face alignment. The second method finds facial features using modelbased algorithms such as the active shape model [2], the active appearance model [3], or the constrained local model [4]. Feature vectors can be generated from location information using several facial feature detectors, which are trained using another training set to facilitate 2D location detection. The feature vectors produced from positional information can be used as a supervised learning framework. They can also be employed directly to estimate facial poses. For example, a triangle obtained from three points, i.e., two eyes and a nose, can be used for pose estimation simply by calculating a triangle projected onto the image plane. However, the geometric approach has the disadvantage that the facial feature locations in all images must be labeled manually to generate the training dataset. However, this is an intuitive method for estimating head poses because it uses location information. The approach we develop in the present study is based on concepts derived from multiclass classification and an appearancebased method. To compress useful visual content and to remove unnecessary information, we propose a new approach based on image abstraction. Image abstraction was originally developed for artistic purposes based on automatic stylization. It was also used to communicate information in a previous study [5]. Thus, image abstraction can provide important perceptual information during the recognition process, by simplifying and compressing the visual content. A related study [6] applied image abstraction methods to coarse head pose estimation algorithms, which were simple and accurate. To the best of our knowledge, this is the first study to apply image abstraction to head pose estimation. However, they only considered an estimation process based on various head poses in terms of variations in identity and they did not consider variations in illumination or facial expressions, whereas we aimed to develop a technique that was robust to such variations. Moreover, they only explain about a binary image produced by their image abstraction algorithm. To ensure that the method is more accurate and more robust to variations in the images, we propose a novel binary representation method, referred to as local directional quaternary patterns (LDQP), which describes the binary images produced by our image abstraction method. In this study, we extend our previous research using an image abstraction method, which generates a facial sketch image from a contour image with a cartoon-like effect [7]. We explain our new facial sketch image representation method, and we evaluate the effectiveness of the image abstraction method and representation methods in various experiments. The remainder of this paper is organized as follows. Section 3 provides an overview of the framework of our system. Section 4 presents the details of our image abstraction method and Section 5 explains our representation method. Our research results are given in Section 6 and we present our conclusions in Section 7. 3. System overview Humans can recognize head poses by detecting simple sets of edges, similar to cartoon faces. People are capable of identifying
simple head poses because they can abstract the features of faces intuitively. In particular, people innately recognize the shapes, configurations, or contours of trained features such as eyes, noses, mouths, eyebrows, foreheads, and chins. Thus, people can remember abstracted images of heads by inference from trained data. The basic concept of our system was designed and implemented from this perspective (Fig. 1). To classify facial poses, it is necessary to train a classifier with facial images and their corresponding pose labels using a facial pose database created during the training session. First, Viola– Jones face detection algorithm detects coarse frontal and profile faces from images in the facial pose database. If the image does not contain a face, it is removed from the training data. The facial images are normalized after the exception handling process and the image abstraction and the representation processes are applied. A classifier is trained for multiclass classification using the binary images produced by the image abstraction algorithm. After the training session, a facial test image is vectorized using similar methods to those employed in the previous training session and the output pose is estimated. 4. Image abstraction Image abstraction removes unnecessary information and emphasizes the main contents by reinterpreting scene information. This process can help viewers to capture specific visual information. We use an image abstraction method to interpret facial images. Our image abstraction method is shown in Fig. 2. The proposed algorithm performs GrabCut segmentation [8] using the rectangular area of a face. Next, to generate a cartoon-like effect, bilateral filtering [9] is applied to remove some of the noise, after which a difference of Gaussian (DoG) method [10] extracts the contours from the face image. 4.1. GrabCut algorithm To obtain a cartoon-like representation of a face, it is necessary to extract the face region. Thus, face region and background region segmentation are required for image abstraction. A rough rectangular region containing the face can be obtained using a face detection algorithm. However, the rectangular region generated by the face detection algorithm still includes background pixels, which need to be separated from the facial image. Thus, another algorithm is required to obtain more precise results. In this case, we use the GrabCut algorithm, which segments the foreground from the background using some of the pixels in an image. The GrabCut algorithm only requires that some of the pixels are labeled as foreground or background pixels. This algorithm is widely used for extracting the foreground by partial labeling. The only input required by the algorithm is a rough segmentation of the foreground and the background. This can be achieved using the rectangular region produced by the face detection algorithm. The inputs of the GrabCut algorithm are the face region that corresponds to a target object, Rf , the background region, Rbg , and the unknown region, Ru . The algorithm aims to separate the face region (Rf ) from the unknown region (Ru ) using information from the background region (Rbg ), as shown in Fig. 3. This is achieved using the Graph Cut algorithm [11] and a Gaussian mixture model (GMM). The basic procedure of the GrabCut algorithm is as follows: (i) A user input is obtained that contains a face region (Rf ), a background region (Rbg ), and an unknown region (Ru ). The unknown region can contain face or background information. (ii) The pixels in the background region and the face region are modeled using the GMM.
B. Han et al. / Pattern Recognition Letters 45 (2014) 145–153
147
Fig. 1. Overall procedure used by the proposed algorithm.
Fig. 2. Image abstraction pipeline.
(iv) Two new GMMs are trained using the pixel sets created in the previous step. (v) A graph is built and the Graph Cut algorithm is used to segment the face and background pixels. (vi) Steps 4–6 are repeated until the segmentation procedure converges on the solution. 4.2. Bilateral filtering
Fig. 3. Rbg represents the background region of an image. The GrabCut algorithm separates the unknown region, Ru , from the face region, Rf , using Rbg .
(iii) Each face and background GMM component is selected by choosing the most probable component. Next, each pixel is assigned to the corresponding GMM component.
The bilateral filter is an edge-preserving smoothing filter. This is a non-linear technique that blurs images to emphasize components with strong edges. This approach is very simple and local. Each pixel is modified using a weighted average of its closest pixels, which are designated by the spatial filter. The weight can be calculated using a Gaussian distribution. This method can also be used in a non-iterative manner, which means it is simple because the effect of the filter on an image can be observed by controlling two parameters: the window size of the spatial kernel and the range of the pixel intensities [12]. The bilateral filter is defined by BF
I ðx; yÞ ¼
P
D D xi ;yi 2X Iðxi ; yi ÞGrr I i Grs C i P D D xi ;yi 2X Grr Ii Grs C i
ð1Þ
148
B. Han et al. / Pattern Recognition Letters 45 (2014) 145–153
IDi ¼j Iðxi ; yi Þ Iðx; yÞ j C Di ¼ kðxi ; yi Þ ðx; yÞk
ð2Þ
BF
where I is the output image, I is the input image, ðxi ; yi Þ denotes the coordinates of the pixel, X is the window, which is centered on ðxi ; yi Þ, and Grr and Grs represent the range and spatial kernel used to smooth the differences in the intensities and the coordinates, respectively. The denominator of the equation is the normalization factor used to guarantee that the sum of the weights is 1.0. The two parameters, rr and rs , are used to control the smoothness and sharpness of images. As the range parameter, rr , increases, the bilateral filter steadily approximates Gaussian convolution. As the other parameter, rs , increases, the larger features are smoothed. Noise reduction is performed after obtaining a facial image using the GrabCut method, while the facial contours are preserved using the bilateral filter. 4.3. Difference of Gaussian DoG is a simple and robust method for extracting edges. DoG is a band-pass filter that preserves useful information in image, such as edge information, and discards unnecessary information [13]. DoG utilizes the differences between two Gaussian images, which have been blurred by a Gaussian convolution operation using different standard deviations. Subtracting one image from the other preserves spatial information. The DoG algorithm functions in a similar way to the retina, which reads information from an image. Two Gaussian kernels with different standard deviations in two dimensions, Gr1 and Gr2 , can be described as follows.
2 1 x þ y2 ; Gr1 ðx; yÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffi exp 2r21 2pr21 2 1 x þ y2 Gr2 ðx; yÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffi exp 2r22 2pr22
ð3Þ
Gaussian images can be obtained using a convolution operation:
I1 ðx; yÞ ¼ Gr1 ðx; yÞ Iðx; yÞ; I2 ðx; yÞ ¼ Gr2 ðx; yÞ Iðx; yÞ
ð4Þ
where I describes the input image and I1 and I2 denote two Gaussian blurred images. The DoG is defined by:
IDoG ðx; yÞ ¼ I1 ðx; yÞ I2 ðx; yÞ ¼ ðGr1 Gr2 Þ Iðx; yÞ
ð5Þ
DoG
where I denotes the output image from the DoG algorithm. The thickness of edges is controlled by adjusting the two standard deviations of the Gaussian kernels, r1 and r2 . Our approach emphasizes the facial contours obtained from facial images and a binary image made of edges is generated.
which are determined on the basis of their radius and number. By subtracting the center pixel value from the pixel values of its neighbors, the resulting negative values are encoded as 0 and the others as 1. However, unlike the LBP input, the LDQP input is a binary image generated by the image abstraction procedure and it extracts a final feature vector, which comprises LDQP codes. The LDQP codes are constructed simply by placing the pixel values of the center pixel and its neighbors in a binary image, as shown in Fig. 4(b). This representation method divides a facial binary image into several grids, which contain regional information. This information is important for estimating facial poses. After applying the image abstraction method to facial images, the locations of pixels that should be in the same region might not be fixed in each binary image because of changes in illumination, variable backgrounds, or occlusions caused by hair. For example, the positions of the eyes, nose, and mouth may be different in each facial image. LDQP is useful for overcoming slight misalignments. Furthermore, it allows our system to distinguish between strong edges and weak edges in the abstracted image because it contains local directional information and the LDQP codes form a histogram on a grid. Therefore, this method can overcome the classification disruption caused by the textures of wrinkles, speckles, or shade. The LDQP representation procedure is as follows. (i) After image abstraction, a binary image is divided into several grids, which are designated by the user. For example, 5 7 grids are used in Fig. 4(a). (ii) Perform LDQP coding as shown in Fig. 4(b). (iii) A histogram with 512 bins is calculated for each grid in the image using the results obtained from the LDQP codes. For example, 5 7 grids have 35 grids and 35 histograms are produced. (iv) A feature vector is produced by concatenating the histograms. A feature vector with dimensions of 17,920 (35 512) is obtained from 35 histograms. The LDQP code can represent directional changes in the pixels in a binary image. The LDQP describes the center pixel value and target pixels values, which are defined by a specific radius and the number of neighbors. For example, as shown in Fig. 4(b), the radius is three and the number of neighbors is eight, so the LDQP code is ‘‘10 10 10 11 11 11 11 10,’’ which is simply a sequence of consecutive binary values based on the center pixel and the target pixels. This code can be described using nine bits because the center pixel can be represented by one bit. This approach is very simple, as shown in Fig. 4(c). If the radius is one and the number of neighbors is eight, the code ‘‘00’’ represents zero pixels while the code ‘‘11’’ represents a line. The code ‘‘01’’ represents two pixels where the center pixel is darkening relative to the other pixel, whereas the code ‘‘10’’ represents two pixels where the center pixel is brightening relative to the other pixel. The LDQP code is described as:
LDQPr;n ðxi ; yi Þ ¼
n1 X Iðxr;k ; yr;k Þ2k þ Iðxi ; yi Þ2n
ð6Þ
k¼0
5. Representation The image abstraction method acquires pose information from a facial image, but it is possible that it also removes color and texture information, which could be valuable for pose estimation. This information could be regarded as noise during pose estimation, but it may be useful for obtaining location information related to the eyes, nose, or mouth. To overcome this problem, we propose our new LDQP representation method for binary images. LDQP is a variant of local binary patterns (LBP) [14] that captures the local structures in binary images. In the original LBP representation procedure, each pixel in a gray image is compared with its neighbors,
where, r represents the radius and n is the number of neighbors defined by a user. 6. Experiments We evaluated our approach using the Labeled Faces in the Wild (LFW) database [15]), the Multi-PIE face database [16], and the Pointing database [17] in qualitative and quantitative experiments. The Viola–Jones face detector, GrabCut, bilateral filtering, and support vector machine (SVM) algorithms were implemented using the OpenCV library [18].
B. Han et al. / Pattern Recognition Letters 45 (2014) 145–153
149
Fig. 4. Local directional quaternary patterns. (a) Overall process of the LDQP. (b) The LDQP coding. (c) Meaning of the LDQP codes.
6.1. Qualitative analysis of image abstraction The LFW and MultiPie facial databases were used in the qualitative evaluation of our method to determine how well the image abstraction method generates cartoon-like binary images from numerous facial images. The images in the LFW dataset are extremely diverse with complex backgrounds and numerous occlusions. They are useful for qualitative evaluations of image abstraction to analyze the quality of segmentation using the GrabCut algorithm and assessing whether the binary images depict facial pose changes in an uncontrolled environment. Thus, Fig. 5(a) shows that our method can represent the facial characteristics in images with various backgrounds. We also performed experiments using the MultiPie database to assess the quality of image abstraction with a diverse range of poses. These images contained similar patterns depending on the pose type, as shown in Fig. 5(b). However, there was a problem with hair in these images because the contours of the hair were not regular in our binary images. Thus, the quality of our method is expected to be better if the problem related to hair can be overcome. 6.2. Experimental comparisons using representative local descriptors in various conditions These experiments compared the performance in various conditions of representative local descriptors, i.e., LBP and local ternary patterns (LTP) [19], with our LDQP method using facial images from the Multi-PIE database. Using these methods, we conducted performance evaluations to compare the accuracy of the classified results and the mean absolute angular error (MAAE) with those generated using our head pose estimation algorithms. Moreover, to identify the methods that were robust in specific conditions, these descriptors were evaluated using four datasets, i.e., identity (I), identity with a small pose set (II), illumination (III), and facial expressions (IV) as shown in Table 1. All of these experiments were
based on k-fold cross-validation, which means that all of the facial images used in each experiment were partitioned by a factor that corresponded to the variation into k sets of k-1 sets for training and one set for testing. For simplicity, we fixed the parameters of the local descriptors at a radius of three, the number of neighbors was eight, and the number of grids was 80 (8 10). We used the linear SVM as a classification method in every experiment. Based on dataset (I), we conducted experiments with identity variation using facial images of 250 people. For each person, there were facial images derived from cameras placed at 13 different poses that varied in the yaw direction from 90 to þ90 with intervals of 15 , i.e., the total number of facial images was 3250. To analyze the effects of the variation in identity, we fixed the illumination conditions as normal with a frontal light source and the facial expression as a neutral expression. All of the images were partitioned into five sets, for which the subjects were selected randomly for five-fold cross-validation. The first row in Table 2 shows that our LDQP method delivered the best performance in terms of both the classification accuracy and the MAAE. To verify that our algorithm was effective with a small pose set for variation in identity, we also conducted experiments using dataset (II), which tested seven discrete yaw angles in intervals of 30 with five-fold cross-validation. The other conditions were fixed as described above. Each set comprised 350 facial images of 50 people and the total number of facial images was 1750. The results are shown in the second row of Table 2. All of the description methods produced comparatively good results, which demonstrated that the image abstraction method was applicable to the small pose set. Furthermore, we performed experiments with dataset (III), where we used five-fold crossvalidation to demonstrate the robustness of our method with variable illumination. We used 3250 images in these experiments, which comprised facial images of 50 people from 13 yaw angles with neutral facial expressions and they were illuminated by five light sources. As shown in the third row of Table 2, all of the representation methods delivered good results without requiring any illumination compensation techniques. In particular, the LTP and
150
B. Han et al. / Pattern Recognition Letters 45 (2014) 145–153
Fig. 5. Qualitative evaluation. (a) Result images from the LFW database (b) Result images from the MultiPie database.
Table 1 Experimental design used for evaluations in various conditions for images from the Multi-PIE database. No
Independent variable
Database
Poses
Fixed variables
Cross-validation
# of images
I II III IV
ID (250) ID (250) Illum. (5) Expr. (2)
Multi-PIE Multi-PIE Multi-PIE Multi-PIE
13 yaw 7 yaw 13 yaw 13 yaw
Illum. (F), Expr. (N) Illum. (F), Expr. (N) ID (50), Expr. (N) ID (50), Illum. (F)
5-fold 5-fold 5-fold 2-fold
3250 1750 3250 1300
Identity (ID), Illumination condition (Illum.), Facial expression (Expr.), Frontal light source (F), Neutral facial expression (N)
Table 2 Experimental results obtained using various representation methods with different conditions. ‘‘IA’’ indicates pure binary images obtained with the image abstraction method, ‘‘MAAE’’ is the mean absolute angular error, and ‘‘C.A.’’ is the classification accuracy. No
Dataset Dataset Dataset Dataset
IA Only
(I) (II) (III) (IV)
LBP
LTP
LDQP
MAAE
C.A.
MAAE
C.A.
MAAE
C.A.
MAAE
C.A.
3:20 0:73 2:37 0:95 4:34 2:04 4:09 0:31
86:05% 1:58% 95:46% 1:32% 83:07% 7:29% 83:40% 1:02%
1:80 0:63 1:16 0:49 2:20 1:53 2:20 0:57
92:11% 1:69% 98:18% 0:81% 91:58% 5:50% 92:62% 1:02%
1:61 0:44 1:03 0:47 1:47 0:78 1:75 0:34
93:27% 1:55% 98:71% 0:41% 94:08% 3:87% 93:14% 0:31%
1:42 0:40 0:85 0:56 1:47 0:90 1:71 0:14
93:28% 1:23% 98:62% 0:73% 93:64% 4:53% 93:85% 0:41%
151
B. Han et al. / Pattern Recognition Letters 45 (2014) 145–153 Table 3 Experimental design employed in the comparisons with the other approaches. No
Training set
Testing set
Database
Poses
Fixed variables
Included variables
Cross-validation
# of images
V VI VII
169 people 12 people 12 people
168 people 3 people 3 people
Multi-PIE Pointing Pointing
13 yaw 13 yaw 9 pitch
Illum. (F) Pitch angle (0 ) Yaw angle (0 )
ID, Expr. ID, Illum., Expr. ID, Illum., Expr.
– 5-fold 5-fold
32,682 390 270
our LDQP outperformed the other description methods because they could express more specific information with changes in the illumination conditions compared with the other methods using binary images. In the experiments with facial expression changes in the dataset (IV), we performed two-fold cross-validation, where the total number of facial images of 50 people was 1300 with 13 discrete yaw angles and two facial expressions (neutral, smile), where the illumination came from a frontal light source. The last row in Table 2 shows the results. We obtained the following findings based on the results of these experiments. (i) The results represented by the LBP variants were superior to those produced using pure binary images obtained with the image abstraction method. This demonstrates that local information is not only crucial for RGB images, but also for the binary images derived using the image abstraction algorithm. (ii) In particular, with illumination or facial expression changes, it was difficult to obtain good results using the image abstraction method alone. Thus, local representation methods were required to obtain better results. (iii) With all the datasets, the results obtained using LDQP were the best of all the representation methods in terms of the MAAE. In terms of the classification accuracy, LTP and LDQP performed well in these experiments. The LDQP yielded more discrepancies between the results and the ground truth, but its results were closer to the ground truth than those obtained with the LTP. The LDQP is memory-efficient compared with the LTP because it builds histograms using the LDQP code, which only comprises 9 bits whereas the LTP code comprises 16 bits. Thus, using the LDQP could reduce the memory consumption by 44%. In summary, the LDQP is more suitable for binary images than the LTP. (iv) Most of the results misclassified by the LDQP were caused by confusing similar angles. The majority of these misclassifications were images of profile faces, which were not close to frontal (particularly angles close to 45 or 45 ). (v) The LDQP occasionally produced incorrect results with wide differences, such as 90 to 45 and 60 to 45 . This was related to the face normalization procedure, rather than the representation methods. The profile data for the Viola– Jones face detector could not produce consistent results with the rectangular facial area. (vi) When the location of the illumination changed, greater differences in the location of the light source resulted in more misclassified results with the LDQP. (vii) The LDQP produced robust results when the facial expression changed. In addition, there was little difference in the accuracy with neutral expressions and smile expressions in the test phases. 6.3. Experimental comparisons with other approaches In this section, we describe experiments that demonstrate that the proposed algorithm outperformed state of the art approaches based on quantitative comparisons. We used a common informative metric, i.e., the mean absolute angular error, to evaluate the
head pose estimation systems using three datasets from the Multi-PIE database and the Pointing database. Table 3 shows the experimental design employed in these experiments. In particular, for coarse head pose estimation, we evaluated the approaches based on the classification error, which is useful for comparing the performance of algorithms that estimate specific discrete poses. We acquired 32,682 facial images from the Multi-PIE database, which comprised 337 people illuminated by a frontal light source, with 13 discrete poses that varied in the yaw direction and six expressions (neutral, smile, surprise, squint, disgust, and scream) to form dataset V. We divided the images into a training set and a testing set by subject: 169 people for training and 168 people for testing. We also distributed the number of facial images equally with respect to each pose to balance the data in the training and testing sets. We compared our proposed algorithms with two other approaches. As shown in Table 4, the LDQP with the image abstraction method outperformed the state of the art algorithms in terms of the MAAE. We also performed experiments to compare other approaches based on the Pointing database, which contains head poses that vary the pitch direction as well as the yaw direction. For the yaw angles (dataset VI), we fixed 0 as the pitch angle and extracted 390 facial images from 15 people with 13 discrete yaw angles at intervals of 30 . Similarly, for the pitch angles (dataset VII), we fixed 0 as the yaw angle and extracted 270 facial images from 15 people with nine discrete pitch angles, which comprised 90 ; 60 ; 30 ; 15 ; 0 ; þ15 ; þ30 ; þ60 , and þ90 . We manually cropped all of the facial images from the Pointing database along the head boundary because the Viola–Jones face detector is vulnerable to faces with movement in the vertical direction. We separated these images into five sets by subject and conducted five-fold cross-validation. Table 5 shows the results of comparisons with five other algorithms. Our proposed approach outperformed the others in cases with horizontal axial rotation. In particular, the LDQP with image abstraction delivered the best performance in these experiments. In cases with vertical axial rotation, however, the proposed approaches were not the best because the image abstraction method could not represent the variation attributable to different hair styles in human faces, which is essential information for the accurate estimation of head poses that vary in the pitch direction. The GrabCut method used for image abstraction usually classifies the head hair as background regions, which could be removed from the pose information. 6.4. Analysis of feature vectors To classify facial poses correctly, it is necessary to analyze the feature vectors used to produce input vectors for a classifier Table 4 Experimental results obtained using dataset V from the Multi-PIE database.
SL2 [20] Normalized gabor [21] Proposed approach (IA only) Proposed approach (IA + LDQP)
MAAE
C.A. (%)
4:33 2:99 3:35 1:25
– – 85.00 94.04
152
B. Han et al. / Pattern Recognition Letters 45 (2014) 145–153
Table 5 Experimental results obtained using datasets VI and VII from the Pointing database. Methods
ANN based approach [22] Human performance [23] Associative memories [23] High-order SVD [24] PCA [24] Proposed approach (IA only) Proposed approach (IA + LDQP)
MAAE
C.A.
VI
VII
VI
VII
9:5 11:8 10:1 12:9 14:11 9:12 3:10 7:23 2:51
9:7 9:4 15:9 17:97 14:98 14:72 2:73 10:06 3:92
52.0% 40.7% 50.0% 49.25% 55.2% 54:36% 13:24% 58:97% 10:66%
66.3% 59.0% 43.9% 54.84% 57.99% 50:37% 5:90% 58:52% 10:64%
Fig. 6. Feature vectors reduced by multi-dimensional scaling. (a) Binary images produced by image abstraction. (b) Feature vectors obtained using the LDQP method.
because a facial pose dataset can be domain-specific. In general, a feature vector derived from an image has very high dimensionality. With our LDQP method, the dimensions of a feature vector reached 40,960 (80 512) when using 80 (8 10) grids. Thus, a dimensionality reduction technique was required to visualize them. We used a classical multi-dimensional scaling technique to reduce them to two dimensions. Fig. 6 shows the feature vectors from the seven viewpoints used in the experiments. As shown in Fig. 6(a), the feature clusters from the binary facial images were linearly separable. After the LDQP process, the feature clusters were more scattered than the original clusters, as shown in Fig. 6(b). This was because the LDQP allowed the feature vectors to avoid overlapping, which is more suitable than facial binary images for image abstraction from larger facial datasets. However, there is a problem if we use too many small samples from a certain viewpoint because face detection may fail, which could cause the problem of imbalanced data set problem between classes. 7. Conclusion and future works In this study, we proposed an image abstraction and representation method for head pose estimation, which we applied successfully to the multiclass classification problem. Cartoon-like facial contour images were used to abstract the characteristics of common facial poses. These images reduced the noise caused by variations in identity, illumination conditions, and facial expressions. The representation method, LDQP, with the image abstraction outperformed state of the art methods in terms of MAAE using the Multi-PIE and Pointing database. Our experimental results showed that the edges in a facial image could be used to facilitate visual communication between humans and computers. Thus, our image abstraction method can be used to compress visual information in many research areas, such as object recognition, gesture recognition, and hand posture estimation. Moreover, a binary facial image can be interpreted intuitively by humans, which gives the opportunity to identify visual cues to determine the meaning of a feature vector. In future research, we plan to expand this work based on the batch alignment of faces by image abstraction and apply
dimensionality reduction techniques that are appropriate for our framework and the head pose estimation domain. Acknowledgments This work was supported by the IT R&D program of MKE & KEIT [10041610, The development of the recognition technology for user identity, behavior and location that has a performance approaching recognition rates of 99% on 30 people by using perception sensor network in the real environment]. This work was also supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (MEST). References [1] E. Murphy Chutorian, M. Trivedi, Head pose estimation in computer vision: a survey, IEEE Trans. Pattern Anal. Mach. Intell. 31 (4) (2009) 607–626. [2] T. Cootes, C. Taylor, D. Cooper, J. Graham, Active shape models-their training and application, Comput. Vision Image Underst. 61 (1) (1995) 38–59. [3] T. Cootes, G. Edwards, C. Taylor, Active appearance models, IEEE Trans. Pattern Anal. Mach. Intell. 23 (6) (2001) 681–685. [4] D. Cristinacce, T. Cootes, Automatic feature localisation with constrained local models, Pattern Recognit. 41 (10) (2008) 3054–3067. [5] H. Winnemöller, S.C. Olsen, B. Gooch, Real-time video abstraction, ACM Trans. Graph. 25 (3) (2006) 1221–1226. [6] A. Puri, B. Lall, Exploiting perception for face analysis: image abstraction for head pose estimation, in: computer vision ECCV 2012, Workshops and Demonstrations, Lecture Notes in Computer Science, vol. 7584, 2012, pp. 319–329. [7] B. Han, Y. Chae, Y.H. Seo, H. Yang, Head Pose Estimation Based on Image Abstraction for Multiclass Classification, in: Information Technology Convergence, Lecture Notes in Electrical Engineering, vol. 253, 2013, pp. 933–940. [8] C. Rother, V. Kolmogorov, A. Blake, GrabCut: interactive foreground extraction using iterated graph cuts, ACM Trans. Graph. 23 (3) (2004) 309–314. [9] C. Tomasi, R. Manduchi, Bilateral filtering for gray and color images, in: Sixth International Conference on Computer Vision, 1998, pp. 839–846. [10] D. Lowe, Object recognition from local scale-invariant features, in: The Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, 1999, pp. 1150–1157. [11] Y. Boykov, M.P. Jolly, Interactive graph cuts for optimal boundary amp; region segmentation of objects in N-D images, in: Proceedings of Eighth IEEE International Conference on Computer Vision ICCV 2001, vol. 1, 2001, pp. 105–112.
B. Han et al. / Pattern Recognition Letters 45 (2014) 145–153 [12] S. Paris, P. Kornprobst, J. Tumblin, Bilateral Filtering, 2009. [13] M. Basu, Gaussian-based edge-detection methods-a survey, IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 32 (3) (2002) 252–260. [14] T. Ahonen, A. Hadid, M. Pietikainen, Face description with local binary patterns: application to face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 28 (12) (2006) 2037–2041. [15] G.B. Huang, M. Ramesh, T. Berg, E. Learned-Miller, Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments, University of Massachusetts, Amherst, 2007. Tech. Rep. 07–49. [16] R. Gross, I. Matthews, J. Cohn, T. Kanade, S. Baker, Multi-PIE, Image Vision Comput. 28 (5) (2010) 807–813. [17] N. Gourier, D. Hall, J.L. Crowley, Estimating face orientation from robust detection of salient facial structures, in: FG Net Workshop on Visual Observation of Deictic Gestures, 2004, pp. 1–9. [18] G. Bradski, The OpenCV library, Dr Dobb’s J. Softw. Tools (2000). [19] X. Tan, B. Triggs, Enhanced local texture feature sets for face recognition under difficult lighting conditions, in: S. Zhou, W. Zhao, X. Tang, S. Gong (Eds.),
[20]
[21]
[22]
[23]
[24]
153
Analysis and Modeling of Faces and Gestures, Lecture Notes in Computer Science, vol. 4778, 2007, pp. 168–182. D. Huang, M. Storer, F. De la Torre, H. Bischof, Supervised local subspace learning for continuous head pose estimation, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2011, pp. 2921–2928. F. Jiang, H.K. Ekenel, B.E. Shi, Efficient and robust integration of face detection and head pose estimation, in: 2012 21st International Conference on Pattern Recognition (ICPR), IEEE, 2012, pp. 1578–1581. R. Stiefelhagen, Estimating head pose with neural networks-results on the pointing04 icpr workshop evaluation data, in: Pointing04 ICPR Workshop of the International Conference on Pattern Recognition, 2004. N. Gourier, J. Maisonnasse, D. Hall, J.L. Crowley, Head pose estimation on low resolution images, in: Multimodal Technologies for Perception of Humans, Springer, 2007, pp. 270–280. J. Tu, Y. Fu, Y. Hu, T. Huang, Evaluation of head pose estimation for studio data, in: Multimodal Technologies for Perception of Humans, Springer, 2007, pp. 281–290.