Face alignment using a deep neural network with local feature learning and recurrent regression

Face alignment using a deep neural network with local feature learning and recurrent regression

Accepted Manuscript Face Alignment Using a Deep Neural Network with Local Feature Learning and Recurrent Regression Byung-Hwa Park , Se-Young Oh , Ig...

8MB Sizes 0 Downloads 88 Views

Accepted Manuscript

Face Alignment Using a Deep Neural Network with Local Feature Learning and Recurrent Regression Byung-Hwa Park , Se-Young Oh , Ig-Jae Kim PII: DOI: Reference:

S0957-4174(17)30487-6 10.1016/j.eswa.2017.07.018 ESWA 11434

To appear in:

Expert Systems With Applications

Received date: Revised date: Accepted date:

11 February 2017 11 June 2017 12 July 2017

Please cite this article as: Byung-Hwa Park , Se-Young Oh , Ig-Jae Kim , Face Alignment Using a Deep Neural Network with Local Feature Learning and Recurrent Regression, Expert Systems With Applications (2017), doi: 10.1016/j.eswa.2017.07.018

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights   

AC

CE

PT

ED

M

AN US



We generate a global feature map using a network that trained a local feature on face Using a local feature extraction layer, local features are selectively investigated Landmark positions are estimated via regression on the extracted feature recurrently Extracted features from generated global feature map show distinctive property Face alignment via proposed method shows a state-of-the-art result on public dataset

CR IP T



ACCEPTED MANUSCRIPT

Face Alignment Using a Deep Neural Network with Local Feature Learning and Recurrent Regression Se-Young Oh*

POSTECH

POSTECH

*Corresponding author: [email protected]

Ig-Jae Kim

CR IP T

Byung-Hwa Park

KIST

AC

CE

PT

ED

M

AN US

Abstract. We propose a face alignment method that uses a deep neural network employing both local feature learning and recurrent regression. This method is primarily based on a convolutional neural network(CNN), which automatically learns local feature descriptors from the local facial landmark dataset that we created. Our research is motivated by the belief that investigating a face from its low-level component features would produce more competitive face alignment results, just as a CNN is normally trained to automatically learn a feature hierarchy from the lowest to the highest levels of abstraction. Moreover, by separately training the feature extraction layers and the regression layers, we impose an explicit functional discrimination between the feature extraction and regression tasks. First, we train a feature extraction network that is used to classify the landmark patches in the dataset. Using this pretrained feature extraction network, we build a face alignment network, which uses an entire face image rather than the local landmark patch as input, thus generating the global facial features. The subsequent local feature extraction layer extracts the local feature set from this global feature, finally generating the local feature descriptors, in which space the network learns a generic descent direction from the currently estimated landmark positions to the ground truth via linear regression applied recurrently. Head pose estimation network also applied to provide a good initial estimate to the local feature extraction layer for accurate convergence. We found that learning of the good local landmark features in pursuit of good landmark classification also leads to a higher face alignment accuracy and achieves state-of-the-art performance on several public benchmark dataset. It signifies the importance of learning not only the global features but the local features for face alignment. We further verify our method‟s effectiveness when applied to related problems such as head pose estimation, facial landmark tracking, and invisible landmark detection. We believe that good local learning enables a deeper understanding of the face or object resulting in higher performance. Keywords: face alignment  deep neural network  convolutional neural network  local feature learning  head pose estimation  facial landmark tracking

Byung-Hwa Park Department of Creative IT Engineering, POSTECH (Pohang University of Science and Technology), 77 Cheongam-Ro, Nam-Gu, Pohang, Gyeongbuk, Korea, 37673, E-mail: [email protected] Se-Young Oh* Department of Creative IT Engineering, POSTECH, E-mail: [email protected], Phone: +8210-4040-3214, Fax: +82-54-279-2099 Ig-Jae Kim Image and Media Research Center, KIST (Korea Institute of Science and Technology), 5 Hwarang-

ACCEPTED MANUSCRIPT

Ro, Seongbuk-Gu, Seoul, Korea, 02792, E-mail: [email protected]

1

Introduction

AC

CE

PT

ED

M

AN US

CR IP T

Face alignment, or facial landmark detection, involves distinguishing the location of the facial landmarks that describe the shape of a face. This is a fundamental task to face-related applications, such as face recognition, facial expression recognition, and face motion capture. Face alignment has improved tremendously with the advent of discriminative cascaded regression approaches (Burgos-Artizzu et al., 2013; Cao et al., 2014; Xiong et al., 2014) that train a cascaded regressor, which estimates facial landmark positions in feature space starting from the initial estimates that include the center of the face region and its perturbed shapes and then refining it in stages. Recently, deep neural networks (DNNs), which can automatically learn robust feature descriptors from a large amount of training data, have successfully been applied to a wide range of problems, such as object classification, object detection, and face recognition (Lawrence et al., 1997; Krizhevsky et al., 2012; Sermanet et al., 2013; Girshick et al., 2014; Jia et al., 2014; Malagavi et al., 2014; Sermanet et al., 2014; Redmon et al., 2015; Szegedy et al., 2015). DNNs also have been applied to face alignment problems in various ways; however, most of them achieved the alignment task on a rather low-resolution face image (Zhou et al., 2013; Yi et al., 2013; Zhang et al., 2014). In this study, we propose an architecture that combines the strengths of these two approaches (Fig. 1), utilizing both a hierarchy of features trained for landmark classification purposes and recurrent regression for face alignment refinement. We first train a feature extraction network to classify each facial landmark by using a facial landmark patch dataset that we created from the original facial dataset. This network uses the full facial image as input and produces the global feature descriptors. Then, the local feature extraction layer extracts only the set of local feature descriptors corresponding to estimated landmark locations from the global feature descriptor space. This local feature set is used to drive the succeeding regression layer, which is trained to learn a generic descent direction from the current landmark estimate toward the ground truth. Facial shape is then updated and the process repeats from the local feature extraction layer until convergence. The whole face alignment system consists of the feature extraction network at the front end, followed by the local feature extraction layer and regression layer. In addition, to provide a good initial estimate of the facial shape for the local feature extraction layer for faster convergence of the regression layer as well as the final head pose, we use a head pose estimation network, which first predicts the five keypoint positions that eventually lead to the head pose parameters. In order to demonstrate the expressional power of the trained local feature descriptors, we also applied these features to the problem of facial landmark tracking as well as invisible landmark detection.

ACCEPTED MANUSCRIPT

1.1

Contributions

AC

CE

PT

ED

M

AN US

CR IP T

Previous face alignment algorithms were either based on hand-crafted features (Zhu et al., 2015; Xiong et al., 2013) or on deep learning that utilizes the features of the entire face (Sun et al., 2013; Zhou et al., 2013; Zhang et al., 2014; Zhang et al., 2015). The main contribution of our research is to apply deep learning to extract facial landmark features with a gradually increasing level of abstraction for face alignment, in line with representation learning (Bengio et al., 2013). Since a face is composed of several facial parts, by learning the facial components or landmarks, we can also optimize the face alignment process. Second, we separate the training of feature extraction and face alignment regression to ensure that they remain independent. In contrast, previous work (Zhou et al., 2013; Yi et al., 2013; Zhang et al., 2014) trained a deep neural network for regression via backpropagation in an end-to-end learning mode, which neither lead to explicit feature extraction nor to explicit regression for final landmark detection, since an error signal using backpropagation may be diluted in the process. Recent object detection algorithms, such as RCNN (Girshick et al., 2014) and OverFeat (Sermanet et al., 2013), applied their regression process to a pretrained network for classification. Finally, we use the local feature extraction layer in the middle of the network in order to process only the local feature descriptors in the region of interest in the global feature descriptor space, which has the same width and height as the input data. We believe these approaches can also be used in various detection problems with greater efficiency. We verify that our system accurately locates facial landmarks of the 300-W facial landmark localization challenge dataset (Sagonas et al. 2013), achieving a state-ofthe-art performance. We further verify the effectiveness of our method in related problems of head pose estimation, facial landmark tracking, and invisible landmark detection. The remainder of the paper is structured as follows. The following section reviews related work. In Section 3, we discuss the motivation of our study and high level consideration. Section 4 describes the details of the proposed architecture and its implementation. Section 5 shows the experimental results obtained from public dataset. Finally, we draw our conclusions and explore directions for future study.

CR IP T

ACCEPTED MANUSCRIPT

AN US

Fig. 1 Proposed architecture of the face alignment network. (a) The head pose estimation network examines a low-resolution input to provide an initial facial shape with a head pose for the local feature extraction layer, which then extracts the desired local features in the global feature descriptor space. (b) The feature extraction network is trained to classify facial landmarks. (c) The face alignment network uses the trained feature extraction network with the entire face as input to produce a global feature set as input. The regression layer generates a vector, which indicates a generic descent direction for improving the current facial shape toward the ground truth

Related Work

2.1

Face Alignment based on Hand-crafted Features

M

2

AC

CE

PT

ED

Face alignment or facial landmark detection has been studied for a long time in computer vision. First, Active Shape Model (ASM, Cootes et al., 1993) approaches were used to fit or align a facial shape to the input face by training the gradient pattern of profile normal on the facial landmark and minimizing its Mahalanobis distance to the target face. Then, Active Appearance Model (AAM, Cootes et al., 2001) approaches, which considered not only the shape model, but also the appearance (texture) model, were extensively studied. This approach learned the parameterized shape models and appearance models via principle component analysis and cast the problem as being to optimize these parameters. Although various facial shapes could be expressed with these models by adjusting their parameters, their expressive power was limited to their parameter space. Moreover, they were not robust to local occlusion and they used the appearance of the whole face. The Constrained Local Model (CLM, Asthana et al., 2013) tried to address this problem by using only local facial parts. More recently, numerous nonparametric cascaded regression methods have been studied with success in solving the face alignment problem. A regression method estimated landmark locations explicitly by regression using image features. For instance, Xiong et al. (2013) estimated landmark locations using the SIFT (Ke et al., 2004) features of the landmark patch image. They learned a generic descent direction from initial estimates to the ground truth landmark position using a set of linear re-

ACCEPTED MANUSCRIPT

gressors. Cao et al., 2014 then used a cascade fern regression with pixel-difference features. Although they showed reliable accuracy in facial landmark detection, they did not consider utilizing facial attributes but simply relied on hand-crafted features in a statistical framework. Face Alignment using Convolutional Neural Networks

CR IP T

2.2

AC

CE

PT

ED

M

AN US

Convolutional neural networks (CNNs) have been widely studied and applied with success to various problems in computer vision, including image classification, object detection and recognition (Krizhevsky et al., 2012; Girshick et al., 2014; Lawrence et al., 1997; Redmon et al., 2015; Malagavi et al., 2014; Sermanet et al., 2013; Sermanet et al., 2014; Szegedy et al., 2015). The basis of CNN training is to obtain a meaningful set of convolution filters at different levels from a given set of training data to represent the feature hierarchy. In the case of facial landmark detection, the network can be trained more easily when the range of candidate dimension is small. If the input dimension is large, the neural filter size should also be large, to effectively capture its spatial features thus requiring a significant amount of computation. Therefore, previous work tried to reduce the problem domain by reducing the image size. For instance, Sun et al. (2013), Zhou et al. (2013), and Zhang et al. (2014) used 39 × 39, 50 × 50, and 60 × 60 images. All of them successfully employed a coarse-to-fine cascaded neural network for face alignment. Sun et al. (2013) and Zhou et al. (2013) pre-partitioned faces into different parts that were each processed by different CNNs. Meanwhile, Zhang et al. (2014) tried to train a CNN by jointly optimizing landmark detection together with auxiliary facial attributes, such as gender, expression, and appearance. However, since they only used small face images to train their network, these approaches did not provide accurate landmark locations on a high resolution face image, which is important for future applications. Recently, Lai et al. (2015) successfully aligned a facial shape over a 250 × 250 face image. They used a deconvolutional neural network to represent an input image by its deep learning feature space, which has the same width and height as the input. They used cascade regression on the shape-indexed features to find a facial landmark over the feature space. In this work, we detect facial landmarks on 250 × 250 images, which provides accurate results even for a high-resolution image using a deep neural network. This network simply learns the features of facial landmark patches instead of using the full face image. This is in contrast to the previous work (Sun et al., 2013; Zhou et al., 2013; Zhang et al., 2014; Zhang et al., 2015) which tried to achieve end-to-end learning by simultaneously training the convolutional filters and regression layers, which could not guarantee functional independence between feature extraction and landmark locating regression within the system. On the other hand, recent object detection and localization approaches, such as RCNN (Girshick et al., 2014), discriminate between classification and regression by using a pre-trained network, which was trained on the ImageNet classification dataset at the front end. Following this philosophy, we tried to separate training of the feature extraction and regression tasks. Furthermore, while previous work used the final facial landmark positions as the target output, our target

ACCEPTED MANUSCRIPT

is the direction vector from the current estimate to the ground truth for updating the current facial shape estimate. This enables the system to refine the current facial shape estimate although it could not find the exact landmark positions at one time. Regression using a Deep Neural Network

CR IP T

2.3

M

AN US

The main difference between classification and regression is determined by whether the target output of the predictor function takes up discrete class labels or continuous values. As a trainable distinctive feature extraction tool, CNNs have been widely used in feature learning for classification. Regression problems were usually reformulated as classification problems as well; for instance, RCNN (Girshick et al., 2014) and OverFeat (Sermanet et al., 2013) applied a CNN, which was trained for classification, to a selective search or sliding windows to score the class of the current window and detect an object. YOLO (Redmon et al., 2015) also applied a CNN to score a predefined rectangular grid section on an image for detecting objects in a single process. Szegedy et al. (2013) replaced the final softmax layer of the pre-trained network with a regression layer that generates a binary mask describing the portion of the object in the region of interest. Malagavi et al. (2014) tried to discretize different head profiles (right, left, straight, up and down) to estimate the head pose in a quantized manner, although its original representation includes continuous roll, pitch, and yaw angles. In this work, we estimate facial landmark positions applying recurrent linear regression in order to gradually update the current landmark estimate, using a CNN based feature set which has been learned from our landmark patch dataset. Table 1. Comparison between previous works and ours Method

Pros

Cons

ASM(1993)

Shape PCA parameter optimization with profile model fitting

Describes shape of the object using a constrained statistical shape model

AAM(2001)

Shape and Appearance PCA parameter optimization

Describes the face using both shape and appearance models

Cascade regression with SIFT features

Fast because it does not calculate Hessian and Jacobian matrix

Just learns the profile model, not the facial texture Weak to local occlusion Slow because it calculates Hessian and Jacobian matrix No deep analysis of theoretical convergence

ED

Approach

SDM(2013)

Cascade regression using Fern with pixel-difference featuress

Very fast on training and test

Sun et al. 2013

Cascade CNNs

Learning features from dataset using CNNs

Zhou et al. 2013

Coarse-to-fine Cascade CNNs

Zhang et al. 2014

Learning facial landmark positions with auxiliary attributes using CNNs

Lai et al. 2015

Deconvolutional NN for feature map and Cascade regression using Fern

Ours

Local feature learning and Recurrent Regression on Global feature space

AC

CE

PT

Cao et al., 2014

3

Learning features from dataset using CNNs Learning feature that describe facial landmark as well as auxiliary attributes Combination of Cao et al., 2014 and CNNs features Uses a very accurate local feature set that learns to classify local facial landmarks.

Using handcrafted features Just validated on 5 distinctive landamarks on 39 × 39 resolution images Just validated on 60 × 60 resolution images Just validated on 60 × 60 resolution images CNN processing does not show realtime Requires parallel processing for realtime operation in local features extractions on the global feature space

Motivation and High-Level Considerations

As an end-to-end learning model, an intuitive way to train a neural network to detect facial landmarks is by using the Euclidean loss layer for landmark locating re-

ACCEPTED MANUSCRIPT

4

Architectural Details

AN US

CR IP T

gression and applying backpropagation to the whole system (Sun et al., 2013; Zhou et al., 2013; Zhang et al., 2014). This implies learning both feature extraction layers and regression layers at the same time. This approach, however, contains two major drawbacks. The vanishing gradient problem tends to make it hard for the feature extraction layers to learn the landmark features. Moreover, it is not possible to fully discriminate between the roles of the feature extraction and regression components as they are jointly learned with the same error signal. Normally, a neural network that employs backpropagation learns the filter from low- to high-level features (Krizhevsky et al., 2012). That is, in the first layer, it learns filters that extract lines and edges and in subsequent layers extracts shapes at increasing levels of abstraction. Based on this observation, we trained the feature extraction network to classify those landmarks contained in the face image as the front end of the face alignment network. Using this network to learn local landmarks, but operating on the whole face image, we can generate a global feature set containing all the important local features, signifying that we are indeed proceeding from the lower to higher levels of abstraction.

AC

CE

PT

ED

M

Our system is composed of two networks for: head pose estimation and face alignment. The head pose estimation network provides an initial estimate of the head pose as well as the final head pose. The network processes a low-resolution facial image (60 × 60) and generates the head scale, translation, and rotation information (s, , and ) for the local feature extraction layer portion of the face alignment network. The face alignment network contains the feature extraction network, which is trained to classify facial landmarks. By attaching a local feature extraction layer and regression layer to the feature extraction network, which provides a rich local descriptor, we establish a face alignment network to detect facial landmarks in an image. Table 2 shows details of this architecture.

ACCEPTED MANUSCRIPT

Table 2. Summary of the network structures used. The convolution layer is described by its filter size; the stride length of all convolution layers is 1. The pooling layer is expressed in terms of its Type (Size, Stride). We also describe the different learning methods, as we separated the training of the feature extraction layer and the regression layer.

Input Conv. 1 Pooling Conv. 2 Pooling Conv. 3 Pooling Unshared Conv Conv. 4_1 Inception Conv. 4_2 Concat. Local Feature Extraction Pooling Conv. 5 Fully Connected Fully Connected

Fully Connected Fully Connected

M

-

-

Face Alignment Network 250 × 250 7 × 7 × 68 3 × 3 × 34 3 × 3 × 34 3 × 3 × 32 5 × 5 × 16 Yes

120 10 Distinctive Landmark 100 3 Head Pose

Yes

-

Average(5, 3) 1 × 1 × 32 68 -

Number of Landmarks × 2

CE

PT

ED

Linear Regression

Head Pose Estimation Network 60 × 60 4 × 4 × 20 Max(2, 2) 3 × 3 × 40 Max(2, 2) 3 × 3 × 60 Max(2, 2) 2 × 2 × 80 -

CR IP T

Backpropagation

Layer

AN US

Training Method

(a)

(b)

(c)

AC

Fig. 2 Creation of the facial landmark dataset. (a) Original image from the 300-W dataset (Sagonas et al., 2013) and its annotated landmarks. (b) Resized to 250 × 250 using the rectangle of the face. 300-W provides hand-labeled landmarks and face detection rectangles from their own face detector. (c) 31 × 31 facial landmark patch dataset images extracted from the landmark positions in (b). Each landmark patch contains its landmark index for classification.

ACCEPTED MANUSCRIPT

4.1

Feature Extraction Network

CE

PT

ED

M

AN US

CR IP T

Our architecture is designed to engineer an optimal hierarchy of facial features in order to facilitate an effective high-level analysis. We first construct a landmark dataset consisting of 31 × 31 facial landmark patches extracted from a full facial image resized to 250 × 250 using the detected rectangle (Fig. 2). We then trained the network to classify these landmark patches as one of 68 reference landmarks using logistic regression. We utilize the output of the network just before softmax as the feature set and call these features “local feature descriptors”, since this represents a local information from the whole face. The output dimension is the number of facial landmarks. When the feature extraction network is used to detect facial landmarks for face alignment, however, we feed the full 250 × 250 face image as input to the network rather than the 31 × 31 landmark patch on which the network was originally trained. In this way, the network produces the facial descriptors in the global domain, which is named the global features, rather than the local features in the landmark domain. Therefore, the local feature extraction layer (Fig. 3) is attached to the end of the network before pooling, as shown in Fig. 4, to selectively extract the desired local feature window from the global feature. The extracted local features were then fed to the remaining layers to produce the final local feature descriptor, enabling the network to extract local feature descriptor within the local region of interest from the input image. Although the batch size of test input data is one (since we process one image at a time), the batch size resulting from local feature extraction is the number of local regions of interest. Our network can thereby analyze several local regions of the image in parallel with GPU batch processing.

AC

Fig. 3 Local Feature Extraction Layer. From the single-batch size global feature representation, the local feature extraction layer extracts several local features from the target local positions and concatenates them in a several-batch size representation. These extracted local features are then sent to the remaining layers to produce the final local feature descriptor.

For accurate local feature extraction on the desired local feature window ( , and ), which is indexed in the raw image, the width and height of each layer before the local feature extraction layer should not be resized. Therefore, we set the stride of the convolution block to one and set the zero-padding size so as not to resize the image width and height. Moreover, we did not use a pooling layer before the location of

ACCEPTED MANUSCRIPT

4.2

CR IP T

the local feature extraction layer, which is meant to both reduce dimension as well as ensure translation invariance. Note that the images in the landmark patch dataset (Fig. 2) are already available in translation-normalized format because they were extracted from the exact desired points on the face. Fig. 4 illustrates this procedure in great detail. Face Alignment Network

( ))) ̂

(

M

‖(

AN US

To use a full face image as input to the feature extraction network, we changed its input dimensions to 250 × 250 and attached the local feature extraction layer before the convolution-pooling block in constructing the face alignment network. Using the face translation and rotation information from the head pose estimation network, we obtain an initial estimate of the landmark positions in the image using equation (3). The local feature extraction layer extracts local features of the positions over the global face features space as input, and then generates the local feature descriptors. The local feature descriptor feeds into the linear regression layer having a Euclidean loss function. The target value the regression layer has to learn is the generic descent direction map, which is the vector from the currently estimated landmark position to the ground truth. Xiong et al. (2013) extensively studied this mechanism as a cascade linear regressor. We found the regressor amounts to learning its weights, ̂ , such that

(

( )))

‖,

(1)

ED

‖(



PT

which is equivalent to minimizing the cost function ∑

(

(

)

) ,

(2)

AC

CE

where is the number of sample images I, refers to the mth layer, denotes the local feature extraction layer, , is the ground truth landmark position of the ith sample image, is an estimated landmark position in the kth iteration, ( ) is the local feature descriptor including a bias node, and ̂ is a generic descent map, which is the regressor. The cost function minimizes the sum over all landmarks of the differences between the estimated and target difference vectors, in which the difference is between the estimated and ground-truth landmark positions at the current iteration k. This is a linear system equation, which can be expressed as . The minimum norm least-squares solution to this linear system is ̂ , where is the

ACCEPTED MANUSCRIPT

ED

M

AN US

CR IP T

Moore-Penrose generalized inverse of matrix (Rao et al., 1971). We thus computed the linear regressor, ̂ , for each kth iteration, whereas the ( ) was trained via stochastic gradient descent with hand-tuned hyperparameters. The regressor provides a generic descent direction to the ground truth position and the network updates the estimated position along this direction; however, it does not achieve this goal at one time so that it has to recurrently loop through the local feature extraction layer until it converges to the final position. It typically converged in four iterations in our experiment.

CE

PT

Fig. 4 Feature Extraction Network and Face Alignment Network. The feature extraction network is trained to classify the facial landmarks in dataset (Fig. 2). The network does not reduce the width and height of data block until the position to apply the local feature extraction layer located in the face alignment network. This architecture is motivated by GoogLeNet (Szegedy et al., 2015), though we use a small version of the inception module without the 1 × 1 convolution layer and verify that this increased network accuracy. The face alignment network is composed of the trained feature extraction network, local feature extraction layer, and additional regression layers.

AC

4.3

Head Pose Estimation Network

To extract important local information in the local feature extraction layer for face and accurate convergence, we use the head pose estimation network. Using the estimated pose information from the network, the initial shape for the local feature extraction layer is calculated following equation (3). The network is also used in the final head pose estimation. We train the network to estimate 5 facial keypoint location first, and estimate head pose from the information. The architecture is illustrated in Table 2. The main reason why we train the network to estimate head pose from the 5

ACCEPTED MANUSCRIPT

4.4

CR IP T

facial keypoint location is to estimate head pose directly from estimated facial alignment result, especially in tracking situation, which can use face alignment result in previous frame. We used error back-propagation using gradient descent to train this network. The scale and translation parameters can be calculated directly from the five distinctive facial keypoints. Assigning Initial shapes to the local feature extraction layer

{

( ) (

) ( (

))

(

AN US

We improve the robustness of generic application performance by generating several initial shapes obtained from the head pose information. Because the training landmark dataset contains 2D images, we used the predefined z-value for the 3D landmark face template model, . The x- and y-values in the 3D model were the mean of the training dataset landmarks. We generated initial shapes:

)

(3)

Training Procedure

PT

4.5

ED

M

where, , and are the scale, rotation, and translation parameters estimated by the head pose estimation network; R is a rotation matrix; P is an orthogonal projection matrix; U(a, b) is a sampled value from the uniform distribution, which represents unif(a - b, a + b), i is an initial shape index; and , , and indicate perturbation values for the uniform distribution. Further, we assume the following values: , , , and pixels. Since the initial prediction of the head pose estimation network may contain errors, obtaining multiple initial shapes from the uniform distribution may help to enhance the robustness of this estimation. We obtain the final estimate by averaging the results of multiple initial shapes (in this case, ten).

AC

CE

Overall, based on the information provided in Table 2, we use a stochastic gradient descent for feature training and a Moore-Penrose generalized inverse for linear regression training. First, we train the head pose estimation network to estimate the positions of the five distinctive facial features using stochastic gradient descent. These feature positions enable the scale and translation parameters to be computed directly. To this end, we train a regression layers to estimate the rotation parameters, , and , from estimated five distinctive facial feature. Next, we train the feature extraction network to learn to classify facial landmarks using a landmark patch dataset (Fig. 2) with gradient descent. We trained the network from scratch. we set the initial weights for the convolution layer and fully connected layer using Xavier initialization while we set their bias parameters at 0.2. Finally, using the pre-trained feature extraction network and head pose estimation network, we train the regression layer of the face alignment network.

ACCEPTED MANUSCRIPT

4.6

Landmark tracking for video sequences

CR IP T

While a conventional neural network conducts input preprocessing including normalization to zero-mean, unit-variance together with or a hyper-tangent function activation, we did not preprocess the input image for several reasons: 1) If we preprocess the input to zero-mean and unit-variance normalization, the statistics of the local landmark patches on the global representation could change 2) the relative scales of pixels are already normalized in the range from 0 to 255.

Experimental results

5.1

Dataset

M

5

AN US

Some differences arise when we discard the assumption that we are working with a still image to handle a sequence of video images. First, the initial head pose estimation is no longer required for tracking because it directly comes from the pose obtained in the previous frame. Furthermore, although the landmark detection procedure runs four rounds of recurrent regression using the local feature descriptors for each of the 10 initial shapes available in parallel, we can decrease the number of recurrent loops in conjunction with the number of initial shapes for tracking, because we again start from the initial shape obtained in the previous frame. This enables our system to track facial landmarks in real time video with a GPU. Our system takes one initial shape with 4 recurrent loops.

ED

We evaluated our proposed architecture by conducting an experiment on the 300Faces in the Wild (300-W) challenge provided by Sagonas et al. (2013). The images and annotations consist of AFW (Zhu et al., 2012), LFPW (Belhumeur et al., 2012), HELEN (Le et al., 2012) and IBUG. 300-W uses two performance metrics. The first is an average point-to-point Euclidean error normalized by the interocular distance, ∑

|

PT ∑

|

|

|

,

(4)

AC

CE

where is the number of samples, is the number of landmarks, is the estimated landmark position, is the ground truth landmark position, and and are the positions of the left and right eye corners, respectively. The second metric is the cumulative error curve corresponding to the percentage of test images for which the error was less than a specified value. We made a fair comparison by following the experimental setting of Ren et al. (2014). Table 2 shows detailed information about the dataset specification and experimental settings. We experimented with 68 landmarks (including face contour, two eyes, two eyebrows, and nose and mouth points) and 49 landmarks (including two eyes, two eyebrows, and nose and mouth points), respectively.

ACCEPTED MANUSCRIPT

Table 3. Dataset specification and experimental settings

Type Number Note

HELEN Training set

LFPW

Test set Training set

2000

330

Large dataset with various situations

Test set

811 (1100) 122 (150)

AFW

IBUG

-

-

337

135

Medium dataset with Small dataset with large pose variation various situations and local occlusion

300-W Test set Common set Challenging set Full set HELEN + HELEN + HELEN + LFPW IBUG LFPW + AFW LFPW + IBUG Training set

Same experimental setup as Ren et al. (2014).

CR IP T

Dataset

Table 4. Classification accuracy of our feature extraction network. When the system classifies all landmark patch images in perfect, the score is one.

5.2

Facial Feature Classification

49 Landmarks Training Accuracy Test Accuracy Top-1 Top-5 Top-1 Top5 0.8749 0.9882 0.8180 0.9734 0.8346 0.9806 0.7685 0.9653

AN US

Dataset HELEN LFPW

68 Landmarks Training Accuracy Test Accuracy Top-1 Top-5 Top-1 Top5 0.7903 0.9689 0.7350 0.9549 0.8291 0.9813 0.6787 0.9353

AC

CE

PT

ED

M

First, we evaluated the performance of the feature extraction network (Fig. 4). We used the HELEN and LFPW facial landmark datasets from 300-W (Fig. 2). Table 3 presents the results. The Top-1 test accuracy obtained for all datasets has a value of approximately 0.7, whereas the training error is approximately 0.8. In the case of the Top-5 accuracy, every value is above 0.95. Unfortunately, we could not compare our results to those obtained by others because, to the best of our knowledge, this seems to be the first attempt to learn facial landmark features. Overall, the HELEN dataset produced more accurate results than the LFPW based on the test accuracy because it contains more training samples. In addition, the results for 49 landmarks are more robust than those for 68 landmarks because the facial contour landmark points are less well defined. Although we extract the facial landmark patches from the whole face image, some contain local occlusion that renders the landmark invisible. As the 300-W dataset contains labeled faces in natural settings, they contain occlusions of various kinds. Facial landmark annotation datasets, such as 300-W, however, annotate these landmarks over the occlusion. Therefore, we concluded that the results in Table 4 are acceptable. Figure 5 shows the global features obtained by the feature extraction network using the whole face as input. The local feature extraction layer then processes these global features for facial alignment. The global features show that the network not only maintains the facial shape of the input image, but also responds to distinctive facial features. This indicates that the DNN, after learning the facial landmark images, can generate facial features by maintaining the facial context of the full input image.

CR IP T

ACCEPTED MANUSCRIPT

Fig. 5 Examples of Input image and its global features which are fed to the local feature extraction layer. This shows that the global feature space represents the facial shape of the input image and also reacts to its various features 5.3

Face Alignment

AC

CE

PT

ED

M

AN US

Table 5 presents the results of the face alignment experiments. We compared the accuracy of our results with those previously reported by other researchers. We could have further augmented the training dataset via rotation, extension, and warping, but in our experiment, we simply augmented the training data by flipping the images. We also reported the results on the 250 × 250 input image dimension as well as the restored original image dimension. Note that we were unable to establish whether the accuracy of previously reported experiments was measured for original or resized input images. Table 5 shows that our approach achieved state-of-the-art performance and that our performance improves when the size of the dataset increases. The error decreases in the following order: the 300-W Common set, HELEN, and LFPW, which is also the order in which the size of the datasets decreases. This reflects that a deep neural network learns a discriminative feature descriptor from the dataset. Therefore, the performance of the network increases with dataset size, revealing the importance of learning a discriminative facial landmark features from a large dataset. Our work, therefore, predominantly depends on the size of the training dataset. Lai et al. (2015) also used regression based on deep learning features and exhibited a similar error trend.

ACCEPTED MANUSCRIPT

Table 5. Experimental results on the 300-W landmark dataset. These results show the average landmark distance from ground truth landmarks normalized by the interocular distance. We reported the error for the whole landmark detection as well as for each of the facial components: a set of inner points describing the eyebrows, nose, eyes, and mouth; and a set of outer points describing the facial contour line.

68-pts

49-pts

68-pts

49-pts

8.29 6.57 6.56 5.67 5.92 5.44 5.41 4.87 4.57 4.10 3.86

7.78 5.48 4.47 4.43 3.78 3.16 2.88

8.16 6.70 5.93 5.50 5.69 5.53 5.12 4.63 4.60 4.25 3.74 3.71

7.43 4.64 4.25 4.06 3.47 2.81 2.77

Zhu et al. 2012 DRMF (Asthana et al. 2013) ESR (Cao et al. 2014) RCPR (Burgos-Artizzu et al. 2013) SDM (Xiong et al. 2013) Smith et al. 2014 Zhao et al. 2014 GN-DPM (Tzimiropoulos et al. 2014) CFAN (Zhang et al. 2014) JLRL(Ge et al. 2016) ERT (Kazemi et al. 2014) LBF (Ren et al. 2014) LBF fast (Ren et al. 2014) CFSS (Zhu et al. 2015) TCDCN (Zhang et al. 2015) Lai et al. 2015 OURS(Original Image Size) OURS(250 × 250 Normalized Size)

300-W (All 68 points)

Common Challenging Full set Subset Subset 8.22 18.33 10.20 6.65 19.79 9.22 5.28 17.00 7.58 6.18 17.26 8.35 5.57 15.40 7.50 13.30 6.31 5.78 5.50 4.87 11.23 6.15 6.40 4.95 11.98 6.32 5.38 15.50 7.37 4.73 9.98 5.76 4.80 8.60 5.54 4.19 8.42 5.02 3.70 8.05 4.55 3.59 7.79 4.40

AN US

Method

HELEN

CR IP T

( %) LFPW

AC

CE

PT

ED

M

Fig. 7 shows the cumulative error curves from our experiments showing the error of the whole facial landmarks as well as the error for each part of the face. It also shows error ellipse distributions on the mean shape. The graph reveals that the error of the face contour is worse than that of other parts, whereas the eyes show the best result for both datasets. Although the face contour feature points are located on a distinctive edge line, they are positioned ambiguously along face contour lines (Fig. 6). Therefore, although the network trained the face contour feature points along the face contour line, it simply used a relative position along the line, rather than the exact position, for training. This phenomenon also occurs for other facial landmarks, as the dataset is human-annotated, which itself introduces error. Of the parts of the face, the eyes have the most distinctive features: two eye corners and four corners including the eye lines and the pupil. Therefore, the eye shows the best results of the features. The error ellipse distributions on the mean shape (Fig. 7) illustrate this phenomenon. In the diagram, the long axes of the error ellipse of facial contour landmarks follow the facial contour. Figure 13 shows some examples from the validation set. Our system processes images from situations with considerable variation in pose and lighting conditions. Cases of significant error occur when the head pose estimation network fails to obtain a good initial estimate.

CR IP T

ACCEPTED MANUSCRIPT

AN US

Fig. 6 Ground truth (Red) vs. Estimated landmark positions (Yellow). Although the estimated face contour features indicate the face contour line, its exact positions do not match the ground truth

ED

M

Fig. 7 Cumulative error curves on the test set (left, middle) and error ellipse distribution on the mean shape of the LFPW test set(right). The errors for the whole face and each of the facial components: face contour, eyebrows, nose, eyes, and mouth. The larger the size of an ellipse, the greater the error variance.

AC

CE

PT

Fig. 8 shows the test error trend over training iterations on the HELEN dataset. We trained the regression layer of the Face Alignment Network in Fig. 4 with the Feature Extraction Network weights fixed in each iteration. Notice that the face alignment error follows the general trend of the facial landmark classification error. This led us to believe that the alignment accuracy is highly correlated to the quality of the local feature set aimed towards landmark classification. The more effective the local facial features are, the more accurate the resulting face alignment. It signifies the importance of learning not only the global features but also the local features in expert and intelligent systems.

CR IP T

ACCEPTED MANUSCRIPT

AN US

Fig. 8 Landmark classification test error in each landmark classification training iteration with face alignment normalized error. Landmark classification test error and face alignment test error shows a similar trend over the training iterations. It reveals when the network learns better landmark features, the regression layer, with these good features, can learn more accurate face alignment.

AC

CE

PT

ED

M

Fig. 9 shows the correlation between the final and initial face alignment normalized errors on 300-W common set and challenge set. The results indicate that the final error shows a similar value against various initial errors when the initial error is below about 20%. However, when the initial error goes bigger than 20%, the final error also increases. This signifies that the accuracy of the first estimate influences the accuracy of the final estimate. This brings about the importance of head pose estimation to provide a better initial estimate on the face shape.

AN US

CR IP T

ACCEPTED MANUSCRIPT

5.4

ED

M

Fig 9 Final versus initial face alignment normalized error in the test process. It shows the importance of initial estimate for better accuracy on the final result.

Head Pose Estimation

AC

CE

PT

Here, we report the results of head pose estimation. The 300-W dataset does not provide head pose information, but only the 68 landmark annotations. In addition, to our knowledge, there are no 3D head pose information available to a 2D image public dataset, which shows continuous angle value. Therefore, most of the previous work used different evaluation metrics. For instance, Zhu et al. (2012) only evaluated the yaw angle in steps of 15 . The Pointing04 benchmark dataset also provides head pose in 15 steps. Malagavi et al. (2014) simply assessed whether the face in an image is positioned up, down, left or right. Xiong et al. (2014) evaluated head pose estimation using a 2D face image projected from a 3D face model with known pose parameters. This led us to create a 300-W 3D pose dataset from the 300-W dataset using public software (Torre et al., 2015) to evaluate head pose estimation. It is noteworthy that Xiong et al. (2014) claimed that the algorithm in this software shows an error less than 1 for 3D face model pose estimation. We also compare our result with that obtained using Kinect face tracking, which also estimates head pose. Table 5 contains the experimental results. We report two cases of pose estimation based on: 1) the ground truth facial landmarks from the 300-W test set, 2) the estimated facial landmarks. Fig. 10 compares our results with those obtained using a Kinect

ACCEPTED MANUSCRIPT

Table 6. Head Pose Estimation Results

CR IP T

to record a face motion video captured by the Kinect RGB-D camera. Although we used our algorithm only on RGB images while Kinect uses RGB-D images, our results still correctly follow the trend of the results obtained with the Kinect (the ground truth is not available for exact comparison). The difference in the pitch angle estimation between our result and that of Kinect is larger than those of the roll and yaw estimates, since the head pitch angle is ambiguous in a 2D image. Fig. 14 shows practical examples in which a 3D pose axis is drawn at the top left of each image.

( )

LFPW

300-W

68-pts

49-pts

68-pts

49-pts

Common Subset

2.59

2.55

3.55

3.29

1.97

Challenging Subset

Full set

5.41

2.62

CE

PT

ED

M

Pose Error

HELEN

AN US

Test Set

Fig. 10 Comparison of Results with the Kinect SDK Face Tracking Algorithm

AC

5.5

Application to Facial Landmark Tracking

Using the estimated head pose from the previous frame as an initial estimate for the current frame, we tracked facial landmarks in a video. To evaluate this performance, we followed the 300-VW challenge scenario. Its training set contains 50 videos; the test set consists of three scenarios that contain 31, 19, and 14 videos, respectively. We trained our system using the training sets from HELEN, LFPW, and AFW in the 300W dataset. We also included some video frames from the 300-VW training set. We initialized the face detection rectangle of the first frame in each test video with its

ACCEPTED MANUSCRIPT

AN US

CR IP T

ground truth to provide an accurate initial detection result. We also used the interocular normalized error in equation (4). Fig. 11 shows the result. Our algorithm produced results comparable to those of the previous work (Chehra Tracker from Asthana et al., 2014; Wu et al., 2015). Our method outperforms previous methods when the normalized error is larger than 0.09. However, when the normalized error is less than 0.09, the results obtained in previous studies are more accurate. The 300-VW dataset contains numerous videos that show a small face compared to the frame size. When the size of the face is too small, it is difficult for our algorithm to generate distinctive local feature descriptors. Therefore, our system may experience difficulties describing the facial shape precisely when the face is too small. The additional limitation of our approach is that we used a GPU for parallel processing when conducting local feature extraction on a global feature. The fitting process required around 32 ms (8 ms for global feature map generation, 6 ms for local feature descriptor generation and regression per recurrent iteration) per frame using an NVIDIA TITAN X GPU. When we use an i7-2600 CPU, it takes around 820 ms (540 ms for global feature map generation, 70 ms for local feature descriptor generation and regression per recurrent iteration) per frame. It takes a relatively long time to generate the global feature map with

CE

PT

ED

M

one CPU since it takes a full-size 250 × 250 image, while previous CNN research (Zhang et al., 2014; Zhang et al., 2015) achieved real-time performance with lowresolution images. Fig. 12 shows the results on some sample sequences.

(a)

(b)

AC

Fig. 11 Cumulative error curves of the proposed and comparison algorithms on the 300VW dataset with three scenarios. We use the interocular normalized error. (a) 49 landmark case (b) 68 landmark case.

CR IP T

ACCEPTED MANUSCRIPT

Fig. 12 Facial landmark tracking result on some samples of the 300-VW dataset.

Application to Invisible Facial Landmark Detection

AN US

5.6

AC

CE

PT

ED

M

Since our system learns facial landmarks by training on local features, we found that it can also detect local invisibility using local feature descriptors. In the absence of a confidence score in the local feature descriptor to indicate which current local region of interest belongs to a certain facial landmark, we can assume that the current local region is invisible in the final stage of face alignment. We addressed this problem by developing a support vector machine (SVM) classifier. We annotated the landmarks of the 300-W dataset: -1 if the landmark is invisible, and 1 if it is visible. In reality, the dataset contains ambiguity in the case of facial hair, beard and whiskers. We only calculated invisible landmark estimates for facial contour landmarks because of the lack of invisible inner landmark data for SVM training. Table 6 presents our results. Clearly, the accuracy mainly depends on the positive (occlusion) sample size of training data. We assume that there are two main reasons for this somewhat degraded performance: first, the local feature descriptor in the feature extraction network is trained to consider the invisible patch as belonging to its original landmark index and second, the lack of an invisible landmark dataset. This suggests the need to improve the dataset design for future training and experimentation. Fig. 13 shows examples of the face alignment result under occlusion.

ACCEPTED MANUSCRIPT

Table 6. Landmark Occlusion Experimental Results. Performance, especially sensitivity, predominantly depends on the number of positive (occluded) samples. SVM Training Result on Training Set with Ground Truth Landmark

975 626 451 347 207 214 435 578 617 583 436 232 206 358 460 638 1237

115 20 7 5 40 36 11 18 8 18 11 46 37 6 8 15 42

305 82 57 71 190 168 30 13 19 8 29 150 191 60 48 70 41

2605 3272 3485 3577 3563 3582 3524 3391 3356 3391 3524 3572 3566 3576 3484 3277 2680

0.894 0.969 0.985 0.986 0.838 0.856 0.975 0.970 0.987 0.970 0.975 0.835 0.848 0.984 0.983 0.977 0.967

0.895 0.976 0.984 0.981 0.949 0.955 0.992 0.996 0.994 0.998 0.992 0.960 0.949 0.983 0.986 0.979 0.985

Sensitivity

Specificity

Positive Sample Number

Positive Sample Ratio

SVM Classification

0.762 0.884 0.888 0.830 0.521 0.560 0.935 0.978 0.970 0.986 0.938 0.607 0.519 0.856 0.906 0.901 0.968

0.958 0.994 0.998 0.999 0.989 0.990 0.997 0.995 0.998 0.995 0.997 0.987 0.990 0.998 0.998 0.995 0.985

1280 708 508 418 397 382 465 591 636 591 465 382 397 418 508 708 1278

0.320 0.177 0.127 0.105 0.099 0.096 0.116 0.148 0.159 0.148 0.116 0.096 0.099 0.105 0.127 0.177 0.320

0.895 0.975 0.984 0.981 0.943 0.949 0.990 0.992 0.993 0.994 0.990 0.951 0.943 0.984 0.986 0.979 0.979

CR IP T

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Positive Negative Predict Predict

AN US

Landmark True False False True Index Positive Negative Positive Negative

SVM Predict Result on Test Set with Ground Truth Landmark

AC

14 7 7 3 3 4 3 3 6 8 7 6 4 5 6 12 25

222 273 298 305 310 306 302 296 290 291 300 304 301 291 282 258 193

0.841 0.806 0.667 0.750 0.625 0.714 0.800 0.880 0.793 0.714 0.682 0.571 0.733 0.773 0.806 0.774 0.769

ED

20 21 11 13 12 10 13 9 11 11 8 12 14 17 17 19 29

PT

74 29 14 9 5 10 12 22 23 20 15 8 11 17 25 41 83

CE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Positive Negative Predict Predict

0.917 0.929 0.964 0.959 0.963 0.968 0.959 0.970 0.963 0.964 0.974 0.962 0.956 0.945 0.943 0.931 0.869

M

Landmark True False False True Index Positive Negative Positive Negative

Sensitivity

Specificity

Positive Sample Number

Positive Sample Ratio

SVM Classification

0.787 0.580 0.560 0.409 0.294 0.500 0.480 0.710 0.676 0.645 0.652 0.400 0.440 0.500 0.595 0.683 0.741

0.941 0.975 0.977 0.990 0.990 0.987 0.990 0.990 0.980 0.973 0.977 0.981 0.987 0.983 0.979 0.956 0.885

94 50 25 22 17 20 25 31 34 31 23 20 25 34 42 60 112

0.285 0.152 0.076 0.067 0.052 0.061 0.076 0.094 0.103 0.094 0.070 0.061 0.076 0.103 0.127 0.182 0.339

0.897 0.915 0.945 0.952 0.955 0.958 0.952 0.964 0.948 0.942 0.955 0.945 0.945 0.933 0.930 0.906 0.836

CR IP T

ACCEPTED MANUSCRIPT

6

AN US

Fig. 13 Landmark Occlusion Detection Example on the HELEN dataset with Facial Landmark Detection. Green: Occlusion estimation result on contour landmarks. Yellow: Landmark estimation result.

Conclusion and Future Work

AC

CE

PT

ED

M

This research involves an extensive investigation into designing deep neural networks for face alignment based on local facial landmark learning and recurrent regression. Our research was motivated by the belief that investigating a face from lowlevel facial component features to higher ups in the feature hierarchy of CNN would produce more meaningful results for successful facial landmark detection. Based on this, we created a new facial landmark DB (which was not available for our use) from the original face public data and trained our CNN on this landmark DB to classify landmarks. In this process, a meaningful set of local feature descriptors were obtained for later use as global features when applied to the whole face image. The local feature extraction layer extracts local feature descriptors around the landmark positions. In this local descriptor space, the network learns a generic descent direction from the currently estimated landmark positions to the ground truth via a linear regression loop. Furthermore, by separating the feature extraction part and the regression part at the training stage, we could optimize each component separately for better performance similar to the previous object detection framework utilizing a pre-trained network for classification and then applying regression to the result. The effectiveness of our face alignment algorithm was validated by showing that it outperforms most state-of-the-art methods on the 300-W dataset. We further compared our pose estimation performance with the result from the Kinect for verification. As an extension to our work, we applied and further validated our system to facial tracking on the 300-VW video dataset. Furthermore, the use of local feature descriptors also lends itself to detecting even invisible landmarks, as has been experimentally verified on the 300-W dataset although the shortage of invisible landmarks in our DB has led to somewhat poor sensitivity. The feasibility study of handling invisible landmarks should be further investigated in the future using a sufficient DB. However, the proposed method, which first generates the global feature map based on the local feature extraction network and from this, extracts the desired local feature

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

map for further processing, shows some limitations compared to other methods (Sun et al., 2013; Zhou et al., 2013; Zhang et al., 2014; Zhang et al., 2015) that learn whole facial image using CNNs. For instance, extracting a multitude of local feature descriptors in a global feature space executing the local feature extraction layer in parallel requires the use of parallel processing with a GPU for real-time performance. A good way to speed up the algorithm itself should be under investigation. In addition, when the resolution of an input face region is quite small as in the 300-VW dataset so that the local landmark features are blurred, the accuracy of proposed method decreases. Another remaining issue is fine-tuning the feature extraction network. Although some object detection algorithms (Sermanet et al., 2013, Girshick et al., 2014) successfully achieved the goal of discriminating feature extraction and regression training, recent work such as the Fast R-CNN (Girshick et al. 2015) has shown more robust performance with further fine-tuning of the pre-trained feature extraction and regression layers jointly. This latter approach may be also useful to our research. Application of conventional neural networks to vision problems usually processes and trains on the whole image. However, as we proposed a way to train on parts of the object as well as on the object itself, we expect our system to develop a deeper understanding of faces or objects as shown in this paper. To this end, the use of the local feature extraction layer in obtaining the global feature map will help us to selectively analyze local features in an effective way according to the given task. We expect this „generating a global feature map based on the trained local feature map‟ paradigm of this article can also be applied to a variety of other related fields.

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Fig. 14 Face alignment results from the 300-W validation set. Top: HELEN, Bottom: IBUG (Challenging Subset). Red rectangle: Some examples containing significant error. The head pose estimation result is also illustrated in the top left corner of each image. The results show that our system is robust to pose variation, illumination, and unusual facial expressions

ACCEPTED MANUSCRIPT

ACKNOWLEDGEMENTS

CR IP T

This research was supported by the MSIP(Ministry of Science, ICT and Future Planning), Korea, under the “IT Consilience Creative Program” (NIPA-2013-H0203-13-1001) supervised by the NIPA(National IT Industry Promotion Agency)

AC

CE

PT

ED

M

AN US

1. Asthana, A., Zafeiriou, S., Cheng S., & Pantic, M. (2013). Robust discriminative response map fitting with constrained local models. CVPR, 3444–3451. 2. Asthana, A., Zafeiriou, S., Cheng, S., & Pantic, M. (2014). Incremental face alignment in the wild. CVPR, 1859–1866. 3. Belhumeur, P. N., Jacobs, D. W., Kriegman, D. J., & Kumar, N. (2013). Localizing parts of faces using a consensus of exemplars. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(12), 2930–2940. 4. Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8), 1798–1828. 5. Burgos-Artizzu, X., Perona, P., & Dollár, P. (2013). Robust face landmark estimation under occlusion. Proceedings of the IEEE International Conference on Computer Vision, 1513–1520. 6. Cao, X., Wei, Y., Wen, F., & Sun, J. (2014). Face alignment by explicit shape regression. International Journal of Computer Vision 107(2), 177–190. 7. Cootes, T. F., Hill, A., Taylor, C. J., & Haslam, J. (1993, June). The use of active shape models for locating structures in medical images. Information Processing in Medical Imaging, 33–47. Berlin, Germany: Springer-Verlag. 8. Cootes, T. F., Edwards, G. J., & Taylor, C. J. (2001). Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 681–685. 9. Ge, Y., Peng, C., Hong, M., Huang, S., & Yang, D. (2016). Joint Local Regressors Learning for Face Alignment. Neurocomputing. 10. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 580–587. 11. Girshick, R. (2015). Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision, 1440–1448. 12. Hecht-Nielsen, R. (1989, June). Theory of the backpropagation neural network. IEEE/IEE International Joint Conference on Neural Networks, 593–605. 13. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., & Darrell, T. (2014, November). Caffe: Convolutional architecture for fast feature embedding. Proceedings of the ACM International Conference on Multimedia, 675–678. 14. Kazemi, V., & Sullivan, J. (2014). One millisecond face alignment with an ensemble of regression trees. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1867–1874. 15. Ke, Y., & Sukthankar, R. (2004, June). PCA-SIFT: A more distinctive representation for local image descriptors. Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on IEEE, II-506–II-513, Vol. 2. 16. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 1097–1105.

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

17. Lai, H., Xiao, S., Cui, Z., Pan, Y., Xu, C., & Yan, S. (2015). Deep Cascaded Regression for Face Alignment. arXiv preprint arXiv:1510.09083. 18. Lawrence, S., Giles, C. L., Tsoi, A. C., & Back, A. D. (1997). Face recognition: A convolutional neural-network approach. IEEE Transactions on Neural Networks, 8(1), 98–113. 19. Le, V., Brandt, J., Lin, Z., Bourdev, L., & Huang, T. S. (2012). Interactive facial feature localization. Computer Vision–ECCV 2012, 679–692. Berlin, Germany: Springer-Verlag. 20. Malagavi, N., Hemadri V., & Kulkarni. U. P. (2014). Head Pose Estimation Using Convolutional Neural Networks. International Journal of Innovative Science, Engineering & Technology, Vol. 1 Issue 6. 21. Rao, C. R., & Mitra, S. K. (1971). Generalized inverse of matrices and its applications (Vol. 7). New York: Wiley. 22. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2015). You only look once: Unified, real-time object detection. arXiv preprint arXiv:1506.02640. 23. Ren S., Cao X., Wei Y., and Sun J. (2014). Face alignment at 3000 fps via regressing local binary features. CVPR, 1685–1692. 24. Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2013). 300 faces in-the-wild challenge: The first facial landmark localization challenge. Proceedings of the IEEE International Conference on Computer Vision Workshops, 397–403. 25. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229. 26. Sermanet, P., Kavukcuoglu, K., Chintala, S., & LeCun, Y. (2013). Pedestrian detection with unsupervised multi-stage feature learning. CVPR, 3626–3633. 27. Smith, B., Brandt, J., Lin, Z., & Zhang, L. (2014). Nonparametric context modeling of local appearance for pose-and expression-robust facial landmark localization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1741–1748. 28. Sun, Y., Wang, X., & Tang, X. (2013). Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3476–3483. 29. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9. 30. Szegedy, C., Toshev, A., & Erhan, D. (2013). Deep neural networks for object detection. In Advances in Neural Information Processing Systems, 2553–2561. 31. Torre, F., Chu, W. S., Xiong, X., Vicente, F., Ding, X., & Cohn, J. (2015, May). Intraface. Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on IEEE, 1–8, Vol. 1. 32. Tzimiropoulos, G., & Pantic, M. (2014). Gauss-newton deformable part models for face alignment in-the-wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1851–1858. 33. Wu, Y., & Ji, Q. (2015). Shape augmented regression method for face alignment. Proceedings of the IEEE International Conference on Computer Vision Workshops, 26–32. 34. Xiong, X., & De la Torre, F. (2014). Supervised descent method for solving nonlinear least squares problems in computer vision. arXiv preprint arXiv:1405.0601. 35. Zhang, J., Shan, S., Kan, M., & Chen, X. (2014). Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment. Computer Vision–ECCV 2014, 1–16. Springer International Publishing.

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

36. Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2015). Learning deep representation for face alignment with auxiliary attributes. IEEE transactions on pattern analysis and machine intelligence, 2016, 38.5: 918–930. 37. Zhao, X., Kim, T. K., & Luo, W. (2014). Unified face analysis by iterative multi-output random forests. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1765–1772. 38. Zhou, E., Fan, H., Cao, Z., Jiang, Y., & Yin, Q. (2013). Extensive facial landmark localization with coarse-to-fine convolutional network cascade. Proceedings of the IEEE International Conference on Computer Vision Workshops, 386–391. 39. Zhu S., Li C., Loy C. C., and Tang X. (2015). Face alignment by coarse-to-fine shape searching, CVPR, 4998–5006. 40. Zhu, X., & Ramanan, D. (2012, June). Face detection, pose estimation, and landmark localization in the wild. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2879–2886.