Sparse fully convolutional network for face labeling

Sparse fully convolutional network for face labeling

Communicated by Yang Tang Accepted Manuscript Sparse Fully Convolutional Network for Face Labeling Minghui Dong, Shiping Wen, Zhigang Zeng, Zheng Ya...

2MB Sizes 0 Downloads 115 Views

Communicated by Yang Tang

Accepted Manuscript

Sparse Fully Convolutional Network for Face Labeling Minghui Dong, Shiping Wen, Zhigang Zeng, Zheng Yan, Tingwen Huang PII: DOI: Reference:

S0925-2312(18)31426-7 https://doi.org/10.1016/j.neucom.2018.11.079 NEUCOM 20216

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

8 September 2018 29 October 2018 27 November 2018

Please cite this article as: Minghui Dong, Shiping Wen, Zhigang Zeng, Zheng Yan, Tingwen Huang, Sparse Fully Convolutional Network for Face Labeling, Neurocomputing (2018), doi: https://doi.org/10.1016/j.neucom.2018.11.079

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Sparse Fully Convolutional Network for Face Labeling Minghui Donga,b , Shiping Wena,b , Zhigang Zenga,b , Zheng Yanc , Tingwen Huangd a School of Automation, Huazhong University of Science and Technology, Wuhan 430074, China. Laboratory of Image Processing and Intelligent Control of Education Ministry of China, Wuhan 430074, China c Centre for Artificial Intelligence, University of Technology Sydney, Australia. d Texas A & M University at Qatar, Doha 5825, Qatar.

CR IP T

b Key

Abstract

AN US

This paper proposes a sparse fully convolutional network (FCN) for face labeling. FCN has demonstrated strong capabilities in learning representations for semantic segmentation. However, it often suffers from heavy redundancy in parameters and connections. To ease this problem, group Lasso regularization and intra-group Lasso regularization are utilized to sparsify the convolutional layers of the FCN. Based on this framework, parameters that correspond to the same output channel are grouped into one group, and these parameters are simultaneously zeroed out during training. For the parameters in groups that are not zeroed out, intra-group Lasso provides further regularization. The essence of the regularization framework lies in its ability to offer better feature selection and higher sparsity. Moreover, a fully connected conditional random fields (CRF) model is used to refine the output of the sparse FCN. The proposed approach is evaluated on the LFW face dataset with the state-of-the-art performance. Compared with a nonregularized FCN, the sparse FCN reduces the number of parameters by 91.55% while increasing the segmentation performance by 11% relative error reduction. Keywords: Fully convolutional network, face labeling, group Lasso 1. Introduction

AC

CE

PT

ED

M

Image semantic segmentation is a very important topic in computer vision as it plays a great role in image understanding, object recognition, medical diagnosis, etc. Face labeling is a sub-domain of image semantic segmentation. Given an image containing human face, the task of face labeling is to classify every pixel in the image into background, hair, nose, mouth, eye, etc. Face labeling has widespread applications in many areas such as expression understanding [1] and face editor [2, 3]. Face labeling is challenging because illumination, head pose, hairstyle, skin color and background differ significantly in different images. Therefore, deep discriminative representations are needed for robust face labeling. Traditional methods based on hand-crafted features are unable to provide satisfactory results when processing challenging face images such as occluded faces. Recently, with the development of deep learning, many works investigated to use deep neural networks to learn features and representations [4, 5, 6, 7]. However, these approaches commonly use deep structures with enormous parameters, and they require large computational resources and storage space. Meanwhile, applications related to face labeling are typically deployed in resource-constrained scenarios, such as mobile devices for virtual makeup, customs pass devices for face verification, et al. In these scenarios, a lightweight and efficient model is highly needed for fast inference to achieve real-time requirements. Previous heavy model severely limits ∗ Corresponding

author. Tel.: +86 18971124190; fax: +86-27-87543630. Email address: [email protected] (Shiping Wen)

Preprint submitted to Neurocomputing

the applicabilities of face labeling. This is the main motivation for this work. Fully convolutional network (FCN), as a kind of convolutional neural networks with all layers being convolutional layers, produces an output which has the same size with the original input image. Each pixel is labeled one of C categories, where C is the number of categories to be segmented. Due to its end-to-end segmentation property and strong ability of learning representations, FCN has become the mainstream algorithm in semantic segmentation. Many image processing works have used FCN models such as boundary detection [8] and visual tracking [9]. In [4], the FCN-8s model adapted from VGG16NET [10] resulted in the highest accuracy in the experiments. This model produces segmentation results that have the same size with the input images to be refined by combining shallow layers with deep layers. Inspired by the result, we take advantages of this model to conduct face labeling. We train a model similar to the FCN-8s using the LFW datasets and then classify every pixel into three classes that represent background, hair, and skin. Although FCN has strong ability to learn representations, its model size is usually too big to be practically used in resourcesconstrained environment. Many works have attempted to get more compact architectures to accelerate computing [11, 12]. FCN-8s has up to about 140M parameters, however not all of these parameters are effective for the final segmentation in specific tasks. Certain parameters may even have negative impact to the final result. Moreover, it is hard to deploy such a huge model in many situations. Motivated by these observations, we utilize the group Lasso which was originally used in linear reDecember 4, 2018

ACCEPTED MANUSCRIPT

Belief Propagation to conduct face labeling [18]. These methods are based on hand-crafted features, so they generally cannot obtain semantic representations and have poor robustness. Recently, many researchers developed deep learning methods to perform face segmentation. Luo et al. trained several different deep learning models to conduct face labeling, however the hair was not labeled by their methods [19]. Liu et al. utilized a single unified convolutional neural network to output unary and pairwise potential of a conditional random field simultaneously [20]. This method took into account not only the unary label likelihoods but also the pairwise label dependencies when labeling a face image. G¨uc¸l¨u et al. proposed an end-to-end trainable model which combined the advantages of CNN, RNN, CRF and adversarial networks for face labeling [21]. This method achieved excellent results on both the LFW and Helen datasets. Different with the existing deep learning based methods, our approach focuses on optimizing the compactness of the segmentation model by exploring novel sparse regularization techniques. As a result, a more compact FCN model can be obtained and it reserves the advantages of the original model. Extensive works have been made to facilitate parameters reduction of deep networks. Networks’ parameters were reduced by pruning redundant connections [22]. However few reduction was achieved on convolutional layers. As the FCN is composed by convolutional layers only, the pruning method cant contribute so much to a compact FCN model. Low Rank Approximation (LRA) was employed to sparse parameters and accelerate computing [23, 24, 25]. A flaw of this approach is that the structure of optimized network is fixed. Moreover, in [26], a distilling knowledge based method was proposed to compress ensemble models. The ibpCNN (Indian Buffet Process Convolutional neural networks) model was proposed to automatically learn an adpative structure when given training data [27]. This approach is similar to ours, however we utilize group level regularization to sparse our model which can get a more compact structure. Several lasso-based sparse methods have been proposed to compress models or further improve the accuracy of models [28, 29, 30, 31]. In [28], L1-norm and L2-norm constraints were both adopted to make specific group characteristics represent specific test samples, with the goal of improving the expressive ability of the model. Similarly, an adaptive class preserving representation method based on group-lasso was proposed for the task of classification in [29]. In [30], sparse-group lasso was utilized to regularize the linear regression model. These methods all focus on classification or regression problems rather than semantic segmentation, and did not explore deep neural networks. In a recently work [31], deep neural networks were optimized by group Lasso, however this work mainly focused on fully connected layers. A comprehensive method, named Structured Sparsity Learning (SSL), was proposed to regularize deep neural networks [15]. The filters, channels, filter shape and layer depth can be optimized simultaneously by corresponding methods, but still did not explore the deep fully convolutional networks. To the best of our knowledge, there is no previous work on FCN sparsification, especially in the field of face labeling. For reasons above, this paper

AN US

CR IP T

gression [13, 14] to regularize the model. The parameters that are corresponding to the same output channel are grouped into one group and they will be zeroed out simultaneously during training. With the help of group Lasso, some channels, which are useless or harmful to the final result, are discarded. It can be viewed as feature selection. The limitation of this method is that the reserved channels can not get valid regularization. To address this issue, inspired by [15], intra-group Lasso (which was called shape-wise regularization in [15]) is introduced into the regularization process. The combination of group Lasso and intra-group Lasso reduce the parameters of FCN model greatly and our final sparse FCN model only have about 9% parameters of the original FCN model. Meanwhile, the overall accuracy is increased by 0.46%. Moreover, since the group Lasso and intra-group Lasso can be applied not only to the convolutional layer but also to the fully connected layer, the proposed method is not limited to the fully convolutional networks for face labeling. Classification or regression tasks based on deep neural networks can use this method to further compress the model and improve the accuracy of the model. Since this work focuses on solving the problem of fast face labeling, these tasks are not discussed in the paper, which an be further explored as a future work. In addition, a fully connected CRF [16] can be utilized to refine the output of sparse FCN model to further improve the performance. The CRF based sparse FCN achieves state-of-the-art results on the LFW datasets. In summary, the contributions are listed as follows:

PT

ED

M

1) A novel FCN model with light-weight structures is proposed for face labeling. The proposed model can be widely applied to resource-constrained scenarios. 2) Group Lasso and intra-group Lasso are proposed to regularize the FCN to obtain a sparse model. This approach reduces the overall parameters by 91.55% and improves the segmentation performance by 11% relative error reduction. 3) With the help of a fully connected CRF, the proposed approach results in the state-of-the-art results on the LFW datasets.

AC

CE

The rest of this paper is organized as follows: Section II provides an overview of related works in the field of face labeling and sparse regularization. Section III describes the details of the proposed algorithm. Section IV shows comparative experimental results on the LFW dataset. Finally, conclusion is drawn in Section V. 2. Related Works

In this section, we review some recent works on face labeling and sparse regularization of deep neural networks. Face labeling has been a research focus due to its strong connections with many real-world applications. Kae et al. first modeled a face shape prior using restricted Boltzmann machine and then utilized a CRF model to label three classes (skin, background and hair) [17]. Lee et al. modeled the color and location distribution of face and incorporated them with Graph-Cut and Loopy 2

ACCEPTED MANUSCRIPT

Input image

conv1 pool1 conv2 pool2 conv3 pool3 conv4 pool4 conv5 pool5

conv6 conv7

2up score_conv7

score_pool4

CR IP T

+

2up fuse_pool4

score_pool3

Output image

AN US

+

8up fuse_pool3

Figure 1: The FCN segmentation architecture for face labeling. Each rectangular block represents a convolutional layer or a pooling layer.

M

tries to improve the performance of face labeling by sparsing the FCN model.

3.1. Segmentation Architecture

ED

3. Theory Details

produce correspondingly-sized map with input image. All deconvolutional layers are initialized with bilinear interpolation kernels. The parameters in deconvolutional layers can also be learnt. The proposed segmentation architecture is depicted in Fig. 1. 3.2. Group Level Sparsity and Feature Selection

Although FCN has strong abilities of representation learning and feature extraction, huge parameters of the model severely limit the applicability of the algorithm in real-world scenarios. In [32], Denil et al. proved that deep convolutional neural network had very serious parameter redundancy. In their approach, only a small number of weights were learned and the others were predicted. The experiments showed that up to 95% of weights could be predicted and the original accuracy had no loss. Moreover, the FCN model we developed is adapted from VGG16 which is designed for the task which has more categories than face labeling. Therefore, this model has a huge potential to be sparsified. This paper attempts to sparse the parameters of this FCN model with group level regularization and to prune the useless or harmful channels. Lasso-based regularization techniques (L1 regularization) were proven to be effective for introducing sparsity in the domain of facial images recognition. However, conventional Lasso cannot help to get a more compact FCN model because only if all the parameters related to one channel have been zeroed out, then this channel can be pruned. To overcome this disadvantage, this paper clusters all parameters related to the same output channel into one group. During the training stage,

AC

CE

PT

Inspired by [4], this paper adapts a model from VGG16, which is similar to the FCN-8s, as the basic segmentation architecture. The first fully connected layer of VGG16 is replaced with a convolutional layer named as conv6, which has a shape of 7 × 7 × 512 × 4096 in order of height of kernels, width of kernels, number of input channels, and number of output channels. Similarly, the second fully connected layer of VGG16 is then replaced by a convolutional layer named as conv7 whose shape is 1 × 1 × 4096 × 4096. Then a score layer with the shape of 1 × 1 × 4096 × 3 is concatenated to conv7 to output the score of the three classes. To get more detailed results, we combine the output of deep layers with the output of shallow layers before the final upsampling operation. The score layer of conv7 is first upsampled by a deconvolutional layer with factor 2 and then fused with the score layer of pool4. This fused layer is named as fuse pool4. Similarly, the fuse pool4 layer is upsampled by a deconvolutional layer with factor 2 and then fused with the score layer of pool3. This fused layer is named as fuse pool3. Finally, the fuse pool3 is upsampled by a deconvolutional layer with factor 8 to output the final segmentation result. Five pooling layers with factor 2 are used in the VGG16, so we upsample the score layer of conv7 with factor 32 in all to 3

ACCEPTED MANUSCRIPT

parameters of one group will be zeroed out simultaneously or not with the help of group level penalty term. We call this convolutional group Lasso. In this case, some groups (channels) can be discarded during testing stage which cannot be achieved by L1 regularization. This process can also be viewed as feature selection. The model can learn to choose more useful kernel features during training stage. To further reduce the number of parameters in the reserved channels, we introduce intra-group Lasso to regularize them. Intra-group Lasso means that parameters having same locations in different output channels will be grouped into one group and they will be zeroed out simultaneously as the same way as group Lasso. Four different regularization methods are demonstrated in Fig. 2 to better illustrate this idea.

The group Lasso regularization term is of the following form: Rg1 (W (l) ) = =

XGl

1

g1=1

XGl

1

g1=1

||w(g1) ||g1 r X K1

i=1

2

(w(g1) i )

(2)

CR IP T

where Gl1 represents the number of group Lasso groups in the lth convolutional layers (i.e. the number of output channels Cl ); K1 represents the number of weights in every group (i.e. the product of number of input channels, height of kernels and width of kernels Nl × Hl × Wl ). The intra-group Lasso regularization term is of the following form: XGl 2 Rg2 (W (l) ) = ||w(g2) ||g2 g2=1 r X K2 XGl 2 2 (w(g2) (3) = i )

Fig. 2 shows a simplified convolutional operation (three input channels and two output channels). Fig. 2a shows the kernels without sparse regularization. Different colors represent different convolutional kernels. In this paper, all values of kernels without sparse regularization are supposed to be nonzero. Fig. 2b shows a possible situation of kernels penalized by Lasso regularization. The grey color represents the value of that location is zero. The Lasso penalty can zero out elements in different channels without optimizing channel-level considerations. This property of Lasso regularization makes it hard to prune channels or kernels. Fig. 2c shows a possible situation of kernels penalized by group Lasso regularization. All parameters in the second output channel are zeroed out simultaneously. In this situation, the second output channel can be removed safely. It is obvious that none parameters are zeroed out in the first output channel. Fig. 2d shows a possible situation of kernels penalized by group Lasso with intra-group Lasso regularization. Compared with kernels regularized by group Lasso only, some elements in the first output channel are also zeroed out. More sparsity can be got by this regularization. Compared with L1 regularization, intra-group Lasso can also help to learn more suitable kernel size.

i=1

g2=1

ED

M

AN US

where Gl2 represents the number of intra-group Lasso groups in the lth convolutional layers (i.e. the product of number of input channels, height of kernels and width of kernels Nl × Hl × Wl ); K2 represents the number of weights of every groups (i.e. the number of output channels Cl ).

PT

i∈ν

In the following part, the formulation of the algorithm will be introduced. Denote W (l) ∈ RNl ×Cl ×Hl ×Wl as the weights of lth(1 ≤ l ≤ L ) convolutional layer in FCN model, where Nl , Cl , Hl , Wl represents the number of input channels, the number of output channels, the height of kernels, and the width of kernels, respectively. With the combination of group Lasso and intra-group Lasso, the objective function to be minimized is formulated as follows: L X

ui (xi ) = − log P(xi )

+ λ2

(5)

where P(xi ) is the output of FCN which indicates the probability of background, skin and hair for every pixels. pi j (xi , x j ) is the pairwise potential and this term is of the following form:

Rg1 (W (l) )

l=1

L X

(i, j)∈ν

where ν is the set of all pixels in one input image; (i, j) is the combination of any two pixels in ν no matter the distance between i and j . xi is the label assignment of the ith pixel. ui (xi ) is the unary potential and this term is of the following form:

CE

AC

L(W) = Lo (W) + λR(W) + λ1

3.3. Fully Connected Conditional Random Field Although the shallow layers are combined with deep layers in our FCN model to refine the segmentation result, the score maps of FCN are still very smooth and the results of classification are homogeneous. However, detailed results are needed in the field of face labeling. To ease this problem, we utilize a fully connected CRF model [16] to complete post-processing. This model uses the following energy function: X X pi j (xi , x j ) (4) ui (xi ) + E(x) =

Rg2 (W (l) )

pi j (xi , x j ) =ω1 exp(−

(1)

|pi − p j |2 |Ii − I j |2 − ) 2σ2α 2σ2β

l=1

+ ω2 exp(−

where Lo (W) is the loss function; R(W) is L2 − norm which is used to prevent overfitting; Rg1 (W (l) ) is the penalty term for group Lasso; Rg2 (W (l) ) is the penalty term for intra-group Lasso.

|pi − p j |2 ) 2σ2γ

(6)

where p represents the positions of pixels and I represents the color intensities of pixels. This pairwise potential is composed by two Gaussian kernels. The first Gaussian kernel depends on 4

Input channels

CR IP T

ACCEPTED MANUSCRIPT

Output channels

Input channels

Input channels

Output channels

(c) group Lasso

M

Input channels

Output channels

(b) Lasso

AN US

(a) no regularization

Output channels

(d) group Lasso with intra-group Lasso

ED

Figure 2: Four different regularization methods. Different colors represent different convolutional kernels. The grey color represents the value of that location is zero.

regularize the model. Validation set is used to choose appropriate hyper-parameters (i.e. the learning rate, the coefficients of regularization terms, the coefficients of CRF model, etc.). The FCN model we adopted has a receptive field of 404 × 404 which is bigger than the original image size in the LFW part labels dataset (250 × 250). Therefore, we resize the image size to 500 × 500 to adapt to the receptive field of our model. The training set of LFW part labels dataset only contains 1500 images which can easily result in overfitting, especially when it is used to train a deep neural network. For this reason, we flip the original images horizontally to perform data augmentation. Because of the specificity of face labeling (the face images have been aligned to a canonical position), we do not rotate images in a particular angle. A total of 3000 images are applied to train and regularize the FCN model.

AC

4. Experiments

CE

PT

the position and color intensity differences of two pixels. The second Gaussian kernel depends on the position differences of two pixels only. σα , σβ , σγ control the scale of two Gaussian kernels. More details about fully connected CRF can be found in [16].

4.1. Datasets, Experimental Setup and evaluation metric. 4.1.1. Datasets The proposed model was trained and tested on the LFW part labels dataset [17], a standard face labeling benchmark dataset. This dataset has been widely used in many works, so it is easy to compare with the previous face labeling works. The LFW part labels dataset contains 2927 face images taken in unconstrained environments and is divided into a training set including 1500 images, a validation set including 500 images, a test set including 927 images. Pixels in all images are manually labeled as background, skin and hair. In our experiments, training set is used to train the basic face labeling FCN model and

4.1.2. Experimental Setup For the consideration of guaranteeing accuracy, a two-step training strategy is adopted. At first, the FCN model is trained without group Lasso and intra-group Lasso regularization terms. When the test accuracy no longer increases, two regularization terms are added into the objective function to conduct the second phase of training. During the first phase train5

ACCEPTED MANUSCRIPT

ing, convolutional layers are initialized from conv1 1 to layer conv7 with the pre-trained weights of VGG16. The deconvolutional layers are initialized by bilinear interpolation kernel. The other layers are initialized by Xavier method. Batch size is set to be 8 because of the memory restriction of graphics card. The Adam learning method with parameters β1 =0.9 , β2 = 0.999 , ε = 1e − 8 are deployed to train our model. The initial learning rate (i.e. lr ) is 1e − 12 and it is decayed with a factor of 0.1 when the test loss no longer decreases. The weight decay coefficient (i.e. λ ) is 5e − 4. During the second phase training, the model is initialized with the pre-trained weights of first training phase. The objective function is replaced. Adam learning method is also deployed with the same parameters as used in first training phase. The coefficients of regularization terms are set as λ1 = 2, λ2 = 1 (high values are set to get high sparsity). The initial learning rate is 1e − 12. The proposed approach is implemented by the open source computing platform Caffe [33] with cuda and cudnn. The configuration of our workstation is Intel Xeon CPU E5-2620 v3 CPU, Nvidia GTX Titan X GPU and 32G RAM. The system is Ubuntu 16.04 64bit. 4.1.3. Evaluation metric In order to facilitate comparison with other works, per-pixel overall accuracy is used to evaluate the proposed method, which was widely used in previous works. Assume that Ni, j denotes the number of pixel whose ground truth label is i and the predicted label is j in one mini-batch, in which i, j ∈ {0, 1, 2} represents background, skin and hair respectively. Then the overall accuracy can be implemented as:

1

0.968 0.966

0.8

0.964 0.962

0.6

0.96 0.4

0.958 0.956

0.2

0.954

0

0.952 0

50

90

1 3 0 1 7 0 2 1 0 2 3 0 2 4 0 2 7 0 3 0 0 4 5 0 6 0 0 7 2 0 EPOCH

accuracy

CR IP T

sparsity

Figure 3: The training trend of sparsity and accuracy. In order to better reflect the overall variation trend, we don’t display sparsity and accuracy in a homogeneous fashion, and carry out more intensive sampling display in the areas where the sparsity changed dramatically. When the epoch reached 700, the sparsity was almost no longer changed, so we stopped the second phase of training in the epoch 720.

i=0

Ni,i

2 P 2 P

i=0 j=0

Ni, j

ED

P=

2 P

M

AN US

directly compare the sparsity of each layer, Fig. 4 shows a histogram of the sparsity of each layer. During the second training phase, we didn’t specially adjust the sparsity factor of each layer and all layers in the FCN model are trained under the same conditions. The conv6 layer with the most parameters has the highest sparsity of 98.06%. The conv1 1 layer has the lowest sparsity of 7%. This is intuitive that the low-level features contain a lot of detail features, which are necessary of high-level features, so the conv1 1 has a lower sparsity. The overall parameters of main convolutional layers decrease from 134M to 11M. The variation of kernels before and after sparse regularization. In order to show the variation of kernels in a more intuitive fashion, we select conv1 2 with less output channels (64) to illustrate the power of sparse regularization. All the parameters in conv1 2 layers are shown in Fig. 5. Fig. 5a shows the kernels of conv1 2 layer before sparse regularization. Fig. 5b shows the kernels has been regularized by group Lasso penalty. We can see that parameters are zeroed out at the level of channels (i.e. groups). However parameters in reserved channels still have high redundancy. Fig. 5c shows the kernels has been regularized by group Lasso penalty and intragroup Lasso penalty. With the help of intra-group Lasso, some parameters in reserved channels are also zeroed out. Feature selection. In order to further explore the effect of

(7)

AC

CE

PT

4.2. The Sparsity Result The training trend of sparsity and accuracy. This work analyzes the variation trend of sparsity and test accuracy as the number of epochs increases during the second training phase. As shown in Fig. 3, when computing the sparsity of networks, a threshold value is set as ξ=1e − 4 and weight w will be zeroed out if |w| < ξ. Then the sparsity is the ratio of zero weights to all parameters. Note that the biases and parameters of deconvolutional layers are ignored. Meanwhile, parameters in score layers only account for a fraction of the all. The accuracy is overall per-pixel accuracy. The test accuracy was 95.79% after the first training phase. With the group Lasso and intra-group Lasso regularization terms, test accuracy further increased and the maximum test accuracy (96.62%) of the entire training phase was achieved at the epoch of 210. At the end of second training phase, the sparsity ratio was 91.55% and the accuracy was 96.28% which was improved by 11% relative error reduction compared with the unsparsely optimized model. Table 1 shows the variation of different convolutional layers’ parameters before and after second training phase. In order to

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.7684 0.6079 0.5317

0.8157 0.7574 0.6871 0.7105 0.7119 0.6703

0.9806 0.8262 0.8256 0.6491

0.5149

0.07

Figure 4: The sparsity of each layer in the final sparsity FCN model

6

ACCEPTED MANUSCRIPT

(a)

(b)

(c)

CR IP T

Figure 5: The variation of conv1 2 kernels before and after sparse regularization. Parameters in one output channel (one group) are flatten into one row (i.e. each row in figures shows parameters associated with an output channel). The darker the color, the smaller the value of the corresponding position.

Table 1: Parameters Comparison Before and After Second Training Phase

AN US

sparse weight 1607 17263 28908 34150 143061 170754 194464 369111 572365 679713 434818 410045 411461 1993552 5887125 11348403

Figure 6: The comparison of feature maps in conv1 2. The eight feature maps have same locations in respective channels. The first row is the results of model without regularization and the second row is the results of model regularized by our method. Table 2: The Overall Accuracy of Different Methods

M

sparsity 0.07 0.5317 0.6079 0.7684 0.5179 0.7105 0.6703 0.6871 0.7574 0.7119 0.8157 0.8262 0.8256 0.9806 0.6491

ED

weight 1728 36846 73728 147456 294912 589824 589824 1179648 2359296 2359296 2359296 2359296 2359296 102760448 16777216 134248128

91.55%

PT

layer name conv1 1 conv1 2 conv2 1 conv2 2 conv3 1 conv3 2 conv3 3 conv4 1 conv4 2 conv4 3 conv5 1 conv5 2 conv5 3 conv6 conv7 total sparse proportion

CE

sparse regularization on the improvement of accuracy and feature selection, we extract eight feature maps of conv1 2 layer to compare.We show that in Fig. 6. According to the comparison results of the feature maps, it can be seen that the feature map without sparse optimization is more obscure, and the feature map with sparse optimization is more clear (such as the 5th feature map). The group of sparse feature maps no longer has feature map that contains only less characteristic information (such as the 2th, 4th, 7th, 8th feature maps). Group Lasso sparse regularization, therefore, can force the network to learn more useful features during the training stage, which can not only reduce the parameters, but also eliminate the interference of useless information at the same time.

Method GLOC[17] MO-GC with prior[20] CnnRnnGan[21] Ours without regularization Ours with regularization (the highest accuracy) Ours with regularization (the highest accuracy) and CRF Ours with regularization (the highest sparsity)

Overall accuracy 94.95% 95.12% 96.67% 95.79% 96.62% 96.82% 96.25%

AC

CRF based method for face labeling [20]; and the end-to-end face labeling method that combined the advantages of CNN, RNN, CRF and adversarial net [21]. Table 2 shows the overall accuracy of different methods. Note that the results of Table 2 are obtained by averaging four experiments under same hyperparameter setting up but different random state to eliminate the bias caused by randomness. Since there are minor difference between the four experiments, Fig. 3-5 and Table 1 show the results of one of them for ease of representation. The proposed sparse model combined with fully connected CRF gets state-ofthe-art results with highest accuracy. Six segmentation results of challenging face images are selected to illustrate the proposed method. Results are shown in Fig. 7. As shown in the segmentation results, the proposed method

4.3. The Segmentation Results Segmentation results on the LFW part labels datasets using the proposed method are shown and they are compared to previous works. We compare it with the following works: the RBM and CRF based method for face labeling [17]; the CNN and 7

ACCEPTED MANUSCRIPT

Acknowledgment

References

CR IP T

A·二

fh闸门二 一唱得 二 -



This work was supported by the Natural Science Foundation of China under Grants 61673187 and 61673188. This publication was made possible by NPRP grant: NPRP 8-274-2-107 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the author[s].

穴气1

址:缉

AN US

Figure 7: Selected challenging segmentation results. From the first column to the sixth column: input images, output of FCN without sparse regularization, output of FCN with highest sparsity; output of FCN with highest accuracy, output of FCN with highest accuracy combined with fully connected CRF, the ground truth.

[1] S. L. Happy, S. Member, and A. Routray, “Automatic facial expression recognition using features of salient facial patches,” IEEE Trans. Affect. Comput., vol. 6, no. 1, pp. 1-12, 2015. [2] S. Liu, X. Ou, R. Qian, W. Wei, and X. Cao, “Makeup like a superstar: Deep localized makeup transfer network,” in International Joint Conference on Artificial Intelligence, 2016, vol. 2016-Jan, pp. 2568-2575. [3] I. Korshunova, W. Shi, J. Dambre, and L. Theis, “Fast face-swap using convolutional neural networks,” CoRR, vol. abs/1611.0, 2016. [4] K. Simonyan et al., “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 07-12-June, no. PART 3, pp. 1-10, 2015. [5] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85-117, 2015. [6] T. N. Sainath, B. Kingsbury, G. Saon, H. Soltau, A. R. Mohamed, G. Dahl, and B. Ramabhadran, “Deep convolutional neural networks for large-scale speech tasks,” Neural Networks, vol. 64, pp. 39-48, 2015. [7] S. Basu, M. Karki, S. Mukhopadhyay, S. Ganguly, R. Nemani, R. Dibiano, and S. Gayaka, “A theoretical analysis of deep neural networks for texture classification,” Neural Networks, pp. 992-999, 2016. [8] I. Kokkinos, “Pushing the boundaries of boundary detection using deep learning,” Comput. Sci., 2016. [9] L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual tracking with fully convolutional networks,” in IEEE International Conference on Computer Vision. IEEE Computer Society, 2015, pp. 3119-3127. [10] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” Comput. Sci., pp. 1-10, 2014. [11] W. Wen, T. Afzal, Y. Zhang, Y. Chen, and H. Li, “A compact DNN: approaching GoogLeNet-Level accuracy of classification and domain adaptation,” CoRR, vol. abs/1703.0, 2017. [12] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li, “Coordinating filters for faster deep neural networks,” CoRR, vol. abs/1703.0, 2017. [13] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped varibles,” J. R. Stat. Soc., vol. 68, no. 1, pp. 49-67, 2006. [14] L. Meier, S. Van De Geer, and P. Bhlmann, “The group lasso for logistic regression,” J. R. Stat. Soc., vol. 70, no. 1, pp. 53-71, 2008. [15] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, 2016. [16] P. Kr¨ahenb¨uhl and V. Koltun, “Efficient inference in fully connected CRFs with Gaussian edge potentials,” in Advances in Neural Information Processing Systems, 2011, pp. 109-117. [17] A. Kae, K. Sohn, H. Lee, and E. Learned-Miller, “Augmenting crfs with boltzmann machine shape priors for image labeling,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2013, pp. 2019-2026. [18] K.C. Lee, D. Anguelov, B. Sumengen, and S.B. Gokturk, “Markov random field models for hair and face segmentation,” in IEEE International Conference on Automatic Face Gesture Recognition, 2008, pp. 1-6. [19] P. Luo, X. Wang, and X. Tang, “Hierarchical face parsing via deep learning,” IEEE Trans. Affect. Comput., pp. 2480-2487, 2012. [20] S. Liu, U. Merced, J. Yang, C. Huang, B. Research, and M.-H. Yang, “Multi-objective convolutional learning for face labeling,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015, pp. 3451-3459. [21] U. G¨uc¸l¨u et al., “End-to-end semantic face segmentation with conditional random fields as convolutional, recurrent and adversarial networks,” CoRR, vol. abs/1703.0, 2017.

AC

CE

PT

ED

M

has excellent robustness to various complex situations. In the situation of great changing of hair’s color and shape (the 1th sample), many faces in one image (the 2th, 6th samples), face occlusion (the 4th, 5th samples), persons having beard (the 3th samples), skin’s color similar to hair’s color (the 6th samples), our method can still capture the correct category information. Moreover, the output of model with sparsity is more close to the ground truth in the whole distribution. The fully connected CRF model partially compensates for the missing detail of FCN. The above results are achieved with two advantages of the proposed method. Firstly, pruning a large number of invalid weights can effectively alleviate the overfitting problem, especially for face labeling which has only 1500 training images. Therefore, the obtained model by proposed method has better generalization and representation ability, thus achieving higher accuracy. More importantly, as shown in Fig. 6, the proposed method can not only filter out harmful and invalid feature maps, but also make the learned feature maps clearer and more expressive. This characteristic can help the model further improve the accuracy compared to the models that are not regularized. 5. Conclusion

In this work, group Lasso and intra-group Lasso were developed to regularize an FCN model for face labeling. The introduction of the regularization terms resulted in a highly sparse model with the number of parameters significantly reduced and segmentation performance improved. With the help of a fully connected CRF, the proposed sparse FCN achieved the state-ofthe-art segmentation accuracy on the LFW dataset. This work can make it easy to deploy an FCN in a variety of settings, especially in mobile devices. More structures and tasks with the proposed method will be further explored in the future work. 8

ACCEPTED MANUSCRIPT

[22] J. Tran, J. Tran, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural networks,” in Advances in Neural Information Processing Systems, 2015, pp. 1135-1143. [23] M. Guechtouli, “Training CNNs with low-rank filters for efficient image classification,” J. Asian Stud., vol. 62, no. 3, pp. 952-953, 2010. [24] C. Tai, T. Xiao, Y. Zhang, X. Wang, and E. Weinan, “Convolutional neural networks with low-rank regularization,” Comput. Sci., vol. 1, no. 2014, pp. 1-10, 2015. [25] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up Convolutional Neural Networks with Low Rank Expansions,” CoRR, vol. abs/1405.3866, 2014. [26] G. Hinton, O. Vinyals. and J. Dean, “Distilling the knowledge in a neural network,” CoRR, vol. abs/1503.02531, 2015. [27] J. Feng and T. Darrell, “Learning The structure of deep convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2749-2757. [28] S. Huang, Y. Yang, D. Yang, F. Huang, W. Lu, and X. Zhang, “Class

AC

CE

PT

ED

M

AN US

CR IP T

specific sparse representation for classification,” Signal Processing, vol. 116, pp. 38-42, 2015. [29] J. Mi, Q. Fu and W. Li, “Adaptive Class Preserving Representation for Image Classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7427-7435. [30] N. Simon, J. Friedman, T. Hastie and R. Tibshirani, “A Sparse-Group Lasso,” Journal of Computational and Graphical Statistics, vol. 22, no. 2, pp. 231-245, 2013. [31] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini, “Group sparse regularization for deep neural networks,” Neurocomputing, pp. 110, 2016. [32] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas, “Predicting parameters in deep learning,” in Advances in Neural Information Processing Systems, 2013, pp. 2148C2156. [33] Y. Jia, et al., “Caffe: convolutional architecture for fast feature embedding,” in ACM International Conference on Multimedia, 2014, pp. 675678.

9

ACCEPTED MANUSCRIPT

CR IP T

Minghui Dong: received his B. Eng. degree from School of Aeronautics and Astronautics, Xiamen University, Xiamen, China in 2017. He is currently working towards the M. Eng. degree from Huazhong University of Science and Technology, Wuhan, China. His research interests include computer version and deep learning.

PT

ED

M

AN US

Shiping Wen: received the M.S degree in Control Science and Engineering, from School of Automation, Wuhan University of Technology, Wuhan, China, in 2010, and received the Ph.D degree in Control Science and Engineering, from School of Automation, Huazhong University of Science and Technology, Wuhan, China, in 2013. He is currently an Associate Professor at School of Automation, Huazhong University of Science and Technology, and also in the Key Laboratory of Image Processing and Intelligent Control of Education Ministry of China, Wuhan, China. His current research interests include memristor-based circuits and systems, networked control systems, machine learning and data mining.

CE

Zhigang Zeng: received his B.S. degree from Hubei Normal University, Huangshi, China, and his M.S. degree from Hubei University, Wuhan, China, in 1993 and 1996, respectively, and his Ph.D. degree from Huazhong University of Science and Technology, Wuhan, China, in 2003.

AC

He is a professor in School of Automation, Huazhong University of Science and Technology, Wuhan, China, and also in the Key Laboratory of Image Processing and Intelligent Control of Education Ministry of China, Wuhan, China. His current research interests include neural networks, switched systems, computational intelligence, stability analysis of dynamic systems, pattern recognition and associative memories.

CR IP T

ACCEPTED MANUSCRIPT

ED

M

AN US

Zheng Yan: received the B.Eng. degree in automation and computer-aided engineering and the Ph.D. degree in mechanical and automation engineering from The Chinese University of Hong Kong, Hong Kong, in 2010 and 2014, respectively. Dr. Yan was a recipient of the Graduate Research Grant from the IEEE Computational Intelligence Society in 2014.

PT

Tingwen Huang: received his B.S. degree from Southwest Normal University (now Southwest University), China, 1990, his M.S. degree from Sichuan University, China, 1993, and his Ph.D. degree from Texas A$\&$M University, College Station, Texas, USA, 2002.

AC

CE

He is a Professor of Mathematics, Texas A$\&$M University at Qatar. His current research interests include Dynamical Systems, Memristor, Neural Networks, Complex Networks, Optimization and Control, Traveling Wave Phenomena.