FC-RCCN: Fully convolutional residual continuous CRF network for semantic segmentation

FC-RCCN: Fully convolutional residual continuous CRF network for semantic segmentation

Accepted Manuscript FC-RCCN: Fully Convolutional Residual Continuous CRF Network for Semantic Segmentation Lei Zhou, Xiangyong Kong, Chen Gong, Fan Z...

2MB Sizes 0 Downloads 49 Views

Accepted Manuscript

FC-RCCN: Fully Convolutional Residual Continuous CRF Network for Semantic Segmentation Lei Zhou, Xiangyong Kong, Chen Gong, Fan Zhang, Xiaoguo Zhang PII: DOI: Reference:

S0167-8655(18)30499-9 https://doi.org/10.1016/j.patrec.2018.08.030 PATREC 7294

To appear in:

Pattern Recognition Letters

Received date: Revised date: Accepted date:

14 May 2018 15 August 2018 23 August 2018

Please cite this article as: Lei Zhou, Xiangyong Kong, Chen Gong, Fan Zhang, Xiaoguo Zhang, FCRCCN: Fully Convolutional Residual Continuous CRF Network for Semantic Segmentation, Pattern Recognition Letters (2018), doi: https://doi.org/10.1016/j.patrec.2018.08.030

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT 1 Highlights • A semantic segmentation framework composed of three subnetworks is proposed. • The framework relies on multi-scale features fusion based network architecture. • The pairwise network is effective in learning the affinity matrixes.

AC

CE

PT

ED

M

AN US

• The C-CRF based learning strategy can boost the segmentation performance

CR IP T

• The C-CRF network is very useful in refining the segmentation masks.

ACCEPTED MANUSCRIPT 1

Pattern Recognition Letters journal homepage: www.elsevier.com

Lei Zhoua,∗∗, Xiangyong Konga , Chen Gongb , Fan Zhangc , Xiaoguo Zhangd a University

of Shanghai for Science and Technology, China of Computer Science and Engineering, Nanjing University of Science and Technology, China c University of Melbourne, Australia d Jiangnan University, China b School

ABSTRACT

CR IP T

FC-RCCN: Fully Convolutional Residual Continuous CRF Network for Semantic Segmentation

1. Introduction

CE

PT

ED

M

AN US

Enlarging the spatial resolution of features generated by fully convolutional networks (FCNs) can improve the performance of semantic segmentation. To achieve this goal, deeper network with deconvolutional structure can be applied. However, when the network architecture becomes more complex, the training efficiency may degrade. To address the joint optimization problem of improving spatial resolution through deeper networks and training deeper networks more effectively, we propose a Fully Convolutional Residual Continuous CRF Network (FC-RCCN) for semantic segmentation. FC-RCCN is composed of three subnetworks: a unary network, a pairwise network, and a superpixel based continuous conditional random filed (C-CRF) network. In order to generate full spatial resolution predictions with high-quality, a residual block based unary network with multi-scale features fusion is proposed. Even though the unary network is a deeper network, the whole framework can be trained effectively in an end-to-end way using the joint pixel-level and superpixel-level supervised learning strategy which is optimized by a pixel-level softmax cross entropy loss and a superpixel-level log-likelihood loss. Besides, C-CRF inference is fused with pixel-level prediction during the test procedure, which guarantees the method’s robustness to the superpxiel errors. In the experiments, we evaluatee the power of the three subnetworks and the learning strategy comprehensively. Experiments on three benchmark datasets demonstrate that the proposed FC-RCCN outperforms previous segmentation methods and obtains the state-of-the-art performance.

AC

Over the last few years, the techniques of visual analysis have achieved great improvement Zhou et al. (2018); Zhu et al. (2015); Xie et al. (2016); Zhu et al. (2014); Wang et al. (2016, 2017); Ma et al. (2017). Especially the success of deep convolutional neural network (CNN) models Krizhevsky et al. (2012); Simonyan and Zisserman (2014); He et al. (2016) has brought a dramatic progress in the task of pixel-wise semantic segmentation using rich hierarchical features J. Long and Darrel (2015); Badrinarayanan et al. (2015); S. Zheng (2015); ∗∗ Corresponding

author: Tel.: +86-021-55271119; fax: +86-021-55271119; e-mail: [email protected] (Lei Zhou), The first two authors share first-authorship; (Lei Zhou)

c 2018 Elsevier Ltd. All rights reserved.

Chen et al. (2016); Lin et al. (2017); Chen et al. (2017). Most of the current semantic segmentation methods are derived from a fully convolutional network (FCN), which was first introduced in J. Long and Darrel (2015). In FCN, the last few fully connected layers are replaced by a convolutional layer to make the end-to-end learning and inference effectively. However, most of the current methods are based on the classification network with downsampled feature maps which are of coarse spatial resolution. Thus, segmentation results based on reduced resolution tends to generate poor object delineation and small isolated regions in segmentation outputs. There are mainly two categories of methods for generating segmentation masks of high quality. The first category of methods is to generate features of higher resolution via recovering the spatial resolution using an upsampling or deconvolutional

ACCEPTED MANUSCRIPT 2

CR IP T

learning. Finally, the inference of superpixel based continuous Conditional Random Field (C-CRF) is implemented as a continuous CRF layer in C-CRF network. Moreover, an effective jointly training strategy based on softmax cross entropy and negative log-likelihood optimization is designed to train the whole framework in an end-to-end way. Our contributions are summarized as follows: (1) A novel residual block based unary network with multiscale features fusion is proposed for high-quality segmentation probability generation. (2) We propose a novel formulation of pairwise potentials defined on C-CRF. Such a formulation enables learning pairwise parameters for different spatial ranges of graph connections. (3) A novel joint superpixel and pixel supervised training strategy is proposed. The label consistency constraint is enforced by superpixel based C-CRF to improve the training efficiency. Joint training and inference by combining superpixel-level and pixel-level supervison is proven to be effective in decreasing the effect of superpixel errors. (4) We show from experimental results that FN-RCCN achieves better performance than a range of state-of-the-art methods. Moreover, integrating C-CRF into the whole segmentation framework can improve the training efficiency of FC-RCCN and can generate segmentation masks of better quality.

AC

CE

PT

ED

M

AN US

path Fu et al. (2017a); Noh et al. (2015); Chen et al. (2017). The feature maps of low resolution are recovered to the input resolution by learning the upsampling process for pixel-wise classification. Then more accurate boundaries can be localized. In the above upsampling process, the deconvolutional and unpooling layers are usually used with a symmetric structure corresponding to the convolutional and pooling layers. It is apparent that the parameter size of the new network is twice as large as the original convolutional structure. Furthermore, the semantic segmentation networks are more difficult for being trained when the network depth increases. The second kind of methods is to combine the powerful classification capabilities of a fully convolutional network with probabilistic graph models, such as conditional random filed (CRF) for improving semantic segmentation performance with deep learning. Many current state-of-the-art methods Chen et al. (2014); Lin et al. (2016); S. Zheng (2015) have integrated graphical models into a deep learning framework. One of the first works of combining deep learning framework with a structured prediction was proposed in Chen et al. (2014) and it applied the densely connected conditional random field (DenseCRF) Krhenbhl and Koltun (2011) to the post-processes of FCN output to generate a better segmentation with refined image boundaries. Zheng et al. S. Zheng (2015) combined DenseCRF with CNN by transforming the DenseCRF post-processing into a single Recurrent Neural Network (RNN) in an end-to-end procedure. There exists inherient drawbacks for these two solutions presented above. On the one hand, the network architecture with deconvolutional component is difficult to be optimized. On the other hand, most existing CRF based methods infer on pixel-level features of low resolution, to reduce the computational cost Chen et al. (2014); Lin et al. (2016); S. Zheng (2015), which is not effective in capturing precision boundary information. But it is time-consuming for CRF to be inferred on full resolution features and the training efficiency may be degraded, so there is a trade-off between segmentation quality and training efficiency. Current researches have shown that CRFs consistently improve the training efficiency of complex segmentation network S. Zheng (2015); Chandra and Kokkinos (2016); Liu et al. (2015b). In this paper, we investigate the combination of these two solutions to design a more powerful segmentation model. Note that when trying to incorporate CRFs and spatial resolution recovery into an unified framework, the trade off between segmentation quality, computation costs and training efficiency needs to take into consideration. To overcome above issues, we present a novel superpixel based semantic segmentation framework FC-RCCN which exploits to design new segmentation model by combining C-CRF with deeper neural network. FC-RCCN is composed of three subnetworks: a unary network, a pairwise network and a C-CRF network. Firstly, to recover the image structure details from downsampled features, we propose a residual block based network architecture with multi-scale features fusion in the unary network. Secondly, a pairwise network is designed to learn the superpixel-wise similarities so as to capture the spatial relationship between superpixels by metric

2. Related Work

Although CNNs have achieved good performance for the task of semantic segmentation, they can not model the relationship between the output variables directly. Combining discrete CRFs with CNNs can overcome this issue. Various conditional random field models based on unary and pairwise clique potentials Russell et al. (2009); Ladicky et al. (2010); Ladick`y et al. (2010) have been used to improve the segmentation performance Chen et al. (2014); Lin et al. (2016); S. Zheng (2015); Arnab et al. (2016). The combination has also been applied to other tasks such as depth estimation Liu et al. (2015a) and saliency detection Fu et al. (2017b). One of the first work related to combining deep learning framework with a structured prediction was proposed in Chen et al. (2014) in which the densely connected conditional random field (DenseCRF) Krhenbhl and Koltun (2011) was applied to the post-process the FCN outputs so as to generate better segmentation masks with refined image boundaries. In S. Zheng (2015), DenseCRF was combined with CNN by transforming the DenseCRF post-processing into a single Recurrent Neural Network (RNN) in an end-to-end procedure. In order to incorporate the object detection and superpixels information into CRFs, high potentials of CRF were defined in Arnab et al. (2016) and they were trained with CNNs in an end-to-end way. Gaussian CRFs have been applied for semantic segmentation as well Chandra and Kokkinos (2016); Vemulapalli et al. (2016). In Chandra and Kokkinos (2016), a structured prediction technique that combined the Gaussian CRF with CNNs was proposed for fast, exact and multi-scale inference in semantic image segmentation task. In Vemulapalli

ACCEPTED MANUSCRIPT

CR IP T

3

Table 1. Explanation of the sympols and notations.

where E(y, x) is the energy function and Ψ(x) is the partition function. The energy function consists of unary terms and pairwise terms are defined as: X X V(yi , y j ; Θ p ) (2) U(yi , x; Θu ) + E(y, x) =

Parameters of unary and pairwise networks. Output of unary network. (section. 3.3.1) Transformed representation of Z. (12) Output of pairwise network. (section. 3.3.2) P A = I + λ(Dk − W k ), Dkii = i Wikj . (9) Similarity between superpixels for class k. (14) Distance kernel for class k. (15) Affinity matrix for class k. (16)

M

AC

CE

PT

et al. (2016), a gaussian mean-field network was proposed and it was combined with CNNs to construct a new end-to-end trainable Gaussian conditional random field network for semantic segmentation. Different from these approaches, in this work, the continuous CRF constructed on superpixels is used to train the proposed deeper segmentation network with multi-scale features fusion and to refine the segmentation masks by learning the dependency between the nodes in the superpixels graph.

3. The Proposed Method

where yi is the i-th element in vector y. U and V are the unary and pairwise potentials with parameters Θu and Θ p . A CRF is often defined on an undirected graph G(Λ, E), where Λ is the set of graph nodes and E is the set of graph edges. We define the label assigned to node Λi as yi . Hence the notation ”i ∼ j” (in (2)) stands for that Λi and Λ j are graph neighbors. The unary term U models the dependency between the a label y and the image x at a given node. The pairwise term V enforces the label consistency between neighboring graph nodes.

3.2. C-CRF Formulation for Semantic Segmentation The basic idea for defining unary term U in (2) is to calculate the dependency between yi and a corresponding feature vector S i that learned from input x. In our formation, S i is a prediction vector that captures the semantic information for segmentation. The unary term: Assuming an N-dimension unary prediction feature S ik for node Λi corresponding to class k, the unary term is defined as a weighted sum of the quadratic cost:

3.1. Brief Reivew of Conditional Random Field (C-CRF) Conditional random field (CRF) is firstly proposed by Lafferty Lafferty et al. (2001) for sequence data labelling. For the task of image segmentation, the conditional probability distribution of a label vector configuration y given an input x on the CRF can be defined as: p(y|x) =

1 exp{−E(y, x)} Ψ(x)

i, j,i∼ j

i

ED

Θu , Θ p Z S Q A Υki, j φki Wk

AN US

Fig. 1. The flowchart of the proposed FC-RCCN for semantic segmentation. In the unary network, yellow rectangles represent the transformation block, green rectangles represent the upsampling block which is composed of two convolutions layers and a up-sampling layer. The green circles represent the multi-feature fusion unit. Convolution parameters are denoted as [ kernel height × kernel width ,number of filter] × repeated times. Z is the output of unary network. Q, P and W are the outputs of pairwise network. S is the output of SP-LAYER. A is the affinity matrix. (Please refer to section 3.3.2 for more details)

(1)

U(yi , x; Θu ) =

C X (yki − S ik (Θu ))2

(3)

k=1

where S ik is the k-th component of S i which depends on x (x is omitted for simplicity) and it indicates the probability that node Λi belongs to class k. C is the number of classes. If Λi belongs to class k, yki = 1, else yki = 0. From the perspective of optimization, the overall costs become smaller if the assigned

ACCEPTED MANUSCRIPT 4

Ψk (x) =

C

1X k k W (y − ykj )2 V(yi , y j , x; W) = 2 k=1 i j i

(4)

where Wikj is also learned from x and it is the pairwise similarity defined between the nodes Λi and Λ j for class k. In the proposed method, Wikj is defined as a positive similarity function between Λi and Λ j . If Λi and Λ j are similar, Wikj is larger and they tend to be assigned with similar labels by inferring on C-CRF. The energy function: The energy function E(y, x) described in (2) can be written as the following form: E(y, x) =

N X C X

(yki

i=1 k=1



S ik (Θu ))2 + (5)

C X 1X Wikj (yki − ykj )2 λ 2 i, j,i∼ j k=1

PT

N X (yki − S ik (Θu ))2 + i=1

(7)

CE

X 1 λ W k (yk − ykj )2 2 ij i i, j,i∼ j

AC

With some mathematic derivation, the matrix form of (7) can be expressed equivalently as: Ek (y, x) = (yk )T (I + λDk − λW k )yk − 2(S k )T yk + (S k )T S k (8) where I is a N × N identity matrix; Dk is a diagonal matrix with P Dkii = i Wikj ; W k if the affinity matrix composed of Wikj ; S k is the probability vector with element S ik . Then we introduce the following notations A and β to simply the expressions: A = I + λ(Dk − W k ), β = (S k )T S k

exp(−Ek (y, x))dy Z = exp(−β) exp(−(yk )T Ayk + 2(S k )T yk )dyk Z 1 = exp(−β) exp(− (yk )T 2Ayk + (2S k )T yk )dyk 2

(10)

N

=

π2

|A|

1 2

exp((S k )T A−1 S k ) − β)

where N is the dimension of A, and |A| is the matrix determinant. In our formation, the invertibility of A is guaranteed. The negative log-likelihood: Then the negative loglikelihood can be written as: −log pk (y|x) =(yk )T Ayk − 2(S k )T yk + (S k )A−1 (S k ) N 1 − log(|A|) + log(π) 2 2

AN US

M

ED

where Ek (y, x) is the cost function for optimizing C-CRF corresponding to class k and Ek (y, x) has the following form: Ek (y, x) =

Z

(11)

Then we will describe how the unary term (S ik ) and the pairwise term (Wikj ) in the proposed C-CRF model are generated.

To improve the training efficiency, the energy function for each segmentation class are optimized independently without considering the correlations between different classes. Likewise, our analysis will focus on the partition function Ψk (x) and conditional probability pk (y|x) for class k. So (5) can be rewritten as: C X E(y, x) = Ek (y, x) (6) k=1

the Gaussian integration:

CR IP T

probability S ik is more close to the label yki . Θu is the parameter for generating the unary term. The pairwise term: The pairwise term is also defined as a weighted sum of the quadratic cost:

(9)

The partition function: Considering the continuous property of C-CRF, the partition function Ψk (x) can be defined as a integration. Then we have the following equation according to

3.3. Features for Unary and Pairwise Term 3.3.1. Unary Network Specifically, the recently proposed Residual network (ResNet) He et al. (2016) has shown impressive improvements over other architectures such as VGG Krizhevsky et al. (2012). The ResNet models which are pre-trained for ImageNet recognition tasks are publicly available, and it is widely used for computer vision tasks. Based on the analysis above, Resnet-101 He et al. (2016) is selected for constructing a unary network. The proposed architecture of the unary network is composed of two parts and the detailed network configurations of the unary network are described in Figure 1. The first part of the unary network is designed based on ResNet-101 for feature extraction and it is initialized with pre-trained weights. For example, the input image with size 320 × 320 is fed into ResNet-101 and the last convolutional layers result in feature maps of spatial resolution 10 × 10 with 2048 channels. In the forward process of Resnet-101, larger receptive fields are generated for capturing high-level information effectively. However, large receptive fields may miss important details for full-resolution segmentation. As noted previously, we aim to exploit multi-level features for high-resolution prediction with high precision. Hence, in the second part of our unary network, the activation maps are upscaled through a sequence of transformation, up-convolution, and upsampling layers. As shown in Figure 1, we divide the pre-trained ResNet101 into 5 blocks and the activations of ResNet blocks would have to be transformed and upsampled before fusing with the up-convolution blocks. The transformation block consists of two 3 × 3 convolutions. The upsampling block consists of two 3 × 3 convolutions and an upsampling layer. The activation maps from the deeper layers will be transformed and upsampled via the upsampling blocks. For example, the output of

ACCEPTED MANUSCRIPT 5

M

(i, j)∈Rt

ED

where (i, j) ∈ Rt denotes the (i, j)-th convolution features being laying inside superpixel t. Then the outputs of SP-LAYER can be represented by a N × C matrix S = [S 1 , ..., S C ].

AC

CE

PT

3.3.2. Pairwise Network As illustrated in Figure 1, the overall pairwise network is composed of a new branch of up-convolutions blocks, a superpixel pooling layer and an affinity layer. Up-convolutions blocks are applied to recover the feature maps to the same resolution with the original image, and the feature Q ∈ Rh×w×d is generated. Then the SP-LAYER transforms the pixel-level features Q to superpixel-level representations p ∈ RN×d . The affinity layer is designed to compute the similarity W ∈ RN×N×C for every pair of connected superpixels with respect to each class. Similar to the role of SP-LAYER in unary network (12). The transformed feature pt with the elements pdt for superpixel t can be represented by X pdt = ∆i jd Qdij (13) (i, j)∈Rt

where (i, j) ∈ Rt denotes the (i, j)-th convolution features being laying inside superpixel t. pt = [p1t , .., pdt ]T is a vector with dimension d. In the affinity layer, the measure of similarity Υkij between superpixel i and j for class k is computed using Υki, j

=e

where Πk = i=1 φki (φki )T is the parameters matrix for defining Mahalanobis distance and we can rewrite the exponent of Υki, j as: M X

−(pi − p j )T Πk (pi − p j ) =

e=1

((φke )T pi − (φke )T p j )

−(pi −p j )T Πk (pi −p j )

(14)

(15)

Hence, the computation of Mahalanobis distance can be implemented as convolutions of pi with kernels φki . Finally, the affinity layer generates the matrices W ∈ RN×N×C with element Wi,k j which is used in the pairwise term of the energy function:

CR IP T

Wi,k j = τΥki, j

(16)

where τ is the coefficient.

3.3.3. C-CRF Network When the unary network and pairwise network are learned, the semantic segmentation inference of C-CRF on a new test image is achieved by minimizing (11) based on S k and W k . The optimization problem is formulated as: (yk )∗ = arg maxyk − log(pk (y|x))

AN US

the 1st ResNet block is of size 160 × 160 and the output of the 2nd ResNet block is of size 80 × 80. The activation maps of the 2nd ResNet block will be transformed and upsampled to 160 × 160. Then, the transformed activation map of the 1st ResNet block, the transformed and upsampled activation map of the 2nd ResNet block will be fused with the output of the 3rd up-convolution block. Finally, unary network can produce a prediction output of spatial resolution 320 × 320. The outputs are represented as Z ∈ Rh×w×C , whose element is Zikj (i = 1, .., h, j = 1, .., w,k = i, .., c), where h, w and C stand for the height, width of prediction maps and the number of classes. Then a superpixel pooling layer (SP-LAYER) is designed to transform the pixel-level features to superpixel-level, in which the features can be aggregated by spatially aligning with superpixels using average pooling through the superpixel pooling layers. To simplify the notations, we assume the image is divided into N superpixels after the over-segmentation step. Hence the superpixel features for superpixel t can be represented as S t (t = 1, .., N) with the elements S tk (k = 1, .., C). We define a frequency counting matrix ∆t ∈ Rh×w for the t-th superpixel, whose element ∆i jt indicates the weight of (i, j)-th feature vectors in the convolution maps that associated to the t-th superpixel. It is calculated by counting the occurrences of the (i, j)-th convolutional feature vector that appears in the tth superpixel region and then normalized. Let S stand for the transformed output of unary network through SP-LAYER (Figure 2). The transformed confidence map S tk of superpixel t can be represented by: X ∆i jk Zikj (12) S tk =

PM

= arg maxyk (yk )T Ayk − 2(Z k )T yk

(17)

where A is defined in (9) and it is apparent that A is symmetk T k k T k ) y = 0, there exists a closed-form ric. By setting ∂(y ) Ay∂Z−2(Z k solution: (18) (yk )∗ = A−1 S k

To solve the above optimization problem, the sequential mean-field method using the Gauss-Seidel algorithm Press et al. (1982) can be applied. 4. End-to-end Training for Semantic Segmentation In this section, we will describe the way for training the network in an end-to-end way. Figure 2 demonstrates the rules for gradient backpropagation and the methods for training the continuous CRF network, the unary network and the pairwise network will be introduced in the following subsections. 4.1. Loss Functions Given M training images, x1 , x2 ..., x M with the ground truth labels y1 , y2 ..., y M , the task of learning is to learn the network parameters Θu and Θ p . A joint learning strategy is designed, in which two kinds of loss functions are used. The pixel-level supervision and superpixel-level supervision are applied to optimize the whole network jointly. 4.1.1. Softmax Cross Entropy for Pixel-level Supervision The output of unary network is a probability function of the form Zikj , i ∈ [1, .., h], j ∈ [1, .., w] where h and w are the height and width of features. The softmax probabilities corresponding Zk ij

e P

to class k are defined as Pkij =

l

loss is presented as: Lkp (i j) = −

X l

e

Zk ij

. Then the cross-entropy

yli j logPli j ,

(19)

ACCEPTED MANUSCRIPT

CR IP T

6

Fig. 2. The details of gradient backpropagation under superpixel-level supervision. Z is the output of unary network. Q, P and W are the outputs of pairwise network. S is the output of SP-LAYER. A is the affinity matrix. Θu , Θ p , φ are the parameters for unary network, pairwise network and C-CRF network respectively. Please refer to section 4 for more details about the symbols presented in the figure.

Training on Unary Network: For the training of the unary network, A joint learning strategy is used. The gradients are backprogated via two braches. In the pixel-level gradient back-

AN US

where yli j is the ground truth indicator function for pixel at location (i, j) whose ground truth label is li∗j , if l == li∗j , yli j = 1 otherwise yli j =0.

C X

Lks

k=1

where

Lks

=

N X i=1

−logpk (yki |xi )

ED

minΘu ,Θ p ,φ L s =

M

4.1.2. Log-Likelihood for Superpixel-level Supervision The regularized maximum conditional likelihood training by gradient descent optimization is used for parameter learning with superpixel-level supervision:

(20)

λ1 λ2 k Θu k22 + k Θ p k22 + 2 2

AC

CE

PT

where λ1 and λ2 are regularization parameters (pre-tuned). N is the number of superpixels. We separate the loss function L s into C functions L1s , L2s , .., LCs and they are optimized independently. The parameters of the whole network are learned by solving the above optimization problems using stochastic gradient descent (SGD) based backpropagation. In the following subsections, we will take Lks as an example to explain how to generate the optimal parameters Θu , Θ p , φki via superpixel based supervision. 4.2. Rules for Gradient Backpropagation Training on Continuous CRF Network: According to (20), the partial derivative of Lks with respect to S k is defined as: ∂Lks ∂Lks = k k T k ∂S ∂ − 2(S ) y + (S k )T A−1 (S k ) = 2(A−1 S k − yk )T .

(21)

Then the derivatives will be backpropagated to the unary network.

propagation branch,

∂Llp (i j) ∂Zil j

= Pli j −yli j , i ∈ [1, ..., h], j ∈ [1, ..., w]. ∂Lk

Then the gradients are backpropagated via calculating ∂Θpu and the parameters of unary network are updated. In the unary network, a linear transformation is performed to generate S from Z by SP-LAYER. In the superpixel-level gradient backpropagation branch, the gradients can be easily calculated since we have: ∂S tk = ∆i jk , i f (i, j) ∈ Rt (22) ∂Zikj

According to the chain rule, the the partial derivative of Lk with respect to Θu is defined as: ∂Lks ∂Lks ∂S k ∂Z k = + λ 1 Θu (23) ∂Θu ∂S k ∂Z k ∂Θu Training on Pairwise Network: The pairwise term is learned through the pairwise branch. The derivatives of loss with respect to similarity matrix W k is defined as: ∂{(yk )T Ayk + (S k )T A−1 (S k ) − 12 log(|A|}) ∂Lks = ∂W k ∂W k ∂A ∂A −1 k T 1 1 ∂|A| = (yk )T yk − (S k )T A−1 A (S ) − k 2 |A| ∂W k ∂W ∂W k ∂A k ∂A −1 k T 1 ∂A = (yk )T y − (S k )T A−1 A (S ) − T r(A−1 ) 2 ∂W k ∂W k ∂W k (24) where T r(.) is the trace of a matrix and

∂A ∂Wikj

=

∂Dkij −Wikj ∂Wikj

. Then the

gradients will be backpropagated to the affinity layer. Given the ∂Lk derivatives ∂Wsk of the loss function with respect to the output of the affinity layer, we can compute

∂Lks ∂Υki, j

as:

∂Lks ∂Lks = tr(( )T τ) k ∂W k (i, j) ∂Υi, j

(25)

ACCEPTED MANUSCRIPT 7 Then the derivatives of loss with respect to pi is: M X X ∂Lk ∂Lks = 2( (φkm )T )( Υki, j ks (p j − pi )) ∂pi ∂Υi, j j m=1

(26)

Hence the derivatives of loss with respect to φkm is represented as: (27)

Then the parameters φ of affinity layer can be updated using stochastic gradient descend. Similar to the SP-LAYER in the unary network, the gradients can be backpropagated through the SP-LAYER in the pairwise network: ∂pdt = ∆i jk , i f (i, j) ∈ Rt (28) ∂Qdij

∂Lks ∂Lk ∂pd ∂Qd = ds + λ2 Θ p ∂Θ p ∂p ∂Qd ∂Θ p

(29)

5. Experimental Results

CE

PT

ED

M

The proposed FC-RCCN architecture of our system is implemented using the popular Matcovnet Vedaldi and Lenc (2015) deep learning library. Resnet-101 He et al. (2016) is selected for constructing unary network. The overall network architecture is as illustrated in Figure.1. The first part of the unary network is designed based on ResNet-101 for feature extraction and it is initialized with pre-trained weights, please refer to section 3.3.1 for more details. The overall pairwise network is composed of a new branch of up-convolutions blocks, a superpixel pooling layer and an affinity layer. In the implementation of the pairwise network, the ResNet-101 part is shared with the unary stream. Then four up-convolutions and up-sampling blocks are included to upsample the activation maps. In the learning process, we set a batch size as 10, the base learning rate is 0.0001, and the learning rate is decreased gradually in the training process. The regularization parameters λ1 =1 and λ2 =5 were set. The algorithm will converge in nearly 20k training iterations in our experimental setting. Three challenging datasets LFW-PL Kae et al. (2013), HELEN Smith et al. (2013) and PASCAL VOC 2012 Everingham et al. (2010) are used to compare FCRCCN with state-of-the-art semantic segmentation methods.

AC

from the original images via face detection and facial landmarks detection methods Cao et al. (2014) to facilitate the training and testing process. The face regions are located and cropped firstly. Then all the cropped facial images are resized so as to adapt to different kinds of deep models. In the evaluation process, the segmentation labels are transformed to the original sizes. In this task, the superpixels are generated using LSC LI and Chen (2015). Note that our method can be used with any oversegmentation algorithm. We directly compare the proposed FCRCCN with current face parsing method S. F. Liu and Yang (2015); Liu et al. (2017) and other state-of-the-art deep learning based semantic segmentation methods, including FCN J. Long and Darrel (2015), CRFASRNN S. Zheng (2015), DEEPLAB Chen et al. (2014), DEEPLAB-DT Chen et al. (2016) and SEGNET Badrinarayanan et al. (2015). FC-RCCN is trained on the training set and validation set of LFW-PL and is evaluated on the test images. All training images are cropped to an input resolution of 320 × 320 to adapt to our network architecture. The models of FCN, CRFASRNN, DEEPLAB, and DEEPLAB-DT reported were trained using the open-source implementations ourselves and we cropped the images so as to adapt to different kinds of deep learning models. For example, the images are padded to the size of 500 × 500 for FCN and CRFASRNN. For SEGNET, images are transformed to the size of 512 × 512 by padding the border regions with zeros. For DEEPLAB and DEEPLAB-DT, the images are cropped to the size of 353×353. The quantitative results of the proposed method and the competitors are presented in Table 2. We can see that FC-RCCN achieves the highest accuracies on background, facial skin and hair segmentation accuracy compared to current face parsing methods such as RNN-G Liu et al. (2017) and MO-CG S. F. Liu and Yang (2015). Compared to other fully convolutional segmentation methods, FC-RCCN also achieves the highest accuracies over almost all the three classes. Figure 3 demonstrates effectiveness of pixel-wise prediction for facial parsing, and we can observe that FC-RCCN shows better performance than the compared methods by recovering the detailed image information. (Please refer to lei zhou (2018) for more visual and quantitative results on LFW-PL dataset).

AN US

Finally, according to the chain rule, the the partial derivative of Lk with respect to Θ p is defined as:

Fig. 3. The visual results of the proposed FC-CNN and the compared methods on LFW-PL dataset. The results of SEGNET, DEEPLABDT, DEEPLAB, CRFASRNN, FCN and the proposed FC-RCCN are listed in each row from left to right .

CR IP T

∂Lks X k ∂Lks = Υi, j k (p j − pi )(p j − pi )T )φkm ∂φkm ∂Υi, j ij

5.1. Evaluation on LFW-PL Dataset The LFW-PL dataset contains 2927 face images of 250 × 250 pixels acquired in unconstrained environments Kae et al. (2013). All of them are manually annotated with skin, hair and background labels using superpixels. In the first experiment, we directly compare the potential advantage of FC-RCCN with respect to state-of-that-art methods on the task of labeling facial components such as skin, hair, and background. In the preprocessing procedure of our experiment, the faces are extracted

5.2. Evaluation on HELEN Dataset The HELEN dataset Smith et al. (2013) contains face labels with 11 classes for the second set of experiments. It is composed of 2330 face images of 400 × 400 pixels with labeled

ACCEPTED MANUSCRIPT 8 Table 2. Face parsing accuracies on LFW-PL dataset. The F-measures of skin (F-skin), hair (F-hair) and background (F-bg) are presented.

F-hair

F-bg

92.91 92.79 92.54 91.17 93.15 93.93 94.37 92.82 94.39

82.69 82.75 80.14 78.85 84.18 80.70 83.43 84.42 87.61

96.32 96.32 95.65 94.95 95.25 97.10 97.55 97.36 97.75

overall accuracy 94.13 94.12 93.44 92.49 93.56 95.12 95.46 96.27 96.80

CR IP T

FCN J. Long and Darrel (2015) CRFASRNN DEEPLAB Chen et al. (2014) DEEPLAB-DT Chen et al. (2016) SEGNET Badrinarayanan et al. (2015) MO-GC S. F. Liu and Yang (2015) RNN-G Liu et al. (2017) FC-RCCN wo C-CRF FC-RCCN

F-skin

Table 3. Face parsing accuracies on HELEN dataset. The F-measures of seven categories, the mouth-all accuracy and the overall accuracy are presented.

CE

PT

ED

facial components generated through manually-annotated contours along eyes, eyebrows, nose, lips and jawline. The same data split setting as S. F. Liu and Yang (2015) is adopted for the LEW-PL dataset and HELEN dataset. Different from the LFWPL dataset, the labels of images in HELEN dataset consist of two eyebrows, upper and lower lips, two eyes, inner mouth, facial skin, nose and hair. In this experiment all the cropped facial images are resized to the size of 320 × 320 for FC-RCCN and superpixels are generated using LSC LI and Chen (2015) as well. The ground truth hair label is merged with the background label to train a 7-classes network. In this way, the proposed FC-RCCN are compared with the work of S. F. Liu and Yang (2015) and B. Smith (2013) fairly. The experiment results on the HELEN dataset are demonstrated in Table 3 using the same subset of images for evaluation with the same criteria. The models of FCN, CRFASRNN, DEEPLAB, DEEPLAB-CRF, DEEPLAB-GCRF and DEEPLAB-DT reported were trained using the open-source implementations. Compared with the latest face parsing method such as RNNG and MO-GC, FC-RCCN outperforms them on the accuracies of most of the classes. We want to highlight that the segmentation accuracies for the classes such as brows, upper lip, in the mouth, lower lip and facial skin are improved by more than 3%. It is noted that RNN-G achieves better per-class performance for the class such as the nose, possibly due to the

AC

nose 0.886 0.885 0.878 0.807 0.901 0.886 0.904 0.922 0.909 0.930 0.918 0.922

upper lip in mouth lower lip mouth all facial skin overall 0.624 0.764 0.751 0.853 0.880 0.819 0.627 0.769 0.774 0.859 0.896 0.823 0.585 0.701 0.724 0.833 0.881 0.798 0.460 0.702 0.717 0.798 0.910 0.716 0.638 0.738 0.762 0.863 0.901 0.830 0.624 0.719 0.749 0.863 0.889 0.830 0.718 0.791 0.813 0.879 0.911 0.858 0.651 0.713 0.700 0.857 0.882 0.804 0.623 0.808 0.694 0.841 0.912 0.847 0.743 0.792 0.817 0.891 0.921 0.886 0.770 0.848 0.835 0.901 0.943 0.886 0.779 0.852 0.841 0.912 0.955 0.901

AN US

eyes 0.743 0.769 0.704 0.728 0.754 0.736 0.823 0.785 0.768 0.868 0.853 0.869

M

Methods brows FCN J. Long and Darrel (2015) 0.677 CRFASRNN S. Zheng (2015) 0.682 DEEPLAB Chen et al. (2014) 0.661 DEEPLAB-CRF Chen et al. (2014) 0.401 DEEPLAB-DT Chen et al. (2016) 0.700 DEEPLAB-GCRF Chandra and Kokkinos (2016) 0.701 SEGNET Badrinarayanan et al. (2015) 0.757 Simth et.al B. Smith (2013) 0.722 MO-GC S. F. Liu and Yang (2015) 0.713 RNN-G Liu et al. (2017) 0.77 FC-RCCN wo C-CRF 0.777 FC-RCCN 0.810

sub-networks designed for refining the segmentation accuracy of small regions in RNN-G. In the future work, we will integrate sub-networks into FC-RCCN and the performance will be improved further. We also compare FC-RCCN with a number of recent fully convolutional semantic segmentation methods and better segmentation performance is achieved. The architectures such as DEEPLAB and FCN are not specially designed for full resolution segmentation since the facial component regions are small. We can see that the primary advantage of our model is owed to the refined objects and improved segmentation details by the utilization of up-convolutions and multi-scale features fusion. An interesting discovery is that SEGNET with up-convolutions achieves better performance than DEEPLAB, FCN, CRFASRNN on HELEN dataset for segmentation task with small objects, even if its performance is worse than those methods on PASCAL VOC 2012. It indicates the importance of up-convolutions for face parsing. The qualitative results of FCRCCN and the compared methods are presented in Figure 4. Overall, FC-RCCN produces better segmentation masks compared to current state-of-the-arts, and it handles small objects better (such as eyes, mouth and eyebrows) by trained with CCRF in an end-to-end way. (Please refer to lei zhou (2018) for more visual and quantitative results on Helen dataset).

ACCEPTED MANUSCRIPT

CR IP T

9

AN US

Fig. 4. The visual results of the proposed FC-RCCN and compared methods on HELEN dataset. The results of CRFASRNN, FCN, DEEPLAB, DEEPLABDT, SEGNET, FC-RCCN wo C-CRF, and FC-RCCN are listed in each row from left to right .

train

sheep

plant

person

mbike

horse

table

chair

bottle

tv

mIoU

WSSL Papandreou et al. (2015)

89.2 46.7 88.5 63.5 68.4 87

CRF-RNN S. Zheng (2015)

90.4 55.3 88.7 68.4 69.8 88.3 82.4 85.1 32.6 78.5 64.4 79.6 81.9 86.4 81.8 58.6 82.4 53.5 77.4 70.1 74.7

BoxSup Dai et al. (2015)

89.8 38

DPN Liu et al. (2015b)

89

M

81.2 86.3 32.6 80.7 62.4 81

sofa

dog

cow

cat

car

bus

76.8 34.2 68.9 49.4 60.3 75.3 74.7 77.6 21.4 62.5 46.8 71.8 63.9 76.5 73.9 45.2 72.4 37.4 70.9 55.1 62.2

bird

FCN J. Long and Darrel (2015)

bike

Method

areo

boat

Table 4. Performance on the PASCAL VOC 2012 test set. The IoU of twenty categories and the mean IoU are presented.

89.2 68.9 68

89.6 83

81.3 84.3 82.1 56.2 84.6 58.3 76.2 67.2 73.9

87.7 34.4 83.6 67.1 81.5 83.7 85.2 83.5 58.6 84.9 55.8 81.2 70.7 75.2

ED

61.6 87.7 66.8 74.7 91.2 84.3 87.6 36.5 86.3 66.1 84.4 87.8 85.6 85.4 63.6 87.3 61.3 79.4 66.4 77.5

DeepLabv2-CRF Chen et al. (2018) 92.6 60.4 91.6 63.4 76.3 95

88.4 92.6 32.7 88.5 67.6 89.6 92.1 87

87.4 63.3 88.3 60

86.8 74.5 79.7

66.3 77.7 95.3 88.9 92.4 33.8 88.4 69.1 89.8 92.9 87.7 87.5 62.6 89.9 59.2 87.1 74.2 80.2

FC-RCCN wo C-CRF

94.4 60.8 93

76.3 80.6 95

FC-RCCN

94.6 69.8 93.6 75.6 84

PT

GCRF Chandra and Kokkinos (2016) 92.9 61.2 91

86.7 92.1 39.3 88

94.3 89.4 92.9 42.9 91.7 72.6 88.9 92.7 88.6 87

CE

5.3. Evaluation on PASCAL VOC Dataset

PASCAL VOC 2012 Everingham et al. (2010) is a famous segmentation dataset which contains a training set, a validation set and a test set, with 1464, 1449 and 1456 images each. Following common practice, we augment the dataset with the extra annotations provided by Hariharan et al. (2011). This gives us a total of 10,582 training images. The dataset provides annotations with 20 object categories and one background class. Since the labels of the test set are not publicly available, all reported results have been obtained from the VOC evaluation server. For evaluating the proposed methods, two kinds of settings are compared. In the setting of FC-RCCN without C-CRF, only the unary network is trained, while the pairwise network and C-CRF network are not used. A

AC

72.3 91.8 91.3 85.7 87.7 69.2 90.3 63.1 84 67.9 92

77.1 81.6

67.5 87.1 77.4 83.2

mean IoU of 81.6 1 is generated which performs better than many state-of-the-art methods such as DPN Liu et al. (2015b), BoxSup Dai et al. (2015), WSSL Papandreou et al. (2015) and CRF-RNN S. Zheng (2015). In the full vesion of FC-RCCN, the unary network, pairwise network and C-CRF network are trained jointly with pixel-level and superpixel-level supervision. Compared to the baseline network, the segmentation performance can be improved by 1.6, reaching a mean IoU of 83.2 2 . To evaluate the C-CRF component of FC-RCCN more comprehensively, we have compared FC-RCCN with other CRF based models initialized by ResNet-101 network, such as Deeplabv2-CRF Chen et al. (2018) and GCRF (also presented as Deeplabv2-GCRF) Chandra and Kokkinos (2016).

1 http://host.robots.ox.ac.uk:8080/anonymous/A0QEIO.html

2 http://host.robots.ox.ac.uk:8080/anonymous/SCXNBN.html

ACCEPTED MANUSCRIPT 10

5.4. Ablation Study

λ MIoU

0 80.01

0.1 80.08

1 81.13

5 81.10

20 80.90

100 80.78

Table 6. The impact of superpixel numbers on the segmentation performance of the PASCAL VOC 2012 validation dataset. The mean IoU and pixel accuracy under different superpixel configurations are presented. NoS represents the number of superpixel.

Baseline:Deeplabv2-resnet mIoU=79.4 NoS Mean IoU Pixel Accuracy 300 0.8026 0.8723 500 0.8075 0.8803 700 0.8099 0.8823 900 0.8113 0.8834 1100 0.8118 0.8838 1300 0.8119 0.8839 1500 0.8124 0.8844 1700 0.8124 0.8845 1900 0.8124 0.8845

M

AN US

In order to evaluate the role of the component C-CRF in FCRCCN, we have compared the performance of FC-RCCN with and without C-CRF. In the experiment on LFW-PL dataset, the per-class accuracies and overall accuracy are improved using C-CRF compared to FC-RCCN without C-CRF. Especially, about 2% performance gain can be achieved for classes such as facial skin and hair. In the experiment on HELEN dataset, FC-RCCN achieves an overall accuracy as 90.1 which is better than 88.6 of the method without C-CRF, and the per-class accuracies of all seven classes are higher than those of FC-RCCN without C-CRF, with average accuracies gain about 1.5%. In the experiments on PASCAL VOC 2012 dataset, the mean IoU has been improved by 1.6 with C-CRF guided training and inference. The experimental results indicate that C-CRF works well in improving the training efficiency of complex network and generating segmentation results of high-quality.

Table 5. The impact of parameter λ on the segmentation performance of the PASCAL VOC 2012 validation dataset.

CR IP T

The detailed results for each category and the mean IoU scores are shown in Table 4. FC-RCNN outperforms the competing methods in most of the categories. In particular, it significantly outperforms the method DeepLabv2-GCRF which is built on pixel-level CRF. Compared with DeepLabv2-DenseCRF, DeepLabv2-GCRF achieves a 0.5 MIoU gain while FC-RCCN achieves a larger MIoU gain as 1.6. It indicates the jointly learing strategy and superpixel guided prediction is more effective in boosting the segmentation performance.

5.5. Discussion

AC

CE

PT

ED

Once the network architecture is fixed, the CRF parameter λ and the superpixel number are two crucial parameters which will affect the segmentation performance. Firstly we validate the impact of λ with the superpixel number set as 1000. The Deeplabv2-resnet is selected as the baseline. The quantitative evaluation on PASCAL VOC 2012 validation dataset by varying λ from 0 to 100 is shown in Table 5. One can find that incorporating the local consistency penalty is useful for improving the mIoU. However too large λ (eg.5 and 100) in turn causes performance degeneration. This is primarily because the linear model may be contradictory to the purpose of label consistency. For example, for two adjacent but distinctive regions to both of which we want to assign the same label, the linear model may cause counteraction. According to the evaluation, we select λ = 1 in our experiment. Then the effect of superpixel numbers is evaluated. As shown in Table 6, the segmentation performance with different superpixel configurations are compared. Even though the superpixel number is small as 300, around 0.6 point mean IoU gain can be achieved compared with the baseline. This is primarily because the combination of pixel-level learning and infernce strategy can reduce the effect of superpxiel errors by compensating the lost information in coarse superpixels. As the suerpxiel number increases, the segmentation performance becomes better. We can find that when the superpixel number is larger than 1500, the mIoU changes slightly. It indicates

that too many superpixels may not be beneficial for further segmentation refinement. However. The inference cost may increase as the dimension of graph becomes larger. Hence, we select NoS=1500 in our experiments. As to the runtime analysis for segmenting an image with size 320 × 320 using GPU GTX1080Ti, the unary network takes 0.07s, the pairwise network takes 0.008, the inference of SCCN takes 0.01s. The most time-consuming procedure is superpixel generation which takes about 0.16s. In summary, the total time cost of the proposed method is around 0.25s for parsing an image with size 320 × 320. 6. Conclusions In this paper, we propose a Full Convolutional Residual Continuous CRF Network (FC-RCCN) for semantic segmentation. FC-RCCN firstly applies a residual block based uanry network with multi-scale features fusion to generate full-resolution predictions. The superpixel based continuous conditional random field (C-CRF) is implemented as a pairwise network and a C-CRF network. The pairwise network is designed to learn the pairwise relationship between superpixels. In the C-CRF network, a differentiable continuous CRF layer is designed to generate the final segmentation masks by combining the outputs of unary network and pairwise network. The power of the proposed method on integrating various subnetworks is tested and evaluated comprehensively. In addition, we propose a novel joint pixel-level and superpixel-level supervised learning strategy to optimize the whole framework in an end-to-end way. The effectiveness of the learning strategy is validated with reasonable learning outcomes. Experimental

ACCEPTED MANUSCRIPT 11

References

AC

CE

PT

ED

M

AN US

Arnab, A., Jayasumana, S., Zheng, S., Torr, P.H., 2016. Higher order conditional random fields in deep neural networks, in: European Conference on Computer Vision, Springer. pp. 524–540. B. Smith, L. Zhang, J.B.J.Y., 2013. Exemplar-based face parsing, in: in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), IEEE. Badrinarayanan, V., Kendall, A., Cipolla, R., 2015. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 . Cao, X.D., Wei, Y.C., Wen, F., Sun, J., 2014. Face alignment by explicit shape regression. Int. J. Comput. Vis. 107, 177–190. Chandra, S., Kokkinos, I., 2016. Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian crfs, in: Euro. Conf. Comput. Vis., Springer. pp. 402–418. Chen, L., Papandreou, G., Schroff, F., Adam, H., 2017. Rethinking atrous convolution for semantic image segmentation. arXiv: Computer Vision and Pattern Recognition . Chen, L.C., Barron, J.T., Papandreou, G., Murphy, K., Yuille, A.L., 2016. Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform, in: in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 4545–4554. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2014. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062 . Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2018. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40, 834–848. Dai, J., He, K., Sun, J., 2015. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 1635–1643. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A., 2010. The pascal visual object classes (voc) challenge. International journal of computer vision 88, 303–338. Fu, J., Liu, J., Wang, Y., Lu, H., 2017a. Stacked deconvolutional network for semantic segmentation. arXiv preprint arXiv:1708.04943 . Fu, K., Gu, I.Y.H., Yang, J., 2017b. Saliency detection by fully learning a continuous conditional random field. IEEE Transactions on Multimedia 19, 1531–1544. Hariharan, B., Arbel´aez, P., Bourdev, L., Maji, S., Malik, J., 2011. Semantic contours from inverse detectors, in: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE. pp. 991–998. He, K.M., Zhang, X.Y., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 770–778. J. Long, E.S., Darrel, T., 2015. Fully convolutional networks for semantic segmentation, in: in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), IEEE. Kae, A., Sohn, K., Lee, H., Learned-Miller, E., 2013. Augmenting crfs with boltzmann machine shape priors for image labeling, in: in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 2019–2026. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, pp. 1097–1105. Krhenbhl, P., Koltun, V., 2011. Efficient inference in fully connected crfs with gaussian edge potentials, in: NIPS. Ladicky, L., Russell, C., Kohli, P., Torr, P.H., 2010. Graph cut based inference with co-occurrence statistics, in: Euro. Conf. Comput. Vis., Springer. pp. 239–253. Ladick`y, L., Sturgess, P., Alahari, K., Russell, C., Torr, P.H., 2010. What, where and how many? combining object detectors and crfs, in: Euro. Conf. Comput. Vis., Springer. pp. 424–437.

Lafferty, J., McCallum, A., Pereira, F.C., 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data . LI, Z.Q., Chen, J.S., 2015. Superpixel segmentation using linear spectral clustering, in: in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), IEEE. Lin, G., Milan, A., Shen, C., Reid, I., 2017. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Lin, G., Shen, C., van den Hengel, A., Reid, I., 2016. Efficient piecewise training of deep structured models for semantic segmentation, in: in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3194–3203. Liu, F., Shen, C., Lin, G., 2015a. Deep convolutional neural fields for depth estimation from a single image, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5162–5170. Liu, S., Shi, J., Liang, J., Yang, M.H., 2017. Face parsing via recurrent propagation. arXiv preprint arXiv:1708.01936 . Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X., 2015b. Semantic image segmentation via deep parsing network, in: Computer Vision (ICCV), 2015 IEEE International Conference on, IEEE. pp. 1377–1385. Ma, Z., Chang, X., Xu, Z., Sebe, N., Hauptmann, A.G., 2017. Joint attributes and event analysis for multimedia event detection. IEEE Trans Neural Netw Learn Syst. doi 10. Noh, H., Hong, S., Han, B., 2015. Learning deconvolution network for semantic segmentation, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 1520–1528. Papandreou, G., Chen, L.C., Murphy, K., Yuille, A.L., 2015. Weakly-and semi-supervised learning of a dcnn for semantic image segmentation. arXiv preprint arXiv:1502.02734 . Press, W., Teukolsky, S., Vetterling, W., Flannery, B., 1982. Numerical recipes in C. volume 2. Cambridge Univ Press. Russell, C., Kohli, P., Torr, P.H., et al., 2009. Associative hierarchical crfs for object class image segmentation, in: in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., IEEE. pp. 739–746. S. F. Liu, J.M. Yang, C.H., Yang, M.Y., 2015. Multi-object convolutional learning for face labeling, in: in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), IEEE. S. Zheng, S. Jayasumana, B.R.P.V.V.Z.S.D.D.C.H.P.T., 2015. Conditional random fields as recurrent neural networks, in: in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), IEEE. Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 . Smith, B.M., Zhang, L., Brandt, J., Lin, Z., Yang, J., 2013. Exemplar-based face parsing, in: in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3484–3491. Vedaldi, A., Lenc, K., 2015. Matconvnet: Convolutional neural networks for matlab, in: Proceedings of the 23rd ACM international conference on Multimedia, ACM. pp. 689–692. Vemulapalli, R., Tuzel, O., Liu, M.Y., Chellapa, R., 2016. Gaussian conditional random field network for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3224–3233. Wang, S., Chang, X., Li, X., Long, G., Yao, L., Sheng, Q., 2016. Diagnosis code assignment using sparsity-based disease correlation embedding. IEEE Transactions on Knowledge & Data Engineering , 1–1. Wang, S., Li, X., Yao, L., Sheng, Q.Z., Long, G., et al., 2017. Learning multiple diagnosis codes for icu patients with local disease correlation mining. ACM Transactions on Knowledge Discovery from Data (TKDD) 11, 31. Xie, L., Zhu, L., Chen, G., 2016. Unsupervised multi-graph cross-modal hashing for large-scale multimedia retrieval. Multimedia Tools and Applications 75, 9185–9204. lei zhou, 2018. The experiment results on helen and lfw datasets. https: //github.com/zmbhou/dataset-PR.git. Zhou, L., Cai, C., Gao, Y., Su, S., Wu, J., 2018. Variational autoencoder for low bit-rate image compression, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2617–2620. Zhu, L., Jin, H., Zheng, R., Feng, X., 2014. Effective naive bayes nearest neighbor based image classification on gpu. The Journal of Supercomputing 68, 820–848. Zhu, L., Shen, J., Xie, L., 2015. Topic hypergraph hashing for mobile image retrieval, in: Proceedings of the 23rd ACM international conference on Multimedia, ACM. pp. 843–846.

CR IP T

results and comparison with existing methods show that the proposed method achieves state-of-the-art performance. Since the proposed method achieves significant performance gain, in the future more powerful unary network may be incorporated into the the proposed method to achieve better performance.