Saliency Detection using a Deep Conditional Random Field Network

Saliency Detection using a Deep Conditional Random Field Network

Saliency Detection using a Deep Conditional Random Field Network Journal Pre-proof Saliency Detection using a Deep Conditional Random Field Network ...

5MB Sizes 0 Downloads 64 Views

Saliency Detection using a Deep Conditional Random Field Network

Journal Pre-proof

Saliency Detection using a Deep Conditional Random Field Network Wenliang Qiu, Xinbo Gao, Bing Han PII: DOI: Reference:

S0031-3203(20)30071-6 https://doi.org/10.1016/j.patcog.2020.107266 PR 107266

To appear in:

Pattern Recognition

Received date: Revised date: Accepted date:

24 February 2019 2 February 2020 8 February 2020

Please cite this article as: Wenliang Qiu, Xinbo Gao, Bing Han, Saliency Detection using a Deep Conditional Random Field Network, Pattern Recognition (2020), doi: https://doi.org/10.1016/j.patcog.2020.107266

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier Ltd.

Highlights • Designed a multi-scale backward optimization network which can hold both rich inherent features from shallower layers and semantic features from deeper layers, then high-level features are transmitted backward to guiding low-level features. • Deep CRF network is introduced to form the relationships between adjacent pixels which is crucial to improve the quality of saliency maps.

1

Saliency Detection using a Deep Conditional Random Field Network Wenliang Qiua , Xinbo Gaoa,∗, Bing Hana a Visual

Information Processing Laboratory, School of Electronic Engineering, Xidian University, Xi’an, China

Abstract Saliency detection has made remarkable progress along with the development of deep learning. While how to integrate the low-level intrinsic context with high-level semantic information to keep the object boundary sharp and restrain the background noise is still a challenging problem. Many attempts on network structures and refinement strategies have been explored, such as using Conditional Random Field (CRF) to improve the accuracy of saliency map, but it is independent from the deep network and cannot be trained end-to-end. To tackle this issue, we propose a novel Deep Conditional Random Field network (DCRF) to take both deep feature and neighbor information into consideration. First, Multi-scale Feature Extraction Module (MFEM) is adopted to capture the low level texture and high level semantic features, multi-stacks of deconvolution layers are employed to improve the spatial resolution of deep layers. Then we employ Backward Optimization Module (BOM) to guide shallower layers by high-level location and shape information derived from deeper layers, which intrinsically enhance the representational capacity of low-level features. Finally, a Deep Conditional Random Field Module (DCRFM) with unary and pairwise potentials is designed to concentrate on spatial neighbor relations to obtain a compact and uniformed saliency map. Extensive experimental results on 5 datasets in terms of 6 evaluation metrics demonstrate that the proposed method achieves state-of-the-art performance. Keywords: saliency detection, conditional random field, convolutional neural network

1. Introduction With an influx of visual information in every second, Human Visual System (HVS) can still effectively focus on the most crucial element and take a further processing within a limited computing resource. The key behind this ∗ Corresponding

author Email address: [email protected] (Xinbo Gao)

Preprint submitted to Journal of LATEX Templates

February 18, 2020

phenomenon is Selective Attention (SA) mechanism. Saliency detection aims at mimicing this HVS characteristics which distinguishes foreground objects from background. With the rapid development of deep learning, convolutional neural network has radically surpassed numerous traditional models in computer vision, such as image classification[1][2][3], visual target tracking[4][5][6], semantic image segmentation[7][8][9][10] and so on. However it does not mean the core idea of traditional model is out of date, so how to apply the essence to deep neural network and build an end-to-end network, which combines the advantages for both deep learning and traditional machine learning models, is of great significance. In this paper, we introduce conditional random field into the proposed multiscale backward optimization neural network to keep the borders of objects. For now, the deep learning based saliency detection methods have achieved exciting performance, some methods[11][12][13][14][15] apply the trained network by image classification straightly and use the high-level semantic features extracted from deeper layers to detect salient object, which is benefit to locate and estimate the approximate saliency region. While owing to the absence of low-level texture information and high spatial resolution, it leads to blurred edges and noisy background. Therefore, some methods[16][17][18][19] take the side outputs of the backbone network into consideration to enrich the detailed information and enhance the sharpness of boundary. Moreover, in order to keep the original spatial resolution of a scene, some literatures[12][15][20] begin with the input image, which is zoomed and cropped into several fixed-resolution subimages and then fed into the networks. Furthermore, in case of the blurred boundary of deep learning based saliency detection methods, CRF[21][22] is usually introduced to refine saliency maps as a post-processing stage. Similar with traditional CRF-based saliency detection methods[23][24][25], the estimated saliency map generated by deep neural network is fed into CRF as unary potential. However, this CRF procedure is typically independent of the whole network which cannot provide guidance to the network. To overcome the aforementioned problems, the goal of this work is to design a multi-scale backward optimization network to extract saliency-benefited lowlevel context and high-level semantic features with high spatial resolution, then develop relationships between shallow and deep features. Deep CRF network can therefore be built based on the above features. In other words, we estimate saliency value of a pixel by considering all the neighbors which is conductive to sharp the boundary and restrain the noise within an object. The contributions of this work are two-fold. Firstly, designed a multi-scale backward optimization network which can hold both rich inherent features from shallower layers and semantic features from deeper layers, then we transmit high-level features backward to guiding low-level features. Secondly, deep CRF network is introduced to form the relationships among adjacent pixels which is crucial to improve the quality of saliency maps. The rest of this paper is organized as follows. We introduce the related work on deep learning based saliency detection methods in Section 1.1. Section 2 presents the proposed DCRF saliency detection algorithm. Section 3 gives 3

experimental results and analysis on five widely used datasets. And this paper is closed with conclusion in Section 4. 1.1. Related Work Convolutional Neural Network (CNN) has substantially improved the performances of many computer vision tasks. Benefit from the powerful feature representation ability of CNN, which lifted the Deep Learning (DL) based salient object detection models to unprecedented heights especially for complex scenes. Most of early DL-based methods[12][15][14] are patch-based, which fed each pre-segmented region into the neural network to estimate the saliency values. While most of the latest methods prefer to input the whole image to the network, namely image-based methods. Similarly, the reshape or resize operations and the inharmony between low-level and high-level features of deep neural networks also lead blurred saliency and inaccurate boundaries, so inspired by the success of traditional saliency detection methods[26][24][27], which employed superpixel as prior segmentation, some methods incorporated pre-computed regions[15][14][28] or post-processing stage[21][29][30], such as superpixel or CRF, to keep spatial coherence and accurate boundaries. In this section, we give a briefly introduction on above three kinds of models. Patch-based models. Li et al. [12] extract features from three different scales, namely object-level, region-level and image-level information, within a deep neural networks. Then they further boosted the deep features with fully connected layers. In [15], a multi-context deep network with two branches is proposed, which means that global and local context are utilized and integrated into a unified network to estimate the saliency values. In [14], two deep neural networks, namely Deep Neural Network Local (DNN-L) and Deep Neural Network Global (DNN-G), are designed to learn local patch features and explore the relationships among global saliency cues and saliency values. Patch-based methods is time-consuming due to feature extraction of patches and fully connected layers. So more and more researchers tend to use the whole image as input and achieve pixel-wise saliency maps. Image-based models. Other CNN based methods[31][32][33][34] introduced a stage-by-stage strategy to obtain a more accurate saliency map. For instance, Wang et al. [34] proposed a stage-wise neural network with a pyramid pooling module and a multi-stage refinement mechanism to detect salient objects. Islam et al. [31] presented a hierarchical representation of relative ranking and stage-wise refinement to achieve better performance. In order to eliminate the noisy information of background, Wang et al. [35] proposed a global recurrent localization network to better localize salient objects by the weighted response map and a novel recurrent structure. In [36], Zhang et al. proposed a bi-directional message passing model to take multi-level context-aware features into consideration, and the gate function is also introduced to control the ratio of different level of features. Luo et al. [18] proposed a grid structure neural network to combine local and global information. Inspired by Mumford-Shah function, a boundary loss function is designed to penalize the errors on the boundary. 4

Refinement-based models. In order to obtain compact saliency maps with accurate boundaries, Hou et al. [21] provide a skip-layer structure with holistically-nested edge detector (HED) based deep neural network. To further improve the spatial coherence, a fully connected CRF is introduced to correct wrong predictions. However, the HED-based network and fully connected CRF are two independent modules and the whole model cannot be trained end-to-end. In [29], two complementary streams are employed, the first stream produced a pixel-level saliency, and the second stream extracted segment-wise features by spatial pooling. A fully connected CRF is also introduced to improve the contour localization. In [30], a novel deep convolutional network is designed to predict human eye fixations and segment the salient object simultaneously. To overcome the drawback of coarse resolution, they use CRF to refine the result. It’s worth mentioning that the CRF components in [21], [29] and [30] are independent to the neural network. Different from aforementioned refinement-based models, we propose an endto-end deep conditional random field network which estimates the saliency value of each pixel according to the high spatial resolution multi-level deep features and neighbor relationships. The proposed three modules guide the weights learning and convergence to generate compact and uniform saliency maps. 2. Model Description Multi-scale Feature Extraction

Unary & Pairwise Net

Graph

Unary Features

MeanField Inference

Pairwise Features

Figure 1: Framework of the proposed DCRF network.

In this work, we propose a deep conditional random field network to detect salient object in a scene. The whole framework, as shown in Fig. 1, consists

5

of three parts: multi-scale feature extraction module (MFEM), backward optimization module (BOM) and deep conditional random field module (DCRFM). More specifically, MFEM is designed to capture the low-level intensity, color as well as texture features and high-level semantic information with high spatial resolution. Then, BOM delivers the semantic location and shape information to shallower layers for more precise representation. Finally, we build a grid-shape graph on the feature space and introduce DCRFM to calculate saliency value of each pixel according to its neighbors. Fig. 1 shows the framework of the proposed method. Fig. 2 shows the main goal of feature extraction parts of the proposed methods. The goal of MFEM and BOM is to fuse more high-level location and shape information for shallower layers, and improve spatial resolution for deeper layers. Generally speaking, interpretation, unpooling and deconvolution are three wildly used strategies to enhance spatial resolution. Simple bilinear interpretation compute the value of each pixel from the nearest four neighbors by a linear map that depends only on the relative positions[8]. Unpooling is employed to place each activation back to its original pooled location, which is useful to reconstruction the structure[37]. The parameters of two aforementioned strategies are fixed and dont need to be learnt during training which is easy to compute. While deconvolution is able to reconstruct the original size of activations in a learning way, it is easy to be performed in-network for end-to-end learning which is crucial important for the proposed DCRF.

Shallow Layers ü High spatial resolution ✗ Less semantic Information

Backbone Network

BOM

Multi-scale Features

MFEM

ü High spatial resolution ü High level knowledge

Deep Layers ü High level knowledge ✗ Low spatial resolution

Figure 2: The role of MFEM and BOM.

So we prefer deconvolution to recover the spatial resolution of deeper layers in MFEM. Different from simple deconvolution in [8] and [33], we use multistacks of deconvolution layers and activation functions to predict finer details while retaining high-level semantic information. On the one hand, a stack of deconvolution layers can learn a nonlinear upsampling to reserve details. On 6

the other hand, inspired by HED[38] architecture which uses side-outputs to enhance edge detection, multiple side-output deconvolution layers are introduced to capture different level of inherent details and enhance them with higher resolution. Fig. 3 shows the saliency maps generated by different parts of VGG-16, each side-output has the potential to provide satisfactory saliency maps, shallower layers can reserve more details while deeper layers can better locate the object and have less noise. Overall, deeper outputs are better than shallow outputs which is consistent with the results shown in Ablation Studies section. By connecting deeper layers to shallower layers in BOM, high-level semantic features can be transformed to shallower layers and thus can help them better enrich low-level textural features. Based on MFEM and BOM, multi-features from different refined layers are extracted and combined for DCRFM.

(a) Input

(c) VGG POOL1

(d) VGG POOL2

(b) GT

(e) VGG POOL3

(f) VGG POOL 4

(g) VGG POOL5

Figure 3: Saliency maps generated by different part of VGG-16 network. (a) Input, (b) Ground truth, (c)-(g) saliency maps generated by different parts of VGG-16 from shallow to deep.

2.1. Multi-Scale Feature Extraction Module As we know, the feature description process of deep neural network is progressive, that is to say, the shallower layers of neural network concentrate more on the inherent information extraction, meanwhile, they have relative higher spatial resolution and rich details. And the low-level features can also provide the salient object more clear boundary. Furthermore, the low-level features are delivered to deeper layers and the semantic features are distilled. The high-level features can better locate the salient object and give a rough shape information about salient object, however, spatial resolution is also reduced causing by pooling. For salient object detection, a good saliency map gives consideration to both sharp salient object boundaries and complete salient regions, what’s more, clear and pure background is also in request. Based on the aforementioned facts, MFEM is developed to capture low and high level features with high spatial resolution. In consideration of the effectiveness and simplicity, we utilize VGG-16[2] as the backbone network in this paper. For convenience, Xl is defined as the input of convolution layers universally, so Wl and bl are weights and bias of the network respectively, where 7

Input

VGG1

VGG2

VGG3

VGG4

VGG5

MFEM1

MFEM_2_1

MFEM_3_1

MFEM_4_1

MFEM_5_1

MFEM_3_2

MFEM_4_2

MFEM_5_2

MFEM_2_2

Figure 4: Multi-scale feature extraction module. Table 1: Details of the Multi-scale Feature Extraction Module. Block VGG1 VGG2 VGG3 VGG4 VGG5 MFEM1 MFEM 2 1 MFEM 3 1 MFEM 4 1 MFEM 5 1 MFEM 2 2 MFEM 3 2 MFEM 4 2 MFEM 5 2

Details

Number

Kernel

Stride

Padding

Output

Conv maxPool Conv maxPool Conv maxPool Conv maxPool Conv maxPool Conv Conv Conv Conv Conv deConv deConv deConv deConv

2 1 2 1 3 1 3 1 3 1 1 1 1 1 1 1 2 3 4

3×3 2×2 3×3 2×2 3×3 2×2 3×3 2×2 3×3 2×2 3×3 3×3 3×3 3×3 3×3 3×3 3×3 3×3 3×3

1 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1

Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

352 × 352 × 64 176 × 176 × 64 176 × 176 × 128 88 × 88 × 128 88 × 88 × 256 44 × 44 × 256 44 × 44 × 512 22 × 22 × 512 22 × 22 × 512 11 × 11 × 512 176 × 176 × 128 88 × 88 × 128 44 × 44 × 128 22 × 22 × 128 11 × 11 × 128 176 × 176 × 128 176 × 176 × 128 176 × 176 × 128 176 × 176 × 128

l = {1, 2, . . . , L} indicates the l-th layer of the network. The output of the l-th layer can be defined as Fl (Xl , {Wl , bl }) = A (Wl ⊗ Xl + bl ), where A (·) is the activation function and ⊗ is convolution operator. We fix the size of the input image as 352 × 352 and the output image as 176 × 176, then the size of final saliency maps will be reverted to the initial size by bilinear interpolation. The structure of MFEM is illustrated in Fig. 4, the fully-connected layers of VGG-16 are deleted and the first five convolution blocks are retained. All these convolution blocks will be further processed to address both high spatial resolution and precision. Concretely speaking, the shallowest block contains texture information and has high resolution, so we only use 3×3 convolution kernel with 128 channels to further refine the low-level features. For the rest of the blocks, 8

3 × 3 convolution kernel with 128 channels are utilized to improve and uniform the middle and high level features, and different number of 5 × 5 deconvolution kernels with 128 channels are followed to improve the spatial resolution of semantic features. The detailed settings of MFEM are shown in Table 1. Five sets of features with high spatial resolution and multi-level representational capacity are obtained from MFEM. 2.2. Backward Optimization Module

MFEM1

MFEM_2_2

MFEM_3_2

MFEM_4_2

MFEM_5_2

BOM_1_1

BOM_2_1

BOM_3_1

BOM_4_1

BOM_5_1

FUSE

BOM_1_2

BOM_2_2

BOM_3_2

BOM_4_2

BOM_5_2

Figure 5: Backward optimization module.

Table 2: Details of the Backward Optimization Module. Block BOM BOM BOM BOM BOM BOM BOM BOM BOM BOM

1 2 3 4 5 1 2 3 4 5

FUSE

Input 1 1 1 1 1 2 2 2 2 2

176 × 176 × 640 176 × 176 × 512 176 × 176 × 384 176 × 176 × 256 176 × 176 × 128 176 × 176 × 128 176 × 176 × 128 176 × 176 × 128 176 × 176 × 128 176 × 176 × 128 176 × 176 × 640 176 × 176 × 512

Details Number Kernel Stride Padding

Output

3×3 3×3 3×3 3×3 3×3 3×3 3×3 3×3 3×3 3×3 3×3

176 × 176 × 128 176 × 176 × 128 176 × 176 × 128 176 × 176 × 128 176 × 176 × 128 176 × 176 × 2 176 × 176 × 2 176 × 176 × 2 176 × 176 × 2 176 × 176 × 2 176 × 176 × 512 176 × 176 × 2

Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1

Yes Yes Yes Yes No No No Yes No Yes No

In last subsection, MFEM endows high-level semantic features with high spatial resolution, however, the features from shallower blocks haven’t been guided and enhanced by the deeper blocks. Hence, in this subsection, we attempt to optimize the feature extraction stage of shallower blocks through introducing the location and shape information provided by the deeper blocks. The structure of BOM is illustrated in Fig. 5. We define the output of MFEM in m-th

9

(1 6 m 6 5) block and n-th (2 6 n 6 5) part as FM F EM m n , the output of BOM in m-th is defined as FBOM m , and the (m − i)-th block of BOM can be described as: FBOM m−i = A (WBOM m−i ⊗ (FM F EM m 2 ⊕ FM F EM m−1 2 ⊕ · · · ⊕ FM F EM

m−i 2 )

+ bBOM

m−i )

,

(1)

where ⊕ means concat in feature channel. The detailed structure of BOM is shown in Table 2. All the five sets of convolutional features refined by MFEM and BOM contain both texture and semantic properties, and the differences of these five feature sets lie in the priority of features, the enhanced shallower blocks focus more on low-level features and high resolution, while the deeper blocks pay more attention to high-level shape and location informations. To make the utmost of five feature sets, we reduce the dimension of feature sets by virtue of 1 × 1 convolution kernel with 2 channels (see details in Table 2). The weighted features can be obtained by weighting the five feature sets: X (2) Fweighted = km · FBOM m 2 , m

where, km is the weight of m-th feature sets. Finally, we feed Fweighted and the fused features of BOM Ff use to the Softmax function respectively, which mapping the features into binary saliency space. The estimated saliency maps yˆ can be defined as: c

yˆ∗ = p∗ (ˆ y∗ (F∗ ) = c) =

c

eW∗ ⊗F∗ +b∗ . P c0 c0 eW∗ ⊗F∗ +b∗

(3)

c0 ∈{0,1}

In this work, we compute the pixel-level loss between estimated saliency value and ground truth by Cross-Entropy function, which is defined as: X L∗ = y (xi ) log (ˆ y∗ (xi ) = 1) + (1 − y (xi )) log (ˆ y∗ (xi ) = 0) , (4) i

where y (xi ) means the label of pixel i, (ˆ y∗ (xi ) = 1) is the probability of pixel i belongs to saliency. ∗ represents weighted or f use in different situations. Thus, two feature-level loss Lweighted and Lf use are obtained. 2.3. Deep Conditional Random Field Module In the last two subsection, saliency has been estimated according to the multi-scale features, which can effectively describe the inherent and extrinsic content of a scene, within a single pixel. However, the feature-level estimation ignored the influence of neighbors on each pixel, which may lead to the blurred boundary and fragmentized foreground as shown in Fig. 7. To overcome the above problem, we employ CRF into the network and build a grid-shape eightconnected graph G = (V, E). Every pixel is regarded as a vertex V , two adjacent vertexes constitute an edge E. 10

In consideration of the complexity and effectiveness of conditional random field, we build the graph on a 22 × 22 grid. Specifically, features generated from VGG4 and MFEM 5 2 (as shown in Table 1) in first deconvolution layer compose the input of DCRFM with size of 22 × 22 × 640. Then, the unary and pairwise networks are designed to refine and form the deep features, which will be further fed into the unary and pairwise potential functions. Similar to the traditional CRF, for vertex i, both unary feature information PU (yi |xi ) and pairwise context constraints PV (yi , yj |xi , xj ) are taken into consideration. The optimization of CRF is equivalent to maximize the likelihood function, which |X;Pωcrf ), or minimize the energy function E, P is defined as P (Y P where E = PU (yi |xi ; ωcrf ) + PV (yi , yj |xi , xj ; ωcrf ). Owing to graph i

i

j

G is cycled rather than tree-structure, which is time consuming in the optimization procedure, and the convergence of deep neural network also requires lots of iterations, so the traditional Stochastic Gradient Descent (SGD) is not suitable for training DCRFM. For this sake, piecewise learning [39][40] is introduced to optimize the deep conditional random field, thus, the conditional likelihood function is defined as the product of independent potential functions: Y Y Y Y PV (yc , yc˙ |X; ωcrf ), P (Y |X; ωcrf ) = PU (yc |X; ωcrf ) U ∈U c∈{0,1}

V ∈V (c,c)∈L ˙ V

(5) where (c, c) ˙ is the label pair, LV = {(0, 0) , (0, 1) , (1, 0) , (1, 1)}, and the unary potential function and pairwise potential function are defined as: e−U (yc |X; ωcrf ) PU (yc |X; ωcrf ) = P −U (y 0 |X; ω ) , crf c e

(6)

c0

e−V (yc ,yc˙ |X; ωcrf ) PV (yc , yc˙ |X; ωcrf ) = P −V (y 0 ,y 0 |X; ω ) , crf c c˙ e

(7)

c0 ,c˙ 0

where U (yc |X; ωcrf ) and V (yc , yc˙ |X; ωcrf ) are the outputs of unary net and pairwise net respectively. It is observed that PU and PV are classical softmax function. Hence, it is more convenient to calculate the gradient and integrate DCRFM into the whole model and form an end-to-end network. The loss of DCRF Lcrf can be described as minimizing the negative log-likelihood function: " P P P Lcrf = − log PU (yc |xi ; ωcrf )+ i U ∈U c∈{0,1} # . (8) P P log PV (yc , yc˙ |xc , xc˙ ; ωcrf ) V ∈V (c,c)∈L ˙ V

To sum up, the final loss of the proposed deep conditional random field network is defined as: LDCRF = Lweighted + Lf use + Lcrf . 11

(9)

3. Experimental Results 3.1. Datasets and Evaluation Metrics In this work, we employ five widely used datasets for evaluation: DUTOMRON[41], ECSSD[42], HKU-IS[15], PASCAL-S[43] and SOD[44]. DUTOMRON consists of 5168 high quality images which containing one or more objects, and it is challenging due to the complex scenarios. ECSSD contains 1000 images with structurally complex scenes and rich semantic information. HKU-IS consists of 4447 images, the objects are widely distributed around the scene, especially in the edges of the scene. PASCAL-S consists of 850 images chosen from the validation set of PASCAL VOC2010 segmentation challenge. And the ground truth masks were labeled by 12 subjects. The images in SOD are a part of BSD (Berkeley Segmentation Dataset), which contains 300 images with one or more objects. All these datasets provided pixel-wise manually labeled ground truth. Six universally-agreed metrics are applied to comprehensively evaluate the effectiveness of the proposed DCRF network: precision-recall curve (PR), Fmeasure curve (Fm), weighted F-measure score(wFm)[45], max F-measure score (maxFm), mean F-measure score (meanFm), and mean absolute error (MAE). The first five metrics are belong to overlapping area between ground truth and saliency map. The precision and recall can be regarded as measures of quality and quantity. For a good saliency map, high precision and recall are both required. To give a harmonic mean of precision and recall, we thus introduced Fm, wFm, maxFm, and meanFm from a more extensive view. Concretely, F-measure is defined as:  1 + β 2 · precision · recall , (10) F-measure = β 2 · precision + recall where β 2 is set to 0.3 as suggested in [46] to stress the importance of precision. While maxFm and meanFm are the maximum and mean of Fm curve respectively. Moreover, wFm can offer a unified solution to evaluation of both non-binary and binary maps (see [45] for details) and provide a more intuitive evaluation result. MAE gives a unified assessment both on foreground and background through calculating the error rate between saliency map S and ground truth Gt. Given a W × H image, MAE is simply defined as: MAE =

W X H X 1 |Gt (x, y) − S (x, y)|. W × H x=1 y=1

(11)

3.2. Implementation The proposed DCRF is deployed on Ubuntu 14 with Tensorflow 1.2 environment. The blocks of VGG1-VGG5 as shown in Fig. 4 and Table 1 are initialized according to the VGG16 model[2], and the rest of convolution and deconvolution layers are initialized with truncated normal distribution. We use Adam with a 12

0.9

1

0.8

0.9

0.7

0.8 0.7

0.6

meanFm

0.5 0.4

0.2

DUT

0.1

SOD

+B

M FE M

1

0.35

0.9

0.3

0.8

0.25

0.7 0.6

MAE

0.5

0.2

0.15

0.4 0.3

0.1

0.2

0.05

0.1

M FE M

M FE M

+B

DC M RF FE M +B OM

M FE M OM +C R M F FE VG M + G_ PO CRF OL 5+ CR VG F G_ PO OL VG 5 G_ PO OL VG 4 G_ PO OL VG 3 G_ PO OL VG 2 G_ PO OL 1

0

DC M RF FE M +B OM

0

M FE M OM +C R M F FE VG M + G_ PO CRF OL 5+ CR VG F G_ PO OL VG 5 G_ PO OL VG 4 G_ PO OL VG 3 G_ PO OL VG 2 G_ PO OL 1

maxFm

M FE M OM +C R M F FE VG M + G_ PO CRF OL 5+ CR VG F G_ PO OL VG 5 G_ PO OL VG 4 G_ PO OL VG 3 G_ PO OL VG 2 G_ PO OL 1

0

M FE M

DC M RF FE M +B OM

0

0.4

DC M RF FE M +B OM

0.1

0.5 0.3

ECSSD

M FE M OM +C R M F FE VG M + G_ PO CRF OL 5+ CR VG F G_ PO OL VG 5 G_ PO OL VG 4 G_ PO OL VG 3 G_ PO OL VG 2 G_ PO OL 1

0.2

+B

0.3

0.6

+B

wFm

learning rate of 10−6 as the optimization function to train the network. The training set contains 3000 images derived from MSRA-B dataset as suggested in [47]. To further augment the training samples, we flip each training sample to four angles: (0◦ , 90◦ , 180◦ , 270◦ ), thus totally 12000 samples are obtained as the final training set. We set the number of iteration in DCRFM to 3. The training time of DCRF with 20 epchos is about 5.5 hours in a NVIDIA Titan X GPU, and the testing time is about 0.14 second per image.

Figure 6: Ablation study of different modules.

3.3. Ablation Study Module analysis. To verify the effectiveness of each module, we evaluate MFEM, MFEM+BOM, MFEM+BOM+DCRFM (namely the proposed DCRF), and some baseline models including VGG POOL1-5, VGG POOL5+CRF, MFEM+CRF and MFEM+BOM+CRF on ECSSD, DUT-OMRON and SOD datasets. Specifically, we concatenate the outputs of MFEM t 2 t ∈ [2, 5] and MFEM1 to a feature map, then, 1 × 1 convolution with 2 channels and softmax function are used to mapping the feature map to saliency space. The configurations remain the same as Section 3.2, the whole training time is about 3.25 hours, and the testing speed is about 41 Fps. For MFEM+BOM, the loss function is Lweighted + Lf use , it takes about 3.55 hours to train the model, and 37 Fps for testing.

13

(a) Input

(b) GT

(c) MFEM

(d) MFEM+BOM

(e) MFEM+BOM+DCRFM

Figure 7: Qualitative comparison of different modules.

The experimental results are shown in Fig. 6. In general, each module played a critical role to improve the performance of DCRF. With the participation of BOM, the multi-scale deep features with high spatial resolution are inducted and refined by semantic location and shape information of objects, which is helpful to get more representative features and achieve more competitive saliency maps. With the assistance of DCRFM, the saliency value of each pixel is not only subject to its features, but also its neighborhoods, which is contribute to filling the hole in an object, sharping the boundary and removing the noisy of the background. Additionally, DCRFM achieves better and more stable improvement based on various feature extraction modules. The subjective evaluation results shown in Fig. 7 indicate the availability of three modules. Different sizes of graphs. The computation complexity of DCRFM increases according to the size of graph, so it is necessary to reduce the graph in the premise of ensuring the accuracy. In this subsection, we test different sizes of graphs including 22 × 22, 88 × 88 and 176 × 176 on ECSSD dataset. For 88 × 88 graph, we utilize the output of MFEM 2 1 and MFEM t 2 with t − 2 (t ∈ [3, 5]) times deconvolution as the input of DCRFM. Similarly, for 176 × 176, the output of MFEM1 and MFEM t 2 with t ∈ [3, 5] are fed into DCRFM. In Fig. 8, the models with three sizes achieved closed performances in PR curves. From the perspective of Fm curve, wFm, meanFm and maxFm, with the increasement of graph size, the performance slightly decreased. While MAE remained about the same under different sizes of graphs. In DCRFM, the unary potential function of the conditional likelihood function focuses on the relation between deep features and labels, while the pairwise potential function pays more attention to the discrepancy between neighbor pixels. As illustrated in Fig. 8, the features derived from deeper layers of MFEM play a more important role in DCRFM. Comparison and analysis between standard CRF and DCRFM. In order to provide a more comprehensive explanation of the differences be-

14

(a) 1 0.9 0.8

0.825

0.854

(b) 0.896

0.848

0.887

0.891 0.811 0.82

0.822

Scores

0.7 0.6 0.5 0.4 0.3

0.2 0.065

0.1

0.065

0.065

0 22x22

88x88 wFm

meanFm

maxFm

176x176 MAE

(c)

Figure 8: Objective evaluation of different graph sizes

tween standard CRF and DCRFM, we combined different parts of the proposed method with post-processing CRF and DCRFM on baseline dataset ECSSD, challenging datasets DUT-OMRON and SOD which contain several objects in each image. In this section, we employed DenseCRF[48] with default parameter settings. The comparison results are shown in Table 3, the left and right part represent the improvement on MFEM and MFEM+BOM, respectively. Overall, DCRFM is better than standard CRF on three datasets. The improvement of standard CRF is not stable and even reduce the performance. One of the reasons is that, the parameters of traditional CRF should be learned or adjusted manually for different styles of saliency maps, while DCRFM could be properly learned in-network and achieve a stable enhancement. Additionally, the running time of standard CRF is average 0.8 second per image which costs much more time than the proposed method. 15

Table 3: Comparison of standard CRF and the proposed DCRFM with different feature extraction modules on ECSSD, DUT-OMRON and SOD datasets.

Modules

X

MFEM BOM DCRF CRF

X

X X

X

wFm meanFm ECSSD maxFm MAE wFm meanFm DUT maxFm MAE wFm meanFm SOD maxFm MAE

0.811 0.842 0.881 0.070 0.586 0.653 0.723 0.089 0.700 0.739 0.811 0.094

0.803 0.858 0.886 0.073 0.582 0.682 0.724 0.084 0.672 0.750 0.803 0.095

0.818 0.852 0.891 0.068 0.590 0.666 0.730 0.084 0.704 0.737 0.818 0.092

X X

X X

X X X

0.820 0.847 0.888 0.067 0.596 0.656 0.723 0.086 0.700 0.731 0.810 0.093

0.804 0.860 0.885 0.073 0.593 0.683 0.725 0.078 0.656 0.732 0.791 0.097

X

0.825 0.854 0.896 0.065 0.606 0.663 0.732 0.082 0.710 0.737 0.814 0.091

3.4. Objective Evaluation In this subsection, we compare eleven state-of-the-art algorithms divided into two parts: traditional methods and deep learning based methods. Four traditional methods, namely HS[42], MR[41], BSAC[49], and wCtr[50] are selected. And seven deep learning based methods are chosen as follows, C2SNet[51], MWS[52], DS[53], Amulet[16], MDF[12], KSR[54], and UCF[55], which achieved outstanding performance. To be fair, we run authors’ codes with default settings or use the released saliency maps directly for all aforementioned methods. Table 4: Comparison of wFm, meanFm, maxFm and MAE on five datasets. The top two results are shown in Red and Blue, respectively. Dataset

DUT

ECSSD

HKU-IS

PASCAL-S

SOD

Metric

DCRF

C2SNet

MWS

DS

Amulet

MDF

KSR

UCF

HS

MR

BSCA

wCtr

wFm meanFm maxFm MAE wFm meanFm maxFm MAE wFm meanFm maxFm MAE wFm meanFm maxFm MAE wFm meanFm maxFm MAE

0.606 0.663 0.732 0.082 0.825 0.854 0.896 0.065 0.819 0.853 0.891 0.051 0.727 0.753 0.825 0.097 0.710 0.737 0.814 0.091

0.639 0.662 0.720 0.079 0.838 0.848 0.888 0.059 0.758 0.750 0.825 0.087 -

0.522 0.606 0.709 0.108 0.704 0.838 0.874 0.099 0.674 0.812 0.854 0.085 0.606 0.715 0.795 0.136 0.554 0.698 0.765 0.139

0.487 0.603 0.739 0.120 0.667 0.825 0.876 0.122 0.533 0.655 0.763 0.176 0.526 0.665 0.769 0.161

0.626 0.639 0.712 0.098 0.840 0.847 0.904 0.059 0.817 0.837 0.882 0.051 0.734 0.712 0.817 0.100 0.662 0.671 0.767 0.124

0.565 0.640 0.694 0.092 0.705 0.794 0.829 0.105 0.586 0.693 0.756 0.145 -

0.486 0.591 0.677 0.131 0.630 0.782 0.817 0.132 0.586 0.746 0.779 0.120 0.565 0.700 0.761 0.157 0.507 0.644 0.704 0.163

0.573 0.615 0.693 0.120 0.573 0.834 0.887 0.069 0.799 0.820 0.866 0.062 0.693 0.721 0.801 0.116 0.651 0.685 0.768 0.131

0.350 0.513 0.596 0.227 0.806 0.627 0.725 0.228 0.422 0.634 0.705 0.215 0.420 0.524 0.644 0.164 0.369 0.492 0.612 0.270

0.381 0.528 0.609 0.187 0.454 0.691 0.732 0.186 0.456 0.662 0.703 0.174 0.415 0.584 0.641 0.231 0.383 0.549 0.607 0.236

0.370 0.509 0.600 0.191 0.496 0.702 0.757 0.183 0.467 0.654 0.718 0.175 0.434 0.596 0.668 0.223 0.400 0.554 0.623 0.230

0.427 0.528 0.622 0.144 0.513 676 0.705 0.171 0.515 0.677 0.718 0.142 0.451 596 0.654 0.201 0.413 0.569 0.607 0.197

The PR curves on five datasets are illustrated in Fig. 9, DCRF surpass the other methods on HKU-IS, PASCAL-S, and SOD datasets, while achieved competitive performance on the rest two datasets. In Fig. 10, the Fm curve of DCRF is slightly below Amulet on ECSSD dataset, but on the other four 16

Precision-Recall Curve

0.9

0.9

0.7

0.8 0.7

Precision

0.6

Precision

Precision-Recall Curve

1

0.8

DCRF DS Amulet MDF KSR UCF HS MR BSCA wCtr C2SNet MWS

0.5 0.4 0.3 0.2

DCRF DS Amulet MDF KSR UCF HS MR BSCA wCtr C2SNet MWS

0.6 0.5 0.4 0.3

0.1

0.2 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

Recall

Precision-Recall Curve

1

1

Precision-Recall Curve

1

0.9

0.8

0.8

(b) ECSSD Precision-Recall Curve

1

0.9

0.6

Recall

(a) DUT-OMRON

0.9

0.8

0.8

0.7

0.5

DCRF Amulet KSR UCF HS MR BSCA wCtr MWS

0.4 0.3 0.2

0.7 DCRF DS Amulet MDF KSR UCF HS MR BSCA wCtr C2SNet MWS

0.6 0.5 0.4 0.3

0.1

Precision

Precision

Precision

0.7 0.6

0.2

0.4

0.6

0.8

1

DCRF DS Amulet KSR UCF HS MR BSCA wCtr MWS

0.5 0.4 0.3

0.2 0

0.6

0.2 0

Recall

0.2

0.4

0.6

0.8

1

0

0.2

0.4

Recall

(c) HKU-IS

0.6

0.8

1

Recall

(d) PASCAL-S

(e) SOD

Figure 9: Comparison of precision-recall curves on five datasets.

(a) DUT-OMRON

(c) HKU-IS

(b) ECSSD

(d) PASCAL-S

(e) SOD

Figure 10: Comparison of F-measure curves on five datasets.

datasets, DCRF is preferable to state-of-the-arts. Moreover, the Fm curves of DCRF kept high and stable in wide range of thresholds due to the DCRFM can effectively sharpen the object boundaries and filter out the noise of background.

17

(a) Input

(b) GT

(c) DCRF (d) C2SNet (e) MWS

(f) DS

(g) Amulet (h) MDF

(i) KSR

(j) UCF

(k) HS

(l) MR

(m) BSCA (n) wCtr

Figure 11: Qualitative comparison of the state-of-the-art methods.

The stationarity of Fm curves is vital for subsequent visual tasks. Table 4 shows the numerical results of wFm, meanFm, maxFm, and MAE. DCRF consistently achieved better rates in HKU-IS, PASCAL-S, and SOD datasets, and also attained better scores in both DUT-OMRON and ECSSD datasets. It is worth mentioning that MDF used the same 3000 images as DCRF for training, KSR used the whole MSRA-B dataset for training. While the training sets of C2SNet, Amulet and UCF are composed of MSRA10K datasets which contains 10000 images. MWS uses 10553 images from DUTS[56] datasets for the second stage training. Nevertheless, DCRF significantly outperforms UCF in five metrics and Amulet in most cases. What’s more, the pre-trained model of PASCAL VOC segmentation dataset, which has overlap with PASCAL-S and HKU-IS, is introduced in DS. On SOD dataset, DCRF substantially outperformed state-of-the-arts and rose 7.2, 7.6, and 5.9 percent on wFm, meanFm, and maxFm respectively. While for MAE, there is 36.3 percentage descent compared with other methods, which indicates that DCRF is capable to handle the multi-objects scenes.

18

3.5. Subjective Evaluation Eleven testing samples from aforementioned five datasets are shown in Fig. 11. Saliency maps generated by DCRF, on the whole, preserved much clearer edges and highlighted the foreground more uniformly with pure background. In row 7, most methods failed to detect both the human and cat, while DCRF not only detected both two objects, but also preserved the details, such as the gap between human and cat. In row 10, DCRF, C2SNet, DS and MDF successfully highlighted the sea gull on the right upper corner. However, the saliency map yield by MDF has a noisy background around the object and blurry boundary. And the noise also exists on the background regions of DS. For multi-objects scenes, such as row 5, four methods, namely DCRF, C2SNet, DS and UCF, achieved promising results. But the saliency values of DS within the object are not homogeneous, blocking effect still exists. While, the three objects in the saliency maps of C2SNet and UCF become stuck together and loss the detail information near the edges. By contrast, DCRF can obtain an exact detection. 3.6. Statistical Analysis In order to give a more intuitive view of five datasets, here we briefly list some properties of them as shown in Table 5 (see details in [57]). Most of the five datasets contain multiple objects in each image, DUT-OMRON and HKUIS have relatively small salient objects, while SOD consists of relatively large objects with wide ranges. DCRF gains top performance on five datasets and is insensitive to the size of object. Additionally, many images in SOD have low colour contrast to the background or touch image boundaries. As illustrated in Table 4, DCRF achieves better performance on SOD, it means DCRF is not sensitive to low contrast image and good at detecting large-size object. Images in HKU-IS contain multiple disconnected objects with relatively diverse spatial distribution and similar fore-background appearance, so DCRF has the ability to extract object regions from complex scenes. Table 5: Statistics of five datasets. Dataset DUT-OMRON ECSSD HKU-IS PASCAL-S SOD

#Img. #Obj. Obj. Area(%) 5168 1000 4447 850 300

1-4+ ∼1 1-4+ 1-4+ 1-4+

14.85±12.15 23.51±14.02 19.13±10.90 24.23±16.70 27.99±19.36

3.7. Computational Complexity In this section, we compare DCRF with aforementioned state-of-the-art methods under the same setting described in Section 3.2. Both running time and model size are taken into consideration. As shown in Table 6, VGG POOLx represents a model trained by the xth modules of VGG-16 backbone network. The model size of DCRF increased by 50% compared to backbone network, and 19

the performance has been significantly improved as demonstrated in Fig. 6. As illustrated in Table 7, DCRF stays at upper middle level with faster speed and smaller model size. Compared with the latest method C2SNet, though C2SNet achieves faster testing time, it has the most trainable parameters because of the fully convolutional encoder-decoder network designed for contour and saliency detection. Table 6: Average training time, testing time and model size on ECSSD, DUT-OMRON and SOD datasets.

Models Training Time(h) Testing Time(s) Model Size(MB)

VGG VGG VGG VGG VGG MFEM MFEM DCRF POOL1 POOL2 POOL3 POOL4 POOL5 +BOM 0.70

1.20

1.55

1.92

2.37

3.25

3.55

5.50

0.005

0.008

0.012

0.013

0.015

0.024

0.027

0.140

60

68

86

142

203

246

267

312

Table 7: Comparison of average running time and model size on ECSSD, DUT-OMRON and SOD datasets. Models Time(s) Model Size(MB)

DCRF C2SNet MWS DS Amulet MDF KSR UCF HS MR BSCA wCtr 0.14

0.04

0.02

0.21

0.06

312

606

113

510

133

21.73 51.3 0.13 1.95 0.16 462

456

118

-

-

1.42

0.15

-

-

4. Conclusion and Future Work An end-to-end deep conditional random field network is proposed to detect salient objects in a scene, which mainly consists of multi-scale feature extraction module, backward optimization module, and deep conditional random field module. Benefit from MFEM and BOM, the low-level features yield by shallower layers are guided by semantic information and reinforced in the target area. Correspondingly, the high-level features generated by deeper layers are endowed with high spatial resolutions. And then, DCRFM is introduced to further refine the saliency map with more clear boundary, compact object and pure background. Additionally, DCRFM can be easily deployed on other different network structures to further refine the saliency maps and enhance the representation ability of the whole network. So further research will extend this work by investigating the use of DCRFM in other computer vision tasks. The experimental results on five widely-used datasets in terms of six metrics demonstrated the effectiveness and robustness. In this paper, only unary and pairwise potentials are introduced in DCRFM which achieved relative good results on most scenes, but it still faces some limitations. For instance, A CRF with only unary and pairwise potentials cannot 20

correct the recognition errors[58] effectively which caused by the poor unary classification. In this situation, higher order potentials, such as higher-order pixel correlation[59] or superpixels based potential[60], can provide important information and help reduce the salient object detection errors. Superpixel segmentation has been wildly used in non-deep learning based saliency detection approaches, it can provide reliable pre-segmentation and reduce the computational complexity for saliency detection. In the deep learning era, how to improve the neural network with traditional machine learning models has significant potential values. Fusing the idea of superpixel into DCRFM as a higher order potential will play an important role to further improve the performance. Declaration of interest statement The authors declared that they have no conflicts of interest to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted. References References [1] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Proceedings of the International Conference on Neural Information Processing Systems, 2012, pp. 1097– 1105. [2] K. Simonyan, A. Zisserman, Very deep convolutional networks for largescale image recognition, in: arXiv preprint arXiv:1409.1556, 2014. [3] X. Bu, Y. Wu, Z. Gao, Y. Jia, Deep convolutional network with locality and sparsity constraints for texture classification, Pattern Recognition 91 (2019) 34–46. [4] L. Wang, W. Ouyang, X. Wang, H. Lu, Visual tracking with fully convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3119–3127. [5] L. Wang, W. Ouyang, X. Wang, H. Lu, Stct: sequentially training convolutional networks for visual tracking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1373–1381. [6] P. Li, D. Wang, L. Wang, H. Lu, Deep visual tracking: Review and experimental comparison, Pattern Recognition 76 (2018) 323–338. [7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. Yuille, Semantic image segmentation with deep convolutional nets and fully connected crfs, in: arXiv preprint arXiv:1412.7062, 2014.

21

[8] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440. [9] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4) (2017) 640–651. [10] F. Liu, G. Lin, C. Shen, Crf learning with cnn features for image segmentation, Pattern Recognition 48 (10) (2015) 2983–2992. [11] G. Lee, Y. W. Tai, J. Kim, Deep saliency with encoded low level distance map and high level features, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 660–668. [12] G. Li, Y. Yu, Visual saliency based on multiscale deep features, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5455–5463. [13] N. Liu, J. Han, Dhsnet: Deep hierarchical saliency network for salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 678–686. [14] L. Wang, H. Lu, X. Ruan, M.-H. Yang, Deep networks for saliency detection via local estimation and global search, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3183– 3192. [15] R. Zhao, W. Ouyang, H. Li, X. Wang, Saliency detection by multi-context deep learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1265–1274. [16] P. Zhang, D. Wang, H. Lu, H. Wang, X. Ruan, Amulet: aggregating multilevel convolutional features for salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 202–211. [17] S. He, J. Jiao, X. Zhang, G. Han, R. W. Lau, Delving into salient object subitizing and detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1059–1067. [18] Z. Luo, A. K. Mishra, A. Achkar, J. A. Eichel, S. Li, P.-M. Jodoin, Nonlocal deep features for salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6593– 6601. [19] H. Chen, Y. Li, D. Su, Multi-modal fusion network with multi-scale multipath and cross-modal interactions for rgb-d salient object detection, Pattern Recognition 86 (2019) 376–385.

22

[20] J. Kim, V. Pavlovic, A shape-based approach for salient object detection using deep learning, in: Proceedings of the European Conference on Computer Vision, 2016, pp. 455–470. [21] Q. Hou, M. M. Cheng, X. W. Hu, A. Borji, Z. Tu, P. Torr, Deeply supervised salient object detection with short connections, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5300–5309. [22] P. Hu, B. Shuai, J. Liu, G. Wang, Deep level sets for salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 540–549. [23] T. Liu, Z. Yuan, J. Sun, N. N. Zheng, X. Tang, H. Y. Shum, Learning to detect a salient object, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (2) (2011) 353–367. [24] W. Qiu, X. Gao, B. Han, A superpixel-based crf saliency detection approach, Neurocomputing 244 (2017) 19–32. [25] L. Mai, Y. Niu, F. Liu, Saliency aggregation: a data-driven approach, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1131–1138. [26] A. Aksac, T. Ozyer, R. Alhajj, Complex networks driven salient region detection based on superpixel segmentation, Pattern Recognition 66 (2017) 268–279. [27] J. Zhang, K. A. Ehinger, H. Wei, K. Zhang, J. Yang, A novel graph-based optimization framework for salient object detection, Pattern Recognition 64 (2017) 39–50. [28] N. Liu, J. Han, D. Zhang, S. Wen, T. Liu, Predicting eye fixations using convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 362–370. [29] G. Li, Y. Yu, Deep contrast learning for salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 478–487. [30] S. S. S. Kruthiventi, V. Gudisa, J. H. Dholakiya, R. V. Babu, Saliency unified: a deep architecture for simultaneous eye fixation prediction and salient object segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5781–5790. [31] M. A. Islam, M. Kalash, N. D. B. Bruce, Revisiting salient object detection: simultaneous detection, ranking, and subitizing of multiple salient objects, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

23

[32] M. A. Islam, M. Rochan, N. D. B. Bruce, W. Yang, Gated feedback refinement network for dense image labeling, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4877– 4885. [33] H. Noh, S. Hong, B. Han, Learning deconvolution network for semantic segmentation, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1520–1528. [34] T. Wang, A. Borji, L. Zhang, P. Zhang, H. Lu, A stagewise refinement model for detecting salient objects in images, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4019–4028. [35] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, A. Borji, Detect globally, refine locally: a novel approach to saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3127–3135. [36] L. Zhang, J. Dai, H. Lu, Y. He, G. Wang, A bi-directional message passing model for salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1741–1750. [37] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: Proceedings of the European Conference on Computer Vision, 2014, pp. 818–833. [38] S. Xie, Z. Tu, Holistically-nested edge detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1395–1403. [39] C. Sutton, A. Mccallum, Piecewise training for undirected models, in: Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2005, pp. 568–575. [40] G. Lin, C. Shen, A. V. D. Hengel, I. Reid, Efficient piecewise training of deep structured models for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3194–3203. [41] C. Yang, L. Zhang, H. Lu, X. Ruan, M.-H. Yang, Saliency detection via graph-based manifold ranking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3166–3173. [42] Q. Yan, L. Xu, J. Shi, J. Jia, Hierarchical saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1155–1162. [43] Y. Li, X. Hou, C. Koch, J. M. Rehg, A. L. Yuille, The secrets of salient object segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 280–287.

24

[44] V. Movahedi, J. H. Elder, Design and perceptual validation of performance measures for salient object segmentation, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2010, pp. 49–56. [45] R. Margolin, L. Zelnik-Manor, A. Tal, How to evaluate foreground maps?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 248–255. [46] A. Borji, M. Cheng, H. Jiang, J. Li, Salient object detection: a benchmark, IEEE Transactions on Image Processing 24 (12) (2015) 5706–5722. [47] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, S. Li, Salient object detection: a discriminative regional feature integration approach, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2083–2090. [48] P. Kr¨ ahenb¨ uhl, V. Koltun, Efficient inference in fully connected crfs with gaussian edge potentials, in: Proceedings of the Advances in Neural Information Processing Systems, 2011, pp. 109–117. [49] Y. Qin, H. Lu, Y. Xu, H. Wang, Saliency detection via cellular automata, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 110–119. [50] W. Zhu, S. Liang, Y. Wei, J. Sun, Saliency optimization from robust background detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2814–2821. [51] X. Li, F. Yang, H. Cheng, W. Liu, D. Shen, Contour knowledge transfer for salient object detection, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 355–370. [52] Y. Zeng, Y. Zhuge, H. Lu, L. Zhang, M. Qian, Y. Yu, Multi-source weak supervision for saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6074–6083. [53] X. Li, L. Zhao, L. Wei, M. H. Yang, F. Wu, Y. Zhuang, H. Ling, J. Wang, Deepsaliency: multi-task deep neural network model for salient object detection, IEEE Transactions on Image Processing 25 (8) (2016) 3919–3930. [54] T. Wang, L. Zhang, H. Lu, C. Sun, J. Qi, Kernelized subspace ranking for saliency detection, in: Proceedings of the European Conference on Computer Vision, 2016, pp. 450–466. [55] P. Zhang, D. Wang, H. Lu, H. Wang, B. Yin, Learning uncertain convolutional features for accurate saliency detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 212–221.

25

[56] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, X. Ruan, Learning to detect salient objects with image-level supervision, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 136–145. [57] W. Wang, Q. Lai, H. Fu, J. Shen, H. Ling, Salient object detection in the deep learning era: an in-depth survey, in: arXiv preprint arXiv:1904.09146, 2019. [58] J. Dai, K. He, J. Sun, Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1635–1643. [59] M. S. Barbosa, A. Bubna-Litic, T. Maddess, Locally countable properties and the perceptual salience of textures, JOSA A 30 (8) (2013) 168–1697. [60] A. Arnab, S. Jayasumana, S. Zheng, P. H. Torr, Higher order conditional random fields in deep neural networks, in: Proceedings of the European Conference on Computer Vision, 2016, pp. 524–540. Wenliang Qiu received the B.Eng. degree in intelligent science and technology and the Ph.D. degree in intelligent information processing from Xidian University, Xian, China, in 2011 and 2018. His research interests include computer vision and image and video saliency analysis. Xinbo Gao received the B.Eng., M.Sc., and Ph.D. degrees in signal and information processing from Xidian University, Xi’an, China, in 1994, 1997, and 1999, respectively. From 1997 to 1998, he was a research fellow at the Department of Computer Science, Shizuoka University, Shizuoka, Japan. From 2000 to 2001, he was a post-doctoral research fellow at the Department of Information Engineering, the Chinese University of Hong Kong, Hong Kong. Since 2001, he has been at the School of Electronic Engineering, Xidian University. He is currently a Cheung Kong Professor of Ministry of Education of P. R. China, a Professor of Pattern Recognition and Intelligent System, and the Dean of Graduate School of Xidian University. His current research interests include Image processing, computer vision, multimedia analysis, machine learning and pattern recognition. He has published six books and around 300 technical articles in refereed journals and proceedings. Prof. Gao is on the Editorial Boards of several journals, including Signal Processing (Elsevier) and Neurocomputing (Elsevier). He served as the General Chair/Co-Chair, Program Committee Chair/Co-Chair, or PC Member for around 30 major international conferences. He is a Fellow of the Institute of Engineering and Technology and a Fellow of the Chinese Institute of Electronics. Bing Han received the B.Eng. degree in automatic control, the M.Sc. degree in signal and information processing, and the Ph.D. degree in pattern recognition and intelligence system from Xidian University, Xian, China, in 2001, 2004, and 2007, respectively. She is currently an Associate Professor of signal and information processing with Xidian University. Her research interests include pattern recognition, computer vision, and aurora image and video analysis. 26