Deepside: A general deep framework for salient object detection

Deepside: A general deep framework for salient object detection

Neurocomputing 356 (2019) 69–82 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Deepside:...

4MB Sizes 0 Downloads 62 Views

Neurocomputing 356 (2019) 69–82

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Deepside: A general deep framework for salient object detection Keren Fu a, Qijun Zhao a,∗, Irene Yu-Hua Gu b, Jie Yang c a

College of Computer Science, Sichuan University, Sichuan, China Department of Electrical Engineering, Chalmers University of Technology, Gothenburg, Sweden c Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China b

a r t i c l e

i n f o

Article history: Received 22 November 2018 Revised 17 April 2019 Accepted 18 April 2019 Available online 10 May 2019 Communicated by Dr. Y. Liu Keywords: Salient object detection Convolutional neural network Side structure Deep supervision

a b s t r a c t Deep learning-based salient object detection techniques have shown impressive results compared to conventional saliency detection by handcrafted features. Integrating hierarchical features of Convolutional Neural Networks (CNN) to achieve fine-grained saliency detection is a current trend, and various deep architectures are proposed by researchers, including “skip-layer” architecture, “top-down” architecture, “short-connection” architecture and so on. While these architectures have achieved progressive improvement on detection accuracy, it is still unclear about the underlying distinctions and connections between these schemes. In this paper, we review and draw underlying connections between these architectures, and show that they actually could be unified into a general framework, which simply just has side structures with different depths. Based on the idea of designing deeper side structures for better detection accuracy, we propose a unified framework called Deepside that can be deeply supervised to incorporate hierarchical CNN features. Additionally, to fuse multiple side outputs from the network, we propose a novel fusion technique based on segmentation-based pooling, which severs as a built-in component in the CNN architecture and guarantees more accurate boundary details of detected salient objects. The effectiveness of the proposed Deepside scheme against state-of-the-art models is validated on 8 benchmark datasets. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Detecting salient objects that human attend in images are very useful for various computer vision applications, such as object extraction and recognition [1–5], image and video compression [6], video summarization [7,8], content-based image editing [9–12], informative common object discovery [13,14] and image retrieval [15–17]. The recent advances in this field are pushed forward by the usage of deep neural networks, especially the fully convolutional neural networks (FCN) [18]. Attributed to FCNs, end-to-end prediction can be directly made from an input color image to a final saliency map. The earliest works (e.g., [19]) involving FCNs for saliency detection typically resort to a single forward network path, e.g., the VGG net [20] by modifying its last several fullyconnected layers to convolutional layers, to make a final detection. Unfortunately due to repeated convolution and pooling in the VGG net, the straightforward output is usually a coarse map with very low resolution, highlighting merely rough saliency locations or regions with spatial and edge details missing.



Corresponding author. E-mail address: [email protected] (Q. Zhao).

https://doi.org/10.1016/j.neucom.2019.04.062 0925-2312/© 2019 Elsevier B.V. All rights reserved.

A way to tackle the above problem is integrating hierarchical convolutional features to achieve fine-grained saliency detection. This is because deeper convolutional features tend to encode high level knowledge and can better locate salient objects, while shallower convolutional features are more likely to capture rich spatial information [21,22]. In light of this, various improved deep architectures are proposed by researchers, including “skip-layer” architecture [18,23], “top-down” architecture [24,25], “short-connection” architecture [22]. While such dazzling architectures have achieved consistent progress on detection accuracy, the underlying distinctions and connections between these schemes are still less clear. In this paper, we take a step further towards a more general salient object detection model equipped with deep hierarchical features. First, we review and draw underlying connections between the above classical architectures [22–24], and show that they actually could be unified into a general framework, which simply just has side structures with different depths. Based on the idea of designing deeper side structures for better detection accuracy, we propose a unified framework called Deepside. Specially, the backbone of Deepside network comprises not only the classical backbone network, e.g., VGG, but also “deeper” side structures that provide richer hierarchical features. Here, “deeper” means that the

70

K. Fu, Q. Zhao and I. Yu-Hua Gu et al. / Neurocomputing 356 (2019) 69–82

side structures of our models is deeper compared to existing methods, leading to the name “Deepside”. The reason of using deeper side structures will be explained in Section 3.2. We present two ways, namely linear and nonlinear ways of integrating hierarchical features. Meanwhile, we show how the linear and nonlinear version of Deepside connect to previous architectures. In summary, the main novelties of this paper are three-fold: 1. We propose a general framework for salient object detection called Deepside, to unify the previous various schemes mentioned above. The key of Deepside is to enhance the performance by using deeper side structures, which also enable effective and straightforward deep supervision. We show how the linear and nonlinear version of Deepside connect to previous schemes. 2. To fuse multiple side outputs from the network, a novel fusion technique based on segmentation-based average pooling is proposed. It guarantees better detection accuracy of fine object boundaries and achieves more uniform emphasis of entire salient objects. 3. Comprehensive experimental comparisons are conducted and the proposed Deepside surpasses state-of-the-art deep learning-based salient object detection models on various benchmark datasets and evaluation metrics. The reminder of the paper is organized as follows. Section 2 describes related work on deep salient object detection. Section 3 describes the proposed Deepside in details. Experimental results, performance evaluation and comparisons are included in Section 4. Finally, conclusions are drawn in Section 5. 2. Related work Before the utilization of deep neural networks, salient object detection models were designed based on various handcrafted features. Inspired by the cognitive fact that human eyes are sensitive to the center-surround difference [26], contrast (local/global) [27–30] is a widely employed saliency feature. Other features may include focusness [31,32], center priors [28,33,34], background priors [35–37], objectness [38] and depth [39]. For these non-deep learning models on salient object detection, please refer to [40] for a survey. More recently, deep learning, especially convolutional neural networks (CNNs), are introduced to saliency estimation and have made substantial improvement. In the early time, salient object detection is handled by deep learning as a classification problem, and deep neural network is utilized to judge whether an object part is salient or not. Zhao et al. [41] integrate global and local context of an image into a multi-context network. Li and Yu [42] extract deep features via CNNs at three different patch scales and then feed these features to fully connected layers. A two-tier saliency classification strategy is proposed by Wang et al. [43]. It employs two DNNs on sliding windows and object proposals respectively. Lee et al. [44] utilize both deep and handcrafted features for saliency detection, and these two types of features are integrated for salient region classification. A general drawback of the above methods is that they are not end-to-end, namely requiring independent CNN inference multiple times for every image patch/region, leading to redundancy. In contrast, a more elegant way is given by fully convolutional neural network (FCNN) [18], which can achieve end-to-end detection. As one of the first works that employ FCNN for salient object detection, Li et al. [19] develop a multi-task deep neural network model, whose fully convolutional layers are shared across saliency detection and semantic image segmentation. However, in [19] a single forward network path (Fig. 1 (a)) is designed, which leads to coarse saliency maps with low resolution (16 × 16). A natural way to recover the loss of resolution is integrating multi-

level/hierarchical features, namely combining low-level features that characterize fine edge details with high-level features which tell object location [18]. Aiming at this, “skip-layer” architecture [18,23], “top-down” architecture [24,25], “short-connection” architecture [22] are recently proposed. Typical “skip-layer” architecture (Fig. 1 (b)) concatenates deep coarse convolutional features with features from shallower layers, and further use convolution operation to obtain a detailed saliency map. Noting that to unify the resolution of features from different depths, a common way is to up-sample low resolution features. Modifications may be made to this architecture by inserting some intermediate convolutional layers after features from shallower layers [23,45]. However, in Li’s work [23], convolution layers with stride are inserted after features from different depths. This method still leads to some loss of resolution (the output map is of 1/8 input resolution). Another scheme to recover fine spatial details is the “top-down” architecture [24,25] (Fig. 1 (c)). This scheme is proposed by Pinheiro et al. [25], and uses a top-down refinement approach to augment feedforward nets for object segmentation. Unlike the aforementioned skip architecture which attempts to output independent predictions at each path, this scheme outputs a coarse “mask encoding” in a feed-forward pass, and then refine this mask encoding in a top-down pass utilizing features at successively lower layers. It is worthy mentioning that such top-down architecture resembles the concurrent hourglass network [46], where readers may find similar spirit. Recently, “short-connection” architecture [22] (Fig. 1 (d)) is proposed by Hou et al. This architecture, enhances the previous Holistically-Nested Edge Detector (HED) [21] which provides a skip-layer structure with deep supervision, by introducing short connections to the skip-layer structures within HED. Such short connections enable a reasonable usage of multi-level convolutional features for object segmentation and can be seen as a combination of “skip-layer” architecture (Fig. 1 (b)) and deep supervision [47]. In addition to the above architectures, other deep models for salient object detection are also proposed. Liu and Han [48] propose a deep hierarchical saliency network (DHSNet). It first employs VGG net to make coarse prediction and then the recurrent convolutional layers (RCL) [49] to refine the prediction results. Wang et al. [50] introduce a recurrent fully convolutional architecture that incorporates saliency prior knowledge from the beginning and then progressively improves the inference through the recurrent network. Wang et al. [51] propose a stagewise refinement model that leverages residue blocks [52] and a multi-stage refinement mechanism. Chen et al. [53] propose a two-stream fixationsemantic architecture. It mimics salient object annotation process, namely selecting the semantic objects that receive the highest fixation. He et al. [54] investigate the joint deep learning of the salient object subtizing task and the detection task in a multi-task fashion. Hu et al. [55] propose a deep level set network that incorporates a guided superpixel filtering layer and a level set-based segmentation layer. A non-local deep structure is proposed by Luo et al. [56], which learns non-local features to minimize a Mumford-Shah functional-inspired loss. Trying to reduce efforts of human annotation, image tags for weak supervision [57] and supervision by classical unsupervised saliency approaches [58] are also exploited. Wang et al. [59] propose Attentive Saliency Network that learns to detect salient objects from fixation maps, and the network is supervised with both ground truth fixation locations and object masks. Different from existing works, in this paper, we delve into the three typical architectures: “skip-layer” architecture [18,23], “topdown” architecture [24,25], “short-connection” architecture [22], and propose a general and unified framework—-Deepside. Deepside inherits and unified the advantages of the three architectures and is able to achieve state-of-the-art performance.

K. Fu, Q. Zhao and I. Yu-Hua Gu et al. / Neurocomputing 356 (2019) 69–82

71

Fig. 1. Previous structures employed by researchers for salient object detection: (a) Typical bottom-up network [19], which predicts saliency maps using only upper-layer CNN features, resulting in coarse pixel prediction. (b) “Skip-layer” architecture [23] that fuses multi-level features by concatenation, and immediately followed with prediction by a score layer. (c) “Top-down” architecture [24] which refines coarse prediction in a top-down pass utilizing features at successively lower layers. (d) “Short connection” architecture [22] that introduces a series of short connections from deeper side output to shallower ones.

Fig. 2. Equalization of “concatenation and convolution” operation, attributed to the linear property of convolution.

where A, B are the input, Cat( · , · ) indicates the concatenation operation along the channel dimension, W denotes the filter kernel, and b is the bias.  means the element-wise summation, and ∗ means the convolution operation. It can be seen from (1) that given a specific combination of W and b, one can easily find a combination of W1 , W2 , b1 , b2 that achieve exactly equal behavior. Note the independence of the separated filters W1 and W2 . Hidden layers generally refer to various layers used in a CNN prior to the output, such as convolution, deconvolution, activation (e.g., ReLU, Sigmoid), batch normalization, interpolation. Due to the fact that nonlinear activation layers are usually used after convolution layers, a series-connected hidden layers could be deemed as a nonlinear unity. Note in Fig. 1, a hidden layer symbol is used to denote one or a series of hidden layers in practice. Keeping above three types of layers in mind helps clarify the following architecture of the proposed Deepside, together with its connections to the typical architectures in Fig. 1.

3. Methodology 3.2. Network architecture of Deepside 3.1. Review of previous architectures Fig. 1(b)–(d) show 3 typical architectures that aim at leveraging hierarchical convolutional features, namely “skip-layer” architecture (b) [18,23], “top-down” architecture (c) [24,25], and “short connection” architecture (d) [22]. These architectures generally comprise three types of layers: concatenation layers, scoring layers, and hidden layers. In Fig. 1, we denote these three types of layers by special symbols. Descriptions are as following: Concatenation layers are widely used for the purpose of feature fusion [22,23], where concatenation is performed by concatenating features along their channel dimension. Scoring layers refer to one or several pure convolution layers whose final output leads to prediction scores. In saliency detection, the final output channel number of scoring layers is one (regression case) or two (binary classification case). Scoring layers compress CNN features into final activation, which is then fed to activation function (sigmoid, softmax, etc) to achieve final detection maps. Since a convolution layer itself is a linear layer, scoring layers could be deemed as linear. Fig. 2 shows an important aspect of such linear property, namely concatenation and its subsequent convolution can be refactored to an equivalent element-wise summation from two independent convolutional branches. Mathematically, this can be expressed by:

W ∗ Cat (A, B ) + b = (W1 ∗ A + b1 )  (W2 ∗ B + b2 ) where W = Cat (W1 , W2 ), b = b1 + b2

(1)

The architecture of our Deepside is given in Fig. 3 (a) and (b). The motivation is that with the given equalization in Fig. 2, the typical architectures in Fig. 1 somewhat can be transformed into architectures having independent sidepath of different depths. It is well known that due to the deep convolutional structure and successive pooling operations employed in a classical backbone network, e.g., VGG-16, the backbone network only outputs very coarse but high-level prediction results [18,19]. By using the equalization in Fig. 2 to transform the networks in Fig. 1, one can see that the outputs of different side paths will be accumulated (i.e., summed) at the end to obtain final activation. The fact is that a side structure branching out from a shallow layer of the backbone generates activation which is directly used to modify high-level activation. This means the side structure activation should contain information of finer boundary details of salient objects, and it indicates where salient object boundaries locate. In this sense, such information is a kind of “high-level” information and therefore may require relatively deeper structure to capture it. In contrast, the previous “skip-layer” architecture (Fig. 1 (b)) is less satisfactory for such a task due to its shallow side path. In Fig. 3, we use multiple hidden layer symbols to clarify our idea of designing “deeper” side path. Details will be given in Section 3.5. In the Deepside, the side structures are included as parts of the backbone, namely the backbone of Deepside includes not only the classical backbone network, e.g., VGG/ResNet, but also deep side structures that provide richer hierarchical features.

72

K. Fu, Q. Zhao and I. Yu-Hua Gu et al. / Neurocomputing 356 (2019) 69–82

Fig. 3. Deepside architecture. (a) Deepside-linear: due to the linear way of feature fusion, deep supervision and feature combination are more straightforward. (b) Deepsidenonlinear: deep side features are leveraged in a nonlinear top-down path as in [25] and Fig. 1 (c).

The Deepside leverages side features in two ways, namely linear and nonlinear. Fig. 3 (a) shows the linear case, where each side feature is compressed by a scoring layer first. Next, the resulting activations from different side paths are sent to the corresponding weighting layers before element-wise summation. We introduce weighting layers to facilitate deep supervision of the network [47]. Weighting layers increase adjustability and avoid supervision conflict1 . We implement weighting layers by Scale layers (performing scaling and shift) in Caffe [60]. Experimentally, we find adding weighting layers notably improve training and prediction accuracy. Fig. 3(b) shows the nonlinear case of Deepside, where the backbone is exactly the same as Fig. 3 (a), but the side features are decoded in a “top-down” manner similar to Fig. 1 (c) [24,25]. In such a top-down manner, features are sequentially concatenated and fused by convolution and ReLU activation. Since ReLU activation is nonlinear, the network cannot be refactored into an architecture similar to Fig. 3 (a). Connections between Deepside and previous architectures: In summary, all the architectures in Fig. 1 can be deemed as simplified versions of Deepside shown in Fig. 3. In the “skip-layer” architecture Fig. 1 (b), since the scoring layer can be refactored according to Fig. 2, Fig. 1 (b) is a degenerated version of Deepsidelinear with only one side supervision. Also, it has shallower side paths. For the “top-down” architecture Fig. 1 (c), if the nonlinear hidden layers after concatenation are replaced by linear layers such as convolution layers, Fig. 1 (c) can also be transformed into Deepside-linear with only one side supervision, where side paths are independent and have different convolutional depths. However, if the nonlinear hidden layers are deployed, the network cannot be refactored. This makes Fig. 1 (c) a degenerated version of Deepside-nonlinear. The difference is that Deepside-nonlinear has deeper side structure comparing to Fig. 1 (c), which may bring im-

1 Mathematically, let h( · ) denote activation function which transforms input activation to the final prediction, and Gt denote the desired ground truth. Without weighting layers in Fig. 3 (a), deep supervision is: letting h (a ) = Gt, h (a + b) = Gt, h (a + b + c ) = Gt, where a denotes the base activation and b, c denote the activations from side paths. This may lead to the learning results b = 0, c = 0 because the supervision signals are contradictory. By introducing independent scalings parameterized by k, p, l, m, n, q, deep supervision is modified into: letting h (ka ) = Gt, h ( pa + lb) = Gt, h (ma + nb + qc ) = Gt. This allows b, c to have their own contributions. Noting that such weighting layers are only necessary for Fig. 3 (a). If there is only one supervision signal like in Fig. 1 (b), weighting layers are not needed.

proved detection performance. Regarding to the “short-connection” architecture proposed in [22] Fig. 1 (d), actually it could be refactored into Deepside-linear. As additional work comparing to [22], we do a transformation of concatenation operation and show the underlying essence of “short-connection”, and associate the usages of activation from different scoring layers to different supervision signals explicitly (corresponding to different colors in Fig. 3 (a)). One can see clearly that in Fig. 3 (a) such associations are indeed independent. 3.3. Fusion of multiple side outputs Although rich hierarchical features are introduced by deep side structures, there is no explicit constraint on object region homogeneity. This may lead to unsatisfactory results that some visually homogenous region are rendered different saliency levels in the output prediction. To enforce region homogeneity and meanwhile to fuse multiple side outputs, we propose an edge-aware segmentation-based average pooling module:



Fˆk =

Ii ∈Rk Fi

|Rk |

(2)

where Fi is the input feature vector at pixel i, Ii is a pixel in segmented region Rk , |Rk | denotes the area of Rk , and Fˆk indicates the output feature after segmentation-based pooling, which is the same for all pixels in the same region. Fig. 4 shows the block diagram of this fusion module. The feature map after pooling, denoted by Fˆ , is concatenated with the original feature map F. A convolution layer with 3 × 3 kernel is used to fuse these features and the result is fed to sigmoid activation for the final output. During training, a fusion loss function is computed at the end of this module. Besides, as shown in Fig. 4, this module can adapt to any input channel dimension. The input can be a single channel map or multi-channel maps obtained by concatenation. This module severs as a built-in component in the CNN architecture and helps enhance the saliency homogeneity. 3.4. Integration of Deepside-linear and Deepside-nonlinear Since Deepside-linear (Fig. 3 (a)) and Deepside-nonlinear (Fig. 3 (b)) share the same Deepside backbone architecture. These two schemes can be further fused by sharing just one same Deepside backbone. In this case, Deepside-linear and Deepside-nonlinear

K. Fu, Q. Zhao and I. Yu-Hua Gu et al. / Neurocomputing 356 (2019) 69–82

Fig. 4. The proposed CNN module that can fuse multiple side outputs (represented as feature maps) by segmentation-based average pooling.

branches have their own outputs, and these outputs can be effectively fused by the fusion module in Section 3.3. As a result, this fused scheme resembles the widely used multi-task training framework of a CNN [19,45], where for different tasks the backbone network is shared. However, in this paper it differs in that the tasks for linear and nonlinear branches are the same, namely for salient object detection. In the experimental Section 4.2, we compare the performance of Deepside-linear and Deepside-nonlinear together with their integration and see the extent of impact of joint training on the overall performance. 3.5. Implementation details of Deepside Deepisde is implemented based on the publicly available Caffe library [60]. We choose VGG-16 net [20] as our classical backbone for better comparison with other works since most existing works (e.g., [22,23,48,55,56,61]) use VGG-16 as backbone. The VGG-16 net has 13 convolution layers as the first part, respectively are Conv1_1 ∼ 1_2, Conv2_1 ∼ 2_2, Conv3_1 ∼ 3_3, Conv4_1 ∼ 4_3, Conv5_1 ∼ 5_3. Similar to previous work [19], we modify the last three fully connected layers (Fc1 ∼ 3) of VGG-16 to three convolution layers (Conv6 ∼ 8) to predict a two-dimensional activation map as output. The Conv6 ∼ 8 respectively have 7 × 7 kernel with filter number 512, 7 × 7 kernel with filter number 512, and 1 × 1 kernel with filter number 1. The network input size is fixed as 320 × 320. Regarding to side structures, five side paths branch out from the VGG-16 net, respectively from Conv1_2, Conv2_2, Conv3_3, Conv4_3, and Conv5_3. Each side path has three convolution layers plus corresponding ReLU units. Details of Deepside backbone are given in Fig. 5, noting the differences of feature resolution between different side paths. For Deepside-linear, as shown in Fig. 3 (a), we use a convolution layer with 1 × 1 kernel to compress each deep side feature into one-channel activation. Since feature resolution of different side varies (Fig. 5), to unify the feature resolution for subsequent summation, we use bilinear interpolation (deconvolution as upsampling can also be used and we find only minor difference) to scale all the activation maps to the identical size of the input, namely 320 × 320. Then we use Scale layers in Caffe to weigh the resized activation maps. Finally, as shown in Fig. 3 (a), the weighted activation maps are combined through element-wise summation. More specifically, the activation from a shallow side path is combined with the activations from all the side paths that are deeper than it, together with the activation from VGG Conv8 (the 10 × 10 × 512 output in Fig. 5).

73

For Deepside-nonlinear, we design a similar scheme according to the top-down architecture proposed in [25]. Convolution layers of 3 × 3 (plus ReLU, which is nonlinear) are used first to compress different feature channels (i.e., 512, 265, 128) to an identical channel number 64. Then the compressed feature of a lower resolution is up-sampled by factor × 2 via bilinear interpolation, and then are concatenated with the side feature of the identical resolution. The concatenated feature, which has 128 channels, are further convolved by 64 3 × 3 filter kernels (plus ReLU again), leading to 64-channel output. As such, up-sampling, concatenating, and convolution continue iteratively in a top-down manner, and the loss of resolution is finally inverted. At last, a 1 × 1 scoring layer is used to make a one channel prediction as in Fig. 3 (b). For more details and motivation of this scheme, readers are referred to [25]. For the segmentation-based pooling module, the input features are the activation signals fed to the final activation/loss layers. Regarding to the segmentation algorithm, we use classical mean-shift segmentation [62] since there exist many publicly available implementations and it is very simple to use. Note that other more sophisticated segmentation algorithms could be applied as well. Since the segmentation index maps provided by the segmentation algorithm are needed for the pooling module (Fig. 4) for feature averaging, we find conducting fusion under a high resolution can substantially improve the quality of saliency maps. Therefore, we perform image segmentation on a fixed resolution 640 × 640 of the resized original image instead of 320 × 320, and meanwhile upsample input features to be fused towards this resolution. There are multiple loss layers when training Deepside. Note that the Deepside network generally outputs 320 × 320 results except for the fusion module, whose output is 640 × 640 as mentioned above. The final loss is a weighted sum of all the individual losses, and all loss weights are the same (with value 1.0) except that the weight of the fusion loss is further multiplied by factor 1/4 (i.e., with value 0.25) due to twice resolution. About the loss function, we have tested the Euclidean loss [19] and the class-balanced cross-entropy loss [21,22], However, we find there is minor difference. So in all our experiments, we use sigmoid activation function as final activation, and the Euclidean loss for training:

L(X , Y, θ ) =

N 

|| f (Xi , θ ) − Yi ||2F

(3)

i=1

where f(Xi , θ ) denotes the network output by taking image Xi as input, θ denotes network parameters, N is the total number of training images, X = {X1 , X2 , . . . , XN } denotes the training set, and Y denotes the binary ground truth set Y = {Y1 , Y2 , . . . , YN } (have been adjusted to the same resolution as f(Xi , θ )). 4. Experimental results 4.1. Setup 4.1.1. Datasets Seven saliency benchmark datasets were mainly used for validating Deepside, including: ASD [63] (10 0 0 images), MSRAB [64] (50 0 0 images), ECSSD [33] (10 0 0 images), DUT-OMRON [36] (5168 images), DUTS [57] (testing set 5019 images), PASCAL-S [65] (850 images), HKU-IS [42] (testing set 1447 images). In addition, we also make an early attempt on the new SOC dataset [66] (2400 images).2 Training images for our Deepside were chosen from MSRA-B. Since the ASD dataset is a subset of MSRA-B, in 2 The SOC is a recently-released saliency dataset [66]. Since the ground truth on its testing set has not been released yet till we wrote this paper, we use its training set plus validation set for testing, resulting in a salient objects sub-dataset with 2400 images.

74

K. Fu, Q. Zhao and I. Yu-Hua Gu et al. / Neurocomputing 356 (2019) 69–82

Fig. 5. The backbone of Deepside. It includes both classical VGG-16 backbone and also deep side structures (Side1 ∼ 5). The side structures output side features of different resolutions.

order to evaluate the performance on ASD, we first exclude images that belong to ASD from MSRA-B, remaining 40 0 0 images. Then we randomly select 30 0 0 images for training and leave the other 10 0 0 images as the MSRA-B test set. Besides, to enhance the generalization ability of the network, we additionally use DUTS training set [57] for fine-tuning, which consists of 10,553 images. For fair comparisons, we apply Deepside trained to all the other datasets. 4.1.2. Network training Our Deepside is implemented on a desktop equipped with Intel I7-8700K CPU (3.7 GHz) and 16 G RAM. The fully convolutional neural network is implemented on the basis of Caffe [60]. During training, the Deepside backbone are initialized by the parameters pre-trained in VGG-16 net [20]. Then we fine-tune the parameters on MSRA-B and DUTS training sets, where all the training images are resized to 320 × 320 to match network input. We do data augmentation by mirror reflection and generate twice amount of training data. The momentum parameter is chosen as 0.99, the learning rate is set to 10−9 , and the weight decay is 0.0 0 05. The SGD learning procedure is accelerated using an NVIDIA 1080Ti GPU, with batch size 1, and takes fixed 180,0 0 0 iterations until the losses converge nicely. Training standard Deepside-linear and Deepsidenonlinear networks with fixed 180,0 0 0 iterations takes about 10 h 40 min and 8h40min, respectively. Slightly longer training time of Deepside-linear is caused by multiple backward propagation paths due to deep supervision. To establish fair benchmarking of different Deepside configurations, we do the following: (1) Initialize Deepside-linear and Deepside-nonlinear with the same backbone network parameters; (2) Use equal loss weights (with value 1.0 as mentioned in Section 3.5) for the loss layers in Deepside-linear and Deepsidenonlinear; (3) First train Deepside-linear and Deepside-nonlinear with fixed 180,0 0 0 iterations. Then if the fusion module is incor-

porated, freeze previous network parameters and later train only the fusion module (having very few parameters) with 60,0 0 0 iterations on the small MSRA-B training set. 4.1.3. Metrics for performance evaluation Given a saliency map Smap and the ground truth map Gt, 3 universally-agreed, standard metrics [40] and also 3 recently proposed metrics [67–69] are used for the evaluation of the proposed method. They are briefly introduced as follows: 1. Precision-Recall (PR) [27,63] is defined as:

Precision(T ) =

|M (T ) ∩ Gt | |M (T ) ∩ Gt | , Recall(T ) = |M ( T )| |Gt |

(4)

where M(T) is the binary mask obtained by directly thresholding the saliency map Smap with the threshold T, and | · | is the total area of the mask(s) inside the map. By varying T, a precision-recall curve can be obtained. 2. F-measure (Fβ ) [27,63] is defined as:

Fβ =

(1 + β 2 )Precision · Recall β 2 · Precision + Recall

(5)

where β is the weight between the precision and the recall. β 2 = 0.3 is usually set since the precision is often weighted more than the recall [63]. In order to get a single-valued score, a threshold is often applied to binarize a saliency map into a foreground mask map. In this paper, we report the maximum F-measure (see also [19,22]) computed from the precision-recall curve by running all threshold values. 3. Mean Absolute Error (MAE) [29,70] is defined as:

MAE =

W H 1  |Smap (x, y ) − Gt (x, y )| W ·H x=1 y=1

(6)

K. Fu, Q. Zhao and I. Yu-Hua Gu et al. / Neurocomputing 356 (2019) 69–82

75

Table 1 F-measure (Fβ ) of Deepside with different settings on 7 benchmark datasets. The best performance is highlighted in bold, and the last column shows the average performance. Settings\Datasets

ASD

ECSSD

MSRA-B(test)

DUT-OMRON

PASCAL-S

HKU-IS

DUTS

Average

Deepside-linear Deepside-nonlinear Deepside-joint-fusion Deepside-joint-linear Deepside-joint-nonlinear

0.9297 0.9394 0.9400 0.9302 0.9374

0.9221 0.9237 0.9310 0.9217 0.9285

0.9229 0.9318 0.9301 0.9215 0.9275

0.7711 0.7728 0.7777 0.7691 0.7785

0.8389 0.8449 0.8483 0.8391 0.8455

0.9104 0.9151 0.9204 0.9092 0.9173

0.8359 0.8450 0.8473 0.8395 0.8467

0.8759 0.8818 0.8850 0.8758 0.8831

Table 2 MAE of Deepside with different settings on 7 benchmark datasets. The best performance is highlighted in bold, and the last column shows the average performance. Settings\Datasets

ASD

ECSSD

MSRA-B(test)

DUT-OMRON

PASCAL-S

HKU-IS

DUTS

Average

Deepside-linear Deepside-nonlinear Deepside-joint-fusion Deepside-joint-linear Deepside-joint-nonlinear

0.0309 0.0269 0.0257 0.0308 0.0277

0.0479 0.0470 0.0403 0.0481 0.0444

0.0390 0.0346 0.0329 0.0388 0.0360

0.0620 0.0608 0.0543 0.0603 0.0577

0.0828 0.0789 0.0746 0.0817 0.0783

0.0411 0.0383 0.0341 0.0409 0.0376

0.0521 0.0485 0.0439 0.0498 0.0472

0.0508 0.0479 0.0437 0.0501 0.0470

Table 3 Maximum Fβ of Deepside and State-of-the-arts on 7 benchmark datasets (higher is better). The 1st, 2nd, and 3rd are in bold, bolditalic, and italic, respectively. Here notation “T” means a dataset was used for training the corresponding model. Methods \Datasets

ASD

ECSSD

MSRA-B(test)

DUT-OMRON

PASCAL-S

HKU-IS

DUTS

NCS [75] DRFI [74] DCL [23] DSS [22] DHS [48] Amulet [61] DLS [55] NLDF [56] SRM [51] BRN [73] Deepside-nonlinear Deepside-nonlinear-fusion Deepside-joint-fusion

.9075 .8955 T T T T T T .9023 .8983 .9394 .9441 .9400

.6480 .6899 .8820 .9062 .8937 .9050 .8257 .8887 .9048 .9137 .9237 .9271 .9310

.7694 .8165 T T T T T T .8925 .8917 .9318 .9345 .9301

.5735 .6237 .6993 .7369 T .7154 .6448 .6993 .7253 .7389 .7728 .7732 .7777

.5686 .6382 .8220 .8111 .7984 .8165 .7200 .8027 .8250 .8382 .8449 .8478 .8483

.6564 .7177 .8849 .9011 .8772 .8888 .8074 .8876 .8915 .9003 .9151 .9191 .9204

.5149 .5857 .7820 .7773 .7813 .7504 – .8120 .7976 .8050 .8450 .8461 .8473

Fig. 6. Visual comparisons of Deepside-linear, Deepside-nonlinear, Deepside-joint-linear, Deepside-joint-nonlinear and Deepside-joint-fusion. GT indicates the ground truth masks. Note that Deepside-joint-fusion is achieved by fusing results of Deepside-joint-linear and Deepside-joint-fusion.

76

K. Fu, Q. Zhao and I. Yu-Hua Gu et al. / Neurocomputing 356 (2019) 69–82

Fig. 7. Qualitative comparisons of the effectiveness of deep supervision between Deepside-linear and DSS [22]. The numbers on the top-right corner of side saliency maps (from Side5 to Side1) show the computed forward MAE (mean absolute error), which is computed between a certain side saliency map (e.g., Side3) and its former side saliency map (e.g., Side4).

Fig. 8. Comparing precision-recall curves of Deepside (Deepside-joint-fusion, Deepside-nonlinear, and Deepside-nonlinear-fusion) to state-of-the-art methods.

K. Fu, Q. Zhao and I. Yu-Hua Gu et al. / Neurocomputing 356 (2019) 69–82

77

Table 4 MAE scores of Deepside and State-of-the-arts on 7 benchmark datasets (lower is better). The 1st, 2nd, and 3rd are in bold, bolditalic, and italic, respectively. Here notation “T” means a dataset was used for training the corresponding model. Methods \Datasets

ASD

ECSSD

MSRA-B(test)

DUT-OMRON

PASCAL-S

HKU-IS

DUTS

NCS [75] DRFI [74] DCL [23] DSS [22] DHS [48] Amulet [61] DLS [55] NLDF [56] SRM [51] BRN [73] Deepside-nonlinear Deepside-nonlinear-fusion Deepside-joint-fusion

0.0640 0.0833 T T T T T T 0.0442 0.0407 0.0269 0.0235 0.0257

0.1861 0.1639 0.0679 0.0517 0.0588 0.0589 0.0859 0.0626 0.0543 0.0408 0.0470 0.0416 0.0403

0.1314 0.1207 T T T T T T 0.0509 0.0454 0.0346 0.0305 0.0329

0.1705 0.1554 0.0797 0.0628 T 0.0976 0.0894 0.0796 0.0694 0.0618 0.0608 0.0560 0.0543

0.2304 0.2034 0.1080 0.0977 0.0959 0.0992 0.1328 0.1007 0.0867 0.0726 0.0789 0.0740 0.0746

0.1659 0.1394 0.0481 0.0401 0.0519 0.0501 0.0696 0.0480 0.0457 0.0360 0.0383 0.0334 0.0341

0.1853 0.1453 0.0880 0.0618 0.0651 0.0841 – 0.0660 0.0583 0.0495 0.0485 0.0441 0.0439

Table 5 Fβw scores of Deepside and State-of-the-arts on 7 benchmark datasets (higher is better). The 1st, 2nd, and 3rd are in bold, bolditalic, and italic, respectively. Here notation “T” means a dataset was used for training the corresponding model. Methods \Datasets

ASD

ECSSD

MSRA-B(test)

DUT-OMRON

PASCAL-S

HKU-IS

DUTS

NCS [75] DRFI [74] DCL [23] DSS [22] DHS [48] Amulet [61] DLS [55] NLDF [56] SRM [51] BRN [73] Deepside-nonlinear Deepside-nonlinear-fusion Deepside-joint-fusion

0.7879 0.7014 T T T T T T 0.8515 0.8651 0.9004 0.9148 0.9053

0.4300 0.4629 0.8387 0.8928 0.8422 0.8321 0.7933 0.8547 0.8628 0.8877 0.8758 0.8972 0.8967

0.6111 0.6256 T T T T T T 0.8456 0.8692 0.8851 0.9018 0.8926

0.3978 0.3720 0.6392 0.7118 T 0.5928 0.5966 0.6407 0.6622 0.7002 0.6878 0.7198 0.7234

0.4299 0.4695 0.7332 0.7815 0.7397 0.7174 0.6734 0.7484 0.7617 0.7918 0.7793 0.8008 0.7996

0.4887 0.5284 0.8411 0.8891 0.8411 0.8022 0.7750 0.8502 0.8455 0.8739 0.8589 0.8843 0.8788

0.2952 0.3261 0.6927 0.7377 0.6965 0.6306 – 0.7103 0.7252 0.7654 0.7536 0.7904 0.7855

Table 6 Sα scores of Deepside and State-of-the-arts on 7 benchmark datasets (higher is better). The 1st, 2nd, and 3rd are in bold, bolditalic, and italic, respectively. Here notation “T” means a dataset was used for training the corresponding model. Methods \Datasets

ASD

ECSSD

MSRA-B(test)

DUT-OMRON

PASCAL-S

HKU-IS

DUTS

NCS [75] DRFI [74] DCL [23] DSS [22] DHS [48] Amulet [61] DLS [55] NLDF [56] SRM [51] BRN [73] Deepside-nonlinear Deepside-nonlinear-fusion Deepside-joint-fusion

0.8868 0.8743 T T T T T T 0.9082 0.9036 0.9385 0.9403 0.9368

0.6958 0.7202 0.8684 0.8821 0.8842 0.8941 0.8066 0.8747 0.8952 0.9026 0.9118 0.9131 0.9172

0.7755 0.8006 T T T T T T 0.8969 0.8936 0.9286 0.9295 0.9255

0.6687 0.6978 0.7710 0.7899 T 0.7805 0.7250 0.7704 0.7977 0.8058 0.8310 0.8313 0.8344

0.6173 0.6487 0.7855 0.7926 0.8045 0.8193 0.7198 0.8012 0.8306 0.8372 0.8438 0.8432 0.8445

0.6836 0.7277 0.8770 0.8783 0.8698 0.8860 0.7986 0.8782 0.8871 0.8947 0.9057 0.9077 0.9085

0.6305 0.6725 0.7891 0.8106 0.8201 0.8039 – 0.8163 0.8356 0.8417 0.8668 0.8680 0.8693

where Smap (x, y) and Gt(x, y) correspond to the saliency value and ground truth value at pixel location (x, y). W and H are the width and height of Smap . 4. Weighted F-measure (Fβw ) is recently proposed by Margolin et al. [67]:

Fβw =

(1 + β 2 )Precisionw × Recallw β 2 × Precisionw + Recallw

(7)

where Precisionw and Recallw are the weighted precision and recall. The difference between (7) and (5) is that Precisionw and Recallw in (7) can directly compare a non-binary map against a binary ground truth without thresholding, thus avoiding the

interpolation flaw. Likewise, β 2 = 0.3 is set to weigh preision more than the recall. For more details about this metric readers are referred to [67]. 5. S-measure (Sα ) is proposed [68] to measure the spatial structure similarities of saliency maps:

Sα = α ∗ So + ( 1 − α ) ∗ Sr

(8)

where α is a balance parameter between object-aware structural similarity So and region-aware structural similarity Sr . We set α = 0.5 as suggested in [68,71,72]. 6. E-measure (Em ) is proposed in [69] as an enhanced-measure for comparing two binary maps. This metric aligns two binary

78

K. Fu, Q. Zhao and I. Yu-Hua Gu et al. / Neurocomputing 356 (2019) 69–82 Table 7 Maximum Em of Deepside and State-of-the-arts on 7 benchmark datasets (higher is better). The 1st, 2nd, and 3rd are in bold, bolditalic, and italic, respectively. Here notation “T” means a dataset was used for training the corresponding model. Methods \Datasets

ASD

ECSSD

MSRA-B(test)

DUT-OMRON

PASCAL-S

HKU-IS

DUTS

NCS [75] DRFI [74] DCL [23] DSS [22] DHS [48] Amulet [61] DLS [55] NLDF [56] SRM [51] BRN [73] Deepside-nonlinear Deepside-nonlinear-fusion Deepside-joint-fusion

0.9385 0.9408 T T T T T T 0.9439 0.9407 0.9677 0.9686 0.9671

0.6973 0.7631 0.9163 0.9306 0.9279 0.9315 0.8726 0.9221 0.9371 0.9462 0.9467 0.9474 0.9510

0.8326 0.8892 T T T T T T 0.9342 0.9330 0.9607 0.9616 0.9574

0.7213 0.7938 0.8261 0.8450 T 0.8339 0.7978 0.8200 0.8438 0.8533 0.8753 0.8737 0.8786

0.6708 0.7450 0.8490 0.8560 0.8592 0.8653 0.7941 0.8591 0.8787 0.8922 0.8929 0.8935 0.8922

0.7369 0.8329 0.9318 0.9414 0.9311 0.9344 0.8769 0.9344 0.9442 0.9489 0.9541 0.9547 0.9550

0.6861 0.7620 0.8448 0.8717 0.8802 0.8507 – 0.8716 0.8910 0.8982 0.9160 0.9157 0.9172

Fig. 9. Visual comparisons of Deepside on testing images to state-of-the-art models. Table 8 Quantitative evaluations on the new SOC dataset [66]. Comparisons are made between Deepside and State-of-the-art deep models. The 1st, 2nd, and 3rd are in bold, bolditalic, and italic, respectively. Methods \Metrics

Fβ ↑

MAE↓

Fβw ↑

Sα ↑

Em ↑

DCL [23] DSS [22] DHS [48] RFCN [49] NLDF [56] SRM [51] GLN [73] BRN [73] Deepside-nonlinear Deepside-nonlinear-fusion Deepside-joint-fusion

0.6440 0.6284 0.6844 0.6581 0.6663 0.7071 0.7200 0.7124 0.7342 0.7365 0.7391

0.1373 0.1411 0.1123 0.1276 0.1285 0.1074 0.1012 0.0968 0.0993 0.0941 0.0933

0.5570 0.5625 0.6103 0.5797 0.5774 0.6147 0.6211 0.6409 0.6404 0.6638 0.6648

0.6960 0.6726 0.7354 0.7180 0.7234 0.7632 0.7728 0.7635 0.7784 0.7778 0.7813

0.7712 0.7593 0.8005 0.7811 0.7843 0.8184 0.8265 0.8230 0.8320 0.8321 0.8328

maps according to their global means first and then computes local pixel-wise correlation. The range of Em lies in interval [0,1]. To exploit E-measure for comparing a non-binary saliency map against a binary ground truth map, we follow a way similar to maximum F-measure mentioned above, namely we first binarize a saliency map into foreground maps by running all possible threshold values. For each foreground mask we can calculate an Em value, and finally we report the maximum Emeasure. As summary, among the 6 metrics above, higher precision-recall curves, Fβ , Fβw , Sα , Em and low MAE indicate better performance. 4.2. Deepside with different settings Firstly, we compare three kinds of settings of Deepside, namely Deepside-linear (Fig. 3(a)), Deepside-nonlinear (Fig. 3(b)),

K. Fu, Q. Zhao and I. Yu-Hua Gu et al. / Neurocomputing 356 (2019) 69–82

79

Fig. 10. Visual comparisons on fine-grained spatial and object boundary details. From left to right: input images, Amulet (ICCV 2017 [61]), DHS (CVPR 2016 [48]), SRM (ICCV 2017 [51]), BRN (CVPR 2018 [73]), our Deepside (deepside-nonlinear-fusion), and GT (ground truth).

Fig. 11. Comparing precision-recall curves of Deepside (Deepside-joint-fusion, Deepside-nonlinear, and Deepside-nonlinear-fusion) to state-of-the-art methods on the new SOC dataset [66].

and Deepside-joint (Section 3.4). Specially, for Deepside-linear, as there are multiple coarse-to-fine outputs from the network as illustrated in Fig. 3 (a), we use the finest output that integrates all side paths to infer a final saliency map. For Deepside-joint, it has both Deepside-linear and Deepside-nonlinear outputs, and what is shared is only the Deepside backbone. To fuse such multiple outputs into a single result, we use the segmentationbased pooling module proposed in Section 3.3, but we also report the performance of individual results of Deepside-linear and Deepside-nonlinear branches. The fused results by segmentationbased pooling module are abbreviated as “Deepside-joint-fusion”, and the individual results of Deepside-linear and Deepside-nonlinear branches are abbreviated as “Deepside-joint-linear” and “Deepsidejoint-nonlinear”, respectively. Note that all the three kinds of configurations of Deepside are learned from the identical network initialization for fair comparison as in Section 4.1.2. Quantitative comparisons on F-measure and MAE are shown in Tables 1 and 2. One can see that most of the best performance on Fβ and MAE are achieved by Deepside-joint-fusion, validating

the effectiveness of the proposed fusion module in Section 3.3. Besides, consistent slightly better performance of Deepside-nonlinear over Deepside-linear can be observed. This is probably attributed to the nonlinear top-down path of the former that makes back propagation and learning more efficient. On the other hand, comparing Deepside-joint-linear with Deepside-linear, and also Deepsidejoint-nonlinear with Deepside-nonlinear, one can find that jointly learning the two makes only very trivial difference. The network architecture tends to learn parameters that are suitable for both branches. The overall gain of Deepside-joint-fusion is contributed by “fusion” rather than “joint learning”. In general, the performance gap between different Deepside settings are not significant. This reveals the fact that the dominative performance of Deepside indeed is determined by its backbone capability, and better performance could be expected if using deep side structures on more advanced backbones such as ResNet and Inception architecture. Fig. 6 shows several visual comparisons for these settings. One can see that Deepside-joint-fusion retains sharper object boundaries and cleaner background. This is also reflected by the fact that Deepsidejoint-fusion achieves the lowest MAE in Table 2. Besides, to validate the effectiveness of deep supervision in Deepside-linear, we visualize the outputs of individual side paths in Fig. 7, and compare them with the corresponding outputs from DSS [22] which employs short connections. An interesting observation is that, visually, both models tend to tune saliency in a coarse-to-fine manner because of the successive incorporation of side features from shallower layers. To better see the differences, we compute the forward MAE for each side output to see its extent of tuning. As results, for both models the forward MAE values decrease monotonically, while Deepside-linear tunes from coarseto-fine much more faster than DSS. 4.3. Comparing Deepside to state-of-the-arts According to the validation results above, we further compare three good Deepside settings to state-of-the-art models, namely Deepside-nonlinear, Deepside-joint-fusion, and Deepside-nonlinearfusion, in a comprehensive manner. Here Deepside-nonlinear-fusion refers to the results by further boosting Deepside-nonlinear outputs using the fusion module proposed in Section 3.3. Comparisons are made to 10 existing models including 8 deep learning-based models and 2 conventional saliency models. Deep learning-based

80

K. Fu, Q. Zhao and I. Yu-Hua Gu et al. / Neurocomputing 356 (2019) 69–82

Fig. 12. Visual comparisons of Deepside on the new SOC dataset [66] to state-of-the-art deep models.

models include: DCL (Deep Contrast Learning) [23], DSS (Deeply Supervised model with Short connections) [22], DHS (Deep Hierarchical Saliency) [48], SRM (Stagewise Refinement Model) [51], Amulet (Aggregating Multi-Level Convolutional Features) [61], DLS (Deep Level Sets) [55], NLDF (Non-local Deep Feature) [56], BRN (Boundary Refinement Network) [73]. Besides, we compare with two traditional saliency models that resort to handcrafted features: DRFI (Discriminative Regional Feature Integration) [74] and NCS (Normalized Cut Saliency) [75]. Note that all the above deep learning-based models are built upon fully convolutional networks (FCNs), where DCL, DSS, DHS, Amulet, DLS, NLDF use VGG-16 as backbone while SRM and BRN use ResNet-50 as backbone. Precision-recall curves on 6 datasets are shown in Fig. 8, while quantitative evaluations on Fβ , MAE, Fβw , Sα , and Em are shown

from Tables 4 and 7, respectively. From these results, the Deepside settings generally outperform state-of-the-art methods by a notable margin on all evaluation metrics. Regarding to the precision-recall curves, Deepside achieves the highest curves. Among the three Deepside settings, Deepside-jointfusion achieves slightly better performance but anyway all the three settings behave very closely on precision-recall. Regarding to the rest four metrics, one can see that sometimes they do not agree with each other. For example, the BRN model has the best MAE on PASCAL-S dataset (Table 4) but cannot rank in top-3 on Fβ and Sα on this same dataset (Tables 4 and 6). But a common observation one can see is that the proposed Deepside consistently achieves the top-3 on those metrics, while Deepside settings with fusion module generally achieve the top-2. This further validates the proposed fusion scheme. From our observation, we found that MAE and Fβw are more likely to favor saliency maps with sharp contrast rather than those with many medium saliency values. Therefore existing models such as DSS which employs dense random field as post-processing, and also BRN which specially designs a boundary refinement network achieve the best on some datasets in Tables 4 and 5, respectively. This is because their saliency outputs are either white or black (see also Figs. 9 and 12). Since the proposed segmentation-based fusion module also helps retain fine-grained object spatial layouts and emphasize entire salient ob-

jects, the improvement of Deepside-nonlinear-fusion over Deepsidenonlinear on Fβw score (Table 5) is more obvious than on other met-

rics. Several visual examples of saliency maps are shown in Fig. 9, where the proposed Deepside settings detect out entire salient objects favorably against state-of-the-art methods. Another advantage of Deepside is its usage of deep convolutional side features together with the segmentation-based pooling module. This helps maintain fine-grained spatial layouts in resulting saliency maps. Fig. 10 shows several examples. Comparing to some leading methods, Deepside achieves clearer (rather than blurred) and more accurate object boundaries with sharp contrast. Besides, we conduct evaluation on SOC dataset (Salient Objects in Clutter) [66], which is a recently-proposed challegening dataset for salient object detection. Fig. 11 shows comparisons of precision-curves on 2400 images to several state-of-the-art models3 : DCL [23], DSS [22], DHS [48], RFCN [49], SRM [51], NLDF [56], GLN [73], BRN [73]. The proposed Deepside generalizes well to such a new challegening dataset and surpasses state-of-the-art performance. Table 8 shows the rest 5 metrics on SOC dataset, where Deepside-joint-fusion consistenly ranks the 1st. Several visual examples of saliency maps on SOC can be found in Fig. 12. 4.4. Discussion and limitation From the experimental results, we have reached the following three observations: i Richer hierarchical side features is proved beneficial for achieving fine-grained salient object detection. Simply a VGG-based architecture allows us to achieve and even surpass state-ofthe-art performance. As an evidence, the proposed Deepsidenonlinear is already superior to existing methods. ii We hypothesize that the dominative performance of the Deepside is indeed contributed by the capability of backbone. As an

3 Since SOC dataset is a brand new dataset and all the authors have not yet provided their results on this dataset, we only compare some methods whose runnable code is released.

K. Fu, Q. Zhao and I. Yu-Hua Gu et al. / Neurocomputing 356 (2019) 69–82

evidence, the impact of linear- and nonlinear-heads are less significant. In this sense, better performance could be expected if using deep side structures on more advanced backbones such as ResNet and Inception. In addition, despite of minor impact, Deepside-nonlinear performs slightly better and also learns faster than Deepside-linear. iii The proposed segmentation-based fusion module is shown effective on keeping sharp object boundaries and achieving uniform emphasis of entire salient objects. It constantly improves on various evaluation metrics. Although our segmentation-based fusion module is shown effective, it currently resorts to traditional image segmentation algorithms which only run on CPU and will occupy much more time comparing to GPU-based neural network computation. The forward inference time of Deepside networks is only about 0.08s and 0.07s, for Deepside-linear and Deepside-nonlinear respectively. The segmentation process takes another 1.2s. From the above, one can see that the chosen segmentation algorithm have high impact to the time efficiency of segmentation-based pooling module, and therefore to the entire method. By choosing faster segmentation algorithms or implementations (e.g., GPU version [76]) in the future, further efficiency boost is feasible. For real applications, as a tradeoff between performance and efficiency, Deepside-nonlinear is potentially a good choice. In contrast, if you are a user to whom the performance matters more than the efficency, Deepside with the fusion module can be considered. 5. Conclusion The proposed Deepside, has been tested and evaluated comprehensively on 8 benchmark datasets. We show that such a VGGbased architecture with deep side structures allows us to achieve and even surpass state-of-the-art performance. The effectiveness of the proposed segmentation-based fusion module as a mean of aggregating multiple side outputs has also been validated. It consistently makes improvements on various universally-agreed evaluation protocols. In the future, we may further exploit the idea of designing deep side structures in other network architectures, such as recurrent neural networks and generative adversarial networks. Conflicts of interest None. Acknowledgment This research is partly supported by the National Science Foundation, China, under No. 61703077, 61773270, 61572315, 61876107, the Fundamental Research Funds for the Central Universities No. YJ201755, the National Key Research and Development Program of China (2017YFB0802300, 2016YFC0801100). The authors would like to thank the anonymous reviewers for their constructive comments. References [1] Z. Liu, R. Shi, L. Shen, Y. Xue, K. Ngan, Z. Zhang, Unsupervised salient object segmentation based on kernel density estimation and two-phase graph cut, IEEE Trans. Multimed. (MM) 14 (4) (2012) 1275–1289. [2] L. Ye, Z. Liu, L. Li, L. Shen, C. Bai, Y. Wang, Salient object segmentation via effective integration of saliency and objectness, IEEE Trans. Multimed. 19 (8) (2017) 1742–1756. [3] K.R. Jerripothula, J. Cai, J. Yuan, Image co-segmentation via saliency co-fusion, IEEE Trans. Multimed. 18 (9) (2016) 1896–1909. [4] U. Rutishauser, D. Walther, C. Koch, P. Perona, Is bottom-up attention useful for object recognition, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. [5] J. Han, K.N. Ngan, M. Li, H.-J. Zhang, Unsupervised extraction of visual attention objects in color images, IEEE Trans. Circuits Syst. Video Technol. 16 (1) (2006) 141–145.

81

[6] C. Guo, L. Zhang, A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression, IEEE Trans. Image Process. 19 (1) (2010) 185–198. [7] Y. Ma, X. Hua, L. Lu, H. Zhang, A generic framework of user attention model and its application in video summarization, IEEE Trans. Multimed. 7 (5) (2005) 907–919. [8] D. Fan, W. Wang, M. Cheng, J. Shen, Shifting more attention to video salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [9] F. Stentiford, Attention based auto image cropping, in: Proceedings of the Workshop on Computational Attention and Applications, ICVS, 2007. [10] L. Marchesotti, C. Cifarelli, G. Csurka, A framework for visual saliency detection with applications to image thumbnailing, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2009. [11] Y. Ding, X. Jing, J. Yu, Importance filtering for image retargeting, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011. [12] S. Goferman, L. Zelnik-Manor, A. Tal, Context-aware saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010. [13] D. Zhang, J. Han, C. Li, J. Wang, X. Li, Detection of co-salient objects by looking deep and wide, Int. J. Comput. Vis. 120 (2) (2016) 215–232. [14] D. Zhang, D. Meng, J. Han, Co-saliency detection via a self-paced multiple-instance learning framework, IEEE Trans. Pattern Anal. Mach. Intell. 39 (5) (2017) 865–878. [15] T. Chen, M. Cheng, P. Tan, A. Shamir, S. Hu, Sketch2photo: internet image montage, ACM Trans. Graph 28 (5) (2006) 1–10. [16] Y. Gao, M. Shi, D. Tao, C. Xu, Database saliency for fast image retrieval, IEEE Trans. Multimed. (MM) 17 (3) (2015) 359–369. [17] G. Liu, D. Fan, A model of visual attention for natural image retrieval, in: Proceedings of the International Conference on Information Science and Cloud Computing Companion, IEEE, 2013, pp. 728–733. [18] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 39 (4) (2017) 640–651. [19] X. Li, L. Zhao, L. Wei, M. Yang, F. Wu, Y. Zhuang, H. Ling, J. Wang, Deepsaliency: multi-task deep neural network model for salient object detection, IEEE Trans. Image Process. 25 (8) (2016) 3919–3930. [20] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, CoRR abs/1409.1556 (2014). [21] S. Xie, Z. Tu, Holistically-nested edge detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1395–1403. [22] Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, P. Torr, Deeply supervised salient object detection with short connections, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, IEEE, 2017, pp. 5300–5309. [23] G. Li, Y. Yu, Deep contrast learning for salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 478–487. [24] G. Li, Y. Xie, L. Lin, Y. Yu, Instance-level salient object segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, IEEE, 2017, pp. 247–256. [25] P.O. Pinheiro, T.-Y. Lin, R. Collobert, P. Dollár, Learning to refine object segments, in: Proceedings of the European Conference on Computer Vision, Springer, 2016, pp. 75–91. [26] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 20 (11) (1998) 1254–1259. [27] M. Cheng, N. Mitra, X. Huang, P. Torr, S. Hu, Global contrast based salient region detection, IEEE Trans. Pattern Anal. Mach. Intell. 37 (3) (2015) 569–582. [28] K. Fu, C. Gong, J. Yang, Y. Zhou, I. Gu, Superpixel based color contrast and color distribution driven salient object detection, Signal Process. Image Commun. 28 (10) (2013) 1448–1463. [29] F. Perazzi, P. Krahenbul, et al., Saliency filters: contrast based filtering for salient region detection, in: Proceedings of the IEEE International Conference on CVPR, 2012. [30] G.-H. Liu, J.-Y. Yang, Exploiting color volume and color difference for salient region detection, IEEE Trans. Image Process. 28 (1) (2019) 6–16. [31] P. Jiang, H. Ling, J. Yu, J. Peng, Salient region detection by UFO: uniqueness, focusness and objectness, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1976–1983. [32] K. Fu, I.Y.-H. Gu, J. Yang, Saliency detection by fully learning a continuous conditional random field, IEEE Trans. Multimed. 19 (7) (2017) 1531–1544. [33] J. Shi, Q. Yan, L. Xu, J. Jia, Hierarchical image saliency detection on extended CSSD, IEEE Trans. Pattern Analysis and Mach. Intell. 38 (4) (2016) 717– 729. [34] X. Shen, Y. Wu, A unified approach to salient object detection via low rank matrix recovery, in: Proceedings of the IEEE International Conference CVPR, 2012. [35] Y. Wei, F. Wen, W. Zhu, J. Sun, Geodesic saliency using background priors, in: Proceedings of the European Conference on Computer Vision (ECCV), 2012. [36] C. Yang, L. Zhang, et al., Saliency detection via graph-based manifold ranking, in: Proceedings of the IEEE International Conference on CVPR, 2013. [37] J. Han, D. Zhang, X. Hu, L. Guo, J. Ren, F. Wu, Background prior-based salient object detection via deep reconstruction residual, IEEE Trans. Circuits Syst. Video Technol. 25 (8) (2015) 1309–1321. [38] Y. Jia, M. Han, Category-independent object-level saliency detection, in: Pro-

82

[39]

[40] [41]

[42] [43]

[44]

[45]

[46]

[47] [48]

[49]

[50]

[51]

[52]

[53]

[54]

[55] [56] [57]

[58]

[59]

[60]

[61]

[62] [63] [64] [65] [66]

[67]

[68] [69]

K. Fu, Q. Zhao and I. Yu-Hua Gu et al. / Neurocomputing 356 (2019) 69–82 ceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1761–1768. J. Zhao, Y. Cao, D. Fan, M. Cheng, X. Li, L. Zhang, Contrast prior and fluid pyramid integration for RGBD salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. A. Borji, M. Cheng, H. Jiang, J. Li, Salient object detection: a benchmark, IEEE Trans. Image Process. 24 (12) (2015) 5706–5722. R. Zhao, W. Ouyang, H. Li, X. Wang, Saliency detection by multi-context deep learning, in: Proceedings of the Computer Vision and Pattern Recognition, 2015, pp. 1265–1274. G. Li, Y. Yu, Visual saliency based on multiscale deep features, in: Proceedings of the Computer Vision and Pattern Recognition, 2015, pp. 5455–5463. L. Wang, H. Lu, X. Ruan, M.H. Yang, Deep networks for saliency detection via local estimation and global search, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3183–3192. G. Lee, Y. Tai, J. Kim, Deep saliency with encoded low level distance map and high level features. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016) 660–668. S. Kruthiventi, V. Gudisa, J. Dholakiya, R. Venkatesh Babu, Saliency unified: a deep architecture for simultaneous eye fixation prediction and salient object segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5781–5790. A. Newell, K. Yang, J. Deng, Stacked hourglass networks for human pose estimation, in: Proceedings of the European Conference on Computer Vision, Springer, 2016, pp. 483–499. C. Lee, S. Xie, P. Gallagher, Z. Zhang, Z. Tu, Deeply-supervised nets, in: Proceedings of the Artificial Intelligence and Statistics, 2015, pp. 562–570. N. Liu, J. Han, Dhsnet: deep hierarchical saliency network for salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 678–686. M. Liang, X. Hu, Recurrent convolutional neural network for object recognition, in: Proceedings of the Computer Vision and Pattern Recognition, 2015, pp. 3367–3375. L. Wang, L. Wang, H. Lu, P. Zhang, X. Ruan, Saliency detection with recurrent fully convolutional networks, in: Proceedings of the European Conference on Computer Vision, 2016, pp. 825–841. T. Wang, A. Borji, L. Zhang, P. Zhang, H. Lu, A stagewise refinement model for detecting salient objects in images, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4039–4048. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. X. Chen, A. Zheng, J. Li, F. Lu, Look, perceive and segment: finding the salient objects in images via two-stream fixation-semantic CNNs, in: Proceedings of the ICCV, 2017. S. He, J. Jiao, X. Zhang, G. Han, R.W. Lau, Delving into salient object subitizing and detection, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE, 2017, pp. 1059–1067. P. Hu, B. Shuai, J. Liu, G. Wang, Deep level sets for salient object detection, in: Proceedings of the IEEE CVPR, 2017. Z. Luo, A. Mishra, A. Achkar, J. Eichel, S. Li, P.-M. Jodoin, Non-local deep features for salient object detection, in: Proceedings of the IEEE CVPR, 2017. L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, X. Ruan, Learning to detect salient objects with image-level supervision, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 136–145. D. Zhang, J. Han, Y. Zhang, Supervision by fusion: towards unsupervised learning of deep salient object detector, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4048–4056. W. Wang, J. Shen, X. Dong, A. Borji, Salient object detection driven by fixation prediction, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1711–1720. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, in: Proceedings of the Twenty-Second ACM international conference on Multimedia, ACM, 2014, pp. 675–678. P. Zhang, D. Wang, H. Lu, H. Wang, X. Ruan, Amulet: aggregating multi-level convolutional features for salient object detection, in: Proceedings of the ICCV, 2017. D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space analysis, IEEE Trans. Pattern Anal. Mach. Intell. 24 (5) (2002) 603–619. R. Achanta, S. Hemami, F. Estrada, S. Süsstrunk, Frequency-tuned salient region detection, in: Proceedings of the IEEE International Conference CVPR, 2009. T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, Learning to detect a salient object, IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 33 (2) (2011) 353–367. Y. Li, X. Hou, C. Koch, J. Rehg, A. Yuille, The secrets of salient object segmentation, in: Proceedings of the CVPR, 2014. D. Fan, M. Cheng, J. Liu, S. Gao, Q. Hou, A. Borji, Salient objects in clutter: bringing salient object detection to the foreground, in: Proceedings of the European Conference on Computer Vision (ECCV), Springer, 2018. R. Margolin, L. Zelnik-Manor, A. Tal, How to evaluate foreground maps? in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. D. Fan, M. Cheng, Y. Liu, T. Li, A. Borji, Structure-measure: a new way to evaluate foreground maps, in: Proceedings of the ICCV, 2017. D. Fan, C. Gong, Y. Cao, B. Ren, M. Cheng, A. Borji, Enhanced-alignment mea-

[70]

[71] [72] [73]

[74]

[75]

[76]

sure for binary foreground map evaluation, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2018, pp. 698–704. M. Cheng, J. Warrell, W. Lin, S. Zheng, V. Vineet, N. Crook, Efficient salient region detection with soft image abstraction, in: Proceedings of the IEEE Int’l Conf on Computer Vision (ICCV), 2013. P. Zhang, H. Lu, C. Shen, Troy: Give attention to saliency and for saliency, arXiv preprint arXiv:1808.02373 (2018). P. Zhang, H. Lu, C. Shen, Hyperfusion-net: Densely reflective fusion for salient object detection, arXiv preprint arXiv:1804.05142 (2018). T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, A. Borji, Detect globally, refine locally: a novel approach to saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3127–3135. H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, S. Li, Salient object detection: a discriminative regional feature integration approach, in: Proceedings of the IEEE International Conference CVPR, 2013. K. Fu, C. Gong, I. Gu, J. Yang, Normalized cut-based saliency detection by adaptive multi-level region merging, IEEE Trans. Image Process. 24 (12) (2015) 5671–5683. M. Cheng, Y. Liu, Q. Hou, J. Bian, P. Torr, S. Hu, Z. Tu, Hfs: Hierarchical feature selection for efficient image segmentation, in: Proceedings of the European Conference on Computer Vision, Springer, 2016, pp. 867–882. Keren Fu received the dual Ph.D. degrees from Shanghai Jiao Tong University, Shanghai, China, and Chalmers University of Technology, Gothenburg, Sweden, in 2016, under the joint supervision of Prof. Jie Yang and Prof. Irene Yu-Hua Gu. He is currently a research associate professor with College of Computer Science, Sichuan University, Chengdu, China. His current research interests include visual computing, saliency analysis, and machine learning.

Qijun Zhao is currently an associate professor in the College of Computer Science at Sichuan University. He obtained his B.Sc. and M.Sc. degrees in computer science both from Shanghai Jiao Tong University, and his Ph.D. degree in computer science from the Hong Kong Polytechnic University. He worked as a post-doc research fellow in the Pattern Recognition and Image Processing lab at Michigan State University from 2010 to 2012. His research interests lie in biometrics, particularly, fingerprint recognition, face perception and affective computing, with applications to forensics, intelligent video surveillance, mobile security, healthcare, and human-computer interactions. Dr. Zhao has published more than 60 papers in academic journals and conferences, and participated in many research projects either as principal investigators or as primary researchers. He served as a program committee co-chair in organizing the 11th Chinese Conference on Biometric Recognition (CCBR 2016) and the 2018 IEEE International Conference on Identity, Security and Behavior Analysis (ISBA 2018), and an area co-chair for the 9th IEEE International Conference on Biometrics: Theory, Applications, and Systems (BTAS 2018). Irene Yu-Hua Gu received the Ph.D. degree in electrical engineering from the Eindhoven University of Technology, Eindhoven, The Netherlands, in 1992. From 1992 to 1996, she was Research Fellow at Philips Research Institute IPO, Eindhoven, The Netherlands, and post dr. at Staffordshire University, Staffordshire, U.K., and Lecturer at the University of Birmingham, Birmingham, U.K. Since 1996, she has been with the Department of Signals and Systems, Chalmers University of Technology, Gothenburg, Sweden, where she has been a full Professor since 2004. Her research interests include statistical image and video processing, object tracking and video surveillance, pattern classiffcation, and signal processing with applications to electric power systems. Dr. Gu was an Associate Editor for the IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, and Part B: Cybernetics from 20 0 0 to 20 05, and an Associate Editor for the EURASIP Journal on Advances in Signal Processing from 2005 to 2016. She was the Chair of the IEEE Swedish Signal Processing Chapter from 2001 to 2004. She has been with the Editorial board of the Journal of Ambient Intelligence and Smart Environments since 2011. Jie Yang received his Ph.D. from the Department of Computer Science, Hamburg University, Germany, in 1994. Currently, he is a professor at the Institute of Image Processing and Pattern recognition, Shanghai Jiao Tong University, China. He has led many research projects (e.g., National Science Foundation, 863 National High Tech. Plan), had one book published in Germany, and authored more than 200 journal papers. His major research interests are object detection and recognition, data fusion and data mining, and medical image processing.