Neural Networks 121 (2020) 308–318
Contents lists available at ScienceDirect
Neural Networks journal homepage: www.elsevier.com/locate/neunet
Embedding topological features into convolutional neural network salient object detection Lecheng Zhou, Xiaodong Gu
∗
Department of Electronic Engineering, Fudan University, 200433, Shanghai, China
article
info
Article history: Received 22 December 2018 Received in revised form 2 September 2019 Accepted 6 September 2019 Available online 25 September 2019 Keywords: Conditional random field Convolutional neural network Salient object detection Topological feature
a b s t r a c t Salient object detection can be applied as a critical preprocessing step in many computer vision tasks. Recent studies of salient object detection mainly employed convolutional neural networks (CNNs) for mining high-level semantic properties. However, the existing methods can still be improved to find precise semantic information in different scenarios. In particular, in the two main methods employed for salient object detection, the patchwise detection models might ignore spatial structures among regions and the fully convolution-based models mainly consider semantic features in a global manner. In this paper, we proposed a salient object detection framework by embedding topological features into a deep neural network for extracting semantics. We segment the input image and compute weight for each region with low-level features. The weighted segmentation result is called a topological map and it provides an additional channel for the CNN to emphasize the structural integrity and locality during the extraction of semantic features. We also utilize the topological map for saliency refinement based on a conditional random field at the end of our model. Experimental results on six benchmark datasets demonstrated that our proposed framework achieves competitive performance compared to other state-of-the-art methods. © 2019 Elsevier Ltd. All rights reserved.
1. Introduction Salient object detection (or saliency detection) aims to automatically discover the most visually distinctive regions in an image. This task has been investigated by many researchers for the rapidly increasing scale of image data that need to be processed. Locating salient objects can facilitate the extraction of the most critical information from an image, thereby improving the efficiency of subsequent image processing tasks, including image semantic segmentation (Wei et al., 2017, 2017), object recognition (Wang, Chen, Li, Xu, & Lin, 2017), video tracking (Wu, Li, & Luo, 2014), image retrieval (Gordo, Almazán, Revaud, & Larlus, 2016; Zhao, Huang, Wang and Tan, 2015), image manipulation (Margolin, Zelnik-Manor, & Tal, 2013), video compression (Itti, 2004), and classification (Murabito et al., 2018). The conventional methods for salient object detection focus on low-level features that are manually extracted from images. Based on our understanding of human saliency perception, various computational models (Achanta, Hemami, Estrada, & Susstrunk, 2009; Cheng, Mitra, Huang, Torr, & Hu, 2015; Gu, Fang, & Wang, 2013; Itti, Koch, & Niebur, 1998; Lecun, Bottou, Bengio, & Haffner, 1998; Liu et al., 2011; Xie, Lu, & Yang, 2013; Yang, ∗ Corresponding author. E-mail addresses:
[email protected] (L. Zhou),
[email protected] (X. Gu). https://doi.org/10.1016/j.neunet.2019.09.009 0893-6080/© 2019 Elsevier Ltd. All rights reserved.
Zhang, Lu, Ruan, & Yang, 2013) have been built to combine these hand-crafted features and produce saliency maps. Most of these algorithms work well for simple images but they cannot handle images with complex semantic structures because the traditional methods mainly apply manual rules and parameters for converting low-level features into saliency scores. Some learning-based approaches (Jiang et al., 2013; Jiang, Zhang, Lu, Yang and Yang, 2013; You, Zhang, Qi, & Lu, 2016) achieve better performance at feature combination and saliency computation, but they are mainly bottom-up models based on manually derived features and they can obtain little semantic information from images. Thus, failures may occur when detecting salient objects in low contrast regions or a cluttered background, and it is difficult to ensure the robustness of the models employed, as shown in Fig. 1(b). Recent studies of computer vision tasks have demonstrated the superior performance of deep neural networks for mining specific high-level features in images. In particular, convolutional neural networks (CNNs) (Lecun et al., 1998) have achieved outstanding performance at object detection (Girshick, 2015; Girshick, Donahue, Darrell, & Malik, 2014) and image classification (Krizhevsky, Sutskever, & Hinton, 2012; Simonyan & Zisserman, 2015), where they are efficient and robust at these tasks because the large number of network parameters can allow the formulation of complex mappings between the original data and extracted features. In contrast to the models based
L. Zhou and X. Gu / Neural Networks 121 (2020) 308–318
309
input channel for the neural network, and the original image and the topological map are simultaneously fed into the neural network to produce a coarse-level saliency result. This additional channel allows the network to treat objects as entireties during the feature extraction process and to focus more on the local semantics. The topological map contains low-level features of the object spatial structures and contour details, and we further refine the final saliency map through a conditional random field (CRF) at the end of the network, which solves the scaling space problem caused by high-level feature extraction. The main contributions of our proposed model are as follows.
Fig. 1. Salient object detection results obtained using features represented at different levels. (a) Input images; (b) saliency maps obtained with only low-level features (Cheng et al., 2015); (c) saliency maps obtained with only high-level features by FCNN; (d) saliency maps obtained by combining high-level features with low-level features using our method; and (e) ground truth.
on hand-crafted features, CNN-based models are data-driven and supervised, and thus they are capable of locating particular objects in more complex scenes. Consequently, recent research into salient object detection with CNN (Cai, Huang, Zeng, Xinghao, & Paisley, 2018; Chen, Lin, Liu, Luo, & Li, 2016; He, Lau, Liu, Huang, & Yang, 2015; Hou et al., 2017; Lee, Tai, & Kim, 2016; Li, Chen, Lu, & Chi, 2017; Li & Yu, 2018; Li et al., 2016; Liu & Han, 2016; Wang, Lu, Ruan, & Yang, 2015; Wang, Ma, & Chen, 2016; Zhao, Ouyang, Li and Wang, 2015) has yielded significant performance improvements. Furthermore, fully CNNs (FCNNs) (Shelhamer, Long, & Darrell, 2015) are more efficient at computing the pixel-level saliency scores with end-to-end networks. Deep neural network based approaches are powerful for mining semantic features (internal properties to determine whether a region is salient) from images, but these features are data dependent and little attention is given to the local context. The extraction of exceptional or weak semantic information from salient regions may have negative effects on the final detection results, as illustrated in Fig. 1(c). In addition, the deconvolution layer of a FCNN may lead to scaling space problem caused by upsampling operations in the deconvolution layers, thereby making the saliency results fuzzy in terms of the details and contours. Human attention can be stimulated by low-level features such as color contrast in our vision and it can also be guided by highlevel semantics related to our anticipation. The former procedure can respond rapidly to local boundaries and details, and thus it is memory-free and has a similar form to prior-based, bottomup computational models. The latter process analyzes the input vision in a global manner by associating information with the existing memory and it can accurately identify salient regions in complex scenes. This memory-dependent mechanism is developed by learning the semantic features of salient objects under different conditions. Inspired by these human processes, we propose a salient object detection model based on deep neural networks, where we combine the critical low-level features of salient objects to complement the top-down procedure. Considering that CNNs focus on the global context of the image, we emphasize the topological properties, including the object structures and integrity, to maintain locality during feature extraction. The aim is to establish low-level feature descriptors that are suitable for use as CNN inputs based on saliency priors. We perform image segmentation by weighting contrast and spatial features to produce a so-called topological map, and represent these low-level features for the subsequent CNN. The topological map provides an additional
(1) We propose a novel framework by embedding topological features into deep neural networks and a CRF through image segmentation in order to promote the network’s capacity for preserving structure and locality. The model utilizes both the topological features obtained from image segmentation and the semantic features from a neural network. Results and analyses based on six benchmark datasets demonstrated that the proposed architecture achieves competitive performance compared to stateof-the-art methods. (2) We design a topological map for feature integration, which is computed using region-based contrast and spatial information from the segmentation results. This is an efficient method for introducing critical prior-based features into the CNN as an input channel. (3) We perform saliency refinement based on a CRF for the saliency maps with combinations of topological features. More confident predictions and well-defined contours are obtained using this procedure. The remainder of this paper is organized as follows. In Section 2, we present a brief review of related work. In Section 3, we introduce the overall architecture and details of each component in the proposed model. Performance comparisons with state-of-the-art approaches and ablation studies are presented in Section 4. Finally, we briefly summarize our conclusions in Section 5. 2. Related work Salient object detection approaches can be generally divided into two categories: conventional methods based on hand-crafted features and deep learning based methods. The methods in the first category include computational models based on local contrast (Itti et al., 1998; Xie et al., 2013), global contrast (Achanta et al., 2009; Cheng et al., 2015; Liu et al., 2011), background priors (Yang et al., 2013; Zhu, Liang, Wei, & Sun, 2014), or combinations of these features (Jiang, Wang et al., 2013; Jiang, Zhang et al., 2013; Li, Lu, Zhang, Ruan, & Yang, 2013). Some saliency detection approaches also learn distance metrics (You et al., 2016) to classify the foreground and background. Previous studies (Gu et al., 2013) of this category introduced topological features using a unit-linking pulse coupled neural network based hole filter. Most of the state-of-the-art salient object detection models based on hand-crafted features were discussed in previous surveys (Borji, Cheng, Hou, Jiang, & Li, 2014, 2015) for reference. 2.1. Deep learning-based salient object detection Recent approaches to salient object detection are mostly based on deep learning because representing the saliency semantics in a complex scene using only hand-crafted features is not adequate. CNN-based methods have greatly improved saliency detection tasks. Two subnetworks extracted encoded low-level features and high-level features from the image in a previous study (Lee
310
L. Zhou and X. Gu / Neural Networks 121 (2020) 308–318
et al., 2016). These two types of features were then integrated for salient region classification. Zhao, Ouyang et al. (2015) jointly modeled the global context and local context in a deep learning framework by upsampling and downsampling. These methods employ fully connected layers at the end of the network, which involve the calculation of a large number of parameters. Instead of using fully connected layers to generate feature vectors, FCNN produces heat maps and pixel-level prediction results for image semantic segmentation (Li et al., 2016; Shelhamer et al., 2015). The superior properties of this method resulted in the application of FCNN to salient object detection. Liu and Han (2016) developed a network architecture in a global to local manner using FCNN and recurrent convolution layers. Another study of salient object detection utilized two sequential neural networks for coarse and fine representation learning (Chen et al., 2016), where the saliency maps generated by the first network guide the second in a local manner for fine-level saliency detection. Li et al. (2016) achieved synchronous salient object detection and semantic segmentation by using a shared fully convolutional architecture, and they employed graph-based Laplacian regularized nonlinear regression for refinement. Most of these FCNN-based methods rely on the feature maps generated from the last convolutional layer and pay less attention to object structure preservation.
matching values between feature maps and the ground truth substitute. Cai et al. (2018) combined a topological metric space into a U-net to learn semantic distance expressions for saliency regions. Different scales of the feature maps were concatenated to generate the final prediction. These methods fuse low-level descriptors at the end of their models, whereas we implement topological maps as inputs to focus the network on the local context and preserve the structural integrity. Li and Yu (2018) introduced skip layers into deep neural networks to fuse multi-level feature maps and superpixel segmentation in order to optimize the final saliency maps. Hou et al. (2017) produced a novel deep neural network with skip layer structures and short connections, where the additional connections fused multi-level side outputs to achieve high accuracy salient object detection. The skip layer strategy requires a supervisor on each side output and redesign of the loss function, thereby increasing the complexity of the training process. Moreover, retraining low-level feature maps with supervision may affect the feature extraction hierarchy in the shallower layers. By contrast, our model extracts prior-based features directly from the original image and combines them in the front and the back end of the model to enhance the multi-level features. 3. Salient object detection framework
2.2. Saliency computation with image segmentation Many of the existing methods for salient object detection are based on image segmentation to predict saliency for different regions. He et al. (2015) extracted two feature sequences from each superpixel in an image and fed them into the CNN. The model learned the hierarchical contrast features and considered the global context by one-dimensional convolution. Wang et al. (2015) presented a saliency detection algorithm that uses two deep neural networks to learn local patch features simultaneously based on object proposal and global features. Wang et al. (2016) segmented images into multi-scale regions and employed fast RCNN to detect the saliency of each region. Wang, Dai, Cai, Sun, and Chen (2018) proposed a salient object detection model based on a residual network where superpixel segmentation was used to simplify the input image. This model also involved fully connection and a support vector machine was employed to calculate the saliency scores for the regions. These previous approaches performed segmentation of the input images and they then computed the patchwise saliency scores. This required neural networks to extract the features from each region, thereby consuming a lot of time and space. In addition, separate regionwise prediction ignores the spatial structures of salient objects. By contrast, we combine the contrast and spatial features into the segmentation result and generate a topological map. This weighted representation allows the CNN to obtain regionwise low-level features as an input channel. Our model considers the spatial relationships between regions to produce a full saliency map by a single forward propagation, thereby reducing the computational cost. 2.3. Combination of multi-level features CNN-based saliency detection approaches mostly focus on the features extracted from higher layers, which may result in the loss of object details. To overcome this weakness, some methods introduce combinations of multi-level features into their models. Li et al. (2017) designed a small-scale CNN to learn high-level features from images and combined the low-level features to produce patch-level saliency scores. SCNet (Cao, Liu, & Wang, 2018) involved combining feature maps from different layers of the CNN, and the network selects features by calculating the
As described in Section 1, both high-level features and lowlevel features are critical for detecting salient objects. Semantic features can be used to discover the intrinsic properties of salient objects and to locate regions in a global manner, whereas topological features focus on appearance and preserve the object structures and details. To obtain more accurate detection results, we propose the salient object detection framework shown in Fig. 2. This model comprises topological feature extraction, semantic feature extraction, and saliency refinement. The model first segments the input image into regions and computes their weights according to the contrast and spatial features. The weighted segmentation result provides a region-based representation of the low-level features, which we call a topological map. We then apply a deep neural network based on VGG16 (Simonyan & Zisserman, 2015) to extract semantic features from the image. The original image is taken together with its topological map as the inputs to produce a saliency map with full resolution. Finally, in order to recover the details lost during the pooling process by the deep neural network, saliency refinement is performed by a CRF. We can obtain more precise predictions of salient objects by using the low-level features from the topological map. 3.1. Extraction of topological features based on segmentation Next, we explain the motivation for producing a topological map in our salient object detection framework. We note that using a neural network with the original image as the input pays less attention to the spatial structure and integrity of the objects, which may lead to fake or incomplete salient regions. Extra topological properties (Gu et al., 2013) need to be included to ensure that a CNN can extract appropriate semantic features from an image. Many studies of image segmentation (Achanta et al., 2012; Felzenszwalb & Huttenlocher, 2004) have demonstrated the effectiveness of separating the different regions in an image and we combine these techniques in our salient object detection model. To emphasize the differences in the features of objects, we further consider the contrast and spatial information as regionbased weights. This representation of the topological features can be regarded as an initial feature map and simply implemented as an input channel for the neural network.
L. Zhou and X. Gu / Neural Networks 121 (2020) 308–318
311
Fig. 2. Overall architecture of our proposed model. The topological map provides an additional channel for the subsequent CNN and helps to refine the saliency map through the CRF.
Fig. 3. Pipeline for topological feature extraction in our model.
Fig. 3 illustrates how we generate the topological map from an image. We employ a graph-based image segmentation algorithm (Felzenszwalb & Huttenlocher, 2004) to segment the image into N regions {Ri }Ni=1 , but the segmentation results contain no extra information for each region. Given that contrast plays an important role in attention (Parkhurst, Law, & Niebur, 2002) and that the Lab color space is perceptually relevant to human vision, we calculate the average Lab color distance Dc (i, j) between each pair of regions Ri and Rj , as follows. Dc (i, j) =
√
(
Li − Lj
)2
( )2 ( )2 + ai − aj + b i − b j
Wc (i) =
Dc (i, j).
(2)
j̸ =i
Regions with higher color contrast compared with other regions will receive a larger weight in the whole image during this step, so the segmentation result can represent contrast features for different objects and alleviate the effects due to some regions with infrequent color. Spatial structure is another critical factor used for distinguishing objects. As shown in Fig. 3, we cannot distinguish objects with similar colors with only contrast features. Naturally, neighboring regions tend to greatly influence each other due to the local manner of attention selection, whereas distant regions have low interactivity. Therefore, we consider the relative positions of different regions during contrast weighting to describe their spatial relevance. The contrast weight in Eq. (1) can be optimized by spatial correlation: Wcs (i) =
∑
Ds (i, j)Dc (i, j),
√
(
xi − xj
(3)
j ̸ =i
where Ds (i, j) is the average spatial distance between pixels in regions Ri and Rj normalized to [0, 1]:
)2
( )2 + yi − yj .
(4)
Next, we perform normalization with Wcs ′ (i) = Wcs (i) /maxk Wcs (k) to ensure that the contrast weight is in the range [0, 1]. We also consider center bias as a global spatial feature for individual objects. The topological features of objects in an image should be distinct according to their absolute spatial position. To address this issue, we calculate the absolute spatial weight for region Ri using:
(
(1)
The contrast weight Wc (i) for a region Ri can then be calculated using:
∑
DS (i, j) =
WAS (i) = exp −
d(i)
σs2
)
,
(5)
where d(i) is the normalized spatial distance between the image center and center of Ri , and σs2 is a parameter for controlling the strength of the absolute spatial weight, which was set to 0.2 in our implementation. By combining Eqs. (2) and (4), we compute the topological features of each region Ri using:
(
WT (i) = exp −
d(i)
σs2
)∑
Ds (i, j)Dc (i, j).
(6)
j̸ =i
The generated topological map WT is then normalized to [0, 1]. We use the intensity to give values to the topological features of a certain pixel. In addition, the topological features are represented in the form of a two-dimensional feature map, which can be submitted directly to a CNN with the original image. We expect the topological map to preserve the object structure, but irregular contours and holes will affect the quality of the following semantic feature extraction process. Thus, we apply superpixelbased (Achanta et al., 2012) smoothing to better preserve the locality in objects and remove small, scattered regions, as shown in Fig. 3. We illustrate the promotion of salient object detection results obtained using the topological map based on both qualitative and quantitative performance analyses in Section 4.3.
312
L. Zhou and X. Gu / Neural Networks 121 (2020) 308–318
Fig. 4. Architecture and parameters of the CNN used in our model. Note that the input size is enlarged into 500 × 500 × 4 for salient object detection with topological features. The network can be considered a fully convolutional neural network (FCNN)
3.2. Extraction of semantic features based on a CNN Discovering saliency semantics in images requires a massive number of parameters for feature mapping, especially in images containing complex scenery. We apply a CNN in our model as a layer upon layer hierarchy for learning high-level features. Given the semantic similarity between salient object detection and image classification task, we employ VGG-16 (Simonyan & Zisserman, 2015) as the backbone because of its high precision at image classification. Some adjustments are made to adapt the network for salient object detection. We show the detailed architecture of our CNN in Fig. 4. The original image and topological map are resized to 500 × 500 and fed into the network synchronously without patch-level segmentation. The first 13 convolutional layers are the same as those used in VGG-16 for feature extraction, except for the size of the feature map, and thus we can initialize our network with most of the parameters in VGG-16. Each convolutional layer is followed by a rectified linear unit (ReLU) to introduce nonlinear operations. The next two convolutional layers can be regarded as saliency regression, where they replace the fully connected layers to predict salient objects using high-level features. We also apply a previously suggested dropout technique (Shelhamer et al., 2015) to alleviate overfitting during training. In contrast to classification or semantic segmentation, salient object detection is a single label prediction problem. We apply the last convolutional layer with a 1 × 1 kernel to combine the 4096 feature maps in a prediction map with a single channel. To obtain a saliency map with the same full resolution as the input image, a deconvolution layer is introduced to learn an upsampling function during training. Instead of manually selecting the upsampling function, learning-based parameters are employed in the deconvolution layer to increase the precision of the prediction results. In addition, a sigmoid function is added before the output to normalize the saliency map to [0, 1]. Finally, we apply cross entropy as the loss function to learn the saliency semantics during the training phase because of its suitability for probability prediction. 3.3. Saliency refinement with low-level features As shown in Fig. 1(c), the saliency maps produced directly by FCNN are fuzzy, especially at the object boundaries, thereby making the saliency predictions uncertain. In order to mine semantic features from an image, we need to increase the receptive field by using pooling operations. As the process moves to deeper layer, the feature maps with smaller sizes are inevitably insufficient to represent the saliency properties in detail. To overcome this problem with deep neural networks, we need to include the structural integrity of objects to complement the existing saliency results generated by high-level feature maps. We compute a pixelwise topological map of the image and utilize it as a refinement of our saliency map. The neural network can approximately locate the salient region through deeper semantics, based on which the topological map can obtain accurate
object details and boundaries. To combine these two maps, we refer to a fully connected CRF (Krähenbühl & Koltun, 2011) as a post-processing step for the fusion of low-level features. The CRF optimizes the saliency prediction of each pixel xi with the following energy function: E (x) =
∑
ψi (xi ) +
∑
ψij (xi , xj ),
(7)
i ,j
i
where x denotes the saliency label prediction (salient or nonsalient) for the pixels. At the beginning of each iteration, the unary potential ψi (xi ) is computed using the normalized saliency map Si produced by the CNN:
ψi (xi ) = − log Si .
(8)
The pairwise potential ψij xi , xj can be considered a penalty term during optimization to emphasize low-level features for saliency prediction, and it is defined as:
(
)
( ) Ii − Ij 2 pi − pj 2 − ψij xi , xj = µ xi , xj ω1 exp − 2σα2 2σβ2 ( ( 2 )) )] ti − tj pi − pj 2 + exp − + ω2 exp − , 2σγ2 2σθ2 [
(
)
(
(
)
(9) where µ xi , xj = 1 if xi ̸ = xj , and otherwise 0. The first two Gaussian kernels indicate that two pixels with similar low-level features tend to have similar saliency scores. The conventional CRF only uses the first kernel including the relative positions (p) and intensities (I) of the two pixels. Our topological map is based on image segmentation, so it imposes a stronger constraint on regions. Moreover, the spatial correlation and contrast difference are integrated in this region-based map, thereby making it a powerful criterion for clustering pixels during saliency refinement. Thus, we introduce an extra term ti = WT (ri ) from the topological map to emphasize the topological features and improve the performance of the CRF. The last Gaussian kernel is a smoothness kernel for removing small isolated regions. ω1 , ω2 , σα , σβ , σθ , and σγ are control parameters for adjusting the importance of each kernel and descriptor. We use an efficient previously proposed implementation (Krähenbühl & Koltun, 2011), which requires an average of 0.4 s for processing a 300 × 400 image. The effectiveness of the CRF and embedded topological features are demonstrated by the visual comparisons in Fig. 5, which indicates that the contours of salient objects become clear and well positioned after CRF optimization. The topological features also promote the consistency of saliency prediction, with more accurate object boundaries and fewer incorrect salient regions. Quantitative results on our saliency refinement are presented in Section 4.3.
(
)
4. Experimental results We evaluated our approaches based on six benchmark datasets and comparisons with other state-of-the-art approaches. We also performed ablation studies and analyses of our proposed model to test different components of the model.
L. Zhou and X. Gu / Neural Networks 121 (2020) 308–318
313
learning rate to 10−8 , step size to 7500, and gamma to 0.1. The momentum and weight decay were set to 0.9 and 0.0005, respectively. Before training, we resized all of the input images to 500 × 500, and set the batch size to five. Furthermore, we augmented our training set by flipping the images horizontally, so the training set was twice as large as the original set. Training our model using the GPU mentioned above required around 12 h with a total of 100000 iterations. The parameters of the fully connected CRF were determined by validation as suggested previously (Krähenbühl & Koltun, 2011). According to the discussion in Section 4.3.2, we eventually set ω1 , ω2 , σα , σβ , σθ , and σγ to 3.0, 3.0, 50.0, 10.0, 15.0, and 6.0, respectively. Fig. 5. Examples of CRF refinement and topological feature enhancement. (a) Input image; (b) coarse saliency prediction obtained by CNN; (c) CRF refinement without topological features; (d) topological map; (e) CRF refinement with topological features; and (f) ground truth.
4.1. Experimental settings 4.1.1. Datasets Our experiments were conducted on six representative datasets comprising MSRA-B (Liu et al., 2011), ECSSD (Yan, Xu, Shi, & Jia, 2013), PASCAL-S (Li, Hou, Koch, Rehg, & Yuille, 2014), DUT-OMRON (Jiang, Zhang et al., 2013), SED2 (Alpert, Galun, Basri, & Brandt, 2007), and HKU-IS (Li & Yu, 2015). All of these datasets have pixelwise saliency annotations. These datasets contain various types of natural images and they have been used widely in previous studies. MSRA-B contains 5000 images, most of which show a single salient object in a simple scene. The dataset is divided into 2500 training images, 500 validation images, and 2000 testing images by default. Many previous studies used the training set to train their models because of its semantic generality. ECSSD is an extension of CSSD (Yan et al., 2013) that contains 1000 images with complex scenes. Each image contains a single salient object with meaningful semantics. PASCAL-S is a more challenging dataset that contains 850 images selected from the PASCAL VOC 2010 dataset (Everingham, Gool, Williams, Winn, & Zisserman, 2010), where many of the images have complex backgrounds and confusing foreground objects. DUT-OMRON contains 5168 images that show one or more salient objects in cluttered scenes. SED2 is a two-object dataset containing 100 images, where some images show salient objects with different scales. HKU-IS is another large, challenging dataset of 4447 images, many of which contain low contrast or multiple salient objects. 4.1.2. Implementation details We trained and tested our model using a desktop computer with an Intel Core i7-6700K CPU (4.00 GHz) and 32 GB of memory. We also used an NVIDIA GeForce GTX1080 Ti GPU to accelerate our computations. The low-level feature extraction algorithm in our model was implemented in MATLAB and it required about 0.25 s to produce a topological map for a 300 × 400 image when we performed image weighting and superpixel segmentation in parallel. The number of superpixels in the SLIC (Achanta et al., 2012) segmentation was set to 200 and the compactness factor was set to 10 during our experiments. We implemented the CNN in our model based on the Caffe framework (Jia et al., 2014). Initialization is crucial when training a neural network, so we initialized our network with the pre-trained model from VGG-16 (Simonyan & Zisserman, 2015), which shares similar convolutional layers to our CNN. We used the training set from the MSRA-B dataset (2500 images) to train our network with a step learning policy, where we set the basic
4.1.3. Evaluation metrics We utilized three universally accepted metrics to quantitatively evaluate the performance of our model, i.e., the precision– recall curve (P–R curve), F-measure, and mean absolute error (MAE). The P–R curve indicates the model’s capacity for detecting salient objects under different circumstances. Saliency map values from 0 to 255 can be converted into binary masks using different thresholds. For each binary mask B, we compute the precision and recall using: Precision =
|B ∩ G| |B ∩ G| , Recall = , |B| |G|
(10)
where G is the binary ground truth and |·| denotes the sum of the non-zero entities. We obtained the complete P–R curve for a dataset by computing the P–R rates for all of the images with different binary thresholds. The F-measure is the weighted harmonic average of the precision and recall, and it is used as a comprehensive indicator of a model’s performance. The F-measure is defined as: (1 + α 2 )Precision × Recall
, (11) α 2 × Precision + Recall where α 2 is typically set to 0.3 to emphasize the importance Fα 2 =
of precision. The maximum F-measure (Fmax ) (Martin, Fowlkes, & Malik, 2004) was computed from the whole P–R curve and the average F-measure (Fav g ) (Achanta et al., 2009) was obtained using an adaptive binary threshold (twice the average intensity of the saliency map in this paper). Fmax and Fav g evaluated the model’s performance from different perspectives. The MAE denotes the average pixel distance between a saliency map and its ground truth, which is calculated using: MAE =
1 W ×H
W H ∑ ∑
|S (x, y) − G(x, y)|,
(12)
x=1 y=1
where S (x, y) refers to the saliency map normalized to [0,1]. 4.2. Comparisons with state-of-the-art methods Using the six datasets mentioned above, we compared the performance of our proposed model with 10 other state-of-theart approaches comprising RC (Liu et al., 2011), GMR (Yang et al., 2013), DRFI (Jiang, Wang et al., 2013), MCDL (Zhao, Ouyang et al., 2015), MDF (Li & Yu, 2015), ELD (Lee et al., 2016), DISC (Chen et al., 2016), DS (Li et al., 2016), NLDF (Luo et al., 2017), and DCL+ (Li & Yu, 2018). RC and GMR are two classical methods based on hand-crafted features, DRFI is a learning-based method with supervision, and the other seven methods all apply deep neural networks to detect salient objects. We generated the saliency maps for these methods using their open source codes or obtained their released saliency maps, where all of the saliency maps were resized to the original resolution and converted to [0, 255] to ensure fair comparisons.
314
L. Zhou and X. Gu / Neural Networks 121 (2020) 308–318
Fig. 6. Visual comparison between saliency maps obtained using different methods.
Fig. 7. P–R curves obtained using different methods based on three representative datasets . (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
4.2.1. Qualitative comparisons We present intuitive comparisons of our method and the state-of-the-art approaches based on the generated saliency maps in Fig. 6. We selected images of different types from the datasets. The saliency maps obtained with our approach are presented in the third column. Clearly, we obtained favorable detection results and they were closer to the ground truth compared with the other approaches. As shown in these examples, our model can produce precise detection results under different circumstances, including low contrast (e.g., rows 1, 2, and 5), center bias (e.g., rows 4 and 5), multiple objects (e.g., rows 6, 7, and 8), large scale salient objects (e.g., rows 2 and 3), and small scale objects (e.g., rows 4 and 5). Furthermore, our saliency maps preserved more accurate contours on the salient objects compared with the other approaches. We also achieved more confident detection results as the intensity inside the salient objects was much higher in our saliency maps. 4.2.2. Quantitative comparisons We present the P–R curves obtained by all of the methods on the ECSSD, DUT-OMRON, and HKU-IS datasets in Fig. 7. The red lines represent our results, and it can be seen that our curves are relatively short with high precision and recall. As mentioned above, our saliency maps were more confident, with greater contrast between the salient objects and background. Our method
achieved high precision and recall under different binary thresholds [0, 255], and the precision–recall rates were distributed centrally in the upper left areas according to Fig. 7. Compared with the other CNN-based approaches such as NLDF (Luo et al., 2017) and DCL+ (Li & Yu, 2018), we achieved competitive or better precision when the recall was close to 1. Thus, our model obtained less false predictions during salient object detection. Fig. 8 compares the Fmax (orange bars) results and the corresponding precision and recall with our method and the other approaches. We obtained a higher recall rate (red bars) when the F-measure reached the maximum, which indicates that our saliency maps contained fewer false salient regions. The superior ability of our model to reduce the likelihood of false detection was also demonstrated by the highest Fav g (purple bars) values among these datasets. Detailed quantitative comparisons of the F-measure (higher is better) and MAE (lower is better) results based on six datasets are presented in Table 1. Our approach achieved the best or second best Fmax , Fav g , and MAE values with five datasets and competitive scores with the DUT-OMRON dataset. We achieved 0.5% to 3.4% higher Fav g values for all of these datasets, thereby demonstrating the stability and accuracy of our model. The ECSSD dataset contains complex backgrounds and meaningful semantics, and our method outperformed the second best approach by 0.5%, 1.5%, and 4.4% in terms of Fmax , Fav g , and MAE, respectively.
L. Zhou and X. Gu / Neural Networks 121 (2020) 308–318
315
Table 1 Quantitative results obtained by different methods based on six datasets. The best two results in each case are shown in bold red and bold blue.
Fig. 8. Fmax , precision, recall, and Fav g obtained using different methods based on three representative datasets.
4.3. Ablation studies and analyses To further demonstrate the effectiveness of the topological map applied in our model, we perform ablation studies using our proposed model. We also investigated the effect of data augmentation in our experiments. 4.3.1. Contributions of embedded topological features Fig. 9 presents visual comparisons of the saliency maps generated using the conventional CNN and that with topological features embedded. These results indicate that the extraction of incorrect semantic information by the CNN could have negative effects on the saliency prediction results. The additional topological channel can enhance the locality and structural features when extracting semantic features, thereby enhancing the preservation of the integrity of salient objects (as shown in the first row) and avoiding the incorrect detection of salient regions (as shown in the second and third rows). Moreover, the performance gains obtained due to the embedded topological features are quantitatively illustrated in Fig. 10. The precision and recall rates both improved as the proportion of correct predictions increased. Fmax and Fav g were 0.4% and 2.6% higher, respectively, compared with the conventional network based on the ECSSD dataset. Further quantitative comparisons based on three datasets are shown in Table 2, which also demonstrates the effect of superpixel smoothing in the topological map. Clearly, the topological map (TM) helped to improve the F-measure and MAE with the different datasets.
Fig. 9. Visual comparisons to illustrate the effectiveness of topological features. (a) Input images; (b) saliency maps obtained from CNN without a topological channel; (c) topological maps; (d) saliency maps obtained from CNN with a topological channel; and (e) ground truth.
We evaluated the effect of the parameter k in the segmentation algorithm and the spatial weight factor σs2 in Eq. (5). Increasing k reduces the number of segmented regions N, as demonstrated in a previous study (Felzenszwalb & Huttenlocher, 2004), and σs2 affects the differences among the central and edge regions. We trained the model and tested it using the ECSSD and SED2 datasets with the different parameters shown in Table 3. It should be noted that the training process for the neural network
316
L. Zhou and X. Gu / Neural Networks 121 (2020) 308–318
Fig. 10. Comparisons of P–R curves and F-measures according to the different components of our model. Table 2 Ablation studies using our proposed model based on three datasets. Dataset
ECSSD
Metrics
Fmax
Fav g
MAE
PASCALS Fmax
Fav g
MAE
Fmax
Fav g
MAE
Raw CCN TM without smoothing TM with smoothing CRF refinement with TM
0.8888 0.8923 0.8926 0.9099
0.8313 0.8675 0.8531 0.8941
0.0362 0.0317 0.0290 0.0153
0.8107 0.8104 0.8131 0.8190
0.7343 0.7474 0.7634 0.7993
0.0576 0.0462 0.0458 0.0344
0.8583 0.8753 0.8850 0.8993
0.7762 0.7649 0.7668 0.8475
0.0257 0.0245 0.0165 0.0071
Table 3 Comparisons of the saliency results obtained with different values of k and σs2 for topological maps based on two datasets.
σs2
k
ECSSD
SED2
Table 4 Comparisons of CRF refinement with and without topological maps as the pairwise potential based on the PASCAL-S dataset.
SED2
Fmax
Fav g
MAE
Fmax
Fav g
MAE
500
0.2
0.9099
0.8941
0.0153
0.8993
0.8475
0.0071
250 1000 500 500
0.2 0.2 0.1 0.5
0.9102 0.9098 0.9098 0.9106
0.8933 0.8832 0.8936 0.8940
0.0160 0.0203 0.0160 0.0153
0.8961 0.8963 0.8991 0.8976
0.8321 0.8322 0.8357 0.8372
0.0084 0.0075 0.0080 0.0071
Without TM With TM
4.3.2. Contribution of CRF refinement As discussed above, the saliency maps obtained using the high-level features from the CNN were insufficient to deal with the object details. Fig. 1(c) shows that a CNN can only produce a coarse prediction with fuzzy contours and possibly incorrect salient regions. The performance gains due to CRF refinement are illustrated in Fig. 10 and Table 2. The precision was relatively higher with this method based on the ECSSD dataset as shown by the red and blue lines in the P–R curves, and Fmax and Fav g improved by 1.9% and 4.8%, respectively. More importantly, applying CRF refinement significantly reduced the prediction error as MAE decreased by 47.2%, 25.1%, and 57.0% for the three datasets shown in Table 2. We also conducted an experiment based on the PASCAL-S dataset to demonstrate the effectiveness of the additional pairwise potential introduced into the CRF. Table 4 shows that Fmax , Fav g , and MAE improved by 0.7%, 0.3%, and 0.6%, respectively, when using the combination of topological features. Fig. 5(c) and (e) also provide qualitative comparisons to show that the object details were further enhanced by the topological features.
Fav g
MAE
0.8134 0.8190
0.7973 0.7993
0.0346 0.0344
Table 5 Comparisons of the saliency results obtained with different parameters in the CRF based on two datasets.
ω1 ω2 σα σβ σθ
could introduce randomness and lead to fluctuations in the final results. The results showed that selecting appropriate parameters could improve the final results to various degrees depending on the dataset. For the SED2 dataset, the results were highly sensitive to the parameters and the performance in terms of Fav g differed by at most 2%. We set k to 500 and σs2 to 0.2 in our implementation.
Fmax
3 1 1 3 3 3 3 3 3 3 3
3 3 1 3 3 3 3 3 3 3 3
50 50 50 10 20 50 50 50 50 50 50
10 10 10 10 10 5 20 10 10 10 10
σγ
15 6 15 6 15 6 15 6 15 6 15 6 15 6 5 6 30 6 15 3 15 10
ECSSD
SED2
Fmax
Fav g
MAE
Fmax
Fav g
MAE
0.9099 0.9102 0.9080 0.9034 0.9065 0.9078 0.9088 0.9097 0.9099 0.9098 0.9098
0.8941 0.8906 0.8928 0.8856 0.8890 0.8926 0.8909 0.8924 0.8928 0.8927 0.8925
0.0153 0.0146 0.0153 0.0186 0.0178 0.0169 0.0164 0.0172 0.0161 0.0170 0.0163
0.8993 0.8989 0.8904 0.8933 0.8957 0.8982 0.8981 0.8985 0.8984 0.8920 0.8978
0.8475 0.8410 0.8303 0.8390 0.8398 0.8307 0.8290 0.8335 0.8405 0.8349 0.8326
0.0071 0.0067 0.0075 0.0086 0.0082 0.0077 0.0073 0.0077 0.0072 0.0077 0.0074
The CRF implementation in our model involves six control parameters that could potentially affect the results obtained with the saliency map. Some of the results in Table 5 indicate that the quality of the saliency maps differed slightly when the parameters of the CRF changed (the best and second best results are shown in red and blue). In order to achieve better performance with our model, we set ω1 , ω2 , σα , σβ , σθ , and σγ to 3.0, 3.0, 50.0, 10.0, 15.0, and 6.0, respectively, according to our validation. 4.3.3. Data augmentation The preparation of the training set also influenced the performance of the trained network. In our experiments, we applied horizontal flipping to the training images to augment the training data, which improved the final detection performance by about 1% based on the MSRA-B dataset. We padded the input image in the first convolutional layer and generated 698 × 698 feature
L. Zhou and X. Gu / Neural Networks 121 (2020) 308–318
317
Table 6 Approximate runtimes using different approaches. Deep learning-based approaches were accelerated by using a GPU. Method
RC
GMR
DRFI
MCDL
MDF
ELD
DS
DISC
NLDF
DCL
Ours
Platform Time (s)
C 0.25
C 0.87
M + C 44.50
Caffe 2.25
Caffe 12.24
Caffe 0.82
Caffe 0.25
Caffe 0.50
TF 0.36
Caffe 0.68
Caffe 0.77
maps to compare with the 500 × 500 maps. The larger-scale input required an additional 0.1 s (80% runtime) to generate a saliency map but little improvement was obtained. In addition, we resized all of the input images to 321 × 321, as suggested in a previous study (Li & Yu, 2018), and the results obtained based on the MSRA-B dataset indicated a performance decrease of about 1%, possibly because the scale of the high-level feature maps was unsuitable for extracting the detailed semantic properties. In addition, some information may have been lost from the images due to size compression. 4.3.4. Runtime We replaced the fully connected layers in our CNN with convolutional layers and only about 0.12 s was required to process a 500 × 500 image during the testing phase. As mentioned above, computing the topological map required about 0.25 s and the CRF required another 0.4 s. Thus, our model required less than 0.8 s to produce a saliency map. Compared with other state-ofthe-art methods based on neural networks, we achieved a much lower runtime than the patchwise detection models (e.g., MDF Li & Yu, 2015) and a comparable runtime to the fully convolutional models (e.g., DCL Li & Yu, 2018), as shown in Table 6. 5. Conclusion In this paper, we proposed a salient object detection model based on CNN. We designed a topological map to provide a representation of the topological features in an image in order to highlight the critical integrity and locality during the feature extraction process by a neural network. The topological map comprises the weighted segmentation results obtained from an image to combine region-based contrast and spatial properties. Furthermore, a CRF is employed to refine the saliency predictions produced using high-level features from the CNN with the previously extracted topological features. We demonstrated the effectiveness of these methods in ablation studies and analyses. The experimental results obtained based on six benchmark datasets showed that our proposed framework achieved comparable or better performance compared with other state-of-the-art methods according to different metrics. Acknowledgment This research was supported by the National Natural Science Foundation of China under Grant numbers 61771145 and 61371148. References Achanta, R., Hemami, S., Estrada, F., & Susstrunk, S. (2009). Frequency-tuned salient region detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1597–1604). Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., & Süsstrunk, S. (2012). SLIC Superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2274–2282. Alpert, S., Galun, M., Basri, R., & Brandt, A. (2007). Image segmentation by probabilistic bottom-up aggregation and cue integration. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–8). Borji, A., Cheng, M.-M., Hou, Q., Jiang, H., & Li, J. (2014). Salient object detection: a survey. arXiv preprint arXiv:1411.5878. Borji, A., Cheng, M.-M., Hou, Q., Jiang, H., & Li, J. (2015). Salient object detection: a benchmark. IEEE Transactions on Image Processing, 24(12), 5706–5722.
Cai, S., Huang, J., Zeng, D., Xinghao, D., & Paisley, J. (2018). MEnet: A metric expression network for salient object segmentation. In International joint conference of artificial intelligence (IJCAI). Cao, F., Liu, Y., & Wang, D. (2018). Efficient saliency detection using convolutional neural networks with feature selection. Information Sciences, 456, 34–49. Chen, T., Lin, L., Liu, L., Luo, X., & Li, X. (2016). DISC: deep image saliency computing via progressive representation learning. IEEE Transactions on Neural Networks and Learning Systems, 27(6), 1135–1149. Cheng, M.-M., Mitra, N. J., Huang, X., Torr, P. H. S., & Hu, S.-M. (2015). Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 569–582. Everingham, M., Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338. Felzenszwalb, P., & Huttenlocher, D. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2), 167–181. Girshick, R. (2015). Fast R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 1440–1449). Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587). Gordo, A., Almazán, J., Revaud, J., & Larlus, D. (2016). Deep image retrieval: learning global representations for image search. In Proceedings of the European conference on computer vision (pp. 241–257). Gu, X., Fang, Y., & Wang, Y. (2013). Attention selection using global topological properties based on pulse coupled neural network. Computer Vision of Image Understanding, 117(10), 1400–1411. He, S., Lau, R. W. H., Liu, W., Huang, Z., & Yang, Q. (2015). SuperCNN: A superpixelwise convolutional neural network for salient object detection. International Journal of Computer Vision, 115(3), 330–344. Hou, Q., Cheng, M.-M., Hu, X., Borji, A., Tu, Z., & Torr, P. (2017). Deeply supervised salient object detection with short connections. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5300–5309). Itti, L. (2004). Automatic foveation for video compression using a neurobiological model of visual attention. IEEE Transactions on Image Processing, 13(10), 1304–1318. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1254–1259. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM MM. Jiang, H., Wang, J., Yuan, Z., Wu, Y., Zheng, N., & Li, S. (2013). Salient object detection: a discriminative regional feature integration approach. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2083–2090). Jiang, B., Zhang, L., Lu, H., Yang, C., & Yang, M. H. (2013). Saliency detection via absorbing Markov chain. In Proceedings of the IEEE international conference on computer vision (pp. 1665–1672). Krähenbühl, P., & Koltun, V. (2011). Efficient inference in fully connected CRFs with Gaussian edge potentials. In Adv. Neural Inform. Process. Syst.. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of the advances in neural information processing system (pp. 1097–1105). Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. Lee, G., Tai, Y.-W., & Kim, J. (2016). Deep saliency with encoded low level distance map and high level features. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 660–668). Li, H., Chen, J., Lu, H., & Chi, Z. (2017). CNN For saliency detection with low-level feature integration. Neurocomputing, 226, 212–220. Li, Y., Hou, X., Koch, C., Rehg, J., & Yuille, A. (2014). The secrets of salient object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 280–287). Li, X., Lu, H., Zhang, L., Ruan, X., & Yang, M.-H. (2013). Saliency detection via dense and sparse reconstruction. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2976–2983). Li, G., & Yu, Y. (2015). Visual saliency based on multiscale deep features. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5455–5463).
318
L. Zhou and X. Gu / Neural Networks 121 (2020) 308–318
Li, G., & Yu, Y. (2018). Contrast-oriented deep neural networks for salient object detection. IEEE Transactions on Neural Networks and Learning Systems, 29(12), 6038–6051. Li, X., Zhao, L., Wei, L., Yang, M.-H., Wu, F., Zhuang, Y., et al. (2016). Deepsaliency: Multi-task deep neural network model for salient object detection. IEEE Transactions on Image Processing, 25(8), 3919–3930. Liu, N., & Han, J. (2016). DHSnet: deep hierarchical saliency network for salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 678–686). Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., et al. (2011). Learning to detect a salient object. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2), 353–367. Luo, Z., Mishra, A., Achkar, A., Eichel, J., Li, S.-Z., & Jodoin, P.-M. (2017). Non-local deep features for salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6593–6601). Margolin, R., Zelnik-Manor, L., & Tal, A. (2013). Saliency for image manipulation. Visual Computer, 29(5), 381–392. Martin, D. R., Fowlkes, C. C., & Malik, J. (2004). Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(5), 530–549. Murabito, F., Spampinatoa, C., Palazzoa, S., Giordanoa, D., Pogorelovb, K., & Rieglerb, M. (2018). Top-down saliency detection driven by visual classification. Computer Vision of Image Understanding, 172, 67–76. Parkhurst, D., Law, K., & Niebur, E. (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research, 42(1), 107–123. Shelhamer, E., Long, J., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440). Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings of the international conference on learning representation. Wang, Z., Chen, T., Li, G., Xu, R., & Lin, L. (2017). Multi-label image recognition by recurrently discovering attentional regions. In Proceedings of the IEEE international conference on computer vision (pp. 464–472). Wang, H., Dai, L., Cai, Y., Sun, X., & Chen, L. (2018). Salient object detection based on multi-scale contrast. Neural Networks, 101, 47–56.
Wang, L., Lu, H., Ruan, X., & Yang, M.-H. (2015). Deep networks for saliency detection via local estimation and global search. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3183–3192). Wang, X., Ma, H., & Chen, X. (2016). Salient object detection via fast R-CNN and low-level cues. In Proceedings of the IEEE international conference on image processing (pp. 1042–1046). Wei, Y., Feng, J., Liang, X., Cheng, M.-M., Zhao, Y., & Yan, S. (2017). Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6488–6496). Wei, Y., Liang, X., Chen, Y., Shen, X., Cheng, M.-M., Feng, J., et al. (2017). STC: A simple to complex framework for weakly-supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2314–2320. Wu, H., Li, G., & Luo, X. (2014). Weighted attentional blocks for probabilistic object tracking. Visual Computer, 30(2), 229–243. Xie, Y., Lu, H., & Yang, M.-H. (2013). Bayesian saliency via low and mid level cues. IEEE Transactions on Image Processing, 22(5), 1689–1698. Yan, Q., Xu, L., Shi, J., & Jia, J. (2013). Hierarchical saliency detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 1155–1162). Yang, C., Zhang, L., Lu, H., Ruan, X., & Yang, M.-H. (2013). Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3166–3173). You, J., Zhang, L., Qi, J., & Lu, H. (2016). Salient object detection via point-to-set metric learning. Pattern Recognition Letters, 84(C), 85–90. Zhao, F., Huang, Y., Wang, L., & Tan, T. (2015). Deep semantic ranking based hashing for multi-label image retrieval. In Proceedings of the IEEE international conference on computer vision (pp. 1556–1564). Zhao, R., Ouyang, W., Li, H., & Wang, X. (2015). Saliency detection by multicontext deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1265–1274). Zhu, W., Liang, S., Wei, Y., & Sun, J. (2014). Saliency optimization from robust background detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2814–2821).