Author’s Accepted Manuscript Complex Networks Driven Salient Region Detection based on Superpixel Segmentation Alper Aksac, Tansel Ozyer, Reda Alhajj
www.elsevier.com/locate/pr
PII: DOI: Reference:
S0031-3203(17)30011-0 http://dx.doi.org/10.1016/j.patcog.2017.01.010 PR6012
To appear in: Pattern Recognition Received date: 24 November 2015 Revised date: 4 September 2016 Accepted date: 7 January 2017 Cite this article as: Alper Aksac, Tansel Ozyer and Reda Alhajj, Complex Networks Driven Salient Region Detection based on Superpixel Segmentation, Pattern Recognition, http://dx.doi.org/10.1016/j.patcog.2017.01.010 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
1
Complex Networks Driven Salient Region Detection based on Superpixel Segmentation Alper Aksac∗† , Tansel Ozyer† , Reda Alhajj∗ ∗ University
of Calgary, Calgary, AB, CANADA
{aaksa, alhajj}@ucalgary.ca † TOBB
University of Economics and Technology, Ankara, TURKEY {ozyer}@etu.edu.tr
Abstract In this paper, we propose an efficient method for salient region detection. First, the image is decomposed by using superpixel segmentation which groups similar pixels and generates compact regions. Based upon the generated superpixels, similarity between the regions is calculated by benefiting from color, location, histogram, intensity, and area information of each region as well as community identification via complex networks theory in the over-segmented image. Then, contrast, distribution and complex networks based saliency maps are generated by using the mentioned features. These saliency maps are used to create a final saliency map. The applicability, effectiveness and consistency of the proposed approach are illustrated by conducting some experiments using publicly available datasets. The tests have been used to compare the proposed method with some state-of-the-art methods. The reported results cover qualitative and quantitative assessments which demonstrate that our approach outputs high quality saliency maps and mostly achieves the highest precision rate compared to the other methods.
Keywords superpixel; segmentation; salient region detection; saliency map; complex networks
I.
I NTRODUCTION
A Human Visual System (HVS) [1] is capable of easily detecting and separating the important parts of a given image from the remainder of the image; the latter part usually includes background information. The implementation
2
of this advanced mechanism in autonomous systems as HVS can be addressed as a multi-disciplinary problem and has been broadly investigated by researchers. To achieve realistic results, it is necessary to consider the elucidation of approaches from several research fields such as human neural system [2], psychology [3], computer vision [4, 5], among others. Visual saliency helps to quickly select and identify important objects that immediately attract human attention, even in complex scenes. The process of detecting salient regions in the field of computer vision has increasingly become very popular topic [6, 7]. In the last decade, considerable research works have been carried out and lots of various techniques have been developed on this topic. Detected regions play an important role in many computer vision applications such as salient object detection [8, 9], image retrieval [10], image retargeting - seam carving [11], image/video compression [12], image resizing [13], collage [14], etc. Understanding HVS research provides more information about visual saliency and shows that visual neurons are sensitive to contrast [15]. This is a leading instrument defined in previous works using image features such as color, orientation, location, histogram, etc. The evolved applications of HVS for detecting salient regions can be represented in two main categories [16]: namely top-down and bottom-up. Further, recent advances in complex networks theory have led to noticeable progress in image segmentation methods. The new concept is based on identifying communities between individually segmented regions and connecting these regions according to their similarity to generate a unique community in the network. In this work, we exploit the bottom-up approach to detect salient regions using contrast, distribution and complex networks information. Based on the conducted research and existing knowledge, a salient object generally contains the following distinctive features: •
A salient object has a high-contrast feature compared to its surrounding background. Especially, the distribution of colors in the image is rare. Moreover, these unique objects are especially attractive to the eye.
•
A salient object takes great advantage of closer regions rather than farther ones for its contrast value.
•
A salient object is commonly located near the image center. Indeed a commonly accepted assumption about objects location is that human attention firstly focuses on the center area of an image.
•
A salient object appears in a smaller area compared to background objects. In other words, salient regions occupy more compact place compared to the mostly unimportant remaining parts of the image which are usually spatially distributed in the whole image.
3
•
A salient object exhibits a uniform color distribution. This means the overall parts of a salient object are homogeneously highlighted.
(a)
(b)
(c)
(d)
Fig. 1: General overview. From left to right: (a) original image, (b) superpixel segmentation, (c) saliency map, and (d) ground truth
A general overview of a problem definition is shown in Figure 1. In particular, the following items are depicted from left to right: the source image, the segmented image, the proposed methods saliency map and the ground truth. Next we summarize the main contributions of our proposed approach and its differences from existing methods. •
The proposed algorithm does not expect pre-specified information and does not require training data or learning steps to start with.
•
The execution time is decreased by using a superpixel image segmentation method. As a result, the proposed framework can also work with high-resolution image data.
•
A salient object usually appears in a smaller area compared to the rest of the image (i.e., background objects). In other words, salient regions take more compact place compared to the remaining parts of the image which
4
are spatially distributed over the whole image. •
Contrast information of regions is generated using color, location and histogram.
•
Complex network construction and analysis is also used in the proposed framework. The pathway followed is different from most traditional graph based methods for image segmentation.
•
Saliency map which is an output of the proposed method is generated with a high resolution result and it is highlighted more uniformly. So, the results reported in later in this paper confirm that our method generally outperforms other state-of-art methods.
The rest of this paper is organized as follows: Section II is a review of the most popular previous studies reported on salient region detection. Section III introduces the proposed method. Evaluation of the proposed approach is presented in Section IV. Conclusions and future work are discussed in Section V.
II.
R ELATED W ORK
Several saliency region detection methods are described in the literature; and these can be grouped in two main categories: namely, top-down and bottom-up. However, the most popular ones are usually gathered around the bottomup. In this section, we review some popular methods that are based on these techniques. Top-down methods [17, 18] are generally goal driven, slow, and task dependent. They need to use pre-specified information to analyze and process saliency information. Consequently, this type of approaches are memory based and hence require more memory to process the whole data. Also, the information gathered in the training section is driven in the test section. The other trend, i.e., bottom-up methods are data driven, fast and pre-attentive. In this case pre-specified information is not needed anymore. Without such information, saliency is detected using basic image characteristics such as color, edge, texture, brightness, etc. Hereby, our proposed approach could be classified in this type of methodology. The methods incorporating these principal techniques are described next together with the methods derived from these techniques. Some approaches in the bottom-up category benefit from color, brightness and rotation to detect saliency. In here, saliency detection in bottom-up methods can be divide into two categories: local and global analysis. Local methods just consider neighboring pixels/regions in order to find a saliency. Itti et al. [5] improved this idea with a center-
5
around operation using Difference of Gaussians (DoG). This work has been extended by Harel et al. [19] using a graph based approach to increase attention to salient regions. Achanta et al. [4], Hou and Zhang [20] proposed a saliency algorithm in which a frequency domain is used to extract salient information from regions. A visual saliency map is better than the previous works; it represents salient pixels more easily and it is well structured. However, these methods make the image blurry, and also make edges of salient regions visually more apparent than the inside part of these regions Guo et al. [21]. From the local approaches, several models have been proposed such as pixel-based analysis Ma and Zhang [22], multi-dimensional DoG Itti and Baldi [23], histogram analysis Liu et al. [24], etc. These studies can be used to attain good results with less blurry outputs. On the other hand, edges are high tolerated with different frequency values and the outputs include noise Achanta et al. [4]. Global methods, e.g., [25–33] deal with contrast calculation by comparing all pixels/regions in the image with each other. At the end, this type of approaches can be used to achieve better results compared to the other previous methods. However, processing all pixels is a high time complexity. Lastly, improvements on segmentation algorithms provide an important support in salient region detection. Also, the comparison with pixel-based [25, 34, 35] and region-based [27–33] approaches with respect to complexity and accuracy clearly demonstrate that region based methods are coming forward. Thus, high dimensional images can be easily managed using region based methods. The implementation of region based methods can be realized in various ways, such as multi-dimension, histogram based contrast, color distribution, etc. By the way, some methods work on both local and global consideration, e.g., Goferman et al. [36], Zhang et al. [37]. Peng et al. [38] propose a saliency cut based on high-order energy term for stereo images. Salient regions in an image present some common shared features which are listed next: •
Regions that belong to salient objects have a high contrast compared to background objects. Also, rare objects are more salient and distinctive.
•
HVS starts to search and focus on objects from the center of an image. Actually, salient regions are generally close to image center.
•
Background regions are more distributed all over the image by holding bigger area; on the other hand, foreground (salient) regions spatial distribution is more centralized and compact.
•
The contribution of foreground regions to the overall brightness of regions contrast is more than far away
6
background regions. Different from the aforementioned works, saliency maps are uniformly generated in the proposed framework by considering both local and global region based approaches, highlighting foreground regions which attract human visual attention, and suppressing background regions. To sum up, in some studies saliency maps are generally generated using image color, orientation, texture, luminance and location information. The results of these algorithms are blurry and provide pre-information to the direction where the attention should focus by extracting the salient region’s edges. On the other hand, some approaches are more suitable and best candidate for content based applications in computer vision, such as object segmentation and detection, compression, etc. No input is needed from the user side; it is rather a fully autonomous working method which effortlessly recognizes and extracts foreground objects from a given image. In other words, our parameters are mainly set and fixed for all the experiments, not to specific and be created a model for each dataset so that it is a non-parametric method. In other methods, some parameters must be tuned by hand to fit the model for the data (see [5, 39, 40]). Thus, our algorithm has evolved from the latter approaches; the focus is on region-based in this paper.
III.
M ETHODOLOGY
Figure 2 shows the flow of process diagram of the working mechanism of the proposed approach. In the preprocessing stage, we apply superpixel segmentation algorithm to the input image. Then the saliency map is generated by using color, complex network, spatial distribution descriptors and combination. The steps composing the algorithm are detailed in the following sections.
A. Pre-Processing Unsupervised image segmentation is a solid computer vision problem because even manual segmentation of images is quite biased depending on image complexity. However, many interactive segmentation algorithms better than an automatic segmentation methods, they require user input [41]. But, no input is needed from the user side in our system. Additionally, most segmentation algorithms produce segments that cover more than one object in the image; this is known as under-segmentation problem. Thus, the natural structure of an image is not preserved; this may cause undesired output for segmentation applications or subsequent processing elements. On the other hand,
7
Fig. 2: The framework of the proposed framework
over-segmentation occurs when objects in the image are divided into more than one segment. Actually, obtaining over-segments is acceptable since objects are still represented correctly by groups of visual elements. In addition, the number of elements constituting an image is drastically reduced from pixel-based to region-based. This will lead to decrease in the computational complexity of subsequent steps more than pixel-based algorithms. For these reasons, image over-segmentation algorithms have started to become increasingly popular as a pre-processing step for computer vision problems. Lately, algorithms specific to over-segmentation started to receive more attention, and researchers specifically focused on grouping similar pixels to obtain approximately same-sized pixel groups called superpixels. A good review of superpixel algorithms can be found in [42], where the authors demonstrated that their
8
algorithm is the most powerful according to the criteria they listed. In our approach described in this paper, we used this superpixel algorithm as a pre-processing step in order to decrease the complexity of the ensuing steps. Although, Shen et al. [43] propose better performance than SLIC; SLIC has better complexity than the proposed superpixel algorithm since we chose SLIC to use in our system. The SLIC (Simple Linear Iterative Clustering) algorithm generates compact and regular-sized superpixels by clustering pixels located close to each other based on their color similarity and spatial information. For this, it uses five-dimensional space, namely labxy , where lab represents pixel color values in the CIELAB color space which is considered both device independent and suitable for color distance calculations, and xy represents the coordinates for pixel position. Also, this methodology removes and isolates the input image from unnecessary details and noise. The main algorithm and detailed explanation can be found in [44]. We chose superpixel number 500 as a parameter for SLIC in our algorithm.
B. Color Descriptor In this part, we try to find salient regions using color information from the segmented regions. The previous works using pixel-based approaches have drawbacks (e.g., noise, rapid changes in contrast, etc.) to the detection of saliency in images. In region-based methodology, it is possible to benefit from the abstraction that compactness and same-sized homogeneous regions can be easily used to overcome these type of problems. To highlight salient regions, we look in the regions for the descriptors of some color features such as uniqueness, scarcity, and singularity. We developed an idea from the approach described in Perazzi et al. [32] using different metrics to measure the distances and similarity between regions. In our method, we decided to use Bhattacharyya distance [45].
dBha ij
4096 = 1 − Histu Histu i
j
(1)
u=1
where dBha shows the color histogram similarity between regions i and j . Hist ui and Histuj are the normalized ij histograms of these compared regions, and u represents their u-th bin element. The main reason for choosing Bhattacharyya distance is that it is better than the Euclidean distance in histogram comparison since it follows statistical approach. The utility of Bhattacharyya distance is that it is related to the Chernoff bound which in turns tends to minimize the Bayes error.
9
The CIELAB color space is preferred, and each channel of this color space (L, a, b) is quantized into 16 bins. Totally, each region contains 16× 16× 16 = 4096 bins. If the result of Equation 1 is close to zero, the color similarity between i and j is higher, and otherwise the two regions are totally different from each other.
C wij = ci − cj 22 dBha ij
1 P wij = exp − 2 pi − pj 22 2σp C P SaliCont = wj wij wij
(2) (3) (4)
j=0,i=j C represents the color similarity between regions i and j ; c and c hold averaged color information for Here wij i j P is a Gaussian spatial weight for increasing the contribution and regions i and j ; dBha is computed in Equation 1; w ij ij
effect of the adjacent regions; p i and pj show averaged location of the compared regions; w j is a total pixel number of region j ; finally · represents L2 -norm of the Euclidean distance. Larger values for wj generate high contrast difference between regions. For this, it helps to discriminate regions from each other. In the equation, σ p2 = 0.4 is used (this value is empirically selected) and p values are normalized to the range [0, 1]. If the larger value of σp2 is selected in Equation 3, the farther regions effect will be increased and the contribution of the spatial weight will be decreased.
1 2 = exp − 2 pi − c2 2σp C P SaliCont = dCent wj wij wij i dCent i
(5) (6)
j=0,i=j
Here c is the guessed saliency center, i.e., the center of the image in our algorithm; d Cent represents Gaussian spatial i weight between region i and the center of the image; and Sal iCont is the saliency value for region i. The main reason to choose the center of image for saliency is the fact that the reported results of the already conducted research revealed that HVS has higher probability to focus on the center of an image. The reason for choosing the center points for the datasets shown in Figure 3 can be realized by considering the summation of all images in each dataset.
10
(a)
(b)
(c)
(d)
Fig. 3: Some illustrative datasets: from left to right: (a) The MSRA-1000 dataset, (b) the Berkley-300 dataset, (c) the CSSD dataset, and (d) the ECSSD dataset. From top to bottom: the mean over all ground truths for each datasets, and the scatter plot of the centroid locations across all images of each datasets [46].
C. Complex Network Descriptor After the calculation in previous steps, some parts of the salient region can be unclear and may have different brightness levels because of the superpixel segmentation that causes the over-segmentation in the produced saliency map. Naturally, regions from the same object should have the same brightness and saliency value. In the second step of the proposed algorithm, the overall object represents the same saliency value using complex network descriptor. The graph is firstly generated using superpixel. The weight function between the generated graph clusters can be applied to measure color differences between regions. Using all this, the communities are finally extracted from the complex network. The abstraction of this way is drawn in Figure 4; details of the process are given below.
c wij = ci − cj 2
fiadj =
p wij = pi − pj 2 ⎧ ⎪ ⎪ C < t ∩ wP < t ⎪ ⎨1, if wij c p ij
⎪ ⎪ ⎪ ⎩0,
otherwise
(7) (8)
(9)
11
Fig. 4: Overview of the complex network based saliency map generation.
C is the color and w P is the location difference between regions, t is a color threshold, and t is a location where wij c p ij
threshold; fijadj donates adjacency relations between regions; and · represents L 2 -norm of the Euclidean distance. Using the value of thresholds t c and tp , neighbor regions which do not have the same color similarity or which show similar color distribution but their distance from each other is large can be used to eliminate some links in the graph between regions to get more robust community relations. For the proposed algorithm, the following values are used for the parameters in Equation 9: t c = 0.08 and tp = 0.2. In here, color and location values are normalized in the interval [0, 1]. After the graph construction process is completed, the communities between regions can be identified using the complex network weight from the hidden information. The chosen community extraction algorithm is a trade-off
12
between time complexity and accuracy. For our case, the fast greedy algorithm was reported as the best selection based on some test runs. Neither the label nor Girvan and Newman algorithm for selecting communities are good to use in the process. The label algorithm is characterized as too fast but has low accuracy. On the other hand, Girvan and Newman model is slow but reports higher accuracy compared to others. The Girvan and Newman algorithm is closely related to the concept of “modularity” which in turn leads to interpret “normalized cuts” in a different way. Actually, cut algorithms play a key role in segmentation. At the end, community clusters are extracted from the superpixels regions using the complex network technique. We used the approach suggested by [47] to detect saliency between regions. The saliency map is created from these communities. Using the split and merge approach and benefiting from the complex network technique, the oversegmented regions can be unified inside a community. Shown in Figure 5 are the differences between the two cases, i.e, before and after applying these steps.
(a)
(b)
(c)
(d)
Fig. 5: Combination of color and complex network based saliency maps. From left to right: (a) original image, (b) color descriptor, (c) complex network descriptor, and (d) color+complex network combination.
SaliCN Cont = dCent i
C P wj wij wij
(10)
j=0,i=j
where all the variables and constraints are the same as explained in Equation 6. In this equation, we can realize the large effect of wj on the formula since the superpixel regions have almost the same size used in the previous calculation.
13
D. Spatial Distribution Descriptor In this part, the saliency map is generated in a way similar to that described in [24]. We benefit from the spatial distribution of color in the whole image based on the spatial relationship. The colors belong to the salient object which is more compact and low level of color spatial-variance so that these regions are the brightest while the colors belong to the background is spatially spread the overall picture and presents high variance so that these regions should be less shiny.
C wij pj
(11)
P wij = μi − pi 22
(12)
μi =
j=0,i=j
N
C
=
j=0,i=j
C wij
1 exp − 2 ci − cj 22 dBha ij 2σc
1 1 2 Bha = C exp − 2 ci − cj 2 dij N 2σc C P SaliDist = wij wij
(13)
(14) (15)
j=0,i=j C is the color similarity between regions i and j ; p is the averaged spatial value of region j ; μ is color the where wij j i
weighted position of colors region i that uses color similarity comparison with others and their position; c i , cj hold P is average spatial weight of averaged color information for regions of i and j ; d Bha is explained in Equation 1; w ij ij
color; N C is a normalization term; and · represents L 2 -norm of the Euclidean distance. C in Equation for all weights summed to 1 as suggested in [32]. When σ 2 = 0, This normalization term helps to wij c C = 1/k where k is the number of regions in the segmentation. The constraint w C controls the colors similarity wij ij
distribution between regions. In the algorithm, σ c2 = 400 is used and p is normalized in the interval [0, 1]. At the end, we saved some noisy data such as color values separated in the whole image and showing high variance called background regions from salient regions.
E. Combination The final saliency map is generated by combining the aforementioned steps as used in [32]. In this merging, color and complex network descriptors are combined in a linear way; the result and the spatial distribution descriptor
14
are combined in a non-linear way. In the linear part, we achieved more compact and similar brightness for salient regions. In the second part, we suppressed and removed the effect of background regions on the salient region. This is characterized and explained in the following equations:
Cont
Sali
= αSaliCont + βSaliCN Cont Cont
Sali = Sali
(16)
exp −γSaliDist
(17)
where α = 0.5, β = 0.5, γ = 2 is used for the equations. Some samples from each steps are represented in Figure 6.
IV.
E XPERIMENTS
AND
R ESULTS
In this section, we test and compare the results and execution time of the proposed method with some state-of-the-art methods, namely FT [4], LC [25], MSSS [35], SR[20], HC [27], RC [27], CA [36] and SF [32]. The empirical analysis has been conducted using four popular saliency datasets: MSRA-1000 [4], CSSD [48], ECSSD [48] and Berkley300 [49]. The MSRA-1000
1
dataset contains 1000 images with the ground truth of salient regions; background
structures are simple; smooth and salient regions are generally located in the center of images. CSSD
2
(complex
scene saliency dataset) contains 200 images with multiple objects and diversified patterns in whole images; this make the detection more challenging. ECCSD
2
(extended complex scene saliency dataset) is an extended version of CSSD;
it includes 1000 images. Also, the ground truth images are provided for both of them. Berkley-300
3
(or SOD known
as Salient Object Dataset) contains 300 images with complex scenes and multiple objects on different positions. We did our experiments on datasets without changing the original size of images. The saliency maps of the state-of-the art works used in the comparison are provided in the authors web pages or generated by using the source codes that are again provided by the authors. For quantitative comparison, we benefit from the PR (precision-recall) and ROC (receiver operating characteristic) curves; also to measure the performance, Precision, Recall, F-measure, AUC (area under curve) and MAE (mean absolute error) scores are computed. 1
http://ivrgwww.epfl.ch/supplementary material/RK CVPR09/ (last visited Aug. 30, 2016)
2
http://www.cse.cuhk.edu.hk/leojia/projects/hsaliency/dataset.html (last visited Aug. 30, 2016)
3
http://elderlab.yorku.ca/∼vida/SOD/index.html (last visited Aug. 30, 2016)
15
(a)
(b)
(c)
(d)
(e)
Fig. 6: Combination of color and complex network based saliency maps. From left to right: (a) original image, (b) color descriptor, (c) complex network descriptor, (d) spatial distribution descriptor, and (e) final saliency map.
A. Evaluation For the quantitative evaluation, we plot the PR curves in Figures 7a, 8a and ROC curves in Figures 7b, 8b for the datasets used in the experiments. Additionally, AUC, MAE, precision, recall and F-measure scores are reported in Figure 9. We follow the same testing structure and parameters to conduct the experiments as suggested in [28, 32]. For plotting the curves, the saliency maps are binarized using threshold within the range of 0 and 255.
16
The precision value shows the ratio between the correctly assigned salient pixels and all pixels from the generated saliency map, while the recall value shows loosed information or salient pixels as the ratio between correctly detected salient pixels and all pixels from the ground truth. Precision and recall rates are calculated as follows:
P recision =
k TP 1 |Mi ∩ Gi | = k |Mi | TP + FP
(18)
i=1
k TP 1 |Mi ∩ Gi | = Recall = k |Gi | TP + FN
(19)
i=1
where, k is the number of images in a database, M i is the i-th binary mask as a prediction value generated from the salient image by using a threshold value in the range [0, 255], G i is the ground truth given as an actual value. To find the overall performance measure, the F-measure values are computed using the following equation:
1 + β 2 · precision · recall Fβ = β 2 · precision + recall
(20)
where, β 2 = 0.3 emphasizes that the precision is more than the recall [4]. The F P R (false positive rate) and T P R (true positive rate) are important evaluation inputs to show the performance results. To illustrate those, we use the ROC curves with thresholds between 0 and 255. To represent the performance in terms of ROC curves, AUC is used by applying the following equation:
AU C =
N 1 i=1
2
(T P Ri + T P Ri−1 ) (1 − (F P Ri − F P Ri−1 ))
(21)
where N is 255 for gray levels, T P R is true positive rate and F P R is false positive rate. As mentioned in [28], for a more balanced comparison we also used the MAE as an evaluation criterion. MAE between a continuous saliency map S and ground truth G is calculated as:
M AE =
N 1 |Si − Gi | N
(22)
i
where N is the number of image pixels and I is the pixel index. S i is the saliency value on pixel i, same as for G i is the ground value.
17
(a) MSRA-1000
(b) Berkley-300
(c) CSSD
(d) ECSSD
Fig. 7: Quantitative comparison of PR results on the datasets.
(a) MSRA-1000
(b) Berkley-300
(c) CSSD
(d) ECSSD
Fig. 8: Quantitative comparison of ROC results on the datasets.
(a) MSRA-1000
(b) Berkley-300
(c) CSSD
(d) ECSSD
Fig. 9: Experimental results for AUC, MAE, precision, recall and F-measure on the four datasets utilized in the testing.
B. Quantitative Comparison As can be seen in Figures 7, 8, our method mostly achieves the best performance in most cases both in terms of the plotted curves and the calculated average scores on the MSRA-1000, CSSD and ECSSD datasets. However,
18
our approach is the second best on the Berkley-300 dataset while it has the close result in the ROC curve table. The reason might be the existence of multiple objects which are positioned close to the edges or cover the whole of picture. In Figure 9, we compared our AUC, MAE, precision, recall and F-measure scores with other methods. Our algorithm improves by 2.27% and 2.01% over the second best algorithms, and 3.09% and 3.21% over the third best algorithms in terms of AUC scores on the MSRA-1000 and ECCSD datasets. Our approach is not very good for AUC score on the Berkley-300 dataset, while it is so close on the CSSD dataset. Precision scores of our approach are consistently higher than others on all datasets. Additionally, recall scores are not so good as others. Intuitively, our approach has a limitation to find whole salient objects completely in the image. The reason might be that the saliency map is not homogeneous and incomplete since multiple objects in scene, or equivalently, some parts of a salient object have low contrast. As mainly discussed by [24] and others, the recall score is not as important as precision score for attention detection. For example, a 100% recall rate can be achieved by simply selecting the whole image. Because of our low recall scores, F-measure scores are lower than others except on the MSRA-1000 dataset. However, precision and recall scores do not consider true negative salience assignment, i.e., the number of pixel correctly marked as non-salient. The MAE scores shows that how successful a method is in the detection of non-salient background pixels. Our approach also outperforms the other approaches in terms of the MAE scores with getting the lowest error rate on all datasets. In Figure 10, it is possible to see the following: (a) the reason of the chosen superpixel region number and (b) separately the combination and comparison of the steps of the suggested method. The number of regions for segmentation larger than 500 superpixels have nearly the same results so that it is enough to use these depending on the trade-off between accuracy and time-complexity.
C. Qualitative Comparison In the qualitative comparison, our method is visually compared with eight existing methods and some results from the experiments are reported in Figure 11. As can be seen, our approach can handle challenging cases when the background is complex and objects are not in the center in spite of it runs over our background assumption which is accepted that the salient objects mainly reside in the center of image. For example, the first, seventh and eighth
19
(a)
(b)
Fig. 10: Analysis of the proposed method. From left to right: (a) different number of segmentations and (b) different design options using the PR curves on the MSRA-1000 dataset.
rows are really good example in which the objects are not in the center. Our approach is clearly detecting the whole salient object while others are mixing with background or finding incomplete in the seventh and tenth rows since the color of objects is so close to background. In the other rows, our approach is almost finding the whole objects while others are mixed by cluttered background.
D. Limitation Although our approach reveals significant results, it is not applicable for all situations. In Figure 12, it is possible to see some failure situations of the proposed algorithm. The general situation on the represented images is multiple objects, the complex background scene and positions of the objects are not close to the image center. As previously mentioned in Figure 3, we can clearly see the averaged location of the ground truths of the databases. Generally, we can say that objects are close to the center of the images but there are exceptional situations like the Berkley-300 dataset. Because of this, our approach is forced to easily detect the salient regions on that dataset. One option to overcome this drawback is to try to locate the possible position of salient regions in the pre-processing step using one of the intersection point detection algorithms, e.g., Shi-Tomasi corner detection. Finding possible edges and corners from the content of the images can be done by making a guess for possible objects location. Another solution for
20
(a) Origi-
(b) FT
(c) LC
(d) MSSS
(e) SR
(f) HC
(g) RC
(h) CA
nal
(i) SF
(j) Ours (k) Ground Truth
Fig. 11: Qualitative comparison of the saliency maps.
the complex scenes is to apply another approach to detect communities between salient regions. Also, we have a problem while detecting salient objects homogeneously and completely in the image, even if we have a better MAE and Precision scores compared to others on all datasets. We do not have a problem for true negative and false positive,
21
only for false negative. As can be seen, the objects are not in the center of image in the first and third rows. Also, two objects are highlighted and the center in the first one. For the second one, there is homogeneous background in the center of the image and this may be leading to wrong outcome. In the second one, our output may be distracted by texture since the salient object and background are not homogeneous and cannot be grouped under one contrast. Our approach is not able to highlight whole salient objects completely in the fourth and fifth rows.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 12: Some problems. From left to right: (a) original image, (b) color descriptor, (c) complex network descriptor, (d) spatial distribution descriptor, (e) final saliency map, and (f) ground truth.
22
E. The Run Time We implemented our method using the C++ programming language and the OpenCV library in VS 2013 running on Windows 8.1. All the results were obtained by running our system on a machine having Intel Core i7 2.3 GHz processor and 8 GB memory. For fair comparison with the other methods, we used the other methods source codes provided by their authors. It can be easily seen from the results reports in Table I that our method is not the fastest algorithm. However, our running time is under half seconds which means that our approach is still good enough to execute reasonably well for real-time applications. Monitoring the superpixel segmentation to find regions and complex network to extract communities are the most time consuming steps inside the suggested approach.
TABLE I: Comparison of the average run times
Method
FT
LC
MSSS
SR
HC
RC
CA
SF
Ours
Time (s)
0.013
0.013
0.987
0.059
0.012
0.145
51.1
0.158
0.213
Code
C++
C++
Matlab
Matlab
C++
C++
Matlab
C++
C++
V.
C ONCLUSIONS
AND
F UTURE W ORK
In the last decade, a large number of researchers concentrated various aspects of computer vision, such as object recognition, image segmentation, image re-targeting, image retrieval, etc. The usage of these developed algorithms in commercial applications is rapidly increasing. The input for these applications is considered critical and important; it is directly effecting the output. In this study, we propose a saliency detection algorithm which is more consistent and efficient with an acceptable running time than others as clearly demonstrated in the comparison and evaluation described in experimental section. We evaluated our approach from different perspectives; we compared with state-of-the-art methods by conducting experiments using publicly available datasets. Also in the experiment section, we showed that our results reflected some improvement. First of all, by following the trend of using the region approach rather than the pixel by pixel comparison, we achieved good results regarding time complexity and the output of the saliency map. We also showed how to benefit from the complex network field to improve the result by producing more compact and clear saliency
23
map. For the future work, we will concentrate on handling the existing limitations of the proposed approach as discussed in the limitations section; firstly we will reduce the time complexity, and then increase the accuracy. For decreasing the time complexity, we can choose the GPU architecture as the working environment and we may improve the algorithm with parallelism. In our algorithm, we are assuming that the salient object will be in the center of the image. But, this assumption is not always true and can cause wrong detection. For increasing the accuracy, especially we can benefit from better approximation for salient region detection using convex hull detection for objects in the beginning. Also, we can test with different community detection approaches from the complex network area.
R EFERENCES [1] Stephen E Palmer. Vision science: Photons to phenomenology, volume 1. MIT press Cambridge, MA, 1999. [2] Sabira K Mannan, Christopher Kennard, and Masud Husain. The role of visual salience in directing eye movements in visual object agnosia. Current biology, 19(6):R247–R248, 2009. [3] Jeremy M Wolfe and Todd S Horowitz. What attributes guide the deployment of visual attention and how do they do it? Nature Reviews Neuroscience, 5(6):495–501, 2004. [4] Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk. Frequency-tuned salient region detection. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1597–1604. IEEE, 2009. [5] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254–1259, 1998. [6] Xiujun Zhang, Chen Xu, Min Li, and Robert KF Teng. Study of visual saliency detection via nonlocal anisotropic diffusion equation. Pattern Recognition, 48(4):1315–1327, 2015. [7] Zuoyong Li, Guanghai Liu, David Zhang, and Yong Xu. Robust single-object image segmentation based on salient transition region. Pattern Recognition, 52:317–331, 2016. [8] Ueli Rutishauser, Dirk Walther, Christof Koch, and Pietro Perona. Is bottom-up attention useful for object recognition? In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–37. IEEE, 2004.
24
[9] Zhen Liang, Zheru Chi, Hong Fu, and Dagan Feng. Salient object detection using content-sensitive hypergraph representation and partitioning. Pattern Recognition, 45(11):3886–3901, 2012. [10] Nishant Shrivastava and Vipin Tyagi. Content based image retrieval based on relative locations of multiple regions of interest using selective regions matching. Information Sciences, 259:212–224, 2014. [11] Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, and Shi-Min Hu. Sketch2photo: internet image montage. In ACM Transactions on Graphics (TOG), volume 28, page 124. ACM, 2009. [12] Chenlei Guo and Liming Zhang. A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. Image Processing, IEEE Transactions on, 19(1):185–198, 2010. [13] Guo-Xin Zhang, Ming-Ming Cheng, Shi-Min Hu, and Ralph R Martin. A shape-preserving approach to image resizing. In Computer Graphics Forum, volume 28, pages 1897–1906. Wiley Online Library, 2009. [14] Stas Goferman, Ayellet Tal, and Lihi Zelnik-Manor. Puzzle-like collage. In Computer Graphics Forum, volume 29, pages 459–468. Wiley Online Library, 2010. [15] Wolfgang Einh¨auser and Peter K¨onig. Does luminance-contrast contribute to a saliency map for overt visual attention? European Journal of Neuroscience, 17(5):1089–1097, 2003. [16] Wei Zhang, QM Jonathan Wu, Guanghui Wang, and Haibing Yin. An adaptive computational model for salient object detection. Multimedia, IEEE Transactions on, 12(4):300–316, 2010. [17] Robert Fergus, Pietro Perona, and Andrew Zisserman. Object class recognition by unsupervised scale-invariant learning. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 2, pages II–264. IEEE, 2003. [18] Jimei Yang and Ming-Hsuan Yang. Top-down visual saliency via joint crf and dictionary learning. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2296–2303. IEEE, 2012. [19] Jonathan Harel, Christof Koch, and Pietro Perona.
Graph-based visual saliency.
In Advances in neural
information processing systems, pages 545–552, 2006. [20] Xiaodi Hou and Liqing Zhang. Saliency detection: A spectral residual approach. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007. [21] Chenlei Guo, Qi Ma, and Liming Zhang. Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages
25
1–8. IEEE, 2008. [22] Yu-Fei Ma and Hong-Jiang Zhang. Contrast-based image attention analysis by using fuzzy growing. In Proceedings of the eleventh ACM international conference on Multimedia, pages 374–381. ACM, 2003. [23] Laurent Itti and Pierre F Baldi. Bayesian surprise attracts human attention. In Advances in neural information processing systems, pages 547–554, 2005. [24] Tie Liu, Zejian Yuan, Jian Sun, Jingdong Wang, Nanning Zheng, Xiaoou Tang, and Heung-Yeung Shum. Learning to detect a salient object. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33 (2):353–367, 2011. [25] Yun Zhai and Mubarak Shah. Visual attention detection in video sequences using spatiotemporal cues. In Proceedings of the 14th annual ACM international conference on Multimedia, pages 815–824. ACM, 2006. [26] Yichen Wei, Fang Wen, Wangjiang Zhu, and Jian Sun. Geodesic saliency using background priors. In Computer Vision–ECCV 2012, pages 29–42. Springer, 2012. [27] Ming-Ming Cheng, Guo-Xin Zhang, Niloy J Mitra, Xiaolei Huang, and Shi-Min Hu. Global contrast based salient region detection. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 409–416. IEEE, 2011. [28] Weining Wang, Dong Cai, Xiangmin Xu, and Alan Wee-Chung Liew. Visual saliency detection based on region descriptors and prior knowledge. Signal Processing: Image Communication, 29(3):424–433, 2014. [29] Lei Zhou, Keren Fu, Yijun Li, Yu Qiao, Xiangjian He, and Jie Yang. Bayesian salient object detection based on saliency driven clustering. Signal Processing: Image Communication, 29(3):434–447, 2014. [30] Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang. Saliency detection via graphbased manifold ranking. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3166–3173. IEEE, 2013. [31] Huaizu Jiang, Jingdong Wang, Zejian Yuan, Yang Wu, Nanning Zheng, and Shipeng Li. Salient object detection: A discriminative regional feature integration approach. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2083–2090. IEEE, 2013. [32] Federico Perazzi, Philipp Krahenbuhl, Yael Pritch, and Alexander Hornung. Saliency filters: Contrast based filtering for salient region detection. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference
26
on, pages 733–740. IEEE, 2012. [33] Huaizu Jiang, Jingdong Wang, Zejian Yuan, Tie Liu, Nanning Zheng, and Shipeng Li. Automatic salient object segmentation based on context and shape prior. In BMVC, volume 3, page 7, 2011. [34] Radhakrishna Achanta, Francisco Estrada, Patricia Wils, and Sabine S¨usstrunk. Salient region detection and segmentation. In Computer Vision Systems, pages 66–75. Springer, 2008. [35] Radhakrishna Achanta and S Susstrunk. Saliency detection using maximum symmetric surround. In Image Processing (ICIP), 2010 17th IEEE International Conference on, pages 2653–2656. IEEE, 2010. [36] Stas Goferman, Lihi Zelnik-Manor, and Ayellet Tal. Context-aware saliency detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(10):1915–1926, 2012. [37] Yongdong Zhang, Zhendong Mao, Jintao Li, and Qi Tian. Salient region detection for complex background images using integrated features. Information Sciences, 2014. [38] Jianteng Peng, Jianbing Shen, Yunde Jia, and Xuelong Li. Saliency cut in stereo images. In Computer Vision Workshops (ICCVW), 2013 IEEE International Conference on, pages 22–28. IEEE, 2013. [39] Dashan Gao, Vijay Mahadevan, and Nuno Vasconcelos. On the plausibility of the discriminant center-surround hypothesis for visual saliency. Journal of vision, 8(7):13, 2008. [40] Lingyun Zhang, Matthew H Tong, and Garrison W Cottrell. Sunday: Saliency using natural statistics for dynamic analysis of scenes. In Proceedings of the 31st Annual Cognitive Science Conference, pages 2944–2949. AAAI Press Cambridge, MA, 2009. [41] Jianbing Shen, Yunfan Du, and Xuelong Li. Interactive segmentation using constrained laplacian optimization. Circuits and Systems for Video Technology, IEEE Transactions on, 24(7):1088–1100, 2014. [42] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Susstrunk. Slic superpixels compared to state-of-the-art superpixel methods. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(11):2274–2282, 2012. [43] Jianbing Shen, Yunfan Du, Wenguan Wang, and Xuelong Li. Lazy random walks for superpixel segmentation. Image Processing, IEEE Transactions on, 23(4):1451–1462, 2014. [44] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine S¨usstrunk. Slic superpixels. Technical report, 2010.
27
[45] Jifeng Ning, Lei Zhang, David Zhang, and Chengke Wu. Interactive image segmentation by maximal similarity based region merging. Pattern Recognition, 43(2):445–456, 2010. [46] Boris Schauerte and Rainer Stiefelhagen. How the distribution of salient objects in images influences salient object detection. [47] Oscar Cuadros, Glenda Botelho, Francisco Rodrigues, and Joao Batista Neto. Segmentation of large images with complex networks. In Graphics, Patterns and Images (SIBGRAPI), 2012 25th SIBGRAPI Conference on, pages 24–31. IEEE, 2012. [48] Qiong Yan, Li Xu, Jianping Shi, and Jiaya Jia. Hierarchical saliency detection. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1155–1162. IEEE, 2013. [49] Vida Movahedi and James H Elder. Design and perceptual validation of performance measures for salient object segmentation. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern RecognitionWorkshops, pages 49–56. IEEE, 2010.