Pattern Recognition Letters 34 (2013) 1270–1278
Contents lists available at SciVerse ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Optimal contrast based saliency detection Xiaoliang Qian, Junwei Han ⇑, Gong Cheng, Lei Guo School of Automation, Northwestern Polytechnical University, Xi’an, Shaanxi 710072, PR China
a r t i c l e
i n f o
Article history: Received 13 June 2012 Available online 24 April 2013 Communicated by S. Wang Keywords: Saliency detection Sparse coding Optimal contrast Multi-scale Eye tracking
a b s t r a c t Saliency detection has been gaining increasing attention in recent years since it could significantly boost many content-based multimedia applications. Most traditional approaches adopt the predefined local contrast, global contrast, or heuristic combination of them to measure saliency. In this paper, based on the underlying premises that human visual attention mechanisms work adaptively for various scales and salient objects can maximally pop out with respect to the background within a specific surrounding area, we propose a novel saliency detection method using a new concept of optimal contrast. A number of contrast hypotheses are first calculated with various surrounding areas by means of sparse coding principles. Afterwards, these hypotheses are compared using an entropy-based criterion and the optimal contrast is selected which is treated as the core factor for building the saliency map. Finally, a multi-scale enhancement is performed to further refine the results. Comprehensive evaluations on three publicly available benchmark datasets and comparisons with many up-to-date algorithms demonstrate the effectiveness of the proposed work. Ó 2013 Elsevier B.V. All rights reserved.
1. Introduction When browsing cluttered visual scenes, human vision systems can focus on salient locations rapidly. In this process, visual attention mechanisms play a key role. The study of visual attention is one of important topics in computer vision since it can facilitate a wide range of applications such as object recognition (Rutishauser et al. 2004), image segmentation (Han et al. 2006), image and video compression (Wang et al. 2003), and seam carving (Rubinstein et al. 2008). In the past decades, visual attention has been extensively studied. Koch and Ullman (1985) developed the first computational model. Its key contribution is to define a 2D topographical map called the saliency map to encode the conspicuity at every location of the image. One of the most successful computational models was presented by Itti et al. (1998). It basically assumes that contrast is the centric factor to determine the human visual attention under free viewing. The center-surround (C–S) operation was invented to model three contrasts in intensity, color, and orientation, which were finally linearly combined into a saliency map. Following Itti’s framework, a variety of computational models have been proposed. In general, existing approaches can be broadly classified into three categories: (1) local contrast based approaches, (2) global contrast based approaches, (3) local-global contrast based approaches in which the local and global context are separately considered and then integrated. ⇑ Corresponding author. Tel./fax: +86 29 88431318. E-mail address:
[email protected] (J. Han). 0167-8655/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2013.04.009
A majority of previous saliency models can be classified into the first category, which examines the distinctiveness of image locations with respect to their local and small neighborhoods. Itti et al. (1998) employed a ‘‘Difference of Gaussians’’ (DoG) across multiple scales to calculate the local contrasts. Likewise, Ma and Zhang (2003) proposed a simple idea of calculating distance of color distributions between a location and its surrounding location within a window Harel et al., 2007 followed Itti’s method to generate feature maps and performed the feature normalization using a graph based algorithm. Bruce and Tsotsos (2009) proposed to modulate the local contrast based on information maximization. Seo and Milanfar (2009) explored the local regression kernels to build a ‘‘self-resemblance’’ map, which measures the similarity between the feature vector at a location and that in its neighborhoods. Garcia-Diaz et al. (2009) proposed a biologically plausible model called Adaptive Whitening Saliency (AWS), which is based on the decorrelation and the distinctiveness of local responses. Other representative works of using local contrast to account for visual saliency include (Han et al., 2011; Itti and Baldi, 2005; Li et al., 2009) and so on. Despite the fact that a number of evidences have been provided to support the point that local contrast directs attention from the cognitive science literatures (Nothdurft 2005; Zetzsche 2005), this thought has recently been questioned by a growing body of research, which argues (Achanta et al. 2009; Cheng et al. 2011) that local contrast based methods tend to yield excessively high saliency values near object boundaries instead of uniformly highlighting whole salient objects. Accordingly, the global contrast based solutions which calculate rarity of locations over the entire image
X. Qian et al. / Pattern Recognition Letters 34 (2013) 1270–1278
have been suggested. Zhang et al. (2008) proposed a model called SUN using a Bayesian framework. In this framework, the bottomup saliency was measured by means of the inference of generalized Gaussian distributions over the entire image. Other influential global contrast based saliency detection algorithms include Bruce and Tsotsos (2005), Hou and Zhang (2008). Alternatively, modeling the global contrast in frequency domain has become popular in recent years. In Hou and Zhang (2007), the Spectral Residual defined as the difference between the raw and the smoothed log Fourier amplitude spectrum of an image was used. Nevertheless, Guo et al. (2008) argued that the amplitude spectrum actually is indecisive. They explored the saliency map by focusing on the phase spectrum of the Fourier Transform. Achanta et al. (2009) provided a frequency-tuned approach relying on DoG band pass filters. Lately, Hou et al. (2012) utilized the sign of each Discrete Cosine Transform component which preferentially contains the foreground information to detect saliency. Instead of relying on only one component (local or global contrast), the third category of methods computes these two components separately and then integrates them to generate the local-global contrast. Liu et al. (2011) proposed three features including multi-scale contrast, C–S histogram, and color spatial distribution to describe salient objects locally, regionally, and globally, respectively. A conditional random field model was then learned to effectively combine these features for salient object detection. Goferman et al. (2010) presented an algorithm based on four principles observed in psychological literatures, where local and global information correspond to the first and second principle, respectively. Torralba et al. (2006) proposed a Bayesian framework to combine the local saliency based features and the global scene based features to predict salient regions. More recently, Borji and Itti (2012) and He et al. (2011) developed the framework to calculate local and global rarities of each image patch separately using sparse coding and finally fuse them to indicate saliency. Cheng et al. (2011) firstly adopted a histogram-based approach for measuring the global contrast. A region based contrast was then applied to refine the results. This method can achieve the superior performance for extracting salient objects. However, how to optimally fuse the local and global information which can be supported by biological evidence is still an open problem. Essentially, both the local and global contrast can be implemented by the C–S operation. When the size of surrounding region is small, the local contrast is obtained whereas the global contrast is obtained when the size of surround is close to the entire image. The ultimate objective of saliency detection is to discover meaningful entities or objects that can attract human attention. As shown in Einhauser et al. (2008) and Elazary and Itti (2008), interesting objects in a scene are more likely to capture human attention. Every interesting object in an image has its own size and can visually pop out maximally from the contextual background at a specific scale. This scale can be local, or global, or any others. As shown in the first row of Fig. 1, the wall clock can be highlighted using the local contrast (Itti et al. 1998) while the global contrast (Hou and Zhang 2007) preferentially detects the boundary of the wall instead of the clock because the clock is not salient with respect to the entire image (the color of wall clock is pretty similar to the roof). In this example, the local contrast can be considered as the best feature to model the saliency. As shown in the second row of Fig. 1, both the rider and the background are highlighted using local contrast. In contrast, the rider is detected using global contrast because of its rarity over entire image. In this example, the global contrast is more appropriate for detecting the visual saliency. In the final examples shown in the third row of Fig. 1, since both the traffic sign and sky have the similar color, neither the local contrast nor the global contrast can capture the salient object properly. Furthermore, the method (Goferman et al. 2010)
1271
combining the local and global context cannot achieve the correct results as well for this case. As can be observed from these examples, the saliency of each object can be optimally detected by measuring the contrast between itself and the surrounding area with a specific size. We called this contrast ‘‘optimal contrast’’. The optimal contrast of salient objects can be local contrast, or global contrast, or other contrasts. This naturally motivates us to propose a novel method to detect saliency by calculating the optimal contrast instead of directly using local or global contrast as in most previous approaches. Our underlying premise is that the optimal contrast adaptively estimated for every interesting object enables to detect visual saliency more precisely. In this paper, we propose an optimal contrast framework for saliency detection in which the size of the surrounding region is determined adaptively by an entropy-based criterion. At first, we adopt the algorithm of weighted sparse coding residual (WSCR) (Han et al. 2011) calculate the contrast. A set of contrast hypotheses can be obtained by varying the size of surrounding area in WSCR. Afterwards, we propose an entropy based scheme to select the optimal contrast from those contrast hypotheses. The optimal contrast is subsequently utilized to measure the visual saliency. Based on the tasks, the existing saliency models can also be broadly divided into two classes: fixation prediction task based methods (Bruce and Tsotsos 2009; Judd et al. 2009; Hou et al. 2012; Borji and Itti 2012) and salient object detection task based methods (Achanta et al. 2009; Cheng et al. 2011; He et al. 2011; Liu et al. 2011). As demonstrated in Borji et al. (2012), the fixation prediction models, on average, perform worse than salient object detection models on human-annotated datasets while they perform better on eye-tracking datasets. This is because the fixation prediction needs distinguishing local regions while salient object detection or segmentation needs merging them. Therefore, it is not easy to develop an algorithm that can achieve the good performance in both two different types of benchmarks. The principal purpose of the proposed work is to solve the task of fixation prediction. However, in order to achieve a reasonably good performance for the task of salient object detection, we adopt a scheme of multi-scale saliency enhancement where the scale of the center area can be changed. When the scale of center area is small, the small object or the boundary of the large object is highlighted. On the contrary, the inner region of the large object is highlighted when the scale is large. The summation of multi-scale saliency maps can deal with the salient objects with various scales. The major contributions of the proposed work can be summarized as follows: (1) The optimal contrast scheme which implements the C–S operation with the size of surrounding region that inferred adaptively by an entropy based criterion is proposed. (2) The multi-scale enhancement scheme is introduced to further refine the performance of the optimal contrast for the salient object detection task. (3) Quantitative comparisons of our method with 13 state-of-the-art approaches on two public eye-tracking datasets and comparisons with 12 state-of-the-art approaches on a human-annotated benchmark dataset are performed. The results demonstrate the proposed work is promising. The reminder of this paper is organized as follows. Section 2 presents the proposed optimal contrast based saliency detection approach. Section 3 reports the experimental results. Finally, we conclude with a discussion in Section 4.
2. The optimal contrast based saliency model 2.1. Contrast modulation based on sparse coding principle Experimental studies (Olshausen and Field 1996) have shown that the receptive fields of simple-cells in the primary visual cortex
1272
X. Qian et al. / Pattern Recognition Letters 34 (2013) 1270–1278
Fig. 1. The saliency maps computed by using various contrasts. (a) Original images. (b) Saliency maps using local contrast (Itti et al. 1998). (c) Saliency maps using global contrast (Hou and Zhang 2007). (d) Saliency maps using local-global contrast (Goferman et al. 2010). (e) Saliency maps using proposed optimal contrast. (f) Ground truth given by humans. The first two examples are selected from Bruce’s eye-tracking dataset (Bruce and Tsotsos 2005) and the final one is from Achanta’s human-annotated dataset (Achanta et al. 2009).
produce a sparse representation which attempt to represent a high-dimensional original signal by using a few representative atoms on a low-dimensional manifold. As a result, this paper defines the contrast by following the WSCR (Han et al. 2011) which is based on the sparse coding principle. The underlying rationale behind this idea is to encode a center location by using its surrounding locations. The features used for sparse coding are the raw data (RGB values) of image. The combination of the coding sparseness and residual can reflect the difference between this center location and its contexts. Comparing with traditional approaches, the contrast calculated using sparse codes reveals the structural difference between patches. Therefore, this method is effective and robust. Given a patch xli with the size of l l in an image with the size of M N, it is sparsely coded with its surrounding patches within a surrounding area where overlapping is allowed (Li et al. 2009):
xli
Dli;s
a
l i;s
ð1Þ l i;s
where the surrounding area is set to s s, a is the sparse codes of xli , Dli;s is the dictionary constructed by all surrounding patches. Here, the size of the surrounding area can be defined as:
s ¼ ð2m þ 1Þl; where m is an integer and 1 6 m maxðM; NÞ 1 6 2l 2
ð2Þ
Eq. (2) ensures that the surrounding area is symmetrical centered at the center patch and is not out of the boundary of image. Fig. 2(a) illustrates the center patch and its overlapping surrounding patches. The rationale behind the Eq. (1) is to represent xli approximately by its surrounding patches. According to Li et al. (2009), the sparse coding length kali;s k0 can indicate the similarity between patch xli and its surroundings. Furthermore, Han et al. (2011) demonstrated that the residual also plays an important role for measuring the difference between xli and its surroundings:
r li;s ¼ xli Dli;s ali;s
ð3Þ
where r li;s is the residual of sparse coding. Thus, the combination of kali;s k0 and r li;s is used to modulate the contrast with respect to the surrounding area with the size of s Han et al. (2011):
C ls ðxli Þ ¼ kali;s k0 krli;s k1
ð4Þ
It is worth noting that the surrounding area of the patches of which the location is near the boundary is not complete, which results the length of the dictionary of these patches is less than normal. Similar to Han et al. (2011), we resolve this problem by multiplying the length of dictionary. Eq. (4) can be modified as:
C ls ðxli Þ ¼ kali;s k0 krli;s k1 jDli;s j
ð5Þ
where jDli;s j is the cardinality of the dictionary. The calculation of ali;s and r li;s can be converted to the problem of l1 norm minimization (Donoho 2006):
min
1 l 2 kr i;s k2 þ kkali;s k1 2
ð6Þ
where k is the parameter indicating the tradeoff between sparsity and distortion. The optimization problem shown in Eq. (6) is essentially a linear regression known as Lasso (Tibshirani 1996), which can be resolved by using LARS algorithm (Efron et al. 2004). Similar to Li et al. (2009), the pixel-level contrast is generated by accumulating the patch-level contrast per pixel since the overlapping of surrounding patches is allowed. It is denoted by C ls ðIli Þ where Ii indicates a pixel in the image. 2.2. Selection of the optimal contrast As we described before, there may exist a specific size of surrounding area for each object within which the object can maximally pop out with respect to the background. This paper develops an entropy based criterion to compare the various contrast hypotheses generated by using the different size of surrounding area and select the optimal one. As we vary the size of the surrounding area s in Eq. (1), a number of contrast hypotheses can be obtained as illustrated in Fig. 2(b). Based on each contrast hypothesis, we can build a saliency map as the candidate. We utilize a quantized histogram to approximate the probability distribution of the saliency values C ls ðIli Þ in the candidate saliency map. The Shannon Entropy is calculated based on the histogram as:
X Hls ¼ pi log pi i
ð7Þ
1273
X. Qian et al. / Pattern Recognition Letters 34 (2013) 1270–1278
Fig. 2. (a) Overlapping surrounding patches of the center patch. (b) Illustration of finding the optimal contrast.
where pi indicates the probability of the ith bin in the histogram. If we treat the saliency detection as a classification problem, the image can be classified into salient regions and non-salient regions. A good saliency map should uniformly highlight the salient regions while suppressing the non-salient regions. It implies that the saliency values shown in the histogram of the saliency map will form a few clusters instead of distributing uniformly over all bins. In this case, the corresponding entropy is generally small. Therefore, the contrast with minimal entropy is preferably selected as the optimal contrast. Fig. 3 provides a few examples of saliency maps corresponding to different contrasts computed by varying s. As can be seen from the examples, saliency maps with the minimum entropy value have the clear difference between the salient regions and non-salient regions. The corresponding size of surrounding area of these three examples is 3l, 5l, 7l respectively. However, determining the optimal contrast based on the minimum entropy alone may not be effective for all data. The prior knowledge of prior probability of entropy is considered to improve the performance. We assume the prior probability distribution of entropy is a Gaussian function in terms of the observation of Judd et al. (2009). We used the entropy data of the eye-tracking dataset (Judd et al. 2009) to fit the Gaussian function and learn the parameters. Afterwards, given a saliency map with the entropy h, its prior probability is estimated as:
1 ðh lÞ2 PðhÞ ¼ pffiffiffiffiffiffiffi exp 2d2 2pd
! ð8Þ
where l, d denote mean and standard deviation respectively. The fitted Gaussian distribution can be regarded as a prior knowledge of real data. Thus, the candidate saliency map should be preferably selected if its prior probability PðHls Þ is high, which implies it is in accordance with the real data sufficiently. We define the criterion for selecting the optimal contrast by combining entropy and its prior probability as follows:
( soptimal ¼ arg min s
Hls
)
PðHls Þ
ð9Þ
Finally, we use the optimal contrast to yield the single-scale saliency map given an image.
Sl ðIi Þ ¼ C lsoptimal ðIi Þ
ð10Þ
2.3. Multi-scale saliency enhancement A scheme of multi-scale saliency enhancement is adopted to further refine the results and to achieve a reasonably good performance for the task of salient object detection. In the algorithm, the scale of the center area can be changed. It is more sensitive to small objects or boundaries of objects at the small scale whereas it is more likely to highlight the inner regions of large objects at the large scale. The summation of multi-scale saliency maps can deal with the salient objects with various scales. Specifically, let L ¼ fl; 1:5l; 2l; . . . ; 0:5ðK þ 1Þlg K ¼ 1; 2; . . . be the set of patches to be considered for a center. Based on each
Fig. 3. The results generated by Eq. (5) and the corresponding entropy values shown under each saliency map. The sample images are from Bruce’s eye-tracking dataset (Bruce and Tsotsos 2005). Here, l = 8, s = 3l, 5l, and 7l respectively.
1274
X. Qian et al. / Pattern Recognition Letters 34 (2013) 1270–1278
patch size, we generate a saliency map. The multi-scale saliency map is calculated as follows:
SðIi Þ ¼
1X j S ðIi Þ K j2L
ð11Þ
To illustrate that the multi-scale saliency enhancement can improve the performance of the proposed algorithm for the task of salient object detection, a few examples containing large-size salient objects are selected from Achanta’s human-annotated dataset (Achanta et al. 2009) and shown in Fig. 4. The boundary of the salient object can be highlighted at small scales. In contrast, the inner region is detected at large scales. The final results can uniformly highlight the whole object to some extent. 3. Experiments The proposed work has been comprehensively evaluated using three publicly available dataset including Bruce’s (Bruce and Tsotsos 2005) eye tracking dataset (BET), Judd’s (Judd et al. 2009) eye tracking dataset (JET), and Achanta’s (Achanta et al. 2009) human-annotated dataset (AHA). We compared our method with 13 state-of-the-art saliency detection methods including AIM (Bruce and Tsotsos 2009), SUN (Zhang et al. 2008), SR (Hou and Zhang 2007), FT (Achanta et al. 2009), IS (Lab-Signature) (Hou et al. 2012), ITTI (Itti et al. 1998) (using Matlab code provided by Harel et al. (2007)), CA (Goferman et al. 2010), RC (Cheng et al. 2011), GBVS (Harel et al. 2007) JUDD (Judd et al. 2009), ISSD (Li et al. 2009), WSCR (Han et al. 2011) and AWS (Garcia-Diaz et al. 2009). These approaches are selected for comparison mainly because (1) they were published during the last few years; (2) they were published in major computer vision conferences or journals, for example CVPR, ICCV, NIPS, MM, and IEEE PAMI; (3) their source code, executable code or results on popular benchmarks were provided by the authors themselves. In addition, to demonstrate the contribution of the prior knowledge used in our criterion (Eq. (9)), we also quantitatively compared the approaches with using the prior knowledge, denoted by Ours, and without using the prior knowledge, denoted by NPK (No Prior Knowledge). For fast computation, similar to Han et al. (2011), the original images are down-sampled to 0.1 times of original size for BET
and AHA dataset and 0.08 for JET dataset. Every image is divided into patches of l l pixels with an overlapping of l/2 pixels in each direction. The l and K (the number of scales) in Eq. (11) are set to 4. The sparse coding parameter k in Eq. (6), which indicates the tradeoff between sparsity and distortion, is set to 0.1. To implement the LARS algorithm, we use the SPAMS toolbox (Mairal et al. 2010) which was proposed recently. It is worth mentioning that the algorithm works well on a large set of images using a fixed set of parameter values without any parameter tuning on individual images. 3.1. Comparisons on eye-tracking datasets In this experiment, we compared our method with 13 state-ofthe-art algorithms on two publicly available eye-tracking datasets. Fig. 5(a) and (b) provide a set of qualitative comparison results. It can be seen that our method can yield saliency maps which are closer to the ground truth data from human fixation maps than most state-of-the-art algorithms. Most existing works used the Receiver Operating Characteristic (ROC) curve and the area under the ROC Curve (AUC) as the metrics for quantitative evaluations. However, as pointed by Borji et al. (2013), ROC and AUC cannot properly deal with the problem of center-bias. For making the fair comparison, Borji et al. (2013), Hou et al. (2012), and Borji and Itti (2012) proposed to use the shuffled AUC score (Zhang et al. 2008) as the metric, which has eliminated the affection of center-bias. Accordingly, this paper also adopts this metric. Smoothing the final saliency map of a model also affects the AUC score (Hou et al. 2012). As pointed out by Hou et al. (2012), for a fair comparison, it is necessary to parameterize the standard deviation of the blurring kernel and evaluate the performance of algorithms under different blurring conditions. Fig. 6(a) presents the shuffled AUC score of models over the range of standard deviations r of the Gaussian kernel in image width (from 0.01 to 0.13 in steps of 0.005). It is worth noticing we did not compare our method with CA (Goferman et al. 2010), AWS (Garcia-Diaz et al. 2009), and WSCR (Han et al. 2011) in this map smoothing experiment because their authors did not provide the full source code (CA and AWS only provide the pcode, and WSCR only provides the final results). We only list their shuffled AUC scores with their default set-
Fig. 4. Comparisons between the single-scale saliency maps and multi-scale saliency maps. The sample images are from Achanta’s human-annotated dataset (Achanta et al. 2009). The single-scale saliency maps are generated using Eq. (10) in which l = 4, 6, 8 and 10, respectively. The multi-scale saliency maps are generated using Eq. (11).
X. Qian et al. / Pattern Recognition Letters 34 (2013) 1270–1278
Fig. 5. Qualitative comparisons on BET, JET and AHA dataset.
1275
1276
X. Qian et al. / Pattern Recognition Letters 34 (2013) 1270–1278
Fig. 6. Quantitative model comparisons. (a) Comparisons on BET and JET dataset. Y-axis indicates the shuffled AUC score, X-axis indicates the STD (standard deviation) of Gaussian kernel (in image width) by which final saliency maps are smoothed. The maximum score of each method is labeled as a dot on the plot. (b) Comparisons of models using precision-recall curves on AHA dataset.
tings in Table 1. The maximum shuffled AUC score for each model is shown in Table 1. As shown in Fig. 6(a), the shuffled AUC score of models on JET dataset is apparently lower than scores on BET dataset, which might be due to the diversity of images of JET dataset. As shown in Fig. 6(a), our saliency model achieves the best performance in the experiment of map smoothing (please note that we did not compare with AWS in this experiment because only pcode of AWS is available and we are not able to tune its parameters for this experiment). However, as shown in Table 1, on the BET and the JET dataset, the highest shuffled AUC scores of our model are slightly lower than the scores of AWS. As reported in recently published literatures (Borji and Itti 2013; Borji et al. 2013), AWS has shown to be very good, which performs best on eye-tracking
datasets in term of the metric of the shuffled AUC. The performance of our model is comparable with that of AWS on eye-tracking datasets, which demonstrates the effectiveness of our work. 3.2. Comparisons on human-annotated dataset We also constructed evaluations on the AHA dataset (Achanta et al. 2009) which has ground truth in the form of accurate human-marked indications for salient objects. It is worth noticing we did not compare our method with the WSCR (Han et al. 2011) in this experiment because its authors did not provide their experimental results on the AHA dataset. Fig. 5(c) provides three comparison examples. It can be seen that saliency maps yielded
Table 1 Maximum performance of models shown in Fig. 6(a). Numbers in first rows for each dataset are the highest shuffled AUC score of models, and numbers in second rows are the STD (r) values where models get their maximum shuffled AUC score. Dataset
AIM
SUN
SR
FT
IS
ITTI
CA
RC
GBVS
JUDD
ISSD
WSCR
AWS
NPK
Ours
BET Optimal r
0.685 0.045
0.666 0.04
0.690 0.035
0.607 0.035
0.712 0.04
0.637 0.035
0.692 –
0.619 0.02
0.647 0.02
0.684 0.025
0.683 0.015
0.677 —
0.722 —
0.714 0.045
0.721 0.04
JET Optimal r
0.674 0.035
0.653 0.035
0.655 0.05
0.590 0.045
0.670 0.05
0.639 0.045
0.668 –
0.598 0.025
0.642 0.015
0.661 0.02
0.652 0.015
0.647 —
0.685 —
0.679 0.035
0.681 0.045
X. Qian et al. / Pattern Recognition Letters 34 (2013) 1270–1278
by our model are closer to the human-marked maps comparing with most state-of-the-art algorithms. By following some representative works for the task of salient object detection (Achanta et al. 2009; Cheng et al. 2011), we adopted the precision-recall curve as the performance metric. The comparison results were shown in Fig. 6(b). As can be seen, RC can achieve the best performance on this dataset. Our method is the second best. The r of the Gaussian kernel for final smoothing of our method was set to 4% of the image width (0.04) on AHA dataset. The effect of center-bias on the AHA dataset should be considered. By conducting comprehensive experiments, Borji et al. (2012) concluded that adding center-bias improves the accuracy of low-performing models while it reduces the accuracy of good models. Nevertheless, this accuracy change is insignificant and not able to alter model rankings. Therefore, this paper compares various models by directly using the normal precision-recall curve as metric. In summary, according to our experimental results shown in Fig. 6 and Table 1, the performance of the proposed model is slightly lower than AWS (best on eye-tracking datasets) on the eye-tracking datasets, but better than AWS on the humanannotated dataset. RC is unsatisfactory on the eye-tracking datasets although it performs best on the human-annotated dataset. The proposed model obtains generally promising results on both two types of datasets. It is worth noting that the approach of NPK (without using the prior knowledge) is worse than the proposed algorithm (using the prior knowledge) as shown in Fig. 6 and Table 1, which demonstrates the information of prior knowledge is useful.
4. Conclusions In this paper, we have proposed a novel and easy-to-implement visual saliency detection framework based on the optimal contrast rather than the fixed local, global contrast or their combination. The optimal contrast is selected from a number of C–S contrast hypotheses by using an entropy based criterion. These hypotheses are generated by varying the size of surrounding area. The entropy based criterion takes both entropy and prior knowledge of the probability distribution of entropy calculated on an eye-tracking database into consideration. We have evaluated the proposed algorithm on three publicly available benchmarks and two conclusions can be drawn. First, the performance of the proposed algorithm is comparable to that of state-of-the-art methods (for example, AWS for the eye tracking dataset and RC for the human-annotated dataset). Second, the information of prior knowledge used in the entropy-based criterion is useful to enhance the accuracy of our model. Additionally, another important achievement of this paper is that we conducted fair evaluations for a variety of saliency detection models across three testing databases by considering the effect of center-bias. Our experimental results have shown that AWS and RC perform best on the eye tracking dataset and the human-annotated dataset, respectively. The rankings of many models are different across datasets. However, we still observe that a few approaches can obtain promising results on all three datasets, which include the proposed method, JUDD (Judd et al. 2009), and AWS (Garcia-Diaz et al. 2009). We will extend and improve the proposed work in the following directions. (1) In the proposed work, we used Han et al. (2011) to generate the contrast hypotheses due to its good performance. Actually, any other C–S mechanisms that can calculate contrast information can be used in our framework in principle. The performance may be improved by finding a more appropriate C–S contrast scheme to generate the contrast hypotheses. (2) Our model
1277
is a pure bottom-up computational model. Essentially, the highlevel knowledge plays an important role in guiding visual attention even in free-viewing condition. Therefore, the combination of a number of top-down factors such as meaningful objects, text, actions, interactions, and context, is believed to be able to improve the accuracy of fixation prediction. (3) The proposed work is to detect visual saliency in static images. It is an important task for us to extend this model to saliency detection in dynamic videos by the integration of temporal information. (4) We will also attempt to apply our saliency model to many real world multimedia applications including image retrieval, image categorization, and image collection visualization. Acknowledgements We appreciate two anonymous reviewers for their valuable suggestions to improve the quality of this paper. J. Han was supported by the National Science Foundation of China under Grant 61005018 and 91120005, NPU-FFR-JC20120237, and NCET-100079. References Achanta, R., Hemami, S., Estrada, F., Susstrunk, S., 2009. Frequency-tuned salient region detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09), Miami, FL, USA, pp. 1597–1604. Borji, A., Itti, L., 2012. Exploiting local and global patch rarities for saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12), Providence, Rhode Island, pp. 478–485. Borji, A., Itti, L., 2013. State-of-the-art in visual attention modeling. IEEE Trans. Pattern Anal. Mach. Intell. 35 (1), 185–207. Borji, A., Sihite, D., Itti, L., 2013. Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study. IEEE Trans. Image Process. 22 (1), 55–69. Borji, A., Sihite, D.N., Itti, L., 2012. Salient object detection: a benchmark. In: Proceedings of the European Conference on Computer Vision (ECCV’12), Firenze, Italy, pp. 414–429. Bruce, N., Tsotsos, J., 2005. Saliency based on information maximization. In: Proceedings of the 19th Annual Conference on Neural Information Processing Systems (NIPS’05), Vancouver, British Columbia, Canada, pp. 155–162. Bruce, N.D.B., Tsotsos, J.K., 2009. Saliency attention and visual search: an information theoretic approach. J. Vis. 9 (3), 5:1–24. Cheng, M.M., Zhang, G.X., Mitra, N.J., Huang, X., Hu, S.M., 2011. Global contrast based salient region detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11), Colorado Springs, USA, pp. 409–416. Donoho, D.L., 2006. For most large underdetermined systems of equations, the minimal l1-norm near-solution approximates the sparsest near-solution. Commun. Pure Appl. Math. 59 (7), 907–934. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression. Ann. Statist. 32 (2), 407–499. Einhauser, W., Spain, M., Perona, P., 2008. Objects predict fixations better than early saliency. J. Vis. 8 (14), 18:1–26. Elazary, L., Itti, L., 2008. Interesting objects are visually salient, J. Vis. 8(3), 3:1–15. Garcia-Diaz, A., Fdez-Vidal, X., Pardo, X., Dosil, R., 2009. Decorrelation and distinctiveness provide with human-like saliency. In: Proceedings of the Advanced Concepts for Intelligent Vision Systems, Bordeaux, France, pp. 343– 354. Goferman, S., Zelnik-Manor, L., Tal, A., 2010. Context-aware saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10), San Francisco, CA, USA, pp. 2376–2383. Guo, C., Ma, Q., Zhang, L., 2008. Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08), Anchorage, Alaska, USA, pp. 1–8. Han, B., Zhu, H., Ding, Y., 2011. Bottom-up saliency based on weighted sparse coding residual. In: Proceedings of the ACM International Conference on Multimedia (MM’11), Scottsdale, Arizona, USA, pp. 1117–1120. Han, J.W., Ngan, K.N., Li, M.J., Zhang, H.H., 2006. Unsupervised extraction of visual attention objects in color images. IEEE Trans. Circuits Syst. Video Technol. 16 (1), 141–145. Harel, J., Koch, C., Perona, P., 2007. Graph-based visual saliency. In: Proceedings of the 21th Annual Conference on Neural Information Processing Systems (NIPS’07), Vancouver, Canada, pp. 545–552. He, S., Han, J.W., Hu, X.T., Xu, M., Guo, L., Liu, T.M., 2011. A biologically inspired computational model for image saliency detection. In: Proceedings of the ACM International Conference on Multimedia (MM’11), Scottsdale, Arizona, USA, pp. 1465–1468.
1278
X. Qian et al. / Pattern Recognition Letters 34 (2013) 1270–1278
Hou, X., Zhang, L., 2007. Saliency detection: A spectral residual approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’07), Minneapolis, Minnesota, USA, pp. 1–8. Hou, X., Zhang, L., 2008. Dynamic visual attention: Searching for coding length increments. In: Proceedings of the 21th Annual Conference on Neural Information Processing Systems (NIPS’08), Vancouver, B.C., Canada, pp. 681– 688. Hou, X.D., Harel, J., Koch, C., 2012. Image signature: highlighting sparse salient regions. IEEE Trans. Pattern Anal. Mach. Intell. 34 (1), 194–201. Itti, L., Baldi, P., 2005. Bayesian surprise attracts human attention. In: Proceedings of the 19th Annual Conference on Neural Information Processing Systems (NIPS’05), Vancouver, British Columbia, Canada, pp. 547–554. Itti, L., Koch, C., Niebur, E., 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20 (11), 1254–1259. Judd, T., Ehinger, K., Durand, F., Torralba, A., 2009. Learning to predict where humans look. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV’09), Kyoto, Japan, pp. 2106–2113. Koch, C., Ullman, S., 1985. Shifts in selective visual attention: towards the underlying neural circuitry. Human Neurobiol. 4 (4), 219–227. Li, Y., Zhou, Y., Xu, L., Yang, X., Yang, J., 2009. Incremental sparse saliency detection. In: Proceedings of the IEEE International Conference on Image Processing (ICIP’09), Cairo, Egypt, pp. 3093–3096. Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., Shum, H.Y., 2011. Learning to detect a salient object. IEEE Trans. Pattern Anal. Mach. Intell. 33 (2), 353–367. Ma, Y.F., Zhang, H.J., 2003. Contrast-based image attention analysis by using fuzzy growing. In: Proceedings of the ACM International Conference on Multimedia (MM’03), New York, NY, USA, pp. 374–381.
Mairal, J., Bach, F., Ponce, J., Sapiro, G., 2010. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11, 19–60. Nothdurft, H., 2005. Salience of feature contrast, In Neurobiology of Attention, In: Itti, L., Rees, G., Tsotsos J. K., (Eds.), Elsevier, pp. 233–239. Olshausen, B.A., Field, D.J., 1996. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381 (6583), 607–609. Rubinstein, M., Shamir, A., Avidan, S., 2008. Improved seam carving for video retargeting, ACM Trans. Graph. 27(3), 16:1–9. Rutishauser, U., Walther, D., Koch, C., Perona, P., 2004. Is bottom-up attention useful for object recognition? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’04), Washington, DC, USA, pp. 37–44. Seo, H.J., Milanfar, P., 2009. Nonparametric bottom-up saliency detection by selfresemblance. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09), Miami, FL, USA, pp. 45–52. Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B (Methodological). 58 (1), 267–288. Torralba, A., Oliva, A., Castelhano, M.S., Henderson, J.M., 2006. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychol. Rev. 113 (4), 766–786. Wang, Z., Lu, L.G., Bovik, A.C., 2003. Foveation scalable video coding with automatic fixation selection. IEEE Trans. Image Process. 12 (2), 243–254. Zetzsche, C., 2005. Natural scene statistics and salient visual features, In Itti, L., Rees, G., Tsotsos J.K., (Eds.), Neurobiology of Attention, Elsevier, pp. 226–232. Zhang, L.Y., Tong, M.H., Marks, T.K., Shan, H.H., Cottrell, G.W., 2008. SUN: A Bayesian framework for saliency using natural statistics, J. Vis. 8(7), 32:1–20.