Accepted Manuscript A saliency prediction model on 360 degree images using color dictionary based sparse representation Jing Ling, Kao Zhang, Yingxue Zhang, Daiqin Yang, Zhenzhong Chen
PII: DOI: Reference:
S0923-5965(18)30241-8 https://doi.org/10.1016/j.image.2018.03.007 IMAGE 15350
To appear in:
Signal Processing: Image Communication
Received date : 31 August 2017 Revised date : 13 February 2018 Accepted date : 15 March 2018 Please cite this article as: J. Ling, K. Zhang, Y. Zhang, D. Yang, Z. Chen, A saliency prediction model on 360 degree images using color dictionary based sparse representation, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.03.007 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
A saliency prediction model on 360 degree images using color dictionary based sparse representation Jing Ling, Kao Zhang, Yingxue Zhang, Daiqin Yang, Zhenzhong Chen∗ School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
Abstract In this paper, a model using color dictionary based sparse representation for 360◦ image saliency prediction is proposed, referred to as CDSR. The proposed model simulates human color perception, extracting the image features by color dictionary based sparse representation, combining with weighted center-surround differences between image patches. Additionally, the partitioning operation, and latitude-bias enhancement are integrated into the proposed model to adapt for 360◦ image saliency prediction. Experimental results on both natural images and 360◦ images show the superior performances of the proposed saliency prediction model. Keywords: 360◦ Images, Saliency Prediction, Color Dictionary, Sparse Representation, Latitude-bias 1. Introduction When facing enormous amount of visual information from the retina, which is beyond the processing ability of the human central neural system, the human visual system will focus on the particularly important visual information while ignoring other less important information in visual scene [1]. Visual saliency prediction is a process that detects salient regions, which are ∗
Corresponding author Email addresses:
[email protected] ( Jing Ling ),
[email protected] ( Kao Zhang ),
[email protected] ( Yingxue Zhang ),
[email protected] ( Daiqin Yang ),
[email protected] ( Zhenzhong Chen )
Preprint submitted to Nuclear Physics B
March 22, 2018
different from their surroundings (often referred to as bottom-up saliency) [2]. It simulates human visual perception characteristics through intelligent algorithms. Then, more attention and usually more complex operations are focused only on the selected areas [3]. Recently, modeling visual saliency has drawn a great amount of research interest. It has been shown that saliency prediction models are quite beneficial in many applications, including image and video compression [4], image segmentation, object recognition [5], etc. Most of these existing models compute the center-surround differences between image patches by extracting the image features, such as color, intensity and direction [6], to generate saliency map, while rarely considering other useful characteristics of human visual perception and processing. In this paper, a saliency prediction model based on sparse representation and the human acuity weighted center-surround differences is proposed, referred to as CDSR. The main idea of sparse representation is to measure color differences between image patches in a holistic manner, using an overcomplete color dictionary trained from natural color images. This method is highly related to the human visual perception as sparse representation behaves similarly to the simple cell in the primary visual cortex [7]. The human visual acuity, as an important characteristic of human visual perception, is adopted to weight the differences between image patches calculated based on sparse features. Sparse representation is also adopted as an important method in the work of [8]. However, it is used to compute three filter response maps to generate scanpath in [8], with a difference in this paper that utilizes it to extract sparse features for better saliency prediction. Furthermore, Independent Component Analysis (ICA) is adopted in the work of [8] to learn color sparse filter functions, whereas K-SVD and OMP methods are empolyed in this paper to train an overcomplete color dictionary and extract sparse features. Recently, 360◦ images are gaining popularity with the ease of availability of cameras and displays [9]. Although a large number of models have been developed to generate saliency maps for flat-2D images, saliency prediction studies in 360◦ images are still limited. In this paper, we focus on 360◦ images, which are different from flat-2D images. 360◦ images represent a full display of all the 360◦ range of scenery, and can be manipulated by the viewers from any angle to interactively observe the scene. Their huge amount of information, coming with the extremely high resolution and large file size, results in a very large computational problem [10]. Partitioning and 2
splicing, and latitude-bias enhancement are thus adopted to better adapt for 360◦ images in this work. The remainder of this paper is organized as follows. In Section 2, the related research work is introduced. Section 3 describes the proposed CDSR model in detail. In Section 4, the experiments and results are described to assess the performance of the proposed model. The final section concludes the paper. 2. Related work As image visual attention plays a vital role in computer vision, graphics, and other aspects [11] [12] [13], many computational models of visual attention have been proposed during the past decades [14]. Visual attention results from both involuntary, task-independent, bottom-up visual saliency of the retinal input, and voluntary, task-dependent top-down visual attention [15]. In particular, the top-down visual attention is actually more accurate to be described as perceived interest that is based on semantic interpretation of the scene, rather than visual saliency [16]. The study of visual saliency prediction can be traced back to the “feature integration theory” proposed by Treisman and Gelade [17], where they discuss how several visual features are integrated with each other and have an impact on the attention of the human visual system. Koch and Ullman [18], on the basis of these characteristic information, put forward a seemingly credible biological model, and for the first time put forward the concept of saliency map. Later, Itti [19] proposed a classic saliency prediction model, which combines a variety of features to extract the saliency map. Then an increasing number of researchers established saliency models by simulating human visual behavior. To simplify such biological models, later researchers began to seek new ways to predict image saliency. Among them, some methods convert image to the frequency domain for processing. For example, Hou et al. [20] established visual attention model in accordance with the concept of spectral residuals. Although the saliency prediction method based on the image frequency domain is simpler and less time-consuming than the previous model, the prediction accuracy is still to be improved as it is not combined with the mechanism of human visual perception. Since then, on the basis of some biologically relevant studies [21, 22, 11, 23], the researchers found that the contrast of the features plays a very 3
important role in determining whether the image region can attract visual attention, in other words, whether it is salient or not. Biologically relevant saliency prediction methods began to achieve better prediction results. There are two ways to calculate feature contrast. One is based on local center-surround contrast, which considers the rarity of image region relative to local neighborhood. A method proposed by Achanta et al. [24], which contrasted the image patch with the surrounding area on a multi-scale to calculate saliency values, is a typical local contrast method. The classical Itti model [19] also computed normalized local center-surround differences of individual features to achieve saliency prediction. Harel et al. [25] constructed fully connected graphs based on low level feature maps, then nodes that have a high dissimilarity with surrounding nodes would be assigned large values in the equilibrium distribution of the graphs, to calculate a normalized saliency map. The other is based on global contrast, which considers the contrast of the image region relative to the entire image. For instance, Zhai and Shah [26] defined the pixel level saliency according to the contrast of the pixel to all other pixels. In the present case, it is still difficult to achieve the same degree in saliency prediction as human eyes since our understanding of the visual mechanism is not deep enough. To make achievements in this aspect, the research on psychology and neurobiology of human visual cognitive is important. In fact, an increasing number of saliency prediction methods began to combine the principle of human vision perception, trying to improve existing saliency prediction methods [27] [28]. Unlike natural image saliency prediction, little has been done for 360◦ image saliency prediction over the last years. Bogdanova et al. [29, 30] proposed bottom-up methods to obtain saliency maps for 360◦ images with static and dynamic cases, where features are computed and fused in a spherical domain. Then, in [31], they investigated how the spherical approach applies to real scenes. The spherical approach processes images in the spherical (non-Euclidian) space and produces attention maps with a direction independent homogeneous response. However, these studies do not provide any detailed descriptions about interpretation of experimental visual attention data for 360◦ images. Sitzmann et al. [32] analyzed how visual attention and saliency in 360◦ images are different from natural images, by recording user observations of the same scenes in 360◦ images and natural images, respectively. Based on those data, they designed a method to transfer natural image saliency to that of 360◦ image’s. Recently, Abreu et al. [33] consid4
ered the problem of estimating saliency maps for 360◦ images viewed with head-mounted displays when eye tracking data are difficult to obtain. They proposed a post-processing method, namely fused saliency maps (FSM), to use current saliency models for 360◦ image saliency prediction. 3. The Proposed CDSR Model Fig. 1 shows the framework of the proposed saliency prediction model CDSR. In the proposed model, the input image is first divided into image patches, and then the sparse features of the image patches are extracted by color dictionary based sparse representation. After that, the center-surround differences between image patches are calculated based on the extracted sparse features, and human visual acuity is used to weight the image patch differences. Partitioning and splicing, and latitude-bias enhancement are also added to better adapt for 360◦ images. 3.1. Color dictionary training Recent research [34] reveals that information is always represented by a relatively small number of simultaneously active neurons. For example, the retina receives a large amount of information, while only a small number of effective information is transmitted to the visual cortex of nerve cells. This helps to improve the efficiency of the treatment of the brain’s vision system, and is often referred to as “sparse representation” [35]. It is generally assumed that sparse representation can capture the underlying structure of the image, which can be used to extract higher level features [7]. The principle of sparse representation is to represent the signal by a linear combination of a series of basis vectors in an overcomplete dictionary [36], and requires the linear combination to be sparse. In other words, only a few basis vectors are activated in the combination to represent the signal. Given an overcomplete dictionary D = {di }ki=1 ∈ Rn×k [37], where n is the dimension of the basis vectors, and k is the number of basis vectors, a signal y can be expressed as a sparse linear combination of a small number of basis vectors in the dictionary D: y = Dx, s.t. ||y − Dx||2 ≤ , x ∈ Rk
5
(1)
Figure 1: The proposed CDSR model where the vector x contains the representation coefficients of the signal y. || · ||2 denotes the l2 norm, used for measuring the deviation. is the representation error. Within the feasible set, the solution with the fewest number of nonzero coefficients is certainly an appealing representation. This most sparse representation is the solution of: minx ||x||0 , s.t. ||y − Dx||2 ≤
(2)
where || · ||0 is the l0 norm, counting the nonzero entries of a vector. In the process of sparse representation, the overcomplete dictionary is often trained by a large number of natural image patches. In this work, a color dictionary is trained to predict image saliency. Based on the color 6
dictionary, the color characteristics of an image can be better expressed for the analysis of their saliency. As shown in Fig. 2, fifty natural images are randomly selected from the MIT1003 [38] dataset for training color dictionary. In our experiment, 100,000 local patches of size 8 × 8 × 3 are randomly selected from the training images, and then each patch is rearranged into a 192 × 1 column vector in the order of the R, G, B color channel. These 100,000 patches are then combined into a 192 × 100000 array, where each column represents an image patch. Using this array as training set for color dictionary training, a color dictionary of size 192 × 256 can be obtained. As shown in the Fig. 3, the color dictionary contains 256 basis vectors of size 192 × 1, which can be rearranged into a 8 × 8 × 3 image patch for visualization.
Figure 2: Examples of training images from MIT1003 [38] These color dictionary basis vectors can capture the correlation between different color channels [39], hence the color feature difference can be effectively represented based on the color dictionary. Sparse representation can naturally mimic the human visual system by extracting the sparse structure from the image [40]. Based on these factors, it is expected that sparse representation will provide better perceptual features for image saliency prediction, thereby improving measurement performance. In this paper, the K-SVD algorithm is used to train the color dictionary, and the orthogonal matching tracking (OMP) algorithm is employed to obtain the sparse coefficient based on this color dictionary [7]. Then, the extracted sparse coefficient is used as a feature to calculate the difference between image patches and generate the final saliency map.
7
Figure 3: Color dictionary 3.2. Sparse feature extraction In this work, the sparse feature of each image patch is extracted by sparse representation. For an input image in the RGB color space, it is first divided into non-overlapping local patches of size 8 × 8 × 3. Then each patch is rearranged into a 192 × 1 column vector. The produced overcomplete color dictionary is then used to decompose the patches, producing sparse coefficient of each patch. Given an image patch y, it is represented using the color dictionary D as : y = Dx,
(3)
where vector x is the sparse coefficient of image patch y, which is also used as the sparse feature of y. Using sparse features instead of R, G, B color features is beneficial, as it can reduce redundancy, prevent information interference, and contribute to decreasing the computational complexity to improve efficiency. 3.3. Initial Saliency calculation After generating the sparse features based on sparse representation of image patches, the center-surround differences between image patches are 8
calculated, and human visual acuity, which was proposed in [1], is used to weight the image patch differences, getting the saliency value of each patch. Human visual acuity can be represented by a function of spatial frequency and retinal eccentricity as follows. e + e2 )) (4) e2 where f is the spatial frequency (cycles/degree); e is the retinal eccentricity (degrees); T0 is the minimum contrast threshold; α is the spatial frequency decay constant and e2 is the half-resolution eccentricity. According to the experimental results in [41], these parameters are set to T0 = 1/64, α = 0.106, and e2 = 2.3. The center-surround difference Dpq between image patch p and q is calculated based on their sparse features xi as follows. X Dpq = ||xi (p) − xi (q)||2 (5) W (f, e) = 1/(T0 exp(αf
i
where xi (p) and xi (q) (i ∈ 0, 1, 2, · · · , n) represent the sparse features of the image patch p and q, respectively; || · ||2 denotes the l2 norm. Given an image patch p, its saliency can be calculated as in equation 6: X W (f, epq )Dpq (6) Sp = p6=q
where epq is the retinal eccentricity of image patch q from the fixation image patch p; Dpq is the patch difference between image patch p and q; W (f, epq ) is the visual acuity between image patch p and q, used as the weighting factor of the patch difference Dpq . 3.4. Enhancement for 360◦ image 3.4.1. Partitioning and splicing 360◦ images are of wider view and higher resolution than natural images, meaning more information and larger file size, resulting in a very large computational problem. Some pre-processing steps are introduced for 360◦ images. When watching a 360◦ image, viewers can only get to see a small part of the whole 360◦ image at one time. Therefore, the 360◦ images are split into tiles of sub-images, as the size of sub-images is more in line with human visual perception characteristics. And the partitioning operation can greatly reduce computational complexity. 9
90◦ × 60◦ of visual degrees is selected as the size of the sub-images in our computations mainly for the following two points. Firstly, such size of the rendered viewport is used as a default option by most of the available 360◦ image devices [9]. Secondly, such size in visual degrees corresponds to exactly one-quarter and one-third of the width and height of the original 360◦ image, respectively, which greatly reduces the complexity of partitioning and splicing. In particular, for the image patch at the edge of sub-images, some of its neighborhood patches, which would contribute to its saliency, are lost. To avoid inaccurate saliency values for these image patches, each subimage is extended 100 pixels outward at the time of partitioning. Then, after the saliency value of each sub-image is calculated, only the original parts of the sub-images before extending are taken and spliced together, to get the whole feature-based saliency map. 3.4.2. Latitude-bias enhancement Studies on natural images have pointed out that the center bias of human eye fixations exists, as centers are the optimal viewpoint for screen viewing [42]. Inspired by this, the bias of human eye fixation is analyzed for 360◦ images. Training images from Salient360 dataset [43] [44] provided by the University of Nantes [45] are examined, using the ground truth overlaying the original 360◦ images, as shown in Fig. 4. It can be found that there is no significant difference in the distribution of human eye fixations in the horizontal direction. However, the distribution of the human eye fixations does have a very obvious difference in the vertical direction. The probability distribution of the human eye fixations on the training images of Salient360 dataset is modeled as shown in Fig. 5. This latitude-bias probability model will be used later to generate a latitude-bias map. To keep relative importance between different maps and avoid promoting irrelevant information when combining saliency maps, [46] proposed a cue combination strategy. The experimental results of the comparison among several fusion methods in [1] further prove the effectiveness of the strategy. We adopt this in equation 7, incorporating the latitude-bias map to the feature saliency map after normalizing them to the same dynamic range (between 0 and 1), obtaining the final saliency map. S = Sf ∗ Sl + Sf + Sl 10
(7)
Figure 4: The ground truth overlaying the original 360◦ images
Figure 5: Latitude-bias probability model
11
where the Sl denotes the latitude-bias map generated by the latitude-bias probability model; Sf denotes the feature-based saliency map. 4. Experimental Result To better illustrate the performance of CDSR, experiments are conducted on both natural images and 360◦ images. 4.1. Experiment on natural images Firstly, the performance of CDSR on natural images is evaluated by comparing with 10 commonly used saliency prediction algorithms [47]. Unlike 360◦ images, natural images will not be resized or split. A simple Gaussian function is used to model the center bias factor, instead of a latitude-bias function. Each input image in the RGB color space is divided into non-overlapping local image patches of size 8×8×3, and then the saliency value of each patch is calculated based on its center-surround differences weighted by visual acuity with neighbor image patches. A simple 1-D measure of the distance of each patch to the image center is further incorporated to handle the center bias [48]. Two benchmark eye tracking datasets are tested: the MIT1003 dataset [38] and the Toronto dataset [49]. The MIT1003 dataset consists of 1003 natural indoor and outdoor scene images, including 779 landscape images and 228 portrait images. The eye movement data is collected by 15 observers who freely browse these images. The Toronto dataset is collected by 20 subjects who freely browse 120 natural indoor and outdoor scene images. Ten state-of-the-art saliency models are evaluated for comparison, including AC [24], BMS [50], CA [51], FT [15], GBVS [25], ICL [52], IT [19], SDSR [53], SR [20], and SUN [42]. An exhaustive comparison of CDSR to these ten state-of-the-art methods on Toronto and MIT1003 datasets with binary ground truth are shown in Fig. 6. It can be observed that compared with most other models, CDSR can generate saliency maps more similar to the ground truth. To compare these saliency prediction models quantitatively, four common measures are used in the comparison experiments: Similarity, Linear Correlation Coefficient (CC), AUC Borji, and AUC Judd [54], which have been used to validate a wide variety of state-of-the-art saliency prediction models, and provide reliable assurance for performance evaluation. The scores of
12
Figure 6: Comparison of different saliency prediction models: first row to the final row: original images, the ground truth, saliency maps from CDSR, AC, BMS, CA, FT, GBVS, ICL, IT, SDSR, SR, and SUN, respectively.
13
model AC BMS CA FT GBVS ICL IT SDSR SR SUN CDSR
AUC Judd 0.66 0.74 0.79 0.57 0.82 0.69 0.77 0.77 0.74 0.68 0.84
AUC Borji 0.62 0.71 0.77 0.56 0.81 0.58 0.76 0.76 0.73 0.67 0.70
CC 0.18 0.37 0.45 0.10 0.53 0.23 0.39 0.44 0.35 0.23 0.54
similarity 0.35 0.40 0.42 0.31 0.44 0.35 0.40 0.43 0.39 0.35 0.47
Table 1: Evaluation of different saliency prediction models on Toronto dataset the compared saliency prediction models on the two datasets are shown in Table 1 and Table 2. It can be seen that CDSR achieves the highest score on AUC Judd, CC and Similarity. It proofs that the method proposed in this work can obtain significantly better prediction performance for saliency prediction than other existing ones. 4.2. Experiment on 360◦ images The proposed model CDSR is evaluated over the Salient360 Test dataset [43] [44], which consists of 25 images, given the Head+Eye ground truth, obtained by the “movement of the head” as well as the “movement of the eye within the viewport”. In addition, the ICME Salient360! Challenge [43] [44] also provided the Head ground truth, derived from the “movement of the head” only. In this work, the input 360◦ image is first resized to 1200 × 2400 pixels, then split into 12 sub-images. Each sub-image is divided into non-overlapping local image patches of size 8 × 8 × 3, and the saliency value of each patch is calculated based on its center-surround differences weighted by visual acuity with neighbor image patches. After that, these 12 sub-images are spliced together to generate the feature saliency map, which will be incorporated with latitude-bias map later to obtain the final saliency map. 14
model AC BMS CA FT GBVS ICL IT SDSR SR SUN CDSR
AUC Judd 0.62 0.75 0.76 0.55 0.81 0.65 0.73 0.72 0.72 0.67 0.85
AUC Borji 0.59 0.70 0.75 0.53 0.80 0.52 0.71 0.70 0.70 0.66 0.72
CC 0.12 0.29 0.31 0.10 0.38 0.13 0.25 0.26 0.25 0.19 0.46
similarity 0.24 0.33 0.32 0.21 0.34 0.21 0.29 0.31 0.30 0.26 0.40
Table 2: Evaluation of different saliency prediction models on MIT1003 dataset Fig. 7 shows three original images, the corresponding ground truth and the saliency map predicted from CDSR. In particular, the ground truth provided by Salient360 Test dataset and the predicted saliency map are overlaid on the original image to better illustrate human fixation locations and the prediction accuracy of CDSR. A number of models have been submitted to 360◦ images Salient360! Challenge [55] [56] [57]. Table 3 compares CDSR with other saliency prediction models presented in 360◦ images Salient360! Challenge [43] [44]. These figures are provided by the organizer of the challenge. The evaluation metrics are set by the organizer as Kullback-Leibler (KL), Linear Correlation coefficient (CC), Normalized Scanpath Saliency (NSS) and Receiver Operating Characteristic (ROC) for Head+Eye ground truth [43]. In addition, we also provide the evaluations of KL and CC on Head groundtruth, to be consistent to other models. In particular, the model used in the challenge is the initial version of CDSR, which did not consider the step of the partitioning and just used a simple Gaussian function to measure the latitude-bias. This initial version is named “ICME CDSR” in Table 3. An upgraded version of “ICME CDSR”, which is proposed in this paper, is evaluated as “CDSR” in the table. This upgraded version adopts pre-processing steps and measures the latitude-bias by modeling the probability distribution of fixations on the training images of 15
Figure 7: The images shown above show the ground truth and predicted saliency map for a given map. First row to the final row: original images, ground truth maps overlaying original images, and saliency maps from CDSR overlaying original images.
16
models TU Munich 3 [55] SJTU [56] TU Munich 1 [55] TU Munich 5 [55] Zhejiang University [57] University of Science and Technology TU Munich 2 [55] Xidian University TU Munich 4 [55] IRISA ICME CDSR CDSR
KL 0.449 0.481 0.501 0.475 0.698
Head+Eye CC NSS 0.579 0.806 0.532 0.918 0.554 0.915 0.559 0.696 0.527 0.851
ROC 0.726 0.735 0.747 0.713 0.714
Head KL CC 0.737 0.600 0.654 0.668 0.745 0.620 0.640 0.556 0.444 0.692
2.017
0.507
0.918
0.695
1.046
0.667
0.576 0.588 0.636 0.585 0.508 0.477
0.412 0.413 0.402 0.448 0.538 0.550
0.686 0.683 0.631 0.506 0.936 0.939
0.691 0.686 0.678 0.644 0.736 0.736
0.912 0.919 1.086 1.072 0.515 0.498
0.503 0.578 0.436 0.412 0.715 0.730
Table 3: Comparison of CDSR with other saliency prediction models presented in 360◦ images Salient360! Challenge (“CDSR” is an upgraded version of “ICME CDSR”, enhanced with partitioning and latitude-bias function.)
17
Figure 8: The images shown above show the Head+Eye ground truth and predicted “CDSR” and “ICME CDSR” saliency maps for a given map. First row to the final row: original images, Head+Eye ground truth maps overlaying original images, saliency maps from “ICME CDSR” overlaying original images, and saliency maps from “CDSR” overlaying original images. Salient360 dataset. Fig. 8 and Fig. 9 show saliency maps comparing “ICME CDSR” and “CDSR” methods on Head+Eye ground truth and Head ground truth. Although for some like the third image, the difference between “ICME CDSR” saliency map and “CDSR” saliency map is not obvious, it is easy to see from the first and second image that the saliency map of “CDSR” is more in line with ground truth than “ICME CDSR”, which may be owing to a more accurate latitude-bias enhancement. It can be concluded from Table 3 that “CDSR” outperforms other models in terms of NSS metric on Head+Eye ground truth, and extremely approaches the best scores on other metrics, which indicates the excellent performance of the proposed model over other models with regard to 360◦ images. What’s more, to better illustrate the distribution of the metrics of each model, histograms are shown in Fig. 10 and Fig. 11 for a more intuitive presentation. 5. Conclusion In this work, a model capable of predicting saliency maps for 360◦ images has been presented, which can also be used for natural images. In the proposed model, the saliency maps are calculated mainly based on the 18
Figure 9: The images shown above show the Head ground truth and predicted “CDSR” and “ICME CDSR” saliency map for a given map. First row to the final row: original images, Head ground truth maps overlaying original images, saliency maps from “ICME CDSR” overlaying original images, and saliency maps from “CDSR” overlaying original images.
Figure 10: The distribution of the metrics of each model on Head+Eye dataset.
19
Figure 11: The distribution of the metrics of each model on Head dataset. sparse features of images, extracted by sparse representation based on an overcomplete color dictionary, and the center-surround differences between image patches, weighted by human visual acuity. What’s more, partitioning and splicing operation, and latitude-bias enhancement are added to better adapt for 360◦ images. Experiments on both natural images and 360◦ images show that the proposed CDSR model consistently outperforms other existing models. However, it is still difficult to obtain the same performance as the human eyes. In the future, we will further study the cognitive psychology and neurobiology of human vision, and integrate it into the image saliency prediction framework to expect greater optimization of existing saliency prediction methods in terms of efficiency and accuracy. Acknowledgments This work was supported in part by the National Natural Science Foundation of China under Grant 61471273 and 61771348, and Wuhan Morning Light Plan of Youth Science and Technology under Grant 2017050304010302.
20
References [1] Y. Fang, W. Lin, Z. Fang, Z. Chen, C. W. Lin, C. Deng, Visual acuity inspired saliency detection by using sparse features, Information Sciences 309 (2015) 1–10. [2] L. Itti, A. Borji, Exploiting local and global patch rarities for saliency detection, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2012, pp. 478–485. [3] X. Hou, J. Harel, C. Koch, Image signature: Highlighting sparse salient regions, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (1) (2012) 194–201. [4] L. Itti, Automatic foveation for video compression using a neurobiological model of visual attention, IEEE Transactions on Image Processing 13 (10) (2004) 1304–1318. [5] D. Walther, C. Koch, Modeling attention to salient proto-objects, Neural networks 19 (9) (2006) 1395–1407. [6] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, S.-M. Hu, Global contrast based salient region detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3) (2015) 569–582. [7] L. Li, W. Xia, Y. Fang, K. Gu, J. Wu, W. Lin, J. Qian, Color image quality assessment based on sparse representation and reconstruction residual , Journal of Visual Communication and Image Representation 38 (2016) 550–560. [8] W. Wang, C. Chen, Y. Wang, T. Jiang, F. Fang, Y. Yao, Simulating human saccadic scanpaths on natural images, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2011, pp. 441–448. [9] M. Rerabek, E. Upenik, T. Ebrahimi, JPEG backward compatible coding of omnidirectional images, in: Proceedings of Applications of Digital Image Processing XXXIX, SPIE, 2016, pp. 99711O–1–99711O–12. [10] Z. Chen, Y. Li, Y. Zhang, Recent advances in omnidirectional video coding for virtual reality: Projection and evaluation, Signal Processing 146 (2018) 66–78. 21
[11] J. Wolf, What attributes guide the deployment of visual attention and how do they do it?, Nature Reviews Neuroscience 5 (6) (2004) 1–7. [12] N. Ejaz, I. Mehmood, S. W. Baik, Efficient visual attention based framework for extracting key frames from videos, Signal Processing: Image Communication 28 (1) (2013) 34–44. [13] A. Li, Y. Zhang, Z. Chen, Scanpath mining of eye movement trajectories for visual attention analysis, in: IEEE International Conference on Multimedia and Expo, IEEE, 2017, pp. 535–540. [14] A. Borji, L. Itti, State-of-the-art in visual attention modeling, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1) (2013) 185–207. [15] R. Achanta, S. Hemami, F. Estrada, S. Susstrunk, Frequency-tuned salient region detection, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 1597–1604. [16] Y. Rai, P. Le Callet, G. Cheung, Quantifying the relation between perceived interest and visual salience during free viewing using trellis based optimization, in: IEEE International Conference on Image, Video, and Multidimensional Signal Processing Workshop, IEEE, 2016, pp. 1–5. [17] A. M. Treisman, G. Gelade, A feature-integration theory of attention, Cognitive Psychology 12 (1) (1980) 97–136. [18] C. Koch, S. Ullman, Shifts in selective visual attention: towards the underlying neural circuitry, in: Matters of Intelligence, Springer, 1987, pp. 115–141. [19] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (11) (1998) 1254–1259. [20] X. Hou, L. Zhang, Saliency detection: A spectral residual approach, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2007, pp. 1–8. [21] A. Levelthal, The neural basis of visual function: Vision and visual dysfunction (1991). 22
[22] R. Desimone, J. Duncan, Neural mechanisms of selective visual attention, Annual Review of Neuroscience 18 (1) (1995) 193–222. [23] S. K. Mannan, C. Kennard, M. Husain, The role of visual salience in directing eye movements in visual object agnosia, Current Biology 19 (6) (2009) R247–R248. [24] R. Achanta, F. Estrada, P. Wils, S. Susstrunk, Salient region detection and segmentation, Computer Vision Systems (2008) 66–75. [25] J. Harel, C. Koch, P. Perona, Graph-based visual saliency, in: Proceedings of Advances in Neural Information Processing Systems, NIPS, 2007, pp. 545–552. [26] Y. Zhai, M. Shah, Visual attention detection in video sequences using spatiotemporal cues, in: Proceedings of ACM International Conference on Multimedia, ACM, 2006, pp. 815–824. [27] J. Zhang, S. Sclaroff, Exploiting surroundedness for saliency detection: a boolean map approach, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (5) (2016) 889–902. [28] Y. Wu, Z. Chen, Saliency map generation based on saccade target theory, in: IEEE International Conference on Multimedia and Expo, IEEE, 2017, pp. 529–534. [29] I. Bogdanova, A. Bur, H. Hugli, Visual attention on the sphere, IEEE Transactions on Image Processing 17 (11) (2008) 2000–2014. [30] I. Bogdanova, A. Bur, H. H¨ ugli, P.-A. Farine, Dynamic visual attention on the sphere, Computer Vision and Image Understanding 114 (1) (2010) 100–110. [31] I. Bogdanova, A. Bur, H. H¨ ugli, The spherical approach to omnidirectional visual attention, in: IEEE International Conference on Signal Processing, IEEE, 2008, pp. 1–5. [32] V. Sitzmann, A. Serrano, A. Pavel, M. Agrawala, D. Gutierrez, G. Wetzstein, Saliency in VR: How do people explore virtual environments?, arXiv preprint arXiv:1612.04335.
23
[33] D. A. Abreu, C. Ozcinar, A. Smolic, Look around you: Saliency maps for omnidirectional images in VR applications, in: IEEE International Conference on Quality of Multimedia Experience, IEEE, 2017, pp. 1–6. [34] A. Garcia-Diaz, V. Leboran, X. R. Fdez-Vidal, X. M. Pardo, On the relationship between optical variability, visual saliency, and eye fixations: A computational approach, Journal of Vision 12 (6) (2012) 17–17. [35] B. A. Olshausen, D. J. Field, Sparse coding of sensory inputs, Current Opinion in Neurobiology 14 (4) (2004) 481–487. [36] J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online dictionary learning for sparse coding, in: Proceedings of ACM Annual International Conference on Machine Learning, ACM, 2009, pp. 689–696. [37] M. Aharon, M. Elad, A. Bruckstein, K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation, IEEE Transactions on Signal Processing 54 (11) (2006) 4311–4322. [38] T. Judd, K. Ehinger, F. Durand, A. Torralba, Learning to predict where humans look, in: Proceedings of international conference on Computer Vision, IEEE, 2009, pp. 2106–2113. [39] J. Mairal, M. Elad, G. Sapiro, Sparse representation for color image restoration, IEEE Transactions on Image Processing 17 (1) (2008) 53– 69. [40] X. Zhang, S. Wang, K. Gu, T. Jiang, S. Ma, W. Gao, Sparse structural similarity for objective image quality assessment, in: Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, IEEE, 2015, pp. 1561–1566. [41] W. S. Geisler, J. S. Perry, Real-time foveated multiresolution system for low-bandwidth video communication, in: Proceedings of Human Vision and Electronic Imaging, SPIE/IS&T, 1998, pp. 294–305. [42] L. Zhang, M. H. Tong, T. K. Marks, H. Shan, G. W. Cottrell, SUN: A Bayesian framework for saliency using natural statistics, Journal of Vision 8 (7) (2008) 32–32.
24
[43] J. Guti´errez, E. David, Y. Rai, P. Le Callet, Toolbox and dataset for the development of saliency and scanpath models for omnidirectional / 360◦ still images, Signal Processing: Image Communication (2018) 1–12. [44] Y. Rai, P. Le Callet, P. Guillotel, Which saliency weighting for omni directional image quality assessment?, in: IEEE International Conference on Quality of Multimedia Experience, IEEE, 2017, pp. 1–6. [45] Y. Rai, J. Guti´errez, P. Le Callet, A dataset of head and eye movements for 360 degree images, in: Proceedings of ACM Conference on Multimedia Systems, ACM, 2017, pp. 205–210. [46] C. Chamaret, J.-C. Chevet, O. Le Meur, Spatio-temporal combination of saliency maps and eye-tracking assessment of different strategies, in: Proceedings of IEEE International Conference on Image Processing, IEEE, 2010, pp. 1077–1080. [47] F. Perazzi, P. Kr¨ahenb¨ uhl, Y. Pritch, A. Hornung, Saliency filters: Contrast based filtering for salient region detection, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2012, pp. 733–740. [48] E. Vig, M. Dorr, D. Cox, Large-scale optimization of hierarchical features for saliency prediction in natural images, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2014, pp. 2798–2805. [49] N. Bruce, J. Tsotsos, Saliency based on information maximization, in: Proceedings of Advances in neural information processing systems, NIPS, 2006, pp. 155–162. [50] J. Zhang, S. Sclaroff, Saliency detection: A boolean map approach, in: Proceedings of IEEE International Conference on Computer Vision, IEEE, 2013, pp. 153–160. [51] S. Goferman, L. Zelnik-Manor, A. Tal, Context-aware saliency detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (10) (2012) 1915–1926.
25
[52] X. Hou, L. Zhang, Dynamic visual attention: Searching for coding length increments, in: Proceedings of Advances in Neural Information Processing Systems, NIPS, 2009, pp. 681–688. [53] H. J. Seo, P. Milanfar, Static and space-time visual saliency detection by self-resemblance, Journal of Vision 9 (12) (2009) 15–15. [54] M. Cornia, L. Baraldi, G. Serra, R. Cucchiara, A deep multi-level network for saliency prediction, in: Proceedings of IEEE International Conference on Pattern Recognition, IEEE, 2016, pp. 3488–3493. [55] M. Startsev, M. Dorr, 360-aware saliency estimation with conventional image saliency predictors, Signal Processing: Image Communication (2018) 1–12. [56] Y. Zhu, G. Zhai, X. Min, The prediction of head and eye movement for 360 degree images, Signal Processing: Image Communication (2018) 1–12. [57] P. Lebreton, A. Raake, Gbvs360, bms360, prosal: Extending existing saliency prediction models from 2d to omnidirectional images, Signal Processing: Image Communication (2018) 1–12.
26
● ● ●
The proposed saliency prediction model on 360° images achieves great performance Color dictionary based sparse representation can provide better perceptual features Latitude-bias enhancement has a significant impact on 360° image saliency prediction