Stereoscopic saliency model using contrast and depth-guided-background prior

Stereoscopic saliency model using contrast and depth-guided-background prior

Communicated by Prof. Junwei Han Accepted Manuscript Stereoscopic Saliency Model using Contrast and Depth-Guided-Background Prior Fangfang Liang, Li...

23MB Sizes 0 Downloads 36 Views

Communicated by Prof. Junwei Han

Accepted Manuscript

Stereoscopic Saliency Model using Contrast and Depth-Guided-Background Prior Fangfang Liang, Lijuan Duan, Wei Ma, Yuanhua Qiao, Zhi Cai, Laiyun Qing PII: DOI: Reference:

S0925-2312(17)31703-4 10.1016/j.neucom.2017.10.052 NEUCOM 19032

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

21 May 2017 2 October 2017 31 October 2017

Please cite this article as: Fangfang Liang, Lijuan Duan, Wei Ma, Yuanhua Qiao, Zhi Cai, Laiyun Qing, Stereoscopic Saliency Model using Contrast and Depth-Guided-Background Prior, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.10.052

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Stereoscopic Saliency Model using Contrast and Depth-Guided-Background Prior Fangfang Lianga , Lijuan Duana , Wei Maa , Yuanhua Qiaob , Zhi Caia , Laiyun Qingc,∗ a Faculty

of Information Technology, Beijing University of Technology, China Science of Beijing University of Technology, China c University of Chinese Academy of Sciences, China

CR IP T

b Applied

Abstract

AN US

Many successful models of saliency have been proposed to detect salient regions for 2D images. Because stereopsis, with its distinctive depth information, influences human viewing, it is necessary for stereoscopic saliency detection to consider depth information as an additional cue. In this paper, we propose a 3D stereoscopic saliency model based on both contrast and depth-guided-background prior. First, a depth-guided-background prior is specifically detected from a disparity map apart from the conventional prior, assuming boundary super-pixels as background. Then, saliency based on disparity with the help of the proposed prior is proposed to prioritize the contrasts among super-pixels. In addition, a scheme to combine the contrast of disparity and the contrast of color is presented. Finally, 2D spatial dissimilarity features are further employed to refine the saliency map. Experimental results on the PSU stereo saliency benchmark dataset (SSB) show that the proposed method performs better than existing saliency models.

M

Keywords: Stereoscopic saliency analysis, 3D images, Depth-guided-background prior, Saliency based on disparity

1. Introduction

AC

CE

PT

ED

The human visual system (HVS) is miraculous in determining the important aspects present in the visual when the eyes acquire information. To simulate the HVS mechanism, two distinct saliency models are applied: bottom-up (stimulus-driven) and top-down (task-driven) [1, 2, 3, 4]. Bottom-up models extract the center-surrounding difference (local or global) features with low-level information in different channels and then combine these feature maps to build a saliency map [5, 6, 7]. Though many image and video saliency models are used in vision tasks [8, 9, 10, 11, 12], with the development of 3D technologies, stereoscopic visual saliency models have attracted increasing attention for diverse applications such as synthetic vision [13], retargeting [14], rendering [15], quality assessment [16, 17], visual discomfort [18, 19], stereoscopic thumbnail creation [20] and disparity control [21]. However, thus far, research [22, 23, 24] that focuses on stereoscopic saliency models is lacking when compared with the rapid expansion of saliency models for 2D images. Overall, much work remains to be done in 3D stereoscopic saliency model research before it can approach the capabilities of the human visual system. Because the sensation of impression [25] is enhanced by the binocular parallax generated by a stereo-channelseparated display, binocular depth cues introduced for 3D displays have changed human viewing behavior [26, 27]. Therefore, this additional depth cue should be considered in a stereoscopic saliency model. Although current bottomup stereoscopic saliency models are effective, questions remain that must be addressed: * How can a salient object be detected in a scene where the colors are similar? Sometimes, the colors of both salient and non-salient objects are similar in natural images; therefore, methods based on color contrast tend to obtain low values inside salient objects. In this situation, disparity cues could help to complete the whole salient object (e.g., the first row in Fig.1(e)1(f)1(g)). Color and disparity are both useful in computing saliency and are,

∗ Corresponding

author Email address: [email protected] (Laiyun Qing)

Preprint submitted to Neurocomputing

November 8, 2017

(a)

(b)

(c)

(d)

(e)

CR IP T

ACCEPTED MANUSCRIPT

(f)

(g)

Figure 1: Two saliency map examples: (a) Left image; (b) Right image; (c) Disparity image; (d) Our method; (e) RC [31]; (f) HC [31]; (g) SS [32].

AN US

to some extent, relevant to each other. Because the interaction of disparity and 2D information is ignored, we attempt to introduce compactness [28] from the color map to measure the relationship while calculating contrast (e.g., the first row in Fig.1(d)).

ED

M

* How can a salient object be detected in a scene with a cluttered background? Current bottom-up saliency methods have difficulties in coping with images where the non-salient part of the image is cluttered (e.g., the second row of Fig.1(a)). Previous models sometimes render non-salient parts as salient because of their high contrast to the surroundings (e.g., the second row in Fig.1(e)1(f)). This phenomenon may be avoided by using a disparity map (e.g., Fig.1(g)). Although boundary background prior [29, 30] has been verified as effective, we find that a prior could be exploited by a disparity map based on two observations: 1) the non-salient areas distant from viewers tend to form a smooth surface on a disparity map; 2) there is an obvious discontinuity between the salient and non-salient parts. Hence, taking multiple cues into consideration, including compactness and background priors, can be useful in forming a complete stereoscopic saliency analysis (e.g., the second row in Fig.1(d)).

CE

PT

* How can a salient object be detected according to the disparity cue in a cluttered scene? For a simple scene, a good saliency map shown in the first row in Fig.1(g), could be obtained by a disparity map shown in the first row in Fig.1(c)). However, the saliency detection method cannot cope with one image very well, especially in a cluttered scene shown in the second row in Fig.1(g). Some regions in background are defined as high saliency values due to being close to salient ones in depth. Then, color feature (compactness [28]), and background priors may be considered together, in this way, non-salient background could be suppressed to a certain degree (e.g.,Fig.1(d)).

AC

In this paper, we propose a united stereoscopic saliency model, called Saliency with Contrast and Depth-GuidedBackground (SCDGB), which combines the saliency obtained from a disparity map with low-level contrast. Saliency based on disparity takes advantage of compactness and background priors to not only highlight the salient object but also eliminate non-salient objects. As for background priors, the depth-guided-background prior is specially explored on a disparity map in addition to the conventional boundary background prior [29]. The contrasts are composed of color contrast and disparity contrast measured by compactness. Moreover, the 2D saliency map and the center-bias preference of human vision are also employed to refine the final saliency map. The results of experiments show that the proposed stereoscopic saliency model achieves superior performance compared with existing saliency approaches. The contributions of this paper are as follows: 1. We propose a stereoscopic saliency model that unites contrast and saliency based on disparity. 2. We develop a saliency based on disparity using the proposed depth-guided-background prior. 3. We present a strategy to represent the contrast of stereoscopic images by fusing the multichannel contrasts.

2

ACCEPTED MANUSCRIPT

The rest of this paper is organized as follows. A review of the related saliency models is given in Section 2. The proposed stereoscopic saliency detection model is elaborated in Section 3. An experiment and an evaluation of the proposed model are presented in Section 4. Conclusions are provided in Section 5. 2. Related Work

CR IP T

During the past few decades, much work has been performed to create saliency models for images. Because our model is based on low-level contrast, we first provide a review of 2D saliency models based on color contrast. Then, we present an analysis of co-saliency and video saliency, 3D saliency models with disparity and high-level priors. 2.1. 2D Saliency Models Based on Color Contrast

AN US

In this section, we mainly discuss stimulus-driven saliency models, because the proposed model belongs to the bottom-up method category. A salient pixel or patch draws more attention because it has a high contrast with its surroundings [33]. Following the extent of the contrast context, saliency models can be based on local contrast [34], region contrast and global contrast [31]. The earliest saliency model was proposed by Itti [34]. In this study, features from color, intensity and orientation were computed using the ”center-surround” difference and subsequently combined into a single topographical saliency map. Because saliency models based on local contrast tend to obtain higher saliency around object edges, the saliency is not uniformly distributed. Therefore, models based on regional contrast and global contrast was proposed. Cheng [31] presented global contrast methods (HC) that used the color statistics of the input image and Region-based Contrast (RC) integrated with spatial information. Zhang [35] calculated color contrast by considering both local and global differences and then computed color distribution features to detect salient regions. However, these methods have difficulties in handling images with cluttered and complex textured backgrounds. Hence, spatial distribution [28] can be used to enhance saliency maps based on the assumption that salient objects have lower spatial variances than do other areas.

M

2.2. Video Saliency Detection and Co-Saliency Detection

CE

PT

ED

Unlike traditional saliency detection approaches with the goal of discovering salient objects in an image, others have emerged to select salient objects in multiple images. Video saliency models [9, 10, 36] aim to combine the spatial and temporal feature in order to detect salient object for each video frame. In [9], a spatiotemporal saliency detection using geodesic distance was proposed to serve object segmentation. A contrast measure [10] incorporated intra-frame boundary and inter-frame motion was presented. Differently, co-saliency detection methods [11, 37, 38, 39, 40] target at detecting salient objects in common from two or more related images. In [11], the wide and deep information in a Bayesian framework were explored. A model linearly combining single-image saliency map and multi-image saliency map was presented [37]. Multiple instance learning and self-paced learning were integrated to design the co-saliency model [40]. Similar to co-saliency detection, stereoscopic saliency detection is also provided with a pair of images. By contrast, essentially, disparity information signifying a cue in depth could be available in stereoscopic images. 2.3. 3D Saliency Models with Disparity

AC

In building a stereoscopic saliency model, depth (binocular disparity or disparity contrast) is considered an attentional cue that is similar to other features such as color and orientation [26, 41, 42]. Jansen [26] investigated the effect of binocular disparity under 2D natural and 3D noisy images and showed that the influence of disparity also exists in natural images. They suggested that in addition to the disparity feature, 2D stimuli such as luminance contrast could be extended to a 3D saliency model. In contrast to the view of [26], Liu [43] demonstrated that the disparity contrasts of salient parts are generally lower than those from randomly selected patches. The reason for these two different findings is that Liu [43] performed experiments with naturalistic stereo images. These findings imply that contrast from color and that from disparity should not carry the same weight in a saliency model. Maki [22] presented a computational model to calculate saliency from an image sequence acquired by stationary binocular camera head using the criterion that a target closer in depth has higher priority. Niu [32] explored stereopsis for saliency analysis and developed an approach that leveraged domain stereoscopic photography rules. One rule is that small disparity magnitudes are associated with large saliency values. The other rule is that either maximal positive disparities or the 3

ACCEPTED MANUSCRIPT

CR IP T

minimal negative disparities are associated with the lowest saliency values. However, this method fails to consider misleading disparity maps because it does not take the influence of color on depth into account. Many models regard disparity knowledge as an additionally extended channel [20, 23, 24]. Some models take a disparity map as an additional input image along with a color image. Wang [20] presented a stereo saliency model integrating edges, disparity boundaries and a saliency bias computed from stereoscopic perception. Fang [23] proposed a stereoscopic saliency detection framework based on feature contrast. These features, which include color, luminance, texture and depth obtained from disparity, are extracted from discrete cosine transform coefficients and then used to calculate feature maps based on a Gaussian model of the spatial distance between image patches. They also designed a new fusion method to combine feature maps using the spatial variance of feature maps. Nevertheless, the benefit of using the spatial variance [28] of a color image as a depth map is not always accurately estimated. Li [24] built a saliency dictionary by selecting a group of potential foreground objects, pruning the outliers in the dictionary and running iterative tests on the remaining super-pixels to refine the dictionary. For stereoscopic saliency detection, disparity was appended directly to the feature vector. 2.4. High-level Priors

M

AN US

High-level priors are effective for completing saliency models. A center-bias prior, which is modeled by a Gaussian function [44, 45], is used to emphasize the center regions in an image. To complement the previous approaches, another boundary prior [29, 46, 47] was proposed by Li [47], who computed saliency using an inner propagation scheme on the condition that background labels are provided by some boundary super-pixels. Similarly, in [48], corner super-pixels were selected as background labels. Based on boundary prior, Wei [29] proposed the connectivity prior and the saliency of an image patch was defined as the length of its shortest path to image boundaries. Another boundary connectivity measure [49] was proposed. Zhao [50] proposed the spaces of background-based distribution, where the distance of every patch away from the background was measured. Different from boundary prior, background was explored by Han [51] with deep learning architectures. Wang [52] exploited learning-based background method inverting the predicted saliency map into a background map. The priors listed above are in still in use, but other priors for stereoscopic images are worth exploring based on the way humans view 3D stereoscopic images. 3. Proposed Stereoscopic Saliency Detection

ED

3.1. Proposed Model Description

CE

PT

Our study is related to low-level contrast features, including color and disparity. In contrast to previous methods, we propose a scheme to fuse color and disparity contrasts instead of handling color and disparity maps separately. Additionally, we present a saliency based on disparity that mimics the manner in which humans view images at differing depth levels to weight the contrast feature. Given a color image and a corresponding disparity map, we provide an overall expression to define saliency as follows. X S i = S Di ∗ ( (CFi, j ∗ GDS T i, j )), (1) j,i

AC

where S i denotes the saliency value assigned to the ith element, which can be a pixel, patch or super-pixel. S Di denotes the saliency value of the ith element. CFi, j , and GDS T i, j denote the feature’s distance and Gaussian distance between i and j, respectively. The diagram of the work in Fig. 2 shows that we first compute the 3D contrast of features and, based on disparity, their saliency. Then, an initial stereoscopic saliency map is acquired by the sum of the contrast of the weighted features. The final saliency is obtained by enhancing the initial saliency using 2D features. We provide a brief overview of the main components below: Saliency based on disparity S Di . When viewing 3D stereoscopic still images, people usually consider contrasts at different depth levels unequally. People rarely tend to pay as much attention to the parts of an image perceived as distant from themselves. When perceiving a disparity map, the non-salient areas of the image that have a large disparity are generally treated as a smooth surface that is different from the area that includes the salient part. The surface denoted as Bd is extracted by the proposed depth-guided-background prior. In this paper, the color spatial 4

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 2: Overview of the process in our model. First, the saliency based on disparity is obtained and features including color and disparity are presented. Then, the final saliency is obtained by enhancing the initial saliency.

CE

PT

ED

M

distribution is called compactness and abbreviated to CP. The compactness cue and background prior are incorporated to improve the saliency obtained by disparity. Contrast feature CFi, j . In this paper, the contrast measure includes two factors: color and disparity. Because humans are sensitive to color, different color channels are considered as providing equal contributions to contrast. In the meantime, the disparity contribution is measured using the compactness cue. Gaussian Distance GDS T i, j . The Gaussian distance between i and j is calculated with a Euclidean metric and then weighted by a Gaussian function to emphasize the influences of nearer elements. Saliency Assignment. To retain a salient object integrity, the conventional practices are both taken into account, which include the central-bias factor demonstrating human’s preference of position on gaze and treat a pixel’s saliency as its surrounding ones’ combination. We particularly introduce saliency obtained from spatial dissimilarity using PCA(principal components analysis) to ensure integrity of the salient regions. In brief, we mainly leverage the importance of contrast and disparity to generate an initial saliency map. In estimation, S Di is to weight the contribution of contrast, meanwhile, the contrast of an element is measured by its sum of difference CFi, j from others using Euclidean distance GDS T i, j . After computing the initial saliency map, saliency assignment is used to reassign the saliency values in order to improve saliency consistency. 3.2. Depth-Guided-Background Prior

AC

When taking photographs, people tend to situate objects at different depth levels, placing the most important objects closer to themselves. This tendency is consistent with previous reports that the distance from an object to an observer determines the extent of human visual attraction. Therefore, pixels close from a depth perspective have higher attractiveness, whereas those further away are less attractive. Furthermore, the pixels furthers from the viewer (in depth) are assumed to have the least saliency. Because humans perceive depth from disparity [53], it is a fact that the pixel farthest away in depth is judged according to disparity. Inspired by the results of [43], we attempt to represent a smooth surface on a disparity map to distinguish non-salient pixels. Prior Definition. According to stereoscopic perception discussed in the related work, humans preferably notice objects with disparities in a certain range, which are in comfort zone (smaller disparities near zero) or close to screen (negative disparities). This preference is related to the prior about salient objects in depth. On the contrary, in our work, we particularly explore the rarely attractive background references using disparity. Given a disparity map, the 5

ACCEPTED MANUSCRIPT

F(µ1 , µ2 , φ) = Length(φ) Z + |Id (x, y) − µ1 |2 dxdy φ>0 Z + |Id (x, y) − µ2 |2 dxdy, φ<0

CR IP T

farthest pixel away in depth is denoted as FBD (farthest-background-depth). Then, a surface that contains the FBD and shares an approximately constant disparity in the interior, is extracted as a prior. We treat this as a depth-guidedbackground prior and use it to differentiate between other priors extracted from a 2D image. Surface Energy Function. A smooth surface does not necessarily all appear at the same depth level and sharp disparity discontinuities can exist on surface boundaries. Thus, we describe the process of building the smooth surface on a disparity map as an energy minimization problem. The energy function, including a smooth region and a boundary, is written as follows:

(2)

AN US

where µ1 , µ2 are the averages of the inside (φ ≥ 0) and outside curve (φ < 0), respectively. The Id denotes the disparity map computed following the method by [54]. The length of the curve (Length(φ)) serves as a regularizing term. The main two steps to compute φ are initialization and propagation. First, the pixels belonging to the FDB constitute φ0 . When initializing φ0 , the InitCurve function computes the signed distance from a pixel to the FDB. Secondly, a smooth surface φ is completed by propagating φ0 along the curvature as described by [55]. Here, we consider that a smooth depth-guided-background surface is comprised of φ, and the area inside the surface is considered as Bd . After detecting Bd , we adopt this factor to eliminate the salience of unattractive regions to develop saliency based on disparity S D. A description about how to use this prior and an example (Fig.3(g)) showing the proposed prior are provided in the following section. 3.3. Saliency Estimation

PT

ED

M

3.3.1. Saliency Based on Disparity Inspired by the results of [22], which reported that humans use depth knowledge to select the salient objects that are closer, we include a factor concerning prominence at a depth level to influence the saliency of a region. Besides the prior based on stereoscopic perception, which focuses on regions close to observer and with disparities around zero, other priors are considered. We assume that the compactness factor will effectively enhance some regions because salient objects tend to have a relatively concentrated distribution. In addition, we propose a prior to decrease the saliency value of regions located at edges or on the depth-guided-background. Given a disparity image Id and a color image I, for the ith region, the saliency computation on a disparity map is as follows. First, the compactness factor is employed to enhance regions with high compactness to improve the saliency computation S di as follows: S di = S si ∗ CPi , (3)

AC

CE

where S si is the initial saliency based on stereoscopic perception as described by [32], by which regions close to people and regions with small disparity could be assigned to high saliency values. Then, CPi , which is the color spatial distribution in the ith region of image I, is written as follows: N X

CPi = exp(−k(

p j − νi

disclr i, j )),

(4)

j=1



2 

1

denotes the similarity of two colors, (clri andclr j ), p j is the centroid of region clr − clr where disclr = exp i j 2 i, j 2σc PN clr j, and νi = j=1 disi, j p j gives the weighted position of one color clri . Second, because existing studies suggest that the boundary background prior is effective, these regions’ saliency values are set to be 0 during the saliency computation. In practice, regions at an image border constitute a set of background nodes and are symbolized as Bb . Then, the saliency S di of the ith region belonging to Bb defaults to non-saliency.

6

ACCEPTED MANUSCRIPT

(b)

(c)

(e)

(f)

(g)

(d)

CR IP T

(a)

(h)

Figure 3: Saliency on disparity map: (a) Input image; (b) Disparity map; (c) Original saliency based on disparity S s; (d) Compactness map CP; (e) Saliency S d enhanced with CP; (f) Saliency S d excluding boundary background prior Bb ; (g) The proposed depth-guided-background Bd (the white pixels); (h) Saliency (S D) combined with our prior using Eq.5.

AN US

Moreover, we assume that the salient region is not included in the area of any region belonging to the depthguided-background. The measured saliency combined with our prior is     S di , d¯i < ζ , S Di =  (5)   0, others,

AC

CE

PT

ED

M

where d¯i represents the average disparity in ith region and ζ is a threshold that classifies a region as salient or nonsalient. We define ζ = min(Id (q)), where q is a pixel belonging to the depth-guided-background (q ∈ Bd ). In particular, because the proposed prior Bd is achieved at the pixel level, while saliency based on disparity is assigned at the region level, we provide a process to determine whether a given region is included in Bd . Because disparity is inversely proportional to depth level, this process uses the mean disparity of a region to distinguish whether it is a part of the non-salient region. The ζ value is determined by the average disparity in the depth-guided-background. In contrast to Liu’s [32] approach, which primarily considers saliency as a nearly linear function of disparity, our method includes two adaptations to develop saliency based on disparity. First, the influence of compactness is introduced to guarantee that non-salient regions with low compactness will be scattered, while the region with high compactness will be concentrated. Second, any region in the background priors (boundary background and depthguided-background) is assumed to be non-salient to eliminate noise. Using this approach, we can better distinguish the salient regions. Fig.3 shows an example of saliency based on disparity. Fig.3(c) shows the initial saliency from Liu [32] where the saliency values of some regions have low contrast compared with those of the statue. Fig.3(d) shows a compactness map (CP) of the remaining image. After incorporating compactness into our method, the saliency of the sky is more or less depressed due to the low color compactness, as shown in Fig.3(e). In Fig.3(f), boundary regions are assigned the lowest saliency values. Fig.3(g) displays a prior of the depth-guided-background computed as described in Section 3.2. A sample result of fusing the depth-guided-background is shown in Fig.3(h). Compared to Fig.3(c), the saliency value of the region belonging to the background priors (Bb or Bd ) is successfully inhibited, and the statue is more salient, as displayed in Fig.3(h). 3.3.2. Contrast Feature Based on reports that disparity contrast and color contrast influence saliency unequally, we consider color and disparity spaces differently when defining the features for each region. Because a salient object tends to be more compact than objects in the background, the compactness of color in this paper is used to weight the contrast of disparity. Given a color image I, we compute one region’s feature in the CIELab color space. Because the disparity between the left and right images has an impact on saliency, we also acquire another feature from the disparity map Id . The average value of the features in every region is a vector denoted as f = [l, a, b, d] that indicates the features of the 7

ACCEPTED MANUSCRIPT

(c)

(b)

(d)

(e)

(f)

CR IP T

(a)

Figure 4: Effect of feature contrast: (a) Left image; (b) Right image; (c) Disparity map; (d) Compactness map (CP) from the fourth element in u; 0 (e) Feature distance calculated by Eq. 6, in which CFi, j uses u = [1, 1, 1, 1] instead of u; (f) Feature distance from Eq.6, where CFi, j considers CP as fourth component of u.

AN US

region. We define a vector (u = [1, 1, 1, CP]) to weight the factors of color contrast and disparity contrast. The CFi, j element, which denotes a feature’s distance from region i to region j, is defined as follows:

CFi, j =

ui fi − u j f j

2 , i, j ∈ {1 · · · N} . (6)

PT

ED

M

Here, fi and f j represent the feature vectors in region i and j, respectively. N is the total number of regions and ui and u j respectively indicate the weight of these two feature vectors. The fourth component, element CPi of ui , is the spatial distribution of the ith region’s color computed by Eq.4. Then, the total contrast of the ith region to its P surrounding (N − 1) regions is given by CFi = Nj=1 CFi, j . Different from the color contrast, the weight of disparity contrast relies on color spatial distribution, which means that highly compact regions should contribute more to the contrast on the disparity map, while regions with low compactness provide a smaller contribution to disparity contrast. Given two regions with equal disparity values, the disparity contrast of a highly compact region is likely to be higher than one with low compactness. Because a salient object tends to be compact in color space, the disparity contrast of a salient region will thus be high. Hence, a salient region is more cluttered owing to the reduced disparity contrast of regions with scattered color. This contrast feature works well even on relatively low-quality disparity scenes with the help of the compactness factor. The effects of feature distance among regions (CFi=1···N ) are shown in Fig.4. As shown in Fig.4(e), the regions’ contrasts on the background pop out on the disparity map compared to the contrast in their surrounding regions. In this case, we do not consider the compactness factor by setting the elements in u to 1. However, when we combine the disparity contrast with compactness, the saliency value of the uncluttered region is destined to decrease. As Fig.4(f) shows, when the feature distances are computed by regarding CP (Eq.4) as the fourth element of u, the saliency of the flowers is preserved due to their higher compactness.

CE

3.4. Proposed Model Implementation

AC

In this paper, we assume that the contrast between salient regions and their surrounding regions is high. First, the left view is smoothed by [56] and then divided into super-pixels, by which the approach in [57] is used for the purpose of decreasing complexity of computation. The super-pixels are treated as regions. Then, the contrasts of features between regions are computed. After computing saliency based on disparity by considering multiple cues to prioritize each region’s contrasts, the saliency of each region is estimated by its weighted contrast to all the other regions. Following the approach for emphasizing the effect of the compactness measure in [28], a saliency value is assigned to a region as follows. The saliency from Eq.1 can be written as follows: N X S ir = CPi ∗ ( CFi, j ∗ GDS T i, j ) ∗ S Di .

(7)

j=1

To eliminate errors from super-pixel segmentation, [47] introduced the idea that every pixel’s saliency can be interpreted as a weighted linear combination of the saliency of its surrounding regions. The pixel-wise saliency is 8

ACCEPTED MANUSCRIPT

defined as, S p (i) =

N X j=1

2

2 exp(−(α

clri − clr j

+ β

pi − p j

))S ir ,

(8)

CR IP T



clri − clr j

and position distance where α and β are two parameters to control the sensitivity of color distance

(

pi − p j

), respectively. N is the number of regions surrounding the ith region. The color value clr is an RGB color vector and the position p is a coordinate vector of a pixel in the ith region. This up-sampling process is ensured to be both local and color sensitive by choosing a Gaussian weight. As reported in [58] center bias preference is a factor that influences attention; the factor is modeled by a Gaussian function in that approach. We supplement CBW (Center Bias Weight) with background priors, as given in the following equation:     0, pi ∈ B , (9) CBW(i) =    exp(−(DstT oCt(i))/(2σ2xy )),others,

AN US

where DstT oCt(i) is the distance from point (xi , yi ) to the central point coordinate (xct , yct ) in√image I, and B = (Bb ∪ Bd ) is a set of background pixels. The standard deviation is a variable denoted as σ xy = H ∗ H + W ∗ W/2, and H and W are the height and width of image I, respectively. Although the model based on color and disparity contrast achieves good performance, other 2D structure-based features could still boost the saliency value. In [59], PCA is employed to calculate the spatial dissimilarity. In our model, two saliency maps from the 3D and 2D features are mixed together. This merged saliency map is a product of center bias and the original 3D and 2D saliency, S (i) = CBW (i) ∗ S rpca (i) ∗ S p (i) ,

(10)

4. Experimental Results

ED

M

We denote the region-level 2D saliency map as S rpca . In stereoscopic images, this operation is performed on the left view image. The region-level distinctness map regarding structure is taken as the average of the pixel-level saliency P in each region S rpca (i) = p∈ri S pca (p)/Ni , where Ni is the number of pixels in ith region, and S pca (p) is the saliency value [59] of the pth pixel in ith region. Because introducing the region-level map results in an inaccurate saliency map caused by Simple Linear Iterative Clustering (SLIC) segmentation, we obtain the final pixel-wise saliency using a weighted linear combination of the surrounding pixels’ saliency on the saliency map (S), as calculated by Eq.10.

AC

CE

PT

To evaluate the proposed model, we performed experiments using the SSB dataset provided by Niu [32]. This publicly available dataset consists of 1,000 pairs of stereoscopic images along with the corresponding masks of salient objects in the left images. Niu follows the procedure designed by Liu [60] to build the benchmark dataset. The most salient object in an image is marked with a rectangle by three users. The images with the least consistent labels are removed. Next, a mask of the salient object in each image is manually segmented using Adobe Photoshop. We evaluated the results on the left image and compared our model (SCDGB) with other saliency detection approaches: SWD [59], wCtr [49], RC [31], LPS [47], CDST [61], SS [32], WSC [24], SSD [20], 3D [23] and GTDS [62]. Of these models, the first four are designed for use on 2D images, while the others except CDST [61] about co-saliency, are intended to be used on stereoscopic images. To compare the results, we used published code for all the algorithms except SS; we implemented SS on the basis of the super-pixel segmentation in Matlab R2012b (64-bit).

4.1. Evaluation Metrics and Parameter Setup We used three general measures [63] to assess the performance of the saliency detection model: precision-recall (PR), the receiver operating characteristic (ROC) and the mean absolute error (MAE). The first two measures are related to the overlapping areas between salient objects and prediction; the third measure takes both salient and nonsalient assignments into account. In addition, we provide two score categories: F-Measure and AUC. A good saliency detection model yields high Fβ and AUC scores and low MAE values. PR and ROC are both based on the saliency map, which is binarized by a threshold that ranges from 0−255. Given this binary map, precision is defined as the proportion of the number of correctly assigned salient pixels to the total 9

ACCEPTED MANUSCRIPT

AN US

CR IP T

number of detected salient pixels, whereas recall is defined as the proportion of the number of successfully assigned salient pixels on the saliency map to the number of non-zero pixels on the ground-truth map. The final precision-recall curves are formed from the set of precision and recall scores. For a binary saliency map, the ROC curve reports both the false positive rate (FPR) and the true positive rate (TPR). The F-measure is computed by the weighted mean of precision and recall: Fβ = (β2 + 1)P × R/(β2 × P + R), where P and R denote precision and recall, respectively. β2 is set to be 0.3 in [31] to emphasize precision. AUC score is interpreted as the area under the curve of ROC. Because the metrics based on a binary saliency map ignore the effects of non-salient pixels, we also introduce MAE [28] as a measure. Given a continuous saliency map and the ground-truth, MAE is the mean absolute error P PH between maps: MAE = W x=1 y=1 |S (x, y) − G (x, y)|/ (W × H) , where W and H denote the width and height of the saliency map, respectively, S and G indicate the saliency map and ground-truth map, respectively, and x, y are pixel coordinates. In our implementation, we suggest that the number of super-pixels (N) should range from 50 to 200. In this paper, we set N to 105. The variance in Gaussian distance (σ2 in Eq.1) is set to 0.4 [31]. Following the setting in [47], α and β are set to 0.33. As discussed in Section 3.3.2, we empirically set k in CPi to 4 to compute the compactness factor. We also conducted an experiment to assess the influence of k from a range of 4 to 6. The reason for choosing MAE as a metric here is that MAE does not require an additional binary operation. The MAE values present characteristics of a slight concave function. We find that the lowest MAE reached was 0.1635 when k is set to 4, whereas the highest MAE was 0.1647 when k was set to 6. When K is set to as 3 and 5, the corresponding MAE values reach up to 0.1635 and 0.1643, respectively. In our implementation of the depth-guided-background prior, parameters about propagation are as in [55].

AC

CE

PT

ED

M

4.2. Validation of Priors We first give two examples of depth-guided-background prior (Bd ) as shown in Fig.5. Given a disparity map (Fig.5(b)), a smooth surface comprised of little- changing disparity is obtained and denoted as the non-salient area (white pixels in Fig.5(d)). In the area consisting of black pixels in Fig.5(d), the change in the disparity range is relatively sharp. Although this prior did not cover the entire non-salient area, it works well to define those black pixels that could possibly contain a salient object. In other words, the salient object most certainly does not belong to Bd in Fig.5(d). In addition, we performed three experiments to demonstrate the effect of background priors. The prior’s saliency is explained because we treat the area filled with white pixels (Bd ) as the detected non-salient portion, while the other portion belongs to the possibly salient parts. In addition to Bd , Bb and a combination of the two priors also form saliency maps. The three saliency maps are abbreviated as S Bd , S Bb and S Bb,d . The values of F-measure, the average precision and average recall are listed in Table 1.

(a)

(c)

(b)

(d)

Figure 5: Proposed background prior: (a) Input left image; (b) Disparity image; (c) Smoothed disparity image; (d) Depth-guided-background (the white pixels denote Bd ).

10

ACCEPTED MANUSCRIPT

Table 1: Results of validating priors. Name S Bd with our proposed prior Bd S Bb with boundary prior Bb S Bdb using Bd and Bb

Fβ 0.26 0.33 0.36

Precision 0.24 0.33 0.34

Recall 0.53 0.52 0.78

CR IP T

Based on average recall (the last column in Table 1), it can be observed that the rate increases by approximately 25 % with two priors while the performances of each single prior are similar. Considering precision and F-measure (the first two columns in Table 1), although the S Bd map appears to have a somewhat lower rate (F0.3 : 0.26 and Precision : 0.24) than the S Bb map (F0.3 : 0.33 and Precision : 0.33), the two-prior version boosts the united priors’ effect (byF0.3 ), and Precision is raised by approximately 1 % and 2 % higher, respectively. Because the prior’s saliency provides only a rough area that includes the salient object, the precision rate is relatively low. Overall, S Bb,d from the two combined priors improves the performance compared with using Bb or Bd alone. Moreover, our proposed prior is definitely effective in assisting saliency computation. 4.3. Baseline Method

M

AN US

A saliency map is obtained by combining the feature contrast and feature weights assigned by other cues. To investigate our scheme of fusing features and weighting feature contrast, we devised three baseline methods to compare with our proposed method (SCDGB). The first baseline, named BL, is complemented by the traditional contrast-based strategy while the other two baselines, named BL + CF and BL + S D, alter some components on the basis of the first baseline. The properties of the baseline models are briefly explained in Table 2. First, we give a detailed description of BL, and then describe how the other methods transform BL. Method 1 : BL. We first designed a conventional contrast-based method extending from 2D (color) to 3D (color and disparity). We call this baseline method BL. The elements of this feature are the values from L ∗ a ∗ b and 0 disparity spaces, represented as f = [l, a, b, d]. The feature contrast is computed according to Eq.11, and the saliency is obtained through Eq.12: Ci, j = k fi0 − f j0 k2 (11) N X

ED

S i1 =

j=1

GDS T i, j ∗ Ci, j .

(12)

CE

PT

In Eq.12, GDS T i, j represents the Gaussian spatial distance between two regions, and N is the number of regions. Method 2 : BL + CF. This method is mainly designed to analyze the fusion scheme based on the compactness factor, where BL + CF means that CF (Eq.6) is also applied. In contrast with BL, this method calculates a region’s feature contrast weighted by the extent of compactness, which is detailed in subsection 3.3.2. Then, the saliency value according to method BL can be written as follows: S i2

=

N X j=1

GDS T i, j ∗ CFi, j .

(13)

AC

Method 3 : BL + S D. We plan to investigate the influences of saliency based on disparity using Eq.14. Regarding BL + S D, which implies S D, (Eq.5) is utilized and the saliency computation is completed through rewriting S i1 by adding a saliency factor based on disparity. S i3 = (

N X j=1

Ci, j ∗ GDS T i, j ) ∗ S Di .

11

(14)

ACCEPTED MANUSCRIPT

Table 2: Our model and Baseline Methods. Feature CF ( Eq.6 ) considering CP C ( Eq.11 ) discarding CP CF ( Eq.6 ) considering CP C ( Eq.11 ) discarding CP

1

1

0.9

0.9

True Positive

0.8 0.7 0.6 0.5 SCDGB BL

0.4 0.3

0

0.1

0.2

0.8 0.7 0.6 0.5 SCDGB BL

0.4

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.3

1

0

0.1

0.2

0.3

Recall

1 0.9

0.8 0.7 0.6 0.5 SCDGB BL+CF 0.1

0.2

0.4

0.6 0.5

0.5

0.6

0.7

0.8

0.9

1

0.3

0.1

0.3

0.4

BL

0.6 0.5 SCDGB BL+SD

0.5

0.6

0.7

0.4

0.8

0.9

1

0.5

0.6

0.7

Recall

(g)

0.8

0.9

1

SCDGB

BL+SD

SCDGB

0.18 0.18 0.17 0.17 0.17 0.17 0.17 0.16 0.16 0.16

BL

BL+CF

(f)

0.86

0.6

0.85

0.5

0

0.1

0.2

0.3

0.4

0.5

0.6

False Positive

(h)

0.84

SCDGB BL+SD

0.4

0.3

BL+SD

(c)

0.87

0.7

0.3

BL+CF

Recall

0.88

0.8

M

True Positive

0.7

F-measure

0.89

ED

Precision

1 0.9

0.8

0.2

0.2

Precision

(e)

1

0.1

1

False Positive

0.9

0

0.9

SCDGB BL+CF

0

(d)

0.3

0.8

0.7

Recall

0.4

0.7

0.8

0.4

0.3

0.6

0.75 0.73 0.7 0.68 0.65 0.63 0.6 0.58 0.55 0.53 0.5

AN US

True Positive

Precision

1

0

0.5

(b)

0.9

0.3

0.4

False Positive

(a)

0.4

Saliency S r ( Eq.7 ) with S D S 1 ( Eq.12 ) without S D S 2 ( Eq.13 ) without S D S 3 ( Eq.14 ) with S D

CR IP T

Precision

Model SCDGB BL BL+CF BL+SD

0.7

0.8

0.9

0.83 1

0.82 BL

BL+CF

BL+SD

SCDGB

(i)

PT

Figure 6: Quantities results of our model and baseline methods. (a) (SCDGB and BL) PR curves; (b) (SCDGB and BL) ROC curves. (c) Average precision, recall and F-measure. (d) (SCDGB and BL+CF) PR curves; (e) (SCDGB and BL+CF) ROC curves. (f) MAE. (g) (SCDGB and BL+SD) PR curves; (h) (SCDGB and BL+SD) ROC curves. (i) AUC.

CE

4.4. Single Component Analysis

AC

Fig.6 depicts the performances of SCDGB and the three baselines (BL, BL + CF and BL + S D). From analyzing the PR and ROC curves, SCDGB obtains the best performances compared with the three baselines. Specifically, we observe that SCDGB and BL + S D improve by approximately 10 % compared to the performances obtained by methods (BL and BL + CF) when the recall rate ranges from 0 to 80 % according to PR curves. Meanwhile, from the ROC curves, a similar finding appears: the TPRs of SCDGB and BL + S D are approximately 20 % higher than those of the other two methods, while the FPR is about 10 % lower. Overall, to assess the holistic performance of the PR, we calculated the average precision and recall. The average precision of our model and BL + S D is higher than the other baselines by approximately 58 %. In addition, regarding the F0.3 and AUC scores, the methods (SCDGB and BL + S D) obtained approximately 5 % and 7 % higher values than did the others (BL and BL + CF reached approximately 53 % and 82 %, respectively). Based on the differences in the calculations (see subsection 4.3), the S D factor is sufficiently effective, as verified in the above analysis. Comparing SCDGB with BL + S D, we find that SCDGB provides superior performance; its precision and Fmeasure are more than 5 % higher. In addition, SCDGB results in lower MAE scores and higher AUC scores— approximately 1 % lower and 1 % higher, respectively, compared to BL + S D. That implies that SCDGB could detect 12

ACCEPTED MANUSCRIPT

CR IP T

more salient pixels and fewere non-salient pixels owing to the scheme of fusing color contrast and disparity contrast. Similarly, when comparing BL with BL + CF, the latter results in higher values for AUC, precision and F-measure and lower MAE values, although the differences do not exceed 1 %. Thus, the results show that single-contrast fusion considering compactness has a better effect in models that include S D (SCDGB and BL + S D) than in those models (BL and BL + CF) without S D. Overall, a model in which contrast is weighted with saliency based on disparity is more likely to be successful in saliency detection. Specially, we can conclude that a strategy which takes S D into account plays an important role in improving saliency, whereas the fusion of color contrast and disparity contrast shows only a slight effect. However, it is worth reiterating that the best performance is obtained by combining S D with CF when one model uses only the contrast from color and disparity. 4.5. Quantitative Comparisons

AC

CE

PT

ED

M

AN US

We illustrate our performance comparison with other models in Fig.7. Based on the PR curves in Fig.7(a)(d), the proposed SCDGB has a precision comparable to GTDS [62] and approximately 5 % higher than LPS [47]. Both models achieve an accuracy above 85 % and higher than the other models (around 80 %). To evaluate the overall precision, Fig.7(c) shows the average precision rate, where it can be observed that SCDGB achieves above 72 %— approximately 6 % higher than the second highest (WSC [24] at 66 %). The other models obtained accuracies of below 60 %. In addition, regarding the F0.3 metric, both our method and WSC [24] achieve excellent performances. SCDGB obtained the highest value: approximately 1 % higher than WSC [24]. GTDS [62] achieved a high recall rate, because it appears to have a high number of false positive salient detections. However, having high precision or F-measure are more important indicators of saliency detection. As the ROC curves in Fig.7(b)(e) show, the proposed model ranks second. Specifically, SCDGB performs only slightly worse than GTDS [62], but it consistently outperforms all the other compared methods. We calculated the area under the ROC curves and display the resulting scores in Fig.7(f), which shows that SCDGB outperformed most of the traditional models on the AUC score. SCDGB’s AUC score is only 2 % less than that of GTDS [62] and it is 1 % higher than LPS [47]. Moreover, when ranked by MAE in Fig.7(f), SCDGB, WSC [24] and LPS [47] perform nearly the same; meanwhile, our method achieves excellent performance among the models regarding Fβ and precision. Overall, from these three scores, the proposed model performs considerably better than do the existing approaches. We also calculated the average time each model consumed when processing an image in the SSB dataset. Except for RC [31], which was implemented in C++, all the models were executed in MATLAB. For the algorithms that require a disparity map as an additional input, the time required to calculate the disparity map is not included in the total time. The four fastest models are RC [31], GTDS [62], SWD [59] and wCtr [49], which all require less than 0.1 s, while the methods of (3D [23], SS [32] and SSD [20]) require approximately 0.3 s, 0.5 s and 1.7 s respectively. The remaining methods (LPS [47], WSC [24], SCDGB and CDST [61] ) require no more than 3.5 s (2.5 s, 2.6 s, 3.2 s and 3.3 s) to obtain the saliency map. Although these methods require more time than the others, they yield higher accuracy. Given N being the number of regions in the input image, algorithms based on contrast, such as RC [31] and SS [32], contain a step to calculate contrast between each region and its surroundings, in which the complexity about contrast is approximately O(N 2 ). SS [32] involves a process of domain knowledge analysis, of which complexity is almost O(N). In our model, extra time consumed by depth-guided-background prior computation is excluded in previous saliency detection model. Our proposed SCDGB model is approximately 0.6 s slower than either WSC [24] or LPS [47]; however, holistically, our model achieves a superior performance. We also provide present a detailed cost breakdown for specific processes. The background prior estimation from disparity consumes 1.4 s (approximately 43 % of the total). Considerable time is also spent on region contrast (the pre-processing and segmentation algorithm, contrast calculation), which consumes 1.7 s (approximately 53 %), while other processes require only 0.1 s (about 4 %). 4.6. Qualitative Comparisons Some visual comparisons of saliency map from different models are provided in Fig.8. The stereoscopic saliency model with disparity GTDS [62] is better at detecting completely uniform objects. However, in some cases, its saliency map is not compact, and it is often disturbed by the background. It can be clearly seen that the maps generated by SWD [59] have blob-like salient objects. This model is more suitable for locating areas of human visual fixation. 13

ACCEPTED MANUSCRIPT

1

0.9

True Positive

Precision

0.8 0.7 0.6

SCDGB 3D GTDS SS WSC SSD CDST

0.5 0.4 0.3

0.8

0.7

0.7

0.6 SCDGB 3D GTDS SS WSC SSD CDST

0.5

0.3 0.2

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Recall

0.1

1

B

False Positive

(a)

ST CD D SS tr wC

0.9

RC D SW S LP

0.8

(b)

SC

0.7

0.2

W

0.6

0.3

DS

0.5

Recall

0.4

SS

0.4

F-measure

0.5

0.6

0.4

0.3

Precision

GT

0.2

0.8

DG

0.1

0.9

3D

0

0.9

SC

0.2

1

CR IP T

1

(c)

1

0.9

0.9

0.8

0.8

0.7 0.6 0.5

SCDGB LPS SWD RC wCtr

0.4 0.3

0.5

0.6

0.4

0.5

SCDGB LPS SWD RC wCtr

0.4 0.3

Recall

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.1

1

False Positive

(d)

(e)

ST

0.1

CD

0

D SS tr

0.2

1

wC

0.9

SC

0.8

RC

0.7

W

0.6

SS DS GT D

0.5

SW

0.4

0.2

3D

0.3

0.3

GB

0.2

0.6

0.7

S

0.1

0.7

LP

0

AUC

0.8

D SC

0.2

MAE

0.9

AN US

True Positive

Precision

1

1

(f)

AC

CE

PT

ED

M

Figure 7: Performance of the proposed model compared with previous models: (a) PR curves of saliency maps for stereoscopic images; (b) ROC curves computed with models for stereoscopic images; (c) Average precision, recall and F-measure; (d) PR curves of saliency maps for 2D images; (e) ROC curves of saliency maps computed with models for 2D images; (f) MAE and AUC.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

(m)

(n)

(o)

Figure 8: Visual comparisons of saliency detection models: (a) Left image; (b) Right image; (c) Disparity map; (d) Ground truth map; (e) SCDGB; (f) WSC [24]; (g) CDST [61];(h) LPS [47]; (i) wCtr [49]; (j) SSD [20]; (k) SS [32]; (l) RC [31]; (m) 3D [23]; (n) SWD [59]; (o) GTDS [62].

14

ACCEPTED MANUSCRIPT

ED

M

AN US

CR IP T

The other models yield clearer salient object edges. As for the algorithm provided by 3D [23], the salient regions are more accurate than those from SWD [59]; the salient areas are assigned higher values and the other regions are given smaller values. The distribution of saliency looks like a Gaussian. The model of saliency called LPS [47] effectively separates the complete salient object from the background with the help of a foreground proposal scheme. As shown, it takes background noise as part of the object when multiple objects exist at the same time. Saliency analysis with SS [32] detects the most salient content successfully; however, it sometimes assigns lower values to small parts of the background than to the salient region. Models (wCtr [49] and SSD [20]) could not detect salient objects very well in complex scenes. CDST [61] is likely to fail when color of the salient object is similar to those of boundary regions. The RC [31] method is not capable enough to handle images with complex textured backgrounds. Because the RC model depends on disparity, a misleading disparity map results in failure. The saliency maps computed by WSC [24] are characteristically uniform and provide complete contours for the salient objects. However, errors occur when the foreground is similar to the background. SCDGB is capable of extracting more accurate objects while including few components from non-salient background areas. Many of the models are capable of detecting good saliency maps when only one object exists in a simple scene. By contrast, because the proposed model relies on the weighted contrasts of color and the disparity map, it handles challenging scenes well, such as when the foreground is similar to background or when the disparity map has low-contrast or is inaccurate. In addition, our proposed model works well on naturalistic scenes with complex backgrounds. By adding saliency based on disparity, we obtain lower saliency values for non-salient areas belonging to priors and higher values for compact salient objects. However when a large part of the disparity map is misleading or the background is highly complex, our method has difficulty performing well. Overall, as shown in the maps in the last second and third rows in Fig.8, our saliency detection results are comparatively better than those from other models. Nevertheless, as explained, our model depends on the disparity and contrast. In one challenging scene where both low-contrast and misleading disparity map exist, the proposed method would fail. A case in point is the last row in Fig.8 where a salient object is at a different depth and the contrast between salient and non-salient region is extremely low. And it is a really challenging case for our approach. The one reason is mainly that saliency analysis based on disparity cannot separate the salient one from the background due to its wider depth range than its surroundings, which means that the proposed prior would detect only a small part of background. Another reason is that color compactness cannot work due to the similar color between the salient and non-salient region, which means color compactness would fail to improve saliency detection based on disparity or contrast. 5. Conclusions

AC

CE

PT

In this paper, we propose a saliency detection model for stereoscopic 3D images. We combine two background priors and compactness to develop saliency based on disparity. The background priors include a depth-guidedbackground prior explored through a disparity map and a boundary background. A saliency based on disparity approach is presented to give a priority weight to each region’s contrast. To measure the contrast of each region, we build a scheme to fuse the respective contrasts of disparity and color. The initial saliency map is estimated by the global contrast and saliency based on disparity. Then, to refine the initial saliency map, the spatial dissimilarity features under reduced dimensions and central preference are both used to improve the saliency map. Experimental results demonstrate that the proposed model performs well on stereo 3D still images. Acknowledgements This research is partially sponsored by National Natural Science Foundation of China [grant numbers, 61370113, 61472387, 61572004 and 61771026], the Beijing Municipal Natural Science Foundation [grant numbers 4152005 and 4152006], the Science and Technology Program of Tianjin [grant number 15YFXQGX0050], and the Science and Technology Planning Project of Qinghai Province [grant number 2016-ZJ-Y04]. References [1] C. E. Connor, H. E. Egeth, S. Yantis, Visual attention: bottom-up versus top-down, Current Biology 14 (19) (2004) R850–R852.

15

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

[2] J. Theeuwes, Top–down and bottom–up control of visual selection, Acta psychologica 135 (2) (2010) 77–99. [3] J. Li, Y. Tian, T. Huang, W. Gao, Probabilistic multi-task learning for visual saliency estimation in video, International journal of computer vision 90 (2) (2010) 150–165. [4] A. Borji, Boosting bottom-up and top-down visual features for saliency estimation, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 438–445. [5] L. Itti, G. Rees, J. Tsotsos, Models of bottom-up attention and saliency, Neurobiology of attention 582. [6] X. Hou, L. Zhang, Saliency detection: A spectral residual approach, in: 2007 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2007, pp. 1–8. [7] Q. Yan, L. Xu, J. Shi, J. Jia, Hierarchical saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1155–1162. [8] J. Shen, Y. Du, W. Wang, X. Li, Lazy random walks for superpixel segmentation, IEEE Transactions on Image Processing 23 (4) (2014) 1451–1462. [9] W. Wang, J. Shen, F. Porikli, Saliency-aware geodesic video object segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3395–3402. [10] W. Wang, J. Shen, L. Shao, Consistent video saliency using local gradient flow optimization and global refinement, IEEE Transactions on Image Processing 24 (11) (2015) 4185–4196. [11] D. Zhang, J. Han, C. Li, J. Wang, X. Li, Detection of co-salient objects by looking deep and wide, International Journal of Computer Vision 120 (2) (2016) 215–232. [12] D. Zhang, J. Han, L. Jiang, S. Ye, X. Chang, Revealing event saliency in unconstrained video collection, IEEE Transactions on Image Processing 26 (4) (2017) 1746–1758. [13] N. Courty, E. Marchand, B. Arnaldi, A new application for saliency maps: Synthetic vision of autonomous actors, in: Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, Vol. 3, IEEE, 2003, pp. III–1065. [14] J. W. Yoo, S. Yea, I. K. Park, Content-driven retargeting of stereoscopic images, IEEE Signal Processing Letters 20 (5) (2013) 519–522. [15] C. Chamaret, S. Godeffroy, P. Lopez, O. Le Meur, Adaptive 3d rendering based on region-of-interest, in: IS&T/SPIE Electronic Imaging, International Society for Optics and Photonics, 2010, pp. 75240V–75240V. [16] F. Shao, W. Lin, S. Gu, G. Jiang, T. Srikanthan, Perceptual full-reference quality assessment of stereoscopic images by considering binocular visual characteristics, IEEE Transactions on Image Processing 22 (5) (2013) 1940–1953. [17] J. Yang, Y. Wang, B. Li, W. Lu, Q. Meng, Z. Lv, D. Zhao, Z. Gao, Quality assessment metric of stereo images considering cyclopean integration and visual saliency, Information Sciences 373 (2016) 251–268. [18] Q. Huynh-Thu, M. Barkowsky, P. Le Callet, The importance of visual attention in improving the 3d-tv viewing experience: Overview and new perspectives, IEEE Transactions on broadcasting 57 (2) (2011) 421–431. [19] H. Sohn, Y. J. Jung, S.-i. Lee, H. W. Park, Y. M. Ro, Attention model-based visual comfort assessment for stereoscopic depth perception, in: 2011 17th International Conference on Digital Signal Processing (DSP), IEEE, 2011, pp. 1–6. [20] W. Wang, J. Shen, Y. Yu, K.-L. Ma, Stereoscopic thumbnail creation via efficient stereo saliency detection, IEEE transactions on visualization and computer graphics 23 (8) (2017) 2014–2027. [21] J. Lei, S. Li, B. Wang, K. Fan, C. Hou, Stereoscopic visual attention guided disparity control for multiview images, Journal of Display Technology 10 (5) (2014) 373–379. [22] A. Maki, P. Nordlund, J.-O. Eklundh, Attentional scene segmentation: integrating depth and motion, Computer Vision and Image Understanding 78 (3) (2000) 351–373. [23] Y. Fang, J. Wang, M. Narwaria, P. Le Callet, W. Lin, Saliency detection for stereoscopic images, IEEE Transactions on Image Processing 23 (6) (2014) 2625–2636. [24] N. Li, B. Sun, J. Yu, A weighted sparse coding framework for saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5216–5223. [25] J. Geng, Three-dimensional display technologies, Advances in optics and photonics 5 (4) (2013) 456–535. [26] L. Jansen, S. Onat, P. K¨onig, Influence of disparity on fixation and saccades in free viewing of natural scenes, Journal of Vision 9 (1) (2009) 29–29. [27] J. H¨akkinen, T. Kawai, J. Takatalo, R. Mitsuya, G. Nyman, What do people look at when they watch stereoscopic movies?, in: IS&T/SPIE Electronic Imaging, International Society for Optics and Photonics, 2010, pp. 75240E–75240E. [28] F. Perazzi, P. Kr¨ahenb¨uhl, Y. Pritch, A. Hornung, Saliency filters: Contrast based filtering for salient region detection, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 733–740. [29] Y. Wei, F. Wen, W. Zhu, J. Sun, Geodesic saliency using background priors, in: European Conference on Computer Vision, Springer, 2012, pp. 29–42. [30] C. Yang, L. Zhang, H. Lu, X. Ruan, M.-H. Yang, Saliency detection via graph-based manifold ranking, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 3166–3173. [31] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, S.-M. Hu, Global contrast based salient region detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3) (2015) 569–582. [32] Y. Niu, Y. Geng, X. Li, F. Liu, Leveraging stereopsis for saliency analysis, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 454–461. [33] W. Einhauser, P. Konig, Does luminance-contrast contribute to a saliency map for overt visual attention?, European Journal of Neuroscience 17 (5) (2003) 1089–1097. [34] L. Itti, C. Koch, E. Niebur, et al., A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on pattern analysis and machine intelligence 20 (11) (1998) 1254–1259. [35] Y. Zhang, F. Zhang, L. Guo, Saliency detection by selective color features, Neurocomputing 203 (2016) 34–40. [36] L. Duan, T. Xi, S. Cui, H. Qi, A. C. Bovik, A spatiotemporal weighted dissimilarity-based method for video saliency detection, Signal Processing: Image Communication 38 (2015) 45–56.

16

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

[37] H. Li, K. N. Ngan, A co-saliency model of image pairs, IEEE Transactions on Image Processing 20 (12) (2011) 3365–3375. [38] H. Fu, X. Cao, Z. Tu, Cluster-based co-saliency detection, IEEE Transactions on Image Processing 22 (10) (2013) 3766–3778. [39] D. Zhang, J. Han, J. Han, L. Shao, Cosaliency detection based on intrasaliency prior transfer and deep intersaliency mining, IEEE transactions on neural networks and learning systems 27 (6) (2016) 1163–1176. [40] D. Zhang, D. Meng, J. Han, Co-saliency detection via a self-paced multiple-instance learning framework, IEEE transactions on pattern analysis and machine intelligence 39 (5) (2017) 865–878. [41] L. Itti, C. Koch, Computational modelling of visual attention, Nature reviews neuroscience 2 (3) (2001) 194–203. [42] D. A. Ruff, R. T. Born, Feature attention for binocular disparity in primate area mt depends on tuning strength, Journal of neurophysiology 113 (5) (2015) 1545–1555. [43] Y. Liu, L. K. Cormack, A. C. Bovik, Dichotomy between luminance and disparity features at binocular fixations, Journal of vision 10 (12) (2010) 23–23. [44] T. Judd, K. Ehinger, F. Durand, A. Torralba, Learning to predict where humans look, in: 2009 IEEE 12th International Conference on Computer Vision, IEEE, 2009, pp. 2106–2113. [45] M. Wang, J. Li, T. Huang, Y. Tian, L. Duan, G. Jia, Saliency detection based on 2d log-gabor wavelets and center bias, in: Proceedings of the international conference on Multimedia, ACM, 2010, pp. 979–982. [46] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, S. Li, Salient object detection: A discriminative regional feature integration approach, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2083–2090. [47] H. Li, H. Lu, Z. Lin, X. Shen, B. Price, Inner and inter label propagation: salient object detection in the wild, IEEE Transactions on Image Processing 24 (10) (2015) 3176–3186. [48] C. Xia, H. Zhang, X. Gao, Combining multi-layer integration algorithm with background prior and label propagation for saliency detection, Journal of Visual Communication and Image Representation 48 (2017) 110–121. [49] W. Zhu, S. Liang, Y. Wei, J. Sun, Saliency optimization from robust background detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 2814–2821. [50] T. Zhao, L. Li, X. Ding, Y. Huang, D. Zeng, Saliency detection with spaces of background-based distribution, IEEE Signal Processing Letters 23 (5) (2016) 683–687. [51] J. Han, D. Zhang, X. Hu, L. Guo, J. Ren, F. Wu, Background prior-based salient object detection via deep reconstruction residual, IEEE Transactions on Circuits and Systems for Video Technology 25 (8) (2015) 1309–1321. [52] Z. Wang, D. Xiang, S. Hou, F. Wu, Background-driven salient object detection, IEEE Transactions on Multimedia 19 (4) (2017) 750–762. [53] N. Qian, Binocular disparity and the perception of depth, Neuron 18 (3) (1997) 359–368. [54] C. Liu, J. Yuen, A. Torralba, Sift flow: Dense correspondence across scenes and its applications, IEEE transactions on pattern analysis and machine intelligence 33 (5) (2011) 978–994. [55] T. F. Chan, L. A. Vese, Active contours without edges, IEEE Transactions on image processing 10 (2) (2001) 266–277. [56] P. Perona, J. Malik, Scale-space and edge detection using anisotropic diffusion, IEEE Transactions on pattern analysis and machine intelligence 12 (7) (1990) 629–639. [57] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. S¨usstrunk, Slic superpixels compared to state-of-the-art superpixel methods, IEEE transactions on pattern analysis and machine intelligence 34 (11) (2012) 2274–2282. [58] J. Wang, M. P. Da Silva, P. Le Callet, V. Ricordel, Study of center-bias in the viewing of stereoscopic image and a framework for extending 2d visual attention models to 3d, in: IS&T/SPIE Electronic Imaging, International Society for Optics and Photonics, 2013, pp. 865114–865114. [59] L. Duan, C. Wu, J. Miao, L. Qing, Y. Fu, Visual saliency detection by spatially weighted dissimilarity, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE, 2011, pp. 473–480. [60] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, H.-Y. Shum, Learning to detect a salient object, IEEE Transactions on Pattern analysis and machine intelligence 33 (2) (2011) 353–367. [61] W. Wang, J. Shen, L. Shao, F. Porikli, Correspondence driven saliency transfer, IEEE Transactions on Image Processing 25 (11) (2016) 5025–5034. [62] F. Qi, D. Zhao, S. Liu, X. Fan, 3d visual saliency detection model with generated disparity map, Multimedia Tools and Applications (2016) 1–17. [63] A. Borji, M.-M. Cheng, H. Jiang, J. Li, Salient object detection: A benchmark, IEEE Transactions on Image Processing 24 (12) (2015) 5706–5722.

17

ACCEPTED MANUSCRIPT

Fangfang Liang received the M.S. degree in Computer Science from Three Gorges University, Yichang, China, in 2010. She is currently pursuing the Ph.D. degree at the Faculty of Information Technology, Beijing University of Technology, China. Her current research interests include Image Processing, Machine Learning and Computer Vision. Lijuan Duan received the B.Sc. and M.Sc. degrees in computer science from Zhengzhou University of Technology, Zhengzhou, China, in 1995 and 1998, respectively. She received the Ph.D. degree

CR IP T

in computer science from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, in 2003. She is currently a Professor at the Faculty of Information Technology, Beijing University of Technology, China. Her research interests include Artificial Intelligence, Image Processing, Computer Vision and Information Security. She has published more than 70 research articles in refereed journals and proceedings on artificial intelligent, image processing and computer vision.

AN US

Wei Ma received her PhD degree in Computer Science from Peking University, in 2009. She is currently an Associate Professor at the Faculty of Information Technology, Beijing University of Technology, China. Her research interests include Image Processing, Computer Vision and their applications in the protection and exhibition of Chinese ancient paintings.

M

Yuanhua Qiao received the B.S.degree in Department of Mathematics from Qilu Normal University, Jinan, Shan Dong, in 1992, the M.S. degree in Applied Science from Beijing University of Technology, Beijing, in 1999, and Ph.D. degree in fluid mechanics from College of life science

ED

and Biotechnology, Beijing, in 2005. From 1999 up till now, she was a Research Assistant, an Associate Professor and Professor respectively in Applied Science of Beijing University of Technology, China. Her research interests include dynamic analysis of neuron networks,

PT

synchronization analysis of neuron network, differential equation and dynamic system. Professor Qiao is a membership of Mathematics Education, in 2006, she got Project completion certificate

CE

(Ministry of Education).

Zhi Cai is a lecturer in the College of Computer Science, Beijing University of Technology, China.

AC

He obtained his M.Sc. in 2007 from the School of Computer Science in the University of Manchester and his Ph.D. in 2011 from the Department of Computing and Mathematics of the Manchester Metropolitan University, U.K. His research interests include Information Retrieval, Ranking in Relational Databases, Keyword Search, Data Mining, Big Data Management \& Analysis and Ontology Engineering. Laiyun Qing is with the School of Information Science and Engineering, Graduate University of the Chinese Academy of Sciences, China. She received her Ph.D. in computer science from Chinese Academy of Sciences in 2005. Her research interests include pattern recognition, image processing and statistical learning. Her current research focuses on neural information processing.

ACCEPTED MANUSCRIPT

AN US

CR IP T

Fangfang Liang

ED

M

Lijuan Duan

AC

CE

PT

Wei Ma

Yuanhua Qiao

ACCEPTED MANUSCRIPT

AN US

CR IP T

Zhi Cai

AC

CE

PT

ED

M

Laiyun Qing