Author's Accepted Manuscript
Automatic tag saliency ranking for stereo images Cao Yang, Kang Kai, Zhang Shijie, Zhang Jing, Wang Zengfu
www.elsevier.com/locate/neucom
PII: DOI: Reference:
S0925-2312(15)00604-9 http://dx.doi.org/10.1016/j.neucom.2014.09.097 NEUCOM15486
To appear in:
Neurocomputing
Received date: 25 November 2013 Revised date: 23 July 2014 Accepted date: 24 September 2014 Cite this article as: Cao Yang, Kang Kai, Zhang Shijie, Zhang Jing, Wang Zengfu, Automatic tag saliency ranking for stereo images, Neurocomputing, http://dx.doi. org/10.1016/j.neucom.2014.09.097 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
1
Automatic tag saliency ranking for stereo images Cao Yanga , Kang Kaia , Zhang Shijiea , Zhang Jinga , and Wang Zengfub∗ . a
Department of Automation, University of Science and Technology
b
Institute of Intelligent Machines, Chinese Academy of Science
Abstract: With the rapid advances in 3D capture and display technology, tag ranking for stereo images will be a potential application on web image retrieval. Directly extending the existing approaches to stereo images is problematic as it may neglect the representative 3D elements. In this paper, a novel automatic tag saliency ranking algorithm for stereo images is presented. Specifically, a novel method of interacting with stereo images is proposed to segment the two images into regions simultaneously. Then tags annotated on the image-level are propagated to the region-level via an improved multi-instance learning algorithm. In the next, a new 3D saliency detection algorithm is proposed using occlusion cues along with the contrast in color and disparity. And finally, tags are re-ranked according to the saliency values of the corresponding regions. Moreover, to evaluate the performance of our approach, an annotated stereo image dataset is set up. The experimental results on this dataset demonstrate the effectiveness of our approach. 1. INTRODUCTION Stereo images and videos can provide an immersive 3D viewing experience, and accordingly have been introduced into many multimedia applications, such as 3D TV, freeviewpoint video, immersive teleconferencing and video game systems. With the development and maturation of 3D capture and display technology in recent years, an abundance of stereo images is being generated. Retrieving stereo images from enormous collections becomes an important research topic and practical problem. Although the technique of Content-based Image Retrieval (CBIR) has been widely investigated [1,2], directly extending the existing 2D image retrieval methods to stereo images is problematic as it may neglect the representative elements in the extra dimension of depth. Consequently, it will lead to some misleading results. Typical examples are shown in Figures 1. The image shown in Figures 1 is annotated with tags “water, dog, grass, bird”. The image shows both high visual representative to the tag “water” and “dog” from the perspective of 2D viewing. However, from the perspective of depth perception, the 3D salient regions of the image in general attract more attentions. So the tag “dog” should be ranked in front of “water” to facilitate the image retrieval. ∗
Corresponding Author, mail to
[email protected]
2 Water Dog Grass Bird Void
Figure 1. The first column is an exemplar stereo image pair. The second column is the consistent segmentation result, where each region is labeled with the assigned colors acting as tags list shown in the last row. The third column shows the 3D salient regions of the given stereo images.
The depth information can be estimated according to the stereo correspondence between the left and right view of the stereo images. However, it is not a given even with stereo correspondence. Occluded regions and areas off the side of an image have no corresponding regions in the other view. Moreover, since a perfect stereo matching algorithm has yet to be invented, errors will exist in the estimated depth map. Due to this, the depth cues cannot be directly used for image retrieval. This paper focuses on the task of tag ranking for stereo images, a fundamental step for retrieving stereo images on a large amount of collections. In developing the tag ranking algorithm, a primary objective is to rank the existing tags according to the relevance scores to the visual content of the given image [3]. This means measuring the visual representative degree of the tags with respect to the corresponding contents of the stereo images. Tag ranking, which aims to rank image tags according to the semantic relevance with respect to the image content, has emerged as a hot topic recently [4]. Existing tag ranking methods can be roughly classified into two categories, i.e., tag relevance ranking and tag saliency ranking, respectively [5]. Tag relevance ranking has attracted more attentions among the earliest works. Li et al. [6] introduced a neighborhood voting method which learns tag relevance by accumulating votes from visual neighbors. Recently, Liu et al. [7] applied Kernel Density Estimation (KDE) to estimate the relevance of each tag individually, and further performed a random-walk based refinement to boost tag ranking performance. Tag saliency ranking, in which the annotated tags are re-ordered according to the saliency property of the corresponding visual content, is firstly proposed in [8]. This method integrates visual attention model to measure the importance of regions of the given image. Therefore, it can provide more comprehensive information and is much consistent with the human perception. Intuitively, tag saliency ranking method is more suitable for stereo images, since most of stereo images contain distinct objects to achieve better 3D viewing experience. Therefore,
3 the saliency values of the objects in the images can be utilized as the importance measures to facilitate the tag ranking. Accordingly, we propose in this paper a tag saliency ranking approach for stereo images. Specifically, the two stereo images are segmented simultaneously in the first. Then tags annotated on the image-level are propagated to the region-level via an improved multi-instance learning algorithm. In the next, the 3D saliency map extracted from stereoscopies is used to measure the importance of regions of the given image. And finally, tags are ranked according to the saliency values of the corresponding regions. To evaluate the performance of our approach, we conduct experiment on our own stereo image dataset. Our dataset is composed of 424 photographs of 16 object classes. We labeled each image pair with the assigned colors acting as tags into the list of object classes. The experimental results demonstrate the effectiveness of our approach. To the best of our knowledge, this is the first time that we propose a tag ranking approach for stereo images. The rest of the paper is organized as follows. Related work is presented in Section 2. The proposed tag saliency ranking approach is presented in Section 3. The experimental results are shown in Section 4 and conclusions are made in Section 5. 2. RELATED WORK There has been significant previous work in tag refinement, including tag annotation [9– 13], tag ranking [4,5], tag de-noising [14], etc. Although these previous techniques could be applied to multi-view images simply by processing the images respectively, such an approach would neglect the valuable features [15,16] to represent the depth perception or other useful cues. Feng et al. [17] proposed a generic framework for stereo image retrieval where the disparity features extracted from the stereo pairs offer complementary clues to refine the retrieving results provided by the visual features. However, since a perfect stereo matching algorithm has yet to be invented, errors will exist in the estimated disparity. Directly using the disparity features would cause mislead results. Visual attention model has been proved that it can improve the tag ranking performance for images with multiple tags. Feng et al. [8] introduced the concept of tag saliency, where visual saliency is used to investigate the ranking order information of tags with respect to the given image. Then the annotated tags are re-ordered according to the saliency property of the corresponding visual content. This approach is based on the observation that users usually pay more attention to the visual salient regions of image. Therefore, it is much consistent with the human perception. Moreover, this approach can provide more comprehensive information when an image is relevant to multiple tags, such as those describing different objects in the image. Particularly, tag saliency ranking approach is well suited for stereo images. To achieve better 3D viewing experience, stereo images tend to contain distinct objects instead of cluttered scene. Therefore, the corresponding salient properties can clearly emphasize the content of the given images. In this paper, we aim at extending the tag saliency ranking approach to stereo images. To accomplish this task, the annotated tags should be firstly propagated from image-level to region-level. This indeed includes two steps, the stereo image segmentation and the region labeling. There is only limited literature about stereo image segmentation. Although the previous work on single images segmentation [18–20] could be applied to stereo image
4 pairs simply by joining the two images into an image volume, such an approach cannot handle the fact that corresponding pixels may have large disparities. A recent study [21] proposes details of interactive selecting objects in the stereo image pairs via graph cut. However, this method requires a lot of user input. Levin et al. propose a closed-form solution for natural image matting [22]. This method can effectively extract a foreground object from the input image with a small amount of user input. In our work, we extend this matting method to stereo images by introducing stereo consistency on the mattes of each view. The region labeling is indeed a weakly-supervised learning problem since tags are usually associated with images instead of individual regions. Multi-instance learning (MIL) has been proved to be an effect method to model such problems [23]. In MIL, an individual example is called an instance and a set of instances is called a bag. Training labels are associated with bags rather than instances. A bag is labeled positive if at least one of its instances is positive; otherwise, the bag is negative [24]. A lot of work has employed MIL for content-based image retrieval and automatic image annotation [23,25–27]. In this paper, we also utilize MIL to accomplish the label propagation from image-level to region-level. In our work, each segmented region is treated as an instance and the segmented regions which form an image are grouped as a bag of instances. The tags are initially labeled on the bag-level instead of instance-level. Then, an image is annotated by keyword w if at least one region in the image has the semantic meaning of w. 3D saliency detection technique plays an important role in our task. Although a wide variety of computational methods have been developed to detect saliency using 2D features extracted from image, such as color, shape, orientation, texture, these methods show their limits in 3D saliency detection, when salient objects do not exhibit visual uniqueness with respect to one or more of these feature. Li and Ngan applied a linear combination of the single-image and the multi-image saliency map to detect co-saliency from an image pair [28]. Their method exploited the local appearance, e.g., color and texture properties to construct a co-multilayer graph, which described the similarity between two SISMs. Since their method did not consider the geometry correspondence between image pair, its performance for dealing with the stereoscopies was not very well. Niu et al. computed the contrast in disparity to identify the saliency region of stereoscopies [29]. However, they did not discuss how to incorporate the stereo disparity cues with the 2D image features. The errors in the estimated disparity map may cause wrong detection results. To deal with the above problems, this paper proposes a novel algorithm to extract the saliency map from the stereo images by using occlusion cues alone with the contrast in color and depth. A detailed introduction of our algorithm is presented in Section 3.4.
3. ALGORITHM In this section, we will present our proposed approach on tag saliency ranking for stereo images. We firstly give an overview of our approach, and then introduce the proposed co-segmentation algorithm for stereo images. In the next, the label propagation from image-level to region-level and the issue of 3D image saliency detection algorithm are introduced in detail.
5
Sculpture
Tree
Left
Building Right Saliency Detection
Sculpture Tag Ranking by Saliency
Tree Building
Co-segmentation
Region Labeling
Sculpture Left
Left
Right
Right
Tree
Building
Figure 2. The flowchart of the proposed method.
3.1. Overview Before we elaborate our method, we first briefly describe how to estimate a disparity map from the left and right view of stereo images. In this paper, the dense disparity map is obtained from stereo matching, which is one of the most thoroughly researched areas in computer vision. Here we refrain from reviewing the literatures on stereo matching algorithms, and limit our approach to this topic by building upon a recent study, fast cost volume filtering [30]. This method can achieve a good stereo matching performance in real time. Everywhere in this paper, we apply the fast cost volume filtering method for disparity estimation. The flowchart of our proposed approach is illustrated in Figure 2. As can be seen, our approach consists of three steps: co-segmentation for stereo images, region labeling and 3D saliency detection. Given a stereo image pair and its associated tags, we first segment the two stereo images into consistent regions using our proposed co-segmentation algorithm. Then, we build the relationship between each tag and each segmented region pair individually by using the multi-instance learning (MIL) algorithm. We further perform 3D saliency detection to measure the importance of each region pair. Finally, the tags of the image are ranked according to the saliency values of the corresponding regions. 3.2. Co-segmentation for stereo images In [22], a closed-form solution to natural image matting was proposed. The matting Laplacian matrix was introduced in order to evaluate the quality of a matte. Thus, the matte extraction problem becomes one of finding the alpha matte that minimizes the
6 following function, α∗ = arg min αT Lα + λ(α − α0 )T D(α − α0 )
(1)
α
Here L is the Laplacian matrix, a sparse symmetric positive semidefinite N × N matrix whose entries are a function of the input image in local windows. L(i, j) is defined as: q|(i,j)∈wq
(δij −
−1 1 ε (1 + (Ii − μq )T (Σq + E) (Ii − μq ))) |wq | |wq |
(2)
where δ ij is the Kronecker delta, μq is the 3 × 1 mean color vector in the window w q around pixel q, Σq is a 3 × 3 covariance matrix in the same window, |wq | is the number of pixels in the window, and E is the 3 × 3 identity matrix. ε is used for numerical stability. in our implementation, we set it to 0.13 . In addition, α0 is a vector containing the alpha values for constrained pixels specified by user input, and D is a diagonal matrix whose diagonal element is one for constrained pixels and zero for all other pixels. λ is a regularization parameter. Generally, it is a some large number when minimal user input is available. In this paper, we extend the above method to stereo images by introduce a stereo consistency constraint on the alpha mattes of the left view and right view. The problem of co-segmentation for stereo image can be mathematically denoted as, (αl , αr )∗ = arg min αl T Ll αl + αr T Lr αr + λ1 (αl ,αr )
(i,j)∈Λ
cij αl (i, j) − αr (i, j − d(i, j))2
+λ2 ((αl − αl0 )T Dl (αl − αl0 ) + (αr − αr0 )T Dr (αr − αr0 ))
(3)
where the subscript l and r denote the left and right view, respectively, d(i, j) is the disparity value at position (i, j) on left view, Λ denotes the position index set, cij is a binary confidence value which evaluates the reliability of the corresponding disparity value d(i, j) by left-right consistency test, αl0 and αr0 are the initial alpha mattes obtained by minimal user input. The first and second term will be called as matte cost terms, the third term will be called as stereo consistency constraint term, and the last two terms will be called as user-specified constraint terms. In implementation, we can give both views’ user-specified constraint terms simultaneously, or only one view’s. If there is more than one object in an image, we can extract objects one by one. λ1 and λ2 are the regularization parameters, in our implementation, we set λ1 to 0.01 and λ2 to 100. Co-segmentation results will be shown in section 4.2. 3.3. Region labeling based on Multi-instance learning As mentioned above, the second task is to propagate the tag from the image-level to region-level. A lot of work has employed Multi-instance learning (MIL) for such a task. From the perspective of MIL, an image is labeled positive only if this image contains at least one instance prototype for that tag; otherwise, it is labeled negative. However, the training data set is much ambiguous since the amount of negative instances is much larger than positive instances in a positive bag. Therefore, how to building a direct semantic mapping between the tags and the corresponding regions becomes the key issue of MIL problem. To denote the semantic concept, the concept of instance prototypes
7 (IPs) which correspond to the representative visual patterns from the positive bags for a given tag is proposed. Diverse Density (DD) and its extension method are very popular for selecting instance prototypes [31–34]. The DD method measures a co-occurrence of similar instances from different bags with the same tag. A feature point with large DD value indicates that it is close to at least one instance from every positive bag and far away from every negative instance. In this paper, we also utilize the Diverse Density method to accomplish the label propagation from image-level to region-level. The notations of our method are given as follows. Given a specific tag ω ∈ V , the positive and negative bags for tag ω are denoted as Biω+ and Biω . Similarly, the corresponding j -th instances (regions) are denoted as rijω+ and rijω− . Here j = 1, . . . , n, where n is the number of instance in the bag. According to the tag is annotated on the image, the total ω− training set are denoted as L = Lω+ ∪ Lω− = {B1ω+ , B2ω+ , . . . Blω+ , B2ω− , . . . Blω− ω+ } ∪ {B1 ω− } ω+ ω− , where l and l are the numbers of the positive and negative bags for tag ω. In [31], for each bag Biω+ ∈ Lω+ , the DD value of each instance rijω+ is defined as: DD(rijω+ , L) where
∝
Pr(L rijω+ )
=
|L| m=1
Pr((Bm , ym ) rijω+ )
(4)
Pr((Bm , ym ) rijω+ ) = max{1 − ym − exp(−dist2 (rmn , rijω+ ))} n
(5)
Here ym equals to 1 when the bag Bm is a positive bag, otherwise ym equals to 0, and rmn is the n-th instance of bag Bm . The above algorithm combines the advantage of the methods proposed in [32] and [33] so that it is more robust and computational efficient. However, this algorithm also has a limitation which may degrade its performance. To illustrate the problem, we change Eq. 4 to its equivalent form: ω+
DD(rijω+ , L)
∝
l m=1
lω − ω+ Pr((Bm , ym ) rij ) + Pr((Bn , yn ) rijω+ )
(6)
n=1
The training set is extremely imbalanced since the amount of negative bags is usually much larger than the positive bags, i.e. lω+ lω− . Therefore, the second part in Eq. 6 will make much more contributions to DD value. However, it is intuitive that positive bags should contribute more in finding instance prototypes rather than negative bags. Moreover, the second part from negative bags also accounts for a great proportion in the computation of DD values. To deal with the above problem, we propose a modified DD value computation method by introducing leave-one-out technique. Mathematically, DD(rijω+ , L)
∝
Λ+ | | m=1
Λ | | ω+ Pr((Bm , ym ) rij ) + Pr((Bn , yn ) rijω+ ) −
(7)
n=1
Here Λ+ is a set that is selected from the positive bags but randomly leaving one sample out, and Λ− is a set that is randomly sampled from the negative bags with its size equal
8 to Λ+ . The whole procedure will be repeated for lω+ − 1 times, and the average of the computed result for each time will be taken as the final DD value. The advantages of our method lie in the following two aspects. First, our method constructs a balanced training data set in a more reasonable way. Besides, our method can achieve more robust result since it is obtained through a cross validation process. Once the DD values of all the instances in positive bags are obtained, the next step is to determine a threshold for the purpose of instance prototypes selection. Here we use a straightforward way to determine the threshold. For each bag Biω+ ∈ Lω+ , search for the maximal satisfying that there is at least one instances label larger than or equal to it, and set it as the threshold. Therefore, all the instances those have bigger DD value than this threshold will be selected as instance prototypes. After getting the collection of instance prototypes for a tag ω, we try to explore those instance prototypes to model the semantic class density of ω. X-means method [35] is applied here to construct the projection space. This method can efficiently search the space of cluster locations and number of clusters by optimizing the Bayesian information criterion. Therefore, we apply this method to group the instance prototypes into the best clusters, and then all the central points of k clusters will construct the projection space. Let V = {v1 , v2 , . . . , vk } be the projection space for a tag ω, where vi is the ith dimension of projection feature. The class density of a given tag ω based on the IPs is defined as: P (v |ω ) =
k
πi vi
(8)
i=1
where vi = √
1 2π|Σi |
exp{− 12 (v − μi )T Σ−1 i (v − μi )}, πi is the set of the relative weights and
satisfies. Here μi and Σi are the mean and covariance matrix of the ith dimension of projection feature. By using X-means method, the parameters in the above model can be calculated. Then the probability that region rijω+ is assigned to given tag ω is P (rijω+ |ω ) according to the Bayesian criterion. By utilizing the above algorithm, we can label each segmented region pair a unique tag according to the probability value that it belongs to. The example results of region labeling on our own stereo image data set are given out in Figure 3. These results have validated the effectiveness of our proposed method. 3.4. 3D saliency detection We now describe our 3D saliency detection method. Since the role of stereo disparity in the pre-attentive processing is still an ongoing research question, our method is mainly based on the observations from stereoscopic photography. According to stereoscopic perception, an object with negative disparity, which is perceived popping-out of the screen, often catches a viewers attention [36]. So our method starts from disparity analysis. Furthermore, when taking a stereoscopic photo, people tend to place an important object at the common region for both left and right view. In other words, the important object should not be occluded in either view. Accordingly, our method further exploits the occlusion cues for saliency detection. In addition, spatial relationships play an important role in human attention [37]. Therefore, our method also includes a spatial filtering procedure for refinement.
9
Bird
Plane
Body
Sculp ture
Boat
Build ing
Car
Dog
Flow er
Cat
Face
Grass
Road
Tree
Wate r
Figure 3. Example results of region labeling. The images with the corresponding tags are selected from our stereo image data set. Note that all the tags appear in the data set are shown here except the tags sky due to the absence of depth cues.
10 The computation of the saliency will be performed at the region level. Studies on stereoscopic photography have shown that an object that is perceived popping-out of the screen often catches a viewers attention [36]. This presents a useful cue for saliency analysis. Theoretically, a region rk ’s saliency value is calculated by the disparities in rk as follows:
Sd (rk ) = 1 − d¯k − d¯min
d¯max − d¯min
(9)
where d¯k represents the mean of disparity value in region rk , and k ∈ Seg, Seg is region index set, d¯min and d¯max represent the minimum and maximum d¯k in the image. Obviously, it is not judicious to determine an object’s saliency just according to its disparity. For instance, in Figure 4(b)(c)(e), parts of the ground ought to be closer to the camera yet obviously these regions are less salient than the objects in the scene. To compensate this limitation, we further introduce the occlusion cues to calculate the saliency for each region.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 4. Half occlusion detection examples. The first row shows the left view of original image pairs. The second and third rows show the half occlusion detection results on left and right image, respectively. The dark points represent the detected half occlusion pixels. As illustrated in Figure 5, pixels that are only visible in one view are called half occlusion or monocular occlusion pixels in stereoscopic photography. In most cases, viewers seldom pay pre-attention to these pixels. Obviously, the more occlusion pixels a region has, the less probable it is salient. Accordingly, we apply this useful cue to analyze the saliency of each region. Since there is no direct way to accurately identify the half-occlusion pixels, we use the left-right consistency test for detection. As for a pixel, if we cannot find a
11 RE Image LE Image
background
foreground
LE
RE
Figure 5. Illustration of half occlusion. Top-view of typical example where background points in one eye are not visible to the other. The blue wave line represents regions visible in the left eye but invisible in the right eye and the regions represented by yellow wave line just go the opposite way. The green rectangles in RE image and LE image stand for the regions of foreground. The yellow rectangles in RE image stand for regions only visible to the right eye. The blue rectangles in LE image stand for regions only visible to the left eye.
pixel matching with it in the other view, it will be counted as half occlusion pixel. The occlusion detection examples are shown in Figure 4. It can be seen that the non-salient regions are separated from the input images. Meanwhile, a few non-occlusion pixels are detected as the occlusion pixels. This is because the stereo matching algorithm sometimes fails in non-texture and repetitive texture regions. Thus, these mismatched pixels will be identified as occlusion pixels. Fortunately, these pixels usually have low contrast to its surrounding. According to the human attention theory, people usually do not pay preattention to these pixels. Therefore, it will have little impact on our method. We count the number of occlusion pixels in each region of left and right view. Then all regions are sorted in descending order according to their numbers of occlusion pixels. We select out those regions ranking at the highest 30% to construct a half occlusion regions pool P. Inspired by a recent color contrast-based saliency detection method from Cheng et al. [37], we assign a saliency value for each region by measuring its color similarity to the one in the half occlusion regions pool. The more similar to the regions in P, the less salient it is, and vice versa. For region rk in either left or right image, its half occlusion contrast value So is formulated as: So (rk ) =
Dc (rk , ri ) |Λi |,
(10)
i∈P
where Dc (rk , ri ) denotes the chi-square distance between region rk and ri according to their color intensities. The saliency values of the regions in half occlusion region pool are all set to the minimum saliency value. The combined saliency Sc is obtained by multiplying disparity saliency and half occlu-
12 sion contrast saliency together. Mathematically, Sc = Sd ∗ So .
(11)
Since spatial relationships play an important role in human attention, our method also includes a collaborative smooth filtering on Sc for refinement. This refinement procedure tends to assign similar saliency values to those neighboring regions sharing similar disparity and color distribution. The collaborative smooth filtering can be formulated as: S(rk ) =
Dc (rk , ri )Dd (rk , ri )Ds (rk , ri )Sc (ri ),
(12)
i∈Seg
where Dd (rk , ri ) denotes the disparity distance between region rk and ri according to their disparities, Ds (rk , ri ) denotes the spatial distance between region rk and ri according to their spatial positions. After the saliency map of the given stereo images are obtained, tags are ranked by the averaged saliency scores of the corresponding regions in the given stereo image in descending order. 4. EXPERIMENT Since there was no existing public annotated stereo image dataset, we firstly set up a labeled stereo image dataset, and then tested the performance of our proposed approach on our own image dataset. We had implemented the proposed algorithm using MATLAB on a platform with Intel Core4 2.4GHZ CPU and 4G memory. In this section, we first introduced the setup of the datasets used in the experiments. And we show segmentation results to discuss the function of the stereo consistency constraint. Then we presented the comparison of our 3D saliency detection algorithm with other existing methods. Finally we presented the tag saliency ranking results. 4.1. IMAGE DATASET To evaluate the validity of tag saliency ranking, we conducted a labeled stereo image dataset, which included 424 stereo images crawled from Flickr 2 , Stereoscopic Image Gallery 3 , and NVIDIA 3D Vision Live 4 , as well as the ground-truth for 16 concepts for these images: building, grass, tree, sky, airplane, water, face, car, flower, bird, sculpture, road, cat, dog, body, boat (Shown in Figure 6). We proposed a novel co-segmentation method (a detailed introduction is presented in Section 3.2) to segment the two stereo images simultaneously. This method enabled us segment the stereo images with minimal user input. Then we labeled each region pair with the assigned colors acting as tags into the list of object classes. Each image was typically represented by 3-5 regions and was associated with 2-5 tags to describe the main semantic objects. Specifically, on average each image had around 3 tags. The size of the image was 320 × 221 pixels. We used the co-segmentation algorithm introduced in Section 3.2 to segment the stereo images. For each segmented region, we extracted 12-dimensional HSV and LAB color moment and 48-dimensional Gabor coefficients moment features with 4 resolutions and 6 orientations. 2
http://www.flickr.com/ http://www.stereophotography.com/ 4 http://photos.3dvisionlive.com/ 3
13
Void
Grass
Dog
Car
Road
Road
Grass
Sky
Grass
Water
Flower
Buiding
Void
Figure 6. The labeled image database. A selection of images in our stereo image dataset and their corresponding ground-truth annotations. The second row shows the 3D salient regions of the stereo images. Colors map uniquely to object class labels. All images are approximately 320 × 221 pixels.
Our database is split randomly into roughly 45% training, 10% validation and 45% test sets, while ensuring approximately proportional contributions from each class. Note that the regions that are smaller than 1/20 of the original image and do not belong to a database class are labeled as void. Void pixels are ignored for both training and testing. 4.2. Co-segmentation results Here, we illustrate how to implement our co-segmentation method and present some results. Our method only required the user to give a sparse set of scribbles. The white scribbles indicated the foreground and black ones indicated the background. To began with, we placed the scribbles on the left image, and minimized the corresponding matte cost and user-specified constraint term in Eq. 3 to get the left view’s matte. Then, we computed the right view’s matte in three ways. First, we took into account two additional terms, right view’s matte cost and stereo consistency constraint term. Second, we placed some scribbles on the right image, i.e., we added the extra right view’s user-specified constraint term. Third, we only minimized the right view’s matte cost and the userspecified constraint term. The obtained three right view’s mattes are shown in Figure 7(f)-(h), respectively. From these results, we can see that the right view’s mattes shown in column (g) and (f) are obviously better than the ones shown in column (h). For example, there are some background appearing in the (h)’s “building” matte. The edges of (h)’s “car”, “cat” and “dog” are not as clear as the ones in (g) and (f). The reason for this is that the stereo consistency constraint can provide more matting information. In addition,
14
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 7. Segmentation results: (a) Left view input image, (b) Right view input image, (c) Left image with sparse constraints, (d) Right image with sparse constraints, (e) Left view matte by minimizing left view’s matte cost and user-specified constraint terms, (f) Right view mattes by our proposed method, (g) Right view matte by our proposed method with rough user interaction, (h) Right view matte by traditional matting method [22].
we can also see that the mattes in (f) are very close to the ones in (g). This means that the extra user interaction can be ignored in case that the stereo consistency constraint is reliable. 4.3. 3D saliency detection performance comparison We evaluated the saliency map from our proposed method in Section 3.4. We also selected eight state-of-the-art methods in the field of saliency map detection, including including FT[38], SR[39], HC[37], GB[40], RC[37], CO[28], CA[41], and SS[29]. The tested images were randomly sampled from our stereo image dataset. The corresponding saliency maps generated by the above algorithms are listed below in Figure 8. Since SS method had not published their codes, we only compared them in the first two cases (The results are cited from [29]). In the first two cases, the salient objects are protruding from the background and have unique colors. It can be seen that our method can achieve comparable results with that of SS method. Next, we tested our method on more challenging cases. For the third image, the salient object (stone deer) had similar colors to the ground. For the fourth image the front handrail was popping out of the screen and the salient object (red stone) is located a little farther. In these two cases, our method could effectively detect the salient objects by using the half occlusion cues. In general, our results established well-defined boundaries of salient objects and uniformly highlights
15
(a) Left View
(b) FT
(d) HC
(c) SR
(e) GB
(g) CO
(f) RC
(h) CA
(i) SS
(j) Ours
Figure 8. Stereo Saliency maps from our method and the other saliency maps.
1.0 0.9
HC FT SR RC CA GB SS Ours
0.8 0.7
Precision
0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Recall
Figure 9. Precision-recall curves for naive thresholding of saliency maps using 100 benchmark stereoscopic images. Different options of our method compared with HC, FT, SR, RC, GB, CA, SS.
16
L
L
Bird Grass R
L
R
R
Boat Sky Water Road Tree
Building Sky Water Road Tree
L
Sculpture Building Road Void R
Figure 10. Several exemplary tag saliency ranking results on our stereo image dataset. From left to right: the original stereo image, saliency map, most salient region,final ranked tag list.
the whole salient regions. To reliably compare how well various saliency detection methods highlight salient regions, we used a varying threshold Tf from 0 to 255 to get a binary segmentation of salient objects and computed the resulting precision v.s. recall curves. As shown in Figure 9, our method outperforms all of the classical 2D saliency detection algorithms. The reason is that these methods do not take any depth cues into account. Compared with 3D saliency detection method SS, our method also yields higher precision and better recall rates. 4.4. Tag saliency ranking performance We first gave some tag ranking results using the proposed approach, as shown in Figure 10. As can be seen, the tag with red font annotated in each image reveals the most salient information in the given image. Consequently, our method can achieve favorable ranking results, which fits with the human perspective. Then, we made the comparison between the proposed method and the visual neighbor voting technique[6] which emploied the naive kNN algorithm to perform the tag relevance ranking task. We choosed k equals to 10 in our experiment, i.e., 10 nearest images were used to learn tag relevance ranking by neighbor voting. Since the visual neighbor voting technique is utilized on a relatively small image dataset, its performance can not complete with our method. Moreover, we compared the ranking performance of our algorithm with other three prevalent approaches, i.e., SR[39], FT[38] and RC[37], only differ in the generation of saliency map. Table 1 gives the comparison results of the several methods in terms of average AUC over the 16 tags, while Figure 11 illustrates the detailed results for each individual tags. As can be seen, since these approaches neglect the depth cues contained in the stereo images, the ranking performance is not very satisfying as compared with our method.
17
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
kNN(k=10) SR FT RC aeroplane bird boat body building car cat dog face flower grass road sculpture sky tree water
Our Approach
Figure 11. The detailed precision comparison for individual tag in comparison with our method and several algorithms.
5. CONCLUSION In this paper, to the best of our knowledge, this is the first time that we propose a tag ranking approach for stereo images. Directly extending the existing 2D methods to stereo image may cause some mislead result since they may neglect the 3D cues contained in the stereo images. In contrast, our approach can better capture the 3D visual representative elements with respect to the corresponding contents of the stereo images. There are two main contributions of our approach. The first is that a novel 3D saliency detection algorithm that uses occlusion cues alone with the contrast in color and disparity is proposed. This algorithm enables to measure the salient degree of each region in the stereo images, even when there are errors in stereo matching. The second is that a two-step method is proposed to propagate the tags annotated to the stereo images to the segmented regions. Specifically, an effective co-segmentation algorithm for stereo images is firstly present, which can simultaneously segment the two stereo images with minimal user input. Then an improved multi-instance learning algorithm is applied to label each segmented region. Beneficial from the above two contributions, tag ranking can be performed in a simple way that is to re-rank the tags according to the saliency values of the corresponding regions. It is worth noting that we just evaluate the performance of our approach on our own dataset that is a rather small-scale one. Since there are no existing public databases of annotated stereo image, we plan to set up a large-scale annotated stereo images dataset and to evaluate our proposed approach on it in the future work.
18 Table 1 Precision comparison with different approaches. Algorithm Precision SR[39] 49.49% F T [38] 58.55% RC[37] 66.61% kNN [6] 46.48% The proposed approach 72.76%
REFERENCES 1. Z.-J. Zha, L. Yang, T. Mei, etc, Visual query suggestion: Towards capturing user intent in internet image search, ACM Trans. on Multimedia Computing, Communications, and Applications. 6. 2. Z.-J. Zha, L. Yang, T. Mei, M. Wang, Z. Wang, Visual query suggestion, in: ACM Multimedia, 2009. 3. A. Sun, S. Bhowmick, Image tag clarity: in search of visual-representative tags for social images, in: ACM Multimedia Workshop on Social Media, 2009. 4. D. Liu, X.-S. Hua, H.-J. Zhang, Content-based tag processing for internet social images, Multimedia Tools Application. 51 (8) (2011) 723–738. 5. S. Feng, C. Lang, H. Liu, Adaptive all-season image tag ranking by saliency-driven image pre-classification, Journal of Visual Communication and Image Representation. 2013 (24) (2013) 1031–1039. 6. X. R. Li, C. G. M. Snoek, M. Worring, Learning social tag relevance by neighbor voting, IEEE Trans. on Multimedia. 11 (7) (2009) 1310–1322. 7. D.Liu, X.-S. Hua, L.-J. Yang, M. Wang, H.-J. Zhang, Tag ranking, in: WWW, 2009. 8. S. Feng, C. Lang, Beyond tag relevance: integrating visual attention model and multiinstance learning for tag saliency ranking, in: CIVR, 2010. 9. M. Wang, X. Hua, R. Hong, J. Tang, G. Qi, Unified video annotation via multigraph learning, IEEE Trans. on Circuits Syst. Video Technol. 19 (5) (2009) 733–746. 10. M. Wang, X. Hua, J. Tang, R. Hong, Beyond distance measurement: constructing neighborhood similarity for video annotation, IEEE Trans. on Multimedia. 11 (3) (2009) 465–476. 11. M. Wang, R. Hong, G. L, Z. Zha, Event driven web video summarization by tag localization and key-shot identification, IEEE Trans. on Multimedia. 12 (4) (2012) 975–985. 12. Z.-J. Zha, T. Mei, J. Wang, etc, Graph based semi-supervised learning with multiple labels,, Journal of Visual Communication and Image Represenation. 20 (2) (2009) 97–103. 13. Z.-J. Zha, M. Wang, Y.-T. Zheng, Y. Yang, R. Hong, T.-S. Chua, Interactive video indexing with statistical active learning., IEEE Trans. on Multimedia. 14 (2012) 17– 27. 14. J. Liu, Y. Zhang, Z. Li, H. Lu, Correlation consistency constrained probabilistic matrix
19
15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34.
factorization for social tag refinement, Neurocomputing. 25 (4) (2013) 172–180. Z.-J. Zha, H. Zhang, et al., Detecting group activities with multi-camera context, IEEE Transactions on Circuits and Systems for Video Technologies 23 (5) (2013) 856–869. Z.-J. Zha, Y. Yang, J. Tang, M. Wang, T.-S. Chua, Robust multi-view feature learning for rgb-d image understanding, ACM Transactions on Intelligent Systems and Technology. Y. Feng, J. Ren, J. Jiang, Generic framework for content-based stereo image/video retrieval, IEEE Electronic Letters. 47 (2) (2011) 97–98. Y. Boykov, M.-P. Jolly, nteractive graph cuts for optimal boundary and region segmentation of objects in n-d images, in: ICCV, 2001. Y. Deng, B. Manjunath, Unsupervised segmentation of color-texture regions in images and video, IEEE Transactions on Pattern Analysis and Machine Learning. 23 (8) (2001) 800–810. C. Rother, V. Kolmogorov, A. Blake, Grabcut c interactive foreground extraction using iterated graph cuts, in: SIGGRAPH, 2004. B. L. Price, S. Cohen, Stereocut: Consistent interactive object selection in stereo image pairs., in: ICCV, 2011. A. Levin, D. Lischinsk, Y. Weiss, A closed form solution to natural image matting., IEEE Trans. on Pattern Analysis and Machine Intelligence. 30 (3) (2008) 228–242. Y. Chen, J. Wang, Miles: Multiple-instance learning via embedded instance selection., IEEE Transaction on Pattern Analysis and Machine Intelligence. 28 (12) (2006) 1931– 1947. O. Maron, T. Lozano-Pierez., A framework for multiple-instance learning., in: Advances in Neural Information Processing Systems (NIPS), 1998. Y. Chen, J. Wang, Image categorization by learning and reasoning with regions., Journal of Machine Learning Research. 5 (2004.) 913–939. J. Tang, X. Hua, G. Qi, X. Wu, Typicality ranking via semi-supervised multipleinstance learning., in: ACM Int. Conf. on Multimedia (ACM MM), 2007. Z.-J. Zha, T. Mei, R. Hong, Z. Gu, Marginalized multi-layer multi-instance kernel for video concept detection, Signal Processing 93 (8) (2013) 2119–2125. H. Li, K. NGAN, A co-saliency model of image pairs., IEEE Trans. on Image processing. 20 (12) (2011) 3365–3375. Y. Niu, Y. Geng, X. Li., Leveraging stereopsis for saliency analysis., in: CVPR, 2012. A. Hosni, C. Rhemann, M. Bleyer, Fast cost-volume filtering for visual correspondence and beyond., IEEE Transaction on Pattern Analysis and Machine Intelligence. 35 (2) (2013) 504–511. R. Rahmani, S. A. Goldman, Missl: multiple-instance semisupervised learning, in: International Conference on Machine Learning (ICML), 2006. X. Qi, Y. Hanin, Incorporating multiple svms for automatic image annotation, Pattern Recognition. 40 (4) (2007) 728–741. S. Feng, D. Xu, Transductive multi-instance multi-label learning algorithm with application to automatic image annotation, Journal of Expert Systems with Applications. 37 (1) (2010) 661–670. Z.-J. Zha, X.-S. Hua, T. Mei, etc., Joint multi-label multi-instance learning for image
20
35. 36. 37. 38. 39. 40. 41.
classification., in: CVPR, 2008. D. Pelleg, A. Moore, X-means: Extending k-means with efficient estimation of the number of clusters, in: International Confence on Machine Learning (ICML), 2000. J. Hakkinen, T. Kawai, J. Takatalo, R. Mitsuya, G. Nyman, What do people look at when they watch stereoscopic movies?, in: SPIE, 2010. M. Cheng, G. Zhang, N. Mitra, X. Huang, S. Hu, Global contrast based salient region detection, in: CVPR, 2011. R. Achanta, S. Hemami, F. Estrada, S. Susstrunk, Frequency-tuned salient region detection, in: CVPR, 2009. X. Hou, L. Zhang, Saliency detection: A spectral residual approach, in: CVPR, 2007. L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Trans. on Pattern Analysis and Machine Intelligence. 20 (11) (1998) 1254–1259. S. Goferman, L. Zelnik-Manor, A. Tal, Context-aware saliency detection, in: CVPR, 2010.