Unsupervised segmentation of multiview feature semantics by hashing model

Unsupervised segmentation of multiview feature semantics by hashing model

Signal Processing 160 (2019) 106–112 Contents lists available at ScienceDirect Signal Processing journal homepage: www.elsevier.com/locate/sigpro S...

3MB Sizes 1 Downloads 95 Views

Signal Processing 160 (2019) 106–112

Contents lists available at ScienceDirect

Signal Processing journal homepage: www.elsevier.com/locate/sigpro

Short communication

Unsupervised segmentation of multiview feature semantics by hashing model Jia Cui a,∗, Yanhui Zhao a, Xinfeng Dong a, Mingxi Tang a,b a b

School of Information Science and Engineering, Shandong Normal University, Jinan 250014, PR China Shaanxi Fashion Engineering University, Xian 712046, PR China

a r t i c l e

i n f o

Article history: Received 22 October 2018 Revised 13 February 2019 Accepted 14 February 2019 Available online 15 February 2019 Keywords: Unsupervised image segmentation Multiview feature projection Similarity distance Hashing model

a b s t r a c t The image segmentation of finely detailed object in images is becoming increasingly important in various image applications. Herein, multiview features such as color, spatial features, texture, saliency, and depth are used to calculate the feature semantic that differentiates the image foreground and background at the pixel level. For different image type, the importance of every single image feature is various. We assume that combining multiple features will produce improved results compared to using any single feature and propose an unsupervised learning strategy to learn optimal feature projection to the final solution surface of binary segmentation. To reduce computational complexity, a hashing model is used to represent the projected feature using binary codes, as its hamming distance efficiently retrieves the pixel similarity. Comparative experiments using the current state-of-the-art unsupervised segmentation algorithms and a deep learning approach, i.e., a fully convolutional network, demonstrate the advantages of the proposed method. © 2019 Published by Elsevier B.V.

1. Introduction Image segmentation is a challenging problem in computer vision. Image segmentation divides an image into numerous discrete regions such that pixels exhibit high similarity within each region and high contrast between regions. Image segmentation is the foundation of many applications, including robot vision, object recognition, and medical image processing. The process is challenging as regions may appear with various backgrounds and different visual conditions. Image segmentation methods can be classified into two categories: interactive and automatic, depending on whether prior information supported by users interactively. We herein focus on the automatic approach. Automatic approaches label pixels that are homogeneous in different level features, such as color and texture, even the semantics, into non-overlapped regions. Clusterbased approaches are borrowed from the feature space analysis in a clustering community, which represents the image pixels in the feature vector space. The major cluster algorithms operate on the local features that lead to a degree of sensitivity to local statistical changes [1]. Ban et al. [2] proposed a pixel-related Gaussian mixture model (GMM) to segment images into superpixels. GMM ∗

Corresponding author. E-mail address: [email protected] (J. Cui).

https://doi.org/10.1016/j.sigpro.2019.02.015 0165-1684/© 2019 Published by Elsevier B.V.

is a weighted sum of Gaussian functions, each one corresponding to a superpixel, to describe the density of each pixel represented by a random variable. Graph-based approaches model an image as a graph to produce segmentation using a relative dissimilarity measure and a global grouping matrix [3]. Identifying the proper optimization function is essential for defining the graph globally. Superpixels are homogeneous regions that are smaller than the presented objects or their parts. The simple linear iterative clustering (SLIC) algorithm proposed by Achanta et al. [4] is a popular method for general image segmentation. It significantly improves superpixel efficiency by starting from the initial regular grid and growing by estimating the distances between pixels and nearby, localized cluster centers [5]. Recently, various methods based on deep learning structure have demonstrated exciting results on several semantic segmentation benchmarks. The fully convolutional networks (FCNs) [6-8] reformulate the image segmentation task into a pixel-level classification based on their multiscale features. The dynamic programming strategy used in FCNs increases the time required to train the classifier in a large database. There are also many new deep semantic segmentation works, such as W-net [9], Deeplab [10], Mask-RCNN [11] and Segnet [12]. From the perspective of semantic segmentation, the results are promising; however, in terms of the traditional evaluation by segmentation accuracy, improvement is still required.

J. Cui, Y. Zhao and X. Dong et al. / Signal Processing 160 (2019) 106–112

107

Fig. 1. Working flow of this paper. After feature extraction, the input image can be represented in multi-view, such as spatial, color, saliency and depth. Then, the feature projection to potential solution surface will be helpful to locate the image contours by hashing model for binary segmentation.

We herein propose a novel unsupervised binary segmentation algorithm based on multiview features, such as spatial features, color, saliency, and depth, all of which are inspired by the multilevel convolutional operations of FCNs. The multiview features depict images from multiple perspectives at the pixel level. There are some existed works on multiview learning algorithm. Yu et al. [13,14] proposed the ranking model for image retrieval, in which the construction of multiple hypergraphs from different visual features is used to obtain linear model by fast alternating optimization. The multimodal deep autoencoder (MDA) [15] uses multiview hypergraph low rank representation to fuse multiple features into a unified manifold representation for human pose recovery. Inspired by them, in the segmentation process, instead of concatenating vectors from different views together as a new vector, we propose a new algorithm that learns a low-dimensional and sufficiently smooth embedding over all views simultaneously. The feature selection matrix is proposed to project multiple features toward to the potential problem space. The multiview pixel features can achieve improved results for differentiating the foreground and background areas. The primary framework is shown in Fig. 1. The calculation of multiple features is more complex than that of a single feature by dynamic programming. Thus, we introduce a hashing model to increase the computation speed and reduce the cost. The contributions of our paper are as follows: 1) Instead of learning image features in large scale database, inspired by recent human perceptual attention study [16,17], we propose a novel algorithm to learn a low-dimensional and smooth embedding over multiview feature (saliency, depth, spatial and color information) perspectives in one image, simultaneously. The participation of more features in the future is also supported by the current framework. The computational similarities at pixel-level is helpful to locate the image contours between fore- and background. 2) The image segmentation model is reformulated as a multiview semantic similarity computation at pixel-level. During the optimization process, instead of alternating optimization algo-

rithms, we propose a novel way to approximate the global optimal solution by spectral hashing model. The binary computation of multiview feature embedding is helpful to reduce computational complexity. To the best of our knowledge, this is the first work that uses a hashing model for image segmentation. This paper is organized as follows: Section 2 formalizes the proposed problem. The details of our method are discussed in Section 3. We present our experiments and conclusions in Sections 4 and 5, respectively. 2. Problem formulation In segmentation, identifying and locating image objects or object parts is challenging, as what makes an “object” or a “part” meaningful can be ambiguous. The ambiguous definitions of objects render image segmentation an ill-posed problem. Herein, we identify the optimal pixel groups with the same or similar feature “semantics” to form the foreground and background. Thus, the segmentation problem can be reformulated as a pixel similarity computation and the minimum energy group retrieval. The objective function can be represented as follows:

F = arg min sd





k 





sd pi , p j · (ωi, j )n

n=1



(1)

2

sd pi , p j =  pi −  p j 



fid1 −m/2, j−m/2 ⎜ ..  pi = ⎝ . fidk +m/2, j−n/2

 ωi, j = exp

− pi − p j 2 ∗ l2

... fi,d j ...

(2)



fid1 −m/2, j+m/2 ⎟ .. ⎠ . fidk +m/2, j+n/2

(3)

2 (4)

108

J. Cui, Y. Zhao and X. Dong et al. / Signal Processing 160 (2019) 106–112

ture semantics” (FS) as a novel approach to learn the projection of multiple features toward the potential solution surface. The term “semantics” does not share the meaning used by semantic segmentation. The proposed FS is the distribution of multiview feature projection in a high-feature dimension space, and the “semantics” of features can be used to obtain the pixels between the foreground and background. Instead of obtaining the objects and parts contour, we learn a feature selection matrix S, as the “mode,” for every object image that can be used to project the multiview features to a low dimension and in a semantically separable manner. Hence, (1) can be reformulated as

argmin 

k  

  pi −  p j 2  · (ωi, j )

n

n=1

⎛⎡

= argmin tr ⎝⎣ 

×



1i − 1j



...

 − n i

 1 T





= argmin tr  pi 

Fig. 2. Image features from multiple perspectives. From the top to the bottom, the features are saliency feature, depth feature, spatial feature (super-pixel) and color histogram. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

In the objective function, sd is the proposed similarity distance, and k is the feature dimension. In other words, we propose a global minimum optimization of feature similarities in multiple feature perspectives. pi and pj are two pixels in the image, and pi is a feature matrix to represent the pixel in the local patch (with window size variable m) in the form of multiple views. f i,d j represents the feature vector at position (i, j) from the d dimension. ωi, j is a k-dimenional column vector following Gaussian distribution between pi and pj and l is deviation variable. Hence, the math model is constructed to transform the image segmentation problem into an optimization problem in a high-dimensional space. 3. Proposed method Many feature descriptors can be used to represent an image from various perspectives, including the saliency, depth, spatial features, and color features. In image segmentation, the salient feature is helpful when images contain prominent foreground objects. When a clear distance of sight exists, the depth feature is the primary contributor. In Fig. 2, we use the saliency, depth, spatial, and color features together to represent the images from multiple perspectives. Every individual feature can effectively segment their groups of images, as observed in Fig. 2(a) and 2(b); however, this is not guaranteed to apply to other image groups. Therefore, we assume that combining multiple features will produce improved results compared to using any single feature for binary image segmentation. The traditional approach combines several feature representations to form a high-dimension vector during computation. However, the contribution of each individual image feature is not uniform across different image categories. For example, in a large, open scene, the depth feature is useful for differentiating the foreground and background. Meanwhile, in a portrait, the color and saliency features are more useful. We herein propose “fea-

= argmin tr

k

⎦   n T

ni −  j



· diag





ωi, j ⎠



  −eTn · diag ωi, j × [−en , In ]Tp j In   pi Ln Tp j ,

(n=1 )





n T j



1i −  j , . . . ,

T ⎤



(5)

where en = [1, 1, . . . , 1]T , In is an n × n identity matrix, tr(.) is the trace operation, and Ln ∈ R(n+1) × (n+1) encodes the objective function for the similarity distance in n dimensions. According to (5), it is



k 



ωi, j

Ln = ⎣n=1 −ωi, j



⎤ T ωi, j ⎦   . diag ωi, j 

n



(6)

This remains a difficult problem to solve. Because different views contribute differently to the final solution surface, we assume that the contribution degree is non-negative. For each patch pi , there is a low-dimensional embedding  = [1i, j , 2i, j , . . . , ki, j ], the potential feature mapping matrix S can be used to map the multiview features globally, i.e.,  pi =  · Sn , wherein Sn ∈ Rn∗k . Therefore, (5) can be rewritten as

argmin , s n

k 



n=1

= argmin , s n

Ln =



tr Sn Ln SnT T

k 

k 





tr L n T ,

(7)

n=1

Sn Ln SnT .

(8)

n=1

Inserting (6) into (8), we have

Ln = Di, j − Wi, j

(9) exp(−xni

xnj 2 /n )

where Wn ∈ and (Wn )i, j = − if the similarity between pixel i and pixel j is close, and (Wn )i, j = 0 other wise; Dn is a diagonal matrix and Dn = [Wi,nj ]. Therefore, Ln is an Rn×n

n

unnormalized graph of the Laplacian matrix [18]. In order to solve the solution defined in Eq. (7), there is no direct way to find its optimization. The traditional way is the iterative algorithm by using alternating optimization [19]. However, we found the definition in Eq. (7) is similar to the spectral relaxation

J. Cui, Y. Zhao and X. Dong et al. / Signal Processing 160 (2019) 106–112

109

Fig. 3. Comparison with other unsupervised segmentation algorithms: (a) target image; (b) results by Gaussian mixture model (GMM); (c) results by level set; (d) results by linear spectral clustering (LSC); (e) results from proposed method.

utilized in [20]. Therefore, we propose to use the spectral hashing strategy to approximate the global optimal solution. In the case of uniform distribution on [a, b], the eigenfunctions of the Laplacian Ln are well established in mathematics. They correspond to the fundamental modes of vibration of a metallic plate. The eigenfunctions k (x) and eigenvalues λk are

  kπ π k (x ) = sin + x ,

(10)

λk = 1 − e− 2 | b−a | .

(11)

b−a

2

ε2

2



The modes learned can be used for mapping the highdimensional feature  to a low-dimensional feature surface by the user-specified threshold θ . The binary representation of the feature projection helps reduce the computation complexity, as shown in (12).









argmin sd pi , p j = Hamdist Binary modes[16] · ni, j



(12)

The proposed algorithm is listed as Input: Image I, feature dimension n, distance threshold θ Output: Segmentation result output 1: extract feature matrix ni, j ; 2: based on ni, j , calculate Ln by Eq. (9); 3: [eigenvector, eigenvalue]←Laplacian Ln ; 4: learn the modes according to Eq. (10)–(12); 5: obtain pixel set P, by Eq. (12) and θ ; 6: calculate segmentation result: output

The time complexity of the proposed algorithm contains two parts. One is the construction of alignment matrices for multiple views, i.e. the computation of Ln , the time complexity is  O( ( ki=1 i ) × n2 ). The other is for the optimization process. Based on reference [19], the alternating optimization should update of  by eigenvalue decomposition of n × n matrix (O(n3 ) and S (O(n × k). Therefore, the entire time complexity of alternating optimization  is O( ( ki=1 i ) × n2 + (n3 + n × k ) × T ), where T is the number of

110

J. Cui, Y. Zhao and X. Dong et al. / Signal Processing 160 (2019) 106–112

Fig. 4. Comparison with other deep learning models on PASCAL VOC dataset: From top to bottom: Original image, segmentation label, results by FCN-8s, FCN-16s, FCN-32s, Deeplab v3, Mask-RCNN, W-net, ours. There are five image categories that are used in the comparison, from left to right, people, airplane, car, animal and multi-objects.

training iterations. However, by the spectral hashing model, the time complexity of second part can be reduced to O(n). The time  complexity of our algorithm is O( ( ki=1 i ) × n2 + n).

mance of the methods above. The comparisons with newly segmentation algorithms were obtained from the BSD500 database, the ones with deep learning approaches were obtained from the Pascal VOC 2012 dataset.

4. Experiments 4.2. Qualitative experiment 4.1. Experiment setup The experimental environment is as follows: An Intel Core TM i7-670 0K CPU @ 4.0 0 GHz, 64-GB RAM, NVIDIA GTX 1070, and a 64-bit Windows system. Four image features were used to generate the multi-view image feature: saliency feature [21], depth feature [22], spatial feature [23], and color histogram [24]. The depth information is not always available in general images, we use the algorithm proposed in reference [14] to estimate the depth clue by the slight blur around contours via sparse representation and image decomposition. The Berkeley Segmentation Dataset (BSDS500) [25] and Pascal Visual Object Classes (Pascal VOC) [26] dataset were used to test the performance of our proposed algorithm against other state-of-the-art algorithms. Three groups of experiments were executed. The first group was the qualitative comparison between the Gaussian mixture model (GMM) [27], the level set model [28], a new superpixel algorithm, the linear spectral clustering (LSC) algorithm [29], and the proposed method. For the current research, the deep learning is spread, and the comparisons with the FCN, W-net, Deeplab v3 and Mask-RCNN occur in the second group. Four segmentation evaluation metrics, the probabilistic rand index (PRI), global consistency error (GCE), variance of information (VOI) and intersection over union (IoU) were used to quantitatively evaluate the perfor-

In the first group, we compared some new, unsupervised segmentation algorithms with ours, as shown in Fig. 2. The spatial feature used herein was calculated based on the superpixel algorithm. Therefore, the new LSC algorithm is chosen to compare with the segmentation results. In Fig. 3(b), the foreground objects are selected using the GMM approach; however, influenced by the complex background, some regions are misclassified, such as the clouds and grass. In Fig. 3(c), the results from the level set algorithm are relatively sensitive to the initial position. If the initial position is suitable, the results are satisfactory, as shown in the image of the ostrich; however, a poorly chosen initial position produces poor results. Fig. 3(d) shows the results obtained from the LSC superpixel algorithm. It is noteworthy that some regions near the object contours are lost. Fig. 3(e) shows the results of our algorithm as the multiview features’ contribution; it clearly outperforms the other approaches. The foreground and background are clearly segmented. Although the experiment results are not represented in the same modality, we prefer to demonstrate them in their original forms using source codes, rather than post-processed. Subsequently, we compare the proposed algorithm with deep learning approaches, such as FCN, W-net, Deeplab v3 and MaskRCNN.

J. Cui, Y. Zhao and X. Dong et al. / Signal Processing 160 (2019) 106–112

111

Table 1 Performance of the proposed method comparison.

FCN-8s FCN-16s FCN-32s DeepLab MaskRCNN W-net Our

Probabilistic rand index (PRI)

Global consistency error (GCE)

Variance of information (VOI)

0.8615 0.8725 0.8736 0.8669 0.8577 0.8289 0.9056

0.1566 0.1629 0.1623 0.1499 0.1904 0.2014 0.0956

1.5146 1.6249 1.4701 1.0755 0.7228 1.5321 0.9824

0.0465 0.0648 0.0658 0.0704 0.0052 0.0531 0.0431

Table 2 Intersection over union metrics comparison.

FCN-8s FCN-16s FCN-32s Deeplab Maskrcnn W-net Ours

0.1214 0.1107 0.1325 0.0955 0.2040 0.2504 0.0552

0.6037 0.5589 0.5855 0.4592 0.4295 0.4058 0.4239

pixel-level classification, the segmentation accuracy of them is not better than that of the proposed method.

People

Airplane

Car

Animal

Multi-object

Average

0.69 0.69 0.69 0.72 0.63 0.65 0.73

0.82 0.81 0.81 0.79 0.85 0.84 0.86

0.87 0.84 0.86 0.83 0.88 0.83 0.93

0.86 0.86 0.85 0.81 0.86 0.84 0.91

0.81 0.78 0.79 0.70 0.82 0.79 0.79

0.81 0.796 0.8 0.764 0.808 0.79 0.848

We used the FCN [6,30] models in AlexNet (CaffeNet) architecture, scoring 48.0 mIU on seg11valid with three-stream, 8pixel prediction stride net (FCN-8s); two-stream, 16-pixel prediction (FCN-16 s); single-stream, 32-pixel prediction (FCN-32s). This network was trained with gradient accumulation, normalized loss, and standard momentum. We trained our W-net [9] on PASCAL VOC2012 dataset. The images is resized to 128 × 128, the W-net is first train with dropout rate 0.65 for 50,0 0 0 iteration, then retrain with 0.3 for another 50,0 0 0 iterations, and the learning rate is reduced by half after 10,0 0 0 iterations. We fine-turned Deeplab v3 model [10] and the Mask RCNN model [11] with the pre-trained model on MS-COCO dataset [31]. All the comparison details can be found in Fig. 4 There are five categories, such as people, airplane, car, animal and multi-objects (more than one object), that are used to segmentation comparison. As observed in Fig. 4, the FCN-8s results are the best of the three FCNs. The other three deep learning models, e.g. Deeplab v3, Mask-RCNN and W-net, segment different objects in one image, such as the multi-object class. As the primary task of semantic segmentation by deep learning is to segment all different objects and assign them to different labels, which is the extent of

4.3. Quantitative experiment For region-based evaluation, we chose the GCE, PRI and VOI as the standards with which to assess the segmentation quality in digital images. The GCE determined the regions in one segmentation that were subsets of the regions in other segmentations. If one segmentation was a proper subset of the others, its pixels were in an area of refinement and the error was expected to be close to zero. The PRI measured the consistency of labels between two segmentation data sources compared to the ground truth, where a stronger consistency indicated better quality results. The VOI defined the distance between two segmentations as the average conditional entropy of one region given the other. For the VOI, lower distance values indicated better results. There are 150 images selected from 5 image categories (people, airplane, car, animal and multi-objects) in PASCAL VOC2012 dataset. The averaged results and its standard deviations delineated in Table 1 clearly demonstrate the potential of our algorithm. The Jaccard index, also known as intersection over union (IoU), is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of intersection divided by the size of the union of the sample sets:

J(A, B ) =

|A ∩ B| |A| + |B| + |A ∩ B|

where A is the segmentation results by the proposed and comparative algorithms, B is the segmentation label in PASCAL VOC dataset. The details is shown in Table 2.

Fig. 5. Results of parameter analysis: (a) feature points detected; (b) preliminary segmentation results by simple iterative linear clustering (SLIC); (c) final segmentation results, from the left to the right, θ = 8, 9, 10, 11, 12.

112

J. Cui, Y. Zhao and X. Dong et al. / Signal Processing 160 (2019) 106–112

In Table 2, the IoU of ours algorithm is better than other 6 deep learning models in ‘people’, ‘airplane’, ‘car’ and ‘animal’ categories. However, in the ‘multi-object’ class, the proposed algorithm cannot segment all the objects, but the salient and less depth ones. From the average IoU value, the proposed algorithm is experimentally promising than other deep learning models. 4.4. Parameters analysis In the proposed algorithm, users must specify one variable in advance, i.e., the distance threshold θ mentioned in section III, which represents the degree of similarity. We evaluated a set of θ values on the “3063.jpg” file from the BSDS500. The results are shown in Fig. 5. In Fig. 5, we observed that if θ was small, indicating that the similarity distance was close to zero, the detected feature points were too numerous and contain several artifacts, thereby leading to the over-segmentation of the results. Conversely, if θ is large, detecting fewer feature points also produces incomplete segments. Thus, the proper selection of similarity distance θ is essential for the quality of the results. Herein, θ ∈ [9, 11]. 5. Conclusions We herein proposed an unsupervised segmentation algorithm to optimize the similar distance at the pixel-level globally for binary image segmentation, and the foreground and background. Owing to the computational complexity, the objective function was simplified to a computable form. The incorporation of a hashing model provided an efficient solution to this problem. The comparison between the proposed unsupervised algorithm and the FCNs showed promising results that supported the feasibility of our work. Some problems remain to be solved in future work. The parameter sensitivity of θ should be learned automatically using an advanced learning strategy. Although the current algorithm is appropriate for images with an obvious distance, the algorithm’s performance on more complex natural images with rich, textured backgrounds can be improved.

Conflict of interest We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled Dr. CUI Jia Shandong Normal University.

Acknowledgments The author would like to thank Prof. Yair Weiss and Dr. Evan Shelhamer for their codes and models. This paper is sponsored by the Chinese NSFC funding 61502285 and Shandong Province NSF funding ZR2014FQ013.

References [1] H. Zhu, et al., Beyond pixels: a comprehensive survey from bottom-up to semantic image segmentation and cosegmentation, J. Visual Commun. Image Represent. 34 (2016) 12–27. [2] Z. Ban, J. Liu, L. Cao, Superpixel segmentation using gaussian mixture model, IEEE Trans. Image Process. 27 (8) (2018) 4105–4117. [3] S. Guimarães, et al., Hierarchizing graph-based image segmentation algorithms relying on region dissimilarity, Math. Morphol. Theory Appl. 2 (1) (2017) 55–75. [4] R. Achanta, et al., SLIC superpixels compared to state-of-the-art superpixel methods, IEEE Trans. Pattern Anal. Mach. Intell. 34 (11) (2012) 2274–2282. [5] D. Stutz, A. Hermans, B. Leibe, Superpixels: an evaluation of the state-of-the-art, Comput. Vision Image Underst. 166 (2018) 1–27. [6] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. [7] L.-C. Chen, et al., Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, IEEE Trans. Pattern Anal. Mach. Intell. 40 (4) (2018) 834–848. [8] Z. Feng, et al., Deep retinal image segmentation: a FCN-based architecture with short and long skip connections for retinal image segmentation, International Conference on Neural Information Processing, Springer, 2017. [9] Xia, X. and B. Kulis, W-Netnet: aA deep model for fully unsupervised image segmentation. arXiv:1711.08506, 2017. [10] Chen, L.-C., et al., Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587, 2017. [11] K. He, et al., Mask r-cnn, 2017 IEEE International Conference on Computer Vision (ICCV), IEEE, 2017. [12] V. Badrinarayanan, A. Kendall, R. Cipolla, SegNet: a deep convolutional encoder–decoder architecture for image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 39 (12) (2017) 2481–2495. [13] J. Yu, Y. Rui, D. Tao, Click prediction for web image reranking using multimodal sparse coding, IEEE Trans. Image Process. 23 (5) (2014) 2019–2032. [14] J. Yu, et al., Learning to rank using user clicks and visual features for image retrieval, IEEE Trans. Cybern. 45 (4) (2015) 767–779. [15] C. Hong, et al., Multimodal deep autoencoder for human pose recovery, IEEE Trans. Image Process. 24 (12) (2015) 5659–5670. [16] C. Liu, et al., Attention Correctness in Neural Image Captioning, AAAI, 2017. [17] A. Das, et al., Human attention in visual question answering: do humans and deep networks look at the same regions? Comput. Vision Image Underst. 163 (2017) 90–100. [18] U. Von Luxburg, A tutorial on spectral clustering, Stat. Comput. 17 (4) (2007) 395–416. [19] J.C. Bezdek, R.J. Hathaway, Some notes on alternating optimization, AFSS International Conference on Fuzzy Systems, Springer, 2002. [20] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, Advances in Neural Information Processing Systems, 2009. [21] Q. Yan, et al., Hierarchical saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013. [22] J. Shi, L. Xu, J. Jia, Just noticeable defocus blur detection and estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. [23] Y. Liu, et al., Manifold SLIC: a fast method to compute content-sensitive superpixels, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [24] P. Liu, et al., Fusion of color histogram and LBP-based features for texture image retrieval and classification, Inf. Sci. 390 (2017) 95–111. [25] D. Martin, et al., A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, in: Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, IEEE, 2001. [26] M. Everingham, et al., The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results, 2012 2012 http://www. pascal-network. org/challenges. in VOC/voc2012/workshop/index. html. [27] S. Ragothaman, et al., Unsupervised segmentation of cervical cell images using Gaussian mixture model, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016. [28] Y. Wu, C. He, A convex variational level set model for image segmentation, Signal Process. 106 (2015) 123–133. [29] J. Chen, Z. Li, B. Huang, Linear spectral clustering superpixel, IEEE Trans. Image Process. 26 (7) (2017) 3317–3330. [30] E. Shelhamer, J. Long, T. Darrell, Fully convolutional networks for semantic segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 39 (4) (2017) 640–651. [31] T.-Y. Lin, et al., Microsoft coco: common objects in context, European Conference on Computer Vision, Springer, 2014.