Accepted Manuscript Joint entropy based learning model for image retrieval Hao Wu, Yueli Li, Xiaohan Bi, Linna Zhang, Rongfang Bie, Yingzhuo Wang PII: DOI: Reference:
S1047-3203(18)30146-9 https://doi.org/10.1016/j.jvcir.2018.06.021 YJVCI 2225
To appear in:
J. Vis. Commun. Image R.
Received Date: Revised Date: Accepted Date:
8 December 2017 20 April 2018 24 June 2018
Please cite this article as: H. Wu, Y. Li, X. Bi, L. Zhang, R. Bie, Y. Wang, Joint entropy based learning model for image retrieval, J. Vis. Commun. Image R. (2018), doi: https://doi.org/10.1016/j.jvcir.2018.06.021
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Joint entropy based learning model for image retrieval Hao Wu1,Yueli Li2, Xiaohan Bi3, Linna Zhang4, Rongfang Bie1,Yingzhuo Wang5 1
College of Information Science and Technology, Beijing Normal University
2 3
College of Information Science and Technology, Hebei Agricultural University
School of Computer and Information Technology, Beijing Jiaotong University 4
College of Mechanical Engineering, Guizhou University, 5
Wendeng Technician College of WeiHai
Abstract As one classic technique of computer vision, image retrieval could retrieve the target images from hundreds of thousands of images effectively. Furthermore, with the rapid development of deep learning, the quality of retrieval is increased obviously. However, under normal conditions, the high-quality retrieval is supported by a large number of learning instances. The large number of learning instances not only need much human source in the process of selection, but also need much computing source in the process of computation. More importantly, for some special categories, it’s difficult to obtain a large number of learning instances. Aiming at the problem above, we proposed one joint entropy based learning model which could reduce the number of learning instances through optimizing the distribution of learning instances. Firstly, the learning instances are pre-selected using improved watershed segmentation method. Then, joint entropy model is used for reducing the possibility of double, useless even mistaken instances existence. After that, a database using a large number of images is built up. Sufficient experiments based on the database show the model’s superiority that our model not only could reduce the number of learning instances but also could keep the accuracy of retrieval.
Keywords: joint entropy, learning instance, image retrieval, watershed segmentation, precise-recall curve, AP value, AUC value
1. Introduction Image retrieval is a computer technique for browsing, searching and retrieving images from a large database of digital images. Originally, to search for images, some keywords of images are used for retrieving the target images. However, manual image annotation is time-consuming and expensive, even mistaken. Aiming at these questions, more and more researchers paid more attention to do some research on content-based image retrieval (CBIR)[1]. At the outset, CBIR aims at avoiding label description and similarities based content(color, shape, texture etc.) are applied .With the development of CBIR, the semantic concept[2] is incorporated and some classic features combined with SVM[3] are used for semantic-based retrieval. Except for the classic SVM models, decision tree[4], Bayesian model[5] and some other models are also used for retrieval. At the same time, image classification[6],
semantic-based
image
composition[7]
and
semantic-based
image
completion[8] are also supported by these models. In recent years, as one classic model in the machine learning, deep learning model[9] has
the
overwhelming
advantages
compared
to
traditional
models.
CNN-based
methods[10-12], RBM-based methods[13-15], Autoencoder-based methods[16-18] and Sparse coding-based Methods[19-21] are classic models in the field of computer vision. The Convolutional Neural Networks (CNN) could be used through training multiple layers and there are usually convolutional layers, pooling layers, and fully connected layers. Restricted Boltzmann Machine (RBM) was proposed by Hinton in 1986 and it could be considered as a generative stochastic neural network usually concluding visible units and hidden units. The antoencoder-based model also could be considered as one special type of artificial neural network which could be used for learning efficient encodings. The sparse coding-based model
describes the image through learning adequate basic functions. The four classic methods of deep learning have brought revolutionary promotion to the computer vision field and more improved models based on them are also applied However, high-quality retrieval results of traditional models largely depend on the sufficient learning instances. If the number of learning instances is reduced, the accuracy of retrieval will decline obviously. Especially for the deep learning model, it’s difficult to learn one complete model using insufficient learning instances, not to mention obtaining the high-quality results. In most cases, collecting a large number of learning instances could increase the computing resource and human resource obviously. For some special categories, even if we try our best to collect learning instances, the number of learning instances still couldn’t fulfill the requirement. As discussed above, how to reduce the learning instances with preserving the accuracy of retrieval has become one meaningful problem. So in this paper, aiming at this problem, we proposed one joint entropy based learning model which could contribute to optimize the distribution of learning instances. In the process of optimization, some double, useless even mistaken instances would be deleted. A small amount of high-quality learning instances could keep high-level retrieval accuracy. The rest of paper is organized in the following: Section 2 introduces the related work of our method. Section 3 introduces our model in details. Section 4 introduces the experiment process and results. Section 5 gives conclusion remarks.
2. Related work For our research topic of joint entropy based learning model, there are many related works that have been done by some researchers. Among these related work, feature descriptor
selection, SVM model, learning instance reduced model and entropy-based model are essential work which not only could achieve remarkable achievement but also could contribute to support our paper's model directly and effectively. 2.1 Feature descriptor Originally, RGB colour histograms[22], texture histograms[23] are often used for evaluating the similarity between different images. With the development of semantic content, some famous feature descriptors have been proposed. SIFT [24], rgSIFT [25], PHOG [26], and GIST [27] are classic feature descriptors which could extract the semantic content effectively. Later, some optimized feature descriptors[28,29] were also used. In the field of computer vision, image classification, image retrieval and image annotation often used them as important ways of semantic content extraction. Deep learning models (e.g. ,CNN-based models[11,30], RBM-based models[13,31], Autoencoder-based models[18,32], Sparse coding-based models[20,33]) could learn the retrieval model without using the feature descriptors above. However, in most cases, they could contribute to represent images more effectively. Especially for CNN-based models, multiple layers of nonlinear processing units could be used for feature extraction and transformation. During the process, convolutional layers, pooling layers, and fully connected layers play different roles respectively. Compared to CNN-based models, RBM-based models, Autoencoder-based models and Sparse coding-based models are not frequently used, but they also could learn high-level semantic feature in data by utilizing hierarchical architectures. However, nearly all high-quality deep learning models are based on at least tens of thousands of learning instances.
2.2 SVM model In the field of machine learning, support vector machines (SVMs)[34] are classic supervised learning models through training the learning instances as one classification model. Risk minimization[35] is the essential part which tends to have a low expectation of risk to train the optimized learning model. However, in the process of model training, it’s difficult to solve the problem of curse of dimensionality[36]. Aiming at the problems, appearance of kernel functions[37,38] could reduce the complexity of calculation. Some classic kernel functions, such as Gaussian Kernel[39], Hyperbolic Tangent (Sigmoid) Kernel[40] are widely used in some research. In most cases, the learning instances has been labeled by human resource. If the classification model has been trained, the target images could be retrieved using the model. Compared to deep learning model, the number of learning instances could be reduced obviously. 2.3 Learning instance reduced model Aiming at the proposed problem, a large number of learning instance reduced models have been proposed. In general, there are two kinds of models. On the one hand, researchers take full advantage of all existing instances, including those that do not belong to the category. The main insight is that, given a few learning instances from one category, they define some other similar learning instances of other categories as candidate learning instances. Paper[41] proposed one model which aims at reducing the learning instances using similar or dissimilar instances. Semantic distance model[42] is used for evaluating the candidate instances. On the other hand, some learning instances optimized models have been used to reduce the number of learning instances. Paper[43] proposed one learning instance optimized model which could
optimize the learning instances effectively. In general, they could reduce the number of learning instances in a certain extent through incorporating some candidate learning instances or optimizing the learning instances. However, many learning instances is still a burden for human resource and computing resource. 2.4 Entropy-based model In information theory, joint entropy is a measure of the uncertainty associated with a set of variables. In the field of computer vision , joint entropy[44], such as Shannon entropy[45], has been widely used .With the development of the model, some improved models are more and more used for image annotation, image classification ,image composition and image completion. For instance, paper[46] used the joint entropy model to label the unknown region. In the process of model construction, joint entropy is used for constructing the semantic structure of one image. Until now, there is no related model for learning instance optimization and our model could make full use of the joint entropy which could select the optimized learning instances. Although some related work has done some research on our topic, even some papers have done some deep research on it and already achieve some remarkable achievements. However, the large number of learning instances is still one burden for human resource and computing resource. Moreover, there is no valid theory to describe and summarize the learning instance distribution model. In order to solve the problem above, on the foundation of previous work, joint entropy based learning model is proposed in this paper which could improve the image retrieval quality through optimizing the distribution of learning instances.
3. Algorithm
As discussed above, how to optimize the distribution of learning instances has become one challenging and meaningful problem. In order to solve the challenging problem, we mainly used joint entropy to optimize the learning instances. In the process, in order to avoid some low-quality learning instances selected by hand, we used improved watershed segmentation method to pre-select them. Then, on the foundation of pre-selection, joint entropy is used to optimize the distribution of learning instances. As shown in figure 1, some selected learning instances(about 20) are used as reference learning instances, then they are optimized by improved watershed segmentation. For each new input learning instance, after watershed segmentation, the joint entropy between it and optimized learning instances is calculated. If the joint entropy is higher than expectation, it is considered as a valid learning instance. Otherwise, it is considered as an invalid learning instance. The above process is circulated until enough valid learning instances are obtained. In most cases, in order to make the algorithm and experiments more convincing, joint entropy of reference learning instances is calculated ahead of schedule. Moreover, the joint entropy of final valid learning instances is also calculated. If the joint entropy is out of expectation, valid learning instances need to be re-obtained.
Figure 1: The flow chart of selecting and optimizing the learning instances. 3.1 Pre-selection step Although our model could contribute to select the optimized learning instances, however, some invalid learning instances could waste the computing resource. For instance, in figure 2(a), target object and interference object exist in the same image. Moreover, the interference object occupies a non-ignorable part of the image. In another case, in figure 2(b), target object just occupies a negligible part of the image. If this kind of images would be selected, there is no contribution or little contribution to the training model. More importantly, some computing resource could be wasted. So in this step, we firstly select some valid learning instance.
(a)
(b)
(c)
Figure 2: Some learning instances in different conditions. Image(a) shows the example that image
concludes not only one object, image(b) shows the example that the target object only occupies one small part of the whole image, image(c) shows one realistic example.
Aiming at the problem above, we used WLS(Weighted Least Squares) filter based decomposition model[47,48] to pre-process the image. WLS filter is one edge-preserving while region smoothing filter. It not only could make the filtered image as similar as original image, but also could make the filtered image sharper along the edges while smoother within the regions. The WLS filter could be considered as one mathematical equation as follows in equation(1) (1)
Where i indicates one pixel in the images, the term
is one term which could
minimize the differences between original image g and filtered image u, the term could smooth the regions while preserving the edges. The smoothness weights of ax and ay, are determined by image g, and k is used for balancing the two terms. The smoothness of each version could be increased by the increasing of k with same factor. The larger k is, the smoother new image is. The degrees of smoothness can be determined by changing the k(0.3-4) and c (1.4 - 1.8). Figure 3 and 4 show two groups of edge-preserving decomposition based images with different smoothness of each version. Edge-preserving decomposition based image could be used for segmentation which contributes to evaluate whether the target images are realistic or not.
(a)
(b)
(c)
(d)
Figure 3: Edge-preserving decomposition based images with different smoothness of each version. Image(d) is often considered as the most suitable optimized image.
(a)
(b)
(c)
(d)
Figure 4: Edge-preserving decomposition based images with different smoothness of each version. Image(d) is often considered as the most suitable optimized image.
In this paper, we used watershed segmentation to process the filtered image. Since watershed segmentation technique[49] has been widely used for many years so it is proposed briefly. Equation(2) contributes to show the process using mathematical content clearly.
g ( x, y) grad ( f ( x, y)) {[ ( f ( x, y) f ( x 1, y)) 2 ( f ( x, y) f ( x, y 1)) 2 . (2) In Equation (2), grad is the gradient operator, f(x,y) is the original image and g(x,y) is the processed image by watershed segmentation. After the segmentation, we could judge whether the target image could be used as a realistic learning instance or not. If the target image concludes not only one object or target object only occupies one small part of the whole image, this kind of image should be deleted. Figure 5 and 6 show two groups of segmentation results using improved watershed segmentation model. This preprocessing step could delete some invalid learning instances which directly contributes to reduce the computing resource wasting.
(a)
(b)
Figure 5: Segmentation result using improved watershed segmentation model. Image(a) is edge-preserving decomposition based image and image(b) is segmentation result based on image(a).
(a)
(b)
Figure 6: Segmentation result using improved watershed segmentation model. Image(a) is edge-preserving decomposition based image and image(b) is segmentation result based on image(a)
3.2 Construction of joint entropy based model After pre-selection step, we need to do further optimization of learning instances. In this step, we used joint entropy based model to evaluate the validation of distribution with learning instances. Learning instance joint entropy could be considered combined entropies of each learning instance. The joint entropy could be calculated using the following probabilities: (3)
Where
is the probability of learning instances existence
together while taking the contextual semantic relationships between each other into consideration,
are different learning instances. Most of learning instances
are certain with more confidence, except for the new incorporating learning instance new instance
, the
has many possibilities.
If the joint entropy needs to be calculated, computational complexity is one special burden for computing resource,so it needs to be approximated. The first order entropy is one effective approximation which is the sum of entropies of each learning instance considered individually.: (4) Where
is the first order entropy and
is the probability of each learning
instance existence. However, the approximation ignores the contextual semantic relationship between different learning instances which could affect the entropy obviously. Aiming at this problem, the second order approximation of joint entropy could be calculated which is called as Bethe entropy approximation[50], it is defined in the following: (5) Where m is the number of learning instances , learning instance ,
and
and
is the joint probability of
, which could assume that there are only learning instance
are appearance features of learning instance
and
and
. In the equation above,
it takes the contextual semantic relationship between different learning instances into consideration fully. When the new learning instance needs to be incorporated, the last term is stable because the majority of learning instances have already existed. However, the first term
is changing consistently, because in the process of incorporation, joint entropy will be changed with the changing of contextual semantic relationship between different learning instances. So how to calculate the contextual semantic relationship between different learning instances has become a new challenging problem. Through previous research [51, 52] and adequate experiments, it was determined that contextual semantic relationship between different learning instances largely depends on the similarity based on different feature descriptors. Aiming at this question, SIFT descriptor and GIST descriptor could be used for obtaining the solution. If the two learning instances are as similar as each other based on SIFT descriptor, they are likely to be semantically similar. Similarly, if the two learning instances are as similar as each other with GIST descriptor, they are likely to be semantically similar. So in this paper, the joint semantic probability could be defined as follows: (6) The first term is the probability of similarity based on GIST descriptor and the second term is the probability of similarity based on SIFT descriptor. Then, we will calculate the probability of similarity based on GIST descriptor similarity based on SIFT descriptor
and the probability of
.
In this study, the GIST descriptor gathers the oriented edge responses at multi-scale into very coarse spatial bins. The GIST descriptor used in our study is built from six oriented edge responses at five scales gathered onto a 4 x 4 spatial resolution, thereby providing maximum effectiveness. Then the GIST descriptor between learning instance
is defined as the Euclidean distance based on and
. Assuming that the dimension of
learning instance feature is m, GIST-based similarity could be defined in equation(7)
(7)
Next, we continued to calculate the similarity based on SIFT descriptors between learning instance
and
. Figure 7 contributes to show the SIFT extraction process and
histogram construction based on visual words clearly. It shows that for each image we could calculate its SIFT-based feature histogram using bag of words model. Then their histograms are used for evaluating the similarity between different images. Concretely, the feature vector consists of SIFT features computed on a regular grid across the image (“dense image”). The vector is then quantized into visual words. The frequency of each visual word is recorded in a histogram for each tile of the spatial tiling. The final feature vector for the image is a concatenation of these histograms. After the process above,
could be defined as similarity between histograms
of each learning instance in the following: (8)
Where Ip and Is are different learning instances, H1(i) and H2(j) are the two histograms of them respectively, i is the bin of the histogram, H1(i) and H2(j) are the heights of each bin,
is the probability of similarity based on SIFT descriptors.
Figure 7: Process of SIFT extraction and histogram construction based on visual words.
After finishing the process above, the probability of similarity based on GIST descriptor
and the probability of similarity based on SIFT descriptor have been obtained using our model
As we know, only if the similarity between different learning instances is reduced, the value of each learning instance would be increased. Otherwise, some double, useless even mistaken learning instances would be incorporated which affects the effectiveness and efficiency of learning model obviously. So if we will incorporate one new learning instance, we should calculate the joint semantic probability between it and other learning instances which could make the joint entropy maximum(Special notes: for new positive incorporated learning instance, we need to calculate the joint semantic probability between it and other known positive learning instances. Similarly, for new negative incorporated learning instance, we need to calculate the joint semantic probability between it and other known negative learning instances.) We continue the process until the number of learning instances is enough. After selection of learning instances using our model, they will be used to learn the training model and related experiments are in the next section.
4. Experiments
4.1Database: In this paper, we selected 198175 images totally from Caltech 256[53], Label me[54], Berkeley segmentation[55],CIFAR-10 [56] databases ,google images and personal images . 112762 images are used as training images and 85413 images are used as test images. In order to make our results more convincing, we selected images as more as categories as possible, as more complicated as possible. The database contains 189 categories. Moreover, many
categories have more than 200 instances. Only 41 categories have lower than 100 instances. The experiments runned at 195ms per image on an Nvidia Tesla M40 GPU (plus 15ms CPU time resizing the outputs to the original resolution). In the process of SIFT descriptor extraction, 1000 words were used and the final dimension of histograms is 4000. The GIST descriptor used in our study is built from six oriented edge responses at five scales gathered onto a 4 x 4 spatial resolution. Firstly, we used Precision-recall curve to evaluate our model performance. During the process, CNN based method[57], visual feature based method[58], retrieval practical based method[59] and spatial pyramid matching based method[60] are used for baselines. For the above classic methods, we offered enough learning instances for them to keep the accuracy. In order to make sure that the number of learning instances is enough, we kept offering learning instances for the model until the accuracy is not changed or changed obviously. But for our model, we offered limited learning instances (50 positive learning instances and 100 negative learning instances). After it, from figure 8 we could see that there is no significant difference between our model and other baselines. Table 1 shows some AP values of specific categories between different methods. Then, the AP value and AUC value in table 2 are used for showing the results more intuitively. Next, we tested our model using the same number of learning instances between different models. Optimized learning based model[61], semantic-distance based model[42], visual feature based method, retrieval practical based method, spatial pyramid matching based method are used for baselines. In this step, for our model and baseline models above, they are all trained by a limited number of images respectively (50 positive learning instances and 100
negative learning instances). From the Precision-recall curve in figure 9, we could see that the Precision-recall curve using our method is better than other baselines obviously. Table 3 shows some AP values of specific categories between different methods, we also could see that our model performs obviously better for the majority of categories. From AP value and AUC value in table 4, we could see that our model has the special advantage when the learning instances are not enough. Then, when AP is set to 0.8, we could calculate the number of learning instances used between different methods. In this process, from the results in table 5, we could see clearly that our model used least number of learning instances when the accuracy is the same. More importantly, in order to make the final results more convincing. We have done some groups of experiments to show joint entropy model and optimization model’s superiority through comparison. For baseline models and our model, they are trained by a limited number of images respectively (50 positive learning instances and 100 negative learning instances). As is shown in table 6, joint entropy model and optimization model contribute to improve retrieval effectiveness significantly. At last, 4 groups of retrieval results in figure 10 are used for showing our method’s superiority. From the experimental results above, we could see that for the majority of categories, the evaluation criterions using our method have obvious disadvantages. For the average evaluation criterions(Precision-Recall curve, AP value ,AUC value), our methods improve the experimental results significantly. Even if we set the AP value same between different methods, the number of learning instances used in our method is also least compared to other baselines.
We must recognize that more and more high-performance models could achieve realistic retrieval results in recent years. Especially for deep learning based models, they improve the accuracy significantly. However, after adequate experiments in this paper, we could see that nearly all related models are at the cost of a large number of learning instances and computing resource. If the number of learning instances is reduced slightly, the accuracy is reduced significantly. On the contrary, there is no significant difference between our model and other baselines evaluated by many accuracy criterions (Precision-Recall curve, AP value ,AUC value). More importantly, our model could reduce the number of learning instances significantly. The essential value of our model is to reduce the number of learning instances without reducing the accuracy obviously. In other words, our model could achieve realistic retrieval accuracy using just a few learning instances.
Figure 8: Precision–recall curve using different methods. Pink: Our method. Blue: Baseline 1. Red: Baseline 2. Black: Baseline 3. Green: Baseline 4. Baseline 1: CNN based method[57]. Baseline 2:
visual feature based method[58]. Baseline 3: retrieval practical based method[59]. Baseline 4: spatial pyramid matching based method[60]
Table 1: AP values of specific categories between different methods. Baseline 1: CNN based method[57]. Baseline 2: visual feature based method[58]. Baseline 3: retrieval practical based method[59]. Baseline 4: spatial pyramid matching based method[60] Computer
tennis
badminton
building
flower
tree
0.833
0.809
0.842
0.792
0.844
0.799
0.840
0.774
0.839
Baseline1
0.904
0.922
0.917
0.874
0.928
0.865
0.918
0.857
0.906
Baseline2
0.881
0.892
0.892
0.854
0.900
0.841
0.877
0.832
0.892
Baseline3
0.862
0.863
0.850
0.821
0.875
0.829
0.877
0.805
0.885
Baseline4
0.841
0.856
0.850
0.802
0.870
0.803
0.849
0.795
0.839
fish
people
desk
apple
duck
panda
penguin
shoes
T-shirt
0.808
0.847
0.803
0.806
0.849
0.821
0.837
0.806
0.819
Baseline1
0.913
0.942
0.890
0.904
0.924
0.897
0.929
0.898
0.901
Baseline2
0.884
0.894
0.874
0.884
0.901
0.869
0.893
0.871
0.872
Baseline3
0.880
0.873
0.851
0.873
0.875
0.841
0.867
0.864
0.855
Baseline4
0.822
0.856
0.809
0.834
0.857
0.832
0.848
0.806
0.843
Our
bike
cellphone
lion
method
Our method
Table 2: AP values calculated using different methods. Baseline 1: CNN based method[57]. Baseline 2: visual feature based method[58]. Baseline 3: retrieval practical based method[59]. Baseline 4: spatial pyramid matching based method[60] Our method
Baseline1
Baseline2
Baseline3
Baseline4
AP
0.839
0.906
0.881
0.864
0.847
AUC
0.829
0.901
0.872
0.859
0.839
Figure 9: Precision–recall curve using different methods. Blue: Our method. Red: Baseline 1. Black: Baseline 2. Green: Baseline 3. Pink: Baseline 4. Yellow: Baseline 5. Baseline 1: optimized
learning based model[61]. Baseline 2: semantic-distance based model[62]. Baseline 3: visual feature based method[58]. Baseline 4: retrieval practical based method[59]. Baseline 5: spatial pyramid matching based method[60]
Table 3: AP values of specific categories between different methods. Baseline 1: optimized
learning based model[61]. Baseline 2: semantic-distance based model[42]. Baseline 3:visual feature based method[58]. Baseline 4: retrieval practical based method[59]. Baseline 5: spatial pyramid matching based method[60]
Our method
Computer
tennis
badminton
0.802
0.809
0.822
bike 0.818
cellphone 0.791
lion 0.811
building
flower
tree
0.822
0.782
0.811
Baseline1
0783
0.771
0.790
0.802
0.779
0.790
0.793
0.766
0.776
Baseline2
0.783
0.742
0.764
0.757
0.747
0.777
0.769
0.732
0.722
Baseline3
0.669
0.719
0.650
0.702
0.656
0.691
0.677
0.669
0.645
Baseline4
0.621
0.607
0.617
0.644
0.603
0.643
0.591
0.600
0.639
Baseline5
0.589
0.590
0.557
0.601
0.544
0.622
0.543
0.582
0.617
fish
people
desk
apple
duck
panda
penguin
shoes
T-shirt
0.819
0.822
0.779
0.810
0.799
0.800
0.781
0.810
0.784
Baseline1
0.788
0.791
0.751
0.766
0.773
0.771
0.746
0.773
0.734
Baseline2
0.764
0.755
0.744
0.729
0.751
0.744
0.709
0.732
0.708
Baseline3
0.659
0.659
0.692
0.644
0.707
0.679
0.662
0.667
0.667
Baseline4
0.608
0.612
0.629
0.620
0.662
0.629
0.613
0.601
0.601
Baseline5
0.593
0.612
0.584
0.588
0.579
0.593
0.549
0.533
0.549
Our method
Table 4: AP values calculated using different methods. Baseline 1: optimized learning based
model[61]. Baseline 2: semantic-distance based model[42]. Baseline 3:visual feature based method[58]. Baseline 4: retrieval practical based method[59]. Baseline 5: spatial pyramid matching based method[60] Our method
Baseline1
Baseline2
Baseline3
Baseline4
Baseline5
AP value
0.811
0.784
0.766
0.665
0.613
0.59
AUC value
0.804
0.779
0.745
0.639
0.608
0.574
Table 5: Number of learning instances used in different methods. Baseline 1: optimized learning
based model[61]. Baseline 2: semantic-distance based model[42]. Baseline 3: visual feature based method[58]. Baseline 4: retrieval practical based method[59]. Baseline 5: spatial pyramid matching based method[60] Computer
tennis
Our method
35/72
43/84
56/89
26/51
41/71
Baseline1
39/86
52/91
63/94
34/63
53/81
Positive/negative
badminton
bike
cellphone
lion
building
flower
tree
37/66
39/74
41/66
47/81
41/75
45/87
55/73
54/89
(instance number)
Baseline2
54/89
66/102
82/117
45/81
62/97
55/84
54/99
64/89
62/97
Baseline3
77/137
84/139
106/145
51/89
74/102
71/113
71/135
82/147
71/111
Baseline4
84/141
97/169
126/169
62/95
89/144
82/141
86/147
93/169
82/129
Baseline5
112/219
99/211
141/235
88/131
129/256
98/206
92/186
93/188
95/149
fish
people
desk
apple
duck
panda
penguin
shoes
T-shirt
Our method
35/67
28/47
42/67
24/54
37/67
37/51
57/89
44/104
34/66
Baseline1
42/77
31/53
55/82
30/60
44/89
44/62
71/105
54/129
48/78
Baseline2
55/89
39/64
64/97
41/72
57/104
54/80
82/134
62/149
54/91
Baseline3
71/119
50/77
77/119
55/88
66/119
69/85
99/168
88/189
58/97
Baseline4
86/141
72/98
89/137
64/91
89/141
79/111
119/191
97/241
64/112
Baseline5
102/189
89/141
116/229
84/129
106/189
91/149
133/214
97/287
77/149
Positive/negative (instance number)
Table 6: AP value and AUC value using different models. Baseline 1: the model that we don't use the joint entropy model to pre-select and optimize the learning instances. Baseline 2 : the model that we just pre-select the learning instances and don't optimize them.
AP value AUC value
Baseline 1 0.622 0.617
Baseline 2 0.712 0.696
Our model 0.839 0.829
Figure 10: Four groups of retrieval results using our model.
5.Conclusion In this paper, we proposed one joint entropy based learning model for high-quality
retrieval. In the process, improved watershed segmentation model is used for pro-processing the learning instances. More importantly, the distribution of learning instances is optimized by our joint entropy model. At last, the adequate experimental results show our method’s superiority that the number of learning instances is declined compared to some traditional models., especially to deserve to be mentioned, the accuracy using our model is still preserved. However, there are still few drawbacks for our model. For instance, if one invalid learning instance is incorporated, the error would be accumulated continually in the process of calculation. Moreover, the joint entropy only simulates the distribution of learning instance and some details have been ignored. So we need to do some further research on the model in order to make the results more realistic and convincing. Particularly, some further work has been done in paper[] that is under review.
Acknowledgement This research is sponsored by Fundamental Research Funds for the Central Universities (No.2016NT14), National Natural Science Foundation of China (No.61601033, No.61571049, No.61401029)
and
Beijing
Advanced
Innovation
Center
for
Future
Education
(BJAICFE2016IR-004). We particularly appreciate XiaoYu and Junqi Guo for their contributions of data collection and optimization.
References [1] Smeulders A W M, Worring M, Santini S, et al. Content-based image retrieval at the end of the early years[J]. IEEE Transactions on pattern analysis and machine intelligence, 2000, 22(12): 1349-1380. [2] Berners-Lee T, Hendler J, Lassila O. The semantic web[J]. Scientific american, 2001, 284(5):
28-37. [3] Suykens J A K, Vandewalle J. Least squares support vector machine classifiers[J]. Neural processing letters, 1999, 9(3): 293-300. [4] MacArthur S D, Brodley C E, Shyu C R. Relevance feedback decision trees in content-based image retrieval[C]//Content-based Access of Image and Video Libraries, 2000. Proceedings. IEEE Workshop on. IEEE, 2000: 68-72. [5] Bernardo J M, Smith A F M. Bayesian theory[J]. 2001. [6] Yang J, Yu K, Gong Y, et al. Linear spatial pyramid matching using sparse coding for image classification[C]//Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009: 1794-1801. [7]Wu H, Li Y, Miao Z, et al. Creative and high-quality image composition based on a new criterion[J]. Journal of Visual Communication and Image Representation, 2016, 38: 100-114. [8] Wu H, Miao Z, Wang Y, et al. Image completion with multi-image based on entropy reduction[J]. Neurocomputing, 2015, 159: 157-171. [9] Guo Y, Liu Y, Oerlemans A, et al. Deep learning for visual understanding: A review[J]. Neurocomputing, 2016, 187: 27-48. [10]Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C]//Advances in neural information processing systems. 2012: 1097-1105. [11] Zeiler M D, Fergus R. Visualizing and understanding convolutional networks[C]//European Conference on Computer Vision. Springer International Publishing, 2014: 818-833. [12] He K, Zhang X, Ren S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(9):
1904-1916. [13] Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets[J]. Neural computation, 2006, 18(7): 1527-1554. [14] Salakhutdinov R, Hinton G E. Deep Boltzmann Machines[C]//AISTATS. 2009, 1: 3. [15] Ngiam J, Chen Z, Koh P W, et al. Learning deep energy models[C]//Proceedings of the 28th International Conference on Machine Learning (ICML-11). 2011: 1105-1112. [16] Poultney C, Chopra S, Cun Y L. Efficient learning of sparse representations with an energy-based model[C]//Advances in neural information processing systems. 2006: 1137-1144. [17] Vincent P, Larochelle H, Bengio Y, et al. Extracting and composing robust features with denoising autoencoders[C]//Proceedings of the 25th international conference on Machine learning. ACM, 2008: 1096-1103. [18] Rifai S, Vincent P, Muller X, et al. Contractive auto-encoders: Explicit invariance during feature extraction[C]//Proceedings of the 28th international conference on machine learning (ICML-11). 2011: 833-840. [19] Memisevic R, CA U, Krueger D. Zero-bias autoencoders and the benefits of co-adapting features[J]. stat, 2014, 1050: 13. [20] Zhou X, Yu K, Zhang T, et al. Image classification using super-vector coding of local image descriptors[C]//European conference on computer vision. Springer Berlin Heidelberg, 2010: 141-154. [21] Gao S, Tsang I W H, Chia L T, et al. Local features are not lonely–Laplacian sparse coding for image classification[C]//Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010: 3555-3561.
[22] Berens J, Finlayson G D, Qiu G. Image indexing using compressed colour histograms[J]. IEE Proceedings-Vision, Image and Signal Processing, 2000, 147(4): 349-355. [23] Van Ginneken B, Koenderink J J, Dana K J. Texture histograms as a function of irradiation and viewing direction[J]. International Journal of Computer Vision, 1999, 31(2-3): 169-184. [24]D.G.Lowe,” Distinctive Image features from Scale-Invariant Keypoints,” Journal
of
Computer Vision ,60(2):91-110.(2004) [25]K.van de Sande, T.Gevers, C.Snoek,”Evaluation of color descriptors for object and scene recognition,” in Pro. IEEE Conference on
Computer Vision and Pattern Recognition,
pp. 1-8, IEEE , Anchorage, AK(2008). [26]A.Bosch, A.Zisserman, X.Munoz, ”Representing shape with a spatial pyramid kernel,” in Pro. ACM international conference on Image and video retrieval, pp. 672 - 679,ACM, New York, NY(2007). [27]H.James et al.,“Scene Completion Using Millions of Photographs, ”ACM Transactions on Graphics, 26(3) ( 2007). [28] Zheng, Yan-Tao, et al. "Toward a higher-level visual representation for object-based image retrieval." The Visual Computer 25.1 (2009): 13-23. [29]A.Farhadi et al.,” Describing objects by their attributes,” in Pro. IEEE Conference on Computer Vision and Pattern Recognition,pp. 1778 - 1785 ,IEEE, Miami, FL (2009). [30] Mollahosseini A, Chan D, Mahoor M H. Going deeper in facial expression recognition using deep neural networks[C]//Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on. IEEE, 2016: 1-10. [31]Wu Q, Diao W, Dou F, et al. Shape-based object extraction in high-resolution remote-sensing
images using deep Boltzmann machine[J]. International journal of remote sensing, 2016, 37(24): 6012-6022. [32]Alain G, Bengio Y. What regularized auto-encoders learn from the data-generating distribution[J]. The Journal of Machine Learning Research, 2014, 15(1): 3563-3593. [33]Zhang
T,
Ghanem
B,
Liu
S,
et
al.
Low-rank
sparse
coding
for
image
classification[C]//Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013: 281-288. [34] Hearst M A, Dumais S T, Osman E, et al. Support vector machines[J]. IEEE Intelligent Systems and their Applications, 1998, 13(4): 18-28. [35] Vapnik V. Principles of risk minimization for learning theory[C]//NIPS. 1991: 831-838. [36] Keogh E, Mueen A. Curse of dimensionality[M]//Encyclopedia of Machine Learning. Springer US, 2011: 257-258. [37] Min J H, Lee Y C. Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters[J]. Expert systems with applications, 2005, 28(4): 603-614. [38] Sahami M, Heilman T D. A web-based kernel function for measuring the similarity of short text snippets[C]//Proceedings of the 15th international conference on World Wide Web. AcM, 2006: 377-386. [39] Keerthi S S, Lin C J. Asymptotic behaviors of support vector machines with Gaussian kernel[J]. Neural computation, 2003, 15(7): 1667-1689. [40] Venu N, Anuradha B. Integration of hyperbolic tangent and Gaussian kernels for Fuzzy C-Means algorithm with spatial information for MRI segmentation[C]//2013 Fifth International Conference on Advanced Computing (ICoAC). IEEE, 2013: 280-285.
[41] Wang G, Forsyth D, Hoiem D. Comparative object similarity for improved recognition with few or no examples[C]//Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010: 3525-3532. [42] Wu H, Miao Z, Wang Y, et al. Optimized recognition with few instances based on semantic distance[J]. The Visual Computer, 2015, 31(4): 367-375. [43] Wu H, Miao Z, Chen J, et al. Recognition improvement through the optimisation of learning instances[J]. IET Computer Vision, 2015, 9(3): 419-427. [44]Tang J, Rahmim A. Bayesian PET image reconstruction incorporating anato-functional joint entropy[J]. Physics in medicine and biology, 2009, 54(23): 7063. [45]Liu S. On the relationship between densities of Shannon entropy and Fisher information for atoms and molecules[J]. The Journal of chemical physics, 2007, 126(19): 191107. [46] Siddiquie B, Gupta A. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning[C]//Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010: 2979-2986 [47] Subr K, Soler C, Durand F. Edge-preserving multiscale image decomposition based on local extrema[J]. ACM Transactions on Graphics (TOG), 2009, 28(5): 147. [48] Wu H, Li Y, Miao Z, et al. A new sampling algorithm for high-quality image matting[J]. Journal of Visual Communication and Image Representation, 2016, 38: 573-581. [49] Shafarenko L, Petrou M, Kittler J. Automatic watershed segmentation of randomly textured color images[J]. IEEE transactions on Image Processing, 1997, 6(11): 1530-1544. [50] Yedidia J S, Freeman W T, Weiss Y. Constructing free-energy approximations and generalized belief propagation algorithms[J]. IEEE Transactions on Information Theory, 2005, 51(7):
2282-2312. [51]
Zhang Y,
Chen T. Efficient
kernels
for
identifying
unbounded-order
spatial
features[C]//Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009: 1762-1769. [52] Zhou X, Yu K, Zhang T, et al. Image classification using super-vector coding of local image descriptors[C]//European conference on computer vision. Springer Berlin Heidelberg, 2010: 141-154. [53] Griffin G, Holub A, Perona P. Caltech-256 object category dataset[J]. 2007. [54] Russell B C, Torralba A, Murphy K P, et al. LabelMe: a database and web-based tool for image annotation[J]. International journal of computer vision, 2008, 77(1-3): 157-173. [55] Arbelaez P, Fowlkes C, Martin D. The berkeley segmentation dataset and benchmark[J]. see http://www. eecs. berkeley. edu/Research/Projects/CS/vision/bsds, 2007. [56] Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images[J]. 2009. [57] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C]//Advances in neural information processing systems. 2012: 1097-1105. [58] Yu, Jun, et al. "Learning to rank using user clicks and visual features for image retrieval." IEEE transactions on cybernetics 45.4 (2015): 767-779. [59]Andrea
Vedaldi
and
Andrew
Zisserman
“Image
Classification
Practical”, http://www.robots.ox.ac.uk/~vgg/share/practical-image-classification.htm (2011) [60]Lazebnik, Svetlana, Cordelia Schmid, and Jean Ponce. "Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories."Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. Vol. 2. IEEE, 2006.
[61] Li Y, Bie R, Zhang C, et al. Optimized learning instance-based image retrieval[J]. Multimedia Tools and Applications, 2016: 1-18. [62] Hao Wu et al. Weighted-learning-instance-based retrieval model using instance distance[J]. Machine Vision and Applications.(Under review)