Joint entropy based learning model for image retrieval

Joint entropy based learning model for image retrieval

Accepted Manuscript Joint entropy based learning model for image retrieval Hao Wu, Yueli Li, Xiaohan Bi, Linna Zhang, Rongfang Bie, Yingzhuo Wang PII:...

NAN Sizes 0 Downloads 69 Views

Accepted Manuscript Joint entropy based learning model for image retrieval Hao Wu, Yueli Li, Xiaohan Bi, Linna Zhang, Rongfang Bie, Yingzhuo Wang PII: DOI: Reference:

S1047-3203(18)30146-9 https://doi.org/10.1016/j.jvcir.2018.06.021 YJVCI 2225

To appear in:

J. Vis. Commun. Image R.

Received Date: Revised Date: Accepted Date:

8 December 2017 20 April 2018 24 June 2018

Please cite this article as: H. Wu, Y. Li, X. Bi, L. Zhang, R. Bie, Y. Wang, Joint entropy based learning model for image retrieval, J. Vis. Commun. Image R. (2018), doi: https://doi.org/10.1016/j.jvcir.2018.06.021

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Joint entropy based learning model for image retrieval Hao Wu1,Yueli Li2, Xiaohan Bi3, Linna Zhang4, Rongfang Bie1,Yingzhuo Wang5 1

College of Information Science and Technology, Beijing Normal University

2 3

College of Information Science and Technology, Hebei Agricultural University

School of Computer and Information Technology, Beijing Jiaotong University 4

College of Mechanical Engineering, Guizhou University, 5

Wendeng Technician College of WeiHai

Abstract As one classic technique of computer vision, image retrieval could retrieve the target images from hundreds of thousands of images effectively. Furthermore, with the rapid development of deep learning, the quality of retrieval is increased obviously. However, under normal conditions, the high-quality retrieval is supported by a large number of learning instances. The large number of learning instances not only need much human source in the process of selection, but also need much computing source in the process of computation. More importantly, for some special categories, it’s difficult to obtain a large number of learning instances. Aiming at the problem above, we proposed one joint entropy based learning model which could reduce the number of learning instances through optimizing the distribution of learning instances. Firstly, the learning instances are pre-selected using improved watershed segmentation method. Then, joint entropy model is used for reducing the possibility of double, useless even mistaken instances existence. After that, a database using a large number of images is built up. Sufficient experiments based on the database show the model’s superiority that our model not only could reduce the number of learning instances but also could keep the accuracy of retrieval.

Keywords: joint entropy, learning instance, image retrieval, watershed segmentation, precise-recall curve, AP value, AUC value

1. Introduction Image retrieval is a computer technique for browsing, searching and retrieving images from a large database of digital images. Originally, to search for images, some keywords of images are used for retrieving the target images. However, manual image annotation is time-consuming and expensive, even mistaken. Aiming at these questions, more and more researchers paid more attention to do some research on content-based image retrieval (CBIR)[1]. At the outset, CBIR aims at avoiding label description and similarities based content(color, shape, texture etc.) are applied .With the development of CBIR, the semantic concept[2] is incorporated and some classic features combined with SVM[3] are used for semantic-based retrieval. Except for the classic SVM models, decision tree[4], Bayesian model[5] and some other models are also used for retrieval. At the same time, image classification[6],

semantic-based

image

composition[7]

and

semantic-based

image

completion[8] are also supported by these models. In recent years, as one classic model in the machine learning, deep learning model[9] has

the

overwhelming

advantages

compared

to

traditional

models.

CNN-based

methods[10-12], RBM-based methods[13-15], Autoencoder-based methods[16-18] and Sparse coding-based Methods[19-21] are classic models in the field of computer vision. The Convolutional Neural Networks (CNN) could be used through training multiple layers and there are usually convolutional layers, pooling layers, and fully connected layers. Restricted Boltzmann Machine (RBM) was proposed by Hinton in 1986 and it could be considered as a generative stochastic neural network usually concluding visible units and hidden units. The antoencoder-based model also could be considered as one special type of artificial neural network which could be used for learning efficient encodings. The sparse coding-based model

describes the image through learning adequate basic functions. The four classic methods of deep learning have brought revolutionary promotion to the computer vision field and more improved models based on them are also applied However, high-quality retrieval results of traditional models largely depend on the sufficient learning instances. If the number of learning instances is reduced, the accuracy of retrieval will decline obviously. Especially for the deep learning model, it’s difficult to learn one complete model using insufficient learning instances, not to mention obtaining the high-quality results. In most cases, collecting a large number of learning instances could increase the computing resource and human resource obviously. For some special categories, even if we try our best to collect learning instances, the number of learning instances still couldn’t fulfill the requirement. As discussed above, how to reduce the learning instances with preserving the accuracy of retrieval has become one meaningful problem. So in this paper, aiming at this problem, we proposed one joint entropy based learning model which could contribute to optimize the distribution of learning instances. In the process of optimization, some double, useless even mistaken instances would be deleted. A small amount of high-quality learning instances could keep high-level retrieval accuracy. The rest of paper is organized in the following: Section 2 introduces the related work of our method. Section 3 introduces our model in details. Section 4 introduces the experiment process and results. Section 5 gives conclusion remarks.

2. Related work For our research topic of joint entropy based learning model, there are many related works that have been done by some researchers. Among these related work, feature descriptor

selection, SVM model, learning instance reduced model and entropy-based model are essential work which not only could achieve remarkable achievement but also could contribute to support our paper's model directly and effectively. 2.1 Feature descriptor Originally, RGB colour histograms[22], texture histograms[23] are often used for evaluating the similarity between different images. With the development of semantic content, some famous feature descriptors have been proposed. SIFT [24], rgSIFT [25], PHOG [26], and GIST [27] are classic feature descriptors which could extract the semantic content effectively. Later, some optimized feature descriptors[28,29] were also used. In the field of computer vision, image classification, image retrieval and image annotation often used them as important ways of semantic content extraction. Deep learning models (e.g. ,CNN-based models[11,30], RBM-based models[13,31], Autoencoder-based models[18,32], Sparse coding-based models[20,33]) could learn the retrieval model without using the feature descriptors above. However, in most cases, they could contribute to represent images more effectively. Especially for CNN-based models, multiple layers of nonlinear processing units could be used for feature extraction and transformation. During the process, convolutional layers, pooling layers, and fully connected layers play different roles respectively. Compared to CNN-based models, RBM-based models, Autoencoder-based models and Sparse coding-based models are not frequently used, but they also could learn high-level semantic feature in data by utilizing hierarchical architectures. However, nearly all high-quality deep learning models are based on at least tens of thousands of learning instances.

2.2 SVM model In the field of machine learning, support vector machines (SVMs)[34] are classic supervised learning models through training the learning instances as one classification model. Risk minimization[35] is the essential part which tends to have a low expectation of risk to train the optimized learning model. However, in the process of model training, it’s difficult to solve the problem of curse of dimensionality[36]. Aiming at the problems, appearance of kernel functions[37,38] could reduce the complexity of calculation. Some classic kernel functions, such as Gaussian Kernel[39], Hyperbolic Tangent (Sigmoid) Kernel[40] are widely used in some research. In most cases, the learning instances has been labeled by human resource. If the classification model has been trained, the target images could be retrieved using the model. Compared to deep learning model, the number of learning instances could be reduced obviously. 2.3 Learning instance reduced model Aiming at the proposed problem, a large number of learning instance reduced models have been proposed. In general, there are two kinds of models. On the one hand, researchers take full advantage of all existing instances, including those that do not belong to the category. The main insight is that, given a few learning instances from one category, they define some other similar learning instances of other categories as candidate learning instances. Paper[41] proposed one model which aims at reducing the learning instances using similar or dissimilar instances. Semantic distance model[42] is used for evaluating the candidate instances. On the other hand, some learning instances optimized models have been used to reduce the number of learning instances. Paper[43] proposed one learning instance optimized model which could

optimize the learning instances effectively. In general, they could reduce the number of learning instances in a certain extent through incorporating some candidate learning instances or optimizing the learning instances. However, many learning instances is still a burden for human resource and computing resource. 2.4 Entropy-based model In information theory, joint entropy is a measure of the uncertainty associated with a set of variables. In the field of computer vision , joint entropy[44], such as Shannon entropy[45], has been widely used .With the development of the model, some improved models are more and more used for image annotation, image classification ,image composition and image completion. For instance, paper[46] used the joint entropy model to label the unknown region. In the process of model construction, joint entropy is used for constructing the semantic structure of one image. Until now, there is no related model for learning instance optimization and our model could make full use of the joint entropy which could select the optimized learning instances. Although some related work has done some research on our topic, even some papers have done some deep research on it and already achieve some remarkable achievements. However, the large number of learning instances is still one burden for human resource and computing resource. Moreover, there is no valid theory to describe and summarize the learning instance distribution model. In order to solve the problem above, on the foundation of previous work, joint entropy based learning model is proposed in this paper which could improve the image retrieval quality through optimizing the distribution of learning instances.

3. Algorithm

As discussed above, how to optimize the distribution of learning instances has become one challenging and meaningful problem. In order to solve the challenging problem, we mainly used joint entropy to optimize the learning instances. In the process, in order to avoid some low-quality learning instances selected by hand, we used improved watershed segmentation method to pre-select them. Then, on the foundation of pre-selection, joint entropy is used to optimize the distribution of learning instances. As shown in figure 1, some selected learning instances(about 20) are used as reference learning instances, then they are optimized by improved watershed segmentation. For each new input learning instance, after watershed segmentation, the joint entropy between it and optimized learning instances is calculated. If the joint entropy is higher than expectation, it is considered as a valid learning instance. Otherwise, it is considered as an invalid learning instance. The above process is circulated until enough valid learning instances are obtained. In most cases, in order to make the algorithm and experiments more convincing, joint entropy of reference learning instances is calculated ahead of schedule. Moreover, the joint entropy of final valid learning instances is also calculated. If the joint entropy is out of expectation, valid learning instances need to be re-obtained.

Figure 1: The flow chart of selecting and optimizing the learning instances. 3.1 Pre-selection step Although our model could contribute to select the optimized learning instances, however, some invalid learning instances could waste the computing resource. For instance, in figure 2(a), target object and interference object exist in the same image. Moreover, the interference object occupies a non-ignorable part of the image. In another case, in figure 2(b), target object just occupies a negligible part of the image. If this kind of images would be selected, there is no contribution or little contribution to the training model. More importantly, some computing resource could be wasted. So in this step, we firstly select some valid learning instance.

(a)

(b)

(c)

Figure 2: Some learning instances in different conditions. Image(a) shows the example that image

concludes not only one object, image(b) shows the example that the target object only occupies one small part of the whole image, image(c) shows one realistic example.

Aiming at the problem above, we used WLS(Weighted Least Squares) filter based decomposition model[47,48] to pre-process the image. WLS filter is one edge-preserving while region smoothing filter. It not only could make the filtered image as similar as original image, but also could make the filtered image sharper along the edges while smoother within the regions. The WLS filter could be considered as one mathematical equation as follows in equation(1) (1)

Where i indicates one pixel in the images, the term

is one term which could

minimize the differences between original image g and filtered image u, the term could smooth the regions while preserving the edges. The smoothness weights of ax and ay, are determined by image g, and k is used for balancing the two terms. The smoothness of each version could be increased by the increasing of k with same factor. The larger k is, the smoother new image is. The degrees of smoothness can be determined by changing the k(0.3-4) and c (1.4 - 1.8). Figure 3 and 4 show two groups of edge-preserving decomposition based images with different smoothness of each version. Edge-preserving decomposition based image could be used for segmentation which contributes to evaluate whether the target images are realistic or not.

(a)

(b)

(c)

(d)

Figure 3: Edge-preserving decomposition based images with different smoothness of each version. Image(d) is often considered as the most suitable optimized image.

(a)

(b)

(c)

(d)

Figure 4: Edge-preserving decomposition based images with different smoothness of each version. Image(d) is often considered as the most suitable optimized image.

In this paper, we used watershed segmentation to process the filtered image. Since watershed segmentation technique[49] has been widely used for many years so it is proposed briefly. Equation(2) contributes to show the process using mathematical content clearly.

g ( x, y)  grad ( f ( x, y))  {[ ( f ( x, y)  f ( x 1, y)) 2  ( f ( x, y)  f ( x, y 1)) 2 . (2) In Equation (2), grad is the gradient operator, f(x,y) is the original image and g(x,y) is the processed image by watershed segmentation. After the segmentation, we could judge whether the target image could be used as a realistic learning instance or not. If the target image concludes not only one object or target object only occupies one small part of the whole image, this kind of image should be deleted. Figure 5 and 6 show two groups of segmentation results using improved watershed segmentation model. This preprocessing step could delete some invalid learning instances which directly contributes to reduce the computing resource wasting.

(a)

(b)

Figure 5: Segmentation result using improved watershed segmentation model. Image(a) is edge-preserving decomposition based image and image(b) is segmentation result based on image(a).

(a)

(b)

Figure 6: Segmentation result using improved watershed segmentation model. Image(a) is edge-preserving decomposition based image and image(b) is segmentation result based on image(a)

3.2 Construction of joint entropy based model After pre-selection step, we need to do further optimization of learning instances. In this step, we used joint entropy based model to evaluate the validation of distribution with learning instances. Learning instance joint entropy could be considered combined entropies of each learning instance. The joint entropy could be calculated using the following probabilities: (3)

Where

is the probability of learning instances existence

together while taking the contextual semantic relationships between each other into consideration,

are different learning instances. Most of learning instances

are certain with more confidence, except for the new incorporating learning instance new instance

, the

has many possibilities.

If the joint entropy needs to be calculated, computational complexity is one special burden for computing resource,so it needs to be approximated. The first order entropy is one effective approximation which is the sum of entropies of each learning instance considered individually.: (4) Where

is the first order entropy and

is the probability of each learning

instance existence. However, the approximation ignores the contextual semantic relationship between different learning instances which could affect the entropy obviously. Aiming at this problem, the second order approximation of joint entropy could be calculated which is called as Bethe entropy approximation[50], it is defined in the following: (5) Where m is the number of learning instances , learning instance ,

and

and

is the joint probability of

, which could assume that there are only learning instance

are appearance features of learning instance

and

and

. In the equation above,

it takes the contextual semantic relationship between different learning instances into consideration fully. When the new learning instance needs to be incorporated, the last term is stable because the majority of learning instances have already existed. However, the first term

is changing consistently, because in the process of incorporation, joint entropy will be changed with the changing of contextual semantic relationship between different learning instances. So how to calculate the contextual semantic relationship between different learning instances has become a new challenging problem. Through previous research [51, 52] and adequate experiments, it was determined that contextual semantic relationship between different learning instances largely depends on the similarity based on different feature descriptors. Aiming at this question, SIFT descriptor and GIST descriptor could be used for obtaining the solution. If the two learning instances are as similar as each other based on SIFT descriptor, they are likely to be semantically similar. Similarly, if the two learning instances are as similar as each other with GIST descriptor, they are likely to be semantically similar. So in this paper, the joint semantic probability could be defined as follows: (6) The first term is the probability of similarity based on GIST descriptor and the second term is the probability of similarity based on SIFT descriptor. Then, we will calculate the probability of similarity based on GIST descriptor similarity based on SIFT descriptor

and the probability of

.

In this study, the GIST descriptor gathers the oriented edge responses at multi-scale into very coarse spatial bins. The GIST descriptor used in our study is built from six oriented edge responses at five scales gathered onto a 4 x 4 spatial resolution, thereby providing maximum effectiveness. Then the GIST descriptor between learning instance

is defined as the Euclidean distance based on and

. Assuming that the dimension of

learning instance feature is m, GIST-based similarity could be defined in equation(7)

(7)

Next, we continued to calculate the similarity based on SIFT descriptors between learning instance

and

. Figure 7 contributes to show the SIFT extraction process and

histogram construction based on visual words clearly. It shows that for each image we could calculate its SIFT-based feature histogram using bag of words model. Then their histograms are used for evaluating the similarity between different images. Concretely, the feature vector consists of SIFT features computed on a regular grid across the image (“dense image”). The vector is then quantized into visual words. The frequency of each visual word is recorded in a histogram for each tile of the spatial tiling. The final feature vector for the image is a concatenation of these histograms. After the process above,

could be defined as similarity between histograms

of each learning instance in the following: (8)

Where Ip and Is are different learning instances, H1(i) and H2(j) are the two histograms of them respectively, i is the bin of the histogram, H1(i) and H2(j) are the heights of each bin,

is the probability of similarity based on SIFT descriptors.

Figure 7: Process of SIFT extraction and histogram construction based on visual words.

After finishing the process above, the probability of similarity based on GIST descriptor

and the probability of similarity based on SIFT descriptor have been obtained using our model

As we know, only if the similarity between different learning instances is reduced, the value of each learning instance would be increased. Otherwise, some double, useless even mistaken learning instances would be incorporated which affects the effectiveness and efficiency of learning model obviously. So if we will incorporate one new learning instance, we should calculate the joint semantic probability between it and other learning instances which could make the joint entropy maximum(Special notes: for new positive incorporated learning instance, we need to calculate the joint semantic probability between it and other known positive learning instances. Similarly, for new negative incorporated learning instance, we need to calculate the joint semantic probability between it and other known negative learning instances.) We continue the process until the number of learning instances is enough. After selection of learning instances using our model, they will be used to learn the training model and related experiments are in the next section.

4. Experiments

4.1Database: In this paper, we selected 198175 images totally from Caltech 256[53], Label me[54], Berkeley segmentation[55],CIFAR-10 [56] databases ,google images and personal images . 112762 images are used as training images and 85413 images are used as test images. In order to make our results more convincing, we selected images as more as categories as possible, as more complicated as possible. The database contains 189 categories. Moreover, many

categories have more than 200 instances. Only 41 categories have lower than 100 instances. The experiments runned at 195ms per image on an Nvidia Tesla M40 GPU (plus 15ms CPU time resizing the outputs to the original resolution). In the process of SIFT descriptor extraction, 1000 words were used and the final dimension of histograms is 4000. The GIST descriptor used in our study is built from six oriented edge responses at five scales gathered onto a 4 x 4 spatial resolution. Firstly, we used Precision-recall curve to evaluate our model performance. During the process, CNN based method[57], visual feature based method[58], retrieval practical based method[59] and spatial pyramid matching based method[60] are used for baselines. For the above classic methods, we offered enough learning instances for them to keep the accuracy. In order to make sure that the number of learning instances is enough, we kept offering learning instances for the model until the accuracy is not changed or changed obviously. But for our model, we offered limited learning instances (50 positive learning instances and 100 negative learning instances). After it, from figure 8 we could see that there is no significant difference between our model and other baselines. Table 1 shows some AP values of specific categories between different methods. Then, the AP value and AUC value in table 2 are used for showing the results more intuitively. Next, we tested our model using the same number of learning instances between different models. Optimized learning based model[61], semantic-distance based model[42], visual feature based method, retrieval practical based method, spatial pyramid matching based method are used for baselines. In this step, for our model and baseline models above, they are all trained by a limited number of images respectively (50 positive learning instances and 100

negative learning instances). From the Precision-recall curve in figure 9, we could see that the Precision-recall curve using our method is better than other baselines obviously. Table 3 shows some AP values of specific categories between different methods, we also could see that our model performs obviously better for the majority of categories. From AP value and AUC value in table 4, we could see that our model has the special advantage when the learning instances are not enough. Then, when AP is set to 0.8, we could calculate the number of learning instances used between different methods. In this process, from the results in table 5, we could see clearly that our model used least number of learning instances when the accuracy is the same. More importantly, in order to make the final results more convincing. We have done some groups of experiments to show joint entropy model and optimization model’s superiority through comparison. For baseline models and our model, they are trained by a limited number of images respectively (50 positive learning instances and 100 negative learning instances). As is shown in table 6, joint entropy model and optimization model contribute to improve retrieval effectiveness significantly. At last, 4 groups of retrieval results in figure 10 are used for showing our method’s superiority. From the experimental results above, we could see that for the majority of categories, the evaluation criterions using our method have obvious disadvantages. For the average evaluation criterions(Precision-Recall curve, AP value ,AUC value), our methods improve the experimental results significantly. Even if we set the AP value same between different methods, the number of learning instances used in our method is also least compared to other baselines.

We must recognize that more and more high-performance models could achieve realistic retrieval results in recent years. Especially for deep learning based models, they improve the accuracy significantly. However, after adequate experiments in this paper, we could see that nearly all related models are at the cost of a large number of learning instances and computing resource. If the number of learning instances is reduced slightly, the accuracy is reduced significantly. On the contrary, there is no significant difference between our model and other baselines evaluated by many accuracy criterions (Precision-Recall curve, AP value ,AUC value). More importantly, our model could reduce the number of learning instances significantly. The essential value of our model is to reduce the number of learning instances without reducing the accuracy obviously. In other words, our model could achieve realistic retrieval accuracy using just a few learning instances.

Figure 8: Precision–recall curve using different methods. Pink: Our method. Blue: Baseline 1. Red: Baseline 2. Black: Baseline 3. Green: Baseline 4. Baseline 1: CNN based method[57]. Baseline 2:

visual feature based method[58]. Baseline 3: retrieval practical based method[59]. Baseline 4: spatial pyramid matching based method[60]

Table 1: AP values of specific categories between different methods. Baseline 1: CNN based method[57]. Baseline 2: visual feature based method[58]. Baseline 3: retrieval practical based method[59]. Baseline 4: spatial pyramid matching based method[60] Computer

tennis

badminton

building

flower

tree

0.833

0.809

0.842

0.792

0.844

0.799

0.840

0.774

0.839

Baseline1

0.904

0.922

0.917

0.874

0.928

0.865

0.918

0.857

0.906

Baseline2

0.881

0.892

0.892

0.854

0.900

0.841

0.877

0.832

0.892

Baseline3

0.862

0.863

0.850

0.821

0.875

0.829

0.877

0.805

0.885

Baseline4

0.841

0.856

0.850

0.802

0.870

0.803

0.849

0.795

0.839

fish

people

desk

apple

duck

panda

penguin

shoes

T-shirt

0.808

0.847

0.803

0.806

0.849

0.821

0.837

0.806

0.819

Baseline1

0.913

0.942

0.890

0.904

0.924

0.897

0.929

0.898

0.901

Baseline2

0.884

0.894

0.874

0.884

0.901

0.869

0.893

0.871

0.872

Baseline3

0.880

0.873

0.851

0.873

0.875

0.841

0.867

0.864

0.855

Baseline4

0.822

0.856

0.809

0.834

0.857

0.832

0.848

0.806

0.843

Our

bike

cellphone

lion

method

Our method

Table 2: AP values calculated using different methods. Baseline 1: CNN based method[57]. Baseline 2: visual feature based method[58]. Baseline 3: retrieval practical based method[59]. Baseline 4: spatial pyramid matching based method[60] Our method

Baseline1

Baseline2

Baseline3

Baseline4

AP

0.839

0.906

0.881

0.864

0.847

AUC

0.829

0.901

0.872

0.859

0.839

Figure 9: Precision–recall curve using different methods. Blue: Our method. Red: Baseline 1. Black: Baseline 2. Green: Baseline 3. Pink: Baseline 4. Yellow: Baseline 5. Baseline 1: optimized

learning based model[61]. Baseline 2: semantic-distance based model[62]. Baseline 3: visual feature based method[58]. Baseline 4: retrieval practical based method[59]. Baseline 5: spatial pyramid matching based method[60]

Table 3: AP values of specific categories between different methods. Baseline 1: optimized

learning based model[61]. Baseline 2: semantic-distance based model[42]. Baseline 3:visual feature based method[58]. Baseline 4: retrieval practical based method[59]. Baseline 5: spatial pyramid matching based method[60]

Our method

Computer

tennis

badminton

0.802

0.809

0.822

bike 0.818

cellphone 0.791

lion 0.811

building

flower

tree

0.822

0.782

0.811

Baseline1

0783

0.771

0.790

0.802

0.779

0.790

0.793

0.766

0.776

Baseline2

0.783

0.742

0.764

0.757

0.747

0.777

0.769

0.732

0.722

Baseline3

0.669

0.719

0.650

0.702

0.656

0.691

0.677

0.669

0.645

Baseline4

0.621

0.607

0.617

0.644

0.603

0.643

0.591

0.600

0.639

Baseline5

0.589

0.590

0.557

0.601

0.544

0.622

0.543

0.582

0.617

fish

people

desk

apple

duck

panda

penguin

shoes

T-shirt

0.819

0.822

0.779

0.810

0.799

0.800

0.781

0.810

0.784

Baseline1

0.788

0.791

0.751

0.766

0.773

0.771

0.746

0.773

0.734

Baseline2

0.764

0.755

0.744

0.729

0.751

0.744

0.709

0.732

0.708

Baseline3

0.659

0.659

0.692

0.644

0.707

0.679

0.662

0.667

0.667

Baseline4

0.608

0.612

0.629

0.620

0.662

0.629

0.613

0.601

0.601

Baseline5

0.593

0.612

0.584

0.588

0.579

0.593

0.549

0.533

0.549

Our method

Table 4: AP values calculated using different methods. Baseline 1: optimized learning based

model[61]. Baseline 2: semantic-distance based model[42]. Baseline 3:visual feature based method[58]. Baseline 4: retrieval practical based method[59]. Baseline 5: spatial pyramid matching based method[60] Our method

Baseline1

Baseline2

Baseline3

Baseline4

Baseline5

AP value

0.811

0.784

0.766

0.665

0.613

0.59

AUC value

0.804

0.779

0.745

0.639

0.608

0.574

Table 5: Number of learning instances used in different methods. Baseline 1: optimized learning

based model[61]. Baseline 2: semantic-distance based model[42]. Baseline 3: visual feature based method[58]. Baseline 4: retrieval practical based method[59]. Baseline 5: spatial pyramid matching based method[60] Computer

tennis

Our method

35/72

43/84

56/89

26/51

41/71

Baseline1

39/86

52/91

63/94

34/63

53/81

Positive/negative

badminton

bike

cellphone

lion

building

flower

tree

37/66

39/74

41/66

47/81

41/75

45/87

55/73

54/89

(instance number)

Baseline2

54/89

66/102

82/117

45/81

62/97

55/84

54/99

64/89

62/97

Baseline3

77/137

84/139

106/145

51/89

74/102

71/113

71/135

82/147

71/111

Baseline4

84/141

97/169

126/169

62/95

89/144

82/141

86/147

93/169

82/129

Baseline5

112/219

99/211

141/235

88/131

129/256

98/206

92/186

93/188

95/149

fish

people

desk

apple

duck

panda

penguin

shoes

T-shirt

Our method

35/67

28/47

42/67

24/54

37/67

37/51

57/89

44/104

34/66

Baseline1

42/77

31/53

55/82

30/60

44/89

44/62

71/105

54/129

48/78

Baseline2

55/89

39/64

64/97

41/72

57/104

54/80

82/134

62/149

54/91

Baseline3

71/119

50/77

77/119

55/88

66/119

69/85

99/168

88/189

58/97

Baseline4

86/141

72/98

89/137

64/91

89/141

79/111

119/191

97/241

64/112

Baseline5

102/189

89/141

116/229

84/129

106/189

91/149

133/214

97/287

77/149

Positive/negative (instance number)

Table 6: AP value and AUC value using different models. Baseline 1: the model that we don't use the joint entropy model to pre-select and optimize the learning instances. Baseline 2 : the model that we just pre-select the learning instances and don't optimize them.

AP value AUC value

Baseline 1 0.622 0.617

Baseline 2 0.712 0.696

Our model 0.839 0.829

Figure 10: Four groups of retrieval results using our model.

5.Conclusion In this paper, we proposed one joint entropy based learning model for high-quality

retrieval. In the process, improved watershed segmentation model is used for pro-processing the learning instances. More importantly, the distribution of learning instances is optimized by our joint entropy model. At last, the adequate experimental results show our method’s superiority that the number of learning instances is declined compared to some traditional models., especially to deserve to be mentioned, the accuracy using our model is still preserved. However, there are still few drawbacks for our model. For instance, if one invalid learning instance is incorporated, the error would be accumulated continually in the process of calculation. Moreover, the joint entropy only simulates the distribution of learning instance and some details have been ignored. So we need to do some further research on the model in order to make the results more realistic and convincing. Particularly, some further work has been done in paper[] that is under review.

Acknowledgement This research is sponsored by Fundamental Research Funds for the Central Universities (No.2016NT14), National Natural Science Foundation of China (No.61601033, No.61571049, No.61401029)

and

Beijing

Advanced

Innovation

Center

for

Future

Education

(BJAICFE2016IR-004). We particularly appreciate XiaoYu and Junqi Guo for their contributions of data collection and optimization.

References [1] Smeulders A W M, Worring M, Santini S, et al. Content-based image retrieval at the end of the early years[J]. IEEE Transactions on pattern analysis and machine intelligence, 2000, 22(12): 1349-1380. [2] Berners-Lee T, Hendler J, Lassila O. The semantic web[J]. Scientific american, 2001, 284(5):

28-37. [3] Suykens J A K, Vandewalle J. Least squares support vector machine classifiers[J]. Neural processing letters, 1999, 9(3): 293-300. [4] MacArthur S D, Brodley C E, Shyu C R. Relevance feedback decision trees in content-based image retrieval[C]//Content-based Access of Image and Video Libraries, 2000. Proceedings. IEEE Workshop on. IEEE, 2000: 68-72. [5] Bernardo J M, Smith A F M. Bayesian theory[J]. 2001. [6] Yang J, Yu K, Gong Y, et al. Linear spatial pyramid matching using sparse coding for image classification[C]//Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009: 1794-1801. [7]Wu H, Li Y, Miao Z, et al. Creative and high-quality image composition based on a new criterion[J]. Journal of Visual Communication and Image Representation, 2016, 38: 100-114. [8] Wu H, Miao Z, Wang Y, et al. Image completion with multi-image based on entropy reduction[J]. Neurocomputing, 2015, 159: 157-171. [9] Guo Y, Liu Y, Oerlemans A, et al. Deep learning for visual understanding: A review[J]. Neurocomputing, 2016, 187: 27-48. [10]Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C]//Advances in neural information processing systems. 2012: 1097-1105. [11] Zeiler M D, Fergus R. Visualizing and understanding convolutional networks[C]//European Conference on Computer Vision. Springer International Publishing, 2014: 818-833. [12] He K, Zhang X, Ren S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(9):

1904-1916. [13] Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets[J]. Neural computation, 2006, 18(7): 1527-1554. [14] Salakhutdinov R, Hinton G E. Deep Boltzmann Machines[C]//AISTATS. 2009, 1: 3. [15] Ngiam J, Chen Z, Koh P W, et al. Learning deep energy models[C]//Proceedings of the 28th International Conference on Machine Learning (ICML-11). 2011: 1105-1112. [16] Poultney C, Chopra S, Cun Y L. Efficient learning of sparse representations with an energy-based model[C]//Advances in neural information processing systems. 2006: 1137-1144. [17] Vincent P, Larochelle H, Bengio Y, et al. Extracting and composing robust features with denoising autoencoders[C]//Proceedings of the 25th international conference on Machine learning. ACM, 2008: 1096-1103. [18] Rifai S, Vincent P, Muller X, et al. Contractive auto-encoders: Explicit invariance during feature extraction[C]//Proceedings of the 28th international conference on machine learning (ICML-11). 2011: 833-840. [19] Memisevic R, CA U, Krueger D. Zero-bias autoencoders and the benefits of co-adapting features[J]. stat, 2014, 1050: 13. [20] Zhou X, Yu K, Zhang T, et al. Image classification using super-vector coding of local image descriptors[C]//European conference on computer vision. Springer Berlin Heidelberg, 2010: 141-154. [21] Gao S, Tsang I W H, Chia L T, et al. Local features are not lonely–Laplacian sparse coding for image classification[C]//Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010: 3555-3561.

[22] Berens J, Finlayson G D, Qiu G. Image indexing using compressed colour histograms[J]. IEE Proceedings-Vision, Image and Signal Processing, 2000, 147(4): 349-355. [23] Van Ginneken B, Koenderink J J, Dana K J. Texture histograms as a function of irradiation and viewing direction[J]. International Journal of Computer Vision, 1999, 31(2-3): 169-184. [24]D.G.Lowe,” Distinctive Image features from Scale-Invariant Keypoints,” Journal

of

Computer Vision ,60(2):91-110.(2004) [25]K.van de Sande, T.Gevers, C.Snoek,”Evaluation of color descriptors for object and scene recognition,” in Pro. IEEE Conference on

Computer Vision and Pattern Recognition,

pp. 1-8, IEEE , Anchorage, AK(2008). [26]A.Bosch, A.Zisserman, X.Munoz, ”Representing shape with a spatial pyramid kernel,” in Pro. ACM international conference on Image and video retrieval, pp. 672 - 679,ACM, New York, NY(2007). [27]H.James et al.,“Scene Completion Using Millions of Photographs, ”ACM Transactions on Graphics, 26(3) ( 2007). [28] Zheng, Yan-Tao, et al. "Toward a higher-level visual representation for object-based image retrieval." The Visual Computer 25.1 (2009): 13-23. [29]A.Farhadi et al.,” Describing objects by their attributes,” in Pro. IEEE Conference on Computer Vision and Pattern Recognition,pp. 1778 - 1785 ,IEEE, Miami, FL (2009). [30] Mollahosseini A, Chan D, Mahoor M H. Going deeper in facial expression recognition using deep neural networks[C]//Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on. IEEE, 2016: 1-10. [31]Wu Q, Diao W, Dou F, et al. Shape-based object extraction in high-resolution remote-sensing

images using deep Boltzmann machine[J]. International journal of remote sensing, 2016, 37(24): 6012-6022. [32]Alain G, Bengio Y. What regularized auto-encoders learn from the data-generating distribution[J]. The Journal of Machine Learning Research, 2014, 15(1): 3563-3593. [33]Zhang

T,

Ghanem

B,

Liu

S,

et

al.

Low-rank

sparse

coding

for

image

classification[C]//Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013: 281-288. [34] Hearst M A, Dumais S T, Osman E, et al. Support vector machines[J]. IEEE Intelligent Systems and their Applications, 1998, 13(4): 18-28. [35] Vapnik V. Principles of risk minimization for learning theory[C]//NIPS. 1991: 831-838. [36] Keogh E, Mueen A. Curse of dimensionality[M]//Encyclopedia of Machine Learning. Springer US, 2011: 257-258. [37] Min J H, Lee Y C. Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters[J]. Expert systems with applications, 2005, 28(4): 603-614. [38] Sahami M, Heilman T D. A web-based kernel function for measuring the similarity of short text snippets[C]//Proceedings of the 15th international conference on World Wide Web. AcM, 2006: 377-386. [39] Keerthi S S, Lin C J. Asymptotic behaviors of support vector machines with Gaussian kernel[J]. Neural computation, 2003, 15(7): 1667-1689. [40] Venu N, Anuradha B. Integration of hyperbolic tangent and Gaussian kernels for Fuzzy C-Means algorithm with spatial information for MRI segmentation[C]//2013 Fifth International Conference on Advanced Computing (ICoAC). IEEE, 2013: 280-285.

[41] Wang G, Forsyth D, Hoiem D. Comparative object similarity for improved recognition with few or no examples[C]//Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010: 3525-3532. [42] Wu H, Miao Z, Wang Y, et al. Optimized recognition with few instances based on semantic distance[J]. The Visual Computer, 2015, 31(4): 367-375. [43] Wu H, Miao Z, Chen J, et al. Recognition improvement through the optimisation of learning instances[J]. IET Computer Vision, 2015, 9(3): 419-427. [44]Tang J, Rahmim A. Bayesian PET image reconstruction incorporating anato-functional joint entropy[J]. Physics in medicine and biology, 2009, 54(23): 7063. [45]Liu S. On the relationship between densities of Shannon entropy and Fisher information for atoms and molecules[J]. The Journal of chemical physics, 2007, 126(19): 191107. [46] Siddiquie B, Gupta A. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning[C]//Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010: 2979-2986 [47] Subr K, Soler C, Durand F. Edge-preserving multiscale image decomposition based on local extrema[J]. ACM Transactions on Graphics (TOG), 2009, 28(5): 147. [48] Wu H, Li Y, Miao Z, et al. A new sampling algorithm for high-quality image matting[J]. Journal of Visual Communication and Image Representation, 2016, 38: 573-581. [49] Shafarenko L, Petrou M, Kittler J. Automatic watershed segmentation of randomly textured color images[J]. IEEE transactions on Image Processing, 1997, 6(11): 1530-1544. [50] Yedidia J S, Freeman W T, Weiss Y. Constructing free-energy approximations and generalized belief propagation algorithms[J]. IEEE Transactions on Information Theory, 2005, 51(7):

2282-2312. [51]

Zhang Y,

Chen T. Efficient

kernels

for

identifying

unbounded-order

spatial

features[C]//Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009: 1762-1769. [52] Zhou X, Yu K, Zhang T, et al. Image classification using super-vector coding of local image descriptors[C]//European conference on computer vision. Springer Berlin Heidelberg, 2010: 141-154. [53] Griffin G, Holub A, Perona P. Caltech-256 object category dataset[J]. 2007. [54] Russell B C, Torralba A, Murphy K P, et al. LabelMe: a database and web-based tool for image annotation[J]. International journal of computer vision, 2008, 77(1-3): 157-173. [55] Arbelaez P, Fowlkes C, Martin D. The berkeley segmentation dataset and benchmark[J]. see http://www. eecs. berkeley. edu/Research/Projects/CS/vision/bsds, 2007. [56] Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images[J]. 2009. [57] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C]//Advances in neural information processing systems. 2012: 1097-1105. [58] Yu, Jun, et al. "Learning to rank using user clicks and visual features for image retrieval." IEEE transactions on cybernetics 45.4 (2015): 767-779. [59]Andrea

Vedaldi

and

Andrew

Zisserman

“Image

Classification

Practical”, http://www.robots.ox.ac.uk/~vgg/share/practical-image-classification.htm (2011) [60]Lazebnik, Svetlana, Cordelia Schmid, and Jean Ponce. "Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories."Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. Vol. 2. IEEE, 2006.

[61] Li Y, Bie R, Zhang C, et al. Optimized learning instance-based image retrieval[J]. Multimedia Tools and Applications, 2016: 1-18. [62] Hao Wu et al. Weighted-learning-instance-based retrieval model using instance distance[J]. Machine Vision and Applications.(Under review)