A hierarchal BoW for image retrieval by enhancing feature salience

A hierarchal BoW for image retrieval by enhancing feature salience

Author’s Accepted Manuscript A Hierarchal BoW for Image Retrieval by Enhancing Feature Salience11This work was partially supported by the National Nat...

1MB Sizes 0 Downloads 71 Views

Author’s Accepted Manuscript A Hierarchal BoW for Image Retrieval by Enhancing Feature Salience11This work was partially supported by the National Natural Science Foundation of China (No. 61370121), the National Hi-Tech Research and Development Program (863 Program) of China (No.2014AA015102), and Outstanding Tutors for doctoral dissertations of S&T project in Beijing (No. 20131000602).

www.elsevier.com/locate/neucom

Fan Jiang, Hai-Miao Hu, Jin Zheng, Bo Li

PII: DOI: Reference:

S0925-2312(15)01503-9 http://dx.doi.org/10.1016/j.neucom.2015.10.044 NEUCOM16222

To appear in: Neurocomputing Received date: 16 March 2015 Revised date: 1 August 2015 Accepted date: 15 October 2015 Cite this article as: Fan Jiang, Hai-Miao Hu, Jin Zheng and Bo Li, A Hierarchal BoW for Image Retrieval by Enhancing Feature Salience11This work was partially supported by the National Natural Science Foundation of China (No. 61370121), the National Hi-Tech Research and Development Program (863 Program) of China (No.2014AA015102), and Outstanding Tutors for doctoral dissertations of S&T project in Beijing (No. 20131000602)., Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2015.10.044 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

A Hierarchal BoW for Image Retrieval by Enhancing Feature Salience1 Fan Jiang1, Hai-Miao Hu1, 2, †, Jin Zheng1, 2, Bo Li1,2 1. Beijing Key Laboratory of Digital Media, School of Computer Science and Engineering, Beihang University, Beijing 100191, China 2. State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China † Corresponding Author: Hai-Miao Hu ([email protected]) Abstract: Retrieving images with multiple features is an active research topic on boosting the performance of existing content-based image retrieval methods. The promising bags-of-words (BoW) models involve multiple features by applying feature fusion strategies in the early stage of image indexing. However, due to the different data forms of features, a simple joint may not guarantee a high retrieval performance. Moreover, a fused feature is not flexible enough to adapt to the variety of images. In order to avoid the submergence of feature salience, this letter proposes a hierarchal BoW to represent each feature in an individual codebook for obtaining the undisturbed ranks from each feature. Moreover, for feature salience enhancement, a query model based on ordinary-least-squared (OLS) regression is established for rank aggregation. The query model weighs each feature according to its retrieval performance and then selects the target images. The experimental results demonstrate that the proposed method improves the accuracy compared to the state-of-the-arts, meanwhile it maintains the stability. Index Terms— Image retrieval; Hierarchal BoW; Feature salience enhancement; Query model

1

This work was partially supported by the National Natural Science Foundation of China (No. 61370121), the National Hi-Tech Research and Development Program (863 Program) of China (No.2014AA015102), and Outstanding Tutors for doctoral dissertations of S&T project in Beijing (No. 20131000602). 1

1. Introduction The content-based image retrieval (CBIR) is an emerging technology, which can make an effective access to the retrieval targets among the increasing image data. In the past decade, numerous research works [1-7] have been carried out by many researchers to focus on CBIR in various applications. Brand-new features [8-11] and fused features [1, 12-20] are proposed for exploiting more information from different aspects. Recently, researchers plot an attempt of CBIR tasks with deep learning approaches[21, 22]. While the image data increasing, simple features (e.g., colors, textures and shapes) are not sufficient to represent the image complexity. Thus, in practical image retrieval applications, multiple features are usually employed to represent the complex characteristics of an image. In CBIR, the bags-of-words (BoW) model is a promising image representation technique for image retrieval and recognition, due to its simplicity and good performance. The conventional BoW models [21, 23-25] quantize and index images with feature vectors. In order to generate the codebook, clustering techniques are applied to subsample the feature vectors; hence the centroids of the clusters are established as vocabularies in the codebook to describe the indexed images. However, the conventional BoW models represent multiple features via a feature fusion process in the early stage[16, 20], which represents multiple features into a single feature space before the codebook generating. Note that much information will be lost during the image quantization, which is a disadvantage of conventional BoW [26]. A simple joint of features may result in a submergence of salient features [16]. Therefore, it is difficult for a single feature space to accurately describe the semantic variety of images. There are two major reasons for the above problem according to our observation based on

2

extensive experiments. On one hand, the data forms of different features may be significantly different, which vary in dimensions, value ranges and density (as shown in Fig. 1(a)). The salience of the features will be submerged during clustering different features. Hence, the retrieval method cannot make full advantage of the contained information. Especially, the significance of the sparse features may be ignored when they are joined with the dense features. An example is shown in Fig. 1. The local binary patterns (LBP) [27] and the pyramid histogram of oriented gradients (PHOG) [28] are extracted from an image from CAVIAR4REID database [23]. These two kinds of features are widely used in image retrieval methods while they are significantly different in the data forms. The LBP feature is a sparse vector while the PHOG feature is a dense vector. The retrieval performance of two features is displayed in Fig. 1(b). The experiment demonstrates that the retrieval performance of simply joining the two features acts almost the same as PHOG, the salience of LBP is submerged.

Fig. 1 The Evaluation of the Simple Joint of LBP and PHOG.

3

Fig. 2 Ineffective Features Perform as Noises.

On the other hand, different features have different effectiveness to describe the same category of images and different categories of images may be adapted to a subset of features. Feature fusion [12, 14, 17, 29] is used to select different features and it requires prior knowledge to have a better combination of basic features. However, it is difficult to exploit adequate prior knowledge for selecting relevant features. Meanwhile, simply joining features fails to guarantee the performance. Especially, when the effective features and ineffective features are simply joined, the significance of the effective features will be affected and result in a poor performance. An example is given in Fig. 2. CEDD [12], a compact descriptor for image retrieval, and LBP are joined together. The experiment demonstrates that LBP strongly affects the salience of CEDD. The performance of the joint feature is neither higher than LBP nor CEDD. To solve the above problems, many researchers related to BoW focus on the rank aggregation. The rank aggregation employs criteria to select the target images from a set of candidates. Yeh et al.[30] and Bosch et al.[31] apply voting based methods to solve the

4

irregular partitions divided by one single feature. Jégou et al.[32] introduces a rank aggregation strategy to select the median rank as the final rank of the candidate image. Although these two methods can achieve good performances, the final rank is represented without considering the characteristics of features. The term frequency–inverse document frequency (TF-IDF) is used in a BoW with the inverted index for measuring the feature salience [33, 34]. Zheng et al.[34] improve the performance by considering the frequency of co-occurrence words, whereas, the co-occurrence of words cannot accurately describe the targets distribution when the ineffective features are infiltrated. Moreover, the salience of features is important to improve retrieval performance while the above methods fail to fully consider the salience of features. Therefore, the hierarchal BoW is proposed to avoid the feature fusion in the early stage and it eliminates the feature submergence by creating independent codebooks for multiple features. In the hierarchal BoW, each feature provides the ranks for the candidate images independently. Thus, a query model is proposed to parallel retrieve images among multiple features and aggregate the ranks accordingly. In the query model, the salience of features is measured by a series of weighting factors and it is learned by ordinary least squared (OLS) regression according to the retrieval performance of each feature. According to the feature salience, the hierarchal BoW aggregates the retrieval ranks provided by each feature to generate a global score. Subsequently, the target images are selected from the candidates. The experimental results demonstrate that the hierarchal BoW is both effective and stable over the tested datasets. Especially, on CAVIAR4REID [23], the proposed hierarchal BoW gains image retrieval accuracy by 13.57%, 5.42% on average when compared with the conventional BoW and the state-of-the-arts, respectively.

5

2. The Proposed Hierarchal BoW The conventional BoW fuses multiple features into a single feature space (as shown in Fig. 3(a), one vocabulary in the codebook contains different features). Because of the variety of data forms, a single feature space suffers from the submergence of feature salience. Thus, the proposed hierarchal BoW (as shown in Fig. 3(b)) decomposes the multiple features codebook in the conventional BoW models and generates individual codebooks for each feature to avoid the feature fusion in the early stage. Then, each feature in the hierarchal BoW provides its respective partition and ranks for its candidate images. The hierarchal BoW selects the target images by aggregating the ranks from each feature. There are two main benefits to represent multiple features independently. Firstly, an ensemble of independent features can reduce the quantization effects near the partition boundaries provided by a single feature space. Thus, the hierarchal BoW can achieve the potential enhancement by eliminating the interactions among different features. Secondly, the method is flexible to adapt to the variety of images. For instance, when a new class of images is introduced, the method can involve a new feature without changing the previous constructions. Note that, in a BoW based model, the clustering procedure consumes the most of the time during image indexing. Thus, the hierarchal BoW takes advantage of reusable components when the training data are changed.

6

Conventional Model

BoW

Hierarchical Model

Vocab.

Vocab.

Paritions

Paritions

(a)

H-BoW

Preparatory Partition (b)

Final Partition

Fig. 3 The Construction of the Hierarchal BoW.

The hierarchal BoW avoids the feature fusion process in the early stage and provides a final partition with rank aggregation (the analysis of the rank aggregation refers to the next section). With the view of representing multiple features independently, an irregular partition may be provided in an individual codebook due to the different effectiveness of the features. With an ensemble of preparatory partitions, the hierarchal BoW has the potential of solving the irregular partitions provided by the independent features in Fig. 3. Table. 1 Features Involved in the Experiment No Feature 1 CEDD 2 LBP 3 Color Layout 4 PHOG 5 Color Histogram 6 FCTH 7 Gabor Texture

7

Fig. 4 Comparison between BoW and H-BoW (1.CEDD, 2.LBP, 3.Color Layout, 4.PHOG, 5. Color Histogram, 6.FCTH, 7. Gabor Texture).

To validate the effectiveness of the multiple codebooks in the hierarchal BoW, we carry out an experiment on retrieving the images in CAVIAR4REID dataset with different combinations of features. As shown in Fig. 4, the multiple codebooks avoid the submerging of salient features. The features CEDD[12], Color Layout[35], Color Histogram, FCTH[14], Gabor Texture[36], LBP[27] and PHOG[28] are employed for testing single features. The numbers in Fig. 4 represent the features for short. An equal-weighted voting, namely Borda count voting, is employed in the proposed hierarchal BoW (H-BoW) as a simple testing method to aggregate the ranks provided by each feature. The experimental results demonstrate that both the simple joint and our method can improve the retrieval accuracy when combing the features of 1, 3, 6 (namely, CEDD, Color Layout and FCTH). However, while more features are involved in, the simple joint of all features results in a worse performance than the baseline (the best single feature in this occasion, the Color Layout). Due to the multiple codebooks, our method provides an improved retrieval performance.

8

3. The Rank Aggregation with OLS Regression The rank aggregation techniques are employed to select target images from the candidates under specified criteria. Thus, the effectiveness of the rank aggregation depends on whether the criteria are able to discriminate the true target images or not. With multiple features involved in the hierarchal BoW, the noisy, irrelevant or redundant features will deteriorate the retrieval performance. Criteria based on an equal-weighted method (e.g., the Borda count voting) are unable to aggregate the ranks with the full use of the feature salience. Note that the feature salience is important to improve the retrieval performance. It can be exploited in the rank aggregation to weight the features. An ideal rank aggregation method is able to enhance the effect of an effective feature and reduce the interference of an ineffective feature. However, most of the weighting technologies [37-41] estimate the weights via the low-level correlations among the features or the overall performance of fused features, without considering the statistical performance of each individual feature. Furthermore, in multiple information querying, an ideal estimator should be both effective in terms of high precision of overall queries and stable in terms of low variance across individual queries [42, 43]. The bias plus variance decomposition [44] is proved as a power tool for analyzing supervised learning scenarios, which employ quadratic loss functions. With bias plus variance decomposition, the square error between target value y and its estimation is divided into three compositions, as shown in (1).

E[( y  yˆ )2 ]  ( E[ y  E( y)])2  ( E( yˆ )  E( y))2  E[( E ( yˆ )  yˆ )2 ]   2  bias 2  variance

(1)

where σ2 defines the target noise generated by the observed samples. It is a lower bound on the expected cost of any learning algorithm.The squared bias is defined by bias2, which measures the overall precision of the learning algorithm. The quantity, variance, measures how much the 9

estimation of the learning algorithm fluctuates across different samples. The square error is decided by bias2+variance, because σ2 is not determined by the learning algorithm. Moreover, note that there is a reciprocal relationship between bias2 and variance, it is impossible to decrease the bias2 and variance at the same time [44]. Therefore, a tradeoff between bias2 and variance is the goal for the learning algorithm to achieve. In the MMSE sense, the least square estimator, which is a well-known best least unbiased estimator for the parameters in the linear regression, minimizes the square error by a quadratic decomposition. Consequently, the tradeoff is finally established. The goal of the proposed rank aggregation method is to find the maximum precision with the trade-off between the precision and stability. According to the analysis of bias plus variance decomposition [44], the precision and stability are in a dilemma, that they cannot be decreased at the same time. However, the trade-off between them can be established by a least square estimator. Therefore, in the hierarchal BoW, a query model based on OLS regression is proposed to aggregate the ranks by weighting the features according to their retrieval performance. The query model is formulated in a linear form as follows: N

yi  Q( xi ; β)  0    j xij

(2)

j 1

where N denotes the amount of features involved, xi ={β0, β1,…,βN}T denote as the retrieval scores provided by each feature for the ith image, i=1,2,…M. The retrieval scores can be generated by any strategies which measure the image similarity between the candidate image and the query image. In the formula, β={β0, β1,…,βN}T is a N+1 dimensional vector consists of the weights for each feature {β1,…,βN} and a global bias β0, hence the final score yi for the ith

10

image is computed accordingly. In order to illustrate the training process for β, the training data and target values are represented in an overdetermined problem with the degree of freedom of M:

 1 x11 ... x1N  X  ... ... ... ...  ,  1 xM 1 ... xMN 

 y1  Y   ...   yM 

(3)

Thus, the problem is formed as: Y  Xβ

(4) Consider the retrievals in each feature as a zero-one classification problem, which regards the result of image retrieval in a binary status. Accordingly, the target value in the training data is yi=0 or 1, i=1,2,…,M. The problem with the estimated weights and bias βˆ  {ˆ0 , ˆ1 ,...,ˆN }T is solved by: 1 βˆ   XT X  XT Y

(5)

where a simple and fast procedure is provided to estimate βˆ by a quadric decomposition. Thus, the hierarchal BoW employs the function Q( xi ; βˆ ) to aggregate the ranks for each feature and generate the final score for a candidate image. Based on the above elaboration, the hierarchal BoW mainly includes the three steps, namely multiple codebooks generating, query model training and rank aggregated query. Firstly, in multiple codebooks generating, the multiple features are indexed independently for the images to eliminate the submergence of different features. Secondly, the query model is trained by the OLS regression using the sample queries. Note that the sample queries provide the retrieval scores for each feature. The OLS regression is able to enhance the salience of effective feature and reduce the interference from ineffective features. Finally, for a true query image, the multiple features are extracted and respectively retrieved in the multiple codebooks.

11

The final rank is aggregated by Q( xi ; βˆ ) with the weighted retrieval scores gained from each feature.

4. Experiment In order to evaluate the performance of the hierarchal BoW model, several experiments are carried out for the comparing hierarchal BoW with feature fusion and some novel rank aggregation methods. Our experiments are run over two public datasets, CAVIAR4REID[23] and ZuBuD[45]. These two datasets are in different topics of image retrieval tasks. The CAVIAR4REID includes the 1220 small-scale pedestrian images (image size around 30  70 ) and ZuBuD contains over 1005 images of Zurich city buildings (image size 640  480 ). Some of the samples from these two dataset are shown in Fig. 5.

Fig. 5 Samples of the Involved Datasets

With the baseline of a conventional BoW with a best performed single feature (BSF), the experiments compare the hierarchal BoW with five similar works, namely PCA[46], MidRank[32], TF-IDF[34], BoW+SVM[47] and LRFF[15]. For negative evidences reduction, Jégou et al. [46] apply the principal component analysis (PCA) whitening in the early stage of image indexing to fuse multiple features. MidRank[32] takes the final rank as the median rank 12

among the ranks provided by the multiple feature and TF-IDF[34] employs the frequency of word co-occurrence as the weights of the features. The discriminative methods like BoW+SVM[47] which cooperates the BoW model with SVM classifier and LRFF[15] which performs feature fusion with a discriminative method, logistic regression. The experiment uses the same feature in Table. 1 and gradually increases the amount of involved features to test and verify the precision and stability of the proposed hierarchal BoW. The hierarchal BoW model employs K-Means clustering as the image indexing method. The weights and bias in the query model are learned by OLS regression. This experiment applies cross-validation to calculate the final accuracy of each rank. The experiment demonstrates the performance of the hierarchal BoW model is both effective and stable over different datasets and a different amount of features. Especially, the hierarchal BoW model using all the seven features raises the accuracy by 13.57%, 5.42% on average over CAVIAR4REID, compared to baseline and the similar works with the best performance (TF-IDF), respectively. For a high performance, the interference of different features ought to be eliminated. While there are ineffective features in every experiment shown in Fig. 6 and Fig. 7, the PCA is unable to show its advantage, which is dimension and correlations reduction. The ineffective features are still remained because the PCA only removes the correlations. Furthermore, the salience of individual features should be fully considered. The MidRank provides a high performance when there are sufficient candidate images. Thus, the performance of it is poor when the query applies a low rank (e.g. rank=1). The TF-IDF method is effective to enhance the performance when sufficient effective features are involved. However, due to the measurement under co-occurrence, the TF-IDF performs poorly when ineffective features occupy a competitive percentage (as shown in Fig. 6 (1), Fig. 7 (1) ). The discriminative

13

method based BoW+SVM[47] and LRFF[15] give a unstable performance when the number of features increases. The BoW+SVM and LRFF are relied on the adaptability of each individual feature. The datasets used in the experiments are distinct from each other and adapted to different features. Therefore, the performance of retrieval accuracy enhancement differs greatly in these two methods over the two datasets. With the elimination of feature interference and the consideration of feature salience, our method achieves an outstanding performance, which is both discriminative and stable over both CAVIAR4REID dataset and ZuBuD dataset. Moreover, with more features involved, the hierarchal BoW considers the trade-off between precision and stability. Therefore the proposed method steadily performs with a high precision (as shown in Fig. 6, Fig. 7 and the detailed performances are displayed in Table. 2, Table. 3).

Fig. 6 The Performance of HBoW over CAVIAR4REID with different numbers of features

14

Table. 2 Detailed Performance of HBoW over CAVIAR4RID with All Features Rank

BSF

Borda

PCA

MidRank

TF-IDF

BoW+SVM

LRFF

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Avg.

40.98% 47.70% 49.84% 55.08% 57.87% 59.34% 61.89% 60.57% 63.20% 64.84% 67.30% 67.05% 68.11% 67.38% 68.77% 68.69% 71.56% 69.75% 72.30% 71.48% 62.69%

40.16% 50.49% 56.31% 61.07% 63.85% 65.74% 68.77% 69.75% 71.39% 73.11% 73.77% 76.48% 77.30% 77.38% 78.52% 79.18% 79.10% 79.92% 80.66% 79.84% 70.14%

38.93% 47.46% 50.33% 54.92% 57.05% 59.75% 61.72% 63.20% 64.10% 67.13% 66.56% 69.26% 68.77% 71.15% 70.00% 71.15% 72.30% 72.13% 71.64% 75.25% 63.64%

20.66% 32.71% 44.37% 51.76% 56.74% 59.48% 61.88% 64.11% 65.69% 67.92% 69.33% 70.98% 72.14% 73.71% 74.95% 76.35% 77.27% 78.91% 79.50% 79.83% 63.91%

40.33% 49.00% 56.57% 61.13% 63.44% 65.93% 68.82% 70.73% 72.22% 74.28% 76.02% 76.28% 77.67% 78.83% 78.68% 79.66% 80.49% 81.48% 82.14% 83.13% 70.84%

11.07% 20.13% 25.90% 30.02% 32.91% 36.20% 37.03% 41.55% 42.38% 45.67% 47.32% 48.96% 49.38% 53.07% 53.50% 54.72% 55.14% 56.38% 58.02% 58.85% 42.91%

31.48% 40.87% 47.46% 52.98% 56.12% 59.49% 61.88% 63.86% 66.41% 68.39% 69.95% 71.35% 72.01% 73.41% 74.32% 76.04% 76.95% 78.10% 79.01% 79.59% 64.98%

Our Method 48.28% 58.77% 63.61% 66.97% 69.67% 72.79% 74.43% 76.07% 78.36% 78.85% 79.67% 80.90% 81.15% 82.38% 83.28% 84.51% 85.34% 86.25% 86.80% 87.05% 76.26%

Fig. 7 The Performance of HBoW over ZuBuD with different numbers of features 15

Table. 3 Detailed Performance of HBoW over ZuBuD with All Features Rank

BSF

Borda

PCA

MidRank

TF-IDF

BoW+SVM

LRFF

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Avg.

81.47% 84.08% 85.46% 86.21% 86.46% 86.83% 87.33% 87.41% 87.50% 87.58% 87.58% 87.74% 87.75% 87.75% 87.83% 87.83% 87.83% 87.83% 87.83% 87.83% 86.91%

94.44% 96.38% 96.93% 97.13% 97.67% 97.67% 97.84% 98.09% 98.09% 98.30% 98.46% 98.46% 98.46% 98.46% 98.46% 98.55% 98.63% 98.63% 98.71% 98.71% 97.90%

78.11% 80.51% 81.30% 82.09% 82.51% 82.84% 83.18% 83.76% 84.17% 84.46% 84.71% 84.80% 85.34% 85.54% 85.75% 85.75% 85.75% 85.75% 85.75% 85.75% 83.89%

38.31% 60.99% 79.52% 87.02% 89.20% 90.94% 91.36% 91.69% 91.44% 91.82% 92.02% 91.94% 91.57% 91.57% 91.24% 90.32% 89.87% 89.91% 90.28% 90.66% 86.08%

87.89% 92.76% 93.52% 94.43% 94.81% 95.22% 95.51% 95.89% 95.97% 96.43% 96.35% 96.22% 96.22% 96.47% 96.47% 96.55% 96.72% 96.72% 96.72% 96.80% 95.38%

78.23% 83.83% 86.20% 87.33% 88.24% 88.87% 89.49% 89.74% 90.44% 90.82% 91.15% 91.40% 91.77% 92.11% 92.65% 93.06% 93.31% 93.40% 93.48% 93.56% 89.95%

78.61% 83.66% 85.71% 87.12% 88.08% 89.20% 89.74% 90.44% 90.61% 90.70% 91.11% 91.48% 91.69% 92.02% 92.23% 92.23% 92.73% 93.10% 93.27% 93.44% 89.86%

Our Method 94.24% 96.26% 97.01% 97.30% 97.51% 97.76% 97.92% 98.26% 98.26% 98.34% 98.50% 98.50% 98.59% 98.59% 98.59% 98.59% 98.59% 98.59% 98.80% 98.88% 97.95%

5. Discussion The experiments, in the above section, compares our method with feature fusion methods and other rank aggregation methods. The results demonstrate that the parametric model of least square estimator is effective to estimate the query model. In this section, the further improvement of the query model will be discussed and a comparison with deep learning based approach will be analysed. A.

The Further Improvement of the Query Model Compared to other parametric models, estimating the specified parameters which come

from an exponential family with subjection to mild conditions, the least square estimator and maximum likelihood estimators are both optimal[48]. However, unlike the maximum likelihood estimators, least square estimator is a more stable method because it approximates 16

an overdetermined set without any assumptions on the distributions of the random variables. The limitation of the method mentioned above is that actually there is no particular parametric model. Thus, any parametric methods take a presumption that X in (3) can be divided by a decision hyperplanes, which is in a linear or a quadric form, like (2). However, compared to the non-parametric models, (e.g. feedforward neural network, SVM), the proposed ordinary least squared (OLS) regression based query model is efficiency in terms of following two points. On one hand, much fewer data are required for training accurate parametric models than the non-parametric models, which approximate the asymptotic target regression function with larger relative samples and slower convergence process. On the other hand, the parametric models can provide a statistical optimal solution in the MMSE sense, when the true separations are widely different from the presumption, according to bias-variance decomposition analysis.

Fig. 8 Comparison between OLS Regression and SVM Regression

We carry out an experiment on training the above query model with OLS regression and

17

L2-regularized support vector regression (L2-SVR). The L2-SVR is implemented by the popular LIBLINEAR[31], a popular SVM-based regression framework for large-scale data. L2-SVR is used under its default configuration (parameter C = 1, p = 0.01) in Fig. 8. The experimental result based on the observation demonstrates that, compared to L2-SVR, the OLS regression provides a competitive and more stable performance. Moreover, the OLS regression takes 75% of the training time of L2-SVR (1981ms, 2677ms respectively over 5526 samples). The experimental result does not imply that the OLS regression is a better method than L2SVR. However, the simple method of OLS regression is still very effective in a practical sense in spite of its limitation. In order to the further improvement, the limitations of parametric models should be considered as a potential opportunity for adopting a more flexible assumption of the decision function or a new appropriate loss function. In the meanwhile, dynamic constructions and weighting strategies may be applied to the proposed method to adapt to the demand of retrieval tasks among the increasing amount of images, like an online CBIR system. The above two points will be our future works. B.

The Further Improvement of Feature Representation Besides involving multiple features in CBIR, deep learning is another promising approach

to improve the retrieval performance[21, 22]. Deep learning architectures, such as deep belief network (DBN), auto encoder and deep convolutional neural network (CNN), have been proposed to do unsupervised feature learning. The recent successes of deep learning are mostly relied on CNNs. Recently, Wan et. al [22] attempt to apply CNNs in image retrieval tasks. Different from the conventional image retrieval methods which extract fixed features, CNNs work as special image filters which could learn effective features through unsupervised

18

procedures. However, in practical using, the main problem of deep learning is the design. For instance, in the CNN, the design of architectures and parameters settings are still without a matter of routine[49]. Thus, the design of a deep learning is still depended on users’ experiences and parameters setting is a complicated procedure of fine tuning. What’s more, a fixed design and parameter settings obviously is not an adaptive approach for image retrieval or image classification. Researchers [49-53] are working on how to deeply understand the behavior of a deep learning frameworks and attempting to provide a simple procedure of training a deep learning framework.

6. Conclusion This letter proposes a hierarchal BoW model, which organizes multiple features in a hierarchal way with multiple codebooks. With the hierarchal representation, the proposed method avoids the potential risk of the submergence feature salience and has the potential for the full use of the feature salience. Additionally, the hierarchal BoW is able to adapt to the variety of images by adopting new features. In order to promote the precision, the proposed query model aggregates the ranks from each feature with the measurement of the feature salience according to its retrieval performance. With the consideration of the trade-off between precision and stability, the hierarchal BoW is built in the MMSE sense. The experimental results demonstrate that the proposed method is practical with effective and stable performance. The feature salience is enhanced to significantly promote the accuracy improvement over different datasets. Especially on CAVIAR4REID the accuracy is gained by 13.57% and 5.42% on average, compared with the conventional BoW and the state-of-the-arts,

19

respectively. Moreover, while the amount of involved features increasing, the proposed method provides a stable rising accuracy. For the further improvement, our future work focuses on finding a flexible decision function for the query model and a proper dynamic weighting strategy.

References [1] V. F. Arguedas, Q. Zhang, and E. Izquierdo, “Bayesian multimodal fusion in forensic applications,” in Computer Vision--ECCV 2012. Workshops and Demonstrations, 2012, pp. 466--475. [2] W. Voravuthikunchai, B. Crémilleux, and F. Jurie, “Image re-ranking based on statistics of frequent patterns,” in Proceedings of International Conference on Multimedia Retrieval, 2014, pp. 129. [3] P. Sahu, R. Batham, and N. Chourasia, “A Survey on Re-Ranking of Image by Analyzing various Features and Multi-Modality,” International Journal of Scientific Progress and Research (IJSPR), vol. 5, no. 3, 2014. [4] J. Cai, Q. Liu, F. Chen et al., “Scalable Image Search with Multiple Index Tables,” in Proceedings of International Conference on Multimedia Retrieval, 2014, pp. 407. [5] G. Tolias, Y. Avrithis, and H. Jégou, “To aggregate or not to aggregate: Selective match kernels for image search,” in Computer Vision (ICCV), 2013 IEEE International Conference on, 2013, pp. 1401-1408. [6] C. Wengert, M. Douze, and H. Jégou, “Bag-of-colors for improved image search,” in Proceedings of the 19th ACM international conference on Multimedia, 2011, pp. 1437--1440. [7] H. Jégou, M. Douze, and C. Schmid, "Hamming embedding and weak geometric consistency for large scale image search," Computer Vision--ECCV 2008, pp. 304--317: Springer, 2008. [8] G. Zeng, H.-M. Hu, Y. Geng et al., “A person re-identification algorithm based on color topology,” in Image Processing (ICIP), 2014 IEEE International Conference on, 2014, pp. 2447--2451. [9] S. Tang, Y.-D. Zhang, J.-T. Li et al., “TRECVID 2007 High-Level Feature Extraction By MCG-ICTCAS.,” in TRECVID, 2007. [10] G.-H. Liu, and J.-Y. Yang, “Content-based image retrieval using color difference histogram,” Pattern Recognition, vol. 46, no. 1, pp. 188--198, 2013. [11] Y. Peng, Z. Yang, J. Yi et al., “Peking University at TRECVID 2008: High Level Feature Extraction.,” in TRECVID, 2008. [12] S. A. Chatzichristofis, and Y. S. Boutalis, “CEDD: Color and edge directivity descriptor: a compact descriptor for image indexing and retrieval,” in Proceedings of the 6th international conference on Computer vision systems, Santorini, Greece, 2008, pp. 312-322. [13] A. E. Abdel-Hakim, and A. A. Farag, “CSIFT: A SIFT descriptor with color invariant characteristics,” in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, 2006, pp. 1978--1983. [14] S. A. Chatzichristofis, and Y. S. Boutalis, “Fcth: Fuzzy color and texture histogram-a low level feature for accurate image retrieval,” in Image Analysis for Multimedia Interactive Services, 2008. WIAMIS'08. Ninth International Workshop on, Klagenfurt, 2008, pp. 191-196. [15] Y. Yang, J. Song, Z. Huang et al., “Multi-feature fusion via hierarchical regression for multimedia analysis,” Multimedia, IEEE Transactions on, vol. 15, no. 3, pp. 572--581, 2013. [16] Y. Geng, H. Hu, J. Zheng et al., “A person re-identification algorithm by using region-based feature selection and feature fusion,” in Image Processing (ICIP), 2013 20th IEEE International Conference on, Melbourne, VIC, 2013, pp. 3363-3366.

20

[17] J. Yang, and X. Zhang, “Feature-level fusion of fingerprint and finger-vein for personal identification,” Pattern Recognition Letters, vol. 33, no. 5, pp. 623-628, Apr., 2012. [18] P. Natarajan, S. Wu, S. Vitaladevuni et al., “Multimodal feature fusion for robust event detection in web videos,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 2012, pp. 1298--1305. [19] U. G. Mangai, S. Samanta, S. Das et al., “A survey of decision fusion and feature fusion strategies for pattern classification,” IETE Technical review, vol. 27, no. 4, pp. 293, 2010. [20] C. G. Snoek, M. Worring, and A. W. Smeulders, “Early versus late fusion in semantic video analysis,” in Proceedings of the 13th annual ACM international conference on Multimedia, 2005, pp. 399--402. [21] Y. Bai, W. Yu, T. Xiao et al., “Bag-of-Words Based Deep Neural Network for Image Retrieval,” in Proceedings of the ACM International Conference on Multimedia, Orlando, Florida, USA, 2014, pp. 229-232. [22] J. Wan, D. Wang, S. C. H. Hoi et al., “Deep learning for content-based image retrieval: A comprehensive study,” in Proceedings of the ACM International Conference on Multimedia, 2014, pp. 157--166. [23] L. Bazzani, M. Cristani, A. Perina et al., “Multiple-shot person re-identification by chromatic and epitomic analyses,” Pattern Recognition Letters, vol. 33, no. 7, pp. 898-903, May., 2012. [24] L. Yang, and A. Hanjalic, “Supervised reranking for web image search,” in Proceedings of the international conference on Multimedia, Firenze, Italy, 2010, pp. 183-192. [25] L. Zhou, Z. Zhou, and D. Hu, “Scene classification using a multi-resolution bag-of-features model,” Pattern Recognition, vol. 46, no. 1, pp. 424-433, Jan., 2013. [26] L. Wu, S. C. Hoi, and N. Yu, “Semantics-preserving bag-of-words models and applications,” Image Processing, IEEE Transactions on, vol. 19, no. 7, pp. 1908-1920, Mar., 2010. [27] T. Ojala, M. Pietikainen, and D. Harwood, “Performance evaluation of texture measures with classification based on Kullback discrimination of distributions,” Pattern Recognition, vol. 29, no. 1, pp. 51-59, Jan., 1996. [28] A. Bosch, A. Zisserman, and X. Munoz, “Representing shape with a spatial pyramid kernel,” in Proceedings of the 6th ACM international conference on Image and video retrieval, Amsterdam, The Netherlands, 2007, pp. 401-408. [29] X.-Y. Wang, Y.-J. Yu, and H.-Y. Yang, “An effective image retrieval scheme using color, texture and shape features,” Computer Standards \& Interfaces, vol. 33, no. 1, pp. 59-68, Jan., 2011. [30] T. Yeh, J. Lee, and T. Darrell, “Adaptive vocabulary forests br dynamic indexing and category learning,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, Rio de Janeiro, 2007, pp. 1-8. [31] M. Bosch, F. Zhu, N. Khanna et al., “Combining global and local features for food identification in dietary assessment,” in Image Processing (ICIP), 2011 18th IEEE International Conference on, Brussels, 2011, pp. 1789-1792. [32] H. Jégou, C. Schmid, H. Harzallah et al., “Accurate image search using the contextual dissimilarity measure,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32, no. 1, pp. 2-11, Dec., 2010. [33] Y. Xia, K. He, F. Wen et al., “Joint inverted indexing,” in Computer Vision (ICCV), 2013 IEEE International Conference on, Sydney, NSW, 2013, pp. 3416-3423. [34] L. Zheng, S. Wang, Z. Liu et al., “Packing and padding: Coupled multi-index for accurate image retrieval,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, Columbus, OH, 2014, pp. 1947-1954. [35] E. Kasutani, and A. Yamada, “The MPEG-7 color layout descriptor: a compact image feature description for high-speed image/video segment retrieval,” in Image Processing, 2001, Thessaloniki, 2001, pp. 674-677. [36] I. Fogel, and D. Sagi, “Gabor filters as texture discriminator,” Biological Cybernetics, vol. 61, no. 2, pp. 103-113, Jun., 1989.

21

[37] J. A. Sáez, J. Derrac, J. Luengo et al., “Statistical computation of feature weighting schemes through data estimation for nearest neighbor classifiers,” Pattern Recognition, vol. 47, no. 12, pp. 3941-3948, Dec., 2014. [38] M. Qian, and C. Zhai, “Unsupervised Feature Selection for Multi-View Clustering on Text-Image Web News Data,” in Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, Shanghai, China, 2014, pp. 1963-1966. [39] J. J.-Y. Wang, J. Yao, and Y. Sun, “Semi-supervised local-learning-based feature selection,” in Neural Networks (IJCNN), 2014 International Joint Conference on, Beijing, China, 2014, pp. 1942-1948. [40] J. J.-Y. Wang, H. Bensmail, and X. Gao, “Feature selection and multi-kernel learning for sparse representation on a manifold,” Neural Networks, vol. 51, pp. 9-16, Mar., 2014. [41] Y. Ko, “New feature weighting approaches for speech-act classification,” Pattern Recognition Letters, vol. 51, pp. 107-111, Jan., 2015. [42] P. Zhang, D. Song, J. Wang et al., “Bias-variance analysis in estimating true query model for information retrieval,” Information Processing \& Management, vol. 50, no. 1, pp. 199-217, Jan., 2014. [43] C. Moulin, C. Largeron, C. Ducottet et al., “Fisher linear discriminant analysis for text-image combination in multimedia information retrieval,” Pattern Recognition, vol. 47, no. 1, pp. 260-269, Jan., 2014. [44] S. Geman, E. Bienenstock, and R. Doursat, “Neural networks and the bias/variance dilemma,” Neural computation, vol. 4, no. 1, pp. 1-58, Jan., 1992. [45] H. Shao, T. v. s. Svoboda, and L. Van Gool, Zubud-zurich buildings database for image based recognition, vol. 260, Swiss Federal Institute of Technology, Switzerland, 2004. [46] H. Jégou, and O. v. Chum, “Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening,” in Computer Vision--ECCV 2012, Florence, Italy, 2012, pp. 774-787. [47] J. Zhang, L. Sui, L. Zhuo et al., “An approach of bag-of-words based on visual attention model for pornographic images recognition in compressed domain,” Neurocomputing, vol. 110, pp. 145-152, Jun., 2013. [48] A. Charnes, E. Frome, and P.-L. Yu, “The equivalence of generalized least squares and maximum likelihood estimates in the exponential family,” Journal of the American Statistical Association, vol. 71, no. 353, pp. 169-171, Oct., 1976. [49] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097--1105. [50] Y. Bengio, “Learning deep architectures for AI,” Foundations and trends® in Machine Learning, vol. 2, no. 1, pp. 1--127, 2009. [51] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527--1554, 2006. [52] A. A. Cruz-Roa, J. E. A. Ovalle, A. Madabhushi et al., "A deep learning architecture for image representation, visual interpretability and automated basal-cell carcinoma cancer detection," Medical Image Computing and Computer-Assisted Intervention--MICCAI 2013, pp. 403--410: Springer, 2013. [53] Z. Wu, Y.-G. Jiang, J. Wang et al., “Exploring inter-feature and inter-class relationships with deep neural networks for video classification,” in Proceedings of the ACM International Conference on Multimedia, 2014, pp. 167--176.

Fan Jiang received the B.S. degree in software engineering from the College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China, in 2013, and he is currently pursuing

22

the M.S. degree in computer science and engineering from Beihang University, Beijing, China. His current research interests include multimedia retrieval and object recognition.

Hai-Miao Hu received the B.S. degree from Central South University, Changsha, China, in 2005, and the Ph.D. degree from Beihang University, Beijing, China, in 2012, all in computer science. He was a visiting student at University of Washington from 2008 to 2009. Currently, he is an assistant professor of Computer Science and Engineering at Beihang University. His research interests include video coding and networking, image/video processing, and video analysis and understanding. Jin Zheng was born in Sichuan, China, on October 15, 1978. She received the B.S. degree in applied mathematics and informatics from the College of Science in 2001, and the M.S. degree from the School of Computer Science, Liaoning Technical University, Fuxing, China, in 2004, and the Ph.D. degree from the School of Computer Science and Engineering, Beihang University, Beijing, China, in 2009. She is currently a Teacher with Beihang University. Her current research interests include moving object detection and tracking, object recognition, image enhancement, video stabilization, and video mosaic.

Chongqing University in 1986, University in 1989, and the Now he is a professor of Director of Beijing Key conference and journal papers compression, video analysis digital image processor.

Bo Li received the B.S. degree in computer science from the M.S. degree in computer science from Xi’an Jiaotong Ph.D. degree in computer science from Beihang University in 1993. Computer Science and Engineering at Beihang University, the Laboratory of Digital Media, and has published over 100 in diversified research fields including digital video and image and understanding, remote sensing image fusion and embedded

23