Kernelized sparse hashing for scalable image retrieval

Kernelized sparse hashing for scalable image retrieval

Neurocomputing 172 (2016) 207–214 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Kerneli...

1MB Sizes 0 Downloads 93 Views

Neurocomputing 172 (2016) 207–214

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Kernelized sparse hashing for scalable image retrieval Yin Zhang, Weiming Lu n, Yang Liu, Fei Wu College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China

art ic l e i nf o

a b s t r a c t

Article history: Received 30 October 2013 Received in revised form 3 February 2015 Accepted 7 February 2015 Available online 13 May 2015

Recently, hashing has been widely applied to large scale image retrieval applications due to its appealing query speed and low storage cost. The key idea of hashing is to learn a hash function that maps high dimensional data into compact binary codes while preserving the similarity structure in the original feature space. In this paper, we propose a new method called the Kernelized Sparse Hashing, which generates sparse hash codes with ℓ1 and non-negative regularizations. Compared to traditional hashing methods, our method only activates a small number of relevant bits on the hash code and hence provides a more compact and interpretable representation of data. Moreover, the kernel trick is introduced to capture the nonlinear similarity of features, and the local geometrical structure of data is explicitly considered in our method to improve the retrieval accuracy. Extensive experiments on three large-scale image datasets demonstrate the superior performance of our proposed method over the examined state-of-the-art techniques. & 2015 Elsevier B.V. All rights reserved.

Keywords: Image retrieval Hashing Sparse coding Kernel methods

1. Introduction With the explosive growth of online images and videos, much effort has been devoted to developing the data-dependent hashing methods that aim to learn the similarity-preserving compact binary codes for representing high-dimensional visual contents [1–11]. This type of hashing technique features two common traits: the encoded data are compact enough to be loaded into the main memory, and the computation of Hamming distance can be efficiently conducted using the bit XOR operation in CPU. As a result, it is very fast to perform approximate similarity computation over such binary codes, which can significantly scale up the nearest neighbor search to large-scale image collections. One common paradigm of the aforementioned hashing methods is to map the input data into a low-dimensional space by some kinds of embedding techniques, e.g. principal component analysis (PCA), and then binarize them into hash codes. As a result, there are two feasible paths to further boost the performance of learning-based hashing methods: one is to introduce a more advanced dimension reduction technique to obtain the more informative embedded representation of the data [1,3,5,8]. The other is to develop smarter quantization schemes for transforming embedded representations into binary codes [9,12–14]. In this paper, we choose to go down the former path and propose a Kernelized Sparse Hashing (KSpH) method, inspired by recent advances in sparse dictionary learning [6,8,15] that bring more n

Corresponding author. E-mail addresses: [email protected] (Y. Zhang), [email protected] (W. Lu), [email protected] (Y. Liu), [email protected] (F. Wu). http://dx.doi.org/10.1016/j.neucom.2015.02.080 0925-2312/& 2015 Elsevier B.V. All rights reserved.

flexibility to adapt embedded representations to the data than principal components. Fig. 1 shows the main process of dictionary learning, code generation and binarization in our framework. Our proposed KSpH method learns the hash codes that minimize the reconstruction error on the pre-trained dictionary. For each data point, our KSpH has the power to activate the most relevant bits to 1 and others to 0 by imposing the sparsity and non-negative constraints on the reconstruction coefficients. This characteristic enables all hashing bits to be fully utilized since each hashing bit only needs to be effective for certain data points. Furthermore, the generated sparse and non-negative coefficients can be naturally mapped into binary codes without losing too much information. At last, our method is also inspired by previous works [3,7,16,17], and capable of exploiting non-linear kernel function to measure similarity of features and explicitly preserving the local geometrical structure of data, which lead to yielding better hash codes for image retrieval. The main contributions of this paper are as follows:

 By generating the sparse hash codes with both ℓ1 and non-



negative regularizations, only a small number of relevant hashing bits are activated, which provides a more compact and interpretable representation of data. The kernel trick is introduced to capture the nonlinear similarity of features, which makes our method more adaptable to different data distributions and non-linear similarity metrics.

 The local geometrical structure of data is explicitly utilized when learning the dictionary, which further improves the performance of nearest-neighbor retrieval.

208

Y. Zhang et al. / Neurocomputing 172 (2016) 207–214

Fig. 1. Image xq, x1 and x2 are firstly represented in their original space. Then the three images are encoded by the dictionary bases learnt from the training data. Finally, nonzero coefficients w.r.t. each data point are binarized into 1.

 We conduct experiments on NUS-WIDE, ANN-GIST and CIFAR dataset which consist of 270,000, 1 million, and 60,000 images, respectively. The experimental results show that the proposed method outperforms the tested state-of-the-art techniques. The remainder of this paper is organized as follows: Section 2 briefly introduces the related work on hashing methods for scalable image retrieval. Section 3 presents the formal formulation of our proposed KSpH framework. Section 4 shows the details of the alternating optimization algorithm for learning the dictionary and sparse coefficients. Section 5 provides the experiment setup and results. Finally, we draw our conclusions in Section 6.

2. Related work Faster nearest neighbor search technologies with sub-linear or even constant time complexity are highly desirable for scalable image retrieval and processing applications [1,7,9,10,18,19]. A promising way to expedite nearest neighbor search is to hash the high dimensional data into compact binary codes under the condition of similar data points being mapped into similar codewords within a small Hamming distance. Locality Sensitive Hashing (LSH) [20] is a representative hashing method that uses simple random projections as hash functions. By selecting the hash functions that satisfy the locality sensitive property, similar data will have a high probability to be mapped into the same hash code in LSH. In [2], KLSH is proposed to generalize LSH with kernel functions which makes LSH more flexible in the setting that the embedded feature space is unknown or incomputable. However, a common weakness of LSH and its extensions is that a large number of hash tables are needed to attain a good search performance due to the randomness of hash functions. To better exploit the data distribution, a plenty of data-aware hashing methods have been proposed by means of machine learning technologies. In [4], stacked Restricted Boltzmann Machines (RBMs) are utilized to capture the data distribution and generate compact binary codes to represent the data points. Spectral Hashing (SH) [1] uses a separable Laplacian eigenfunction formulation that ends up assigning more bits to higher-variance PCA directions. Binary reconstruction embedding (BRE) [3] learns the hash function by explicitly minimizing the reconstruction error between the original distances and the Hamming distances of the corresponding binary codes. In [17], Self-Taught Hashing (STH) is proposed to preserve the local similarity of data and learn the hash function in a self-taught manner. Spline Regression Hashing (SRH) [7] simultaneously considers the local and global similarities of data by combining the local spline functions and the global kernel hashing function. After obtaining the embedded representations, smart quantization schemes [9,12–14] are able to further improve the quality of final compact binary codes. In iterative quantization [9], learning a good binary code is formulated as the problem of directly

minimizing the quantization error of mapping the PCA-projected data to vertices of the binary hypercube. In [12], Manhattan hashing (MH) is proposed to solve the problem of Hamming distance based hashing. The basic idea of MH is to encode each projected dimension with multiple bits of natural binary code (NBC). In [13,14], adaptive multi-bit quantization methods are proposed to quantize each projected dimension with variable bit numbers. More bits will be adaptively allocated to encode dimensions with larger dispersion while fewer bits for dimensions with smaller dispersion. Modeling data vectors as sparse linear combinations of basis elements [6,8,10,11,15] is widely used in machine learning, signal processing and so forth. Dictionary Learning [15] is focused on learning the basis set to adapt it to specific data. The novel data will be represented with a sparse linear combination of a few atoms in the learned dictionary. Robust Sparse Hashing (RSH) [8] thinks that the input vectors themselves are perturbed or uncertain, and learns dictionaries on the robustified counterparts of uncertain data points. The difference between RSH and our method is that RSH models data uncertainty from a robust optimization perspective, however, our method exploits the local geometrical structure of data and the kernel trick to model complex data distributions.

3. The proposed framework Given a collection of d-dimensional n data points X ¼ ½x1 ; x2 ; …; xn  A Rdn , hashing aims to learn a hash function that maps these data points into l-dimensional binary codes Y ¼ ½y1 ; y2 ; …; yn  A 0; 1ln . Here, d is the dimensionality of the original feature space and l is the length of the obtained hash codes. As mentioned above, we expect each hash code yi to be the sparse representation of its corresponding data point xi under the given dictionary bases D ¼ ½d1 ; d2 ; …dl  A Rdl , while minimizing the reconstruction error. Inspired by the sparse coding and dictionary learning models [6,15], we write our objective function as follows: argmin : fyi gni ¼ 1 ;D

s:t: :

n  X

J xi  Dyi J 22 þ λ J yi J 1



i¼1

yi A 0; 1l ;

J dj J 22 r c;

i ¼ 1; …; n

j ¼ 1; …; l

ð1Þ

The constraint on yi enforces the learned hash codes to be binary and the constraint J dj J 22 rc removes the scaling ambiguity. However, the objective function in Eq. (1) can be demonstrated to be NP hard. Following [1,7], we remove the constraint yi A 0; 1l to make the problem computationally tractable. It should be noted that when yj is enforced to be binary, only dictionary bases that are positively correlated with xi have the chance to be activated to 1. Thus, we add the non-negative constraint to the relaxed coefficients to eliminate the negatively correlated bases, which keeps

Y. Zhang et al. / Neurocomputing 172 (2016) 207–214

the consistency with the original binary codes. Now the objective function in Eq. (1) can be revised as follows: argmin :

n  X

fsi gni ¼ 1 ;D

i¼1

s:t: :

si Z0;

J xi  Dsi J 22 þ λ J si J 1

Following the iterative optimization method in [21], the optimization of Eq. (6) can be divided into two steps: learning dictionary D (fixing S) and learning the sparse coefficients S (fixing D).



j ¼ 1; …; l

ð2Þ

where si is the sparse coefficients of xi. After obtaining S ¼ ½s1 ; …; sn  A Rln þ , the jth ðj ¼ 1; …; lÞ bit of the binary code yi can be obtained by the following thresholding operation: ( 1 if ðsi Þj 40; ð3Þ ðyi Þj ¼ 0 if ðsi Þj ¼ 0: As shown in [2,7], by employing the kernel trick to the raw features, the learned hash codes can be more adaptable to different distributions and similarity metrics. In general, suppose there exists a feature mapping function ϕ that maps the data points and dictionary basis from their original space into the kernel space as X ¼ ½x1 ; …; xn -ϕðXÞ ¼ ½ϕðx1 Þ; …; ϕðxn Þ and D ¼ ½d1 ; …; dl -D ¼ ½ϕðd1 Þ; …; ϕðdl Þ, our objective can be further written as J ϕðxi Þ  Dsi J 22 þ λ J si J 1



ð4Þ

i¼1

The objective function in Eq. (4) still fails to consider the local geometrical structure of the data which is critical in nearestneighbor retrieval application. To capture the intrinsic geometrical structure, we construct the nearest neighbor graph G ¼ 〈X; E〉 where X denotes the set of data points and E denotes the set of edges. Furthermore, we obtain wik by calculating the weight of the edge that connects the ith and kth data points (wik is set to 0 if there is no edge between the two points). Then we can impose the following regularization on S in the objective function to guarantee that the geometrical structure is kept after hashing ΩðSÞ ¼

n X n X

wik J si  sk J 1

ð5Þ

i¼1k¼1

Eq. (5) ensures that locally similar data points have similar reconstruction coefficients. In particular, we use ℓ1 loss to measure similarity between the coefficients which encourages similar data to have the similar sparsity structure (the position of non-zero dictionary elements). Even if thresholded into binary codes, such similar sparsity structure can still be kept in terms of Hamming distance. By incorporating Eq. (5) into our objective function, we formulate the objective function of our proposed KSpH as follows: argmin : fsi gni ¼ 1 ;D

þγ

n  X

J ϕðxi Þ Dsi J 22 þ λ J si J 1

si Z0;

J Dj J 2 rc;

D ¼ ϕðXÞW

ð7Þ nl

where W ¼ ½w1 ; w2 ; …; wl  A R is a matrix that defines the transforms from the training data points to the dictionary elements. By fixing S, the objective function in Eq. (6) can then be written as argmin : W

s:t: :

J ϕðXÞ  ϕðXÞWS J 2F J ϕðXÞwj J 22 r c;

j ¼ 1; …; l

ð8Þ

Since J ϕðXÞ  ϕðXÞWS J 2F ¼ J ϕðXÞðI  WSÞ J 2F   ¼ tr ðI  WSÞT ϕðXÞT ϕðXÞðI  WSÞ

ð9Þ

where I is an n  n identity matrix, by introducing matrix K ¼ ϕðXÞT ϕðXÞ, the objective function can be further written as   argmin : tr ðI  WSÞTK ðI  WSÞ W

s:t: :

wTj Kwj r c;

j ¼ 1; …; l

ð10Þ

In this paper, a Lagrange dual method [24] is adopted to optimize the objective function in Eq. (10). Let β ¼ ½β1 ; …; βl , βj is the Lagrange multiplier associated with the jth inequality constraint wTj Kwj  c r 0, then the Lagrange dual function of Eq. (10) is given by   LðW; βÞ ¼ tr ðI  WSÞT KðI  WSÞ þ

l X

βj ðwTj Kwj cÞ

ð11Þ

j¼1

Let B be the l  l diagonal matrix whose diagonal entries Bj;j ¼ βj for all j, then LðW; βÞ can be written as   LðW; BÞ ¼ tr ðI  WSÞT KðI WSÞ þ trðW T KWBÞ c trðBÞ

ð12Þ

ð13Þ

By selecting proper kernel function, K can be guaranteed to be positive-definite and Wn can be solved as1

i ¼ 1; …; n j ¼ 1; …; l

Similar to [22,23], by assuming that the dictionary elements lie within the subspace spanned by the training data ϕðXÞ, we can write

KðW T SST  ST þWBÞ ¼ 0

wik J si  sk J 1

i¼1k¼1

s:t: :

4.1. Learning the dictionary D

Setting the derivative of L w.r.t. W to 0, the optimal solution Wn satisfies



i¼1 n X n X

4. Optimization

i ¼ 1; …; n

J dj J 22 rc;

n  X

209

ð6Þ

where λ and γ are the two regularization parameters to control the weight of the sparsity and local geometry preserving regularization term respectively. The details of optimization algorithm of Eq. (6) will be illustrated in the next section. After obtaining each sparse coefficient si, we can binarize it according to Eq. (3). Since the local geometrical structure has already been integrated into the dictionary D by the joint optimization, calculating the hash code for novel data can be viewed as a special case of Eq. (6) where D is fixed and γ is set to zero.

W n ¼ ST ðSST þ BÞ  1

ð14Þ

Substituting Eq. (14) into Eq. (12), the Lagrange dual function becomes LðBÞ ¼ trðK  KST ðSST þ BÞ  1 SÞ  c trðBÞ

ð15Þ

with the constraint that βj Z0; j ¼ 1; …; l. This problem can be solved by the Newtons method. Denoting Q ¼ ðSST þBÞ  1 , the gradient and Hessian of LðBÞ are computed as 1 In practice, since that ST S þ B is not guaranteed invertible, one can use pseudoinverse instead of directly computing the inverse.

210

Y. Zhang et al. / Neurocomputing 172 (2016) 207–214

follows: ∂LðBÞ ¼ ðQSÞj KðST Q Þj  c ∂βj

ð16Þ

∂2 LðBÞ ¼ 2ðQSKST Q Þj;k ðQ Þj;k ∂βj ∂βk

ð17Þ

Let Bn be the optimal solution of Eq. (15), then the optimal W n ¼ SðSST þBn Þ  1 . Thus, we can represent the dictionary D in the kernel space as

operation used in [25] should be modified to threshold all the negative coefficients into 0 in each iteration step. We illustrate the algorithm of KSpH for learning the dictionary D and hash code Y in Algorithm 1. From this algorithm, we note that the while loop will terminate if the value of the objective function in Eq. (6) no longer decreases (or decreases too slowly). To initialize the dictionary D, we randomly sample l data points from ϕðXÞ as the dictionary basis. Algorithm 1. The algorithm of kernelized sparse hashing.

Input: X: the matrix of the given data points. Initialization: Randomly initialize the dictionary D. while not converged do   Compute f u ðSÞ according to Eq: ð22Þ;   Fix D and compute S by solving Eq: ð23Þ via the FISTA algorithm ½26; 27;    Compute B via solving Eq: ð15Þ while fixing S;   Compute W by W ¼ SðSST þ BÞ  1 ;    Represent D as D ¼ ϕðXÞW;

1 2 3 4 5 6 7 8

Map S into binary codes Y according to Eq. (3). Output: Y: the matrix of the generated hash codes, D: the trained dictionary in the kernel space.

D ¼ ϕðXÞW ¼ ϕðXÞSðSST þ Bn Þ  1

ð18Þ

4.2. Computing the sparse coefficient S When the dictionary D is fixed, the first term of the objective function in Eq. (6) can be written as LðSÞ ¼ J ϕðXÞ  DS J 2F ¼ J ϕðXÞ  ϕðXÞWS J 2F   ¼ tr K þ ST W T KWS  2KWS

ð19Þ

where K ¼ ϕðXÞT ϕðXÞ and the definition of W has already been given in Eq. (14). Now we consider the geometrical structure penalty ΩðSÞ. According to [25,26], ΩðSÞ can be reformulated to a linear transformation of S via the dual norm as follows:   ΩðSÞ ¼ max tr ACST ð20Þ AAQ

where AA Q ¼ fAj J A J 1 r 1; A A Rnj Ej g is an auxiliary matrix associated with J CST J 1 . Matrix C A R j Ej n is the edge-vertex incident matrix with the form of 8 if p ¼ i; > < wik C e ¼ ði;kÞ;p ¼

 wik > :0

if p ¼ k;

ð21Þ

Once the dictionary has been trained, given a novel data point xq, its corresponding coefficient sq can be obtained via the following objective function: argmin : sq

s:t: :

AAQ

S

s:t: :

LðSÞ þ γf μ ðSÞ þ λ J S J 1 SZ0

5.1. Datasets To evaluate the efficacy of our proposed Kernelized Sparse Hashing (KSpH) method, we perform experiments on three realworld image datasets, namely the NUS-WIDE dataset, ANN-GIST dataset and CIFAR dataset. All of them have been widely used in the evaluation of content based image indexing and retrieval applications. The details of three datasets are listed as follows:

 NUS-WIDE [28]: The NUS-WIDE dataset is a real-world web



 ð23Þ

Because the first two terms of Eq. (23) are convex and smooth, Eq. (23) can be efficiently solved by the FISTA algorithm [26,27]. In particular, to guarantee that S is non-negative, the soft-thresholding

ð25Þ

5. Experiments and results

where μ Z 0 is the smoothness parameter. Substituting Eqs. (19) and (22) back into Eq. (6), by fixing D, the objective function can be written as argmin :

sq Z 0

ð24Þ

which is a special case of Eq. (23).

otherwise;

where i; k; p ¼ 1; …; n. Since the regularization ΩðSÞ is still non-smooth, we further introduce a smooth approximation f μ ðSÞ as follows:   ð22Þ f μ ðSÞ ¼ max tr ACST  μJ A J 2F

J ϕðxq Þ  ϕðXÞWsq J 2F þ λ J sq J 1

image database with 269,648 images crawled from Flickr. We randomly sample 2000 images for testing, and the remaining images form the retrieval set on which the search is performed. The provided 500-dimensional bag-of-words feature based on SIFT descriptions are used to represent these images in our experiment. ANN-GIST [29]: The ANN-GIST dataset consists of 1 million images as the retrieval set for similarity search and 1000 images as the test set. Those images are represented by 960dimensional GIST descriptors which describe the global patterns of the images. CIFAR: The CIFAR [5] dataset consists of 60,000 32  32 tiny color images in 10 classes, with 6000 images per class. We randomly sample the provided 50,000 images as the retrieval set and the remaining images as the test set. For each image, we extract 512 dimensional GIST feature vectors for similarity calculation.

Y. Zhang et al. / Neurocomputing 172 (2016) 207–214

211

5.2. Experiment setup

5.3. Comparison with the state of the art

We first train the dictionary according to Algorithm 1. For all datasets, 3000 images are randomly sampled from the retrieval set for training our model. During the training phase, we use radial basis function (RBF) kernel to compute the distance between the training data points and the dictionary elements in kernel space. For constructing the nearest neighbor graph G, we select the top 5000 edges by their weights to constitute the edge set E. For KSpH and the other compared algorithms, we use cross-validation on the training data to find the best paramater settings for different datasets. After the hash codes for both the retrieval set and test set are generated, we use the image in the test set to retrieve the images in the retrieval set by calculating the Hamming distance between their corresponding hash codes. To determine whether the retrieved images are relevant to the query ones, we calculate the top 5% nearest neighbors of the query images in the original space in terms of Euclidean distance as the ground-truth. In addition, since the CIFAR dataset also provides the category information of each image, we also adopt the category information as groundtruth on this dataset to prove that our method is suitable and flexible for different aims of retrieval task. We evaluate the retrieval performance by Precision, Recall [30] as well as the F1-score described as follows:

We compare the performance of our proposed Kernelized Sparse Hashing (KSpH) methods with the methods listed below:

F1 ¼ 2 

precision  recall : precision þ recall

    

In particular, in our implemented binarized Sparse Coding method, we simply binarize the sparse coefficient obtained from the ℓ1 regularized regression for comparison. All our experiments are conducted on a Linux workstation with Intel 2.67 GHz Xeon CPU and 32GB RAM using Matlab 7.10. The precision–recall curve on NUS-WIDE and ANN-GIST datasets are shown in Fig. 2. As for CIFAR dataset, the precision–recall curve obtained from different ground-truths are shown in Fig. 3. Here, each data point is compressed into a 256 bit hash code and the Hamming radiuses (the maximum distance between any query image and retrieved images) are varied from 2 to 120. The experimental result indicates that our method outperforms all the compared methods. In Fig. 4, we show a real retrieval example on CIFAR dataset. It can be seen that the retrieval results of our method are closer to the ground-truth results than the other considered methods.

ð26Þ

1

1 LSH KLSH SH SRH ScH KSpH

0.6

LSH KLSH SH SRH ScH KSpH

0.8

Precision

Precision

0.8

0.4

0.6

0.4

0.2

0

Locality Sensitive Hashing (LSH) [20]. Spectral Hashing (SH) [1]. Kernelized Locality Sensitive Hashing (KLSH) [2]. Spline Regression Hashing (SRH) [7]. binarized Sparse Coding (ScH).

0.2

0

0.2

0.4

0.6

0.8

0

1

0

0.2

0.4

Recall

0.6

0.8

1

Recall

Fig. 2. The comparisons of precision–recall curve on NUS-WIDE and ANN-GIST datasets. (a) NUS-WIDE. (b) ANN-GIST.

1

1 LSH KLSH SH SRH ScH KSpH

0.6

0.8

Precision

Precision

0.8

LSH KLSH SH SRH ScH KSpH

0.4

0.6

0.4 0.2 0.2 0

0

0.2

0.4

0.6 Recall

0.8

1

0

0.2

0.4

0.6

0.8

1

Recall

Fig. 3. The comparisons of precision–recall curve on CIFAR dataset with different ground-truths. (a) Nearest Neighbor as Ground-Truth. (b) Category Information as GroundTruth.

212

Y. Zhang et al. / Neurocomputing 172 (2016) 207–214

query

Ground-Truth

KSpH

LSH

KLSH

SH

SRH

ScH

Fig. 4. Examples of retrieval results on CIFAR dataset. We select images from the testing set as queries and use the 16 nearest neighbor images in the retrieval set as the Ground-Truth.

0.15

0.1

0.1

0.05

0.05

0

0.15 F1−score

F1−score

F1−score

0.15

0 0.01

0.02

0.05

0.1

0.2

0.5

0.1

0.05

0.01

0.02

λ

0.05

0.1

0.2

0.5

0 0.01

λ

0.02

0.05

0.1

0.2

λ

Fig. 5. The comparisons of F1 score with different γ and λ settings on three datasets. (a) NUS-WIDE. (b) ANN-GIST. (c) CIFAR.

5.4. Parameters tuning and efficiency Besides introducing kernel functions to test data and dictionary, the superior performance of our KSpH also comes from the non-negative sparse constraint on the hash codes as well as the local geometry preserving constraint. How do these two constraints affect the performance of our KSpH? To further study our KSpH, we vary the sparsity-inducing weight λ, local geometry preserving weight γ and binary codes length l to evaluate the performance variation. The experiments for parameters tuning are conducted on subsets of all three datasets with 2000 training images and 500 test images. We use the average F1-score for performance evaluation. Fig. 5 reports the experimental results of KSpH having λ ¼0.01, 0.02, 0.05, 0.1, 0.2, 0.5 and γ¼ 0.01, 0.02, 0.05, 0.1, 0.2, 0.5. First, we can observe that λ should be set to an appropriate value in order to obtain good results. Empirically, the optimal value of λ is 0.1 on NUS-WIDE and ANN-GIST datasets, which produces approximately 25% non-zero bits in the hash code. With larger λ, the obtained hash codes might be too sparse to preserve the original similarity information. In contrast, when λ is too small, some unrelated bits might be involved and thus degrade the retrieval performance. Fig. 5 also indicates that the performance can be further improved by selecting a proper γ which integrates the local geometrical structure into the dictionary and hash codes. In general, these two

parameters influence each other and should be tuned jointly on different datasets. We further compared the F1 score on NUS-WIDE datasets with different code lengths. For fair evaluation, we select the Hamming radius where highest F1-score can yield on each dataset. Taking NUS-WIDE dataset as an example, we report the experimental result in terms of F1-score in Fig. 6 with code lengths varying from 64 to 256. The figure shows that the performance of KSpH continues to improve and outperforms other methods as the code length increases. Our KSpH method is also very efficient in calculating the hash code for the novel data points. The computational overhead mainly comes from two aspects, calculating the kernel matrix and obtaining the sparse codes. The former computation is similar to other kernel based methods and can be further accelerated by approximation techniques. The latter computation is also efficient by using the FISTA method to obtain sparse coefficients. Moreover, for large scale datasets, multi-core based tools, such as spams2 can also be introduced to speed up the computation. In our current implementation, calculating the 256 bit code for a query image takes less than 0.01 s and the exhaustive nearest neighbor retrieval in database with 1 million items can be finished within 1 s.

2

http://spams-devel.gforge.inria.fr

Y. Zhang et al. / Neurocomputing 172 (2016) 207–214

Fig. 6. The comparisons of F1 score with different code lengths on NUS-WIDE dataset.

6. Conclusions In this paper, we propose a hashing method for similarity search based on kernelized sparse coding. The hash codes generated by our method are more effective and flexible by introducing the sparse constraints and the kernel function. Furthermore, the local geometry structure of data is well kept when learning the dictionary. Extensive experiments on three large-scale image datasets demonstrate the superior performance of our proposed method over the tested stateof-the-art techniques.

213

[13] C. Deng, H. Deng, X. Liu, Y. Yuan, Adaptive multi-bit quantization for hashing, Neurocomputing 151 (2015) 319–326. http://dx.doi.org/10.1016/j.neucom.2014.09.033. [14] Q. Guo, Z. Zeng, S. Zhang, Adaptive bit allocation hashing for approximate nearest neighbor search, Neurocomputing 151 (2015) 719–728. http://dx.doi. org/10.1016/j.neucom.2014.10.042. [15] J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online dictionary learning for sparse coding, in: ICML, 2009, pp. 689–696. [16] W. Liu, J. Wang, R. Ji, Y. Jiang, S. Chang, Supervised hashing with kernels, in: CVPR, 2012, pp. 2074–2081. [17] D. Zhang, J. Wang, D. Cai, J. Lu, Self-taught hashing for fast similarity search, in: SIGIR, 2010, pp. 18–25. [18] L. Ma, J. Deng, H. Yang, Y. Hong, K. Wang, Urban landscape classification using Chinese advanced high-resolution satellite imagery and an object-oriented multi-variable model, Front. Inf. Technol. Electron. Eng. 16 (3) (2015) 238–248. http://dx.doi.org/10.1631/FITEE.1400083. [19] J.D. Udayan, H. Kim, J. Kim, An image-based approach to the reconstruction of ancient architectures by extracting and arranging 3d spatial components, Front. Inf. Technol. Electron. Eng. 16 (1) (2015) 12–27. http://dx.doi.org/ 10.1631/FITEE.1400141. [20] M. Datar, P. Indyk, Locality-sensitive hashing scheme based on p-stable distributions, in: Symposium on Computational Geometry, 2004, pp. 253–262. [21] H. Lee, A. Battle, R. Raina, A.Y. Ng, Efficient sparse coding algorithms, in: NIPS, 2006, pp. 801–808. [22] D. Zhang, Z.-H. Zhou, S. Chen, Non-negative matrix factorization on kernels, in: PRICAI, 2006, pp. 404–412. [23] H.V. Nguyen, V.M. Patel, N.M. Nasrabadi, R. Chellappa, Kernel dictionary learning, in: ICASSP, 2012, pp. 2021–2024. [24] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, Shaftesbury Road, Cambridge, CB2 8BS, United Kingdom, 2004. [25] X. Chen, S. Kim, Q. Lin, J. Carbonell, E. Xing, Graph-structured multi-task regression and an efficient optimization method for general fused lasso, arXiv preprint. arXiv:1005.3579. [26] X. Chen, Q. Lin, S. Kim, J.G. Carbonell, E.P. Xing, Smoothing proximal gradient method for general structured sparse learning, in: UAI, 2011, pp. 105–114. [27] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci. 2 (1) (2009) 183–202. [28] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y.-T. Zheng, Nus-wide: A real-world web image database from national university of Singapore, in: CIVR, Santorini, Greece, 2009. [29] H. Jégou, M. Douze, C. Schmid, Product quantization for nearest neighbor search, IEEE Trans. Pattern Anal. Mach. Intell. 33 (1) (2011) 117–128. [30] C.D. Manning, P. Raghavan, H. Schuetze, Introduction to Information Retrieval, Cambridge University Press, Shaftesbury Road, Cambridge, CB2 8BS, United Kingdom, 2008.

Acknowledgments We would like to thank the anonymous reviewers for their insightful comments on this paper. This work was supported in part by the 973 Program (No. 2012CB316400), NSFC (Nos. 61402403 and 61103099), 863 program (2012AA012505), Chinese Knowledge Center for Engineering Sciences and Technology, Zhejiang Provincial NSFC (No. LQ13F020001), and the Fundamental Research Funds for the Central Universities (2014QNA5008).

Yin Zhang received his B.S. degree in Electrical Engineering and Ph.D. degree in Computer Science from Zhejiang University, Hangzhou, China, in 2002 and 2009 respectively. He is currently a faculty member of the College of Computer Science at Zhejiang University. His research interests lie in collaborative filtering, multimedia retrieval and their applications in digital libraries.

References [1] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, in: NIPS, 2008, pp. 1753– 1760. [2] B. Kulis, K. Grauman, Kernelized locality-sensitive hashing for scalable image search, in: ICCV, 2009, pp. 2130–2137. [3] B. Kulis, T. Darrell, Learning to hash with binary reconstructive embeddings, in: NIPS, 2009, pp. 1042–1050. [4] R. Salakhutdinov, G.E. Hinton, Semantic hashing, Int. J. Approx. Reason. 50 (2009) 969–978. [5] A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images, Technical Report, Department of Computer Science, University of Toronto, 2009. [6] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [7] Y. Liu, F. Wu, Y. Yang, Y. Zhuang, A.G. Hauptmann, Spline regression hashing for fast image search, IEEE Trans. Image Process. 21 (10) (2012) 4480–4491. [8] A. Cherian, V. Morellas, N. Papanikolopoulos, Robust sparse hashing, in: ICIP, 2012, pp. 2417–2420. [9] Y. Gong, S. Lazebnik, A. Gordo, F. Perronnin, Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 35 (12) (2013) 2916–2929. [10] F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, Y. Zhuang, Sparse multi-modal hashing, IEEE Trans. Multimed. 16 (2) (2014) 427–439. [11] Z. Yu, F. Wu, Y. Yang, Q. Tian, J. Luo, Y. Zhuang, Discriminative coupled dictionary hashing for fast cross-media retrieval, in: SIGIR, 2014, pp. 395–404. [12] W. Kong, W.-J. Li, M. Guo, Manhattan hashing for large-scale image retrieval, in: SIGIR, 2012, pp. 45–54.

Weiming Lu received his B.Sc. and Ph.D. degrees in computer science from Zhejiang University in 2003 and 2009 respectively. Currently, he is an assistant professor with the College of Computer Science, Zhejiang University. His research interests mainly include digital library, unstructured data management and distributed system.

Liu Yang received his B.Sc. and Ph.D. degree in Software College and College of Computer Science in Zhejiang University, Hangzhou, China in 2008 and 2013, respectively. He is currently working at Tencent Shenzhen Headquarter as a researcher. His current research interests include data mining, information retrieval and social network analysis.

214

Y. Zhang et al. / Neurocomputing 172 (2016) 207–214

Fei Wu received the B.S. degree from Lanzhou University, Lanzhou, Gansu, China, the M.S. degree from Macao University, Taipa, Macau, and the Ph.D. degree from Zhejiang University, Hangzhou, China. He is currently a professor with the College of Computer Science, Zhejiang University. His current research interests include multimedia analysis, retrieval, statistic learning, and pattern recognition.