Margin-based two-stage supervised hashing for image retrieval

Margin-based two-stage supervised hashing for image retrieval

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Margin-ba...

668KB Sizes 5 Downloads 142 Views

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Margin-based two-stage supervised hashing for image retrieval Ye Liu a,b, Yan Pan a,c,n, Hanjiang Lai a,b,f, Cong Liu a,e, Jian Yin a,b,d,n a

School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China School of Information Science and Technology, Sun Yat-sen University, Guangzhou, China School of Software, Sun Yat-sen University, Guangzhou, China d SYSU-CMU Shunde International Joint Research Institute, Foshan, China e School of Advanced Computing, Sun Yat-sen University, Guangzhou, China f Department of Electronic and Computer Engineering, National University of Singapore, Singapore b c

art ic l e i nf o

a b s t r a c t

Article history: Received 13 July 2015 Received in revised form 10 March 2016 Accepted 8 July 2016 Communicated by Liang Lin

Similarity-preserving hashing is a widely used method for nearest neighbor search in large-scale image retrieval. Recently, supervised hashing methods are appealing in that they learn compact hash codes with fewer bits by incorporating supervised information. In this paper, we propose a new two-stage supervised hashing methods which decomposes the hash learning process into a stage of learning approximate hash codes followed by a stage of learning hash functions. In the first stage, we propose a margin-based objective to find approximate hash codes such that a pair of hash codes associating to a pair of similar (dissimilar) images has sufficiently small (large) Hamming distance. This objective results in a challenging optimization problem. We develop a coordinate descent algorithm to efficiently solve this optimization problem. In the second stage, we use convolutional neural networks to learn hash functions. We conduct extensive evaluations on several benchmark datasets with different kinds of images. The results show that the proposed margin-based hashing method has substantial improvement upon the state-of-the-art supervised or unsupervised hashing methods. & 2016 Elsevier B.V. All rights reserved.

Keywords: Deep learning Image retrieval Image hashing Neural network Optimization algorithm

1. Introduction Large-scale image retrieval, a task that finds images of containing a similar object or a scene as in a query image, has attracted increasing interest due to the explosive growth of available image data on the web. Approximate nearest neighbor (ANN) search has become a popular technique in image retrieval on datasets with millions or billions of images. A notable stream of efficient ANN search methods is learningto-hash, i.e., an approach of learning to compress data points (e.g., images) into binary representations such that semantically similar data points have nearby binary codes. The existing learning-tohash methods can be divided into two main categories: unsupervised methods and supervised methods. Unsupervised methods (e.g., [1–3]) learn a set of hash functions from unlabeled data without any side information. Supervised methods (e.g., [4– 7]) try to learn compact hash codes by leveraging supervised information on data points (e.g., similarities on pairs of images). n Corresponding authors at: School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China. E-mail addresses: [email protected] (Y. Pan), [email protected] (J. Yin).

Among various supervised learning-to-hash methods for images, an emerging stream is the two-stage [4,7] methods, in which the hash learning process is divided into two stages: (1) learning approximate hash codes that preserves the similarities on pairs of images, (2) learning a set of hash functions from input images so that these hash functions can generate the learned approximate hash codes. Such methods are appealing in that the learning problem in their second stage is a standard multi-task binary classification problem that can be solved by off-the-shelf classifiers (e.g., kernel methods [7] or deep neural networks [4]). Particularly, with the advance of deep learning in the last few years, convolutional neural networks have made dramatically progress in image recognition and detection. The two-stage methods provide a way to boost the performance of hashing via leveraging successful deep models. One of the main challenges in the two-stage methods is how to find approximate hash codes (each of which associates with an input image) that accurately preserves the similarities on pairs of images. The existing two-stage hashing methods (e.g., [7,4]) usually try to learn the approximate hash codes by pursuing the minimum (maximum) Hamming distance between a pair of hash codes that associates with a pair of semantically similar (dissimilar) images. More specifically, for a pair of images x and y whose

http://dx.doi.org/10.1016/j.neucom.2016.07.024 0925-2312/& 2016 Elsevier B.V. All rights reserved.

Please cite this article as: Y. Liu, et al., Margin-based two-stage supervised hashing for image retrieval, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2016.07.024i

Y. Liu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

q-bit approximate hash codes are Hx and Hy, if x and y are semantically similar, the existing two-step methods try to make ∥ Hx − Hy ∥/ → 0 (the minimum distance, where ∥·∥/ represents the Hamming distance); if x and y are dissimilar, the existing methods try to make ∥ Hx − Hy ∥/ → q (the maximum distance). However, pursuing minimum (maximum) Hamming distance for similar (dissimilar) pairs may be too harsh and possibly degrade the quality of hashing. In this paper, we propose to learn the approximate hash codes that satisfies a more flexible setting: for two similar (dissimilar) data points, the Hamming distance between their corresponding hash codes only needs to be sufficiently small (large), i.e., if x and y are similar, one has ∥ Hx − Hy ∥/ ≤ r1; if x and y are dissimilar, one has ∥ Hx − Hy ∥/ ≥ r2, where 0 ≤ r1 < r2 ≤ q . This new setting is a generalization of the setting used in [5,7,4] (with r1 = 0 and r 2 = q ). Under this new setting, the corresponding objective for approximate hash code learning is difficult to optimize, due to its non-convexity and the existence of an element-wise multiplication operator. To address this issue, we propose a novel random coordinate descent algorithm to efficiently solve the corresponding optimization problem. After obtaining the approximate hash codes, we learn a set of hash functions via deep convolutional networks. We conduct extensive evaluations of the proposed method on several benchmarks datasets with different kinds of images. The experimental results indicate that the proposed method has superior performance against several state-of-the-art supervised or unsupervised methods.

2. Related work Due to the encouraging efficiency in search speed, hashing has become a popular ANN search method in large-scale image retrieval. Hashing methods can be divided into data-independent hashing and data dependent hashing. The early efforts mainly focus on data-independent hashing. For example, the notable Locality-Sensitive Hashing (LSH) [8] methods construct hash functions by random projections or random permutations that are independent on the data points. The main limitation of data-independent methods lies in that they usually requires long hash codes to obtain good performance. However, long hash codes lead to inefficient search due to large storage space and low recall rates. Data-dependent hashing, a.k.a. learning-to-hash, pursues a compact binary representation with fewer bits from the training data. According to whether using side information, learning-tohash methods can be divided into two sub-categories: unsupervised methods and supervised methods. Unsupervised methods try to learn a set of similarity-preserving hash functions only from the unlabeled data. Here are some representative methods in this sub-category. Kernelized LSH (KLSH) [2] generalizes LSH to accommodate arbitrary kernel functions, making it possible to learn hash functions which preserve data points' similarity in a kernel space. Semantic hashing [9] generates hash functions by a deep auto-encoder via stacking multiple restricted Boltzmann machines (RBMs). Graph-based hashing methods, such as Spectral hashing [10] and Anchor Graph Hashing [3], learn non-linear mappings as hash functions which try to preserve the similarities within the data neighborhood graph. In order to reduce the quantization errors, Iterative Quantization (ITQ) [1] seeks to learn an orthogonal rotation matrix which is applied to the data matrix after principal component analysis projection. Supervised methods aim to learn better bitwise representations by incorporating supervised information. The commonly used supervised information is in the form of pairwise labels

which indicates the semantical similarity/dissimilarity between data points. Binary Reconstruction Embedding (BRE) [6] learns hash functions by explicitly minimizing the reconstruction error between the original distances of data points and the Hamming distances of the corresponding binary codes. Minimal Loss Hashing (MLH) [11] learns similarity-preserving hash codes by minimizing a hinge-like loss function which is formulated as structured prediction with latent variables. Supervised Hashing with Kernels (KSH) [5] is a kernel-based supervised method which learns to hash the data points to compact binary codes whose Hamming distances are minimized on similar pairs and maximized on dissimilar pairs. Column Generation Hash (CGHash) [12] is a column generation based method to learn hash functions with proximity comparison information. Semi-Supervised Hashing (SSH) [13] learns hash functions via minimizing similarity errors on the labeled data while simultaneously maximizing the entropy of the learnt hash codes over the unlabeled data. In supervised hashing methods, an emerging stream is the twostage methods [7,4] in which the learning process is decomposed into two stages: (1) learning approximate hash bits (each piece of hash bits associates with a data point in the training set) from the supervised information, and (2) learning hash functions that can generate the learned approximate hash bits from the training data. The main advantage of two-stage methods is that the learning problem in their second stage can be regarded as multi-task binary classification which can be solved by off-the-shelf classifiers. For example, Two-Step Hashing (TSH) [7] uses kernel methods to learn hash functions in its second stage; CNNH [4] uses deep convolutional networks to learn hash functions for images in its second stage. However, the existing two-stage methods [7,4] learn the approximate hash codes by minimizing (maximizing) the Hamming distance on similar (dissimilar) data points, which may be overly restrictive and degrade the hashing performance. The proposed method in this paper belongs to two-stage methods. Compared to the existing two-stage methods, the proposed method learns approximate hash codes by minimizing a marginbased objective that pursues sufficient small (large) Hamming distance on similar (dissimilar) data points. We empirically show that such a new setting can boost the quality of hashing.

3. The approach In this paper, we follow the notations used in [4]. Given a training set of n images 0 = {I1, I2, … , In } and its corresponding pairwise similarity matrix S defined by:

⎧ + 1, Ii, I j are semantically similar Sij = ⎨ ⎪ ⎩ − 1, Ii, I j are semantically dissimilar, ⎪

where Sij ∈ [ − 1, 1] represents the similarity between the ith and the jth images. The goal of supervised hashing is to learn a set of q hash functions based on S and 0 . Each hash function is a mapping of an input image to {1, − 1}, which accounts for generating one binary hash bit. Some supervised hashing methods formulate the whole learning problem as a single objective, resulting in a complex and difficult optimization problem. On the other hand, the two-stage methods (e.g., [4,7]) decompose the learning process into two stages: an approximate hash code learning stage followed by a hash function learning stage, where the corresponding optimization problem in each stage is relatively simple. The proposed method belongs to the two-stage methods.

Please cite this article as: Y. Liu, et al., Margin-based two-stage supervised hashing for image retrieval, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2016.07.024i

Y. Liu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

3

Table 1 MAP of Hamming ranking w.r.t. different number of bits on three datasets. The results of MBH and CNNH are the average of 10 trials. Method

MBH CNNH KSH ITQ-CCA MLH BRE SH ITQ LSH

MNIST (MAP)

CIFAR-10 (MAP)

12 bits

24 bits

32 bits

48 bits

12 bits

24 bits

32 bits

48 bits

12 bits

24 bits

32 bits

48 bits

0.969 0.958 0.871 0.658 0.473 0.516 0.266 0.389 0.188

0.971 0.961 0.890 0.692 0.665 0.595 0.268 0.435 0.208

0.971 0.956 0.895 0.712 0.653 0.612 0.258 0.423 0.236

0.970 0.956 0.899 0.723 0.655 0.635 0.252 0.428 0.242

0.481 0.449 0.301 0.262 0.181 0.158 0.132 0.163 0.122

0.514 0.471 0.335 0.281 0.196 0.182 0.136 0.168 0.128

0.518 0.485 0.343 0.287 0.208 0.192 0.135 0.173 0.121

0.536 0.495 0.353 0.296 0.212 0.195 0.131 0.176 0.122

0.239 0.198 N/A N/A N/A N/A N/A N/A N/A

0.280 0.255 N/A N/A N/A N/A N/A N/A N/A

0.298 0.256 N/A N/A N/A N/A N/A N/A N/A

0.291 0.258 N/A N/A N/A N/A N/A N/A N/A

Table 2 MAP of Hamming ranking w.r.t. different number of bits on ILSVRC-2012 dataset for r1 parameters sensitive. The results of MBH and CNNH are the average of 10 trials. Method

ILSVRC-2012 (MAP)

CNNH MBH(r1 ¼ 0,r2 ¼ 5) MBH(r1 ¼ 1,r2 ¼5) MBH(r1 ¼ 2,r2 ¼ 5)

12 bits

24 bits

32 bits

48 bits

0.198 0.239 0.223 0.219

0.255 0.266 0.267 0.263

0.256 0.269 0.261 0.258

0.258 0.268 0.262 0.257

H

2

, (1)

F

which is known to be equivalent [5] to

⎛ ⎞2 q min ∑ ⎜ ∥ Hi· − Hj· ∥/ − (1 − Sij ) ⎟ , ⎝ ⎠ Hi·, Hj· 2 i, j

(2)

where ∥·∥/ represents the Hamming distance. The objective in (2) tries to minimize (maximize) the Hamming distance between the ith and jth hash codes with Ii and Ij being semantically similar (dissimilar), i.e.,

∥ Hi· − Hj·∥/ → 0

if Sij = 1

∥ Hi· − Hj·∥/ → q

if Sij = − 1.

if Sij = 1

∥ Hi· − Hj·∥/ ≥ r2

if Sij = − 1,

12 bits

24 bits

32 bits

48 bits

0.198 0.239 0.189 N/A N/A

0.255 0.266 0.280 0.263 0.267

0.256 0.269 0.281 0.270 0.298

0.258 0.268 0.262 0.291 0.275

Table 4 MAP of Hamming ranking w.r.t different number of bits on the large SVHN dataset. The results of MBH and CNNH are the average of 10 trials. Method

SVHN (MAP)

MBH CNNH KSH ITQ-CCA MLH BRE SH ITQ LSH

12 bits

24 bits

32 bits

48 bits

0.901 0.893 0.458 0.426 0.145 0.163 0.139 0.129 0.107

0.916 0.899 0.538 0.479 0.242 0.201 0.136 0.130 0.121

0.928 0.901 0.561 0.480 0.251 0.220 0.132 0.125 0.119

0.925 0.895 0.576 0.506 0.272 0.235 0.139 0.138 0.126

approximate hash codes for the training images in 0 by minimizing the following objective:



min

H ∈{−1,1}n × q

+

max (0, r2 − 2 ∥ Hi· − Hj· ∥/ )2

(i, j)∈ 5



max (0, 2 ∥ Hi· − Hj· ∥/ − r1)2

(i, j)∈ 7

(3)

However, in such a formulation, it is arguable that pursuing minimum (maximum) Hamming distance on similar (dissimilar) pairs may be overly restrictive. A natural and more flexible setting is, with Ii and Ij being similar (dissimilar), the Hamming distance of their corresponding approximate hash codes only need to be sufficiently small (large), i.e.,

∥ Hi· − Hj·∥/ ≤ r1

ILSVRC-2012 (MAP)

CNNH MBH (r1 ¼ 0,r2 ¼5) MBH (r1 ¼ 0,r2 ¼10) MBH (r1 ¼ 0,r2 ¼15) MBH (r1 ¼ 0,r2 ¼20)

3.1.1. Margin-based objective Suppose S is the given similarity matrix on images 0 , Hi·, Hj· ∈ {1, − 1}q represent the q-bit approximate hash codes for the images Ii and Ij, respectively. We define H as an n by q binary matrix whose kth row is Hk·. The existing two-stage methods [7,4] try to find approximate hash codes of preserving similarities by minimizing the following objective:

1 HH T q

Table 3 MAP of Hamming ranking w.r.t. different number of bits on ILSVRC-2012 dataset for r2 parameters sensitive. The results of MBH and CNNH are the average of 10 trials. Method

3.1. Step 1: Learning approximate hash codes

min S −

ILSVRC-2012 (MAP)

=

min n × q

H ∈{−1,1}

+

∑ (i, j)∈ 7

=

min



(i, j)∈ 5

⎛ ⎞2 1 max ⎜ 0, r2 − ∥ Hi· − Hj· ∥2- ⎟ ⎝ ⎠ 2

⎞2 ⎛ 1 max ⎜ 0, ∥ Hi· − Hj· ∥2- − r1⎟ ⎝ 2 ⎠

H ∈{−1,1}n × q

+





max (0, Hi·HTj· − (q − r2 ))2

(i, j)∈ 5

max (0, (q − r1) − Hi·HTj· )2

(i, j)∈ 7

(4)

where r1 and r2 are margin parameters that satisfies 0 ≤ r1 < r2 ≤ q . Obviously, the setting in (3) can be regarded as a special case of (4), with r1 = 0 and r2 = q. Under the margin-based setting in (4), we learn the

=

min

H ∈{−1,1}n × q

∥ max (0, D ⊙ (HH T − M ))∥2- ,

(5)

where ∥·∥- is the Frobenius norm and ⊙ is the element-wise multiplication operator. 7 ( 5 ) represents the set of similar (dissimilar) image pairs, respectively. D is an indicator matrix in which Dij = − 1 if the images Ii and Ij are semantically similar, Dij = 1 if Ii

Please cite this article as: Y. Liu, et al., Margin-based two-stage supervised hashing for image retrieval, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2016.07.024i

Y. Liu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

1

1

0.8

MBH CNNH KSH ITQ−CCA MLH BRE ITQ SH LSH

0.9

0.8

0.6

0.4

Precision

0.8

Precision

Precision (Hamm. dist. <=2)

1

0.6

0.4

0.7 0.6

0.2

0

0.2

15

20

25

30

35

40

0 0

45

0.5

0.2

0.4

Number of bits

0.6

0.8

0.4

1

200

400

600

800

1000

Number of top returned images

Recall

Fig. 1. The results on MNIST. (a) Precision curves within Hamming radius 2; (b) precision-recall curves of Hamming ranking with 48 bits; (c) precision curves with 48 bits w. r.t. different number of top returned samples.

MBH CNNH KSH ITQ−CCA MLH BRE ITQ SH LSH

0.6 0.8

0.5

0.3

Precision

0.4

Precision

Precision (Hamm. dist. <=2)

1

0.5

0.6

0.4

0.2 0.2

0.1 0

15

20

25

30

35

40

0 0

45

0.4 0.3 0.2

0.2

0.4

Number of bits

0.6

0.8

0.1

1

200

400

600

800

1000

Number of top returned images

Recall

0.4

1

0.35

0.8

0.4

MBH CNNH

0.3

0.25

0.6

0.4

0.3 0.2

0.2

0.15

Precision

0.35

Precision

Precision (Hamm. dist. <=2)

Fig. 2. The results on CIFAR-10. (a) Precision curves within Hamming radius 2; (b) precision-recall curves of Hamming ranking with 48 bits; (c) precision curves with 48 bits w.r.t. different number of top returned samples.

15

20

25

30

35

40

0 0

45

0.2

0.4

0.6

0.8

1

0.25

200

400

600

800

1000

Number of top returned images

Recall

Number of bits

Fig. 3. The results on ILSVRC-2012. (a) Precision curves within Hamming radius 2; (b) precision-recall curves of Hamming ranking with 48 bits; (c) precision curves with 48 bits w.r.t. different number of top returned samples.

Table 5 Comparison of average testing time on the large SVHN dataset. The results of MBH and CNNH are the average of 10 trials. Method

SVHN (ms)

MBH CNNH

12 bits

24 bits

32 bits

48 bits

3.253 3.262

3.569 3.558

3.762 3.812

3.985 3.762

and Ij are semantically dissimilar. M is a matrix that Mij = q − r1 if Ii and Ij are similar, Mij = r2 − q if Ii and Ij are dissimilar. We denote 0 as the n by n matrix with all zeros. Note that we use the fact

∥ x − y ∥/ =

1 4

∥ x − y ∥2- with x, y ∈ { − 1, 1}q .

3.1.2. Optimization algorithm It is difficult to directly optimize the objective in (5) due to the integer constraints on H. A commonly used relaxation is replacing H ∈ { − 1, 1}n × q by the range constraints H ∈ [ − 1, 1]n × q :

min

∥ max (0, D ⊙ (HH T − M ))∥2-

s. t.

H ∈ [ − 1, 1]n × q .

H

(6)

It is still challenging to optimize (6) due to the non-convex term HHT and the element-wise operator D. To address this issue, we propose to solve (6) by a random coordinate descent algorithm using Newton directions. This algorithm sequentially or randomly chooses one entry in H to update while keeping other entries fixed. Specifically, in each iteration, we update only one entry (e.g., Hij) in H with other entries being fixed. Let H = [H·1, H·2 ,..., H·q ],

Please cite this article as: Y. Liu, et al., Margin-based two-stage supervised hashing for image retrieval, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2016.07.024i

Y. Liu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

where H·j represents the jth column in H. The objective in (6) can be rewritten as:

⎛ ⎛ ⎛ max ⎜⎜ 0, D ⊙ ⎜ H·j H·Tj − ⎜⎜ M − ⎜ ⎝ ⎝ ⎝

⎞⎞⎞ ∑ H·c H·Tc ⎟⎟ ⎟⎟ ⎟⎟ ⎠⎠⎠ c≠j

2

-

= ∥ max (0, D ⊙ (H·j H·Tj − R))∥2n

=

n

∑ ∑ (max (0, Dlk (Hlj Hkj − Rlk )))2 l= 1 k = 1

where we set R = M − ∑c ≠ j H·c H ·Tc . Since M is symmetric, it is easy to verify that R is also symmetric. By fixing other entries in H, the objective w.r.t. Hij can be rewritten as: n

g (Hij ) =

5

is

⎛ ⎛ g′ (Hij ) ⎞⎞ d = max ⎜⎜ −1 − Hij, min ⎜ − , 1 − Hij ⎟ ⎟⎟, ( ) g H ″ ⎝ ⎠⎠ ij ⎝

which can be used in the update rule Hij ← Hij + d . The sketch of the proposed coordinate descent algorithm is shown in Algorithm 1. Since the objective (6) is non-convex w.r.t. H, Algorithm 1 is a greedy algorithm and it cannot guarantee to decrease (6) in each iteration. Hence, similar to existing methods [5,7,6], if in some iteration (of the outer loop) increases the value of (6), Algorithm 1 does not update H and L and continues the loop (see the IF-ELSE block in the outer loop of Algorithm 1).

n

∑ ∑ (max (0, Dlk (Hlj Hkj − Rlk )))2

Algorithm 1. Coordinate descent algorithm for hash bit learning.

l= 1 k = 1

Input: S, D ∈ { − 1, 1}n × n , the number of bits q in a target hash

= max (0, Dij (Hij2 − Rii ))2 + 2 ∑ max (0, Dik (Hij Hkj − Rik ))2 + constant, k≠i

where we use the fact that R is symmetric. Suppose we update Hij to Hij + d , the corresponding optimization problem is:

min g (Hij + d)

s. t. − 1 ≤ Hij + d ≤ 1.

d

code, the tolerance error ϵ, maximum iterations T. Initialize: randomly initialize H ∈ [ − 1, 1]n × q ; t ← 0, H(0) ← H . for t¼ 1,…,T do Decide the order of n × q indices (i,j) by random permutation (i ¼1,…,n, j¼ 1,…,q). for each of the n × q indices (i,j) do Select the entry hij to update. Calculate g′(Hij ) and g″(Hij ) by (8) and (9).

To simplify the search of d, we approximate g (Hij + d ) by a quadratic function via Taylor expansion:

Calculate d by (11) and update Hij by Hij ← Hij + d . end for t ← t + 1, calculate the objective value F(t ).

1 g^ (Hij + d) = g (Hij ) + g′ (Hij ) d + g″ (Hij ) d2, 2

if F(t ) ≤ F(t − 1) then H(t ) ← H , else H(t ) ← H(t − 1) , continue.

(7)

where g′(Hij ) (g″(Hij )) is the first (second) order derivative of g at Hij: n

g′ (Hij ) = 4

(11)



max (0, Dik (Hij Hkj − Rik )) Dik Hkj

k=1

(8)

if the relatively change

F(t − 1) −F(t ) F(t − 1)

≤ ϵ , then break.

end for Output: the sign matrix of H, each of whose elements is either 1 or  1. Algorithm 2. Accelerated Coordinate descent algorithm for hash bit learning.

g″ (Hij ) = Dii2 (12Hij2 − 4Rii ) I Dii (Hij Hij− Rii ) > 0

Input: S, D ∈ { − 1, 1}n × n , the number of bits q in a target hash code, the tolerance error ϵ, maximum iterations T.

2 + 4 ∑ Dii2 Hkj I Dik (Hij Hkj− Rik ) > 0

Initialize: randomly initialize H ∈ [ − 1, 1]n × q ; L ← HHT − S ,

k≠i

= (12Hij2 − 4Rii ) I Dii (Hij Hij− Rii ) > 0

t ← 0, F(0) ← ∥ max (0, D ⊙ L )∥2F , H(0) ← H , L (0) ← L . for t¼ 1,…,T do Decide the order of n × q indices (i,j) by random permutation (i ¼1,…,n, j¼ 1,…,q). for each of the n × q indices (i,j) do Select the entry hij to update. Calculate g′(Hij ) and g″(Hij ) by (12).

2 + 4 ∑ Hkj I Dik (Hij Hkj− Rik ) > 0 k≠i

= (8Hij2 − 4Rii ) I Dii (Hij Hij− Rii ) > 0 n

+4

2 Hkj I Dik (Hij Hkj− Rik ) > 0

∑ k=1

(9)

where Ix is an indicator function that Ix = 1 if x is true and otherwise Ix = 0. Note that we use the fact that Dij ∈ {1, − 1} to simplify g″(Hij ). By setting the derivative of (7) w.r.t. d to be zero, we have

d= −

g′(Hij ) g″(Hij )

. Hence, the solution to the objective

min g^ (Hij + d) d

s. t. − 1 ≤ Hij + d ≤ 1.

(10)

Calculate d by (11) and update Hij by Hij ← Hij + d . Update L by (13). end for

t ← t + 1, F(t ) ← ∥ max (0, D ⊙ L )∥2F . if F(t ) ≤ F(t − 1) then H(t ) ← H , L (t ) ← L , else H(t ) ← H(t − 1) , L (t ) ← L (t − 1), continue. if the relatively change

F(t − 1) − F(t ) ≤ ϵ , then break. F(t − 1)

end for Output: the sign matrix of H, each of whose elements is either 1 or  1.

Please cite this article as: Y. Liu, et al., Margin-based two-stage supervised hashing for image retrieval, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2016.07.024i

Y. Liu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

3.1.3. Efficient implementation Note that directly calculating the matrix R = qM − ∑c ≠ i H·c H ·Tc ∈ 9 n × n needs O (n2) time, which is timeconsuming with large number n of training images. To tackle this issue, we do not calculate R explicitly. Specifically, we maintain a matrix L = HHT − M and update L in each iteration with low costs. Given L, g′(Hij ) and g″(Hij ) can be calculated by:

g′ (Hij ) = 4 max (0, Di· ⊙ L i·)(D·i ⊙ H·j ), n

g″ (Hij ) = 4

∑ Hkj2 I Dik L ik > 0 + 4 (Hij2 + Lii ) I Dii L ii > 0, k=1

(12)

where L i· is the ith row of L. Hence, we can calculate g′(Hij ) and g″(Hij ) in O(n) time. After Hij is updated, it is easy to verify that only the ith row and ith column of L are affected. For example, given Hij ← Hij + d , we update L by:

L i· ← L i· + dH·Tj ,

L·i ← L·i + dH·j,

L ii ← L ii + d2.

(13)

Hence, the time complexity of updating L is also O(n). The sketch of the accelerated algorithm is shown in Algorithm 2. 3.2. Step 2: Learning the hash functions With the learned approximate hash codes H by Step 1, in the second step, we learn a set of hash functions that can encode an image Ii into a hash code Hi·. Each hash function is a mapping in the form of h:  → { − 1, 1}, where  represents the space of input images. Hence, each hash function can be regarded as a binary classifier on images. Learning a set of q hash functions is a problem of multi-task binary classification to which many off-the-shelf classifiers can be applied, such as kernel methods [7] or deep neural networks [4]. On the other hand, in the last few years, we are witnessing a revolution in computer vision, mainly due to the dramatic progress in deep convolutional neural networks (CNNs) [14]. The methods based deep CNNs have achieved state-of-the-art performance in object categorizalg:RCDation, object detection, and other image recognition tasks. In this paper, we adopt deep convolutional networks to learn hash functions. We adopt the architecture of [14] as our basic framework. Our network has three convolution-pooling layers with rectified linear activation, max pooling and local contrast normalization, a standard fully connected layer and an output layer with softmax activation. We use 32, 64, 128 filters (with the size 5  5) in the 1st, 2nd and 3rd convolutional layers. We use dropout [15] in the fully connected layer with a rate of 0.5. After the deep network is trained, one can use it to generate a q-bit hash code for an input image. Specifically, in prediction, an input image x is first encoded into a q-dimensional feature vector - (x ). Then one can obtain a q-bit binary code by simple quantization b = sign (- (x )), where sign(v) is the sign function on vectors that for i = 1, 2, … , q , sign (vi ) = 1 if vi > 0.5, otherwise sign (vi ) = − 1.

4. Experiments

(2) The CIFAR-102 dataset consists of 60,000 32  32 color images in 10 classes. (3) A subset of the ILVSRC-2012 [16] dataset contains 40,500 images in 30 classes.3 Each of these images is associated with one or multiple labels. Two images are regarded as similar if they share at least one label. We resize images of this subset into 64  64. (4) The SVHN4 dataset is a large real-world image dataset for recognizing digits and numbers in natural scene images. SVHN consists of over 600,000 32  32 color images in 10 classes. We call the proposed margin-based hashing method as MBH. We test and compare the search accuracies of the proposed MBH with eight state-of-the-art hashing methods, including three unsupervised methods LSH [8], SH [10] and ITQ [1], and five supervised methods CNNH [4], KSH [5], MLH [11], BRE [6] and ITQCCA [1]. In each dataset, we randomly select 1000 images (100 images per class) as the test query set. For the unsupervised methods, we use the rest images as training samples. For the supervised methods, we randomly select 5000 images (500 images per class) from the rest images as the training set. For a fair comparison, all of the methods use identical training and test sets. For the proposed MBH and the most related competitor CNNH, we directly use the image pixels as input. For the other baseline methods, we follow [4,11,5] to represent each image in MNIST by a 784-dimensional greyscale vector, and each image in CIFAR-10 by a 512-dimensional GIST [17] vector. To evaluate the quality of hashing, we use four evaluation metrics: Mean Average Precision (MAP), Precision-Recall curves, Precision curves within Hamming distance 2, and Precision curves w.r.t. different numbers of top returned samples. We implement the proposed MBH based on the open-source Caffe [18] framework. In all experiments, our convolutional networks are trained by stochastic gradient descent with 0.9 momentum [19]. The mini-batch size of images is 128. The base learning rate is 0.01. The weight decay parameter is 0.004. The results of BRE, ITQ, ITQ-CCA, KSH, MLH and SH are obtained by the implementations provided by their authors, respectively. The results of LSH and CNNH are obtained from our implementation. For a fair comparison to CNNH, in the second stage of the proposed method, we use the same stack of convolutionpooling layers as in [4]. The implementation of MBH will be released open-source on publication. For the margin parameters in MBH, we choose r1 from {0, 1, 2}, r2 from {5, 10, 15, 20} by cross validation. 4.2. Results of search accuracies Tables 1–4 and Figs. 1–3 show the comparison results of search accuracies on all of the four datasets.5 Two observations can be made from these results: (1) On MNIST and CIFAR-10, the proposed MBH achieves substantially better search accuracies (w.r.t. MAP, precision within Hamming distance 2, precision-recall, and precision with varying size of top returned samples) than those baseline methods using traditional hand-crafted visual features. For example, compared to the best competitor KSH, the MAP results of the proposed method

4.1. Experimental settings 2

We conduct extensive evaluations of the proposed method on four benchmark datasets: (1) The MNIST1 dataset for recognizing of handwritten digits, which consists of 70 K 28  28 greyscale images of the digits ‘0’ to ‘9’. 1

http://yann.lecun.com/exdb/mnist/.

http://www.cs.toronto.edu/  kriz/cifar.html. We choose the first 30 classes in ILSVRC-2012 by alphabetical order of the class names. 4 http://ufldl.stanford.edu/housenumbers/. 5 On ILSVRC-2012, we only report the results of MBH and CNNH and omit the results of other baselines, because the baseline methods (except for CNNH) with hand-crafted features (e.g., 512-dimensional GIST) performs poor on this dataset. For example, with 32-bit hash codes, the second best baseline KSH with GIST features obtains a MAP less than 0.05. 3

Please cite this article as: Y. Liu, et al., Margin-based two-stage supervised hashing for image retrieval, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2016.07.024i

Y. Liu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

indicate a relative increase of 7.78–11.12%/49.71–58.75% on MNIST/ CIFAR-10, respectively. (2) In all of the four datasets, the proposed MBH performs consistently better than the most related competitors CNNH. For example, with respect to MAP, compared to CNNH, the proposed MBH shows a relative increase of 1.01–1.02%/6.80–9.13%/9.80– 20.70%/0.90–3.35% on MNIST/CIFAR-10/ILSVRC-2012/SVHN, respectively. Since the implementation of MBH's second stage is the same as that of CNNH, these results verify that the proposed margin-based objective can result in better performance of hashing than the objectives used in [4,7,5]. 4.3. Comparison of computation time Table 5 shows the comparison of average testing time on the large SVHN dataset. The computation time compared with benchmark approaches is competitive. We can get the hash codes in the stage one in just several minutes and train the deep neural network which are similar to the former CNNH method in the stage two. In the testing step, the experimental results show that the average calculation time for each image of MBH and CNNH is very close.

5. Conclusion In this paper, we developed a two-stage supervised hashing methods. In the first stage of learning approximate hash codes, we propose a margin-based objective that tries to make the Hamming code distance of a two similar (dissimilar) images to be sufficiently small (large). We developed a random coordinate descent algorithm to efficiently solve the corresponding optimization problem. In the second stage, we use convolutional neural networks to learn hash functions. Empirical evaluations in image retrieval show that the proposed method has superior performance gains over stateof-the-arts.

Acknowledgements This work is supported by the National Natural Science Foundation of China (61033010, 61272065, 61370021, 61472453, U1401256, U1501252), Natural Science Foundation of Guangdong Province (S2011020001182, S2012010009311, S2013010011905), Research Foundation of Science and Technology Plan Project in Guangdong Province (2011B031700004, 2011B040200007, 2012A010701013). This work is also supported by the IndustryUniversity Cooperation and Joint Research Program between Flamingo(Huolieniao) Network (Guangzhou) Co., Ltd. and Guangdong Provincial Key Laboratory of Big Data Analysis and Processing in China.

7

Recognition, 2012, pp. 2074–2081. [6] B. Kulis, T. Darrell, Learning to hash with binary reconstructive embeddings, in: Proceedings of the Advances in Neural Information Processing Systems, 2009, pp. 1042–1050. [7] G. Lin, C. Shen, D. Suter, A. van den Hengel, A general two-step approach to learning-based hashing, in: Proceedings of the IEEE Conference on Computer Vision, Sydney, Australia, 2013. [8] A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in: Proceedings of the International Conference on Very Large Data Bases, vol. 99, 1999, pp. 518–529. [9] R. Salakhutdinov, G. Hinton, Learning a nonlinear embedding by preserving class neighbourhood structure, in: Proceedings of the International Conference on Artificial Intelligence and Statistics, vol. 11, 2007. [10] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, in: Proceedings of the Advances in Neural Information Processing Systems, 2008, pp. 1753–1760. [11] M. Norouzi, D.M. Blei, Minimal loss hashing for compact binary codes, in: Proceedings of the International Conference on Machine Learning, 2011, pp. 353–360. [12] X. Li, G. Lin, C. Shen, A.v.d. Hengel, A. Dick, Learning hash functions using column generation, in: Proceedings of the International Conference on Machine Learning, 2013. [13] J. Wang, S. Kumar, S.-F. Chang, Semi-supervised hashing for scalable image retrieval, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3424–3431. [14] A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convolutional neural networks, in: Proceedings of Advances in Neural Information Processing Systems, 2012, pp. 1106–1114. [15] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R.R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, arXiv preprint arXiv:1207.0580. [16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, L. Fei-Fei, ImageNet large scale visual recognition challenge, Int. J. Comput. Vis. 115 (3) (2015) 211–252. [17] A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope, Int. J. Comput. Vis. 42 (3) (2001) 145–175. [18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, in: Proceedings of the 22nd ACM International Conference on Multimedia, 2014, pp. 675–678. [19] I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initialization and momentum in deep learning, in: Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 1139–1147.

Ye Liu, received the M.Sc. degree in computer science from Sun Yat-sen University, Guangzhou, China. He is currently a Department Director in the Data Center of Flamingo(Huolieniao) Network (Guangzhou) Co., Ltd. His major research interests include social network mining, large-scale machine learning and deep learning.

Yan Pan, received the Ph.D. degree in computer science from Sun Yat-sen University, Guangzhou, China. He is currently an Associate Professor at Sun Yat-sen University, China. His major research interests include machine learning, data mining and information retrieval.

References [1] Y. Gong, S. Lazebnik, Iterative quantization: a procrustean approach to learning binary codes, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 817–824. [2] B. Kulis, K. Grauman, Kernelized locality-sensitive hashing for scalable image search, in: Proceedings of the IEEE International Conference on Computer Vision, 2009, pp. 2130–2137. [3] W. Liu, J. Wang, S. Kumar, S.-F. Chang, Hashing with graphs, in: Proceedings of the International Conference on Machine Learning, 2011, pp. 1–8. [4] R. Xia, Y. Pan, H. Lai, C. Liu, S. Yan, Supervised hashing for image retrieval via image representation learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2014, pp. 2156–2162. [5] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, S.-F. Chang, Supervised hashing with kernels, in: Proceedings of the IEEE Conference on Computer Vision and Pattern

Hanjiang Lai, received the Ph.D. degree in computer science from Sun Yat-sen University, Guangzhou, China. He was a Research Fellow in the Learning and Vision Research Group at the Department of Electrical and Computer Engineering, National University of Singapore. He is currently a Researcher at Sun Yat-sen University, China. His main research interests include machine learning algorithms and learning to rank.

Please cite this article as: Y. Liu, et al., Margin-based two-stage supervised hashing for image retrieval, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2016.07.024i

8

Y. Liu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Cong Liu, received the Ph.D. degree in the Department of Computer Science and Engineering, Florida Atlantic University. He is currently an Associate Professor at Sun Yat-sen University, China. His major research interests include routing in mobile ad hoc networks and machine learning.

Jian Yin, received the Ph.D. degree in computer science from Wuhan University, Wuhan, China. He is currently a Professor and Ph.D. supervisor at Sun Yat-sen University, China. His major research interests include big data and data mining.

Please cite this article as: Y. Liu, et al., Margin-based two-stage supervised hashing for image retrieval, Neurocomputing (2016), http: //dx.doi.org/10.1016/j.neucom.2016.07.024i