Semi-supervised constraints preserving hashing

Semi-supervised constraints preserving hashing

Neurocomputing 167 (2015) 230–242 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Semi-su...

2MB Sizes 0 Downloads 54 Views

Neurocomputing 167 (2015) 230–242

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Semi-supervised constraints preserving hashing Di Wang, Xinbo Gao n, Xiumei Wang VIPS Lab, School of Electronic Engineering, Xidian University, Xi'an 710071, China

art ic l e i nf o

a b s t r a c t

Article history: Received 17 March 2015 Received in revised form 14 April 2015 Accepted 26 April 2015 Communicated by Luming Zhang Available online 8 May 2015

With the ever-increasing amount of multimedia data on the web, hashing-based approximate nearest neighbor search methods have attracted significant attention due to its remarkable efficiency gains and storage reductions. Traditional unsupervised hashing methods are designed for preserving distance metric similarity which may lead to semantic gap among the high-level semantic similarities. Recently, attentions have been paid to semi-supervised hashing methods which can preserve data's a few available semantic similarities (usually given in terms of labels, pairwise constraints, tags, etc.). However, these methods often preserve semantic similarities for low-dimensional embeddings. When converting lowdimensional embeddings into binary codes, the quantization error will be accumulated thus resulting in performance deterioration. To this end, we propose a novel semi-supervised hashing method which preserves pairwise constraints for both low-dimensional embeddings and binary codes. It first represents data points by cluster centers to preserve data neighborhood structure and reduce the dimensionality. Then the constraint information is fully utilized to embed the derived data representations into a discriminative low-dimensional space by maximizing discriminative Hamming distance and data variance. After that, optimal binary codes are obtained by further preserving the semantic similarities in the process of quantizing the low-dimensional embeddings. By utilizing constraint information in the quantization process, the proposed method can fully preserve pairwise semantic similarities for binary codes thus leading to better retrieval performance. Thorough experiments on standard databases show the superior performance of the proposed method. & 2015 Elsevier B.V. All rights reserved.

Keywords: Hashing Binary codes Semi-supervised learning Nearest neighbor search Pairwise similarity

1. Introduction With the explosive growth of data on the web, there is an urgent demand of approximate nearest neighbor (ANN) search methods to efficiently exploit user intent information from large web databases. The main challenges for ANN methods are fast query and low storage requirement. To meet this goal, various hashing methods have been proposed recently. Hashing aims to map high-dimensional data, such as documents, images, or videos, into a set of low-dimensional compact binary codes while preserving the underlying similarity relationship of the original data. The pairwise similarity comparisons between these binary codes can be measured by Hamming distances which only involve efficient bit-count operations thus can be computed very quickly. Furthermore, only a small number of bits are sufficient for binary codes to maintain the information required for retrieval, which bring enormous storage savings. Due to these merits, hashing methods have been successfully used in various applications

n

Corresponding author. E-mail address: [email protected] (X. Gao).

http://dx.doi.org/10.1016/j.neucom.2015.04.072 0925-2312/& 2015 Elsevier B.V. All rights reserved.

such as large-scale retrieval [1–3], feature descriptor learning [4,5], and near-duplicate detection [6–8]. According to how much supervised information is used, hashing methods can be classified as unsupervised, semi-supervised, and supervised methods. Most representative hashing methods are unsupervised, which are seeking to employ data information to compute binary codes. Early endeavors in unsupervised hashing concentrated on data-independent methods which are using random projections to construct hash functions without exploiting the knowledge of training data. Notable examples include Locality-Sensitive Hashing (LSH) [9], Kernelized LSH (KLSH) [10], and Shift-Invariant KLSH (SKLSH) [11]. However, due to the limitation of random hash functions generating scheme, LSH-like hashing methods need long binary codes to achieve reasonable performance, therefore suffer from long query time and high storage cost. To improve the limitation of data-independent method, recent research focuses on data-dependent hashing methods which can generate hash functions by making use of the correlations between data points. Typical approaches are PCA-like hashing [12–15], manifold-like hashing [16–19], and K-means-like hashing [20–22]. To preserve the semantic similarities, supervised hashing algorithms have been employed to design more effective hash functions. Linear discriminant analysis hashing (LDAHash) obtains hash functions

D. Wang et al. / Neurocomputing 167 (2015) 230–242

231

01 00 0

0.8 0.1

0

0.1 0.7 0.2

0

0.2

0.8

0

0.1 0.1

0

0.2

0

0.2 0.6

0

0

0

0.1

0

0 0.8 10

Data Set

Clustering

11

Sparse Embedding

Constraints Preserving Low Dimensional Embedding

Constraints Preserving Quantization

Fig. 1. Flowchart of the proposed semi-supervised constraints preserving hashing method.

by maximizing the between-class scatter among binary codes associated with different classes [4]. Binary reconstructive embedding (BRE) learns hash functions by minimizing the reconstruction error between the original semantic similarities and the Hamming distances of the corresponding binary embeddings [23]. Minimal loss hashing (MLH) introduces a hinge-like loss function depending on semantic similarity information and learns binary codes based on structural prediction [24]. Supervised hashing with kernels (KSH) maps data to hash codes whose Hamming distances are minimized on similar pairs and simultaneously maximized on dissimilar pairs [25]. Although the LDAHash can tackle supervision via easy optimization, it lacks of adequate performance. BRE, MLH and KSH can obtain better search accuracy than LDAHash whereas they often attribute to sophisticated optimization and have expensive training cost. This imperfection has greatly diminished the applicability of these methods to large-scale tasks. Usually, labeling samples requires much human expertise on large scale data set. For this reason, available supervised information can be very limited. Semi-supervised methods will be helpful in this case. They often take advantage of both supervised information and data's underlying similarity information. Label-regularized max-margin partition (LAMP) method uses kernel-based similarity and additional small number of pairwise constraints as side information to generate hash codes [26]. The encoding process is formulated within a regularized maximum margin framework which can be solved by a series of convex sub-problems with quadratic-program. However, the optimization process is so complex that leading lengthy training process and reducing the serviceability. Semi-supervised hashing (SSH) minimizes the empirical error on the labeled data while maximizing entropy of hash bits over the labeled and unlabeled data [27]. According to different optimization algorithms, SSH has three solutions, SSH-Orthogonal, SSH-Nonorthogonal and sequential projection learning for hashing (SPLH). SSH-Orthogonal and SSHNonorthogonal first project the original data into low-dimensional embeddings, and then quantize the embeddings into binary ones by thresholding. However, the quantization error produced by this conversion process will decrease the hashing performance. To reduce the quantization error, SPLH is designed as a boosting learning method. Each hash function in SPLH is intended to correct the errors produced by the previous ones. The hash functions are learned iteratively such that the pairwise label matrix is updated by imposing higher weights on point pairs violated the preceding hash function. However, SPLH judges every previous bit separately when deciding the errors of the obtained projections to penalize them with the higher weights. Since the similarities of hash codes should take consideration of all bits holistically, SPLH may incur more errors. To solve this problem, Bootstrap sequential projection learning for semi-supervised nonlinear hashing (Bootstrap-NSPLH) method is proposed [28]. It utilizes bootstrap sequential projection learning to rectify the quantization errors by taking into consideration of all the previously learned bits holistically. However, Bootstrap-NSPLH is mainly designed for label information. When weaker supervised information such as pairwise constraint is available, it needs calculating the pairwise

Hamming similarity matrix (for N training data points, the size of this matrix will be N  N) for the whole data set as pairwise constraint information is often scattered. This is intolerable for the large data set for the enormous storage space and large amount of calculation. To effectively reduce the quantization error accumulated during converting low-dimensional embeddings into binary codes after relaxation and efficiently use the pairwise constraints, we propose semi-supervised constraints preserving hashing (SCPH) method in this paper. Fig. 1 illustrates the flowchart of the proposed method. It first partitions the data points into some clusters and represents the data points by cluster centers to preserve data neighborhood structure and reduce dimensionality. Then constrain information is fully utilized to embed the derived data representations into a discriminative low-dimensional space by maximizing discriminative distance and data variance. After that, optimal binary codes are obtained by preserving the semantic similarities in the process of quantizing the low-dimensional embedding. The main contributions are summarized as follows:

 A discriminative low-dimensional space is learned in which points 



with similar constraints are pushed as close as possible, while points with dissimilar constraints are pulled away as far as possible. Constraint information is further utilized to guide the quantization process. The obtained discriminative low-dimensional embeddings are quantized into Hamming space by jointly maximizing the binary codes’ discriminability and minimizing the quantization error. This makes the similarities of binary codes consistent with the pairwise constraints as much as possible. The proposed method has more applicability as the weaker supervised information that is pairwise constraint can be used. When other supervised information for example label or tag is available, it is also useful as pairwise constraint can be readily derived from them.

The rest of this paper is organized as follows. The proposed SCPH method is presented in Section 2. Section 3 gives its computational complexity analysis. Section 4 shows the experimental results and analysis on popular image data sets. Finally, Section 5 is the conclusion.

2. Semi-supervised constraints preserving hash codes learning Given a training set of n data points, denoted as fx1 ; x2 ; …; xn g with xi A Rd , which form the rows of the data matrix X A Rnd . The data set also includes a fraction of pairwise constraint information with similar pair set M in which ðxi ; xj Þ A M implies that xi and xj belong to the same class, and dissimilar pair set C in which ðxi ; xj Þ A C implies that xi and xj belong to different classes. Our goal is to learn binary codes Y A f1;  1gnr 1 of the data set through 1

Converting  1/1 codes to 0/1 codes is a trivial shift and scaling operation.

232

D. Wang et al. / Neurocomputing 167 (2015) 230–242

P where t k is the mean of projected data, i.e., t k ¼ 1n ni¼ 1 zi wk . Without loss of generality, let Z be normalized to have zero mean, the hash function would be

Table 1 Notations and descriptions. Symbol

Description

Symbol

Description

M C X H Y Z U W V R xi hi yi

Similar pair set Dissimilar pair set Training set Hash functions Hash codes Sparse embeddings Cluster centers Orthogonal vectors Projected data points Rotation matrix The i-th training point The i-th hash function The i-th hash code

zi ui wi n nM nC d r m t s λ η

The i-th sparse embedding The i-th cluster center The i-th orthogonal vector Number of training points Number of similar pairs Number of dissimilar pairs Dimension of training points Length of hash codes Number of cluster centers Bandwidth parameter Number of nearest cluster centers Regularization parameter Regularization parameter

hk ðzi Þ ¼ sgnðzi wk Þ:

δ¼

X 1 X 1 ‖y yj ‖22  ‖y  yj ‖22 ; 2nC ðx ;x Þ A C i 2nM ðx ;x Þ A M i i

building hash functions H ¼ ½h1 ; h2 ; …; hr , where r indicates the code length, and hk denotes the binary encoding function for each bit k ¼ 1; 2; …; r. Table 1 lists the main notations and descriptions used in this paper. Generally, we use the uppercase as the matrix and the lowercase as the vector.

2.1. Sparse embedding Recent years have witnessed the great success of sparse coding for data analysis [29–31]. The basic idea of sparse coding is that each data point is represented by a small number of atoms of dictionary, where the dictionary is over-completed thus can provide sufficient descriptive power of the data set. The advantage of sparse coding for retrieval is that it can reduce the dimensionality as well as capture the salient structure. Motivated by sparse coding in which each data point can be represented as a linear combination of a small number of basis vectors, we derive a small set of bases to approximate each training data point, aiming at preserving data's neighborhood structure and reducing data's dimensionality. To compute the bases, K-means clustering method is performed on n data points to obtain m cluster centers m U ¼ fuj A Rd gj ¼ 1 which act as bases to represent training set X. Then sparse embedding zij which represents the similarity between the i-th data point and the j-th cluster center is defined as 8 2 > < P expð  ‖xi  uj ‖ =tÞ ; 8 j A 〈i〉 0 expð  ‖x  uj0 ‖2 =tÞ zij ¼ ð1Þ i j A 〈i〉 > : 0 otherwise; where 〈i〉  ½1 : m denotes the indices of s ðs{mÞ nearest cluster centers of point xi in U and t denotes the bandwidth parameter. Then the training set X can be represented by cluster centers as Z ¼ ½zij  A Rnm . Note that similar data representation approach has also been used in many hashing methods [16,17,28,32].

2.2. Constraints preserving low-dimensional embedding The proposed method aims to map data set X into Hamming space to obtain discriminative hash codes Y. After converting the training data X into the new sparse representations Z, we intend to learn hash functions H ¼ ½h1 ; h2 ; …; hr  for converting Z into binary codes Y. The hash function by linear projection coupled with mean thresholding is used. This type of hash function has been widely used in data-dependent hashing methods [14,25,28]. Given r orthogonal vectors fw1 ; w2 ; …; wr g A Rm , the k-th hash function is defined as hk ðzi Þ ¼ sgnðzi wk  t k Þ;

ð2Þ

ð3Þ

One important characteristic for good hash functions is the discriminative ability [12]. That is, functions should map similar data points to binary codes with low Hamming distances and dissimilar data points with high ones. To pursue this property, the proposed method learns effective hash codes under the supervision of similar constraints and dissimilar constraints by maximizing discriminative Hamming distance. The discriminative Hamming distance δ is defined as

i

j

ð4Þ

j

where nC and nM represent the numbers of similar pairs and dissimilar pairs, respectively. yi , which is the i-th row of Y, represents the hash code of xi . The first term on the right-hand side of Eq. (4) measures the average Hamming distance between the pairwise hash codes which are known to belong to different classes, while the second term measures the average Hamming distance between the pairwise hash codes which are known to belong to the same class. So the intuition behind Eq. (4) is to ensure that distances in the Hamming space between different classes are as large as possible, while distances between instances from the same class are as small as possible. Then, what δ measures is in fact the discriminability of the hash codes. The larger the δ is, the larger the discriminability will be. Eq. (4) can be rewritten as

δ¼

1X ‖yi  yj ‖22 S~ ij ¼ trðY T LYÞ; 2 i;j

where 8 1 > > > > > < nC 1 S~ ij ¼  > > > n M > > : 0

ð5Þ

if ðxi ; xj Þ A C if ðxi ; xj Þ A M

ð6Þ

otherwise:

~ D is a diagonal And trðÞ denotes the trace of a matrix, L ¼ D  S, ~ that is dii ¼ P S~ ij . matrix whose entries are column sums of S, j Recall that hash codes Y are obtained by hash function, that is Y ¼ ½h1 ðZÞ; h2 ðZÞ; …; hr ðZÞ ¼ sgnðZWÞ;

ð7Þ

where W ¼ ½w1 ; w2 ; …; wr . Substitute Eq. (7) to Eq. (5), the discriminative Hamming distance δ would be   δ ¼ tr sgnðZWÞT L sgnðZWÞ : ð8Þ The discriminative Hamming distance is designed for the constrained data points. When there are abundant unconstrained data points, these examples should also be considered to enhance the performance. Following the formulation of [27], for unconstraint data, we would like to maximize the variance σ of hash codes, that is 1 n

σ ¼ trðsgnðZWÞT sgnðZWÞÞ:

ð9Þ

Then, we can obtain the following objective function for learning hash functions: FðWÞ ¼ δ þ σ   λ ¼ tr sgnðZWÞT L sgnðZWÞ þ trðsgnðZWÞT sgnðZWÞÞ n     λ T ¼ tr sgnðZWÞ L þ I sgnðZWÞ ; n

ð10Þ

D. Wang et al. / Neurocomputing 167 (2015) 230–242

where I is an identity matrix and λ is a scaling parameter to balance the contribution of the unconstrained data points. In summary, we intend to learn optimal hash functions H by maximizing the objective function F. Unfortunately, this optimization problem is NP-hard due to the sign function. Naturally, we replace the sign of low embedding ZW with its signed magnitude in (10) to find a relaxed version F~ of the objective function F:     λ F~ ðWÞ ¼ tr W T Z T L þ I ZW ¼ trðW T MWÞ n

233

Although the problem is NP-hard, its sub-problems w.r.t. each of Y and R is convex. Therefore, we can minimize it by the following alternating procedure. Fix R and update Y: Expanding Q, we have ~ Q ¼ ‖Y  VR‖2F þ 2η trðY T SVRÞ ~ ¼ ‖Y‖2F þ ‖VR‖2F  2 trðY T VRÞ þ2η trðY T SVRÞ   ¼ nr þ ‖V‖2F  2 trðY T I  ηS~ VRÞ:

ð17Þ

ð11Þ W T W ¼ I;   T where M ¼ Z L þ nλI Z . The orthogonal restriction for W makes the bits to be uncorrelated. Maximizing the objective function F~ is a typical eigen-problem which can be easily and efficiently solved by computing the eigenvectors of M corresponding to the largest r eigenvalues. After getting the optimal solution W, we can get the low-dimensional embedding ZW. Instead of obtaining the binary codes Y by simply quantizing ZW, i.e., sgnðZWÞ, we propose the constraints preserving quantization method to preserve semantic similarity for binary codes.

Because the low-dimensional embedding V is fixed, minimizing (17) is equivalent to

2.3. Constraints preserving quantization

‖Y  VR‖2F ¼ trðY T Y þRT V T VR  2 trðY T VRÞÞ:

It is worth noticing that the solution for F~ is not unique in the previous subsection. Suppose W is an optimal solution, then so is ~ ¼ WR for any orthogonal r  r matrix R. Therefore, we are free W to orthogonally transform the projected data V ¼ZW to preserve pairwise constrains for binary codes Y. That is, we intend to learn a suitable R which maximize the discriminative Hamming distance δ. We rewrite δ as

Its solution can be obtained by first computing the SVD of the matrix Y T V as U ΛV T and then letting R ¼ VU T . Then Q is equivalent ~ to adding a constraint term 2η trðY T SVRÞ on (19), that is     Q ¼ tr Y T Y þRT V T VR  2 trðY T I  ηS~ VRÞ : ð20Þ

s:t:

1X ‖yi  yj ‖22 S~ ij 2 i;j  1X ¼ ‖yi ‖22 þ ‖yj ‖22  2yi yTj S~ ij 2 i;j

δ¼

1X ð2r 2yi yTj tÞS~ ij 2 i;j X X ¼r Sij  yi yTj S~ ij

ð12Þ

Therefore, we get the following objective function: min R

s:t:

trðsgnðVRÞT S~ sgnðVRÞÞ RT R ¼ RRT ¼ I:

ð13Þ

As Eq. (13) is difficult to solve, we approximate it by learning binary codes Y and R simultaneously: min Y;R

s:t:

~ ~ trðsgnðVRÞT SVRÞ ¼ trðY T SVRÞ RT R ¼ RRT ¼ I:

ð14Þ

For unconstrained data points, we would like to minimize the quantization error like [14] min‖Y  VR‖2F : Y;R

ð15Þ

Then the objective function for constraints preserving quantization would be min s:t:

~ Q ðY; RÞ ¼ ‖Y  VR‖2F þ 2η trðY T SVRÞ RT R ¼ RRT ¼ I;

ð19Þ

Following the proof of [33], we can easily get the solution  of Eq. (20), through computing the SVD of the matrix Y T I  ηS~ V as U ΛV T and letting R ¼ VU T . Y and R are updated iteratively until they are not changed or the preset maximum number of iterations is achieved. We summarize the procedures for SCPH in Algorithm 1.

Input: Training data X A Rnd , pairwise constraints M and C, cluster number m, nearest points number s, bandwidth

i;j

¼  trðsgnðVRÞ S~ sgnðVRÞÞ:

  where V~ ij denote the elements of V~ ¼ I  ηS~ VR. To maximize (18) with respect to Y, we only need to have yij¼  1 whenever   V~ ij Z0 and  1 otherwise. In other words, Y ¼ sgn I  ηS~ VR . Fix Y and update R: For a fixed Y, the first term of Q corresponds to a classic orthogonal Procrustes problem [33]. It can be rewritten as

Training stage

~ ¼ 0  trðY T SYÞ T

ð18Þ

i¼1j¼1

Algorithm 1. Semi-supervised constraints preserving hashing.

¼

i;j

n X r   X yij V~ ij ; max trðY T I  ηS~ VRÞ ¼

ð16Þ

where η is a scaling parameter to balance the contributions of two parts.

parameter t, regularization parameters r.

λ and η, code length

Output: Binary codes Y A Rnr , cluster centers U A Rmd , matrixes W A Rmr , and R A Rrr . 1. Perform K-means on X to obtain cluster centers U. 2. Compute the sparse embedding Z by Eq. (1) and remove its mean. 3. Compute W by Eq. (11). Repeat 4.1 Update Y by Eq. (18). 4.2 Update R by Eq. (20). Until Stopping criterion is met. Testing stage Input: Test data X t A Rnt d , cluster centers U A Rmd , matrixes W A Rmr , and R A Rrr . Output: Binary codes Y t A Rnt r . 1. Compute the sparse embedding Z t by Eq. (1) and remove its mean. 2. Obtain Y t by sgnðZ t WRÞ.

3. Computational complexity analysis In this section, the computational complexity analysis of the proposed method is given.

234

D. Wang et al. / Neurocomputing 167 (2015) 230–242

The training cost of SCPH is determined by three main steps: (1) sparse embedding; (2) constraints preserving low-dimensional embedding; (3) constraints preserving quantization. The computational cost is dominated by K-means in the first step, which is O (dmnl), where l is the number of iterations for K-means. Typically, a few iterations are sufficient (set to 10 in our experiments). This makes clustering very fast. After getting the cluster centers, the construction of the similarity graph Z costs O(dmn). The main cost of step 2 is solving r eigenvectors of the m  m matrix M which is Oðrm2 Þ. In step 3, we alternately update Y and R for several iterations to find a locally optimal solution. In each iteration, fixing R to update Y costs O(rsn) considering that S~ is highly sparse, 3 and fixing Y to update  R costs  Oðr Þ which is dominated by SVD of T ~ the r  r matrix Y I  ηS V . So the overall cost for step 3 is Oððrsn þ r 3 ÞtÞ, where t is the number of iterations. In order to preserve the sparsity of the similarity graph Z, nearest points number s is usually very small (set to 5 in our experiments). In practice, we have found that a few iterations will achieve superior performance so we do not need to iterate until convergence. We set t to 20 in our experiments. Considering that m (normally a few hundred) and d are much less than n, total training time for SCPH is in principle linear in the size of training set. For large scale data processing, the training process of SCPH can be easily implemented in parallel, such as on GPU and mapreduce [34]. The training time would be highly reduced in such way. The cost of the testing process for compressing the test sample into binary code is of great importance for hashing method. The overall cost of SCPH for testing is Oðdm þrsÞ, in which representing the test sample by cluster centers costs O(dm) and compressing the new representation into binary code costs O(rs). The testing complexity for the proposed method hardly increases with the code length which makes it very fast and scalable to large-scale data sets.

4. Experiments To evaluate the performance of the proposed semi-supervised hashing method, we conduct several comparison experiments with several state-of-art approaches, including AGH [16], ITQ [14], SSH [27] and Bootstrap-NSPLH [28], on four benchmark image data sets MNIST,2 CIFAR10,3 Caltech101,4 and NUS-WIDE.5 Note that SSH framework has three methods SPLH, SSH-Orthogonal, and SSH-Nonorthogonal, here we use SSH-Orthogonal and perform ITQ refinement like [14] does. The reason for choosing this method is that, SSH-Orthogonal with ITQ refinement works better than SSH-Orthogonal and SSH-Nonorthogonal. SPLH has been compared with Bootstrap-NSPLH and Bootstrap-NSPLH achieved better performance in [28]. So in the following experiments we do not compare SCPH with SPLH. In the following statement, BootstrapNSPLH is called as BTNSPLH for the sake of simplicity. 4.1. Data sets and settings 4.1.1. Data sets MNIST: The MNIST handwritten digits data set consists of 70K 784-dimensional samples associated with digits from ‘0’ to ‘9’. Each image is represented by the values of pixels. The entire data set is partitioned into two subsets: a training set with 69K samples and a query set with 1K samples. Because this data set is fully annotated, the true semantic neighbors are defined according to the associated labels. For SCPH and SSH, we randomly generate 2 3 4 5

http://yann.lecun.com/exdb/mnist/ http://www.cs.toronto.edu/kriz/cifar.html http://www.vision.caltech.edu/Image_Datasets/Caltech101/ http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm

200K similar constraints and 200K dissimilar constraints (about 0:01% of all the constraints). For BTNSPLH hashing method, the label information of 4K samples from the training set is used and the remaining data are treated as the unlabeled data. CIFAR10: The CIFAR10 data set is a labeled subset of 80 million tiny images data set, containing 60K 1024-dimensional color images in 10 classes (6K images for each class). Like [17], we represent each image in this data set by a GIST feature vector of dimension 512. The GIST feature is a global feature, which provides a rough description of an image [35]. It convolves the image by several oriented filters at different scales and orientations in order to characterize several important statistics. Then images can be matched through statistical information in them. Similarly, this data set is randomly split into two subsets: a training set with 59K samples and a query set with 1K samples. The true semantic neighbors are also defined according to the associated labels. Supervised information is generated the same as the MNIST data set. Caltech101: The Caltech101 data set contains 9144 pictures of objects belonging to 101 categories with 40–800 images per category. Most categories have about 50 images. We also represent each image by a GIST feature vector of dimension 512. We randomly choose 10 pictures from each category to form the test data set and the remaining pictures are treated as the training set. For SCPH and SSH, we randomly generate 50K similar constraints and 50K dissimilar constraints (about 0.3 % of all the constraints). For NSPLH, label information of 3K samples from the training set is used and the remaining data are treated as the unlabeled data. NUS-WIDE: NUS-WIDE is a web image data set containing 269,684 images downloaded from Flickr. Tagging ground-truth for 81 semantic concepts is provided for evaluation. After removing those images without tags, we have 209,347 images left. The images are represented by 500-dimensional bag of words based on SIFT feature vectors. The SIFT feature is a local feature, which generates large numbers of keypoints that densely cover the image over the full range of scales and locations [36]. The orientations of keypoints are determined by the local image appearance and are covariant to image rotations. Then images can be matched by finding candidate matching keypoints based on Euclidean distance. We randomly choose 1K images with their tags to serve as the query set and the left images are serving as the training set. True semantic neighbors are defined depending on whether two images share at least one common tag in the experiments. For SCPH and SSH, we randomly generate 5 similar constraints and dissimilar constraints for each data point. For NSPLH hashing, the label information of 10K samples from the training set is used and the remaining data are treated as the unlabeled data.

4.1.2. Parameter settings

 NSPLH has 6 essential parameters that are anchor number

  



m, nearest points number s, bandwidth parameter t, regularization parameter λ, and two key learning scheme parameters α and β. We choose m¼ 300, s¼ 2, t¼0.5, λ ¼ 8, α ¼ 0:9, and β ¼ 0 as [28] did. AGH has 3 essential parameters, anchor number m, nearest points number s, and bandwidth parameter t, we set m ¼ 300, s ¼5, and t ¼0.5 as [16] did. SSH has a regularization parameter λ, we set λ ¼ 2 as [27] did. The proposed SCPH method has 5 parameters that are cluster number m, nearest points number s, bandwidth parameter t, regularization parameters λ and η. We set m ¼ 300, t ¼0.5, and s¼5. λ and η are both set to one. ITQ is a no-parameter method.

D. Wang et al. / Neurocomputing 167 (2015) 230–242

4.3. Results

4.2. Evaluation metrics Two commonly used criteria Hamming ranking and hash lookup are adopted to perform realtime search. Hamming ranking ranks all the data points in the database depending on their Hamming distance from the query and returns the desired neighbors from the top of the ranked list. Hash lookup constructs a lookup table by using the database codes and returns all the points in the buckets that fall within a small Hamming radius of the query. The complexity of Hamming ranking is linear and the complexity of the hash lookup is constant time. These two criteria focus on different characteristics of hashing methods. Hash lookup emphasizes the practical search speed whereas it often fails because the Hamming space becomes increasingly sparse and very few samples fall in the same bucket. Hamming ranking provides better quality measurement of the binary codes at the sacrifice of some speed. Due to the binary representation, Hamming ranking is also very fast in practice. Based on these two criteria, we use several metrics to measure the quantitative performance of different methods. For Hamming ranking-based evaluation, we compute recall by progressively scanning in the rank list and calculate the retrieval precision within top R neighbors returned from Hamming ranking. We also compute mean average precision (MAP) which is the mean of all the queries’ average precision. Average precision (AP) is defined as 1

i¼1

n X

vi i ¼ 1

Pi vi

j¼1

vj

!

i

;

ð21Þ

where vi is equal to one if the i-th point in the ranking list has the same label as the query, and n is the size of the entire data set. In hash lookup-based evaluation, we vary the Hamming radius r from 0 to its maximum to compute the precision and recall of the returned samples falling inside the Hamming ball with radius r.

Fig. 2 shows the MAP results ranging from 16-bit to 80-bit for all the methods. We can find that the proposed method always achieves better performance than other comparison methods on all these four data sets. This shows that the proposed method can efficiently preserve the constraint information in the encoding process. Also it can be seen that although SSH makes use of the constraint information, it almost obtains the same result as ITQ in all four data sets. Because SSH only preserves constraint information for lowdimensional embedding, when relaxing it to binary code, the performance is reduced due to the quantization error. Moreover, it can be observed that the proposed method outperforms BTNSPLH which uses the stronger supervised information labels. This may be due to the fact that BTNSPLH only preserves label information for a small number of labeled data points whereas ignores quantization error of the overwhelming majority unlabeled data points. Furthermore, it is surprising to see that the proposed method shows a much higher MAP in databases CIFAR10 and Caltech101 in which the semantic gap is larger than the other two data sets. This suggests that the proposed method can efficiently keep the semantic pairwise similarity. Additionally, we can find that all the methods cannot achieve good performance on Caltech101. This is probably due to the fact that there is relatively fewer and uneven quantity of samples in each category. As hashing methods are mainly designed for large-scale retrieval problem, each category is supposed to have sufficient samples for hashing to learn the characteristics. However, the number of samples is uneven for each category on Caltch101, which will cause binary codes represent more characteristics of the large category whereas cannot represent characteristics of the small category. Therefore, when retrieving the samples in the small category, the false retrieve rate will be high, leading to the unsatisfactory performance on Caltech101. To improve the hashing performance on Caltech101, more representative features should be extracted and uneven properties of the data set should be considered.

1

0.35

MAP

0.8

ITQ AGH SSH BTNSPLH SCPH

0.6

0.3 MAP

AP ¼ Pn

0.4

ITQ AGH SSH BTNSPLH SCPH

0.25 0.2 0.15

20

0.25

30

40

50 #Bits

60

70

ITQ AGH SSH BTNSPLH SCPH

0.2

0.1

80

20

30

40

50 #Bits

0.38

0.15

60

70

80

ITQ AGH SSH BTNSPLH SCPH

0.4

MAP

0.2

MAP

235

0.36 0.34 0.32

0.1 0.05

0.3 0.28 20

30

40

50 #Bits

60

70

80

20

30

40

50 #Bits

60

70

Fig. 2. MAP varies with the number of bits. (a) MNIST, (b) CIFAR10, (c) Caltech101, and (d) NUS-WIDE.

80

D. Wang et al. / Neurocomputing 167 (2015) 230–242

0.45

0.9

0.4

0.85 0.8

ITQ AGH SSH BTNSPLH SCPH

0.75 0.7

20

Precision @ top 100

0.18

30

40

50 #Bits

60

70

0.14 0.12

0.3

ITQ AGH SSH BTNSPLH SCPH

0.25 20

30

40

50 #Bits

0.08 30

40

70

80

0.42 0.4 0.38 ITQ AGH SSH BTNSPLH SCPH

0.36 0.34 0.32

20

60

0.44

0.1

0.06

0.35

0.2

80

ITQ AGH SSH BTNSPLH SCPH

0.16

Precision @ top 100

0.95

Precision @ top 100

Precision @ top 100

236

50 #Bits

60

70

80

20

30

40

50 #Bits

60

70

80

0.45

0.85

0.4

0.8 0.75

ITQ AGH SSH BTNSPLH SCPH

0.7 0.65

20

40

50 #Bits

60

70

0.1

0.06 30

40

0.3 ITQ AGH SSH BTNSPLH SCPH

0.25 20

30

40

50 #Bits

60

70

80

0.42

0.08

20

0.35

0.2

80

ITQ AGH SSH BTNSPLH SCPH

0.12 Precision @ top 250

30

Precision @ top 250

0.9

Precision @ top 250

Precision @ top 250

Fig. 3. Precision of the top 100 ranked images varies with the number of bits. (a) MNIST, (b) CIFAR10, (c) Caltech101, and (d) NUS-WIDE.

50 #Bits

60

70

80

0.4 0.38 ITQ AGH SSH BTNSPLH SCPH

0.36 0.34 0.32

20

30

40

50 #Bits

60

70

80

Fig. 4. Precision of the top 250 ranked images varies with the number of bits. (a) MNIST, (b) CIFAR10, (c) Caltech101, and (d) NUS-WIDE.

Figs. 3 and 4 represent, respectively, the averaged precision for top 100 and 250 retrieved images ranging from 16-bit to 80-bit for all the methods. It is obvious that the proposed SCPH method achieves the highest precision on all four data sets. Also, we can note that BTNSPLH

does not perform well and even worse than the unsupervised methods AGH and ITQ in data set NUS-WIDE. This may be explained by BTNSPLH's drawback that ignores the quantization error of the overwhelming majority unlabeled data points.

1

1

0.8

0.8

0.6

0.6

Recall

Recall

D. Wang et al. / Neurocomputing 167 (2015) 230–242

ITQ AGH SSH BTNSPLH SCPH

0.4 0.2 1

20000

40000

237

ITQ AGH SSH BTNSPLH SCPH 40000 56000

0.4 0.2 0 1

60000

20000

N

1

1

0.8

0.8

0.6

0.6

0.4

Recall

Recall

N

ITQ AGH SSH BTNSPLH SCPH

0.2 1

2000

4000 N

6000

0.4

ITQ AGH SSH BTNSPLH SCPH

0.2

8000

1

50000

100000 N

150000

200000

Fig. 5. Recall curve with 32-bit. (a) MNIST, (b) CIFAR10, (c) Caltech101, and (d) NUS-WIDE.

0.8

1

0.6

0.6 0.4

ITQ AGH SSH-ITQ BTNSPLH SCPH

0.2 0

Precision

Precision

0.8

ITQ AGH SSH BTNSPLH SCPH

0

0.2

0.4 0.2

0.4

0.6

0.8

0

1

0

0.2

ITQ AGH SSH BTNSPLH SCPH

0.4

0.6

0.8

1

ITQ AGH SSH BTNSPLH SCPH

0.8

Precision

Precision

0.6

0.4

Recall

Recall

0.6 0.4

0.2 0.2

0

0

0.2

0.4

0.6

0.8

1

Recall

0

0.2

0.4

0.6

0.8

1

Recall

Fig. 6. Recall precision curve with 32-bit using hash lookup. (a) MNIST, (b) CIFAR10, (c) Caltech101, and (d) NUS-WIDE.

Fig. 5 shows the recall curve with 32-bit for each method. As the recall results show, the proposed method outperforms others which indicates that it can learn binary codes more effectively.

To illustrate the hash lookup results, we plot the recall precision curves by varying the Hamming radius in Fig. 6. It can be seen that the proposed SCPH method achieves better performance than

238

D. Wang et al. / Neurocomputing 167 (2015) 230–242

Fig. 7. Top 50 nearest neighbors of one example digit ’5’ returned by the different methods on the MNIST data set. The leftmost digit is the query sample. From the top to bottom, the results are returned by ITQ, AGH, SSH, BTNSPLH, and SCPH using 32-bit code.

Fig. 8. Top 15 nearest neighbors of the query image returned by the different methods on CIFAR10 data set. The leftmost image is the query sample. From the top to bottom, the results are returned by ITQ, AGH, SSH, BTNSPLH, and SCPH using 32-bit code. Red rectangle denotes false returned sample. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

400

ITQ AGH SSH BTNSPLH SCPH

300 200

Training Time

Training Time

400

100 20

30

40

50 #Bits

60

70

100

20

60 ITQ AGH SSH BTNSPLH SCPH

40 20

20

30

40

50 #Bits

60

70

80

30

40

50 #Bits

60

70

80

60

70

80

ITQ AGH SSH BTNSPLH SCPH

800 Training Time

Training Time

ITQ AGH SSH BTNSPLH SCPH

200

0

80

80

0

300

600 400 200 0

20

30

40

50 #Bits

Fig. 9. Training time varies with the number of bits. (a) MNIST, (b) CIFAR10, (c) Caltech101, and (d) NUS-WIDE.

other methods on data sets MNIST, CIFAR10, and Caltech101. On data set NUS-WIDE, SCPH performs worse than BTNSPLH. We find that the precision for SCPH is as much as the BTNSPLH, the poor performance is due to the lower recall results. For many retrieval

tasks with large data sets, precision is more important than recall. Hence, the proposed method still has advantage for retrieval tasks. Fig. 7 represents the qualitative results with 32-bit for the MNIST data set. We retrieve the top 50 neighbors by different

D. Wang et al. / Neurocomputing 167 (2015) 230–242

239

0.6 Precision @ top 100

MAP

0.5 0.4 0.3 0.2

0.5 0.4 0.3 0.2

0.1

500

1000

500

1500

m

1000

1500

1000

1500

m

0.5

Precision @ top 100

0.5

MAP

0.4 0.3 0.2 0.1

500

1000

0.4 0.3 0.2

1500

500

m

m

Fig. 10. Effect of cluster number m. (a) MAP varies with cluster number m with 32-bit, (b) precision of top 100 ranked images varies with cluster number m with 32-bit, (c) MAP varies with cluster number m with 64-bit, and (d) precision of top 100 ranked images varies with cluster number m with 64-bit.

0.5 Precision @ top 100

0.5

MAP

0.4 0.3 0.2 0.1

5

10 s

15

0.35 0.3 0.25 5

10 s

15

20

5

10 s

15

20

Precision @ top 100

0.5

0.4 MAP

0.4

0.2

20

0.5

0.3 0.2 0.1

0.45

5

10 s

15

20

0.45 0.4 0.35 0.3 0.25 0.2

Fig. 11. Effect of the number of nearest points s. (a) MAP varies with the number of nearest points s with 32-bit, (b) precision of top 100 ranked images varies with the number of nearest points s with 32-bit, (c) MAP varies with the number of nearest points s with 64-bit, and (d) precision of top 100 ranked images varies with the number of nearest points s with 64-bit.

240

D. Wang et al. / Neurocomputing 167 (2015) 230–242

0.36 Precision @ top 100

0.48

MAP

0.34

0.32

0.3

1

2

3

4 t

5

6

0.4 0.36 0.32 0.01 0.05

7

0.34

0.1

0.5 t

1

5

10

0.1

0.5 t

1

5

10

Precision @ top 100

0.44

0.32 MAP

0.44

0.3 0.28 0.26 0.01 0.05

0.1

0.5 t

1

5

0.42 0.4 0.38 0.36 0.01 0.05

10

Fig. 12. Effect of bandwidth parameter t. (a) MAP varies with bandwidth parameter t with 32-bit, (b) precision of top 100 ranked images varies with bandwidth parameter t with 32-bit, (c) MAP varies with bandwidth parameter t with 64-bit, and (d) precision of top 100 ranked images varies with bandwidth parameter t with 64-bit.

0.5 Precision @ top 100

0.5

MAP

0.4 0.3 0.2 0.1

1

3

5

7

9

0.2

1

3

5

7

9 11 13 15 17 19 λ

3

5

7

9 11 13 15 17 19 λ

Precision @ top 100

0.5

0.35 MAP

0.3

0.1

11 13 15 17 19 λ

0.4

0.3 0.25 0.2

0.4

1

3

5

7

9

11 13 15 17 19 λ

0.45 0.4 0.35 0.3 0.25 0.2

1

Fig. 13. Effect of regularization parameter λ. (a) MAP varies with regularization parameter λ with 32-bit, (b) precision of top 100 ranked images varies with regularization parameter λ with 32-bit, (c) MAP varies with regularization parameter λ with 64-bit, and (d) precision of top 100 ranked images varies with regularization parameter λ with 64-bit.

D. Wang et al. / Neurocomputing 167 (2015) 230–242

0.4

Precision @ top 100

0.6

MAP

0.35 0.3 0.25 0.2

241

1

3

5

7

0.5 0.4 0.3 0.2 0.1

9 11 13 15 17 19 η

1

3

5

7

9

11 13 15 17 19 η

1

3

5

7

9

11 13 15 17 19 η

0.6 Precision @ top 100

MAP

0.35

0.3

0.25

0.2

1

3

5

7

9

11 13 15 17 19 η

0.5 0.4 0.3 0.2 0.1

Fig. 14. Effect of regularization parameter η. (a) MAP varies with regularization parameter η with 32-bit, (b) precision of top 100 ranked images varies with regularization parameter η with 32-bit, (c) MAP varies with regularization parameter η with 64-bit, and (d) precision of top 100 ranked images varies with regularization parameter η with 64-bit.

methods on one example digit ‘5’. As shown in Fig. 7, we find that SCPH tends to return more digits ‘5’ than other methods. Fig. 8 represents the qualitative results with 32-bit for CIFAR10 data set. We retrieve the top 15 neighbors by the different methods on a given example image. It can be seen that SCPH outperforms to retrieve the semantically related images for the test query, compared with the other methods. We empirically study the training time for each method. Fig. 9 shows the training time varying from 16-bit to 80-bit for each method. It can be observed that BTNSPLH needs more training time than other methods and scales linearly with the code length. The proposed SCPH needs much less training time than BTNSPLH and hardly increases with the code length. Although SCPH is somewhat slower than AGH, ITQ and SSH, it achieves much better performance than them.

consistent with t varying. As bandwidth parameter t is a factor to scale the distance in the construction of similarity graph, it has the same influence on all the data points. Therefore, it has consistent influence on the performance of SCPH when it is not too large nor too small. Usually, t can be chosen in the range of [0.01, 10]. Regularization parameter λ: It can be seen from Fig. 13 that SCPH keeps steady with λ varying. So λ can be set to a number in the range of [1, 20]. Regularization parameter η: Fig. 14 shows how the performance of SCPH varies with η. It can be seen that MAP keeps consistent with η varying. Whereas, when η is larger than 3, the precision is higher. Generally η can be chosen in the range of [4, 20]. From the above analyses we can reach the conclusion that the proposed method keeps stable with these parameters in general. Therefore, it has a wide range to choose these parameters.

4.4. Parameters effect In this section, the effects of the parameters of the proposed method are evaluated. We choose CIFAR10 data set to conduct a thorough study on the main parameters used in SCPH. When studying one parameter, other parameters are set the same as the setting in the previous experiments: Cluster number m: Fig. 10 plots the experimental results varying with the cluster number m from 100 to 1500 for 32-bit and 64-bit. We can find that the performance of the proposed method tends to increase slightly as m increase. Though larger m may result in better performance, they increase calculating amount. Actually a small cluster number is sufficient to obtain good performance. Nearest points number s: Fig. 11 shows the performance of SCPH versus parameter s. It can be seen that the precision tends to increase as s increases when s is smaller than 5 and decrease slightly as s increase after s is larger than 5. Generally the value of s can be set to an integer from 5 to 20. Bandwidth parameter t: Fig. 12 shows how the performance of SCPH varies with t. It can be seen that both MAP and precision keep

5. Conclusion This paper proposes a novel semi-supervised hashing method which makes full use of the limited constraint information to keep semantic similarities for binary codes. It keeps pairwise constraints on both low-dimensional embedding process and quantization process thus can make the binary codes have more discriminating power. Thorough experiments on four data sets show the superior performance of the proposed method than other existing algorithms. In the future, we will extend our method to the multimodal case.

Acknowledgments This paper was supported partially by the National Natural Science Foundation of China (Grant nos. 61125204, 61432014, 61472304, and 61172146), the Fundamental Research Funds for the Central Universities (Grant nos. BDZ021403 and JB149901), the

242

D. Wang et al. / Neurocomputing 167 (2015) 230–242

Program for Changjiang Scholars and Innovative Research Team in University of China (No. IRT13088) and the Shaanxi Innovative Research Team for Key Science and Technology (No. 2012KCT-02). References [1] D. Zhang, J. Wang, D. Cai, J. Lu, Self-taught hashing for fast similarity search, in: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, 2010, pp. 18–25. [2] C. Hong, J. Yu, D. Tao, M. Wang, Image-based three-dimensional human pose recovery by multi-view locality sensitive sparse retrieval, IEEE Trans. Ind. Electron. 62 (6) (2014) 3742–3751. http://dx.doi.org/10.1109/TIE.2014. 2378735. [3] H. Fu, X. Kong, J. Lu, Large-scale image retrieval based on boosting iterative quantization hashing with query-adaptive reranking, Neurocomputing 122 (2013) 480–489. [4] C. Strecha, A.M. Bronstein, M.M. Bronstein, P. Fua, LDAHash: improved matching with smaller descriptors, IEEE Trans. Pattern Anal. Mach. Intell. 34 (1) (2012) 66–78. [5] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, J. Attenberg, Feature hashing for large scale multitask learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, New York, NY, USA, 2009, pp. 1113–1120. [6] J. Song, Y. Yang, Z. Huang, H.T. Shen, R. Hong, Multiple feature hashing for realtime large scale near-duplicate video retrieval, in: Proceedings of the 19th ACM International Conference on Multimedia, ACM, New York, NY, USA, 2011, pp. 423-432. [7] M. Henzinger, Finding near-duplicate web pages: a large-scale evaluation of algorithms, in: Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, 2006, pp. 284–291. [8] F. Zou, H. Feng, H. Ling, C. Liu, L. Yan, P. Li, D. Li, Nonnegative sparse coding induced hashing for image copy detection, Neurocomputing 105 (2013) 81–89. [9] M. Datar, N. Immorlica, P. Indyk, V.S. Mirrokni, Locality-sensitive hashing scheme based on p-stable distributions, in: Proceedings of the 20th Symposium on Computational Geometry, ACM, New York, NY, USA, 2004, pp. 253– 262. [10] B. Kulis, K. Grauman, Kernelized locality-sensitive hashing for scalable image search, in: Proceedings of the 12th International Conference on Computer Vision, IEEE, Piscataway, NJ, USA, 2009, pp. 2130–2137. [11] M. Raginsky, S. Lazebnik, Locality-sensitive binary codes from shift-invariant kernels, in: Advances in Neural Information Processing Systems, 2009, pp. 1509–1517. [12] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, in: Advances in Neural Information Processing Systems, 2009, pp. 1753–1760. [13] W. Kong, W.-J. Li, Isotropic hashing, in: Advances in Neural Information Processing Systems, 2012, pp. 1646–1654. [14] Y. Gong, S. Lazebnik, A. Gordo, F. Perronnin, Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 35 (12) (2013) 2916–2929. [15] Z. Bodó, L. Csató, Linear spectral hashing, Neurocomputing 141 (2014) 117–123. [16] W. Liu, J. Wang, S. Kumar, S.-F. Chang, Hashing with graphs, in: Proceedings of the 28th International Conference on Machine Learning, 2011, pp. 1–8. [17] F. Shen, C. Shen, Q. Shi, A. Van Den Hengel, Z. Tang, Inductive hashing on manifolds, in: Proceedings of the 26th International Conference on Computer Vision and Pattern Recognition, IEEE, Piscataway, NJ, USA, 2013, pp. 1562–1569. [18] W. Liu, C. Mu, S. Kumar, S.-F. Chang, Discrete graph hashing, in: Advances in Neural Information Processing Systems, 2014, pp. 3419–3427. [19] G. Irie, Z. Li, X.-M. Wu, S.-F. Chang, Locally linear hashing for extracting nonlinear manifolds, in: Proceedings of the 27th International Conference on Computer Vision and Pattern Recognition, IEEE, Piscataway, NJ, USA, 2014, pp. 2123–2130. [20] K. He, F. Wen, J. Sun, K-means hashing: an affinity-preserving quantization method for learning binary compact codes, in: Proceedings of the 26th International Conference on Computer Vision and Pattern Recognition, IEEE, Piscataway, NJ, USA, 2013, pp. 2938–2945. [21] M. Norouzi, D.J. Fleet, Cartesian k-means, in: Proceedings of the 26th International Conference on Computer Vision and Pattern Recognition, IEEE, Piscataway, NJ, USA, 2013, pp. 3017–3024. [22] J. Wang, J. Song, X. Xu, H. Shen, S. Li, Optimized cartesian k-means, IEEE Trans. Image Process. 27 (1) (2015) 180–192. [23] B. Kulis, T. Darrell, Learning to hash with binary reconstructive embeddings, in: Advances in Neural Information Processing Systems, 2009, pp. 1042–1050. [24] M. Norouzi, D.M. Blei, Minimal loss hashing for compact binary codes, in: Proceedings of the 28th International Conference on Machine Learning, 2011, pp. 353–360. [25] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, S.-F. Chang, Supervised hashing with kernels, in: Proceedings of the 25th International Conference on Computer Vision and Pattern Recognition, IEEE, Piscataway, NJ, USA, 2012, pp. 2074–2081. [26] Y. Mu, J. Shen, S. Yan, Weakly-supervised hashing in kernel space, in: Proceedings of the 23rd International Conference on Computer Vision and Pattern Recognition, IEEE, Piscataway, NJ, USA, 2010, pp. 3344-3351.

[27] J. Wang, S. Kumar, S.-F. Chang, Semi-supervised hashing for large-scale search, IEEE Trans. Pattern Anal. Mach. Intell. 34 (12) (2012) 2393–2406. [28] C. Wu, J. Zhu, D. Cai, C. Chen, J. Bu, Semi-supervised nonlinear hashing using bootstrap sequential projection learning, IEEE Trans. Knowl. Data Eng. 25 (6) (2013) 1380–1393. [29] H. Lee, A. Battle, R. Raina, A.Y. Ng, Efficient sparse coding algorithms, in: Advances in Neural Information Processing Systems, 2006, pp. 801–808. [30] M. Yang, D. Zhang, J. Yang, Robust sparse coding for face recognition, in: Proceedings of the 24th International Conference on Computer Vision and Pattern Recognition, IEEE, 2011, pp. 625-632. [31] W. Liu, D. Tao, J. Cheng, Y. Tang, Multiview hessian discriminative sparse coding for image annotation, Comput. Vis. Image Underst. 118 (2014) 50–60. [32] X. Zhu, L. Zhang, Z. Huang, A sparse embedding and least variance encoding approach to hashing, IEEE Trans. Image Process. 23 (9) (2014) 3737–3750. [33] P.H. Schönemann, A generalized solution of the orthogonal procrustes problem, Psychometrika, New York, NY, USA 31 (1) (1966) 1–10. [34] W. Liu, H. Zhang, D. Tao, Y. Wang, K. Lu, Large-scale paralleled sparse principal component analysis, Multimed. Tools Appl. (2014) 1–13. [35] A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope, Int. J. Comput. Vis. 42 (3) (2001) 145–175. [36] D.G. Lowe, Object recognition from local scale-invariant features, in: Proceedings of the 7th International Conference on Computer Vision, vol. 2, IEEE, 1999, pp. 1150-1157.

Di Wang received the B.S. degree in computer science from Changan University, Xi'an, China, in 2011. She is currently working toward the Ph.D. degree in the School of Electronic Engineering at Xidian University. Her research interests include machine learning and multimedia information retrieval.

Xinbo Gao (M'02–SM'07) received the B.Eng., M.Sc. and Ph.D. degrees in signal and information processing from Xidian University, China, in 1994, 1997 and 1999, respectively. From 1997 to 1998, he was a research fellow in the Department of Computer Science at Shizuoka University, Japan. From 2000 to 2001, he was a postdoctoral research fellow in the Department of Information Engineering at the Chinese University of Hong Kong. Since 2001, he joined the School of Electronic Engineering at Xidian University. Currently, he is a Cheung Kong Professor of Ministry of Education, China, a Professor of Pattern Recognition and Intelligent System, and Director of the State Key Laboratory of Integrated Services Networks. His research interests are computational intelligence, machine learning, computer vision, pattern recognition and wireless communications. In these areas, he has published 5 books and around 200 technical articles in refereed journals and proceedings including IEEE TIP, TCSVT, TNN, TSMC, IJCV, and Pattern Recognition. He is on the editorial boards of several journals including Signal Processing (Elsevier), and Neurocomputing (Elsevier). He served as General Chair/Co-chair or program committee chair/co-chair or PC member for around 30 major international conferences. Now, he is a Fellow of IET and Senior Member of IEEE.

Xiumei Wang received the Ph.D. degree from Xidian University, in 2010. She is currently an Associate Professor at the School of Electronic Engineering in Xidian University. Her research interests mainly involve nonparametric statistical models and machine learning. In these areas, she has published several scientific articles including IEEE TSMC-B, Pattern recognition and Neurocomputing.