Signal Processing: Image Communication 76 (2019) 31–40
Contents lists available at ScienceDirect
Signal Processing: Image Communication journal homepage: www.elsevier.com/locate/image
Neighborhood kinship preserving hashing for supervised learning✩ Yan Cui a,d , Jielin Jiang b,c ,∗, Zuojin Hu a , Xiaoyan Jiang a , Wuxia Yan e,a , Min-ling Zhang d a
College of Mathematics and Information Science, Nanjing Normal University of Special Education, Nanjing 210038, China School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing 210044, China Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET), Nanjing University of Information Science and Technology, Nanjing, 210044, China d School of Computer Science and Engineering, Southeast University, Nanjing 210096, China e Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, Minjiang University, Fuzhou 350121, China b c
ARTICLE
INFO
Keywords: Hashing learning Kinship preserving hashing Discriminant information Discriminant hashing Robust distance metric
ABSTRACT Most existing hashing methods rarely utilize the label information to learn the hashing function. However the label information of the training data is very important for classification. In this paper, we develop a new neighbor kinship preserving hashing based on a learned robust distance metric, which can pull the intra-class neighborhood samples as close as possible and push the inter-class neighborhood samples as far as possible, such that the discriminant information of the training data is incorporated into the learning framework. Furthermore, the discriminant information is inherited into the hamming space by the proposed neighbor kinship preserving hashing which can obtain highly similar binary representation for kinship neighbor pairs and highly different binary representation for non-kinship neighbor pairs. Moreover, the proposed priori intervention iterative optimization algorithm can better apply the learned discriminant information for classification and matching. Experimental results clearly demonstrate that our method achieves leading performance compared with the state-of-the-art supervised hashing learning methods.
1. Introduction With the rapid growth of the web data including text documents and vision data, hashing technique has recently become a hot research topic in huge databases. Hashing-based approximated nearest neighbor search methods have been successfully applied in many domains such as image retrieval [1,2], image matching [3,4], classification [5,6], object detection [7,8] and recommendation [9,10]. Hashing methods, which aim to preserve some notion of similarity in the hashing space, can be categorized into three groups: unsupervised hashing, semi-supervised hashing and supervised hashing. Unsupervised hashing methods try to design hash functions for preserving the similarity in the original feature space from unlabeled data. Locality sensitive hashing (LSH) [11] is the seminal unsupervised hashing method and has been successfully applied in information retrieval. It learns linear hashing function to convert the unlabeled input data into binary hash codes. Due to the asymptotic theoretical guarantees for random projection-based LSH, many LSH-based unsupervised hashing methods have been developed [12–15]. Spectral hashing [16] is another popular unsupervised hashing method, which generates compact hash codes based on the low-dimensional manifold
structure of data. Multidimensional spectral hashing [17] and spectral multimodal hashing [18] are its updated extension and multimodal extension, respectively. Some other graph-based hashing methods include isotropic hashing [19], hashing with graph [20], discrete graph hashing [21], self-taught hashing [22] and anchor graph hashing [23]. Iterative quantization (ITQ) [1] is a well-known unsupervised hashing approach, which tries to refine the initial projection matrix learned by PCA for minimizing the quantization error of mapping the data to the vertices of binary hypercube. Semi-supervised hashing is to incorporate the information of few labeled data into the construction hash functions. Semi-Supervised Hashing (SSH) [24,25] is well known to preserve semantic similarity by utilizing the pairwise information of labeled samples. SemiSupervised Discriminant Hashing (SSDH) [26] is proposed based on Fisher’s discriminant analysis, which learns hash codes by maximizing the separability between labeled data in different classes while the unlabeled data are used for regularization. Semi-supervised hashing with semantic confidence [27] incorporates the pairwise and triplet relationships of the image examples into the hash function learning based on the semantic confidence. Semi-supervised nonlinear hashing using bootstrap sequential projection learning (Bootstrap-NSPLH) [28]
✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.image.2019.04.003. ∗ Corresponding author. E-mail addresses:
[email protected] (Y. Cui),
[email protected] (J. Jiang).
https://doi.org/10.1016/j.image.2019.04.003 Received 15 August 2018; Received in revised form 27 February 2019; Accepted 7 April 2019 Available online 24 April 2019 0923-5965/© 2019 Elsevier B.V. All rights reserved.
Y. Cui, J. Jiang, Z. Hu et al.
Signal Processing: Image Communication 76 (2019) 31–40
takes into account of all the previous learned bits to correct the errors which are accumulated during the conversation of the real-value embedding into the binary code without incurring the extra computational cost. Weakly supervised hashing [29] maximizes the label-regularized margin partition (LAMP) to enhance hashing quality by using kernelbased similarity and additional pairwise constraints as side information. Semi-Supervised Tag Hashing (SSTH) proposed in [30] can fully utilize tag information to design hash function by exploring the correlation between tags and hashing bits. In contrast, when the label information is available, supervised hashing tries to incorporate the information of the labeled data into the construction of hash functions. Some well known sparse recovery algorithms such as linear discriminant analysis hashing (LDAH) [31] can achieve desired hashing which make the expectation of the hamming metric on the set positive pairs to be minimized and that of the negative pairs to be maximized. Kernel-Based Supervised Hashing (KSH) [32] tries to utilize the pairwise supervision information to achieve high quality hashing. Binary reconstructive embedding (BRE) [33] designs hash functions by minimizing the reconstructed error between the input distances and the reconstructed Hamming distances. Minimal Loss Hashing (MLH) [34] aims to learn similarity-preserving binary codes by preserving similarity between pairs of training exemplars. Ranking-based supervised hashing (RSH) [35] was proposed to leverage listwise supervision into the hash function learning framework. Besides RSH, the tree-based method [36,37] and the hamming metric learning framework presented in [38] also aim to preserve the ranking orders. Supervised hashing using graph cuts and boosted decision trees [39] proposed a flexible hashing framework which is able to incorporate various kinds of loss functions and hash functions. Supervised discrete hashing (SDH) [40] formulates the hashing framework in terms of linear classification, where the learned binary codes are expected to be optimal for classification. Supervised hashing for image retrieval via image representation learning [41] automatically learns a good image representation tailored to hashing as well as a set of hash functions. Fast supervised discrete hashing (FSDH) [42] uses a very simple yet effective regression of the class labels of training examples to the corresponding hash code to accelerate the algorithm. Fast scalable supervised hashing (FSSH) [43] learns the hash codes based on the semantic information and the features of data. Jointly sparse hashing (JSH) [44] is proposed to minimize the locality information loss and obtain jointly sparse hashing method, which integrates locality, joint sparsity, and rotation operation together with a seamless formulation. Although unsupervised hashing and semi-supervised hashing have been successfully applied to large-scale data retrieval, it has the following limitations: (1) For unsupervised hashing, there is not any label information which may be benefit for constructing hash functions. Thus, in order to obtain satisfying retrieval results, LSH-based methods typically require long codes and multiple hashing tables. This may lead to considerable storage cost and longer query time. (2) For semi-supervised hashing, there is only few labeled training data which can be utilized to design hash function. The existing semisupervised hashing usually designs independent objective functions based on few labeled and unlabeled training data, respectively. Thus, the objective functions based on the unlabeled training data do not take any prior information of the labeled information into account. Supervised hashing aims to learn hashing function to generate compact binary codes that can be able to preserve labeled based similarity in the hamming space. In this paper, we propose a novel supervised hashing framework, which aims to preserve the neighborhood kinship in the hamming space. The learned hashing framework can pull the kinship neighborhood as close as possible and push the non-kinship neighborhood as far as possible simultaneously. As such, a robust distance metric can be learnt by leveraging the label information of the training data in the original space. As we know, inter-class samples with higher similarity usually lie on a neighborhood and are easily misclassified than those with lower similarity. Thus we expect the learned
matric can push the non-same class neighborhood samples as far as possible, meanwhile pull the same class neighborhood samples as close as possible such that more discriminant information can be exploited for classification. In the real word application, the discriminant of the learned matric is expected to inherit in the hamming space. To fulfill this idea, the learned distance metric is redefined in the hamming space. Based on the new redefined distance metric, the learned hashing function can make the same class neighborhoods with kinship binary codes and non-same class neighborhoods with obvious different binary codes. To emphasize the main contributions of this paper, the advantages of our new supervised hashing framework can be summarized as follows: (1) A novel supervised hashing approach is developed based on learned distance metric which is optimal for classification. We leverage the label information of the original training data to learn a robust distance metric in which the learned matric can pull the intraclass neighborhood samples as close as possible, and push the interclass neighborhood samples as far as possible, such that the discriminant information of the training data is incorporated into the learning framework. The learned distance matric is redefined in the hamming space, such that the discriminant information of the original space is inherited in the hamming space. Based on the redefined distance, we design hashing function to learn the binary codes of the training data. The learned hashing function can make the same class samples with similar binary codes and non-same class samples with obvious different binary codes. (2) The proposed method greatly outperforms other state-of-the-art supervised hashing methods in terms of retrieval accuracy. The organization of this paper is as follows. We present the related work in Section 2. In Section 3, the novel supervised hashing framework is proposed for large-scale data. Experiments are presented in Section 4. Conclusions are summarized in Section 5. 2. Preliminary Hashing has been widely applied for ANN search in large scale information retrieval, due to its encouraging efficiency in both speed and storage. In this section, we briefly review some representative hashing approaches on unsupervised, semi-supervised and supervised learning. 2.1. Locality sensitive hashing Locality Sensitive Hashing (LSH) [11] is a well known unsupervised hashing method, which uses random linear projections to map the data into a low-dimensional Hamming space. A typical category of LSH function is constructed by the following random projections and thresholds as ℎ(𝑥) = 𝑠𝑖𝑔𝑛(𝑤𝑇 𝑥 + 𝑏)
(1)
where 𝑠𝑖𝑔𝑛(𝑧) is sign function which equals to 1 for 𝑧 > 0 and otherwise −1, 𝑤 is a random project vector and 𝑏 is a random intercept. The random project vector 𝑤 is data independent, and each entry of which is drawn from p-stable distribution (including Gaussian distri∑ 𝑇 bution) [11,12]. The random intercept 𝑏 = − 𝑁1 𝑁 𝑖=1 𝑤 𝑥𝑖 is bias, for centered data, 𝑏 = 0. In order to preserve the locality in the hamming space, the hash function ℎ(𝑥) satisfies the following local sensitivity hashing property: 𝑝{ℎ(𝑥) = ℎ(𝑦)} = 𝑠𝑖𝑚(𝑥, 𝑦)
(2)
where 𝑝{ℎ(𝑥) = ℎ(𝑦)} is the probability of collision and 𝑠𝑖𝑚(𝑥, 𝑦) is the 2 . Let similarity measure which can be defined as 𝑠𝑖𝑚(𝑥, 𝑦) = exp − ‖𝑥−𝑦‖ 𝜎2 𝐻(𝑥) = [ℎ1 (𝑥), ℎ2 (𝑥), … , ℎ𝐾 (𝑥)] is the 𝐾-bit binary embedding of 𝑥, thus the probability of collision for 𝑥 and 𝑦 is defined as 𝑝{𝐻(𝑥) = 𝐻(𝑦)} ∝ [1 − 32
cos−1 𝑥𝑇 𝑦 𝐾 ] 𝜋
(3)
Y. Cui, J. Jiang, Z. Hu et al.
Signal Processing: Image Communication 76 (2019) 31–40
hash learning procedure into binary code inference and hash function learning. For a set f training points 𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 } ∈ 𝑅𝑡 and affinity matrix 𝑌 describing the label based similarity information, its element 𝑦𝑖𝑗 measures the similarity between data points 𝑥𝑖 and 𝑥𝑗 , i.e. 𝑦𝑖𝑗 = 1 if 𝑥𝑖 and 𝑥𝑗 are similar (relevant), 𝑦𝑖𝑗 = −1 if 𝑥𝑖 and 𝑥𝑗 are dissimilar (irrelevant) and 𝑦𝑖𝑗 = 0 if the pairwise relation is undefined. In order to learn a set of hash functions for preserving the similarity in the Hamming space, the following optimization framework for hash function learning is proposed:
In practice, the large value 𝐾 can reduce false collisions (i.e. the number of non-neighbor sample pairs falling into the same bucket). However the large value 𝐾 also decreases the collision probability between similar samples. In the real world application, more hash tables are usually constructed to overcome this drawback. However this lead to extra storage cost and larger query time. 2.2. Semi-supervised hashing for large scale search Recently, some semi-supervised hashing methods have been developed to leverage the information of both labeled and unlabeled data. Semi-supervised hashing for large scale search is one of the most representative semi-supervised hashing methods, which integrate both labeled and unlabeled data to learn binary codes. For training data set 𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 } and each sample 𝑥𝑖 ∈ 𝑅𝑡 , (𝑖 = 1, 2, … , 𝑛), two categories of label information 𝑀 and 𝐶 are defined. 𝑀 is denoted as a set of neighbor-pair: if (𝑥𝑖 , 𝑥𝑗 ) ∈ 𝑀, thus 𝑥𝑖 and 𝑥𝑗 are neighbors in a metric space or share common labels. 𝐶 is denoted as a set of nonneighbor-pair: if (𝑥𝑖 , 𝑥𝑗 ) ∈ 𝐶, thus 𝑥𝑖 and 𝑥𝑗 are far away in metric space or have different class labels. In order to compute the 𝐾-bit hamming embedding 𝑌 ∈ 𝐵 𝐾×𝑛 , the 𝑘th hash function is defined as
min 𝛷(∙)
(𝑥𝑖 ,𝑥𝑗 )∈𝑀
With these auxiliary variables, the optimization framework in Eq. (10) can be decomposed into two subproblems: 𝑚𝑖𝑛𝑍
𝑛 ∑ 𝑛 ∑
𝛿(𝑦𝑖𝑗 ≠ 0)𝐿(𝑧𝑖 , 𝑧𝑗 ; 𝑦𝑖,𝑗 ),
𝑖=1 𝑗=1
𝑠.𝑡.
To rewrite the above objective function in a compact matrix form, the following notation is defined
(12)
𝑍 ∈ {−1, 1}𝑚×𝑛
and
ℎ(𝑥𝑖 ) = [ℎ1 (𝑥1 ), ℎ2 (𝑥2 ), … , ℎ𝑀 (𝑥𝑖 )] ∈ 𝑅𝑀 min
(6)
𝛷(∙)
𝑚 ∑ 𝑛 ∑
𝛿(𝑧𝑟,𝑖 = ℎ𝑟 (𝑥𝑖 )),
(13)
𝑟=1 𝑖=1
where 𝑍 is the matrix of 𝑀-bit binary codes for all training data points; 𝑧𝑖 is the binary code vector for 𝑥𝑖 . 𝛿(∙) is an relation indicator function. Obviously, the hashing learning framework in Eq. (10) is turned into two steps: binary code inference and hash function learning by solving Eqs. (12) and (13), respectively. For binary code inference, the Eq. (12) can be rewrite into a standard quadratic problem based on the proposition in [39]. For hash function learning, the Adaboost is applied to solve the problem Eq. (13). The decision tree can be learned at each boosting iteration, and the node of the learned binary decision tree is a decision stump which can be found by minimizing the weighted classification error. The more details for the two subproblems can be found in [39].
𝑊 = [𝑤1 , 𝑤2 , … , 𝑤𝑀 ] ∈ 𝑅𝐷×𝑀 With the above representation, the objective function for SSH can be rewritten as 1 𝐽 (𝐻) = 𝑡𝑟{𝐻(𝑋𝑙 )𝑆𝐻(𝑋𝑙 )} 2 (7) 1 𝑡𝑟{𝑠𝑔𝑛(𝑊 𝑇 𝑋𝑙 )𝑆𝑠𝑔𝑛(𝑊 𝑇 𝑋𝑙 )𝑇 } 2 Where 𝑆 ∈ 𝑅𝑙×𝑙 is an introduced indicator matrix, its entry incorporates the pairwise labeled information from 𝑋𝑙 as ⎧ 1 if (𝑥𝑖 , 𝑥𝑗 ) ∈ 𝑀 ⎪ 𝑆𝑖𝑗 = ⎨−1 if (𝑥𝑖 , 𝑥𝑗 ) ∈ 𝐶 ⎪ ⎩ 0 otherwise
(10)
(11)
𝑍𝑟,𝑖 = ℎ𝑟 (𝑥𝑖 )
(𝑥𝑖 ,𝑥𝑗 )∈𝐶
𝐻(𝑋𝑙 ) = [ℎ(𝑥1 ), ℎ(𝑥2 ), … , ℎ(𝑥𝑛 )] ∈ 𝑅𝑀×𝑛
𝛿(𝑦𝑖𝑗 ≠ 0)𝐿(𝛷(𝑥𝑖 ), 𝛷(𝑥𝑗 ), 𝑦𝑖𝑗 )
𝑖=1 𝑗=1
where 𝛿(𝑦𝑖𝑗 ≠ 0) ∈ {0, 1} denotes the relation of two data points, 𝛿(𝑦𝑖𝑗 ≠ 0) = 0 indicates the pairwise relation is undefined, 𝛿(𝑦𝑖𝑗 ≠ 0) = 1 indicates the pairwise relation is defined. 𝐿(∙) is a loss function that measures how well the binary codes match the similarity ground truth 𝑦𝑖𝑗 . 𝛷(𝑥) = [ℎ1 (𝑥), ℎ2 (𝑥), … , ℎ𝑀 (𝑥)] is the output of 𝑥 with 𝑀 hash function, ℎ𝑖 is a hash function with binary output ℎ𝑖 (𝑥) ∈ {1, −1}. The learning procedure is decomposed into two steps: binary code inference and hash function learning. To obtain binary code inference, auxiliary variables 𝑍𝑟,𝑖 ∈ −1, 1(𝑟 = 1, 2, … , 𝑀) are introduced to denote the output of the 𝑟th hash function on 𝑥𝑖
(4) ℎ𝑘 (𝑥𝑖 ) = 𝑠𝑔𝑛(𝑤𝑇𝑘 𝑥𝑖 + 𝑏𝑘 ) ∑ 𝑇 where 𝑏𝑘 = − 𝑁1 𝑁 𝑖=1 𝑤𝑚 𝑥𝑖 is the mean of the projected data, for centered data, 𝑏𝑘 = 0. In order to obtain same binary bits for (𝑥𝑖 , 𝑥𝑗 ) ∈ 𝑀 and different binary bit for (𝑥𝑖 , 𝑥𝑗 ) ∈ 𝐶, the linear projection 𝑤𝑚 can be learned by maximizing the following objective function, which measures the empirical accuracy on the labeled data for a family of hash functions. ∑ ∑ ∑ 𝐽 (𝐻) = { ℎ𝑘 (𝑥𝑖 )ℎ𝑘 (𝑥𝑗 ) − ℎ𝑘 (𝑥𝑖 )ℎ𝑘 (𝑥𝑗 )} (5) 𝑘
𝑛 ∑ 𝑛 ∑
(8)
3. Neighborhood kinship preserving hashing
and 𝑠𝑔𝑛(𝑊 𝑇 𝑋𝑙 ) is the matrix of signs of individual elements. Since the optimization problem in Eq. (7) is non-differentiable, a simple relaxation of the empirical fitness is used in SSH. 𝐻(𝑋𝑙 ) = 𝑠𝑔𝑛(𝑊 𝑇 𝑋𝑙 ) is replaced by its signed magnitude 𝑊 𝑇 𝑋𝑙 . Thus the optimization problem can be directly written as
For the supervised learning, though the learned hashing function can well preserve some structure of the data, they cannot exploit the discriminative information of data, which may weaken the performance of ANN search. To handle this problem, we propose neighborhood kinship preserving hashing, which can exploit the discriminative information by incorporating the label information of the training data into the hash learning model. In this section, we will specifically present the motivation and formulation of the proposed approach.
1 (9) 𝑡𝑟{𝑊 𝑇 𝑋𝑙 𝑆𝑋𝑙𝑇 𝑊 } 2 Thus, the embedding solution is composed by the leading eigenvectors of 𝑋𝑙 𝑆𝑋𝑙𝑇 , more details can be found in [25]. 𝐽 (𝑊 ) =
2.3. Supervised hashing using graph cuts and boosted decision trees
3.1. Motivation
Supervised hashing aims to learn the compact binary codes of the original features and preserve the label based similarity in the binary Hamming space. Lin et al. proposed a supervised, flexible and efficient hashing optimization framework in [39], which decomposed the
The nearest neighbor search problem is defined as the task of identifying similar samples from a data set for a given test sample. In practice, some different class samples with higher similarity usually lie in a neighborhood and are usually misclassified than those with lower 33
Y. Cui, J. Jiang, Z. Hu et al.
Signal Processing: Image Communication 76 (2019) 31–40
with the above definition, 𝐽𝑘 (𝐴) and 𝐽𝑛𝑘 (𝐴) can be further written as
similarity. Thus we tend to learn a distance matrix which can push these samples (lying in a neighborhood and with a different label) as far as possible, meanwhile, pull the same class neighborhood samples as close as possible. Let 𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 } be the all training data set and each sample 𝑥𝑖 ∈ 𝑅𝑑 , (𝑖 = 1, 2, … , 𝑛). For ∀𝑥𝑖 ∈ 𝑋, its 𝑘 nearest neighbors construct the neighborhood set 𝑁𝑘 {𝑥𝑖 }, in which the data samples can be divided into two categories by using the label information of 𝑥𝑗 ∈ 𝑁𝐾 {𝑥𝑖 } and 𝑥𝑖 , 𝐾 and 𝑁𝐾. Specifically, (𝑥𝑖 , 𝑥𝑗 ) ∈ 𝐾 is denoted as kinship neighbor pair if 𝑥𝑗 and 𝑥𝑖 share the same class labels. Similarly, (𝑥𝑖 , 𝑥𝑗 ) ∈ 𝑁𝐾 is denoted as non-kinship neighbor pair if 𝑥𝑗 and 𝑥𝑖 have different class labels. As we know, the non-kinship neighbor pair are more easily misclassified as they have high similarity as the kinship neighbor pairs. Therefore, to improve the performance of classification, we tend to learn a robust distance metric for classification in which the nonkinship neighbor pairs are emphasized and expected to push away as far as possible, meanwhile, the kinship neighbor pairs can be pulled as close as possible. Thus the learned distance metric can incorporate more discriminative information for verification. As we discussed above, the learned distance metric which pull kinship neighbor pairs close and push non-kinship neighbor pairs away can be defined as follows: √ (14) 𝑑(𝑥𝑖 , 𝑥𝑗 ) = (𝑥𝑖 − 𝑥𝑖 )𝑇 𝐴(𝑥𝑖 − 𝑥𝑗 )
𝐽𝑘 (𝐴) =
𝑗
= 𝑡𝑟(𝑊 𝑇
𝑛 𝑛 1 ∑ ∑ 2 1 ∑ ∑ 𝑑 (𝑥𝑖 , 𝑥𝑗 ) = (𝑥 − 𝑥𝑗 )𝑇 𝐴(𝑥𝑖 − 𝑥𝑗 ) 𝑘1 𝑖=1 𝑥 ∈𝐾 𝑘1 𝑖=1 𝑥 ∈𝐾 𝑖 𝑗
𝐽𝑛𝑘 (𝐴) =
𝑗
𝑗
= 𝑡𝑟(𝑊 𝑇
min 𝐽 (𝐴) = 𝐽𝑘 (𝐴) − 𝐽𝑛𝑘 (𝐴) 𝐴
= 𝑡𝑟(𝑊 𝑇
𝑛 1 ∑ ∑ ( (𝑥 − 𝑥𝑗 )(𝑥𝑖 − 𝑥𝑗 )𝑇 )𝑊 ) 𝑘1 𝑖=1 𝑥 ∈𝐾 𝑖 𝑗
𝑛 1 ∑ ∑ (𝑥 − 𝑥𝑗 )(𝑥𝑖 − 𝑥𝑗 )𝑇 )𝑊 ) − 𝑡𝑟(𝑊 𝑇 ( 𝑘2 𝑖=1 𝑥 ∈𝑁𝐾 𝑖
(22)
𝑗
𝑇
= 𝑡𝑟(𝑊 (𝑀𝑘 − 𝑀𝑛𝑘 )𝑊 ) ∑ ∑ where 𝑀𝑘 = 𝑘1 ( 𝑛𝑖=1 𝑥𝑗 ∈𝐾 (𝑥𝑖 − 𝑥𝑗 )(𝑥𝑖 − 𝑥𝑗 )𝑇 ) and 𝑀𝑛𝑘 = 1 1 ∑𝑛 ∑ ( 𝑖=1 𝑥𝑗 ∈𝑁𝐾 (𝑥𝑖 − 𝑥𝑗 )(𝑥𝑖 − 𝑥𝑗 )𝑇 ) are the kinship matrix and non𝑘 2
kinship matrix, respectively. In order to make the proposed optimization well posed with 𝑊 , the constraint 𝑊 𝑇 𝑊 = 𝐼 is introduced to restrict the scale of 𝑊 . Thus, we can further formulate the optimization problem in Eq. (17) as follows
(15)
min 𝐽 (𝐴) = 𝑡𝑟(𝑊 𝑇 (𝑀𝑘 − 𝑀𝑛𝑘 )𝑊 ) 𝐴
𝑠.𝑡
𝑛 𝑛 1 ∑ ∑ 2 1 ∑ ∑ 2 𝑑 (𝑥𝑖 , 𝑥𝑗 ) − 𝑑 (𝑥𝑖 , 𝑥𝑗 ) 𝑘1 𝑖=1 𝑥 ∈𝐾 𝑘2 𝑖=1 𝑥 ∈𝑁𝐾 𝑗
𝑛 ∑
𝑗
∑
(𝑥𝑖 − 𝑥𝑗 )𝑇 𝐴(𝑥𝑖 − 𝑥𝑗 )
(17)
𝑖=1 𝑥𝑗 ∈𝐾
𝑛 1 ∑ ∑ (𝑥 − 𝑥𝑗 )𝑇 𝐴(𝑥𝑖 − 𝑥𝑗 ) 𝑘2 𝑖=1 𝑥 ∈𝑁𝐾 𝑖 𝑗
where 𝑘1 = 𝑐𝑎𝑟𝑑(𝐾) and 𝑘2 = 𝑐𝑎𝑟𝑑(𝑁𝐾) are the cardinal number of 𝑁𝐾 and 𝐾, respectively. As we known, to solve the optimization criterion in Eq. (17), the distance metric 𝑑 need to be firstly known to compute the 𝑘-nearest neighbors of 𝑥𝑖 . Therefor, we use an iterative scheme to solve this problem, i.e. we firstly use the Euclidean metric to search the neighborhood set 𝑁𝑘 {𝑥𝑖 } of 𝑥𝑖 , and obtain 𝑑 sequentially. Since 𝐴 is symmetric and positive semidefinite, 𝐴 can be decomposed as 𝐴=𝑊𝑊𝑇
(23) 𝑊𝑇𝑊 = 𝐼
The above optimization problem can be solved by the conventional eigenvalue–eigenvector problem. Having obtained 𝑊 , 𝐴 can be computed by Eq. (18), and the neighborhood set 𝑁𝑘 {𝑥𝑖 } of 𝑥𝑖 can be updated by the learned distance metric. The optimal 𝑊 can be obtained by the following iterative algorithm 1.
(16)
𝑗
𝐴
−
(21)
𝑗
min 𝐽 (𝐴) = 𝐽𝑘 (𝐴) − 𝐽𝑛𝑘 (𝐴)
1 𝑘1
𝑛 1 ∑ ∑ ( (𝑥 − 𝑥𝑗 )(𝑥𝑖 − 𝑥𝑗 )𝑇 )𝑊 ) 𝑘2 𝑖=1 𝑥 ∈𝑁𝐾 𝑖
Therefore, the optimization problem in Eq. (17) is turned to
In order to close kinship neighbor pairs and push non-kinship neighbor pairs away, the following problem is proposed
=
𝑛 1 ∑ ∑ (𝑥 − 𝑥𝑗 )𝑇 𝑊 𝑊 𝑇 (𝑥𝑖 − 𝑥𝑗 ) 𝑘2 𝑖=1 𝑥 ∈𝑁𝐾 𝑖
𝑗
𝑛 𝑛 1 ∑ ∑ 2 1 ∑ ∑ 𝑑 (𝑥𝑖 , 𝑥𝑗 ) = (𝑥 − 𝑥𝑗 )𝑇 𝐴(𝑥𝑖 − 𝑥𝑗 ) 𝑘2 𝑖=1 𝑥 ∈𝑁𝐾 𝑘2 𝑖=1 𝑥 ∈𝐾 𝑖
=
(20)
𝑗
Meanwhile, the non-kinship degree of the non-kinship neighbor pairs can be measured by the following formulation 𝐽𝑛𝑘 (𝐴) =
𝑛 1 ∑ ∑ ( (𝑥 − 𝑥𝑗 )(𝑥𝑖 − 𝑥𝑗 )𝑇 )𝑊 ) 𝑘1 𝑖=1 𝑥 ∈𝐾 𝑖
and
where 𝐴 is an 𝑚 × 𝑚 square matrix. Thus 𝑑(𝑥𝑖 , 𝑥𝑗 ) satisfies the properties of nonnegativity, symmetry and triangle inequality, then we can know that 𝐴 is symmetric and positive semi-definite. Furthermore, the kinship degree of the kinship neighbor pairs can be measured by the following formulation 𝐽𝑘 (𝐴) =
𝑛 1 ∑ ∑ (𝑥 − 𝑥𝑗 )𝑇 𝑊 𝑊 𝑇 (𝑥𝑖 − 𝑥𝑗 ) 𝑘1 𝑖=1 𝑥 ∈𝐾 𝑖
(18) 𝑅𝑚×𝑙 .
where 𝑊 ∈ Then, the distance metric in Eq. (14) can be rewritten as √ √ (19) 𝑑(𝑥𝑖 , 𝑥𝑗 ) = (𝑥𝑖 − 𝑥𝑖 )𝑇 𝐴(𝑥𝑖 − 𝑥𝑗 ) = (𝑥𝑖 − 𝑥𝑖 )𝑇 𝑊 𝑊 𝑇 (𝑥𝑖 − 𝑥𝑗 ) 34
Y. Cui, J. Jiang, Z. Hu et al.
Signal Processing: Image Communication 76 (2019) 31–40
As discussed above, we maximize the following optimization problem to make the similarity relational degree between the kinship neighbor pairs maximization and the similarity relational degree between the non-kinship neighbor pairs minimum.
3.2. Formulation From Section 3.1, we can know that the distance metric is learned based on the label information of the training data, such that the discriminant information is incorporated for classification. Therefore, the discriminant information is expected to be inherited when mapping the original data sample into the hamming space. Let 𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 } be all training data set and each sample 𝑥𝑖 ∈ 𝑅𝑡 , (𝑖 = 1, 2, … , 𝑛), 𝐿 = {𝑙1 , 𝑙2 , … , 𝑙𝑛 } is the corresponding labels set. Our aims is to map 𝑥 ∈ 𝑋 to a hamming space so as to obtain its binary compact representation. Since linear hashing functions are very efficiently computable, the linear function is utilized as hashing function in this work. Suppose the 𝑙th hashing function is defined as ℎ𝑙 (𝑥𝑖 ) = 𝑠𝑔𝑛(𝑤𝑇𝑙 𝑥𝑖 + 𝑏𝑙 )
max 𝑊
max 𝑊
(26)
ℎ𝑘 (𝑥𝑖 )ℎ𝑘 (𝑥𝑗 )
max 𝑊
𝐾 ∑
∑
ℎ𝑘 (𝑥𝑖 )ℎ𝑘 (𝑥𝑗 )
∑
(34)
𝐾 ∑
(35)
𝑛 ∑
∑
(
𝐾 ∑
∑
ℎ𝑘 (𝑥𝑖 )ℎ𝑘 (𝑥𝑗 )
𝑖=1 𝑥𝑗 ∈𝑁𝑘 (𝑥𝑖 ) 𝑘=1 (𝑥𝑖 ,𝑥𝑗 )∈𝐾 𝐾 ∑
∑
ℎ𝑘 (𝑥𝑖 )ℎ𝑘 (𝑥𝑗 )) + 𝜆
𝐾 ∑
(𝑣𝑎𝑟[𝑠𝑔𝑛(𝑤𝑇𝑘 𝑋)])
(36)
𝑘=1
𝑇
𝑊 𝑊 = 𝐼𝐾
As stated above, the proposed unified formulation aims to search orthogonal 𝑊 which will learn compact and highly informative hashing codes. The learned hashing codes of kinship neighbor pairs may be highly similar, while that of the non-kinship neighbor pairs may be highly different.
(28)
3.3. Optimization and algorithms
𝑘=1 (𝑥𝑖 ,𝑥𝑗 )∈𝑁𝐾
In order to search 𝑊 , the first term of the objection function can been rewritten in compact matrix form as follows
ℎ𝑘 (𝑥𝑖 ) = 0
𝑛 ∑
𝐾 ∑
𝑣𝑎𝑟(ℎ𝑘 (𝑋)) =
𝑘=1
𝐾 ∑
(𝑣𝑎𝑟[𝑠𝑔𝑛(𝑤𝑇𝑘 𝑋 + 𝑏𝑘 )])
𝑖=1
(𝑤𝑇𝑙 𝑥𝑖 + 𝑏𝑘 ) = 0 ⟺ 𝑏𝑘 = −
𝑖=1
𝑛 1∑ 𝑇 𝑤 𝑥 =0 𝑛 𝑖=1 𝑘 𝑖
𝐾 ∑ 𝑘=1
𝑣𝑎𝑟(ℎ𝑘 (𝑋)) =
𝐾 ∑
(𝑣𝑎𝑟[𝑠𝑔𝑛(𝑤𝑇𝑘 𝑋)])
ℎ𝑘 (𝑥𝑖 )ℎ𝑘 (𝑥𝑗 ) −
𝐾 ∑
∑
ℎ𝑘 (𝑥𝑖 )ℎ𝑘 (𝑥𝑗 ))
𝑘=1 (𝑥𝑖 ,𝑥𝑗 )∈𝑁𝐾
(37) 𝑅𝑛×𝑛
where 𝑆𝐾 ∈ and 𝑆𝑁𝐾 ∈ is the neighbor kinship relationship matrix incorporating the pairwise kinship information from 𝑋, which can be defined as { 1 if (𝑥𝑖 , 𝑥𝑗 ) ∈ 𝐾 𝑆𝐾𝑖𝑗 = (38) 0 otherwise
(30)
and
(31)
𝑆𝑁𝐾𝑖𝑗 =
Thus, the optimization problem in Eq. (30) is turned to max
∑
𝑅𝑛×𝑛
From Eq. (29), we have 𝑛 ∑
𝐾 ∑
= 𝑡𝑟{𝑠𝑔𝑛(𝑊 𝑇 𝑋)𝑆𝐾 𝑠𝑔𝑛(𝑊 𝑇 𝑋)𝑇 } − 𝑡𝑟{𝑠𝑔𝑛(𝑊 𝑇 𝑋)𝑆𝑁𝐾 𝑠𝑔𝑛(𝑊 𝑇 𝑋)𝑇 }
𝑘=1
ℎ𝑘 (𝑥𝑖 ) =
(
= 𝑡𝑟{𝐻(𝑋)𝑆𝐾 𝐻(𝑋)𝑇 } − 𝑡𝑟{𝐻(𝑋)𝑆𝑁𝐾 𝐻(𝑋)𝑇 }
(29)
In fact, it is hard to find hashing function for implementing the maximum entropy partitioning. Fortunately, it has been proved in [26] that the maximum entropy partitioning is equivalent to maximize the variance of a bit, i.e. max
∑
𝑖=1 𝑥𝑗 ∈𝑁𝑘 (𝑥𝑖 ) 𝑘=1 (𝑥𝑖 ,𝑥𝑗 )∈𝐾
𝑖=1
𝑛 ∑
(
𝑖=1 𝑥𝑗 ∈𝑁𝑘 (𝑥𝑖 ) 𝑘=1 (𝑥𝑖 ,𝑥𝑗 )∈𝐾
𝑠.𝑡.
From the view point of the information-theoretic, it is expected that the learned binary hashing code can provide more information by each bit and reduce the redundancy among all the hashing bits. As in [26], we use maximum entropy principle to make the each learned binary hashing code given balanced partitioning of 𝑋, i.e. 𝑛 ∑
∑
−
(27)
ℎ𝑘 (𝑥𝑖 )ℎ𝑘 (𝑥𝑗 )
ℎ𝑘 (𝑥𝑖 )ℎ𝑘 (𝑥𝑗 ))
Incorporating the orthogonal constraint into the overall objective function, the following unified formulation can be obtained
Meanwhile, the hamming distances between the non-kinship pairs is as far as possible, i.e. the similarity relational degree is as small as necessary, the following sub-problem is proposed to minimize the similarity relational degree between the non-kinship neighbor pairs. ∑
∑
𝑘=1 (𝑥𝑖 ,𝑥𝑗 )∈𝑁𝐾
𝑘=1 (𝑥𝑖 ,𝑥𝑗 )∈𝑁𝐾
∑
𝐾 ∑
𝑊 𝑇 𝑊 = 𝐼𝐾
𝑘=1 (𝑥𝑖 ,𝑥𝑗 )∈𝐾
𝐻
ℎ𝑘 (𝑥𝑖 )ℎ𝑘 (𝑥𝑗 ) −
In order to reduce redundancy in the learned binary codes, the learned binary codes is expected to be uncorrelated. Thus the learned linear functions are restricted as orthogonal, i.e.
where 𝑊 = [𝑤1 , 𝑤2 , … , 𝑤𝐾 ] ∈ is the linear projection matrix and 𝑏 = [𝑏1 , 𝑏2 , … , 𝑏𝐾 ]𝑇 ∈ 𝑅𝐾 . To obtain satisfactory matching results, 𝑊 can be learnt to obtain same binary representation for kinship neighbor pairs and different binary representation for nonkinship neighbor pairs. Specifically, the hamming distances between kinship neighbor pairs is hoped to 0, i.e. if (𝑥𝑖 , 𝑥𝑗 ) ∈ 𝐾, ℎ𝑘 (𝑥𝑖 ) = ℎ𝑘 (𝑥𝑗 ), (𝑘 = 1, 2, … , 𝐾), 𝑖, 𝑗 = 1, 2, … , 𝑛, in other words, the following formulate can be maximized to make the similarity relational degree between the kinship neighbor pairs maximization
𝐾 ∑
∑
− ℎ𝑘 (𝑥𝑖 )ℎ𝑘 (𝑥𝑗 )) + (𝑣𝑎𝑟[𝑠𝑔𝑛(𝑤𝑇𝑘 𝑋)]) 𝑘=1 (𝑥𝑖 ,𝑥𝑗 )∈𝑁𝐾 𝑘=1
𝑅𝑑×𝐾
min 𝐽 (𝐻) =
𝑛 ∑
𝐾 ∑
1 1 (25) (1 + ℎ𝑙 (𝑥𝑖 )) = (1 + 𝑠𝑔𝑛(𝑤𝑇𝑘 𝑥𝑖 )) 2 2 Let 𝐻 = [ℎ1 , ℎ2 , … , ℎ𝐾 ] be the sequence of the 𝐾 hashing functions, thus, for ∀𝑥 ∈ 𝑋, we have
𝐻
𝐾 ∑
𝑖=1 𝑥𝑗 ∈𝑁𝑘 (𝑥𝑖 ) 𝑘=1 (𝑥𝑖 ,𝑥𝑗 )∈𝐾
(24)
𝑦𝑖𝑙 =
max 𝐽 (𝐻) =
(
Meanwhile, combining the regularizer term Eq. (32) into Eq. (33) to ensure the learned hashing bits are highly informative, that is, the overall objective function of the neighborhood kinship preserving hashing is given as
where 𝑤𝑙 ∈ is a linear function to map 𝑥𝑖 ∈ 𝑋 to −1 or 1, and 𝑏𝑙 is the mean of the projected data. The 𝑙th corresponding binary representation can be obtained by the following formula
𝐾 ∑
∑
(33)
𝑅𝑑
𝐻(𝑥) = [ℎ1 (𝑥), ℎ2 (𝑥), … , ℎ𝐾 (𝑥)] = 𝑠𝑔𝑛(𝑊 𝑇 𝑥 + 𝑏)
𝑛 ∑
{
1 if (𝑥𝑖 , 𝑥𝑗 ) ∈ 𝑁𝐾 0 otherwise
(39)
As proved in [20], the maximum variance of a hashing function is lower bounded by the scaled variance of the projected data. Thus the
(32)
𝑘=1
35
Y. Cui, J. Jiang, Z. Hu et al.
Signal Processing: Image Communication 76 (2019) 31–40
Table 1 Results of the compared methods in precision (%) and recall (%) and F-measure (%) of hamming distance 2 on CIFAR-10 with code length from 16, 32, 64, 96, 128 and 256. LSH
ITQ
KSH
SDH
FSDH
FSSH
NKPH
Precision
16 32 64 96 128 256
0.66 3.170 9.430 3.30 2.10 0.20
0.69 5.370 23.04 7.38 3.43 0.60
32.02 33.80 12.16 3.970 1.00 0.10
59.65 63.41 53.50 44.52 40.06 31.68
55.71 62.24 55.41 46.80 42.08 35.50
50.17 54.87 56.39 59.05 61.81 63.67
55.12 65.45 57.17 46.20 40.83 44.07
Recall
16 32 64 96 128 256
9.90 9.31 1.010 0.16 0.07 0.01
9.07 9.64 6.38 0.77 0.27 0.01
5.740 0.47 0.02 0.00 0.00 0.00
41.45 28.46 20.34 17.46 15.44 11.62
43.68 28.71 22.08 19.74 18.09 15.20
48.59 60.57 62.45 64.32 66.19 67.33
90.01 90.01 90.04 90.01 90.02 89.99
16 32 64 96 128 256
1.23 4.72 1.82 0.31 0.14 0.02
1.28 6.89 9.99 1.39 0.50 0.02
9.73 0.93 0.04 0.00 0.00 0.00
48.91 39.29 29.47 25.08 22.29 17.00
48.97 39.29 31.58 27.77 25.3 21.29
49.37 57.58 59.27 61.57 63.93 65.45
68.37 75.79 69.94 61.05 56.18 59.17
F-measure
objective function as max𝑡𝑟{𝑊 𝑇 𝑋𝑆𝐾 𝑋 𝑇 𝑊 } − 𝑡𝑟{𝑊 𝑇 𝑋𝑆𝑁𝐾 𝑋 𝑇 𝑊 } 𝑊
(43) 𝛽 𝜌 𝑡𝑟{𝑊 𝑇 𝑋𝑋 𝑇 𝑋} − 𝑡𝑟{(𝑊 𝑇 𝑊 − 𝐼)𝑇 (𝑊 𝑇 𝑊 − 𝐼)} 2 2 Because the optimization problem in Eq. (42) is not convex in 𝑊 , most approaches tackle this problem by using a block-coordinate descent strategy. Similarly, we propose a priori intervention iterative optimization strategy to optimize the objective function. Taking the derivatives of 𝐽 (𝑊 ) with respect 𝑊 , we have +
𝜕𝐽 (𝑊 ) = 𝑋𝑆𝐾 𝑋 𝑇 𝑊 − 𝑋𝑆𝑁𝐾 𝑋 𝑇 𝑊 + 𝛽𝑋𝑋 𝑇 𝑊 − 𝜌𝑊 (𝑊 𝑇 𝑊 − 𝐼) 𝜕𝑊 (44) We update 𝑊 with the following form 𝜕𝐽 (𝑊 ) (45) 𝜕𝑊 In order to inherit the discriminant information learned in the original space into the hamming space, the initialized 𝑊 0 is the decomposition of the learned distance metric 𝐴. Meanwhile, during the iterations, we use the learned metric to update the 𝑘-neighbors of ∀𝑥𝑖 ∈ 𝑋 and the neighbor kinship relationship matrix 𝑆𝐾 and 𝑆𝑁𝐾 . The proposed priori intervention iterative optimization algorithm is summarized in Algorithm 2 𝑊 𝑡+1 = 𝑊 𝑡 − 𝜇
Table 2 Results of the compared methods in MAP (%) on CIFAR-10 with code length from 16 to 256.
16 32 64 96 128 256
LSH
ITQ
KSH
SDH
FSDH
FSSH
NKPH
12.61 13.45 15.04 15.62 16.01 17.02
16.23 16.92 17.58 17.92 18.11 18.62
22.88 23.54 23.63 25.49 25.62 25.48
57.85 61.16 62.02 63.30 63.27 64.12
54.38 59.54 62.24 62.75 63.42 64.16
39.69 56.92 58.79 60.05 62.21 64.36
61.88 63.29 62.79 60.88 61.44 62.04
second term of the objection function can be rewritten as 𝜆
𝐾 ∑
(𝑣𝑎𝑟[𝑠𝑔𝑛(𝑤𝑇𝑘 𝑋)]) =
𝑘=1
𝜆∑ 𝑇 𝜆 𝑤 𝑋𝑋 𝑇 𝑤𝑘 = 𝑡𝑟{𝑊 𝑇 𝑋𝑋 𝑇 𝑋} 𝑛 𝑘 𝑘 𝑛
(40)
By jointly exploiting the above matrix form, the objection function in Eq. (35) is turned to max 𝑡𝑟{𝑠𝑔𝑛(𝑊 𝑇 𝑋)𝑆𝐾 𝑠𝑔𝑛(𝑊 𝑇 𝑋)𝑇 } − 𝑡𝑟{𝑠𝑔𝑛(𝑊 𝑇 𝑋)𝑆𝑁𝐾 𝑠𝑔𝑛(𝑊 𝑇 𝑋)𝑇 } 𝑊
+
𝛽 𝑡𝑟{𝑊 𝑇 𝑋𝑋 𝑇 𝑋} 2
𝑠.𝑡.
𝑊 𝑇 𝑊 = 𝐼𝐾 (41)
Furthermore, we convert the orthogonality constraints into a penalty term and add it to the objective function, thus, the above optimization problem can be written as follows max 𝑡𝑟{𝑠𝑔𝑛(𝑊 𝑇 𝑋)𝑆𝐾 𝑠𝑔𝑛(𝑊 𝑇 𝑋)𝑇 } 𝑊
− 𝑡𝑟{𝑠𝑔𝑛(𝑊 𝑇 𝑋)𝑆𝑁𝐾 𝑠𝑔𝑛(𝑊 𝑇 𝑋)𝑇 } + −
The proposed optimization algorithm can effectively utilize the learned discriminant information by updating the neighborhood set by Eq. (19). Thus the learned hashing function can make the learned hashing codes of kinship neighbor pairs to be highly similar, while that of the non-kinship neighbor pairs to be highly different.
𝛽 𝑡𝑟{𝑊 𝑇 𝑋𝑋 𝑇 𝑋} 2
𝜌 ‖𝑊 𝑇 𝑊 − 𝐼‖2𝐹 2 (42)
⇔ 𝑇
𝑇
𝑇
4. Experiments
max 𝑡𝑟{𝑠𝑔𝑛(𝑊 𝑋)𝑆𝐾 𝑠𝑔𝑛(𝑊 𝑋) } 𝑊
− 𝑡𝑟{𝑠𝑔𝑛(𝑊 𝑇 𝑋)𝑆𝑁𝐾 𝑠𝑔𝑛(𝑊 𝑇 𝑋)𝑇 } + −
𝛽 𝑡𝑟{𝑊 𝑇 𝑋𝑋 𝑇 𝑋} 2
In this section, we apply the proposed NKPH algorithm on three large-scale image data sets: CIFAR-10, MNIST and NUS-WIDE databases. In order to evaluate the effectiveness of the proposed NKPH algorithm, we compare the proposed method with LSH, ITQ, KSH, SDH, FSDH and FSSH, these codes are downloaded from the homepage of the corresponding authors and the parameters of these methods are
𝜌 𝑡𝑟{(𝑊 𝑇 𝑊 − 𝐼)𝑇 (𝑊 𝑇 𝑊 − 𝐼)} 2
As in [45], we relax the sign of projection by its signed magnitude, leading to 𝐻(𝑋) ≈ 𝑠𝑔𝑛(𝑊 𝑇 𝑋) ≈ 𝑊 𝑇 𝑋. This relaxation leads the 36
Y. Cui, J. Jiang, Z. Hu et al.
Signal Processing: Image Communication 76 (2019) 31–40
Table 3 Compared the training-testing time on CIFAR-10 with code length from 16 to 256. LSH
ITQ
KSH
SDH
FSDH
FSSH
NKPH
Training time
16 32 64 96 128 256
3.6630 3.6948 3.7211 3.8429 4.5769 4.1858
6.4606 8.0386 11.2882 14.7635 19.9459 33.5742
192.9817 382.4065 765.9763 1157.4150 1546.8089 3118.6482
898.5350 920.5384 986.4606 1125.7313 1310.5128 2279.1723
609.5820 605.4918 609.2840 641.4631 642.9565 664.8502
87.9100 100.3000 114.4602 116.5118 121.0665 134.9637
164.6486 162.9716 165.3330 168.2964 174.8755 194.9494
Testing time
16 32 64 96 128 256
0.0011 0.0004 0.0006 0.0011 0.0015 0.0023
0.0558 0.0870 0.1690 0.2877 0.3572 0.7288
0.0487 0.0539 0.0638 0.0651 0.0575 0.0701
0.5643 0.6904 0.9608 1.2507 1.5087 2.7396
0.5691 0.6826 0.9190 1.2564 1.6013 2.7790
0.0264 0.0391 0.0617 0.0914 0.1189 0.1544
0.0450 0.0518 0.0817 0.1211 0.1575 0.2046
Table 4 Results of the compared methods in precision (%), recall (%) and F-measure (%) of hamming distance 2 on MNIST with code length from 16 to 256.
set as the suggested values in the corresponding papers, For KSH, 10,000 randomly sampled labeled points are utilized, and for SDH and FSDH, 10,000 randomly sampled points are utilized as anchors. Experiments are performed on a Workstation with an Intel Xeon processor (2.20 GHz), 128 GB RAM, and MATLAB 2012b, and the experimental results is reported by precision, recall, and F-measure of Hamming radius 2, MAP, training time and testing time.
LSH
ITQ
KSH
SDH
FSDH
FSSH
NKPH
Precision
16 32 64 96 128 256
0.46 4.790 49.14 21.74 8.50 0.10
0.45 3.960 45.04 55.76 29.30 8.60
83.50 86.20 68.03 55.50 46.82 34.75
95.59 96.01 95.22 93.56 93.26 91.53
95.03 96.21 95.26 93.63 93.28 92.02
94.12 95.06 96.25 97.58 98.03 98.41
71.18 84.67 92.56 94.82 96.35 98.74
Recall
16 32 64 96 128 256
0.95 3.12 8.79 0.65 0.04 0.00
95.00 94.89 51.75 7.84 1.95 0.05
49.56 35.51 17.23 8.94 3.49 1.940
92.74 89.68 87.65 86.30 85.44 82.44
92.87 90.24 88.03 87.12 86.81 86.09
94.34 91.67 89.43 88.50 79.11 78.45
90.14 90.14 90.16 90.15 90.15 90.15
F-measure
16 32 64 96 128 256
0.62 3.78 14.91 1.26 0.08 0.00
0.90 7.6 48.16 13.75 3.66 0.10
62.20 50.30 27.50 15.40 6.50 3.67
94.14 92.74 91.28 89.78 89.18 86.75
93.94 93.13 91.50 90.26 89.93 88.96
94.23 93.33 92.71 92.82 87.56 87.37
79.55 87.32 91.34 92.43 93.15 94.25
4.1. Experiments on CIFAR-10 CIFAR-10 databases is composed of 60,000 images which are manually labeled as 10 classes with 6000 samples for each class. Each image in this databases is represented by a GIST feature vector of dimension 512. In our experiment, every class data set was randomly split into a training set and test set at a ratio of 5:1, i.e. the training data set contains 59,000 samples, and the testing data set contains 1000 samples. For KSH, 2000 samples are randomly selected as labeled points, and for SDH and FSDH, 10,000 samples are randomly selected as anchor points. The results of the hash lookup can be found in Table 1. From Table 1, it is clear that on this large data set the best performance of the proposed algorithm NKPH is better than that of the best performance of the LSH, ITQ, KSH, SDH and FSDH. Specifically, the best precision rate of the proposed algorithm is 65.45% on 32 bit code length, compared with 63.41% for SDH, 62.24% for FSDH, 63.67% for FSSH and obviously outperform LSH, ITQ and KSH. Form the recall, the performance of the proposed algorithm is much more robust and outperform other hashing learning methods. Similar to the precision and recall index, the performance of the F-measure illustrates the proposed NKPH outperform all other hashing learning methods. Next, we compare the proposed algorithm NKPH with other methods in Hamming ranking index MAP and the training time and testing time, the results can be found in Tables 2 and 3. From Tables 2 and 3, we know that the MAP performance of the proposed algorithm outperform SDH, FSDH and FSSH with code length 16, 32 and 64, with increasing the length of the code bits, the MAP rate of the proposed algorithm slightly lower that of SDH, FSDH and FSSH. In contrast to LSH, ITQ and KSH, the performance of the proposed NKPH is much better. From the training time, the training time of the LSH, FSDH, FSSH and the proposed NKPH is very stability with increasing the code bits in compared with ITQ, KSH and SDH. The training time of the proposed NKPH is obviously shorter than SDH and FSDH, but slightly longer than FSSH. Similar to the training time, the testing time of the proposed NKPH is obviously shorter than SDH and FSDH, but slightly longer than FSSH. But overall, the performance of the hash lookup outperform SDH, FSDH and FSSH. Thus, the overall performance of the proposed NKPH outperform all other hash methods.
Table 5 Results of the compared methods in MAP on MNIST with code length from 16 to 256.
16 32 64 96 128 256
LSH
ITQ
KSH
SDH
FSDH
FSSH
NKPH
23.81 27.75 31.28 35.21 38.88 39.99
40.75 43.49 45.43 46.26 46.59 47.16
78.36 83.68 83.91 83.97 84.55 85.11
96.29 96.39 96.66 96.93 96.91 97.13
95.29 96.60 96.71 96.81 96.93 97.06
94.43 95.73 95.85 96.64 96.76 97.27
93.12 96.59 96.90 97.70 97.26 98.07
data set was randomly split into a training set and test set at a ratio of 69:1, i.e. the training data set contains 69,000 samples, and the testing data set contains 1000 samples. The training data was used to train a robust hashing function, and the robust hashing function was applied to testing data to obtain the corresponding hashing code. Then we set the hamming distance 2 to search the class label of the testing data. For KSH, 2000 samples are randomly selected as labeled points, and for SDH and FSDH, 10,000 samples are randomly selected as anchor points. The results of hash lookup on MNIST can be found in Table 4. From Table 4, it is clear that the proposed NKPH achieves the best precision rate 98.94% on 256 bit code length, compared with 86.20% for KSH, 96.01% for SDH, 96.21% for FSDH and 98.41% for FSSH, and obviously higher than that of LSH and ITQ. The performance of the proposed NKPH is very robust form the Recall rate, while the recall rate of the proposed NKPH slightly lower than that of other methods. From the F-measure index, the proposed NKPH still achieve promising results. Moreover, the MAP valves of the compared methods can be found in Table 5. From Table 5, it is clear that the performance of the proposed NKPH slightly outperform SDH, FSDH and FSSH, but obviously outperform
4.2. Experiments on MNIST The MNIST data set consists of 70,000 images, each of 784 dimensions, of handwritten digits from 0 to 9. In our experiment, every class 37
Y. Cui, J. Jiang, Z. Hu et al.
Signal Processing: Image Communication 76 (2019) 31–40
Fig. 1. Compared the training-testing time with code lengths 16, 32, 64, 96, 128 and 256.
Fig. 2. Results of the compared methods in MAP on NUS-WIDE with code length from 16 to 256. Table 6 Results of the compared methods in precision (%) of hamming distance 2 on NUS-WIDE with code length from 16 to 256.
LSH, ITQ and KSH. Furthermore, the results of the training and testing time can be found in Fig. 1, from Fig. 1, we know that the training time obviously higher than that of LSH and ITQ, but obviously lower than that of KSH, SDH and FSDH. The testing time of the proposed NKPH is also lower than ITQ, SDH and FSDH. 4.3. Experiments on NUS-WIDE The NUS-WIDE database is composed of 270,000 images which is collected from Flickr. The 270,000 images is associated with 1000 most frequent tags, and 81 ground truth concept labels is labeled by humans for all images. The content of the images is extracted by local SIFT features descriptor, 500D bag of words based on SIFT descriptions was used in our experiments, and the 21 most frequent label was collected to test as in [20]. For each label, 100 images were randomly selected to constitute the query set and the remaining images were used for the training set. For this large data set, we define kinship neighbor pair if 𝑥𝑗 and 𝑥𝑖 share at least one same class labels, based on the kinship neighbor pair to train hashing function. For KSH, 10,000 samples are randomly selected as labeled points(two experiments can be found in Table 5 because of the training time), and for SDH and FSDH, 10,000 samples are randomly selected as anchor points. The results of precision, recall and F-measure on NUS-WIDE can be found in Table 6. From Table 6, we can see that the proposed NKPH achieves the best performance for each code length. With the code length 16, the proposed NKPH achieves 36.76% precision, compared with the best performance of others hash methods, i.e. FSSH achieves 36.46% on 32 bit code length, FSDH achieves 36.15% on 32 bit code length, SDH achieves 36.15% on 64 bit code length, KSH achieves 36.10% on 16 bit length, ITQ achieves 27.41% on 64 bit code length and LSH achieves
ITQ
BRE
MLH
KSH
SDH
FSSH
NKPH
Precision
16 32 64 96 128 256
6.88 12.81 5.05 1.50 0.89 0.14
7.55 18.48 27.41 6.20 2.58 0.97
36.1 36.1 ♯ ♯ ♯ ♯
36.05 36.04 36.15 36.15 36.15 36.15
36.00 36.15 36.15 36.15 36.15 36.15
36.31 36.46 36.39 36.32 36.40 36.38
36.76 36.59 36.68 36.60 36.69 36.67
Recall
16 32 64 96 128 256
81.56 21.90 2.530 1.140 0.4900 0.04000
84.00 47.89 12.33 6.780 5.060 3.010
57.53 57.53 ♯ ♯ ♯ ♯
95.00 95.89 95.14 95.14 95.14 95.14
95.60 95.14 95.14 95.14 95.14 95.14
90.33 89.23 89.45 90.25 91.76 92.34
84.17 84.98 91.74 95.01 97.22 92.78
16 32 32 32 32 32
12.69 16.16 3.370 1.300 0.6300 0.06000
13.85 26.67 17.01 6.480 3.420 1.470
44.36 44.36 ♯ ♯ ♯ ♯
53.00 52.97 52.39 52.39 52.39 52.39
52.30 52.39 52.39 52.39 52.39 52.39
51.79 51.76 51.73 51.79 52.12 52.18
52.13 52.12 53.42 53.88 54.32 53.58
F-measure
12.81% on 32 bit code length. For the recall, the proposed NKPH, SDH, FSDH and FSSH achieve batter performance than LSH, ITQ and KSH, and the similar conclusion can be obtained from the F-measure index, but the proposed NKPH is litter instability in comparing with SDH, FSDH and FSSH. Next we compare the proposed NKPH with LSH, ITQ, KSH, SDH and FSDH from hamming rank index, i.e. the MAP index which can be found in Fig. 2. 38
Y. Cui, J. Jiang, Z. Hu et al.
Signal Processing: Image Communication 76 (2019) 31–40
Table 7 Results of the compared methods in training time (s) and testing time (s) on NUS-WIDE with code length from 16 to 256. LSH
ITQ
KSH
SDH
FSDH
NKPH
Training time
16 32 64 96 128 256
17.1018 16.9591 17.4955 17.7218 17.7847 18.8356
25.0403 29.1926 38.7489 48.3837 60.8721 103.3139
1.7451e+5 3.4400e+5 ♯ ♯ ♯ ♯
0.9530e+3 3.0472e+3 3.6881e+3 4.2835e+3 4.3056e+3 8.8223e+3
2.7612e+3 2.7729e+3 2.9441e+3 2.7327e+3 2.7116e+3 2.8098e+3
176.3348 179.1979 187.0121 194.3062 209.8337 248.0264
Testing time
16 32 64 96 128 256
0.0011 0.0017 0.0047 0.0034 0.0068 0.01550
0.1171 0.2509 0.5323 0.7782 0.9862 2.1522
0.8528 0.9974 ♯ ♯ ♯ ♯
1.842 2.639 4.1375 5.3704 5.5422 9.7808
1.8689 2.2026 3.3166 4.2039 5.6229 9.2399
0.1632 0.1695 0.2444 0.3416 0.4559 0.6443
From Fig. 2, it is clear that the MAP of the proposed NKPH, SDH, FSDH and FSSH is higher than LSH, ITQ and KSH obviously. Moreover, the proposed NKPH outperform SDH and FSDH with code bit 16, 32, 64, 96 and 128. The training time and testing time can be found in Table 7, form Table 7, we know that the training time of the proposed NKPH shorter than KSH, SDH and FSDH, meanwhile, the testing time of the proposed NKPH shorter than ITQ, KSH, SDH and FSDH as well.
and the Startup Foundation for Introducing Talent of Nanjing University of Information Science and Technology, China (Grant NO. 2243141601019). Conflict of interest The authors declare that they have no conflicts of interest to this work.
4.4. Discussion References For the proposed supervised hashing learning, we perform experiments with the CIFAR-10, MNIST and NUS-WIDE. As concluded in Section 4, from the hamming lookup index(precision, recall and Fmeasure), the proposed NKPH outperform other hashing methods. From the hash rank index MAP, the proposed NKPH achieves better results than other hash methods. Furthermore, the training time of the proposed NKPH obviously shorter than KSH, SDH and FSDH, and the testing time shorter than these hash methods as well. But the training time and the testing time of the proposed NKPH slightly longer than FSSH. In contrast to SDH, FSDH and FSSH, the proposed NKPH can update the neighbor information in each iteration, and incorporate the learned neighbor information into the hash code learning. Thus, the proposed NKPH can update the discriminant information in hash function learning.
[1] Y. Gong, S. Lazebnik, A. Gordo, F. Perronnin, Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval, IEEE TPAMI 35 (12) (2013) 2916–2929. [2] R. Xia, Y. Pan, H. Lai, Supervised hashing for image retrieval via image representation learning, in: Proc. AAAI, 2014, pp. 1–2. [3] J. Cheng, C. Leng, J. Wu, Fast and accurate image matching with cascade hashing for 3d reconstruction, in: Proc. CVPR, 2014, pp. 1–8. [4] B. Li, D. Ming, W. Yan, Image matching based on two-column histogram hashing and improved RANSAC, IEEE Geosci. Remote Sens. Lett. 11 (8) (2014) 1433–1437. [5] Y. Mu, G. Hua, W. Fan, Hash-SVM: Scalable kernel machines for large-scale visual classification, in: Proc. CVPR, 2014, pp. 979–986. [6] Y. Xu, Z. Liu, Z. Zhang, High-throughput and memory-efficient multimatch packet classification based on distributed and pipelined hash tables, IEEE/ACM Trans. Netw. 22 (3) (2014) 982–995. [7] W. Kehl, F. Tombari, N. Navab, Hashmod: a hashing method for scalable 3D object detection, arXiv preprint arXiv:1607.06062, 2016. [8] R. Zhang, F. Wei, B. Li, E2LSH based multiple kernel approach for object detection, Neurocomputing 124 (2014) 105–110. [9] A. Tomar, F. Godin, B. Vandersmissen, Towards Twitter hashtag recommendation using distributed word representations and a deep feed forward neural network, in: Proc. ICACCI, 2014, pp. 362–368. [10] F. Godin, V. Slavkovikj, W. De Neve, Using topic models for twitter hashtag recommendation, in: Proc. ACM, 2013, pp. 593–596. [11] A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in: Proc. VLDB, 1999. [12] B. Kulis, K. Grauman, Kernelized locality-sensitive hashing, IEEE TPMI 34 (6) (2012) 1092–1104. [13] L. Paulev, H. Jgou, L. Amsaleg, Locality sensitive hashing: A comparison of hash function types and querying mechanisms, Pattern Recognit. Lett. 31 (11) (2010) 1348–1358. [14] D. Gorisse, M. Cord, F. Precioso, Locality-sensitive hashing for chi2 distance, IEEE TPMI 34 (2) (2012) 402–409. [15] Y. Mu, S. Yan, Non-metric locality-sensitive hashing, in: Proc. AAAI, 2010, pp. 539–544. [16] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, in: Proc. NIPS, 2009, pp. 1753–1760. [17] Y. Weiss, R. Fergus, A. Torralba, Multidimensional spectral hashing, in: Proc. ECCV, 2012, pp. 340–353. [18] Y. Zhen, Y. Gao, D.Y. Yeung, Spectral multimodal hashing and its application to multimedia retrieval, IEEE Trans. Cybern. 46 (1) (2016) 27–38. [19] W. Kong, W.J. Li, Isotropic hashing, in: Proc. NIPS, 2012, pp. 1646–1654. [20] W. Liu, J. Wang, S. Kumar, Hashing with graphs, in: Proc. ICML-11, 2011, pp. 1–8. [21] W. Liu, C. Mu, S. Kumar, Discrete graph hashing, in: Proc. NIPS, 2014, pp. 3419–3427. [22] D. Zhang, J. Wang, D. Cai, Self-taught hashing for fast similarity search, in: Proc. ACM, 2010, pp. 18–25. [23] H. Takebe, Y. Uehara, S. Uchida, Efficient anchor graph hashing with data-dependent anchor selection, IEICE TIS 98 (11) (2015) 2030–2033.
5. Conclusions In this paper, we propose the supervised hashing algorithm neighborhood kinship preserving hashing. To leverage the label information of the training data, a robust learned distance metric can pull the intra-class neighborhood samples as close as possible, and push the inter-class neighborhood samples as far as possible, such that the discriminant information of the training data is incorporated the learning framework. Meanwhile, when searching the optimal hashing transform, the neighbor pairs can be updated by the learned robust distance metric. Thus that the learned hashing transform can make the learned hashing codes of kinship neighbor pairs highly similar, while that of the non-kinship neighbor pairs highly different. The experimental results on three public image data sets show that the efficacy of the proposed NKPH for large-scale image retrieval. Acknowledgments The authors would like to thank the editor and the anonymous reviewers for their critical and constructive comments and suggestions. This work is supported by the National Science Foundation of China (Grant No. 61601235, 61573248, 61773328, 61732011), the Natural Science Foundation of Jiangsu Province of China (Grant NO. BK20170768, BK20160972), the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (Grant NO. 17KJB520019, 16KJB520031, 16KJB520026, 16KJB520027), the China Postdoctoral Science Foundation (Grant No. 2018M640441) 39
Y. Cui, J. Jiang, Z. Hu et al.
Signal Processing: Image Communication 76 (2019) 31–40 [36] L. Jiang, G. Misherghi, Z. Su, Deckard: Scalable and accurate tree-based detection of code clones, in: Proceedings of the 29th International Conference on Software Engineering, 2007, pp. 96–105. [37] K. Nohl, D. Evans, Quantifying information leakage in tree-based hash protocols (short paper), in: International Conference on Information and Communications Security, 2006, pp. 228–237. [38] M. Norouzi, D.J. Fleet, R.R. Salakhutdinov, Hamming distance metric learning, in: Proc. NIPS, 2012, pp. 1061–1069. [39] G. Lin, C. Shen, A. van den Hengel, Supervised hashing using graph cuts and boosted decision trees, IEEE TPAMI 37 (11) (2015) 2317–2331. [40] F. Shen, C. Shen, W. Liu, Supervised discrete hashing, in: Proc. CVPR, 2015, pp. 37–45. [41] R. Xia, Y. Pan, H. Lai, Supervised hashing for image retrieval via image representation learning, in: Proc. AAAI, Vol. 1, 2014, p. 2. [42] J. Gui, T. Liu, Z. Sun, et al., Fast supervised discrete hashing, IEEE TPAMI 40 (2) (2018) 490–496. [43] L. Xin, L. Nie, X. He, et al., Fast scalable supervised hashing, in: Proc. SIGIR, 2018, pp. 735–744. [44] zh Lai, Y. Chen, J. Wu, et al., Jointly sparse hashing for image retrieval, IEEE Trans. Image Process. 27 (12) (2018) 6147–6158. [45] J. Tang, Z. Li, M. Wang, Neighborhood discriminant hashing for large-scale image retrieval, IEEE TIP 24 (9) (2015) 2827–2840.
[24] J. Wang, S. Kumar, S.F. Chang, Semi-supervised hashing for large-scale search, IEEE TPMI 34 (12) (2012) 2393–2406. [25] J. Wang, S. Kumar, S.F. Chang, Semi-supervised hashing for scalable image retrieval, in: Proc. CVPR, 2010, pp. 3424–3431. [26] S. Kim, S. Choi, Semi-supervised discriminant hashing, in: Proc. ICDM, 2011, pp. 1122–1127. [27] Y. Pan, T. Yao, H. Li, Semi-supervised hashing with semantic confidence for large scale visual search, in: Proc. ACM, 2015, pp. 53–62. [28] C. Wu, J. Zhu, D. Cai, Semi-supervised nonlinear hashing using bootstrap sequential projection learning, IEEE TKDE 25 (6) (2013) 1380–1393. [29] Y. Mu, J. Shen, S. Yan, Weakly-supervised hashing in kernel space, in: Proc. CVPR, 2010, pp. 3344–3351. [30] Q. Wang, L. Si, D. Zhang, Learning to hash with partial tags: Exploring correlation between tags and hashing bits for large scale image retrieval, in: Proc. ECCV, 2014. [31] C. Strecha, A. Bronstein, M. Bronstein, LDAHash: Improved matching with smaller descriptors, IEEE TPAMI 34 (1) (2012) 66–78. [32] W. Liu, J. Wang, R. Ji, et al., Supervised hashing with kernels, in: Proc. CVPR, 2012, pp. 2074–2081. [33] B. Kulis, T. Darrell, Learning to hash with binary reconstructive embeddings, in: Proc. NIPS, 2009, pp. 1042–1050. [34] M. Norouzi, D.M. Blei, Minimal loss hashing for compact binary codes, in: Proc. ICML-11, 2011, pp. 353–360. [35] J. Wang, W. Liu, A.X. Sun, Y. Jiang, Learning hash codes with listwise supervision, in: Proc. ICCV, 2013, p. 2.
40