Unsupervised Deep Cross-modal Hashing with Virtual Label Regression

Unsupervised Deep Cross-modal Hashing with Virtual Label Regression

Unsupervised Deep Cross-modal Hashing with Virtual Label Regression Communicated by Yongdong Zhang Journal Pre-proof Unsupervised Deep Cross-modal ...

1005KB Sizes 0 Downloads 51 Views

Unsupervised Deep Cross-modal Hashing with Virtual Label Regression

Communicated by Yongdong Zhang

Journal Pre-proof

Unsupervised Deep Cross-modal Hashing with Virtual Label Regression Tong Wang, Lei Zhu, Zhiyong Cheng, Jingjing Li, Zan Gao PII: DOI: Reference:

S0925-2312(19)31767-9 https://doi.org/10.1016/j.neucom.2019.12.058 NEUCOM 21687

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

3 September 2019 29 November 2019 12 December 2019

Please cite this article as: Tong Wang, Lei Zhu, Zhiyong Cheng, Jingjing Li, Zan Gao, Unsupervised Deep Cross-modal Hashing with Virtual Label Regression, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.12.058

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Unsupervised Deep Cross-modal Hashing with Virtual Label Regression Tong Wanga , Lei Zhua , Zhiyong Chengb , Jingjing Lic , Zan Gaod a School

of Information Science and Engineering, Shandong Normal University Computer Science Center (National Supercomputer Center in Jinan) c School of Computer Science and Engineering, University of Electronic Science and Technology of China d Qilu University of Technology (Shandong Academy of Sciences) b Shandong

Abstract Unsupervised cross-modal hashing learns hash codes without dependence on the semantic labels. It has the desirable advantage of well scalability and thus can effectively support large-scale cross-media retrieval. However, how to directly learn discriminative discrete hash codes under the unsupervised learning paradigm is still an open challenging problem. In this paper, we aim to attack this problem by proposing an Unsupervised Deep Cross-modal Hashing with Virtual Label Regression (UDCH-VLR) method. We propose a novel unified learning framework to jointly perform deep hash function training, virtual label learning and regression. Specifically, we learn unified hash codes via collaborative matrix factorization on the multi-modal deep representations to preserve the multi-modal shared semantics. Further, we incorporate the virtual label learning into the objective functions and simultaneously regress the learned virtual labels to the hash codes. Finally, instead of simply exploiting the existing shallow features and relaxing the binary constraints, we devise an alternative optimization strategy to directly update the deep hash functions and discrete binary codes. Under such circumstance, the discriminative capability of hash codes can be progressively enhanced with iterative learning. Extensive experiments on three publicly available cross-media retrieval datasets demonstrate that our approach outperforms state-of-theart methods. Keywords: Cross-modal retrieval, Deep discrete hashing, Virtual labels Email address: [email protected] (Lei Zhu)

Preprint submitted to Neurocomputing

December 18, 2019

1. Introduction Nowadays, with the explosive growth of multimedia data, cross-modal retrieval has become a hot research topic in multimedia computing, information retrieval and data mining [1, 2, 3]. However, large-scale cross-modal retrieval is confronted with great challenges on storage cost and retrieval speed. To address these problems, cross-modal hashing [4, 5, 6, 7, 8] is developed to transform heterogeneous data (e.g. texts, audios, images, videos) into a common low-dimensional Hamming space which preserves both inter-media and intra-media consistency in the original feature space. With respect to the exploitation of supervised information, existing cross-modal hashing methods can be roughly divided into two categories: unsupervised [9, 10, 11] and supervised [12, 13, 14] cross-modal hashing. The former mainly learns binary hash codes and functions by preserving the intrinsic data structure, while the latter performs hash learning with the guidance of supervised semantic labels. Supervised cross-modal hashing method has made significant progress and achieved state-of-the-art retrieval performance. Nevertheless, its success is heavily relied on the large amounts of semantic labels which require massive labor to collect. In contrast, unsupervised crossmodal hashing is developed without this limitation. Thus, it can support large-scale cross-media retrieval well. Despite that many unsupervised cross-modal hashing methods have achieved promising results, there are still two important problems that should be solved: 1) Most existing methods focus on shallow models that simply adopt linear or nonlinear projection for hash function learning. Under such circumstance, the generated hash codes may suffer from limited representation capability. 2) Considering that it is difficult to learn the binary codes from the feature representations directly, most existing methods discard the discrete binary constraints to avoid the NP-hard hash optimization problem. This strategy indeed simplifies the problem, but it will cause significant quantization loss when dealing with the continuous relaxation [15, 16, 17, 18]. Recently, deep hashing [19, 20, 21] has demonstrated that both feature representation and hash coding can be learned more effectively with an end-to-end deep learning 2

architecture. Compared to shallow methods, deep models are reported to achieve superior performance by deeply capturing nonlinear correlation of heterogeneous modalities. However, just a few research works are proposed on applying deep learning for unsupervised cross-modal hashing. Among them, there still exists a tricky unsolved problem that large amount of neural network parameters should be learned, but no semantic labels can be exploited for parameter optimization. In light of above analyses, in this paper, we present a novel unsupervised deep cross-modal hashing framework, termed as Unsupervised Deep Cross-modal Hashing with Virtual Label regression (UDCH-VLR). Our method makes the first attempts towards deeply modeling the heterogeneous modality correlations, and directly preserving them into binary hash codes with the automatically learned auxiliary virtual labels. The main contributions of the proposed method are summarized as below: • We propose a new unsupervised deep cross-modal hashing model to jointly perform deep hash function learning, virtual label learning and regression. The devised learning framework ensures the semantic consistency between hash codes and virtual labels, and thus enhances the discriminative capability of unsupervised hash codes. To the best of our knowledge, there is still no similar work. • We develop a discrete alternative optimization strategy guaranteed with convergence to iteratively update the deep hash functions and binary hash codes. By directly exploiting the discrete optimization instead of relaxing discrete constraints, the quantization errors are effectively reduced. • We conduct an extensive experiments on three standard cross-media retrieval datasets. The results demonstrate that our approach outperforms the state-ofthe-art unsupervised cross-modal hashing methods. The rest of this paper is organized as follows. Section 2 introduces the related work. Section 3 describes our proposed scheme in detail. Section 4 reports the experimental configuration. Section 5 presents the experimental results. Finally, we conclude this paper in Section 6.

3

2. Related Work 2.1. Supervised Cross-modal Hashing Supervised cross-modal hashing mainly leverages the supervised semantic labels to guide the hash code and function learning. Semantic Correlation Maximization (SCM) [22] reconstructs paired similarity matrix implicitly by preserving the maximum semantic correlation of hash codes to reduce the computation and storage requirements. Semantics Preserving Hashing (SePH) [23] minimizes the Kullback-Leibler divergence between the paired similarity matrix and the hash codes, making the learned hash codes maintain the paired sample relations. Discrete Cross-modal Hashing (DCH) [24] jointly learns discriminative binary codes and modality-specific hash functions with supervised information for multiple modalities. 2.2. Unsupervised Cross-modal Hashing Despite great progress has achieved by supervised methods, their performance are largely dependent on the amounts of semantic labels. Unsupervised methods mainly learn hash codes and functions based on the unlabeled data. They significantly reduce the consumption of human resource. Existing unsupervised cross-modal hashing methods can be divided into two categories: graph-based and matrix factorization-based methods. Graph-based methods generally construct similarity graph [25, 26, 27] to model heterogeneous modality correlation. They maintain the intra- and inter-modality similarity by minimizing the errors between the hash codes of the paired data. Crossview Hashing (CVH) [9] aims to minimize the weighted average Hamming distances between hash codes which are obtained by solving a generalized eigen-value problem. Inter-media Hashing (IMH) [28] learns hash codes and functions simultaneously by constructing similarity graph for each modality to preserve the inter-media and intramedia consistency. Linear Cross-modal Hashing (LCMH) [29] first transforms the representation of each instance with its distance to cluster centroids. Then, it approximates the distances between instances and clusters with the to-be-learnt hash codes. Fusion Similarity Hashing (FSH) [30] preserves the fusion similarity directly across multiple modalities into a common Hamming space by constructing the undirected asymmetric

4

graph. Cross-modal Discrete Hashing (CMDH) [31] proposes a joint discrete optimization framework which is able to transform different modalities into the unified hash codes which bring the semantic gap of multiple modalities. Matrix factorization-based methods seek low-dimensional latent semantic space to reconstruct multiple modality data and the hash codes are learned by quantizing the projected values. Collective Matrix Factorization Hashing (CMFH) [32] learns the unified hash codes by collective matrix decomposition with latent factor model. It not only supports cross-view search, but also improves the search accuracy by merging multi-view information. Latent Semantic Sparse Hashing (LSSH) [33] employs a sparse coding and matrix factorization to detect the image structures and text concepts respectively. It maps the latent semantic features into a common binary space for cross-modal similarity search. Semantic Topic Multimodal Hashing (STMH) [34] presents a novel semantic topic multi-modal hashing based on matrix decomposition. It first learns semantic topics for texts and semantic concepts for images respectively by semantic modeling. Then, it projects the learned semantic features into a common latent space to generate hash codes by determining whether topics and concepts are contained in texts and images. 2.3. Deep Neural Networks Based Hashing Inspired by the advancement of deep learning, several works take the advantage of deep neural networks for cross-media retrieval. This research direction has attracted considerable attentions due to the powerful representation capability of deep learning. Deep Binary Reconstruction for Cross-modal Hashing (DBRC) [35] proposes a multi-modal deep reconstruction network based on the adaptive tanh hash function to directly encode the binary hash codes instead of the separated binarization procedure. Unsupervised Deep Hashing via Binary Latent Factor Models (UDCMH) [36] designs a multi-modal hashing framework which combines deep learning and matrix factorization to generate unified hash codes with discrete binary optimization. Unsupervised Generative Adversarial Cross-modal Hashing (UGACH) [37] learns hash functions by exploiting the generative adversarial network based on correlation graph, which captures the intrinsic manifold structures of multi-modal data. This adversarial learning process is proved with strong nonlinear modeling capability. 5

Although these unsupervised deep cross-modal hashing methods achieve promising performance, they still fail to address the tricky problem that large amount of neural network parameters should be optimized in deep learning paradigm and no semantic labels can be exploited for supervision. In this paper, we propose a novel UDCHVLR model to simultaneously perform collaborative matrix factorization, virtual label learning and regression. With such learning framework, the discriminative capability of hash codes can be enhanced by guiding the collaborative cross-modal hash code learning with discriminative labels. Table 1: Main notations used in this paper.

Notation

Description

Xt ∈ Rn×dt

feature matrix of t-th modality

L ∈ Rn×n

graph Laplacian matrix

Zt ∈ R

Dt ×n

Ut ∈ R

Dt ×r

n×r

B∈R

deep feature representation of t-th modality latent factor matrix binary hash codes

G ∈ Rn×c

virtual label matrix

αt

the weight of t-th modality

dt

the dimension of t-th modality feature

Dt

the dimension of t-th deep feature

c

the number of virtual labels

r

the length of hash codes

t

the number of modalities

m

the number of anchors

c×r

P∈R

semantic projection matrix

6

3. The Proposed Method 3.1. Notations and Problem Definition n on Suppose that the training set is represented as O = oi , which consists of n i=1

image-text pairs. Let X1 = [x11 , x21 , ..., xn1 ]T ∈ Rn×d1 denote the image feature matrix

with d1 dimensions, X2 = [x12 , x22 , ..., xn2 ]T ∈ Rn×d2 denote the text feature matrix with d2 dimensions. Without loss of generality, we further assume that data points are norP malized with zero-mean, i.e, ni=1 xit = 0, t = 1, 2. Our method learns the unified binary

codes B ∈ {0, 1}n×r for training data, where r is the length of hash codes. In this work,

sgn(·) is the indicator function, which outputs 1 if the input is positive, and vice versa. || · ||F is Frobenius norm of a matrix. Tr(·) is the trace of a matrix, which is computed as the sum of diagonal elements in the matrix. Table 1 summarizes the main notations used in this paper. 3.2. Framework Overview The basic framework of the proposed UDCH-VLR method is illustrated in Figure 1. The framework contains two stages: offline learning and online retrieval. The offline learning stage is further comprised of three main components: deep feature learning, latent discriminative hash code learning and hash function learning. These three parts are represented in purple, peru and blue respectively. In the first part, features are extracted from heterogeneous multi-modal data with deep neural networks to capture the nonlinear structure of multiple modalities. In the second part, the hash codes are learned by the joint learning framework which consists of collaborative feature factorization, virtual label learning and regression. In the last part, the hash codes are generated through the learned deep neural network for online query retrieval. At the online stage, the query sample is first projected into the Hamming space to generate hash codes by hash functions. Then, the relevant samples are retrieved from the database by performing the Hamming distance computation and similarity ranking. 3.3. Model Formulation In this paper, we develop a unified unsupervised deep cross-modal hashing model. In our learning framework, we extract the multi-modal feature representation with deep 7

Offline Learning Deep Feature Learning

Fc6_I Fc7_I

Image

Latent Factor Matrix U1

Fc8_I

Fc6_T Fc7_T

Collaborative Feature Factorization Virtual Label G

Hash Codes B

Latent Factor Matrix U2

r

0

0

1

1

0

0

1

0

1

1

×1

1

0

1

0

0

0

1

1

0

1

0

0

1

1

Similarity Matrix S

P +

...

....

....

Bag-of-Words TF+IDF

4096

...

4096

....

....

Text

A

Latent Discriminative Hash Code Learning

Hash Function Learning

Virtual Label Learning and Regression

Fc8_T

Query Image

Query Text

A

Hash Function

Image Database A

A

A

Text Database

A

10011 10010 110001100100 A A A 11001 110001

10010

A

Query: Cambridge 10011

10010

11000

10010

10011

Online Retrieval

Figure 1: The basic framework of the proposed Unsupervised Deep Cross-modal Hashing with Virtual Label Regression (UDCH-VLR).

neural networks, learn the virtual labels with nonnegative spectral analysis, and supervise the collaborative matrix binary factorization with virtual labels to learn discriminative hash codes. 3.3.1. Deep Feature Learning The deep feature learning part consists of two deep neural networks for image and text modalities respectively. The deep neural network for image modality is based on VGG-16 and initialized with the weights that are pre-trained on the large-scale ImageNet dataset [38]. The last fully-connected layer (Fc8 I) is slightly modified with hidden units whose number is equal to the hash code length. Moreover, hyperbolic tangent activation functions are adopted to generate the quantizable representations of the whole network. To perform feature learning for text, we first represent the texts 8

with Bag-of-Words (BOW) and then use them as the input of the network with three fully-connected layers (Fc6 T - Fc8 T). These three fully connected layers are similar to those in the image network. We also adjust the last fully connected layer of this network with hidden units and adopt the tangent function as the activation function. Ft (Xt ; Wt ) is denoted as the outputs of last fully connected layer with data Xt , where Wt is the weight parameter of the t-th deep neural layer. The deep feature representations Zt |2t=1 extracted by the Fc7 layer (Fc7 I and Fc7 T) are the input to binary latent discriminative hash learning.

3.3.2. Latent Discriminative Hash Code Learning Relevant heterogeneous multi-modal data share the same semantic [39, 40, 41, 42, 43]. Thus, it is natural to transform them into common collaborative subspace. Accordingly, we can assume that the heterogeneous modalities with the semantic correlations share the same hash codes based on the collaborative matrix binary factorization. In this paper, we propose a collaborative latent representation model to capture the correlations of modalities, this part can be formulated as min

αt ,B,Ut

2 X t=1

2

αηt kZt − Ut BT kF ,

s.t.

2 X t=1

αηt = 1, αηt ≥ 0, UTt Ut = I, B ∈ {−1, 1}n×r ,

(1)

where αt |2t=1 are the weights for two modalities, which are assigned automatically to

measure the importance of the corresponding modality. The parameter η is used to control the weight distribution. Ut ∈ RDt ×r is latent factor matrix, it is orthogonal to

avoid the trivial solution. B ∈ Rn×r is the shared binary hash codes to be learned. It is directly imposed with discrete constraint. With the above formula, the deep feature of each instance is approximated by a linear combination of latent factors with the binary

hash codes. The hash codes B can be directly optimized to reduce the quantization loss of the intermediate representation process. Supervised hashing methods can achieve superior performance by leveraging the semantic labels to conduct hash learning. However, there still exist a major limitation in these methods that their performance is largely dependent on the amounts of semantic labels and providing these labels is labor consuming for large-scale applications. In this paper, we explore a new strategy to learn virtual labels which can distinguish the 9

deep hash codes. Specifically, we adopt nonnegative spectral clustering [44] to learn virtual labels, which can be further integrated with the above collaborative matrix binary factorization. Besides, we directly regress the virtual label G to the hash codes, so that the binary codes from the same class are close and vice versa. Under such circumstance, the discriminative capability of hash codes can be enhanced. Mathematically, we formulate this unified learning framework as min λkB − GPk2F + βTr(GT LG) + δkPk2F ,

G,P,B

s.t. GT G = I, G ≥ 0, B ∈ {−1, 1}n×r , (2)

where β, λ and δ are regularization parameters. G ∈ Rn×c is the virtual label ma-

trix which is consistent with the corresponding hash codes. This process guarantees that deep feature representations sharing the same semantic labels have the same hash codes. In this way, the learnt hash codes can be empowered with more discriminative capability. P ∈ Rc×r is semantic projection matrix which directly projects the virtual label G into the lower dimensional Hamming space. Besides, considering that the computation complexity of spectral clustering is too high, we adopt the strategy of anchor graph to construct approximate kNN graph [45]. In particular, we define the graph Laplacian matrix L ∈ Rn×n in the second term as L = I − S = I − AΛ−1 AT .

(3)

Here, S is the similarity matrix, Λ = diag(AT 1), A ∈ Rn×m is the anchor graph matrix between the deep features and anchors [46, 47, 48]. It is computed as below Ai j = P

exp(−kZi − θ j k22 /σ)

j∈[i] exp(−kZi

− θ j k22 /σ)

,

(4)

where [i] denotes the s (s  m) nearest anchors of Zi , σ is the Gaussian kernel paramn om eter, m cluster centers θ j are anchor points randomly selected from the training set. j=1

Hence, the time complexity of our method decreases from O(n2 ) to O(nm2 ). 3.3.3. Hash Function Learning

The hash codes B of the training set is learned by optimizing Eq.(1) and Eq.(2). However, it cannot be generalized to the query instance directly. For hash function learning, we attempt to minimize the quantization loss between the output of deep 10

neural network and the binary codes B. We adopt the following formula to achieve the aim min Wt ,B

2 X t=1

kFt (Xt ; Wt ) − Bk2F ,

s.t. B ∈ {−1, 1}n×r ,

(5)

where Wt is weight parameter of t-th deep neural network, it can be optimized by employing a standard Stochastic Gradient Descent (SGD) optimizer [49]. When a new query arrives, the hash code can be directly generated with (sgn(Ft (Xqt ; Wt )) + 1)/2. 3.3.4. Overall Formulation By integrating Eq.(1), Eq.(2) and Eq.(5) into a unified learning framework, we derive the overall objective function of our proposed method as min

αt ,B,Ut ,G,P,Wt

2 X t=1

2

αηt kZt − Ut BT kF + λkB − GPk2F + βTr(GT LG) + µ

s.t.

2 X t=1

2 X t=1

kFt (Xt ; Wt ) − Bk2F + δkPk2F ,

αt = 1, αt ≥ 0, UTt Ut = I, B ∈ {−1, 1}n×r , GT G = I, G ≥ 0. (6)

The objective function in Eq.(6) gives a clear explanation: The first term learns the multi-modal shared hash codes. The second term regresses the corresponding virtual labels to the hash codes B. The third term learns the virtual labels by nonnegative spectral analysis. The fourth term learns the deep hash functions for each modality. The last term is a regularization of P to avoid overfitting. 3.4. Optimization Scheme The objective function in Eq.(6) is not convex to the involved variables simultaneously. Fortunately, it is convex with respect to any one of the variables if the others are fixed. In this paper, we introduce an iterative algorithm to solve the problem. The detailed optimization steps are presented as follows: as

Step 1. Update αt |2t=1 . The optimization formula with respect to αt |2t=1 can be written

min αt

2 X t=1

2

αηt kZt − Ut BT kF ,

11

s.t.

2 X t=1

αt = 1, αt ≥ 0.

(7)

Inspired by [50], taking the derivative of Eq.(7) with respect to αt |2t=1 and setting it to

zero, we obtain

1 . αt = √ kZt − Ut BT k2F

(8)

Step 2. Update Ut . The optimization formula for updating Ut can be formulated as 2

min kZt − Ut BT kF , Ut

s.t. UTt Ut = I.

This problem can be simplified by considering the orthogonal constraint UTt Ut = I. max Tr(UTt Zt B) = Tr(UTt Ft ),

(9)

Ut

where Ft = Zt B. The optimal Ut can be solved using Singular Value Decomposition (SVD) [51] (10)

Ut = Qt IDt ×r VTt ,

where Q ∈ RDt ×Dt is the left-singular value of SVD decomposition of Ft , V ∈ Rr×r is the right-singular value of SVD decomposition of Ft , I ∈ RDt ×r is the identity matrix. Step 3. Update P. The problem in Eq.(6) can be rewritten as min λkB − GPk2F + δkPk2F .

(11)

P

Calculating the derivative of Eq.(11) with respect to P and setting it to zero, we can derive the following updating rule of P as −1

(12)

P = (λGT G + δI) λGT B. Step 4. Update G. The optimization formula becomes min λkB − GP k2F + βTr(GT LG), G

s.t. GT G = I, G ≥ 0.

(13)

The objective function in Eq.(13) is not convex. To make the problem solvable, we relax the orthogonal constraint, and rewrite it as λ γ 2 min kB − GPk2F + Tr(GT LG) + kGGT − IkF , G β 2β 12

s.t. G ≥ 0.

(14)

From Eq.(14), we can see that γ is the tuning parameter of orthogonality condition. In our experiment, γ is usually large enough for the training data. We use the Theorem 1 to solve the above equation. This update rule is related to Nonnegative Matrix Factorization (NMF) [52] and this approach is similar to the update rule in [53]. Theorem 1. For the G-optimization problem, we can calculate G approximately with the following updating rule Gi j ← Gi j

( λβ BPT + γβ G)i j (LG + λβ GPPT + γβ GGT G)i j

.

Proof. Suppose that φi j is the Lagrange multiplier for constraint Gi j ≥ 0 and Φ =

[φi j ] ∈ Rc×n , the Largrabgian function L is defined as

2λ λ Tr(PT GT B) + Tr(PT GT GP) β β γ 2 + kGT G − IkF + Tr(ΦGT ). 2β

L = Tr(GT LG) −

Let the derivative of L with respect to G to be zero, we obtain λ λ γ γ 2(LG − BPT + GPPT + GGT G − G) + Φ = 0. β β β β

(15)

According to the Karush-Kuhn-Tucker (KKT) conditions [54], we have the condition φi j Gi j = 0 ∀i, j . By multiplying Gi j into Eq.(15), we can get λ γ λ γ (LG + GPPT + GGT G − BPT − G)i j Gi j = 0. β β β β Motivated by NMF, we can derive the updating rule as Gi j ← Gi j

( λβ BPT + γβ G)i j (LG + λβ GPPT + γβ GGT G)i j

.

(16) 

Step 5. Update B. In order to solve B, Eq.(6) can be rewritten as min B

2 X t=1

2

αηt kZt − Ut BT kF + λkB − GPk2F + µ

13

2 X t=1

kFt (Xt ; Wt ) − Bk2F , s.t. B ∈ {−1, 1}n×r .

In our approach, the binary codes B is directly optimized without discarding the binary constraint B ∈ {−1, 1}n×r . We obtain the solution as B0 =

2 X (αηt ZTt Ut + λGP + µFt (Xt ; Wt )), t=1

(17)

B = sgn(B0 ).

Step 6. Update Wt . The corresponding optimization problem is reduced as min Wt

2 X t=1

kFt (Xt ; Wt ) − Bk2F .

(18)

The parameter Wt of our network is trained by fine-tuning deep networks with SGD. The hash functions are learned according to the updating rule until convergency. When a new query arrives, its hash codes are obtained with (sgn(Ft (Xqt ; Wt )) + 1)/2. The key steps of the proposed discrete hash optimization method are summarized in Algorithm 1. Algorithm 1 Unsupervised Deep Cross-modal Hashing with Virtual Label Regression Input: Feature matrix of the t-th modality Xt , the number of virtual labels c, hash code length r, the parameters β, λ, γ, αt , η, δ, µ. Output: Hash code matrix B, hash functions Ft (Xt ; Wt ). Extract the deep feature representations Zt from the deep networks. Randomly initialize Ut , G, P and B. Construct the Laplacian matrix L by Eq.(3). repeat Update αt by Eq.(8). Update Ut by Eq.(10). Update P by Eq.(12). Update G by Eq.(16). Update B by Eq.(17). Update the deep model parameters Wt by Eq.(18). until convergence

14

3.5. Computation Complexity Analysis In this subsection, we discuss the time complexity of the proposed method. We assume that n  l, n  c, n  Dt , n  r, n  T . In the training phase, for the

Ut -subproblem, it takes O(Dt (Dt + r)r). For the B-subproblem, its time complexity for hash codes generation is O(n(s + c)r). For the G-subproblem, the time complexity is O((rc+cr+r2 +c2 )n). Moreover, we compute L by constructing the anchor graph instead of kNN graph, which extremely reduces the computation cost. For the P-subproblem, the time complexity of this optimization process is O(c(c + r)n). Hence, the overall training complexity for UDCH-VLR is O(T (rc + cr + r2 + c2 )n), where T is the number of iterations. This process scales linearly with the size of the training dataset. 4. Experimental Configuration 4.1. Evaluation Datasets In this paper, we conduct extensive experiments on three publicly available multimodal datasets: LabelMe [55], MIR Flickr [56] and NUS-WIDE [57]. All these datasets are comprised of two modalities: image and text. They have been widely adopted for evaluating the cross-modal hashing performance [10, 13]. Specifically, the statistics of three datasets are introduced as follows: LabelMe. This dataset is comprised of 2, 688 images downloaded by the LabelMe toolbox. It is manually associated with several objects’ textual tags and annotated with one of 8 semantic categories. In our experiment, we randomly take 2, 016 imagetext pairs as training set, and the remaining 672 pairs are treated as testing set. For the shallow methods, an image is described by 4, 096-dimensional feature extracted by the Caffe [58] implementation of VGGNet [59], and a text is represented by 245dimensional BOW feature vector. For the deep learning based method, we take the images and texts as inputs of the our deep hashing model. MIR Flickr. This dataset is collected from the Flickr website with 25, 000 imagetext pairs, each of which is annotated with at least one of 24 provided labels. We construct a query set with 100 images which are randomly selected from each category

15

and delete the repeated instances to get the final 20, 015 image-text pairs. In our experiment, we randomly choose 2, 266 image-tag pairs as the query set and the remaining 17, 749 pairs as the retrieval set. From the retrieval set, we randomly select 5, 325 pairs as the training set. For the shallow methods, the image contents are represented by 4, 096-dimensional feature extracted by the Caffe implementation of VGGNet, while the text contents are represented with a 1, 386-dimensional BOW feature vector. For the deep leaning based method, we use the raw images and texts as the input of our deep feature learning model. NUS-WIDE. This dataset is comprised of 269, 648 image-tag pairs crawled from Flickr. Each image is associated with textual tags and it is manually annotated with at least one of 81 concept labels. Since several concept labels are scarce, we select the most 21 frequently-used concepts and the corresponding 195, 834 instances for experiments. Further, we randomly select 2, 084 image-text pairs as the query set, the remaining 193, 749 pairs are treated as the retrieval set, and 5, 000 image-text pairs from the retrieval set are randomly selected as the training set. For the shallow methods, the images are described by 4, 096-dimensional feature extracted by Caffe implementation of VGGNet, and the texts are described by a 1, 000-dimensional BOW feature vector. For the deep leaning based method, we extract the deep features from images and texts by our deep feature learning model. 4.2. Evaluation Baselines Our approach is unsupervised. To evaluate the retrieval performance, we compare it with state-of-the-art unsupervised cross-modal hashing methods including: Cross-view Hashing (CVH) [9], Inter-media Hashing (IMH) [28], Linear Cross-modal Hashing (LCMH) [29], Fusion Similarity Hashing (FSH) [30], Collective Matrix Factorization Hashing (CMFH) [32], Latent Semantic Sparse Hashing (LSSH) [33], Cross-modal Discrete Hashing (CDMH) [31] and Unsupervised Deep Hashing via Binary Latent Factor Models (UDCMH) [36]. All these methods are tuned to report the best results according to the corresponding paper illustration.

16

4.3. Evaluation Metrics In experiments, we adopt two standard evaluation metrics, Mean Average Precision (MAP) [13] and topK-precision [60]. The definitions of these two metrics are: 1) MAP: Given a query with a group of R retrieved results, the average precision (AP) is defined as AP =

R 1 X P(r)δ(r), N r=1

(19)

where N is the number of the relevant instances in the retrieval set, P(r) denotes the precision of the top r retrieved instances, δ denotes the indicator function, the value is 1 if the r-th retrieved instance is a true neighbour of the query, and 0 otherwise. The MAP is defined as Q

MAP =

1 X AP(qi ), Q i=1

(20)

where Q is size of query set. Obviously, larger MAP values indicate the better retrieval performance of the learning model. 2) TopK-precision: This evaluation metric reflects the change of precision with respect to the number of retrieved instances. Given the returned K instances, we calculate the portion of right items that are retrieved from the database as the TopK-precision. 4.4. Implementation Details There are several parameters in the proposed UDCH-VLR: λ, β, γ, δ, µ, η, the number of virtual labels c, and the number of anchors m. The regularization parameter λ controls the regression from the labels to the binary hash codes, β supports the learning of the virtual labels, two parameters γ and δ are the regularization parameters to avoid overfitting, µ is the penalty parameter. Specifically, our method achieves the best performance when {λ = 10−2 , β = 105 , γ = 104 , δ = 10−2 , µ = 10−1 }, {λ = 104 ,

β = 10−2 , γ = 10−2 , δ = 10−2 , µ = 10−4 }, {λ = 104 , β = 5, γ = 10−2 , δ = 104 , µ = 10−3 }

on LabelMe, MIR Flickr, NUS-WIDE respectively. The parameter c is also crucial to

the experiment, we set it to 35 on LabelMe and NUS-WIDE, 100 on MIR Flickr. On three datasets, the parameter η which controls the weight of each modality is set to 1,

17

the number of anchors is set to 300, the maximum iteration is set to 15. Moreover, our experiment is conducted on open-source Caffe. The learning rates is 0.01, 0.001 for image and text networks, the batch size is fixed as 15 for both two networks. For the compared approach UDCMH, as no published implementation codes are available, we reproduce it by ourselves according to the corresponding paper and carefully tune the parameters to fit the datasets. For other baseline methods, the codes are exposed by the authors. All the methods are performed on a computer with Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz and 64-bit Ubuntu 16.04.6 LTS operating system. 5. Experimental Results 5.1. Retrieval Accuracy Comparison In Tables 2, 3, 4, we demonstrate MAP values of all methods on different code lengths varied from 16 to 128 on LabelMe, MIR Flickr, NUS-WIDE respectively. From the MAP results, we have the following observations: 1) We can find that UDCH-VLR significantly outperforms the state-of-the-art baselines on both retrieval tasks of imageto-text and text-to-image with varied code length. The obtained superior performance can be attributed to two advantages of our method: Firstly, UDCH-VLR simultaneously performs deep learning and hash code learning, it improves the representation capability of hash codes. Secondly, UDCH-VLR learns the virtual labels of heterogeneous multi-modal data as the semantic guidance and enhances the discriminative capability of hash codes. 2) It is worth noting that MAP values of our method keep increasing when the hash code length becomes longer. The reason is that the discrete optimization in our method can successfully avoid the binary quantization loss and more hash bits can bring more valuable information. 3) Deep cross-modal hashing methods UDCMH achieves the second best result compared with other shallow hashing methods, this experimental phenomenon validates the powerful representation capability of deep model. 4) The performance of above unsupervised methods obtained in text-to-image retrieval task outperforms that obtained on image-to-text retrieval task. This is because that text modality can better describe the topic of instance pairs on image-to-text retrieval task. 18

0.6 0.5 0.4 0.3

MIR Flickr on 32 bits

0.8

Ours UDCMH LCMH CMDH CVH CMFH LSSH IMH FSH

0.8

Precision

0.7

Precision

0.9

Ours UDCMH LCMH CMDH CVH CMFH LSSH IMH FSH

0.7

NUS-WIDE on 32 bits Ours UDCMH LCMH CMDH CVH CMFH LSSH IMH FSH

0.7

Precision

LabelMe on 32 bits

0.8

0.6 0.5 0.4

0.6

0.2

0.3 0.1 0.5 400

600

800

1000

(a) Image −→ Text

0.5 0.4 0.3

600

800

1000

200

600

800

1000

(c) Image −→ Text 0.8

Ours UDCMH LCMH CMDH CVH CMFH LSSH IMH FSH

0.7

400

Number of top retrieved samples

MIR Flickr on 32 bits

0.8

Precision

0.6

Precision

0.9

Ours UDCMH LCMH CMDH CVH CMFH LSSH IMH FSH

0.7

400

(b) Image −→ Text

LabelMe on 32 bits

0.8

0.2 200

Number of top retrieved samples

NUS-WIDE on 32 bits Ours UDCMH LCMH CMDH CVH CMFH LSSH IMH FSH

0.7

Precision

200

Number of top retrieved samples

0.6 0.5 0.4

0.6

0.2

0.3 0.1 0.5 200

400

600

800

1000

Number of top retrieved samples

(d) Text −→ Image

0.2 200

400

600

800

1000

Number of top retrieved samples

(e) Text −→ Image

200

400

600

800

1000

Number of top retrieved samples

(f) Text −→ Image

Figure 2: The topK-precision curves on LabelMe, MIR Flickr, NUS-WIDE when the hash code length is set as 32.

Table 2: MAP comparison on LabelMe. The best result in each column is marked with bold.

Methods

Image −→ Text

Text −→ Image

16 bits 32 bits 64 bits 128 bits 16 bits 32 bits 64 bits 128 bits

CVH

0.2100 0.2117 0.2067 0.1974 0.2302 0.2231 0.2154 0.2100

IMH

0.3286 0.3486 0.3536 0.3622 0.4076 0.4291 0.4847 0.5012

FSH

0.3901 0.5485 0.5602 0.5954 0.5612 0.5802 0.6005 0.6149

LCMH

0.3305 0.3729 0.3995 0.4247 0.3907 0.4445 0.4851 0.5201

CMFH

0.5480 0.5474 0.5601 0.5988 0.5699 0.5832 0.6059 0.6273

LSSH

0.5311 0.5571 0.5845 0.6077 0.5741 0.5905 0.6157 0.6491

CMDH

0.5571 0.5799 0.6022 0.6102 0.5813 0.6245 0.6449 0.6570

UDCMH

0.5532 0.5901 0.6358 0.6435 0.5914 0.6311 0.6501 0.6592

UDCH-VLR 0.5614 0.6247 0.6807 0.7368 0.6053 0.6717 0.7073 0.7579

19

0.6 0.5 0.4 0.3

MIR Flickr on 128 bits

0.8

Ours UDCMH LCMH CMDH CVH CMFH LSSH IMH FSH

0.8

Precision

0.7

Precision

0.9

Ours UDCMH LCMH CMDH CVH CMFH LSSH IMH FSH

0.7

NUS-WIDE on 128 bits Ours UDCMH LCMH CMDH CVH CMFH LSSH IMH FSH

0.7

Precision

LabelMe on 128 bits

0.8

0.6 0.5 0.4

0.6

0.2

0.3 0.1 0.5 400

600

800

1000

(a) Image −→ Text

0.5 0.4 0.3

600

800

1000

200

600

800

1000

(c) Image −→ Text 0.8

Ours UDCMH LCMH CMDH CVH CMFH LSSH IMH FSH

0.7

400

Number of top retrieved samples

MIR Flickr on 128 bits

0.8

Precision

0.6

Precision

0.9

Ours UDCMH LCMH CMDH CVH CMFH LSSH IMH FSH

0.7

400

(b) Image −→ Text

LabelMe on 128 bits

0.8

0.2 200

Number of top retrieved samples

NUS-WIDE on 128 bits Ours UDCMH LCMH CMDH CVH CMFH LSSH IMH FSH

0.7

Precision

200

Number of top retrieved samples

0.6 0.5 0.4

0.6

0.2

0.3 0.1 0.5 200

400

600

800

1000

Number of top retrieved samples

(d) Text −→ Image

0.2 200

400

600

800

1000

Number of top retrieved samples

(e) Text −→ Image

200

400

600

800

1000

Number of top retrieved samples

(f) Text −→ Image

Figure 3: The topK-precision curves on LabelMe, MIR Flickr and NUS-WIDE when the hash code length is set as 128.

Table 3: MAP comparison on MIR Flickr. The best result in each column is marked with bold.

Methods

Image −→ Text

Text −→ Image 64 bits

128 bits

CVH

0.5727 0.5699 0.5606 0.5544 0.5936 0.5819 0.5778

0.5723

IMH

0.5735 0.5782 0.5862 0.5973 0.5779 0.5848 0.5988

0.6013

FSH

0.5746 0.5756 0.5833 0.5865 0.5809 0.5740 0.5763

0.5978

LCMH

0.5813 0.5922 0.5926 0.5935 0.5884 0.5952 0.6088

0.6157

CMFH

0.5916 0.6075 0.6204 0.6223 0.5929 0.6174 0.6261

0.6339

LSSH

0.5932 0.6221 0.6389 0.6319 0.5947 0.6232 0.6359

0.6491

CMDH

0.5935 0.6323 0.6479 0.6573 0.5964 0.6399 0.65071 0.6703

UDCMH

16 bits 32 bits 64 bits 128 bits 16 bits 32 bits

0.6645 0.6824 0.6933 0.7132 0.6710 0.6936 0.7088

0.7187

UDCH-VLR 0.7400 0.7406 0.7463 0.7485 0.7582 0.7609 0.7616

0.7695

20

Table 4: MAP comparison on NUS-WIDE. The best result in each column is marked with bold.

Methods

Image −→ Text

Text −→ Image

16 bits 32 bits 64 bits 128 bits 16 bits 32 bits 64 bits 128 bits

CVH

0.2100 0.3537 0.3487 0.3218 0.2100 0.3549 0.3442 0.3231

IMH

0.4086 0.4562 0.4856 0.5262 0.4247 0.4586 0.4953 0.5326

FSH

0.3286 0.3559 0.4150 0.4460 0.4354 0.4912 0.5152 0.5526

LCMH

0.4405 0.5184 0.5395 0.5748 0.4507 0.5384 0.5442 0.5765

CMFH

0.5280 0.5384 0.5628 0.5970 0.5741 0.5835 0.5973 0.6218

LSSH

0.5311 0.5420 0.5671 0.6032 0.5864 0.5920 0.6053 0.6272

CMDH

0.5338 0.5564 0.5782 0.6085 0.5913 0.6018 0.6021 0.6358

UDCMH

0.5401 0.5686 0.5985 0.6131 0.5954 0.6100 0.6264 0.6435

UDCH-VLR 0.5571 0.5983 0.6322 0.6448 0.6053 0.6559 0.6742 0.7348 The topK-precision curves of all methods on 32 and 128 bits are shown in Figure 2 and 3. It can be easily to observe that our method consistently achieves the highest precision than other compared methods on both retrieval tasks. These results also validate the effectiveness of UDCH-VLR. Besides, we also find that the topK-precisions of all baselines perform better on text-to-image task, which is consistent with the MAP results on three datasets. Moreover, matrix factorization based methods perform better than graph-based methods. The results further verify the effectiveness of collaborative matrix decomposition on hash code learning. The topK-precision curves are calculated based on the Hamming ranking. When performing the retrieval, the top retrieved instances can be paid more attention by users in the practice. From this point of view, the proposed method has advantages over baselines for cross-modal retrieval. To sum up, UDCH-VLR can achieve the superior performance than the compared unsupervised methods on all of the evaluation metrics and benchmark datasets. We can conclude that the joint learning framework of discrete matrix decomposition and virtual label learning is an effective strategy to learn discriminative unsupervised hash codes.

21

5.2. Time Cost Analysis To validate the efficiency of UDCH-VLR, we further evaluate the training and query time (in seconds) with the hash code length fixed as 64 bits on three datasets. The experimental results are summarized in Table 5. From this table, we can observe that the proposed method consumes much less training and query time than deep method UDCMH on three datasets. On MIR Flickr and NUS-WIDE, the training time is also much shorter than shallow methods IMH and FSH. The reasons for high efficiency of our approach can be summarized as follows: 1) UDCH-VLR performs hash learning based on the collaborative matrix factorization to transform multi-modal data into a unified hash codes. This strategy reduces the computational complexity of similarity preserving. 2) The Laplacian matrix L is computed by constructing the anchor graph. It avoids directly computing the similarity matrix of training data and thus reduces the training time. 3) The discrete optimization algorithm of our method generates the hash codes within a closed-form operation rather than iterative Discrete Cyclic Coordinate (DCC) descent, which further enhances the training efficiency. Table 5: Comparison of training and query time on LabelMe, MIR Flickr and NUS-WIDE.

Methods

Training time (s)

Query time (s)

LabelMe MIR Flickr NUS-WIDE LabelMe MIR Flickr NUS-WIDE

CVH

0.42

5.56

71.50

0.21

4.80

9.92

IMH

0.84

25.96

113.22

0.22

9.79

11.36

LCMH

1.61

4.97

17.18

0.23

3.45

9.72

FSH

2.19

36.43

212.14

0.14

3.39

4.88

LSSH

0.95

11.49

12.39

1.56

14.05

15.31

CMFH

0.07

1.56

2.57

0.09

1.35

1.60

CMDH

2.20

4.16

4.85

1.30

3.05

3.74

UDCMH

5.82

40.55

94.51

2.13

14.57

25.63

UDCH-VLR

3.50

23.01

90.00

1.84

10.92

21.07

22

5.3. Effects of Virtual Labels In our method, we propose to learn virtual labels which distinguish multi-modal hash codes from different categories. The virtual labels are learned to effectively enhance the underlying semantic associations and the discriminative capability of hash codes. To validate the effectiveness, we conduct more experiments to compare our method with the variant UDCH-VLR-I on three datasets with the code length varied from 16 bits to 128 bits. UDCH-VLR-I denotes the variant method that removes the virtual labels in our formulation. The performance comparison results are shown in Figure 4 on both cross-modal retrieval tasks. From these figures, we can observe that our method achieves superior retrieval performance than the variant UDCH-VLR-I. These results prove that the discriminative capability of hash codes can be improved with virtual labels. LabelMe

0.8

MIR Flickr

0.8

NUS-WIDE

0.7 0.6

0.6

0.4

0.2

0.5

MAP

MAP

MAP

0.6

0.4

0.3 0.2

0.2

UDCH-VLR-I UDCH-VLR-II UDCH-VLR

0

UDCH-VLR-I UDCH-VLR-II UDCH-VLR

32

64

128

0 16

Number of bits

64

128

16

LabelMe

UDCH-VLR-I UDCH-VLR-II UDCH-VLR

0.6

MAP

MAP

0.2

0.4

0.2

UDCH-VLR-I UDCH-VLR-II UDCH-VLR

0

0 16

32

64

128

Number of bits

(d) Text −→ Image

128

NUS-WIDE

0.8

0.6

0.4

64

(c) Image −→ Text

MIR Flickr

0.8

32

Number of bits

(b) Image −→ Text

0.6

MAP

32

Number of bits

(a) Image −→ Text 0.8

UDCH-VLR-I UDCH-VLR-II UDCH-VLR

0.1

0 16

0.4

0.4

0.2

UDCH-VLR-I UDCH-VLR-II UDCH-VLR

0 16

32

64

128

Number of bits

(e) Text −→ Image

16

32

64

128

Number of bits

(f) Text −→ Image

Figure 4: Comparison results of UDCH-VLR, UDCH-VLR-I and UDCH-VLR-II on three datasets.

23

5.4. Effects of Discrete Optimization To show the effectiveness of discrete optimization in UDCH-VLR, we design a variant method denoted as UDCH-VLR-II for comparison. Specifically, we adopt the optimization strategy which first relaxes the discrete constraints and then transforms the real-valued solution to generate the binary hash codes. The objective function of UDCH-VLR-II is formulated as min B

2 X t=1

2

αηt kZt − Ut BT kF + λkB − GPk2F + βTr(GT LG) + µ

2 X t=1

kFt (Xt ; Wt ) − Bk2F + δkPk2F ,

The relaxed hash codes can be calculated as 2 2 2 P P P B = (λGP + αηt Ut Zt )(( αηt UTt Ut ) + ( µFt (Xt ; Wt )))−1 . Figure 4 shows the t=1

t=1

t=1

comparison results of UDCH-VLR-II and UDCH-VLR on LabelMe, MIR Flickr, and NUS-WIDE respectively. As we can observe, UDCH-VLR achieves superior performance than UDCH-VLR-II. These results validate the proposed discrete hash optimization can effectively reduce binary quantization errors, especially for the text modality. 5.5. Parameter Sensitivity Analysis In this subsection, we conduct experiments to analyze the parameter sensitivity on two cross-modal retrieval tasks. Specifically, we record the MAP variations with the involved parameters λ, β, γ, δ and µ on MIR Flickr when the hash code length is set to 64, and the number of virtual labels c on LabelMe, MIR Flickr and NUS-WIDE datasets when the hash code length is set to 64. We tune the involved parameters λ, β, n o γ, δ and µ from 10−4 , 10−2 , 1, 102 , 104 while fixing the other parameters on image-totext and text-to-image retrieval tasks. The results are shown in Figure 5. It can be seen that the MAP result is relatively stable in the whole range on the image-to-text task. We can obtain the best performance for image-to-text task when {λ = 104 , β = 10−2 , γ = 10−2 , δ = 102 , µ = 10−4 }, and that for text-to-image task when {λ = 104 , β = 10−2 , γ = 10−2 , δ = 102 , µ = 10−4 }. Moreover, we analyze the performance variations with the

number of virtual labels c, which is chosen from a range of {5, 35, 65, 100, 200, 500}. The results in Figure 6 show that UDCH-VLR can achieve the best performance when c

is in the range of {35, 100, 35} on LabelMe, MIR Flickr, and NUS-WIDE, respectively.

24

0.8

0.75

0.75

MAP

MAP

0.8

0.7

0.7

Image ! Text Text ! Image

Image ! Text Text ! Image 0.65 10-4

10-2

102

1

104

0.65 10-4

10-2

102

1

(a)

(b) 0.8

0.75

0.75

0.75

MAP

MAP

0.8

MAP

0.8

0.7

0.7

0.7

Image ! Text Text ! Image 0.65 10-4

10-2

102

1

104

-

6

Image ! Text Text ! Image

Image ! Text Text ! Image 104

0.65 10-4

10-2

102

1

.

104

0.65 10-4

10-2

1

(c)

102

104

7

/

(d)

(e)

Figure 5: Performance variations with the parameters λ, β, γ, δ and µ on MIR Flickr when the hash code length is set to 64.

Image ! Text

0.8 0.75

0.75

LabelMe MIR Flickr NUS-WIDE

LabelMe MIR Flickr NUS-WIDE

0.7

MAP

0.7

MAP

Text ! Image

0.8

0.65

0.65

0.6

0.6

0.55

0.55

0.5

0.5 5 35 60 100

200

500

5 35 60 100

200

the number of virtual labels c

the number of virtual labels c

(a) Image −→ Text

(b) Text −→ Image

500

Figure 6: Performance variations with the parameter c on MIR Flickr when the hash code length is set to 64.

25

5.6. Convergence Analysis In this subsection, empirical experiments are conducted to prove the convergence of the proposed iterative algorithm on LabelMe, MIR Flickr and NUS-WIDE datasets with the code length fixed as 64. The convergence curves are illustrated in Figure 7. It shows that the objective function value of the proposed method decreases rapidly when the number of iterations is less than 5 and finally becomes steady after 15 iterations on three datasets. These results validate the convergence of the proposed discrete optimization method. LabelMe on 64 bits

8

1.8 1.6 1.4 1.2

# 106

MIR Flickr on 64 bits 4

Objective value

# 104

Objective value

Objective value

2

6 4 2

# 106

NUS-WIDE on 64 bits

3 2 1

1 0.8

0

0 0

5

10

15

Number of iterations

(a) LabelMe on 64 bits

0

5

10

Number of iterations

(b) MIR Flickr on 64 bits

15

0

10

20

30

Number of iterations

(c) NUS-WIDE on 64 bits

Figure 7: The convergence of the proposed iterative algorithm on LabelMe, MIR Flickr and NUS-WIDE datasets with the code length fixed as 64.

6. Conclusion In this paper, we propose an effective Unsupervised Deep Cross-modal Hashing with Virtual Label Regression (UDCH-VLR) method to jointly perform the unsupervised feature learning and binary projection. In our model, the deep feature learning preserves the nonlinear relations of multi-modal samples with powerful deep neural networks, and the binary mapping learns the unified hash codes with collaborative matrix factorization on deep features and latent factors. Further, our model learns the virtual labels via nonnegative spectral analysis, and exploits them to enhance the discriminative capability of hash codes. To solve the formulated optimization problem, we propose an effective alternative optimization scheme to simultaneously learn discrete binary codes and hash functions. Extensive experiments on three publicly available datasets demonstrate the superior performance of UDCH-VLR over state-of-the-art unsupervised cross-modal hashing methods . 26

Author Contribution Tong Wang.: Data curation, Writing, Original draft preparation.Software, Validation. Lei Zhu: Investigation, Conceptualization, Methodology, Reviewing and Editing, Supervision, Funding acquisition Zhiyong Cheng: Visualization, Reviewing and Editing Jingjing Li: Reviewing and Editing, Project administration Zan Gao: Reviewing and Editing

Declaration of Competing Interest We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property. We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from [email protected].

27

References [1] X. Lu, L. Zhu, J. Li, H. Zhang, H. T. Shen, Efficient supervised discrete multiview hashing for large-scale multimedia search, IEEE Transactions on Multimediadoi:10.1109/TMM.2019.2947358. [2] L. Zhu, Z. Huang, X. Liu, X. He, J. Sun, X. Zhou, Discrete multimodal hashing with canonical views for robust mobile landmark search, IEEE Transactions on Multimedia 19 (9) (2017) 2066–2079. [3] L. Zhu, Z. Huang, Z. Li, L. Xie, H. T. Shen, Exploring auxiliary context: Discrete semantic transfer hashing for scalable image retrieval, IEEE Transactions on Neural Networks and Learning Systems 29 (11) (2018) 5264–5276. [4] K. Wang, R. He, L. Wang, W. Wang, T. Tan, Joint feature selection and subspace learning for cross-modal retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10) (2015) 2010–2023. [5] J. Tang, K. Wang, L. Shao, Supervised matrix factorization hashing for crossmodal retrieval, IEEE Transactions on Image Processing 25 (7) (2016) 3157– 3166. [6] F. Zheng, Y. Tang, L. Shao, Hetero-manifold regularisation for cross-modal hashing, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (5) (2016) 1059–1071. [7] L. Zhu, J. Shen, L. Xie, Z. Cheng, Unsupervised visual hashing with semantic assistant for content-based image retrieval, IEEE Transactions on Knowledge and Data Engineering 29 (2) (2017) 472–486. [8] L. Zhu, J. Shen, L. Xie, Z. Cheng, Unsupervised topic hypergraph hashing for efficient mobile image retrieval, IEEE Transactions on Cybernetics 47 (11) (2017) 3941–3954. [9] S. Kumar, R. Udupa, Learning hash functions for cross-view similarity search, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2011, pp. 1360–1365. 28

[10] M. Hu, Y. Yang, F. Shen, N. Xie, R. Hong, H. T. Shen, Collective reconstructive embeddings for cross-modal hashing, IEEE Transactions on Image Processing 28 (6) (2018) 2770–2784. [11] D. Wang, Q. Wang, X. Gao, Robust and flexible discrete hashing for cross-modal similarity search, IEEE Transactions on Circuits and Systems for Video Technology 28 (10) (2017) 2703–2715. [12] D. Wang, X.-B. Gao, X. Wang, L. He, Label consistent matrix factorization hashing for large-scale cross-modal similarity search, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (10) (2018) 2466–2479. [13] C. Zhang, W.-S. Zheng, Semi-supervised multi-view discrete hashing for fast image search, IEEE Transactions on Image Processing 26 (6) (2017) 2604–2617. [14] X. Luo, P.-F. Zhang, Y. Wu, Z.-D. Chen, H.-J. Huang, X.-S. Xu, Asymmetric discrete cross-modal hashing, in: Proceedings of the ACM on International Conference on Multimedia Retrieval (ICMR), 2018, pp. 204–212. [15] S. He, B. Wang, Z. Wang, Y. Yang, F. Shen, Z. Huang, H. T. Shen, Bidirectional discrete matrix factorization hashing for image search, IEEE Transactions on Cyberneticsdoi:10.1109/TCYB.2019.2941284. [16] F. Shen, C. Shen, Q. Shi, A. Van Den Hengel, Z. Tang, Inductive hashing on manifolds, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 1562–1569. [17] B. Kulis, K. Grauman, Kernelized locality-sensitive hashing for scalable image search., in: Proceedings of International Conference on Computer Vision (ICCV), 2009, pp. 2130–2137. [18] M. Hu, Y. Yang, F. Shen, N. Xie, R. Hong, H. T. Shen, Collective reconstructive embeddings for cross-modal hashing, IEEE Transactions on Image Processing 28 (6) (2018) 2770–2784.

29

[19] Y. Zhuang, Z. Yu, W. Wang, F. Wu, S. Tang, J. Shao, Cross-media hashing with neural networks, in: Proceedings of the ACM International Conference on Multimedia (MM), 2014, pp. 901–904. [20] Y. Cao, M. Long, J. Wang, Q. Yang, P. S. Yu, Deep visual-semantic hashing for cross-modal retrieval, in: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2016, pp. 1445–1454. [21] Q.-Y. Jiang, W.-J. Li, Deep cross-modal hashing, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3232–3240. [22] D. Zhang, W.-J. Li, Large-scale supervised multimodal hashing with semantic correlation maximization, in: Proceedings of AAAI Conference on Artificial Intelligence (AAAI), 2014, pp. 2177–2183. [23] Z. Lin, G. Ding, M. Hu, J. Wang, Semantics-preserving hashing for cross-view retrieval, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3864–3872. [24] X. Xu, F. Shen, Y. Yang, H. T. Shen, X. Li, Learning discriminative binary codes for large-scale cross-modal retrieval, IEEE Transactions on Image Processing 26 (5) (2017) 2494–2507. [25] Y. Han, L. Zhu, Z. Cheng, J. Li, X. Liu, Discrete optimal graph clustering, IEEE Transactions on Cybernetics (2018) 1–14doi:10.1109/TCYB.2018.2881539. [26] J. Li, K. Lu, Z. Huang, L. Zhu, H. T. Shen, Transfer independently together: A generalized framework for domain adaptation, IEEE Transactions on Cybernetics 49 (6) (2019) 2144–2155. [27] J. Li, K. Lu, Z. Huang, L. Zhu, H. T. Shen, Heterogeneous domain adaptation through progressive alignment, IEEE Transactions on Neural Networks and Learning Systems 30 (5) (2019) 1381–1391.

30

[28] J. Song, Y. Yang, Y. Yang, Z. Huang, H. T. Shen, Inter-media hashing for largescale retrieval from heterogeneous data sources, in: Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2013, pp. 785– 796. [29] X. Zhu, Z. Huang, H. T. Shen, X. Zhao, Linear cross-modal hashing for efficient multimedia search, in: Proceedings of the ACM International Conference on Multimedia (MM), 2013, pp. 143–152. [30] H. Liu, R. Ji, Y. Wu, F. Huang, B. Zhang, Cross-modality binary code learning via fusion similarity hashing, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7380–7388. [31] V. E. Liong, J. Lu, Y.-P. Tan, Cross-modal discrete hashing, Pattern Recognition 79 (2018) 114 – 129. [32] G. Ding, Y. Guo, J. Zhou, Collective matrix factorization hashing for multimodal data, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 2075–2082. [33] J. Zhou, G. Ding, Y. Guo, Latent semantic sparse hashing for cross-modal similarity search, in: Proceedings of the ACM International Conference on Research on Development in Information Retrieval (SIGIR), 2014, pp. 415–424. [34] D. Wang, X. Gao, X. Wang, L. He, Semantic topic multimodal hashing for crossmedia retrieval, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2015, pp. 3890 – 3896. [35] D. Hu, F. Nie, X. Li, Deep binary reconstruction for cross-modal hashing, IEEE Transactions on Multimedia 21 (4) (2018) 973–985. [36] G. Wu, Z. Lin, J. Han, L. Liu, G. Ding, B. Zhang, J. Shen, Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2018, pp. 2854–2860.

31

[37] J. Zhang, Y. Peng, M. Yuan, Unsupervised generative adversarial cross-modal hashing, in: Proceedings of AAAI Conference on Artificial Intelligence (AAAI), 2018, pp. 539–546. [38] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A largescale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255. [39] C. Cui, H. Liu, T. Lian, L. Nie, L. Zhu, Y. Yin, Distribution-oriented aesthetics assessment with semantic-aware hybrid network, IEEE Transactions on Multimedia 21 (5) (2019) 1209–1220. [40] Y. Yang, J. Zhou, J. Ai, Y. Bin, A. Hanjalic, H. T. Shen, Video captioning by adversarial lstm, IEEE Transactions on Image Processing 27 (11) (2018) 5600– 5611. [41] Z. Cheng, X. Chang, L. Zhu, R. Catherine Kanjirathinkal, M. S. Kankanhalli, MMALFM: explainable recommendation by leveraging reviews and images, ACM Transactions on Information Systems 37 (2) (2019) 16:1–16:28. [42] A. Liu, Y. Su, W. Nie, M. Kankanhalli, Hierarchical clustering multi-task learning for joint human action grouping and recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (1) (2017) 102–114. [43] N. Xu, A. Liu, Y. Wong, Y. Zhang, W. Nie, Y. Su, M. Kankanhalli, Dual-stream recurrent neural network for video captioning, IEEE Transactions on Circuits and Systems for Video Technology 29 (8) (2019) 2482–2493. [44] A. Y. Ng, M. I. Jordan, Y. Weiss, On spectral clustering: Analysis and an algorithm, in: Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2002, pp. 849–856. [45] W. Liu, J. Wang, S. Kumar, S. Chang, Hashing with graphs, in: Proceedings of the International Conference on Machine Learning (ICML), 2011, pp. 1–8.

32

[46] C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, Q. Dai, Stat: spatialtemporal attention mechanism for video captioning, IEEE Transactions on Multimedia. doi:10.1109/TMM.2019.2924576. [47] C. Yan, L. Li, C. Zhang, B. Liu, Y. Zhang, Q. Dai, Cross-modality bridging and knowledge transferring for image understanding, IEEE Transactions on Multimedia 21 (10) (2019) 2675–2685. [48] C. Yan, H. Xie, J. Chen, Z. Zha, X. Hao, Y. Zhang, Q. Dai, A fast uyghur text detector for complex background images, IEEE Transactions on Multimedia 20 (12) (2018) 3389–3398. [49] L. Bottou, Large-scale machine learning with stochastic gradient descent, in: Proceedings of the International Conference on Computational Statistics (COMPSTAT), 2010, pp. 177–186. [50] F. Nie, J. Li, X. Li, et al., Parameter-free auto-weighted multiple graph learning: A framework for multiview clustering and semi-supervised classification, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2016, pp. 1881–1887. [51] J. Huang, F. Nie, H. Huang, C. Ding, Robust manifold nonnegative matrix factorization, ACM Transactions on Knowledge Discovery from Data 8 (3) (2014) 1–21. [52] D. D. Lee, H. S. Seung, Algorithms for non-negative matrix factorization, in: Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2001, pp. 556–562. [53] X. Shen, F. Shen, L. Liu, Y.-H. Yuan, W. Liu, Q.-S. Sun, Multiview discrete hashing for scalable multimedia search, ACM Transactions on Intelligent Systems and Technology 9 (5) (2018) 53:1–53:21. [54] H.-C. Wu, The karushkuhntucker optimality conditions in an optimization problem with interval-valued objective function, European Journal of Operational Research (2007) 46 – 59. 33

[55] A. Oliva, A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope, International Journal of Computer Vision 42 (3) (2001) 145–175. [56] M. J. Huiskes, M. S. Lew, The mir flickr retrieval evaluation, in: Proceedings of the ACM International Conference on Multimedia Information Retrieval (MIR), 2008, pp. 39–43. [57] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y. Zheng, Nus-wide: a real-world web image database from national university of singapore, in: Proceedings of the ACM International Conference on Image and Viedo Retrieval (CIVR), 2009, p. 48. [58] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the ACM International Conference on Multimedia (MM), 2014, pp. 675–678. [59] F. Shen, Y. Xu, L. Liu, Y. Yang, Z. Huang, H. T. Shen, Unsupervised deep hashing with similarity-adaptive and discrete optimization, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (12) (2018) 3034–3044. [60] X.-S. Xu, Dictionary learning based hashing for cross-modal retrieval, in: Proceedings of the ACM International Conference on Multimedia (MM), 2016, pp. 177–181.

34

Biography

Tong Wang is with the School of Information Science and Engineering, Shandong Normal University, China. Her research interest is in the area of large-scale multimedia content analysis and retrieval.

Lei Zhu received the bachelor’s degree (2016) at Shandong Normal University. She is currently a Ph.D. student at the School of Information Science and Engineering, Shandong Normal University, China. She is under the supervision of Prof. Huaxiang Zhang at Shandong Normal University (2016-). Her research interests are in the area of large-scale multimedia content analysis and retrieval.

35

Zhiyong Cheng is currently a Professor with Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences). He received the Ph.D degree in computer science from Singapore Management University in 2016, and then worked as a Research Fellow in National University of Singapore. His research interests mainly focus on large-scale multimedia content analysis and retrieval. His work has been published in a set of top forums, including ACM SIGIR, MM, WWW, TOIS, IJCAI, TKDE, and TCYB. He has served as the PC member for several top conferences such as MM, MMM etc., and the regular reviewer for journals including TKDE, TIP, TMM etc.

Jingjing Li received his MSc and PhD degree in Computer Science from University of Electronic Science and Technology of China in 2013 and 2017, respectively. Now he is a national Postdoctoral Program for Innovative Talents research fellow with the School of Computer Science and Engineering, University of Electronic Science and Technology of China. He has great interest in machine learn-

36

ing, especially transfer learning, subspace learning and recommender systems.

Zan Gao is a full professor in the Shandong AI Institute, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences). From 2011 to 2018, he worked at the school of Computer Science and Engineering, Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology. From Sep. 2009 to Sep. 2010, he was a visiting scholar in the School of Computer Science, Carnegie Mellon University, USA, and worked with Prof. Alexander G. Hauptmann. From July 2016 to Jan 2017, he worked with Prof. Chua Tat Seng in School of Computing of National University of Singapore as a visiting scholar. He received his Ph.D degree from Beijing University of Posts and Telecommunications in 2011. His research interests include machine learning, computer vision, multimedia analysis and retrieval.

37