Author’s Accepted Manuscript Low-rank Image Tag Completion with Dual Reconstruction Structure Preserved Xue Li, Yu-Jin Zhang, Bin Shen, Bao-Di Liu
www.elsevier.com
PII: DOI: Reference:
S0925-2312(15)01266-7 http://dx.doi.org/10.1016/j.neucom.2014.12.121 NEUCOM16026
To appear in: Neurocomputing Received date: 30 June 2014 Revised date: 14 December 2014 Accepted date: 15 December 2014 Cite this article as: Xue Li, Yu-Jin Zhang, Bin Shen and Bao-Di Liu, Low-rank Image Tag Completion with Dual Reconstruction Structure Preserved, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2014.12.121 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Low-rank Image Tag Completion with Dual Reconstruction Structure Preserved Xue Li† , Yu-Jin Zhang† , Bin Shen‡1 , Bao-Di Liu§ † Electronic
Engineering, Tsinghua University, Beijing 100084,China Science, Purdue University, West Lafayette, IN 47907, USA § Information and Control Engineering, China University of Petroleum, Qingdao, 266580, China
[email protected],
[email protected],
[email protected] ‡ Computer
Abstract User provided tags, albeit play an essential role in image annotation, may inhibit accurate annotation as well since they are potentially incomplete. To address this problem, a novel tag completion method is proposed in this paper. In order to exploit as much information, the proposed method is designed with the following features: 1) Low-rank and error sparsity: the initial tag matrix D is decomposed into the complete tag matrix A and a sparse error matrix E, where A is further factorized into a basis matrix U and a sparse coefficient matrix V , i.e., D = U V + E. With K M , information sharing between related tags and similar samples can be achieved via subspace construction. 2) Local reconstruction structure consistency: the local linear reconstruction structures obtained in the original feature and tag spaces are preserved in both the low-dimensional feature subspace and tag subspace. 3) Promote basis diversity: the pair-wise dot products between the columns of U are minimized, in order to obtain more representative basis vectors. Experiments conducted on Corel5K dataset and the newly issued Flickr30Concepts dataset demonstrate the effectiveness and efficiency of the proposed method. Keywords: Tag completion, Image annotation, Low-rank factorization, LLE
1 This
author is currently with Google Research, New York.
Preprint submitted to Journal of Neurocomputing
September 8, 2015
Figure 1: Sample image in Corel5K dataset, where several additional tags can be added.
1. Introduction One of the most significant signs of the big data era lies in the explosive growth of visual data that need to be well organized, analyzed, and retrieved. Nevertheless, though many more images are available, it seems now still difficult 5
to obtain a sufficient number of accurate annotations, since manual labeling remains expensive. A natural alternative is automatic image annotation(AIA), which performs labeling by machines trained with well-labeled data, thus it can relieve manual labor from the tedious labeling work. For this reason, high performance AIA [1, 2, 3] methods are demanded for both content based image
10
retrieval (CBIR) [4, 5, 6] and tag based image retrieval (TBIR). In practice, however, high-quality training data are so scarce, that we are usually faced with imperfectly or incompletely labeled images. Take the image in Figure 1 as an example, which is extracted from Corel5K dataset and has 4 userprovided tags, i.e., sky, grass, house, and gate. However, several other words can
15
be used as tags as well, such as clouds, roof, and tree, just to name a few. Due to the existence of synonyms and user preference, it is nearly impossible for users to provide a list of complete tags manually. Such incomplete tags would definitely pose threats to the annotation process, since the obtained relevance between the visual content and the semantic tags are inadequate or even harmful for AIA
20
systems to predict accurate labels. Therefore, if we can perform accurate tag completion to the samples before feeding them into the AIA module, a significant improvement of the annotation performance could be expected, which in turn will benefit many visual applications. 2
Now that all the data we have are incomplete, how can we recover the missing 25
labels? At the foremost, we explore the insight that labels are not independent; they are, as a matter of fact, correlated with each other. For instance, if an image is labeled with gate and roof, then house would be a reasonable guess. Hence, inspired by the information sharing mechanism of Multi-task Learning (MTL), our method encourages related labels to share their information, and
30
achieves this goal by constructing a low-dimensional tag subspace. This is also consistent with the conclusion of [7, 8, 9] that features and tags often reside on low-dimensional subspaces. Specifically, we assume that the initial label matrix (whose columns correspond to tags and rows represent samples) can be factorized as a low-dimensional basis matrix and a sparse coefficient matrix.
35
Such a framework is able to encourage related tags to be linearly reconstructed by common basis, and thus information gets shared. Meanwhile, by switching the roles of basis matrix and coefficient matrix, we encourage similar samples to choose common basis as well, hence the information in the instance-level also gets shared.
40
To further improve the performance, in addition to the above low-rank factorization which implicitly exploits useful information, we also explicitly promote content consistency and tag correlation by following the idea of LSR [10] and adopting the LLE assumption. The LSR method assumes that each sample can be linearly reconstructed by several other samples, and each tag can be
45
linearly reconstructed by several other tags, while the local geometry structures in both the original feature space and tag space are preserved. As an extension, our strategy adopts the LLE assumption not only in the original feature space but also in the low-dimensional subspaces [11, 12]. To achieve this goal, the local geometry structures are computed in the original spaces, and preserved
50
while learning the subspaces. Finally, to obtain more representative basis in our low-rank factorization scheme which encapsulates sparse coding [13, 14, 15], we follow the methods of [16] and [17] and promote basis diversity by minimizing the dot pair-wise products of all the basis. This simple constraint costs us little, but endows our 3
55
formulation with extra robustness and significant performance improvement. Contributions. The main contribution of the proposed formulation lies in the combination and extension of low-rank factorization and local reconstruction structure consistency. For the former, by constructing the low-dimensional feature subspace and the low-dimensional tag subspace, we seek to recover the
60
missing labels through information sharing among similar samples and related tags. For the latter, the local reconstruction structures obtained in the original spaces are preserved in both the compressed feature subspace and tag subspace, which allows us to explicitly pursue content consistency and tag correlation, and integrate visual cues and semantic information seamlessly into our factorization
65
framework. The computational complexity of our method grows linearly with the number of samples, therefore, the proposed method is able to handle much larger datasets. The rest of this paper is organized as follows. Section 2 briefly reviews related work on image annotation and tag completion. The proposed tag completion
70
method is formalized in Section 3, followed by detailed optimization methods in Section 4. Experimental results on both Corel5K dataset and Flickr30Concepts dataset are presented in Section 5, and Section 6 concludes this paper. 2. Related Work Image tag completion, which aims at adding missing tags to images, is actu-
75
ally a special case of image annotation. Their main difference lies in the provided tags for the training and test sets. For image annotation, tags associated with the training samples are assumed to be complete, whereas tags for both the training and test samples are incomplete in the completion scenario. To some extent, the initial tags for test images provide essential clues for the true se-
80
mantic contents; however, these incomplete tags may also introduce risks. To draw a parallel between automatic image annotation and image tag completion, recent efforts on the two topics are briefly reviewed in this section, respectively.
4
Annotation. Previous image annotation methods can be divided into three categories: generative methods, discriminative methods, and retrieval-based 85
methods. As the most interpretable models, generative methods tried to model the joint probability of tags and image features, by which their association can be explained. Mixture models such as CMRM [18], CRM [19], MBRM [20], and SML [21] either treated test images as the mixture of training samples or estimated the conditional feature distribution using Gaussian Mixture Mod-
90
els (GMM). Whereas topic models, such as mmLDA , cLDA [22], and trmmLDA [23], extended LDA [24, 25] by incorporating topics of text and specifying its relatedness with visual topics in various patterns.
On the other
hand, discriminative methods tackled it as a multi-label classification problem and built classifiers for each tag, such as SVM [26, 27, 28]. Due to the 95
lack of correspondence between tags and image regions, Multi-instance Learning (MIL) [26, 29, 30] was naturally introduced by defining the entire image as a bag. Recently, some retrieval-based methods approached annotation as a labeltransfer problem and achieved state-of-the-art performance, including JEC [1], TagProp [2], and 2PKNN [31]. In addition, several recent studies are conducted
100
to address the issue of missing labels in AIA problems, such as [32] and [33]. Tag completion. As mentioned above, a stream of studies have shown that rich information can be exploited from initial tags in a completion task; therefore, the pursuit of maintaining tag correlation as well as content consistency has always been a key component in nearly every algorithm, albeit in different
105
formulations [7, 34, 10, 17, 35, 36, 37]. The users’ tagging behavior was first analyzed in [38], followed by a tag recommendation method which comprised the generation and aggregation of candidate tags. In [35], tag recommendation was approached as a maximum a posteriori (MAP) problem, by making use of a folksonomy. The method of [39] cast images and their tags into compositions
110
of semantic unities, and measured their relevance via the similarity of their semantic unities, using semantic unity graph. G. Zhu et al. [7] decomposed the user-provided tag matrix into a low-rank completed matrix and a sparse er-
5
ror matrix, and utilized two distance matrices to maintain content consistency and tag correlation. Another matrix factorization method, [17], formulated tag 115
completion by nonnegative data factorization, and decomposed the global image representations into label representations. Alternatively, the TMC method [34] directly searched for the optimal tag matrix which preserves correlation structures for both images and tags. Finally, the recently proposed LSR method [10] conducted linear sparse reconstruction for each image and each tag, respectively,
120
and achieved state-of-the-art performance. 3. Proposed Method In this section, some notations are introduced first, followed by a detailed description of the proposed method. 3.1. Outline Denote the initial user-provided tag matrix as D ∈ RN ×M , with M and N specifying the number of tags and images, respectively. Entries of D have binary values, that is, ⎧ ⎨ 1, Dij = ⎩ 0,
125
in case image i is associated with label j; otherwise.
Our objective is to recover the latent complete tag matrix A ∈ RN ×M . Following the idea of [7], D is decomposed into the true tag matrix A and a sparse error matrix E ∈ RN ×M . As explained in Section 1, with information sharing in mind, a low-dimensional tag subspace and a low-dimensional feature subspace are constructed in our method, by further factorizing A as a basis
130
matrix and a sparse coefficient matrix, i.e., A = U V . Thus we have D = U V + E,
(1)
where U ∈ RN ×K and V ∈ RK×M are the basis and sparse coefficient matrix, respectively. Since K M , the formulation in Eq.(1) implies that related tags will be encouraged to be linearly reconstructed by common basis in U , thus 6
tag-level information gets shared. Furthermore, by switching the roles of U and 135
V , similar samples are encouraged to choose common basis within V as well, in this way, instance-level information also gets shared. In addition, our method achieves low-rank using a sparse coding scheme, preserves local reconstruction structures in the compressed low-dimensional feature subspace and tag subspace, and promotes basis diversity. Therefore, the
140
proposed objective consists of four items, as defined in Eq.(2): min L(U, V, E) + γRu (U ) + λRv (V ) + ωRb (U ),
U,V,E
s.t.
U•k 2 = 1, ∀k ∈ 1, 2, · · · , K.
(2)
Specifically, L(U, V, E) denotes the empirical error with sparse constraints for both V and E; Ru (U ) and Rv (V ) measure the local reconstruction structure consistency in the low-dimensional image subspace and tag subspace, respectively; and Rb (U ) represents similarities between all the columns of U . Concrete 145
definitions of these items are presented in the following subsections. 3.2. Low Rank and Error Sparsity This subsection presents our first item in Eq.(2). Here L(U, V, E) measures the reconstruction error based on Eq.(1). Combining our factorization scheme and the constraints of sparsity with trade-off parameters, we arrive at a loss
150
function which is defined as follows: 2 L(U, V, E) = D − E − U V F + 2η V 1 + β E 1 .
(3)
Note that Eq.(3) can be interpreted as sparse coding, with U ∈ RN ×K being the basis matrix and V ∈ RK×M the sparse coefficient matrix. Here V can be regarded as tag representations in the new low-dimensional tag subspace, and U as image representations in the new low-dimensional image subspace. In 155
the next two subsections, local geometry structures would be explored in the original spaces and preserved in the subspaces.
7
3.3. Local Reconstruction Structure in Feature Space Denote X ∈ RN ×L as the feature matrix in the original feature space, where each row of X is a feature vector for an individual sample. In the new low160
dimensional subspace, each row of U is a compressed representation of an image. Similar to the idea of LLE, the local geometry structure is believed to be important and should be preserved while compressing the representation. Therefore, first the original data X is explored for the structure information, which is encoded in matrix S:
2 S ∗ = arg min X − SX F + αS 1 , S
s.t. 165
Snn = 0, ∀n ∈ 1, 2, · · · , N ,
(4)
where S ∈ RN ×N is the local linear reconstruction coefficient matrix in the original feature space. The j-th row of S contains corresponding weights to reconstruct the features of the j-th image using that of other samples. Eq.(4) can be efficiently solved using the feature-sign method [40]. Next, assume that the tags of the j-th image can be equally reconstructed by
170
the tags of other images, thus we have A ∼ SA. The local linear reconstruction structure specified by S should be robust to the sparse coding procedure in Eq.(3), which means this reconstruction structure should apply to U as well, i.e., U ∼ SU . Therefore, the second item in Eq.(2) can be instantiated as: 2 Ru (U ) = U − SU F .
(5)
Note that Eq.(5) is similar to the LSR method at first glance, in the spirit 175
of linearly reconstructing each sample by other samples. However, the methodology behind is different, since our method imposes a stronger constraint by preserving local geometry in the low-dimensional subspace, and the variable to be optimized now is the basis matrix U , not the reconstruction coefficients as in LSR.
180
3.4. Local Reconstruction Structure in Tag Space In this section, the regularizer for V is designed in a manner similar to U . As explained before, each column of V can be deemed as a compressed 8
feature vector for a tag, and the local reconstruction structure in the original tag space should be preserved in this new subspace as well. Hence, the structural 185
information, encoded in T , is explored first by leveraging the original data D: 2 T ∗ = arg min D − DT F + μT 1 , T
s.t.
Tmm = 0, ∀m ∈ 1, 2, · · · , M ,
(6)
where T ∈ RM×M is the local linear reconstruction coefficient matrix in the tag space, with the i-th column of T containing corresponding weights to reconstruct the distribution of the i-th tag using that of other tags. Eq.(6) can be solved analogously to Eq.(4). 190
Then the reconstruction relationship specified by T should also apply to V . Therefore, the third item in Eq.(2) is presented as follows: 2 Rv (V ) = V − V T F .
(7)
Eq.(5) and Eq.(7) offer a convenient manner to integrate into our factorization framework both the visual cues and semantic information, which are essential for successful tag completion. 195
3.5. Minimize basis similarity The last component of Eq.(2) aims to improve the sparse coding process itself. Inspired by the success of [16] and [17], our method also incorporates the regularizer to avoid the basis matrix from containing identical or similar columns. In other words, the proposed formulation encourages column diversity
200
in the basis matrix. Intuitively, if two columns from the basis matrix U are orthogonal, i.e., their dot product is zero, these two columns can be considered as entirely different. Thus, we have Rb (U ) =
K K
uTi uj = tr U T U B ,
i=1 j=1 j=i
where B ∈ RK×K is a matrix that satisfies: ⎧ ⎨ 0, in case i = j; Bij = ⎩ 1, otherwise. 9
(8)
The regularizer specified in Eq.(8) is very easy to implement, as shown in Section 4. 205
Finally, our overall objective function is presented as below: min
U,V,E
s.t.
D − E − U V 2 + 2η V + β E F 1 1
2 2 +γ U − SU F + λV − V T F + ωtr{U T U B} ,
U•k 2 = 1, ∀k ∈ 1, 2, · · · , K.
(9)
Eq.(9) incorporates all the essential components we have come up with: 1) low-rank factorization with error sparsity which aims to share information among related tags and similar samples; 2) local reconstruction structure consistency in the obtained low-dimensional tag and feature subspaces; 3) diversity 210
among basis, which helps to improve the sparse coding process. Despite being a seemingly complex objective function with multiple items and variables, the optimization of Eq.(9) is fairly simple and efficient, with a computational complexity growing linearly with the number of samples, as shown in Section 4. 4. Optimization
215
In this section, we focus on solving the minimization problem of the proposed objective function in Eq.(9). Although it is not jointly convex in all three variables, it is separately convex in U , V and E with remaining variables fixed. Thus, we devise an efficient solver for Eq.(9), by decoupling it into three subproblems and conducting optimization alternatively. The proposed algorithm
220
has a computational complexity growing linearly with the number of samples, which indicates that our method is capable of handling much larger datasets. The optimization procedure is summarized in Algorithm 1.
10
Algorithm 1 Low-rank Tag Completion with Dual Reconstruction. Require: Initial tag matrix D, reconstruction matrix S and T , matrix B, and parameters γ, λ, η, β, ω, N , M, K, and maxIter.
1: U ← rand(N, K), V ← zeros(K, M), E ← zeros(N, M), H ← λ(T − I)(T − I)T , G ← γ(S − I)T (S − I)
2: iter = 0; 3: for iter < maxIter do 4: 5:
iter ← iter + 1 ←D−E D
6:
Update V :
7: 8:
T U, Z = U T U Compute C = D
9: 10: 11: 12:
for k = 1;k ≤ K;k++ do for m = 1;m ≤ M;m++ do M Pkm = Cmk − K r=1 Hmr Vkr l=1 Vlm Zlk − r=m l=k Vkm = max{Pkm , η} + min{Pkm , −η} / Zkk + Hmm end for
13: 14:
end for
15: 16:
T , W = V V T + ωB Compute F = V D
17: 18: 19: 20:
Update U : for k = 1;k ≤ K;k++ do for n = 1;n ≤ N ;n++ do N Qnk = Fkn − K r=1 Urk Grn l=1 Wkl Unl − r=n l = k Unk = Qnk / Wkk + Gnn U•k ← U•k /max{1, U•k 2 }
21: 22:
end for
23:
Update E
end for
24: end for 25: return U and V
4.1. Optimizing Coefficient V = D−E, H = λ(T −I)(T −I)T , Here the method in [41, 42] is used. Define D 225
when U and E are kept fixed, Eq.(9) reduces to: f (V )
− U V 2 + λV − V T 2 + 2η V = D F F 1 T
T T T − 2V D U +VV U U D = tr D
+tr V HV T + 2η V 1 .
(10)
T
, the objective function of Vkm reduces D Ignoring the constant term tr D
11
to
⎛
⎞ K M ⎜ ⎟ f (Vkm ) = 2Vkm ⎝ Vlm Zlk + Hmr Vkr − Cmk ⎠ r=1 r=m
l=1 l=k
2 + Vkm Zkk + Hmm + 2η|Vkm |.
(11)
T U , and Z U T U . where C D Note that Eq.(11) is a piece-wise parabolic function going upwards, which is convex and easy to obtain the optimal point: Vkm =
max{Pkm , η} + min{Pkm , −η} , Zkk + Hmm
where Pkm = Cmk −
K
Vlm Zlk −
M
Hmr Vkr .
r=1 r=m
l=1 l=k
4.2. Optimizing Basis U The optimal U can be obtained by alternating between a procedure similar to V and an Euclidean projection. Define G = γ(S − I)T (S − I), when V and E are fixed, U can be solved analogously to V , and the only modification is to remove the L1 regularizer: ⎛ ⎞ K N ⎜ ⎟ 2 Wkk + Gnn , (12) Wkl Unl + Urk Grn − Fkn ⎠ + Unk f (Unk ) = 2Unk ⎝ l=1 l=k 230
r=1 r=n
T , and W V V T + ωB. The optimal Unk is: where F V D Unk = where Qnk = Fkn −
Qnk , Wkk + Gnn
K
Wkl Unl −
N
Urk Grn .
r=1 r=n
l=1 l=k
Then, Euclidean projection is performed to ensure the L2 norm of each column in U is less than 1. Note this is a coordinate descend approach and the projection is conducted after each coordinate is updated if the L2 norm of 12
the updated column of U is greater than 1. Thus, both the convergence and 235
decrease in the objective function are guaranteed. This constraint of U•k = 1 is relaxed to U•k ≤ 1, since the relaxation will result in a convex optimization problem while keeping the global optimum unchanged, i.e., the optimal U will always satisfy U•k = 1 even if our explicit constraint is U•k ≤ 1. 4.3. Optimizing Sparse Error E
240
Finally, when U and V are fixed, obtaining E reduces to solving the following sparse coding problem: E ∗ = arg min E
D − U V − E 2 + β E , F 1
(13)
which can be solved similarly to S and T . 4.4. Implementation Issues A kNN strategy is adopted when calculating matrices S and T , where we 245
set k = 200 (same to [10]. Note that the lowercase letter k denotes the number of neighboring samples and the uppercase letter K represents the number of basis.), in order to make it faster. For the number of basis, we set K = 100 for Corel5K dataset, and K = 500 for the much larger Flickr30Concepts dataset. Meanwhile, similar to [7], D is re-initialized as D = (SD + DT )/2 before fed
250
into the completion process. Also, tags are treated as features when obtaining S for the Flickr30Concepts dataset, however not for Corel5K; since the remaining tags for images in Corel5K are very sparse (less than 4), thus using tags as features would cause performance deterioration.
255
As indicated by Eq.(11) and Eq.(12), for each iteration, the total computational complexity is O(K 2 × N + K × (nnz(V ) + nnz(H) + nnz(G))), where nnz(·) denotes the number of non-zero entries of a sparse matrix. This indicates that the running time of our proposed method is directly determined by N (the number of samples) and K (the number of basis). This also means that
260
our method is capable of handling large datasets since its running time scales linearly with N . 13
Table 1: Statistics of Corel5K and Flickr30Concepts. Counts of tags are given in format of ”mean/maximum”.
Corel5K
Flickr30Concepts
Vocabulary Size
260
2,513
Nr. of Images
4,918
27,838
Tags per Image
3.4/5
8.3/70
Del. Tags per Image
1.4 (40%)
3.3 (40%)
Test Set
492
2,807
5. Experiments To validate our approach, extensively empirical studies are carried out. In this section, the experimental setup we adopted is first outlined, which is fol265
lowed by the analysis of some parameters. In Section 5.3, the influence of the number K of basis vectors is investigated. Finally, the performance of the proposed formulation is evaluated and compared with several prior methods. Some representative failure cases are examined in Section 5.4 as well, which can help us to gain more in-depth understanding towards the impact of unbalanced
270
samples, and shed light upon future research directions. 5.1. Experimental setup To facilitate comparison between our method and the previous ones, the same datasets and features as [10] are adopted. Two datasets are used: the wellestablished benchmark dataset Corel5K and the real-world Flickr30Concepts.
275
Statistics of both datasets are given in Table 1. For Corel5K, we randomly delete 40% of tags and ensure that each image has at least one tag removed and one tag remained.
2
We perform random deletion
8 times and averaged performance is reported. Furthermore, a validation set containing 491 images is extracted randomly to perform parameter tuning. 2 The
1000-dimensional SIFT BoW feature for Corel5K dataset is downloaded from http://
lear.inrialpes.fr/people/guillaumin/data.php.
14
280
For Flickr30Concepts [10], the data provided by the authors are used directly, including the ground truth, the initial tag matrix, as well as two types of features: the 1000-dimensional SIFT BoW feature and the composite features consisting of a set of 10 kinds of basic features3 . In addition, the evaluation method in [10] is adopted, as well as the same
285
measurements: average precision@N (AP@N), average recall@N (AR@N) and coverage@N (C@N). The definitions of these measurements are shown below: 1 Nc (i) , m i=1 N m
AP @N =
1 Nc (i) , m i=1 Nmg (i) m
AR@N =
1 I{Nc (i) > 0}, m i=1 m
C@N =
where m is the number of test images, Nc (i) is the number of correctly recovered tags for the i-th image, Nmg (i) is the number of missing ground truth tags for the i-th image, and I{·} is an Indicator function that returns 1 when the condition 290
is true and 0 otherwise. Note that here N denotes the number of tags added to an image (not the total number of samples as defined in Section 3). Also, evaluations are performed only for the test set, and neighbors are extracted only in the training set. 5.2. Parameter Settings
295
Altogether 7 parameters are involved in the proposed method, hence it is necessary to tune each parameter in order to achieve better performance and analyze their respective influence to the completion process. The control variable method is adopted, which means only one parameter is modified at a time while others unchanged. Since α and μ have little influence 3 The
features include: Color Correlogram, Color Layout, CEDD, Edge Histogram, FCTH,
JCD, Jpeg Coefficient Histogram, RGB Color Histogram, Scalable Color, SURF with Bag-ofWords model.
15
(b) 0.7
0.7 AP@N
AR@N
AP@N
C@N
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2 0.05 0.1 0.2 0.5
1
2 λ
5
0.2 0.05 0.1 0.2 0.5
10 20 50 100
(c)
AR@N
1
2 γ
5
C@N
10 20 50 100
(d) 0.7
AP@N
AR@N
C@N
AP@N
AR@N
C@N
0.6 0.6 0.55 0.5
0.5
0.45
0.4
0.4 0.3 0.35 0.3 0.01 0.05
0.1
0.3
0.5 ω
1
2
5
0.2 0.01 0.02 0.05 0.1 0.2 0.5 η
10
1
2
5
10
Figure 2: Influences of λ, γ, ω and η on validation set of Corel5K.
300
when the number of neighbors is large enough (here 200), so we just set a large number (here 1) to make the feature-sign method faster. Also, as mentioned in [43], if β is smaller than 0.2, severe negative effect can be observed; however, if β is larger than 1, the returned matrix E will have all-zero entries. On the other hand, values within [0.2, 1] would not affect the performance too much.
305
Therefore, only 4 parameters will be discussed here, namely, λ, γ, ω and η. The obtained results are shown in Fig.2. As illustrated in Fig.2, the value of γ should not be too large, since a larger γ means a higher degree of confidence on the assumption that SA ∼ A, whereas this maybe questionable due to the semantic gap. Similarly, performance also
310
degraded as λ gets larger, since T is obtained from incomplete initial tags.
16
The performance of the proposed method is gradually improved as ω becomes larger, until very large values are used. As indicated in Eq.(12), entries in U would be close to zero under such conditions. For η, as it approaches 0, the L1 regularizer seems disabled; as it gets larger, V might be too sparse 315
and the reconstruction error would get larger. The final values adopted in our experiments are λ = 5, γ = 5, β = 0.7, η = 0.5, ω = 0.3. All the following results are obtained using this group of parameters. 5.3. Influence of K Since a matrix factorization framework is adopted in the proposed method,
320
it is necessary to examine the influence of the number of basis and determine an appropriate value for K. As illustrated in Fig.3, when K is given a small value, the basis which could be used to reconstruct the true tag matrix are insufficient, thus the obtained performances are far from satisfactory due to large reconstruction error. As K gets larger, the performances increase gradually.
325
However, performances would not be improved when further increase K, which means a larger K would not necessarily lead to a better result. This is consistent with our intuition of information sharing, since when K M does not hold, the related tags or similar samples can choose different basis, thus information cannot propagate between tags and samples any longer.
330
Recall that the computational complexity of the proposed method is directly related to the value of K (cf. Section 4.4). Therefore, K should be set to a smaller value in order to make our method more efficient. 5.4. Tag Completion Results To demonstrate the effectiveness of our method, we compare its performance
335
with state-of-the-art annotation methods (JEC [1] and TagProp [2]) and several tag completion algorithms, namely, LR [7], Vote+ [38], Folksonomy [35], DLC [17], TMC [34], and LSR [10]. Note that JEC and TagProp are designed for multi-channel features, while TMC and DLC are more suitable for SIFT BoW feature (DLC requires non-negative features, and TMC prefers dot product of 17
AP@N
AR@N
C@N
0.6 0.55 0.5 0.45 0.4 0.35 0.3 20
50
70 100 150 200 250 300 400 500 K
Figure 3: Influences of K on Corel5K. Note that a larger K would not necessarily lead to a better result, since information sharing cannot be achieved when K M no longer hold. Table 2: Experimental results on Corel5K and Flickr30Concepts using only SIFT BoW feature.
340
Corel5K
Flickr30Concepts
(N = 2)
(N = 4)
AP
AR
C
AP
AR
C
TMC
0.23
0.33
0.40
0.19
0.21
0.37
DLC
0.09
0.13
0.18
0.07
0.09
0.23
LSR
0.28
0.42
0.50
0.30
0.36
0.60
LRDLR
0.32
0.49
0.57
0.32
0.39
0.64
LRDLR+
0.33
0.51
0.60
0.34
0.41
0.67
feature vectors to be non-negative), whereas the LSR method, as well as the proposed one, can handle both multi-features and SIFT BoW feature. For those baseline methods, the evaluation results reported in [10] are directly cited. Our previous method [43] is referred to as “LRDLR” (short for low-rank dual linear reconstruction ), while the extended version which promotes basis diversity is
345
referred to as “LRDLR+”. Experimental results using only the SIFT BoW feature on both datasets are shown in Table 2, and results using composite features on Flickr30Concepts are presented in Table 3. For Corel5K dataset, our method outperforms previous methods by a large 18
Table 3: Experimental results on Flickr30Concepts using 10 types of features.
Flickr30Concepts (N = 4) AP
AR
C
JEC
0.25
0.30
0.49
TagProp
0.23
0.29
0.50
Vote+
0.23
0.27
0.48
Folksonomy
0.21
0.26
0.47
LR
0.27
0.34
0.51
LSR
0.37
0.45
0.67
LRDLR
0.39
0.48
0.72
LRDLR+
0.38
0.48
0.74
margin. Meanwhile, the extended version (“LRDLR+”) shows even better per350
formance, especially for the metric C@N , which demonstrates the significance of promoting basis diversity. Note that the pre-processing steps of obtaining S and T in our method basically correspond to the LSR method, which is far more delicate in the design of group-sparsity regularizer and soft fusion of coefficients. However, the LSR method is highly dependent on initial labels, which means, if
355
some critical tags are removed, the sparse reconstruction guided by the available tags may turn out completely failed. Our method, on the other hand, exhibits enhanced robustness due to the incorporation of low-rank factorization. For the Flickr30Concepts dataset, our method still achieves the best performance, but the increase in performance with respect to LSR is not so significant
360
as for Corel5K dataset, since images contained in Flickr30Concepts have richer initial labels than images in Corel5K. Under such circumstances, if one or two critical tags are removed, the content of this image can still be depicted by the remaining tags, which means the requirement for robustness towards noisy initial tags is not so essential for Flickr30Concepts as for Corel5K.
365
Furthermore, compared with results using only SIFT BoW feature, perfor-
19
mances using 10 types of features get substantially improved, both for the LSR method and the proposed one, which once more demonstrates the superiority of multiple features. Finally, some representative failure cases are illustrated in Fig.4, where the 370
images are shown along with their corresponding initial tags, missing ground truth tags, and tags recovered by the proposed method.
Figure 4: Some failure samples from Corel5K (from top to bottom: initial tags, missing ground truth tags, and recovered tags, respectively.). Note that the recovered tags still reveal some semantic information.
As shown in Fig.4, most failure samples contain high-frequency tags as initial labels, such as sky, tree, and water, etc. Such frequent tags are related to many
20
sky
plane jet
0.3 0.25
0.2
0.2
0.15
0.15
0.05 0
0.1 0.05 0
1 2 3 4 5 6 7 8 9 101112131415
prop smoke zebra clouds grass f−16 formation herd mountain locomotive ground coral giraffe
0.1
tree clouds water mountain buildings plane jet hills tower boats street temple grass sun sand
sky
0.25
1 2 3 4 5 6 7 8 9 101112131415
Figure 5: Comparison between coefficients for frequent tag sky and normal tag plane. Note that coefficients for sky decrease more slowly and exhibit a smaller variance.
images, hence their reconstruction coefficients obtained by Eq.(6) are likely to 375
be inaccurate. To make it more intuitive, the coefficients of sky and plane are illustrated in Fig.5, where sky is an example of high-frequency labels (actually, it is the most frequent tag in Corel5K) while plane is a tag that related to sky, with moderate frequency. Here only the 15 top related tags are displayed . After being sorted in a descending order, their coefficients both follow an
380
exponential-like trend; however, coefficients related to sky decrease more slowly and have a smaller variance, which means identifying the most related labels for high-frequency tags such as sky is a relatively tough task. Therefore, when sky occurred as an initial tag, multiple related candidate tags share similar chances to be chosen as completed tags, and which one to choose depends on the tags
385
from related samples of this image. However, if the candidate tags themselves are correlated, their frequency of occurrence in those related samples maybe still very close, resulting in a random selection from multiple candidates. Since sample unbalance is a serious challenge for AIA and tag completion problems, this may indicate some directions for future improvements.
390
Meanwhile, according to Fig.4, albeit not identical with the missing ground truth tags, the recovered tags still provide us with some clues about the true 21
content of the images, such as buildings for A1, plane for B1, house for E1, people for C3, giraffe for E3, and herd for A4, etc. 6. Conclusions 395
A novel tag completion algorithm is proposed in this paper, which is characterized by the low-rank, error sparsity, basis diversity, and the ability to preserve local linear reconstruction structures in the compressed low-dimensional feature space and tag space. Extensive experiments conducted on the well-known Corel5K dataset and the real-world Flickr30Concepts dataset demonstrate the
400
effectiveness and efficiency of the proposed algorithm, where our method outperforms prior methods by a large margin. In addition, an analysis of some representative failure cases is presented, which sheds some light upon the impact of unbalanced samples and indicates directions for future improvements. Acknowledgment
405
This work was supported by National Nature Science Foundation (NNSF: 61171118) and Specialized Research Fund for the Doctoral Program of Higher Education (SRFDP-20110002110057). References [1] A. Makadia, V. Pavlovic, S. Kumar, A new baseline for image annotation,
410
in: European Conference on Computer Vision, Springer, 2008, pp. 316–329. [2] M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid, Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation, in: International Conference on Computer Vision, 2009, pp. 309–316. [3] Y. Zheng, Y.-J. Zhang, H. Larochelle, Topic modeling of multimodal data:
415
an autoregressive approach, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2014.
22
[4] L. Zheng, S. Wang, Z. Liu, Q. Tian, Lp-norm idf for large scale image search, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2013, pp. 1626–1633. 420
[5] L. Zheng, S. Wang, W. Zhou, Q. Tian, Bayes merging of multiple vocabularies for scalable image retrieval, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2014. [6] L. Zheng, S. Wang, Z. Liu, Q. Tian, Packing and padding: Coupled multiindex for accurate image retrieval, in: IEEE International Conference on
425
Computer Vision and Pattern Recognition, 2014. [7] G. Zhu, S. Yan, Y. Ma, Image tag refinement towards low-rank, content-tag prior and error sparsity, in: ACM International Conference on Multimedia, 2010, pp. 461–470. [8] Y. Pang, Y. Yuan, X. Li, Iterative subspace analysis based on feature line
430
distance, IEEE Transactions on Image Processing 18 (4) (2009) 903–907. [9] X. Li, Y. Pang, Deterministic column-based matrix decomposition, IEEE Transactions on Knowledge and Data Engineering 22 (1) (2010) 145–149. [10] Z. Lin, G. Ding, M. Hu, J. Wang, X. Ye, Image tag completion via imagespecific and tag-specific linear sparse reconstructions, in: IEEE Interna-
435
tional Conference on Computer Vision and Pattern Recognition, 2013, pp. 1618–1625. [11] B. Shen, L. Si, Non-negative matrix factorization clustering on multiple manifolds, in: AAAI, 2010. [12] B. Shen, B.-D. Liu, Q. Wang, Y. Fang, J. Allebach, Sp-svm: Large margin
440
classifier for data on multiple manifolds, in: AAAI, 2015. [13] B.-D. Liu, Y.-X. Wang, Y.-J. Zhang, Y. Zheng, Discriminant sparse coding for image classification, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2012, pp. 2193–2196. 23
[14] B.-D. Liu, Y.-X. Wang, B. Shen, Y.-J. Zhang, M. Hebert, Self-explanatory 445
sparse representation for image classification, in: European Conference on Computer Vision, Springer, 2014, pp. 600–616. [15] B. Shen, W. Hu, Y. Zhang, Y.-J. Zhang, Image inpainting via sparse representation, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, pp. 697–700.
450
[16] L. Bo, X. Ren, D. Fox, Multipath sparse coding using hierarchical matching pursuit, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2013, pp. 660–667. [17] X. Liu, S. Yan, T.-S. Chua, H. Jin, Image label completion by pursuing contextual decomposability, ACM Transactions on Multimedia Computing,
455
Communications, and Applications 8 (2) (2012) 21. [18] J. Jeon, V. Lavrenko, R. Manmatha, Automatic image annotation and retrieval using crossmedia relevance models, in: International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003, pp. 119–126.
460
[19] V. Lavrenko, R. Manmatha, J. Jeon, A model for learning the semantics of pictures., in: Advances in Neural Information Processing Systems, Vol. 1, 2003, p. 2. [20] S. Feng, R. Manmatha, V. Lavrenko, Multiple bernoulli relevance models for image and video annotation, in: IEEE International Conference on
465
Computer Vision and Pattern Recognition, Vol. 2, 2004, pp. 1002–1009. [21] G. Carneiro, A. B. Chan, P. J. Moreno, N. Vasconcelos, Supervised learning of semantic classes for image annotation and retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (3) (2007) 394–410. [22] D. M. Blei, M. I. Jordan, Modeling annotated data, in: International ACM
470
SIGIR Conference on Research and Development in Information Retrieval, 2003, pp. 127–134. 24
[23] D. Putthividhy, H. T. Attias, S. S. Nagarajan, Topic regression multi-modal latent dirichlet allocation for image annotation, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2010, pp. 3408– 475
3415. [24] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, Journal of Machine Learning Research 3 (2003) 993–1022. [25] Y. Pang, S. Wang, Y. Yuan, Learning regularized lda by clustering, IEEE Transactions on Neural Networks and Learning Systems 25 (12) (2014)
480
2191–2201. [26] C. Yang, M. Dong, J. Hua, Region-based image annotation using asymmetrical support vector machine-based multiple-instance learning, in: IEEE International Conference on Computer Vision and Pattern Recognition, Vol. 2, 2006, pp. 2057–2063.
485
[27] Y. Chen, J. Z. Wang, Image categorization by learning and reasoning with regions, Journal of Machine Learning Research 5 (2004) 913–939. [28] B. Li, K. Goh, Confidence-based dynamic ensemble for image annotation and semantics discovery, in: ACM International Conference on Multimedia, 2003, pp. 195–206.
490
[29] R. Jin, S. Wang, Z.-H. Zhou, Learning a distance metric from multi-instance multi-label data, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2009, pp. 896–902. [30] Z.-H. Zhou, M.-L. Zhang, Multi-instance multi-label learning with application to scene classification, Advances in Neural Information Processing
495
Systems 19 (2007) 1609. [31] Y. Verma, C. Jawahar, Image annotation using metric learning in semantic neighbourhoods, in: European Conference on Computer Vision, Springer, 2012, pp. 836–849.
25
[32] Q. Wang, B. Shen, S. Wang, L. Li, L. Si, Binary codes embedding for 500
fast image tagging with incomplete labels, in: European Conference on Computer Vision, Springer, 2014, pp. 425–439. [33] M. Chen, A. Zheng, K. Weinberger, Fast image tagging, in: International Conference on Machine Learning, ACM, 2013, pp. 1274–1282. [34] L. Wu, R. Jin, A. Jain, Tag completion for image retrieval, IEEE Transac-
505
tions on Pattern Analysis and Machine Intelligence 35 (3) (2013) 716–727. [35] S. Lee, W. De Neve, K. N. Plataniotis, Y. M. Ro, Map-based image tag recommendation using a visual folksonomy, Pattern Recognition Letters 31 (9) (2010) 976–982. [36] D. Liu, S. Yan, X.-S. Hua, H.-J. Zhang, Image retagging using collaborative
510
tag propagation, IEEE Transactions on Multimedia 13 (4) (2011) 702–712. [37] D. Liu, X.-S. Hua, M. Wang, H.-J. Zhang, Image retagging, in: ACM International Conference on Multimedia, 2010, pp. 491–500. [38] B. Sigurbj¨ornsson, R. Van Zwol, Flickr tag recommendation based on collective knowledge, in: ACM International Conference on World Wide Web,
515
2008, pp. 327–336. [39] Y. Liu, F. Wu, Y. Zhang, J. Shao, Y. Zhuang, Tag clustering and refinement on semantic unity graph, in: IEEE International Conference on Data Mining, 2011, pp. 417–426. [40] H. Lee, A. Battle, R. Raina, A. Ng, Efficient sparse coding algorithms, in:
520
Advances in Neural Information Processing Systems, 2006, pp. 801–808. [41] B.-D. Liu, Y.-X. Wang, Y.-J. Zhang, B. Shen, Learning dictionary on manifolds for image classification, Pattern Recognition 46 (7) (2013) 1879–1890. [42] B.-D. Liu, Y.-X. Wang, B. Shen, Y.-J. Zhang, Y.-J. Wang, Blockwise coordinate descent schemes for sparse representation, in: IEEE International
26
525
Conference on Acoustics, Speech and Signal Processing, 2014, pp. 5267– 5271. [43] X. Li, Y.-J. Zhang, B. Shen, B.-D. Liu, Image tag completion by lowrank factorization with dual reconstruction structure preserved, in: IEEE International Conference on Image Processing, 2014.
27