Low-rank image tag completion with dual reconstruction structure preserved

Author’s Accepted Manuscript Low-rank Image Tag Completion with Dual Reconstruction Structure Preserved Xue Li, Yu-Jin Zhang, Bin Shen, Bao-Di Liu ww...

Download PDF

2MB Sizes 2 Downloads 67 Views

Report

PDF Reader
Full Text

Author’s Accepted Manuscript Low-rank Image Tag Completion with Dual Reconstruction Structure Preserved Xue Li, Yu-Jin Zhang, Bin Shen, Bao-Di Liu

www.elsevier.com

PII: DOI: Reference:

S0925-2312(15)01266-7 http://dx.doi.org/10.1016/j.neucom.2014.12.121 NEUCOM16026

To appear in: Neurocomputing Received date: 30 June 2014 Revised date: 14 December 2014 Accepted date: 15 December 2014 Cite this article as: Xue Li, Yu-Jin Zhang, Bin Shen and Bao-Di Liu, Low-rank Image Tag Completion with Dual Reconstruction Structure Preserved, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2014.12.121 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Low-rank Image Tag Completion with Dual Reconstruction Structure Preserved Xue Li† , Yu-Jin Zhang† , Bin Shen‡1 , Bao-Di Liu§ † Electronic

Engineering, Tsinghua University, Beijing 100084,China Science, Purdue University, West Lafayette, IN 47907, USA § Information and Control Engineering, China University of Petroleum, Qingdao, 266580, China [email protected], [email protected], [email protected] ‡ Computer

Abstract User provided tags, albeit play an essential role in image annotation, may inhibit accurate annotation as well since they are potentially incomplete. To address this problem, a novel tag completion method is proposed in this paper. In order to exploit as much information, the proposed method is designed with the following features: 1) Low-rank and error sparsity: the initial tag matrix D is decomposed into the complete tag matrix A and a sparse error matrix E, where A is further factorized into a basis matrix U and a sparse coeﬃcient matrix V , i.e., D = U V + E. With K M , information sharing between related tags and similar samples can be achieved via subspace construction. 2) Local reconstruction structure consistency: the local linear reconstruction structures obtained in the original feature and tag spaces are preserved in both the low-dimensional feature subspace and tag subspace. 3) Promote basis diversity: the pair-wise dot products between the columns of U are minimized, in order to obtain more representative basis vectors. Experiments conducted on Corel5K dataset and the newly issued Flickr30Concepts dataset demonstrate the eﬀectiveness and eﬃciency of the proposed method. Keywords: Tag completion, Image annotation, Low-rank factorization, LLE

1 This

author is currently with Google Research, New York.

Preprint submitted to Journal of Neurocomputing

September 8, 2015

Figure 1: Sample image in Corel5K dataset, where several additional tags can be added.

1. Introduction One of the most signiﬁcant signs of the big data era lies in the explosive growth of visual data that need to be well organized, analyzed, and retrieved. Nevertheless, though many more images are available, it seems now still diﬃcult 5

to obtain a suﬃcient number of accurate annotations, since manual labeling remains expensive. A natural alternative is automatic image annotation(AIA), which performs labeling by machines trained with well-labeled data, thus it can relieve manual labor from the tedious labeling work. For this reason, high performance AIA [1, 2, 3] methods are demanded for both content based image

10

retrieval (CBIR) [4, 5, 6] and tag based image retrieval (TBIR). In practice, however, high-quality training data are so scarce, that we are usually faced with imperfectly or incompletely labeled images. Take the image in Figure 1 as an example, which is extracted from Corel5K dataset and has 4 userprovided tags, i.e., sky, grass, house, and gate. However, several other words can

15

be used as tags as well, such as clouds, roof, and tree, just to name a few. Due to the existence of synonyms and user preference, it is nearly impossible for users to provide a list of complete tags manually. Such incomplete tags would deﬁnitely pose threats to the annotation process, since the obtained relevance between the visual content and the semantic tags are inadequate or even harmful for AIA

20

systems to predict accurate labels. Therefore, if we can perform accurate tag completion to the samples before feeding them into the AIA module, a signiﬁcant improvement of the annotation performance could be expected, which in turn will beneﬁt many visual applications. 2

Now that all the data we have are incomplete, how can we recover the missing 25

labels? At the foremost, we explore the insight that labels are not independent; they are, as a matter of fact, correlated with each other. For instance, if an image is labeled with gate and roof, then house would be a reasonable guess. Hence, inspired by the information sharing mechanism of Multi-task Learning (MTL), our method encourages related labels to share their information, and

30

achieves this goal by constructing a low-dimensional tag subspace. This is also consistent with the conclusion of [7, 8, 9] that features and tags often reside on low-dimensional subspaces. Speciﬁcally, we assume that the initial label matrix (whose columns correspond to tags and rows represent samples) can be factorized as a low-dimensional basis matrix and a sparse coeﬃcient matrix.

35

Such a framework is able to encourage related tags to be linearly reconstructed by common basis, and thus information gets shared. Meanwhile, by switching the roles of basis matrix and coeﬃcient matrix, we encourage similar samples to choose common basis as well, hence the information in the instance-level also gets shared.

40

To further improve the performance, in addition to the above low-rank factorization which implicitly exploits useful information, we also explicitly promote content consistency and tag correlation by following the idea of LSR [10] and adopting the LLE assumption. The LSR method assumes that each sample can be linearly reconstructed by several other samples, and each tag can be

45

linearly reconstructed by several other tags, while the local geometry structures in both the original feature space and tag space are preserved. As an extension, our strategy adopts the LLE assumption not only in the original feature space but also in the low-dimensional subspaces [11, 12]. To achieve this goal, the local geometry structures are computed in the original spaces, and preserved

50

while learning the subspaces. Finally, to obtain more representative basis in our low-rank factorization scheme which encapsulates sparse coding [13, 14, 15], we follow the methods of [16] and [17] and promote basis diversity by minimizing the dot pair-wise products of all the basis. This simple constraint costs us little, but endows our 3

55

formulation with extra robustness and signiﬁcant performance improvement. Contributions. The main contribution of the proposed formulation lies in the combination and extension of low-rank factorization and local reconstruction structure consistency. For the former, by constructing the low-dimensional feature subspace and the low-dimensional tag subspace, we seek to recover the

60

missing labels through information sharing among similar samples and related tags. For the latter, the local reconstruction structures obtained in the original spaces are preserved in both the compressed feature subspace and tag subspace, which allows us to explicitly pursue content consistency and tag correlation, and integrate visual cues and semantic information seamlessly into our factorization

65

framework. The computational complexity of our method grows linearly with the number of samples, therefore, the proposed method is able to handle much larger datasets. The rest of this paper is organized as follows. Section 2 brieﬂy reviews related work on image annotation and tag completion. The proposed tag completion

70

method is formalized in Section 3, followed by detailed optimization methods in Section 4. Experimental results on both Corel5K dataset and Flickr30Concepts dataset are presented in Section 5, and Section 6 concludes this paper. 2. Related Work Image tag completion, which aims at adding missing tags to images, is actu-

75

ally a special case of image annotation. Their main diﬀerence lies in the provided tags for the training and test sets. For image annotation, tags associated with the training samples are assumed to be complete, whereas tags for both the training and test samples are incomplete in the completion scenario. To some extent, the initial tags for test images provide essential clues for the true se-

80

mantic contents; however, these incomplete tags may also introduce risks. To draw a parallel between automatic image annotation and image tag completion, recent eﬀorts on the two topics are brieﬂy reviewed in this section, respectively.

4

Annotation. Previous image annotation methods can be divided into three categories: generative methods, discriminative methods, and retrieval-based 85

methods. As the most interpretable models, generative methods tried to model the joint probability of tags and image features, by which their association can be explained. Mixture models such as CMRM [18], CRM [19], MBRM [20], and SML [21] either treated test images as the mixture of training samples or estimated the conditional feature distribution using Gaussian Mixture Mod-

90

els (GMM). Whereas topic models, such as mmLDA , cLDA [22], and trmmLDA [23], extended LDA [24, 25] by incorporating topics of text and specifying its relatedness with visual topics in various patterns.

On the other

hand, discriminative methods tackled it as a multi-label classiﬁcation problem and built classiﬁers for each tag, such as SVM [26, 27, 28]. Due to the 95

lack of correspondence between tags and image regions, Multi-instance Learning (MIL) [26, 29, 30] was naturally introduced by deﬁning the entire image as a bag. Recently, some retrieval-based methods approached annotation as a labeltransfer problem and achieved state-of-the-art performance, including JEC [1], TagProp [2], and 2PKNN [31]. In addition, several recent studies are conducted

100

to address the issue of missing labels in AIA problems, such as [32] and [33]. Tag completion. As mentioned above, a stream of studies have shown that rich information can be exploited from initial tags in a completion task; therefore, the pursuit of maintaining tag correlation as well as content consistency has always been a key component in nearly every algorithm, albeit in diﬀerent

105

formulations [7, 34, 10, 17, 35, 36, 37]. The users’ tagging behavior was ﬁrst analyzed in [38], followed by a tag recommendation method which comprised the generation and aggregation of candidate tags. In [35], tag recommendation was approached as a maximum a posteriori (MAP) problem, by making use of a folksonomy. The method of [39] cast images and their tags into compositions

110

of semantic unities, and measured their relevance via the similarity of their semantic unities, using semantic unity graph. G. Zhu et al. [7] decomposed the user-provided tag matrix into a low-rank completed matrix and a sparse er-

5

ror matrix, and utilized two distance matrices to maintain content consistency and tag correlation. Another matrix factorization method, [17], formulated tag 115

completion by nonnegative data factorization, and decomposed the global image representations into label representations. Alternatively, the TMC method [34] directly searched for the optimal tag matrix which preserves correlation structures for both images and tags. Finally, the recently proposed LSR method [10] conducted linear sparse reconstruction for each image and each tag, respectively,

120

and achieved state-of-the-art performance. 3. Proposed Method In this section, some notations are introduced ﬁrst, followed by a detailed description of the proposed method. 3.1. Outline Denote the initial user-provided tag matrix as D ∈ RN ×M , with M and N specifying the number of tags and images, respectively. Entries of D have binary values, that is, ⎧ ⎨ 1, Dij = ⎩ 0,

125

in case image i is associated with label j; otherwise.

Our objective is to recover the latent complete tag matrix A ∈ RN ×M . Following the idea of [7], D is decomposed into the true tag matrix A and a sparse error matrix E ∈ RN ×M . As explained in Section 1, with information sharing in mind, a low-dimensional tag subspace and a low-dimensional feature subspace are constructed in our method, by further factorizing A as a basis

130

matrix and a sparse coeﬃcient matrix, i.e., A = U V . Thus we have D = U V + E,

(1)

where U ∈ RN ×K and V ∈ RK×M are the basis and sparse coeﬃcient matrix, respectively. Since K M , the formulation in Eq.(1) implies that related tags will be encouraged to be linearly reconstructed by common basis in U , thus 6

tag-level information gets shared. Furthermore, by switching the roles of U and 135

V , similar samples are encouraged to choose common basis within V as well, in this way, instance-level information also gets shared. In addition, our method achieves low-rank using a sparse coding scheme, preserves local reconstruction structures in the compressed low-dimensional feature subspace and tag subspace, and promotes basis diversity. Therefore, the

140

proposed objective consists of four items, as deﬁned in Eq.(2): min L(U, V, E) + γRu (U ) + λRv (V ) + ωRb (U ),

U,V,E

s.t.

U•k 2 = 1, ∀k ∈ 1, 2, · · · , K.

(2)

Speciﬁcally, L(U, V, E) denotes the empirical error with sparse constraints for both V and E; Ru (U ) and Rv (V ) measure the local reconstruction structure consistency in the low-dimensional image subspace and tag subspace, respectively; and Rb (U ) represents similarities between all the columns of U . Concrete 145

deﬁnitions of these items are presented in the following subsections. 3.2. Low Rank and Error Sparsity This subsection presents our ﬁrst item in Eq.(2). Here L(U, V, E) measures the reconstruction error based on Eq.(1). Combining our factorization scheme and the constraints of sparsity with trade-oﬀ parameters, we arrive at a loss

150

function which is deﬁned as follows: 2 L(U, V, E) = D − E − U V F + 2η V 1 + β E 1 .

(3)

Note that Eq.(3) can be interpreted as sparse coding, with U ∈ RN ×K being the basis matrix and V ∈ RK×M the sparse coeﬃcient matrix. Here V can be regarded as tag representations in the new low-dimensional tag subspace, and U as image representations in the new low-dimensional image subspace. In 155

the next two subsections, local geometry structures would be explored in the original spaces and preserved in the subspaces.

7

3.3. Local Reconstruction Structure in Feature Space Denote X ∈ RN ×L as the feature matrix in the original feature space, where each row of X is a feature vector for an individual sample. In the new low160

dimensional subspace, each row of U is a compressed representation of an image. Similar to the idea of LLE, the local geometry structure is believed to be important and should be preserved while compressing the representation. Therefore, ﬁrst the original data X is explored for the structure information, which is encoded in matrix S:

2 S ∗ = arg min X − SX F + αS 1 , S

s.t. 165

Snn = 0, ∀n ∈ 1, 2, · · · , N ,

(4)

where S ∈ RN ×N is the local linear reconstruction coeﬃcient matrix in the original feature space. The j-th row of S contains corresponding weights to reconstruct the features of the j-th image using that of other samples. Eq.(4) can be eﬃciently solved using the feature-sign method [40]. Next, assume that the tags of the j-th image can be equally reconstructed by

170

the tags of other images, thus we have A ∼ SA. The local linear reconstruction structure speciﬁed by S should be robust to the sparse coding procedure in Eq.(3), which means this reconstruction structure should apply to U as well, i.e., U ∼ SU . Therefore, the second item in Eq.(2) can be instantiated as: 2 Ru (U ) = U − SU F .

(5)

Note that Eq.(5) is similar to the LSR method at ﬁrst glance, in the spirit 175

of linearly reconstructing each sample by other samples. However, the methodology behind is diﬀerent, since our method imposes a stronger constraint by preserving local geometry in the low-dimensional subspace, and the variable to be optimized now is the basis matrix U , not the reconstruction coeﬃcients as in LSR.

180

3.4. Local Reconstruction Structure in Tag Space In this section, the regularizer for V is designed in a manner similar to U . As explained before, each column of V can be deemed as a compressed 8

feature vector for a tag, and the local reconstruction structure in the original tag space should be preserved in this new subspace as well. Hence, the structural 185

information, encoded in T , is explored ﬁrst by leveraging the original data D: 2 T ∗ = arg min D − DT F + μT 1 , T

s.t.

Tmm = 0, ∀m ∈ 1, 2, · · · , M ,

(6)

where T ∈ RM×M is the local linear reconstruction coeﬃcient matrix in the tag space, with the i-th column of T containing corresponding weights to reconstruct the distribution of the i-th tag using that of other tags. Eq.(6) can be solved analogously to Eq.(4). 190

Then the reconstruction relationship speciﬁed by T should also apply to V . Therefore, the third item in Eq.(2) is presented as follows: 2 Rv (V ) = V − V T F .

(7)

Eq.(5) and Eq.(7) oﬀer a convenient manner to integrate into our factorization framework both the visual cues and semantic information, which are essential for successful tag completion. 195

3.5. Minimize basis similarity The last component of Eq.(2) aims to improve the sparse coding process itself. Inspired by the success of [16] and [17], our method also incorporates the regularizer to avoid the basis matrix from containing identical or similar columns. In other words, the proposed formulation encourages column diversity

200

in the basis matrix. Intuitively, if two columns from the basis matrix U are orthogonal, i.e., their dot product is zero, these two columns can be considered as entirely diﬀerent. Thus, we have Rb (U ) =

K K

uTi uj = tr U T U B ,

i=1 j=1 j=i

where B ∈ RK×K is a matrix that satisﬁes: ⎧ ⎨ 0, in case i = j; Bij = ⎩ 1, otherwise. 9

(8)

The regularizer speciﬁed in Eq.(8) is very easy to implement, as shown in Section 4. 205

Finally, our overall objective function is presented as below: min

U,V,E

s.t.

D − E − U V 2 + 2η V + β E F 1 1

2 2 +γ U − SU F + λV − V T F + ωtr{U T U B} ,

U•k 2 = 1, ∀k ∈ 1, 2, · · · , K.

(9)

Eq.(9) incorporates all the essential components we have come up with: 1) low-rank factorization with error sparsity which aims to share information among related tags and similar samples; 2) local reconstruction structure consistency in the obtained low-dimensional tag and feature subspaces; 3) diversity 210

among basis, which helps to improve the sparse coding process. Despite being a seemingly complex objective function with multiple items and variables, the optimization of Eq.(9) is fairly simple and eﬃcient, with a computational complexity growing linearly with the number of samples, as shown in Section 4. 4. Optimization

215

In this section, we focus on solving the minimization problem of the proposed objective function in Eq.(9). Although it is not jointly convex in all three variables, it is separately convex in U , V and E with remaining variables ﬁxed. Thus, we devise an eﬃcient solver for Eq.(9), by decoupling it into three subproblems and conducting optimization alternatively. The proposed algorithm

220

has a computational complexity growing linearly with the number of samples, which indicates that our method is capable of handling much larger datasets. The optimization procedure is summarized in Algorithm 1.

10

Algorithm 1 Low-rank Tag Completion with Dual Reconstruction. Require: Initial tag matrix D, reconstruction matrix S and T , matrix B, and parameters γ, λ, η, β, ω, N , M, K, and maxIter.

1: U ← rand(N, K), V ← zeros(K, M), E ← zeros(N, M), H ← λ(T − I)(T − I)T , G ← γ(S − I)T (S − I)

2: iter = 0; 3: for iter < maxIter do 4: 5:

iter ← iter + 1 ←D−E D

6:

Update V :

7: 8:

T U, Z = U T U Compute C = D

9: 10: 11: 12:

for k = 1;k ≤ K;k++ do for m = 1;m ≤ M;m++ do M Pkm = Cmk − K r=1 Hmr Vkr l=1 Vlm Zlk − r=m l=k Vkm = max{Pkm , η} + min{Pkm , −η} / Zkk + Hmm end for

13: 14:

end for

15: 16:

T , W = V V T + ωB Compute F = V D

17: 18: 19: 20:

Update U : for k = 1;k ≤ K;k++ do for n = 1;n ≤ N ;n++ do N Qnk = Fkn − K r=1 Urk Grn l=1 Wkl Unl − r=n l = k Unk = Qnk / Wkk + Gnn U•k ← U•k /max{1, U•k 2 }

21: 22:

end for

23:

Update E

end for

24: end for 25: return U and V

4.1. Optimizing Coeﬃcient V = D−E, H = λ(T −I)(T −I)T , Here the method in [41, 42] is used. Deﬁne D 225

when U and E are kept ﬁxed, Eq.(9) reduces to: f (V )

− U V 2 + λV − V T 2 + 2η V = D F F 1 T

T T T − 2V D U +VV U U D = tr D

+tr V HV T + 2η V 1 .

(10)

T

, the objective function of Vkm reduces D Ignoring the constant term tr D

11

to

⎛

⎞ K M ⎜ ⎟ f (Vkm ) = 2Vkm ⎝ Vlm Zlk + Hmr Vkr − Cmk ⎠ r=1 r=m

l=1 l=k

2 + Vkm Zkk + Hmm + 2η|Vkm |.

(11)

T U , and Z U T U . where C D Note that Eq.(11) is a piece-wise parabolic function going upwards, which is convex and easy to obtain the optimal point: Vkm =

max{Pkm , η} + min{Pkm , −η} , Zkk + Hmm

where Pkm = Cmk −

K

Vlm Zlk −

M

Hmr Vkr .

r=1 r=m

l=1 l=k

4.2. Optimizing Basis U The optimal U can be obtained by alternating between a procedure similar to V and an Euclidean projection. Deﬁne G = γ(S − I)T (S − I), when V and E are ﬁxed, U can be solved analogously to V , and the only modiﬁcation is to remove the L1 regularizer: ⎛ ⎞ K N ⎜ ⎟ 2 Wkk + Gnn , (12) Wkl Unl + Urk Grn − Fkn ⎠ + Unk f (Unk ) = 2Unk ⎝ l=1 l=k 230

r=1 r=n

T , and W V V T + ωB. The optimal Unk is: where F V D Unk = where Qnk = Fkn −

Qnk , Wkk + Gnn

K

Wkl Unl −

N

Urk Grn .

r=1 r=n

l=1 l=k

Then, Euclidean projection is performed to ensure the L2 norm of each column in U is less than 1. Note this is a coordinate descend approach and the projection is conducted after each coordinate is updated if the L2 norm of 12

the updated column of U is greater than 1. Thus, both the convergence and 235

decrease in the objective function are guaranteed. This constraint of U•k = 1 is relaxed to U•k ≤ 1, since the relaxation will result in a convex optimization problem while keeping the global optimum unchanged, i.e., the optimal U will always satisfy U•k = 1 even if our explicit constraint is U•k ≤ 1. 4.3. Optimizing Sparse Error E

240

Finally, when U and V are ﬁxed, obtaining E reduces to solving the following sparse coding problem: E ∗ = arg min E

D − U V − E 2 + β E , F 1

(13)

which can be solved similarly to S and T . 4.4. Implementation Issues A kNN strategy is adopted when calculating matrices S and T , where we 245

set k = 200 (same to [10]. Note that the lowercase letter k denotes the number of neighboring samples and the uppercase letter K represents the number of basis.), in order to make it faster. For the number of basis, we set K = 100 for Corel5K dataset, and K = 500 for the much larger Flickr30Concepts dataset. Meanwhile, similar to [7], D is re-initialized as D = (SD + DT )/2 before fed

250

into the completion process. Also, tags are treated as features when obtaining S for the Flickr30Concepts dataset, however not for Corel5K; since the remaining tags for images in Corel5K are very sparse (less than 4), thus using tags as features would cause performance deterioration.

255

As indicated by Eq.(11) and Eq.(12), for each iteration, the total computational complexity is O(K 2 × N + K × (nnz(V ) + nnz(H) + nnz(G))), where nnz(·) denotes the number of non-zero entries of a sparse matrix. This indicates that the running time of our proposed method is directly determined by N (the number of samples) and K (the number of basis). This also means that

260

our method is capable of handling large datasets since its running time scales linearly with N . 13

Table 1: Statistics of Corel5K and Flickr30Concepts. Counts of tags are given in format of ”mean/maximum”.

Corel5K

Flickr30Concepts

Vocabulary Size

260

2,513

Nr. of Images

4,918

27,838

Tags per Image

3.4/5

8.3/70

Del. Tags per Image

1.4 (40%)

3.3 (40%)

Test Set

492

2,807

5. Experiments To validate our approach, extensively empirical studies are carried out. In this section, the experimental setup we adopted is ﬁrst outlined, which is fol265

lowed by the analysis of some parameters. In Section 5.3, the inﬂuence of the number K of basis vectors is investigated. Finally, the performance of the proposed formulation is evaluated and compared with several prior methods. Some representative failure cases are examined in Section 5.4 as well, which can help us to gain more in-depth understanding towards the impact of unbalanced

270

samples, and shed light upon future research directions. 5.1. Experimental setup To facilitate comparison between our method and the previous ones, the same datasets and features as [10] are adopted. Two datasets are used: the wellestablished benchmark dataset Corel5K and the real-world Flickr30Concepts.

275

Statistics of both datasets are given in Table 1. For Corel5K, we randomly delete 40% of tags and ensure that each image has at least one tag removed and one tag remained.

2

We perform random deletion

8 times and averaged performance is reported. Furthermore, a validation set containing 491 images is extracted randomly to perform parameter tuning. 2 The

1000-dimensional SIFT BoW feature for Corel5K dataset is downloaded from http://

lear.inrialpes.fr/people/guillaumin/data.php.

14

280

For Flickr30Concepts [10], the data provided by the authors are used directly, including the ground truth, the initial tag matrix, as well as two types of features: the 1000-dimensional SIFT BoW feature and the composite features consisting of a set of 10 kinds of basic features3 . In addition, the evaluation method in [10] is adopted, as well as the same

285

measurements: average precision@N (AP@N), average recall@N (AR@N) and coverage@N (C@N). The deﬁnitions of these measurements are shown below: 1 Nc (i) , m i=1 N m

AP @N =

1 Nc (i) , m i=1 Nmg (i) m

AR@N =

1 I{Nc (i) > 0}, m i=1 m

C@N =

where m is the number of test images, Nc (i) is the number of correctly recovered tags for the i-th image, Nmg (i) is the number of missing ground truth tags for the i-th image, and I{·} is an Indicator function that returns 1 when the condition 290

is true and 0 otherwise. Note that here N denotes the number of tags added to an image (not the total number of samples as deﬁned in Section 3). Also, evaluations are performed only for the test set, and neighbors are extracted only in the training set. 5.2. Parameter Settings

295

Altogether 7 parameters are involved in the proposed method, hence it is necessary to tune each parameter in order to achieve better performance and analyze their respective inﬂuence to the completion process. The control variable method is adopted, which means only one parameter is modiﬁed at a time while others unchanged. Since α and μ have little inﬂuence 3 The

features include: Color Correlogram, Color Layout, CEDD, Edge Histogram, FCTH,

JCD, Jpeg Coeﬃcient Histogram, RGB Color Histogram, Scalable Color, SURF with Bag-ofWords model.

15

(b) 0.7

0.7 AP@N

AR@N

AP@N

C@N

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2 0.05 0.1 0.2 0.5

1

2 λ

5

0.2 0.05 0.1 0.2 0.5

10 20 50 100

(c)

AR@N

1

2 γ

5

C@N

10 20 50 100

(d) 0.7

AP@N

AR@N

C@N

AP@N

AR@N

C@N

0.6 0.6 0.55 0.5

0.5

0.45

0.4

0.4 0.3 0.35 0.3 0.01 0.05

0.1

0.3

0.5 ω

1

2

5

0.2 0.01 0.02 0.05 0.1 0.2 0.5 η

10

1

2

5

10

Figure 2: Inﬂuences of λ, γ, ω and η on validation set of Corel5K.

300

when the number of neighbors is large enough (here 200), so we just set a large number (here 1) to make the feature-sign method faster. Also, as mentioned in [43], if β is smaller than 0.2, severe negative eﬀect can be observed; however, if β is larger than 1, the returned matrix E will have all-zero entries. On the other hand, values within [0.2, 1] would not aﬀect the performance too much.

305

Therefore, only 4 parameters will be discussed here, namely, λ, γ, ω and η. The obtained results are shown in Fig.2. As illustrated in Fig.2, the value of γ should not be too large, since a larger γ means a higher degree of conﬁdence on the assumption that SA ∼ A, whereas this maybe questionable due to the semantic gap. Similarly, performance also

310

degraded as λ gets larger, since T is obtained from incomplete initial tags.

16

The performance of the proposed method is gradually improved as ω becomes larger, until very large values are used. As indicated in Eq.(12), entries in U would be close to zero under such conditions. For η, as it approaches 0, the L1 regularizer seems disabled; as it gets larger, V might be too sparse 315

and the reconstruction error would get larger. The ﬁnal values adopted in our experiments are λ = 5, γ = 5, β = 0.7, η = 0.5, ω = 0.3. All the following results are obtained using this group of parameters. 5.3. Inﬂuence of K Since a matrix factorization framework is adopted in the proposed method,

320

it is necessary to examine the inﬂuence of the number of basis and determine an appropriate value for K. As illustrated in Fig.3, when K is given a small value, the basis which could be used to reconstruct the true tag matrix are insuﬃcient, thus the obtained performances are far from satisfactory due to large reconstruction error. As K gets larger, the performances increase gradually.

325

However, performances would not be improved when further increase K, which means a larger K would not necessarily lead to a better result. This is consistent with our intuition of information sharing, since when K M does not hold, the related tags or similar samples can choose diﬀerent basis, thus information cannot propagate between tags and samples any longer.

330

Recall that the computational complexity of the proposed method is directly related to the value of K (cf. Section 4.4). Therefore, K should be set to a smaller value in order to make our method more eﬃcient. 5.4. Tag Completion Results To demonstrate the eﬀectiveness of our method, we compare its performance

335

with state-of-the-art annotation methods (JEC [1] and TagProp [2]) and several tag completion algorithms, namely, LR [7], Vote+ [38], Folksonomy [35], DLC [17], TMC [34], and LSR [10]. Note that JEC and TagProp are designed for multi-channel features, while TMC and DLC are more suitable for SIFT BoW feature (DLC requires non-negative features, and TMC prefers dot product of 17

AP@N

AR@N

C@N

0.6 0.55 0.5 0.45 0.4 0.35 0.3 20

50

70 100 150 200 250 300 400 500 K

Figure 3: Inﬂuences of K on Corel5K. Note that a larger K would not necessarily lead to a better result, since information sharing cannot be achieved when K M no longer hold. Table 2: Experimental results on Corel5K and Flickr30Concepts using only SIFT BoW feature.

340

Corel5K

Flickr30Concepts

(N = 2)

(N = 4)

AP

AR

C

AP

AR

C

TMC

0.23

0.33

0.40

0.19

0.21

0.37

DLC

0.09

0.13

0.18

0.07

0.09

0.23

LSR

0.28

0.42

0.50

0.30

0.36

0.60

LRDLR

0.32

0.49

0.57

0.32

0.39

0.64

LRDLR+

0.33

0.51

0.60

0.34

0.41

0.67

feature vectors to be non-negative), whereas the LSR method, as well as the proposed one, can handle both multi-features and SIFT BoW feature. For those baseline methods, the evaluation results reported in [10] are directly cited. Our previous method [43] is referred to as “LRDLR” (short for low-rank dual linear reconstruction ), while the extended version which promotes basis diversity is

345

referred to as “LRDLR+”. Experimental results using only the SIFT BoW feature on both datasets are shown in Table 2, and results using composite features on Flickr30Concepts are presented in Table 3. For Corel5K dataset, our method outperforms previous methods by a large 18

Table 3: Experimental results on Flickr30Concepts using 10 types of features.

Flickr30Concepts (N = 4) AP

AR

C

JEC

0.25

0.30

0.49

TagProp

0.23

0.29

0.50

Vote+

0.23

0.27

0.48

Folksonomy

0.21

0.26

0.47

LR

0.27

0.34

0.51

LSR

0.37

0.45

0.67

LRDLR

0.39

0.48

0.72

LRDLR+

0.38

0.48

0.74

margin. Meanwhile, the extended version (“LRDLR+”) shows even better per350

formance, especially for the metric C@N , which demonstrates the signiﬁcance of promoting basis diversity. Note that the pre-processing steps of obtaining S and T in our method basically correspond to the LSR method, which is far more delicate in the design of group-sparsity regularizer and soft fusion of coeﬃcients. However, the LSR method is highly dependent on initial labels, which means, if

355

some critical tags are removed, the sparse reconstruction guided by the available tags may turn out completely failed. Our method, on the other hand, exhibits enhanced robustness due to the incorporation of low-rank factorization. For the Flickr30Concepts dataset, our method still achieves the best performance, but the increase in performance with respect to LSR is not so signiﬁcant

360

as for Corel5K dataset, since images contained in Flickr30Concepts have richer initial labels than images in Corel5K. Under such circumstances, if one or two critical tags are removed, the content of this image can still be depicted by the remaining tags, which means the requirement for robustness towards noisy initial tags is not so essential for Flickr30Concepts as for Corel5K.

365

Furthermore, compared with results using only SIFT BoW feature, perfor-

19

mances using 10 types of features get substantially improved, both for the LSR method and the proposed one, which once more demonstrates the superiority of multiple features. Finally, some representative failure cases are illustrated in Fig.4, where the 370

images are shown along with their corresponding initial tags, missing ground truth tags, and tags recovered by the proposed method.

Figure 4: Some failure samples from Corel5K (from top to bottom: initial tags, missing ground truth tags, and recovered tags, respectively.). Note that the recovered tags still reveal some semantic information.

As shown in Fig.4, most failure samples contain high-frequency tags as initial labels, such as sky, tree, and water, etc. Such frequent tags are related to many

20

sky

plane jet

0.3 0.25

0.2

0.2

0.15

0.15

0.05 0

0.1 0.05 0

1 2 3 4 5 6 7 8 9 101112131415

prop smoke zebra clouds grass f−16 formation herd mountain locomotive ground coral giraffe

0.1

tree clouds water mountain buildings plane jet hills tower boats street temple grass sun sand

sky

0.25

1 2 3 4 5 6 7 8 9 101112131415

Figure 5: Comparison between coeﬃcients for frequent tag sky and normal tag plane. Note that coeﬃcients for sky decrease more slowly and exhibit a smaller variance.

images, hence their reconstruction coeﬃcients obtained by Eq.(6) are likely to 375

be inaccurate. To make it more intuitive, the coeﬃcients of sky and plane are illustrated in Fig.5, where sky is an example of high-frequency labels (actually, it is the most frequent tag in Corel5K) while plane is a tag that related to sky, with moderate frequency. Here only the 15 top related tags are displayed . After being sorted in a descending order, their coeﬃcients both follow an

380

exponential-like trend; however, coeﬃcients related to sky decrease more slowly and have a smaller variance, which means identifying the most related labels for high-frequency tags such as sky is a relatively tough task. Therefore, when sky occurred as an initial tag, multiple related candidate tags share similar chances to be chosen as completed tags, and which one to choose depends on the tags

385

from related samples of this image. However, if the candidate tags themselves are correlated, their frequency of occurrence in those related samples maybe still very close, resulting in a random selection from multiple candidates. Since sample unbalance is a serious challenge for AIA and tag completion problems, this may indicate some directions for future improvements.

390

Meanwhile, according to Fig.4, albeit not identical with the missing ground truth tags, the recovered tags still provide us with some clues about the true 21

content of the images, such as buildings for A1, plane for B1, house for E1, people for C3, giraﬀe for E3, and herd for A4, etc. 6. Conclusions 395

A novel tag completion algorithm is proposed in this paper, which is characterized by the low-rank, error sparsity, basis diversity, and the ability to preserve local linear reconstruction structures in the compressed low-dimensional feature space and tag space. Extensive experiments conducted on the well-known Corel5K dataset and the real-world Flickr30Concepts dataset demonstrate the

400

eﬀectiveness and eﬃciency of the proposed algorithm, where our method outperforms prior methods by a large margin. In addition, an analysis of some representative failure cases is presented, which sheds some light upon the impact of unbalanced samples and indicates directions for future improvements. Acknowledgment

405

This work was supported by National Nature Science Foundation (NNSF: 61171118) and Specialized Research Fund for the Doctoral Program of Higher Education (SRFDP-20110002110057). References [1] A. Makadia, V. Pavlovic, S. Kumar, A new baseline for image annotation,

410

in: European Conference on Computer Vision, Springer, 2008, pp. 316–329. [2] M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid, Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation, in: International Conference on Computer Vision, 2009, pp. 309–316. [3] Y. Zheng, Y.-J. Zhang, H. Larochelle, Topic modeling of multimodal data:

415

an autoregressive approach, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2014.

22

[4] L. Zheng, S. Wang, Z. Liu, Q. Tian, Lp-norm idf for large scale image search, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2013, pp. 1626–1633. 420

[5] L. Zheng, S. Wang, W. Zhou, Q. Tian, Bayes merging of multiple vocabularies for scalable image retrieval, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2014. [6] L. Zheng, S. Wang, Z. Liu, Q. Tian, Packing and padding: Coupled multiindex for accurate image retrieval, in: IEEE International Conference on

425

Computer Vision and Pattern Recognition, 2014. [7] G. Zhu, S. Yan, Y. Ma, Image tag reﬁnement towards low-rank, content-tag prior and error sparsity, in: ACM International Conference on Multimedia, 2010, pp. 461–470. [8] Y. Pang, Y. Yuan, X. Li, Iterative subspace analysis based on feature line

430

distance, IEEE Transactions on Image Processing 18 (4) (2009) 903–907. [9] X. Li, Y. Pang, Deterministic column-based matrix decomposition, IEEE Transactions on Knowledge and Data Engineering 22 (1) (2010) 145–149. [10] Z. Lin, G. Ding, M. Hu, J. Wang, X. Ye, Image tag completion via imagespeciﬁc and tag-speciﬁc linear sparse reconstructions, in: IEEE Interna-

435

tional Conference on Computer Vision and Pattern Recognition, 2013, pp. 1618–1625. [11] B. Shen, L. Si, Non-negative matrix factorization clustering on multiple manifolds, in: AAAI, 2010. [12] B. Shen, B.-D. Liu, Q. Wang, Y. Fang, J. Allebach, Sp-svm: Large margin

440

classiﬁer for data on multiple manifolds, in: AAAI, 2015. [13] B.-D. Liu, Y.-X. Wang, Y.-J. Zhang, Y. Zheng, Discriminant sparse coding for image classiﬁcation, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2012, pp. 2193–2196. 23

[14] B.-D. Liu, Y.-X. Wang, B. Shen, Y.-J. Zhang, M. Hebert, Self-explanatory 445

sparse representation for image classiﬁcation, in: European Conference on Computer Vision, Springer, 2014, pp. 600–616. [15] B. Shen, W. Hu, Y. Zhang, Y.-J. Zhang, Image inpainting via sparse representation, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, pp. 697–700.

450

[16] L. Bo, X. Ren, D. Fox, Multipath sparse coding using hierarchical matching pursuit, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2013, pp. 660–667. [17] X. Liu, S. Yan, T.-S. Chua, H. Jin, Image label completion by pursuing contextual decomposability, ACM Transactions on Multimedia Computing,

455

Communications, and Applications 8 (2) (2012) 21. [18] J. Jeon, V. Lavrenko, R. Manmatha, Automatic image annotation and retrieval using crossmedia relevance models, in: International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003, pp. 119–126.

460

[19] V. Lavrenko, R. Manmatha, J. Jeon, A model for learning the semantics of pictures., in: Advances in Neural Information Processing Systems, Vol. 1, 2003, p. 2. [20] S. Feng, R. Manmatha, V. Lavrenko, Multiple bernoulli relevance models for image and video annotation, in: IEEE International Conference on

465

Computer Vision and Pattern Recognition, Vol. 2, 2004, pp. 1002–1009. [21] G. Carneiro, A. B. Chan, P. J. Moreno, N. Vasconcelos, Supervised learning of semantic classes for image annotation and retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (3) (2007) 394–410. [22] D. M. Blei, M. I. Jordan, Modeling annotated data, in: International ACM

470

SIGIR Conference on Research and Development in Information Retrieval, 2003, pp. 127–134. 24

[23] D. Putthividhy, H. T. Attias, S. S. Nagarajan, Topic regression multi-modal latent dirichlet allocation for image annotation, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2010, pp. 3408– 475

3415. [24] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, Journal of Machine Learning Research 3 (2003) 993–1022. [25] Y. Pang, S. Wang, Y. Yuan, Learning regularized lda by clustering, IEEE Transactions on Neural Networks and Learning Systems 25 (12) (2014)

480

2191–2201. [26] C. Yang, M. Dong, J. Hua, Region-based image annotation using asymmetrical support vector machine-based multiple-instance learning, in: IEEE International Conference on Computer Vision and Pattern Recognition, Vol. 2, 2006, pp. 2057–2063.

485

[27] Y. Chen, J. Z. Wang, Image categorization by learning and reasoning with regions, Journal of Machine Learning Research 5 (2004) 913–939. [28] B. Li, K. Goh, Conﬁdence-based dynamic ensemble for image annotation and semantics discovery, in: ACM International Conference on Multimedia, 2003, pp. 195–206.

490

[29] R. Jin, S. Wang, Z.-H. Zhou, Learning a distance metric from multi-instance multi-label data, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2009, pp. 896–902. [30] Z.-H. Zhou, M.-L. Zhang, Multi-instance multi-label learning with application to scene classiﬁcation, Advances in Neural Information Processing

495

Systems 19 (2007) 1609. [31] Y. Verma, C. Jawahar, Image annotation using metric learning in semantic neighbourhoods, in: European Conference on Computer Vision, Springer, 2012, pp. 836–849.

25

[32] Q. Wang, B. Shen, S. Wang, L. Li, L. Si, Binary codes embedding for 500

fast image tagging with incomplete labels, in: European Conference on Computer Vision, Springer, 2014, pp. 425–439. [33] M. Chen, A. Zheng, K. Weinberger, Fast image tagging, in: International Conference on Machine Learning, ACM, 2013, pp. 1274–1282. [34] L. Wu, R. Jin, A. Jain, Tag completion for image retrieval, IEEE Transac-

505

tions on Pattern Analysis and Machine Intelligence 35 (3) (2013) 716–727. [35] S. Lee, W. De Neve, K. N. Plataniotis, Y. M. Ro, Map-based image tag recommendation using a visual folksonomy, Pattern Recognition Letters 31 (9) (2010) 976–982. [36] D. Liu, S. Yan, X.-S. Hua, H.-J. Zhang, Image retagging using collaborative

510

tag propagation, IEEE Transactions on Multimedia 13 (4) (2011) 702–712. [37] D. Liu, X.-S. Hua, M. Wang, H.-J. Zhang, Image retagging, in: ACM International Conference on Multimedia, 2010, pp. 491–500. [38] B. Sigurbj¨ornsson, R. Van Zwol, Flickr tag recommendation based on collective knowledge, in: ACM International Conference on World Wide Web,

515

2008, pp. 327–336. [39] Y. Liu, F. Wu, Y. Zhang, J. Shao, Y. Zhuang, Tag clustering and reﬁnement on semantic unity graph, in: IEEE International Conference on Data Mining, 2011, pp. 417–426. [40] H. Lee, A. Battle, R. Raina, A. Ng, Eﬃcient sparse coding algorithms, in:

520

Advances in Neural Information Processing Systems, 2006, pp. 801–808. [41] B.-D. Liu, Y.-X. Wang, Y.-J. Zhang, B. Shen, Learning dictionary on manifolds for image classiﬁcation, Pattern Recognition 46 (7) (2013) 1879–1890. [42] B.-D. Liu, Y.-X. Wang, B. Shen, Y.-J. Zhang, Y.-J. Wang, Blockwise coordinate descent schemes for sparse representation, in: IEEE International

26

525

Conference on Acoustics, Speech and Signal Processing, 2014, pp. 5267– 5271. [43] X. Li, Y.-J. Zhang, B. Shen, B.-D. Liu, Image tag completion by lowrank factorization with dual reconstruction structure preserved, in: IEEE International Conference on Image Processing, 2014.

27

Low-rank image tag completion with dual reconstruction structure preserved

Low-rank image tag completion with dual reconstruction structure preserved

Recommend Documents