Pattern Recognition Letters 34 (2013) 292–298
Contents lists available at SciVerse ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Max-margin embedding for multi-label learning Sunho Park a, Seungjin Choi a,b,c,⇑ a
Department of Computer Science and Engineering, Pohang University of Science and Technology, 77 Cheongam-ro, Nam-gu, Pohang 790-784, Republic of Korea Division of IT Convergence Engineering, Pohang University of Science and Technology, 77 Cheongam-ro, Nam-gu, Pohang 790-784, Republic of Korea c Department of Creative IT Excellence Engineering, Pohang University of Science and Technology, 77 Cheongam-ro, Nam-gu, Pohang 790-784, Republic of Korea b
a r t i c l e
i n f o
Article history: Received 13 August 2012 Available online 2 November 2012 Communicated by S. Sarkar Keywords: Multi-label learning Label embedding Max-margin learning Cost-sensitive multi-label hinge loss One versus all (OVA)
a b s t r a c t Multi-label learning refers to methods for learning a classification function that predicts a set of relevant labels for an instance. Label embedding seeks a transformation which maps labels into a latent space where regression is performed to predict a set of relevant labels. The latent space is often a low-dimensional space, so computational and space complexities are reduced. However, the choice of an appropriate transformation to a latent space is not clear. In this paper we present a max-margin embedding method where both instances and labels are mapped into a low-dimensional latent space. In contrast to existing label embedding methods, the pair of instance and label embeddings is determined by minimizing a cost-sensitive multi-label hinge loss, in which label-dependent cost is applied to more penalize the misclassification of positive examples. For implementation, we employ the limited memory Broyden–Fletcher–Goldfarb–Shanno (BFGS) method to determine the instance and label embeddings by a joint optimization. Numerical experiments on a few datasets demonstrate the high performance of our method compared to existing embedding methods in the case where the dimensionality of the latent space is much smaller than that of the original label space. Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction Multi-label learning seeks a classification function that predicts a set of relevant labels for an instance, whereas in multi-class problems a single label is assigned to an instance. Multi-label problems arise in various applications, including text mining, scene classification, image annotation, and bioinformatics, to name a few. Existing approaches to multi-label learning can be divided into two categories: algorithm adaptation and problem transformation (Tsoumakas et al., 2010). The algorithm adaptation approach directly extends the existing methods to solve multi-label problems, such as Adaboost. MR (Schapire and Singer, 2000), Rank-SVM (Elisseeff and Weston, 2002) and ML-KNN (Zhang and Zhou, 2007). In the problem transformation approach, a multi-label problem is transformed to one or more reduced tasks (such as regression or binary classification). One-versus-all (OVA) and label embedding methods fall into this approach. OVA, which is also known as binary relevance (BR), assumes that labels are not correlated so that binary classifiers are independently trained for each label. Label embedding seeks a transformation which maps labels into a latent space (reduced label space) where regression is performed to ⇑ Corresponding author at: Department of Computer Science and Engineering, Pohang University of Science and Technology, 77 Cheongam-ro, Nam-gu, Pohang 790-784, Republic of Korea. Tel.: +82 54 279 2259; fax: +82 54 279 2299. E-mail addresses:
[email protected] (S. Park),
[email protected] (S. Choi). 0167-8655/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2012.10.016
predict a set of relevant labels. Compressed sensing-based method (Hsu et al., 2009) assumes the sparsity in the label space such that a reduced label space is determined by a random projection. Principal label space transformation (PLST) (Tai and Lin, 2010, 2012) exploit the singular value decomposition of the label matrix for the transform to a reduced label space. Label embedding methods are efficient if the dimensionality of the reduced label space is much smaller than that of the original label space. However, the choice of a proper transformation to a reduced label space is not clear. The minimization of a loss function in a reduced label space is not necessarily correlated with minimizing the desired multi-label loss. Moreover, most of label embedding methods consider uniform misclassified cost for positive and negative examples. However, in practice, each example undergoes label imbalance since a few relevant labels are assigned to examples. In this paper we present a max-margin method where both instances and labels are mapped into a low-dimensional latent space. In contrast to existing label embedding methods, the pair of instance and label embeddings is determined by minimizing a cost-sensitive multi-label hinge loss, in which label-dependent cost is applied to more penalize the misclassification of positive examples. Thus, the latent space obtained by our method is directly related with optimizing the multi-label loss in the sense that the cost-sensitive multi-label hinge loss is a continuous upper bound on the empirical multi-label loss. Due to this difference, our method outperforms the existing embedding methods even in
S. Park, S. Choi / Pattern Recognition Letters 34 (2013) 292–298
the case where the dimensionality of the latent space is much smaller than that of the original label space, which is a desirable property of an embedding method for multi-label problems with the limited computational resources. In addition, the cost-sensitive learning makes our method more effective for the case where the number of positive examples for each label is much smaller than that of negative examples, which commonly arises in multi-label datasets with large number of labels. In order to find the optimal pair of embeddings, we employ limited memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) method that is efficient for large-scale optimization problems. Numerical experiments on a few datasets confirm the useful behavior of our method, compared to existing embedding methods as well as OVA. The rest of this paper is organized as follows. A unifying view of the existing embedding methods for multi-label learning is provided in Section 2. Section 3 presents our max-margin embedding method where the shared latent space is determined by minimizing the costsensitive multi-label hinge loss. Numerical experiments are provided in Section 4 and conclusions are drawn in Section 5. 2. Label embedding: a unifying view Label embedding finds a transformation which maps labels into a reduced label space, such that regression is applied, treating transformed labels as targets, to predict a set of relevant labels. Each instance is assigned a set of labels that it is nearest to, in the embedding space. Existing methods include compressed sensing-based method (Hsu et al., 2009), principal label space transform (PLST) (Tai and Lin, 2010, 2012), and canonical correlation analysis (CCA) (Sun et al., 2011). We first present a unifying view of label embedding, in which multi-label learning is described by seeking a pair of embeddings which map both instances and labels into a shared latent space such that similarity between the embedded instance and label points is maximized. Suppose that we are given a set of N instance-label pairs, fðxi ; yi ÞgNi¼1 ,
where
xi 2 X RD
are
instances
and
yi 2 Y ¼
K
f1; 1g are label vectors with jYj ¼ 2K . Denote by X 2 RDN and Y 2 RKN instance and label matrices, respectively, where Y k;i ¼ 1 if instance xi is assigned label k and otherwise Y k;i ¼ 1. The shared latent space L RL (L minfD; Kg) is a low-dimensional subspace common to both instances and labels. A pair of embeddings W 2 RLD (which maps instances to L) and V 2 RLK (which maps labels to L) are determined such that Wxi is as close as possible to Vyi and is as far as possible to Vy; 8y 2 Y n yi , as shown in Fig. 1. To this end, we consider a discriminant function f ðWx; VyÞ : X Y # R over instance-label
293
pairs, which measures how similar pairs ðWx; VyÞ are. A set of relevant labels for instance xi is predicted by maximizing the discriminant function f over label vectors y for instance xi , given embeddings W and V:
b y i ¼ argmax f ðWxi ; VyÞ:
ð1Þ
y2Y
Alternatively, f can be thought of as loss, where the minimum of f ðWxi ; VyÞ over y 2 Y is achieved at the desired label of xi , given embeddings W and V. Existing methods (Hsu et al., 2009; Tai and Lin, 2010; Sun et al., 2011), which are summarized below, determine label embedding V according to a pre-specified criterion, followed by estimating instance embedding W by a regression on the shared latent space, treating Vy as response. In order to describe existing methods, e ¼ 1 Y þ 1K 1> , where we define 0=1 binary label matrix by Y N 2 N 1N 2 R is the vector of all ones. Compressed sensing (Hsu et al., 2009): Assuming that the label space is sparse, i.e., each 0=1 binary label vector e y i has only a few 1’s in its entries, determines the shared latent space by random projections V 2 RLK , each entry of which, for instance, is drawn from i.i.d. Gaussian distributions. It employs greedy algorithms such as orthogonal matching pursuit (Mallat and Zhang, 1993) and CosaMp (Needell and Tropp, 2008) to recover labels from predictions of compressed labels by regression. PLST (Tai and Lin, 2010; Tai and Lin, 2012): The shared latent space is determined by projecting labels onto a space spanned by top L left singular vectors of the centered 0=1 label matrix. Define the centering matrix by H ¼ I NN N1 1N 1> N . Then > e V ¼ U> L where Y H U L DL V L . The label for xi is predicted by
1e b e 1N ; y i ¼ round U L Wxi þ Y N
where roundðÞ is the elementary-wise rounding operator which nearest vertex in the hypercube. CCA (Sun et al., 2011): Common subspace is determined by maximizing correlations between Wxi and Vyi , i.e., argmaxW tr ðWXH ÞðVYH Þ> , subject to WXX > W ¼ I. It was shown in (Sun et al., 2011) that CCA is also solved by least squares regression where the label embedding is given by 1 e HH Y e > 2 2 RKK . Instances are projected onto a shared V¼ Y latent space by instance embedding W. A binary classifier is applied for classification for each label, treating Wxi as input features. We would like to point out that all these aforementioned label embedding methods determine instance embedding W and label embedding V separately. In other words, learning the label embedding V is not necessarily correlated with minimizing a multi-label loss. Moreover, an appropriate choice of V is not clear since only dimensionality reduction of the label space is concerned. Moreover, most of label embedding methods consider the same cost for misclassification of positive and negative examples. However, in practice, label imbalance is serious since only a few relevant labels are assigned to instances (leading to small number of positive examples and large number of negative examples). 3. Max-margin embedding
Fig. 1. A pictorial view of embedding approach to multi-label learning. Instances and labels are projected onto the shared latent space L, where the embedded point Wxi is close to Vyi and far from other embedded points Vy, for y 2 Y n yi . The symbols and represent the instances and the distinct label vectors on L, respectively.
In this section, we present the main contribution of this paper, where we describe the max-margin formulation of label embedding approach to multi-label problems introduced in Section 2. Maxmargin embedding jointly determines both instance and label embeddings by minimizing the cost-sensitive multi-label hinge loss.
294
S. Park, S. Choi / Pattern Recognition Letters 34 (2013) 292–298
3.1. Formulation
Now we introduce misclassification cost ck ðÞ for k ¼ 1; . . . ; K, defined as
Throughout this paper, we choose the following discriminant function:
ck ðþ1Þ ¼ Nk =Nþk ; ck ð1Þ ¼ 1;
> f ðWx; VyÞ ¼ x> W > Vy; ¼ vec W > ðx VyÞ;
where N þ k represents the number of instances assigned label k (i.e., þ positive examples) and N denotes the number of k ¼ N Nk negative examples. In general, label space is sparse in multi-label problems, i.e., ck ðþ1Þ > ck ð1Þ, so misclassification of positive examples is more penalized. Invoking (6) and incorporating misclassification cost ck ðÞ, we formulate the max-margin embedding as the minimization of cost-sensitive multi-label hinge loss with regularizing W and V:
which is linear in the joint feature map ðx VyÞ, where denotes the Kronecker product, and vecðÞ represents the vec operator which produces a long vector by stacking the columns of the matrix in input argument. The goal of max-margin embedding is to learn the discriminant function f ðWx; VyÞ ¼ x> W > Vy, given a set of N instance-label pairs, fðxi ; yi ÞgNi¼1 , such that parameters W and V are estimated by maximizing the margin c ¼ mini ci , where ci is the minimal differences between the score of the correct label yi > and of the closest runner-up b y ¼ argmaxy–yi x> i W Vy. Invoking the prediction rule (1), the condition of zero training misclassification error can be written as a set of Nð2K 1Þ constraints
x>i W > Vyi x>i W > Vy P 0;
8y 2 Y n yi ; i ¼ 1; . . . ; N:
ð2Þ
Assume that the set of inequalities in (2) is feasible. As in max-margin methods employed in support vector machines (Vapnik, 1998; Tsochantaridis et al., 2005), we estimate W and V such that the margin c is maximized. Regularizing W and V separately, as done in multi-class learning (Amit et al., 2007), to make the problem well-posed yields the following optimization problem:
c;
max
kWk2F þkVk2F ¼1
subject to x>i W > Vyi x>i W > Vy P c;
8y 2 Y n yi ; i ¼ 1; . . . ; N;
where k kF denotes Frobenious norm. We consider multi-label Hamming loss Eðy; yi Þ, the value of which varies according to variables y ¼ ½y1 ; . . . ; yK > given label vector i; yi ¼ ½Y 1;i ; . . . ; Y K;i > , defined as K 1X e yk ; Y k;i ; 2 k¼1
ð4Þ
where e yk ; Y k;i ¼ Y k;i Y k;i yk . Introducing slack variables ni P 0 to allow non-zero error in training examples, as well as the margin re-scaling (Taskar et al., 2004) in which each inequality in (3) is re-scaled by multi-label Hamming loss Eðy; yi Þ,
x>i W > Vyi x>i W > Vy P c Eðy; yi Þ;
8y 2 Y n yi ; i ¼ 1; . . . ; N;
(3) can be equivalently expressed as k ðkWk2F 2
min
W;V;ni P0
1 þ kVk2F Þ þ 2KN
N X
ni ;
i¼1
ð5Þ
subject to x>i W > Vyi x>i W > Vy P Eðy; yi Þ ni ; for 8y 2 Y n yi ; i ¼ 1; . . . ; N, where k P 0 is a regularization parameter. In general, (5) involves N2K1 constraints which grow exponentially in the number of labels K. However, multi-label Hamming loss Eðy; yi Þ, defined in (4), decomposes over individual labels (Taskar et al., 2004; Hariharan et al., 2010b), it reduces to only NK constraints, so that the slack variable for each instance, ni , in (5), can be expressed as:
ni ¼ max Eðy; yi Þ x>i W > Vyi x>i W > Vy y2Y
¼
K X k¼1
¼
K X k¼1
max eðyk ; Y k;i Þ ðY k;i yk Þx>i W > v k
yk 2f 1g
maxf0; 2 2Y k;i x>i W > v k g:
ð6Þ
ð7Þ
One can easily see that the minimization of the cost-sensitive multi-label hinge loss implies the minimization of the empirical cost-sensitive multi-label loss WðW; VÞ which is defined to include the label dependent cost for each label:
WðW; VÞ ¼
N 1 X E cs ð b y i ; yi Þ; KN i¼1
ð8Þ
where b y i is the label vector for xi predicted by (1) and E cs ð; Þ is a cost-sensitive version of Hamming loss, given by
E cs ð b y i ; yi Þ ¼ ð3Þ
Eðy; yi Þ ¼
K X N k 1 X ðkWk2F þ kVk2F Þ þ ck ðY k;i Þ W;V 2 KN k¼1 i¼1 max 0; 1 Y k;i x>i W > v k :
min
K 1X b k;i ; Y k;i : ck ðY k;i Þe Y 2 k¼1
ð9Þ
Note that hinge loss is an upper bound on zero-one loss. It follows from this that the cost-sensitive multi-label hinge loss is also an upper bound on the empirical multi-label loss (8). Thus, in contrast to exiting embedding methods, the shared latent space obtained by our method is directly related with minimizing the empirical multilabel loss. The formulation (7) includes the cost-sensitive OVA as its special case. Fixing the embedding dimension as L ¼ D and choosing the identity matrix for instance embedding, i.e., W ¼ I DD , our formulation (7) becomes a set of K independent minimization problems, each of which is a prediction problem for label k: N k 1X min kv k k2 þ ck ðY k;i Þ max 0; 1 Y k;i v >k xi : vk 2 N i¼1
ð10Þ
Setting ck ðþ1Þ ¼ 1 for k ¼ 1; . . . ; K, (10) reduces to OVA. In fact, the optimization problem (10) is a realization of cost-sensitive binary SVM (Brefeld et al., 2003). OVA enjoys solving K independent problems, while the dependency between labels is completely ignored. On the other hand, in our embedding formulation the task (prediction) relatedness is encoded by instance embedding W that is shared across labels. That is, the predictor for label k, defined as W > v k , is represented as a linear combination of shared bases. This idea is identical to the shared predictor subspace in multi-task learning (Rai and Daumé, 2010), where task parameters are assumed to share underlying basis space to account for the task relatedness. We also note that our formulation (7) can be transformed into a convex optimization by employing a trace norm regularization for the matrix variable M ¼ W > V. Instead of limiting the embedding dimension L, we constrain the norm of W and V, as in max-margin matrix factorization (Srebro et al., 2005), leading to the trace norm regularization for M:
1 kWk2F þ kVk2F ; 2 V
kMk ¼ min> M¼W
ð11Þ
295
S. Park, S. Choi / Pattern Recognition Letters 34 (2013) 292–298
where kMk denotes the trace norm of M which is the sum of singular values of M. Note that the trace norm kMk is a convex envelope of the rank function of the matrix M (Fazel et al., 2001). Let mk ¼ W > v k be the kth column in the matrix M, then it indicates the predictor for class k. As a result, the problem (7) yields a trace norm minimization:
minimizeM kkMk þ
1 KN
K X N X
ck ðY k;i Þ max 0; 1 m>k xi :
The formulation (12) is convex and thus guarantees a global solution. In addition, the trace norm regularization is useful to find the shared structure among the related functions in multi-class learning (Amit et al., 2007) and multi-task learning (Argyriou et al., 2008). Note that, however, the optimizations of (12) involves high computational cost. Although we can apply the first order optimization to solve (12) using by an extended gradient method (Ji and Ye, 2009) or a smooth approximation of trace norm (Amit et al., 2007), both methods require to compute singular value decomposition (SVD) of the matrix variable M at each iteration, which might be prohibited for large scale multi-label datasets. Recently, Ying and Li have proposed an eigenvalue optimization formulation for learning a Mahalanobis metric (Ying and Li, 2012). The authors also have provided the first order optimization methods to solve a trace norm minimization problem, which only need the largest eigenvector of the gradient matrix, the same size of the parameter matrix, at each iteration. Those optimization methods can be directly applied to our trace norm minimization formulation (12) (see Appendix A in (Ying and Li, 2012)) with less computational cost than the above two approaches which need a full SVD. However, they still suffer from high computational burden for the datasets having high-dimensional features, because the outer product of the largest eigenvector (whose dimensionality is D þ K, see also Appendix A in (Ying and Li, 2012)) is involved in each iteration. 3.2. Implementation
k
φs (z) for y = 1
7
φ s (z) for y = −1
k
φ (z) for y = −1 k k
6 5
ck (+1) 3 2
ck (−1) 1 0 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
z Fig. 2. The cost-sensitive multi-label hinge loss (13) and its smooth approximation (14). The missclassification cost for positive example is set to ck ðþ1Þ ¼ 3.
e k is zero for z P 1 and has negative Note that the smooth function / (ck ðY k;i Þ) slop for z 6 0 as the same as /k in (13). However, in the region, 0 < z < 1, it smoothly changes between a zero slope and a constant negative slop. Fig. 2 provides a plot of two functions /k e k in the cases where the missclassification cost for a positive and / example and for a negative example are ck ðþ1Þ ¼ 3 and ck ð1Þ ¼ 1, respectively. Replacing the cost-sensitive multi-label hinge loss in (7) with its smooth approximation (14), we obtain the following unconstrained smooth minimization problem:
min W;V
N X K k 1 X e k Y k;i v > Wxi : / kWk2F þ kVk2F þ k 2 KN i¼1 k¼1
ð15Þ
In the experiments, we use the limited memory BFGS (Nocedal, 1980; Liu and Nocedal, 1989) which is an efficient gradient-based optimization tool for large-scale problems.1 4. Numerical experiments
In this paper we focus on solving the optimization problem (7), instead of solving the trace norm regularization problem (12). With the fixed embedding dimension L, we directly estimate the pair of embeddings, W and V by minimizing (7). Although the joint optimization over W and V yields a non-convex optimization, this approach requires less computational cost than solving the trace norm regularization. In addition, the memory space for the parameters is much reduced from DK to LðD þ KÞ: if L K, even more memory is saved. Note that our approach is similar to a fast max-margin matrix factorization (Rennie and Srebro, 2005a), where the pair of factor and coefficient matrices are jointly estimated, instead of solving the original trace norm minimization problem (Srebro et al., 2005) on the matrix variable that is a product of the factor and coefficient matrices. A main difficulty in the formulation (7) is the nondifferentiability of the cost-sensitive multi-label hinge loss. Let define
ð13Þ
> Y k;i x> i W
where z ¼ v k . Then, we notice that the function /k is identical to a binary hinge loss, i.e., /b ðzÞ ¼ max f0; 1 zg, except that it has more steep negative slop (-ck ðþ1Þ) for a positive example (see Fig. 2). Thus, we can extend the smooth approximation of the binary hinge loss proposed in (Rennie and Srebro, 2005b) to include the label dependent cost:
8 if z P 1; > <0 e k ðzÞ ¼ ck ðY k;i Þð1 zÞ2 =2 if 0 < z < 1; / > : ck ðY k;i Þð1=2 zÞ if z 6 0:
φ (z) for y = 1
8
4
ð12Þ
k¼1 i¼1
/k ðzÞ ¼ ck ðY k;i Þ max f0; 1 zg;
9
ð14Þ
We evaluated the performance of our method compared to OVA and the existing embedding methods, including compressed sensing (CS) (Hsu et al., 2009), CCA (Sun et al., 2011) and PLST (Tai and Lin, 2010). Note that the multi-label classification algorithm in (Hariharan et al., 2010b) is also proposed in the framework of max-margin learning where the label correlation is used for the prior information in order to implicitly capture the dependency among labels. However, this type of prior is useful when the label correlations found in training data is very different from those found in testing data, which is not common for most multi-label datasets. In fact, the method aims at solving the specific type of problems in computer vision, such as zero shot problem (Hariharan et al., 2010a). In addition, the paper shows that the max-margin formulation for multi-label problems without the label correlation can be reduced into OVA (Hariharan et al., 2010b). Thus, we instead treat OVA as a max-margin multi-label classifier. We considered 5 multi-label datasets from mulan.2 The detailed descriptions for each data set are summarized in Table 1. Note that in most datasets each example has a few relevant labels, i.e., the label density (LD), the average number of 1’s in a label vector divided by K, is much less than one. For all experiments, we used a random 5-folds cross validation method where the dataset is randomly divided into 5-subsets, in which 1 subset was used for test data and the remaining subsets were used for training data. 1 For implementation of the limited memory BFGS algorithm, we used an opensource software which is available at http://www.chokkan.org/software/liblbfgs/. 2 see http://mulan.sourceforge.net/datasets.html.
296
S. Park, S. Choi / Pattern Recognition Letters 34 (2013) 292–298
the (example-based) F measure is averaged across all examples, such that,
Table 1 Data description: K is the number of labels, N is the number of instances, D is the input dimensionality, LC is the Label Cardinality (the average number of 1’s in a label ). vector) and LD is the Label Distribution (LC K # Labels (K) Scene Medical Rcv1v2 Bibtex Corel5K
6 45 101 159 374
# Examples (N)
# Features (D)
2407 978 6000 7395 5000
LC
294 1449 47,236 1836 499
T 2X F¼ T i¼1 PK
LD
1.074 1.245 2.880 2.402 3.522
0.179 0.028 0.029 0.015 0.009
2RP ; RþP
ð16Þ
where recall (R) and precision (P) are given by
PK R¼
k¼1
p Yb k;i ¼ 1 p Y k;i ¼ 1 PK
k¼1
p Y k;i ¼ 1
PK ;
P¼
k¼1
PK
p Yb k;i ¼ 1
p Yb k;i ¼ 1 p Y k;i ¼ 1
k¼1
;
b k;i is the predicted label obtained by the method and the where Y predictor pðtÞ is 1 if t is true and 0 otherwise. Given the test data,
p Yb k;i ¼ 1 p Y k;i ¼ 1
P p Yb k;i ¼ 1 þ Kk¼1 p Y k;i ¼ 1
;
ð17Þ
where T is the number of the test examples. The larger the F value, the better prediction performance. In fact, the Hamming loss, defined in (4), is also a representative performance measure used in multi-label problems. However, the Hamming loss would be less informative in the case where LD is close to zero as in our experiments (see Table 1). Suppose that there is a naive classifier which always returns a label vector of all 1. Although it produces undesirable prediction results, it obtains the considerably small value of Hamming loss (equal to LD). Thus, in the experiments we only focused on the performance of each method evaluated by F measure. The detailed experimental settings for each method are summarized as follows. For OVA, we used a linear SVM (liblinear (Fan et al., 2008)) for an individual predictor for each label. In the case of CS (Hsu et al., 2009), the label embedding V was constructed by the randomly-chosen rows of the Hadamard matrix, and orthogonal matching pursuit algorithm (Mallat and Zhang, 1993) was employed to recover the label vector from the prediction of compressed labels by regression. For CCA, instances were mapped onto the common subspace where the linear SVM is applied for
OVA CCA CS PLST Our method
1
k¼1
k¼1
In order to evaluate the performance of each method, we considered F measure (Tsoumakas et al., 2010) which is defined as the harmonic mean of recall and precision, such that
F¼
PK
OVA CCA CS PLST Our method
0.7
0.8
F measure
F measure
0.6
0.6
0.5 0.4 0.3
0.4 0.2 0.2
0
0.1
5
10
15
20
25
30
35
40
0
45
5
embedding dimension (L)
embedding dimension (L)
0.25
OVA CCA CS PLST Our method
0.5
10 15 20 25 30 35 40 45 50
OVA CCA CS PLST Our method
0.2
F measure
F measure
0.4
0.3
0.15
0.2
0.1
0.1
0.05
0
0 5
10 15 20 25 30 35 40 45 50
embedding dimension (L)
5
10 15 20 25 30 35 40 45 50
embedding dimension (L)
Fig. 3. Performance comparison of each embedding method on (a) medical, (b) rcv1v2, (c) bibtex and (d) corel5K datasets in terms of F1-measure, varying the embedding dimensions L from 10 to 50.
297
S. Park, S. Choi / Pattern Recognition Letters 34 (2013) 292–298
Table 2 Performance comparison in terms of Recall and F measure, where the results are the average values and the standard deviation (in the parentheses). The numbers in bold face represent the best values among the methods for each performance measure. Dataset
Scene Medical Rcv1v2 Bibtex Corel5K
Embedding
CS
Recall
OVA F1
Recall
CCA F1
dimension (L)
Recall
F1
Recall
F1
Recall
F1
0.643 (0.020) 0.730 (0.026) 0.147 (0.005) 0.350 (0.006) 0.086 (0.005)
0.617 (0.019) 0.732 (0.020) 0.170 (0.006) 0.372 (0.005) 0.112 (0.006)
0.649 (0.006) 0.700 (0.023) 0.491 (0.004) 0.448 (0.008) 0.145 (0.007)
0.610 (0.006) 0.404 (0.030) 0.084 (0.002) 0.404 (0.006) 0.150 (0.006)
L=6
0.491 (0.083) 0.491 (0.083) 0.308 (0.010) 0.363 (0.009) 0.081 (0.008)
0.499 (0.084) 0.499 (0.084) 0.320 (0.010) 0.332 (0.007) 0.086 (0.009)
0.557 (0.006) 0.557 (0.006) 0.366 (0.005) 0.247 (0.005) 0.054 (0.005)
0.539 (0.006) 0.539 (0.006) 0.375 (0.007) 0.283 (0.007) 0.074 (0.006)
0.880 (0.014) 0.888 (0.056) 0.702 (0.013) 0.522 (0.007) 0.436 (0.010)
0.698 (0.008) 0.808 (0.032) 0.538 (0.007) 0.403 (0.005) 0.178 (0.004)
L = 20 L = 50 L = 50 L = 50
classification for each label. Note that, as in (Sun et al., 2011), the embedding dimension in CCA is fixed to the number of labels, K (refer Section 2). For PLST, we used a ridged linear regression model, where the regularization parameter is set to 102 as recommended in (Tai and Lin, 2010), to solve the estimation problem in the reduced label space. In both cases of OVA and CCA, the regularization parameter of the linear SVM, which tradeoffs the loss and margin, was chosen among f22 ; 21 ; . . . ; 24 g by maximizing the F measure on the validation dataset. Similarly, in our method we chose the regularization parameter k among f106 ; 105 ; . . . ; 100 g. In the experiments, we mainly focused on comparing our embedding approach to the existing label embedding methods in the case where the embedding dimension L is much smaller than the dimensionality of the original label space, K. The purpose of embedding methods is to reduce the computational and space complexities by mapping instances and labels into the common low-dimensional space. Note that, however, in most existing embedding methods the label embedding is obtained by a certain kind of the unsupervised dimensional reduction of the label vectors, e.g., random projection for CS and principal component analysis (PCA) for PLST. Thus, if the embedding dimension is not large enough to capture the information contained in the original label space, the performance of the existing embedding methods would be significantly degraded. On the other hand, our method finds the optimal latent space in sense that both label and instance embeddings are obtained by minimizing the multi-label loss. Thus, we can expect that our method outperforms existing embedding methods even for the much lower dimensional latent space. To verify the above point, we evaluated the performance of each embedding method on the last four data sets in Table 1, with varying the embedding dimension in the range from 5 to 50. As shown in Fig. (3), our method is superior to the other embedding methods across most cases. Especially for corel5K dataset, the performance of our method is becoming better than all other methods at L ¼ 30, where the embedding dimension L is much smaller than the original dimensionality of the label space, K ¼ 374. All results demonstrate that our method is superior to other embedding methods, CS and PLST, with the low embedding dimension. Thus, our method would be more useful for the case where only limited memory space is available (the required memory space for the parameters in our embedding method is LðK þ DÞ). We also included the experimental results which support the effectiveness of the cost sensitive multi-label loss in our formulation. In most multi-label datasets in Table 1, the LD value is much smaller than 1, which yields the imbalance problem between positive and negative examples in each label. Since the existing embedding methods consider uniform misclassification cost for positive and negative examples, they might fail to correctly
PLST
Our method
separate a few positive examples from a bunch of negative examples. On the other hand, in our method the penalization of the misclassification of positive examples is helpful to improve the true positive rate (recall). Note that, since the naive classifier which has a high false alarm rate can obtain the high recall value, we additionally included the F measure to accurately evaluate the performance of each method. We provide the recall and the F measure of each method in Table 2 where we chose the smaller embedding dimension L which guarantees the reasonable prediction performance. As shown in the Table 2, our method significantly improves recall value of the embedding methods as well as of OVA. Especially for corel5K dataset whose LD value is less than 0.01 as in Table 1, OVA and other embedding methods fail to correctly predict the positive examples (see their recall value in the Table 2). On the other hand, our method is superior to the other methods across most cases in terms of both recall value and F measure. As a results, we can confirm the effectiveness of the cost sensitive multi-label loss in our method. 5. Conclusions We have proposed the max-margin embedding method where instances and labels are mapped into a low-dimensional latent space. In contrast to exiting embedding methods, the pair of instance and labels embeddings is determined by minimizing the cost-sensitive multi-label hinge loss in which the label-dependent cost is applied to more penalize the misclassification of positive examples. Thus, the shared latent space obtained by our method is directly related with minimizing the empirical multi-label loss. We also have shown that our method outperforms the existing embedding methods in the case where the dimensionality of the latent space is much smaller than that of the original label space. In addition, the cost-sensitive learning makes our method more effective for the case where the number of positive examples for each label is much smaller than that of negative examples. The cost-sensitive multi-label hinge loss function itself is not able to capture the dependency among labels. In the future, we plan to extend our embedding framework to include the more general multi-label loss which explicitly models the label dependency, such as a hierarchical loss function in (Cesa-Bianchi et al., 2006). Acknowledgments This work was supported by National Research Foundation (NRF) of Korea (2012-0005032), MEST Converging Research Center Program (2012K001343), NIPA ITRC Program (NIPA-2012-H030112-3002), MKE and NIPA IT Consilience Creative Program (C1515-1121-0003), and NRF World Class University Program (R31-10100).
298
S. Park, S. Choi / Pattern Recognition Letters 34 (2013) 292–298
References Amit, Y., Fink, M., Srebro, N., Ullman, S., 2007. Uncovering shared structures in multiclass classification. In: Proceedings of the International Conference on Machine Learning (ICML). Corvallis, OR, USA. Argyriou, A., Evgeniou, T., Pontil, M., 2008. Convex multi-task feature learning. Machine Learning 73 (3), 243–272. Brefeld, U., Geibel, P., Wysotzki, F., 2003. Support vector machines with example dependent costs. In: Proceedings of the European Conference on Machine Learning (ECML). Cavtat-Dubrovnik, Croatia, pp. 492–502. Cesa-Bianchi, N., Gentile, C., Zaniboni, L., 2006. Incremental algorithms for hierarchical classification. Journal of Machine Learning Research 7, 31–54. Elisseeff, A., Weston, J., 2002. A kernel method for multi-labeled classification. In: Advances in Neural Information Processing Systems (NIPS). Vol. 14. MIT Press, pp. 681–687. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J., 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874. Fazel, M., Hindi, H., Boyd, S.P., 2001. A rank minimization heuristic with application to minimum order system approximation. In: Proceedings of the 2001 American Control Conference. Hariharan, B., Vishwanathan, S.V.N., Varma, M., 2010a. Efficient max-margin multilabel classification with applications to zero-shot learning. Tech. Rep. MSR-TR2010-141, Microsoft Research. Hariharan, B., Zelnik-Manor, L., Vishwanathan, S.V.N., Varma, M., 2010b. Large scale max-margin multi-label classification with priors. In: Proceedings of the International Conference on Machine Learning (ICML). Haifa, Israel. Hsu, D., Kakade, S.K., Langford, J., Zhang, T., 2009. Multi-label prediction via compressed sensing. In: Advances in Neural Information Processing Systems (NIPS). Vol. 22. MIT Press. Ji, S., Ye, J., 2009. An accelerated gradient method for trace norm minimization. In: Proceedings of the International Conference on Machine Learning (ICML). Montreal, Canada. Liu, D.C., Nocedal, J., 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming 45 (3), 503–528. Mallat, S.G., Zhang, Z., 1993. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing 41 (12), 3397–3415. Needell, D., Tropp, J.A., 2008. CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis 26 (3), 301–321.
Nocedal, J., 1980. Updating quasi-Newton matrices with limited storage. Mathematics of Computation 35 (151), 773–782. Rai, P., Daumé III, H., 2010. Infinite predictor subspace models for multitask learning. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). Sardinia, Italy. Rennie, J.D.M., Srebro, N., 2005a. Fast maximum margin matrix factorization for collaborative prediction. In: Proceedings of the International Conference on Machine Learning (ICML). Bonn, Germany. Rennie, J.D.M., Srebro, N., 2005b. Loss functions for preference levels: Regression with discrete ordered labels. Proceedings of the IJCAI Multidisciplinary Workshop on Advances in Preference Handling, In. Schapire, R.E., Singer, Y., 2000. BoosTexter: a boosting-based system for text categorization. Machine Learning 39, 135–168. Srebro, N., Rennie, J.D.M., Jaakkola, T., 2005. Maximum-margin matrix factorization. In: Advances in Neural Information Processing Systems (NIPS). Vol. 17. MIT Press. Sun, L., Ji, S., Ye, J., 2011. Canonical correlation analysis for multilabel classification: A least squares formulation, extension, and analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (1), 194–200. Tai, F., Lin, H.T., 2010. Multi-label classification with principal label space transformation. Proceedings of the Second International Workshop on Learning from Multi-Label Data. Haifa, Israel, In. Tai, F., Lin, H.T., 2012. Multi-label classification with principal label space transformation. Neural Computation(to appear). Taskar, B., Guestrin, C., Koller, D., 2004. Max-margin Markov networks. In: Advances in Neural Information Processing Systems (NIPS). Vol. 16. MIT Press. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y., 2005. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6, 1453–1484. Tsoumakas, G., Katakis, I., Vlahavas, I., 2010. Mining multi-label data. In: Maimon, O., Rokach, L. (Eds.), Data Mining and Knowledge Discovery Handbook. Springer, pp. 667–685. Vapnik, V., 1998. Statistical Learning Theory. John Wiley and Sons, Inc., New York. Ying, Y., Li, P., 2012. Distance metric learning with eigenvalue optimization. Journal of Machine Learning Research 13, 1–26. Zhang, M.L., Zhou, Z.H., 2007. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition 40, 2038–2048.