ARTICLE IN PRESS
JID: INS
[m3Gsc;September 28, 2015;8:35]
Information Sciences xxx (2015) xxx–xxx
Contents lists available at ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins
Multi-label Lagrangian support vector machine with random block coordinate descent method Jianhua Xu∗
Q1
School of Computer Science and Technology, Nanjing Normal University, Nanjing, Jiangsu 210023, China
a r t i c l e
i n f o
Article history: Received 27 November 2014 Revised 15 August 2015 Accepted 11 September 2015 Available online xxx Keywords: Multi-label classification Support vector machine Kernel function Block coordinate descent method Quadratic programming
a b s t r a c t When all training instances and labels are considered all together in a single optimization problem, multi-label support and core vector machines (i.e., Rank-SVM and Rank-CVM) are formulated as quadratic programming (QP) problems with equality and bounded constraints, whose training procedures have a sub-linear convergence rate. Therefore it is highly desirable to design and implement a novel efficient SVM-type multi-label algorithm. In this paper, through applying pairwise constraints between relevant and irrelevant labels, and defining an approximate ranking loss, we generalize binary Lagrangian support vector machine (LSVM) to construct its multi-label form (Rank-LSVM), resulting into a strictly convex QP problem with non-negative constraints only. Particularly, each training instance is associated with a block of variables and all variables are divided naturally into manageable blocks. Consequently we build an efficient training procedure for Rank-LSVM using random block coordinate descent method with a linear convergence rate. Moreover a heuristic strategy is applied to reduce the number of support vectors. Experimental results on twelve data sets demonstrate that our method works better according to five performance measures, and averagely runs 15 and 107 times faster and has 9 and 15% fewer support vectors, compared with Rank-CVM and Rank-SVM. © 2015 Published by Elsevier Inc.
1
1. Introduction
2
Multi-label classification is a special supervised learning task, where any single instance possibly belongs to several classes simultaneously, and thus the classes are not mutually exclusive [2,36,37,48]. In the past ten years, such a classification issue has received a lot of attention because of many real world applications, e.g., text categorization [29,46], scene annotation [1,47], bioinformatics [43,46], and music emotion categorization [33]. Currently, there mainly are four strategies to design various multilabel classification algorithms: data decomposition, algorithm extension, hybrid, and ensemble strategies. Data decomposition strategy divides a multi-label data set into either one or more single-label (binary or multi-class) subsets, learns a sub-classifier for each subset using an existing classifier, and then integrates all sub-classifiers into an entire multi-label classifier. There exist two popular decomposition ways: one-versus-rest (OVR) or binary relevance (BR), and label powerset (LP) or label combination (CM) [2,36,37,48]. It is rapid to construct a data decomposition method since many popular single-label classifiers and their free software are available. But the label correlations are not characterized explicitly in OVR-type methods, and lots of new classes with a few training instances and no new predicting label combination are created in LP-type methods.
3 4 5 6 7 8 9 10 11 12
∗
Tel.: +86 25 96306209; fax: +86 25 85891990. E-mail address:
[email protected],
[email protected]
http://dx.doi.org/10.1016/j.ins.2015.09.023 0020-0255/© 2015 Published by Elsevier Inc.
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
JID: INS 2
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
ARTICLE IN PRESS
[m3Gsc;September 28, 2015;8:35]
J. Xu / Information Sciences xxx (2015) xxx–xxx
Algorithm extension strategy generalizes some specific multi-class algorithm to consider all training instances and classes (or labels) of training set all together. But this strategy possibly induces some complicated optimization problems, e.g., a largescale unconstrained problem in multi-label back-propagation neural networks (BP-MLL) [46] and two large-scale quadratic programming (QP) ones in multi-label support and core vector machines (Rank-SVM [9] and Rank-CVM [43]). Inspiringly, the label correlations of individual instance are depicted sufficiently via pairwise relations between relevant labels and irrelevant ones. Hybrid strategy not only extends or modifies an existing single-label method but also splits a multi-label data set into a series of subsets implicitly or explicitly. After the OVR trick is embedded, the famous k-nearest neighbor algorithm (kNN) is cascaded with discrete Bayesian rule in ML-kNN [47], logistic regression in IBLR-ML [5], fuzzy similarity in FSKNN [19] and casebased reasoning in MlCBR [25]. After executing feature extraction with principal component analysis and feature selection with genetic algorithm, multi-label naive Bayes (MLNB) utilizes OVR trick to estimate prior and conditional probabilities for each label [45]. Besides a relatively low computational cost, such a strategy weakly characterizes the label correlations either explicitly or implicitly. Ensemble strategy either generalizes an existing multi-class ensemble classifier, or realizes a new ensemble of the aforementioned three kinds of multi-label techniques. Two boosting-type multi-label classifiers (Adaboost.MH and Adaboost.MR) are derived from famous Adaboosting method [29] via minimizing Hamming loss and ranking one, respectively. Random k-labelsets (RAkEL) method splits an entire label set into several subsets of size k, trains LP classifiers and then constructs an ensemble multi-label algorithm [38]. Classifier chair (CC) builds an OVR classifier in a cascade way rather than a parallel one, and then its ensemble form (ECC) alleviates the possible effect of classifier order [27]. Variable pairwise constraint projection for multi-label ensemble (VPCME) combines feature extraction based on variable pairwise constraint projection with boosting-type ensemble strategy [20]. In [21], random forest of predictive cluster trees (RF-PCT) is strongly recommended due to its good performance in an extensive experimental comparison, including ECC, RAkEL, ML-kNN and etc. Generally, these ensemble methods spend more training and testing time to improve their classification performance. Now it is widely recognized that algorithm extension strategy considers as many label correlations as possible, which is an optimal way to enhance multi-label classification performance further [6]. But, its corresponding methods have a relatively high computational cost, which limits their usability for many real world applications. Consequently, it is still imperative to design and implement some novel efficient multi-label classifiers. In this paper, we focus on SVM-type multi-label techniques. Generally, SVM-type multi-label methods are formulated as QP problems with some different constraint conditions which further are associated with distinct optimization procedures. Rank-SVM [9] includes several equality and many box constraints. When Rank-SVM is solved by Frank-Wolfe method (FWM) [10,15], a large-scale linear programming is dealt with at each iteration. As a variant of Rank-SVM , Rank-CVM [43] involves a unit simplex and many non-negative constraints. Because of its special constraints, at each iteration in FWM, there exist a closed solution and several efficient recursive formulae for Rank-CVM. Although FWM has a sub-linear convergence rate, the experimental results demonstrate that Rank-CVM has a lower computational cost than Rank-SVM. In [44], random block coordinate descent method (RBCDM) with a sub-linear convergence rate is used for Rank-SVM, where at each iteration a small-scale QP sub-problem with equality and box constraints is still solved by FWM due to equality constraint limit. The experimental results show that RCBDM runs three times faster than FWM for Rank-SVM. The success of these methods inspires us to design a special QP problem with some particular constraints and then to construct its efficient optimization solution procedure. In the past twenty years, it is widely accepted that binary SVM [39] is one of the most successful classification techniques [42]. The success of SVM attracts many researchers to develop its variants with special optimization forms to reduce as many computational costs as possible, e.g., Lagrangian support vector machine (LSVM) [22], proximal SVM (PSVM)[11] and its privacy preserving version (P3SVM) [31], twin SVM (TWSVM) [18] and its sparse and Laplacian smooth forms [4,26], support tensor machine (STM) [32] and its multiple rank multi-linear kernel version [13]. Particularly, binary LSVM [22] is formulated as a QP problem with non-negative constraints only, and then is solved by an iterative procedure with inverse matrix, which has a linear convergence rate. In this paper, we generalize LSVM to construct its multi-label version: Rank-LSVM, which has the same non-negative constraints as those in LSVM and the same number of variables to be solved as that in Rank-SVM and Rank-CVM. Since each training instance is associated with a block of variables and thus all variables are split naturally into many manageable blocks in Rank-LSVM, we implement an efficient training procedure using random block coordinate descent method (RBCDM) with a linear convergence rate [24,28]. Further, to reduce the number of support vectors to speed up the training and testing procedures, we apply a heuristic strategy to avoid updating some training instances which result in no ranking loss. Experimental results on 12 benchmark data sets illustrate that our method is a competitive candidate for multi-label classification according to five performance measures, compared with four existing techniques: Rank-CVM [43], Rank-SVM [9], BP-MLL [46], and RF-PCT [21]. Moreover, our Rank-CVM runs averagely 15 and 107 times faster, has 17 and 9% fewer support vectors than Rank-SVM and Rank-CVM in the training phase. The rest of this paper is organized as follows. Multi-label classification setting is introduced in Section 2. In Sections 3 and 4, our Rank-LSVM is proposed after binary LSVM is reviewed briefly and then an efficient training algorithm is constructed and analyzed. In Section 5, we formally analyze the relation between our Rank-LSVM and two previous SVM-type algorithms (i.e., Rank-SVM and Rank-CVM). Section 6 is devoted to experiments with 12 benchmark data sets. This paper ends with some conclusions finally. Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS
J. Xu / Information Sciences xxx (2015) xxx–xxx
[m3Gsc;September 28, 2015;8:35] 3
Fig. 1. Schematic illustration of LSVM.
74 75 76 77 78
2. Multi-label classification setting Let Rd be a d-dimensional input real space and Q = {1, 2, . . . , q} a finite set of q class labels. Further, assume that each instance x ∈ Rd can be associated with a set of relevant labels L ⊆ 2Q (all possible subsets of Q). Additionally, the complement of L, i.e., L¯ = Q \L, is referred to as a set of irrelevant labels of x. Given a training data set of size l drawn identically and independently from an unknown probability distribution (i.i.d.) on Rd × 2Q , i.e.,
{(x1 , L1 ), . . ., (xi , Li ), . . ., (xl , Ll )}, 79 80 81 82 83 84 85 86
(1) Rd
2Q
the multi-label classification problem is to learn a classifier f(x): → that generalizes well on both these training instances and unseen ones in the sense of optimizing some expected risk functional with respect to a specific empirical loss function [9,43,46]. In many traditional q-class single-label classification methods, a widely used trick is to learn q discriminant functions fi (x) : Rd → R, i = 1, . . ., q such that fk (x) > fi (x), i = k, if x ∈ class k [8]. For multi-label classification, as an extension of ¯ which implies that any relevant label should multi-class classification, such a trick is generalized as fk (x) > fi (x), k ∈ L, i ∈ L, be ranked higher than any irrelevant one [9,43]. In this case, the multi-label prediction can be performed through a proper threshold function t(x),
f (x) = {k| fk (x) ≥ t (x), k = 1, . . ., q}.
(2)
89
Now there mainly are three kinds of thresholds: a constant (e.g., 0 for -1/+1 setting) [1], a linear regression model associated with q discriminant function values [9,43,46], and an additional discriminant function for a calibrated label [12]. In the last two cases, t(x) is dependent on x either directly or implicitly.
90
3. Multi-label Lagrangian support vector machine
91 92
In this section, we review binary Lagrangian support vector machine (LSVM) briefly, and then propose our multi-label Lagrangian support vector machine (Rank-LSVM).
93
3.1. Binary Lagrangian support vector machine
94
Binary Lagrangian support vector machine (LSVM) is introduced in [22], as a variant of traditional binary SVM [39]. Let an i.i.d. binary training set of size l be,
87 88
95
{(x1 , y1 ), . . ., (xi , yi ), . . ., (xl , yl )}, 96 97
f (x) = wT x + b, 98 99 100 101
(3)
where xi ∈ Rd and yi ∈ {+1, −1} denote the ith training instance vector and its binary label. In the original space, a linear discriminant function is defined as, (4)
Rd
where w ∈ and b ∈ R represent the weight vector and bias term respectively, as shown in Fig. 1. For the linearly separable case, the hyperplane f (x) = 0 separates positive instances from negative ones with a margin 1/[wT , b]T 2 in LSVM, where ·2 indicates 2-norm of vector. This margin is associated with both w and b, and is maximized via minimizing wT w + b2 . The primary form of LSVM for the nonlinearly separable case is formulated as,
1 T 1 (w w + b2 ) + C ξi2 , 2 2 l
min
i=1
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS 4
[m3Gsc;September 28, 2015;8:35]
J. Xu / Information Sciences xxx (2015) xxx–xxx
Fig. 2. Three possible relations between a pair of labels in Rank-LSVM.
s. t. yi (wT xi + b) ≥ 1 − ξi , i = 1, . . ., l, 102
(5)
where C > 0 denotes a regularization constant to control the tradeoff between classification errors and model complexity, and
103
ξ i is a slack variable to indicate classification error of xi . Note that here ξ i ≥ 0 is redundant due to its squared hinge loss in the
104
objective function. The dual form of the above problem (5) is,
1 T 1 α K ◦ yyT + yyT + I 2 C s.t. α ≥ 0,
min
1 2
α − uT α , = α T α − uT α , (6)
110
where K = [xTi x j |i, j = 1, . . ., l] and I denote the positive semi-definite kernel and unit matrices of size l × l, respectively, y = [y1 , . . ., yl ]T indicates the binary label vector, the operator ’◦’ represents the Hadamard product, α = [α1 , . . ., αl ]T is the l Lagrangian multipliers to be solved, and u = [1, . . ., 1]T and 0 = [0, . . ., 0]T are two column vectors with l ones and zeros, respectively. It is attractive that = K ◦ yyT + yyT + C1 I is a positive definite matrix. Since LSVM (6) involves a strictly convex objective function and l non-negative constraints, it is solved by an iterative procedure with inverse matrix, whose convergence rate is linear [22].
111
3.2. Multi-label Lagrangian support vector machine
112
In this sub-section, we extend the above LSVM to construct its multi-label version: multi-label Lagrangian support vector machine or Rank-LSVM simply. Now, in the original input space, q linear discriminant functions are defined as,
105 106 107 108 109
113
fk (x) = wTk x + bk , k = 1, . . ., q, 114 115 116 117
where wk and bk denote the weight vector and bias term for the kth class or label. As mentioned in Section 2, it is highly desirable that any relevant label should be ranked higher than any irrelevant one, which could be depicted using the following pairwise constraint for a pair of relevant and irrelevant labels (m, n) ∈ (Li × L¯ i ) of the ith training instance xi from the multi-label training set (1),
fm (xi ) − fn (xi ) = (wm − wn )T xi + (bm − bn ) > 0. 118 119
121 122 123 124
(8)
In case such a perfect case does not occur, a slack variable ξ imn is considered as in binary LSVM. Therefore, the above inequality constraint (8) can be rewritten as,
fm (xi ) − fn (xi ) = (wm − wn )T xi + (bm − bn ) ≥ 1 − ξimn , 120
(7)
(9)
where 1 is added to force the difference to be more than 1, just as in binary LSVM (5). As shown in Fig. 2, there exist three cases to illustrate the possible relations between any relevant label and any irrelevant one: (a) fm (xi ) − fn (xi ) ≥ 1 and ξimn = 0; (b) 0 < fm (xi ) − fn (xi ) < 1 and 0 < ξ imn < 1; (c) fm (xi ) − fn (xi ) ≤ 0 and ξ imn ≥ 1. Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS
[m3Gsc;September 28, 2015;8:35]
J. Xu / Information Sciences xxx (2015) xxx–xxx
125 126 127
5
According to the ranking loss (refer to its definition in Section 6), the first two cases do not result in any ranking loss value, whereas the last one does induce a ranking loss 1. But, regardless of any case, its ranking loss value is less than or equal to its 2 . Therefore, an approximate ranking loss function for Rank-LSVM can be defined as, squared slack variable ξimn
Approximate ranking loss =
l 1 1 ¯ l | L || | L i i (m,n)∈ i=1
2 ξimn ,
(10)
(Li ×L¯ i )
128 129 130
which is regarded as an empirical loss term when 1/l is removed in our Rank-LSVM. On the other hand, we constitute a model q regularization term k=1 (wTk wk + b2k )/2 for our Rank-LSVM in terms of binary LSVM principle. Now we formulate the primary form of our Rank-LSVM as, q l 1 T 1 1 (wk wk + b2k ) + C 2 2 |Li ||L¯ i | (m,n)∈ i=1 k=1
min
2 ξimn ,
(Li ×L¯ i )
s.t. 131 132 133
(wm − wn ) xi + (bm − bn ) ≥ 1 − ξimn , (m, n) ∈ (Li × L¯ i ), i = 1, . . ., l,
+1, if k = m, −1, if k = n, ; k = 1, 2, . . ., q. 0, otherwise,
⎡ −1
⎢ +1 k cimn =⎢ 0 ⎣ 0 0 135
(12)
Here we give an example to detect such a quantity. When q = 5, Li = {2, 3} and L¯ i = {1, 4, 5}, we have
mn = { 21
136
(11)
where C represents a regularization constant to control the tradeoff between model complexity and ranking loss. Note that ξ imn ≥ 0 holds true naturally due to its squared hinge loss in (11), as seen below. To simplify the inequality constraints in (11), we define a new notion, k cimn =
134
T
24
25
0 +1 0 −1 0
31 −1 0 +1 0 0
0 +1 0 0 −1
35 },
34 0 0 +1 −1 0
⎤ ⎧ ⎫
0 ⎪1⎪ ⎨2⎪ ⎬ 0 ⎥ ⎪ +1 ⎥, 3 = k, ⎦ ⎪ ⎪ ⎪ 0 ⎩4⎪ ⎭ −1 5
(13)
where each column indicates relevant and irrelevant labels with +1 and −1, respectively. Therefore a concise inequality constraint form for (11) could be rewritten as,
(wm − wn )T xi + (bm − bn ) =
q
k cimn (wTk xi + bk ) ≥ 1 − ξimn .
(14)
k=1
137 138
The dual problem of (11) can be derived using the standard Lagrangian technique. Let α imn ≥ 0 be the Lagrangian multipliers for all inequality constraints (14). The Lagrangian function for (11) becomes, q l 1 T 1 L= (wk wk + b2k ) + Ci 2 2 (m,n)∈ i=1
k=1
139 140
ξ
2 imn
−
(Li ×L¯ i )
l i=1 (m,n)∈
(Li ×L¯ i )
αimn
q
k cimn
wTk xi
+ bk − 1 + ξimn ,
(15)
k=1
where Ci = C/|Li ||L¯ i |. The Karush–Kuhn–Tucker (KKT) conditions for this primary problem require the following relations to be true,
⎛
∂L = 0 ⇒ wk = ∂ wk 141
l i=1
⎞
⎝
k cimn αimn ⎠xi =
(m,n)∈ (Li ×L¯ i )
l
βki xi ,
(16)
i=1
l l ∂L k = 0 ⇒ bk = cimn αimn = βki , ∂ bk i=1 (m,n)∈ i=1
(17)
∂L = 0 ⇒ αimn = Ci ξimn , ∂ξimn
(18)
(Li ×L¯ i )
142
143
where,
βki =
k cimn αimn .
(19)
(m,n)∈ (Li ×L¯ i )
144
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS 6
145 146
J. Xu / Information Sciences xxx (2015) xxx–xxx
The relation (18) shows that ξ imn ≥ 0 naturally holds owing to α imn ≥ 0 and Ci > 0. By introducing the above KKT conditions (16)–(18) into the Lagrangian function (15), the dual form is,
⎛
q l 1 min 2
k=1 i, j=1
q l 1 βki βk j (xTi x j ) + 2
k=1 i, j=1
⎞
l 1 1 ⎝ βki βk j + 2 Ci (m,n)∈ i=1
2 ⎠ αimn −
(Li ×L¯ i )
s.t. αimn ≥ 0, (m, n) ∈ (Li × L¯ i ), i = 1, . . . , l. 147 148
l
βki (xT xi ) + bk =
i=1
150 151 152 153 154
l
αimn ,
i=1 (m,n)∈
(Li ×L¯ i )
(20)
Therefore our Rank-LSVM is formulated as a quadratic programming problem with non-negative constraints only, as in LSVM (6). According to (16) and (17), the discriminant functions (7) could be rewritten as,
fk (x) = 149
[m3Gsc;September 28, 2015;8:35]
l
βki (xT xi + 1), k = 1, . . ., q,
(21)
i=1
which depend directly on β ki . For the ith instance xi , its βki (k = 1, . . ., q) are determined by its corresponding original variables αimn , (m, n) ∈ (Li × L¯ i ) via (19). This implies that each training instance is associated with a block of variables and thus all variables in (20) are naturally divided into l blocks. Now we transform the above (20) into a concise representation with a block of variables from each training instance. For k the ith instance, we construct a column vector αi with α imn , a row vector hki with cimn and a unit vector ui , whose length is li = |Li | × L¯ i , and a matrix Hi of size q × li using hki , i.e.,
T αi = [αimn |(m, n) ∈ (Li × L¯ i )]T = αij | j = 1, . . ., li , k hki = cimn |(m, n) ∈ (Li × L¯ i ) = hkij | j = 1, . . ., li , ui = [1, . . ., 1]T ,
T
T
H i = h1i , . . ., hqi 155
T
.
(22)
According to (22), the formulae (19) and (20) could be simplified into,
βki = hki αi , βi = [β1i , ..., βqi ]T = H i αi , 156
and
min F (α) =
l 1 2
αTi i j α j −
i, j=1
s.t. 157
159 160 161
l
uTi αi ,
i=1
αi ≥ 0, i = 1, . . ., l,
where,
i j = 158
(23)
(24)
(H Ti H j )(xTi x j + 1), if i = j, (H Ti H i )(xTi xi + 1) + Ii /Ci , if i = j,
(25)
with a unit matrix Ii of size li × li . Note that ii is positive definite. In order to validate whether the above objective function F(α) in (24) is strictly convex, we construct a more concise representation with an entire solution vector. In terms of (22), we define an entire solution vector α and a unit vector u, whose length is lt = li=1 li , a new matrix H of size q × lt , and a kernel matrix K and a diagonal matrix D of size lt × lt ,
T α = αT1 , . . ., αTi , . . ., αTl , T T T T u = u1 , . . ., ui , . . ., ul
,
H = [H 1 , . . ., H i , . . ., H l ], K =
xTi x j
ui uTj |i, j = 1, . . ., l ,
D = diag(u1 /C1 , . . . , ui /Ci , . . . , ul /Cl ). 162
In this case, the above QP problem (24) could be rewritten as the following form,
min F (α) = s.t. 163
(26)
α ≥ 0,
1 T α α − uT α , 2 (27)
where,
= (H T H ) ◦ K + (H T H ) + D.
(28)
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS
J. Xu / Information Sciences xxx (2015) xxx–xxx
164 165 166 167 168 169 170 171 172 173 174 175 176 177
[m3Gsc;September 28, 2015;8:35] 7
The diagonal matrix D is positive definite since its diagonal elements all are positive. For any vector α, αT (H T H )α = (H α)T (H α) = H T α22 ≥ 0, so the matrix HT H is positive semi-definite [14]. For three widely-used kernel functions (linear, polynomial and RBF ), their corresponding kernel matrix K = [xTi x j |i, j = 1, . . ., l] is positive semi-definite [39], therefore the K is also positive semi-positive. Because the Hadamard product of two semi-positive matrices is still semi-definite [14], the first matrix at the right side in (28) is positive semi-definite. Therefore the matrix is positive definite, which means that the objective function F(α) in (27) is strictly convex. To determine the threshold function t(x) in (2), we utilize a linear regression technique in Rank-SVM [9] and Rank-CVM [43]. Each training instance xi is converted into a q-dimensional vector [ f1 (xi ), . . .., fq (xi )]T using (21), and then an optimal threshold t∗ (xi ) is determined via minimizing the Hamming loss. A linear regression threshold function is trained, i.e., q t (x) = k=1 sk fk (x) + s0 , where si (i = 0, 1, . . ., q) represent the regression coefficients. Finally we use this threshold form to detect the relevant labels in the testing phase. Additionally, we could build the non-linear kernel version of Rank-LSVM through substituting various Mercer kernels K(xi , xj ) for the dot product (xTi x j ) between two vectors in (21) and (25), as in binary SVM and LSVM [22,39]. According to (23), we define a proxy solution vectors β with its original solution vector α in (26),
β = βT1 , . . ., βTi , . . ., βTl
!T
,
(29)
186
whose length is lq only. In this paper, the ith training instance xi becomes a support vector if only its βi = 0, which stems from its αi = 0. Reversely, any non-zero component of αi results in a non-zero βi , which induces a support vector xi . In principle, our Rank-LSVM (24) could be still solved using the iterative procedure with inverse matrix in [22]. But there are two drawbacks: to calculate a large-scale inverse matrix of size lt × lt , and to derive an almost dense original solution vector α. An alternative optimization method is random coordinate descent method (RCDM) with a shrinking strategy used in [17]. However, RCDM is to update all variables according to a random order, and the shrinking procedure is applied to the components of α , thus the original solution vector α is possibly sparse, but the proxy solution vector β is not sparse yet. To achieve an as sparse proxy solution vector as possible, we will solve our Rank-LSVM using random block coordinate descent method with a heuristic shrinking strategy in the next section.
187
4. An efficient training algorithm for Rank-LSVM
188 189
In this section, we design and implement an efficient training procedure based on random block coordinate descent method with a shrinking strategy for our Rank-LSVM (24).
190
4.1. Random Block Coordinate Descent Method for Rank-LSVM
191
The block coordinate descent method (BCDM) splits all variables to be solved into some manageable blocks, and then updates a single block only at each iteration when the remaining blocks are fixed [41]. Concretely speaking, BCDM consists of outer epochs and inner iterations. At its each epoch all blocks of variables are updated correspondingly in terms of cyclic or random order, and at its each iteration a single block of variables is optimized. Currently, there exist two widely used versions: cyclic and random BCDM, simply CBCDM and RBCDM, in this paper. A number of experiments show that RBCDM is more efficient than CBCDM for large-scale optimization problems [3,17,24,30]. Hence, we apply RBCDM to solve our Rank-LSVM (24) in this study. In the above section, it is pointed out that all variables to be solved in Rank-LSVM (24) are naturally divided into l blocks, i.e., αi (i = 1, . . ., l ). Let the initial indexes of l instances be W (0) = {1, 2, . . ., l }. At the pth epoch, according to uniform distribution law, we permute randomly W ( p−1) to W ( p) = {π (1), . . ., π (t ), . . ., π (l )}( p = 1, 2, . . .). At the tth iteration, the ith block of variables (i.e., αi , i = π (t )) is selected and optimized, and the other blocks are fixed, i.e.,
178 179 180 181 182 183 184 185
192 193 194 195 196 197 198 199 200 201
( p,t )
αj 202
=
α(j p,t−1) , if j = i, α(i p,t−1) + αi , if j = i,
(30)
In this case, the optimization problem (24) could be simplified as,
min F (α( p,t ) ) = F (α( p,t−1) ) + s.t. 203
j = 1, . . ., l.
T
( α i ) ,
α(i p,t−1) + αi ≥ 0,
( p,t−1)
where gi
1 (αi )T ii (αi ) + g(i p,t−1) 2
(31) ( p,t−1)
indicates the gradient vector with respect to αi
, i.e.,
l α(i p,t−1) ∂F ( p,t−1) T T = H H α x x + 1 + − ui j j i i j Ci ∂α(i p,t−1) j=1 l α( p,t−1) β(j p,t−1) xTi x j + 1 + i = H Ti − ui ,
g(i p,t−1) =
j=1
Ci
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS 8
J. Xu / Information Sciences xxx (2015) xxx–xxx
= H Ti [ f1 (xi ), . . ., fq (xi )]T + 204
where β
( p,t−1)
( p,t )
βj 205 206 207
[m3Gsc;September 28, 2015;8:35]
Ci
− ui ,
(32)
corresponds to α( p,t−1) via (23). After solving (31), only βi needs to be updated, i.e.,
=
α(i p,t−1)
β(j p,t−1) , if j = i, j = 1, . . ., l. β(i p,t−1) + H i αi , if j = i,
(33)
When all l sub-problems (t = 1, . . ., l ) are solved sequentially, we obtain the original and proxy solution vectors at the end ( p)
of the pth epoch, i.e., α( p) = α( p,l ) and β Theorem 1 on its linear convergence rate.
=β
( p,l )
. According to [28] work, the RBCDM for our Rank-LSVM has the following
212
Theorem 1. For Rank-LSVM, its solution vector α(p, t) in (30) from random block coordinate descent method (RBCDM) converges linearly. When the size of block is 1, the above BCDM is reduced into coordinate descent method (CDM) [16]. In this study, we use cyclic CDM (i.e.,CCDM) to solve the QP sub-problem (31) since CCDM runs slightly faster than its corresponding random version experimentally for a small-scale problem. On the linear convergence rate of CCDM, we have the following Theorem 2.
213
Theorem 2. For Rank-LSVM, its solution αi in (31) from cyclic coordinate descent method (CCDM) converges linearly.
208 209 210 211
214 215
The above two theorems will be proved in Appendices Appendix A and Appendix B. 4.2. A heuristic shrinking strategy for Rank-LSVM (0)
218
= 0 (i = 1, . . ., l ). For the ith instance xi , if αi = 0 ( p = 1, 2, . . .) during the entire optimization procedure, such a vector xi is not a support vector due to its corresponding βi = 0. However, in the above RBCDM for Rank-LSVM, it is possible that all training instances become support vectors in practice.
219
To enhance the sparseness of the original and proxy solution vectors, we design a heuristic shrinking rule as follows,
216 217
In our Rank-LSVM, the original solution vector is initialized to be zero, i.e., αi
( p)
αi = 0, if α(i p,t−1) = 0 and gmin > −τ , i 220 221 222 223
where gmin is the minimum component of the gradient vector gi in (32) and τ represents a gradient threshold. When such i a rule is satisfied, we do not solve the QP sub-problem (31) associated with the training instance xi . ( p,t−1) Now we analyze (31) and (32) to determine a proper interval for τ . If αi = 0, its corresponding gradient vector (32) could be simplified into,
g(i p,t−1) = H Ti [ f1 (xi ), . . ., fq (xi )]T − ui , 224 225 226 227 228 229 230 231 232 233 234 235 236 237
(34) ( p,t−1)
(35)
where the first term measures the difference of discriminant function values between any relevant label and any irrelevant one, i.e., fm (xi ) − fn (xi ) (m, n) ∈ (Li × L¯ i ). This gradient vector also decides whether αi = 0 or not. Next we consider three cases, ( p,t−1) (a) gi ≥ 0, which means that gmin ≥ 0. According to (35), we have fm (xi ) − fn (xi ) ≥ 1, which represents that all relevant i labels are ranked 1 higher than all irrelevant ones and thus the constraints (9) hold true with ξimn = 0. This is a perfect case with αi = 0 and without a ranking loss. On the other hand, from (31), we also obtain αi = 0 since Qii is positive definite. ( p,t−1) (b) gi > −ui , i.e., gmin > −1. At least, we obtain fm (xi ) > fn (xi ), which implies that all possible relevant-irrelevant label i pairs are correctly ranked and the constraints (8) are true. On the basis of (31), αi = 0. But no ranking loss occurs. (c) The others, i.e., gmin ≤ −1. Some fm (xi ) ≤ fn (xi ) happen, therefore αi = 0 and further a ranking loss would be derived. i Therefore the τ value in (31) can belong to three disjoint intervals: (i) τ ≤ 0. According to (31) and (34), the shrinking rule does not work for the last two cases (b) and (c), so the number of support vectors could not be reduced. (ii) 0 < τ ≤ 1. Some instances whose relevant-irrelevant label pairs are ranked correctly become non-support vectors, which can improve the sparseness of original and proxy solution vectors. (0) (iii) τ > 1. Since the initial gradient vector gi = −ui , our training procedure is terminated after one epoch and
242
α(i 1) = 0 (i = 1, . . ., l ) are outputted. But it is necessary to solve (31) to build the satisfactory entire original and proxy solution vectors α and β. Based on the above analysis, we determine τ ∈ [0, 1], where τ = 0 implies that no shrinking strategy is applied. Theoretically, the larger the threshold τ is, the sparser the initial and proxy solution vectors α and β are. Experimentally, our rule (34) could force many training instances to be non-support vectors (βi = 0) and thus could reduce the computational cost in the training
243
and testing stages effectively.
244
4.3. An efficient training algorithm for Rank-LSVM
245
Now, we summarize our efficient training algorithm based on RBCDM with a heuristic shrinking strategy for Rank-LSVM as the Algorithm 1. Here, two stopping indexes (M and ε ) are adopted simultaneously, in which the former indicates the maximal
238 239 240 241
246
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS
J. Xu / Information Sciences xxx (2015) xxx–xxx
[m3Gsc;September 28, 2015;8:35] 9
Algorithm 1 A training procedure with RBCDM for Rank-LSVM. (0)
1. Set the initial α(0) = 0 and β = 0, and choose random seed s, gradient threshold τ , maximal epochs M, and stopping tolerance ε . 2. Calculate all small matrixes Q ii 3. Let the instance index set be W (0) = {1, 2, . . ., l }. 4. For p = 1, 2, . . ., M. 4-1. Permute W ( p−1) to W ( p) using uniform distribution law. 4-2. For t = 1, . . ., l. 4-2-1. i = π (t ). 4-2-2. Construct a QP sub-problem (31). 4-2-3. If (34) holds, then next t. 4-2-4. Solve (31) using CCDM to obtain αi . 4-2-5. Update βi using (33). 4-3. If (36) is satisfied, then stop.
247
number of epochs and the latter is the relative tolerance for the proxy solution vector β, i.e.,
β( p) − β( p−1) 2 ≤ ε, β( p) 2
(36)
248
Additionally, the random seed is denoted by s.
249
5. Relation with two existing multi-label SVM-type algorithms
250
In this section, we briefly review two existing multi-label SVM-type algorithms: multi-label support vector machine (RankSVM) [9] and multi-label core vector machine (Rank-CVM) [43], which are closely related to our Rank-LSVM. The Rank-SVM [9] extends multi-class SVM in [39], whose original form based on our notions in the above two sections is formulated as,
251 252 253
min
q l 1 T 1 wk wk + C ¯ 2 |Li ||Li | (m,n)∈ i=1 k=1
ξimn ,
(Li ×L¯ i )
s.t.
q
k cimn wTk xi + bk ≥ 1 − ξimn ,
ξimn ≥ 0, (m, n) ∈ (Li × L¯ i ), i = 1, . . ., l.
(37)
k=1
254 255 256
Compared with our Rank-LSVM, the bias terms bk in Rank-SVM are not be regularized, and the slack variables ξ imn ≥ 0 are applied due to their linear hinge loss form in (37). The dual version of Rank-SVM is derived using the Lagrangian technique as the following concise vector-matrix representation,
1 T α ((H T H ) ◦ K )α − uT α, 2 s.t. H α = 0, 0 ≤ α ≤ C,
min F (α) =
257 258 259
(38)
and Ci = C/|Li ||L¯ i |. It is observed that Rank-SVM (38) includes q equality and lt box where the column vector C = constraints, and the same linear term as our Rank-LSVM. Note that the objective function in (38) is only convex theoretically. The Rank-CVM [43] is derived from binary core vector machine [34,35], whose original version is described as, [C1 uT1
min
. . . Cl uTl]T
q l 1 T 1 1 (wk wk + b2k ) − νρ + C ¯ 2 2 |Li ||Li | (m,n)∈ i=1 k=1
2 ξimn ,
(Li ×L¯ i )
q
s.t.
k cimn (wTk xi + bk ) ≥ ρ − ξimn ,
ρ > 0, (m, n) ∈ (Li × L¯ i ), i = 1, . . ., l,
(39)
k=1
260 261 262
where ρ is to replace 1 in Rank-SVM and our Rank-LSVM to depict the margin, and ν is to combine the model regularization term 1 q (wTk wk + b2k ) with the margin ρ . As in our Rank-LSVM, here ξ imn ≥ 0 hold naturally due to their squared hinge loss forms 2 k=1 in (39). The dual problem of Rank-CVM becomes,
1 T α ((H T H )K + H T H + D)α, 2 s.t. uT α = 1, α ≥ 0,
min F (α) =
(40)
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS 10
[m3Gsc;September 28, 2015;8:35]
J. Xu / Information Sciences xxx (2015) xxx–xxx
277
The Rank-CVM has a unit simplex constraint and lt non-negative ones, and the same quadratic term as our Rank-LSVM. It is noted that the objective function in (40) is strictly convex theoretically. Additionally, both Rank-SVM and Rank-CVM need to learn a threshold function as introduced in our Rank-LSVM. Therefore we observe that there are three aspects: (a) the number of variables to be solved (i.e., lt ), (b) linear or kernel discriminant functions, and (c) threshold function form to be trained, in which Rank-SVM, Rank-CVM and our Rank-LSVM are identical. The main distinction among such three multi-label methods is their constraints, which could result in different optimization procedures. The Rank-SVM [9] is primarily solved by Frank-Wolfe method (FWM) with a sub-linear convergence rate [10,15], in which a large-scale linear programming problem is dealt with at its each iteration. In [44], a random block coordinate descent method (RBCDM) is to cope with Rank-SVM, where a small-scale QP sub-problem of size (q+1) at least is still solved by FWM due to the q equality constraints. According to the theoretical analysis in [23], this RBCDM for Rank-SVM has only a sub-linear convergence rate because of the q equality constraints and lt box ones. The Rank-CVM is also solved by FWM in [43], where due to its unit simplex constraint and non-negative ones a closed solution vector for a large-scale LP problem is achieved and several recursive formulae are derived. In order to achieve an -accuracy solution, the iteration complexity or the number of iterations depends on O(1/ ) in Rank-SVM and Rank-CVM , which is greater than O(log(1/ )) in our Rank-LSVM.
278
6. Experiments
279 281
In this section, we compare our Rank-LSVM with four existing multi-label classification approaches including Rank-CVM, Rank-SVM, BP-MLL and RF-PCT experimentally. Before presenting our experimental results, we briefly introduce five performance evaluation measures, four existing methods and twelve benchmark data sets.
282
6.1. Performance evaluation measures
283
Since it is more complicated to evaluate a multi-label classification algorithm than a single-label one, more than ten performance evaluation measures have been collected in [2,37,48]. In this paper, we choose the same five popular and indicative measures: coverage, one error, average precision, ranking loss and Hamming loss, as in [5,43,46,47]. Assume a testing data set of size m to be {(x1 , L1 ), . . ., (xi , Li ), . . ., (xm , Lm )}. Given the ith instance xi , its q discriminant function values and predicted set of relevant labels from some multi-label algorithm are denoted by fkP (xi )(k = 1, . . ., q) and LPi ⊆ 2Q , respectively. The coverage estimates how far we need, on average, to go down the list of labels to cover all relevant labels of the instance, which is divided by the number of labels to define a normalized form in this study,
263 264 265 266 267 268 269 270 271 272 273 274 275 276
280
284 285 286 287 288 289
Normalized coverage ( ↓ ) =
m 1 |C (xi )| − 1 , m q
(41)
i=1
290 291
where C (xi ) = {k| fkP (xi ) ≥ fkP (xi ), k ∈ Q } and k = {k| min fkP (xi ), k ∈ Li }. The one error evaluates how many times that the topranked label is not one of relevant labels,
" .
m 1 P One error( ↓ ) = fkP (xi ) k ∈/ Li | fk (xi ) = max m k ∈Q i=1
292 293
The average precision calculates the average fraction of labels ranked above a specific relevant label k ∈ Li , which actually are in Li , i.e., m 1 Average precision( ↑ ) = m i=1
294 295
1 |Li | k∈Li
# $ k ∈ Li | f P (xi ) ≥ f P (xi ) k k # $ . k ∈ Q | f P (xi ) ≥ f P (xi ) k k
(43)
The ranking loss computes the average fraction of labels pairs (a relevant label versus an irrelevant one) that are not correctly ordered for the instance,
# $ (k, k ) ∈ (Li × L¯ i )| f P (xi ) ≤ f P (xi ) k k . |Li ||L¯ i | i=1
m 1 Ranking loss( ↓ ) = m
296
(42)
(44)
The Hamming loss estimates the percentage of labels, whose relevance is predicted incorrectly,
m P 1 Li Li Hamming loss( ↓ ) = , m q
(45)
i=1
297 298 299 300
where is the symmetric difference between two sets. It is desirable that a multi-label algorithm should achieve a higher value for the average precision measure, and lower values for the other four ones, as indicated by the up and down arrows in their definitions (41)–(45). These measures (41)–(45) originally range from 0 to 1, and then are converted into percentage in the next experiments. Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS
[m3Gsc;September 28, 2015;8:35]
J. Xu / Information Sciences xxx (2015) xxx–xxx
11
Table 1 Statistics of 12 data sets in our experiments. Data set
Domain
#Train
#Test
#Feature
#Class
LC
LD(%)
#Variable
Emotions Image Birds Scene Plant Genbase Human Medical Slashdot Yeast Langlog Enron
Music Scene Audio Scene Biology Biology Biology Text Text Biology Text Text
391 1200 175 1211 588 463 1862 645 2269 1500 751 1123
202 800 172 1196 390 199 1244 333 1513 917 502 579
72 294 260 294 440 1185 440 1449 1079 103 1004 1001
6 5 19 6 12 27 14 45 22 14 75 53
1.87 1.24 1.91 1.07 1.08 1.35 1.19 1.25 1.18 4.24 1.40 3.38
31.17 24.80 10.05 17.83 9.00 5.00 8.50 2.78 5.36 30.29 1.87 6.38
2793 5342 5638 6278 6959 14808 28036 34878 55306 58248 76838 186018
304
To sort the performance of multiple methods for a given measure, we compare two methods (A and B) using paired Wilcoxon sign ranked test with 5% significance level [7]. When two methods perform statistically equally, they are assigned to 0.5 score. If the algorithm A is better than the B, they are allocated to 1 and 0 score, respectively. Finally, we accumulate the score (denoted by W-test) of each method according to all possible pairwise test results.
305
6.2. Four existing multi-label methods
306
325
In this paper, we selected four existing multi-label classification methods: Rank-CVM [43], Rank-SVM [9], BP-MLL [46], and RF-PCT [21], which will be compared with our Rank-LSVM experimentally. It is noted that RF-PCT represents tree-type ensemble approach, which performs the best in an extensive experimental comparison including 12 different methods and 11 diverse data sets in [21] and then is strongly recommended to be compared with any novel proposed method. The other three methods and our Rank-LSVM belong to algorithm extension methods considering all training instances and all labels via all possible pairwise constraints or comparisons between relevant labels and irrelevant ones at the same time. BP-MLL is coded in Matlab from1 , whose recommended parameter settings are that the learning rate is fixed at 0.05, the number of hidden neurons is 0.2d, the training epochs is fixed to be 100 and the regularization constant is set to be 0.1. We downloaded the Java package of RF-PCT from2 , and then added an additional C/C++ program to calculated some performance measures. The strongly recommended parameters are that the size of feature subsets is to round off 0.1d + 1, the number of iterations is 100, and the trees are fully grown, in RF-PCT. Three SVM-type algorithms: our Rank-LSVM, Rank-CVM, Rank-SVM are coded using C/C++ language. The source codes of Rank-SVM and Rank-CVM have been involved in our free software package MLC-SVM 3 . Additionally, their performance depends on the kernel forms and parameters, and regularization constant C. The RBF kernel K (x, y) = exp (−γ x − y22 ) is validated only in this section, where γ denotes the kernel scale factor. In this study, Rank-SVM is solved by RBCDM with a size of block 40(q + 1) [44] and Rank-CVM by FWM [43]. For two optimization techniques (FWM for Rank-CVM, and RBCDM for Rank-SVM and RankLSVM), we have to set two stopping indexes: the maximal number of iterative epochs M and tolerance ε . We will tune these key parameters: γ , C, M and ε below. Our computational platform is a HP workstation with two Intel Xeon5675 CPUs and 32G RAM, and C++6.0, Java6.0 and Matlab 2010a.
326
6.3. Twelve data sets
327
333
To compare our Rank-LSVM with the aforementioned four classification methods, we collected 12 benchmark data sets: Emotions, Image, Birds, Scene, Plant, Genbase, Human, Medical, Slashdot, Yeast, Langlog and Enron, as summarized in Table 1, according to the number of variables (#Variable) to be solved in the training procedures of three SVM-type methods in the last column. Table 1 also shows some useful statistics of these data sets, where the sizes of training and testing sets are denoted by #Train and #Test, the number of features by #Feature, the number of labels or classes by #Class, and the label cardinality and density (%) by LC and LD, respectively. These data sets cover five diverse domains: text, scene, music, audio and biology, all of which with a sparse format for three SVM-type methods are also available at our lab homepage3 .
334
6.4. Effect of gradient threshold and random seed
301 302 303
307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324
328 329 330 331 332
335 336
In this sub-section, the effect of gradient threshold (τ ) and random seed (s) is investigated on Yeast data set for our RankLSVM, when an enough large number of epochs M = 1000 is set to satisfy the tolerance ε = 10−3 . 1 2 3
http://cse.seu.edu.cn/people/zhangml. http://clus.sourceforge.net. http://computer.njnu.edu.cn/Lab/LABIC/LABIC_index.html.
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS 12
[m3Gsc;September 28, 2015;8:35]
J. Xu / Information Sciences xxx (2015) xxx–xxx
(a) Index (%)
100 95 90 Training time Testing time Support vectors
85 80
0
0.1
0.2
0.3
0.4 0.5 0.6 Gradient threshold
0.7
0.8
0.9
1
80
(b) Index (%)
Average precision 60 Normalized coverage 40 One error 20 Ranking loss
Hamming loss 0
0
0.1
0.2
0.3
0.4 0.5 0.6 Gradient threshold
0.7
0.8
0.9
1
Fig. 3. The effect of gradient threshold τ for Rank-LSVM on Yeast.
Table 2 The effect of random seed for Rank-LSVM on Yeast.
337 338 339 340 341 342 343 344 345 346 347
Index
Unit
Result
Actual training time Actual testing time Support vectors Normalized coverage One error Average precision Ranking loss Hamming loss
Seconds Seconds % % % % % %
7.0 ± 0.2 1.1 ± 0.1 88.61 ± 0.29 44.51 ± 0.04 22.06 ± 0.07 76.18 ± 0.03 16.59 ± 0.02 19.93 ± 0.06
Given the kernel scale factor γ = 1, regularization constant C = 1 and random seed s = 1, we set the gradient threshold τ from 0.0 to 1.0 with the step 0.1 to conduct 3-fold cross validation on the training set of Yeast. The experimental results are shown in Fig. 3, in which support vectors indicate the ratio of the number of support vectors to the number of training instances, and the training and testing time represents the ratio of actual training and testing time to their maximums, respectively. It is found out that as the gradient threshold τ increases, the training and testing time, and support vectors decrease gradually in Fig. 3(a), but five classification performance measures remain stable in Fig. 3(b). When the gradient threshold is to be 1, the sparsest solution can be achieved. Further, for the gradient threshold τ = 1.0, we investigate the effect of random seed using 10 different values. Table 2 shows the average values and their standard deviations for actual training and testing time, support vectors and five performance measures. It is observed that all eight distinct indexes are stable and robust due to extremely small standard deviations. Therefore, we only consider the gradient threshold τ = 1.0 and the random seed s = 1 in the next experiments. Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
JID: INS
ARTICLE IN PRESS J. Xu / Information Sciences xxx (2015) xxx–xxx
[m3Gsc;September 28, 2015;8:35] 13
Fig. 4. The tuning procedure of kernel scale factor γ for three SVM-type methods.
348
6.5. Tuning key parameters for three SVM-type methods
349
As mentioned the above, the good performance of SVM-type classifiers is associated with the following key parameters: kernel scale factor γ , regularization constant C, the maximal number of epochs M and stopping tolerance ε . In this sub-section, we tune these key parameters using three-fold cross validation on training sets only for three SVM-type methods. The following tradeoff criterion between ranking loss and Hamming one, as in [43], is used in this paper,
350 351 352
RHL2 = (Ranking loss + Hamming loss)/2. 353 354 355 356 357 358 359 360 361 362 363 364 365
(46)
We regard this criterion (46) as a function of these tunable parameters, and then tune them using a three-step lazy procedure: (1) To tune γ from 2−10 , 2−9 , . . . ,22 given C = 1, M = 1000 and ε = 10−3 , as shown in Fig. 4, and then to select an optimal γ value for each data set, as shown in Table 3. It is shown that all curves but two from Rank-CVM on Slashdot and Yeast are smooth and have a minimum. To satisfy ε = 10−3 , Rank-LSVM, Rank-CVM and Rank-SVM at most execute 20 epochs on Plant, 2.14 on Birds, and 49 on Enron, respectively. Note that for Rank-CVM, since the lt iterations are executed in each epoch and all lt variables are updated at each iteration, the iterative procedure can be terminated during one epoch and the number of true executed epochs is possibly not integer. (2) To tune C from 2−2 , 2−1 , . . .,210 given the optimal γ values with the above step, M = 1000 and ε = 10−3 , as shown in Fig. 5, and then to choose an optimal C value for each data set, as shown in Table 3. It is observed that tuning C could improve the performance further since all optimal C values are not equal to 1. For ε = 10−3 , Rank-LSVM, Rank-CVM and Rank-SVM at most run 424 epochs on Enron, 5.22 on Birds, and 577 on Scene, respectively. (3) At the above two steps, three SVM-type methods are terminated by ε = 10−3 . On the basis of these optimal combinations (γ , C) from the aforementioned two steps, we further investigate whether the criterion (46) could be improved via adjustPlease cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS 14
[m3Gsc;September 28, 2015;8:35]
J. Xu / Information Sciences xxx (2015) xxx–xxx Table 3 The optimal criterion values (%) and their (γ , C) from three SVM-type methods on 12 data sets under M = 1000 and ε = 10−3 . Data set
Rank-LSVM
Rank-CVM
Rank-SVM
Emotions Image Birds Scene Plant Genbase Human Medical Slashdot Yeast Langlog Enron W-test
17.59(2−3 , 21 ) 15.74(2−2 , 21 ) 15.55(2−2 , 27 ) 7.36(2−3 , 22 ) 14.23(2−6 , 22 ) 0.25(2−2 , 27 ) 11.29(2−5 , 22 ) 1.89(2−5 , 28 ) 6.74(2−4 , 24 ) 17.87(20 , 22 ) 7.97(2−8 , 25 ) 6.07(2−6 , 25 ) 1.0
17.51(2−2 , 21 ) 15.96(2−2 , 21 ) 15.43(2−2 , 26 ) 7.32(2−4 , 24 ) 14.21(2−6 , 24 ) 0.22(2−2 , 25 ) 11.55(2−5 , 21 ) 1.72(2−5 , 27 ) 6.98(2−3 , 22 ) 18.16(20 , 21 ) 8.54(2−7 , 23 ) 6.46(2−7 , 23 ) 1.0
17.91(2−1 , 21 ) 15.55(2−2 , 21 ) 15.55(2−2 , 25 ) 7.38(2−3 , 22 ) 14.46(2−6 , 23 ) 0.15(2−2 , 210 ) 11.35(2−5 , 22 ) 1.76(2−4 , 24 ) 7.19(2−4 , 22 ) 17.83(20 , 22 ) 8.82(2−8 , 23 ) 6.32(2−6 , 24 ) 1.0
RHL2(%)
(1) Emotions
(2) Image
(3) Birds
23
20
21
22
19
20
21
19
18
20
Rank−LSVM Rank−CVM Rank−SVM
9
18 17
19
17
16
18 17
(4) Scene 10
0
5
10
15
8
16 0
(5) Plant
5
10
15
0
(6) Genbase
18
5
7
10
0
(7) Human
3
5
10
(8) Medical
17 16
8
15
6
RHL2(%)
17 2 16
14 1
13
4
12
2
15 14
0
5
10
0
0
(9) Slashdot
5
10
11
0
(15) Yeast
5
10
0
(10) Langlog
5
10
(12) Enron 15
26
RHL2(%)
12
14
24 10
12
22
10 10
20
8
8
18 6
0
5 log2(C)
10
0
5 log2(C)
10
0
5 log2(C)
10
5
0
5 log2(C)
10
Fig. 5. The tuning procedure of regularization constant C for three SVM-type methods.
366 367 368 369 370 371
ing M and ε , and observe how the optimization procedure works, simultaneously. Given a smaller tolerance ε = 10−5 , we set M = 1, 5, 10, 15, 20, 25, 30, 40, 50, 100, 200, 500, 1000 for Rank-LSVM and Rank-SVM, and M/100 for Rank-CVM. Additionally, the special M satisfying ε = 10−3 is recorded. Fig. 6 illustrates that the criterion (46) is a function of M, where the circle and square symbols indicate the ε = 10−3 and 10−5 are satisfied, respectively. For Rank-LSVM and Rank-SVM, the curves of the criterion (46) smoothly decrease as the M increases, and achieve a stable value at ε = 10−3 denoted by circles. On Rank-CVM, its ten curves behave decreasingly. On Slashdot and Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS
[m3Gsc;September 28, 2015;8:35]
J. Xu / Information Sciences xxx (2015) xxx–xxx
15
Fig. 6. The tuning procedure of the maximal number of epochs M for three SVM-type methods.
378
Yeast, their curves increase after ε = 10−3 is achieved. But at ε = 10−3 , Rank-CVM obtain a stable value on 12 data sets. Therefore, M = 1000 and ε = 10−3 are an appropriate stopping setting for three SVM-type methods. In Table 3, we also list the minimal values of the criterion (46), where the best value of each data set among different methods is highlighted in boldface. Rank-LSVM, Rank-CVM and Rank-SVM achieve the lowest criterion values on four, five and three data sets, respectively. In the last row, we show the Wilcoxon test scores for three methods. It is observed that three SVM-type multilabel classifiers obtain the same Wilcoxon test scores. This implies that these key parameters (γ , C, M and ε ) are tuned to be optimal for such three methods.
379
6.6. Convergence rate analysis
380
In this sub-section, to analyze the convergence rate for three SVM-type methods, we regard the criterion (46) as a function of training time, as shown in Fig. 7, where the RHL2 values and training time are from each three-fold cross validation with different epochs. We observe that (a) our Rank-LSVM has the fastest convergence rate on all data sets but Image and Scene. Particularly, on the last four data sets with more than 50,000 variables, our Rank-LSVM achieves ε = 10−3 extremely quickly; (b) Rank-CVM converges faster than our Rank-LSVM on Image and Scene; (c) Rank-SVM runs the slowest, compared with Rank-LSVM and Rank-CVM. This result is consistent with the theoretical analysis, since Rank-LSVM with RBCDM has a linear convergence rate, whereas Rank-CVM with FWM and Rank-SVM with RBCDM hold a sub-linear one.
372 373 374 375 376 377
381 382 383 384 385 386
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
JID: INS 16
ARTICLE IN PRESS
[m3Gsc;September 28, 2015;8:35]
J. Xu / Information Sciences xxx (2015) xxx–xxx
Fig. 7. The convergence rate analysis for three SVM-type methods.
387
6.7. Performance comparison on twelve testing data sets
388
In this sub-section, we compare our Rank-LSVM with Rank-CVM, Rank-SVM, BP-MLL, and RF-PCT using a train-test mode. According to those optimal parameter combinations in Table 3 for three SVM-type methods, and those recommended parameter settings mentioned above for BP-MLL and RF-PCT, we retrained all multi-label methods on 12 training sets, and then verified their performance using 12 testing sets listed in Table 1. The detailed experimental results are shown in Tables 4–8 according to five different measures, where the best value of each data set among five methods is highlighted in boldface, and the Wilcoxon test scores are listed in the last rows. In Table 4 for normalized coverage measure, Rank-LSVM, Rank-SVM, RF-PCT and Rank-CVM achieve five, four, three and one lowest values, respectively. According to Wilconxon test scores, such four methods perform well equally, and better than BP-MLL. In Table 5 of one error measure, on Genbase, all five methods obtain the zero value. Except for Genbase, Rank-SVM works the best on seven data sets, but much worse than Rank-LSVM on Langlog. On the basis of test scores, our Rank-LSVM is superior to the other four methods. As to average precision measure in Table 6, Rank-LSVM, Rank-SVM and RF-PCT obtain the highest values on six, five and one data sets. In terms of Wilcoxon test scores, three SVM-type techniques behave equally well, and better than BP-MLL and RF-PCT statistically . In Table 7 of ranking loss measure, our Rank-LSVM, Rank-SVM and Rank-CVM work the best on six, three and two data sets. Further, our Rank-LSVM achieves the highest Wilcoxon test score.
389 390 391 392 393 394 395 396 397 398 399 400 401 402 403
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS
[m3Gsc;September 28, 2015;8:35]
J. Xu / Information Sciences xxx (2015) xxx–xxx
17
Table 4 Normalized coverage (%) from 12 data sets. Data set
Rank-LSVM
Rank-CVM
Rank-SVM
BP-MLL
RF-PCT
Emotions Image Birds Scene Plant Genbase Human Medical Slashdot Yeast Langlog Enron W-test
31.11 15.48 24.14 7.16 14.79 1.75 13.61 2.90 9.80 44.70 16.95 20.91 2.5
30.86 15.85 23.29 7.32 16.71 1.75 13.72 2.76 10.75 48.51 17.72 26.35 2.5
30.86 15.58 24.66 7.11 15.39 1.10 13.58 3.09 10.63 44.42 17.53 24.49 2.5
32.92 33.03 31.03 34.49 51.56 2.25 40.29 4.49 10.59 45.87 21.25 21.54 0.0
30.28 16.43 22.80 7.79 17.78 1.84 15.65 3.63 11.61 44.42 21.26 23.32 2.5
Table 5 One error (%) from 12 data sets. Data set
Rank-LSVM
Rank-CVM
Rank-SVM
BP-MLL
RF-PCT
Emotions Image Birds Scene Plant Genbase Human Medical Slashdot Yeast Langlog Enron W-test
31.19 24.38 40.12 19.65 59.74 0.00 52.01 13.51 39.72 23.66 70.92 21.42 3.0
29.21 25.88 41.86 19.40 60.51 0.00 51.61 13.51 38.93 25.74 73.11 29.71 2.5
29.70 24.38 40.12 19.31 59.23 0.00 50.24 12.61 40.25 23.45 75.50 25.56 2.5
30.69 53.88 65.12 82.69 94.87 0.00 71.79 36.04 40.45 23.77 77.29 23.32 0.0
26.73 27.00 41.28 20.74 65.13 0.00 57.07 14.71 44.88 24.65 77.89 21.24 2.0
Table 6 Average precision (%) from 12 data sets. Data set
Rank-LSVM
Rank-CVM
Rank-SVM
BP-MLL
RF-PCT
Emotions Image Birds Scene Plant Genbase Human Medical Slashdot Yeast Langlog Enron W-test
79.32 84.39 63.17 88.23 60.09 99.62 65.04 89.87 70.53 76.69 38.81 71.33 3.0
80.11 83.54 62.87 88.31 58.29 99.62 65.17 89.90 70.45 73.94 37.15 66.07 3.0
79.70 84.27 62.84 88.44 59.84 99.69 65.95 90.11 70.01 76.85 36.17 67.67 3.0
77.62 63.31 46.80 46.38 23.65 99.14 40.62 75.91 69.72 75.01 33.25 69.40 0.5
80.85 82.41 62.15 87.43 55.37 99.50 60.58 87.61 65.42 70.86 30.56 63.61 0.5
408
As per Hamming loss measure in Table 8, RF-PCT, Rank-SVM and Rank-LSVM receive the lowest values on six, four and two data sets. According to Wilcoxon test, all methods but BP-MLL have the same test scores. In order to obtain a comprehensive comparison, we calculate the sum of test scores over five measures for each method, and sort such five classifiers as our Rank-LSVM (14), Rank-SVM (13), Rank-CVM (13), RF-PCT (9.5), and BP-MLL (0.5). Overall, our Rank-LSVM performs the best slightly, and Rank-SVM and Rank-CVM also achieve a competitive performance.
409
6.8. Computational cost comparison
410
In this subsection, we analyze and compare the computational costs for five methods. In the above training and testing stages, for all 12 data sets, we also record the training and testing time from all five methods, and the number of support vectors from three SVM-type techniques. For RF-PCT, the induced time is reported, which is regarded as training time in this paper and essentially includes a small amount of testing time to calculate the discriminant function values and to detect the predicted label subsets. In this case,
404 405 406 407
411 412 413 414
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS 18
[m3Gsc;September 28, 2015;8:35]
J. Xu / Information Sciences xxx (2015) xxx–xxx Table 7 Ranking loss (%) from 12 data sets. Data set
Rank-LSVM
Rank-CVM
Rank-SVM
BP-MLL
RF-PCT
Emotions Image Birds Scene Plant Genbase Human Medical Slashdot Yeast Langlog Enron W-test
16.36 12.65 15.81 6.56 15.19 0.41 12.36 1.81 8.26 16.18 13.46 7.08 3.0
15.75 13.01 14.84 6.70 17.20 0.41 12.45 1.71 9.09 18.04 14.23 9.36 2.5
15.91 12.65 16.05 6.48 15.77 0.11 12.37 1.91 8.98 16.08 13.96 8.87 2.5
18.36 34.93 21.98 39.51 54.44 0.78 40.03 3.05 9.01 17.50 17.48 7.37 0.0
15.14 13.78 15.02 7.29 18.60 0.46 14.60 2.38 10.17 16.78 16.69 8.02 2.0
Table 8 Hamming loss (%) from 12 data sets. Data set
Rank-LSVM
Rank-CVM
Rank-SVM
BP-MLL
RF-PCT
Emotions Image Birds Scene Plant Genbase Human Medical Slashdot Yeast Langlog Enron W-test
21.37 15.45 8.29 8.50 10.92 0.13 9.09 1.06 4.46 19.40 2.35 4.62 2.5
20.13 16.05 8.42 8.84 10.75 0.15 9.04 1.07 4.37 22.80 2.44 5.09 2.5
20.71 15.30 8.69 8.00 10.56 0.19 8.97 1.02 4.52 19.13 2.52 4.78 2.5
22.11 23.85 11.32 29.08 13.97 0.32 9.71 2.00 4.54 20.84 2.51 5.30 0.0
20.05 15.70 8.60 9.21 8.78 0.24 8.21 1.27 4.19 19.70 1.77 4.59 2.5
435
the testing time is to calculate the measures only for RF-PCT. The detailed training and testing time is listed in Table 1S in Supplementary Material. According to the sum of average training and testing time (in seconds), five classifiers are sorted as Rank-LSVM(23.8+2.9), RF-PCT(38.0+0.8), Rank-CVM(377.9+3.0), Rank-SVM(2576.4+3.1), BP-MLL(45543.8+11.4). Overall, our Rank-LSVM runs the fastest among five methods. Now we compare the computational costs for three SVM-type methods further. For SVM-type algorithms, the testing time is mainly determined by the number of support vectors, since evaluating kernel functions would spend the most fraction of testing time. In Fig. 8, we show the training time and support vectors for 12 data sets for three SVM-type methods. From training time in Fig. 8(a), Rank-SVM runs the slowest on all 12 data sets, and our Rank-LSVM spends much less time than Rank-CVM except for Image and Scene. On the average, our Rank-LSVM runs about 15 and 107 time faster than Rank-CVM and Rank-SVM, respectively. According to the percentage of support vectors in Fig. 8(b) and Table 2S in Supplementary Material, Rank-SVM produces the most support vectors on all 12 data sets, and our Rank-LSVM achieves the least ones except for Enron. The average percentage of support vectors is 79.44 (Rank-LSVM), 88.36 (Rank-CVM) and 96.03 (Rank-SVM), respectively. Overall, our Rank-LSVM obtains a sparsest solution among three SVM-type methods, which is also verified by testing time as mentioned above. In Table 2S, we also list the percentage of support vectors from our Rank-LSVM with τ = 0.0, that is, no shrinking strategy is applied. The corresponding average percentage value is 96.23%, which is 16.79% higher than that with τ = 1.0. This illustrates that our shrinking strategy improves the solution spareness significantly. According to the training and testing time, it can be demonstrated that our Rank-LSVM runs much faster than Rank-CVM and Rank-SVM in the training and testing phases.
436
7. Conclusions
437
Two existing multi-label SVM-type algorithms (i.e., Rank-CVM and Rank-SVM) are a typical algorithm extension method, which considers all instances and all labels simultaneously to characterize label correlations sufficiently, resulting into two relatively complicated quadratic programming problems with equality and bounded constraints. In this paper, we extend binary Lagrangian support vector machine (LSVM) to construct its multi-label version: Rank-LSVM, which is formulated as a quadratic
415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434
438 439 440
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS
[m3Gsc;September 28, 2015;8:35]
(b)Support vectors (%)
(a)Training time (seconds)
J. Xu / Information Sciences xxx (2015) xxx–xxx
19
Rank−LSVM Rank−CVM Rank−SVM
10000 100 1.0 0.01
Emotions
Image
Birds
Scene
Plant
Genbase
Human
Medical
Slashdot
Yeast
Langlog
Enron
Emotions
Image
Birds
Scene
Plant
Genbase Human Data set
Medical
Slashdot
Yeast
Langlog
Enron
100 80 60 40 20
Fig. 8. Training time and support vectors from three SVM-type methods on 12 data sets.
448
programming problem with non-negative constraints only. Based on random block coordinate descent method with a heuristic shrinking strategy, our efficient training procedure for Rank-LSVM has a linear convergence rate theoretically. On 12 data sets, on the average, our novel method runs about 15 and 107 times faster, and produces about 9 and 17% fewer support vectors than Rank-CVM and Rank-SVM, respectively in the training phase. Experimental study also demonstrates that our Rank-LSVM achieves rather competitive performance, compared with five typical multi-label classification methods, including Rank-CVM, Rank-SVM, BP-MLL and RF-PCT, according to five indicative performance measures. In future we will conduct a detailed model selection work to search optimal key parameters more efficiently, and enhance the solution sparsity for three SVM-type multi-label classifiers.
449
Acknowledgments
441 442 443 444 445 446 447
450
This work was supported by Natural Science Foundation of China under grant no. 61273246 and 61432008.
451
Appendix A. Proof of Theorem 1
452
In the Algorithm 1, at each outer epoch random block coordinate descent method (RBCDM) is used to solve the entire largescale QP problem (24) of our Rank-LSVM. On the convergence rate of RBCDM, a strongly convex function with box constraints is investigated first in [24], and then a more general composite function is considered in [28]. Our Rank-LSVM is a special case of their work. In this study, we use some recent theoretical results from [28] to prove the linear convergence rate of RBCDM for our Rank-LSVM. Our Rank-LSVM could be converted into the following strongly convex unconstrained minimization with a composite objective function,
453 454 455 456 457 458
G(α) = F (α) + I(α), 459 460
where I(α) is the indicator function of a block-separable convex set, which describes the non-negative constraints in our RankLSVM,
I(α) = 461 462 463 464
(A.1)
0, +∞,
if αi ≥ 0, otherwise
(A.2)
Firstly, we need to validate the following assumptions for three functions F(α), I(α) and G(α). Assumption 1. The gradient of function F(α) is block coordinate-wise Lipschitz continuous with positive constants % LBi = (2li (xi 22 + 1) + li /Ci ), i = 1, . . ., l. Validation. According to the gradient vector (32) with respect to the ith block of variables αi , we have,
& & gi (αi + αi ) − gi (αi )2 = & xi 22 + 1 H Ti H i αi + αi /Ci &2
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS 20
J. Xu / Information Sciences xxx (2015) xxx–xxx
& & & xi 22 + 1 H Ti H i + I i /Ci & αi 2 F & & & xi 22 + 1 H Ti H i & + I i /Ci F αi 2 F xi 22 + 1 H i 2F + I i F /Ci αi 2 % 2li xi 22 + 1 + li /Ci αi 2
≤ ≤ ≤ ≤
%
= LBi αi 2 ,
(A.3)
465
where LBi = (2li (xi 22 + 1) +
466
Assumption 2. The indicator function I(α) : Rlt → R ∪ {+∞} is block separable.
467
[m3Gsc;September 28, 2015;8:35]
li /Ci ) is referred to as Lipschitz constant.
Validation. We can split I(α) into,
I(α) =
l
I(αi ),
(A.4)
i=1
468
where,
I(αi ) = 469 470
0, if αij ≥ 0 +∞, otherwise
(A.5)
in which I(αi ) : Rli → R ∪ {+∞} are convex and closed. In order to validate the (strong) convexity of three functions F(α), I(α) and G(α), we introduce a pair of conjugate norm,
α[L] =
l
1/2
LBi
α
2 2
,
i=1
α ∗[L] =
l
1/2 ( )
LBi −1
(α ∗ )2 2
.
(A.6)
i=1
471 472 473 474
Assumption 3. With respect to the above norm ·[L] , the function I(α) is only convex, and the functions F(α) and G(α) are strongly convex. Validation. For two arbitrary vectors α1 and α2 where the superscripts 1 and 2 is to differentiate two vectors only, and the positive definite matrix , we have
1 2 T 1 (α ) α2 − uT α2 − (α1 )T α 1 + uT α1 − (α1 − u)T (α2 − α1 ) 2 2 1 1 2 2 1 = (α2 )T α − (α1 )T α + (α1 )T α 2 2 1 1 = (α1 − α2 )T (α1 − α2 ) ≥ (α1 − α2 )T D(α1 − α2 ) 2 2 l l 1 1 1 1 B 1 = α1 − α2i 22 = L αi − α2i 22 BC i 2 Ci i 2 L i i=1 i=1 i
F (α2 ) − F (α1 ) − ∇ F (α1 ), α2 − α1 =
1 1 μF (L) LBi α1i − α2i 22 = μF (L)α1 − α2 2[W ] 2 2 l
≥
(A.7)
i=1
475
where μF (L) indicates the strong convexity parameter of function F(α), which is defined as,
μF (L) = 476 477
1
maxi=1,..,l LBiCi
> 0.
(A.8)
Similarly, the convexity parameter of I(α) is μI (L) =0, which means that I(α) is convex only. The strongly convexity of G(α) may come from F(α) or I(α) or both, whose convexity parameter μG (L) is,
μG (L) ≥ μF (L) + μI (L) = μF (L) > 0. 478 479 480 481
(A.9)
Since the optimization problem of our Rank-LSVM satisfies the above three assumptions, the following detailed Theorem A1 from the Theorem 7 in [28] shows that the RBCDM for our Rank-LSVM has a linear convergence. Theorem A.1. Assume μF (L) + μI (L) > 0, and choose an initial vector α(0) = 0. If α(p, t) is a random vector from the RBCDM for Rank-LSVM, then we have,
E (G(α( p,t ) ) − G∗ ≤ (1 −
μF (L) l
)k (G(α(0) ) − G∗ ),
(A.10)
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
ARTICLE IN PRESS
JID: INS
[m3Gsc;September 28, 2015;8:35]
J. Xu / Information Sciences xxx (2015) xxx–xxx
21
483
where E(G(α(p, t) )) is the expected value of G(α(p, t) ) and G∗ is its optimal values. It is worth noting that the Theorem 1 in Section 4 is its simplified form.
484
Appendix B. Proof of Theorem 2
485
In the Algorithm 1, at each inner iteration cyclic coordinate descent method (CCDM) is to solve each small-scale QP subproblems (31), which could be rewritten as the following form for convenience,
482
486
1 (αi )T ii (αi ) + gTi (αi ), 2 ≥ 0.
min F (αi ) = s.t.
α i + α i
(B.1)
489
In [40], the linear convergence rate of feasible descent method for convex optimization is proved, where our optimization subproblem (B.1) with CCDM is a special case. In order to show that our CCDM for (B.1) has a linear convergence rate, we validate the function F(αi ) satisfies the following assumption.
490
Assumption 4. F(αi ) is Lipschitz continuous with a positive constant LCi and strongly convex with a positive parameter μ.
487 488
491
492 493
Validation. For two arbitrary vectors α1i and α2i , we have,
& & ∇ F (α1i ) − ∇ F (α2i )2 = &ii (α1i − α2i )&2 & & ≤ ii F &α1i − α2i & 2 & & % 2 ≤ 2li xi 2 + 1 + l i /Ci &α1i − α2i & 2 & 1 & j2 & C& = Li α i − α i 2
where = + 1) + Additionally, we obtain, LCi
(2li (xi 22
√
l i /Ci ) is Lipschitz constant, which is the same as
LBi
(B.2) in the above Appendix A.
T T ∇ F (α1i ) − ∇ F (α2i ) α1i − α2i = α1i − α2i ii α1i − α2i & &2 &2 1& ≥ &α1i − α2i & = μ&α1i − α2i &
Ci
2
2
(B.3)
496
where μ = 1/Ci represents the convexity parameter. According to the Corollary 3.3 in [40], the following detailed Theorem A2 shows that CCDM for solving the sub-problem (B.1) has a linear convergence rate.
497
Theorem A.2. The CCDM for the sub-problem (B.1) has a global linear convergence rate if Assumption 4 is satisfied.
494 495
498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 Q2 519 Q3 520 521 522
Note that the Theorem 2 in Section 4 is its concise version. Supplementary material Supplementary material associated with this article can be found, in the online version, at 10.1016/j.ins.2015.09.023. References [1] M.R. Boutell, J. Luo, X. Shen, C.M. Brown, Learning multi-label scene classification, Pattern Recogn. 37 (9) (2004) 1757–1771. [2] A.C. de Carvalho, A.A. Freitas, A tutorial on multi-label classification techniques, in: Function Approximation and Classification, 5, Springer, 2009, pp. 177– 195. [3] K.-W. Chang, C.-J. Hsieh, C.-J. Lin, Coordinate descent method for large-scale l2-loss linear support vector machines, J. Mach. Learn. Res. 9 (2008) 1369–1398. [4] W.-J. Chen, Y.-H. Shao, N. Hong, Laplacian smooth twin support vector machine for semi-supervised classification, Int. J. Mach. Learn. Cybern. 5 (3) (2014) 459–468. [5] W. Cheng, E. Hullermeier, Combining instance-based learning and logistic regression for multi-label classification, Mach. Learn. 76 (2-3) (2009) 211–225. [6] K. Dembczynski, W. Waegeman, W.C.E. Hullermeier, On label dependence and loss minimization in multi-label classification, Mach. Learn. 88 (1-2) (2012) 5–45. [7] J. Demsar, Statistical comparison of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [8] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd ed., John Wiley and Sons, New York, USA, 2001. [9] A. Elisseeff, J. Weston, A kernel method for multi-labelled classification, in: Proceedings of the 14th Conference on Neural Information Processing Systems (NIPS2001), Vancouver, British Columbia, Canada, 2001, 2001, pp. 681–687. [10] M. Frank, P. Wolfe, An algorithm for quadratic programming, Nav. Res. Logist. Q. 3 (1-2) (1956) 95–110. [11] G. Fung, O.L. Mangasarian, Proximal support vector machine classifiers, in: Proceedings of the 17th ACM SIGKDD International Conference Knowledge Dis. Data Min. (KDD2001), San Francisco, CA, USA, 2001, 2001 pp. 78–86. [12] J. Furnkranz, E. Hullermeier, E.L. Mencia, K. Brinker, Multi-label classification via calibrated label ranking, Mach. Learn. 73 (2) (2008) 133–153. [13] X. Gao, L. Fan, H. Xu, Multiple rank multi-linear kernel support vector machine for matrix data classification, Int. J. Mach. Learn. Cybern. (2015). in press. [14] J.E. Gentle, Matrix Algebra: Theory, Computations and Applications in Statistics, Springer, New York, USA, 2007. [15] J. Guelat, P. Marcotte, Some comments on Wolfes away step, Math. Program. 35 (1) (1986) 110–119. [16] C. Hildreth, A quadratic programming procedure, Nav. Res. Logist. Q. 4 (1) (1957) 79–85.
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023
JID: INS 22
523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565
ARTICLE IN PRESS
[m3Gsc;September 28, 2015;8:35]
J. Xu / Information Sciences xxx (2015) xxx–xxx
[17] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S.S. Keerithi, S. Sundararajan, A dual coordinate descent method for large-scale linear svm, in: Proceedings of the 25th International Conference on Machine Learning (ICML2008), Helsinki, Finland, 2008, 2008, pp. 408–415. [18] Jayadeva, R. Khemchandani, S. Chandra, Twin support vector machines for pattern recognition, IEEE Trans. Patt. Anal. Mach. Intell. 29 (5) (2007) 901–910. [19] J.-Y. Jiang, S.-C. Tsai, S.-J. Lee, FSKNN: multi-label text categorization based on fuzzy simularity and k nearest neighbors, Expert Syst. Appl. 39 (3) (2012) 2813–2821. [20] P. Li, H. Li, M. Wu, Multi-label ensemble based on variable pairwise constraint projection, Inform. Sci. 222 (2013) 269–281. [21] G. Madjarov, D. Kocev, D. Gjorgjevikj, S. Dzeroski, An extensive experimental comparison of methods for multi-label learning, Pattern Recognit. 45 (9) (2012) 3084–3104. [22] O.L. Mangasarian, D.R. Musicant, Lagrangian support vector machine, J. Mach. Learn. Res. 1 (2001) 161–171. [23] I. Necoara, A. Patrascu, A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints, Comput. Optim. Appl. 57 (2) (2014) 307–337. [24] Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM J. Optim. 2 (2) (2012) 341–362. [25] R. Nicolas, A. Sancho-Asensio, E. Golobardes, A. Fornells, A. Orriols-Puig, Multi-label classification based on analog reasoning, Expert Syst. Appl. 40 (15) (2013) 5924–5931. [26] X. Peng, Building sparse twin support vector machine classifiers in primal space, Inform. Sci. 181 (18) (2011) 3967–3980. [27] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chain for multi-label classification, Mach. Learn. 85 (3) (2011) 333–359. [28] P. Richtarik, M. Takac, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, Math. Program. 144 (1-2) (2014) 1–38. [29] R.E. Schapire, Y. Singer, Boostexter: a boosting-based system for text categorization, Mach. Learn. 39 (2-3) (2000) 135–168. [30] S. Shalev-Sheartz, A. Tewari, Stochastic methods for l1-regularized loss minimization, J. Mach. Learn. Res. 12 (2011) 1865–1892. [31] L. Sun, W.-S. Mu, B. Qi, Z.-J. Zhou, A new privacy-preserving proximal support vector machine for classification of vertically partitioned data, Int. J. Mach. Learn. Cybern. 6 (1) (2015) 109–118. [32] D. Tao, X. Li, X. Wu, W. Hu, S.J. Maybank, Supervised tensor learning, Knowl. Inform. Syst. 13 (1) (2007) 1–42. [33] K. Trohidis, G. Tsoumakas, G. Kalliris, I.P. Vlahavas, Multi-label classification of music into emotions, in: Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR2008), Philadelphia, PA, USA, 2008, pp. 325–330. [34] I.W. Tsang, J.T. Kwok, P.-M. Cheung, Core vector machines: fast SVM training on very large data sets, J. Mach. Learn. Res. 6 (2005) 363–392. [35] I.W. Tsang, J.T. Kwok, J.M. Zurada, Generalized core vector machines, IEEE Trans. Neural Netw. 17 (5) (2006) 1126–1140. [36] G. Tsoumakas, I. Katakis, Multi-label classification: an overview, Int. J. Data Warehous. Min. 3 (3) (2007) 1–13. [37] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multi-label data, in: Data Mining and Knowledge Discovery Handbook, 2nd, Springer, 2010, pp. 667–685. [38] G. Tsoumakas, I. Vlahavas, I. Katakis, Random k-labelsets for multi-label classification, IEEE Trans. Knowl. Data Eng. 23 (7) (2011) 1079–1089. [39] V.N. Vapnik, Statistical Learning Theory, John Wiley and Sons, New York, USA, 1998. [40] P.-W. Wang, C.-J. Lin, Iteration complexity of feasible descent methods for convex optimization, J. Mach. Learn. Res. 15 (2014) 1523–1548. [41] Z. Wen, D. Goldfarb, K. Scheinberg, Block coordinate descent methods for seimdefinate programming, in: Handbook on Semidefinite, Conic and Polynomial Optimization, 166, Springer, 2012, pp. 533–564. [42] X. Wu, V. Kumar, The Top Ten Algorithms in Data Mining, Chapman and Hall/CRC, Boca Raton, FL, USA, 2009. [43] J. Xu, Fast multi-label core vector machine, Pattern Recognit. 46 (3) (2013) 885–898. [44] J. Xu, A random block coordinate descent method for multi-label support vector machine, in: Proceedings of the 20th International Conference on Neural Information Processing (ICONIP2013), Daegu, Korea, 2013, pp. 281–290. [45] M.-L. Zhang, J.M. Pena, V. Robles, Feature selection for multi-label naive bayes classification, Inform. Sci. 179 (19) (2009) 3218–3229. [46] M.-L. Zhang, Z.-H. Zhou, Multilabel neural networks with application to function genomics and text categorization, IEEE Trans. Knowl. Data Eng. 18 (10) (2006) 1338–1351. [47] M.-L. Zhang, Z.-H. Zhou, ML-kNN: a lazy learning approach to multi-label learning, Pattern Recogn. 40 (5) (2007) 2038–2048. [48] M.-L. Zhang, Z.-H. Zhou, A review on multi-label learning algorithms, IEEE Trans. Knowl. Data Eng. 26 (8) (2014) 1837–1919.
Please cite this article as: J. Xu, Multi-label Lagrangian support vector machine with random block coordinate descent method, Information Sciences (2015), http://dx.doi.org/10.1016/j.ins.2015.09.023