Communicated by Zhaohong Deng
Accepted Manuscript
Multi-dimensional classification via a metric approach Zhongchen Ma, Songcan Chen PII: DOI: Reference:
S0925-2312(17)31576-X 10.1016/j.neucom.2017.09.057 NEUCOM 18939
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
19 June 2017 4 September 2017 18 September 2017
Please cite this article as: Zhongchen Ma, Songcan Chen, Multi-dimensional classification via a metric approach, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.09.057
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Multi-dimensional classification via a metric approach Zhongchen Maa , Songcan Chena,∗ of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing 211106, China
CR IP T
a College
Abstract
Multi-dimensional classification (MDC) refers to learning an association between
AN US
individual inputs and their multiple dimensional output discrete variables, and
is thus more general than multi-class classification (MCC) and multi-label classification (MLC). One of the core goals of MDC is to model output structure for improving classification performance. To this end, one effective strategy is to firstly make a transformation for output space and then learn in the transformed space. However, existing transformation approaches are all rooted in
M
label power-set (LP) method and thus inherit its drawbacks (e.g. class imbalance and class overfitting). In this study, we first analyze the drawbacks of the
ED
LP, then propose a novel transformation method which can not only overcome these drawbacks but also construct a bridge from MDC to MLC. As a result, many off-the-shelf MLC methods can be adapted to our newly-formed problem.
PT
However, instead of adapting these methods, we propose a novel metric learning based method, which can yield a closed-form solution for the newly-formed problem. Interestingly, our metric learning based method can also naturally be
CE
applicable to MLC, thus itself can be of independent interest as well. Extensive experiments justify the effectiveness of our transformation approach and our
AC
metric learning based method. Keywords: Multi-dimensional classification, problem transformation, distance metric learning, closed-form solution
∗ Corresponding
author Email address:
[email protected] (Songcan Chen)
Preprint submitted to Journal of LATEX Templates
September 21, 2017
ACCEPTED MANUSCRIPT
1. Introduction In supervised learning, binary classification (BC), multi-label classification (MLC) and multi-class classification (MCC) have been extensively studied in
5
CR IP T
past years. As a more general learning task, MDC is relatively less studied
up to now, which can partly be attributed to its more complex output space.
Figure 1 displays the relationships among the different classification paradiams in term of m class variables of K possible values each. As shown in the figure, BC has only single class variable whose range is {1, 0} or {1, −1}, corresponding
to m = 1 and K = 2; MCC also has only single class variable but whose range
can take a number of class values, corresponding to m = 1 and K > 2; MLC
AN US
10
has multiple class variables whose range is also {1, 0} or {1, −1}, corresponding to m > 1 and K = 2. More generally, MDC allows multiple class variables that can take a number of class values, corresponding to m > 1 and K > 2.
MDC
BC
ED
m 1
MLC
M
m 1
K 2
MCC
K 2
Figure 1: Relationship between different classification paradigms, where m is the number of
PT
class variables and K is the of values each of these variables may take. Fig.number 1. 不同的分类问题范例:L代表标签的数量和 K代表每个标签的取值范围。
A wide range of applications has been found corresponding to this task, for example, in computer vision task [1], a landscape image may present many
CE
15
information such as the month, season, or the type of subject; in information
AC
retrieval task [2][3], documents can be classified into different kinds of categories like mood or topic; in computational advertising task[4], a social media information may demonstrate the user’s gender, age, personality, happiness or political
20
polarity. Like MLC, the core goal of MDC is to achieve effective classification performance by modeling output structure. In modeling, a simplest assumption is
2
ACCEPTED MANUSCRIPT
that the class variables are completely unrelated, thus it is sufficient to design a separate independent model for each class. However, such an ideal assumption 25
is hardly applicable to real world problems in general, as correlation (structure)
CR IP T
often exists among class variables, for example, a user’s age can have strong impact on his political polarity where the young are generally more radical and
elders are often more conservative. Even within each output dimension, there exists an explicit within-dimension relationship among its values, which refers 30
to that only one value of a class variable can be activated. Therefore, one key to
achieve its effective learning lies in how to take sufficient advantage of explicit
within each output dimension.
AN US
and/or implicit relationships both among output dimensions and among values
In order to model such output structures, there are two main strategies pro35
posed: (i) explicitly modeling the dependence structures between class variables, e.g., via imposing chain structure [5][6][7], or using multi-dimensional Bayesian network structure [8][9] or adopting Markov random field[10] (ii) implicitly mod-
M
eling output structure by transformation approaches [11][12][13]. A major limitation of the former strategy lies in requiring a pre-defined output structure (e.g., chain or Bayesian network), thus partly losing flexibility
ED
40
of characterizing structure. In contrast, the transformation approach of the latter strategy enjoys more flexibility due to its ability to modeling various
PT
structures. What’s more, we also witness that such a transformation method has demonstrated its convincing performance in [13]. Therefore, in this paper, we follow such a transformation strategy to model output structures of MDC.
CE
45
To the best of our knowledge, all the existing transformation methods can
be classified as label power-set (LP)-based transformation approach. LP [11]
AC
can transform the MDC problem into a corresponding multi-class classification problem by defining a new compound class variable whose range exactly contains
50
all the possible combinations of values of the original class variables. Though implicitly considering the interaction between different classes, LP suffers from class imbalance and class overfitting problems, where the class imbalance refers to the great differences in the total number of instances for different combi3
ACCEPTED MANUSCRIPT
nations of the class variables, and the class overfitting problem refers to zero 55
instances for some combinations of the class variables. To address these issues of LP, [13] proposed to firstly form super-class partitions by modeling the
CR IP T
dependence between class variables and then make each super-class partition correspond to a compound class variable defined by LP. Although this super-
class partitioning can reduce the the original problem to a set of subproblems, 60
these newly formed subproblems still need to be transformed by LP, thereby, the approach naturally suffers its problems.
In this study, we analyze the drawbacks of the LP and propose a novel
AN US
transformation method which can not only overcome these drawbacks but also construct a bridge from MDC to MLC. Specifically, our transformation ap65
proach desires to form a new output space with all binary variables by a tricky binarization process for the original output space of MDC. Since such a newlyformed problem has a similarity with MLC (e.g. the class variables of both problems are all binary), our transformation approach is named as Multi-Label
70
M
liKe Transformation approach (MLKT) and subsequently, many off-the-shelf MLC methods can be adapted to the newly-formed problem. However, instead
ED
of adapting these methods, in this study, we also propose a novel metric-based method of aiming to make the predictions of an instance in the learned metric space close to its true class values while far away from others. And, our
75
PT
metric-based method involved can yield a closed-form solution, thus its learning is more efficient than its competitive methods. Interestingly, our metric learn-
CE
ing method involved can also naturally be applicable to MLC, thus itself can be of independent interest as well. Finally, extensive experimental results justify that: our approach combing the above two procedures achieves a better classi-
AC
fication performance than the state-of-the-art MDC methods while our method
80
itself also obtains competitive classification performance with a lower learning complexity compared to its counterparts designed specifically for MLC. The rest of the paper is structured as follows: We firstly introduce the
required background in the field of multi-dimensional classification in Section 2. Then we introduce MLKT in Section 3. Next, we present the details of the 4
ACCEPTED MANUSCRIPT
85
distance metric learning method in Section 4. We then experimentally evaluate the proposed schemes in Section 5. Finally, we give concluding remarks in
CR IP T
Section 6.
2. Background
In this section, we review basic multi-dimensional classifiers.
In MDC, we have N labeled instances D = {(xi , yi )}N i=1 from which we wish
to build a classifier that associates multiple class values with each data instance. The data instance is represented by a vector of d values x = (x1 , . . . , xd ), each
AN US
drawn from some input domain X 1 × · · · × X d . And the classes are represented
by a vector of m values y = (y 1 , . . . y m ) from the domain Y 1 × · · · × Y m where
each Y j = {1, . . . , K j } is the set of possible values for the jth class variable Y j . Specifically, we seek to build a classifier f that assigns each instance x to a
vector y of class values:
M
f : X 1 × · · · × X d → Y1 × · · · × Ym
ED
x : (x1 , . . . , xd ) 7→ y : (y 1 , . . . y m ) Binary Relevance (BR) is a straightforward method for MDC. It trains m
90
classifiers f := (f 1 , . . . , f m ) for each class variable. Specifically, a standard
PT
multi-class classifier f j learns to associate one of the values y j ∈ Y j to each data instance, where f j : X 1 × · · · × X d → Y j . However, it is unable to capture the dependencies among classes and suffers low accuracies as illustrated in [5, 14, 15].
CE 95
MDC has attracted much attentions recently and many multi-dimensional
AC
classifiers for modeling the output structure of MDC have been proposed in recent years. As presented in the introduction section, there are two main strategies:
100
1. Explicit representation of the dependence structure between class variables.
5
ACCEPTED MANUSCRIPT
Classifier chains model (CC)[5, 6, 7], Classifier trellises (CT) [14] and Multi-dimensional Bayesian network classifiers (MBCs) [8][9] were recently proposed methods following this strategy for MDC. Specifically,
CR IP T
classifier chains model (CC) learns m classifiers, one for each class variable. These classifiers are linked at random order, such that the jth clas-
sifier uses as input features not only the instance, but also the output pre-
dictions of the previous j − 1 classifiers, namely yˆj = f j (x, yˆ1 , . . . , yˆj−1 ) for any test instance x. Specifically:
f j : X 1 × · · · × X d × Y 1 · · · × Y j−1 → Y j .
This method has demonstrated high performance in multi-label domains
AN US
105
and is directly applicable to MDC. However, a drawback is that the class variable ordering in chain has a strong effect on predictive accuracy, and with greedy structure comes the concern of error propagation along the chain due to that an incorrect estimate yˆj will negatively affect all subsequent class variables. Naturally, the ensemble strategy (ECC) [7], which
M
110
trains several CC classifiers with random order chains, can be used to alleviate these problems.
ED
classifier trellises (CT) captures dependencies among class variables by considering a predefined trellis structure. Each of the vertices of the trellis
PT
corresponds to one of the class variables. Fig.2 shows a simple example of the structure where the parents of each class variable are the class variables
CE
laying on the vertices above and to the left in the trellis. Specifically: f d : X 1 × · · · × X d × Y b × Y c → Y d.
AC
CT can scale large data sets with reasonable complexity. However, just
115
like CC, the artificially-defined greedy structure may falsely reflect the real dependency among class variables, thus limit its classification performance in real world applications unless such a predefined structure coincides with the given problem. Multi-dimensional Bayesian network classifiers(MBCs) are a family of
probabilistic graphical models which organizes class and feature variables 6
CR IP T
ACCEPTED MANUSCRIPT
Figure 2: A simple example of Classifier Trellises [14].
120
as three different subgraphs: class subgraph, feature subgraph, and bridge (from class to features) subgraph. Different graphical structures for the
AN US
class and the feature subgraphs can lead to different families of MBCs. We show a simple tree-tree structure of MBC in Fig.3. In recent years,
various MBCs have been proposed and have become useful tools for mod125
eling output structures of MDC [8, 9, 16]. However, the problem is still
ED
M
its exponential computational complexity.
PT
Figure 3: A simple tree-tree structure of MBC [8][9].
2. Implicit incorporation of output structure by transforming output space. To the best of our knowledge, all the existing transformation methods
CE
for MDC can be classified as label power-set (LP)-based transformation
130
approach.
AC
Label power-set (LP) [11] is a typical transformation approach for MLC and also can be directly applied to MDC. It firstly forcefully assumes all the class variables are dependent and then defines a new compound class variable whose range contains all the possible combinations of values of the original class variables. Specifically, f : X 1 × · · · × X d → Cartesian product(Y 1 , . . . , Y m ). 7
ACCEPTED MANUSCRIPT
As a result, the original problem is turned into a multi-class classification problem which has many off-the-shelf methods available. In this way, the output structure of MDC is implicitly considered. However, LP easily
CR IP T
suffers from class overfitting and class imbalance problems as mentioned in the introduction section.
135
Random k-labelsets (RAkEL) [17] and Super classes Classifier (SCC) [13]
are LP-based transformation approaches, where RAkEL uses multiple LP classifiers with each trained on a random subset of Y, while SCC’s LP clas-
sifiers are trained on the subsets of class variables with strong dependency.
classifier is learned, namely:
AN US
Formally, given a subset class variable S of Y, then its corresponding LP
fS : X 1 × · · · × X d → Cartesian product(Y S ). Note that RAkEL is specially designed for MLC, but can straightly be applied to MDC. While SCC has demonstrated convincing classification
M
performance for MDC. However, both methods need to resort to LP, therefore naturally suffering its same problems to some extend.
ED
In summary, the above two strategies have their own advantages and draw-
140
backs. However, relatively speaking, the latter one enjoys more flexibility in
PT
modeling output structure. Therefore, in this paper, we follow the latter one to model output structure of MDC. Unfortunately, existing transformation approaches are all based on LP, thus naturally inherit its class overfitting and class imbalance problems. Thanks to these, we try to give an analysis for the
CE
145
drawbacks of LP and propose a novel transformation approach to overcome
AC
them.
3. Transformation for MDC 3.1. Analysis for LP
150
We now give an analysis for LP and detail the causes of its drawbacks.
8
ACCEPTED MANUSCRIPT
LP is a typical transformation approach for MDC. It defines a new compound class variable whose range contains all the possible combinations of values of the original class variables. In this way, the original MDC problem is turned
155
CR IP T
into a multi-class classification problem which is relatively easier for subsequent learning. However, such transformation has two serious drawbacks:
The first drawback is class overfitting, which is invoked by the great reduction of the number of instances for each class after transformation. Specifically,
given a dataset {xi , yi }N i=1 of a MDC problem which has m class variables with
each having K i class values. Clearly, the number of instances for each class of the i-th output dimension is of the number is
1
K max N ,
1 Ki N
in the original problem and the worst case
AN US
160
where K
max
= max(K i ). However, after the trans-
formation, the number of instances for each class becomes Qm1 K i N , which is i=1 Qm 1 N due to i=1 K i K max . Hence, this drawback makes far less than K max learning for the formed problem easy class overfitting.
The second drawback is class imbalance, which is invoked by the reduction
165
M
of balance degree (refers to the smallest ratio of the total number of instances between classes) after transformation. More specifically, given a dataset of an
ED
imbalance MDC problem which has m class variables Y 1 , . . . , Y m , if we assume the balance degrees of class variables are respectively p1 , . . . , pm . After LP 170
transformation, the balance degree changes to p1 × · · · × pm . Based on pi <
PT
1, ∀i ∈ {1, . . . , m}, we can get p1 ×· · ·×pm < min(p1 , . . . , pm ). Thus the balance degree in the formed problem is worse than that in the original problem. In
CE
nature, although LP does not transform a totally balanced MDC problem into an imbalanced problem, it can indeed transform almost balanced MDC problem into imbalanced problem.
AC
175
What’s more, from the above analysis, we find that the more class variables
the LP’ compound class variable includes, the more serious the class overfitting and class imbalance problems become.
180
To overcome the drawbacks of LP, we propose a transformation approach to
1) make the number of instances for each class in the transformed problem as similar as possible to that in the original MDC problem; 2) keep the balance 9
ACCEPTED MANUSCRIPT
degree in the formed problem as consistent as possible with that in the original MDC problem.
CR IP T
3.2. The procedure of MLKT In this subsection, we present a novel transformation approach, namely
185
MLKT, which transforms Y to a subspace of {0, 1}L (where L is the dimen-
sionality of the formed problem). Our approach inherits the following favorable characteristics of LP:
AN US
• keep the output space size invariant.
• is easy for subsequent modeling in the transformed space.
190
• can reflect the explicit within-dimensional relationship.
In addition, it also possesses two extra key characteristics, i.e.,
• it can overcome the class overfitting and class imbalance problems suffered
M
by LP.
• it is decomposable for each class variable of MDC.
ED
195
Among above characteristics, we attempt to make the transformation decomposable, aiming to make easy transformation implementing and distinctive
PT
learning for each output variable. By doing so, we can ease the unnecessary computation cost of LP when the correlations between some output variables 200
are not strong. Now, let us detail the procedures of MLKT:
CE
For each individual class variable of MDC,
AC
• if K i ≥ 3, then for each y i ∈ Y i = {1, . . . , K i }, construct a new K i -
205
ˆij = 1 if j = y i , 0 otherwise. dimensional class vector ˆ zi where z
• if K i = 2, then for each y i ∈ Y i = {1, 2}, construct a 1-dimensional class vector ˆ zi where ˆ zi = 0 if y i = 1 and ˆ zi = 1 if y i = 2
With the above transformation, the original output class vector y = (y 1 , . . . , y m ) ∈
ˆ = [ˆ ˆm ], i.e. z ˆ is obtained Y is converted to a corresponding class vector z z1 ; . . . ; z 10
ACCEPTED MANUSCRIPT
by concatenating corresponding m vectors in the ascending index order. Thus, we form a new output domain: Zˆ = {ˆ z ← y | y ∈ Y}, where ”←” denotes
210
”transformed” by MLKT. Clearly, Zˆ ⊂ {0, 1}L , where L is the dimensionality
CR IP T
ˆ. of z Let us give an example to help understanding MLKT transformation.
Example 1. Assume the output space of MDC is Y = Y 1 × Y 2 , where Y 1 := {1, 2, 3}, Y 2 := {1, 2}.
1. transformation for each individual class domain.
215
Y 2 := {1, 2} → {0, 1}
2. concatenating procedure.
AN US
Y 1 := {1, 2, 3} → {(1, 0, 0)T , (0, 1, 0)T , (0, 0, 1)T }
Zˆ = {(1, 0, 0, 0)T , (1, 0, 0, 1)T , (0, 1, 0, 0)T , (0, 1, 0, 1)T , (0, 0, 1, 0)T , (0, 0, 1, 1)T } Clearly, it is not easy for subsequent learning in the new formed space.
220
M
However, we observe that the new formed space is equivalent to {0, 1}L with some additionally-imposed constraints which is relatively easier for subsequent learning. These additionally-imposed constraints can be obtained based on a
ED
deep insight as follows:
Let us define an integer set φi = {φi1 , φi2 , . . . , φij , . . . } for each i = 1, . . . , m,
ˆ(φi ) = z ˆi where the elements of set φi represent the indices of ˆ such that z z
PT
corresponding to class variable domain Y i . Based on the transformation charˆ(φi ) has one and acteristics of MLKT, we find that when K i ≥ 3, the vector z
AC
CE
only one element to be 1, thus, we have X
ˆ(φij ) = 1, z
j∈φi
ˆ ∀ˆ z ∈ Z.
(1)
In fact, we can use ET z = t to formulate all the equalities, where E ∈ RL×m
is an indicator matrix whose element 1, if k = φi ∧ K i ≥ 3 j Eki = 0, otherwise 11
(2)
ACCEPTED MANUSCRIPT
and t is a m dimensional vector whose element 1, if K i ≥ 3 ti = 0, otherwise
(3)
problem Zˆ is in fact equivalent to: Z = {ET z = t | z ∈ {0, 1}L } 225
CR IP T
Next, we prove in Proposition 1 that the output domain of the newly-formed
(4)
where E and t are defined in Eq.(2) and Eq.(3), respectively. As a result, leading to the following Proposition 1:
AN US
Proposition 1. The output domain Zˆ formed by MLKT is equivalent to Z = {ET z = t | z ∈ {0, 1}L } where E is a predefined indicator matrix and t is a
predefined m-dimensional vector. 230
ˆ z ∈ Z. Therefore, we just need to Proof. Based on Eq.(1), we find that ∀z ∈ Z, with that of Zˆ as well).
M
prove that the size of Z is consistent with that of Y (which means the consistency Assume the original MDC problem has m class variables, with each having 1
K , K 2 , . . . , K m possible values. Note that the MLKT transformation is decomposable for each class variable, thus we just need to prove that the space of
ED
235
Zφi is also K i . In the following, we give a proof according to several cases:
PT
Case 1: if K i = 2, the class variable domain Y i is transformed to Z(φi ) =
{0, 1}1 , thus its space size is also 2.
Case 2: if K i ≥ 3, the class variable domain Y i is transformed to Z(φi ) = i
{h1, z(φi ) i = 1 | z(φi ) ∈ {0, 1}K } in terms of our setup in Eq.(1), where 1 is
CE
240
a K i -dimensional vector with all elements being 1 and h, i represents the inner
AC
product between two vectors. Because the equality constraint can make sure that the vector z(φi ) has one and only one element to be 1, the space size of Z(φi ) is also K i .
245
To further help understanding MLKT transformation from Y to Z = {ET z =
t | z ∈ {0, 1}L }, we give an example as follows: 12
ACCEPTED MANUSCRIPT
Example 2. Assume the output space of MDC is Y = Y 1 × Y 2 , where Y 1 := {1, 2, 3}, Y 2 := {1, 2}. The MLKT approach follows two steps: 2. Define an indicator matrix E and a vector t.
CR IP T
1. Define φi and L.
250
According to Eq.(4), we can get: φ1 = {1, 2, 3}, φ2 = {4}, L = 4, E1 = (1, 1, 1, 0)T , E2 = (0, 0, 0, 0)T , E = [E1 , E2 ], t = [1, 0]T , respectively. Thus, Z
can be defined as Z = {ET z = t | z ∈ {0, 1}4 }.
In terms of such a transformation, the number of instances for each class in
255
formed by LP,
i.e., Qd 1
i=1
Ki
N 2,
which is far more than that in the problem
AN US
the formed problem seems to be
N . Moreover, on the surface, although it is also more
than that in the original problem, i.e.,
1 Ki N ,
this is not so. Based on the
decomposability of MLKT and the consistent size of Z(φi ) with that of Y i , so if 260
the original MDC problem has Ni instances categorized as y i , then the formed
problem also has Ni instances categorized as z(φi ) , meaning that the formed
M
problem keeps consistent with the original MDC problem not only the number of instances for each class but also the balance degree. Naturally, MLKT avoids
265
ED
the class overfitting and the class imbalance problems. Moreover, the explicit within-dimensional relationship is reflected by common-used one-vs-all coding [18]. In this way, MLKT can make all the desired transformation characteristics
PT
guaranteed.
From now on, we just need to focus on learning the structure of the problem
CE
formed by MLKT to reveal original structure. 4. Learning for the transformed problem
AC
270
Notations
Firstly, we give the notations which will be used in the following section. qP P 2 For a matrix A ∈ Rp×q , we use || A ||F = i j Aij to denotes its Frobenius
norm. For a positive definite matrix B 0, B−1 denotes the inverse of matrix
275
B. And, we use || x1 −x2 ||B = (x1 −x2 )T B(x1 −x2 ) to denote the Mahalanobis distance between vectors x1 and x2 . 13
ACCEPTED MANUSCRIPT
4.1. Model construction and Optimization Given the training data instances D = {(xi , yi )}N i=1 (yi ∈ Y), we can obtain
the corresponding re-labeled instances D0 = {(xi , zi )}N i=1 (z ∈ Z) by MLKT each instance x a vector z of class values: xi : (x1i , . . . , xdi ) 7→ zi : (zi1 , . . . ziL )
CR IP T
transformation. Then, the left problem is to build a classifier g that assigns
Let X ∈ RN ×d denote the input matrix and Z ∈ Z N ×L denote the output
matrix. To solve the transformed problem, a simple linear regression model is
arg minP∈Rd×L
AN US
to learn matrix P through the following formulation:
1 || Z − XP ||2F +γ || P ||2F , 2
(5)
Here γ ≥ 0 is a regularization parameter. However, this method usually yields low classification performance due to lack of consideration of correlations in
M
output space [19]. Considering the correlations, [20, 19] proposed to learn a discriminative Mahalanobis distance metric which can make the distance between PT xi and zi less than that between PT xi and any other output z in
ED
the output space. Unfortunately, both [20, 19] can not be directly applicable to our transformed problem, we, instead, develop an alternative novel metric learning method well suited to our scenario and it can nicely obtain a closed
PT
form solution (Our Mahalanobis metric learning method is similar to [20, 19] and we detail the connections in section 4.5.). Its formulation is presented as
CE
follows:
X
∀i,z˜i ∈Z\zi
|| PT xi − zi ||2Ω +
1 ˜i ||2Ω−1 , || PT xi − z | Z\zi |
(6)
AC
arg minΩ0
where P is the solution of the linear regression model (5), Ω is a positive definite
matrix. In the above, the first term aims to make the distance between PT xi
280
and zi smaller and the second term the distance between PT xi and any other output z larger. The main idea of using Ω−1 is motivated by [21], where Ω−1 is used to measure the distances between dissimilar points. The goal is to increase 14
ACCEPTED MANUSCRIPT
the Mahalanobis distance between PT xi and any other output z by decreasing || PT xi − z ||2Ω−1 (see Proposition 1. [21]). Because the space size of Z grows exponentially with dimension L, we only
CR IP T
consider the k nearest neighbors (kNN) of zi in the training dataset instead of any other outputs in the whole output space. Moreover, a regularizer term is used for avoiding overfitting. Therefore, we present the formulation of the distance metric learning method as follows: arg minΩ0
X
1 ˜i ||2Ω−1 || PT xi − z k
(7)
AN US
∀i,z˜i ∈kN N (zi )\zi
λDsld (Ω, I)+ || PT xi − zi ||2Ω +
where γ ≥ 0, P is fixed and the solution of (5), I is the identity matrix and Dsld (Ω, I) is the symmetrized LogDet divergence:
Dsld (Ω, I) := tr(Ω) + tr(Ω−1 ) − 2L. Further define:
X (PT xi − zi )(PT xi − zi )T
M
S :=
(8)
∀i
X
ED D :=
∀i,z˜i ∈kN N (zi )\zi
(9)
Using both S and D, the minimization problem (7) can be recast as
PT
285
1 T ˜i )(PT xi − z ˜i )T (P xi − z k
(10)
CE
arg minΩ0 λDsld (Ω, I) + tr(ΩS) + tr(Ω−1 D).
Interestingly, the minimization problem (10) is the same as the problem (13)
of [21], and is both strictly convex and strictly geodesically convex (Theorem 3
AC
of [21]), thus having global optimal solution. What’s more, it can have a closed form solution below: Ω = (S + λI)−1 ]1/2 (D + λI),
where A]1/2 B := A1/2 (A−1/2 BA−1/2 )1/2 A1/2 .
15
(11)
ACCEPTED MANUSCRIPT
It is this fact that the solution is given by the midpoint of the geodesic joining (S+λI)−1 and (D+λI). The geodesic viewpoint is important to make a tradeoff between (S + λI)−1 and (D + λI). Note that Ω := (S + λI)−1 ]1/2 (D + λI) is
CR IP T
also the minimum of problem (12) according to [21]: 2 2 arg min δR (Ω, (S + λI)−1 ) + δR (Ω, (D + λI)),
where δR denotes the Riemannian distance
(12)
δR (U, V) :=|| log(V−1/2 UV−1/2 ) ||F , f or U, V 0
AN US
Thus, we can get the balanced version between S and D of problem (10): 2 2 arg min := (1 − t)δR (Ω, (S + λI)−1 ) + δR (Ω, (D + λI)), t ∈ [0, 1]. Ω0
(13)
Interestingly, it can be shown (see [22], ch.6) that the unique solution to problem (13) is
Ω = (S + λI)−1 ]t (D + λI)
(14)
M
where A]t B := A1/2 (A−1/2 BA−1/2 )t A1/2 .
The solution connects to the Riemannian geometry of symmetric positive
290
ED
definite (SPD) matrices, and thus we denote it as gMML. Totally, we detail the learning procedure in Algorithm 1.
PT
Algorithm 1 MLKT-gMML algorithm Input: The MDC training data set D = {(xi , yi )}N i=1 (yi ∈ Y); The preset hyper-parameters k, λ, γ and t
CE
Output: Regression matrix P and distance metric Ω; 1:
Transform D to D0 by MLKT approach: D0 = {(xi , zi )}N i=1 (z ∈ Z)
Set P := arg minP∈Rd×L 21 || Z − XP ||2 +γ || P ||2F ;
3:
Compute S and D by Eq.(8) and Eq.(9)
4:
Set Ω := (S + λI)−1 ]t (D + λI);
5:
return P and Ω;
AC
2:
Note that P has an impact on learning Ω, conversely, Ω has an impact on learning P as well. Thus, P can be obtained by optimizing the following 16
ACCEPTED MANUSCRIPT
problem: P := arg minP∈Rd×L
1 || Z − XP ||2Ω +γ || P ||2F . 2
(15)
Its solution can be boiled down to solving the Sylvester equation: (XT X)P +
CR IP T
γPΩ−1 = XT Y. A classical algorithm for solving such equation is the Bartels-
Stewart algorithm [23]. In a nutshell, an iteration algorithm for learning P and Ω is detailed in Algorithm 2 called as gMML-I.
Algorithm 2 gMML-I algorithm Input: The MDC training data set D = {(xi , yi )}N i=1 (yi ∈ Y); The number of iterations η; The preset hyper-parameters k, λ, γ and t
1:
Transform D to D0 by MLKT approach: D0 = {(xi , zi )}N i=1 (z ∈ Z)
2:
Set Ωinit = I
3:
repeat
Set P := arg minP∈Rd×L 21 || Z − XP ||2Ω +γ || P ||2F ;
5:
Compute S and D by Eq.(8) and (9)
6:
Set Ω := (S + λI)−1 ]t (D + λI);
M
4:
until (reach to η iterations)
8:
return P and Ω;
ED
7:
4.2. Prediction for a new instance
PT
295
AN US
Output: Regression matrix P and distance metric Ω;
Based on the learning procedure, the output z of a new instance x can be
CE
predicted by solving the following optimization problem: min z∈Z
1 || z − PT x ||2Ω 2
(16)
It is equivalent to solving a quadratic binary optimization problem with
AC
equality constraints, namely,
min z
1 2
|| z − PT x ||2Ω
s.t. ET z = t z ∈ {0, 1}L 17
(17)
ACCEPTED MANUSCRIPT
The optimization problem (17) is very difficult to solve due to its NPhardness. Instead we replace the binary constraints with 0 ≤ z ≤ 1, then the NP-hard optimization problem is converted to a simple box-constrained
min v
1 2
|| v − PT x ||2Ω
s.t. ET v = t v ∈ [0, 1]L
CR IP T
quadratic programming as follows
AN US
Now, for each set φi , the prediction z(φi ) of x can be made in terms of 1, if j = arg maxk v(φi ) , k z(φi ) = if | φi |≥ 3, k = 1, . . . , Ki j 0, otherwise z i = round(v i ), if | φi |= 1 (φ ) (φ )
(18)
(19)
Where round() means rounding their predictions into 0/1 assignments. In turn,
M
the prediction yi of x in the original output space is yi = j, if | φi |≥ 3
(20)
ED
yi = (z i + 1), if | φi |= 1 (φ )
Algorithm 3 details the predicting procedures.
PT
Algorithm 3 Predict new instance x Input: The learned regression matrix P and distance metric Ω; The new instance x
CE
Output: The prediction class vector y Solve z := arg minz∈Z || z − PT x ||2Ω ;
2:
Inverse transformation: y ← z according to Eq.(19) and Eq.(20);
3:
return y;
AC
1:
4.3. Connections between existing metric learning methods and ours
300
Two works are mostly related to our metric learning method, namely maximum margin output coding (MMOC) [20] and large margin metric learning 18
ACCEPTED MANUSCRIPT
with kNN constraints (LM-kNN) [19]. These two methods likewise use a Mahalanobis distance metric (a symmetric positive semidefinite matrix denoted by S + ) to model output structure of MLC, where the Mahalanobis distance metric is used to learn a lower dimensional space.
CR IP T
305
MMOC aims to learn a discriminative Mahalanobis metric which can make
the distance between PT xi and its real class vector zi as close to 0 as possible
and less than the distance between PT xi and any other outputs with some margin. Specifically, its formulation is as follows n CX 1 ξi trace(Ω) + 2 n i=1 Ω∈S + ,{ξi }n i=1
(21)
AN US
arg min
s.t. ϕTizi Ωϕizi + ∆(zi , z) − ξi 6 ϕTiz Ωϕiz , ∀z ∈ {0, 1}, ∀i
where C is a positive constant, ϕiz = PT xi − z and ϕizi = PT xi − zi . It proved to have good classification accuracy for MLC task. However, it also suffers from a big burden, i.e. it has to treat the exponentially large num-
310
M
ber of constraints for each instance during training, leading to computational infeasibility.
ED
Like MMOC, LM-kNN also adopts a Mahalanobis metric learning method for MLC, which just involves k constraints for each instance. Its distance metric learning attempts to make instances with similar class vectors closer. Thus, the
PT
class vector of each instance can be predicted by their nearest neighbors. In fact, LM-kNN can be much simpler than MMOC and is established by minimizing
CE
the following objective: n 1 CX trace(Ω) + ξi 2 n i=1 Ω∈S + ,{ξi }n i=1
AC
arg min
(22)
s.t. ϕTizi Ωϕizi + ∆(zi , z) − ξi 6 ϕTiz Ωϕiz , ∀z ∈ N ei(i), ∀i
where C, ϕiz and ϕizi are similarly defined to those of MMOC. N ei(i) is the
output set of k nearest neighbors of input xi . For LM-kNN, its prediction for a testing instance can be obtained based on its k nearest neighbors in the learned metric space. Specifically, for the testing
19
ACCEPTED MANUSCRIPT
315
input x, we find its k nearest instances {x1 , . . . , xk } in the training set, then, a set of scores for each class vector of x can be obtained from the distances between x and {x1 , . . . , xk }, lastly, using these scores to predict its class vector
CR IP T
by thresholding. Clearly, neither MMOC nor LM-kNN can be applied to our transformed 320
problem due to that the output space of our transformed problem is not equiv-
alent to that of MLC. Although they can be adapted to our scenario by some efforts, these efforts are non-trivial because their corresponding training and/or predicting have to be re-designed. Moreover, at present it is not our focus. As a
325
AN US
result, we choose an alternative design way for our Mahalanobis distance metric learning, where our method is formally close to MMOC but has a closed form solution as described in section 4.1. 4.4. Complexity Analysis
The time complexity of regularized least square regression is basically the
330
M
complexity of computing the matrix multiplication with O(N d2 + N dL) plus the complexity of the inverse computation with O(d3 ). The complexity of com-
ED
puting the geometric mean of two matrices by Cholesky-Schur method [24] is O(L3 ). The complexity of computing a Sylvester equation is O(d3 + L3 + N d2 + N dL). The complexity of computing a box-constrained quadratic problem is
PT
O(L3 + Ld). And the time complexity of kNN is O(kN ). Algorithm 1 involves solving a regularized least square regression problem
335
and computing the geometric mean of two matrices. Therefore, its total time
CE
complexity is O(N d2 + N dL + d3 + L3 + kN ). Algorithm 2 involves solving a Sylvester equation and computing the geometric mean of two matrices with η
AC
iterations (where η is usually fixed to a pre-set small integer). Therefore, its
340
time complexity is O(η(d3 + L3 + N d2 + N dL + kN )). Algorithm 3 involves
solving a box-constrained quadratic problem, its time complexity is O(L3 +Ld). Based on [19], the training and predicting time complexities of LM-kNN are
respectively O( √1 (N d2 + N dL + L3 + d3 + kN dL2 )) and O(LN + Ld), where is the accuracy met by its solution. The training and predicting time complexities 20
ACCEPTED MANUSCRIPT
345
of MMOC are respectively O(θ(N d2 + N dL + d3 + N L3 + N 4 )) and O(L3 ), where θ is its iterations. In comparison with our metric learning counterparts used here, our Algo-
CR IP T
rithm 1 and Algorithm 2 have an advantage over MMOC and LM-kNN in terms of training time due to η O( √1 ) and η θ. The predicting time complexity 350
of our Algorithm 3 is comparable to MMOC and higher than LM-kNN.
5. Experiments
In this section, we discuss the experiments conducted on two publicly avail-
AN US
able real-world datasets for MDC. The two datasets for MDC are respectively ImageCLEF2014 1 and Bridges. ImageCLEF2014 comes from a real world chal355
lenge in the field of robot vision [25]. Bridges dataset comes from the UCI collection [26]. Unfortunately, there are not yet many publicly available standardized multi-dimensional datasets, so we boost our collections with eight most
M
commonly used multi-label datasets which can be accessed from Mulan 2 . The characteristics of these datasets are shown in Table 1.
ED
Table 1: Datasets used in the evaluation class variables
Features
Instances
birds
19
260
645
emotions
6
72
593
medical
45
1449
978
scene
6
294
2407
yeast
14
103
2417
flags
7
19
194
genbase
27
1186
662
CAL500
174
68
502
AC
CE
PT
Dataset
360
bridges
5
7
107
CLEF2014
9
264
6500
We consider two commonly used evaluation criteria for MDC, namely Ham-
ming accuracy and Example accuracy. These evaluation criteria can be calcu1 http://www.imageclef.org/2014/robot 2 http://mulan.sourceforge.net/datasets-mlc.html
21
ACCEPTED MANUSCRIPT
Table 2: Hamming Accuracy (Part A) classifier
birds
emotions
medical
scene
yeast
MLKT-RR
0.9511 ± 0.0083
0.7771 ± 0.0263
0.9509 ± 0.0086
0.8477 ± 0.0282
0.7048 ± 0.0042
♠0.9562 ± 0.0089
♠0.7932 ± 0.0254
♠0.9736 ± 0.0068
♠0.8528 ± 0.0324
♠0.7065 ± 0.0077
0.7904 ± 0.0256
0.9545 ± 0.0091
lated as follows: 1. Hamming accuracy:
m m N 1 X 1 X 1 X Accj = δ(yij , yˆij ) m j=1 m j=1 N i=1
AN US
Acc =
0.8463 ± 0.0291
where δ(yij , yˆij ) = 1 if yˆij = yij , and 0 otherwise. Note that yˆij denotes the jth class value predicted by the classifier for instance i and yij is its true value.
365
2. Example accuracy
N 1 X δ(yi , yˆi ) N i=1
M
Acc =
where δ(yi , yˆi ) = 1 if yˆi = yi , and 0 otherwise.
ED
Before the experiments, some parameters need to be set in advance. The parameter η for gMML-I algorithm is always set to 3 throughout our experiments (because when η > 3, we find it has no changes for Ω and P). The parameters λ and t associated with Ω are respectively tuned from the range {100 , 101 , 102 }
PT
370
and {0.3, 0.5, 0.7}. The parameter γ for P is tuned from the range {0, 0.1, 0.2}.
CE
All the following experimental results are the average results of 10-fold cross validation experiments. And, we use the notation ♠ to denote the best results.
AC
5.1. Comparison with our baseline methods
375
We firstly verify the classification accuracy of MLKT-gMML-I in comparison
with that of both ridge regression model (namely Ω = I) and the algorithm without iteration procedure (namely MLKT-gMML). We show the results in Tables 2, 3, 4 and 5.
22
0.7054 ± 0.0034
CR IP T
0.9519 ± 0.0081
MLKT-gMML MLKT-gMML-I
ACCEPTED MANUSCRIPT
Table 3: Hamming Accuracy (Part B) classifier
flags
genbase
CAL500
ImageCLEF2014
bridges
MLKT-RR
♠0.7376 ± 0.0384
0.9554 ± 0.0549
0.8624 ± 0.0061
♠0.8430 ± 0.0191
0.7000 ± 0.1222
MLKT-gMML-I
0.7353 ± 0.0294
♠0.9896 ± 0.0038
♠0.8625 ± 0.0063
0.8429 ± 0.0254
classifier
birds
emotions
medical
scene
yeast
MLKT-RR
0.4781 ± 0.0599
0.2136 ± 0.0572
0.5962 ± 0.0507
0.3000 ± 0.1343
0.0041 ± 0.0044
♠0.5188 ± 0.0796
♠0.2661 ± 0.0614
♠0.7782 ± 0.0432
0.2917 ± 0.1607
♠0.0051 ± 0.0046
0.7323 ± 0.0295
0.9698 + 0.0166
0.8526 ± 0.0063
Table 4: Example Accuracy (Part A)
MLKT-gMML MLKT-gMML-I
0.4813 ± 0.0584
0.2593 ± 0.0654
0.8429 ± 0.0254
0.6943 ± 0.1264
♠0.7286 ± 0.1071
CR IP T
MLKT-gMML
0.6667 ± 0.0580
♠0.3008 ± 0.1334
0.0051 ± 0.0064
380
AN US
From the results, we can see that MLKT-gMML-I nearly achieves the best
accuracy on all these datasets in regard to both evaluation criteria. To verify whether the differences are significant, two non-parametric Friedman tests among these methods for Hamming accuracy and Example accuracy are conducted respectively.
385
M
In Hamming accuracy, the Friedman test renders a F value of 8.6471 (> F(α,k−1,(b−1)(k−1)) = F(0.05,2,18) = 3.555)3 . Thus, the null hypothesis that all the methods have identical effects is rejected and a post-hoc test needs to be
ED
conducted for further testing their differences. To this end, a commonly-used post-hoc test, Nemenyi test, is conducted. The result is shown in Figure 4 from which we can see that: 1) MLKT-gMML-I has a significant difference from our other two methods; 2) MLKT-gMML achieves a comparable performance with
PT
390
CE
MLKT-RR. 3 Here,
F is the percent point function of the F distributuion, α is the significance level, b
AC
is the number of datasets and k is the number of algorithms for test.
Table 5: Example Accuracy (Part B) classifier
flags
genbase
CAL500
ImageCLEF2014
bridges
MLKT-RR
0.1737 ± 0.0994
0.8379 ± 0.2966
0.0000 ± 0.0000
0.2250 ± 0.0348
0.1714 ± 0.1475
♠0.1842 ± 0.0568
♠0.9273 ± 0.0333
♠0.2256 ± 0.0480
♠0.2429 ± 0.1656
MLKT-gMML
MLKT-gMML-I
0.1789 ± 0.0666
0.9242 ± 0.0357
23
0.0000 ± 0.0000
0.0000 ± 0.0000
♠0.2256 ± 0.0480
0.1714 ± 0.1475
ACCEPTED MANUSCRIPT
MLKT−gMML
MLKT−gMML−I
0.5
1
1.5
2
2.5
3
CR IP T
MLKT−RR
3.5
Figure 4: Friedman Test of our methods in terms of Hamming accuracy. In the graph, the
horizontal axis represents the values of mean rank, the vertical axis represents the different methods for test. For each method, the • represents its mean rank value, the line segment
AN US
represents the critical range of Nemenyi test. Two methods have a significant difference, if their line segments are not overlapped; not so, otherwise.
In Example accuracy, the Friedman test renders a F value of 10.85 (> F(α,k−1,(b−1)(k−1) = F(0.05,2,18) = 3.555), meaning the null hypothesis is also rejected. Then, A post-hoc Nemenyi test is again conducted. Its result is shown in Figure 5 and indicates that: 1) MLKT-gMML-I is significantly differ-
M
395
ent from MLKT-RR; 2) MLKT-gMML also achieves a comparable performance
ED
with MLKT-RR.
On the whole, we can conclude that MLKT-gMML-I achieves the best classification performance while MLKT-gMML achieves a comparable performance with MLKT-RR. Therefore, in the following, we just concentrate on the com-
PT
400
parison between MLKT-gMML-I and the other competitive MDC methods.
CE
5.2. Comparison with several competitive MDC methods We then compare MLKT-gMML-I with several competitive methods for
AC
MDC from the literature: Binary-Relevance (BR), Classifier Chains (CC) , En-
405
semble of Classifier Chains (ECC), RAkEL and Super-Class Classifier (SCC).
Since the above methods are only designed for modeling output structure, Naive Bayesian classifier is used as their base classifier in our experiments. We use an open-source Java framework, namely the MEKA [27] library, for the experiments. Regarding the parameterization of these approaches, ECC is 24
ACCEPTED MANUSCRIPT
MLKT−RR
MLKT−gMML−I
0.5
1
1.5
2
2.5
3
CR IP T
MLKT−gMML
3.5
Figure 5: Friedman Test of our methods in terms of Example accuracy. In the graph, the
horizontal axis represents the values of mean rank, the vertical axis represents the different methods for test. For each method, the • represents its mean rank value, the line segment
AN US
represents the critical range of Nemenyi test. Two methods have a significant difference, if their line segments are not overlapped; not so, otherwise.
Table 6: Hamming Accuracy (Part A) classifier
birds
emotions
medical
scene
yeast
BR-NB
0.6472 ± 0.0354
0.7479 ± 0.0204
0.9747 ± 0.0016
0.7581 ± 0.0101
0.7162 ± 0.0122
0.7102 ± 0.0442
0.7544 ± 0.0221
RAkEL SCC
410
0.7859 ± 0.0163
0.7530 ± 0.0213
0.9751 ± 0.0011
0.7632 ± 0.0094
0.7460 ± 0.0228
♠0.9753 ± 0.0015
0.8385 ± 0.0134
0.8770 ± 0.0110
♠0.9562 ± 0.0089
♠0.7932 ± 0.0254
0.9736 ± 0.0068
0.7660 ± 0.0180
0.9751 ± 0.0016
0.9752 ± 0.0010
0.7634 ± 0.0096
♠0.8630 ± 0.0100 0.8528 ± 0.0324
ED
MLKT-gMML-I
0.6411 ± 0.0375
M
CC-NB ECC-NB
configured to learn 10 different models for the ensemble, for RAkEL we use the recommended configuration with 2m models having triplets of class combi-
PT
nations, and for SCC we use a nearest neighbour replacement filter (NNR) to identify all p = 1 infrequent class-values and replace them with their n = 2 most-
CE
frequent nearest neighbours. Their Hamming accuracy and Example accuracy
415
are shown in Tables 6, 7, 8 and 9 respectively. From the results of these tables, we see that MLKT-gMML-I can achieve
AC
better performance on most of the datasets than its competitive MDC methods (BR, CC, ECC, RAkEL and SCC) in terms of both evaluation criteria. To
verify the performance differences, two non-parametric Friedman tests among
420
these methods for Hamming accuracy and Example accuracy are respectively conducted.
25
0.6965 ± 0.0131 0.7018 ± 0.0128 0.6760 ± 0.0142
♠0.7574 ± 0.0081 0.7065 ± 0.0077
ACCEPTED MANUSCRIPT
Table 7: Hamming Accuracy (Part B) flags
genbase
CAL500
BR-NB
0.6645 ± 0.0443
0.9702 ± 0.0063
0.6813 ± 0.0094
0.7226 ± 0.0314
0.9661 ± 0.0036
RAkEL SCC
0.6890 ± 0.0454
0.6562 ± 0.0594 0.7041 ± 0.0496
MLKT-gMML-I
♠0.7353 ± 0.0294
classifier
birds
BR-NB
0.0326 ± 0.0191
0.5965 ± 0.0083
bridges 0.7200 ± 0.0470
0.9661 ± 0.0044
0.6060 ± 0.0131
0.6050 ± 0.0090
0.7060 ± 0.0660
0.9653 ± 0.0047
0.7141 ± 0.0069
0.6560 ± 0.0120
0.7120 ± 0.0250
0.9460 + 0.0081 ♠0.9896 ± 0.0038
0.7124 ± 0.0068 0.7050 ± 0.0120
♠0.8625 ± 0.0063
AN US
CC-NB ECC-NB
ImageCLEF2014
CR IP T
classifier
0.6150 ± 0.0096 0.8000 ± 0.0070
♠0.8429 ± 0.0254
0.7114 ± 0.0899 0.7140 ± 0.0520
♠0.7286 ± 0.1071
Table 8: Example Accuracy (Part A)
SCC
scene
yeast
0.1691 ± 0.0225
0.0980 ± 0.0212
0.0481 ± 0.0237
0.2260 ± 0.0497
0.2730 ± 0.0331
0.1753 ± 0.0212
0.0327 ± 0.0237
0.2141 ± 0.0348
0.2649 ± 0.0391
0.3020 ± 0.0209
0.2020 ± 0.0480
♠0.5188 ± 0.0796
0.2360 ± 0.0439 0.2650 ± 0.0490
♠0.2661 ± 0.0614
0.2771 ± 0.0400 0.2795 ± 0.0320
♠0.7782 ± 0.0432
0.1795 ± 0.0216 ♠0.5370 ± 0.0260 0.2917 ± 0.1607
0.1080 ± 0.0246 0.1109 ± 0.0224 0.0765 ± 0.0173
♠0.1910 ± 0.0240 0.0051 ± 0.0046
CE
PT
MLKT-gMML-I
medical
0.2659 ± 0.0377
M
RAkEL
0.0326 ± 0.0191
ED
CC-NB ECC-NB
emotions
0.2057 ± 0.0549
flags
genbase
CAL500
ImageCLEF2014
bridges
BR-NB
0.0361 ± 0.0344
0.5110 ± 0.0588
0.0000 ± 0.0000
0.0292 ± 0.0101
0.2000 ± 0.1680
0.0363 ± 0.05270
0.2749 ± 0.0635
CC-NB
AC
Table 9: Example Accuracy (Part B)
classifier
ECC-NB RAkEL SCC
MLKT-gMML-I
0.0771 ± 0.0603
0.0780 ± 0.0755
♠0.1927 ± 0.1220 0.1842 ± 0.0568
0.2810 ± 0.0732
0.0000 ± 0.0000
0.0320 ± 0.0100
0.1860 ± 0.1790
0.2761 ± 0.0724
0.0000 ± 0.0000
0.0280 ± 0.0320
♠0.2570 ± 0.0610
♠0.2256 ± 0.0480
0.2429 ± 0.1656
0.3906 ± 0.0827
♠0.9273 ± 0.0333
26
0.0000 ± 0.0000
0.0000 ± 0.0000
0.0000 ± 0.0000
0.0335 ± 0.0117
0.2160 ± 0.0270
0.2286 ± 0.2231
0.2290 ± 0.1300
ACCEPTED MANUSCRIPT
In Hamming accuracy, the Friedman test renders a F value of 5.2663 (> F(α,k−1,(b−1)(k−1)) = F(0.05,9,45) = 2.422). Thus, the null hypothesis that all the methods are identical is rejected and a post-hoc test needs to be conducted for further testing their differences. To this end, a commongly-used post-hoc test,
CR IP T
425
Nemenyi test, is conducted. The results is shown in Figure 6, from which we can
see that: 1) MLKT-gMML-I has a significant difference from two methods (BRNB and CC-NB); 2) There is no significant difference among these methods
except MLKT-gMML-I. Therefore, MLKT-gMML-I achieves a slightly better classification performance than its competitive MDC methods.
AN US
430
BR−NB
CC−NB
ECC−NB
RKkEL
SCC
0
M
MLKT−gMML−I
1
2
3
4
5
6
7
Figure 6: Friedman Test in terms of Hamming accuracy. In the graph, the horizontal axis
ED
represents the values of mean rank, the vertical axis represents the different methods for test. For each method, the • represents its mean rank value, the line segment represents the critical range of Nemenyi test. Two methods have a significant difference, if their line segments are
PT
not overlapped; not so, otherwise.
In Example accuracy, the Friedman test renders a F value of 5.4961 (>
CE
F(α,k−1,(b−1)(k−1)) = F(0.05,10,45) = 2.422). Thus, the null hypothesis that all the methods are identical is rejected and a post-hoc test needs to be conducted for further testing their differences. To this end, the Nemenyi test as above is conducted. The results is shown in Figure 6 from which we can see that:
AC
435
1) MLKT-gMML-I has a significant difference from BR-NB. 2) There is not a significant difference among these methods except BR-NB. Therefore, MLKTgMML-I can achieve a comparable Example accuracy with its competitive methods.
440
On the whole, we can conclude that MLKT-gMML-I achieves comparable 27
ACCEPTED MANUSCRIPT
BR−NB
CC−NB
ECC−NB
SCC
MLKT−gMML−I
0
1
2
3
4
5
6
CR IP T
RAkEL
7
Figure 7: Friedman Test in terms of Example accuracy. In the graph, the horizontal axis represents the values of mean rank, the vertical axis represents the different methods for test.
For each method, the • represents its mean rank value, the line segment represents the critical not overlapped; not so, otherwise.
AN US
range of Nemenyi test. Two methods have a significant difference, if their line segments are
(or even slightly better) classification performance with (or than) its competitive MDC methods.
M
5.3. Comparison with LM-kNN on MLC task
Note that gMML-I is closely related to MMOC and LM-kNN and the latter two are designed for MLC specially, thus we just conduct experiments on MLC
ED
445
datasets to verify their classification performance. However, since MMOC has to deal with exponentially large number of constraints for each instance in training
PT
procedure, it is infeasible even for the CAL500 dataset with 68 features and 174 labels [19]. Therefore, we only compare gMML-I with LM-kNN. We show the 450
results in figure 8 and figure 9.
CE
We can see from the figures that gMML-I achieves better performance on
six datasets in terms of Hamming accuracy, while four datasets in terms of
AC
Example accuracy4 than LM-kNN. So on the whole, gMML-I achieves better classification on most of the datasets. To verify their difference, the Friedman
455
tests of differences between gMML-I and LM-kNN are conducted and render F -values of 2.3333 for Hamming accuracy and 0.4667 for Example accuracy re4 Both
methods achieve zero accuracy on CAL500 for Example accuracy.
28
CR IP T
ACCEPTED MANUSCRIPT
1 0.9 0.8 0.7 0.6 0.5 0.4
gMML−I LM−kNN
0.3 0.2
0
birds
emotions medical
AN US
0.1 scene
yeast
flags
genbase CAL500
ED
M
Figure 8: Hamming Accuracy (HA).
1
gMML−I LM−kNN
0.9
PT
0.8 0.7 0.6
AC
CE
0.5 0.4 0.3 0.2 0.1 0
birds
emotions medical
scene
yeast
flags
genbase CAL500
Figure 9: Example Accuracy (EA).
29
ACCEPTED MANUSCRIPT
spectively, both are not significant (< F(α,k−1,(b−1)(k−1)) = F(0.05,7,7) = 5.595). So, gMML-I can obtain competitive classification accuracy on MLC task with LM-kNN, but has a lower learning complexity than LM-kNN as analysed in section 4.4.
CR IP T
460
6. Conclusions
In this paper, we proposed a new transformation approach, namely MLKT, for MDC, which possesses the following favorable characteristics: i) it can keep
the space size of MDC invariant, ii) it can reflect the explicit within-dimensional relationships, iii) it is easy for subsequent modeling in the transformed space,
AN US
465
iv) it can overcome the class overfitting and class imbalance problems suffered by LP-based transformation approach, v) it is decomposable for each output dimension of MDC. Moreover, we also presented a novel metric learning based method for the transformed problem, which itself can be of independent interest and also has a closed form solution. Extensive experimental results justified that
M
470
our approach combined the above two procedures can achieve better classification performance than the competitive MDC methods, while our metric learning
ED
based method itself can also obtain competitive classification performance with a lower learning complexity compared to its counterparts designed specifically for MLC. And, as mentioned in the introduction section, we can refer to many
PT
475
MLC methods to develop alternatives well suited to our transformed problem
CE
as our future direction.
AC
Acknowledgements
480
This work is supported in part by the National Natural Science Foundation
of China under the Grant Nos. 61672281 and in part by the Funding of Jiangsu Innovation Program for Graduate Education under Grant KYLX16 0383. And we would like to express our appreciation for the valuable comments from reviewers and editors.
30
ACCEPTED MANUSCRIPT
References 485
[1] M. Elhoseiny, T. El-Gaaly, A. Bakry, A. Elgammal, Convolutional models for joint object categorization and pose estimation, arXiv preprint
CR IP T
arXiv:1511.05175.
[2] T. Theeramunkong, V. Lertnattee, Multi-dimensional text classification,
in: Proceedings of the 19th international conference on Computational linguistics-Volume 1, Association for Computational Linguistics, 2002, pp.
490
1–7.
AN US
[3] J. Ortigosa-Hern´ andez, J. D. Rodr´ıguez, L. Alzate, M. Lucania, I. Inza, J. A. Lozano, Approaching sentiment analysis by using semi-supervised
learning of multi-dimensional classifiers, Neurocomputing 92 (2012) 98– 115.
495
[4] C. Tu, Z. Liu, M. Sun, Prism: Profession identification in social media
M
with personal information and community structure, in: Chinese National Conference on Social Media Processing, Springer, 2015, pp. 15–27.
ED
[5] J. H. Zaragoza, L. E. Sucar, E. F. Morales, C. Bielza, P. Larranaga, Bayesian chain classifiers for multidimensional classification, in: IJCAI,
500
PT
Vol. 11, Citeseer, 2011, pp. 2192–2197. [6] W. Cheng, E. H¨ ullermeier, K. J. Dembczynski, Bayes optimal multilabel classification via probabilistic classifier chains, in: Proceedings of the 27th
CE
international conference on machine learning (ICML-10), 2010, pp. 279– 286.
505
AC
[7] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains for multilabel classification, Machine learning 85 (3) (2011) 333–359.
[8] L. C. Van Der Gaag, P. R. De Waal, et al., Multi-dimensional Bayesian network classifiers., in: Probabilistic graphical models, Citeseer, 2006, pp.
510
107–114.
31
ACCEPTED MANUSCRIPT
[9] C. Bielza, G. Li, P. Larranaga, Multi-dimensional classification with Bayesian networks, International Journal of Approximate Reasoning 52 (6) (2011) 705–727.
CR IP T
[10] J. Arias, J. A. Gamez, T. D. Nielsen, J. M. Puerta, A scalable pairwise class interaction framework for multidimensional classification, Interna-
515
tional Journal of Approximate Reasoning 68 (2016) 194–210.
[11] M. R. Boutell, J. Luo, X. Shen, C. M. Brown, Learning multi-label scene classification, Pattern recognition 37 (9) (2004) 1757–1771.
AN US
[12] G. Tsoumakas, I. Katakis, Multi-label classification: An overview, International Journal of Data Warehousing and Mining 3 (3) (2007) 1.
520
[13] J. Read, C. Bielza, P. Larra˜ naga, Multi-dimensional classification with super-classes, IEEE Transactions on knowledge and data engineering 26 (7) (2014) 1720–1733.
M
[14] J. Read, L. Martino, P. M. Olmos, D. Luengo, Scalable multi-output label prediction: From classifier chains to classifier trellises, Pattern Recognition
525
ED
48 (6) (2015) 2096–2109.
[15] J. D´ıez, J. J. del Coz, O. Luaces, A. Bahamonde, Using tensor products to
PT
detect unconditional label dependence in multilabel classifications, Information Sciences 329 (2016) 20–32. [16] Inference and learning in multi-dimensional Bayesian network classi-
CE
530
fiers, author=De Waal, Peter R and Van Der Gaag, Linda C, book-
AC
title=European Conference on Symbolic and Quantitative Approaches
535
to Reasoning and Uncertainty, pages=501–511, year=2007, organization=Springer.
[17] G. Tsoumakas, I. Vlahavas, Random k-labelsets: An ensemble method for multilabel classification, in: European Conference on Machine Learning, Springer, 2007, pp. 406–417.
32
ACCEPTED MANUSCRIPT
[18] Y. Anzai, Pattern Recognition & Machine Learning, Elsevier, 2012. [19] W. Liu, I. W. Tsang, Large margin metric learning for multi-label predic-
CR IP T
tion., in: AAAI, 2015, pp. 2800–2806.
540
[20] Y. Zhang, J. G. Schneider, Maximum margin output coding, in: Proceed-
ings of the 29th International Conference on Machine Learning (ICML-12), 2012, pp. 1575–1582.
[21] P. H. Zadeh, R. Hosseini, S. Sra, Geometric mean metric learning, in:
Proceedings of The 33rd International Conference on Machine Learning,
545
AN US
2016, pp. 2464–2471.
[22] R. Bhatia, Positive definite matrices, Princeton university press, 2009. [23] R. H. Bartels, G. Stewart, Solution of the matrix equation ax+ xb= c [f4], Communications of the ACM 15 (9) (1972) 820–826.
[24] B. Iannazzo, The geometric mean of two matrices from a computational
M
550
viewpoint, arXiv preprint arXiv:1201.0101.
ED
[25] B. Caputo, H. M¨ uller, J. Martinez-Gomez, M. Villegas, B. Acar, N. Patri¨ udarlı, R. Paredes, M. Cazorla, et al., Imageclef cia, N. Marvasti, S. Usk¨ 2014: Overview and analysis of the results, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer,
PT
555
2014, pp. 192–211.
CE
[26] M. Lichman, UCI machine learning repository (2013). URL http://archive.ics.uci.edu/ml
AC
[27] J. Read, P. Reutemann, B. Pfahringer, G. Holmes, MEKA: A multi-
560
label/multi-target extension to Weka, Journal of Machine Learning Research 17 (21) (2016) 1–5. URL http://jmlr.org/papers/v17/12-164.html
33
ACCEPTED MANUSCRIPT
M
AN US
CR IP T
Zhongchen Ma received his B.S. degree in Information and Computing Science from Qingdao Agricultural University in 2012. In 2015, he completed his M.S. degree in computer science and technique at Nanjing University of Aeronautics and Astronautics. He is currently pursuing the Ph.D. degree with the College of Computer Science & Technology, Nanjing University of Aeronautics and Astronautics. His research interests include pattern recognition and machine learning.
AC
CE
PT
ED
Songcan Chen received his B.S. degree in mathematics from Hangzhou University (now merged into Zhejiang University) in 1983. In 1985, he completed his M.S. degree in computer applications at Shanghai Jiaotong University and then worked at NUAA in January 1986. There he received a Ph.D. degree in communication and information systems in 1997. Since 1998, as a full-time professor, he has been with the College of Computer Science & Technology at NUAA. His research interests include pattern recognition, machine learning and neural computing.