Multi-dimensional classification via a metric approach

Multi-dimensional classification via a metric approach

Communicated by Zhaohong Deng Accepted Manuscript Multi-dimensional classification via a metric approach Zhongchen Ma, Songcan Chen PII: DOI: Refere...

1MB Sizes 0 Downloads 26 Views

Communicated by Zhaohong Deng

Accepted Manuscript

Multi-dimensional classification via a metric approach Zhongchen Ma, Songcan Chen PII: DOI: Reference:

S0925-2312(17)31576-X 10.1016/j.neucom.2017.09.057 NEUCOM 18939

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

19 June 2017 4 September 2017 18 September 2017

Please cite this article as: Zhongchen Ma, Songcan Chen, Multi-dimensional classification via a metric approach, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.09.057

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Multi-dimensional classification via a metric approach Zhongchen Maa , Songcan Chena,∗ of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing 211106, China

CR IP T

a College

Abstract

Multi-dimensional classification (MDC) refers to learning an association between

AN US

individual inputs and their multiple dimensional output discrete variables, and

is thus more general than multi-class classification (MCC) and multi-label classification (MLC). One of the core goals of MDC is to model output structure for improving classification performance. To this end, one effective strategy is to firstly make a transformation for output space and then learn in the transformed space. However, existing transformation approaches are all rooted in

M

label power-set (LP) method and thus inherit its drawbacks (e.g. class imbalance and class overfitting). In this study, we first analyze the drawbacks of the

ED

LP, then propose a novel transformation method which can not only overcome these drawbacks but also construct a bridge from MDC to MLC. As a result, many off-the-shelf MLC methods can be adapted to our newly-formed problem.

PT

However, instead of adapting these methods, we propose a novel metric learning based method, which can yield a closed-form solution for the newly-formed problem. Interestingly, our metric learning based method can also naturally be

CE

applicable to MLC, thus itself can be of independent interest as well. Extensive experiments justify the effectiveness of our transformation approach and our

AC

metric learning based method. Keywords: Multi-dimensional classification, problem transformation, distance metric learning, closed-form solution

∗ Corresponding

author Email address: [email protected] (Songcan Chen)

Preprint submitted to Journal of LATEX Templates

September 21, 2017

ACCEPTED MANUSCRIPT

1. Introduction In supervised learning, binary classification (BC), multi-label classification (MLC) and multi-class classification (MCC) have been extensively studied in

5

CR IP T

past years. As a more general learning task, MDC is relatively less studied

up to now, which can partly be attributed to its more complex output space.

Figure 1 displays the relationships among the different classification paradiams in term of m class variables of K possible values each. As shown in the figure, BC has only single class variable whose range is {1, 0} or {1, −1}, corresponding

to m = 1 and K = 2; MCC also has only single class variable but whose range

can take a number of class values, corresponding to m = 1 and K > 2; MLC

AN US

10

has multiple class variables whose range is also {1, 0} or {1, −1}, corresponding to m > 1 and K = 2. More generally, MDC allows multiple class variables that can take a number of class values, corresponding to m > 1 and K > 2.

MDC

BC

ED

m 1

MLC

M

m 1

K 2

MCC

K 2

Figure 1: Relationship between different classification paradigms, where m is the number of

PT

class variables and K is the of values each of these variables may take. Fig.number 1. 不同的分类问题范例:L代表标签的数量和 K代表每个标签的取值范围。

A wide range of applications has been found corresponding to this task, for example, in computer vision task [1], a landscape image may present many

CE

15

information such as the month, season, or the type of subject; in information

AC

retrieval task [2][3], documents can be classified into different kinds of categories like mood or topic; in computational advertising task[4], a social media information may demonstrate the user’s gender, age, personality, happiness or political

20

polarity. Like MLC, the core goal of MDC is to achieve effective classification performance by modeling output structure. In modeling, a simplest assumption is

2

ACCEPTED MANUSCRIPT

that the class variables are completely unrelated, thus it is sufficient to design a separate independent model for each class. However, such an ideal assumption 25

is hardly applicable to real world problems in general, as correlation (structure)

CR IP T

often exists among class variables, for example, a user’s age can have strong impact on his political polarity where the young are generally more radical and

elders are often more conservative. Even within each output dimension, there exists an explicit within-dimension relationship among its values, which refers 30

to that only one value of a class variable can be activated. Therefore, one key to

achieve its effective learning lies in how to take sufficient advantage of explicit

within each output dimension.

AN US

and/or implicit relationships both among output dimensions and among values

In order to model such output structures, there are two main strategies pro35

posed: (i) explicitly modeling the dependence structures between class variables, e.g., via imposing chain structure [5][6][7], or using multi-dimensional Bayesian network structure [8][9] or adopting Markov random field[10] (ii) implicitly mod-

M

eling output structure by transformation approaches [11][12][13]. A major limitation of the former strategy lies in requiring a pre-defined output structure (e.g., chain or Bayesian network), thus partly losing flexibility

ED

40

of characterizing structure. In contrast, the transformation approach of the latter strategy enjoys more flexibility due to its ability to modeling various

PT

structures. What’s more, we also witness that such a transformation method has demonstrated its convincing performance in [13]. Therefore, in this paper, we follow such a transformation strategy to model output structures of MDC.

CE

45

To the best of our knowledge, all the existing transformation methods can

be classified as label power-set (LP)-based transformation approach. LP [11]

AC

can transform the MDC problem into a corresponding multi-class classification problem by defining a new compound class variable whose range exactly contains

50

all the possible combinations of values of the original class variables. Though implicitly considering the interaction between different classes, LP suffers from class imbalance and class overfitting problems, where the class imbalance refers to the great differences in the total number of instances for different combi3

ACCEPTED MANUSCRIPT

nations of the class variables, and the class overfitting problem refers to zero 55

instances for some combinations of the class variables. To address these issues of LP, [13] proposed to firstly form super-class partitions by modeling the

CR IP T

dependence between class variables and then make each super-class partition correspond to a compound class variable defined by LP. Although this super-

class partitioning can reduce the the original problem to a set of subproblems, 60

these newly formed subproblems still need to be transformed by LP, thereby, the approach naturally suffers its problems.

In this study, we analyze the drawbacks of the LP and propose a novel

AN US

transformation method which can not only overcome these drawbacks but also construct a bridge from MDC to MLC. Specifically, our transformation ap65

proach desires to form a new output space with all binary variables by a tricky binarization process for the original output space of MDC. Since such a newlyformed problem has a similarity with MLC (e.g. the class variables of both problems are all binary), our transformation approach is named as Multi-Label

70

M

liKe Transformation approach (MLKT) and subsequently, many off-the-shelf MLC methods can be adapted to the newly-formed problem. However, instead

ED

of adapting these methods, in this study, we also propose a novel metric-based method of aiming to make the predictions of an instance in the learned metric space close to its true class values while far away from others. And, our

75

PT

metric-based method involved can yield a closed-form solution, thus its learning is more efficient than its competitive methods. Interestingly, our metric learn-

CE

ing method involved can also naturally be applicable to MLC, thus itself can be of independent interest as well. Finally, extensive experimental results justify that: our approach combing the above two procedures achieves a better classi-

AC

fication performance than the state-of-the-art MDC methods while our method

80

itself also obtains competitive classification performance with a lower learning complexity compared to its counterparts designed specifically for MLC. The rest of the paper is structured as follows: We firstly introduce the

required background in the field of multi-dimensional classification in Section 2. Then we introduce MLKT in Section 3. Next, we present the details of the 4

ACCEPTED MANUSCRIPT

85

distance metric learning method in Section 4. We then experimentally evaluate the proposed schemes in Section 5. Finally, we give concluding remarks in

CR IP T

Section 6.

2. Background

In this section, we review basic multi-dimensional classifiers.

In MDC, we have N labeled instances D = {(xi , yi )}N i=1 from which we wish

to build a classifier that associates multiple class values with each data instance. The data instance is represented by a vector of d values x = (x1 , . . . , xd ), each

AN US

drawn from some input domain X 1 × · · · × X d . And the classes are represented

by a vector of m values y = (y 1 , . . . y m ) from the domain Y 1 × · · · × Y m where

each Y j = {1, . . . , K j } is the set of possible values for the jth class variable Y j . Specifically, we seek to build a classifier f that assigns each instance x to a

vector y of class values:

M

f : X 1 × · · · × X d → Y1 × · · · × Ym

ED

x : (x1 , . . . , xd ) 7→ y : (y 1 , . . . y m ) Binary Relevance (BR) is a straightforward method for MDC. It trains m

90

classifiers f := (f 1 , . . . , f m ) for each class variable. Specifically, a standard

PT

multi-class classifier f j learns to associate one of the values y j ∈ Y j to each data instance, where f j : X 1 × · · · × X d → Y j . However, it is unable to capture the dependencies among classes and suffers low accuracies as illustrated in [5, 14, 15].

CE 95

MDC has attracted much attentions recently and many multi-dimensional

AC

classifiers for modeling the output structure of MDC have been proposed in recent years. As presented in the introduction section, there are two main strategies:

100

1. Explicit representation of the dependence structure between class variables.

5

ACCEPTED MANUSCRIPT

Classifier chains model (CC)[5, 6, 7], Classifier trellises (CT) [14] and Multi-dimensional Bayesian network classifiers (MBCs) [8][9] were recently proposed methods following this strategy for MDC. Specifically,

CR IP T

classifier chains model (CC) learns m classifiers, one for each class variable. These classifiers are linked at random order, such that the jth clas-

sifier uses as input features not only the instance, but also the output pre-

dictions of the previous j − 1 classifiers, namely yˆj = f j (x, yˆ1 , . . . , yˆj−1 ) for any test instance x. Specifically:

f j : X 1 × · · · × X d × Y 1 · · · × Y j−1 → Y j .

This method has demonstrated high performance in multi-label domains

AN US

105

and is directly applicable to MDC. However, a drawback is that the class variable ordering in chain has a strong effect on predictive accuracy, and with greedy structure comes the concern of error propagation along the chain due to that an incorrect estimate yˆj will negatively affect all subsequent class variables. Naturally, the ensemble strategy (ECC) [7], which

M

110

trains several CC classifiers with random order chains, can be used to alleviate these problems.

ED

classifier trellises (CT) captures dependencies among class variables by considering a predefined trellis structure. Each of the vertices of the trellis

PT

corresponds to one of the class variables. Fig.2 shows a simple example of the structure where the parents of each class variable are the class variables

CE

laying on the vertices above and to the left in the trellis. Specifically: f d : X 1 × · · · × X d × Y b × Y c → Y d.

AC

CT can scale large data sets with reasonable complexity. However, just

115

like CC, the artificially-defined greedy structure may falsely reflect the real dependency among class variables, thus limit its classification performance in real world applications unless such a predefined structure coincides with the given problem. Multi-dimensional Bayesian network classifiers(MBCs) are a family of

probabilistic graphical models which organizes class and feature variables 6

CR IP T

ACCEPTED MANUSCRIPT

Figure 2: A simple example of Classifier Trellises [14].

120

as three different subgraphs: class subgraph, feature subgraph, and bridge (from class to features) subgraph. Different graphical structures for the

AN US

class and the feature subgraphs can lead to different families of MBCs. We show a simple tree-tree structure of MBC in Fig.3. In recent years,

various MBCs have been proposed and have become useful tools for mod125

eling output structures of MDC [8, 9, 16]. However, the problem is still

ED

M

its exponential computational complexity.

PT

Figure 3: A simple tree-tree structure of MBC [8][9].

2. Implicit incorporation of output structure by transforming output space. To the best of our knowledge, all the existing transformation methods

CE

for MDC can be classified as label power-set (LP)-based transformation

130

approach.

AC

Label power-set (LP) [11] is a typical transformation approach for MLC and also can be directly applied to MDC. It firstly forcefully assumes all the class variables are dependent and then defines a new compound class variable whose range contains all the possible combinations of values of the original class variables. Specifically, f : X 1 × · · · × X d → Cartesian product(Y 1 , . . . , Y m ). 7

ACCEPTED MANUSCRIPT

As a result, the original problem is turned into a multi-class classification problem which has many off-the-shelf methods available. In this way, the output structure of MDC is implicitly considered. However, LP easily

CR IP T

suffers from class overfitting and class imbalance problems as mentioned in the introduction section.

135

Random k-labelsets (RAkEL) [17] and Super classes Classifier (SCC) [13]

are LP-based transformation approaches, where RAkEL uses multiple LP classifiers with each trained on a random subset of Y, while SCC’s LP clas-

sifiers are trained on the subsets of class variables with strong dependency.

classifier is learned, namely:

AN US

Formally, given a subset class variable S of Y, then its corresponding LP

fS : X 1 × · · · × X d → Cartesian product(Y S ). Note that RAkEL is specially designed for MLC, but can straightly be applied to MDC. While SCC has demonstrated convincing classification

M

performance for MDC. However, both methods need to resort to LP, therefore naturally suffering its same problems to some extend.

ED

In summary, the above two strategies have their own advantages and draw-

140

backs. However, relatively speaking, the latter one enjoys more flexibility in

PT

modeling output structure. Therefore, in this paper, we follow the latter one to model output structure of MDC. Unfortunately, existing transformation approaches are all based on LP, thus naturally inherit its class overfitting and class imbalance problems. Thanks to these, we try to give an analysis for the

CE

145

drawbacks of LP and propose a novel transformation approach to overcome

AC

them.

3. Transformation for MDC 3.1. Analysis for LP

150

We now give an analysis for LP and detail the causes of its drawbacks.

8

ACCEPTED MANUSCRIPT

LP is a typical transformation approach for MDC. It defines a new compound class variable whose range contains all the possible combinations of values of the original class variables. In this way, the original MDC problem is turned

155

CR IP T

into a multi-class classification problem which is relatively easier for subsequent learning. However, such transformation has two serious drawbacks:

The first drawback is class overfitting, which is invoked by the great reduction of the number of instances for each class after transformation. Specifically,

given a dataset {xi , yi }N i=1 of a MDC problem which has m class variables with

each having K i class values. Clearly, the number of instances for each class of the i-th output dimension is of the number is

1

K max N ,

1 Ki N

in the original problem and the worst case

AN US

160

where K

max

= max(K i ). However, after the trans-

formation, the number of instances for each class becomes Qm1 K i N , which is i=1 Qm 1 N due to i=1 K i  K max . Hence, this drawback makes far less than K max learning for the formed problem easy class overfitting.

The second drawback is class imbalance, which is invoked by the reduction

165

M

of balance degree (refers to the smallest ratio of the total number of instances between classes) after transformation. More specifically, given a dataset of an

ED

imbalance MDC problem which has m class variables Y 1 , . . . , Y m , if we assume the balance degrees of class variables are respectively p1 , . . . , pm . After LP 170

transformation, the balance degree changes to p1 × · · · × pm . Based on pi <

PT

1, ∀i ∈ {1, . . . , m}, we can get p1 ×· · ·×pm < min(p1 , . . . , pm ). Thus the balance degree in the formed problem is worse than that in the original problem. In

CE

nature, although LP does not transform a totally balanced MDC problem into an imbalanced problem, it can indeed transform almost balanced MDC problem into imbalanced problem.

AC

175

What’s more, from the above analysis, we find that the more class variables

the LP’ compound class variable includes, the more serious the class overfitting and class imbalance problems become.

180

To overcome the drawbacks of LP, we propose a transformation approach to

1) make the number of instances for each class in the transformed problem as similar as possible to that in the original MDC problem; 2) keep the balance 9

ACCEPTED MANUSCRIPT

degree in the formed problem as consistent as possible with that in the original MDC problem.

CR IP T

3.2. The procedure of MLKT In this subsection, we present a novel transformation approach, namely

185

MLKT, which transforms Y to a subspace of {0, 1}L (where L is the dimen-

sionality of the formed problem). Our approach inherits the following favorable characteristics of LP:

AN US

• keep the output space size invariant.

• is easy for subsequent modeling in the transformed space.

190

• can reflect the explicit within-dimensional relationship.

In addition, it also possesses two extra key characteristics, i.e.,

• it can overcome the class overfitting and class imbalance problems suffered

M

by LP.

• it is decomposable for each class variable of MDC.

ED

195

Among above characteristics, we attempt to make the transformation decomposable, aiming to make easy transformation implementing and distinctive

PT

learning for each output variable. By doing so, we can ease the unnecessary computation cost of LP when the correlations between some output variables 200

are not strong. Now, let us detail the procedures of MLKT:

CE

For each individual class variable of MDC,

AC

• if K i ≥ 3, then for each y i ∈ Y i = {1, . . . , K i }, construct a new K i -

205

ˆij = 1 if j = y i , 0 otherwise. dimensional class vector ˆ zi where z

• if K i = 2, then for each y i ∈ Y i = {1, 2}, construct a 1-dimensional class vector ˆ zi where ˆ zi = 0 if y i = 1 and ˆ zi = 1 if y i = 2

With the above transformation, the original output class vector y = (y 1 , . . . , y m ) ∈

ˆ = [ˆ ˆm ], i.e. z ˆ is obtained Y is converted to a corresponding class vector z z1 ; . . . ; z 10

ACCEPTED MANUSCRIPT

by concatenating corresponding m vectors in the ascending index order. Thus, we form a new output domain: Zˆ = {ˆ z ← y | y ∈ Y}, where ”←” denotes

210

”transformed” by MLKT. Clearly, Zˆ ⊂ {0, 1}L , where L is the dimensionality

CR IP T

ˆ. of z Let us give an example to help understanding MLKT transformation.

Example 1. Assume the output space of MDC is Y = Y 1 × Y 2 , where Y 1 := {1, 2, 3}, Y 2 := {1, 2}.

1. transformation for each individual class domain.

215

Y 2 := {1, 2} → {0, 1}

2. concatenating procedure.

AN US

Y 1 := {1, 2, 3} → {(1, 0, 0)T , (0, 1, 0)T , (0, 0, 1)T }

Zˆ = {(1, 0, 0, 0)T , (1, 0, 0, 1)T , (0, 1, 0, 0)T , (0, 1, 0, 1)T , (0, 0, 1, 0)T , (0, 0, 1, 1)T } Clearly, it is not easy for subsequent learning in the new formed space.

220

M

However, we observe that the new formed space is equivalent to {0, 1}L with some additionally-imposed constraints which is relatively easier for subsequent learning. These additionally-imposed constraints can be obtained based on a

ED

deep insight as follows:

Let us define an integer set φi = {φi1 , φi2 , . . . , φij , . . . } for each i = 1, . . . , m,

ˆ(φi ) = z ˆi where the elements of set φi represent the indices of ˆ such that z z

PT

corresponding to class variable domain Y i . Based on the transformation charˆ(φi ) has one and acteristics of MLKT, we find that when K i ≥ 3, the vector z

AC

CE

only one element to be 1, thus, we have X

ˆ(φij ) = 1, z

j∈φi

ˆ ∀ˆ z ∈ Z.

(1)

In fact, we can use ET z = t to formulate all the equalities, where E ∈ RL×m

is an indicator matrix whose element   1, if k = φi ∧ K i ≥ 3 j Eki =  0, otherwise 11

(2)

ACCEPTED MANUSCRIPT

and t is a m dimensional vector whose element   1, if K i ≥ 3 ti =  0, otherwise

(3)

problem Zˆ is in fact equivalent to: Z = {ET z = t | z ∈ {0, 1}L } 225

CR IP T

Next, we prove in Proposition 1 that the output domain of the newly-formed

(4)

where E and t are defined in Eq.(2) and Eq.(3), respectively. As a result, leading to the following Proposition 1:

AN US

Proposition 1. The output domain Zˆ formed by MLKT is equivalent to Z = {ET z = t | z ∈ {0, 1}L } where E is a predefined indicator matrix and t is a

predefined m-dimensional vector. 230

ˆ z ∈ Z. Therefore, we just need to Proof. Based on Eq.(1), we find that ∀z ∈ Z, with that of Zˆ as well).

M

prove that the size of Z is consistent with that of Y (which means the consistency Assume the original MDC problem has m class variables, with each having 1

K , K 2 , . . . , K m possible values. Note that the MLKT transformation is decomposable for each class variable, thus we just need to prove that the space of

ED

235

Zφi is also K i . In the following, we give a proof according to several cases:

PT

Case 1: if K i = 2, the class variable domain Y i is transformed to Z(φi ) =

{0, 1}1 , thus its space size is also 2.

Case 2: if K i ≥ 3, the class variable domain Y i is transformed to Z(φi ) = i

{h1, z(φi ) i = 1 | z(φi ) ∈ {0, 1}K } in terms of our setup in Eq.(1), where 1 is

CE

240

a K i -dimensional vector with all elements being 1 and h, i represents the inner

AC

product between two vectors. Because the equality constraint can make sure that the vector z(φi ) has one and only one element to be 1, the space size of Z(φi ) is also K i .

245

To further help understanding MLKT transformation from Y to Z = {ET z =

t | z ∈ {0, 1}L }, we give an example as follows: 12

ACCEPTED MANUSCRIPT

Example 2. Assume the output space of MDC is Y = Y 1 × Y 2 , where Y 1 := {1, 2, 3}, Y 2 := {1, 2}. The MLKT approach follows two steps: 2. Define an indicator matrix E and a vector t.

CR IP T

1. Define φi and L.

250

According to Eq.(4), we can get: φ1 = {1, 2, 3}, φ2 = {4}, L = 4, E1 = (1, 1, 1, 0)T , E2 = (0, 0, 0, 0)T , E = [E1 , E2 ], t = [1, 0]T , respectively. Thus, Z

can be defined as Z = {ET z = t | z ∈ {0, 1}4 }.

In terms of such a transformation, the number of instances for each class in

255

formed by LP,

i.e., Qd 1

i=1

Ki

N 2,

which is far more than that in the problem

AN US

the formed problem seems to be

N . Moreover, on the surface, although it is also more

than that in the original problem, i.e.,

1 Ki N ,

this is not so. Based on the

decomposability of MLKT and the consistent size of Z(φi ) with that of Y i , so if 260

the original MDC problem has Ni instances categorized as y i , then the formed

problem also has Ni instances categorized as z(φi ) , meaning that the formed

M

problem keeps consistent with the original MDC problem not only the number of instances for each class but also the balance degree. Naturally, MLKT avoids

265

ED

the class overfitting and the class imbalance problems. Moreover, the explicit within-dimensional relationship is reflected by common-used one-vs-all coding [18]. In this way, MLKT can make all the desired transformation characteristics

PT

guaranteed.

From now on, we just need to focus on learning the structure of the problem

CE

formed by MLKT to reveal original structure. 4. Learning for the transformed problem

AC

270

Notations

Firstly, we give the notations which will be used in the following section. qP P 2 For a matrix A ∈ Rp×q , we use || A ||F = i j Aij to denotes its Frobenius

norm. For a positive definite matrix B  0, B−1 denotes the inverse of matrix

275

B. And, we use || x1 −x2 ||B = (x1 −x2 )T B(x1 −x2 ) to denote the Mahalanobis distance between vectors x1 and x2 . 13

ACCEPTED MANUSCRIPT

4.1. Model construction and Optimization Given the training data instances D = {(xi , yi )}N i=1 (yi ∈ Y), we can obtain

the corresponding re-labeled instances D0 = {(xi , zi )}N i=1 (z ∈ Z) by MLKT each instance x a vector z of class values: xi : (x1i , . . . , xdi ) 7→ zi : (zi1 , . . . ziL )

CR IP T

transformation. Then, the left problem is to build a classifier g that assigns

Let X ∈ RN ×d denote the input matrix and Z ∈ Z N ×L denote the output

matrix. To solve the transformed problem, a simple linear regression model is

arg minP∈Rd×L

AN US

to learn matrix P through the following formulation:

1 || Z − XP ||2F +γ || P ||2F , 2

(5)

Here γ ≥ 0 is a regularization parameter. However, this method usually yields low classification performance due to lack of consideration of correlations in

M

output space [19]. Considering the correlations, [20, 19] proposed to learn a discriminative Mahalanobis distance metric which can make the distance between PT xi and zi less than that between PT xi and any other output z in

ED

the output space. Unfortunately, both [20, 19] can not be directly applicable to our transformed problem, we, instead, develop an alternative novel metric learning method well suited to our scenario and it can nicely obtain a closed

PT

form solution (Our Mahalanobis metric learning method is similar to [20, 19] and we detail the connections in section 4.5.). Its formulation is presented as

CE

follows:

X

∀i,z˜i ∈Z\zi

|| PT xi − zi ||2Ω +

1 ˜i ||2Ω−1 , || PT xi − z | Z\zi |

(6)

AC

arg minΩ0

where P is the solution of the linear regression model (5), Ω is a positive definite

matrix. In the above, the first term aims to make the distance between PT xi

280

and zi smaller and the second term the distance between PT xi and any other output z larger. The main idea of using Ω−1 is motivated by [21], where Ω−1 is used to measure the distances between dissimilar points. The goal is to increase 14

ACCEPTED MANUSCRIPT

the Mahalanobis distance between PT xi and any other output z by decreasing || PT xi − z ||2Ω−1 (see Proposition 1. [21]). Because the space size of Z grows exponentially with dimension L, we only

CR IP T

consider the k nearest neighbors (kNN) of zi in the training dataset instead of any other outputs in the whole output space. Moreover, a regularizer term is used for avoiding overfitting. Therefore, we present the formulation of the distance metric learning method as follows: arg minΩ0

X

1 ˜i ||2Ω−1 || PT xi − z k

(7)

AN US

∀i,z˜i ∈kN N (zi )\zi

λDsld (Ω, I)+ || PT xi − zi ||2Ω +

where γ ≥ 0, P is fixed and the solution of (5), I is the identity matrix and Dsld (Ω, I) is the symmetrized LogDet divergence:

Dsld (Ω, I) := tr(Ω) + tr(Ω−1 ) − 2L. Further define:

X (PT xi − zi )(PT xi − zi )T

M

S :=

(8)

∀i

X

ED D :=

∀i,z˜i ∈kN N (zi )\zi

(9)

Using both S and D, the minimization problem (7) can be recast as

PT

285

1 T ˜i )(PT xi − z ˜i )T (P xi − z k

(10)

CE

arg minΩ0 λDsld (Ω, I) + tr(ΩS) + tr(Ω−1 D).

Interestingly, the minimization problem (10) is the same as the problem (13)

of [21], and is both strictly convex and strictly geodesically convex (Theorem 3

AC

of [21]), thus having global optimal solution. What’s more, it can have a closed form solution below: Ω = (S + λI)−1 ]1/2 (D + λI),

where A]1/2 B := A1/2 (A−1/2 BA−1/2 )1/2 A1/2 .

15

(11)

ACCEPTED MANUSCRIPT

It is this fact that the solution is given by the midpoint of the geodesic joining (S+λI)−1 and (D+λI). The geodesic viewpoint is important to make a tradeoff between (S + λI)−1 and (D + λI). Note that Ω := (S + λI)−1 ]1/2 (D + λI) is

CR IP T

also the minimum of problem (12) according to [21]: 2 2 arg min δR (Ω, (S + λI)−1 ) + δR (Ω, (D + λI)),

where δR denotes the Riemannian distance

(12)

δR (U, V) :=|| log(V−1/2 UV−1/2 ) ||F , f or U, V  0

AN US

Thus, we can get the balanced version between S and D of problem (10): 2 2 arg min := (1 − t)δR (Ω, (S + λI)−1 ) + δR (Ω, (D + λI)), t ∈ [0, 1]. Ω0

(13)

Interestingly, it can be shown (see [22], ch.6) that the unique solution to problem (13) is

Ω = (S + λI)−1 ]t (D + λI)

(14)

M

where A]t B := A1/2 (A−1/2 BA−1/2 )t A1/2 .

The solution connects to the Riemannian geometry of symmetric positive

290

ED

definite (SPD) matrices, and thus we denote it as gMML. Totally, we detail the learning procedure in Algorithm 1.

PT

Algorithm 1 MLKT-gMML algorithm Input: The MDC training data set D = {(xi , yi )}N i=1 (yi ∈ Y); The preset hyper-parameters k, λ, γ and t

CE

Output: Regression matrix P and distance metric Ω; 1:

Transform D to D0 by MLKT approach: D0 = {(xi , zi )}N i=1 (z ∈ Z)

Set P := arg minP∈Rd×L 21 || Z − XP ||2 +γ || P ||2F ;

3:

Compute S and D by Eq.(8) and Eq.(9)

4:

Set Ω := (S + λI)−1 ]t (D + λI);

5:

return P and Ω;

AC

2:

Note that P has an impact on learning Ω, conversely, Ω has an impact on learning P as well. Thus, P can be obtained by optimizing the following 16

ACCEPTED MANUSCRIPT

problem: P := arg minP∈Rd×L

1 || Z − XP ||2Ω +γ || P ||2F . 2

(15)

Its solution can be boiled down to solving the Sylvester equation: (XT X)P +

CR IP T

γPΩ−1 = XT Y. A classical algorithm for solving such equation is the Bartels-

Stewart algorithm [23]. In a nutshell, an iteration algorithm for learning P and Ω is detailed in Algorithm 2 called as gMML-I.

Algorithm 2 gMML-I algorithm Input: The MDC training data set D = {(xi , yi )}N i=1 (yi ∈ Y); The number of iterations η; The preset hyper-parameters k, λ, γ and t

1:

Transform D to D0 by MLKT approach: D0 = {(xi , zi )}N i=1 (z ∈ Z)

2:

Set Ωinit = I

3:

repeat

Set P := arg minP∈Rd×L 21 || Z − XP ||2Ω +γ || P ||2F ;

5:

Compute S and D by Eq.(8) and (9)

6:

Set Ω := (S + λI)−1 ]t (D + λI);

M

4:

until (reach to η iterations)

8:

return P and Ω;

ED

7:

4.2. Prediction for a new instance

PT

295

AN US

Output: Regression matrix P and distance metric Ω;

Based on the learning procedure, the output z of a new instance x can be

CE

predicted by solving the following optimization problem: min z∈Z

1 || z − PT x ||2Ω 2

(16)

It is equivalent to solving a quadratic binary optimization problem with

AC

equality constraints, namely,

min z

1 2

|| z − PT x ||2Ω

s.t. ET z = t z ∈ {0, 1}L 17

(17)

ACCEPTED MANUSCRIPT

The optimization problem (17) is very difficult to solve due to its NPhardness. Instead we replace the binary constraints with 0 ≤ z ≤ 1, then the NP-hard optimization problem is converted to a simple box-constrained

min v

1 2

|| v − PT x ||2Ω

s.t. ET v = t v ∈ [0, 1]L

CR IP T

quadratic programming as follows

AN US

Now, for each set φi , the prediction z(φi ) of x can be made in terms of     1, if j = arg maxk v(φi ) ,   k  z(φi ) = if | φi |≥ 3, k = 1, . . . , Ki j  0, otherwise     z i = round(v i ), if | φi |= 1 (φ ) (φ )

(18)

(19)

Where round() means rounding their predictions into 0/1 assignments. In turn,

M

the prediction yi of x in the original output space is   yi = j, if | φi |≥ 3

(20)

ED

 yi = (z i + 1), if | φi |= 1 (φ )

Algorithm 3 details the predicting procedures.

PT

Algorithm 3 Predict new instance x Input: The learned regression matrix P and distance metric Ω; The new instance x

CE

Output: The prediction class vector y Solve z := arg minz∈Z || z − PT x ||2Ω ;

2:

Inverse transformation: y ← z according to Eq.(19) and Eq.(20);

3:

return y;

AC

1:

4.3. Connections between existing metric learning methods and ours

300

Two works are mostly related to our metric learning method, namely maximum margin output coding (MMOC) [20] and large margin metric learning 18

ACCEPTED MANUSCRIPT

with kNN constraints (LM-kNN) [19]. These two methods likewise use a Mahalanobis distance metric (a symmetric positive semidefinite matrix denoted by S + ) to model output structure of MLC, where the Mahalanobis distance metric is used to learn a lower dimensional space.

CR IP T

305

MMOC aims to learn a discriminative Mahalanobis metric which can make

the distance between PT xi and its real class vector zi as close to 0 as possible

and less than the distance between PT xi and any other outputs with some margin. Specifically, its formulation is as follows n CX 1 ξi trace(Ω) + 2 n i=1 Ω∈S + ,{ξi }n i=1

(21)

AN US

arg min

s.t. ϕTizi Ωϕizi + ∆(zi , z) − ξi 6 ϕTiz Ωϕiz , ∀z ∈ {0, 1}, ∀i

where C is a positive constant, ϕiz = PT xi − z and ϕizi = PT xi − zi . It proved to have good classification accuracy for MLC task. However, it also suffers from a big burden, i.e. it has to treat the exponentially large num-

310

M

ber of constraints for each instance during training, leading to computational infeasibility.

ED

Like MMOC, LM-kNN also adopts a Mahalanobis metric learning method for MLC, which just involves k constraints for each instance. Its distance metric learning attempts to make instances with similar class vectors closer. Thus, the

PT

class vector of each instance can be predicted by their nearest neighbors. In fact, LM-kNN can be much simpler than MMOC and is established by minimizing

CE

the following objective: n 1 CX trace(Ω) + ξi 2 n i=1 Ω∈S + ,{ξi }n i=1

AC

arg min

(22)

s.t. ϕTizi Ωϕizi + ∆(zi , z) − ξi 6 ϕTiz Ωϕiz , ∀z ∈ N ei(i), ∀i

where C, ϕiz and ϕizi are similarly defined to those of MMOC. N ei(i) is the

output set of k nearest neighbors of input xi . For LM-kNN, its prediction for a testing instance can be obtained based on its k nearest neighbors in the learned metric space. Specifically, for the testing

19

ACCEPTED MANUSCRIPT

315

input x, we find its k nearest instances {x1 , . . . , xk } in the training set, then, a set of scores for each class vector of x can be obtained from the distances between x and {x1 , . . . , xk }, lastly, using these scores to predict its class vector

CR IP T

by thresholding. Clearly, neither MMOC nor LM-kNN can be applied to our transformed 320

problem due to that the output space of our transformed problem is not equiv-

alent to that of MLC. Although they can be adapted to our scenario by some efforts, these efforts are non-trivial because their corresponding training and/or predicting have to be re-designed. Moreover, at present it is not our focus. As a

325

AN US

result, we choose an alternative design way for our Mahalanobis distance metric learning, where our method is formally close to MMOC but has a closed form solution as described in section 4.1. 4.4. Complexity Analysis

The time complexity of regularized least square regression is basically the

330

M

complexity of computing the matrix multiplication with O(N d2 + N dL) plus the complexity of the inverse computation with O(d3 ). The complexity of com-

ED

puting the geometric mean of two matrices by Cholesky-Schur method [24] is O(L3 ). The complexity of computing a Sylvester equation is O(d3 + L3 + N d2 + N dL). The complexity of computing a box-constrained quadratic problem is

PT

O(L3 + Ld). And the time complexity of kNN is O(kN ). Algorithm 1 involves solving a regularized least square regression problem

335

and computing the geometric mean of two matrices. Therefore, its total time

CE

complexity is O(N d2 + N dL + d3 + L3 + kN ). Algorithm 2 involves solving a Sylvester equation and computing the geometric mean of two matrices with η

AC

iterations (where η is usually fixed to a pre-set small integer). Therefore, its

340

time complexity is O(η(d3 + L3 + N d2 + N dL + kN )). Algorithm 3 involves

solving a box-constrained quadratic problem, its time complexity is O(L3 +Ld). Based on [19], the training and predicting time complexities of LM-kNN are

respectively O( √1 (N d2 + N dL + L3 + d3 + kN dL2 )) and O(LN + Ld), where  is the accuracy met by its solution. The training and predicting time complexities 20

ACCEPTED MANUSCRIPT

345

of MMOC are respectively O(θ(N d2 + N dL + d3 + N L3 + N 4 )) and O(L3 ), where θ is its iterations. In comparison with our metric learning counterparts used here, our Algo-

CR IP T

rithm 1 and Algorithm 2 have an advantage over MMOC and LM-kNN in terms of training time due to η  O( √1 ) and η  θ. The predicting time complexity 350

of our Algorithm 3 is comparable to MMOC and higher than LM-kNN.

5. Experiments

In this section, we discuss the experiments conducted on two publicly avail-

AN US

able real-world datasets for MDC. The two datasets for MDC are respectively ImageCLEF2014 1 and Bridges. ImageCLEF2014 comes from a real world chal355

lenge in the field of robot vision [25]. Bridges dataset comes from the UCI collection [26]. Unfortunately, there are not yet many publicly available standardized multi-dimensional datasets, so we boost our collections with eight most

M

commonly used multi-label datasets which can be accessed from Mulan 2 . The characteristics of these datasets are shown in Table 1.

ED

Table 1: Datasets used in the evaluation class variables

Features

Instances

birds

19

260

645

emotions

6

72

593

medical

45

1449

978

scene

6

294

2407

yeast

14

103

2417

flags

7

19

194

genbase

27

1186

662

CAL500

174

68

502

AC

CE

PT

Dataset

360

bridges

5

7

107

CLEF2014

9

264

6500

We consider two commonly used evaluation criteria for MDC, namely Ham-

ming accuracy and Example accuracy. These evaluation criteria can be calcu1 http://www.imageclef.org/2014/robot 2 http://mulan.sourceforge.net/datasets-mlc.html

21

ACCEPTED MANUSCRIPT

Table 2: Hamming Accuracy (Part A) classifier

birds

emotions

medical

scene

yeast

MLKT-RR

0.9511 ± 0.0083

0.7771 ± 0.0263

0.9509 ± 0.0086

0.8477 ± 0.0282

0.7048 ± 0.0042

♠0.9562 ± 0.0089

♠0.7932 ± 0.0254

♠0.9736 ± 0.0068

♠0.8528 ± 0.0324

♠0.7065 ± 0.0077

0.7904 ± 0.0256

0.9545 ± 0.0091

lated as follows: 1. Hamming accuracy:

m m N 1 X 1 X 1 X Accj = δ(yij , yˆij ) m j=1 m j=1 N i=1

AN US

Acc =

0.8463 ± 0.0291

where δ(yij , yˆij ) = 1 if yˆij = yij , and 0 otherwise. Note that yˆij denotes the jth class value predicted by the classifier for instance i and yij is its true value.

365

2. Example accuracy

N 1 X δ(yi , yˆi ) N i=1

M

Acc =

where δ(yi , yˆi ) = 1 if yˆi = yi , and 0 otherwise.

ED

Before the experiments, some parameters need to be set in advance. The parameter η for gMML-I algorithm is always set to 3 throughout our experiments (because when η > 3, we find it has no changes for Ω and P). The parameters λ and t associated with Ω are respectively tuned from the range {100 , 101 , 102 }

PT

370

and {0.3, 0.5, 0.7}. The parameter γ for P is tuned from the range {0, 0.1, 0.2}.

CE

All the following experimental results are the average results of 10-fold cross validation experiments. And, we use the notation ♠ to denote the best results.

AC

5.1. Comparison with our baseline methods

375

We firstly verify the classification accuracy of MLKT-gMML-I in comparison

with that of both ridge regression model (namely Ω = I) and the algorithm without iteration procedure (namely MLKT-gMML). We show the results in Tables 2, 3, 4 and 5.

22

0.7054 ± 0.0034

CR IP T

0.9519 ± 0.0081

MLKT-gMML MLKT-gMML-I

ACCEPTED MANUSCRIPT

Table 3: Hamming Accuracy (Part B) classifier

flags

genbase

CAL500

ImageCLEF2014

bridges

MLKT-RR

♠0.7376 ± 0.0384

0.9554 ± 0.0549

0.8624 ± 0.0061

♠0.8430 ± 0.0191

0.7000 ± 0.1222

MLKT-gMML-I

0.7353 ± 0.0294

♠0.9896 ± 0.0038

♠0.8625 ± 0.0063

0.8429 ± 0.0254

classifier

birds

emotions

medical

scene

yeast

MLKT-RR

0.4781 ± 0.0599

0.2136 ± 0.0572

0.5962 ± 0.0507

0.3000 ± 0.1343

0.0041 ± 0.0044

♠0.5188 ± 0.0796

♠0.2661 ± 0.0614

♠0.7782 ± 0.0432

0.2917 ± 0.1607

♠0.0051 ± 0.0046

0.7323 ± 0.0295

0.9698 + 0.0166

0.8526 ± 0.0063

Table 4: Example Accuracy (Part A)

MLKT-gMML MLKT-gMML-I

0.4813 ± 0.0584

0.2593 ± 0.0654

0.8429 ± 0.0254

0.6943 ± 0.1264

♠0.7286 ± 0.1071

CR IP T

MLKT-gMML

0.6667 ± 0.0580

♠0.3008 ± 0.1334

0.0051 ± 0.0064

380

AN US

From the results, we can see that MLKT-gMML-I nearly achieves the best

accuracy on all these datasets in regard to both evaluation criteria. To verify whether the differences are significant, two non-parametric Friedman tests among these methods for Hamming accuracy and Example accuracy are conducted respectively.

385

M

In Hamming accuracy, the Friedman test renders a F value of 8.6471 (> F(α,k−1,(b−1)(k−1)) = F(0.05,2,18) = 3.555)3 . Thus, the null hypothesis that all the methods have identical effects is rejected and a post-hoc test needs to be

ED

conducted for further testing their differences. To this end, a commonly-used post-hoc test, Nemenyi test, is conducted. The result is shown in Figure 4 from which we can see that: 1) MLKT-gMML-I has a significant difference from our other two methods; 2) MLKT-gMML achieves a comparable performance with

PT

390

CE

MLKT-RR. 3 Here,

F is the percent point function of the F distributuion, α is the significance level, b

AC

is the number of datasets and k is the number of algorithms for test.

Table 5: Example Accuracy (Part B) classifier

flags

genbase

CAL500

ImageCLEF2014

bridges

MLKT-RR

0.1737 ± 0.0994

0.8379 ± 0.2966

0.0000 ± 0.0000

0.2250 ± 0.0348

0.1714 ± 0.1475

♠0.1842 ± 0.0568

♠0.9273 ± 0.0333

♠0.2256 ± 0.0480

♠0.2429 ± 0.1656

MLKT-gMML

MLKT-gMML-I

0.1789 ± 0.0666

0.9242 ± 0.0357

23

0.0000 ± 0.0000

0.0000 ± 0.0000

♠0.2256 ± 0.0480

0.1714 ± 0.1475

ACCEPTED MANUSCRIPT

MLKT−gMML

MLKT−gMML−I

0.5

1

1.5

2

2.5

3

CR IP T

MLKT−RR

3.5

Figure 4: Friedman Test of our methods in terms of Hamming accuracy. In the graph, the

horizontal axis represents the values of mean rank, the vertical axis represents the different methods for test. For each method, the • represents its mean rank value, the line segment

AN US

represents the critical range of Nemenyi test. Two methods have a significant difference, if their line segments are not overlapped; not so, otherwise.

In Example accuracy, the Friedman test renders a F value of 10.85 (> F(α,k−1,(b−1)(k−1) = F(0.05,2,18) = 3.555), meaning the null hypothesis is also rejected. Then, A post-hoc Nemenyi test is again conducted. Its result is shown in Figure 5 and indicates that: 1) MLKT-gMML-I is significantly differ-

M

395

ent from MLKT-RR; 2) MLKT-gMML also achieves a comparable performance

ED

with MLKT-RR.

On the whole, we can conclude that MLKT-gMML-I achieves the best classification performance while MLKT-gMML achieves a comparable performance with MLKT-RR. Therefore, in the following, we just concentrate on the com-

PT

400

parison between MLKT-gMML-I and the other competitive MDC methods.

CE

5.2. Comparison with several competitive MDC methods We then compare MLKT-gMML-I with several competitive methods for

AC

MDC from the literature: Binary-Relevance (BR), Classifier Chains (CC) , En-

405

semble of Classifier Chains (ECC), RAkEL and Super-Class Classifier (SCC).

Since the above methods are only designed for modeling output structure, Naive Bayesian classifier is used as their base classifier in our experiments. We use an open-source Java framework, namely the MEKA [27] library, for the experiments. Regarding the parameterization of these approaches, ECC is 24

ACCEPTED MANUSCRIPT

MLKT−RR

MLKT−gMML−I

0.5

1

1.5

2

2.5

3

CR IP T

MLKT−gMML

3.5

Figure 5: Friedman Test of our methods in terms of Example accuracy. In the graph, the

horizontal axis represents the values of mean rank, the vertical axis represents the different methods for test. For each method, the • represents its mean rank value, the line segment

AN US

represents the critical range of Nemenyi test. Two methods have a significant difference, if their line segments are not overlapped; not so, otherwise.

Table 6: Hamming Accuracy (Part A) classifier

birds

emotions

medical

scene

yeast

BR-NB

0.6472 ± 0.0354

0.7479 ± 0.0204

0.9747 ± 0.0016

0.7581 ± 0.0101

0.7162 ± 0.0122

0.7102 ± 0.0442

0.7544 ± 0.0221

RAkEL SCC

410

0.7859 ± 0.0163

0.7530 ± 0.0213

0.9751 ± 0.0011

0.7632 ± 0.0094

0.7460 ± 0.0228

♠0.9753 ± 0.0015

0.8385 ± 0.0134

0.8770 ± 0.0110

♠0.9562 ± 0.0089

♠0.7932 ± 0.0254

0.9736 ± 0.0068

0.7660 ± 0.0180

0.9751 ± 0.0016

0.9752 ± 0.0010

0.7634 ± 0.0096

♠0.8630 ± 0.0100 0.8528 ± 0.0324

ED

MLKT-gMML-I

0.6411 ± 0.0375

M

CC-NB ECC-NB

configured to learn 10 different models for the ensemble, for RAkEL we use the recommended configuration with 2m models having triplets of class combi-

PT

nations, and for SCC we use a nearest neighbour replacement filter (NNR) to identify all p = 1 infrequent class-values and replace them with their n = 2 most-

CE

frequent nearest neighbours. Their Hamming accuracy and Example accuracy

415

are shown in Tables 6, 7, 8 and 9 respectively. From the results of these tables, we see that MLKT-gMML-I can achieve

AC

better performance on most of the datasets than its competitive MDC methods (BR, CC, ECC, RAkEL and SCC) in terms of both evaluation criteria. To

verify the performance differences, two non-parametric Friedman tests among

420

these methods for Hamming accuracy and Example accuracy are respectively conducted.

25

0.6965 ± 0.0131 0.7018 ± 0.0128 0.6760 ± 0.0142

♠0.7574 ± 0.0081 0.7065 ± 0.0077

ACCEPTED MANUSCRIPT

Table 7: Hamming Accuracy (Part B) flags

genbase

CAL500

BR-NB

0.6645 ± 0.0443

0.9702 ± 0.0063

0.6813 ± 0.0094

0.7226 ± 0.0314

0.9661 ± 0.0036

RAkEL SCC

0.6890 ± 0.0454

0.6562 ± 0.0594 0.7041 ± 0.0496

MLKT-gMML-I

♠0.7353 ± 0.0294

classifier

birds

BR-NB

0.0326 ± 0.0191

0.5965 ± 0.0083

bridges 0.7200 ± 0.0470

0.9661 ± 0.0044

0.6060 ± 0.0131

0.6050 ± 0.0090

0.7060 ± 0.0660

0.9653 ± 0.0047

0.7141 ± 0.0069

0.6560 ± 0.0120

0.7120 ± 0.0250

0.9460 + 0.0081 ♠0.9896 ± 0.0038

0.7124 ± 0.0068 0.7050 ± 0.0120

♠0.8625 ± 0.0063

AN US

CC-NB ECC-NB

ImageCLEF2014

CR IP T

classifier

0.6150 ± 0.0096 0.8000 ± 0.0070

♠0.8429 ± 0.0254

0.7114 ± 0.0899 0.7140 ± 0.0520

♠0.7286 ± 0.1071

Table 8: Example Accuracy (Part A)

SCC

scene

yeast

0.1691 ± 0.0225

0.0980 ± 0.0212

0.0481 ± 0.0237

0.2260 ± 0.0497

0.2730 ± 0.0331

0.1753 ± 0.0212

0.0327 ± 0.0237

0.2141 ± 0.0348

0.2649 ± 0.0391

0.3020 ± 0.0209

0.2020 ± 0.0480

♠0.5188 ± 0.0796

0.2360 ± 0.0439 0.2650 ± 0.0490

♠0.2661 ± 0.0614

0.2771 ± 0.0400 0.2795 ± 0.0320

♠0.7782 ± 0.0432

0.1795 ± 0.0216 ♠0.5370 ± 0.0260 0.2917 ± 0.1607

0.1080 ± 0.0246 0.1109 ± 0.0224 0.0765 ± 0.0173

♠0.1910 ± 0.0240 0.0051 ± 0.0046

CE

PT

MLKT-gMML-I

medical

0.2659 ± 0.0377

M

RAkEL

0.0326 ± 0.0191

ED

CC-NB ECC-NB

emotions

0.2057 ± 0.0549

flags

genbase

CAL500

ImageCLEF2014

bridges

BR-NB

0.0361 ± 0.0344

0.5110 ± 0.0588

0.0000 ± 0.0000

0.0292 ± 0.0101

0.2000 ± 0.1680

0.0363 ± 0.05270

0.2749 ± 0.0635

CC-NB

AC

Table 9: Example Accuracy (Part B)

classifier

ECC-NB RAkEL SCC

MLKT-gMML-I

0.0771 ± 0.0603

0.0780 ± 0.0755

♠0.1927 ± 0.1220 0.1842 ± 0.0568

0.2810 ± 0.0732

0.0000 ± 0.0000

0.0320 ± 0.0100

0.1860 ± 0.1790

0.2761 ± 0.0724

0.0000 ± 0.0000

0.0280 ± 0.0320

♠0.2570 ± 0.0610

♠0.2256 ± 0.0480

0.2429 ± 0.1656

0.3906 ± 0.0827

♠0.9273 ± 0.0333

26

0.0000 ± 0.0000

0.0000 ± 0.0000

0.0000 ± 0.0000

0.0335 ± 0.0117

0.2160 ± 0.0270

0.2286 ± 0.2231

0.2290 ± 0.1300

ACCEPTED MANUSCRIPT

In Hamming accuracy, the Friedman test renders a F value of 5.2663 (> F(α,k−1,(b−1)(k−1)) = F(0.05,9,45) = 2.422). Thus, the null hypothesis that all the methods are identical is rejected and a post-hoc test needs to be conducted for further testing their differences. To this end, a commongly-used post-hoc test,

CR IP T

425

Nemenyi test, is conducted. The results is shown in Figure 6, from which we can

see that: 1) MLKT-gMML-I has a significant difference from two methods (BRNB and CC-NB); 2) There is no significant difference among these methods

except MLKT-gMML-I. Therefore, MLKT-gMML-I achieves a slightly better classification performance than its competitive MDC methods.

AN US

430

BR−NB

CC−NB

ECC−NB

RKkEL

SCC

0

M

MLKT−gMML−I

1

2

3

4

5

6

7

Figure 6: Friedman Test in terms of Hamming accuracy. In the graph, the horizontal axis

ED

represents the values of mean rank, the vertical axis represents the different methods for test. For each method, the • represents its mean rank value, the line segment represents the critical range of Nemenyi test. Two methods have a significant difference, if their line segments are

PT

not overlapped; not so, otherwise.

In Example accuracy, the Friedman test renders a F value of 5.4961 (>

CE

F(α,k−1,(b−1)(k−1)) = F(0.05,10,45) = 2.422). Thus, the null hypothesis that all the methods are identical is rejected and a post-hoc test needs to be conducted for further testing their differences. To this end, the Nemenyi test as above is conducted. The results is shown in Figure 6 from which we can see that:

AC

435

1) MLKT-gMML-I has a significant difference from BR-NB. 2) There is not a significant difference among these methods except BR-NB. Therefore, MLKTgMML-I can achieve a comparable Example accuracy with its competitive methods.

440

On the whole, we can conclude that MLKT-gMML-I achieves comparable 27

ACCEPTED MANUSCRIPT

BR−NB

CC−NB

ECC−NB

SCC

MLKT−gMML−I

0

1

2

3

4

5

6

CR IP T

RAkEL

7

Figure 7: Friedman Test in terms of Example accuracy. In the graph, the horizontal axis represents the values of mean rank, the vertical axis represents the different methods for test.

For each method, the • represents its mean rank value, the line segment represents the critical not overlapped; not so, otherwise.

AN US

range of Nemenyi test. Two methods have a significant difference, if their line segments are

(or even slightly better) classification performance with (or than) its competitive MDC methods.

M

5.3. Comparison with LM-kNN on MLC task

Note that gMML-I is closely related to MMOC and LM-kNN and the latter two are designed for MLC specially, thus we just conduct experiments on MLC

ED

445

datasets to verify their classification performance. However, since MMOC has to deal with exponentially large number of constraints for each instance in training

PT

procedure, it is infeasible even for the CAL500 dataset with 68 features and 174 labels [19]. Therefore, we only compare gMML-I with LM-kNN. We show the 450

results in figure 8 and figure 9.

CE

We can see from the figures that gMML-I achieves better performance on

six datasets in terms of Hamming accuracy, while four datasets in terms of

AC

Example accuracy4 than LM-kNN. So on the whole, gMML-I achieves better classification on most of the datasets. To verify their difference, the Friedman

455

tests of differences between gMML-I and LM-kNN are conducted and render F -values of 2.3333 for Hamming accuracy and 0.4667 for Example accuracy re4 Both

methods achieve zero accuracy on CAL500 for Example accuracy.

28

CR IP T

ACCEPTED MANUSCRIPT

1 0.9 0.8 0.7 0.6 0.5 0.4

gMML−I LM−kNN

0.3 0.2

0

birds

emotions medical

AN US

0.1 scene

yeast

flags

genbase CAL500

ED

M

Figure 8: Hamming Accuracy (HA).

1

gMML−I LM−kNN

0.9

PT

0.8 0.7 0.6

AC

CE

0.5 0.4 0.3 0.2 0.1 0

birds

emotions medical

scene

yeast

flags

genbase CAL500

Figure 9: Example Accuracy (EA).

29

ACCEPTED MANUSCRIPT

spectively, both are not significant (< F(α,k−1,(b−1)(k−1)) = F(0.05,7,7) = 5.595). So, gMML-I can obtain competitive classification accuracy on MLC task with LM-kNN, but has a lower learning complexity than LM-kNN as analysed in section 4.4.

CR IP T

460

6. Conclusions

In this paper, we proposed a new transformation approach, namely MLKT, for MDC, which possesses the following favorable characteristics: i) it can keep

the space size of MDC invariant, ii) it can reflect the explicit within-dimensional relationships, iii) it is easy for subsequent modeling in the transformed space,

AN US

465

iv) it can overcome the class overfitting and class imbalance problems suffered by LP-based transformation approach, v) it is decomposable for each output dimension of MDC. Moreover, we also presented a novel metric learning based method for the transformed problem, which itself can be of independent interest and also has a closed form solution. Extensive experimental results justified that

M

470

our approach combined the above two procedures can achieve better classification performance than the competitive MDC methods, while our metric learning

ED

based method itself can also obtain competitive classification performance with a lower learning complexity compared to its counterparts designed specifically for MLC. And, as mentioned in the introduction section, we can refer to many

PT

475

MLC methods to develop alternatives well suited to our transformed problem

CE

as our future direction.

AC

Acknowledgements

480

This work is supported in part by the National Natural Science Foundation

of China under the Grant Nos. 61672281 and in part by the Funding of Jiangsu Innovation Program for Graduate Education under Grant KYLX16 0383. And we would like to express our appreciation for the valuable comments from reviewers and editors.

30

ACCEPTED MANUSCRIPT

References 485

[1] M. Elhoseiny, T. El-Gaaly, A. Bakry, A. Elgammal, Convolutional models for joint object categorization and pose estimation, arXiv preprint

CR IP T

arXiv:1511.05175.

[2] T. Theeramunkong, V. Lertnattee, Multi-dimensional text classification,

in: Proceedings of the 19th international conference on Computational linguistics-Volume 1, Association for Computational Linguistics, 2002, pp.

490

1–7.

AN US

[3] J. Ortigosa-Hern´ andez, J. D. Rodr´ıguez, L. Alzate, M. Lucania, I. Inza, J. A. Lozano, Approaching sentiment analysis by using semi-supervised

learning of multi-dimensional classifiers, Neurocomputing 92 (2012) 98– 115.

495

[4] C. Tu, Z. Liu, M. Sun, Prism: Profession identification in social media

M

with personal information and community structure, in: Chinese National Conference on Social Media Processing, Springer, 2015, pp. 15–27.

ED

[5] J. H. Zaragoza, L. E. Sucar, E. F. Morales, C. Bielza, P. Larranaga, Bayesian chain classifiers for multidimensional classification, in: IJCAI,

500

PT

Vol. 11, Citeseer, 2011, pp. 2192–2197. [6] W. Cheng, E. H¨ ullermeier, K. J. Dembczynski, Bayes optimal multilabel classification via probabilistic classifier chains, in: Proceedings of the 27th

CE

international conference on machine learning (ICML-10), 2010, pp. 279– 286.

505

AC

[7] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains for multilabel classification, Machine learning 85 (3) (2011) 333–359.

[8] L. C. Van Der Gaag, P. R. De Waal, et al., Multi-dimensional Bayesian network classifiers., in: Probabilistic graphical models, Citeseer, 2006, pp.

510

107–114.

31

ACCEPTED MANUSCRIPT

[9] C. Bielza, G. Li, P. Larranaga, Multi-dimensional classification with Bayesian networks, International Journal of Approximate Reasoning 52 (6) (2011) 705–727.

CR IP T

[10] J. Arias, J. A. Gamez, T. D. Nielsen, J. M. Puerta, A scalable pairwise class interaction framework for multidimensional classification, Interna-

515

tional Journal of Approximate Reasoning 68 (2016) 194–210.

[11] M. R. Boutell, J. Luo, X. Shen, C. M. Brown, Learning multi-label scene classification, Pattern recognition 37 (9) (2004) 1757–1771.

AN US

[12] G. Tsoumakas, I. Katakis, Multi-label classification: An overview, International Journal of Data Warehousing and Mining 3 (3) (2007) 1.

520

[13] J. Read, C. Bielza, P. Larra˜ naga, Multi-dimensional classification with super-classes, IEEE Transactions on knowledge and data engineering 26 (7) (2014) 1720–1733.

M

[14] J. Read, L. Martino, P. M. Olmos, D. Luengo, Scalable multi-output label prediction: From classifier chains to classifier trellises, Pattern Recognition

525

ED

48 (6) (2015) 2096–2109.

[15] J. D´ıez, J. J. del Coz, O. Luaces, A. Bahamonde, Using tensor products to

PT

detect unconditional label dependence in multilabel classifications, Information Sciences 329 (2016) 20–32. [16] Inference and learning in multi-dimensional Bayesian network classi-

CE

530

fiers, author=De Waal, Peter R and Van Der Gaag, Linda C, book-

AC

title=European Conference on Symbolic and Quantitative Approaches

535

to Reasoning and Uncertainty, pages=501–511, year=2007, organization=Springer.

[17] G. Tsoumakas, I. Vlahavas, Random k-labelsets: An ensemble method for multilabel classification, in: European Conference on Machine Learning, Springer, 2007, pp. 406–417.

32

ACCEPTED MANUSCRIPT

[18] Y. Anzai, Pattern Recognition & Machine Learning, Elsevier, 2012. [19] W. Liu, I. W. Tsang, Large margin metric learning for multi-label predic-

CR IP T

tion., in: AAAI, 2015, pp. 2800–2806.

540

[20] Y. Zhang, J. G. Schneider, Maximum margin output coding, in: Proceed-

ings of the 29th International Conference on Machine Learning (ICML-12), 2012, pp. 1575–1582.

[21] P. H. Zadeh, R. Hosseini, S. Sra, Geometric mean metric learning, in:

Proceedings of The 33rd International Conference on Machine Learning,

545

AN US

2016, pp. 2464–2471.

[22] R. Bhatia, Positive definite matrices, Princeton university press, 2009. [23] R. H. Bartels, G. Stewart, Solution of the matrix equation ax+ xb= c [f4], Communications of the ACM 15 (9) (1972) 820–826.

[24] B. Iannazzo, The geometric mean of two matrices from a computational

M

550

viewpoint, arXiv preprint arXiv:1201.0101.

ED

[25] B. Caputo, H. M¨ uller, J. Martinez-Gomez, M. Villegas, B. Acar, N. Patri¨ udarlı, R. Paredes, M. Cazorla, et al., Imageclef cia, N. Marvasti, S. Usk¨ 2014: Overview and analysis of the results, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer,

PT

555

2014, pp. 192–211.

CE

[26] M. Lichman, UCI machine learning repository (2013). URL http://archive.ics.uci.edu/ml

AC

[27] J. Read, P. Reutemann, B. Pfahringer, G. Holmes, MEKA: A multi-

560

label/multi-target extension to Weka, Journal of Machine Learning Research 17 (21) (2016) 1–5. URL http://jmlr.org/papers/v17/12-164.html

33

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

Zhongchen Ma received his B.S. degree in Information and Computing Science from Qingdao Agricultural University in 2012. In 2015, he completed his M.S. degree in computer science and technique at Nanjing University of Aeronautics and Astronautics. He is currently pursuing the Ph.D. degree with the College of Computer Science & Technology, Nanjing University of Aeronautics and Astronautics. His research interests include pattern recognition and machine learning.

AC

CE

PT

ED

Songcan Chen received his B.S. degree in mathematics from Hangzhou University (now merged into Zhejiang University) in 1983. In 1985, he completed his M.S. degree in computer applications at Shanghai Jiaotong University and then worked at NUAA in January 1986. There he received a Ph.D. degree in communication and information systems in 1997. Since 1998, as a full-time professor, he has been with the College of Computer Science & Technology at NUAA. His research interests include pattern recognition, machine learning and neural computing.