Ordinal classification using Pareto fronts

Ordinal classification using Pareto fronts

ESWA 9932 No. of Pages 7, Model 5G 7 April 2015 Expert Systems with Applications xxx (2015) xxx–xxx 1 Contents lists available at ScienceDirect Ex...

835KB Sizes 9 Downloads 98 Views

ESWA 9932

No. of Pages 7, Model 5G

7 April 2015 Expert Systems with Applications xxx (2015) xxx–xxx 1

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa 4 5 3

Ordinal classification using Pareto fronts q

6

M.M. Stenina a, M.P. Kuznetsov a,⇑, V.V. Strijov a,b

7 8 9 10 1 1 2 2 13 14 15 16 17 18 19 20

a b

Moscow Institute of Physics and Technology, Institutskiy Lane 9, Dolgoprudny, Moscow 141700, Russia Dorodnicyn Computing Center of Russian Academy of Sciences, Vavilov St. 40, 119333 Moscow, Russia

a r t i c l e

i n f o

Article history: Available online xxxx Keywords: Ordinal classification Pareto front Expert estimations Binary relation

a b s t r a c t The paper presents an ordinal classification method using Pareto fronts. An object is described by a set of ordinal features assigned by experts. We describe the class boundaries by the set of Pareto fronts. We propose to predict the object class using the nearest Pareto front boundary. The proposed method is illustrated by a real-world application to the International Union for Conservation of Nature Red List species categorization. Ó 2015 Published by Elsevier Ltd.

22 23 24 25 26 27 28 29

30 31

1. Introduction

32

We investigate the supervised ordinal classification problem (Frank & Hall, 2001; Har-Peled, Roth, & Zimak, 2003), a setting referred to as learning to rank or ordinal regression. Several key assumptions are made about the investigated data structure within the ordinal classification setting. First, the target variable assumed to have an ordinal nature. As formulated in Chu and Sathiya Keerthi (2007), that setting bridges the gap between regression and classification problems. The second assumption corresponds to the structure of features describing the given data. We consider the ordinal nature of input variables, that is, each feature provides the non-strict ranking over the set of objects (Ben-David, Sterling, & Pao, 1989; Furnkranz & Hullermeier, 2003; Kotlowski & Slowinski, 2009). We also make the following assumptions about the feature structure (Strijov, Granic, Juric, Jelavic, & Maricic, 2011):

33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53

 the given set of features is sufficient to construct an adequate ordinal classification model;  the rule that ‘‘bigger is better’’ is applied, that is, larger feature values imply greater object preference. In current research we consider the important requirement of monotonicity between the input and output ordinal variables. Monotone relationships are frequently encountered in machine q This project was supported by the Ministry of Education and Science of the Russian Federation, RFMEFI60414X0041, and by the Russian Foundation for Basic Research, Grant 14-07-31042. ⇑ Corresponding author. Tel.: +7 (926) 832 01 97; fax: +7 (499) 783 33 27. E-mail addresses: [email protected] (M.M. Stenina), [email protected] (M.P. Kuznetsov), [email protected] (V.V. Strijov).

learning area (Kotlowski & Slowinski, 2009). Recent investigations (Duivesteijn & Feelders, 2008) showed that satisfying the monotonicity requirements leads to the better prediction quality. The problem of monotone ordinal classification arises in the area of information retrieval (Xia, Liu, Wang, Zhang, & Li, 2008). The most common methods are based on object pairwise comparisons. To solve such kind of problems, the modified support vector machine (Yue, Finley, Radlinski, & Joachims, 2007) and boosting (Freund, Iyer, Schapire, & Singer, 2003) are used. However, the important shortcoming of such boosting-like models is their complexity and non-interpretability in the context of the investigated field. In this paper we propose a new ordinal classification method that operates with ordinal input and target variables. The method constructs classification model basing on the object dominance concept. We propose to describe the ordinal class boundaries using the Pareto front idea (Nogin, 2003). We propose to construct the set of Pareto fronts corresponding to the ordinal classes and to predict the object class using the nearest Pareto front boundary. To build the correct Pareto front model we introduce the concept of separable sample and propose the new efficient method of its construction. The main advantage of the method is that it allows to construct simple interpretable models on the same level of prediction accuracy. We use the proposed classification method to categorize threatened animal and plant species included in the International Union for Conservation of Nature (IUCN) Red List. Each species on this list belongs to one of seven possible categories: extinct, extinct in the wild, critically endangered, endangered, vulnerable, near threatened, and least concern. This categorization is monotone with respect to the risk of extinction. The object-feature matrix for the categorization problem consists of ordinal species descriptions assigned the experts and class labels corresponding to the species categories. The problem is to

http://dx.doi.org/10.1016/j.eswa.2015.03.021 0957-4174/Ó 2015 Published by Elsevier Ltd.

Please cite this article in press as: Stenina, M. M., et al. Ordinal classification using Pareto fronts. Expert Systems with Applications (2015), http://dx.doi.org/ 10.1016/j.eswa.2015.03.021

54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85

ESWA 9932

No. of Pages 7, Model 5G

7 April 2015 2

M.M. Stenina et al. / Expert Systems with Applications xxx (2015) xxx–xxx

Table 1 The part of ordinal IUCN data in the object-feature matrix. Feature

Condition

Trend

Population

3 – big; 2 – small; 1 – critically small 2 – complex; 1 – simple

4 – grows; 3 – stable; 2 – reduces; 1 – reduces fast 2 – stable; 1 – disappear

Population structure

then the ði; kÞ-th matrix is 1. For strict total orders, the matrix is lower triangular with zeros on the diagonal. Element li of set Y corresponds to row i of the matrix. We define the loss value rð; Þ as the difference between strings of the matrix Y,

135 136 137 138 139

140

Y X 0 rðli ; li0 Þ ¼ jYði; kÞ  Yði ; kÞj:

ð3Þ

142

k¼1

100

construct a model that estimates a class label from the species description. Table 1 shows four features within the object-feature matrix: ‘‘Population condition’’, ‘‘Population trend’’, ‘‘Population structure condition’’, ‘‘Population structure trend’’. We show that the proposed Pareto front method allows to construct an accurate, stable and interpretable ordinal classification model for the problem of the IUCN categorization. In addition to the IUCN data, we test the method on the benchmark datasets. We compare the method with two ordinal classification algorithms from Frank and Hall (2001) and one method of classification with monotonicity constraints from Duivesteijn and Feelders (2008). Furthermore, to emphasize the ordinal nature of the algorithm, we make ordinal feature transformation of the benchmark datasets and compare the methods on the transformed data.

101

2. Problem setting

86 87 88 89 90 91 92 93 94 95 96 97 98 99

102

103 105 106

107 109 110 111 112 113 114 115 116 117 118

119 121 122

123 125 126

127 129 130 131 132 133 134

Consider the set of pairs

D ¼ fðxi ; yi Þg;

i 2 I ¼ f1; . . . ; mg;

consisting of objects xi and class labels yi . Each object >

xi ¼ ½xi1 ; . . . ; xij ; . . . ; xid  ;

3. Two-class Pareto classification

143

Consider a special two-class case of problem (1) such that Y ¼ fl1 ; l2 g ¼ f0; 1g; 0  1. That is, the sample D consists of objects with class labels denoted by 0 or 1. Let f be the monotonic function minimizing error (2) in the two-class case. To construct f in this case, we first define f ðxÞ on the separable ^ sample D,

144

^ ¼ fðxi ; y Þg; D i

152

is described by ordinal-scaled measurements. This means that the set of values for feature j is a finite ordered set Lj with a binary relation . In this paper, we consider only strict total orders, i.e., total, non-reflexive, asymmetric and transitive binary relations. However, the proposed methods can be generalized to the case of partial orders. The set of values Y for the class labels yi is also a finite strictly ordered set Y ¼ fl1 ; . . . ; lY g with a binary relation l1  . . .  lY . The problem is to find a monotonic function

u : x # y^;

ð1Þ

150

G

P

154 155 156 157 158 160 161 162

163 165

such that yn ¼ 0 for n 2 N , and yp ¼ 1 for p 2 P.

166

3.1. Dominance relation

167

We now introduce the concept of a dominance relation. This includes n-domination and p-domination. We say that object xn ¼ ½xn1 ; . . . ; xnd > n-dominates object xi ¼ ½xi1 ; . . . ; xid > ,

168

or xn n xi ; if xnj xij for each j ¼ 1; . . . ; d: say

that

object

169 170

171 173

>

xp ¼ ½xp1 ; . . . ; xpd 

p-dominates

object

174 175

176

or xp p xk ;

This function should minimize the error SðuÞ,

SðuÞ ¼

149

159

xk ¼ ½xk1 ; . . . ; xkd  ,

and y 2 Y:

148

inition of the mapping f will be extended to the entire sample D and to the whole set of objects X. For the two-class classification problem, we split the set of ^ into two subsets: object indices I^ of the separable sample D

>

x 2 X ¼ L1      Ld

147

153

We

where

146

^ is a subset of the entire sample D. The ‘‘separable samsuch that D ple’’ concept for ordered sets means that there exists a hull (or Pareto optimal front) corresponding to each class defined by the binary relation ‘‘’’ such that the hulls for two classes do not intersect. First, the function f will be defined on the separable sample set ^ such that the error function (2) is zero on D. ^ Second, the defD

I^ ¼ N

j 2 J ¼ f1; . . . ; dg;

i 2 I^ # I

145

if

X ^i Þ; rðyi ; y

ð2Þ

i2I

^i ¼ uðxi Þ, and rð; Þ is the loss function between elements of where y the ordered set Y. To define a loss function between elements of the set Y ¼ fl1 ; . . . ; lY jl1  . . .  lY g, we introduce a binary matrix Y (Table 2) describing the binary relations between elements of Y. If li  lk ,

Table 2 Matrix of an ordered set. Labels

l1

l2

...

lY1

lY

l1 l2 ... lY1 lY

0 1 ... 1 1

0 0 ... 1 1

... ... ... ... ...

0 0 ... 0 1

0 0 ... 0 0

xpj xkj

for each j ¼ 1; . . . ; d:

We can assume that an object neither n-dominates nor p-dominates itself,

x ¤n x;

x ¤p x:

178 179 180

181 183

Fig. 1 illustrates the dominance relation in the case of two features; the x-axis denotes feature values from set L1 , and the y-axis denotes feature values from set L2 . The yellow region indicates the n-dominance space for object xn and the p-dominance space for object xp . Object xn n-dominates each object xi from the corresponding n-dominance space, and object xp p-dominates each object xk from the corresponding p-dominance space.

184

3.2. Pareto front construction

191

Let us define the Pareto fronts, the sets describing class boundaries for the separable sample.

192

Please cite this article in press as: Stenina, M. M., et al. Ordinal classification using Pareto fronts. Expert Systems with Applications (2015), http://dx.doi.org/ 10.1016/j.eswa.2015.03.021

185 186 187 188 189 190

193

ESWA 9932

No. of Pages 7, Model 5G

7 April 2015 3

M.M. Stenina et al. / Expert Systems with Applications xxx (2015) xxx–xxx

9 8 7

6 Feature 2

Feature 2

8

4

2

6

3

5

1

4 3

2

2

0 0

2

4 6 Feature 1

1

8

Fig. 1. Dominance relation.

2

4

Feature 1

6

8

Fig. 3. Two-class classification example.

! 194 195 196

Definition 1. A set of objects xn ; n 2 N , is called Pareto front POFn if, for each element xn 2 POFn , there does not exist any x such that xn xn .

206

Definition 2. A set of objects xp ; p 2 P is called Pareto front POFp if, for each element xp 2 POFp , there does not exist any x such that x p xp . Fig. 2 illustrates Pareto fronts for the two-class separable sample. Each object is described by two features. The x-axis denotes feature values from set L1 , and the y-axis denotes feature values from set L2 . The green triangles and blue squares are the objects from different classes. The objects forming the Pareto fronts are denoted by red circles. The dotted line indicates the n-front class boundary, and the solid line indicates the p-front class boundary.

207

3.3. Two-class classification

208

We now use the constructed Pareto fronts and the corresponding class boundaries to define a monotone classifier. ^ assigns the class label 0 to an object x 2 X if Function f : x # y ~ -dominates x. Function f there exists an object xn 2 POFn that n assigns the class label 1 to an object x 2 X if there exists an object ~-dominates x. Thus, xp 2 POFp that p

198 199 200 201 202 203 204 205

209 210 211 212 213

214 216

217 218

219

f ðxÞ ¼



0; if there exists xn 2 POFn : xn n~ x; 1; if there exists xp 2 POFp : xp p~ x:

ð4Þ

If the set of such elements is empty, we extend the definition of f to the entire set X according to the nearest Pareto front:

arg min ðqðx; x ÞÞ ;

f ðxÞ ¼ f

221

x0 2POFn [POFp

where the sets POFn ; POFp include Pareto fronts and boundary points corresponding to the imaginary objects. The function q is defined by function (3) applied to the feature values: d X qðx; x Þ ¼ rðxj ; x0j Þ: 0

3.4. Separable sample construction

242

Consider the method of constructing a set I^ such that the func^ is monotone over the corresponding subsample. tion f : x # y Split the set of indices I into two subsets

243

G

6

lðxi Þ ¼

5

1

2

4

Feature 1

6

Fig. 2. Pareto fronts.

8

230 231 232 233 234 235 236 237 238 239 240 241

244 245

248 249

n 2 N;

and yp ¼ 1;

250 252

p 2 P:



253 254

255

#fxj jxi n xj ; j 2 Pg;

if i 2 N ;

#fxj jxi p xj ; j 2 N g; if i 2 P;

257

Table 3 Two-class classifier example.

2

229

246

P

4 3

225

228

Consider the power l of the set of objects dominated by object xi and belonging to the foreign class:

7

224

In other words, f classifies an object x according to the rule of the nearest POF if the object x is not dominated by any POF. Fig. 3 shows model data consisting of two-class objects. Objects in the first class are indicated by green triangles, and those in the second class are denoted by blue squares. Each object is described by two features. The x-axis shows feature values from set L1 , the yaxis represents those from set L2 . The classified objects are indicated by the black circles. Table 3 gives the classification results. The first column contains the object number, the second column contains the object coordinates, and the third column contains the classifier output. The label 0 implies that the object was classified as a green triangle, whereas a 1 denotes that the object was classified as a blue square.

yn ¼ 0;

8

223

227

such that

9

222

ð5Þ

j¼1

I ¼N

Feature 2

197

0

Object

x

f ðxÞ

1 2 3

(4, 5) (6, 7) (9, 6)

0 1 1

Please cite this article in press as: Stenina, M. M., et al. Ordinal classification using Pareto fronts. Expert Systems with Applications (2015), http://dx.doi.org/ 10.1016/j.eswa.2015.03.021

ESWA 9932

No. of Pages 7, Model 5G

7 April 2015 4 258 259 260

M.M. Stenina et al. / Expert Systems with Applications xxx (2015) xxx–xxx

where # denotes the power of the set. To find set I^ , we consequently eliminate defective objects from the entire sample D. 1: I ¼ P

F

N.

^ 2: return I^ ¼ P

F ^ N.

^ :¼ N ; {initialization} ^ :¼ P; N 3: I^ :¼ I ; P 4: while the sample has the objects xi ; i 2 I^ such that lðxi Þ > 0 do 5: ^i :¼ arg maxlðx Þ;

8: 9: 10:

^ :¼ P ^ n f^ig; P ^ ^ if i 2 N then ^ :¼ N ^ n f^ig. N

281

Fig. 4 shows the model sample set, where objects are described with two features. This sample set consists of two classes (green triangles and blue squares). Fig. 4(a) shows the sample set with defective objects at coordinates (1, 1), (5, 4), and (8, 6). The defective objects dominate objects in the opposite class. These objects are indicated by red circles. Fig. 4(b) shows the separable subsample obtained once the defective objects have been eliminated.

282

4. Ordinal classification

283

4.1. Ordinal classifier construction

276 277 278 279 280

where n 2 N u ; if yn lu ;

and 301

The ordinal classifier

302

uðxÞ ¼ uðf 1;2 ; . . . ; f Y1;Y ÞðxÞ; u : X ! Y;

303 305

is defined as follows:

306

307

minflu jf u;uþ1 ðxÞ ¼ 0g; if flu jf u;uþ1 ðxÞ ¼ 0g – ;; lu 2Y

if flu jf u;uþ1 ðxÞ ¼ 0g ¼ ;:

lY ;

ð6Þ 309

Table 4 illustrates the application of Eq. (6). The output of the ordinal classifier uðxÞ is the label lu of the first class when the classifier f u;uþ1 equals 0, whereas if the output f u;uþ1 is 1, we assign the label lY to the ordinal classification result. Fig. 5 shows an example for a set of objects from three different classes. The axes denote feature values describing the objects. The various objects are indicated by the red circles, green triangles and blue squares. The class boundaries corresponding to the n-fronts are indicated by the dotted line, and the solid line indicates the

I^ :¼ I^ n f^ig; ^ then if ^i 2 P

274 275

P uþ1 ;

p 2 P uþ1 ; if yp luþ1 :

uðxÞ ¼

i2I^

7:

G

(

i

6:

I^ ¼ N u

Table 4 Monotone classifier illustration. 1, 2 1

u  1; u 1

... ...

u; u þ 1 0

... ...

Y  1; Y 0

9 8

289

290 292 293 294 295 296 297 298

299

7

l1      lu  luþ1      lY :

Feature 2

288

Y ¼ fl1 ; . . . ; lu ; luþ1 ; . . . ; lY g;

We denote the class label indices by f1; . . . ; u; u þ 1; . . . ; Yg. The Pareto two-class classifier can be constructed as

^ 2 f0; 1g; f u;uþ1 : x # y

x 2 X;

4

1

9 8

7

7

6

6

5 4

2

2 6

8

Feature 1

6

8

4 3

Feature 1

4

5

3

4

2

Fig. 5. Pareto front, feature 1 is preferable to feature 2.

8

2

5

2

9

1

6

3

for each pair of adjacent classes u; u þ 1. To construct the two-class classifier, we split the sample into two parts: objects with class labels ^lu and objects with class labels
Feature 2

285 287

Consider a general case of the problem:

Feature 2

284

1

2

4

Feature 1

6

8

Fig. 4. Eliminating defective objects.

Please cite this article in press as: Stenina, M. M., et al. Ordinal classification using Pareto fronts. Expert Systems with Applications (2015), http://dx.doi.org/ 10.1016/j.eswa.2015.03.021

310 311 312 313 314 315 316 317 318

ESWA 9932

No. of Pages 7, Model 5G

7 April 2015 5

M.M. Stenina et al. / Expert Systems with Applications xxx (2015) xxx–xxx

(

Table 5 Ordinal classifier example. Number

Object, x

f 12 ðxÞ

f 23 ðxÞ

uðxÞ

1 2 3

(1, 1) (5, 4) (9, 9)

0 1 1

0 0 1

1 2 3

if f u;uþ1 ðxÞ ¼ 0; than f ðuþsÞðuþ1þsÞ ðxÞ ¼ 0 for each s : ðu þ 1 þ sÞ 6 Y; if f u;uþ1 ðxÞ ¼ 1; than f ðusÞðuþ1sÞ ðxÞ ¼ 1 for each s : ðu  sÞ P 1: ð7Þ

353

Definition 4. We say that Pareto fronts POFn ðuÞ and POFp ðu þ 1Þ do not intersect,

354

POFn ðuÞ

329

p-fronts. The classified objects are shown by black circles. Table 5 shows an example of the set of two-class classifiers f 1;2 ; f 2;3 included in the ordinal classifier uðxÞ for the illustrated sample. The first column contains object numbers, the second their coordinates, and the third and fourth columns contain the two-class classification results for the adjacent classes. A label of 0 in the third column means that the classifier f 1;2 assigned the object to the first class, and a label of 0 in the fourth column means that the classifier f 2;3 assigned the object to the second class. The final column contains the results of ordinal classification. The values of this column correspond to the output of the ordinal classifier.

330

4.2. Extension of Pareto front definition for ordinal classification

331

344

To construct the fronts between classes labeled as lu and luþ1 , we use objects with class labels l1 ; . . . ; lu to construct an n-front for class lu and objects with class labels luþ1 ; . . . ; lY to construct a p-front for class luþ1 . Therefore, the same objects belong to fronts for different classes, and a front for one class contains objects of different classes. Fig. 6 illustrates a sample containing three-class model data. Red circles denote objects in the first class, and green triangles denote those in the second. The figure shows that object (7, 2) from the first class belongs to the n-fronts of both the first and second classes. This implies that the definition of the n-front should be extended to objects whose class label is not greater than the n-front class label; the definition of the p-front should be extended to objects whose class label is not less than the p-front class label.

345

4.3. Admissible classifiers

346

In this section, we introduce the important property of admissibility for an ordinal classifier. We prove a theorem about the admissibility conditions of the constructed classifier u.

321 322 323 324 325 326 327 328

332 333 334 335 336 337 338 339 340 341 342 343

347 348 349 350

351

359

POFn ðuÞ

362

\

POFp ðu þ 1Þ ¼ ;:

Fig. 2 shows an example of non-intersecting Pareto fronts. Theorem 1. If the Pareto fronts do not intersect,

POFn ðuÞ

\

POFp ðu þ 1Þ ¼ ;;

POFn ðuÞ

\

POFp ðu þ 1Þ ¼ ;;

f 1;2 ðxÞ ¼ 0;

f 2;3 ðxÞ ¼ 1:

372 374 375 376

377 379

1. 9y 2 POFn ð1Þ : y n x. If y 2 POFn ð2Þ, then it follows that f 2;3 ðxÞ ¼ 0, which contradicts the assumption that f 2;3 ðxÞ ¼ 1. If y R POFn ð2Þ, then

383

w n y;

and it follows that

)

9y0 2 POFn ð1Þ; such that y0 ¼ arg

6

min

qðx; yÞ;

y2POFn ð1Þ[POFp ð2Þ

5 4

3

3

2

2 1

2

4

Feature 1

382

384 385 386

387 389 391 393

f 2;3 ðxÞ ¼ 0:

2. The fronts POFn ð1Þ and POFp ð2Þ do not dominate object x. In this case,

6

381

390

POFn ð2Þ 3 wn y n x

7

8

371

380

7

6

370

(The case f 1;2 ðxÞ ¼ 1; f 2;3 ðxÞ ¼ 0 is similar.) The result f 1;2 ðxÞ ¼ 0 can be obtained if one of the two following conditions holds:

8

Feature 1

u ¼ 1; 2;

and that there exists an object x such that the transitivity condition does not hold,

8

4

365 367

u ¼ 1; . . . ; Y  1;

369

9

2

364

Proof. We prove the theorem for the case z ¼ 3. For more classes, the proof is similar. Suppose that the Pareto fronts do not intersect,

9

1

363

368

Definition 3. The classifier u in Eq. (6) is said to be admissible if, for each function f u;uþ1 , the transitivity condition holds:

4

360

then the transitivity condition (7) holds for any classified object.

9w 2 POFn ð2Þ :

5

356 358

POFp ðu þ 1Þ ¼ ;;

if the boundaries of their dominance spaces do not intersect,

Feature 2

320

Feature 2

319

\

355

6

8

Fig. 6. An example of a common object for the two fronts.

Please cite this article in press as: Stenina, M. M., et al. Ordinal classification using Pareto fronts. Expert Systems with Applications (2015), http://dx.doi.org/ 10.1016/j.eswa.2015.03.021

394 395

396 398

ESWA 9932

No. of Pages 7, Model 5G

7 April 2015 6 399 400 401 402 403

404 406 407

408 410

M.M. Stenina et al. / Expert Systems with Applications xxx (2015) xxx–xxx

where q is the distance function (5). The assumption f 2;3 ðxÞ ¼ 1 holds in one of two possible cases. (a) 9t 2 POFp ð3Þ : t p x. The object t does not belong to POFp ð2Þ, because POFp ð2Þ does not dominate x. Then, it follows that

9t0 2 POFp ð2Þ :

t0 p t:

Therefore, we obtain a chain of domination inequalities,

t0 p t p x;

3. The separable sample construction procedure involves the similar comparison of all object pairs between the different classes, that also costs Oðm2 nÞ operations for the balanced-size classes. The iterative procedure of the defective objects elimination is also quadratic due to the memorization of all dominant elements for each object. However, this memorization makes the memory costs also quadratic. 4. In the ordinal-class case the total complexity is multiplied by the number of classes, so that the final complexity estimation is Oðm2 nKÞ, where K is the number of classes.

464 465 466 467 468 469 470 471 472 473 474

411 412 413 414 415 416 417 418 419 420 421 422 423 424

425 427 428 429

430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 446 445

and it follows that t0 p x. This contradicts the assumption that front POFp ð2Þ does not dominate object x. (b) Object x is not dominated by fronts POFn ð2Þ and POFp ð3Þ. In this case, object x is not dominated by any front POFn ðuÞ; POFp ðu þ 1Þ with u ¼ 1; 2. Note that there exists an object y1 2 POFn ð1Þ whose dominance space boundary contains the point y0 , where y0 is the nearest point to x in the sense of the Hamming distance (5). Object y1 can belong to front POFn ð2Þ. In this case, the distance between x and POFn ð2Þ is less than the distance between x and POFn ð1Þ. (Here, the distance between a point and a front means the distance between a point and the nearest point of a front.) If object y1 does not belong to front POFn ð2Þ, there exists an object

y2 2 POFn ð2Þ :

y2 n y1 :

However, y2 ¤n x, because the object x is not dominated by any front. Hence, it follows that the distance between object x and front POFn ð2Þ is less than the distance between object x and point y0 on front POFn ð1Þ. The proof for the pair of fronts POFp ð2Þ; POFp ð3Þ is similar. There exists an object w1 2 POFp ð2Þ such that the boundary of its dominance space contains a point w0 , where w0 is the nearest point to x in the sense of the Hamming distance (5). The distance between object x and front POFp ð3Þ is not less than the distance between x and w0 2 POFp ð2Þ. We have proved that the distance between x and POFn ð2Þ is not greater than the distance between x and POFn ð1Þ, and the distance between x and POFp ð3Þ is not less than the distance between x and POFp ð2Þ. From f 1;2 ðxÞ ¼ 0, it follows that the distance between x and POFn ð1Þ is less than the distance between x and POFp ð2Þ. Thus, x is nearer to POFn ð2Þ than to POFp ð3Þ. This contradicts the assumption that f 2;3 ðxÞ ¼ 1, and concludes the proof. h

447

451

Given that the method of Pareto fronts uses only separable samples, all fronts are disjoint. Therefore the ordinal classifier (6) is admissible and the transitivity condition (7) holds for any classified object.

452

4.4. Computational cost

453

To estimate the computational cost, we must compute the cost for the basic stages of the algorithm: the two-object comparison, the Pareto front construction, the separable sample construction, and the ordinal-class case.

448 449 450

454 455 456 457 458 459 460 461 462 463

1. The cost of two object comparison is OðnÞ, where n is the number of features. 2. The Pareto front construction method involves comparison of the all pairs of objects. For the balanced-size classes OðmÞ the comparison is Oðm2 nÞ, where m is the total number of objects. After the pairwise comparison the method finds the non-dominated objects using the Oðm2 Þ operations.

We see that the Pareto front construction procedure is quite costly, and one of our further directions is to reduce the cost. The possible solution is to use the class transitivity property to construct the Pareto fronts for all classes together to eliminate the multiplier K from the complexity estimation.

475

5. Experimental results

480

5.1. Benchmark datasets

481

We verify the proposed method on the different benchmark datasets. To test the method on the ordinal data, we make an additional monotone feature transformation for each dataset. We compare the method with two ordinal classification algorithms and one method of classification with monotonicity constraints. We used the following datasets from the UCI repository: Pyrimidines, Machine CPU, Housing, Computer Activity, Abalone and Car. All datasets except the last one correspond to the regression problem, so that we made a discretization of the target variable into five levels containing equal number of objects. The experiment scheme duplicates the one from Chu and Sathiya Keerthi (2007). We randomly partitioned each dataset into the training and test splits, as it shown in Table 6. The partition was repeated 100 times independently. To measure the quality we estimated the mean zero-one loss and the mean absolute loss on the test datasets. For comparison we used two classification algorithms, the decision tree J48 (Trees) and the Support Vector Machine with the linear kernel (SVM), that were combined with the ordinal classification scheme from Frank and Hall (2001). Furthermore, we used the nearest-neighbor classification method with monotonicity (KNN) from Duivesteijn and Feelders (2008). The results for the five original datasets (all except ‘‘Cars’’) are given in Table 7. The bolded numbers indicate whether the corresponding method was stat significantly better than the others. We see that the ordinal SVM outperformed all other methods due to the linear nature of features in the considered datasets. To investigate the method properties on the ordinal-scaled data, we made the ordinal transformation of the dataset features. As for the target variable, we discretized all features into five levels containing equal number of objects. The results for the transformed datasets (and for the ‘‘Cars’’ whose features were initially ordinal) are given in Table 8. The

482

Table 6 Description of the datasets. Dataset

#FEATURES

#OBJECTS

Training/Test

Pyrimidines MachineCPU Boston Computer Abalone Cars RedBook

27 6 13 21 8 6 101

74 209 506 8182 4177 1728 102

50/24 150/59 300/206 4000/4182 1000/3177 1000/728 100/1

Please cite this article in press as: Stenina, M. M., et al. Ordinal classification using Pareto fronts. Expert Systems with Applications (2015), http://dx.doi.org/ 10.1016/j.eswa.2015.03.021

476 477 478 479

483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514

ESWA 9932

No. of Pages 7, Model 5G

7 April 2015 M.M. Stenina et al. / Expert Systems with Applications xxx (2015) xxx–xxx Table 7 Experimental results: ordinal target, linear features. Dataset

Pyrimidines MachineCPU Boston Computer Abalone

Mean zero-one loss ( 0:01)

Mean absolute loss ( 0:01)

SVM

POF

Trees

KNN

SVM

POF

Trees

KNN

0.50 0.44 0.38 0.32 0.53

0.62 0.44 0.48 0.71 0.59

0.61 0.47 0.41 0.38 0.57

0.55 0.51 0.47 0.60 0.60

0.64 0.53 0.46 0.35 0.78

0.90 0.53 0.65 1.36 0.92

0.84 0.53 0.47 0.41 0.77

0.75 0.61 0.62 0.90 0.88

Table 8 Experimental results: ordinal target, ordinal features. Dataset

Pyrimidines MachineCPU Boston Computer Abalone Cars RedBook

Mean zero-one loss ( 0:01)

Mean absolute loss ( 0:01)

SVM

POF

Trees

KNN

SVM

POF

Trees

KNN

0.57 0.51 0.40 0.44 0.78 0.19 0.66

0.58 0.39 0.48 0.69 0.59 0.19 0.47

0.60 0.47 0.40 0.41 0.58 0.08 0.48

0.61 0.43 0.41 0.45 0.59 0.06 0.62

0.71 0.65 0.49 0.53 1.78 0.24 0.85

0.77 0.45 0.68 1.38 0.92 0.26 0.60

0.79 0.56 0.46 0.45 0.76 0.08 0.52

0.76 0.51 0.51 0.55 0.89 0.07 0.79

7

6. Summary and further research

532

We have proposed the ordinal classification method based on the construction of Pareto fronts, which describe boundaries between ordinal classes. The ordinal classifier is constructed as a superposition of two-class Pareto classifiers. Our algorithm was compared with some well-known algorithms, and demonstrated adequate results. It was used to solve the IUCN Red List categorization problem. The main advantage of the method is the simplicity and interpretability of the obtained models. At the same time, the computational experiments showed that the obtained models allow to get the comparable prediction quality with the state-of-the art ordinal classification methods. Further investigations will be devoted to extending the scope of the proposed algorithm. The classification method will be extended to the case of partial orders defined over the set of features and over the set of feature values. From a practical point of view, the partial-order case corresponds to incomplete information given by the experts. To describe the partial orders, we will use the idea of a partial order cone proposed in Kuznetsov and Strijov (2014). Another way to improve the proposed classification algorithm is to take into account the expert information about feature preferences. The preference relation defined over the set of features allows to restrict the set of admissible models and to achieve better prediction quality.

533

References

557

Ben-David, Arie, Sterling, Leon, & Pao, Yoh-Han (1989). Learning and classification of monotonic ordinal concepts. Computational Intelligence, 5(1), 45–49. Chu, Wei, & Sathiya Keerthi, S. (2007). Support vector ordinal regression. Neural Computing, 19(3), 792–815. Duivesteijn, Wouter, & Feelders, Ad (2008). Nearest neighbour classification with monotonicity constraints. In Walter Daelemans, Bart Goethals, & Katharina Morik (Eds.), Machine learning and knowledge discovery in databases. Lecture notes in computer science (Vol. 5211, pp. 301–316). Berlin Heidelberg: Springer. ISBN 978-3-540-87478-2. Frank, Eibe, & Hall, Mark (2001). A simple approach to ordinal classification. Springer. Freund, Yoav, Iyer, Raj, Schapire, Robert E., & Singer, Yoram (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969. ISSN 1532-4435. Furnkranz, Johannes, & Hullermeier, Eyke (2003). Pairwise preference learning and ranking. Machine Learning ECML, 2003, 145–156. Har-Peled, S., Roth, D., & Zimak, D. (2003). Constraint classification for multiclass classification and ranking. In NIPS (pp. 785–792). Kotlowski, Wojciech, & Slowinski, Roman (2009). Rule learning with monotonicity constraints. In Proceedings of the 26th annual international conference on machine learning. ICML ’09 (pp. 537–544). New York, NY, USA: Dover. ISBN 978-1-60558516-1. Kuznetsov, M. P., & Strijov, V. V. (2014). Methods of expert estimations concordance for integral quality estimation. Expert Systems with Applications, 41(4), 1988–1996. . Nogin, V. D. (2003). The edgeworth-pareto principle and relative importance of criteria in the case of a fuzzy preference relation. Computational Mathematics and Mathematical Physics, 43(11), 1666–1676. Strijov, Vadim, Granic, Goran, Juric, Jeljko, Jelavic, Branka, & Maricic, Sandra Antecevic (2011). Integral indicator of ecological impact of the croatian thermal power plants. Energy, 36(7), 4144–4149. Xia, Fen, Liu, Tie-Yan, Wang, Jue, Zhang, Wensheng, & Li, Hang (2008). Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th international conference on machine learning. ICML ’08 (pp. 1192–1199). New York, NY, USA: ACM. ISBN 978-1-60558-205-4. Yue, Yisong, Finley, Thomas, Radlinski, Filip, & Joachims, Thorsten (2007). A support vector method for optimizing average precision. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR ’07 (pp. 271–278). New York, NY, USA: ACM. ISBN 978-1-59593-597-7.

558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596

534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556

3

2

1

1

2

3

Fig. 7. Comparison of the computed and expert estimated categories.

519

SVM method is significantly worse due to the linearity of kernel, while the ‘‘Trees’’ and the ‘‘POF’’ methods demonstrate sufficiently good results. Another interesting observation is that the Trees method outperforms the other methods according to the mean absolute loss.

520

5.2. IUCN Red List dataset

521

The results for the IUCN Red List categorization are shown in Table 8 in the last raw. Unlike the other datasets, we used the leave-one-out partition scheme. The POF and the Trees methods demonstrate the best results according to the zero-one loss. Fig. 7 compares the categories computed by the POF algorithm using the leave-one-out results. The x-axis denotes the computed categories, and the y-axis represents the categories determined by the experts. The radius of each point is proportional to the number of objects with the corresponding computed and expert categories. For a significant number of objects, the computed category was the same as that assigned by the experts.

515 516 517 518

522 523 524 525 526 527 528 529 530 531

Please cite this article in press as: Stenina, M. M., et al. Ordinal classification using Pareto fronts. Expert Systems with Applications (2015), http://dx.doi.org/ 10.1016/j.eswa.2015.03.021

597