ESWA 9932
No. of Pages 7, Model 5G
7 April 2015 Expert Systems with Applications xxx (2015) xxx–xxx 1
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa 4 5 3
Ordinal classification using Pareto fronts q
6
M.M. Stenina a, M.P. Kuznetsov a,⇑, V.V. Strijov a,b
7 8 9 10 1 1 2 2 13 14 15 16 17 18 19 20
a b
Moscow Institute of Physics and Technology, Institutskiy Lane 9, Dolgoprudny, Moscow 141700, Russia Dorodnicyn Computing Center of Russian Academy of Sciences, Vavilov St. 40, 119333 Moscow, Russia
a r t i c l e
i n f o
Article history: Available online xxxx Keywords: Ordinal classification Pareto front Expert estimations Binary relation
a b s t r a c t The paper presents an ordinal classification method using Pareto fronts. An object is described by a set of ordinal features assigned by experts. We describe the class boundaries by the set of Pareto fronts. We propose to predict the object class using the nearest Pareto front boundary. The proposed method is illustrated by a real-world application to the International Union for Conservation of Nature Red List species categorization. Ó 2015 Published by Elsevier Ltd.
22 23 24 25 26 27 28 29
30 31
1. Introduction
32
We investigate the supervised ordinal classification problem (Frank & Hall, 2001; Har-Peled, Roth, & Zimak, 2003), a setting referred to as learning to rank or ordinal regression. Several key assumptions are made about the investigated data structure within the ordinal classification setting. First, the target variable assumed to have an ordinal nature. As formulated in Chu and Sathiya Keerthi (2007), that setting bridges the gap between regression and classification problems. The second assumption corresponds to the structure of features describing the given data. We consider the ordinal nature of input variables, that is, each feature provides the non-strict ranking over the set of objects (Ben-David, Sterling, & Pao, 1989; Furnkranz & Hullermeier, 2003; Kotlowski & Slowinski, 2009). We also make the following assumptions about the feature structure (Strijov, Granic, Juric, Jelavic, & Maricic, 2011):
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
the given set of features is sufficient to construct an adequate ordinal classification model; the rule that ‘‘bigger is better’’ is applied, that is, larger feature values imply greater object preference. In current research we consider the important requirement of monotonicity between the input and output ordinal variables. Monotone relationships are frequently encountered in machine q This project was supported by the Ministry of Education and Science of the Russian Federation, RFMEFI60414X0041, and by the Russian Foundation for Basic Research, Grant 14-07-31042. ⇑ Corresponding author. Tel.: +7 (926) 832 01 97; fax: +7 (499) 783 33 27. E-mail addresses:
[email protected] (M.M. Stenina),
[email protected] (M.P. Kuznetsov),
[email protected] (V.V. Strijov).
learning area (Kotlowski & Slowinski, 2009). Recent investigations (Duivesteijn & Feelders, 2008) showed that satisfying the monotonicity requirements leads to the better prediction quality. The problem of monotone ordinal classification arises in the area of information retrieval (Xia, Liu, Wang, Zhang, & Li, 2008). The most common methods are based on object pairwise comparisons. To solve such kind of problems, the modified support vector machine (Yue, Finley, Radlinski, & Joachims, 2007) and boosting (Freund, Iyer, Schapire, & Singer, 2003) are used. However, the important shortcoming of such boosting-like models is their complexity and non-interpretability in the context of the investigated field. In this paper we propose a new ordinal classification method that operates with ordinal input and target variables. The method constructs classification model basing on the object dominance concept. We propose to describe the ordinal class boundaries using the Pareto front idea (Nogin, 2003). We propose to construct the set of Pareto fronts corresponding to the ordinal classes and to predict the object class using the nearest Pareto front boundary. To build the correct Pareto front model we introduce the concept of separable sample and propose the new efficient method of its construction. The main advantage of the method is that it allows to construct simple interpretable models on the same level of prediction accuracy. We use the proposed classification method to categorize threatened animal and plant species included in the International Union for Conservation of Nature (IUCN) Red List. Each species on this list belongs to one of seven possible categories: extinct, extinct in the wild, critically endangered, endangered, vulnerable, near threatened, and least concern. This categorization is monotone with respect to the risk of extinction. The object-feature matrix for the categorization problem consists of ordinal species descriptions assigned the experts and class labels corresponding to the species categories. The problem is to
http://dx.doi.org/10.1016/j.eswa.2015.03.021 0957-4174/Ó 2015 Published by Elsevier Ltd.
Please cite this article in press as: Stenina, M. M., et al. Ordinal classification using Pareto fronts. Expert Systems with Applications (2015), http://dx.doi.org/ 10.1016/j.eswa.2015.03.021
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
ESWA 9932
No. of Pages 7, Model 5G
7 April 2015 2
M.M. Stenina et al. / Expert Systems with Applications xxx (2015) xxx–xxx
Table 1 The part of ordinal IUCN data in the object-feature matrix. Feature
Condition
Trend
Population
3 – big; 2 – small; 1 – critically small 2 – complex; 1 – simple
4 – grows; 3 – stable; 2 – reduces; 1 – reduces fast 2 – stable; 1 – disappear
Population structure
then the ði; kÞ-th matrix is 1. For strict total orders, the matrix is lower triangular with zeros on the diagonal. Element li of set Y corresponds to row i of the matrix. We define the loss value rð; Þ as the difference between strings of the matrix Y,
135 136 137 138 139
140
Y X 0 rðli ; li0 Þ ¼ jYði; kÞ Yði ; kÞj:
ð3Þ
142
k¼1
100
construct a model that estimates a class label from the species description. Table 1 shows four features within the object-feature matrix: ‘‘Population condition’’, ‘‘Population trend’’, ‘‘Population structure condition’’, ‘‘Population structure trend’’. We show that the proposed Pareto front method allows to construct an accurate, stable and interpretable ordinal classification model for the problem of the IUCN categorization. In addition to the IUCN data, we test the method on the benchmark datasets. We compare the method with two ordinal classification algorithms from Frank and Hall (2001) and one method of classification with monotonicity constraints from Duivesteijn and Feelders (2008). Furthermore, to emphasize the ordinal nature of the algorithm, we make ordinal feature transformation of the benchmark datasets and compare the methods on the transformed data.
101
2. Problem setting
86 87 88 89 90 91 92 93 94 95 96 97 98 99
102
103 105 106
107 109 110 111 112 113 114 115 116 117 118
119 121 122
123 125 126
127 129 130 131 132 133 134
Consider the set of pairs
D ¼ fðxi ; yi Þg;
i 2 I ¼ f1; . . . ; mg;
consisting of objects xi and class labels yi . Each object >
xi ¼ ½xi1 ; . . . ; xij ; . . . ; xid ;
3. Two-class Pareto classification
143
Consider a special two-class case of problem (1) such that Y ¼ fl1 ; l2 g ¼ f0; 1g; 0 1. That is, the sample D consists of objects with class labels denoted by 0 or 1. Let f be the monotonic function minimizing error (2) in the two-class case. To construct f in this case, we first define f ðxÞ on the separable ^ sample D,
144
^ ¼ fðxi ; y Þg; D i
152
is described by ordinal-scaled measurements. This means that the set of values for feature j is a finite ordered set Lj with a binary relation . In this paper, we consider only strict total orders, i.e., total, non-reflexive, asymmetric and transitive binary relations. However, the proposed methods can be generalized to the case of partial orders. The set of values Y for the class labels yi is also a finite strictly ordered set Y ¼ fl1 ; . . . ; lY g with a binary relation l1 . . . lY . The problem is to find a monotonic function
u : x # y^;
ð1Þ
150
G
P
154 155 156 157 158 160 161 162
163 165
such that yn ¼ 0 for n 2 N , and yp ¼ 1 for p 2 P.
166
3.1. Dominance relation
167
We now introduce the concept of a dominance relation. This includes n-domination and p-domination. We say that object xn ¼ ½xn1 ; . . . ; xnd > n-dominates object xi ¼ ½xi1 ; . . . ; xid > ,
168
or xn n xi ; if xnj xij for each j ¼ 1; . . . ; d: say
that
object
169 170
171 173
>
xp ¼ ½xp1 ; . . . ; xpd
p-dominates
object
174 175
176
or xp p xk ;
This function should minimize the error SðuÞ,
SðuÞ ¼
149
159
xk ¼ ½xk1 ; . . . ; xkd ,
and y 2 Y:
148
inition of the mapping f will be extended to the entire sample D and to the whole set of objects X. For the two-class classification problem, we split the set of ^ into two subsets: object indices I^ of the separable sample D
>
x 2 X ¼ L1 Ld
147
153
We
where
146
^ is a subset of the entire sample D. The ‘‘separable samsuch that D ple’’ concept for ordered sets means that there exists a hull (or Pareto optimal front) corresponding to each class defined by the binary relation ‘‘’’ such that the hulls for two classes do not intersect. First, the function f will be defined on the separable sample set ^ such that the error function (2) is zero on D. ^ Second, the defD
I^ ¼ N
j 2 J ¼ f1; . . . ; dg;
i 2 I^ # I
145
if
X ^i Þ; rðyi ; y
ð2Þ
i2I
^i ¼ uðxi Þ, and rð; Þ is the loss function between elements of where y the ordered set Y. To define a loss function between elements of the set Y ¼ fl1 ; . . . ; lY jl1 . . . lY g, we introduce a binary matrix Y (Table 2) describing the binary relations between elements of Y. If li lk ,
Table 2 Matrix of an ordered set. Labels
l1
l2
...
lY1
lY
l1 l2 ... lY1 lY
0 1 ... 1 1
0 0 ... 1 1
... ... ... ... ...
0 0 ... 0 1
0 0 ... 0 0
xpj xkj
for each j ¼ 1; . . . ; d:
We can assume that an object neither n-dominates nor p-dominates itself,
x ¤n x;
x ¤p x:
178 179 180
181 183
Fig. 1 illustrates the dominance relation in the case of two features; the x-axis denotes feature values from set L1 , and the y-axis denotes feature values from set L2 . The yellow region indicates the n-dominance space for object xn and the p-dominance space for object xp . Object xn n-dominates each object xi from the corresponding n-dominance space, and object xp p-dominates each object xk from the corresponding p-dominance space.
184
3.2. Pareto front construction
191
Let us define the Pareto fronts, the sets describing class boundaries for the separable sample.
192
Please cite this article in press as: Stenina, M. M., et al. Ordinal classification using Pareto fronts. Expert Systems with Applications (2015), http://dx.doi.org/ 10.1016/j.eswa.2015.03.021
185 186 187 188 189 190
193
ESWA 9932
No. of Pages 7, Model 5G
7 April 2015 3
M.M. Stenina et al. / Expert Systems with Applications xxx (2015) xxx–xxx
9 8 7
6 Feature 2
Feature 2
8
4
2
6
3
5
1
4 3
2
2
0 0
2
4 6 Feature 1
1
8
Fig. 1. Dominance relation.
2
4
Feature 1
6
8
Fig. 3. Two-class classification example.
! 194 195 196
Definition 1. A set of objects xn ; n 2 N , is called Pareto front POFn if, for each element xn 2 POFn , there does not exist any x such that xn xn .
206
Definition 2. A set of objects xp ; p 2 P is called Pareto front POFp if, for each element xp 2 POFp , there does not exist any x such that x p xp . Fig. 2 illustrates Pareto fronts for the two-class separable sample. Each object is described by two features. The x-axis denotes feature values from set L1 , and the y-axis denotes feature values from set L2 . The green triangles and blue squares are the objects from different classes. The objects forming the Pareto fronts are denoted by red circles. The dotted line indicates the n-front class boundary, and the solid line indicates the p-front class boundary.
207
3.3. Two-class classification
208
We now use the constructed Pareto fronts and the corresponding class boundaries to define a monotone classifier. ^ assigns the class label 0 to an object x 2 X if Function f : x # y ~ -dominates x. Function f there exists an object xn 2 POFn that n assigns the class label 1 to an object x 2 X if there exists an object ~-dominates x. Thus, xp 2 POFp that p
198 199 200 201 202 203 204 205
209 210 211 212 213
214 216
217 218
219
f ðxÞ ¼
0; if there exists xn 2 POFn : xn n~ x; 1; if there exists xp 2 POFp : xp p~ x:
ð4Þ
If the set of such elements is empty, we extend the definition of f to the entire set X according to the nearest Pareto front:
arg min ðqðx; x ÞÞ ;
f ðxÞ ¼ f
221
x0 2POFn [POFp
where the sets POFn ; POFp include Pareto fronts and boundary points corresponding to the imaginary objects. The function q is defined by function (3) applied to the feature values: d X qðx; x Þ ¼ rðxj ; x0j Þ: 0
3.4. Separable sample construction
242
Consider the method of constructing a set I^ such that the func^ is monotone over the corresponding subsample. tion f : x # y Split the set of indices I into two subsets
243
G
6
lðxi Þ ¼
5
1
2
4
Feature 1
6
Fig. 2. Pareto fronts.
8
230 231 232 233 234 235 236 237 238 239 240 241
244 245
248 249
n 2 N;
and yp ¼ 1;
250 252
p 2 P:
253 254
255
#fxj jxi n xj ; j 2 Pg;
if i 2 N ;
#fxj jxi p xj ; j 2 N g; if i 2 P;
257
Table 3 Two-class classifier example.
2
229
246
P
4 3
225
228
Consider the power l of the set of objects dominated by object xi and belonging to the foreign class:
7
224
In other words, f classifies an object x according to the rule of the nearest POF if the object x is not dominated by any POF. Fig. 3 shows model data consisting of two-class objects. Objects in the first class are indicated by green triangles, and those in the second class are denoted by blue squares. Each object is described by two features. The x-axis shows feature values from set L1 , the yaxis represents those from set L2 . The classified objects are indicated by the black circles. Table 3 gives the classification results. The first column contains the object number, the second column contains the object coordinates, and the third column contains the classifier output. The label 0 implies that the object was classified as a green triangle, whereas a 1 denotes that the object was classified as a blue square.
yn ¼ 0;
8
223
227
such that
9
222
ð5Þ
j¼1
I ¼N
Feature 2
197
0
Object
x
f ðxÞ
1 2 3
(4, 5) (6, 7) (9, 6)
0 1 1
Please cite this article in press as: Stenina, M. M., et al. Ordinal classification using Pareto fronts. Expert Systems with Applications (2015), http://dx.doi.org/ 10.1016/j.eswa.2015.03.021
ESWA 9932
No. of Pages 7, Model 5G
7 April 2015 4 258 259 260
M.M. Stenina et al. / Expert Systems with Applications xxx (2015) xxx–xxx
where # denotes the power of the set. To find set I^ , we consequently eliminate defective objects from the entire sample D. 1: I ¼ P
F
N.
^ 2: return I^ ¼ P
F ^ N.
^ :¼ N ; {initialization} ^ :¼ P; N 3: I^ :¼ I ; P 4: while the sample has the objects xi ; i 2 I^ such that lðxi Þ > 0 do 5: ^i :¼ arg maxlðx Þ;
8: 9: 10:
^ :¼ P ^ n f^ig; P ^ ^ if i 2 N then ^ :¼ N ^ n f^ig. N
281
Fig. 4 shows the model sample set, where objects are described with two features. This sample set consists of two classes (green triangles and blue squares). Fig. 4(a) shows the sample set with defective objects at coordinates (1, 1), (5, 4), and (8, 6). The defective objects dominate objects in the opposite class. These objects are indicated by red circles. Fig. 4(b) shows the separable subsample obtained once the defective objects have been eliminated.
282
4. Ordinal classification
283
4.1. Ordinal classifier construction
276 277 278 279 280
where n 2 N u ; if yn lu ;
and 301
The ordinal classifier
302
uðxÞ ¼ uðf 1;2 ; . . . ; f Y1;Y ÞðxÞ; u : X ! Y;
303 305
is defined as follows:
306
307
minflu jf u;uþ1 ðxÞ ¼ 0g; if flu jf u;uþ1 ðxÞ ¼ 0g – ;; lu 2Y
if flu jf u;uþ1 ðxÞ ¼ 0g ¼ ;:
lY ;
ð6Þ 309
Table 4 illustrates the application of Eq. (6). The output of the ordinal classifier uðxÞ is the label lu of the first class when the classifier f u;uþ1 equals 0, whereas if the output f u;uþ1 is 1, we assign the label lY to the ordinal classification result. Fig. 5 shows an example for a set of objects from three different classes. The axes denote feature values describing the objects. The various objects are indicated by the red circles, green triangles and blue squares. The class boundaries corresponding to the n-fronts are indicated by the dotted line, and the solid line indicates the
I^ :¼ I^ n f^ig; ^ then if ^i 2 P
274 275
P uþ1 ;
p 2 P uþ1 ; if yp luþ1 :
uðxÞ ¼
i2I^
7:
G
(
i
6:
I^ ¼ N u
Table 4 Monotone classifier illustration. 1, 2 1
u 1; u 1
... ...
u; u þ 1 0
... ...
Y 1; Y 0
9 8
289
290 292 293 294 295 296 297 298
299
7
l1 lu luþ1 lY :
Feature 2
288
Y ¼ fl1 ; . . . ; lu ; luþ1 ; . . . ; lY g;
We denote the class label indices by f1; . . . ; u; u þ 1; . . . ; Yg. The Pareto two-class classifier can be constructed as
^ 2 f0; 1g; f u;uþ1 : x # y
x 2 X;
4
1
9 8
7
7
6
6
5 4
2
2 6
8
Feature 1
6
8
4 3
Feature 1
4
5
3
4
2
Fig. 5. Pareto front, feature 1 is preferable to feature 2.
8
2
5
2
9
1
6
3
for each pair of adjacent classes u; u þ 1. To construct the two-class classifier, we split the sample into two parts: objects with class labels ^lu and objects with class labels
Feature 2
285 287
Consider a general case of the problem:
Feature 2
284
1
2
4
Feature 1
6
8
Fig. 4. Eliminating defective objects.
Please cite this article in press as: Stenina, M. M., et al. Ordinal classification using Pareto fronts. Expert Systems with Applications (2015), http://dx.doi.org/ 10.1016/j.eswa.2015.03.021
310 311 312 313 314 315 316 317 318
ESWA 9932
No. of Pages 7, Model 5G
7 April 2015 5
M.M. Stenina et al. / Expert Systems with Applications xxx (2015) xxx–xxx
(
Table 5 Ordinal classifier example. Number
Object, x
f 12 ðxÞ
f 23 ðxÞ
uðxÞ
1 2 3
(1, 1) (5, 4) (9, 9)
0 1 1
0 0 1
1 2 3
if f u;uþ1 ðxÞ ¼ 0; than f ðuþsÞðuþ1þsÞ ðxÞ ¼ 0 for each s : ðu þ 1 þ sÞ 6 Y; if f u;uþ1 ðxÞ ¼ 1; than f ðusÞðuþ1sÞ ðxÞ ¼ 1 for each s : ðu sÞ P 1: ð7Þ
353
Definition 4. We say that Pareto fronts POFn ðuÞ and POFp ðu þ 1Þ do not intersect,
354
POFn ðuÞ
329
p-fronts. The classified objects are shown by black circles. Table 5 shows an example of the set of two-class classifiers f 1;2 ; f 2;3 included in the ordinal classifier uðxÞ for the illustrated sample. The first column contains object numbers, the second their coordinates, and the third and fourth columns contain the two-class classification results for the adjacent classes. A label of 0 in the third column means that the classifier f 1;2 assigned the object to the first class, and a label of 0 in the fourth column means that the classifier f 2;3 assigned the object to the second class. The final column contains the results of ordinal classification. The values of this column correspond to the output of the ordinal classifier.
330
4.2. Extension of Pareto front definition for ordinal classification
331
344
To construct the fronts between classes labeled as lu and luþ1 , we use objects with class labels l1 ; . . . ; lu to construct an n-front for class lu and objects with class labels luþ1 ; . . . ; lY to construct a p-front for class luþ1 . Therefore, the same objects belong to fronts for different classes, and a front for one class contains objects of different classes. Fig. 6 illustrates a sample containing three-class model data. Red circles denote objects in the first class, and green triangles denote those in the second. The figure shows that object (7, 2) from the first class belongs to the n-fronts of both the first and second classes. This implies that the definition of the n-front should be extended to objects whose class label is not greater than the n-front class label; the definition of the p-front should be extended to objects whose class label is not less than the p-front class label.
345
4.3. Admissible classifiers
346
In this section, we introduce the important property of admissibility for an ordinal classifier. We prove a theorem about the admissibility conditions of the constructed classifier u.
321 322 323 324 325 326 327 328
332 333 334 335 336 337 338 339 340 341 342 343
347 348 349 350
351
359
POFn ðuÞ
362
\
POFp ðu þ 1Þ ¼ ;:
Fig. 2 shows an example of non-intersecting Pareto fronts. Theorem 1. If the Pareto fronts do not intersect,
POFn ðuÞ
\
POFp ðu þ 1Þ ¼ ;;
POFn ðuÞ
\
POFp ðu þ 1Þ ¼ ;;
f 1;2 ðxÞ ¼ 0;
f 2;3 ðxÞ ¼ 1:
372 374 375 376
377 379
1. 9y 2 POFn ð1Þ : y n x. If y 2 POFn ð2Þ, then it follows that f 2;3 ðxÞ ¼ 0, which contradicts the assumption that f 2;3 ðxÞ ¼ 1. If y R POFn ð2Þ, then
383
w n y;
and it follows that
)
9y0 2 POFn ð1Þ; such that y0 ¼ arg
6
min
qðx; yÞ;
y2POFn ð1Þ[POFp ð2Þ
5 4
3
3
2
2 1
2
4
Feature 1
382
384 385 386
387 389 391 393
f 2;3 ðxÞ ¼ 0:
2. The fronts POFn ð1Þ and POFp ð2Þ do not dominate object x. In this case,
6
381
390
POFn ð2Þ 3 wn y n x
7
8
371
380
7
6
370
(The case f 1;2 ðxÞ ¼ 1; f 2;3 ðxÞ ¼ 0 is similar.) The result f 1;2 ðxÞ ¼ 0 can be obtained if one of the two following conditions holds:
8
Feature 1
u ¼ 1; 2;
and that there exists an object x such that the transitivity condition does not hold,
8
4
365 367
u ¼ 1; . . . ; Y 1;
369
9
2
364
Proof. We prove the theorem for the case z ¼ 3. For more classes, the proof is similar. Suppose that the Pareto fronts do not intersect,
9
1
363
368
Definition 3. The classifier u in Eq. (6) is said to be admissible if, for each function f u;uþ1 , the transitivity condition holds:
4
360
then the transitivity condition (7) holds for any classified object.
9w 2 POFn ð2Þ :
5
356 358
POFp ðu þ 1Þ ¼ ;;
if the boundaries of their dominance spaces do not intersect,
Feature 2
320
Feature 2
319
\
355
6
8
Fig. 6. An example of a common object for the two fronts.
Please cite this article in press as: Stenina, M. M., et al. Ordinal classification using Pareto fronts. Expert Systems with Applications (2015), http://dx.doi.org/ 10.1016/j.eswa.2015.03.021
394 395
396 398
ESWA 9932
No. of Pages 7, Model 5G
7 April 2015 6 399 400 401 402 403
404 406 407
408 410
M.M. Stenina et al. / Expert Systems with Applications xxx (2015) xxx–xxx
where q is the distance function (5). The assumption f 2;3 ðxÞ ¼ 1 holds in one of two possible cases. (a) 9t 2 POFp ð3Þ : t p x. The object t does not belong to POFp ð2Þ, because POFp ð2Þ does not dominate x. Then, it follows that
9t0 2 POFp ð2Þ :
t0 p t:
Therefore, we obtain a chain of domination inequalities,
t0 p t p x;
3. The separable sample construction procedure involves the similar comparison of all object pairs between the different classes, that also costs Oðm2 nÞ operations for the balanced-size classes. The iterative procedure of the defective objects elimination is also quadratic due to the memorization of all dominant elements for each object. However, this memorization makes the memory costs also quadratic. 4. In the ordinal-class case the total complexity is multiplied by the number of classes, so that the final complexity estimation is Oðm2 nKÞ, where K is the number of classes.
464 465 466 467 468 469 470 471 472 473 474
411 412 413 414 415 416 417 418 419 420 421 422 423 424
425 427 428 429
430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 446 445
and it follows that t0 p x. This contradicts the assumption that front POFp ð2Þ does not dominate object x. (b) Object x is not dominated by fronts POFn ð2Þ and POFp ð3Þ. In this case, object x is not dominated by any front POFn ðuÞ; POFp ðu þ 1Þ with u ¼ 1; 2. Note that there exists an object y1 2 POFn ð1Þ whose dominance space boundary contains the point y0 , where y0 is the nearest point to x in the sense of the Hamming distance (5). Object y1 can belong to front POFn ð2Þ. In this case, the distance between x and POFn ð2Þ is less than the distance between x and POFn ð1Þ. (Here, the distance between a point and a front means the distance between a point and the nearest point of a front.) If object y1 does not belong to front POFn ð2Þ, there exists an object
y2 2 POFn ð2Þ :
y2 n y1 :
However, y2 ¤n x, because the object x is not dominated by any front. Hence, it follows that the distance between object x and front POFn ð2Þ is less than the distance between object x and point y0 on front POFn ð1Þ. The proof for the pair of fronts POFp ð2Þ; POFp ð3Þ is similar. There exists an object w1 2 POFp ð2Þ such that the boundary of its dominance space contains a point w0 , where w0 is the nearest point to x in the sense of the Hamming distance (5). The distance between object x and front POFp ð3Þ is not less than the distance between x and w0 2 POFp ð2Þ. We have proved that the distance between x and POFn ð2Þ is not greater than the distance between x and POFn ð1Þ, and the distance between x and POFp ð3Þ is not less than the distance between x and POFp ð2Þ. From f 1;2 ðxÞ ¼ 0, it follows that the distance between x and POFn ð1Þ is less than the distance between x and POFp ð2Þ. Thus, x is nearer to POFn ð2Þ than to POFp ð3Þ. This contradicts the assumption that f 2;3 ðxÞ ¼ 1, and concludes the proof. h
447
451
Given that the method of Pareto fronts uses only separable samples, all fronts are disjoint. Therefore the ordinal classifier (6) is admissible and the transitivity condition (7) holds for any classified object.
452
4.4. Computational cost
453
To estimate the computational cost, we must compute the cost for the basic stages of the algorithm: the two-object comparison, the Pareto front construction, the separable sample construction, and the ordinal-class case.
448 449 450
454 455 456 457 458 459 460 461 462 463
1. The cost of two object comparison is OðnÞ, where n is the number of features. 2. The Pareto front construction method involves comparison of the all pairs of objects. For the balanced-size classes OðmÞ the comparison is Oðm2 nÞ, where m is the total number of objects. After the pairwise comparison the method finds the non-dominated objects using the Oðm2 Þ operations.
We see that the Pareto front construction procedure is quite costly, and one of our further directions is to reduce the cost. The possible solution is to use the class transitivity property to construct the Pareto fronts for all classes together to eliminate the multiplier K from the complexity estimation.
475
5. Experimental results
480
5.1. Benchmark datasets
481
We verify the proposed method on the different benchmark datasets. To test the method on the ordinal data, we make an additional monotone feature transformation for each dataset. We compare the method with two ordinal classification algorithms and one method of classification with monotonicity constraints. We used the following datasets from the UCI repository: Pyrimidines, Machine CPU, Housing, Computer Activity, Abalone and Car. All datasets except the last one correspond to the regression problem, so that we made a discretization of the target variable into five levels containing equal number of objects. The experiment scheme duplicates the one from Chu and Sathiya Keerthi (2007). We randomly partitioned each dataset into the training and test splits, as it shown in Table 6. The partition was repeated 100 times independently. To measure the quality we estimated the mean zero-one loss and the mean absolute loss on the test datasets. For comparison we used two classification algorithms, the decision tree J48 (Trees) and the Support Vector Machine with the linear kernel (SVM), that were combined with the ordinal classification scheme from Frank and Hall (2001). Furthermore, we used the nearest-neighbor classification method with monotonicity (KNN) from Duivesteijn and Feelders (2008). The results for the five original datasets (all except ‘‘Cars’’) are given in Table 7. The bolded numbers indicate whether the corresponding method was stat significantly better than the others. We see that the ordinal SVM outperformed all other methods due to the linear nature of features in the considered datasets. To investigate the method properties on the ordinal-scaled data, we made the ordinal transformation of the dataset features. As for the target variable, we discretized all features into five levels containing equal number of objects. The results for the transformed datasets (and for the ‘‘Cars’’ whose features were initially ordinal) are given in Table 8. The
482
Table 6 Description of the datasets. Dataset
#FEATURES
#OBJECTS
Training/Test
Pyrimidines MachineCPU Boston Computer Abalone Cars RedBook
27 6 13 21 8 6 101
74 209 506 8182 4177 1728 102
50/24 150/59 300/206 4000/4182 1000/3177 1000/728 100/1
Please cite this article in press as: Stenina, M. M., et al. Ordinal classification using Pareto fronts. Expert Systems with Applications (2015), http://dx.doi.org/ 10.1016/j.eswa.2015.03.021
476 477 478 479
483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514
ESWA 9932
No. of Pages 7, Model 5G
7 April 2015 M.M. Stenina et al. / Expert Systems with Applications xxx (2015) xxx–xxx Table 7 Experimental results: ordinal target, linear features. Dataset
Pyrimidines MachineCPU Boston Computer Abalone
Mean zero-one loss (0:01)
Mean absolute loss (0:01)
SVM
POF
Trees
KNN
SVM
POF
Trees
KNN
0.50 0.44 0.38 0.32 0.53
0.62 0.44 0.48 0.71 0.59
0.61 0.47 0.41 0.38 0.57
0.55 0.51 0.47 0.60 0.60
0.64 0.53 0.46 0.35 0.78
0.90 0.53 0.65 1.36 0.92
0.84 0.53 0.47 0.41 0.77
0.75 0.61 0.62 0.90 0.88
Table 8 Experimental results: ordinal target, ordinal features. Dataset
Pyrimidines MachineCPU Boston Computer Abalone Cars RedBook
Mean zero-one loss (0:01)
Mean absolute loss (0:01)
SVM
POF
Trees
KNN
SVM
POF
Trees
KNN
0.57 0.51 0.40 0.44 0.78 0.19 0.66
0.58 0.39 0.48 0.69 0.59 0.19 0.47
0.60 0.47 0.40 0.41 0.58 0.08 0.48
0.61 0.43 0.41 0.45 0.59 0.06 0.62
0.71 0.65 0.49 0.53 1.78 0.24 0.85
0.77 0.45 0.68 1.38 0.92 0.26 0.60
0.79 0.56 0.46 0.45 0.76 0.08 0.52
0.76 0.51 0.51 0.55 0.89 0.07 0.79
7
6. Summary and further research
532
We have proposed the ordinal classification method based on the construction of Pareto fronts, which describe boundaries between ordinal classes. The ordinal classifier is constructed as a superposition of two-class Pareto classifiers. Our algorithm was compared with some well-known algorithms, and demonstrated adequate results. It was used to solve the IUCN Red List categorization problem. The main advantage of the method is the simplicity and interpretability of the obtained models. At the same time, the computational experiments showed that the obtained models allow to get the comparable prediction quality with the state-of-the art ordinal classification methods. Further investigations will be devoted to extending the scope of the proposed algorithm. The classification method will be extended to the case of partial orders defined over the set of features and over the set of feature values. From a practical point of view, the partial-order case corresponds to incomplete information given by the experts. To describe the partial orders, we will use the idea of a partial order cone proposed in Kuznetsov and Strijov (2014). Another way to improve the proposed classification algorithm is to take into account the expert information about feature preferences. The preference relation defined over the set of features allows to restrict the set of admissible models and to achieve better prediction quality.
533
References
557
Ben-David, Arie, Sterling, Leon, & Pao, Yoh-Han (1989). Learning and classification of monotonic ordinal concepts. Computational Intelligence, 5(1), 45–49. Chu, Wei, & Sathiya Keerthi, S. (2007). Support vector ordinal regression. Neural Computing, 19(3), 792–815. Duivesteijn, Wouter, & Feelders, Ad (2008). Nearest neighbour classification with monotonicity constraints. In Walter Daelemans, Bart Goethals, & Katharina Morik (Eds.), Machine learning and knowledge discovery in databases. Lecture notes in computer science (Vol. 5211, pp. 301–316). Berlin Heidelberg: Springer. ISBN 978-3-540-87478-2. Frank, Eibe, & Hall, Mark (2001). A simple approach to ordinal classification. Springer. Freund, Yoav, Iyer, Raj, Schapire, Robert E., & Singer, Yoram (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969. ISSN 1532-4435. Furnkranz, Johannes, & Hullermeier, Eyke (2003). Pairwise preference learning and ranking. Machine Learning ECML, 2003, 145–156. Har-Peled, S., Roth, D., & Zimak, D. (2003). Constraint classification for multiclass classification and ranking. In NIPS (pp. 785–792). Kotlowski, Wojciech, & Slowinski, Roman (2009). Rule learning with monotonicity constraints. In Proceedings of the 26th annual international conference on machine learning. ICML ’09 (pp. 537–544). New York, NY, USA: Dover. ISBN 978-1-60558516-1. Kuznetsov, M. P., & Strijov, V. V. (2014). Methods of expert estimations concordance for integral quality estimation. Expert Systems with Applications, 41(4), 1988–1996.
. Nogin, V. D. (2003). The edgeworth-pareto principle and relative importance of criteria in the case of a fuzzy preference relation. Computational Mathematics and Mathematical Physics, 43(11), 1666–1676. Strijov, Vadim, Granic, Goran, Juric, Jeljko, Jelavic, Branka, & Maricic, Sandra Antecevic (2011). Integral indicator of ecological impact of the croatian thermal power plants. Energy, 36(7), 4144–4149. Xia, Fen, Liu, Tie-Yan, Wang, Jue, Zhang, Wensheng, & Li, Hang (2008). Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th international conference on machine learning. ICML ’08 (pp. 1192–1199). New York, NY, USA: ACM. ISBN 978-1-60558-205-4. Yue, Yisong, Finley, Thomas, Radlinski, Filip, & Joachims, Thorsten (2007). A support vector method for optimizing average precision. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR ’07 (pp. 271–278). New York, NY, USA: ACM. ISBN 978-1-59593-597-7.
558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596
534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556
3
2
1
1
2
3
Fig. 7. Comparison of the computed and expert estimated categories.
519
SVM method is significantly worse due to the linearity of kernel, while the ‘‘Trees’’ and the ‘‘POF’’ methods demonstrate sufficiently good results. Another interesting observation is that the Trees method outperforms the other methods according to the mean absolute loss.
520
5.2. IUCN Red List dataset
521
The results for the IUCN Red List categorization are shown in Table 8 in the last raw. Unlike the other datasets, we used the leave-one-out partition scheme. The POF and the Trees methods demonstrate the best results according to the zero-one loss. Fig. 7 compares the categories computed by the POF algorithm using the leave-one-out results. The x-axis denotes the computed categories, and the y-axis represents the categories determined by the experts. The radius of each point is proportional to the number of objects with the corresponding computed and expert categories. For a significant number of objects, the computed category was the same as that assigned by the experts.
515 516 517 518
522 523 524 525 526 527 528 529 530 531
Please cite this article in press as: Stenina, M. M., et al. Ordinal classification using Pareto fronts. Expert Systems with Applications (2015), http://dx.doi.org/ 10.1016/j.eswa.2015.03.021
597