Indices for rough set approximation and the application to confusion matrices

JID:IJA AID:8451 /FLA [m3G; v1.261; Prn:18/12/2019; 13:49] P.1 (1-18) International Journal of Approximate Reasoning ••• (••••) •••–••• 1 Contents...

Download PDF

903KB Sizes 0 Downloads 17 Views

Report

PDF Reader
Full Text

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.1 (1-18)

International Journal of Approximate Reasoning ••• (••••) •••–•••

1

Contents lists available at ScienceDirect

1 2

2 3

International Journal of Approximate Reasoning

4

3 4 5

5

6

6

www.elsevier.com/locate/ijar

7

7

8

8

9

9

10

10

11 12 13

Indices for rough set approximation and the application to confusion matrices

16 17 18 19

15

a

17

College of Maths. Informatics, Fujian Normal University, Fuzhou, China b Dept. of Computer Science, Brock University, St Catharines, Canada c Institut für Evaluation und Marktanalysen, Brinkstr. 19, 49143 Jeggen, Germany

24 25 26 27 28 29 30 31 32

16 18 19 20

a r t i c l e

i n f o

a b s t r a c t

21 22

22 23

13

Ivo Düntsch a,b,1 , Günther Gediga c,1

20 21

12 14

14 15

11

Article history: Received 18 July 2019 Received in revised form 2 November 2019 Accepted 15 December 2019 Available online xxxx Keywords: Rough set approximation Confusion matrix Precision indices Odds ratios Error estimation

33

Confusion matrices and their associated statistics are a well established tool in machine learning to evaluate the accuracy of a classiﬁer. In the present study, we deﬁne a rough confusion matrix based on a very general classiﬁer, and derive various statistics from it which are related to common rough set estimators. In other words, we perform a rough set–like analysis on a confusion matrix, which is the converse of the usual procedure; in particular, we consider upper approximations. A suitable index for measuring the tightness of the upper bound uses a ratio of odds. Odds ratios offer a symmetric interpretation of lower and upper precision, and remove the bias in the upper approximation. We investigate rough odds ratios of the parameters obtained from the confusion matrix; to guard against undue random inﬂuences, we also approximate their standard errors. © 2019 Elsevier Inc. All rights reserved.

23 24 25 26 27 28 29 30 31 32 33

34

34

35

35

36

36

37

37 38

38 39

1. Introduction

41 42 43 44 45 46 47 48 49 50 51 52 53 54

The philosophy of rough sets is based on the assumption that knowledge about the world depends on the granularity of representation [1]. Rough set methods are concerned with the approximation of sets (whose extent may be unknown) by suitable functions, and measured by associated indices which are deﬁned by the data at hand, in particular, frequency measures such as counting. Confusion matrices are a frequently used tool in machine learning to evaluate the prediction quality of a classiﬁer such as neural networks, tree–like classiﬁers, discriminant analysis, Bayesian methods, support vector machines, and many more, including classiﬁers based on rough sets, see e.g. [2–4]. Usually, confusion matrices are analyzed by statistical techniques, even if the method to derive rules to predict outcomes is based on rough set technology. Nevertheless, a rough set based interpretation of the “confusion” is directly possible, if the information about the granules used for the prediction process is given. If there are granules for which the predicted class equals the class given by, say, a gold standard, we may interpret this as a lower bound of the approximation of the class given by the gold standard. Any granule which intersects the class given by the gold standard in at least one element contributes to the upper bound of the gold standard class. This results in a simple application of the rough set model.

57

60 61

42 43 44 45 46 47 48 49 50 51 52 53 54 56

56

59

41

55

55

58

39 40

40

1

E-mail addresses: [email protected] (I. Düntsch), [email protected] (G. Gediga). The ordering of authors is alphabetical and equal authorship is implied.

https://doi.org/10.1016/j.ijar.2019.12.008 0888-613X/© 2019 Elsevier Inc. All rights reserved.

57 58 59 60 61

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.2 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

2

1

1

Table 1 The granule frequency matrix.

2 3

Granule

4 5 6 7 8 9 10

2

Decision classes

3

Granule size

4

Y1

...

Yi

...

Yk

X1 ... Xj ... Xm

c 11 ... c j1 ... cm1

... ... ... ... ...

c 1i ... c ji ... cmi

... ... ... ... ...

c 1k ... c jk ... cmk

c1 ... cj ... cm

Class size

n1

...

ni

...

nk

n

5 6 7 8 9 10 11

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

In this paper we shall explore how confusion matrices ﬁt into the rough set approach, if the information about the granularity of the prediction process is unknown. The reason for discarding the basic information is twofold: 1. It is a standard procedure to publish confusion matrices to demonstrate the validity of a data model. 2. Other data analysis techniques do not use the information granules directly, but only summary statistics. In this case information about the granularity does not describe the prediction process. Particular attention will be given to investigate how we can use a suitably deﬁned rough confusion matrix to approximate the accuracy αi of rough approximation and the rough approximation quality γ . To measure relationships among the estimates, we introduce rough odds ratios and show how they are related to the previously deﬁned concepts. An error theory for rough odds ratios is also provided. This paper may be regarded as a continuation of our earlier [5] in which we have investigated conﬁdence bounds for standard rough set approximation quality and accuracy, and of [6], where we have proposed a parameter free alternative to the variable precision model [7] in the context of a variant of confusion matrices. Some material of Section 4 is taken from the preliminary [8]. 2. Basic rough set concepts

32 33 34 35 36 37 38 39

low(Y ) =

Lower approximation

(2.1)

{ Z ∈ P : Z ∩ Y = ∅},

Upper approximation

(2.2)

Boundary.

(2.3)

41

bnd(Y ) = upp(Y ) \ low(Y ),

46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

17 18 19 20 21 22 23 24 25 26 27

31 32 33 34 35 36 38 39

Note that

40 41 42 43 44

44 45

16

37

{ Z ∈ P : Z ⊆ Y },

upp(Y ) =

43

15

30

Mathematically, granularity may be expressed by an equivalence relation θ on a nonempty ﬁnite set U , up to the classes of which membership in a subset of U can be determined. Thus, the knowledge we have are the classes X , Y of two equivalence relations, and our aim is to approximate the classes in Y by the classes in X . We assume that the reader has a working knowledge how to obtain predictor classes and decision classes from a ﬁxed decision system, see e.g. [9] or [10]. Several operators are deﬁned on 2U in the following way [9]: Let P = { X 1 , . . . , X m } be the set of classes of an equivalence relation θ . If Y ⊆ U , then,

40 42

14

29

30 31

13

28

28 29

12

upp(Y ) = low(U \ Y ).

(2.4)

Throughout, we suppose that Y = {Y 1 , . . . , Y k } is the set of decision classes, also called descriptor classes, and that lower and upper approximations are taken with respect to θ for a ﬁxed equivalence relation θ , having the set of predictor classes X = { X 1 , . . . , Xm }; predictor classes are also called granules. The aim is to approximate the classes of Y by the granules in X. To avoid trivialities we assume that |U | ≥ 2, and that there are at least two decision classes, i.e. k > 1. We use n = |U | for the number of observations, and set

ni = |Y i |,

45 46 47 48 49 50 51 52 53 54

nli = |low(Y i )|,

55

nu i = |upp(Y i )|,

57

nb i = |bnd(Y i )| = nu i − nli . Since both X and Y are partitions of U , we can build the granule frequency matrix with marginal sums shown in Table 1; there, c ji = | X j ∩ Y i |. Note that a granule frequency matrix is a contingency table, see e.g. [11].

56 58 59 60 61

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.3 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

1 2 3 4 5

j = c ji : 1 ≤ i ≤ k belonging to granule X j . Since Y is a partition of U and X j = ∅, we have Consider the vector X j contains only one non–zero entry, say, c ji , then X j ⊆ Y i , and we call the granule c ji = 0 for at least one 1 ≤ i ≤ k. If X deterministic; otherwise the granule is called indeterministic. The granule X j is said to belong to the upper bound upp(Y i ) of the decision class Y i if c ji = 0. Thus, we obtain m

6 7

3

nu i =

8

12 13

16

19 20 21 22 23 24 25 26 27 28 29 30 31

Ind(a) =

10 11

0,

if a = 0,

1,

otherwise.

(2.5)

36

39

αi =

nu i

Accuracy of approximation of Y i ,

,

(3.1)

1 n

k

·

44

47 48 49

52 53 54 55

58

(3.2)

61

25 26 27 28 29 30 31 33

We observe that

38 39

i =1

40

αi depends on the approximation of −Y i as well, since

41 42

αi =

nli nu i

=

nli n − |low(−Y )|

43

.

44 45

upp

In [5] we have introduced two indices, namely, p low , the lower precision, and p i , the upper precision of determining i membership in Y i and have investigated some of their properties; these will play a major rule in our subsequent considerations. Given a decision class Y i , the relative precision of deterministic membership in Y i is deﬁned by

p low = i

nli ni

=1−

ni − nli ni

46 47 48 49 50

,

(3.3)

i.e. p low is the percentage of correctly predicted elements Y i relative to ni . Then, 1 − p low is the percentage of elements of i i Y i which cannot be predicted by the deterministic rules. Clearly, 0 ≤ p low ≤ 1, and i

51 52 53 54 55 56

• p low = 0 if and only if the lower bound of Y i is empty, i • p low = 1 if and only if Y i is a deterministic class, i.e. ni = nli = nu i . i

57 58 59

59 60

24

37

Quality of approximation of Y by X .

56 57

23

36

nli ,

50 51

22

35

45 46

21

34

and the global statistic

γ=

20

32

nli

42 43

16

19

Various indices play a role in rough set theory, such as accuracy α , approximation quality γ and R–roughness ρ [9, Ch 2.6]. We suggest to distinguish between precision indices and approximation indices: A precision index measures the precision of membership in a class Y i , and may contain in its simplest form the cardinality ni of Y i . A (rough) approximation index is related to the approximation of some set, usually a decision class, by the classes of X , and contains only one or more of nli , nu i or n. This is meaningful in the context of rough sets, since it can happen that the rough set low( X ), upp( X ) equals the rough set low(Y ), upp(Y ), and | X | = |Y |. Thus, in a rough set measure the cardinality of Y i is not directly involved – indeed, the parameters nli , nu i or n do not determine ni = |Y i | so that, in some sense, an approximation index may be called context free – or, more precisely, decision context free, as the context of the granules may vary. It may also be noted that an approximation index does not involve the complement of a decision class, since a rough set – up to which we know the world – does not generally have a complement. Traditionally, rough statistics are the local statistic for a decision class Y i , deﬁned by

40 41

15

18

3. Approximation and precision indices

37 38

13

17

34 35

12 14

For the basic philosophy and tools of rough set data analysis the reader is invited to consult [9,10]. For recent developments and more advanced methods the overview [12] is an excellent source.

32 33

5

9

17 18

4

8

Here, Ind is the indicator function deﬁned by

14 15

3

7

j =1

11

2

6

Ind(c ji ) · c j .

9 10

1

We note in passing that the precision indices require complete knowledge of Y i , while it involves only the approximations.

αi is a “true” rough set measure, as

60 61

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.4 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

4

1

1

Table 2 Results of a rough set analysis.

2 3 4

ni nli nu i

5 6

2

Y1

Y2

Y3

Sum

30 15 45

50 40 60

120 90 150

200 145 -

3 4 5 6 7

7

8

8

Table 3 Indices for the system of Table 2.

9 10 11

9 10

Y1

Y2

Y3

11

0.5

0.8

0.75

12

12

p low i

13

pi

0.667

0.833

0.8

13

14

αi γ

0.333

0.667 0.725

0.6

14

upp

15 16 17 18 19

ni i =1 n

24

upp

pi

upp

pi

26

=

=

27 29

ni nu i

=1−

0 =

32

ni −

k=i nlk

37 38 39 40 41

18 19

(3.4)

,

ni n

(3.5)

.

44

47 48

≤ pi

≤ 1,

(3.6)

51 52 53 54 55 56 57 58 59 60 61

23 24 25 26 28

31 32 33

and upp

pi

= 1 ⇐⇒

ni nu i

34

= 1 ⇐⇒ ni = nu i

(3.7)

The reciprocal

1

upp

pi

35 36

⇐⇒ nli = nu i ⇐⇒ p low = 1. i upp

of p i

is of interest as well, since

with respect to the size ni of Y i .

1

upp

pi

(3.8)

37 38

1

upp

pi

− 1 is the percentage of extra elements in the upper bound

39 40

may be called an overshoot factor (OSF).

41 42

Example 1. As a running example for basic rough set and precision parameters we shall use an example from [5] shown in Tables 2 and 3. 2

43 44 45

upp

Empirically one can observe that in many cases p low is much smaller than p i i

46

.

47 48

4. Confusion matrices

49

49 50

22

30

upp

45 46

21

29

42 43

20

27

is the percentage of those elements predicted not to be possibly in Y i by indeterministic rules. Note

34 36

nu i

Clearly, 1 − p i that

31

35

nu i − ni

ni

upp

30

33

17

low

which is the percentage of elements possibly in Y i captured by indeterministic rules. By (2.4) and the fact that Y 1 , . . . , Y n partition U , we also have

25

28

nli ni

Since γ = · , we see that γ is a weighted sum of the p i values. Similarly, we deﬁne a precision index for an arbitrary x being in the upper approximation of Y i by

21 23

16

k

20 22

15

Recall that Y = {Y 1 , . . . , Y k } is a set of nonempty disjoint decision classes. A classiﬁer is a mapping f : U → Y , where f (x) is interpreted as the decision class in which x is predicted to be. This leads to a set Yˆ 1 , . . . , Yˆ i , . . . , Yˆ k of disjoint predictor classes, where

Yˆi = {x ∈ U : f (x) ∈ Y i } = f

−1

50 51 52 53

(4.1)

54

We may think of the sets Y i as classes prescribed by a gold standard such as a classiﬁcation by an expert, and of the sets Yˆ i as observed values predicted to be in Y i by the classiﬁer f . True and predicted values may be cross–tabulated in a confusion matrix or error matrix. Perhaps the earliest occurrence of confusion matrixes is in Pearson [13], who calls them contingency tables, see also [14,15] for early uses of a confusion matrix. In the matrix, the class names serve as column labels, respectively, row labels of a (k, k) matrix: An entry Yˆ i , Y j in the matrix is the number of elements of Y j which are predicted to be in Y i . The simplest confusion matrix has type (2, 2).

56

(Y i ).

55 57 58 59 60 61

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.5 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

1 2

Table 4 A general confusion matrix.

3

True value

4

Predicted value

5 6 7 8 9 10 11

5

1 2 3

Sum

4

Y1

...

Yj

...

Yk

Yˆ 1 ...

n11 ...

... ...

n1 j ...

... ...

n1k ...

n1• ...

6

Yˆ j ...

n j1 ...

... ...

n jj ...

... ...

n jk ...

n j• ...

8

Yˆ k

nk1

...

nkj

...

nkk

nk•

Sum

n •1

...

n• j

...

n•k

n

5 7 9 10 11 12

12 13 14

13

Table 5 An example confusion matrix [26].

14 15

15

Y1 Pine

Y2 Cedar

Y3 Oak

Y4 Cotton–wood

Total

Pine

35

4

12

2

53

18

Cedar

14

11

9

5

39

19

Oak

11

3

38

12

64

Cottonwood

1

0

4

2

7

Total

61

18

63

21

163

16 17 18 19 20 21 22

Yˆ 1 Yˆ 2 Yˆ 3 Yˆ 4

17

In this case, there are two decision classes P and N, and the entries in the table are called true positives, false positives, true negatives, or false negatives:

28

34

Predicted value

29

N

True

False

Pˆ

31

Positive (TP) False

Positive (FP) True

32

ˆ N

Negative (TN)

Negative (FN)

30

37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

Confusion matrices are a common tool in machine learning to gauge the quality of a classiﬁer [16], and for an introduction to confusion matrices and associated topics we invite the reader to consult [17]. Many statistics are associated with a confusion matrix, such as sensitivity TP/(TP+FN), speciﬁcity TN/(TN+FP), precision (TP)/(FP+TP), and the area AUC under the ROC curve which shows the dependence of sensitivity on 1 - speciﬁcity. The AUC is a widely used measure for the quality of supervised classiﬁcation and diagnostic rules; it is, however, not without pitfalls as Hand [18] demonstrates. Confusion matrices have been used in rough set theory, for example, in [19–24]. We have investigated sensitivity and speciﬁcity in the rough set context in [6], and in the context of concept lattices in [25]. In all previous work, a rough set classiﬁer is constructed in the ﬁrst instance along with the usual indices such as α and γ , and its classiﬁcation quality is estimated by the usual statistics of the confusion matrix. Our proposed procedure is, in some sense, the reverse: Using a confusion matrix obtained from a general classiﬁer, the only restriction of which is that it is compatible to a ﬁxed equivalence relation, we consider rough set indices obtained from the confusion matrix. These are approximations of e.g. α and γ which we call α ∗ , γ ∗ etc. A general confusion matrix has the form shown in Table 4. There, the entry ni j is the number of elements of class Y j which are predicted to be in class Y i , i.e. ni j = |Yˆi ∩ Y j |. Thus, according to the classiﬁer f .

k

j =1 n j j

is the number of correctly classiﬁed elements

55 56

Example 2. As our running example for a confusion matrix, we have chosen data obtained from an interpreter of aerial photographs, shown in Table 5. In the experiment, experts were asked to identify types of trees in Yosemite Valley, California, USA. The data were reported by Congalton and Mead [26], who credit them to Lauer et al. [27]. 2

59 60 61

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 54 55 56 57

57 58

34

53

53 54

33 35

35 36

26

P

30

33

25

28

True value

29

32

22

27

27

31

21

24

24 26

20

23

23 25

16

As a next step we shall use the rough set philosophy of indistinguishability up to equivalence classes to deﬁne a classiﬁer. In this situation, a classiﬁer cannot assign decision classes to elements of U , but has to act on whole equivalence classes. Given partitions X and Y , we call a function f : X → Y a rough classiﬁer with respect to X and Y , or, simply, a classiﬁer if X and Y are understood.

58 59 60 61

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.6 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

6

1

1

Table 6 The granule freq. matrix.

2 3 4

X1 X2 X3 X4 Sum

5 6 7 8

Y1

Y2

Sum

1 0 0 2 3

1 1 1 0 3

2 1 1 2 6

2 3 4 5 6 7 8 9

9 10 11 12 13 14 15 16 17 18 19

The meaning of the function f is that each element of X i is predicted to be in f ( X i ). A classiﬁer in the rough set sense is a specialization of a general classiﬁer, where allelements of a class X i are predicted to be in the same decision class. As in (4.1), we obtain the predictor classes Yˆ j = f −1 (Y j ). In the rough set situation, each nonempty Yˆ j is a union of

equivalence classes of X . If Yˆ j = ∅, then no element of U is predicted to be in Y j by any class X i using f . Arguably, the simplest classiﬁer is a maximum classiﬁer given by f ( X i ) = Y j , where |Y j | = max{|Y r | : 1 ≤ r ≤ k}. Note that such Y j need not be unique, and a choice needs to be made. Since f is a function, the collection { f −1 (Y j ) : 1 ≤ j ≤ k, f −1 (Y j ) = ∅} is a partition of X , and therefore, keeping in mind that X partitions U , it follows that

{ f −1 (Y j ) : 1 ≤ j ≤ k, f −1 (Y j ) = ∅} = U .

20 21 22 24 25

Yˆ i , Y j =

26 27

Thus,

28 29 30 31 32 33 34 35

ni j =

12 13 14 15 16 17 18 20

{c sj : f ( X s ) = Y i }, if f −1 (Y i ) = ∅,

0,

11

19

We are now in position to deﬁne the (rough) confusion matrix M f of the rough classiﬁer f . It has row labels Yˆ i , column labels Y j and entries

23

10

21 22 23 24

(4.2)

otherwise.

25 26 27

28

| X s ∩ Y j |.

(4.3)

29 30

f ( X s )=Y i

31

Given the partitions X and Y , the corresponding confusion matrix can be obtained in three steps:

32 33

1. Write the granule frequency matrix M obtained from X and Y as in Table 1. 2. Relabel the rows of M by f ( X i ):

34 35 36

36

Predicted

Y1

...

Yi

...

Yk

Sum

c 11 ... c j1 ... cm1

... ... ... ... ...

c 1i ... c ji ... cmi

... ... ... ... ...

c 1k ... c jk ... cmk

c1 ... cj ... cm

38

41

f ( X1 ) ... f (X j) ... f ( Xm )

42

Class size

n1

...

ni

...

nk

n

42

37 38 39 40

37 39 40 41 43

43 44 45 46 47 48 49 50 51 52 53 54

3. Aggregate the frequencies of the rows with the same label according to (4.2). If f −1 (Y j ) = ∅, ﬁll the row with label Yˆ j with 0s. Sort the rows according to indices of their labels. The result will have the form shown in Table 4. Note that the column sums in Table 4 are equal to the corresponding values in Table 1, as the collection of nonempty sets Yˆ j is a partition of U .

59 60 61

46 47 48 50

Example 3. Suppose that X = { X 1 , . . . , X 4 } and Y = {Y 1 , Y 2 }. Deﬁne f : X → Y by f ( X 1 ) = f ( X 4 ) = Y 1 , and f ( X 2 ) = f ( X 3 ) = Y 2 . The values and the construction process are shown in Tables 6, 7, and 8. Note that f classiﬁes ﬁve of the six elements of U correctly, so that its success ratio is 56 , where as γ = 46 . 2

51 52 53 54 55

5. Reﬁning the rough classiﬁer from the confusion matrix

56 57

57 58

45

49

The procedure is illustrated with an example.

55 56

44

Thus far, we have put no restrictions on the rough classiﬁer f . In a more optimistic spirit, we shall assume that f satisﬁes the condition

X i ∩ f ( X i ) = ∅.

58 59 60

(5.1)

61

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.7 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

1

7

1

Table 7 The relabeled matrix.

2 3 4

Y1 Y2 Y2 Y1 Sum

5 6 7 8

2

Y1

Y2

Sum

1 0 0 2 3

1 1 1 0 3

2 1 1 2 6

3 4 5 6 7 8 9

9 10

10

Table 8 The confusion matrix.

11 12 13

Yˆ 1 Yˆ 2 Sum

14 15

11

Y1

Y2

Sum

3 0 3

1 2 3

4 2 6

12 13 14 15 16

16

17

17 18 19 20 21 22 23

In this case, at least one element of X i is classiﬁed correctly. If X i is a rough deterministic class, say, X i ⊆ Y j , then X i ∩ Y n = ∅ if n = j and therefore, (5.1) forces f ( X i ) = Y j .

Since (5.1) is equivalent to X i ⊆ upp( f ( X i )) by (2.2), we can think of Yˆ j as a lower bound of the rough upper approximation of Y j . Again using the fact that X and Y are sets of classes of equivalence relations, namely, that

Y j = {Y j ∩ X i : 1 ≤ i ≤ m}, we can sharpen this bound: Suppose that Yˆ j = X s1 ∪ . . . ∪ X sn and deﬁne ∗

nu j = n j j +

24 25

28

nu ∗∗ j = n jj +

30 31 33 34

38 39 40

γ =

44

αi∗ =

45 46

53

nii +

low(Y j ) =

50 52

n

=

n

j =i ni j

⎝ nl∗∗ i = nii − Ind

j =i

61

32 33

36 37 38 40

=

43

nl∗ i

nu ∗

44

≥ αi .

i

46

⎞

ni j ⎠ ≥ nli

45

{ X i : f ( X i ) = Y j } ⊆ Yˆ j .

47 48 49 50 51 52 53 54 55 56 57

57

60

30

42

≥γ,

{ Xi : Xi ⊆ Y j } ⊆

56

59

28

41

+ n ji

⎛

55

27

39

γ ∗ and αi∗ of the standard rough set measures γ and αi :

An argument similar to the one used for upper bounds leads to a reﬁned upper bound of the number of rough lower bound elements:

54

58

∗

i nl i

nii

25

35

(5.4)

Similarly, Yˆ j is a crude upper bound of the rough lower approximation:

49 51

i nii

24

34

the decision class membership for each y ∈ Yˆ j . We can use this observation to deﬁne approximations ∗

22

31

In this case, Yˆ j = X s1 ∪ . . . ∪ X sn , and the classiﬁer f coincides with the rough lower approximation by correctly predicting

43

48

(5.3)

nl∗j = n j j = |Y j |.

42

21

29

(ni j + n ji + Ind(ni j )) ≤ nu j .

Yˆ j = X s1 ∪ . . . ∪ X sn , and we set

41

47

Turning to the lower approximation, we ﬁrst suppose that Y j is a rough deﬁnable class, say, Y j = X s1 ∪ . . . ∪ X sn . Then,

36

20

26

i = j

35 37

(5.2)

If ni j = 0, then nii = 0 by (5.1), and therefore, there is some X n , such that f ( X n ) = Y i and X n ∩ Y j = ∅. The latter implies X n ⊆ upp(Y j ), and we obtain an even sharper bound as

29

32

(ni j + n ji ) ≤ nu j .

19

23

i = j

26 27

18

and

58

k i =1

∗∗

nli ≥

k i =1

59

nli .

60 61

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.8 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

8

Next, we shall restrict the classiﬁer further. For each 1 ≤ i ≤ m let

1

1 2

2 3 5 6 7

f ( X i ) = Y j for some j ∈ M i .

9 11 12

4

M i is the set of all indices which have a maximum value in row i of the granule frequency matrix of Table 1. Since X i = ∅ and Y partitions U , it follows that ni j = 0 for at least one j, and that M i = ∅. A max-row classiﬁer is deﬁned by

8 10

3

M i = { j : 1 ≤ j ≤ k and c i j = max{c ir : 1 ≤ r ≤ k}}.

4

(5.5)

Then, | X i ∩ Y j | = max{c ir : 1 ≤ r ≤ k}. Thus, X i ⊆ Yˆ j implies that c i j is a maximum in row i. We can use this observation to establish a sharper upper bound of nl j : Suppose that Yˆ j =

{ X s1 , . . . , X sn } and consider the partial granule matrix

Granule in Yˆ j

15 17 18 19 20 21 22

Granule size

...

Yj

...

Yk

X s1 ... X si ... X sn

c s1 1 ... c si 1 ... c sn 1

... ... ... ... ...

c s1 j ... c si j ... c sn j

... ... ... ... ...

c s1 k ... c si k ... c sn k

Confusion size

n j1

...

n jj

...

n jk

17 18 19 20 21 22

Since a maximum of each row is in column Y j , it follows that for all 1 ≤ t ≤ k, t = j, we have n jt ≤ n j j , and therefore, max{n jt : 1 ≤ t ≤ k, t = j } ≤ n j j . None of the elements of U , which contribute to some n jt for j = t, can contribute to n j j , and thus,

32

k

nlm j ≥

j =1

35 36

k

numj

50 51 52 53 54 55 56

= n jj +

(n ji + 2 · ni j ) ≤ nu j .

(5.6)

† i

and m, whereby the indices α and γ below, the exponent † equals ∗ or ∗∗. The values αi∗ and

†

exceed or are equal to the (unknown granule based) values

γ ∗ correspond to the “classical” procedure, where

αi and γ ; here and

61

41 42 44 45 46 47 48 49

γ ∗ is identical to the “percentage of correctly

classiﬁed classes”. Both values form optimistic bounds for the analysis of confusion matrices. • The values αi∗∗ and γ ∗∗ take advantage of a simple property of rough sets. These values thus form an even better estimate than the indices αi∗ and γ ∗ which are based on the classic rough estimates αi and γ . • The values αim and γ m are valid estimates if the classiﬁer in use is a maximum classiﬁer. In this case, the estimates for

αi† and γ † of a confusion matrix can be further tightened.

50 51 52 53 54 55 56 57

6. Interpretation of α and γ based on the confusion matrix

58 59

59 60

38

43

57 58

37

40

All point estimators for Example 2 are shown in Table 9 on the following page. We have used the row–max classiﬁer for the nlm , . . . , αim , γ m values, using as entries max{0, nlm }, respectively, max{0, αim }. i i In summary, we have presented three different types of indices related to αi and γ , indicated by the exponents ∗, ∗∗

•

36

39

j =i

43

49

35

cardinality of upp(Y j ). Thus, we obtain

42

48

34

Finally, we estimate the rough upper bound of Y j using the max–row classiﬁer. We know already that nu ∗j = n j j + i = j (n ji + ni j ) ≤ nu j . Since for each 1 ≤ i ≤ k, i = j the max – rule implies nii ≥ ni j , and therefore we can add ni j to the

41

47

33

nl j

j =1

40

46

27

31

34

45

26

30

and

33

44

25

29

32

39

24

28

nlmj = n j j − max{n jt : 1 ≤ t ≤ k, t = j } ≥ nl j ,

30

38

11

23

29

37

10

16

c s1 ... c si ... c sn

28

31

9

15

23

27

8

14

Decision classes Y1

16

26

7

13

14

25

6

12

13

24

5

In case of the “classical” analysis of confusion matrices, there is a nice relationship of a mean α value and the γ value using nl∗i and nu ∗i as lower and upper bound frequencies. The α – accuracy (3.1) is connected to the confusion matrix by

60 61

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.9 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

1

9

1

Table 9 Estimators for Table 5.

2 3

Y1

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Y2

2

Y3

Y4

4

Basic rough set parameters ni 61 18 p low 0.574 0.611 i upp pi 0.772 0.391

63 0.603 0.708

21 0.095 0.808

163

αi

0.427

0.077

γ = 0.528

0.443

0.239

3

Sum

5 6 7

Estimators obtained from the confusion matrix nl∗i 35 11 38 2 nu ∗i 79 46 89 26 nb∗i 44 35 51 24

86 240 154

nl∗∗ i nu ∗∗ i nb∗∗ i

34 82 48

10 48 38

37 92 55

1 29 28

82 251 169

nlm i num i nbm i

23 105 82

0 64 64

26 114 88

0 45 45

αi∗ αi∗∗ αim

0.443 0.415 0.219

0.239 0.208 0.0

0.427 0.402 0.228

0.077 0.034 0.0

8 9 10 11 12 13 14 15 16 17 18 19

γ ∗ = 0.528 γ ∗∗ = 0.503 γ m = 0.300

20 21 22 23

23 24 25 26 27 28

αi =

nl∗ i

nu ∗i

31 32 33 34 35 36 37

i ni •

40

26 27

αi measures is therefore given by

i nii

+ n•i − nii

=

28 29

γ

(6.1)

2−γ

O SF =

43 44 45 46 47 48 49 50

53 54 55 56

59 60 61

35 36 37

41

As in case of a single decision class, the mean α value is given by the fraction of the (lower) approximation quality and the γ (upper) overshoot factor. The same holds for the mean α , since α = O S F . γ Since the mapping f (γ ) = 2−γ is strictly monotone, we observe that α and γ are exchangeable measurements of

roughness, which means that including the upper bounds in the computation of α does not contain further information. This relationship only holds in the “classical” analysis of confusion matrices. At times, the decision attribute is too ﬁnely grained with respect to the predictor in order to obtain signiﬁcant results. It will be helpful to have a framework to decide whether decision categories should be merged in order to result in rough prediction rules with better accuracy. The classical approach of confusion matrices analysis offers us the key for such a framework: The index

42 43 44 45 46 47 48 49 50 51

γi j =

nii + n j j

52

nii + n j j + ni j + n ji

53 54

exhibits the conditional approximation quality of the joined decision category Y i ∪ Y j . If the decision attribute has only two classes, it is easy to see that

55 56 57

57 58

34

40

51 52

33

39

n

41 42

32

38

+ n•i − nii 2n − γ n = =2−γ.

i ni •

i ni •

30 31

Note that α is a measure of approximation quality as well as γ , but it uses additionally the information about the upper bounds. It is interesting that the mean α -value is a function of γ only. As α and γ are strictly monotone connected, any procedure which maximizes γ will maximize α as well. It is easy to see that √ α ≤ γ and α γ , if γ ∈ (0,√1). A simple calculation shows that the maximal difference of α and γ occurs when γ = 2 − 2 with a difference of 3 − 2 2. The mean overshoot factor O S F is given by

38 39

25

ni • + n•i − nii

α=

24

nii

A mean value of the

29 30

=

γi j = γ . Otherwise, if γi j is very low with respect to the overall Therefore, we should inspect

58 59

γ , we obtain a reasonable candidate for joining categories.

60 61

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.10 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

10

1

l nll

f i j = γ − γi j =

2

−

n

nii + n j j

1

nii + n j j + ni j + n ji

2 3

3 4

Parameterization using the Delta method [28] results in

4 5

5 6

fij = a + b + c −

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

a+b a+b+d+e

1. 2. 3. 4. 5. 6.

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

20 21 22 23

we have suggested in [5] to explore

low

pi

upp

· pi

=

√

αi as the geometric mean of lower and

upp

46

O low = i

upp

Oi

=

47

nli n

1−

⎧ ⎨

50

Oi =

=

nli n

53 54

1−

n − nli

,

, if nu i n,

ni n

=

ni n − ni

,

61

29 30 31 32 33 34 35 36 37 38 39

42 43 45 47 49 50 51

upp

upp

O Ri

=

59 60

28

48

O low (O i , O i ) is the ratio of the chance that an arbitrarily drawn x ∈ X is in low(Y i ) (upp(Y i ), Y i ) to the chance that it i is not in low Y i (upp(Y i ), Y i ). We now obtain the range corrected value as the ratio

56 58

27

46

otherwise

ni n

55 57

26

41

nli

51 52

25

44

nu i n nu 1− n i

⎩0,

48 49

24

40

44 45

13

19

upper approximation precision. A closer look, however, reveals that, once again, p i causes trouble. The best interpretation upp of a mean is given, when there is no variance, which means that p low = p holds. Even though this situation may occur i i ni (and is easily computed), it seems quite rare for empirical data; for example, if p low , then p low p upp . i i i n A suitable index for measuring the tightness of the upper bound uses a ratio of odds. Odds ratios offer a symmetric interpretation of lower and upper precision, and remove the bias in the upper approximation. Furthermore, products of odds (or sums of log–odds) are odds as well, which is a nice property for building logistic type models based on odds. For details on the property of odds ratios we invite the reader to consult [29, Chapter 2.2.3] or [30]. Odds ratios are often used to estimate the relationship between two binary variables, in particular, in medical statistics, e.g. to determine whether “a particular exposure is a risk factor for a particular outcome” [31]. Their role in linear regression models is highlighted by Menard [32], and in multivariate analysis of discrete data by Clogg and Shockey [33]. First, deﬁne

41 43

12

18

40 42

11

17

upp

and p i

10

16

The interpretation of the precision index p i faces a problem, since we need additional information to conclude that upp a precision index is “low”: While p i = 1 reﬂects a perfect prediction with deterministic rules, we will always have ni ≤ p upp by (3.6). A welcome property of an upper bound precision measure μ would signal upp(Y i ) = U by μ(Y i ) = 0. i n To connect p i

9

15

7. Odds ratios as approximation measures

upp

7

14

Start with the confusion matrix M(k,k). Compute γ (k) and γi j (k). Choose g (i , j ) = min(γi j (k)). If γk − g (i , j ) does not differ signiﬁcantly from 0, stop. Merge category i and j, resulting in the confusion matrix M (k − 1, k − 1). Goto step 1.

low

6 8

Here, aˆ = nii /n, bˆ = n j j /n, cˆ = l=i ,l= j nll /n, dˆ = ni j /n and eˆ = n ji /n. The quite complicated term for the computation of the standard error of f i j is determined in C.1. A simple application of the result is the computation of an optimal (smaller and less ﬁnely grained) confusion matrix, which optimizes the γ -index by the following algorithm:

23 24

(6.2)

.

53 54 55

Oi upp Oi

if upp(Y i ) U

56

0,

otherwise.

58

upp

=

52

pi

−

1−

ni n

ni n

57 59

.

60 61

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.11 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

1

11

Noting that

1 2

2 3

ni

4

n

/(1 −

ni n

)≤

5

≤

6

nu i n nu i

7 8

upp

we see that 0 ≤ O R i

9

10 11

n

upp

O Ri

=

/(1 − /(1 −

ni

n

3

since ni ≤ nu i ,

),

n nu i

since 1 −

),

nu i n

4

≤1−

ni n

5

,

8 9 10

1, ⇐⇒ upp(Y i ) = Y i ,

11 12

13 15

7

≤ 1. Furthermore,

0, ⇐⇒ upp(Y i ) = U ,

12 14

6

13

upp

the latter, since there are at least two decision classes. Thus, the smaller O R i , the larger is upp(Y i ) relative to Y i . upp Our odds ratio O R i is special, since it does not exceed 1, which is in general not true for arbitrary odds ratios.

14 15 16

16 17

17

Example 4. Continuing Example 2,

18

18 19 20

upp

O R1

=

0.667 − 30/200 1 − 30/200

21 22 23

upp

O R2

=

24 25 26

upp

O R3

=

19

= 0.608,

20 21

0.833 − 50/200 1 − 50/200

22

= 0.778,

0.800 − 120/200 1 − 120/200

23 24 25

= 0.5. 2

26 27

27 28 29 30 31 32

low

O Ri

=

Oi

=

34 35 37 38 39 40 41

29 30

low

33

36

28

An analogous index can be given for the lower bound by deﬁning

Oi

(7.1)

,

32

ni /n / . 1 − nli /n 1 − ni /n

33

nli /n

(7.2)

O Ri is well deﬁned since there are at least two decision classes, and thus, 0 ≤ nl i ≤ ni n. Clearly, 0 ≤ O R i ≤ 1 holds low as well with O R low = 0 if and only if nl = 0, and O R = 1 if and only if nl = n . i i i i i , nlm and nu ∗i , In case of confusion matrix data, we can utilize odds by replacing nl i and nu i by its counterparts nl∗i , nl∗∗ i i m ∗∗ nu i , nu i respectively. The product of the odds ratios, called the rough odds ratio (ROR), is meaningful as well: The expression low

46 47 48 49 50 51 52 53 54 55

58 59 60 61

37 38 39 40 41

=

43

nli /n nu i /n / 1− , if upp(Y i ) = U nu i /n 1−nli /n

44 45 46

otherwise

0,

47

is an odds ratio with the property 0 ≤ R O R i ≤ 1. Moreover, R O R i is a “truly” rough measure, as it is deﬁned only by the cardinalities of approximations, and not of Y i itself. √ The square root R O R i as the geometric mean of the lower and upper odds ratio shows that rough approximation in some sense may be thought of as a quadratic function. upp Both O R low and O R i measure roughness, and it may be interesting to characterize the dependency of nl i and nu i i upp low when O R i = O Ri holds. First, we consider the extreme cases: If O R low = O R upp = 1, then low(Y i ) = upp(Y i ) = Y i . If i i upp low O Ri = O R i = 0, then nli = 0 and nu i = n. Otherwise, assuming 0 nli ≤ nu i n, it is straightforward to compute that

56 57

36

42

upp

R O R i = O R low · O Ri i

44 45

34 35

low

42 43

31

upp

O R low = O Ri i

upp

⇐⇒ O i

=

O 2i O low i

48 49 50 51 52 53 54 55 56

upp

⇐⇒ O low · Oi i

= O 2i .

As binary decision variables play a prominent role in data analysis, it is interesting to see how R O R i behaves in this situation. Given two outcomes with corresponding classes Y 1 and Y 2 , we ﬁrst note that

57 58 59 60 61

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.12 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

12

1

nl1 + nu 2 = n = nl2 + nu 1 .

(7.3)

3

Suppose that nu i = n. Then, R O R 1 can be written as

4

nl1 /n nu 1 /n / , R O R1 = 1 − nl1 /n 1 − nu 1 /n nl1 nu 1 = / , n − nl1 n − nu 1 nl1 nu 1 = /

5 6 7 8 9 10 11 12

15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

3 4

13 14

=

nu 2

5 6 7 8 9 10 11

by (7.3),

nl2

12

nl1 nu 2 / ,

nu 1

13 14

nl2

15 16

= α1 · α2 .

17

Obviously, R O R 2 = R O R 1 , and we obtain a single measure of approximation for this situation. This make sense, because the lower and upper approximation of Y 1 is dependent on the upper and lower approximation of Y 2 and vice versa. Since there is no degree of freedom, the index for the approximation should not vary, if we change from category Y 2 to category Y 1 , which is the case when using R O R, but not for the classical α -values. In summary, this section presents three indices based on odds ratios (ORs) that can be given over known rough set indices (which are based on probabilities); each of these ORs is a transformation that takes into account the ﬁnite number of elements in the respective valid population:

32

35

38 39 40 41

44

47 48

51

56 57 58 59 60 61

25

31 32

35 36

As above, we will use parameterization and the Delta method [28] to approximate the standard errors of the odds ratios. As a ﬁrst step, we notice that nli ≤ ni ≤ nu i and that the observations accounting for nli and ni − nli , respectively, ni and nu i − ni , as well as for nli and nu i − nli do not overlap. Therefore, we may assume that the observations in different samples are independent realizations of their populations. Our basis in this section is the generic parameterized O R

37 38 39 40 41 42

O R (a, b) =

a 1−a

·

1 − (a + b) a+b

43

,

44 45

Using O R (a, b) enables us to derive a generic estimator for the standard error of any O R we have described earlier. For nl n −nl example, aˆ = ni and bˆ = i n i are used in case of analyzing O R low . i To facilitate the computation of the standard error we take logarithms and obtain the parameterized function

46 47 48 49 50

f (a, b) = ln( O R (a, b)) = ln(a) − ln(1 − a) + ln(1 − (a + b)) − ln(a + b).

51 52

From f we can estimate the variance as

53 54

54 55

24

34

8. Standard errors of rough odds ratios

52 53

23

33

49 50

22

30

Since the ORs can be formed in rough set analysis, the corresponding indices for the estimates of the approximation quality in confusion matrices are also possible; in particular, R O R ∗i , R O R ∗∗ and R O R m are the counterparts of αi∗ , αi∗∗ and αim . i i

45 46

21

29

42 43

20

28

36 37

19

27

33 34

18

26

• O R low is associated with the probability p low , i i upp upp • O Ri is associated with the probability p i accordingly, • and R O R i is the counterpart of αi .

30 31

1 2

2

V ar ( ˆf ) =

b(1 − b) n · a(1 − a)(a + b)(1 − (a + b))

.

(8.1)

Details are given in C.2. A nice feature of the parameterized variance estimator given by (8.1) is its ﬂexibility. It is not only applicable for lower and upper precision, but also for measures in confusion matrices. It is only necessary to use proper substitutions for the parameters a and b in the estimate of O R (a, b).

55 56 57 58 59 60 61

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.13 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

Substituting the frequencies ni and nli we obtain the estimate

1 2 4 5

upp

As O R i

6

αi∗ are odds ratios of similar structure as O R low , we redeﬁne the parameters a and b and use the same i

5

Finally, R O R low is parameterized by aˆ = i

13

V ar (ln( R O R low )) = i

14 15 17

3

, we deﬁne aˆ =

ni n

and bˆ =

nu i −ni n

resulting in

nli n

and bˆ =

nu i (n − nu i )nli (n − nli )

6 7

(8.3)

nb i n

n(nu i − nli )(n − nu i + nli )

4

8

n(nu i − ni )(n − nu i + ni ) upp V ar (ln( O R i )) = . ni (n − ni )nu i (n − nu i )

10

16

and

(8.2)

upp

9

12

2

approximation formula as in case of O R low . Regarding O R i i

8

11

1

n(ni − nli )(n − ni + nli ) V ar (ln( O R low )) ≈ i ni (n − ni )nli (n − nli )

3

7

13

9 10 11

and we obtain

12 13

(8.4)

.

15 16

A summary of the substitutions is presented in the following table:

18 19

14

17 18

aˆ

bˆ

nli

ni − nli

20

19

20

O R low i

21

O Ri

ni

nu i − ni

21

22

R O Ri

nli

nu i − nli

22

23

O R ∗i low

nl∗i

ni − nl∗i

23

upp

24

∗ upp

O Ri

ni

nu ∗

R O R ∗i

nl∗i

nu ∗i − nl∗i

27

O R ∗∗low i

nl∗∗ i

ni − nl∗∗ i

28

O Ri

ni

25 26

∗∗upp

24

− ni

25 26 27

nu ∗∗ − ni i

28

29

R O R ∗∗ i

30

O Ri

nlm i

ni − nlm i

30

31

O Ri

ni

num − ni i

31

32

R O Rm i

nlm i

m num i − nl i

32

m,low m,upp

nl∗∗ i

i

∗∗ nu ∗∗ i − nl i

29

33

33

34

34 35

35

Example 5. Continuing Example 1, we have the following values:

36

36

37

37 38

Y1

Y2

Y3

38

39

O R low i

0.459

0.750

0.545

39

40

SE

0.213

0.094

0.120

40

41

95% CI

[0.312, 0.677]

[0.630, 0.893]

[0.445, 0.668]

41

42

upp

42

O Ri

0.608

0.778

0.500

43

SE

0.123

0.078

0.119

44

95% CI

[0.476, 0.776]

[0.668, 0.906]

[0.396, 0.631]

R O Ri

0.279

0.583

0.273

SE

0.230

0.116

0.150

95% CI

[0.178, 0.438]

[0.465, 0.732]

[0.203, 0.366]

0.528

0.764

0.522

49

[0.422, 0.662]

[0.682, 0.855]

[0.451, 0.605]

50

43

45 46 47 48

√

49

R O Ri

95% CI

50

53 54 55 56 57 58 59

47 48

51

As

√

upp

R O R is the geometric mean of lower bound O R low and upper bound O R i i

,

• It is a compromise of the two √ approximation measures, and upp • The width of the 95% CI of R O R is smaller than that of O R low and O R i . i √

The latter is due to the fact that R O R is a mean of the other odds ratios, and that standard errors of means are smaller than standard errors of the single values. 2

52 53 54 55 56 57 58 59 60

60 61

45 46

51 52

44

We close this section with the values for our running example of data from a confusion matrix.

61

JID:IJA AID:8451 /FLA

14

1

[m3G; v1.261; Prn:18/12/2019; 13:49] P.14 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

Example 6. Continuing Example 2, we have the following values:

1 2

2

Y1

Y2

Y3

Y3

3

0.457 0.144 [0.345,0.607]

0.583 0.202 [0.392,0.866]

0.483 0.137 [0.369,0.631]

0.084 0.682 [0.022,0.319]

4

O Ri SE 95% CI

0.636 0.119 [0.504,0.803]

0.316 0.141 [0.240,0.416]

0.524 0.116 [0.417,0.658]

0.779 0.174 [0.554,1.097]

7

R O R ∗i SE 95% CI

0.291 0.169 [0.209,0.405]

0.184 0.285 [0.105,0.322]

0.253 0.173 [0.180,0.354]

0.065 0.689 [0.017,0.252]

10

R O R ∗i 95% CI

0.539 [0.457,0.636]

0.429 [0.324,0.567]

0.503 [0.424,0.595]

0.256 [0.130.0.502]

O R ∗∗low i SE 95% CI

0.441 0.148 [0.330,0.589]

0.527 0.225 [0.339,0.818]

0.466 0.141 [0.354,0.614]

0.042 0.982 [0.006,0.286]

O Ri SE 95% CI

0.591 0.297 0.121 0.147 [0.466,0.748]

0.486 0.118 [0.223,0.397]

0.683 0.177 [0.386,0.613]

[0.483,0.966]

R O R ∗∗ i SE 95% CI

0.260 0.176 [0.184,0.367]

0.157 0.303 [0.087,0.283]

0.227 0.178 [0.160,0.321]

0.029 0.989 [0.004,0.198]

0.510 [0.429,0.606]

0.396 [0.295,0.532]

0.476 [0.400,0.567]

0.169 [0.063,0.445]

O Ri SE 95% CI

0.275 0.197 [0.187,0.404]

0.000 0.000

0.301 0.184 [0.210,0.432]

0.000 0.000

28

m,upp O Ri

SE 95% CI

0.330 0.140 [0.251,0.435]

0.192 0.189 [0.133,0.278]

0.271 0.138 [0.206,0.355]

0.388 0.179 [0.273,0.551]

28

29

R O Rm i SE 95% CI

0.091 0.235 [0.057,0.144]

0.000 0.000

0.082 0.233 [0.052,0.129]

0.000 0.000

31

0.301

0.000

0.286

0.000

3

O R ∗i low SE 95% CI

4 5 6

∗ upp

7 8 9 10 11

12 13 14 15 16 17

∗∗upp

18 19 20 21 22 23

24

95% CI

R O R ∗∗ i

25

m,low

26 27

30 31 32 33

34 35

R O Rm i

95% CI

[0.239,0.379]

42 43 44 45 46 47 48 49 50 51

[0.228,0.359]

56

61

14 15 16

19 20 21 22 23 24 25 26

29 30 32 34 35

38 39

In this paper we have investigated the connection between rough sets and confusion matrices, and have discussed the classical rough set measures α and γ in this context. In particular, we have proposed a rough classiﬁer and a rough confusion matrix, and have shown how upper and lower bounds of rough approximation measures can be obtained. We have also presented a measure and an algorithm for merging two ﬁnely grained categories which may be used to improve the rough approximation quality γ . Since the precision measures p i and p i which make up αi lack symmetry, we propose to use odds ratios as range corrected values. Since estimates based on empirical data may not be accurate, we have provided an error theory for the odds ratios. Examples for the concepts are given throughout. In future work we shall investigate the usefulness of the indices for decision making, using test data and simulations. In particular, we shall compare the performance of rough set classiﬁers with other machine learning methods based on our rough confusion matrix indices. We shall also incorporate and position the results into our general approach to statistics in the rough set model [34,35].

40 41 42 43 44 45 46 47 48 49 50 51 52

Declaration of competing interest

53 54

We wish to conﬁrm that there are no known conﬂicts of interest associated with this publication and there has been no signiﬁcant ﬁnancial support for this work that could have inﬂuenced its outcome.

55 56 57

Acknowledgements

58 59

59 60

13

37

9. Summary and future work

57 58

12

36

54 55

11

33

52 53

9

27

39 41

8

18

37

40

6

17

36 38

5

We thank the anonymous reviewers for careful reading and helpful suggestions. I. Düntsch gratefully acknowledges support by the National Natural Science Foundation of China, Grant No. 61976053.

60 61

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.15 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

1

Appendix A. Table of simple rough measures

1 2

2 3

Name

Notation

Estimate

Rough lower precision

p low i

Rough upper precision

upp

pi

8

Rough alpha precision

αi

nl i ni ni nu i nl i nu i

9

Rough approximation quality

γ α

4

3 4

Generic rough set measures

5 6 7

10

Rough alpha

5 6 7 8

i nl i n

9

γ

10

2−γ

11

11

Standard measures for evaluation of confusion matrices

12

∗ low

Largest lower precision

pi

14

Smallest upper precision

pi

15

Largest alpha precision

αi∗

Largest approximation quality

γ∗

Largest alpha

α∗

13

16 18 20 21 22 23

Upper lower precision

pi

26 27

15 16 17

Lower upper precision

pi

Upper alpha precision

α ∗∗

∗∗upp

n jj +

γ ∗∗

Upper alpha

α ∗∗

19

j =i

20

ni j

n jj + i

21

ni

j +n ji + Ind(ni j )) i = j (ni

nii − Ind

i

Upper approximation quality

ni

25

14

18

nii − Ind

∗∗low

24

13

γ∗ 2−γ ∗

Sharper rough bounds for evaluation of confusion matrices

19

12

nii ni nii + i = j (ni j +n ji ) ni nii nii + i = j (ni j +n ji ) i nii n

∗ upp

17

j =i

ni j

i = j (ni j +n ji + Ind(ni j ))

nii − Ind

j =i

ni j

n

γ ∗∗ 2−γ ∗∗

28

Bounds for max–row classiﬁer for evaluation of confusion matrices

29

m-Lower precision

pi

30

m-Upper precision

pi

max{0,nii }−max{nit :1≤t ≤k,t =i }) ni ni nii + j =i (n ji +2·ni j )

m-Alpha precision

αim

max{0,nii }− max{nit :1≤t ≤k,t =i } nii + j =i (n ji +2·ni j )

33

m-Approximation quality

γ ∗∗

max{0,nii }−max{nit :1≤t ≤k,t =i } n

34

m-Alpha

αm

γm 2−γ m

31 32

m,low m,upp

35

Appendix B. Table of odds ratio measures

40 41 42 43 44

Notation

Estimate

Rough lower OR

low

O Ri

upp

Rough upper OR

O Ri

Rough OR

ROR O R ∗i low

nii n−nii

∗ upp

ni n−ni

33 34 35

Sharper rough bounds for evaluation of confusion matrices

51

Upper lower OR

O R ∗∗low i ∗∗upp

Lower upper OR

O Ri

Upper ROR

R O R ∗∗ i

54

39

nii − Ind

j =i

ni j

/ ni n−(nii − Ind n ) n−ni j=i i j n + ( n ii ni i = j i j +n ji + Ind(ni j )) / n−ni n−(nii + i = j (ni j +n ji + Ind(ni j ))) nii − Ind nii + i = j (ni j +n ji + Ind(ni j )) j =i ni j / n−(nii + i = j (ni j +n ji + Ind(ni j ))) n− n − Ind n ii

j =i

ij

Bounds for max–row classiﬁer for evaluation of confusion matrices m,low

m-Lower OR

O Ri

m-Upper OR

O Ri

m-ROR

R O Rm i

m,upp

45 47

50

61

32

46

(ni j +n ji ) / n−n +i= j (n +n ) ii ji i = j i j nii + i = j (ni j +n ji ) nii / n−nii n−(nii + i = j (ni j +n ji ))

R O R∗

60

31

nii +

Largest ROR

59

30

ni / n− ni

49

58

29

44

O Ri

57

28

43

Smallest upper OR

48

56

27

42

Largest lower OR

55

26

41

nl i / ni n−nl i n−ni ni / nu i n−ni n−nu i nl i / nu i n−nl i n−nu i

46

53

25

40

Generic rough set measures

Standard measures for evaluation of confusion matrices

52

24

38

Name

45 47

23

37

38 39

22

36

36 37

15

max{0,nii }−max{nit :1≤t ≤k,t =i }) / ni n−(max{0,nii }−max{nit :1≤t ≤k,t =i })) n−ni nii + j =i (n ji +2·ni j ) ni / n−ni n−(nii + j =i (n ji +2·ni j )) nii + j =i (n ji +2·ni j ) max{0,nii }−max{nit :1≤t ≤k,t =i } / n−(max{0,nii }−max{nit :1≤t ≤k,t =i }) n−(nii + j =i (n ji +2·ni j ))

48 49 50 51 52 53 54 55 56 57 58 59 60 61

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.16 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

16

1

Appendix C. Details for method computations

1 2

2 3

3

C.1. Computation for (6.2)

4

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

5

Consider

6

∂ fij d+e =1− ∂a (a + b + d + e )2

7 8 9

∂ fij d+e =1− ∂b (a + b + d + e )2

10 11 12

∂ fij =1 ∂c ∂ fij a+b = ∂d (a + b + d + e )2

13 14 15 16 17

∂ fij a+b = ∂e (a + b + d + e )2

18 19

Here, aˆ = nii /n, bˆ = n j j /n, cˆ =

dˆ = ni j /n and eˆ = n ji /n.

l=i ,l= j nll /n,

24

V ar ( f)=

a(1 − a) n

25 26 27

+

28 29

+

30 31 32

+

33 34

+

37

−

38 39 40 41 42

− −

43 44 45

−

46 47

−

48 49 50

−

51 52 53

−

54 55

−

56 57 58

−

59 60 61

−

(1 −

b(1 − b)

d+e

(a + b + d

(1 −

n

+ e )2

23

)2

d+e

24 25

2

(a + b + d + e )2

26

)

27

c (1 − c )

28 29

n d(1 − d)

(

n e (1 − e ) n

35 36

21 22

22 23

20

2ab n 2ac n 2ad n 2ae n 2bc n 2bd n 2be n

(

(1 − (1 − (1 − (1 − (1 − (1 − (1 −

2cd

30

a+b

(a + b + d + e )2

n

33

d+e

(a + b + d

+ e )2

d+e

(a + b + d + e )2

34 35 36

)2

37 38 39

)

40

d+e

a+b ) (a + b + d + e )2 (a + b + d + e )2 d+e

a+b

) (a + b + d + e )2 (a + b + d + e )2

41 42 43 44 45 46

d+e

47

) (a + b + d + e )2 d+e

)

48

a+b

49

+ e )2

50

) (a + b + d + e )2 (a + b + d + e )2

52

a+b

54

(a + b + d

+ e )2

d+e

(a + b + d a+b

51 53 55 56

a+b

(

32

)2 (a + b + d + e )2

n (a + b + d 2de

)

a+b

n (a + b + d + e )2 2ce

31

2

57

+ e )2

a+b

(a + b + d + e )2

58 59

2

)

60 61

JID:IJA AID:8451 /FLA

[m3G; v1.261; Prn:18/12/2019; 13:49] P.17 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

1

17

C.2. Computation for (8.1)

1 2

2 3 4 5 6 7 8 9 10 11 12

3

The partial derivatives are

4

∂f 1 1 1 1 = + − − , ∂a a 1−a a+b 1 − (a + b) ∂f 1 1 =− − , ∂b a+b 1 − (a + b)

5 6 7 8 9

and we can estimate the variance as

V ar ( f)=

a(1 − a) n

13 14

+

15 16

1 a

b(1 − b) n

+

1 1−a 1

a+b 1

1

+

−

1 a+b

−

1

1

18 19 20 21 22 23

12 13

2

14 15

1 − (a + b) 1

11

1 − (a + b)

1

+2·a·b + − − a 1−a a+b 1 − (a + b) 1 1 · + a+b 1 − (a + b) b(1 − b) = . n · a(1 − a)(a + b)(1 − (a + b))

17

10

2

16 17 18 19 20 21

(C.1)

25

25

References

29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

26 27

27 28

23 24

24 26

22

[1] Z. Pawlak, Rough sets, Int. J. Comput. Inf. Sci. 11 (1982) 341–356. ´ A. Veljovic, ´ S. Ilic, ´ Ž. Papic, ´ M. Tomovic, ´ Evaluation of classiﬁcation models in machine learning, Theory Appl. Math. Comput. Sci. 7 (2017) [2] J. Novakovic, 39–46. [3] D.J. Hand, Supervised classiﬁcation and tunnel vision, Appl. Stoch. Models Bus. Ind. 21 (2005) 97–109. [4] O. Caelen, A Bayesian interpretation of the confusion matrix, Ann. Math. Artif. Intell. 81 (2017) 429–450. [5] G. Gediga, I. Düntsch, Standard errors of indices in rough set data analysis, in: Transactions on Rough Sets 17, 2014, pp. 33–47. [6] I. Düntsch, G. Gediga, PRE and variable precision models in rough set data analysis, in: Transactions on Rough Sets 19, 2015, pp. 17–37, MR3618228. [7] W. Ziarko, Variable precision rough set model, J. Comput. Syst. Sci. 46 (1993) 39–59. [8] I. Düntsch, G. Gediga, Confusion matrices and rough set data analysis, CoRR, arXiv:1902.01487, 2019, https://doi.org/10.1088/1742-6596/1229/1/ 012055, CC BY 3.0 licence. [9] Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning About Data, System Theory, Knowledge Engineering and Problem Solving, vol. 9, Kluwer, Dordrecht, 1991. [10] I. Düntsch, G. Gediga, Rough Set Data Analysis: A Road to Non-invasive Knowledge Discovery, Methoδ os Publishers (UK), Bangor, 2000, http://www. cosc.brocku.ca/~duentsch/archive/nida.pdf. [11] B. Everitt, The Analysis of Contingency Tables, Springer Verlag, 1977, originally published by Chapman and Hall. [12] H. Nguyen, A. Skowron, Rough sets: from rudiments to challenges, in: A. Skowron, Z. Suraj (Eds.), Rough Sets and Intelligent Systems - Professor Zdzisław Pawlak in Memoriam, vol. 1, Springer Verlag, 2013, pp. 75–173. [13] K. Pearson, Mathematical Contributions to the Theory of Evolution: On Contingency and Its Relation to Association and Normal Correlation, Draper’s Company Research Memoirs, Biometric Series, vol. 13, Department of Applied Mathematics, University College, University of London, 1904. [14] M.H. Hodge, I. Pollack, Confusion matrix analysis of single and multidimensional auditory displays, J. Exp. Psychol. 63 (1962) 129–142. [15] J. Townsend, Theoretical analysis of an alphabetic confusion matrix, Percept. Psychophys. 9 (1971) 40–50. [16] R. Kohavi, F. Provost, Glossary of terms, Mach. Learn. 30 (1998) 271–274. [17] T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. 27 (2006) 861–874. [18] D. Hand, Measuring classiﬁer performance: a coherent alternative to the area under the ROC curve, Mach. Learn. 77 (2009) 103–123. [19] L. Cheng, X. Chen, M. Wei, J. Wuan, X. Hou, Modeling mode choice behavior incorporating household and individual sociodemographics and travel attributes based on rough sets theory, Comput. Intell. Neurosci. 2014 (2014) 560919. [20] H. Al-Qaheri, A.E. Hassanien, A. Abraham, Discovering stock price prediction rules using rough sets, Neural Netw. World 18 (2008). [21] A. Hassanien, M. Abdelhafez, H. Own, Rough sets data analysis in knowledge discovery: a case of Kuwaiti diabetic children patients, Adv. Fuzzy Syst. 2008 (2008) 528461. [22] M. Sudha, A. Kumaravel, Students performance prediction based on rough sets, Indian J. Comput. Sci. Eng. 8 (2017) 584–589. [23] J. Xu, Y. Zhang, D. Miao, Three-way confusion matrix for classiﬁcation: a measure driven view, Inf. Sci. 507 (2020) 772–794. [24] I. Düntsch, G. Gediga, Approximation operators in qualitative data analysis, in: H. de Swart, E. Orłowska, G. Schmidt, M. Roubens (Eds.), Theory and Application of Relational Structures as Knowledge Instruments, in: Lecture Notes in Computer Science, vol. 2929, Springer-Verlag, Heidelberg, 2003, pp. 214–230. [25] I. Düntsch, G. Gediga, Simplifying contextual structures, in: Proceedings of the 6th International Conference on Pattern Recognition and Machine Intelligence, in: Lecture Notes in Computer Science, vol. 9124, Springer Verlag, Heidelberg, 2015, pp. 23–32. [26] R. Congalton, R.A. Mead, A quantitative method to test for consistency and correctness in photointerpretation, Photogramm. Eng. Remote Sens. 49 (1983) 69–74. [27] D. Lauer, C. Hay, A. Benson, Quantitative evaluation of multiband photographic techniques, Final report NAS9-957, Earth Observation Division, Manned Spacecraft Center, NASA, 1970. [28] G. Oehlert, A note on the Delta method, Am. Stat. 46 (1992) 27–29. [29] A. Agresti, Categorical Data Analysis, 3rd ed., Wiley Series in Probability and Statistics, Wiley, Hoboken, NJ, 2013.

28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

JID:IJA AID:8451 /FLA

18

1 2 3 4 5 6 7 8 9

[30] [31] [32] [33]

[m3G; v1.261; Prn:18/12/2019; 13:49] P.18 (1-18)

I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••

J. Bland, D. Altman, The odds ratio, BMJ 320 (2000) 1468. M. Szumilas, Explaining odds ratios, J. Can. Acad. Child Adolesc. Psychiatry 19 (2019) 227–229. S. Menard, Applied Logistic Regression Analysis, 2nd ed., Quantitative Applications in the Social Sciences, vol. 106, Sage Publications, 2002. C.C. Clogg, J.W. Shockey, Multivariate analysis of discrete data, in: J.R. Nesselroade, R.B. Cattell (Eds.), Handbook of Multivariate Experimental Psychology, Perspectives on Individual Differences, 2nd ed., Plenum Press, 1988, pp. 337–365. [34] I. Düntsch, G. Gediga, Probabilistic granule analysis, in: C.-C. Chan, J.W. Grzymala-Busse, W.P. Ziarko (Eds.), Proceedings of the Sixth International Conference on Rough Sets and Current Trends in Computing, RSCTC 2008, in: Lecture Notes in Computer Science, vol. 5306, Springer Verlag, 2008, pp. 223–231. [35] G. Gediga, I. Düntsch, Statistical techniques for rough set data analysis, in: L. Polkowski, S. Tsumoto, T.Y. Lin (Eds.), Rough Set Methods and Applications: New Developments in Knowledge Discovery in Information Systems, Physica Verlag, Heidelberg, 2000, pp. 545–565, MR1858668.

1 2 3 4 5 6 7 8 9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17

17

18

18

19

19

20

20

21

21

22

22

23

23

24

24

25

25

26

26

27

27

28

28

29

29

30

30

31

31

32

32

33

33

34

34

35

35

36

36

37

37

38

38

39

39

40

40

41

41

42

42

43

43

44

44

45

45

46

46

47

47

48

48

49

49

50

50

51

51

52

52

53

53

54

54

55

55

56

56

57

57

58

58

59

59

60

60

61

61

Indices for rough set approximation and the application to confusion matrices

Indices for rough set approximation and the application to confusion matrices

Recommend Documents