JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.1 (1-18)
International Journal of Approximate Reasoning ••• (••••) •••–•••
1
Contents lists available at ScienceDirect
1 2
2 3
International Journal of Approximate Reasoning
4
3 4 5
5
6
6
www.elsevier.com/locate/ijar
7
7
8
8
9
9
10
10
11 12 13
Indices for rough set approximation and the application to confusion matrices
16 17 18 19
15
a
17
College of Maths. Informatics, Fujian Normal University, Fuzhou, China b Dept. of Computer Science, Brock University, St Catharines, Canada c Institut für Evaluation und Marktanalysen, Brinkstr. 19, 49143 Jeggen, Germany
24 25 26 27 28 29 30 31 32
16 18 19 20
a r t i c l e
i n f o
a b s t r a c t
21 22
22 23
13
Ivo Düntsch a,b,1 , Günther Gediga c,1
20 21
12 14
14 15
11
Article history: Received 18 July 2019 Received in revised form 2 November 2019 Accepted 15 December 2019 Available online xxxx Keywords: Rough set approximation Confusion matrix Precision indices Odds ratios Error estimation
33
Confusion matrices and their associated statistics are a well established tool in machine learning to evaluate the accuracy of a classifier. In the present study, we define a rough confusion matrix based on a very general classifier, and derive various statistics from it which are related to common rough set estimators. In other words, we perform a rough set–like analysis on a confusion matrix, which is the converse of the usual procedure; in particular, we consider upper approximations. A suitable index for measuring the tightness of the upper bound uses a ratio of odds. Odds ratios offer a symmetric interpretation of lower and upper precision, and remove the bias in the upper approximation. We investigate rough odds ratios of the parameters obtained from the confusion matrix; to guard against undue random influences, we also approximate their standard errors. © 2019 Elsevier Inc. All rights reserved.
23 24 25 26 27 28 29 30 31 32 33
34
34
35
35
36
36
37
37 38
38 39
1. Introduction
41 42 43 44 45 46 47 48 49 50 51 52 53 54
The philosophy of rough sets is based on the assumption that knowledge about the world depends on the granularity of representation [1]. Rough set methods are concerned with the approximation of sets (whose extent may be unknown) by suitable functions, and measured by associated indices which are defined by the data at hand, in particular, frequency measures such as counting. Confusion matrices are a frequently used tool in machine learning to evaluate the prediction quality of a classifier such as neural networks, tree–like classifiers, discriminant analysis, Bayesian methods, support vector machines, and many more, including classifiers based on rough sets, see e.g. [2–4]. Usually, confusion matrices are analyzed by statistical techniques, even if the method to derive rules to predict outcomes is based on rough set technology. Nevertheless, a rough set based interpretation of the “confusion” is directly possible, if the information about the granules used for the prediction process is given. If there are granules for which the predicted class equals the class given by, say, a gold standard, we may interpret this as a lower bound of the approximation of the class given by the gold standard. Any granule which intersects the class given by the gold standard in at least one element contributes to the upper bound of the gold standard class. This results in a simple application of the rough set model.
57
60 61
42 43 44 45 46 47 48 49 50 51 52 53 54 56
56
59
41
55
55
58
39 40
40
1
E-mail addresses:
[email protected] (I. Düntsch),
[email protected] (G. Gediga). The ordering of authors is alphabetical and equal authorship is implied.
https://doi.org/10.1016/j.ijar.2019.12.008 0888-613X/© 2019 Elsevier Inc. All rights reserved.
57 58 59 60 61
JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.2 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
2
1
1
Table 1 The granule frequency matrix.
2 3
Granule
4 5 6 7 8 9 10
2
Decision classes
3
Granule size
4
Y1
...
Yi
...
Yk
X1 ... Xj ... Xm
c 11 ... c j1 ... cm1
... ... ... ... ...
c 1i ... c ji ... cmi
... ... ... ... ...
c 1k ... c jk ... cmk
c1 ... cj ... cm
Class size
n1
...
ni
...
nk
n
5 6 7 8 9 10 11
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
In this paper we shall explore how confusion matrices fit into the rough set approach, if the information about the granularity of the prediction process is unknown. The reason for discarding the basic information is twofold: 1. It is a standard procedure to publish confusion matrices to demonstrate the validity of a data model. 2. Other data analysis techniques do not use the information granules directly, but only summary statistics. In this case information about the granularity does not describe the prediction process. Particular attention will be given to investigate how we can use a suitably defined rough confusion matrix to approximate the accuracy αi of rough approximation and the rough approximation quality γ . To measure relationships among the estimates, we introduce rough odds ratios and show how they are related to the previously defined concepts. An error theory for rough odds ratios is also provided. This paper may be regarded as a continuation of our earlier [5] in which we have investigated confidence bounds for standard rough set approximation quality and accuracy, and of [6], where we have proposed a parameter free alternative to the variable precision model [7] in the context of a variant of confusion matrices. Some material of Section 4 is taken from the preliminary [8]. 2. Basic rough set concepts
32 33 34 35 36 37 38 39
low(Y ) =
Lower approximation
(2.1)
{ Z ∈ P : Z ∩ Y = ∅},
Upper approximation
(2.2)
Boundary.
(2.3)
41
bnd(Y ) = upp(Y ) \ low(Y ),
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
17 18 19 20 21 22 23 24 25 26 27
31 32 33 34 35 36 38 39
Note that
40 41 42 43 44
44 45
16
37
{ Z ∈ P : Z ⊆ Y },
upp(Y ) =
43
15
30
Mathematically, granularity may be expressed by an equivalence relation θ on a nonempty finite set U , up to the classes of which membership in a subset of U can be determined. Thus, the knowledge we have are the classes X , Y of two equivalence relations, and our aim is to approximate the classes in Y by the classes in X . We assume that the reader has a working knowledge how to obtain predictor classes and decision classes from a fixed decision system, see e.g. [9] or [10]. Several operators are defined on 2U in the following way [9]: Let P = { X 1 , . . . , X m } be the set of classes of an equivalence relation θ . If Y ⊆ U , then,
40 42
14
29
30 31
13
28
28 29
12
upp(Y ) = low(U \ Y ).
(2.4)
Throughout, we suppose that Y = {Y 1 , . . . , Y k } is the set of decision classes, also called descriptor classes, and that lower and upper approximations are taken with respect to θ for a fixed equivalence relation θ , having the set of predictor classes X = { X 1 , . . . , Xm }; predictor classes are also called granules. The aim is to approximate the classes of Y by the granules in X. To avoid trivialities we assume that |U | ≥ 2, and that there are at least two decision classes, i.e. k > 1. We use n = |U | for the number of observations, and set
ni = |Y i |,
45 46 47 48 49 50 51 52 53 54
nli = |low(Y i )|,
55
nu i = |upp(Y i )|,
57
nb i = |bnd(Y i )| = nu i − nli . Since both X and Y are partitions of U , we can build the granule frequency matrix with marginal sums shown in Table 1; there, c ji = | X j ∩ Y i |. Note that a granule frequency matrix is a contingency table, see e.g. [11].
56 58 59 60 61
JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.3 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
1 2 3 4 5
j = c ji : 1 ≤ i ≤ k belonging to granule X j . Since Y is a partition of U and X j = ∅, we have Consider the vector X j contains only one non–zero entry, say, c ji , then X j ⊆ Y i , and we call the granule c ji = 0 for at least one 1 ≤ i ≤ k. If X deterministic; otherwise the granule is called indeterministic. The granule X j is said to belong to the upper bound upp(Y i ) of the decision class Y i if c ji = 0. Thus, we obtain m
6 7
3
nu i =
8
12 13
16
19 20 21 22 23 24 25 26 27 28 29 30 31
Ind(a) =
10 11
0,
if a = 0,
1,
otherwise.
(2.5)
36
39
αi =
nu i
Accuracy of approximation of Y i ,
,
(3.1)
1 n
k
·
44
47 48 49
52 53 54 55
58
(3.2)
61
25 26 27 28 29 30 31 33
We observe that
38 39
i =1
40
αi depends on the approximation of −Y i as well, since
41 42
αi =
nli nu i
=
nli n − |low(−Y )|
43
.
44 45
upp
In [5] we have introduced two indices, namely, p low , the lower precision, and p i , the upper precision of determining i membership in Y i and have investigated some of their properties; these will play a major rule in our subsequent considerations. Given a decision class Y i , the relative precision of deterministic membership in Y i is defined by
p low = i
nli ni
=1−
ni − nli ni
46 47 48 49 50
,
(3.3)
i.e. p low is the percentage of correctly predicted elements Y i relative to ni . Then, 1 − p low is the percentage of elements of i i Y i which cannot be predicted by the deterministic rules. Clearly, 0 ≤ p low ≤ 1, and i
51 52 53 54 55 56
• p low = 0 if and only if the lower bound of Y i is empty, i • p low = 1 if and only if Y i is a deterministic class, i.e. ni = nli = nu i . i
57 58 59
59 60
24
37
Quality of approximation of Y by X .
56 57
23
36
nli ,
50 51
22
35
45 46
21
34
and the global statistic
γ=
20
32
nli
42 43
16
19
Various indices play a role in rough set theory, such as accuracy α , approximation quality γ and R–roughness ρ [9, Ch 2.6]. We suggest to distinguish between precision indices and approximation indices: A precision index measures the precision of membership in a class Y i , and may contain in its simplest form the cardinality ni of Y i . A (rough) approximation index is related to the approximation of some set, usually a decision class, by the classes of X , and contains only one or more of nli , nu i or n. This is meaningful in the context of rough sets, since it can happen that the rough set low( X ), upp( X ) equals the rough set low(Y ), upp(Y ), and | X | = |Y |. Thus, in a rough set measure the cardinality of Y i is not directly involved – indeed, the parameters nli , nu i or n do not determine ni = |Y i | so that, in some sense, an approximation index may be called context free – or, more precisely, decision context free, as the context of the granules may vary. It may also be noted that an approximation index does not involve the complement of a decision class, since a rough set – up to which we know the world – does not generally have a complement. Traditionally, rough statistics are the local statistic for a decision class Y i , defined by
40 41
15
18
3. Approximation and precision indices
37 38
13
17
34 35
12 14
For the basic philosophy and tools of rough set data analysis the reader is invited to consult [9,10]. For recent developments and more advanced methods the overview [12] is an excellent source.
32 33
5
9
17 18
4
8
Here, Ind is the indicator function defined by
14 15
3
7
j =1
11
2
6
Ind(c ji ) · c j .
9 10
1
We note in passing that the precision indices require complete knowledge of Y i , while it involves only the approximations.
αi is a “true” rough set measure, as
60 61
JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.4 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
4
1
1
Table 2 Results of a rough set analysis.
2 3 4
ni nli nu i
5 6
2
Y1
Y2
Y3
Sum
30 15 45
50 40 60
120 90 150
200 145 -
3 4 5 6 7
7
8
8
Table 3 Indices for the system of Table 2.
9 10 11
9 10
Y1
Y2
Y3
11
0.5
0.8
0.75
12
12
p low i
13
pi
0.667
0.833
0.8
13
14
αi γ
0.333
0.667 0.725
0.6
14
upp
15 16 17 18 19
ni i =1 n
24
upp
pi
upp
pi
26
=
=
27 29
ni nu i
=1−
0 =
32
ni −
k=i nlk
37 38 39 40 41
18 19
(3.4)
,
ni n
(3.5)
.
44
47 48
≤ pi
≤ 1,
(3.6)
51 52 53 54 55 56 57 58 59 60 61
23 24 25 26 28
31 32 33
and upp
pi
= 1 ⇐⇒
ni nu i
34
= 1 ⇐⇒ ni = nu i
(3.7)
The reciprocal
1
upp
pi
35 36
⇐⇒ nli = nu i ⇐⇒ p low = 1. i upp
of p i
is of interest as well, since
with respect to the size ni of Y i .
1
upp
pi
(3.8)
37 38
1
upp
pi
− 1 is the percentage of extra elements in the upper bound
39 40
may be called an overshoot factor (OSF).
41 42
Example 1. As a running example for basic rough set and precision parameters we shall use an example from [5] shown in Tables 2 and 3. 2
43 44 45
upp
Empirically one can observe that in many cases p low is much smaller than p i i
46
.
47 48
4. Confusion matrices
49
49 50
22
30
upp
45 46
21
29
42 43
20
27
is the percentage of those elements predicted not to be possibly in Y i by indeterministic rules. Note
34 36
nu i
Clearly, 1 − p i that
31
35
nu i − ni
ni
upp
30
33
17
low
which is the percentage of elements possibly in Y i captured by indeterministic rules. By (2.4) and the fact that Y 1 , . . . , Y n partition U , we also have
25
28
nli ni
Since γ = · , we see that γ is a weighted sum of the p i values. Similarly, we define a precision index for an arbitrary x being in the upper approximation of Y i by
21 23
16
k
20 22
15
Recall that Y = {Y 1 , . . . , Y k } is a set of nonempty disjoint decision classes. A classifier is a mapping f : U → Y , where f (x) is interpreted as the decision class in which x is predicted to be. This leads to a set Yˆ 1 , . . . , Yˆ i , . . . , Yˆ k of disjoint predictor classes, where
Yˆi = {x ∈ U : f (x) ∈ Y i } = f
−1
50 51 52 53
(4.1)
54
We may think of the sets Y i as classes prescribed by a gold standard such as a classification by an expert, and of the sets Yˆ i as observed values predicted to be in Y i by the classifier f . True and predicted values may be cross–tabulated in a confusion matrix or error matrix. Perhaps the earliest occurrence of confusion matrixes is in Pearson [13], who calls them contingency tables, see also [14,15] for early uses of a confusion matrix. In the matrix, the class names serve as column labels, respectively, row labels of a (k, k) matrix: An entry Yˆ i , Y j in the matrix is the number of elements of Y j which are predicted to be in Y i . The simplest confusion matrix has type (2, 2).
56
(Y i ).
55 57 58 59 60 61
JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.5 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
1 2
Table 4 A general confusion matrix.
3
True value
4
Predicted value
5 6 7 8 9 10 11
5
1 2 3
Sum
4
Y1
...
Yj
...
Yk
Yˆ 1 ...
n11 ...
... ...
n1 j ...
... ...
n1k ...
n1• ...
6
Yˆ j ...
n j1 ...
... ...
n jj ...
... ...
n jk ...
n j• ...
8
Yˆ k
nk1
...
nkj
...
nkk
nk•
Sum
n •1
...
n• j
...
n•k
n
5 7 9 10 11 12
12 13 14
13
Table 5 An example confusion matrix [26].
14 15
15
Y1 Pine
Y2 Cedar
Y3 Oak
Y4 Cotton–wood
Total
Pine
35
4
12
2
53
18
Cedar
14
11
9
5
39
19
Oak
11
3
38
12
64
Cottonwood
1
0
4
2
7
Total
61
18
63
21
163
16 17 18 19 20 21 22
Yˆ 1 Yˆ 2 Yˆ 3 Yˆ 4
17
In this case, there are two decision classes P and N, and the entries in the table are called true positives, false positives, true negatives, or false negatives:
28
34
Predicted value
29
N
True
False
Pˆ
31
Positive (TP) False
Positive (FP) True
32
ˆ N
Negative (TN)
Negative (FN)
30
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
Confusion matrices are a common tool in machine learning to gauge the quality of a classifier [16], and for an introduction to confusion matrices and associated topics we invite the reader to consult [17]. Many statistics are associated with a confusion matrix, such as sensitivity TP/(TP+FN), specificity TN/(TN+FP), precision (TP)/(FP+TP), and the area AUC under the ROC curve which shows the dependence of sensitivity on 1 - specificity. The AUC is a widely used measure for the quality of supervised classification and diagnostic rules; it is, however, not without pitfalls as Hand [18] demonstrates. Confusion matrices have been used in rough set theory, for example, in [19–24]. We have investigated sensitivity and specificity in the rough set context in [6], and in the context of concept lattices in [25]. In all previous work, a rough set classifier is constructed in the first instance along with the usual indices such as α and γ , and its classification quality is estimated by the usual statistics of the confusion matrix. Our proposed procedure is, in some sense, the reverse: Using a confusion matrix obtained from a general classifier, the only restriction of which is that it is compatible to a fixed equivalence relation, we consider rough set indices obtained from the confusion matrix. These are approximations of e.g. α and γ which we call α ∗ , γ ∗ etc. A general confusion matrix has the form shown in Table 4. There, the entry ni j is the number of elements of class Y j which are predicted to be in class Y i , i.e. ni j = |Yˆi ∩ Y j |. Thus, according to the classifier f .
k
j =1 n j j
is the number of correctly classified elements
55 56
Example 2. As our running example for a confusion matrix, we have chosen data obtained from an interpreter of aerial photographs, shown in Table 5. In the experiment, experts were asked to identify types of trees in Yosemite Valley, California, USA. The data were reported by Congalton and Mead [26], who credit them to Lauer et al. [27]. 2
59 60 61
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 54 55 56 57
57 58
34
53
53 54
33 35
35 36
26
P
30
33
25
28
True value
29
32
22
27
27
31
21
24
24 26
20
23
23 25
16
As a next step we shall use the rough set philosophy of indistinguishability up to equivalence classes to define a classifier. In this situation, a classifier cannot assign decision classes to elements of U , but has to act on whole equivalence classes. Given partitions X and Y , we call a function f : X → Y a rough classifier with respect to X and Y , or, simply, a classifier if X and Y are understood.
58 59 60 61
JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.6 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
6
1
1
Table 6 The granule freq. matrix.
2 3 4
X1 X2 X3 X4 Sum
5 6 7 8
Y1
Y2
Sum
1 0 0 2 3
1 1 1 0 3
2 1 1 2 6
2 3 4 5 6 7 8 9
9 10 11 12 13 14 15 16 17 18 19
The meaning of the function f is that each element of X i is predicted to be in f ( X i ). A classifier in the rough set sense is a specialization of a general classifier, where allelements of a class X i are predicted to be in the same decision class. As in (4.1), we obtain the predictor classes Yˆ j = f −1 (Y j ). In the rough set situation, each nonempty Yˆ j is a union of
equivalence classes of X . If Yˆ j = ∅, then no element of U is predicted to be in Y j by any class X i using f . Arguably, the simplest classifier is a maximum classifier given by f ( X i ) = Y j , where |Y j | = max{|Y r | : 1 ≤ r ≤ k}. Note that such Y j need not be unique, and a choice needs to be made. Since f is a function, the collection { f −1 (Y j ) : 1 ≤ j ≤ k, f −1 (Y j ) = ∅} is a partition of X , and therefore, keeping in mind that X partitions U , it follows that
{ f −1 (Y j ) : 1 ≤ j ≤ k, f −1 (Y j ) = ∅} = U .
20 21 22 24 25
Yˆ i , Y j =
26 27
Thus,
28 29 30 31 32 33 34 35
ni j =
12 13 14 15 16 17 18 20
{c sj : f ( X s ) = Y i }, if f −1 (Y i ) = ∅,
0,
11
19
We are now in position to define the (rough) confusion matrix M f of the rough classifier f . It has row labels Yˆ i , column labels Y j and entries
23
10
21 22 23 24
(4.2)
otherwise.
25 26 27
28
| X s ∩ Y j |.
(4.3)
29 30
f ( X s )=Y i
31
Given the partitions X and Y , the corresponding confusion matrix can be obtained in three steps:
32 33
1. Write the granule frequency matrix M obtained from X and Y as in Table 1. 2. Relabel the rows of M by f ( X i ):
34 35 36
36
Predicted
Y1
...
Yi
...
Yk
Sum
c 11 ... c j1 ... cm1
... ... ... ... ...
c 1i ... c ji ... cmi
... ... ... ... ...
c 1k ... c jk ... cmk
c1 ... cj ... cm
38
41
f ( X1 ) ... f (X j) ... f ( Xm )
42
Class size
n1
...
ni
...
nk
n
42
37 38 39 40
37 39 40 41 43
43 44 45 46 47 48 49 50 51 52 53 54
3. Aggregate the frequencies of the rows with the same label according to (4.2). If f −1 (Y j ) = ∅, fill the row with label Yˆ j with 0s. Sort the rows according to indices of their labels. The result will have the form shown in Table 4. Note that the column sums in Table 4 are equal to the corresponding values in Table 1, as the collection of nonempty sets Yˆ j is a partition of U .
59 60 61
46 47 48 50
Example 3. Suppose that X = { X 1 , . . . , X 4 } and Y = {Y 1 , Y 2 }. Define f : X → Y by f ( X 1 ) = f ( X 4 ) = Y 1 , and f ( X 2 ) = f ( X 3 ) = Y 2 . The values and the construction process are shown in Tables 6, 7, and 8. Note that f classifies five of the six elements of U correctly, so that its success ratio is 56 , where as γ = 46 . 2
51 52 53 54 55
5. Refining the rough classifier from the confusion matrix
56 57
57 58
45
49
The procedure is illustrated with an example.
55 56
44
Thus far, we have put no restrictions on the rough classifier f . In a more optimistic spirit, we shall assume that f satisfies the condition
X i ∩ f ( X i ) = ∅.
58 59 60
(5.1)
61
JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.7 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
1
7
1
Table 7 The relabeled matrix.
2 3 4
Y1 Y2 Y2 Y1 Sum
5 6 7 8
2
Y1
Y2
Sum
1 0 0 2 3
1 1 1 0 3
2 1 1 2 6
3 4 5 6 7 8 9
9 10
10
Table 8 The confusion matrix.
11 12 13
Yˆ 1 Yˆ 2 Sum
14 15
11
Y1
Y2
Sum
3 0 3
1 2 3
4 2 6
12 13 14 15 16
16
17
17 18 19 20 21 22 23
In this case, at least one element of X i is classified correctly. If X i is a rough deterministic class, say, X i ⊆ Y j , then X i ∩ Y n = ∅ if n = j and therefore, (5.1) forces f ( X i ) = Y j .
Since (5.1) is equivalent to X i ⊆ upp( f ( X i )) by (2.2), we can think of Yˆ j as a lower bound of the rough upper approximation of Y j . Again using the fact that X and Y are sets of classes of equivalence relations, namely, that
Y j = {Y j ∩ X i : 1 ≤ i ≤ m}, we can sharpen this bound: Suppose that Yˆ j = X s1 ∪ . . . ∪ X sn and define ∗
nu j = n j j +
24 25
28
nu ∗∗ j = n jj +
30 31 33 34
38 39 40
γ =
44
αi∗ =
45 46
53
nii +
low(Y j ) =
50 52
n
=
n
j =i ni j
⎝ nl∗∗ i = nii − Ind
j =i
61
32 33
36 37 38 40
=
43
nl∗ i
nu ∗
44
≥ αi .
i
46
⎞
ni j ⎠ ≥ nli
45
{ X i : f ( X i ) = Y j } ⊆ Yˆ j .
47 48 49 50 51 52 53 54 55 56 57
57
60
30
42
≥γ,
{ Xi : Xi ⊆ Y j } ⊆
56
59
28
41
+ n ji
⎛
55
27
39
γ ∗ and αi∗ of the standard rough set measures γ and αi :
An argument similar to the one used for upper bounds leads to a refined upper bound of the number of rough lower bound elements:
54
58
∗
i nl i
nii
25
35
(5.4)
Similarly, Yˆ j is a crude upper bound of the rough lower approximation:
49 51
i nii
24
34
the decision class membership for each y ∈ Yˆ j . We can use this observation to define approximations ∗
22
31
In this case, Yˆ j = X s1 ∪ . . . ∪ X sn , and the classifier f coincides with the rough lower approximation by correctly predicting
43
48
(5.3)
nl∗j = n j j = |Y j |.
42
21
29
(ni j + n ji + Ind(ni j )) ≤ nu j .
Yˆ j = X s1 ∪ . . . ∪ X sn , and we set
41
47
Turning to the lower approximation, we first suppose that Y j is a rough definable class, say, Y j = X s1 ∪ . . . ∪ X sn . Then,
36
20
26
i = j
35 37
(5.2)
If ni j = 0, then nii = 0 by (5.1), and therefore, there is some X n , such that f ( X n ) = Y i and X n ∩ Y j = ∅. The latter implies X n ⊆ upp(Y j ), and we obtain an even sharper bound as
29
32
(ni j + n ji ) ≤ nu j .
19
23
i = j
26 27
18
and
58
k i =1
∗∗
nli ≥
k i =1
59
nli .
60 61
JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.8 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
8
Next, we shall restrict the classifier further. For each 1 ≤ i ≤ m let
1
1 2
2 3 5 6 7
f ( X i ) = Y j for some j ∈ M i .
9 11 12
4
M i is the set of all indices which have a maximum value in row i of the granule frequency matrix of Table 1. Since X i = ∅ and Y partitions U , it follows that ni j = 0 for at least one j, and that M i = ∅. A max-row classifier is defined by
8 10
3
M i = { j : 1 ≤ j ≤ k and c i j = max{c ir : 1 ≤ r ≤ k}}.
4
(5.5)
Then, | X i ∩ Y j | = max{c ir : 1 ≤ r ≤ k}. Thus, X i ⊆ Yˆ j implies that c i j is a maximum in row i. We can use this observation to establish a sharper upper bound of nl j : Suppose that Yˆ j =
{ X s1 , . . . , X sn } and consider the partial granule matrix
Granule in Yˆ j
15 17 18 19 20 21 22
Granule size
...
Yj
...
Yk
X s1 ... X si ... X sn
c s1 1 ... c si 1 ... c sn 1
... ... ... ... ...
c s1 j ... c si j ... c sn j
... ... ... ... ...
c s1 k ... c si k ... c sn k
Confusion size
n j1
...
n jj
...
n jk
17 18 19 20 21 22
Since a maximum of each row is in column Y j , it follows that for all 1 ≤ t ≤ k, t = j, we have n jt ≤ n j j , and therefore, max{n jt : 1 ≤ t ≤ k, t = j } ≤ n j j . None of the elements of U , which contribute to some n jt for j = t, can contribute to n j j , and thus,
32
k
nlm j ≥
j =1
35 36
k
numj
50 51 52 53 54 55 56
= n jj +
(n ji + 2 · ni j ) ≤ nu j .
(5.6)
† i
and m, whereby the indices α and γ below, the exponent † equals ∗ or ∗∗. The values αi∗ and
†
exceed or are equal to the (unknown granule based) values
γ ∗ correspond to the “classical” procedure, where
αi and γ ; here and
61
41 42 44 45 46 47 48 49
γ ∗ is identical to the “percentage of correctly
classified classes”. Both values form optimistic bounds for the analysis of confusion matrices. • The values αi∗∗ and γ ∗∗ take advantage of a simple property of rough sets. These values thus form an even better estimate than the indices αi∗ and γ ∗ which are based on the classic rough estimates αi and γ . • The values αim and γ m are valid estimates if the classifier in use is a maximum classifier. In this case, the estimates for
αi† and γ † of a confusion matrix can be further tightened.
50 51 52 53 54 55 56 57
6. Interpretation of α and γ based on the confusion matrix
58 59
59 60
38
43
57 58
37
40
All point estimators for Example 2 are shown in Table 9 on the following page. We have used the row–max classifier for the nlm , . . . , αim , γ m values, using as entries max{0, nlm }, respectively, max{0, αim }. i i In summary, we have presented three different types of indices related to αi and γ , indicated by the exponents ∗, ∗∗
•
36
39
j =i
43
49
35
cardinality of upp(Y j ). Thus, we obtain
42
48
34
Finally, we estimate the rough upper bound of Y j using the max–row classifier. We know already that nu ∗j = n j j + i = j (n ji + ni j ) ≤ nu j . Since for each 1 ≤ i ≤ k, i = j the max – rule implies nii ≥ ni j , and therefore we can add ni j to the
41
47
33
nl j
j =1
40
46
27
31
34
45
26
30
and
33
44
25
29
32
39
24
28
nlmj = n j j − max{n jt : 1 ≤ t ≤ k, t = j } ≥ nl j ,
30
38
11
23
29
37
10
16
c s1 ... c si ... c sn
28
31
9
15
23
27
8
14
Decision classes Y1
16
26
7
13
14
25
6
12
13
24
5
In case of the “classical” analysis of confusion matrices, there is a nice relationship of a mean α value and the γ value using nl∗i and nu ∗i as lower and upper bound frequencies. The α – accuracy (3.1) is connected to the confusion matrix by
60 61
JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.9 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
1
9
1
Table 9 Estimators for Table 5.
2 3
Y1
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Y2
2
Y3
Y4
4
Basic rough set parameters ni 61 18 p low 0.574 0.611 i upp pi 0.772 0.391
63 0.603 0.708
21 0.095 0.808
163
αi
0.427
0.077
γ = 0.528
0.443
0.239
3
Sum
5 6 7
Estimators obtained from the confusion matrix nl∗i 35 11 38 2 nu ∗i 79 46 89 26 nb∗i 44 35 51 24
86 240 154
nl∗∗ i nu ∗∗ i nb∗∗ i
34 82 48
10 48 38
37 92 55
1 29 28
82 251 169
nlm i num i nbm i
23 105 82
0 64 64
26 114 88
0 45 45
αi∗ αi∗∗ αim
0.443 0.415 0.219
0.239 0.208 0.0
0.427 0.402 0.228
0.077 0.034 0.0
8 9 10 11 12 13 14 15 16 17 18 19
γ ∗ = 0.528 γ ∗∗ = 0.503 γ m = 0.300
20 21 22 23
23 24 25 26 27 28
αi =
nl∗ i
nu ∗i
31 32 33 34 35 36 37
i ni •
40
26 27
αi measures is therefore given by
i nii
+ n•i − nii
=
28 29
γ
(6.1)
2−γ
O SF =
43 44 45 46 47 48 49 50
53 54 55 56
59 60 61
35 36 37
41
As in case of a single decision class, the mean α value is given by the fraction of the (lower) approximation quality and the γ (upper) overshoot factor. The same holds for the mean α , since α = O S F . γ Since the mapping f (γ ) = 2−γ is strictly monotone, we observe that α and γ are exchangeable measurements of
roughness, which means that including the upper bounds in the computation of α does not contain further information. This relationship only holds in the “classical” analysis of confusion matrices. At times, the decision attribute is too finely grained with respect to the predictor in order to obtain significant results. It will be helpful to have a framework to decide whether decision categories should be merged in order to result in rough prediction rules with better accuracy. The classical approach of confusion matrices analysis offers us the key for such a framework: The index
42 43 44 45 46 47 48 49 50 51
γi j =
nii + n j j
52
nii + n j j + ni j + n ji
53 54
exhibits the conditional approximation quality of the joined decision category Y i ∪ Y j . If the decision attribute has only two classes, it is easy to see that
55 56 57
57 58
34
40
51 52
33
39
n
41 42
32
38
+ n•i − nii 2n − γ n = =2−γ.
i ni •
i ni •
30 31
Note that α is a measure of approximation quality as well as γ , but it uses additionally the information about the upper bounds. It is interesting that the mean α -value is a function of γ only. As α and γ are strictly monotone connected, any procedure which maximizes γ will maximize α as well. It is easy to see that √ α ≤ γ and α γ , if γ ∈ (0,√1). A simple calculation shows that the maximal difference of α and γ occurs when γ = 2 − 2 with a difference of 3 − 2 2. The mean overshoot factor O S F is given by
38 39
25
ni • + n•i − nii
α=
24
nii
A mean value of the
29 30
=
γi j = γ . Otherwise, if γi j is very low with respect to the overall Therefore, we should inspect
58 59
γ , we obtain a reasonable candidate for joining categories.
60 61
JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.10 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
10
1
l nll
f i j = γ − γi j =
2
−
n
nii + n j j
1
nii + n j j + ni j + n ji
2 3
3 4
Parameterization using the Delta method [28] results in
4 5
5 6
fij = a + b + c −
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
a+b a+b+d+e
1. 2. 3. 4. 5. 6.
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
20 21 22 23
we have suggested in [5] to explore
low
pi
upp
· pi
=
√
αi as the geometric mean of lower and
upp
46
O low = i
upp
Oi
=
47
nli n
1−
⎧ ⎨
50
Oi =
=
nli n
53 54
1−
n − nli
,
, if nu i n,
ni n
=
ni n − ni
,
61
29 30 31 32 33 34 35 36 37 38 39
42 43 45 47 49 50 51
upp
upp
O Ri
=
59 60
28
48
O low (O i , O i ) is the ratio of the chance that an arbitrarily drawn x ∈ X is in low(Y i ) (upp(Y i ), Y i ) to the chance that it i is not in low Y i (upp(Y i ), Y i ). We now obtain the range corrected value as the ratio
56 58
27
46
otherwise
ni n
55 57
26
41
nli
51 52
25
44
nu i n nu 1− n i
⎩0,
48 49
24
40
44 45
13
19
upper approximation precision. A closer look, however, reveals that, once again, p i causes trouble. The best interpretation upp of a mean is given, when there is no variance, which means that p low = p holds. Even though this situation may occur i i ni (and is easily computed), it seems quite rare for empirical data; for example, if p low , then p low p upp . i i i n A suitable index for measuring the tightness of the upper bound uses a ratio of odds. Odds ratios offer a symmetric interpretation of lower and upper precision, and remove the bias in the upper approximation. Furthermore, products of odds (or sums of log–odds) are odds as well, which is a nice property for building logistic type models based on odds. For details on the property of odds ratios we invite the reader to consult [29, Chapter 2.2.3] or [30]. Odds ratios are often used to estimate the relationship between two binary variables, in particular, in medical statistics, e.g. to determine whether “a particular exposure is a risk factor for a particular outcome” [31]. Their role in linear regression models is highlighted by Menard [32], and in multivariate analysis of discrete data by Clogg and Shockey [33]. First, define
41 43
12
18
40 42
11
17
upp
and p i
10
16
The interpretation of the precision index p i faces a problem, since we need additional information to conclude that upp a precision index is “low”: While p i = 1 reflects a perfect prediction with deterministic rules, we will always have ni ≤ p upp by (3.6). A welcome property of an upper bound precision measure μ would signal upp(Y i ) = U by μ(Y i ) = 0. i n To connect p i
9
15
7. Odds ratios as approximation measures
upp
7
14
Start with the confusion matrix M(k,k). Compute γ (k) and γi j (k). Choose g (i , j ) = min(γi j (k)). If γk − g (i , j ) does not differ significantly from 0, stop. Merge category i and j, resulting in the confusion matrix M (k − 1, k − 1). Goto step 1.
low
6 8
Here, aˆ = nii /n, bˆ = n j j /n, cˆ = l=i ,l= j nll /n, dˆ = ni j /n and eˆ = n ji /n. The quite complicated term for the computation of the standard error of f i j is determined in C.1. A simple application of the result is the computation of an optimal (smaller and less finely grained) confusion matrix, which optimizes the γ -index by the following algorithm:
23 24
(6.2)
.
53 54 55
Oi upp Oi
if upp(Y i ) U
56
0,
otherwise.
58
upp
=
52
pi
−
1−
ni n
ni n
57 59
.
60 61
JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.11 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
1
11
Noting that
1 2
2 3
ni
4
n
/(1 −
ni n
)≤
5
≤
6
nu i n nu i
7 8
upp
we see that 0 ≤ O R i
9
10 11
n
upp
O Ri
=
/(1 − /(1 −
ni
n
3
since ni ≤ nu i ,
),
n nu i
since 1 −
),
nu i n
4
≤1−
ni n
5
,
8 9 10
1, ⇐⇒ upp(Y i ) = Y i ,
11 12
13 15
7
≤ 1. Furthermore,
0, ⇐⇒ upp(Y i ) = U ,
12 14
6
13
upp
the latter, since there are at least two decision classes. Thus, the smaller O R i , the larger is upp(Y i ) relative to Y i . upp Our odds ratio O R i is special, since it does not exceed 1, which is in general not true for arbitrary odds ratios.
14 15 16
16 17
17
Example 4. Continuing Example 2,
18
18 19 20
upp
O R1
=
0.667 − 30/200 1 − 30/200
21 22 23
upp
O R2
=
24 25 26
upp
O R3
=
19
= 0.608,
20 21
0.833 − 50/200 1 − 50/200
22
= 0.778,
0.800 − 120/200 1 − 120/200
23 24 25
= 0.5. 2
26 27
27 28 29 30 31 32
low
O Ri
=
Oi
=
34 35 37 38 39 40 41
29 30
low
33
36
28
An analogous index can be given for the lower bound by defining
Oi
(7.1)
,
32
ni /n / . 1 − nli /n 1 − ni /n
33
nli /n
(7.2)
O Ri is well defined since there are at least two decision classes, and thus, 0 ≤ nl i ≤ ni n. Clearly, 0 ≤ O R i ≤ 1 holds low as well with O R low = 0 if and only if nl = 0, and O R = 1 if and only if nl = n . i i i i i , nlm and nu ∗i , In case of confusion matrix data, we can utilize odds by replacing nl i and nu i by its counterparts nl∗i , nl∗∗ i i m ∗∗ nu i , nu i respectively. The product of the odds ratios, called the rough odds ratio (ROR), is meaningful as well: The expression low
46 47 48 49 50 51 52 53 54 55
58 59 60 61
37 38 39 40 41
=
43
nli /n nu i /n / 1− , if upp(Y i ) = U nu i /n 1−nli /n
44 45 46
otherwise
0,
47
is an odds ratio with the property 0 ≤ R O R i ≤ 1. Moreover, R O R i is a “truly” rough measure, as it is defined only by the cardinalities of approximations, and not of Y i itself. √ The square root R O R i as the geometric mean of the lower and upper odds ratio shows that rough approximation in some sense may be thought of as a quadratic function. upp Both O R low and O R i measure roughness, and it may be interesting to characterize the dependency of nl i and nu i i upp low when O R i = O Ri holds. First, we consider the extreme cases: If O R low = O R upp = 1, then low(Y i ) = upp(Y i ) = Y i . If i i upp low O Ri = O R i = 0, then nli = 0 and nu i = n. Otherwise, assuming 0 nli ≤ nu i n, it is straightforward to compute that
56 57
36
42
upp
R O R i = O R low · O Ri i
44 45
34 35
low
42 43
31
upp
O R low = O Ri i
upp
⇐⇒ O i
=
O 2i O low i
48 49 50 51 52 53 54 55 56
upp
⇐⇒ O low · Oi i
= O 2i .
As binary decision variables play a prominent role in data analysis, it is interesting to see how R O R i behaves in this situation. Given two outcomes with corresponding classes Y 1 and Y 2 , we first note that
57 58 59 60 61
JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.12 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
12
1
nl1 + nu 2 = n = nl2 + nu 1 .
(7.3)
3
Suppose that nu i = n. Then, R O R 1 can be written as
4
nl1 /n nu 1 /n / , R O R1 = 1 − nl1 /n 1 − nu 1 /n nl1 nu 1 = / , n − nl1 n − nu 1 nl1 nu 1 = /
5 6 7 8 9 10 11 12
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
3 4
13 14
=
nu 2
5 6 7 8 9 10 11
by (7.3),
nl2
12
nl1 nu 2 / ,
nu 1
13 14
nl2
15 16
= α1 · α2 .
17
Obviously, R O R 2 = R O R 1 , and we obtain a single measure of approximation for this situation. This make sense, because the lower and upper approximation of Y 1 is dependent on the upper and lower approximation of Y 2 and vice versa. Since there is no degree of freedom, the index for the approximation should not vary, if we change from category Y 2 to category Y 1 , which is the case when using R O R, but not for the classical α -values. In summary, this section presents three indices based on odds ratios (ORs) that can be given over known rough set indices (which are based on probabilities); each of these ORs is a transformation that takes into account the finite number of elements in the respective valid population:
32
35
38 39 40 41
44
47 48
51
56 57 58 59 60 61
25
31 32
35 36
As above, we will use parameterization and the Delta method [28] to approximate the standard errors of the odds ratios. As a first step, we notice that nli ≤ ni ≤ nu i and that the observations accounting for nli and ni − nli , respectively, ni and nu i − ni , as well as for nli and nu i − nli do not overlap. Therefore, we may assume that the observations in different samples are independent realizations of their populations. Our basis in this section is the generic parameterized O R
37 38 39 40 41 42
O R (a, b) =
a 1−a
·
1 − (a + b) a+b
43
,
44 45
Using O R (a, b) enables us to derive a generic estimator for the standard error of any O R we have described earlier. For nl n −nl example, aˆ = ni and bˆ = i n i are used in case of analyzing O R low . i To facilitate the computation of the standard error we take logarithms and obtain the parameterized function
46 47 48 49 50
f (a, b) = ln( O R (a, b)) = ln(a) − ln(1 − a) + ln(1 − (a + b)) − ln(a + b).
51 52
From f we can estimate the variance as
53 54
54 55
24
34
8. Standard errors of rough odds ratios
52 53
23
33
49 50
22
30
Since the ORs can be formed in rough set analysis, the corresponding indices for the estimates of the approximation quality in confusion matrices are also possible; in particular, R O R ∗i , R O R ∗∗ and R O R m are the counterparts of αi∗ , αi∗∗ and αim . i i
45 46
21
29
42 43
20
28
36 37
19
27
33 34
18
26
• O R low is associated with the probability p low , i i upp upp • O Ri is associated with the probability p i accordingly, • and R O R i is the counterpart of αi .
30 31
1 2
2
V ar ( ˆf ) =
b(1 − b) n · a(1 − a)(a + b)(1 − (a + b))
.
(8.1)
Details are given in C.2. A nice feature of the parameterized variance estimator given by (8.1) is its flexibility. It is not only applicable for lower and upper precision, but also for measures in confusion matrices. It is only necessary to use proper substitutions for the parameters a and b in the estimate of O R (a, b).
55 56 57 58 59 60 61
JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.13 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
Substituting the frequencies ni and nli we obtain the estimate
1 2 4 5
upp
As O R i
6
αi∗ are odds ratios of similar structure as O R low , we redefine the parameters a and b and use the same i
5
Finally, R O R low is parameterized by aˆ = i
13
V ar (ln( R O R low )) = i
14 15 17
3
, we define aˆ =
ni n
and bˆ =
nu i −ni n
resulting in
nli n
and bˆ =
nu i (n − nu i )nli (n − nli )
6 7
(8.3)
nb i n
n(nu i − nli )(n − nu i + nli )
4
8
n(nu i − ni )(n − nu i + ni ) upp V ar (ln( O R i )) = . ni (n − ni )nu i (n − nu i )
10
16
and
(8.2)
upp
9
12
2
approximation formula as in case of O R low . Regarding O R i i
8
11
1
n(ni − nli )(n − ni + nli ) V ar (ln( O R low )) ≈ i ni (n − ni )nli (n − nli )
3
7
13
9 10 11
and we obtain
12 13
(8.4)
.
15 16
A summary of the substitutions is presented in the following table:
18 19
14
17 18
aˆ
bˆ
nli
ni − nli
20
19
20
O R low i
21
O Ri
ni
nu i − ni
21
22
R O Ri
nli
nu i − nli
22
23
O R ∗i low
nl∗i
ni − nl∗i
23
upp
24
∗ upp
O Ri
ni
nu ∗
R O R ∗i
nl∗i
nu ∗i − nl∗i
27
O R ∗∗low i
nl∗∗ i
ni − nl∗∗ i
28
O Ri
ni
25 26
∗∗upp
24
− ni
25 26 27
nu ∗∗ − ni i
28
29
R O R ∗∗ i
30
O Ri
nlm i
ni − nlm i
30
31
O Ri
ni
num − ni i
31
32
R O Rm i
nlm i
m num i − nl i
32
m,low m,upp
nl∗∗ i
i
∗∗ nu ∗∗ i − nl i
29
33
33
34
34 35
35
Example 5. Continuing Example 1, we have the following values:
36
36
37
37 38
Y1
Y2
Y3
38
39
O R low i
0.459
0.750
0.545
39
40
SE
0.213
0.094
0.120
40
41
95% CI
[0.312, 0.677]
[0.630, 0.893]
[0.445, 0.668]
41
42
upp
42
O Ri
0.608
0.778
0.500
43
SE
0.123
0.078
0.119
44
95% CI
[0.476, 0.776]
[0.668, 0.906]
[0.396, 0.631]
R O Ri
0.279
0.583
0.273
SE
0.230
0.116
0.150
95% CI
[0.178, 0.438]
[0.465, 0.732]
[0.203, 0.366]
0.528
0.764
0.522
49
[0.422, 0.662]
[0.682, 0.855]
[0.451, 0.605]
50
43
45 46 47 48
√
49
R O Ri
95% CI
50
53 54 55 56 57 58 59
47 48
51
As
√
upp
R O R is the geometric mean of lower bound O R low and upper bound O R i i
,
• It is a compromise of the two √ approximation measures, and upp • The width of the 95% CI of R O R is smaller than that of O R low and O R i . i √
The latter is due to the fact that R O R is a mean of the other odds ratios, and that standard errors of means are smaller than standard errors of the single values. 2
52 53 54 55 56 57 58 59 60
60 61
45 46
51 52
44
We close this section with the values for our running example of data from a confusion matrix.
61
JID:IJA AID:8451 /FLA
14
1
[m3G; v1.261; Prn:18/12/2019; 13:49] P.14 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
Example 6. Continuing Example 2, we have the following values:
1 2
2
Y1
Y2
Y3
Y3
3
0.457 0.144 [0.345,0.607]
0.583 0.202 [0.392,0.866]
0.483 0.137 [0.369,0.631]
0.084 0.682 [0.022,0.319]
4
O Ri SE 95% CI
0.636 0.119 [0.504,0.803]
0.316 0.141 [0.240,0.416]
0.524 0.116 [0.417,0.658]
0.779 0.174 [0.554,1.097]
7
R O R ∗i SE 95% CI
0.291 0.169 [0.209,0.405]
0.184 0.285 [0.105,0.322]
0.253 0.173 [0.180,0.354]
0.065 0.689 [0.017,0.252]
10
R O R ∗i 95% CI
0.539 [0.457,0.636]
0.429 [0.324,0.567]
0.503 [0.424,0.595]
0.256 [0.130.0.502]
O R ∗∗low i SE 95% CI
0.441 0.148 [0.330,0.589]
0.527 0.225 [0.339,0.818]
0.466 0.141 [0.354,0.614]
0.042 0.982 [0.006,0.286]
O Ri SE 95% CI
0.591 0.297 0.121 0.147 [0.466,0.748]
0.486 0.118 [0.223,0.397]
0.683 0.177 [0.386,0.613]
[0.483,0.966]
R O R ∗∗ i SE 95% CI
0.260 0.176 [0.184,0.367]
0.157 0.303 [0.087,0.283]
0.227 0.178 [0.160,0.321]
0.029 0.989 [0.004,0.198]
0.510 [0.429,0.606]
0.396 [0.295,0.532]
0.476 [0.400,0.567]
0.169 [0.063,0.445]
O Ri SE 95% CI
0.275 0.197 [0.187,0.404]
0.000 0.000
0.301 0.184 [0.210,0.432]
0.000 0.000
28
m,upp O Ri
SE 95% CI
0.330 0.140 [0.251,0.435]
0.192 0.189 [0.133,0.278]
0.271 0.138 [0.206,0.355]
0.388 0.179 [0.273,0.551]
28
29
R O Rm i SE 95% CI
0.091 0.235 [0.057,0.144]
0.000 0.000
0.082 0.233 [0.052,0.129]
0.000 0.000
31
0.301
0.000
0.286
0.000
3
O R ∗i low SE 95% CI
4 5 6
∗ upp
7 8 9 10 11
12 13 14 15 16 17
∗∗upp
18 19 20 21 22 23
24
95% CI
R O R ∗∗ i
25
m,low
26 27
30 31 32 33
34 35
R O Rm i
95% CI
[0.239,0.379]
42 43 44 45 46 47 48 49 50 51
[0.228,0.359]
56
61
14 15 16
19 20 21 22 23 24 25 26
29 30 32 34 35
38 39
In this paper we have investigated the connection between rough sets and confusion matrices, and have discussed the classical rough set measures α and γ in this context. In particular, we have proposed a rough classifier and a rough confusion matrix, and have shown how upper and lower bounds of rough approximation measures can be obtained. We have also presented a measure and an algorithm for merging two finely grained categories which may be used to improve the rough approximation quality γ . Since the precision measures p i and p i which make up αi lack symmetry, we propose to use odds ratios as range corrected values. Since estimates based on empirical data may not be accurate, we have provided an error theory for the odds ratios. Examples for the concepts are given throughout. In future work we shall investigate the usefulness of the indices for decision making, using test data and simulations. In particular, we shall compare the performance of rough set classifiers with other machine learning methods based on our rough confusion matrix indices. We shall also incorporate and position the results into our general approach to statistics in the rough set model [34,35].
40 41 42 43 44 45 46 47 48 49 50 51 52
Declaration of competing interest
53 54
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
55 56 57
Acknowledgements
58 59
59 60
13
37
9. Summary and future work
57 58
12
36
54 55
11
33
52 53
9
27
39 41
8
18
37
40
6
17
36 38
5
We thank the anonymous reviewers for careful reading and helpful suggestions. I. Düntsch gratefully acknowledges support by the National Natural Science Foundation of China, Grant No. 61976053.
60 61
JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.15 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
1
Appendix A. Table of simple rough measures
1 2
2 3
Name
Notation
Estimate
Rough lower precision
p low i
Rough upper precision
upp
pi
8
Rough alpha precision
αi
nl i ni ni nu i nl i nu i
9
Rough approximation quality
γ α
4
3 4
Generic rough set measures
5 6 7
10
Rough alpha
5 6 7 8
i nl i n
9
γ
10
2−γ
11
11
Standard measures for evaluation of confusion matrices
12
∗ low
Largest lower precision
pi
14
Smallest upper precision
pi
15
Largest alpha precision
αi∗
Largest approximation quality
γ∗
Largest alpha
α∗
13
16 18 20 21 22 23
Upper lower precision
pi
26 27
15 16 17
Lower upper precision
pi
Upper alpha precision
α ∗∗
∗∗upp
n jj +
γ ∗∗
Upper alpha
α ∗∗
19
j =i
20
ni j
n jj + i
21
ni
j +n ji + Ind(ni j )) i = j (ni
nii − Ind
i
Upper approximation quality
ni
25
14
18
nii − Ind
∗∗low
24
13
γ∗ 2−γ ∗
Sharper rough bounds for evaluation of confusion matrices
19
12
nii ni nii + i = j (ni j +n ji ) ni nii nii + i = j (ni j +n ji ) i nii n
∗ upp
17
j =i
ni j
i = j (ni j +n ji + Ind(ni j ))
nii − Ind
j =i
ni j
n
γ ∗∗ 2−γ ∗∗
28
Bounds for max–row classifier for evaluation of confusion matrices
29
m-Lower precision
pi
30
m-Upper precision
pi
max{0,nii }−max{nit :1≤t ≤k,t =i }) ni ni nii + j =i (n ji +2·ni j )
m-Alpha precision
αim
max{0,nii }− max{nit :1≤t ≤k,t =i } nii + j =i (n ji +2·ni j )
33
m-Approximation quality
γ ∗∗
max{0,nii }−max{nit :1≤t ≤k,t =i } n
34
m-Alpha
αm
γm 2−γ m
31 32
m,low m,upp
35
Appendix B. Table of odds ratio measures
40 41 42 43 44
Notation
Estimate
Rough lower OR
low
O Ri
upp
Rough upper OR
O Ri
Rough OR
ROR O R ∗i low
nii n−nii
∗ upp
ni n−ni
33 34 35
Sharper rough bounds for evaluation of confusion matrices
51
Upper lower OR
O R ∗∗low i ∗∗upp
Lower upper OR
O Ri
Upper ROR
R O R ∗∗ i
54
39
nii − Ind
j =i
ni j
/ ni n−(nii − Ind n ) n−ni j=i i j n + ( n ii ni i = j i j +n ji + Ind(ni j )) / n−ni n−(nii + i = j (ni j +n ji + Ind(ni j ))) nii − Ind nii + i = j (ni j +n ji + Ind(ni j )) j =i ni j / n−(nii + i = j (ni j +n ji + Ind(ni j ))) n− n − Ind n ii
j =i
ij
Bounds for max–row classifier for evaluation of confusion matrices m,low
m-Lower OR
O Ri
m-Upper OR
O Ri
m-ROR
R O Rm i
m,upp
45 47
50
61
32
46
(ni j +n ji ) / n−n +i= j (n +n ) ii ji i = j i j nii + i = j (ni j +n ji ) nii / n−nii n−(nii + i = j (ni j +n ji ))
R O R∗
60
31
nii +
Largest ROR
59
30
ni / n− ni
49
58
29
44
O Ri
57
28
43
Smallest upper OR
48
56
27
42
Largest lower OR
55
26
41
nl i / ni n−nl i n−ni ni / nu i n−ni n−nu i nl i / nu i n−nl i n−nu i
46
53
25
40
Generic rough set measures
Standard measures for evaluation of confusion matrices
52
24
38
Name
45 47
23
37
38 39
22
36
36 37
15
max{0,nii }−max{nit :1≤t ≤k,t =i }) / ni n−(max{0,nii }−max{nit :1≤t ≤k,t =i })) n−ni nii + j =i (n ji +2·ni j ) ni / n−ni n−(nii + j =i (n ji +2·ni j )) nii + j =i (n ji +2·ni j ) max{0,nii }−max{nit :1≤t ≤k,t =i } / n−(max{0,nii }−max{nit :1≤t ≤k,t =i }) n−(nii + j =i (n ji +2·ni j ))
48 49 50 51 52 53 54 55 56 57 58 59 60 61
JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.16 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
16
1
Appendix C. Details for method computations
1 2
2 3
3
C.1. Computation for (6.2)
4
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
5
Consider
6
∂ fij d+e =1− ∂a (a + b + d + e )2
7 8 9
∂ fij d+e =1− ∂b (a + b + d + e )2
10 11 12
∂ fij =1 ∂c ∂ fij a+b = ∂d (a + b + d + e )2
13 14 15 16 17
∂ fij a+b = ∂e (a + b + d + e )2
18 19
Here, aˆ = nii /n, bˆ = n j j /n, cˆ =
dˆ = ni j /n and eˆ = n ji /n.
l=i ,l= j nll /n,
24
V ar ( f)=
a(1 − a) n
25 26 27
+
28 29
+
30 31 32
+
33 34
+
37
−
38 39 40 41 42
− −
43 44 45
−
46 47
−
48 49 50
−
51 52 53
−
54 55
−
56 57 58
−
59 60 61
−
(1 −
b(1 − b)
d+e
(a + b + d
(1 −
n
+ e )2
23
)2
d+e
24 25
2
(a + b + d + e )2
26
)
27
c (1 − c )
28 29
n d(1 − d)
(
n e (1 − e ) n
35 36
21 22
22 23
20
2ab n 2ac n 2ad n 2ae n 2bc n 2bd n 2be n
(
(1 − (1 − (1 − (1 − (1 − (1 − (1 −
2cd
30
a+b
(a + b + d + e )2
n
33
d+e
(a + b + d
+ e )2
d+e
(a + b + d + e )2
34 35 36
)2
37 38 39
)
40
d+e
a+b ) (a + b + d + e )2 (a + b + d + e )2 d+e
a+b
) (a + b + d + e )2 (a + b + d + e )2
41 42 43 44 45 46
d+e
47
) (a + b + d + e )2 d+e
)
48
a+b
49
+ e )2
50
) (a + b + d + e )2 (a + b + d + e )2
52
a+b
54
(a + b + d
+ e )2
d+e
(a + b + d a+b
51 53 55 56
a+b
(
32
)2 (a + b + d + e )2
n (a + b + d 2de
)
a+b
n (a + b + d + e )2 2ce
31
2
57
+ e )2
a+b
(a + b + d + e )2
58 59
2
)
60 61
JID:IJA AID:8451 /FLA
[m3G; v1.261; Prn:18/12/2019; 13:49] P.17 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
1
17
C.2. Computation for (8.1)
1 2
2 3 4 5 6 7 8 9 10 11 12
3
The partial derivatives are
4
∂f 1 1 1 1 = + − − , ∂a a 1−a a+b 1 − (a + b) ∂f 1 1 =− − , ∂b a+b 1 − (a + b)
5 6 7 8 9
and we can estimate the variance as
V ar ( f)=
a(1 − a) n
13 14
+
15 16
1 a
b(1 − b) n
+
1 1−a 1
a+b 1
1
+
−
1 a+b
−
1
1
18 19 20 21 22 23
12 13
2
14 15
1 − (a + b) 1
11
1 − (a + b)
1
+2·a·b + − − a 1−a a+b 1 − (a + b) 1 1 · + a+b 1 − (a + b) b(1 − b) = . n · a(1 − a)(a + b)(1 − (a + b))
17
10
2
16 17 18 19 20 21
(C.1)
25
25
References
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
26 27
27 28
23 24
24 26
22
[1] Z. Pawlak, Rough sets, Int. J. Comput. Inf. Sci. 11 (1982) 341–356. ´ A. Veljovic, ´ S. Ilic, ´ Ž. Papic, ´ M. Tomovic, ´ Evaluation of classification models in machine learning, Theory Appl. Math. Comput. Sci. 7 (2017) [2] J. Novakovic, 39–46. [3] D.J. Hand, Supervised classification and tunnel vision, Appl. Stoch. Models Bus. Ind. 21 (2005) 97–109. [4] O. Caelen, A Bayesian interpretation of the confusion matrix, Ann. Math. Artif. Intell. 81 (2017) 429–450. [5] G. Gediga, I. Düntsch, Standard errors of indices in rough set data analysis, in: Transactions on Rough Sets 17, 2014, pp. 33–47. [6] I. Düntsch, G. Gediga, PRE and variable precision models in rough set data analysis, in: Transactions on Rough Sets 19, 2015, pp. 17–37, MR3618228. [7] W. Ziarko, Variable precision rough set model, J. Comput. Syst. Sci. 46 (1993) 39–59. [8] I. Düntsch, G. Gediga, Confusion matrices and rough set data analysis, CoRR, arXiv:1902.01487, 2019, https://doi.org/10.1088/1742-6596/1229/1/ 012055, CC BY 3.0 licence. [9] Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning About Data, System Theory, Knowledge Engineering and Problem Solving, vol. 9, Kluwer, Dordrecht, 1991. [10] I. Düntsch, G. Gediga, Rough Set Data Analysis: A Road to Non-invasive Knowledge Discovery, Methoδ os Publishers (UK), Bangor, 2000, http://www. cosc.brocku.ca/~duentsch/archive/nida.pdf. [11] B. Everitt, The Analysis of Contingency Tables, Springer Verlag, 1977, originally published by Chapman and Hall. [12] H. Nguyen, A. Skowron, Rough sets: from rudiments to challenges, in: A. Skowron, Z. Suraj (Eds.), Rough Sets and Intelligent Systems - Professor Zdzisław Pawlak in Memoriam, vol. 1, Springer Verlag, 2013, pp. 75–173. [13] K. Pearson, Mathematical Contributions to the Theory of Evolution: On Contingency and Its Relation to Association and Normal Correlation, Draper’s Company Research Memoirs, Biometric Series, vol. 13, Department of Applied Mathematics, University College, University of London, 1904. [14] M.H. Hodge, I. Pollack, Confusion matrix analysis of single and multidimensional auditory displays, J. Exp. Psychol. 63 (1962) 129–142. [15] J. Townsend, Theoretical analysis of an alphabetic confusion matrix, Percept. Psychophys. 9 (1971) 40–50. [16] R. Kohavi, F. Provost, Glossary of terms, Mach. Learn. 30 (1998) 271–274. [17] T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. 27 (2006) 861–874. [18] D. Hand, Measuring classifier performance: a coherent alternative to the area under the ROC curve, Mach. Learn. 77 (2009) 103–123. [19] L. Cheng, X. Chen, M. Wei, J. Wuan, X. Hou, Modeling mode choice behavior incorporating household and individual sociodemographics and travel attributes based on rough sets theory, Comput. Intell. Neurosci. 2014 (2014) 560919. [20] H. Al-Qaheri, A.E. Hassanien, A. Abraham, Discovering stock price prediction rules using rough sets, Neural Netw. World 18 (2008). [21] A. Hassanien, M. Abdelhafez, H. Own, Rough sets data analysis in knowledge discovery: a case of Kuwaiti diabetic children patients, Adv. Fuzzy Syst. 2008 (2008) 528461. [22] M. Sudha, A. Kumaravel, Students performance prediction based on rough sets, Indian J. Comput. Sci. Eng. 8 (2017) 584–589. [23] J. Xu, Y. Zhang, D. Miao, Three-way confusion matrix for classification: a measure driven view, Inf. Sci. 507 (2020) 772–794. [24] I. Düntsch, G. Gediga, Approximation operators in qualitative data analysis, in: H. de Swart, E. Orłowska, G. Schmidt, M. Roubens (Eds.), Theory and Application of Relational Structures as Knowledge Instruments, in: Lecture Notes in Computer Science, vol. 2929, Springer-Verlag, Heidelberg, 2003, pp. 214–230. [25] I. Düntsch, G. Gediga, Simplifying contextual structures, in: Proceedings of the 6th International Conference on Pattern Recognition and Machine Intelligence, in: Lecture Notes in Computer Science, vol. 9124, Springer Verlag, Heidelberg, 2015, pp. 23–32. [26] R. Congalton, R.A. Mead, A quantitative method to test for consistency and correctness in photointerpretation, Photogramm. Eng. Remote Sens. 49 (1983) 69–74. [27] D. Lauer, C. Hay, A. Benson, Quantitative evaluation of multiband photographic techniques, Final report NAS9-957, Earth Observation Division, Manned Spacecraft Center, NASA, 1970. [28] G. Oehlert, A note on the Delta method, Am. Stat. 46 (1992) 27–29. [29] A. Agresti, Categorical Data Analysis, 3rd ed., Wiley Series in Probability and Statistics, Wiley, Hoboken, NJ, 2013.
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
JID:IJA AID:8451 /FLA
18
1 2 3 4 5 6 7 8 9
[30] [31] [32] [33]
[m3G; v1.261; Prn:18/12/2019; 13:49] P.18 (1-18)
I. Düntsch, G. Gediga / International Journal of Approximate Reasoning ••• (••••) •••–•••
J. Bland, D. Altman, The odds ratio, BMJ 320 (2000) 1468. M. Szumilas, Explaining odds ratios, J. Can. Acad. Child Adolesc. Psychiatry 19 (2019) 227–229. S. Menard, Applied Logistic Regression Analysis, 2nd ed., Quantitative Applications in the Social Sciences, vol. 106, Sage Publications, 2002. C.C. Clogg, J.W. Shockey, Multivariate analysis of discrete data, in: J.R. Nesselroade, R.B. Cattell (Eds.), Handbook of Multivariate Experimental Psychology, Perspectives on Individual Differences, 2nd ed., Plenum Press, 1988, pp. 337–365. [34] I. Düntsch, G. Gediga, Probabilistic granule analysis, in: C.-C. Chan, J.W. Grzymala-Busse, W.P. Ziarko (Eds.), Proceedings of the Sixth International Conference on Rough Sets and Current Trends in Computing, RSCTC 2008, in: Lecture Notes in Computer Science, vol. 5306, Springer Verlag, 2008, pp. 223–231. [35] G. Gediga, I. Düntsch, Statistical techniques for rough set data analysis, in: L. Polkowski, S. Tsumoto, T.Y. Lin (Eds.), Rough Set Methods and Applications: New Developments in Knowledge Discovery in Information Systems, Physica Verlag, Heidelberg, 2000, pp. 545–565, MR1858668.
1 2 3 4 5 6 7 8 9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
21
21
22
22
23
23
24
24
25
25
26
26
27
27
28
28
29
29
30
30
31
31
32
32
33
33
34
34
35
35
36
36
37
37
38
38
39
39
40
40
41
41
42
42
43
43
44
44
45
45
46
46
47
47
48
48
49
49
50
50
51
51
52
52
53
53
54
54
55
55
56
56
57
57
58
58
59
59
60
60
61
61