Evaluation of pattern classifiers – Applying a Monte Carlo significance test to the classification efficiency

Evaluation of pattern classifiers – Applying a Monte Carlo significance test to the classification efficiency

Pattern Recognition Letters 19 Ž1998. 1–6 Evaluation of pattern classifiers – Applying a Monte Carlo significance test to the classification efficien...

81KB Sizes 0 Downloads 5 Views

Pattern Recognition Letters 19 Ž1998. 1–6

Evaluation of pattern classifiers – Applying a Monte Carlo significance test to the classification efficiency Edgard Nyssen

1

Department of Electronics, Brussels Free UniÕersity (VUB), Pleinlaan 2, B-1050 Brussels, Belgium Received 25 April 1996; accepted 2 October 1997

Abstract The exact probability technique for testing the significance of the efficiency of a multiclass classifier is hardly applicable to medium or large size problems, due to combinatorial explosion. The present paper presents an alternative method, based on a Monte Carlo test. q 1998 Elsevier Science B.V. Keywords: Decision functions; Classifier evaluation; Significance testing; Monte Carlo test

1. Introduction The application of a classifier to a set of patterns yields a frequency distribution N s Ž n11 , n12 , . . . ,n k k ., where n i j represents the number of elements, belonging to class V i , which have been assigned by the classifier to class V j Ž i, j g  1, . . . ,k 4.. Assuming that the set is a representative sample from the pattern population under study, an important quality criterion is the relative frequency of correctly classified patterns – the classification efficiency – which is defined as k

Ý nii e seŽ N . s

k

is1 k

.

the results, obtained by applying a classifier to a test set of patterns. The experiment with the test set yields a frequency distribution N T, where ‘‘T’’ refers to the test set. Let us first introduce the following notations: s

nw sx j s

is1

n iw t x s

Ý ni j ,

s

nw sxw t x s

js1

t

Ý Ý ni j . is1 js1

Testing the significance of the efficiency of a multiclass classifier can be performed by applying the following procedure ŽNyssen, 1996.: 1. search for the distributions N, belonging to the set N) , defined as follows:

Ý Ý ni j is1 js1

In this paper we consider the problem of interpreting

1

t

Ý ni j ,

E-mail: [email protected].

0167-8655r98r$19.00 q 1998 Elsevier Science B.V. All rights reserved. PII S 0 1 6 7 - 8 6 5 5 Ž 9 7 . 0 0 1 4 7 - 5

2

N s N N N s Ž n11 , . . . ,n k k . g N k ,

½

5

n i j satisfy Ž 1 . and Ž 2 . , N) s  N N N g N , n i j satisfy Ž 3 . 4 ,

E. Nyssenr Pattern Recognition Letters 19 (1998) 1–6

2

referring to the constraints: ; i g  1, . . . ,k 4 :

n iw k x s nTiw k x ,

Ž 1.

; j g  1, . . . ,k 4 :

nw k x j s nTw k x j ,

Ž 2.

k

Ý

2. Theoretical considerations

k

nii 0

is1

Ý

nTi i ;

Ž 3.

is1

2. calculate k

p) s

k

Ý ŁŁ

Ng N) ss2 ts2 X

sponds to that of the Monte Carlo test ŽRipley, 1987. and is explained in detail in Section 2.

sy 1xt C n s t C nnww sy n sw t x 1xw t x

C nnww ssxwxtt x

,

Ž 4.

where C nn s n!rŽŽ n y nX .!nX !. denotes the number of combinations in which nX objects can be selected from a set of n distinct objects; 3. compare p) with a , a chosen significance level; if p) - a , the null hypothesis H 0 that the distribution N T was obtained through random classification, is rejected in favour of the alternative hypothesis H 1 that the efficiency of the classifier under study is higher than expected under H 0 . This procedure is based on the idea that distribution N T is a member of the population of distributions N with the same marginal totals in the rows and columns of the k = k contingency table Ži.e., a square table containing the n i j values, where i is the row index and j, the column index. – which is expressed by constraints Ž1. and Ž2.. The terms in Eq. Ž4. are the exact probabilities, under H 0 , of those distributions showing the same or a higher classification efficiency than N T – which is expressed by constraint Ž3.. The bottleneck in the application of the procedure may be the exhaustive search of the distributions N g N . Even an optimised implementation still requires in step Ž2., the calculation of a number of terms, equal to the size of N) . This size increases dramatically for an increasing number of classes, an increasing test set size and a decreasing classification efficiency. We designed an alternative approach, based on a statistical experiment, in which on one hand the classifier under study is applied to the test set, yielding N T, and on the other hand a random sample of f distributions NqR Ž q g 1, . . . , f . is taken from a population of distributions with the same marginal totals as N T and generated by a random classifier; superscript R refers to the random character of this classification. The logic behind this approach corre-

A testing procedure can be obtained from the following reasoning. We consider a sample N T of size one, containing the distribution N T obtained by the application to the test set of the studied classifier. The second sample N R is the one, consisting of the distributions NqR . Probability theory provides us with theoretical instruments, allowing us to test whether the two samples have been taken from the same population. Since we assume that the members of the second sample have been obtained by random classification, this hypothesis is equivalent to the original null hypothesis that N T was obtained through random classification. The purpose is to test this null hypothesis H 0 against the alternative hypothesis H 1 that sample N T is taken from a population of distributions with a higher expected classification efficiency. This classification efficiency can be calculated for both samples yielding eT s eŽ N T . and e qR s eŽ NqR ., where q g 1, . . . , f. The most appropriate method is the randomisation test Že.g. ŽSiegel, 1956..: Ø it is a nonparametric technique, maximally exploiting all available information and having therefore, in that sense, a power-efficiency of 100%; Ø the efficiencies are on one hand interval level scores Žthe randomisation test tests the significance of the difference between the means of two samples., but on the other hand the efficiencies satisfy a discrete distribution, thus violating the necessary assumptions of many other nonparametric techniques like the Mann–Whitney U test. Consider the frequency distribution f l of the efficiencies eŽ N ., calculated for the elements of sample N R , i.e.,

½

;l g 1, . . . ,nw k xw k x : f l s a N N N g N R ,

k

Ý nii s l is1

5

– since each value of Ý kis1 n i i corresponds to one value of eŽ N .. Consider also the set of distributions N U s N R j N T. The randomisation test, applied to

E. Nyssenr Pattern Recognition Letters 19 (1998) 1–6

our problem, is based on the idea that the result of the experiment is considered as one instance of the f q 1 distinct assignments of one element of N U to sample N T and the remaining f elements to sample N R ; these assignments, as possible outcomes of the experiment, are equiprobable under H 0 . Only part of the assignments lead to an efficiency score for the element of N T which is equal or greater than the observed eT ; these assignments correspond to the most extreme possible values of the test variable in the randomisation test procedure – the difference of the sums of the scores in the two samples. Hence – still under H 0 – the probability of such assignment is nw k xw k x

žÝ /

fl q 1

pX) s

lsl T

fq1

,

Ž 5.

with k

lTs

Ý nTi i . is1

This probability can be compared with a selected significance level a .

3

permutation to the components of c, since in this way, all sample elements have the same probability of being assigned to a given class, independently of their real class membership. Every distribution NqR Ž q g 1, . . . , f . is generated by applying the following algorithm, using a function ran0() as pseudo-random number generator Žcf. ŽPress et al., 1995..: set l s 0, which will be used to count the correctly classified patterns; given c, apply the following processing steps for each value of t s 1, . . . ,Ž nw k xw k x y 1.: 1. generate random number u – uniformly distributed in w0,1.; 2. calculate r s t q ? u = Ž nw k xw k x y t q 1.@; this yields for r a random value which satisfies a discrete uniform distribution between t and nw k xw k x; 3. interchange c r and c t ; 4. verify whether the new c t value equals the label of the class corresponding to t; if so, increment l. One can easily prove that this algorithm applies a random permutation to the components of c. At the end, we readily obtain the number of correctly classified patterns Ž l ., which can be used to update frequency f l in the frequency distribution of the efficiencies e qR Žwith q g 1, . . . , f ..

3. Methodology The NqR g N R Ž q g 1, . . . , f . are obtained through stochastic simulation. A sample of nw k xw k x patterns v r Ž r g  1, . . . ,nw k xw k x4. of which the class membership is known, is represented by an array c g  1, . . . ,k 4 nw k xw k x of class labels c r . Obviously, since we only deal with samples, yielding distributions from N and hence, satisfying constraints Ž1. and Ž2., the labels c r have to satisfy

Ý

d i c r s nTiw k x ,

rs1

corresponding to constraint Ž1.. The classification of a sample element is given by the index r of its label c r . Putting nX1 s 0 and nXt s nw k xw ty1x , for t ) 1, we have r ) nXt , r ( nw k xw t x

´

The author has implemented the technique, described in this paper, in the C programming language on a SPARCstation 1, under the Solaris UNIX operating system. Source code can be obtained from the author Žvia electronic mail only.. 4.1. Effect of the Õalue of f on the Õalue of pX)

nw k xw k x

; i g  1, . . . ,k 4 :

4. Case studies

v r assigned to V t .

In this way, the classification is consistent with constraint Ž2.. Based on this representation, random classification can be simulated by applying a random

Nyssen Ž1996. describes the application of the exact probability test to the distribution, shown in Fig. 1 and taken from ŽJennrich and Sampson, 1988.. The technique, described in the present paper was applied to the same example to verify the effect on its performances of the choice of f – the size of the sample of distributions NqR Ž q g 1, . . . , f .. We carried out different experiments, choosing for f the values from the series: 1,3,10,30,100, . . . , 3 = 10 6 , 10 7. The resulting pX) values are shown in Fig. 2,

E. Nyssenr Pattern Recognition Letters 19 (1998) 1–6

4

4.2. Time behaÕiour

Fig. 1. Contingency table, showing classification results for the data of 102 ulcer patients; the three classes correspond to three types of pathology evolution.

together with the exact probability p) and the 1% and 5% significance levels. Note already the relatively high values of pX) for low values of f, corresponding to the well-known phenomenon that the power of a two-sample test is limited when the involved samples are small. It is also worth noting that for high f values, pX) approximates p) , which is due to the fact that the limiting distribution of the relative frequencies of e qR Žwith q g 1, . . . , f . is the exact probability distribution and therefore: nw k xw k x

žÝ /

f lrf q 1rf

lim f™`

pX)

Ž5.

s lim f™`

lsl T

1 q 1rf

s p) .

A more detailed discussion of the influence on the pX) -value of f, will be given in Section 5.

In order to observe the time behaviour of the Monte Carlo test algorithm, we took an example from ŽNyssen, 1996. for which the time required by the exact probability technique is high: we consider the case of k s 5, nTiw k x s 5 and nTw k x j s 5 Žfor all values of i and j .. For eT s 40%, the number of terms in Eq. Ž4. is 1254250. In order to calculate the exact probability value Ž p) s 0.0195., the mentioned computer program needs about 393 s of computing time; the Monte Carlo testing program needed about 3.4 s, to obtain a pX) value of 0.0204 with f s 30000. We also considered the case of k s 6, nTiw k x s 4 and nTw k x j s 4 Žfor all values of i and j .. As expected, due to the increase of the number of classes with respect to the previous example Žalthough the number of elements – nTw k xw k x s 24 – in the distribution under test, is slightly lower and the number of correctly classified patterns – i.e., 10 – was not changed., the number of terms in Eq. Ž4. increased to 15897850, which is considerable; the time needed to calculate p) s 0.00415 is 10792 s, while it takes the Monte Carlo test only 3.9 s to obtain pX) s 0.00393 with f s 30000. In order to verify the usefulness of the Monte Carlo test for larger problems, which may be encountered in practice, we applied it to a problem involving 10000 test patterns: k s 10, nTiw k x s 1000 and nTw k x j s 1000 Žfor all values of i and j .. We applied the technique with f s 30000, for a classification efficiency of 10%, 10.1%, 10.5%, 11%, 20%, 95% and 100%. In all cases, computing times ranged between 1426 and 1451 s. The corresponding calculated pX) -values were, respectively, 0.508, 0.376, 0.0509, 7.67 = 10y4 and 3.33 = 10y5 for the remaining efficiency values.

5. Discussion

X

Fig. 2. Values of p) , obtained by applying a Monte Carlo test to the distribution of Fig. 1 for different values of f ; both coordinates are log scaled.

In order to ensure a sufficiently high power efficiency of the Monte Carlo test, the statistical experiment, described in this paper, should involve a sufficiently large sample N R of distributions. These f distributions are generated through stochastic simula-

E. Nyssenr Pattern Recognition Letters 19 (1998) 1–6

tion Žas described in Section 3 and therefore f is an essential parameter of the simulation, which must be chosen carefully to optimise the success of the test procedure. An absolute lower limit can be derived from Eq. Ž5.: it is clear that pX) cannot be less than 1rŽ1 q f . and since rejecting the null hypothesis requires that pX) ( a we conclude that f 0 ay1 y 1. As illustrated by the example of Section 4.1, if p) is a few orders of magnitude smaller than a , it would be sufficient to select f in the order of magnitude of Ž p) .y1 to obtain almost certainly a pX) value smaller than a . Normally one does not have any clue about the value of p) Žsince the Monte Carlo test is actually executed to avoid its calculation.; in order to avoid repeated experiments with the same test set data to find a ‘‘good’’ f value Žbecause this practice would falsify the chosen significance level., we designed a criterion to select an acceptable minimum f value which is sufficiently high to deal with the worst-case possibility of p) being close to a . Consider therefore that we would like to be able to prove the significance of classification efficiency for a test set distribution, for which p) ( a y D p, by applying the Monte Carlo test Ž D p is a chosen threshold.. Assume also that we are willing, in case of p) s a y D p, to accept a risk of not being able to reject the null hypothesis Ži.e., a type II error. with probability b R . For high f values, the right-hand side of Eq. Ž5. satisfies a normal distribution NŽ m , s 2 . in good approximation, where

m s Ž f Ž a y D p . q 1. Ž f q 1.

y1

,

s 2 s Ž f Ž a y D p . Ž 1 y a q D p . . Ž f q 1.

y2

.

Let zb R be the value of a standard normally distributed variable z, satisfying: PŽ z - zb R . s 1 y b R . The minimal f value searched for Žlet us call it f min . satisfies m q zb R s s a . Table 1 shows f min values for a few useful combinations of values for a and D p. A rough and conservative approximation of f min can also be calculated using the following equation: f min f

zb2 R Ž a y D p .

Ž D p.

2

.

Ž 6.

The choice of f min is mainly meant to tune the power of the Monte Carlo test. Choosing e.g. b R s

5

Table 1 Values for f min for different values of a and D p; in all cases, b R s 0.05

a

Dp

f min

0.05 0.01 0.005 0.001

0.005 0.001 0.0005 0.0001

5024 26074 52386 262880

0.05 for a s 0.05 and D p s 0.005 means that we accept that in 1 out of 20 similar Monte Carlo test experiments where f s f min , the null hypothesis would not be rejected at a s 0.05, while the exact probability test would reject it at a significance level of a ) 0.045. Keep in mind that this is a worst case scenario; the probability of a type II error with this chosen f-value will even decrease to the extent that the exact probability p) is smaller than 0.045. Note also that sometimes it may happen that pX) is smaller than p) Žcf. e.g., the result of the second experiment described in Section 4.2. The reason is that pX) is conditioned by the specific result N R of the stochastic simulation experiment; on the average however, the pX) is greater than p) , and consequently the power of the Monte Carlo test is expected to be smaller than the power of the exact probability test. It is obvious that in the Monte Carlo test method, the most time consuming part of the algorithm will be the repeated step where one random number is generated for each of Ž nTw k xw k x y 1. components of c; this is performed f times Ži.e., one per member of the distribution sample N R .. Consequently, the time consumed by the execution of the algorithm is roughly proportional to f = Ž nTw k xw k x y 1. or – for a given f value – linear in nTw k xw k x Žthe size of the test set. and independent of both the number of classes k and the classification efficiency eT ! This contrasts sharply with the time behaviour of the exact probability algorithm which, as shown by Nyssen Ž1996., suffers from combinatorial explosion, especially for increasing values of k or eT. The examples of Section 4.2 illustrate how useful the Monte Carlo testing method is, when dealing with a relatively high number of classes or test set patterns; the second example illustrates also that the time behaviour of the Monte Carlo test is hardly affected by increasing the value of k.

6

E. Nyssenr Pattern Recognition Letters 19 (1998) 1–6

The last experiment, which involves nTw k xw k x s 10000 patterns, shows us that even for larger test sets, the Monte Carlo testing technique can be applied within acceptable computing time constraints. Note that the computing time is almost not influenced by the classification efficiency. The values of pX) are consistent with the expectations, considering that for a problem with 10 classes, with an equal number of test set patterns in the different classes, a random classifier has an expected efficiency of 10%. The pX) -values drop very quickly for eT values exceeding 10% Žwith, however, a lower limit of Ž1 q f .y1 s 3.33 = 10y5, as mentioned at the beginning of this section.. This behaviour is explained by the large size of the test set, which influences favourably the power of the hypothesis testing technique. It can be observed that in our example, a classification efficiency of 11% is already statistically significant at a a s 0.1% level.

6. Conclusions In conclusion, we would like to propose the following approach to the problem of showing the significance of a multiclass classification efficiency: Ø if feasible, apply the exact probability test as described in Section 1 and discussed by Nyssen Ž1996.;

Ø if, however, the number of classes or test patterns involved in the procedure causes an unacceptable consumption of computing time, when the exact probability testing technique is used, apply the Monte Carlo test as described in Sections 2 and 3; the application of this method implies the choice of f ; a minimum value f min for this parameter can be chosen, using Table 1 or the approximating formula Ž6.. Acknowledgements The author would like to thank Dr Hichem Sahli for providing him some useful program source code. References Jennrich, R., Sampson, P., 1988. Stepwise discriminant analysis. In: Dixon, W.J. ŽEd.., BMDP Statistical Software Manual. University of California Press. Nyssen, E., 1996. Evaluation of pattern classifiers – Testing the significance of classification efficiency using an exact probability technique. Pattern Recognition Letters 17 Ž11., 1125– 1129. Press, W.H., Teukolsky, S.A., Vetterling, W.T., 1995. Numerical Recipes in C: The Art of Scientific Computing. University Press, Cambridge. Ripley, B.D., 1987. Stochastic Simulation. Wiley, New York. Siegel, S., 1956. Nonparametric Statistics for the Behavioural Sciences. Internat. Student Ed., McGraw-Hill, Kogakusha.