Classification and kernel density estimation

Classification and kernel density estimation

vistas in Astronomy Vol. 41, No. 3, pp. 41 l-417,1997 @ 1997 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0083~6656l97 $15.00 + ...

552KB Sizes 5 Downloads 184 Views

vistas in Astronomy Vol. 41, No. 3, pp. 41 l-417,1997 @ 1997 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0083~6656l97 $15.00 + 0.00

Pergamon

PI I: SOO83-6656(97)00046-9

CLASSIFICATION AND KERNEL DENSITY ESTIMATION CHARLES TAYL,OR Department of Statistics, University of Leeds, Leeds LS2 9JT, UK

Abstract- The method of kernel density estimation can be readily used for the purposes of classification, and an easy-to-use package (ALLOCBO) is now in wide circulation. It is known that this method performs well (at least in relative terms) in the case of bimodal, or heavily skewed distributions. In this article we first review the method, and describe the problem of choosing h, an appropriate smoothing parameter. We point out that the usual approach of choosing h to minimize the asymptotic integrated mean squared error is not entirely appropriate, and we propose an alternative estimate of the classification error rate, which is the target of interest. Unfortunately, it seems that analytic results are hard to come by, but simulations indicate that the proposed estimator has smaller mean squared error than the usual crossvalidation estimate of error rate. A second topic which we briefly explore is that of classification of drifiing populations. In this case, we outline two general approaches to updating a classifier based on new observations. One of these approaches is limited to parametric classifiers; the other relies on weighting of observations, and is more generally applicable. We use an example from the credit industry as well as some simulated data to illustrate the methods. @ 1997 Elsevier Science Ltd. All rights reserved.

1. INTRODUCTION Suppose that we have observed n observations Xi, i = 1, . . . , II with corresponding known classes ci, where xi is a p-dimensional feature vector and ci denotes the class membership. To simplify notation, we will assume throughout this paper that there are only two classes. The objective now is to learn a “rule”, say 4, so that we can assign a new observation x* to a class by a mapping 4(x*) + c. In the case of the kernel density estimate, suppose that we have nj observations from class Cj, then we can estimate the probability density function by

h(x) =

L c

njXidj

K(X; Xi9 h),

(1)

C. Taylor

412

where K( .) is a kernel function such that j K(x) dx = 1 and h is the smoothing parameter (which can take the form of a vector, or even a matrix in the case of multi-dimensional data). The classification rule is then to allocate x* to class C, if rn = argmaxj [i (x*). Note that if we use a normal kernel function, then the limiting case when h 4 0 gives a nearest neighbour classifier. This suggests that, with careful choice of the smoothing parameters, we should always do better than the l-NN classifier. However, there can be numeric difficulties in using very small values of h naively. If density estimation per se is the goal, then it is widely recognized that the choice of smoothing parameter is much more important than the choice of kernel function. However, as will be seen in Section 2, there are many issues which are distinctive when classification is the end target. For example, there is no reason why the kernel function should itself be a density. If we relax the condition that K(x) 2 0 then better properties may ensue.

I. 1. Shifing populations In classical supervised learning, the available examples (the training data) are usually used to learn a classifier. In many practical situations in which the environment changes, this procedure ceases to work. In the StatLog project [2, Ch. 91 this situation was encountered in the case of a credit-scoring application. Generally speaking, the application of a discrimination algorithm to classify new, unseen examples will be problematic if either the number of attributes changes or the number of attributes remains the same but the interpretation of the records of the datasets changes over the time. In this situation the distribution of at least one class is gradually changing. To solve this problem one could simply relearn the rule provided that there are enough new examples with known class, but this is wasteful. An alternative is incremental learning - see Refs. [1,6,7,5] for example - in which one of the design goals is that the decision tree that is produced should depend only on the set of instances, without regard to the sequence in which those instances were presented. However, if there is population drift some of the old data should be downweighted as no longer representative. To deal with dynamic aspects there are essentially two problems (in addition to those normally associated with classification). The first problem is to detect any change in the situation, the second problem is how to react to any detected change. In Section 3 this paper discusses ideas for adaptive learning which can capture dynamic aspects of real-world datasets. Although some of these ideas have a general character and could be applied to any supervised algorithm, here we focus attention on kernel density methods which uses a weighted average of kernel functions with the weight being determined by the age. A final section applies some of the methods and ideas to simulated data and an example from the credit industry.

2. ESTIMATION

OF ERROR

RATES

Numerous researchers have tackled the problem of choosing an appropriate bandwidth which is based on the data - see, for example, Ref. [8] for references. However, most of the results are related to minimizing the (asymptotic) integrated mean squared error (IMSE) which is given by E l(f - f)2. It is worth noting that such a policy for choosing h may not work very well in a classification setting when we want to minimize the expected misclassification rate which for two classes is given by

Classification and kernel density estimation x1

_

fi (xW

J 1

+ ~72

_

fh,

fh2(Xb’fk1(X)

=a1I1

fz(X)dx

J 1

+7t212,

413 (2)

(Xb_fk2(x)

where Xi is the prior probability that the data belongs to class Ci. The usual approach of taking a Taylor series expansion does not work here, and the fact that the limits of the integral are random variables makes this look intractable. We illustrate by simulation that the optimal h 1, h2 to minimize IMSE can be very different to those which minimize the expected error rate. In this example, 100 observations were simulated from each of N(0, l), N(2, OS2). For each of 100 samples we calculated fl and f2 using a range of different smoothing parameters. For these distributions, (hl , h2) = (0.422,0.211) minimize the asymptotic IMSE, whereas the actual error rate is minimized for (hl, h2) = (0.720,0.215). Suppose that the integrals in Eq. (2) are over a connected region. Then we need to estimate the point t such that fl (t) = fz(t). Let f(hl, IQ) estimate t and be defined by fh, (8 = jh2 (3. ui = $ K(u)u2du. It can be shown that, approximately (as h J, 0 and

E(3

=

t +

G [Wc~ - $&VI] 2

var@

f;(r) - $0)



K2

= fz [f;(t)

-

.

f;(t)]’

Attempting to minimize bias2 + variance leads to quintic simultaneous equations, so plug-in solutions are not readily available yet. The usual estimate of the expected error rate is obtained by leave-one-out cross-validation which, up to a multiplicative constant, is given by (assuming from now on that rrl = ~r2= 0.5)

C *id!1

z[.h(Xi)

-

f;“‘(xi)l+ 1 r[_fl(Xi)Xi

ff'(Xi)],

(3)

EC2

where f(‘)(x) denotes the kernel estimate of f(x) (using Eq. (1)) using all of the data except the ith observation, and Z(x) = 1 if x > 0; 0 otherwise. An alternative is to use a smoothed version

of (3), which is more obviously an estimate of (2) (again omitting a multiplicative constant), given by K(x; xi, h)dx + C J XiEC2I fi(Xb.P(X) 2

K(x; xi, h)dx.

(4)

In this case h = 0 gives the usual leave-one-out, or cross-validation estimate of the error rate, since, for example, the first integral in (4) will be 1 if fz(xi) > fy’(xi) and 0 otherwise. So although h = 0 will give an unbiased estimate of the error, a value of h > 0 can give a better estimate (in terms of mean squared error). Some progress can be made in computing the approximate bias and variance of the estimator derived from Eq. (4), and simulations confirm that, in general, it does lead to a better estimator than Eq. (3). Figs. 1 and 2 show the estimated mean squared error over 100 samples of size 10 from each of N(0, 1) and N(1, 1). Note that (4) requires three or four choices of smoothing parameter, whereas (3) requires only two. However, although (4) can lead to a smaller mean squared error - the minimum of the curve is 0.004 compared with 0.015 for (3) - the extra computation will rarely be worth the effort. Moreover, if

414

C. Taylor I

meansquared error for (hl,h2)=(1.89.1.89) --

Fig. 1. Mean squared error for estimate of error rate using Eq. (4) as function of smoothing parameter h. mean

squared

error

(cv)

Fig. 2. Mean squared error of estimate given by Eq. (3) as function of h 1, h2.

classification of future observations is to be carried out, then only good choices of h 1, h2 need to be found. Estimation of the error rate may then be of secondary importance.

3. DEALING

WITH

SHIFTING

POPULATIONS

Nakhaeizadeh et al. [4] discuss ways of updating the classifier either by modifying the rule which has been learned, or by modifying the training data. Suppose that we examine the data in batches of size m and that, at time t + mk we detect a change which requires adaptation in the learned rule. Any algorithm can be totally relearned from recent observations after a change in one of the classes has been detected. More interestingly, we can consider how best to re-use previously learned information.

Classijication and kernel density estimation

415

We can use a “similarity” rule to throw away observations in the current training set which are different - i.e. they are close in feature space, but have different class label - from those recently observed. Alternatively, the older observations can be eliminated or a kind of moving window with a predefined number of observations, possibly representative and new ones, could be used. We try a possible implementation whereby old data are discarded and new data are included in the “template” used for establishing a rule according to their perceived usefulness. As presented, this system will have some limitations. For example, if there are drifting populations then new data will be incorrectly classified, but should nevertheless be included in the template set. This point is taken up in a nearest neighbour implementation in Ref. [3]. Note that this approach can be used for any (not just similarity-based) algorithms. We now focus attention on a dynamic version of the kernel estimator, which in the simple case is given by

as an estimate of the density f~ (x) at time T when we have previously observed xt in the class of interest. Here wt are weights and NT is a normalizing constant. For example, we could choose wt = e-A(T-r) in which case NT = (1 - ewkT)/( 1 - eeA) or

w = t

1 I0

forT?tzT-W otherwise

in which case NT = W. Using either of these parameterizations for wt requires choice of either A or W, in addition to the smoothing parameter h. Of course both parameters must be chosen for each class. Again, analytic calculations appear to be intractable even for IMSE and very simple dynamic models. However, numerical calculations are timplified by noting that simple updating formulae can be derived expressing fT(x)in terms of fT_ I(x) and a kernel function of XT. In this paper we consider experiments on real and simulated dynamic data in which we train on an initial set of data (ordered by time), choosing any parameters by cross-validation means and then test on observations in the second part of the data.

4. RESULTS In this section we describe some results of our adaptive updating ideas and compare them to conventional statistical classification methods in an example. The simplest and non-adaptive approach is to use the classification rule that was learned from the training data and apply that to all batches, with no updating. A small modification is to update the priors according to the new data, and a further modification is to use the priors which are estimated using only the last batch. An alternative method is to completely re-learn the rule at each time point. We tried out some of the above ideas on two simulated datasets. At each of 1000 time points we generate an example with three variables (Xl, X2, X3) from 2 classes. We use the first 500 observations as the training data and the remaining 1500 as test data. The distributions of each class has two independent normal variables (with unit standard deviation) and a uniformly distributed (on [0, 1)) “noise” variable. The mean of the noise variable ~3 = 0.5 was independent of time; the means of the normal variables vary with time as follows: l datZ:Class1has~~,~~=Ofort~750and~t,~~=t/1000for751~t~1000,whereas Class 2 differs in that ~2 = 2 for t 5 750 and ~7. = 2 + t/l000 for 751 5 t 5 1000, so

C. Taylor

416

Table 1 Error rates for kernel classifier on simulated data. See text for (Q-o-(v) Error rate

Method

(i) no-learn (ht. h2) (ii) all-data (h 1,h2) (iii) (ht , h2, AI, h2) (iv) (ht , h2, WI, W2)

Estimated parameters

datl

dat2

datl

dat2

0.201 0.167 0.173 0.165

0.357 0.274 0.239 0.231

(0.75,0.8) (0.75.0.8) (3, 3,0.08,0.08) (1.5, 1.5,210,245)

(1.5, 1.5) (1.5, 1.5) (1, 1.0.06.0.06) (1.6, 1.6,95, 250)

that there is no change to the distributions until two thirds of the testing phase, when there is a sudden jump followed by a slow drift. l dar2: Class 1 has ~1, ~2 = 0 and Class 2 has ~1 = 2t/lOOO, ~2 = 2 - 2t/lOOO for 1 5 t < 1000. In this case there is a gradual shift in the training and test phase of the second group in the mean of (Xl, X2) from (0,2) to (2,O). Since we split the testing data into batches of 50 observations (which always corresponds to 25 observations from each class) the change should happen in batch 21 when applying datl and should go through the whole training and testing phase when considering dat2. Note that for the both datasets, the observations were ordered so that the priors (however estimated) were always equal. For the kernel classifier we tried four approaches: (i) h 1, h2 were trained on the test data, but the classifier was not updated (no-learn); (ii) the classifier was updated using all observations thus far observed (with no change in the smoothing parameters); the dynamic kernel estimator given by Eq. (5) with (iii) a weighted exponential decay (A), and (iv) a window of width W learned from the training data. In the latter two cases, two parameters for each of the two classes were chosen by crossvalidation. The results are given in Table 1. The credit data covers a two-year period and consists of 156 273 observations, the first 5000 of which were used as the initial training data. Initially, there were 15 attributes (all categorical) in one year, and 14 attributes in the second year. Since we are not dealing with the problem of a change in the number of attributes, the extra variable was discarded. We coded all the variables into a list of O/ 1 attributes, and stepwise selection in linear discriminant was used to select 15 of these binary variables. We display the error rates in each batch in Fig. 3. The kernel classifier which was kept fixed gave an overall error rate of over 20%, whereas updating the prior to reflect the proportions in the last batch gave a small improvement to 18.2%. The dynamic version (moving window) gave a large improvement to 10.4%. Note that we neither scaled the binary data, nor considered a different smoothing parameter in each dimension. So in effect we made the assumption that the variables were independent with common variance which was certainly not the case. Stationarity is a key issue which will affect the performance of any dynamic classification algorithm. For example, if the changes which occur in the training phase are very different from the nature of the changes which take place during testing then the way the parameters are updated is likely to be deficient. For this reason it seems that any method should include a monitoring process even if this monitoring is not normally used to update the rules.

Classijication and kernel density estimation

x

.

417

.

. --._.- *** . . ..*: ’ .

c

batch

Fig. 3. Error rates for two kernel classifiers. The ‘*’ points are for a non-dynamic classifier in which the priors were updated according to the proportions observed in the previous batch. The ‘1’ points were for a moving window classifier, Eq. (5) with (WI, W2) = (400,300O) and (h 1, k2) = (0.25,0.25).

References [l] S.L. Crawford, Extensions to the CART algorithm, International Journal of&Ian-Machine Studies 31(1989) 197-217. [2] D. Michie, D.J. Spiegelhalter, CC. Taylor, Eds., Machine Learning, Neural and Statistical Classification (Ellis Horwood, Chichester, 1994). [3] G. Nakhaeizadeh, C.C. Taylor, G. Km&h, Dynamic Aspects of Statistical Classification. In: Intelligent Adaptive Agents, AAAI Technical report No. WS-96-04 (AAAI Press, Menlo Park, CA, 1996) p. 55-64. [4] G. Nakhaeizadeh, C.C. Taylor, G. Kunisch, Dynamic Supervised Learning: Some Basic Issues and Application Aspects. In: Classification, Data Analysis, and Knowledge Organisation, B. Klar, 0. Opitz, Eds. (Springer, Berlin, 1997). [5] J.C. Schlimmer, R. Granger, Incremental learning from noisy data, Machine Learning 1 (1986) 317-354. [6] P.E. Utgoff, Incremental learning of decision trees, Machine Learning 4 (1989) 161-186. [7] P.E. Utgoff, An improved algorithm for incremental induction of decision trees, In: Proceedings of Eleventh Machine Learning Conference, Rutgers University (Morgan Kaufmann, 1994). [8] M.P. Wand, M.C. Jones, Kernel Smoothing (Chapman and Hall, London, 1995).