Novelty detection employing an L2 optimal non-parametric density estimator

Pattern Recognition Letters 25 (2004) 1389–1397 www.elsevier.com/locate/patrec Novelty detection employing an L2 optimal non-parametric density estim...

Download PDF

324KB Sizes 0 Downloads 62 Views

Report

PDF Reader
Full Text

Pattern Recognition Letters 25 (2004) 1389–1397 www.elsevier.com/locate/patrec

Novelty detection employing an L2 optimal non-parametric density estimator Chao He, Mark Girolami

*

Department of Computing Science, Bioinformatics Research Centre, University of Glasgow, Glasgow G12 8QQ, UK Received 5 January 2004; received in revised form 16 April 2004 Available online 8 June 2004

Abstract This paper considers the application of a recently proposed L2 optimal non-parametric reduced set density estimator to novelty detection and binary classiﬁcation and provides empirical comparisons with other forms of density estimation as well as support vector machines. Ó 2004 Elsevier B.V. All rights reserved. Keywords: Reduced set density estimator (RSDE); Novelty detection; Binary classiﬁcation

1. Introduction Novelty detection (Roberts, 2000; Sch€ olkopf et al., 2001; Campbell and Bennett, 2001), oneclass classiﬁcation (Tax and Duin, 1999) or outlier detection (Wang et al., 1997; Barnett and Lewis, 1977; Anderson, 1958) is a problem of some signiﬁcant theoretical (Sch€ olkopf et al., 2001) and practical interest. The parametric statistical approaches to outlier detection, which impose a strong assumption of the data having a Gaussian distribution, are hypothesis tests based on a form of T 2 -statistic (Anderson, 1958; Barnett and Lewis,

* Corresponding author. Tel.: +44-141-330-8628; fax: +44141-330-8627. E-mail addresses: [email protected] (C. He), girolami@ dcs.gla.ac.uk (M. Girolami).

1977). By relaxing the Gaussian assumption semiparametric approaches have been proposed combining mixture modelling, extreme value theory (Roberts, 2000) and the bootstrap (Wang et al., 1997) in deﬁning tests of the outlying nature of a newly observed datum. Methods for outlier testing based on the support vector method have been proposed and have been shown to be highly eﬀective for cases where there is a paucity of data drawn from the underlying distribution and/or an associated density function which may not necessarily be in existence (Tax and Duin, 1999; Sch€ olkopf et al., 2001; Campbell and Bennett, 2001). This paper considers the case where data scarcity is not a constraint and that the continuous distributional characteristics of the data suggest the existence of a well formed density function. Such situations are quite the norm in the majority

0167-8655/$ - see front matter Ó 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2004.05.004

1390

C. He, M. Girolami / Pattern Recognition Letters 25 (2004) 1389–1397

of real applications such as continuous monitoring of the condition of a machine or process––indeed the reverse ‘problem’ is often experienced in many situations where there is an overwhelming amount of data logged. In situations where the volume of data to be processed is large a semi-parametric mixture model can provide a reduced representation of the reference data sample, in the form of the estimated mixing coeﬃcients and component suﬃcient statistics, for testing of further observed data for novelty. On the other hand non-parametric approaches such as K-nearest neighbour or the Parzen window density estimators require the full reference set for testing which in such practical circumstances can be prohibitively expensive for testing purposes. The support vector approach to novelty detection and density estimation has also been observed to provide a sparse or condensed density representation (Vapnik and Mukherjee, 2000; Tax and Duin, 1999; Sch€ olkopf et al., 2001). A recently proposed reduced set density estimator (RSDE) (Girolami and He, 2003) addresses the above problem by providing a kernel density estimator which employs a small subset of the available data sample. It is optimal in the L2 sense in that the integrated squared error between the unknown true density and the RSDE is minimised in devising the estimator. Whilst only requiring OðN 2 Þ optimisation routines to estimate the required kernel weighting coeﬃcients, the RSDE provides similar levels of performance accuracy and sparseness of representation as support vector machine (SVM) (Vapnik and Mukherjee, 2000) density estimation, which requires OðN 3 Þ optimisation routines. The additional advantage of the RSDE is that no extra free parameters are introduced such as regularisation terms (Weston et al., 1999), bin width (Holmstr€ om, 2000; Scott and Sheather, 1985) or number of nearest neighbours (Mitra et al., 2002) making this method a very simple and straightforward approach to providing a reduced set density estimator with comparable accuracy to that of the full sample Parzen density estimator. Knowing the RSDE has the above advantages for density estimation, it would then be interesting to investigate the performance achieved when ap-

plied to novelty detection and classiﬁcation (this term is used speciﬁcally to indicate two-class or binary classiﬁcation in the remaining context). By introducing the RSDE density estimation method (Girolami and He, 2003) in Section 2, a statistical hypothesis testing based novelty (outlier) detector and a Bayesian classiﬁer are devised in Sections 3 and 4 respectively. Performances of both the RSDE novelty detector and the RSDE classiﬁer are assessed in Section 5. Experimental results indicate that the RSDE based novelty detector and binary classiﬁer achieve statistically similar accuracy as those based on the full sample Parzen window density estimator while reducing computational costs for the test by 65–80% on average.

2. Reduced set density estimator 2.1. L2 Distance based density estimation Based on a data sample S ¼ fx1 ; . . . ; xN g 2 Rd the general form of P a kernel density estimator is given as ^pðx; h; cÞ ¼ Nn¼1 cn Kh ðx; xn Þ. For a given kernel with width h the maximum likelihood estimator (MLE) criterion (McLachlan and Peel, 2000) can be employed to estimate the P weighting coeﬃcients subject to the constraints n cn ¼ 1 and cn P 0 8n, which yields values for the coeﬃcients such that cn ¼ N1 8xn 2 S i.e. the Parzen window density estimator (Girolami and He, 2003). Alternative Distance based criteria have been considered for the purposes of density estimation when employing mixture models (Scott, 1999). In particular the L2 criterion, based on the integrated squared error (ISE), has been investigated as a robust error criterion which will be less inﬂuenced by the presence of outliers in the sample and model mismatch than the MLE criterion (Scott, 1999). For a density estimate with parameters h denoted as ^pðx; hÞ the argument R which provides2 the minimum ISE, deﬁned as Rd jpðxÞ ^pðx; hÞj dx, can be written as follows. Z ^h ¼ arg min ^p2 ðx; hÞdx 2EpðxÞ f^pðx; hÞg ð1Þ h

Rd

C. He, M. Girolami / Pattern Recognition Letters 25 (2004) 1389–1397

where EpðxÞ fg denotes expectation with respect to the unknown density pðxÞ. 2.2. Plug-in estimation of weighting coeﬃcients An unbiased estimate of the right-hand expectation in (1) can be obtained as a ci weighted sum ^ of full Parzen density estimators ph ðxi Þ of each R point xi . The left-hand term Rd ^ p2 ðx; hÞdx can be PN computed in a quadratic form i;j¼1 ci cj Cðxi ; xj Þ R where Cðxi ; xj Þ ¼ Rd Kh ðx; xi ÞKh ðx; xj Þdx. Combining both, the minimisation of a plug-in estimate of the ISE (1) for a kernel density estimator ^ pðx; hÞ ¼ ^ pðx; h; cÞ can be written as a constrained quadratic optimisation (refer to Girolami and He (2003) for further details) which in familiar matrix form is 1 arg min cT Cc cT p 2 c ð2Þ subject to cT 1 ¼ 1

and

ci P 0 8i

where the N N matrices with elements Cðxi ; xj Þ and Kh ðxi ; xj Þ are deﬁned as C and K respectively. The N 1 vector of Parzen density P estimates of N each point in the sample ^ ph ðxi Þ ¼ N1 j¼1 Kh ðxi ; xj Þ is deﬁned as p ¼ K1N , where 1N is the N 1 vector whose elements are all N1 . The above minimisation of a plug-in estimate of the ISE for a general kernel density estimator yields a sparse representation in the weighting coeﬃcients (refer to Girolami and He (2003) for detailed discussion). The minimisation speciﬁed by (2) can simply be solved by applying standard constrained quadratic programming which achieves OðN 3 Þ scaling. Girolami and He (2003) proposed appropriate forms of the multiplicative updating (Sha et al., 2002) and the sequential minimal optimisation (SMO) (Sch€ olkopf et al., 2001) methods for (2) and these further reduce the scaling to OðN 2 Þ. The RSDE was shown to be able to provide a sparse representation and reduce the computational costs of the full Parzen window density estimation without degrading performance accuracy (Girolami and He, 2003). In the following sections, the RSDE based novelty detection and binary classiﬁcation approaches are devised.

1391

3. Hypothesis testing and novelty detection Novelty detection is, in most situations, characterised as a problem of identifying examples from a data sample which are possibly discordant with the existing sample (Barnett and Lewis, 1977). Such a problem can be posed in terms of a statistical hypothesis test which can be represented in the following manner, a ﬁnite sample of instances of a random vector x 2 Rd has a probability density such that the sample fx1 ; . . . ; xN g is independently and identically distributed as pðx; hÞ where h denotes the parameters of the appropriate distributional form. For an additional example xN þ1 the null hypothesis is that sample N þ 1 is drawn from the same distribution, i.e. xN þ1 pðx; hÞ. The alternate hypothesis is that the example is drawn from another distribution characterised by, for example, a diﬀerent location parameter. In the absence of any prior information about the alternate distribution it should be noted that an a-level signiﬁcance test will deﬁne a region, say Cd , deﬁned by a constant level of the density pðx; hÞ. Therefore the alternate distribution will have ﬁnite support and the least committal distributional form (maximum entropy) is the uniform distribution over the region of support. The alternate hypothesis is written as xN þ1 US where US denotes the uniform distribution over the region of support Sd ¼ Rd n Cd . Formally, for a given data sample x1 ; . . . ; xN the following test for a new point xN þ1 is given H0 : xN þ1 2 Cd vs. H1 : xN þ1 62 Cd . The standard test statistic employed is the likelihood ratio (Anderson, 1958) QN þ1 sup n¼1 pðxn ; hÞ hN þ1 k¼ : ð3Þ QN sup US n¼1 pðxn ; hÞ hN

For the case where pðx; hÞ is multivariate normal with mean l and covariance C, whose N sample ^ N then the test statistic ^N and C estimates are l emerging from the above likelihood ratio takes the form of a modiﬁed T 2 statistic (Anderson, 1958). Consider an additional N þ 1th example denoted as xN þ1 , it can be shown that the associated test 2 ^ 1 ðxN þ1 l ^ N ÞT C ^N Þ is statistic D2 ¼ ðNNþ1Þ2 ðxN þ1 l N related to a central F -distribution with d and N d

1392

C. He, M. Girolami / Pattern Recognition Letters 25 (2004) 1389–1397

degrees of freedom by F ¼ ðN dÞðN þ 1ÞD2 = ½dN 2 ðN þ 1ÞdD2 . As the null hypothesis states that all the points including the N þ 1th sample have a common mean and covariance then this point would be rejected at the a signiﬁcance level if F > Fa;d;N d . The a-level test deﬁnes an elliptical region bounded by the value of D2 corresponding to the a signiﬁcance level given by the estimated ^ N . Whilst appealing to the ^N and C parameters l central limit theorem to justify assumptions of data normality provides an elegant closed form null distribution on which to threshold a novelty (outlier) detector, the strong assumption on the parametric form of the data distribution is often too restrictive in many practical applications (Sch€ olkopf et al., 2001; Tax and Duin, 1999). For an arbitrary and non-Gaussian density the likelihood ratio test statistic no longer has a closed form representation for its distribution under the null hypothesis. However, it should be noted that asymptotically k ! pðxN þ1 j^ hN Þ þ Op ðN 1 Þ and so the test statistic which emerges is simply the probability density estimate, based on the original N samples, of the test point (Wang et al., 1997). The distribution of the test statistic under the null hypothesis can be estimated employing the bootstrap (Efron and Tibshirani, 1993) and so the various critical values of kcrit which deﬁne the speciﬁc signiﬁcance level of the test for the null hypothesis can be established from the numerically obtained empirical distribution (Wang et al., 1997). Employing this test statistic to provide an a-level signiﬁcance test of data novelty requires the estimate of the probability density pðx; hÞ. In Section 5, we employ the RSDE and compare it with the Parzen window density estimation method to provide the required test statistic for a novelty detector. In addition, the support vector data description method (Tax and Duin, 1999; Sch€ olkopf et al., 2001), which is speciﬁcally designed for one-class classiﬁcation (novelty detection) is also employed for comparison.

4. Bayesian classiﬁcation So far, we have discussed using RSDE or other density estimators to build a novelty (out-

lier) detector for one-class classiﬁcation. In this section the multi-class situation will be considered. Classically for multi-class classiﬁcation, it is desired to predict the posterior probability of membership of one of the classes given the input x. To obtain a probabilistic classiﬁer with a density estimator we train an estimator ^pc ðx; hÞ ¼ ^pðx; hjcÞ for each class c, and apply Bayes’ rule to obtain the posterior probability of class membership ^pðx; hjcÞP^ ðcÞ P^ ðcjx; hÞ ¼ P ; pðx; hjc0 ÞP^ ðc0 Þ c0 ^

ð4Þ

then the test sample x is assigned to the class having the largest posterior probability. When applying the RSDE as the estimator to build the classiﬁer, parameters h ¼ fh; cg are estimated for each class. During training, the kernel width (free parameter) h is tuned by cross-validation based on the classiﬁcation error of the validation set, and the weighting coeﬃcients c are obtained by optimising (2) over training samples. The same is done for choosing the h value of the Parzen window estimator. 5. Experiments 5.1. Novelty detection experiments In this section the RSDE along with the Parzen window (PW) non-parametric density estimators are employed to provide the required test statistic for a novelty detector deﬁned by (3). These novelty detection results are then compared with the support vector data description (SVDD) method that was speciﬁcally designed for oneclass classiﬁcation (novelty detection) (Tax and Duin, 1999). The distribution of the test statistic under the null hypothesis is obtained by using the individual bootstrap samples to deﬁne the N -sample reference data set which deﬁnes the density estimate, a further single N þ 1th datum is then drawn from the available data sample and the test statistic for the bootstrap sample ^pb ðxN þ1 ; ^hÞ is computed.

C. He, M. Girolami / Pattern Recognition Letters 25 (2004) 1389–1397

5.1.1. Handwritten digits dataset In the handwritten digits dataset 1 there are 200 examples of each of the digits 0–9, and six diﬀerent feature sets available including Fourier (76dimension), Proﬁle (216-dimension), KarhunenLoeve (64-dimension), Pixel (240-dimension), Zernike (47-dimension) and Morphological (6dimension). In (Tax and Duin, 1999), the Fourier, Proﬁle, Zernike and Pixel four individual feature sets were investigated. In the following experiment, the Fourier, Zernike and Morphological feature sets are put together to create a single feature set with higher sample diversities than each individual one. The digits 0–9 in turn are chosen here to be the ‘normal class’ against which all other digits are measured for novelty. Repeating the same approach taken in (Tax and Duin, 1999), the 200 samples of the target object are split into training set and testing set (for evaluating the false rejection performance of the novelty detector) with 100 samples each. The remaining 1800 samples of all other digits are used as outlier testing objects (for evaluating the false acceptance performance of the novelty detector). Considering only 100 target objects are available for parameter estimation, the 5-dimensional principal component subspace is used in this experiment where over 90% of the variance in the data is retained. The RSDE and the Parzen window density estimators were utilised to estimate the density of the feature set for the target digit. The null-distribution of the test statistic was then obtained by a 1000-step bootstrap (Wang et al., 1997). The kernel widths for the Parzen window density estimate was selected by 10-fold cross-validation. The kernel width for the RSDE was selected by 10fold estimates of integrated square error with the Parzen window estimator. By setting diﬀerent thresholds (false rejection rates) to deﬁne signiﬁcance level tests of 1%, 5%, 10%, 15%, and 25%, the integrated receiver operating characteristic (ROC) errors (Metz, 1978; Tax and Duin, 1999) are calculated and shown in Table 1, in which RR indicates the RSDE’s remaining sample ratio

1 Multiple Features Dataset: ftp://ftp.ics.uci.edu/pub/machinelearning-databases/mfeat/

1393

Table 1 Integrated ROC errors (1–25%) of the novelty detection tests on the handwritten digits dataset Class

RSDE (%)

PW (%)

SVDD (%)

RR (%)

0 1 2 3 4 5 6 7 8 9

3.23 1.90 2.19 3.52 3.28 2.69 5.19 0.82 1.44 5.14

3.44 5.44 6.68 10.37 8.20 7.44 9.01 2.17 2.05 9.13

2.14 10.93 4.06 6.94 8.59 2.94 7.71 8.32 5.46 7.89

4 75 13 59 30 20 22 65 70 11

Average

2.94

6.39

6.50

36.9

which is the percentage of the size of the reduced set obtained by the RSDE over the size of the original reference set used by the full Parzen window estimator. The last row of the table shows the average performance of each method averaged over all 10 target classes. The results of the oneclass classiﬁer SVDD are also listed in the table for comparison. In Table 1, the RSDE novelty detector shows the best performance on this single split for almost all 10 target classes, and is shown to have superior novelty detection performance for all target classes compared to the Parzen window estimator, which was stated to provide the best overall performance for well sampled data in (Tax and Duin, 1999) after extensive testing of varying ‘one-class classiﬁers’ on a large number of diverse types of data sets. On average, the RSDE achieves better performance than the Parzen window and the SVDD on this data set; meanwhile reducing the sample size by approximately 63% that means only about 37% of the target sample is employed in the novelty assessments of all other samples. The RSDE has been shown previously to be capable of producing smoother density estimates than the Parzen window density estimator which may face the problem of over ﬁtting (Girolami and He, 2003) that results in a larger test error for the Parzen window. The performance of the SVDD is poorer because no estimate of a density is made. The above results only provide some indication as to the signiﬁcance of the measured

1394

C. He, M. Girolami / Pattern Recognition Letters 25 (2004) 1389–1397

performance. However as there is no indication of the variability in performance ﬁgures, therefore a more extensive study is presented in Section 5.1.2 where statistical tests are utilised to give more complete comparisons. 5.1.2. R€ atsch’s real dataset collections Further extensive experiments were executed on a subset of R€ atsch’s real dataset collections which included the data sets Banana, Titanic, Diabetes, Breast Cancer and German. Utilised in (R€ atsch et al., 2001), these data sets are available at http:// ida.first.gmd.de/~raetsch, which provides 100 training and test splits of two classes for each data set. In our experiments, the ﬁrst 10 splits were used. Within each split, N1R training samples from the ﬁrst class were utilised as the reference set to train the RSDE, the Parzen Window and the SVDD novelty detectors; N1V test samples from the same class were utilised as the validation set to evaluate the false rejection performance; and all N2 training and test samples from the second class were measured for novelty. Experimental results are shown in Table 2, where integrated ROC errors (1–25%) of three novelty detectors over 10 splits are displayed by their mean and standard deviation (STD) values. As with the previous set of results the RSDE performance appears to be superior on average. The distribution of test errors is Gaussian (assessed using the Jarque–Bera test (Judge et al., 1988) ða ¼ 0:01Þ), so a T -test is utilised to compare performance between the diﬀerent novelty detectors. Although in Table 2 the RSDE novelty detector continues to show the lowest test errors

(mean) for all data sets among three methods we investigated, the T -test results indicate that, there is no signiﬁcant diﬀerence ða ¼ 0:01Þ between performances of the RSDE and the Parzen window approaches for all ﬁve data sets. However on average the sample size utilised for tests by the RSDE approach is only 19.87% of that utilised by the Parzen window approach, i.e. the RSDE novelty detector reduces the test computational costs by approximately 80%. Apart from Banana, statistically there is also no signiﬁcant diﬀerence ða ¼ 0:01Þ between performance of the RSDE and the SVDD approaches for other data sets. By plotting data points of Banana data set (reference set distributed approximately in a ring shape), a large quantity of outliers are found located inside the boundary deﬁned by support points obtained by the SVDD which results in a very high false acceptance rate. Because the SVDD only deﬁnes a closed boundary of the reference set rather than estimating the density distribution, it has limitations for some specially distributed data sets like Banana. Furthermore, unlike the RSDE approach whose sample points used for the test and sample reduction rates are ﬁxed for all diﬀerent signiﬁcance level tests, those of the SVDD are variable and dependent on the signiﬁcance levels set by the tests. Therefore, whenever the signiﬁcance level changes during tests, the SVDD has to be rerun which can produce unwanted inconvenience in controling test levels. 5.2. Two-class classiﬁcation experiments It is well-known that class-conditional density estimation may not be optimal for the purpose of

Table 2 Novelty detection results of R€atsch’s data collections Data set Banana Titanic Diabetes Breast cancer German

N1R 183 107 298 142 483

N1V 2741 1383 202 54 217

N2 2376 711 268 81 300

d 2 3 8 9 20a

Integrated ROC error (mean ± STD)%

RR (%)

RSDE

PW

SVDD

(mean ± STD)

4.46 ± 1.39 7.46 ± 2.12 10.15 ± 3.18 10.01 ± 3.49 12.64 ± 5.05

5.60 ± 0.88 7.89 ± 4.65 14.56 ± 4.06 13.81 ± 4.21 13.80 ± 3.57

12.21 ± 1.97 7.92 ± 2.49 10.88 ± 2.24 10.85 ± 4.18 13.28 ± 2.92

13.52 ± 2.18 66.15 ± 6.88 2.21 ± 0.88 11.18 ± 1.27 6.28 ± 0.90

a In experiments, the 20-dimensional data set German was reduced to 7-dimensional by carrying out the Kolmogorov–Smirnov test to remove those dimensions without discrimination powers.

C. He, M. Girolami / Pattern Recognition Letters 25 (2004) 1389–1397

1395

Fig. 1. Left hand plot: RSDE density estimation based classiﬁer whose reduced set samples are encircled; Right hand plot: Parzen density estimation based classiﬁer. In both cases, the decision boundary is shown in thick solid line, and the iso-contours of the density estimate for both classes are shown.

classiﬁcation (Vapnik, 1998). However to make the investigation more complete, this section further assesses the RSDE’s performance for binary classiﬁcation. 5.2.1. Ripley’s synthetic data To illustrate the classiﬁcation results visually, Ripley’s synthetic 2-D data (Ripley, 1996) are employed to test the performance of the density estimation based two-class classiﬁer speciﬁed in Section 4. Ripley’s data include two classes with altogether 250 training samples and 1000 testing samples, which were generated from mixtures of two Gaussians with the classes overlapping to the extent that the Bayes error is around 8%. In experiments, the RSDE and the Parzen window density estimators are utilised and compared where the kernel widths of the Parzen window and the RSDE density estimators are selected using the same method as indicated in novelty detection

experiments. The classiﬁcation results for the training set are shown in Fig. 1. The test error for the RSDE is 9.2% which is only slightly superior to 9.4% of the Parzen window, but the RSDE has the remarkable advantage for the test complexity. Only 13 samples in the reduced set are needed for the RSDE classiﬁer while all 250 training samples are required for the Parzen window classiﬁer. 5.2.2. R€atsch’s real dataset collections Further experiments of the RSDE classiﬁer are carried out on the subset of R€atsch’s real dataset collections that were used in novelty detection experiments in Section 5.1.2. RSDE classiﬁcation experimental results are shown in Table 3, where test results of the SVM classiﬁer are calculated and quoted from R€atsch’s online repository, and the Parzen window classiﬁer is employed for comparison. The average number of non-zero points

Table 3 Classiﬁcation experimental results on real data sets Data set

N

d

Test error (mean ± STD)%

Non-zero points

RSDE

PW

SVM

RSDE

PW

Banana Titanic Diabetes Breast cancer German

400 150 468 200 700

2 3 8 9 20

11.18 ± 0.62 22.75 ± 0.41 28.53 ± 1.81 30.65 ± 4.51 28.80 ± 1.91

10.82 ± 0.48 22.16 ± 0.42 25.80 ± 1.66 26.49 ± 3.07 23.80 ± 2.64

11.68 ± 0.79 22.10 ± 0.61 23.50 ± 1.49 28.57 ± 4.29 22.50 ± 1.41

65.7 77.3 5.4 7.9 28

400 150 468 200 700

1396

C. He, M. Girolami / Pattern Recognition Letters 25 (2004) 1389–1397

(points used for test) of the Parzen window and the RSDE classiﬁers over 10 splits are also listed. As in Section 5.1.2, Jarque–Bera test and T -test are carried out to compare test errors of all three classiﬁers. For all data sets, T -test (a ¼ 0:01) results show no signiﬁcant diﬀerence between the RSDE and the Parzen window classiﬁers, and the SVM is superior statistically for Diabetes and German data sets. Meanwhile, on average the RSDE classiﬁer reduces test computational costs by approximately 76%.

on the full-size Parzen window density estimator while reducing computational costs for the test by 65–80% on average. The RSDE density estimation based novelty detector also outperforms the nondensity estimation based SVDD (Tax and Duin, 1999) method which has limitations to some particular data sets and increases test computational costs and inconvenience when performing diﬀerent signiﬁcance level tests. The distance-measure based SVM classiﬁer (Vapnik and Mukherjee, 2000) shows slightly superior performance than the two density estimate based classiﬁers considered.

6. Discussion This section gives some intuitive discussions and explanations to understand why the RSDE achieves better performance in novelty detection than in binary classiﬁcation. In novelty detection, an outlier is tested by setting a threshold for the target density estimate. Because the reduced set obtained by the RSDE can well represent the target density distribution that is optimal in the L2 sense with the true distribution, the RSDE based novelty detector achieves good results. In the classiﬁcation case, the distance between two classes are more important than how exactly an individual class is distributed. Therefore, the SVM classiﬁer that concentrates on measuring the distance between two classes generally achieves better classiﬁcation performance than ﬁnite sample density estimate based classiﬁers including the Parzen window and the RSDE classiﬁers.

7. Conclusion This paper investigated the application of the recently proposed reduced set density estimator (RSDE) (Girolami and He, 2003) to novelty detection and binary classiﬁcation and provided empirical comparisons with the Parzen window and SVM (Vapnik and Mukherjee, 2000; Tax and Duin, 1999) approaches. Experimental results indicate that the RSDE based novelty detector and binary classiﬁer both achieve statistically similar accuracy as those based

Acknowledgements This work is supported by Scottish Higher Educational Funding Council Research Development grant ‘INCITE’ http://www.incite. org.uk. A full Mat-lab implementation of RSDE and example data sets are available at http:// cis.paisley.ac.uk/giro-ci0/ reddens.

References Anderson, T., 1958. An Introduction to Multivariate Statistical Analysis. Wiley, New York. Barnett, V., Lewis, T., 1977. Outliers in Statistical Data. Wiley, New York. Campbell, C., Bennett, K., 2001. A linear programming approach to novelty detection. In: Leen, T.K. et al. (Eds.), Advances in Neural Information Processing Systems, vol. 13. MIT Press, pp. 395–401. Efron, B., Tibshirani, R., 1993. An Introduction to the Bootstrap. Chapman and Hall, London. Girolami, M., He, C., 2003. Probability density estimation from optimally condensed data samples. IEEE Trans. Pattern Anal. Machine Intell. 25 (10), 1253–1264. Holmstr€ om, L., 2000. The error and the computational complexity of a multivariate binned kernel density estimator. J. Multivariate Anal. 72 (2), 264–309. Judge, G.G., Hill, R.C., Griﬃths, W.E., et al., 1988. Introduction to the Theory and Practice of Econometrics. Wiley, New York. McLachlan, G., Peel, D., 2000. Finite Mixture Models. Wiley, New York. Metz, C., 1978. Basic principles of ROC analysis. Semin. Nucl. Med. 8 (4), 283–298. Mitra, P., Murthy, C., Pal, S., 2002. Density based multiscale data condensation. IEEE Trans. Pattern Anal. Machine Intell. 24 (6), 734–747.

C. He, M. Girolami / Pattern Recognition Letters 25 (2004) 1389–1397 R€ atsch, G., Onoda, T., M€ uller, K., 2001. Soft margins for Adaboost. Mach. Learn. 42 (3), 287–320. Ripley, B., 1996. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, UK. Roberts, S., 2000. Extreme value statistics for novelty detection in biomedical signal processing. IEE Proc. Sci. Technol. Meas. 47 (6), 363–367. Sch€ olkopf, B., Platt, J., Shawe-Taylor, J., 2001. Estimating the support of a high-dimensional distribution. Neural Comput. 13, 1443–1471. Scott, D., 1999. Remarks on ﬁtting and interpreting mixture models. Comput. Sci. Stat. 31, 104–109. Scott, D., Sheather, S., 1985. Kernel density estimation with binned data. Commun. Stat.: Theor. Meth. 14, 1353–1359. Sha, F., Saul, L., Lee, D.D., 2002. Multiplicative updates for non-negative quadratic programming in support vector

1397

machines, Technical Report MS-CIS-02-19, University of Pennsylvania. Tax, D., Duin, R., 1999. Support vector data description. Pattern Recognition Lett. 20 (11–13), 1191–1199. Vapnik, V.N., 1998. Statistical Learning Theory. Wiley, New York. Vapnik, V., Mukherjee, S., 2000. Support vector method for multivariate density estimation. In: Solla, S. et al. (Eds.), Advances in Neural Information Processing Systems. MIT Press, pp. 659–665. Wang, S., Woodward, W., Gray, H., et al., 1997. A new test for outlier detection from a multivariate mixture distribution. J. Comput. Graph. Stat. 6, 285–299. Weston, J., Gammerman, A., Stitson, M.O., et al., 1999. Support vector density estimation. Advances in Kernel Methods. MIT Press.

Novelty detection employing an L2 optimal non-parametric density estimator

Novelty detection employing an L2 optimal non-parametric density estimator

Recommend Documents