Pattern Recognition Letters 24 (2003) 2479–2487 www.elsevier.com/locate/patrec
Modified support vector novelty detector using training data with outliers Li Juan Cao
a,*
, Heow Pueh Lee a, Wai Keong Chong
b
a
b
Institute of High Performance Computing, #01-01 The Capricorn, Singapore Science Park II, 1 Science Park Road, Singapore 117528, Singapore The Singapore MIT Alliance, National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260, Singapore Received 14 August 2002; received in revised form 10 March 2003
Abstract This paper proposes the modified support vector novelty detector (SVND) for novelty detection which addresses the problem of detecting outliers from normal data patterns. While the original SVND [Neural Comput. 13 (2001) 1443] attempts to estimate a function to separate the region of normal data patterns from that of outliers based on normal data patterns, the modified SVND generalizes it to take into account the outliers in the training set by separating both the normal data patterns and the outliers from the origin with maximal margin. By examining several artificial and real data sets, the experiment shows that there is significant improvement in the performance of the modified SVND in comparison with the original SVND. Furthermore, the original SVND is sensitive to the outliers, with the performance deteriorating when outliers are included in the training set. Ó 2003 Published by Elsevier B.V. Keywords: Novelty detection; Support vector novelty detector (SVND)
1. Introduction Novelty detection addresses the problem of detecting outliers from the rest of normal data patterns. The outliers could be caused by the abnormal behaviors of a system or the error measurements of a feature vector. Usually, there are
* Corresponding author. Present address: Singapore Management University, Bukit Timha Road 469, Singapore 259756, Singapore. Tel.: +65-6822-0801. E-mail address:
[email protected] (L.J. Cao).
0167-8655/03/$ - see front matter Ó 2003 Published by Elsevier B.V. doi:10.1016/S0167-8655(03)00093-X
far fewer outliers compared to the normal data patterns. Many approaches are available for novelty detection, such as the Gaussian mixture model (Lauer, 2001), Parzen density estimation model (Parzen, 1962) and nearest neighbor method (Ridder et al., 1998). They are developed based on the idea of firstly estimating the distribution of normal data patterns and thereafter distinguishing a new data pattern based on the distribution level. Following the same idea, support vector novelty detector (SVND) was recently developed. The first SVND is proposed by Tax and Duin (1999), which estimates a sphere to contain all the normal data
2480
L.J. Cao et al. / Pattern Recognition Letters 24 (2003) 2479–2487
patterns with the smallest radius. The ‘‘smallest radius’’ means that the outliers will lie outside this sphere. Thus, the outliers can be identified from the normal data patterns by calculating the distance of a data pattern to the center of the sphere. Another alternative SVND is proposed by Sch€ olkopf et al. (2001). Instead of a sphere, a function is estimated to separate the region of normal data patterns from that of outliers with maximal margin, thus detecting the outliers from the normal data patterns. As demonstrated in (Sch€ olkopf et al., 2001), the two SVND by using the sphere and the function are very close in spirit. One limitation in the distribution estimation methods is that the outliers are assumed to have different distribution levels from those of the normal data patterns. For example, in Sch€ olkopfÕs SVND, all the outliers are assumed to lie in the region which is different from that of the normal data patterns. It could be the case for simple problems. But for complex problems, some outliers could have similar distribution levels as those of the normal data patterns. This means that outliers may lie in the region of normal data patterns. To such outliers, the distribution estimation methods will fail for identifying them. For improving the performance of SVND, Tax (2001) proposes the method of incorporating the known outliers in the development of SVND algorithm for improving the separation of the normal data patterns and the outliers. By not only letting the normal data patterns lie inside the sphere but also letting the existed outliers lie outside the sphere, Tax find that a more efficient description of the sphere can be obtained. The purpose of this paper is to generalize Sch€ olkopfÕs SVND to take into account the available outliers to improve the separation of the normal data patterns and the outliers, especially for the region where the outliers and the normal data patterns lie in closely. The generalization of Sch€ olkopfÕs SVND to exploit the outliers has also been discussed in (Hayton et al., 2000), where the function is estimated by separating the normal data patterns from the center of the outliers with maximal margin. This paper proposes the approach of exploiting the outliers by separating both the normal data patterns and the outliers
Fig. 1. The separating function. Normal data patterns are represented using circles. Outliers are represented using stars. The dotted line is the solution of the original SVND. The solid line is the solution of the modified SVND.
from the origin with maximal margin. Using such a simple idea, a more efficient separation could be obtained in the modified SVND than the original SVND, as illustrated in Fig. 1. This paper is organized as follows. Section 2 introduces the basic theory of the original SVND. The modified SVND is described in Section 3. Section 4 presents the experimental results, followed by conclusions in the last section.
2. Support vector novelty detector l
Given a set of training data patterns fXi gi¼1 most of which are normal, the idea of SVND is to estimate a function which takes positive values in the region of normal data patterns and negative values in the region of outliers. For this purpose, a function is estimated by firstly mapping the original input space X into a high dimensional feature space /ðX Þ and then separating the high dimenl sional feature vectors f/ðXi Þgi¼1 from the origin with maximal margin. The estimated function takes the following form: f ðX Þ ¼ signðW /ðX Þ bÞ l f/ðXi Þgi¼1
ð1Þ
from the origin with maxTo separate imal margin, W and b are estimated by
L.J. Cao et al. / Pattern Recognition Letters 24 (2003) 2479–2487
minimize: subject to:
l X 1 2 kW k þ C ni b 2 i¼1
W /ðXi Þ b P ni ; i ¼ 1; . . . ; l
2.2. Kernel function ð2Þ
ni P 0 2
In (2), the first term 12kW k is the regularization term, which attempts to maximize the margin. The P second term C li¼1 ni is the training error to penalize the data patterns lying in the negative side from the origin. C is a trade-off between the regularization term and the training error. By introducing Lagrange multipliers ai and exploiting the optimality constraints, the decision function (1) has the following explicit form: ! l X f ðX Þ ¼ sign ai KðXi ; X Þ b ð3Þ i¼1
2.1. Lagrange multipliers and support vectors
subject to:
Rðai Þ ¼ l X
l X l 1X ai aj KðXi ; Xj Þ 2 i¼1 j¼1
ð4Þ
ai ¼ 1
i¼1
0 6 ai 6 C;
KðXi ; Xj Þ is defined as the kernel function. The value of each element of K is equal to the inner product of two high dimensional feature vectors /ðXi Þ and /ðXj Þ, that is, KðXi ; Xj Þ ¼ /ðXi Þ /ðXj Þ. The use of the kernel function has the advantage of providing much computational efficiency as the calculations of /ðXi Þ /ðXj Þ are all replaced with KðXi ; Xj Þ. It also provides another additional advantage of making the mapping of /ðX Þ from X implicit. Any function satisfying MercerÕs condition (Mercer, 1909) can be used as the kernel function. The most commonly used kernel function is the Gaussian function KðXi ; Xj Þ ¼ 2 2 eðXi Xj Þ =r , where r2 is the kernel parameter. For a new test data pattern Z, the value of f ðZÞ is calculated according to (3). If f ðZÞ is equal to 1, Z is classified as a normal data pattern. Otherwise, if f ðZÞ is equal to )1, it is determined as an outlier.
3. Modified support vector novelty detector
In (3), ai is the so-called Lagrange multiplier. It is a positive value and obtained by maximizing the dual function of (2), which has the following form: minimize:
2481
i ¼ 1; . . . ; l
The linearly constrained quadratic programming (QP) problem (4) can be solved by using the modified version of sequential minimal optimization algorithm (SMO), as described in (Sch€ olkopf et al., 2001). Based on the Karush–Kuhn–Tucker (KKT) conditions (Kuhn and Tucker, 1951), only a number of ai in (3) will assume non-zero values. The training data patterns associated with them are referred to as support vectors. According to (3), only support vectors are used to determine the decision function as the other training data patterns are associated with ai ¼ 0.
When the outliers are available, the modified SVND uses them to improve the estimation. In contrast with the original SVND which separates only the normal data patterns from the origin with maximal margin, the modified SVND separates both the normal data patterns and the outliers from the origin with maximal margin. Suppose the training set fXi gli¼1 consists of kðk < lÞ normal data patterns and ðl kÞ outliers, by using the same function form (1), f ðX Þ is estimated by minimize:
k l X X 1 2 kW k þ C1 ni þ C2 ni b 2 i¼1 i¼kþ1
subject to: W /ðXi Þ b P ni ; W /ðXi Þ b 6 ni ; ni P 0;
i ¼ 1; . . . ; k
i ¼ k þ 1; . . . ; l
i ¼ 1; . . . ; l Pk
ð5Þ
In (5), the second term C1 i¼1 ni is the training error Pl of normal data patterns. The third term C2 i¼kþ1 ni is the training error of outliers. C1 and C2 are respectively the penalty for the error of the
2482
L.J. Cao et al. / Pattern Recognition Letters 24 (2003) 2479–2487
first type (normal data patterns are classified as outliers) and the error of the second type (outliers are classified as normal data patterns). Let the normal data patterns be labeled as yi ¼ 1:0 and the outliers be labeled as, yi ¼ 1:0 (5) can be rewritten as k l X X 1 kW k2 þ C1 ni þ C2 ni b minimize: 2 i¼1 i¼kþ1 subject to: yi ðW /ðXi Þ bÞ P ni ; i ¼ 1; . . . ; l ni P 0 ð6Þ By exploiting ai and the optimality constraints, the dual function of (6) has the following form: l X l 1X ai aj yi yj KðXi ; Xj Þ minimize: Rðai Þ ¼ 2 i¼1 j¼1 subject to:
l X
ai yi ¼ 1
i¼1
0 6 ai 6 C 1 ;
i ¼ 1; . . . ; k
0 6 ai 6 C 2 ;
i ¼ k þ 1; . . . ; l ð7Þ
It can be observed that when k ¼ l (there is no outlier in the training set), (7) reduces to (4) by replacing all the values of yi with 1.0 and letting C1 ¼ C. The linearly constrained QP problem (7) can also be solved by modifying the standard SMO, as described in Appendix A. After getting the optimal ai from (7), f ðX Þ is calculated by ! l X f ðX Þ ¼ sign ai yi KðXi ; X Þ b ð8Þ i¼1
For a new test data pattern Z, the value of f ðZÞ is calculated according to (8). The same rule as used in the original SVND is used to determine the class label of Z, as described in Section 2.
4. Experiment 4.1. Performance measure criterion To make a comparison between the original SVND and the modified SVND, the receiver-
Fig. 2. A perfect ROC curve AB.
operating characteristic curve (ROC) (Zou et al., 1997) is used to illustrate the result. In the ROC curve, x-axis represents the percentage of outliers in the testing set that are correctly identified, called as the true negative (TN), and y-axis represents the percentage of normal test data patterns that are correctly classified, called as the true positive (TP). A perfect ROC curve is the solid line AB as illustrated in Fig. 2, where for any given value of TP, the value of TN is always equal to 1. To generate the ROC, C and the Gaussian kernel parameter r2 should be chosen appropriately. The following two-step procedure is used to choose the two free parameters. Firstly, as discussed in (Tax, 2001), C can be approximated by C6
1 NSV
ð9Þ
where NSV is the number of support vectors lying in the negative side of the function. The best value of C is chosen based on trial and errors. If there is no outlier in the training set, the maximal value of C could be chosen as 1.0. In the modified SVND, C1 is also chosen according to (9) as well as experimental investigation, while C2 is determined by estimating the error of the second kind. After determining C in the original SVND and C1 and C2 in the modified SVND, r2 is then varied from 0.0001 to 1000 by multiplying 10 at each step to obtain the ROC curve.
L.J. Cao et al. / Pattern Recognition Letters 24 (2003) 2479–2487
2483
4.2. Artificial data set An artificial data set is examined in the first series of the experiment. This data set consists of 500 normal bivariate data patterns which are generated from the Gaussian distribution with zero mean vector and diagonal covariance matrix with both entries 4, and 100 bivariate outliers which are generated from the Gaussian distribution with the mean vector having both entries 4 and diagonal covariance matrix with both entries 4. Fig. 3 illustrates this data set, where the normal data patterns are represented using circles and the outliers are represented using stars. Obviously, there is an overlap between the normal data patterns and the outliers. The experimental setup used is as 8 follows: 250 normal data patterns and 20 outliers which are randomly selected from the whole data set are used as the training set. The remaining 250 normal data patterns and 80 outliers are used as the testing set. Furthermore, to investigate the influence of outliers to the original SVND, the training set with the outliers removed is also used to train the original SVND. In both the original SVND and the modified SVND, the Gaussian function is used as the kernel function. According to the procedure described in Section 4.1, the value of C in the original SVND is respectively chosen as 0.05 for the training set with the outliers used and 1.0 for the training set without the outliers. In the modified SVND, C1
Fig. 4. The obtained ROC curves on the artificial data set.
and C2 are both used as 0.05. The modified versions of SMO as described in (Sch€ olkopf et al., 2001) and Appendix A are implemented in this experiment and the program is developed using VCþþ language. The obtained ROC curves are illustrated in Fig. 4. As described in Section 4.1, the closer to the ideal ROC curve AB, the better the performance. Fig. 4 shows that the original SVND using the training set with the outliers included performs worse than that using the training set without outliers. The reason lies in that there is an overlap between the normal data patterns and outliers, and the inclusion of the outliers in the training set will make the estimation of the function less accurate. However, the modified SVND by making using of outliers performs the best among all the methods. 4.3. Biomed data set
Fig. 3. Artificial data set. The normal data patterns are represented using circles and the outliers are represented using stars.
The Biomed data set obtained from the StatLib data archive (Campbell and Bennett, 2001) is also investigated in the experiment. This data set consists of 127 normal data patterns and 67 outliers. Each data pattern is composed of 4 features corresponding to measurements made on blood samples. For this data set, 80 normal data patterns and 10 outliers are used as the training set. The remaining data patterns are used as the test set. The training set with the outliers removed is also used
2484
L.J. Cao et al. / Pattern Recognition Letters 24 (2003) 2479–2487
breast-cancer-wisconsin
1
True positive
0.8
0.6
0.4 original SVND (with outliers) original SVND(without outliers)
0.2
modified SVND
0
0
0.2
0.4 0.6 True negative
0.8
1
Fig. 6. The obtained ROC curves on the Breast-Cancer-Wisconsin data set.
Fig. 5. The obtained ROC curves on the Biomed data set.
to train the original SVND to see the influence of outliers. The Gaussian function is still used as the kernel function. For this data set, the value of C in the original SVND is used as 0.1 for the training set without the outliers and 0.05 for the training set with the outliers included. In the modified SVND, C1 and C2 are both used as 0.1. The obtained ROC curves are illustrated in Fig. 5. The same conclusion is reached as in the artificial data set. That is, the original SVND using the training set with the outliers included performs worse than that using the training set without the outliers. A better performance is obtained in the modified SVND by using the outliers.
data sets belong to a 2-class classification problem. For the use of novelty detection, in each data set the dominant data patterns are used as the normal data patterns, and the other type of data patterns are used as the outliers. The details of the training set and the testing set are described in Table 1. The Gaussian function is used as the kernel function for all types of SVND. The values of C, C1 and C2 used in each data set are also given in Table 1. The obtained ROC curves are illustrated in Figs. 6–8. In all the three data sets, the original SVND using the training set with the outliers included performs worse than that using the training set without the outliers. The modified SVND has the best performance among all the methods.
4.4. UCI data set The Breast-Cancer-Wisconsin, Monk-1 and Pima-Indians-Diabetes data sets taken from the UCI machine learning dataset repository (Murphy and Aha, 1992) are further studied. All the three
5. Conclusion This paper proposes a modified SVND to deal with outliers in the training set. The key idea of the
Table 1 Details of the training and testing sets and the parameters (n denotes the total number of data patterns) Name
Breast-Wisconsin Monk-1 Pima-Diabetes
n
683 432 768
Training
Testing
(C1 ; C2 )
C
Normal
Outliers
Normal
Outliers
With outliers
Without outliers
222 182 250
50 18 68
222 182 250
189 50 200
0.01 0.01 0.05
0.1 0.5 0.1
0.5, 0.5 0.01, 0.01 0.1, 0.1
L.J. Cao et al. / Pattern Recognition Letters 24 (2003) 2479–2487
Fig. 7. The obtained ROC curves on the Monk-1 data set.
2485
set, with the performance deteriorating when outliers are used in the training set. This is because the original SVND is established based on the assumption that the outliers and normal data patterns lie in different regions. But in reality, the outliers could also lie close to the normal data patterns or even lie in the region of the normal data patterns. As such, the inclusion of the outliers in the training set will make the estimation of the function less accurate, thus deteriorating the generalization performance of the original SVND. One limitation of this paper is that in the generation of the ROC curve the parameters of SVND are chosen based on a two-step procedure. Future work will be on the development of a structured approach of choosing the optimal parameters of SVND for any given values of TP and TN. Appendix A. Modified SMO The linearly constrained QP problem (7) is solved by modifying the standard SMO, as described below. The basic idea of the standard SMO is to break a large QP problem into a series of smallest possible QP problems. As a linear equality constraint must obey, the minimum number ai that can be optimized at each step is two. As only two ai are to be optimized, the solution could be solved analytically. There are also different heuristics for choosing the two ai for optimization.
Fig. 8. The obtained ROC curves on the Pima-Indians-Diabetes data set.
modified SVND is to separate both the normal data patterns and the outliers from the origin with maximal margin. This is different from the original SVND which separates only the normal data patterns from the origin with maximal margin. By examining artificial and real data sets, the experiment demonstrates that the modified SVND could obtain a more accurate estimation of the function than the original SVND, eventually resulting in higher generalization performance in detecting the outliers. The experiment also shows that the original SVND is sensitive to the outliers in the training
A.1. Analytically solving for two Lagrange multipliers Let the two multipliers be a1 and a2 . In order to solve for a1 and a2 , the bounds on these multiplies need to be first computed. Let s ¼ y1 y2 and a1 þ sa2 ¼ c. According to the constrains in (7), the calculated bounds for a2 are given in Table 2. Table 2 Bounds for a2 y1 ¼ 1
y1 ¼ 1
y2 ¼ 1
L ¼ Maxðr C1 ; 0Þ H ¼ Minðr; C1 Þ
L ¼ Maxðr; 0Þ H ¼ Minðr þ C2 ; C1 Þ
y2 ¼ 1
L ¼ Maxðr; 0Þ Minðr þ C2 ; C1 Þ
L ¼ Maxðr C2 ; 0Þ H ¼ Minðr; C2 Þ
2486
L.J. Cao et al. / Pattern Recognition Letters 24 (2003) 2479–2487
Let Kij denote KðXi ; Xj Þ, Rðai Þ is written as 1 1 Rða1 ; a2 Þ ¼ a21 K11 þ a22 K22 þ a1 a2 sK12 2 2 l l X X þ a1 y 1 aj yj K1j þ a2 y2 aj yj K2j j¼3
þ
l X l X i¼3
ðA:1Þ
j¼3
According to a1 þ sa2 ¼ c, (A.1) can be expressed in terms of a2 : 1 1 2 Rða2 Þ ¼ ðr sa2 Þ K11 þ a22 K22 2 2 þ ðr sa2 Þa2 sK12 þ ðr sa2 Þy1 S1 þ a2 y 2 S2 þ S ðA:2Þ Pl Pl Pl where Si ¼ j¼3 aj yj Kij and S ¼ i¼3 j¼3 ai aj
yi yj Kij . As S is independent of a2 , and also s2 ¼ 1 and sy1 ¼ y2 , the derivative of Rða2 Þ to a2 is equal to dR ¼ sðr sa2 ÞK11 þ a2 K22 da2 þ ðr sa2 ÞsK12 a2 K12 y2 S1 þ y2 S2 ðA:3Þ Let dR=da2 ¼ 0, a2 is solved by a2 ¼
rsðK11 K12 Þ þ y2 ðS1 S2 Þ g
ðA:4Þ
where g ¼ K11 þ K22 2K12 . Let Oi ¼ w/ðxi Þ ¼ Pl j¼1 aj yj Kij , thus O1 and O2 are equal to O1 ¼ a1 y1 K11 þ a2 y2 K12 þ S1
ðA:5Þ
O2 ¼ a1 y1 K12 þ a2 y2 K22 þ S2
ðA:6Þ
Substituting (A.5) and (A.6) into (A.4), (A.4) can be expressed in terms of O1 and O2 . a2 ¼ aold 2 þ
y2 ðO1 O2 Þ g
ðA:7Þ
Then, a2 is clipped according to the bounds described in Table 2. 8 < H if a2 P H anew ¼ a2 if L < a2 < H ðA:8Þ 2 : L if a2 6 L Finally, a1 can be calculated from anew 2 .
ðA:9Þ
According to KKT conditions, the threshold b can be calculated by If y1 ¼ 1 and 0 < anew < C1 1
j¼3
ai aj yi yj Kij
old new ¼ aold anew 1 1 þ sða2 a2 Þ
or y1 ¼ 1 and
0 < anew < C2 ; 1
b ¼ O1
ðA:10Þ
b ¼ O2
ðA:11Þ
If y2 ¼ 1 and 0 < anew < C1 2 or y2 ¼ 1 and
0 < anew < C2 ; 2
A.2. Heuristics for selecting a1 and a2 There are different heuristics for choosing a1 and a2 . The following heuristics are used to choose a2 . (i) Scan over the entire data set to find the multiplier violating KKT conditions. That is, a2 is the multiplier satisfying ai > 0&yi ðOi bÞ > 0 or ðyi ¼ 1&ai < C1 Þkðyi ¼ 1&ai < C2 Þ&yi ðOi bÞ < 0. Once a2 is obtained, a1 is chosen using its heuristic to jointly optimize. After optimization, the values of a1 and a2 are updated. The searching of a2 is continued in the resting of the entire data set. (ii) Same as (i), but the scan is only performed over the multiplier with non-zero and nonbound values. In the implementation, a single scan of (i) is followed by multiple scans of (ii) until there is no KKT violation in the multipliers with non-zero and non-bound values. This procedure is then repeated until there is no KKT violation in all the multipliers. The heuristic for choosing a1 is j ¼ arg maxjO2 On j n2SV
SV ¼ i; i ¼ 1; . . . ; l;
ðA:12Þ 0 < ai < C 1 if yi ¼ 1 if yi ¼ 1 0 < ai < C2
ðA:13Þ
L.J. Cao et al. / Pattern Recognition Letters 24 (2003) 2479–2487
That is, a1 is the multiplier with non-zero and nonbound values having the largest absolute value of O2 On . If this heuristic cannot make positive progress, the other two heuristics as used the standard SMO (Platt, 1999a,b) are used to obtain a1 . A.3. Initialization of ai and b For satisfying the linear equality constraint of (7), all fai gli¼kþ1 associated with outliers are inik tialized to 0. 1=C1 number of fai gi¼1 associated with normal data patterns are initialized to C1 . If k 1=C1 is not an integer, another fai gi¼1 is initialized to the value of ð1 Intð1=C1 ÞC1 Þ. Moreover, b is k initialized to the value of maxffOi gi¼1 ; ai > 0g.
References Campbell, C., Bennett, K.P., 2001. Linear programming techniques for novelty detection. Advances in Neural Information Processing 13. Statlib data. Available from
. Hayton, P., Sch€ olkopf, B., Tarassenko, L., Anuzis, P., 2000. Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra. NIPS, London. pp. 946–952. Kuhn, H.W., Tucker, A.W., 1951. Nonlinear programming. In: Proc. 2nd Berkeley Symposium on Mathematical Statistics and Probabilistics, Berkeley, pp. 481–492.
2487
Lauer, M., 2001. A mixture approach to novelty detection using training data with outliers. Lect. Notes Comput. Sci. 2167, 300–316. Mercer, J., 1909. Functions of positive and negative type and their connection with the theory of integral equations, philos. Trans. Roy. Soc. London A 209, 415–446. Murphy, P.M., Aha, D.W., 1992. UCI Machine Learning Database Repository. Available from . Parzen, E., 1962. On estimation of a probability density function and mode. Ann. Math. Statist. 33, 1065–1076. Platt, J.C., 1999a. Fast training of support vector machines using sequential minimal optimization. In: Sch€ olkopf, B., Burges, C.J.C., Smola, A.J. (Eds.), Advances in Kernel Methods––Support Vector Learning. MIT Press, pp. 185– 208. Platt, J.C., 1999b. Using analytic QP and sparseness to speed training of support vector. In: NIPS-11: Advances in Neural Information Processing Systems 11, Denver, Colorado. Ridder, D.D., Tax, D.M.J., Duin, R.P.W., 1998. An experimental comparison of one-class classification methods. In: Proc. Fourth Annual Conf. of the Advanced School for Computing and Imaging. ASCI, Delft. Sch€ olkopf, B., Platt, J., Shawe-Taylor, J., Smola, A.J., Williamson, R.C., 2001. Estimating the support of a highdimensional distribution. Neural Comput. 13 (7), 1443– 1472. Tax, D.M.J., 2001. One-class classification. Ph.D. Thesis, Delf University of Technology. Available from . Tax, D.M.J., Duin, R.P.W., 1999. Support vector data description. Pattern Recognition Lett. 20 (11–13), 1191– 1199. Zou, K.H., Hall, W.J., Shapiro, D.E., 1997. Smooth nonparametric receiver operating characteristic (ROC) curves for continuous diagnostic tests. Statist. Med. 16, 2143–2156.