COMPUTATIONAL STATISTICS & DATAANALYSIS ELSEVIER
Computational Statistics & Data Analysis 23 (1996) 153-168
Computing depth contours of bivariate point clouds Ida Ruts a, Peter J. R o u s s e e u w b
a Faculty of Applied Economic Sciences, UFSIA, Prinsstraat 13, B-2000 Antwerp, Belgium b Department of Mathematics, UIA, Universiteitsplein 1, B-2610 Antwerp, Belgium
Abstract
In this paper we construct an exact algorithm for computing depth contours of a bivariate data set. For this we use the half-space depth introduced by Tukey. The depth contours form a nested collection of convex sets. The deeper the contour, the more robust it is with respect to outliers in the point cloud. The proposed algorithm has been implemented in a program called ISODEPTH, which needs little computation time and is illustrated on some real data examples. Finally, it is shown how depth contours can be used to construct robustified versions of classification techniques based on convex bulls.
Keywords." Algorithms; Classification; Graphical display; Ranks; Robustness
I. I n t r o d u c t i o n
In 1975, T u k e y i n t r o d u c e d the notion o f half-space depth o f a point relative to a multivariate data set. Let us first consider the univariate case. T h e depth o f a point z relative to a o n e - d i m e n s i o n a l data set X -- {Xl,X2 . . . . . xn} is the m i n i m u m o f the n u m b e r o f data points to the left o f z, and the n u m b e r o f data points to the right o f z: d e p t h , ( z ; X ) = m i n ( # { i ; x i _< z } , # { i ; x i >_ z}). T h e half-space depth o f a point z E ~P relative to a p - d i m e n s i o n a l data c l o u d X = {xl,x2 . . . . . xn} is defined as the smallest depth o f z in a n y o n e - d i m e n s i o n a l
* Corresponding author. 0167-9473/96/$15.00 @ 1996 Elsevier Science B.V. All rights reserved PII S 0 1 6 7 - 9 4 7 3 ( 9 6 ) 0 0 0 2 7 - 8
154
i.. Ruts, P.J. Rousseeuw/Computational Statistics & Data Analysis 23 (1996) 153-168
projection of the data set. This half-space depth can also be seen as the minimal number of data points contained in a closed half-space of which the boundary plane passes through z. A survey of other depth functions can be found in Small (1990) and Niinimaa (1993). The half-space depth is affine invariant: if z and the data X are translated or linearly transformed, the depth remains the same (Donoho, 1982; Donoho and Gasko, 1992). This implies that this concept is independent of the chosen coordinate system. Depth is closely related to rank (Tukey, 1977), as can be seen in the univariate case. When we rank data points in one dimension, we give the extreme points (those with the lowest or highest rank) depth one. The data values with the second and second highest rank have depth two, and so on. Since the median is half-way the ordered list, the median will be the point with maximal depth. In higher dimensions, the depth of a point gives an indication of how "deep" the point is inside the cloud. A point with maximal depth can then be thought of as a multidimensional median (Donoho, 1982; Donoho and Gasko, 1992). Note that depth is not equivalent with density. The depth of z is a global notion in the sense that it depends on the entire X, whereas the density at z is more local because it depends only on the points of X in a neighborhood of z. Also note that there exists a population version of depth, the properties of which are studied in Rousseeuw and Ruts (1997). Consider for a data set X c ~P the set Dk = {x E ~P; depth(x;X) _> k}. The interior points of Dk have depth at least k, and the boundary points have depth equal to k. We call Dk the contour of depth k, although a stricter usage might reserve this phrase for the boundary of Dk. Note that Dk is the intersection of all half-spaces that contain at least n + 1 - k points of the cloud, hence it is convex. The different depth contours form a nested sequence, because Dk+~ is contained in Dk. Note that the outermost contour Dt is the usual convex hull of X. Points outside the convex hull of X have depth equal to zero. Some distribution theory for the sets Dk can be found in Eddy (1983, 1985). Depth contours of population distributions were studied by Mass6 and Theodorescu (1994). The number of depth contours of a given data set X, and hence the maximal depth, depends on the shape of X. If the data set is nearly symmetric there can be as many as In~21 depth contours, but there will be far fewer for highly asymmetric data sets. If X is in general position (meaning that no more than p points lie in any ( p - 1)-dimensional subspace), the maximal depth lies between [n/(p+ 1)] and In~21 (Donoho, 1982; Donoho and Gasko, 1992). So for k greater than In/2] the set Dk is empty. For p = 2, we know that Dk is always nonempty for k _< In/3]. From now on, we shall assume that 1 _< k _<
Fn/3l. This paper will be organized as follows. In Section 2 we describe a primitive approach to actually compute bivariate depth contours. In Section 3 we develop a more refined algorithm. The performance of the latter is studied in Section 4, followed by some examples in Section 5. Finally, the last section explores applications of depth contours to classification problems.
L Ruts, P.J. Rousseeuw I Computational Statistics & Data Analysis 23 (1996) 153 168
155
2. Naive algorithm A naive algorithm to compute depth contours of a bivariate data set proceeds as follows. As mentioned above, Dk is the intersection of a collection of half-planes, of which it can be assumed that their boundary line passes through two data points. Each edge of Dk is thus part of a line through two data points, and hence each vertex of the desired contour lies on the intersection of two such lines. So we can consider all lines through two data points, and compute all their intersection points, of which there are O(n4). In each of these intersection points we then compute the half-space depth by means of an algorithm of Rousseeuw and Ruts (1996). This algorithm, which can also be used to compute the simplicial depth (Liu, 1990), combines geometric properties with certain sorting and updating mechanisms and achieves a time complexity of O(nlogn). We only need to store the intersection points with depth equal to k. Let us denote their number by Ark. The convex hull of these Nk points is the desired depth contour Dk. The vertices of the convex hull can be obtained by algorithms of Eddy (1977) or Preparata and Hong (1977). They use a divide-and-conquer approach to provide an array with the indices of the vertices of the convex hull in counterclockwise order. The algorithm of Preparata and Hong (1977) takes O(Nk logNk) time. In our case Nk < n 4, SO computing the convex hull takes at most o ( n n l o g n ) operations. Finally, we can use a software package with graphics capabilities, such as GAUSS or S-PLUS, to draw the desired depth contour based on the ordered vertices of the convex hull. The computation of the depth in all of the O(n 4) points brings the total time complexity of the naive algorithm to O(n s logn). Therefore, we only used it for checking the results of faster algorithms for small n.
3. Proposed algorithm A more efficient algorithm can be obtained by means of some geometry. The depth region Dk is often referred to as the k-hull of X. A directed (oriented) line L is called a k-divider if it has at most k - 1 points of X strictly to its right and at most n - k points strictly to its left. A special k-divider is a k-divider which contains at least two data points. The closed half-plane to the left of (and including) a special k-divider is called a special half-plane. So a special half-plane contains at least n + 1 - k points of the cloud, and the k-hull Dk of X is the intersection of all the special half-planes of X. We apply this fact in our algorithm: in a first step we compute the special k-dividers, and then we compute the intersection of the corresponding special half-planes. Circular sequences
The definition of special k-dividers resembles the notion of projection pursuit, and seems to indicate that we have to consider all one-dimensional projections of the
156
I. Ruts, P.J. Rousseeuw / Computational Statistics & Data Analysis 23 (1996) 153-168
data set in order to find the smallest univariate depth. However, it turns out that we can restrict the search over a finite number of directions. We achieve this by using the concept of circular sequences (Goodman and Pollack, 1980; Edelsbrunner, 1987, p. 29). Consider a bivariate data set X = { x l , x 2 , . . . , x , } in general position and where no two lines through two data points are parallel. Let L be a directed line which is not perpendicular to any line through two points of X. Then the orthogonal projection of X onto L determines a permutation of N = { 1,2,..., n}. When L rotates counterclockwise, it defines a sequence of permutations of N in an obvious way. This periodic sequence of permutations of N is called a circular sequence. It has period n(n - 1). When L rotates over an angle of re, the resulting permutation of N will be exactly the reverse of the initial permutation. So a rotation over a half-circle suffices to construct the entire circular sequence. As we go from one permutation to the next, only a few "moves" are possible. Two successive permutations of a circular sequence differ only by switching two adjacent numbers. The permutation changes whenever there exists a line through two data points which is perpendicular to the current direction of L. Indeed, suppose that the line through data points xi and xj is perpendicular to a line which forms an angle ~ with the horizontal axis. Just before our rotating line L reaches the angle 0t, suppose that i precedes j in the permutation of N (see Fig. l(a)). If we would project on the direction which forms an angle with the horizontal axis, the projections of xi and xj would coincide. When the rotating line passes this direction, the order of i and j will have changed: j then precedes i (see Fig. l(b)). The circular sequence allows us to detect the special k-dividers. If, when going from one permutation to the next, the values i and j trade place, and the switch occurs at places k and k + 1 or at n - k and n - k + 1, we know that the line through xi and xj is a special k-divider. Indeed, consider the two open half-planes with this line as their boundary. One of them then contains only k - 1 data points. Therefore, we can orient the line through xi and xj in such a way that the k - 1 points are to its right. Step 1: In our algorithm, we start by testing whether the data points are in general position. The x-coordinates of the data points are stored in an array X, and the
(o)
(b) L
x~
~,xi ""
/ ..
Fig. 1. In (a), i precedes j, but in (b), j precedes i because L has passed the direction perpendicular to the line through xi and xj.
I. Ruts, P.J. Rousseeuw/Computational Statistics & Data Analysis 23 (1996) 153 168
157
y-coordinates in an array Y. Both arrays are of length n. First, we test whether any two data points coincide. Naively, this would require O(n 2) operations, but we can attain O(n log n) as follows. We sort the data points with respect to the x-coordinate by means of a standard O(n log n)-time subroutine. Then we consider the sorted points successively. Whenever two x-coordinates are equal, we first test whether the corresponding y-coordinates are also equal. If two coinciding points are detected, the program halts. Otherwise, we also test whether the next x-coordinate is equal to these two. If so, we have found three points which lie on a vertical line and the program also halts. Next, we test whether any three data points are collinear. The naive approach would require O(n 3) operations, but we can achieve O(n 2 logn) as follows. Let Lij be the line through data points (X(I), Y(I)) and (X(J), Y(J)). First we compute the angles ~iy between Lij and the horizontal axis. These angles are all between 0 and g. We store them in an array ANGLE and sort them. We alter two arrays IND1 and IND2 in the same way, to keep track of the indices of the two data points that have yielded a certain angle. We consider the sorted angles successively, and whenever some angles are equal we test whether any two of the corresponding lines also have a point in common. To do this, we compare the indices in IND1 and IND2. Whenever we detect three collinear points, the program halts. (Note that we do not have to worry about situations where ~ij = ~,j, and L~j NLi,j, = ~ because this would not pose a problem for the remainder of the algorithm.) Because we need to sort O(n 2) angles, this first part of the program yields time complexity O(n21ogn) and storage O(n2). Step 2: In a next step, we develop the circular sequence and detect all the special k-dividers in turn. At any time, we shall store the permutation which corresponds to the projection on the current direction of the rotating line L in an array NCIRQ of length n. We start with the projection on the horizontal axis, so in order to initialize NCIRQ, we consider the indices of the data points sorted with respect to the x-coordinate. This sorting operation was already done in step 1 of the algorithm, where we also made sure that whenever two x-coordinates were equal, the point with the largest y-coordinate was put first. (Note that there can be at most two points with that x-coordinate, otherwise the data are not in general position.) Now that NCIRQ contains the result of the orthogonal projection on the horizontal direction, we will search for the first switch. First we look for lines through two data points which are perpendicular to the projection line, as it rotates from 0 to ANGLE(1 ). This means that for every line L~j we check whether 0 < ~ < ANGLE(1 ), where ~ is equal to ~ij - (zr/2) if ~ij > (g/2) and equal to ~ij + (zc/2) if ~ij < (g/2). If so, we switch i and j in the array NCIRQ. To be able to do this, we have to know their positions in the array NCIRQ. For this purpose we have an array NRANK, also of length n. For each i, NRANK(i) gives the position of i in the array NCIRQ. Of course, if values are switched, we also have to update the array NRANK. Next, we check whether any ~ lies between A N G L E ( l ) and ANGLE(2), then we check between ANGLE(2) and ANGLE(3), and so on. Finally, we search for lines perpendicular to directions between A N G L E ( n ( n - 1)/2) and the horizontal direction. In the algorithm, a pointer JFLAG keeps track of the ~ we have to test next. In this way
158
~ Ruts, P.J. Rousseeuw / Computational Statistics & Data Analysis 23 (1996) 153-168
we consider all the angles one by one, so updating the array NCIRQ requires O(n 2) operations. Each time we switch two numbers, say i and j, we check whether the switch occurs at positions k and k + 1 or at n - k and n - k + 1. If so, Li9 is a "candidate"edge of D, and we store the indices i and j in the arrays KAND1 and KAND2. Subject to whether ~ lies between 0 and (re/2) or (rt/2) and ~, and whether the switch occurred at places k and k + 1 or at n - k and n - k + 1, we add rc to the corresponding value in the array ANGLE. In this way, we give Lij the correct orientation. We store the angles of the oriented special k-dividers in a new array ALPHA. The angles now lie between 0 and 2re. We will also keep track of the equation of Lij, by storing the constant term d from the equation sin(ALPHA(KOUNT))x - cos(ALPHA(KOUNT)) y = d, where KOUNT indicates the number of the currently found special k-divider. We store those d in an array D. Later on, we can then easily check whether a certain point (xo, Yo) lies to the left of the line through X ( K A N D I ( K O U N T ) ) and X ( K A N D 2 ( K O U N T ) ) by comparing s i n ( A L P H A ( K O U N T ) ) x 0 - cos(ALPHA (KOUNT)) Y0 to D(KOUNT). Each time we find a special k-divider, we increase KOUNT by 1, so when the rotation is concluded we know the total number of special k-dividers. We call this number NUM. It was proved by Erdrs, et al. (1973) that NUM is bounded by O(nv/-k), which yields O(rt 3/2) in our case since k is at most O(n). The overall complexity of Step 2 thus remains O(n2), hence the total computation time to obtain the special k-dividers is still O(n21ogn) due to the initial sort of the array ANGLE in steps 1 and 2. Step 3: Having found all the special k-dividers, we will now compute the intersection of the corresponding special half-planes. This can be achieved by an algorithm like that of Shamos and Hoey (1976), which computes the intersection of N halfplanes in O ( N l o g N ) operations. In our case this becomes at most O(n3/21ogn), which brings the total complexity of the algorithm to O(n21ogn). Due to the unavailability of actual source code for the Shamos-Hoey algorithm, we proceeded in a rather different way. In Fig. 2, all the special 3-dividers are shown for a bivariate data set of 10 points. The bold lines indicate the intersection D3 of the special half-planes. We will find the vertices of Dk one by one by tracing the edges of the convex figure in a counterclockwise direction. First we sort the array ALPHA of the oriented lines and permute the arrays KAND1, KAND2 and D in the same way. We shall denote the line through the data points with indices KANDI(i) and KAND2(i) by Li and the intersection point of two k-dividers Li and Lj (with i < j ) by Pij. Now suppose that the intersection point pi.j is a vertex of Dk. We will then move in the direction defined by Lj until we encounter the first intersection point of Lj with the special k-dividers that have a greater angle with the horizontal direction than Lj does (see Fig. 3). The special k-divider which is responsible for this first intersection point will define the next direction to move in. In Fig. 3, this is Lj+3. The special k-dividers are oriented in such a way that we will trace the convex contour in the counterclockwise direction.
I. Ruts, P.J. Rousseeuw I Computational Statistics & Data Analysis 23 (1996) 153-168
159
Fig. 2. Special 3-dividers and D3 for 10 data points, generated according to a standard bivariate Gaussian distribution.
Lj+2 Li+5
Lj+I Lj
L[
Fig. 3. To find Dk we subsequently move along lines Zj,Zj+3,Zj+4,....
At each step, we will intersect the current line with the successive special kdividers in KAND1 and KAND2 until we find the first intersection point of the current line with a k-divider with a greater ALPHA-value. So we need a way to test whether an intersection point is indeed the first. Let us go back to Fig. 3. If we are moving on Lj, we will first test whether the intersection point pj,j+l of Lj and Lj+I is the correct one. To do this, we check whether the point lies to the left o f L j+2 by using the array D. If Pj,j+l lies to the right o f Lj+2, we forget about moving on Lj+~,
160
1. Ruts, P.J. Rousseeuwl Computational Statistics & Data Analysis 23 (1996) 153-168
•
'
L j+ 1
Lj Li
Fig. 4. Here pj,j+l lies to the left of Zj+2 but Pj,j+3 is the point where we want to stop moving along Lj.
because Pj,j+l cannot possibly lie on Dk (as in Fig. 3). We then test whether Pj,j+2 lies to the left of Lj+3, and so on. When eventually we find a point Pj, m which lies on the correct side of Lm+l (in Fig. 3 this is for m = j + 3), we cannot yet be sure that this is the point we are looking for. For instance, in Fig. 4 we see that pj, j+l lies to the correct side of Lj+2 and still it is not the point we are looking for. Therefore, we compute the depth in every intersection point which lies to the left of the next special k-divider. For this we use the algorithm by Rousseeuw and Ruts (1996). If Pj, m has depth k, we have found the correct point and we can start moving on Lm. We then test whether Pm, m+l lies to the left of Lm+2, and so on. If Pj, m has a depth lower than k, our search for the first intersection point on Lj continues. We then intersect Lj and Lm+l and test whether Pj, m+l lies to the left of Lm+2, and so on. In the algorithm, two pointers IW1 and IW2 are used to keep track of the special k-dividers which are currently being intersected. When the intersection point of two special k-dividers happens to be a data point, we have to slightly alter the procedure. In that case, we test whether the intersection point lies to the left of the next special k-divider which does n o t contain the data p o i n t , because just testing whether the point lies to the left of the next special kdivider would be rather meaningless when it really lies on that line. If the test is affirmative, we will move along the last special k-divider which contains this data point. For instance, in Fig. 5 the current intersection point Pm, m+l is equal to the data point xi. Instead of testing whether the point lies to the left of Lm+2, a s we normally would, we now test whether it lies to the left of Lm+4, which is the first special k-divider not containing x;. This test is affirmative, so the next edge to trace will be part of Lm+3, which is the last of the special k-dividers containing the data point. By means of the previous procedure, we obtain an array with the coordinates of the vertices of Dk in the correct order, which allows us to draw Dk. The above algorithm was implemented in a program called ISODEPTH, the code of which is available from the authors by e-mail (
[email protected]). The program
I. Ruts, P.J. Rousseeuw / Computational Statistics & Data Analysis 23 (1996) 153-168
Lm+l
161
Lm
Lm
Lrn+3~ Lm+4 Fig. 5. Here pm, m+l lies to the left of Lm+4 and the next line to move on is Lm÷3.
can compute several contours at the same time, which may then be plotted using a graphical package.
4. Performance of the algorithm Table 1 lists some execution times of the naive algorithm and the algorithm ISODEPTH proposed here. The times are in seconds on an 80486 PC, for various Table 1 Execution times (in seconds on an 80486 PC) of the naive algorithm and the proposed algorithm ISODEPTH, for sample size n n
naive
ISODEPTH
10 20 30 40 50 60 70 80 90 100 200 400 600 800 1000
0.2 6.4 56.4 239.9 773.0 1946.2 4301.0 8602.4
0.01 0.03 0.06 0.12 0.19 0.27 0.37 0.51 0.63 0.84 3.65 16.39 38.56 70.65 115.00
162
1. Ruts, P.J. Rousseeuw/ Computational Statistics & Data Analysis 23 (1996) 153-168
Q
oOOOql'Oo o o° I
1000
I
10000
I
[
100000
1000000 n 2 log
3000000
n
Fig. 6. Computation time of the algorithm ISODEPTH v e r s u s scale.
n2
log n, both plotted on a cube root
sample sizes n. The observations were generated according to a standard bivariate Gaussian distribution and the depth k was taken equal to Vn/3]. The smaller times (in particular, those of ISODEPTH) were obtained as averages over 100 runs. As mentioned before, the naive algorithm is O(n 5 logn). For n < 1000 the time of the proposed algorithm behaves as a multiple of n 2 log n. In Fig. 6 the execution time has been plotted against n 2 log n, both on a cube root scale, and the resulting points do lie on a straight line through the origin.
5. Examples Let us look at some examples, in which the depth contours provide information on the shape of bivariate data sets. In a first example, we consider the brain weight (in g) and the body weight (in kg) of 28 animals, given in Table 7 of Rousseeuw and Leroy (1987, p. 57). In Fig. 7 we plotted the logarithms (to the base 10) of these measurements, and the contours of depths 1-10. The contour of depth 1 is simply the convex hull of the whole data cloud. Most of the other depth contours have some comers which are not data points. These comers do occur on intersections of lines through two data points. Note that the shape of the outer depth contours, say for depths 1-3, is very different from the shape of the inner depth contours. The shape of these inner contours seems to be roughly elliptical, whereas the outer contours are affected by a few atypical points at the outskirts of the cloud. Note that for population distributions the inner contours are typically more "rounded" than the outer contours, but they are not necessarily elliptical (Rousseeuw and Ruts, 1997).
I. Ruts, P.J. Rousseeuw/ Computational Statistics & Data Analysis 23 (1996) 153-168
163
0
.Q (3
0
I~
i
I -2
I
-1
I
I
0
1
i
I
i
2
logarithmic
I
i
I
i
4
3
body weight
Fig. 7. Contours of depths 1 to 10, based on logarithmic brain weight versus logarithmic body weight of 28 animals.
o
o
o
I
O.B
1.1
1.4
1.7
2.0
I
I
I
I
I
I
I
I
2,3
2.6
2.9
3.2
3.5
3.8
4.1
4,4
weight
4.7
x 10 4
Fig. 8. Contours of depths 1 to 8, in a plot of the weight and the cost of 23 single-engine aircraft. A second example considers the weight (in pounds) and the cost (in units o f $ 100000) o f 23 single-engine aircraft built over the years 1947-1979. The data were taken from Gray (1985). In Fig. 8 w e plotted the contours o f depths 1-8. In this example, w e clearly see an outlier in the top right comer o f the figure. Only
164
1. Ruts, P.J. Rousseeuw / Computational Statistics & Data Analysis 23 (1996) 153-168
g
I
18.0
I
i
i
I
I
19.0
I
20.0
I
I
21.0
I
I
22.0
I
I
2,3.0
I
I
24.0
I
I
25.0
I
26.0
240pU
Fig. 9. Depth contours of two clusters. The points represent the concentrations of the isotopes 239pu and 24°pu in 45 batches of plutonium.
the depth contour for depth 1 is influenced by this outlier. The inner contours are robust against such outliers. It is also possible to display the result of a clustering algorithm graphically by plotting the contours of the various clusters in a single display. We consider a real data set, provided by E. Trauwaert, from a Belgian factory of nuclear fuel. The data are concentrations of five isotopes in 45 batches of plutonium. Based on all the variables, the data set was clustered into two groups. In order to make a graphical display we consider the concentrations of the two most important isotopes, namely 24°pu and 239pu, listed in Table 2. For each of the two clusters, we plotted the different depth contours in Fig. 9. We see that the clusters do not overlap in this plot, and the depth contours allow us to visualize their shapes and relative position.
6. Applications to discriminant and cluster analysis Depth contours can be used to construct a robustified version of the hypervolume classification technique, which was introduced in Baufays and Rasson (1982). Applications in image remote sensing are discussed in Rasson, et al. (1993). Consider n points which are independently and uniformly distributed within an unknown domain E C ~P. Assume that E is the union of g mutually disjoint convex sets El, E2 . . . . . Eg. For each of these sets Ej we have a training set Xj at our disposal (1 _< j _< g). The aim is then to classify a new observation x. For each j, 1 <_ j < g, we denote the convex hull of Xj by Cj, and H(Cj, x) is the minimal homothetic
L Ruts, P.J. RousseeuwI Computational Statistics & Data Analysis 23 (1996) 153 168
Table 2 Concentration of two isotopes in 45 batches of plutonium 24Opu 239pu 24Opu 239pu
24Opu
239pu
21.204 21.408 21.668 18.428 20.223 18.548 21.162 21.557 24.493 25.576 25.719 25.692 25.146 25.126 25.128
18.640 18.869 18.122 20.750 20.345 19.108 22.754 23.320 23.128 23.133 23.239 21.761 21.429 20.939 23.603
69.147 68.294 71.076 75.714 76.150 77.845 62.382 60.112 60.519 61.585 61.332 72.291 73.451 74.888 60.507
75.804 75.515 75.175 78.872 73.317 79.116 75.751 75.326 63.287 59.553 58.688 58.758 59.728 59.544 59.877
25.100 18.488 18.629 21.515 17.872 24.656 18.285 20.794 20.867 21.718 21.721 21.713 20.225 18.573 18.633
~C~
~
61.182 78.244 78.166 74.254 79.840 62.455 73.189 75.968 75.957 72.885 72.907 72.919 76.089 70.129 69.273
165
~Cq H(Cj,×)
Fig. 10. The new point x will be assigned to E / i f the shaded area between Ci and H(Ci, x) is minimal.
expansion o f Cj (about its centroid) which contains the point x. The classification rule is: classify x into Ej if the difference between the volumes after and before the expansion is minimal. So x is assigned to the class Ej with the smallest value o f V o l ( H ( C j , x ) ) - Vol(Cj). Fig. 10 shows a picture for the bivariate case, where we want to minimize the shaded area between Cj and H ( C j , x). We propose a modified version o f this technique in which the convex hull CJ is replaced by a depth contour D~ o f Xj with depth k = L~(#Xj)J (the integer part o f times the number o f data points in Xj). The parameter ~ has to be chosen by the user. The new point x would then be assigned to the class E/ that minimizes the difference between the volumes after and before the expansion o f its depth contour. The method is more robust against outliers in Xj because the latter affect the size and the shape o f Cj (1 _< j _< g). If a data set contains m outliers, only the m outermost depth contours can be strongly affected by them (Donoho and Gasko,
166
I. Ruts, P.J. Rousseeuwl Computational Statistics & Data Analysis 23 (1996) 153-168
~
3
Fig. 1 I. We want to find the partition of X into those clusters for which the sum of the areas of the convex hulls of the clusters is minimal.
1992). So the convex hull of the data set will be affected by outliers, whereas sufficiently deep contours remain robust. This can also be seen in the second example of the previous section, where one outlier only influences D1. In two dimensions, the proposed technique can be carried out by replacing the points of the training sets by the extremal points of their depth contours, obtained with the algorithm of Section 3, and then applying the existing algorithm of the hypervolume technique (Granville, et al., 1991 ). Let us also consider the convex hull technique in cluster analysis (Hardy and Rasson, 1982). Suppose we have a data set X = { X l , X 2 . . . . . Xn} C ~P. The aim is then to find the partition of X into g clusters X1,X2,... ,X o that minimizes the sum of all the volumes of the convex hulls of the clusters (see Fig. 11 for a two-dimensional picture). So if we denote the convex hull of Xj by Cj (1 <_ j <_ g), we want to minimize ~J=l Vol(Cj). Again we could robustify this method by replacing the convex hulls of the clusters by depth contours (for a certain ~) of the clusters, and then minimizing the sum of all the volumes of these depth contours. A disadvantage here is the increase of the computation time, since depth contours will have to be computed for many candidate clusters, instead of just for the g training sets in the discriminant analysis case. Note that there exist other classification methods which minimize a total volume, but using other types of sets. For instance, one can partition the data such that the total volume of the classical 95% tolerance ellipsoids of the clusters is minimized. This corresponds to minimizing the sum of the square roots of the determinants of the covariance matrices. This approach has been extended in several ways. For instance, a corresponding fuzzy clustering method has been developed (see Rousseeuw, et al., 1996), by incorporating continuous memberships and fuzzy covariance matrices. Also robustified versions have been constructed (Davies, 1988; Jolion, et al., 1991) in which the classical covariance is replaced by the MVE estimator of Rousseeuw (1985) that is based on the minimum volume ellipsoid covering a given fraction of the data. A similar method has been developed for robust discriminant analysis (Chork and Rousseeuw, 1992), by replacing the covariance matrices in the classical method by MVE estimates based on the training sets.
I. Ruts, P.J. Rousseeuw / Computational Statistics & Data Analysis" 23 (1996) 153-168
167
Acknowledgements The authors wish to thank Babs Dcnissen, David Dobkin, Herbert Edelsbrunner, Bill Eddy, Dan Hoey and Franco P. Preparata for providing helpful information. References Baufays, P. and J.-P. Rasson, A new geometric discriminant rule, Comput. Stat&t. Quart. 2 (1985) 15-30. Chork, C.Y. and P.J. Rousseeuw, Integrating a high-breakdown option into discriminant analysis in exploration geochemistry, J. Geochem. Exploration, 43 (1992) 191-203. Davies, P.L., Consistent estimates for finite mixtures of well separated elliptical distributions, in: H.H. Bock (Ed.), Classification and related methods of data analysis, (North-Holland, Amsterdam, 1988) 195-202. Donoho, D.L., Breakdown properties of multivariate location estimators, Ph.D. Qualifying Paper (Harvard University, 1982). Donoho, D.L. and M. Gasko, Breakdown properties of location estimates based on halfspace depth and projected outlyingness, Ann. Statist., 20 (1992) 1803-1827. Eddy, W.F., A new convex hull algorithm for planar sets, A C M Trans. Math. Software, 3 (1977) 398 -403. Eddy, W.F., Set-valued orderings for bivariate data, in: R.V. Ambartzumian and W. Weil (Eds.), Stochastic geometry, 9eometric statistics, and stereology, (Teubner, Leipzig, 1983) 79-90. Eddy, W.F., Ordering of multivariate data, in: L. Billard (Ed.), Computer science and statistics." Proc. 16th Symp. on the interface, (North-Holland, Amsterdam, 1985) 25-30. Edelsbrunner, H., Algorithms in combinatorial geometry, (Springer, Berlin, 1987). Erd6s, P., L. Lovfisz, A. Simmons and E.G. Straus, Dissection graphs of planar point sets, in: J.N. Srivastava, F. Harary, C.R. Rao, G.-C. Rota and S.S. Shrikhande (Eds.), A survey of combinatorial theory, (North-Holland, Amsterdam, 1973) 139-149. Goodman, J.E. and R. Pollack, On the combinatorial classification of nondegenerate configurations in the plane, J. Combin. Theory A, 29 (1980) 220-235. Granville, V., M. K~ivfinek and J.-P. Rasson, Remark on computational complexity of hypervolume classification method, Comput. Statist. Quart., 4 ( 1991 ) 315- 319. Gray, J.B., Graphics for regression diagnostics, in: ASA Proc. Statistical Computing Section, (1985) 102-107. Hardy, A. and J.-P. Rasson, Une nouvelle approche des problbmes de classification automatique, Statist. Analyse Donnbes, 7 (1982) 41-56. Jolion, J.-M., P. Meer and S. Bataouche, Robust clustering with applications in computer vision, IEEE Trans. Pattern Anal. Machine Intelligence 13 (1991) 791-802. Liu, R.Y., On a notion of data depth based on random simplices, Ann. Statist. 18 (1990) 405-414. Mass6, J.-C. and R. Theodorescu, Halfplane trimming for bivariate distributions, J. Multivariate Anal. 48 (1994) 188-202. Niinimaa, A., Bivariate generalizations of the median, Technical Report, (University of Oulu, Finland, 1993). Preparata, F.P. and S.J. Hong, Convex hulls of finite sets of points in two and three dimensions, Comm. A C M 20 (1977) 87-93. Rasson, J.-P., F. Orban-Ferauge and V. Granville, From a natural to a behavioral classification rule, Technical Report, (FUNDP, Namur, 1993). Rousseeuw, P.J., Multivariate estimation with high breakdown point, in: W. Grossmann, G. Pflug, I. Vincze and W. Wertz (Eds.), Mathematical statistics and applications, (Reidel, Dordrecht, 1985) 283-297. Rousseeuw, P.J., L. Kaufman and E. Trauwaert, Fuzzy clustering using scatter matrices, Comput. Statist. Data Analysis, this issue (1996).
168
I. Ruts, P.J. Rousseeuw l Computational Statistics & Data Analysis 23 (1996) 153-168
Rousseeuw, P.J. and A. Leroy, Robust regression and outlier detection (Wiley, New York, 1987). Rousseeuw, P.J. and I. Ruts, Bivariate location depth, Applied Statistics, to appear (1996). Rousseeuw, P.J. and I. Ruts, The depth function of a distribution, manuscript, in preparation (1997). Shamos, M.I. and D. Hoey, Geometric intersection problems, Proe. 17th Ann. IEEE Syrup. on Foundations o f Computer Science (1976) 208-215. Small, C.G., A survey of multidimensional medians, Internat. Stat. Rev., 58 (1990) 263-277. Tukey, J.W., Mathematics and the picturing of data, Proc. Int. Conoress Math., Vancouver, Vol. 2 (1975) 523-531. Tukey, J.W., Exploratory data analysis (Addison-Wesley, Reading, MA, 1977).