Journal of Statistical Planning and Inference 77 (1999) 247–262
Asymptotics for trimmed k-means and associated tolerance zones 1 L.A. Garca-Escudero, A. Gordaliza, C. Matran ∗ Departamento de Estadstica e InvestigaciÃon Operativa, Facultad de Ciencias, Universidad de Valladolid, 47002, Valladolid, Spain Received 27 May 1997; accepted 22 September 1998
Abstract Impartial trimming procedures with respect to general ‘penalty’ functions, , have been recently introduced in Cuesta-Albertos et al. (1997. Ann. Statist. 25, 553–576) in the (generalized) k-means framework. Under regularity assumptions, for real-valued samples, we obtain the asymptotic normality both of the impartial trimmed k-mean estimator ((x) = x2 ) and of the impartial trimmed k-median estimator ((x) = x). In spite of the additional complexity coming from the several groups setting, the empirical quantile methodology used in Butler (1982. Ann. Statist. 10, 197–204) for the LTS estimator (and subsequently in Tableman (1994. Statist. Probab. Lett. 19, 387–398) for the LTAD estimator) also works in our framework. Besides their relevance for the robust estimation of quantizers, our results open the possibility of considering asymptotic distribution-free tolerance regions, constituted by unions of intervals, for predicting a future observation, generalizing the idea in c 1999 Elsevier Science B.V. All rights reserved. Butler (1982). AMS classiÿcations: Primary 62G20; 62G15; secondary 62G35 Keywords: Asymptotics; Clustering methods; Distribution freeness; Robustness; Tolerance zones; Trimmed k-means
1. Introduction The k-means method is the most widely used (nonhierarchical) procedure in cluster analysis. Although it is a classical method (Cox, 1957; Fisher, 1958; MacQueen, 1967) the interest which it deserves is made patent by the abundant literature related 1 Research partially supported by the DGICYT grants PB95-0715-C02-00,01 and by Consejera de Educacion y Cultura de la Junta de Castilla y Leon grant VA08=97. ∗ Corresponding author.
c 1999 Elsevier Science B.V. All rights reserved. 0378-3758/99/$ – see front matter PII: S 0 3 7 8 - 3 7 5 8 ( 9 8 ) 0 0 1 9 6 - 7
248 L.A. GarcÃa-Escudero et al. / Journal of Statistical Planning and Inference 77 (1999) 247–262
with dierent aspects of this topic as Sverdrup-Thygeson (1981), Pollard (1981) and Cuesta-Albertos and Matran (1988) dealing with the consistency in dierent settings, or Hartigan (1975, 1978), Pollard (1982) and Serinko and Babu (1992) providing weak limit theorems. Also a special issue of IEEE (1982) is devoted to the population version as a quantization procedure; Stute and Zhu (1995) study this method combined with projection pursuit; Tarpey et al. (1995) analyze the population versions as principal points from the self-consistency viewpoint; : : : Cuesta-Albertos et al. (1997) pointed out the bad performance of k-means in the presence of outliers and introduced trimmed k-means with the aim of robustifying k-means and the associated clustering analysis through an ‘impartial’ trimming procedure. Such methodology (related with techniques employed in Gordaliza (1991a), Rousseeuw (1983) and Rousseeuw and Yohai (1984)) can be succinctly described as follows: Let X1 ; X2 ; : : : ; Xn be a sample of independent identically distributed random variables (r.v.’s) in Rp with common distribution F. For a suitable nondecreasing (penalty) function : R+ → R, and a trimming level 1 − ∈ (0; 1), a (generalized) trimmed k-mean of {X1 ; X2 ; : : : ; Xn } is a k-set (a set with k points), {Tˆ 1 ; Tˆ 2 ; : : : ; Tˆ k } ⊂ Rp , solving the following double minimization problem:
min Y
min
{T1 ;:::;Tk } ⊂ Rp
1 P [n ]∗ Xi ∈Y
inf ||Xi − Tj ||
j=1;:::;k
(1.1)
where Y ranges in the class of the subsets of {X1 ; X2 ; : : : ; Xn } with [n ]∗ data points ([n ]∗ is the smallest integer greater than n ). Namely, it is the (generalized) k-mean of the subsample containing [n ]∗ points with the smallest mean deviation measured through the penalty function . We emphasize the fact that this kind of procedure can be considered as an ‘ad hoc’ method for distributions with adequate shapes to be split in a given number of relevant zones. Therefore, mixtures of unimodal distributions constitute a main goal in the applications. In order to obtain the empirical minimizer {Tˆ1 ; Tˆ2 ; : : : ; Tˆk } in the multivariate setting, the main diculty arises from the nonexistence of exact nonexhaustive algorithms. This diculty can be avoided through the use of simulated annealing based algorithms as was shown in Cuesta-Albertos et al. (1997). In that paper, several simulated bivariate data sets were analyzed revealing the ability of the procedure to catch the cluster structure of the data in an ecient way. For univariate data sets, taking advantage of the natural order in the real line, an exhaustive (and not excessively ‘computer intensive’) search is possible. Notice that, in general, the minimizer of Eq. (1.1) is not necessarily unique. For instance, if is not a strictly convex function, although k = 1 and the set Y would have been xed, the minimizer of Eq. (1.1) does not need to be unique (recall the median with (x) = x). Throughout this paper we will consider the trimmed k-mean of {X1 ; X2 ; : : : ; Xn } as any minimizer of expression (1.1).
L.A. GarcÃa-Escudero et al. / Journal of Statistical Planning and Inference 77 (1999) 247–262 249
The population analogue to Eq. (1.1) for a distribution F is de ned in the obvious way by Z −1 min p inf ||x − Tj || dF(x); (1.2) min A∈B; F(A)= {T1 ;:::; Tk } ⊂ R
A
j=1;:::;k
where B stands for the Borel -algebra in Rp : When the distribution F is absolutely continuous we can always select a Borel set with probability : For general distributions an adequate alternative are the trimming functions as introduced in Gordaliza (1991a). The Borel set A where the minimum of Eq. (1.2) is attained will be called an optimal zone, which may be viewed as the ‘most adequate’ set containing a given amount of probability mass . Cuesta-Albertos et al. (1997) provided a detailed study of the associated optimal zones as well as proofs of the existence and consistency of these estimators. Also, the performance of the method was studied from an applied point of view analyzing some simulated data sets. In this work we study the asymptotic distribution of the minimizer of Eq. (1.1), but due to technical reasons we must restrict our development to the trimmed k-means ((x) = x2 ) and trimmed k-medians ((x) = x) in the univariate case. Our analysis focuses on the case k=2 to avoid notational complexity without mathematical relevance. The technical background to achieve our objective is based on the approximation of the quantile process by Brownian bridges. Such a procedure has been previously used in connection with trimming procedures by Butler (1982) for the LTS (least trimmed squares) estimator and by Tableman (1994) for the LTAD (least trimmed absolute deviations) estimator. We also refer to Gordaliza (1991a, b) for the introduction of impartial trimming procedures based on other functions . We emphasize the fact that Butler’s technique adequately supports the complexity added by simultaneous consideration of several groups but it seems inherently unable to handle neither ‘penalty functions’ dierent from (x) = x2 and (x) = x (see Remark 5) nor the general framework of Rp -valued r.v.’s (the available multivariate versions of the quantile function does not t this problem). However, since the k-means methods and, consequently, the trimmed k-means methods are procedures intended especially for multivariate data, the asymptotics obtained in this paper should be considered as a very rst step in the study of the asymptotics of trimmed k-means procedures. On the other hand it is interesting to stress the fact that our results permit the construction of asymptotic distribution-free tolerance zones which clearly improve tolerance intervals when the parent distribution is a mixture (see Section 5). The next lemma, showing that the optimal zone must be the union of intervals with the same length (although not necessarily disjoint), will be often used through the paper. In order to improve readability, its proof as well as those of most results in the paper will be deferred to the appendix.
250 L.A. GarcÃa-Escudero et al. / Journal of Statistical Planning and Inference 77 (1999) 247–262
Lemma 1.1. Let F be a probability distribution on R. Then for any ÿxed T1 ; T2 ∈ R such that there exist a zone (T1 − S; T1 + S) ∪ (T2 − S; T2 + S) with probability mass
we have Z min −1 inf |x − Tj | dF(x) j=1;2 A∈B;F(A)= A Z inf |x − Tj | dF(x): = −1 (T1 −S;T1 +S) ∪ (T2 −S;T2 +S)
j=1;2
Consequently, for F having a dierentiable and strictly positive density f (as assumed through the next sections), the double minimization problem (1.2) is equivalent to Z −1
inf |x − Tj | dF(x) (1.3) min −∞¡T1 6T2 ¡+∞
(T1 −S;T1 +S) ∪ (T2 −S;T2 +S)
j=1;2
where S:=S(T1 ; T2 ) ∈ (0; ∞) is uniquely determined by F((T1 − S; T1 + S) ∪ (T2 − S; T2 + S)) = : Moreover, expression (1.3) can be rewritten as Z Z −1
(|x − T1 |) dF(x) + (|x − T2 |) dF(x) min −∞¡T1 6T2 ¡+∞
C1
C2
(1.4)
where Ci ; i = 1; 2, denotes the cluster associated to each center Ti ; i = 1; 2: The cluster Ci consists of all points in (T1 − S; T1 + S) ∪ (T2 − S; T2 + S) closer to Ti than the other Tj ; j 6= i; i.e., C1 =(T1 −S; (T1 +S)∧(T1 +T2 )=2) and C2 =((T2 −S)∨(T1 +T2 )=2; T2 +S): Our analysis in the next sections will depend on the relative position (separated or overlapped) of the clusters in the optimal region. Hence, if {T1∗ ; T2∗ } is a generalized trimmed 2-mean with associated clusters C1∗ ∪ C2∗ ; then from Eq. (1.4) we have that each Ti∗ must be a -mean (see BrHns et al., 1969) of the corresponding cluster Ci∗ , i.e., Z Z (|x − Ti |) dF(x) = (|x − Ti∗ |) dF(x); (1.5) min Ti ∈R
Ci∗
Ci∗
because on the contrary, we could diminish the expression in Eq. (1.4) The next lemma is a particular case of Theorem 3:6 in Cuesta-Albertos et al. (1997), where only the consistency of the radii associated to the optimal zones has been added. The consistency of the optimal radii is implicit in the proof of that theorem: b Tb1 + S] b ∪ [Tb2 − S; b Tb2 + S] b be the optimal zone obtained Lemma 1.2. Let [Tb1 − S; b b through a trimmed 2--mean {T1 ; T2 } based on a random sample of size n from an absolutely continuous distribution F (we suppose Tb1 6Tb2 ). Let T1∗ ; T2∗ and S ∗ (T1∗ 6T2∗ ) be the solution; which we assume to be unique; of problem (1.2) (with k = 2) for the distribution F. Then Tb1 → T1∗ ;
Tb2 → T2∗
and
Sb → S ∗
P-a:e:
L.A. GarcÃa-Escudero et al. / Journal of Statistical Planning and Inference 77 (1999) 247–262 251
Fig. 1. Asymptotics for trimmed k-means and associated tolerance zones.
Notice that a uniqueness condition about the population trimmed 2--mean will be assumed throughout the paper. Giving general sucient conditions is a hard problem (the uniqueness condition is usually just assumed in most of the papers dealing with the asymptotics of (untrimmed) k-means, and only a few papers tackle this diculty and for unrealistic distributions in clustering (see Li and Flury, 1995). Gordaliza (1991a) proved in the k =1 case and for unimodal distributions the uniqueness of the trimmed (1-)-mean, so we can hope that this property should be inherited by the mixtures of suciently distant unimodal distributions. Our attempts to obtain general sucient conditions led to the formulation of somewhat unnatural and complicated conditions, and to the conclusion that it would be better to use a simple numerical treatment when a speci c distribution is considered. For instance, we have observed that for mixtures of two normal distributions with reasonably similar weights no uniqueness problem occurs. We include Figs. 1 and 2 to give some evidence to support our feeling by showing that no competitive local minima appear in relation with the only global minimum in two particular cases. Both situations involve mixtures of normal distributions, not far enough to avoid in uence between them, and contemplate the 2-means problem for a 20% trimming level (i.e.,
= 0:8) and symmetric and asymmetric mixtures. Moreover, we also think that the trimming procedure contributes to smooth the possible lack of uniqueness of the usual (untrimmed) k-mean method created by the presence of the tails of the distributions in the mixture. On the other hand, we notice that the uniqueness requirement can even be removed from the technical point of view. It is not dicult to prove a variation of Lemma 1.2 without the assumption of uniqueness on the population trimmed k-mean, in the b contains a convergent subsequence and sense that every subsequence of (Tb1 ; Tb2 ; S)
252 L.A. GarcÃa-Escudero et al. / Journal of Statistical Planning and Inference 77 (1999) 247–262
Fig. 2. Asymptotics for trimmed k-means and associated tolerance zones.
every convergent subsequence will converge to a population trimmed k-mean. Since, the consistency is the only requisite needed for the reasonings given through the paper, our results easily generalize to any situation through the corresponding subsequences statement. So, we believe that the uniqueness of the solution of problem (1.2) should not be a drawback for the application of these procedures, specially, when the proper choice of k and the trimming level 1 − have been made (notice also that the instability in obtaining the empirical minimizer will lead us to think about an inadequate choice of parameters k or ).
2. Trimmed 2-means. Separated population clusters From now on, we suppose that the distribution F has density f which is dierentiable and strictly positive on R. Let us consider the case k =2 and (x)=x2 . Assume that we are in the case of separated population clusters and that there exist uniquely determined T1∗ ; T2∗ ∈ R being the minimizer of Eq. (1.4), i.e., Z Z
−1 (x − T1 )2 dF(x) + (x − T2 )2 dF(x) min −∞¡T1 6T2 ¡+∞
=
−1
C1
C2
"Z (T1∗ −S ∗ ;T1∗ +S ∗ )
(x −
T1∗ )2 dF(x)
#
Z +
where S ∗ = S(T1∗ ; T2∗ ) and T1∗ + S ∗ ¡ T2∗ − S ∗ :
(T2∗ −S ∗ ;T2∗ +S ∗ )
(x −
T2∗ )2 dF(x)
L.A. GarcÃa-Escudero et al. / Journal of Statistical Planning and Inference 77 (1999) 247–262 253
In this case of separated population clusters, we have by continuity that Eq. (1.4) is equivalent to "Z
−1
min
−∞¡T1 6T2 ¡+∞;T1 +S¡T2 −S
T1 +S
T1 −S
Z
2
(x − T1 ) dF(x) +
#
T2 +S
T2 −S
2
(x − T2 ) dF(x) : (2.1)
We rewrite this minimization problem using the quantile function, q(·) = F −1 (·); via the change of variables 1 = F(T1 − S); 2 = F(T2 − S); = F(T1 − S; T1 + S) and x = q(t) inside the integrals. Then, we obtain that problem (2.1) is equivalent to the minimization of "Z 2 1 + q(1 ) + q(1 + ) −1 dt q(t) − H (1 ; 2 ; ) := 2 1 Z +
2 + −
2
q(2 ) + q(2 + − ) q(t) − 2
2
# dt ;
where (1 ; 2 ; ) satisfy 0 ¡ 1 ¡ 1 + ¡ 2 ¡ 2 + − ¡ 1:
(2.2)
(Notice that we search for optimal sets having the form I (1 ; 2 ; ) = [q(1 ); q(1 + )] ∪ [q(2 ); q(2 + − )]; and that will be the probability mass in the rst cluster and − will be the probability mass in the second cluster.) Hence, if (T1∗ ; T2∗ ) is the (by assumption) unique minimizer of Eq. (2.1) then ∗1 = F(T1∗ − S ∗ ); ∗2 = F(T2∗ − S ∗ ); ∗ = F(T1∗ − S ∗ ; T1∗ + S ∗ ); with S ∗ = S(T1∗ ; T2∗ ); is the unique minimizer of H for (1 ; 2 ; ) verifying Eq. (2.2). Lemma 1.1, the fact that the population clusters are separated and the strict positiveness of the density on R imply the following relation: 0 ¡ ∗1 ¡ ∗1 + ∗ ¡ ∗2 ¡ ∗2 + − ∗ ¡ 1:
(2.3)
Either from Eq. (1.5) with (x) = x2 ; or from Remark 2, we can see that ∗ = is also the unique minimizer of function D de ned as
(∗1 ; ∗2 ; ∗ )
Z −1 D(1 ; 2 ; ) :=
1 +
1
Z +
2 + −
2
2
q (t) dt −
−1
Z
1 +
1
q2 (t) dt − ( − )−1
Z
!2 q(t) dt
2 + −
2
!2 q(t) dt
:
(2.4)
254 L.A. GarcÃa-Escudero et al. / Journal of Statistical Planning and Inference 77 (1999) 247–262
So, from now on, we will use the target function D: By dierentiating D we see that ∗ must be a root of the function (·) = ( 1 (·); 2 (·); 3 (·))0 de ned as Z 1 + 1 −1 q(t) dt; 1 (1 ; 2 ; ) = [q(1 ) + q(1 + )] − 2 1 Z 2 + − 1 −1 q(t) dt 2 (1 ; 2 ; ) = [q(2 ) + q(2 + − )] − ( − ) 2 2 and 3 (1 ; 2 ; ) = q(2 + − ) − ( − )−1 − q(1 + ) −
−1
Z
1 +
1
Z
2 + −
2
q(t) dt
! q(t) dt
! :
Remark 1. Notice that, as ∗ is a root of equations 1 and 2 we have that the mean of the distribution F restricted to each optimal interval must be the midpoint of this interval. Equation 3 (∗ ) = 0 forces both intervals to have the same length. Remark 2. Note that H and D have the same minimizers. Let ∗ = (∗1 ; ∗2 ; ∗ ) be the unique minimizer of H and =(1 ; 2 ; ) any minimizer of D: Dierentiating H (respectively D) with respect to 1 and 2 we obtain 1 (∗1 ; ∗2 ; ∗ ) = 0 and 2 (∗1 ; ∗2 ; ∗ ) = 0 (respectively 1 (1 ; 2 ; ) = 0 and 2 (1 ; 2 ; ) = 0), which trivially forces H (1 ; 2 ; ) = D(1 ; 2 ; )
and
H (∗1 ; ∗2 ; ∗ ) = D(∗1 ; ∗2 ; ∗ ):
(2.5)
Thus, we have H (1 ; 2 ; ) = D(1 ; 2 ; ) 6 D(∗1 ; ∗2 ; ∗ ) = H (∗1 ; ∗2 ; ∗ ) 6 H (1 ; 2 ; )
(by Eq:(2:5)) (since minimizes D) (by Eq:(2:5)) (since ∗ minimizes H )
so that H (1 ; 2 ; ) = H (∗1 ; ∗2 ; ∗ ): But this implies ∗ = , since (by assumption) the minimizer of H is unique. b the empirical version of D by the ‘plug-in’ principle, by replacing Let us de ne D; q for qn , the empirical quantile function of Parzen (1979) (qn is a piecewise linear function between the order statistics X(1) 6 · · · 6X(n) such that qn ((i − 12 )=n) = X(i) for i = 1; : : : ; n; qn (t) = X(1) for 06t6(2n)−1 and qn (t) = X(n) for 1 − (2n)−1 6t61). Let b b ∗ = (b ∗1 ; b ∗2 ; b ∗ ) be a vector that minimizes D(): The zone Ib(b ∗1 ); qn (b ∗1 + ∗ ) = [qn (b ∗ ∗ ∗ ∗ b b b b )] ∪ [qn (2 ); qn (2 + − )] is an estimator of the optimal zone. From now on the sample analogues to population functionals will be denoted with the hat symbol ‘b·’. Assuming the same conditions of Lemma 1.2 with (x) = x2 , these consistencies may be translated into the following ones: b ∗ → ∗ and b ∗ → ∗ P-a:e: (2.6) ∗ → ∗ ; b 1
1
2
2
L.A. GarcÃa-Escudero et al. / Journal of Statistical Planning and Inference 77 (1999) 247–262 255
(E.g., if Fn is the empirical distribution function then b ∗1 is asymptotically equivalent ∗ ∗ ∗ b b to Fn (T1 − S). From 1 = F(T1 − S ); the classical Glivenko–Cantelli theorem and the continuity of F it is trivial to prove that b ∗1 → ∗1 P-a.e.) Applying the previous consistency results, the following lemma can be proved: Lemma 2.1. Under the assumptions of Lemma 1.2 with (x) = x2 , b b ∗ ) = 0) = 1: lim P( (
n→∞
Butler (1982) Lemma 2:2 could be slightly modi ed taking into account the initial Bickel (1967) result from which this result is obtained. We shall change the unimodality condition by the strict positiveness of the density on R. Thus, if F is an absolutely continuous distribution with dierentiable and strictly √ positive density on R, and Qn (t) = n[qn (t) − q(t)] (06t61) is the quantile process, then there exists a probability space on which a Brownian bridge {B(t) : 06t61} is de ned such that an equivalently distributed version of Qn (t) satis es P
sup |Qn (t) − q0 (t)B(t)| −→ 0
(2.7)
c6t6d
as n → ∞; for any [c; d] ⊂(0; 1): The next theorem establishes the asymptotic normality of b ∗ : Theorem 2.2. Let F be a distribution with dierentiable and strictly positive density on R such that ∗ is the unique vector which minimizes Eq. (2.4). Assume that the matrix 0 (∗ ) (the matrix of derivatives of evaluated on ∗ ) is nonsingular. Then √ b∗ L n( − ∗ ) −→ [ 0 (∗ )]−1 A; where A = (A1 ; A2 ; A3 )0 is the random vector deÿned by Z ∗1 + ∗ 1 ∗ −1 q0 (t)B(t) dt − [q0 (∗1 )B(∗1 ) + q0 (∗1 + ∗ )B(∗1 + ∗ )]; A1 = ( ) 2 ∗ 1 Z ∗2 + − ∗ q0 (t)B(t) dt A2 = ( − ∗ )−1 ∗ 2
1 − [q0 (∗2 )B(∗2 ) + q0 (∗2 + − ∗ )B(∗2 + − ∗ )]; 2 ! Z ∗1 + ∗ 0 ∗ ∗ ∗ ∗ ∗ −1 0 q (t)B(t) dt A3 = q (1 + )B(1 + ) − ( ) ∗ 1
− q
0
(∗2
+ −
∗
)B(∗2
∗
∗ −1
+ − ) − ( − )
Z
∗ ∗ 2 + −
∗ 2
! 0
q (t)B(t) dt
and {B(t): 06t61} denotes a Brownian bridge. Now, we consider the midpoints of the population optimal intervals c1∗ (∗ )= 12 {q(∗1 )+ q(∗1 + ∗ )}; c2∗ (∗ ) = 12 {q(∗2 ) + q(∗2 + − ∗ )} and the midpoints of the
256 L.A. GarcÃa-Escudero et al. / Journal of Statistical Planning and Inference 77 (1999) 247–262
sample optimal intervals b c∗ (b ∗ ); b c2∗ (b ∗ ). Note that the population trimmed 2-means R ∗ ∗1 R ∗ + − ∗ ∗ ∗ ∗ −1 1 + q(t) dt; 2∗ (∗ ) = ( − ∗ )−1 ∗2 q(t) dt and the corare 1 ( ) = ( ) ∗ 1 2 b2∗ (b b1∗ (b ∗ ); ∗ ). As a consequence of Lemma 2.1, we responding sample estimators are √ b b∗ t ∗ have (1; 1; 0) n( ( ) − ( )) = oP (1) and this convergence implies the following asymptotic equivalence: √ b2∗ (b n[(b c1∗ (b ∗ ) − c1∗ (∗ ); b c2∗ (b ∗ ) − c2∗ (∗ ))t − (b 1∗ (b ∗ ) − 1∗ (∗ ); ∗ ) − 2∗ (∗ ))t ] =oP (1): b1∗ (b As a consequence of this equivalence we get the asymptotic normality of ∗ ) ∗ ∗ b b2 ( ): and Theorem 2.3. Let F be as in Theorem 2.2. Then √
n
b1∗ (b ∗ ) − 1∗ (∗ ) b2∗ (b ∗ ) − 2∗ (∗ )
! L
−→
∗ 0 ∗ ∗ ∗ ∗ 1 0 ∗ 2 [q (1 )B(1 ) + q (1 + )B(1 + )] ∗ 0 ∗ ∗ ∗ ∗ 1 0 ∗ 2 [q (2 )B(2 ) + q (2 + − )B(2 + − )]
+ H [ 0 (∗ )]−1 A; where A is the random vector deÿned in Theorem 2.2 and H is the 2 × 3 matrix of constants 1 0 ∗ 0 ∗ ∗ ∗ 1 0 ∗ H=
2
[q (1 ) + q (1 + )] 0
0
1 0 ∗ [q (2 ) + q0 (∗2 2
q (1 + ) 2 + − ∗ )] − 12 q0 (∗1 + − ∗ )
:
As already remarked in the introduction, one of the main applications of these procedures is the construction of tolerance zones more general than those corresponding to isolated intervals. We will consider the zone bI (b ∗1 ); qn (b ∗1 + b ∗ )] ∪ [qn (b ∗2 ); qn (b ∗2 + − b ∗ )] ∗ ) = [qn (b as a tolerance zone. After de ning bI (b ∗ ), in the next theorem we study the large sample properties of its probability coverage Pn , which may be seen as the conditional probability of covering Xn+1 with bI (b ∗ ) given X1 ; : : : ; Xn : Theorem 2.4. Suppose the conditions in Theorem 2.2 hold; and let ∗1 + b ∗ )) − F(qn (b ∗1 )) + F(qn (b ∗2 + − b ∗ )) − F(qn (b ∗2 )) Pn = F(qn (b be the coverage of the tolerance zone bI (b ∗ ). Then √ L n(Pn − ) −→ N (0; (1 − )): Remark 3. Notice that the approximation (2.7) still holds replacing the strict positiveness of the density on R by only its strict positiveness on the set {x: 0 ¡ F(x) ¡ 1} and then in the cases ∗1 = 0 or ∗2 + ∗ = 1 (similar to case II in Butler, 1982) the asymptotic distributions can also be handled with the additional use of classical theory of sample quantiles.
L.A. GarcÃa-Escudero et al. / Journal of Statistical Planning and Inference 77 (1999) 247–262 257
Remark 4. The length of the intervals whose union gives the optimal zone may be used as a measure of the dispersion around the ‘centers’ of the distribution. The asymptotic normality of this estimator can also be derived. 3. Trimmed 2-means. Overlapped population clusters If is close enough to 1 or if the distributions involved in the population mixture are near then the population clusters can be overlapped, i.e. (T1∗ − S ∗ ; T1∗ + S ∗ ) ∩ (T2∗ − S ∗ ; T2∗ + S ∗ ) 6= ∅. To deal with the large sample properties of the estimators in this case, we will use 1 = F(T1 − S); 2 = F(T2 − S); 3 = F(T1 + S) and h = h(2 ; 3 ) = F( 12 {q(2 ) + q(3 )}): Arguing as in Section 2, the optimal ∗ = (∗1 ; ∗2 ; ∗3 ) can be found as the root of (·) = ( 1 (·); 2 (·); 3 (·))0 now de ned by Z h 1 1 (1 ; 2 ; 3 ) = [q(1 ) + q(3 )] − (h − 1 )−1 q(t) dt; 2 1 Z 1 + 1 −1 q(t) dt 2 (1 ; 2 ; 3 ) = [q(2 ) + q(1 + )] − (1 + − h) 2 h and 3 (1 ; 2 ; 3 ) = (q(3 ) − q(1 )) − (q(1 + ) − q(2 )): The techniques used to study the problems in the previous section can be adapted to study the large sample properties of this procedure and derive the asymptotic normality of these estimators. Now, typically the optimal sample zone will be the interval bI (b ∗ )= ∗ ∗ ∗ ∗ b b b b [qn (1 ); qn (1 + )]. Considering Pn = F(qn (1 + )) − F(qn (1 )), the coverage of this tolerance zone, we also have √ L n(Pn − ) −→ N (0; (1 − )): So, choosing a suitable and considering bI (b ∗ ) the optimal zone associated to the trimmed 2-means as the tolerance zone, the convergence of Theorem 2.4 together with this last convergence allow us to construct a-coverage tolerance zones with b-guarantee of attaining the a-coverage. 4. Trimmed 2-medians The penalty function (x) = x leads to the trimmed 2-medians. Now, we have to minimize (in the separated population clusters case) the target function "Z 1 + −1 |q(t) − q(1 + 12 )|dt D(1 ; 2 ; ) = Z +
1
2 + −
2
# |q(t) − q(2 + 12 ( − ))|dt :
258 L.A. GarcÃa-Escudero et al. / Journal of Statistical Planning and Inference 77 (1999) 247–262
Let us de ne functions 1 ; 2 and 3 by 1 (1 ; 2 ; ) = 12 [q(1 ) + q(1 + )] − q(1 + 12 ); 2 (1 ; 2 ; ) = 12 [q(2 ) + q(2 + − )] − q(2 + 12 ( − )); 3 (1 ; 2 ; ) = (q(2 + − ) − q(2 + 12 ( − ))) − (q(1 + ) − q(1 + 12 ))): Working just like in the trimmed 2-means case (repeating the reasoning almost line by line), we obtain the asymptotic normality of the trimmed 2-medians and that, if Pn is the coverage of the zone bI (b ∗ ) (optimal zone associated to the trimmed 2-medians √ L viewed as a tolerance zone), then the convergence n(Pn − ) −→ N (0; (1 − )) also holds. We remark that the asymptotic behavior of Pn does not depend on k or on the penalty function (x) = x2 or (x) = x. 5. Example The data set under consideration is the ‘Old Faithful geyser eruption lengths’. This data set contains 272 observations and has been used before in several publications (e.g. Silverman, 1986). For n=272, to achieve a 0:95-guarantee of attaining a 0:85-coverage with bI (b ∗ ) we must choose =0:882. Fig. 3 shows the tolerance zones associated to the trimmed 2-mean (solid line) and the trimmed 2-medians (dotted line) and also shows
Fig. 3. Asymptotics for trimmed k-means and associated tolerance zones.
L.A. GarcÃa-Escudero et al. / Journal of Statistical Planning and Inference 77 (1999) 247–262 259
Butler’s tolerance interval (dashed line) and Tableman’s interval (dot-dashed line). We can see how a large density zone is kept out of LTS and LTAD tolerance intervals but this zone enters into the tolerance zones constructed using trimmed 2-means and trimmed 2-medians. This shows the inadequacy of tolerance intervals for a good performance in the mixture model. Moreover, the ability of trimmed 2-means and trimmed 2-medians to catch the two lumps of mass of the distributions, provides tolerance zones that are much shorter than the LTS and LTAD tolerance intervals. Fig. 3 also shows the trimmed 2-means ({2:012; 4:369}) and the trimmed 2-medians ({1:967; 4:367}) and provides the LTS (3:714) and LTAD (4:105) estimators for this data set. Acknowledgements The authors gratefully acknowledge many helpful suggestions from J.A. CuestaAlbertos during the preparation of the paper. We also thank the associate editor and two anonymous referees for their careful reading of the manuscript and for valuable suggestions improving the paper, specially the referee who kindly provided us the details of how Lemma 1.1 leads to function D in Section 2. Appendix Proof of Lemma 1.1. The complement of a set A will be denoted as Ac . Denote B = (T1 − S; T1 + S) ∪ (T2 − S; T2 + S) and let A be another Borel set with probability : Note that F(A ∩ Bc ) = F(Ac ∩ B), since = F(A ∩ B) + F(A ∩ Bc ) = F(A ∩ B) + F(Ac ∩ B): Thus, Z inf |x − Tj | dF(x) j=1;2 A Z Z inf |x − Tj | dF(x) + inf |x − Tj | dF(x) = j=1;2 j=1;2 A∩B A∩Bc Z inf |x − Tj | dF(x) + (S)F(A ∩ Bc ) ¿ j=1;2 A∩B Z inf |x − Tj | dF(x) + (S)F(Ac ∩ B) = j=1;2 A∩B Z Z inf |x − Tj | dF(x) + inf |x − Tj | dF(x) ¿ j=1;2 j=1;2 A∩B Ac ∩B Z = inf |x − Tj | dF(x): B
j=1;2
b Proof of Lemma 2.1. We want to minimize D() when = (1 ; 2 ; ) ranges in the polytope {(1 ; 2 ; ) : 061 61 + 62 62 + − 61}:
(A.1)
260 L.A. GarcÃa-Escudero et al. / Journal of Statistical Planning and Inference 77 (1999) 247–262
b attains its minimum at b D ∗ =(b ∗1 ; b ∗2 ; b ∗ ). If b ∗ satis es 0 ¡ b ∗1 ¡ b ∗1 +b ∗ ¡ b ∗2 ¡ b ∗2 + ∗ ∗ ∗ b b b
− ¡ 1 (i.e., belongs to the interior of the polytope) then must be a critib b b b b 0 (b ∗ ) = ( ∗ ) = 0: Therefore, if ( ∗ ) 6= 0 then the minimum is cal point, hence D attained at b ∗ in the boundary of this polytope, and consequently the coordinates of ∗ b must satisfy as an equality at least one of the inequalities de ning the polytope (A.1). However, with limiting probability 1, any equality is no longer possible due to Eq. (2.6) and the assumed relation (2.3) (e.g. suppose, for instance, that eventually b ∗ = b ∗2 with probability not tending to 0. This fact contradicts the consistency ∗1 + b result b ∗2 − (b ∗1 + b ∗ ) → ∗2 − (∗1 + ∗ ) ¿ 0; P-a.e. which follows from Eq. (2.6) and the assumed strict inequality ∗1 + ∗ ¡ ∗2 ). Proof of Theorem 2.2. Lemma 2.1 implies convergence as
√ b b∗ P n ( ) −→ 0: We can rewrite this last
R b∗ +b∗ b + Qn (b ∗1 + b ∗ )] − (b ∗ )−1 ∗1 Qn (t) dt b1 R b∗ + −b∗ 1 ∗ ∗ ∗ b b b 2 [Qn (2 ) + Qn (2 + − )] − ( − b ∗ )−1 ∗2 Qn (t) dt b 2 oP (1) = ∗ ∗ R 2 + −b ∗ ∗ ∗ −1 b ∗ ∗ (Qn (b b b b b 2 + − ) − ( − ) Qn (t) dt) − (Qn (1 + ) b ∗ 2 R b∗ +b∗ −(b ∗ )−1 ∗1 Qn (t) dt) b1 1 [Qn (∗1 ) 2
(A.2) +
√
n( (b ∗ ) − (∗ )):
(A.3)
It follows from the consistency of b ∗ , approximation (2.7) the continuity of q0 (t)B(t) that Eq. (A.2) converges to the random vector −A: Then by a multivariate -method (see, e.g., Pollard, 1984, p. 63) applied to Eq. (A.3) we obtain the desired result. Proof of Theorem 2.3. The convergence of the midpoints of the intervals may be derived taking into account that ! √ b ∗ ) − c1∗ (∗ ) c1∗ (b n b ∗ ) − c2∗ (∗ ) c2∗ (b ! ∗ ∗ ∗ 1 b b b [Q ( ) + Q ( + )] n n 1 1 = 1 2 b∗ b∗ b∗ 2 [Qn (2 ) + Qn (2 + − )] ! √ 1 n[{q(b ∗1 ) − q(∗1 )} + {q(b ∗1 + b ∗ ) − q(∗1 + ∗ )}] 2 : + 1√ ∗ ∗ ∗ b∗ b∗ b∗ 2 n[{q(2 ) − q(2 )} + {q(2 + − ) − q(2 + − )}] The consistency of b ∗ and Eq. (2.7) give the limit distribution of the rst term of the last equality and, once we know the limit distribution of b ∗ , applying the -method to the second term convergence follows.
L.A. GarcÃa-Escudero et al. / Journal of Statistical Planning and Inference 77 (1999) 247–262 261
Proof of Theorem 2.4. We will use a reasoning analogous to that in Butler (1982). Let us rewrite √ √ n(Pn − ) = n([F(qn (b ∗1 + b ∗ )) − F(q(∗1 + ∗ ))] − [F(qn (b ∗1 )) − F(q(∗1 ))]) √ ∗2 + − b ∗ )) − F(q(∗2 + − ∗ ))] + n([F(qn (b ∗ )) − F(q(∗ ))]): − [F(qn (b 2
2
The limit distribution of b ∗ ; -methods and the fact that q0 (·) = 1=f(q(·)) now show that √ n(Pn − ) = B(∗1 + ∗ ) − B(∗1 ) + B(∗2 + − ∗ ) − B(∗2 ) + oP (1); √ L so that n(Pn − ) −→ N (0; (1 − )): Remark 5. We have commented that the (x) = x case (trimmed k-medians) is just a straightforward adaptation of the (x) = x2 case (trimmed k-means). The handling of more generals ’s like (x) = xp ; p ¿ 1; is not so simple. First, there is no explicit expression of the ‘-mean’ for general -functions, and, secondly, in the Proof of Theorem 2.2 we use relations like ! Z Z Z √ 1 1 1 n qn (t) dt − q(t) dt = Qn (t) dt − − − or
√
n qn ( + 12 ( − )) − q( + 12 ( − )) = Qn ( + 12 ( − ));
which have no analogue when dealing with a general . References Bickel, P.J., 1967. Some contributions to the theory of order statistics. Proc. 5th Berkeley Symp. Math. Statist. Prob. 1, 575–591. BrHns, H.D., Brunk, H.D., Frank, W.E., Hanson, D.L., 1969. Generalized means and associated families of distributions. Ann. Math. Statist. 40, 339–435. Butler, R.W., 1982. Nonparametric interval and point prediction using data trimmed by a Grubbs-type outlier rule. Ann. Statist. 10, 197–204. Cox, D.R., 1957. Note on Grouping. J. Amer. Statist. Assoc. 52, 543–547. Cuesta-Albertos, J.A., Matran, C., 1988. The strong law of large numbers for k-means and best possible nets of Banach valued random variables. Probab. Theory Related Fields 78, 523–534. Cuesta-Albertos, J.A., Gordaliza, A., Matran, C., 1997. Trimmed k-means: an attempt to robustify quantizers. Ann. Statist. 25, 553–576. Fisher, W.D., 1958. On grouping for maximum homogeneity. J. Amer. Statist. Assoc. 53, 789–798. Gordaliza, A., 1991a. Best approximations to random variables based on trimming procedures. J. Approx. Theory 64, 162–180. Gordaliza, A., 1991b. On the breakdown point of multivariate location estimators based on trimming procedures. Statist. Probab. Lett. 11, 387–394. Hartigan, J.A., 1975. Clustering Algorithms. Wiley, New York. Hartigan, J.A., 1978. Asymptotic distribution for clustering criteria. Ann. Statist. 6, 117–131. IEEE, 1982. Transactions on Information Theory, vol. IT-28, No. 2. Li, L., Flury, B., 1995. Uniqueness of principal points for univariate distributions. Statist. Probab. Lett. 25, 323–327.
262 L.A. GarcÃa-Escudero et al. / Journal of Statistical Planning and Inference 77 (1999) 247–262 MacQueen, J., 1967. Some methods for classi cation and analysis of multivariate observations. Proc. 5th Berkeley Symp. on Mathematical Statistics and Probability, vol. 1, pp. 281–297. Parzen, E., 1979. Nonparametric statistical data modelling. J. Amer. Statist. Assoc. 74, 105–131. Pollard, D., 1981. Strong consistency of k-means clustering. Ann. Statist. 9, 135–140. Pollard, D., 1982. A central limit theorem for k-means clustering. Ann. Probab. 10, 919–926. Pollard, D., 1984. Convergence of Stochastic Processes. Springer, New York. Rousseeuw, P.J., 1983. Multivariate estimation with high breakdown point. Proc., 4th Panonian Symp. on Math. Statist. Bad Tatzmannsdorf, Austria. in: Grossman, W., P ug, G., Vincze, I., Wertz, W. (Eds.), Mathematical Statistics and Applications, vol. B. Reidel, Dordrecht, pp. 283–297. Rousseeuw, P.J., Yohai, V., 1984. Robust regression by means of S-estimators. Robust and Nonlinear Time Series Analysis. Lecture Notes in Statistics, vol. 26. Springer, New York, pp. 283–297. Serinko, R.J., Babu, G.J., 1992. Weak limit theorems for univariate k-means clustering under nonregular conditions. J. Multivariate Anal. 49, 188–203. Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman & Hall, London. Stute, W., Zhu, L.X., 1995. Asymptotics of k-means clustering based on projection pursuit. Sankhya 57, 462–471. Sverdrup-Thygeson, H., 1981. Strong law of large numbers for measures of central tendency and dispersion of random variables in compact metric spaces. Ann. Statist. 9, 141–145. Tableman, M., 1994. The asymptotics of the least trimmed absolute deviation (LTAD) estimator. Statist. Probab. Lett. 19, 387–398. Tarpey, T., Li, L., Flury, B., 1995. Principal points and self-consistent points of elliptical distributions. Ann. Statist. 23, 103–112.