Concentration of measure and cluster analysis

Concentration of measure and cluster analysis

Statistics & Probability Letters 65 (2003) 65 – 70 Concentration of measure and cluster analysis Daniel Z. Zanger∗ Booz Allen Hamilton, Inc., 3133 Co...

165KB Sizes 0 Downloads 37 Views

Statistics & Probability Letters 65 (2003) 65 – 70

Concentration of measure and cluster analysis Daniel Z. Zanger∗ Booz Allen Hamilton, Inc., 3133 Connecticut Ave., NW No. 521, Washington, DC 20008, USA Received February 2003; accepted July 2003

Abstract An estimate on measure concentration for the problem of cluster analysis posed as a suitable integer linear program is stated and proved. c 2003 Elsevier B.V. All rights reserved.  Keywords: Asymptotic analysis of algorithms; Measure concentration; Clustering methods

1. Introduction Concentration of measure is by now a well-known phenomenon, often encountered within highdimensional structures or within the context of large numbers of variables, that involves the propensity of some functions de4ned on certain probability spaces to concentrate around their medians. This measure concentration e6ect has been carefully studied by a number of authors (Donoho, 2000; Ledoux, 2001; McDiarmid, 1998; Milman, 1988; Milman and Schectman, 1986; Pestov, 2000; Steele, 1996; Talagrand, 1995) due in no small measure to its relationship to applications within combinatorial optimization, statistical learning, data analysis, and data mining. In this paper, we examine the relationship between the concentration of measure phenomenon and the problem of cluster analysis, posed as an appropriate integer linear program. Indeed, given any positive integer N as well as two 4nite disjoint sets I and J of respective cardinalities N and M where 0 ¡ M ¡ N; let ci; j ∈ [0; 1] represent the cost of assigning i to j; ∀i ∈ I; j ∈ J . Letting | · | denote the cardinality of a set, consider the following problem equivalent to a corresponding integer linear program: N  Minimize ci; (i) (1) i=1 ∗

Fax: +1-202-797-0597. E-mail address: [email protected] (D.Z. Zanger).

c 2003 Elsevier B.V. All rights reserved. 0167-7152/$ - see front matter  doi:10.1016/j.spl.2003.07.001

66

D.Z. Zanger / Statistics & Probability Letters 65 (2003) 65 – 70

over all

: I → J; surjective | −1 (j)| = sj ; j = 1; : : : ; M;

(2) (3)

wherein the sj ¿ 0; j = 1; : : : ; M constitute a set of predetermined integers designating the desired cardinalities of M clusters, and we essentially identify I = {1; : : : ; N } and J = {1; : : : ; M }. The problem above models, for example, the situation in which a set I of N objects is given together with some dissimilarity measure between them and a distinguished subset of cardinality M of designated cluster representatives or “centers,” and it is required to partition I into clusters of prescribed size each of which contains exactly one of the centers and as many as possible of the other objects least dissimilar to it. Such a situation is closely related to a K-means or K-medoid clustering problem (see Hastie et al., 2001) for which it is desired to exert some control over the sizes of the respective clusters. To prove a concentration result corresponding to the problem (1)–(3), we will generalize and extend results of Talagrand (1995) on concentration inequalities for the (linear) assignment problem. In order to establish a framework for this, we replace the deterministic costs ci; j ; i = 1; : : : ; N; j = 1; : : : ; M in (1) with corresponding uniform, independent random variables Xi; j ; i = 1; : : : ; N; j = 1; : : : ; M over [0; 1]: Given 4xed constants C0 ¿ 0;  ∈ [0; 12 ), we choose any desired positive integers sj ; j = 1; : : : ; M; with sj 6 C0 N  ;

∀j = 1; : : : ; M;

and consider the r.v.    onto Xi; (i) | : I → J; | −1 (j)| = sj ; j = 1; : : : ; M : LN = inf

(4)

(5)

i ∈I

The growth condition (4) turns out to be necessary due to the dependence relationships that will obtain among the Xi; j as the problem is transformed into one involving instead bijective assignments as in Talagrand (1995). Our principal result, Theorem 2, will imply in part that, essentially, the random variable LN concentrates to within O(N −1=2+ ) (up to logarithmic factors) of a given median with probability 1−2=N . The strategy of the proof will involve 4rst invoking the theory of -expanding digraphs to suitably bound the cost of an optimal assignment with suLciently high probability (Propositions 2 and 3) and then appealing to a result (Theorem 1 of the present paper) of Talagrand’s on concentration of extrema of linear functionals of random variables to extract our main theorem. 2. Main results and proofs In order to establish our measure concentration inequality for LN , we restate the problems (1)–(3) as one involving bijective assignments :I → K

(6)

between disjoint sets I and K with |I | = |K| = N . For this we de4ne, using the Xi; j , random variables Yi; k giving the (randomized) measure of the cost of assigning i ∈ I to k ∈ K via Yi; k ≡ Xk; j

where j is such that i ∈ [Sj−1 + 1; Sj ]

(7)

D.Z. Zanger / Statistics & Probability Letters 65 (2003) 65 – 70

67

 and Sj ≡ jl=1 sl , S0 ≡ 0; ∀j = 1; : : : ; M , with the sl ; l = 1; : : : ; M; as in (1)–(3). (Once again we essentially identify I = K = {1; : : : ; N } and J = {1; : : : ; M }.) Of course each of the N 2 random variables Yi; k is also uniform over [0; 1]; but clearly they do not form an independent set. In any case, the relevant observation for us is that if LN is de4ned using the Xi; j as before in (5), then in fact we also have       (8) Yi; (i)   bijective : LN = inf  i ∈I

In order to introduce the required result of Talagrand’s on concentration of extrema of linear functionals, assume that (Zi )i6N is an independent sequence of random variables and F any family of N -tuples of  = (i )i6N of real numbers, and consider the random variable  Z = inf i Zi : (9) ∈F

i6N

 We set  = sup∈F  2 where  2 = ( i6N i2 )1=2 . Recall that a median of a random variable such as Z is de4ned to be a number M with Pr(Z 6 M ) ¿ 12

and

Pr(Z ¿ M ) ¿ 12 :

(10)

Theorem 1 (Talagrand, 1995). Consider the random variable given by (9) and a median M of Z. Also assume that for each i there is a number ri for which ri 6 Zi 6 ri + 1. Then for all w ¿ 0 we have   w2 (11) Pr(|M − Z| ¿ w) 6 4 exp − 2 : 4 We will also require the concept of an -expanding digraph. For this, we 4rst note, more generally, that, by a digraph D, we will simply mean a subset of I × K. Consider such a digraph and S ⊆ I . Set D(S) = {k ∈ K|∃i ∈ S; (i; k) ∈ D}: The digraph D is -expanding ( ¿ 2) if, for all subsets S of I : N 1 |S| ¿ ⇒ |D(S)| ¿ N − (N − |S|); 2  |S| 6

  N N ⇒ |D(S)| ¿ min |S|; : 2 2

(12)

(13)

(14)

Proposition 1 (Talagrand, 1995). Consider an -expanding digraph D and an integer m such that m ¿ N=2. Consider a one-to-one map  from I to K. Then, given any i ∈ I , we can 9nd n 6 2m and disjoint points i1 = i; i2 ; : : : ; in+1 = i such that for 1 6 l 6 n, we have (il ; (il+1 )) ∈ D. Consider u ¿ 0 and the digraph Du given by (i; k) ∈ Du ⇔ Yi; k 6 (2u log N )=N 1− .

68

D.Z. Zanger / Statistics & Probability Letters 65 (2003) 65 – 70

Proposition 2. Assume that the digraph Du is -expanding, and consider an integer m such that m ¿ N=2. Then for an optimal assignment  we have Yi; (i) 6 (2u log N )=N 1− for all i 6 N . Proof. Consider any i ∈ I , and consider i = i1 ; : : : ; in+1 = i as in Proposition 1, used for D = Du . De4ne (il ) = (il+1 ) for 1 6 l 6 n, and (i ) = (i ) if i ∈ {i1 ; : : : ; in }. Since  is optimal, we have   Yi ; (i ) 6 Yi ; (i ) (15) i 6N

i 6N

so that Yi; (i) 6



Yil ;(il ) 6 2nuN −1+ log N:

(16)

16l6n

Proposition 3. There are constants C1 ; C2 ¿ 0 such that if u ¿ C1 ; 2u log N 6 N 1− ; 0 6  ¡ 12 the random digraph Du is C2 u-expanding with probability ¿ 1 − N −C2 u . Proof. We explain why (13) with  = C2 u is satis4ed with probability ¿ 1 − N −C2 u ; the case for (14) is similar. Given s for which N=2 6 s 6 N; consider a subset S of I with s = |S|. Since | −1 (j)| 6 sj 6 C0 N 

(17)

∀j = 1; : : : ; M , at least s=(C0 N  ) of the Yi; k ; i = 1; : : : ; s, for any given k ∈ K are independent. Hence we have, for any k ∈ K,    2u log N s=C0 N c Pr(k ∈ D(S) ) 6 1 − (18) N 1−  

2u(log N )s 6 exp − C0 N

 :

(19)

Then, appealing to elementary properties of the binomial distribution, we have  Pr |D(S)| ¡ N −

N −s 





6N

N −s +3 



 exp −



2u(log N )s C0 N



N −s +1 



(20)



2us(log N )(N − s + ) (N − s)(log N ) = N exp − + C0 N  3



u(log N )(N − s + ) (N − s)(log N ) + 6 N exp − C0   3

 (21)

 :

(22)

D.Z. Zanger / Statistics & Probability Letters 65 (2003) 65 – 70

69

There are no more than N (N −s) subsets of I of cardinality s. Consequently,   

N −s  |D(S)| ¡ N − Pr   N=26s6N {S ⊆I S |=s}

 u(log N )(N − s + ) (N − s)(log N ) 6N + N exp − C0       u(log N )(N − s + ) 1 = N 4 exp − (N − s)log N + 1+ C0   (N −s)

4



6 exp(−C2 u log N )

(23) (24) (25)

with u ¿ C1 ;  = C2 u and a sagacious choice of C1 ; C2 ¿ 0. The proposition immediately follows. Theorem 2. Denote by MN a median of LN . Then, for all N (N ¿ 3) and any C0 ¿ 0; 0 6  ¡ 12 as in (4), there is K = K(C0 ) such that    Kt(log N )2 6 2 exp(−t 2 ); (26) t 6 log N ⇒ Pr |LN − MN | ¿ N (1=2−)   t¿



 log N ⇒ Pr  |LN − MN | ¿

 Kt 3 (log N ) 2  2   6 2 exp (−t ): t N (1=2−) log log N

(27)

Proof. Consider u 6 N 1− =(2 log N );  = C2 u (where C2 is as in Proposition 3), and the smallest integer m such that m ¿ N=2. Set v=

5mu(log N ) ; N 1− 

Zi; k = min(Yi; k ; v):

Consider the r.v. LuN de4ned as LN is on the right-hand side of (8) but using the costs Zi; k instead of the Yi; k . To show that LuN ¿ LN , note that (i; k) ∈ Du ⇔ Yi; k 6 2uN −1+ (log N ) ⇔ Zi; k 6 2uN −1+ (log N ). Assuming that Du is -expanding, the analogue of Proposition 2 for the Zi; k implies that we have Zi; (i) 6 4mu(log N )=N 1− ; 1 6 i 6 N , for an optimal assignment  with respect to LuN . Since v ¿ 4mu(log N )=N 1− , it follows that, for such ; Zi; (i) = Yi; (i) so that LuN ¿ LN . Alternately, that LuN 6 LN is clear from Proposition 2 applied to the Yi; k . Hence, Pr(LN = LuN ) ¿ 1 − N −C2 u : Notice that, of course, we may write     Zi; (i)  u   : I → K;  bijective : LN = inf v  v i ∈I

(28)

(29)

70

D.Z. Zanger / Statistics & Probability Letters 65 (2003) 65 – 70

Since 0 6 Zi; (i) =v 6 1, we can take ri ≡ 0 to apply Theorem 1. It then follows that, for all w ¿ 0,   w2 u u ; (30) Pr(|LN − M (LN )| ¿ w) 6 2 exp − 4Nv2 wherein M (LuN ) denotes the median of LuN . Combining with (28) we obtain Pr(|LN − M | ¿ w) 6 Pr(|LuN − M (LuN )| ¿ w) + Pr(LN = LuN )   w2 + N − C2 u : 6 2 exp − 4Nv2

(31) (32)

√ Finally, to obtain (26)–(27), we select the requisite parameters. Take w = 3 N tv. If t 2 6 log N , take u = K = K(C0 ) ¿ max{C1 ; C2−1 }, and, if t 2 ¿ log N select u = Kt 2 =log N where once again K = K(C0 ) ¿ max{C1 ; C2−1 }. References Donoho, D.L., 2000. High-dimensional data analysis: the curses and blessings of dimensionality. Technical Report, Stanford University Department of Statistics. Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Berlin. Ledoux, M., 2001. The Concentration of Measure Phenomenon. American Mathematical Society, Providence, RI. McDiarmid, C., 1998. Concentration. Probabilistic Methods for Algorithmic Discrete Mathematics. Springer, Berlin, pp. 195 –248. Milman, V.D., 1988. The heritage of P. Levy in geometric functional analysis. Asterisque 157–158, 273–302. Milman, V.D., Schectman, G., 1986. Asymptotic Theory of Finite Dimensional Normed Spaces. In: Lecture Notes in Mathematics, Vol. 1200. Springer, Berlin, New York. Pestov, V., 2000. On the geometry of similarity search: dimensionality curse and concentration of measure. Inform. Process. Lett. 73, 47–51. Steele, J.M., 1996. Probability Theory and Combinatorial Optimization. CBMS-NSF Regional Conference Series in Applied Mathematics 69, SIAM. Talagrand, M., 1995. Concentration of measure and isoperimetric inequalities in product spaces. Publications Mathematiques de l’I.H.E.S 81, 73–205.