1997,17(3):309-318
GLIVENKO-CANTELLI TYPE THEOREMS AND THEIR APPLICATIONS 1 Zh,u. Li:-cing (
~ 1]
n- )
PTobabdity Laborato ru, Inetiiut.e of Apphed Mtith erruitics, Chinese Aca dems) of Sciences, Bcij-ing 100080, Ch.itu:
Abstract In this paper, we present a sufficient condition for uniform convergence of means to their expectations over the classes of real functions. Our proof is very simple via connecting this convergence to uniform convergence over the class of indicator functions.
Furthermore, we obtain a convergent result concerning a non= VC class
with an application to variable transformation for fi t ting regression model.
Key words Uniform convergence, non- VC class of functions, variable transformation
method.
1 Int rod uction Let (S, ep) be a measurable space and let P be a probability lneasure on (S, ep). Let
{Xi, i 2: I} be i.i.d. S-valued random variables with common law P, Pn is the empirical measure based on Xl,···, X n . One of the 1110st persistent problems in probability and statistics is that of finding ou t the law P from the random law Pn . The a.s. convergence of Pn to P uniformly over classes of funcitons is a subject of particular interest, the fundamental result is the celebrated Glivenko-Cantelli theorrn, The work of Vapnik and Cervonenkis[5] (1971) vitalizes the study for this question. Modulo measurability considerations in the following. Steele[4], who refined an earlier result of Vapnik and Cervonenkis, characterized the class of indicator funcitons on class of sets C for which sup IPnI(A) - PI(A)I ~ 0 a.s.
AEC
(1.1)
where I(A) is the indicator function of set A E C, Pf=.r fdP. A necessary and sufficient condition of (1.1) is 1 C -lI;l,(Xl,···,Xn)~O n
In
Pr.,
(1.2)
where "V;1, - 1 is the number of the largest subset of {Xl,· .. , X n}, whose every subset (the empty set and the whole subset are included) has the form An {Xl, ... , X n } for S0111e A E C. 1 Received
Jun.12,1995; revised May 20,1996. Research partly supported by NNSF of China.
310
ACTA MATHEMATICA SCIENTIA
Val. 1 7
For general class of funcitons :F, Vapnik and Cervonenkisl'f and Gine and Zinn[l] obtained, respectively, the necessary and sufficient conditions for sup IPnf - P fl
---+
0 a.s.
(1.3)
:F
to be lim PHn,oo(€,:F,Pn)/n = 0 for all
n-+oo
and lim PHn,2(€,:FM,P n)/n
n-+oo
=0
for all M
€>O
(1.4)
> O,€ > 0,
(1.5)
where :FM = {fI(F ~ M) : f E :F}, and sup 1/(8)1 ~ F(8.) for all 8 and some F satisfying :F
P F < 00, and Hn,p(€, F, P n )= In{min{ m : 3/1,· .. , fm such that sup.min p;/p If - fi IP :F
for 1
~~n},
::; E}}
~ p ~ 00. The above results are all beautiful. But we see that either the condition (1.4) or (1.5)
is hard to check since :F is an abstract class whenever S is Fuclidean or not. According to the statistical folklore, it is important to seek an easily checked, sufficient conditon for (1.3). Since F is in general uncountable, some conditions are needed to guard against possible measurability difficulties. We leave these unspecified and assume throughout this paper that :F is permissible in the sense of Pollard (1984,pp.196). In this paper, we shall present another sufficient condition for case of the class of real functions on R 1 as follows. Let = (X, t), where X has probability measure P, and t, an auxiliary variable, is distributed uniformly on [-k, k] for some k > o. Suppose that F == {I} is a uniformoly bounded class of real functions on space S, g ==
e
{GJ
= {(x,t): I(x) ~ t ~ 0 or 0 ~ t ~ I(x)}: IE :F}. Let el, ··.,en be i.i.d. with e. Theorem 1 If
(1.6) then sup IPnl - P II
---+
:F
where
vf (e1, . · · , en)
0 a.s.,
(1.7)
is denoted similar to that in (1.2).
In the unbounded case, we adopt the notation of (1.5), :FM, and let g1'.1 = {G!vI == {( z , t) : f (x) ~ t < 0 or 0 < t ~ 1(x)} : 1 E :FM }. Let er,···,e~ be i.i.d. with where (X,t),t is distributed uniformly on
eM,
[-M,M].
eM =
Theorem 2 A sufficient condition for which sup IPnl - P II F
---+
0 a.s.
(1.8)
IS
(1.9)
No.3
Zhu: GLIVENKO-CANTELLI TYPE THEOREMS AND THEIR APPLICATIONS
311
It is worth saying that we establish a connection between the case of the class of indicator functions and that of the general class of functions so as to show our results very simply. On the other hand, (1.9) is relative easy to check for S being Euclidean. As an example, we present the following result: Suppose that the distribution function of P, P say, is continuous, and for a given positive integer N, let F
= {f
: R 1 -+ R 1 ; f has piecewise continuous second derivative and the cardinality of the points where f' does not exist or f' (x) = 0 or f" (x) = 0 is less or equal to N}. Call the point where f'(x) = 0 and f"(x) = 0 the extreme point and the turning point respectively. Theorem 3 Suppose that
P is continuous, sup F
then
IPnf - Pfl-+
(1.10)
0 a.s.
Remark 1 When F is a class of step functions which the cardinality of the discontinuous points is less or equal to N, the conclusion of Theorem 3 still hold. Remark 2 In view of Steele's result, it is an interesting problem that whether (1.6) and (1.9) are the necessary conditions. In view of the following Lemma 3.1, we shall see that F in Theorem 3 is not a VC class of funcitons, which is paid much attention by hU111an now, so Theorem 3 is not a trivial conclusion, while an interesting one. Indeed, we are able to apply it to variable transformation for fitting regression model, we now present an application. Suppose that (Y, Z) is a random vector with distribution P, where Y is one-dimensional and Z is d-dimensional. For many applications, one needs to inspect the variable Y of primary interest and its relationship with X. But when the dimension of space R d is large, the sparsity of the sample points in the space will plague one to choose an effective method for usc. See Huber (1985) for this reason. Therefore, some dimension reduction techniques have to be developed. Li (1989,1991) proposed a method involving projection pursuit technique for searching the nonlinear structures including heteroscedaticity, curve-blurring, and clustering with the aim of inspecting those projection diretions for whihc the scatterplot of Y against the projected variable might suggest a suitable transformation of Y to get a good linear square fit by the projected variables. His projection index which measures the interestingness of the data structure hidden in the scatter diagram of Y against bT X is (1.11 )
where' Corr(·,·) denotes the correlation coefficient between two variables, the maximum is taken over all transformations. As pointed out by Li(1991), the index R 2(b) possesses the affine invariance with respect to b. However, since what we are interested in is projection direction b rather than others, we can restrict ourself to consider those normalized b, that is,
Ilbll
= 1, hence the affine
312
ACTA MATHEMATICA SCIENTIA
Vol.17
invariance is not a very important, desired statistical claim here. Based on this reason, we now choose a more simple projection index, i.e. R(h, b) == P(h(Y) - bT z)2,
R(b) ==
inf
hE£2(P)
P(h(Y) - bT z)2 == bT E(Z - E(ZIY))(Z - E(Zly))Tb. .
( 1.12)
For all the possible directions, one attractive choice for projection will be the direction bo which minimizes R(b). In order to estimate this direction, we need first to estimate the infimum of R(b). We now give a consistent estimate of inf R(b). Ilbll=l For each integer N, define :FN == {h : R 1 ~ R 1 : h. has piecewise continuous second derivative and the cardinality of the points where h' does not exist or h' (y) == 0 or h" (y) == 0 is less or equal to N, and there exists a [,2(p) function F(·) such that Ih(y)1 ::; F(y) for all y E R 1 } . Denote
Rn;(b) = inf
hEfN
~n t(h(lj) -
bT Zj fl'.
( 1.13)
j=l
Theorem 4 Suppose that the marginal distribution of P about the random variable
Y is continuous, and that PI~ZII2 < co. Assume further that there exists a known function
F(Y) with P F2(y) <
for which IIE(ZIY == y)11 ::; F(y) for all y E R 1 , and that each component of E( ZIY) is continuous. Then (X)
lim
N-+oo
-bo)
lim n-+oo
inf R n N (b) == inf R( b) a.s. IIbll=l . Ilbll=l
( 1.14)
Furthermore, assume that the direction bo is unique (except for the sign change bo ~ Let bnN is that value which minimizes RnN(b). Then
lim lim
11.-+00 n-+oo
s;
N
== bo a.s.
( 1.15)
Remark 3 The class of step functions does not fulfill the restriction for :F1V • But as pointed out in Remark 1, the assertion of Theorem 3 is still true, thus we can still use Theorem 3 to show the above conclusion when :F1v is a class of step functions with the cardinality of points where h is discontinuous to be less or equal to N. In this case, the estimate is constructed, as described by Li( 1991), by the slice inverse regression (SIR) with the fixed number of slice. For more details of SIR, see Li( 1989,1991). We put the proofs of 'Theorem 1 and Theorem 2 in Section 2, Section 3 contains the proof of Theorem 3 and Theorem 4.
2 TIle Proofs of T'heorern
r-, 2
The proof of Theorem lOur idea is to transform sup over the class of functions into sup over the class of indicator functions, then to utilize Steele's result (1.2) or Pollard's (Pollard, 1984) for our case.
313
Zhu: GLIVENKO-CANTELLI TYPE THEOREMS AND THEm APPLICATIONS
No.3
Let G f + == {(x, t) : 0 ::; t ::;
J
I (x)
I (x) ::; t < O}}
and G f _ == {(x, t) :
for
f
E :F.
Recalling the notation of PI == IdP, we have PI == 2kPU {I( G f +) - I( G f_)} and Pni == 2kPnU{I(G f +) - I(Gf_)}, where U denotes uniform distribution on [-k,k], and k is the supremum bound of F, that is, sup sup I/(x)1 :s; k. Hence x
We see that
:F
ehas distribution PU which is the joint distribution of two independent random
variables. Consider {ei, i == 1"", n} is an i.i.d, sample for PU, and construct empirical measure Qn based on ~l"",~n, it is then easy to see that PnU{I(G f + - I(G f _ ) } ) is the expectation value with respect to uniform distribution U, namely, PnU(I(G f + ) - I(Gf_)) - PU(I(G f +) - I(G f_))
== U {Qn(I(G f + ) - I(G f_)) - PU(I(G f +) - I(Gf_ ))}.
(2.2)
Consequently, we need only show that
and
(2.3) But (1.7) has implied (2.3) from Steele's conclusion. The proof of Theorem 2 For any fixed € > 0, choosing M large enough such that PFI(F
~
M) < e and utilizing Theorem 1, (1.8) implies that sup IPnl - P II
--+
:Fl\1
(2.4)
0 a.s.
On the other hand, sup I~tl - Pil ~ sup IPnII(F ~ M) - PfI(F ~ M)I :F
.
F
+ PFI(F P II + 2E.
+PnFI(F ~ M)
::; sup IPnl :Fl\1
for large n. We complete the proof from letting n
--+ 00,
then
€
2: 111)
(2.&)
--+ 00.
3 The proof of T'heor-ern 3 arrd T'heorem 4 In order to convenience the proof, we adopt Pollard's definition (1984).
Definition A class of set C is said to shatter a set of points V is each of the subsets of V has the form en V for SOHle c E C. We first introduce some lemmas, Lemma 3.1 Suppose H == {h} is a class of real functions on R 1 satisfying that the second derivatives of h is piecewise continuous. For any set of at least 7 points V == {(Xi, ti) :
314 i
Vol.17
ACTA MATHEMATICA SCIENTIA
= 1,···, L}
which lies inupper half plane. If the class of sets QH
= {Gh =
{(x, t) : 0 :::; t :::;
h( x)} : h E H} shatters V, then it must be happen that
i) there exists a h E H for which hpossesses at least an undifferentiated point or an extreme point or a turning point in interval [min{xi},max{xi}]' or ii). each point of V must be an extreme point of their convex hull.
Proof Wit~out loss of generality, assume that V is a set of 7 points. If (ii) happens, we conclude the proof. Now suppose that there exists a least a point of V being a convex combination of the others. Without loss of generality assume that Xi = i, i = 1,···,7. If h is of an undifferentiated point in [1,7], we then finish the proof. So suppose h has no any undifferentiated point in [1,7], we investigate various cases respectively, i)* If ti is decreasing or increasing with respect to i(2 :::; i :::; 6) we can assume without loss of generality that t? t3 2: . · . 2: t6. If there is an extreme point, we complete the proof. We now assume further that there is no any extreme point in [1,7]. We shall find at least a turnig point in [1,7]. If there is no any turning point, we then know that in [1,7] the derivative of h(.), h'( .), is non-decreasing or non-increasing, that is, h is the concave function or the convex function. Since QH .shatters V, then there are h l and h2 E H for which
(3.1) If hi and h 2 are the concave function and the convex function in [1.7]' without loss of generality, assume the concavity of h l and the convexity of h 2 , we know that Gill r~ {[1~ 7] x R 1 } is the convex set and G h2n {[I, 7] x R l } is the convex set. Hence ChI n {[I, 7] x ill} contains the segment between (X2, t2) and (X4, t 4)
t4 - t2 -2-(x - 4) + t 4 (2:::; z
< 4)
(3.2)
and then contains the region {(x,t) : t4 ; t " l ( x - 4) + t4 :::; t} n {[2,4] x R l } . Meanwhile, (X3't3) tf. G hl n {[I, 7] x R l } , thus, (X3, t3) lies above the segment (t 4 - t 2)(x - 4)/2 + t 4. On the other hand, G h 2 n VG h 2 n V n {[I, 7] x R l } = {(Xi, t i) : i = 1,3,5, 7}, then G h2 n {[I, 7] x R l } n V :J {(Xi,ti) : i = 2,4,6}. Since n {[I, 7] x R l } is a convex 2 set, we see that G h2n {[I, 7] x R l } co~tains the segment between (X2't2) and (X4, t 4), and then G h2n {[I, 7] x R l} contains (X3, t3) because (~3' t 3) lies above the segment, this is a contradiction with G h 2 n V = {(Xi,t i) : i = 1,3,5, 7}. So we know that there is a h E H which has at least a turning point in [1,7].
Gf
Furthermore, if both h l and h 2 are convex functions or concave functions. Without loss of generality, assume h l and h 2 are both the concave functions. Invoking the concavity of both h l and h 2, we trace the similar argument above to see that (Xi, ti) lies above the segment between (Xi-l, ti-l) and (Xi+l, ti+l)i = 2, ... ,6. This fact implies each point of V is an extreme point of their convex hull, this is a contradiction with the assumption at the beginning of the proof.
No.3
Zhu: GLIVENKO-CANTELLI TYPE THEOREMS AND THEIR APPLICATIONS
315
i) ** t2,···, t6 is not decreasing or increasing. It is easy to see that there is a ti such that for two tj,tk,2 ~ j < i < k ~ 6.
or we assume without loss of generality that t j < t i and tk < ti. Since H G
= {G h } shatters V,
thus there is a function h E H for which
Consequently, in interval [j, k]h is of the following properties
h( i)
~ ti
{ h (j) < t j and h(k) < t k Denote h* = sup(x). By the continuity of h(·) in [j, k] we can find fj,k]
X
oE
[j, k] for
h(xo) = h", This shows that Xo is an extreme point of h, completing the proof of the lemma. Lemma 3.2 Let V = {(Xi, ti), i = 1,2,···, L} be a set of at least 7 points in upper half plane or in lower half plane. Suppose that .therc is at least a point of V being a convex combination of the others. Then there exist a subset V such that if Gh n V = V for SOBle h E H, h must have at least an undifferentiated point, or a turning point or an extreme point. Proof Note that the proof of Lemma 3.1, we easily see that the conclusion is a direct consequence, the proof is completed. Now let :F .be that defined in Theorem 3. Lemma 3.3 For any positive integer k ~ 14(3N + 1) let V = {(Xi, ti) : i = 1,···, k}. If G:F shatters V, then there exist in upper plane or in lower half pane k/2(3N + 1) adjacent points for which these points are all the extreme points of their convex hull. Proof Divide V into (3N + 1) groups containing at least k/2(3N + 1) points in each group. Obviously, in each group there are at least [k/(3N + 1)]/2 points which lies in upper half plane or lower half plane. Writing it as Vi. According to the partition of V, we split the axis of abscissas into (3N + 1) interval. (G:F shatters V, necessarily, x~ s are different from each other). If the assertion of the lemma is not true. Then there are subsets of Vi, Vi say,
+ 1) respectively,
such that if G:F n V· = U;~l+lVi for some f E :F, then f has at least an undifferentiated point or a turning point or an extreme point in each interval by
i = 1,··· (3N
Lemma 3.2. But it is impossible for f E F, completing the proof of Lemma 3.3. We now come into final step of Theorem 3 from Lemma 3.3. By the sufficiency of Theorem 2, we only need to show for all M (3.3)
316
ACTA MATHEMATICA SCIENTIA
Vol.17
Suppose that (3.3) does not hold, that is, for some M and some
€
>0 (3.4)
The following idea is similar to that of Steele. (see also Pollard, 1984). Divide R 1 x [-M, M] into a patchwork of m 2 subrectangles in a following way, first
*
divide R 1 into m intervals with common probability vlue according to P due to the continuity of distribution. Next, silce [-M, M] in same way with respect to uniform distribution on [-M, M], m will be specified shortly. Because the class A of all possible unions of these su brectangles is finite, then P{s~p IQnI(A) - PUI(A)I ~ €/4(3N
1
+ I)} ::; :i€
(3.5)
for all n large enouth. Then for infinitely many n
and s~p IQnI(A) - PUI(A)I
< €/4(3N + I)}
1
> :i€.
(3.6)
So there must exist a sample configuration for which g shatters S0111e collection of at least ne sample points and for which
IQnI(A) - PUI(A)I < €/4(3N
+ 1)
for every A E A.
(3.7)
By Lemma 3.3, write 0 for .the convex hull of the subset of n€/2(3N + 1) points in the shattered set, and A o for the union of those subrectangles that intersect the boundary of o.
The set A o contains all the extreme points of 0, so QnI(Ao) 2: €/2(3N + 1), and it belongs to A, so IQnI(Ao) - PUI(Ao)1 :s; €/4(3N + 1). Consequently, PUI(A o) 2: €/4(3N + 1), which will give the desired contradition if we make rti large enough. Write ao for the boundary of O. We first divide R 1 x [-M, M] into 9 subrectangles, that is, m == 3. It is clearly that 0 can have boundary points in, at most, 8 subrectangles
of 9 subrectangles, say E 1 , ·
· · ,
E s , in other words,
ao c
U~=lEi. And
Subdivide each E i , i == 1,···,8 into 9 subrectangles. As the above reason, for each i,80 only intersects 8 subrectangles of, say, Eil,· .. ,EiS. Thus
eo c
UJ=l U~=l E i j
=UJ:l Ej
Repeat the above argument to derive a union of the subrectangles UJ~lEj} == Al for which A" where PU(A ,) == (~)'. Stop this procedure until (~)lo < €/4(3N + 1) for S0111e 10. In this time, m == 3'0 and the A,a is that set desired, A o , this is a contradiction with the claim made of A o. We complete the proof.
ao c
Zhu: GLIVENKO-CANTELLI TYPE THEOREMS AND THEIR APPLICATIONS
No.3
317
The proof of Tb.eorcrn 4 We first show that for each integer N
a.s., as n
--+ 00.
(3.8)
Clearly, for (3.8), it is enough to show that
(3.9) ')
J,~-J
1
n
= sup II;;: L(l~(lj) :FN
- Ph(Y)Z)I-+ 0 a.s.
(3.10)
j=1
(3) -_ sup 11 J nN - ~ LJ(bT ZJ.)2 - P (T b Z )2) --+ 0 a.s. b Il ll=1 n j=1
(3.11 )
By Theorem 3, (3.9) is easily showed. Therefore, since ~ ~~t=I(ZjZJ - PZZ T ) --+ 0 a.s., so (3.11) is an obvious fact. To show (3.10), we will combine Theorem 3 and (1.5) to deal with it. Now assume no less of generality that Z is scalar and that li is nonnegative. Similar to the proof of Theorem 2, we only need to show that sup IQnI(Gh)Z - PU(I(Gh)Z)1 :FN
--+
0 a.s.
where Qn is determined by (Zi, Yj,ij),j == 1,·" .ri, By the condition of P\ZI 2 < 00, it is enough to show that for each M
(3.12)
>0
sup IQnI(Gh)ZI(IZI ::; M) - PU(I(Gh)ZI(lZI ::; M)I--+ 0 a.s. :FN
(3.13)
Furthermore, by (1.5), it suffices to prove that lim PUH n2(€, F NA1, Qn)/n == 2 for all
11--+00
where FNM == {I(Gh)ZI(/Z/
< M) : h
€
> 0,
(3.14)
E FN}.
For conveninence, write I ( G h (y, i)) for I ( G h)' For any two functions hI, h 2 E F N .~I , we have (~.15)
It implies that
H n,2(€, FNA1, Qn)
< Hnoo(€/M,
FNAl, Qn).
(3.16)
Thus we only need to show that
(3.17) Since for any two functions hI, h 2 E F N
318
Vol.17
ACTA MATHEMATICA SCIENTIA
where the notation 'b/ means the symmmetric difference of two sets, then for any fixed
fJ > 0 if and only if (3.18) By Theorem 16 of Pollard (1984, pp.19, see also Steele, 1975), and the definition of VfM (~l'
· .. , en), for any fixede > 0, we have (3.19) where V stands for VfM((Yl,t1),···,(Yn,t n)). By the proof of Theorem 3, we have 1 n
-V
---+
0 in Pr.,
then
~Hn,oo(6, .rNM, Qn) :::; (V - 1) log(n/(V - ~)) + log 2 + (V - 1)
->
0 in Pr,
it implies that
1 -PU u, oo(fJ, :FNM, Qn) ---+ O. n ' Consequently, (3.12) is showed, then so does (3.10). The proof of (3.8) is completed.
From (3.8), it is clearly that a.s.
(3.20)
N ow all we need to show is that N
lim -+ 00
inf inf P(h(Y) - bT Z)2 = inf inf P(h(Y) - bT Z)2. \I b II =1 .rN II b II =1 hE c 2 (p )
(3.21)
By (1.12), PF 2 < 00 and the continuity of E(ZIY), (3.21) can be easily showed by a method in approximation theory, the details of the proof are omitted here. We complete the proof of Theorem 4. References 1 Gine E, Zinn J. Some limit theorems for empirical processes. Ann. Probab., 1984, 12: 929-989 2 Li K C. Data visualization with SIR: a transformation based projection pursuit method. UCLA statistical series, 1989,24 3 Li K C. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association (with discussion), 1990,86: 316-342 4 Pollard D. Limit theorems for empirical processes. Z.Wahrsch. Verw. Gebiete,1981,57: 181-185 5 Pollard D. Convergence of stochastic processes. Springer-Verlag, New York, 1984 6 Pollard D. Asymptotics Via Empirical processes. Statistical Science. 1989,4: 341-366 7 Steele J M. Empirical discrepancies and subadditive processes. Ann. Probab., 1978,6: 118-127