I11
General CramGr Theory
3.1: Preliminary Formulation
We want in this section to extend the CRAMERTheorem (cf. Theorem 1.2.6) to a more general setting. In order to describe the setting which we have in mind, it will be necessary to introduce the following embellished form of the hypothesis (C) made at the beginning of Section 2.2.
(c)
E and X are the same as they were in (C). In addition there is a metric p on E which is compatible with the topology on E induced by the topology on X and a measurable norm 11 . 11 on X (which need not be compatible with the topology on X) such that: ( E , p ) is Polish; 11 . 11 is bounded on pbounded subsets of E ;
for all a E [0, 11 and all elements p l , p2, 41, q2 E E ; and
Without further mention, we will be working in this section with the situation which we now describe. E , X , p, and 11 . 11 are as in (C), and R = EZ' is given the product topology. Note that, since E is a separable metric space, the BORELfield Bn over 52 coincides with the product Zf E to denote the a-algebra (BE) . Next, for n E Z+, we use X, : R nth coordinate map (i.e., X,(w) = w,). In view of the preceding remark about Bn,one sees that not only is each of the maps X, measurable from
-
58
III General Cram& Theory
(a,Bn) into (X,Bx) but
59
so are linear combinations of these maps. Given 0 5 m 5 n, we will use SF to denote Xi (f0 when m = n ) and
c,"=,+,
Sm S , to stand for +; and when m = 0, we will usually drop the superscript. Finally, p € M1(E), P z pZf (again using the remark about BQ, one sees that P E MI(0)) and p, E M1(E) is the distribution of under P. Our purpose will be to study the large deviation theory for the family { p , : n 2 1). Obviously, to whatever extent we succeed, we will have generalized CRAMER'STheorem. Our approach is an amalgam of ideas coming from D. RUELLEvia 0. LANFORDand the results obtained in Section 2.2. In particular, we will first use LANFORD'S argument to show, in complete generality, that { p , : n >_ 1) satisfies a weak large deviation principle with a convex rate function. We will then do our best to replace the weak principle with the full large deviation principle and to identify the governing rate function. The main reason for our needing to make the assumptions in is that we will want to use the technical facts proved in the following lemma.
17L
s,
(c)
3.1.1 Lemma. Let A be a non-empty, open convex subset of E . Then for any K cc A , the closed convex hull K of K is also a compact subset of A . In particular, if v E M1(E), then, for each S E (0, l),there is a convex K CC A such that v ( K )2 (1 - S ) v ( A ) .
PROOF: Suppose that K CC A . Given 0 < S < p ( K , A C )choose , M
PI, . . . , pM E K so that
K cUB(p,,S) 1
cr
and denote by r(6)the set of points amqm, where {a,}? C [0,1] with C,M a , = 1 and qm E B ( p , , S ) , 1 5 m 5 M . Clearly, r(6)g A and is closed in E . Moreover, because implies that pballs are convex, it is easy to show that r(6)is convex. Hence, I? 2 T'(S). This not only proves that K C A , but it also gives us an easy way to see that K is compact. Indeed, again using one sees that
(c)
(c),
where { P I , . . . , p ~ } -is the convex hull of { P I , . . . , p ~ and, } as such, is compact. Since K r(6)and 6 can be taken arbitrarily small, it follows immediately that K is totally bounded and therefore, since it is closed in E , compact.
Large Deviations
60
Given the first part, the second part of the lemma is an immediate consequence of the well-known ULAM’S Lemma which says that, because E and therefore A are Polish spaces, there is a K cc A such that v ( K ) (1- S)v(A);and obviously, the first part says that we may as well take K to be convex. I
-
Our first application of Lemma 3.1.1 occurs already in the second part of the next key result.
3.1.2 Lemma. For each convex C E BE,n E Z+ pn(C) is supermultiplicative. In addition, if A is an open convex subset of E , then either p n ( A ) = 0 for all n E Z+ or there exists an N E Z+ such that pn(A) > 0 for all n 2 N . PROOF:To prove the first assertion, observe that, by convexity,
and therefore, by shift invariance and independence,
We next turn to the second assertion. Suppose that ,u,(A) > 0 for some m E h+, and, using Lemma 3.1.1, choose a convex K C c A so that p , ( K ) > 0. Let 0 < 26 < p ( K ,A C ) take , G = { q E E : llq - KI1 < 6}, and set M = sup{ 11q11 : q E K}. Then, for n = sm T , where 0 5 T < m,
+
as long as m M
< n6. Thus, if we choose N
so that m M
< NS and
then, since K is convex, we have that
for all n 2 N . I Before we can use Lemma 3.1.2, we recall the following simple fact about sub-addit ive functions.
III General Cram&r Theory
-
61
3.1.3 Lemma. Let f : Z+ [O,m] be a sub-additive function and assume that there is an N E Z+ such that f ( n ) < 00 for all n 2 N. Then
lim
n-+m
f tn ) = inf f (n)E [0, m). n n>N n
PROOF: For m 2 N , set M, = max{f(n) : m 5 n 5 2m). For n 2 m 2
where s = [n/rn] and r = n - ms. Hence,
By combining Lemma 3.1.2 with Lemma 3.1.3, we know that if C" denotes the collection of all non-empty, convex open sets A in E,then 1 C(A) = C,,(A) = - lim - log (pn(A)) E [0, m] n-oo n
(3.1.4)
exists for every A E C". Noting that if I is the rate function governing the large deviations of { p n : n 2 l}, then (cf. the proof of Lemma 2.1.1)
(3.1.5)
I ( g ) = I,,(q) G lim C,,(B(q,r)) = sup{C,(A) : q E A E C " } , r\O
we see that there is no alternative to our adopting (3.1.5) as the definition of I . Of course, we still have to check that this I does indeed give rise to a large deviation principle. 3.1.6 Theorem. The function I,, in (3.1.5) is a convex rate function on E and { p n : n 2 I} satisfies the weak large deviation principle with rate function I,,. Furthermore, if G is a finite union of elements from C", then
1 lim - log (pn(G)) = - inf I,,.
n-+mn
G
PROOF: The lower semi-continuity of I,, is an immediate consequence of its definition. To prove that I,, is convex, let q l , q 2 E E be given, and set q= Given an A E C" containing q, choose A, E C" so that q, E Ai
9.
62
Large Deviations
and A 2 A1aAa.Then
C ( A ) = - lim
n+m
1 2n
- log ( p z n ( A ) )
{w :
1 =-
2
5
(- lim
1
- log
n-mn
Ip(q1)
sn(w) E A1 and Sn(u) E Az}))
1 [ p n ( A l ) ]- lim - log [ p n n+m n
(&)I)
+U 4 2 ) . 7
2
+
and from this we conclude that I p ( q ) 5 ( I p ( q l ) I p ( q 2 ) ) / 2 . Because we already know that I p is lower semi-continuous, the convexity of I p is now proved by a familiar iteration argument followed by a passage to the limit. The fact that lim_ ,~ log ( p n ( G ) )2 - infG I p for arbitrary open G in E is built into the definition of I p . Next, suppose that K cc E and let C < infK I p . Then, there is a finite cover { A l , . . . , A M }& C" of K with the property that C(Am) > C for each 1 5 m 5 M . Hence,
and so we have proved that En+m log ( p n ( K ) )5 - infK I p and therefore that the weak large deviation principle holds. To complete the proof, suppose that G = M A,, where {A,}? G C". Then an easy argument shows that
Ul
1 n
- lim -log(pn(G)) n-m
=
min L ( A m ) .
ljm
Hence, it suffices for us to check that C ( A ) = infA I p for every A E C"; and since we already know that C ( A ) 5 infA I p , this comes down to checking that C ( A ) 1 infA I p when L ( A ) < 00. To this end, let 6 E (0,l) be given log ( p n ( A ) )5 L ( A ) + 6 for n 2 N . Next, we use and choose N so that Lemma 3.1.1 t o find a convex K cc A such that 1 1 1% ( P N ( A ) ) 1% ( P N ( K ) ) < 6. N Then, by sub-additivity and the preceding paragraph,
-;
inf I p 5 f;i A
1
I p 5 !& --log ( p n ( K ) ) n+m
III General Cram& Theory
63
3.1.7 Corollary. If for each L 2 0 there is a K L cc E such that (3.1.8)
- 1 lim - log ( p n ( ~ i ) 5 ) -L,
n+cc
n
then I p is a good rate function and {p,, : n 2 1) satisfies the full large deviation principle with rate function I p . 16 in addition,
for every X E X*, then
and
PROOF:The first assertion is no more than the conjunction of Theorem 3.1.6 and Lemma 2.1.5. The rest is an immediate consequence of the first part together with Theorem 2.2.21. I 3.1.11 Exercise.
In the case when E = X is finite dimensional and p(p, q ) = 1/q - pll, show that the whole of Corollary 3.1.7 applies as soon as A,(X) < 00 for every X E X*. This is, of course, the CRAMERTheorem in the general finite dimensional setting.
64
Large Deviations
3.2 Sanov's Theorem
In this section we will specialize to the situation described in the second example of Remark 2.2.1. That is, E = and X = M(C), where C is a Polish space and M(C) is given the topology which is generated by the sets in (2.2.2). Clearly, the topology inherited by as a subset of M(C) is the weak topology (i.e., the topology corresponding to convergence against bounded continuous test functions). In order to show that Ml(C) and M(C) satisfy the hypothesis at the beginning of Section 3.1, we must produce the metric p on M1(C) and the norm 11 * I( on M(C). The latter is easy; namely, we take llall to be the total variation of IIallvar of a E M(C). Since
(e)
we see that 11. l V ar is lower semi-continuous and therefore certainly measurable on M(C); and clearly 11 [ I v ar is bounded on MI@). We now turn to the metric for MI@). Following LEVY and PROHOROV, define the L6vy metric p ( a , v) = inf(6 > o : a ( ~5 )V ( F ( " ) ) 6 (3.2.1) and v ( F ) 5 Q ( F ( ~ ) )6 for every closed F in C} e
+
+
for a,v E MI@), where F ( 6 ) is defined relative to a complete metric on C. An easy argument shows that p is a metric and that it satisfies the convexity property required in (6).Since it is clear that p(a,v) 5 IIv - a [ ( , all that remains is to show that p is compatible with the weak topology and that (MI@),p) is Polish. Before proving this, we will need to recall some elementary properties of the weak topology.
(i) The weak topology is second countable. (ii) a,
v if and only if En+m a,(F) 5 v ( F ) for every closed F in C.
(iii) 2 Ml(C) is relatively compact if and only if for each 6 > 0 there is a K cc C such that a ( K ) 2 1 - 6 for every a E r. (Such a subset r i s said to be tight.)
(iv)If F c b (C;R) is uniformly bounded on all of C and is equi-continuous v implies that on each compact subset of C, then a,
All of these facts are well-known, and their proofs can be found in any standard text in which the modern theory of weak convergence is discussed. We will now use them to check that the LEVY metric possesses the properties which we want.
111 General Cram& Theory
65
3.2.2 Lemma. ( L ~ v Y& PROHOROV) The metric p in (3.2.1) is compatible with the weak topology on MI@), and (MI@),p) is Polish.
-
PROOF:In view of property (3)above, it is obvious that a , + v if p ( a n , v) 0. To prove the opposite implication, let S > 0 be given and for each closed F in C define dist (a, ( F ( ' ) ) " ) a E C, = dist(a, F ) dist (a,( F ( 6 ) ) c')
+
where "dist" is measured with the same metric on C as the one used in the definition of F ( 6 ) .It is then an easy matter to check that {$JF : F closed in C} is uniformly bounded and equi-continuous on C. Hence, by property (iv), if a, ==+ u , then f $JF da, f $JF du at a rate which is independent of F ; and, since X F 5 $F 5 x F ( 6 ) , we conclude from this that p(an,v ) 0 if a, =+ v. (We use the notation XJ-to denote the indicator (or characteristic) function of a set r.) We have therefore proved that p is compatible with the weak topology on MI@). To prove that p is a complete metric on MI@), suppose that
-
-
- -
sup p(an,am)
n>m
0 as m
00.
We must show that {an}yis relatively compact. To this end, let 6 > 0 be p(a,, am) 5 S/2', given and, for l E Z+, choose m E Z+ so that and then (using property (iii)) choose Ke C C C so that f f k (Ke) 2 1- 6/2e
for all n E Z+. Finally, set
K =
n 00-
Kj6/2')
e=i
note that K is closed and totally bounded with respect to a complete metric and is therefore compact, and check that an( K " ) 5 26 for all n E Z+. Thus, by property (iii), {an}yis indeed relatively compact. I Before getting down to the main business of this section, there is one more general fact about the space M(C) which it will be useful to have at our disposal. Namely, we want a good representation of M(C)*.
3.2.3 Lemma. The duality relation (3.2.4)
Large Deviations
66
-sc
determines a representation of M(C)* as cb(C; R).
PROOF:Clearly, for each 4 E cb(C;R), a E M(C) 4da determines a unique element of M(C)*. Thus, all that we have to show is that every element of M(C)* arises in this way. Let X E M(C)* be given and define $(a) = A(&), (T E C. Clearly, 4 is continuous. Moreover, because of the way in which the topology on M(C) is defined, we can find a finite set {$~~}r/lE Cb(C; R) such that
and from this it is clear that 4 is bounded. Finally, it is obvious that X(Q) = 4da if a is a linear combination of point masses; and, because such a's are dense in M(C), it follows that this equation holds for all Q E M(C). I
s,
Returning to the problem of large deviations, let Q E Ml(Ml(C)) be given and define Qn E M1 (Ml(C))to be the distribution of U = (Vl,.. . ,V,)
E
M1(C)"
-C 1 "
-
n k=l
vk E Ml(C)
under Q" E M1(MI@)").By the Weak Law of Large Numbers combined with the second countability of the weak topology on MI@),one can easily check that Qn ,,a , where PQ E Ml(C) is defined by r
Thus, it is reasonable to inquire about the large deviations of {Qn : n 2 1). In fact, by the results which we proved in Section 3.1, we will know that the large deviations {Qn : n 2 1) are governed by the rate function (3.2.5)
I Q ( v )= h b ( V )
SUP
4 d V - A Q ( ~ :) 4 E Cb(C;R)
for v E Ml(C), where
and that IQ is good, as soon as we show that {Qn : n 2 1) is exponentially tight. In order to do so, we employ the following remarkably useful general observation which will serve us well not only here but also later on.
III General Cram& Theory
67
-
3.2.7 Lemma. Let p E Ml(C) be a fixed and suppose that {Vm}E=lis a bounded sequence of non-negative, measurable functions on C which tends to 0 in p-measure as m m. Then, for each M E [l,m) and /3 E [I,GO) with the property that there is a subsequence (Vmf
for 0 < E I 1 and L E [ l , m ) ,whenever {R,: c family which satisfies
13.2.9)
0<€<1 SUP
> 0) G Ml(Ml(C)) is a
[$
( / M I I ~ ) e x ~ L V d u ] Re(du))' I M L ~ x P [ B V dfi ]
for every bounded measurable V : C -+ [O,m]. In particular, for each such that L E [ l , ~there ) is a CL CC
R,(C:) i exp[-L/~],
(3.2.10) for any {R, :
E
o
> 0) satisfying (3.2.9).
PROOF:Obviously, for any 6 > 0 and measurable V : E
-+
[0, GO) ,
for every 0 < 6 5 1 and T E (0,m). Now let .4! E Z+ be given, take S = l / Y , T = .4!(e log(2M) + 1),and choose me E Z' so that
+
One then has that
R,
({v
E
M ~ ( c :)
cS,vm, dv 2 1)) I
e x p [ - ( ~+ I ) / € ]
for 0 < E 5 1 and I E Z+; and (3.2.8) is an immediate consequence of this. To get (3.2.10), choose { K , } to be a sequence of compact subsets of C for which p ( K & ) 0 as m -+ 00. Next, apply the preceding (with V, = x K , ) to see that there is a subsequence { K m f } z for l which (3.2.10)
-
68
Large Deviations
To see how the exponential tightness of {Qn : n 2 1) follows from uQ(du) and note that Lemma 3.2.7, set p~ =
where we have used JENSEN'Sinequality at the final step. Obviously, exponential tightness for {Qn : n 1) is now a trivial consequence of (3.2.10); and, as we mentioned just before Lemma 3.2.7, this is all that we needed in order to know that the IQ in (3.2.5) is good and that it governs the large deviations of {Qn : n 2 1).
>
-
A particularly interesting case of this general result is the one in which Q is the distribution ji of (T E C 6, E Ml(C) under some p E MI@). In this case, Qn is the distribution f i n of the empirical distribution functional (3.2.11)
t 7 = (01,. . . ,0")E
C"
-
.
c n
Ln(u) = n m=l 1
under pn and the measure p~ introduced above coincides with p. Specializing the preceding to this case, we see that
and therefore that the large deviations of {jin : n 2 1) are governed by the good rate function
I C ( 4 = AE(u) (3.2.12)
for u E MI@). However, before stating this as a theorem, we want to develop a more tractable expression for IF. 3.2.13 Lemma. For u E Ml(C), define
(3.2.14)
S,flogfdp 00
i f u < < p a n d f = dP k otherwise.
111 General Cramkr Theory
69
Then the I,i in (3.2.12) is equal to H(. 111).
+
-
PROOF:We first show that if v << p and ve =_ 6 p (1 - e)v for 0 E [0,1], then H(vlp) = limelo H(velp). To this end, set f = and fe = 0 (1 0)f. Since x E [0, m) z log z is convex, JENSEN’S inequality says that
%
+
-
At the same time, since x E [O,oo) log x is non-decreasing and concave, log fe 2 (log 6) V ((1 - 0) log f ) ; and therefore
-
After combining these two, one clearly gets the asserted convergence. We next show that if v << p, then I,(v) 5 H(v1p). In view of the preceding and the obvious fact that v E RE(v) is lower semicontinuous, we may and will assume that f = $ 0 for some 8 E ( 0 , l ) . In particular, by JENSEN’S inequality, we then have
>
from which it is clear that I p ( v ) 5 H(v1p). As a consequence of the preceding, all that remains is to show that if I,i(v) < 00, then dv = f d p and (3.2.15)
IjLW
L
J, f l % f d P .
Given v with I p ( v ) < 00, one has
for every bounded continuous 4. Since the class of 4’s for which (3.2.16) holds is closed under bounded point-wise convergence, (3.2.16) continues to be true for every bounded &-measurable 4. In particular, we can now show that u << p. Indeed, suppose that I? E BE with p ( r ) = 0. Then, by (3.2.16) with 4 = T X r , r v ( r ) 5 I,(v), T > 0; and therefore v ( r )= 0. Knowing that v << p , set f = I f f is uniformly positive and
%.
Large Deviations
70
uniformly bounded, then (3.2.15) is an immediate consequence of (3.2.16) with 4 = logf. If f is uniformly positive but not necessarily uniformly bounded, set f n = f A n and use (3.2.16) together with FATOU’S Lemma to justify
-
Finally, to treat the general case, define vg and f,g = e+(l-e)f for 0 E [0,1] as in the first paragraph of this proof. By the preceding, fe log fe dp 5 Ib(vg) as long as 0 E (0,l). Moreover, since 0 E [ O , l ] Ib(ug) is bounded, lower semi-continuous,and convex on [0, 11, it is continuous there. In conjunction with the result obtained in the first paragraph, this now completes the proof. I The quantity H ( v ( p )in (3.2.14) is called the relative entropy of v with respect to p. As we will see in the sequel, the relative entropy functional plays a central role in the theory of large deviations. We have now proved the following theorem which, at least when C = R, was originally derived by SANOV. 3.2.17 Theorem. (SANOV)Let p be a probability measure on the Polish space C and Jet fin E Ml(Ml(C)) be the distribution under pn of the function L, in (3.2.11). Also, define H(.lp) as in (3.2.14). Then H(. Ip) is a good, convex rate function on MI@) and {fin : n 2 1) satisfies the full large deviation principle with rate function H(. lp).
Before dropping this topic, it seems appropriate to address a deficiency in the preceding result. Namely, the Weak Law of Large Numbers tells us that
-
for every bounded measurable 4 : C R; not just for bounded continuous 4’s. Thus, {fin}: actually tends to 6, in the strong topology on M1(C) and not just the weak topology. It is therefore reasonable to ask whether one cannot also develop the corresponding large deviation theory relative to the strong topology. As we are about to see, not only is it possible to do so, but it is even a rather easy exercise to transform what we already have into a statement about the strong topology.
I11 General Cram& Theory
71
The strong topology (or r-topology) on M1(C) is the topology U generated by the sets
as q5 runs over the space B ( C ; R ) of bounded measurable functions on C into R. Obviously, the strong topology is stronger than the weak one. In addition, it is clear that, for each a E MI@ ), the sets
~ ( aA,,; . . . , A N ;
(3.2.18)
€1 =
n N
~ ( aXA,; ;
E)
k=l
for N E Zf, A l , . . ., AN E BE,and E > 0 constitute a U-neighborhood basis at a. In particular, M I ( C ) with the strong topology is usually not even first countable! The key which will allow us to transform the SANOVTheorem to the strong topological setting is contained in the following, whose proof turns on another application of Lemma 3.2.7.
3.2.19 Lemma. Let p
E
M1(C) and assume that
{Re : E > 0) 2 Ml(Ml(C1) satisfies (3.2.9) for some p, M E [l,co)and all measurable V : C -+ [0, m]. Further, assume that { R , : E > 0) satisfies the weak large deviation principle with the rate function I . Then, H(vlp) I P ( W ) + log(2M)), v E Ml( C ) . (3.2.20) Moreover, given N E Z+ and A,, . . . , A N E BE,define f : M1(C)
bY
f(v) = (v(Al),...,v(AN)),
-
RN
v E Ml(C).
Then f is measurable and - i nf { I(v ): f ( v ) E A"}
< limclog(R,(f-'(A))) r-0
5 - inf{I(v) : f ( v ) EX} for every A E
&N.
PROOF:First observe that, by Lemmas 2.1.5 and 3.2.7, our hypotheses imply that I must be a good rate function and that {R, : E > 0} satisfies the full large deviation principle with rate I.
72
Large Deviations To prove (3.2.20), use Theorem 2.1.10 and (3.2.9) to obtain
for all u E M1(C) and V E cb(C;w);and so, for each v E M1(C),
-
for every bounded measurable V : C [0,m). In particular, just as in the proof of Lemma 3.2.13, I(v) = m if v is not absolutely continuous with respect to p. On the other hand, if v << p and f = $, take V, = log(1 f A n ) and conclude that
+
-
I(v) 2
1
k l o g ( l + f A n) dv - log(2M).
00 and thereby arrive at (3.2.20). Finally, let n We now turn to the proof of the second assertion. In view of Lemma 2.1.4, all that we have to do is produce a sequence {ft}: G C(M,(C);RN) with the properties that
and
To this end, choose for each 1 5 k 5 N a sequence
-
111 General Cramcr Theory in p-measure as m {Vml}E, for which
00.
73
Next, apply Lemma 3.2.7 to find a subsequence
for 0 < c 5 1 and L E [ l , ~ ) . Finally, set
Clearly, fp, E C(M1(C);WN)for each l E Z+. Moreover, since (cf. part (i) of Exercise 3.2.23 below) {v : H(vlp) 5 K } is uniformly absolutely continuous with respect to p for every K E (0, co),one sees from (3.2.20) that Vm du 0
s,
-
-
uniformly over v's with I(v) 5 L; and therefore fe(v) f ( v ) uniformly for v 's in such sets. It is therefore clear that these fe 's will serve. I
3.2.21 Theorem. Let { R e : E > 0) and I be as in Lemma 3.2.19 above. For each L 2 0 the set {v E : I(v) 5 L} is compact in the strong topology. Moreover, if I' is a measurable subset of and T' and roT denote, respectively, the interior and closure of r in the strong topology, then
- inf I(v) < limclog(R,(r)) 5 i&~log(R,(I')) 5 - inf I(v). m - o r
€-0
€40
U€FT
In particular, for any p E M1 (C) and all r E BE,
PROOF:To see that {v : I(v) 5 L } is strongly compact, simply observe that the strong and weak topologies coincide on subsets of consisting of measures which are uniformly absolutely continuous with respect to a fixed element of and use (3.2.20) together with part (i) of Exercise 3.2.23 below. To prove the lower bound, note that if a E Yo', then there exist N E Z+,
Large Deviations
74
and a n open set A in RN such that
aE
{Y :
( ~ ( A I ). ,. . , v ( A N ) ) E A }
r;
and therefore Lemma 3.2.19 shows that
To prove the upper bound, let I? E I J M ~ ( E ) be given and suppose that F 2 r is a strongly closed subset of M I @ ) . Using P to denote the collection of all finite partitions P of C into measurable sets and given P = {A l , . . . , AN} E P , let A ( P ) be the closure in W N of
and set F ( P ) = {v : (.(A,), . . . ,.(AN)) 3.2.19, we know that
-
[
E
A ( P ) } . Then, by Lemma
’1
limElog(R,(l?)) 5 K c l o g R F ( P ) 5 - inf €40
EO’
‘(
vEF(P)
I(v).
Thus, we will have the upper bound once we show that inf I(v) = sup
uEF
inf
PEP V E W P )
I(v).
In proving this, we may and will assume that there is an L E ( 0 , ~ ) such that inf I(v)5 L for every P E P. uEF(P)
Noting that (because the level sets of I are strongly compact) there is, for each P E P, a vp E F ( P ) such that I(v p ) = infVEF(p)I(v) and using the fact that {up : P E P } {v : I(Y)5 L } , choose any subnet {vp : P E P’} of {up : P E P } (think of P being partially ordered by “refinement”) so that {vp : P E P’} converges strongly to some a. Clearly, all that we have to do is check that a E F. To this end, let N E Z+, A l , . . . , AN E BE,and 6 > 0 be given; and denote by U the corresponding set defined by (3.2.18). Next, choose P E P’ so that {Al, . . . , AN} is contained in the algebra over C which is generated by P and CaEP ~ Y ~ ( A ) - ~ (
I11 General Cram& Theory
75
3.2.22 Exercise.
(i) On the basis of the reader’s previous information about the weak topology, derive the four properties listed at beginning of this section.
-
(ii) Suppose that {(En,p n ) } y is a sequence of complete separable metric spaces and that, for each n E Z+,x,+1,, : Cn+l C, is a mapping with the property that
-
Let (C, p ) be the projective limit of {(En,~,+1,,, pn) : n E Z+}, and show that M1(C) is homeomorphic to the projective limit of {M1(Cn)}f”when pn+l pn+l o is the mapping from M1(C,+1) into M1(C,) for each n E Z+.
r;il,,
Hint: The only difficult step here is to check that if pn E MI(&) and -1 p, = p,+1 o x,+~,, for each n E Z+, then there is a p E M l ( C ) such that p, = p o rG1, n E Z+. However, the existence of such a ,u can be seen as a consequence of the KOLMOGOROV Extension Theorem as presented in Theorem 1.1.10 of [104]. 3.2.23 Exercise.
In this exercise we outline an approach t o the SANOVTheorem which avoids the estimate in Lemma 3.2.7 and resembles, to a much greater extent, the ideas behind our proof of the classical CRAMERTheorem. Even though the proof avoids Lemma 3.2.7, it nonetheless uses the equation A; = H(.(p)provided by Lemma 3.2.13.
(i) It turns out (cf. Corollary 5.1.11) that both the goodness of A; as well as the upper bound (in terms of A;) for all closed sets C 2 MI@) follow once one knows that A, has the property: for each M E ( O , c m ) , there is a K ( M ) cc C such that A p ( V ) 5 1 whenever V E Cb(E;R) vanishes on K ( M ) and is bounded by M . Prove that Ap has this property. Although it is true that the goodness of A; follows from the general principles which will be derived in Section 5.1, it is actually very easy to check that the level sets {Y : H(vlp) 5 L } are not only weakly but even strongly compact. Indeed, it is clear that the set {f E L 1 ( p ) +: JC f l o g f d p 5 L } is uniformly p-integrable and is therefore weakly compact as a subset of L1(p). Since weak convergence in L ’ ( p ) gives rise to strong convergence in M(C) of the associated measures, it follows that (Y : H(v\p) 5 L } is strongly compact.
76
Large Deviations
(ii) We now outline a proof of the lower bound in Theorem 3.2.21 which is based the same principle as the one used to prove the lower bound in R Since this lower bound is better than the the classical C R A M ~Theorem. lower bound in SANOV'STheorem, this will complete the program for this exercise. Let H ( v ( p ) < 00 and suppose that G E t S ~ ~ ( isx )a strongly open fm(um)for u E En, neighborhood of v. Set f = $,define F,(u) = and let A, = {u E C" : L,(u) E G and F,(a) > O}.
nL=,
- -
Using the Law of Large Numbers, check that vn(A,) 1 as n Next, using JENSEN'S inequality and the fact that xlogx 2 -e-', [0, a), verify the following steps:
as long as vn(A,) > 0. Combining this with vn(A,) lower bound in Theorem 3.2.21.
-
00.
x
E
1, arrive at the
3.2.24 Exercise.
Prove that
llv - P I I L 5 2H(VIPL),
(3.2.25)
P , v E Ml(C).
A proof of (3.2.25) can be based on the observation that
+ 2x)(z logrc - x + 11, 2 E [O, m), IIv - plIvar = ]If - lJ1,y(p) if v << p and f = $, and
3(x - 1)2 5 (4 the fact that
SCHWARTZ'S inequality 3.2.26 Exercise.
For R E Ml(Ml(C)), define PR E MI@) by
Ill
77
General Cram&r Theory
and show that R E M1 (Ml(C)) I-+ p~ E is a continuous mapping. Next, let Q E M1 (MI@)) be given, and show that
I Q ( v )= A;(”)
= inf(H(R1Q) : p~ = v},
v E MI@).
(Hint: Either use the variational formula for relative entropy, or combine the results of the present section with Lemma 2.1.4.) Finally, apply JENSEN’Sinequality to check that
Conclude that
and therefore that v
<< p~ if IQ(v) < 03.
3.2.28 Exercise.
-
Let {QE : 6
> 0) be a family of probability measures on
and I : M1(C) [0,a] a rate function with the property that {v : I(v) 5 L } is compact in the strong topology for each L > 0. Assume, in addition, that - i n f I < limtlog(Q,(I’)) 5 G ~ l o g ( Q , ( r ) )5 - i n f I r o r €-+O €+O r -7
-
(see Theorem 3.2.21 for the notation here) for all I‘ E B M ~ ( ~Given ). a W which satisfies
BM,p)-rneasurable function @ : M1(C)
O z z l (/MI@)
for some cx E (1,m), show that
if @ is continuous with respect to the strong topology.
78
Large Deviations
3.3 Cramhr’s Theorem for Banach Spaces
In this section, we will be assuming that E = X is a separable real BANACH space with norm (1. [ I E , and we will be attempting to prove the analogue of CRAMER’S Theorem (cf. Theorem 1.2.6) in this setting. That is, we want to show that if p is a probability measure on (E,BE)for which (3.3.1)
sn
and if pn denotes the distribution of under p n (cf. the second paragraph of Section 3.1), then { p n : n 2 I} satisfies the full large deviation principle with rate function At,, where
and
However, before getting into the details, it might be helpful to know what it is that %on-deviant behavior” means in this situation. For this reaon, we begin with a proof of the Strong Law of Large Numbers for E-valued random variables. 3.3.4 Theorem. (RANGARAO) Let {Xn}ybe a sequence of independent, identically distributed, E-valued random variables on some probabilityspace(R,M,P); a n d l e t p be thedistributionofxl. IfsE I I z l I ~ p ( d z< ) m, then there is an m ( p ) E E such that
for P-almost every w E 0. Moreover, m ( p ) is the unique element y of E with the property that (3.3.5)
E’(A, Y)E
=
E*(A,z)E p(dx),
A
E**
xy
PROOF:First, by KOLMOGOROV’S Zero-One Law, if X,(w)has a limit for P-almost every w E s2, then that limit is, P-almost surely, independent of w E R. Knowing this and using the Classical Strong Law of Large Numbers, one can easily check that, if it exists, then the P-almost
III General Cram& Theory
79
xy
sure limit of xi, must satisfy (3.3.5). Hence, all that we have to do is show that convergence takes place. Note that the Classical Strong Law together with the second countability of the weak topology on M1(E) lead to the conclusion that
( 3.3.6) for P-almost every w E R. As we are about to see, this turns to be surprisingly close to the result which we are seeking. Indeed, suppose7 for the moment, that there is an R E (0,m) such that IIXn(w)lI~5 R for all n E Z+ and w E R. For A E E*, define * g r ’ ( x ) = q R ( x ) E * ( A 7 r )El
(3.3.7)
xEE
where QR E C ( E ;[0,1]) satisfies
{gy’
Then : IIAIIE. 5 l} is uniformly bounded and equi-continuous. Hence, by property iv) of the weak topology (cf. the first paragraph of Section 3.2) and (3.3.6)
and therefore
for P-almost every w E 0. Thus, for P-almost every w E R,
is a CAUCHY sequence in E ; and, as we pointed out in the preceding paragraph, that is all that we need. In other words, our result has been proved in the case of bounded random variables. To handle the general case, define, for R E (0,oo),
xLR’( w) = x 10,R] (11x n ( w )11E ) x n ( w )
Large Deviations
80
and
Y,‘R’(w) = IlX,(w) - Xy)(W)(IE’ w E a.
Given E > 0, choose R E ( 0 , ~so) that
together with the Clasand use the preceding result applied to {XLR’}f” sical Strong Law applied to {Y,(“)}: to conclude that
xy
X k ( w ) } r is CAUCHY in E for P-almost Since this clearly shows that { every w E 0, the proof is complete. I The quantity m ( p ) in (3.3.5) is called the mean of the measure p. Obviously, a consequence of Theorem 3.3.4 is the fact that pn 6m(p). In order to study the large deviations from this, we again want to use Corollary 3.1.7. Thus, we must show that there are compact K L Is for which (3.1.8) holds. Actually, most of the work required for their construction has already been done in Lemma 3.2.7; all that we need in addition is the following relatively simple observation.
*
3.3.8 Lemma. Ifr
-
E M1(E) satisfies
then v f T m(v)is a continuous map. In particular, iff : E is a lower semi-continuous function satisfying
and if
-
[O, co]
111 General Cram& Theory
then, for each L E [O,m), r ( f , L ) is closed and v E r ( f , L ) continuous. Thus, (3.3.9) is a measurable subset of Ml(E) on which v mapping.
-
-
81
m ( v ) is
m(v) is a measurable
PROOF:The second assertion is just an application of the first assertion once one notes that, for v E r(f,L),
Thus, we need only prove the first assertion. Moreover, since u H ~ l s l l E , R11.11~ ~ ( d zis) lower semi-continuous for each R E (0, m), we may and will assume from the outset that the set is closed in Ml(E). Now, suppose that { u n } r C r and that v, v,. Then
where the functions g y ) are the ones defined in (3.3.7). Hence, for each
R E (O,m),
from which the desired result is clear. I The preceding enables us to prove the following variant of Lemma 3.2.7.
3.3.10 Lemma. Let p E M,(E) be given, assume that
82 Then f(0) = 0; 2 E E
-
Large Deviations
f(x) E [O,OO] is lower semi-continuous;
f (x) - 00; lim -
lIxll+m
and
11xl1
1 JEexp[Ul dcL I 1-6’ s E [O, 1).
Finally, if { R , :
E
> 0) C MI(Ml(E)) satisfies
-
for some p, M E [l,00) and all measurable V : E [ O , o o ] , then for each L E [l,OO) there is a K L CC E (whose choice depends only on p, p, and M and is otherwise independent of { R, : E > 0}) such that
where pe E Ml(E) denotes the distribution of under R, and r(f)is the set in (3.3.9).
Y
E
[o, 00)},
T
PROOF:Define h(T)
= SUp{(I:T- g ( 0 ) :
(I:
E
-
0 < 6 5 1,
P , ( K i ) L eXP[-L/E],
r(f)
m(v) E E
E
[o, GO).
It is then clear that h(0) = 0 and that T E [O,GO) h ( ~E) [O,GO] is a lower semi-continuous, non-decreasing function for which h(T) lim = 00. T
T+OO
Furthermore, just as in the proof of Lemma 1.2.5 (only even more easily), one sees that p ( { x E E : 1)xll 2
T>)
5 e--h(T),
T
E [o,Go);
from which it is easy to show that
Now let { R , : E > 0} satisfying our hypothesis be given. By Lemma 3.2.7, we know that, for each L E [ ~ , o o )there , is a CL CC Ml(E) (depending only on p, 0,and M ) such that
R B ( C i )5 e-L’e,
0
< E 5 1.
III
83
General Cram& Theory
In addition, since
we have, when T L
R€(I? (f
7
+ log(4M)), that
f 2@(L
TL)
'>
/ < (g)"'/['2pE
-
M,(E)
for all L E [l,00) and 0 < E
exp
e--LIE
E
f d u ] R,(dv) I 2
< 1. Hence, if
then, as the continuous image of a compact set, K L CC E and
for all L E 11, 00) and 0 < E 5 1. I We now have all the machinery which we need to prove the following extension of the CRAMER Theorem.
3.3.11 Theorem. (DONSKER & VARADHAN) Assume that (3.3.1) holds. Then, for each L E [0, 00), there is a K L cc E such that
Pn(KL) I e P n L , n E Z+ and L
E
[l,m).
In particular, { p n : n E Z'} is exponentially tight and the function A f in (3.3.2) is a good rate function which governs the large deviations of {pn : n 2 1). PROOF: In view of Corollary 3.1.7, all that we need to check is the first assertion. To this end, define ji, E Ml(Ml(E)), as in Section 3.2, t o be the distribution of .
L,
= -n1
c ,s n
m=l
under pn. It is then clear that the measures R1/, ,!isatisfy n the hypotheses of Lemma 3.3.10 with respect to p ; and therefore the existence of the required K L 's is an immediate consequence of that lemma. I
84
Large Deviations
3.3.12 Exercise.
-
It is amusing and instructive to prove the large deviation result of this section as a corollary of the SANOVTheorem. To this end, let f : E [O,m] be the function in the proof of Lemma 3.3.10 and define the sets I'(f,L), L 2 0 accordingly. Noting that the mean value functional m is continuous on each I'(f,L), use part (i) of Exercise 2.1.20 together with [O,m] given by the SANOVTheorem to show that I : E
-
is a good rate function on E and that it governs the large deviations of {pn : n 2 1). Conclude, in particular, that I = A;.
A direct proof of the equality in (3.3.13) is not easy, even in special cases. For example, suppose that E = 0 and that p is WIENER'Smeasure W as in Section 1.3. As we pointed in the discussion (1.3.7), the large deviation principle proved in SCHILDER'S Theorem can be thought of as an example of CRAMER'S Theorem; and so (3.3.13) gives us another variational formula for Iw = Ah. To give a direct proof of (3.3.13) in this case, one can use the fact that P E MI(@)is absolutely continuous with respect to W if and only if there is a {Bt : t E [0,m)}-progressively measurable map b : [0,m) x 0 + Rd such that
Ib(t, e)I2 dt and
[I qS,e) 00
P ( q = exp
< 00
d e ( s )-
-
in which case the distribution of
eEo
e-
(as., W )
1
m
1 w(de);
lb(s,e)12ds
1'qs,
e) ds
under P is W . Using this, one can check that, when P << W ,
Finally, for a given $J E H1, show that, among P E MI(@)satisfying Je llOlleP(d6) < 00 and m ( P ) = $J,the one which minimizes H(P1W) is
W* .
III General Cram& Theory
85
3.4 Large Deviations for Gaussian Measures
In this section, we will generalize SCHILDER’S Theorem to cover all censpace. That is, tered Gaussian measures on a separable, real BANACH we will be assuming that ( E , 11. I / E ) is a separable, real BANACH space and that p is a probability measure on ( E ,B E ) with the property that
-
[0,w). The bilinear for some symmetric, bilinear map Qr : E* x E* map Qr is called the covariance of p. At least when E is infinite dimensional, the archetype for this sort of measure is WIENER’S measure W on the space 0 (cf. Section 1.3), in which case
and SCHILDER’S Theorem gave us a large deviation principle for the family {We : E > 0). Our aim here will be to come as close as possible to duplicating SCHILDER’S result in general. 3.4.2 Lemma. There exists an a E ( 0 , ~such ) that (3.4.3)
~ e x P [ a l l x l l bPL(dX) l < 00.
In particular, (3.4.4)
2Ap(X) = Q p ( X , X ) =
sE
L
~*(X,x);p(dZ) 5 B11X11&
for X E E * , where B = 11x11; p ( d ~ E) (0,oo). Finally, A; is a good rate function; and, for all x E E and t E R, AL(tx) = t2AL(x). (See Exercise 3.4.15 below for a little more information.)
PROOF:The existence of an a > 0 for which (3.4.3) holds is a consequence of FERNIQUE’S Theorem (cf. Theorem 1.3.24). Furthermore, the equalities in (3.4.4) are all obtained from consideration of the R-valued, centered GAussian random variable x E E I-+ p ( X , z ) ~ the ; inequality is trivial; and the finiteness of B follows trivially from (3.4.3). Finally, given (3.4.3), the fact that A; is a good rate function is covered in the statement of Theorem 3.3.11; and the homogeneity of A; is an immediate consequence of the homogeneity of Ap. I Following the pattern in SCHILDER’S Theorem, we now define pe to d I 2 x E E under p; and, as a first be the distribution of x E E approximation to his result, we present the following.
-
Large Deviations
86 3.4.5 Theorem. The family { p E : E
> 0) satisfies the full large deviation
principle with the good rate function A;.
- xy
PROOF:We have already pointed out that, as a consequence of Theorem 3.3.11, A; is a good rate function. Furthermore, since p1ln is the distribux k (i.e., plln here is the same measure tion under p n of x E E n as the one which we denoted by p n in Section 3.3), Theorem 3.3.11 allows us to also conclude from (3.4.3) that { p l l n : n 1 1) satisfies the full large deviation principle with rate A;. In order to pass from this statement to the desired one, set n(E) = [ 1 / ~ V ] 1 and y ( ~ = ) en(€)for E > 0. It is then clear that x y(~)’/’x under p l / n ( B )has distribution p E and that y(e) E [l-- E , 11 for 0 < E < 1. Now suppose that F is a closed subset of E and set F = {y-lI2x : y E 13 and x E F}. Then F is also closed, and so
-
[i,
Since
this proves the upper bound in the large deviation principle. To prove the lower bound, let G be an ‘open set in E and suppose that x E G. Then we can find an open neighborhood U of x and an EO E (0,1/2] such that U C y(c)-’/’G for all 0 < E < €0. Hence lim E log ( p E ( G )= ) lim Y(E) log ( p n ( E ) ( y ( ~ ) - l / Z G ) )
-
F o4
BO ’
6 )
1
1 n-oo & I -log (p1ln(U)) 2 - inf A; 2 -A;(.). n U Thus, the lower bound is also proved. I As a dividend of Theorem 3.4.5, we get the following sharpening of the estimate in (3.4.3). 3.4.6 Corollary. (DONSKER & VARADHAN) Set (3.4.7) a = inf{A;(x) : 11x11~= 1) and b = sup{Q,(X,X) : I l X l l p = 1).
Then (3.4.8)
lim
R-CC
1
log [p({x E E : l
l ~ l 2l ~R})]= -U
1 2b-
= --
III General Cramer Theory
&
87
sEIlxll&p(dx)<
In particular, 5 fails for cr E (a, m).
00,
(3.4.3) holds for (Y E (O,a), and it
PROOF:It suffices to prove (3.4.8). To prove the first equality, set B = B(O,l) and note that inf A; = inf inf A* (rx) = a inf r2 = a BC
T 2 1 ZEaB
T21
and similarly that infzc A; = a. Hence, by Theorem 3.4.5, we see that
1 lim -log
R+m
R2
[p(B(O, R)")]= lim flog [ p . ( E ) ] = - inf A; = -a.
B'
t+O
To prove the second equality in (3.4.8), first observe that A;(.)
= sup { J E * ( A , z ) ~ - t2A,(A, A) : J E R and (IA((E. = 1)
{
= SUP AfA ( E * ( A , x ) ~ : )I)AJ~E*= 1}
where pA denotes the distribution under p of x E E and we have used (iv) of Exercise 1.2.11 to see that
(3.4.9)
I-+
~ ( A , x ) E~ R,
e2
In particular, if 11x1(~ = 1, then
To prove the opposite inequality, suppose JJAJJE. = 1 and note that
where we have applied the first equality in (3.4.8) to both p and to p A , and we again used (3.4.9) to get the final equality. Since this shows that a 5 2 ~ 1p ( ~ whenever , ~ f JJA)JE-= 1, we have now shown that a =
8.
88
Large Deviations
Before leaving the topic of centered GAussian measures p, we want to show that one can always develop a representation of A; analogous to the one for Ah given in (1.3.12). For this purpose, it will be convenient to introduce a new notion. Namely, we will say that ( E ,H , S, p ) is a Wiener quadruple if
(i) E is a separable, real BANACH space, (ii) H is a separable, real HILBERTspace, (iii) S is a continuous, linear injection from H into E , (iv) 1.1 is a probability measure on ( E ,BE) with the property that
-
(3.4.10) l e x p [ a
where S' : E'
E * ( X , Z ) ~ ]p(dz) = exp
H is the adjoint map to S.
Obviously, if ( E ,H , S,p ) is a WIENERquadruple, then p is a centered GAUssian measure on E and the covariance of p is &,(A,
A') = (S*A, S*A'),.
In particular, S* is bounded from E* to H with operator norm given by
Hence, the norm of S also satisfies (3.4.11)
-
3.4.12 Theorem. If p is a centered GAussian measure on the separable, real BANACH space E , then there exist a separable, real HILBERTspace H and a continuous, linear injection S : H E such that ( E ,H , S, p ) is a WIENERquadruple. Moreover, if ( E ,H , S, p ) is any WIENERquadruple, then S is a compact map, S satisfies (3.4.11), and (3.4.13)
AE(x) =
{ $11S-'x11& CQ
ifx E SH i f x e E\SH.
PROOF:To prove the first statement, let H denote the closure in L 2 ( p )of the subspace spanned by the functions p ( A , .)E,A E E*; and set l l h l l ~= l l h 1 1 ~ 2 (for ~ , h E H . In order to define S, we must first define (cf. Theorem
I11 General Cramkr Theory
89
3.3.4) “ m ( f p )E E” for f E L 2 ( p ) . To this end, assume that f E L 2 ( p ) is non-negative and that f dp = 1. Then, since
sE
-
we can define r n ( f p ) , where f p is the probability measure u given by u(dz) = f(z)p(dz). Now, extend f m ( f p ) to the whole of L 2 ( p ) by linearity; and let S be the restriction of this map to H . Note that (3.4.14)
,-(A,
S h ) , = (,*(A,
-), h ) H ,
X E E* and h E H .
In particular, if h E H and Sh = 0, then h IH {p(X,.), : A E E’}, and therefore h = 0. That is, S is an injection. Finally, to complete the proof that ( E ,H , S, p ) is a WIENERquadruple, let X E E* be given and, using (3.4.14), check that S’X = ,.(A, .), and therefore that I(S*Xl(&= &,(A, A). Now let ( E ,H , S, p ) be any WIENERquadruple. We have already seen that (3.4.11) holds. Moreover, since A; is a good rate function, the compactness of S will follow as soon as we show that (3.4.13) is true. To prove (3.4.13), first suppose that z = Sh for some h E H . Then 1 S h ) , - zllS*Xll& : X E E * }
{ SUP ~.(s’x, { {
= sup ,.(A, =
1 h ) , - zlls*A(($ : E E*}
= SUP
1 ~ * ( h ’ , h )-, 511h’l[&: h’ E H
1:
= -llhllL
since S is an injection and therefore S’E’ is dense in H . Conversely, suppose that 2 E E and that A;(.) < 00. Then, since
we have that JE*(X,Z),J
L (2~~(~))1121/~*~E 1 1E*. ~,
Hence, because S*E* is dense in H , there is a unique, continuous, linear functional F on H such that F(S*X)= p(X,z),, X E E*; and therefore, by the RIESZ Representation Theorem, we know that there is a unique h E H with the property that ,*(A,
S h ) , = (s*X, h ) H = ,p(X,z),,
Thus, we conclude that
3: =
E E’.
Sh and therefore that (3.4.13) holds.
Large Deviations
90
3.4.15 Exercise.
-
Let p be a centered GAussian measure on the separable, real BANACH space E. Show that (A,A’) E E* x E* Qp(A,A’) is continuous with respect to the weak* topology; and conclude from this that there is a A0 E E* with I l A o l l ~ . = 1 and a = , where a is defined as in ~Q~(XO,XO) (3.4.7). In particular, use this to show that
3.4.16 Exercise. Let E = C(C;W), where C is a compact metric space and we think of E as a BANACH space with the uniform norm. Given a centered GAussian measure p on E , define q p ( s , t ) = Qp(6,,6t) for s, t E C. Show that qp E C(C2;[0, co))and that
Next, show that q p ( s , t)’ 5 q r ( s , s)qp(t, t ) for all s, t E C, and use this to conclude that b = sup,Ec qp(s, s), where b is defined as in (3.4.7).