CHAPTER VI
Introduction to Probability Theory
1. Probability This exposition is inspired by Nelson’s approach in [71]. Thus, we only discuss finite probability spaces, the continuous case being dealt with via the use of infinitesimal (nonstandard) analysis. Probability Theory is usually applied when we perform an experiment whose outcome is not known in advance. In order to use the theory, however, we must know which is the set of possible outcomes. This set of possible outcomes is formalized in the theory by a set 52, called the sample or probability space. For examples of probability spaces, see Section 2 in the Preliminaries (Examples 1, 2, 3). We shall refer to these examples in the sequel. We associate a weight pr(w) to the possibility of each outcome, which is a real number between 0 and 1. The only condition we impose on pr is that the total probability be 1. That is
If the coin of Example 1 is a fair coin, then we would assign pr(H) = 1/2 = pr(T). If it is not a fair coin, however, the assignment might be different. If the dice of Example 3 are fair, then we would assign 1
for every pair, if we take fl as our sample space. If we take Cll, however, the 129
VI. PROBABILITY THEORY
130
assignments are different. For instance
1 pr(2) = 36 ’ n
6 pr(7) = 36 1 pr(12) = 36 Once we have the distribution of probabilities, pr, defined on elements of R, we define a probability measure Pr over all subsets of R. As it is easily seen, for our finite spaces, Pr is completely determined by pr. We adopt the convention that the distribution of probabilities over R is written with lower-case letters and the corresponding probability measure over subsets of R begins with a capital letter. For instance, pr and Pr, pr, and Prl, or pr’ and Pr‘. Thus, we define: DEFINITION
VI.1
(PROBABILITY SPACES).
(1) A (finite) probability space or (finite) sample space is a finite set R and a function pr, called a distribution of probabilities, or, simply, a distribulion, on R, i.e., a pair (Q, pr) , such that (a) pr is strictly positive, that is, pr(w) > 0 for every w E R. (b) WEn
We call (R,pr) a weak probability space, if conditions (b) and (c) pr is positive, that is, pr(w) 3 0 for every w E R, are satisfied. (2) An event A is any subset of R, and the probability of A is pr(w).
PrA = WEA
Since pr determines Pr, in order to check whether something is a probability space, we need only to check (l),in the definition. For instance, if E is the event E = {(1,6), (6,1), (2,5), ( 5 , 2 ) , (3,4), (4,3)}, in Example 3 with R as a sample space, then E is the event that the sum of the two dice equals seven, and its probability is
1. PROBABILITY
131
A weak probability space can always be replaced by a probability space that preserves most of its properties. If we have a space (R’,pr’) where pr’ is only positive, but satisfies (b), we can always replace it by a probability space (R,pr) where pr is strictly positive, as follows. Let 51 = {w E R’ I pr‘(w) > 0) and pr = pr’ 1 0. Then (R,pr) is a probability space with pr strictly positive. Any event A’ c R’ can be replaced by A = A‘ n SZ. Then Pr’ A’ = Pr A. So that any experiment that is modeled by (R’,pr’) can also be modeled by (R,pr). Thus, there is no loss of generality in requiring pr to be strictly positive, and it simplifies some of the developments. We say that an event E occursl if an outcome belonging to E occurs. For any two events E and F their union E U F occurs if either E occurs or F occurs or both occur. The intersection of E and F , E F , occurs if and only if both E and F occur. The empty or impossible event, 0, is the event that never occurs, and the universal or sure event is R. If E F = 0, we say that E and F are mutually exclusive or incompatible. Finally, for any two events E and F, their difference, E-F, occurs when E occurs and F does not occur. In particular, the complement (relative to R), of E , E C ,occurs when E does not occur. From the definition of probability on events, we immediately obtain for any pair of events E and F (1) 0 5 P r E 5 1. (2) P r R = 1. ( 3 ) If E and F are mutually exclusive, then Pr(EUF) = P r E + P r F .
(4) Pr E > 0, for every E # 0. Kolmogorov’s axioms for the finite case, which were introduced in Preliminaries, Section 2, have two main elements: an algebra of events, that is, a family B of subsets of R such that R E B and closed under finite unions and complements (and hence, under differences and intersections), and a function Pr defined on B satisfying (l),(2), and (3). Thus, a Kolmogorov probability space can be represented by a triple (R, B,Pr) satisfying the properties indicated above. As for probability spaces, a space which satisfies (l),(2), (3), can be replaced by one satisfying (l),(2)1 (3), (4). This replacement is done as follows. Let A be the union of all B E B such that Pr B = 0. Then we have that A is an event and P r A = 0. We take R’ = R - A, B’ to be the algebra consisting of R’ n C for C E B,and if C E B,we take Pr’(CnS1’) = Pr C. Then (R’,B’,Pr’) satisfies (l), (2), (3), and (4). We also have, that B’ c B and if C E B’,then Pr‘ C = Pr C. Thus, we define:
DEFINITION v1.2 (KOLMOGOROV PROBABILITY SPACES). A triple R finite, is a (finite) Kolmogorov probability space if it satisfies (1) B is an algebra of subsets of R.
(n,a, Pr),
132
VI. PROBABILITY THEORY
(2) Pr : B -+ [0,1]. (3) P r o = 1. (4) If E and F are mutually exclusive elements of
Pr(E U F) = Pr E
D,then
+ Pr F.
(5) Pr E > 0, for every E # 0, E E B. We say that (R, B, Pr) is a weak Kolmogorov space, if it only satisfies (l),(2), (3)) and (4). It is clear that if we have a (weak) finite probability space, (R, pr) in our sense, then (Q,P(O),Pr) is a (weak) finite Kolmogorov probability space. It is not difficult to prove (see Section 7, Corollary VI.9) that for any finite (weak) Kolmogorov probability space, (R,B,Pr), one can find a finite (weak) probability space (Q’,pr‘) and a one-one onto function f : B + P(O‘) such that PrA = Pr’ f ( A ) , for every A E D. Thus, our axioms are enough for dealing with finite probability spaces. Infinite probability spaces are another matter. We shall deal with them, however, by reducing them to finite (or, more properly, hyperfinite) spaces using nonstandard analysis. The following properties are not difficult to prove, either from our original axioms or from Kolmogorov’s axioms. (1) P r 0 = 0. (2) P r E + P r E C = 1. (3) P r ( E U F) = P r E + Pr F - Pr E F . 2. Conditional probability
Suppose that we are in the case of Example 3 with sample space R and fair dice. Suppose that we observe that the first die is three. Then, given this information, what is the probability that the sum of the two dice is seven? To calculate this probability, we reduce our sample space to those outcomes where the first die gives three, that is, to ((3, l ) , (3,2), (3,3), (3,4), (3,5), (3,6)). Since each of the outcomes in this sample space had originally the same probability of occurring, t4heyshould still have equal probabilities. That is, given that the first die is a three, then the (conditional) probability of each of these outcomes is 1/6, while the (conditional) probability of the outcomes not in this set is 0. Hence, the desired probability is 1/6. If we let E and F denote the event of the sum being seven and the first die being three, respectively, then the conditional probability just obtained is denoted Pr(El F ) . In order to obtain a general formula, we reason as follows. If the event F occurs, then E occurs just in case an outcome both in E and F occurs, that is, an outcome in E F occurs. Since we know that F has occurred, we must take F as our new sample space, and define a new distribution PrF. We must set the
2. CONDITIONAL PROBABILITY
133
probability of points outside of F to 0, and, hence, CwEF prF(w) must be equal to 1. So we set, for w E F
and then piF is a distribution of probabilities over F , i.e., ( F ,prF) is a sample space. Thus, we get
Pr(E1F) =
Pr E F prF(w) = PrF ' wEEF
Notice that this equation is only well defined when Pr F > 0 , that is, when F # 0. The equation defines Pr( IF) as the probability determined by the distribution piF. Thus, Pr( IF) has all the properties described obove for a probability. We then define formally:
DEFINITION VI.3 (CONDITIONAL PROBABILITY). Let F be an event such that PrF
# 0.
Then we define, for any event E
Pr E F Pr(E1F) = PrF ' The equation can also be written in the useful form
Pr E F = Pr(E1F) Pr F. Bayes formula. Let E and F be events. We have
E = EFU EFc with E F and E F c mutually exclusive. Then
+
Pr E = Pr E F Pr E F C = Pr(E1F) Pr F Pr(E1F") Pr F C = Pr(E1F) Pr F + Pr(EIFc)(1 - Pr F )
+
This formula is what is called the total probability formula, which can be generalized as follows. Let FI, F2, .. ., F,, be pairwise mutually exclusive events with positive probability such that their union is the whole sample space (i.e., the sure event). Then for any event E n
P r E = C P r ( E I F j ) PrFj. j=1
VI. PROBABnITY THEORY
134
With the same assumption about the sequence of F's, we can deduce for any i with 1 5 i 5 n
P r EFi Pr(FilE) = Pr E
This is B a y e s f o r m u l a . In particular, we have
P r ( F ( E )=
Pr(E1F) Pr F Pr(E1F) P r F + Pr(EIFC)P r F " .
Example 4 is an example of the use of Bayes formula. Independent events. We conclude this section with a discussion of independence as is defined inside probability theory. I shall introduce later another notion of independence that does not depend on the probability measure. The notion of independence introduced here may be called more precisely probabilistic i n depend e n ce. Two events E and F are independent if the occurrence of one of them, say F , does not affect the probability of the other, i.e., E . T h a t is, if
Pr(E1F) = P r E . This is equivalent to Pr E F = P r E Pr F . Since this last formula does not have the restriction that Pr F > 0 , we adopt it as a definition. The definition of independence can be extended to more than two events. Thus, we have:
DEFINITION VI.4 (PROBABILISTIC INDEPENDENCE). The events E l , E 2 , . . . , E, are said to be (probabilistically) independent if Pr E ~E, ~.,. .E~,,,=
I'IPr E ~, , m
j=1
for every
il,
i s , . . . , ,i
between 1 and n , such that i,
# i, for p # q.
Example 5 shows that pairwise independent events might not be independent. 3. Random variables
Most frequently, when performing an experiment, we are interested in a property of the outcome and not on the outcome itself, although we may need the particular sample space, in order to determine the probabilities. For instance, in casting two dice as in Example 3, we would usually adopt the sample space R and not 01, because the function pr is easily determined for it. We may only be interested, however, in knowing the sum of the numbers that appear on the dice, and not on the particular outcome in 52. In order to deal with this situation we introduce random variables.
3. RANDOM VARIABLES
135
DEFINITION VI.5 (RANDOM VARIABLES). A random variable X is a function X:sZ+R. We write, for any property of real numbers, cp(x), Icp(X)l= {w E Q I cP(X(w))}* In particular, for any set of real numbers S
[X E q = {w E
n I X(w) E S}
and for a real number r [X = r] = {w E SZ I X(w) = r } . We shall use for some examples the random variables X, Y , and 2 introduced in Preliminaries, Section 2, for the case of the two dice (Example 3). Namely, X is the face occuring in the first die, Y is the face occurring in the second die, and 2, the sum of the two faces. Let X be an arbitrary random variable. Once we have the probabilities Pr[X = i], we can usually forget about the sample space. In fact, it is only used for determining the following probability space:
DEFINITION VI.6 (PROBABILITY SPACE O F A RANDOM VARIABLE). The range of the random variable X , A x , is defined by
A x = {X(W)I w E a}. For i E A x , we define prx(i) = Pr[X = i]. The function prx is called the mass function or distribution of X. We shall occasionally use the same symbol prx for the function defined on all of R, by the same definition. The pair (Ax,prx) is called the probability space of X . The probability determined by prx, namely, P r x , is called the probability distribution of X. Since $2is finite, Ax is a finite set of real numbers. We clearly have that prx is strictly positive on A x and
Thus, prx is a distribution of probabilities over AX, and, hence, ( A x , prx) is a probability space, the space induced by X on R. It is also clear that prx(x) = 0, for x $ A x . Any function pr : S + R such that S is a finite set of real numbers and
VI. PROBABILJTY THEORY
136
can be considered as the distribution of a random variable. We just take R = S and X ( z ) = 2 , for every 2: E S. Then pr = prx.
DEFINITION VI.7 (EQUIVALENCE). We call two random variables X and Y (possibly defined on different sample spaces) equivalent, if X and Y have the same probability spaces, that is, if (Ax ,p rx ) = ( A y , pry). Probability theory is concerned only with those properties of random variables that are shared by all equivalent random variables. Let X be any random variable. Let X’ be the identity restricted to A X , i.e., X ’ ( z ) = z for z E A X . Then X’ is a random variable on (Ax,prx), Ax1 = A x , and prXl = prx. Thus, X’ is equivalent to X. Therefore, in studying random variables, it is no loss of generality to assume that they are defined on (Ax,prx). If XI and Xz are random variables and a and b are real numbers, then we define (ax1
+ b X z ) ( w ) = aX1(w) + bXz(w), (XlX2)(W) = Xl(W)X’(W).
For instance, in the case of the two fair dice, with the definitions introduced above, we have Z = X Y .
+
Next, we introduce the notion of independence for random variables:
DEFINITION V I . 8 (INDEPENDENT RANDOM VARIABLES). w e say that the random variables X and Y , defined over the same sample space, are independent if for any real numbers i and j Pr[X = i, Y = j ] = Pr[X = i] Pr[X = j]. For instance, in the case of two fair dice, with X , Y and 2 as before, X and Y are independent, but Z and X are not. 4. Expectation, variance and covariance
We now define some functions on random variables.
DEFINITION V I . 9 (EXPECTATION AND
VARIANCE).
(1) The m e a n or ezpecta-tion of the random variable X is
(2) The variance of X is Var X = E(X - EX)’ =
Z(z- E X ) 2 p r X ( z ) . ZEN
The square root of the variance of X is called the standard deviation of X.
4. EXPECTATION, VARIANCE AND COVARIANCE
137
( 3 ) The covariance of the random variables X and Y is COV(X,Y) = E((X - EX)(Y
- E Y)).
The correlation coeficient of X and Y is COVlX. Y) The mean is the weighted average of the values of the random variable, weighted according to their probabilities. The variance is the weighted average of the squared differences to the mean. We note that
EX =
C i prx(i).
i€Ax
The right hand side of this equation is also the expectation of the identity function on (Ax,prx). Hence, the expectation of a random variable depends only on its probability space. The same is true for the variance and covariance. We can easily get the following formulas
E(aX1 + b X z ) = Q E X I+bEX2, VXX=E(X-EX)~
= E X’ - 2(EX)’ + (E X ) 2 = E X’ - (EX)2. Var aX = Ea2X2- (E = a2(EX2- (EX)2) = a’ var X , COV(X, Y) = E(X - E X)(Y - E Y) = E(XY - Y E X - X E Y + E X EY) = EXY - E Y E X - E X E Y + E X E Y =EXY-EXEY, and Var(X
+ Y) = E(X + Y - E(X + Y))2 = E((X - E X ) + (Y - EY))’
= E((X - EX)’+ (Y - EY)2 + 2(X - EX)(Y - EY)) = Var X + Var Y + 2 Cov(X,Y).
If X and Y are independent, then we have
VI. PROBABILITY THEORY
138
=EX-EY. Thus, still assuming independence, we get COV(XY)= E X Y - E X E Y =EXEY-EXEY = 0, and Var(X
+ Y ) = Var X -+ Var Y + 2 Cov(X, Y) = Var X + Var Y.
Where it makes sense, the formulas can be extended by induction to n variables. We mention here a few important types of random variables.
Constant random variables. These are variables X such that X(w) = c , for every w E a, where c is a real number. We clearly have Pr[X = c] = 1 and Pr[X = r] = 0 for any r # c . Also E X = c and Var X = 0. We have that A, = {c} and prx(c) = 1, so that we are justified in identifying the random variable X with the real number c. Indicator or Bernoulli random variables. If we have any event E , the Indicator of E is the function defined by
For these random variables we have
prx,(r) =
ifr=l, Pr E " , if r = 0, otherwise.
(f"
In statistical practice, an outcome of an experiment or trial in E is called a E " , a failure. So that we define a Bernoulli random variable as a variable that is 1, in case of success, and 0, in case of failure. We also have: success and one in
4. EXPECTATION, VARIANCE AND COVARIANCE
139
Binomial random variables. Suppose that one performs n independent Bernoulli trials, each of which results in a “success” with probability p and in a “failure” with probability 1 - p . Let the random variable X measure the number of successes in n trials. Then X is called a binomial random variable having parameters ( n , p ) . The distribution of X is given by
where (y) = n ! / ( n- i)!i! equals the number of sets (called, in this case, combinations) of i objects that can be chcxen from n objects. This equation can be proved by first noticing that, since the trials are supposed to be independent, the probability of any particular sequence containing i successes and n - i failures is p‘(1 - p)”-‘. The equation follows since there are (!) different sequences of the n outcomes with i successes and n - i failures. Note that, by the binomial theorem, the probabilities prx(i) sum to one, so that this is an adequate distribution. The expectation and the variance of a binomial random variable X can be calculated as follows. Let Xj be the Bernoulli random variable that is one for success in the ith trial, and zero for failure. Then n
x =CXj. i= 0
Since the sequence X I ,Xz, . .. , Xn is assumed to be independent, by the formulas obtained above
i=1
n
Expectation of a function of a random variable. Suppose f : R + R and X a random variable on s1. Then we can consider the random variable f ( X ) defined by
f ( X ) ( 4 = f(X(W))
VI. PROBABILITY THEORY
140
for every w E 52. We shall also have occasion later to consider functions of several variables. That is, iff : "R -+ R and XI, X2, . . . , X, are random variables, over the same space S Z , then
f ( X l , X 2 , . . . lX?a)(W)= f(Xl(W),XZ(W),... ,X,(W))> for every w E 52. We now want to calculate Ef(X). We have
XI = C 3:Pr[f(X)
E
ZE
m
= Z].[O]
But, we have
Hence
Therefore, we also get
5. Inequalities
We shall use the following inequalities.
THEOREM VI.l (JENSEN'S
p 2 1. Then
INEQUALITY).
Led X be a random variable and
IE X I P 5 E /XIp.
In particular we huve
IEXI 5 EIXI. PROOF. This is obtained, because for p 2 1 we have
I
c
X ( W ) Pr(W)IP
s
c
IX(W)lPpr(w).
WEn
THEOREM VI.2 (MARKOV'S INEQUALITY). Let f be a positive real function f(z) 2 0 f o r every e) and let E > 0 . Then
(z.e.7
6. ALGEBRAS OF
RANDOM VARIABLES
141
PROOF. We have
so that
THEOREM VI.3 (CHEBYSHEV'S INEQUALITY). Let p random variable. T h e n E lXIP Pr[lXl 2 €1 i - EP I n particular, w e have
Pr[(X - EX1 2 E] PROOF.
We have for E
By Markov's inequality
€2
> 0, p > 0 and w E 52
IX(W)l 2 E Hence
a
x
var
5 -.
> 0 , E > 0 , and X be
e
IX(w)I' 2 E P .
[IXl 2 €1 = [IXl' 2 4.
E IXlP Pr[lXIP 2 4 i Ep. Thus, we obtain the theorem. In order to obtain the last clause, we apply the theorem with X - E X instead ofX a n d p = 2 . 6. Algebras of random variables
The set of all random variables on S l is the set "R. We know that this set is closed under linear combinations, i.e., for a , b E R and X , Y E "R, then a X + b Y E "R. That is, "R is a vector space. But also, X , Y E "R imply XY E "R. Thus, "R is also an algebra, i.e., aset closed under linear combinations and products. We must consider other algebras of random variables. So we define
DEFINITION VI.10 (ALGEBRAS). (1) By an algebra of random variables Y we mean a subset of "R satisfying the following properties: (a) If a , b E R, and X, Y E V , then aX bY E V . (b) If X and Y E V , then XY E V .
+
VI. PROBABILITY THEORY
142
(c) If X is constant (i.e., X ( w ) = c for a c E R and all w E Sl), then
x E v. (2) An atom of a subset of nR, V , in particular an algebra, is a maximal event A such that each random variable in V is constant in A . That is, for every X E V , there is a real number c such that X ( w ) = c for all w E A , and if w 4 A , then there is an X E Y such that X ( w ) # X ( w ’ ) , for any w’ E A . We denote by at(V) the set of all atoms in V . PROPOSITION VI.4. Let V be a set of random variables, subset of ”R. Then R is partitioned into atoms, that is, Sl i s the union of the atoms and they are pairwise disjoint.
PROOF. We first prove that €or any w E Sl, there is an atom A such that w E A . Let R = { w ~ , w. .~. , w n } . Define the sequence A; of subsets of Q by recursion as follows
(1) Ao = { w } . ( 2 ) Suppose that Ai is defined. Then Ai+i
=
{
A; U { w , } , if for every X E Y , X ( w i ) = X ( 7 ) for each 7 E A i , Ai,
otherwise.
Then take A = A,,. It is clear that A is an atom and that w E A . We prove, now, that if A and B are atoms such that A n B # 8, then A = B . Suppose that A and B are atoms with A n B # 0, and let w E A n B . Let w’ E A and w” E B and let X be any random variable in V . Since w , w“ E B , we have that X ( w ) = X ( w ” ) . Since w , w’ E A we also have X ( w ) = X ( w ’ ) . Thus, for any element u’of A , any element w’l of B and any X E V , we have X(w’) = X ( w ” ) . Thus, since A and B are atoms, A = B . 0 The trivial algebra of random variables consists of all the constant random variables. There is only one atom for this algebra, namely Q. We define:
DEFINITION VI.ll (GENERATEDALGEBRA). The algebra V generated by the random variables XI, X a , . . . , X , on Q is the smallest algebra containing these variables and all constant random variables. As an example consider the case of two fair dice with the random variable 2 defined above, i.e., Z((z,j)) = i + j . Consider the algebra V generated by 2 . This algebra contains all constants and linear combinations of these constants with 2. There are 1 1 atoms, which are:
6. ALGEBRAS OF A7
A8 A9
Aio A11 An
RANDOM VARIABLES
143
={(I,6 ) (6, ~ I), (2,5), (5,2), (3,4), (4, 3 ) ) ~ ={(2,6),(6,2), (3,5),(5,3),(4,4)), ={(3,6),(6,3), (4,5),(5,4)), ={(4,6), (6,4)1(5,5)), ={(5,6)(6,5)), ={(6,6)).
The subscript of A denotes the value of 2 at the elements of the atom. If A is an atom of V and X E Y, we denote by X ( A ) the value of X at the elements of the atom A . Thus, in the example above, Z ( A i ) = i. Thus, we see that in the case that the algebra V is generated by one random variable X , its atoms are the subsets of S l of the form [X= i], for the numbers i E A x . Algebras of random variables have the following properties:
THEOREM VI.5. Let V be an algebra of random variables. Then (1) The indicator of any atom is in V . (2) If W is any subset of "R, the algebra generated b y W , V , is the set of all linear combinations of indicators of atoms of W , i.e., X E V if and only if there are indicators of atoms Y1, Y2, . . ., Ym and real numbers a l l a2, ... , am such that X=alY1+azY2+...+amYm . Moreover, the atoms of V are the same as the atoms of W . (3) V consists of all random variables which are constant on the atoms of V . (4) Let f be an arbitrary function from "R into R and X I , X Z ,. . . , X , E V . Then f ( X l , X z , .. . , X n ) E Y.
PROOF OF (1). Let A be an atom. If A = C2 then the indicator of A is the constant function of value 1, and hence, it is in V . Suppose that there is an w 4 A . Then, by definition of atom, there is an X E V such that X ( A ) # X ( w ) . Let ) x, = X x( A-) -X X( w( W ) '
Then X, E V , X , ( A ) = 1, and X w ( w )= 0. We have that 4.4
is in V . If q E A , then X,(q) = 1 for every w 4 A . Hence, Y ( q ) = 1. On the other hand, if 77 4 A , then X , ( q ) = 0. Hence, Y ( q ) = 0. Thus, Y is X A , the indicator of A . 0
PROOF OF (2). First, let U be the set of all linear combinations of indicators of the atoms of W . We shall prove that U is an algebra. It is clear that U is
VI. PROBABILITY THEORY
144
closed under linear combinations. Suppose, now, that X, Y E U.Then
x = a l X A , + a2XA2 + + h X A , y = blXB, + b2XBa + + bmXBm * ' '
where A l , . . . A, and
B1,
' ' *
.. . , B,
9
1
are atoms of W , and, hence
n
m
n
m
i=l j=1
But, since Ai and Bj are atoms, A, B, = 0 or AiBj = A i . Thus, X .Y is a linear combination of indicators of atoms, and, hence, it belongs to U. Therefore, we have proved that U is an algebra. Let, now, Z E W and let A 1 , A 2 , . .. , be the atoms of W , and Z ( w ) = ui for w E A i , i = l , . . . , n. Then
z= U l X A l + . . + %XA, '
t
and, hence, Z f U .Thus, we have proved that V C_ U. By 1, V contains all the indicators of atoms in V . Since W V , all the atoms in W are atoms in V . Thus, V contains all indicators of atoms in W. Therefore, U E V , and, hence, U = V . El
c
PROOFOF (3). Now, let X be a function on Q which is constant on the atoms of V , let A l , Az, . . . , A , be the atoms of V , and let X(Ai) = a i . Then
x = a l X A l -k Q2XA2 + + UrnXA,. ' ' '
Thus, X E V . I7
PROOFOF (4). Let f be a real function of n variables, and X I , X 2 , . . . , X , 6 V. Then, since X1, .. ., X , are constant on the atoms, f ( X 1 , .. . ,X,) is also constant on the atoms. Hence, by (3), we get (4). 0 We have the following corollary:
COROLLARY VI.6. Let W be 0 subset of nl!% Then . if V is the algebra of random variables over R generated b y W , then V can be considered as the algebra of all random variables over at(W), at(w)R, b y identifying the random variable X E V with Xv on at(V) defined b y X V ( A ) = X(Ah
where A is an atom of V . That is, the function F : V F ( X ) = XV, is one-one and onto and satisfies: if X , Y E V and u , b are real numbers then
F(aX
+
at(w)W, defined b y
+ b y ) = a F ( X ) + b F ( Y ) and F ( X Y ) = F ( X ) F ( Y ) .
Moreover, we have
7. ALGEBRAS OF EVENTS
145
(1) If X E V is an indicator, then Xv is also an indicator. (2) If (S2,pr) is a probability space, then (at(W),pr‘), where pr’(A) = PrA, f o r any A E at(W), is also a probability space (with expectation E;) satisfying
E X = E;
xV
f o r any X E V .
PROOF. We must prove that the function F ( X ) = Xv, defined on V , is oneone and onto at(w)Tw, and that it preserves the operations of the algebra. Suppose that F ( X ) = F(Y). Since, by the previous theorem, the atoms of V are the same as the atoms of W , X(A) = Y(A) for any atom A of V . Since for any w E Q, w E A,, for a certain atom A,, we have that X = Y. Thus, F is one-one. Let now 2 be an arbitrary random variable on at(W)R. Define X on R by X(w) = Z(A,), where A, is the atom to which w belongs. It is clear that X E V and F(X) = 2. Thus F is onto. It is also clear that F preserves the operations of the algebra V . Proof of (1). It is clear from the definition that an indicator, xc,is sent into the indicator xo,where D = {A E at(V) I A n C # 8). Thus, we have (1). Proof of ( 2 . It easy to show that (at(W), pr’)) = (at(V), pr’) is a probability space. We also have for X E V EX =
C ~(w)pr(w) WEn
A€at(V) w E A
=
C
Xv(A)PrA
7. Algebras of events The family of all events P ( R ) is closed under unions, intersections and complements and it contains a. Thus, it is an algebra of events. We define:
DEFINITION VI.12 (ALGEBRAS OF EVENTS). A nonempty subset of P ( a ) ,B, satisfying: (1) If A and B E B , then A U B E A, (2) If A E f?, then A“ E B ,
VI. PROBABILITY THEORY
146
is called an algebra of events. An element A E B such that A # and B c A , is called an atom of B.
8 and there is no B E B such that B # 0
Notice that we have two notions of atoms: atom of an algebra of random variables and atom of an algebra of events. The duality between these two notions of algebras, implies, as we shall see later (Theorem VIA), that, in a sense, the two notions of atoms coincide. As an example of atoms of algebra of events, in the algebra P R , the atoms are the singletons. The trivial algebra of events is the algebra of two elements, {Q,0), whose only atom is R. Let B be an algebra of events. Since
A B = (A" U BC)' and
A-B=ABc
B is closed under intersections and differences, i.e., A , B E B imply A B , A - B B . Also R = A u A"; thus, R E 8. We also have the following property:
PROPOSITION VI.7. Let B be an algebra of events. Then ( 1 ) Suppose that A = ULl Ai, where Ai E B, for i = 1, . .., rn. Then there are pairwise disjoint A', C A l , , . . , A; C A , in B, such that A= A:. ( 2 ) For every A l , . .., A , E B and a l , . . . , a , E R, there are pairwise dzsjoint &, . . . , B k and b l , . . . , b k E IF9 such that
uzl
m
k
i=l
i=l
U Ai = U B; and
+
alXAl
' ' '
+
$. a m X A ,
= hXB1
+ . ' . + bkXBk
(3) If X A = a i X A t i. . U m X A , , where a; # 0, and Ad ' . z , j = I, . . . , m, i f j , then A = A1 U . . . U A, . 1
1
f l
A, = 8, for
PROOF O F (1). We just define by recursion the A: a s follows: (1) A', = A l . (2) Suppose that A; is defined for j = 1, . . . , i. Then A:+, = Ai+l-U;=, A;. PROOFOF (2). The proof is by induction on m. Let A l , . . . , A,+1, and u l , . . . , b k work for A1, . . . , Am with a l , . . . , a m . Let A = U z l Ai, and let Ci = ( A - A,+1) n Bi, D; = A f l A , + l n B i , E = A , + l - A , c ~ = b ~ , d i = b ~ + a , + al n d e = a , + l , f o r i = 1, . . . , k. Then, we have . . . , am+l be given, and suppose that B1, . . . , Bk with b l ,
m+ 1
k
k
i= 1
i=I
i=l
U Ai=UC;UUDiUE
7. ALGEBRAS OF EVENTS
147
and alXA1
+ + * *
am+lXA,+x
= Cl XCl -k
*
*
-k CmXC, -k d l X D l
+
*
*
-k d m X D ,
+ eXE.
PROOF OF (3). Let XA
= alXA1
+
' '
'
-k
amXA,
.
Let x E A. Then XA(Z) = 1. Since the Ai's are pairwise disjoint, there is at most one Ai such that 2 E Ai. Thus alXAl(t)
+ "'+ a m X A , ( x )
= aiXA;(z).
Hence, ~ ~ x A , (= z )1, and so X A , ( Z ) = 1. Thus, x E Ai and m i=l
Assume, now, that x E that x E Ai. Thus
UEl Aj. Then there is exactly one i = 1, .. . , rn such
alXAl(z)
Hence, XA(E)
+ " ' + amXA,(x)
# 0, and so x E A .
= a i X A . ( x ) # 0.
0
We have the following relationship between an algebra of random variables and an algebra of events. We first introduce a definition. Notice, for this definition, that, given any partition of R, A l , A 2 , . . . , A,, the family consisting of all the unions of members of the partition is an algebra of events. Recall, also, that the atoms of an algebra of random variables form a partition of Q.
DEFINITION VI. 13. (1) Let V be an algebra of random variables over R. We define V , to be the algebra of events in R consisting of unions of the atoms of V . (2) Let B be an algebra of events in R. We define B,, to be the algebra of random variables generated by the indicators of sets in B. If l? is the trivial algebra of events {a,@},then a,, is the trivial algebra containing the constant random variables. On the other hand, if Y is the trivial algebra of constant random variables, then V , is the trivial algebra of events,
{a, 0).
THEOREM VI.8 (DUALITY). (1) Let V be an algebra of random variables over R. Then V , is an algebra of events and (Ye),,= V . Therefore, the atoms of V (as an algebra of random variables) are the same as the atoms of V , (as an algebra of events).
148
VI. PROBABILITY THEORY
( 2 ) Let B be an algebm of events in S2. Then (Br,), = B. Therefore, the atoms of B, as an algebra of events, are the same as the atoms of B,, (as an algebra of random variables) and, hence, there are atoms A l l . . . , A , of B such that every element of B is a union of some of these atoms.
PROOF OF (1). The fact that V, is an algebra of events is deduced from the easy to prove statement that the unions of elements of a partition of R form an algebra of events. We have that (V,),., contains all indicators of the atoms of Y. Then, by Corollary VI.6, Y C (V,),,. On the other hand, if C E V , with C = A I U . - . U A m , where A l , Az, . .. , A , are distinct (and hence pairwise disjoint) atoms, then
PROOF OF (2). Suppose first that A E B. Then XA
=alXA,
XA
E
Brv. Thus
+ ' . + OmxA, '
where A l , . . . , A, are distinct atoms of B,.,. By Proposition VI.7, (3), A = A1 U . . . U A, and, hence, A E (Br,)e. Thus, B E (Brv)e. Suppose, now, that A is an atom of B,,. Then, by Theorem VI.5, X A E &,. 'Thus, XA =alXAl
+..-+arnXA,,
where A l , . . . , A , E B. Then, by Proposition VI.7, (2), there are B1, .. ., BI, E B,pairwise disjoint, and b l , . . . , bk (which clearly can be taken # 0) such that
= biXBl 4- . . . -k b k X B k . By Proposition VI.7, (3), A = B1 U .. . U B k , and, hence, A E B Since every atom of Br, is in 8,(Bru), B. XA
We now prove that every finite Kolmogorov probability space can be replaced by a finite probability space.
COROLLARY VI.9. For eve y (weak) Kolmogorov probabi&ty space (R1, a,P r l ) , there is a (weak) probability space ( n , p r ) and a function F such that (1) F is a one-one function of (2) For every A E B, we have
onto P 0. P r l A = Pr F(A).
PROOF. By the previous theorem, there are atoms A l , . . . , A , E B such that
every element of 23 is a union of these atoms. Let R = { A l , . . . ,A,} and define
prfA) = Prl A, for A E R. Then (R,pr) is a (weak) probability space.
8. CONDITIONAL DISTRIBUTIONS AND EXPECTATIONS
149
-
Let, now, A E f3. Then A = B1 U . . U B,, where B1, . .., Bm E S2. The B's are pairwise disjoint. Let F ( A ) = (B1, .. . ,B,}. Thus m
i= 1 m i= 1
= Pr F ( B ) . 0
We have the following theorem that extends the duality between algebras of events and algebras of random variables:
THEOREM VI.10. Let V and W be algebras of random van'ables. Then the following conditions are equivalent: (1) W 5 V . (2) Eve y atom in W is the union of atoms in V . (3) We C V e V and let A be an atom of V . Every random PROOF. Suppose that W variable in W is constant on A , and, thus, there is an atom of W , say B, such that A B. So, if A is an atom in V , and B an atom in W , we have that A n B = 0 or A n B = A. We then get (2), i.e.
Suppose, now, (2). We have that every element if We is the union of atoms in W , and, hence, the union of atoms in V . Thus, we get (3). Finally, suppose (3). We have that W = (We),, and V = ( V e ) , " . Since all the generators of (We),, are generators of (Ve),,, we get (1). EI 8. Conditional distributions and expectations
Let V be an algebra of random variables and A an atom of V . As in Section 2, we can take A as a probability space with respect to PrA defined by
for d l w E A. It is clear that (A,prA) is a sample space, that is wEA
The conditional expectation of a random variable X with respect to V (or to V e )is a random variable Ev X that at each w E S2 gives the expectation of X if the atom A,, such that w E A,, were to occur. That is, we define
VI. PROBABILITY THEORY
150
DEFINITION VI.14 (CONDITIONAL EXPECTATION). Let V be an algebraof random variables over 8, and X an arbitrary random variable over 8. For w E 8, we define
where A, is the atom containing w . In case V is the algebra of random variables generated by the random variables X I , X 2 , . . . , X , , we also write
Ev X = E(XIX1, X 2 , . . . ,X m ) . When B is an algebra of events, we write Ea for Es,, We have that Ev X E V , and Ev X on each atom, A , is the expectation of X with respect t o pr,, Using the convention indicated above that if A is an atom, then, for any X E V , X ( A ) = X ( w ) , for w E A , we can write
Ev X ( A ) =
c
X ( l ) )PTA(V7).
VEA
As an example, consider the trivial algebra, 7, of the constant random variables; then, since its only atom is R, it is easy to check that ET = E, i.e., E7 X is the constant random variable with value E X . From the fact that for each atom A , Ev is an expectation, we can easily prove the following properties.
(1) If Y E V , then EVY = Y . (2) The conditional expectation is V-linear, i.e., for YI, Y2 E V , and X I , X2 arbitrary random variables, we have
Ev(YiXi+ Y2X2) = Yi Ev XI
+ yz Ev X 2 .
Thus, the element of V act as constants for Ev. (3) If W is an algebra of random variables, such that W V , then Ew Ev = Ew. In particular, since the trivial algebra 7 is a subset of any algebra, we have E Ev = E, that is, for any random variable X , EEv X = E X . The proof of (3), depends on Theorem VI.10, (2).
EXAMPLE V I . l . We continue with the example of the two fair dice, and the random variables X ( ( i , j ) )= i,
Y ( ( i , j )= ) j,
Z ( ( i , j ) )= i + j ,
for every {i, j ) E I;t. Let V be the algebra generated by 2. Let w = (3,4). Then w is
in the atom
8. CONDITIONAL DISTRIBUTIONS AND EXPECTATIONS
151
Then
= -7 2
Thus, the expectation of X knowing that seven has appeared is 712. On the other hand, by property (1) mentioned above, E ( Z I 2 ) = 2. By symmetry, we can see that E(X1.Z)= E(Y1.Z). Since, by property (2) E(X1.Z) we have that
+ E(Y1.Z) = E(X + Y I Z ) = E ( Z I 2 ) = 2, E ( X I Z ) = E(Y1.Z) = i2.
We can also define the conditional probability of an event B with respect to an algebra of random variables V . This is a random variable, such that, for each w E a, its value is the conditional probability of B given that the atom A,, with w E A,, has occurred. That is:
DEFINITION VI.15 (CONDITIONAL PROBABILITY). Let B be an event, V an algebra of random variables, and w E R. Then Prv B(w) = Pr(BIV)(w)= E v ( x ~ ) ( w=)
Pr(BA,) PrA, '
where A, is the atom of V containing w. If X I , X 2 , . . . , X , are random variables, we also write
P r ( B I X I , X 2 , . . . ,X,)
= Pr(BIV),
where V is the algebra generated by X I , X a , . . . , X,. We have, then, that Pr(B1V) is a random variable, which is constant on the atoms of V , and, hence, it belongs to V . The old definition of conditional probability is a special case of this one. Let E and F be events with Pr F > 0. Consider the indicator of F , X F , and let V be the algebra generated by X F . If Pr F C # 0, this algebra has exactly two atoms, namely, F and F c . Thus
Pr(EIXF)(w) =
{
Pr(EF)
= Pr(EIF), if w E F , PrF Pr(EFc) = Pr(EIFc), if w E F c PrFc
Thus, we have that Pr(E1F) = Pr(EIXF)(F).
VI. PROBABILITY THEORY
152
We can also relativize Jensen’s and Markov’s inequalities to Y . Jensen’s inequality is relativized to IEv X I P I Ev 1x15 for p 2 1, and Markov’s inequality
for f positive and Y
> 0, Y E V , that is
for every w E R. 9. Constructions of probability spaces
We first introduce formally the notation used in Corollary VI.6:
DEFINITION VI.16. Let (Q,pr) be a probability space and let V be an algebra of random variables on this space. Then we define the probability space (at(V),pr’) where pr‘(A) = PrA for any A E at(Y). We write Pr’ for the probability measure induced by pr’. We say that the original probability space (Q, pr) is fibered over (at(Y), pr’) with fibers (A,prA),where A E at(Y). The two constructions that we shall frequently use in the sequel and that we shall discuss in this section, are special cases of “fibering”. We have the following easy proposition:
PROPOSITION V I . l l . Let Y be an algebra of random variables. T h e n (1) For a n y w E s2, pr(w) = pr‘(A,) prA,(w), where A, i s the a t o m t o which w belongs. ( 2 ) Let X be a random variable, then
EX=E;E~X, where EL is the conditional expectation relative t o (at(V), pr’).
PROOF. Part (1) is immediate from the definitions. We now prove (2). We have
E;E~X =
c
(Ev X ) ( A )Pr‘W
AEat(V)
AEat(V) w E A
9. CONSTRUCTIONS O F
PROBABILITY SPACES
153
Part (2) of this proposition is a generalization of the total probability formula: let B be the algebra of events generated by the partition of R, F1, Fz, . . . , F,,, and let E be any event. Then the atoms of B are precisely F1 , . . ., Fn. We have PrE=ExE
= E’, EBXE = EL(Pr(E1B)) n
= x ( P r ( E I B ) ) ( F i )Pr Fi. i=l
It iseasy toshow that Pr(EIB)(w) = Pr(EIFj), ifw E Fj. Thus, Pr(ElB)(Fi) = Pr(EIFi). Therefore, we get the total probability formula Pr E =
x n
Pr(EIFi) Pr Fi.
i=l
We now proceed to define the average space. This is the general construction of fibered spaces. That is, a space (Ro,pro) is given, and for each w E Ro,we have a space (Rwlprw).Then we want to get a space (R,pr) that is fibered over (Ro,pro) (or, more precisely, over a copy of this space) with fibers (R,,pr,).
DEFINITION VI.17 (AVERAGE SPACES). Let (no,pro) be a finite sample space, and, for each w E Ro, let (R,,pr,) be a finite probability space. The average probability space of((R,,pr,) I w E Ro) with respect t o (Ro,pro), (R,pr), where pr is called the average measure, is defined as follows: = U{{W>
x
Qu .
I w E no} = { ( w , 17) I w E Ro, 17 E
%l},
and, if w E RO and q E R,, then pr((w, 17)) = Pro(w) pr,(77). We now obtain the average measure as a fibered measure. Let Q be as in the definition, and let V be the algebra of random variables on R which depend only on their first argument. That is, X E V means that X(w,v) = X(w,v’), for every 17, 17’ E Rw. The atoms of V are the sets { w } x R, for w E Ro. (Instead of this algebra, we could consider just the algebra of events generated by the partition of Q o given by these atoms.) We define pr’({w} x R,) = pro(w) and prtwlXnw= pr,. Then, it is easy to check that for any random variable X on R
E X = E l EvX, where E is the expectation determined by the average measure, and E’ by pr‘. Thus the average measure can be said to be fibered over (Ro,pro) with fibers (Rw1Prw). It is not difficult to show that the average measure of any subset A of the average space R can be obtained as follows. For any w E Ro, let A, = (77 1
VI. PROBABILITY THEORY
154
( w , q) E A}. Then A, C a,. Let X A be the random variable defined on $20 by X A ( W )= Pr,(A,). We can calculate
i.e., PrA is the expectation of X A with respect t o pro (or, what is the same, the weighted average with respect t o the measure Pro) of the probabilities Pr,(A,) for w E Q. We now define the product spaces:
DEFINITION V1.18 (PRODUCT SPACES). Let (Qi, pri) be probability spaces for 1 5 i 5 n. Then the product probability space of ((Szi, pri) I 1 5 i 5 n )
is defined as follows:
(1) Q = n:=1 Qi (2) pr((z1,. . . , 2,)) = 1
n:==,pri(ti), for each ( ~ 1. .~ .
,T,)
E Q.
pr is called the product distribution. In case we have two spaces ( Q l l p r l ) and (!&,pr2), we also write the product space (01 x 02, pr, x pr2). When ( Q i ,pri) = (Q,pr), for all i with 1 5 i 5 n , we write (Qi,pri) = ( W Z l n p r ) .
n:=,
It is clear that if the (Qi, pri) are probability spaces, then their product is also a probability space. Let us see why the product space is a case of fibering. Consider the case of the product of two spaces, (Ql x 0 2 , pr, x pr2). Let V1 be the algebra of random variables on R = Q1 x St2 which depend only on the first variable, that is, the variables X such that X ( w 1 , q ) = X(w1,q’) for every q, q’ E 02. The atoms of this algebra are the sets
for a certain q E. Q 1 . Then we have the probability distributions prbl on at(V1) defined by prbl(A,) = pr,(q), and piA, on each atom A , , by
10. STOCHASTIC PROCESSES
155
We then show, for any random variable X on R , E X = EL, Ev, X:
E X = ELl ( E y X)
=
c
(Ev1 X)(AWl) pr'vl(Awl)
wlml
w i E n i waEna
So that the product measure is fibered over (Rl,prl) with fibers (R2,pr2) for each w1 E R1. We turn, now, to products with an arbitrary finite number of factors. It is clear that if A C R with A = Ai and Ai C Ri, then
n;='=,
Pr A =
n
Pri Ai.
10. Stochastic processes
A stochastic process is a function whose values are random variables: VI.19 (STOCHASTIC PROCESSES). Let T be a finite set and let DEFINITION be a finite probability space.
(a,pr)
(1) A stochastic process indexed by T and defined over (R,pr) is a function [ : T -+ nR. We write [ ( t , w ) for the value of [ ( t ) at w E R, and we write [ ( . , w ) for the function from T to R which associates to each t E T , [ ( t , w ) E R. Thus each [ ( t ) , written also [ t , is a random variable, each [ ( t , w ) is a real number, and each <(.,w) is a real-valued function on T called a trajectory or sample path of the process. (2) Let A, be the set of all trajectories of the stochastic process <. Then A, is a finite subset of TR,the set of all function from T to R. We define the distribution o f t , prC,on elements of A, by prc(A) = P r [ [ ( t ) = A ( t ) for all t E TI. The corresponding probability measure, Prt, is called the probability distribution of 5. Decoding the definition, we have that prC(A)= Pr{w 1 < ( t , w ) = A ( t ) , for all t E T } . From the definition, we get that (A.c,prC) is a probability space. As usual, we have denoted by PrC the probability measure determined by prt, that is, for
VI. PROBABILJTY THEORY
DEFINITION VI.20. Two stochastic processes ( and (' indexed by the same finite set T are called equivalent in case (A~,pr{)= (A,t,prt,); that is, in case they have the same trajectories with the same probabilities. defined by q ( t ,A) = X ( t ) , is Notice that the evaluation function 7 : T .--t a stochastic process defined over (A~,pr,)which is equivalent to (. In order to see this, we must prove that A, = A, and pr, = pr,. We have, X E At if and only if X = q ( . ,A); hence, X E A, if and only if X E A,,. On the other hand pr,,(X) = Pr, { p E A, I q ( t , p ) = X(t) all t E T } = Pr, { p E At I p ( t ) = A(2) all t E T }
= Pr&9 Thus, in studying a stochastic process there is no loss in generality in assuming that it is defined over the space of its trajectories. If T consists of a single element t , then we can identify a stochastic process indexed by T with the random variable [ ( t ) . That is, a random variable is a special case of a stochastic process. If X = [ ( t ) is such a random variable, then the trajectories of X are functions { ( t ,A)} where X E R. So they can be identified with real numbers, i.e., {(t,A)} can be identified with the real number A. Thus, the distribution of X is prx(A) = Pr[X = A], where X is in
<
Ax = {X E W I Pr[X = A]
# 0).
This is exactly what we had before in Section 3. When T = { 1 , 2 , . . . ,n}, we write a process E over T , as X I ,X z , . . . , X,, where E(i) = Xi for i = 1, ..., n. We also use the vector notation, writing = 2,and we call x' a random vector. The distribution, prc, is called the joint distribution of XI, X z , . . . , X, or of the random vector x',and is also written, if X(1) = zl,X(2) = x 2 , . . . , X(n> = zn
<
In particular, we write prx,y for the joint distribution of two variables, X , Y. That is, prx,y(z,y) = pr,(X) = Pr[X = x , Y = y], where [(l) = X ,((2) = Y , X(1) = z, and X(2) = y. We now extend the notion of independence to stochastic processes.
10. STOCHASTIC PROCESSES
157
DEFINITION VI.21. The random variables ( ( t ) , for t E T,of a stochastic process € indexed by T are called independent if
for all A E At. The construction of the product space gives the standard example of independent variables. Suppose, for instance, that T = { 1 , 2 , . .. v} and that X is a random variable on a space (Q, pr), then the random variables Xn defined on the product space ("Q, " pr) by
Xn(u1,. . . i
= X(G)
~ v )
<
are independent. The stochastic process indexed by T and defined by ((n) = Xn describes u independent observations of the random variable X . If XI, ... , X, are random variables over the same probability space, then they are called independent, if they are independent as a random process indexed by n. In particular, two random variables X1 and X2 over the same probability space (Q, pr) , considered as a process X defined on { 1,2}, are independent, if and only if
Pr[Xi = 21,X2 = 221 = ~ r X ( ( 2 1 , ~ 2 ) ) = PIXl (21 PTXa (.2> = Pr[X1 = cl] Pr[X2 = 1 2 1 , for all reals 2 1 and 22. Thus, the new notion coincides with the old one introduced in Section 3. If A1 , . . . , An are events, then their indicator functions X A ~ .,. . , X A , are a stochastic process indexed by n. It is easy to check that the events A l l . .. , A, are independent exactly in case their indicator functions are independent.