Chapter 1
Randomness and probability
Probability had been closely associated with games of chance. It seems that C. de M6r6, a French nobleman living in the seventeenth century was the first person to explicitly formulate the problem of winning chances in a game. He was interested in then popular dice game. A dice game consisted in throwing a pair of dice 24 times. People believed that betting on a double six in 24 throws would be profitable. Based on the calculations made by de M6r6, however, the true case was the opposite. This counter-intuition problem and others posed by de M6r6 attracted the attention of two then famous French mathematicians B. Pascal and P. de Fermat. After correspondence exchanges, they discovered the fundamental principles of probability theory. Soon thereafter, the Dutch scientist C. Huygens published the first book on probability. Probability theory soon became popular due to the inherent appeal of games of chance. This led to rapid development of the subject during the 18th century. P. de Laplace (1749-1827) made important contributions to probability theory by applying probabilistic ideas to a variety of scientific and practical problems. Examples are the theory of errors, actuarial mathematics, and statistical mechanics. The wide applicability of probability theory did not solve the difficulty of precise definition of probability until 1930s. The search for a widely acceptable definition took nearly three centuries. The process was marked by much controversy. In 1933, a Russian mathematician A. Kolmogorov put a stop to the controversy by using axiomatic methods to form modem probability theory (Apostol, 1969). Books on probability are numerous, and thus only a brief introduction is given herein. The interested reader is referred to Yeh (1973), Baht (1995), Milton et al (1997), and Utts & Heckard (2002) or even some good websites on probability.
1. Randomness and probabdity 1.1 Randomness 1.1.1
Random phenomena
Our mindset is used to deterministic phenomena, which we are sure to occur or not to occur. Examples are numerous. The sun rises every morning and sets every evening. Water flows from high to low. There is a direct flight from New York to Tokyo. Deterministic phenomena help us to make a sound plan in advance so as to be well prepared for the event to occur. But this is not always the case. More often than not, some unpredictable things popping up without any warning would change, hinder or even destroy our plans. Many of us have such experiences. My daughter, then 10 years, described how she felt on hearing the tsunami news from island Phuket in 2004. "The plan we had for our holiday to island Phuket, Thailand started brewing months ago, and we were sure that nothing could stop us from getting utmost fun in the sun. Excitement that developed months ago had reached the highest level. However, a good holiday plan isn't always enough for a good holiday. The tsunami that found its way all to island Phuket denied our plan completely." Opposite to deterministic phenomena, these unexpected occurrences, called random phenomena, are in fact all around us. And they are studied in the framework of probability theory. More precisely, under certain conditions, the phenomenon that the outcomes do not always occur is called random. Here, "certain conditions" and "all elementary outcomes" are basic elements of random phenomena. "Certain conditions" may be man-made, or natural, but they must be repeatable. So random phenomena are also called random experiments. "All elementary outcomes" refer to those which are simple and cannot be divided further. Take quality checkup of a TV set for instance. The outcomes are either "pass" or "fail". "Pass" or "fail" are two of the elementary outcomes. But it is unpredictable for which to occur.
1.1.2 Sample space and random events Suppose we are to perform an experiment which outcomes are either finitely many or infinitely many, but not unique. We further suppose, however, that the set of all possible outcomes is known. This set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by if2.
Example 1.1: Tossing, a coin The experiment consists of the flipping of a coin. The outcome is either Heads (denoted by H) or Tails (denoted by T), then the sample space is
~-(H,T}.
(I.l)
1.1. Randomness
3
Example 1.2: Rolling a die If the experiment consists of rolling a die, then the sample space is
if2 = {1,2,3,4,5,6}.
(1.2)
Example 1.3" Gasoline consumed in a month Consider driving a car for a month, with the outcome being the number of gallons of gasoline consumed. In principle this outcome could be any nonnegative number, and so the sample space is ~ :[0,~).
(1.3)
Events are collections of outcomes in the sample space. Or, any subset E of the sample space ~2 is known as an event. We say the event A occurs if the outcome of the experiment is an element of the set A. Example 1.4: Events in examples 1.I and 1.2 Reexamine Example 1.1 in the above. Set E = {H}, then E is the event that a head appears on the flip of the coin. Similarly, E = {T} is the event that a tail appear. In Example 1.2, E = {2} is the event that two appears on the toss of the die. E = {1,2,3, 4,5} is the event that 1,2,3,4 and 5 appear on the toss. Events and their relationships can be described by using the set language. Suppose E and F are two events of a sample space ~ . The operation E O F called union of E and F defines a new event, consisting of all points which are either in E or in F or in both E and F. In other words, E U F occurs either E or F occurs. Suppose in Example 1.1, E - { H } and F = {T}, then E U F = {H, T} defines an event that definitely occur, or the whole sample space. The operation E n F , or simply EF, defines a new event that consists of all points which are both in E and F. In other words, the event EF occurs if and only if E and F occur. Suppose in Example 1.1, E = {H} and F = {T}. Then the intersection EF does not consist of any points and hence could not occur. Such event is given a special name, null event and denoted by ~ . If EF = 0 , E and F are said to be mutually exclusive. The definitions in the above can be extended to more than two events. If E~, E 2 .-., are events, then the union of these events is defined to be the event
1. Randomness and probability consisting of all points which are in anyone of E~,E2....
We use U].~ E, to
denote the union of these events. We use n~__~IS, to denote the intersection of E t, E2 ..-, which is defined as the event consisting of the points which are in all of the events E,, n = 1, 2,.... All points in the sample space f2, which are not in E define a new event the
compliment of event E . This compliment of E is denoted by E ~ . Back to Example 1.1, E ' = { T }
if E = { H } .
E'=O,if
E={H,T}.
Therefore, the
compliment of a sample space is empty, that is, fY = g . 1.2 Probability 1.2.1 Probability defined on events An intuitive notion of the terminology and interpretation of probabilities is essential to functioning successfully in today's society. The most-frequently quoted example is weather reports we get from the TV or radio. The phrases the weather reporters often use are "there is a 40% chance of rain today" or "there is a 10% chance of rain today". Weather reporters have even coined the phrase "POP" (probability of precipitation) in making their reports. These numbers are important for our daily life. If POP is 0, we should bring an umbrella to class or to work; and if POP is 100%, one should plan an outdoor sporting event. The public has an intuitive notion about these numbers. POP is near 0 or POP is 100% are events that will hardly occur or will occur for sure. How about 50% POP? This depends on one's mental outlook. An optimist would not prepare for rain, whereas a pessimist would take an umbrella. From this example, we obtain an intuitive interpretation of probabilities. First of all, probabilities are numbers between 0 and 1 inclusive. They outline a vision of whether an event will occur. Sometimes the numbers are replaced by percentages as 0% and 100%. Furthermore, probabilities near 0 indicate that the event under consideration is not likely to occur; and the probabilities near 1 indicate that the event under consideration is very likely to occur. Probabilities near 1/2 indicate that the event in question has the same chance of occurring as of failing to occur. Probability is often defined through the following axioms. Consider an experiment whose sample space is f~. For each event E of the sample space f2, a number P(E)is defined such that it satisfies the following three conditions (axioms): (1) 0 _< P(E) < 1,
(!.4a)
1.2. Probability (2)
5
P(f2) = 1,
(1.4b)
(3) For any sequence of events E l, E2,..., which are mutually exclusive, that is, events for which E,,,E, = Q whenever n r oc
Then
o~
P([,.J E, ) = ~ P( E, ) n=l
(1.4c)
n=l
P(E) is referred to as the probability of the event E. Because E I,.JE c = f2 and E f'] E ~ = ~ , we have
P(E)+ P(E ~) = P(EkJ E ~) = P ( n ) = 1,
(1.5)
that is,
P(E ~) = 1 - P(E) .
(1.6)
This is the so-called compliment law. Moreover, if E = f2, we have P(E) = 0. Consider P ( E w F ) , the probability of all points either in E or in F. Note that
P(E) + P(F) is the probability of all points in E plus the probability of all points in F. Since any point that is both in E and F is counted twice, we must have
P(E l,.JF) = P(E) + P(F) - P(EF) .
(1.7)
E x a m p l e 1.5: Probability f o r coin tossmg In Example 1.1 for coin tossing, E~ = { H } , E 2 = {T}. The sample space
f2 = { H , T } . If we supposed that the chance for a head and a tail were equal (called equally hkely), then we would have P(S)=I,
1
P({H})= P({T})=-:-, P(~)=0. 2
(1.8)
They satisfy the three conditions defining the probability. If a head is twice as likely to appear as a tail, we would have 2 1 e(n) =l,e({H})=~, e({v})=~,e(o)= 0.
(1.9)
I. Randomness and probabihty Example 1.6: Probabilityfor die rolling In the die rolling example (Example 1.2), if we supposed that all six numbers were equally likely to appear, then we would have P({I}) = P({2}) = ... = P({6}) = Z 9 From condition (1.4c) for probability definition, we conclude that the probability of getting an even number would equal
1
P({2,4,6}) = P({2}) + P({4}) + P({6}) = - s . 2
(l.10)
1.2.2 Conditional probability Knowing one event has occurred often forces us to reassess the probability of another event. Consider the 2004 presidential election. The probability that a voter cast his ballot for President Bush is 51%. But suppose we know this voter is a DC resident. This would make us revise this probability downward by a considerable amount. Alternatively, knowing the voter is a Texas resident would make us revise this probability upwards. The notation for conditional probability is Pr(BI A) meaning the probability of event B given that event A occurs. Given that one event occurs, the conditional probability of a second event is the ratio of the probability of an occurrence common to both events to the probability of the initial event. Pr(B I A)=
P(ANB) Pr(A)
.
(l.ll)
Example 1.7: Presidential election If A is being a voter in Texas and B is being a Bush voter, the 2004 presidential election results reveal Pr(A)= 0.061; 6.1% of all voters were from Texas. And Pr(ANB) - 0.037; 3.7% of all voters were Texan Bush voters. So the probability of a Texan voting for Bush is Pr(BI A) .
P(AN B) . . Pr(A)
.
0.037 O.061
0.622.
(1.12)
One of the most important conclusions on conditional probability is stated in the following theorem
1.2. Probability
7
Pr(A n B)= P r ( B I A ) P r ( A ) = Pr(A I B)= Pr(B).
(1.13)
Theorem 1.1 Multiplication rule
This follows directly from probability definition. The theorem says that the probability of an occurrence common to two events is the probability of the first event multiplied by the probability of the second event conditional on the first event. Theorem 1.1 can be extended to general cases. Theorem 1.2 The law of total probability If A~, A2,. . ., A, satisfy
A NA =f~, if i:/: j, and OA, =f2, i , j = l,2,...,n , t=l
Then Pr(B) = ~ Pr( B I A, ) Pr( A, ).
(1.14)
t=l
Proof See Baht (1995). The following theorem was discovered quite long ago, but did not receive extensive attentions until 1950s. Since then, statistics based on Bayes' rule has dominated the development of modem statistics.
Theorem 1.3 Bayes'rule. If A t , A 2,..., ,4, satisfy
A, nA, = O , if i ~ j, and OA, =s
i,j = l,2,...,n,
(1.15)
t=l
Then Pr(A, I B) = Pr(B I A, ) Pr(A, )
(1.16)
Pr(B I ,4, ) Pr(A, ) t=i
Proof We consider the case for the A and A'. The multiplication rule and the fact that set intersection is commutative means
1. Randomness and probability Pr(A I B) Pr(B) = Pr(A 1"1B) = Pr(B I A) Pr(A)
(1.17)
and the law of total probability means Pr(B) = Pr(B I A) Pr(A) + Pr(B I A~ ) Pr( At ) giving the result.
(1.18) []
Example 1.8: Testing for binary outcomes Consider the massive testing of cattle for BSE (mad cow disease) introduced by the European Commission in 2001. Research into these tests revealed that a BSE-infected cow had a 70% chance of testing positive and a healthy cow had a 10% chance of testing positive. Denoting infection as B and testing positive as T, we have Pr(TI B) = 0.70, Pr(TI B ~) = 0.10.
(1.19)
The probability that a cow had BSE was very low, around 0.00001. However, for the sake of clarity in the following calculations, let's assume Pr(B) = 0.01, not quite so low a probability but one that will still illustrate the main idea. Suppose a cow tests positive. It is natural to ask what is the probability that this cow actually has BSE. By Bayes' rule, Pr(B I T) =
Pr(TI B) Pr(B) Pr(TI B) Pr(B) + Pr(TI B' ) Pr( B ~)
0.70 x 0.01 0.70 x0.01 + 0.10 x 0.99
(1.2o)
= 0.066.
If a cow tests positive there is only a 6.6% chance that it actually is infected with BSE! The BSE is sadly very ineffective. This was not immediately apparent from the false positive and false negative rates of 10% and 30%, but Bayes' rule revealed the full limitations of the test. The multiplication rule can be extended to more than two events.
1.2.3 Independence A prominent example of independent events is sampling with replacement. Consider drawing two playing cards from a standard deck. Let//1 denote the first card being a Heart, and let H2 denote the second card being a Heart. Now suppose the cards are drawn with replacement. That is, after the first card is drawn it is replaced, the deck is thoroughly reshuffled, and then the second card
1.3. Random variable
9
is drawn. We clearly have Pr(H 2 I H~) = Pr(H2), because the probability that the second card is a Heart is regardless of whether or not the first card drawn is a Heart. In contrast, if the cards are drawn without replacement, then/-/1 and/-/2 are no longer independent. If the first card is a Heart and is not replaced, then there are 12 Hearts left out of 51 cards, so 12
1
51
13
- - = P r ( H 2 ] H, ) r P r ( H 2 ) = - - .
( 1.21 )
Independence can be defined in several ways. The following are all equivalent definitions of A and B being independent; Pr(A I B) = Pr(A),
(1.22a)
Pr(B[ A) = Pr(B),
(1.22b)
Pr(A f'] B) = Pr(A) Pr(B).
(1.22c)
1.3 R a n d o m 1.3.1 R a n d o m
variable variable and distributions
In the previous sections, we studied experiments by defining a sample space and probabilities for every event in the sample space. The sample space and the corresponding event probabilities constitute a complete description of the experiment. But we are often interested not in a complete description of an experiment, but rather wish to concentrate only on a few features. So we wish to move away from the comprehensiveness of probability theory towards a more focused approach. Defining and applying random variables is the means by which we focus on the experimental features of interest. If we are able to assign a real value X for each elementary outcome co of the sample space ~ , then we def'me a real-valued function on the sample space. Such a function is called a random variable (r.v. or rv) on the sample space. We often use X - X ( c o ) o r Y - Y(co) to denote random variables, and the assigned value, denoted by small letters x, y are called value of the random variable. More rigorously speaking, X is a random variable on ~ if X = X(co)is a realvalued function defined on X = X(co)satisfying {co [ X(co) < x} e ga(~) for any real number x.
(1.23)
where ~a(~)denotes the power set of ff~. Thus, a random variable implies two things: The value of random variable X is dependent on the occurrence of one of
1. Randomness and probabllity
10
the elementary outcome co. Because the latter is random, so is the value of the random variable. But the probability for a certain value of the random variable is deterministic due to the fact that the event "X(co)=x '' is an element in f2, and each event in f~ is pre-assigned a probability P. In summary, a random variable is characterized by "randomness in value but determinism in probability". It should be noted that it is possible to define two or more random variables on one single random space. In the above example, another random variable Y may be defined as "six spots on the two dice". Random variable is the most fundamental concept in probability theory. And the distribution function of a random variable fully describes its behavior. Distribution and Probability density function (pdf in short) are the basic definitions in this book. Suppose X is a random variable on the sample space f2. For any real number x, the probability of the event X _
F(x) = Pr(X _
Then the function F(x) is defined as the distribution of the random variable. Each random variable has a distribution, and different random variables may as well have the same distribution function. The phrase "random variable X has a distribution F ( x ) " is equivalent to "X is distributed as F(x)'" or "X obeys distribution F(x) ", denoted by X-~ F(x) . A distribution function has the following three properties (1) monotonic function: if x, < x 2 , then F(x, ) <_F(x 2),
(1.25a)
(2) F(-oo) = lim F(x) = 0, F(oo) = lim F(x) = 1,
(1.25b)
x .---~ - :r
x - - - } oc
(3) Right continuous" F(x + 0) = lim F(x + Ax) = F(x).
(1.25c)
,6x~0
On the other hand, any real-valued function having the above-defined properties is the distribution function of a random variable. If the derivative of F(x) exists, and the differential form
dF(x) = f (x)dx is given, then f ( x ) is called probability density function (pdf). property (1.25a), we always have
(1.26) Because of
1.3. Random variable f ( x ) > O, -oo < x < oo .
11 (1.27)
In cases where confusions may arise, a subscript is used to denote pdf in the form of f x (x). If two random variables are considered, for example, symbols like f x (x) and fr (Y) would be fine to show their difference. The differential dF(x) represents the event {x < X _
(1.28)
Then the probability of the event (a < X < b) is given by b
b
P(a < X < b) = ~dF(x) : ~ f (x)dx = F(b) - F(a)
a
(1.29)
a
and from Property (1.25b), we have ct?
dF(x) = ~ f (x)dx = 1 --or.
(1.30)
--,:~
We sometimes also call F(x) cumulative density function (CDF). There are two types of random variables. Continuous random variables are those we have dealt with in the above. Opposite to continuous random variables is discrete random variable to be discussed in the following. With practice it is easy to distinguish the two. It is necessary to do so since the methods of computing associated probabilities depend on the type of variable involved. A discrete random variable is a random variable that can assume at most a finite or a countably many infinite number of possible values. A discrete random variable X assumes discrete real values x~,x2,...,x" with associated probabilities
P~, Pz,"', P., often denoted by finite scheme
XI
X2
"'"
Xn I
P,
P~
...
po
.
(1.31)
Because distribution function F(x)= Pr(X _
12
1. Randomness a n d probabihty
(1.32)
f (x) = ~ Pk6 ( x - x k ). h=l
In most practical applications, continuous random variables represent m e a s u r e d data, such as heights, lengths, distance and life periods. Discrete random variables represent count data, such as the number of red balls drawn
from an urn, the number of boys in a family.
1.3.2 Vector random variables and joint distribution Suppose f2 is a probability space and n real-valued functions X~ = Xt(co), X 2 = X2(co) , ..., X ( c o ) ,
co ~ f2 are defined on space f2. If for any n real
numbers x~, x 2 ,..., x, there always holds
{(ol x(co) <_x,,X~(co) <_x~,...,x,(co) _
(1.33)
Then (X~,...,X,) is an n-dimensional random variable (or random vector, or vector random variable) on the sample space f~. One big difference between a one-dimensional random variable introduced above and an n-dimensional random variable lies in the fact that only one real number is assigned for the former and many real numbers are assigned for the latter for each elementary result co. Let ( X ~ , X 2 , . . . , X , ) be an n-dimensional random variable on the sample space f2. For any n real numbers x~,Xz,...,x . , the probability for the event "X, < x~, X~ _
(1.34)
is called the j o i n t distribution function for n-dimensional random variable (X~,X2,...,X), or joint distribution in short. Joint distribution function fully describes the probabilistic properties of a high-dimensional random variable, including not only those of single random variable but also of the statistical correlation between any two random variables. If there exists a function f(xl,x2,'",Xn)such that for n real numbers x~,x 2 , . . . , x . , there always holds XI
X2
II
9- o c - o G
Xn
(1.35)
1.3. Random variable Then this
n-dimensional
random
variable
is called
13 n-dimensional joint
distribution function, and f (%, x2,..., x, ) is called joint density function. A joint density function satisfy
(1)
f ( x , , x 2,..., x, ) > O, oc
(2)
or)
I I"" I f(t''t2'''''t")dt'dt2"''dt" =1. -ac
(1.36)
oc
(1.37)
-oc
Take two-dimensional random variable (X, Y) for instance, the corresponding distribution function is
F(x, y) = Pr(X < x, Y < y)
(1.38)
which has the following properties
(1)
F(x, y) is a non-decreasing function of either x or y,
(2)
For any x and y, F(-oo, y) = ylim F(x, y) = 0, ---~--~
F(x,-oo) = lim F(x, y) = 0,
(1.39a)
(1.39b) (1.39c)
y---~-oc
F ( + ~ , +oo) = lim
F(x, y) = 1,
(1.39d)
X ---I.o~ y--~oc
(3)
F(x, y) is a right-continuous function either for x or for y. F(x + O, y) = F(x, y) , F(x, y + O) = F(x, y) ,
(4)
(1.40)
For any two points (%, Yt ) and (x 2, Y2) on the plane, we have
F(x2, y2 ) - F(x,, Y2 ) - F(x2, Y, ) + F(x, , y, ) >_O .
(1.41)
Any 2-dimensional function having the above-mentioned properties can be used as a joint distribution function. If F ( x , y ) i s the joint distribution, then the distribution function of X can be obtained from F(x, y) through
Fx (x) = Pr(X < x) = Pr(X < x, Y < oo) = F(x, ~ ) .
(1.42)
1. Randomness and probability
14
Similarly, the distribution function of Y can be obtained from F(x, y) through F r (y) = F(o% y ) .
(1.43)
Both F x ( x ) and F r (x) are called margmal dtstribution functions of the joint distribution function or random variables X and Y In the case of n-dimensional random variable, various marginal distribution functions can be obtained from
(1)
Ix, (x,) = F(x,, oo,..., oo),
(1.44a)
(2)
Fx2 (x 2) = F(oo, x2,. .. , ~ ) ,
(1.44b)
(3)
Fx.x2 (X~,X2) = F(x~,x2,oo,..., ~ ) .
(1.44c)
An n-dimensional joint distribution function has n one-dimensional distribution functions, n / 2 two-dimensional marginal distribution functions, ...., n (n-O-dimensional marginal distribution functions.
1.3.3 Conditional distribution I f X is a random variable, and A is an event with positive probability, then the conditional probability Fx(X I A) = Pr(X < x] A) =
Pr({X < x} N A)
,x ~ ( - ~ , ~)
(1.45)
Pr(A)
is called conditional distribunonfunction of random variable X about event A, or conditional joint distribution in short. Conditional joint distribution describes the statistical law of the values a random variable assumes given that event A occurs. Here A might be an event on sample space ~ . Different A may correspond to different conditional joint distribution. Thus, conditional distribution function is a function of not only x but also event A. If we take A = {a < X _
Fx(X)-F~(a) F~(b)-F~(a) 1
x
(1.46)
1.3. Random variable
15
This is compressed (or truncated) distribution function of random variable X, or compressed distribution in short. Suppose (X, Y)is a two-dimensional continuous random variable and its joint density function is f ( x , y ) .
If two events A and B are {x < X _
{y < Y <_y + dy} , respectively, then Pr(B) = f ( y ) d y , Pr(AB) = f ( x , y)dxdy, f ( y ) ~: O.
(1.47)
define a conditional probability Pr(AI B) = f ( x l y ) d x .
(1.48)
The quantity
f (x l y) = f (x, y) f(Y)
(1.49)
is the conditional density function given that Y = y . It is a density function of x, thus satisfying oc
f (x l y) >-0 , ~ f (x l y)dx = 1.
(1.50)
---oG
Similarly we have the conditional density function given that X = x defined by
f (y l x) = f (x, y_.__2). f(x)
(1.51)
Two events A and B are statistically independent if and only if
P(A c~ B) = P(A)P(B) .
(1.52)
This condition can be applied to two random variables X and Y by defining two events A'= {A _
(J.53)
16
1. Randomness and probability From this definition and the definitions of density functions, we obtain
f (x, y) = f (x) f (y) .
(1.54)
Suppose (X,Y) is a two-dimensional random variable. The following expectation cov(X, Y) = E {Ix - E ( X ) ] [ Y - E(Y)]} = E(XY)- E(X)E(Y)
(1.55)
is called covariance if it exists. In more general study of the statistical independence of n random variables X~, X 2,..., X, , we define events 14, by A, ={X,
i=l,2,...,n
where x, are real. The random variables X, are said to be statistically independent if the following condition is satisfied.
f(x,,Xz,...,x,) = f ( x , ) f ( x z ) . . , f ( x , ) .
(1.56)
1.3.4 Expectations The probabilistic properties of the population are fully described by its density function f ( x ) . And the density function contains infinitely many pieces of information about the random variable under consideration. The flight hours from Singapore to Beijing are a random variable depending on the weather, the conditions of the two airports and etc. So are the flight hours form Singapore to Hong Kong. If the question is: which flight takes more time. Then our answer is surely the second flight. The reason for us to get the answer quickly is that we know there exists a number able to describe the flight hours well. It is the average flight hours. The average flight hours from Singapore to Hong Kong are 3 while the average flight hours from Singapore to Beijing are 6. If the question is changed to "which flight is more on time", the answer also depends on a single number, called variance. The larger the variance is, the less the flight is on time. If we say the first flight is more on time, then it means that the variance of the first flight is smaller than the second, vice versa. Therefore, variance describes the scatter state of the random variable around a specified value, the mean in this case. Many of us have the following experiences. Although the average flight hours from Singapore to Beijing are six hours, we have a vague feel that over half
1.3. Random variable
17
flights often take less than six hours, say five and half hours. Then comes the third question: given the average flight hours, most probably we use more flight hours or less flight hours than the average for a specified flight course? To answer the question, we need another quantity called skewness. Continuing the process, we may find each question might be answered by a single number. All these numbers contain partial information of the population. For a continuous random variable, there exist infinitely many such numbers; while for a discrete random variable, such numbers are finitely many or countably infinitely many. Among these numbers, the most important ones are the average and variance. Other names for the average of a random variable are "the expected value of X", "the mean of X", or "the statistical average of X", "the mathematical expectation of a random variable X". We used E ( X ) or X to denote the average. It is defined w
by or~
E[X] = X =
-
f x f (x)dx .
(1.57)
If X is a discrete random variable with n possible values x,, each having a probability Pr(x,) of occurrence, then
E[X] = s x, Pr(x, ),
(1.58)
t=i
if the density function is expressed as f (x) = s P(x, )d(x - x, ).
(1.59)
t=!
Example 1.9 The Exponential distribution Consider the exponential distribution defined by
fL e-(X-a)/a
f(x)- IA o
x>a x
From the definition:
(1.60)
18
1. Randomness and probability
E[X] =
•f-• -~x-a)/Adx= ea/ a ~ixe-X/~dx = a+ 2 .
(1.61)
Often we must consider that case that Y is a function of random variable X as we see in the previous section. Suppose Y = g ( X ) is a function of X. Then
E[g(X)] = f g ( x ) f (x)dx .
(1.62)
-or?
IfX is a discrete random variable, then
E[g(X)] = ~ g ( x , )P(x, )
(1.63)
t=l
where n may be inf'mite in some cases. Expectation is a special case of a more general class of quantities called moments. The n-th moment about the origin is defined by oc
m, = E [ X ] = f x " f ( x ) d x
n=0,1,2,...
.
(1.64)
If the function g ( x ) i s replaced by g ( x ) = x". It is clear that the zero-th m
moment is the area of the density function, while the first moment is m~ = X , the expected value of X. Another type of moments are called central moments, or moments about the mean value of X, defined by or~
p, - E[(X - X)"] = f (x - X)" f (x)dx
(1.65) m
which is obtained by inserting g ( X ) = ( X - X)" into definition (1.62). The moment /~o = 1 is the area of f ( x ) .
Expectation is the first central
moment. The second central moment is so important that it is denoted by a special symbol cr 2 or D ( X ) , and given a special name variance, defined by 0 ..2 " - O ( X )
-- E(X
2) -
E 2 (X)
2 "- m 2 - m ~ .
(1.66)
1.3. Random variable
19
And its square root is called the standard deviation of X. It measures the spread in the function f ( x ) about the mean. Roughly, the average distance from the mean of all data points. The more closely the values cluster around the mean the smaller the standard deviation will be, see Figure 1.1.
' !
. .
[ Plus skeW ' | zero skew
9 minus
pdf
I
~ "
It i
!
i
I
I
I
pdf
~-
"~
i
(~A
~
_
CYB
Figure 1.1 Standard deviation
Figure 1.2 Skewness
The third moment is the above mentioned skewness. Skewness refers to the symmetry of the curve. A perfectly symmetrical curve has a skewness of zero. The sign of skewness describes the direction of the long "tail" points, not the location where most data lie, see Figure 1.2.
1.3.5 Typical distributions In this section, four important special distributions are briefed.
1. The uniform distribution The uniform distribution is the simplest one. A uniform random variable assumes all values on an interval with equal probabilities. The density function of a uniform random variable on an interval [a,b] is defined as
f(x) -
f
0 1 b- a
x ~ [a,b] x ~ [a,b]
A uniform random variable has the following properties
(1.67)
20
1. Randomness and probability a+b (1) mean E ( X ) = ~ , 2 (b-a) 2 (2) variance D ( X ) = ~ 12
(1.68a) (1.68b)
Notation X ~ U[a,b] denotes that X is uniformly distributed on the interval [a,b].
2. The normal distribution The normal distribution, or Gaussian distribution, has the density function of the form
f ( x ) = 2x/~0- exp -
2o.2
-~ < x < ~
(1.69)
where -oo ~ < oo and 0- > 0 are two parameters, the first being location parameter and the second being scale parameter. We often use N(,t/,0- 2 ) to denote Gaussian distribution. If p=0 and o-= 1, such a distribution is called standard normal distribution. The importance of normal distribution lies in the fact that the sum of a large number of random variables approaches Gaussian distribution if the influence of each random is small. Its properties are (1) Moments: E(X)
=/~,
D(X)
- o -~ .
(1.70)
(2) If X is distributed as N(p,0-2) , then the normalized random variable X" = (X - p) / 0- has distribution N(0,1). (3) The weighted sum c~X~ + c 2X2 +... + c , X , is distributed as
c,c, coy(X,, X , ) ) .
N(yc,/l,, t=l
t=!
(1.71)
I=1
(4) If the sum of two random variables is a Gaussian random variable, then each random variable is a Gaussian random variable.
21
1.3. Random variable 3. The chi-square distribution
Suppose X , , X 2,...,X. are standard normal random variables. Their squared sum X( + X~ +..-+ X,2 is distributed as
1
xa/2-1e -a/2
X >0
2~
f(x)=
(1.72) 0
x_
where F(a)denotes Gamma function defined by oo
F(a)
~u~-'e-"du.
=
(1.73)
0
Distributions having such pdf are called Chi-square distribution with degree of freedom n, denoted by X 2(a). The mean and variance of a Chi-square variable are E(X) = at D(X)= 2a, respectively Chi-square variables are additive, meaning the sum of finitely many or infinitely many random variables obeying Chi-square distributions remains Chisquare distributed. Chi-square distribution is widely used for independence testing. So Chisquare distribution is tabulated in Table 1.1. 4. The multivariate normal distribution
Let ~t=
y: 22
,t/2
~
t: t2
~
Z=
I
~ = O-21 0"22 PO'l O'2
2
0"2
(1.74) "
Notice that [y.] = o-,~o"z2 ( l - p ~ )and Ipl < 1 ,so Ixl > 0 and
E
-I
_.
1
r
o-2
o-2o-22(1 - p) L-po-, o-2
1
-PO-lo-2 o-(
"
We define the density function of the Gaussian bivariate
(1.75)
22
I. R a n d o m n e s s a n d p r o b a b i l i t y
f (Yl, Y2 ) =
1
1
(
Yj - 16
2zcr, cr2~]1- p2 exp - 2(1- p2)
I(
o-,
2 (1.76)
A Gaussian multivariate has the property that E(y) = ltt, D(y) = Z,
(1.77)
that is, the two parameters in a Gaussian multivariate are the mean vector and covariance matrix. Suppose x and y are m- and n-dimensional random vectors, respectively. The components of x are identically and independently distributed as N(0,1). A is a m x n constant matrix, and p is an n xl vector. If y and p + A7x obey same
distribution,
then
y
is said
to
be
a Gaussian
multivariate,
denoted
by y-~ N,(Ia, A 7A). We also use symbol Z = AT A in some cases. Its density function is exp-
(y_p)7
(y_p)
(1.78)
where I'[ represents determinant. The linear transform of a Gaussian multivariate remains Gaussian. All the marginal distributions of a Gaussian multivariate remain Gaussian.
1.4 Concluding remarks Random phenomena are everywhere around us. They are described by event and probability. Through introducing random variable, the study of random phenomena is concentrated on finding the probability distribution or probability density function (pdf) of the random variable under consideration. Four special pdfs are outlined here. They are useful. More general methods for finding pdf will be presented later.
1.4. Concluding remarks
23
Table 1.1 The Chi-square distribution df 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
. 100 2.70554 4.60517 6.25139 7.77944 9.23636 10.64464 12.01704 13.36157 14.68366 15.98718 17.27501 18.54935 19.81193 21.06414 22.30713 23.54183 24.76904 25.98942 27.20357 28.41198 29.61509 30.81328 32.00690 33.19624 34.38159 35.56317 36.74122 37.91592 39.08747 40.25602
.050 3.84146 5.99146 7.81473 9.48773 11.07050 12.59159 14.06714 15.50731 16.91898 18.30704 19.67514 21.02607 22.36203 23.68479 24.99579 26.29623 27.58711 28.86930 30.14353 31.41043 32.67057 33.92444 35.17246 36.41503 37.65248 38.88514 40.11327 41.33714 42.55697 43.77297
.025 5.02389 7.37776 9.34840 11.14329 12.83250 14.44938 16.01276 17.53455 19.02277 20.48318 21.92005 23.33666 24.73560 26.11895 27.48839 28.84535 30.19101 31.52638 32.85233 34.16961 35.47888 36.78071 38.07563 39.36408 40.64647 41.92317 43.19451 44.46079 45.72229 46.97924
.010 6.63490 9.21034 11.34487 13.27670 15.08627 16.81189 18.47531 20.09024 21.66599 23.20925 24.72497 26.21697 27.68825 29.14124 30.57791 31.99993 33.40866 34.80531 36.19087 37.56623 38.93217 40.28936 41.63840 42.97982 44.31410 45.64168 46.96294 48.27824 49.58788 50.89218
.005 7.87944 10.59663 12.83816 14.86026 16.74960 18.54758 20.27774 21.95495 23.58935 25.18818 26.75685 28.29952 29.81947 31.31935 32.80132 34.26719 35.71847 37.15645 38.58226 39.99685 41.40106 42.79565 44.18128 45.55851 46.92789 48.28988 49.64492 50.99338 52.33562 53.67196