Existence and uniqueness of the solution of the likelihood equations for binary Markov chains

Existence and uniqueness of the solution of the likelihood equations for binary Markov chains

Statistics & Probability North-Holland Letters July 1991 12 (1991) 29-35 Existence and uniqueness of the solution of the likelihood equations for ...

544KB Sizes 0 Downloads 25 Views

Statistics & Probability North-Holland

Letters

July 1991

12 (1991) 29-35

Existence and uniqueness of the solution of the likelihood equations for binary Markov chains Smen Bisgaard Center for Quality and Productivity WI 53705, USA

Improvement

and Department

of Industrial

Engineering,

Unruerslty of Wuconsin-Madison,

Madison,

Laurel E. Travis Department

of Fmance and Management

Science,

Umversrty of Alberta,

Edmonton,

Alberta,

Canada T6G ZR6

Received June 1987 Revised August 1990

Abstract: The two-state Markov chain is a useful model when analyzing binary data that may be serially correlated. The equations likelihood equations for this model are two intersecting tonics, and several solutions could be anticipated. However, we prove the existence and uniqueness of the solution, and give a simple method for numerical solution. Keywords:

Binary

data, binary

Markov

chain,

maximum

likelihood

1. Introduction

Binary data are frequently observed over time in situations where the validity of the standard assumption of independence appears questionable (see Box, Hunter and Hunter, 1978, pp. 78-82, 496 and 584). For example, in industrial production, successive items in a process may be checked and classified as defective or non-defective; since the causes of defects may persist for a period of time, there may be a tendency for the data on defects to be serially correlated. If serial correlation is detected, it can be a clue that a cause for the defects exists that might be identified and removed, resulting in improved quality. As another example, a consumer’s repeated choice between two products can be modeled as binary data. Serial correlation can be used as a measure of brand loyalty, and thus provide useful information for planning marketing strategy. The evaluation of 0167-7152/91/$03.50

0 1991 - Elsevier Science

Publishers

estimation,

modeling

serial dependence.

algorithms used to generate random numbers, when these are represented in binary code, presents a third example. The simple two-state Markov chain is a useful model to consider for binary sequences of this kind. It allows for serial dependence and it contains the independent Bernoulli model as a special case. The two-state Markov chain was used by Goodman (1958) and Lehmann (1959) in discussions of the run-test. Cox (1958) combined the linear logistic regression model and the two-state Markov model and showed several interesting applications. The two-state Markov model was used by Gabriel and Neuman (1962) in meteorology, by Klotz (1971) in genetics, and by Crow and Miles (1977) in telecommunication. Klotz (1973) provided an asymptotic distribution of the maximum likelihood estimates and stated that the solution of the likelihood equations is difficult. Devore (1976) claimed, without actu-

B.V. (North-Holland)

29

Volume

12, Number

1

STATISTICS

& PROBABILITY

ally showing it, that the maximum likelihood estimates can be found by solving a quadratic equation. Crow (1979) pointed out a mistake in that paper and showed that the maximum likelihood estimates are solutions to a cubic equation; a conclusion which is supported by Kim and Bai (1980). Other types of estimators have been discussed by Klotz (1973), and Crow and Miles (1979). We will develop two quadratic equations which yield the maximum likelihood estimates for the parameters. Since, when written in this form, the likelihood equations are two quadratic equations in two unknowns, several solutions could be expected. Klotz (1971) showed that, conditional on one of the parameters being known, the estimate of the other parameter exists and is unique. However, in most practical applications both parameters are unknown. Thus the main thrust of this paper is to prove the existence and uniqueness in the general case when both parameters are unknown. We also show how the equations can be solved in practice and illustrate our approach with an example originally studied by Barnard, Jenkins, and Winsten (1962) that yields a solution in the interior of the unit square. We also provide an example of a non-trivial binary sequence that yields a solution on the boundary of the unit square. It is of theoretical interest, but perhaps disheartening, that this the simplest example of a Markov chain requires a very careful enumeration of special cases to establish existence and uniqueness of the likelihood equations. More complicated Markov chains might be even more mischievous.

tions are simple. q are restricted

pr(X,

= 1) =P/(P

1-P i

4

P

1-q i

where 0 < p < 1 and 0 < q < 1. This parameterization has the advantage that the likelihood equa30

are equal

+ 4)

and pr(X,=O)=q/(p+q), we can write the likelihood [(P,

as:

41x1

=pr(X,

=x1)

Xpr(X,IX,=x,)...pr(X,IX,_,=x,_,)

x

n-l iQ

[(1

_p)cl-~,)c~--~~+l~p(l-‘~)“‘i’

xqG-x,+d(l =

P”(l

_

q)-+l]

-PYifO - 4Y p+tq

where a is x, plus the number of transitions from state 0 to state 1, b is the number of transitions from 0 to 0, c is 1 - x, plus the number of transitions from 1 to 0, and d is the number of transitions from 1 to 1. Notice that a, b, c and d are positive integers. We will assume for the remainder of the paper that we have more than two data points, i.e., that a + b + c + d > 2. When p = q = 0, the stationary probabilities are indeterminate, and it is reasonable to define

2. The model

P=

Note that the parameters p and to a particularly simple space, the

unit square. Assuming that the initial probabilities to the stationary probabilities, i.e.,

I(0, 0) =

Let { X, E {0, 1 } ; i = 1, 2, } be a stationary Markov chain or sequence of dependent Bernoulli variables. This model can be parametrized as follows:

July 1991

LETTERS

3. Solution

1

if Vi: x, = 0, or Vi: x, = 1,

0

otherwise.

to the likelihood

equations

To examine the solution of the likelihood equations, we begin by considering the event that one or more of the statistics is zero, i.e., abed= 0. In this case, we show that a maximum (not necessarily unique) lies on the boundary of the unit square. This done, we restrict our attention to the

Volume

12, Number

1

STATISTICS

& PROBABILITY

case abed # 0, and prove the existence and uniqueness of a maximum for this more common case.

July 1991

LETTERS

p = 1 maximizes 1. Fixing optimal q value yields

p = 1 and finding

the

-(l+d)+\/‘(l+d)‘+4(c+d-1)c 4=

2(c+d-

1)

3.1. Boundary solutions

Boundary solutions correspond to random walks with reflecting and/or absorbing barriers. In the case abed = 0, the maximum of the likelihood function will fall on the boundary of the unit square. To see this, consider first the case where a = 0. Since a represents x, plus the number of transitions from 0 to 1, a = 0 implies that the data consists solely of zeroes. Hence we have c = 1 and d = 0. The likelihood function reduces to I( p, q) = q(1 -p)“/( p + q) and we see by inspection that all points with (p, q) = 0 x [0, 11 are maxima that yield I( p, q) = 1. Thus given the data the most likely inference is that state 0 is an absorbing barrier. Since state one was never reached, the probability of transition out of state one is indeterminate. Thus, for this case, it is not surprising that the optimal value of q is not unique. Similarly the case where c = 0 corresponds to a data set with every element equal to one, and the non-unique set of boundary optima where (p, q) = [0, 11 X 0, and I(p, q) = 1. Similarly the inference here is that state 1 is an absorbing barrier and we cannot conclude anything about the conditional probabilities associated with leaving or returning to state 0 given that we started in state 0. Somewhat less trivial is the case where b = 0. Since the cases with a or c equal to zero have already been considered, we can assume that both these statistics are strictly positive. (An example of a data set with these properties is {lllOllOllllOllOllllll}.) Also, we can assume d > 0 since if b and d are both 0, the optimal solution is p = q = 1 corresponding to a cyclical ‘random’ walk. Thus we are interested in the case where a, c, d > 0, and I( p, q) = p”q’(1 - q)d/( p + 4). Since, in this case, any point with q = 0 or q = 1 yields I( p, q) = 0 we see that the maximum must have q E (0, 1). Fixing any such q and using univariate calculus to maximize I as a function of p shows that, regardless of the choice of q E (0, l),

Thus, the unusual but non-trivial case where d # 0 and b = 0 yields a unique boundary solution, and the case with b f 0 and d = 0 is analogous. These solutions correspond to reflecting barriers either in state 0 or 1. We have shown that, if ubcd = 0, there exists a solution (not necessarily unique) on the boundary of the unit square. From this point forward, we will assume that ubcd # 0. 3.2. Existence

of a maximum

Recall that, when p = q = 0, and ubcd # 0, we have defined I= 0. Now I( p, q) is defined and continuous everywhere on the unit square (to prove continuity at (0, 0), let p = aq for any CY2 0 and notice that lim y+O1 = 0 since a, c > 1). Thus I( p, q) attains a maximum on the unit square. Since ubcd # 0, L( p, q) = 0 whenever ( p, q) is on the boundary of the unit square, and hence this maximum must lie on the interior. On the interior, any max of I must also be a max of log I and must satisfy a log I,Aip = 0, a log I/aq = 0. Thus I has a maximum on (0, 1) x (0, 1) that satisfies these two log-likelihood equations. It remains to show that the solution to these log-likelihood equations is unique. This will enable us to conclude that the solution is indeed the desired maximum.

3.3. Uniqueness Taking tiating,

of the solution

the log of the likelihood function, and equating to zero yields

U

b

1

P

1-P

p+q

C d ---=4 1-q

p+q’

---=_

differen-

(1)

and 1

(2) 31

Volume

12. Number

These likelihood F(P,

STATISTICS

1

equations

4) = {(a-

can also be written

& PROBABILITY

as

I> ++J2

-{(a-l)-(a+b)q}p-aq=O (3) and G(P>

LETTERS

July 1991

4AC < 0. Using statement (i) and some algebraic manipulations, the inequality on the right becomes q > - 1, and statement (ii) follows. To prove (iii), noticethat4AC IBI. Thus B + (B* - 4AC)‘j2 > 0 and (iii) follows. 0 Lemma

4) = ((c-1)

+d}q2

-{(c-l)-(c+d)p}q-cp=o. (4) It is readily shown that (3) and (4) each define a hyperbola in the (p, q) plane. The uniqueness of the solution of these likelihood equations is established by the following theorem. Theorem.

The system of equations F( p, q) = 0, G( p, q) = 0 has at most one solution on (0, 1) X

2. Assume c + d > 1. For all p E (0, 1) there exists a unique q E (0, 1) such that G( p, q) = 0 (i.e., the hyperbola defined by G( p, q) = 0 has one and only one brunch passing through the interior of the unit square). Proof. Analogous

to the proof of Lemma

1.

0

Since we are only interested in the segments of F and G which pass through the set (0, 1) X (0,l), we define f(q)

= (-B-t

(B2-4AC)“2)/2A,

(0, 1).

Before proving this theorem, we will need some definitions and lemmas. First we will show that the two hyperboloe defined by the likelihood equations each have only one branch passing through (0, 1) x (0,1). This done, we will define functions that describe only these relevant branches and use these functions to prove the result. Lemma

1. Assume

a + b > 1. For all q E (0, 1) there exists a unique p E (0, 1) such that F( p, q) = 0. (i.e., the hyperbola defined by F( p, q) = 0 has one and only one branch passing through the interior of the unit square).

and we define g(p) analogously (4). So in fact, we are looking for a to the system p = f (q), q = g(p). convenient to define the following vals U-l a+b-l’a+b i

a

c-l c+d-l’c+d i

C

R,=

from equation solution ( fi, 4) It will also be two open inter-

i

and R,=

Lemma

1.

3. j E (0, 1) implies g( jj) E R,, (0, 1) implies f (4) E R,.

and q E

Proof.

Fixing q and solving F( p, q) = 0 for p yields p = {-B f (B2 - 4AC)‘j2}/2A where A = (a - 1) + b, B = -{(a - 1) -(a + b)q} and C = -qa. We proceed by showing that q E (0, 1) implies (i) B*-4AC>O, (ii) { -B + (B* - 4AC)‘j2}/2A E (0, l), and (iii) {-B-(B2-4AC)“*}/2A 0, which follows from a + b > 1. Writing statement (ii) as 0 < -B + ( B2 < 2A, we see that the left inequality - 4AC)“’ holds since the definitions of A and C imply 32

Notice that this implies that any solution to F( p, q) = 0, G( p, q) = 0 on (0, 1) X (0, 1) will fall in the range R, x R,. Lemma 4. Except when a=1 or c=l, jj~(O, 1) implies g’( jj) E (0, 1) and 4 E (0, 1) implies f ‘(4)

E (0, 1). Proof of Lemma

3. jL?E (0, 1) implies g(a) E (0, 1) by Lemma 2 and the definition of g(p). Define 4 = g( j?) and k = $/( 6 + e), and notice that 0 <

Volume 12. Number 1

STATISTICS & PROBABILITY

k < 1. Now if 4 = g( $), i.e. c 7-r 4

d

=-

l-q

j and

$ must

solve (2)

l-k 4

so c-(1-k) c+d-(l-k)’

‘=

and a little algebra yields 4 E R,. (0, 1) implies f(e) E R,. 0

Similarly

Proof of Lemma 4. Differentiating with respect to q yields

(3) implicitly

4 E

~-(Q+b)f(q) f’(4) = 2f(q)(a+b-l)-(a-l)+(a+b)q’ Now assume i.e. [f’(i)]-’

$ E (0, 1) and assume f ‘(4) E (0, l), < 1. Simplifying this inequality yields

2f($)(a+h-l)-(a-l)+(a+b)q^

but since 4 E (0, l), equality implies 2(u-l)(u+b-1) u+b-1 ~u_

and

f(B) E R,,

this

in-

-(a-l)-t(u+b)G

u+b-1 i.e., (.-l>+(u+b)G<

ever, g’( jIj> is in (0. 1) and f ‘( g( j?)) is in (0, 1) as well (assuming that a # 1 and c # 1). Thus we have a contradiction. For the case with a = 1, the definitions of u and c imply that c must be 1 or 2 (assuming that c is not 0). This case is unusual, since the branches of the likelihood hyperbolm defined by f and g pass through (0, 0) (when a > 1, the solution (0, 0) falls on the branch of the hyperbola that was discarded). Because of this difference, there are two solutions to the likelihood equations on the unit square, but only one on the interior of the unit square. Unfortunately, the existence of the root at (0, 0) causes the above uniqueness proof to break down in this case. An alternate proof can be constructed by assuming that there are three solutions on the unit square and arguing that there must then be two points satisfying f’( g( j?))g’( j) = 1. Applying Rolle’s Theorem a second time to these two points gives an equation that, with a little algebra, can be reduced to a contradiction. Because of the limited relevance of this special case, we do not give the algebraic details. q

4. Calculation of the solution and examples To solve the likelihood ton-Raphson method, proximations using

(a+b)(a-l)

u+;_l.

Now for all values of a except 1, we know that the left side of this inequality is strictly greater than one, and the right side is strictly less than one, and thus we have a contradiction. Similarly, $ E (0, 1) implies g’(j) E (0, 1) except when c = 1. q

July 1991

LETTERS

“+‘=“-

equations using the Newwe generate successive ap-

G[f(q,)?q,]

G’[f(q,),

q,]

where G’(p,q)=2[(c-l)+d]q-(c-1) +(c+d)p+(c+d)qp-cp’ and

Proof of Theorem. First we consider the case where a # 1 and c # 1. Any solution ( j!, 4”) to the likelihood equations on (0, 1) X (0, 1) satisfies j = f(q), and 4=8(j), and hence f 0 g(p)--p=O. If there are two solutions (pI, ql) and (pz, q2) with p, and p2 on (0, l), they both have p,, pz in R, (by Lemma 3). Hence Rolle’s Theorem gives us a 6 in R, such that f’(g( j?))g’( jj) = 1. How-

p’=

(u+tb)p-a - 2(u-l+b)p-(a-l)+(u+b)q

and G and f are given above. Once q is found, p can be found from f(q). Using the dataset (0111001010000101) and a starting value of q1 = 0.5, the procedure converges in three iterations yielding the maximum likeli33

Volume

12. Number

0.0

STATISTICS

1

04

0.2

0.6

08

& PROBABILITY

1.0

July 1991

LETTERS

00

0.2

04

0.6

0.8

P

Fig. 1. Log-likelihood

surface,

Fig. 3. Log-likelihood

data (0111001010000101).

hood estimates (j?, 4) = (0.5329, 0.6893). (A better starting value can be obtained using p0 = a/( a + 6) and q. = c/(c + d).) Barnard, Jenkins and Winsten (1962) using simple counts, reported jj =+=0.5556 and G= f = 0.6667 for this same dataset. The log likelihood surface and hyperbola

xl

\

N

\

\

\

\

\

\

\

\

\

\

\

\ rs

c \

0

N

dat~{lllOllOllllOllOllllll).

are shown in Figures 1 and 2 respectively. Lindqvist (1978) gives the asymptotic distribution of the likelihood estimates for a different parameterization of this model, so confidence intervals can readily be constructed. In Section 3.1 we showed an example of a non-trivial binary sequence that produces a unique boundary solution. In this example a = 5, b = 0, c = 4, and d = 12. The maximum likelihood estimates of p and q are 1 and 0.2408 respectively. The practical interpretation is, as also inspection of the observations will reveal, that this is most likely data coming from a Markov chain with a reflecting barrier in state 0. The likelihood surface for this example is shown in Figure 3. Of course, for this situation Lindqvist’s asymptotic results do not apply.

Acknowledgement

P

\ -4

-2

I 0

2

4

P

Fig. 2. Equations (3) and (4), data set (0111001010000101). The hyperbola for (3) plotted as a solid line; that for (4) as a dashed line. 34

surface,

An initial draft of this paper was prepared in collaboration with the late Professor William G. Hunter. We want to express our appreciation for his help. This work was sponsored by National Science Foundation Grant DDM-8808138 and the College of Business and Economics, University of Wisconsin-Whitewater.

Volume

12, Number

1

STATISTICS

& PROBABILITY

References Barnard. G.A., G.M. Jenkins and C.B. Winsten (1962), Likelihood inference and time series, J. Royal Statist. Sot. Ser. A 125. 321-372. Box, G.E.P., W.G. Hunter and J.S. Hunter (1978) Statistics/or Experimenters (Wiley, New York). Cox, D.R. (1958) The regression analysis of binary sequences, J. Roy. Statist. Sot. Ser. B 20, 2155242. Crow, E.L. (1979) Approximate confidence intervals for a proportion from Markov dependent trials, Comm. Starrsr. -Simulation Comput. BB(l), l-23. Crow, E.L. and M.J. Miles (1977) Confidence limits for digital error rates from dependent transmissions. OT Report 77-118 (Washington, DC). Crow, E.L. and M.J. Miles (1979) Validation of estimators of a proportion from Markov dependent trials, Comm. Statist -Simulation Compur. BS(l), 25-52. Devore, J.L. (1976), A note on the estimation of parameters in

LETTERS

July 1991

a Bernoulli model with dependency, Ann. Statist. 4. 990992. Gabriel, K.R. and J. Neuman (1962). A Markov chain mode1 for daily rain-fall occurrence at Tel-Aviv. Qunrt. J. Roy Meteorological Sot., 88, 90-95. Goodman, L. (1958) Simplified runs tests and likelihood ratio tests for Markov chains, Blomerrika 45, 181-197. Kim, S. and D.S. Bai (1980) On parameter estimation in Bernoulli trials with dependence, Comm. Srarist. ~ Theov Methods A9(13), 1401-1410. Klotz, J. (1971). Markov chain clustering of birth by sex, Proc. Sixth Berkeley Symp. Math. Statist. Probab. (Univ. of California Press, Berkeley, CA). Klotz, J. (1973) Statistical inference in Bernoulli trials with dependence, Ann. Statist. 1, 3733379. Lehmann, E.L. (1959). Tesrmg Statistical Hypotheses (Wiley, New York). Lindqvist, B. (1978) A note on Bernoulli trials with dependence, Stand. J. Sratist. 5, 205-208.

35