Statistics & Probability Letters 46 (2000) 161 – 168
Limit theorems for regression models of time series of counts Michel Blaisa , Brenda MacGibbonb; 1 , Roch Roy c;1; ∗ a D epartement
de mathematiques et statistique, Universite de Montreal, Canada and Departement de mathematiques, Universite du Quebec a Montreal, Canada c Centre de recherche math ematique and Departement de mathematiques et statistique, Universite de Montreal, Case postale 6128, Succursale Centre-ville, Montreal, P. Quebec, Canada H3C 3J7 b GERAD
Received September 1998; received in revised form March 1999
Abstract Here we present some limit theorems for a general class of generalized linear models describing time series of counts Y1 ; : : : ; Yn . Following Zeger (Biometrika 75 (1988) 621– 629), we suppose that the serial correlation depends on an unobservable latent process {t }. Assuming that the conditional distribution of Yt given t belongs to the exponential family, that Y1 |1 ; : : : ; Yn |n are independent, and that the latent process satis es a mixing condition, it is shown that c 2000 Elsevier the quasi-likelihood estimators of the regression coecients are asymptotically normally distributed. Science B.V. All rights reserved Keywords: Poisson regression; Estimating equations; Exponential family distribution; Asymptotic distribution; Mixing
1. Introduction In the last decade, there has been a growing interest in models and methods for analyzing time series of counts when explanatory variables are available. There are many real situations such as in the analysis of disease incidence where it is essential to take into account the integer-valued characteristic of the data in order to achieve satisfactory modeling. Cox (1981) distinguished between two types of models for time-dependent data: observation-driven and parameter-driven. In models driven by previous observations, the conditional model for Yt is speci ed as a function of past observations Yt−1 ; : : : ; Y1 and possibly also by lagged explanatory variables. In parameter-driven models, an unobservable latent process is assumed to generate the serial correlation. For time series of counts, such an approach has been studied by, for example, Zeger (1988), Chan and Ledolter (1995), Jorgensen et al. (1995), Davis et al. (1999). In his work on serially correlated count data, Zeger (1988) assumes that the response Yt follows a log-linear model and that the latent process {t } is a stationary autoregressive one ∗
Corresponding author. Tel.: +1-514-343-7977; fax: +1-514-343-5700. E-mail address:
[email protected] (R. Roy)
1 This work was supported by grants from the Natural Science and Engineering Research Council of Canada and the Fonds FCAR (Government of Quebec).
c 2000 Elsevier Science B.V. All rights reserved 0167-7152/00/$ - see front matter PII: S 0 1 6 7 - 7 1 5 2 ( 9 9 ) 0 0 1 0 1 - 7
162
M. Blais et al. / Statistics & Probability Letters 46 (2000) 161 – 168
without specifying the distribution of the t . He further supposes that the conditional distribution of Yt given t is Poisson and that the Yt |t are independent. Under these assumptions, he derives the quasi-likelihood estimator of the vector of regression coecients using the estimating equations method (see, e.g. Godambe and Kale, 1991; Heyde, 1993). In this paper, we consider a more general model than Poisson log-linear regression; we study the generalized linear model assuming only that each observation conditional on the latent process has a general exponential family distribution. In Section 2, we present the properties of this general model and describe how the quasi-likelihood estimator of the vector of regression coecients can be obtained using estimating equations. Here in Section 3, under fairly weak mixing assumptions on the latent process, we establish the asymptotic normality of the quasi-likelihood estimator. It extends the result stated in Zeger (1988) for Poisson log-linear regression. The proof relies heavily on a one-dimensional central limit theorem for a triangular array of mixing random variables proved by Davidson (1992). A method for constructing asymptotically valid con dence regions for the regression parameters is also indicated. 2. The generalized linear model and the estimation problem Here we use the notation of McCullagh and Nelder (1989) where the generalized linear model is an extension of the classical linear model. Let Y = (Y1 ; : : : ; Yn )T denote the response vector, y a realization of Y ; = (1 ; : : : ; n )T the mean vector of Y , with each t satisfying t = E(Yt ). The matrix X is an n × p matrix of explanatory variables, Xt denoting the tth row of X and xtj denoting the tth observation of the jth covariate for t = 1; : : : ; n and j = 1; : : : ; p. The vector of regression parameters is denoted by = ( 1 ; : : : ; p )T and = (1 ; : : : ; n )T = X represents the vector of linear predictors, with each t satisfying t =
p X
xtj j = Xt :
(1)
j=1
The link function g with inverse function h (assumed to be monotone and dierentiable) is such that t =g(t ) and t =h(t )=h(Xt ). For the classical linear model with normally distributed errors, g is the identity function; for the log-linear model, g is the log function. We suppose that cov(Y ) = Vn ;
(2) 2
where Vn is an n × n positive de nite symmetric matrix of known functions of , of the variance and possibly of another parameter which can be multidimensional but does not depend on . If the Yt ’s are assumed to be independent, then a very simple general model is obtained by supposing that each Yt has an exponential family distribution whose density can be written as yt − b() fYt (yt ; ; ) = exp + c(yt ; ) ; (3) a() where a(·); b(·) and c(·) are given measurable functions and is a scaling parameter. For example, for the Poisson distribution with parameter ; = log ; = a() = 1; b() = exp(); c(yt ; ) = −log yt !, and for the negative binomial (with r the number of successes known and xed, p the probability of success and where Yt counts the number of failures rather than the number of trials before the rth success), p yt + r − 1 = log ; = a() = 1; b() = r log(1 + e ); c(yt ; ) = log : r−1 1−p Gourieroux et al. (1984), among others, studied inference for Poisson models with speci cation errors using pseudo maximum likelihood estimation techniques.
M. Blais et al. / Statistics & Probability Letters 46 (2000) 161 – 168
163
Table 1 Variance of the dependent variable Yt as a function of t = E(Yt ) and the latent process {t } in the parameter-driven model for various discrete exponential family distributions Distribution of Yt |t
Time-varying Parameter
var(Yt |t )
var(Yt )
Poisson (t )
t = t t
t t
t + 2 t2
Binomial (n; pt )
pt =
t t n
t t (n−t t ) n
t +
Negative binomial (r; pt )
pt =
r t t +r
t t (t t +r) r
t +
t2 ((n − 1)2 − 1) n 2 t ((r + 1)2 + 1) r
Let lt = log(fYt (yt ; ; )) denote the log-likelihood function for Yt whose density is assumed to be arbitrary. The score function or score, U(; yt ), is de ned as the derivative of the likelihood function: U(; yt ) =
@lt ; @
its variance Ft () satis es 2 @lt @ lt = −E : Ft () = var @ @ 2 For an exponential family distribution, the score function and its variance satisfy U(; yt ) =
yt − b0 () ; a()
Ft () =
b00 () : a()
(4)
For n independent observations Y1 ; : : : ; Yn , the score Un (; y) and its variance F() are equal to the sums of the individual scores and variances, respectively. However, in the dependent case, more sophisticated models have to be constructed. We aimed for both generality and estimability in our model. Following the ideas in Zeger’s (1988) log-linear model for serially dependent counts, we propose a more general model for which the parameters can still be easily estimated. This leads us to classical quasi-likelihood methods and estimating equations which combine the advantages of maximum likelihood and least-squares estimation. For pertinent discussions of these methods, the reader is referred to Heyde (1993), Godambe and Kale (1991) and Liang and Zeger (1995). It was noted by Bradley (1973) and Wedderburn (1974) that if the Yt ’s have exponential family distributions, then the score function depends on the parameters only through the means and variances. Wedderburn thus suggested using the exponential family score function even when the underlying distribution was unspeci ed. In such a case, the estimating function is called a quasi-score estimating function and the derived estimator a quasi-likelihood estimator. More precisely, following Zeger (1988), let {t } denote an underlying stationary “unobservable” process satisfying: E(t ) = 1;
cov(t ; t+ ) = 2 ();
(5)
where () denotes the correlation between t and t+ . We suppose that Y1 |1 ; : : : ; Yn |n are independent and that each Yt |t has an exponential family distribution with a density function analogous to that de ned by Eq. (1) and with E(Yt |t ) = t t ;
var(Yt |t ) = f(t ; t );
(6)
where t has been de ned previously. If for example the link function g(·) = log(·), then Yt |t follows a log-linear model. Various other examples of link functions are given in Table 1.
164
M. Blais et al. / Statistics & Probability Letters 46 (2000) 161 – 168
In this parameter-driven model, we have that E(Yt ) = t ;
var(Yt ) = 2 t2 + E(var(Yt |t ));
cov(Yt ; Yt+ ) = t t+ 2 ();
¿ 0:
Clearly, the Yt ’s are uncorrelated if the latent process is white noise, that is, () = 0 if 6= 0. Estimating equations will now be used to estimate the vector of regression parameters. In the general case, where the Yt are not necessarily independent, the quasi-score, a vector of length p, is de ned as Un ( ) = DnT Vn−1 (Y − );
(7)
where Dn is an n × p matrix with the element (t; j) given by @t =@ j (see McCullagh and Nelder, 1989, Chapter 9). The quasi-likelihood estimating equation for estimating the vector of regression parameters is given by ˆ =0 (8) Un ( ) ˆ a root of this estimating equation is called a quasi-likelihood estimator. In practice, if Un depends on and , , a vector of parameters independent of , a consistent estimator of is substituted in (8). From (7), it is immediately seen that the vector Un ( ) satis es E (Un ( )) = 0;
cov (Un ( )) = DnT Vn−1 Dn :
(9) ∗
Furthermore, expanding Un ( ) in a Taylor series around , with () and 2 xed, we obtain that ˜ − ∗ ); (10) Un ( ) = Un ( ∗ ) − Cn ( )( ∗ ∗ ˜ ˜ where Cn ( ) = −@Un =@ | = ˜ and for each i = 1; : : : ; p; i 6 i 6 i . The ith component of Un ( ) is given by n n X @k X k=1
@ i
wkt (Yt − t );
(11)
t=1
where wkt is the (k; t)th term of Vn−1 . The (i; j)th term of Cn ( ∗ ) is the derivative of the ith term of −Un ( ∗ ) with respect to j and is given by −
n n n n n n X X @wst X @s X @2 s X @s X @t wst (Yt − t ) − (Yt − t ) + wst : @ i @ j @ i @ j @ i @ j s=1
t=1
s=1
t=1
s=1
∗
Taking expectations, the rst two terms disappear; thus E(Cn ( )) = Using (9), we will henceforth let Bn = DnT Vn−1 Dn = cov(Un ( ∗ )) = E(Cn ( ∗ )):
t=1
DnT Vn−1 Dn
. (12)
∗
From (12), the vector Un ( ) can be treated as the derivative of a log-likelihood function. As mentioned in McCullagh and Nelder (1989, p. 333), it is to be expected that under suitable regularity conditions, the root ˆ of (8) will be asymptotically normally distributed with approximate covariance matrix Bn . This result is proved in the following section. 3. Asymptotic distribution of the quasi-likelihood estimator We outline here general conditions under which the quasi-likelihood estimator is asymptotically normal and consequently inference about the parameters is feasible. We consider the class of -mixing processes. This notion is fundamental to these results and constitutes a fairly weak assumption. For more information about dierent types of mixing, the reader can consult Doukhan (1994, Section 1.3).
M. Blais et al. / Statistics & Probability Letters 46 (2000) 161 – 168
165
De nition 1. A sequence of random variables {t ; t ∈ Z} de ned on a probability space ( ; B; P) is -mixing if for all k¿1 and n¿1, |P(A ∩ B) − P(A)P(B)|6(n) for all A ∈ P; k ; B ∈ F; n+k , where {(n)} is a sequence of real numbers such that (n) →n→∞ 0, P; k and F; n+k are the - elds generated by {: : : ; k−1 ; k } and {n+k ; n+k+1 ; : : :} respectively. Proposition 1. If {t } is a -mixing sequence and {Yt } is a process such that the Yt |t are independent; then the process {Yt } is -mixing; with the mixing sequence {4(n)}. Proof. Let A ∈ PY; k , and B ∈ FY; n+k . Clearly, we can write P(A ∩ B) − P(A)P(B) = E(P(A|P; k )P(B|F; n+k )) − E(P(A|P; k ))E(P(B|F; n+k )) since the Yt |t are independent. By letting = P(A|P; k ), = P(B|F; n+k ), we have |E() − E()E()|64(n) by Theorem 17:2:1 in Ibragimov and Linnik (1971). There are many examples of -mixing processes; in particular ARMA processes under fairly weak conditions given by Athreya and Pantula (1986) are -mixing. See also Doukhan (1994, Section 2:4). Besides the mixing condition, we will have to assume that the covariance matrix de ned by Eq. (12) satis es a weak limiting condition, that is, we suppose that the following limit exists 1 Bn = B: (13) n It is obvious that this condition is satis ed in the case of independent observations. A central limit theorem will be deduced from a multivariate generalization of a one-dimensional result on triangular arrays due to D Davidson (1992), Theorem 1 given below. Henceforth, → means “convergence in distribution”. lim
n→∞
Theorem 1 (Davidson, 1992). Let {Xnt ; t = 1; : : : ; n; n¿1} denote a triangular array of random variables de ned on the probability Pn space ( ; F; P) such that 1. E(Xnt ) = 0 and E( t=1 Xnt )2 = 1. Also; suppose that the next two assumptions are satis ed. 2. There exists a constant r ¿ 2 and a triangular array of positive constants {cnt } such that ( r 1=r 2 ) Xnt sup n max cnt ¡ ∞ and E 16t6n cnt n is uniformly bounded in t and n. P 3. For each n; the sequence {Xnt } is -mixing for a sequence {(m)} such that m (m)r=(r−2) ¡ ∞. Then the following result is true: n X t=1
D
Xnt → N(0; 1): n→∞
Now let us apply this theorem to derive the asymptotic distribution of the quasi-score function Un ( ). Theorem 2. Let Un ( ∗ ) denote the quasi-score function de ned by (7) and assume that the following assumptions are satis ed.
166
M. Blais et al. / Statistics & Probability Letters 46 (2000) 161 – 168
10 . There exist positive constants and K1 such that E(Yt − t )2+ ¡ K1 ; ∀t. 20 . Each term of the matrix DnT Vn−1 ; as a function of n; is bounded Pin absolute value. 30 . The process {t } is -mixing for a sequence {(m)} such that m¿1 (m)(2+)= ¡ ∞. 40 . B and Bn are positive de nite; ∀n. Then; we have: 1 D √ Un ( ∗ ) → Np (0; B): n→∞ n where Np (0; B) denotes the p-dimensional normal distribution with vector mean 0 and covariance matrix B. √ Proof. It suces to show that linear combinations of the components of (1= n)Un ( ∗ ) are asymptotically of length p such that T = 1. For such normally distributed. More precisely, let = (1 ; : : : ; p )T be a vector √ a , we consider linear combinations of the components of (1= n)Un ( ∗ ) given by (11): p
n
n
1 X X X 1 √ T Un ( ∗ ) = √ k dsk wst (Yt − t ) n n s=1 t=1 k=1 ! p n n 1 X X X =√ k dsk wst (Yt − t ); n t=1 s=1 k=1
where dsk = @s =@ k . The proof now depends on Davidson’s one-dimensional result. Let Pn Pp dsk wst T DnT Vn−1 ent k=1 k p s=1 p ant = = ; T Bn T Bn where ent is a vector of length n which has the value 1 in the tth position and 0 elsewhere. Also, let Xnt = ant (Yt − t ), for n¿1 and t = 1; : : : ; n. It follows easily that !2 n X T DnT Vn−1 Dn E(Xnt ) = 0 and E Xnt = = 1: T Bn t=1 √ Thus assumption 1 of Davidson’s theorem is veri ed. Let cnt = 1= n and r = 2 + , then r 1=r √ Xnt = nant (E(Yt − t )2+ )1=(2+) : E cnt The expectation on the right-hand side is bounded by assumption 10 . It remains to examine √
T DT V −1 ent : nant = q n n 1 T n Bn
It follows from assumption 20 that there exists a positive constant K2 such that the numerator is bounded √ by K2 p. Now (1=n)T Bn →n→∞ T B ¿ 0 by assumption 40 and Eq. (13). Thus there exists a positive integer n0 such that for all n ¿ n0 ; 1n T Bn ¿ 12 T B ¿ 0. Let K3 = min16n6n0 { 1n T Bn } ¿ 0 since the matrix Bn is positive de nite, then r 1=r √ K11=(2+) K2 p Xnt ¡q ; ∀n; t; E cnt min{K3 ; 12 T B} and assumption 2 of Davidson’s theorem is veri ed since supn {n(max16t6n cnt )2 } = 1; ∀n. Proposition 1 implies that the process {Yt } is -mixing with the function 4(·) where (·) is given by assumption 30 . Since Xnt = ant (Yt − t ) and ant is not random, assumption 3 of Davidson’s theorem is also veri ed.
M. Blais et al. / Statistics & Probability Letters 46 (2000) 161 – 168
167
Therefore, it follows from Davidson’s theorem that n n X X D Xnt = ant (Yt − t ) → N(0; 1) t=1
t=1
and consequently, by Eq. (13), we obtain that !1=2 n X 1 T T Bn D ∗ √ Un ( ) = Xnt → N(0; T B) n n t=1
p
for any ∈ R and the proof is completed. The asymptotic distribution of the estimating equation estimator is easily derived from Theorem 2 and we obtain the following result. Theorem 3. Let ˆn be the estimator obtained from the estimating Eq. (8) with the quasi-score (7). In addition; let us suppose that the assumptions of Theorem 2 are satis ed as well as the following: 50 . Cn ( ) is continuous in a neighbourhood of ∗ ; the true value of ; 60 . Bn−1 Cn ( ˆn ) → Ip in probability; where Ip denote the identity matrix of dimension p. Then; we have √ D n( ˆn − ∗ ) → Np (0; B −1 ): n→∞
Proof. The probability that the matrix Cn ( ˆn ) is non-singular tends to 1 as n → ∞ from assumptions 50 and 60 . Since ˆn satis es Eq. (8), we have from (10) that 0 = Un ( ˆn ) = Un ( ∗ ) − Cn ( ˜n )( ˆn − ∗ ). This leads to ˆn − ∗ = Cn−1 ( ˜n )Un ( ∗ ) = Cn−1 ( ˜n )Bn Bn−1 Un ( ∗ ): √ From assumption 60 again and from Eq. (13), n( ˆn − ∗ ) has the same asymptotic distribution as √ (1= n)B −1 Un ( ∗ ) which is Np (0; B −1 ) by Theorem 2. Theorem 3 is an extension of the asymptotic results for log-linear models discussed in Proposition 1 of Zeger (1988). However, for statistical inference about the regression parameters, e.g., in order to construct con dence intervals or tests about these parameters, it is necessary to replace the matrices Vn and Dn by consistent estimators Vˆn and Dˆ n obtained, for example, by the method of moments. The following corollary derives a limiting distribution in this case. Corollary 1. Under the assumptions of Theorem 3; we have that 1=2 D Bˆ n ( ˆn − ∗ ) → Np (0; Ip ); n→∞
where
T −1 Bˆ n = Dˆ n Vˆ n Dˆ n ; Vˆn and Dˆ n denoting consistent estimators of Vn and Dn ; respectively.
(14)
√ D Proof. We have that n( ˆn − ∗ ) → n→∞ Z , where Z denotes a p-dimensional random vector with a Np (0; B −1 ) distribution. Also, it follows from Eqs. (13) and (14) that (1=n)Bˆ n →n→∞ B in probability. By Theorem 4:4 of Billingsley (1968), we have that √ 1 D n( ˆn − ∗ ); Bˆ n → (Z ; B) n→∞ n
168
M. Blais et al. / Statistics & Probability Letters 46 (2000) 161 – 168
and by the continuous mapping theorem, it follows that 1=2 √ 1 ˆ D Bn n( ˆn − ∗ ) → B 1=2 Z ; n→∞ n where B 1=2 Z has a Np (0; Ip ) distribution since B 1=2 B −1 B 1=2 = Ip . 4. Conclusion Following the ideas of Zeger (1988), we have introduced a class of generalized linear models for time series of counts in which serial correlation is generated by an unobservable latent process. The marginal expectation, t , depends on the regression parameters but not on the parameters of the latent process. These models can be non stationary both in mean and variance and also allow for overdispersion. Quasi-likelihood estimators of the regression coecients were obtained by the estimating equations technique and the asymptotic normality was derived under weak mixing conditions on the latent process. Acknowledgements The authors would like to thank a referee for helpful comments. References Athreya, K.B., Pantula, S.G., 1986. A note on strong mixing of ARMA processes. Statist. Probab. Lett. 4, 187–190. Billingsley, P., 1968. Convergence of Probability Measures. Wiley, New York. Bradley, E.L., 1973. The equivalence of maximum likelihood and weighted least squares estimates in the exponential family. J. Amer. Statist. Assoc. 68, 199–200. Chan, K.S., Ledolter, J., 1995. Monte Carlo EM estimation for time series models involving counts. J. Amer. Statist. Assoc. 90, 242–252. Cox, D.R., 1981. Statistical analysis of time series: some recent developments. Scand. J. Statist. 8, 93–115. Davidson, J., 1992. A central limit theorem for globally nonstationary near-epoch dependent functions of mixing processes. Econom. Theory 8, 313–329. Davis, R.A., Dunsmuir, W.T.M., Wang, Y., 1999. Modeling time series of count data: In: Ghosh, S. (Ed.), Asymptotics, Nonparametric and Time Series. Marcel Dekker, New York, pp. 63–113. Doukhan, P., 1994. Mixing: Properties and Examples, Lecture Notes in Statistics, # 85, Springer, New York. Godambe, V.P., Kale, B.K., 1991. Estimating functions: an overview. In: Godambe, V.P. (Ed.), Estimating Functions. Clarendon Press, Oxford, pp. 3–20. Gourieroux, C., Monfort, A., Trognon, A., 1984. Pseudo maximum likelihood methods: applications to Poisson models. Econometrica 52, 701–720. Heyde, C.C., 1993. Quasi-likelihood and general theory of inference for stochastic processes. In: Obretenov, A., Stefanov, V.T. (Eds.), Proceedings of the Seventh International Summer School on Probability and Mathematical Statistics, Varna, Bulgaria, 1991, Sci. Cult. Tech. Publ., Singapore, pp. 122–152. Ibragimov, I.A., Linnik, Y.V., 1971. Independent and Stationary Sequences of Random Variables. Wolters-Noordho, Groningen. Jorgensen, B., Lundbye-Christensen, S., Song, X.-K., Sun, L., 1995. A state space model for multivariate longitudinal count data of mixed types. University of British Columbia, Department of Statistics. Technical Report, # 148. Liang, K.Y., Zeger, S.L., 1995. Inference based on estimating functions in the presence of nuisance parameters. Statist. Sci. 10, 158–173. McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models, 2nd Edition. Chapman & Hall, London. Wedderburn, R.W.M., 1974. Quasi-likelihood functions, generalized linear models, and the Gauss Newton method. Biometrika 61, 439–447. Zeger, S.L., 1988. A regression model for time series of counts. Biometrika 75, 621–629.