Modelling and analysis of count data using a renewal process

Modelling and analysis of count data using a renewal process

STATISTICS& PROBABILITY LETTERS ELSEVIER Statistics & ProbabilityLetters 31 (1996) 129-132 Modelling and analysis of count data using a renewal proc...

222KB Sizes 0 Downloads 18 Views

STATISTICS& PROBABILITY LETTERS ELSEVIER

Statistics & ProbabilityLetters 31 (1996) 129-132

Modelling and analysis of count data using a renewal process M.J. Faddy Centre for Statistics, Department of Mathematics, The University of Queensland, Brisbane, Qld. 4072, Australia Received March 1995; revised November 1995

Abstract

The Poisson process may be generalised by an ordinary renewal process with the times between renewals following a gamma distribution. Over- (under-) dispersion of the distribution of the number of renewals relative to the Poisson distribution is the result, according as the coefficient of variation of this gamma distribution is > 1 ( < 1). This distribution of the number of renewals is then applied to analyse some data on seizure counts of epileptics. Keywords: Poisson distribution; Renewal process; Over-dispersion; Under-dispersion; Data analysis

I. Introduction

Dispersion modelling was a topic much in evidence from papers presented at the XVIIth International Biometric Conference; for example, Dean and Balshaw (1994) considered random effects models, and Smith (1994) discussed simple methods for compensating for over-dispersion in generalised linear models. Also, Faddy (1994) approached the problem of over- and under-dispersion relative to the Poisson distribution by changing the rate of the underlying Poisson process of 'events' from a constant to one dependent on the accumulated number of events that have occurred. The purpose of the present paper is to consider another generalisation of the Poisson process, namely the renewal process (Cox and Miller, 1965, Ch. 9); such a generalisation will also admit distributions of the number of renewals (events) which can be over-dispersed or under-dispersed relative to the Poisson distribution. The models will then be applied to an analysis of some data on numbers of epileptic seizures reported by subjects in a study discussed by Thall and Vail (1990). These data were also analysed in Breslow and Clayton (1993) using a generalised linear mixed model, which allowed for greater between-subject variation (than Poisson) by including some random effects in the linear predictor. The use of a renewal process to model the data essentially allows for this between-subject variation by having a different between-seizure (event) time distribution for each of the subjects in the study: some distributions will admit 'bursty' sequences of events with clusters close together in time separated by longer time periods, while others will result in more regular sequences of events. It is thus how these distributions vary with respect to the covariates measured in the study that will be the main feature of this analysis. Although these models might be considered non-standard by many statisticians, they can be amenable to simple numerical techniques for computation of the resulting probabilities and fitting to data, such as those 0167-7152/96/$12.00 © 1996 Elsevier Science B.V. All rights reserved PH S0167-7152(96)00023-5

130

M.J. Faddy I Statistics & Probability Letters 31 (1996) 129-132

provided by the package M A T L A B ® (1992).

2. The renewal process Cox and Miller (1965, Ch. 9) describe renewal processes; briefly, an ordinary renewal process is a point process such that, starting from 0, points (or events) occur with the sequence of times between these points being independent identically distributed continuous random variables, /'1, T2, T3..... say. The probability distribution of the number of events, X ( t ) , in the time interval (0, t) may be written in terms of the time Sn = T1 + T2 + "'" "+- T~ of occurrence of the nth event: P { X ( t ) = 0} = 1 - P ( S 1 <.t), P { X ( t ) = n} = P ( S n <<.t) - P(Sn+1 <~t),

for n~> 1.

(2.1)

If the individual T/'s are gamma distributed with probability density function, for t > 0: a b t b - I e - at

f ( t ; a,b) -

r(b)

(2.2)

,

then S, is also gamma distributed, with probability density function f ( t ; a, nb). The gamma distribution (2.2) has mean b p = E(Ti) = -

(2.3)

a

and variance b cr2 = Var(Ti) = ~-

(2.4)

leading to coefficient of variation 1/x/b. For b << 1, the shape of the distribution with a mode at 0 and a long tail is such as to produce a 'bursty' sequence of points in the renewal process, while for b >> 1 the shape is more symmetric producing a more regular sequence of points; b = 1 results in the exponential distribution for the inter-event times, and the renewal process corresponds to a simple Poisson process of 'rate' a, and Poisson distribution (mean at) for X ( t ) . Exact results, in general, from (2.1) for the distribution of X(t), the number of events in (0,t), are difficult to obtain, although the probabilities may be readily calculated numerically using the gamma distribution (2.2). However, asymptotic results for large t are available (Cox and Miller, 1965, Ch. 9) for the moments of X(t): t E { X ( t ) } ~ -,#

t72t Var{X(t)} ,~ #3'

(2.5)

where # and a 2 are the mean and variance of the inter-event times Tg; for simplicity, these asymptotic forms do not involve intercept terms. Eq. (2.5) is suggestive of Var{X(t)} > E { X ( t ) }

if a > #,

Var{X(t)} < E { X ( t ) }

if a < #.

Or over-dispersion (under-dispersion) relative to the Poisson distribution if the coefficient of variation of the inter-event time distribution a/# > 1 ( < 1). For the gamma distribution (2.2), these conditions reduce to bl).

M.J. Faddy I Statistics & Probability Letters 31 (1996) 129-132

131

Using the asymptotic moments (2.5) the mean and variance of X (t ) may be simultaneously modelled in terms of a vector x of covariates by writing tp =exp{xTat}

and

tr2t ~-T

=

exp{xrp},

where a and/] are vectors of coefficients. Bringing in the gamma moments (2.3) and (2.4), these reduce to 1 a = - exp{xT(2a -- fl)},

(2.6)

t

b -- exp{xT(~, --/~)}.

(2.7)

Here the quantities at and b do not depend on time t: this is entirely consistent as the probability distribution of X(t) (2.1) depends only on the inter-event time distribution (2.2) through this combination (b and at) of parameters. Given data on observed counts together with covariate values, the parameter vectors a and fl can be estimated by maximum likelihood using (2.6) and (2.7) to specify the gamma distribution (2.2) from which probabilities (2.1) may be calculated in terms of the incomplete gamma function, and the resulting likelihood maximised using the Nelder-Mead simplex algorithm: all this can be readily done using the M A T L A B ® (1992) package. Notice that although the distributional results (2.1) and (2.2) are exact, the likelihood as a function of ~ and /~ will be approximate due to use of (2.5). Also, the question of identifiability of the model parameters arises: it seems likely that if there is sufficient variation in the covariates, then parameter estimation should be quite possible, as with the example of the next section.

3. Some data Thall and Vail (1990) discuss some data on numbers of seizures over two-week periods for a group of 59 epileptics, 31 of whom received a treatment (progabide) and the remaining 28 a placebo. Other covariates were age and baseline seizure counts over an eight-week period prior to treatment. The post-treatment counts were measured over four successive two-week periods, and here the counts in the fourth period are taken as the response, with treatment, log(baseline count) and log(age) as the covariates. The renewal process defined by (2.1), (2.2), (2.6) and (2.7) with time t fixed arbitrarily at 1, was fitted to these data by maximum likelihood, with the following results: log-likelihood = -149.404 with all three covariates -149.955 with log(baseline count) and treatment -151.699 with log(baseline count) only -180.192 with no covariates. Age thus does not contribute much! The treatment has little effect on the variance (log-likelihood = -149.975, when this covariate only affects the mean), but some effect on the mean: 2 × log-likelihood ratio = 3.45 on 1 d.f. with a p-value of approximately 0.03 from a one-sided test for a positive treatment effect. The baseline seizure count is not surprisingly the most significant covariate, as this gives an indication of the severity of each subject's epilepsy: its effect is on both the mean (2x log-likelihood ratio = 10.01 on 1 d.f.) and variance (2× log-likelihood ratio = 19.36 on 1 d.f.). These results are broadly comparable with those reported in Breslow and Clayton (1993), except for the separation of the effects of the covariates on the mean and variance of the seizure counts. A glance at the estimates of the parameters a and b of the gamma distribution (2.2) from the fired forms (2.6) and (2.7) shows

132

M.J. Faddy / Statistics & Probability Letters 31 (1996) 129-132

that for all but one o f the 59 subjects b < 1; the median o f these estimates o f b is 0.37, and the quartiles 0.27 and 0.57, giving an indication of the degree o f over-dispersion relative to the Poisson distribution that is apparent in these data. But this approach of using a renewal process does admit the alternative possibility o f under-dispersion; this feature, and the above-mentioned separation o f covariate effects on the mean and variance of the counts are thus the principal features offered by this approach when compared with the mixed model o f Breslow and Clayton (1993). Also, it is perhaps worth pointing out that in some contexts, such as epilepsy, the distribution o f times between 'events' might itself be a useful measure to estimate: this renewal theory approach provides such estimates. Finally, saddle-point methods (Smith, 1958) could be used on the distribution of the number of renewals (2.1) if an analytical approximation is required, or if a distribution other than gamma for the inter-event times is used.

References Breslow, N.E. and D.G. Clayton (1993), Approximate inference in generalised linear mixed models, J. Amer. Statist. Assoc. 88, 9-25. Cox, D.R. and H.D. Miller (1965), The Theory o f Stochastic Processes (Methuen, London). Dean, C.B. and R. Balshaw (1994), Efficiency of the analysis of aggregate data in Poisson processes, Proc. XVllth Internat. Biometric Conf. 2, 190. Faddy, M.J. (1994), Modelling and analysis of under- and over-dispersed Poisson-like data, Proc. XVllth Internat. Biometric Conf. 2, 24. M A T L A B ® Reference Guide (1992), The MathWorks Inc., Natick, Mass. Smith, P.J. (1994), Overdispersion in generalised linear models, Proc. XV11th Internat. Biometric Conf. 2, 192. Smith, W.L. (1958), Renewal theory and its ramifications, Z Roy. Statist. Soc. Ser. B 20, 243-302. Thall, P.F. and S.C. Vail (1990), Some covariance models for longitudinal data with over-dispersion, Biometrics 46, 657~71.