Mathematics and Computers in Simulation 69 (2005) 4–11
A rich family of generalized Poisson regression models with applications S. Bae a , F. Famoye b , J.T. Wulu c , A.A. Bartolucci d , K.P. Singh a,∗ a
Department of Biostatistics, School of Public Health, University of North Texas Health Science Center, Fort Worth, TX 76107-2699, USA b Department of Mathematics, Central Michigan University, Mount Pleasant, MI 48859, USA c Bureau of Primary Care, Health Resources and Services Administration, Department of Health and Human Services, Bethesda, MD 20814, USA d Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL 35294-2699, USA Available online 11 March 2005
Abstract The Poisson regression (PR) model is inappropriate for modeling over- or under-dispersed (or inflated) data. Several generalizations of PR model have been proposed for modeling such data. In this paper, a rich family of generalized Poisson regression (GPR) models is reviewed in detail. The family has a wide range of applications in various disciplines including agriculture, econometrics, patent applications, species abundance, medicine, and use of recreational facilities. For illustrating the usefulness of the family, several applications with different situations are given. For example, hospital discharge counts are modeled using GPR and other generalized models, in which the applied models show that household size, education, and income are positively related to diagnosis-related groups (DRGs) hospital discharges. One of the advantages of using the family is that it lets data determine which model is appropriate for a given situation. It is expected that the results discussed in the paper would enhance our understanding of various forms of count data originating from primary health care facilities and medical domains. © 2005 IMACS. Published by Elsevier B.V. All rights reserved. Keywords: Poisson regression; Dispersion; Negative binomial regression; Sex partners; Hospital discharge
∗
Corresponding author. E-mail address:
[email protected] (K.P. Singh).
0378-4754/$30.00 © 2005 IMACS. Published by Elsevier B.V. All rights reserved. doi:10.1016/j.matcom.2005.02.026
S. Bae et al. / Mathematics and Computers in Simulation 69 (2005) 4–11
5
1. Introduction The generalized Poisson regression (GPR) model proposed by [5] is used to model count data that are affected by a number of known predictor variables. The model is based upon the generalized Poisson distribution, which had been extensively studied by researchers. The reader is referred to [4] and the references therein for more details. Count data with too many zeros are common in a number of applications. Ridout et al. [24] cited examples of data with too many zeros from various disciplines including agriculture, econometrics, patent applications, species abundance, medicine, and use of recreational facilities. Several models have been proposed to handle count data with too many zeros: Lambert [20] described the zero-inflated Poisson regression (ZIPR) models with an application to defects in manufacturing; [16] described the zero-inflated binomial regression (ZIBR) model and incorporated random effects in ZIPR and ZIBR models; and [21] generalized the ZIPR model to accommodate the extent of individual exposure. Other models in the literature include the hurdle model [23], the two-part model [17], and the semi-parametric model [14]. Details of these models can be found in [24] and additional references on ZIPR models can be found in [2]. A feature of many count datasets is the joint presence of excess zero observations and the long right tails, both relative to the Poisson assumption [15]. Both features may be accounted for by over-dispersion in the data. The excess zeros can occur as a result of clustering. Over-dispersion has the tendency to increase the proportion of zeros and whenever there are too many zeros relative to Poisson assumption, the negative binomial regression and the generalized Poisson regression tend to improve the fit of the data. For a better fit, an over-dispersed model that incorporates excess zeros should serve as an alternate. Gurmu and Trivedi [15] illustrated this point. They found that the negative binomial hurdles model, which allows for over-dispersion and also accommodates the presence of excess zeros, is more appropriate among all the models they considered. Also, [24] considered various ZIPR models for an Apple shoot propagation data. They concluded that the ZIPR models were inadequate for the data as there was still evidence of over dispersion. They went on to fit zero-inflated negative binomial models to the data. Gupta et al. [13] studied the zero-adjusted generalized Poisson distribution. They estimated the model parameters by the method of maximum likelihood. They studied the effect of not using adjusted (inflated or deflated) model when the occurrence of zero differs from what is expected. They showed that more errors are committed for small values of the count if adjustment is ignored. They noted that the zeroadjusted generalized Poisson distribution fitted very well the fetal movement data and the death notice data of London times. However, there are situations where non-zero numbers occur too often. For such situation [10] proposed k-inflated generalized Poisson regression (k-IGPR) models. This paper discusses these models as a rich family of generalized Poisson regression models and illustrates the usefulness of the family by modeling several data sets including hospital discharge counts.
2. A family of generalized Poisson regression models A broad generalization of the Poisson regression (PR) model is given as follows:
P(Y = yi |xi , zi ) =
ϕi + (1 − ϕi )f (k; µi , α), yi = k (1 − ϕi )f (yi ; µi , α), yi = k
(2.1)
6
S. Bae et al. / Mathematics and Computers in Simulation 69 (2005) 4–11
where 0 < ϕi < 1 and f(yi , µi , α), yi = 0, 1, 2, . . . is the generalized Poisson regression model given by [9] and has been used in modeling various data sets [10,25,29] is given by:
f (yi ; µi , α) =
µi 1 + αµi
yi
(1 + αyi )yi −1 −µi (1 + αyi ) exp , yi ! 1 + αµi
p
(2.2)
where yi = 0, 1, 2, . . ., µi = µi (xi ) = exp j=1 xij βj ,xi = (xi1 = 1, xi2 , xi3 , . . ., xip ) is the ith row of covariate matrix X, and β = (β1 , β2 , . . ., βp ) are unknown p-dimensional vector of parameters. The mean of Yi is given by µi (xi ) and the variance of Yi is given by µi (1 + αµi )2 . The GPR model in (2.2) is an extension of the Poisson regression model given by [11]. The parameter α measures the dispersion. When α = 0, the model in (2.2) reduces to the Poisson regression (PR) model. When α > 0, the model represents count data with over-dispersion and when α < 0, the GPR model represents count data with under-dispersion. As discussed by [10], the generalization in (2.1) allows for a decreasing proportion of k-values if k −1 −f(k; µi , α)[1 − f(k; µi , α)] ϕI < 1. The functions µi = µi (xi ) and ϕi = ϕi (zi ) satisfy log(µi ) = j=1 xij βj
ϕi and logit(ϕi ) = log 1−ϕ = m j=1 zij δj where zi = (zi1 = 1, zi2 , zi3 , . . ., zim ) is the ith row of covariate i matrix Z and δ = (δ1 , δ2 , . . ., δm ) are unknown m-dimensional vector of parameters. In this set up, the functions ϕi and µi are, respectively, modeled via logit and log link functions. Both are linear functions of some covariates. The generalization has been found appropriate for situations with too many kth numbers. The generalization is broad and most of generalizations of PR model are special cases of it. Therefore, it can be called a rich family of generalized Poisson regression models. The mean and variance of the family are given, respectively, by:
E(yi |xi ) = ϕi k + (1 − ϕi )µi (xi )
(2.3)
V (yi |xi ) = ϕi (1 − ϕi )(k − µi )2 + (1 − ϕi )µi (1 + αµi )2
(2.4)
and
When k = 0, the family reduces to the zero inflated generalized Poisson regression (ZIGPR) model. It reduces to the GPR model when ϕi = 0. It reduces to the k-inflated Poisson regression model when α = 0. For positive values of ϕi , it represents the k-inflated generalized Poisson regression model and for negative values of ϕi , it represents k-deflated generalized Poisson regression model. The k-deflation cases rarely occur in practice. The reader is referred to [10] for a detailed discussion.
3. Parameter estimation in the family The log-likelihood function for the family of generalized Poisson regression models defined in (2.1) is given by: log(L) = −
n i=1
log(1 + ωi ) +
yi =k
log[ωi + f (k; µi , α)] +
yi =k
log[f (yi ; µi , α)],
(3.1)
S. Bae et al. / Mathematics and Computers in Simulation 69 (2005) 4–11
7
ϕi m where ωi = 1−ϕ = exp j=1 zij δj .The likelihood equations for the family are given in [10]. The i Fisher’s information matrix is obtained by taking the expectations of minus the second derivatives. Based ˆ δ, ˆ α), ˆ inferences on the regression on the asymptotic normality of the maximum likelihood estimator (β, coefficients and the dispersion parameter can be made. The Newton–Raphson algorithm is used to find the solutions of the likelihood equations for the model.
4. Score test for k-inflation in the family In this section, we discuss a score test to check whether the number of k-values is too large for a generalized Poisson regression (GPR) model to adequately fit the data. Famoye and Singh [10] considered situation where there are n observations, among them are nk k-values. The score function U(β, α, 0) and the expected information matrix I(β, α, 0) can be calculated from the log-likelihood in (3.1). A score test in the family has the advantage that one needs to fit just the GPR model given in (2.2), which is the model under the null hypothesis. The score statistic for testing whether the GPR model fits the number of k-values well is given by: −1
ˆ α) ˆ α, ˆ α, ˆ α, ˆ α, ˆ = S(β, ˆ 0) = U (β, ˆ 0)[I(β, ˆ 0)] U(β, ˆ 0), S(β,
(4.1)
where αˆ and βˆ are the maximum likelihood estimates of α and β under the null hypothesis of GPR model. The elements of the score function U(α, β, 0) are given in Famoye and Singh [10]. Under the null hypothesis, the score statistic in (4.1) has an asymptotic Chi-square distribution with 1 degree of freedom.
5. Goodness-of-fit-test A measure of goodness-of-fit may be based on the log-likelihood statistic. The family reduces to the k-inflated Poisson regression (k-IPR) model when the dispersion parameter α = 0. To test for the adequacy of the model over the k-IPR model, one may test the hypothesis H0 :α = 0 against Ha :α = 0. The inclusion of the dispersion parameter α in the regression model is justified when H0 is rejected (Famoye and Singh [10]). To test the null hypothesis, one can use the likelihood ratio statistic. Alternatively, one can use the asymptotic Wald t-statistic to test the null hypothesis.
6. Applications 6.1. Application to sex partners data Note that for positive values of ϕi , the family represents the k-inflated generalized Poisson regression models and for negative values of ϕi , it represents k-deflated generalized Poisson regression models. The k-deflation cases rarely occur in practice. Famoye and Singh [10] used the k-inflated Poisson regression model in modeling sex partner data in NHANES III in a household survey interview (NHANES III 1988–1994, October 1996, http://www.cdc.gov/nchs/nhanes.htm).
8
S. Bae et al. / Mathematics and Computers in Simulation 69 (2005) 4–11
The k-IGPR models gave significant negative relationships between the number of sex partners and race indicating that non-whites tend to have more sex partners than whites. There was a significant positive relationship between the number of sex partners and gender, which implies that males tend to have more sex partners than females. The independent variables, education, age, work, health, and poverty income ratio have no significant relationships with the number of sex partners. The variable marital status had a significant negative relationship with the number of sex partners. Subjects who were married or living as married had fewer number of sex partners than other subjects. The variables drink and marijuana had significant positive relationships with the number of sex partners. That is, subjects with greater than 9 drinks per day tend to have more sex partners and subjects who have used marijuana have more sex partners. 6.2. Application to farm injury data The family of GPR models given in (2.1) reduces to the GPR model of [9]. Wulu et al. [29] used the GPR model for analysis of farm injury data from a study for assessing the effects of risk factors, a joint study of the University of Alabama at Birmingham, USA, and Tennessee State University, USA. The results suggested race, gender, educational level, and marital status to be positively associated with agricultural injuries, a significant inverse relationship between seat belt use and farm injury, consistent use of seat belts less likely to sustain farm injuries, and those with some form of medical conditions to appear to have positive significant impact on the frequency of farm injuries. The modeling results also showed that respondents with more hours (i.e., 4–8) of farm safety training demonstrate significant reduction in the number of injuries reported. The estimated goodness-of-fit measures showed that GPR models outperformed the NBR and PR models and the GPR model predicted the frequency of farming injuries to be consistent and supportive for the application of GPR models. 6.3. Application to hospital discharge count data Studies have indicated variations and related reasons for the utilization of hospital services and health care medical facilities by community residents. Several authors studied surgical and medical discharge abstracts from 1985 to 1987 for Maryland patients admitted to acute care hospitals. Their study indicated that race, income, and education are significant factors in differential hospital utilization rates [7,12]. The authors found that admission rates for medical reasons declined with increasing community income levels and were elevated in blacks. Kudur and Demlo [19] found average income levels in areas of high admission rates were significantly lower than those areas of low admission. Wilson and Tedeschyi [27] found income levels positively associated with surgical discharges rates and a positive association between the percent of Medicaid population and medical discharge rates. Wennberg and Freeman [26] found in a population-based study that the relationship between hospitalization rates for avoidable hospital conditions (AHC) cases and median household income, revealed consistent correlation between low income and high rate of hospitalization. deShazo [7] admits that the random nature of the discharge count data of his research suggests a fit of the pure Poisson model because the data indicate the average number of discharges per person per time interval. He indicated that the goodness-of-fit of pure Poisson model for the count data was poor and observed that the count data indicated extra-Poisson variation, and thus, decided to fit a loglinear model. deShazo argued that the log-linear model was observed to be appropriate for this purpose
S. Bae et al. / Mathematics and Computers in Simulation 69 (2005) 4–11
9
since it can be extended to include measured community characteristics in a regression model, while still yielding an estimate of the amount of systematic variability beyond that predicted by the factors included in the model. The results of deShazo’s research support the hospital discharge findings of [19,26–28]. The hospital discharge data set originates from the Finite Information Systems for Hospitals (FISH) Database and is published by the Birmingham Regional Hospital Council in Alabama. deShazo [7] constructed and analyzed data sets from the FISH database. The data set used in deShazo’s research contains inpatient hospital utilization rates, 86,726 inpatient discharge counts, socio-demographic and socio-economic factors (e.g., age, race, education, income, sex, and level of poverty). The inpatient hospital discharges during a period of four years (1987–1990), in two Alabama counties (Jefferson and Shelby) were classified by deShazo [7] according to diagnosis related groups (DRGs) and small areas (or zip codes). Using the age-specific hospital utilization rates for the two counties, deShazo applied indirect standardization methods to obtain adjusted or expected discharge counts for all zip codes. The overall age-adjusted rates per 1000 for LV, AHC, and ACS discharges over the four-year period were 7.0, 16.1, and 5.7, respectively. Wulu [30] modeled hospital discharge counts using Poisson regression (PR) model, generalized Poisson regression (GPR) model and negative binomial (NB) model. The author addressed the following questions. (i) What is the relationship between hospital DRGs counts and socio-demographic covariates such as household income, race/ethnicity, education, family size, financial status, and population size for specific areas? (ii) Which covariates (factors or variables) make significant contributions to explain the particular hospital discharge count distributions? The variables considered by [30] were as follows: number of observed LV discharges (OBSLV); number of observed ACS discharges (OBSACS); number of observed AHC discharges (OBSAHC); number of adjusted or expected LV discharges (ADJLV); number of adjusted or expected ACS discharges (ADJACS); number of adjusted or expected AHC discharges (ADJAHC); low variation (LV) discharge rate (LVRATE); ambulatory care sensitive (ACS) discharge rate (ACSRATE); avoidable hospital condition (AHC) (THH90); median 1990 household income (HHINC90); percent of total black population (PCTBLACK); percent of household with less than US$ 7000 annual income (LT7K); percent of adult with less than high school education (PCTLTHSG). The issue of hospital discharges of patients is important, because it is economically driven, and while it is economically driven, it places significant implications on the accessibility and quality of care. Studies have shown that early discharge can lead to missed diagnosis [18] and hospital readmission [3], factors that influence the economics of healthcare. Thus, it is universally believed that discharges at anytime, particularly early discharge, without carefully assessing the complexity of the medical and psychosocial factors is unacceptable health care [18]. In public health and medical literature, a number of studies have been done on the racial disparity in early hospital discharge. For example, [22] examined the relationship between race and discharge against medical advice from hospitals. Investigators of this study found that in 1990, there were an estimated 241,911 discharges against medical advice, accounting for 0.92% of all live discharges. African–American patients were 1.78 times more likely than White patients to be discharged against medical advice [22]. The result of these early discharges may be due to minorities being less satisfied with the medical care they received. Among Medicare beneficiaries, African–Americans and Hispanic patients report less satisfaction with care than White patients, and African–American patients report less confidence in their physicians [7]. Blendon et al. [1] revealed that racial differences in dissatisfaction with care are more visible in the inpatient setting
10
S. Bae et al. / Mathematics and Computers in Simulation 69 (2005) 4–11
than in the outpatient setting. Dissatisfaction with inpatient care may lead a patient to leave a hospital against medical advice. Such discharges tend to place the patients at risk for adverse medical outcomes, disrupt the therapeutic relationship, demoralize staff, waste medical resources, and expose the hospital to liability, [6,8]. Additional studies that assess the risk of early discharge need to be conducted in order to evaluate the effectiveness of the current healthcare system. While cost is a major component of healthcare, it is important that accessibility and quality of care is not sacrificed in the restructuring of our system. In this paper, we described non-linear regression techniques appropriate for the analysis of hospital discharge data. The GPR models outperformed or performed equally as well as the PR and NBR models. The average rate for all three DRGs was 28.8 per 1000 across the four-year period for all zip code areas. AHC DRGs showed the highest number or rate of discharges, while ACS DRGs showed the smallest. The GPR models outperform Poisson regression models except where the variation in diagnosis-related groups (DRGs) hospital discharge counts across zip code areas is relatively small.
7. Conclusion The proposed family of generalized Poisson regression models is more flexible than other models for modeling count data from various disciplines with different situations. When k = 0, the family reduces to the zero-inflated generalized Poisson regression (ZIGPR) model for modeling over-dispersed count data with too many zeros, which is a special case of the family. There are cases where the zero-inflated negative binomial regression (ZINBR) model fails to converge, [20]. For positive values of ϕi , the family represents the k-inflated generalized Poisson regression models and for negative values of ϕi , it represents k-deflated generalized Poisson regression models. The k-deflation cases rarely occur in practice. It reduces to the generalized Poisson regression (GPR) model proposed by [9] when ϕi = 0. Three examples have been given for illustrating the usefulness of the family in modeling. One of the advantages of using the family is that it lets data determine which model is appropriate for a given situation.
References [1] R.J. Blendon, L.H. Aiken, H.E. Freeman, C.R. Corey, Access to medical care for black and white Americans, J. Am. Med. Assoc. 261 (1989) 278–281. [2] D. Bohning, E. Dietz, P. Schlattmann, L. Mendonca, U. Kirchner, The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology, J. R. Statistical Soc. A 162 (1999) 195–209. [3] L.C. Camberg, N.E. Smith, M. Beaudet, J. Daley, M. Cagan, G. Thibault, Discharge, destination and repeat hospitalizations, Med. Care 35 (1997) 756–767. [4] P.C. Consul, Generalized Poisson Distributions: Properties and Applications, Marcel Dekker Inc., New York, 1989. [5] P.C. Consul, F. Famoye, Generalized Poisson regression model, Commun. Statist. Theor. Methods 21 (1) (1992) 89–109. [6] M.C. Corley, K. Link, Men patients who leave a general hospital against medical advice: mortality rates within six months, J. Stud. Alcohol 42 (1981) 1058–1061. [7] W.F. deShazo, A population based approach to monitoring inpatient utilization and determining the influence of sociodemographic factors on three theoretical categories of diagnosis related groups in a Two County Area of Alabama, DrPH Dissertation (unpublished), University of Alabama at Birmingham, 1997. [8] P. Endicott, B. Watson, Interventions to improve the AMA-discharge rate for opiate-addicted patients, J. Psychosoc. Nurs. 32 (1994) 36–40.
S. Bae et al. / Mathematics and Computers in Simulation 69 (2005) 4–11
11
[9] F. Famoye, Restricted generalized Poisson regression model, Commun. Statist. Theor. Methods 22 (1993) 1335–1354. [10] F. Famoye, K.P. Singh, On inflated generalized Poisson regression models, Adv. Appl. Statist. 3 (2003) 145–158. [11] E.L. Frome, M.H. Kurtner, J.J. Beauchamp, Regression analysis of Poisson-distributed data, J. Am. Statistical Assoc. 68 (1973) 288–298. [12] B.S. Gillum, E.J. Grave, E. Wood, National hospital discharge survey, Vital Health Statist. 13 (1998) 1–51. [13] P.L. Gupta, R.C. Gupta, R.C. Tripathi, Analysis of zero-adjusted count data, Comput. Statist. Data Anal. 23 (1996) 207–218. [14] S. Gurmu, Semi-parametric estimation of hurdle regression models with an application to Medicaid utilization, J. Appl. Econometrics 12 (1997) 225–242. [15] S. Gurmu, P.K. Trivedi, Excess zeros in count models for recreational trips, J. Bus. Econ. Statist. 14 (4) (1996) 469–477. [16] D.B. Hall, Zero-inflated, Poisson and binomial regression with random effects: a case study, Biometrics 56 (2000) 1030–1039. [17] D. Heibron, Zero-altered and other regression models for count data with added zeros, Biometrical J. 36 (1994) 531–547. [18] M. Kiely, A. Drum, W. Kessel, Early discharge, Clin. Perinatol. 25 (1998) 539–553. [19] J.M. Kudur, L.K. Demlo, Small area variation in Iowa hospital utilization, Iowa Med. 75 (1985) 213–217. [20] D. Lambert, Zero-inflated Poisson regression with an application to defects in manufacturing, Technometrics 34 (1992) 1–14. [21] A.H. Lee, K. Wang, K.K.W. Yau, Analysis of zero-inflated Poisson data incorporating extent of exposure, Biometrical J. 43 (8) (2001) 963–975. [22] E. Moy, B.A. Bartman, Race and hospital discharge against medical advice, J. Natl. Med. Assoc. 88 (1996) 658–660. [23] J. Mullahy, Specification and testing of some modified count models, J. Econometrics 33 (1986) 341–365. [24] M. Ridout, C.G.B. Demetrio, J. Hinde, Models for count data with many zeros, in: Invited Paper Presented at the 19th International Biometric Conference, Cape Town, South Africa, 1998, pp. 179–190. [25] K.P. Singh, F. Famoye, Analysis of rates using a generalized Poisson regression model, Biometrical J. 35 (1993) 917–923. [26] J. Wennberg, J. Freeman, Are hospital services rationed in New Haven or over-utilized in Boston, Lancet 1 (1987) 1185–1188. [27] P. Wilson, P. Tedeschyi, Community correlates of hospital use, Health Serv. Res. 19 (1984) 133–146. [28] R.A. Wolfe, G.R. Petroni, C.G. McLaughlin, L.F. McMahon Jr., Empirical evaluation of statistical models for counts or rates, Statist. Med. 10 (1991) 1405–1416. [29] J.T. Wulu, K.P. Singh, F. Famoye, T.N. Thomas, G. McGwin, Regression analysis of count data, J. Indian Soc. Agric. Statist. 55 (2002) 220–231. [30] J.T. Wulu, Generalized Poisson regression models with applications, Ph.D. Dissertation, Department of Biostatistics, University of Alabama at Birmingham, USA, 1999.