Bayesian variable selection for Poisson regression with underreported responses

Bayesian variable selection for Poisson regression with underreported responses

Computational Statistics and Data Analysis 54 (2010) 3289–3299 Contents lists available at ScienceDirect Computational Statistics and Data Analysis ...

544KB Sizes 0 Downloads 133 Views

Computational Statistics and Data Analysis 54 (2010) 3289–3299

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

Bayesian variable selection for Poisson regression with underreported responses Stephanie Powers a,∗ , Richard Gerlach b , James Stamey c a

Red Deer College, Red Deer, Alberta, Canada

b

University of Sydney, NSW, Australia

c

Baylor University, Waco, TX, USA

article

info

Article history: Received 24 November 2008 Received in revised form 19 February 2010 Accepted 1 April 2010 Available online 22 April 2010 Keywords: Misclassification Poisson regression MCMC Model uncertainty

abstract Variable selection for Poisson regression when the response variable is potentially underreported is considered. A logistic regression model is used to model the latent underreporting probabilities. An efficient MCMC sampling scheme is designed, incorporating uncertainty about which explanatory variables affect the dependent variable and which affect the underreporting probabilities. Validation data is required in order to identify and estimate all parameters. A simulation study illustrates favorable results both in terms of variable selection and parameter estimation. Finally, the procedure is applied to a real data example concerning deaths from cervical cancer. © 2010 Elsevier B.V. All rights reserved.

1. Introduction The Poisson model is commonly used in epidemiology, economics, medicine and wildlife management. In some applications, not all events are reported by the ‘‘conventional’’ reporting system. For instance, Amoros et al. (2006) provided a thorough study of the impact of underreporting on the estimation of road crash casualty rates in France. Kumara and Chin (2005) also considered the impact of underreporting on estimating rates of traffic accidents. Whittemore and Gong (1991) investigated the effects of underreporting on estimation of death rates due to cervical cancer. In all of these examples it was found that both the Poisson rates and the probabilities of reporting depended on subsets of the independent variables. Therefore, it is of interest to determine exactly which covariates are related to the Poisson rates and which are related to the reporting probabilities. Accounting for underreporting in Poisson applications by combining a small error free sample, referred to as validation data, and a larger fallible sample, referred to as the main study data, has been used in multiple studies. This allows parameter estimates obtained from the combined fallible and infallible samples to be corrected for, essentially robust to, the misclassification in the fallible data. As mentioned above, Whittemore and Gong (1991) corrected for underreporting in estimated cervical cancer rates in European countries. They used validation data for the reporting errors combined with fallible main study data in order to derive maximum likelihood estimators. Sposto et al. (1992) used autopsy data to correct fallible death certificate information in a study of cancer deaths in post-World War II Japan. Sposto et al. (1992) actually allowed for both under and overreporting, but found the overreporting probabilities for all cancers were much smaller than the corresponding underreporting probabilities. Similar to Whittemore and Gong (1991), Sposto et al. (1992) estimated parameters using maximum likelihood. Work has also been done for studies with binomial outcomes. Cheng and



Corresponding author. Tel.: +1 403 342 3413; fax: +1 403 343 4028. E-mail address: [email protected] (S. Powers).

0167-9473/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2010.04.003

3290

S. Powers et al. / Computational Statistics and Data Analysis 54 (2010) 3289–3299

Hsueh (2003) considered a logistic regression; where they estimated parameters by combining error prone data with error free validation data. Prescott and Garthwaite (2005) used logistic regression to model a matched case-control study with differential misclassification, where validation data was used in order to correct for the misclassification. Recently, Nofuentes et al. (2009) considered sample size selection for a logistic model where a gold standard is only used on a subset of patients, similar to the double sampling model we advocate here. Rosychuk and Islam (2009) considered a Bayesian approach to modeling misclassified data. We consider a Bayesian method for modeling underreporting using validation data for the reporting errors combined with a larger fallible data set. Specifically, we propose a log-linear model relating the set of potential covariates to the Poisson rate of interest and a logistic regression model relating the reporting probabilities to the covariates. Thus, if there are p potential covariates, there are 2p 2p = 4p candidate models; e.g. with 3 covariates there are 64 possible models. Gerlach et al. (2002) used an innovative approach to Bayesian model selection for the regular (no misclassification) logistic model, including a random effect term, for data sets with large numbers of covariates. Gerlach and Stamey (2007) applied this method to the logistic model with misclassification in the response. Choi et al. (2009) considered a similar approach to variable selection. We adapt the Gerlach and Stamey (2007) approach to the Poisson regression model allowing for random effects to model potential over-dispersion. We organize the paper as follows. In Section 2 we detail the Poisson model with underreporting and the variable selection procedure. In Section 3 we provide simulations illustrating the effectiveness of the procedure. In Section 4 we apply the method to a real data set. Finally we provide some concluding comments in Section 5. 2. Poisson regression model and variable selection method Suppose we wished to know what factors affect the rate of death due to cancer. However, not all cancer deaths are recorded as such. Using our Bayesian approach, we can determine which factors affect the rate of death due to cancer and also identify which factors affect the probability of a misdiagnosis. Similar to Whittemore and Gong (1991), our Bayesian approach has a fallible sample of counts: y1 , y2 , . . . , yr = y, where r is the number of distinct covariate patterns, and yi ∼ Poisson(µi ).

(2.1)

In our illustration, yi represents the number of reported cancer deaths in a fallible sample. The quantity µi = Ni λi πi represents the rate of observed occurrences for group i in the fallible sample. Ni is the known offset: the amount of study time contributed by the subjects in group i. In epidemiological applications, this is often personyears. The quantity λi represents the rate of the underlying true occurrence count (such as the actual death rate for a specific type of cancer), which includes the observed and unobserved occurrences. The quantity πi is the probability that an occurrence of the event of interest is correctly reported. We assume both λi and πi depend on a set of p potential covariates, X . For the Poisson rates, λi , we model this dependence via the log link function with random error, log(λi ) = z1i = Xi β1 + e1i ,

i = 1, . . . , r ,

(2.2)

where independently, e1i ∼ N (0, σ ). For the reporting probabilities, πi , we use the logistic link with error, 2 1

log(πi /(1 − πi )) = z2i = Xi β2 + e2i ,

i = 1, . . . , r ,

(2.3)

where independently, e2i ∼ N (0, σ22 ). The likelihood function, on observing the sample y, is: L(β1 , β2 , σ12 , σ22 | y ) ∝

r Y

(λi πi )yi e−Ni λi πi ,

(2.4)

i =1

where λi and πi are functions of β1 , β2 , σ12 , and σ22 as in (2.2) and (2.3). The random effects terms in (2.2) and (2.3) allow for over-dispersion, which is not accounted for in the model of Whittemore and Gong (1991). In order to estimate β1 , β2 , σ12 , and σ22 , we require either an informative prior distribution under a Bayesian approach, or a set of validation data, or both. In this study, we assume that validation data is available. Validation data may only be available regarding the reporting probabilities or it may be a true ‘‘double sample’’ where information on both the reporting probabilities and the Poisson rates are gained. For instance, in a case study on cervical cancer death rates (Whittemore and Gong, 1991), doctors from various countries were asked to complete a death certificate given a specific patient’s case history. Unbeknownst to the doctors, the patient had died of an ulcerated tumor of the cervix. The resulting data yields information on how likely doctors from countries such as England or France are to report a true cervical cancer death, but provides no information on the overall death rate. This will be the most common situation and we refer to this scenario as V1. In some situations, analysts do have access to a ‘‘double sample.’’ To illustrate, if during a portion of the study time all deaths were thoroughly autopsied and the diagnosed causes were compared to the causes on the death certificates, then this data would provide information on both the reporting probability and the overall rate of occurrence. We will call this scenario V2. In both scenario V1 and V2 we have some form of validation data. Before we model the contributions of the validation data to the likelihood, let us first define a couple of terms used in the remainder of this paper. We refer to the perfectly sampled data as validation data, while the data where only a fallible classifier is used is referred to as the main study. Since

S. Powers et al. / Computational Statistics and Data Analysis 54 (2010) 3289–3299

3291

the length of study time is different for the validation data and the main study, we refer to Ni as the number of personyears for potentially underreported observations in the ith covariate pattern. Whereas, ni denotes the additional number of person-years for which records are kept using both the cheaper fallible method and a more expensive infallible approach. Generally, ni is substantially less than Ni . In the case of V2, we can obtain the total number of actual occurrences (denoted ti ) in covariate pattern i for the data in which we have an infallible test, where ti is a function of the underlying true occurrence rate λi from (2.2): ti ∼ Poisson(ni λi ).

(2.5)

In the case of V1, where information is only available on the reporting probabilities, then ti would be fixed in advance and thus not random. In both V1 and V2, the total number of occurrences recorded by the fallible search method in the validation data, denoted ui , is a function of the reporting probability πi from (2.3) and the number of actual occurrences in the validation sample from (2.5): ui | ti ∼ binomial(ti , πi ).

(2.6)

In scenario V1, information from the validation data would contribute to the likelihood through (2.6). In scenario V2, both (2.5) and (2.6) contribute to the likelihood. The authors are unaware of any existing literature that applies variable selection methods to situations such as V1 or V2. Following the work of Gerlach and Stamey (2007), which used variable selection on binomial logistic models with misclassification, we will use data imputation methods and a Markov chain Monte Carlo (MCMC) sampling scheme to perform inference and variable selection. Data imputation allows the missing unreported occurrences in the main study data to be inferred, allowing a complete data likelihood to be employed in the MCMC algorithm. Let ti0 be the number of actual occurrences in the main study data that are underreported in the ith covariate pattern, then ti0 ∼ shifted Poisson(λi (1 − πi ), yi ),

ti0 = yi , yi + 1, yi + 2, . . . ,

(2.7)

where λi is the true occurrence rate from (2.2) and πi is the reporting probability from (2.3). We impute ti0 as part of the MCMC sampling scheme using (2.7) and the method described in Appendix A. Given the observed counts from the validation data (ti from (2.5) and ui from (2.6)), the observed counts from the main study (yi from (2.1)), and the imputed counts (ti0 from (2.7)), the complete data likelihood function is L(β1 , β2 , σ21 , σ22 | d ) ∝

r Y

t 0 +ti −(N +n )λ i i i

λi i

e

y +u i

πi i

(1 − πi )ti +ti −(yi +ui ) , 0

(2.8)

i=1

where d = (y , t , u) denotes the observed data for both the main study and validation samples. Again, the latent variables πi and λi , which are functions of β1 , β2 , σ12 , and σ22 , represent the underreporting probabilities and underlying true occurrence rates from (2.2) and (2.3). We wish to estimate z1i = log(λi ) and z2i = log(πi /(1 − πi )) and perform variable selection on the covariates, β1 and β2 . Using methods similar to Gerlach et al. (2002), we can simultaneously estimate the Poisson regression parameters, β1 and σ12 along with the logistic regression parameters, β2 and σ22 as part of an MCMC sampling scheme. George and McCulloch (1993) and Smith and Kohn (1996) pioneered the approach of introducing an indicator variable for each covariate and sampling these indicators in an MCMC algorithm for variable selection in continuous observation regression settings. We follow their idea and introduce indicator variable vectors, J1 and J2 , containing 0’s and 1’s, to indicate which of the covariates are currently included in the Poisson (J1 ) and logistic (J2 ) models for each of the unknown parameters λ and π . Smith and Kohn (1996) illustrated the improvement in efficiency and MCMC convergence, as well as accuracy of estimation, when sampling these indicator vectors without conditioning upon their corresponding regression coefficients, or the variance of the regression errors. Gerlach et al. (2002) illustrated that incorporating random effect terms, as in (2.2) and (2.3), allows a similar result to hold for discrete regression models. As such, we employ the following conditional posterior distributions for the indicator functions: p( J1j | d , X , z1 , J1,6=j ) ∝ p(d | z1 )p(z1 | X , J1 )p( J1j | J1,6=j ),

for j = 1, . . . , p,

(2.9)

p( J2j | d , X , z2 , J2,6=j ) ∝ p(d | z2 )p(z2 | X , J2 )p(J2j | J2,6=j ),

for j = 1, . . . , p,

(2.10)

and

where the notation Jk,6=i indicates the vector Jk not including the ith term Jk,i . The distributions in (2.9) and (2.10) are not conditional upon the regression parameters βk or σk2 , where k = 1, 2, because the first likelihood term p(d | zk ) can be analytically integrated out. This is achieved in a straightforward manner. Details are given in Appendix A. For model selection purposes, usually the modal vectors J1 and J2 from the MCMC sample are chosen as the final model, indicating which combination of covariates for z1 and z2 are overall the most likely. The posteriors above require three prior distribution settings:

3292

S. Powers et al. / Computational Statistics and Data Analysis 54 (2010) 3289–3299

(i) The probability of covariate inclusion in each of the two models is Pr(Jk,j = 1 | Jk,6=j ). We set Pr( Jk,j = 1 | Jk,6=j ) = 0.5 to reflect prior ignorance about each variable in each model. The Pr( Jk,j = 1 | Jk,6=j ) = 0.5 is independent of the other indicators Jk,j , therefore Pr( Jk,j = 1 | Jk,6=j ) = 0.5 = Pr( Jk,j = 1) and the prior on the inclusion probabilities is p( Jk ) =

p Y

Pr( Jk,j = 1) = 0.5p = 2−p ,

for k = 1, 2.

j =1

This prior can easily be changed to reflect expert opinion about certain variables being included, or not included, in the model. However, the variable selection results will be sensitive to this choice, as illustrated later in our simulation study. As such, we recommend using 0.5, or close to this value, when no variable is favored a priori. (ii) For the prior on the regression parameters, we chose

 βk | σk2 ∼ N 0, c(k) σk2 (Xk0 Xk )−1 , where Xk is the matrix containing the columns of X corresponding to the elements of Jk that equal 1. We set c(k) to be sufficiently large (e.g. 100) to make the prior for βk reasonably diffuse; see Smith and Kohn (1996) for a discussion. (iii) For the prior on the random effect variance, we used a flat prior on σk as suggested by Gelman (2006). However, the results obtained in Sections 3.1 and 3.2 are not sensitive to either the use of this prior or the more standard Jeffreys prior: p(σk2 ) ∝ 12 . σk

The posteriors for the latent processes z1 and z2 are: p(zk | d , X , Jk , βk , σk2 ) ∝ p(d | zk )p(zk | X ( Jk ), βk , σk2 ),

for k = 1, 2.

Details for simulating from these distributions are given in Appendix A. In Appendix B we describe an alternative sampling scheme similar to the one proposed by Shephard and Pitt (1997) for latent vectors in state space time series models. Initial values are randomly chosen for the unknown model parameters and latent variables: β, σ 2 , J , and z, from their prior distributions. Iterates are successively generated in turn from each of the posterior distributions detailed in the Appendices. Typically, we run the sampling scheme for 30,000 iterations as a burn-in period and then collect samples for the next 30,000 iterations, all of which are employed for posterior estimation and model selection purposes. 3. Simulation study We considered several sets of parameters in a simulation study. The Matlab code used to perform the variable selection and generate the data sets is available from the authors upon request. We ran the simulations using both MCMC sampling schemes described in the Appendices and found they performed quite similarly in terms of model selection and parameter estimation. For the parameter configurations chosen, we found the procedure in the Appendix A performed slightly better in terms of coverage of 95% intervals than the procedure in Appendix B. Therefore, in the simulation summaries all results use the procedure in Appendix A. In all simulations, except where noted, the priors for including in the model were 0.5. 3.1. Simulation 1 For the first parameter configuration, we considered a situation with four potential independent variables, thus we have 24 24 = 256 total possible models. We set β1 = (0.75, 0.5, −2.0, 0, 0) for the Poisson rates and β2 = (2.2, −1.9, 0, 0, 0) for the logistic regression. Both random effects variances were 0.1. Thus, for the Poisson rates we expect to have two significant variables and one significant variable for the underreporting probabilities. For this first parameter configuration we generated 300 data sets. In the first simulation we assume V2 validation data is available. Specifically, for each covariate pattern we generate a fallible Poisson count with rate µi = Ni λi πi where the λi and πi are computed using the coefficients just described and Ni = N = 100 for all groups. For the validation data, we generate ti ∼ Poisson(ni λi ) for each covariate pattern where ni = n = 25. Next, conditional on the true Poisson count we generate ui | ti ∼ binomial(ti , πi ). Using this generated data we ran the variable selection scheme described in Section 2 and recorded the variables included, parameter estimates, the fitted Poisson rates, and the reporting probabilities. Across the covariate patterns, the expected number of observed counts in the main study data ranged from approximately 20 to 210. For the validation data set, these counts ranged from around 6 to 70. Our simulation has rather small counts compared to some large-scale epidemiological studies, like for instance the Whittemore and Gong (1991) data set. On a 2.16 GHz Intel Core 2 Duo computer, 60,000 iterations and a 30,000 iteration burn-in took 857 s. Diagnostic plots seen in Fig. 1 reveal that the Markov chain converges nicely to the parameter values. Table 1 provides the average posterior probability of each variable being included in the model and the proportion of data sets that included each variable. The procedure works quite well as variables with non-zero coefficients have high probabilities of being included while those with 0 coefficients have low probabilities of being included. Overall, in the 300 simulated data sets, the correct model was the most probable 58% of the time. The second most probable model, the model that failed to include X1 in the Poisson model, was found to be the most probable in 19% of the simulations. X1 had the

3293

–1 –3

–2

Coefficient for X2

1.0 0.5 0.0

–4

–1.0

–0.5

Coefficient for X1

1.5

0

2.0

S. Powers et al. / Computational Statistics and Data Analysis 54 (2010) 3289–3299

Fig. 1. History plots for the coefficients of X1 and X2 in the Poisson regression indicate the Markov chain converges to the posterior distribution. Table 1 Results from simulation 1. Model

Variable

True value

Posterior probability

Inclusion in model

Poisson

X1 X2 X3 X4

0.5 −2 0 0

0.703 0.999 0.190 0.179

0.743 1.000 0.060 0.040

Logistic

X1 X2 X3 X4

−1.9

0.997 0.242 0.216 0.214

1.000 0.043 0.037 0.040

0 0 0

Fig. 2. True values (diamonds), average posterior estimates (squares), and 95% intervals for posterior mean estimates for λ from 300 simulated data sets across the 16 covariate patterns.

smallest magnitude effect (β1 = 0.5) of the variables in the actual model, causing it to be insignificant in some of the data sets. No other model was chosen best more than 5% of the time. Fig. 2 shows the fitted rates across all covariate patterns along with the true rates and the average 95% intervals. As seen in Fig. 2, the model does a good job of estimating these values. Similar plots for the reporting probabilities verify that these quantities are also well estimated. For the same parameters, we reran the simulation, but this time assuming V1 validation data. For the simulation, the Poisson counts generated in the validation data were treated as fixed, thus the training sample was only informative for the reporting probabilities. The results of these simulations are summarized in Table 2. Again, the inclusion probabilities of the variables with non-zero coefficients are high and these probabilities are low for those with 0 coefficients. The correct model was again selected the most, but not quite as often with the validation data only on the reporting probabilities. The correct model had the highest probability 44% of the time and the model that excludes X1 in the Poisson model had the highest probability 30% of the time. The model that excludes X1 in the Poisson model and excludes X1 in the logistic model had the highest probability 8% of the time. The fitted rates estimate the true rates very well; similar to what we saw when we had validation data for both reporting probabilities and rates. Overall, having only validation data for the reporting probabilities

3294

S. Powers et al. / Computational Statistics and Data Analysis 54 (2010) 3289–3299

Table 2 Results from simulation 1 with validation data only on the reporting probability. Model

Variable

True value

Posterior probability

Inclusion in model

Poisson

X1 X2 X3 X4

0.5 −2 0 0

0.626 0.997 0.173 0.171

0.587 0.997 0.060 0.040

Logistic

X1 X2 X3 X4

−1.9

0.870 0.207 0.170 0.173

0.907 0.043 0.023 0.033

0 0 0

Table 3 Results from simulation 1 with underreporting ignored. Model

Variable

True value

Posterior probability

Inclusion in model

Poisson

X1 X2 X3 X4

0.5 −2 0 0

0.213 0.999 0.185 0.195

0.090 1.000 0.055 0.065

Fig. 3. True values (diamonds), average posterior estimates (squares), and 95% intervals for posterior mean estimates for λ from 300 simulated data sets across the 16 covariate patterns when underreporting is ignored.

had a minor impact on the performance on the model selection procedure. Finally, we note that for V2 data 18.3% of the simulations included one or more spurious covariates while with V1 data 19.7% did. For comparison purposes we considered the Poisson model when the underreporting was ignored. Thus, only the fallible main study data was used. According to the results provided in Table 3, ignoring the underreporting greatly reduces the probability of X1 being included. When underreporting was ignored, X1 was only included in 9% of the simulated data sets. The most probable model was the one that correctly includes X2 but incorrectly leaves out X1 . It was selected 82% of the time. The second and third most probable models were both selected about 6% of the time. The second most probable model was the correct model, while the third most probable removes X1 and adds X4 . The true and fitted rates along with 95% intervals are provided in Fig. 3. Fig. 3 further illustrates the impact of ignoring the underreporting, since the true values lie completely outside of the intervals for several of the covariate patterns and by quite a large amount in several cases. 3.2. Simulation 2 We next consider a parameter configuration with a larger set of potential variables. For this simulation we have

β1 = (2.5, 0.7, −0.5, 0.3, 0, 0, 0) and for the logistic regression we set β2 = (1.5, 0, 0, −0.7, 0, 1.2, 0). Again both random effect variances are 0.1. Given the number of variables, there are 4096 possible models. We again set Ni = N = 100 and ni = n = 25. Similar to the first simulation, the Markov chains converge nicely. The larger number of possible models causes the CPU time for each data set to be approximately 3170 s. With V2 validation data, the model selection performance is again quite good. The correct model was the most probable model 77% of the time. The model chosen the second most often was selected in 17% of the data sets. This model omitted X3 from the Poisson regression portion of the model, included all other variables with non-zero coefficients, and did not include any variables with 0 coefficients. No other model was selected in

S. Powers et al. / Computational Statistics and Data Analysis 54 (2010) 3289–3299

3295

Table 4 Results from simulation 2. Model

Variable

True value

Posterior probability

Inclusion in model

Poisson

X1 X2 X3 X4 X5 X6

0.7 −0.5 0.3 0 0 0

0.999 0.994 0.760 0.122 0.117 0.123

1.000 1.000 0.817 0.000 0.000 0.000

Logistic

X1 X2 X3 X4 X5 X6

0 0 −0.7 0 1.2 0

0.146 0.150 0.999 0.145 1.000 0.142

0.033 0.007 1.000 0.020 1.000 0.007

Table 5 Results from simulation 2 with validation data only on the reporting probability. Model

Variable

True value

Posterior probability

Inclusion in model

Poisson

X1 X2 X3 X4 X5 X6

0.7 −0.5 0.3 0 0 0

0.999 0.992 0.740 0.122 0.121 0.123

1.000 1.000 0.813 0.000 0.007 0.000

Logistic

X1 X2 X3 X4 X5 X6

0 0 −0.7 0 1.2 0

0.149 0.150 0.999 0.153 1.000 0.144

0.023 0.007 1.000 0.017 1.000 0.013

more than 2% of the data sets. Table 4 provides the average posterior probability of each variable being included in the model and the proportion of data sets that included each variable. Variables that should be included all have high probabilities and variables that should not be included have low probabilities, indicating good performance. If information is only available on the reporting probabilities, scenario V1, the performance is still very good. Specifically, the correct model was selected 74% of the time and the model omitting X3 from the Poisson regression was selected 17% of the time. Table 5 provides the average posterior probability of each variable being included in the model and the proportion of data sets that included each variable. It is again notable that the method performs well when validation data is only available for the reporting probabilities. As noted earlier, scenarios where validity data is only available for reporting probabilities appears to be the more common case, thus it is encouraging that the probabilities in Table 5 are almost as high as those in Table 4 for variables that should be in the model and are comparably low for variables that should not be included. The inclusion of spurious covariates is very similar for the two validation data scenarios. For V1 data, approximately 5% of the simulated data sets resulted in most probable models that included at least one spurious covariate while just under 3% of the models for V2 data had at least one spurious covariate. For this parameter configuration, if underreporting is ignored very poor inferences result. The most selected model includes the variables with non-zero coefficients and adds two spurious variables. This model was selected 45% of the time. The correct model was selected best in 18% of the simulated data sets, which was the third most frequent. Fig. 4 illustrates the very poor performance of the model in terms of estimating the Poisson rates when underreporting is ignored. We see that the true rates in general fall considerably outside the average of the 2.5th and 97.5th percentiles across all covariate patterns. In fact the coverage is less than 10% for over half of the covariate patterns, with average coverage of 9.6% across all covariate patterns. Finally, we investigate the impact our prior for the inclusion probabilities had on the variable selection process. For these same parameter values, we ran a simulation where the priors for all variables were 0.25 rather than 0.5 and another simulation where the priors were 0.75. In the case where the priors were 0.25, the model selected the most was still the correct model. However, it was only selected 60% of the time compared to 77% when a prior of 0.5 was used. When the priors were set to 0.75, the correct model was selected 83% of the time. Therefore, there does appear to be some sensitivity to the selection of priors for inclusion probabilities. 4. Example As an example we consider data found in Whittemore and Gong (1991). In the original study the data were stratified into age and country categories with number of cervical cancer deaths as the response. For illustrative purposes we consider

3296

S. Powers et al. / Computational Statistics and Data Analysis 54 (2010) 3289–3299

Fig. 4. True values (diamonds), average posterior estimates (squares), and 95% intervals for posterior mean estimates for λ from 300 simulated data sets across the 64 covariate patterns when underreporting is ignored.

Table 6 MCMC output for cervical cancer data. Covariates included in final model are in bold. Variable

Post. mean

2.5th

97.5th

Post. prob.

Poisson rates

Constant Belgium France Italy 35–44 45–54 55–64

2.39 NA NA −1.24 1.69 2.73 2.93

1.56 NA NA −2.11 0.76 1.83 2.04

3.21 NA NA −0.32 2.60 3.63 3.82

– 0.206 0.165 0.928 0.994 0.999 0.999

Reporting probabilities

Constant Belgium France Italy

5.07

3.69

6.66

−3.17 −4.07 −3.47

−4.88 −5.78 −5.18

−1.60 −2.56 −1.91

– 0.470 0.538 0.481

σ12

0.23

0.10

0.50



σ

1.55

0.09

8.04



Random effects

2 2

each country and age category an independent variable. For the four countries, we let X1 = 1 for Belgium, X2 = 1 for France, and X3 = 1 for Italy, thus England is the country when all three are 0. For the ages, X4 = 1 for 35–44, X5 = 1 for 45–54, and X6 = 1 for 55–64, thus ages 25–34 is represented when X4 = X5 = X6 = 0. Validation data is only available on the reporting probabilities and only for the country variables. Therefore, all six variables are available for the Poisson rates, but only X1 , X2 , and X3 are available for the logistic regression on the reporting probabilities. There are 26 23 = 512 possible models. We applied the MCMC method described in Section 2 with 30,000 sampling iterations after a burn-in sample of 30,000 iterations. The CPU run time was 579 s. MCMC trace plots suggest convergence to the posterior distribution. The model selected the most number of times by the MCMC variable selection method contained the variables Italy and all three age categories for the Poisson rates. The logistic regression for the reporting probabilities contained all three country variables in the model. The most likely model had a posterior probability of 0.181. In Table 6 we provide the means and 95% intervals for the parameters that were in the final model along with the probability of inclusion for all model parameters. Interestingly, the second most likely model had the same variables for the Poisson rates but for the reporting probabilities, none of the three country variables were included. This model had a posterior probability of 0.136. This appears to be due to the relatively small amount of validation data that is available for the reporting probabilities. For instance, if the validation data for the four countries increased from 50 to 500 per country, the top three models all have all three countries as significant for the reporting probability model. If underreporting is ignored, the same variables are included for the Poisson rates, however, the second most probable model includes all countries and age groups. When accounting for the underreporting, Belgium and France are very unlikely to be included in the model. This can be seen in Fig. 5, which plots the fitted rates accounting for underreporting and ignoring it. In all cases except for England, ignoring underreporting leads to significantly lower rates. More interestingly, when accounting for underreporting, Belgium and France are much closer to each other and closer to England than when ignoring the underreporting. Thus we can see how ignoring underreporting could cause spurious variables to be included in the model.

S. Powers et al. / Computational Statistics and Data Analysis 54 (2010) 3289–3299

3297

Fig. 5. Rates of cervical cancer when accounting for and ignoring underreporting.

Unfortunately, a direct comparison between our results and Whittemore and Gong (1991) is not possible. However, Whittemore and Gong (1991) did compare four different models: one of which was that death rates depended on age and country, another that death rates depended only on age, and the last two assumed that reporting probabilities were the same across countries. Whittemore and Gong (1991) found that the model where death rates depended on age and country and the reporting probabilities were allowed to vary outperformed the other models in a likelihood ratio test. Using our Bayesian variable selection method, we found that some of the countries and ages had a significant impact on death rates, while differences in countries explained some of the variability in underreporting. 5. Comments We have presented a Bayesian approach to variable selection in a Poisson regression model where the response is subject to underreporting. The simulations considered demonstrate good performance of the methods, both in terms of determining the variables that impact the Poisson rates and the reporting probabilities. In addition to determining which variables are significant, we can also use the posterior probabilities as weights in Bayesian model averaging to forecast Poisson rates and reporting probabilities. In this paper, we considered examples of underreporting where underreporting was dependent on a subset of covariates. McInturff et al. (2004), Liu et al. (2004) and others considered logistic regression with response misclassification where the misclassification parameters did not depend on covariates. Similarly to these papers, there could be cases where only a single reporting probability parameter is required. The same method used in this paper has been applied with similar success to cases where there is a single reporting probability parameter (Powers, 2009). Further complications to the model, such as using variable selection when one or more of the covariates are potentially misclassified or the case where both the response and a covariate are subject to measurement error could provide insightful further research. Appendix A We now provide some details regarding the conditional distributions in the MCMC sampling scheme. For the indicator variables in the model selection and under the priors provided in Section 2, we have: P ( Jk,j=1 | d , zk , Jk,6=j ) =

1 1 + hkj

,

where

 hkj =

S ( Jk,j = 1) S ( Jk,j = 0)

r /2

p

1 + c(k)

p( Jk,j = 0 | Jk,6=j ) , p( Jk,j = 1 | Jk,6=j )

(c +1) 0 and S ( Jk,j ) = zk0 zk − B0 A−1 B, B = Xk0 zk , A = (ck) Xk Xk . (k)

3298

S. Powers et al. / Computational Statistics and Data Analysis 54 (2010) 3289–3299

Gerlach and Eyre (2004) illustrated the use of the Metropolis–Hastings algorithm to estimate a vector of log-rates as part of the MCMC sampling scheme in a Poisson regression model (without underreporting) as follows. Let



p(z1j | dj , z16=j , β1 , σ12 ) ∝ p(dj | z1j ) exp −

1 2σ12

z1j − X1j β1

2



= l(z1j )g (z1j ),

be the full conditional distribution for the log-rates for covariate pattern j, where g (z1j ) is the normal part of the conditional distribution and l(z1j ) ∝ exp(z1j (tj0 + tj ) − exp(z1j )(Nj + nj )). Here we simulate a proposal value from the model prior l(z ∗ )

z ∗ | d , z16=j , β1 , σ12 ∼ N (X1j β1 , σ12 ) and accept this value with probability l(z c ) , where z c represents the current value of z1j . Otherwise, keep the current value, i.e. z1j = z c . This method is equivalent to that, for dynamic models, in Knorr-Held (1999). The idea is similar for series z2 but we model the log-odds instead of the log-rates. Specifically, the conditional distribution is: exp(z2j (yj + uj ))

p(z2j | dj , z26=j , β2 , σ22 ) ∝

0

(1 + exp(z2j ))tj +tj

 exp −

1 2σ22

z2j − X2j β2

2



.

Let l(z2j ) =

exp(z2j (yj + uj )) 0

(1 + exp(z2j ))tj +tj

. l(z ∗ )

Then, simulate a proposal value z ∗ | d , z26=j , β2 , σ22 ∼ N (X2j β2 , σ22 ) and accept this value with probability l(z c ) , where z c represents the current value of z2j . Otherwise keep the current value, i.e. z2j = z c . Gelman et al. (1996), recommend acceptance rates between 14% and 40% to obtain at least 80% of the maximum efficiency. The acceptance rates achieved for this paper were in-line with these recommendations. For example, in simulation 1 the acceptance rates ranged from 13% to 60%, with an average acceptance rate of 22% for both series z1 and z2 . Using the Whittemore and Gong (1991) data, the acceptance rates averaged 14% for series z1 and 16% for z2 . It is of note that Gelman et al. (1996) recommendations are based on simulating a multivariate normal distribution using a random walk Metropolis algorithm, where ours is an independent kernel Metropolis–Hastings with logistic and Poisson. We also want to estimate βk and σk2 for k = 1, 2. The posterior distributions for βk | d , zk , Jk and σk2 | d , zk , Jk , βk are straightforward to compute. From the prior distributions in Section 2 and under a general Inverse Gamma (ak , bk ) prior for σk2 :

βk | d , zk , Jk ∼ tr Ak Bk , −1

1 A− k Dk

! ,

r

and

σk2 | y , zk , Jk , βk ∼ Inverse Gamma where Ak =

c(k) +1 c(k)



r + qk 2

+ ak ,

Ck 2

0

 + bk ,

Xk Xk , Bk = Xk zk , Ck = zk − Xk βk (zk − Xk βk ) + 0

0

1 c(k)

1 β0k Xk0 Xk βk , and Dk = zk0 zk − B0k A− k Bk . Here qk is the

number of columns of Xk . The standard Jeffreys prior is σk2 ∼ Inverse Gamma(ak = 0, bk = 0), while a flat prior on σk is equivalent to σk2 ∼ Inverse Gamma(ak = −0.5, bk = 0). Appendix B As an alternative to the sampling procedure described in Appendix A, which uses the prior distribution as the proposal function, we consider an approximation to the posterior p(zkj | dj , zk6=j , βk , σk2 ) as a proposal. Specifically, we approximate p(zkj | dj , zk6=j , βk , σk2 ) with a Gaussian distribution and use that approximation as a Metropolis–Hastings proposal distribution, similar to the work of Shephard and Pitt (1997) for latent vectors in state space models. For z1 we have,



f (z1j ) = p(z1j | dj , z16=j , β1 , σ12 ) ∝ exp z1j (tj0 + tj ) − exp(z1j )(Nj + nj ) −

1 2σ12

z1j − X1j β1

2



,

where we denote the posterior of z1 as f (z1i ). The log of this posterior is given by log(f (z1j )) = z1j (tj0 + tj ) − exp(z1j )(Nj + nj ) −

1 2σ12

z1j − X1j β1

2

+ c.

The constant c is independent of z1 and accounts for the fact that we know the formula is proportional to the posterior, not equal. We employ a second order Taylor series expansion to approximate this function via: log(f (z1j )) ≈ log(f (ˆz )) + z1j − zˆ

2 ∂ 2 f (z1j ) ; ∂ z1j2 z1j =ˆz

(B.1)

S. Powers et al. / Computational Statistics and Data Analysis 54 (2010) 3289–3299

∂ 2 f (z1j ) where zˆ is the mode of log(f (z1j )), 2 ∂ z1j

3299

is the analytic 2nd derivative of log(f (z1j )) evaluated at the mode zˆ . From z1j =ˆz

(B.1) it follows that: f (z1j ) ∝ exp

! 2 ∂ 2 f (z1j ) z1j − zˆ , ∂ z1j2 z1j =ˆz

which is equivalent to



! −1  2 ∂ f (z1j ) z1j ∼ N µ = zˆ , σ = − h , ∂ z1j2 z1j =ˆz 2

where h = 1.

This becomes the proposal distribution in a Metropolis–Hastings scheme. To avoid getting stuck in local modes and to enhance mixing properties during the MCMC iterations, we set h = 1.7 in this proposal, a choice we found led to optimal Metropolis–Hastings acceptance rates in 20%–60%. We approximate the mode using the fminsearch optimization routine in Matlab, with −1 ∗ log(f (z1j )) as the function to be minimized. Alternatively, the Newton–Raphson algorithm could be used. For computational efficiency, the mode search can be performed every x iterations, where x > 1. We found x between 5 and 10 led to suitable decreases in speed while maintaining fast convergence and good mixing properties for the MCMC chains. The procedure is similar for z2 . References Amoros, E., Martin, J.L., Laumon, B., 2006. Under-reporting of road crash casualties in France. Accident Analysis and Prevention 38, 627–635. Cheng, K.F., Hsueh, H.M., 2003. Estimation of a logistic regression model with mismeasured observations. Statistica Sinica 13, 111–127. Choi, J., Fuentes, M., Reich, B.J., 2009. Spatial–temporal association between fine particulate matter and daily mortality. Computational Statistics and Data Analysis 53, 2889–3000. Gelman, A., 2006. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis 1, 515–533. Gelman, A., Roberts, G., Gilks, W., 1996. Efficient Metropolis jumping rules. Bayesian Statistics 5, 599–607. George, E.I., McCulloch, R.E., 1993. Variable selection via Gibbs sampling. Journal of the American Statistical Association 88, 881–893. Gerlach, R., Bird, R., Hall, A., 2002. Bayesian variable selection in logistic regression: predicting company earnings direction. Australian and New Zealand Journal of Statistics 44 (2), 155–168. Gerlach, R., Eyre, T., 2004. Bayesian variable selection in discrete regression models. University of Newcastle. Working Paper. Gerlach, R., Stamey, J.D., 2007. Bayesian model selection for logistic regression with misclassified outcomes. Statistical Modelling 7, 255–273. Knorr-Held, L., 1999. Conditional prior proposals in dynamic models. Scandinavian Journal of Statistics 26, 129–144. Kumara, S.S.P., Chin, H.C., 2005. Application of Poisson underreporting model to examine crash frequencies at signalized three-legged intersections. Transportation Research Record 1908, 46–50. Liu, Y., Johnson, W.O., Gold, E.B., 2004. Bayesian analysis of risk factors for anovulation. Statistics in Medicine 23, 1901–1919. McInturff, P., Johnson, W.O., Cowling, D., Gardner, I.A., 2004. Modeling risk when outcomes are subject to error. Statistics in Medicine 23, 1095–1109. Nofuentes, J.A.R., del Castillo, J.D.L., Alonso, M.A.M., 2009. Determining sample size to evaluate and compare the accuracy of binary diagnostic tests in the presence of partial disease verification. Computational Statistics and Data Analysis 53, 742–755. Powers, S., 2009. Bayesian Approaches to Inference and Variable Selection for Misclassified and Under-Reported Response Models. Baylor University Press, Waco, Texas. Prescott, G., Garthwaite, P., 2005. Bayesian analysis of misclassified binary data from a matched case-control study with a validation sub-study. Statistics in Medicine 24 (3), 379–401. Rosychuk, R.J., Islam, S., 2009. Parametric estimation in a model for misclassified Markov data—a Bayesian approach. Computational Statistics and Data Analysis 53, 3805–3816. Shephard, N., Pitt, M.K., 1997. Likelihood analysis of non-Gaussian measurement time series. Biometrika 84, 653–667. Sposto, R., Preston, D.L., Shimizu, Y., Mabuchi, K., 1992. The effect of diagnostic misclassification on non-cancer and cancer mortality dose response in A-bomb survivors. Biometrics 48, 605–617. Smith, M., Kohn, R., 1996. Nonparametric regression using Bayesian variable selection. Journal of Econometrics 75, 317–344. Whittemore, A.S., Gong, G., 1991. Poisson regression with misclassified counts: application to cervical cancer mortality rates. Applied Statistics 40, 81–93.