Bootstrapping with R to Determine Variances of Mixture Model Estimates in Predicting Confidence Intervals for Population Sizes

Bootstrapping with R to Determine Variances of Mixture Model Estimates in Predicting Confidence Intervals for Population Sizes

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 86 (2016) 216 – 219 2016 International Electrical Engineering Cong...

108KB Sizes 3 Downloads 35 Views

Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 86 (2016) 216 – 219

2016 International Electrical Engineering Congress, iEECON2016, 2-4 March 2016, Chiang Mai, Thailand

Bootstrapping with R to determine variances of mixture model estimates in predicting confidence intervals for population sizes Chareena Ujeha, Pratana Satitvipaweea, Jutatip Sillabutraa, Pichitpong Soontornpipita, Prasong Kitidamrongsuka, Chukiat Viwatwongkasema * a

Department of Biostatistics, Faculty of Public Health, Mahidol University, Bangkok 10400, Thailand

Abstract It is not easy to find the variance estimates of mixture model via the theoretical derivation directly. Instead, bootstrapping denoting a resampling technique from an original sample dataset with replacement allocation is used to calculate variances of mixture estimates of zero-truncated Poisson distributions in a prediction of population size and its confidence interval. The application is the estimation of the number of drug (opium) users in Thailand 2007 under surveillance data of counts of treatment episodes in a case. The results indicated that there were 3,262 observed opium cases who received treatments, the estimate of the unobserved number of opium users without receiving any treatment was 3,931, leading to total population size estimate of 7,193 opium users. The 95% confidence intervals as a by-product of bootstrapping were 6,674-7,712 under bootstrap normal base, 6,782-7,761 under bootstrap 95% percentiles, and 6,626-7,605 under bootstrap t intervals. Bootstrapping algorithm with R program is available here. © 2016 2016 The TheAuthors. Authors.Published Published Elsevier by by Elsevier B.V.B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the Organizing Committee of iEECON2016. Peer-review under responsibility of the Organizing Committee of iEECON2016 Keywords: Bootstrapping with R; Variance Estimation; Mixture Model; Population Size Estimation

1. Introduction Bootstrapping is a resampling technique with replacement allocation from an original sample dataset in order to estimate the precision of sample statistics, to perform significance tests, and to validate statistical models when theoretical parts are complex and/or the sample size is not met enough conditions in practice. Mixture model

* Corresponding author. Tel.: +66-2-354-8530; fax: +66-2-354-8534. E-mail address: [email protected].

1877-0509 © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the Organizing Committee of iEECON2016 doi:10.1016/j.procs.2016.05.096

217

Chareena Ujeh et al. / Procedia Computer Science 86 (2016) 216 – 219

corresponding to mixture distribution is also based on a complex procedure that theoretical inference is difficult to get the result directly, especially in finding variance estimates. Bootstrap method initiated by Efron in 1979 1 is placed under pressure of this situation. 2. Methodology and design The formulas, notations, and data are followed from the work of Viwatwongkasem et al.

2, 3

. Let ni be the

number of drug (opium) users reported with i treatment episodes. The data of opium cases are n1 2200 , n2 703 , n3 197 , n4

76 , n5

50 , n6

33 , n7

3 . In total, the sample size, n

¦

m

n

i 1 m

3262 , for opium use are observed. k

Suppose that a mixture of zero-truncated Poisson densities is given as

f  (i, Q)

¦q

f (i, O j )

j 

where

j 1

f  (i, O j )

Po(i, O j ) / (1  Po(0, O j )) , q1 ,..., qk and O1 ,..., Ok are found from the k -component. The Horvitz-Thompson

estimator for estimating the total population size of opium use and the number of opium users without receiving any treatment (zero episode) can be estimated as § k · qˆ j ¸ n¨ ¦ and nˆ0 ¨ j 1 1  exp(Oˆ j ) ¸ © ¹ ¦im 1 i ni ei j , Oˆ j (1  exp(Oˆ j )) , and ei j ¦im 1 ni ei j Nˆ

where

qˆ j

1 m ¦ ni ei j ni1

Nˆ  n

f (i, O j )q j . ¦ f  (i, O cj )qcj  k jc 1

EM algorithm for mixtures of zero-truncated Poisson distributions Step 0 Choose some initial estimates qˆ (jr ) and Oˆ j( r ) at cycle r Step 1 Compute Nˆ ( r )

§ k · qˆ (jr ) ¸ , nˆ0( r ) n¨ ¦ ¨ j 1 1  exp(Oˆ j( r ) ) ¸ © ¹

Step 2 Compute new estimates as qˆ (jr 1)

(r ) Nˆ ( r )  n , and ei j

1 m ¦ ni ei( rj ) , Oˆj(r 1) ni1

f  (i, Oˆ (j r ) ) qˆ (jr ) . ¦kj c 1 f  (i, Oˆcj ( r ) ) qˆcj ( r )

¦im 1 i ni ei( rj ) ¦im 1 ni ei( rj )

(1  exp(Oˆ (j r ) ))

Step 3 Set r r  1 and go back to Step 1. The step 1 and 2 are repeated until convergence. Bootstrap variance to determine population size 1.

Bootstrap frequencies n1* , n2* ,..., nm* are sampled from a multinomial distribution with size parameter n and with nonparametric probability parameter pi ni / n

2.

For each bootstrap sample {n1* , n2* ,..., nm* } of size n , Nˆ * is constructed from above EM algorithm

3.

If there are R bootstrap samples, then population size estimates Nˆ 1* , Nˆ 2* ,....., Nˆ R* are available, the bootstrap mean and variance are N *

1 R ˆ* ¦ Nb , Vˆ 2 Rb1

intervals. 4.

R 1 ¦ ( Nˆ b*  N * )2 . Also, the standard error Vˆ is used to obtain bootstrap ( R  1) b 1





* * , Nˆ (0.975) Bootstrap 95% confidence intervals based on normal is Nˆ r 1.96Vˆ , Nˆ (0.025) under bootstrap percentile,

and t



Nˆ  t0.025 Vˆ , Nˆ  t0.975 Vˆ

( N  Nˆ ) / Vˆ . *



under bootstrap tD interval where tD is the D

th

ordered value of the statistics

218

Chareena Ujeh et al. / Procedia Computer Science 86 (2016) 216 – 219

3. Results There were 3,262 observed opium cases who received treatments at all episodes. Here the mixture of two components of zero-truncated Poisson densities was the best fitting model with the smallest BIC value. The results produced nˆ0 3,931 for the unobserved number of opium users without any treatment episodes and Nˆ 7,193 for the population size estimate of opium use. With 1,000 bootstrap samples, Table 1 presents the various results. Table 1 Results from 1,000 replicates of bootstrapping Estimators Nˆ

Bootstrap Method 7,193

Bootstrap mean N * Bias ( N *  Nˆ )

7,214

nˆ0

3,931

Bootstrap variance Vˆ 2

20.97 58625.62

Normal theory

6,719-7,668

Percentile

6,782-7,761

Bootstrap-t

6,626-7,605

The following R commands were a part of computation of bootstrap samples to obtain confidence intervals. ### Suppose the outputs from mixture model are in the following column names #### ### colnames(Output) <- c('k', 'lamda', 'q', 'n0', 'N', 'LL', 'BIC') ### ###################################################################### boot.mixture.PoisZero <- function(i, ni, k, Rep, ci) { R <- Rep m <- length (i) n <- sum (ni) Sampl.data <- matrix(integer(m*R), nrow = R, ncol = m) lamda <- matrix (integer(k*R), nrow = R, ncol = k) q <- matrix (integer(k*R), nrow = R, ncol = k) N <- integer(R) n0 <- integer(R) LL <- integer(R) BIC <- integer(R) for (ii in 1:R) { Sampl.data[ii,] <- Sampling.data(i, ni) Result <- mixture.Pois.ZERO(i, c(Sampl.data[ii,]), k) lamda[ii,] <- Result[,2] q[ii,] <- Result[,3] n0[ii] <- Result[1,4] N[ii] <- Result[1,5] LL[ii] <- Result[1,6] BIC[ii] <- Result[1,7] }

Chareena Ujeh et al. / Procedia Computer Science 86 (2016) 216 – 219

############################################################# ##### 95% CI of lamda, n0 & N by Normal dist ###### ############################################################# alpha.L <- (100-ci)*0.01*0.5 alpha.U <- 1-((100-ci)*0.01*0.5) Z

<- qnorm(alpha.U, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)

Output <- mixture.Pois.ZERO(i, ni, k) L.ci.Lamda.boot.norm <- Output [,2] - Z*sqrt(colVars(lamda)) U.ci.Lamda.boot.norm <- Output [,2] + Z*sqrt(colVars(lamda)) L.ci.n0.boot.norm <- Output [1,4] - Z*sqrt(var(n0)) U.ci.n0.boot.norm <- Output [1,4] + Z*sqrt(var(n0)) L.ci.N.boot.norm <- Output [1,5] - Z*sqrt(var(N)) U.ci.N.boot.norm <- Output [1,5] + Z*sqrt(var(N)) ############################################################# ##### 95% CI of lamda, n0 & N by precentile interval ###### ############################################################# percentile.lamda.boot <- matrix (integer(k*2), nrow = 2, ncol = k) for (j in 1:k) { percentile.lamda.boot[,j] <- quantile(lamda[,j], c(alpha.L, alpha.U)) } ci.n0.boot.percentile <- quantile(n0, c(alpha.L, alpha.U)) ci.N.boot.percentile <- quantile(N, c(alpha.L, alpha.U)) }

4. Discussion We can observe that 95% percentile confidence interval provides the narrow band while 95% bootstrap-t confidence interval gives the highest wide range, so 95% bootstrap-normal base seems to provide the moderate rang lying between two interval estimates. Indeed, there are several ways to estimate the population size, instead of mixture models. Some are based on homogeneous assumption, such as maximum likelihood approach; some are heterogeneous, such as Chao’s and Zelterman’s methods 2. Recommendation for further study could focus on other approaches and compare their performances if possible. The other mixtures, such as binomial, geometric, negative binomial, should be considered in future researches. Finally, the R program for bootstrapping data is available. Acknowledgements This study was partially supported for publication by the China Medical Board (CMB), Faculty of Public Health, Mahidol University, Bangkok, Thailand. References 1. Erfron, B,Bootstrap methods: another look at the jackknife. Ann. statist.7: 1-26. 2. Viwatwongkasem, C., Kuhnert, R., Satitvipawee, P. A comparison of population size estimators under the truncated count model with and without allowance for contaminations. Biometrical Journal 2008; 50: 1006-1021. 3. Viwatwongkasem, C., Satitvipawee, P., Jareinpituk, S., Soontornpipit, P. Mixture models for estimating the number of drug users in Thailand 2005-2007, Applied Mathematics 2013; 4: 1242-1250.

219