COMPUTATIONAL STATISTICS & DATAANALYSIS ELSEVIER
Computational Statistics & Data Analysis 27 (1998) 291-309
Estimating population proportions from imputed data Svein Nordbotten Department of Information Science, University of Bergen, N-5020 Bergen, Norway Received 1 April 1997; accepted 1 August 1997
Abstract
Imputation estimates based on imputed values obtained from neural network models used in an "impute first-aggregate next" approach, have been computed from Norwegian population census and administrative register data. The imputation estimates were compared with simple unbiased estimates obtained by the traditional "aggregate first-estimate next" approach and found to be superior for estimating proportions in small subgroups. Predictors for predicting the accuracy of such imputation estimates were proposed. Results are promising for estimating small subgroup or area proportions. (~) 1998 Elsevier Science B.V. All rights reserved.
Keywords: Imputation estimates; Non-linear imputation models; Neural networks; Estimate accuracy prediction
1. Introduction In Norway, an extensive amount of administrative register data for each inhabitant is available to the Central Bureau of Statistics (CBS). These were used to prepare statistics for the population Census in 1990. In addition to the register-based statistics, estimates of totals, averages, and proportions for other non-register attributes were computed from a population survey sample. For smaller subgroups and areas, the sample-based estimates frequently failed to satisfy established accuracy criteria for publication. More sophisticated estimators could be used, for example, ratio and regression estimators, taking advantage of possible functional relationships between survey variables and register variables. The traditional estimation approach would be to aggregate first-estimate next, i.e. aggregate values from a sample would be used for estimating the survey attribute statistics. In the case of more advanced estimators, aggregates for the register 0167-9473/98/$19.00 (~) 1998 Els~,vicr Science B.V. All rights reserved PH S 0 1 6 7 - 9 4 7 3 ( 9 8 ) 0 0 0 1 1 - 5
292
S. Nordbotten/ Computational Statistics & Data Analysis 27 (1998) 291-309
variables from both the sample and the remaining population would be used to improve the estimates for survey attribute statistics. However, if the nature of an assumed relationship between a survey variable and register variables is non-linear, the use of aggregates as arguments in the non-linear estimator will give estimates different from those obtained if individual imputations are computed first by means of non-linear functional relationship and then aggregated to the required statistics. The latter approach will be called the impute first-aggregate next approach. Assumed relationship between dependent survey variables and independent register variables can be estimated in different ways, for example, by means of methods from the statistical theory of regression or by means of iterative methods from the theory of parallel distributed processing (Rumelhart, 1986). The aim of this study is to investigate if results, particularly for small subpopulations, from a population census could be made more useful by applying an impute first-aggregate next approach based on imputation models with parameters computed by means of an algorithm for iterative approximation.
2. Experiments 2.1. Imputation models The paradigm of parallel distributed processing, or neural networks, have been intensively studied and improved during the last decade. The relationship between neural network theory and the theory of regression has also been highlighted (Cheng and Titterington, 1994). The possibility of using models based on Artificial Neural Networks (ANN) for imputing survey variable values for each individual not included in the sample to obtain individual data for the whole population, has been studied and reported in a recent study (Nordbotten, 1996) on which the present investigation is based. Two alternative sets of imputation models were studied, and are referred to in this paper as the linear and the non-linear imputation models. All imputation models investigated can be considered as functions imputing the individual values of unknown survey variables based on known register variables as arguments. In total, 49 survey variable functions were developed in each of the two sets of models. All functions had 96 register variables as arguments. The linear models comprise functions of the form ,r-96
y~j)' = 1/(1 + exp-[wok + ~i
• WikX(ill)
for j = 1,...,49,
where X(l),...,x(96) denote the known register values a n d Y(1),''',Y~49) are model predictions of the survey values Y~I). . . . . Y09) for an individual. The w-s are the parameters or weights to be computed. The functions are identical to logit regression functions (Cheng-Ming Kuan, 1994). These models are also known as simple feed-
S. Nordbotten/ Computational Statistics & Data Analysis 27 (1998) 291-309
293
forward network models with sigmoid transfer functions.~ The properties of these models have been extensively studied, and their limitations were earlier pointed out (Minsky and Papert, 1969). Most of the objections can be escaped by non-linear imputation models tt
25
*
96
y~j) = 1/(1 + exp -[w0j + S k wkj(1/1 + exp -[Wok + Si wikx(i)])]) for j -- 1,...,49. These models are known as feed-forward networks with a layer of K hidden neurons. The hidden neurons are also characterized by sigmoid transfer functions. The hidden neurons correspond to latent variables in the statistical theory. In this study, the parameters were computed from a sample by means of an iterative algorithm called Backpropagation (Rumelhart, 1986). In the case of a linear model function, experiments carried out indicate that this algorithm seems to give parameter values approaching those computed by means of the least-squares method. Following the computation of the parameters, the models were used to impute individual survey variable values for unobserved individuals which were finally aggregated together with the observed values to estimates for the population proportions.
2.2. Estimators and accuracy predictors Three different estimators were used in the experiment. First, ordinary simple, unbiased estimates for the population proportions were computed. These estimates were referred to as the unbiased estimates. The second set of estimates used imputations of individual survey variable values from a linear imputation model. As pointed out, these estimates can be considered as approximations to what would have been obtained if ordinary multiple regression had been used for estimating the imputation models. These estimates are called the linear imputation estimates. Finally, the nonlinear imputation estimates were computed using the individual imputation values from the non-linear imputation models. The accuracy of an estimate is defined as the deviation of its value from its target. An accuracy predictor is a method for determining margins for the accuracy of an estimate subject to a specified risk of error. Accuracy predictions are wanted by producers of statistics for quality declaration of the estimates and by users for evaluating the appropriateness of statistics for their specific purposes. The standard error estimator for an unbiased estimate from a sample of observations, is an example of a well-known method used for accuracy prediction. Without a reliable predictor for the accuracy of imputation estimates, these estimates would be of limited interest. Cross valuation methods have been proposed for computing accuracy predictions for the individual imputations when the number of observations are small and are all needed for training (Moody, 1993). In the present application, the population is relatively large and a sample of several thousand individual observations is not prohibitive. An independent sample not overlapping the training sample, was a 1The linear models are therefore not strictly linear.
294
S. Nordbotten I Computational Statistics & Data Analysis 27 (1998) 291-309
reasonable and straightforward solution applied for predicting the accuracy margins for the imputation estimates. The population was assumed divided randomly into three parts. Sample 1 was used for computing the parameters of the imputation models and Sample 2 for developing the accuracy predictors. The members of each of these two samples were assumed to be surveyed. Sample 3 represented the remaining population for which individual imputations were required. The estimators and accuracy predictors used were: (a) The unbiased estimator Pr of a population proportion P based on observations in Samples 1 and 2 Px(~yt + Sy2)/(nl + n2), where the subscript numbers refer to the corresponding samples. The standard error of Pr Sr = s ' v / ( 1 - f ) / ( n l + n2) was used as a predictor of accuracy for this estimate where f is the sampling fraction (nl + n2)/N. The standard deviation s of y was estimated from Samples 1 and 2 as s = ~/(S(yl - ~1) z + E(Y2 -- J)2)2)/(n1 + n2 - 1), where J~l and J~2 are then means of the variables y in Samples 1 and 2. The sampling error of s was ignored. The accuracy of Pr will decrease with decreasing sample size. The estimator Pr is not likely to be useful for small subpopulations. (b) The linear imputation estimator Pe of P includes three steps. First, the coefficients w0 and w a of the linear imputation models are computed from Sample 1. Then individual imputations y~ for all members of Sample 3 are computed by means of the imputation models. Finally, the individual observed values of Samples 1 and 2 and the imputed values for Sample 3 are aggregated to the imputation estimate PL = (Syl + Sy2 + Sy;)/N. Resampling of the training sample will normally give different PL estimates. The accuracy of PL is determined by two factors, the training Sample 1 and the individual imputation errors z3 = y~ - y3, which are due to the imperfections of the imputation models. An accuracy predictor for PL estimates can be obtained by rewriting the estimator PL -~ (~Yl + Sy2 + Sy3 )IN + z~z3/N. For any estimate PL based on imputations from a given instance of the training sample, say Sample 1, we can predict the accuracy by SL = RMSE* ( ( 1 - f ) / N , It
S. Nordbottenl Computational Statistics & Data Analysis 27 (1998) 291-309
295
where R M S E is the root-mean-square error of z in the population expressed by RMSEL =
V~3/N.
Since the R M S E L is measuring the deviations of the imputed values from their corresponding target values, errors due to the training sample will also be included. R M S E was estimated by computing imputed values y~ for the individuals of Sample 2 and use the imputation errors z2 in
rmse = ~/~z2/(n2 - 1). The sampling error of the estimate rmse2 was ignored. The Sr predictor has the root of n as a denominator, while the SL predictor has the root of N as denominator. The ratio n/N = f will always be equal to or less than 1. Only if
S < rmse* x/~ will the Pr estimates thus be preferable to PL estimates. (c) The imputation estimator PE of P, is similar to PL, but with the non-linear imputation models substituting the linear models used for PL The PE estimator is defined by
PE = (Sy~ + Sy2 + 2Sy;')/N. The accuracy metric SE for this PE estimate is computed as
SE = RMSE* V/(1 - f ) / N , where RMSEE is the root-mean-square error of the z-variables when derived from the non-linear models. RMSEE is estimated from deviations between the individual imputed values by the non-linear models and the corresponding targets in Sample 2. Assuming that the estimates discussed can be considered as events from normal distributions, it can be expected that for about two of three estimates the absolute deviations from their target value are equal or less than their accuracy predictions.
2.2.1. Data Data from a municipality population of 17 326 individuals were used for the experiments. This particular municipality had paid CBS to have a 100% survey. Both survey and register variable data were thus available for all inhabitants, and well suited for experimentation. To simulate the normal situation for which survey values were collected from a sample of the population, a random survey sample of 2007 individuals was drawn. From this sample, a subsample of 1845 records of survey and corresponding register values, Sample l, was randomly extracted. The records for the remaining 2 0 0 7 - 1845 = 162 individuals, Sample 2, were used for computing the accuracy predictors. The rest, 15 319 individuals, was called Sample 3 for which imputations were required.
296
S. Nordbotten/ Computational Statistics & Data Analysis 27 (1998) 291-309
The individual survey variable values for all individuals of Samples 1 and 2 were first aggregated and then blown up to traditional unbiased estimates of the population proportions. Ten linear and 10 non-linear imputation models were developed. Each model could simultaneously impute values for 2-9 survey variables, in total 49 variables, based on the values of 96 register variables. The 49 survey variables were variables observed, particularly, for the population census. The 96 register variables were among a larger set of register variables also used in the census processing. They were selected after a number of preliminary experiments. The linear models included from 194 to 873 parameters to be computed, while the non-linear models had the same 96 independent variables as input variables, but included in addition 25 latent variables which implied from 2477 to 2660 parameters to be determined. 2 Based on the sample of 1845 individuals, the parameter sets for each of the 20 models were computed by means of a standard Backpropagation algorithm. The linear and the non-linear imputation models were applied to impute alternative values for individuals in Samples 2 and 3. The individual observed values for individuals in Samples 1 and 2 plus the imputed values for individuals in Sample 3 were then aggregated to form imputation estimates of population proportions for survey variables. Finally, all sets of estimates were compared with target proportions computed from the available, but not used, observed survey variable values for the whole population. Box 1. Summary of population and samples used Total population:
N = 17 326
Sample Sample Sample Sample Sample Sample
nl -- 1845 n2 = 162 n 3 = 15 319 na = 18 from Samples 1+2 nb ----144 from Sample 3 n about 1845
1: 2: 3: a: b: la-le:
In addition to the above samples, the population of a small census tract was extracted to test the validity o f the estimates for an area with few inhabitants. This subpopulation included 162 individuals (it is a coincidence that this tract has the same size as Sample 2) of which 18 individuals belonged to Sample 1+2 and for which survey variable values therefore were assumed observed and available. These 18 individuals were referred to as Sample a while the remaining 144 of the census tract were denoted Sample b. 2 As for the intercepts in regressions, the imputation models required auxiliary variables with constant values 1. The number of parameters for a non-linear model is for example: Parameters=Latent variables*(Register variables+l)+Survey variables*(Latent variables +1).
S. Nordbotten/ Computational Statistics & Data Analysis 27 (1998) 291-309
297
Box 2. Notations used for variables y~, y2 and Y3: pl,j~2 and Y3: y~ and y~: z2 and z3: xl, x2 and x3:
individual observed survey variables in Samples 1-3. means of Yl, Y2 and Y3 in Samples 1-3. individual imputed values in Samples 2 and 3. individual differences ( y ~ - y2) and (y~ -Y3) in Samples 2 and 3. individual observed register variables in Samples 1-3.
To demonstrate the effects of a random training sample on the imputation estimates, five additional random and mutually exclusive samples of size about 1800 individuals were finally selected from Sample 3. These are referred to as Sample la-le. For each of these, independent imputation models were computed for a selection of survey variables.
3. Empirical analysis and tests 3.1. Preliminary tests
A number of empirical tests were carried out. First, the assumption that the estimated standard deviations Sy from Sample 1+2 are acceptable, estimates of the corresponding standard deviations for the whole population, was investigated. Table 1, Columns (1) and (2) display the standard deviations for Sample 1+2 and for Sample 3. The figures do not reveal any significant differences, and the standard deviations s estimated from Sample 1+2 were accepted for the following analysis. A second question raised was whether the imputation models produced biased estimates. For this purpose, averages of the z variables were computed from Samples 2 and 3 and reported in Columns (3), and (4) of Table 1. The average biases are small for most variables. There is no good correspondence among the averages of z in Samples 2 with the averages of Sample 3. It does not seem possible to make useful predictions for biases in Sample 3 from computations based on Sample 2 data. The accuracy predictor for imputation estimates required the R M S E of the deviations z. Comparison of the rmsel from Sample 1 in Column (5) with the rmse3 Column (7) from Sample 3 in, confirms previous experience that estimates from a training sample underestimate the rmse3. Column (6) shows the rmse2 estimated from predictions and target values of Sample 2. These estimates are close to the rmse3 from Sample 3, and were therefore used in the computation of accuracy measures for the estimated proportions. A comparison of the rmse in Columns (6) and (7) with the standard deviations for the unbiased estimates in Columns (1) and (2) show that rmse are significantly smaller indicating better accuracy of the imputation estimates. Experimental computations of estimates for R M S E by means of the crossvalidation for non-linear models proposed by Moody, were also carried out for some selected variables on observations in Sample 1 (Moody, 1993). A 10-fold
0.305 0.488 0.289 0.452 0.341 0.285 0.104 0.094 0.165 0.500 0.445 0.499 0.197 0.113 0.498 0.233 0.143 0.204 0.230 0.180 0.402 0.310
Cohabitance Nobody Spouse Cohabitant Children Parents Siblings In-laws Grandparents/-children Other
Paid work Paid work> = 100h Paid work< 100 h
Work association Employed Self-employed Family member
Income in week Work income No work income
Hours usually working Hours: 1-9 Hours: 10-19 Hours: 20-29 Hours: 30-34 Hours: 35-39 Hours: 40 and more
pl
p2
p7
p8
p9 0.149 0.200 0.213 0.162 0.405 0.291
0.496 0.245
0.498 0.210 0.084
0.500 0.450
0.300 0.491 0.279 0.461 0.338 0.279 0.092 0.086 0.144
(2)
(1)
Variable
Model
s-3
s-1 + 2
Table 1 Standard deviations for y, biases and rmse for z Obs. standard deviations Sample 1+2 Sample 3
-0.019 -0.011 0.015 0.002 0.052 -0.040
-0.002 -0.008
0.007 -0.020 0.003
0.013 -0.020
0.002 -0.023 -0.018 0.019 -0.002 -0.020 0.009 0.011 -0.011
(3)
z-2
-0.004 0.006 0.007 0.007 0.014 0.011
0.002 0.009
0.004 0.007 0.010
0.012 -0.011
-0.003 -0.004 0.009 -0.014 0.001 0.002 0.002 0.003 0.003
(4)
z-3
Bias from non-lin, imp. Sample 2 Sample 3
0.083 0.107 0.100 0.128 0.182 0.158
0.030 0.079
0.095 0.074 0.047
0.070 0.071
0.148 0.118 0.122 0.194 0.111 0.117 0.080 0.062 0.112
(5)
rinse-1
0.173 0.208 0.236 0.226 0.314 0.339
0.081 0.285
0.335 0.226 0.120
0.302 0.295
0.249 0.163 0.199 0.270 0.188 0.222 0.054 0.061 0.174
(6)
rinse-2
Sample 2
0.149 0.211 0.214 0.185 0.344 0.312
0.034 0.238
0.316 0.229 0.111
0.243 0.244
0.224 0.150 0.175 0.295 0.213 0.213 0.108 0.095 0.169
(7)
rinse-3
Sample 3
R M S E from non-lin, imp.
Sample 1
'~
"& ~3
,,q
~.
g~
6"" ~,
t~
~-
.~
tJ
0.145 0.217 0.231 0.177 0.380 0.324 0.099
0.487 0.174 0.178
0.199 0.131 0.125 0.229 0.455
0.407 0.267 0.174 0.176 0.092 0.086
0.458 0.115 0.113 0.044 0.254 0.109
Working hours in week Hours: 1-9 Hours: 10-19 Hours: 20-29 Hours: 30-34 Hours: 35-39 Hours: 40 and more Not working
Work location in week Usual location One different location Different locations
Travel trips in week None Not at home One-trip Two--three trips Four or more trips
One-way travel time Minutes: <15 Minutes: 15-29 Minutes: 30-44 Minutes 45-59 Minutes: 60-89 Minutes: 90 or more
Means of transportation Car Bus Train Boat Bicycle Other
pl0
pll
p12
p13
p14 0.446 0.090 0.097 0.020 0.246 0.098
0.396 0.263 0.173 0.154 0.092 0.072
0.206 0.131 0.115 0.209 0.449
0.482 0.174 0.171
0.147 0.203 0.220 0.161 0.384 0.311 0.099
-0.092 0.016 -0.000 -0.001 0.053 0.007
-0.080 0.008 0.001 0.028 0.003 -0.003
-0.005 0.017 -0.015 -0.016 -0.046
-0.050 -0.005 0.043
-0.004 -0.003 -0.021 -0.020 0.007 0.008 -0.005
-0.025 0.009 0.005 0.000 0.029 0.006
-0.044 0.006 0.007 0.027 0.010 0.006
0.001 0.019 0.003 0.013 -0.018
-0.024 0.003 0.030
-0.003 0.002 -0.005 0.001 0.015 0.027 0.000
0.213 0.084 0.072 0.004 0.162 0.082
0.090 0.077
0.140
0.243 0.182 0.134
0.105 0.088 0.087 0.122 0.175
0.238 0.154 0.183
0.091 0.133 0.139 0.137 0.205 0.177 0.024
0.331 0.097 0.088 0.013 0.259 0.044
0.411 0.279 0.207 0.192 0.112 0.113
0.192 0.121 0.137 0.237 0.348
0.270 0.200 0.200
0.177 0.198 0.216 0.201 0.327 0.320 0.077
0.369 0.117 0.121 0.021 0.292 0.124
0.384 0.291 0.203 0.202 0.097 0.080
0.224 0.174 0.137 0.233 0.373
0.310 0.202 0.236
0.152 0.202 0.222 0.180 0.361 0.332 0.034
Y,
-,.4
R*
300
S. Nordbotten/ Computational Statistics & Data Analysis 27 (1998) 291-309
cross-validation was used. The method implied the division of Sample 1 into 10 nonoverlapping random subsamples of approximately equal sizes. Ten different sets of parameter estimates for each imputation model were trained by leaving one different subsample out each time. For each parameter estimate set, the mean square error of z were computed for the left out subsample. The 10 sets of mean square errors were averaged, and finally the cross-validation estimates rmsecross computed. The results of the cross-validation computations for the first imputation model, shown in Box 3, indicated that the rmse2 estimates from Sample 2 gave results closer to the rmse3 of Sample 3, than did the cross-validation rmsecr estimates from Sample 1. Still, with few observations available, cross-validation can obviously be a useful method for estimating accuracy predictions. Box 3. Root-mean-square error estimates from Sample 1 by means of non-linear cross-validation and from Samples 2 and 3 by ordinary computation Variable Cohabitance
Sample 1 rmsecross
Sample 2 rmse2
Sample 3 rmse3
Nobody Spouse Cohabitant Children Parents Siblings In-laws Grandparents/-children Other
0.248426 0.142619 0.187668 0.315753 0.227954 0.238513 0.120515 0.110045 0.214668
0.249405 0.163140 0.199472 0.269644 0.187937 0.222370 0.053572 0.061185 0.174427
0.250639 0.163947 0.200459 0.270978 0.188867 0.223470 0.053837 0.061488 0.175290
3.2. Estimates f o r the municipality population
Column (1) of Table 2, give the target proportions from Population Census data for the total municipality, while Columns (2)-(4) present the unbiased, the linear and the non-linear imputation estimates of the proportions, respectively, for the whole population. Estimates from all three estimators gave values very close to the target proportions. Columns (5)-(7) of Table 2 give the accuracy predictions for the three sets of estimates while the following Columns (8)-(10) give the corresponding relative predictions. Inspection of the figures reveal higher absolute as well as relative predicted accuracy for the imputation estimates than for the unbiased estimates. However, for a population and a sample of the sizes used, all estimators will in general give good results. 3.3. Estimates f o r the census tract
Table 3 displays the application of the three estimators on a census tract area with only 162 inhabitants of which 18 were identified belonging to Sample a for
S. Nordbottenl Computational Statistics & Data Analysis 27 (1998) 291-309
301
which observations were assumed available. The four Columns (1)-(4) show the target proportions, the unbiased, the linear and the non-linear imputation estimates. The target proportions were computed from the sums of all 162 observations in Sample a+b. The unbiased estimates were based on the observations in Sample a, while the linear and the non-linear imputation estimates were aggregated from the sums of observations from Sample a and the sums of the 144 individual imputations made for Sample b. Inspection of the estimates shows that the linear imputation estimates are in average much closer to the target proportions than the unbiased estimates and the non-linear imputation estimates are, in general, much closer to the target proportions than the linear imputation estimates. As could be expected because of the small size of Sample a, the unbiased estimates were very unreliable. Predictions of the relative estimate accuracies were computed based on the predictors discussed above, and are reported in Columns (5)-(7) in Table 3. The corresponding actual relative deviations between estimates and target proportions are given in Columns (8)-(10). Boxes 4 - 6 demonstrate the validity of the accuracy predictors for the unbiased, linear and non-linear imputation estimates, respectively. Assume that the publication principle is to publish only estimates which are predicted to deviate less than +20% from the target value with the risk that one out of three predictions is incorrect. The sum in the first column of Box 4 shows that the unbiased estimator provided 14 estimates which satisfied the requirement for publication. The linear and the non-linear imputation estimators gave 13 and 26 estimates which satisfied the publication criteria according to the sums in the first columns of Boxes 5 and 6, respectively. Box 4. Predicted and observed relative accuracy of the 49 unbiased estimates Observed < 0.2 < 0.2
> = 0.2
Sum
0
2
2
> = 0.2
14
23
47
Sum
14
25
49
Predicted
It was pointed out above, that the practical value of an estimator, however, depends on the possibility to predict the accuracy of the estimates produced. The first three boxes illustrate the success of the accuracy predictors to predict which estimates should be published. Only 2 unbiased estimates were predicted to satisfy the publication condition and both predictions were incorrect when compared with the actual targets, Fourteen
0.100 0.403 0.086 0.304 0.132 0.086 0.009 0.008 0.022
0.505 0.281
0.452 0.045 0.008
0.442 0.063
0.023 0.043 0.048 0.028 0.206 0.095
Paid work Paid work> =100h Paid work < 100 h
Work association Employed Self-employed Family member
Income in week Work income No work income
Hours usually working Hours: 1-9 Hours: 10-19 Hours: 20-29 Hours: 30-34 Hours: 35-39 Hours: 40 and more
Target P (1)
Cohabitance Nobody Spouse Cohabitant Children Parents Siblings In-laws Grandparents/children Other
Variable
0.021 0.043 0.056 0.033 0.203 0.108
0.459 0.058
0.464 0.040 0.013
0.516 0.273
0.104 0.391 0.092 0.290 0.134 0.089 0.011 0.009 0.028
P- Y (2)
0.006 0.024 0.043 0.015 0.234 0.049
0.435 0.046
0.455 0.032 0.008
0.497 0.297
0.108 0.414 0,075 0.310 0.135 0.104 0.004 0.005 0.006
Estimates P-L (3)
0.013 0.041 0.049 0.022 0.217 0.093
0.442 0.064
0.461 0.039 0.009
0.510 0.275
0.096 0.406 0,082 0.307 0.129 0.084 0.005 0.003 0.017
P-E (4)
0.003 0.004 0.005 0.004 0.008 0.007
0.010 0.005
0.010 0.004 0.002
0.010 0.009
0.006 0.010 0.006 0.009 0.007 0.006 0.002 0.002 0.003
0.001 0.002 0.002 0.002 0.003 0.002
0.001 0.002
0.002 0.002 0.001
0.002 0.002
0.002 0.001 0.002 0.002 0.002 0.002 0.001 0.001 0.001
0.001 0.001 0.002 0.002 0.002 0.002
0.001 0.002
0.002 0.002 0.001
0.002 0.002
0.002 0.001 0.001 0.002 0.001 0.002 0.000 0.000 0.001
Accuracy for estimates S -L S -E S- Y (6) (7) (5)
Table 2 Unbiased, linear and non-linear imputed estimates with accuracy measures
0.144 0.099 0.086 0.113 0.042 0.060
0.023 0.085
0.023 0.102 0.183
0.020 0.034
0.062 0.026 0.066 0.033 0.053 0.067 0.199 0.221 0.124
S-Y/P-Y (8)
0.067 0.038 0.031 0.046 0.014 0.015
0.002 0.039
0.005 0.045 0.044
0.005 0.007
0.021 0.004 0.017 0.008 0.011 0.021 0.091 0.128 0.036
0.097 0.036 0.035 0.074 0.010 0.026
0.001 0.032
0.005 0.041 0.096
0.004 0.008
0.019 0.003 0.017 0.006 0.010 0.019 0.071 0.151 0.072
Relative accuracy S-L/P-L S-E/P-E (9) (10)
x.i
~r,.
~,
-~ ~"
t,~
0.044 0.017 0.014 0.047 0.281
0.196 0.075 0.031 0.025 0.008 0.005
Travel trips in week None Not at home One trip Two-three trips Four or more trips
One way Minutes: Minutes: Minutes: Minutes: Minutes: Minutes:
Means o f transportation Car Bus Train Boat Bicycle Other
0.277 0.009 0.010 0.001 0.065 0.010
0.371 0.031 0.030
Work location in week Usual location One different location Different locations
travel time < 15 15-29 30-44 45-59 60 89 90 or more
0.022 0.043 0.052 0.027 0.179 0.110 0.010
Working hours in week Hours: 1-9 Hours: 10-19 Hours: 2 0 - 2 9 Hours: 3 0 - 3 4 Hours: 3 5 - 3 9 Hours: 40 and more Not working
0.299 0.013 0.013 0.002 0.069 0.012
0.210 0.077 0.031 0.032 0.008 0.007
0.041 0.017 0.016 0.055 0.292
0.388 0.031 0.033
0.021 0.049 0.057 0.032 0.175 0.120 0.010
0.259 0.002 0.009 0.105 0.029 0.009
0.108 0.018 0.010 0.034 0.003 0.005
0.015 0.015 0.012 0.025 0.257
0.415 0.012 0.008
0.006 0.012 0.045 0.016 0.131 0.116 0.006
0.261 0.009 0.008 0.000 0.075 0.008
0.143 0.053 0.024 0.032 0.001 0.001
0.032 0.022 0.011 0.046 0.269
0.355 0.021 0.044
0.013 0.036 0.036 0.016 0.191 0.122 0.009
0.010 0.002 0.002 0.001 0.005 0.002
0.009 0.006 0.004 0.004 0.002 0.002
0.004 0.003 0.003 0.005 0.010
0.010 0.004 0.004
0.003 0.005 0.005 0.004 0.008 0.007 0.002
0.003 0.001 0.001 0.002 0.002 0.000
0.003 0.002 0.001 0.002 0.001 0.001
0.001 0.001 0.001 0.002 0.003
0.002 0.001 0.002
0.002 0.002 0.002 0.001 0.003 0.002 0.001
0.002 0.001 0.001 0.000 0.002 0.000
0.003 0.002 0.001 0.001 0.001 0.001
0.001 0.001 0.001 0.002 0.002
0.002 0.001 0.001
0.001 0.001 0.002 0.001 0.002 0.002 0.001
0.032 0.180 0.183 0.469 0.077 0.191
0.041 0.073 0.117 0.116 0.227 0.242
0.101 0.157 0.165 0.087 0.033
0.026 0.117 0.114
0.142 0.092 0.086 0.115 0.046 0.057 0.209
0.009 0.061 0.089 1.119 0.026 0.000
0.016 0.032 0.045 0.051 0.068 0.077
0.034 0.066 0.081 0.031 0.010
0.005 0.037 0.046
0.071 0.031 0.032 0.044 0.017 0.018 0.082
0.009 0.079 0.074 0.207 0.025 0.038
0.021 0.038 0.062 0.043 0.815 0.637
0.042 0.039 0.093 0.036 0.009
0.005 0.067 0.032
0.097 0.040 0.042 0.089 0.012 0.019 0.060
I-,a
t-,a -,q
e~
5.
r~
Variable
Cohabitance Nobody Spouse Cohabitant Children Parents Siblings In-laws Grandparents/children Other
Paid work Paid work> =100h Paid work< 100h
Work association Employed Self-employed Family member
Income in week Work income No work income
Hours usually working Hours: 1-9 Hours: 10-19 Hours: 20-29 Hours: 30-34 Hours: 35-39 Hours: 40 and more
Model
p1
p2
p7
p8
p9
0.049 0.056 0.049 0.037 0.272 0.086
0.488 0.062
0.488 0.062 0.012
0.537 0.290
0.056 0.056 0.056 0.056 0.278 0.167
0.389 0.167
0.444 0.111 0.056
0.500 0.333
0.167 0.333 0.278 0.278 0.278 0.167 0.111 0.111 0.111
(2)
(1)
0.142 0.475 0.105 0.296 0.130 0.062 0.037 0.037 0.049
P- Y
P
Target
0.006 0.012 0.062 0.006 0.259 0.037
4.488 0.074
0.506 0.056 0.006
0.543 0.278
0.111 0.475 0.074 0.333 0.142 0.080 0.012 0.019 0.012
(3)
P-L
Estimates
0.012 0.068 0.051 0.031 0.258 0.109
0.481 0.093
0.494 0.070 0.014
0.556 0.256
0.109 0.472 0.109 0.323 0.135 0.073 0.031 0.030 0.029
(4)
P-E
Table 3 Simple, unbiased, linear and non-linear estimates for census tract
0.572 0.814 0.918 0.719 0.321 0.413
0.285 0.311
0.249 0.393 0.452
0.222 0.297
0.407 0.325 0.231 0.362 0.272 0.380 0.208 0.189 0.329
(5)
2.369 1.368 0.290 2.558 0.115 0.456
0.002 0.312
0.049 0.340 0.967
0.047 0.074
0.201 0.031 0.213 0.074 0.111 0.247 0.837 0.645 0.837
(6)
S-L/P-L
1.091 0.227 0.343 0.535 0.090 0.230
0.013 0.228
0.050 0.239 0.619
0.040 0.085
0.169 0.026 0.135 0.062 0.103 0.225 0.129 0.152 0.450
(7)
S-E/P-E
Predictions of rel. accuracy
S-Y/P-Y
0.125 0.000 0.125 0.500 0.023 0.929
0.203 1.700
0.089 0.800 3.500
0.069 0.149
0.174 0.299 1.647 0.062 1.143 1.700 2.000 2.000 1.250
(5)
0.875 0.778 0.250 0.833 0.046 0.571
8.203 0.200
0.038 0.100 0.500
0.011 0.043
0.217 0.000 0.294 0.125 0.095 0.300 0.667 0.500 0.750
(6)
1(3)-(1)1/(1)
0.762 0.221 0.031 0.155 0.050 0.263
0.013 0.500
0.013 0.133 0.167
0.036 0.118
0.230 0.006 0.043 0.089 0.042 0.186 0.167 0.194 0.419
(7)
f(4)-(1)l/(1)
Relative absolute deviation
1(2)-(1)1/(1)
Na
"xl
g~
g~
0.105 0.037 0.025 0.068 0.265
0.216 0.080 0.031 0.025 0.025 0.019
Travel trips in week None Not at home One trips Two-three trips Four or more trips
One way Minutes: Minutes: Minutes: Minutes: Minutes: Minutes:
Means o f transportation Car Bus Train Boat Bicycle Other
p12
p13
p14
0.309 0.025 0.037 0.025 0.080 0.025
0.414 0.049 0.049
Work location in week Usual location One different location Different locations
pll
travel time < 15 15-29 30 44 45-59 60-89 90 or more
0.037 0.068 0.062 0.031 0.247 0.105 0.012
Working hours in week Hours: 1-9 Hours: 10-19 Hours: 2 0 - 2 9 Hours: 3 0 - 3 4 Hours: 3 5 - 3 9 Hours: 40 and more Not working
pl0
0.389 0.111 0,111 0.111 0.111 0.111
0,167 0.111 0,167 0.056 0.056 0.056
0.056 0.056 0.056 0.111 0.278
0.333 0.056 0.111
0.056 0.056 0,056 0.056 0.278 0,167 0.056
0.272 0.019 0.037 0.173 0.049 0.025
0.148 0.025 0,025 0.043 0.006 0.006
0.025 0.019 0.019 0.037 0.278
0.469 0.006 0.019
0.006 0.006 0,074 0.012 0.167 0.080 0,006
0.311 0.031 0.043 0.024 0.108 0,028
0.148 0.056 0.025 0.037 0.006 0.006
0.049 0.019 0.012 0.068 0.296
0.418 0.025 0.065
0.021 0,049 0.056 0.022 0,241 0.114 0.007
0.262 0.230 0,226 0.089 0.508 0.217
0,543 0.534 0.232 0.703 0.367 0.344
0.796 0.523 0.501 0.457 0,364
0.325 0.697 0.357
0.579 0.866 0.926 0.708 0.304 0.433 0.397
0.098 0.456 0.322 0,134 0.382 0.000
0.231 1.026 0.592 0.391 0.967 0.967
0.592 0.645 0.721 0,483 0.114
0.044 1.934 0.853
2.558 2,558 0,255 1.184 0,186 0.278 1.368
0.079 0.232 0.154 0.041 0.178 0.117
0.206 0.372 0,622 0,384 1.343 1.358
0.288 0.485 0.824 0.259 0.087
0.048 0.595 0.227
0,624 0,296 0,284 0.669 0.100 0.208 0.821
0.260 3.500 2.000 3.500 0.385 3.500
0.229 0.385 4.400 1.250 1.250 2.000
0.471 0.500 1.250 0.636 0,047
0.194 0.125 1.250
0.500 0.182 0.100 0.800 0.125 0.588 3.500
0,120 0.250 0.000 6.000 0.385 0.000
0.314 0.692 0.200 0.750 0.750 0.667
0,765 0.500 0.250 0.455 0.047
0.134 0,875 0.625
0.833 0.909 0.200 0.600 0.325 0.235 0.500
0.009 0.250 0.150 0.032 0.343 0,124
0.314 0.308 0.200 0.500 0.750 0.667
0.529 0,500 0.500 0.000 0.116
0.010 0.496 0.319
0.431 0.272 0.086 0.277 0.022 0.083 0.437
tan
t,,a ,.q
g~
r~
306
S. Nordbotten I Computational Statistics & Data Analysis 27 (1998) 291-309
unbiased estimates were predicted to be too inaccurate for publication while they, in fact, satisfied the publication requirement. The predictor for the linear imputation estimates identified 14 estimates which should deviate less than 20% of which 3 were incorrect predictions. On the other side, only two estimates which should have been accepted were rejected by the prediction. For the non-linear estimates, the predictions implied that 20 estimates should pass the publication criterion of which two were incorrect when confronted with the actual target values. Among the 29 non-linear imputation estimates predicted to have a relative accuracy not satisfying the requirement for publication, eight proved to be acceptable when compared with the actual targets. These results indicate that the non-linear imputation estimates for the proportions of the census tract are significantly closer to the target proportions than both the unbiased and the linear imputation estimates. Also the accuracy predictions for the imputation estimates seem to be significantly more reliable than those for the unbiased estimates. Box 5. Predicted and observed relative accuracy of the 49 linear estimates Observed < 0.2 < 0.2
> = 0.2
Sum
11
3
14
2
33
35
13
36
49
Predicted > - - 0.2 Sum
Box 6. Predicted and observed relative accuracy of the 49 non-linear estimates Observed < 0.2 < 0.2
> = 0.2
Sum
18
2
20
8
21
29
26
23
49
Predicted > = 0.2 Sum
The above analysis has, in part, been repeated for the 55 other census tracts in the same municipality. The analysis for these areas supported the results reported above. To test if the imputation models could be used outside the municipality population from which the training sample was drawn, a second municipality from another part of the country and with a different socioeconomic structure, was studied. This
S. Nordbotten/ Computational Statistics & Data Analysis 27 (1998) 291-309
307
second municipality was also surveyed 100% in 1990. It comprised 44 census tracts and had only 230 inhabitants in average per tract. In the experiment, it was assumed that no sample survey had been carried out in this second municipality. No unbiased estimates could therefore be computed, and all individual values for the survey variables had therefore to be imputed. The non-linear imputation models developed for the first municipality were used for this purpose. A relative high number of the estimates for the small areas in this second municipality also satisfied the publication requirements. The accuracy predictions seemed to be almost as promising as those reported in Box 6 above for the first municipality. A detailed report on these small area experiments will be published in a future paper.
3.4. Effects from the random selection of the training sample It was pointed out above that the accuracy predictions used in the previous paragraphs include the errors due to random selection of the training sample. However, resampling of the training sample can produce more or less accurate imputation estimates with corresponding accuracy measures. Box 7 illustrates the variations in the PE estimates for the cohabitance proportions in the census tract based on imputation models with parameters computed from Samples 1 and 5 other mutually exclusive random training samples. Box 7. Estimates for census tract based on different training samples Cohabitance Nobody Spouse Cohabitant Children Parents Siblings In-laws Grandpar./-child Other
Sample lc
1
la
lb
0.109 0.472 0.109 0.323 0.135 0.073 0.031 0.030 0.029
0.117 0.475 0.093 0.370 0.160 0.049 0.025 0.025 0.025
0.093 0.469 0.105 0.358 0.148 0.080 0.025 0.025 0.031
0.117 0.481 0.086 0.370 0.148 0.031 0.25 0.025 0.031
Target ld
le
0,111 0.481 0.080 0.296 0.130 0.043 0.056 0.037 0.025
0.117 0.463 0.086 0.315 0.142 0.062 0.019 0.019 0.012
0.142 0.475 0.105 0.296 0.130 0.062 0.037 0.037 0.049
From the figures we can see that the 9 different training samples resulted in different final results. The estimates with the smallest deviation from their target proportions are marked. The figures indicate that Sample 1 did not produce significantly better or worse estimates than the other 5 training samples. Box 8 shows similar information about the Means of transportation estimates from the different training models. Also, these figures support the assumption that
308
S. Nordbotten I Computational Statistics & Data Analysis 27 (1998) 291-309
different random training samples produce different imputation models, but with sample of the size used, the variations in the final imputation estimates are moderate. Box 8. Estimates for census tract based on different training samples Means of transportation Car Bus Train Boat Bicycle Other
Sample lc
1
la
lb
0.311 0.031 0.043 0.024 0.108 0.028
0.358 0.025 0.025 0.019 0.086 0.025
0.340 0.025 0.025 0.025 0.031 0.025
0.259 0.031 0.031 0.019 0.062 0.025
Target ld
le
0.191 0.025 0.025 0.025 0.031 0.025
0.364 0.019 0.025 0.025 0.031 0.025
0.309 0.025 0.037 0.025 0.080 0.025
4. Conclusions
The main objective of this study was to evaluate the impute first-aggregate next estimators for proportions based on imputed values for small areas with few inhabitants. Individual imputations were obtained by means of models estimated from data for the individuals in a sample supplemented with administrative data for all individuals. The models used were ANN feed-forward models. The investigation indicated that imputations used in the impute first-aggregate next estimators, improve the results compared with those obtained by means of ordinary estimators. Accuracy predictors of these estimates were developed. In the same way as the standard error is used as an accuracy predictor for unbiased estimates, predictors for imputation estimates can be based on the root-mean-square error for individual imputations. The empirical study indicated that satisfactory predictors for the accuracy of imputation estimates can be based on a small sample independent of the sample used for estimating the parameters of the models. The reliability of accuracy predictions of imputation estimates seems to justify practical use.
Empirical computations for a small census tract illustrate that linear imputation estimates were significantly closer to the target proportions than corresponding unbiased estimates. The non-linear imputation estimates were even better. The use of non-linear imputation estimates should permit a substantial increase in the number of statistics which could be published, and provide a new readiness to prepare statistics on request. The effects of the random training sample on the accuracy of the estimates were investigated by resampling and recomputing the models. For training samples as large as the one used in the experiments, the sampling should not be expected to have a significant influence on the results. The predicted accuracy measures will always reflect the error effects due to the training sample.
S. Nordbotten I Computational Statistics & Data Analysis 27 (1998) 291-309
309
Acknowledgements This research has been carried out under a cooperation contract with the Norwegian Central Bureau of Statistics, and as part of the Statistical Information System (SIS) project at the Department of Information Science, University of Bergen. I am grateful for a number of suggestions offered by Professor Ib Thomsen, Research Director, Central Bureau of Statistics, Oslo.
References Cheng Bing, Terrington, D.M., 1994. Neural networks - a review from a statistical perspective. Statist. Sci. 9 (1) 2-54. Cheng-Ming Kuan, White, H., 1994. Artificial neural networks: an econometric perspective. Econom. Rev. 13(1) 1-91. Minsky, M., Papert, S., 1969. Perceptrons - an introduction to computational geometry. MIT Press, Cambridge, MA. Moody, J.E., 1993. Prediction risk and architecture selection for neural networks. In: Charkassy, V., Friedman, J.H., Wechsler, H. (Eds.), Statistics to Neural Networks. Theory and Pattern Recognition Applications. Springer, Berlin. Nordbotten, S., 1996. Neural network imputation applied on Norwegian 1990 Population census data utilizing administrative registers. J. Official Statist. 12 (4) 385-401. Rumelhart, D.E., McCleUand, J.L., 1986. Parallel distributed processing. The MIT Press, Cambridge, MA.