Statistics & Probability Letters 46 (2000) 337 – 345
Nonparametric estimation of individual food availability along with bootstrap con dence intervals in household budget surveys V.G.S. Vasdekisa; ∗ , A. Trichopouloub a Department
b Department
of Statistics, Athens University of Economics and Business, 76 Patision Street, 10434 Athens, Greece of Nutrition and Biochemistry, National School of Public Health, 196 Alexandras Avenue, 11521 Athens, Greece Received May 1998; received in revised form May 1999
Abstract An additive nonparametric model is proposed for the analysis of household budget surveys data whose estimation reduces to least squares. Parameter estimates are biased. A rst-order approximation for the bias is obtained and it is used to bias correct the residuals of the model in order to construct bootstrap con dence intervals for the model parameter estimates. The results show somewhat shorter intervals than pointwise intervals which are based on a normal approximation with c 2000 Published by Elsevier Science B.V. All rights reserved less bias. Keywords: Nonparametric models; Least squares; Bias; Bootstrap con dence intervals
1. Introduction A problem of great interest among nutrition researchers is the estimation of individual consumption per day for a series of food items. Traditionally, this is accomplished by using individual surveys in which every individual records all amount of food it consumes within or outside household. However, there are a number of problems which one faces with this type of surveys like expensiveness of conducting them and susceptibility to undereporting. In recent years, there is a growing interest in exploiting data from household budget surveys (HBS). HBS are conducted regularly in many countries by the National Statistical Oces and their aim is, among other things, to record amount of food brought into households, providing a measure of food availability for each household. Individual and household characteristics are also recorded in an attempt to link availability with demographic and socio-economic factors. Although, there are problems of interpretation, these surveys are aordable and easily conducted. Therefore, exploiting the information they contain about deriving individual estimates of availability, is appealing. ∗
Corresponding author.
c 2000 Published by Elsevier Science B.V. All rights reserved 0167-7152/00/$ - see front matter PII: S 0 1 6 7 - 7 1 5 2 ( 9 9 ) 0 0 1 2 0 - 0
338
V.G.S. Vasdekis, A. Trichopoulou / Statistics & Probability Letters 46 (2000) 337 – 345
Recently, Chesher (1997), based on the study by Engle et al. (1986), derived estimates of daily individual availability by age and sex from HBS data using a semi-parametric model. In the framework of seeking tools for the most ecient use of such data, this paper has two objectives: One is to propose an alternative model for the estimation of individual availability curve by age and sex and the other, most importantly, is to use bootstrap for the derivation of con dence intervals, thus providing a good measure of curve variability, something which Chesher’s approach did not focus on. In Section 2, the model is described and estimation is accomplished using spline smoothing. In Section 3, a bootstrap approach is used for the derivation of con dence intervals and nally, in the last section, results are presented from data of the Greek, 1993–1994 HBS. 2. Model, estimation and bias One of the key assumptions concerning this approach to the data is that during the survey period which is usually at least one week, the expected amount of data entering the household re ects the expected consumption of food. Thus, estimates of individual availability can be considered as estimates of individual consumption. Let us consider household i; i = 1; : : : ; n of mi members, each of which has an average availability of a food item equal to f(cij ; zi ); j=1; : : : ; mi , where cij is a vector of personal characteristics such as age, sex, education level or occupation and zi is a vector of household characteristics such as location of household, household income or an index such as the proportion of household food expenditure to the total household expenditure during the recording period. As regards function f, Chesher (1997), assumed that the hosehold and personal characteristics aect the deterministic part of availability multiplicatively so that f(cij ; zi ) = f(cij )g(zi ). If cij represents age only and zi represents income for example, then, under this assumption, a change in g causes availability of dierent age groups to be aected proportionally. In this paper, we shall consider an additive approach in which f(cij ; zi ) = f(cij ) + g(zi ):
(1)
Under this modelling, a change in g causes availability of dierent age groups to be aected by the same value, something which is more restrictive than the multiplicative assumption. If, however, availability across age groups does not vary much the dierences between the two models may not be notable. Furthermore, the additive assumption leads to simple least-squares solution in contrast with the multiplicative one which, either we use a parametric or a nonparametric modelling for g, leads to nonlinear estimation. Thus, the additive model can be considered to be a useful alternative to the multiplicative one. Under these assumptions, the average household availability is the sum of average individual availability as E(yi |Ci ; zi ) =
mi X
f(cij ) + mi g(zi );
(2)
j=1
where C = (ci1 ; : : : ; cimi )T a matrix of individual characteristics of all members of the household. The model for the mean availability will be completed by assuming a model for g. For this, we shall assume that zi is written in the form (zdi ; zci ) representing the discrete and the continuous part of the household characteristics, respectively. Vector zi aects mean availability through g which is written as T
d + g0 (zci ) g(zdi ; zci ) = zdi
(3)
where d is a vector of unknown coecients and g0 a function of the continuous part of the covariates need to be estimated. To proceed with the estimation we use the same discretization argument found in Chesher (1997). Let us denote by aM and aF the range of ages for males and females, respectively, which we wish to include into the model. For example, if we wish to analyse all males from newborns up to 80 years of age,
V.G.S. Vasdekis, A. Trichopoulou / Statistics & Probability Letters 46 (2000) 337 – 345
339
then aM = 81. It is not dicult to show that the mean availability for household i is T E(yi |Ci ; zi ) = 0 + TMi M + TFi F + mi (zdi
d + zciT c );
(4)
where Mi and Fi are aM × 1 and aF × 1 vectors of counts of the number of males and females of household i falling into each age category, respectively. zid and zic are d × 1 and c × 1 vectors of dummy variables, respectively. The former represents the discrete part of household characteristics, while the other, represents the discretized version of the continuous part. 0 ; M ; F ; d and c are coecients to be estimated.The model is complete if we assume that the actual response is characterized by (4) plus an error with zero mean and variance 2 . Model (4) assumes that within a household each individual’s availability depends on its sex and its age. Dierences between individuals living in dierent households appear only through d and c . Since M ; F and c are coecients representing discretized versions of functions of mean availability, estimation can be based on spline smoothing (Green and Silverman, 1994) and (Engle et al., 1986) the estimation problem is the minimization of 2 T T M A M A M M + F2 TF AT F A F F + 2 Tc AT c A c c ; ( y − X )T ( yi − X ) + M
(5)
where M ; F and are penalty coecients, X is a n × l design matrix with l = aM + aF + c + d + 1; T = ( 0 ; M ; F ; d ; c ) is a 1 × l vector of unknown coecients and 1 −2 1 0 : : : 0 0 0 0 1 −2 1 : : : 0 0 0 As = . .. .. .. .. .. .. . . . . ::: . . . . 0
0
0
0 : : : 1 −2 1
are dim(s) − 2 × dim(s) matrices of second dierences. Matrix X is written as IaM +aF +1 0 0 m1 Ic+d T x1 IaM +aF +1 0 x2T I 0 m ; 2 c+d . . . .. .. T . . xn I 0 aM +aF +1 0 mn Ic+d T T ; zci ) is a 1 × l vector of household covariates and Ib is the b × b identity matrix. where xiT = (1; TMi ; TFi ; zdi Eq. (5) is written in matrix form
F = ( y − X )T ( y − X ) + T where
=
0
T
;
M A M
F A F
0
; A c
(6)
340
V.G.S. Vasdekis, A. Trichopoulou / Statistics & Probability Letters 46 (2000) 337 – 345
where = (M ; F ; ) and 0 is a 1 × d vector of zeroes. Standard matrix dierentiation leads to the solution ˆ = (X T X +
T
)
−1
X Ty
(7)
with chosen by a method like cross-validation or simply by the experimenter’s subjective choice. Unless = 0, this is a biased estimator of with mean (X T X +
T
−1 T ) X X
(8)
and variance 2 (X T X +
T
)
−1
X T X (X T X +
T
−1 ) :
(9)
In applications where 2 is unknown an estimate is used as (see Carter and Eagleson, 1992) ˆ2 =
( y − X ˆ )T ( y − X ˆ ) : tr(In − X (X T X + T )−1 X T )2
(10)
An approximate expression for bias is given in the next theorem. Theorem 1. For model (4); the bias of ˆ is given; to a rst-order approximation; by (X T X )−1
T
:
(11)
Proof. Since for a matrix A for which an inverse exists (I + A)−1 = I − A + A2 − · · · we get the expansion of the inverse of matrix X T X + (X T X +
T
)−1 = (X T X )−1 − (X T X )−1
By substituting (12) into (7), we get ˆ = (X T X )−1 X T Y − (X T X )−1 T
= ˆo − (X T X )−1
T
(X
T
T
T
as
(X T X )−1 :
(12)
X )−1 X T y
ˆ
o ;
where ˆo is the usual least-squares regression coecient. By taking expectations we conclude that approximately, the bias has the form (X T X )−1
T
:
3. Bootstrap con dence intervals Chesher (1997) used (9) to obtain pointwise intervals 95% of which are not con dence intervals in the strict sense but they give an idea of the mean availability curves variability. By denoting G = (X T X + T −1 T T T −1 ) X X (X X + ) , we get p p ˆMi ± 1:96ˆ Gii ; ˆFj ± 1:96ˆ GaM +j; aM +j ; (13) where Gii and GaM +j; aM +j are the diagonal elements of G which correspond to ˆMi and ˆFj , respectively, i = 1; : : : ; aM ; j = 1; : : : ; aF . In the same paper, Chesher proposed a bootstrap-like procedure which is based on the asymptotic normal distribution of the regression coecients. This assumption however, need not be true in practice especially in food items which allow extreme availabilities. A better way to express the variability of the mean availability curves is to use bootstrap in the way suggested by Hardle and Bowman (1988). Bootstrap technique estimates the unknown distribution F of ˆ To get an initial estimate of residuals we form the approximate residuals by an empirical distribution F.
V.G.S. Vasdekis, A. Trichopoulou / Statistics & Probability Letters 46 (2000) 337 – 345
341
residuals ”ˆ = y − X ˆ
(14)
Since these residuals are biased, resampling should take place from bias-corrected residuals. These are constructed by considering bias in (10) and by replacing by ˆo : ”ˆBC = ”ˆ − X (X T X )−1
T
(X
T
X )−1 X T y:
(15)
Centring is essential if we want these residuals to re ect the behaviour of true observation errors. It is not dicult to show that the centred bias corrected residuals have the form ”BC = ”ˆ − X (X T X )−1
T
T −1 T (X X ) X y −
1T ”ˆ 1; n
(16)
where 1 is a vector of 1’s and since 1T X (X T X )−1 T = 0. By denoting the bootstrapped bias-corrected residuals by ”∗BC we get bootstrapped observations y∗ = X ˆ + ”ˆ∗BC :
(17)
∗ Bootstrap estimates of ; ˆ are easily obtained. If this process is repeated B times then we can get 95% ∗ percentile intervals (see also Efron and Tibshirani, 1993) from the empirical bootstrap distribution Fˆ ˆ∗ of ˆ as
−1
−1
[Fˆ ˆ∗ (0:025); Fˆ ˆ∗ (0:975)]:
(18)
4. An example The following example comes from data taken from the 1993–1994 Greek HBS. Estimates of individual availability on red meat from 2986 households of the major Athens area are presented for both sexes when the analysis considered families with individuals from newborns up to 80 years of age for both men and women. The individual characteristics considered were age and sex. Initially, no household characteristics were taken into account since our purpose was not to analyze the data thoroughly but mainly to give an example. Thus, in model (4), d = c = 0. Consequently, = 0 and l = 163. Estimates of individual availability for men and women were estimated using (7). These estimates are accompanied by pointwise 95% intervals calculated using (9), (10) and (13) (see Figs. 1 and 3). Bootstrap estimates along with bootstrap con dence intervals calculated as in (18) are presented in Figs. 2 and 4. All estimates were obtained using the same smoothing parameter which for this case was M = F = 100. As it can be seen, although there are dierences between the estimated availability curves due to the reduction of bias in the bootstrap estimates these dierences are not acute. The picture changes somewhat when we consider larger values for the smoothing parameters (see Table 1). A comparison between graphs 1 and 3 shows dierences in red meat availability in infants. This can be caused by two reasons. One is that the eect of a household-related covariate-like income or food expenditure, has not been taken into account. The other is that dierent smoothing parameters are appropriate for the curves corresponding to dierent sexes. We investigated this point by obtaining optimal smoothing parameter values for the two curves using generalized cross validation (Craven and Wahba, 1979). The values obtained were M = 206:72 and F = 260:78 showing that more smoothing to the females curve is appropriate. Table 2 shows these estimates for age points 0 –5 along with estimates when the ratio of food expenditures to the total expenditures for each household has been considered into the model. By taking into account this ratio, dierences in availability between male and female infants although smaller, do not vanish. The picture is better when we consider dierent smoothing parameters.
342
V.G.S. Vasdekis, A. Trichopoulou / Statistics & Probability Letters 46 (2000) 337 – 345
Fig. 1. Estimated individual availability in red meat for males from the Athens area; pointwise intervals.
Fig. 2. Estimated individual availability in red meat for males from the Athens area; bootstrap CI.
V.G.S. Vasdekis, A. Trichopoulou / Statistics & Probability Letters 46 (2000) 337 – 345
Fig. 3. Estimated individual availability in red meat for females from the Athens area; pointwise intervals.
Fig. 4. Estimated individual availability in red meat for females from the Athens area; bootstrap CI.
343
344
V.G.S. Vasdekis, A. Trichopoulou / Statistics & Probability Letters 46 (2000) 337 – 345 Table 1 Individual estimates of availability in red meat for Athens area and for selected ages; two dierent smoothing parameter choices M = F = 100
M = F = 1000
Ages
95% intervals
95% bootstrap CI
95% intervals
95% bootstrap CI
Males 0 10 20 30 40 50 60 70
38.0 39.6 34.7 60.1 90.8 95.1 69.9 50.1
106.6 74.4 67.8 99.5 141.2 148.7 122.0 105.1
39.7 41.4 37.3 60.1 90.8 95.1 75.2 50.6
108.7 74.4 69.5 95.4 137.6 147.5 124.6 104.2
38.5 51.1 60.9 70.5 79.9 84.3 80.7 69.9
76.4 75.9 81.6 94.1 107.4 115.2 116.6 115.3
38.6 53.8 65.3 71.6 76.4 79.8 79.4 75.2
74.8 76.8 83.5 93.8 102.5 109.0 114.1 118.5
Females 0 10 20 30 40 50 60 70
63.6 29.9 29.3 19.0 33.9 58.3 54.0 29.1
136.5 66.3 62.0 63.1 87.3 113.5 109.2 85.4
60.6 35.3 27.9 21.7 36.8 55.6 53.9 28.1
127.5 70.1 59.4 63.8 88.1 108.0 106.6 86.3
45.2 46.3 45.3 44.2 45.4 45.2 39.7 27.9
83.4 70.8 66.2 69.5 76.1 80.6 80.4 77.3
42.2 48.2 49.2 47.3 44.6 41.3 36.4 27.9
79.9 72.9 69.8 71.7 74.1 74.6 73.8 74.4
Table 2 Individual availability in red meat for infants under three dierent estimation schemes. First two columns: subjective choice of smoothing parameters. Next two columns: same choice but with expenditure ratio eects taken into account. Final two columns: generalized cross validation choice
M = F = 100
M = F = 100 Expenditure ratio eects taken into account
M = 206:72; F = 260:78
Ages
Males
Females
Males
Females
Males
Females
0 1 2 3 4 5
72.3 71.6 70.9 70.1 69.0 67.5
100.0 93.5 87.1 80.7 74.3 68.3
80.6 79.1 77.7 76.0 74.0 71.7
104.7 97.7 90.6 83.6 76.6 69.9
70.3 69.2 68.1 67.0 65.8 64.5
83.6 80.3 77.0 73.7 70.4 67.3
Acknowledgements We wish to thank a referee for his helpful comments. This study was supported by the FAIR-97-3096 European Union programme on “Compatibility and Individual Nutrition Surveys in Europe and Disparities in Food Habits”. References Carter, C.K., Eagleson, G.K., 1992. A comparison of variance estimators in non-parametric regression. J. Roy. Statist. Soc. B 54, 773–780. Chesher, A., 1997. Diet revealed? Semiparametric estimation of nutrient intake-age relationships. J. Roy. Statist. Soc. A 160, 389–428. Craven, P., Wahba, G., 1979. Smoothing noisy data with spline functions. Numer. Math. 31, 377–390.
V.G.S. Vasdekis, A. Trichopoulou / Statistics & Probability Letters 46 (2000) 337 – 345
345
Efron, B., Tibshirani, R.J., 1993. An Introduction to the Bootstrap. Chapman and Hall, London. Engle, R.F., Granger, C.W.J., Rice, J.A., Weiss, A., 1986. Semiparametric estimates of the relation between weather and electricity sales. J. Amer. Statist. Assoc. 81, 310–320. Green, P.J., Silverman, B.W., 1994. Nonparametric Regression and Generalized Linear Models. Chapman & Hall, London. Hardle, W., Bowman, A.W., 1988. Bootstrapping in nonparametric regression: local adaptive smoothing and con dence bands. J. Amer. Statist. Assoc. 83, 102–110.