Journal of Econometrics 115 (2003) 277 – 291
www.elsevier.com/locate/econbase
Estimation of Lorenz curves: a Bayesian nonparametric approach Hikaru Hasegawa∗ , Hideo Kozumi Graduate School of Economics and Business Administration, Hokkaido University, Kita 9, Nishi 7 Kita-ku, Sapporo 060-0809, Japan Accepted 22 January 2003
Abstract In this article, we estimate Lorenz curves using the recent development of the Bayesian nonparametric method with Dirichlet process prior. We also consider contaminated observations of income and propose a method for removing these contaminated observations. Further, we present examples using both simulated data and real data to illustrate our approach. c 2003 Elsevier Science B.V. All rights reserved. JEL classi&cation: C11; C15; D63 Keywords: Dirichlet process; Gini index; Income distribution; Inequality; Markov chain Monte Carlo (MCMC)
1. Introduction The Lorenz curve plays a central role in the analysis of income distribution and the evaluation of welfare judgments. In the literature concerning Lorenz curve estimation, there are two approaches (Ryu and Slottje, 1999). One is to assume a hypothetical statistical distribution for the income distribution and then estimate the Lorenz curve using this statistical distribution (McDonald and Xu, 1995). The other is to @t a speci@c functional form to the Lorenz curve directly, and estimate the empirical Lorenz curves (Ryu and Slottje, 1996; Sarabia et al., 1999; Chotikapanich and GriAths, 2002). Both approaches are based on parametric estimation and/or parametric approximations (Ryu and Slottje, 1996, 1999). ∗
Corresponding author. Tel.: +81-11-706-3179; fax: +81-11-706-4947. E-mail address:
[email protected] (H. Hasegawa).
c 2003 Elsevier Science B.V. All rights reserved. 0304-4076/03/$ - see front matter doi:10.1016/S0304-4076(03)00098-8
278
H. Hasegawa, H. Kozumi / Journal of Econometrics 115 (2003) 277 – 291
Furthermore, cross-sectional data on income that are collected in the form of a survey usually suGer from survey errors such as interviewer’s errors, errors due to respondents, and the like (Groves, 1989). The estimated Lorenz curve is inIuenced remarkably, when the income data is contaminated by such errors. Recently, Cowell and Victoria-Feser (2001) focused on this problem from a non-Bayesian point of view. They applied a robust estimation method to the estimation of Lorenz curves with contaminated income data. The objective of this article is to propose an alternative approach to Lorenz curve estimation with contaminated data using Bayesian nonparametric analysis. In our model, the estimation of income distribution and the investigation of contaminated observations are carried out simultaneously. Our Bayesian nonparametric approach uses a Dirichlet process prior (Ferguson, 1973, 1974, 1983). Since the recent development of the Markov chain Monte Carlo (MCMC) method overcame the computational burden of posterior analysis using Dirichlet process prior, Bayesian nonparametric analysis has been researched intensively (Escobar, 1994; West et al., 1994; Escobar and West, 1995; Dey et al., 1998). 1 One of the advantages of our Bayesian approach is that our model permits heteroscedasticity in individual income. Furthermore, our approach divides the individual income data into several income groups as part of the estimation procedure. In this sense, the possibility of dependency among individual incomes exists, though they are assumed to be independent. We also present a method of removing the contaminated observations. With the use of a mixture model for contaminated and noncontaminated data, the contaminated observations are clearly discriminated in the posterior inference. This enables us to estimate the Lorenz curve using the predictive distribution for the noncontaminated data. This article is organized as follows. Section 2 presents the Bayesian nonparametric model for density estimation with contaminated data. In Section 3, we describe the posterior inference procedure and the estimation of the Lorenz curve and the Gini index. In Section 4, we illustrate our Bayesian approach using both simulated data and real data. Finally, a brief summary and some extensions of our approach are given in Section 5. 2. Bayesian nonparametric model Let yi be the income of the ith household or individual (i=1; : : : ; n). We assume that the income data take a positive value and a small number of observations are suspected to be contaminated. Since yi ¿ 0, we consider the logarithm of income, denoted by xi = log yi hereafter. Taking into account the presence of contaminated data, we assume that the distribution of xi can be expressed with a mixture density given by p(xi | i ; c ; ) = (1 − )p0 (xi | i ) + p1 (xi | c )
(i = 1; : : : ; n);
(1)
1 Recent applications of Bayesian nonparametric analysis with Dirichlet process prior to econometric models are as follows: Kozumi and Hasegawa (2000), Campolieti (2000) and Hirano (2002).
H. Hasegawa, H. Kozumi / Journal of Econometrics 115 (2003) 277 – 291
279
where i and c are unknown parameter vectors, p0 and p1 are density functions, and
is a mixture proportion constrained in (0; 1). For calculation and interpretation of the mixture density (1), it is sometimes convenient to introduce unobservable independent random variables, zi ∈ {0; 1}, with the probability mass function p(zi = j) = (1 − )1−zi zi
(j = 0; 1):
Then the distribution of xi conditional on zi can be written as p0 (xi | i ) if zi = 0; p(xi | i ; c ; zi ) = p1 (xi | c ) if zi = 1:
(2)
(3)
Suppose the realized values of zi imply that the corresponding observations are either contaminated, zi = 1, or noncontaminated, zi = 0. It is possible to assume that the distribution of noncontaminated data is described by the @rst component p0 of the mixture density (1) and the distribution of contaminated data by the second component p1 . Furthermore the parameter gives the proportion of contaminated data. We now specify the distributions of p0 and p1 . The number of contaminated observations is usually much smaller than that of noncontaminated observations. Thus, it would be appropriate to employ a parametric approach when modeling the distribution of the contaminated data. We simply assume that p1 (xi | c ) = fN (xi |c ; c2 );
c = (c ; c2 );
(4)
where fN (x|; 2 ) denotes the density function of the normal distribution with mean and variance 2 . On the other hand, many observations of noncontaminated data are available for the analysis of the income distribution. Therefore, we consider a Bayesian nonparametric approach to modeling the distribution p0 . Following Escobar and West (1995), we @rst assume that p0 (xi | i ) = fN (xi |i ; i2 );
i = (i ; i2 )
(5)
and that i are a sample from an unknown distribution G. Integrating out i yields the density f(xi |G) = p0 (xi | i ) dG( i ): (6) This is no longer a normal distribution and provides a nonparametric class of distributions. It should be noted that our model permits the heteroscedasticity of individual income. A Bayesian analysis requires a prior distribution for G. We assume that G has the Dirichlet process prior introduced by Ferguson (1973, 1974). The Dirichlet process is a probability measure on the space of all distributions. More speci@cally, given a proper base probability distribution G0 and a precision parameter , a random probability distribution G is said to be a Dirichlet process if, for any @nite partition, B1 ; : : : ; Bm , of the parameter space, the random vector (G(B1 ); : : : ; G(Bm )) has a Dirichlet distribution
280
H. Hasegawa, H. Kozumi / Journal of Econometrics 115 (2003) 277 – 291
with a parameter vector (G0 (B1 ); : : : ; G0 (Bm )). Throughout this paper DP(G0 ) will be used to denote a Dirichlet process. The Dirichlet process has been recognized as having several attractive features. Among them, one important feature is that it gives mass 1 to the set of discrete distributions. That is, any realizations are almost surely discrete (Blackwell, 1973; Antoniak, 1974). To explain this discreteness property, let = { 1 ; : : : ; n } denote a sample from the Dirichlet process. Then, as shown by Escobar (1994), when G is integrated over its prior distribution, the sequence of i follows as 1 with probability +i−1 ; = j (7) i | 1 ; 2 ; : : : ; i−1 ; ∼ G0 with probability +i−1 where 1 ∼ G0 . Consequently, i ’s are partitioned into say k (6 n) groups such that all i in the same group are the same while those in diGerent groups diGer. These k distinct values of i are a sample from G0 . According to the partition of i , xi ’s also belong to the k diGerent income groups. In this sense, our model permits the possibility of dependence among xi ’s. From the above discussion, we can see that the distribution of noncontaminated data is approximated as a mixture of @nite normal distributions. Since an arbitrary density on the real line can be approximated using a mixture of normal distributions (Ferguson, 1983), we are able to obtain a rich class of distributions for noncontaminated data using a Dirichlet process prior. It should be noted that, without restrictions on the parameters, the component densities p0 and p1 can be the distributions for contaminated and noncontaminated data, respectively. Income is restricted to positive values in our model, and problems associated with contaminated data arises in the upper tail of the income distribution (Cowell and Victoria-Feser, 1996, 2001). Thus we impose the constraint c ¿ i on the means of the component densities to ensure identi@ability. To complete the model description, we have to specify the prior distributions for the rest of the parameters. For the base prior distribution G0 , the conjugate normal/inversegamma distribution is chosen, that is, s 0 0 I (i ¡ c ); dG0 (i ; i2 ) ˙ N(0 ; 0 i2 )IG ; 2 2 where I (·) is the indicator function and IG(a; b) denotes the inverse-gamma distribution with parameters a and b. We also choose the following conjugate prior distributions for c = (c ; c2 ) and : s c0 c0 ; (c ; c2 ) ∼ N(c0 ; c0 c2 )IG ; 2 2
∼ Be(a0 ; b0 ); where Be(a; b) is a beta distribution with parameters a and b. Finally, following Escobar and West (1995), we assume that follows the gamma distribution ∼ Ga(c0 ; d0 ):
H. Hasegawa, H. Kozumi / Journal of Econometrics 115 (2003) 277 – 291
281
3. Posterior inference 3.1. Gibbs sampling In our model, the main computational diAculty arises in the sampling of i . However, West et al. (1994) has proposed an eAcient algorithm to sample i where the discreteness property of Dirichlet processes is used. The discreteness property states that any realization of n case parameters i generated from G lies in a set of k 6 n distinct values, ∗ = { 1∗ ; : : : ; k∗ }. Let S = {S1 ; : : : ; Sn } be the con@guration vector. That is, Si =j if and only if i = j∗ . The con@gurations Si map ∗ into by i = j∗ if Si = j. Then, denoting (i) = { 1 ; : : : ; i−1 ; i+1 ; : : : ; n }, the conditional distribution of i is derived as (i)
i |
k (i)
(i) ∗(i) 1 G0 + ∼ nj ( j ); +n−1 +n−1
(8)
j=1
where k (i) is the number of distinct values in (i) with nj(i) taking the common value j∗(i) and (z) denoting a distribution with mass 1 to the point z. Since the information in is equivalent to that of ∗ , S and k, sampling can be replaced with sampling S and ∗ . Thus, the algorithm proposed by West et al. (1994) consists of the following two steps: (a) Generate Si from the conditional distribution p(Si = j| · · ·) = qij ; where
qij ˙
(9)
fT (xi |0 ; (1 + 0 )s0 =0 ; 0 )1−zi ;
j = 0;
nj(i) fN (xi | j∗(i) )1−zi ;
j ¿ 0;
(10)
and fT (x|; 2 ; ) is the density of the t-distribution with mean , scale factor and degrees of freedom. For any index i such that Si = 0, draw a new i from the posterior obtained by updating the prior G0 via the likelihood of xi . Hereafter we use | : : : to denote “conditional on all other variables”. (b) Generate j∗ = (j∗ ; j∗2 ) from the conditional distribution, which is the truncatednormal/inverse-gamma distribution, ˆj sˆj I (j∗ ¡ c ); (11) j∗ ; j∗2 | · · · ∼ N(ˆj ; ˆj j2 )IG ; 2 2 where ˆj =
0 nj xPj + 0 ; 0 nj + 1
ˆj = 0 + nj ;
ˆj =
sˆj = s0 +
0 ; 0 nj + 1 nj (xPj − 0 )2 (xi − xPj )2 ; + 0 nj + 1 i∈J j
282
H. Hasegawa, H. Kozumi / Journal of Econometrics 115 (2003) 277 – 291
xPj =
1 xi nj i∈J
Jj = {i : zi = 0 ∩ Si = j}
j
and nj is the number of observations in Jj . It should be noted that the sampling of k is implicitly accomplished in (a). The conditional distribution of zi is written as p(zi = j | · · ·) ˙ {(1 − )p0 (xi | i )}1−zi { p1 (xi | c )}zi ;
(12)
which is actually a Bernoulli distribution. Thus we can simulate zi from it without any diAculty. Through conjugacy, the conditional distributions of c , c2 , and are easily obtained and given as ˆc sˆc I (c ¿ j∗ ); (13) c ; c2 | · · · ∼ N(ˆc ; ˆc c2 )IG ; 2 2
| · · · ∼ Be(a0 + m1 ; b0 + m0 );
(14)
where ˆc =
c0 m1 xPc + c0 ; c0 m1 + 1
ˆc = c0 + m1 ;
ˆc =
sˆc = sc0 +
c0 ; c0 m1 + 1 m1 (xPc − c0 )2 (xi − xPc )2 ; + c0 m1 + 1 i:zi =1
xPc =
1 xi ; m1
mj = #{i : zi = j}:
i:zi =1
Finally, Escobar and West (1995) note that if we introduce a beta distributed variable % ∼ Be( + 1; n), the full conditional distribution of is given by ∼ wGa(c0 + k; d0 − log %) + (1 − w)Ga(c0 + k − 1; d0 − log %); where w c0 + k − 1 = : 1 − w n(d0 − log %) 3.2. Estimation of the Lorenz curve and Gini index Given the density function of y, f(y) (y ¿ 0), the continuous formulation of the Lorenz curve is generally de@ned as y 1 zf(z) d z; (15) L(p) = E(y) 0
H. Hasegawa, H. Kozumi / Journal of Econometrics 115 (2003) 277 – 291
283
where p = F(y) and F(y) is the distribution function of y (Lambert, 1993, p. 40). The continuous formulation of the Gini index is de@ned as 1 2E[yF(y)] L(p) dp = −1 + G=1−2 (16) E(y) 0 (Lambert, 1993, p. 43). If a sample {y1 ; : : : ; yN } is observed, the Lorenz curve and the Gini index are estimated using their discrete formulations (Lambert, 1993, pp. 40, 44): k y(i) k ˆ = i=1 L ; (17) N N i=1 y(i) N N +1 2 Gˆ = − iy(i) ; (18) + 2 N N yP i=1
where y(i) are the order statistics and yP is the sample mean. In a Bayesian framework, it is natural to estimate the Lorenz curve and the Gini index from the predictive distribution because it summarizes all the information on the data. The predictive distribution of the future observation xf is written as p(xf | f ; c ; )+( f ; c ; |X ) d f d c d ; (19) p(xf |X ) = where +(·|X ) denotes the posterior distribution. However, p(xf | f ; c ; ) includes the distribution of the contaminated data in our model, and inference to the Lorenz curve and the Gini index should be based on the distribution of noncontaminated data. Therefore, we consider the predictive distribution for noncontaminated data to be given as (20) p0 (xf |X ) = p0 (xf | f )+( f |X ) d f : Since Eq. (8) implies that k
f | ∼
1 nj ( j∗ ); G0 + +n +n
(21)
j=1
where nj is the number of i ’s taking the value j∗ , we can obtain p0 (xf |) = fT (xf |0 ; (1 + 0 )s0 =0 ; 0 ) +n k
1 nj fN (xf |j∗ ; j∗2 ): + +n
(22)
j=1
Thus the predictive distribution for the noncontaminated data can be rewritten as p0 (xf |X ) = p0 (xf |)+(|X ) d: (23) An easy route in Gibbs sampling is to sample xf from the predictive density. We estimate the Lorenz curve and the Gini curve using (17) and (18). In addition,
284
H. Hasegawa, H. Kozumi / Journal of Econometrics 115 (2003) 277 – 291
the estimate of the predictive distribution is obtained as pˆ 0 (xf |X ) =
M 1 p0 (xf |[i] ); M
(24)
i=1
where [i] are a simulated sample of .
4. Applications using simulated data and real data 4.1. Simulated data Following Cowell and Victoria-Feser (2001), we assume the income distribution is a Dagum distribution D(-; .; ) (Dagum, 1977) whose density function is f(y|-; .; ) =
.y−(+1) (1 + .y− )−(-+1) : B(1; -)
(25)
The Dagum distribution is a special case of the generalized beta distribution (GB) (e.g., McDonald and Xu, 1995). Two samples of 5,000 observations were generated from D(2; 1; 3) and D(2; 1; 2:5), respectively. The sample from D(2; 1; 3) Lorenz-dominates the sample from D(2; 1; 2:5). In addition, we contaminated the sample from D(2; 1; 3) by multiplying the largest 0.2% of the observations by 10. The contaminated sample from D(2; 1; 3) no longer Lorenz-dominates the sample from D(2; 1; 2:5), and vice versa. Let CDa(·; ·; ·) and NDa(·; ·; ·) be the contaminated and the noncontaminated samples from the Dagum distribution, respectively. Table 1 presents the summary statistics of three samples, NDa(2; 1; 3), NDa(2; 1; 2:5) and CDa(2; 1; 3). Figs. 1 and 2 show their Lorenz curves. These Lorenz curves are drawn using their empirical distributions. 2 The priors for the estimation are de@ned by the hyperparameter values 0 = 0:0, c0 = 3:0, 0 = c0 = 400, 0 = c0 = 4:0, s0 = sc0 = 0:04, a0 = b0 = 1:0, c0 = 2:0, Table 1 Summary statistics of simulated data
mean sd b min max
NDa(2; 1; 3)a
CDa(2; 1; 3)a
NDa(2; 1; 2:5)
1.612 1.096 0.243 17.932
1.831 5.590 0.243 179.317
1.841 1.858 0.229 50.768
a ‘CDa’ b ‘sd’
and ‘NDa’ denote the contaminated and noncontaminated samples. denotes standard deviation.
2 The posterior results obtained afterward are generated by using DIGITAL Visual Fortran version 6.6 and all @gures are drawn using Ox version 3.1 (Doornik, 2001).
H. Hasegawa, H. Kozumi / Journal of Econometrics 115 (2003) 277 – 291
285
1.00 0.75
NDa(2,1,3) CDa(2,1,3) NDa(2,1,2.5)
0.50 0.25
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.6
0.4
0.2
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
Fig. 1. Empirical Lorenz curves (simulated data).
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
4950
4955
4960
4965
4970
4975
4980
4985
4990
4995
5000
Fig. 2. Posterior probabilities of zi = 1 (simulated data).
d0 = 0:5, which reIect rather weak prior information. The MCMC simulation was run for 15,000 iterations and the @rst 5,000 samples were discarded as a burn-in period.
286
H. Hasegawa, H. Kozumi / Journal of Econometrics 115 (2003) 277 – 291 1.00 0.75 0.50 0.25
(a)
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
4.5
5.0
5.5
2
1
(b)
-1.5
1.00 0.75 0.50 0.25 (c)
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Fig. 3. Estimates of the density (simulated data).
Using the contaminated sample CDa(2; 1; 3), we implemented the MCMC procedure described in the previous sections. Fig. 2 shows the posterior probabilities of zi = 1. From this @gure, we can see that the noncontaminated and the contaminated data are discriminated exactly. Fig. 3 shows the estimated density functions. In this @gure, (a) and (b) show the estimated density functions for noncontaminated and contaminated data. (c) provides a comparison of the estimated density function and the data. The estimated density function @ts with the histogram suAciently. Fig. 4 shows the Lorenz curve estimated using the predictive distribution of noncontaminated data. Although the estimated Lorenz curve lies slightly inside the Lorenz curve of NDa(2; 1; 3), they are very similar. Table 2 summarizes the Gini indices. In this table, the Gini indices of NDa(2; 1; 3), CDa(2; 1; 3) and NDa(2; 1; 2:5) are calculated from the discrete formulation. The estimated Gini index is slightly less than that of NDa(2; 1; 3). 4.2. Real data Using income data from the Panel Study of Income Dynamics (PSID), we consider the applicability of our Bayesian approach is applicable to real income data. The data set was downloaded from the PSID site (http://www.isr.umich.edu/src/psid/). We use income data for 1997 from the ‘Income Plus’ Files: 1994 –1997 Family Income and Components Files (Kim et al., 2000). The data consist of 6,307 observations.
H. Hasegawa, H. Kozumi / Journal of Econometrics 115 (2003) 277 – 291
287
posterior NDa(2,1,3) NDa(2,1,2.5)
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Fig. 4. Estimates of Lorenz curves (simulated data). Table 2 Gini index (simulated data) NDa(2; 1; 3)
CDa(2; 1; 3)
NDa(2; 1; 2:5)
Posteriora
0.297
0.381
0.356
0.288
a ‘Posterior’
denotes the Gini index estimated using the posterior results.
Table 3 Summary statistics (1997 Family income data)
Obs.a mean sdb min max
y
log y
6260 46600.0 50970.7 1 829914
6260 10.309 1.080 0.000 13.629
a ‘Obs’ b ‘sd’
denotes the number of observations. denotes standard deviation.
We deleted zero and negative income observations from the data set. As a result, the number of observations becomes 6,260. Table 3 provides the summary statistics.
288
H. Hasegawa, H. Kozumi / Journal of Econometrics 115 (2003) 277 – 291 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
6200
6205
6210
6215
6220
6225
6230
6235
6240
6245
6250
6255
6260
Fig. 5. Posterior probabilities of zi = 1 (1997 Family income data).
In order to examine our method of removing the contaminated data, we made the arti@cial contaminated data by multiplying the thirteen (about 0.2%) largest observations by 10. The hyperparameter values are the same as before except for 0 and c0 . In this analysis we set 0 = 6:0 and c0 = 14:0 and run the MCMC algorithm for 20,000 iterations following a burn-in phase of 5,000 iterations. Fig. 5 shows the posterior probabilities of zi = 1. The contaminated data are clearly discriminated. Fig. 6 shows the estimated density functions. The estimated density function @ts the noncontaminated data. Fig. 7 shows the estimated Lorenz curve by using the predictive distribution of noncontaminated data. As a result of removing the contaminated data, the estimated Lorenz curve lies slightly inside that of the nonarti@cial real data. Table 4 presents the Gini indices of the real data and the posterior results. Since the estimated Gini index by using the posterior results does not use the contaminated data, which lies in the upper tail of income distribution, the estimated Gini is slightly less than that of the real data. 5. Concluding remarks In this article, we proposed a Bayesian method for estimating Lorenz curves using the recent development of a Bayesian nonparametric method with Dirichlet process prior. We also presented a method for removing the contaminated observations. From the results obtained using both simulated data and real data, we found that our approach estimated the Lorenz curves and Gini indices adequately. It should be noted that in
H. Hasegawa, H. Kozumi / Journal of Econometrics 115 (2003) 277 – 291
289
0.4
0.2
(a)
0
2
4
6
8
10
12
14
16
0
2
4
6
8
10
12
14
16
14
16
3 2 1
(b)
0.4 0.2
(c)
0
2
4
6
8
10
12
Fig. 6. Estimates of the density (1997 Family income data).
1.0 0.9
posterior data (contaminated) data (non contaminated)
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Fig. 7. Estimates of Lorenz curves (1997 Family income data).
0.9
1.0
290
H. Hasegawa, H. Kozumi / Journal of Econometrics 115 (2003) 277 – 291
Table 4 Gini index (1997 Family income data) Dataa
Posteriorb
0.460
0.447
a ‘Data’
denotes the Gini index calculated from the empirical distribution. denotes the Gini index estimated using the posterior results.
b ‘Posterior’
addition to the Gini index, other inequality measures (e.g., the generalized entropy measure) can also be estimated using our approach. In this article, we only used survey data in the estimation of the Lorenz curves and Gini indices. Lorenz curves and inequality measures are frequently estimated using grouped data. There are many classical methods for estimating Lorenz curves and inequality measures from grouped data. For example, Kakwani and Podder (1973, 1976) proposed a prominent method for the estimation of Lorenz curves and inequality measures with grouped data. Further, Chotikapanich and GriAths (2002) estimated the Lorenz curve using Dirichlet distribution for the error terms. These studies @t a speci@c functional form to the Lorenz curves. From the Bayesian point of view, we can extend our Bayesian nonparametric estimation method to the estimation of Lorenz curves and inequality measures from grouped data (Hasegawa, 2002). Therefore, our approach can be applied to a wide range of models which estimate Lorenz curves and inequality measures. Acknowledgements The very helpful comments from Professor Arnold Zellner, the editor, and an anonymous associate editor are gratefully acknowledged. The @rst author also thanks the @nancial support by the Japanese Ministry of Education, Culture, Sports, Science and Technology under the Grant-in-Aid for Scienti@c Research No. 13630024. References Antoniak, C.E., 1974. Mixtures of Dirichlet processes with application to Bayesian nonparametric problems. Annals of Statistics 2, 1152–1174. Blackwell, D., 1973. Discreteness of Ferguson Selections. Annals of Statistics 1, 356–358. Campolieti, M., 2000. Bayesian Semiparametric estimation of discrete duration models: an application of the Dirichlet process prior. Journal of Applied Econometrics 16, 1–22. Chotikapanich, D., GriAths, W.E., 2002. Estimating Lorenz curves using a Dirichlet distribution. Journal of Business and Economic Statistics 20, 290–295. Cowell, F.A., Victoria-Feser, M.-P., 1996. Robust properties of inequality measures. Econometrica 64, 77–101. Cowell, F.A., Victoria-Feser, M.-P., 2001. Robust Lorenz curves: a semi-parametric approach, Distributional Analysis Discussion Paper 50, STICERD, London School of Economics, London WC2A 2AE. Dagum, C., 1977. A new model of personal income distribution: speci@cation and estimation. Economie AppliquXee 30, 413–437.
H. Hasegawa, H. Kozumi / Journal of Econometrics 115 (2003) 277 – 291
291
Dey, D., MYuller, P., Sinha, D. (Eds.), 1998. Practical Nonparametric and Semiparametric Bayesian Statistics. Springer, New York. Doornik, J.A., 2001. Ox 3.0: An Object-Oriented Matrix Programming Language. Timberlake Consultants Ltd., London. Escobar, M.D., 1994. Estimating normal means with a Dirichlet process prior. Journal of the American Statistical Association 89, 268–277. Escobar, M.D., West, M., 1995. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association 90, 577–588. Ferguson, T.S., 1973. A Bayesian analysis of some nonparametric problems. The Annals of Statistics 1, 209–230. Ferguson, T.S., 1974. Prior distributions on spaces of probability measures. The Annals of Statistics 2, 615–629. Ferguson, T.S., 1983. Bayesian density estimation by mixtures of normal distribution. In: Rizvi, M.H., Rustagi, J.S. (Eds.), Recent Advances in Statistics: Papers in Honor of Herman ChernoG on His Sixtieth Birthday. Academic Press, New York, pp. 287–302. Groves, R.M., 1989. Survey Errors and Survey Costs. Wiley, New York. Hasegawa, H., 2002. Bayesian analysis of Lorenz curves from grouped data. Mimeographed, Hokkaido University. Hirano, K., 2002. Semiparametric Bayesian inference in autoregressive panel data. Econometrica 70, 781–799. Kakwani, N.C., Podder, N., 1973. On the estimation of Lorenz curves from grouped observations. International Economic Review 14, 278–292. Kakwani, N.C., Podder, N., 1976. EAcient estimation of the Lorenz curve and associated inequality measures from grouped observations. Econometrica 44, 137–148. Kim, Y.-S., Loup, T., Lupton, J., StaGord, F.P., 2000. Notes on the ‘Income Plus’ Files: 1994 – 1997 family income and components @les. Documentation, the Panel Study of Income Dynamics (http://www.isr.umich.edu/src/psid/income94-97/y pls notes.htm). Kozumi, H., Hasegawa, H., 2000. A Bayesian analysis of structural changes with an application to the displacement eGect. The Manchester School 68, 476–490. Lambert, P.J., 1993. The Distribution and Redistribution of Income: A Mathematical Analysis, 2nd Edition. Manchester University Press, Manchester. McDonald, J.B., Xu, Y.J., 1995. A generalization of the beta distribution with application. Journal of Econometrics 66, 133–152. Ryu, H.K., Slottje, D.J., 1996. Flexible functional form approaches for approximating the Lorenz curve. Journal of Econometrics 72, 251–274. Ryu, H.K., Slottje, D.J., 1999. Parametric approximations of the Lorenz curve. In: Silber, J. (Ed.), Handbook on Income Inequality Measurement. Kluwer Academic Publishers, Boston, pp. 291–312. Sarabia, J.-M., Castillo, E., Slottje, D.J., 1999. An ordered family of Lorenz curves. Journal of Econometrics 91, 43–60. West, M., MYuller, P., Escobar, M.D., 1994. Hierarchical priors and mixture models with application in regression and density estimation. In: Smith, A.F.M., Freeman, P. (Eds.), Aspects of Uncertainty: A Tribute to D.V. Lindley. Wiley, New York, pp. 363–386.