Economics Letters 129 (2015) 45–48
Contents lists available at ScienceDirect
Economics Letters journal homepage: www.elsevier.com/locate/ecolet
A piecewise method for estimating the Lorenz curve✩ ZuXiang Wang a , Russell Smyth b,∗ a
Wuhan University, China
b
Monash University, Australia
highlights • • • •
We develop a piecewise method to estimate the Lorenz curve. It provides a good approximation of the Lorenz curve and frequencies. We illustrate that the piecewise method produces plausible density estimates. We show that the method is useful when the income data has multiple peaks.
article
info
Article history: Received 2 January 2015 Received in revised form 4 February 2015 Accepted 6 February 2015 Available online 14 February 2015
abstract We propose a piecewise method to estimate the Lorenz curve for grouped income data. Our illustrative application shows that the method can produce more plausible density when the income distribution data has multiple peaks and the Lorenz curve cannot be modeled satisfactorily over the entire interval with a single Lorenz curve model. © 2015 Elsevier B.V. All rights reserved.
JEL classification: C80 D31 Keywords: Lorenz curve Parametric form Grouped income data
1. Introduction Lorenz models can be used to model grouped income data to obtain both the Lorenz curve and frequencies of the underlying income distribution. Many models are available in the literature (eg. Ryn and Slottje, 1996; Sarabia et al., 1999, 2001; Rohde, 2009). A drawback of these models is that they do not simultaneously provide a good fit for both the Lorenz curve values and frequencies (Wang et al., 2011). We suggest a piecewise Lorenz curve method to address this drawback. Moreover, income distribution data with
multiple peaks implies income polarization, which has drawn much recent attention (see, eg Foster and Wolfson, 2010). The method that we propose can be used to model such distributions. Kakwani (1976) and Cowell and Mehta (1982) consider income data approximation by piecewise interpolating densities and then obtaining Lorenz curves from the densities. Our approach is to do the opposite. Specifically, we piecewise approximate Lorenz curves and then obtain the densities from the Lorenz curves. 2. The piecewise method for estimating the Lorenz curve Suppose we have grouped income data on interval [c , d]
✩ Wang acknowledges financial support from the Social Science Foundation of
China (10BJL015, 12AZD030), Chinese Ministry of Education (09YJA790152), the Center for Economic Development Research, Wuhan University and the Social Science Division, Wuhan University. ∗ Correspondence to: Department of Economics, Monash University 3800, Australia. Tel.: +61 399051560. E-mail address:
[email protected] (R. Smyth). http://dx.doi.org/10.1016/j.econlet.2015.02.008 0165-1765/© 2015 Elsevier B.V. All rights reserved.
(pi , Li )ni=1 ,
(pi , xi /µ)ni=1 ,
(1)
where pi is the proportion of income units whose income is less than, or equal to, xi and Li and µ are the income share and average income respectively for those income units on [c , d]. Assume 0 ≤ x1 < x2 < · · · < xn . Let l(p) be the actual Lorenz curve on [c , d]. It
46
Z. Wang, R. Smyth / Economics Letters 129 (2015) 45–48
follows: l(pi ) = Li ,
l′ (pi ) = xi /µ,
i = 1 , 2 , . . . , n.
(2)
Let the actual distribution function for incomes on [c , d] be F (x) with pnL = F (xnL ) and pnR = F (xnR ), where nL and nR are positive integers with nR < n, nL < n and nR < nL , implying xnR < xnL . Therefore, pnL and 1 − pnR are population shares on [c , xnL ] and [xnR , d] respectively. Let LL (p) and LR (p) be estimated Lorenz curves for incomes on [c , xnL ] and [xnR , d] respectively. Assume xm ∈ [xnR , xnL ]. Define the two-piece-Lorenz-curve (TPLC) estimation for the grouped income data in (1) as:
Lm p p = F (x), x ≤ xm L L pnL LL pm /pnL p−pn pm −pn LR 1 −p R − LR 1 −p R L(p) = nR nR Lm + (1 − Lm ) pm −pn 1 − LR 1−p R nR p = F (x), x ≥ xm
(3) Fig. 1. TPLC and ELC using balanced fit with b = 1.
to minimize the jump in density at xm by solving
where Lm = l(pm ) and pm = F (xm ).
Theorem. L(p) in (3) is a Lorenz curve. Furthermore, if LL (p) and LR (p) are actual Lorenz curves on [c , xnL ] and [xnR , d] respectively, (3) equals l(p). First, if LL (p) and LR (p) satisfy the definition of the Lorenz curve, so does L(p), because L(0) = 0,
L(1) = 1,
L′ (p) ≥ 0,
L′′ (p) ≥ 0.
Second, if LL (p) is the actual Lorenz curve for incomes on [c , xnL ], then it must satisfy LL p/pnL = l(p)/LnL ,
(4)
for any p = F (x) with x ≤ xnL . If LR (p) is the actual Lorenz curve for the incomes on [xnR , d], then it must satisfy
LR
p − pnR
1 − pnR
=
l(p) − LnR
,
1 − LnR
(5)
for any p = F (x) with x ≥ xnR . Relationships (4) and (5) imply that the right-hand side of (3) equals l(p) for any p ∈ [0, 1]. The TPLC estimation procedure in (3) is useful when there is a peak in either [c , xnL ] or [xnR , d], or both, and the Lorenz curve on the entire interval cannot be modeled satisfactorily with a single Lorenz curve model. L(p) in (3) can be generalized to multiple cases. For example, one could estimate a three-piece-Lorenz-curve for the entire distribution by further creating a TPLC for incomes on [xm , d ]. To apply the TPLC, we use a Lorenz model to estimate grouped datasets for incomes on [c , xnL ] and [xnR , d] respectively, obtaining estimated Lorenz curves LL (p) and LR (p). Entering them into (3), we obtain the TPLC on [c , d]. Irrespective of the method applied locally to the two portions, the convexity of the resulting curve is preserved according to the theorem. p = F (x) in (3) is replaced by the estimated distribution function for all incomes on [c , d]. It can be obtained by solving x = µL′ (p) for any x ≥ 0 and the related density f (x) can be estimated by f (x) =
1
µL′′ (p)
.
xm ∈ xnR , xnR +1 , . . . , xnL is used as a critical point joining LL (p) and LR (p) in forming the TPLC. Different values of xm will result in different F (x), L(p) and f (x). xm can be determined by solving
min F (xnL ) − F (xnR ) − pnL − pnR .
xm nR ≤m≤nL
(6)
Alternatively, since the estimated density is generally discontinu ous at any selection of xm ∈ xnR , xnR +1 , . . . , xnL , we can take xm
min
xm nR ≤m≤nL
lim |f (xm + ε) − f (xm − ε)|
ε→0
.
(7)
3. Illustrative application In this section we compare the fit of the TPLC with that of an estimated Lorenz curve (ELC), applied to Swedish income distribution data for 1977 provided in Cowell and Mehta (1982). We use Swedish income data because it clearly contains multiple peaks, as demonstrated in Fig. 1. The Lorenz model
η
G(p) = Lλ 1 (p)α δ Lβ (p) + (1 − δ)Lλ (p)
,
δ ∈ [0, 1]
(8)
is used to fit the entire dataset to obtain the ELC where Lβ (p) = 1 − (1 − p)β , Lλ (p) =
λp
e
−1
eλ − 1
,
β ∈ (0, 1],
λ > 0.
Divide the income interval into two subintervals. The models H (p) = Lβ (p)α [δ p + (1 − δ)Lλ (p)]η ,
δ ∈ [0, 1]
(9)
and R(p) = pα Lβ (p)η
(9′ )
are applied respectively to the datasets for the left and right subintervals. α and η in (8), (9) and (9′ ) satisfy α > 0, η ≥ 0, α +η ≥ 1. H (p) and R(p) are considered by Wang et al. (2011), R(p) is considered by Sarabia et al. (1999), Lβ (p) is the Lorenz curve for the classical Pareto distribution and Lλ (p) is proposed by Chotikapanich (1993). We use the balanced fit approach suggested by Wang et al. (2011), which estimates parameters by minimizing b
n i =1
[Li − L(pi )]2 + (1 − b)
n
2
pi − Fˆ (xi )
,
b ∈ [0, 1], (10)
i=1
where L(p) is the Lorenz model and Fˆ (xi ) is the solver of the equation µL′ (p) = xi , and is the estimated frequency of income units at xi . µ is average income. We use b = 1 or b = 0 below. Parameter estimates with standard errors are in the Appendix. The dashed curve in Fig. 1 is the ELC for the Swedish data, while the two solid curves together are the TPLC for the data, both with b = 1. Fig. 2 displays the counterpart curves with b = 0. Note that xnR with nR = 7 and xnL with nL = 9 are selected for both estimates. The joining points xm = x8 determined by (7) are the same for both b = 1 and b = 0.
Z. Wang, R. Smyth / Economics Letters 129 (2015) 45–48
47
Table 1 Lorenz curve approximation. xi
Fig. 2. TPLC and ELC using balanced fit with b = 0.
The columns titled ELC in Table 1 give estimated Lorenz curve values when model (8) is used and alternatively b = 1 or b = 0 is applied. The columns titled TPLC give the corresponding twopiece-Lorenz-curve approximation when model (9) and (9′ ) are used. We follow Sarabia et al. (1999) and use MSE, MAE and MAS to measure goodness of fit. The TPLC is a better fit than the ELC with b = 1 or b = 0. Between alternative b values, the TPLC performs better when b = 1. Table 2 gives the frequency estimation from the estimated Lorenz curves, in which the column titled fi contains frequencies of income units in alternative intervals. The estimates from the ELC with b = 1 and b = 0 are indistinguishable, implying no improvement can be obtained through selecting b. The estimates with the TPLC are better than their ELC counterparts with b = 1 or b = 0. The TPLC density with b = 1 in Fig. 1 is visually less plausible than its counterpart with b = 0 in Fig. 2. However, the former produces a very attractive Lorenz curve estimate. Let the parameter vector of Lorenz model L(p) be θ , and let the minimizer of
n
i=1
[Li − L(pi )]2 and
n i=1
pi − Fˆ (xi )
2
respec-
tively be θˆ 1 and θˆ 0 with corresponding minimums M 1 and M 0 . While b is allowed to vary, the minimum of (10) must be larger, or equal to, min M 1 , M 0 , otherwise θˆ 1 , θˆ 0 or both cannot be optimal. However, with piecewise approximation, opportunities exist b to obtain better estimates through optimal selection of b. Let LL L (p) b
and LRR (p) with bL and bR be Lorenz curve estimates on intervals [c , xnL ] and [xnR , d] respectively. We can build different combib
b
nations of LL L and LRR . For example, a better Lorenz curve or density estimation for the overall interval can be the TPLC from joining L1L (p) and L0R (p).1 Next, we estimate the Gini index and the polarization index proposed by Foster and Wolfson (2010). The latter can be written as 2(1−2L(0.5)−G)/L′ (0.5).2 We do so for both the whole population and a majority larger than 89 per cent of the population with income less than h = 70. The findings are reported in Table 3. The
1 Consider our estimates for the Swedish data. Call the TPLC by joining L1 (p) on L [c , xm ] and L0R (p) on [xm , d] the combination CA. Call the TPLC by joining L0L (p) on [c , xm ] and L1R (p) on [xm , d] the combination CB. We find that the Lorenz curve value estimates with CB are slightly better than those with CA in terms of the three error measures and the frequency estimate from CA is slightly better than that from CB in terms of the first two measures. The CB estimate of the Lorenz curve values are slightly inferior to the TPLC with b = 1 as shown in Table 1, while the frequency estimates from CA are slightly inferior to the TPLC with b = 0 as shown in Table 2. However, estimates from both CA and CB are better than single model estimates. 2 The index is defined as 2(1 − 2L(0.5) − G)µ/m, with m the median, µ the average, G the Gini index, and L(·) the Lorenz curve. We have used L′ (0.5) = m/µ.
5 10 15 20 25 30 35 40 45 50 55 60 70 80 100 120 150 200 500 MSE × 105 MAE MAS
pi
b=1
Li
0.0264 0.0735 0.1649 0.2571 0.3291 0.3973 0.4620 0.5243 0.5920 0.6694 0.7473 0.8123 0.8936 0.9343 0.9698 0.9839 0.9923 0.9971 0.9998
0.0016 0.0102 0.0385 0.0774 0.1167 0.1621 0.2132 0.2699 0.3398 0.4290 0.5283 0.6188 0.7459 0.8195 0.8957 0.9327 0.9599 0.9796 0.9968
b=0
ELC 0.0014 0.0083 0.0341 0.0750 0.1167 0.1639 0.2156 0.2718 0.3401 0.4276 0.5267 0.6183 0.7467 0.8198 0.8960 0.9337 0.9607 0.9789 0.9949 0.2610 0.0012 0.0043
TPLC 0.0020 0.0105 0.0381 0.0776 0.1166 0.1621 0.2133 0.2699 0.3409 0.4301 0.5287 0.6187 0.7460 0.8199 0.8961 0.9329 0.9598 0.9792 0.9971 0.0175 0.0003 0.0011
ELC 0.0018 0.0095 0.0363 0.0771 0.1182 0.1647 0.2156 0.2713 0.3393 0.4269 0.5265 0.6187 0.7475 0.8214 0.9005 0.9409 0.9697 0.9880 0.9994 1.4844 0.0026 0.0097
TPLC 0.0023 0.0111 0.0387 0.0778 0.1169 0.1623 0.2133 0.2699 0.3397 0.4285 0.5271 0.6172 0.7440 0.8172 0.8923 0.9287 0.9556 0.9755 0.9956 0.4020 0.0014 0.0044
Table 2 Frequency approximation from estimated Lorenz curves. xi 5 10 15 20 25 30 35 40 45 50 55 60 70 80 100 120 150 200 500 MSE × 105 MAE MAS
fi 0.0264 0.0471 0.0914 0.0922 0.0720 0.0682 0.0647 0.0623 0.0677 0.0773 0.0780 0.0649 0.0813 0.0407 0.0355 0.0141 0.0085 0.0048 0.0028
b=1 ELC 0.0387 0.0583 0.0670 0.0719 0.0745 0.0755 0.0754 0.0742 0.0722 0.0693 0.0653 0.0598 0.0914 0.0427 0.0321 0.0147 0.0105 0.0045 0.0017 9.9547 0.0075 0.0243
b=0 TPLC 0.0257 0.0562 0.0793 0.0942 0.0761 0.0643 0.0659 0.0520 0.0729 0.0877 0.0784 0.0611 0.0780 0.0416 0.0369 0.0141 0.0083 0.0045 0.0027 2.6686 0.0035 0.0120
ELC 0.0310 0.0586 0.0717 0.0769 0.0779 0.0767 0.0745 0.0718 0.0690 0.0662 0.0634 0.0602 0.0939 0.0402 0.0309 0.0151 0.0137 0.0080 0.0000 8.2620 0.0073 0.0197
TPLC 0.0224 0.0566 0.0849 0.0935 0.0731 0.0667 0.0655 0.0644 0.0666 0.0811 0.0770 0.0624 0.0804 0.0418 0.0353 0.0130 0.0077 0.0043 0.0030 0.9197 0.0020 0.0095
Table 3 Inequality and polarization estimates for Sweden. Entire Population
89% at lower income
Gini
Polarization
Gini
Polarization
b=1 ELC TPLC
0.3574 0.3559
0.3165 0.3264
0.3086 0.3056
0.2997 0.3184
b=0 ELC EPLC
0.3556 0.3573
0.3247 0.3235
0.3076 0.3047
0.3055 0.3168
Gini estimates for the entire population are larger than 0.35 for the ELC and TPLC with b = 1 or b = 0, which is slightly larger than the estimates given by Cowell and Mehta (1982). The Gini estimates are not sensitive across the ELC and TPLC to the choice of b. The
48
Z. Wang, R. Smyth / Economics Letters 129 (2015) 45–48
polarization index for the entire population is almost the same for the ELC and TPLC for both values of b. However, the estimates for the subpopulation for the index are larger for the TPLC, than for the ELC. 4. Conclusion We have proposed a multiple Lorenz curve method to model grouped income data. Using Swedish income data from 1977, we demonstrate that the method overcomes the drawback that the single Lorenz curve method cannot produce good approximations to both the Lorenz curve and to the frequencies at the same time. We show that the TPLC works particularly well when the income data has multiple peaks. Appendix
λ1
b=1 Left Right b=0 Left Right
λ1 b=1 b=0
b=1 Left Right b=0 Left Right
β
λ
δ
α
η
0.1431 0.1432 204.2071 0.1469
12.8680 0.1837 0.0619 0.0797 14.8907 0.0245 0.2823 0.1447
β
λ
δ
0.0188 0.0077
15.1124 0.0011 – –
0.0363 0.0169 0.0287 0.0261
0.0295 0.0149
7.7755 –
0.0509 0.0176 0.0288 0.0260
0.0014 –
α
η
References
Parameter estimates of the Lorenz curves for Sweden
b=1 b=0
Approximate standard errors of parameters for Sweden
β
λ
δ
α
η
0.1530 0.3276 17.9586 0.9998
44.9958 0.7808 1.6329 0.1056 35.0000 0.9590 0.0233 1.6074
β
λ
δ
1.0000 0.6453
20.1024 0.0002 – –
1.5807 0.0168 0.3838 0.7056
1.0000 0.5466
16.4930 0.0004 – –
1.5249 0.0245 0.6459 0.4767
α
η
Chotikapanich, D., 1993. A comparison of alternative functional forms for the Lorenz curve. Econom. Lett. 3, 187–192. Cowell, F.A., Mehta, F., 1982. The estimation and interpolation of inequality measures. Rev. Econom. Stud. 49, 273–290. Foster, J.E., Wolfson, M.C., 2010. Polarization and the decline of the middle class: Canada and the U.S.. J. Econ. Inequal. 8, 247–273. Kakwani, N., 1976. On the estimation of income inequality measures from grouped observations. Rev. Econom. Stud. 43, 483–492. Rohde, N., 2009. An alternative functional form for estimating the Lorenz curve. Econom. Lett. 105, 61–63. Ryn, H.K., Slottje, D.J., 1996. Two flexible functional form approaches for approximating the Lorenz curve. J. Econometrics 72, 251–274. Sarabia, J., Castillo, E., Slottje, D.J., 1999. An ordered family of Lorenz curves. J. Econometrics 91, 43–60. Sarabia, J., Castillo, E., Slottje, D.J., 2001. An exponential family of Lorenz curves. South. Econ. J. 67, 748–756. Wang, Z., Ng, Y.-K., Smyth, R., 2011. A general method for creating Lorenz curves. Rev. Income Wealth 57, 561–582.