Accepted Manuscript Estimation of Lorenz curves based on dummy variable regression Zheng-Xin Wang, Hai-Lun Zhang, Hong-Hao Zheng
PII: DOI: Reference:
S0165-1765(19)30029-1 https://doi.org/10.1016/j.econlet.2019.01.021 ECOLET 8348
To appear in:
Economics Letters
Received date : 3 June 2018 Revised date : 16 January 2019 Accepted date : 22 January 2019 Please cite this article as: Z.-X. Wang, H.-L. Zhang and H.-H. Zheng, Estimation of Lorenz curves based on dummy variable regression. Economics Letters (2019), https://doi.org/10.1016/j.econlet.2019.01.021 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
*Highlights (for review)
Highlights We propose an estimation method of the Lorenz curve based on dummy variable regression. We grant economic connotations to parameters in the estimation of the Lorenz curve. The method has advantages of higher fitting precision and stronger adaptability.
*Title Page
Estimation of Lorenz curves based on dummy variable regression Zheng-Xin Wanga,b,c , Hai-Lun Zhang a , Hong-Hao Zheng a a. School of Economics, Zhejiang University of Finance & Economics, Hangzhou, 310018 China b. Center for Research of Regulation & Policy, Zhejiang University of Finance & Economics, Hangzhou 310018, China c.
Center for Regional Economy & Integrated Development, Zhejiang University of Finance & Economics, Hangzhou 310018, China
E-mail:
[email protected] (Zheng-Xin Wang);
[email protected] (Hai-Lun Zhang);
[email protected] (Hong-Hao Zheng, Corresponding author).
Abstract: We propose a new estimation method of the Lorenz curve based on dummy variable regression for granting clear economic connotations to parameters in the estimation of the Lorenz curve. The dummy variable regression model of the Lorenz curve is established by introducing dummy variables to group data about income level. On this basis, it is proved that the regression function is endowed with the convexity that the Lorenz curve should have. To verify the effectiveness of the new method, empirical study is carried out taking the income data of urban and rural residents in the Chinese Household Income Project Survey (CHIP) database as examples. In the empirical study, the proposed model is compared with several existing classical parameter models. The results indicate that the slope parameter of the model represents the cumulative income–population resilience of corresponding residents, and reflects the income structure of residents to some extent. Meanwhile, compared with classical parameter models, the newly proposed method has advantages of higher fitting precision and stronger adaptability. Keywords: Income distribution, Lorenz curve, Dummy variable regression
JEL codes: C80, D31
*Manuscript Click here to view linked References
1. Introduction The Lorenz curve, as the basic tool for studying the income inequality, is mainly described through the interpolation of grouped data or fitted using specific parameter models. Kakwani and Podder (1973), Villasenor and Arnold (1989), and Chotikapanich (1993) proposed single-parameter models for fitting the Lorenz curve. Sarabia et al. (2015) revealed that the bi-parametric estimation method of the Lorenz curve proposed by Wang and Smyth (2015) is essentially the bi-parametrization of the hyperbolic Lorenz curve (Arnold, 1986). Prendergast and Staudte (2016) innovatively introduced quantile ratio index to replace the ratio part in the original Lorenz curve L( p) p
p . By introducing more parameters, previous studies greatly improved the
fitting precision of the Lorenz curve. However, parameters obtained with either single- or multi-parameter models lack clear economic connotations. Furthermore, level of precision and reliability of the above mentioned models are relatively low. In this study, by dividing the curve into segments using dummy variables, the Lorenz curve fitted using the multivariate linear regression model is built, which improves the fitting precision while guaranteeing that the parameters in the model have clear economic connotations. It is worth noting that differing from Prendergast and Staudte (2016), this study constructs a dummy variable regression model based on the idea of differentiation, which is mainly based on the original definition of Lorenz curve. The two approaches do differ as follows: it is known that using the quantile regression model can obtain wealth distribution of quantile in each population group, while using the dummy variable regression model, the average income of all population groups can be acquired. 2. Model construction 2.1 Segment notation of the Lorenz curve The Lorenz curve is generally defined as follows (Rohde, 2009):
f ( ) where and represent the cumulative percents of population and wealth, respectively. The Lorenz curve is bound to meet the following requirement:
d d 2 0, 2 0, (0) 0, (1) 1 d d At first, the general Lorenz curve is divided into n segments, as shown in Figure 1, to demonstrate the income levels of populations in different income ranges.
Figure 1 Schematic diagram of the segmented Lorenz curve Then, the income level is divided into n segments using dummy variables. In this way, the Lorenz curve can be written as the following approximate form:
y D1 (1 x 1 ) D2 ( 2 x 2 ) Dn ( n x n ) Therefore, there is
lim y f ( ) n
where x and y denote the cumulative percents of population and wealth, respectively. When x [
i 1 i , ] , Di 1 ; otherwise, Di 0 . n n
The proof of the convexity of the model is displayed in the Appendix. 2.2 The economic connotations of coefficients The segmented curve is then replaced with straight line segments, which is able to endow parameters with proper economic connotations. The coefficient
i of the
model actually describes the income level of each group of populations. When other conditions are fixed, every one percent increase in the population results in the increment of
i of the cumulative percent of wealth. That is, i is the average
income of population in the group. Although
i does not have a straightforward
economic connotation, the average accumulation of wealth and intra-group Gini coefficient of the group can be calculated in combination with
i .
2.3 Expression of Gini coefficient By using the above dummy variable regression function, the expression of Gini coefficient can be easily obtained: 1
1 n
0
0
S1 ydx
n
i
n
n
i 1
(i Di x i Di )dx in1 (i x i )dx [ i 1
i 1
i (2i 1) i 2n 2
n
]
Meanwhile, the intra-group inequality coefficient of each group is also obtained:
Gi i
2 n i , i 1,2, , n 2i 1
The overall Gini coefficient can be expressed as follows: n
i (2i 1)
i 1
n2
G 1 2S1 1
2 , i 1,2, , n
3. Empirical test based on income data of urban and rural residents in China To verify the fitting precision of the proposed model, the microcosmic data which contain urban and rural household income in 1995~2013 in the Chinese Household Income Project Survey (CHIP) database are used to compare the fitting results of different models. By using the regression methods of the grouping model when n is 5, 10, and 20, comparison is carried out with the following classical estimation models of the Lorenz curve: Kakwani-Podder:
( ) e (1 ), 0 Chotikapanich:
( )
e k 1 ,k 0 ek 1
Pareto: 1
( ) 1 (1 ) , 1
Nicholas Rohde (2009): 1 , 1
( )
Bi-parametric (Wang & Smyth, 2015):
( )
1 1 1 2 ,1 1, 2 1 1 2 1 1
where and represent the cumulative percent of wealth and population, respectively. By conducting regression test for the above model using the same dataset, we obtain the parameter results in Table 1. The regression outputs of grouping models are listed in Table A.1 in Appendix. In addition, we report on various estimation results based on different segments of the data (i.e., n=5, n=10, and n=20), as shown in Table B.1, B.2, and B.3 in Appendix. The comparison results between the quantile regression model and the new approach in this research are presented in Table C.1 in Appendix. Table 1 The regression output of different parametric models Models
Parameters
Kakwani-Podder
Chotikapanich
Nicholas Rohde
Bi-parametric
(2009)
(Wang & Smyth, 2015)
Pareto
δ
K
1/γ
β
ω1
ω2
1.6896***
2.4637***
0.5508***
1.4483***
0.7039***
-0.0851***
(0.0087)
(0.0033)
(0.0021)
(0.0005)
(0.0013)
(0.0067)
1.8835***
2.1759***
0.6037***
1.5256***
0.5494***
0.56222***
(0.0142)
(0.0019)
(0.0014)
(0.0008)
(0.0017)
(0.012)
1.8966***
2.6550***
0.5249***
1.3957***
0.7015***
0.0725***
(0.0060)
(0.0030)
(0.0013)
(0.0005)
(0.0011)
(0.0078)
1.5657***
2.2534***
0.5748***
1.5104***
0.6534***
0.0279***
(0.0068)
(0.0029)
(0.0014)
(0.0006)
(0.0016)
(0.0082)
1.8819***
2.6201***
0.5335***
1.4030***
0.6883***
0.1327***
(0.0060)
(0.0028)
(0.0013)
(0.0005)
(0.0011)
(0.0080)
1.5039***
2.1852***
0.5857***
1.5354***
0.6488***
-0.0008***
(0.0066)
(0.0028)
(0.0015)
(0.0006)
(0.0016)
(0.0076)
1.8923***
2.6323***
0.5305***
1.4000***
0.6870***
0.1534***
(0.0058)
(0.0028)
(0.0013)
(0.0005)
(0.0016)
(0.0084)
1.4551***
2.1518***
0.5839***
1.5493***
0.6581***
-0.0665***
1995
1999
2011-r
2011-u
2012-r
2012-u
2013-r 2013-u
(0.0062)
(0.0028)
(0.0012)
(0.0006)
(0.0016)
(0.0070)
Note: u represents the urban data in CHIP database, and r represents the rural data in CHIP database, among which 1995 and 1999 did not distinguish between rural and urban.
It can be seen from Table 1 that the classical models also have excellent parameter fitting results. As to the performance of the parameters, the classical models only involving one or two parameters fail to more intuitively explain the Lorenz curve. By contrast, the parametric sequences fitted by using the grouping model are able to more intuitively show the income levels of each group. Combining with the group interval and the number of groups, the model is capable of analyzing the income structure. The significance of the parameter fitting results explained above depends greatly on the specificity of data, so the precision of each model is the main basis for comparing the performance of the models under the same data. Table 2 displays the results of the three regression methods of the grouping model under conditions that n is 5, 10, and 20 and the overall precision report of the classical models. Table 2 Mean square errors of different models Kakwani
Chotikap
Models
Nicholas Pareto
Dreg
Dreg
Dreg
Sample
(n=5)
(n=10)
(n=20)
size
Bi-parametric
-Podder
anich
Rohde
1995
0.2718
0.0207
0.0963
0.0110
0 .0136
0.0596
0.0074
0.0031
7998
1999
0.4875
0.0135
0.0978
0.0166
0.0123
0.0193
0.0051
0.0023
9063
2011-r
0.2253
0.0207
0.0907
0.0121
0.0147
0.0610
0.0085
0.004
10248
2011-u
0.1998
0.0172
0.0834
0.0116
0.0127
0.0564
0.0066
0.0014
6557
2012-r
0.2236
0.0194
0.0948
0.0122
0.0140
0.0608
0.0079
0.0036
10295
2012-u
0.1943
0.0167
0.0845
0.0113
0.0121
0.056
0.0061
0.0014
6580
2013-r
0.2202
0.0195
0.0930
0.0124
0.0144
0.0606
0.0082
0.0038
10389
2013-u
0.1823
0.0174
0.0770
0.0111
0.0122
0.0556
0.0064
0.0029
6637
When n is set as 5, 10, and 20, the results of dummy variable regression change as described above. The results of the parameter estimation of the three grouping models are given in Appendix Table A2. On the premise of guaranteeing a large
enough sample size, the precision of the model increases with the increasing number of groups. It can be seen from the results that the mean square errors (MSEs) of the three single-parameter models (i.e. Kakwani and Podder, Chotikapanich, and Pareto) are significantly larger than those of other models. The single-parameter model proposed by Nicholas Rohde differs slightly from the bi-parametric model put forward by Wang and Smyth (2015) in terms of the fitting effect, while its precision is higher than those of the dummy variable regression model with n=5. Since data set used in this paper has large sample size (see Table 2), grouping models of n=10 and n=20 are estimated. It can be seen from the results that when n=10 and n=20, the MSEs of the grouping model are smaller than the above two models. While selecting the number of groups, the grouping model is able to greatly improve the fitting precision of the model while endowing parameters with economic connotations. 4. Conclusions Parameters of the new model are endowed with clear economic connotations. The classical models and other models of the Lorenz curve proposed in recent years involve few parameters. With these limited number of parameters, the parameter models are not able to describe the features of the Lorenz curve. Even though the parameters are able to reflect the bending degree of the Lorenz curve, they have similar functions with the Gini coefficient: they merely present the phenomenon of inequality and the overall inequality degree. The slope parameter obtained by the regression of the grouping model is the cumulative income–population resilience of corresponding residents, which mirrors the income level of the residents to some extent. Therefore, the parameter has clear economic connotation that is conductive to the economic analysis. The new model has high precision and strong adaptability. For the grouping model able to adjust the group number and group interval, the only thing that can limit the grouping is the sample size. When the sample size is too small, it is meaningless even though the grouping model can obtain high fitting precision. This point is also the same to other classical models. While, as long as the sample size is guaranteed, the grouping model performs the adjustment according to the sample size. Therefore, while estimating the Lorenz curves through the dummy variable regression, researchers can refer to the code shared in the supplementary material. By adjusting the number of groups, it guarantees that the model has enough precision; while the
adjustment of the group interval ensures that information in the regression results is not too approximate and also saves the degree of freedom. In comparison, other parameter models basically have fixed forms and only the size of parameters can be adjusted. For special income data, different parameter models also have different performances. References Chotikapanich, D., 1993. A comparison of alternative functional forms for the Lorenz curve. Economics Letters 41, 21–29. Chotikapanich, D., Griffiths, B., Rao, D. S., 2007. Estimating and combining national income distributions using limited data. Journal of Business and Economic Statistics 25, 97–109. Gastwirth, J., Glauberman, M., 1976. The interpolation of the Lorenz curve and Gini index from grouped data. Econometrica 44, 479–483. Kakwani, N.C., Podder, N., 1976. Efficient estimation of the Loren curve and associated inequality measures from grouped observations. Econometrica 44, 137–148. Kakwani, N. C., Podder, N., 1973. On the estimation of Lorenz curves from grouped observations. International Economic Review 14, 278–292. Prendergast, L. A., Staudte, R. G., 2016. Quantile versions of the Lorenz Curve. Electronic Journal of Statistics 10, 1896-1926. Rohde, N., 2009. An alternative functional form for estimating the Lorenz curves. Economics Letters 105, 61-63. , A., 2017. Yearly, monthly and weekly seasonality of tourism demand: A decomposition analysis. Tourism Management 60, 379-389. Sarabia, J. M., Prieto, F., Jordá, V., 2015. About the hyperbolic Lorenz curve. Economics Letters 136, 42-45. Sarabia, J. M., Jordá, V., 2014. Explicit expressions of the Pietra index for the generalized function for the size distribution of income. Physica A 416, 582–595. Sarabia, J.M., Castillo, E., Slottje, D.J., 1999. An ordered family of Lorenz curves. Journal of Econometrics 91, 43–60.
Wang, Z. X., Smyth, R., 2015. A piecewise method for estimating the Lorenz curves. Economics Letters 129, 45-48. Wang, Z. X., Smyth, R., 2015. A hybrid method for creating Lorenz curves. Economics Letters 133, 59-63.
Appendix A. Results obtained with dummy regression In order to discuss how the number n of groups influences the estimation precision of the Lorenz curves with dummy variable regression in detail, the estimation results of dummy variable regression with different n by 1995 CHIP data are listed in Table A.1. It can be seen from the table that all of the regression coefficients are significant at the 0.01 level. The result indicates that on conditions that n=5, 10, 20, different groups have significant differences in the statistical result of the income. The results of three tests verify the conclusion that the larger the number n of groups is, the higher the precision of the model. However, it does not absolutely mean that we should select an as large number of groups as possible in practical applications. Selecting a larger number of groups probably results in the decreasing degree of freedom of parameter estimation and therefore the larger estimation error. For the linear model, n
y ( i Di x i Di ) i 1
Suppose that the sample size is N, then the degree of freedom of the model is obtained as
in the case that the income level is divided into n segments.
Obviously, the model consumes higher degree of freedom than the standard linear model. Therefore, a large enough sample size is needed particularly while selecting the number of groups, so as to ensure the enough degree of freedom and avoid an overlarge number of groups affecting the model precision. In practical application, researchers need to balance the advantages and disadvantages due to the enlarging number of groups and the loss of degree of freedom, so as to reasonably select the number of groups. A th
upp m t f r
the regression coefficient
cti
2 1 “Th
c
mic c
t ti
f c ffici t ”
i represents the average slope of the Lorenz curves on a
certain number of groups. Therefore, it reflects the economic connotation the response extent of cumulative percentage of income to the variation of cumulative percentage of population, that is, the cumulative income–population resilience. Table A.1 Dummy variable regression with different n by 1995 CHIP data Model 1 (n=5)
Model 2 (n=10)
Model 3 (n=20)
(MSE=0.00766)
(MSE=0.00333)
(MES=0.00142)
1 2
3 4
5 6 7
8 9
10 11
12
13 14
15 16
17
0.2990***
0.2172***
0.1658***
(0.0013)
(0.0012)
(0.0012)
***
***
0.2646***
(0.0010)
(0.0007)
(0.0007)
***
***
0.3416***
(0.0011)
(0.0007)
(0.0005)
***
***
0.4057***
(0.0018)
(0.0006)
(0.0005)
***
***
0.4696***
(0.0007)
(0.0006)
***
0.5333***
(0.0008)
(0.0005)
***
0.5915***
(0.0010)
(0.0005)
***
0.6477***
(0.0015)
(0.0004)
***
0.7048***
(0.0027)
(0.0005)
2.5803***
0.7648***
(0.0169)
(0.0005)
0.5617
0.7993
1.1280
1.9934
(0.0096)
0.3736
0.5024
0.6198
0.7344
0.8659
1.0224
1.2408
1.5830
0.8304*** (0.0005) 0.9014*** (0.0006) 0.9771*** (0.0007) 1.0681*** (0.0008) 1.1780*** (0.0009) 1.3059*** (0.0011) 1.4694*** (0.0016)
1.7031***
18
(0.0021) 2.0653***
19
(0.0043) 3.3151***
20
(0.0291)
Note: the data in parentheses are robust standard errors.
Based on the large-sample CHIP data, the number n of groups is set as 20 to further estimate the Lorenz curves of rural and urban residents in China from 2011 to 2013. The coefficients of dummy variable regressions are displayed in Table A.2. As shown in the table, the residents with a larger group number, that is, rich residents, have larger coefficients; while residents with a smaller group number, that is, poor residents, present smaller coefficients. In addition, the urban residents have a larger coefficient compared with the rural residents in the same year. The results imply that the population change of the high-income group more greatly influences the income compared with that of the low-income group; the population change in urban areas has larger influences on the income in comparison with that in the rural areas. Table A.2 The coefficients of dummy variable regressions by 2011-2013 CHIP data
1 2
3 4
5 6 7
2011-r
2012-r
2013-r
2011-u
2012-u
2013-u
0.1169***
0.1186***
0.1190***
0.3040***
0.3171***
0.2044***
(0.0009)
(0.0009)
(0.0009)
(0.0015)
(0.0016)
(0.0018)
***
***
***
***
***
0.3252***
0.2208
0.2219
0.2150
(0.0006)
(0.0006)
(0.0006)
(0.0006)
(0.0009)
(0.0007)
***
***
***
***
***
0.3993***
0.2860
0.4923
0.2955
0.2960
(0.0004)
(0.0005)
(0.0006)
(0.0008)
(0.0006)
(0.0005)
***
***
***
***
***
0.4690***
0.3663
0.6156
0.5003
0.3691
0.3596
(0.0008)
(0.0004)
(0.0006)
(0.0009)
(0.0008)
(0.0005)
***
***
***
***
***
0.5239***
0.4320
0.6887
0.6023
0.4435
0.4354
(0.0005)
(0.0005)
(0.0003)
(0.0008)
(0.0007)
(0.0003)
***
***
***
***
***
0.5862***
0.5072
0.7826
0.7031
0.4991
0.4933
(0.0005)
(0.0000)
(0.0006)
(0.0008)
(0.0009)
(0.0005)
***
***
***
***
***
0.6449***
0.5517
0.5545
0.5592
0.8322
0.7555
0.9435
0.8531
0.9147
8 9
10 11
12
13 14
15
16 17 18
19 20 gini
(0.0000)
(0.0006)
***
***
0.6030
(0.0004 ***
(0.0010)
(0.0009)
(0.0003)
***
***
0.6853***
0.6262
0.6392
(0.0007)
(0.0004)
(0.0005)
(0.0011)
(0.0007)
(0.0007)
***
***
***
***
***
0.7527***
0.6881
1.0025
0.7042
0.6730
(0.0003)
(0.0006)
(0.0005)
(0.0009)
(0.0010)
(0.0007)
***
***
***
***
***
0.7891***
0.7663
1.1150
1.0184
0.7491
0.7600
(0.0005)
(0.0000)
(0.0004)
(0.0014)
(0.0001)
(0.0004)
***
***
***
***
***
0.8670***
0.8275
1.1795
1.0961
0.8236
0.8381
(0.0000)
(0.0009)
(0.0008)
(0.0003)
(0.0013)
(0.0008)
***
***
***
***
***
0.9292***
0.8817
1.2958
1.1769
0.9069
0.8916
(0.0011)
(0.0008)
(0.0005)
(0.0016)
(0.0011)
(0.0007)
***
***
***
***
***
1.0253***
0.9887
1.3687
1.2575
0.9978
1.004
(0.0008)
(0.0001)
(0.0007)
(0.0009)
(0.0006)
(0.0007)
***
***
***
***
***
1.0912***
1.0980
1.4721
1.3439
1.0887
1.0985
(0.0004)
(0.0009)
(0.0002)
(0.0008)
(0.0023)
(0.0009)
***
***
***
***
***
1.1869***
1.1756
1.6139
1.4641
1.2203
1.2020
(0.0014)
(0.0009)
(0.0011)
(0.0029)
(0.0023)
(0.0009)
***
***
***
***
***
1.2966***
1.3457
1.6959
1.5563
1.3143
1.3268
(0.0011)
(0.0015)
(0.0005)
(0.0025)
(0.0022)
(0.0004)
***
***
***
***
***
1.4429***
1.4892
1.9175
1.7302
1.5056
1.5118
(0.0023)
(0.0010)
(0.0015)
(0.0050)
(0.0033)
(0.0017)
***
***
***
***
***
1.6273***
1.7399
2.1912
1.8996
1.7635
1.7553
(0.0025)
(0.0019)
(0.0015)
(0.0072)
(0.0064)
(0.0020)
***
***
***
***
***
1.9420***
2.1328
2.5897
2.1307
2.1092
2.0957
(0.0033)
(0.0032)
(0.0030)
(0.0114)
(0.0136)
(0.0035)
***
***
***
***
***
2.8690***
3.2994
3.1784
2.5020
5.1135
3.0763
3.2677
3.2672
(0.0267)
(0.0253)
(0.0263)
(0.1420)
(0.0982)
(0.0252)
0.4053
0.4010
0.4027
0.4363
0.4203
0.3402
Note: the data in parentheses are robust standard errors.
4.7388
Appendix B. Regression results with different models at different segments of the population We supplement regression results of different segments of the population when n =5, 10, and 20 in 1995 CHIP data based on typical models and the dummy variable regression models, as displayed in Table B.1, B.2, and B.3. The typical models in the nonlinear form exhibit better goodness of fit in the majority of segments in regression process than that of the dummy variable model, however, the parameters of the typical models are mainly proposed for the measurement of whole model distribution and these models mentioned primarily focus on the overall properties of Lorenz curve, the fitting results of some segments after segmentation show problems like discontinuous points, deviations, etc. We therefore obtain these model may be not applicable to fitting the conclusion in the segmentation case. The classical models adopted in the research, such as Kakwani-Podder, Chotikapanich, Pareto, and Nicholas Rohe models are continuous and pass fixed points (0, 0) and (1, 1). The Lorenz curves obtained by dividing the samples into several segments and regressing them separately are certainly different, because the parameters of the models are the results of pursuing local optimum. After linking these different Lorenz curves into an integral whole, it is bound to attain only one curve containing discontinuous points and deviating from the overall curve.
Table B.1 The MSEs of different models when n=5 n=5
Kakwani-Podder
Chotikapanich
Pareto
Nicholas Rohde
Bi-parametric
Dreg
G1
0.5340
0.0018
0.0146
0.0035
0.0004
0.0039
G2
0.0076
0.0028
0.0028
0.0007
0.0005
0.0018
G3
0.0020
0.0040
0.0021
8.3*e-5
0.0006
0.0020
G4
0.0005
0.0050
0.0028
0.0002
0.0012
0.0033
G5
0.0124
0.0119
0.0439
0.0045
0.0064
0.0164
Table B.2 The MSEs of different models when n=10 n=10
Kakwani-Podder
Chotikapanich
Pareto
Nicholas Rohde
Bi-parametric
Dreg
G1
0.6778
0.0003
0.0069
0.0013
0.0002
0.0013
G2
0.0073
0.0004
0.0012
0.0003
0.0001
0.0005
G3
0.0023
0.0006
0.0009
0.0003
0.0001
0.0005
G4
0.0014
0.0008
0.0006
0.0001
0.0001
0.0004
G5
0.0007
0.0009
0.0005
3.4e-5
0.0002
0.0004
-5
G6
0.0004
0.0011
0.0005
1.2e
0.0002
0.0005
G7
-6
0.0012
0.0006
0.0001
0.0002
0.0007
3.6×e
-5
G8
0.0002
0.0013
0.0007
4.0e
0.0004
0.0010
G9
0.0009
0.0016
0.0013
0.0003
0.0008
0.0017
G10
0.0088
0.0067
0.0457
0.0031
0.0038
0.0102
Table B.3 The MSEs of different models when n=20 n=20
Kakwani-Podder
Chotikapanich
Pareto
Nicholas Rohde
Bi-parametric
Dreg
G1
0.8664
5.4e-5
0.0037
0.0006
1.09e-4
0.0005
G2
0.0054
6.7e-5
0.0006
0.0002
2.78e-5
0.0002
G3
0.0027
9.1e-5
0.0003
9.3e-5
3.99e-5
0.0001
G4
0.0012
0.0001
0.0003
7.9e-5
2.79e-5
0.0001
G5
0.0005
0.0001
0.0003
9.2e-5
7.35e-6
0.0001
G6
0.0006
0.0002
0.0002
4.2e-5
3.45e-5
0.0001
G7
0.0004
0.0002
0.0002
3.6e-5
3.12e-5
0.0001
G8
0.0003
0.0002
0.0001
1.3e-5
4.12e-5
0.0001
G9
0.0002
0.0002
0.0001
1.4e-5
3.56e-5
0.0001
G10
0.0001
0.0003
0.0002
1.9e-5
2.85e-5
0.0001
G11
0.0001
0.0003
0.0001
3.6e-5
4.47e-5
0.0001
G12
8.8e-5
0.0003
0.0001
1.1e-5
5.33e-5
0.0001
G13
1.5e-5
0.0003
0.0002
3.4e-5
4.61e-5
0.0002
G14
1.9e-5
0.0003
0.0002
2.1e-5
6.16e-5
0.0002
G15
4.4e-5
0.0003
0.0002
1.0e-5
8.12e-5
0.0002
G16
7.1e-5
0.0003
0.0002
1.1e-5
1.212e-4
0.0003
G17
0.0002
0.0004
0.0003
5.0e-5
1.565e-4
0.0004
G18
0.0002
0.0004
0.0001
5.3e-5
0.000325
0.0005
G19
0.0007
0.0007
0.0006
0.0003
5.958e-4
0.001
G20
0.0056
0.0039
0.0457
0.0017
5958e-4
0.0062
Appendix C. Comparison between quantile and dummy variable regressions Compared with the standard regression model, the quantile regression model can more comprehensively give a general view of the conditional distribution of the explained variable, rather than merely analyzing the conditional mean of the explained variable. The estimation of the regression coefficient is generally different on different quantiles, that is, the explaining variable has discrepant influences on the explained variables of different levels. The result is similar to the dummy variable regression proposed in the research. The comparison results between the quantile and dummy variable regressions by 1995 CHIP data are presented in Table C.1. As can be seen from Table C.1, the coefficients obtained by the quantile regression and the dummy variable regression are both the ratios of cumulative percentage of income to cumulative percentage of population, but the results obtained by regression are far different. The difference is probably attributable to: (1) The quantile regression contains all the data before the quantile when regress on the different quantiles while the regression of dummy variables only involves the data of the current segment. (2) The quantile regression takes minimization of the absolute value of deviation as the objective function, while the dummy variable regression is still based on the least square method. Similar to the standard ordinary least squares, the quantile regression also cannot control the Lorenz curve to pass through the origin and the point (1, 1). By adding constraints, the research guarantees that the dummy variable regression satisfies the above requirements of the Lorenz curves. The detailed processing approach is shown as follows: Considering that the Lorenz curves are actually the function of wealth distribution, there is bound to have
. To guarantee that the Lorenz curve passing through the
origin and the point (1, 1), parameters of the model need to meet the conditions that
=0 and
. The fitting function can be written as follows after adding the two constraints:
Then, the above function is converted to the following fixed-point regression model:
Table C.1 The results of quantile regression and dummy variable regression by 1995 CHIP data Cumulative
Bootstrap Coef.(quantile reg)
income
Pseudo R2
Std. Err.
Coef.(dummy reg)[R2=1.000]
cumulative q05
0.7981 population
0.0050
0.7864
0.1658
cons
-0.1546
0.0025
0.7980
0.0049
cumulative q10
population cons
-0.1533
0.0025
0.7992
0.0047
0.7839
0.2646
0.7812
0.3416
0.7784
0.4057
0.7754
0.4696
0.7723
0.5333
0.7692
0.5915
0.7659
0.6477
0.7627
0.7048
0.7595
0.7648
cumulative q15
population cons
-0.1519
0.0024
0.8006
0.0047
cumulative q20
population cons
-0.1497
0.0024
0.8022
0.0046
cumulative q25
population cons
-0.1468
0.0024
0.8046
0.0045
cumulative q30
population cons
-0.1434
0.0023
0.8076
0.0042
cumulative q35
population cons
-0.1394
0.0021
0.8110
0.0040
cumulative q40
population cons
-0.1347
0.0019
0.8152
0.0036
cumulative q45
population cons
-0.1293
0.0019
0.8200
0.0035
cumulative q50
population cons
-0.1231
0.0016
cumulative 0.8255 q55
0.0033
population cons
-0.1160
0.0017
0.8314
0.0033
0.7563
0.8304
0.7531
0.9014
0.7499
0.9771
0.7468
1.0681
0.7437
1.1780
0.7406
1.3059
0.7373
1.4694
0.7337
1.7031
0.7294
2.0653
0.7251
3.3151
cumulative q60
population cons
-0.1078
0.0019
0.8384
0.0040
cumulative q65
population cons
-0.0988
0.0023
0.8467
0.0044
cumulative q70
population cons
-0.0888
0.0025
0.8568
0.0050
cumulative q75
population cons
-0.0779
0.0026
0.8687
0.0046
cumulative q80
population cons
-0.0658
0.0021
0.8830
0.0043
cumulative q85
population cons
-0.0523
0.0021
0.9020
0.0048
cumulative q90
population cons
-0.0372
0.0016
0.9309
0.0043
cumulative q95
q99
population cons
-0.0203
0.0014
cumulative
0.9737
0.0034
population cons
-0.0047
0.0005
Appendix D. Proof of the convexity of the model The Lorenz curve increases gently at first and then sharply grows, that is, it is a necessary property for the curve to have positive first-order and second-order derivatives. Likewise, the model also needs to meet the important property. It is known that the income I k of each individual can be regarded as the increment of the Lorenz curve in the vertical axis when the kth person is added. After ranking the income data of residents according to the income level, it is obtained that
I k I k 1 , which means that the cumulative curve of wealth grows more and more rapidly, that is, the Lorenz curve shows the property of a convex function. The model converts the original continuous curve into a piecewise linear function and the strict convexity of the original function is weakened as nonstrict one in the model. The convexity of the model is proved as follows: For arbitrary x1 and x2 , consider the following two cases: i 1 i 1) When x1 , x2 , , n n
For arbitrary 0 < p < 1, there is
f [ px1 (1 p) x2 ] i [ px1 (1 p) x2 ] i pf ( x1 ) (1 p) f ( x2 ) That is,
f [ px1 (1 p) x2 ] pf ( x1 ) (1 p) f ( x2 ) k 1 k k k 1 2) Under conditions that x1 , , x2 , , n n n n
For arbitrary 0 < p < 1, there is
px1 (1 p) x2 [
k 1 k , ] n n
Therefore,
f [ px1 (1 p) x2 ] k [ px1 (1 p) x2 ] k pf ( x1 ) (1 p) f ( x2 ) p( k x1 k ) (1 p)( k 1 x2 k 1 ) Then, there is f [ px1 (1 p) x2 ] [ pf ( x1 ) (1 p) f ( x2 )] (1 p)[( k k 1 ) x2 ( k k 1 )] 0
That is,
f [ px1 (1 p) x2 ] pf ( x1 ) (1 p) f ( x2 ) It is proved in the similar way when x1 and x2 are in different groups.