Estimation of Lorenz curves based on dummy variable regression

Estimation of Lorenz curves based on dummy variable regression

Accepted Manuscript Estimation of Lorenz curves based on dummy variable regression Zheng-Xin Wang, Hai-Lun Zhang, Hong-Hao Zheng PII: DOI: Reference:...

645KB Sizes 0 Downloads 99 Views

Accepted Manuscript Estimation of Lorenz curves based on dummy variable regression Zheng-Xin Wang, Hai-Lun Zhang, Hong-Hao Zheng

PII: DOI: Reference:

S0165-1765(19)30029-1 https://doi.org/10.1016/j.econlet.2019.01.021 ECOLET 8348

To appear in:

Economics Letters

Received date : 3 June 2018 Revised date : 16 January 2019 Accepted date : 22 January 2019 Please cite this article as: Z.-X. Wang, H.-L. Zhang and H.-H. Zheng, Estimation of Lorenz curves based on dummy variable regression. Economics Letters (2019), https://doi.org/10.1016/j.econlet.2019.01.021 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

*Highlights (for review)

Highlights  We propose an estimation method of the Lorenz curve based on dummy variable regression.  We grant economic connotations to parameters in the estimation of the Lorenz curve.  The method has advantages of higher fitting precision and stronger adaptability.

*Title Page

Estimation of Lorenz curves based on dummy variable regression Zheng-Xin Wanga,b,c , Hai-Lun Zhang a , Hong-Hao Zheng a a. School of Economics, Zhejiang University of Finance & Economics, Hangzhou, 310018 China b. Center for Research of Regulation & Policy, Zhejiang University of Finance & Economics, Hangzhou 310018, China c.

Center for Regional Economy & Integrated Development, Zhejiang University of Finance & Economics, Hangzhou 310018, China

E-mail: [email protected] (Zheng-Xin Wang); [email protected] (Hai-Lun Zhang); [email protected] (Hong-Hao Zheng, Corresponding author).

Abstract: We propose a new estimation method of the Lorenz curve based on dummy variable regression for granting clear economic connotations to parameters in the estimation of the Lorenz curve. The dummy variable regression model of the Lorenz curve is established by introducing dummy variables to group data about income level. On this basis, it is proved that the regression function is endowed with the convexity that the Lorenz curve should have. To verify the effectiveness of the new method, empirical study is carried out taking the income data of urban and rural residents in the Chinese Household Income Project Survey (CHIP) database as examples. In the empirical study, the proposed model is compared with several existing classical parameter models. The results indicate that the slope parameter of the model represents the cumulative income–population resilience of corresponding residents, and reflects the income structure of residents to some extent. Meanwhile, compared with classical parameter models, the newly proposed method has advantages of higher fitting precision and stronger adaptability. Keywords: Income distribution, Lorenz curve, Dummy variable regression

JEL codes: C80, D31

*Manuscript Click here to view linked References

1. Introduction The Lorenz curve, as the basic tool for studying the income inequality, is mainly described through the interpolation of grouped data or fitted using specific parameter models. Kakwani and Podder (1973), Villasenor and Arnold (1989), and Chotikapanich (1993) proposed single-parameter models for fitting the Lorenz curve. Sarabia et al. (2015) revealed that the bi-parametric estimation method of the Lorenz curve proposed by Wang and Smyth (2015) is essentially the bi-parametrization of the hyperbolic Lorenz curve (Arnold, 1986). Prendergast and Staudte (2016) innovatively introduced quantile ratio index to replace the ratio part in the original Lorenz curve L( p)  p

p  . By introducing more parameters, previous studies greatly improved the

fitting precision of the Lorenz curve. However, parameters obtained with either single- or multi-parameter models lack clear economic connotations. Furthermore, level of precision and reliability of the above mentioned models are relatively low. In this study, by dividing the curve into segments using dummy variables, the Lorenz curve fitted using the multivariate linear regression model is built, which improves the fitting precision while guaranteeing that the parameters in the model have clear economic connotations. It is worth noting that differing from Prendergast and Staudte (2016), this study constructs a dummy variable regression model based on the idea of differentiation, which is mainly based on the original definition of Lorenz curve. The two approaches do differ as follows: it is known that using the quantile regression model can obtain wealth distribution of quantile in each population group, while using the dummy variable regression model, the average income of all population groups can be acquired. 2. Model construction 2.1 Segment notation of the Lorenz curve The Lorenz curve is generally defined as follows (Rohde, 2009):

  f ( ) where  and  represent the cumulative percents of population and wealth, respectively. The Lorenz curve is bound to meet the following requirement:

d d 2  0, 2  0, (0)  0, (1)  1 d d At first, the general Lorenz curve is divided into n segments, as shown in Figure 1, to demonstrate the income levels of populations in different income ranges.

Figure 1 Schematic diagram of the segmented Lorenz curve Then, the income level is divided into n segments using dummy variables. In this way, the Lorenz curve can be written as the following approximate form:

y  D1 (1 x   1 )  D2 ( 2 x   2 )    Dn ( n x   n ) Therefore, there is

lim y  f ( ) n 

where x and y denote the cumulative percents of population and wealth, respectively. When x  [

i 1 i , ] , Di  1 ; otherwise, Di  0 . n n

The proof of the convexity of the model is displayed in the Appendix. 2.2 The economic connotations of coefficients The segmented curve is then replaced with straight line segments, which is able to endow parameters with proper economic connotations. The coefficient

 i of the

model actually describes the income level of each group of populations. When other conditions are fixed, every one percent increase in the population results in the increment of

 i of the cumulative percent of wealth. That is,  i is the average

income of population in the group. Although

 i does not have a straightforward

economic connotation, the average accumulation of wealth and intra-group Gini coefficient of the group can be calculated in combination with

i .

2.3 Expression of Gini coefficient By using the above dummy variable regression function, the expression of Gini coefficient can be easily obtained: 1

1 n

0

0

S1   ydx  

n

i

n

n

i 1

 (i Di x   i Di )dx   in1 (i x   i )dx  [ i 1

i 1

 i (2i  1)  i 2n 2



n

]

Meanwhile, the intra-group inequality coefficient of each group is also obtained:

Gi   i 

2 n i , i  1,2, , n 2i  1

The overall Gini coefficient can be expressed as follows: n

 i (2i  1)

i 1

n2

G  1  2S1  1  

 2 , i  1,2, , n

3. Empirical test based on income data of urban and rural residents in China To verify the fitting precision of the proposed model, the microcosmic data which contain urban and rural household income in 1995~2013 in the Chinese Household Income Project Survey (CHIP) database are used to compare the fitting results of different models. By using the regression methods of the grouping model when n is 5, 10, and 20, comparison is carried out with the following classical estimation models of the Lorenz curve: Kakwani-Podder:

 ( )  e  (1 ),  0 Chotikapanich:

 ( ) 

e k  1 ,k  0 ek  1

Pareto: 1

 ( )  1  (1   ) ,  1 

Nicholas Rohde (2009):   1  ,  1    

 ( )   

Bi-parametric (Wang & Smyth, 2015):

 ( ) 

1  1 1   2   ,1  1,  2  1 1   2 1  1

where  and  represent the cumulative percent of wealth and population, respectively. By conducting regression test for the above model using the same dataset, we obtain the parameter results in Table 1. The regression outputs of grouping models are listed in Table A.1 in Appendix. In addition, we report on various estimation results based on different segments of the data (i.e., n=5, n=10, and n=20), as shown in Table B.1, B.2, and B.3 in Appendix. The comparison results between the quantile regression model and the new approach in this research are presented in Table C.1 in Appendix. Table 1 The regression output of different parametric models Models

Parameters

Kakwani-Podder

Chotikapanich

Nicholas Rohde

Bi-parametric

(2009)

(Wang & Smyth, 2015)

Pareto

δ

K

1/γ

β

ω1

ω2

1.6896***

2.4637***

0.5508***

1.4483***

0.7039***

-0.0851***

(0.0087)

(0.0033)

(0.0021)

(0.0005)

(0.0013)

(0.0067)

1.8835***

2.1759***

0.6037***

1.5256***

0.5494***

0.56222***

(0.0142)

(0.0019)

(0.0014)

(0.0008)

(0.0017)

(0.012)

1.8966***

2.6550***

0.5249***

1.3957***

0.7015***

0.0725***

(0.0060)

(0.0030)

(0.0013)

(0.0005)

(0.0011)

(0.0078)

1.5657***

2.2534***

0.5748***

1.5104***

0.6534***

0.0279***

(0.0068)

(0.0029)

(0.0014)

(0.0006)

(0.0016)

(0.0082)

1.8819***

2.6201***

0.5335***

1.4030***

0.6883***

0.1327***

(0.0060)

(0.0028)

(0.0013)

(0.0005)

(0.0011)

(0.0080)

1.5039***

2.1852***

0.5857***

1.5354***

0.6488***

-0.0008***

(0.0066)

(0.0028)

(0.0015)

(0.0006)

(0.0016)

(0.0076)

1.8923***

2.6323***

0.5305***

1.4000***

0.6870***

0.1534***

(0.0058)

(0.0028)

(0.0013)

(0.0005)

(0.0016)

(0.0084)

1.4551***

2.1518***

0.5839***

1.5493***

0.6581***

-0.0665***

1995

1999

2011-r

2011-u

2012-r

2012-u

2013-r 2013-u

(0.0062)

(0.0028)

(0.0012)

(0.0006)

(0.0016)

(0.0070)

Note: u represents the urban data in CHIP database, and r represents the rural data in CHIP database, among which 1995 and 1999 did not distinguish between rural and urban.

It can be seen from Table 1 that the classical models also have excellent parameter fitting results. As to the performance of the parameters, the classical models only involving one or two parameters fail to more intuitively explain the Lorenz curve. By contrast, the parametric sequences fitted by using the grouping model are able to more intuitively show the income levels of each group. Combining with the group interval and the number of groups, the model is capable of analyzing the income structure. The significance of the parameter fitting results explained above depends greatly on the specificity of data, so the precision of each model is the main basis for comparing the performance of the models under the same data. Table 2 displays the results of the three regression methods of the grouping model under conditions that n is 5, 10, and 20 and the overall precision report of the classical models. Table 2 Mean square errors of different models Kakwani

Chotikap

Models

Nicholas Pareto

Dreg

Dreg

Dreg

Sample

(n=5)

(n=10)

(n=20)

size

Bi-parametric

-Podder

anich

Rohde

1995

0.2718

0.0207

0.0963

0.0110

0 .0136

0.0596

0.0074

0.0031

7998

1999

0.4875

0.0135

0.0978

0.0166

0.0123

0.0193

0.0051

0.0023

9063

2011-r

0.2253

0.0207

0.0907

0.0121

0.0147

0.0610

0.0085

0.004

10248

2011-u

0.1998

0.0172

0.0834

0.0116

0.0127

0.0564

0.0066

0.0014

6557

2012-r

0.2236

0.0194

0.0948

0.0122

0.0140

0.0608

0.0079

0.0036

10295

2012-u

0.1943

0.0167

0.0845

0.0113

0.0121

0.056

0.0061

0.0014

6580

2013-r

0.2202

0.0195

0.0930

0.0124

0.0144

0.0606

0.0082

0.0038

10389

2013-u

0.1823

0.0174

0.0770

0.0111

0.0122

0.0556

0.0064

0.0029

6637

When n is set as 5, 10, and 20, the results of dummy variable regression change as described above. The results of the parameter estimation of the three grouping models are given in Appendix Table A2. On the premise of guaranteeing a large

enough sample size, the precision of the model increases with the increasing number of groups. It can be seen from the results that the mean square errors (MSEs) of the three single-parameter models (i.e. Kakwani and Podder, Chotikapanich, and Pareto) are significantly larger than those of other models. The single-parameter model proposed by Nicholas Rohde differs slightly from the bi-parametric model put forward by Wang and Smyth (2015) in terms of the fitting effect, while its precision is higher than those of the dummy variable regression model with n=5. Since data set used in this paper has large sample size (see Table 2), grouping models of n=10 and n=20 are estimated. It can be seen from the results that when n=10 and n=20, the MSEs of the grouping model are smaller than the above two models. While selecting the number of groups, the grouping model is able to greatly improve the fitting precision of the model while endowing parameters with economic connotations. 4. Conclusions Parameters of the new model are endowed with clear economic connotations. The classical models and other models of the Lorenz curve proposed in recent years involve few parameters. With these limited number of parameters, the parameter models are not able to describe the features of the Lorenz curve. Even though the parameters are able to reflect the bending degree of the Lorenz curve, they have similar functions with the Gini coefficient: they merely present the phenomenon of inequality and the overall inequality degree. The slope parameter obtained by the regression of the grouping model is the cumulative income–population resilience of corresponding residents, which mirrors the income level of the residents to some extent. Therefore, the parameter has clear economic connotation that is conductive to the economic analysis. The new model has high precision and strong adaptability. For the grouping model able to adjust the group number and group interval, the only thing that can limit the grouping is the sample size. When the sample size is too small, it is meaningless even though the grouping model can obtain high fitting precision. This point is also the same to other classical models. While, as long as the sample size is guaranteed, the grouping model performs the adjustment according to the sample size. Therefore, while estimating the Lorenz curves through the dummy variable regression, researchers can refer to the code shared in the supplementary material. By adjusting the number of groups, it guarantees that the model has enough precision; while the

adjustment of the group interval ensures that information in the regression results is not too approximate and also saves the degree of freedom. In comparison, other parameter models basically have fixed forms and only the size of parameters can be adjusted. For special income data, different parameter models also have different performances. References Chotikapanich, D., 1993. A comparison of alternative functional forms for the Lorenz curve. Economics Letters 41, 21–29. Chotikapanich, D., Griffiths, B., Rao, D. S., 2007. Estimating and combining national income distributions using limited data. Journal of Business and Economic Statistics 25, 97–109. Gastwirth, J., Glauberman, M., 1976. The interpolation of the Lorenz curve and Gini index from grouped data. Econometrica 44, 479–483. Kakwani, N.C., Podder, N., 1976. Efficient estimation of the Loren curve and associated inequality measures from grouped observations. Econometrica 44, 137–148. Kakwani, N. C., Podder, N., 1973. On the estimation of Lorenz curves from grouped observations. International Economic Review 14, 278–292. Prendergast, L. A., Staudte, R. G., 2016. Quantile versions of the Lorenz Curve. Electronic Journal of Statistics 10, 1896-1926. Rohde, N., 2009. An alternative functional form for estimating the Lorenz curves. Economics Letters 105, 61-63. , A., 2017. Yearly, monthly and weekly seasonality of tourism demand: A decomposition analysis. Tourism Management 60, 379-389. Sarabia, J. M., Prieto, F., Jordá, V., 2015. About the hyperbolic Lorenz curve. Economics Letters 136, 42-45. Sarabia, J. M., Jordá, V., 2014. Explicit expressions of the Pietra index for the generalized function for the size distribution of income. Physica A 416, 582–595. Sarabia, J.M., Castillo, E., Slottje, D.J., 1999. An ordered family of Lorenz curves. Journal of Econometrics 91, 43–60.

Wang, Z. X., Smyth, R., 2015. A piecewise method for estimating the Lorenz curves. Economics Letters 129, 45-48. Wang, Z. X., Smyth, R., 2015. A hybrid method for creating Lorenz curves. Economics Letters 133, 59-63.

Appendix A. Results obtained with dummy regression In order to discuss how the number n of groups influences the estimation precision of the Lorenz curves with dummy variable regression in detail, the estimation results of dummy variable regression with different n by 1995 CHIP data are listed in Table A.1. It can be seen from the table that all of the regression coefficients are significant at the 0.01 level. The result indicates that on conditions that n=5, 10, 20, different groups have significant differences in the statistical result of the income. The results of three tests verify the conclusion that the larger the number n of groups is, the higher the precision of the model. However, it does not absolutely mean that we should select an as large number of groups as possible in practical applications. Selecting a larger number of groups probably results in the decreasing degree of freedom of parameter estimation and therefore the larger estimation error. For the linear model, n

y   (  i Di x   i Di )   i 1

Suppose that the sample size is N, then the degree of freedom of the model is obtained as

in the case that the income level is divided into n segments.

Obviously, the model consumes higher degree of freedom than the standard linear model. Therefore, a large enough sample size is needed particularly while selecting the number of groups, so as to ensure the enough degree of freedom and avoid an overlarge number of groups affecting the model precision. In practical application, researchers need to balance the advantages and disadvantages due to the enlarging number of groups and the loss of degree of freedom, so as to reasonably select the number of groups. A th

upp m t f r

the regression coefficient

cti

2 1 “Th

c

mic c

t ti

f c ffici t ”

 i represents the average slope of the Lorenz curves on a

certain number of groups. Therefore, it reflects the economic connotation the response extent of cumulative percentage of income to the variation of cumulative percentage of population, that is, the cumulative income–population resilience. Table A.1 Dummy variable regression with different n by 1995 CHIP data Model 1 (n=5)

Model 2 (n=10)

Model 3 (n=20)

(MSE=0.00766)

(MSE=0.00333)

(MES=0.00142)

1 2

3 4

5 6 7

8 9

10 11

12

13 14

15 16

17

0.2990***

0.2172***

0.1658***

(0.0013)

(0.0012)

(0.0012)

***

***

0.2646***

(0.0010)

(0.0007)

(0.0007)

***

***

0.3416***

(0.0011)

(0.0007)

(0.0005)

***

***

0.4057***

(0.0018)

(0.0006)

(0.0005)

***

***

0.4696***

(0.0007)

(0.0006)

***

0.5333***

(0.0008)

(0.0005)

***

0.5915***

(0.0010)

(0.0005)

***

0.6477***

(0.0015)

(0.0004)

***

0.7048***

(0.0027)

(0.0005)

2.5803***

0.7648***

(0.0169)

(0.0005)

0.5617

0.7993

1.1280

1.9934

(0.0096)

0.3736

0.5024

0.6198

0.7344

0.8659

1.0224

1.2408

1.5830

0.8304*** (0.0005) 0.9014*** (0.0006) 0.9771*** (0.0007) 1.0681*** (0.0008) 1.1780*** (0.0009) 1.3059*** (0.0011) 1.4694*** (0.0016)

1.7031***

18

(0.0021) 2.0653***

19

(0.0043) 3.3151***

 20

(0.0291)

Note: the data in parentheses are robust standard errors.

Based on the large-sample CHIP data, the number n of groups is set as 20 to further estimate the Lorenz curves of rural and urban residents in China from 2011 to 2013. The coefficients of dummy variable regressions are displayed in Table A.2. As shown in the table, the residents with a larger group number, that is, rich residents, have larger coefficients; while residents with a smaller group number, that is, poor residents, present smaller coefficients. In addition, the urban residents have a larger coefficient compared with the rural residents in the same year. The results imply that the population change of the high-income group more greatly influences the income compared with that of the low-income group; the population change in urban areas has larger influences on the income in comparison with that in the rural areas. Table A.2 The coefficients of dummy variable regressions by 2011-2013 CHIP data

1 2

3 4

5 6 7

2011-r

2012-r

2013-r

2011-u

2012-u

2013-u

0.1169***

0.1186***

0.1190***

0.3040***

0.3171***

0.2044***

(0.0009)

(0.0009)

(0.0009)

(0.0015)

(0.0016)

(0.0018)

***

***

***

***

***

0.3252***

0.2208

0.2219

0.2150

(0.0006)

(0.0006)

(0.0006)

(0.0006)

(0.0009)

(0.0007)

***

***

***

***

***

0.3993***

0.2860

0.4923

0.2955

0.2960

(0.0004)

(0.0005)

(0.0006)

(0.0008)

(0.0006)

(0.0005)

***

***

***

***

***

0.4690***

0.3663

0.6156

0.5003

0.3691

0.3596

(0.0008)

(0.0004)

(0.0006)

(0.0009)

(0.0008)

(0.0005)

***

***

***

***

***

0.5239***

0.4320

0.6887

0.6023

0.4435

0.4354

(0.0005)

(0.0005)

(0.0003)

(0.0008)

(0.0007)

(0.0003)

***

***

***

***

***

0.5862***

0.5072

0.7826

0.7031

0.4991

0.4933

(0.0005)

(0.0000)

(0.0006)

(0.0008)

(0.0009)

(0.0005)

***

***

***

***

***

0.6449***

0.5517

0.5545

0.5592

0.8322

0.7555

0.9435

0.8531

0.9147

8 9

10 11

12

13 14

15

16 17 18

19  20 gini

(0.0000)

(0.0006)

***

***

0.6030

(0.0004 ***

(0.0010)

(0.0009)

(0.0003)

***

***

0.6853***

0.6262

0.6392

(0.0007)

(0.0004)

(0.0005)

(0.0011)

(0.0007)

(0.0007)

***

***

***

***

***

0.7527***

0.6881

1.0025

0.7042

0.6730

(0.0003)

(0.0006)

(0.0005)

(0.0009)

(0.0010)

(0.0007)

***

***

***

***

***

0.7891***

0.7663

1.1150

1.0184

0.7491

0.7600

(0.0005)

(0.0000)

(0.0004)

(0.0014)

(0.0001)

(0.0004)

***

***

***

***

***

0.8670***

0.8275

1.1795

1.0961

0.8236

0.8381

(0.0000)

(0.0009)

(0.0008)

(0.0003)

(0.0013)

(0.0008)

***

***

***

***

***

0.9292***

0.8817

1.2958

1.1769

0.9069

0.8916

(0.0011)

(0.0008)

(0.0005)

(0.0016)

(0.0011)

(0.0007)

***

***

***

***

***

1.0253***

0.9887

1.3687

1.2575

0.9978

1.004

(0.0008)

(0.0001)

(0.0007)

(0.0009)

(0.0006)

(0.0007)

***

***

***

***

***

1.0912***

1.0980

1.4721

1.3439

1.0887

1.0985

(0.0004)

(0.0009)

(0.0002)

(0.0008)

(0.0023)

(0.0009)

***

***

***

***

***

1.1869***

1.1756

1.6139

1.4641

1.2203

1.2020

(0.0014)

(0.0009)

(0.0011)

(0.0029)

(0.0023)

(0.0009)

***

***

***

***

***

1.2966***

1.3457

1.6959

1.5563

1.3143

1.3268

(0.0011)

(0.0015)

(0.0005)

(0.0025)

(0.0022)

(0.0004)

***

***

***

***

***

1.4429***

1.4892

1.9175

1.7302

1.5056

1.5118

(0.0023)

(0.0010)

(0.0015)

(0.0050)

(0.0033)

(0.0017)

***

***

***

***

***

1.6273***

1.7399

2.1912

1.8996

1.7635

1.7553

(0.0025)

(0.0019)

(0.0015)

(0.0072)

(0.0064)

(0.0020)

***

***

***

***

***

1.9420***

2.1328

2.5897

2.1307

2.1092

2.0957

(0.0033)

(0.0032)

(0.0030)

(0.0114)

(0.0136)

(0.0035)

***

***

***

***

***

2.8690***

3.2994

3.1784

2.5020

5.1135

3.0763

3.2677

3.2672

(0.0267)

(0.0253)

(0.0263)

(0.1420)

(0.0982)

(0.0252)

0.4053

0.4010

0.4027

0.4363

0.4203

0.3402

Note: the data in parentheses are robust standard errors.

4.7388

Appendix B. Regression results with different models at different segments of the population We supplement regression results of different segments of the population when n =5, 10, and 20 in 1995 CHIP data based on typical models and the dummy variable regression models, as displayed in Table B.1, B.2, and B.3. The typical models in the nonlinear form exhibit better goodness of fit in the majority of segments in regression process than that of the dummy variable model, however, the parameters of the typical models are mainly proposed for the measurement of whole model distribution and these models mentioned primarily focus on the overall properties of Lorenz curve, the fitting results of some segments after segmentation show problems like discontinuous points, deviations, etc. We therefore obtain these model may be not applicable to fitting the conclusion in the segmentation case. The classical models adopted in the research, such as Kakwani-Podder, Chotikapanich, Pareto, and Nicholas Rohe models are continuous and pass fixed points (0, 0) and (1, 1). The Lorenz curves obtained by dividing the samples into several segments and regressing them separately are certainly different, because the parameters of the models are the results of pursuing local optimum. After linking these different Lorenz curves into an integral whole, it is bound to attain only one curve containing discontinuous points and deviating from the overall curve.

Table B.1 The MSEs of different models when n=5 n=5

Kakwani-Podder

Chotikapanich

Pareto

Nicholas Rohde

Bi-parametric

Dreg

G1

0.5340

0.0018

0.0146

0.0035

0.0004

0.0039

G2

0.0076

0.0028

0.0028

0.0007

0.0005

0.0018

G3

0.0020

0.0040

0.0021

8.3*e-5

0.0006

0.0020

G4

0.0005

0.0050

0.0028

0.0002

0.0012

0.0033

G5

0.0124

0.0119

0.0439

0.0045

0.0064

0.0164

Table B.2 The MSEs of different models when n=10 n=10

Kakwani-Podder

Chotikapanich

Pareto

Nicholas Rohde

Bi-parametric

Dreg

G1

0.6778

0.0003

0.0069

0.0013

0.0002

0.0013

G2

0.0073

0.0004

0.0012

0.0003

0.0001

0.0005

G3

0.0023

0.0006

0.0009

0.0003

0.0001

0.0005

G4

0.0014

0.0008

0.0006

0.0001

0.0001

0.0004

G5

0.0007

0.0009

0.0005

3.4e-5

0.0002

0.0004

-5

G6

0.0004

0.0011

0.0005

1.2e

0.0002

0.0005

G7

-6

0.0012

0.0006

0.0001

0.0002

0.0007

3.6×e

-5

G8

0.0002

0.0013

0.0007

4.0e

0.0004

0.0010

G9

0.0009

0.0016

0.0013

0.0003

0.0008

0.0017

G10

0.0088

0.0067

0.0457

0.0031

0.0038

0.0102

Table B.3 The MSEs of different models when n=20 n=20

Kakwani-Podder

Chotikapanich

Pareto

Nicholas Rohde

Bi-parametric

Dreg

G1

0.8664

5.4e-5

0.0037

0.0006

1.09e-4

0.0005

G2

0.0054

6.7e-5

0.0006

0.0002

2.78e-5

0.0002

G3

0.0027

9.1e-5

0.0003

9.3e-5

3.99e-5

0.0001

G4

0.0012

0.0001

0.0003

7.9e-5

2.79e-5

0.0001

G5

0.0005

0.0001

0.0003

9.2e-5

7.35e-6

0.0001

G6

0.0006

0.0002

0.0002

4.2e-5

3.45e-5

0.0001

G7

0.0004

0.0002

0.0002

3.6e-5

3.12e-5

0.0001

G8

0.0003

0.0002

0.0001

1.3e-5

4.12e-5

0.0001

G9

0.0002

0.0002

0.0001

1.4e-5

3.56e-5

0.0001

G10

0.0001

0.0003

0.0002

1.9e-5

2.85e-5

0.0001

G11

0.0001

0.0003

0.0001

3.6e-5

4.47e-5

0.0001

G12

8.8e-5

0.0003

0.0001

1.1e-5

5.33e-5

0.0001

G13

1.5e-5

0.0003

0.0002

3.4e-5

4.61e-5

0.0002

G14

1.9e-5

0.0003

0.0002

2.1e-5

6.16e-5

0.0002

G15

4.4e-5

0.0003

0.0002

1.0e-5

8.12e-5

0.0002

G16

7.1e-5

0.0003

0.0002

1.1e-5

1.212e-4

0.0003

G17

0.0002

0.0004

0.0003

5.0e-5

1.565e-4

0.0004

G18

0.0002

0.0004

0.0001

5.3e-5

0.000325

0.0005

G19

0.0007

0.0007

0.0006

0.0003

5.958e-4

0.001

G20

0.0056

0.0039

0.0457

0.0017

5958e-4

0.0062

Appendix C. Comparison between quantile and dummy variable regressions Compared with the standard regression model, the quantile regression model can more comprehensively give a general view of the conditional distribution of the explained variable, rather than merely analyzing the conditional mean of the explained variable. The estimation of the regression coefficient is generally different on different quantiles, that is, the explaining variable has discrepant influences on the explained variables of different levels. The result is similar to the dummy variable regression proposed in the research. The comparison results between the quantile and dummy variable regressions by 1995 CHIP data are presented in Table C.1. As can be seen from Table C.1, the coefficients obtained by the quantile regression and the dummy variable regression are both the ratios of cumulative percentage of income to cumulative percentage of population, but the results obtained by regression are far different. The difference is probably attributable to: (1) The quantile regression contains all the data before the quantile when regress on the different quantiles while the regression of dummy variables only involves the data of the current segment. (2) The quantile regression takes minimization of the absolute value of deviation as the objective function, while the dummy variable regression is still based on the least square method. Similar to the standard ordinary least squares, the quantile regression also cannot control the Lorenz curve to pass through the origin and the point (1, 1). By adding constraints, the research guarantees that the dummy variable regression satisfies the above requirements of the Lorenz curves. The detailed processing approach is shown as follows: Considering that the Lorenz curves are actually the function of wealth distribution, there is bound to have

. To guarantee that the Lorenz curve passing through the

origin and the point (1, 1), parameters of the model need to meet the conditions that

=0 and

. The fitting function can be written as follows after adding the two constraints:

Then, the above function is converted to the following fixed-point regression model:

Table C.1 The results of quantile regression and dummy variable regression by 1995 CHIP data Cumulative

Bootstrap Coef.(quantile reg)

income

Pseudo R2

Std. Err.

Coef.(dummy reg)[R2=1.000]

cumulative q05

0.7981 population

0.0050

0.7864

0.1658

cons

-0.1546

0.0025

0.7980

0.0049

cumulative q10

population cons

-0.1533

0.0025

0.7992

0.0047

0.7839

0.2646

0.7812

0.3416

0.7784

0.4057

0.7754

0.4696

0.7723

0.5333

0.7692

0.5915

0.7659

0.6477

0.7627

0.7048

0.7595

0.7648

cumulative q15

population cons

-0.1519

0.0024

0.8006

0.0047

cumulative q20

population cons

-0.1497

0.0024

0.8022

0.0046

cumulative q25

population cons

-0.1468

0.0024

0.8046

0.0045

cumulative q30

population cons

-0.1434

0.0023

0.8076

0.0042

cumulative q35

population cons

-0.1394

0.0021

0.8110

0.0040

cumulative q40

population cons

-0.1347

0.0019

0.8152

0.0036

cumulative q45

population cons

-0.1293

0.0019

0.8200

0.0035

cumulative q50

population cons

-0.1231

0.0016

cumulative 0.8255 q55

0.0033

population cons

-0.1160

0.0017

0.8314

0.0033

0.7563

0.8304

0.7531

0.9014

0.7499

0.9771

0.7468

1.0681

0.7437

1.1780

0.7406

1.3059

0.7373

1.4694

0.7337

1.7031

0.7294

2.0653

0.7251

3.3151

cumulative q60

population cons

-0.1078

0.0019

0.8384

0.0040

cumulative q65

population cons

-0.0988

0.0023

0.8467

0.0044

cumulative q70

population cons

-0.0888

0.0025

0.8568

0.0050

cumulative q75

population cons

-0.0779

0.0026

0.8687

0.0046

cumulative q80

population cons

-0.0658

0.0021

0.8830

0.0043

cumulative q85

population cons

-0.0523

0.0021

0.9020

0.0048

cumulative q90

population cons

-0.0372

0.0016

0.9309

0.0043

cumulative q95

q99

population cons

-0.0203

0.0014

cumulative

0.9737

0.0034

population cons

-0.0047

0.0005

Appendix D. Proof of the convexity of the model The Lorenz curve increases gently at first and then sharply grows, that is, it is a necessary property for the curve to have positive first-order and second-order derivatives. Likewise, the model also needs to meet the important property. It is known that the income I k of each individual can be regarded as the increment of the Lorenz curve in the vertical axis when the kth person is added. After ranking the income data of residents according to the income level, it is obtained that

I k  I k 1 , which means that the cumulative curve of wealth grows more and more rapidly, that is, the Lorenz curve shows the property of a convex function. The model converts the original continuous curve into a piecewise linear function and the strict convexity of the original function is weakened as nonstrict one in the model. The convexity of the model is proved as follows: For arbitrary x1 and x2 , consider the following two cases: i 1 i  1) When x1 , x2   , ,  n n

For arbitrary 0 < p < 1, there is

f [ px1  (1  p) x2 ]  i [ px1  (1  p) x2 ]   i  pf ( x1 )  (1  p) f ( x2 ) That is,

f [ px1  (1  p) x2 ]  pf ( x1 )  (1  p) f ( x2 )  k 1 k   k k  1 2) Under conditions that x1   ,  , x2   , ,  n n n n 

For arbitrary 0 < p < 1, there is

px1  (1  p) x2  [

k 1 k , ] n n

Therefore,

f [ px1  (1  p) x2 ]   k [ px1  (1  p) x2 ]   k pf ( x1 )  (1  p) f ( x2 )  p( k x1  k )  (1  p)(  k 1 x2   k 1 ) Then, there is f [ px1  (1  p) x2 ]  [ pf ( x1 )  (1  p) f ( x2 )]  (1  p)[(  k   k 1 ) x2  ( k   k 1 )]  0

That is,

f [ px1  (1  p) x2 ]  pf ( x1 )  (1  p) f ( x2 ) It is proved in the similar way when x1 and x2 are in different groups.