Applied Mathematics and Computation 361 (2019) 274–293
Contents lists available at ScienceDirect
Applied Mathematics and Computation journal homepage: www.elsevier.com/locate/amc
Computation and application of robust data-driven bandwidth selection for gradient function estimation Qichang Xie∗, Qiankun Sun Department of Finance, Shandong Technology and Business University, Shandong, PR China
a r t i c l e
i n f o
Keywords: Bandwidth selection Composite quantile regression Gradient estimation Local polynomial fitting
a b s t r a c t The significance of gradient estimates in nonparametric regression cannot be neglected as it is a critical process for executing marginal effect in empirical application. However, the performance of resulting estimates is closely related to the selection of smoothing parameters. The existing methods of parameter choice are either too complicated or not robust enough. For improving the computational efficiency and robustness, a data-driven bandwidth selection procedure is proposed in this paper to compute the gradient of unknown function based on local linear composite quantile regression. Such bandwidth selection method can solve the difficulty of the infeasible selection program that requires the direct observation of true gradient. Moreover, the leading bias and variance of the estimated gradient are obtained under certain regular conditions. It is shown that the bandwidth selection method processes the oracle property in the sense that the selected bandwidth is asymptotically equivalent to the optimal bandwidth if the true gradient is known. Monte Carlo simulations and a real example are conducted to demonstrate the finite sample properties of the suggested method. Both simulation and application corroborate that our technique delivers more effective and robust derivative estimator than some existing approaches. © 2019 Elsevier Inc. All rights reserved.
1. Introduction Nonparametric models, which refrain from misspecification in the functional form, play an essential role in modern statistical theory and have been extensively used in empirical research. Quite a few regression methodologies and relevant theoretical properties were established by estimating the mean functions. A comprehensive survey was made by Fan and Gijbels [3]. In contrast, the investigation of nonparametric derivative estimation is an issue that has not received enough attention. However, applications of derivative estimation are significant and widespread [5,21]. Particularly, considering that the estimation of gradient measures the change amount of dependent variable responding to a unit change in the regressors, it is a major approach for executing the marginal effect in data analysis. Gasser and Müller [4] introduced a kernel estimator for gradient of regression functions. Applying the differentiated estimation, Härdle [7] took the gradient estimation into account, but the variance of resulting estimator was rather large. To address this problem, Wang and Lin [23] examined the derivative estimation by a local constant approximation in least squares regression. Despite that the methods for estimating derivative have been well developed nowadays, the success of estimation results depends sensitively on the choice of smoothing parameters or bandwidths. In terms of nonparametric mean functions, there ∗
Corresponding author. E-mail addresses:
[email protected],
[email protected] (Q. Xie).
https://doi.org/10.1016/j.amc.2019.05.044 0 096-30 03/© 2019 Elsevier Inc. All rights reserved.
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
275
has been a large number of literature on bandwidth selection, for example [16,19,24]. However, the bandwidth selection for nonparametric derivative estimation has been insufficient because the derivative estimates are usually derived from a local polynomial estimation of regression function as “by-products”. As Wang and Lin [23] argued, with the same bandwidth, the mean estimator can achieve the optimal rate of convergence, while the derivative estimators cannot. To improve the performance of derivative estimation, different approaches of bandwidth selection have been presented in light of distinct estimation criteria. Rice [20] adopted an integrated mean square error criterion to choose the optimal smoothing parameter for nonparametric derivative function. With respect to gradient estimation, Müller et al. [18] selected the bandwidth via minimizing a generalized cross-validation criterion. [8] explored the bandwidth selection of gradient estimate through a minimum cross-validation technique. All of the above-mentioned estimation procedures for bandwidth selection are built on least squares (LS) method, which is not robust and often sensitive to outliers notwithstanding the elegance [9]. Since being introduced from the work of Koenker and Bassett [14], quantile regression is usually deemed as a robust alternative to the least squares method for estimating derivatives nonparametrically. Chaudhuri [1] adopted the local quantile regression to deduce the Bahadur representation of derivative estimator. Whereafter, this method was employed by Chaudhuri et al. [2] to estimate the average derivative of quantile function. Yu and Jones [25] obtained the asymptotic normality of derivative estimation through considering the local linear quantile regression. Lin et al. [17] recently explored the bandwidth selection for nonparametric conditional quantile derivative function via making use of the local linear quantile regression. Although the main advantages of quantile regression in robustness of outliers and characterization of heterogeneous effects have been recognized, its efficiency still showed close correlation with the selection of quantiles. For this reason, Zou and Yuan [27] introduced the composite quantile regression (CQR) to obtain a highly efficient estimator. The CQR incorporated distributional information over different quantiles, inherited the robustness of quantile regression and realized high estimation efficiency by averaging each loss function. Due to the powerful statistical properties, the method of CQR was also applied to the derivative estimation of various models, for instance nonparametric models [11], varying coefficient models [12], single-index models [10] and single-index coefficient models [26]. On the contrary, little attention has been paid to improve the performance of the derivative estimation. What’s more, the bandwidth selection problems of these derivative estimators have not been highlighted properly in the literature, which still calls for a better solution to today. The outliers and heteroscedasticity in the empirical data may exist due to the nonlinear correlation in nonparametric regression. Thus, it is necessary to recommend a robust bandwidth choice method to automatically adapt to these conditions. To fulfill this aim, this paper contributes to developing a gradient-based cross-validation (GBCV) method for computing local linear gradient estimator in the framework of CQR. The existing bandwidth selection techniques, such as the “plug-in” [3] and “short-cut” [11], need to calculate numerous unknown functions, so they are not fully automatic. In distinction, the suggested method is completely automatic and conveniently actualized without heavy computations. The main contributions of this paper are summarized as follows. (1) The GBCV bandwidth selection approach generalizes the oracle LSCV construction for the gradient in [18] with local linear estimation, which is adequately data-driven. Despite that the LSCV construction has the oracle property, it is actually infeasible because the unknown real gradients shall be observed. To overcome the limitation of LSCV construction, the unknown true gradient in the oracle LSCV is replaced by the local cubic composite quantile gradient estimate to obtain the GBCV criterion. The results show that the leading bias term in the GBCV criterion is dominated by the local linear estimator, whose order is greater than that of local cubic estimator. Moreover, the leading variance in the GBCV criterion is turned into the variance of difference between two estimators. After being adjusted by the constant merely relying on kernel function, it is proved that the selected bandwidth by minimizing the GBCV is asymptotically equivalent to that chosen from the oracle LSCV with unknown oracle gradient. Therefore, the proposed gradient-based bandwidth selection method is feasible. (2) Although a similar GBCV method has been investigated by Henderson et al. [8], their gradient estimates were based on the local linear LS regression, which is not robust for a class of non-normal error distribution. For the sake of robustness, Lin et al. [17] generalized the GBCV technique to study the bandwidth selection in the setting of local linear quantile regression. However, the local linear quantile regression which excessively relies on the artificial choice of one quantile, is likely to cause biased use of distributional information. Our method can properly make up for the shortcomings of the two aforementioned techniques. On one hand, the suggested method can achieve the robustness by significantly increasing the estimation efficiency under outliers and non-normal distributions. On the other hand, the information can be combined by a series of different quantiles, so that the new method enables to take advantage of the entire distributional information of the nonparametric response. (3) Simulations and applications are conducted to compare the sampling performance of the proposed technique with the LSCV method and the local linear LS method of Henderson et al. [8]. The simulation results reveal that our procedure in the estimation of gradients dominates the LSCV method and considerably outperforms the local linear LS method for all non-normal error distributions. Application results substantiate the superiority of the suggested approach over the LSCV method for efficiency and the local linear LS method for robustness. Therefore, it can be concluded that the GBCV based on CQR is an efficient and robust alternative to some of the existing methods. Furthermore, our method is applicable to some important bandwidth selection topics involved in the higher order derivatives, the discrete covariates and other nonparametric models.
276
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
The rest of this paper is organized as follows. Section 2 provides a new cross-validation procedure based on local linear CQR method for gradient estimation and establishes the asymptotic properties for the choice of bandwidth. The results of Monte Carlo simulation are reported in Section 3. Section 4 presents a real data analysis and Section 5 concludes this paper. All technical proofs are deferred to the Appendix. 2. Methodology The nonparametric models are featured by their ability to explore nonlinear implicit relationships between variables. In addition, the univariate heteroscedastic models including homoscedastic models as special forms, are more universal in practical application and can be easily extended to multivariate cases [16]. Then, we consider the univariate nonparametric heteroscedastic model
y = m ( x ) + σ ( x )ε ,
(1)
where y stands for the response variable, x denotes a scalar covariate, m(x) is assumed to be a smooth nonparametric function, σ (x) says a positive function representing the standard deviation of the random error ε . Assume that ε has mean 0 and variance 1. Local linear fitting is a popular method for estimating the mean function of model (1), which has been proved to be successful. It has also been shown that the local linear estimator has better boundary behavior and is unbiased when the true model is linear [3]. In the context of local linear fitting, m(xi ) is locally approximated by a linear function m(xi ) ≈ m(x ) + m (x )(xi − x ) and then is fitted by a linear model in a neighbourhood of x. As mentioned above, there is a large number of reference concerned with the estimation of mean functions using the method of local linear fitting. In distinction, we are interested in the first order derivative of m(x) respect to x, denoted as β (x ) = dm(x )/dx. Suppose that {xi , yi }, i = 1, . . . , n, is an independent and identically distributed random sample from model (1). Recently, Henderson et al. [8] considered the local linear LS estimator of β (x) obtained from minimizing n
yi − a − b(xi − x ) 2 Kh,ix ,
(2)
i=1
where b estimates β (x), Kh,ix = h−1 K ((xi − x )/h ) is a kernel function and h represents the bandwidth. By solving (2), a closed expression of the gradient estimator can be derived and the estimator is asymptotically normal. In addition, the resulting estimator possesses many good statistical merits, such as its design adaptability and high mini-max efficiency [3]. However, the estimator can be no longer consistent when the second moment of error distribution is infinite. To avoid the deficiency of the local linear LS, Lin et al. [17] extended the work made by Henderson et al. [8], and surveyed the same gradient estimates via local linear quantile regression. The efficiency of the estimators depends on the level of quantile although their quantile gradient estimates can achieve the robustness. To address the above two issues, Kai et al. [11] introduced the local linear CQR, which inherits the robustness of local linear quantile regression by crossing different quantiles and can significantly improve the estimation efficiency of the local linear LS method for commonly seen non-normal error distribution. Despite that the asymptotic expressions of the local linear CQR estimator have been established by Kai et al. [11], the bandwidth selection has not been taken into account for the gradient estimation. It is well known that the behavior of estimators is closely related to the selected bandwidth. To fill this gap, we investigated the bandwidth selection for the local linear CQR estimation of β (x). Set ρτk (u ) = u(τk − I (u ≤ 0 )) be p check loss functions at p quantile positions τ = k/( p + 1 ), k = 1, . . . , p. Denote βˆLL (x ) as the local linear CQR estimator k
of β (x) by minimizing the locally kernel weighted CQR loss p n
ρτk yi − ak − b(xi − x ) Kh,ix ,
(3)
k=1 i=1
where b corresponds to the estimate of β (x). Bandwidth selection is an important issue in improving the performance of estimator in local polynomial fitting. Theoretically, we should choose the bandwidth h for derivative estimator to minimize the expected mean squared error E {[βˆLL (x ) − β (x )]2 } or its sample analogue:
CV (h ) =
n 1 ˆ βLL (xi ) − β (xi ) 2W (xi ), n
(4)
i=1
where W(·) is a weight function with bounded support so as to trim out data near the boundary of the support of x. The leading term of CV(h) in (4) defined by CV0 (h) is
CV0 (h ) =
MSE0 [βˆLL (x ) − β (x )]W (x ) f (x )dx,
(5)
where f(x) is the density function of x and MSE0 represents the leading term of MSE (Mean Squared Error). By the similar arguments as in [6,16], it yields
CV0 (h ) =
2 Bias0 (βˆLL (x )) + Var0 (βˆLL (x )) W (x ) f (x )dx + (s.o. ),
(6)
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
277
where Bias0 (βˆLL (x )) and Var0 (βˆLL (x )) stand for the leading bias and leading variance terms of βˆLL (x ) respectively and (s.o.) are some terms having smaller probability orders than the first leading term. As demonstrated in (6), the expression of CV0 (h) is comprised by the leading bias and leading variance of the local linear composite quantile gradient estimator. In the following paragraph, the leading bias and leading variance of βˆLL (x ) will be computed, with proofs being listed in the Appendix. For convenience, some notations are required. Let Fε (·) and
fε (·) be the cumulative distribution function and density function of the error ε , respectively. Set μ j = K (s )s j ds and
p p p v j = K (s )2 s j ds for j = 0, 1, . . . , 6. Define ϕ = k=1 ϕk with ϕk = fε (ck ) and λ( p) = k=1 k =1 τkk , where ck = Fε−1 (τk ) and τk,k = min(τk , τk ) − τk τk . For performing the asymptotic analysis, we require some assumptions which have been used in [3,11]. Assumption 1. m(x) has continuous fourth-order derivative on x ∈ M, where M is the support of the trimming function. Assumption 2. The density function f(·) of x is differentiable and positive on x ∈ M. Assumption 3. The conditional variance σ 2 (x) is continuous in the neighbourhood of x0 . Assumption 4. The distribution of error is symmetric and its density fε (·) is positive. Assumption 5. The kernel function K(·) is a symmetric density function with bounded support. Theorem 1. Under Assumptions 1–5, if h → 0 and nh3 → ∞, then the leading bias and leading variance of local linear CQR estimator βˆLL (x ) of β (x) are given by
(μ4 − μ22 ) f (x )m (x ) Bias0 (βˆLL (x )) = h2 ≡ h2 B1 ( x ) , 2 μ2 f ( x ) 1 Var0 (βˆLL (x )) = nh3
v2 λ ( p )σ 2 ( x ) 1 = V1 Ω (x ), nh3 ϕ 2 μ22 f (x )
where B1 (x ) = (μ4 − μ22 ) f (x )m (x )/(2μ2 f (x )), V1 = v2 λ( p)/(ϕ 2 μ22 ) and Ω (x ) = σ 2 (x )/ f (x ). Substituting the leading bias and variance into (6), the optimal bandwidth h0,opt for gradient estimator is given by minimizing CV0 (h), namely
h0,opt =
1 / 7
3V1 σ 2 (x )W (x )dx
2 n−1/7 . 4 B1 (x ) f (x )W (x )dx
(7)
Although the bandwidth h0,opt is optimal, it cannot be used directly. The reason for this lies in the infeasibility of h0,opt due to the existence of some unknown functions, such as f(x), f (x), ϕ and m (x) in the formula (7). Alternatively, the “plugin” method initiated by Fan and Gijbels [3] can be considered to obtain the estimation of h0,opt . This approach requires to estimate these unknown functions with some initial bandwidths first, then to plug those estimates into the expression (7). In theory, the “plug-in” approach is difficult for application because the “plug-in” formulas in (7) are rather complex when the models are multiple regressors or involve the discrete covariates. Practically, the “plug-in” method of choosing the bandwidth is not fully automatic as it needs some initial selection of bandwidths for estimating those unknown functions. It is self-evident that this approach will result in poor bandwidth selection, if these initial choices deviate from the optimal value. In order to provide a feasible bandwidth, this paper recommends an automatic procedure for the selection of bandwidth defined as hˆ 0,opt . The procedure is fully data-driven and does not need to choose the initial bandwidths. It is proved that the selected hˆ 0,opt and the infeasible h0,opt are asymptotically equal, which means hˆ 0,opt /h0,opt = 1 + o p (1 ). The goal is to establish a feasible objective function so as to choose hˆ 0,opt . The objective function is built by adopting a consistent estimate βˆ (x ) of the gradient function β (x) to replace the true gradient value in (4). Hence, the objective function is written as
CVLCB (h ) =
n 1 ˆ βLL (xi ) − βˆ (xi ) 2W (xi ). n
(8)
i=1
The candidates for βˆ (x ) can be from a series of local polynomial composite quantile derivative estimators. We borrow the idea of [8] that the local cubic estimate denoted as βˆ (x ) = βˆLCB (x ) is the best option among a series of alternative estimates, where the subscript “LCB represents the local cubic. By the third-order Taylor expansion, m(xi ) is locally fitted by m(xi ) ≈ m(x ) + m (x )(xi − x ) + m (x )(xi − x )2 /2 + m (x )(xi − x )3 /6 ≡ a + b1 (xi − x ) + b2 (xi − x )2 + b3 (xi − x )3 . Then, the local cubic estimate was derived from the local cubic composite quantile regression p n k=1 i=1
x − x ρτk yi − ak − b1 (xi − x ) − b2 (xi − x )2 − b3 (xi − x )3 K i , h
(9)
278
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
where b1 is the estimate of β (x). By minimizing the object function (9), the estimates aˆk , bˆ 1 , bˆ 2 and bˆ 3 can be obtained. Naturally, the estimate bˆ 1 is just the local cubic estimate βˆLCB (x ). Correspondingly, the leading term of CVLCB (h) is
CV0,LCB (h ) =
MSE0 [βˆLL (x ) − βˆLCB (x )]W (x ) f (x )dx
2 Bias0 βˆLL (x ) − βˆLCB (x ) + Var0 βˆLL (x ) − βˆLCB (x )
=
W (x ) f (x )dx.
(10)
To calculate CV0,LCB (h), it is necessary to estimate the leading bias and variance of βˆLL (x ) − βˆLCB (x ). We will show that Bias0 (βˆLL (x )) dominates the bias term Bias0 (βˆLL (x ) − βˆLCB (x )) as Bias0 (βˆLCB (x )) = O(h4 ) = o(h2 ) is negligible compared with the leading bias of local linear CQR estimator with order O(h2 ). Furthermore, since Var0 (βˆLL (x ) − βˆLCB (x )) = Var0 (βˆLL (x )) + Var0 (βˆLCB (x )) − 2Cov0 (βˆLL (x ), βˆLCB (x )), one needs to know the leading variance of both estimators and their covariance for estimating it. The results are summarized as follows. Theorem 2. Under Assumptions 1–5, if h → 0 and nh3 → ∞, then
Bias0 (βˆLCB (x )) = O(h4 ),
Var0 (βˆLCB (x )) = =
1 nh3
λ( p)(μ26 v2 + μ24 v6 − 2μ4 μ6 v4 ) σ (x ) f (x ) ϕ 2 (μ2 μ6 − μ24 )2
1 V2 Ω (x ), nh3
Cov0 (βˆLL (x ), βˆLCB (x )) = =
1 nh3
λ( p)(μ6 v2 − μ4 v4 ) σ (x ) ϕ 2 μ2 (μ2 μ6 − μ24 ) f (x )
1 V3 Ω (x ), nh3
where
V2 =
λ( p)(μ26 v2 + μ24 v6 − 2μ4 μ6 v4 ) λ( p)(μ6 v2 − μ4 v4 ) , V3 = 2 . 2 2 2 ϕ ( μ2 μ6 − μ4 ) ϕ μ2 (μ2 μ6 − μ24 )
According to Theorem 1, 2 and (10), we have
CV0,LCB (h ) =
h4 B21 (x ) +
V1,3 Ω (x ) W (x ) f (x )dx, nh3
(11)
where
V1,3 = V1 − 2V3 + V2 =
λ( p)μ24 (μ22 v6 − 2μ2 μ4 v4 + μ24 v2 ) . ϕ 2 μ22 (μ2 μ6 − μ24 )2
By minimizing CV0,LCB (h), the optimal bandwidth is
h0,cubic =
1 / 7
3V1,3 σ 2 (x )W (x )dx
2 n−1/7 . 4 B1 (x )W (x ) f (x )dx
Theorem 3. We have that
h0,opt = h0,cubic
V1 V1,3
1/7
=
v2 (μ2 μ6 − μ24 )2 2 2 μ4 (μ2 v6 − 2μ2 μ4 v4 + μ24 v2 )
(12)
1 / 7 ,
which is a constant and free from unknown functions. Given Theorem 3, we can set up the asymptotically optimal bandwidth hˆ 0,opt for the estimated gradient βˆLL (x ). Specifically, one can first obtain the bandwidth hˆ via minimizing the feasible cross validation objective function (8), then 0,cubic
multiply hˆ 0,cubic by the ratio (V1 /V1,3 )1/7 to acquire the corrected bandwidth hˆ 0,opt . We conclude that the corrected bandwidth hˆ 0,opt is asymptotically equal to the infeasible bandwidth h0,opt in the sense that
hˆ 0,opt (V1 /V1,3 )1/7 hˆ 0,cubic (V1 /V1,3 )1/7 (h0,cubic + o p (h0,cubic )) = = h0,opt h0,opt h0,opt = 1 + o p ( 1 ),
(13)
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
279
Fig. 1. Algorithm and flow chart.
where the second equality is established by some regularity conditions that resemble those appeared in [6]. As shown in (13), the computation of corrected bandwidth hˆ 0,opt needs to estimate the ratio (V1 /V1,3 )1/7 . Since it only depends on the kernel functions, direct algebraic operations can be used to calculate the ratio of some commonly used kernel functions. For example, as to Gaussian kernel, (V1 /V1,3 )1/7 ≈ 1.009 is very close to 1, indicating that there is hardly demand for adjustment by multiplying the ratio. By simple calculation, (V1 /V1,3 )1/7 ≈ 0.852 is for Epanechnikov kernel function. In fact, the ratio values for two kernels are the same as those in [8] who discussed the bandwidth selection of gradient estimation from local linear least squares regression. More than that, the expression of the ratio (V1 /V1,3 )1/7 reported in (13) is coincident with that in [8] as well. To improve the understanding of the computational process, the algorithm and flow chart of the suggested method are depicted in Fig. 1. The optimization of (8) can be solved by grid search method. This approach needs to determine the particular set of bandwidths. However, it is difficult to provide a reasonable set of bandwidths in theory. Alternatively, because a large bandwidth will cause oversmoothing while a small bandwidth tends to generate undersmoothing, a set of bandwidths may be roughly estimated in the light of the resulting sensibilities of the bandwidths. In the simulation of [3], they took hmin = (x(n ) − x(1 ) )/n and hmax = (x(n ) − x(1 ) )/2 to construct the set of bandwidths, where x(1) and x(n) represent the first and the last order statistics. 3. Simulation study This section executes the simulation studies to assess the finite sample performance of the gradient-based crossvalidation method. For the local linear CQR estimates, we discuss p = 5, 9, 19, 99 and the kernel function is selected as the
280
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
Gaussian kernel, i.e. K (u ) = (2π )−1/2 exp(−u2 /2 ). In practice, one may use other kernel functions, for instance the Uniform kernel K (u ) = 0.5I (|u| ≤ 1 ), Epanechnikov kernel K (u ) = 0.75(1 − u2 )I (|u| ≤ 1 ) and Triangular kernel K (u ) = (1 − |u| )I (|u| ≤ 1 ), etc. Nevertheless, these kernels have nearly the same efficiency for local polynomial estimator [3]. Consequently, the selection of the kernel function is not decisive.1 The advantages of adopting the Gaussian kernel are that this kernel has much longer effective support, which can use a large amounts of information provided by the local data points around x. Because of the Gaussian kernel, the ratio (V1 /V1,3 )1/7 ≈ 1.009 is very close to 1 indicating that there is no need for us to adjust the bandwidth selected by the suggested method. Notice that a comparison to the oracle setting can be realized because we can easily calculate the true gradient. Moreover, this simulation was prepared by using R programming language with package “quantreg” and was implemented on an Advanced Micro Devices machine with AMD 180–2.4 GHz CPU processors. 3.1. Example 1 Consider the following two data generating processes (DGP)
DGP1: yi = 1 + 0.5 sin(xi ) + [0.5xi ]εi , i = 1, . . . , n, DGP2: yi = 1.5 + ln(2 + xi ) + [0.8 + sin(xi )]εi , i = 1, . . . , n, where the covariate xi is from the uniform distribution on [−1, 1] and the error ε i follows the normal distribution N(0, 1). The number of replication is 500 and the sample sizes are varied between 100 and 200. About the weight function, we select W (xi ) = I (q2.5% ≤ xi ≤ q97.5% ) so as to trim the top and bottom 2.5% of the sample, where q2.5% and q97.5% are 2.5%th and 97.5%th quantiles of the data, respectively. It is not difficult to know that the true gradients are 0.5cos (xi ) for DGP1 and 1/(2 + xi ) for DGP2. For the sake of comparison, three different ways of selecting the bandwidth are concerned. (i) Gradient-based cross-validation method (GBCV): The bandwidth h is selected by minimizing n−1 ni=1 [βˆLL (xi ) − βˆLCB (xi )]2 , which is our suggested method. (ii) Infeasible method (INFE): This method minimizes n−1 n [βˆLL (xi ) − β (xi )]2 to select the bandwidth h, where β (xi ) is i=1
the true value of gradient. It is noteworthy that this approach only serves as an optimal benchmark and is infeasible in practice. (iii) Least squares cross-validation method (LSCV): This approach chooses bandwidth h through minimizing p n
(−1 ) ρτk yi − mˆ ki, ( x ) , i LL
k=1 i=1
(−1 ) ˆ ki, where m (xi ) is the leave-one-out estimator of m(xi ) from the local linear CQR. LL
To evaluate the performance for each method, we compute the root mean squared errors (RMSE) defined by
j RMSE(βˆLL )=
n 1 ˆj βLL (xi ) − β (xi ) 2 , n i=1
j where βˆLL is the local linear composite quantile estimate of β = dm(· )/dx using the jth method of bandwidth selection. j It is necessary to measure the merits of different estimates βˆ through the ratio of root mean squared errors (RRMSE) LL
that is j RRMSE(βˆLL )=
j RMSE(βˆLL ) . INFE RMSE(βˆ ) LL
Table 1 and Table 2 report the simulation results for DGP1 and DGP2, respectively. As might be expected, the INFE method completely outperforms the GBCV and LSCV methods and the GBCV method is obviously superior to the LSCV method for all p values and sample sizes. As a benchmark, the INFE approach always attains the smallest median of RMSEs due to the usage of the true gradient β . Without the information of true β , the medians of RMSEs from the GBCV method are slightly higher than the INFE method does. For all cases considered, the LSCV method gives rise to quite large medians of RMSEs, which indicates that the behavior of LSCV is worse than those of the other two approaches. However, it should be pointed out that the comparison here is focused on the gradient function and the bandwidth selected by LSCV method is not optimal for the gradient estimation, but is asymptotically efficient for the mean function m(x). From the results of RRMSE, it is found that the medians of the RRMSEs from GBCV are close to 1, which shows that the suggested method approaches to the INFE method in selecting bandwidth of gradient estimation. However, the LSCV method has severely deviated from the INFE method as it leads to huge medians of RRMSEs. 1 Simulation is fulfilled for different kernels to reveal the effect of the selected kernel function on resulting estimators. Because the kernel selection is not the focus of this article, the simulation is arranged in the Appendix (A.4).
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
281
Table 1 The medians of RMSEs and RRMSEs for DGP1. p values
RMSE
RRMSE
βˆ INFE
βˆ GBCV
βˆ LSCV
GBCV βˆLL
LSCV βˆLL
n = 100 p=5 p=9 p = 19 p = 99
0.0807 0.0816 0.0824 0.0824
0.0953 0.10 0 0 0.1001 0.1006
0.3489 0.3522 0.3524 0.3528
1.1171 1.1998 1.1999 1.1697
16.4642 17.4364 17.2957 17.5235
n = 200 p=5 p=9 p = 19 p = 99
0.0773 0.0765 0.0770 0.0771
0.0904 0.0869 0.0869 0.0865
0.2566 0.2546 0.2561 0.2540
1.2200 1.1652 1.1682 1.1727
10.3730 9.7639 9.7862 9.6980
LL
LL
LL
Table 2 The medians of RMSEs and RRMSEs for DGP2. p values
RMSE
RRMSE
INFE βˆLL
GBCV βˆLL
LSCV βˆLL
GBCV βˆLL
LSCV βˆLL
n = 100 p=5 p=9 p = 19 p = 99
0.1785 0.1800 0.1807 0.1797
0.2370 0.2321 0.2278 0.2264
0.9309 0.9262 0.9223 0.9199
1.4289 1.4627 1.4614 1.5061
33.6640 30.7759 30.1464 30.6895
n = 200 p=5 p=9 p = 19 p = 99
0.1523 0.1502 0.1514 0.1503
0.2030 0.2054 0.2134 0.2146
0.6065 0.6277 0.6174 0.6169
1.5935 1.8308 1.6876 1.6248
13.5687 13.8200 14.1797 14.2338
3.2. Example 2 It is of interest to compare the gradient estimators of our method with that of [8]. For this, the data are generated from
yi = 2 − 0.5x2i + sin(xi ) + σ (xi )εi , i = 1, . . . , n, where the covariate xi follows a uniform distribution on [0, π ] and σ (xi ) = (1 + cos(2xi ))/4. In our simulation, we discussed five error distributions for ε i ranging from the symmetric to asymmetric distributions: Normal distribution N(0,1), t-distribution t(3), Laplace distribution, Log-normal distribution and mixture of normal distributions 0.4N (0, 32 ) + 0.6N (0, 2 ). Moreover, it can be known that the true gradient is β (xi ) = −xi + cos(xi ). We define the suggested GBCV based on local linear CQR estimator and the GBCV based on local linear least squares GBCV and β ˆ GBCV , respectively. The performance of estimator βˆ (xi ) is assessed by the mean squared estimator of [8] as βˆLL,CQR LL,LS errors (MSE), which is
MSE(βˆ ) =
n 1 ˆ β ( xi ) − β ( xi ) 2 , n i=1
GBCV or β ˆ GBCV . The sample size is n = 150 and the number of simulation is 500. where βˆ equals to either βˆLL,CQR LL,LS Table 3 presents the medians and standard deviations of MSEs over 500 simulations. From Table 3, we find that the GBCV outperform our estimates β ˆ GBCV , when the error follows N(0,1). However, the estimates βˆ GBCV are conestimates βˆLL,LS LL,CQR LL,CQR sistently superior with the estimates βˆ GBCV for non-normal error distributions, because our methods obtained much lower LL,LS
values of medians and standard deviations than those given by Henderson et al. [8]. 4. Empirical application To fulfil the finite sample performance, we take an illustrative example to compare the results with real data between least squares cross-validation (LSCV) and our suggested gradient-based cross-validation (GBCV) approaches. To achieve this purpose, we apply a subset of the UK Family Expenditure Survey data with high net income, which consists of 300 observations and was collected in the UK Family Expenditure Survey in 1973. In the non-parametric framework, the data set was investigated by Kai et al. [11]. In this application, our interest is focused on the relationship between the food expenditure and the net income. Consequently, we take the dependent variable y to be the logarithm of food-expenditure∗ 100, and the independent variable x as
282
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293 Table 3 The medians and standard deviations of MSEs in Example 2. Distribution
GBCV βˆLL,LS
GBCV βˆLL,CQR
p=5
p=9
p=19
p=99
N(0,1)
0.0873 (0.1567) 0.2476 (0.2937) 0.1351 (0.2045) 0.8517 (0.7353) 0.3967 (0.4584)
0.1350 (0.2271) 0.1347 (0.2220) 0.1029 (0.1816) 0.4627 (0.2420) 0.3197 (0.3941)
0.1172 (0.2072) 0.1663 (0.2495) 0.1146 (0.1800) 0.4522 (0.2427) 0.3474 (0.3839)
0.1196 (0.2094) 0.1600 (0.2425) 0.1096 (0.1705) 0.4506 (0.2457) 0.3702 (0.3964)
0.1201 (0.2102) 0.1603 (0.2426) 0.1095 (0.1704) 0.4490 (0.2462) 0.3690 (0.3961)
t(3) Laplace Log-normal Mixture
The numbers in parentheses are standard deviations. Table 4 Descriptive statistics. Statistics
log(Food-expenditure∗ 100)
log(Net-income∗ 100)
Minimum 1st Quartile Median 3st Quartile Maximum Mean Std.Dev. Correlation Skewness Kurtosis Sample size
-0.1366 0.7425 0.8447 0.9283 1.0842 0.8224 0.1496
1.0077 1.2401 1.3039 1.3442 1.3697 1.2801 0.0789 0.3243
-1.6599 6.6106
-1.1502 0.7140
300
the net-income∗ 100. Then, the specification of the nonparametric model is
y = m ( x ) + σ ( x )ε .
(14)
Table 4 displays the descriptive statistics for the two variables in the data set, and shows that there is a strong positive correlation of the two variables. According to the micro-economic theory, we know that the food expenditure should increase as net income increases. This monotonically increasing relationship indicates that the gradient function β (x) ≡ dm(x)/dx should be positive everywhere. Therefore, one expects that its estimate βˆLL (x ) obtained from
(mˆ (x ), βˆLL (x )) ≡ arg min
p n
a1 ,...,a p ,b k=1 i=1
ρτk yi − ak − b(xi − x ) Kh,ix
should be greater than zero, i.e., βˆLL (x ) > 0 for all x. To determine the performance of LSCV and GBCV, model (14) is estimated by local linear CQR with p = 5, 9, 19. In addition, the bandwidths hLSCV and hGBCV are selected via LSCV and GBCV, respectively. Due to the analogous results of CQR estimates with three different ps, we report only the CQR estimate with p = 9 in Fig. 2. Fig. 2(a) depicts the estimated regression functions. It can be seen that the estimated regression functions given by GBCV are much smoother as compared to LSCV method. This is reasonable because GBCV inclines to engender a bigger optimal bandwidth hGBCV than LSCV does, and a bigger bandwidth hGBCV indicates a smoother nonparametric function estimate. Fig. 2(b) reports the estimated derivative functions. It can be found that the estimated derivatives using GBCV method are also much smoother than those applying LSCV approach. Moreover, the derivative estimates from GBCV are all positive on the support of independent variable x = log(Net–income*100 ), which supports the monotonically increasing linkage between income and expenditure in economic theory. Nevertheless, the derivative estimates from LSCV method have two negative value regions. This illustrative example indicates that in terms of local linear CQR, the conventional method of LSCV is more likely to choose a small bandwidth rather than the optimal one for nonparametric derivative estimators. As a result, a large deviation estimation result will be generated because of the insufficient smoothness in the small and sub-optimal bandwidth. In this sense, GBCV would be preferential to LSCV in the selection of bandwidth if the marginal effects of regression variables or the derivatives of nonparametric functions are concentrated. To evaluate further the performance of the suggested method, we use the same data set to compare the behavior of our GBCV based on local linear CQR estimators and the GBCV based on local linear least squares estimators of [8]. Since the CQR estimates with different ps are very similar, we only report the CQR results with p = 9 in Fig. 3. It can be seen from Fig. 3(a) that the overall pattern of the local linear least squares (LS) and the local linear CQR (CQR) are the same. In addition, the derivative estimates from these two methods are all positive on the support of independent
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
1.2 0.6 0.0
o
0.8
1.0
o
0.2
GBCV LSCV
0.4
log(Food−expenditure*100)
0.4
0.6
0.8
1.0
data GBCV LSCV
0.0
log(Food−expenditure*100)
(b) Estimated derivative function
0.2
1.2
(a) Estimated regression function
283
1.0
1.1
1.2
1.3
log(Net−income*100)
−0.2
−0.2
o 1.4
1.0
1.1
1.2
1.3
1.4
log(Net−income*100)
Fig. 2. Estimated regression functions and their derivatives by local linear CQR with p = 9. The selected bandwidths are hLSCV = 0.024 and hGBCV = 0.1 by LSCV and GBCV methods respectively; (a) The scatter plot of data and the estimated regression functions. (b) The estimated derivative functions.
variable. Now we demonstrate the robustness of the CQR method. From the scatter plot, there have three possible outlier observations: (1.2005,0.2014), (1.2626,0.2148) and (1.3479,-0.1366) (circled in the plot). To investigate the influence of these three possible outliers, we artificially perturb the data set to extreme cases by moving the outliers of (1.2005,0.2014), (1.2626,0.2148) and (1.3479,-0.1366) to (1.20 05,-1.0 070), (1.2626,-1.0742) and (1.3479,-0.6834), respectively. After distorting the three observations, we refit the data with both the local linear LS and the local linear CQR procedures; see Fig. 3(b). It is observed that although the derivative estimates from the local linear LS are all positive on the support of independent variable, these estimates have changed dramatically due to the presence of these three artificial outliers. In contrast, the local linear CQR estimates are nearly unaffected. Furthermore, as a more extreme demonstration, we move them to more extreme situations. Specifically, the three outliers are moved from (1.2005,0.2014), (1.2626,0.2148) and (1.3479,-0.1366) to (1.2005,-5.0350), (1.2626,-5.3700) and (1.3479,-3.4150), respectively. Using the more distorted observations, we re-estimate the gradients via the local linear LS and the local linear CQR methods. The resulting estimates are presented in Fig. 3(c). It can be found that the proposed CQR estimates are almost unaffected by the artificial distortion, whereas the LS estimates change obviously and have some negative values. In summary, our approach is more robust than the method offered by Henderson et al. [8]. 5. Conclusion and discussion This paper proposes a GBCV approach to compute the optimal bandwidth for the local linear composite quantile regression. Different from the traditional LSCV that selects bandwidth for mean function, this article focuses on the estimation of gradient function because it is important to the research of marginal effect. The bandwidth selection procedure is proved to be asymptotically optimal, which can achieve the optimal convergence rate of gradient estimation with the selected bandwidth. The results of both simulation and empirical research reveal that the proposed GBCV is better than the LSCV method, of which the latter leads to under-smoothed derivative estimate and outperforms the local linear LS method in [8], which gives rise to non-robust derivative estimates. Although our methodology provides a general framework for selecting the optimal bandwidth of gradient estimation in relation to robust regression, there are still some limitations of the proposed techniques. Here, some applicable ranges of the approach are discussed and the possible directions for future research are put forward. Our model is limited to the stationarity assumption that seems to be strict in practical applications. It is more prevalent to deal with non-stationary factors when examining economic and financial issues from a time perspective. In the work of [15], they considered the local linear CQR estimator for model (1), in which the regressor follows a Harris recurrent Markov
284
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
1.0
1.1
1.2
1.3
1.0 0.8 0.6 0.4 0.2
CQR LS
−0.2 0.0
log(Food−expenditure*100)
1.0 0.8 0.6 0.4 0.2
CQR LS
−0.2 0.0
log(Food−expenditure*100)
1.2
(b) Estimated derivative function for case1
1.2
(a) Estimated derivative function
1.4
1.0
1.1
log(Net−income*100)
1.2
1.3
1.4
log(Net−income*100)
1.0 0.5
CQR LS 0.0
log(Food−expenditure*100)
1.5
(c) Estimated derivative function for case2
1.0
1.1
1.2
1.3
1.4
log(Net−income*100) Fig. 3. Estimated derivative functions by local linear LS and local linear CQR with p = 9. The selected bandwidths are hLS = 0.1 and hCQR = 0.1 for LS and CQR methods respectively; (a) The estimated derivative functions; (b) The estimated derivative functions moving the three possible outliers to extreme case; (c) The estimated derivative functions moving the three possible outliers to more extreme case.
processes including both stationary and non-stationary cases. Naturally, one of direct extensions of our bandwidth selection is to allow that the regressor is generated by the Harris recurrent Markov processes. The local linear CQR employs the uniform weights, which is only appropriate to symmetric errors. The resulting estimator will lead to a large bias if there is no symmetric condition. In theory, an even more efficient estimator will be generated when nonuniform weights are used for different quantile loss functions. Sun et al. [22] discussed the weighted local linear CQR for model (1), which includes symmetric and asymmetric random errors. It would be of great interest to generalize the bandwidth selection procedure into their method so as to further enhance the estimation efficiency. Acknowledgement We would like to thank the Editor and referees for detailed insightful comments and suggestions that led to a substantial improvement of the article. Thanks also go to Yu Ma, Bo Xin, Shuchang Wu, Fengqin Yu, Guangxian Zhang, Jun Lian and Xu
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
285
Wang for helpful comments and discussion. This research was supported by MOE (Ministry of Education in China) Project of Humanities and Social Sciences (Grant No. 16YJC790040). Appendix This section proves the theoretical results√of the local composite quantile gradient estimators. To facilitate derivation, we need to give some notations. Define uk = nh(ak − m(x ) − σ (x )ck ), χi = (xi − x )/h and ηi,k = I (εi − ck ≤ 0 ) − τk . Furthermore, let ei,j be the unit vector of length j with 1 at position i. A.1. Proof of Theorem 1 The minimization problem in (3) is equivalent to
(aˆ1 , . . . , aˆ p , bˆ ) = arg min G (a1 , . . . , a p , b) a1 ,...,a p ,b
= arg min
p n
a1 ,...,a p ,b k=1 i=1
ρτk yi − ak − b(xi − x ) Kh,ix ,
(15)
where βˆLL (x ) = bˆ is the estimate for β (x). √ local linear composite quantile gradient √ Let bh = bh, = nh(bh − hm (x )), i,k = (uk + χi )/ nh and
di,k = ck (σ (xi ) − σ (x )) + ri , where ri = m(xi ) − m(x ) − hm (x )χi . Then, it yields yi − ak − b(xi − x ) = σ (xi )(εi − ck ) + di,k − i,k . ∗ = I (ε ≤ c − d /σ (x )) − τ , by applying the Knight identity [13] If we define θ x = (u1 , . . . , u p , ) and ηi,k i i k i,k k
ρτk (u − v ) − ρτk (u ) = v{I (u ≤ 0 ) − τk } +
v 0
{I (u ≤ z ) − I (u ≤ 0 )}dz,
(16)
the objective function G (a1 , . . . , a p , b) becomes
L (θ x ) =
p
uk
k=1
where
Bn,k (θ x ) =
n
n ∗ Kh,ix ηi,k √ nh i=1
Kh,ix
i=1
+
i,k I 0
p p n ∗ Kh,ix χi ηi,k + Bn,k (θ x ), √ nh k=1 i=1 k=1
d di,k z − I εi ≤ ck − dz . εi ≤ ck − i,k + σ ( xi ) σ ( xi ) σ ( xi )
Moreover, denote
⎛ϕ n K 1 h,ix ⎜ n i=1 σ (xi ) ⎜ ⎜ 0 ⎜ ⎜ ⎜ .. Sn = ⎜ . ⎜ ⎜ 0 ⎜ ⎜ ⎝ϕ n K 1 h,ix χi n i=1 σ (xi )
(17)
0
···
0
n K ϕ2 h,ix n i=1 σ (xi )
···
0
.. .
.. .
0
···
n K ϕ2 h,ix χi n i=1 σ (xi )
···
ϕp
n
.. . Kh,ix
i=1 σ (xi ) n K ϕp h,ix χi n i=1 σ (xi )
n
⎞ n K ϕ1 h,ix χi n i=1 σ (xi ) ⎟ ⎟ n K ϕ2 h,ix χi ⎟ n i=1 σ (xi ) ⎟ ⎟ ⎟ .. ⎟, . ⎟ n ϕ p Kh,ix χi ⎟ ⎟ n i=1 σ (xi ) ⎟ 2⎠ n ϕ Kh,ix χi n i=1 σ (xi )
∗ , . . . , W ∗ , W ∗ ) , where and W ∗n = (W11 1p 21 n 1 ∗ W1∗k = √ K (χi )ηi,k , k = 1, . . . , p, nh i=1
and p n 1 ∗ ∗ W21 = √ K (χi )χi ηi,k . nh k=1 i=1
Under Assumptions 1–3, we follow the proof of Lemma 3 in [11] and have
L (θ x ) =
1 θ Sn θ x + (W ∗n ) θ x + oP (1 ). 2 x
(18)
286
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
According to (18), one can obtain that the first order condition for θ x is
Sn θˆ x = −W ∗n + oP (1 ).
(19)
In addition, we write W n = (W11 , . . . , W1 p , W21 ) , where n 1 W1k = √ K (χi )ηi,k , k = 1, . . . , p, nh i=1
and p n 1 W21 = √ K (χi )χi ηi,k . nh k=1 i=1
Then, the identity (19) can be rewritten by
Sn θˆ x = − (W ∗n − W n ) − E (W ∗n − W n )|X − E (W ∗n − W n |X ) − W n + oP (1 ). It can be proved that
Var(W ∗n
(20)
− W n |X ) = oP (1 ). Together with E (W n |X ) = 0, (20) equals to
Sn θˆ x = −E (W ∗n |X ) − W n + oP (1 ). Define
⎛ ⎜
ϕ1 f ( x ) 0 .. . 0
1 ⎜ ⎜ S= σ (x ) ⎜
⎝
ϕ1 hμ2 f (x ) and
(21)
0 ϕ2 f ( x ) .. . 0 ϕ2 hμ2 f (x )
⎛
··· ··· .. . ··· ···
⎞ ϕ1 hμ2 f (x ) ϕ2 hμ2 f (x )⎟ ⎟ .. ⎟, . ⎟ ϕ p hμ2 f (x )⎠ ϕμ2 f (x )
0 0 .. . ϕ p f (x ) ϕ p hμ2 f (x )
⎞
−h2 ϕ1 μ2 f (x )m (x )/(2σ (x )) .. ⎜ ⎟ ⎟ . D=⎜ ⎝−h2 ϕ μ f (x )m (x )/(2σ (x ))⎠. p 2 −h3 ϕμ4 f (x )m (x )/(2σ (x )) √ It follows from the standard kernel estimation uniform convergence proof techniques that S n = S + o(h ) and
( nh )−1 E (W ∗n |X ) = D + o(h ) under Assumptions 4 and 5. Through simple matrix calculation, it leads to
S + o(h ) = Ax + hQx + o(h ), where Ax = f (x )σ −1 (x )diag(ϕ1 , . . . , ϕ p , ϕμ2 ) and
⎛
0 . μ2 f ( x ) ⎜ ⎜ .. Qx = σ (x ) ⎝ 0
ϕ1
··· .. . ··· ···
ϕ1
0 .. . 0
⎞
⎟ ⎟. ϕ p⎠ .. .
ϕp
0
−1 −1 By making use of the identity {Ax + hQx + o(h )}−1 = A−1 x − hAx Qx Ax + o(h ), the FOC can be decomposed by
−1 −1 −1 −1 −1 θˆ x = − nh[A−1 x − hAx Qx Ax ]D − [Ax − hAx Qx Ax ]W n + (s.o. ).
After the tedious computations, it derives
⎛
⎜ ⎜ ⎜ ⎜ ⎜ −1 −1 ⎜ A−1 − h A Q A = x x x x ⎜ ⎜ ⎜ ⎜ ⎝
σ (x ) ϕ1 f ( x )
0
···
0
σ (x ) ϕ2 f ( x )
···
0
.. .
.. .
.. .
0
0
···
−hσ (x ) f (x ) ϕ f 2 (x )
−hσ (x ) f (x ) ϕ f 2 (x )
···
0 .. .
σ (x ) ϕ p f (x ) −hσ (x ) f (x ) ϕ f 2 (x )
(22)
⎞
−hσ (x ) f (x ) ϕ f 2 (x ) ⎟ −hσ (x ) f (x ) ⎟ ⎟ ϕ f 2 (x ) ⎟ ⎟ .. ⎟. . ⎟ −hσ (x ) f (x ) ⎟ ⎟ ϕ f 2 (x ) ⎟ ⎠ σ (x ) ϕμ2 f (x )
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
287
Therefore, we obtain that the leading bias of local linear composite quantile derivative estimator Bias0 (βˆLL (x )) is
Bias0 (βˆLL (x )) = E[βˆLL (x ) − m (x )] = h−1 E[bˆ h − hm (x )]
−1 −1 −1 = E[ ˆ ]/ nh3 = −h−1 e ( p+1 ),( p+1 ) [Ax − hAx Qx Ax ]D
= h2
(μ4 − μ22 ) f (x )m (x ) ≡ h2 B1 ( x ) , 2 μ2 f ( x )
and the leading variance of the gradient estimator Var0 (βˆLL (x )) is
1 −1 ˆ Var0 (βˆLL (x )) = Var(e ( p+1 ),( p+1 ) δh θ x ) = nh3 Var (e( p+1 ),( p+1 ) Ax W n )
2 p n σ (x ) 1 K (χi )χi ηi,k √ ϕμ2 f (x ) nh k=1 i=1 2 p n 1 σ 2 (x ) = 2 4 2 2 2 E K (χi )χi ηi,k n h ϕ μ2 f ( x ) k=1 i=1 1 = E nh3
1 nh3
=
v2 λ ( p )σ 2 ( x ) 1 = V1 Ω (x ), nh3 ϕ 2 μ22 f (x )
√ where δh = diag(1, . . . , 1, 1/ nh3 ) is a ( p + 1 ) × ( p + 1 ) diagonal matrix.
A.2. Proof of Theorem 2 The minimization problem in (9) can be written as
(aˆ1 , . . . , aˆ p , bˆ 1 , bˆ 2 , bˆ 3 ) = =
arg min a1 ,...,a p ,b1 ,b2 ,b3
arg min
G ( a1 , . . . , a p , b1 , b2 , b3 ) p n
a1 ,...,a p ,b1 ,b2 ,b3 k=1 i=1
ρτk yi − ak − b1 (xi − x )
−b2 (xi − x )2 − b3 (xi − x )3 Kh,ix ,
(23)
where bˆ 1 , bˆ 2 and bˆ 3 correspond to the estimates of first, √ second and third derivatives. √ √ Denote b1h = b1 h, b2h = b2 h2 , b3h = b3 h3 , 1 =√ nh(b1h − hm (x )), 2 = nh(b2h − h2 m (x )/2 ), 3 = nh(b3h − h3 m (x )/6 ), and i,k = (uk + 1 χi + 2 χi2 + 3 χi3 )/ nh. Besides, let
di,k = ck (σ (xi ) − σ (x )) + ri , where ri = m(xi ) − m(x ) − hm (x )χi − h2 m (x )χi2 /2 − h3 m (x )χi3 /6. Therefore, yi − ak − b1 (xi − x ) − b2 (xi − x )2 − b3 (xi − x )3 = σ (xi )(εi − ck ) + di,k − i,k holds.
∗ = I (ε ≤ c − d /σ (x )) − τ . By identity (16), G (a , . . . , a , b , b , b ) can Write ϑ x = (u1 , . . . , u p , 1 , 2 , 3 ) and ηi,k p 1 2 3 1 i i k i,k k be rewritten by
L (ϑ x ) =
p
n ∗ Kh,ix ηi,k √ nh i=1
uk
k=1
+
3 j=1
j
p p n ∗ Kh,ix χij ηi,k + Bn,k (ϑ x ), √ nh k=1 i=1 k=1
where
Bn,k (ϑ x ) =
n i=1
Kh,ix
i,k I 0
d di,k z − I εi ≤ ck − dz . εi ≤ ck − i,k + σ ( xi ) σ ( xi ) σ ( xi )
(24)
288
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
Furthermore, define
⎛
n K ϕ1 h,ix ⎜ n i=1 σ (xi ) ⎜ ⎜ ⎜ 0 ⎜ ⎜ .. ⎜ ⎜ . ⎜ ⎜ 0 Sn = ⎜ ⎜ ⎜ϕ ⎜ 1 n Kh,ix χi ⎜ n ⎜ i=1 σ (xi ) ⎜ϕ 2 ⎜ 1 n Kh,ix χi ⎜n ⎜ i=1 σ (xi )3 ⎝ ϕ1 n K h,ix χi n i=1 σ (xi )
0
···
0
n K ϕ2 h,ix n i=1 σ (xi )
···
0
.. .
.. .
.. .
0
···
n K ϕ2 h,ix χi n i=1 σ (xi ) 2 n K ϕ2 h,ix χi n i=1 σ (xi ) 3 n K ϕ2 h,ix χi n i=1 σ (xi )
··· ··· ···
n K ϕp h,ix n i=1 σ (xi ) n K ϕp h,ix χi n i=1 σ (xi ) 2 n K ϕp h,ix χi n i=1 σ (xi ) 3 n K ϕp h,ix χi n i=1 σ (xi )
n K ϕ1 h,ix χi n i=1 σ (xi ) n K ϕ2 h,ix χi n i=1 σ (xi )
.. . n K ϕp h,ix χi n i=1 σ (xi ) 2 n K ϕ h,ix χi n i=1 σ (xi ) 3 n K ϕ h,ix χi n i=1 σ (xi ) 4 n K ϕ h,ix χi n i=1 σ (xi )
2 n K ϕ1 h,ix χi n i=1 σ (xi ) 2 n K ϕ2 h,ix χi n i=1 σ (xi )
.. . 2 n K ϕp h,ix χi n i=1 σ (xi ) 3 n K ϕ h,ix χi n i=1 σ (xi ) 4 n K ϕ h,ix χi n i=1 σ (xi ) 5 n K ϕ h,ix χi n i=1 σ (xi )
⎞ 3 n K ϕ1 h,ix χi n i=1 σ (xi ) ⎟ ⎟ 3 n K ϕ2 h,ix χi ⎟ ⎟ n i=1 σ (xi ) ⎟ ⎟ .. ⎟ ⎟ . 3⎟ n K ϕp h,ix χi ⎟ ⎟ n i=1 σ (xi ) ⎟, 4 ⎟ n K ϕ h,ix χi ⎟ ⎟ n i=1 σ (xi ) ⎟ ⎟ 5 n K ϕ h,ix χi ⎟ n i=1 σ (xi ) ⎟ ⎟ 6 ⎠ n K ϕ h,ix χi n i=1 σ (xi )
!∗ = (W ∗ , . . . , W ∗ , W ∗ , W ∗ , W ∗ ) , where and W n 11 1p 21 22 23 n 1 ∗ W1∗k = √ K (χi )ηi,k , k = 1, . . . , p, nh i=1
and p n 1 ∗ W2∗j = √ K (χi )χij ηi,k , j = 1, . . . , 3. nh k=1 i=1
Similarly, we have
L (ϑ x ) =
1 ! ∗ ) ϑ x + oP ( 1 ), ϑ Sn ϑx + (W n 2 x
(25)
under Assumptions 1–3. From (25), the first order condition for ϑx is ∗
! + oP ( 1 ). ˆ x = −W Sn ϑ n
(26)
!n = (W , . . . , W , W , W , W ) , where Denote W 11 1p 21 22 23 n 1 W1k = √ K (χi )ηi,k , k = 1, . . . , p, nh i=1
and p n 1 W2 j = √ K (χi )χij ηi,k , j = 1, . . . , 3. nh k=1 i=1
Then, the FOC (26) turns into ∗
! |X ) − W ! n + oP ( 1 ). ˆ x = −E (W Sn ϑ n
(27)
√ ! ∗ |X ) = By the standard kernel estimation uniform convergence proof techniques, it has Sn = S + o(h ) and ( nh )−1 E (W n D + o(h ) with Assumptions 4 and 5, where
⎛
ϕ1 f ( x )
0 ⎜ ⎜ .. ⎜ . 1 ⎜ S= ⎜ 0 σ (x ) ⎜ ⎜ϕ1 hμ2 f (x ) ⎝ ϕ μ f (x ) 1 2 ϕ1 hμ4 f (x )
0 ϕ2 f ( x ) .. . 0 ϕ2 hμ2 f (x ) ϕ 2 μ2 f ( x ) ϕ2 hμ4 f (x )
··· ··· .. . ··· ··· ··· ···
0 0 .. . ϕ p f (x ) ϕ p hμ2 f (x ) ϕ p μ2 f ( x ) ϕ p hμ4 f (x )
ϕ1 hμ2 f (x ) ϕ2 hμ2 f (x )
ϕ 1 μ2 f ( x ) ϕ 2 μ2 f ( x )
ϕ p hμ2 f (x ) ϕμ2 f (x ) ϕ hμ4 f (x ) ϕμ4 f (x )
ϕ p μ2 f ( x ) ϕ hμ4 f (x ) ϕμ4 f (x ) ϕ hμ6 f (x )
.. .
.. .
⎞ ϕ1 hμ4 f (x ) ϕ2 hμ4 f (x )⎟ ⎟ .. ⎟ . ⎟ , ϕ p hμ4 f (x )⎟ ⎟ ϕμ4 f (x ) ⎟ ϕ hμ6 f (x ) ⎠ ϕμ6 f (x )
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
and
⎛
289
⎞
−h4 ϕ1 μ4 f (x )m(4) (x )/(24σ (x )) .. ⎜ ⎟ ⎜ ⎟ . ⎜ 4 ⎟ −h ϕ p μ4 f (x )m(4) (x )/(24σ (x ))⎟. D=⎜ ⎜ 5 ⎟ ⎜ −h ϕμ6 f (x )m(4) (x )/(24σ (x ))⎟ ⎝ −h4 ϕμ6 f (x )m(4) (x )/(24σ (x )) ⎠ −h5 ϕμ8 f (x )m(4) (x )/(24σ (x )) Likewise, we have
S + o(h ) = Ax + hQx + o(h ), where
⎛
ϕ1
0
ϕ2
⎜ 0 ⎜ . ⎜ . f (x ) ⎜ . Ax = σ (x ) ⎜ ⎜ 0 ⎜ 0 ⎝ϕ μ 1 2 and
⎛
.. . 0 0
ϕ 2 μ2
0
0
0 0 .. . 0
0 0 .. . 0
⎜ ⎜ ⎜ Qx = σ (x ) ⎜ ⎜ ⎜ϕ1 μ2 ⎝ 0 ⎜ f (x )
ϕ 2 μ2 0
ϕ 1 μ4
ϕ 2 μ4
··· ··· .. . ··· ··· ··· ···
0 0 .. .
ϕp
ϕ p μ2
0
ϕμ4
.. .
ϕμ2
0
ϕ p μ2
··· ··· .. . ··· ··· ··· ···
ϕ 1 μ2 ϕ 2 μ2
0 0 .. . 0
0
ϕμ4
0 0 .. . 0
ϕ 1 μ2 ϕ 2 μ2 ϕ p μ2
0
ϕμ4
0 0
0 0 .. . 0
.. .
ϕ p μ2 ϕ p μ4
0
ϕμ4
0
ϕμ6
0
−1
−1
⎞
0 0 .. . 0
⎟ ⎟ ⎟ ⎟ ⎟, ⎟ ϕμ4 ⎟ ⎠ 0 ϕμ6
⎞ ϕ 1 μ4 ϕ 2 μ4 ⎟ .. ⎟ ⎟ . ⎟ . ϕ p μ4 ⎟ ⎟ 0 ⎟ ϕμ6 ⎠ 0 −1
According to the identity {Ax + hQx + o(h )}−1 = Ax − hAx Qx Ax + o(h ), the FOC can be reexpressed by
−1 −1 −1 −1 −1 −1 ! ϑˆ x = − nh[Ax − hAx Qx Ax ]D − [Ax − hAx Qx Ax ]W n + (s.o. ).
If we define Z =
ϕ (μ22
⎛
−1
Ax
− μ4 ) and M =
Z − ϕ1 μ ⎜ ϕ1 Z ⎜ −μ2 ⎜ 2 ⎜ Z ⎜ ⎜ .. ⎜ . σ (x ) ⎜ ⎜ −μ22 = f (x ) ⎜ ⎜ Z
⎜ ⎜ ⎜ ⎜ ⎜ ⎝
2 2
ϕ 2 (μ
−μ Z Z − ϕ2 μ22 ϕ2 Z .. . −μ22 Z 2 2
0
0
μ2
μ2
Z 0
Z 0
2 2 μ6 − μ 4 ) ,
··· ··· .. . ··· ··· ··· ···
(28)
the inverse of Ax is
−μ22 Z −μ22 Z .. . Z − ϕ p μ22 ϕ pZ 0
μ2 Z 0
0
μ2
0
μ2
.. . 0
ϕμ6 M 0
−ϕμ4 M
Z Z .. .
μ2 Z 0 −1 Z 0
⎞ 0
⎟ ⎟ 0 ⎟ ⎟ ⎟ .. ⎟ . ⎟ ⎟ ⎟. 0 ⎟ ⎟ −ϕμ4 ⎟ ⎟ M ⎟ ⎟ 0 ⎟ ⎠ ϕμ2 M
Consequently, we have that the leading bias of local cubic composite quantile derivative estimator Bias0 (βˆLCB (x )) is given by −1 −1 −1 Bias0 (βˆLCB (x )) = h−1 e ( p+1 ),( p+3 ) [Ax − hAx Qx Ax ]D
= O(h−1 )O(1 )O(h5 ) = O(h4 ), and the leading variance of local cubic composite quantile derivative estimator Var0 (βˆLCB (x )) is
ˆ Var0 (βˆLCB (x )) = Var(e ( p+1 ),( p+3 ) δh ϑ x ) 1 −1 ! = Var(e ( p+1 ),( p+3 ) Ax W n ) nh3
290
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
1 = Var nh3
"
ϕμ6 σ (x ) −ϕμ4 σ (x ) ! 0, . . . , 0, , 0, Wn M f (x ) M f (x )
n n ϕμ6 σ (x ) 1 ϕμ4 σ (x ) 1 K (χi )χi ηi,k − K (χi )χi3 ηi,k √ √ M f (x ) M f (x ) nh k=1 i=1 nh k=1 i=1 2 2 p p σ 2 (x ) ϕμ6 σ 2 (x ) ϕμ4 3 = E K (χi )χi ηi,k + E K (χi )χi ηi,k M f (x ) M f (x ) nh4 nh4 k=1 k=1 p 2σ 2 ( x ) ϕ 2 μ4 μ6 2 − E K (χi )χi4 ηi,k ηi,k nh4 (M f (x ))2 k,k p
1 = Var nh3
p
#
λ( p)(μ26 v2 + μ24 v6 − 2μ4 μ6 v4 ) σ 2 (x ) f (x ) M(μ2 μ6 − μ24 ) 2 2 2 λ ( p )( μ v + μ v − 2 μ μ v ) 1 1 4 6 4 σ (x ) 6 2 4 6 = = V2 Ω (x ), 3 2 2 2 f (x ) nh nh3 ϕ ( μ2 μ6 − μ4 ) =
1 nh3
√ √ √ where δh = diag(1, · · · , 1, 1/ nh3 , 1/ nh5 , 1/ nh7 ) is a ( p + 3 ) × ( p + 3 ) diagonal matrix. In terms of the first expectation, one has
ϕμ6 E K (χi )χi ηi,k M f (x ) k=1 p
2
p ϕ 2 μ26 2 2 = E K (χi )χi ηi,k ηi,k (M f (x ))2 k,k $ p ϕ 2 μ26 $ 2 2 = E E ηi,k ηi,k $xi K (χi )χi (M f (x ))2 k,k x − x x − x 2 λ( p)ϕ 2 μ26 i i = K2 f (xi )dxi 2 h h (M f (x )) hλ( p)ϕ 2 μ26 = K 2 (s )s2 f (sh + x )ds (M f (x ))2 hλ( p)ϕ 2 μ26 hλ( p)μ26 v2 1 = v2 f ( x ) = 2 , 2 (M f (x )) ϕ (μ2 μ6 − μ24 )2 f (x )
and the other two expectations can be obtained in the same way. Moreover, we have that the leading covariance is
Cov0 (βˆLL (x ), βˆLCB (x )) 1 −1 ! = Cov e A−1 x W n , e( p+1 ),( p+3 ) Ax W n ( p+1 ) , ( p+1 ) 3 nh
1 = E nh3
n σ (x ) 1 K (χi )χi ηi,k √ ϕμ2 f (x ) nh k=1 i=1 p
p p n n ϕμ6 σ (x ) ϕμ4 σ (x ) 3 × K (χi )χi ηi,k − K (χi )χi ηi,k √ √ M f (x ) nh M f (x ) nh k=1 i=1 k=1 i=1 p p 2 1 μ6 σ 2 ( x ) 2 μ σ ( x ) 4 = E K (χi )χi2 ηi,k ηi,k − K 2 (χi )χi4 ηi,k ηi,k nh4 μ2 M f 2 (x ) k,k μ2 M f 2 (x ) k,k λ ( p )σ 2 ( x ) μ v μv =
6 2
−
4 4
μ2 M f ( x ) μ2 M f ( x ) 1 λ( p)(μ6 v2 − μ4 v4 ) σ 2 (x ) 1 = = V3 Ω (x ). nh3 ϕ 2 μ2 (μ2 μ6 − μ24 ) f (x ) nh3 nh3
The proof of this theorem is completely.
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
291
Table 5 The simulation results for different kernels. Kernel function
Estimated points -2.0
-1.5
-1.0
-0.5
0.0
Gaussian Uniform Epanechnikov Triangular
-0.8679 -0.7423 -1.3722 -1.5713
0.3235 0.2882 0.2831 0.2841
0.6527 0.5042 0.6717 0.6509
-0.0724 -0.080 -0.1145 -0.0951
Gaussian Uniform Epanechnikov Triangular
-0.4882 -0.9194 -1.3361 -1.5755
0.3066 0.2882 0.2785 0.1719
0.6053 0.4957 0.6193 0.6725
-0.0829 -0.0804 -0.0951 -0.0884
Gaussian Uniform Epanechnikov Triangular
-0.7423 -1.0907 -1.3616 -1.5567
0.3028 0.2831 0.2785 0.1809
0.6053 0.4780 0.5891 0.6496
-0.0947 -0.0874 -0.0982 -0.0838
Gaussian Uniform Epanechnikov Triangular
-0.7359 -1.0886 -1.3722 -1.6215
0.2882 0.2841 0.2815 0.2177
0.5977 0.4761 0.6025 0.6455
-0.0884 -0.0902 -0.0979 -0.1084
p=5 -0.1126 -0.1541 -0.0881 -0.0790 p=9 -0.1144 -0.1533 -0.0835 -0.0748 p = 19 -0.1152 -0.1543 -0.0867 -0.0790 p = 99 -0.1163 -0.1539 -0.0868 -0.0812
0.5
1.0
1.5
2.0
-0.0875 0.0664 -0.0482 -0.0589
0.4130 0.4170 0.3761 0.3435
-0.2361 -0.0563 -0.3266 -0.3865
-1.0665 -0.9971 -1.0665 -1.0873
-0.0979 0.0553 -0.0372 -0.0476
0.3925 0.4071 0.3914 0.3832
-0.2359 -0.1611 -0.3266 -0.3927
-1.0722 -0.8389 -1.0665 -1.0464
-0.1090 0.0553 -0.0255 -0.0482
0.4023 0.3965 0.3977 0.4071
-0.2227 -0.1155 -0.3228 -0.3667
-1.0465 -0.8400 -1.0622 -1.0915
-0.1088 0.0617 -0.0261 -0.0475
0.4031 0.3784 0.4023 0.4119
-0.2128 -0.1187 -0.3167 -0.3865
-1.0613 -0.8389 -0.9971 -1.0655
A.3. Proof of Theorem 3 Substituting the leading bias and variance of βˆLL (x ) into the leading term of integrated mean square error, the crossvalidation criterion CV0 (h) equals
CV0 (h ) =
= =
MSE0 βˆLL (x ) − m (x ) W (x ) f (x )dx
Bias0 (βˆLL (x ))
h4 B21 (x ) +
2
+ Var0 (βˆLL (x )) W (x ) f (x )dx
V1 Ω (x ) W (x ) f (x )dx, nh3
where W(x) denotes a weight function with bounded support so as to trim out data near the boundary of the support of x. The optimal bandwidth defined by h0,opt is the value of h that minimizes CV0 (h), thereby
h0,opt =
1 / 7
3V1 σ 2 (x )W (x )dx
n−1/7 . 4 B21 (x ) f (x )W (x )dx
From the formula above, we find that h0,opt is unknown since B1 (x) includes some unknown functions, such as f(x), f (x) and m (x). To provide a feasible bandwidth, we replace the unknown derivative function m (x) in MSE0 with the local cubic composite quantile derivative estimate βˆLCB (x ) and obtain
CV0,LCB (h ) =
= =
MSE0 βˆLL (x ) − βˆLCB (x ) W (x ) f (x )dx
Bias0 βˆLL (x ) − βˆLCB (x )
Bias0 βˆLL (x )
2
2
+ Var0 βˆLL (x ) − βˆLCB (x ) W (x ) f (x )dx
+ Var0 βˆLL (x ) + Var0 βˆLCB (x )
−2Cov0 βˆLL (x ), βˆLCB (x ) W (x ) f (x )dx V1,3 = h4 B21 (x ) + Ω ( x ) W (x ) f (x )dx, nh3 where
V1,3 = V1 − 2V3 + V2 =
λ( p)v2 2λ( p)(μ6 v2 − μ4 v4 ) λ( p)(μ26 v2 + μ24 v6 − 2μ4 μ6 v4 ) − + ϕ 2 μ22 ϕ 2 μ2 (μ2 μ6 − μ24 ) ϕ 2 (μ2 μ6 − μ24 )2
292
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
Fig. 4. Estimated gradient functions by local linear CQR with p = 9. (a) The estimated gradient functions with Gaussian kernel; (b) The estimated gradient functions with Uniform kernel; (c) The estimated gradient functions with Epanechnikov kernel; (d) The estimated gradient functions with Triangular kernel.
=
λ( p) μ24 (μ22 v6 − 2μ2 μ4 v4 + μ24 v2 ) . ϕ2 μ22 (μ2 μ6 − μ24 )2
By minimizing CV0,LCB (h), the optimal bandwidth h0,cubic is
h0,cubic =
1 / 7
3V1,3 σ 2 (x )W (x )dx
n−1/7 . 4 B21 (x ) f (x )W (x )dx
It follows from the expressions of h0,opt and h0,cubic that the ratio is given by
h0,opt = h0,cubic
V1 V1,3
1/7
v2 (μ2 μ6 − μ24 )2 = μ24 (μ22 v6 − 2μ2 μ4 v4 + μ24 v2 )
1 / 7 ,
which is a constant depending only on kernel functions. This completes the proof of this theorem.
Q. Xie and Q. Sun / Applied Mathematics and Computation 361 (2019) 274–293
293
A.4. Comparison for different kernel functions To assess the perform of kernels, this simulation examines the impact of the four commonly used kernel functions on the estimation of gradient function. These kernel functions are Gaussian kernel K (u ) = (2π )−1/2 exp(−u2 /2 ), Uniform kernel K (u ) = 0.5I (|u| ≤ 1 ), Epanechnikov kernel K (u ) = 0.75(1 − u2 )I (|u| ≤ 1 ) and Triangular kernel K (u ) = (1 − |u| )I (|u| ≤ 1 ). The observations are generated according to
yi = sin(2xi ) + 2 exp(−0.5xi ) + σ (xi )εi , i = 1, . . . , n, where the covariate xi obeys a uniform distribution on [−2, 2], the standard deviation σ (xi ) = 0.2x2i and the error ε i is from the normal distribution N(0, 1). In this simulation, the sample size is fixed to be n = 200 and the number of replication is 500. Moreover, it can be known that the true gradient function is β (xi ) = 2 cos(2xi ) − exp(−0.5xi ). The estimation of β (·) over [−2, 2] is executed by local polynomial CQR with p = 5, 9, 19 and 99. For evaluating the behaviors of estimates, the bias is applied as a criterion and is defined by bias = βˆLL (xi ) − β (xi ). In Table 5, the biases based on these kernels are reported for the gradient estimate βˆLL (xi ) at some typical points xi = −2.0, −1.5, −1.0, −0.5, 0.0, 0.5, 1.0, 1.5, 2.0. It can be observed that all gradient estimates perform very similarly for the four kernel functions. In particular, the estimated gradients obtained by these kernels are more consistent on the interior points. In order to further explore the impact of kernel functions, we consider the gradient estimation with these kernel functions using different bandwidths. Here, the used bandwidths include the optimal bandwidth hˆ 0,opt , 2.5 times hˆ 0,opt and 0.5 times hˆ 0,opt . Fig. 4 shows the results of gradient estimates with p = 9. As illustrated in the figure, for all the four kernels, the estimated gradients with hˆ 0,opt contribute to same and good fittings for the true gradient function; however, all the kernel functions with the bandwidths of 2.5hˆ 0,opt and 0.5hˆ 0,opt lead to similar and bad estimates for the true gradient function. It can be concluded that the gradient estimation is not sensitive to the selection of kernel functions, but is determined by the choice of bandwidths. This conclusion agrees with the finding in [3]. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27]
P. Chaudhuri, Nonparametric estimates of regression quantiles and their local Bahadur representation, Ann Stat. 2 (1991) 760–777. P. Chaudhuri, K. Doksum, A. Samarov, On average derivative quantile regression, Ann. Stat. 25 (1997) 715–744. J. Fan, I. Gijbels, Local Polynomial Modelling and Its Applications, Chapman and Hall, London, 1996. T. Gasser, H.G. Müller, Estimating regression functions and their derivatives by the kernel method, Scand. J. Stat. 11 (1984) 171–185. S. Ghosh, A. Deb, G. Sarkar, Taylor series approach for function approximation using estimated higher derivatives, Appl. Math. Comput. 284 (2016) 89–101. P. Hall, Q. Li, J.S. Racine, Nonparametric estimation of regression functions in the presence of irrelevant regressors, Rev. Econ. Stat. 89 (2007) 784–789. W. Härdle, Applied Nonparametric Regression, Cambridge University Press, Cambridge, 1990. D.J. Henderson, Q. Li, C.F. Parmeter, S. Yao, Gradient-based smoothing parameter selection for nonparametric regression estimation, J. Econom. 184 (2015) 233–241. P.J. Huber, E.M. Ronchetti, Robust Statistics, John Wiley & Sons, New York, 2009. R. Jiang, Z. Zhou, W. Qian, Y. Chen, Two step composite quantile regression for single-index models, Comput. Stat. Data Anal. 64 (2013) 180–191. B. Kai, R. Li, H. Zou, Local composite quantile regression smoothing: an efficient and safe alternative to local polynomial regression, J. R. Stat. Soc. B 72 (2010) 49–69. B. Kai, R. Li, H. Zou, New efficient estimation and variable selection methods for semiparametric varying-coefficient partially linear models, Ann. Stat. 39 (2011) 305–332. K. Knight, Limit theory for autoregressive parameter estimates in an infinite variance random walk, Can. J. Stat. 17 (1989) 261–278. R. Koenker, G.S. Bassett, Regression quantiles, Econometrica 46 (1978) 33–50. D. Li, R. Li, Local composite quantile regression smoothing for harris recurrent Markov processes, J. Econ. 194 (2016) 44–56. Q. Li, J.S. Racine, Nonparametric Econometrics: Theory and Practice, Princeton University Press, Princeton, 2007. W. Lin, Z. Cai, Z. Li, L. Su, Optimal smoothing in nonparametric conditional quantile derivative function estimation, J. Econometrics 188 (2015) 502–513. H.G. Müller, U. Stadtmuller, T. Schmitt, Bandwidth choice and confidence intervals for derivatives of noisy data, Biometrika 74 (1987) 743–749. R. Raya-Miranda, M. Martnez-Miranda, Data-driven local bandwidth selection for additive models with missing data, Appl. Math. Comput. 217 (2011) 10328–10342. J.A. Rice, Bandwidth choice for differentiation, J. Multivar. Anal. 19 (1986) 251–264. P. Roul, V. Prasad Goura, B-spline collocation methods and their convergence for a class of nonlinear derivative dependent singular boundary value problems, Appl. Math. Comput. 341 (2019) 428–450. J. Sun, Y.J. Gai, L. Lin, Weighted local linear composite quantile estimation for the case of general error distributions, Journal of Statistical Plan. Inference 143 (2013) 1049–1063. W. Wang, L. Lin, Derivative estimation based on difference sequence via locally weighted least squares regression, J. Mach. Learn. Res. 16 (2015) 2617–2641. H. Wang, L. Zhou, Bandwidth selection of nonparametric threshold estimator in jump-diffusion models, Comput. Math. Appl. 73 (2017) 211–219. K. Yu, M.C. Jones, Local linear quantile regression, J. Am. Stat. Assoc. 93 (1998) 228–237. R. Zhang, Y. Lv, W. Zhao, J. Liu, Composite quantile regression and variable selection in single-index coefficient model, J. Stat. Plan. Inference 176 (2016) 1–26. H. Zou, M. Yuan, Composite quantile regression and the oracle model selection theory, Ann. Stat. 36 (2008) 1108–1126.