A robust and efficient estimation and variable selection method for partially linear single-index models

A robust and efficient estimation and variable selection method for partially linear single-index models

Accepted Manuscript A robust and efficient estimation and variable selection method for partially linear single-index models Hu Yang, Jing Yang PII: D...

459KB Sizes 0 Downloads 103 Views

Accepted Manuscript A robust and efficient estimation and variable selection method for partially linear single-index models Hu Yang, Jing Yang PII: DOI: Reference:

S0047-259X(14)00108-0 http://dx.doi.org/10.1016/j.jmva.2014.04.024 YJMVA 3745

To appear in:

Journal of Multivariate Analysis

Received date: 8 September 2013 Please cite this article as: H. Yang, J. Yang, A robust and efficient estimation and variable selection method for partially linear single-index models, Journal of Multivariate Analysis (2014), http://dx.doi.org/10.1016/j.jmva.2014.04.024 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Highlights (for review)

Highlights • We propose a robust and efficient estimation method based on local modal regression.

• The asymptotic normality of the proposed estimators are established. • The estimator is shown to be superior compared with OLS. • A variable selection procedure for covariates is shown to possess oracle property. • A modified EM algorithm is proposed.

1

*Manuscript Click here to download Manuscript: A robust and efficient estimation and variable selection Click for partially here to view linearlinked single-index References models.te

A robust and efficient estimation and variable selection method for partially linear single-index models Hu Yanga , Jing Yanga,∗ a College

of Mathematics and Statistics, Chongqing University, Chongqing 401331, China

Abstract In this paper, a new robust and efficient estimation approach based on local modal regression is proposed for partially linear single-index models, of which the univariate nonparametric link function is approximated by local polynomial regression. The asymptotic normality of proposed estimators for both the parametric and nonparametric parts are established. We show that the resulting estimator is more efficient than the ordinary least-square-based estimation in the case of outliers or heavy-tail error distribution, and as asymptotically efficient as the least square estimator when there are no outliers and the error is normal distribution. To achieve sparsity when there exist irrelevant variables in the model, a variable selection procedure based on SCAD penalty is developed to select significant parametric covariates and is shown to possess oracle property under some regularity conditions. We also propose a practical modified EM algorithm for the new method. Some Monte Carlo simulations and a real data set are conducted to illustrate the finite sample performance of the proposed estimators. Keywords: Partially linear single-index models; Local modal regression; Robust estimation; Variable selection; Oracle property 1. Introduction Partially linear single-index models (PLSIM) have gained much attentions in recent years because such models not only extend the popular single-index models and linear models, but also retain the easy interpretation of parameters in multiple linear regression and the flexibility of the single-index model. In other words, PLSIM allow the response variable to depend linearly on some covariates and nonlinearly on a single linear combination of other variables called as single-index through a nonparametric link function. Specifically, the partially linear single-index model is defined as follows: Y = X T β0 + g(Z T γ0 ) + ε,

(1)

∗ Corresponding

author. Email addresses: [email protected] (Hu Yang), [email protected] (Jing Yang)

Preprint submitted to Journal of Multivariate Analysis

January 2, 2014

where “T ” denotes the transpose of a vector or matrix throughout this paper, Y is the univariate response variable, X and Z are the p-dimensional and qdimensional covariate vectors, respectively, g(·) is an unknown univariate link function, β0 is a vector of corresponding linear regression coefficient, γ0 is the single-index vector coefficient, and for the sake of model identification, we further assume that kγ0 k2 = 1 and the first component of γ0 is positive, ε is independent identically distributed (i.i.d.) random error with mean zero and ε is also independent of (X, Z). To estimate the parameters β0 and γ0 in PLSIM, Carroll et al. [2] firstly proposed a backfitting algorithm in a more general case, i.e., generalized PLSIM. Yu and Ruppert [23] argued that the estimators proposed by Carroll et al. [2] may become computationally unstable and introduced a penalized spline estimation procedure. Xia and H¨ ardle [27] applied the minimum average variance estimation (MAVE, [26]) method combined with the idea of dimension reduction to model (1). Wang et al. [20] proposed a two-stage estimation procedure under the mild condition that a few indices based on Z suffice to explain X. However, all these existing papers are based on least squares or likelihood method and the assumptions that the error is normally distributed, which are sensitive to outliers and lose efficiency for heavy-tail error distribution. Then a lot of work had been done to remedy these weaknesses.√One of the most popular and useful approaches is quantile regression for its n-consistency and asymptotic normality can be established without assuming any moment condition of the residual. Relevant Literature see [9, 22, 7] and a comprehensive book [10] among others. Recently, quantile regression for single-index model and PLSIM have been studied, Wu et al. [21] introduced an algorithm based on local linear quantile regression for single-index models. Jiang et al. [8] extended the composite quantile regression method to single-index models. Boente and Rodriguez [1] studied a family of robust estimates for the parametric and nonparametric coefficients under generalized PLSIM. Fan and Zhu [6] proposed a quantile regression version of out-product gradient (qOPG, [26]) estimation method in general semi-parametric models of which part of predictors are presented with a single-index. In practice, large amounts of variables may be collected and some of them are irrelevant which should be excluded from the final model, variable selection then becomes necessary for modeling, because inclusion of too many irrelevant variables can degrade the estimation accuracy and model interpretability. To this end, many good penalty functions such as bridge penalty [3], the least absolute shrinkage and selection operator (Lasso) [17], the smoothly clipped absolute deviation (SCAD, [4]), the adaptive Lasso [28] and the elastic net penalty [29] had been proposed. In recent years, these methods have been used in PLSIM. Liang et al. [13] applied the SCAD penalty to simultaneously estimate regression coefficients and select important variables. Lai and Guo [14] proposed a two-step procedure to perform estimation and variable selection for both linear and index parts via adaptive Lasso. Zhang et al. [30] adopted the coordinate-independent sparse estimation to select the significant variables in the parametric components. 2

More recently, Yao et al. [25] investigated a new estimation method based on a local modal regression (LMR) in a nonparametric model. Zhang et al. [31] studied the semiparametric partially linear varying coefficient model (SPLVCM) based on modal regression and developed a variable selection procedure to select significant parametric components. Zhao et al. [32] also studied the SPLVCM based on modal regression by B-spline basis. Liu et al. [15] proposed a new robust and efficient estimation procedure based on local modal regression for single-index models. All their results were confirmed not only to be robust to the data sets including outliers or heavy-tail error distribution, but also to be as asymptotically efficient as the least-square-based estimation when there are no outliers and the error is normally distributed. Motivated by this fact, in this paper, we devoted to extending the modal regression to PLSIM. The main goal of this paper is to develop a robust and effective estimation and variable selection procedure for estimating both parametric and nonparametric parts and selecting significant parametric components in model (1) based on modal regression, where the nonparametric link function is approximated by local polynomial regression. The best convergence rates, asymptotic normality as well as the oracle property of the proposed LMR estimators are established. We demonstrate that the LMR estimators could be more efficient than least squares for many non-normal errors and as asymptotically efficient as least squares for normal errors. The paper is organized as follows. In Section 2, we propose a new estimation method for PLSIM following the idea of modal regression method, and establish the asymptotic normalities of the LMR estimators for both nonparametric and parametric parts. A robust and efficient variable selection procedure via SCAD penalty is recommended, which can be used to select the significant parametric components in Section 3. Meanwhile, its oracle property is also established. In Section 4, we discuss the details of bandwidth selection both in theory and in practise and BIC criterion is suggested to select the tuning parameter. Moreover, we introduce a modified EM algorithm for implementation. Two Monte Carlo simulations and a real data example are given in Section 5 to demonstrate the finite sample performance of the proposed procedures. Finally, we make some conclusions in Section 6. All the conditions and the technical proofs are deferred to the Appendix A. 2. Estimation method and theoretical properties 2.1. local polynomial approximation and LMR for PLSIM The mode is an important numerical characteristic of a statistical distribution or data set, especially when there exist outliers in the data or the distribution is skewed, because the mean probably not be a good measurement of center, modal regression can provide more meaningful point prediction and larger coverage probability for prediction than the mean regression in those cases. [24, 25, 31, 15] systematically studied the modal regression for the linear model, univariate nonparametric regression model, SPLVCM and single-index 3

model, respectively. In this paper, we devote to extending modal regression to PLSIM and studying the variable selection for the parametric components to achieve robust and efficient sparse estimators. Suppose that {Xi , Zi , Yi }ni=1 is an i.d.d. sample from model (1). For ZiT γ in the neighborhood of u, the unspecified link function g(·) can be locally linear approximated as g(Zi T γ) ≈ g(u) + g ′ (u)(ZiT γ − u) , a + b(ZiT γ − u),

where a = g(u) and b = g ′ (u). Given a initial estimator of γ0 , noted as γˆ (0) , we can obtain the estimators of β0 , g(·) and g ′ (·) by maximizing n  1X φh2 Yi − XiT β − a − b(ZiT γˆ (0) − u) Kh1 (ZiT γˆ (0) − u) n i=1

(2)

with respect to β, a and b. Where Kh1 (·) = K(·/h1 )/h1 and K(·) is a kernel density function, h1 is a bandwidth for estimating the nonparametric link function g(·) and h2 is another bandwidth for φ(·). Note that, the choice of φ(·) is not very crucial according to Yao et al. [25], and we choose the standard normal density for φ(·) throughout this paper for the ease of computation. ˜ a Then, we denote {β, ˜(u; γˆ (0) ), ˜b(u; γˆ (0) )} be the maximizer of (2) and define g˜(u; γˆ (0) ) = a ˜(u; γˆ (0) ). √ It is well known that the estimator β˜ is nh1 -consistent as presented in Theorem 1, since we only use the data in a local neighbourhood of ZiT γˆ (0) to estimate β. In √ order to get a more efficient estimator with the optimal convergence rate n, we may compute an improved estimator of β0 by using all the data, which can be obtained by maximizing n

 1X φh Yi − a ˜(ZiT γˆ(0) ) − XiT β n i=1 3

(3)

with respect to β. The estimator βˆ is defined as the maximizer of (3). √ Note that, after the n-consistent estimator βˆ is obtained, we can also compute an efficient estimation of the single-index parameter γ0 by using all data, which can be realized by maximizing the following objective function n  1X φh4 Yi − XiT βˆ − g˜(ZiT γ; γ) n i=1

(4)

with respect to γ. However, it seems difficult to calculate the maximizer straightly, following the similar way in [15], we apply an approximation of (4) by Taylor expression based on the estimators a ˜(u; γˆ (0) ) and ˜b(u; γˆ (0) ) from (2). Recall (0) (0) ′ that g˜(u; γˆ ) = a ˜(u; γˆ ) and g˜ (u; γˆ (0) ) = ˜b(u; γˆ (0) ), then we can get a more efficient estimator of γ0 by maximizing n  1X φh Yi − XiT βˆ − a ˜(ZiT γˆ (0) ; γˆ (0) ) − ˜b(ZiT γˆ (0) ; γˆ (0) )ZiT (γ − γˆ (0) ) n i=1 4

4

(5)

with respect to γ. The estimator γˆ is defined as the maximizer of (5). With βˆ and γˆ obtained from above estimation procedures, we can further improve the efficiency of a ˜ by maximizing n   1X φh Yi − XiT βˆ − a − b(ZiT γˆ − u) Kh5 ZiT γˆ − u n i=1 6

(6)

with respect to a and b. We denote {ˆ a(u; γˆ ), ˆb(u; γˆ )} be the maximizer of (6) and define the estimator of g(u) as gˆ(u) = a ˆ(u; γˆ). It is noteworthy that all the above six bandwidths {hk : k = 1, 2, . . . , 6} can be selected by data-driven method in practice so that the resulting estimators are adaptively robust. The detailed choices of these bandwidths will be discussed in Section 4. 2.2. Theoretical properties In this section, we aim to establish some relevant asymptotic theories of the proposed LMR estimators. Before this, we need the following definitions. Denote U0 = Z T γ0 , define Z Z µj = uj K(u)du and νj = uj K 2 (u)du, j = 0, 1, 2. Let

F (u, h) = E{φ′′h (ε) | U0 = u} 

1 A(X) =  0 X

0 µ2 0

 XT 0 , XX T

and

G(u, h) = E{φ′h (ε)2 | U0 = u}. 

ν0 B(X) =  0 ν0 X

0 ν2 0

 ν0 X T  0 ν0 XX T

and C(X) = (1, 0, X T )T . The following theorem describes the asymptotic nor˜ mality of a ˜(u), a ˜′ (u) and β. Theorem 1. Suppose that regularity conditions (C1)-(C7) given in the Appendix hold. If h1 → 0 and nh1 → ∞ as n → ∞, h2 is a constant and does not depend on n, then    a ˜(u) − g(u)   p 1 a′ (u) − g ′ (u))  − h21 µ2 Σ1 (u)−1 Γ(u)g ′′ (u) nh1  h1 (˜   2 β˜ − β0   1 d → N 0, (7) Σ1 (u)−1 Σ2 (u)Σ1 (u)−1 , fU0 (u) where g ′′ (u) is the second derivative of g(u), Γ(u) = F (u, h2 )E{C(X) | U0 = u}, Σ1 (u) = F (u, h2 )E{A(X) | U0 = u} and Σ2 (u) = G(u, h2 )E{B(X) | U0 = u}.

5

Theorem 2. Suppose that regularity conditions (C1)-(C7) given in the Appendix hold. If nh41 → 0 and nh21 / log(1/h1 ) → ∞ as n → ∞, h3 is a constant and does not depend on n, then  √ d n(βˆ − β0 ) → N 0, Q−1 ΩQ−1 ,

(8)

where Ω = Var{Xφ′h3 (ε) − D(U0 )}, D(U0 ) = φ′h2 (ε)R(U0 )P Σ1 (U0 )−1 C(X), R(U0 ) = E{F (U0 , h3 )X}, P = (1, 0, 0T )1×(2+p) and Q = E{F (U0 , h3 )XX T }. Remark 1. The optimal bandwidth h1 in Theorem 1 is h1,opt ∼ n−1/5 , obviously, this bandwidth does not satisfy the conditions in Theorem 2, which indicates that undersmoothing for g(·) becomes necessary in order to get the √ ˆ This is a common requirement n-consistency and asymptotic normality of β. in semiparametric models in the kernel literature. A sensible rule for choice h1 ˆ 1 = h1,opt × n−2/15 recommended in [2], which guarantees that the required is h bandwidth has correct order of magnitude for optimal asymptotic performance. Theorem 3. Let β be a known constant β0 or an estimator of β0 with order Op (n−1/2 ). Suppose that regularity conditions (C1)-(C7) given in the Appendix hold. If nh41 → ∞ as n → ∞, h2 and h4 are constants and does not depend on n, then  √ d (9) n(ˆ γ − γ0 ) → N 0, S − ∆S − ,  ′ where ∆ = E [φh4 (ε)g ′ (U0 )Z − φ′h2 (ε)E{g ′ (U0 )Zφ′′h4 (ε) | U0 }/FhU20 ]⊗2 , S =  E{φ′′h4 (ε)g ′ (U0 )2 ZZ T } − E g ′ (U0 )φ′′h4 (ε)ZE{g ′ (U0 )φ′′h2 (ε)Z T | U0 }/FhU20 with FhU20 = F (U0 , h2 ) and A⊗2 = AAT for any vector or matrix A, S − is a generalized inverse of S. Remark 2. As we can see from Theorem 3, the proposed procedure would eventually provide a consistent estimator of γ0 with an optimal convergence √ rate n and undersmoothing for g is also needed in this process. Theorem 4. Let γ be a known constant γ0 or an estimator of γ0 with order Op (n−1/2 ), β is also a known constant β0 or an estimator of β0 with order Op (n−1/2 ). Suppose that regularity conditions (C1)-(C7) given in the Appendix hold. If h5 → 0 and nh5 → ∞ as n → ∞, h6 is a constant and does not depend on n, then     p ν0 G(u, h6 ) 1 d . (10) ˆ(u; γ) − g(u) − µ2 h25 g ′′ (u) → N 0, nh5 a 2 fU0 (u)F (u, h6 )2

Remark 3. Compared with Theorem 1, we obtain that a ˆ(u) has the same conditional asymptotic bias as a ˜(u), but the conditional asymptotic variance of a ˆ(u) is smaller, which indicates that a ˆ(u) is a more efficient estimator than a ˜(u). Moreover, the conditional asymptotic distribution of a ˆ(u) is coincide with Theorem 2.2 in [25].

6

3. Variable selection for parametric components in PLSIM 3.1. Variable selection procedure In this section, we aim to construct the penalized estimation for β0 with SCAD penalty to select important variables and estimating their effects, simultaneously. To this end, we consider the following penalized function based on modal regression ℓλ (β) =

n X i=1

p X  pλj (|βj |), φh3 Yi − a ˜(ZiT γˆ ) − XiT β − n

(11)

j=1

where a ˜(ZiT γˆ) is the estimator of the link function g(·) obtained from (2) at point ZiT γˆ and γˆ is the maximizer of (4), pλ (t) is the SCAD penalty function at t that defined as follows  λ|t|, |t| ≤ λ,   2 (a −1)λ2 −(|t|−aλ)2 , λ < |t| ≤ aλ, pλ (t) = 2(a−1)   (a+1)λ2 , |t| > aλ 2 for given a > 2 and λ > 0, and its first-order derivative is   (aλ − t)+ ′ pλ (t) = λ I(t ≤ λ) + I(t > λ) , (a − 1)λ

where (t)+ = tI(t > 0) is the hinger loss function, λ is the penalized parameter, a is some constant usually taken to be a = 3.7 suggested in [4]. As we can see that the SCAD penalty is continuously differentiable on (−∞, 0) ∪ (0, ∞) but singular at 0, and its derivative vanishes outside [−aλ, aλ]. Therefore, it can provide sparse solutions by shrinking small coefficients toward zeros and produce unbiased estimators for large coefficients by putting zero weights to them. For given tuning parameters, we can get the penalized estimators by maximizing ℓλ (β) with respect to β. However, as the SCAD penalty function is singular at 0 and does not have continuous second-order derivative, it is difficult to maximizing (11) directly, we then apply locally quadratic approximation (LQA, [4]) to {pλj (|βj |)}pj=1 . Suppose that we are given an initial value βˆ(0) which is close to the maximizer (0) of (11), if βˆj is very close to 0, then set βˆj = 0. Otherwise, pλj (|βj |) can be locally quadratic approximated as 1 (0) (0)2 (0) (0) (0) pλj (|βj |) ≈ pλj (|βˆj |) + {p′λj (|βˆj |)/|βˆj |}(βj2 − βˆj ) for βj ≈ βˆj . 2 Consequently, we can get the sparse estimator noted as βˆλ by maximizing Qλ (β) =

p  nX (0) (0) ˜(ZiT γˆ) − XiT β − {p′λj (|βˆj |)/|βˆj |}βj2 , φh3 Yi − a 2 j=1 i=1

n X

7

(12)

3.2. Theoretical properties In this section, we study the theoretical properties of the penalized estimator βˆjλ and the following notations are needed. Decompose the true parameter T T vector β0 as (β0I , β0II )T with β0I ∈ Rl and β0II ∈ Rp−l . Without loss of generality, we assume that β0I = (β01 , β02 , . . . , β0l )T consists all of the nonzero components of β0 , and β0II = (β0,l+1 , β0,l+2 , . . . , β0,p )T corresponding to all λ T T the zero coefficients. Let βˆλ = ((βˆIλ )T , (βˆII ) ) be the corresponding penalized estimator. Denote λmax = max{λj : j = 1, 2, . . . , p} and λmin = min{λj : j = 1, 2, . . . , p}, j

j

an = max {|p′λj (|β0j |)| : β0j 6= 0} and bn = max {|p′′λj (|β0j |)| : β0j 6= 0}, 1≤j≤p

1≤j≤p

Φλ =

(p′λ1 (|β01 |), . . . , p′λl (|β0l |))T

and Ψλ = diag{p′′λ1 (|β01 |), . . . , p′′λl (|β0l |)}.

Theorem 5. Let βˆλ be the maximizer of (12). Suppose that regularity conditions (C1)-(C8) given in the Appendix hold. If nh41 → ∞ and nh21 / log(1/h1 ) → ∞ as n → ∞, h3 is a constant and √ does not depend on n. Further assume an → 0 and bn → 0, λmax → 0 and nλmin → ∞ as n → ∞, then λ (i) Sparsity: βˆII = 0, with probability tending to one, (ii) Asymptotic normality: √ d n(Ql + Ψλ ){βˆIλ − β0I + (Ql + Ψλ )−1 Φλ } → N (0, Ωl ),

(13)

where Ql and Ωl are the top-left l-by-l submatrices of Q and Ω in Theorem 2. Theorem 5 shows that our proposed penalized estimator achieve oracle property, i.e., βˆλ not only possess the sparsity property which will reduce model complexity and improve interpretability of the true model, but also satisfy asymptotic normality. From Theorem 5, we have the following corollary with its proof omitted here. √ Corollary 1. Under the same assumptions given in Theorem 5, if nΦλ = op (1) and Ψλ = op (1), then √ d n(βˆIλ − β0I ) → N (0, Ql −1 Ωl Ql −1 ).

(14)

4. Bandwidth selection and estimation algorithm 4.1. Asymptotic optimal bandwidth in theory Let γ be a known constant γ0 or an estimator of γ0 with order Op (n−1/2 ), by [12], we have that the ratio of the asymptotic variance of the LMR estimator to that of the local linear regression (LLR) estimator is given by RA(u, h2 ) =

G(u, h2 )F −2 (u, h2 ) . σ2 8

We can see that the ratio RA(u, h2 ) depends only on u and h2 , and it plays an important role in efficiency and robustness of estimators. Therefore, the optimal choice of h2 can be choosed as h2,opt = arg min RA(u, h2 ) = arg min G(u, h2 )F −2 (u, h2 ). h2

h2

(15)

It follows from (15) that h2,opt does not depend on n and only depends on the conditional error distribution of ε given U0 . Remark 4. As shown in Theorem 2.4 in [25], for all h2 > 0, inf h2 RA(u, h2 ) ≤ 1 regardless of the error distribution, and if the error follows normal distribution, we have RA(u, h2 ) > 1 and inf h2 RA(u, h2 ) = 1. That is to say, LMR estimator is more efficient than LLR estimator when there exist outliers and heavy-tail error distribution and does not lose efficiency under normal error distribution. To obtain the optimal bandwidth h1 , we can minimizes the mean squared error (MSE) of a ˜(u) based on (7), which results h1,opt = RA(u, h2,opt )1/5 hLLR , where hLLR is the asymptotic optimal bandwidth for LLR, see [12]. Consequently, in the same way as above, the optimal choices of h3 , h4 , h6 and h5 is given by h3,opt = arg min G(u, h3 )F −2 (u, h3 ), h4,opt = arg min G(u, h4 )F −2 (u, h4 ), h4

h3

h6,opt = arg min G(u, h6 )F −2 (u, h6 ), h5,opt = RA(u, h6,opt )1/5 hLLR . h6

√ ˆ a senAs mentioned in Remark 1, to obtain the n-consistent estimator β, sible bandwidth for h1 is usually generated by ˆh1,opt = h1,opt × n−2/15 . 4.2. Bandwidth selection in practice In practice, the error distribution is unknown, since ε is independent of U0 in this paper, then we can estimate F (h2 ) and G(h2 ) in implementation by n

F (h2 ) =

n

1X ′ 1 X ′′ φ (ˆ εi ) and G(h2 ) = {φ (ˆ εi )}2 , n i=1 n i=1

ˆ gˆ and γˆ are estimators based on the pilot where εˆi = Yi − XiT βˆ − gˆ(ZiT γˆ), β, d 2 ) = G(h ˆ 2 )Fˆ (h2 )−2 /ˆ estimators. Therefore, we can estimate RA(h σ 2 , where σ ˆ is also estimated based on the pilot estimator. Then, using the grid search d 2 ). As recommended in method, we can easily find h2,opt to minimize RA(h σ × 1.02j , j = 1, 2, . . . , k for [25], the possible grids points for h can be 0.5ˆ ˆ 2,opt , we obtain some fixed k, such as k = 50 or k = 100. After we get h 1/5 ˆ ˆ ˆ ˆ d h1,opt = RA(h2,opt ) hLLR , where hLLR is the selected optimal bandwidth for local linear regression. Similarly, we can find the optimal bandwidths for h3 , ˆ 1,opt × n−2/15 for h1 for the need of h4 , h5 and h6 . Note that, we shall use h undersmoothing. 9

4.3. Selection of tuning parameter As stated in Theorem 5, the proposed penalized estimator βˆλ possesses the oracle property. However, this attractive feature relies heavily on the tuning parameters. So, in practice, the choices of turning parameters λj , j = 1, 2, . . . , p should be determined first. Many selection criteria based on data-driven method such as CV, GCV, AIC and BIC selection can be used. In this paper, we adopt BIC selector to choose λj for its ability to identify the true model consistently. Because it is computationally expensive to minimize BIC, according to the approach of [5], we set λj = λSE(βˆj ) for j = 1, 2, . . . , p, where λ is the tuning √ parameter and SE(βˆj ) is standard error of the unpenalized n-consistent estimator of β0 produced by (3). Then, we use BIC criterion to select λ by minimizing the following objective function ) ( n   log(n) 1X T T ˆλ + ˜(Zi γˆ) − Xi β φh Yi − a dfλ , (16) BIC(λ) = − log n i=1 3 n where dfλ is the number of nonzero coefficients of βˆλ . Denote the optimal λ as λopt = arg minλ BIC(λ), and the simulation studies later shows that λopt performs very well. 4.4. Algorithm In this subsection, we propose a modified modal expectation-maximization (MEM) algorithm proposed by Li et al. [11]. Denote ξ = (a, b, β T )T , then the local linear modal regression (LMR) procedure for estimating β0 , γ0 and g(·) is as follows: √ Step 0. Obtain an initial n-consistent estimator γˆ (0) of γ0 by qOPG method (0) (0) [6], and standardize γˆ (0) by sign(ˆ γ1 )ˆ γ /kˆ γ (0) k. Note that, other robust estimation method such as median regression and composite quantile regression (CQR) can also used to obtain the initial value. T Step 1. Obtain ξ˜ based on (2) by updating ξ. Let ξ (0) = (a(0) , b(0) , β (0) )T be the initial estimation and set t = 0. E-step: Update {π1 (j | ξ (t) )}nj=1 by

φh2 (Yj − Xj∗ T ξ (t) )Kh1 (ZjT γˆ (0) − u) , π1 (j | ξ (t) ) = Pn T ˆ (0) − u) ∗ T (t) i=1 φh2 (Yi − Xi ξ )Kh1 (Zi γ

where Xi∗ = (1, (ZiT γˆ (0) − u), XiT )T . M-step: Update ξ to obtain ξ (t+1) by ξ (t+1)

= arg max ξ

= (X

∗T

n n o X π1 (i | ξ (t) ) log φh2 (Yi − Xi∗T ξ) i=1

W1 X ∗ )−1 X ∗ T W1 Y, 10

where X ∗ = (X1∗ , X2∗ , . . . , Xn∗ )T , Y = (Y1 , Y2 , . . . , Yn )T and W1 = diag{π1 (1 | ξ (t) ), π1 (2 | ξ (t) ), . . . , π1 (n | ξ (t) )}. Iterate the E-step and M-step until convergence. Step 2. Obtain βˆ based on (3) by updating β. Let β (0) = set t = 0. E-step: Update {π2 (j | β (t) )}nj=1 by π2 (j | β

(t)

˜(ZjT γˆ (0) ) − XjT β (t) φh3 Yj − a

) = Pn

i=1

= arg max β



Pn

φh3 Yi − a ˜(ZiT γˆ (0) ) − XiT β (t)

M-step: Update β to obtain β (t+1) by β (t+1)

1 n

i=1

˜ 0i ) and β(U

.

n n X o π2 (i | β (t) ) log φh3 Yi − a ˜(ZiT γˆ (0) ) − XiT β i=1

= (X T W2 X)−1 X T W2 Y ∗ ,

where X = (X1 , X2 , . . . , Xn )T , Y ∗ = (Y1 − a ˜(Z1T γˆ (0) ), Y2 − a ˜(Z2T γˆ (0) ), . . . , Yn − T (0) T (t) (t) a ˜(Zn γˆ )) and W2 = diag{π2 (1 | ξ ), π2 (2 | ξ ), . . . , π2 (n | ξ (t) )}. Iterate the E-step and M-step until convergence. Note that, when we need to conduct the penalized estimation procedure based on (12), a revising of Step 2 with some small changes is as follows: (0) Step 2♯ . Obtain βˆλ based on (12) by updating β. Let β λ = and set t = 0. E-step: Update {π2 (j | β (t) )}nj=1 by

π2 (j | β λ

(t)

(t+1)

Pn

i=1

˜ 0i ) β(U

(t)  φh3 Yj − a ˜(ZjT γˆ (0) ) − XjT β λ ) = Pn . (t)  ˜(ZiT γˆ (0) ) − XiT β λ i=1 φh3 Yi − a

M-step: Update β to obtain β λ βλ

1 n

= arg max β

n nX i=1

(t+1)

by

π2 (i | β λ

(t)

) log φh3 Yi − a ˜(ZiT γˆ (0) ) − XiT β

o n X ′ ˆ(0) (0) {pλj (|βj |)/|βˆj |}βj2 − 2 j=1  −1 (t) = X T {W2 + nΣλ (β λ )}X X T W2 Y ∗ , p



n o where Σλ (β) = diag p′λ1 (|β1 |)/|β1 |, p′λ2 (|β2 |)/|β2 |, . . . , p′λp (|βp |)/|βp | for any given vector β, X, Y ∗ and W2 are the same as in Step 2. Iterate the E-step and M-step until convergence.

11

Step 3. Obtain γˆ based on (5) by updating γ. Let γ (0) = γˆ (0) and set t = 0. E-step: Update {π3 (j | γ (t) )}nj=1 by  φh4 Y˜j∗ − ˜b(ZjT γˆ (0) ; γˆ (0) )ZjT (γ (t) − γˆ (0) ) (t) π3 (j | γ ) = Pn , φh Y˜ ∗ − ˜b(Z T γˆ (0) ; γˆ (0) )Z T (γ (t) − γˆ (0) ) i=1

XiT βˆ −

Y˜i∗

4

where = Yi − M-step: Update γ to obtain γ γ (t+1) = arg max γ

i

i

i

a ˜(ZiT γˆ (0) ). (t+1)

by

o π3 (i | γ (t) ) log φh4 Y˜i∗ − ˜b(ZiT γˆ (0) ; γˆ (0) )ZiT (γ − γˆ (0) )

n n X i=1

= (Z˜ ∗T W3∗ )−1 Z˜ ∗T W3 Y˜ ∗ ,

∗ ∗ ∗ ∗ )Zi , Y˜ ∗ = (Y˜1∗ , . . . , Y˜n∗ )T where Z˜ ∗ = (Z˜1 , Z˜2 , . . . , Z˜n )T , Z˜i = ˜b(ZiT γˆ (0) ; γˆ (0) and W3 = diag π3 (1 | ξ (t) ), π3 (2 | ξ (t) ), . . . , π3 (n | ξ (t) ) . Iterate the E-step and M-step until convergence.

Step 4. Continue Steps 1, 2 (2♯ ) and 3 until convergence, we obtain the estimator γˆ of γ0 . Step 5. With γˆ obtained from Step 4, we can get the estimator βˆ (βˆλ ) by carrying out Steps 1 and 2 (2♯ ). Step 6. With βˆ (βˆλ ) obtained from Step 5, combined with γˆ , we can get the estimator gˆ(u) = gˆ(u; γˆ ) by maximizing (6). Remark 5. It is worth noting that, as the usual EM algorithm, it is a prudent way to run the procedure from several different starting points and choose the best local optima, because the converged value may depend on the starting point and there is no guarantee that the algorithm will converge to the global optimal solution. 5. Numerical studies In this section, we conduct two different Monte Carlo simulations and a real dataset to assess the finite sample performance of our proposed estimation and variable selection method. In our simulations, we select Gaussian kernel as the kernel function, and an estimator is treated as zero if its absolute value is smaller than 10−6 . The optimal bandwidth for hLLR is obtained by 5-fold cross-validation (CV) method. To examine the robustness and efficiency of our proposed LMR estimator, we compare the simulation results with the popular least square estimator based on LLR. In order to evaluate the performance of the estimators, we consider the following criteria: (1) average value of the corresponding coefficients (Average); (2) bias (Bias) and standard deviation (SD); (3) generalized mean square error (GMSE). GMSE(β) = (βˆ−β)T E(XX T )(βˆ−β) and GMSE(γ) = (ˆ γ −γ)T E(ZZ T )(ˆ γ −γ). 12

Further more, the average number of the true zero coefficients correctly identified as zero (NC), the average number of the true nonzero coefficients incorrectly identified as zero (NIC) in β0 and the probability of correctly selecting the true model (PC) are reported to show the performance of variable selection procedure. To assess the performance of estimator gˆ for the conditional link function g, we apply the square root of average square errors (RASE) of gˆ, which is defined as )1/2 ( ngrid 1 X 2 , RASE(ˆ g) = kˆ g(ui ) − g(ui )k ngrid i=1

where {ui , i = 1, 2, . . . , ngrid } are the grid points at which the function g(·) is evaluated.

5.1. Monte Carlo simulations Example 1. In this example, we focus on the estimation of the proposed LMR estimator, and the following PLSIM is considered Y = X T β0 + g(Z T γ0 ) + ε,

(17)

where X = (X1 , X2 , X3 )T with Xj′ s are independently generated from uniform √ √ distribution U (− 3, 3) for j = 1, 2, 3 and β0 = (1, 0.5, 3)T , Z = (Z1 , Z2 , Z3 )T with each Zj being standard normal distribution and the correlation between Zi √ and Zj being 0.5|i−j| and θ0 = (2, 1, 1)T / 6, the nonparametric link function is given by g(Z T γ0 ) = 8 cos(Z T γ0 ) + exp{−(Z T γ0 )2 }. In order to show the robustness of our estimators, the following three different error distributions are considered: standard normally distributed N (0, 1), t(3) distribution which is used to produce heavy-tailed distribution and mixture of normals 0.9N (0, 1) + 0.1N (0, 10) (MN) which is used to produce the outliers. In our simulation experiments, 200 repetitions are carried out with sample size n = 100, 200, 400, and the corresponding results are summarized in Table 1 and Table 2. The last column of Table 2 presents the RASE values of estimation of g. It can be seen from these two tables, the proposed LMR estimator is as efficient as the LLR estimator when the error is normal distributed, for the other two error distributions t(3) and MN, LMR estimator performs much better than LLR estimator in terms of all the evaluation criterion defined above, which indicates that our proposed LMR estimator is robust to the data sets subjected to the outliers in the response variables or heavy-tailed error distributions. On the other hand, with the sample size n increase gradually, there is an obvious ˆ γˆ and gˆ get better and tendency that all the performance of the estimators β, better. All these simulation results corroborate our theoretical findings. Example 2. In this example, we focus on the variable selection of the proposed LMR estimator and the model set-up is similar to (17) except that X = (X1 , X2 , . . . , X8 )T and Z = (Z1 , Z2 , Z3 )T are jointly normally distributed with mean 0, variance 1 and correlation 0.5|i−j| , the nonparametric link function 13

is given by g(Z T γ0 ) = 3 sin(Z T γ0 )+(Z T γ0 −0.5)2 . As considered in Example 1, three different error distributions N (0, 1), t(3) and MN are generated to show the robustness of the proposed LMR estimator. Also 200 repetitions are operated with sample size n = 100, 200, 400, and the corresponding results are summarized in Table 3, and Fig. 1-Fig. 3 are the boxplots of the linear regression estimators βˆλ based on LMR amd LLR method from 200 simulated data sets for the case n = 100, 200, 400 with ε ∼ t(3), respectively. From the results presented in Table 3, it is noteworthy that the proposed variable selection procedure based on LMR method performs much better than LLR estimator in terms of NC, NIC and PC except for the normal errors, similar results can also be found in the three boxplot figures. Meanwhile, when the error is normal distribution, the difference between these two estimators is relatively small, which implies that LMR estimator almost not lose any efficiency in this case. All these results demonstrate the expected superiority of the proposed LMR estimator than LLR. Note that, we had also obtained the similar boxplots for the other two error distributions ε ∼ N (0, 1) and ε ∼ M N , but we omit presenting them in order to reduce the length of this article.

5.2. A real data example This subsection considers an application of our proposed method to a data set from a diabetes study which had been analyzed by [30]. The data contains 442 observations for diabetes patients on 11 variables, of which the response variable Y is a quantitative measurement of disease progression one year after baseline. Ten other statistical measurements on the 442 patients are the predictor covariates that include age, sex, body mass index (BMI), average blood pressure (BP), and six blood serum measurements about total cholesterol, density level, tension glaucoma level, and glucose concentration denoted by TC, LDL, HDL, TCH, LTG, and GLU, respectively. Beforehand, both the response variable Y and the covariates are standardized so that they have mean zero and unit variance. Here, we consider age, BMI and BP as the index variables Z and the other seven variables as the linear variables X. Then we apply the proposed variable selection procedure to analyze the data set by a partially linear single-index regression model stated as (1).

14

4 3.5 3.5 3

3 2.5

2.5 2 2

1.5 1

1.5

0.5 1

0 beta1

beta2

beta5

beta1

beta2

(a) LLR

beta5

(b) LMR

0.6

0.3

0.4

0.2

0.1

0.2

0 0 −0.1 −0.2 −0.2 −0.4 −0.3 −0.6

beta3

beta4

beta6

beta7

beta8

beta3

(c) LLR

beta4

beta6

beta7

beta8

(d) LMR

Fig. 1. Boxplots of the estimator βˆλ from 200 repeats with n = 100 and error ε ∼ t(3) in Example 2. (a) and (b) are the corresponding boxplots of the estimators for nonzero coefficients, (c) and (d) are the corresponding boxplots of the estimators for zero coefficients. The horizontal black dotted lines denote the true values of β0j for j = 1, 2, . . . , 8.

The corresponding results are presented in Table 4 and Fig. 4. As we can see from Table 4, the difference between the estimated parameters based on LLR and LMR are relatively small, however, LMR estimators have smaller standard errors than LLR approach which means that LMR has better performance. As to the effect of variable selection, both these two methods shrink some predictor variables exactly to zero and LMR is sparse than LLR method by shrinking TCH to zero that not appeared in LLR. Meanwhile, the nonzero estimators of the linear part imply that TC, LDL, and CTG seem to be the most important variable. In addition, the LLR and LMR estimations for the nonparametric link function g(·) are presented in Fig. 4 along with its 95% pointwise confidence intervals, there is a few discrepancies between these two estimations. From Fig. 4, we know that the link function g(·) is truly a nonlinear function varies 15

4 3.5 3 3 2.5

2.5

2 2

1.5 1

1.5 0.5 0 beta1

beta2

1

beta5

beta1

beta2

(a) LLR

beta5

(b) LMR

0.2

0.08

0.15

0.06 0.04

0.1

0.02 0.05 0 0

−0.02

−0.05

−0.04

−0.1

−0.06 −0.08

−0.15

−0.1 −0.2 −0.12 beta3

beta4

beta6

beta7

beta8

beta3

(c) LLR

beta4

beta6

beta7

beta8

(d) LMR

Fig. 2. Boxplots of the estimator βˆλ from 200 repeats with n = 200 and error ε ∼ t(3) in Example 2. (a) and (b) are the corresponding boxplots of the estimators for nonzero coefficients, (c) and (d) are the corresponding boxplots of the estimators for zero coefficients. The horizontal black dotted lines denote the true values of β0j for j = 1, 2, . . . , 8. with the single-index component and the partially linear single-index model is appropriate for this data set. 6. Conclusions In this paper, we study the LMR method for robust estimation in PLSIM with heavy-tailed error distributions or outliers in the response variables, and the asymptotic normality has been established under some regular conditions. The resulting estimator is shown to be more efficient or at least as efficient as the popular LLR regardless of the error distribution. To select significant covariates for the parametric components, we also propose a variable selection procedure with SCAD penalty and show its oracle property by choosing the penalty parameter properly. Furthermore, a modified EM algorithm is provided for computation. Note that, here we only consider the error is homoscedastic, 16

3.2

3.2 3

3

2.8

2.8

2.6

2.6 2.4

2.4

2.2

2.2

2

2

1.8

1.8

1.6

1.6

1.4

1.4

1.2 beta1

beta2

beta5

beta1

beta2

(a) LLR

beta5

(b) LMR

0.25 0.05 0.2 0.15

0

0.1 −0.05

0.05 0

−0.1

−0.05 −0.15

−0.1 −0.15

−0.2 −0.2 −0.25

−0.25 beta3

beta4

beta6

beta7

beta8

beta3

(c) LLR

beta4

beta6

beta7

beta8

(d) LMR

Fig. 3. Boxplots of the estimator βˆλ from 200 repeats with n = 400 and error ε ∼ t(3) in Example 2. (a) and (b) are the corresponding boxplots of the estimators for nonzero coefficients, (c) and (d) are the corresponding boxplots of the estimators for zero coefficients. The horizontal black dotted lines denote the true values of β0j for j = 1, 2, . . . , 8. it needs further study when the model has heteroscedastic error. In addition, some irrelevant variables may appeared in the index part in many cases, which is also an interesting open question, we may consider some efficient dimension reduction approaches or penalized spline combined with delete-one-component method to investigate it based on modal regression in the future. Other potential extensions may be to varying-coefficient single-index model or generalized PLSIM. Acknowledgements This research was supported by the National Natural Science Foundation of China (Grant No. 11171361) and Ph.D. Programs Foundation of Ministry of Education of China (Grant No. 20110191110033).

17

Appendix A. For our theoretical results, we make for technical convenience the following imposed regularity conditions. (C1) U0 has a bounded support and its density function fU0 (·) is positive and has a continuous second derivative. Moreover, fU0 (·) is uniformly continuous for γ in a neighborhood of γ0 . (C2) The random vector X has bounded support. (C3) The kernel function K(·) ≥ 0 has a symmetric density function with bounded support and satisfies Lipschitz condition. (C4) The nonparametric link function g(·) has continuous second derivative. (C5) E{φ′h (ε) | U0 = u} = 0, E{φ′h (ε)3 | U0 = u}, E{φ′′h (ε)2 | U0 = u} and E{φ′′′ h (ε) | U0 = u} are continuous with respect to u. (C6) F (u, h) and G(u, h) are continuous with respect to u. (C7) For any h > 0, F (u, h) < 0. (C8) The penalty function pλj (t) satisfies the following condition lim inf lim inf p′λj (t)/λj > 0. + n→∞

t→0

Note that, conditions (C1)-(C2) are commonly used and easily satisfied in the literature of PLSIM. (C3)-(C4) are a commonly adopted condition for kernel functions in local polynomial estimation. (C5)-(C8) are usual necessary conditions used in local modal nonparametric regression in [25, 31, 15]. The imposed condition E{φ′h (ε) | U0 = u} = 0 is satisfied if the error density is symmetric about zero, which ensures that the proposed estimator is consistent. Let Xi∗ (U0 ) = {1, (U0i − u)/h1 , XiT }T , δ = (a, b, β T )T , δ0 = (a0 , b0 , β0T )T , where a0 = g(u) and b0 = g ′ (u) and ri (U0 ) = g(U0i ) − a0 − b0 (U0i − u). Then we have the following lemmas hold. Lemma 1. Under the conditions (C1)-(C7), if h1 → 0 and nh1 → ∞ as n → ∞, then n

1X Kh1 (U0i − u)φ′′h2 (εi ) n i=1 n



1X Kh1 (U0i − u)φ′′h2 (εi )ri (U0 ) n i=1

U0i − u h1 

j

U0i − u h1

= F (u, h2 )fU0 (u)µj + op (1),

j

1 = h21 F (u, h2 )fU0 (u)g ′′ (u)µj+2 2 + op (h21 ).

The proof of Lemma 1 is available from Lemma 1 in [25] and is thus omitted here. ˜ where Denote H = diag(1, h, 11×p ), δ1 = H1 δ, δ10 = H1 δ0 and δ˜1 = H1 δ, H1 = H |h=h1 . 18

Lemma 2. Under the conditions (C1)-(C7), with probability approaching 1, there exists a constant local maximizer δ˜ = (˜ a, ˜b, β˜T )T of (2) such that kδ˜1 − δ10 k = Op ((nh1 )−1/2 + h21 ), where k · k denotes the Euclidean distance.

Lemma 3. Let Θ be the bounded support of U0 , under the same conditions in Theorem 1, the following equation holds uniformly in u ∈ Θ, δ˜1 − δ10 =

1 Σ1 (u)−1 Wn + Op (ωn h21 + ωn2 log1/2 (1/h1 )), nfU0 (u)

√ where ωn = 1/ nh1 . The proofs of Lemma 2 and Lemma 3 can be referred to Lemma A.2 and Lemma A.4 in [31], respectively. √ Proof of Theorem 1. Let γˆ (0) be a initial n-consistent estimator of γ0 , T T ˜ ˜ ˜ recall that δ = (˜ a, b, β ) maximizes (2), combined with condition (C4), we must have δ˜1 = H1 δ˜ satisfy the following equation n X i=1

˜ )Kh1 (ZiT γˆ (0) − u)φ′h (εi + τ˜i ) = 0. Xi∗ (U 2

˜ 0 ), U ˜ )−(˜ ˜i −u)−X T (β−β ˜ = (U ˜1 , U ˜2 , · · · , U ˜n )T where τ˜i = ri (U a −a0 )−(˜b−b0 )(U i ˜i = Z T γˆ (0) . and U i Using the Taylor expansion, it follows that   n X 1 ′′′ ∗ 2 ′′ ′ ∗ ˜ τi + φh2 (εi )˜ Xi (U )Ki,h1 φh2 (εi ) + φh2 (εi )˜ τi = 0, (A.1) 2 i=1

˜i − u). Note that, τ˜i = where ε∗i lies in εi and εi + τ˜i and Ki,h1 = Kh1 (U ∗T ˜ ˜ ˜ ri (U ) − Xi (U )(δ1 − δ10 ), then, the second term on the right-hand side of (A.1) can be written as n X i=1

˜ )Xi∗ (U ˜) − Ki,h1 φ′′h2 (εi )ri (U

n X i=1

˜ )Xi∗T (U ˜ )(δ˜1 − δ10 ) Ki,h1 φ′′h2 (εi )Xi∗ (U

, C1 + C2 . (A.2) √ (0) ˜ → U0 and f ˜ (u) = fU0 (u)(1 + op (1)) by Since γˆ is n-consistent, then U U condition (C1). Using Lemma 1, we have 1 2 nh µ2 fU0 (u)Γ(u)g ′′ (u)(1 + op (1)) + op (nh21 ), 2 1 C2 = −nfU0 (u)Σ1 (u)(1 + op (1))(δ˜1 − δ10 ),

C1 =

where Γ(u) and Σ1 (u) are defined in Theorem 1. 19

On the other hand, we have kδ˜1 − δ10 k = Op ((nh1 )−1/2 + h21 ) from Lemma 2. Following the similar idea in the proof of Theorem 1 in [31], with some simple calculations for the third term on the right-hand side of (A.1), we obtain n X i=1

2 ˜i − u)φ′′′ (ε∗ )˜ Xi∗ Kh1 (U h2 i τi = op (C2 ).

(A.3)

Then, by (A.1)-(A.3), we have δ˜1 − δ10

=

1 Σ1 (u)−1 Wn (1 + op (1)) nfU0 (u) 1 + h21 µ2 Σ1 (u)−1 Γ(u)g ′′ (u)(1 + op (1)), 2

(A.4)

P ˜i − u)φ′ (εi ). ˜ )Kh1 (U where Wn = ni=1 Xi∗ (U h2 Note that, by the condition (C5) and the similar proof of Lemma 1, we can easily obtain E(Wn ) = 0 and Var(Wn ) = nh−1 1 fU0 (u)Σ2 (u)(1 + op (1)), where Σ2 (u) are defined in Theorem 1. Again following the proof of Theorem 1 in [31], we have d (A.5) Wn → N (0, nh−1 1 fU0 (u)Σ2 (u)). Consequently, by (A.1)-(A.5), the asymptotic normality of δ˜1 is followed by (7). This completes the proof. Proof of Theorem 2. Recall that βˆ maximizes (3), we must have βˆ satisfy the following equation n X Xi φ′h3 (εi − τˆi ) = 0. i=1

Using the Taylor expansion, it follows that   n X 1 ∗ 2 = 0, (ε )ˆ τ τi + φ′′′ Xi φ′h3 (εi ) − φ′′h3 (εi )ˆ 2 h3 i i i=1

(A.6)

where τˆi = (˜ a − a0 ) + XiT (βˆ − β0 ), ε∗i lies in εi − τˆi and εi . Then, the second term on the right-hand side of (A.6) is −

n X i=1

a − a0 ) − φ′′h3 (εi )Xi (˜

n X i=1

φ′′h3 (εi )Xi XiT (βˆ − β0 ) , M1 + M2 .

(A.7)

Note that, it is easy to obtain that M2 = −nQ(βˆ − β0 ). where Q = E{F (U0 , h3 )XX T }. 20

(A.8)

Let P = (1, 0, 0T )1×(2+p) , by Lemma 3, it follows that sup |˜ a(u) − a0 (u) − 1/nfU0 (u)P Σ1 (u)−1 Wn |

u∈Θ

= Op (ωn h21 + ωn2 log1/2 (1/h1 )).

(A.9)

˜ → U0 and f ˜ (u) = fU0 (u)(1+op (1)), from (A.8) and the assumptions Because U U nh41 → 0 and nh21 / log(1/h1 ) → ∞ as n → ∞, we have M1 = −

n n 1 X X  −1 f (U0i )φ′h2 (εj )φ′′h3 (εi )Xi P Σ1 (u)−1 (U0i )Kh1 (U0j − U0i ) n j=1 i=1 U0

(1 + op (1)) + op (1). Denote

M3 = −

n X

D(U0j ),

(A.10)

j=1

where D(U0j ) = φ′h2 (εj )R(U0j )P Σ1 (U0j )−1 C(Xj ), R(U0j ) = E{F (U0j , h3 )X}. Then, we can obtain p (A.11) D1 → D3 with some simple moments calculations. On the other hand, Using Lemma 3 and (A.9), we have

|ˆ τi | ≤ |˜ a − a0 | + |XiT (βˆ − β0 )| = Op (kβˆ − β0 k) and

|ˆ τi |2 = Op (kβˆ − β0 k2 ).

Therefore, we can get that

n

1 X ′′′ ∗ 2 τ = op (D2 ). φ (ε )ˆ 2 i=1 h3 i i

(A.12)

Consequently, combine (A.6)-(A.12), we have n X √ √ {Xi φ′h3 (εi ) − D(U0i )} + op (1). n(βˆ − β0 ) = ( nQ)−1 i=1

Consider that have

E{Xi φ′h3 (εi )

− D(U0i )}=0, by the Central Limit Theorem, we

√ d n(βˆ − β0 ) → N (0, Q−1 ΩQ−1 ),

where Ω is defined in Theorem 2. This completes the proof. ˆ then model (1) becomes the singleProof of Theorem 3. Let Y˜ = Y − X T β, √ index model, by the results that βˆ is n-consistent from Theorem 2 and X has 21

bounded support in condition (C3), we can proof the asymptotic normality of γˆ is (9) following the idea in the proof Theorem 1 in [15]. √ Proof of Theorem 4. Note that βˆ is n-consistent and X has bounded support, the asymptotic normality of a ˆ can be similarly obtained by following the ideas in the proof of Theorem 1, then we omit it. Proof of TheoremP5. Using the same notations in the of Theorem 2, Pproof n p recall that ℓλ (β) = i=1 φh3 Yi − a ˜(ZiT γˆ ) − XiT β − n j=1 pλj (|βj |). Under the conditions of Theorem 5, for any given βI satisfying kβI − β0I k = Op (n−1/2 ) and any constant C, from Lemma A.6 in [31], we have ℓλ ((βIT , 0T )) =

max

kβII k≤Cn−1/2

T ℓλ ((βIT , βII )).

λ Therefore, there exists βˆλ satisfying βˆII = 0 and

∂ℓλ (β) ˆλ T T T = 0 for j = 1, . . . , l, ∂βj ((βI ) ,0 )

which is equivalent to 0

= =

  1 ∗ 2 τi + φ′′′ Xi φ′h3 (εi ) − φ′′h3 (εi )ˆ (ε )ˆ τ − np′λj (|βˆjλ |)sgn(βjλ ) h3 i i 2 i=1   n X 1 ′′′ ∗ 2 ′′ ′ τi + φh3 (εi )ˆ Xi φh3 (εi ) − φh3 (εi )ˆ τi 2 i=1 n X

−n{p′λj (|β0j |)sgn(β0j ) + p′′λj (|β0j | + op (1))(βˆjλ − β0j )},

where the second equality holds by applying the Taylor expansion. Consequently, following the similar proof in Theorem 2, by the Central Limit Theorem and Slutsky’s Theorem, we have √ d n(Ql + Ψλ ){βˆIλ − β0I + (Ql + Ψλ )−1 Φλ } → N (0, Ωl ), where Ql and Ωl are defined in Theorem 5. This completes the proof. References References [1] G. Boente, D. Rodriguez, Robust estimates in generalized partially linear single-index models, Test 21 (2012) 386-411. [2] R. Carroll, J. Fan, I. Gijbels, M. Wand, Generalized partially linear singleindex models, Journal of the American Statistical Association 92 (1997) 477-489. 22

[3] I. Frank, J. Friedman, A statistical view of some chemometrics tools, Technometrics 35 (1993) 109-135. [4] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association 96 (2001) 1348-1360. [5] J. Fan, R. Li, New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis, Journal of the American Statistical Association 99 (2004) 710-723. [6] Y. Fan, L. Zhu, Estimation of general semi-parametric quantile regression, Journal of Statistical Planning and Inference 143 (2013) 896-910. [7] T. Honda, Quantile regression in varying coefficient models, Journal of Statistical Planning and Inference 121 (2004) 113-125. [8] R. Jiang, Z. Zhou, W. Qian, W. Shao, Single-index composite quantile regression, Journal of the Korean Statistical Society 3 (2012) 323-332. [9] R. Koenker, G. Bassett, Regression quantiles, Econometrica 46 (1978) 3350. [10] R. Koenker, Quantile Regression, Cambridge University Press, 2005. [11] J. Li, S. Ray, B. Lindsay, A nonparametric statistical approach to clustering via mode identification, Journal of Machine Learning Research 8 (2007) 1687-1723. [12] R. Li, H. Liang, Variable selection in semiparametric regression modeling, The Annals of Statistics 36 (2008) 261-286. [13] H. Liang, X. Liu, R. Li, C. Tsai, Estimation and testing for partially linear single-index models, The Annals of Statistics 38 (2010) 3811-3836. [14] P. Lai, S. Guo, Variable Selection for Partially Linear Single-index Model, Journal of Computational Information Systems 8 (2012) 5917-5924. [15] J. Liu, R. Zhang, W. Zhao, Y. Lv, A robust and efficient estimation method for single-index models, Journal of Multivariate Analysis 122 (2013) 226238. [16] Y. Mack, B. Silverman, Weak and strong uniform consistency of kernel regression estimates, Probability Theory and Related Fields 61 (1982) 405415. [17] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society Series B 58 (1996) 267-288. [18] H. Wang, G. Li, G. Jiang, Robust regression shrinkage and consistent variable selection via the LAD-LASSO, Journal of Business and Economics Statistics 20 (2007) 347-355. 23

[19] Q. Wang, X. Yin, A nonlinear multi-dimensional variable selection method for high dimensional data: sparse MAVE, Computational Statistics and Data Analysis 52 (2008) 4512-4520. [20] J. Wang, L. Xue, L. Zhu, Y. Chong, Estimation for a partial-linear singleindex model, The Annals of Statistics 38 (2010) 246-274. [21] T. Wu, K. Yu, Y. Yu, Single-index quantile regression, Journal of Multivariate Analysis 101 (2010) 1607-1621. [22] K. Yu, M. Jones, Local linear quantile regression, Journal of the American Statistical Association 93 (1998) 228-237. [23] Y. Yu, D. Ruppert, Penalized spline estimation for partially linear singleindex models, Journal of the American Statistical Association 97 (2002) 1042-1054. [24] W. Yao, L. Li, A new regression model: Modal Linear Regression, Technical Report, Available at http://www-personal.ksu.edu/wxyao/. [25] W. Yao, B. Lindsay, R. Li, Local modal regression, Journal of Nonparametric Statistics 24 (2012) 647-663. [26] Y. Xia, H. Tong, W. Li, L. Zhu, An adaptive estimation of dimension reduction space, Journal of the Royal Statistical Society Series B 64 (2002) 363-410. [27] Y. Xia, W. H¨ ardle, Semi-parametric estimation of partially linear singleindex models, Journal of Multivariate Analysis 97 (2006) 1162-1184. [28] H. Zou, The adaptive Lasso and its oracle properties, Journal of the American Statistical Association 101 (2006) 1418-1429. [29] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society Series B 67 (2006) 301-320. [30] J. Zhang, Y. Yu, L. Zhu, H. Liang, Partial linear single index models with distortion measurement errors, Annals of the Institute of Statistical Mathematics 65 (2013) 1-31. [31] R. Zhang, W. Zhao, J. Liu, Robust estimation and variable selection for semiparametric partially linear varying coefficient model based on modal regression, Journal of Nonparametric Statistics 25 (2013) 523-544. [32] W. Zhao, R. Zhang, J. Liu, Y. Lv, Robust and efficient variable selection for semiparametric partially linear varying coefficient model based on modal regression, Annals of the Institute of Statistical Mathematics (2013) 1-27.

References 24

25

MN

t(3)

N(0,1)

Dist.

400

200

100

400

200

100

400

200

100

n

1.0172 1.0142 1.0148 0.9894 1.0104 1.0064 0.9819 0.9851 0.9852 0.9885 1.0081 1.0063

LLR LMR LLR LMR LLR LMR

Average 0.9861 0.9849 0.9911 0.9894 1.0034 1.0051

LLR LMR LLR LMR LLR LMR

LLR LMR LLR LMR LLR LMR

Method

-0.0181 (0.2552) -0.0149 (0.1427) -0.0148 (0.1465) -0.0115 (0.1168) 0.0081 (0.0896) 0.0063 (0.0577)

0.0172 (0.2579) 0.0147 (0.1496) 0.0148 (0.1521) -0.0106 (0.1124) 0.0104 (0.0921) 0.0064 (0.0728)

β1 Bias (SD) -0.0139 (0.1323) -0.0151 (0.1334) -0.0089 (0.1069) -0.0106 (0.1052) 0.0034 (0.0522) 0.0051 (0.0506)

Estimation of β0 in Example 1. The true β0 = (1, 0.5, 3)T .

Table 1

0.4744 0.4927 0.4838 0.4891 0.5102 0.5047

0.5001 0.5123 0.4770 0.4885 0.5122 0.5050

Average 0.4867 0.4881 0.4892 0.4879 0.5051 0.5048

-0.0256 (0.3127) -0.0073 (0.1123) -0.0162 (0.1811) -0.0109 (0.1076) 0.0102 (0.0970) 0.0047 (0.0515)

0.0001 (0.1852) -0.0132 (0.1495) 0.0230 (0.1683) -0.0115 (0.1201) 0.0122 (0.1357) 0.0050 (0.0713)

β2 Bias (SD) -0.0133 (0.1381) -0.0119 (0.1373) -0.0108 (0.1017) -0.0121 (0.1078) 0.0051 (0.0534) 0.0048 (0.0519)

2.9836 2.9792 2.9869 2.9912 2.9893 2.9949

2.9829 3.0089 2.9844 2.9823 2.9891 3.0062

Average 2.9870 2.9828 2.9899 2.9885 2.9956 2.9947

-0.0164 -0.0208 -0.0131 -0.0088 -0.0107 -0.0051

(0.2139) (0.2373) (0.1273) (0.1157) (0.1002) (0.0544)

-0.0171 (0.2181) 0.0089 (0.1311) 0.0156 (0.1558) -0.0107 (0.1196) -0.0109 (0.0972) 0.0062 (0.0737)

β3 Bias (SD) -0.0130 (0.1339) -0.0172 (0.1564) -0.0101 (0.1003) -0.0115 (0.1034) -0.0044 (0.0520) -0.0053 (0.0523)

0.1249 0.0738 0.0748 0.0363 0.0217 0.0130

0.1126 0.0865 0.0675 0.0407 0.0251 0.0166

(0.2035) (0.1661) (0.1253) (0.0615) (0.0824) (0.0306)

(0.1896) (0.1501) (0.1368) (0.0628) (0.0816) (0.0319)

GMSE (SD) 0.0569 (0.0831) 0.0584 (0.0912) 0.0294 (0.0536) 0.0279 (0.0541) 0.0109 (0.0178) 0.0117 (0.0166)

26

MN

t(3)

N(0,1)

Dist.

400

200

100

400

200

100

400

200

100

n

0.8267 0.8061 0.8285 0.8248 0.8157 0.8109 0.8317 0.8122 0.8118 0.8137 0.8129 0.8133

LLR LMR LLR LMR LLR LMR

Average 0.8220 0.8106 0.8107 0.8119 0.8169 0.8165

LLR LMR LLR LMR LLR LMR

LLR LMR LLR LMR LLR LMR

Method

0.0152 (0.1612) -0.0043 (0.0884) -0.0047 (0.0872) -0.0028 (0.0407) -0.0036 (0.0442) -0.0032 (0.0321)

0.0102 (0.1474) -0.0104 (0.0726) 0.0120 (0.0952) 0.0083 (0.0584) -0.0008 (0.0417) -0.0056 (0.0346)

γ1 Bias (SD) 0.0055 (0.0587) -0.0059 (0.0692) -0.0058 (0.0394) -0.0046 (0.0392) 0.0004 (0.0173) 0.0000 (0.0143)

0.3789 0.3917 0.3905 0.3977 0.3986 0.4030

0.3812 0.3974 0.3904 0.4018 0.3966 0.4028

Average 0.3934 0.3903 0.4032 0.4029 0.4041 0.4035

-0.0294 -0.0166 -0.0177 -0.0106 -0.0096 -0.0052

(0.2194) (0.1090) (0.1148) (0.0634) (0.0661) (0.0363)

0.0270 (0.2102) -0.0108 (0.0924) 0.0178 (0.1250) -0.0065 (0.0556) 0.0117 (0.0634) -0.0054 (0.0374)

γ2 Bias (SD) -0.0149 (0.1081) -0.0179 (0.1173) -0.0050 (0.0387) -0.0053 (0.0391) -0.0042 (0.0199) -0.0047 (0.0233)

Estimation of γ0 in Example 1. The true γ0 = (0.8165, 0.4082, 0.4082)T .

Table 2

0.3921 0.4041 0.3871 0.3974 0.3995 0.4023

0.3909 0.4081 0.3944 0.4009 0.4035 0.4042

Average 0.4012 0.4017 0.4138 0.4005 0.4055 0.4054

-0.0162 -0.0042 -0.0211 -0.0108 -0.0087 -0.0059

(0.1793) (0.0735) (0.0926) (0.0642) (0.0594) (0.0387)

-0.0174 (0.2004) -0.0002 (0.0711) 0.0138 (0.1122) -0.0074 (0.0563) -0.0048 (0.0571) -0.0040 (0.0318)

γ3 Bias (SD) -0.0071 (0.0728) -0.0066 (0.0696) 0.0055 (0.0405) -0.0077 (0.0411) -0.0027 (0.0183) -0.0028 (0.0190)

0.1706 0.1105 0.1250 0.0558 0.0329 0.0213

0.1510 0.1038 0.1006 0.0513 0.0530 0.0206

(0.1722) (0.1057) (0.1143) (0.0516) (0.0603) (0.0220)

(0.1884) (0.1005) (0.1168) (0.0607) (0.0626) (0.0299)

GMSE (SD) 0.1071 (0.1030) 0.1133 (0.1067) 0.0061 (0.0414) 0.0068 (0.0426) 0.0030 (0.0212) 0.0032 (0.0226)

0.2876 0.2123 0.1470 0.1276 0.1248 0.0857

0.2641 0.2060 0.1305 0.1239 0.1121 0.0835

(0.1127) (0.0919) (0.0866) (0.0674) (0.0532) (0.0443)

(0.1246) (0.1102) (0.1007) (0.0741) (0.0685) (0.0570)

RASE (SD) 0.1469 (0.0705) 0.1556 (0.0758) 0.1006 (0.0517) 0.1040 (0.0596) 0.0643 (0.0334) 0.0685 (0.0369)

27

MN

t(3)

N(0,1)

Dist.

400

200

100

400

200

100

400

200

100

n 0.0487 0.0539 0.0199 0.0217 0.0086 0.0083 0.0945 0.0690 0.0543 0.0408 0.0217 0.0138 0.1069 0.0631 0.0525 0.0359 0.0167 0.0129

LLR LMR LLR LMR LLR LMR

LLR LMR LLR LMR LLR LMR

(0.0970) (0.0773) (0.0592) (0.0455) (0.0204) (0.0121)

(0.0732) (0.0601) (0.0600) (0.0372) (0.0139) (0.0131)

(0.0418) (0.0445) (0.0189) (0.0235) (0.0093) (0.0094)

GMSE(β) (SD)

LLR LMR LLR LMR LLR LMR

Method

0.0356 0.0248 0.0173 0.0109 0.0072 0.0066

0.0451 0.0238 0.0196 0.0124 0.0111 0.0059

0.0192 0.0197 0.0081 0.0082 0.0042 0.0036

(0.0536) (0.0354) (0.0210) (0.0158) (0.0076) (0.0075)

(0.0485) (0.0267) (0.0216) (0.0107) (0.0101) (0.0052)

(0.0200) (0.0215) (0.0069) (0.0080) (0.0040) (0.0033)

GMSE(γ) (SD)

Variable selection of the parametric components in Example 2.

Table 3

0.4772 0.2761 0.2683 0.1941 0.1371 0.1148

0.5480 0.3219 0.3445 0.1713 0.1524 0.1274

0.2299 0.2386 0.1370 0.1402 0.1028 0.1065

(0.2179) (0.1315) (0.1622) (0.1027) (0.0750) (0.0446)

(0.1943) (0.1151) (0.1164) (0.0659) (0.0743) (0.0422)

(0.0796) (0.1032) (0.0500) (0.0518) (0.0356) (0.0371)

RASE (SD)

4.4500 4.8500 4.8700 4.9350 4.9500 4.9850

4.2650 4.6750 4.6700 4.8750 4.9350 4.9700

4.7350 4.7250 4.9150 4.9000 4.9950 4.9950

NC

0.0150 0 0.0050 0 0 0

0.0200 0 0.0050 0 0 0

0.0050 0 0 0 0 0

NIC

0.8650 0.8900 0.9300 0.9750 0.9850 0.9900

0.8300 0.8650 0.9250 0.9650 0.9750 0.9900

0.9150 0.9000 0.9750 0.9700 0.9950 0.9950

PC

Table 4

Estimation for diabetes data. Variable LLR Age 0.2140 (0.0201) index part BMI 0.8235 (0.0164) BP 0.4626 (0.0125) Sex 0 (-) TC -0.6398 (0.2434) LDL 0.3238 (0.2641) linear part HDL 0.1331 (0.1394) TCH 0.0426 (0.1507) CTG 0.4519 (0.2276) GLU 0 (-)

28

LMR 0.1332 (0.0143) 0.8648 (0.0176) 0.4800 (0.0103) 0 (-) -0.5208 (0.1438) 0.4821 (0.1822) 0.1067 (0.1253) 0 (-) 0.5350 (0.1518) 0 (-)

1.5 1.0 0.5

Estimation of g(.)

0.0 −0.5 −3

−2

−1

0

1

2

3

4

2

3

4

Estimated index

0.5 0.0 −0.5

Estimation of g(.)

1.0

(a) LLR

−3

−2

−1

0

1

Estimated index

(b) LMR

Fig. 4. The local linear estimators of the link function g(·) (solid line) against the estimated index, along with the associated 95% pointwise confidence intervals (dotted lines). (a) is the estimation based on LLR method; (b) is the estimation based on LMR method. 29