Regularization methods for high-dimensional sparse control function models

Regularization methods for high-dimensional sparse control function models

Journal of Statistical Planning and Inference xxx (xxxx) xxx Contents lists available at ScienceDirect Journal of Statistical Planning and Inference...

518KB Sizes 0 Downloads 65 Views

Journal of Statistical Planning and Inference xxx (xxxx) xxx

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi

Regularization methods for high-dimensional sparse control function models✩ Xinyi Xu, Xiangjie Li, Jingxiao Zhang



Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing 100872, China

article

info

Article history: Received 29 May 2018 Received in revised form 21 September 2019 Accepted 21 September 2019 Available online xxxx Keywords: Endogeneity Regularization Control function Consistency of estimation Model selection Penalized least squares

a b s t r a c t Traditional penalty-based methods might not achieve variable selection consistency when endogeneity exists in high-dimensional data. In this article we construct a regularization framework based on the two-stage control function model, so called the regularized control function (RCF) method, to estimate important covariate effects, select key instruments, and replace the CF-based hypothesis test with variable selection to identify truly endogenous predictors. Under appropriate conditions, we establish theoretical properties of the RCF estimators, including the consistency of coefficient estimation and model selection. Simulation results confirm that the RCF method is effective and superior to its main competitor, the penalized least squares (PLS) method. The proposed method also provides insightful and interpretable results on a real data analysis. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Modern scientific research allows for data whose number of predictors p is increasingly large relative to the sample size n. Many high-dimensional regularization methods taking the form of ‘‘loss function + penalty’’ have been well established, based on the assumption of sparsity and exogeneity. The terminology ‘‘exogeneity’’ borrowed from econometrics means that none of the regressors are correlated with the regression error, and it promises consistent selection of the important regressors. However no matter in applied econometrics or medical researches (Angrist and Keueger, 1991; Guo and Small, 2016), endogeneity (the unobservable error to be correlated with the explanatory variables) incidentally arises from a large pool of predictors. For conventional penalized methods such as Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001), and MCP (Zhang, 2010), such endogeneity causes inconsistency of variable selection and coefficients estimation, and possible false scientific discoveries (Fan and Liao, 2014). A natural idea to address the endogeneity problem in high-dimensional models is to learn the instrumental variable (IV) technique from econometrics. For example, Belloni et al. (2012) apply Lasso-type methods to the IV model to estimate optimal instruments. Gautier and Tsybakov (2018) present an inference procedure based on the self tuning instrumental variables (STIV) estimator. Fan and Liao (2014) construct a penalized focused generalized method of moments (FGMM) which applies the IV method and achieves the oracle property. Lin et al. (2015) propose a two-stage regularization (2SR) framework based on the classical two-stage least squares (2SLS) estimator. Belloni et al. (2018) focus on the generalized method of moments problems, and discuss cases including linear instrumental variable models as econometric ✩ Supported by the MOE Project of Key Research Institute of Humanities and Social Sciences at Universities (16JJD910002) and the Outstanding Innovative Talents Cultivation Funded Programs 2018 of Renmin University of China. ∗ Corresponding author. E-mail address: [email protected] (J. Zhang). https://doi.org/10.1016/j.jspi.2019.09.007 0378-3758/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.

2

X. Xu, X. Li and J. Zhang / Journal of Statistical Planning and Inference xxx (xxxx) xxx

applications. All the existing methods, however, are not able to identify endogenous regressors, and thus have problem in determining whether the predictor of interest is truly endogenous. In this article, we have designed to extend the control function (CF) method, which is inherently a kind of IV method and widely implemented in econometrics, to the high-dimensional sparse situation. In addition to dealing with endogeneity, the CF approach leads to a robust, regression-based Hausman test of whether the suspected endogenous variables are actually endogenous (Wooldridge, 2010). In high-dimensional models where the classical test fails due to loss of the equation’s identifiability, we try to transform this test into a variable selection problem via imposing penalty functions in both two stages of the CF model. Theoretical analysis, simulation results and real data analysis all verify that, under appropriate assumptions, the proposed regularized control function (RCF) estimators achieve the consistency of variable selection and coefficient estimation, meanwhile identifying optimal instruments and truly (or the most) endogenous predictors. In fact, most literatures focus on the endogeneity of one or a few key variables, or most of the endogeneity is contained only in a small part of predictors. Therefore, exploiting the exact or approximate sparsity of endogenous variables is reasonable, and sheds light on the actual endogenous situation. The remainder of this article is organized as follows. In Section 2, we propose the regularization framework for highdimensional sparse CF models. In Section 3, we investigate theoretical properties of the RCF estimators. Section 4 reports simulation results. Section 5 gives a real data analysis. In Section 6, we discuss possible directions of future development for the RCF method. Section 7 concludes the paper. All technical proofs are given in Supplementary Materials. 2. Regularization method for control function model In this section, we first introduce the two-stage control function (2SCF) model with the corresponding Hausman test (Wooldridge, 2010). Then we consider the control function model in the high-dimensional sparse setting, and finally impose a regularization framework on both stages to form the RCF method. 2.1. Two-stage control function model Consider the structural equation, i.e., the model representing casual relationship in econometrics y = β T x + u,

(1)

where x = (x1 , . . . , xp ) and u is the structural random error. The key condition needed for OLS to consistently estimate T

β is that

E(u) = 0,

Cov (xj , u) = 0,

for j = 1, . . . , p. An explanatory variable xj is said to be endogenous if it is correlated with the random error u. Endogeneity arises in many ways: measurement error of regressors, simultaneity between the response and predictors, unmeasured confounders affecting both the outcome and factors, and so on. Typically, constructing a valid CF relies on the availability of valid instrumental variables z. The basic idea of the CF method is to use the IV to split the unobservable error into two parts: correlated and uncorrelated parts with the regressors. The linear projection of xj onto all the instruments z is called a reduced-form equation for the endogenous xj , which is xj = γ Tj z + vj ,

(2)

where γ j ∈ Rq×1 , z ∈ Rq×1 , and vj is the reduced random error with E(vj ) = 0 for j = 1, . . . , p. let v = (v1 , . . . , vp )T . There are several assumptions for z to be valid instruments: (1) E(zT u) = 0 implying the exogeneity of z, (2) the rank condition for identification that z is partially correlated with at least one exogenous variable xj , (3) E(zT vj ) = 0 for j = 1, . . . , p implying that z is independent from v, (4) no perfect colinearity in components of z. Endogeneity of xj arises if and only if u is correlated with vj for j = 1, . . . , p. Write the linear projection of u on v in error form, as u = ρ T v + e,

(3)

where ρ = (ρ1 , . . . , ρp ) is the population regression coefficient. By definition, E(vj e) = 0 and thus E(z e) = 0 because u and vj are both uncorrelated with z for j = 1, . . . , p. Therefore, plugging Eq. (3) into the structural equation (1) gives a valid equation T

y = βT x + ρT v + e.

T

(4)

As e is uncorrelated with x and v, we can consistently estimate β and ρ by running OLS regression of y on x and v. In effect, including v as additional regressors in Eq. (4) ‘‘controls for’’ the endogeneity of x. We can regard v as the proxy for the factors in u that are correlated with x. However, we cannot observe v but have observations of x and z. The solution is to replace v with the reduced-form T residuals vˆ = (vˆ 1 , . . . vˆ p )T , where vˆ j = xj − γˆj z is the residual from Eq. (2) for j = 1, . . . , p, to get y = βT x + ρT vˆ + ε,

(5)

Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.

X. Xu, X. Li and J. Zhang / Journal of Statistical Planning and Inference xxx (xxxx) xxx

3

where ε = ρT (v − vˆ ) + e. A two-stage control function (2SCF) model is thus established: Firstly run OLS regression of x on z to get the residuals vˆ ; Secondly include vˆ as augmented regressors and estimate Eq. (5). As we can get consistent estimation of γ j in Eq. (2), the OLS estimators from Eq. (5) will be consistent for β and ρ. When p ≪ n (in most instances p = 1), it is trivial to use (5) to test the null hypothesis H0 : ρj = 0 since the usual t statistic is asymptotically valid under H0 in the classical setting. The inclusion of vˆ hence serves another valuable purpose: it produces a DWH test of H0 : ρj = 0, which means univariate xj is actually exogenous (Wooldridge, 2010). In high dimensional settings, the DWH test has low power, and Guo et al. (2018) propose an improved endogeneity test to remedy this issue for the univariate variable. 2.2. High-dimensional sparse 2SCF model Suppose we have n independent observations of (y, x, z) denoted by y ∈ Rn×1 , X ∈ Rn×p and Z ∈ Rn×q . In the highdimensional setting where p and q can both be much larger than n, we are interested in making inference for the two-stage control function model:

{

X = ZΓ0 + V,

(6)

y = Xβ0 + Vρ0 + e,

where Γ0 = (γ 01 , . . . , γ 0p ) ∈ Rq×p is the coefficient of instruments with γ 0j ∈ Rq×1 for j = 1, . . . , p, V = (vT1· , . . . , vTn· )T ∈ Rn×p is the reduced random error with mean 0, β0 ∈ Rp×1 is the coefficient of predictors, ρ0 ∈ Rp×1 is the coefficient of augmented regressors and e ∈ Rn×1 is uncorrelated with X and V. The structural random error u ∈ Rn×1 is represented as u = Vρ0 + e. Estimators from the 2SCF method will become extremely biased and invalid as p and q increase. As is typical in high-dimensional sparse modeling, we assume that both stages in model (6) are sparse which means only a small subset of the regression coefficients in β0 , Γ0 and ρ0 are nonzero. In this sense the endogenous variables in the first stage can be well approximated by a relatively small subset of instruments, and the majority of endogeneity is contained in a few variables with the rest too little to matter. In addition to selecting and estimating important covariate effects, we also regard the identification of optimal instruments as a variable selection problem (Belloni et al., 2012). In high-dimensional sparse situation, the test of the variable endogeneity becomes invalid following the failure of traditional 2SCF estimators. Since the nonzero components in ρ0 supposedly correspond to the actually endogenous parts in X, we stress here the identification of truly endogenous variables as a variable selection problem instead of a hypothesis test problem. Our goal is therefore to identify and estimate the nonzero coefficients in β0 , Γ0 and ρ0 . 2.3. Two-stage regularization and numerical implementation We apply regularization methods in both stages of the 2SCF model to conduct more effective coefficient estimation and variable selection. In the first stage, we aim at identifying and estimating the nonzero coefficients of instruments and ˆ The first-stage regularized estimator is defined as p penalized least squares problems getting the residuals V.

{ γˆ j = arg min γ j ∈Rq

1 2n

∥xj − Zγ ∥ + 2 j 2

q ∑

} pλj (|γij |) ,

(7)

i=1

where xj ∈ Rn×1 is the jth column of X, γ j = (γ1j , . . . , γqj )T ∈ Rq×1 for j = 1, . . . , p, pλj (·) is a penalty function to promise ˆ = (γˆ 1 , . . . , γˆ p ) ∈ Rq×p . sparsity, and λj > 0 is the tuning parameter controlling the regularization strength for γ j . Let Γ

ˆ = X − ZΓˆ . Then we have that V ˆ for V in the second stage of (6), and utilize penalties to identify and estimate nonzero coefficients We next substitute V of predictors and control variables. The second-stage regularized estimator is defined as

⎧ ⎫ p p ⎨ ⎬ ∑ ∑ ˆ T , ρˆ T )T = arg min 1 ∥ y − Xβ − Vˆ ρ ∥2 + (β p ( |β | ) + p ( |ρ | ) , µ1 j µ2 j 2 ⎭ β,ρ∈Rp ⎩ 2n j=1

(8)

j=1

where βj and ρj are the jth components of β and ρ, pµ1 (·) and pµ2 (·) are penalty functions, and µ1 , µ2 > 0 are tuning parameters controlling the regularization strength for β and ρ. Without loss of generality, we take µ1 and µ2 as the same value µ in the computation and theoretical study for simplicity. In this way we acquire the estimators for Γ0 , β∫0 and ρ0 . t{ For the specific form of pλ (t), t ≥ 0, we use the Lasso penalty pλ (t) = λt, the SCAD penalty pλ (t) = λ 0 I(θ ≤ ∫t } λ) + I(θ > λ)(aλ − θ )+ /[(a − 1)λ] dθ with a > 2, and the MCP penalty pλ (t) = 0 {(aλ − θ )+ /a}dθ with a > 1, in the later simulations and real data analysis. SCAD and MCP penalties are asymptotically unbiased penalties with the hyper parameter (a, λ). These penalty functions are most commonly applied in variable selection and their property has been well established in high-dimensional sparse models. We employ the coordinate descent algorithm to numerically minimize the penalized problem (7) and (8). This algorithm is developed by Fu (1998), Friedman et al. (2010) and Mazumder et al. (2011), and appears to be the fastest for computing the regularization paths for a variety of loss functions. Taking (7) for instance, the iterative coordinate algorithm uses the univariate soft-thresholding operator S(·, λ) to update one coordinate of γ j at a time, with other coordinates kept Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.

4

X. Xu, X. Li and J. Zhang / Journal of Statistical Planning and Inference xxx (xxxx) xxx Table Algorithm 1: Coordinate descent algorithm and tuning parameter selection Step 1.

For each λj , j = 1, . . . , p, input a grid of L increasing values Λj = {λ1j , . . . , λLj }. (a) For each value λlj with l ∈ {L, L − 1, . . . , 1}, repeat the following: i. Initialize γ˜ j (λlj ) = 0 or some warm starts. ii. For i = 1, . . . , q, cycle through the following one-at-a-time updates

∑ ( γ˜ij (λlj ) = S n−1 (xj −

k ̸ =i

(b) (c)

) γ˜kj zk )T zi , λlj ,

until the updates converge to certain γˆ j (λlj ). iii. Decrement l. } { Return the solution path γˆ j (λlj ) : λlj ∈ Λj . Define the K -fold cross validation error for λ by CV (λ) =

K 1 ∑

K

(k) (−k) ˆ j (λ)∥22 , ∥x(k) j −Z γ

k= 1

where the superscript (k) corresponds to the kth part of the sample and (−k) stands for the kth part removed. Determine the final value for λj as λj = arg min CV (λlj ) and the final γˆ j accordingly. λlj ∈Λj

Step 2. Step 3.

ˆ = X − ZΓˆ and set that A = (X, V) ˆ = (a1 , . . . , a2p ) and b = (βT , ρT )T = Once Γˆ = (γˆ 1 , . . . , γˆ p ) is obtained, we get that V (b1 , . . . , b2p )T . For µ, input a grid of increasing values U = {µ1 , . . . , µM } (a) For each value µm with m ∈ {M , M − 1, . . . , 1}, repeat the following: ˜ µm ) = 0 or some warm starts. i. Initialize b( ii. For j = 1, . . . , p, cycle through the following one-at-a-time updates b˜ j (µm ) = S n−1 (y −

(

(b) (c)



.

b˜ k ak )T aj , µm ,

)

k̸ =j

ˆ µm ). until the updates converge to certain b( iii. Decrement m. { } ˆ µm ) : µm ∈ U . Return the solution path b( Define the K -fold cross validation error for µ by CV (µ) =

K 1 ∑

K

∥y(k) − A(k) bˆ (−k) (µ)∥22 .

k=1

Determine the final value for µ as µ = arg min CV (µm ) and the final bˆ accordingly. µm ∈U

fixed at their previous values, and successively updates each coordinate till convergence. S(·, λ) is distinctively defined for the specific penalty. S(·, λ) defined for Lasso, SCAD and MCP are respectively S(t , λ) = sgn(t)(|t | − λ)+ ,

S(t , λ) =

⎧ sgn(t)(|t | − λ)+ , ⎪ ⎪ ⎨ |t | − λa/(a − 1) sgn(t)

⎪ ⎪ ⎩

1 − 1/(a − 1)

t, ⎧ ⎨ sgn(t) (|t | − λ)+ , 1 − 1/a S(t , λ) = ⎩ t,

if |t | ≤ 2λ,

,

if 2λ < |t | ≤ aλ if |t | > aλ,

if |t | ≤ aλ, if |t | > aλ,

where a is an additional tuning parameter required a > 2 in SCAD and a > 1 in MCP. Coordinate descent algorithms often input a grid of increasing tuning parameter values with warm starts from nearby solutions, and then update the solutions as the regularization parameter value changes to produce a solution path and to move toward the local minimum. Thus we adopt the K -fold cross-validation error defined in Algorithm 1 as the criterion to select the optimal tuning parameters, where K depends on the sample size. The complete procedure of the coordinate decent algorithm and tuning parameter selection is summarized in Table Algorithm 1. Our simulated results demonstrate that the algorithm indeed reaches the desired sparse estimator. 3. Theoretical properties





Here we give some notations. Throughout this paper, let ∥A∥1 = maxj i |aij |, ∥A∥∞ = maxi j |aij | for any matrix A = (aij ). For any vector α and index set J, denote by αJ as the subvector formed by the jth components of α with j ∈ J. Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.

X. Xu, X. Li and J. Zhang / Journal of Statistical Planning and Inference xxx (xxxx) xxx

5

For any matrix A and index set I, J, denote by AIJ as the submatrix formed by the (i, j)th entries of A with i ∈ I and j ∈ J. Let J c denote the complement of J and |J | denote the number of elements in J. Define the support of a vector α as supp(α) = {j : αj ̸ = 0} and ∥α∥0 = |supp(α)|. Let S1 = supp(β0 ), S2 = supp(ρ0 ) and S = supp[(βT0 , ρT0 )T ]. Define the sparsity level as r = maxj=1,...,p |supp(γ 0j )|, s1 = |S1 | and s2 = |S2 |, where γ 0j is the jth column of Γ0 . 3.1. Inconsistency of PLS with endogeneity First of all we consider the penalized least squares (PLS) method, which has been proved effective and selectionconsistent in high-dimensional sparse regression based on the exogeneity assumption. The PLS estimator is defined as:

⎫ ⎧ p ⎬ ⎨1 ∑ p µ (β j ) . βˆ = arg min ∥y − Xβ∥22 + ⎭ β∈Rp ⎩ 2n

(9)

j=1

Theorem 2.2 in Fan and Liao (2014) have revealed that, the PLS would not work any more without the possibly ad-hoc exogeneity assumption. Particularly, in the simple case when s1 , the number of nonzero components of β0 , is bounded, if there exist some regressors correlated with the regression error, the PLS does not achieve the variable selection consistency. Note that in the condition of Theorem 2.2 in Fan and Liao (2014), the index of endogeneity does not have to be the index of important regressors. Hence the consistency for PLS will fail even if the endogeneity is only present on the unimportant regressors, which agrees with our simulation results shown in Section 4.3. 3.2. Theoretical properties of RCF estimators We now concentrate on the theoretical performance of RCF estimators when (n, p, q) are all allowed to infinity. We mainly discuss the property with the representative Lasso penalty, and the conclusion can be generalized to a large class of penalties. Our first assumption promises the validity of Z and defines the distribution of random errors, where (1) is stronger than E(ZT u) = 0 and E(ZT V) = 0. Assumption 1.

The instruments Z, structural error u and reduced error V satisfy

(1) E [(V, u)|Z] = 0. (2) (vTk· , uk )|Z ∼ MVN(0, Σ) for k = 1, . . . , n, where (vTk· , uk ) ∈ Rp+1 and Σ = (σij )i,j=1,...,p+1 . The next is the key assumption, the restricted eigenvalue (RE) assumption proposed by Bickel et al. (2009), to guarantee nice statistical properties of the Lasso selection. Assumption 2. k(A, t) =

For a matrix A ∈ Rn×m and some integer t, 1 ≤ t ≤ m, define the ‘‘restricted eigenvalue’’ as min

min

J ⊆{1,...,m}, δ̸=0, |J |≤t ∥δJ c ∥1 ≤3∥δJ ∥1

∥Aδ∥2 . √ n∥δJ ∥2

Then instruments Z ∈ Rn×q and augmented regressors (X, V) ∈ Rn×2p satisfy (1) There exists k1 > 0 such that k(Z, r) ≥ k1 . (2) There exists k2 > 0 such that k((X, V), s) ≥ k2 . Remark 1. Assumption 2 depicts the ‘‘restricted’’ positive definiteness of Z and (X, V). The integer r or s here plays the role of an upper bound on the sparsity of the coefficient Γ0 or (βT0 , ρT0 )T , where s = s1 + s2 . Under the sparsity like this, we next consider the parameter space with ∥Γ0 ∥1 ≤ L, ∥β0 ∥1 ≤ M1 and ∥ρ0 ∥1 ≤ M2 for some constants L, M1 , M2 > 0. The third assumption is the strong irrepresentable condition proposed by Zhao and Yu (2006), which is almost necessary and sufficient for Lasso to select the true model. Assumption 3. satisfy

Recall that S = supp[(βT0 , ρT0 )T ]. Then the covariance of augmented regressors C = n−1 (X, V)T (X, V)

(1) CSS is invertible with ϕ = ∥(CSS )−1 ∥∞ . (2) There exists a constant 0 < η ≤ 1 such that ∥CS c S (CSS )−1 ∥∞ ≤ 1 − η. Remark 2. Assumption 3 states that Lasso selects the true model consistently if and (almost) only if the unimportant predictors are ‘‘irrepresentable’’ by important predictors, which requires important and unimportant explanatory variables not be strongly correlated. Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.

6

X. Xu, X. Li and J. Zhang / Journal of Statistical Planning and Inference xxx (xxxx) xxx

ˆ is characterized by Theorem 1 of Lin et al. (2015) The estimation and prediction quality of the first-stage estimator Γ based on Assumptions 1 and 2(1). We have given the nonasymptotic boundaries for the estimation and prediction loss T

T

ˆ , ρˆ )T in the Supplementary. From the nonasymptotic loss boundaries, we derive the of the second-stage estimator (β sufficient condition for RCF estimators to achieve the estimation consistency in the following theorems. p

Theorem 1. Suppose r 2 s2 (log p + log q) = o(n). Under Assumptions 1, 2, there exist regularization parameters {λn,j }j=1 = o(1)

ˆ and ρˆ defined by Eq. (8) with the L1 and µn = o(1), which are functions of (n, p, q), such that the regularized estimators β penalty satisfy ˆ − β0 ∥1 + ∥ρˆ − ρ0 ∥1 → 0) → 1, P(∥β and

ˆ − β0 ) + V( ˆ ρˆ − ρ0 )∥2 → 0) → 1. P(n−1/2 ∥X(β ˆ and ρˆ . Sign consistency is stronger than the usual selection We now discuss the model selection consistency of β consistency which only requires the zeros to be matched, but not the signs. The irrepresentable condition avoids dealing with situations where a model is estimated with matching zeros but reversed signs, which can be misleading and hardly ˆ and ρˆ is established qualifies as a correctly selected model (Zhao and Yu, 2006). Thus the model selection consistency of β in the form of sign consistency in the following theorem. Denote sgn(·) as the sign function. p

Theorem 2. Suppose r 2 s2 (log p + log q) = o(n). Under Assumptions 1, 2, 3, there exist regularization parameters {λn,j }j=1 = o(1) and µn = o(1), which are functions of (n, p, q), such that T

T

T

T

ˆ , ρˆ )T s.t . sgn[(βˆ , ρˆ )T ] = sgn[(βT , ρT )T ]} → 1, P {∃(β 0 0 T

T

ˆ , ρˆ )T is a local minimizer of Eq. (8) with the L1 penalty. where (β Comparing with the PLS estimators, in the simpler case when s and r are bounded, it is shown that the RCF still achieves the estimation and variable selection consistency as long as log p + log q = o(n), even if there exist some regressors p correlated with the regression error. The detailed proof of Theorems 1 and 2 with specific function forms of {λn,j }j=1 and µn has been provided in the Supplementary. The generalization to a generic class of penalty functions can be referred to that of Lin et al. (2015). Due to the space limitation, we have not fully explored the procedure in this text. Since Lasso penalty is biased and does not possess the oracle penalty with unchanged tuning parameters (Zou, 2006), it is reasonable to conjecture that for an asymptotically unbiased penalty (e.g. SCAD, MCP), the estimation and selection consistency can be achieved under weaker conditions than that of Lasso. 4. Simulation studies In this section, we conduct a series of simulations to verify the validity of the proposed RCF method. The PLS and RCF methods are carried out separately for comparison. Lasso, SCAD and MCP penalties are applied to integrally contrast performance of these two approaches. 4.1. Example 1: Strong instrumental variables, endogeneity in both important and unimportant regressors To test the performance of the PLS and RCF methods, we simulate the linear model

{

X = ZΓ0 + V, y = Xβ0 + u,

(10)

with (n, p, q) = (200, 100, 100) and (300, 600, 600) standing for different magnitude relations between n and (p, q). The columns of error terms u and V are generated by sampling the (p + 1)-vector (vTk· , uk ), k = 1, . . . , n, independently from MVN(0, Σ), where Σ = (σij ) with σij = (0.2)|i−j| for i, j = 1, . . . , p, σp+1,p+1 = 1 and σj,p+1 to be specified later. Entries of Z are sampled independently from N(1, 1). β0 is generated by sampling s1 = 5 nonzero components from N(b, 0.1) with P(b = 1) = P(b = −1) = 0.5. The columns of Γ0 are generated by sampling r = 5 nonzero entries from N(a, 0.1) with P(a = 1) = P(a = −1) = 0.5. Given u, V, Z, β0 and Γ0 , the covariate X and the response y are generated according to (10). The value of a represents the correlation between X and Z, i.e., the strength of instruments, where higher a means more satisfactory Z. For σj,p+1 ’s, j = 1, . . . , p, set s0 = 10 nonzero ones to be 0.3, where five correspond to the nonzero components of β0 and five are sampled randomly from the rest locations. σj,p+1 = ρ0j ’s here quantify the degree of endogeneity. In this way, five important and five unimportant regressors are endogenous. ˆ − β0 ∥1 , Four performance measures are used to compare the methods: (1) L1 estimation loss of β0 determined by ∥β −1/2 ˆ (2) L2 prediction loss of Xβ0 determined by n ∥X(β − β0 )∥2 , (3) true positive rate defined by TPR = TP /s1 , (4) the Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.

X. Xu, X. Li and J. Zhang / Journal of Statistical Planning and Inference xxx (xxxx) xxx

7

Table 1 Strong IV, endogeneity in both important and unimportant regressors. (Lower Est. and Pred., higher TPR and MCC indicate better performance.) PLS Est.

RCF Pred.

TPRβ

MCCβ

Est.

Pred.

TPRβ

MCCβ

TPRρ

MCCρ

1 (0) 1 (0) 1 (0)

0.38 (0.09) 0.64 (0.22) 0.71 (0.21)

0.36 (0.09) 0.08 (0.03) 0.09 (0.03)

0.28 (0.05) 0.09 (0.03) 0.1 (0.03)

1 (0) 1 (0) 1 (0)

0.52 (0.09) 0.81 (0.13) 0.83 (0.13)

0.99 (0.03) 1 (0) 1 (0)

0.75 (0.08) 0.76 (0.1) 0.77 (0.08)

1 (0) 1 (0) 1 (0)

0.34 (0.06) 0.65 (0.25) 0.76 (0.23)

0.35 (0.06) 0.05 (0.02) 0.05 (0.01)

0.29 (0.05) 0.07 (0.02) 0.07 (0.02)

1 (0) 1 (0) 1 (0)

0.46 (0.06) 0.92 (0.12) 0.93 (0.12)

1 (0) 1 (0) 1 (0)

0.66 (0.08) 0.76 (0.09) 0.76 (0.07)

(n, p, q) = (200, 100, 100) Lasso SCAD MCP

0.8 (0.27) 0.54 (0.32) 0.56 (0.38)

0.45 (0.09) 0.4 (0.11) 0.4 (0.11)

(n, p, q) = (300, 600, 600) Lasso SCAD MCP

0.66 (0.16) 0.39 (0.18) 0.37 (0.19)

0.4 (0.05) 0.33 (0.07) 0.33 (0.08)

Table 2 Weak instrumental variables. (Lower Est. and Pred., higher TPR and MCC indicate better performance.) PLS Est.

RCF Pred.

TPRβ

MCCβ

Est.

Pred.

TPRβ

MCCβ

TPRρ

MCCρ

1 (0) 1 (0) 1 (0)

0.3 (0.07) 0.43 (0.18) 0.5 (0.17)

0.92 (0.27) 0.31 (0.09) 0.35 (0.12)

0.4 (0.09) 0.16 (0.04) 0.18 (0.04)

1 (0) 1 (0) 1 (0)

0.41 (0.06) 0.53 (0.08) 0.6 (0.1)

0.91 (0.11) 1 (0) 1 (0)

0.57 (0.1) 0.61 (0.05) 0.68 (0.08)

1 (0) 1 (0) 1 (0)

0.28 (0.06) 0.33 (0.06) 0.44 (0.09)

0.9 (0.17) 0.19 (0.05) 0.21 (0.06)

0.41 (0.09) 0.1 (0.03) 0.12 (0.03)

1 (0) 1 (0) 1 (0)

0.35 (0.03) 0.52 (0.09) 0.59 (0.1)

0.94 (0.1) 1 (0) 1 (0)

0.49 (0.06) 0.57 (0.06) 0.61 (0.07)

(n, p, q) = (200, 100, 100) Lasso SCAD MCP

1.97 (0.51) 1.68 (0.45) 1.63 (0.52)

0.64 (0.09) 0.66 (0.09) 0.65 (0.1)

(n, p, q) = (300, 600, 600) Lasso SCAD MCP

1.83 (0.39) 1.67 (0.32) 1.51 (0.32)

0.59 (0.04) 0.62 (0.08) 0.6 (0.07)

TP ×TN −FP ×FN Matthews correlation coefficient defined by MCC = √(TP +FP)(TP , where TP, TN, FP, and FN stand for the +FN)(TN +FP)(TN +FN) number of correctly selected nonzero, correctly zero, incorrectly nonzero, incorrectly zero coefficients respectively. The first two measure the estimation and prediction quality, and the latter two measure the variable selection quality. Lower estimation and prediction loss, higher TPR and MCC indicate better performance. Besides, to measure the accuracy of endogenous variable selection in the RCF method, we also display the TPR and MCC of estimators for ρ, despite no according comparison in PLS. Throughout the simulations, we apply 10-fold cross-validation to choose the optimal tuning parameters. Averages and standard deviations (values in parenthesis) over 50 replications of each measure are reported in Table 1, which compares the performance of PLS and RCF. It is shown that the RCF method outperforms PLS with remarkably lower and more stable estimation and prediction loss. Both methods work well in selecting the important predictors, but the MCCβ of RCF is generally higher than that of PLS. We also claim that RCF works well in selecting truly endogenous variables, because of the almost up to 1 TPRρ and satisfactory MCCρ . In all these ways, the RCF method maintains superior performance. To reveal the consistency of estimators as the sample size varies and provide intuitive comparison, we display the trend curves of L1 estimation loss, L2 prediction loss and MCC for both methods. In Fig. 1 we set the dimensions p = q = 100 and in Fig. 2 p = q = 600 with the sample size n varying from 200 to 1500. As presented in both figures, the RCF method outperforms PLS in all measures with more steady status, and we conclude that the RCF method is an effective and robust method.

4.2. Example 2: Weak instrumental variables Consider a similar model as in Example 1, but we take a as P(a = 0.5) = P(a = −0.5) = 0.5 representing weak instruments with the correlation between X and Z weakened. Table 2 shows that the performance of both the methods deteriorates a little because the proportion of X correlated with u increases, but the RCF still performs relatively well in all measures. Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.

8

X. Xu, X. Li and J. Zhang / Journal of Statistical Planning and Inference xxx (xxxx) xxx

Fig. 1. Performance curves of PLS and RCF for Example 1 with p = q = 100. Lower estimation and prediction loss, higher MCC indicate better performance. Table 3 Endogeneity only in unimportant regressors. (Lower Est. and Pred., higher TPR and MCC indicate better performance.) PLS Est.

RCF Pred.

TPRβ

MCCβ

Est.

Pred.

TPRβ

MCCβ

TPRρ

MCCρ

1 (0) 1 (0) 1 (0)

0.37 (0.07) 0.64 (0.23) 0.69 (0.22)

0.39 (0.12) 0.09 (0.06) 0.09 (0.06)

0.32 (0.05) 0.09 (0.05) 0.1 (0.05)

1 (0) 1 (0) 1 (0)

0.54 (0.09) 0.83 (0.12) 0.85 (0.1)

0.99 (0.02) 1 (0) 1 (0)

0.77 (0.1) 0.76 (0.1) 0.77 (0.09)

1 (0) 1 (0) 1 (0)

0.35 (0.07) 0.71 (0.3) 0.77 (0.27)

0.32 (0.07) 0.06 (0.02) 0.06 (0.02)

0.29 (0.06) 0.08 (0.02) 0.08 (0.02)

1 (0) 1 (0) 1 (0)

0.51 (0.09) 0.91 (0.12) 0.92 (0.11)

1 (0) 1 (0) 1 (0)

0.66 (0.09) 0.76 (0.08) 0.77 (0.09)

(n, p, q) = (200, 100, 100) Lasso SCAD MCP

0.85 (0.29) 0.53 (0.28) 0.57 (0.36)

0.46 (0.08) 0.39 (0.09) 0.4 (0.11)

(n, p, q) = (300, 600, 600) Lasso SCAD MCP

0.68 (0.29) 0.41 (0.2) 0.39 (0.2)

0.41 (0.09) 0.33 (0.07) 0.34 (0.07)

4.3. Example 3: Endogeneity only in unimportant regressors Consider a similar model as in Example 1, but for σj,p+1 ’s, j = 1, . . . , p, we set s0 = 10 nonzero ones to be 0.3, which are all sampled randomly from the locations corresponding to the zero components of β0 . In this way, only the unimportant regressors are endogenous and all the important regressors are exogenous. It is interesting to see from Table 3 that, even in this case, the PLS still does not select the true model correctly but the RCF works well. This result agrees with Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.

X. Xu, X. Li and J. Zhang / Journal of Statistical Planning and Inference xxx (xxxx) xxx

9

Fig. 2. Performance curves of PLS and RCF for Example 1 with p = q = 600. Lower estimation and prediction loss, higher MCC indicate better performance.

the conclusion of Fan and Liao (2014) that the PLS fails even when the true model has no endogeneity, and verifies the advantage of the RCF method. 4.4. Example 4: No sparsity in endogenous variables The RCF estimator is established under the assumption that the endogeneity is concentrated on a small part of predictors, which is the actual situation in most literature and considered reasonable in reality. We still consider the model setting where the assumption is violated and investigate the sensitivity of the RCF method. Consider a similar model as in Example 1 with (n, p, q) = (200, 100, 100), but for σj,p+1 ’s, j = 1, . . . , p, we set s0 nonzero ones to be 0.1 with s0 = 60, 70, 80, where five of them correspond to the important regressors. It can be seen from Table 4 that RCF possesses superiority in estimation and prediction until s0 = 80, and does not lose variable selection advantage until s0 = 70. However, selection quality of ρ0 deteriorates seriously without sparsity in endogenous variables. We can still conclude that, even though the sparsity assumption of ρ0 is moderately violated, the RCF estimator is more efficient regarding the estimation, prediction and variable selection of β0 . To make the results more intuitive, we make the superior items bold in Tables 4 and 5. 4.5. Example 5: Comparison with 2SR It is known that in the linear model, the CF estimators for β are identical to the two-stage least squares (2SLS) estimators. Nevertheless, such algebraic equivalence no longer holds when regularization is introduced in the estimation. Besides, the CF method has several attractive features. First, the CF model leads to a straightforward test of the null hypothesis that xj is exogenous, and the test is easily made robust to heteroskedasticity. Second, the CF approach can be adapted to certain nonlinear models in cases where analogues of 2SLS have undesirable properties. When the model is not strictly linear in endogenous regressors, Imbens and Wooldridge (2007) conjectured that the CF estimator, while less robust than 2SLS, might be much more precise because it keeps the explanatory variables Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.

10

X. Xu, X. Li and J. Zhang / Journal of Statistical Planning and Inference xxx (xxxx) xxx

Table 4 No sparsity in endogenous variables. (Lower Est. and Pred., higher TPR and MCC indicate better performance.) PLS Est.

RCF Pred.

TPRβ

MCCβ

Est.

Pred.

TPRβ

MCCβ

TPRρ

MCCρ

0.39 (0.07) 0.26 (0.1) 0.28 (0.12)

1 (0) 1 (0) 1 (0)

0.39 (0.06) 0.7 (0.24) 0.76 (0.24)

0.48 (0.11) 0.27 (0.12) 0.26 (0.13)

0.35 (0.07) 0.24 (0.08) 0.25 (0.09)

1 (0) 1 (0) 1 (0)

0.49 (0.05) 0.68 (0.13) 0.78 (0.13)

0.42 (0.06) 0.33 (0.08) 0.22 (0.1)

0.28 (0.1) 0.27 (0.09) 0.23 (0.07)

0.37 (0.07) 0.24 (0.11) 0.27 (0.12)

1 (0) 1 (0) 1 (0)

0.4 (0.1) 0.78 (0.26) 0.78 (0.25)

0.46 (0.12) 0.22 (0.07) 0.23 (0.09)

0.33 (0.06) 0.2 (0.04) 0.21 (0.05)

1 (0) 1 (0) 1 (0)

0.5 (0.09) 0.66 (0.12) 0.76 (0.14)

0.39 (0.05) 0.28 (0.11) 0.21 (0.08)

0.24 (0.08) 0.25 (0.08) 0.22 (0.07)

0.37 (0.07) 0.21 (0.06) 0.2 (0.06)

1 (0) 1 (0) 1 (0)

0.4 (0.1) 0.78 (0.22) 0.88 (0.16)

0.42 (0.1) 0.23 (0.08) 0.25 (0.14)

0.32 (0.05) 0.2 (0.06) 0.23 (0.09)

1 (0) 1 (0) 1 (0)

0.53 (0.06) 0.6 (0.07) 0.74 (0.13)

0.41 (0.04) 0.37 (0.05) 0.26 (0.1)

0.24 (0.07) 0.25 (0.07) 0.2 (0.06)

s0 = 60 Lasso SCAD MCP

0.68 (0.22) 0.33 (0.27) 0.36 (0.33) s0 = 70

Lasso SCAD MCP

0.63 (0.23) 0.3 (0.29) 0.38 (0.38) s0 = 80

Lasso SCAD MCP

0.66 (0.31) 0.24 (0.11) 0.18 (0.08)

Table 5 Comparison with 2SR. (Lower Est. and Pred., higher TPR and MCC indicate better performance.) 2SR Est.

RCF Pred.

TPRβ

MCCβ

Est.

Pred.

TPRβ

MCCβ

TPRρ

MCCρ

1 (0) 1 (0) 1 (0)

0.54 (0.09) 0.91 (0.13) 0.93 (0.11)

0.36 (0.09) 0.08 (0.03) 0.09 (0.03)

0.28 (0.05) 0.09 (0.03) 0.1 (0.03)

1 (0) 1 (0) 1 (0)

0.52 (0.09) 0.81 (0.13) 0.83 (0.13)

0.99 (0.03) 1 (0) 1 (0)

0.75 (0.08) 0.76 (0.1) 0.77 (0.08)

1 (0) 1 (0) 1 (0)

0.46 (0.1) 0.82 (0.13) 0.88 (0.13)

0.35 (0.06) 0.05 (0.02) 0.05 (0.01)

0.29 (0.05) 0.07 (0.02) 0.07 (0.02)

1 (0) 1 (0) 1 (0)

0.46 (0.06) 0.92 (0.12) 0.93 (0.12)

1 (0) 1 (0) 1 (0)

0.66 (0.08) 0.76 (0.09) 0.76 (0.07)

(n, p, q) = (200, 100, 100) Lasso SCAD MCP

0.8 (0.31) 0.32 (0.18) 0.32 (0.18)

0.57 (0.16) 0.37 (0.15) 0.38 (0.16)

(n, p, q) = (300, 600, 600) Lasso SCAD MCP

0.84 (0.35) 0.32 (0.14) 0.33 (0.15)

0.56 (0.14) 0.37 (0.13) 0.38 (0.13)

ˆ in the second stage. We reasonably speculate that, for the linear models with X unchanged (instead of using X) regularization, the RCF approach that uses more information can improve precision of the estimates, compared to the 2SR method proposed by Lin et al. (2015). Consider the model in Example 1, the RCF and 2SR methods are carried out separately for comparison. It is clearly seen from Table 5 that for the estimation of β0 and prediction of Xβ0 , the RCF estimator possesses absolute superiority to 2SR. Two methods lead to different estimates and RCF has its distinctive advantage. A theoretical comparison remains to be further explored. 4.6. Example 6: Comparison with IVR The instrumental variable regression (IVR) (Neykov et al., 2018) is also a practical method to deal with endogeneity in high-dimensional regression, which is a modification of Dantzig selector and comparable to the RCF method. Consider a similar model as in Example 1. IVR constructs the following estimator

βˆ = arg min ∥β∥1 , such that ∥n−1 ZT (Xβ − y)∥∞ ≤ λ, where λ is a regularization parameter. We compare the IVR and RCF methods in Table 6, and the result further confirms the superiority of the RCF method. Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.

X. Xu, X. Li and J. Zhang / Journal of Statistical Planning and Inference xxx (xxxx) xxx

11

Table 6 Comparison with IVR. (Lower Est. and Pred., higher TPR and MCC indicate better performance.) (n, p, q) = (200, 100, 100) IVR

RCF

(n, p, q) = (300, 600, 600)

Est.

Pred.

TPRβ

MCCβ

Est.

Pred.

TPRβ

MCCβ

Dantzig

0.65 (0.2)

0.43 (0.09)

1 (0)

0.45 (0.08)

0.55 (0.24)

0.43 (0.2)

1 (0)

0.46 (0.12)

Lasso

0.36 (0.09) 0.08 (0.03) 0.09 (0.03)

0.28 (0.05) 0.09 (0.03) 0.1 (0.03)

1 (0) 1 (0) 1 (0)

0.52 (0.09) 0.81 (0.13) 0.83 (0.13)

0.35 (0.06) 0.05 (0.02) 0.05 (0.01)

0.29 (0.05) 0.07 (0.02) 0.07 (0.02)

1 (0) 1 (0) 1 (0)

0.46 (0.06) 0.92 (0.12) 0.93 (0.12)

SCAD MCP

Table 7 Chosen factors with their coefficients by PLS and RCF. PLS

Industry Tertiary Consumption ratio Unemployment Urbanization rate Education

RCF

Lasso

SCAD

MCP

Lasso

SCAD

MCP

−0.67 −0.46

−1.64 −1.45

−1.64 −1.44

−0.12

−0.12

−0.16

0.82 −0.27 −0.29 −0.17

0.88 −0.27 0.00 0.00

0.88 −0.26 0.00 0.00

0.00 0.94 −0.08 0.00 0.00

0.00 1.10 −0.01 0.00 0.00

0.00 1.10 −0.01 0.00 0.00

5. Data analysis of urban–rural income gap in China In this section, we apply the proposed RCF method to the data of China’s urban–rural income inequality as an example. The data stems from Statistical Yearbooks of 31 Chinese provinces during 2014–2015. We consider a model with p = 21 covariates, q = 21 instruments and n = 20 complete observations. Our goal is to identify the most influential factors of the income gap. We take the Annual per Capita Disposable Income Ratio of Urban to Rural Households (Income Ratio) as the response to quantify the income gap. The considered factors of the urban–rural income gap include the following indicators: Per Capita GDP, Tertiary, Unemployment, Consumption Ratio, Urbanization Rate, Industry, Foreign Trade and 14 items about Local Financial Expenditure (including Education, Science and Technology, General Public Services and so on). Among them, Consumption Ratio is supposed to change together with Income Ratio as the economic condition fluctuates. Since we have found few literatures investigating the direct relationship between government budget policy and income inequality yet, we add these 14 items as redundant variables expected to be filtered out. Some economic variables like consumption are endogenous because they are determined simultaneously along with income. A widespread econometric solution is to use lagged levels of explanatory variables as instruments (Li and Pan, 2010). Therefore we take the 21 indices in 2015 as predictors and use their values in 2014 as instruments for convenience. We also standardize the original predictors in order to eliminate the influence of the magnitude of different predictors. Leave-one-out cross-validation criterion is used here to choose the optimal tuning parameters. We aim to determine which factors are likely to play critical roles in urban–rural income inequality. The chosen factors by PLS and RCF are listed with the coefficients in Table 7. We try to compare the two methods in terms of the rationality of their selected economic variables. There are several differences between these two methods’ results. Firstly, the RCF method has produced a sparser model than PLS. Secondly, the effect of Consumption Ratio predominates in RCF, but is far exceeded by Industry and Tertiary in PLS. It is unreasonable because the close connection of consumption with income has been recognized in extensive literatures see, Chen and Justin (2014) and Shen (2013). Moreover, effects of Industry and Tertiary in PLS are remarkably negative, which is completely opposite to the viewpoints of Lin et al. (1994), Cai and Yang (2000) and Kanbur and Zhang (2005) who believe that urban biased policies and government’s priority to develop heavy industries expand the income inequality, and Shen (2013) and Chao and Shen (2014) who think the development of Secondary and Tertiary industry will lead to agriculture being neglected and further broaden the income gap. In addition, Zhang and Liu (2012) conclude that urbanization has no significant influence on urban–rural income gap via an empirical research. Most literatures have supported suboptimality of the PLS results and rationality of the RCF results. The outcome confirms that the RCF method can provide insightful and interpretable results of factors selection and effects estimation in real data problems. 6. Discussion on feasible extensions Based on the proposed RCF method, there are several extensions that are worth exploring. The main two directions are generalization of the model framework and improvement of the algorithm. Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.

12

X. Xu, X. Li and J. Zhang / Journal of Statistical Planning and Inference xxx (xxxx) xxx

6.1. Framework for approximate sparsity Exact sparsity assumptions are hardly realistic for some applications concerning (1) with endogeneity. Learning from Belloni and Chernozhukov (2013), we now extend the current framework to allow for approximate sparsity. Reconsider the two-stage model:

{

xj = Zγ ∗j + vj

(11)

y = Xβ∗ + Vρ∗ + e,

where xj ∈ Rn is the jth column of X ∈ Rn×p , γ j ∈ Rq is the jth column of Γ ∈ Rq×p , vj = (v1j , . . . , vnj )T is the jth column of V ∈ Rn×p for j = 1, . . . , p, with p, q ≫ n. The distribution of error terms is the same as in Assumption 1, with vij ∼ N(0, σj2 ) (write σjj = σj2 ) for i = 1, . . . , n. Approximate sparsity allows all the entries in γ ∗j , β∗ and ρ∗ to be nonzero provided their coefficients decay at sufficiently fast rates. Then we consider the following oracle risk minimization program for the first stage of (11) for j = 1, . . . , p: k min ck2 + σj2 , where ck2 = min ∥Z(γ ∗j − γ j )∥22 /n. ∥γ j ∥0 ≤k n

0≤k≤q∧n

(12)

Not that ck2 + σj2 k/n is an upper bound on the risk of the best k-sparse least squares estimator. The oracle program (12) chooses the optimal value of k. Let rj be the smallest integer amongst these optimal values, and let

γ 0j ∈ arg min ∥Z(γ ∗j − γ j )∥22 /n.

(13)

∥γ j ∥0 ≤rj

We call γ 0j the oracle target value of the first stage, Tj = supp(γ 0j ) the oracle model, rj = |Tj | = ∥γ 0j ∥0 the dimension of the oracle model, and Zγ 0j the oracle approximation to Zγ ∗j . Similarly, the oracle risk minimization program for the second stage is k min c˜k2 + σe2 , where n

0≤k≤p∧n

c˜k2 =

min

∥(βT ,ρT )T ∥0 ≤k

∥X(β∗ − β) + V(ρ∗ − ρ)∥22 /n.

(14)

Let s be the smallest integer amongst the optimal values of k chosen by (14), and let (βT0 , ρT0 )T ∈

arg min ∥(βT ,ρT )T ∥0 ≤s

∥X(β∗ − β) + V(ρ∗ − ρ)∥22 /n.

(15)

In this way β0 is our ultimate oracle target value. Let S1 = supp(β0 ), S2 = supp(ρ0 ), s1 = |S1 | and s2 = |S2 |. The ‘‘true’’ model defined here is no longer the exactly sparse model (6), but the best s1 -dimensional approximation Xβ0 chosen by the oracle to the regression function Xβ∗ . Since Tj , S1 and S2 are unknown, we shall estimate them using Lasso-type methods. Note that Lasso is sometimes only used as a model selection device, and an extra step, which performs an OLS with the regressors selected by Lasso to obtain final estimators may be used (Belloni and Chernozhukov, 2013; Zhu, 2018). In the present manuscript yet, we only focus on the property of Lasso estimators, but do not pursue the OLS-post-Lasso development. ˆ and ρˆ are defined in the same form as (7) and (8) with Lasso penalty, but they are now The Lasso estimators γˆ j ’s, β estimators of the oracle targets γ 0j ’s, β0 and ρ0 defined in (13) and (15). As for the estimation and prediction property of the estimators, we still rely on the restricted eigenvalue (RE) condition proposed by Bickel et al. (2009), as stated in Assumption 2, Section 3.2. It is demonstrated that the nonasymptotic loss boundaries listed in Proposition 1 (refer to the Supplementary) and estimation consistency presented in Theorem 1 still hold under the approximate sparsity framework, with proof given in the Supplementary. We leave the sparsity property of the estimators for future investigation. 6.2. Tuning parameter selection We now discuss the effectiveness of the cross-validation criterion in variable selection, and try to make modification to the algorithm. 6.2.1. Stability in variable selection We first have a look at the stability selection (Meinshausen and Bühlmann, 2010), which computes the selection probability of each variable, over subsamples of certain number and size, for a grid of regularization parameters. In an effective variable selection method, important variables are supposed to stand out in the stability paths with high selection probability. Consider a similar model as in Example 1. The stability paths of the RCF method are displayed in Fig. 3. It is clearly seen that the overall stability paths of the RCF method seem hardly noisy. Therefore the RCF method with its algorithm is effective in distinguishing the most important factors. Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.

X. Xu, X. Li and J. Zhang / Journal of Statistical Planning and Inference xxx (xxxx) xxx

13

Fig. 3. Stability paths for different methods based on 100 subsamples. For the left column (n, p, q) = (200, 100, 100) and for the right column (n, p, q) = (300, 600, 600). Variables with maximum selection probability at least 0.3 are displayed in solid lines, among which variables common to all the Lasso, SCAD and MCP penalties are shown in red and the distinct ones in blue, and the remaining variables are displayed in dashed lines. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

6.2.2. Comparison with ESCV-based RCF Lim and Yu (2016) introduce the concept of estimation stability, which means estimates from different samples are supposed to be close to each other. Taking this into consideration, we modify the cross validation criterion to the estimation stability cross validation (ESCV) criterion proposed by Lim and Yu (2016). Consider a similar model as in Example 1. We compare the result based on the CV and ESCV criterion in Table 8. To make the result more intuitive, we make the superior items bold in Table 8. It is clearly seen that ESCV improves variable selection quality but deteriorates in prediction, since it considers a combination of stability and prediction, rather than only the prediction, as a new criterion. ESCV cuts down model size and false positive rates (FPR), while sacrificing little of true positive rates (TPR), with selection of bigger (more regularized) Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.

14

X. Xu, X. Li and J. Zhang / Journal of Statistical Planning and Inference xxx (xxxx) xxx Table Algorithm 2: Two-step penalization algorithm for choosing tuning parameters

For the first-stage regularization parameters λj ’s: Step 1. Over-Penalization (0)

Set σˆ j

1

= ( log p+n log q ) 4 . For arbitrary ς ∈ (0, 41 ), perform (7) with (0) log p+log q 1 −ς )2 j ( n

(0) j

λj = λ

= σˆ

(1)

to obtain the initial estimates γˆ j Adjusted-Penalization √

Step 2.

(1)

Set σˆ j

(1) 1 xj Z j 22 . For n (1) log p+log q 1 −ς )2 j ( n (2) the estimates j for

∥ − γˆ ∥

=

(1) j

λj = λ

for j = 1, . . . , p.

the same ς as in Step 1, perform (7) with

= σˆ

to obtain

γˆ

j = 1 , . . . , p.

(2)

(2)

(1)

Using γˆ j returned by Step 2 to construct σˆ j , apply additional adjustment to λj For the second-stage √ regularization parameter µ: Let σˆ max = max

1 n

j=1,...,p

Step 1.

(1)

by replacing σˆ j

(2)

with σˆ j .

∥xj − Zγˆ j ∥22 , Lˆ = max ∥γˆ j ∥1 , rˆ = max |supp(γˆ j )|, where γˆj is the first-stage estimator. j=1,...,p

j=1,...,p

Over-Penalization 1

n ˆ (0) = σˆ e(0) = ( ) 4 . For arbitrary ς ∈ (0, 14 ) and some constant c0 /k1 , perform (8) with Set M 2 log p+log q

µ=µ

(0)

=

(1)

Set σˆ e

=

µ = µ(1) =

σˆ

(1) e

√ (0) ˆ (0) σˆ max )( log p+log q ) 12 −ς Lˆ rˆ (σˆ e ∨ M 2

n

ˆ to obtain the initial estimates β Adjusted-Penalization √

Step 2.

ˆ Using β

c0 k1

(2)

(1) M2

(2)

and ρˆ

(1)

.

(1) ˆ (1) = ∥ρˆ (1) ∥1 . For the same ς and c0 /k1 as in Step 1, perform (8) with ∥ − Xβˆ − Vˆ ρˆ (1) ∥22 and M 2 √ (1) ˆ rˆ (σˆ e ∨ M ˆ (1) σˆ max )( log p+log q ) 12 −ς

1 y n c0 L k1

ˆ to obtain the estimates β and ρˆ

(1)

2 (2)

n

and ρˆ

(2)

. (2)

returned by Step 2 to construct σˆ e (2) e

(2)

ˆ , apply additional adjustment to µ by replacing and M 2

(2) M2 .

and ˆ with σˆ and ˆ Further iterations may result better performance with small samples. In the small sample practice, the choice of the constant c0 /k1 can be assisted with CV or ESCV criterion.

regularization parameters. In terms of parameter estimation, however, ESCV gives inferior performance when there is little dependence between predictors. This result is consistent with Lim and Yu (2016). In addition, Lim and Yu (2016) only present experimental results for Lasso, which is easily run using the R package ‘‘glmnet’’, but the computational cost greatly increases when we apply ESCV to the coordinate descent algorithm for the SCAD and MCP penalty. Thus we only display the result with Lasso penalty, but ESCV still provides an important direction of algorithm improvement. Due to the rather high computational cost in our simulation, we do not apply ESCV for all examples.

6.2.3. Two-step penalization algorithm Zhu (2018) proposes l1 -regularized two-stage least squares estimators with high-dimensional endogenous regressors and instruments (H2SLS). A two-step algorithm is provided for choosing regularization parameters with asymptotic guarantees: over penalization of the parameter to obtain the initial estimator, and then adjusting the parameter by the obtained estimator. Now refer to the propositions in the Supplementary. To preserve √ the asymptotic bounds of estimation loss for Lasso penalty, the regularization parameters are chosen as λj = C σj c0 L( e k1



log p+log q n

for j = 1, . . . , p, and

µ = σ ∨ M2 σmax ) where the first-stage error vij ∼ N(0, σ the second-stage error ei ∼ N(0, σe2 ) for i = 1, . . . , n, maxj=1,...,p ∥γ 0j ∥1 ≤ L, ∥ρ0 ∥1 ≤ M2 , σmax = maxj=1,...,p σj , r = maxj=1,...,p |supp(γ 0j )|, and C , c0 , k1 are r(log p+log q) , n

2 j ),

unknown constants. To improve the tuning parameter selection algorithm, we adopt a similar two-step procedure as Zhu p (2018), which reflects the conditions on {λn,j }j=1 and µn . The algorithm is named as two-step penalization (2SP) and summarized in detail in Table Algorithm 2. Next we compare the RCF method using two-step penalization algorithm for tuning parameter selection with the PLS method. Considering the models in Examples 1–3, Table 9 shows the performance of both methods with Lasso penalty. We still apply 10-fold cross-validation to choose the optimal tuning parameters in PLS estimation, since Chetverikov et al. (2016) have demonstrated that the cross-validated Lasso estimator achieves the fastest possible rate of convergence in the √ prediction norm up to a small logarithmic factor log(pn) in the model without endogeneity. The result further confirms the superiority of the RCF method using 2SP algorithm. In this article, the asymptotic bounds of estimation loss and conditions on regularization parameters are proposed for Lasso penalty as a representative. Due to the space limitation, we have not fully explored the asymptotic bounds for other penalties like SCAD and MCP. Therefore an elaborate design of similar over-penalization algorithm for more regularization penalties deserves further exploration. Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.

X. Xu, X. Li and J. Zhang / Journal of Statistical Planning and Inference xxx (xxxx) xxx

15

Table 8 CV-based and ESCV-based RCF with Lasso penalty. (Lower Est. and Pred., higher TPR and MCC indicate better performance.) Est.

Pred.

TPRβ

MCCβ

TPRρ

MCCρ

λmin

λmed

0.52 (0.09) 0.82 (0.1)

0.99 (0.03) 0.7 (0.03)

0.75 (0.08) 0.76 (0.08)

0.87 (0.15) 1.02 (0.23)

0.85 (0.15) 0.93 (0.3)

0.46 (0.06) 0.87 (0.12)

1 (0) 0.8 (0.22)

0.66 (0.08) 0.82 (0.13)

1.04 (0.14) 1.19 (0.42)

1.03 (0.13) 1.11 (0.26)

(n, p, q) = (200, 100, 100) 0.36 (0.09) 0.57 (0.09)

CV ESCV

0.28 (0.05) 0.59 (0.05)

1 (0) 1 (0)

(n, p, q) = (300, 600, 600) 0.35 (0.06) 0.54 (0.14)

CV ESCV

0.29 (0.05) 0.59 (0.15)

1 (0) 1 (0)

Table 9 Comparison of PLS and RCF using 2SP algorithm with Lasso penalty for Examples 1–3. (Lower Est. and Pred., higher TPR and MCC indicate better performance.) PLS Est. Ex1.

RCF Pred.

TPRβ

MCCβ

Est.

Pred.

TPRβ

MCCβ

TPRρ

MCCρ

1 (0)

0.38 (0.09)

0.62 (0.09)

0.38 (0.07)

1 (0)

0.54 (0.09)

0.9 (0.07)

0.65 (0.11)

1 (0)

0.34 (0.06)

0.58 (0.06)

0.39 (0.07)

1 (0)

0.51 (0.09)

0.94 (0.08)

0.64 (0.14)

1 (0)

0.3 (0.07)

1.53 (0.28)

0.59 (0.09)

1 (0)

0.38 (0.07)

0.78 (0.11)

0.41 (0.11)

1 (0)

0.28 (0.06)

1.51 (0.20)

0.57 (0.08)

1 (0)

0.34 (0.05)

0.82 (0.11)

0.38 (0.1)

1 (0)

0.37 (0.07)

0.62 (0.14)

0.40 (0.07)

1 (0)

0.54 (0.09)

0.88 (0.11)

0.64 (0.13)

1 (0)

0.35 (0.07)

0.56 (0.08)

0.38 (0.08)

1 (0)

0.5 (0.12)

0.92 (0.09)

0.6 (0.13)

(n, p, q) = (200, 100, 100) 0.8 (0.27)

0.45 (0.09)

(n, p, q) = (300, 600, 600) 0.66 (0.16) Ex2.

0.4 (0.05)

(n, p, q) = (200, 100, 100) 1.97 (0.51)

0.64 (0.09)

(n, p, q) = (300, 600, 600) 1.83 (0.39) Ex3.

0.59 (0.04)

(n, p, q) = (200, 100, 100) 0.85 (0.29)

0.46 (0.08)

(n, p, q) = (300, 600, 600) 0.68 (0.29)

0.41 (0.09)

7. Conclusion The presence of endogeneity in high-dimensional data causes the inconsistency of the penalized least squares method. In this article, we have learned from the control function method in econometrics and proposed the regularized control function (RCF) method. We investigated theoretical conditions for the RCF estimator to achieve estimation and selection consistency. Through abundant simulations, we compared the RCF method with the existing PLS method and demonstrated that RCF is superior by providing more accurate estimation and prediction, consistent variable selection and satisfactory identification of the truly endogenous regressors. The application to China’s urban–rural income gap data indicated the practicability of the RCF method. In addition to economics, the RCF method may also be applicable in natural sciences such as pharmacy, biology, genetics and genomics and so on. For example, endogeneity exists due to the unmeasured confounders in gene expressions (Lin et al., 2015) or observational studies of treatment effect (Guo and Small, 2016). The result in this article is expected to be further expanded in other fields. Acknowledgments We thank the editors and reviewers for detailed feedback and suggested improvement on this paper. We are also grateful to Ying Zhu, the author of Zhu (2018), for sharing the code of the over-penalization parameter-selection algorithm, which is very helpful to our work. Appendix A. Supplementary material Supplementary material related to this article can be found online at https://doi.org/10.1016/j.jspi.2019.09.007. Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.

16

X. Xu, X. Li and J. Zhang / Journal of Statistical Planning and Inference xxx (xxxx) xxx

References Angrist, J.D., Keueger, A.B., 1991. Does compulsory school attendance affect schooling and earnings? Q. J. Econ. 106 (4), 979–1014. Belloni, A., Chen, D., Chernozhukov, V., Hansen, C., 2012. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80 (6), 2369–2429. Belloni, A., Chernozhukov, V., 2013. Least squares after model selection in high-dimensional sparse models. Bernoulli 19 (2), 521–547. Belloni, A., Chernozhukov, V., Chetverikov, D., Hansen, C.B., Kato, K., 2018. High-dimensional econometrics and regularized GMM, arXiv preprint, 1806.01888. Bickel, P.J., Ritov, Y., Tsybakov, A.B., 2009. Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist. 37 (4), 1705–1732. Cai, F., Yang, T., 2000. Political economy of the income gap between urban and rural areas. Soc. Sci.ss China 4. Chao, X., Shen, K., 2014. The urban-rural income gap, the quality of labour force and the economic growth in China. Econ. Res. J. (Chin. J.) 6, 30–43. Chen, B., Justin, Y.L., 2014. Development strategy, urbanization and the urban-rural income gap in China. Soc. Sci. China 35 (1), 5–20. Chetverikov, D., Liao, Z., Chernozhukov, V., 2016. On cross-validated Lasso, arXiv preprint, 1605.02214. Fan, J., Li, R., 2001. Variable selection via nonconvave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96 (456), 1348–1360. Fan, J., Liao, Y., 2014. Endogeneity in high dimensions. Ann. Statist. 42 (3), 872–917. Friedman, J., Hastie, T., Tibshirani, R., 2010. Regularization paths for generalized linear models via coordinate descent.. J. Stat. Softw. 33 (1), 1–22. Fu, W., 1998. Penalized regressions: The bridge versus the lasso. J. Comput. Graph. Statist. 7 (3), 397–416. Gautier, E., Tsybakov, A.B., 2018. High-dimensional instrumental variables regression and confidence sets, arXiv preprint, 1105.2454. Guo, Z., Kang, H., Cai, T.T., Small, D.S., 2018. Testing endogeneity with high dimensional covariates. J. Econometrics 207 (1), 175–187. Guo, Z., Small, D., 2016. Control function instrumental variable estimation of nonlinear causal effect models. J. Mach. Learn. Res. 17 (1), 3448–3482. Imbens, G.M., Wooldridge, J.M., 2007. Control functions and related methods. In: Summer Institute 2007 Methods Lectures. NBER. Kanbur, R., Zhang, X., 2005. Fifty years of regional inequality in China: a journey through central planning, reform, and openness. Rev. Dev. Econ. 9 (1), 87–106. Li, Z., Pan, W., 2010. Econometrics (Chinese Monograph), third ed. Higher Education Press, Beijing. Lim, C., Yu, B., 2016. Estimation stability with cross-validation (ESCV). J. Comput. Graph. Statist. 25 (2), 464–492. Lin, Y., Cai, F., Li, Z., 1994. China’s Miracle: Development Strategy and Economic Reform (Chinese Monograph). Shanghai Sanlian Press, Shanghai Renmin Press, Shanghai. Lin, W., Feng, R., Li, H., 2015. Regularization methods for high-dimensional instrumental variables regression with an application to genetical genomics. J. Amer. Statist. Assoc. 110 (509), 270–288. Mazumder, R., Friedman, J.H., Hastie, T., 2011. Sparsenet: Coordinate descent with nonconvex penalties. J. Amer. Statist. Assoc. 106 (495), 1125–1138. Meinshausen, N., Bühlmann, P., 2010. Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol. 72 (4), 417–473. Neykov, M., Ning, Y., Liu, J.S., Liu, H., 2018. A unified theory of confidence regions and testing for high dimensional estimating equations. Statist. Sci. 33 (3), 427–443. Shen, Y., 2013. (Chinese thesis). Southwestern University of Finance and Economics. Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58 (1), 267–288. Wooldridge, J.M., 2010. Econometric Analysis of Cross Section and Panel Data. MIT Press, Cambridge. Zhang, C.H., 2010. Nearly unbiased variable selection under minimax concave penaltyss. Ann. Statist. 38 (2), 894–942. Zhang, Y., Liu, W., 2012. Population mobility, structure of fiscal expenditure and urban-rural income gap. Chin. Rural Econ. (Chin. J.) 1, 16–30. Zhao, P., Yu, B., 2006. On model selection consistency of lasso. J. Mach. Learn. Res. 7 (12), 2541–2563. Zhu, Y., 2018. Sparse linear models and l1 -regularized 2SLS with high-dimensional endogenous regressors and instruments. J. Econometrics 202 (2), 196–213. Zou, H., 2006. The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 (476), 1418–1429.

Please cite this article as: X. Xu, X. Li and J. Zhang, Regularization methods for high-dimensional sparse control function models. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.09.007.