Accepted Manuscript
Semiparametric method for model structure discovery in additive regression models Takuma Yoshida PII: DOI: Reference:
S2452-3062(17)30014-X 10.1016/j.ecosta.2017.02.005 ECOSTA 49
To appear in:
Econometrics and Statistics
Received date: Revised date: Accepted date:
10 February 2016 9 February 2017 16 February 2017
Please cite this article as: Takuma Yoshida, Semiparametric method for model structure discovery in additive regression models, Econometrics and Statistics (2017), doi: 10.1016/j.ecosta.2017.02.005
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Semiparametric method for model structure discovery in additive regression models
CR IP T
Takuma Yoshida Kagoshima University, Graduate School of Science and Engineering, 1-21-35, Korimoto Kagoshima, Kagoshima 890-8580, Japan. Tel.: +81-099-285-8039
[email protected]
Abstract
AN US
In regression analysis, there are two typical approaches, parametric methods
and nonparametric methods. If the prior information of the structure of the regression function is obtained, a parametric method is preferred since they are efficient and easily interpreted. When the model is misspecified, on the other hand, parametric estimators do not work well. Therefore, it is important to check whether the parametric model assumption is valid. To simultaneously
M
discover the model structure and estimate the regression function in additive regression models, a new semiparametric method is proposed. First, a paramet-
ED
ric model is prepared and its estimator is obtained for all additive components. Next, for the residual data associated with the parametric estimator, a nonparametric method is applied. The final estimator is constructed by summing the
PT
parametric estimator and the nonparametric estimator of the residual data. In the second-step estimation, the B-spline method with an adaptive group lasso
CE
penalty is utilized. For each additive component, if the nonparametric estimator becomes the zero function, the final estimator is reduced to a parametric estimator. In other words, the model structure can then be discovered. The
AC
asymptotic properties of the proposed estimator are shown. A numerical study via a Monte Carlo simulation and a real data application are presented. Keywords: , Adaptive group lasso, Additive model, Parametric guided estimation, Parametric model discovery, Spline smoothing 2010 MSC: 62G08, 62J07
Preprint submitted to Economics and Statistics
March 11, 2017
ACCEPTED MANUSCRIPT
1. Introduction Multiple regression captures the functional relationship between a scalar response and a multivariate predictor. One typical technique for multiple regres-
5
CR IP T
sion is the additive model since it has the advantage of avoiding the phenomenon known as “the curse of dimensionality”. For a scalar response Y and p-variate predictor x = (x1 , · · · , xp ), the additive model is formulated by Y =µ+
p X
fj (xj ) + ε,
j=1
(1)
AN US
where µ is the unknown intercept parameter, fj are unknown regression func-
tions, and ε is the error with mean 0 and finite variance. The additive model was proposed by Buja et al. [2] and several authors have developed regression analysis with additive models in the context of nonparametric methods. For example, Stone [24] used the B-spline method. Opsomer and Ruppert [22] and Opsomer [21] studied the kernel smoothing method. Horowitz and Mammen
M
[11], Horowitz et al. [12], and Liu and Yang [16] proposed the two-stage nonparametric estimation method. Recently, additive models for sparse settings
ED
have been developed. Typical results of sparse additive regression were developed by Zhang and Lin [29], Meier et al. [19], Ravikumar et al. [23], and Huang et al. [13]. Thus, additive models are widely used. Basically, the development
PT
of additive models has focused on nonparametric regression methods. On the other hand, if it is believe that the parametric model assumption is valid for some fj from the prior information or a pilot study, we can use the parametric
CE
model fj (xj |θ j ) for the jth component. When all components can be expressed
AC
as parametric models, (1) becomes Y =µ+
p X j=1
fj (xj |θ j ) + ε.
Then, the parameters θ 1 , · · · , θ p are estimated via least squares or other methods. Since the parametric method has many advantages such as that the interpretation of the estimator is easy, the rate of convergence of the estimator is
10
faster than that of the nonparametric estimator, and the estimation algorithm 2
ACCEPTED MANUSCRIPT
is simple, the parametric model assumption is useful. However, in general, we cannot know whether the parametric model is valid. If the model assumption is reasonable, the parametric estimator works well. If it is not valid, on the
15
CR IP T
other hand, the performance of the parametric estimator is not guaranteed. In this paper, we propose a new estimation method to check the correctness of the
parametric model assumption and to estimate the regression function simultaneously. For this, we use a hybrid technique of the parametric guided method and the adaptive group lasso.
We now briefly describe the parametric guided method and introduce the
idea of our study. For j = 1, · · · , p, we prepare the parametric model fj (·|θ j )
AN US
20
and estimate (θ 1 , · · · , θ p ) using the least squares method or other methods. Next, for the residual data associated with parametric estimator, the nonparametric method is applied. Finally, the estimator of each additive component is constructed by summing the parametric estimator of the true function and the 25
nonparametric estimator of the residual data. Thus, the estimator is obtained
M
by a two-step procedure. The final estimator combines the parametric and nonparametric structures, and so the proposed estimator is a semiparametric
ED
estimator. For regression problems, the parametric guided method has been developed by Glad [9], Naito [20], Martins et al. [17], Fan el al. [6], and Fan et al. 30
[8]. The main assertion of this approach is that the parametric guided estimator
PT
has better behavior than the fully nonparametric estimator when the prepared parametric model is near the true function. That is, the parametric guided
CE
estimator can be regarded as an improved version of the fully nonparametric estimator. On the other hand, the parametric guided estimator was not compared
35
with a parametric estimator in the aforementioned articles. If the parametric
AC
model assumption is correct, the nonparametric component may not be needed. This motivates us to obtain a parametric estimator if fj (·) = fj (·|θ j ) and a nonparametric estimator if fj (·) 6= fj (·|θ j ) for j = 1, · · · , p. To achieve this, we use the adaptive group lasso method in the second step of the parametric
40
guided method. The adaptive group lasso is known as a sparse estimation method and is a 3
ACCEPTED MANUSCRIPT
combination of the adaptive lasso and the group lasso. The adaptive lasso and the group lasso were proposed by Zou [31] and Yuan and Lin [27], respectively. Huang et al. [13] developed the adaptive group lasso for additive models (1) in nonparametric regression. In their setting, the sparse additive model is assumed
CR IP T
45
and so there exist j such that fj ≡ 0. Using the adaptive group lasso, the nonparametric estimator reduces to the zero function if the true function is the zero function. In our setting, the true function is the mean function of
the residual: rj (·, θ j ) = f (·) − f (·|θ j ) for j = 1, · · · , p. Therefore, the final 50
estimator is reduced to the parametric estimator if rj is estimated to be the
AN US
zero function. Such a case indicates that the parametric estimator can capture
the structure of the regression function and so we can decide fj (·) = fj (·|θ j ). If the estimator of a nonparametric component is not the zero function, the semiparametric estimator is obtained for that component. Consequently, using 55
the adaptive group lasso in the parametric guided method, the model structure can be discovered.
M
Related methods have been proposed by Zhang et al. [28], Huang et al. [14], Lian et al. [15], and Wu and Stefanski [26]. However, Zhang et al. [28], Huang
60
ED
et al. [14], and Lian et al. [15] considered only linear model fj (·|θ j ) = θj xj . Zhang et al. [28] and Lian et al. [15] proposed the double penalization method, but their method cannot be extended to the general parametric model case. On
PT
the other hand, in Huang et al. [14], the true regression model was assumed to be fj (xj ) = fj (xj |θ j ) + gj (xj ), where fj (xj |θ j ) = θj xj and gj is the smooth
CE
function and the profiling method was used to estimate θ1 , · · · , θp and g1 , · · · , gp .
65
Wu and Stefanski [26] proposed the kernel-based variable selection method. In their approach, the parametric model is restricted to a polynomial model. Thus,
AC
all the above methods considered only linear and polynomial models. On the other hand, there is no limit to choice of the shape of parametric models in our method and so our method can be widely used compared with those in the
70
above previous studies. The remainder of this paper is organized as follows. In Section 2, we propose a two-step estimation based on the parametric guided method and consider 4
ACCEPTED MANUSCRIPT
sparse estimation. In addition, we note how to interpret our estimator. Section 3 shows the mathematical properties of our estimator. In Section 4, the proposed 75
method is illustrated with Monte Carlo simulations and an analysis of a real data
CR IP T
set. The proofs for the results given in Section 3 are provided in the Appendix.
2. Parametric guided estimation with adaptive group lasso
Let Y and X = (X1 . · · · , Xp ) be a random variable and a random vector of
dimension p, respectively, and let {(Yi , X i ) : i = 1, · · · , n} be a random sample drawn from (Y, X). We assume that each Xj is supported on an interval [a, b],
AN US
where a < b are finite numbers. We now consider the additive regression model Yi = µ + f1 (xi1 ) + · · · + fp (xip ) + εi , i = 1, · · · , n,
where µ is an unknown intercept parameter, the fj ’s are unknown univariate regression functions, xi = (xi1 , · · · , xip ) are the given values of the ith covariates, and ε1 , · · · , εn are independent and identically distributed (i.i.d.) errors with
M
80
mean 0 and finite variance σ 2 . To satisfy identifiability of each additive func-
ED
tion, suppose that E[fj (Xj )] = 0 for j = 1, · · · , p. The aim is to estimate µ and Pn f1 , · · · , fp . It is easy to see that the ordinary estimator of µ is µ ˆ = n−1 i=1 Yi .
Therefore, all responses are centered at Yi − µ ˆ and we consider the following model:
PT
85
Yi = f1 (xi1 ) + · · · + fp (xip ) + εi
(2)
CE
with centered Yi . In (2), we estimate f1 , · · · , fp by combining the parametric
AC
guided method and the adaptive group lasso. First, we prepare the parametric models fj (xj |θ j )(j = 1, · · · , p) by using
prior information or based on pilot research and consider the parametric additive model Yi = f1 (xi1 |θ 1 ) + · · · + fp (xip |θ p ) + εi .
Then, the parameters are estimated as θ 1 , · · · , θ p via some method such as
ˆ1, · · · , θ ˆ p be the least squares estimator of the least squares method. Let θ 5
ACCEPTED MANUSCRIPT
θ 1 , · · · , θ p . The parametric estimator of fj (xj ) is obtained as a centered version
ˆ j ). The true additive model (2) can then be written as of fj (xj |θ
CR IP T
ˆ 1 ) − · · · − fp (xip |θ ˆ p ) = r1 (xi1 , θ ˆ 1 ) + · · · + rp (xip , θ ˆ p ) + εi , Yi − f1 (xi1 |θ ˆ j ) contains the where rj (xj , θ j ) = fj (xj ) − fj (xj |θ j ). The function rj (xj , θ unknown function fj and hence this remains an unknown function. Accordingly,
the functions r1 , · · · , rp are estimated via a nonparametric method with the adaptive group lasso. In this paper, we use the B-spline method. Let {Bk : k =
1, · · · , K + d} be a dth-degree B-spline basis with knots a = κ0 < · · · < κK = b
AN US
and additional knots κ−d+1 = · · · = κ−1 = a and κK+1 = · · · = κK+d = b. For
ˆ j ) by a B-spline model j = 1, · · · , p, we approximate rj (xj , θ K+d X
bjk Bk (xj ) = bTj B(xj ),
k=1
where the bj,k ’s are unknown parameters, bj = (bj,1 , · · · , bj,K+d )T and B(xj ) = ˆ 1 ) − · · · − fp (xip |θ ˆ p ), i = (B1 (xj ), · · · , BK+d (xj ))T . Let Y˜i = Yi − f1 (xi1 |θ
M
1, · · · , n. Then, the adaptive group lasso objective loss function is defined as 2 p p n X X X Y˜i − L(b1 , · · · , bp ) = bTj B(xij ) + λ wj ||bj ||, (3)
ED
90
i=1
j=1
j=1
where w1 , · · · , wp are the weights and ||a|| =
√
aT a for any vector a. Although
PT
˜j ||, where b ˜j is the least several weights can be considered, we use wj = 1/||b squares estimator of bj . We interpret the proposed method as follows. For j = 1, · · · .p, let Fj =
ˆj = 0, the final estimator reduces {fj (·|θ j )|θ j ∈ Rpj }. By minimizing (3), if b
CE 95
ˆ j ) = 0. We can then conclude fj ∈ Fj , to the parametric estimator since rˆj (·, θ
AC
which means that the model structure of the regression function is discovered. If ˆ j ) 6= 0, we obtain the semiparametric estimator fˆj (·) = fj (·|θ ˆ j ) + rˆj (·, θ ˆj ) rˆj (·, θ and hence we specify fj 6∈ Fj . Thus, the proposed method discovers the model
100
structure and estimates the regression function simultaneously. Note that the parametric guided estimator has better behavior than the fully nonparametˆ is small (see Fan ric estimator when the difference between fj and fj (xj |θ) 6
ACCEPTED MANUSCRIPT
et al. [6] and Gouch and Genton [10]). Accordingly, the proposed estimator ˆ j ) + rˆj (·, θ ˆ j ) will have better curvature than the fully nonparametfˆj (·) = fj (·|θ ric estimator even if fj 6∈ Fj . Remark 1
CR IP T
105
The penalty ||bj || is not convex and so the optimization is not easy. In this
paper, similar to in Fan and Li [7] and Lian et al. [15], (3) is solved by the (0)
local quadratic approximation. Given the current estimate bj , ||bj || can be approximated by (0)
(0)
1 ||bj ||2 − ||bj ||2 . (0) 2 ||b ||
AN US
||bj || ≈ ||bj || +
j
After iteration, for a certain threshold c(for example c = 10−6 ), if ||bj || < c, we ˆj = 0. define b Remark2
In the previous studies of the parametric guided method (see, e.g., Glad
110
M
[9], Naito [20], Martins et al. [17], Fan el al. [6], and Fan et al. [8] ), the ordinary non-sparse nonparametric smoothing technique, especially local linear
ED
smoothing, is used to estimate the residual curve. On the other hand, for additive models, Huang et al. [13] developed component selection using the 115
adaptive group lasso. Our proposed method is a combination of the parametric
PT
guided method and the adaptive group lasso method. To our best knowledge, this is the first study of such a hybrid technique. The new contribution of this
CE
paper is that the model structure can be determined using the sparse estimation method in the second step of the parametric guided method. Further, there is
120
no restriction on the choice of the class of parametric models, unlike in Zhang
AC
et al. [28], Huang et al. [14], Lian et al. [15], and Wu and Stefanski [26]. Remark 3 When the determination of the parametric model family is difficult for some
component, we can then use the zero function, i.e., fj (xj |θ) ≡ 0. For this com125
ponent, the fully nonparametric estimator is obtained. In general, the shape of 7
ACCEPTED MANUSCRIPT
the parametric model is explored in a pilot study. After candidate parametric models are obtained, they are compared by using a model selection method such as AIC or BIC. When the prior information is somewhat weak, we inves-
130
CR IP T
tigate by using a scatter plot of the partial residual data associated with the nonparametric estimator. For the nonparametric estimator fˆj (j = 1, · · · , p), P (−j) we calculate Y˜i = Yi − k,k6=j fˆk (xik ), which is the partial residual for the (−j)
jth component. From the scatter plot of {(Y˜i
, xij ) : i = 1, · · · , n} or the
curvature of fˆj (xj ), we guess the shape of the model fj (·|θ j ). However, this is only one part of the complete story of the selection method of the class of
parametric models. Further investigation of this problem is left as future study
AN US
135
at the present stage. Remark 4
As another approach, we can consider the true function fj to be modeled as
M
fj (xj ) = f (xj |θ j ) + rj (xj ), j = 1, · · · , p,
where f (·|θ j ) are the prepared models and rj are the nonparametric functions. Then, we estimate the parameters θ 1 , · · · , θ p and nonparametric functions r1 , · · · , rp simultaneously, or iteratively between two components. This
ED
140
method was studied by Huang et al. [14] under linear model fj (xj |θ j ) = θj xj .
PT
In these approaches, however, an extensive computational cost is needed because the iterative algorithm proceeds in sparse estimation. Hence, we are
CE
careful about using the parametric guided method.
145
3. Theoretical result
AC
This section presents the asymptotic properties of the proposed semipara-
metric estimator. We give the general assumptions concerning the true functions and error as Assumption A. Assumption A
150
1. The distribution of X is absolutely continuous and its density is bounded away from zero and infinity on [a, b]p . 8
ACCEPTED MANUSCRIPT
2. The true functions f1 , · · · , fp and the variance of ε are bounded. 3. The true functions f1 , · · · , fp are `th continuously differentiable. Define
CR IP T
f (x|θ) = f1 (x1 |θ 1 ) + · · · + f1 (xp |θ p ),
where θ = (θ 1 , · · · , θ p )T . Let θ 0 = (θ 0,1 , · · · , θ 0,p )T be the minimizer of E[(Y − 155
f (x|θ))2 |X = x]. The assumptions for the prepared parametric models are
given as Assumption B. Assumption B
AN US
1. For fj ∈ Fj , there exists θ 0,j ∈ Rpj such that fj (·) = fj (·|θ 0,j ), j = 1, · · · , p.
2. There exists exactly one minimizer θ 0 of E[(Y − f (x|θ))2 |X = x].
160
3. For j = 1, · · · , p, fj (xj |θ j )(xj ∈ [a, b], θ j ∈ Rpj ) is bounded away from infinity. [a, b], θ j ∈ Rpj .
M
4. For j = 1, · · · , p, fj (xj |θ j ) is `th continuously differentiable for xj ∈ In the second step, we use the B-spline method. The minimum required
165
ED
conditions for B-spline smoothing are given as Assumption C. If Assumption C fails, the efficiency of the B-spline estimator is violated. Assumption C
PT
1. The number of knots K + d is smaller than n. −1 2. There exists a constant c1 > 0 such that c−1 ≤ maxj {|κj+1 − κj |} ≤ 1 K
170
CE
c1 K −1 .
We briefly outline how to investigate the asymptotic properties of the pro-
ˆ j ) + rˆj (xj , θ ˆ j ). Under some reasonable conposed estimator fˆj (xj ) = fj (xj |θ
AC
ˆ − θ 0 = OP (n−1/2 ) and fj (xj |θ ˆ j ) − fj (xj |θ 0,j ) = OP (n−1/2 ) (see ditions, θ
175
ˆ j ). Let later). Thus, we only need to investigate the asymptotic order of rˆj (xj , θ b0 = (b0,1 , · · · , b0,p )T be the minimizer of 2 p X E Y˜0 − bTj B(xj ) , j=1
9
ACCEPTED MANUSCRIPT
where Y˜0 = Y − f (x|θ 0 ). From a property of the B-spline function (see Barrow and Smith [1]), under Assumptions A–C, for j = 1, · · · , p, we have ||rj (xj , θ 0 ) − bT0,j B(xj )|| ≤ O(K −η ), where η = min{`, d + 1}. This η will be important in
ˆ0 = (b ˆ0,1 , · · · , b ˆ0,p )T be the minimizer of the theorems that follow. Next, let b 2 p n n X X X Y˜i0 − L0 (b1 , · · · , bp |λ) = bTj B(xij ) + λ wj ||bj ||, (4) i=1
CR IP T
180
j=1
j=1
where
Y˜i0 = Yi − f1 (xi1 |θ 0,1 ) − · · · − fp (xip |θ 0,p ), i = 1, · · · , n. ˆT B(xj )|| ||rj (xj , θ 0 ) − b j
AN US
Then, we obtain
ˆT B(xj )|| ≤ ||rj (xj , θ 0 ) − bT0,j B(xj )|| + ||bT0,j B(xj ) − b 0,j ˆT B(xj ) − b ˆT B(xj )|| +||b 0,j j
T
T
T
ˆ B(xj )|| + ||b ˆ B(xj ) − b ˆ B(xj )|| = O(K −η ) + ||bT0,j B(xj ) − b 0,j 0,j j
M
ˆ0,j − b0,j || and and hence it is sufficient to evaluate the asymptotic order of ||b
ˆ0,j − b ˆj ||. In Theorem 1 below, the asymptotic order of ||b ˆ0,j −b0,j || is derived. ||b 185
ED
ˆ0,j − b ˆj || is asymptotically dominated Theorem 2, the main result, shows that ||b ˆ0,j − b0,j ||. As a result, the asymptotic order of ||ˆ ˆ j )|| by ||b r(xj , θ 0,j ) − rˆ(xj , θ η ˆT B(xj )||. is the same as that of O(K − ) + ||bT0,j B(xj ) − b 0,j
PT
On the other hand, if fj ∈ Fj , then b0 = 0 and so we want to obtain the
ˆj = 0. Thus, we also show P (||b ˆj || > 0) → 0 as n → ∞ for estimator as b
CE
fj ∈ Fj . This property is called sparsity. To obtain the asymptotic consistency
190
and sparsity of the estimator obtained from the adaptive group lasso, the weights need to depend on n, that is, wj = wj (n). Actually, we use the norm of the
AC
˜j || as w−1 , which is dependent on n. If w−1 → 0 least squares estimator ||b j j ˆj = 0 since the constraint of the penalty of the as n → ∞, we will obtain b
jth component becomes stronger. When wj−1 = O(1), on the other hand, the
195
estimator of the jth component is estimated as nonzero. Conditions on the growths of the weights, K, and λ are obtained. Assumption D 10
ACCEPTED MANUSCRIPT
1. For j = 1, · · · , p, if fj ∈ Fj , b0,j = 0 and there exists a increasing sequence {rn }∞ n=1 such that as n → ∞,
CR IP T
rn wj−1 = O(1). 2. For j = 1, · · · , p, if fj 6∈ Fj , there exist constants c2 , c3j > 0 such that ||b0,j || > c2 and as n → ∞, P (wj−1 ≥ c3j ) → 1.
n λK = o(1), = o(1). nrn K 1/2+d λrn
AN US
3. As n → ∞,
Assumption D 1 and 2 are natural conditions. From Stone [24], under the ˜j ||(j = 1, · · · , p). assumptions A–C, Assumption D 1 and 2 hold for wj−1 = ||b 200
˜0 || → 0 is satisfied. In Theorem 1, for j = 1, · · · , p, Therefore if b0,j = 0, then ||b ˆ0,j are shown. the sparsity and consistency of b
M
Theorem 1. Under Assumptions A–D, for j = 1, · · · , p, if fj ∈ Fj , as n → ∞
and if fj 6∈ Fj ,
ED
ˆ0,j || = 0) → 1, P (||b
PT
ˆ0,j − b0,j ||2 = OP (K 2 /n) + OP (K/n) + O(K −2η+1 ) + O(K 2 λ2 /n2 ). ||b
CE
ˆ j ) and rˆj (xj , θ 0,j ) Next, the asymptotic order of the difference between rˆj (xj , θ √ ˆ is n-consistent of θ 0 . Using the result is derived. We then assume that θ of White [25], this assumption holds under some weak conditions. Furtherˆ j ) − fj (xj |θ 0,j ) = γj,n (xj ), where more, for j = 1, · · · , p, we obtain fj (xj |θ γj,n (xj ) = OP (n−1/2 ) for any xj ∈ [a, b]. By combining Theorem 1 and the
AC
205
asymptotics for the parametric estimator, we obtain the following theorem. ˆ − θ 0 = OP (n−1/2 ). Under Assumptions A–D, for Theorem 2. Suppose θ j = 1, · · · , p, if fj ∈ Fj , as n → ∞ ˆj || = 0) → 1, P (||b 11
ACCEPTED MANUSCRIPT
and if fj 6∈ Fj ,
Furthermore, for fj 6∈ Fj (j = 1, · · · , p),
CR IP T
ˆ0,j − b0,j ||2 = OP (K 2 /n) + OP (K/n) + O(K −2η+1 ) + O(K 2 λ2 /n2 ). ||b
ˆ j ) − rj (xj , θ 0,j )||2 = OP (K/n) + OP (1/n) + O(K −2η ) + O(Kλ2 /n2 ). ||ˆ rj (xj , θ
From the result of Theorem 2, the asymptotics for the final estimator is obtained as follows.
AN US
Corollary 1. Under the same assumptions as Theorem 2, for j = 1, · · · , p, if fj 6∈ Fj ,
||fˆj (xj ) − fj (xj )||2 = OP (K/n) + OP (1/n) + O(K −2η ) + O(Kλ2 /n2 ). The proof of Corollary 1 is easy since for fj ∈ Fj ,
210
M
ˆ j ) − fj (xj |θ 0,j )||2 + ||ˆ ˆ j ) − rj (xj , θ 0j )||2 . ||fˆj (xj ) − fj (xj )||2 ≤ ||fj (xj |θ rj (xj , θ Thus, the asymptotic order of the final estimator is dominated by that of the
ED
ˆ j ). When the final estimator is the same as the nonparametric estimator rˆj (xj , θ √ parametric estimator, n-consistency holds. We next show the optimal rate of ˆj = 0, the optimal order of the convergence of the proposed estimator. If b
215
PT
estimator is OP (n−1 ) and so we are interested only in the case fj 6∈ Fj . When
˜j ||−1 as wj , from Horowitz and Mammen we use the least squares estimator ||b
CE
˜j ||−1 ) = O(nη/(2η+1) ) as K = O(n1/(2η+1) ) under Assumptions [11], rn = O(||b A–C. Consequently, we obtain the following corollary. Corollary 2. Suppose the same assumptions as Theorem 2. Furthermore as-
AC
suming K = O(n1/(2η+1) ), rn = O(nη/(2η+1) ), and λ = O(nα ), 1/(4η + 2) < α < 1/2, for j = 1, · · · , p, if fj 6∈ Fj ,
E[{fˆj (xj ) − fj (xj )}2 ] = OP n−2η/(2η+1) .
Remark 5 12
ACCEPTED MANUSCRIPT
It is known that the asymptotic bias of the parametric guided estimator is smaller than that of the fully nonparametric estimator when fj is close to fj (|θ 0,j ). From Barrow and Smith [1], the model bias between the function g
Bias(x) = −
g (η) (x) c(x), η!K η
CR IP T
and the B-spline model can be written as
where g (η) is the ηth derivative of g and c(x) is a bounded function. For the 220
nonparametric estimator, g is the true regression function fj . On the other hand, for the parametric guided estimator, g(x) = fj (x) − fj (x|θ 0,j ). Thus, if (η)
(η)
(η)
AN US
|fj (x) − fj (x|θ 0,j )| < fj (x) for x ∈ [a, b], bias reduction is achieved although the asymptotic variance is unchanged. This point is discussed in related articles, including Glad [9], Fan et al. [6], and Gouch and Genton [10]. Conse225
quently, if fj ∈ Fj , the nonparametric component vanishes. Even if fj 6∈ Fj , the asymptotic bias of the proposed estimator is smaller than that of the fully (η)
(η)
(η)
4. Numerical study
M
nonparametric estimator when |fj (x) − fj (x|θ 0,j )| < fj (x).
ED
4.1. Simulation
In this section, we demonstrate the numerical performance of the proposed
PT
estimator in terms of its estimation accuracy. The generating model is the
CE
additive model:
230
yi = f1 (xi1 ) + · · · + f4 (xi4 ) + εi , i = 1, · · · , n.
The error is independently generated from N (0, σ 2 ). The predictors x1 , · · · , xn
AC
are independently distributed as N4 (0, Σ), where Σ = (ρ|i−j| )ij . When ρ = 0,
Σ is the identity matrix. After xj1 , · · · , xjn are generated, they are modified as xij / maxk=1,··· ,n {|xkj |}. That is, all predictors are within [−1, 1] for j = 1, · · · , 4. The true regression functions are defined as f1 (t) = 2 sin(2πt) + γ1 t, f2 (t) = 2
5t + γ2 t, f3 (t) = 3t2 + γ3 t, and f4 (t) = e−t + γ4 t, where γ1 , · · · , γ4 ∈ R. 2
13
ACCEPTED MANUSCRIPT
Each function is modified so that E[fj (Xj )] = 0. As the parametric model in the first-step estimation, we prepare f1 (t|θ 1 ) = θ11 sin(2πt), f2 (t|θ 2 ) = θ21 t2 , 2
f3 (t|θ 3 ) = θ31 t2 , and f4 (t|θ 4 ) = θ41 e−t . Thus if γ1 = · · · = γ4 = 0, the
CR IP T
true function is included in the class of parametric models for all components. For a given prepared parametric model, the parameter θ is estimated by the
least squares method. In the second-step estimation, we use the cubic B-spline
with K = 5 equidistant knots for each component. The smoothing parameter λ is selected via 5-fold cross-validation. To evaluate the proposed method in
terms of structure selection, we summarize the result in terms of whether the
AN US
nonparametric component becomes 0. If the nonparametric estimator is 0, then the prepared parametric model can be regarded as the correct model. Thus, we count the number of times that the parametric model is identified in each additive component. Next, the estimation accuracy is evaluated by the sample version of mean integrated squared error (MISE):
where R is the number of replications, fˆr is the estimator with the rth repli-
ED
235
R J 1 X1X ˆ {fr (xj ) − f (xj )}2 , R r=1 J j=1
M
MISE =
cation, and x1 , · · · , xJ are test data generated via a 4-dimensional uniform
distribution on [−1, 1]4 . In the above two studies, we set J = 100 and R = 500.
PT
The sample sizes were n = 50, 100, and 200. The proposed semiparametric estimator, the parametric guided estimator with the adaptive group lasso, is 240
denoted SE-AGL. For comparison, we also calculate the MISE of the semipara-
CE
metric estimator with the group lasso (SE-GL), the semiparametric estimator with no penalty (SE-NP), the semiparametric estimator with an L2 penalty
AC
(SE-L2 ), the fully nonparametric estimator with no penalty (NE-NP), the fully nonparametric estimator with an L2 penalty (NE-L2 ), and the fully parametric
245
estimator (PE). If the weights are wj ≡ 1, SE-AGL reduces to SE-GL. Otherwise if λ = 0, SE-AGL and SE-NP are the same. NE-NP and NE-L2 are
constructed by the nonparametric B-spline method. In other words, if all parametric models are fj (xj |θ j ) ≡ 0, SE-NP and SE-L2 14
ACCEPTED MANUSCRIPT
are similar to NE-NP and NE-L2 , respectively. For the L2 penalty, we use the 250
second-order difference penalty proposed by Eilers and Marx [4]. The estimation algorithm of SE-L2 and NE-L2 follows the method of Marx and Eilers [18]. If
CR IP T
γ = (γ1 , · · · , γ4 ) = (0, 0, 0, 0), PE is the estimator of the correct model and so its performance should be better than those of other estimators.
In Table 1, for each additive component, the ratio of the number of times that 255
the nonparametric estimator is the zero function to the number of repetitions (Ratio) is given under several conditions. When γ = (0, 0, 0, 0), the parametric model assumption is correct. From the results, Ratio was near 1 even with
AN US
(ρ, σ) = (0.5, 1). For γ = (0.5, 0.5, 0, 0), only f3 (·|θ 3 ) and f4 (·|θ 4 ) are correct models. Therefore, Ratio was high for f3 and f4 compared with f1 and f2 . 260
However, the proposed method frequently determined that the nonparametric estimates were not needed for f1 and f2 . Similarly, for γ = (0.5, 0.5, 0.5, 0.5), Ratio indicated that the difference between the true function and the model was ignored many times. Intuitively, there is a gap between fj (·) and fj (·|θ j )
265
M
when γj = 5. In fact, for γ = (5, 5, 1, 1) and (5,5,5,5), Ratio is quite small. Thus, we confirmed that the nonparametric components are estimated as the
ED
zero function if fj (·) ≈ fj (·|θ j ). Otherwise if fj (·) 6≈ fj (·|θ j ), the proposed method indicated that the nonparametric components were needed to obtain the final estimator.
270
PT
Table 2 shows the sample MISE values of all the estimators under several conditions. In all cases, the MISE become small as n increases. We found that all
CE
semiparametric and nonparametric estimators had consistency and so the performances of these estimators were basically good. Even when (ρ, σ) increased,
the performances of the estimators were sufficiently good. For γ = (0, 0, 0, 0),
AC
the behaviors of SE-AGP, SE-GP, and PE were better than those of the other
275
estimators. In this case, the model assumption was completely correct and so the estimators of SE-AGP and SE-GP reduced to PE in almost every replication. Thus, we could confirm the efficiency of the proposed method. The rate of convergence of PE is faster than the nonparametric estimator in the correct model case. Hence, the result that the MISE of PE is smallest is not surprising. 15
ACCEPTED MANUSCRIPT
Table 1: Ratio for each setting. (γ1 , γ2 , γ3 , γ4 ) = (0, 0, 0, 0) (0.5,0.5)
(0,1)
(0.5,1)
(0,0.5)
(0.5,0.5)
(0,1)
(0.5,1)
f1
0.995
0.944
0.863
0.845
0.998
0.933
0.836
0.816
f2
0.964
0.953
0.854
0.835
0.943
0.895
0.812
0.795
f3
0.992
0.983
0.834
0.827
0.871
0.820
0.723
0.717
f4
0.994
0.963
0.946
0.952
0.914
0.878
0.738
0.692
f1
1.000
1.000
0.925
0.921
0.963
0.942
0.892
0.851
f2
1.000
0.993
0.934
0.873
0.944
0.931
0.873
0.812
f3
1.000
0.994
0.969
0.928
0.908
0.858
0.647
0.682
f4
1.000
1.000
0.957
0.915
0.902
0.825
0.659
0.646
f1
1.000
1.000
0.945
0.932
0.966
0.952
0.912
0.872
f2
1.000
0.993
0.964
0.821
0.953
0.955
0.926
0.861
f3
1.000
1.000
0.969
0.935
0.928
0.847
0.683
0.656
f4
1.000
1.000
0.962
0.953
0.901
0.853
0.688
0.632
(ρ, σ)
(0,0.5)
(0.5,0.5)
(0,1)
(0.5,1)
(0,0.5)
(0.5,0.5)
(0,1)
(0.5,1)
f1
0.634
0.626
0.542
0.536
0.231
0.193
0.128
0.081
f2
0.645
0.598
0.576
0.523
0.192
0.152
0.114
0.102
f3
0.672
0.623
0.538
0.518
0.151
0.247
0.091
0.075
f4
0.716
0.701
0.612
0.602
0.266
0.212
0.097
0.071
f1
0.512
0.568
0.411
0.433
0.082
0.081
0.027
0.012
f2
0.523
0.493
0.394
0.387
0.093
0.083
0.014
0.021
f3
0.496
0.458
0.458
0.416
0.068
0.052
0.002
0.014
f4
0.532
0.516
0.492
0.439
0.103
0.098
0.023
0.012
f1
0.494
0.517
0.354
0.335
0.009
0.012
0.010
0.000
0.504
0.482
0.327
0.312
0.012
0.023
0.022
0.021
0.476
0.451
0.368
0.333
0.011
0.016
0.005
0.001
0.512
0.507
0.416
0.375
0.014
0.021
0.012
0.000
n = 200
AN US
n = 100
n = 200
f2 f3 f4
ED
n = 100
M
(γ1 , γ2 , γ3 , γ4 ) = (0.5, 0.5, 0.5, 0.5)
n = 50
PT
(γ1 , γ2 , γ3 , γ4 ) = (5, 5, 1, 1)
(γ1 , γ2 , γ3 , γ4 ) = (5, 5, 5, 5)
(0,0.5)
(0.5,0.5)
(0,1)
(0.5,1)
(0,0.5)
(0.5,0.5)
(0,1)
(0.5,1)
f1
0.002
0.008
0.004
0.005
0.001
0.002
0.000
0.001
f2
0.011
0.008
0.008
0.004
0.005
0.003
0.002
0.002
f3
0.011
0.013
0.021
0.021
0.002
0.022
0.000
0.000
f4
0.012
0.016
0.018
0.026
0.002
0.002
0.001
0.000
f1
0.002
0.006
0.002
0.004
0.002
0.001
0.000
0.000
f2
0.008
0.005
0.008
0.004
0.001
0.001
0.000
0.000
f3
0.007
0.010
0.016
0.015
0.000
0.002
0.000
0.000
f4
0.009
0.011
0.016
0.019
0.000
0.000
0.000
0.000
f1
0.002
0.005
0.003
0.007
0.001
0.001
0.000
0.000
f2
0.007
0.006
0.008
0.004
0.000
0.001
0.000
0.000
f3
0.007
0.009
0.012
0.020
0.000
0.000
0.001
0.001
f4
0.008
0.010
0.009
0.016
0.000
0.000
0.000
0.000
CE AC
(γ1 , γ2 , γ3 , γ4 ) = (1, 1, 1, 1)
(ρ, σ)
n = 50
n = 200
CR IP T
(0,0.5)
n = 50
n = 100
(γ1 , γ2 , γ3 , γ4 ) = (0, 0, 0.5, 0.5)
(ρ, σ)
16
ACCEPTED MANUSCRIPT
Table 2: MISE (×100) of estimators. (γ1 , γ2 , γ3 , γ4 ) = (0, 0, 0, 0)
n = 200
(0,1)
(0.5,1)
(0,0.5)
(0.5,0.5)
(0,1)
SE-AGL
4.544
4.663
18.77
18.78
4.723
5.123
22.15
22.43
SE-GL
4.551
4.671
18.79
18.81
4.768
5.201
22.17
22.52
SE-NP
5.237
5.462
21.24
22.42
5.141
SE-L2
4.813
5.001
23.12
20.15
5.021
NE-NP
7.341
7.527
30.41
31.14
7.261
NE-L2
7.124
6.891
28.46
28.31
6.621
PE
4.531
4.661
18.71
18.72
4.763
SE-AGL
0.624
0.631
7.352
7.435
0.713
SE-G
0.648
0.651
7.514
7.601
0.721
SE-NP
0.741
0.783
8.125
8.461
0.778
SE-L2
0.641
0.614
0.742
0.762
0.735
NE-NP
0.952
0.992
10.51
10.78
0.931
NE-L2
0.767
0.725
7.642
7.562
PE
0.624
0.631
7.415
SE-AGL
0.315
0.363
3.572
SE-GL
0.331
0.384
SE-NP
0.342
0.365
SE-L2
0.333
0.357
NE-NP
0.352
0.394
NE-L2
0.350
0.382
PE
0.314
0.363
6.512
24.24
24.62
5.822
23.56
23.41
8.132
32.34
31.82
6.871
29.51
29.43
5.131
22.21
22.44
0.726
1.032
1.102
0.763
1.051
1.115
0.792
3.775
5.443
0.746
1.101
1.162
0.956
1.277
1.444
0.831
0.872
1.163
1.212
7.412
0.741
0.771
1.113
1.176
3.535
0.426
0.425
0.526
0.542
3.601
3.566
0.427
0.429
0.532
0.542
3.612
3.613
0.433
0.435
0.541
0.547
3.593
3.583
0.433
0.428
0.527
0.541
3.652
3.664
0.472
0.461
0.553
0.561
3.636
3.664
0.457
0.442
0.544
0.542
3.573
3.531
0.421
0.418
0.520
0.535
(γ1 , γ2 , γ3 , γ4 ) = (0.5, 0.5, 0.5, 0.5) (ρ, σ) SE-AGL
(0.5,0.5)
(0,1)
(0.5,1)
(0,0.5)
(0.5,0.5)
(0,1)
4.786
4.738
19.94
20.56
4.931
5.562
23.56
23.62
4.741
19.96
20.55
4.937
5.572
23.62
23.71
4.784
SE-NP
4.913
4.826
20.15
20.92
5.146
5.637
23.91
24.32
SE-L2
4.521
4.745
19.75
20.46
5.024
5.552
23.52
23.92
NE-NP
4.832
4.831
20.35
21.01
5.172
5.663
23.74
24.05
NE-L2
4.522
4.751
19.97
21.02
5.041
5.613
23.57
23.82 122.8
PT
AC
n = 200
(0.5,1)
SE-GL
PE
6.352
8.416
31.25
32.25
84.05
84.26
122.2
SE-AGL
0.643
0.675
9.256
12.46
0.747
0.812
12.25
12.93
SE-GL
0.677
0.673
10.01
13.15
0.751
0.815
12.12
12.81
SE-NP
0.644
0.681
9.325
13.72
0.834
0.891
12.31
14.03
SE-L2
0.641
0.661
9.253
12.31
0.746
0.801
12.21
12.83
NE-NP
0.672
0.713
9.672
13.95
0.851
0.931
13.21
14.21
NE-L2
0.652
0.691
9.421
12.62
0.792
0.812
12.22
12.87
PE
2.251
2.302
16.31
20.13
6.525
87.36
115.4
112.5
SE-AGL
0.342
0.345
3.531
4.356
0.341
0.372
4.011
4.423
SE-GL
0.361
0.352
3.601
4.418
0.354
0.391
4.101
4.491
SE-NP
0.385
0.394
3.667
4.492
0.407
0.426
4.182
4.602
SE-L2
0.362
0.344
3.512
4.343
0.345
0.373
4.095
4.503
NE-NP
0.421
0.425
3.932
4.636
0.441
0.494
4.418
4.712
NE-L2
0.381
0.393
3.771
4.531
0.409
0.451
4.215
4.611
PE
2.531
2.321
17.12
20.13
6.613
89.21
119.4
120.6
CE
n = 100
(γ1 , γ2 , γ3 , γ4 ) = (1, 1, 1, 1)
(0,0.5)
ED
n = 50
(0.5,1)
CR IP T
n = 100
(0.5,0.5)
AN US
n = 50
(γ1 , γ2 , γ3 , γ4 ) = (0.5, 0.5, 0, 0)
(0,0.5)
M
(ρ, σ)
17
ACCEPTED MANUSCRIPT
Table 3: Ratio for LAND, SRP, ASR, and the proposed method (SPG).
LAND
n = 200
(ρ, σ)
(0,0.5)
(0.5,1)
(0,0.5)
(0.5,1)
(0,0.5)
(0.5,1)
(0,0.5)
(0.5,1)
f1
0.924
0.763
0.941
0.787
0.951
0.764
0.986
0.854
f2
0.913
0.712
0.912
0.756
0.935
f3
0.423
0.435
0.321
0.335
0.352
f4
0.000
0.002
0.000
0.001
0.000
f1
0.985
0.883
0.988
0.857
0.984
f2
0.975
0.881
0.983
0.841
0.977
f3
0.351
0.321
0.271
0.267
0.235
f4
0.000
0.002
0.000
0.001
0.000
f1
0.995
0.893
1.000
0.932
f2
0.994
0.897
1.000
f3
0.214
0.251
f4
0.000
0.002
0.735
0.951
0.836
0.356
0.343
0.313
0.000
0.000
0.000
0.894
0.995
0.864
0.915
0.991
0.876
0.314
0.225
0.281
0.000
0.000
0.000
1.000
0.932
1.000
0.942
0.941
1.000
0.934
1.000
0.941
0.136
0.253
0.202
0.256
0.198
0.212
0.000
0.001
0.000
0.000
0.000
0.000
For γ = (0.5, 0.5, 0.5, 0.5), the MISE values of SE-AGP and SE-GP are smaller
M
280
SPG
CR IP T
n = 100
ASR
AN US
n = 50
SRP
than that of PE for all combinations (n, ρ, σ). This indicates that the model
ED
assumption was not correct in some replications and the nonparametric components captured the difference between the true function and the prepared model for each additive component. For γ = (1, 1, 1, 1), the performance of PE was inferior to those of other semiparametric and nonparametric estimators. On the
PT
285
other hand, the MISE of the proposed estimator SE-AGP is near that of other nonparametric estimators. Thus, we found that the performance of the pro-
CE
posed estimator is guaranteed to be on the same level as a fully nonparametric estimator even if the model assumption is not corrected. Finally, we compare the proposed method with other structure discovery
AC
290
methods such as those of Zhang et al. [28] (LAND: linear and nonlinear discovery), Huang et al. [14] (SRP: semiparametric regression pursuit), and Wu and Stefanski [26] (ASR: automatic structure recovery). All of the above methods consider only linear model fj (xj |θj ) = θj xj . Therefore, we investigate the
295
classification performance of linear structure discovery between our method and 18
ACCEPTED MANUSCRIPT
these related methods. The true regression functions are redefined as f1 (t) = 2t, f2 (t) = t, f3 (t) = e(x−1)/2 , and f4 (t) = sin(2πt). The parametric model is set as the linear function fj (xj |θj ) = θj xj (j = 1, 2, 3, 4). Thus, fj ∈ Fj (j = 1, 2), 300
CR IP T
f4 6∈ F4 , and f3 6∈ F3 but f3 ≈ f3 (·|θ3 ). Similar to Table 1, we calculate Ratio for the methods LAND, SRP, and ASR and the proposed method (SPG: sparse
parametric guided). For f1 and f2 , the true model is included in the linear model and so Ratio will become high. For f3 and f4 , on the other hand, the
value of Ratio is expected to be low. Table 3 shows the Ratio values for all estimators. We found that all the estimators discovered the linear structure for f1 and f2 in many replication. For f4 as well, the high accuracies of the
AN US
305
methods were confirmed. For f3 , Ratio is somewhat high since f3 is nearly linear. However, as n increases, Ratio becomes smaller. Although the difference is slight, the proposed estimator has better or similar performance compared to the other methods. 4.2. Application
M
310
We apply the proposed method to analyzing a men’s decathlon dataset that
ED
is available in the R package scar of Chen and Samworth [3]. This dataset includes data of male decathlon athletes who scored at least 6500 points in at least one athletic competition in 2012 and obtained points in every event of that competition. The sample size is n = 614. The purpose of the analysis is to investi-
PT
315
gate the relationship between results of different athletes. For the analysis of this paper, we only four variables. The response Y is the points scored in the javelin
CE
throw. The predictors are x1 , points scored in the shot put; x2 , points scored in the discus throw; and x3 , points scored in pole vaulting. These predictors have
the three highest correlations with Y among all variables. To obtain numerical
AC
320
stability, all variables were rescaled as xij = xij / maxi {xij }(j = 1, 2, 3). First, we select the class of parametric models for each component on the
basis of a pilot study. In Figure 1, the fully nonparametric estimator of each additive function is shown. The estimators were constructed via penalized spline
325
method (see Marx and Eilers [18]). Using this information, we set f1 (t|θ11 ) = 19
0.15 0.05
f3.hat
0.00
f2.hat 0.6
0.7
0.8
0.9
1.0
0.4
0.5
0.6
x1
0.7
0.8
0.9
1.0
x2
CR IP T
0.5
−0.10
−0.2
−0.10
−0.05
−0.1
−0.05
0.00
0.0
f1.hat
0.1
0.05
0.10
0.2
0.3
0.10
ACCEPTED MANUSCRIPT
0.3
0.5
0.7
0.9
x3
f3.hat
f2.hat
0.00
0.0
−0.10
−0.3
−0.05
−0.2
−0.05
0.00
−0.1
f1.hat
0.05
0.1
0.05
0.2
0.10
AN US
0.10
0.3
Figure 1: Fully nonparametric estimators of f1 (left), f2 (middle), and f3 (right).
0.5
0.6
0.7
0.8
0.9
1.0
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.3
0.5
x2
0.7
0.9
x3
M
x1
Figure 2: Parametric estimators of f1 , f2 , and f3 . The left, middle, and right panels are for
ED
the least squares estimators of f1 (t|θ11 ), f2 (t|θ21 , θ22 , θ23 , θ24 ), and f3 (t|θ31 ).
θ11 t, f2 (t|θ21 , θ22 , θ23 , θ24 ) = θ21 t+θ22 t2 +θ23 t3 +θ24 t4 , and f3 (t|θ31 ) = θ31 exp[t2 ].
PT
ˆ = (θˆ11 , θˆ21 , θˆ22 , θˆ23 , θˆ24 , θˆ31 ) = From the least squares method, we obtained θ (1.12, −4.15, 1.74, 7.32, −5.96, 0.12). Figure 2 shows the structure of the para-
CE
metric estimator for each component. We found that the structures of the fully 330
nonparametric estimator and the parametric estimator are similar for f1 and f3 . For f2 , the curves for the two estimators are similar for x2 > 0.7, but clearly
AC
different for x2 < 0.5. Using the parametric guided method with the adaptive group lasso, we check
whether the model assumption is valid. In the second step of the proposed
335
method, the degree of the B-spline bases was d = 3 (cubic spline), the number
of knots was K = 5, and the smoothing parameter λ was determined by 5-fold
20
0.10
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.4
0.5
0.6
0.7
0.8
0.9
1.0
x2
AN US
x2
CR IP T
0.05 0.00 −0.15
−0.10
−0.05
f2.hat
0.00 −0.15
−0.10
−0.05
f2.hat
0.05
0.10
ACCEPTED MANUSCRIPT
Figure 3: Estimators for second component f2 . Left: Proposed estimator fˆ2 (x2 ) (solid), paraˆ 2 ) (dashed), and nonparametric estimator of the residual rˆ2 (x2 , θ ˆ2 ) metric estimator f2 (x2 |θ (dot-dashed). Right: proposed estimator (solid) and fully nonparametric estimator (dashed).
cross-validation. As the result, for f1 and f3 , the estimator of the nonparametric
M
component was estimated as the zero function. That is, the models f1 (x1 |θ11 ) =
θ11 x1 and f3 (x3 |θ31 ) = θ31 exp[x23 ] were decided to be correct models. On the
other hand, the nonparametric component for f2 did not vanish. Actually,
ED
340
ˆ2 || = 4.122. Figure 3 shows the proposed estimator fˆ2 (x2 ) = we obtained ||b
ˆ 2 ) + rˆ2 (x2 , θ ˆ 2 ). From the left panel, we see that rˆ2 (x2 , θ ˆ 2 ) is not the zero f2 (x2 |θ
PT
function and the curve changed rapidly for x2 < 0.5. This indicates that the parametric component could not capture the true curve in this region and the nonparametric component modified the parametric curve. In the right panel,
CE
345
the proposed estimator and fully nonparametric estimator are plotted. The parametric guided estimator has less curvature than does the nonparametric
AC
estimator for 0.35 < x2 < 0.5. To evaluate the accuracy of the estimator, we calculated the prediction error PEr = E[(Ytest − fˆ(xtest ))2 ],
where (Ytest , xtest ) indicates the test data. The test data are independent of 350
the training data used to construct fˆ(x). Specifically, PEr was approximated 21
ACCEPTED MANUSCRIPT
via its sample version in the context of 5-fold validation with 50 replications. PEr (standard deviation) of the proposed estimator was 0.0783 (0.021). We also calculated PEr (standard deviation) of the parametric estimator and the
355
CR IP T
nonparametric estimator, and obtained 0.0963 (0.023) and 0.0812 (0.025), respectively. Thus, we found that the proposed estimator is more useful than
the fully parametric estimator and the fully nonparametric estimator from the prediction error point of view.
5. Future extensions
AN US
We now describe some future problems related to this study. In this paper,
we considered an additive model for which the proposed estimator discovers some parametric components. Thus, the model can be written as f (x) =
q X j=1
fj (xj |θ j ) +
p X
fj (xj ),
j=q+1
f (x) =
p1 X
fj (xj |θ j ) +
ED
j=1
M
where q < p. One extension is sparse additive models of the form pX 1 +p2
fj (xj ) +
j=p1 +1
p X
0(xj ),
(5)
j=p1 +p2 +1
where p1 +p2 < p and 0(xj ) is the zero function. To cover this case, a parametric
PT
model should be constructed so that fj (xj |θ j ) = 0 for some θ j . For example, the basis function models can be used. That is, we assume that the parametric
CE
model is
where φj (xj ) = (φ1,j (xj ), · · · , φpj ,j (xj ))T and φk,j : R → R(k = 1, · · · , pj ). If the parameter θ = (θ 1 , · · · , θ p ) is estimated via a sparse method such as the
AC
360
fj (xj |θ j ) = θ Tj φj (xj ),
group lasso or the adaptive group lasso in the first-step estimation, this extension ˆ is defined as the will be achieved. Here, the adaptive group lasso estimator θ minimizer of p n q X X (Yi − f (x|θ))2 + λ1 uj θ Tj θ j , i=1
j=1
22
ACCEPTED MANUSCRIPT
365
where λ1 is the smoothing parameter and u1 , · · · , up are weights. When all weights are 1, the estimator reduces to the group lasso estimator. Under some reasonable conditions, the sparsity and consistency of the parametric and final
CR IP T
estimator will be guaranteed. A second extension is the high-dimensional case (p > n). For the high370
dimensional setting, it is very difficult to construct the class of parametric models. Thus, this problem is not easy despite being interesting. As one option, after the important variables have been chosen by sure independence screening (Fan et al. [5]) or model-free variable screening (Zhu et al. [30]), we can
AN US
estimate the regression functions for a p < n setting.
Thirdly, we can add a smoothness penalty in the second-step estimation.
375
That is, the loss function (3) is improved as
L(b1 , · · · , bp ) =
n X i=1
Y˜i −
n X
M
+λ
j=1
p X j=1
2
bTj B(xij )
wj ||bj || + λ2
p X
vj bTj Qbj ,
(6)
j=1
ED
where λ2 is the smoothing parameter, v1 , · · · , vp are weights, and Q is the second-order difference matrix (see Eilers and Marx [4]). According to existing studies on B-spline smoothing, the performance of the minimizer of (6) is better than that of (3). However, it is difficult to investigate the theoretical properties
PT
380
of an estimator obtained via (6). For further study, the development of the sparsity and the rate of convergence of the estimator obtained from (6) would
CE
be interesting. Also, extensions to other models, such as generalized linear models, varying coefficient models, and quantile regression models, are important problems.
AC
385
Appendix We give the proof of theorems of this paper. First for convenience, we
introduce some notation. Let Y˜ 0 = (Y˜10 , · · · , Y˜n0 )T , Y˜i0 = Yi − f (x|θ 0 ), Zj = (B k (xij ))ik for j = 1, · · · , p, and Z = [Z1 , · · · , Zp ]. The loss function (4) can 23
ACCEPTED MANUSCRIPT
390
be written as L0 (b|λ) = ||Y˜ 0 − Zb||2 + λ
n X j=1
wj ||bj ||,
CR IP T
where b = (bT1 , · · · , bTp )T . Define A = {j ∈ {1, · · · , p}|fj (·) 6= fj (·|θ 0,j )} and |A|(K + d)-square matrix ZA = (Zj , j ∈ A), where |A| is the cardinality of A. To show Theorems 1 and 2, we will use the following lemmas.
Lemma 1. The asymptotic order of the maximum eigenvalue of the matrix 395
T T ZA ZA is O(nK −1 ). Furthermore, that of (ZA ZA )−1 is O(n−1 K).
AN US
The proof of Lemma 1 is described in Lemma 3 of Huang et al. [13].
Outline of proof of Theorem 1. The proof of Theorem 1 is the same as that for the nonparametric estimator under the assumption that the true function is rj (xj , θ 0,j ) = fj (xj ) − fj (xj |θ 0,j )(j = 1, · · · , p). Then, if fj ∈ Fj , rj (xj , θ 0,j ) = 400
0. Thus, for fj ∈ Fj , the sparsity property is obtained. For fj 6∈ Fj , the rate of convergence of the B-spline estimator can be shown as in the proofs of Theorems
M
3 and 4 of Huang [13] and so the details are omitted.
ED
Proof of Theorem 2. To show Theorem 2, it is sufficient to show ˆj || > 0) → 0, P (||b
j 6∈ A
PT
and evaluate the asymptotic order of j∈A
CE
Pp as n → ∞. Let Y˜ = (Y˜1 , · · · , Y˜n )T and let γ n = ( j=1 γj,n (xij ), i = 1, · · · , n)T . Then Y˜i can be written as
AC
405
ˆj − b ˆ0,j ||, ||b
Y˜i
= Yi −
p X
ˆ
fj (xij |θ j ) j=1 p X
= Y˜0,i − = Y˜0,i −
j=1
p X
ˆ j ) − fj (xij |θ 0,j ) fj (xij |θ
γj,n (xij ).
j=1
24
ACCEPTED MANUSCRIPT
Thus, we have Y˜ = Y˜ 0 − γ n and γ n = OP (n−1/2 ). Therefore, (3) is asymptotically equivalent to L0 (b|λ) and the first assertion of Theorem 2 is obvious from the sparsity property of Theorem 1. On the other hand, the second assertion is
CR IP T
not obvious and we should check whether the influence of γ n is negligible. Let Tj be a (K + d) × |A|(K + p) matrix with the form Tj = (O, O, · · · , I, O, · · · , O), 410
where O is a (K + p) square matrix of zeros and I is a (K + p) identity matrix. From the proof of Theorem 3 of Huang et al. [13], for j ∈ A, T T = Tj (ZA ZA )−1 ZA γn T −λwj Tj (ZA ZA )−1
From Lemma 1, we have
ˆj ˆ0,j b b − ˆj || ||b ˆ0,j || ||b
.
≤
T O(K 2 /n2 )||ZA γ n ||2
=
O(K 2 /n2 )O(n/K)O(n−1 )
M
T T ||Tj (ZA ZA )−1 ZA γ n ||2
!
AN US
ˆj − b ˆ0,j b
=
O(K/n2 ).
ED
Furthermore, since wj = O(1) for j ∈ A, Lemma 1 yields that T ˆj ||2 /||b ˆj ||2 ≤ OP (λ2 K 2 /n2 ). ||λwj Tj (ZA ZA )−1 b
Hence,
PT
ˆj − b ˆ0,j ||2 ≤ O(K/n2 ) + OP (λ2 K 2 /n2 ) ||b
and this yields that
AC
CE
ˆj − b0,j ||2 ||b
≤
ˆj − b ˆ0,j ||2 + ||b0,j − b ˆ0,j ||2 ||b
= OP (K/n2 ) + OP (λ2 K 2 /n2 ) +O(K −2η+1 ) + OP (K 2 /n) + OP (λ2 K 2 /n2 ) = O(K −2η+1 ) + OP (K 2 /n) + OP (λ2 K 2 /n2 ).
This completes the second assertion of Theorem 2. By the properties of the B-spline function, there exist constants c3 , c4 > 0 such that ˆj − b0,j ||2 ≤ ||ˆ ˆ j ) − rj (xj , θ 0,j )||2 ≤ c4 K −1 ||b ˆj − b0,j ||2 . c3 K −1 ||b rj (xj , θ 25
ACCEPTED MANUSCRIPT
Thus, we obtain the final assertion of Theorem 2 as
415
CR IP T
ˆ j ) − rj (xj , θ 0,j )||2 = O(Kn−1 ) + O(K −2η ) + O(Kλ2 n−2 ). ||ˆ rj (xj , θ
Acknowledgments
The authors are grateful to the Editor, Associate Editor, and two anonymous
referees for their valuable comments and suggestions, which led to improvement
of the paper. The research of the author was partly supported by KAKENHI
420
AN US
26730019.
References
[1] Barrow,D.L. and Smith,P.W. (1978). Asymptotic properties of best L2 [0, 1] approximation by spline with variable knots. Quart. Appl. Math. 36 293-
M
304.
[2] Buja,A., Hastie,T. and Tibshirani,R. (1989). Linear smoothers and additive models (with discussion). Ann. Statist. 17,453-555.
ED
425
[3] Chen,Y. and Samworth, R. J. (2014) scar: shape-constrained additive regression: a maximum likelihood approach. R Package Version 0.2-0. Centre
PT
for Mathematical Sciences, University of Cambridge, Cambridge. [4] Eilers,P.H.C. and Marx,B.D. (1996). Flexible smoothing with B-splines and
CE
penalties. Statist.Sci. 11, 89-121.
430
AC
[5] Fan,J., Feng,Y. and Song,R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Amer. Statist. Assoc. 106 544–557.
[6] Fan,J., Wu,Y. and Feng,Y. (2009). Local quasi-likelihood with a parametric
435
guide. Ann. Statist. 37 4153-4183.
26
ACCEPTED MANUSCRIPT
[7] Fan,J.Q. and Li,R.Z. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360. [8] Fan, J., Maity, A., Wang, Y. and Wu, Y. (2013). Parametrically Guided
Data. J. Nonparametr. Statist., 25, 109-128.
440
CR IP T
Generalized Additive Models with Application to Mergers and Acquisitions
[9] Glad,I.K. (1998). Parametrically guided non-parametric regression. Scand. J. Statist. 25 649-668.
[10] Ghouch,A.E. and Genton,M.C. (2009). Local polynomial quantile regres-
445
AN US
sion with parametric features. J. Amer. Statist. Assoc. 104 1416–1429.
[11] Horowitz,J.L. and Mammen,E. (2004). Nonparametric estimation of an additive model with a link function. Ann. Statist. 32 2412–2443. [12] Horowitz,J.L., Klemela¨ a.J and Mammen,E. (2006). Optimal estimation in
M
additive regression models. Bernoulli. 12 271–298.
[13] Huang,J., Horowitz,J.L. and Wei,F. (2010). Variable selection in nonparametric additive models. Ann. Statist. 38 2282–2313.
ED
450
[14] Huang,J., Wei,F. and Ma,S. (2012). Semiparametric regression pursuit. Sta-
PT
tistica Sinica 22 1403–1426. [15] Lian,H., Liang,H. and Ruppert,D. (2015). Separation of covariates into nonparametric and parametric parts in high-dimensional partially linear additive models. Statistica Sinica 25 591–608.
CE
455
AC
[16] Liu,R. and Yang,L. (2010). Spline-backfitted kernel smoothing of additive coefficient model. Economic Theory 26 29–59.
[17] Martins-Filho,C., Mishra,S. and Ullah,A. (2008). A class of improved para-
460
metrically guided nonparametric regression estimators. Econometric Rev. 27 542-573.
27
ACCEPTED MANUSCRIPT
[18] Marx,B.D. and Eilers,P.H.C. (1998). Direct generalized additive modeling with penalized likelihood. Comp. Statist & Data Anal. 28 193–209.
tive modeling. Ann. Statist. 37 3779–3821. 465
CR IP T
[19] Meier,L., van de Geer,S. and B¨ uhlmann,P. (2009). High-dimensional addi-
[20] Naito,K. (2002). Semiparametric regression with multiplicative adjustment. Communications in Statistics, Theory and Methods 31 2289–2309.
[21] Opsomer,J.D. (2000). Asymptotic properties of backfitting estimators. J. Mult. Anal. 73 166–79.
AN US
[22] Opsomer,J.D. and Ruppert,D. (1998). A fully automated bandwidth selection method for fitting additive models. J. Amer. Statist. Assoc. 93
470
605–619.
[23] Ravikumar,P., Lafferty,H., Liu,H. and Wasserman, L. (2009). Sparse additive models. J. Roy. Statist. Soc. B 71 1009–1030.
M
[24] Stone,C.J. (1985). Additive regression and other nonparametric models. Ann. Statist. 13 689–705.
475
ED
[25] White,H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 50 1–25.
PT
[26] Wu,Y. and Stefanski,L.A. (2015). Automatic structure recovery for additive models. Biometrika. 102 381–395. [27] Yuan,M. and Lin,Y. (2006). Model selection and estimation in regression
CE
480
with grouped variables. J. R. Statist. Soc. B 68 49–67.
AC
[28] Zhang,H.H., Cheng,G. and Liu,Y. (2011). Linear or nonlinear? Automatic
485
structure discovery for partially linear models. J. Amer. Statist. Assoc. 106 1099–1112.
[29] Zhang, H. H. and Lin, Y. (2006). Component selection and smoothing for nonparametric regression in exponential families. Statist. Sinica 16 1021– 1041. 28
ACCEPTED MANUSCRIPT
[30] Zhu,L.P., Li,L., Li,R. and Zhu,L.X. (2011). Model-free feature screening for ultrahigh-dimensional data. J. Amer. Statist. Assoc. 106 1464–1475. 490
[31] Zou, H. (2006). The adaptive Lasso and its oracle properties. J. Amer.
AC
CE
PT
ED
M
AN US
CR IP T
Statist. Assoc. 101 1418–1429.
29