Semiparametric method for model structure discovery in additive regression models

Semiparametric method for model structure discovery in additive regression models

Accepted Manuscript Semiparametric method for model structure discovery in additive regression models Takuma Yoshida PII: DOI: Reference: S2452-3062...

656KB Sizes 0 Downloads 127 Views

Accepted Manuscript

Semiparametric method for model structure discovery in additive regression models Takuma Yoshida PII: DOI: Reference:

S2452-3062(17)30014-X 10.1016/j.ecosta.2017.02.005 ECOSTA 49

To appear in:

Econometrics and Statistics

Received date: Revised date: Accepted date:

10 February 2016 9 February 2017 16 February 2017

Please cite this article as: Takuma Yoshida, Semiparametric method for model structure discovery in additive regression models, Econometrics and Statistics (2017), doi: 10.1016/j.ecosta.2017.02.005

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Semiparametric method for model structure discovery in additive regression models

CR IP T

Takuma Yoshida Kagoshima University, Graduate School of Science and Engineering, 1-21-35, Korimoto Kagoshima, Kagoshima 890-8580, Japan. Tel.: +81-099-285-8039 [email protected]

Abstract

AN US

In regression analysis, there are two typical approaches, parametric methods

and nonparametric methods. If the prior information of the structure of the regression function is obtained, a parametric method is preferred since they are efficient and easily interpreted. When the model is misspecified, on the other hand, parametric estimators do not work well. Therefore, it is important to check whether the parametric model assumption is valid. To simultaneously

M

discover the model structure and estimate the regression function in additive regression models, a new semiparametric method is proposed. First, a paramet-

ED

ric model is prepared and its estimator is obtained for all additive components. Next, for the residual data associated with the parametric estimator, a nonparametric method is applied. The final estimator is constructed by summing the

PT

parametric estimator and the nonparametric estimator of the residual data. In the second-step estimation, the B-spline method with an adaptive group lasso

CE

penalty is utilized. For each additive component, if the nonparametric estimator becomes the zero function, the final estimator is reduced to a parametric estimator. In other words, the model structure can then be discovered. The

AC

asymptotic properties of the proposed estimator are shown. A numerical study via a Monte Carlo simulation and a real data application are presented. Keywords: , Adaptive group lasso, Additive model, Parametric guided estimation, Parametric model discovery, Spline smoothing 2010 MSC: 62G08, 62J07

Preprint submitted to Economics and Statistics

March 11, 2017

ACCEPTED MANUSCRIPT

1. Introduction Multiple regression captures the functional relationship between a scalar response and a multivariate predictor. One typical technique for multiple regres-

5

CR IP T

sion is the additive model since it has the advantage of avoiding the phenomenon known as “the curse of dimensionality”. For a scalar response Y and p-variate predictor x = (x1 , · · · , xp ), the additive model is formulated by Y =µ+

p X

fj (xj ) + ε,

j=1

(1)

AN US

where µ is the unknown intercept parameter, fj are unknown regression func-

tions, and ε is the error with mean 0 and finite variance. The additive model was proposed by Buja et al. [2] and several authors have developed regression analysis with additive models in the context of nonparametric methods. For example, Stone [24] used the B-spline method. Opsomer and Ruppert [22] and Opsomer [21] studied the kernel smoothing method. Horowitz and Mammen

M

[11], Horowitz et al. [12], and Liu and Yang [16] proposed the two-stage nonparametric estimation method. Recently, additive models for sparse settings

ED

have been developed. Typical results of sparse additive regression were developed by Zhang and Lin [29], Meier et al. [19], Ravikumar et al. [23], and Huang et al. [13]. Thus, additive models are widely used. Basically, the development

PT

of additive models has focused on nonparametric regression methods. On the other hand, if it is believe that the parametric model assumption is valid for some fj from the prior information or a pilot study, we can use the parametric

CE

model fj (xj |θ j ) for the jth component. When all components can be expressed

AC

as parametric models, (1) becomes Y =µ+

p X j=1

fj (xj |θ j ) + ε.

Then, the parameters θ 1 , · · · , θ p are estimated via least squares or other methods. Since the parametric method has many advantages such as that the interpretation of the estimator is easy, the rate of convergence of the estimator is

10

faster than that of the nonparametric estimator, and the estimation algorithm 2

ACCEPTED MANUSCRIPT

is simple, the parametric model assumption is useful. However, in general, we cannot know whether the parametric model is valid. If the model assumption is reasonable, the parametric estimator works well. If it is not valid, on the

15

CR IP T

other hand, the performance of the parametric estimator is not guaranteed. In this paper, we propose a new estimation method to check the correctness of the

parametric model assumption and to estimate the regression function simultaneously. For this, we use a hybrid technique of the parametric guided method and the adaptive group lasso.

We now briefly describe the parametric guided method and introduce the

idea of our study. For j = 1, · · · , p, we prepare the parametric model fj (·|θ j )

AN US

20

and estimate (θ 1 , · · · , θ p ) using the least squares method or other methods. Next, for the residual data associated with parametric estimator, the nonparametric method is applied. Finally, the estimator of each additive component is constructed by summing the parametric estimator of the true function and the 25

nonparametric estimator of the residual data. Thus, the estimator is obtained

M

by a two-step procedure. The final estimator combines the parametric and nonparametric structures, and so the proposed estimator is a semiparametric

ED

estimator. For regression problems, the parametric guided method has been developed by Glad [9], Naito [20], Martins et al. [17], Fan el al. [6], and Fan et al. 30

[8]. The main assertion of this approach is that the parametric guided estimator

PT

has better behavior than the fully nonparametric estimator when the prepared parametric model is near the true function. That is, the parametric guided

CE

estimator can be regarded as an improved version of the fully nonparametric estimator. On the other hand, the parametric guided estimator was not compared

35

with a parametric estimator in the aforementioned articles. If the parametric

AC

model assumption is correct, the nonparametric component may not be needed. This motivates us to obtain a parametric estimator if fj (·) = fj (·|θ j ) and a nonparametric estimator if fj (·) 6= fj (·|θ j ) for j = 1, · · · , p. To achieve this, we use the adaptive group lasso method in the second step of the parametric

40

guided method. The adaptive group lasso is known as a sparse estimation method and is a 3

ACCEPTED MANUSCRIPT

combination of the adaptive lasso and the group lasso. The adaptive lasso and the group lasso were proposed by Zou [31] and Yuan and Lin [27], respectively. Huang et al. [13] developed the adaptive group lasso for additive models (1) in nonparametric regression. In their setting, the sparse additive model is assumed

CR IP T

45

and so there exist j such that fj ≡ 0. Using the adaptive group lasso, the nonparametric estimator reduces to the zero function if the true function is the zero function. In our setting, the true function is the mean function of

the residual: rj (·, θ j ) = f (·) − f (·|θ j ) for j = 1, · · · , p. Therefore, the final 50

estimator is reduced to the parametric estimator if rj is estimated to be the

AN US

zero function. Such a case indicates that the parametric estimator can capture

the structure of the regression function and so we can decide fj (·) = fj (·|θ j ). If the estimator of a nonparametric component is not the zero function, the semiparametric estimator is obtained for that component. Consequently, using 55

the adaptive group lasso in the parametric guided method, the model structure can be discovered.

M

Related methods have been proposed by Zhang et al. [28], Huang et al. [14], Lian et al. [15], and Wu and Stefanski [26]. However, Zhang et al. [28], Huang

60

ED

et al. [14], and Lian et al. [15] considered only linear model fj (·|θ j ) = θj xj . Zhang et al. [28] and Lian et al. [15] proposed the double penalization method, but their method cannot be extended to the general parametric model case. On

PT

the other hand, in Huang et al. [14], the true regression model was assumed to be fj (xj ) = fj (xj |θ j ) + gj (xj ), where fj (xj |θ j ) = θj xj and gj is the smooth

CE

function and the profiling method was used to estimate θ1 , · · · , θp and g1 , · · · , gp .

65

Wu and Stefanski [26] proposed the kernel-based variable selection method. In their approach, the parametric model is restricted to a polynomial model. Thus,

AC

all the above methods considered only linear and polynomial models. On the other hand, there is no limit to choice of the shape of parametric models in our method and so our method can be widely used compared with those in the

70

above previous studies. The remainder of this paper is organized as follows. In Section 2, we propose a two-step estimation based on the parametric guided method and consider 4

ACCEPTED MANUSCRIPT

sparse estimation. In addition, we note how to interpret our estimator. Section 3 shows the mathematical properties of our estimator. In Section 4, the proposed 75

method is illustrated with Monte Carlo simulations and an analysis of a real data

CR IP T

set. The proofs for the results given in Section 3 are provided in the Appendix.

2. Parametric guided estimation with adaptive group lasso

Let Y and X = (X1 . · · · , Xp ) be a random variable and a random vector of

dimension p, respectively, and let {(Yi , X i ) : i = 1, · · · , n} be a random sample drawn from (Y, X). We assume that each Xj is supported on an interval [a, b],

AN US

where a < b are finite numbers. We now consider the additive regression model Yi = µ + f1 (xi1 ) + · · · + fp (xip ) + εi , i = 1, · · · , n,

where µ is an unknown intercept parameter, the fj ’s are unknown univariate regression functions, xi = (xi1 , · · · , xip ) are the given values of the ith covariates, and ε1 , · · · , εn are independent and identically distributed (i.i.d.) errors with

M

80

mean 0 and finite variance σ 2 . To satisfy identifiability of each additive func-

ED

tion, suppose that E[fj (Xj )] = 0 for j = 1, · · · , p. The aim is to estimate µ and Pn f1 , · · · , fp . It is easy to see that the ordinary estimator of µ is µ ˆ = n−1 i=1 Yi .

Therefore, all responses are centered at Yi − µ ˆ and we consider the following model:

PT

85

Yi = f1 (xi1 ) + · · · + fp (xip ) + εi

(2)

CE

with centered Yi . In (2), we estimate f1 , · · · , fp by combining the parametric

AC

guided method and the adaptive group lasso. First, we prepare the parametric models fj (xj |θ j )(j = 1, · · · , p) by using

prior information or based on pilot research and consider the parametric additive model Yi = f1 (xi1 |θ 1 ) + · · · + fp (xip |θ p ) + εi .

Then, the parameters are estimated as θ 1 , · · · , θ p via some method such as

ˆ1, · · · , θ ˆ p be the least squares estimator of the least squares method. Let θ 5

ACCEPTED MANUSCRIPT

θ 1 , · · · , θ p . The parametric estimator of fj (xj ) is obtained as a centered version

ˆ j ). The true additive model (2) can then be written as of fj (xj |θ

CR IP T

ˆ 1 ) − · · · − fp (xip |θ ˆ p ) = r1 (xi1 , θ ˆ 1 ) + · · · + rp (xip , θ ˆ p ) + εi , Yi − f1 (xi1 |θ ˆ j ) contains the where rj (xj , θ j ) = fj (xj ) − fj (xj |θ j ). The function rj (xj , θ unknown function fj and hence this remains an unknown function. Accordingly,

the functions r1 , · · · , rp are estimated via a nonparametric method with the adaptive group lasso. In this paper, we use the B-spline method. Let {Bk : k =

1, · · · , K + d} be a dth-degree B-spline basis with knots a = κ0 < · · · < κK = b

AN US

and additional knots κ−d+1 = · · · = κ−1 = a and κK+1 = · · · = κK+d = b. For

ˆ j ) by a B-spline model j = 1, · · · , p, we approximate rj (xj , θ K+d X

bjk Bk (xj ) = bTj B(xj ),

k=1

where the bj,k ’s are unknown parameters, bj = (bj,1 , · · · , bj,K+d )T and B(xj ) = ˆ 1 ) − · · · − fp (xip |θ ˆ p ), i = (B1 (xj ), · · · , BK+d (xj ))T . Let Y˜i = Yi − f1 (xi1 |θ

M

1, · · · , n. Then, the adaptive group lasso objective loss function is defined as  2 p p n X X X Y˜i − L(b1 , · · · , bp ) = bTj B(xij ) + λ wj ||bj ||, (3)

ED

90

i=1

j=1

j=1

where w1 , · · · , wp are the weights and ||a|| =



aT a for any vector a. Although

PT

˜j ||, where b ˜j is the least several weights can be considered, we use wj = 1/||b squares estimator of bj . We interpret the proposed method as follows. For j = 1, · · · .p, let Fj =

ˆj = 0, the final estimator reduces {fj (·|θ j )|θ j ∈ Rpj }. By minimizing (3), if b

CE 95

ˆ j ) = 0. We can then conclude fj ∈ Fj , to the parametric estimator since rˆj (·, θ

AC

which means that the model structure of the regression function is discovered. If ˆ j ) 6= 0, we obtain the semiparametric estimator fˆj (·) = fj (·|θ ˆ j ) + rˆj (·, θ ˆj ) rˆj (·, θ and hence we specify fj 6∈ Fj . Thus, the proposed method discovers the model

100

structure and estimates the regression function simultaneously. Note that the parametric guided estimator has better behavior than the fully nonparametˆ is small (see Fan ric estimator when the difference between fj and fj (xj |θ) 6

ACCEPTED MANUSCRIPT

et al. [6] and Gouch and Genton [10]). Accordingly, the proposed estimator ˆ j ) + rˆj (·, θ ˆ j ) will have better curvature than the fully nonparametfˆj (·) = fj (·|θ ric estimator even if fj 6∈ Fj . Remark 1

CR IP T

105

The penalty ||bj || is not convex and so the optimization is not easy. In this

paper, similar to in Fan and Li [7] and Lian et al. [15], (3) is solved by the (0)

local quadratic approximation. Given the current estimate bj , ||bj || can be approximated by (0)

(0)

1 ||bj ||2 − ||bj ||2 . (0) 2 ||b ||

AN US

||bj || ≈ ||bj || +

j

After iteration, for a certain threshold c(for example c = 10−6 ), if ||bj || < c, we ˆj = 0. define b Remark2

In the previous studies of the parametric guided method (see, e.g., Glad

110

M

[9], Naito [20], Martins et al. [17], Fan el al. [6], and Fan et al. [8] ), the ordinary non-sparse nonparametric smoothing technique, especially local linear

ED

smoothing, is used to estimate the residual curve. On the other hand, for additive models, Huang et al. [13] developed component selection using the 115

adaptive group lasso. Our proposed method is a combination of the parametric

PT

guided method and the adaptive group lasso method. To our best knowledge, this is the first study of such a hybrid technique. The new contribution of this

CE

paper is that the model structure can be determined using the sparse estimation method in the second step of the parametric guided method. Further, there is

120

no restriction on the choice of the class of parametric models, unlike in Zhang

AC

et al. [28], Huang et al. [14], Lian et al. [15], and Wu and Stefanski [26]. Remark 3 When the determination of the parametric model family is difficult for some

component, we can then use the zero function, i.e., fj (xj |θ) ≡ 0. For this com125

ponent, the fully nonparametric estimator is obtained. In general, the shape of 7

ACCEPTED MANUSCRIPT

the parametric model is explored in a pilot study. After candidate parametric models are obtained, they are compared by using a model selection method such as AIC or BIC. When the prior information is somewhat weak, we inves-

130

CR IP T

tigate by using a scatter plot of the partial residual data associated with the nonparametric estimator. For the nonparametric estimator fˆj (j = 1, · · · , p), P (−j) we calculate Y˜i = Yi − k,k6=j fˆk (xik ), which is the partial residual for the (−j)

jth component. From the scatter plot of {(Y˜i

, xij ) : i = 1, · · · , n} or the

curvature of fˆj (xj ), we guess the shape of the model fj (·|θ j ). However, this is only one part of the complete story of the selection method of the class of

parametric models. Further investigation of this problem is left as future study

AN US

135

at the present stage. Remark 4

As another approach, we can consider the true function fj to be modeled as

M

fj (xj ) = f (xj |θ j ) + rj (xj ), j = 1, · · · , p,

where f (·|θ j ) are the prepared models and rj are the nonparametric functions. Then, we estimate the parameters θ 1 , · · · , θ p and nonparametric functions r1 , · · · , rp simultaneously, or iteratively between two components. This

ED

140

method was studied by Huang et al. [14] under linear model fj (xj |θ j ) = θj xj .

PT

In these approaches, however, an extensive computational cost is needed because the iterative algorithm proceeds in sparse estimation. Hence, we are

CE

careful about using the parametric guided method.

145

3. Theoretical result

AC

This section presents the asymptotic properties of the proposed semipara-

metric estimator. We give the general assumptions concerning the true functions and error as Assumption A. Assumption A

150

1. The distribution of X is absolutely continuous and its density is bounded away from zero and infinity on [a, b]p . 8

ACCEPTED MANUSCRIPT

2. The true functions f1 , · · · , fp and the variance of ε are bounded. 3. The true functions f1 , · · · , fp are `th continuously differentiable. Define

CR IP T

f (x|θ) = f1 (x1 |θ 1 ) + · · · + f1 (xp |θ p ),

where θ = (θ 1 , · · · , θ p )T . Let θ 0 = (θ 0,1 , · · · , θ 0,p )T be the minimizer of E[(Y − 155

f (x|θ))2 |X = x]. The assumptions for the prepared parametric models are

given as Assumption B. Assumption B

AN US

1. For fj ∈ Fj , there exists θ 0,j ∈ Rpj such that fj (·) = fj (·|θ 0,j ), j = 1, · · · , p.

2. There exists exactly one minimizer θ 0 of E[(Y − f (x|θ))2 |X = x].

160

3. For j = 1, · · · , p, fj (xj |θ j )(xj ∈ [a, b], θ j ∈ Rpj ) is bounded away from infinity. [a, b], θ j ∈ Rpj .

M

4. For j = 1, · · · , p, fj (xj |θ j ) is `th continuously differentiable for xj ∈ In the second step, we use the B-spline method. The minimum required

165

ED

conditions for B-spline smoothing are given as Assumption C. If Assumption C fails, the efficiency of the B-spline estimator is violated. Assumption C

PT

1. The number of knots K + d is smaller than n. −1 2. There exists a constant c1 > 0 such that c−1 ≤ maxj {|κj+1 − κj |} ≤ 1 K

170

CE

c1 K −1 .

We briefly outline how to investigate the asymptotic properties of the pro-

ˆ j ) + rˆj (xj , θ ˆ j ). Under some reasonable conposed estimator fˆj (xj ) = fj (xj |θ

AC

ˆ − θ 0 = OP (n−1/2 ) and fj (xj |θ ˆ j ) − fj (xj |θ 0,j ) = OP (n−1/2 ) (see ditions, θ

175

ˆ j ). Let later). Thus, we only need to investigate the asymptotic order of rˆj (xj , θ b0 = (b0,1 , · · · , b0,p )T be the minimizer of  2  p X   E Y˜0 − bTj B(xj )  , j=1

9

ACCEPTED MANUSCRIPT

where Y˜0 = Y − f (x|θ 0 ). From a property of the B-spline function (see Barrow and Smith [1]), under Assumptions A–C, for j = 1, · · · , p, we have ||rj (xj , θ 0 ) − bT0,j B(xj )|| ≤ O(K −η ), where η = min{`, d + 1}. This η will be important in

ˆ0 = (b ˆ0,1 , · · · , b ˆ0,p )T be the minimizer of the theorems that follow. Next, let b  2 p n n X X X Y˜i0 − L0 (b1 , · · · , bp |λ) = bTj B(xij ) + λ wj ||bj ||, (4) i=1

CR IP T

180

j=1

j=1

where

Y˜i0 = Yi − f1 (xi1 |θ 0,1 ) − · · · − fp (xip |θ 0,p ), i = 1, · · · , n. ˆT B(xj )|| ||rj (xj , θ 0 ) − b j

AN US

Then, we obtain

ˆT B(xj )|| ≤ ||rj (xj , θ 0 ) − bT0,j B(xj )|| + ||bT0,j B(xj ) − b 0,j ˆT B(xj ) − b ˆT B(xj )|| +||b 0,j j

T

T

T

ˆ B(xj )|| + ||b ˆ B(xj ) − b ˆ B(xj )|| = O(K −η ) + ||bT0,j B(xj ) − b 0,j 0,j j

M

ˆ0,j − b0,j || and and hence it is sufficient to evaluate the asymptotic order of ||b

ˆ0,j − b ˆj ||. In Theorem 1 below, the asymptotic order of ||b ˆ0,j −b0,j || is derived. ||b 185

ED

ˆ0,j − b ˆj || is asymptotically dominated Theorem 2, the main result, shows that ||b ˆ0,j − b0,j ||. As a result, the asymptotic order of ||ˆ ˆ j )|| by ||b r(xj , θ 0,j ) − rˆ(xj , θ η ˆT B(xj )||. is the same as that of O(K − ) + ||bT0,j B(xj ) − b 0,j

PT

On the other hand, if fj ∈ Fj , then b0 = 0 and so we want to obtain the

ˆj = 0. Thus, we also show P (||b ˆj || > 0) → 0 as n → ∞ for estimator as b

CE

fj ∈ Fj . This property is called sparsity. To obtain the asymptotic consistency

190

and sparsity of the estimator obtained from the adaptive group lasso, the weights need to depend on n, that is, wj = wj (n). Actually, we use the norm of the

AC

˜j || as w−1 , which is dependent on n. If w−1 → 0 least squares estimator ||b j j ˆj = 0 since the constraint of the penalty of the as n → ∞, we will obtain b

jth component becomes stronger. When wj−1 = O(1), on the other hand, the

195

estimator of the jth component is estimated as nonzero. Conditions on the growths of the weights, K, and λ are obtained. Assumption D 10

ACCEPTED MANUSCRIPT

1. For j = 1, · · · , p, if fj ∈ Fj , b0,j = 0 and there exists a increasing sequence {rn }∞ n=1 such that as n → ∞,

CR IP T

rn wj−1 = O(1). 2. For j = 1, · · · , p, if fj 6∈ Fj , there exist constants c2 , c3j > 0 such that ||b0,j || > c2 and as n → ∞, P (wj−1 ≥ c3j ) → 1.

n λK = o(1), = o(1). nrn K 1/2+d λrn

AN US

3. As n → ∞,

Assumption D 1 and 2 are natural conditions. From Stone [24], under the ˜j ||(j = 1, · · · , p). assumptions A–C, Assumption D 1 and 2 hold for wj−1 = ||b 200

˜0 || → 0 is satisfied. In Theorem 1, for j = 1, · · · , p, Therefore if b0,j = 0, then ||b ˆ0,j are shown. the sparsity and consistency of b

M

Theorem 1. Under Assumptions A–D, for j = 1, · · · , p, if fj ∈ Fj , as n → ∞

and if fj 6∈ Fj ,

ED

ˆ0,j || = 0) → 1, P (||b

PT

ˆ0,j − b0,j ||2 = OP (K 2 /n) + OP (K/n) + O(K −2η+1 ) + O(K 2 λ2 /n2 ). ||b

CE

ˆ j ) and rˆj (xj , θ 0,j ) Next, the asymptotic order of the difference between rˆj (xj , θ √ ˆ is n-consistent of θ 0 . Using the result is derived. We then assume that θ of White [25], this assumption holds under some weak conditions. Furtherˆ j ) − fj (xj |θ 0,j ) = γj,n (xj ), where more, for j = 1, · · · , p, we obtain fj (xj |θ γj,n (xj ) = OP (n−1/2 ) for any xj ∈ [a, b]. By combining Theorem 1 and the

AC

205

asymptotics for the parametric estimator, we obtain the following theorem. ˆ − θ 0 = OP (n−1/2 ). Under Assumptions A–D, for Theorem 2. Suppose θ j = 1, · · · , p, if fj ∈ Fj , as n → ∞ ˆj || = 0) → 1, P (||b 11

ACCEPTED MANUSCRIPT

and if fj 6∈ Fj ,

Furthermore, for fj 6∈ Fj (j = 1, · · · , p),

CR IP T

ˆ0,j − b0,j ||2 = OP (K 2 /n) + OP (K/n) + O(K −2η+1 ) + O(K 2 λ2 /n2 ). ||b

ˆ j ) − rj (xj , θ 0,j )||2 = OP (K/n) + OP (1/n) + O(K −2η ) + O(Kλ2 /n2 ). ||ˆ rj (xj , θ

From the result of Theorem 2, the asymptotics for the final estimator is obtained as follows.

AN US

Corollary 1. Under the same assumptions as Theorem 2, for j = 1, · · · , p, if fj 6∈ Fj ,

||fˆj (xj ) − fj (xj )||2 = OP (K/n) + OP (1/n) + O(K −2η ) + O(Kλ2 /n2 ). The proof of Corollary 1 is easy since for fj ∈ Fj ,

210

M

ˆ j ) − fj (xj |θ 0,j )||2 + ||ˆ ˆ j ) − rj (xj , θ 0j )||2 . ||fˆj (xj ) − fj (xj )||2 ≤ ||fj (xj |θ rj (xj , θ Thus, the asymptotic order of the final estimator is dominated by that of the

ED

ˆ j ). When the final estimator is the same as the nonparametric estimator rˆj (xj , θ √ parametric estimator, n-consistency holds. We next show the optimal rate of ˆj = 0, the optimal order of the convergence of the proposed estimator. If b

215

PT

estimator is OP (n−1 ) and so we are interested only in the case fj 6∈ Fj . When

˜j ||−1 as wj , from Horowitz and Mammen we use the least squares estimator ||b

CE

˜j ||−1 ) = O(nη/(2η+1) ) as K = O(n1/(2η+1) ) under Assumptions [11], rn = O(||b A–C. Consequently, we obtain the following corollary. Corollary 2. Suppose the same assumptions as Theorem 2. Furthermore as-

AC

suming K = O(n1/(2η+1) ), rn = O(nη/(2η+1) ), and λ = O(nα ), 1/(4η + 2) < α < 1/2, for j = 1, · · · , p, if fj 6∈ Fj ,

  E[{fˆj (xj ) − fj (xj )}2 ] = OP n−2η/(2η+1) .

Remark 5 12

ACCEPTED MANUSCRIPT

It is known that the asymptotic bias of the parametric guided estimator is smaller than that of the fully nonparametric estimator when fj is close to fj (|θ 0,j ). From Barrow and Smith [1], the model bias between the function g

Bias(x) = −

g (η) (x) c(x), η!K η

CR IP T

and the B-spline model can be written as

where g (η) is the ηth derivative of g and c(x) is a bounded function. For the 220

nonparametric estimator, g is the true regression function fj . On the other hand, for the parametric guided estimator, g(x) = fj (x) − fj (x|θ 0,j ). Thus, if (η)

(η)

(η)

AN US

|fj (x) − fj (x|θ 0,j )| < fj (x) for x ∈ [a, b], bias reduction is achieved although the asymptotic variance is unchanged. This point is discussed in related articles, including Glad [9], Fan et al. [6], and Gouch and Genton [10]. Conse225

quently, if fj ∈ Fj , the nonparametric component vanishes. Even if fj 6∈ Fj , the asymptotic bias of the proposed estimator is smaller than that of the fully (η)

(η)

(η)

4. Numerical study

M

nonparametric estimator when |fj (x) − fj (x|θ 0,j )| < fj (x).

ED

4.1. Simulation

In this section, we demonstrate the numerical performance of the proposed

PT

estimator in terms of its estimation accuracy. The generating model is the

CE

additive model:

230

yi = f1 (xi1 ) + · · · + f4 (xi4 ) + εi , i = 1, · · · , n.

The error is independently generated from N (0, σ 2 ). The predictors x1 , · · · , xn

AC

are independently distributed as N4 (0, Σ), where Σ = (ρ|i−j| )ij . When ρ = 0,

Σ is the identity matrix. After xj1 , · · · , xjn are generated, they are modified as xij / maxk=1,··· ,n {|xkj |}. That is, all predictors are within [−1, 1] for j = 1, · · · , 4. The true regression functions are defined as f1 (t) = 2 sin(2πt) + γ1 t, f2 (t) = 2

5t + γ2 t, f3 (t) = 3t2 + γ3 t, and f4 (t) = e−t + γ4 t, where γ1 , · · · , γ4 ∈ R. 2

13

ACCEPTED MANUSCRIPT

Each function is modified so that E[fj (Xj )] = 0. As the parametric model in the first-step estimation, we prepare f1 (t|θ 1 ) = θ11 sin(2πt), f2 (t|θ 2 ) = θ21 t2 , 2

f3 (t|θ 3 ) = θ31 t2 , and f4 (t|θ 4 ) = θ41 e−t . Thus if γ1 = · · · = γ4 = 0, the

CR IP T

true function is included in the class of parametric models for all components. For a given prepared parametric model, the parameter θ is estimated by the

least squares method. In the second-step estimation, we use the cubic B-spline

with K = 5 equidistant knots for each component. The smoothing parameter λ is selected via 5-fold cross-validation. To evaluate the proposed method in

terms of structure selection, we summarize the result in terms of whether the

AN US

nonparametric component becomes 0. If the nonparametric estimator is 0, then the prepared parametric model can be regarded as the correct model. Thus, we count the number of times that the parametric model is identified in each additive component. Next, the estimation accuracy is evaluated by the sample version of mean integrated squared error (MISE):

where R is the number of replications, fˆr is the estimator with the rth repli-

ED

235

R J 1 X1X ˆ {fr (xj ) − f (xj )}2 , R r=1 J j=1

M

MISE =

cation, and x1 , · · · , xJ are test data generated via a 4-dimensional uniform

distribution on [−1, 1]4 . In the above two studies, we set J = 100 and R = 500.

PT

The sample sizes were n = 50, 100, and 200. The proposed semiparametric estimator, the parametric guided estimator with the adaptive group lasso, is 240

denoted SE-AGL. For comparison, we also calculate the MISE of the semipara-

CE

metric estimator with the group lasso (SE-GL), the semiparametric estimator with no penalty (SE-NP), the semiparametric estimator with an L2 penalty

AC

(SE-L2 ), the fully nonparametric estimator with no penalty (NE-NP), the fully nonparametric estimator with an L2 penalty (NE-L2 ), and the fully parametric

245

estimator (PE). If the weights are wj ≡ 1, SE-AGL reduces to SE-GL. Otherwise if λ = 0, SE-AGL and SE-NP are the same. NE-NP and NE-L2 are

constructed by the nonparametric B-spline method. In other words, if all parametric models are fj (xj |θ j ) ≡ 0, SE-NP and SE-L2 14

ACCEPTED MANUSCRIPT

are similar to NE-NP and NE-L2 , respectively. For the L2 penalty, we use the 250

second-order difference penalty proposed by Eilers and Marx [4]. The estimation algorithm of SE-L2 and NE-L2 follows the method of Marx and Eilers [18]. If

CR IP T

γ = (γ1 , · · · , γ4 ) = (0, 0, 0, 0), PE is the estimator of the correct model and so its performance should be better than those of other estimators.

In Table 1, for each additive component, the ratio of the number of times that 255

the nonparametric estimator is the zero function to the number of repetitions (Ratio) is given under several conditions. When γ = (0, 0, 0, 0), the parametric model assumption is correct. From the results, Ratio was near 1 even with

AN US

(ρ, σ) = (0.5, 1). For γ = (0.5, 0.5, 0, 0), only f3 (·|θ 3 ) and f4 (·|θ 4 ) are correct models. Therefore, Ratio was high for f3 and f4 compared with f1 and f2 . 260

However, the proposed method frequently determined that the nonparametric estimates were not needed for f1 and f2 . Similarly, for γ = (0.5, 0.5, 0.5, 0.5), Ratio indicated that the difference between the true function and the model was ignored many times. Intuitively, there is a gap between fj (·) and fj (·|θ j )

265

M

when γj = 5. In fact, for γ = (5, 5, 1, 1) and (5,5,5,5), Ratio is quite small. Thus, we confirmed that the nonparametric components are estimated as the

ED

zero function if fj (·) ≈ fj (·|θ j ). Otherwise if fj (·) 6≈ fj (·|θ j ), the proposed method indicated that the nonparametric components were needed to obtain the final estimator.

270

PT

Table 2 shows the sample MISE values of all the estimators under several conditions. In all cases, the MISE become small as n increases. We found that all

CE

semiparametric and nonparametric estimators had consistency and so the performances of these estimators were basically good. Even when (ρ, σ) increased,

the performances of the estimators were sufficiently good. For γ = (0, 0, 0, 0),

AC

the behaviors of SE-AGP, SE-GP, and PE were better than those of the other

275

estimators. In this case, the model assumption was completely correct and so the estimators of SE-AGP and SE-GP reduced to PE in almost every replication. Thus, we could confirm the efficiency of the proposed method. The rate of convergence of PE is faster than the nonparametric estimator in the correct model case. Hence, the result that the MISE of PE is smallest is not surprising. 15

ACCEPTED MANUSCRIPT

Table 1: Ratio for each setting. (γ1 , γ2 , γ3 , γ4 ) = (0, 0, 0, 0) (0.5,0.5)

(0,1)

(0.5,1)

(0,0.5)

(0.5,0.5)

(0,1)

(0.5,1)

f1

0.995

0.944

0.863

0.845

0.998

0.933

0.836

0.816

f2

0.964

0.953

0.854

0.835

0.943

0.895

0.812

0.795

f3

0.992

0.983

0.834

0.827

0.871

0.820

0.723

0.717

f4

0.994

0.963

0.946

0.952

0.914

0.878

0.738

0.692

f1

1.000

1.000

0.925

0.921

0.963

0.942

0.892

0.851

f2

1.000

0.993

0.934

0.873

0.944

0.931

0.873

0.812

f3

1.000

0.994

0.969

0.928

0.908

0.858

0.647

0.682

f4

1.000

1.000

0.957

0.915

0.902

0.825

0.659

0.646

f1

1.000

1.000

0.945

0.932

0.966

0.952

0.912

0.872

f2

1.000

0.993

0.964

0.821

0.953

0.955

0.926

0.861

f3

1.000

1.000

0.969

0.935

0.928

0.847

0.683

0.656

f4

1.000

1.000

0.962

0.953

0.901

0.853

0.688

0.632

(ρ, σ)

(0,0.5)

(0.5,0.5)

(0,1)

(0.5,1)

(0,0.5)

(0.5,0.5)

(0,1)

(0.5,1)

f1

0.634

0.626

0.542

0.536

0.231

0.193

0.128

0.081

f2

0.645

0.598

0.576

0.523

0.192

0.152

0.114

0.102

f3

0.672

0.623

0.538

0.518

0.151

0.247

0.091

0.075

f4

0.716

0.701

0.612

0.602

0.266

0.212

0.097

0.071

f1

0.512

0.568

0.411

0.433

0.082

0.081

0.027

0.012

f2

0.523

0.493

0.394

0.387

0.093

0.083

0.014

0.021

f3

0.496

0.458

0.458

0.416

0.068

0.052

0.002

0.014

f4

0.532

0.516

0.492

0.439

0.103

0.098

0.023

0.012

f1

0.494

0.517

0.354

0.335

0.009

0.012

0.010

0.000

0.504

0.482

0.327

0.312

0.012

0.023

0.022

0.021

0.476

0.451

0.368

0.333

0.011

0.016

0.005

0.001

0.512

0.507

0.416

0.375

0.014

0.021

0.012

0.000

n = 200

AN US

n = 100

n = 200

f2 f3 f4

ED

n = 100

M

(γ1 , γ2 , γ3 , γ4 ) = (0.5, 0.5, 0.5, 0.5)

n = 50

PT

(γ1 , γ2 , γ3 , γ4 ) = (5, 5, 1, 1)

(γ1 , γ2 , γ3 , γ4 ) = (5, 5, 5, 5)

(0,0.5)

(0.5,0.5)

(0,1)

(0.5,1)

(0,0.5)

(0.5,0.5)

(0,1)

(0.5,1)

f1

0.002

0.008

0.004

0.005

0.001

0.002

0.000

0.001

f2

0.011

0.008

0.008

0.004

0.005

0.003

0.002

0.002

f3

0.011

0.013

0.021

0.021

0.002

0.022

0.000

0.000

f4

0.012

0.016

0.018

0.026

0.002

0.002

0.001

0.000

f1

0.002

0.006

0.002

0.004

0.002

0.001

0.000

0.000

f2

0.008

0.005

0.008

0.004

0.001

0.001

0.000

0.000

f3

0.007

0.010

0.016

0.015

0.000

0.002

0.000

0.000

f4

0.009

0.011

0.016

0.019

0.000

0.000

0.000

0.000

f1

0.002

0.005

0.003

0.007

0.001

0.001

0.000

0.000

f2

0.007

0.006

0.008

0.004

0.000

0.001

0.000

0.000

f3

0.007

0.009

0.012

0.020

0.000

0.000

0.001

0.001

f4

0.008

0.010

0.009

0.016

0.000

0.000

0.000

0.000

CE AC

(γ1 , γ2 , γ3 , γ4 ) = (1, 1, 1, 1)

(ρ, σ)

n = 50

n = 200

CR IP T

(0,0.5)

n = 50

n = 100

(γ1 , γ2 , γ3 , γ4 ) = (0, 0, 0.5, 0.5)

(ρ, σ)

16

ACCEPTED MANUSCRIPT

Table 2: MISE (×100) of estimators. (γ1 , γ2 , γ3 , γ4 ) = (0, 0, 0, 0)

n = 200

(0,1)

(0.5,1)

(0,0.5)

(0.5,0.5)

(0,1)

SE-AGL

4.544

4.663

18.77

18.78

4.723

5.123

22.15

22.43

SE-GL

4.551

4.671

18.79

18.81

4.768

5.201

22.17

22.52

SE-NP

5.237

5.462

21.24

22.42

5.141

SE-L2

4.813

5.001

23.12

20.15

5.021

NE-NP

7.341

7.527

30.41

31.14

7.261

NE-L2

7.124

6.891

28.46

28.31

6.621

PE

4.531

4.661

18.71

18.72

4.763

SE-AGL

0.624

0.631

7.352

7.435

0.713

SE-G

0.648

0.651

7.514

7.601

0.721

SE-NP

0.741

0.783

8.125

8.461

0.778

SE-L2

0.641

0.614

0.742

0.762

0.735

NE-NP

0.952

0.992

10.51

10.78

0.931

NE-L2

0.767

0.725

7.642

7.562

PE

0.624

0.631

7.415

SE-AGL

0.315

0.363

3.572

SE-GL

0.331

0.384

SE-NP

0.342

0.365

SE-L2

0.333

0.357

NE-NP

0.352

0.394

NE-L2

0.350

0.382

PE

0.314

0.363

6.512

24.24

24.62

5.822

23.56

23.41

8.132

32.34

31.82

6.871

29.51

29.43

5.131

22.21

22.44

0.726

1.032

1.102

0.763

1.051

1.115

0.792

3.775

5.443

0.746

1.101

1.162

0.956

1.277

1.444

0.831

0.872

1.163

1.212

7.412

0.741

0.771

1.113

1.176

3.535

0.426

0.425

0.526

0.542

3.601

3.566

0.427

0.429

0.532

0.542

3.612

3.613

0.433

0.435

0.541

0.547

3.593

3.583

0.433

0.428

0.527

0.541

3.652

3.664

0.472

0.461

0.553

0.561

3.636

3.664

0.457

0.442

0.544

0.542

3.573

3.531

0.421

0.418

0.520

0.535

(γ1 , γ2 , γ3 , γ4 ) = (0.5, 0.5, 0.5, 0.5) (ρ, σ) SE-AGL

(0.5,0.5)

(0,1)

(0.5,1)

(0,0.5)

(0.5,0.5)

(0,1)

4.786

4.738

19.94

20.56

4.931

5.562

23.56

23.62

4.741

19.96

20.55

4.937

5.572

23.62

23.71

4.784

SE-NP

4.913

4.826

20.15

20.92

5.146

5.637

23.91

24.32

SE-L2

4.521

4.745

19.75

20.46

5.024

5.552

23.52

23.92

NE-NP

4.832

4.831

20.35

21.01

5.172

5.663

23.74

24.05

NE-L2

4.522

4.751

19.97

21.02

5.041

5.613

23.57

23.82 122.8

PT

AC

n = 200

(0.5,1)

SE-GL

PE

6.352

8.416

31.25

32.25

84.05

84.26

122.2

SE-AGL

0.643

0.675

9.256

12.46

0.747

0.812

12.25

12.93

SE-GL

0.677

0.673

10.01

13.15

0.751

0.815

12.12

12.81

SE-NP

0.644

0.681

9.325

13.72

0.834

0.891

12.31

14.03

SE-L2

0.641

0.661

9.253

12.31

0.746

0.801

12.21

12.83

NE-NP

0.672

0.713

9.672

13.95

0.851

0.931

13.21

14.21

NE-L2

0.652

0.691

9.421

12.62

0.792

0.812

12.22

12.87

PE

2.251

2.302

16.31

20.13

6.525

87.36

115.4

112.5

SE-AGL

0.342

0.345

3.531

4.356

0.341

0.372

4.011

4.423

SE-GL

0.361

0.352

3.601

4.418

0.354

0.391

4.101

4.491

SE-NP

0.385

0.394

3.667

4.492

0.407

0.426

4.182

4.602

SE-L2

0.362

0.344

3.512

4.343

0.345

0.373

4.095

4.503

NE-NP

0.421

0.425

3.932

4.636

0.441

0.494

4.418

4.712

NE-L2

0.381

0.393

3.771

4.531

0.409

0.451

4.215

4.611

PE

2.531

2.321

17.12

20.13

6.613

89.21

119.4

120.6

CE

n = 100

(γ1 , γ2 , γ3 , γ4 ) = (1, 1, 1, 1)

(0,0.5)

ED

n = 50

(0.5,1)

CR IP T

n = 100

(0.5,0.5)

AN US

n = 50

(γ1 , γ2 , γ3 , γ4 ) = (0.5, 0.5, 0, 0)

(0,0.5)

M

(ρ, σ)

17

ACCEPTED MANUSCRIPT

Table 3: Ratio for LAND, SRP, ASR, and the proposed method (SPG).

LAND

n = 200

(ρ, σ)

(0,0.5)

(0.5,1)

(0,0.5)

(0.5,1)

(0,0.5)

(0.5,1)

(0,0.5)

(0.5,1)

f1

0.924

0.763

0.941

0.787

0.951

0.764

0.986

0.854

f2

0.913

0.712

0.912

0.756

0.935

f3

0.423

0.435

0.321

0.335

0.352

f4

0.000

0.002

0.000

0.001

0.000

f1

0.985

0.883

0.988

0.857

0.984

f2

0.975

0.881

0.983

0.841

0.977

f3

0.351

0.321

0.271

0.267

0.235

f4

0.000

0.002

0.000

0.001

0.000

f1

0.995

0.893

1.000

0.932

f2

0.994

0.897

1.000

f3

0.214

0.251

f4

0.000

0.002

0.735

0.951

0.836

0.356

0.343

0.313

0.000

0.000

0.000

0.894

0.995

0.864

0.915

0.991

0.876

0.314

0.225

0.281

0.000

0.000

0.000

1.000

0.932

1.000

0.942

0.941

1.000

0.934

1.000

0.941

0.136

0.253

0.202

0.256

0.198

0.212

0.000

0.001

0.000

0.000

0.000

0.000

For γ = (0.5, 0.5, 0.5, 0.5), the MISE values of SE-AGP and SE-GP are smaller

M

280

SPG

CR IP T

n = 100

ASR

AN US

n = 50

SRP

than that of PE for all combinations (n, ρ, σ). This indicates that the model

ED

assumption was not correct in some replications and the nonparametric components captured the difference between the true function and the prepared model for each additive component. For γ = (1, 1, 1, 1), the performance of PE was inferior to those of other semiparametric and nonparametric estimators. On the

PT

285

other hand, the MISE of the proposed estimator SE-AGP is near that of other nonparametric estimators. Thus, we found that the performance of the pro-

CE

posed estimator is guaranteed to be on the same level as a fully nonparametric estimator even if the model assumption is not corrected. Finally, we compare the proposed method with other structure discovery

AC

290

methods such as those of Zhang et al. [28] (LAND: linear and nonlinear discovery), Huang et al. [14] (SRP: semiparametric regression pursuit), and Wu and Stefanski [26] (ASR: automatic structure recovery). All of the above methods consider only linear model fj (xj |θj ) = θj xj . Therefore, we investigate the

295

classification performance of linear structure discovery between our method and 18

ACCEPTED MANUSCRIPT

these related methods. The true regression functions are redefined as f1 (t) = 2t, f2 (t) = t, f3 (t) = e(x−1)/2 , and f4 (t) = sin(2πt). The parametric model is set as the linear function fj (xj |θj ) = θj xj (j = 1, 2, 3, 4). Thus, fj ∈ Fj (j = 1, 2), 300

CR IP T

f4 6∈ F4 , and f3 6∈ F3 but f3 ≈ f3 (·|θ3 ). Similar to Table 1, we calculate Ratio for the methods LAND, SRP, and ASR and the proposed method (SPG: sparse

parametric guided). For f1 and f2 , the true model is included in the linear model and so Ratio will become high. For f3 and f4 , on the other hand, the

value of Ratio is expected to be low. Table 3 shows the Ratio values for all estimators. We found that all the estimators discovered the linear structure for f1 and f2 in many replication. For f4 as well, the high accuracies of the

AN US

305

methods were confirmed. For f3 , Ratio is somewhat high since f3 is nearly linear. However, as n increases, Ratio becomes smaller. Although the difference is slight, the proposed estimator has better or similar performance compared to the other methods. 4.2. Application

M

310

We apply the proposed method to analyzing a men’s decathlon dataset that

ED

is available in the R package scar of Chen and Samworth [3]. This dataset includes data of male decathlon athletes who scored at least 6500 points in at least one athletic competition in 2012 and obtained points in every event of that competition. The sample size is n = 614. The purpose of the analysis is to investi-

PT

315

gate the relationship between results of different athletes. For the analysis of this paper, we only four variables. The response Y is the points scored in the javelin

CE

throw. The predictors are x1 , points scored in the shot put; x2 , points scored in the discus throw; and x3 , points scored in pole vaulting. These predictors have

the three highest correlations with Y among all variables. To obtain numerical

AC

320

stability, all variables were rescaled as xij = xij / maxi {xij }(j = 1, 2, 3). First, we select the class of parametric models for each component on the

basis of a pilot study. In Figure 1, the fully nonparametric estimator of each additive function is shown. The estimators were constructed via penalized spline

325

method (see Marx and Eilers [18]). Using this information, we set f1 (t|θ11 ) = 19

0.15 0.05

f3.hat

0.00

f2.hat 0.6

0.7

0.8

0.9

1.0

0.4

0.5

0.6

x1

0.7

0.8

0.9

1.0

x2

CR IP T

0.5

−0.10

−0.2

−0.10

−0.05

−0.1

−0.05

0.00

0.0

f1.hat

0.1

0.05

0.10

0.2

0.3

0.10

ACCEPTED MANUSCRIPT

0.3

0.5

0.7

0.9

x3

f3.hat

f2.hat

0.00

0.0

−0.10

−0.3

−0.05

−0.2

−0.05

0.00

−0.1

f1.hat

0.05

0.1

0.05

0.2

0.10

AN US

0.10

0.3

Figure 1: Fully nonparametric estimators of f1 (left), f2 (middle), and f3 (right).

0.5

0.6

0.7

0.8

0.9

1.0

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.3

0.5

x2

0.7

0.9

x3

M

x1

Figure 2: Parametric estimators of f1 , f2 , and f3 . The left, middle, and right panels are for

ED

the least squares estimators of f1 (t|θ11 ), f2 (t|θ21 , θ22 , θ23 , θ24 ), and f3 (t|θ31 ).

θ11 t, f2 (t|θ21 , θ22 , θ23 , θ24 ) = θ21 t+θ22 t2 +θ23 t3 +θ24 t4 , and f3 (t|θ31 ) = θ31 exp[t2 ].

PT

ˆ = (θˆ11 , θˆ21 , θˆ22 , θˆ23 , θˆ24 , θˆ31 ) = From the least squares method, we obtained θ (1.12, −4.15, 1.74, 7.32, −5.96, 0.12). Figure 2 shows the structure of the para-

CE

metric estimator for each component. We found that the structures of the fully 330

nonparametric estimator and the parametric estimator are similar for f1 and f3 . For f2 , the curves for the two estimators are similar for x2 > 0.7, but clearly

AC

different for x2 < 0.5. Using the parametric guided method with the adaptive group lasso, we check

whether the model assumption is valid. In the second step of the proposed

335

method, the degree of the B-spline bases was d = 3 (cubic spline), the number

of knots was K = 5, and the smoothing parameter λ was determined by 5-fold

20

0.10

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.4

0.5

0.6

0.7

0.8

0.9

1.0

x2

AN US

x2

CR IP T

0.05 0.00 −0.15

−0.10

−0.05

f2.hat

0.00 −0.15

−0.10

−0.05

f2.hat

0.05

0.10

ACCEPTED MANUSCRIPT

Figure 3: Estimators for second component f2 . Left: Proposed estimator fˆ2 (x2 ) (solid), paraˆ 2 ) (dashed), and nonparametric estimator of the residual rˆ2 (x2 , θ ˆ2 ) metric estimator f2 (x2 |θ (dot-dashed). Right: proposed estimator (solid) and fully nonparametric estimator (dashed).

cross-validation. As the result, for f1 and f3 , the estimator of the nonparametric

M

component was estimated as the zero function. That is, the models f1 (x1 |θ11 ) =

θ11 x1 and f3 (x3 |θ31 ) = θ31 exp[x23 ] were decided to be correct models. On the

other hand, the nonparametric component for f2 did not vanish. Actually,

ED

340

ˆ2 || = 4.122. Figure 3 shows the proposed estimator fˆ2 (x2 ) = we obtained ||b

ˆ 2 ) + rˆ2 (x2 , θ ˆ 2 ). From the left panel, we see that rˆ2 (x2 , θ ˆ 2 ) is not the zero f2 (x2 |θ

PT

function and the curve changed rapidly for x2 < 0.5. This indicates that the parametric component could not capture the true curve in this region and the nonparametric component modified the parametric curve. In the right panel,

CE

345

the proposed estimator and fully nonparametric estimator are plotted. The parametric guided estimator has less curvature than does the nonparametric

AC

estimator for 0.35 < x2 < 0.5. To evaluate the accuracy of the estimator, we calculated the prediction error PEr = E[(Ytest − fˆ(xtest ))2 ],

where (Ytest , xtest ) indicates the test data. The test data are independent of 350

the training data used to construct fˆ(x). Specifically, PEr was approximated 21

ACCEPTED MANUSCRIPT

via its sample version in the context of 5-fold validation with 50 replications. PEr (standard deviation) of the proposed estimator was 0.0783 (0.021). We also calculated PEr (standard deviation) of the parametric estimator and the

355

CR IP T

nonparametric estimator, and obtained 0.0963 (0.023) and 0.0812 (0.025), respectively. Thus, we found that the proposed estimator is more useful than

the fully parametric estimator and the fully nonparametric estimator from the prediction error point of view.

5. Future extensions

AN US

We now describe some future problems related to this study. In this paper,

we considered an additive model for which the proposed estimator discovers some parametric components. Thus, the model can be written as f (x) =

q X j=1

fj (xj |θ j ) +

p X

fj (xj ),

j=q+1

f (x) =

p1 X

fj (xj |θ j ) +

ED

j=1

M

where q < p. One extension is sparse additive models of the form pX 1 +p2

fj (xj ) +

j=p1 +1

p X

0(xj ),

(5)

j=p1 +p2 +1

where p1 +p2 < p and 0(xj ) is the zero function. To cover this case, a parametric

PT

model should be constructed so that fj (xj |θ j ) = 0 for some θ j . For example, the basis function models can be used. That is, we assume that the parametric

CE

model is

where φj (xj ) = (φ1,j (xj ), · · · , φpj ,j (xj ))T and φk,j : R → R(k = 1, · · · , pj ). If the parameter θ = (θ 1 , · · · , θ p ) is estimated via a sparse method such as the

AC

360

fj (xj |θ j ) = θ Tj φj (xj ),

group lasso or the adaptive group lasso in the first-step estimation, this extension ˆ is defined as the will be achieved. Here, the adaptive group lasso estimator θ minimizer of p n q X X (Yi − f (x|θ))2 + λ1 uj θ Tj θ j , i=1

j=1

22

ACCEPTED MANUSCRIPT

365

where λ1 is the smoothing parameter and u1 , · · · , up are weights. When all weights are 1, the estimator reduces to the group lasso estimator. Under some reasonable conditions, the sparsity and consistency of the parametric and final

CR IP T

estimator will be guaranteed. A second extension is the high-dimensional case (p > n). For the high370

dimensional setting, it is very difficult to construct the class of parametric models. Thus, this problem is not easy despite being interesting. As one option, after the important variables have been chosen by sure independence screening (Fan et al. [5]) or model-free variable screening (Zhu et al. [30]), we can

AN US

estimate the regression functions for a p < n setting.

Thirdly, we can add a smoothness penalty in the second-step estimation.

375

That is, the loss function (3) is improved as

L(b1 , · · · , bp ) =

n X i=1

Y˜i −

n X

M





j=1

p X j=1

2

bTj B(xij )

wj ||bj || + λ2

p X

vj bTj Qbj ,

(6)

j=1

ED

where λ2 is the smoothing parameter, v1 , · · · , vp are weights, and Q is the second-order difference matrix (see Eilers and Marx [4]). According to existing studies on B-spline smoothing, the performance of the minimizer of (6) is better than that of (3). However, it is difficult to investigate the theoretical properties

PT

380

of an estimator obtained via (6). For further study, the development of the sparsity and the rate of convergence of the estimator obtained from (6) would

CE

be interesting. Also, extensions to other models, such as generalized linear models, varying coefficient models, and quantile regression models, are important problems.

AC

385

Appendix We give the proof of theorems of this paper. First for convenience, we

introduce some notation. Let Y˜ 0 = (Y˜10 , · · · , Y˜n0 )T , Y˜i0 = Yi − f (x|θ 0 ), Zj = (B k (xij ))ik for j = 1, · · · , p, and Z = [Z1 , · · · , Zp ]. The loss function (4) can 23

ACCEPTED MANUSCRIPT

390

be written as L0 (b|λ) = ||Y˜ 0 − Zb||2 + λ

n X j=1

wj ||bj ||,

CR IP T

where b = (bT1 , · · · , bTp )T . Define A = {j ∈ {1, · · · , p}|fj (·) 6= fj (·|θ 0,j )} and |A|(K + d)-square matrix ZA = (Zj , j ∈ A), where |A| is the cardinality of A. To show Theorems 1 and 2, we will use the following lemmas.

Lemma 1. The asymptotic order of the maximum eigenvalue of the matrix 395

T T ZA ZA is O(nK −1 ). Furthermore, that of (ZA ZA )−1 is O(n−1 K).

AN US

The proof of Lemma 1 is described in Lemma 3 of Huang et al. [13].

Outline of proof of Theorem 1. The proof of Theorem 1 is the same as that for the nonparametric estimator under the assumption that the true function is rj (xj , θ 0,j ) = fj (xj ) − fj (xj |θ 0,j )(j = 1, · · · , p). Then, if fj ∈ Fj , rj (xj , θ 0,j ) = 400

0. Thus, for fj ∈ Fj , the sparsity property is obtained. For fj 6∈ Fj , the rate of convergence of the B-spline estimator can be shown as in the proofs of Theorems

M

3 and 4 of Huang [13] and so the details are omitted.

ED

Proof of Theorem 2. To show Theorem 2, it is sufficient to show ˆj || > 0) → 0, P (||b

j 6∈ A

PT

and evaluate the asymptotic order of j∈A

CE

Pp as n → ∞. Let Y˜ = (Y˜1 , · · · , Y˜n )T and let γ n = ( j=1 γj,n (xij ), i = 1, · · · , n)T . Then Y˜i can be written as

AC

405

ˆj − b ˆ0,j ||, ||b

Y˜i

= Yi −

p X

ˆ

fj (xij |θ j ) j=1 p  X

= Y˜0,i − = Y˜0,i −

j=1

p X

ˆ j ) − fj (xij |θ 0,j ) fj (xij |θ

γj,n (xij ).

j=1

24



ACCEPTED MANUSCRIPT

Thus, we have Y˜ = Y˜ 0 − γ n and γ n = OP (n−1/2 ). Therefore, (3) is asymptotically equivalent to L0 (b|λ) and the first assertion of Theorem 2 is obvious from the sparsity property of Theorem 1. On the other hand, the second assertion is

CR IP T

not obvious and we should check whether the influence of γ n is negligible. Let Tj be a (K + d) × |A|(K + p) matrix with the form Tj = (O, O, · · · , I, O, · · · , O), 410

where O is a (K + p) square matrix of zeros and I is a (K + p) identity matrix. From the proof of Theorem 3 of Huang et al. [13], for j ∈ A, T T = Tj (ZA ZA )−1 ZA γn T −λwj Tj (ZA ZA )−1

From Lemma 1, we have

ˆj ˆ0,j b b − ˆj || ||b ˆ0,j || ||b

.



T O(K 2 /n2 )||ZA γ n ||2

=

O(K 2 /n2 )O(n/K)O(n−1 )

M

T T ||Tj (ZA ZA )−1 ZA γ n ||2

!

AN US

ˆj − b ˆ0,j b

=

O(K/n2 ).

ED

Furthermore, since wj = O(1) for j ∈ A, Lemma 1 yields that T ˆj ||2 /||b ˆj ||2 ≤ OP (λ2 K 2 /n2 ). ||λwj Tj (ZA ZA )−1 b

Hence,

PT

ˆj − b ˆ0,j ||2 ≤ O(K/n2 ) + OP (λ2 K 2 /n2 ) ||b

and this yields that

AC

CE

ˆj − b0,j ||2 ||b



ˆj − b ˆ0,j ||2 + ||b0,j − b ˆ0,j ||2 ||b

= OP (K/n2 ) + OP (λ2 K 2 /n2 ) +O(K −2η+1 ) + OP (K 2 /n) + OP (λ2 K 2 /n2 ) = O(K −2η+1 ) + OP (K 2 /n) + OP (λ2 K 2 /n2 ).

This completes the second assertion of Theorem 2. By the properties of the B-spline function, there exist constants c3 , c4 > 0 such that ˆj − b0,j ||2 ≤ ||ˆ ˆ j ) − rj (xj , θ 0,j )||2 ≤ c4 K −1 ||b ˆj − b0,j ||2 . c3 K −1 ||b rj (xj , θ 25

ACCEPTED MANUSCRIPT

Thus, we obtain the final assertion of Theorem 2 as

415

CR IP T

ˆ j ) − rj (xj , θ 0,j )||2 = O(Kn−1 ) + O(K −2η ) + O(Kλ2 n−2 ). ||ˆ rj (xj , θ

Acknowledgments

The authors are grateful to the Editor, Associate Editor, and two anonymous

referees for their valuable comments and suggestions, which led to improvement

of the paper. The research of the author was partly supported by KAKENHI

420

AN US

26730019.

References

[1] Barrow,D.L. and Smith,P.W. (1978). Asymptotic properties of best L2 [0, 1] approximation by spline with variable knots. Quart. Appl. Math. 36 293-

M

304.

[2] Buja,A., Hastie,T. and Tibshirani,R. (1989). Linear smoothers and additive models (with discussion). Ann. Statist. 17,453-555.

ED

425

[3] Chen,Y. and Samworth, R. J. (2014) scar: shape-constrained additive regression: a maximum likelihood approach. R Package Version 0.2-0. Centre

PT

for Mathematical Sciences, University of Cambridge, Cambridge. [4] Eilers,P.H.C. and Marx,B.D. (1996). Flexible smoothing with B-splines and

CE

penalties. Statist.Sci. 11, 89-121.

430

AC

[5] Fan,J., Feng,Y. and Song,R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Amer. Statist. Assoc. 106 544–557.

[6] Fan,J., Wu,Y. and Feng,Y. (2009). Local quasi-likelihood with a parametric

435

guide. Ann. Statist. 37 4153-4183.

26

ACCEPTED MANUSCRIPT

[7] Fan,J.Q. and Li,R.Z. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360. [8] Fan, J., Maity, A., Wang, Y. and Wu, Y. (2013). Parametrically Guided

Data. J. Nonparametr. Statist., 25, 109-128.

440

CR IP T

Generalized Additive Models with Application to Mergers and Acquisitions

[9] Glad,I.K. (1998). Parametrically guided non-parametric regression. Scand. J. Statist. 25 649-668.

[10] Ghouch,A.E. and Genton,M.C. (2009). Local polynomial quantile regres-

445

AN US

sion with parametric features. J. Amer. Statist. Assoc. 104 1416–1429.

[11] Horowitz,J.L. and Mammen,E. (2004). Nonparametric estimation of an additive model with a link function. Ann. Statist. 32 2412–2443. [12] Horowitz,J.L., Klemela¨ a.J and Mammen,E. (2006). Optimal estimation in

M

additive regression models. Bernoulli. 12 271–298.

[13] Huang,J., Horowitz,J.L. and Wei,F. (2010). Variable selection in nonparametric additive models. Ann. Statist. 38 2282–2313.

ED

450

[14] Huang,J., Wei,F. and Ma,S. (2012). Semiparametric regression pursuit. Sta-

PT

tistica Sinica 22 1403–1426. [15] Lian,H., Liang,H. and Ruppert,D. (2015). Separation of covariates into nonparametric and parametric parts in high-dimensional partially linear additive models. Statistica Sinica 25 591–608.

CE

455

AC

[16] Liu,R. and Yang,L. (2010). Spline-backfitted kernel smoothing of additive coefficient model. Economic Theory 26 29–59.

[17] Martins-Filho,C., Mishra,S. and Ullah,A. (2008). A class of improved para-

460

metrically guided nonparametric regression estimators. Econometric Rev. 27 542-573.

27

ACCEPTED MANUSCRIPT

[18] Marx,B.D. and Eilers,P.H.C. (1998). Direct generalized additive modeling with penalized likelihood. Comp. Statist & Data Anal. 28 193–209.

tive modeling. Ann. Statist. 37 3779–3821. 465

CR IP T

[19] Meier,L., van de Geer,S. and B¨ uhlmann,P. (2009). High-dimensional addi-

[20] Naito,K. (2002). Semiparametric regression with multiplicative adjustment. Communications in Statistics, Theory and Methods 31 2289–2309.

[21] Opsomer,J.D. (2000). Asymptotic properties of backfitting estimators. J. Mult. Anal. 73 166–79.

AN US

[22] Opsomer,J.D. and Ruppert,D. (1998). A fully automated bandwidth selection method for fitting additive models. J. Amer. Statist. Assoc. 93

470

605–619.

[23] Ravikumar,P., Lafferty,H., Liu,H. and Wasserman, L. (2009). Sparse additive models. J. Roy. Statist. Soc. B 71 1009–1030.

M

[24] Stone,C.J. (1985). Additive regression and other nonparametric models. Ann. Statist. 13 689–705.

475

ED

[25] White,H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 50 1–25.

PT

[26] Wu,Y. and Stefanski,L.A. (2015). Automatic structure recovery for additive models. Biometrika. 102 381–395. [27] Yuan,M. and Lin,Y. (2006). Model selection and estimation in regression

CE

480

with grouped variables. J. R. Statist. Soc. B 68 49–67.

AC

[28] Zhang,H.H., Cheng,G. and Liu,Y. (2011). Linear or nonlinear? Automatic

485

structure discovery for partially linear models. J. Amer. Statist. Assoc. 106 1099–1112.

[29] Zhang, H. H. and Lin, Y. (2006). Component selection and smoothing for nonparametric regression in exponential families. Statist. Sinica 16 1021– 1041. 28

ACCEPTED MANUSCRIPT

[30] Zhu,L.P., Li,L., Li,R. and Zhu,L.X. (2011). Model-free feature screening for ultrahigh-dimensional data. J. Amer. Statist. Assoc. 106 1464–1475. 490

[31] Zou, H. (2006). The adaptive Lasso and its oracle properties. J. Amer.

AC

CE

PT

ED

M

AN US

CR IP T

Statist. Assoc. 101 1418–1429.

29