Adaptive penalized splines for data smoothing

Adaptive penalized splines for data smoothing

Computational Statistics and Data Analysis 108 (2017) 70–83 Contents lists available at ScienceDirect Computational Statistics and Data Analysis jou...

908KB Sizes 0 Downloads 161 Views

Computational Statistics and Data Analysis 108 (2017) 70–83

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

Adaptive penalized splines for data smoothing✩ Lianqiang Yang a,b,∗ , Yongmiao Hong b a

School of Mathematical Sciences, Anhui University, Hefei 230601, China

b

Department of Economics, Cornell University, Ithaca, NY, 14853, USA

highlights • An adaptive penalized splines model for data smoothing is proposed. • The model is locally adaptive to the inhomogeneity of observations. • Local penalization is constructed without complicated pre-estimation.

article

info

Article history: Received 6 February 2016 Received in revised form 18 October 2016 Accepted 24 October 2016 Available online 4 November 2016 Keywords: Nonparametric regression Data smoothing Penalized splines Adaptivity Local penalty

abstract Data driven adaptive penalized splines are considered via the principle of constrained regression. A locally penalized vector based on the local ranges of the data is generated and added into the penalty matrix of the classical penalized splines, which remarkably improves the local adaptivity of the model for data heterogeneity. The algorithm complexity and simulations are studied. The results show that the adaptive penalized splines outperform the smoothing splines, l1 trend filtering and classical penalized splines in estimating functions with inhomogeneous smoothness. © 2016 Elsevier B.V. All rights reserved.

1. Introduction Several non- or semi-parametric regression methods have been developed in recent years, including classical methods, such as kernels, local polynomials, smoothing splines, sieves and wavelets, and relative newcomers such as l1 trend filtering, lasso, and generalized lasso (Wahba, 1990; Green and Silverman, 1994; Fan and Gijbels, 1996; Eubank, 1999; Ruppert et al., 2003; Hastie et al., 2001). The principle of maximizing likelihood functions under some constraints is obeyed in certain regression methods, e.g., smoothing splines, penalized splines, lk trend filtering and generalized lasso, where the models are commonly designed as yi = f (xi ) + ϵi . With given observations (xi , yi ), the design x is arranged such that xi 6 xi+1 , i = 1, . . . , n. The regression function to be estimated is f (x), and the error ϵi is generally assumed to be independent and identically subject to a Gaussian p distribution with mean 0 and variance σ 2 . Once f (x) is represented by a linear combination of basis functions Bj (x) as j=1 Bj (x)βj , the

✩ The R codes are attached to the electronic version of this paper.



Corresponding author at: School of Mathematical Sciences, Anhui University, Hefei 230601, China. E-mail address: [email protected] (L. Yang).

http://dx.doi.org/10.1016/j.csda.2016.10.022 0167-9473/© 2016 Elsevier B.V. All rights reserved.

L. Yang, Y. Hong / Computational Statistics and Data Analysis 108 (2017) 70–83

71

estimate is defined by a penalized least-square optimization problem:

βˆ = arg min{∥Y − X β∥22 + λ penalty(β)},

(1)

β∈Rp

where Y = (y1 , . . . , yn )′ , β = (β1 , . . . , βp )′ , the design matrix X is generated by X (i, j) = Bj (xi ), ∥ · ∥2 is the Euclid norm, λ is a tuning parameter, and penalty(β) is a positive convex function of β. Constrained regression based on splines or piecewise polynomials is applied in some popular methods. First, several types of spline bases are used for Bj (x) such as the truncated power basis (Ruppert, 2002), the B-spline basis (De Boor, 1978; Eilers and Marx, 1996), the natural spline basis (Wahba, 1990) and other piecewise polynomial bases (Tibshirani et al., 2014).  Second, several types of penalties are used, such as smoothing splines with f ′′ (x)2 dx (Wahba, 1990; Gu, 2002), penalized splines with ∥D(k) β∥22 (O’Sullivan, 1988; Eilers and Marx, 1996), l1 trend filtering with ∥D(1) β∥1 , where D(k) is the difference matrix of order k (Kim et al., 2009), and generalized lasso based on splines (Bazerque et al., 2011; Lin and Zhang, 2006) with ∥Dβ∥1 , where D is a particular matrix (Tibshirani and Taylor, 2011). Intuitively, the first term in target function (1) measures the estimator’s loss, and the second term denotes the estimator’s undesirable characters; the weight λ controls the trade-off between the first and second terms. It can also be said that the target function maintains the balance between the goodness of fit and the degrees of freedom of the model. Adaptability strongly influences model efficiency. Therefore, how to improve this aspect using a well-designed model is an interesting research topic in nonparametric regression analysis. Adaptive spline smoothing is developed in several methods. Generally, the first method focuses on the selection of knots (Friedman, 1991; Mammen et al., 1997; Zhou and Shen, 2001; Ruppert, 2002). These procedures always require complex computations to determine the number and positions of knots. The second method focuses on the construction of penalties (Ruppert and Carroll, 2000; Jullion and Lambert, 2007; Pintore et al., 2006; Liu and Guo, 2010; Storlie et al., 2010; Jang and Oh, 2011; Wang et al., 2013). In these procedures, the idea of generalizing the total penalty parameter λ from a constant to a function λ(·) plays the key role, and the function λ(·) is always pre-estimated by some elaborate techniques. Furthermore, Bayesian adaptive penalized splines have also been studied in some works. Baladandayuthapani et al. (2005) and Jullion and Lambert (2007) give heteroscedastic prior distributions for the penalty parameters, whereas Lang and Brezger (2004), Scheipl and Kneib (2009), Krivobokova et al. (2008) and Crainiceanu et al. (2007) give heteroscedastic prior distributions for the regression parameters. The estimates are normally obtained by Markov chain Monte Carlo (MCMC) or restricted maximum likelihood (REML). Krivobokova et al. (2008) further developed a fast and simple algorithm for the Bayesian P-splines based on Laplace approximation of the marginal likelihood. In particular, Tibshirani et al. (2014) present a detailed study on l1 trend filtering (Kim et al., 2009) and compare it with smoothing splines (Wahba, 1990) and locally adaptive regression splines (Mammen et al., 1997). The trend filtering has similar adaptability as the locally adaptive regression splines but outperforms the smoothing splines in estimating functions with inhomogeneous smoothness. However, greater local adaptivity appears necessary in certain cases, such as Fig. 1, where the Mexican hat function and its scatter points with random errors and l1 trend filtering fitting with 15, 20, and 40 degrees of freedom are shown. Clearly, with more degrees of freedom, the cusp of the hat becomes well fitted; however, the flat part is over-fitted with more wiggles. With fewer degrees of freedom, the flat part is well fitted; however, the cusp is under-fitted. Most of the aforementioned adaptive splines are based on smoothing splines and not penalized splines (Eilers and Marx, 1996; Ruppert et al., 2003), and complicated pre-computations are always required. In this paper, a new method called adaptive penalized splines is provided based on the classical penalized splines. The procedure only adds a weight vector to the differences in coefficients, the weight vector is fully data driven and does not need to be pre-estimated. Solving of the model is a simple quadratic convex optimization problem as generalized ridge regression. Empirical results and simulations show that this new method can remarkably improve the local adaptivity of the model compared to l1 trend filtering, smoothing splines and classical penalized splines. The remainder of this paper is organized as follows: Section 2 constructs the adaptive penalized splines and the computation of the nonparametric estimator is considered in Section 3; a simulation study for comparing the adaptive penalized splines, l1 trend filtering, smoothing splines and classical penalized splines is presented in Section 4; an application is shown in Section 5; and Section 6 contains some discussion. 2. Adaptive penalized splines First, we provide a short review on classical penalized splines, i.e., P-splines, which was first named in Eilers and Marx (1996) but can be traced back to O’Sullivan (1988), where X is constructed using a l-degree B-spline basis and penalty(β) is equal to ∥D(k) β∥22 . Then, (1) is equivalently written as

βˆ = arg min β∈Rp

  n   i=1

yi −

p  j =1

2 Bj (xi )βj



p 

(△k βj )2

j=k+1

 

,

(2)



where △k denotes the kth-order difference operator as △1 βj = βj − βj−1 and △k = △(△k−1 ). In the case of k = 2, the P-splines can be considered as the discrete version of smoothing splines. Additional details about P-splines can be found in Eubank (1999), Green and Silverman (1994) and Ramsay and Silverman (2005), etc.

72

L. Yang, Y. Hong / Computational Statistics and Data Analysis 108 (2017) 70–83

Fig. 1. Smoothing by trend filtering. Panel (a) is the Mexican hat function (solid) and its scatter plot, panels (b), (c) and (d) are Mexican hat function (dotted) and trend filtering estimates (solid) with 15, 20 and 40 degrees of freedom, respectively.

The penalties in (2) are used to control the volatility of estimate or, in the geometrical sense, to prevent the fitting of local wiggles. This is because the differences in coefficients are similar to the derivatives of the functions based on the convex hull property of B-splines (De Boor, 1978). The tuning parameter λ controls the trade-off between the goodness of fit and the volatility of the regression function. However, λ has the unique weight of the total penalties; each (△k βj )2 is equally penalized without considering the inhomogeneous smoothness of the observations, although such inhomogeneity emerges in many cases such as the data from Fig. 1. With this analysis, a rational and intuitive remedy is to match each (△k βj )2 with a well-defined weight ωj , which should be selected as a monotonic decreasing function of the volatility of the local observations. The underlying idea is that if the local data exhibit greater volatility, a smaller penalty should be imposed on the local estimate to provide greater flexibility; in contrast, if the local data exhibit less volatility, a higher penalty should be imposed on the local estimate to reduce flexibility. Inspired by this idea, we give a notably simple selection of ωj that is fully data driven and that does not include pre-estimates as in certain other methods. The selection is only a power of a linear function of the local data range. In detail, first, we follow the recommendation in Ruppert and Carroll (2000) to let the interior knot κj of the P-splines be the j/(K + 1)th sample quantile of the xi s, j = 1, . . . , K where K is the number of interior knots. The boundary knots for the l-degree B-spline basis are set as κ−l = · · · = κ0 = x1 and κK +1 = · · · = κK +l+1 = xn . Then, the p = K + l + 1 B-spline basis functions Bj (x), j = 1, . . . , p are defined. Because of the property of local support, Bj (x) vanishes outside the interval (κj−l−1 , κj ) but remains positive in that interval (De Boor, 1978). With △k βj computed from βj to βj−k , let rj = range{yi | xi ∈ [κj−k−l−1 , κj ]} =

max

{ys − yt }

xs , xt ∈[κj−k−l−1 , κj ]

and R = maxj {rj }. Furthermore, we define

ωj = (R − rj )γ ,

j = k + 1, . . . , p

(3)

L. Yang, Y. Hong / Computational Statistics and Data Analysis 108 (2017) 70–83

73

Table 1 The models to be compared with each other in the simulation study. Models

References

Smoothing splines Trend filtering Classical penalized splines Penalized splines by REML Fast adaptive penalized splines (FAPS) by REML Adaptive penalized splines

Wahba (1990) Kim et al. (2009) Eilers and Marx (1996) Ruppert et al. (2003) Krivobokova et al. (2008) This paper

where the positive integer γ is used to strengthen the power of the linear function of rj . Then, ωj is a monotonically decreasing function of the range rj , and the range rj is the simplest index of the volatility of the local data. Then, the adaptive P-splines based on local penalization are estimated by

βˆ = arg min β∈Rp

  n 

yi −

 i=1

p 

2 Bj (xi )βj

j =1



p  j=k+1

(ωj △k βj )2

 

.

(4)



3. Computational considerations We rewrite (4) in a matrix form

βˆ = arg min{∥Y − X β∥22 + λβT Gβ}

(5)

β∈Rp

with a penalization matrix G = (WD(k) )T (WD(k) ), where W = diag{ωk+1 , . . . , ωp }. Then, we obtain (5) as a generalized ridge regression estimate. Furthermore, because of the linear independence property of the B-spline basis, the design matrix X is of full column rank, and the unique solution of (5) is

βˆ = (X T X + λG)−1 X T Y .

(6)

As discussed in Ramsay and Silverman (2005), section 5.4.1, the size of λG is suggested to be ten orders of magnitude greater than the size of X T X . Our rule of thumb is that γ is determined to make the size of G approximately 105 or 106 times the size of X T X , and λ is selected based on some information rules such as GCV , CV , and AIC . In our simulation studies, this rule of thumb works quite well. Fast computation can be obtained with the matrix decomposition technique introduced in Ruppert and Carroll (2000) as follows. Denote the Cholesky’s decomposition of X T X as X T X = B−1 B−T , and the eigendecomposition of BGBT is ˆ = Z αˆ , where Z = XBT U and BGBT = UCU T , where U is an orthogonal matrix and C is diagonal. Then, for each λ, yˆ = X β αˆ is the solution of the diagonal system (I + λC )αˆ = Z T Y . Because B, U , C and Z need to be pre-computed only once, the total computation becomes quite fast. It is also clear that the trace of the hat matrix is tr {H (λ)} = tr {X (X T X + λG)−1 X T } = tr {(I + λC )−1 }, which is used to measure the degrees of freedom of the model estimate. This procedure originates from Reinsch (1967) and is called the Reinsch form in Eubank (1999) and Tibshirani et al. (2014), etc. 4. Comparison to trend filtering, smoothing splines, and classical penalized splines Tibshirani et al. (2014) give detailed comparisons of three nonparametric tools, trend filtering (Kim et al., 2009), smoothing splines (Wahba, 1990) and locally adaptive regression splines (Mammen et al., 1997), because all of these tools produce adaptive splines. Trend filtering and locally adaptive regression splines share notably similar adaptivities to the local level of smoothness of the data and are substantially better than smoothing splines. However, penalized splines were not considered in Tibshirani et al. (2014). In this section, classical penalized splines, smoothing splines, trend filtering and adaptive penalized splines are compared. Locally adaptive regression splines (Mammen et al., 1997) are not considered because they have identical adaptivity as trend filtering but suffer from high computational complexity. Classical penalized splines are well studied in Ruppert et al. (2003), and the R package SemiPar was written by M.P. Wand to accompany this book. The penalized splines are represented as mixed models, and the coefficients and tuning parameter of the smoothness are estimated using the restricted maximum likelihood (REML) with the function spm in this package. Krivobokova et al. (2008) present fast adaptive penalized splines (FAPS) and develop the package AdaptFit, where the hierarchical model is estimated using REML with the function asp. They are extended to both the package AdaptFitOS and the function asp2 in Wiesenfarth et al. (2012). This method provides comparable results to other approaches but with significantly reduced numerical effort. Then, the penalized splines by REML are also evaluated for comparison with the estimates, whose tuning parameters are selected by GCV or AIC . The complexity of the algorithms is discussed, and simulation studies on the adaptivity of these methods are presented for three examples. For clarity, the models and their original references are listed in Table 1, and penalized spline is always abbreviated as P-spline in the figures.

74

L. Yang, Y. Hong / Computational Statistics and Data Analysis 108 (2017) 70–83

Fig. 2. Mexican hat function (dotted) and its estimates (solid) using (a) smoothing splines, (b) trend filtering, (c) classical P-splines, (d) P-splines by REML, (e) FAPS and (f) adaptive P-splines.

4.1. Complexity of algorithms As described in Tibshirani et al. (2014), the complexity of the trend filtering algorithm is O(n3/2 ) via the primal–dual interior point method (Kim et al., 2009), where n is the number of given observations. The complexity is improved to O(n) if

L. Yang, Y. Hong / Computational Statistics and Data Analysis 108 (2017) 70–83

75

Fig. 3. Residual sum of squares (left) and sum of squared errors (right) plotted versus degrees of freedom. The panels in the top row correspond to the data set from Fig. 2, and in the bottom row to average values from 50 data sets. The estimates are based on smoothing splines, trend filtering, classical P-splines, P-splines by REML and adaptive P-splines.

the path algorithm of generalized lasso (described in Tibshirani and Taylor, 2011) is employed. The smoothing spline fitted values can also be computed in O(n) operations because the number of B-spline basis functions is n. Penalized splines use much fewer knots than smoothing splines. Thus, the number of basis functions p is also much less than n, which makes the complexity of the algorithms based on penalized splines O(p) with p ≪ n. For example, in our next simulation study, the j/(K + 1)th quantile of observations is selected as the interior knot of the cubic B-splines where K = ⌈ 5n ⌉ − 1 and j = 1, . . . , K , then p = ⌈ 5n ⌉ + 3, which is a notable advantage when the observations are dense. Here, ceiling function ⌈x⌉ is the smallest integer greater than or equal to x. 4.2. Simulation study The function smoothing.spline is used to compute the smoothing spline estimate. The function trendfilter in the package genlasso performs the trend filtering fitting. The function spm in the SemiPar package generates the P-splines using REML. The function asp2 in the AdaptFitOS package calculates the fast adaptive penalized splines (FAPS) using REML. The fitting codes of the adaptive penalized splines and classical penalized splines by AIC or GCV are written in R, and all R codes in this paper are attached to the electronic version of this paper (see Appendix A). Example 1 (Mexican Hat). This function is given by f (x) = −1 + 1.5x + 0.2φ(x − 0.6), where φ(x) is the density function of the normal distribution N (0, 0.02) and xi are 200 independent sample points from the uniform distribution U (0, 1). Then, yi = f (xi ) + εi are independently generated with εi ∼ N (0, σ 2 ), σ 2 = 0.2.

76

L. Yang, Y. Hong / Computational Statistics and Data Analysis 108 (2017) 70–83

Fig. 4. Doppler function (dotted) and its estimates (solid) using (a) smoothing splines, (b) trend filtering, (c) classical P-splines, (d) P-splines by REML, (e) FAPS and (f) adaptive P-splines.

Fig. 2 displays estimates of one data set. Panels (a), (b), (c), (d), (e) and (f) are estimates by the smoothing splines with minimum AIC , trend filtering, classical P-splines with minimum AIC , P-splines by REML, FAPS by REML and adaptive n P-splines with minimum AIC , where AIC = yi − yi )2 + 2tr (H )σ 2 . The j/(K + 1)th quantile of observations where i=1 (ˆ K = ⌈ 5n ⌉ − 1 and j = 1, . . . , K is selected as the knot for classical P-splines, P-splines by REML and adaptive P-splines, and

L. Yang, Y. Hong / Computational Statistics and Data Analysis 108 (2017) 70–83

77

Fig. 5. Residual sum of squares (left) and sum of squared errors (right) plotted versus degrees of freedom. The panels in the top row correspond to the data set from Fig. 4, and in the bottom row to average values from 50 data sets. The estimates are based on smoothing splines, trend filtering, classical P-splines, P-splines by REML and adaptive P-splines.

the degrees of trend filtering, classical P-splines and adaptive P-splines are 3. The adaptive P-spline estimate clearly exhibits better adaptivity than the other five methods because it fits the true function well in both the cusp and the flat region of the curve, whereas the other methods have unnecessary wiggles in the flat region. If these wiggles are reduced by decreasing the number of degrees of freedom, the cusp region n of the curve is underestimated, as shown nin Fig. 1. Furthermore, the residual sum of squares i=1 (ˆyi − yi )2 and the sum of squared errors i=1 (ˆyi − f (xi ))2 under different degrees of freedom are calculated (because the degrees of freedom in the function asp2 cannot be tuned as an argument, FAPS is not included in such a situation, as below). Fig. 3 shows the results. Panels (a) and (b) show the residual sum of squares and sum of squared errors of the data set from Fig. 2. Panels (c) and (d) are the mean values of fifty data sets. The smoothing splines and P-splines by REML or minimum AIC behave almost identically and worse than trend filtering when approximately 15–35 degrees of freedom are used, whereas the adaptive P-splines reduce the residual sum of squares and sum of squared errors remarkably compared to the other four methods for identical numbers of degrees of freedom. It is worth noting that panels (c) and (d) present a notably similar shape as panels (a) and (b) but smaller vertical values compared to panels (a) and (b). Example 2 (Doppler). This function is given by f (x) = 5(x(1 − x))0.5 sin(2π (1 + c )/(x + c )), where c = 0.05 and xi are 400 independent sample points from the uniform distribution U (0, 1). Then, yi = f (xi ) + εi are independently generated with εi ∼ N (0, σ 2 ), σ 2 = 0.2. Fig. 4 displays the estimates of one data set. Panels (a), (b), (c), (d), (e) and (f) are estimates using smoothing splines with minimum AIC , trend filtering, classical P-splines with minimum AIC , P-splines by REML, FAPS by REML and adaptive P-splines with minimum AIC . The j/(K + 1)th quantile of observations where K = ⌈ 2n ⌉ − 1 and j = 1, . . . , K is selected

78

L. Yang, Y. Hong / Computational Statistics and Data Analysis 108 (2017) 70–83

Fig. 6. Block function (dotted) and its estimates (solid) using (a) smoothing splines, (b) trend filtering, (c) classical P-splines, (d) P-splines by REML, (e) FAPS and (f) adaptive P-splines.

as the knot for classical P-splines, P-splines by REML and adaptive P-splines, and the degrees of trend filtering, classical Psplines and adaptive P-splines are 3. It is clear that the adaptive P-splines achieve a competitive adaptivity with FAPS and outperform the other methods.

L. Yang, Y. Hong / Computational Statistics and Data Analysis 108 (2017) 70–83

79

Fig. 7. Residual sum of squares (left) and sum of squared errors (right) plotted versus degrees of freedom. The panels in the top row correspond to the data set from Fig. 6, and in the bottom row to average values from 50 data sets. The estimates are based on smoothing splines, trend filtering, classical P-splines, P-splines by REML and adaptive P-splines.

Fig. 5 shows the residual sum of squares and sum of squared errors with different numbers of degrees of freedom. Panels (a) and (b) show the case using the data set from Fig. 4. Panels (c) and (d) are the mean values of fifty data sets. The smoothing splines and classical P-splines behave similarly and are worse than trend filtering, whereas the adaptive P-splines are better than the other methods when approximately 50–100 degrees of freedom are used. Example 3 (Block). The data are generated using yi = ci + εi , where ci = 0, ci = 0.5, ci = 0.2, ci = 0, ci = 0.4,

εi ∼ N (0, 0.01) for i = 1, . . . , 40. εi ∼ N (0, 0.02) for i = 41, . . . , 80. εi ∼ N (0, 0.05) for i = 81, . . . , 120. εi ∼ N (0, 0.02) for i = 121, . . . , 160. εi ∼ N (0, 0.01) for i = 161, . . . , 200.

Here, εi is independent with piecewise heteroscedasticity to increase the complexity. Then, the information criterion AIC n used in Examples 1 and 2 is replaced by GCV = yi − yi )2 /(1 − tr (H )/n)2 , because the piecewise heteroscedasticity i=1 (ˆ leads to no unique σ 2 which is necessary in the calculation of AIC . Fig. 6 displays estimates of one data set. Panels (a), (b), (c), (d), (e) and (f) are estimates by smoothing splines with minimum GCV , trend filtering, classical P-splines with minimum GCV , P-splines by REML, FAPS by REML and adaptive Psplines with minimum GCV . The knots are selected in the same way as Example 2 and the degrees of trend filtering, classical P-splines and adaptive P-splines are 0. Similar to the previous examples, the adaptive P-splines achieve better adaptivity than the other five methods because they better match the true function and are not significantly disturbed by the piecewise heteroscedasticity.

80

L. Yang, Y. Hong / Computational Statistics and Data Analysis 108 (2017) 70–83

Fig. 8. Mean AIC , GCV plotted versus degrees of freedom. Panels (a), (b) and (c) correspond to the 50 data sets used in Examples 1, 2 and 3, respectively. The estimates are based on smoothing splines, trend filtering, classical P-splines, P-splines by REML and adaptive P-splines.

Fig. 7 shows the residual sum of squares and sum of squared errors with different numbers of degrees of freedom and the mean values of fifty data sets. The adaptive P-splines have smaller values than the other methods when approximately 7 degrees of freedom are used and smaller values than the smoothing splines and classical P-splines when more than 6 degrees of freedom are used but larger values than trend filtering when more than 8 degrees of freedom are used. In this example, trend filtering obtains the smallest residual sum of squares and sum of squared errors when more than 8 degrees of freedom are used because trend filtering better considers the goodness of fit compared to the model complexity, which is also shown in Fig. 8. Fig. 8 shows the scores of information rules with the number of degrees of freedom of the three examples. Panel (a) is of the Mexican hat. Here, the adaptive P-splines obviously outperform the other four models: they obtain the minimum AIC with the smallest number of degrees of freedom and obtains the smallest minimum AIC s among the five methods. Panel (b) is of the Doppler function, where the adaptive P-splines are superior when 50–100 degrees of freedom are used. Panel (c) is of the Block function, where the adaptive P-splines obtain the minimum GCV when less than 20 degrees of freedom are used. However, the numbers of degrees of freedom of the classical P-splines, smoothing splines, P-splines by REML, and trend filtering are approximately 40, 60, 60, and greater than 100, respectively. These results show that the adaptive P-splines use relatively few parameters to obtain better performance. 5. Application In this section, these methods are used to smooth the mcycle data set in the R library MASS, which measures the acceleration over time in a simulated motorcycle crash to test the efficacy of crash helmets (Silverman, 1985). Fig. 9 shows the estimates, where the knots are selected in the same way as Examples 2 and 3 and the degrees of trend filtering, classical

L. Yang, Y. Hong / Computational Statistics and Data Analysis 108 (2017) 70–83

81

Fig. 9. Estimates of the mcycle data set using (a) smoothing splines, (b) trend filtering, (c) classical P-splines, (d) P-splines by REML, (e) FAPS and (f) adaptive P-splines.

P-splines and adaptive P-splines are 2. Smoothing splines, classical P-splines and adaptive P-splines obtain the minimum GCV when approximately 12 degrees of freedom are used. The P-splines and FAPS by REML also use approximately 12 degrees of freedom. Trend filtering obtains a local minimum GCV when 12 degrees of freedom are used; however, its overall minimum GCV is obtained when more than 100 degrees of freedom are used, and the estimate of trend filtering is notably

82

L. Yang, Y. Hong / Computational Statistics and Data Analysis 108 (2017) 70–83

wiggly. Here, the residual sums of squares of smoothing splines, trend filtering, classical P-splines, P-splines by REML, FAPS and adaptive P-splines are 429.25, 439.75, 426.68, 420.59, 425.57 and 428.90, respectively. However, unlike the other four methods, the FAPS and adaptive P-splines have two advantages, as indicated in Fig. 9. First, the FAPS and adaptive P-splines obtain the smoothest estimates. Second, the left part of their estimates (x < 13) is equal to 0, and the right part (x > 33) decreases to 0. These characteristics are consistent with the physical senses because the acceleration caused by the outside force should smoothly change and should be zero until the crash begins and decrease to 0 after the crash finishes. However, the other four methods show increasing deviation from 0 when x > 50, and the trend filtering and smoothing splines also show a small deviation from 0 when x < 13. 6. Conclusions and discussion We have constructed adaptive P-splines, which are an improved version of traditional P-splines for nonparametric regression and data smoothing, demonstrated the efficiency of their estimate procedures and shown their adaptivity to data heterogeneity. The philosophy of this method is based on both synthesis and analysis, which are two concepts discussed in subsection 7.2 in Tibshirani et al. (2014). From the synthesis perspective, the B-spline design matrix is used to capture the desired characteristics of the estimator. From the analysis perspective, the penalty matrix is the product of the difference matrix and the diagonal matrix, which is generated from the data ranges. The difference matrix prevents the estimator from over-fitting, and the diagonal matrix leads to a penalty term which is non-uniform along the data. Simulations and applications show that the FAPS and adaptive P-splines outperform smoothing splines, trend filtering and classical P-splines in matching the actual underlying function with identical numbers of degrees of freedom or minimized information rule scores. However, some questions about adaptive P-splines remain to be studied in detail. The optimal γ and λ in (6) should be driven by data, and studies on how to pre-estimate or analytically derive their values or some path algorithms are expected. The asymptotic property and extension to multivariate cases should also be examined. Acknowledgments The authors are grateful to the associate editor and two referees for their valuable comments and insightful suggestions, which were used to improve the quality of this paper. Appendix A. Supplementary data Supplementary material related to this article can be found online at http://dx.doi.org/10.1016/j.csda.2016.10.022. References Baladandayuthapani, V., Mallick, B.K., Carroll, R.J., 2005. Spatially adaptive Bayesian penalized regression splines (P-splines). J. Comput. Graph. Statist. 14 (2), 378–394. Bazerque, J.A., Mateos, G., Giannakis, G.B., 2011. Group-lasso on splines for spectrum cartography. IEEE Trans. Signal Process. 59 (10), 4648–4663. Crainiceanu, C.M., Ruppert, D., Carroll, R.J., Joshi, A., Goodner, B., 2007. Spatially adaptive Bayesian penalized splines with heteroscedastic errors. J. Comput. Graph. Statist. 16 (2), 265–288. De Boor, C., 1978. A Practical Guide to Splines. Springer-Verlag, New York. Eilers, P.H., Marx, B.D., 1996. Flexible smoothing with B-splines and penalties. Statist. Sci. 11 (2), 89–121. Eubank, R.L., 1999. Nonparametric Regression and Spline Smoothing. Marcel Dekker, New York. Fan, J., Gijbels, I., 1996. Local Polynomial Modelling and its Applications. Chapman & Hall, London. Friedman, J.H., 1991. Multivariate adaptive regression splines. Ann. Statist. 19 (1), 1–141. Green, P., Silverman, B., 1994. Nonparametric Regression and Generalized Linear Models. Chapman & Hall, London. Gu, C., 2002. Smoothing Spline ANOVA Models. Springer-Verlag, New York. Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning. Springer-Verlag, New York. Jang, D., Oh, H.-S., 2011. Enhancement of spatially adaptive smoothing splines via parameterization of smoothing parameters. Comput. Statist. Data Anal. 55 (2), 1029–1040. Jullion, A., Lambert, P., 2007. Robust specification of the roughness penalty prior distribution in spatially adaptive Bayesian P-splines models. Comput. Statist. Data Anal. 51 (5), 2542–2558. Kim, S.-J., Koh, K., Boyd, S., Gorinevsky, D., 2009. l1 trend filtering. SIAM Rev. 51 (2), 339–360. Krivobokova, T., Crainiceanu, C.M., Kauermann, G., 2008. Fast adaptive penalized splines. J. Comput. Graph. Statist. 17 (1), 1–20. Lang, S., Brezger, A., 2004. Bayesian P-splines. J. Comput. Graph. Statist. 13 (1), 183–212. Lin, Y., Zhang, H.H., 2006. Component selection and smoothing in smoothing spline analysis of variance models. Ann. Statist. 34 (5), 2272–2297. Liu, Z., Guo, W., 2010. Data driven adaptive spline smoothing. Statist. Sinica 20 (3), 1143–1163. Mammen, E., van de Geer, S., et al., 1997. Locally adaptive regression splines. Ann. Statist. 25 (1), 387–413. O’Sullivan, F., 1988. Nonparametric estimation of relative risk using splines and cross-validation. SIAM J. Sci. Stat. Comput. 9 (3), 531–542. Pintore, A., Speckman, P., Holmes, C.C., 2006. Spatially adaptive smoothing splines. Biometrika 93 (1), 113–125. Ramsay, J., Silverman, B., 2005. Functional Data Analysis. Springer-Verlag, New York. Reinsch, C.H., 1967. Smoothing by spline functions. Numer. Math. 10 (3), 177–183. Ruppert, D., 2002. Selecting the number of knots for penalized splines. J. Comput. Graph. Statist. 11 (4), 735–757. Ruppert, D., Carroll, R.J., 2000. Spatially-adaptive penalties for spline fitting. Aust. N. Z. J. Stat. 42 (2), 205–223. Ruppert, D., Wand, M.P., Carroll, R.J., 2003. Semiparametric Regression. Cambridge University Press, New York. Scheipl, F., Kneib, T., 2009. Locally adaptive Bayesian P-splines with a normal-exponential-gamma prior. Comput. Statist. Data Anal. 53 (10), 3533–3552.

L. Yang, Y. Hong / Computational Statistics and Data Analysis 108 (2017) 70–83

83

Silverman, B.W., 1985. Some aspects of the spline smoothing approach to non-parametric regression curve fitting. J. R. Stat. Soc. Ser. B Stat. Methodol. 47 (1), 1–52. Storlie, C.B., Bondell, H.D., Reich, B.J., 2010. A locally adaptive penalty for estimation of functions with varying roughness. J. Comput. Graph. Statist. 19 (3), 569–589. Tibshirani, R.J., et al., 2014. Adaptive piecewise polynomial estimation via trend filtering. Ann. Statist. 42 (1), 285–323. Tibshirani, R.J., Taylor, J., 2011. The solution path of the generalized lasso. Ann. Statist. 39 (3), 1335–1371. Wahba, G., 1990. Spline Models for Observational Data. SIAM, Philadelphia. Wang, X., Du, P., Shen, J., 2013. Smoothing splines with varying smoothing parameter. Biometrika 100 (4), 955–970. Wiesenfarth, M., Krivobokova, T., Klasen, S., Sperlich, S., 2012. Direct simultaneous inference in additive models and its application to model undernutrition. J. Amer. Statist. Assoc. 107 (500), 1286–1296. Zhou, S., Shen, X., 2001. Spatially adaptive regression splines and accurate knot selection schemes. J. Amer. Statist. Assoc. 96 (453), 247–259.