Tensor product splines and functional principal components

Tensor product splines and functional principal components

Journal of Statistical Planning and Inference xxx (xxxx) xxx Contents lists available at ScienceDirect Journal of Statistical Planning and Inference...

2MB Sizes 0 Downloads 234 Views

Journal of Statistical Planning and Inference xxx (xxxx) xxx

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi

Tensor product splines and functional principal components ∗

Philip T. Reiss , Meng Xu Department of Statistics, University of Haifa, Haifa 19105, Israel

article

info

a b s t r a c t Functional principal component analysis for sparse longitudinal data usually proceeds by first smoothing the covariance surface, and then obtaining an eigendecomposition of the associated covariance operator. Here we consider the use of penalized tensor product splines for the initial smoothing step. Drawing on a result regarding finiterank symmetric integral operators, we derive an explicit spline representation of the estimated eigenfunctions, and show that the effect of penalization can be notably disparate for alternative approaches to tensor product smoothing. The latter phenomenon is illustrated with two data sets derived from magnetic resonance imaging of the human brain. © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Article history: Received 5 December 2018 Received in revised form 7 October 2019 Accepted 7 October 2019 Available online xxxx Keywords: Bivariate smoothing Covariance function Covariance operator Eigenfunction Roughness penalty

1. Introduction Extensions of principal component analysis from multivariate data to densely observed functions were popularized in the 1990s (Rice and Silverman, 1991; Ramsay and Dalzell, 1991; Silverman, 1996). A quite different formulation of functional principal component analysis (FPCA), suitable for sparse longitudinal data, was introduced by Yao et al. (2005). The crux of this reformulation is that, just as the loading vectors of classical principal components are eigenvectors of the sample covariance matrix, functional principal components can be defined by eigenfunctions of an estimated covariance operator. For functions f : D −→ R arising from the underlying process on a domain D, typically a finite interval on the real line, Yao et al. (2005) derive FPCs by applying local linear smoothing to estimate the covariance function C (s, t) = Cov[f (s), f (t)], and then estimating the eigenfunctions of the resulting covariance operator. Due in part to advances in relevant software (Wood, 2006, 2017), it has recently become popular to carry out a similar procedure for FPCA, but with the covariance function smoothing performed by penalized tensor product splines (e.g., Di et al., 2009; Goldsmith et al., 2013; Xiao et al., 2018). Such approaches assume that the covariance function has the form C (s, t) = b(s)T Θb(t),

(1)

where b(s) = [b1 (s), . . . , bK (s)] denotes a spline basis on D and Θ is a symmetric K × K matrix. Leveraging the computational efficiency and flexibility of penalized spline smoothers, this line of work has extended FPCA to an impressive variety of complex data structures (Di et al., 2014; Cederbaum et al., 2018). In this paper we present a result on finite-rank linear integral operators, and discuss two important ramifications thereof for functional principal component analysis by tensor product splines. The ideas are illustrated with two very different data sets based on magnetic resonance imaging of the human brain: one is from a study of development of the cerebral cortex in children and adolescents, and the other concerns white matter microstructure in adults. T

∗ Corresponding author. E-mail address: [email protected] (P.T. Reiss). https://doi.org/10.1016/j.jspi.2019.10.006 0378-3758/© 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons. org/licenses/by-nc-nd/4.0/).

Please cite this article as: P.T. Reiss and M. Xu, Tensor product splines and functional principal components. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.10.006.

2

P.T. Reiss and M. Xu / Journal of Statistical Planning and Inference xxx (xxxx) xxx

2. A result on finite-rank symmetric integral operators In what follows let F = L2 (D) where D is a compact subset of a Euclidean space. Given a function C ∈ L2 (D × D), the integral operator C : F −→ F with kernel C is given by



C (s, t)h(t)dt

(C h)(s) =

(2)

D

for h ∈ F . The following result, proved in Appendix A, is the key to most of our development. Theorem 1. Suppose there exist linearly independent functions ψ1 , . . . , ψK ∈ F and a real symmetric matrix R = (rij )1≤i,j≤K such that for s, t ∈ D, C (s, t) =

K K ∑ ∑

rij ψi (s)ψj (t).

(3)

i=1 j=1

Let ψ (s) = [ψ1 (s), . . . , ψK (s)]T and let G be the K × K Gram matrix with (i, j) entry gij ≡ φ ∈ F , and C given by (2), C φ = λφ if and only if

φ (s) = v T ψ(s)

∫ D

ψi (s)ψj (s)ds. Then, for λ ̸= 0, (4)

for a vector v ∈ RK satisfying RGv = λv .

(5)

In other words, (4) establishes a correspondence between eigenfunctions φ (·) of C and eigenvectors v of RG associated with a given nonzero eigenvalue. For non-repeated eigenvalues of RG, this is a one-to-one correspondence.

3. Simplified estimation of the eigenfunctions Our first application of Theorem 1 is to simplify estimation of the covariance operator’s eigenfunctions. In the classical longitudinal data setup with n repeatedly observed individuals, the ith individual has responses yi1 , . . . , yimi measured at time points ti1 , . . . , timi . The functional data framework casts these responses as a sample yi (ti ), . . . , yi (tmi ) from the underlying function yi (·). Following Staniswalis and Lee (1998), Yao et al. (2005) first obtain a mean function estimate µ ˆ (·); then each value

ˆ (tij2 )], ˆ (tij1 )][yi (tij2 ) − µ [yi (tij1 ) − µ

(6)

for i = 1, . . . , n and distinct j1 , j2 ∈ {1, . . . , mi }, has approximate expectation C (tij1 , tij2 ), and therefore the set of these values can be smoothed to obtain an estimate Cˆ (·, ·) of the covariance function. Yao et al. (2005) then estimate the eigenfunctions of the covariance operator by matrix eigendecomposition of a discretized version of Cˆ . This approach has been followed by authors who have used tensor product splines for covariance smoothing. But as we now show, estimation of C (·, ·) by tensor product splines renders the discretization step superfluous. ˆb(t) Assume we have obtained a tensor product spline estimate of the covariance function (1), that is, Cˆ (s, t) = b(s)T Θ ˆ; methods for doing this are considered below in Section 4. Our goal is then to for a symmetric K × K matrix Θ eigendecompose the estimated covariance operator Cˆ, defined by the right side of (2) but with C replaced by Cˆ . Theorem 1 ˆG, where G is the Gram matrix reduces this task to finding the eigenvalue–eigenvector pairs (λ1 , v1 ), . . . , (λK , vK ) of Θ ∫ [ D bi (s)bj (s)ds]1≤i,j≤K . Then, for k = 1, . . . , K , φk (·) = b(·)T vk is an eigenfunction of Cˆ with corresponding eigenvalue λk , and indeed these are all the eigenfunctions of Cˆ corresponding to nonzero eigenvalues. We can thus represent the eigenfunctions using coefficients with respect to a spline basis, as did Ramsay and Silverman (2005), who frame the problem very differently but also estimate the principal components by solving a matrix eigenproblem for the spline coefficients. This common representation facilitates comparison of alternative approaches to FPCA.

4. Penalized tensor product spline smoothing of the covariance function To ensure adequate smoothness, tensor product splines are usually fitted by penalized least squares, and a second application of Theorem 1 is to characterize the shrinkage induced by roughness penalties when estimating the covariance function. We present this in Section 5, after first providing some needed details on covariance smoothing. Please cite this article as: P.T. Reiss and M. Xu, Tensor product splines and functional principal components. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.10.006.

P.T. Reiss and M. Xu / Journal of Statistical Planning and Inference xxx (xxxx) xxx

3

4.1. Penalized least squares setup Combining the first paragraph of Section 3 with (1) yields E {yi (tij1 ) − µ ˆ (tij1 )}{yi (tij2 ) − µ ˆ (tij2 )} ≈ b(tij1 )T Θb(tij2 )

[

]

= [b(tij2 )T ⊗ b(tij1 )T ]θ, where θ ≡ vec(Θ). Thus Θ can be estimated by setting up a regression problem with coefficient vector θ , response vector z consisting of the values (6) for all i and distinct j1 , j2 , and design matrix X having rows b(tij2 )T ⊗ b(tij1 )T arranged in the same order. We can then estimate θ via the penalized least squares estimator

( ) θˆ = arg min ∥z − X θ∥2 + θ T P θ

(7)

θ

= (X T X + P )−1 X T z , where θ T P θ is a roughness penalty that typically has the form P = αs (P ⊗ M ) + αt (M ⊗ P)

(8)

for symmetric K × K matrices M , P. For example, taking M = IK and P to be a difference penalty matrix yields the P-spline penalty of Eilers and Marx (2003) and Currie et al. (2006). Alternatively, if P induces a univariate roughness penalty and M is the Gram matrix with (i, j) entry



bi (s)bj (s)ds,

mij =

(9)

D

then, as shown in Appendix B, we obtain an exact version of the tensor product penalty proposed by Wood (2006), as opposed to the approximate version that he derives. 4.2. ‘‘Square’’ versus ‘‘triangle’’ smoothing The above covariance smoothing procedure usually includes the ‘‘responses’’ (6) for all (tij1 , tij2 ) ∈ D2 with j1 ̸ = j2 , with the diagonal treated separately due to the assumed presence of measurement error. As an alternative to smoothing on the entire square region D2 , one can exploit symmetry and perform smoothing only on those points for which tij1 > tij2 . In this way one obtains an estimate Cˆ (s, t) for the triangular region T = {(s, t) ∈ D2 : s > t }, which can be extended to all of D2 by taking Cˆ (s, t) = Cˆ (t , s) for s < t. Such an approach, implemented by Fabian Scheipl, is currently available as an option in the refund package (Goldsmith et al., 2018) for R (R Core Team, 2018). These two smoothing strategies will henceforth be referred to as ‘‘square’’ and ‘‘triangle’’ smoothing, and the resulting covariance function estimates and operators will be denoted by Cˆ sq , Cˆsq and by Cˆ tr , Cˆtr , respectively. In the next two sections, we show that the effect of penalization can be very different in these two approaches. 5. Direction of penalization A popular special case of the penalty (8) is a second-derivative penalty in both the s- and the t-direction, i.e.,

αs

∫ [ D

∂2 C (s, t) ∂ s2

]2

dt + αt

∫ [ D

∂2 C (s, t) ∂t2

]2 ds

(10)

(Wood, 2006). Such a penalty, or alternatively a second-difference penalty, shrinks toward bilinear functions of s, t. As we now show, the impact of this shrinkage differs for the two approaches described in Section 4.2. 5.1. Square smoothing For a symmetric function on D2 such as the covariance, bilinear functions of s, t, i.e. functions that are not penalized by (10), have the form C (s, t) = a + b(s + t) + cst = ψ (s)T R ψ (t)

(11)

where

ψ(s) =

[ ] 1 s

[ and R =

a b

]

b . c

(12)

Chen et al. (2019) note that the traditional random intercept and slope model has covariance of the form (11) (see their Eq. (2)), and propose a formal test of that model. Please cite this article as: P.T. Reiss and M. Xu, Tensor product splines and functional principal components. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.10.006.

4

P.T. Reiss and M. Xu / Journal of Statistical Planning and Inference xxx (xxxx) xxx

If we let

[∫

1 ds D

G=



s ds D



s ds D



]

s2 ds D

,

then by Theorem 1, a covariance operator with kernel C (·, ·) of the form (11), (12) has at most two distinct eigenfunctions associated with positive eigenvalues, namely

φ1 (s) = p1 + q1 s and φ2 (s) = p2 + q2 s

(13)

where (p1 , q1 ) , (p2 , q2 ) are eigenvectors of RG. Hence square smoothing with penalty (10) shrinks the eigenfunctions toward the two-dimensional space of linear functions on D. It is straightforward to prove analogous results for other derivative or difference penalties. T

T

5.2. Triangle smoothing When smoothing is performed on T rather than on all of D2 , the general form for a function that is not penalized by penalty (10) is a + bs + ct + dst rather than (11). When such a function is extended to D2 by symmetry, the resulting function can be written as C (s, t) = a + b min{s, t } + c max{s, t } + dst, or equivalently as C (s, t) = a + c(s + t) + ∆ min{s, t } + dst

(14)

where ∆ = b − c. Such a function has a ridge along the line s = t and two bilinear pieces on either side. Assume for simplicity that D = [0, 1]. In Appendix C we show that for a covariance operator with kernel of the form (14) and eigenvalues λ1 ≥ λ2 ≥ · · · ≥ 0,

λj ≥ 4∆[(2j + 3)π]−2

(15)

for all j ≥ 1. In summary, the argument of Section 5.1 implies that penalty (10) shrinks the covariance function estimate Cˆ sq toward the kernel of a rank-2 covariance operator with linear eigenfunctions. By contrast, (15) shows that Cˆ tr is shrunk toward the kernel of a covariance operator whose eigenvalue decay is bounded below by a polynomial rate. 6. Examples The results of Section 5 refer to the limits of Cˆ sq and of Cˆ tr as the smoothing parameters αs , αt go to infinity. In practical applications, we recommend choosing αs , αt to optimize the restricted maximum likelihood (REML) criterion (Ruppert et al., 2003; Reiss and Ogden, 2009; Wood, 2011). Even with such (finite) choices of the smoothing parameters, the different directions of penalization described in Section 5 may be reflected in noticeably different estimates Cˆ sq , Cˆ tr and resulting FPC decompositions, as we now illustrate with two examples, both drawn from neuroimaging applications. 6.1. Cortical thickness development Our first data set was derived from a magnetic resonance imaging (MRI) study of typical brain development conducted at the U.S. National Institute of Mental Health (Giedd et al., 2015). Fig. 1 presents measures of cortical thickness averaged over the brain, for 131 male participants in the study, as a function of age. Each individual was measured at 2–5 different ages, yielding a total of 355 observations. The top row of Fig. 2 displays the covariance surface estimates Cˆ sq and Cˆ tr , obtained with the fpca.sc function in the refund package for R (Goldsmith et al., 2018). For over 90% of the square region, the estimates differ by less than 5%, but the contours (approximately quarter-circles for Cˆ sq versus broken line segments for Cˆ tr ) make the differences clearer. The REML-optimal smoothing parameters are αs = αt = 530423 for square smoothing and αs = 28249, αt = 765122 for triangle smoothing. For square smoothing, we have neither a proof that REML always chooses equal smoothing parameters nor a counterexample. On the other hand, it is readily proved that when αs = αt , Cˆ sq is symmetric with respect to its two arguments (see Appendix D). When we reran triangle smoothing with the constraint αs = αt , the resulting Cˆ tr was virtually unchanged; we thus conclude that the difference between the two fits is driven not by the equality of the smoothing parameters in square smoothing, but by the different domains over which smoothing is performed. While values as high as the above αt results sometimes indicate non-convergence of the underlying REML optimization algorithm, in both cases the output from R package mgcv (Wood, 2017) confirms that convergence is attained. The two resulting FPC decompositions are very different, as can be seen in the accompanying plots of the estimated eigenfunctions φˆ 1 , φˆ 2 , φˆ 3 , φˆ 4 scaled by the square roots of the corresponding eigenvalues. Square smoothing yields one highly dominant eigendirection, reflecting shrinkage toward a finite-rank covariance operator as in Section 5.1, and φˆ 1 is approximately linear, as we would expect from (13). Triangle smoothing produces eigenvalues that decrease more gradually, as expected from (15), so that several directions of variation can be discerned. Please cite this article as: P.T. Reiss and M. Xu, Tensor product splines and functional principal components. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.10.006.

P.T. Reiss and M. Xu / Journal of Statistical Planning and Inference xxx (xxxx) xxx

5

Fig. 1. Cortical thickness, averaged over the brain, for 131 males, each measured at several ages; observations for the same individual are joined by line segments. Table 1 First four eigenvalues of the estimated covariance operators based on square smoothing (Cˆsq ) versus triangle smoothing (Cˆtr ), for the cortical thickness data. Values in bold (middle column) are obtained when the smoothing parameters are chosen to optimize the REML criterion; other columns give results when a constant (−6, −3, 3, or 6) is added to the log smoothing parameter values based on REML.

−6

−3

REML

+3

+6

Cˆsq

λˆ 1 λˆ 2 λˆ 3 λˆ 4

0.2797 0.0006 0.0001 0.0000

0.2805 0.0000 0.0000 0.0000

0.2805 0.0000 0.0000 0.0000

0.2805 0.0000 0.0000 0.0000

0.2805 0.0000 0.0000 0.0000

Cˆtr

λˆ 1 λˆ 2 λˆ 3 λˆ 4

0.2738 0.0060 0.0010 0.0005

0.2757 0.0031 0.0022 0.0007

0.2758 0.0033 0.0022 0.0008

0.2758 0.0033 0.0022 0.0008

0.2758 0.0033 0.0022 0.0008

The relevance of our results in Section 5 on limits of Cˆ sq , Cˆ tr as αs , αt → ∞ is underlined by Table 1, which displays the eigenvalue estimates that ensue when αs , αt are chosen by REML, along with results obtained with much smaller and much larger smoothing parameters—specifically with log αs and log αt equal to their REML-based values plus −6, −3, 3 or 6. It appears that for both approaches, the REML-based values are essentially the values attained in the infinite-smoothing-parameter limit. In sum, the disparate results for Cˆ sq versus Cˆ tr , and for the ensuing FPC decompositions, are consistent with our discussion in Section 5. 6.2. White matter microstructure Delaigle and Hall (2016) distinguish between two forms of partially observed functional data: the ‘‘sparse functional case’’ in which each curve is observed at a small number of points distributed throughout D, and the ‘‘fragmentary functional case’’, in which the points on each curve are restricted to a small portion of D and thus represent only a fragment of the underlying curve. In the fragment case, no curve is observed both near the beginning and near the end of the interval D, and thus there are no ‘‘responses’’ (6) in two corners of the square D2 to which smoothing is Please cite this article as: P.T. Reiss and M. Xu, Tensor product splines and functional principal components. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.10.006.

6

P.T. Reiss and M. Xu / Journal of Statistical Planning and Inference xxx (xxxx) xxx

Fig. 2. Above: Covariance function estimates Cˆ sq (left) and Cˆ tr (right) for the cortical thickness data, with locations (tij1 , tij2 ) of the values (6) that



are smoothed to produce these estimates; thus only points with tij1 > tij2 are included at right. Below: Scaled eigenfunction estimates λˆ j φˆ j , for j = 1, 2, 3, 4, for the resulting estimated covariance operators Cˆsq (left) and Cˆtr (right), with percent variance explained shown in the legends.

applied. Delaigle and Hall (2016) argue, therefore, that the FPCA approach of Yao et al. (2005) is unsuitable for fragment data. This might be raised as an objection to the analysis of Section 6.1, since the cortical thickness data are an example of fragment data. Our second example does not suffer from this shortcoming. We consider a portion of a diffusion tensor imaging (DTI) data set previously analyzed by Goldsmith et al. (2011) and in numerous subsequent papers on functional data. DTI is an MRI-based imaging modality that seeks to characterize white matter fibers in the brain by tracing water diffusion in the white matter. Our data set, provided with the R package refund, consists of fractional anisotropy (FA), a DTI-derived measure of white matter integrity, at 55 locations along the right corticospinal tract, in 142 individuals. (The full data set includes 382 FA curves from these 142 individuals, but we included only the first curve for each participant.) By randomly sampling subsets of the 55 points on each curve (see Fig. 3), we created artificial but realistic sparse functional data sets with which to examine further the effects of our two smoothing strategies. Fig. 4 presents boxplots of the proportion of variance explained by the first two FPC’s, for 50 artificial data sets with 2, 3, 5, 10, . . ., 50 points per curve. As seen in the left subfigure, for square smoothing with 2–3 points per curve, the first two FPC’s usually explain virtually all the variance. This suggests that for very sparse data, square smoothing shrinks the covariance almost entirely toward the two-dimensional null space of the penalty as described in Section 5.1. As the number of observations per curve increases, the variance explained levels off a bit below 70%. The right subfigure shows that for triangle smoothing as well, as the functions are observed more densely, the proportion of variance explained by Please cite this article as: P.T. Reiss and M. Xu, Tensor product splines and functional principal components. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.10.006.

P.T. Reiss and M. Xu / Journal of Statistical Planning and Inference xxx (xxxx) xxx

7

Fig. 3. Fractional anisotropy in the right corticospinal tract for 142 individuals. Above: The full curves comprising FA at 55 locations. Below: Sparse functional observations, derived by randomly selecting 5 points on each curve.

Fig. 4. Proportion of variance explained by the first two FPCs, based on the two smoothing methods, for sparse functional data consisting of random subsamples from each of the FA curves.

the first two FPC’s decreases substantially. As we would expect from Section 5.2, this proportion is markedly lower from triangle than for square smoothing; this disparity is not only seen for very sparse data, but persists for densely observed data, for which the first two FPC’s explain only about 60% of the variance. Fig. 5 displays a typical example of the two contrasting eigendecompositions with randomly sampled points per FA curve. The eigendecomposition of Cˆsq is very much consistent with the development in Section 5.1: almost all of the variance is explained by two eigenfunctions, which are essentially linear as in (13). The eigenfunctions of Cˆtr , on the other hand, are nonlinear, and about 8% of the variance is left unexplained by the first two. Please cite this article as: P.T. Reiss and M. Xu, Tensor product splines and functional principal components. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.10.006.

8

P.T. Reiss and M. Xu / Journal of Statistical Planning and Inference xxx (xxxx) xxx

Fig. 5. Covariance function estimates Cˆ sq , Cˆ tr , and associated scaled eigenfunction estimates points per curve.

√ λˆ j φˆ j , as in Fig. 2, but here for the FA data with 5

The fact that our empirical results were consistent with the theory of Section 5, for both of these two very different examples, suggests that our results regarding shrinkage may be highly relevant to a variety of real-data applications of FPCA. 7. Discussion We have presented a result on finite-rank symmetric integral operators, and derived two consequences for FPCA via tensor product splines. First, we reduced the eigendecomposition of an estimated covariance operator to a matrix eigendecomposition (Section 3). Second, we demonstrated that two different penalized smoothing methods for the covariance shrink in quite different directions (Section 5). The real-data examples in Section 6 illustrate how these differences can be important in practice. In view of the (empirical) Bayesian interpretation of penalized smoothing, when the functional data are sampled very sparsely, the impact of the prior is especially pronounced, and thus the FPCA results should be treated with caution. Our examination of penalization in covariance smoothing has focused on penalties of type (8), which are quite standard for tensor product spline smoothing, and in particular on (10). In thin plate spline smoothing (e.g., Green and Silverman, 1994), the penalty functional includes not only the partial derivatives in (10) but also the mixed partial derivative ∂2 C (s, t). Such a penalty could be applied with tensor product splines as well, but the generic unpenalized function ∂ s∂ t would then be C (s, t) = a + b(s + t) rather than (11). It is readily shown that such a function is not a valid covariance function unless b = 0, i.e., C (s, t) is a constant. Please cite this article as: P.T. Reiss and M. Xu, Tensor product splines and functional principal components. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.10.006.

P.T. Reiss and M. Xu / Journal of Statistical Planning and Inference xxx (xxxx) xxx

9

Disparity in results of the sort seen in Figs. 2 and 5 can have significant scientific implications. In applications such as that of Section 6.1, an unwarranted assumption that all individuals have parallel trajectories can cause one to overlook important variation in the timing of development (Thompson et al., 2011; Reiss et al., 2016). A single dominant eigendirection, as found by square smoothing for this example, is a weaker condition than parallel trajectories, but a highly restrictive one nonetheless. Thus the triangle smoothing approach, which finds multiple non-negligible eigendirections, may in this case reveal developmental variation that would otherwise be missed. On the other hand, given the relative sparsity of the longitudinal observations, it is possible that these data do not support precise estimation of more than one eigendirection; in this case, function reconstructions derived from triangle smoothing might be modeling noise rather than signal. A more general argument in favor of square smoothing is that penalizing the covariance toward having two linear eigenfunctions (13) may be seen as a natural counterpart of traditional scatterplot smoothing with a second derivative penalty, which shrinks toward a linear fit. Moreover, since the classical random intercept/random slope model has covariance function of the form (11), as noted above and by Chen et al. (2019), the square smoothing approach of shrinking toward (11) could be viewed as a natural bridge between the FPCA and linear mixed-effects approaches to longitudinal data. The above remarks, taken together, do not point to a clear preference for either square or triangle smoothing. What does seem clear is that more work is needed to better understand the properties of functional principal component estimators based on tensor product smoothing. Acknowledgments This work was supported in part by grant 1777/16 from the Israel Science Foundation. Many thanks to Uriah Finkel, Fabian Scheipl, Luo Xiao and David Zucker for helpful discussions, and to Aaron Alexander-Bloch, Jay Giedd and Armin Raznahan for providing the cortical thickness data. The DTI data were collected at Johns Hopkins University and the Kennedy-Krieger Institute. Theorem 1 was inspired by a similar result derived in unpublished lecture notes of Ilya Goldsheid. Appendix A. Proof of Theorem 1 By (3), C (s, t) =

K ∑

ψi (s)fi (t)

(A.1)

i=1

where fi (t) =

K ∑

rij ψj (t).

(A.2)

j=1

If φ is an eigenfunction of the operator C given by (2) with associated eigenvalue λ ̸ = 0, then

φ (s) = λ (C φ )(s) = λ −1

−1

∫ ∑ K

ψi (s)fi (t)φ (t)dt =

D i=1

K ∑

vi ψi (s)

(A.3)

i=1

where

vi = λ−1



fi (t)φ (t)dt D

for i = 1, . . . , K . We claim that v = (v1 , . . . , vK )T satisfies (5). To see this, observe that for all s, K ∑

vi ψi (s) = λ−1



C (s, t)φ (t)dt D

i=1

= λ−1

∫ ∑ K

ψi (s)fi (t)

D i=1



−1

K ∑

vj ψj (t)dt

j=1

] K K [ ∫ ∑ ∑ vj fi (t)ψj (t)dt ψi (s), i=1 j=1

D

by (A.1) and (A.3). For each i, equating the ψi (s) coefficients of the first and last expressions above gives

λvi =

] K [ ∫ ∑ vj fi (t)ψj (t)dt . j=1

D

Please cite this article as: P.T. Reiss and M. Xu, Tensor product splines and functional principal components. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.10.006.

10

P.T. Reiss and M. Xu / Journal of Statistical Planning and Inference xxx (xxxx) xxx

Thus v is an eigenvector, with associated eigenvalue λ, of the K × K matrix with (i, j) entry



fi (t)ψj (t)dt =

∫ ∑ K

D

rik ψk (t)ψj (t)dt =

D k=1

K ∑

rik gkj ,

k=1

by (A.2). But this matrix is simply RG, so the claim is proved. Conversely, given λ ̸ = 0 and v ∈ RK satisfying (5), the following argument shows that φ given by (4) is an eigenfunction of C with eigenvalue λ: (C φ )(s) =



C (s, t)φ (t)dt D

=

∫ ∑ K

fi (t)ψi (s)

D i=1

=

K ∑

vk ψk (t)dt

k=1

K K [∫ ∑ ∑

]

fi (t)vk ψk (t)dt ψi (s)

D

i=1 k=1

⎤ ⎡ ∫ ∑ K K K ∑ ∑ ⎣ = rij ψj (t)vk ψk (t)dt ⎦ ψi (s) D j=1

i=1 k=1 K

=

K

K

∑∑∑

D

i=1 k=1 j=1 K

=

) ψj (t)ψk (t)dt vk ψi (s)

(∫ rij

K

∑∑

(RG)ik vk ψi (s)

i=1 k=1

=

K ∑

(λvi )ψi (s)

i=1

= λφ (s). Appendix B. Exact formula for Wood’s tensor product penalty For a general bivariate function C : D × D −→ R, of the form (1) but not necessarily symmetric, the penalty proposed by Wood (2006) is

αs



Js [C (·, t)]dt + αt



Jt [C (s, ·)]ds,

(B.1)

D

D

where Js (·), Jt (·) are roughness functionals for functions on D. The claim at the end of Section 4.1 is that the general penalty (8), with appropriate P and with M given by (9), yields (B.1). It suffices to show that if Js (g) = Jt (g) = βT P β, for some symmetric K × K matrix P and for any g of the form g(v ) = b(v )T β, then



Js [C (·, t)]dt = θ T (M ⊗ P)θ

(B.2)

Jt [C (s, ·)]ds = θ T (P ⊗ M )θ.

(B.3)

D

and

∫ D

To prove (B.2), note that the above expression for Js , together with (1), implies that Js [C (·, t)] = b(t)T ΘT P Θb(t) for all t ∈ D. Thus, by standard results on the vec operator and Kronecker products,





b(t)T ΘT P Θb(t)dt

Js [C (·, t)]dt = D

D

[ ] ∫ = tr ΘT P Θ b(t)b(t)T dt D

= = = =

tr(ΘT P ΘM ) vec(Θ)T vec(P ΘM ) vec(Θ)T [(M ⊗ P)vec(Θ)]

θ T (M ⊗ P)θ,

Please cite this article as: P.T. Reiss and M. Xu, Tensor product splines and functional principal components. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.10.006.

P.T. Reiss and M. Xu / Journal of Statistical Planning and Inference xxx (xxxx) xxx

11

establishing (B.2). Likewise, (B.3) follows by noting that Jt [C (s, ·)] = b(s)T ΘP ΘT b(s) for all s ∈ D; the details are very similar and are therefore omitted. For a slightly more general characterization of the penalty (B.1), see Theorem 1 of Reiss et al. (2014). Appendix C. Proof of (15) The covariance operator with kernel (14) is C = C1 + C2 , where C1 is the integral operator with kernel a + c(s + t) + dst and C2 is the integral operator with kernel ∆ min{s, t }. Arguing as in Section 5.1, C1 is of rank at most 2 and thus has an eigendecomposition C

C

C

C

C1 = λ1 (C1 )φ1 1 ⊗ φ1 1 + λ2 (C1 )φ2 1 ⊗ φ2 1 ,

(C.1)

where λj (·) denotes the jth largest eigenvalue of an operator. By the nonnegative definiteness of C , for all f ∈ F orthogonal C C to φ1 1 , φ2 1 we obtain 0 ≤ ⟨C f , f ⟩ = ⟨C2 f , f ⟩.

(C.2)

But C2 is ∆ times the covariance operator of the standard Brownian motion process on [0, 1]; hence (C.2) implies ∆ ≥ 0, and C2 is a nonnegative definite operator with λk (C2 ) = 4∆[(2k − 1)π ]−2 for each k ≥ 1 (Hsing and Eubank, 2015, pp. 118–119). The argument then proceeds similarly to the proof of Theorem 3.22 of Schott (2016). Let φkC be the eigenfunction of C corresponding to eigenvalue λk (C ), and let ‘g ⊥ {. . .}’ be a shorthand meaning that g ∈ F is orthogonal to each of the functions within the curly brackets. For each j ≥ 1,

λj (C ) = max{⟨C g , g ⟩ : g ⊥ {φ1C , . . . , φjC−1 }, ∥g ∥ = 1} C

C

≥ max{⟨C g , g ⟩ : g ⊥ {φ1C , . . . , φjC−1 , φ1 1 , φ2 1 }, ∥g ∥ = 1} C

C

= max{⟨C2 g , g ⟩ : g ⊥ {φ1C , . . . , φjC−1 , φ1 1 , φ2 1 }, ∥g ∥ = 1} ≥

min

f1 ,...,fj+1 ∈F

max{⟨C2 g , g ⟩ : g ⊥ {f1 , . . . , fj+1 }, ∥g ∥ = 1}

= λj+2 (C2 ) = 4∆[(2j + 3)π ]−2 , C

C

where the second equality holds since g ⊥ {φ1 1 , φ2 1 } implies C1 g = 0 by (C.1), and the second-last equality follows from the Courant–Fischer theorem. Appendix D. Square smoothing with equal smoothing parameters Let θ ∗ = vec(ΘT ), and let penSSR(θ ) denote the penalized sum of squared residuals criterion in (7). We show here that, for square smoothing with αs = αt in (8), penSSR[ 21 (θ + θ ∗ )] ≤ penSSR(θ ) for arbitrary θ . This implies that if the two smoothing parameters in (8) are equal, the square smoothing estimate must be symmetric, since for any asymmetric coefficient matrix Θ, the symmetric matrix 12 (Θ + ΘT ) attains a lower value of the criterion being minimized in (7). The 2

proof uses the vec operator and Kronecker products as in Appendix B, and consists of showing that for any θ ∈ RK , (i) ∥z − X (θ + θ ∗ )/2∥2 ≤ ∥z − X θ∥2 , and (ii) if P = α[(P ⊗ M ) + (M ⊗ P)] [i.e., (8) with α = αs = αt ], then 1 4

(θ + θ ∗ )T P (θ + θ ∗ ) ≤ θ T P θ.

(D.1)

To prove (i) we first note that in square smoothing, each ‘‘response’’ (6) appears twice in the penalized regression of Section 4.1: once with fitted value [b(tij2 )T ⊗ b(tij1 )T ]θ = b(tij1 )T Θb(tij2 ) and once with fitted value

[b(tij1 )T ⊗ b(tij2 )T ]θ = b(tij2 )T Θb(tij1 ) = b(tij1 )T ΘT b(tij2 ). Thus the contribution of a given response (6), say z, to the sum of squared residuals is

]2

z − b(tij1 )T Θb(tij2 )

[

[ ]2 + z − b(tij1 )T ΘT b(tij2 )

[ ]2 ≥ 2 z − b(tij1 )T {(Θ + ΘT )/2}b(tij2 ) [ ]2 [ ]2 = z − {b(tij2 )T ⊗ b(tij1 )T }(θ + θ ∗ )/2 + z − {b(tij1 )T ⊗ b(tij2 )T }(θ + θ ∗ )/2 , Please cite this article as: P.T. Reiss and M. Xu, Tensor product splines and functional principal components. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.10.006.

12

P.T. Reiss and M. Xu / Journal of Statistical Planning and Inference xxx (xxxx) xxx

where the inequality is by the identity (z − a)2 + (z − b)2 ≥ 2[z − (a + b)/2]2 . Summing over all the responses z gives (i). To prove (ii) we note that θ T (P ⊗ M )θ = θ ∗T (M ⊗ P)θ ∗ and θ T (M ⊗ P)θ = θ ∗T (P ⊗ M )θ ∗ , from which it follows that T θ P θ = θ ∗T P θ ∗ . Consequently, the right side of (D.1) can be written as 21 (θ T P θ + θ ∗T P θ ∗ ), and thus the right side minus the left side can be shown to equal 14 (θ − θ ∗ )T P (θ − θ ∗ ), which is non-negative as required. References Cederbaum, J., Scheipl, F., Greven, S., 2018. Fast symmetric additive covariance smoothing. Comput. Statist. Data Anal. 120, 25–41. Chen, S.T., Xiao, L., Staicu, A.-M., 2019. A smoothing-based goodness-of-fit test of covariance for functional data. Biometrics 75, 562–571. Currie, I.D., Durban, M., Eilers, P.H.C., 2006. Generalized linear array models with applications to multidimensional smoothing. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 (2), 259–280. Delaigle, A., Hall, P., 2016. Approximating fragmented functional data by segments of Markov chains. Biometrika 103 (4), 779–799. Di, C., Crainiceanu, C.M., Caffo, B.S., Punjabi, N.M., 2009. Multilevel functional principal component analysis. Ann. Appl. Stat. 3 (1), 458. Di, C., Crainiceanu, C.M., Jank, W.S., 2014. Multilevel sparse functional principal component analysis. Stat 3 (1), 126–143. Eilers, P.H.C., Marx, B.D., 2003. Multivariate calibration with temperature interaction using two-dimensional penalized signal regression. Chemometr. Intell. Lab. Syst. 66 (2), 159–174. Giedd, J.N., Raznahan, A., Alexander-Bloch, A., Schmitt, E., Gogtay, N., Rapoport, J.L., 2015. Child Psychiatry Branch of the National Institute of Mental Health Longitudinal Structural Magnetic Resonance Imaging Study of Human Brain Development. Neuropsychopharmacology 40 (1), 43–49. Goldsmith, J., Bobb, J., Crainiceanu, C.M., Caffo, B., Reich, D., 2011. Penalized functional regression. J. Comput. Graph. Statist. 20 (4), 830–851. Goldsmith, J., Greven, S., Crainiceanu, C., 2013. Corrected confidence bands for functional data using principal components. Biometrics 69 (1), 41–51. Goldsmith, J., Scheipl, F., Huang, L., Wrobel, J., Gellar, J., Harezlak, J., McLean, M.W., Swihart, B., Xiao, L., Crainiceanu, C., Reiss, P.T., 2018. Refund: Regression with functional data. R package version 0.1-17. URL https://CRAN.R-project.org/package=refund. Green, P.J., Silverman, B.W., 1994. Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman & Hall, Boca Raton, Florida. Hsing, T., Eubank, R., 2015. Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators. John Wiley & Sons, Chichester, UK. R Core Team, 2018. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, URL https://www.R-project.org/. Ramsay, J.O., Dalzell, C.J., 1991. Some tools for functional data analysis (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 53, 539–572. Ramsay, J.O., Silverman, B.W., 2005. Functional Data Analysis, second ed. Springer, New York. Reiss, P.T., Crainiceanu, C.M., Thompson, W.K., Huo, L., 2016. Modeling change in the brain: methods for cross-sectional and longitudinal data. In: Ombao, H., Lindquist, M., Thompson, W., Aston, J. (Eds.), Handbook of Neuroimaging Data Analysis. Chapman and Hall/CRC, pp. 467–494. Reiss, P.T., Huang, L., Chen, H., Colcombe, S., 2014. Varying-smoother models for functional responses. arXiv preprint arXiv:1412.0778. Reiss, P.T., Ogden, R.T., 2009. Smoothing parameter selection for a class of semiparametric linear models. J. R. Stat. Soc. Ser. B Stat. Methodol. 71 (2), 505–523. Rice, J.A., Silverman, B.W., 1991. Estimating the mean and covariance structure nonparametrically when the data are curves. J. R. Stat. Soc. Ser. B Stat. Methodol. 53 (1), 233–243. Ruppert, D., Wand, M.P., Carroll, R.J., 2003. Semiparametric Regression. Cambridge University Press, New York. Schott, J.R., 2016. Matrix Analysis for Statistics, third ed. John Wiley & Sons, Hoboken, New Jersey. Silverman, B.W., 1996. Smoothed functional principal components analysis by choice of norm. Ann. Statist. 24 (1), 1–24. Staniswalis, J., Lee, J., 1998. Nonparametric regression analysis of longitudinal data. J. Amer. Statist. Assoc. 444, 1403–1418. Thompson, W.K., Hallmayer, J., O’Hara, R., 2011. Design considerations for characterizing psychiatric trajectories across the lifespan: Application to effects of APOE-e4 on cerebral cortical thickness in Alzheimer’s disease. Am. J. Psychiatry 168 (9), 894–903. Wood, S.N., 2006. Low-rank scale-invariant tensor product smooths for generalized additive mixed models. Biometrics 62 (4), 1025–1036. Wood, S.N., 2011. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J. R. Stat. Soc. Ser. B Stat. Methodol. 73 (1), 3–36. Wood, S.N., 2017. Generalized Additive Models: An Introduction with R, second ed. CRC Press, Boca Raton, Florida. Xiao, L., Li, C., Checkley, W., Crainiceanu, C., 2018. Fast covariance estimation for sparse functional data. Stat. Comput. 28 (3), 511–522. Yao, F., Müller, H., Wang, J., 2005. Functional data analysis for sparse longitudinal data. J. Amer. Statist. Assoc. 100, 577–590.

Please cite this article as: P.T. Reiss and M. Xu, Tensor product splines and functional principal components. Journal of Statistical Planning and Inference (2019), https://doi.org/10.1016/j.jspi.2019.10.006.