Randomized Sketches for Sparse Additive Models
Communicated by Dr Chi Man Vong
Journal Pre-proof
Randomized Sketches for Sparse Additive Models Fode Zhang, Xuejun Wang, Rui Li, Heng Lian PII: DOI: Reference:
S0925-2312(19)31713-8 https://doi.org/10.1016/j.neucom.2019.12.012 NEUCOM 21639
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
17 December 2018 5 December 2019 6 December 2019
Please cite this article as: Fode Zhang, Xuejun Wang, Rui Li, Heng Lian, Randomized Sketches for Sparse Additive Models, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.12.012
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.
Randomized Sketches for Sparse Additive Models Fode Zhanga , Xuejun Wangb , Rui Lic∗ and Heng Liand a
Center of Statistical Research, School of Statistics Southwestern University of Finance and Economics, Chengdu 611130, China
b
School of Mathematical Sciences, Anhui University, Hefei, China
c
School of Statistics and Information Shanghai University of International Business and Economics, Shanghai, China
d
Department of Mathematics, City University of Hong Kong
Abstract Sparse additive models in the setting of reproducing kernel Hilbert spaces have theoretically optimal properties even in high dimensions. However, its computational cost is heavy, especially considering that it contains two regularization parameters associated with different norms. In the big data setting, fitting of the model becomes infeasible. As our first attempt to address this issue, we herein consider fixed-dimensional setting and propose a randomized sketches approach for sparse additive models. It is shown that the sketched estimator has the same optimal convergence rate as the standard estimator. Some Monte Carlo examples are presented to illustrate the performance of the estimators. ∗
Email:
[email protected]
1
Keywords: Convergence rate; Kernel method; Random projection; Sparse additive models.
1
Introduction
Statistical estimation of multivariate functions is generally difficult to interpret and may suffer from curse of dimensionality. Semiparametric models that involve only the estimation of univariate functions in the setting with multiple predictors are thus popular. This motivated the proposal of additive models where each predictor can enter the model with flexible nonparametric structures (Hastie and Tibshirani, 1990) and meanwhile only univariate functions are involved. On the other hand, even with a great many predictors available to include into the initial modeling, many of them may not be relevant and the inclusion of these decreases the accuracy of prediction. Indeed, in the literature, one popular way out is to assume that only a few of the component functions are nonzero in the true model and use sparsity-inducing penalties to identify the nonzero functions (Lin and Zhang, 2006; Ravikumar et al., 2008; Meier et al., 2009; Xue, 2009; Huang et al., 2010). Different approaches for additive models have been proposed in the literature, including regression splines, smoothing splines, and kernel methods. Furthermore, adopting the reproducing kernel Hilbert spaces (RKHS) framework, theoretical results for sparse additive models have gained increasing interest (Lin and Zhang, 2006; Meier et al., 2009). The RKHS framework is very general including linear regression, polynomial regression, and smoothing splines as special cases. Its mathematical elegance also leads to several developments of the optimal convergence rates of the model in high dimensions with p possibly larger than the sample size n (Koltchinskii and Yuan, 2010; Raskutti et al., 2012; Suzuki and Sugiyama, 2013), although we focus on the fixed-dimensional setting in the current paper. Rather, we are interested in the case n 2
is so large that the computational efficiency in fitting the model is the main concern. We also adopt the RKHS framework in this work. Fitting sparse additive models with some penalty terms typically requires an iterative algorithm and thus it is not clear how to give an explicit computational complexity bound as in classical computation theory. However, one can note that even for kernel ridge regression with a single function, the standard implementation requires computation of the inverse of an n × n kernel matrix and has a complexity
of O(n3 ), where n is the sample size. Thus model fitting would become infeasible in the big data setting, which has attracted a lot of attention recently. One general approach to dealing with large data with a theoretical guarantee is the divide-andconquer method, which partitions the entire data set into multiple subsets, computes an estimator using each subset, and finally aggregates the multiple estimators into a final estimator. This approach has been investigated in McDonald et al. (2009); Zinkevich et al. (2010); Zhang et al. (2013, 2015); Balcan et al. (2015); Zhao et al. (2016); Lee et al. (2017); Shi et al. (2017); Banerjee et al. (2017), although its usage in sparse additive models has not been investigated. In this paper, we consider approximations of sparse additive models based on random projection, also known as randomized sketches. This method works by projecting the n-dimensional data vectors to an m-dimensional space utilizing a random sketching matrix (Pilanci and Wainwright, 2015, 2016; Wang et al., 2017, 2018). Most related to our work is that of Yang et al. (2017) on randomized sketches for kernel ridge regression (KRR), and our results rely on some of the results obtained in that paper. The method of Yang et al. (2017) is based on applying the sketch to the kernel matrix, carefully constructed to retain the minimax optimality of the KRR estimate. Our contribution in this paper is to establish the statistical properties of randomized sketches on sparse additive models. Unlike KRR, here the penalized estimator does not have an explicit form, which makes our proof significantly different from that of 3
Yang et al. (2017), in spite of some partial similarities. Note that even with p = 1 (p is the number of predictors) the model does not reduce to that of KRR since the latter used a ridge penalty involving the square of the RKHS norm. On the other hand, our penalty contains both `2 -norm of functions evaluated at observations as well as the RKHS norm (non-squared). Our work is a natural extension for random sketches to the case of multiple predictors that expands the applicability of the method. We regard sketching as an alternative method to the more popular divide-and-conquer instead of a method that should replace the latter. One can also envision future developments to combine the two methods. Additive models in the RKHS have been studied frequently with many elegant theoretical results in the statistics and machine learning literature. There are many other alternative models proposed under the framework of RKHS, including for example quantile regression, or even non-regression models such as kernel principal component analysis and kernel canonical correlation analysis. Sketching for these models probably requires different techniques than used in this work in order to establish their theoretical properties, which is outside the scope of the current work. The rest of the article is organized as follows. In Section 2, we present the sparse additive model and the sketching method based on a random projection matrix. In Section 3, we establish the optimal convergence rate of the penalized estimator. Section 4 reports some simulation studies demonstrating the statistical and computational properties of the sketched estimator. We conclude with some discussions in Section 5. Proof of some technical lemmas are relegated to the appendix.
2
Random sketches for sparse additive models
First we review the concept of reproducing kernel Hilbert spaces (RKHS). Suppose (X , B, P) is a probability space on X with a σ-algebra B and probability measure 4
P. L2 norm for a function in L2 (P) is denoted by k.k. A Hilbert space H ⊂ L2 (P) with inner product h·, ·iH is an RKHS if there is a positive-definite kernel function P K : X × X → R (i.e., K(xi , xj ) = K(xj , xi ) and i,j ai aj K(xi , xj ) ≥ 0 for any sequence xi ∈ X and ai ∈ R) such that (i) K(., x) ∈ H for all x ∈ X , and (ii) we
have the reproducing property hf, K(., x)iH = f (x), ∀f ∈ H, x ∈ X . We assume that for some constant C, supx∈X K(x, x) < C. Then for any f ∈ H, kf k2∞ =
supx |hf, K(., x)iH |2 ≤ supx kK(., x)k2H kf k2H = supx K(x, x)kf k2H ≤ Ckf k2H . That is, the functions in the unit ball of H are uniformly bounded. Under suitable regularity conditions, Mercer’s theorem implies that the kernel has a spectral decomposition K(x, y) =
∞ X
µk φk (x)φk (y),
k=1
with eigenvalues µ1 ≥ µ2 ≥ · · · ≥ 0, and φk are eigenfunctions orthogonal in L2 (P). The decay rate of the eigenvalues quantifies the complexity of the RKHS, and determines the convergence rate for nonparametric estimation in the RKHS. More specifically, the complexity function is defined as
R(u) =
∞
1X min{µk , u2 } n k=1
!1/2
.
Let γn = inf{γ : R(γ) ≤ γ 2 }. Then γn is uniquely defined and is the optimal convergence rate in L2 (P)-error for estimating f ∈ H in nonparametric regression (Raskutti et al., 2012).
We are given an independent and identically distributed sample (xi = (xi1 , . . . , xip )> , yi ), i =
5
1, . . . , n from the generating model yi =
p X
fj (xij ) + i ,
j=1
where i is a mean-zero error independent of covariates xi . We assume i ∼ N (0, 1) P for simplicity of theory. We use f ∗ (xi ) = pj=1 fj∗ (xij ) to denote the truth. Note P that the decomposition of the conditional mean as pj=1 fj∗ (xij ) might not be unique (and typically not), but this will not cause confusion in the following since we focus
on the estimation of f ∗ instead of each individual component. One can also regard this as saying that we just fix one possible decomposition. This is the same as the perspective taken in Koltchinskii and Yuan (2010); Suzuki and Sugiyama (2013). We assume each fj∗ is in a certain RKHS Hj with kernel Kj . The estimator we use is fb =
X j
n
p
p
p
X X X 1X fbj = arg min (yi − fj (xij ))2 + λ1 kfj kn + λ2 kfj kH , fj ∈Hj n i=1 j=1 j=1 j=1 j=1,...,p
where kfj kn = (
P
i
fj2 (xij )/n)1/2 roughly speaking controls sparseness and kfj kH
controls both sparseness and smoothness. The mixed norm is necessary to obtain the optimal convergence rate in the literature. By the representer theorem (Kimeldorf and Wahba, 1971), each fbj can be exP pressed as fbj (x) = ni=1 ω bij Kj (x, xij ) for some ω bij , i = 1, . . . , n, j = 1, . . . , p. Thus, defining the kernel matrices Kj = (Kj (xij , xi0 j ))i,i0 , using the reproducing property that hKj (., x), Kj (., x0 )iH = Kj (x, x0 ), the optimization problem transforms into
X Xq Xq 1X 2 > 2 (yi − (Kj ω j )i ) + λ1 ω j Kj ω j /n + λ2 ω> j Kj ω j , n i=1 j j=1 j=1 p
n
min ω j =(ω1j ,...,ωnj )> j=1,...,p
where (Kj ω j )i denotes the i-th component of the vector Kj ω j . 6
p
> > Because the dimension of the problem (the dimension of ω = (ω > 1 , . . . , ω p ) ) is
np with sample size n, when n gets large, it will soon become infeasible to solve the optimization problem. The idea of randomized sketches for kernel ridge regression is to limit all ω j to an m-dimensional subspace of Rn with m << n, utilizing an n × m
sketching matrix S. Adapted to our context, this means that, given S ∈ Rn×m , we
b j = Sα b j with α b j given by set ω b = α
n X 1X arg min (yi − (Kj Sαj )i )2 n α=(α>1 ,...,α>p )> i=1 j Xq Xq > K2 Sα /n + λ > α> S α> +λ1 j 2 j j j S Kj Sαj .
(1)
j
j
Commonly used sketching matrices include sub-Gaussian sketches and randomized orthogonal system (ROS) sketches. For sub-Gaussian sketches, the columns of S are independently sub-Gaussian. This includes as a special case that the entries of S are i.i.d. Gaussian with variance 1/m. In ROS sketches the random matrix is obtained by sampling (with replacement) the columns of a fixed n × n orthonormal matrix and randomly change the sign of each entry. See Yang et al. (2017) for details.
3
Asymptotic theory
For simplicity of notations mainly, we assume Kj ≡ K and Hj ≡ H. Remember the complexity function is given by
R(u) :=
∞
1X min{µk , u2 } n k=1
!1/2
and we let γn := min{γ : R(γ) ≤ γ 2 }. 7
,
It is known that the minimax rate for the additive model is kfb − f ∗ k2 = Op (γn2 ) (Raskutti et al., 2012).
Suppose that the kernel matrices have the eigendecompositions Kj /n = Uj Dj U> j where Uj is an orthonormal matrix and Dj is diagonal with diagonal elements µ bj1 ≥ µ bj2 ≥ · · · ≥ µ bjn ≥ 0, the empirical complexity functions are n
b j (u) := R
1X min{b µjk , u2 } n k=1
!1/2
.
Define b j (γ) ≤ γ 2 }. γ bnj := min{γ : R
As shown in Section 3.2 of Koltchinskii and Yuan (2010), γ bnj and γn are of the same order and thus we do not distinguish between these two and simply write γ bnj as γn .
When S satisfies suitable conditions, the sketched estimator can still achieve the
optimal convergence rate when m << n. The exact required order of m depends on the type of the projection matrix S chosen. Define dn = min{k : µk ≤ γn2 },
which can be regarded as the statistical dimension of the problem. Consider the eigendecomposition of the kernel matrices mentioned above. By Yang et al. (2017), a sketching matrix S is said to be K-satisfiable if there is a constant c such that 1/2
> > kU> (1)j (SS − I)U(1)j kop ≤ 1/2, kS U(2)j D(2)j kop ≤ cγn ,
(2)
where U(1)j ∈ Rn×dn contains the first dn columns of Uj , U(2)j ∈ Rn×(n−dn ) contains the remaining n − dn columns of Uj , D(2)j = diag{b µj,dn +1 , . . . , µ bjn }, and k.kop denotes
the operator norm for matrices. Lemma 5 of Yang et al. (2017) showed that when
m ≥ cdn for the Gaussian sketch, or m ≥ cdn log4 (n) for the ROS sketch, S is Ksatisfiable with high probability. Note that since we assume p is fixed, we have (2)
8
satisfied simultaneously for all j = 1, . . . , p. We make the following assumption. Assumption (A). There exists a constant β > 0 such that
inf
fj ∈H,j=1,...,p
k
p X j=1
fj k/(
p X j=1
kfj k2 )1/2 ≥ β.
We assume that β is a fixed constant. This assumption also appears in Koltchinskii and Yuan (2010) which roughly can be regarded as a condition constraining the correlations between different covariates. It is an extension of the eigenvalue condition usually used in sparse linear models. We can now state our main result. Theorem 1 Consider the sketched problem (1) with a given K-satisfiable projection matrix S. Assume assumption (A) holds. If fj∗ ∈ H, j = 1, . . . , p, then, when λ1 = Cγn , λ2 = Lγn2 with L a sufficiently large constant, we have kfb − f ∗ k2 = Op (γn2 ) and P P kfb− f ∗ k2n = Op (γn2 ), where kfb− f ∗ k2n = i (f (xi ) − f ∗ (xi ))2 /n, fb = i,j ω bij K(., xij ), b j = Sα b j , and α b j as defined in (1). ω
As we discussed above, when using Gaussian sketches or ROS sketches, as long as m ≥ cdn (for Gaussian sketches) or m ≥ cdn log4 (n) (for ROS sketches), S is Ksatisfiable with probability approaching one. Thus in such a situation, we can achieve kfb − f ∗ k2 = Op (γn2 ).
Proof of Theorem 1. Throughout the proof C denotes a generic positive constant. Let fej be the sketched KRR estimator when the responses are given by fj∗ (xij ). That P e i for α e that solves is, we set fej = i ω eij K(., xij ) with ω eij = (Sα) n X e > S> Kj Sα, min (fj∗ (xij ) − (Kj Sα)i )2 + nλα α∈Rm i=1
9
e = 2γ 2 . Lemma 1 shows that kfej − f ∗ k2 ≤ Cγ 2 and kfekH ≤ C, where with λ n j n n P e 2 1/2 ∗ ∗ e kfj − fj kn = ( i (fj (xij ) − fj (xij )) /n) . P b in the definition of fb minimizes (1), while fej also has the form i ωij K(, xij ) Since α with ω j = (ω1j , . . . , ωnj )> in the column space of S, we have
X X 1X (yi − fb(xi ))2 + λ1 kfbj kn + λ2 kfbj kH n i j j X X 1X (yi − fe(xi ))2 + λ1 kfej kn + λ2 kfej kH . ≤ n i j j By (yi − fb(xi ))2 = (yi − fe(xi ) − (fb(xi ) − fe(xi )))2 = (yi − fe(xi ))2 + (fb(xi ) − fe(xi ))2 −
2(yi − fe(xi ))(fb(xi ) − fe(xi )), we get kfb − fek2n + λ1
X j
kfbj kn + λ2
X j
kfbj kH
X X 2X ≤ (yi − fe(xi ))(fb(xi ) − fe(xi )) + λ1 kfej kn + λ2 kfej kH n i j j X 2 2X b i (f (xi ) − fe(xi )) − (fe(xi ) − f ∗ (xi ))(fb(xi ) − fe(xi )) = n i n i X X +λ1 kfej kn + λ2 kfej kH . j
j
Using kfekn − kfbkn ≤ kfb − fekn , we get from the above that kfb − fek2n + λ2
X j
kfbj kH
2X e 2X b i (f (xi ) − fe(xi )) − (f (xi ) − f ∗ (xi ))(fb(xi ) − fe(xi )) = n i n i X X +λ1 kfbj − fej kn + λ2 kfej kH . j
j
By Lemma 1 of Raskutti et al. (2012), with probability at least 1 − c1 exp{−c2 nγn2 } 10
for some constants c1 , c2 > 0, X X 2X b i (f (xi ) − fe(xi )) ≤ Cγn kfbj − fej kn + Cγn2 kfbj − fej kH . n i j j To simplify presentation, we often refrain from mentioning “with probability at least 1 − c1 exp{−c2 nγn2 }” below in our bounds. Moreover, by the Cauchy-Schwarz inequality, 1X e (f (xi ) − f ∗ (xi ))(fb(xi ) − fe(xi )) ≤ kfe − f ∗ kn kfb − fekn . n i Thus we have kfb − fek2n + λ2
≤ Cγn +λ1
X j
X j
≤ Cγn +λ1
X j
X j
X j
kfbj kH
kfbj − fej kn + Cγn2
kfbj − fej kn + λ2 kfbj − fej kn +
j
+(λ2 +
Cγn2 )
X j
kfej kH ,
j
j
kfbj kH + C
kfej kH
X j
kfbj − fej kH + 2kfe − f ∗ kn kfb − fekn
kfej kH
X
X
≤ (λ1 + Cγn + 2kfe − f kn ) ∗
j
X
Cγn2
kfbj − fej kn + λ2
X
X j
γn2 kfej kH + 2kfe − f ∗ kn kfb − fekn
kfbj − fej kn + Cγn2
X j
kfbj kH
(3)
where in the second inequality we used that kfbj − fej kH ≤ kfbj kH + kfej kH , and in the P last step used that kfb − fekn ≤ j kfbj − fej kn .
11
Furthermore, using Lemma 4 and assumption (A), X j
≤ C
kfbj − fej kn
X j
kfbj − fej k + Cγn
≤ (C/β)kfb − fek + Cγn
≤ (C/β)kfb − fek + Cγn
X j
X j
X j
kfbj − fej kH
kfbj − fej kH
(kfbj kH + kfej kH ).
(4)
Using (4) in (3), we get kfb − fek2n + λ2
X j
kfbj kH
≤ C(λ1 + γn + kfe − f ∗ kn )kfb − fek + C(λ1 γn + kfe − f ∗ kn γn + γn2 ) +C(λ2 + λ1 γn + kfe − f kn γn + ∗
By moving the term
P
j
X j
γn2 )
X j
kfej kH .
kfbj kH to the LHS, this means kfbj kH ≤ C1 γn−1 kfb − fek + C
X j
kfej kH ,
X j
kfbj kH
(5)
(6)
and we note that the constant C1 above can be sufficiently small since λ2 = Lγn2 with
12
L chosen to be large enough. Using Lemma 3, b e2 2 b e kf − f k − kf − f kn X X X ≤ Cγn ( kfbj − fej k)( kfbj − fej kH ) + Cγn2 ( kfbj − fej kH )2 j
j
j
X X ≤ Cγn kfb − fek( kfbj − fej kH ) + Cγn2 ( kfbj − fej kH )2 j
j
X X 1 b e2 ≤ kf − f k + Cγn2 ( kfbj kH + kfej kH )2 4 j j X 1 b e2 ≤ kf − f k + Cγn2 ( kfej kH )2 , 2 j
(7)
where in the second inequality we used assumption (A), in the third inequality we used the Cauchy-Schwarz inequality as well as kfbj − fej kH ≤ kfbj kH + kfej kH , and in
the last step we used (6) and that C1 is small enough. This combined with (5) gives kfb − fek2
≤ kfb − fek2 − kfb − fek2n + Cγn kfb − fek + Cγn2
X j
kfej kH
X X 1 b e2 kf − f k + Cγn kfb − fek + Cγn2 ≤ kfej kH + Cγn2 ( kfej kH )2 . 2 j j
(8)
Moving 12 kfb− fek2 to the left hand side proves kfb− fek2 ≤ Cγn2 , owing also to Lemma 1. From (7), this gives also kfb − fek2n ≤ Cγn2 .
4
Numerical studies
We perform some simulations to illustrate the performance of the sketched estimator. We use a relatively small dimension p = 5. For the predictors for each sample √ size n, most of the observations (n − b nc of them) are generated uniformly on 13
√ √ [1/2, 1]p . The rest (b nc of them) are generated by xij ∼ N (0, 1/ n) and taking their absolute values so that all predictors lie in [0, 1]. This heterogeneous way of generating predictors follows one illustration in Yang et al. (2017). The responses are generated from yi =
5 X
fj (xij ) + i ,
j=1
where f1 (x) = −6x(1 − x), f2 (x) = sin(2πx)/(2 − sin(2πx)), f3 = f4 = f5 ≡ 0, i.i.d.
and i ∼ N (0, σ 2 ). We set n = 128, 256, 512, 1024 and σ = 0.2, 0.5. We use the
second-order Sobolev kernel K(s, t) = min{s, t}2 max{s, t}/2 − min{s, t}3 /6 + st + 1.
For the sketching matrix, we consider both Gaussian sketch (entries of matrix are i.i.d. Gaussian) and ROS sketch based on the Hadamard matrix. We also use the subsampling matrix in which the columns of S are m randomly selected columns of the identity matrix. This is equivalent to randomly sampling m columns of the kernel matrix.
Note that this subsampling sketch is different from random sampling of
the data since we only sample the columns of the kernel matrix while the number of rows is still n. We compute the standard estimator and the sketched estimator with m ∈ {n/2, n/22 , . . . , n/25 }. For example, when n = 128, m ranges from 4 to 64. In each case 200 data sets are generated. The tuning parameters are selected by 5-fold cross-validation. The accuracies of estimators are assessed by RMSE = P ( ni=1 (fb(xi ) − f (xi ))2 /n)1/2 .
To make the computational problem easier we make the following changes on p P the penalty. Our implementation is based on the penalty j λ1 kfj k2 + λ2 kfj k2H . p √ Since (x + y)/ 2 ≤ x2 + y 2 ≤ (x + y), the new penalty is closely related to our
previously presented one and was also used in Meier et al. (2009). The new penalty has q P > > > 2 the advantage that it can be written as j λ1 α> j S Kj Sαj /n + λ2 αj S Kj Sαj = q P > 2 λ α> 1 j (S Kj S/2 + λ2 Kj )αj so the optimization problem is simply a group lasso j problem. The original penalty formulation has two separate penalties so the existing 14
implementations for group lasso cannot be used. The model is fitted using the R package gglasso for computing the solution path efficiently. The results are presented in Figure 1. The two rows correspond to σ = 0.2 and 0.5, respectively. In each panel, the differently colored curves correspond to different sample sizes (the black, red, green and blue curves correspond to n = 128, 256, 512, 1024, respectively.) and m decreases along the x-axis. The RMSEs of the estimators without using sketching are visually indistinguishable from the sketched estimators when m = n/2 in all situations and thus are not shown. For Gaussian and ROS sketches, we see that the errors do not increase much even for m much smaller than n. For example, for σ = 0.2 with Gaussian sketch (upper-left panel), when sample size is n = 256 (red curve), there is no obvious deterioration in RMSE even when m = 16. The subsampling sketch performs slightly worse. We further perform a simulation with much larger sample size n = 215 = 32768, p = 10 (two nonzero functions as before), and σ = 2 or 5, using the same data generating mechanism. For such a large sample size, the standard estimator cannot be computed on our machine which will stop responding after a while. In this simulation we compare the sketching method with other approaches. For the sketching method, we use Gaussian sketch, ROS sketch, and subsampling sketch as before, with m = 8, 32. The choice of m = 32 is roughly based on computational feasibility since it would take too long to compute the estimators with larger m. We compare the proposed method with divide-and-conquer method which computes the estimator by partitioning the entire data into M equally-sized parts, computing an estimator on each part, and finally taking the average of the M estimators. The number of machines is taken to be M = 10, 20 chosen such that computational speed for M = 10 is similar to the sketching method with m = 32. Lastly, we compare with baseline subsampling, in which a subset of data of size m = 2000 or 5000 is selected for fitting the model, with m = 5000 again resulting in simulation time similar to the sketching 15
method with m = 32. The results for the five methods (including the three sketching methods), with two settings (two choices of m or M ) for each method are reported in Table 1. In Table 1, the ten settings are indicated by methods (1)–(10). we see that the sketching methods are comparable to the divide-and-conquer method in estimation accuracy, while the baseline subsampling method obviously has the worst performance. The table also reported the standard errors for the differences in RMSE for some pairs of methods. Comparing the differences between (2) and (4), we see that the differences between Gaussian sketch and ROS sketch are insignificant (the difference in RMSE is smaller than twice of its standard error). On the other hand, the difference between Gaussian sketch and subsampling sketch is significant. Comparing Gaussian sketch with divide-and-conquer, the differences are not statistically significant. The implementations were done on the author’s desktop computer with Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60 GHz processor and 64GB RAM. In terms of computation times, the simulations with n = 256 in one particular parameter setting were completed in about half an hour while the sketched estimators with m = 16 were completed in less than a minute. Our implementation in R is available at https://github.com/shellinglian/SAM.sketch. Finally, we use three data sets from the UCI machine learning repository for further illustrations. The first dataset is the air quality data which contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device, located within an Italian city (De Vito et al., 2008). These are used as predictors together with temperature and humidity. Of interest is the prediction of the true hourly averaged concentrations for CO. The second dataset is the wine quality data to predict the quality scores (a value between 0 and 10), with 11 predictors such as acidity, density, pH value, etc. The 16
2
3
4
5
0.3 1
2
3
4
5
1
3
Gaussian sketch
ROS sketch
Sub sketch
3 log2(n m)
4
5
5
4
5
0.3 0.1
0.2
RMSE
0.3 0.1
0.2
RMSE
0.3 0.2 0.1
2
4
0.4
log2(n m)
0.4
log2(n m)
RMSE
1
2
log2(n m)
0.4
1
0.1
0.2
RMSE
0.3 0.1
0.2
RMSE 0.1
0.2
RMSE
0.3
0.4
Sub sketch
0.4
ROS sketch
0.4
Gaussian sketch
1
2
3 log2(n m)
4
5
1
2
3 log2(n m)
Figure 1: For the first simulation, this shows RMSE of fb as m varies in {n/2, n/22 , . . . , n/25 }. Different lines with different colors correspond to different n. The two rows correspond to σ = 0.2 and 0.5, respectively. The three columns correspond to Gaussian sketches, ROS sketches, and subsampling sketches, respectively.
17
Table 1: For the second simulation with n = 215 = 32768, we report the RMSE for various methods and the standard error of the difference of RMSE for some selected pairs of methods. σ=2 σ=5 (1) m = 8 0.143 0.292 Gaussian sketch (2) m = 32 0.115 0.209 (3) m = 8 0.148 0.298 ROS sketch (4) m = 32 0.117 0.214 (5) m = 8 0.182 0.325 Sub sketch (6) m = 32 0.139 0.258 (7) M = 20 0.130 0.278 Divide-and-Conquer (8) M = 10 0.109 0.219 (9) m = 2000 0.290 0.524 Subsampling (10) m = 5000 0.218 0.342 Standard deviation of (2)-(4) 0.006 0.008 Standard deviation of (2)-(6) 0.011 0.020 Standard deviation of (2)-(8) 0.009 0.011 Standard deviation of (2)-(10) 0.033 0.042
18
third dataset is the metro interstate traffic volume data which regresses the log(traffic volume) on 3 predictors including temperature, rain amount, and percentage of cloud cover, using only data on non-holidays. All predictors are standardized to be within [0, 1]. We randomly partition the data set 100 times with a training size n ∈ {1024, 2048}, and the rest used as validation set. Prediction results using the Gaussian sketch, ROS sketch subsampling sketch (m = 32), divide and conquer (M = 10), and subsampling (m = 300) are reported, with these m or M chosen such that different methods has similar computation time. Results using all training data are also shown in Table 2. As in simulations, we see that the sketch method compares favorably with other methods. We have also computed the standard error as in our simulations, and find that the difference between the Gaussian sketching method and subsampling method is significant, while the difference between the Gaussian sketching method and the divide-and-conquer method is not significant. Table 2: Prediction accuracy of different estimators. n = 1024 n = 2048 n = 1024 ROS sketch n = 2048 n = 1024 Sub sketch n = 2048 n = 1024 Divide-and-Conquer n = 2048 n = 1024 Subsampling n = 2048 n = 1024 Standard estimator n = 2048 Gaussian sketch
Air quality 0.261 0.255 0.260 0.257 0.267 0.260 0.263 0.252 0.282 0.281 0.251 0.242
19
wine quality 3.74 3.17 3.72 3.29 3.79 3.44 3.74 3.25 3.93 3.92 3.57 3.04
Traffic volume 1.57 1.27 1.59 1.32 1.70 1.34 1.58 1.29 1.78 1.78 1.49 1.15
5
Conclusions
In this paper, we constructed a sketched estimator for sparse additive models in the reproducing kernel Hilbert space framework. We establish that the convergence rate of the sketched estimator is the same as that of the standard unsketched estimator. Our numerical examples demonstrate the good performance of the sketched estimator which can also be much faster. In our numerical examples, the performances of sketching and divide-and-conquer are similar. However, it is well known that even for a parametric model there is a limit on the number of partitions one can use in practice and theory. In theory it √ is known that the number of partitions can be at most O( n) beyond which the divide-and-conquer method will fail, which has also been demonstrated numerically in various specific models (Zhao et al., 2016). The sketching method can be combined with divide-and-conquer to achieve better performances. More specifically, when n is √ extremely large, we can first apply divide-and-conquer method with O( n) partitions. √ Since each partition with at least n observations can still be very large, we can then use sketching method on each partition to calculate the local estimators. This kind of strategy has been demonstrated to be beneficial for parametric ridge regression in Wang et al. (2018). Whether this works favorably in the current setting is of course unknown and we leave it as future work. A natural but non-trivial next step is to extend the theory to the high-dimensional setting. Also, in the high-dimensional setting, it would be interesting to combine sketching with random projection on p to reduce the size of both n and p simultaneously. Wang et al. (2017) has made some attempts in this direction for linear models and it would be interesting to consider similar strategies for additive models.
20
Acknowledgements The authors sincerely thank the editor, the associate editor and several anonymous reviewers for their insightful comments that improved the manuscript. The research of Fode Zhang is partially supported by the Fundamental Research Funds for the Central Universities (No. JBK1901053, JBK1806002 and JBK140507) of China. The research of Xuejun Wang is partially supported by NSFC (11671012). Li Rui’s research was supported by National Social Science Fund of China (No.17BTJ025). The research of Heng Lian is supported by City University of Hong Kong Start-up grant 7200521, by Hong Kong RGC general research fund 11301718 and 11300519, by Project 11871411 from NSFC and the Shenzhen Research Institute, City University of Hong Kong.
Appendix: Some Lemmas Lemma 1 When S is K-satisfiable, we have
and
kfej − fj∗ k2n ≤ Cγn2 kfej kH ≤ C.
Proof of Lemma 1. Proof of Lemma 2 in Yang et al. (2017) showed that when e = 2γ 2 , λ n
kfej − fj∗ k2n + λkfej k2H ≤ Cγn2 ,
which immediately implied both results in the statement of the lemma.
21
Lemma 2 P Pp p k j=1 fj k2n − k j=1 fj k2 P = O(γn ). E sup P Pp p p fj ∈H ( j=1 kfj kH ) j=1 kfj k + γn j=1 kfj kH
Proof of Lemma 2. By the symmetrization argument, we have
P Pp p k j=1 fj k2n − k j=1 fj k2 P E sup P Pp p p fj ∈H ( j=1 kfj kH ) j=1 kfj k + γn j=1 kfj kH P P (1/n) i σi ( pj=1 fj (xij ))2 P ≤ 2E sup P Pp p p fj ∈H ( j=1 kfj kH j=1 kfj k + γn j=1 kfj kH ) # " P P P p (1/n) k j fj k∞ j=1 fj (xij )) i σi ( Pp ≤ 4 sup Pp · E sup Pp , fj ∈H fj ∈H j=1 kfj kH j=1 kfj kH j=1 kfj k + γn
where σi , i = 1, . . . , n are i.i.d. Rademacher variables, and the last inequality follows from the contraction inequality for the Rademacher complexity (see, e.g., Theorem 2.2 of Koltchinskii (2011)). Obviously P k j fj k∞ sup Pp ≤ C, fj ∈H j=1 kfj kH
and
# " # P (1/n) P σi (Pp fj (xij )) (1/n) i σi fj (xij ) P i j=1 . Pp E sup p ≤ E max sup j fj ∈H kf k + γ kf k + γ kf k kf k fj ∈H j n j H j n j H j=1 j=1 "
Since p is assumed to be fixed, the Lemma would be implied by P (1/n) i σi f (xij ) ≤ Cγn . E sup kf k + γn kf kH f ∈H
22
(9)
Now we use the results in Mendelson (2002) which established that # X E sup σi f (xij ) ≤ CR(u). (1/n) f ∈H,kf kH ≤1,kf k≤u "
Since
(10)
i
kf k ≤ γn , + kf kH kf kH ≤ C, −1 γn kf k + kf kH γn−1 kf k
(10) implies
P (1/n) i σi f (xij ) ≤ CR(γn ) = Cγn2 , E sup −1 γn kf k + kf kH f ∈H
where we used the definition of γn that it satisfies R(γn ) = γn2 . The above is obviously
equivalent to (9). This finished the proof of the lemma.
Using the Talagrand’s concentration inequality, the above bound in expectation leads to a bound with high probability as stated in the following lemma. Lemma 3 With probability at least 1 − c1 exp{−c2 nγn2 }, p ! p p p X X X X X 2 2 fj kn − k fj k ≤ Cγn ( kfj kH ) kfj k + γn kfj kH , k j=1
j=1
j
j=1
j=1
for all fj ∈ H, j = 1, . . . , p.
Proof of Lemma 3. The function
(
P
j kfj kH )
P
(
Pp
p j=1
j=1
fj )2
kfj k + γn
23
Pp
j=1 kfj kH
(11)
is obviously bounded by C/γn . Using P ( pj=1 fj )2
P P Pp p ( j kfj kH ) kf k + γ kf k j n j H j=1 j=1 Pp Pp k j=1 fj k∞ | j=1 fj | Pp ≤ Pp · Pp j=1 kfj kH j=1 kfj k + γn j=1 kfj kH Pp | j=1 fj | Pp ≤ C Pp , j=1 kfj k + γn j=1 kfj kH
the variance of (11) is bounded by a constant. Talagrand’s concentration inequality implies P Pp p k j=1 fj k2n − k j=1 fj k2 P sup P Pp p fj ∈H ( kf k kf k + γ kf k ) j H j n j H j=1 j=1 j P P r p p k j=1 fj k2n − k j=1 fj k2 t t + C P ≤ CE sup P +C , P p p n nγn fj ∈H ( kf k kf k + γ kf k ) j H j n j H j=1 j=1 j
with probability at least 1 − c1 e−c2 t . Setting t = nγn2 yields the conclusion.
The following Lemma is a corollary of Lemma 3. Lemma 4 With probability at least 1 − c1 exp{−c2 nγn2 }, for all f = kf kn ≤ C(kf k + γn
X
kfj kH )
kf k ≤ C(kf kn + γn
X
kfj kH ).
and
24
j
j
Pp
j=1
fj , fj ∈ H,
Proof of Lemma 4. By Lemma 3, we have kf k2n ≤ kf k2 + C(γn 2
X j
≤ kf k + C(γn kf k 2
≤ C(kf k +
γn2 (
kfj k X
X j
j
X j
X kfj kH + γn2 ( kfj kH )2 )
kfj kH +
X
γn2 (
2
kfj kH ) )
j
j
kfj kH )2 )
(12)
where the first inequality used Lemma 3, the second inequality used assumption (A), and the last is based on Cauchy-Schwarz inequality. This proves the first bound of the lemma. To show the second bound of the lemma, we have similarly kf k2 ≤ kf k2n + C(γn kf k
X j
kfj kH + γn2 (
X j
kfj kH )2 )
X 1 C 2 γn2 X ≤ kf k2n + kf k2 + kfj k2H + Cγn2 ( kfj kH )2 , 2 2 j j
(13) (14)
Moving (1/2)kf k2 on the RHS to the LHS proves the second bound of the lemma.
References Balcan, M.-F., Liang, Y., Song, L., Woodruff, D., and Xie, B. “Communication efficient distributed kernel principal component analysis.” arXiv:1503.06858 (2015). Banerjee, M., Durot, C., and Sen, B. “Divide and conquer in non-standard problems and the super-efficiency phenomenon.” The Annals of Statistics, to appear (2017). De Vito, S., Massera, E., Piga, M., Martinotto, L., and Di Francia, G. “On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario.” Sensors and Actuators B: Chemical , 129(2):750–757 (2008).
25
Hastie, T. and Tibshirani, R. Generalized additive models. Monographs on statistics and applied probability. London ; New York: Chapman and Hall, 1st edition (1990). Huang, J., Horowitz, J. L., and Wei, F. “Variable selection in nonparametric additive models.” Annals of Statistics, 38(4):2282–2313 (2010). Kimeldorf, G. and Wahba, G. “Some results on Tchebycheffian spline functions.” Journal of Mathematical Analysis and Applications, 33:82–95 (1971). Koltchinskii, V. Oracle inequalities in empirical risk minimization and sparse recovery problems. New York: Springer (2011). Koltchinskii, V. and Yuan, M. “Sparsity in multiple kernel learning.” The Annals of Statistics, 38(6):3660–3695 (2010). Lee, J. D., Liu, Q., Sun, Y., and Taylor, J. E. “Communication-efficient sparse regression.” Journal of Machine Learning Research, 18:1–30 (2017). Lin, Y. and Zhang, H. H. “Component selection and smoothing in multivariate nonparametric regression.” Annals of Statistics, 34(5):2272–2297 (2006). McDonald, R., Mann, G., and Silberman, N. “Efficient large-scale distributed training of conditional maximum entropy models.” Proceedings of Advances in Neural Information Processing Systems, 1231–1239 (2009). Meier, L., de Geer, S., and Buhlmann, P. “High-dimensional additive modeling.” Annals of Statistics, 37(6B):3779–3821 (2009). Mendelson, S. “Geometric parameters of kernel machines.” In International Conference on Computational Learning Theory, 29–43 (2002).
26
Pilanci, M. and Wainwright, M. J. “Randomized sketches of convex programs with sharp guarantees.” IEEE Transactions on Information Theory, 61(9):5096–5115 (2015). —. “Iterative Hessian sketch: fast and accurate solution approximation for constrained least squares.” Journal of Machine Learning Research, 17:1–38 (2016). Raskutti, G., Wainwright, M. J., and Yu, B. “Minimax-optimal rates for sparse additive models over kernel classes via convex programming.” The Journal of Machine Learning Research, 13:389–427 (2012). Ravikumar, P., Liu, H., Lafferty, J., and Wasserman, L. “SpAM: Sparse additive models.” In Platt, J., Koller, D., Singer, Y., and Roweis, S. (eds.), Advances in Neural Information Processing Systems 20 , 1201–1208. MIT Press, Cambridge, MA (2008). Shi, C., Lu, W., and Song, R. “A massive data framework for M-estimators with cubic-rate.” Journal of the American Statistical Association, to appear (2017). Suzuki, T. and Sugiyama, M. “Fast learning rate of multiple kernel learning: Trade-off between sparsity and smoothness.” The Annals of Statistics, 41:1381–1405 (2013). Wang, J., Lee, J. D., Mahdavi, M., Kolar, M., and Srebro, N. “Sketching meets random projection in the dual: A provable recovery algorithm for big and highdimensional data.” Electronic Journal of Statistics, 11(2):4896–4944 (2017). Wang, S., Gittens, A., and Mahoney, M. W. “Sketched ridge regression: optimization perspective, statistical perspective, and model averaging.” Journal of Machine Learning Research, 18:1–50 (2018). Xue, L. “Consistent variable selection in additive models.” Statistica Sinica, 19:1281– 1296 (2009). 27
Yang, Y., Pilanci, M., and Wainwright, M. J. “Randomized sketches for kernels: Fast and optimal nonparametric regression.” Annals of Statistics, 45(3):991–1023 (2017). Zhang, Y., Duchi, J. C., and Wainwright, M. J. “Communication-efficient algorithms for statistical optimization.” Journal of Machine Learning Research, 14:3321–3363 (2013). —. “Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates.” Journal of Machine Learning Research, 16:3299–3340 (2015). Zhao, T., Cheng, G., and Liu, H. “A partially linear framework for massive heterogeneous data.” Annals of Statistics, 44(4):1400–1437 (2016). Zinkevich, M. a., Smola, A., and Weimer, M. “Parallelized stochastic gradient descent.” In Proceedings of Advances in Neural Information Processing Systems, 2595–2603 (2010).
28
Author contribution Fode Zhang : Writing- Original draft preparation, Methodology, Software. Xuejun Wang: Formal analysis, Software. Rui Li: Formal analysis, Software. Heng Lian: Conceptualization, Writing- Reviewing and Editing, Formal analysis. Declaration of interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Fode Zhang was born in Gansu, China, in 1983. He received the B.S. degree in mathematics from the Northwest Normal University, Lanzhou, Gansu, China, in 2009, and the Ph.D. degree in statistics from Northwestern Polytechnical University, Xi’an, Shannxi, China, in 2017. He is currently an Associate Professor in the Center of Statistical Research and School of Statistics, Southwestern University of Finance and Economics in Chengdu, China. His research interests include statistical inference, statistical manifold, and information geometry.
29
Xuejun Wang was born in Anhui, China, in 1981. He received the B.S. and Ph.D. degree in Statistics from Anhui University, Anhui, China, in 2007 and 2010, respectively. He is a professor and Doctoral supervisor of School of Mathematical Sciences of Anhui University, Anhui, China. His current research interests include probability limit theory, nonparametric regresstion estimation, semiparametric regession estimation and so on.
30
Rui Li was born in Shanxi, China, in 1980. He received the B.S. degree in Mathematics in 2003 from Taiyuan Normal College, the M.S. degree in Probability and Statistics in 2007 from Jiangsu University and the Ph.D degree in Statistics in 2015 from Shanghai University of Finance and Economics, China. He is an associate professor in Shanghai University of International Business and Economics. His current research interests are nonparametric and semi-parametric models.
Heng Lian is currently an Associate Professor in the Department of Mathematics, City University of Hong Kong. He previously worked as an Assistant Professor at Nanyang Technological University, Singapore, and later as a Senior Lecturer at the University of New South Wales, Australia. His research interest include mathematical statistics, machine learning, and pattern recognition.
31