Journal of Statistical Planning and Inference 143 (2013) 276–282
Contents lists available at SciVerse ScienceDirect
Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi
Learning rates of multi-kernel regression by orthogonal greedy algorithm Hong Chen a, Luoqing Li b, Zhibin Pan a,n a b
College of Science, Huazhong Agricultural University, Wuhan 430070, China Faculty of Mathematics and Computer Science, Hubei University, Wuhan 430062, China
a r t i c l e in f o
abstract
Article history: Received 22 July 2012 Accepted 3 August 2012 Available online 16 August 2012
We investigate the problem of regression from multiple reproducing kernel Hilbert spaces by means of orthogonal greedy algorithm. The greedy algorithm is appealing as it uses a small portion of candidate kernels to represent the approximation of regression function, and can greatly reduce the computational burden of traditional multi-kernel learning. Satisfied learning rates are obtained based on the Rademacher chaos complexity and data dependent hypothesis spaces. & 2012 Elsevier B.V. All rights reserved.
Keywords: Sparse Multi-kernel learning Orthogonal greedy algorithm Data dependent hypothesis space Rademacher chaos complexity Learning rate
1. Introduction Kernel methods have been extensively used in various learning tasks, and its performance largely depends on the data representation via the choice of kernel function. Due to the practical importance of multi-kernel learning, many studies in machine learning have been devoted to the data dependent choice of kernel recently, see, e.g., Lanckriet et al. (2004), Micchelli and Pontil (2005), Wu et al. (2007), Ying and Zhou (2007), Ying and Campbell (2010), and Chen and Li (2010). In the regression setting, the above mentioned multi-kernel models usually can be formulated as a regularized framework in reproducing kernel Hilbert spaces. Let us recall some basic concepts of the multi-kernel regularized regression. Let X be a compact subset of Rd and let Y be contained in ½M,M. The product space Z :¼ X Y is assumed to be measurable and it is endowed with an unknown probability measure denoted by r. Input–output pairs (x,y) are sampled according to r. For every x 2 X, let rðy9xÞ be the conditional (w.r.t. x) probability measure on Y and let rX ðxÞ be the marginal probability measure on X. The error for a measurable function f : X-Y is the so-called expected risk Z Eðf Þ :¼ Jyf J2L2 ¼ ðyf ðxÞÞ2 dr: r
Z
It is known that the function which minimizes Eðf Þ is the regression function defined by Z f r ðxÞ ¼ y drðy9xÞ, x 2 X: Y
From the assumption y 2 ½M,M, we know that 9f r ðxÞ9 r M. n
Corresponding author. E-mail address:
[email protected] (Z. Pan).
0378-3758/$ - see front matter & 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.jspi.2012.08.002
ð1Þ
H. Chen et al. / Journal of Statistical Planning and Inference 143 (2013) 276–282
277
Set Nm : ¼ f1,2, . . . ,mg for any m 2 N. A training set of size m is drawn by sampling m independent and identically distributed pairs according to r, z : ¼ fzi ,i 2 Nm g ¼ fðxi ,yi Þ,i 2 Nm g 2 Z m : Throughout the paper, we restrict our attention to a prescribed set K of candidate Mercer kernels. We say that K : X X-R is a Mercer kernel if it is a continuous, symmetric, and positive semi-definite, i.e., for any finite set of distinct points fx1 ,x2 , . . . ,x‘ g X, the matrix ðKðxi ,xj ÞÞ‘i,j ¼ 1 is positive semi-definite. The candidate reproducing kernel Hilbert space HK associated with a Mercer kernel K is defined as the closure of the linear span of the set of functions fK x :¼ Kðx,Þ : x 2 Xg, equipped with the inner product / , SHK defined by /K x ,K y SHK ¼ Kðx,yÞ: The reproducing property is given by /K x ,f SHK ¼ f ðxÞ,
8x 2 X, f 2 HK :
ð2Þ
Denote C(X) as the space of continuous functions on X with the supremum norm J J1 . Because of the continuity of K 2 K and the compactness of X, we have pffiffiffiffiffiffiffiffiffiffiffiffiffi k : ¼ sup sup Kðx,xÞ o 1: K2K x2X
So, the reproducing property above tells us Jf J1 r kJf JK ,
8f 2 HK :
The empirical error with respect to the random samples z is defined as E z ðf Þ : ¼ Jyf Jm ¼
m 1X ðy f ðxi ÞÞ2 , mi¼1 i
P where J Jm is the L2r norm with respect to the discrete measure ð1=mÞ m i ¼ 1 dxi with du is the Dirac measure of u. In general, the regularization scheme of multi-kernel regression is defined as a two-layer minimization problem f z, l : ¼ argmin minfE z ðf Þ þ lJf J2K g, K2K f 2HK
l 4 0:
ð3Þ
Its error analysis has been well developed with various techniques in learning theory (see, e.g., Lanckriet et al., 2004; Micchelli and Pontil, 2005; Ying and Zhou, 2007; Ying and Campbell, 2010; Chen and Li, 2010). Here we are interested in the case when the total number n of candidate kernels K ¼ fK j : j 2 Nn g is large, but only a relatively small number of them is necessary to represent the approximation of regression function f r . Note that the solution of (3) belongs to the hypothesis space 8 9 m X n
which involves expansions of all the candidate kernels and all the training data. This may result in computation burden when the number of candidate kernels is large. Thus, the sparse representation of solution is crucial to improve the efficiency of multi-kernel learning. Only recently there are studies for concerning the sparsity of multi-kernel learning in Koltchinskii and Yuan (2008, 2010). For the multiple kernel regularized method with sparsity penalty, the oracle inequality of excess risk is established in Koltchinskii and Yuan (2008). In this paper, we consider to realize sparse representation by greedy selection of important kernels and training samples. We denote the set of candidate kernels is Kkz , where kz is the number of candidate kernels after k times of feature selection. The hypothesis space based on kernel selection is defined by 8 9 t kz m X
with ‘1 norm Jf J‘1 ¼ inf
8
i¼1j¼1
j 9 i
9a : f ¼
m X n X i¼1j¼1
9 =
j j : K i xi ;
a
The data dependent hypothesis space Hz,K can be considered as a natural extension from single kernel setting in Xiao and Zhou (2010) and Shi et al. (2011) to multi-kernel setting. Based on the hypothesis space, a new multi-kernel orthogonal greedy algorithm (MOGA) is introduced in Table 1. The algorithm in Table 1 can be divided into two parts: selecting features ffk g and solving the empirical risk minimization to derive f^ k . In fact, the goal of the normalization of kernels is to provide the feasibility of error analysis, which does not affect the predictive performance of the algorithm. There are some statistical analysis of orthogonal greedy algorithms in learning problem (Barron et al., 2008; Zhang, 2009). However, to the best of our knowledge, there is no any studies of kernel choice by greedy algorithm. Our method tries to bring together three distinct concepts that have received independent attention in learning theory: multi-kernel
278
H. Chen et al. / Journal of Statistical Planning and Inference 143 (2013) 276–282
Table 1 Multi-kernel orthogonal greedy algorithm. Input: z 2 Z m , K 2 K, and T 4 0 j Step 1. Normalization: K~ xi ¼ K jxi =JK jxi Jm , i 2 Nm , j 2 Nn j Dictionary: Dm þ n ¼ fK~ xi : i 2 Nm ,j 2 Nn g Step 2. Computation: Let f^ ¼ 0 0
for k ¼ 1,2, . . .
Pm 1 ^ let fk ¼ argming2Dm þ n m i ¼ 1 9ðyi f k ðxi ÞÞgðxi Þ9 ^ k ¼ Spanðf , . . . , f Þ let H 1 k min E z ðf Þ f^ ¼ arg min ^ E z ðhÞ ¼ argmin k
h2H k
K2Kkz
f 2HK
if Jyf^ k J2m þ Jf^ k J‘1 r JyJ2m and kZ T break end Output: f^ k
learning (Ying and Campbell, 2010; Koltchinskii and Yuan, 2008), orthogonal greedy algorithm for feature selection (Barron et al., 2008; Zhang, 2009), and error analysis with data dependent hypothesis spaces (Wu et al., 2007; Wu and Zhou, 2008; Xiao and Zhou, 2010; Shi et al., 2011). The goal is to establish the estimate of the excess risk Eðf^ k ÞEðf r Þ. Based on the data dependent feature selection, we derive satisfactory learning rates which is faster than ðlog n=mÞ1=2 . The paper is organized as follows. In Section 2, we decompose the excess risk into the sample error, the hypothesis error, and the approximation error. Estimates of sample error and hypothesis error are given in Sections 3 and 4, respectively. Explicit learning rate of MOGA is presented in Section 5. 2. Error decomposition z Now we introduce a data-independent function space Hkz,K .
Definition 1. Define a hypothesis space 8 9 1 X n < = X j j~j j j j 1 ~ ai K ui ,fai g 2 ‘ ,fui g X, K ui ¼ K ui =JK ui JL2r H1 ¼ f : f ¼ : X ; i¼1j¼1
equipped with the norm 8 9
i¼1j¼1
In order to investigate the approximation of f^ k to f r , we introduce a regularizing function f l ¼ arg minfEðf Þ þ lJf JH1 g: f 2H1
ð4Þ
Now, we present the following error decomposition. Proposition 1. Let f r , f l , and f^ k be defined in (1), (4), and Table 1, respectively. Then we have Eðf^ k ÞEðf r Þ r Sðz, lÞ þ Hðz, lÞ þ DðlÞ, where Sðz, lÞ ¼ Eðf^ k ÞE z ðf^ k Þ þ E z ðf l ÞEðf l Þ, Hðz, lÞ ¼ E z ðf^ k ÞE z ðf l Þ, DðlÞ ¼ Eðf l ÞEðf r Þ þ lJf JH1 :
The above first term Sðz, lÞ is called the sample error. The second term Hðz, lÞ is called the hypothesis error, depended on z the data dependence hypothesis space Hkz,K which need not contain the regularizing function f l 2 H1 . The last term DðlÞ is called the approximation error. 3. Estimate of sample error In this paper we consider Rademacher chaos complexity as the capacity measurement of hypothesis space, which has been discussed for more general case in Srebro and Ben-David (2006) and has been used to establish the error analysis of learning algorithms recently in Clemencon et al. (2008), Ying and Campbell (2010), and Chen and Li (2010).
H. Chen et al. / Journal of Statistical Planning and Inference 143 (2013) 276–282
279
Definition 2. Let F be a class of functions on X X and let fEi : i 2 Nm g be independent Rademacher random variables. Moreover, let x ¼ fxi : i 2 Nm g be independent random variables distributed according to a distribution m on X. The homogeneous Rademacher chaos of order two, with respect to the Rademacher random variables E, is a random variable system defined by 8 9 < = 1 X ^ Ei Ej gðxi ,xj Þ : g 2 F : U ðEÞ ¼ : g ; m i,j2Nm ,i o j
We refer to the expectation of its suprema " # ^ ^ U m ðF Þ ¼ EE sup9U g ðEÞ9 g2F
as the empirical Rademacher chaos complexity. The following result is presented in Ying and Campbell (2010) for the set of finite base kernels. Lemma 1. If K has a finite number of kernels given by K ¼ fK j ,j 2 Nn g, then U^ m ðKÞ r 25ek2 logðn þ 1Þ: Since kz o k, we have U^ m ðKkz Þ r25ek2 logðkz þ1Þ r25ek2 logðk þ 1Þ. Lemma 2. For almost every z 2 Z m , we have Jf^ k JH1 r kM 2 and Jf l JH1 r DðlÞ=l. Proof. From the definition of f^ k in Table 1, we know that Jf^ k J‘1 r M2 . Then the bound of f^ k follows from the inequality Jf^ k JH1 r kJf^ k J‘1 . Based on the definition of f l , we can derive the desired upper bound. & In order to obtain the uniformly upper bound of Sðz, lÞ, we consider the data dependent space BRK ¼ ff 2 H1 : Jf JH1 r Rg, where R ¼ maxfkM 2 ,DðlÞ=lg. Now we have assembled the necessary material to get the following estimate of sample error. The proof is a simple extension of uniform convergence bounds in Ying and Zhou (2007) and Chen and Li (2010). Proposition 2. For any 0 o d o 1, with probability at least 1d, there holds rffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi log k 8ðM þ kRÞR þ2M2 logð2=dÞ 2 pffiffiffiffiffi þ : Sðz, lÞ r 160ðM þ kRÞkR þ3ðM þ kRÞ m m m
Proof. By McDiarmid’s bounded difference inequality (see, e.g., Theorem 9.2 of Devroye et al., 1997), with probability at least 1d=2 there holds that rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi lnð2=dÞ : ð5Þ Sðz, lÞ r sup9Eðf ÞE z ðf Þ9 r E sup9Eðf ÞE z ðf Þ9 þ ðM þ kRÞ2 m f 2BR f 2BR K
K
The first term on the righthand side (5) can be estimated by the standard symmetrization arguments. Indeed, with probability at least 1d=2, we have # # " # " " rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 X 1 X lnð2=dÞ : ð6Þ Ei Vðyi ,f ðxi ÞÞ r 2EE sup Ei Vðyi ,f ðxi ÞÞ þ 2ðM þ kRÞ2 E sup9Eðf ÞE z ðf Þ9 r2EEE sup m f 2BR f 2BR m f 2BR m K
K
i2Nm
K
i2Nm
Note that Jf J1 r kJf JH1 r kR for all f 2 BRK . Now, we define the truncated function V ðy,f ðxÞÞ ¼ Vðy,f ðxÞÞVðy,0Þ. Then, we know that V ðy,Þ : R-R has the Lipschitz constant 2ðM þ kRÞ with respect to f and V ðy,0Þ ¼ 0. Applying the concentration property of Rademacher averages (see, e.g., Theorem 12 of Bartlett and Mendelson, 2002) to the space BRK implies that, with probability 1d=2, # # # " " " X X X Ei Vðyi ,f ðxi ÞÞ r EE sup Ei V ðyi ,f ðxi ÞÞ þ EE sup Ei Vðyi ,0Þ EE sup f 2BRK i2Nm f 2BRK i2Nm f 2BRK i2Nm # 0 2 11=2 " X X Ei f ðxi Þ þ @EE Ei Vðyi ,0Þ A r 4ðM þ kRÞE sup R f 2B K
i2Nm
i2Nm
311=2 # 0 2 X X r 4ðM þ kRÞEE sup Ei f ðxi Þ þ @EE 4 Ei Vðyi ,0ÞVðyj ,0ÞEj 5A : f 2BR "
K
i2Nm
i,j2Nm
280
H. Chen et al. / Journal of Statistical Planning and Inference 143 (2013) 276–282
It follows that # # " " X X pffiffiffiffiffi Ei Vðyi ,f ðxi ÞÞ r4ðM þ kRÞEE sup Ei f ðxi Þ þ M2 m: EE sup f 2BR f 2BR K
i2Nm
K
i2Nm
Finally, based on the reproducing property in (2), we observe that * + X X Ei f ðxi Þ ¼ REE sup sup Ei K xi ,f EE sup K2K Jf JH r 1 f 2BR K
i2Nm
1
i2Nm
HK
1=2 1=2 X X Ei Ej Kðxi ,xj Þ rREE sup Ei Ej Kðxi ,xj Þ ¼ REE sup K2K i,j2N K2K i,j2N ,iaj m m 1=2 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X pffiffiffiffiffi pffiffiffiffiffi þ REE sup Kðxi ,xi Þ rR 2mU^ m ðKÞ þRk m rR 50ek2 m logðkþ 1Þ þ Rk m, K2K i2Nm
where we use the fact that the trace of ðKðxi ,xj ÞÞi,j2Nm no more than k2 m. Combining the above inequalities with (5) and (6) yields the desired result. & 4. Estimate of hypothesis error Denote Hm 1
8 9 1 X n 1 X n < = X X j,m j j j,m ~ j,m j,m j ~j j~j ~ ~ ~ ¼ f¼ ai K ui : ai ¼ ai JK ui Jm , K ui ¼ K ui =JK ui Jm , ai K ui 2 H1 : ; i¼1j¼1
i¼1j¼1
with the norm Jf JHm1 ¼ inf
8 1 X n
9a
j,m 9¼ i
i¼1j¼1
1 X n X
j ~j JK ui Jm 9 i
9a
:f¼
i¼1j¼1
1 X n X i¼1j¼1
9 =
j~j K : i ui ;
a
To estimate the difference of empirical error between f^ k and f l , we introduce the following inequality (Theorem 2.3 in Barron et al., 2008). Lemma 3. For any f 2 H1 , the error of MOGA in Table 1 satisfies E z ðf^ k ÞE z ðf Þ r
4Jf J2Hm 1
k
:
Since Jf J2Hm depends on z, we must further find its relation with Jf J2H1 . The estimate of Hðz, lÞ can be proved as below by 1 the Hoeffding inequality. Proposition 3. Assume that Kðu,vÞ Z k0 for each K 2 K and u,v 2 X. For any d 40, the inequality 8 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi!2 9 < logðm=dÞ = D2 ðlÞ 2 2 1 Hðz, lÞ rmin k k0 , 1 þ kk0 : ; kl2 2m holds with confidence at least 1d. j
~ Proof. From the definitions of Jf JHm1 and Jf JH1 , we know that Jf JHm1 r kk1 0 Jf JH1 . We also note that EK ui ¼ 1 and vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u m m u1 X j j 1 X j 2 2 9K~ ðui ,xs Þ9 1 r 9K~ ðui ,xs Þ9 1: JK~ ui Jm 1 ¼ t ms¼1 ms¼1 Then, based on Hoeffding inequality, for any i, we have (
) m 2k2 E2 m 1 X j j 2 ~ ~ 9K ðui ,xs Þ9 EK ðui ,xÞ Z E r exp 0 2 : P ms¼1 k
H. Chen et al. / Journal of Statistical Planning and Inference 143 (2013) 276–282
281
By setting d ¼ expf2k20 E2 m=k2 g, we have with confidence 1d, rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi m j j j 1X logð1=dÞ 2 1 ~ ~ ~ : 9K ðui ,xs Þ9 EK ui þ1 r 1 þ kk0 JK ui Jm r mj¼1 2m
ð7Þ
Hence, we have Jf l J2Hm r 1 þ kk1 0 1
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi!2 logð1=dÞ Jf l J2H1 2m
holds true with confidence at least 1md. Finally, combining the above inequality with Lemmas 2 and 3, we derive the desired result.
&
5. Learning rate In this paper, we adopt the following condition of DðlÞ which can be found in Xiao and Zhou (2010) and Shi et al. (2011). Definition 3. We say the target function f r can be approximated with exponent 0 oq r 1 in H1 if there exists a constant cq Z 1, such that q
DðlÞ r cq l ,
8l 40:
We are now formulate the convergence rate for MOGA defined in Table 1. Theorem 1. Assume that f r can be approximated with exponent 0 oq r 1 in H1 . Choose T ¼ mg satisfies T ominfem ,mng for some g 4 0. Then, for any 0 o d o1, with confidence 1d sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 8 q=ð42qÞ > m1=2g > > c1 logð1=dÞ log m > ; , g Z > < m log m ^ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Eðf k ÞEðf r Þ r > > m1=2g > > c2 logð1=dÞmqgðq2Þ , g r : > : log m Here constants c1 ,c2 are independent of m, k, d. Proof. Based on Propositions 1– 3, we have that rffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi log k 8ðM þ kRÞR þ 2M 2 logð2=dÞ 2 ^ p ffiffiffiffi ffi þ ÞEðf Þ r 160ðM þ k RÞ k R k RÞ þ 3ðM þ Eðf k r m m m 8 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi!2 9 2 < = logðm=dÞ D ðlÞ 1 þ DðlÞ þ min k2 k2 0 , 1 þ kk0 : ; kl2 2m
with confidence 12d. Then from the condition of DðlÞ, we have with confidence 1d: ! rffiffiffiffiffiffiffiffiffiffiffi log k l2q2 2q2 q ^ þl þ , Eðf k ÞEðf r Þ r c logð1=dÞ l m k where c is a p constant independent of m, d, k. ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi When g Z m1=2g =log m, by setting l ¼ ðlog k=mÞ1=ð42qÞ , we have with confidence 1d q=ð42qÞ log k Eðf^ k ÞEðf r Þ r 3c logð1=dÞ : m pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi q2 When g r m1=2g =log m, by setting l ¼ k , we have with confidence 1d Eðf^ k ÞEðf r Þ r 3c logð1=dÞk
ð2qÞq
:
The proof of Theorem 1 is complete. & pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi When g Z m1=2g =log m and q-1, the convergence rate is with the order of ðlog m=mÞ1=2 . For n bm, our learning rate is faster than ðlog n=mÞ1=2 . For the special single kernel setting, our convergence rate is satisfy compared with ‘1 -regularization methods, e.g., Oðn1=3ðd þ 1Þ Þ in Xiao and Zhou (2010), Oðn1=5 Þ in Sun and Wu (2011). Although we need an additional lower bound of kernel function K, our result does not need the restricted conditions of X and r in Xiao and Zhou (2010) and Shi et al. (2011).
282
H. Chen et al. / Journal of Statistical Planning and Inference 143 (2013) 276–282
In Raskutti et al. (2009), the lower minimax bound is obtained scaling as log n=m for this type of problems discussed pffiffiffiffiffiffiffiffiffiffiffiffiffi here. When n=e m log m ¼ Oð1Þ, our upper bound essentially coincides with the lower bound. Then MOGA leads to minimax optimal rates by suitable choice of T. Finally, our error analysis could be extended to the non-i.i.d. case by introducing the techniques as done in Xu and Chen (2008), Zou et al. (2011), and Farahmanda and Szepesva´ria (2012), and we leave it for future study.
Acknowledgments This work was supported partially by the National Natural Science Foundation of China (NSFC) under Grant Nos. 11001092 and 11071058, the Natural Science Foundation of Hubei Province under Grant No. 2010CDA008 and the Fundamental Research Funds for the Central Universities (Program No. 2011PY130, 2011QC022). References Barron, A.R., Cohen, A., Dahmen, W., DeVore, R., 2008. Approximation and learning by greedy algorithm. Annals of Statistics 36, 64–94. Bartlett, P.L., Mendelson, S., 2002. Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3, 463–482. Chen, H., Li, L.Q., 2010. Learning rates of multi-kernel regularized regression. Journal of Statistical Planning and Inference 140, 2562–2568. Clemencon, S., Luogosi, G., Vayatis, N., 2008. Ranking and empirical minimization of U-statistics. Annals of Statistics 36, 844–874. ¨ Devroye, L., Gyorfi, L., Lugosi, G., 1997. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York. Farahmanda, A., Szepesva´ria, C., 2012. Regularized least-squares regression: learning from a b-mixing sequence. Journal of Statistical Planning and Inference 142, 493–505. Koltchinskii, V., Yuan, M., 2008. Sparse recovery in large ensembles of kernel machines. In: COLT 2008, pp. 229–238. Koltchinskii, V., Yuan, M., 2010. Sparsity in multiple kernel learning. Annals of Statistics 38, 3660–3695. Lanckriet, G., Cristianini, N., Bartlett, P.L., Ghaoui, L., Jordan, M.I., 2004. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research 5, 27–72. Micchelli, C.A., Pontil, M., 2005. Learning the kernel function via regularization. Journal of Machine Learning Research 6, 1099–1125. Raskutti, G., Wainwright, M., Yu, B., 2009. Lower bounds on minimax rates for nonparameteric regression with additive sparsity and smoothness. In: NIPS 20, pp. 1563–1570. Shi, L., Feng, Y.L., Zhou, D.X., 2011. Concentration estimates for learning with ‘1 -regularizer and data dependent hypothesis spaces. Applied and Computational Harmonic Analysis 31, 286–302. Srebro, N., Ben-David, S., 2006. Learning bounds for support vector machines with learned kernels. In: Lugosi, G., Simon, H.U. (Eds.), Proceedings of the 19th Annual Conference on Learning Theory, Pittsburgh, PA, USA, pp. 169–183. Sun, H.W., Wu, Q., 2011. Least square regression with indefinite kernels and coefficient regularization. Applied and Computational Harmonic Analysis 30, 96–109. Wu, Q., Ying, Y., Zhou, D.X., 2007. Multi-kernel regularized classifiers. Journal of Complexity 23, 108–134. Wu, Q., Zhou, D.X., 2008. Learning with sample dependent hypothesis spaces. Computers and Mathematics with Applications 56, 2896–2907. Xiao, Q.W., Zhou, D.X., 2010. Learning by nonsymmetric kernel with data dependent spaces and ‘1 -regularizer. Taiwanese Journal of Mathematics 14, 1821–1836. Xu, Y.L., Chen, D.R., 2008. Learning rates of regularized regression for exponentially strongly mixing sequence. Journal of Statistical Planning and Inference 138, 2180–2189. Ying, Y., Campbell, C., 2010. Rademacher chaos complexities for learning the kernel problem. Neural Computation 22, 2858–2886. Ying, Y., Zhou, D.X., 2007. Learnability of Gaussians with flexible variances. Journal of Machine Learning Research 8, 249–276. Zhang, T., 2009. On the consistency of feature selection using greedy least squares regression. Journal of Machine Learning Research 10, 555–568. Zou, B., Chen, R., Xu, Z.B., 2011. Learning performance of Tikhonov regularization algorithm with geometrically beta-mixing observations. Journal of Statistical Planning and Inference 141, 1077–1087.