Optimal learning rates of lp -type multiple kernel learning under general conditions

Optimal learning rates of lp -type multiple kernel learning under general conditions

Information Sciences 294 (2015) 255–268 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins...

481KB Sizes 2 Downloads 46 Views

Information Sciences 294 (2015) 255–268

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

p

Optimal learning rates of l -type multiple kernel learning under general conditions Shaogao Lv ⇑, Fanyin Zhou The School of Statistics, Southwestern University of Finance and Economics, Chengdu 611130, China

a r t i c l e

i n f o

Article history: Received 25 March 2013 Received in revised form 7 September 2014 Accepted 13 September 2014 Available online 22 September 2014 Keywords: Multiple kernel learning Kernel learning Correlation measure Generalization ability Local Rademacher complexity

a b s t r a c t p

One of the most promising learning kernel methods is the l -type multiple kernel learning proposed by Kloft et al. (2009). This method can adaptively select kernel function in supervised learning problems. The majority of the studies associated with generalization error have recently received wide attention in machine learning and statistics. The present study aims to establish a new generalization error bound under more general frameworks, in which the correlation among reproducing kernel Hilbert spaces (RKHSs) is considered, and the restriction of smooth condition on the target function is relaxed. In this case, the interaction between the estimation and approximation errors must be simultaneously regarded. In this investigation, optimal learning rates are derived by applying the local Rademacher complexity technique, which is given in terms of the capacity of RKHSs spanned by multi-kernels and target function regularity. Ó 2014 Elsevier Inc. All rights reserved.

1. Introduction Kernel-based learning methods, such as support vector machines (SVM), have been extensively applied in supervised learning tasks, including classification and regression. These methods implicitly map the input data into a high dimensional feature space, in which the implicit mapping U is defined by a kernel function that returns the inner product hUðxÞ; UðyÞi between the images of data points x and y. Hence, these kernel-learning approaches are computationally more efficient than the other methods that are required to project x and y explicitly into feature space. From a practical perspective, kernelbased algorithms are defined in an infinite functional space. However, they can efficiently work in a finite space for many applications, and can capture nonlinear structures in many real-world data sets. Kernels and associated RKHSs are simple and can be generally applied; thus, they play an increasingly important role in machine learning and statistics. Selecting regularization parameters is an immediate concern when kernel is provided. This is typically solved by conducting cross-validation or generalized cross-validation [16]. However, most kernel-based learning algorithms critically rely on the selection of kernel function, which induces the issue of choosing the optimal kernel from a collection of candidates. Numerous kernel selection methods have been proposed in the literature. For example, by using the outer product of the label vector as the ground-truth, The kernel target alignment actualized by Cristianini et al. [13] and Cortes et al. [9] aims to learn the entries of a kernel matrix using the outer product of label vector as ground truth. A gradient descent algorithm was developed by Chapelle et al. [12] and Bousquet and Herrmann [5] to minimize an estimate of SVMs generalization error ⇑ Corresponding author. E-mail address: [email protected] (S. Lv). http://dx.doi.org/10.1016/j.ins.2014.09.011 0020-0255/Ó 2014 Elsevier Inc. All rights reserved.

256

S. Lv, F. Zhou / Information Sciences 294 (2015) 255–268

using the outer product of label vector as ground truth over a set of parameters defined in kernels sequence. The hyperkernels method was introduced by Ong et al. to directly analyze kernel function in an inductive setting. Alternative kernel selection approaches include the DC and semi-infinite programming realized by Argyriou et al. [1] and Gehler and Nowozin [15], respectively. Nevertheless, these approaches mentioned above often lead to non-convex optimization problems, wherein ensuring computational efficiency is difficult. Kernel learning algorithms with different settings for the width parameter of Gaussian kernels must be explored to obtain an optimal parameter from a specified interval [32]. These particular approaches can result in a standard convex optimization program, that is, the Gaussian kernels, which correspond to a sequence of capacity-limited functional spaces. However, such algorithm lacks the flexibility of modeling data with great complexity. Using classification information, Zamania et al. [37] proposed to determine a kernel function for kernel principal components and kernel linear discriminant analyses. Learning kernels with linear combinations of multiple kernel functions p has recently received considerable attention in machine learning. Kloft et al. [17] proposed the so-called l -norm ð1 < p 6 2Þ multiple kernel learning (MKL) method, which has been proven useful and effective in terms of theoretical analysis and practical applications. p l -norm MKL is an empirical minimization algorithm that operates on multi-kernel class, which consists of functions n o P p f 2 HK h : K h ¼ M m¼1 hm K m j khkp 6 1; h P 0 , where M is the number of given candidate kernels. l -norm MKL has been successfully applied in solving real-world problems. For example, Kloft et al. [18] originally implemented the algorithm to clarp ify problems in bioinformatics. The results of his research revealed that l -norm MKL ðp > 1Þ achieves more accurate 2

prediction results than the state-of-the-art methods. Yu et al. [36] developed a l -norm MKL algorithm and applied it to genomic data fusion. The results of their investigation showed that this algorithm achieved comparable performance with the conventional SVM–MKL algorithms. Moreover, in generic object recognition research, Nakashika and Suga [24] proposed a novel feature selection method based on MKL. Their experimental results illustrated the effectiveness of the proposed automatic feature selection method. Some researchers recently interpreted MKL from different views. For instance, Xu et al. [35] presented a novel soft margin perspective for MKL under more general frameworks, and many existing MKL can be viewed as special cases. Mao et al. [23] introduced a novel probabilistic interpretation of MKL and proposed a hierarchical Bayesian model that can simultaneously learn the proposed data-dependent prior and classification. Over the past few years, MKL has been theoretically analyzed without a hitch. Cortes et al. [8,9] obtained the convergence qffiffiffiffiffiffiffiffiffiffi 11=p p with p ¼ 1 and M pffiffin with 1 < p 6 2. Kloft et al. [18] derived a similar conrates associated with l -MKL of the order logðMÞ n vergence bound with improved constants. Bartlett et al. [6] and Kloft et al. [19] adopted localization techniques, including the local Rademacher complexities, and conventionally obtained sharp learning rates. Kloft and Blanchard [19] provided a p localized convergence of l -MKL. However, their analysis relied on a strong condition that the underlying RKHSs are no correlated with one another. Considering the correlation between candidate kernels, Suzuki [30] derived the fast learning rates p of dense-type regularization in a unifying framework, which included l -MKL as a special case. However, these algorithms were conducted under a strong assumption that the target function is smooth and lies in the hypothesis space where the algorithm works. The approximation error is neglected and the relationship between regularization parameter and smoothness of the target function cannot be completely reflected. In the literature of statistical learning theory, generalization error = estimation error + approximation error for a given estimator (see details in [4]). Multiple kernels evidently lead to additional functional complexity. Hence, learning with these kernels can only achieve a better performance for generalization ability when it can significantly improve approximation ability. Based on this argument, the case in which the target function does not lie in the hypothesis space can be compellingly considered. In such instance, previous p techniques of analyzing l -MKL are no longer valid because the upper bound of the estimator goes to infinity with the increase of the sample size n, which affects the upper bound of the estimation error. Therefore, more explicit analysis methods of deriving the optimal learning rates under general conditions must be developed. To the best of our knowledge, no existing study has explicitly analyzed how the correlations among RKHSs affect the learning rates under the multi-kernel learning settings. p The present study primarily aims to derive the optimal learning rates of l -MKL under a mild condition, that is, the target function does not lie in kernel classes. In this case, optimal rates can be obtained by uniformly bounding the second moments of functions from an adequate class by their expectations. Classical empirical process theory of the local Rademacher complexities is extended to more general cases. Thus, the optimal rates in this investigation are derived with an iterative technique. This study generally provide beneficial contributions in the following aspects:  Optimal learning rates are obtained by considering the correlation structure of the underlying RKHSs. The final result shows that the correlation greatly affects convergence rates.  Convergence rates are established under a mild assumption on target function that effectively relaxes function constrains.  As a by-product, a tight bound is provided for concentration inequality by applying the local Rademacher complexity under the general conditions stipulated in Section 5. Moreover, the advantages of MKLin terms of approximation ability are discussed. The rest of this paper is organized as follows. Section 2 formulates the classical supervised learning problem, introduces the MKL algorithm for the regression problem, and provides main assumptions for theoretical analysis. Section 3 presents

S. Lv, F. Zhou / Information Sciences 294 (2015) 255–268

257

the theoretical results of the study and cites some specific approximation abilities with respect to the MKL specified in Section 4. Section 5 provides the key techniques and Section 6 summarizes the conclusions of the study. Some useful lemmas and propositions are deferred until Appendix. 2. Preliminary This section specifies the problem setup, notations, and assumptions required for convergence analysis. 2.1. Problem formulation Some notations used in the paper are as follows: Let z ¼ ðX i ; Y i Þ; i ¼ 1; . . . ; n be independent copies with values in X  Y, where X is an input space in Rd and Y is a Borel subset of R. q denotes the distribution on X  Y and qX the marginal distribution of X. The goal of prediction is to learn a reasonably good prediction rule f : X ! R from the empirical data fX i ; Y i Þgni¼1 . To be precise, given a loss function ‘ : Y  R ! Rþ , we can define the risk of a prediction rule f as

Z

Eðf Þ :¼

‘ðy; f ðxÞÞdq: XY

An optimal prediction rule apropos the above loss is expressed by the following formula: 

f ¼ arg min Eðf Þ; f :X!R

where minimization is taken over by all measurable functions and the minimum is presumed attained and unique. This 2 study mainly focuses on the regression problem, which corresponds to the least square loss: ‘ðy; f ðxÞÞ ¼ ðy  f ðxÞÞ . In this regression setting, optimal prediction can be expressed by the following equation 

f ðxÞ ¼

Z

y dqðjxÞ; Y

where qðjxÞ is the conditional distribution of y induced by q. Unfortunately, qðjxÞ is often unknown and we cannot compute   f directly. It is natural to estimate f by the estimator f z from the empirical data, so that f z can generalize well on unseen data.  Throughout this paper, qðjxÞ is assumed supported on ½T; T, and it follows that jf ðxÞj 6 T for x 2 X almost surely. This boundedness assumption can be relaxed to an unbounded one similar to that in Caponnetto et al. [7]. However, that direction is not pursued by this study for simplicity. p Given M RKHS fHm gM ; l -norm ð1 < p 6 2Þ type regularization m¼1 , each of which is generated by a specific kernel K m P scheme is considered as regards the regularization term k  kp;M . That is, for f ¼ M m¼1 f m ðf m 2 Hm Þ, regularization term is defined by the norm M

kf kp;M ¼ kðkf m kHm Þm¼1 klp ¼

M X kf m kpHm

!1=p ;

m¼1

and the function class is defined as

( Hp;M :¼

f ¼

M X

) f m;

kf kp;M < 1 :

m¼1 p

Under any circumstances, the original l -type MKL proposed by Kloft et al. [17] can be transformed into the following equivalent optimization problem:

^f z ¼

M n M X X X ^f m ¼ arg min 1 yi  f m ðxi Þ f m 2Hm n m¼1 m¼1 i¼1

!2 þ kkf k2p;M :

ð2:1Þ

This problem can be solved by following a finite dimensional optimization procedure according to the representer theorem. Feasible algorithms include block coordinate, active sets methods and greedy approximate methods. 2.2. Notations and assumptions Space HK is considered an RKHS if a positive symmetric function K : X  X ! R exists, such that for each x 2 X, (i) the function Kð; xÞ belongs to a Hilbert space HK and (ii) the reproducing property f ðxÞ ¼ hf ; Kð; xÞiHK is satisfied for all f 2 HK . Let an integral operator Lm : L2 ðqX Þ ! L2 ðqX Þ associated with a kernel K m be defined as

Lm ðf Þ :¼

Z

X

K m ð; xÞf ðxÞdqX ðxÞ;

258

S. Lv, F. Zhou / Information Sciences 294 (2015) 255–268

It is known that this operator is compact, positive and self-adjoint [11]. Thus all eigenvalues are nonnegative and countable mostly. We denote by ll;m the l-th largest eigenvalues of the integral operator Lm , and f/l;m g the associated eigenfunctions, forming an orthonormal basis in L2 ðqX Þ. The decay rate of these eigenvalues plays an important role in the analysis of this study because it essentially determines the functional complexity of each Hm . Assumption 1 (Spectral Assumption). There exist 0 < sm < 1 and c > 0 such that

ll;m 6 cl1=sm ; ð8 l P 1; 1 6 m 6 MÞ: It has been shown that the Spectral Assumption is equivalent to the classical covering number assumption [28]. Moreover, the corresponding RKHS becomes ‘‘simple’’ as sm goes large. Another important notation is also introduced in this paper to characterize the degree of ‘‘dependence’’ of the spaces Hm ; m ¼ 1; . . . ; M. Intuitively, as the overlaps among RKHS Hm increases, the complexity of composite RKHS decreases accordingly. Let jM be characterized as follows:

(

k

PM

jM :¼ sup j P 0j j 6 PMm¼1

f m k2L2 ðqX Þ

2 j¼1 kf m kL2 ðqX Þ

Assumption 2 (Incoherence Assumption).

) ; f m 2 Hm ðm ¼ 1; . . . ; MÞ ;

ð2:2Þ

jM is strictly bounded from below; there exists a constant C 0 satisfying

0 < C 0 < kM : If the functions in spaces Hm ; m 2 f1; 2; . . . ; Mg are orthogonal in L2 ðqX Þ; jM is equal clearlyto 1. Essentially, quantity jM measures the correlation of the M RKHSs. It has been used frequently in the theory of sparse recovery, including so-called ‘‘restricted isometry constants’’ introduced by Candes and Tao [10]. Koltchinskii and Yuan [21] also adopted this variable to select an appropriate kernel in the multiple kernels setting. Bach [3] and Suzuki [30] also imposed Assumption 2 to verify 1 p the consistency of l -MKL and l -MKL (p > 1) respectively. Quantity jM can be bounded by other geometric quantities that describe how dependent the spaces of random functions Hm # L2 ðqX Þ are. Related discussions were provided by Koltchinskii and Yuan [21] in the literature of sparse kernel learning. For completeness, these conclusions are restated in Appendix A. The correlation between different objects in the literature of RKHS can be characterized with different approaches. For example, kernel function K can be taken as a similarity measure between spike times [25] can assess the correlation of two RKHSs via the inner product in L2 -norm for canonical correlation analysis Fukumizu et al. [14]. Also recently a new generalized correlation function, called correntropy, has been proposed to offer insights into the geometric relationship among different RKHSs [34]. Similar discussions can also be observed in the work of Vepakomma and Elgammal [31], in which the so-called distance correlation is introduced as a statistical measure of dependence between the newly learned features and response variables. Nevertheless, the current study deals with the correlation problems in multi-kernel settings, which are significantly distinct from the focus of studies mentioned above. The notation (2.2) used in Assumption 3 simplifies the criterion of this work, which enabled the researchers to understand the main theoretical results of this study. A better correlation measure may undoubtedly exist in certain contexts, or different correlation measurements may be closely related with one another. Further investigations are beyond the scope of this study, and only some existing conclusions are listed in Appendix A. Assumption 3 (Kernel Assumption). For each m ¼ 1; . . . ; M; Hm is separable with respect to norm RKHS and supx2X jK m ðx; xÞj 6 1. This assumption is satisfied as long as the majority of the commonly used kernels are considered in bounded domains. According to the positive semi-definiteness of kernel K m , it follows that

jK m ðx; yÞj2 6 jK m ðx; xÞjjK m ðy; yÞj 6 1;

8 x; y 2 X:

3. Main results  In this section, we derive the upper bounds of Eð fbz Þ  Eðf Þ and provide detailed comparisons with some existing results. Two additional notations are employed to describe the results of this study. For R > 0, denote

WðRÞ ¼ fz 2 Z n : k fbz kp;M 6 Rg:

259

S. Lv, F. Zhou / Information Sciences 294 (2015) 255–268 

If f is in composite kernel class, R can be bounded by some constant. In this context, approximation error is assumed to  exist. However, the upper bound for R tends to go to infinity when n increases. Since the learner fbz may also be larger than f in the sense of L1 -norm, a projection operator can be instinctively applied on f z , which was introduced into learning algorithms to improve learning rates [28]. Definition 1. Projection operator p is defined on the space of measurable functions f : X ! R as

8 if f ðxÞ > T; > < T; pðf ÞðxÞ ¼ f ðxÞ; if jf ðxÞj 6 T; > : T; if f ðxÞ < T:

Moreover, we need the regularization error function Dp : ½0; 1Þ ! ½0; 1Þ,

Dp ðkÞ ¼ min fEðf Þ  Eðf q Þ þ kkf k2p;M g; f 2Hp;M

k > 0:

ð3:1Þ 

Dp ðkÞ measures the approximation ability of hypothesis space Hp;M with respect to f in space L2 ðqX Þ. It is generally assumed that, for some 0 < b  1 and C b > 0,

Dp ðkÞ 6 C b kb :

ð3:2Þ

This regularization error was extensively studied by Steinwart and Christmann [27] with a single kernel, in which its relation with standard approximation error was also addressed. Cucker and Smale [11] and Smale and Zhou [29] related the behavior of D2 ðkÞ to both the interpolation spaces and certain powers of integral operator Lm . The approximation ability of Dp ðkÞ under multi-kernel framework is discussed in Section 4. This section elucidates the main result of this study, which establishes a convergence rate of the MLK algorithm described by (2.1). Theorem 1. Suppose that Assumptions 1–3 are satisfied, and jyj 6 T almost surely. For any 0 < d < 1 and k satisfying 

3M2=p logð2=dÞ nk

6 1, there exists a set V R  b Eðpð f z ÞÞ  Eðf Þ þ kk fbz k2p;M is bounded by

C0

with

qðV R Þ 6 d, such that for all z 2 WðRÞ n V R , regularization error

(  )  1   1 log 2d R Dp ðkÞ _ 2 1þ1=smax 1þ1=s2 max 1 1 1þsmax jM þ Dp ðkÞ þ R ; k n n

ð3:3Þ

p ð3þ1=smax Þ 2 W W  where C 0 ¼ 9 þ 88cp T 2 þ 128cp T 2 M 12ð1þ1=smax Þ p ; A B :¼ maxfA; Bg and smax :¼ M m¼1 sm involved in Assumption 1. In addition,  1 1 p is some constant satisfying: p þ p ¼ 1. In Theorem 1, one needs an appropriate R such that WðRÞ ¼ Z n . A standard evidence verifies that, for all k > 0 and almost all z 2 Z n ; k fb k 6 pTffiffi, accordingly, R ¼ pTffiffi. A similar proof was presented by Wu et al. [33]. Hence the details are omitted.

z p;M

k

k

By substituting the above bound for R in Theorem 1, the following upper bound on generalization error can be achieved when Assumption 2 is supported. Corollary 1. Under the assumptions of Theorem 1, we assume that (3.2) holds with some 0 < b 6 1. For any 0 < d < 1, with confidence to at least 1  d there holds

! b    ðbþ1=2Þð1þs max Þ 2 e 1  2 b b ; Eðpð f z ÞÞ  Eðf Þ þ kk f z kp;M 6 log C d n

by taking k ¼

1  ðbþ1=2Þð1þs max Þ 1 ; n

e is a constant independent of n or d. where C b Notice that the exponent ðbþ1=2Þð1þs for the learning rate is always less than 2=3. This instance is weak in the statistical max Þ learning theory literature. Improved error bounds can be obtained by providing a tight bound for k fb k , which can be realz p;M

ized by adopting the more complicated iterative technique developed by Wu et al. [33]. By employing the tight bound on k fbz kp;M (given in Appendix), Theorem 1 immediately yields the strong learning rates stated below. 1

Theorem 2. Suppose Assumptions 1–3 are satisfied, and jyj 6 T almost surely. Assume that Dp ðkÞ ¼ Oðkb Þ. Set k ¼ nbþsmax with 2d b þ smax P 1. For any 0 < d < 1 and n P N M;d , with confidence to at least 1  1þs there holds max

 ( bþsb ) max 2 1  b ; Eðpð f z ÞÞ  Eðf Þ 6 C M log d n 

In particular, when f 2 Hp;M , we have

 ( 1þs1 ) max 2 1  b Eðpð f z ÞÞ  Eðf Þ 6 C M log ; d n

260

S. Lv, F. Zhou / Information Sciences 294 (2015) 255–268

where

8T 2 logð2=dÞ

NM;d :¼

!1þ1=smax

C 2M

_  1 2 ; CM

and 1

p ð3smax þ1Þ

C M :¼ 53T 2 þ 128Tcp ð2C M Þ1þsmax M 12ð1þsmax Þ

p1

smax 1

max j1þs : M

Theorem 2 establishes the learning rates of MLK algorithms under extremely general conditions, inducing the results in single kernel settings (e.g., [28,33]). The learning rates in this study are optimal when b þ smax P 1 because the lower bounds for the learning algorithms are attained (see Theorem 9, [28]). Theorem 2 also provides a confidence-based estimate for the least squares error of the MLK algorithms. The dependence of the estimate on the confidence level is in the mild form of logð2=dÞ rather than the most common form of 2=d. The result of this study implies an almost sure convergence, which guarantees convergence in probability. Considering a positive correlation between RKHSs, the conclusion of Proposition 5 in Appendix shows that, when jM becomes larger, more overlaps arise according to the definition of qðI; JÞ given. Thus, the functional complexity of composite kernel space is reduced, which ultimately improves learning rates. This observation conforms with the conclusion given in Theorem 2. The convergence rate derived in Theorem 2 also achieves the minimax rate, which was verified by Suzuki [30], up to some constant term depending on M and d. The findings of this study indicate that no dependency of the learning rate exists on parameter, p. Compared with the main results given by Suzuki [30], the conclusion of the current study holds for two situations, namely, homogeneous and inhomogeneous complexities. The optimal learning rates are extended to the case, in  which f does not lie in the RKHS generated by the composite kernel. Under more general settings, ideal parameter k depends  on some prior knowledge such as the regularity of f . This condition provides the researchers some insights into selecting an appropriate regularization parameter. Various types of kernels are likely used in real settings, and composite RKHSs are correlated with one another. Various works reported that ‘p -MKL ðp > 1Þ has a comparable numerical performance with ‘1 -MKL in numerical experiments [8,9]. Some existing empirical experiments also signify that different types of kernels are frequently used in a reasonable manner. The theoretical analysis of this study supports these experimental results. When prior knowledge on target function is assumed available, parameters k and jM can be tuned accordingly. If the target function is less smooth, smaller k and jM must be selected to enhance the approximation ability of the algorithms, and vice versa. 4. Approximation ability on MKL Our main result about learning rates of (2.1) is stated under conditions on the approximation ability of Hp;M with respect  to f and capacity of Hp;M .  Remark that our assumption (3.2) implies that when f is replaced by Hp;M ; Dp ðkÞ tends to go to zero with a polynomial  order decay when k goes to zero. Note Cucker and Smale [11] that Dp ðkÞ ¼ oðkÞ would imply f ¼ 0. So b ¼ 1 in (3.2) is the  best we can expect. This case is equivalent to f 2 Hp;M when Hp;M is dense in L2qX , see Smale and Zhou [29]. Assumption (3.2) with respect to p ¼ 2 has been characterized in terms of interpolation spaces Cucker and Smale [11]. Proposition (1.1) of Smale and Zhou [29] demonstrates that, if qX is the Lebesgue measure on X and the target function  f 2 Hs , a Sobolev space with power s. When Gaussian kernel ðGr ðx; yÞ ¼ expðrkx  yk2 ÞÞ is taken with a fixed scaler r, a polynomial decay of Dp ðkÞ is impossible in this case. Allowing for varying scale of Gaussian kernels, however, Example 1 of Micchelli et al. [22] successfully obtains a polynomial decay under the multi-kernel setting. This shows that multi-kernel learning can improve approximation power in some sense, and furthermore improve the learning ability. To make our description clearly, we will restate the corresponding conclusions explicitly as follows. Let ðB; k  kÞ be a Banach space and ðH; k  kH Þ be dense subspace with kbk 6 kbkH for b 2 H. Given a 2 B, define the approximation ability of H with respect to a as

Iða; RÞ :¼

inf fka  bkg;

kbkH 6R

R > 0:

Given a bounded domain X, and H is the range of a sequence of integral operators

Lr f ðxÞ ¼

rd=2 Z

p

Gr ðx; tÞf ðtÞdt:

X

2 The norm for b is kbk ¼ kL1 r bkL2 ðqX Þ , that is, kLr f kH ¼ kf k for f 2 L ðqX Þ. We state the comparative conclusion in term of the approximation ability concerning any fixed Gaussian kernel and varying-parameter ones. The following result is obtained by combining Proposition (1.1) of Smale and Zhou [29] with Example 1 of Micchelli et al. [22].

261

S. Lv, F. Zhou / Information Sciences 294 (2015) 255–268

Proposition 1. If a 2 L2 ð½0; 1d Þ is not C 1 , then for any

Iða; RÞ :¼

kbk

e > 0 and any fixed parameter r,

e

inf

L2 ð½0;1d Þ

6R

ka  Lr bkL2 ð½0;1d Þ – OðR Þ:

Nevertheless, for the special case of a 2 Hs ð½0; 1d Þ, and we allow that

ð4þsÞ

r ¼ Rð4þsÞdþ2s , then

   2s Iða; RÞ ¼ O R ð4þsÞdþ2s : e ¼ PM K m . Then each element of H can be expressed as f ¼ PM f ðf 2 HK Þ, and the Define the synthetic kernel K m m¼1 m¼1 m m e K norm is given by 2

kf ke ¼ min K f 2H eK

( M X

) kf m k2K m

for any f ¼

m¼1

M X

f m ðf m 2 HK m Þ:

m¼1

Similar to Example 1 of Micchelli et al. [22], we can obtain the polynomial decay for D2 ðkÞ in our multiple kernel settings. e r ¼ PM Gr ðx; yÞ. The Precisely, let rm be a sequence arranged in a nondecreasing order, and consider the concrete kernel K m m¼1 following conclusion about the approximation ability of multiple kernels holds, which will be proved in Appendix. 

Example 1. If dqX is the Lebesgue measure on X and f lies in the following Sobolev space

(

s

2

d

H ðR Þ ¼

d

f 2 L ðR Þ : kf kHs

 Z ¼ ð2pÞd Þ

Rd

2 s

ð1 þ jnj Þ jef ðxÞj2

f is the Fourier transform of f. If we can take where e 2s

D2 ðkÞ ¼ Oðkð4þsÞdþ2s Þ; If assume that

rm ¼

1

ð4þsÞ ð4þsÞdþ2s

k

)

1=2

<1 ; , then

8 M 2 Nþ :  1

rm ¼ mc with some c > 0, and M ¼ d

 dð4þsÞ  1 D2 ðkÞ ¼ O k 2sþdð4þsÞ ;

4þs

cð2sþdð4þsÞÞ

k

e, then we have

where dke represents the smallest positive integer greater than or equals to k. Example 1 shows that, if generalization error is dominated by the corresponding approximation error, the generalization ability can be improved by imposing multiple kernels. Unlike ‘1 -MKL, ‘p ðp > 1Þ-MKL enhances learning ability by adding functional complexities, rather than selecting a proper kernel. 5. Error decomposition In this section, we decompose generalization error of our study into the sum of two parts: one is deterministic described by the regularization error, and the other is the sample error reflecting the fluctuation caused by random sampling. Upper bounds for them will also be given in the section. First, we define the total sample error as

S z ðk; f Þ ¼ fEðpð fbz ÞÞ  E z ðpð fbz ÞÞg þ fE z ðf Þ  Eðf Þg; 1 n

Pn

8 f 2 Hp;M :

2

where E z ðf Þ :¼ i¼1 ðyi  f ðxi ÞÞ is the empirical risk. The function f in the above equation can be arbitrarily chosen. However, proper choices lead to good estimates of regularization error. A good and common choice is f ¼ f k , where

f k ¼ arg min fEðf Þ þ kkf k2p;M g: f 2Hp;M

ð5:1Þ

With these preparations, we are now able to provide the following error decomposition. Proposition 2. Let f k be defined by (5.1), there holds  Eðpð fbz ÞÞ  Eðf Þ þ kk fbz k2p;M 6 S z ðk; f k Þ þ Dp ðkÞ:

ð5:2Þ

 Proof. We can decompose Eðpð fbz ÞÞ  Eðf Þ into

n o  fEðpð fbz ÞÞ  E z ðpð fbz ÞÞg þ E z ðpð fbz ÞÞ þ kk fbz k2p;M  ðE z ðf k Þ þ kkf k k2p;M Þ þ fE z ðf k Þ  Eðf k Þg þ fEðf k Þ  Eðf Þ þ kkf k k2p;M g  kk fbz k2p;M :

262

S. Lv, F. Zhou / Information Sciences 294 (2015) 255–268

Since the map p is contractive, we have E z ðpð fbz ÞÞ 6 E z ð fbz Þ. This implies that the second term as above is less than zero according to the definition of f z . Then our desired result follows immediately. h In this paper we consider the capacity of the ball Hp;M ðDÞ of the radius D

Hp;M ðDÞ :¼ ff 2 Hp;M j kf kp;M 6 Dg: We further split the sample error S z ðk; f k Þ into the following components

S z ðk; f k Þ ¼ S 1 þ S 2 ; where 



S 1 :¼ fE z ðf k Þ  E z ðf Þg  fEðf k Þ  Eðf Þg; and   S :¼ fEðpð fb ÞÞ  Eðf Þg  fE ðpð fb ÞÞ  E ðf Þg: 2

z

z

z

z

S 2 can be easily bounded by applying the following one-sided Bernstein type concentration inequality, described by Lemma 1 in Appendix. The following inequality has been shown in Lv and Zhu (2012) with a slight modification. 2



2

Proposition 3. Define the random variable n2 ðzÞ ¼ ðf k ðxÞ  yÞ  ðf  yÞ . For every 0 < d < 1, with confidence at least 1  d=2, there holds

!  n 1X 3M 2=p logð2=dÞ 6T 2 þ logð2=dÞ S1 ¼ n2 ðzi Þ  Eðn2 Þ 6 Dp ðkÞ 1 þ : þ n i¼1 nk n Estimating S 2 is more involved, since it depends on the sample z. Our approach to tackle S 2 is applying the local Rademacher complexity argument presented as follows. To this end, let us recall the concept of Redemacher complexity process. Introduce the Rademacher variables ri and i.i.d. zi ði ¼ 1; . . . nÞ, then the global Redemacher average is given as

# 1 X n Rn ðF Þ :¼ Ez;r sup ri f ðzi Þ ; f 2F n i¼1 "

where F is some hypothesis space specified in advance. To make fully use of the variance of the functions, the local Redemacher complexity of F is given as

# 1 X n sup ri f ðzi Þ : n i¼1 f 2F ; Ef 2 6r

" Rn;r ðF Þ :¼ Ez;r

Note that it subsumes the global Redemacher complexity as a special case for r ¼ 1. Since the local Redemacher complexities are always smaller than the corresponding global ones, they often lead to sharper bounds, especially allowing for the variance of f ðzi Þ for any f 2 F . More detailed discussion on the local Redemacher complexity can be found in Bartlett et al. [6]. To estimate S 2 , we define the following function class associated with the least square loss: 2

2



H‘s ðRÞ ¼ fgðzÞ ¼ ðf ðxÞ  yÞ  ðf ðxÞ  yÞ j kf k1 6 T; kf kp;M 6 Rg:

Proposition 4. Let H‘s ðRÞ be defined as above, and assume that A1–A3 are satisfied. Then for any 0 < d < 1, with confidence to 1  d=2 there holds

 n 88cp log 2d T 2 R 1X EðgÞ EðgÞ  gðzi Þ 6 þ Dp ðkÞ þ n i¼1 2 n 1     1  ð3þ1=s 2 max Þ 2 1 1 1þsmax Dp ðkÞ _ 2 1þ1=smax p12ð1þ1=s 1þ1=smax max Þ p j M ; R 128cp T 2 M k n

8 g 2 H‘s ðRÞ:

Proof. A simple calculation shows that

EðgÞ2 ¼

Z



2

2



ðf ðxÞ  f ðxÞÞ ðf ðxÞ þ f ðxÞ  2yÞ dqðzÞ 6 16T 2 EðgÞ;

Z

8 kf k1 6 T:

To verify the variance condition of Proposition 6, we can choose TðgÞ as:

TðgÞ :¼ 16T 2 EðgÞ ¼ 16T 2

Z X



2

ðf ðxÞ  f ðxÞÞ dqX ðxÞ:

263

S. Lv, F. Zhou / Information Sciences 294 (2015) 255–268

This means that we can take B ¼ 16T 2 of Proposition 6. Moreover, it is proved easily that the least square loss restricted to ½T; T is Lipschitz continuous, that is,

jLðy; tÞ  Lðy; t 0 Þj 6 4Tjt  t0 j;

for any y; t; t0 2 ½T; T:

Consequently, according to Lemma 2, we can bound Rn ðg 2 H‘s M ðRÞ; TðgÞ 6 rÞ by

 Z 2  8TRn f 2 Hp;M ðRÞ; ðf ðxÞ  f ðxÞÞ dqX ðxÞ 6 X

r 16T



 : 2

ð5:3Þ



We notice that: kf  f k k2L2 ðqX Þ 6 2ðkf  f k2L2 ðqX Þ þ kf  f k k2L2 ðqX Þ Þ. This shows that

 Z 2  Rn f 2 Hp;M ðRÞ; ðf ðxÞ  f ðxÞÞ dqX ðxÞ 6 X



r 16T 2

  r  6 Rn f 2 Hp;M ðRÞ; kf  f k k2L2 ðqX Þ 6 2kf  f k k2L2 ðqX Þ þ 2 8T   r 2 6 Rn f 2 Hp;M ðRÞ; kf  f k kL2 ðqX Þ 6 2Dp ðkÞ þ 2 8T   r 6 Rn f ; kf kp;M 6 R; kgkp;M 6 kf k kp;M ; kf  gk2L2 ðqX Þ 6 2Dp ðkÞ þ 2 8T ! ! rffiffiffiffiffiffiffiffiffiffiffiffi Dp ðkÞ _ r 2 6 Rn f  g; f ; g 2 Hp;M R ; kf  gkL2 ðPÞ 6 2Dp ðkÞ þ 2 k 8T ! ! rffiffiffiffiffiffiffiffiffiffiffiffi _ Dp ðkÞ r ¼ Rn f 2 Hp;M R ; kf k2L2 ðqX Þ 6 2Dp ðkÞ þ 2 ; k 8T

where the second inequality is based on (3.1). The fourth inequality and the last equality are both obtained from the symmetry of the ri and Hp;M . The above inequality together with (5.3) shows that

H‘s M ðRÞ; TðgÞ

Rn ðg 2

! ! rffiffiffiffiffiffiffiffiffiffiffiffi Dp ðkÞ _ r 2 R ; kf kL2 ðqX Þ 6 2Dp ðkÞ þ 2 ; k 8T

6 rÞ 6 C 1 Rn f 2 Hp;M

ð5:4Þ

where C 1 :¼ 8T. It remains to find a suitable sub-root function w, such that

wðrÞ P C 1 Rn ðf ; kf kp;M Denote RD :¼

qffiffiffiffiffiffiffiffi W Dp ðkÞ k

rffiffiffiffiffiffiffiffiffiffiffiffi Dp ðkÞ _ r 6 R; kf k2L2 ðqX Þ 6 2Dp ðkÞ þ 2 Þ; for 8 r > r  : k 8T

R. According to Lemma 3, we have

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P P   sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 r 2r M RD cM p p 2Dp ðkÞ M 2 m¼1 hm m¼1 hm Rn f ; kf kp;M 6 RD ; kf kL2 ðqX Þ 6 2Dp ðkÞ þ 2 6 þ þ n jM n 8T 8T 2 jM n vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u

M

ucp R2D X1



þt l l;m

l¼hm þ1 m¼1 p n

2

Define a ¼

C 21

PM

m¼1

wðrÞ :¼ : C1 hm

4T 2 jM n

, and

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u

P 1 M

p  ucp R2D X1

2Dp ðkÞ M m¼1 hm

þ C 1 RD cM p : þ C1t b ¼ C1 l l;m

l¼hm þ1 m¼1 p jM n n

n 2

pffiffiffiffiffi To find a fixed point of wðrÞ, we only need to solve the equation ar þ b ¼ r, which is further equivalent to solving 2 2  r  ða þ 2bÞ þ b ¼ 0 for a positive root. This implies that r 6 a þ 2b, and it suffices to bound a and b respectively. By Proposition 6, for all 0 < d < 1 and take K ¼ 2, with confidence to 1  d=2 there holds n 1X EðgÞ 70  88 log EðgÞ  gðzi Þ 6 þ 2r þ n i¼1 2 n T

2 d

T2

;

8 g 2 H‘sM ðRÞ:

ð5:5Þ

We now compute an upper bound of the fixed point r under Assumption 1, that is,

ll;m 6 cl1=sm ; ð8 l P 1; 1 6 m 6 MÞ: This implies that 1 X l¼hm þ1

ll;m 6 c

1 X l¼hm þ1

l

1=sm

6c

Z

1

hm

x1=sm dx ¼

c 11=sm h ; 1  1=sm m

m ¼ 1; . . . ; M:

ð5:6Þ

264

S. Lv, F. Zhou / Information Sciences 294 (2015) 255–268

To estimate b, note that

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P P 2Dp ðkÞ M 2C 2 M hm m¼1 hm 6 Dp ðkÞ þ 1 m¼1 ; C1 jM n jM n then we can bound r by

r 6 Dp ðkÞ þ

4C 21

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u

1 M

ucp R2D X1 p 

h m m¼1 t

þ C 1 RD cM p : þ C1 l l;m



l¼hm þ1 m¼1 p jM n n n

PM

2



 1 1 p q Observing that C 1 ¼ 8T, and applying l -to-l conversion kaklp 6 M qp kaklq ; for 0 < q < p 6 1 , we have

4C 21

PM

m¼1 hm 6 256T 2 jM n

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi u P uM 2=p þ1=2 kðh2 ÞM k  2 M M h m m¼1 p =2 2t m¼1 m : 6 256T j2M n2 j2M n2

Using the subadditivity of

pffi  again, we can bound r by

p

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi !M

u

2 1 1 u

X p 

M p þ1=2 h2m C 1 þ 256T 2 u

þ C 1 RD cM p : pffiffiffi min t

Dp ðkÞ þ þ cp R2D ll;m



2 06hm 61 n jM n n

p

l¼hm þ1 m¼1

ð5:7Þ

2

Substitute the Eq. (5.6) into the above inequality and take the derivative with respect to hm , and let the derived derivative be zero, the optimal hm can be expressed as

hm ¼ ðcp R2D M ð4þp

 Þ=2p

1

j2M nÞ1þ1=sm :

Resubstituting the above into (5.7), we have

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi



ffi!

 2 M



: 1þs m r ¼O

n

m¼1 p 

2

It is shown that the asymptotic rate of convergence in n is determined by the kernel with the smallest decreasing spectrum. 1   Denote hmax :¼ ðcp R2D M ð4þp Þ=2p j2M nÞ1þ1=smax . Noting that (5.7), we can bound r  by

8 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi9 u ! rffiffiffiffiffiffiffiffiffiffiffiffi p > 1 = RD cMp p ð8T þ 256T 2 M 12 Þhmax 8T 2 cp Dp ðkÞ _ 1 t  max r 6 Dp ðkÞ þ þ C1 6 Dp ðkÞ þ þ R > k n n jM n n j2M > : ; ! rffiffiffiffiffiffiffiffiffiffiffiffi   1   1 Dp ðkÞ _ 2 1þ1=smax 1þ1=s2 max 1 1 1þsmax 8T 2 cp Dp ðkÞ _ jM þ Dp ðkÞ þ R R 6 128cp ð8T þ 256T Þ k n k n ! rffiffiffiffiffiffiffiffiffiffiffiffi   1   1  ð3s 2 max þ1Þ 1 1 1þsmax Dp ðkÞ _ 2 1þ1=smax p12ð1þs 8T 2 cp Dp ðkÞ _ 1þ1=smax max Þ j 6 Dp ðkÞ þ cp ð264TÞ2 M þ R R M k n k n 

2

Thus we complete the proof of Proposition 4. h

6. Discussion Among approaches for learning kernels, ‘p -norm ðp > 1Þ MKL has been shown to be helpful in practice and frequently outperforms the classical single-kernel machines in many empirical applications. However, existing theoretical results only show that its generalization ability depends on the ‘‘worst’’ kernel (see [19,30]). This leads to a big gap between practical and theoretical aspects. We notice that the ‘p -MKL has a better generalization performance only when it improves the approximation error in the context of statistical learning theory. For this reason, we consider a general case where the target function is not necessarily in composite kernel space. This paper establishes the optimal learning rates under more general conditions in the literatures. Our results demonstrate explicitly how the algorithmic generalization error is influenced by the ‘‘correlation’’ relationship among candidate kernels and the regularity of target function. By exploring the approximation ability of ‘p -MKL stated in Section 4, we conclude that ‘p -MKL can achieve a better generalization performance than the single kernel machines, provided that the target function is less smooth or its structure is more complex. Consequently, our study provides some new insights into learning kernels using the ‘p -norm ðp > 1Þ regularization.

S. Lv, F. Zhou / Information Sciences 294 (2015) 255–268

265

Acknowledgements We are deeply grateful to the editor and the anonymous reviewers for their valuable suggestions, which help improve our manuscript greatly. The first author’s research is supported partially by National Natural Science Foundation of China (Nos. 11226111 and 11301421), as well as by the Fundamental Research Funds for the Central Universities (Nos. 14TD0046 and JBK140210). Appendix A. Geometry characteristics of the Incoherence Assumption First, for J f1; 2; . . . ; Mg, we will use a modification of

(

jM by adding J that is denoted b2 ðJ; qX Þ.

) P 2 k M m¼1 f m kL2 ðqX Þ b2 ðJ; qX Þ :¼ sup j P 0j j 6 P ; f 2 H ðm ¼ 1; . . . ; MÞ : m m 2 j2J kf m kL2 ðqX Þ

Denote the linear span of ff m ; m 2 Jg by HJ . We will use the following quantity which characterizes the correlation between the subspaces HI and HJ for some I; J f1; 2; . . . ; Mg:

(

qðI; JÞ ¼ sup

)

hf ; giL2 ðqX Þ

kf kL2 ðqX Þ kgkL2 ðqX Þ

; f 2 HI ; g 2 HJ ; f – 0; g – 0 :

Denote qðJÞ :¼ qðJ; J c Þ, where J c is denoted as the complement of the subset J. Obviously the quantity qðI; JÞ is the maximal cosine of the angle in the space L2 ðqX Þ between the vectors in HI and HJ , which is similar to the notation of canonical correlation in multivariate statistical analysis. Finally, we will use another quantity defined as follows. Given f m 2 Hm ; m 2 f1; 2; . . . ; Mg, denote lðff m ; m 2 JgÞ the minimal eigenvalue of the Gram matrix ðhf i ; f j iL2 ðqX Þ Þ . Let i;j2J

lðJÞ :¼ infflðff m ; m 2 JgÞ : f m 2 Hm ; kf m kL2 ðqX Þ ¼ 1g: The following simple proposition reveals some relations among these quantities, whose proof easily follows from Koltchinskii [20]. Proposition 5. For all J f1; 2; . . . ; Mg, there holds

b2 ðJ; qX Þ 6

1

lðJÞð1  q2 ðJÞÞ

:

The above result shows that, the quantity b2 ðJ; qX Þ is small if the spaces of random variables Hm ; m 2 f1; 2; . . . ; Mg satisfy proper conditions of ‘‘weak correlations’’.

Appendix B. Some useful Lemmas and Propositions in this paper The following result is a slight modification of Theorem 3 in Blanchard et al. [2]. Proposition 6. Let F be a class of measure functions from Z to ½a; b and there are some functional T : F ! Rþ , satisfying Varðf Þ 6 Tðf Þ 6 BEðf Þ for any f 2 F with some positive constant B. Define w be a sub-root function and let r be the fixed point of w. Assume that w satisfies, for any r P r  ,

Rn ðf 2 F ; Tðf Þ 6 rÞ 6 wðrÞ: Then for all t > 0 and all K > 1, with probability at least 1  et there holds

Ef 

n 1X Ef 704ðK  1Þ  tðK  1Þð11ðb  aÞ þ 26BKÞ r þ f ðzi Þ 6 þ ; n i¼1 K B Kn

8f 2 F :

Lemma 1. Let n be a random variable on a probability space Z with variance r2 satisfying jn  EðnÞj 6 M n for some constant M n . Then for any 0 < d < 1, we have m 2Mn log 1X nðzi Þ  EðnÞ 6 m i¼1 3m

1 d

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  2r2 log 1d : þ m

Lemma 2. Let F be a function class. If L : R ! R is Lipschitz with constant L0 , then Rn ðLoFÞ 6 2L0 Rn ðFÞ.

266

S. Lv, F. Zhou / Information Sciences 294 (2015) 255–268

Lemma 3. Suppose that Kernel Assumption (3) and Incoherence Assumption (2) both hold. Then the local Redemacher complexity of the multi-kernel class Hp;M ðDÞ can be bounded for any 1 6 p 6 2 as

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi v u

P 1 M

p  u4cp D2 X1

4r M m¼1 hm t

þ DcM p ; Rn;r ðHp;M ðDÞÞ 6 l þ l;m

l¼hm þ1 m¼1 p jM n n

n 2

where h1 ; . . . ; hM are arbitrary nonnegative integers. This theorem can be proved by an almost literal repetition of Theorem 5 in Kloft and Blanchard [19] under no-correlation assumption. For simplicity, note the case with positive correlations among RKHSs. The more overlaps will arise as jM increases, in thus case, the functional complexity of RKHSs become much smaller accordingly. This phenomenon coincides with the conclusion given above. Theorem 1 immediately yields the following relation, since kk fbz k2p;M is bounded by the right side term of (3.3). Lemma 4. Under assumptions in Theorem 1, for any 0 < d < 1, and R > T, there is a set of V R with qðV R Þ 6 d such that

[ pffiffiffi WðRÞ # Wðan R þ bn Þ V R ; where an ; bn are independent of R and given by

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  1 ðlogð2=dÞ þ 1Þ 1n 1þsmax an :¼ ; k

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u  smax u Dp ðkÞ 1þsmax  1 1þs1max þ Dp ðkÞ t k nk

and bn :¼

k

:

By applying the iterative procedures, we can obtain an improved bound for k fbz kp;M . Intuitionally, we expect fbz to be a qffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffi Dp ðkÞ Dp ðkÞ good approximation of f k , and we know kf k kp;M 6 . So one would expect k fbz kp;M to be upper bounded by as well. k k More details on this technique can be found in Wu et al. [33] and we omit this process. 1

Lemma 5. Under assumptions of Theorem 1, let k ¼ nbþsmax with b þ smax P 1. For any 0 < d < 1 and n P N M;d , with confidence to 1  bþsdmax there holds

 rffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1  Dp ðkÞ k fbz kp;M 6 M 1=p 3 logð2=dÞ þ 2 þ ð2C M Þbþsmax T ; k

where N M;d and C M are defined in Theorem 2. Appendix C. Main proofs involved in this paper

Proof of Example 1. We define an approximating function as

f M;r ¼

M Z rd=2 1 X  Gr ðx; yÞf ðyÞdy: M m¼1 X m p

By Scovel and Steinwart [26], we have P d=2  d=2 kf M;r kGr 6 p M kf kL2 ðqX Þ M m¼1 rm . Moreover, we get

kf M;r 

 f k2L2 ðqX Þ

R



X

Grm ðx; yÞf ðyÞdy 2 HGrm ,



2

1 X M  pkk2

 2 

 rm e g ¼ k f M;r  f kL2 ðqX Þ ¼

e  1 fe

2

M m¼1

then



Write C :¼ ðp

 2 kf k2L2 ðqX Þ

þ

D2 ðkÞ 6 inf fkf M;r  rm

By taking

rm ¼

 1

ð4þsÞ ð4þsÞdþ2s

k



D2 ðkÞ 6 C k

2s ð4þsÞdþ2s

On the other hand, if

 kf k2Hs Þ,

 f k2L2 ðqX Þ

follows



6 ðp2 kf k2L2 ðqX Þ þ kf k2Hs Þ

L ðqX Þ



it

M 1X  2s rm4þs : M m¼1

we have that

þ

(

kkf M;r k2Gr g

) M M 2s 1X kX 4þs d 6 C inf rm þ r : rm M M m¼1 m m¼1 

, the above inequality gives

;

8M 2 Nþ :

rm ¼ mc with some c > 0, then there exist two constants c1 ; c2 , such that

M 2sc 1X  2s rm4þs 6 c1 M4þs ; M m¼1

that

f M;r 2 HGr ,

and

267

S. Lv, F. Zhou / Information Sciences 294 (2015) 255–268

and M 1X rd 6 c2 Mdc : M m¼1 m

Combining the above two inequalities and together with the estimate of D2 ðkÞ, we have that

 2sc  D2 ðkÞ 6 C M 4þs þ kMdc ;

where C is a positive constant independent of M or n. When M ¼

l  1

4þs

cð2sþdð4þsÞÞ

k

m , there holds

dð4þsÞ

12sþdð4þsÞ

D2 ðkÞ 6 Ck

:

Thus we complete the proof of Example 1.

h

Proof of Theorem 1. First, it is seen from Proposition 2 that,  Eðpð fbz ÞÞ  Eðf Þ þ kk fbz k2p;M 6 S z ðk; f k Þ þ Dp ðkÞ:

Also note that S z ðk; f k Þ ¼ S 1 þ S 2 , and we derive the explicit bounds for S 1 and S 2 in Propositions 2 and 3 respectively. Thus the result of Theorem 1 follows easily.  Proof of Corollary 1. Since k fbz kp;M 6 pTffiffik, Theorem 1 tells us that, Eð fbz Þ  Eðf Þ is bounded by 

)   ( 12  1þs1 max 2 e 1 1 C log þ Dp ðkÞ ; d k n

Let

112 11þs1 k

max

n

e is a constant: where C

¼ Dp ðkÞ, then this implies that k ¼

Eðpð fbz ÞÞ  Eðf Þ þ kk fbz k2p;M 

This completes the proof.

1 1ðbþ1=2Þð1þs

n

max Þ

and Dp ðkÞ ¼

b 1ðbþ1=2Þð1þs

n

max Þ

. Thus it follows that

! b    ðbþ1=2Þð1þs max Þ 2 e 1 : 6 log C d n

h

Proof of Theorem 2. For any 0 < d < 1 and n P N M;d , Lemma 5 means that we can take



 rffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1  Dp ðkÞ M1=p 3 logð2=dÞ þ 2 þ ð2C M Þbþsmax T ; k

and the measure of the set WðRÞ is at least 1  bþsdmax . Applying Theorem 1 to the above R, and let 1  pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi C 1 ¼ M 1=p 3 logð2=dÞ þ 2 þ ð2C M Þ1þsmax T , we see that for each z 2 WðRÞ n V R ,

Eðpð fbz ÞÞ  Eðf Þ 6 C M 

1

( ) rffiffiffiffiffiffiffiffiffiffiffiffi  1   1 Dp ðkÞ 1þ1=smax 1 1þsmax Dp ðkÞ logð2dÞ ; þ Dp ðkÞ þ n k n k

Note that k ¼ nbþsmax with b þ smax P 1, we have nk P 1 and the conclusion immediately. h

rffiffiffiffiffiffiffiffiffiffiffiffiffiffi   Dp ðkÞ 1 k

n

ð6:1Þ

6 Dp ðkÞ. Putting the above bounds into (6.1) yields

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]

A. Argyriou, T. Evgeniou, M. Pontil, Convex multi-task feature learning, Mach. Learn. 73 (2008) 243–272. B. Blanchard, O. Bousquet, P. Massart, Statistical performance of support vector machines, Ann. Stat. 36 (2008) 489–531. F. Bach, Consistency of the group lasso and multiple kernel learning, J. Mach. Learn. Res. 9 (2008) 1179–1225. O. Bousquet, S. Boucheron, G. Lugosi, Introduction to statistical learning theory, Adv. Lect. Mach. Learn., Lect. Notes Comput. Sci. 3176 (2004) 169–207. O. Bousquet, D.J.L. Herrmann, On the complexity of learning the kernel matrix, NIPS (2002). P.L. Bartlett, O. Bousquet, S. Mendelson, Local Rademacher complexities, Ann. Stat. 33 (2005) 1497–1537. A. Caponnetto, E. De Vito, Optimal rates for regularized least-squares algorithms, Found. Comput. Math. 7 (2007) 331–368. C. Cortes, M. Mohri, A. Rostamizadeh, Generalization bounds for learning kernels, in: 27th ICML, 2010a. C. Cortes, M. Mohri, A. Rostamizadeh, Two-stage learning kernel algorithms, in: 27th ICML, 2010b. E. Candes, T. Tao, The Dantzig selector: statistical estimation when p is much larger than n, Ann. Stat. 35 (2007) 2313–2351. F. Cucker, S. Smale, On the mathematical foundations of learning, Bull. Am. Soc. 39 (2001) 1–49. O. Chapelle, V. Vapnik, O. Bousquet, S. Mukherjee, Choosing multiple parameters for support vector machines, Mach. Learn. 46 (2002) 131–159. N. Cristianini, J. Kandola, A. Elisseeff, J. Shawe-Taylor, On kernel-target alignment, NIPS (2002). K. Fukumizu, F.R. Bach, A. Gretton, Statistical consistency of kernel canonical correlation analysis, J. Mach. Learn. Res. 8 (2007) 361–383. P.V. Gehler, S. Nowozin, Infinite kernel learning, NIPS (2008). T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer Series in Statistics, 2002.

268

S. Lv, F. Zhou / Information Sciences 294 (2015) 255–268

[17] M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.R. Muller, A. Zien, Efficient and accurate ‘p -norm multiple kernel learning, NIPS 22 (2009) 997–1005 (MIT Press, Cambridge). p [18] M. Kloft, U. Brefeld, S. Sonnenburg, A. Zien, l -norm multiple kernel learning, J. Mach. Learn. Res. 12 (2011) 953–997. p [19] M. Kloft, G. Blanchard, On the convergence rate of l -norm multiple kernel learning, J. Mach. Learn. Res. 13 (2012) 2465–2501. [20] Koltchinskii, Sparsity in penalized empirical risk minimization, Ann. lIHP 45 (2009) 7–57. [21] V. Koltchinskii, M. Yuan, Sparsity in multiple kernel learning, Ann. Stat. 38 (2010) 3660–3695. [22] A. Micchelli, M. Pontil, Q. Wu, D.X. Zhou, Error Bounds for Learning the Kernel. Research Note 05/09, University of College London, 2005. [23] Q. Mao, I.W. Tsang, S.H. Gao, L. i Wang, Generalized multiple kernel learning with data-dependent priors, IEEE Trans. Neural Netw. Learn. Syst. (2014). in press, http://dx.doi.org/10.1109/TNNLS.2014.2334137. [24] T. Nakashika, A. Suga, Generic object recognition using automatic region extraction and dimensional feature integration utilizing multiple kernel learning, ICASSP (2011) 1229–1232. [25] A.R. Paiva, I. Park, J.C. Principe, A reproducing kernel Hilbert space framework for spike train signal processing, Neural Comput. 21 (2009) 424–449. [26] C. Scovel, I. Steinwart, Fast rates for support vector machine, Lect. Notes Comput. Sci. 59 (2005) 853–888. [27] I. Steinwart, A. Christmann, Support Vector Machines, Springer, New York, 2008. [28] I. Steinwart, Don Hush, Clint Scovel, Optimal rates for regularized least squares regression, in: 22nd ACLT, 2009, pp. 79–93. [29] S. Smale, D.X. Zhou, Estimating the approximation error in learning theory, Anal. Appl. 1 (2003) 1–25. [30] T. Suzuki, Unifying framework for fast learning rate of non-sparse multiple kernel learning, NIPS (2011). [31] P. Vepakomma, A. Elgammal, Learning distance correlation maximizing functions in vector-valued reproducing kernel Hilbert spaces. (in press). [32] Q. Wu, Y.M. Ying, D.X. Zhou, Multi-kernel regularized classifiers, J. Complex. 23 (2007) 108–138. [33] Q. Wu, Y.M. Ying, D.X. Zhou, Learning rates of least-square regularized regression, Found. Comput. Math. 6 (2006) 171–192. [34] J.W. Xu, Nonlinear Signal Processing based on Reproducing Kernel Hilbert Space, LAP Lambert Academic Publishing, 2009. [35] X.X. Xu, I.W. Tsang, D. Xu, Soft margin multiple kernel learning, IEEE Trans. Neural Netw. Learn. Syst. 24 (2013) 749–761. [36] S. Yu, T. Falck, A. Daemen, L.C. Tranchevent, J.A. Suykens, B.D. Moor, Y. Moreau, L2 -norm multiple kernel learning and its application to biomedical data fusion, BMC Bioinform. 11 (2010) 309–333. [37] B. Zamania, A. Akbaria, B. Nasersharif, Evolutionary combination of kernels for nonlinear feature transformation, Inform. Sci. 274 (2014) 95–107.