Convergence rates of generalization errors for margin-based classification

Convergence rates of generalization errors for margin-based classification

Journal of Statistical Planning and Inference 139 (2009) 2543 -- 2551 Contents lists available at ScienceDirect Journal of Statistical Planning and ...

178KB Sizes 1 Downloads 26 Views

Journal of Statistical Planning and Inference 139 (2009) 2543 -- 2551

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / j s p i

Convergence rates of generalization errors for margin-based classification Changyi Park∗ Department of Statistics, University of Seoul, Jeonnong-dong 90, Dongdaemun-gu, Seoul 130-743, Republic of Korea

A R T I C L E

I N F O

Article history: Received 31 January 2007 Received in revised form 20 November 2008 Accepted 20 November 2008 Available online 30 November 2008 Keywords: Classification Convex and non-convex loss Empirical process Statistical learning theory

A B S T R A C T

This paper develops a general approach to quantifying the size of generalization errors for margin-based classification. A trade-off between geometric margins and training errors is exhibited along with the complexity of a binary classification problem. Consequently, this results in dealing with learning theory in a broader framework, in particular, of handling both convex and non-convex margin classifiers, among which includes, support vector machines, kernel logistic regression, and -learning. Examples for both linear and nonlinear classifications are provided. © 2008 Elsevier B.V. All rights reserved.

1. Introduction Classification as a tool to extract information from data plays an important role in science and engineering. Among various classification methodologies, margin-based classification has recently seen significant developments. The central problem this paper addresses is the accuracy of various margin classifiers, obtained by minimizing a penalized objective function in binary classifications. The classification literature based on machine learning is vast and what will be cited below is very brief. Indeed, due to the enormity of the literature, with apologies, we will only cite those that bear a direct relevance to our framework. Lin (2000) studies the rates of convergence of support vector machines, based on a formulation of the method of sieves, where the rates are the same as those in function estimation. Shen et al. (2003) derive a learning theory for non-convex -learning, where the rates are usually faster than the ones in function estimation. In fact, they show that a fast rate of n−1 is attainable by -learning in a linear non-separable example. Zhang (2004) and Lugosi and Vayatis (2004) obtain consistency for convex margin losses. Bartlett et al. (2005) obtain data dependent bounds using local Rademacher averages. Steinwart and Scovel (2007) and Blanchard et al. (2008) study the rates for support vector machines using Gaussian kernels. Note that Audibert and Tsybakov (2007) obtain interesting results along the lines that plug-in rules may attain super-fast rates, i.e., rates faster than n−1 . We note, however, that their formulation is in terms of regression function estimation. While this formulation is suited for obtaining minimax rates, there is some gap between regression function estimation and margin-based classification in practice. Finally Bartlett et al. (2006) obtain rates of convergence for convex losses adopting the low noise assumption along with boundedness conditions. In this paper, we derive general upper bound theory for margin-based classifiers, convex or non-convex, obtained via minimizing a surrogate objective function with penalization. In particular, we derive probabilistic as well as risk upper bounds of the generalization error under an arbitrary loss that satisfies a Lipschitz condition. This paper is organized as follows. Section 2 sets out the notation and preliminaries for margin-based classification. There are a number of assumptions necessary to quantify generalization error bounds and so we go over them individually in Section 3. ∗ Tel.: +82 2 22105759; fax: +82 2 22105274. E-mail address: [email protected]. 0378-3758/$ - see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2008.11.011

2544

C. Park / Journal of Statistical Planning and Inference 139 (2009) 2543 -- 2551

Section 4 establishes a general upper bound theory concerning the generalization error of a general margin-based classifier. Section 5 illustrates the general theories via linear and nonlinear learning examples in terms of the speed of convergence in the generalization errors. All proofs are collected in Section 6. 2. Preliminaries The primitives of classification involve the input space X ⊂ Rd , an output space Y ={−1, +1}, and a training sample {(Xi , Yi )}ni=1 , consisting of a random sample on the joint probability space (X × Y, (X) × 2Y , P(·, ·)) with (X) a -field on X. Classification is performed using the training sample to construct f such that the classifier defined as sign(f ) decides the class assignment of an input x ∈ X. The performance is determined by the margin, yf (x), where (x, y) ∈ X × Y, with a correct classification being determined by yf (x) > 0. Consequently, the overall performance of a classifier is determined by the margin. To be more precise, margin-based classification begins with a surrogate loss V(x, y) = V(yf (x)), for (x, y) ∈ X × Y, followed  by minimizing the objective function ni=1 V(Yi f (Xi )) over a class of candidate decision functions f ∈ F, where F is a class of functions, the parameter space. To prevent overfitting, a non-negative penalty functional J(f ) is added thus yielding the constrained optimization problem of minimizing l(f ) =

n 1 V(Yi f (Xi )) + J(f ) n

(2.1)

i=1

over F where  > 0 is a penalization constant for the penalty functional J. Similar to other penalization procedures, see for example, Wahba (1990),  controls the trade-off between the training error and the penalty. The minimizer of (2.1) with respect to f ∈ F yields an estimated decision function fˆ , and hence the classifier sign(fˆ ). In machine learning, the penalty functional is usually the inverse of the “geometric margin”, or, the conditional Fisher information (Corduneanu and Jaakola, 2003). The penalty functional also can be an upper bound on the complexity of the function class F or the variance of individual function in the class. In particular, for the linear case, the geometric margin with respect to a linear decision function f is defined to be 2/w2 , where f (x) = w, x + b is a hyperplane with ·, · the usual inner   product on Rd and b ∈ R. In the nonlinear case, the geometric margin is 2/g2K = 2/ ni=1 nj=1 i j K(xi , xj ), where f has the n representation g(x) + b ≡ i=1 i K(x, xi ) + b, and K : X × XR is a proper kernel assumed to satisfy Mercer's (1909) condition. This ensures that g2K is a proper norm. The above formulation provides a general framework for margin classification, in particular, different choices of V yield different learning methodologies. Below, we list the more popular examples. Example 2.1 (SVM). Support vector machines use the hinge loss defined by V(z) = [1 − z]+ q

where [z]+ = max{0, z}. Its variants are V(z) = [1 − z]+ for q  1, see Lin (2002). Example 2.2 (KLR). The kernel logistic regression adopts the logistic loss V(z) = log(1 + e−z ) see Zhu and Hastie (2005). Example 2.3 (-learning). The -loss is of the form of V(z) = (z), defined as ⎧ z1 ⎨ 0, (z) = 1 − z, 0  z < 1 ⎩ 1 otherwise see Shen et al. (2003) for general form of  losses. Example 2.4 (Sigmoid-learning). The normalized sigmoid loss V(z) = 1 − tanh(cz) is also margin-based, see Mason et al. (2000). We would like to point out that the weighted distance discriminate analysis proposed by Marron et al. (2007) uses a nonstandard loss function depending on the margin. However, this loss function may not be compared directly with other marginbased losses since the penalty term is different from the standard choice.

C. Park / Journal of Statistical Planning and Inference 139 (2009) 2543 -- 2551

2545

Empirical version of minimizing EV(Yf (X)) is minimizing (2.1) but without the penalty term. In particular, denote by fV , the minimizer of EV(Yf (X)). Then eV (f , fV ) = EV(Yf (X)) − EV(Yf V (X))

(2.2)

is known as the excess surrogate risk, where f is a measurable function. Misclassification loss is defined by L(z) = 12 (1 − sign(z)) The Bayes rule is defined as f¯ = sign(f ∗ ) where f ∗ (x) = P(Y = 1|X = x) − 12 is the regression function obtained by minimizing the generalization error, EL(Yf (X)), over all measurable f . The excess risk is defined as e(f , f¯ ) = EL(Yf (X)) − EL(Y f¯ (X))  0

(2.3)

Throughout this paper, the surrogate loss V is assumed to be an upper envelope of L meaning, L(z)  V(z) for all z ∈ R. Note that f¯ can be taken as fV for the hinge or  losses. Finally, we introduce a complexity measure, called the L2 -metric entropy with bracketing, of a function class F. Given any c  > 0, the set {(fjl , fju )}Nj=1 is called an -bracketing function of F if for any f ∈ F, there is a j such that fjl  f  fju and fju − fjl 2   for all j = 1, . . . , Nc where  · 2 is the L2 -norm. The L2 -metric entropy HB (, F) of F with bracketing is defined as the logarithm of the cardinality of -bracketing function of F of the smallest size. We develop a theory for the convergence of the excess risk of fˆ obtained from minimizing the objective function (2.1) defined by the surrogate loss V. Our theory treats convex as well as non-convex losses in a unified way. 3. Assumptions In terms of bounding the excess risk (2.3) of some given learning procedure based on a surrogate loss, a number of technical assumptions need to be made. We list these conditions below for use in Section 4. In the following we denote by C1 , C2 , . . . , as positive constants independent of n. A1. A2. A3. A4.

V is Lipschitz, i.e., |V(z1 ) − V(z2 )|  C1 |z1 − z2 |, for z ∈ R.

F is uniformly bounded on X.

For some positive sequence sn = O(n−sV ) with sV > 0, there exists a sequence {fn } ⊂ F such that eV (fn , fV )  sn and Jn =J(fn )  1. The excess risk satisfies e(f , f¯ )  C2 eV (f , fV ) and E{V(Yf (X))−V(Yf V (X))}2  C3 eV (f , fV ) for some positive constants 0 < ,   1, where C2 , C3 may depend on V.  ) for the same  in A4. A5. There exists a sequence {n } of positive numbers such that HB (n , F) = O(n4−2 n  ∗ A6. There exist some constant 0 <   ∞ and C4 such that P(x ∈ X : |f (x)|  )  C4 for sufficiently small > 0.

Some comments about these assumptions are in order. Assumption A1 is required in bounding HB (u, DV,n ) by HB (u, F) where DV,n = {d : d(f ) = V(yf (x)) − V(yf n (x))}. On a compact interval on R, C1 may depend on V alone. Assumption A2 is needed for unbounded losses. For bounded losses such as , it is not necessary. Assumption A3 is a condition on the approximation error rate that can be determined for a given F and the joint distribution of (X, Y). Assumption A4 is a moment condition for the empirical process defined by the surrogate loss. Assumption A5 is a complexity condition on F. The smallest n satisfying A5 yields the best upper bound of the estimation error rate for a classifier. Assumption A6 is called the low noise assumption introduced in Mammen and Tsybakov (1999). This assumption describes the boundary behavior of the optimal decision function. In particular,  = +∞ corresponds to the easiest classification problem such as the separable cases. As can be seen in our polynomial kernel example in Section 5, see below,  = +∞ is not limited to separable cases. 4. Main results The following theorem yields the rate of convergence to the Bayes risk for general losses under a number of assumptions and represents the most general result to date. Theorem 4.1. Suppose A1–A5 are satisfied and let 2n = max{2n , 2sn }. If fˆ is a minimizer of l(f ) with Jn  2n /2, then

P(e(fˆ , f¯ )  2n )  3.5 exp(−C5 n(Jn )2− )

2546

C. Park / Journal of Statistical Planning and Inference 139 (2009) 2543 -- 2551

as n → ∞, where 0 < ,   1. Furthermore, e(fˆ , f¯ ) = OP ( 2n )

and E{e(fˆ , f¯ )} = O( 2n )

provided that n(Jn )2− is bounded away from zero. The following result covers convex loss functions and is similar to the results of Bartlett et al. (2006). Corollary 4.2. Suppose A2, A3, A5, A6 hold and let V be strictly convex with a continuous second derivative. Let  = max{ 12 , /( + 1)} and 2 = max{2 , 2sn }, where 0 <   ∞. For any minimizer fˆ of l(f ) with Jn  2 /2, n

n

n

P(e(fˆ , f¯ )  2n )  3.5 exp(−C6 nJn ) as n → ∞. Furthermore, e(fˆ , f¯ ) = OP ( 2n )

and E{e(fˆ , f¯ )} = O( 2n )

provided that nJn is bounded away from zero. Yang (1999) obtained minimax rates for squared error loss without A6. From Lemma 6.1 (Section 6), we see that  = 12 without A6 so that in this case, the best possible rate that our corollary yields is n−1/2 . This is identical to Yang's result. The following result is applicable to SVM. Corollary 4.3. Let V be the hinge loss and suppose A2, A3, A5, and A6 hold. Let 2n = max{2n , 2sn }. For any minimizer fˆ of l(f ) with Jn  2n /2, there exists a constant C7 > 0 such that,

P(e(fˆ , f¯ )  2n )  3.5 exp(−C7 n(Jn )(+2)/(+1) ) as n → ∞, where 0 <   ∞. Furthermore, e(fˆ , f¯ ) = OP ( 2n )

and E{e(fˆ , f¯ )} = O( 2n )

provided that n(Jn )(+2)/(+1) is bounded away from zero. The following result applies to -learning, see Shen et al. (2003). Corollary 4.4. Let V be the  loss, and suppose A3, A5, and A6 hold. Let 2n = max{2n , 2sn }. For any minimizer fˆ of l(f ) with Jn  2n /2,

P(e(fˆ , f¯ )  2n )  3.5 exp(−C8 n(Jn )(+2)/(+1) ) as n → ∞, where 0 <   ∞. Furthermore, e(fˆ , f¯ ) = OP ( 2n )

and E{e(fˆ , f¯ )} = O( 2n )

provided that n(Jn )(+2)/(+1) is bounded away from zero. 5. Examples In this section, we illustrate the above results using the polynomial and Gaussian kernels. Throughout this section, it is assumed that X = {x ∈ Rd : x11 + · · · + x2d  1} is the unit ball in Rd for d  1 and the underlying marginal distribution on X is uniform. 5.1. Polynomial kernel Let K(x, y) = (x, y + 1)p for x, y ∈ X be a polynomial kernel of order p  1. This kernel induces F consisting of all polynomials of order at most p. From (83) and (84) in Kolmogorov and Tikhomirov (1959), it follows that HB (, F) = O((1/ )d/p ). Let ft (x) = x1 be the true decision function. Given x ∈ X,  1 − r if x1  0 P(Y = +1|X = x) = r otherwise where 0 < r < 12 . Note that this is a non-separable classification problem with the optimal Bayes risk r. For any sufficiently small > 0, P(x ∈ X : |f ∗ (x)|  ) = 0 because p∗ (x) = P(Y = +1|X = x) has jump discontinuities. This implies A6 with  = +∞.

C. Park / Journal of Statistical Planning and Inference 139 (2009) 2543 -- 2551

2547

Consider -learning. For a choice of fn = nf t , eV (fn , f¯ ) = O(n−1 ), implying A3 with sV = 1. A5 is fulfilled with n = n−p/(2p+d) when (nJn )−1 ∼ n−d/(2p+d) . By Corollary 4.4, we have e(fˆ , f¯ ) = O(n−2p/(2p+d) log(1/ )) except for a set of probability less than some small > 0 and Ee(fˆ , f¯ ) = O(n−2p/(2p+d) ). For comparison, apply Theorem 1 in Shen et al. (2003). The metric entropy for sets is bounded by O((1/ )(d−1)/p ) for d > 1 and p > d − 1 (Dudley, 1974). Hence the resulting rate is n−p/(p+d−1) and it is consistent with the minimax rate of Mammen and Tsybakov (1999). Since the class of classification sets is induced by the class of functions, Theorem 1 in Shen et al. (2003) may yield faster rates than ours in some cases. Remark 5.1. The linear kernel can be seen as a special case of the polynomial kernel with p = 1. For the linear example in Shen et al. (2003, Corollary 3) yields a rate of n−2/(d+2) for -learning. Since the metric entropy for sets is bounded by O(log(1/ )), their Theorem 1 yields  of n−1 log n. They sharpened the rate up to n−1 using a local metric entropy bound. Remark 5.2. Since F is not sufficiently large, A3 is not satisfied for convex losses. Hence our results do not apply in this example. The results in Bartlett et al. (2006) are not applicable because the convergence of excess risk is not guaranteed without the convergence of the surrogate risk. This problem seems to be a difficult, yet important, future research question. 5.2. Gaussian kernel Consider the Gaussian kernel  x − y2 K(x, y) = exp − 2 2 where  > 0. Let F be the space of functions induced by this kernel. The metric entropy of F in sup-norm is given by H∞ (, F) = O((log 1/ )d+1 ) by (4.8) of Zhou (2002). It is easy to show that HB (, F) = O((log 1/ )d+1 ). Assume that the conditional distributions of X given Y = +1 and Y = −1 are normal distributions with mean vectors + and − , respectively, and common covariance matrix 2 I, where + = (+1, 0, . . . , 0) , − = (−1, 0, . . . , 0) , and 0 < 2 < ∞. Let ∈ (0, 1) be the mixing parameter such that | log( /(1 − ))| < 2/ 2 . By Bayes' Theorem, p∗ (x) = exp(2x1 / 2 )/(1 − + exp(2x1 / 2 )). Denote the true decision function as ft (x) = x1 − x∗1 where x∗1 = (2 /2) ln( /(1 − )). Using a similar argument as in the polynomial kernel example, for any sufficiently small > 0, 

P(x ∈ X : |f ∗ (x)|  )  P x ∈ X :

2 2

|x1 − x∗1 |  ln



1 + 2 1 − 2



 P(x ∈ X : |x1 − x∗1 |  C9 ) = O( ) using Taylor series expansion. Hence A6 is satisfied with  = 1. First, let us consider a strictly convex loss V. For KLR, it is easily shown that fV (x) = (1/ 2 )ft ∈ F. Hence A3 is satisfied with an arbitrarily fast rate. When (nJn )−1 ∼ 1/(log n)d+1 , the choice of n =n−1/2 (log n)(d+1)/2 , fulfills A5. In this case, = 12 . By Corollary 4.2, we have e(fˆ , f¯ )  O(n−1/2 (log n)(d+1)/2 log(1/ )) except for a set of probability tending to zero and Ee(fˆ , fV ) = O(n−1/2 (log n)(d+1)/2 ). Let V be the hinge loss. We can take a sequence of bounded functions {tanh(nf )} in C ∞ (X) converging to f¯ in sup-norm. t

Hence A3 is satisfied with some 0 < sV  1 because C ∞ (X) is a subset of the space spanned by F. A5 is satisfied with n = n−1/3 (log n)(d+1)/3 . By Corollary 4.2, we have e(fˆ , f¯ ) = O(n−2/3 (log n)2(d+1)/3 log(1/ )) except for a set whose probability tends to zero and Ee(fˆ , f¯ ) = O(n−2/3 (log n)2(d+1)/3 ). Due to the approximation error, the rate is at best n−2/3 (log n)2(d+1)/3 . For -learning, let fn = nf t . Then A3 is satisfied with sV = 1 because eV (fn , f¯ )  O(n−1 ). By Corollary 4.4, we have e(fˆ , f¯ ) = O(n−2/3 (log n)2(d+1)/3 log(1/ )) except for a set whose probability tends to zero and Ee(fˆ , f¯ )=O(n−2/3 (log n)2(d+1)/3 ). In this example, if  = +∞, then the best possible rate is n−1 (log n)d+1 . Remark 5.3. Steinwart and Scovel (2007) obtain fast rates up to n−1 for SVM using Gaussian kernels. In addition to the low noise assumption, they imposed the geometric noise condition. The geometric noise condition describes the concentration of |2p∗ − 1|dPX near the decision boundary where PX denotes the marginal distribution of X. 6. Proofs The section provides the proofs to our main results. Let us proceed by first stating and proving some technical lemmas which leads to the proofs of the corollaries. The proof of Theorem 4.1 will be done last.

Lemma 6.1. Let V be a strictly convex loss that has a continuous second derivative. Then A2 and A6 imply A4 with  =max{ 21 , /( +1)} and  = 1.

2548

C. Park / Journal of Statistical Planning and Inference 139 (2009) 2543 -- 2551

Proof. Let us prove the first inequality of A4. First, we state two facts to be used in the proof. That is, for any strictly convex loss V, sign(fV ) = sign(f ∗ ) a.s.

(6.1)

by Theorem 3.1 of Lin (2002), and there exists C10 depending only on V such that |f ∗ (x)|  C10 |fV (x)|

(6.2)

by Eq. (3) in Lin (2002). Define AV (f (X)) = p∗ (X)V(f (X)) + (1 − p∗ (X))V(−f (X)). It can be seen that AV (f (X)) = E{V(Yf (X))|X} a.s. Note that fV (x) is a minimizer of f  AV (f (x)) for any fixed x. By Taylor's expansion of AV (f (X)) at fV (X) AV (f (X)) − AV (fV (X)) = A V (fV (X))(f (X) − fV (X)) +

A V (g(X)) (f (X) − fV (X))2  0 2

a.s.

(6.3)

where g(x), an intermediate value in the expansion, is between f (x) and fV (x). It then follows from (6.3) that A V (fV (X)) = 0 a.s., by setting f (x) = fV (x) + h1 for small h1 > 0 and f = fV (x) − h2 for small h2 > 0. By the definition of minimization, A V (g(X))  0 a.s. ¨ Using Holder's inequality with (6.3) and (6.2), eV (f , fV ) = E{AV (f (X)) − AV (fV (X))}

 C12 [E{|f (X) − fV (X)| (sign(f (X))  sign(fV (X)))}]2  C12 [E{|fV (X)| (sign(f (X))  sign(fV (X)))}]2 −2  C12 C10 [E{|f ∗ (X)| (sign(f (X))  sign(fV (X)))}]2 −2  C12 C10 e(f , fV )2

(6.4)

for C11 = supf ∈F f ∞ < ∞ and C12 = inf |z|  C11 V (z)/2 > 0. Similarly, using A6 with (6.3) and (6.2), we have eV (f , fV )  C12 E{(f (X) − fV (X))2 (sign(f (X))  sign(fV (X)), |fV (X)|  )}

 C12 [E{|fV (X)| (sign(f (X))  sign(fV (X)))} − P(|fV (X)|  )] −1  C12 [C10 E{|f ∗ (X)| (sign(f (X))  sign(fV (X)))} − P(|f ∗ (X)|  C10 )] −1  C12 (C10 e(f , fV ) − C4 (C10 ) )



C12 −2 C (2C4 C10 )−1/  e(f , fV )(+1)/  2 10

(6.5)

−(+1)/  (e(f , fV )/(2C4 ))1/(+1) . Furthermore, e(f , fV ) = e(f , f¯ ) follows from (6.1). Combining (6.4) and (6.5), with the choice of = C10 we have established the desired result. The second inequality of A4 can be proved using a Taylor expansion argument also. Let C13 = sup|z|  C11 [V (z)]2 . Using (6.3), we have

E{V(Yf (X)) − V(Yf V (X))}2  C13 E{f (X) − fV (X)}2  C3 eV (f , fV ) −1 where C3 = C12 C13 . 

Lemma 6.2. For the hinge loss, A2 and A6 imply A4 with  = 1 and  = /( + 1). Proof. By Corollary 3.1 in Zhang (2004), the first inequality holds with C2 = 1. From Lemma 6.6 in Steinwart and Scovel (2007),

E{V(Yf (X)) − V(Y f¯ (X))}2  ((f ∗ )−1 ,∞ + 2)(f ∞ + 1)(+2)/(+1) eV (f , f¯ )/(+1) where  · ,∞ denotes the norm in Lorentz spaces. By A2 and A6, 

(+2)/(+1)

C3 = ((f ∗ )−1 ,∞ + 2) sup f ∞ + 1 f ∈F

<∞



Lemma 6.3. Suppose that V is the  loss. Then A6 implies A4 with  = 1 and  = /( + 1). Proof. Proposition 1 in Shen et al. (2003) yields the first inequality of A4 with  = 1 and C2 = 1. The second inequality holds with  = /( + 1) and C3 = 2−/(+1) (16C41/(+1) + 8). See the proof of Theorem 3.4 in Liu and Shen (2006). 

C. Park / Journal of Statistical Planning and Inference 139 (2009) 2543 -- 2551

2549

6.1. Proof of corollaries Corollaries 4.2–4.4 follow from Lemmas 6.1–6.3, respectively, and Theorem 4.1. In the proof of Theorem 4.1, we apply the one-sided large deviation inequality in Shen and Wong (1994) presented below. Theorem 6.4. Let F be the class of functions bounded above by T and Ef (Z) = 0. Define En (f ) = n−1/2 v  supF Var(f (Z)). For M > 0 and 0 <  < 1, let

n

i=1 (f (Zi )

− Ef (Zi )) and

M2

2 (M, v, F) =

2[4v + MT/(3n1/2 )]

and s = M/(8n1/2 ). Suppose HB (v1/2 , F)  v , M  n1/2 4T



2 (M, v, F)

(6.6)

v1/2  T

(6.7)

4

and, if s < v1/2 ,

I

v1/2 s 1/2 M3/2 HB (u, F)1/2 du  10 ,v = 4 2 s/4

(6.8)

Then it follows that





P∗ sup En (f )  M  3 exp −(1 − )2 (M, v, F) F

For the proof and technical details of the theorem, see Shen and Wong (1994).

6.2. Proof of Theorem 4.1 To obtain our results, we apply Kolmogorov's chaining technique. Interested readers may see van de Geer (2000) for other examples of the chaining technique. Let (f , X, Y) = V(Yf (X)) + J(f ). Define the scaled empirical process as

En (f ) = n−1

n  [(f , Xi , Yi ) − (fn , Xi , Yi ) − E{(f , Xi , Yi ) − (fn , Xi , Yi )}] i=1

Let Aa,b ={f ∈ F : 2a−1 2n  eV (f , fV ) < 2a 2n , 2b−1 Jn  J(f ) < 2b Jn } and Aa,0 ={f ∈ F : 2a−1 2n  eV (f , fV ) < 2a 2n , J(f ) < Jn }; a=1, 2, . . . and b = 1, 2, . . . . To bound P(e(fˆ , f¯ )  2n ), we apply Theorem 6.4 to P(Aa,b ) by controlling the mean and variance defined by (f , Xi , Yi ) and . By the first inequality of A4,

P(e(fˆ , f¯ )  C2 2n )  P(eV (fˆ , fV )  2n ) ⎛



P ⎝

sup {f ∈F:eV (f ,fV )  2n }

n

−1

⎞ n  {(fn , Xi , Yi ) − (f , Xi , Yi )}  0⎠ ≡ I i=1

 where P∗ denotes the outer probability measure. The last inequality is trivial because fˆ is the minimizer of n−1 ni=1 (f , Xi , Yi )   n n and fn ∈ F imply that n−1 i=1 (fn , Xi , Yi )  n−1 i=1 (fˆ , Xi , Yi ). Without loss of generality, we may assume that C2 = 1 because it can be absorbed into C5 . To bound I, it suffices to bound P(Aa,b ) for each a, b = 1, . . . after decomposing F into Aa,b 's. To this end, we need some inequalities regarding the first and second moments of (f , Xi , Yi ) − (fn , Xi , Yi ) for f ∈ Aa,b . By A3, we have 2eV (fn , fV )  2sn  2n .

2550

C. Park / Journal of Statistical Planning and Inference 139 (2009) 2543 -- 2551

This together with Jn  2n /2, we have inf E{(f , Xi , Yi ) − (fn , Xi , Yi )} = inf {eV (f , fV ) + (J(f ) − Jn )} − eV (fn , fV )

f ∈Aa,b

f ∈Aa,b

 2a−2 2n + (2b−1 − 1)Jn ≡ Mn (a, b)

(6.9)

for any integer a, b  1 and inf E{(f , Xi , Yi ) − (fn , Xi , Yi )}  (2a−2 − 1/2) 2n  2a−3 2n ≡ Mn (a, 0)

f ∈Aa,0

By the triangle inequality and the second inequality of A4,  sup E{V(Yf (X)) − V(Yf n (X))}  C3 2

f ∈Aa,b

(6.10)







sup eV (f , fV ) + eV (fn , fV )

f ∈Aa,b

 C14 Mn (a, b) ≡ vn (a, b)

(6.11)

for a = 1, 2, . . . and b = 0, 1, . . . where C14 = (1 + 4 )C3 .   Using (6.9) and (6.10), we have I  a,b P∗ (supf ∈Aa,b En (f )  Mn (a, b)) + a P∗ (supf ∈Aa,0 En (f )  Mn (a, 0)) ≡ I1 + I2 . Next, we proceed to bound I1 and I2 separately. Define DV,n,k = {df : df (x, y) = V(yf (x)) − V(yf n (x)), f ∈ Fk } where Fk = {f ∈ F : J(f )  k2 }. Similarly, DV,n can be defined for F. Denote {(fjL , fjU )}j=1, ...,Nc as an -bracketing function of Fk . Define dlj (x, y) = l

u

l

u

min{V(yf j (x)), V(yf j (x))} − V(yf n (x)) and duj (x, y) = max{V(yf j (x)), V(yf j (x))} − V(yf n (x)). For any f ∈ Fk , there is j = 1, . . . , Nc such that fjl  f  fju , implying that dlj  df  duj . Hence {(dlj , duj )}j=1, ...,Nc is an -bracketing function of DV,n,k . By the Lipschitz condition A1, duj − dlj   C1 fju − fjl  for j = 1, . . . , Nc , where C1 can be taken so that it depends on V alone. Hence HB (, DV,n,k )  HB (C1 , Fk )

(6.12)

Define

(n , k) =



vn (a,b)1/2

C15 Mn (a,b)

1/2

HB (u, DV,n,k ) du/Mn (a, b)

where we take C15 = /32 and  defined below. By A5, (6.12), and the fact that (n , k) is non-increasing in a and Mn (a, b) for a = 1, . . . , Fk ⊂ F, n  n , and 0 <   1, we have sup (n , k)  sup

k2



C14 Mn (1,b)/2 1/2

k  2 C15 Mn (1,b)

1/2

HB (C1 u, Fk ) du/Mn (1, b)

 2(b−2)(/2−1) n−2 HB1/2 (C1 2n /2, F)  4n−2 HB1/2 (C1 2n /2, F)  C16 n1/2

(6.13)

where C16 may depend on  and C1 . By A2, C17 = supf ∈F df ∞ < ∞. With the choice of M = n1/2 M(a, b), v = v(a, b), T = C17 , and

 = 4C3−1/2 , we can show that (6.6)–(6.8) hold. Hence Theorem 6.4 is applicable to the class DV,n,k . Following the argument of the proof of Theorem 1 in Shen et al. (2003), one can show  ∞ ∞   nMn (a, b)2 3 exp − I1  2(4vn (a, b) + Mn (a, b)C17 /3)

 

b=1 a=1 ∞  ∞  b=1 a=1 ∞  ∞ 

 3 exp(−C5 nM2− n (a, b))

3 exp(−C5 n[(2a−2 2n )2− + ((2b−1 − 1)Jn )2− ])

b=1 a=1

 3 exp(−C5 n(Jn )2− )/[1 − exp(−C5 n(Jn )2− )]2 for a generic C5 . Similarly, I2  3 exp(−C5 n(Jn )2− )/[1 − exp(−C5 n(Jn )2− )]2 Finally, I  6 exp(−C5 n(Jn )2− )/[1 − exp(−C5 n(Jn )2− )]2 This implies that I1/2  ( 52 + I1/2 ) exp(−C5 n(Jn )2− ). Since I  I1/2  1, the result follows.

C. Park / Journal of Statistical Planning and Inference 139 (2009) 2543 -- 2551

2551

Acknowledgments This is a part of my Ph.D. dissertation at the Ohio State University. The author is very indebted to Professor Xiaotong Shen for his support and guidance. The author is also grateful to Professors Peter T. Kim and Ja-Yong Koo for their valuable discussions. This work was supported by the Korea Research Foundation Grant funded by the Korean Government (KRF-2008-331-C00055). References Audibert, J.-Y., Tsybakov, A., 2007. Fast learning rates for plug-in classifiers. Ann. Statist. 35, 608–633. Bartlett, P.L., Bousquet, O., Mendelson, S., 2005. Local Rademacher complexities. Ann. Statist. 34, 1497–1537. Bartlett, P.L., Jordan, M.I., McAuliffe, J.D., 2006. Convexity, classification, and risk bounds. J. Amer. Statist. Assoc. 101, 138–156. Blanchard, G., Bousquet, O., Massart, P., 2008. Statistical performance of support vector machines. Ann. Statist. 36, 489–531. Corduneanu, A., Jaakola, T., 2003. On information regularization. In: Proceedings of the Nineteenth Annual Conference on Uncertainty in Artificial Intelligence.Morgan Kaufmann Publishers, San Francisco, CA, pp. 151–158. Dudley, R.M., 1974. Metric entropy of some classes of sets with differentiable boundaries. J. Approx. Theory 10, 227–236. Kolmogorov, A.N., Tikhomirov, V.M., 1959. -entropy and -capacity of sets in a functional space. Uspekhi Mat. Nauk 14, 3–86 (In Russian. English Translations in Amer. Soc. Trans. 17 (1961) 277–264). Lin, Y., 2000. Some asymptotic properties of the support vector machine. TR 1029r, Department of Statistics, University of Wisconsin-Madison. Lin, Y., 2002. A note on margin-based loss functions in classification. Statist. Probab. Lett. 68, 73–82. Liu, Y., Shen, X., 2006. On multicategory -learning and support vector machine. J. Amer. Statist. Assoc. 101, 500–509. Lugosi, G., Vayatis, N., 2004. On the Bayes-risk consistency of regularized boosting methods. Ann. Statist. 32, 30–55. Mammen, E., Tsybakov, A.B., 1999. Smooth discriminant analysis. Ann. Statist. 27, 1808–1829. Marron, J.S., Todd, M., Ahn, J., 2007. Distance weighted discrimination. J. Amer. Statist. Assoc. 102, 1267–1271. Mason, L., Baxter, J., Bartlett, P., Frean, M.R., 2000. Boosting algorithms as gradient descent in function space. Adv. Neural Inform. Process. Syst. 12, 512–518. Mercer, J., 1909. Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. Roy. Soc. London A 209, 415–446. Shen, X., Wong, W.H., 1994. Convergence rate of sieve estimates. Ann. Statist. 22, 580–615. Shen, X., Zhang, X., Tseng, G.C., Wong, W.H., 2003. On -learning. J. Amer. Statist. Assoc. 98, 724–734. Steinwart, I., Scovel, J.C., 2007. Fast rates for support vector machines using Gaussian kernels. Ann. Statist. 35, 575–607. van de Geer, S., 2000. Empirical Processes in M-estimation. Cambridge University Press, Cambridge, UK. Wahba, G., 1990. Spline methods for observational data. In: CBMS-NSF Regional Conference Series, Philadelphia. Yang, Y., 1999. Minimax nonparametric classification—part I: rates of convergence, part II: model selection for adaptation. IEEE Trans. Inform. Theory 45, 2271–2292. Zhang, T., 2004. Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist. 32, 56–84. Zhou, D.X., 2002. The covering number in learning theory. J. Complexity 18, 739–767. Zhu, J., Hastie, T., 2005. Kernel logistic regression and the import vector machines. J. Comput. Graph. Statist. 14, 185–205.