A local Vapnik–Chervonenkis complexity

A local Vapnik–Chervonenkis complexity

Neural Networks 82 (2016) 62–75 Contents lists available at ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet A local ...

844KB Sizes 2 Downloads 55 Views

Neural Networks 82 (2016) 62–75

Contents lists available at ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

A local Vapnik–Chervonenkis complexity Luca Oneto a,∗ , Davide Anguita a , Sandro Ridella b a

DIBRIS - University of Genoa, Via Opera Pia 13, I-16145 Genoa, Italy

b

DITEN - University of Genoa, Via Opera Pia 11A, I-16145 Genoa, Italy

article

info

Article history: Received 26 January 2016 Received in revised form 19 May 2016 Accepted 1 July 2016 Available online 18 July 2016 Keywords: Local Rademacher Complexity Local Vapnik–Chervonenkis entropy Generalization error bounds Statistical Learning Theory Complexity measures

abstract We define in this work a new localized version of a Vapnik–Chervonenkis (VC) complexity, namely the Local VC-Entropy, and, building on this new complexity, we derive a new generalization bound for binary classifiers. The Local VC-Entropy-based bound improves on the original Vapnik’s results because it is able to discard those functions that, most likely, will not be selected during the learning phase. The result is achieved by applying the localization principle to the original global complexity measure, in the same spirit of the Local Rademacher Complexity. By exploiting and improving a recently developed geometrical framework, we show that it is also possible to relate the Local VC-Entropy to the Local Rademacher Complexity by finding an admissible range for one given the other. In addition, the Local VC-Entropy allows one to reduce the computational requirements that arise when dealing with the Local Rademacher Complexity in binary classification problems. © 2016 Elsevier Ltd. All rights reserved.

1. Introduction In learning systems, the development of effective measures for assessing the complexity of hypothesis classes is fundamental for enabling a precise control of the outcome of the learning process. One of the first attempts was made several decades ago, with the theory developed by V.N. Vapnik and A.Y. Chervonenkis, who proposed, among the others, the well-known VC-Dimension (Vapnik, 1998; Zhang, Bian, Tao, & Lin, 2012; Zhang & Tao, 2013). The VC-Dimension defines the complexity of a hypothesis class as the cardinality of the largest set of points that can be shattered by functions of the class. Unfortunately, the VC-Dimension, like other measures, is a global one, because it takes into account all the functions in the hypothesis class, and, furthermore, is dataindependent, because it does not take into account the actual distribution of the data available for learning. As a consequence of targeting this worst-case learning scenario, the VC-Dimension leads to very pessimistic generalization bounds. In order to deal with the second issue, effective data-dependent complexity measures have been developed, which allow to take into account the actual distribution of the data and produce tighter estimates of the complexity of the class. As an example,



Corresponding author. E-mail addresses: [email protected] (L. Oneto), [email protected] (D. Anguita), [email protected] (S. Ridella). http://dx.doi.org/10.1016/j.neunet.2016.07.002 0893-6080/© 2016 Elsevier Ltd. All rights reserved.

data-dependent versions of the VC Complexities have been developed in Boucheron, Lugosi, and Massart (2000), ShaweTaylor, Bartlett, Williamson, and Anthony (1998) and together with the Rademacher Complexity (Bartlett & Mendelson, 2003; Koltchinskii, 2001) they represent the state-of-the-art tools in this field. In recent years, the Rademacher Complexity has been further improved, as researchers have succeeded in developing local datadependent complexity measures (Bartlett, Bousquet, & Mendelson, 2002, 2005; Cortes, Kloft, & Mohri, 2013; Koltchinskii, 2006; Oneto, Ghio, Ridella, & Anguita, 2015b; van de Geer, 2006). Local measures improve over global ones thanks to their ability of taking into account only those functions of the hypothesis class that will be most likely chosen by the learning procedure, i.e. the models with small error. In particular, the Local Rademacher Complexity has shown to be able to accurately capture the nature of the learning process, both from a theoretical point of view (Bartlett et al., 2005; Koltchinskii, 2006; Oneto et al., 2015b) and in realworld applications (Cortes et al., 2013; Kloft & Blanchard, 2011; Lei, Binder, Dogan, & Kloft, 2015; Steinwart & Scovel, 2005). We propose in this work a localized version of a VC Complexity, namely the Local VC-Entropy, and show how it can be related to the Local Rademacher Complexity through an extension of the geometrical framework presented in Anguita, Ghio, Oneto, and Ridella (2014). Getting more insights on the mechanisms underlying different notions of complexity, and the non-trivial relationships among them, is crucial, both from a theoretical and a practical point of view. In fact, for this reason, the literature

L. Oneto et al. / Neural Networks 82 (2016) 62–75

targeting the connections between different complexity measures is quite large (Bousquet, 2002; Ledoux & Talagrand, 1991; Lei, Ding, & Bi, 2015; Massart, 2000; Sauer, 1972; Shelah, 1972; Srebro, Sridharan, & Tewari, 2010; Vapnik, 1998). Finally, we show how to exploit the Local VC-Entropy for bypassing the computational difficulties that arise when computing the Local Rademacher Complexity in binary classification problems. The localization of the VC-Entropy allows us to introduce the same improvements achieved by the localization of the Rademacher Complexity into the VC Theory as well, like, for example, the derivation of refined generalization bounds with respect to their global counterparts. In fact, based on this new localized notion of complexity, we propose a new generalization bound that does not take into account all the functions in the set but only the ones with small error. The paper is structured as follows. In Section 2 we introduce the theoretical framework and the two localized notions of complexity, i.e. the Local Rademacher Complexity and the new Local VC-Entropy. In Section 3 we propose a new bound on the generalization error based on the Local VC-Entropy and we show that this new bound is actually able to discard those functions that will be never chosen by the algorithm for classification purposes but are usually considered for estimating the generalization error. Section 4 is devoted to the connections between the Local Rademacher Complexity and the new Local VC-Entropy, by exploiting a geometrical framework introduced in Anguita et al. (2014), which deals only with the global version of these complexities. In Section 5 we show the computational advantages of the Local VC-Entropy with respect to the Local Rademacher Complexity. Section 6 concludes the paper.

Note that the definition of Rademacher Complexity adopted in this paper is in agreement with the one that appeared in the recent literature (Anguita et al., 2014; Bartlett & Mendelson, 2003; Koltchinskii, 2001; Oneto, Ghio, Ridella, & Anguita, 2015a), in fact:

Eσ sup f ∈F

n 1  Ln ( f ) = I (f (Xi ), Yi ) ,

(2)

Let us define the following quantity:

FDn = {{f (X1 ), . . . , f (Xn )} : f ∈ F } ,

(3)

which is the set of functions restricted to the sample. In other words, FDn is the set of distinct functions distinguishable within F with respect to the dataset Dn . The VC-Entropy Hn (F ) and the Annealed VC-Entropy An (F ), together with their empirical counterparts  Hn (F ) and  An (F ) (Vapnik, 1998), are defined as: Hn (F ) = EX1 ,...,Xn  Hn (F ),

  An (F ) = ln EX1 ,...,Xn FDn  , 

   Hn (F ) = ln FDn  ,

(4)

 An (F ) =  Hn (F ).

(5)

Let {σ1 , . . . , σn } be n independent Rademacher random variables for which P{σi = +1} = P{σi = −1} = 1/2. Then, the Rademacher Complexity  Rn (F ) (Bartlett & Mendelson, 2003; Koltchinskii, 2001), and its deterministic counterpart Rn (F ), are defined as: 2  Rn (F ) = Eσ sup f ∈F

n 

n i =1

I (f (Xi ), σi ),

Rn (F ) = EX1 ,...,Xn Rn (F ).

n i=1

(6) (7)

I (f (Xi ), σi ) = Eσ sup f ∈F

 n (F , r ) =  LR Rn

n 1

n i =1

σi f (Xi ).

(8)



f : f ∈F, Ln ( f ) ≤ r



,

(9)

LRn (F , r ) = Rn ({f : f ∈ F , L(f ) ≤ r }) .

(10)

The Local Rademacher Complexity improves over its global counterpart thanks to its ability of taking into account only those functions of the hypothesis class that will be most likely chosen by the learning procedure. This is due to the fact that in the definitions of Eqs. (9) and (10) the r parameter shrinks the hypothesis space by discarding the functions with large error. r is connected with the generalization ability of an f ∈ F , as we will see in Theorem 3.5. Note, again, that the definitions of Local Rademacher Complexity of Eqs. (9) and (10) are in agreement with the ones that appeared in the recent literature (Bartlett et al., 2005; Koltchinskii, 2006; Oneto et al., 2015b), in fact:

 n ( F , r ) = Eσ LR

sup  f ∈ f : f ∈F ,

×

(1)

n i=1 L(f ) = E(X ,Y ) I (f (X ), Y ).

n 2

 n (F , r ) (Bartlett et al., The Local Rademacher Complexity LR 2005; Koltchinskii, 2006), together with its expected value LRn (F , r ), is defined as:

2. A local Vapnik–Chervonenkis complexity Let µ be a probability distribution over X × Y where Y = {±1}. We denote as F a class of {±1}-valued functions f ∈ F on X, and suppose that Dn = {(X1 , Y1 ) , . . . , (Xn , Yn )} with n > 1 is sampled according to µ. The accuracy of an f ∈ F in representing µ is 1−Yf (X ) measured according to the indicator function I (f (X ), Y ) = , 2 namely I (f (X ), Y ) = 0 if f (X ) = Y and I (f (X ), Y ) = 1 if f (X ) ̸= Y . Consequently, the empirical error  Ln (f ) and the generalization error L(f ) of an f ∈ F can be defined as:

63

n 1

n i =1

1 n

n 

 [I (f (Xi ),Yi )]2 ≤r

i=1

I (f (Xi ), σi ),

(11)

and LRn (F , r ) = EX1 ,...,Xn Eσ

×

n 1

n i =1

sup f ∈ f : f ∈F , E(X ,Y ) [I (f (X ),Y )]2 ≤r

{

I (f (Xi ), σi ),

} (12)

since I (f (Xi ), Yi ) ∈ {0, 1} and, therefore, [I (f (Xi ), Yi )]2 = I (f (Xi ), Yi ). Consequently, in this paper the definitions of Local Rademacher Complexity are not referred to (F ) but to (I ◦ F ) since, as we will show in the next section, in this paper we are interested in relating the local complexity measures to the generalization error. In the framework of the VC Theory, to the best of our knowledge, such approach has never been proposed: in fact, the VC Theory takes into account the whole hypothesis space. In a recent preprint and unpublished article (Lei, Ding et al., 2015) an attempt of estimating the Local Rademacher Complexity via a Covering Number-based (Zhou, 2002) upper-bound has been made, but it still relies on the original work on Local Rademacher Complexity (Bartlett et al., 2005). In this paper we propose, instead, a localized version of a complexity measure based on the VC Theory and show that it can be effectively exploited in learning theory for deriving a new generalization bound. This complexity measure extends the proposal of Vapnik (1998) by introducing the notion of localization in the traditional Vapnik’s Statistical Learning Theory framework. Let us localize the set of functions defined in Eq. (3) by introducing a constraint on the error, controlled by a parameter r:

(Dn ,r ) = {f1 , . . . , fn } : f ∈ F , F Ln (f ) ≤ r , 

(13)

F(Dn ,r ) = {{f1 , . . . , fn } : f ∈ F , L(f ) ≤ r } ,

(14)



64

L. Oneto et al. / Neural Networks 82 (2016) 62–75

then, the empirical Local VC-Entropy, and its expected counterpart, can be defined as:

n (F , r ) = ln F(Dn ,r )  , LH 



(15)

LHn (F , r ) = EP ln F(Dn ,r )  .





(16)

As we will show in the next section, it is possible to derive a fully empirical generalization bound on the true error of a classifier, based on this complexity. Moreover, we will show that the Local VC-Entropy and the Local Rademacher Complexity are tightly related, analogously to their global versions (Anguita et al., 2014). 3. A local Vapnik–Chervonenkis complexity based bound In order to present our main contribution we need some preliminary results. Some of them are already available in the literature and are reported in Appendix A, while others are new and will be derived here. The proofs are reported in Appendix B. The first theorem allows us to bound the generalization error of a function based on a property of the entire class. This is a technical result that allows us to normalize the distance between the true and the empirical error of the functions. Theorem 3.1. It is possible to upper-bound the generalization error of an f ∈ F as follows: L(f ) ≤ min

K

K ∈(1,∞)

sup α

s.t.

α∈(0,1]

 Ln (f ) +

K −1

The bounds proposed in Corollary 3.3 are, to the best knowledge of the authors, new, even if some attempts to derive this result can be found in Boucheron et al. (2000) and Massart (2000). In order to obtain the main result of this section we still need one more lemma. Lemma 3.4. With probability at least (1 − 4e−x ), the class of functions with small generalization error is concentrated around the one with small empirical error:

{f : f ∈ F , L(f ) ≤ r }       Hn ({f : f ∈ F , L(f ) ≤ r }) + 2x . (22) ⊆ f : f ∈ F , Ln (f ) ≤ r + 3   n The proof of Lemma 3.4 is straightforward since it is a simple application of the inequality of Eq. (21) of Corollary 3.3. Finally we can state the main result of this section. Theorem 3.5. Given a function f ∈ F chosen according to Dn , we can state, with probability at least (1 − 9e−x ), that the following generalization bound holds L(f ) ≤ min

K ∈(1,∞)

 s.t.

r K

,

f ∈F

(17)

sup 6

K

r  Ln (f ) + ,

K −1

K

r α (T (r , α) + 2x) n

α∈(0,1]



r [ L( f ) −  Ln (f )] ≤ , K f ∈{f : f ∈F , L(f )≤ α } sup

r

n F , T (r , α) ≤ LH

r > 0.

.

where T (r , α) =  Hn

The following result shows that also the Annealed VC-Entropy is concentrated around its expected value.



r

α

 +3

∀f ∈ F ≤

r K

,

(23)

r >0

T (r , α) + 2x

 ,

n

f : f ∈ F , L(f ) ≤ αr



.

Theorem 3.5 can be simplified when F contains at least one function which perfectly fits the available data.

Lemma 3.2. The Annealed VC-Entropy is concentrated around its expected value. The following inequality holds with probability at least (1 − e−x ):

Corollary 3.6. If ∃f ∗ ∈ F :  Ln (f ∗ ) = 0, we can state that, with probability at least (1 − 9e−x ), the following generalization bound holds

A2n (F ) ≤ 8 An (F ) + 16x = 8 Hn (F ) + 16x.

L( f ∗ ) ≤ r

(18)

(24)

 The Annealed VC-Entropy is the milestone of the Vapnik’s results (Vapnik & Kotz, 1982) since it allows to bound the generalization error given the empirical one (and vice versa). Unfortunately, this bound is not computable, but thanks to the previous lemma we will be able to derive its computable version and to encapsulate the localization principle. The original Vapnik’s generalization bound is reported in Theorem A.3, while the following corollary gives its full empirical version.

s.t.

sup 6

r α (T (r , α) + 2x) n

α∈(0,1]

 n F , T (r , α) ≤ LH

r

α

 +3

≤ r,

r >0

T (r , α) + 2x n

 .

The proof is straightforward and consists in observing that, in Theorem 3.5, if ∃f ∗ ∈ F :  Ln (f ∗ ) = 0 we have that K → 1. 3.1. Understanding Theorem 3.5: a toy example

Corollary 3.3. Given a space of functions F and a dataset Dn , the following inequality holds with probability at least (1 − 4e−x )

 sup L(f ) −  Ln (f ) ≤ 6





 Hn (F ) + 2x n

f ∈F

sup



L(f ).

(19)

f ∈F

Moreover, with probability at least (1 − 4e−x ) we can state ∀f ∈ F that: L(f ) −  Ln (f )



L(f )

 ≤6 

 Hn (F ) + 2x n

,

 Hn (F ) + 2x  Ln (f ) ≤ L(f ) + 3 . n

(20)

(21)

In this section we present an application of Theorem 3.5 to a toy example in order to stress the importance of the localization effect. In particular we show that the bound proposed in Theorem 3.5 takes into account only functions with small generalization error, and, as the cardinality of the training set increases, the number of functions discarded by the bound rapidly increases. Let us suppose that P(X ) is a uniform distribution over a circle in two dimensions. In polar coordinates (ρ, θ ), we define P(Y |X ) as follows:

 π π , else P(+1|X ) = 0,  2 2 π 3π P(−1|X ) = 1 if θ ∈ , else P(−1|X ) = 0. P(+1|X ) = 1 if θ ∈ −

2

2

(25) (26)

L. Oneto et al. / Neural Networks 82 (2016) 62–75

65

Therefore, according to Corollary 3.3, we can state that with probability at least (1 − 4e−x ): L( f ∗ ) −  Ln (f ∗ )



L(f ∗ )

 ≤6

 Hn (F ) + 2x n

,

∀f ∈ F .

(31)

Finally, by exploiting Eqs. (4), (27), (30) and (31), we have that, with probability at least (1 − 4e−x ), the following inequality holds: L(f ∗ ) ≤ 36

ln(2n) + 2x n

,

∀f ∈ F .

(32)

Let us consider Corollary 3.6 and Eq. (28). With probability at least (1 − 9e−x ), the following inequality holds: L(f ∗ ) ≤ r ,

(33)

if the following inequality is satisfied Fig. 1. Toy example: a possible Dn with n = 4 and with the associated FDn .

 sup 6

The hypothesis space F is composed by the sheaf of straight lines that passes through the center of the circle. An example of Dn with n = 4, together with the set FDn , is depicted in Fig. 1. For simplicity, let us suppose that  n is an even number. (Dn ,r ) |, and plug them in We have to compute FDn  and |F Corollary 3.3 and Theorem 3.5 in order to compare the state-of-theart global bound with our proposal.   Let us consider the worst case scenario, which means that FDn  will be the largest possible value, when the samples in Dn are placed in general position (Cover, 1965; Klęsk & Korzeń, 2011). It is easy to note that

  FD  ≤ 2n. n

(27)

Note that this is an upper bound since if, for example, two points fall in the same positions or exactly at the opposite side of the circle this number will decrease. Let us note also that:

   {f1 , . . . , fn } : f ∈ F , n Ln (f ) = 0  = 1    {f1 , . . . , fn } : f ∈ F , n Ln (f ) = 1  = 2 .. .  n   Ln ( f ) = =2  {f1 , . . . , fn } : f ∈ F , n

Because of Eqs. (27) and (28) we have that T (r , α) ≤ ln(2n), and then we obtain:

 T (r , α) ≤ ln 2n

α

α

 +3

ln(2n) + 2x n



 +1 .

(36)

By substituting it in Eq. (34) we have: r α (T (r , α) + 2x) n

n

α∈(0,1]

=6

(28)

(29)

           r ln 2n r + 3 ln(2nn)+2x + 1 + 2x  n

,

By construction, there is at least one classifier that perfectly classifies the dataset:

L(f ∗ ) ≤ r ,

(38)

if the inequality of Eq. (34) is satisfied







ln 2n r + 3 36

ln(2n)+2x n



 + 1 + 2x ≤ r,

n

r > 0.

(39)

Let r ∗ be the smallest r that satisfies the inequality of Eq. (39), then, with probability at least (1 − 9e−x ): (40)

The number of functions that are taken into account by the new bound of Eq. (40) is

    ln ( 2n ) + 2x exp T (r ∗ , 1) = 2n r ∗ + 3 + 1, 

(30)

(37)

thanks to Lemma 3.7. This last term must be less than or equal to r /K . Consequently, we can state that, with probability at least (1 − 9e−x ):

L(f ∗ ) ≤ r ∗ .

monotonically increases with α ∈ (0, 1].

∃f ∗ ∈ F :  Ln (f ∗ ) = 0.

r

≤ sup 6

Lemma 3.7. Let us consider a, b, d ∈ [0, ∞) and c ∈ [1, ∞). Then it is possible to state that the following function

 +c +d



           r α ln 2n αr + 3 ln(2nn)+2x + 1 + 2x 

Based on these observations we can show how to apply the results of Corollary 3.3 and Theorem 3.5 to this toy problem, but first we need one additional lemma.

b

(34)

(35)

α∈(0,1]

Consequently, it is possible to state that:



r > 0,

   r T (r , α) + 2x  T (r , α) ≤ LH n F , + 3 α n       r T (r , α) + 2x ≤ ln min 2n +3 + 1, 2n . α n



   {f1 , . . . , fn } : f ∈ F , n Ln (f ) = n − 1  = 2    {f1 , . . . , fn } : f ∈ F , n Ln (f ) = n  = 1.

aα ln

≤ r,

and where, by also exploiting Eq. (28) and Theorem 3.5

.. .

  n ,r ) ≤ min[2nr + 1, 2n] ≤ 2nr + 1.

n

α∈(0,1]

sup 6

2

 F(D

r α (T (r , α) + 2x)

n

(41)

66

L. Oneto et al. / Neural Networks 82 (2016) 62–75

the Global VC-Entropy as n increases. Note that the toy example of Section 3.1 falls in this more general result. Corollary 3.8. Under the same hypothesis of Corollary 3.6 if:

 F(D

n ,r )

  ≤ anp r + b,

(43)

where a, p ∈ [0, M ] and b ∈ [1, M ] with M positive and finite are constant values (note that b = 1 since there is  only one  function f ∈ (Dn ,0)  = 1 (Vapnik, F which gives  Ln (f ) = 0 or in other words F 1998)), the number of functions f ∈ F taken into account by the Local VC-Entropy based bound Corollary 3.6 is smaller than the one taken into account by the Vapnik’s result based on the Global VC-Entropy of Corollary 3.3, when n is large enough. In the next corollary we present a general case where @f ∗ ∈ F : Ln (f ∗ ) = 0. Corollary 3.9. Under the same hypothesis of Theorem 3.5 if: Fig. 2. Eqs. (41) and (42) for different values of n and with 95% confidence level.

 F(D

n ,r )

where f

    ≤ anp max 0, r −  Ln ( f ∗ ) + b , ∗

(44)

is a function in F with its associated empirical error

 Ln (f ∗ ) ∈ [0, 1], a, p ∈ [0, M ] and b ∈ [1, M ] with M positive and finite are constant values (note that b ≥ 1 since f ∗ exists and gives  Ln (f ∗ )), the number of functions f ∈ F taken into account by the Local VC-Entropy based bound Theorem 3.5 is smaller than the one taken into account by the Vapnik’s result based on the Global VC-Entropy of Corollary 3.3, when n is large enough. Finally, note that in all the results presented in this section we have not optimized the constants K , α nor r ∗ in order to present a closed form solution, which is more intelligible for the reader. 4. A connection between the local rademacher complexity and the local VC-Entropy A recent paper (Anguita et al., 2014) has related the VC-Entropy and the Rademacher Complexity. In particular, (Anguita et al., 2014) shows that it is possible to obtain two functions φ(·) and −1

Fig. 3. Ratio between Eqs. (41) and (42) for different values of n and with 95% confidence level.

while in the case of Vapnik’s result (Eq. (32)) is exp  Hn (F ) = 2n.





(42)

Fig. 2 shows the comparison between these two values (Eqs. (41) and (42)) for different values of n and with 95% confidence level. As expected the number of functions taken into account by the bound of Theorem 3.5 is remarkably smaller with respect to the one of Corollary 3.3. The result is more evident in Fig. 3 where the ratio between Eqs. (41) and (42) is reported. As n increases, the number of functions taken into account by the bound which exploits the Local VC-Entropy decreases exponentially with respect to the total number of functions belonging to F . 3.2. Generalizing the toy example The result of this toy is just a particular case of a more general result that we will report in the next corollary. In particular it is possible to prove that when the cardinality of the localized set of functions defined in Eq. (13) grows linearly with the constraint over the empirical error defined by r, then the number of functions taken into account by the Local VC-Entropy based bound is smaller than the one taken into account by the Vapnik’s result based on

φ(·) (and their inverse φ −1 (·) and φ (·)) such that:     φ  Rn (F ) ≤ φ  Hn (F ) , Hn (F ) ≤     −1  φ  Rn (F ) ≤  H n ( F ) ≤ φ −1  Rn (F ) .

(45) (46)

These lower and upper bounds can be obtained with different methods: φ(·) can be derived by exploiting the Massart’s Lemma (Massart, 2000) and its improvements (Anguita et al., 2014); φ(·) can be obtained by exploiting some results relating to the Rademacher Complexity, the VC-Dimension, the Fat-Shattering and the VC-Entropy (Anguita et al., 2014; Duan, 2012; Sauer, 1972; Srebro et al., 2010). However, a different approach to the problem has also been shown in Anguita et al. (2014), leading to the tightest lower and upper bounds of the Rademacher Complexity based on the VC-Entropy (and vice versa). The result has been derived by exploiting some geometrical properties of the problem, which allow to find the best and the worst possible geometrical configurations and, consequently, to obtain the best formulations −1

of φ(·) and φ(·) (and consequently φ −1 (·) and φ (·)). As a matter of fact, the results shown in Anguita et al. (2014) can be directly applied in order to relate also the Local Rademacher Complexity and the Local VC-Entropy. A new hypothesis space Gr must be defined, such that:

Gr = {f : f ∈ F , Ln (f ) ≤ r } , r

G |Dn

  = {g1 , . . . , gn } : g ∈ Gr .

(47) (48)

L. Oneto et al. / Neural Networks 82 (2016) 62–75

Then, it is possible to highlight that:

 n (F , r ), Hn (Gr ) = LH

 n (F , r ), Rn (Gr ) = LR

(49)

and, consequently, to obtain that

    n (F , r ) , n (F , r ) ≤ LR  n (F , r ) ≤ φ LH φ LH    −1   n (F , r ) ≤ LH n (F , r ) ≤ φ −1 LR  n (F , r ) , φ LR

(50) (51)

where only the results of Anguita et al. (2014) have been used. As we will show in the next section, the result reported in Eqs. (50) and (51) can be remarkably improved by taking into account the localization effect.

= {σ1 , . . . , σn }

S = {σ : σi ∈ {±1}} ,

 σ : min

i

SDn =

f ∈F

n 

(52)

 I (fj , σj ) = i, σ ∈ S

.

(53)

j =1

In other words, S is the set of all the 2n possible configurations of σ , i while SD is the set of σ that can be perfectly classified by F , given n Dn , with i errors. For these two sets, the following properties hold: n 

i SD = S, n

|S | = 2n ,

j

i SD ∩ SDn = n

if i ̸= j if i = j.

∅ i SD n

(55)

Consequently, both the VC-Entropy and the Rademacher Complexity can be rewritten in terms of these sets (Anguita et al., 2014):

 0   ,  Hn (F ) = ln SD n n 1   i   i SDn  . Rn (F ) = 1 − 2 n

(56) (57)

2 n i=0

Now, let us generalize the set S , in order to take into account the locality constraint: r ,i

 σ : min

SDn =

f ∈F

= σ : min = i, σ ∈ S ,

 I (fj , σj ) = i, σ ∈ S , Ln (f ) ≤ r

j =1

 f ∈F

n 

n 

I (fj , σj )

.

Analogously to Eq. (56), it is also possible to reformulate the Local VC-Entropy and the Local Rademacher Complexity in terms of these sets: r ,0

 

n (F , r ) = ln SD  , LH n

given Dn , which perfectly classify at least one of the possible configurations of σ ∈ S having Hamming distance from y lower than nr;

• the Local Rademacher Complexity computes the average of the minimum number of errors performed by the functions in F , given Dn , which have empirical error lower than or equal to r. In this case, all the 2n possible σ ∈ S configurations are contemplated. In other words, analogously to (Anguita et al., 2014), the Local Rademacher Complexity takes into account the distribution of  

 r ,i  SDn  as i is varied, while the Local VC-Entropy only contemplates    r ,0  SDn . In order to obtain φ and φ for the localized bounds, let us

 

  r ,i   n (F , r ) = 1 − 2 1 LR i SDn  . n 2 n i=0

Then, the VC-Entropy can be upper bounded:

r ,0

 

 

r ,i

 

consider Eq. (60). Given SDn , the distribution of SDn  which maximizes (or minimizes) the Rademacher Complexity must be searched. For such purpose, σ ∈ S can be interpreted as r ,0 the vertices of an n-dimensional hypercube. Thus, the set SDn corresponds to the subset of vertices of this hypercube that can be represented by the functions in F , given Dn . These vertices must be concentrated around the vertex y, thus all the vertices that are r ,0 at Hamming distance greater than nr from y will not belong to SDn . Consequently, the set of the vertices, at Hamming distance lower than or equal to nr from y, is:

= σ : σ ∈ S,

n 

 I (σi , yi ) ≤ nr

.

(62)

i=1

Moreover, the minimum number of errors performed by f ∈ F on a particular σ ′ ∈ S corresponds to the minimum Hamming 0 . Due to this interpretation, the distance of σ ′ from every σ 0 ∈ SD n Local Rademacher Complexity of Eq. (60) can be reformulated as follows:

(59)

 n (F , r ) = 1 − 2 LR

n

  r ,0  n (F , r ) = ln SD LH  n  n    I (fj , σj ) = ln  σ : min f ∈F  j=1

• the Local VC-Entropy counts the number of functions in F ,

(58)

i =1

 

be noted that:

r Stot

 I (f (Xi ), yi ) ≤ nr

  1 2 0, n , n , . . . , 1 , while    n  1,i   1 ,0   Hn (F ) = ln SDn  and  Rn (F ) = 1 − 2 2n1n i=0 i SDn . It can



j =1 n 

(61)

i

(54)

i=0



i =0

Note that, in the considered case, r ∈

4.1. The geometrical framework and the effects of localization Let us define the following sets, where σ (Anguita et al., 2014):

67

 n    = 0, σ ∈ S , I (f (Xi ), yi ) ≤ nr   i=1  n    = ln  σ : min I (fj , σj ) f ∈F  j =1  n    = 0, σ ∈ S , I (σi , yi ) ≤ nr   i=1     nr  n ≤ ln .

(60)

1  2n n

min d σ 0 , σ .

r ,0 σ∈S σ 0 ∈SDn





(63)

Let us consider the following two extreme cases for the vertices: r ,0

1. σ 0 ∈ SDn are clustered inside the portion of the hypercube r delimitated by the vertices Stot , i.e. they are close to each other (in terms of Hamming distance); r ,0

2. σ 0 ∈ SDn are scattered inside the portion of the hypercube r delimitated by the vertices Stot , i.e. they are far from each other.

68

L. Oneto et al. / Neural Networks 82 (2016) 62–75

 n (F ) will be smaller than in the scattered In the clustered case, LR r ,0 case. In fact, the more the vertices σ 0 ∈ SDn are scattered, the smaller the value of minσ 0 ∈S r ,0 d σ 0 , σ , with σ ∈ S , is likely to Dn





 n (F ) increases). As a consequence, in order to find become (so LR φ , we have to minimize the sum in Eq. (63) by searching for the set r ,0 r ,0 Smin and the configuration of the vertices σ 0 ∈ Smin that satisfy the following properties:

   r ,0  r ,0 r ,0 • the cardinality of Smin and SDn must be the same, i.e. Smin  =    r ,0  SDn ; r ,0 • the vertices σ 0 ∈ Smin must be as scattered as possible inside r the portion of the hypercube delimitated by the vertices Stot .

Consequently, the problem of finding φ can be reformulated as follows:

φ =1−

2 2n n

min





r Smin ⊆Stot σ∈S σ 0 ∈Smin

     r ,0   r ,0  Smin  = SDn  .

s.t.

(64)

In other words, the worst possible configuration of vertices σ 0 ∈ r ,0 Smin inside the portion of the hypercube delimitated by the vertices r Stot must be considered, independently from both the probability distribution P and the data Dn . r ,0 An analogous idea can be exploited for finding φ : the set Smax 0 and the configuration of vertices σ 0 ∈ Smax must be identified, which allow maximizing the sum in Eq. (63) and, then, minimizing  n (F , r ): LR

φ =1−

2 2n n



max

r Smax ⊆Stot σ∈S σ 0 ∈Smax

4.2. The upper bound φ 0 The upper bound φ can be found by identifying the set Smin that solves Problem (64). In general, while admitting several solutions due to the geometry of the problem, the closed form expression for r ,0 Smin cannot be easily derived. However, we can explicitly compute  r ,0

the upper bound in two cases, namely when SDn  is minimum



r ,0

 

 

r ,0

 

 

r ,0

 



(SDn  = 1), and when SDn  is maximum (SDn  =

nr n i=1

   r ,0  When SDn  = 1, the following result is derived:   n    2 n φ =1− n i , 2 n

i =1

i

).

(66)

i

n

since only one vertex can be represented with no errors, as the i vertices  at distance i from it are represented with i errors. Instead,

 r ,0  nr n when SDn  = i=1 i :     n  2 n (i − nr ) , φ =1− n 2 n

i=nr +1

nmin :

nr    n

i

r Stot

φ =1−

  nmin    r ,0   n = SD + q,  n

since all the vertices in can be represented, while the vertices at distance i from it are represented with i errors.





n

nmin + 1

.

(69)

2 2n n



 nmin     r ,0   n i S  Dn  i

i=1

  n  n i

 (i − nr ) ,

(70)

where nmin is found by solving Problem (69). Note that, when r = 1 (namely, when Local Rademacher Complexity and Global Rademacher Complexity coincide), the same result derived in Anguita et al. (2014) is obtained (see Eq. (37) in Anguita et al., 2014). 4.3. The lower bound φ A procedure, analogous to the one described in Section 4.3, r ,0 r ∈ Stot with the can be exploited for φ : in this case, the set Smax most clustered configuration of vertices must be found, where the r ,0 distance among all the vertices σ 0 ∈ Smax is minimized. Again, the solution for this problem is not unique (Anguita et al., 2014). Since the most clustered configuration of the vertices must be found and r Stot is already clustered around y, it can be indifferently either r ,0 r r ,0 Smax ∈ Stot or Smax ∈ S (as in Anguita et al., 2014): thus, the lower bound coincides with the one in Anguita et al. (2014). Then, the following problem must be solved: max nmax , n1max , . . . , nnmax :

(71) n−

n −i1 n   nmax −1 n    r ,0  + ··· SDn  =

where n nr +i

i

i =0

i=nr +1

(67)



(68)

r For the remaining vertices, not included in Stot but belonging  to S , the same considerations as above can be carried out: the nrn+i r vertices at distance i from Stot are represented, at least, with i errors. Obviously, this solution is not the best possible one, leaving some space for future improvements: in fact, some of the vertices r at distance i from Stot can be represented with more than i errors. Then, the solution of Eq. (64) becomes:

k=0

i

i

i =0

r Since the q configurations σ ∈ Stot , characterized by distance 0 nmin +1 from more than one configuration of the vertices σ 0 ∈ Smin , should not be counted twice, nmin ∈ {0, . . . , n} can be computed by solving the following problem:

(65)

Note that, for the sake of simplicity, we can safely consider that all r the vertices in Stot are at Hamming distance lower than or equal to nr from y = 0 because of the symmetry properties of the hypercube.

 

i

i=1

+ r (nmin + 1) +

 r ,0   r ,0  S  = S  . Dn max

s.t.

i

i =0

  min d σ 0 , σ , r ,0

r ,0

nr     nmin      nmin +1   n  r ,0   n  r ,0   n ≤ < SD . SDn   n

   r ,0  s.t. q < SDn 

r ,0

r ,0

increasing the distance i. At some point, a value of distance nmin will be such that:

i=0

min d σ 0 , σ ,



r ,0

In the general case, the approach is tricky. The vertices Smin in r Stot must be arranged to be as scattered as possible. In order to nr n r contemplate all the i=1 i in Stot vertices, it is possible to start 0 from the ones at distance i = 0 from σ 0 ∈ Smin and proceed by

u1

i1 =l1 e1

k

i1 =1 n1 max

u2

i2 =l2 e2

···

i2 =1 n2 max

un

in =L en

nmax −1 j=1 

ij

1,

in =1 nmax nmax

means that the summations are

stopped when either is = us , s = 1, . . . , n, or i1 = e1 ∧ i2 = e2 ∧ · · · ∧ in = en . Note that es ≤ us , s = 1, . . . , n, in general.

L. Oneto et al. / Neural Networks 82 (2016) 62–75

69

(a) Different values of r.

(b) Small values or r. Fig. 4. Lower φ and upper φ bounds varying r. n = 30.

Hence we can derive φ

(Bousquet, 2002; Massart, 2000) it is easy to prove that



2

φ =1− n 2 n   nmax −1  n− ij  n max    j=1 n n  n− ij    n   j=1 2 × (i − nmax ) − ··· ,  i=n i i =1 in =1 max

1 n1 max

max nmax nmax

 n (F , r ) ≤ φ LR

M

=

 n (F , r ) 2r LH n

  n     exp λnLH n (F , r ) = exp λEσ max σi f (Xi ) f ∈F(Dn ,r )

 ≤ Eσ







σi f (Xi ) =

n  

 exp

λ2 fi2 2

i =1

n  

cosh [λf (Xi )]

f ∈F(Dn ,r ) i=1

i =1

f ∈F(Dn ,r ) i=1

This brief section is devoted to graphically compare the results obtained in Sections 4.2 and 4.3. In Figs. 4 and 5 the obtained lower (φ ) and upper (φ ) bounds are shown for different values of r and n. When r < 1, the localization effect clearly shows up, making the derived bound remarkably tighter than the one obtained in Anguita et al. (2014). Note that, when r = 1, the same results of Anguita et al. (2014) are recovered, where the relation between the global versions of VC-Entropy and Rademacher Complexity was analyzed. As underlined in Section 4.2, the tightness of the upper bound φ could be improved for small values of r and Local VC-Entropy. In fact if we consider a small modification of the Massart’s Lemma



n

exp λ

f ∈F(Dn ,r )

4.4. How tight are the new bounds?

(73)

Let us study the Local Rademacher Complexity. For every λ ≥ 0, we have that (Massart, 2000):

(72) max are found by solving Problem (71). where nmax , n1max , . . . , nnmax

.



 2 2   λ p = F(Dn ,r )  exp , 2

(74)

where we have exploited Hoeffding’s Lemma (Hoeffding,   1963) and where p2 = maxf ∈FDn

  f (X1 )2 + · · · + f (Xn )2 2 . Let us

n (F )/r 2 ; then we have substitute λ = nR n (F , r ) ≤ LH

p



n

2 ln F(Dn ,r ) .





(75)

It is easy to see that p2 = nr. In Fig. 6 we report the comparison between φ and φ bound φ and r.

M

M

for different r and n and, as expected, the upper

is smaller than φ for small values of Local VC-Entropy

70

L. Oneto et al. / Neural Networks 82 (2016) 62–75

(a) Different values of r.

(b) Small values or r. Fig. 5. Lower φ and upper φ bounds varying r. n = 100.

5. Computational aspects

5.1. Local Rademacher Complexity

The bounds derived in the previous sections also have a major effect on the effort needed for computing the above complexities. In particular computing the Local Rademacher Complexity or the Local VC-Entropy is a problem where the computational requirements increase exponentially with the number of samples (Amaldi & Kann, 1995; Anguita et al., 2014; Bartlett et al., 2005; Bartlett & Mendelson, 2003; Hoffgen, Simon, & Vanhorn, 1995; Johnson & Preparata, 1978; Vapnik & Kotz, 1982). In this section we will show that, at least in a quite general case, this problem can be circumvented for the Local VC-Entropy, while it is not possible for the Local Rademacher Complexity. Let us consider a kernel function k : X × X → R, which is a symmetric positive semi-definite function that corresponds to a dot product in a Reproducing Kernel Hilbert Space, i.e. there exists a φ : X → K ⊆ RD , where K is a Hilbert Space, such that k(X1 , X2 ) = ⟨φ(X1 ), φ(X2 )⟩ with X1 , X2 ∈ X. Let us define as F the set of kernelized binary classifiers such that: f (X ) = sign

 n 

 αi k(Xi , X ) + b ,

α ∈ Rn , b ∈ R.

(76)

i=1

In the following section we will show that for the F defined in Eq. (76) it is possible to estimate the Local VC-Entropy with a procedure where the computational requirements grow polynomially with the number of samples, while it is not possible for the Local Rademacher Complexity.

The Local Rademacher Complexity requires the solution of the following minimization problem (see Eq. (6)):

 n (F ) = Eσ LR

n 2

sup f ∈F , Ln (f )≤r

= 1 − 2Eσ

n i=1

inf

f ∈F , Ln (f )≤r

I (f (Xi ), σi ) n 1

n i=1

I (f (Xi ), σi ) .

(77)

More explicitly, we have to compute: (I) the minimum number of errors, that the class of functions performs on a random realization of the labels σ , and (II) replicate 2n times this minimization process, for computing the expectation with respect to σ . Note that (I) is an NP-hard problem (Amaldi & Kann, 1995; Hoffgen et al., 1995; Johnson & Preparata, 1978) and (II) is computationally intractable. Problem (I) cannot be bypassed even if, in some particular cases, the minimum number of errors can be both upper and lower bounded (Anguita, Ghio, Greco, Oneto, & Ridella, 2010; Anguita, Ghio, Oneto, & Ridella, 2012). Unfortunately, these bounds are often so loose to make them hardly useful in practice: they are tight only when the number of errors is very small. However, the  n (F ) is small enough to guarantee good interesting case is when LR generalization performances (Bartlett, Boucheron, & Lugosi, 2002), n that is when Eσ inff ∈F ,Ln (f )≤r 1n I f (Xi ), σi ) is large. ( i =1

L. Oneto et al. / Neural Networks 82 (2016) 62–75

71

(a) n = 30.

(b) n = 100. Fig. 6. Comparison of φ and φ

Problem (II) instead can be circumvented by resorting to a Monte Carlo estimation over the 2n possible cases. Let us compute: k    n (F ) = 1 LR

j i

sup I (f (Xi ), σ ), k j=1 f ∈F ,Ln (f )≤r n i=1

(78)

where 1 ≤ k ≤ 2n is the number of Monte Carlo trials and

  n (F ), instead of {σ 1 , . . . , σ k } ⊆ S . The effect of computing LR  LRn (F ), can be explicited by noting that the Monte Carlo trials

can be modeled as a sampling without replacement from the 2n possible label configurations. Then, we can apply any bound for the tail of the hypergeometric distribution like, for example, the Serfling’s bound (Serfling, 1974), to write:







2kϵ 2

1− k−n1 2

.

(79)

 n (F ) is tightly concentrated around its This result shows that LR 

 n (F ), then the estimation of LR  n (F ) does not depend mean LR on 2n but on k, making Problem (II) computationally tractable. Another possibility is to exploit the concentration inequalities for bounded difference functions (Boucheron, Lugosi, & Massart, 2013; McDiarmid, 1989), which show that we can effectively estimate  n (F ) through a single realization of σ ∈ S (Bartlett & Mendelson, LR 2003; Klęsk & Korzeń, 2011). Let us define 

 n (F ) = LR  

sup f ∈F , Ln (f )≤r

n 2

n i=1

for different values of n and r.

with σ ∈ S . Consequently we can state that (Bartlett & Mendelson, 2003; Klęsk & Korzeń, 2011):



n 2

  n (F ) ≥ LR  n (F ) + ϵ ≤ e P LR

M

I (f (Xi ), σi ) ,

(80)

 nϵ 2     P LRn (F ) ≥ LRn (F ) + ϵ ≤ e− 8 .

(81)

Thanks to this result, Problem (II) can be completely ignored. Obviously, in practical applications, and especially when n is small, it is better to exploit the approach of Eq. (79) with respect to the one of Eq. (81) in order to decrease the variability of the result (Anguita et al., 2012; Bartlett & Mendelson, 2003). When we consider a Reproducing Kernel Hilbert Space (Aronszajn, 1950; Vapnik, 1998), some authors have proposed to upper-bound the Local Rademacher Complexity of a kernel class by computing the eigenvalues of the Gram matrix. However, this upper bound cannot be used for bounding the generalization error measured with the indicator loss function (Theorem 6.5 Bartlett et al., 2005; Mendelson, 2002) even though it can be used in other contexts like the MKL problem (Cortes et al., 2013; Kloft & Blanchard, 2011). For what concerns, instead, the binary classification problem faced in this paper, no computationally tractable solution has been proposed in the past. To the best knowledge of the authors the only tentative in this direction is Theorem 6.3 in Bartlett et al. (2005) where it has been shown that the Local Rademacher Complexity can be computed whenever weighted empirical risk minimization can be performed. In other words, if there is an efficient algorithm for minimizing a weighted sum of classification errors, then there is an efficient algorithm for computing an upper bound on the Localized Rademacher averages.

72

L. Oneto et al. / Neural Networks 82 (2016) 62–75

 (F ,r ) LH

Nevertheless, this problem remains computationally intractable (Bartlett et al., 2005; Hoffgen et al., 1995).

Consequently also n n is tightly concentrated around its mean and the Local VC-Entropy can be easily estimated from the data.

5.2. Local VC-Entropy

6. Conclusion

When we deal with Reproducing Kernel Hilbert Spaces, contrarily to the Local Rademacher Complexity, the estimation of the Local VC-Entropy can be performed with a procedure where the computational requirements grow polynomially with the number of samples. Let us reformulate Eq. (61) as:

 n   n (F , r ) = ln  σ : min LH I (fj , σj ) f ∈F  j =1 = 0, σ ∈ S ,

n  i =1

   I (σi , yi ) ≤ nr  . 

(82)

Eq. (82) requires to count for how many σ ∈ S , the following problem has a solution:

min (·), s.t. α,b

   n     αj k(Xj , Xi ) + b > 0,  σi

∀i ∈ {1, . . . , n}

j =1

n      I (σi , yi ) ≤ nr .  i =1

(83) Note that there is no objective function since we just have to check whether a solution exists, or, in other words, if at least one value of (w , b) exists such that we can perfectly shatter σ , when its Hamming distance from n y is less than or equal to nr. Note also that the constraint i=1 I (σi , yi ) ≤ nr can be checked before starting the minimization process since only constant quantities are involved. Consequently, Problem (83) becomes a Linear Programming (LP) problem, which can be solved in polynomial time (Boyd & Vandenberghe, 2004). Therefore, in order to compute the Local VC-Entropy of Eq. (82): (I) the polynomial problem of Eq. (83) for a σ ∈ S must be solved; then, (II) this minimization process must be replicated for each σ ∈ S (i.e. 2n times). But step (II) can be again avoided by bounding the desired quantity, up to the required pren (F , r )/n. cision, by exploiting the Serfling’s bound applied to LH By resorting to a Monte Carlo estimation over the 2n possible cases we can compute:

n (F , r ) LH 

 n    = ln  σ : min I (fj , σj ) = 0, σ ∈ {σ 1 , . . . , σ k }, f ∈F  j =1  n    × I (σi , yi ) ≤ nr  ,  i=1

ment from the 2n possible label configurations. We observe that:

n (F , r ) LH n

∈ [1/n, ln(2)] .

(85)

Then, we can apply the Serfling’s bound (Serfling, 1974), and write:

 P

n (F , r ) LH n



n (F , r ) LH 

k

 +ϵ

≤e

2 2k ϵ (2) − lnk− 1 1− 2n



≤e

kϵ 2

1− k−n1 2

The next two lemmas describe some properties for relating the Annealed VC-Entropy and the VC-Entropy. The proof can be retrieved in Boucheron et al. (2000). Lemma A.1. The Annealed VC-Entropy can be upper-bounded through the VC-Entropy as follows: A2n (F ) ≤ 2An (F ) ≤ 4Hn (F ).

(A.1)

The following lemma shows that the VC-Entropy can be estimated based on its empirical version (Boucheron et al., 2000). Lemma A.2. The Empirical VC-Entropy is concentrated around its expected value. The following inequality holds with probability at least (1 − e−x ): Hn (F ) ≤  Hn (F ) +



2xHn (F ).

(A.2)

Note that Lemma A.2 is not the sharpest concentration result (Boucheron et al., 2000, 2013) but in this paper we keep the proofs more readable and we do not focus the attention on getting optimal constants. We recall here also the Vapnik’s generalization bounds.



P sup f ∈F

 n (F , r ), instead of . . . , σ } ⊆ S . The effect of computing LH  LH n (F , r ), can be modeled again as a sampling without replace

Appendix A. Known results

 (84)

k



In this paper we have defined a localized VC Complexity measure, the Local VC-Entropy, and, based on this complexity, we have developed a new bound on the generalization error of a classifier. We showed that this bound is able to discard those functions that most likely will not be selected during the learning phase. Then, we showed that this new local notion of complexity can be related to the well known Local Rademacher Complexity. This relation allows one to find an admissible range for the value of one complexity given the other, and to bypass the computationally intractable problem of computing the Local Rademacher Complexity in binary classification. A further advantage of the proposed approach is that we addressed mainly empirical quantities, the empirical Local Rademacher Complexity and the empirical Local VC-Entropy, which are the only ones that can be estimated from the data.

Theorem A.3. Given a space of functions F and a dataset Dn

where 1 ≤ k ≤ 2n is the number of Monte Carlo trials and {σ 1 ,

n (F , r ) ∈ 1, ln(2n ) → LH



.

(86)

L(f ) −  Ln (f )



L(f )



 ≥t

 ≤ 4 exp

A2n (F ) n



t2 4

 

n , (A.3)

      A2n (F ) P sup L(f ) −  Ln (f ) ≥ t ≤ 4 exp − t2 n . 

f ∈F

n

(A.4)

Appendix B. Proofs Proof of Theorem 3.1. Given a space of functions F with the associated loss function I (f (X ), Y ), the space of loss functions I is defined as:

I = I ◦ F = {I (f (X ), Y ) : f ∈ F } .

(B.1)

L. Oneto et al. / Neural Networks 82 (2016) 62–75

73

Consequently, we reformulate the notion of generalization and empirical error in this new space I as:

then, thanks to the inequality of Eq. (B.10), we can state that ∀f ∈ F

L(f ) = L(I ) = E(X ,Y ) I (f (X ), Y ),

L(f ) ≤

(B.2)

n

 Ln ( f ) =  Ln ( I ) =

1 n i =1

I (f (Xi ), Yi ).

(B.3)

We use the loss space instead of the function space since it allows us to normalize it with respect to the generalization error and easily obtain the result. Let us normalize the loss space I

 Ir =



r L(I ) ∨ r

I: I ∈ I ,

(B.4)

r K

Ir ∈Ir

.

(B.5)

L(Ir ) ≤  Ln (Ir ) + sup [L(Ir ) −  Ln (Ir )] ≤  Ln (Ir ) + Ir ∈Ir

r K

.

(B.6)

Let us consider the two cases for the normalization factor: (i) L(I ) ≤ r and (ii) L(I ) > r. In the first case Ir = I, so the following inequality holds: r r = Ln ( I ) + . (B.7) L(I ) = L(Ir ) ≤  Ln (Ir ) + K K In the second case, Ir = L(rI ) I and L(rI ) ∈ [0, 1]. Then:



r K

K

K

.

(B.8)

 Ln (I ).

(B.9)

By combining the inequalities of Eqs. (B.7) and (B.9), we have that, if Property (B.5) holds, we can state that ∀K > 1:

  Ln (I )

K

     1   2x + 2x 4Hn (F ) + 2x Hn ( F ) ≤ Hn ( F ) + ≤ 2 Hn (F ) + 4x, (B.15) √ √ √ √ since a + b ≤ a + b and 2 ab ≤ a + b with a, b ∈ [0, +∞). A2n (F ) ≤ 4Hn (F ) ≤ 8 Hn (F ) + 16x = 8 An (F ) + 16x.  (B.16) Proof of Corollary 3.3. In order to prove the Property (19) let us consider the first property of Theorem A.3 in order to state that, with probability at least (1 − 4e−x ), the following inequality holds:

 sup

L(f ) −  Ln (f ) L(f )

f ∈F



 Ln (I ) +

K −1

r K

,

sup

L(f ) −  Ln (f )

(B.10)

(B.17)



f ∈F

√ sup

 .

L(f )

(B.18)

Proof of Theorem 3.5. Let us consider Theorem 3.1 and Corollary 3.3, then with probability at least (1 − 4e−x ): sup α

sup f ∈F , L(f )≤ αr

≤ sup α 6

 Hn

[L(f ) −  Ln (f )] 

α∈(0,1]

×



sup f ∈F , L(f )≤ αr



(B.11)

≤ sup 6

Since L(I r)∨r ≤ 1 we can state that (B.12)

If the inequality of Eq. (B.5) holds for Irs , r > 0 and K > 1 sup

I ∈I, α∈(0,1], L(α I )≤r

sup

I ∈I, L(α I )≤r

= sup α

sup



f ∈F , L(f )≤ α

f : f ∈ F , L(f ) ≤ αr n



+ 2x

 .

(B.19)



f : f ∈ F , L(f ) ≤

r 

α   ≤ Hn  f : f ∈ F , Ln ( f ) 

[L(I ) −  Ln (I )]

r

+ 2x

Note also that, thanks to Lemma 3.4, we can state that, with probability at least (1 − 4e−x ), the following inequality holds:

 Hn

α[L(I ) −  Ln (I )]

α∈(0,1]



L(f )

rα  Hn



f : f ∈ F , L(f ) ≤ αr n

α∈(0,1]

Ir ⊆ Irs .

α∈(0,1]

.

By combining the inequalities of Eqs. (B.17), (B.18) and Lemma 3.2 the inequality of Eq. (19) can be straightforwardly derived. By combining, instead, Theorem A.3 with Lemma 3.2 we obtain the inequalities of Eqs. (20) and (21). 



∀I ∈ I.

= sup α

n



L(f )

α∈(0,1]

Irs = {α I : α ∈ (0, 1], I ∈ I, L(α I ) ≤ r } .

sup [L(Irs ) −  Ln (Irs )] =

A2n (F ) + x

sup L(f ) −  Ln (f )





f ∈F

Let us define the star-shaped (Bartlett et al., 2005) version of the loss space Ir :

Irs ∈Irs

 ≤



K K −1

K



f ∈F

K −1

 r L(I ) ≤ max  Ln (I ) + ,

Proof of Lemma 3.2. Let us consider Lemma A.2: by solving Eq. (A.2) with respect to Hn (F ), we have with probability at least (1 − e−x ) that:



By solving the last inequality with respect to L(I ), we obtain the following bound: L(I ) ≤

So the claim of the theorem is proved. In fact, since the bound holds for every r > 0 and K > 1, which satisfy the condition of Eq. (B.13), we can take the combination of r and K that gives the minimum of Eq. (B.14). 

Moreover, the first term of the inequality of Eq. (B.17) can be bounded as follows:

 L(I )   L(I )r  L(I ) −  Ln (I ) = L(Ir ) −  Ln (Ir ) rL(I ) r L(I ) r L(I ) =

(B.14)

K

By combining Lemma A.1 with the inequality of Eq. (B.15), we have with probability at least (1 − e−x ) that

Then, note that ∀Ir ∈ Ir we have

L(I ) −  Ln (I ) =

r  Ln (f ) + .

2

and let us suppose that for r > 0 and K > 1 sup [L(Ir ) −  Ln (Ir )] ≤

K

K −1

r [ L( f ) −  Ln (f )] ≤ , K

(B.13)

 ≤

r

α

+3

 Hn



f : f ∈ F , L(f ) ≤ αr n



 + 2x  . 

(B.20)

74

L. Oneto et al. / Neural Networks 82 (2016) 62–75

The inequality of Eq. (B.20) can be rewritten as:

 T (r , α) ≤  Hn

r

f : f ∈ F , Ln ( f ) ≤

 n F , = LH



r

+3

α

α

 +3

T (r , α) + 2x

T (r , α) + 2x



anp 1 −  Ln (f ∗ ) + b.



n

 .

n

(Dn ,r ) , we have that the number of considered the functions in F functions is:

(B.21)



For what concerns the Local VC-Entropy based bound of Theorem 3.5, we have that with probability at least (1 − 9e−x )

√ 4 r∗ n r∗   Ln ( f ∗ ) + ≤ √ Ln ( f ∗ ) + √ . 4 4 K ∈(1,∞) K − 1 K n−1 n

sup α

[L(f ) −  Ln (f )]

sup f ∈F , L(f )≤ αr

 ≤ sup 6

r α (T (r , α) + 2x) n

α∈(0,1]

,

(B.22)

(B.31)

√ 4

Note that by setting K = n we have just obtained a looser constant without compromising the validity of the bound and if r ∗ → 0 for n → ∞ the consistency of the bound in ensured. Moreover, f ∗ ∈ F and r ∗ is the smallest r that satisfies the following inequality

 which concludes the proof.



sup 6

Proof of Lemma 3.7. In order to prove the statement let us note that d dα

aα ln



b

α



 

+ c + d = a ln

b

α





b

+c −

b + αc

(B.23)

1 x

n F , T (r , α) ≤ LH

a ln

b

α

+c −



b

≥a 1−

b + αc

1 b

α

α b − =a 1− b + αc b + αc 



= aα

+c

c−1 b + αc

T (r , α) + 2x



r

α

+3 

(B.24)





n



r

α

+3

T (r , α) + 2x n



− Ln ( f ∗ ) + b .

Based on these last results we can state that:



(B.32)

and where

≤ ln anp max 0, , x > 0.

r > 0,

n

 .

r

≤ √ , 4

n



ln(x) ≤ x − 1, x > 0 → ln(x) ≥ 1 −



r α (T (r , α) + 2x)

α∈(0,1]

Moreover, let us remember that

 

K

L(f ∗ ) ≤ min

We can also reformulate the inequality Eq. (B.19) in order to obtain: α∈(0,1]

(B.30)



b

Note that T (r , α) ≤ ln anp r −  Ln (f ∗ ) + b and then:



b + αc







T (r , α)

≥ 0.

(B.25)

Consequently, the statement of this lemma is proved.

(B.33)





Proof of Corollary 3.8. For what concerns the Vapnik’s result of (Dn ,r ) , we have Corollary 3.3, which considers all the functions in F that the number of considered functions is:

  r ≤ ln anp max 0, α    +3

ln anp max 0, r −  Ln (f ∗ ) + b + 2x





n

− Ln (f ) + b . ∗





(B.34)

Thanks to this result we have that r is the smallest r > 0 that satisfies ∗

anp + b.

(B.26)

For what concerns the Local VC-Entropy based bound of Corollary 3.6, by following the same argument presented in this section for the toy problem, thanks to Corollary 3.6 and Lemma 3.7 we have that with probability at least (1 − 9e−x ) L(f ∗ ) ≤ r ∗ ,

(B.27)

where f ∗ is the one which gives  Ln (f ∗ ) = 0 and r ∗ is the smallest r that satisfies the following inequality



ln anp





r +3

36

ln(anp +b)+2x n



 + b + 2x r > 0. (B.28)

Note that for n → ∞ then r ∗ → 0 with order O(ln(n)/n). The number of functions that are taken into account is

 anp

 r∗ + 3

ln(anp + b) + 2x n



Consequently since r ∗ + 3

sup 6

r α (T (r , α) + 2x)

 + b.

(B.29)

ln(anp +b)+2x n

< 1 for n large enough, the

r

≤ √ 4

n

α∈(0,1]

n

T (r , 1) + 2x α (T (r , α) + 2x) → sup 36 = 36 ≤ r, √ √ n

α∈(0,1]

n

(B.35)

since we have exploited Lemma 3.7. Consequently r ∗ must become √ zero as n increases as O(ln(n)/ n). Finally we can count the number of functions that are taken into account by Theorem 3.5





≤ r,

n



anp max 0, r ∗ + 3

ln anp r ∗ −  Ln (f ∗ ) + b + 2x







+ b.

n



 − Ln (f ∗ ) (B.36)

[ ( )+b]+2x < 1 for n large enough, the Since r ∗ + 3 n statement of our theorem is proved. 



ln anp r ∗ − Ln (f ∗ )

References

statement of our theorem is proved.



Proof of Corollary 3.9. Analogously to Corollary 3.8, for what concerns the Vapnik’s result of Corollary 3.3, which considers all

Amaldi, E., & Kann, V. (1995). The complexity and approximability of finding maximum feasible subsystems of linear relations. Theoretical Computer Science, 147, 181–210. Anguita, D., Ghio, A., Greco, N., Oneto, L., & Ridella, S. (2010). Model selection for support vector machines: Advantages and disadvantages of the machine learning theory. In International joint conference on neural networks.

L. Oneto et al. / Neural Networks 82 (2016) 62–75 Anguita, D., Ghio, A., Oneto, L., & Ridella, S. (2012). In-sample and out-of-sample model selection and error estimation for support vector machines. IEEE Transactions on Neural Networks and Learning Systems, 23, 1390–1406. Anguita, D., Ghio, A., Oneto, L., & Ridella, S. (2014). A deep connection between the vapnik-chervonenkis entropy and the rademacher complexity. IEEE Transactions on Neural Networks and Learning Systems, 25, 2202–2211. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404. Bartlett, P. L., Boucheron, S., & Lugosi, G. (2002). Model selection and error estimation. Machine Learning, 48, 85–113. Bartlett, P. L., Bousquet, O., & Mendelson, S. (2002). Localized rademacher complexities. In Computational learning theory. Bartlett, P. L., Bousquet, O., & Mendelson, S. (2005). Local rademacher complexities. The Annals of Statistics, 33, 1497–1537. Bartlett, P. L., & Mendelson, S. (2003). Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3, 463–482. Boucheron, S., Lugosi, G., & Massart, P. (2000). A sharp concentration inequality with applications. Random Structures & Algorithms, 16, 277–292. Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press. Bousquet, O. (2002). Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. (Ph.D. thesis), Paris: Ecole Polytechnique. Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press. Cortes, C., Kloft, M., & Mohri, M. (2013). Learning kernels using local rademacher complexity. In Advances in neural information processing systems. Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, 326–334. Duan, H. H. (2012). Bounding the fat shattering dimension of a composition function class built using a continuous logic connective. The Waterloo Mathematics Review, 2, 4–19. Volume 2. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13–30. Hoffgen, K. U., Simon, H. U., & Vanhorn, K. S. (1995). Robust trainability of single neurons. Journal of Computer and System Sciences, 50, 114–125. Johnson, D. S., & Preparata, F. P. (1978). The densest hemisphere problem. Theoretical Computer Science, 6, 93–107. Klęsk, P., & Korzeń, M. (2011). Sets of approximating functions with finite vapnikchervonenkis dimension for nearest-neighbors algorithms. Pattern Recognition Letters, 32, 1882–1893. Kloft, M., & Blanchard, G. (2011). The local rademacher complexity of lp-norm multiple kernel learning. In Advances in neural information processing systems. Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47, 1902–1914.

75

Koltchinskii, V. (2006). Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34, 2593–2656. Ledoux, M., & Talagrand, M. (1991). Probability in Banach spaces: isoperimetry and processes, Vol. 23. Springer. Lei, Y., Binder, A., Dogan, U., & Kloft, M. (2015). Localized multiple kernel learning—a convex approach. ArXiv Preprint arXiv:1506.04364. Lei, Y., Ding, L., & Bi, Y. (2015). Local rademacher complexity bounds based on covering numbers. ArXiv Preprint arXiv:1510.01463. Massart, P. (2000). Some applications of concentration inequalities to statistics. Annales de la Faculté des Sciences de Toulouse Mathématiques, 9, 245–303. McDiarmid, C. (1989). On the method of bounded differences. Surveys in Combinatorics, 141, 148–188. Mendelson, S. (2002). Geometric parameters of kernel machines. In Computational learning theory. Oneto, L., Ghio, A., Ridella, S., & Anguita, D. (2015a). Global rademacher complexity bounds: From slow to fast convergence rates. Neural Processing Letters,. Oneto, L., Ghio, A., Ridella, S., & Anguita, D. (2015b). Local rademacher complexity: Sharper risk bounds with and without unlabeled samples. Neural Networks, 65, 115–125. Sauer, N. (1972). On the density of families of sets. Journal of Combinatorial Theory, Series A, 13, 145–147. Serfling, R. J. (1974). Probability inequalities for the sum in sampling without replacement. The Annals of Statistics, 2, 39–48. Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44, 1926–1940. Shelah, S. (1972). A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics, 41, 247– 261. Srebro, N., Sridharan, K., & Tewari, A. (2010). Smoothness, low noise and fast rates. In Advances in neural information processing systems. Steinwart, I., & Scovel, C. (2005). Fast rates for support vector machines. In Learning theory. van de Geer, S. (2006). Discussion: Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34, 2664–2671. Vapnik, V. N. (1998). Statistical learning theory. Wiley-Interscience. Vapnik, V. N., & Kotz, S. (1982). Estimation of dependences based on empirical data, Vol. 41. New York: Springer-Verlag. Zhang, C., Bian, W., Tao, D., & Lin, W. (2012). Discretized-vapnik-chervonenkis dimension for analyzing complexity of real function classes. IEEE Transactions on Neural Networks and Learning Systems, 23, 1461–1472. Zhang, C., & Tao, D. (2013). Structure of indicator function classes with finite vapnikchervonenkis dimensions. IEEE Transactions on Neural Networks and Learning Systems, 24, 1156–1160. Zhou, D. X. (2002). The covering number in learning theory. Journal of Complexity, 18, 739–767.