Local Rademacher Complexity Machine

ARTICLE IN PRESS JID: NEUCOM [m5G;February 7, 2019;19:45] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputing ...

Download PDF

636KB Sizes 1 Downloads 62 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

JID: NEUCOM

[m5G;February 7, 2019;19:45]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Local Rademacher Complexity Machine Luca Oneto a,∗, Sandro Ridella b, Davide Anguita a a b

DIBRIS, University of Genova, Via Opera Pia 13, Genova I-16145, Italy DITEN, University of Genova, Via Opera Pia 11a, Genova I-16145, Italy

a r t i c l e

i n f o

Article history: Received 28 June 2018 Revised 5 October 2018 Accepted 25 October 2018 Available online xxx Keywords: Vapnik–Chervonenkis Theory Support Vector Machines Local Rademacher Complexity Theory Local Rademacher Complexity Machine Kernel methods

a b s t r a c t Support Vector Machines (SVMs) are a state-of-the-art and powerful learning algorithm that can effectively solve many real world problems. SVMs are the transposition of the Vapnik–Chervonenkis (VC) theory into a learning algorithm. In this paper, we present the Local Rademacher Complexity Machine (LRCM), a transposition of the Local Rademacher Complexity (LRC) theory, the state-of-the-art evolution of the VC theory, into a learning algorithm. Analogously to what has been done for the SVMs, we will present ﬁrst the theoretical ideas behind the LRC theory, we will show how these ideas can be translated into a learning algorithm, the LRCM, and then how the LRCM can be made eﬃcient and kernelizable. By exploiting a series of real world datasets, we will show the effectiveness of the LRCM against the SVMs.

1. Introduction In the context of learning from data, understanding and controlling the learning process is of paramount importance since it allows to study how learning algorithms behave and how to improve them. One of the ﬁrst attempts in this direction was made with the theory developed by Vapnik and Chervonenkis, who proposed the well-known Vapnik–Chervonenkis (VC) theory [1,2]. The VC theory tells us that the generalization error of a model, chosen in a set of possible ones, is bounded by its empirical error, plus a term which measures the size of the set of models, the VC-Dimension, plus another term which takes into account the fact that the sample may not be a good representation of the distribution of the data, the conﬁdence term. Unfortunately, the VC-Dimension (i) is a global measure of complexity, namely it takes into account all the functions in the set, and (ii) is data-independent, because it does not take into account the actual distribution of the data available for learning. In other words, also the functions that will be never chosen by the learning algorithm (functions with high empirical error) or samples that have small probability to be sampled are taken into account. As a consequence, the VC-Dimension leads to very pessimistic estimates. In order to deal with issue (ii), datadependent complexity measures have been developed, which allow to take into account the actual distribution of the data and to produce tighter estimates. The Rademacher Complexity (RC) [3–5] rep-

∗

Corresponding author. E-mail addresses: [email protected] (L. Oneto), [email protected] (S. Ridella), [email protected] (D. Anguita).

© 2019 Elsevier B.V. All rights reserved.

resents the state-of-the-art tool in this context. In order to deal, instead, with issue (i) researchers have succeeded in developing local data-dependent complexity measures like the Local VC-Dimension [6] and the Local RC (LRC) theory [7–10]. Local measures improve over global ones thanks to their ability to take into account only the functions with small error, namely the ones that will be most likely chosen by the learning procedure. In particular, the LRC has shown to be able to accurately capture the nature of the learning process, both from a theoretical point of view [8–17] and also in many different real world applications (e.g. small sample problems [18], resource limited models [19], graph kernel learning [20,21], and multiple kernel learning [22]). The suggestion of LRC theory is quite simple: a good set of models should contain models with small error over the data and large error over randomly labeled data. Basically, LRC states that a good model should be able to ﬁt just the observed data and not a noisy version of it. Apart from the intrinsic theoretical importance of the above mentioned results, they also allowed to develop practical tools which are now milestones in learning from data. The most compelling example are the Support Vector Machines (SVMs), state-ofthe-art and powerful learning algorithms that can effectively solve many real world problems [23,24]. SVMs are the transposition of the VC theory into a learning algorithm which optimizes the tradeoff between accuracy and complexity of the learned model [25,26]. Despite the great success of SVMs, to the best knowledge of the authors, no one has tried to develop a learning algorithm based on the LRC theory analogously to what have been done for the VC theory. For this reason, inspired by the SVMs, we proposed in this paper the Local Rademacher Complexity Machine (LRCM), a

https://doi.org/10.1016/j.neucom.2018.10.087 0925-2312/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: L. Oneto, S. Ridella and D. Anguita, Local Rademacher Complexity Machine, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2018.10.087

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;February 7, 2019;19:45]

L. Oneto, S. Ridella and D. Anguita / Neurocomputing xxx (xxxx) xxx

transposition of the LRC theory into a learning algorithm which improves the properties of the original SVMs by including the new intuitions behind LRC. In fact, our proposal is able, as the SVMs, to be eﬃcient and to generate both linear and nonlinear models by exploiting the kernel trick [27]. Moreover, LRCM introduces a new regularization term which force the solution to have small error on the available data but also large error on a random labeled sample, by implementing the same idea behind the LRC theory. The rest of the paper is organized as follows. Section 2 formalizes the LRC theory by focusing the attention of the advances with respect to the VC theory. Section 3 shows, in analogy to SVM, how it is possible to transpose the LRC theory into an eﬃcient and easily kernelizable learning algorithm, the LRCM. Section 4 compares the performance of the LRCM against its predecessor, the SVM, showing the advantages of the former. Section 5 discusses the results obtained in this work and opens a new perspective on how this work can be evolved in the future and open the way to a new ﬁeld of research. In the Appendix, we have reported the non essential material to the comprehension of the paper in order to improve its readability.

(1)

where : Yˆ × Y → [0, 1] is a predeﬁned loss function. As μ is unknown, L(h ) cannot be explicitly computed, but we can compute the empirical error, namely the empirical estimator of the generalization error n 1 (h ) = (h(xi ), yi ). n

(2)

i=1

The purpose of any learning procedure is to select the best model h∗ , namely the one which minimizes the generalization error over the whole possible functions

h = arg min L(h ). ∗

h

(3)

Obviously, we cannot check the whole possible functions and L(h ) cannot be computed since μ is unknown. What we can do is to select different sets of functions H = {H1 , H2 , . . .} and estimate L(h ) with a suited estimator L˜ (h, n, H, δ ) which allows to ﬁnd a func˜ ∗ , close to h∗ , such that tion h

˜ ∗ = arg min L˜ (h, n, H, δ ). h h∈H∈H

HDn = {{h(x1 ), . . . , h(xn )}|h ∈ H},

(5)

which is the set of functions restricted to the sample. In other words, HDn is the set of distinct functions distinguishable within H with respect to the dataset Dn . We can then deﬁne the empirical VC-Entropy [2] Aˆ n (H ) as the logarithm of the number of functions in HDn Aˆ n (H ) = ln (|HDn | ),

(6)

where | · | is the cardinality of a set, and the Growth Function [2] Gn (H ) as the largest value of Aˆ n (H ) that we can obtain given all the possible n samples Dn extracted from μ

(7)

Dn

We consider the conventional binary classiﬁcation framework [2]: based on the observation of x ∈ X ⊆ Rd , one has to estimate y ∈ Y ⊆ {±1} by choosing a suitable hypothesis h : X → Yˆ ⊆ R in a set of possible ones H. A learning algorithm selects h ∈ H by exploiting a set of labeled samples Dn : {(x1 , y1 ), . . . , (xn , yn )}. Dn consists of a sequence of independent samples distributed according to μ over X × Y. The generalization error associated to an hypothesis h ∈ H is deﬁned as

y Lˆ n

In order to present the VC and RC based generalization bounds we ﬁrst need to recall some preliminary deﬁnitions. For what concerns the VC based generalization, it has been developed for the speciﬁc case when Yˆ = Y ⊆ {±1} and the Hard loss function is exploited H (h(x ), y ) = [yh(x ) ≤ 0] where the Iverson Bracket notation is exploited. Let us deﬁne the following quantity

Gn (H ) = max Aˆ n (H ).

2. From the VC theory to the LRC theory

L(h ) = E(x,y ) (h(x ), y ),

2.1. VC and RC based theories

(4)

Typically L˜ (h ) = L˜ (h, n, H, δ ) is a probabilistic upper bound of L(h ), i.e. L(h ) ≤ L˜ (h, n, H, δ ), which depends on the empirical error of h, the number of samples, the size of the class H, and the desired probability for the bound to be true (1 − δ ). Different approaches exist for building L˜ (h, n, H, δ ) and they basically differ on the way they compute the size of the class H. In the next sections we will review the milestone results of Vapnik and Chervonenkis about the VC theory [1,2], then we will show how this result has been improved with the RC theory [3–5], and then we will present the state-of-the-art result about the LRC theory [7–10]. Note that, in this work, we will not exploit the tighter bound available in literature, since this is out of the scope of this paper, but we will present the most interpretable ones.

An elegant interpretation of empirical VC-Entropy [28] shows that Aˆ n (H ) is the logarithm of the number of possible combinations of the labels that we can assign to xi ∈ Dn , for a maximum of 2n possible ones, which can be perfectly shattered by H. In other words, Aˆ n (H ) throws away the information about yi ∈ Dn keeping just the information about the xi ∈ Dn . As a consequence all the functions in H are taken into consideration, even the ones with high error [6]. Furthermore, the Growth Function basically throws away also the information about the xi ∈ Dn by considering the worst possible n samples of the input space [2]. This is the reason why the Growth Function is considered as a data-independent measure of complexity of H. At this point we can ﬁnally deﬁne the VCDimension [2] dVC (H ) as the maximum number of samples s for which we can perfectly shatter all the samples dVC (H ) = max{n|n ∈ N+ , Gn (H ) = n ln(2 )}.

(8)

Unfortunately the VC-Dimension cannot be computed from the data, but based on its deﬁnition we can recall the milestone result of [1,2,29] which states that the following generalization error bound holds with probability (1 − δ )

L (h ) ≤

y Lˆ n

(h ) +

dVC ln(n )

n

+

ln 4δ , n

∀h ∈ H.

(9)

The latter is the main results of the VC theory. As we have just seen, the VC theory possesses three main weaknesses: (i) it deals just with the binary classiﬁcation case when the Hard loss function is exploited, (ii) the dVC is basically dataindependent and cannot be computed from the data, and (iii) the VC Dimension takes into account also the functions in H with high empirical error, namely the functions that will be never selected by any learning algorithm. In order to ﬁll the ﬁrst two gaps the RC theory has been developed. For what concerns instead the third issue, in [6] a Localized version of the VC theory which takes into account only the functions with low empirical error has been developed, but the stateof-the-art solution is the LRC theory [8,10] that we will present in Section 2.2. In order to present the RC theory we have to deﬁne some preliminary quantities. Let σ1 , . . . , σn be n { ± 1}-valued independent Rademacher random variables for which P(σi = +1 ) = P(σi = −1 ) = 1/2. Then we can deﬁne the empirical RC as Rˆˆ n (H ) = sup h∈H

n 2 σi ( h ( x i ) , y i ) . n

(10)

i=1

Please cite this article as: L. Oneto, S. Ridella and D. Anguita, Local Rademacher Complexity Machine, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2018.10.087

ARTICLE IN PRESS

JID: NEUCOM

[m5G;February 7, 2019;19:45]

L. Oneto, S. Ridella and D. Anguita / Neurocomputing xxx (xxxx) xxx

Often the empirical RC is presented as

2.2. LRC theory

Rˆ n (H ) = Eσ1 ,...,σn Rˆˆ n (H ),

(11)

which, while still empirical, is less computational eﬃcient since we have to compute also the expected value with respect to the Rademacher random variables. Rˆ n (H ) is sometime preferred to Rˆˆ n (H ) since it may lead to sharper generalization bounds [29]. The RC is deﬁned as Rn (H ) = EDn Rˆ n (H ).

(12)

Note that, the empirical RC is data dependent, in particular it depends on the xi ∈ Dn , and can be computed directly from the available data. Note also that, if (h(x), y) can be expressed as (h(x ), y ) = 1 − yh(x )/2, or in general if is symmetric (i.e. (h(x ), y ) = 1 − (h(x ), −y )), then Rˆ n (H ) can be reformulated as

σ Rˆˆ n (H ) = sup 1 − 2Lˆ σn (h ) = sup 2Lˆ − n (h ) − 1 , h∈H

(13)

h∈H

for the proof refer, for example, to [30] Section IV.A. Thanks to this simple reformulation it is possible to derive the ﬁrst interpretation of the empirical the RC. The RC measures the capacity of the class of function to ﬁt random labels: if H is not able to ﬁt the noise Rˆˆ n (H ) will be small (since Lˆ σn (h ) is large) and vice-versa. A more elegant interpretation [6] shows that RC is a generalization of the VC which takes into account not just the functions which perfectly shatter a random conﬁguration of the labels but also the errors of the ones that are not able to shatter the data. Based on the deﬁnition of RC we can recall the milestone result of [3,4] which states that the following generalization error bound holds with probability (1 − δ )

y L(h ) ≤ Lˆ n (h ) + Rˆˆ n (H ) + 3

ln 4δ , n

∀h ∈ H,

(14)

where any [0,1]-bounded loss function can be exploited. The latter is the main result of the RC theory. Note that both the VC and the RC based generalization bounds state that with probability (1 − δ ) L(h ) ≤ Lˆ n (h ) + Cn (H ) + (n, δ ), y

∀h ∈ H.

3

(15)

y Lˆ n (h )

where, is the empirical error, Cn (H ) is a term which depends on the size, or complexity, of the class of functions measured with the VC-Dimension or the RC, and (n, δ ) is a conﬁdence term. Note also that, if Yˆ = Y ⊆ {±1} and the Hard loss function is exploited, it is possible to prove that two functions LB and UB exist such that: LB[dVC (H )] ≤ Rn (H ) ≤ UB[dVC (H )],

(16)

UB−1 [Rn (H )] ≤ dVC (H ) ≤ LB−1 [Rn (H )],

(17)

where LB−1 and UB−1 are the inverse of respectively LB and UB [28]. This means that the VC and the RC based generalization bounds are basically two faces of the same coin and can be generalized under the same approach adopted in Eq. (15). y As we will see in Section 3, if we set L˜ (h, n, H, δ ) = Lˆ n (h ) + Cn (H ) + (n, δ ) we can plug it in Eq. (4) and straightly derive the SVMs from there. In fact SVMs search for the h which minimizes y the trade-off between accuracy Lˆ n (h∗ ) and complexity Cn (H ) of the class from which we chose our model or, in other words, the estimated generalization error.1

1 Note that (n, δ ) does not depend on h nor H and so it does not inﬂuence the choice of h.

The issue of the VC and the RC theories that we still have to ﬁll is how we can disregard the functions with high error, namely the ones that are not useful for learning, when measuring the size, or the complexity, of a class of functions for estimating the generalization error of the model chosen in the class. In this context, the real breakthrough and state-of-the-art solution was made with the LRC theory [8,10] which was able, for the ﬁrst time, to take into account, when computing the RC for estimating the generalization ability of a model, just the functions in H with small error. The main result of the LRC theory is a fully empirical generalization error bound [4,10]. In particular, it is possible to state that the following bound holds with probability (1 − δ ) and ∀h ∈ H L (h ) ≤

kˆ y y y Lˆ (h )+kˆ Rˆˆ n h |h ∈ H, Lˆ n (h )≤ˆLn (h )+cˆ +(kˆ , n, δ ), ˆk−1 n (18)

where (kˆ , n, δ ) is a conﬁdence term and kˆ > 1 and cˆ ≥ 0 are quantities that can be computed from the data (see Appendix B for more details). The LRC based generalization bound of Eq. (18), with respect to the VC and the RC based ones of Eq. (15), shows that an optimal model should not be just the result of a compromise between the accuracy of that model and the complexity of the space of models from where the model has been chosen. Eq. (18) states that, among the models with small empirical error, the functions which perform badly on random labels should be preferred. In the next section, similarly to what has been done with the VC and the RC based bounds exploited for deriving the SVMs, we will build a learning algorithm from the LRC based generalization bound called LRCM, which tries to take advantage of the improvements of the LRC theory over the VC and RC theories. 3. From the SVMs to the LRCM In this section we will show how the VC and the RC theories can be transposed in the SVMs learning algorithm (see Section 3.1), and then, with an analogy, we will show how we can do the same with the LRC theory transposing it in the LRCM learning algorithm (see Section 3.2). For this purpose let us consider the same framework of SVMs [2,25] where h(x ) = wT φ (x ) + b, where φ : Rd → RD with usually D d, where H is deﬁned as w : w ∈ Rd , w ≤ a ∈ [0, ∞ ) and b ∈ R, and where · is the Euclidean norm. Since we are dealing with binary classiﬁcation problems the Hard loss function H (h(x ), y ) = 1 − ysign[h(x )]/2 should be adopted. Unfortunately H is not convex, then, for the SVMs, the Hinge loss function ξ (h(x ), y ) = max[0, 1 − yh(x )], the simplest yet most effective convex upper bound of the Hard loss function, is exploited [31]. In this setting it is possible to prove that the RC can be bounded as follows [4] Rˆ n (H ) ≤

supw≤a w

n

n i=1

a φ ( xi ) T φ ( xi ) ≤ n

n

φ (xi )T φ (xi ),

i=1

(19) which means that Cn (H ) ∝ sup

w≤a

w ≤ a.

(20)

3.1. From the VC and the RC theories to the SVMs The SVMs are the result of the simple combination of the approach described in Eq. (4) exploiting the estimator of Eq. (15) and

Please cite this article as: L. Oneto, S. Ridella and D. Anguita, Local Rademacher Complexity Machine, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2018.10.087

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;February 7, 2019;19:45]

L. Oneto, S. Ridella and D. Anguita / Neurocomputing xxx (xxxx) xxx

the observation of Eq. (20) which allows to state that [2]

˜ , b˜ = arg w

n 1 max[0, 1 − yi (wT φ (xi ) + b)] + λw2 , n

min

w∈RD ,b∈R

i=1

(21) where λ ∈ [0, ∞) is an hyperparameter which balances the trade-of between accuracy and complexity of the solution. Note that the connection between the VC and the RC theories and the SVMs is not so trivial. In fact, rigorously speaking, the complexity term (or regularizer) should consider the parameter a (since the hypothesis space is deﬁned as w : w ∈ Rd , w ≤ a ∈ [0, ∞ ) and b ∈ R) and not w. But since a would be constant no matter what w we choose such that w ≤ a, it is more appropriate to use directly w, which, by analogy, measures the complexity of the selected function. By reformulating Problem (21) we obtain the more familiar primal formulation of the SVMs [25]

1 w 2 + C ξi 2 n

min w,b

(22)

i=1

s.t. yi (wT φ (xi ) + b) ≥ 1 − ξi ,

∀i ∈ {1 , . . . , n}

ξi ≥ 0 , ∀ i ∈ { 1 , . . . , n } where C ∈ [0, ∞) is an hyperparameter which balances the trade-of between accuracy and complexity of the solution. C must be tuned during the Model Selection (MS) phase (see Section 3.4) which ﬁnds the best hyperparameter for the problem under exam [29]. By computing the dual formulation of Problem (22) we have that [25]

1 T y α Q α − 1T α 2 s.t. yT α = 0

minn α∈ R

(23)

0≤α≤C where Qi, j = yi y j φ (xi )T φ (x j ), c = [c, . . . , c]T with c ∈ R, y = [y1 , . . . , yn ], b is the Lagrange multiplier of the equality constrain in Problem (23), and y

h (x ) =

n

αi yi φ (xi )T φ (x ) + b.

(24)

i=1

By exploiting the kernel trick [27] it is possible to overcome the necessity of knowing explicitly φ, avoiding the curse of dimensionality and feature mapping engineering, by using a suitable kernel function which satisﬁes the Mercer condition. Though problem-speciﬁc kernel functions have been developed throughout the years (e.g. [32,33]), the linear and the Gaussian kernels are the most used ones [34] because of its peculiar properties. Kernel trick allows to exploit the kernel K(a, b) = φ (a )T φ (b) both during y the training and in the forward phase, obtaining Qi, j = yi y j K(xi , x j ) and

h (x ) =

n

αi yi K(xi , x ) + b.

(25)

i=1

Each kernel, for example the Gaussian one K(a, b) = exp −γ a − b2 , has one or more hyperparameters, in this case γ ∈ [0, ∞), that needs to be tuned during the MS phase. 3.2. From the LRC theory to the LRCM The LRC based generalization error upper bound of Eq. (18) adds a more profound intuition with respect to the VC and RC based ones of Eq. (15). Eq. (18) states that we should not look just at the complexity of the entire space H but we

should look just to a subset of the h ∈ H with small error. Furthermore, Eq. (18) also tells us that, among the functions with small empirical error over the available samples we should prefer the ones with high error on random labels. If the chosen function has small empirical error on the available data and behaves badly on random noise probably it is a function with good generalization properties. As a consequence, analogously to SVMs, exploiting the Hinge loss function instead of the Hard one, and combining the approach described in Eq. (4) exploiting the estimator of Eq. (18) and the observations of Eqs. (13) and (20), we obtain the LRCM

˜ , b˜ = arg w

n 1 max[0, 1 − yi (wT φ (xi ) + b)] + λ1 w2 n

min

w∈RD ,b∈R

i=1

n 1 + λ2 max[0, 1 + σi (wT φ (xi ) + b)], n

(26)

i=1

where λ1 , λ2 ∈ [0, ∞) are hyperparameters. λ1 balances the tradeof between accuracy and complexity of the solution. λ2 balances the capacity of the selected function to perform badly on the random labels and well on the actual ones. Problem (26) minimizes the error over the data, measured with the Hinge loss function, and contemporary balances the complexity of the solution measured with w2 , but the complexity is computed by taking into account only the functions with high error over the random labels σ1 , . . . , σn . In fact, when λ1 ∈ (0, ∞) is large we force the solution to ﬁt the available data with simple functions (small w2 ). Moreover, when λ2 ∈ (0, ∞) is large we force also the solution to make a high error over the σ i since ni=1 max[0, 1 + σi h(xi )] = ξ (h(xi ), −σi ). Note that the term n i=1 max[0, 1+σi h (xi )] acts also as a random regularizer analogously to the dropout in neural networks [35]. Note that, analogously to the SVMs, the connection between the LRC theory and the LRCM is not so trivial. In fact, rigorously speaking, the complexity term (or regularizer) should consider all the functions in the hypothesis space such w : w ∈ Rd , w ≤ a ∈ [0, ∞ ) and b ∈ R and such that their error on random labels is small. But since this quantity would be constant no matter what function we choose such that w ≤ a and such that the error on random labels is low, it is more appropriate to use directly w and the error on random labels, which, by analogy, measures the complexity of the selected function. By reformulating Problem (21) we obtain the primal formulation of the LRCM

min w,b

y 1 w2 + C1 ξi + C2 ξiσ 2 n

n

i=1

i=1 y , i

s.t. yi (wT φ (xi ) + b) ≥ 1 − ξ

(27)

∀i ∈ {1 , . . . , n} ∀i ∈ {1 , . . . , n}

− σi (w φ (xi ) + b) ≥ 1 − ξiσ , ξ y , ξ σ ≥ 0, ∀i ∈ {1 , . . . , n} T

i

i

where C1 , C2 ∈ [0, ∞) are hyperparameters. The larger is C1 the smaller is the error on the available samples. The larger is C2 the larger is the error on the randomly labeled samples. C1 and C2 must be tuned during the Model Selection (MS) phase (see Section 3.4) which ﬁnds the best hyperparameters for the problem under exam [29]. Note that, with respect to SVMs, one additional hyperparameter needs to be tuned and this obviously increases the computational requirements of the MS phase. By computing the dual formulation of Problem (27) (see Appendix A) we have that

min

α, β ∈ R n

1 2

T

α Qy Q σ ,y β

Q y,σ Qσ

T α α 1 − 1 β β

(28)

Please cite this article as: L. Oneto, S. Ridella and D. Anguita, Local Rademacher Complexity Machine, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2018.10.087

ARTICLE IN PRESS

JID: NEUCOM

[m5G;February 7, 2019;19:45]

L. Oneto, S. Ridella and D. Anguita / Neurocomputing xxx (xxxx) xxx

T y α s.t. =0 -σ β

α 0 C ≤ ≤ 1 β 0 C2 y,σ

where Qi,σj = σi σ j φ (xi )T φ (x j ), Qi, j = −yi σ j φ (xi )T φ (x j ), Q σ ,y =

(Q y,σ )T ,

b is the Lagrange multiplier to the equality constraint of Eq. (28), and

h (x ) =

n

(αi yi − βi σi )φ (xi )T φ (x ) + b.

(29)

i=1

Problem (28) can be easily solved with the standard SVMs solvers [36]. Problem (28) allows, analogously to Problem (23), to exploit the kernel trick [27] and implement also nonlinear models. The only issue that is still open in Problem (27) (and consequently also in Problem (28)) is how to set the σ i with i ∈ {1, . . . , n}. In order to reduce the probability of unlucky realization we will exploit the proposal described in [37] to use the Nearly Homogeneous Multi-Partitioning technique developed in [38]. The idea is to split the original dataset in two almost homogeneous subsets and assign to each of the two sets two different labels. This is an heuristic method to assign to the available samples the nois n iest possible labels. In this way, the term i=1 max[0, 1+σi h (xi )] measures the capacity of the function to underﬁt the noisiest possible conﬁguration of the labels. 3.3. Extension to multiclass problems In the previous sections, we recalled the SVMs and deﬁned the LRCM learning algorithm for binary classiﬁcation purposes. However, many real world problems are characterized by c ∈ {2, 3, 4, . . .} classes, but SVMs, and consequently also LRCM, cannot natively tackle with both binary (c = 2) and multiclass (c > 2) classiﬁcation problems. In order to ﬁll this gap several techniques exist in literature [39–42]. In this work, we decided to exploit the All-Versus-All (AVA) approach for three reasons: (i) it is the simplest method, (ii) it does not create unbalanced classiﬁcation problems if the original problem is balanced, and (iii) the size the learning problem is small. In fact, the AVA method consists in building Nc (Nc − 1 )/2 training sets, where Nc is the number of classes, each one containing data from two different classes, u and v. These sets are used for training Nc (Nc − 1 )/2 different binary classiﬁers and the resulting models are saved for the forward phase. When a new pattern has to be classiﬁed, it is applied in input to all the binary classiﬁers and the majority voting is taken as a predictor. 3.4. Model selection and error estimation

Lil

i

i

Lil ∩ Tti = ,

∪ Vv ∪ Tt = Dn .

Vvi ∩ Tti = ,

based on Lil with i ∈ {1, . . . , r} which performs well, on average, respectively on Vvi with i ∈ {1, . . . , r}. Since the data in Lil are i.i.d. from the one in Vvi the idea is that the best set of hyperparameters should be the one which allows to achieve a small error on a data set that is independent from the training set. Once the best set of hyperparameters is found one can select the best model by training the algorithm with the whole dataset and test in on Tti . The error of the model trained with Lil ∪ Vvi is independent from Tti and consequently the error on the test set will be an unbiased estimator of the generalization error. In this work we decided to use the non-parametric Bootstrap, the most powerful resampling method [29]. Consequently l = n and Lil must be sampled with replacement from Dn while Vvi and Tti are sampled without replacement from the sample of Dn that has not been sampled in Lil [29]. Note that for the non-parametric Bootstrap procedure r ≤

2n−1 n

.

4. Experimental results In this section, we will compare the performance of the proposed LRCM against the SVMs. For this purpose we will consider two main scenarios: (1) Small Sample Binary Classiﬁcation Problems and Linear Models: in this case we will consider binary classiﬁcation problems problem where d n. Consequently, a nonlinear formulation is generally not needed and a linear SVM and LRCM will be exploited. (2) Average or Large Sample Multiclass Problems and NonLinear Models: in this case we will consider multiclass classiﬁcation problems where d ≈ n or d n. Consequently a nonlinear formulation is generally needed and consequently a non-linear SVM and LRCM will be exploited with the use of the Gaussian kernel. We will consider a series of standard benchmark datasets available in literature. In the case when a dataset possesses categorical features the one-hot encoding is exploited. When a column contains missing values an additional binary feature is included to code the missing value while in the missing spots the average of the value of that column is substituted. For what concerns the MS phase we set r = 10 0 0 in the nonparametric Bootstrap and we search for C, C1 , C2 ∈ 10{−6,−5.5,...,+6} in the linear case and, in the non-linear case with the Gaussian kernel, we search for γ ∈ 10{−6,−5.5,...,+6} . In all the tables we reported the performance on the test set as explained in Section 3.4. 4.1. Small Sample Binary Classiﬁcation Problems and Linear Models

The selection of the optimal hyperparameters and the unbiased estimation of the performance of the ﬁnal model is the fundamental problem in learning from data [29]. In this work we will use the resampling methods because of their simplicity and effectiveness in many real world problems [29]. Resampling methods rely on a simple idea: the original dataset Dn is re sampled once or many (r) times, with or without replacement, to build three independent datasets called training, validation, and test sets respectively, Lil , Vvi , and Tti , with i ∈ {1, . . . , r}. Note that

Lil ∩ Vvi = ,

5

∀i ∈ {1 , . . . , r }, (30)

Then, in order to select the best set of hyperparameters in a set of possible ones, we chose the one which allows to create models

In order to verify whether the LRCM allows to improve over the performance of the SVMs in the small-sample setting with linear models, we make use of several Human Gene Expression datasets, the same exploited in [18] (see Table 1). Since some of these datasets are multi-class problems, analogously to [18], we map the multi-class into two-class problems according to the description of Table 1. In Table 2, we present the average number of errors on the test set. In particular, we compare the results obtained with SVM and LRCM. From the results of Table 2 it is possible to note how LRCM outperforms SVM in most of the datasets in a statistical relevant way (since in the table it is reported the t-student conﬁdence interval at 95%). Moreover, when LRCM does not outperform SVM, they both perform comparably. The only exception is the Leukemia 1 dataset, where SVM outperforms LRCM.

Please cite this article as: L. Oneto, S. Ridella and D. Anguita, Local Rademacher Complexity Machine, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2018.10.087

ARTICLE IN PRESS

JID: NEUCOM 6

[m5G;February 7, 2019;19:45]

L. Oneto, S. Ridella and D. Anguita / Neurocomputing xxx (xxxx) xxx Table 1 Human Gene Expression datasets (see [18] for details): mapping of the multi-class into two-class problems. Id

Dataset

d

n

Class +1

Class −1

HGE01

Brain Tumor 1

5920

90

Medulloblastoma

HGE02

Brain Tumor 2

10,367

50

HGE03 HGE04 HGE05 HGE06 HGE07 HGE08 HGE09 HGE10

Colon Cancer 1 Colon Cancer 2 DLBCL Duke Breast Cancer Leukemia Leukemia 1 Leukemia 2 Lung Cancer

22,283 20 0 0 5469 7129 7129 5327 11,225 12,600

47 62 77 44 72 72 72 203

Classic glioblastomas and anaplastic oligodendrogliomas Already Already Already Already Already ALL B-cell ALL Adeno

HGE11 HGE12 HGE13

Myeloma Prostate Tumor SRBCT

28,032 10,509 2308

105 102 83

Malignant glioma, AT/RT, normal cerebellum, and PNET Non-classic glioblastomas and anaplastic oligodendrogliomas two-class two-class two-class two-class two-class ALL T-cell and AML AML and MLL Normal, squamous, COID, and SMCL two-class two-class RMS, BL, and NB

Already Already EWS

Table 2 Human Gene Expression datasets (see Table 1): average number of errors on the test sets performed by SVM and LRCM. Dataset

D01

D02

D03

D04

D05

D06

D07

D08

D09

D10

D11

D12

D13

# Wins

SVM LRCM

5.4 ± 0.3 4.2 ± 0.1

0.0 ± 0.0 0.0 ± 0.0

5.0 ± 0.2 4.1 ± 0.1

4.0 ± 1.0 3.7 ± 0.2

5.4 ± 0.3 4.3 ± 0.3

5.0 ± 0.2 3.9 ± 0.1

5.0 ± 0.0 4.1 ± 0.1

8.2 ± 0.2 8.7 ± 0.4

8.0 ± 0.0 6.8 ± 0.1

14.4 ± 0.3 10.1 ± 0.3

6.8 ± 0.2 6.8 ± 0.1

6.6 ± 0.2 5.9 ± 0.2

6.6 ± 0.2 6.2 ± 0.2

3 12

Table 3 Subset of the UCI Datasets [43] taken from [44]: percentage of errors on the test sets performed by SVM and LRCM. Id

Dataset

d

n

c

SVM

LRCM

D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 D14 D15 D16 D17 D18 D19 D20 D21 D22 D23 D24 D25 D26

Annealing Audiology (Standardized) Bach Banknote Authentication Blogger Breast Tissue Car Evaluation Contraceptive Method Choice CNAE-9 Fertility Glass Identiﬁcation Human Activity Recognition Horse Colic LSVT Voice Rehabilitation Mice Protein Expression Libras Movement Nursey Optical Recognition of Handwritten Digits Parkinson Pittsburgh Bridges Seeds Image Segmentation Sensorless Drive Diagnosis Tic-Tac-Toe Endgame Wine Yeast

798 226 5665 1372 100 106 1728 1473 1080 100 214 10,299 368 126 1080 360 12,960 5620 197 108 210 2310 58,509 958 178 1484

38 69 17 5 6 10 6 9 857 10 10 561 27 309 82 91 8 64 23 13 7 19 49 9 13 8

6 24 102 2 2 6 4 3 9 2 7 6 2 2 8 15 5 10 2 6 3 7 11 2 3 10

0.33 ± 0.02 3.44 ± 0.06 3.39 ± 0.04 3.77 ± 0.05 4.76 ± 0.19 8.67 ± 0.27 9.67 ± 0.39 16.79 ± 0.95 13.33 ± 0.54 22.06 ± 1.40 21.19 ± 0.67 10.67 ± 0.52 10.33 ± 0.85 12.33 ± 0.77 11.67 ± 0.16 23.08 ± 0.97 15.12 ± 0.87 12.50 ± 0.30 15.73 ± 0.49 11.33 ± 0.57 28.13 ± 1.99 26.85 ± 1.74 34.38 ± 0.43 29.79 ± 1.10 44.64 ± 1.86 43.67 ± 1.42

0.33 ± 0.03 3.21 ± 0.06 3.91 ± 0.05 3.12 ± 0.01 2.52 ± 0.20 6.78 ± 0.29 10.81 ± 0.67 15.12 ± 0.85 11.12 ± 0.41 18.29 ± 0.70 19.98 ± 1.15 11.96 ± 0.81 7.73 ± 0.35 13.45 ± 0.43 12.35 ± 0.30 20.98 ± 1.69 14.26 ± 0.66 11.56 ± 0.44 10.98 ± 0.59 11.33 ± 0.54 26.12 ± 1.22 24.13 ± 0.62 32.12 ± 1.64 26.54 ± 0.93 47.89 ± 1.68 48.49 ± 2.02

9

19

# Wins

4.2. Average or Large Sample Multiclass Problems and Non-Linear Models In order to verify whether the LRCM allows to improve over the performance of the SVMs in the average or large sample setting with non-linear models, we make use of several UCI datasets [43], the same exploited in [18] (see Table 3). In Table 3, we present the percentage of errors on the test sets. In particular, we compare the results obtained with SVM and LRCM. From the results of Table 3 it is possible to note how LRCM outperforms SVM in many of the datasets in a statistical relevant way (since in the table it is reported the t-student conﬁdence in-

terval at 95%). In this case LRCM has still a statistical advantage over SVM but less evident with respect to the small sample setting. This can be due to the fact that the regularizer introduced in LRCM has more effect when d n since there are many solutions with small error and then it can change a lot the selection of the boundary. 5. Discussion Our work shows for the ﬁrst time that the Local Rademacher Complexity Theory, as the Vapnik–Chervonenkis one, can be exploited to inspire a new learning algorithm, the Local Rademacher

Please cite this article as: L. Oneto, S. Ridella and D. Anguita, Local Rademacher Complexity Machine, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2018.10.087

ARTICLE IN PRESS

JID: NEUCOM

[m5G;February 7, 2019;19:45]

L. Oneto, S. Ridella and D. Anguita / Neurocomputing xxx (xxxx) xxx

Complexity Machines, and results on a series of real world datasets show the potential beneﬁt of the proposal, especially for small sample problems. Apart for the fact that our proposal works well in practice and is theoretically grounded by a state-of-the-art theory in the context of estimating the generalization ability of a model, this paper presents a new data-dependent regularization schema. This new regularization term is general enough to be exploited in many other classiﬁcation algorithms (e.g. Extreme Learning Machine, Multitask Learning, and Neural Networks) but it can be easily extended also to regression tasks by substituting the Rademacher Complexity with the Gaussian one. More generally, this work is a step forward toward a new way of designing learning algorithms and puts the basis for translating other theories. For example, a great result would be to translate the Algorithms Stability theory into a learning algorithm able to capture the intuition behind the theory itself into a practical tool which can be daily exploited by practitioners.

7

n n ∂L =− αi yi + βi σi = 0 ∂b i=1 i=1

(A.11)

∂L = C1 − αi − μi = 0, ∀i ∈ {1, . . . , n} ∂ξiy

(A.12)

∂L = C2 − βi − ηi = 0, ∀i ∈ {1, . . . , n}, ∂ξiσ

(A.13)

From Eq. (A.10) we obtain that

w=

n

(αi yi − βi σi )φ (xi ).

(A.14)

i=1

From Eq. (A.11) we obtain that n

(αi yi − βi σi ) = 0.

(A.15)

i=1

Appendix A. The dual formulation of LRCM

From Eqs. (A.12) and (A.5) we obtain that

In this Appendix, we show how the LRCM dual formulation of Eq. (28) can be derived from the LRCM primal formulation of Eq. (27). First let us compute the Lagrangian

y 1 L = w2 + C1 ξi + C2 ξiσ 2 −

n

n

n

i=1

i=1

αi ≤ C1 , ∀i ∈ {1, . . . , n}.

(A.16)

From Eqs. (A.13) and (A.5) we obtain that

βi ≤ C2 , ∀i ∈ {1, . . . , n}.

(A.17)

By substituting these expressions in the Lagrangian we easily obtain the LRCM dual formulation.

αi yi (wT φ (xi ) + b) − 1 + ξiy

Appendix B. Deriving Eq. (18)

i=1

−

n

βi −σi (wT φ (xi ) + b) − 1 + ξiσ

i=1

−

n i=1

μi ξiy −

n

ηi ξiσ ,

(A.1)

i=1

and list the different constraints. The ones of the primal formulation

yi (wT φ (xi ) + b) ≥ 1 − ξiy ,

∀i ∈ {1 , . . . , n}

− σi (w φ (xi ) + b) ≥ 1 − ξiσ , T

∀i ∈ {1 , . . . , n}

ξiy , ξiσ ≥ 0, ∀i ∈ {1, . . . , n},

(A.2)

(A.3)

(A.4)

the ones on the Lagrangian multipliers

αi , βi , μi , ηi ≥ 0, ∀i ∈ {1, . . . , n},

Note that the following proof can be derived in a more general scenario and with more reﬁned inequalities but for the purpose of this paper this proof is general enough. Let us consider the Theorem 3.7 of [10] which states that the following bound holds with probability (1 − δ )

L(h ) ≤ max k≥1

y

βi −σi (wT φ (xi ) + b) − 1 + ξiσ = 0, ∀i ∈ {1, . . . , n}

(A.7)

μi ξiy = 0, ∀i ∈ {1, . . . , n}

(A.8)

ηi ξiσ = 0, ∀i ∈ {1, . . . , n},

(A.9)

and the Karush–Kuhn–Tucker (KKT) conditions n n ∂L =w− αi yi φ (xi ) + βi σi φ (xi ) = 0 ∂w i=1 i=1

(A.10)

ln 3δ 2n

ln 3δ , 2n

(B.1)

n 1 2 α hα ∈[0, 1], h∈H, ( h ( x ), y )

ψ (r ) = Rˆn

n

≤

(A.6)

(h ) + kr + 2 ∗

where r∗ is the ﬁxed point of the following sub-root function ψ (r)

(A.5)

αi yi (wT φ (xi ) + b) − 1 + ξi = 0, ∀i ∈ {1, . . . , n}

Lˆyn

k ≤ max Lˆyn (h ) + kr ∗ + 2 k≥1 k − 1

the complementary conditions

k Lˆy (h ) , k−1 n

1

α2

3r +

ln δ 2n

i=1

3 +

2 ln 3δ . n

(B.2)

Let us consider also Theorem 5 of [45] which states that the following bound holds with probability (1 − δ )

Rˆn (H ) ≤ Rˆˆn (H ) +

8 ln 2δ . n

(B.3)

Let us also consider, for simplicity, the case when the hard loss function is exploited. Then, we can easily state that the following bound holds with probability (1 − δ )

k L(h ) ≤ max Lˆyn (h ) + kr ∗ + 2 k≥1 k − 1

ln 6δ , 2n

(B.4)

where

r ∗ = ψ (r ∗ )

Please cite this article as: L. Oneto, S. Ridella and D. Anguita, Local Rademacher Complexity Machine, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2018.10.087

ARTICLE IN PRESS

JID: NEUCOM 8

[m5G;February 7, 2019;19:45]

L. Oneto, S. Ridella and D. Anguita / Neurocomputing xxx (xxxx) xxx

=

2 ln 3δ + Rˆn n

≤

1

α

3r ∗ +

2

n 1 2 α hα ∈[0, 1], h∈H, ( h ( x ), y ) n

6

i=1

ln δ 2n

ln 6δ 1 y ∗ α hα ∈[0, 1], h∈H, Lˆn (h )≤ 2 3r + 2n α 6

≤ Rˆˆn

+3

≤ Rˆˆn

2 ln δ n

ln 6δ 2 ln 6δ y ˆ h h∈H, Lˆn (h )≤3Rˆn (H )+ +3 . 2n n (B.5)

By setting

(kˆ , n, δ ) = (3kˆ + 2 )

cˆ = 3Rˆˆn (H )+

2 ln 6δ n

(B.6)

ln 6δ 2n

(B.7)

where cˆ can be obviously computed from the data, and by setting

kˆ = arg max k≥1

k Lˆy (h ) + kr ∗ + 2 k−1 n

ln 6δ , 2n

(B.8)

our statement of Eq. (18) is derived. Note that the constants can be improved but this is out of the scope of this presentation. References [1] V.N. Vapnik, A.Y. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities, Soviet Math. Dokl. 9 (4) (1968) 915–918. [2] V.N. Vapnik, Statistical Learning Theory, Wiley New York, 1998. [3] V. Koltchinskii, Rademacher penalties and structural risk minimization, IEEE Trans. Inf. Theory 47 (5) (2001) 1902–1914. [4] P.L. Bartlett, S. Mendelson, Rademacher and Gaussian complexities: risk bounds and structural results, J. Mach. Learn. Res. 3 (2002) 463–482. [5] L. Oneto, A. Ghio, S. Ridella, D. Anguita, Global Rademacher complexity bounds: from slow to fast convergence rates, Neural Process. Lett. 43 (2) (2015) 567–602. [6] L. Oneto, D. Anguita, S. Ridella, A local Vapnik–Chervonenkis complexity, Neural Netw. 82 (2016) 62–75. [7] P.L. Bartlett, O. Bousquet, S. Mendelson, Localized Rademacher complexities, in: Proceedings of the Computational Learning Theory, 2002. [8] P.L. Bartlett, O. Bousquet, S. Mendelson, Local Rademacher complexities, Ann. Stat. 33 (4) (2005) 1497–1537. [9] V. Koltchinskii, Local Rademacher complexities and oracle inequalities in risk minimization, Ann. Stat. 34 (6) (2006) 2593–2656. [10] L. Oneto, A. Ghio, S. Ridella, D. Anguita, Local Rademacher complexity: sharper risk bounds with and without unlabeled samples, Neural Netw. 65 (2015) 115–125. [11] T. Liu, G. Lugosi, G. Neu, D. Tao, Algorithmic stability and hypothesis complexity, in: Proceedings of the International Conference on Machine Learning, 2017. [12] N. Youseﬁ, Y. Lei, M. Kloft, M. Mollaghasemi, G.C. Anagnostopoulos, Local Rademacher complexity-based learning guarantees for multi-task learning, J. Mach. Learn. Res. 19 (38) (2018) 1–47. [13] Y. Lei, L. Ding, Y. Bi, Local Rademacher complexity bounds based on covering numbers, Neurocomputing 218 (2016) 320–330. [14] V. Syrgkanis, A sample complexity measure with applications to learning optimal auctions, in: Proceedings of the Advances in Neural Information Processing Systems, 2017. [15] Y. Maximov, M.R. Amini, Z. Harchaoui, Rademacher complexity bounds for a penalized multi-class semi-supervised algorithm, J. Artif. Intell. Res. 61 (2018) 761–786. [16] C. Xu, T. Liu, D. Tao, C. Xu, Local Rademacher complexity for multi-label learning, IEEE Trans. Image Process. 25 (3) (2016) 1495–1507. [17] N. Zhivotovskiy, S. Hanneke, Localization of vc classes: beyond local Rademacher complexities, Theor. Comput. Sci. 742 (2018) 27–49.

[18] D. Anguita, A. Ghio, L. Oneto, S. Ridella, In-sample model selection for trimmed hinge loss support vector machine, Neural Process. Lett. 36 (3) (2012) 275–283. [19] D. Anguita, A. Ghio, L. Oneto, S. Ridella, Learning with few bits on small-scale devices: from regularization to energy eﬃciency, in: Proceedings of the European Symposium on Artiﬁcial Neural Networks, 2014. [20] L. Oneto, N. Navarin, M. Donini, A. Sperduti, F. Aiolli, D. Anguita, Measuring the expressivity of graph kernels through statistical learning theory, Neurocomputing 268 (C) (2017) 4–16. [21] L. Oneto, N. Navarin, M. Donini, S. Ridella, A. Sperduti, F. Aiolli, D. Anguita, Learning with kernels: a local Rademacher complexity-based analysis with application to graph kernels, IEEE Trans. Neural Netw. Learn. Syst. 29 (10) (2018) 4660–4671. [22] C. Cortes, M. Kloft, M. Mohri, Learning kernels using local Rademacher complexity, in: Proceedings of the Neural Information Processing Systems, 2013, pp. 2760–2768. [23] L. Wang, Support Vector Machines: Theory and Applications, Springer Science & Business Media, 2005. [24] M. Wainberg, B. Alipanahi, B.J. Frey, Are random forests truly the best classiﬁers? J. Mach. Learn. Res. 17 (1) (2016) 3837–3841. [25] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) 273–297. [26] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. [27] B. Schölkopf, The kernel trick for distances, in: Proceedings of the Neural Information Processing Systems, 2001. [28] D. Anguita, A. Ghio, L. Oneto, S. Ridella, A deep connection between the Vapnik–Chervonenkis entropy and the Rademacher complexity, IEEE Trans. Neural Netw. Learn. Syst. 25 (12) (2014) 2202–2211. [29] L. Oneto, Model selection and error estimation without the agonizing pain, WIREs Data Min. Knowl. Discov. 8 (4) (2018). [30] D. Anguita, A. Ghio, L. Oneto, S. Ridella, In-sample and out-of-sample model selection and error estimation for support vector machines, IEEE Trans. Neural Netw. Learn. Syst. 23 (9) (2012) 1390–1406. [31] L. Rosasco, E. De Vito, A. Caponnetto, M. Piana, A. Verri, Are loss functions all the same? Neural Comput. 16 (5) (2004) 1063–1076. [32] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, C. Watkins, Text classiﬁcation using string kernels, J. Mach. Learn. Res. 2 (2002) 419–444. [33] T. Gärtner, P. Flach, S. Wrobel, On graph kernels: hardness results and eﬃcient alternatives, in: Proceedings of the Learning Theory and Kernel Machines, Springer, 2003. [34] S.S. Keerthi, C.J. Lin, Asymptotic behaviors of support vector machines with Gaussian kernel, Neural Comput. 15 (7) (2003) 1667–1689. [35] N. Srivastava, G.E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overﬁtting, J. Mach. Learn. Res. 15 (1) (2014) 1929–1958. [36] J. Shawe-Taylor, S. Sun, A review of optimization methodologies in support vector machines, Neurocomputing 74 (17) (2011) 3609–3618. [37] D. Anguita, A. Ghio, L. Oneto, S. Ridella, Maximal discrepancy vs. Rademacher complexity for error estimation, in: Proceedings of the European Symposium on Artiﬁcial Neural Networks, 2011. [38] M. Aupetit, Nearly homogeneous multi-partitioning with a deterministic generator, Neurocomputing 72 (7) (2009) 1379–1389. [39] A.C. Lorena, A.C.P.L.F. De Carvalho, J.M.P. Gama, A review on the combination of binary classiﬁers in multiclass problems, Artif. Intell. Rev. 30 (1–4) (2008) 19. [40] D. Anguita, S. Ridella, D. Sterpi, A new method for multiclass support vector machines, in: Proceedings of the IEEE International Joint Conference on Neural Networks, 2004. [41] R. Rifkin, A. Klautau, In defense of one-vs-all classiﬁcation, J. Mach. Learn. Res. 5 (2004) 101–141. [42] K.B. Duan, S.S. Keerthi, Which is the best multiclass SVM method? An empirical study, in: Proceedings of the International Workshop on Multiple Classiﬁer Systems, 2005, pp. 278–285. [43] D. Dua, E. Karra Taniskidou, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2017, http:// archive.ics.uci.edu/ml. [44] I. Orlandi, L. Oneto, D. Anguita, Random forests model selection, in: Proceedings of the European Symposium on Artiﬁcial Neural Networks, Computational Intelligence and Machine Learning, 2016. [45] P. Klesk, M. Korzen, Sets of approximating functions with ﬁnite Vapnik–Chervonenkis dimension for nearest-neighbors algorithms, Pattern Recognit. Lett. 32 (14) (2011) 1882–1893. Luca Oneto was born in Rapallo, Italy in 1986. He received his B.Sc. and M.Sc. in Electronic Engineering at the University of Genoa, Italy respectively in 2008 and 2010. In 2014 he received his Ph.D. from the same university in School of Sciences and Technologies for Knowledge and Information Retrieval with the thesis “Learning Based On Empirical Data”. In 2017 he obtained the Italian National Scientiﬁc Qualiﬁcation for the role of Associate Professor in Computer Engineering and in 2018 the one in Computer Science. He is currently an Assistant Professor at University of Genoa with particular interests in Statistical Learning Theory, Machine Learning, and Data Mining.

Please cite this article as: L. Oneto, S. Ridella and D. Anguita, Local Rademacher Complexity Machine, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2018.10.087

JID: NEUCOM

ARTICLE IN PRESS

[m5G;February 7, 2019;19:45]

L. Oneto, S. Ridella and D. Anguita / Neurocomputing xxx (xxxx) xxx Sandro Ridella received the Laurea degree in electronic engineering from the University of Genoa, Genoa, Italy, in 1966. Currently, he is a Full Professor at the Department of Biophysical and Electronic Engineering (DIBE, now DITEN Dept.), University of Genoa, where he teaches circuits and algorithms for signal processing. In the last ﬁve years, his scientiﬁc activity has been mainly focused on the ﬁeld of neural networks.

9

Davide Anguita received the ‘Laurea’ degree in Electronic Engineering and a Ph.D. degree in Computer Science and Electronic Engineering from the University of Genoa, Genoa, Italy, in 1989 and 1993, respectively. After working as a Research Associate at the International Computer Science Institute, Berkeley, CA, on special-purpose processors for neurocomputing, he returned to the University of Genoa. He is currently Associate Professor of Computer Engineering with the Department of Informatics, BioEngineering, Robotics, and Systems Engineering (DIBRIS). His current research focuses on the theory and application of kernel methods and artiﬁcial neural networks.

Please cite this article as: L. Oneto, S. Ridella and D. Anguita, Local Rademacher Complexity Machine, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2018.10.087

Local Rademacher Complexity Machine

Local Rademacher Complexity Machine

Recommend Documents