Refined bounds for online pairwise learning algorithms

Refined bounds for online pairwise learning algorithms

ARTICLE IN PRESS JID: NEUCOM [m5G;December 5, 2017;16:34] Neurocomputing 0 0 0 (2017) 1–10 Contents lists available at ScienceDirect Neurocomputi...

610KB Sizes 0 Downloads 35 Views

ARTICLE IN PRESS

JID: NEUCOM

[m5G;December 5, 2017;16:34]

Neurocomputing 0 0 0 (2017) 1–10

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Refined bounds for online pairwise learning algorithmsR Xiaming Chen a, Yunwen Lei b,∗ a b

Department of Mathematics, City University of Hong Kong, Kowloon, Hong Kong, China Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China

a r t i c l e

i n f o

Article history: Received 2 April 2017 Revised 7 September 2017 Accepted 17 November 2017 Available online xxx Communicated by Dr Yiming Ying Keywords: Pairwise learning Online learning Learning theory Reproducing Kernel Hilbert Space

a b s t r a c t Motivated by the recent growing interest in pairwise learning problems, we study the generalization performance of Online Pairwise lEaRning Algorithm (OPERA) in a reproducing kernel Hilbert space (RKHS) without an explicit regularization. The convergence rates established in this paper can be arbitrarily 1 closed to O(T − 2 ) within T iterations and largely improve the existing convergence rates for OPERA. Our novel analysis is conducted by showing an almost boundedness of the iterates encountered in the learning process with high probability after establishing an induction lemma on refining the RKHS norm estimate of the iterates.

1. Introduction Machine learning often refers to a process of inferring the relaT tionship underlying some examples {zt = (xt , yt )}t=1 drawn from a probability measure ρ defined over Z := X × Y with a compact input space X ⊂ Rd and an output space Y ⊂ R. For many machine learning problems, the relationship can be expressed by a function from X to Y and the quality of a model f : X → R can be quantified by a local error V(y, f(x)) induced by a function V : R × R → R+ . For example, a binary classification problem aims to build a classifier f from X to Y = {±1} and typical choices of V include the zero-one loss V (y, a ) = 1{ya<0} and its surrogates V (y, a ) = φ (ya ) with a convex nonnegative function φ . Here 1{ · } is the indicator function. Regression problems aim to estimate the output value y by a function f : X → Y and the quality of f at (x, y) can be measured by an increasing function of the discrepancy between f(x) and y. We refer to these learning problems as “pointwise learning” since the local error V(y, f(x)) only involves a single example z = (x, y ) ∈ Z. Recently, there is growing interest in another important class of learning problems which we refer to as “pairwise learning” problems. For pairwise learning problems, the associated local error de-

R The work described in this paper is partially supported by the Research Grants Council of Hong Kong [Project No. CityU 11303915] and by National Natural Science Foundation of China under Grants 11461161006, 11471292 and 11771012. ∗ Corresponding author. E-mail addresses: [email protected] (X. Chen), [email protected] (Y. Lei).

© 2017 Elsevier B.V. All rights reserved.

pends on a pair of examples z = (x, y ), z˜ = (x˜, y˜ ) and we wish to build an estimator f : X × X → R. To be precise, the local error of f at (z, z˜) can be typically quantified by V (r (y, y˜ ), f (x, x˜)), where r : Y × Y → R is a reducing function whose specific realization depends on the application domain. Many machine learning problems can be incorporated into the framework of pairwise learning by choosing appropriate reducing functions r and loss functions V, including ranking [6,23], similarity and metric learning [2,4,11], AUC maximization [32], gradient learning [22] and learning under minimum error entropy criterion [8,13,14]. For example, the problem of ranking aims to learn a good order f : X × X → R between z and z˜ such that we predict y ≤ y˜ if f (x, x˜) ≤ 0. The local error of f at (z, z˜) can be naturally quantified by 1{(y−y˜) f (x,x˜)<0} , which is of the form V (r (y, y˜ ), f (x, x˜)) by taking V (r, a ) = 1{ra<0} and r (y, y˜ ) = y − y˜. A convex surrogate of this 0 − 1 loss  is the so-called least squares ranking loss V sq (r (y, y˜ ), f (x, x˜)) := |y − y˜| −





2

studied in [1,3,34,36], where sgn(a) desgn y − y˜ f (x ) − f (x˜) notes the sign of a ∈ R. Training examples in some machine learning problems become available in a sequential manner. Online learning provides an efficient method to handle these learning tasks by iteratively updating the model ft upon the arrival of an example zt = (xt , yt ), t ∈ N. As the counterpart of batch learning which handles training samples at the same time, online learning enjoys an additional advantage in computational efficiency. This computational advantage is especially appealing in the pairwise learning context since the objective function for batch pairwise learning over T examples involves O(T2 ) terms. Motivated by these observations, the generalization analysis of online pairwise learning has recently received consid-

https://doi.org/10.1016/j.neucom.2017.11.049 0925-2312/© 2017 Elsevier B.V. All rights reserved.

Please cite this article as: X. Chen, Y. Lei, Refined bounds for online pairwise learning algorithms, Neurocomputing (2017), https://doi.org/10.1016/j.neucom.2017.11.049

ARTICLE IN PRESS

JID: NEUCOM 2

[m5G;December 5, 2017;16:34]

X. Chen, Y. Lei / Neurocomputing 000 (2017) 1–10

erable attention [3,12,15,18,29,34]. In particular, an error bound of − 13

the order O(T log T ) was established in [34] for an Online Pairwise lEaRning Algorithm (OPERA) in a reproducing kernel Hilbert space (RKHS) after T iterations. Unlike existing work requiring the iterates to be restricted to a bounded domain or the loss function to be strongly convex [15,29], OPERA is implemented in an RKHS, without constraints on the iterates, to minimize a non-strongly convex objective function. This paper aims to refine these theoretical results. To be precise, we give an error bound of the order 1

arbitrarily closed to O(T − 2 ) for OPERA in [34]. This improvement is achieved by establishing the “boundedness” of iterates encountered in the learning process, which was shown to grow polynomially with respect to the iteration number in [34]. Our novel analT ysis is based on an induction lemma showing that { ft }t=1 would belong to a ball of radius O(t α −ν ) with high probability if one can show that it belongs to a ball of radius O(tα ), where ν is a positive constant depending only on the step size sequence. The “boundedness” of iterates can then be derived by applying repeatedly this induction lemma. 2. Main results Throughout this paper, we assume that the training examples

{zt = (xt , yt )}t∈N are independently drawn from ρ in an online manner. We consider online pairwise learning in an RKHS defined on the product space X 2 = X × X . Let K : X 2 × X 2 → R be a Mercer kernel, i.e., a continuous, symmetric and positive semidefinite kernel. The associated RKHS HK is defined as the completion of the linear combination of functions {K(x,x˜) (· ) := K ((x, x˜), (·, · )) : (x, x˜) ∈ X 2 } under an inner product satisfying the following reproducing property





K(x,x˜) , g = g(x, x˜),

∀x, x˜ ∈ X and g ∈ HK .



Denote κ := supx,x˜∈X K ((x, x˜), (x, x˜)), and throughout the paper we assume that |y| ≤ M almost surely for some M > 0. We study a specific pairwise learning problem with the local error taking the least squares form V (r (y, y˜ ), f (x, x˜)) = ( f (x, x˜) − y + y˜ )2 , which coincides with the least squares ranking loss Vsq with applications to ranking problems [1] if y = y˜. These two loss functions would be identical almost surely if the set {(z, z˜) ∈ Z × Z : y = y˜} is of measure zero under the probability measure ρ × ρ . For this specific pairwise learning problem studied in [3,34,36], an efficient OPERA starting with f1 = f2 = 0 was introduced in [34] as follows

ft+1 = ft −

γt t −1

t−1  



ft (xt , x j ) − yt + y j K(xt ,x j ) ,

t = 2, 3, . . . .

Here {γt > 0 : t ∈ N} is usually referred to as the sequence of step sizes. This paper only considers polynomially decaying step sizes t −θ

1



of the form γt = μ with θ ∈ 2 , 1 and which implies γ t κ 2 ≤ 1 for all t ∈ N. The generalization error of a function f : X × X → R is defined by

E( f ) =

Z×Z



μ ≥ κ 2,

( f (x, x˜) − y + y˜ )2 dρ (x, y )dρ (x˜, y˜ ).

Define the pairwise regression function fρ as the difference between two standard regression functions

fρ (x, x˜) = fρ (x ) − fρ (x˜),

f ρ (x ) =



X

ydρ (y|x ).

We denote L2ρ (X 2 ) the space of square integrable functions on the domain X × X , i.e.,

f : X × X → R : f ρ



=

X ×X

| f (x, x˜)|2 dρX (x )dρX (x˜)

12

<∞ ,

where ρX is the marginal distribution of ρ over X . Analogous to the standard least square regression problem [7], the following identity holds for any f ∈ L2ρ (X 2 ) [13,34]

 

E( f ) − E fρ = f − fρ 2ρ ,

(2.2)

from which we also see clearly that fρ minimizes the functional E (· ) among all measurable functions. Our generalization analysis requires a standard regularity assumption on the pairwise regression function in terms of the integral operator LK : L2ρ (X 2 ) → L2ρ (X 2 ) defined by



LK f =

X ×X

f (x, x˜)K(x,x˜) dρX (x )dρX (x˜).

The integral operator LK is compact and positive since K is a Mercer β kernel, from which the fractional power LK (β > 0) can be well de∞ β ∞ β fined by LK f = j=1 λ j α j ψ j for f = j=1 α j ψ j ∈ L2ρ (X 2 ), where {λ j } j∈N are the positive eigenvalues and {ψ j } j∈N are the associated 1

orthonormal eigenfunctions. Furthermore, we know LK2 (L2ρ (X 2 )) = HK with the norm satisfying

 1 

g K = L−K 2 gρ , ∀g ∈ HK . β

β

Here LK (L2ρ (X 2 )) = {LK f : f ∈ L2ρ (X 2 )} for any β > 0. Throughout this paper, we always assume the existence of an f ∗ ∈ HK such that f ∗ = arg min f ∈HK E ( f ), which implies that

1 ∇E ( f ∗) = 2



 Z×Z





f ∗ (x, x˜) − y + y˜ K(x,x˜) dρ (x, y )dρ (x˜, y˜ )



= LK f ∗ − f˜ρ = 0,

(2.3)

where ∇ is the gradient operator. This assumption, weaker than assuming fρ ∈ HK , was also imposed in the literature, see, e.g., [5,35]. The following theorem to be proved in Section 4.3 establishes learning rates for the last iterate fT +1 generated by OPERA. For any a ∈ R, let a and a denote the largest integer not larger than a and the smallest integer not smaller than a, respectively. Theorem 1. Let { ft : t ∈ N} be given by OPERA. Suppose fρ ∈ β

LK (L2ρ (X 2 )) with some β > 0. For any 0 < δ < 1, with probability 1 − δ, there holds

j=1

(2.1)



L2ρ (X 2 ) =



 

E ( fT +1 ) − E f˜ρ ≤ C¯T − min (2β ,1 )(1−θ ) log log

2

T

δ

T

 1 −θ   2θ −1



2−2θ 2θ −1



δ

log (eT ), 3

where C¯ is a constant independent of T and δ . Our proof of Theorem 1 is based on the following key observaT tion to be proved in Section 4.2 on the boundedness of { ft }t=1 (up to factors log T) with high probability. Proposition 2. Let { ft : t ∈ N} be given by OPERA and assume γ t κ 2 ≤ 1 for any t ∈ N. Then for any δ ∈ (0, 1), with probability at least 1 − δ, the following inequality holds for all t = 2, . . . , T



ft K ≤ C log

θ  8T  21θ−−1

δ

+1

   21θ−−1θ

log(eT ),

(2.4)

is a constant independent of T and δ . where C

Please cite this article as: X. Chen, Y. Lei, Refined bounds for online pairwise learning algorithms, Neurocomputing (2017), https://doi.org/10.1016/j.neucom.2017.11.049

ARTICLE IN PRESS

JID: NEUCOM

[m5G;December 5, 2017;16:34]

X. Chen, Y. Lei / Neurocomputing 000 (2017) 1–10

3. Discussions Online pointwise learning has received considerable attention due to its computational efficiency in tackling large-scale datasets ubiquitous in the big-data era. For the specific least squares loss function, error bounds were established for online pointwise learning in RKHSs with a regularizer [19,26,28,30] or without a regularizer [31]. The generalization performance of online regularized learning and online unregularized learning in RKHSs involving general loss functions was studied in [16,20,33,35]. These discussions were recently extended to online pointwise learning in Banach space to capture the geometry of the problem at hand [17]. Unlike online pointwise learning using unbiased estimates of true gradients in the learning process, the randomized gradients 1 t−1 j=1 ( ft (xt , x j ) − yt + y j )K(xt ,x j ) used in online pairwise learnt−1 ing is a biased estimate of the gradient E ( ft ) since ft depends on z1 , . . . , zt−1 . This essential difference between online pointwise learning and online pairwise learning raises challenges in the theoretical analysis of online pairwise learning algorithms. We overcome this difficulty by an error decomposition that allows us to remove the dependency of ft on previous examples by estimating its RKHS norm with high probability. This error decomposition was used in [34] to derive convergence rates of the or



− min



β

1 β +1 , 3

der O T log T log(8T /δ ) (δ is the confidence level) for OPERA. This paper shows that the convergence rate





O T − min(2β ,1)(1−θ ) log

T (1 − θ ) δ ( 2θ − 1 )





2−2θ 2θ −1



log

2

T

δ

log (eT ) 3



1 −θ is possible for OPERA with γt = μ t , θ > 1/2, which can be 1 − min(β , 2 ) arbitrarily closed to O(T ) by letting θ approach to 12 (up to logarithmic factors). The suboptimal convergence rate 



O T

− min

β

1 β +1 , 3

log T log(8T /δ )



in [34] was derived by estab1 −θ

lishing a crude upper bound on ft K as O(t 2 ), while Proposition 2 shows  that  ft , t = 1, . . . , T belong to a ball of radius



O

log

θ  T  21θ−−1

δ



1 −θ 2θ −1



log T

with high probability. This “bound-

edness” (up to logarithmic factors) of iterates allows us to get improved error bounds. Below we describe some related work on online pairwise learning algorithms. Online projected pairwise learning algorithms as



ft+1 = ProjBR ft −

t−1 γt 

t −1





kernel K(x,x˜) = x − x˜ [3]. The analysis in [3] crucially relies on the fact that the iterates ft belong to the range of the associated covariance operator almost surely, which, however, does not hold for a general kernel. Also, the approximation error was not considered in [3]. Online regularized pairwise learning in an RKHS was studied in [12]. Under the decay assumption

D(λ ) := inf

= O(λ2β ),

0<β≤

1 , 2

(3.2)



β



the convergence rate O(T 2β +1 log T ) was established for a general convex loss function [12]. Here E V ( f ) =  ˜ ), f (x, x˜))dρ (z )dρ (z˜) is the generalization error Z×Z V (r (y, y of f measured by the loss function V. Under the condition (3.2), −



the generalization error of the order O(T q+1+4β log T ) was given for online unregularized pairwise learning algorithms involving a convex loss function satisfying an increment condition of order q [18]. For the specific least squares loss function, the error bound −

β

in [18] reduces to O(T 1+2β log T ). For the least squares loss β function, the decay condition (3.2) holds if f˜ρ ∈ LK (L2ρ ) [7,27] and −

β

the error bound of the order O(T 1+2β ) (up to logarithmic factors) in [12,18] is worse than our bound arbitrarily closed to O(T − min(β , 2 ) ). It should be mentioned that the error bounds in [12,18,35] are derived in expectation, while our bounds and the results in [3,34] hold in probability. As a future work, it would be interesting to extend the analysis to some other related learning problems [9,10,21,24,25]. 1

4. Proof of main results 4.1. Error decomposition This section introduces an error decomposition which is crucial for the subsequent analysis. For any t, k ∈ N with k ≤ t, denote ωkt (LK ) = tj=k (I − γ j LK ). Define

S(z,z˜) = (y − y˜ )K(x,x˜)

and  St =

t−1 1  S(zt ,z j ) . t −1 j=1

(3.1)

were studied in [15,29] by establishing online to batch conversion 1 T −1 bounds for the average of iterates f¯T = T −1 t=1 ft . Here ProjBR denotes the projection to a ball BR = { f K ≤ R : f ∈ HK } of radius R and V− denotes the left-hand derivative of V with respect to the second argument. For the specific least squares loss function V (y, f ) = (y − f )2 and r (y, y˜ ) = y − y˜, the error bound  for the



E V ( f ) − E V ( fρ ) + λ f 2K

f ∈HK

St =

j=1

average output function f¯T in [15] reduces to O T −β



Denote



V− r (yt , y j ), ft (xt , x j ) K(xt ,x j ) ,

∀t ∈ N

3

T

log δ



if

0 < β ≤ 12 [34], which is slightly better than our bound for fT given in Theorem 1. However, our analysis addresses the more challenging problem on the generalization performance of last iterate for OPERA and therefore does not need an average operator. Furthermore, the algorithm (3.1) requires an additional projection step at each iteration, which can be time-consuming and raises a challenging question on how to tune R. Although implemented in an unconstrained setting, the iterates { ft }t∈N in OPERA are shown in this paper to belong to a “bounded” ball with high probability. A fast 2 convergence rate of the order O T 1−2θ log (T /δ ) was recently established for the estimation error of OPERA (2.1) with the specific

 X

 St dρ (zt ) =

t−1  1  ( fρ (x ) − y j )K(x,x j ) dρX (x ). t −1 X j=1

For any x, x˜ ∈ X , define the linear operator L(x,x˜) : HK → HK by

L(x,x˜) ( f ) = f (x, x˜)K(x,x˜) ,

f ∈ HK .

For any t ∈ N, introduce

 Lt =

t−1 1  L(xt ,x j ) t −1

and Lt =



j=1

X

 Lt dρ (zt ).

That is,

 Lt ( f ) =

t−1  1   f xt , x j K(xt ,x j ) , t −1 j=1

Lt ( f ) =

1 t −1

t−1   j=1

X

f (x, x j )K(x,x j ) dρX (x ).

The iteration (2.1) can be rewritten as





ft+1 = ft − γt  Lt ( ft ) −  St = (I − γt LK ) ft − γt ( Lt − LK )( ft ) + γt  St ,

Please cite this article as: X. Chen, Y. Lei, Refined bounds for online pairwise learning algorithms, Neurocomputing (2017), https://doi.org/10.1016/j.neucom.2017.11.049

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;December 5, 2017;16:34]

X. Chen, Y. Lei / Neurocomputing 000 (2017) 1–10

from which one gets the following error decomposition for ft+1 − f˜ρ [34]

ft+1 − f˜ρ = −ω2t (LK )( f˜ρ ) −

t 

t 

γ j ωtj+1 (LK )Aˆ j −

j=2

γ j ωtj+1 (LK )Bˆ j .

j=2

(4.1) Here, we introduce

















Bˆ j =  Lj − Lj fj −  Sj − Sj .

Aˆ j = L j − LK f j − S j − LK f˜ρ ,

The term ω2t (LK )( f˜ρ ) depends on the regularity of the pairwise regression function and we refer it to as the approximation error, t t t t ˆj ˆj while the term j=2 γ j ω j+1 (LK )A + j=2 γ j ω j+1 (LK )B depends on training examples and is referred to as the sample error. The underlying reason to decompose the sample error into two terms  involving Aˆ j and Bˆ j is that tj=2 γ j ωtj+1 (LK )Bˆ j is a martingale difference sequence and  L − L is a summation of identically and inK

j

dependently distributed (i.i.d.) random variables with expectation zero. 4.2. Boundedness of iterates

Let HS(HK ) be the Hilbert space of Hilbert-Schmidt operators on HK with the inner product A, B = Tr(A B ) for any A, B ∈ HS(HK ), where B denotes the adjoint of B and Tr denotes the trace of a linear operator. The space HS(HK ) belongs to the space (L(HK ), · L(HK )) of bounded linear operators on HK satisfying

A L(HK ) ≤ A HS , ∀A ∈ HS(HK ).

Lemma 4 (Induction lemma). Let C ≥ 1, δ 0 ∈ (0, 1), α ∈ (0, θ ] and δ ∈ (0, 1 − δ0 ). If, with probability 1 − δ0 , the following inequality holds for all 2 ≤ t ≤ T

ft K ≤ Ct α , then, with probability 1 − δ0 − δ, the following inequality holds for all 2≤t≤T

f t K ≤

t 

γ j ωtj+1 (LK )A¯ j −

j=2

t 



  t  

ft+1 K ≤ I − ω2t (LK ) L(HK ) f ∗ K +  γ j ωtj+1 (LK )A¯ j 

The proof of Proposition 3, based on the error decomposition (4.2), is postponed to the end of this subsection. The underlying reason to consider the error decomposition (4.1) and (4.2), involving f˜ρ and f∗ , respectively, is due to the identity E[ S j ] = LK fρ = LK f ∗ . As compared to (4.1), an advantage of (4.2) is that f∗ belongs to HK , which would be essential for us to establish the almost boundedness of iterates with high probability. For any C ≥ 1, α ∈ (0, θ ] and j ∈ N, introduce















L j − LK f j 1{ f j K ≤C jα } − S j − LK f ∗ , A¯ j,C,α =



Bˆ j,C,α =  Lj − L j f j 1{ f j K ≤C jα } −  Sj − Sj .

(4.3)

Proposition 3. Let C ≥ 1, θ ∈ ( 12 , 1 ) and α ∈ (0, θ ]. Then for any 0 < δ < 1, with probability at least 1 − δ, the following inequality holds for all 2 ≤ t ≤ T

    t t      γ j ωtj+1 (LK )A¯ j,C,α  +  γ j ωtj+1 (LK )Bˆ j,C,α  



1 +α −θ 2

CCδ,T t 1+2α −2θ

, if CCδ,T log(et ), if



where Cδ,T = max (25M

 t 

+

j=2

α > θ − 12 , α ≤ θ − 12 ,

j=2

+ 15κ )κμ−1 , 1

 t 

+

K

  γ j ωtj+1 (LK )A¯ j,C,α 

j=2

K

  γ j ωtj+1 (LK )Bˆ j,C,α  . K



ft+1 K ≤ f K + ∗

1 +α −θ

CCδ,T t 2 1+2α −2θ

, if CCδ,T log(et ), if

α > θ − 12 , α ≤ θ − 12 .

˜ ≥ 1 − δ − δ0 . The stated inequality then It is clear that Pr( ∩ )  = C + f ∗ and noting f = f = 0. The follows by taking C K 1 2 δ,T δ,T proof is complete.  Proof of Proposition 2. According to Lemma 1 in [34], we know that

ft K ≤ C1t

1−θ 2

,

∀t = 2, 3, . . . , T ,

where C1 = max  (2M, 1 ). Iteratively  2−3θ  applying Lemma 4 with α = 1−θ 1 2 − k θ − 2 , k = 0, . . . , 2θ −1 − 1 would show that with prob-





θ ability 1 − 22−3 θ −1 δ, the following inequality holds for all t = 2, . . . , T (note α ≤ θ )

K

Note log( δ ).

K

˜ ⊂ Z T with probAccording to Proposition 3, there exists a set  ability measure at least 1 − δ such that (4.4) holds for any ˜ . Then, for any (z1 , . . . , zT ) ∈  ∩  ˜ the following ( z1 , . . . , zT ) ∈  inequality holds for all 2 ≤ t ≤ T

(4.4) 8T

 t 

≤ f ∗ K + 

ft K ≤ C1 

  γ j ωtj+1 (LK )Bˆ j 

j=2

j=2



K

⎩CC  log(et ), if α ≤ θ − 1 , 2 δ,T

j=2

L j − LK f j − S j − LK f ∗ . A¯ j :=

j=2

α > θ − 12 ,

We know Pr() ≥ 1 − δ0 . For any (z1 , . . . , zT ) ∈ , we get A¯ j,C,α = A¯ j and Bˆ j,C,α = Bˆ j , ∀ j = 1, . . . , T , and (4.2) then implies that

γ j ωtj+1 (LK )Bˆ j , (4.2)



if

   = ( z1 , . . . , zT ) : f j K ≤ C j α , j = 1, . . . , T .

1



,

1+2α −2θ

 = C + f ∗ . where we introduce C K δ,T δ,T

would belong to an RKHS ball of radius O(t 2 +α −θ ) with high probability if one can show in probability that ft K ≤ Ct α , ∀t = 2, . . . , T . A key observation is that the exponent would be improved from α to 21 + α − θ if θ > 12 . Then, Proposition 2 can be θ  times. The identity proved by applying Proposition 3 with  21θ−−1 (2.3) motivates us to consider the following error decomposition

where

⎧ 1 ⎨ CC δ ,T t 2 +α−θ

Proof. Introduce

This section aims to prove Proposition 2 on the boundedness T of { ft }t=1 (up to a logarithmic factor of T). Our proof is based on Proposition 3 and Lemma 4. Proposition 3 gives a probabilistic bound for the RKHS norm of a varied sample error where each fj is multiplied by the indicator function on the event { fj K ≤ Cjα }. T Based on this key proposition, Lemma 4 further shows that { ft }t=1

ft+1 − f ∗ = −ω2t (LK )( f ∗ ) −

(4.5)

α=







2−3θ 2θ −1



k=0

−1



 C δ,T

2 − 3θ − k ( 2θ − 1 )

1−θ 2−3θ 1 2 − 2θ −1 θ − 2 1−θ 2−3θ 1 θ−2 2 − 2θ −1



 t

1−θ 2





2−3θ 2θ −1



θ − 12

 .

≤ θ − 12 . Applying again Lemma 4 with

then shows that with probability at least

Please cite this article as: X. Chen, Y. Lei, Refined bounds for online pairwise learning algorithms, Neurocomputing (2017), https://doi.org/10.1016/j.neucom.2017.11.049

ARTICLE IN PRESS

JID: NEUCOM

[m5G;December 5, 2017;16:34]

X. Chen, Y. Lei / Neurocomputing 000 (2017) 1–10

1−



1−θ 2θ −1



δ, the following inequality holds for all t = 2, . . . , T

ft K ≤ C1Cδ ,T





2−3θ 2θ −1



k=0

−1



 C δ,T

2 − 3θ − k ( 2θ − 1 )





log where



2−3θ 2θ −1

8T

δ 

+ f ∗ K

−1

Cθ = k=0

 1−θ  2θ −1

  t

  γ j ωtj+1 (LK )Aˆ2j  . Let the vector-valued K  random variable ξ (z ) = X ( fρ (x˜) − y )K(x˜,x ) dρX (x˜) ∈ HK . It follows Step 1: Estimate 

E[ξ ] =







X

    j−1  1  

S j − LK f ∗ K =  ξ ( z ) − E[ξ ]  j − 1 =1  # K √ 4κ M log 2δT 6 2κ M log 2δT log 2δT ≤ + 2κ M ≤ . 

1 . 2 − 3θ − k ( 2θ − 1 )

j−1

The above probabilistic inequality can be written as the stated form. The proof is complete.  We now turn to prove Proposition 3. For this purpose, we need to introduce three lemmas. Lemma 5 provides elementary inequalities for series. Lemma 6 establishes a Bennett’s inequality for random variables in Hilbert space [26] and Lemma 7 is the PinelisBernstein for martingale difference sequences in a Hilbert space [28].

(4.6)

Lemma 6. Let {ξi : i = 1, . . . t } be independent random variables in a Hilbert space H with norm · . Suppose that almost surely ξ i ≤ B and E[ ξi 2 ] ≤ σ 2 < ∞. Then, for any 0 < δ < 1, the following inequality holds with probability at least 1 − δ

t

i=1

#

  t   6 2M κ 2T  − θ − 12 γ j Aˆ2j K ≤ log j μ δ j=2 j=2 √

t 

  j B 2   Sk  ≤ 2 + σt log . 3 δ k=1

Proof of Proposition 3. For any 2 ≤ t ≤ T, the following inequality is clear

    t t      γ j ωtj+1 (LK )A¯ j,C,α  +  γ j ωtj+1 (LK )Bˆ j,C,α  K

j=2

j=2

K





and



Aˆ2j = − S j − LK f

 ∗



,



Bˆ2j = −  Sj − Sj .

We now proceed with the proof in five steps.

log(et ).

j=2



t 

γ j Aˆ2j K ≤

j=2

√ 6 2Mκ

μ

log

2T

δ

log(et ).

   j Step 2: Estimate  tj=2 γ j ωtj+1 (LK )Bˆ2  . It is clear that K   j j E[Bˆ2 |z1 , . . . , z j−1 ] = 0 and therefore ξ j = γ j ωtj+1 (LK )Bˆ2 : j ∈ N is a martingale difference sequence. Since y ∈ [−M, M] almost surely, we know

 t  ω j+1 (LK )Bˆ j  ≤ Bˆ j K ≤ 4κ M, 2 K 2

t  j=2



1 2,

we derive the following inequal-

t $ ! "  2  γ j2 E ωtj+1 (LK )Bˆ2j 2K $z1 , . . . , z j−1 ≤ 4κ Mμ−1 j−2θ





−1 2

j=2

log(et )

max γ j ωtj+1 (LK )Bˆ2j K ≤ max γ j Bˆ2j K ≤ 4κ Mμ−1 .

K

Bˆ1j,C,α =  Lj − L j f j 1{ f j K ≤C jα }



K

j=2

2≤ j≤t

For any 2 ≤ t ≤ T, applying Lemma 7 then yields the following inequality with probability 1 − Tδ

  4 t  2T  j  γ j ωtj+1 (LK )Bˆ2  ≤ 2 3 κ Mμ−1 + 4κ Mμ−1 log(et ) log δ j=2

Aˆ1j,C,α = L j − LK f j 1{ f j K ≤C jα } ,

δ

  t t   j γ j ωtj+1 (LK ) L(HK ) Aˆ2j K  γ j ωtj+1 (LK )Aˆ2  ≤

2≤ j≤t

K

j=2

where we introduce

2T

Since ωtj+1 (LK ) L(HK ) ≤ 1, with probability at least 1 − δ, the following inequality holds for all t = 2, . . . , T

≤ 4κ M μ

    t t     + γ j ωtj+1 (LK )Bˆ1j,C,α  +  γ j ωtj+1 (LK )Bˆ2j  , (4.7) j=2

log

and

K

j=2

    t t     ≤ γ j ωtj+1 (LK )Aˆ1j,C,α  +  γ j ωtj+1 (LK )Aˆ2j  K

μ

from which and (4.6) with θ ≥ ity for any 2 ≤ t ≤ T

max 

j=2

√ 6 2Mκ

(4.9)

log 2δ . t

Lemma 7. Let {Sk : k ∈ N} be a martingale difference sequence in a Hilbert space with norm · . Suppose that almost surely Sk ≤ B  and tk=1 E[ Sk 2 |S1 , . . . , Sk−1 ] ≤ σt2 . Then, for any 0 < δ < 1, the following inequality holds with probability at least 1 − δ

1≤ j≤t

j

(4.8)



⎧ 1−λ ⎨ t 1−−λλ , if λ < 1, t t  j   −λ −λ j ≤1+ x dx = log(et ), if λ = 1, ⎩ λ , j−1 j=1 j=2 if λ > 1, λ−1

j−1

This, together with γ j = μ−1 j−θ and (4.6) with θ ≥ 12 , shows that with probability 1 − δ, the following inequality holds for all t = 2, . . . , T

Lemma 5. For any λ > 0, we have

t



fρ (x˜) − fρ (x ) K(x˜,x ) dρ (x, y )dρ (x˜, y˜ ) = LK f˜ρ = LK f ∗ .

 Since it holds that ξ K ≤ X | fρ (x˜) − y| K(x˜,x ) K dρX (x˜) ≤ 2κ M. For any j = 2, . . . , T , applying Lemma 6 with B = σ = 2κ M and H = HK shows the following inequality with probability 1 − Tδ

log(eT ),

  t ! " 1  2B log 2δ +σ ξi − E [ ξ i ]  ≤ 

j=2

from (2.3) that

log(eT )

= max(2M, 1 )Cθ max (25M + 15κ )κμ−1 , 1

5

K



32Mκμ−1  2T log(et ) log . 3 δ

Therefore, with probability 1 − δ, the following inequality holds for all t = 2, . . . , T

  t 32Mκμ−1  2T  j log(et ) log .  γ j ωtj+1 (LK )Bˆ2  ≤ 3 δ K j=2

(4.10)

Please cite this article as: X. Chen, Y. Lei, Refined bounds for online pairwise learning algorithms, Neurocomputing (2017), https://doi.org/10.1016/j.neucom.2017.11.049

ARTICLE IN PRESS

JID: NEUCOM 6

[m5G;December 5, 2017;16:34]

X. Chen, Y. Lei / Neurocomputing 000 (2017) 1–10

  t

Step 3: Estimate 

j=2

  γ j ωtj+1 (LK )Aˆ1j,C,α  . For any 2 ≤ t ≤ T, K

according to ωtj+1 (LK ) L(HK ) ≤ 1 and (4.5), we derive

  t  j,C,α   γ j ωtj+1 (LK )Bˆ1  ≤

  t t     j,C,α  γ j ωtj+1 (LK )Aˆ1j,C,α K  γ j ωtj+1 (LK )Aˆ1  ≤ K

j=2

t 

⎧ 1 ⎪ ⎪ + α −θ 2T ⎪ 2 −1 2 2 ⎪ √ 2 2 C μ κ + t log ⎪ 1+2α −2θ ⎪ 3 δ ⎪ ⎪ ⎪ ⎪ 1 ⎪ −1 2 + α −θ ⎨ 16C μ κ 2T

γ j ( L j − LK ) f j 1{ f j K ≤C jα } K

j=2



t 

  γ j L j − LK L(HK )  f j 1{ f j K ≤C jα }  K



t2 log , √ δ 3 1 + 2 α − 2 θ ⎪ ⎪  2 2T ⎪ ⎪ 2C κ 2 μ−1 + 2 log(et ) log ⎪ ⎪ 3 δ ⎪ ⎪ ⎪ ⎪ −1 2  ⎪ 16 C μ κ 2 T ⎩≤ log(et ) log , 3 δ

j=2

≤C

t 

γ j L j − LK HS jα .

j=2

vector-valued random variable ξ (x ) = Since ·, K(x˜,x ) K K(x˜,x ) HS ≤ κ 2 , we have ξ HS ≤ X ·, K(x˜,x ) K K(x˜,x ) HS dρX (x˜) ≤ κ 2 . For any 2 ≤ j ≤ T, applying Lemma 6 with B = σ = κ 2 and H = HS(HK ) gives the following inequality with probability 1 − Tδ Define



X ·, K(x˜,x ) K K(x˜,x ) d ρX (x˜).

  j−1  1  

L j − LK HS =  ξ ( x ) − E[ξ ] j−1 HS =1 # √ 2κ 2 log 2δT 3 2κ 2 log 2δT log 2δT ≤ + κ2 ≤ .  j−1

j−1

j

(4.11)

Therefore, with probability 1 − δ, the following inequality holds for all t = 2, . . . , T t 

t √ 2 T  α −θ − 1 2 γ j L j − LK HS jα ≤ 3 2κ 2 μ−1 log j

j=2

 ≤

δ

√ 3 2κ 2 μ−1 log 2δT 1 √ 2 + α −θ 2 −1

3 2κ

1

μ log 2δT

α > θ − 12 , log(et ), if α ≤ θ − 12 . if

⎧ √ 2 −1 1 3 2C κ μ log 2δT t 2 +α −θ , ⎪ 1 ⎪ + α −θ ⎪   ⎨ 2 t  if√α > θ − 12 , j,C,α   γ j ωtj+1 (LK )Aˆ1  ≤ 2T 2 −1 ⎪ K ⎪3 2C κ μ log δ log(et ), j=2 ⎪ ⎩ if α ≤ θ − 12 .

(4.12)

   j,C,α  Step 4: Estimate  tj=2 γ j ωtj+1 (LK )Bˆ1  . Since f j 1{ f j K ≤C jα } K

depends only on z1 , . . . , z j−1 , we know

$ ! " ! " E Bˆ j,C,α |z1 , . . . , z j−1 = E  Lj − L j $z1 , . . . , z j−1 f j 1{ f K ≤C jα } = 0. 

j,C,α Therefore, ξ j = γ j ωtj+1 (LK )Bˆ1 : j,C,α ˆ ence sequence. The term B1 can

j∈N



is a martingale differ-

be controlled as follows

≤ C  Lj − L j HS jα ≤ 2κ 2C jα , from which and α ≤ θ we derive t $  ! " γ j2 E ωtj+1 (LK )Bˆ1j,C,α 2K $z1 , . . . , z j−1 ≤ (2μ−1C κ 2 )2 j 2 ( α −θ )

j=2

≤ and

& (2Cμ−1 κ 2 )2

j=2

t

1+2α −2θ

, if ) log(et ), if

1+2α −2θ −1 2 2

(2C μ κ



α>θ− α≤θ−

1 , 2 1 2



max γ j ωtj+1 (LK )Bˆ1j,C,α  ≤ 2κ 2C max γ j jα ≤ 2C κ 2 μ−1 .

2≤ j≤t

K

if

α≤θ− .

1 2

(4.13)

Step 5: Combination of obtained bounds. Combining (4.7), (4.9), (4.10), (4.12) and (4.13) together, for any 0 < δ < 1/4 with probability 1 − 4δ, the following inequality holds for all 2 ≤ t ≤ T

    t t      γ j ωtj+1 (LK )A¯ j,C,α  +  γ j ωtj+1 (LK )Bˆ j,C,α  K

≤ 20μ−1 Mκ log(et ) log

& 15Cκ 2 μ−1

j=2

2T

δ

2T 2 +α −θ log if θt δ , + 1+2α −2 2T 2 −1 11C κ μ log δ log(et ), if 1





K

1 +α −θ 2

α > θ − 12 , α ≤ θ − 12 2T

log δ −1 t if α > θ − 12 , (25M + 15κ )κμ−1C 1+22αT−2θ , 20M + 11κ κμ C log δ log(et ), if α ≤ θ − 12 ,

where we have used the following elementary inequality for the case α > θ − 12 .

2 2t 2 +α −θ eα −θ − 2 = . e ( 1 + 2α − 2θ ) 1 + 2α − 2θ 1

log(et ) ≤ (et ) 2 +α −θ 1

1

The above inequality can be rewritten as (4.4). The proof is complete. 

j

 

Bˆ1j,C,α K ≤  Lj − L j L(HK )  f j 1{ f j K ≤C jα }  K

t 

α>θ− ,

⎧ 6Cμ−1 κ 2 1 +α−θ √ t2 log 2δT , ⎪ 1+2α −2θ ⎪ ⎪ ⎪   ⎨if α > θ − 1 , t  j,C,α  t ˆ 2  γ j ω j+1 (LK )B1  ≤ ⎪ K 6C μ−1 κ 2 log(et ) log 2δT , ⎪ j=2 ⎪ ⎪ ⎩ if α ≤ θ − 12 .

j=2

t 2 + α −θ ,

1 2

if

Therefore, with probability at least 1 − δ, the following inequality holds for all 2 ≤ t ≤ T

j=2

Then, with probability 1 − δ, the following inequality holds for all t = 2, . . . , T

1

K

j=2

j=2



Now, for any 2 ≤ t ≤ T, we can apply Lemma 7 to derive the following inequality with probability 1 − Tδ

2≤ j≤t

4.3. Proof of main theorem This section presents the proof of Theorem 1. To this aim, we first give a proposition (Proposition 8) as an analogy of Proposition 3 focusing instead on the L2ρ (X )-norm of a varied sample error. The proof of Proposition 8 is postponed to the end of this subsection. Proposition 8. Let { ft : t ∈ N} be defined as (2.1). For any 2 ≤ t ≤ T and 0 < δ < 1, the following inequality holds with probability at least 1−δ

    t t      γ j ωtj+1 (LK )Aˆ j,C,0  +  γ j ωtj+1 (LK )Bˆ j,C,0  j=2

ρ

≤ 8κ (1 + κ )(2M + κ C ) log

j=2

'

4T

t 

δ

j=2



ρ

γj

j (1 +

t

= j+1

γ ) 2 1

Please cite this article as: X. Chen, Y. Lei, Refined bounds for online pairwise learning algorithms, Neurocomputing (2017), https://doi.org/10.1016/j.neucom.2017.11.049

ARTICLE IN PRESS

JID: NEUCOM

[m5G;December 5, 2017;16:34]

X. Chen, Y. Lei / Neurocomputing 000 (2017) 1–10

+

 T

where

T

= j+1 γ

1+

j=2

12

γ j2



( ,





To apply Proposition 8, we need to estimate the last two terms in (4.14), which is addressed by the following two lemmas to be proved in Appendix A. Lemma 9. If θ ∈ ( 12 , 1 ) and t ≥ 2, then

j=2

γj



j 1+

where

t

)

Cθ ,μ = μ

− 12

= j+1

γ



2 1 −θ 1 μ2θ + 2 + 2θ +1 + 1 − 1−θ 3

j=2

1+

t

= j+1 γ

= C θ ,μ

1

μ



1 −θ

max t θ −1 , t −θ log 3t ≤C θ ,μ 2





2 1 −θ 2θ 1 − 1 −θ + 2θ − 1 3

2

(4.16)



. Then, C θ ,μ

(4.17)

−β ≤ Cκ ,β LK f˜ρ ρ T −β (1−θ ) .

Plugging this estimate into (4.17) implies the following inequality ˜ for any (z1 , . . . , zT ) ∈  ∩  −β ≤ Cκ ,β LK f˜ρ ρ T −β (1−θ )

*

+ 8κ (1 + κ )(2M + κ C ) log





T ≤C T,δ

2 2θ + 22θ μ−2 . μ min(μ, 1 )

− min β ,



1 ( 1 −θ ) 2 ,

4T

δ

C¯θ ,μ



θ −1 log(eT 1−θ )T

2

where

For any linear operator A : L2ρ (X 2 ) → L2ρ (X 2 ), we denote by

A L2 the operator norm of A, i.e., A L2 = sup f ρ ≤1 A f ρ . ρ

θ −1

log(eT 1−θ )T

fT +1 − f˜ρ ρ ≤ ω2t (LK ) f˜ρ ρ + 8κ (1 + κ )(2M + κ C ) θ −1  4T log C¯θ ,μ log(eT 1−θ )T 2 . δ

t θ −1 log et 1−θ , ≤C θ ,μ where



fT +1 − f˜ρ ρ

)

γ j2

C¯θ ,μ

ω2t (LK ) f˜ρ ρ ≤ ω2T (LK )LβK L(L2ρ ) L−K β f˜ρ ρ

− 12 (2θ + 1 )(1 − θ ) 21 * . 2θ − 1

Lemma 10. If θ ∈ ( 12 , 1 ) and t ≥ 2, then t 

δ

According to Lemma 11, the term ω2t (LK ) f˜ρ ρ can be bounded by

 − θ θ −1  θ −1  12 ≤ Cθ ,μ max t 2 , t 2 = Cθ ,μt 2 ,



4T

˜ , where C¯θ ,μ = Cθ ,μ + holds for any (z1 , . . . , zT ) ∈  ˜ for any (z1 , . . . , zT ) ∈  ∩ , we derive

Aˆ j,C,0 = L j − LK f j 1{ f j K ≤C} − S j − LK f˜ρ .

t 

8κ (1 + κ )(2M + κ C ) log

(4.14)



7

: = C L−β f˜ρ ρ C T,δ κ ,β K



β Lemma 11 [34]. Let β > 0. If f˜ρ ∈ LK (L2ρX ), then

log

ω2T (LK )LβK (L2ρ ) ≤ Cκ ,β T −β (1−θ ) ,

(4.15)

where Cκ ,β := (( βe )β + κ 2β )κ 2β (μ(1 − θ ))β (1 − ( 12 )1−θ )−β . We are now in a position to present the proof of Theorem 1. Proof of Theorem 1. Introduce



log +8κ (1+κ ) 2M+κ C

ρ

4T

δ

C¯θ ,μ



θ  8T  21θ−−1

δ



+1

   21θ−−1θ

log(eT )



log(e T ).

is the constant defined in Proposition 2. The stated result Here C ˜ ≥ 1 − 2δ .  follows from (2.2) and Pr( ∩ ) We now turn to prove Proposition 8 based on the following lemma which can be found in the proof of Lemma 3 in [31].

   = (z1 , . . . , zT ) : f j K ≤ C, j = 1, . . . , T ,

Lemma 12. If β > 0, then there holds

where C denotes the right hand side of (2.3). Proposition 2 shows Pr() ≥ 1 − δ . For any (z1 , . . . , zT ) ∈ , we get Aˆ j,C,0 = Aˆ j and Bˆ j,C,0 = Bˆ j , ∀ j = 1, . . . , T , and (4.1) then implies that

ωtj (LK )LβK L(L2ρ ) ≤

  T  

fT +1 − f˜ρ ρ ≤ ω (LK ) f˜ρ ρ +  γ j ωtj+1 (LK )Aˆ j 

For any f ∈ HK , there exists g ∈ L2ρ (X 2 ) such that L1K/2 g = f and

f K = g ρ . Then, it is clear

T 2

j=2

  T   + γ j ωTj+1 (LK )Bˆ j  j=2

ρ

 T

  γ j ωTj+1 (LK )Aˆ j,C,0 

ρ

  T   + γ j ωTj+1 (LK )Bˆ j,C,0  . ρ

    T T      γ j ωTj+1 (LK )Aˆ j,C,0  +  γ j ωTj+1 (LK )Bˆ j,C,0  ≤ ρ

= j

j=2

(4.18)

  Proof of Proposition 8. We first estimate  tj=2 γ j ωtj+1 (LK )   Aˆ j,C,0  . According to the triangle inequality and (4.18), we derive

Plugging the estimates given in Lemmas 9, 10 into (4.14), there ex˜ ⊂ Z T with probability measure at least 1 − δ such that ists a set  the following inequality

j=2

 − β t γ .

= ωtj+1 (LK )L1K/2 L(L2ρ ) f K .

j=2

j=2

e



+ κ 2β min 1,

ωtj+1 (LK ) f ρ = ωtj+1 (LK )L1K/2 g ρ ≤ ωtj+1 (LK )L1K/2 L(L2ρ ) g ρ

ρ

 = ω2T (LK ) f˜ρ ρ + 

 β β



t 

ρ

γ j ωtj+1 (LK )Aˆ j,C,0 ρ ≤

j=2

t 

γ j ωtj+1 (LK )Aˆ j,C,0 ρ

j=2



t 

γ j ωtj+1 (LK )L1K/2 L(L2ρ ) Aˆ j,C,0 K .

j=2

(4.19)

ρ

Please cite this article as: X. Chen, Y. Lei, Refined bounds for online pairwise learning algorithms, Neurocomputing (2017), https://doi.org/10.1016/j.neucom.2017.11.049

ARTICLE IN PRESS

JID: NEUCOM 8

[m5G;December 5, 2017;16:34]

X. Chen, Y. Lei / Neurocomputing 000 (2017) 1–10 1

By Lemma 12 and the elementary inequality min{1, a− 2 } ≤ for any a ∈ R, it holds that

ωtj+1 (LK )L1K/2 L(L2ρ ) ≤



t

=t+1 γ

 12



t 

γ jω

t j+1

j=2

√ 2 (1 + κ ) ,  (1 + t= j+1 γ )1/2

where we use the notation from (4.5) that

2 a+1

(4.20)

= 0. Furthermore, it follows

log

√ 16 2κ (1 + κ )(2M + κ C ) j,C,0  ( LK )B ρ ≤ 3

T 4  δ

j=2

12

γ j2

T

1+

= j+1

.

γ

(4.22)

The stated result then follows from (4.21) and (4.22).



Aˆ j,C,0 K ≤ ( L j − LK ) f j 1{ f j K ≤C} K + S j − LK f˜ρ K ≤ C L j − LK L(HK ) + S j − LK f˜ρ K

Acknowledgments

≤ C L j − LK HS + S j − LK f˜ρ K . Plugging this estimate, (4.20), (4.8) and (4.11) into (4.19), we get the following inequality with probability 1 − 2δ

  t    γ j ωtj+1 (LK )Aˆ j,C,0  j=2



t  j=2



ρ

2 ( 1 + κ )γ j  1  1 + t= j+1 γ 2

= (1 + κ ) log

4T !

δ

We are grateful to the reviewers and Dr. Cheng Wang for the comments on the unboundedness of the regression function in Lemma 4 of the previous version, as well the constructive comments for us to improve the presentation of the paper. We are also grateful to Prof. Ding-Xuan Zhou for constructive comments.

 3√2κ 2C log 4T 6√2κ M log 4T  δ + δ  

6κ 2C + 12κ M

j

Appendix A. Proof of Lemmas 9 and 10

j

t "



j=2

γj

j (1 +

t

γ ) 2

1

= j+1

.

Proof of Lemma 9. We write

(4.21)

    We now turn to estimate  tj=2 γ j ωtj+1 (LK )Bˆ j,C,0  . Analogous

t  j=2

ρ

to the proof of Proposition 3, we know that {ξ j = γ j ωtj+1 (LK )Bˆ j,C,0 } is a martingale difference sequence. The term Bˆ j,C,0 K can be bounded as follows

 

Bˆ j,C,0 K ≤ ( Lj − L j ) f j 1{ f j K ≤C}  +  Sj − S j K K      ≤ L j − L j L(HK ) f j 1{ f K ≤C} +  Sj − S j K

j 1+

t  2 

t

=



j 1+

j=2

= j+1

γ

 12

γj

t

= j+1

γ

t 

+

 12

t j= +1 2

γj



j 1+

t

= j+1

γ

 12 :

= J1 + J2 .

K

j

γj



For any j ≤ 2t , by get

≤ C Lj − L j HS + 2κ M ≤ 2κ 2C + 4κ M.

t

= j+1 

−θ

1 1−θ



!

" (t + 1 )1−θ − ( j + 1 )1−θ , we

Using this estimate, (4.18) and (4.20), we derive

 t    ω j+1 (LK )Bˆ j,C,0  ≤ ωtj+1 (LK )L1/2  2 Bˆ j,C,0 K K ρ L (L )

t 

ρ

√ 2 2 ( 1 + κ )κ ( 2M + κ C ) ≤  1 / 2 .  1 + T= j+1 γ

γ ≥

= j+1



j=2

≤ 8 ( 1 + κ )2 κ 2 ( 2M + κ C )2

γ j2

T  j=2

1+

T

= j+1

γ

,

and

Applying (4.6) with θ >

J1 ≤ μ 2≤ j≤T

− 12

γ j ωtj+1 (LK )Bˆ j,C,0 ρ

√ ≤ 2 2(1 + κ )κ (2M + κ C ) sup

2≤ j≤T

√ ≤ 2 2 ( 1 + κ )κ ( 2M + κ C )





1+

j=2

T

= j+1

γ

1+

2 j T = j+1

(1 − θ )

 1

γj

T 



=

  2 $ γ j2 E ωtj+1 (LK )Bˆ j,C,0 ρ $z1 , . . . , z j−1

B : = sup



1 (t + 1 )1−θ − 2θ −1 (t + 2 )1−θ μ (1 − θ )

t + 2 1 − θ  (t + 1 )1−θ  1− μ (1 − θ ) 2t + 2 1 −θ  1 −θ  2 (t + 1 ) ≥ 1 − 1−θ , ∀t ≥ 2. μ (1 − θ ) 3

It then follows that T 

! " 1 (t + 1 )1−θ − ( j + 1 )1−θ μ (1 − θ )

≤ μ− 2

1 γ 2  12

γ

1 2

1 2,



(A.1)

the term J1 can be bounded by

2 1 −θ 1 − 1 −θ 3

− 12

2  t

(t + 1 )

θ −1 2

−1 j−(θ +2 )

j=2

− 12 (2θ + 1 )(1 − θ ) 21 2 θ −1 1 − 1 −θ (t + 1 ) 2 . 2θ − 1 3 1 −θ

For any j ≥ 2t  + 1, since j ≤ t, we know

.

1+

t  = j+1

Applying Lemma 7 then implies the following inequality with probability at least 1 − 2δ

t  t −θ

γ ≥ 1 +

= j+1

μ

=

 μ + (t − j )t −θ ,

1

μ

(A.2)

from which we get

Please cite this article as: X. Chen, Y. Lei, Refined bounds for online pairwise learning algorithms, Neurocomputing (2017), https://doi.org/10.1016/j.neucom.2017.11.049

ARTICLE IN PRESS

JID: NEUCOM

[m5G;December 5, 2017;16:34]

X. Chen, Y. Lei / Neurocomputing 000 (2017) 1–10

J2 ≤

μ



1 2

t 

t j= +1 2

!

−1 j−(θ +2 )

μ + (t − j )t −θ

≤ (2−1 t )−(θ +2 ) μ −1



1 2

" 12

t 

!

t +1 2

j=

J2 ≤

1

μ + (t − j )t −θ

" 12

j= 2t +1

!

 

" 12 ≤

μ + (t − j )t −θ

j=0

!

.

1

j= 2t +1

" 12

μ + jt −θ

 1 = √ +

μ

 2t 

2 

j=1

!

μ+ jt −θ

t

j=1

2  

!

μ + jt −θ

" 12 ≤ =

j j−1

j=1

 t 2 0

= tθ

!

!

j=1

1

μ + xt −θ 1

μ + xt −θ

 t 2

!

1

μ + jt −θ

" 12 .

μ

!

√ " 1 1+θ μ + 2−1t 1−θ − μ ≤ 2 2 t 2 .

⎤ ⎦

The stated bound follows by putting the bounds on J1 and J2 together. The proof is complete.  Proof of Lemma 10. We write

j=2

1+

2  t

t

= j+1

γ

=

j=2

γ j2

1+

t

= j+1

+

j= 2t +1

γ γ j2

t 

1+

t

= j+1

According to (A.1) and (4.6) with θ > bounded by

J1 ≤

γ

1 2,

:= J1 + J2 .

the term J1 can be

2t     21−θ −1 μ−1 (1 − θ ) 1 − 1−θ (t + 1 )θ −1 j−2θ

3

≤ μ−1



 21−θ −1 2θ 1 − 1 −θ t θ −1 . 2θ − 1 3

≤ =



1

μ + (t − j )t −θ

.

j=2

By (A.2), we derive

t

2 1 1  = + μ + jt −θ μ

j=1

1

μ + jt −θ

.

can be further controlled by

1 min (μ, 1 ) 1

t

 2

j=1



j j−1

1 dx 1 + xt −θ

2t 

1 dx min (μ, 1 ) 0 1 + xt −θ  t θ   2 t 1 d 1 + xt −θ − θ min (μ, 1 ) 0 1 + xt  1+2−1 t 1−θ tθ x˜−1 dx˜ min (μ, 1 ) 1   tθ log 1 + 2−1 t 1−θ min (μ, 1 ) tθ 3t 1−θ log . min (μ, 1 ) 2

Combining the above three inequalities together, we get

J2 ≤

1 1 1 θ − (θ + ) − − 2 μ−1 t 2 + 2θ +1 μ 2 t 2 .

γ j2

1 μ + jt −θ



  −θ " 12 d μ + xt

1 1+θ − 1 1 −1 J2 ≤ (2−1 t )−(θ +2 ) μ− 2 ⎣μ 2 + 2 2 t 2

t 

1 j=1 μ+ jt −θ

=



θ+

 2t 

=

Combining the above three inequalities together, we get

=2

t

 2

j=0

" 12 dx

0 μ + xt −θ  μ+2−1 t 1−θ 1 = tθ x˜− 2 dx˜

= 2t θ

t j= +1 2

2  1 ≤ μ + (t − j )t −θ

" 12 dx

1

!

t 

(2−1 t )−2θ

" 12 can be further controlled by

t

1

μ

The term

j=1

1

1

t

t 

2t 

The term



t +1 2

j−2θ μ + (t − j )t −θ

With a change of variables, we derive

t 2

1

μ j=

With a change of variables, we derive t 

t 

1

9

2 2 θ t −θ 3t 1−θ log + 22θ μ−2 t −2θ . μ min(μ, 1 ) 2

The stated bound follows by putting the bounds on J1 and J2 together. The proof is complete.  References [1] S. Agarwal, P. Niyogi, Generalization bounds for ranking algorithms via algorithmic stability, J. Mach. Learn. Res. 10 (2009) 441–474. [2] A. Bellet, A. Habrard, Robustness and generalization for metric learning, Neurocomputing 151 (2015) 259–267. [3] M. Boissier, S. Lyu, Y. Ying, D.X. Zhou, Fast convergence of online pairwise learning algorithms, in: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016, pp. 204–212. [4] Q. Cao, Z.C. Guo, Y. Ying, Generalization bounds for metric and similarity learning, Mach. Learn. 102 (1) (2016) 115–132. [5] A. Caponnetto, E. De Vito, Optimal rates for the regularized least-squares algorithm, Found. Comput. Math. 7 (3) (2007) 331–368. [6] S. Clémençon, G. Lugosi, N. Vayatis, Ranking and empirical minimization of U-statistics, Ann. Stat. (2008) 844–874. [7] F. Cucker, D.X. Zhou, Learning Theory: an Approximation Theory Viewpoint, Cambridge University Press, 2007. [8] J. Fan, T. Hu, Q. Wu, D.X. Zhou, Consistency analysis of an empirical minimum error entropy algorithm, Appl. Comput. Harmonic Anal. 41 (1) (2016) 164–189. [9] Y. Feng, X. Huang, L. Shi, Y. Yang, J.A. Suykens, Learning with the maximum correntropy criterion induced losses for regression, J. Mach. Learn. Res. 16 (2015) 993–1034. [10] Z.C. Guo, D.H. Xiang, X. Guo, D.X. Zhou, Thresholded spectral algorithms for sparse approximations, Anal. Appl. 15 (03) (2017) 433–455. [11] Z.C. Guo, Y. Ying, Guaranteed classification via regularized similarity learning, Neural Comput. 26 (3) (2014) 497–522. [12] Z.C. Guo, Y. Ying, D.X. Zhou, Online regularized learning with pairwise loss functions, Adv. Comput. Math. 43 (1) (2016) 127–150. [13] T. Hu, J. Fan, Q. Wu, D.X. Zhou, Learning theory approach to minimum error entropy criterion, J. Mach. Learn. Res. 14 (2013) 377–397. [14] T. Hu, J. Fan, Q. Wu, D.X. Zhou, Regularization schemes for minimum error entropy principle, Anal. Appl. 13 (04) (2015) 437–455. [15] P. Kar, B. Sriperumbudur, P. Jain, H. Karnick, On the generalization ability of online learning algorithms for pairwise loss functions, in: Proceedings of The 30th International Conference on Machine Learning, 2013, pp. 441–449. [16] Y. Lei, L. Shi, Z.C. Guo, Convergence of unregularized online learning algorithms, 2017, ArXiv preprint arXiv: 1708.02939.

Please cite this article as: X. Chen, Y. Lei, Refined bounds for online pairwise learning algorithms, Neurocomputing (2017), https://doi.org/10.1016/j.neucom.2017.11.049

JID: NEUCOM 10

ARTICLE IN PRESS

[m5G;December 5, 2017;16:34]

X. Chen, Y. Lei / Neurocomputing 000 (2017) 1–10

[17] Y. Lei, D.X. Zhou, Analysis of online composite mirror descent algorithm, Neural Comput. 29 (3) (2017) 825–860. [18] J. Lin, Y. Lei, B. Zhang, D.X. Zhou, Online pairwise learning algorithms with convex loss functions, Inf. Sci. 406–407 (2017) 57–70. [19] J. Lin, L. Rosasco, Generalization properties of doubly online learning algorithms, 2017. ArXiv preprint arXiv: 1707.00577. [20] J. Lin, D.X. Zhou, Online learning algorithms can converge comparably fast as batch learning, IEEE Trans. Neural Netw. Learn. Syst. (2017), doi:10.1109/TNNLS. 2017.2677970. [21] S.B. Lin, D.X. Zhou, Distributed kernel-based gradient descent algorithms, Constr. Approx. (2017) 1–28, doi:10.10 07/s0 0365- 017- 9379- 1. [22] S. Mukherjee, D.X. Zhou, Learning coordinate covariances via gradients, J. Mach. Learn. Res. 7 (2006) 519–549. [23] W. Rejchel, Fast rates for ranking with large families, Neurocomputing 168 (2015) 1104–1110. [24] L. Shi, Learning theory estimates for coefficient-based regularized regression, Appl. Comput. Harmonic Anal. 34 (2) (2013) 252–265. [25] L. Shi, Y.L. Feng, D.X. Zhou, Concentration estimates for learning with 1 -regularizer and data dependent hypothesis spaces, Appl. Comput. Harmonic Anal. 31 (2) (2011) 286–302. [26] S. Smale, Y. Yao, Online learning algorithms, Found. Comput. Math. 6 (2) (2006) 145–170. [27] S. Smale, D.X. Zhou, Learning theory estimates via integral operators and their approximations, Constr. Approx. 26 (2) (2007) 153–172. [28] P. Tarres, Y. Yao, Online learning as stochastic approximation of regularization paths: optimality and almost-sure convergence, IEEE Trans. Inf. Theory 60 (9) (2014) 5716–5735. [29] Y. Wang, R. Khardon, D. Pechyony, R. Jones, Generalization bounds for online learning algorithms with pairwise loss functions, in: Proceedings of the COLT, 23, 2012. 13–1 [30] Y. Yao, On complexity issues of online learning algorithms, IEEE Trans. Inf. Theory 56 (12) (2010) 6470–6481. [31] Y. Ying, M. Pontil, Online gradient descent learning algorithms, Found. Comput. Math. 8 (5) (2008) 561–596. [32] Y. Ying, L. Wen, S. Lyu, Stochastic online AUC maximization, in: Proceedings of the Advances in Neural Information Processing Systems, 2016, pp. 451–459.

[33] Y. Ying, D.X. Zhou, Online regularized classification algorithms, IEEE Trans. Inf. Theory 52 (11) (2006) 4775–4788. [34] Y. Ying, D.X. Zhou, Online pairwise learning algorithms, Neural Comput. 28 (4) (2016) 743–777. [35] Y. Ying, D.X. Zhou, Unregularized online learning algorithms with general loss functions, Appl. Comput. Harmonic Anal. 42 (2) (2017) 224–244. [36] Y. Zhao, J. Fan, L. Shi, Learning rates for regularized least squares ranking algorithm, Anal. Appl. 15 (6) (2017) 815–836. Xiaming Chen received his M.S. degree from the School of Mathematical Sciences, Beijing Normal University, in 2015, and his B.S. degree from the School of Mathematical Sciences, South China Normal University, in 2012. He is currently a Ph.D. candidate at Department of Mathematics, City University of Hong Kong. His research interests include pairwise learning, deep learning and related topics.

Yunwen Lei received the Ph.D. degree from the School of Computer Science, Wuhan University, in 2014, and the B.S. degree from the College of Mathematics and Econometrics, Hunan university, in 2008. From 2015 to 2017, he was a postdoctoral researcher fellow at Department of Mathematics, City University of Hong Kong. He is currently a research assistant professor at Department of Computer Science and Engineering, Southern University of Science and Technology. His main research interests include machine learning, statistical learning theory and convex optimization.

Please cite this article as: X. Chen, Y. Lei, Refined bounds for online pairwise learning algorithms, Neurocomputing (2017), https://doi.org/10.1016/j.neucom.2017.11.049