Analysis of convergence performance of neural networks ranking algorithm

Analysis of convergence performance of neural networks ranking algorithm

Neural Networks 34 (2012) 65–71 Contents lists available at SciVerse ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet ...

261KB Sizes 2 Downloads 54 Views

Neural Networks 34 (2012) 65–71

Contents lists available at SciVerse ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

Analysis of convergence performance of neural networks ranking algorithm✩ Yongquan Zhang, Feilong Cao ∗ Department of Information and Mathematics Sciences, China Jiliang University, Hangzhou 310018, Zhejiang Province, PR China

article

info

Article history: Received 15 September 2011 Received in revised form 31 March 2012 Accepted 29 June 2012 Keywords: Ranking algorithm Neural networks Covering number Convergence rate

abstract The ranking problem is to learn a real-valued function which gives rise to a ranking over an instance space, which has gained much attention in machine learning in recent years. This article gives analysis of the convergence performance of neural networks ranking algorithm by means of the given samples and approximation property of neural networks. The upper bounds of convergence rate provided by our results can be considerably tight and independent of the dimension of input space when the target function satisfies some smooth condition. The obtained results imply that neural networks are able to adapt to ranking function in the instance space. Hence the obtained results are able to circumvent the curse of dimensionality on some smooth condition. Crown Copyright © 2012 Published by Elsevier Ltd. All rights reserved.

1. Introduction The analysis of convergence performance of learning algorithm is an important and hot topic in machine learning research. To our knowledge, Vapnik and Chervonenkis (1971) first started to study the learning algorithm and established the analysis of convergence for classification algorithm from the statistical analysis. Since then, more different tools have been used to study the convergence performance of learning algorithms and have been applied to both classification (learning of binary-valued functions) and regression (learning of real-valued functions). In many learning algorithms, the goal is not simply classifying objects into one of a fixed number of classes; instead, a ranking of objects is desired. For example, in information retrieval problems, where one likes to retrieve documents from some databases that are ‘relevant’ to a given query or topic. In such problems, one needs a ranking of the documents so that relevant documents are ranked higher than irrelevant documents. Recently, the ranking problem has gained much attention in machine learning (see Agarwal & Niyogi, 2005, 2009; Clemencon, Lugosi, & Vayatis, 2008; Cohen, Schapire, & Singer, 1999; Cortes, Mohri, & Rastogi, 2007; Cossock & Zhang, 2006; Cucker & Smale, 2001, 2002). For ranking problem, we learn a real-valued function which gives scores to instances; however, these scores themselves do not matter; instead, we are only interested in the relative ranking of instances which are given by these scores.

✩ This research was supported by the National Natural Science Foundation of China (No. 61101240) and the Zhejaing Provincial Nature Science Foundation of China (Nos Y6110117, Q12A010096). ∗ Corresponding author. Fax: +86 571 86835737. E-mail address: [email protected] (F. Cao).

Now, the ranking has been successfully applied to all kinds of fields, such as social choice theory (Kenneth, 1970), statistics (Lehmann, 1975) and mathematical economics (Chiang & Wainwright, 2005). However, in 1999 Cohen et al. (1999) first began to study the ranking in machine learning. From then on, many researchers started to pay attention to it and study the interesting topic from machine learning view, for example, Crammer and Singer (2002) and Herbrich, Graepel, and Obermayer (2000) considered the related ranking but distinct problem of ordinal regression. Radlinski and Joachims (2005) developed an algorithmic framework for ranking in information retrieval applications. Both Agarwal and Niyogi (2005) and Freund, Iyer, Schapire, and Singer (2003) have considered the convergence properties of ranking algorithms for the special setting of bipartite ranking respectively. Clemencon et al. (2008) have given statistical convergence properties of ranking algorithms based on empirical and convex risk minimization by using the theory of U-statistics. Agarwal and Niyogi (2009) studied the convergence properties of ranking algorithms in a more general setting of the ranking problem that arise frequently in applications and convergence error via ranking algorithmic stability. Burges et al. (2005) have developed a neural network based on the algorithm of ranking problem. Although there have been several recent advances in developing algorithms for various settings of the ranking problem, the study of generalization properties of ranking algorithms has been largely limited to the special setting of bipartite ranking (see Agarwal & Niyogi, 2005, Freund et al., 2003). Similar to Agarwal and Niyogi (2009), we study the convergence property of ranking learning algorithms in a more general setting of the ranking problem that arises frequently in applications and practice. Our convergence rates are derived by using the approximation property of neural networks and covering number instead of the notion of algorithmic stability in reproducing kernel Hilbert space in Agarwal and Niyogi (2009).

0893-6080/$ – see front matter Crown Copyright © 2012 Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2012.06.012

66

Y. Zhang, F. Cao / Neural Networks 34 (2012) 65–71

Similar to both classification and regression, the ranking problem takes place in some hypothesis space which has good approximation property for a ranking function. It is well known that feedforward neural networks (FNNs) have universal approximation property for any continuous or integrable functions defined on a compact set; there are some algorithms to carry out the approximation. In 1989, Cybenko (1989) first proved that if the activation function in FNNs is a continuous sigmoidal function, and I = [0, 1]d is a unit cube in Rd , then any continuous function on I = [0, 1]d is approximated by FNNs. Since then, some different methods from Cybenko (1989) have been designed. Meanwhile, a series of investigations into the condition of activation function ensuring the validity of the density theorem can be found in Chen and Chen (1995a, 1995b), Chen, Chen, and Liu (1995), Hornik (1991), and Mhaskar and Micchelli (1992). The complexity of FNN approximation mainly describes the relationship among the topology structure of hidden layer (such as the number of neurons and the value of weights), the approximation ability and the approximation rate. The study of complexity has attracted much attention in recent years (Cao, Zhang, & He, 2009; Cao, Zhang, & Xu, 2009; Chui & Li, 1992; Maiorov & Meir, 1998; Xu & Cao, 2005). In the study of machine learning, FNNs are usually used as hypothesis space to study the convergence performance of learning algorithm. For example, Barron (1993) gave the convergence rate of least square regression learning algorithm by the approximation property of FNNs. In 2006, Hamers and Kohler (2006) obtained nonasymptotic bounds on the least square regression estimates by minimizing the empirical risk over suitable sets of FNNs. Recently, Kohler and Mehnert (2011) gave an analysis of the rate of convergence of least squares learning algorithm in FNNs for smooth regression function. In this article, we study ranking learning algorithm by using neural networks, where the hypothesis space is chosen as a class of FNNs with one hidden layer. The article is organized into six sections. Following the introduction in the present section, we describe general ranking problem in a more general setting and introduce neural networks in Section 2. In Section 3, we give approximation error of ranking algorithm by the approximation property of neural networks. Section 4 estimates the sample error. The obtained upper bound in connection with the approximation error leads to estimating the upper bound of convergence rate of neural networks ranking algorithm. In Section 5, we compare our results with the known related work. Finally, we conclude the article with the obtained results. 2. General ranking problem and neural networks For the ranking problem, one is given some samples of ordering relationships among instances in some instance space X , and the goal is to learn a real-value function from these samples that ranks future instances. The ranking problems arise in all kinds of domains: in user-preference modeling, one wants to order movies or texts according to likes of oneself, and in information retrieval, where one is interested in retrieving documents from some databases that are ‘relevant’ to a given query or topic. In such problems, one wants to return a list of documents that contains relevant documents at the top and irrelevant documents at the bottom to the user; in other words, one wants a ranking of the documents so that relevant documents are ranked higher than irrelevant documents. In this article, we let X ⊂ Br = {x ∈ Rd , ∥x∥2 ≤ r }, Y = [0, M0 ] ⊂ R for two positive constants r , M0 , and let ρ be a probability distribution on Z = X × Y , and ρX , ρ(y|x) are marginal probability and conditional probability on m m Z respectively. Denote z = {zi }m a set of i=1 = {(xi , yi )}i=1 ∈ Z labeled samples according to ρ . The goal of the ranking problem is to learn a real-valued function f (f : X → Y ) which orders

exactly the other instances in X according to random samples. The function f is considered to rank x lower than x′ if f (x) < f (x′ ), and higher than if f (x) > f (x′ ). The penalty for mistakenly ranking a pair of instances can be taken greater for mistakenly ranking a pair of instances with a larger difference between their labels. We introduce the ranking loss function in the following definition. Definition 1 (See Agarwal and Niyogi (2009)). A ranking loss function is a function ℓ : X Y × (X × Y ) × (X × Y ) → R+ ∪ {0} that assigns to each real function f : X → Y , and (x, y), (x′ , y′ ) ∈ (X , Y ) a non-negative real number ℓ(f , (x, y), (x′ , y′ )). The ranking loss function can be interpreted as the penalty of f in its relative ranking of two instances x and x′ given their corresponding labels y and y′ . We shall require that the loss function ℓ be symmetric with respect to (x, y) and (x′ , y′ ), that is ℓ(f , (x, y), (x′ , y′ )) = ℓ(f , (x′ , y′ ), (x, y)) for all f , (x, y) and (x′ , y′ ). Several ranking loss functions are useful in the study of ranking problem as follows: (1) the 0–1 ranking loss function: 1

ℓ0-1 (f , (x, y), (x′ , y′ )) = I(y−y′ )(f (x)−f (x′ ))<0 + If (x)=f (x′ ) ; 2 (2) the least squares ranking loss function:

 2 ℓsq (f , (x, y), (x′ , y′ )) = |y − y′ | − sgn(y − y′ )(f (x) − f (x′ )) ; (3) the discrete ranking loss function:

  1 ℓdisc (f , (x, y), (x′ , y′ )) = |y − y′ | I(y−y′ )(f (x)−f (x′ ))<0 + If (x)=f (x′ ) ; 2

(4) for γ > 0, the γ -ranking loss function is defined as follows:

ℓγ (f , (x, y), (x′ , y′ )) |y − y′ |,   (f (x) − f (x′ )) × sgn(y − y′ )   ≤ 0, if    γ   ′  (f (x) − f (x )) × sgn(y − y′ )  |y − y′ | − , = γ  ′ ′  (f (x) − f (x )) × sgn(y − y )   if 0 < < |y − y′ |,    γ  ′ ′   0, if (f (x) − f (x )) × sgn(y − y ) ≥ |y − y′ |, γ where for u ∈ R,  1, if u > 0, if u = 0, sgn(u) = 0, −1, if u < 0. According to the definitions of both the discrete ranking loss function and the γ -ranking loss function, we know that for all γ > 0, there holds ℓdisc ≤ ℓγ . By using the ranking loss, we define the expected ℓ-error of the function f :

E (f ) = Rℓ (f ) = E(x,y),(x′ ,y′ )∼ρ×ρ ℓ(f , (x, y), (x′ , y′ )). The function fρℓ

fρℓ

(1)

that minimizes the error (1) is given by

= arg min E(x,y),(x′ ,y′ )∼ρ×ρ ℓ(f , (x, y), (x′ , y′ )), f :X →Y

where the minimum is taken over all measurable functions. In the article we always assume that fρℓ exists and satisfies |fρℓ (x)| ≤ log m for all x ∈ X . The corresponding empirical ranking error of expected ranking ℓ-error is defined as follows: △

Ez (f ) = Rℓ (f ) =

2

m−1

m 

m(m − 1) i=1 j=i+1

ℓ(f , (xi , yi ), (xj , yj )).

Y. Zhang, F. Cao / Neural Networks 34 (2012) 65–71

Definition 2. Let ℓ : X Y × (X × Y ) × (X × Y ) → R+ ∪ {0} be a ranking loss function. We say that the ranking loss function ℓ is Lipschitz continuous if

L=

sup

  ℓ(f , (x, y), (x′ , y′ )) − ℓ(f ′ , (x, y), (x′ , y′ ))

sup

|f (x) − f ′ (x)| + |f (x′ ) − f ′ (x′ )|

f ,f ′ ∈H ,f ̸=f ′ x,x′ ∈X ,y,y′ ∈Y

k 

ci σ αiT x + βi + c0





(2)

i=1

where σ : R → R is called an activation function and αi ∈ Rd , βi , c0 , ci ∈ R, i = 1, 2, . . . , k are the parameters that determine the neural networks. For the activation function σ , one often uses the so-called sigmoidal functions, i.e., a function which is nondecreasing and satisfies lim σ (t ) = 0

t →−∞

and

1

as the activation function of neural networks. We define the set Fk,m as the hypothesis space according to (2):

 Fk,m =

k 

k    ci σ αiT x + βi + c0 : αi ∈ Rd , bi , ci ∈ R, |ci |

i=1

= Lm ≤ log m . Since σ is bounded in absolute value by 1, the functions in Fk,m are bounded in absolute value by Lm . In the following, we consider the ranking learning algorithm:

f ∈Fk,m

m−1

Here we used the definition of fz in the last inequality. The first term of (4) is called the sample error, and the second one, which measures the approximation ability of Fk,m for ρ , is called the approximation error. It has been well understood in Cucker and Smale (2001). For neural networks usually such smooth conditions are defined by imposing conditions on the Fourier transform of the minimizer fρℓ . The Fourier transform fˆ of a function f ∈ L1 (Rd ) is defined by fˆ (ω) =

1



m(m − 1) i=1 j=i+1



(2π )d/2

f (x)e−iω x dx (ω ∈ Rd ). T

Rd

f (x) =



1 d/2

(2π )

fˆ (ω)eiω x dω T

Rd

ℓ(f , (xi , yi ), (xj , yj )).

(5)

holds almost everywhere with respect to the Lebesgue measure. For 0 < C < ∞, we denote FC as the class of functions for which (5) holds on Rd and, in addition



fˆ (ω)∥ω∥2 dω ≤ C .

(6)

Lemma 3. For every function f ∈ FC , X ⊆ Br , and every k ≥ 1, there exists a function fk ∈ Fk,m so that



(f (x) − fk (x))2 dρX ≤ X

m

(3)

The efficiency of the algorithm (3) is measured by the expected ℓ-ranking error between fz and the target function fρℓ , i.e., we need to estimate the excess error

E (fz ) −

≤ E (fz ) − Ez (fz ) + Ez (fz ) − Ez (fk,m ) + Ez (fk,m ) − E (fk,m ) + D ≤ {E (fz ) − Ez (fz ) + Ez (fk,m ) − E (fk,m )} + D .

Barron (1993) pointed out that the class of functions are continuously differentiable on Rd . To estimate D , we need the following Lemma (see Theorem 3 of Barron, 1993).

i =0

2

E (fz ) − E (fρℓ )

Rd



fz ∈ arg min

f ∈Fk,m

t →∞

(t ∈ R),

1 + e−t

fk,m = arg min E (f ).

If fˆ ∈ L1 (Rd ), then the inverse formula

lim σ (t ) = 1.

In the article, we will use the so-called logistic squashing function

σ (t ) =

where the function fk,m is defined as

In fact,

< ∞.

And we assume that M1 = supf ∈H ,x,x′ ∈X ,y,y′ ∈Y |ℓ(f , (x, y), (x′ , y′ ))| < ∞ and H is a hypothesis space. In the article, we will consider the suitable FNN set as the hypothesis space. The FNNs with one hidden layer and k hidden neurons are a real-valued function on Rd of the form N (x) =

67

E (fρℓ )

k

,

where C0 = C0 (r ) is a constant which only depends on the constant r. Theorem 4. If the minimizer function fρℓ ∈ FC for C > 0, and ∥fρℓ ∥∞ ≤ log m, then

D = inf

f ∈Fk,m

by using the given samples and the approximation property of neural networks.

C0



E (f ) − E (fρℓ ) ≤ 2L





C0 k

,

where C0 and L are two constants which are independent of fρℓ . Proof. From Definition 2, we know that for any g ∈ Fk,m

3. Estimating approximation error of ranking learning algorithm by neural networks Here we would expect that the minimizer of the empirical error, i.e., fz , is a good approximation of the minimizer fρℓ . This is actually fρℓ

true if can be approximated by functions from Fk,m , measured by the decay of the approximation error defined as

D = inf

f ∈Fk,m



E (f ) − E fρℓ

 

  ≤ E (g ) − E fρℓ

= E(x,y),(x′ ,y′ )∼ρ×ρ ℓ(g , (x, y), (x′ , y′ )) − E(x,y),(x′ ,y′ )∼ρ×ρ ℓ(fρℓ , (x, y), (x′ , y′ ))  ≤ |ℓ(g , (x, y), (x′ , y′ )) − ℓ(fρℓ , (x, y), (x′ , y′ ))|dρ(x, y)dρ(x′ , y′ )  ≤L {|g (x) − fρℓ (x)| + |g (x′ )

Thus, the excess error E (fz ) − E (fρℓ ) may be divided into

E (fz ) − E (fρℓ ) ≤ E (fz ) − Ez (fz ) + Ez (fk,m ) − E (fk,m ) + D ,





Z ×Z

E (f ) − E (fρℓ ) .



inf

f ∈Fk,m



Z ×Z

(4)

− fρℓ (x′ )|}dρX dρX dρ(y|x)dρ(y′ |x′ )

68

Y. Zhang, F. Cao / Neural Networks 34 (2012) 65–71



Proposition 6. For 0 < δ ≤ 1, with confidence at most 2δ , there holds

{|g (x) − fρℓ (x)| + |g (x′ ) − fρℓ (x′ )|}dρX dρX

=L

X ×X  = 2L |g (x) − fρℓ (x)|dρX

 

X



(g (x) − fρℓ (x))2 dρX

≤ 2L X

 = 2L X

(g (x) − fρℓ (x))2 dρX

 ≤ 2L BR

1/2 

12 dρX

1/2

X

1/2

(g (x) − fρℓ (x))2 dρX

1/2

,

where the third inequality is obtained by Minkowshi inequality. Taking g = fk , Lemma 3 tells us



D = inf

f ∈Fk,m

E (f ) − E (fρℓ ) ≤ 2L





C0 k

This finishes the proof of Theorem 4.

. 

4. Estimating the convergence rate of the ranking learning algorithm

Ez (fk,m ) −

E (fz ) − Ez (fz ) + Ez (fk,m ) − E (fk,m )

=

Ez (fk,m ) − Ez (fρℓ ) − E (fk,m ) − E (fρℓ )

 +







E (fz ) − E (fρℓ ) − Ez (fz ) − Ez (fρℓ )









.

(7)

We know that the main difference in the formulation of the ranking problem as compared with the problems of classification and regression is the performance which is measured on pairs of samples, rather than on individual samples in ranking. This means that, unlike the empirical error in classification or regression, the empirical error in ranking problem cannot be expressed as a sum of independent random variables. Indeed, this is the reason why the standard Höeffding inequality used to estimate the convergence bounds in classification and regression can no longer be applied to obtain the convergence bounds for the ranking error (see Agarwal & Niyogi, 2005). Therefore, we only make use of the more general McDiarmid inequality in probability theory to obtain uniform convergence bounds for sample error in this article. Our main tool which is used to estimate the sample error of ranking algorithm will be the following powerful concentration inequality of McDiarmid. Theorem 5 (See McDiarmid, 1989). Let z1 , z2 , . . . , zm be independent random variables, each taking value in a set A. Let φ : Am → R. For each 1 ≤ j ≤ m, there exists cj > 0 such that sup z1 ,...,zm ∈A,zj∗ ∈A



− E (fk,m ) − 

E (fρℓ )





8M12 m

log

2

δ

.

The proof follows the proof of a similar result for regression algorithms in the reference In particular,  Cucker and Smale  (2001).   the random variable Ez (fk,m ) − Ez (fρℓ ) − E (fk,m ) − E (fρℓ ) , representing the difference between the expected and empirical errors of the minimizing function fk,m in the hypothesis space Fk,m and the target function fρℓ , is shown to satisfy the conditions of McDiarmid’s inequality (McDiarmid, 1989). Details of the proof are provided in Appendix A. In the following, we estimate the second part of (7). Because the random variable ξ = ℓ(fz , (x, y), (x′ , y′ ))−ℓ(fρℓ , (x, y), (x′ , y′ )) involves the sample z, the estimation is difficult. We thus solve it by using covering number. Definition 7. For a set F and ϵ > 0, let ρX be margin probability distribution of X and p ∈ N ∪ {∞}. An Lp (ρX )-ϵ -cover of F is a finite set of functions g1 , . . . , gk ∈ F with the property



|f (x) − gj (x)|p dρX

min

In this section, we will estimate the sample error in (4). The obtained error together with the approximation error in Section 3 will lead to estimating the excess error E (fz ) − E (fρℓ ). In fact, according to the first part of (4), we obtain

Ez (fρℓ )

1≤j≤k

 1p

= min ∥f − gj ∥p ≤ ϵ 1≤j≤k

X

for all f ∈ F . Lp (ρX )-ϵ -covering number Np (F , ϵ) of F is the minimal size of the Lp (ρX )-ϵ -cover of F . In case that there is not finite Lp (ρX )ϵ -cover of F the Lp (ρX )-ϵ -covering number of F is defined by Np (F , ϵ) = ∞. Since ∥f ∥p ≤ ∥f ∥∞ for any f ∈ F , we have N∞ (F , ϵ) ≤ Np (F , ϵ) for p ∈ N ∪ {∞}.  Let BR = {f ∈ Fk,m : ∥f ∥1 = X |f (x)|dρX ≤ R}. The

covering number has been extensively studied, see, e.g., Pontil (2003); Williamson, Smola,  and Schˇokopf(2001).   To estimate the term E (fz ) − E (fρℓ ) − Ez (fz ) − Ez (fρℓ ) in (7) concerning the random variable ξ = ℓ(fz , (x, y), (x′ , y′ )) − ℓ(fρℓ , (x, y), (x′ , y′ )), we need the following probability inequality. Theorem 8. For all ε > 0, there holds



Prob sup E (f ) − E (fρℓ ) − (Ez (f ) − Ez (fρℓ )) ≤ ε



f ∈BR

 ≥ 1 − exp log N1

 ε  2L2 R



mε 2



32M12

.

    Since the ball BR ⊂ Fk,m , then N1 BR , 2Lε2 ≤ N1 Fk,m , 2Lε2 .

Therefore, we only need to estimate the cover number N1 Fk,m ,  ε . For the set 2L2



Fk,m =

 k 

ci σ αiT x + βi + c0 : αi ∈ Rd , bi , ci ∈ R,





i =1

|φ(z1 , z2 , . . . , zm )

k 

|ci |

i=0

 = Lm ≤ log m ,

 − φ(z1 , . . . , zj−1 , zj∗ , zj+1 , . . . , zm ) ≤ cj .

the L1 -covering number can be upper-bounded (see the proof of Theorem 16.1 in GyÖrfi, Kohler, Krzy¨zak, & Walk, 2002) by

Then for any ε > 0, 2 c2 j=1 j

2ε − m

Prob {φ(z1 , . . . , zm ) − E [φ(z1 , . . . , zm )] ≥ ε} ≤ e

.

We first estimate the first part of (7) in the following proposition.

N1 Fk,m , ε ≤







6eLm (k + 1)

ε

(2d+5)k+1

,

(8)

where e is a constant. The inequality of (8) is already used in Hamers and Kohler (2006) and Kohler and Mehnert (2011).

Y. Zhang, F. Cao / Neural Networks 34 (2012) 65–71

The proof of Theorem 8 makes use of McDiarmid’s inequality and is similar to that of Proposition 6, with two main differences. First, McDiarmid’s inequality is applied to obtain a bound conditioned on a concrete function fk,m in Proposition 6, and the probability inequality is applied to obtain a bound conditioned on the hypothesis space Fk,m in Theorem 8. Second, the constants ck in the application of McDiarmid’s inequality are different for each k. Details of the proof are provided in Appendix B. Theorem 9. Let ρ be a probability distribution on Z and ℓ : X Y × (X × Y ) × (X × Y ) → R+ ∪ {0} be a ranking loss function defined by Definition 1. If the minimized function fρℓ ∈ FC for C > 0, and

∥fρℓ ∥∞ ≤ log m, then for any δ > 0, with confidence at least 1 − δ , there holds

 E (fz ) − E (fρℓ ) ≤ A

 log m



m

+B

log 2δ , m

√ 2

where A = 16M1

C0 (2d + 6), B = 8M1 .

The results of Theorem 9 show that the generalization error of the ranking algorithm is independent of the dimension of the input space when the target function fρℓ satisfies some smooth condition. We give the details of proof of Theorem 9 combining Proposition 6 with Theorem 8 in Appendix C. 5. Comparisons with related work In Section 4, we have studied the convergence performance of ranking learning algorithms in a setting that is more general than what has been considered previously. We have derived the upper bound of ranking learning algorithms by using the approximation property of neural networks and covering number. In this section we discuss how our results relate to other recent studies. 5.1. Comparison with generalization bounds for regression Our generalization analysis of ranking learning algorithms is based on a similar analysis for regression algorithms by Kohler and Mehnert (2011). There are two differences between our work and that of Kohler and Mehnert. The first difference in our generalization analysis of ranking learning algorithms as compared with the problems of regression in Kohler and Mehnert (2011) is the performance which is measured on pairs of samples in ranking problem, rather than on individual samples for regression estimate. This means that, unlike the empirical error in regression estimate, the empirical error in ranking problem cannot be expressed as a sum of independent random variables. Indeed, this is the reason why the standard Höeffding inequality used in the convergence bounds in regression estimate can no longer be applied to obtain the generalization bounds for the ranking learning algorithms. We obtain the upper bounds of generalization error for ranking algorithms by using McDiarmid’s inequality. The second difference in the two generalization bounds is partly due to the difference in loss functions in ranking and regression, and partly due to a slight difference in decomposition of generalization error. 5.2. Comparison with the work of Agarwal et al. The work that is perhaps closely related to ours is that of Agarwal and Niyogi (2009), who study the generalization properties of ranking learning algorithms using the notion of algorithmic stability in a reproducing kernel Hilbert space if the space has stability properties. The sample setting considered by Agarwal et al. is similar to ours: the learner is given a sample set

69

{(xi , yi )}m i=1 , and the goal of the ranking problem is to learn a realvalued function f (f : X → Y ) which orders exactly the other instances in X according to random samples. The function f is considered to rank x lower than x′ if f (x) < f (x′ ), and higher if f (x) > f (x′ ). Although uniform convergence bounds for ranking learning algorithms have relied on the smoothness of the target function, we have obtained the explicit upper bound of ranking learning algorithms and the order of the upper bound is independent of the dimension of input space. There are two important differences between our work and that of Agarwal and Niyogi (2009). First, Agarwal et al. considered generalization properties of ranking learning algorithms by using the notion of algorithmic stability in a reproducing kernel Hilbert space if the space has stability properties. However, some kernel spaces may not have stability properties, such as the space which is spanned by a polynomial kernel function. Second, although they have studied the generalization properties of ranking learning algorithms, the uniform convergence bounds for the ranking learning algorithm have not been derived explicitly in Agarwal and Niyogi (2009). 6. Conclusions This article has studied the generalization performance of ranking algorithms based on a setting where ranking preferences among instances are indicated by real-valued labels on the instances. This setting of the ranking problem arises frequently in practice and is more general than the setting of bipartite ranking. Comparing with giving the generalization properties of ranking learning algorithms by algorithmic stability in reproducing kernel Hilbert space, we give the upper bound of uniform convergence rate for ranking learning algorithm by using the approximation property of neural networks and covering number. The obtained results imply that neural networks are able to adapt to ordering relationships among instances in some instance space X . The upper bounds of uniform convergence rate of ranking learning algorithm is considerably tight and independent of the dimension of the input space when the target function fρℓ satisfies some smooth condition. Under the smooth condition, we are able to circumvent the curse of dimensionality in ranking learning algorithm by using approximation property of neural networks. It is well known that upper estimation of generalization error of ranking learning algorithms cannot completely characterize the convergence capability of learning algorithms in general, because an established upper estimation might be too loose to reflect their inherent convergence capability. Hence, in order to present the inherent convergence rates of learning algorithm accurately, we need not only estimate the upper bound for convergence rate of learning algorithm but also the lower bound. Naturally, the estimate for lower bound is difficult but significant. So far we are not aware of any lower bound for generalization error. It is an open question to analyze the lower bound of generalized error that is made by ranking learning algorithms. Appendix A. Proof of Proposition 6 Proof. Let φ : (X × Y )m → R be defined as follows:

φ(z) = Ez (fk,m ) − Ez (fρℓ ) =

2

m−1

m  

m(m − 1) i=1 j=i+1

ℓ(fk,m , (xi , yi ), (xj , yj ))

 − ℓ(fρℓ , (xi , yi ), (xj , yj )) .

70

Y. Zhang, F. Cao / Neural Networks 34 (2012) 65–71

Let J = N1 FR , εL and consider h1 , . . . , hJ so that the disks Dj centered at fj and with radius εL cover FR . By Definition 2, for all z ∈ Z m and all h ∈ Dj , we have



Then by linearity of expectation, Ezm ∼ρ m [φ(z)]

=

m −1

m 

2

m(m − 1) i=1 j=i+1

E(xi ,yi ),(xj ,yj )∼ρ×ρ ℓ(fk,m , (xi , yi ), (xj , yj ))





sup E (h) − Ez (h) ≥ 4ε ⇒ E (hj ) − Ez (hj ) ≥ 2ε. f ∈Dj

We shall show that the function φ satisfies the condition of Theorem 5. Let z = ((x1 , y1 ), . . . , (xm , ym )) ∈ (X × Y )m . For each (x∗j , y∗j ) ∈ (X × Y ), we let zj = ((x1 , y1 ), . . . , (xj−1 , yj−1 ), (x∗j , y∗j ), (xj+1 , yj+1 ), . . . , (xm , ym )). Then for any 1 ≤ j ≤ m, we have

|φ(z) − φ(zj )| ≤ |Ez (fk,m ) − Ezj (fk,m )| + |Ez (fρℓ ) − Ezj (fρℓ )|  2 ℓ(fk,m , (xi , yi ), (xj , yj )) ≤ m(m − 1) i̸=j  − ℓ(fk,m , (xi , yi ), (x∗j , y∗j ))  2 ℓ(f ℓ , (xi , yi ), (xj , yj )) + ρ m(m − 1) i̸=j  − ℓ(f ℓ , (xi , yi ), (x∗ , y∗ )) ρ



j

m(m − 1)

4M1 m

We conclude that, for j = 1, 2, . . . , J,





Prob sup E (h) − Ez (h) ≥ 4ε

  mε 2 ≤ exp − 2 . Therefore



Prob sup E (h) − Ez (h) ≥ ε

j =1

δ 2

Therefore,

∥h1 − h2 ∥1 ≤ 2L∥f1 − f2 ∥1 , which implies that



log N∞ FR ,

ε=

8M12 m

log

2

δ



Ez (fk,m ) −



exp log N1 Fk,m ,

ε  2L2

− E (fk,m ) − 

E (fρℓ )



ε 



h(ε) = log N1 Fk,m ,



The proof of Proposition 6 is completed.

8M12 m

log

2

δ

2L2



32M12

=

δ 2

.

.





ε 

Appendix B. Proof of Theorem 8 Proof. Consider the function set FR defined by

So we get

  FR = ℓ(f , (x, y), (x′ , y′ )) − ℓ(fρℓ , (x, y), (x′ , y′ )) : f ∈ BR .

h(ε) ≤ (2d + 6)k log

Each function h ∈ FR has the form h(z ) = ℓ(f , (x, y), (x′ , y′ )) − ℓ(fρℓ , (x, y), (x′ , y′ )) with f ∈ BR , and satisfies E (h) = E (f ) − E (fρℓ ). From Proposition 6, we know that there holds

Thus, if we take ε ∗ ≤ following inequality



Prob {E (h) − Ez (h) ≥ ε} ≤ exp −

mε 2 8M12



mε 2 32M12

δ = log . 2

The function h : R+ → R is strictly decreasing. Hence εR ≤ ε ∗ if h(ε ∗ ) ≤ log 2δ . From (8), we know that log N1 Fk,m ,

for any h ∈ FR .



mε 2

We may choose εR to be the positive solution to the equation

 

2L



Proof. Let

.

Ez (fρℓ )

  ε  ε  ≤ log N∞ BR , 2 ≤ log N1 BR , 2 .

Appendix C. Proof of Theorem 9

,

With confidence at most 2δ , there holds



ε

L 2L The proof of Theorem 8 is finished.

we solve the above equation and obtain its positive solution



.

= |ℓ(f1 , (x, y), (x′ , y′ )) − ℓ(f2 , (x, y), (x′ , y′ ))|   ≤ L |f1 (x) − f2 (x)| + |f1 (x′ ) − f2 (x′ )| .



by applying Theorem 5 to the function φ . Setting

=

32M12

2

− ℓ(f2 , (x, y), (x′ , y′ )) + ℓ(fρℓ , (x, y), (x′ , y′ ))|

8M1

8M12

mε 2



ε

According to the definition of function h, we know that

2ε 2 ≤ exp −  2   m 4M1   m   mε 2 = exp − 2 ,

exp −



Prob E (hj ) − Ez (hj ) ≥

≤ J exp −

  



J 

|h1 (z ) − h2 (z )| = |ℓ(f1 , (x, y), (x′ , y′ )) − ℓ(fρℓ , (x, y), (x′ , y′ ))



mε 2





.

Prob {Ez (fk,m ) − Ez (fρℓ )} − {E (fk,m ) − E (fρℓ )} ≥ ε





f ∈FR

Therefore, for any ε > 0, we get for

  

  ≤ Prob E (hj ) − Ez (hj ) ≥ 2ε

h∈Dj

2M1

j

M1 (m − 1) =

|E (h) − Ez (h) − (E (hj ) − Ez (hj ))| ≤ 2L∥h − hj ∥1 ≤ 2ε. Since this holds for all z ∈ Z m and all f ∈ Dj , we get

− ℓ(fρℓ , (xi , yi ), (xj , yj )) = E (fk,m ) − E (fρℓ ).

4



 (9)

2L2

≤ (2d + 6)k log

ε

12L2 eLm (k + 1)

ε 1 m



mε 2 32M12

.

.

to be a positive number satisfying the

(2d + 6)k log(12L2 eLm (k + 1)m) − then h(ε) ≥ log 2δ .

12L2 eLm (k + 1)

mε 2 32M12

δ ≥ log , 2

Y. Zhang, F. Cao / Neural Networks 34 (2012) 65–71

Since the inequality is satisfied by ε ∗ , which can be written as

 ε∗ ≤

32M12



 2 2 log + (2d + 6)k log(12L eLm (k + 1)m) , δ

m

then

 εR ≤

32M12



2

log

m

δ

 1 + (2d + 6)k log(12L2 eLm (k + 1)m) + , m

we apply Theorem 8 to ε = εR , and obtain that there exists a set VR′ ⊆ Z m whose measure is at least 1 − 2δ so that for all f ∈ BR and z ∈ Z m \ VR′ ,

E (f ) − E (fρℓ ) − (Ez (f ) − Ez (fρℓ ))

 ≤

32M12



2

log

m

δ



+ (2d + 6)k log(

12L2 eL

m

(k + 1)m) +

1 m

.

For any z ∈ Z m \ VR′ , there holds

E (fz ) − E (fρℓ ) − (Ez (fz ) − Ez (fρℓ ))

 ≤

32M12



2

log

m

δ

 1 + (2d + 6)k log(12L2 eLm (k + 1)m) + . m

According to Proposition 6, we know that there exists another set VR′′ ⊆ Z m whose measure is at most 2δ so that

 E (fk,m ) −



E (fρℓ )

Ez (fρℓ )

− Ez (fk,m ) −









8M12 m

log

2

δ

.

Combining the above bound with (4), (7), Definition 2, Theorem 4, we can conclude that for all z ∈ Z m \ VR′ \ VR′′

E (fz ) − E (fρℓ )

 ≤

32M12

 log

m



1

+ + m 

2

δ

8M12 m

 + (2d + 6)k log(12LeLm (k + 1)m) 

2

log

+ 2M1

δ

C0 k

32M12



(2d + 6)k log(12LeLm (k + 1)m)  

m

+

1 m

+

32M12 m

log

2

δ

+ 2M1

C0 k

and this expression is minimized for

 k=C



log m m

, √

here C ∗ = 4M12 C0 (2d + 6). Altogether, we get

 E (fz ) − E (fρℓ ) ≤ A

 log m



m

+B

log 2δ , m

√ 2

where A = 16M1 C0 (2d + 6), B = 8M1 . The proof of Theorem 9 is completed.



References Agarwal, S., & Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms In Proceedings of 18th annual conference on learning theory.

71

Agarwal, S., & Niyogi, P. (2009). Generalization bounds for ranking algorithms via algorithmic stability. Journal of Machine Learning Research, 10, 441–474. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. Transactions on Information Theory, 39, 930–944. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., & Hamilton, N. et al. (2005). Learning to rank using gradient descent. In Proceedings of the 22nd international conference on machine learning. Cao, F. L., Zhang, Y. Q., & He, Z. R. (2009). Interpolation and rate of convergence by a class of neural networks. Applied Mathematical Modelling, 33, 1441–1456. Cao, F. L., Zhang, Y. Q., & Xu, Z. B. (2009). The lower estimation of approximation rate for neural networks. Science in China. Series F. Information Sciences, 52, 1283–1490. Chen, T. P., & Chen, H. (1995a). Approximation capability to functions of several variables, nonlinear functionals, and operators by radial basis function neural networks. IEEE Transactions on Neural Networks, 6, 904–910. Chen, T. P., & Chen, H. (1995b). Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Transactions on Neural Networks, 6, 911–917. Chen, T. P., Chen, H., & Liu, R. (1995). Approximation capability in C (Rn ) by multilayer feedforward networks and related problems. IEEE Transactions on Neural Networks, 6, 25–30. Chiang, A. C., & Wainwright, K. (2005). Fundamental methods of mathematical economics (Fourth ed.). McGraw-Hill Irwin. Chui, C. K., & Li, X. (1992). Approximation by ridge functions and neural networks with one hidden layer. Journal of Approximation Theory, 14, 131–141. Clemencon, S., Lugosi, G., & Vayatis, N. (2008). Ranking and empirical minimization of U-statistics. Annals of Statistics, 36, 844–874. Cohen, W. W., Schapire, R. E., & Singer, Y. (1999). Learning to order things. Journal of Artificial Intelligence Research, 10, 243–270. Cortes, C., Mohri, M., & Rastogi, A. (2007). Magnitude-preserving ranking algorithms. In Proceedings of 24th international conference on machine learning. Cossock, D., & Zhang, T. (2006). Subset ranking using regression. In Proceedings of the 19th annual conference on learning theory. Crammer, K., & Singer, Y. (2002). Pranking with ranking. Advances in Neural Information Processing Systems, 14, 641–647. Cucker, F., & Smale, S. (2001). On the mathematical foundations of learning. Bulletin of American Mathematical Society, 39, 1–49. Cucker, F., & Smale, S. (2002). Best choice for regularization parameters in learning theory: on the bias-variance problem. Foundations of Computational Mathematics, 2, 413–428. Cybenko, G. (1989). Approximations by superpositions of sigmoidal functions. Mathematics of Control, Signals, and Systems, 2, 303–314. Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969. GyÖrfi, L., Kohler, M., Krzy¨zak, A., & Walk, H. (2002). A distribution-free theory of nonparametric regression. Springer. Hamers, M., & Kohler, M. (2006). Nonasymptotic bounds on the L2 error of neural network regression estimates. Annals of the Institute of Statistical Mathematics, 58, 131–151. Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, 115–132. Hornik, H. (1991). Approximation capabilities of multilayer feedfor- ward networks. Neural Networks, 4, 251–257. Kenneth, J. (1970). Arrow, social choice and individual values (second ed.). Yale University Press. Kohler, M., & Mehnert, J. (2011). Analysis of the rate of convergence of least squares neural network regression estimates in case of measurement errors. Neural Networks, 24, 273–279. Lehmann, E. L. (1975). Nonparametrics: statistical methods based on ranks. HoldenDay. Maiorov, V., & Meir, R. S. (1998). Approximation bounds for smooth functions in C (Rd ) by neural and mixture networks. IEEE Transactions on Neural Network, 9, 969–978. McDiarmid, C. (1989). On the method of bounded differences. Cambridge University Press. Mhaskar, H. N., & Micchelli, C. A. (1992). Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied Mathematics, 13, 350–373. Pontil, M. (2003). A note different covering numbers in learning theory. Journal of Complexity, 19, 665–671. Radlinski, F., & Joachims, T. (2005). Query chains: learning to rank from implicit feedback. In Proceedings of the ACM conference on knowledge discovery and data mining. Vapnik, V., & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16, 264–280. Williamson, R. C., Smola, A. J., & Schˇokopf, B. (2001). Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. Transactions on Information Theory, 47, 2516–2532. Xu, Z. B., & Cao, F. L. (2005). Simultaneous Lp approximation order for neural networks. Neural Networks, 18, 914–923.