J. Math. Anal. Appl. 486 (2020) 123881
Contents lists available at ScienceDirect
Journal of Mathematical Analysis and Applications www.elsevier.com/locate/jmaa
Debiased magnitude-preserving ranking: Learning rate and bias characterization Hong Chen a,∗ , Yingjie Wang b , Biqin Song a , Han Li b,∗∗ a b
College of Science, Huazhong Agricultural University, Wuhan 430070, China College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
a r t i c l e
i n f o
Article history: Received 11 September 2018 Available online 20 January 2020 Submitted by U. Stadtmueller Keywords: Ranking Bias correction Integral operator Sampling operator Convergence rate Reproducing kernel Hilbert space
a b s t r a c t Magnitude-preserving ranking (MPRank) under Tikhonov regularization framework has shown competitive performance on information retrial besides theoretical advantages for computation feasibility and statistical guarantees. In this paper, we further characterize the learning rate and asymptotic bias of MPRank, and then propose a new debiased ranking algorithm. In terms of the operator representation and approximation techniques, we establish their convergence rates and bias characterizations. These theoretical results demonstrate that the new model has smaller asymptotic bias than MPRank, and can achieve the satisfactory convergence rate under appropriate conditions. In addition, some empirical examples are provided to verify the effectiveness of our debiased strategy. © 2020 Elsevier Inc. All rights reserved.
1. Introduction Magnitude-preserving ranking (MPRank) has been used successfully for recommendation tasks, document retrieval and drug discovery, see e.g., [10,7,18,36,4]. The original motivation of MPRank [10,1] aims to construct the estimator for preserving the magnitude y − y , where y, y is the output scores for the inputs x and x respectively. The MPRank is related closely with the minimum error entropy criterion [15], and can be formulated by a Tikhonov regularization scheme, which is a sum of the least-squares ranking loss [10,1] plus a kernel-norm regularization in reproducing kernel Hilbert spaces (RKHS). In particular, the predictor of MPRank can be obtained by a linear system [5,36,35] in terms of the operator representation. Theoretical properties of the MPRank, especially its generalization bounds and learning rates, have been investigated in terms of the algorithmic stability in [10,1], the operator approximation in [5,18], and the concentration estimate associated with U-statistics [9,26,20,35]. Recently, the stochastic gradient algorithm is proposed
* Corresponding author. ** Principal corresponding author. E-mail addresses:
[email protected] (H. Chen),
[email protected] (B. Song),
[email protected] (H. Li). https://doi.org/10.1016/j.jmaa.2020.123881 0022-247X/© 2020 Elsevier Inc. All rights reserved.
2
H. Chen et al. / J. Math. Anal. Appl. 486 (2020) 123881
in [7] to reduce the computation complexity of MPRank with guarantees on generalization performance. Moreover, a multi-scale MPRank is formulated in [36] to learn the non-flat ranking function, where the generalization bound is given based on the concentration estimate with covering numbers. Recently, in [4], the MPRank has been used successfully for the structured prediction problems by combining it with the input output kernel regression. Along the previous works, this paper revisits the theoretical properties of MPRank, with particular emphasis on the bias characterization and convergence analysis. It is well known that bias correction is an important strategy to improve learning performance of regularized algorithms for classification [21] and regression [19,32,12,23]. For kernel regularized networks (RKN), its debiased version is proposed in [32] with the asymptotic analysis on the bias and variance. Moreover, Guo et al. in [12] establish the error bounds and learning rates of debiased RKN for the distributed setting under the divide-and-conquer strategy. In [19], a communication-efficient approach is devised to sparse regression in the high-dimensional setting, and its theoretical properties are estimated including the convergence rate analysis and model selection consistency. For 1-norm linear support vector machine (SVM), Lian and Fan in [21] formulate the corresponding debiased model and show it can achieve satisfactory approximation on the model coefficients. Recently, a debiased estimator for partially linear models is proposed in [23], where its effectiveness is verified by the non-asymptotic oracle analysis and the experimental evaluation on the simulated data. Despite these works made rapid progress to understand the debiased strategy, all of them are limited to the regularization scheme with the point-wise losses, e.g., the least-squares loss in [32,12,19] and the hinge loss [21]. It is natural and important to further explore theoretical properties of debiased regularized algorithms with the pairwise losses, e.g., the ranking losses used in MPRank [10,1,25] and RankSVM [14,17,6]. In this article, we propose a debiased magnitude-preserving ranking (DMPRank) algorithm and establish its analysis on asymptotic bias, computation, and learning rate. The estimator of our DMPRank can be represented explicitly by the ranking integral operator and the sampling operator constructed in [5]. By developing the operator approximation techniques in [5,18], we derive the bias characterizations of MPRank and DMPRank respectively. Meanwhile, we establish the learning rates of DMPRank under certain appropriate conditions, as well as a refined error estimate for MPRank. It should be noticed that the bias characterization of MPRank has been discussed recently in [13] based on operator representation [6] and concentration inequality of U -statistics [16]. Although our paper is closely related with [13], there are two features for our analysis and applications. In theory, we established the learning rate, coefficient equation and bias characterization simultaneously for MPRank (when the target function belongs to RKHS), while [13] just considered the bias characterization. In addition, our analysis is based on operator approximation techniques in [6,18] and the second order operator decomposition [22], which is different from the error analysis in [13] associated with U -statistics. In applications, we evaluated the DMPRank on simulated and real data, which fills the gap on empirical validation for the debiased strategy. The rest of this paper is arranged as follows. In Section 2, we recall the MPRank and present its properties. In Section 3, we formulate the MPRank and state our theoretical results. After providing the technique proofs in Section 4, we evaluate the proposed approach empirically in Section 5.1. Section 6 concludes this paper. 2. Magnitude-preserving ranking Regularized magnitude-preserving ranking (MPRank) is a popular kernel method for pairwise ranking due to its generalization guarantees [10,1,5,35] and empirical effectiveness [18,7,4]. Suppose a set of n input-output pairs z = {zi }ni=1 = {(xi , yi )}ni=1 sampled randomly and independently according to an unknown probability distribution ρ(x, y) on Z = X × Y, where X be a compact metric space and Y ⊂ R.
H. Chen et al. / J. Math. Anal. Appl. 486 (2020) 123881
3
For given observations z = (x, y), z = (x , y ) ∈ Z and f : X → R, the least-squares ranking loss 2 2 (f, z, z ) = |y − y | − sgn(y − y )(f (x) − f (x )) = y − y − (f (x) − f (x )) , is used in [10,1] to preserve the magnitude y − y . The main goal of magnitude-preserving ranking is to learn a ranking function f : X → R with minimal expected ranking risk E(f ) =
2 y − y − (f (x) − f (x )) dρ(x, y)dρ(x , y ).
Z Z
It has been proved in [5,7,15] that fρ is the minimizer of E(f ) over all measurable functions, where ydρ(y|x), x ∈ X .
fρ (x) = Y
The MPRank algorithm is formulated under a Tikhonov regularization scheme with a reproducing kernel Hilbert space (RKHS). The RKHS HK associated with a Mecer kernel K is defined to be the function class spanned by {K(x, ·), x ∈ X }. For HK , its inner product is given by K(x , ·), K(x, ·) = K(x, x ) and its reproducing property shows f (x) = f, K(x, ·). In particular, for any f ∈ HK , there holds f ∞ ≤ κf HK , where κ := sup K(x, x). Detailed properties of RKHS and Mercer kernel can be found in [3,11]. x∈X
For a ranking function f : X → R, the discrete version of E(f ) associated with empirical observations z is called the empirical ranking risk, which is defined by Ez (f ) =
n 2 1 yi − yj − (f (xi ) − f (xj )) . 2 n i,j=1
It has been shown in [5,7] that EEz (f ) = n−1 n E(f ). Given training samples z, the MPRank (formulated in [10,1,5]) estimates the score function fρ by the following regularization scheme fz = arg min
f ∈HK
Ez (f ) + λf K ,
(1)
where λ > 0 is the regularization parameter. Recall that the difference between MPRank and the regularized least squares regression in [29,11] has been well discussed in [7,15,36]. By introducing the ranking integral operator and sampling operator, [5,7] establish the representer theorem for the minimizer fz and show the solution of (1) can be obtained easily via a linear system. Given a discrete subset x = {xi }ni=1 belonging to X , define the sampling operator Sx : HK → Rn (see e.g., [28,29,27,30]) as Sx (f ) = (f (x1 ), ..., f (xn ))T , ∀f ∈ HK .
(2)
The dual operator of Sx is denoted by Sx∗ , where Sx∗ c =
n i=1
ci K(xi , ·), c = (c1 , ..., cn )T ∈ Rn .
(3)
H. Chen et al. / J. Math. Anal. Appl. 486 (2020) 123881
4
Denote D = nI −11T , where I is an n-order unit matrix and 1 = (1, ..., 1)T ∈ Rn . Let Y = (y1 , ..., yn )T ∈ Rn and let K = (K(xi , xj ))ni,j=1 ∈ Rn×n . For completeness, we recall the representation theorem (see [5,36]) of fz in (1). Theorem 1. [5,36] For any given z ∈ Z n , the minimizer fz in (1) can be expressed as fz = Moreover, we have fz = linear system
n i=1
1 λ −1 1 ∗ ∗ S DS + S DY. I x n2 x 2 n2 x
(4)
αz,i K(xi , ·), where αz = (αz,1 , ..., αz,n )T ∈ Rn is the unique solution of the DK +
λn2 I α = DY. 2
(5)
As shown in [8,36,35], the above theorem demonstrates the relationship between regularized least-squares ranking and regression. In particular, the operator representation is useful for the recent ranking theory analysis in [18,7]. To establish the approximation analysis of fz to fρ , the ranking integral operator LK : HK → HK is introduced in [5], which is defined as LK f =
f (x)(K(x, ·) − K(x , ·))dρX (x)dρX (x ).
(6)
X X
Recall that LK is a self-adjoint positive operator on HK with LK ≤ 2κ2 [5,18], and E
1 ∗ n−1 S DSx = LK . n2 x n
In the sequel, for simplicity, we denote LK,x =
1 ∗ S DSx . n2 x
(7)
Now, we introduce the basic assumption used in this paper. Assumption 1. Assume that fρ ∈ HK , κ := sup x∈X
(x, y) ∈ Z.
K(x, x) < ∞, and |y| ≤ M almost surely for any
By introducing the data-independent middle function λ −1 fλ = LK + I LK fρ , 2
(8)
we get a refined error estimates of MPRank following the operator approximation techniques in [5,18,31]. For completeness, we present the proof in Section 4. Theorem 2. Let Assumption 1 be true. For any 0 < δ < 1, with confidence at least 1 − δ, it holds that fz − fλ K ≤
48(κM + κ2 fρ K ) log(4/δ) 8κ2 fρ K √ + . λn λ n
− 2r+2 Moreover, if L−r , we have K fρ ∈ HK , and setting λ = n 1
H. Chen et al. / J. Math. Anal. Appl. 486 (2020) 123881
5
fz − fρ K ≤ C log(4/δ)n− 2r+2 r
with confidence at least 1 − δ, where C is a positive constant independent of n, δ. Theorem 2 is a refined result to the previous analysis in [5] on the approximation fz to fρ , which improve r r the estimate of learning rate from O(n− 2r+3 ) in [5] to O(n− 2r+2 ). In essential, the current result is consistent with the convergence analysis in [18] for regularized ranking algorithms. The above approximation analysis also implies the bound for the excess risk E(fz ) − E(fρ ) in terms of the property of E(f ) in [15,35]. As illustrated in [32,12,21], the bias characteristic is important to the regularization methods, which can provide new insights for constructing bias correction strategy. As demonstrated in [10,5,18], the learning performance of MPRank depends heavily on the difference between fz and fρ in the L2ρX sense. Here, we introduce b(n, λ) = Ez fz − fρ L2
ρX
to measure the bias of MPRank. Now we present the asymptotic result for b(n, λ). Detail proof will be provided in the next section. Theorem 3. Let Assumption 1 be true. Then, (i) Ez fz converges to (LK + λ2 I)−1 LK fρ in HK , and the asymptotic bias of fz is − λ2 (LK + λ2 I)−1 fρ .
(ii) Let τj and φj be the eigenvalue and eigenfunctions of LK . If fρ = αj φj ∈ HK , then j:τj =0
lim b(n, λ) =
n→∞
j:τj =0
λ2 αi2 . (λ + 2τi )2
Theorem 3 demonstrates the bias characterization of MPRank based on the regularization parameter λ, the ranking integral operator LK , and the property of target function fρ . This result extends the previous bias characterization for least-squares regression in [32,12] to the ranking setting. 3. Debiased magnitude-preserving ranking Theorem 3 demonstrates that the asymptotic bias of MPRank estimator is − λ2 (LK + λ2 I)−1 LK fρ , which motives us to get an asymptotical unbiased estimator by subtracting this bias characterization from fz. However, we can not get the unbiased estimator directly as both LK and fρ are unknown. It is a natural strategy to resolve this problem by replacing LK with its empirical version LK,x (also n12 Sx∗ DSx ) and replacing fρ with the derived estimator fz . Thus, the new debiased magnitude-preserving ranking (DMPRank) can be defined as a plug-in estimator fz = fz +
λ λ (LK,x + I)−1 fz . 2 2
(9)
Then, we can deduce that 1 λ −2 fz = LK,x + I LK,x + λI 2 Sx∗ DY. 2 n Following the idea of solution characteristics to MPRank in [5,36,35], we can derive the following linear system to obtain fz .
H. Chen et al. / J. Math. Anal. Appl. 486 (2020) 123881
6
Theorem 4. Denote fz =
n i=1
αz,i K(xi , ·), where αz = (αz,1 , ..., αz,n )T ∈ Rn . Then,
λn2 DK + I αz = DK + λn2 I αz , 2
(10)
where αz = (αz,1 , ..., αz,n )T ∈ Rn is the coefficient of MPRank in Theorem 1. Theorem 4 provides a quantitative relationship between αz associated with DMPRank and αz relied on MPRank. The current analysis illustrates that the proposed DMPRank can be implemented by integrating the linear systems (5) and (10). Define the bias for the approximation of fz to fρ as b (n, λ) = Ez fz − fρ L2 . ρX
To characterize the above bias, we introduce a plug-in estimator fλ = fλ +
λ λ −1 LK + I fλ . 2 2
(11)
Moreover, combining (8) and (11), we deduce that λ −2 fλ = LK,x + I LK + λI LK fρ . 2
(12)
It is a position to present the error bounds and learning rates for DMPRank. Theorem 5. Let Assumption 1 be true. Then, for any 0 < δ < 1, with confidence at least 1 − δ, 2 fρ K 4κ2 fρ K fz − f ≤ 2fz − fλ + 48κ √ . log(2/δ) + λ K K λn λ n − 4r+2 Moreover, if L−2r , we have K fρ ∈ HK for r ∈ (0, 1], and setting λ = n 1
fz − fρ
≤ C log(2/δ)n− 2r+1 r
K
with confidence at least 1 − δ, where C is a positive constant independent of n, δ. Theorem 5 establishes the quantitative relationship between fz − fλ K and fz − fλ K , which implies that the approximation ability of DMPRank relies on the MPRank heavily. Meanwhile, our analysis also r shows the DMPRank also can achieve the approximation rate with the polynomial decay O(n− 2r+1 ) under mild conditions. In particular, the derived approximation rate improves the previous convergence results, r r e.g., O(n− 2r+3 ) for MPRank in [5] and O(n− 2r+2 ) for the generalized regularized ranking algorithms [18]. Recently, a fast learning rate closing to O(n−1 ) is obtained for MPRank in [35] under proper conditions on the hypothesis space capacity and the approximation error, where the analysis techniques involve the Hoeffdings decomposition for U -statistics [26,9,20] and the concentration estimate with empirical covering numbers [33]. Differences from [35], we apply the constructed operator approximation to obtain the capacity independent learning rate. It should be noticed that our result extends the bias characterization and convergence analysis from the pointwise regularization kernel networks in [32,12] to the pairwise MPRank. Moreover, the approximation result with the expectation version can be obtained by combining Theorem 5 and Lemma 11 in [12].
H. Chen et al. / J. Math. Anal. Appl. 486 (2020) 123881
7
Corollary 1. Under the conditions in Theorem 5, we have 2 2r Ez fz − fρ K = O(n− 2r+1 ). Notice that the current learning rate O(n− 2r+1 ) in expectation is comparative with the error bound 1−2r O( lnnn ) for the ELMRank [8]. When n 2r+1 ≤ ln n, the approximation rate derived here is faster than the related result in [8]. Now we state the bias characterization for the DMPRank and provide its proof in the next section. 2r
Theorem 6. Let Assumption 1 be true. Then, 2 (i) Ez fz converges to (LK + λ2 I)−2 (LK +λI)LK fρ in HK , and the asymptotic bias of fz is − λ4 (LK + λ2 I)−2 fρ .
αj φj ∈ HK , then (ii) Let τj and φj be the eigenvalue and eigenfunctions of LK . If fρ = j:τj =0
lim b (n, λ) =
n→∞
j:τj =0
λ4 αi2 . (λ + 2τi )4
Compared with the bias characterization of MPRank in Theorem 3, we can find that the DMPRank is λ effectiveness to reduce the bias. As λ+2τ ≤ 1, the asymptotic bias of fz is much smaller than MPRank i when fρ depends only on the first several principal components. This observation is consistent with the bias analysis in [32,12,13] for the regularized kernel networks. 4. Proofs In this section, we present the proofs of Theorems 2-6 following the operator approximation techniques in [5,18]. For compactness of the expressions, we will use the notation · for the operator norm from HK to HK . 4.1. Preliminary lemmas We first recall an inequality stated in Lemma 11 [12], which transforms the bound with probability to the estimation with expectation version. Lemma 1. Let Q be a positive random variable. For any 0 < δ < 1, assume that Q ≤ a(log δb )τ for some positive constants a, b, τ with confidence at least 1 − δ. Then, for any s > 0, there holds EQs ≤ as bΓ(τ s + 1). Now we introduce some theory results on the properties of LK and LK,x . The following preliminary inequalities are used in [5] according to the property of positive operator inspired from the proof of Proposition 4.1 in [31]. Indeed, the self-adjoint and nonnegative properties of LK assure that its rth power LrK makes sense [29,12]. Lemma 2. The LK in (6) and LK,x in (7) are self-adjoint positive operators in HK . Moreover, for 0 < r ≤ 1 and γ > 0, there hold (LK + γI)−1 LrK ≤ γ r−1 and (LK,x + γI)−1 (LK,x )r ≤ γ r−1 .
H. Chen et al. / J. Math. Anal. Appl. 486 (2020) 123881
8
The following deviation inequalities have been provided in [5,18] based on the McDiarmid-Bernstein inequality for vector-value random variables in [24]. Lemma 3. Let z be the training set drawn independently according to ρ. For any 0 < δ < 1, with confidence 1 − δ, there holds 24κ2 2κ2 1 . LK − LK,x = LK − 2 Sx∗ DSx ≤ √ log(2/δ) + n n n Lemma 4. Let z be the training set drawn independently according to ρ. For any 0 < δ < 1, with confidence 1 − δ, there holds 1 1 1 1 + √ + LK fρ . 2 Sx∗ DY − LK fρ ≤ 12κM log(2/δ) n n n n 4.2. Theory characterization for MPRank In this subsection, we derive the error bound for fz − fρ K and prove the asymptotic bias of MPRank. Proof of Theorem 2. According to the representations of fz in (4) and fλ in (8), we deduce that λ −1 1 ∗ λ −1 λ −1 fz − fλ = LK,x + I S DY − L f + L + I − L + I LK fρ . K ρ K,x K x 2 n2 2 2
(13)
As shown in Lemma 2 (also see [5,18]), LK + λ2 I and LK,x + λ2 I are positive operators in the reproducing kernel Hilbert space HK . Recall that, for invertible operators A and B on a Banach space, there holds A−1 − B −1 = A−1 (B − A)B −1 , which is called the second order operator decomposition [22,12]. Hence, we have λ −1 λ −1 λ −1 λ −1 LK,x + I − LK + I = LK,x + I LK − LK,x LK + I . 2 2 2 2
(14)
Combining (13) and (14), we have λ −1 1 ∗ λ −1 λ −1 fz − fλ ≤ LK,x + I S DY − L f LK − LK,x LK + I LK fρ . K ρ + LK,x + I x 2 2 n 2 2 Moreover, from Lemma 2 we get 2 1 Sx∗ DY − LK fρ + 2 LK − LK,x · (LK + λ I −1 LK fρ K λ n2 λ 2 2 2 1 ≤ 2 Sx∗ DY − LK fρ K + LK − LK,x · fρ K . λ n λ
fz − fλ K ≤
(15)
Putting the estimates in Lemmas 3 and 4 into (15), for any 0 < δ < 1, we have fz − fλ K ≤
48(κM + κ2 fρ K ) log(2/δ) 8κ2 fρ K √ + λn λ n
with confidence 1 − 2δ. Then, by setting δ = 2δ in (16), we get the first statement in Theorem 2.
(16)
H. Chen et al. / J. Math. Anal. Appl. 486 (2020) 123881
9
Proposition 5 in [5] states that r fλ − fρ K ≤ L−r K fρ K λ
if fρ is in the range of LrK for some 0 < r ≤ 1. Combining the above bound with (16), we obtain the second estimation in Theorem 2. Now we turn to state the bias characterization in Theorem 3. Proof of Theorem 3. Following the same idea as in the proof of Theorem 2, we obtain Ez fz − fλ
K
= Ez (fz − fλ )K ≤ Ez fz − fλ )K 2 1 2 ≤ Ez 2 Sx∗ DY − LK fρ K + Ez LK − LK,x · fρ K . λ n λ
(17)
Applying Lemma 1 to Lemmas 3 and 4, we further get 104κ2 Ez LK − LK,x ≤ √ n
(18)
1 96κM 2κ2 fρ K Ez 2 Sx∗ DY − LK fρ K ≤ √ . + n n n
(19)
and
Putting (18) and (19) into (17), we get Ez fz − fλ
K
208κ2 fρ K 192κM 4κ2 fρ K √ √ ≤ + + . λn λ n λ n
Then, lim Ez fz − fλ K = 0.
n→∞
(20)
This together with the definition of fλ in (8) yields the first conclusion in Theorem 3. It is easy to check that λ −1 λ −1 λ λ λ −1 fλ − fρ = LK + I LK fρ − LK + I LK + I fρ = − LK + I fρ . 2 2 2 2 2 −1 Combining the above equality with (20) we know the asymptotic bias of fz is − λ2 LK + λ2 I fρ . 2 Note that the convergence in HK implies the convergence in LρX [29,32]. Thus, we can verify the desired
result for the setting fρ = αj φj ∈ HK . This completes the proof. j:τj =0
4.3. Theory characterization for DMPRank In this subsection, we present the proofs of Theorems 4-6. Proof of Theorem 4. Recall that fz = fz +
λ λ −1 LK,x + I fz 2 2
H. Chen et al. / J. Math. Anal. Appl. 486 (2020) 123881
10
Then λ LK,x + I fz = (LK,x + λI)fz . 2
(21)
According to the definitions of Sx and S∗x , we know that fz = S∗x αz , fz = S∗x αz , and Sx S∗x α = Kα for any α ∈ Rn . Hence, for the left side of (21), we have
LK,x +
1 1 λ λ λ I fz = S∗x 2 DSx S∗x + I αz = S∗x 2 DK + I αz . 2 n 2 n 2
For the right side of (21), it is easy to check that
1 1 LK,x + λI fz = S∗x 2 DSx S∗x + λI αz = S∗x 2 DK + λI αz . n n
Combining the above equalities together yields S∗x
1 λ 1 I α DK + − DK + λI αz = 0 z n2 2 n2
Hence the coefficient vector of fz satisfies the desired equation in Theorem 4.
Now we present the proof of Theorem 5. Proof of Theorem 5. From the definitions of fz in (9) and fλ in (11), we deduce that fz − fλ λ λ −1 λ λ −1 = fz + fz − fλ + fλ LK,x + I LK + I 2 2 2 2 λ λ λ −1 λ −1 λ −1 fλ (fz − fλ ) + − LK + I = (fz − fλ ) + LK,x + I LK,x + I 2 2 2 2 2 λ λ λ −1 λ −1 λ −1 = (fz − fλ ) + LK,x + I (fz − fλ ) + LK,x + I LK − LK,x LK + I fλ , 2 2 2 2 2 where the last inequality follows from the second order operator decomposition, e.g., A−1 − B −1 = A−1 (B − A)B −1 for positive operators A and B. According to Lemma 2 and fλ = (LK + λ2 I)−1 LK fρ , we further have fz − f ≤ 2fz − fλ + 2 LK − LK,x · (LK + λ I)−1 LK fρ λ K K K λ 2 2 ≤ 2fz − fλ K + LK − LK,x · fρ K . λ
(22)
Putting the estimate of LK − LK,x in Lemma 3 into (22), we obtain fz − f ≤ 2fz − fλ + 2 LK − LK,x · (LK + λ I)−1 LK fρ λ K K K λ 2 4κ2 fρ K 48κ2 fρ K √ log(2/δ) + ≤ 2fz − fλ K + λn λ n with confidence 1 − δ.
(23)
H. Chen et al. / J. Math. Anal. Appl. 486 (2020) 123881
11
Now we turn to bound fλ − fρ K . It is easy to verify that fλ − fρ = −
λ2 λ −2 fρ . LK + I 4 2
(24)
When fρ ∈ L−2r with r ∈ (0, 1], we further obtain K 2 λ2 −2r f − fρ = λ LK + λ I −2 L2r LK + λ I −1 LrK 2 · L−2r fρ K LK fρ K ≤ K λ K K 4 2 4 2 λ ≤ ( )2r L−2r K fρ K , 2
(25)
where the last inequality depends on Lemma 2. Combining the estimation of fz − fλ K in Theorem 2 with (23) and (25), we have 2 ) log(4/δ) 12κ2 fρ K λ fz − fρ ≤ 192(κM + κ f √ρ K + + ( )2r L−2r K fρ K K λn 2 λ n
with confidence 1 − 2δ. 1 The desired error bound follows by setting λ = n− 4r+2 and δ˜ = 2δ in the above estimation.
Proof of Theorem 6. With the same idea as the proof of Theorem 5, we observe that fz − fλ = (fz − fλ ) +
λ λ λ −1 λ −1 λ −1 (fz − fλ ) + fλ LK,x + I LK,x + I LK − LK,x LK + I 2 2 2 2 2
and Ez fz − f = Ez (fz − f ) ≤ Ez fz − f . λ K λ K λ K Then, Ez fz − f
λ K
2 λ ≤ 2Ez fz − fλ K + Ez LK − LK,x · (LK + I)−1 LK fρ K λ 2 2f ρ K ≤ 2Ez fz − fλ K + Ez LK − LK,x . λ
Recall that, in the proof of Theorem 3, Ez LK −LK,x ≤ and
2 104κ √ , n
Ez n12 Sx∗ DY−LK fρ K ≤
96κM √ n
(26) +
2κ2 fρ K , n
2 208κ2 fρ K 4κ 192κM f ρ K √ √ + . Ez fz − fλ K ≤ + λn λ n λ n Combining these equalities with (26), we obtain Ez fz − fλ K → 0 as n → ∞, which means Ez fz converges to fλ . This together with (24) yields the desired bias characterization. Moreover, the second statement in Theorem 6 is obtained by direct computation. 5. Experiments In this section, we evaluate DMPRank and MPRank on simulated examples and several benchmark datasets on the drug discovery.
H. Chen et al. / J. Math. Anal. Appl. 486 (2020) 123881
12
Table 1 Performance of DMPRank and MPRank in simulation. Parameters
DMPRank
MPRank
Dimension p
Sample size n
Mean
Deviation
Mean
Deviation
p=1
n n n n n
= = = = =
25 50 100 200 400
4.89% 4.03% 3.47% 3.31% 3.22%
3.29% 1.47% 0.91% 0.35% 0.27%
4.97% 4.06% 3.50% 3.31% 3.22%
3.37% 1.54% 0.91% 0.36% 0.27%
p=5
n n n n n
= = = = =
25 50 100 200 400
8.27% 5.33% 4.08% 3.83% 3.62%
2.77% 1.12% 0.62% 0.46% 0.46%
8.84% 5.82% 4.51% 4.37% 4.17%
2.81% 1.36% 0.76% 0.54% 0.54%
p = 10
n n n n n
= = = = =
25 50 100 200 400
15.48% 9.21% 6.80% 5.46% 4.99%
5.42% 1.57% 1.16% 0.59% 0.65%
16.31% 9.72% 7.14% 5.88% 5.38%
5.31% 1.84% 1.20% 0.70% 0.72%
5.1. Simulated examples Inspired from [18], we generate the training samples z = {(xi , yi )}ni=1 by randomly choosing the input {xi }ni=1 ∈ Rp for each dimension from {1, 2, ..., 100} and the output yi = xi 2 /2 + i , 1 ≤ i ≤ n, where [·] represents the integer part of inputs and i is the independent noise sampled from the uniform 2 ) . distribution U (−2, 2). The current RKHS is associated with the Gaussian kernel K(x, x ) = exp − (x−x γ2 Each evaluation is repeated for 100 times with sample sizes n = 25, 50, 100, 200, 400 and dimensions p = 1, 5, 10, respectively. For each evaluation we separate the training samples into two subparts of n/2 elements. The function fz in (9) and fz in (1) are constructed via the first subpart for a given regularization parameter λ and γ. The second subpart is used to select the regularization parameter λ and γ through minimizing the empirical misranking risk [9,18] n
Rz (f ) =
i,j=1
I{yi >yj ∧f (xi )≤f (xj )} n
i,j=1
, I{yi >yj }
where I{ϕ} is 1 if ϕ is true and 0 otherwise. Here, λ is searched in the discrete sequence of 100 numbers, i.e., λ = λ0 q j with λ0 = 2, q = 0.95, j = 1, ..., 100, and γ is searched in the grid {1, 5, 10, 50, 100, 500, 1000}. Then the above constructed functions with the selected parameter λ and γ are used to verify the performance of MPRank and DMPRank on the test set of 100 additional random inputs. Experimental results on the mean and standard deviation of the misranking risk are reported in Table 1, which shows the effectiveness of the proposed DMPRank. In addition, we report the average results (with n = 50, 100, 200) under varying p in Fig. 1, which indicates that our DMPRank consistently perform better than MPRank when size n and dimensions p tend to be bigger. 5.2. Performance comparison on QSAR dataset We apply DMPRank and MPRank on two quantitative structure-activity relationship (QSAR) datasets: inhibitors of dihydrofolate reductase (DHFR) and cyclooxygenase-2 (COX2), which correspond to the bi-
H. Chen et al. / J. Math. Anal. Appl. 486 (2020) 123881
13
Fig. 1. Performance of DMPRank and MPRank on varying dimensions p.
Table 2 Comparison of DMPRank and MPRank on QSAR datasets. Dataset DHFR COX2
DMPRank
MPRank
Mean
Deviation
Mean
Deviation
34.56% 37.37%
5.29% 5.94%
36.41% 38.46%
5.33% 6.39%
ological activities represented as pIC50 values [2,8]. The DHFR inhibitor dataset represents pIC50 values belonging to (3.3, 9.8) with 361 compounds, and the COX2 inhibitor data set represents pIC50 values belonging to (4, 9) with 282 compounds. Each compound in two datasets is represented by the 2.5-D chemical descriptors in [2]. The DHFR inhibitor dataset includes 70 descriptors and the COX2 inhibitor dataset includes 74 descriptors. As a preliminary, these descriptors are scaled by z-score normalization. For each evaluation, we take 10 percent compounds as the test set, and the rest as the training set. The selection of hyper-parameters λ and γ follows the same setup in the simulation. We repeat each evaluation for 100 times and report the average results in Table 2. 6. Conclusions In this paper, we have investigated the learning rates and asymptotic biases for MPRank and DMPRank. More specifically, we apply the operator representation to derive the solution equalities for the two methods, and utilize the operator approximation technique to characterize their approximate rates and biases. It should be noticed that the derived learning rates are faster than the previous results in [5,7,18] under the capacity-independent analysis framework. Along the line of this paper, it is important to design a distributed version of MPRank for reducing its computation complexity with large-scale data, which is related closely with the divide-and-conquer strategy in [34,12,22]. Acknowledgments This work is partially supported by the National Natural Science Foundation of China (NSFC) under Grants 11671161 and 11801201, and Fundamental Research Funds for the Central Universities (Project Nos. 2662018QD018, 2662019FW003). References [1] S. Agarwal, P. Niyogi, Generalization bounds for ranking algorithms via algorithmic stability, J. Mach. Learn. Res. 10 (2009) 441–474. [2] S. Agarwal, D. Dugar, S. Sengupta, Ranking chemical structures for drug discovery: a new machine learning approach, J. Chem. Inf. Model. 50 (5) (2010) 716–731.
14
H. Chen et al. / J. Math. Anal. Appl. 486 (2020) 123881
[3] N. Aronszajn, Theory of reproducing kernels, Trans. Am. Math. Soc. 68 (1950) 337–404. [4] C. Brouard, E. Bach, S. Böcker, J. Rousu, Magnitude-preserving ranking for structured outputs, in: Asian Conference on Machine Learning, vol. 77, 2017, pp. 407–422. [5] H. Chen, The convergence rate of a regularized ranking algorithm, J. Approx. Theory 164 (2012) 1513–1519. [6] H. Chen, D.R. Chen, Learning rate of support vector machine for ranking, J. Approx. Theory 188 (2014) 57–68. [7] H. Chen, Y. Tang, L. Li, Y. Yuan, X. Li, Y.Y. Tang, Error analysis of stochastic gradient descent ranking, IEEE Trans. Cybern. 43 (2013) 898–909. [8] H. Chen, J.T. Peng, Y. Zhou, L. Li, Z. Pan, Extreme learning machine for ranking: generalization analysis and applications, Neural Netw. 53 (2014) 119–126. [9] S. Clemencon, G. Luogosi, N. Vayatis, Ranking and empirical minimization of U-statistics, Ann. Stat. 36 (2008) 844–874. [10] C. Cortes, M. Mohri, A. Rastogi, Magnitude-preserving ranking algorithms, in: Proc. 24th International Conference on Machine Learning (ICML), 2007. [11] F. Cucker, D.X. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge Univ. Press, Cambridge, U.K., 2007. [12] Z. Guo, L. Shi, Q. Wu, Learning theory of distributed regression with bias corrected regularization kernel network, J. Mach. Learn. Res. 18 (2017) 1–25. [13] F. He, Q. Wu, Bias corrected regularization kernel method in ranking, Anal. Appl. 17 (1) (2019) 1–17. [14] R. Herbrich, T. Graepel, K. Obermayer, Large margin rank boundaries for ordinal regression, in: Advances in Large Margin Classifiers, 2000. [15] T. Hu, J. Fan, Q. Wu, D.X. Zhou, Learning theory approach to minimum error entropy criterion, J. Mach. Learn. Res. 14 (2013) 377–397. [16] T. Hu, Q. Wu, D.X. Zhou, Distributed kernel gradient descent algorithm for minimum error entropy principle, Appl. Comput. Harmon. Anal. (2019), https://doi.org/10.1016/j.acha.2019.01.002. [17] T. Joachims, Optimizing search engines using clickthrough data, in: Proc. ACM Conf. Knowledge Discovery and Data Mining (KDD), 2002. [18] G. Kriukova, S.V. Pereverzyev, P. Tkachenko, On the convergence rate and some applications of regularized ranking algorithms, J. Complex. 33 (2016) 14–29. [19] J.D. Lee, Q. Liu, Y. Sun, J.E. Taylor, Communication-efficient sparse regression, J. Mach. Learn. Res. 18 (2017) 1–30. [20] H. Li, C.B. Ren, L.Q. Li, U-processes and preference learning, Neural Comput. 26 (2014) 2896–2924. [21] H. Lian, Z. Fan, Divide-and-Conquer for debiased 1 -norm support vector machine in ultra-high dimensions, J. Mach. Learn. Res. 18 (2018) 1–26. [22] S.B. Lin, X. Guo, D.X. Zhou, Distributed learning with regularized least squares, J. Mach. Learn. Res. 18 (2017) 1–13. [23] S. Lv, H. Lian, A debiased distributed estimation for sparse partially linear models in diverging dimensions, arXiv: 1708.05487v1. [24] S. Mukherjee, D.X. Zhou, Learning coordinate covariances via gradients, J. Mach. Learn. Res. 7 (2006) 519–549. [25] T. Pahikkala, E. Tsivtsivadze, A. Airola, J. Jrvinen, J. Boberg, An efficient algorithm for learning to rank from preference graphs, Mach. Learn. 75 (2009) 129–165. [26] W. Rejchel, On ranking and generalization bounds, J. Mach. Learn. Res. 13 (2012) 1373–1392. [27] L. Rosasco, M. Belkin, E.D. Vito, On learning with integral operators, J. Mach. Learn. Res. 11 (2010) 905–934. [28] S. Smale, D.X. Zhou, Shannon sampling: connections to learning theory, Appl. Comput. Harmon. Anal. 19 (2005) 285–302. [29] S. Smale, D.X. Zhou, Learning theory estimates via integral operators and their approximations, Constr. Approx. 26 (2007) 153–172. [30] H. Sun, Q. Wu, A note on applications of integral operator in learning theory, Appl. Comput. Harmon. Anal. 26 (3) (2009) 416–421. [31] H. Sun, Q. Wu, Application of integral operator for regularized least square regression, Math. Comput. Model. 49 (2009) 276–285. [32] Q. Wu, Bias corrected regularization kernel networks and its applications, in: International Joint Conference on Neural Networks (IJCNN), 2017, pp. 1072–1079. [33] Q. Wu, Y. Ying, D.X. Zhou, Multi-kernel regularized classifiers, J. Complex. 23 (2007) 108–134. [34] Y. Zhang, J. Duchi, M.J. Wainwright, Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates, J. Mach. Learn. Res. 15 (2015) 1–40. [35] Y. Zhao, J. Fan, L. Shi, Learning rates for regularized least squares ranking algorithm, Anal. Appl. 15 (6) (2017) 815–836. [36] Y. Zhou, H. Chen, R. Lan, Z. Pan, Generalization performance of regularized ranking with multiscale kernels, IEEE Trans. Neural Netw. Learn. Syst. 27 (5) (2015) 993–1002.