Journal of Complexity 43 (2017) 51–75
Contents lists available at ScienceDirect
Journal of Complexity journal homepage: www.elsevier.com/locate/jco
Multi-task learning via linear functional strategy✩ Abhishake Rastogi *, Sivananthan Sampath Department of Mathematics, Indian Institute of Technology Delhi, New Delhi 110016, India
article
info
Article history: Received 20 April 2017 Accepted 2 August 2017 Available online 15 August 2017 Keywords: Multi-task learning Manifold learning Multi-penalty regularization Linear functional strategy Vector-valued RKHS Error estimate
a b s t r a c t In machine learning, the multi-task learning is a natural approach that exploits the relations among different tasks to improve the performance. We develop a theoretical analysis of multi-penalty least-square regularization scheme on the reproducing kernel Hilbert space in vector-valued setting. The results hold for general framework of vector-valued functions; therefore they can be applied to multi-task learning problems. We study an approach for multi-penalty regularization scheme which is based on the idea of linear combination of different regularized solutions. We estimate the coefficients of the linear combination by means of the socalled linear functional strategy. We discuss a theoretical justification of the linear functional strategy which particularly provides the optimal convergence rates for multi-penalty regularization scheme. Finally, we provide an empirical analysis of the multiview manifold regularization scheme based on the linear functional strategy for the challenging multi-class image classification and species recognition with attributes. © 2017 Elsevier Inc. All rights reserved.
1. Introduction In computer vision and image processing, we have to solve various problems simultaneously such as object detection, classification, image denoising, inpainting, tracking of multiple agents. Several approaches are proposed in learning to handle this challenges. We need enough resources to solve each of these problems separately. The procedure of obtaining solutions one-by-one is computationally ✩ Communicated by Sergei Pereverzyev. Corresponding author. E-mail addresses:
[email protected] (A. Rastogi),
[email protected] (S. Sampath).
*
http://dx.doi.org/10.1016/j.jco.2017.08.001 0885-064X/© 2017 Elsevier Inc. All rights reserved.
52
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
costly. Even if we tackle the problems separately then also we might ignore the relatedness of various tasks. This motivates us to develop the new field of multi-task learning [3,22,23,26,31,58] which simultaneously solves several distinct problems as well as incorporate the structure of the relations of different tasks. In order to tackle T -tasks simultaneously we consider the vector-valued functions f : X → RT in multi-task learning where the components of function f = (f1 , . . . , fT ) describe the individual tasks. Micchelli and Pontil [48] introduced the vector-valued reproducing kernel Hilbert spaces to facilitate theory of multi-task learning. Every vector-valued RKHS is corresponding to some operator-valued positive definite kernel [48]. Here we consider the general framework of vectorvalued functions f : X → Y which includes the multi-task learning setting as a special case when Y = RT . Multi-task learning is well-studied through the vector-valued reproducing kernel Hilbert spaces [46,48]. Various algorithms are discussed to incorporate the structure of task relations in literature [3,22,23,26]. A natural extension of manifold regularization [11] to the multi-task learning is presented which exploits output-dependencies as well as enforce smoothness with respect to input data geometry [43,50]. Minh et al. [49] generalized the concept of vector-valued manifold regularization to the multi-view manifold regularization which deals with between-view regularization measuring the consistency of the component functions across different views and withinview regularization measuring the smoothness of the component functions in their corresponding views. Here we are proposing more general multi-penalty framework in vector-valued function setting. In order to execute the learning algorithms we face two natural problems: the appropriate hypothesis space to search the estimator and regularization parameters. The fact that every reproducing kernel Hilbert space corresponds to some kernel reduces the problem of choosing appropriate RKHS (hypothesis space) to choosing appropriate kernel. In paper [4,17,47,53], the authors proposed multiple kernel learning from a set of kernels. Multi-view learning approach is also considered to construct the regularized solution based on different views of the input data using different hypothesis spaces [15,42,43,49,57,59,60]. We consider the direct sum of reproducing kernel Hilbert spaces as the hypothesis space. The second crucial problem is the parameter choice strategy. Many regularization schemes depend on tuning parameters. Various parameter choice rules are discussed for single-penalty regularization [6,19,24,51,63] (also see references therein) as well as for multipenalty regularization [1,8,28,32,38–40,52] for ill-posed inverse problems and learning algorithms. We obtain various regularized solutions corresponding to different choices of regularization schemes, hypothesis spaces and parameter choice rules. An idea was explored in the literature [2,7,29,34,44,54] which uses the calculated approximants in the construction of a new estimator. We are inspired from the work of Kriukova et al. [34] in which the authors discussed an approach to aggregate various regularized solutions for ranking algorithm in single-penalty regularization. We adapt this strategy for multi-task learning manifold regularization scheme to improve the accuracy of the results. We consider linear combination of calculated regularized solutions and try to obtain a combination ∑the l i fz = i=1 ci fz which approximates the target function accurately. The coefficients of the linear combination are estimated by means of the linear functional strategy. The estimation of the values of linear bounded functionals (e.g., inner products) at the approximated elements is often called as linear functional strategy. The information hidden inside various estimators can be integrated using this technique. We discuss the convergence issues of the constructed solution fz based on linear functional strategy. As a consequence of the convergence analysis of linear functionals, we obtain the optimal convergence rates for the minimizer of the multi-penalty regularization scheme. The distribution of the paper is as follows. In Section 2, we describe the framework of vectorvalued multi-penalized learning problem with some basic definitions and notations. In Section 3, we analyze the error estimates of linear functionals for the multi-penalty regularization scheme which particularly gives the optimal convergence rates for the multi-penalty regularizer. In Section 4, we propose a parameter choice strategy ‘‘balanced-discrepancy principle’’ which adaptively selects a parameter on the discrepancy surface. In Section 5, we discuss the convergence issues of the constructed solution of multi-penalty regularization based on linear functional strategy. In the last section, we demonstrate the performance of multi-penalty estimators based on balanced-discrepancy principle of parameter choice rule and linear functional strategy.
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
53
2. Multi-task learning via vector-valued RKHS Multi-task learning [3,22,23,26] is the natural approach that exploits the structure of relations among different tasks. Multi-task learning reduces the human efforts and supervision by providing the joint solution of several problems. The joint solution has potential to exploit task relatedness in order to improve learning. We formulate the framework of vector-valued functions f : X → RT , where the components of output variable consist of individual tasks (T -tasks). Here we consider general regularized framework of learning vector-valued functions f : X → Y . In particular for Y = RT , the setting reduces to multi-task learning which includes vector-valued regression, multilabel detection/classification. Let the input space X be a locally compact countable Hausdorff space and the output space m (Y , ⟨·, ·⟩Y ) be a real separable Hilbert space. Suppose z = {(xi , yi )}m be a given set i=1 ∈ Z of observations drawn independently and identically according to the unknown joint probability distribution ρ on the sample space Z = X × Y . A widely used approach for learning a function from empirical data is to minimize a regularization functional consists of an error term measuring the fitness of data and a smoothness term measuring the complexity of the function. In learning theory, this regularization scheme over the hypothesis space H can be described as argmin Ez (f ) + λ0 ∥f ∥2H ,
{
}
(1)
f ∈H
where the empirical error Ez (f ) is given by Ez (f ) =
m 1 ∑
m
∥f (xi ) − yi ∥2Y .
i=1
Moreover, various authors felt necessity of more sophisticated multi-penalty methods to incorporate various features in the solution such as boundedness, sparsity and the geometry of the probability distribution. Multi-penalized regularization methods for real valued functions are well-analyzed for ill-posed inverse problems [8,28,32,38,39,41,52]. In the learning theory framework, the convergence issues of multi-penalty regularization and its various parameter choice rules are well-studied for scalar-valued functions [1,38,40,62]. In the context of vector-valued manifold regularization [42,43,50], a structured multi-output prediction problem is discussed which exploit output interdependencies as well as enforce smoothness with respect to input data geometry:
{ argmin f ∈H
m 1 ∑
m
} ∥f (xi ) − ∥ + λA ∥f ∥ + λI ⟨f, Mf⟩Y n , yi 2Y
2 H
(2)
i=1
where {(xi , yi ) ∈ X × Y : 1 ≤ i ≤ m} {xi ∈ X : m < i ≤ n} is given set of labeled and unlabeled data, M is a symmetric, positive operator, λA , λI ≥ 0 and f = (f (x1 ), . . . , f (xn )) ∈ Y n . In order to optimize the vector-valued regularization functional there are two fundamental problems in learning theory. First problem is to choose the appropriate hypothesis space which is assumed as a source to provide the estimator. Second, most crucial issue is to select regularization parameters to ensure good performance of the solution. Most regularized learning algorithms depend on the tuning regularization parameters.
⋃
2.1. Mathematical preliminaries and notations Definition 2.1 (Vector-valued Reproducing Kernel Hilbert Space). For non-empty set X and real Hilbert space (Y , ⟨·, ·⟩Y ), the Hilbert space (H, ⟨·, ·⟩H ) of functions from X to Y is called reproducing kernel Hilbert space if for any x ∈ X and y ∈ Y , the linear functional which maps f ∈ H to ⟨y, f (x)⟩Y is continuous. Let L(Y ) be the Banach space of bounded linear operators on Y . Suppose K : X × X → L(Y ) such that for (x∑ , z) ∈ X × X , K ∗ (x, z) = K (z , x) and for every m ∈ N, {(xi , yi ) : 1 ≤ i ≤ m} ⊂ Z , we m have that i,j=1 ⟨yi , K (xi , xj )yj ⟩Y ≥ 0. Then the function K is said to be an operator-valued positive
54
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
definite kernel. For every vector-valued RKHS H there exists an operator-valued positive definite kernel K such that for Kx : Y → H, defined as (Kx y)(z) = K (z , x)y for x, z ∈ X , the span of the set {Kx y : x ∈ X , y ∈ Y } is dense in H. The reproducing property is ⟨y, f (x)⟩Y = ⟨Kx y, f ⟩H , ∀f ∈ H. Moreover, there is one to one correspondence between operator-valued positive definite kernels and vector-valued RKHSs [48]. So we denote the vector-valued RKHS H by HK corresponding to an operator-valued positive definite kernel K and its norm by ∥ · ∥HK . Reproducing kernel Hilbert space (RKHS) is widely considered as a hypothesis space in the scalarvalued learning algorithms. Since there is one to one correspondence between the reproducing kernel Hilbert spaces and mercer kernels [48]. Therefore appropriate choice of RKHS (hypothesis space) is equivalent to choose appropriate kernel. The optimal choice of kernel is crucial to achieve the success of the optimization scheme. Micchelli and Pontil [4,17,47] discussed the concept of multiple kernel learning to enhance the richness of hypothesis space. The kernel choice over a finite (or convex) set of kernels is well-studied to fulfill the requirements of the learning algorithms. Many authors [15,42,43,49,57,59,60] considered the multi-view learning approach to construct the estimators based on different views of the input data using different hypothesis spaces. Here we are considering the direct sum of reproducing kernel Hilbert spaces in order to tackle the multiple kernel learning problem. The direct sum of reproducing kernel Hilbert spaces H = HK1 ⊕ . . . ⊕ HKv is also a RKHS. Suppose K is the kernel corresponding to RKHS H. Therefore vector-valued multi-penalized optimization problem becomes,
{ argmin
m 1 ∑
m
f ∈HK ⊕...⊕HKv 1
} ∥f (xi ) − ∥ + λA ∥f ∥ + λI ⟨f, Mf⟩Y n . yi 2Y
2 H
(3)
i=1
In this paper, we assume the reproducing kernel Hilbert space H is separable such that ∗ (i) Kx : Y → H is a Hilbert–Schmidt operator for all x ∈ X and ∑∞κ := supx∈X Tr(Kx Kx ) < ∞, where for Hilbert–Schmidt operator A ∈ L(H), Tr(A) := ⟨ Ae , e ⟩ for an orthonormal k k k=1 basis {ek }∞ k=1 of H. (ii) The real function from X × X to R, defined by (x, t) ↦ → ⟨Kt v, Kx w⟩H , is measurable ∀v, w ∈ Y .
√
Definition 2.2. The sampling operator Sx : H → Y m associated with a discrete subset x = (xi )m i=1 is defined by Sx (f ) = (f (x))x∈x . Then its adjoint is given by Sx∗ c =
m 1 ∑
m
Kxi ci ,
∀c = (c1 , . . . , cm ) ∈ Y m .
i=1
Now, the regularization scheme (3) can be expressed as
{
2
}
argmin ∥Sx f − y∥2m + λA ∥f ∥2H + λI ∥(Sx∗′ MSx′ )1/2 f ∥H ,
(4)
f ∈H
∑m
1 2 2 n m ′ where x = (xi )m i=1 , x = (xi )i=1 , y = (yi )i=1 and ∥y∥m = m i=1 ∥yi ∥Y . Our aim is to analyze the convergence issues and various parameter choice rules for such regularized multi-task learning schemes. Here we consider more general framework of multi-penalized vector-valued regularization scheme:
fz,λ := argmin f ∈H
⎧ ⎨ ⎩
∥Sx f − y∥2m + λ0 ∥f ∥2H +
p ∑ j=1
λj ∥Bj f ∥2H
⎫ ⎬
,
(5)
⎭
where Bj : H → H (1 ≤ j ≤ p) are bounded operators, λ0 > 0, λj (1 ≤ j ≤ p) are non-negative real numbers and λ denotes the ordered set (λ0 , λ1 , . . . , λp ).
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
55
Theorem 2.1. For the positive choice of λ0 , the functional (5) has unique minimizer:
⎛ fz,λ = ⎝Sx∗ Sx + λ0 I +
p ∑
⎞−1 λj B∗j Bj ⎠ Sx∗ y.
(6)
j=1
We obtain the explicit form of fz,λ by taking the functional derivative of the expression:
∥Sx f − y∥2m + λ0 ∥f ∥2H +
p ∑
⎞ ⟩ ⟨⎛ p ∑ ∗ ⎠ 2 ∗ ⎝ λj Bj Bj f , f λj ∥Bj f ∥H = Sx Sx + λ0 I + j=1
j=1
H
− 2⟨Sx∗ y, f ⟩H + ∥y∥2m . To discuss the convergence issues of the multi-penalty regularized solution fz,λ , we consider the data-free version of the regularization scheme (5):
⎧ ⎨∫
fλ := argmin f ∈H
⎩
∥f (x) − y∥2Y dρ (x, y) + λ0 ∥f ∥2H + Z
p ∑
λj ∥Bj f ∥2H
⎫ ⎬
.
2
1/2
∥f (x) − y∥2Y dρ (x, y) = ∥LK (f − fH )∥H + E (fH ), we get, ⎞−1 ⎛ p ∑ fλ = ⎝LK + λ0 I + λj B∗j Bj ⎠ LK fH .
Using the fact E (f ) =
∫
(7)
⎭
j=1
Z
(8)
j=1
Further, we also consider the single-penalty version of the functional (7),
{∫
∥f (x) − y∥2Y dρ (x, y) + λ0 ∥f ∥2H
fλ0 := argmin f ∈H
} (9)
Z
which implies fλ0 = (LK + λ0 I)−1 LK fH , where the integral ( ) operator LK is a self-adjoint, non-negative, compact operator on the Hilbert space Lρ2X , ⟨·, ·⟩L 2
of functions from X to Y square-integrable with respect to ρX , defined as
ρX
∫ LK (f )(x) :=
K (x, t)f (t)dρX (t), x ∈ X .
X
The integral operator LK can also be defined as a self-adjoint operator on H. We use the same notation LK for both the operators defined on different domains. The problem of our interest is to estimate the error fz,λ − fH . We establish the error bounds by estimating the terms fz,λ − fλ , fλ − fλ0 and fλ0 − fH . Generally, we refer the term fz,λ − fλ as sample error and fλ − fH as approximation error. In order to achieve the uniform convergence rates for learning algorithms, we consider some prior assumptions on the probability measure ρ in terms of the complexity of the target function and a theoretical spectral parameter effective dimension. Following the notion of Bauer et al. [9] and Caponnetto et al. [18], we consider the joint probability measure ρ on X × Y satisfies the assumptions: (i)
∫
∥y∥2Y dρ (x, y) < ∞
(10)
Z
(ii) There exists the minimizer of the generalization error over the RKHS H,
{∫
} ∥f (x) − y∥2Y dρ (x, y) .
fH := argmin f ∈H
Z
(11)
56
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
(iii) There exist some constants M , Σ such that
∫ (
e∥y−fH (x)∥Y /M −
∥y − fH (x)∥Y M
Y
) Σ2 − 1 dρ (y|x) ≤ 2 2M
(12)
holds for almost all x ∈ X . Assumption 2.1 (General Source Condition). Suppose
Ωφ,R := {f ∈ H : f = φ (LK )g and ∥g ∥H ≤ R} , where φ is a continuous increasing index function defined on the interval [0, κ 2 ] with the assumption φ (0) = 0. Then the condition fH ∈ Ωφ,R is usually referred to as general source condition [45]. Assumption 2.2 (Polynomial Decay Condition). We assume the eigenvalues tn ’s of the integral operator LK follows the polynomial decay: For fixed positive constants α, β and b > 1,
α n−b ≤ tn ≤ β n−b ∀n ∈ N. We consider the class of the probability measures Pφ which satisfies the conditions (i), (ii), (iii) and the general source condition. We also define the probability measure class Pφ,b satisfying the conditions (i), (ii), (iii), the source condition and polynomial decay condition. The error estimates studied in our analysis are based on the smoothness of the target function and the effective dimension. In the following subsection we shall briefly discuss the properties of effective dimension. 2.2. Spectral theoretical parameter: effective dimension The effective dimension is an important ingredient to achieve the optimal minimax convergence rates in learning theory. The effective dimension N (α ) of H with respect to the marginal probability measure ρX is defined by, N (α ) := Tr (LK + α I)−1 LK , for α > 0.
(
)
Since the integral operator LK is a trace∑ class operator, the effective dimension is finite. Using ∞ the singular value decomposition LK = n=1 tn ⟨·, en ⟩K en for an orthonormal sequence (en )n∈N of eigenvectors of LK with corresponding eigenvalues (tn )n∈N such that t1 ≥ t2 ≥ · · · ≥ 0, we get N (α ) =
∞ ∑
tn
n=1
α + tn
which implies that the function α → N (α ) is continuous and decreasing from ∞ to zero for 0 < α < ∞. More properties can be found in [12,13,36,37,64]. The effective dimension measures the complexity of the hypothesis space H according to the marginal probability measure ρX [18]. The behavior of the effective dimension depends on the decay of the eigenvalues of the integral operator LK . Suppose the eigenvalues of the integral operator satisfies Assumption 2.2, then the effective dimension N (α ) can be estimated from Proposition 3 [18] as follows, N (α ) := Tr (LK + α I)−1 LK ≤
(
)
β b −1/b α , for b > 1 b−1
(13)
and without the polynomial decay condition, we have N (α ) ≤ ∥(LK + α I)−1 ∥L(H) Tr (LK ) ≤
κ2 . α
Hence under the polynomial decay condition, we get that the effective dimension of H behaves like power-type function. However, in general, we may not expect power-type behavior of the effective dimension. Lu et al. [37] have shown that for Gaussian kernel with the uniform sampling on [0, 1], the effective dimension exhibits the log-type behavior (14) rather than the power-type behavior (13). Therefore we also consider the logarithm decay condition on the effective dimension N (α ) in our analysis.
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
57
Assumption 2.3 (Logarithmic Decay Condition). Assume that there exists some positive constant c > 0 such that N (α ) ≤ c log
( ) 1
α
, ∀α > 0.
(14)
In single-parameter regularization, the concept of effective dimension is widely considered to obtain the optimal convergence rates under Hölder’s source condition [14,18,30] and general source condition [37,56]. In order to observe the role of effective dimension, we consider the general regularization scheme [9] for one-parameter family of operator functions {gα } with 0 < α ≤ κ 2 : fz,α = gα (Sx∗ Sx )Sx∗ y. Lu et al. [37] provided the following estimates for general regularization under general source condition fH ∈ Ωφ,R :
{
√
Prob ∥fz,α − fH ∥ρ ≤ C z∈Z m
(
√
α φ (α ) +
N (α )
)(
αm
( ))3 } log
6
η
≥1−η
and
{
(
Prob ∥fz,α − fH ∥H ≤ C z∈Z m
√ φ (α ) +
N (α )
αm
)(
( ))2 } log
6
η
≥ 1 − η.
The error estimates reveal the interesting fact that the error terms consist of increasing and decreasing function of α which led to propose a posteriori choice of regularization parameter α based on balancing principle in [37]. Hence the effective dimension plays the crucial role in the error analysis of regularized learning algorithms. Using the concept of effective dimension, the convergence rates in our analysis are improved from the error bounds provided in [1] under the polynomial decay condition and we also achieve the optimal minimax convergence rates for the multi-penalty regularization scheme (Theorem 3.5). 3. Error analysis of multi-penalty regularization under general source condition In this section, we discuss the convergence issues of multi-penalty regularization using linear functional estimations. Linear functional error estimations are rather general, in the sense that, the error estimates in HK -norm and Lρ2X -norm can be deduced from the error estimates of linear functionals. Moreover, the error analysis of linear functionals will be used to address the convergence issues of the constructed solution based on linear functional strategy in vector-valued setting (Section 5). 3.1. Error estimation of linear functionals In Theorem 3.1 for the class of probability measure Pφ , we present the error bound for the estimation of the values of linear functionals ℓf (fH ) = ⟨f , fH ⟩H by ⟨f , fz,λ ⟩H for the fixed value of regularization parameter λ. The achievable accuracy for estimating ⟨f , fH ⟩H is essentially determined by the smoothness of the target function fH and the smoothness of the functional representer f in terms of general source condition. Here it is important to note that bounded linear functional values ⟨f , fz,λ ⟩H can be estimated more accurately than the regularized solution fz,λ (Proposition 2.15 [7,39]). In Theorem 3.2, we derive the convergence rate of ⟨f , fz,λ ⟩H under the general source condition fH ∈ Ωφ,R for the parameter choice based on index function φ and the sample size m. For the class of probability measure Pφ,b , in Theorem 3.3 we derive the convergence rate in terms of index function φ , the parameter b related to the polynomial decay condition and sample size m. The parameter b enters in the parameter choice by the estimate (13) of effective dimension and provides the optimal convergence rates. In particular, for sufficiently large sample size m, we can assume the following inequality: 8κ 2
√
m
( ) log
4
η
≤ λ0 .
(15)
58
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
Theorem 3.1. Let z be i.i.d. samples drawn according to the probability measure ρ ∈ Pφ and f ∈ Ωψ,R′ with the assumption that t −1/2 ψ (t), t −1 φ (t), t −1 φ (t)ψ (t) are nonincreasing functions. Then for sufficiently large sample size m according to (15) with the confidence 1 − η for all 0 < η < 1, ⟨f , fH ⟩H can be approximated as follows:
⎧ ⎨
3Bλ ∥fH ∥ρ |⟨f , fH ⟩H − ⟨f , fz,λ ⟩H | ≤ R ψ (λ0 ) 3Rφ (λ0 ) + 3/2 ⎩ λ0 ⎫ ⎞ ⎛ √ ( )⎬ Σ 2 N (λ 0 ) 4 κ M ⎠ log + , +4⎝ mλ0 mλ0 η ⎭ ′
∑p
where Bλ = ∥
λ
∗
j=1 j Bj
Bj ∥L(H) and ∥f ∥ρ := ∥f ∥Lρ2 =
(∫
X
X
)1/2 ∥f (x)∥2Y dρX (x) .
Proof. Under the smoothness assumptions of fH ∈ Ωφ,R and f ∈ Ωψ,R′ there exist g , v ∈ H such that fH = φ (LK )g, f = ψ (LK )v and ∥g ∥H ≤ R, ∥v∥H ≤ R′ . Then we have
|⟨f , fz,λ ⟩H − ⟨f , fH ⟩H | ≤ IS + IA ,
(16)
where IS = ⏐ v, ψ (LK )(fz,λ − fλ )
⏐⟨
⟩ ⏐ ⏐ and IA = |⟨v, ψ (LK )(fλ − fH )⟩H |. H
Now fz,λ − fλ can be expressed as
⎛
p ∑
fz,λ − fλ = ⎝Sx∗ Sx + λ0 I +
j=1
Then fλ = (LK + λ0 I +
∑p
λ
∗
j=1 j Bj
LK fH = LK fλ + λ0 fλ +
p ∑
Bj )
⎫ ⎞−1 ⎧ p ⎬ ⎨ ∑ Sx∗ y − Sx∗ Sx fλ − λ0 fλ − λj B∗j Bj fλ . λj B∗j Bj ⎠ ⎭ ⎩ j=1
−1
LK fH implies
λj B∗j Bj fλ .
j=1
Therefore,
⎛ fz,λ − fλ = ⎝Sx∗ Sx + λ0 I +
p ∑
⎞−1 λj B∗j Bj ⎠ {Sx∗ y − Sx∗ Sx fλ − LK (fH − fλ )}.
j=1
Further we can continue as follows, IS ≤ R′ ∥ψ (LK )∆S (LK + λ0 I)1/2 ∥L(H) ∥(LK + λ0 I)−1/2 {Sx∗ y − Sx∗ Sx fH + (Sx∗ Sx − LK )(fH − fλ )}∥H
{ } { } I2 I2 I2 ≤ R′ I3 I4 I1 + √ ∥fλ − fH ∥H ≤ R′ I3 I4 I1 + √ ∥fλ − fλ0 ∥H + √ ∥fλ0 − fH ∥H λ0 λ0 λ0 and IA ≤ R′ ∥ψ (LK )(fλ − fλ0 )∥H + ∥ψ (LK )(fλ0 − fH )∥H ,
{
}
where I1 = (LK + λ0 I)−1/2 Sx∗ y − Sx∗ Sx fH H , I2 = ⏐⏐Sx∗ Sx − LK ⏐⏐L(H) , I3 = ∥ψ (LK )(LK +λ0 I)−1/2 ∥L(H) ,
⏐⏐
)
(
⏐⏐
I4 = ∥(LK + λ0 I)1/2 ∆S (LK + λ0 I)1/2 ∥L(H) and ∆S = (Sx∗ Sx + λ0 I + j=1 λj B∗j Bj )−1 . The estimate of I1 can be obtained from the step 3.2 of Theorem 4 [18] while the estimate of I2 can be found in Theorem 2 of De Vito et al. [25]:
∑p
( I1 ≤ 2
κM √ + m λ0
√
Σ 2 N (λ0 ) m
holds with the probability 1 − η.
)
( ) log
4
η
, I2 ≤ 2
(
κ2 m
κ2 +√
m
)
( ) log
4
η
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
The expression fλ − fλ0 = −(LK + λ0 I +
∥fλ − fλ0 ∥H ≤
Bλ
λ0
∑p
λ
∗
j=1 j Bj
Bj )
∑p −1
λ
∗
j=1 j Bj
59
Bj fλ0 implies that
∥fλ0 ∥H
and Bλ I3
∥ψ (LK )(fλ − fλ0 )∥H ≤ √
λ0
Bλ I3
∥(LK + λ0 I)1/2 ∆A (LK + λ0 I)1/2 ∥L(H) ∥fλ0 ∥H ≤ √
∑p
λ0
∥fλ0 ∥H ,
∑p
where Bλ = ∥ j=1 λj B∗j Bj ∥L(H) and ∆A = (LK + λ0 I + j=1 λj B∗j Bj )−1 . Note that fλ0 is the minimizer of the functional (9) and taking f = 0 yields
∥fλ0 − fH ∥2ρ + λ0 ∥fλ0 ∥2H ≤ ∥fH ∥2ρ , which gives
∥fH ∥ρ ∥fλ0 ∥ρ ≤ 2∥fH ∥ρ and ∥fλ0 ∥H ≤ √ . λ0 Under the smoothness of φ and ψ , we get ∥fλ0 − fH ∥H ≤ Rφ (λ0 ) and ∥ψ (LK )(fλ0 − fH )∥H ≤ Rψ (λ0 )φ (λ0 ). Now we have, I3 = ∥ψ (LK )(LK + λ0 I)−1/2 ∥L(H) ≤ sup
0
ψ (t) ψ (λ 0 ) ≤ √ . 1 / 2 (t + λ0 ) λ0
For sufficiently large m we get with the confidence 1 − η/2,
∥(LK + λ0 I)−1/2 (LK − Sx∗ Sx )(LK + λ0 I)−1/2 ∥L(H) ≤
I2
λ0
4κ 2
≤ √
mλ0
( ) log
4
η
≤
1 2
which implies I4 ≤ ∥(LK + λ0 I)1/2 (Sx∗ Sx + λ0 I)−1 (LK + λ0 I)1/2 ∥L(H)
= ∥{I − (LK + λ0 I)1/2 (LK − Sx∗ Sx )(LK + λ0 I)1/2 }−1 ∥L(H) ≤ 2. Hence we conclude that with the probability 1 − η,
⎫ ⎧ ⎛ ⎞ √ ( ) ⎬ ⎨ 2 Σ N (λ 0 ) κM ⎠ log 4 + Bλ ∥fH ∥ρ + Rφ (λ0 ) + IS ≤ 2R′ ψ (λ0 ) 2 ⎝ 3 / 2 ⎭ ⎩ mλ0 mλ0 η λ0 and
{ IA ≤ R ψ (λ0 ) ′
Bλ 3/2
λ0
} ∥fH ∥ρ + Rφ (λ0 ) .
Combining the error bounds of IS and IA we obtain the desired result. Theorem 3.2. Under the same assumptions of Theorem 3.1 with the parameter choice λ0 ∈ (0, 1], λ0 = Θ −1 (m−1/2 ), λj = (Θ −1 (m−1/2 ))3/2 φ (Θ −1 (m−1/2 )) for 1 ≤ j ≤ p, where Θ (t) = t φ (t), ⟨f , fH ⟩H can be approximated as follows:
{
( )} ⏐ ⏐ 4 −1 −1/2 −1 −1/2 ⏐ ⏐ Prob ⟨f , fH ⟩H − ⟨f , fz,λ ⟩H ≤ C ψ (Θ (m ))φ (Θ (m )) log ≥ 1 − η, z∈Z m η ∑p where C = R′ (3R + 4κ M + 4κ Σ + 3 j=1 ∥B∗j Bj ∥L(H) ∥fH ∥ρ ). Proof. Let Θ (t) = t φ (t). Then it follows,
Θ (t) t2 lim √ = lim −1 = 0. t →0 t →0 Θ (t) t
60
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
Under the parameter choice λ0 = Θ −1 (m−1/2 ) we have, lim mλ0 = ∞.
m→∞
Therefore for sufficiently large m, we get mλ0 ≥ 1 and 1 mλ0
1/2
=
λ0 φ (λ0 ) 1/2 ≤ λ0 φ (λ0 ). √ mλ0
Under the parameter choice λ0 ≤ 1, λ0 = Θ −1 (m−1/2 ), λj = (Θ −1 (m−1/2 ))3/2 φ (Θ −1 (m−1/2 )) for 1 ≤ j ≤ p, from Theorem 3.1 follows that with the confidence 1 − η,
|⟨f , fH ⟩H − ⟨f , fz,λ ⟩H | ≤ C ψ (Θ
−1
(m
∑p
where C = R′ (3R + 4κ M + 4κ Σ + 3 Hence our conclusion follows. □
−1/2
j=1
))φ (Θ
−1
(m
−1/2
( ) )) log
4
η
,
∥B∗j Bj ∥L(H) ∥fH ∥ρ ).
Theorem 3.3. Under the same assumptions of Theorem 3.1 and Assumption 2.2 with the parameter choice λ0 ∈ (0, 1], λ0 = Ψ −1 (m−1/2 ), λj = (Ψ −1 (m−1/2 ))3/2 φ (Ψ −1 (m−1/2 )) for 1 ≤ j ≤ p, where 1
1
Ψ (t) = t 2 + 2b φ (t), ⟨f , fH ⟩H can be approximated as follows: ( )} { ⏐ ⏐ 4 Prob ⏐⟨f , fH ⟩H − ⟨f , fz,λ ⟩H ⏐ ≤ C ′ ψ (Ψ −1 (m−1/2 ))φ (Ψ −1 (m−1/2 )) log ≥ 1 − η, m z∈Z η √ ∑p where C ′ = R′ (3R + 4κ M + 4 β bΣ 2 /(b − 1) + 3 j=1 ∥B∗j Bj ∥L(H) ∥fH ∥ρ ). 1
1
Proof. Let Ψ (t) = t 2 + 2b φ (t). Then it follows, t2 Ψ (t) lim √ = lim −1 = 0. t →0 Ψ t →0 (t) t Under the parameter choice λ0 = Ψ −1 (m−1/2 ) we have, lim mλ0 = ∞.
m→∞
Therefore for sufficiently large m, we get mλ0 ≥ 1 and 1
1 λ 2b φ (λ0 ) = 0√ ≤ λ02b φ (λ0 ). mλ0 mλ0
1
Under the parameter choice λ0 ≤ 1, λ0 = Ψ −1 (m−1/2 ), λj = (Ψ −1 (m−1/2 ))3/2 φ (Ψ −1 (m−1/2 )) for 1 ≤ j ≤ p, from Theorem 3.1 and the estimate (13) follows that with the confidence 1 − η,
( ) ⏐ ⏐ ⏐⟨f , fH ⟩H − ⟨f , fz,λ ⟩H ⏐ ≤ C ′ ψ (Ψ −1 (m−1/2 ))φ (Ψ −1 (m−1/2 )) log 4 , η √ ∑ p where C ′ = R′ (3R + 4κ M + 4 β bΣ 2 /(b − 1) + 3 j=1 ∥B∗j Bj ∥L(H) ∥fH ∥ρ ).
(17)
Hence our conclusion follows. □
3.2. Convergence rates of multi-penalty regularizer Here the upper convergence rates of the regularized solution fz,λ are derived from the estimates of Theorems 3.2, 3.3 for the class of probability measure Pφ , Pφ,b , respectively. In Theorem 3.4, we discuss the error estimates under the general source condition and the parameter choice rule based on the index function φ and sample size m. Under the polynomial decay condition, in Theorem 3.5 we obtain the optimal minimax convergence rates in terms of index function φ , the parameter b and the number of samples m. In particular under Hölder’s source condition, we present the convergence rates under the logarithm decay condition on effective dimension in Corollary 3.1.
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
61
Theorem 3.4. Let z be i.i.d. samples drawn according to the probability measure ρ ∈ Pφ . Then for all 0 < η < 1, the following error estimates hold: (i) If φ (t) and t /φ (t) are nondecreasing functions. Then with confidence 1 − η we have,
⎧ ⎨
∥fz,λ − fH ∥H ≤
⎛
3Rφ (λ0 ) +
κM + + 4⎝ mλ0
3Bλ ∥fH ∥ρ 3/2
λ0
⎩
√
⎫ ⎞ ( )⎬ 4 Σ 2 N (λ0 ) ⎠ log mλ0 η ⎭
and under the parameter choice λ0 ∈ (0, 1], λ0 = Θ −1 (m−1/2 ), λj = (Θ −1 (m−1/2 ))3/2 φ (Θ −1 (m−1/2 )) for 1 ≤ j ≤ p, where Θ (t) = t φ (t) we have
{
Prob ∥fz,λ − fH ∥H ≤ C φ (Θ −1 (m−1/2 )) log
( )} 4
η
z∈Z m
≥ 1 − η,
∑p
∑p
where Bλ = ∥ j=1 λj B∗j Bj ∥L(H) and C = 3R + 4κ M + 4κ Σ + 3 j=1 ∥B∗j Bj ∥L(H) ∥fH ∥ρ . √ (ii) If φ (t) and t /φ (t) are nondecreasing functions. Then with confidence 1 − η we have,
{ ∥fz,λ − fH ∥ρ ≤ 3Rλ0
1/2
φ (λ 0 ) +
(
3Bλ ∥fH ∥ρ
+4
λ0
κM √ + m λ0
√
Σ 2 N (λ 0 )
( )}
)
m
log
4
η
and under the parameter choice λ0 ∈ (0, 1], λ0 = Θ −1 (m−1/2 ), λj = (Θ −1 (m−1/2 ))3/2 φ (Θ −1 (m−1/2 )) for 1 ≤ j ≤ p, where Θ (t) = t φ (t) we have
{
Prob ∥fz,λ − fH ∥ρ ≤ C (Θ z∈Z m
−1
−1/2
(m
1/2
))
φ (Θ
−1
(m
−1/2
( )} )) log
4
η
≥1−η
provided that
√
mλ0 ≥ 8κ 2 log(4/η).
Proof. When f ∈ H we get
∥fz,λ − fH ∥H = sup |⟨f , fH ⟩H − ⟨f , fz,λ ⟩H | ∥f ∥H = 1 ⎧ ⎫ ⎛ ⎞ √ ( )⎬ ⎨ 3Bλ ∥fH ∥ρ κ M Σ 2 N (λ0 ) 4 ⎠ log ≤ 3Rφ (λ0 ) + + 4⎝ + . 3/2 ⎩ mλ0 mλ0 η ⎭ λ0 −1/2
On the other hand, for LK
f ∈ H we have
1/2
1/2
1/2
∥fz,λ − fH ∥ρ = ∥LK (fz,λ − fH )∥H = sup |⟨LK v, fH ⟩H − ⟨LK v, fz,λ ⟩H | ∥v∥H =1 ( ) { √ ( )} Σ 2 N (λ0 ) κM 4 3Bλ ∥fH ∥ρ 1/2 +4 log . ≤ 3Rλ0 φ (λ0 ) + √ + λ0 m η m λ0 Under the parameter choice λ0 ≤ 1, λ0 = Θ −1 (m−1/2 ), λj = (Θ −1 (m−1/2 ))3/2 φ (Θ −1 (m−1/2 )) for 1 ≤ j ≤ p with the confidence 1 − η, we get
{
Prob ∥fz,λ − fH ∥H ≤ C φ (Θ −1 (m−1/2 )) log z∈Z m
( )} 4
η
≥1−η
and
{
Prob ∥fz,λ − fH ∥ρ ≤ C (Θ −1 (m−1/2 ))1/2 φ (Θ −1 (m−1/2 )) log z∈Z m
∑p
where C = 3R + 4κ M + 4κ Σ + 3 j=1 ∥B∗j Bj ∥L(H) ∥fH ∥ρ . Hence our conclusions follow. □
( )} 4
η
≥ 1 − η,
62
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
Corollary 3.1. For the Hölder’s source condition fH ∈ Ωφ,R , φ (t) = t r under the same assumptions of Theorem 3.4 and Assumption 2.3 on effective dimension N (λ0 ) with the parameter choice λ0 =
( log m ) 2r1+1
( log m ) 2r +3
m
4r +2 (1 ≤ j ≤ p), for all 0 < η < 1, we have the following convergence rates with and λj = m confidence 1 − η:
( ∥fz,λ − fH ∥H ≤ C
log m
) 2rr+1
( ) log
m
4
η
for 0 ≤ r ≤ 1
and
( ∥fz,λ − fH ∥ρ ≤ C
log m
) 21
m
( ) log
√
4
η ∑p
where C = 3R + M /16κ 3 + 2 2c Σ + 3
for 0 ≤ r ≤
j=1
1 2
,
∥B∗j Bj ∥L(H) ∥fH ∥ρ .
Theorem 3.5. Let z be i.i.d. samples drawn according to the probability measure ρ ∈ Pφ,b . Then under the parameter choice λ0 ∈ (0, 1], λ0 = Ψ −1 (m−1/2 ), λj = (Ψ −1 (m−1/2 ))3/2 φ (Ψ −1 (m−1/2 )) for 1 ≤ j ≤ p, 1
1
where Ψ (t) = t 2 + 2b φ (t), for all 0 < η < 1, the following error estimates holds with confidence 1 − η, (i) If φ (t) and t /φ (t) are nondecreasing functions. Then we have,
{
Prob ∥fz,λ − fH ∥H ≤ C¯ φ (Ψ −1 (m−1/2 )) log z∈Z m
( )} 4
η
≥1−η
and lim lim sup sup Prob ∥fz,λ − fH ∥H > τ φ (Ψ −1 (m−1/2 )) = 0,
{
}
τ →∞ m→∞ ρ∈Pφ,b z∈Z m
where C¯ = √ 2R + 4κ M + 2M ∥B∗ B∥L(H) + 4κ Σ . (ii) If φ (t) and t /φ (t) are nondecreasing functions. Then we have,
{
Prob ∥fz,λ − fH ∥ρ ≤ C¯ (Ψ −1 (m−1/2 ))1/2 φ (Ψ −1 (m−1/2 )) log z∈Z m
( )} 4
η
≥1−η
and lim lim sup sup Prob ∥fz,λ − fH ∥ρ > τ (Ψ −1 (m−1/2 ))1/2 φ (Ψ −1 (m−1/2 )) = 0
{
τ →∞ m→∞ ρ∈Pφ,b
}
z∈Z m
provided that
√
mλ0 ≥ 8κ 2 log(4/η).
The lower convergence rates are studied for any learning algorithm over the class of the probability measures Pφ,b in Theorem 3.10, 3.12 [56]. Here we have discussed the upper convergence rates for multi-penalty regularization scheme in vector-valued function setting. The parameter choice λ = λ(m) is said to be optimal if the upper convergence rates for λ = λ(m) coincide with the lower convergence rates. For the parameter choice λ = (λ0 , . . . , λp ), λ0 = Ψ −1 (m−1/2 ), λj = (Ψ −1 (m−1/2 ))3/2 φ (Ψ −1 (m−1/2 )), 1 ≤ j ≤ p, Theorem 3.5 shares the upper convergence rates with the lower minmax rates of Theorem 3.10, 3.12 [56]. Therefore the choice of the parameter is optimal. Remark 3.1. Under the same assumptions of Theorem 3.5 for Hölder source condition fH ∈ Ωφ,R , 2br +3b b φ (t) = t r with the parameter choice λ0 ∈ (0, 1], λ0 = m− 2br +b+1 and for 1 ≤ j ≤ p, λj = m− 4br +2b+2 , − br − 2br +b we obtain the optimal minimax convergence rates O(m 2br +b+1 ) for 0 ≤ r ≤ 1 and O(m 4br +2b+2 ) for 1 2 0 ≤ r ≤ 2 in RKHS-norm and LρX -norm, respectively. Remark 3.2. Under the polynomial decay condition, the convergence rates in our analysis (Theorem 3.5) are improved from the error bounds provided in [1]. Using the concept of effective dimension,
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
63
we achieve the optimal minimax convergence rates for the multi-penalty regularization scheme for the class of probability measure Pφ,b . Under the logarithm decay condition, the convergence rates for multi-penalty regularization discussed in Corollary 3.1 have the same order as of single-parameter regularization provided in Corollary 5.2 [37]. Remark 3.3. For the real valued functions and multi-task algorithms (the output space Y ⊂ Rm for some m ∈ N) we can obtain the error estimates from our analysis without imposing any condition on the conditional probability measure (12) for the bounded output space Y . We discussed the parameter choice dependent on smoothness of the target function which is not known in general. Here we present a data-driven approach which requires information about the samples only. 4. Parameter choice rules Parameter choice strategy is the backbone of regularization schemes. Most regularization schemes depend on tuning parameters. Various prior and posterior parameter choice rules are proposed for single-penalty regularization [6,19,24,51,63] (also see references therein). Many regularization parameter selection approaches are discussed for multi-penalized ill-posed inverse problems such as discrepancy principle [38,41], quasi-optimality principle [28,52], balanced-discrepancy principle [32], heuristic L-curve [10], noise structure based parameter choice rules [5,8,20], some approaches which require reduction to single-penalty regularization [16]. Due to growing interest in multi-penalty regularization, multi-parameter choice rules are discussed in learning theory framework such as discrepancy principle [38,40], balanced-discrepancy principle [1], parameter choice strategy based on generalized cross validation score [62]. Discrepancy principle is widely applied parameter choice rule which is known as first parameter choice rule in regularization theory [55]. Discrepancy approach is discussed for ill-posed problems as well as for learning algorithms [38] in multi-penalty regularization strategies. Discrepancy principle produces a discrepancy surface of parameter λ = (λ0 , λ1 , . . . , λp ). Our aim is to pick one point on the surface. We are discussing ‘‘the balanced-discrepancy principle’’ parameter choice approach which adaptively chooses first parameter according to the balancing principle, then locates a point on the discrepancy surface corresponding to this choice. Here we are generalizing the balanced-discrepancy principle for two-parameter regularization [1]. The balancing principle selects parameter via trade off between sample error and approximation error. For decreasing sample error and increasing approximation error as a function of λ0 , balancing principle chooses an ideal parameter value λ◦0 for which both sample error and approximation error are equal. The error estimates for single-penalty regularization scheme (λj = 0, for 1 ≤ j ≤ p) corresponding to (5) can be obtain from Theorem 10 [9]:
{ ( )( )} 4 1 Prob ∥fz,λ0 − fH ∥H ≤ C log ≥ 1 − η, √ + φ (λ0 ) z∈Z m η λ0 m where φ is the index function with the assumption that t −1 φ (t) is a nonincreasing function. Given a geometrical sequence λi0 = λstart qi with q > 1 and λstart ≤ cm−1/2 , the ideal parameter λ◦0 0 0 can be estimated by
{ λ¯ 0 = max λ : ∥fz,λj − fz,λj−1 ∥H ≤ i 0
0
0
4C log (4/η)
}
, j = 1, . . ., i − 1 . √ λj0−1 m
(18)
For further discussion see references [1,24]. { } Now we project λ¯ 0 on the discrepancy surface λ ∈ Rp+1 : ∥Sx fz,λ − y∥m = c δ , where ∥Sx fH − y∥m ≤ δ and c > 0. Discrepancy principle ∥Sx fz,λ − y∥m = c δ implies
∥y∥2m − ∥Sx fz,λ ∥2m − 2λ0 ∥fz,λ ∥2H − 2
p ∑ j=1
λj ∥Bj fz,λ ∥2H = c 2 δ 2 .
(19)
64
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
This is easy to see that this equation is equivalent to the following form and for a given λ¯ 0 , λi = λki +1 can be updated as
λki +1 =
2(λki )2 ∥Bi fz,λk ∥2H 2 m
∥y∥ − ∥
∥ − 2λ¯ 0 ∥fz,λk ∥2H − 2
Sx fz,λk 2m
∑p
j=1,i̸ =j
λkj ∥Bj fz,λk ∥2H − c 2 δ 2
, for 1 ≤ i ≤ p, (20)
where λk = (λ¯ 0 , λk1 , . . . , λkp ). Continue the process until |λki +1 − λki | > ε , 1 ≤ i ≤ p for some ε > 0. Algorithm 1: 1. For a discretized sequence, calculate λ0 = λ¯ 0 with the criteria (18) of the balancing principle. 2. For the initial values of λki (1 ≤ i ≤ p), c , δ , start with k = 0. 3. Calculate fz,λk and update λi = λki +1 according to (20). 4. If stopping criteria |λki +1 − λki | < ε (1 ≤ i ≤ p) satisfied then stop updating λi otherwise keep updating λi according to (20). Proposition 4.1. Suppose non-negative parameters λk satisfies ∥Sx fz,λk − y∥m > c δ . Then λi = λki +1 , given by the formula (20), is a non-negative value with λki +1 ≤ λki . Suppose (λk )∞ k=0 is a sequence according to (20) satisfying ∥Sx fz,λk − y∥m > c δ . Then from Proposition 4.1 it follows that λki converges to some λ¯ i and under continuity of fz,λ as a function of λ from (20) we obtain ∥Sx fz,λ¯ − y∥m = c δ for λ¯ = (λ¯ 0 , λ¯ 1 , . . . , λ¯ p ). 5. Multi-penalty regularization based on linear functional strategy We construct various regularized solutions corresponding to different choices of regularization schemes, hypothesis spaces (RKHSs) and parameter choice rules. Even in the execution of parameter choice rules we generate many regularized solutions corresponding to intermediate values of regularization parameters. We select one of the approximants according to the parameter choice rule. In spite of numerical expenses in the construction of the regularized solutions we ignore the information hidden inside the other approximants. To overcome this situation we try to accumulate the information of various regularized solutions by means of linear functional strategy. Anderssen [2] proposed linear functional strategy and further it is developed in ill-posed inverse problems [7,21,29,44]. In learning theory framework, Kriukova et al. [34] performed adaptation–extension of the strategy to aggregate various regularized solutions corresponding to different parameter choices for ranking algorithm. Further, the authors used this approach in aggregating all calculated regularized approximants based on Nyström type subsampling [35]. Pereverzyev et al. [54] discussed the linear functional strategy for multiple kernel learning algorithm which is used to combine the estimators corresponding to different reproducing kernel Hilbert spaces (Kernels). We consider the linear combination of the regularized solutions and try to figure out the combination which is closer to the target function fH . Given a finite set of approximants {fzi ∈ H : 1 ≤ i ≤ l}, our aim is to find the optimizer of the problem:
⏐⏐ l ⏐⏐ ⏐⏐∑ ⏐⏐ ⏐⏐ ⏐⏐ i argmin ⏐⏐ ci fz − fH ⏐⏐ . ⏐ ⏐ ⏐⏐ l (c1 ,...,cl )∈R
(21)
i=1
If we minimize the functional in RKHS-norm, then the optimization problem (21) becomes equivalent to solving the linear system: Gc = g , j
where the matrix G = (⟨fzi , fz ⟩H )li,j=1 , the vector g = (⟨fH , fzi ⟩H )li=1 and c = (ci )li=1 . Note that G consists of known values. So it can be determined but g is inaccessible because of fH . So we approximate this by means of Theorems 3.2 and 3.3.
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
65
We assume that the number of approximants (the value of l) is negligible in comparison to the sample size m. We approximate g = (⟨fH , fzi ⟩H )li=1 by the vector g¯ = (⟨fz,λ , fzi ⟩H )li=1 . Under the assumption of Theorem 3.2 for the class of the probability measures Pφ ,
{
( ( ))} 4 −1 −1/2 ˜ Prob ∥g − g¯ ∥Rl ≤ C φ (Θ (m )) log ≥1−η z∈Z m η
(22)
and under the assumption of Theorem 3.3 for the class of the probability measures Pφ,b ,
{
Prob ∥g − g¯ ∥Rl ≤ ˜ C′ z∈Z m
( ( ))} 4 ≥ 1 − η. φ (Ψ −1 (m−1/2 )) log η
(23)
Now we combine the regularized solutions by means of linear functional strategy in HK (LFS-HK ). If G−1 exists, then we consider fz =
l ∑
c¯i fzi , where c¯ = (c¯i )li=1 = G−1 g¯ .
i=1
Theorem 5.1. For the class of probability measure Pφ under the assumptions of Theorem 3.2, the convergence rate of the constructed solution fz based on LFS-HK can be described as:
⏐⏐ ⏐⏐ l ( ( )) ⏐⏐ ⏐⏐∑ 4 ⏐⏐ ⏐⏐ i ci fz − fH ⏐⏐ + O φ (Θ −1 (m−1/2 )) log ∥fz − fH ∥H = min ⏐⏐ ⏐⏐ η c ∈Rl ⏐⏐ i=1
H
holds with the probability 1 − η. Theorem 5.2. For the class of probability measure Pφ,b under the assumptions of Theorem 3.3, the convergence rate of the constructed solution fz based on LFS-HK can be described as:
⏐⏐ ⏐⏐ l ( ( )) ⏐⏐ ⏐⏐∑ 4 ⏐⏐ ⏐⏐ i ∥fz − fH ∥H = min ⏐⏐ ci fz − fH ⏐⏐ + O φ (Ψ −1 (m−1/2 )) log ⏐⏐ η c ∈Rl ⏐⏐ i=1
H
holds with the probability 1 − η. Suppose we optimize the functional (21) in Lρ2X -norm. Then the problem of minimization (21) becomes equivalent to the problem of finding c, Hc = h, j
where H = (⟨fzi , fz ⟩ρ )li,j=1 and h = (⟨fH , fzi ⟩ρ )li=1 . In order to find c we need the information of H and h. But due to inaccessibility of the probability distribution of ρ we cannot estimate it directly. Therefore we approximate the quantities with the help of following expressions:
⟨fH , fzi ⟩ρ =
m 1 ∑
m
ys fzi (xs ) + ⟨Sx∗ Sx fH − Sx∗ y, fzi ⟩H + ⟨LK fH − Sx∗ Sx fH , fzi ⟩H
s=1
and
⟨fzi , fzj ⟩ρ =
n 1∑
n
fzi (xs )fzj (xs ) +
LK − Sx∗′ Sx′ fzi , fzj
⟨(
)
⟩ H
,
s=1
where x = (x1 , . . . , xm ) are labeled inputs, x′ = (x1 , . . . , xn ) are all inputs (labeled and unlabeled) and y = (y1 , . . . , ym ) are the outputs. The estimators fzi = fz,λi corresponding to different parameter choices can be estimated as
⏐⏐ ⏐⏐ ⏐⏐ ⏐⏐ ∥fz,λi ∥H ≤ ∥∆S ∥L(H) ⏐⏐Sx∗ y − Sx∗ Sx fH ⏐⏐H + ⏐⏐∆S Sx∗ Sx ⏐⏐L(H) ∥fH ∥H ⏐⏐ 1 ⏐⏐ ≤ i ⏐⏐Sx∗ y − Sx∗ Sx fH ⏐⏐H + ∥fH ∥H λ0
66
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75 K
and the estimators fzi = fz,λi corresponding to different RKHSs (Kernels Ki ) can be estimated as
⏐⏐
∥
K fz,λi H
∥
⏐⏐
m ⏐⏐ ⏐⏐ 1 ⏐⏐ 1 ∑ ⏐⏐ ≤ ⏐⏐ Ki (·, xs )(ys − fHK (xs ))⏐⏐ i ⏐⏐ λ0 ⏐⏐ m s=1
+ ∥fHKi ∥H .
H
where fHK is the minimizer of the generalization error (11) over RKHS HKi . Hence under the i
condition (15) using Proposition 21, 22 [9], we can control the estimators fzi and conclude the following result. Theorem 5.3. Let z be i.i.d. samples drawn according to the probability measure ρ with the hypothesis (12). Then for sufficiently large sample size m according to (15) with the confidence 1 − η, we have m 1 ∑
⟨fH , fzi ⟩ρ ≤
m
ys fzi (xs ) + C1
(
m−1/2 log
( ))
s=1
4
η
and
⟨fzi , fzj ⟩ρ ≤
n 1∑
n
fzi (xs )fzj (xs ) + C2
(
n−1/2 log
s=1
( )) 4
η
.
Under the assumption that number of approximants l is negligible in comparison to sample size m, we have with the confidence 1 − η,
( ( )) ( ( )) 4 4 ∥H − H¯ ∥Rl →Rl ≤ C1′ n−1/2 log and ∥h − h¯ ∥Rl ≤ C2′ m−1/2 log , η η ( ∑ )l ( 1 ∑m )l i ¯ = 1 n fzi (xs )fzj (xs ) and h¯ = m where H s = 1 s=1 ys fz (xs ) i=1 . n i,j=1
¯ −1 exists, then for sufficiently large n we can assume that If H ∥H − H¯ ∥Rl →Rl < 1/∥H¯ −1 ∥Rl →Rl . Using Banach theorem for inverse operators ([33], V. 4.5) we get
∥H
−1
∥Rl →Rl
⏐⏐ −1 ⏐⏐ ⏐⏐H¯ ⏐⏐ l l ⏐⏐ ⏐⏐ ⏐R⏐ →R ⏐⏐ = O(1). ≤ − 1 ⏐ ⏐ ¯ ¯ ⏐⏐ l l 1 − H ⏐⏐Rl →Rl ⏐⏐H − H R →R
(24)
¯ −1 h¯ we have, For c = H −1 h and c¯ = H ¯ + H −1 (H¯ − H)c¯ . c − c¯ = H −1 (h − h) It follows,
( ( )) 4 ∥c − c¯ ∥Rl = O m−1/2 log . η It is clear that for the constructed solution fz =
l ∑
c¯i fzi , c¯ = (c¯i )li=1
(25)
i=1
based on linear functional strategy in Lρ2X (LFS-Lρ2X ), we get
⏐⏐ l ⏐⏐ ⏐⏐∑ ⏐⏐ ⏐⏐ ⏐⏐ i ∥fz − fH ∥ρ ≤ min ⏐⏐ ci fz − fH ⏐⏐ + l∥c − c¯ ∥Rl max ∥fzi ∥ρ . 1≤i≤n ⏐⏐ c ∈Rl ⏐⏐ i=1
ρ
(26)
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
67
Theorem 5.4. Under the assumptions of Theorem 5.3, the convergence rate of the constructed solution fz based on LFS-Lρ2X can be described as:
⏐⏐ l ⏐⏐ ( ( )) ⏐⏐∑ ⏐⏐ 4 ⏐⏐ ⏐⏐ i −1/2 ∥fz − fH ∥ρ ≤ min ⏐⏐ ci fz − fH ⏐⏐ + O m log l ⏐ ⏐ ⏐ ⏐ η c ∈R i=1
ρ
holds with the probability 1 − η.
¯ −1 in the analysis of the aggregation approaches based We considered the existence of G−1 and H ¯ in Theorem 5.6 using the on linear functional strategy. Here we address the invertibility of G and H following theorem (Theorem 2 [49]). Theorem 5.5. The minimization problem (3) has a unique solution, given by fz,λ = vector a = (a1 , . . . , an )T ∈ Y n .
∑n
s=1 Kxs as
for some
¯ in terms of the Gramian Using the representation theorem, we obtain the expressions of G and H matrix K = (K (xi , xj ))ni,j=1 and the coefficients ai = (ai1 , . . . , ain )T of regularized solutions fz,λi = ∑n i T T 2 1 l ¯ s=1 Kxs as (1 ≤ i ≤ l), i.e., G = A KA and H = A K A, where A = [a , . . . , a ]. Then it is easy to show the following result.
∑n
i i i i T n Theorem 5.6. The set of regularized solutions fz,λi = s=1 Kxs as for a = (a1 , . . . , an ) ∈ Y (1 ≤ i ≤ l) are linearly independent if and only if ai ’s are linearly independent. Moreover, for the positive definite ¯ are invertible if the regularized solutions fz,λi are linearly independent. kernel K , the matrices G and H
6. Numerical realization In our experiments, we demonstrate the performance of multi-penalty regularization based on linear functional strategy for both scalar-valued functions and multi-task learning problems. We consider the well-known academic example for extrapolating regression problem in scalar-valued function setting. Then we present the extensive empirical analysis of the multi-view manifold regularization scheme based on LFS-Lρ2X for the challenging multi-class image classification and species recognition with attributes. 6.1. Linear functional strategy on scalar-valued functions In the following section, we demonstrate the performance of two-parameter choice rules versus the proposed aggregation approaches based on linear functional strategy for multi-penalty regularization schemes and also describe the efficiency of regularization algorithm based on linear functional strategy statistically using the relative error measure. To analyze the multi-penalty regularization based on linear functional strategy in scaler-valued function setting, we consider the well-known academic example [24,40,47], fρ (x) =
1 10
{
( x+2 e
( )2 −8 43π −x
−e
−8( π2 −x)
2
( )2 )} −8 32π −x
−e
, x ∈ [0, 2π],
(27)
which belongs to the reproducing kernel Hilbert space H corresponding to the kernel K (x, y) = xy + exp(−8(x − y)2 ). m Consider a training set zm = {(xi , yi )}m i=1 , where corresponding to the empirical inputs x = (xi )i=1 = π { 10 (i − 1)}m the outputs are generated as i=1 yi = f (xi ) + δξi , i = 1, . . . , m and ξi follows the uniform distribution over [−1, 1] with δ = 0.02.
(28)
68
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
Fig. 1. The constructed solution fz2 (blue line) based on LFS-Lρ2X , the ideal estimator fρ (red line), the approximants (green lines) and the empirical data z15 with additional input x16 (red stars). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
We consider the semi-supervised problem of labeled and unlabeled data points which can be viewed as a special case of problem proposed in [11],
{ argmin f ∈H
m 1 ∑
m
} (f (xi ) − yi ) + λ0 ∥f ∥ 2
2 H
+ λ1 ∥Sx′ f ∥
2 n
,
i=1
where {(xi , yi ) ∈ X × Y : 1 ≤ i ≤ m} {xi ∈ X : m < i ≤ n} is given set of labeled and unlabeled data and x′ = (xi : 1 ≤ i ≤ n). In our experiments, we construct the four approximants fz,λi (i = 1, 2, 3, 4) for the samples 1 {(xi , yi )}15 i=1 with unlabeled point {x16 } corresponding to the regularization parameters λ = (0.5, 1), 2 3 4 1 λ = (0.1, 0.1), λ = (0.01, 0.01) and λ = (0.02, 1). We construct the estimators fz and fz2 from the approximants {fz,λi }4i=1 of the multi-penalty regularization based on LFS-HK and LFS-Lρ2X , respectively. The estimators with fixed regularization parameters and the estimator based on LFS-Lρ2X are shown in Fig. 1. In general, we apply the parameter choice rules to obtain the regularized solutions. Here we compare the regularized solutions corresponding to the discrepancy principle [38], the balanceddiscrepancy principle [1] and the proposed aggregation approaches based on linear functional strategy for multi-penalty regularization in learning theory. In the balanced-discrepancy principle, we obtain (1) the regularized solution fz,λ by first choosing λ1 in accordance with balancing principle [24] by choosing λ2 = 0 and then finding corresponding parameter value of λ2 on the discrepancy curve. We get the parameter values λ = (λ1 , λ2 ) = (6 × 10−4 , 3.5114 × 10−4 ) according to the balanced(2) discrepancy principle. We construct the multi-penalty regularizer fz,λ by interchanging the role of parameter choices of λ1 and λ2 in the balanced-discrepancy principle with parameters λ = (6.2083 × (3) 10−4 , 3.3 × 10−4 ). The estimator fz,λ is constructed using discrepancy parameter choice rule proposed by Shuai Lu et al. [38] with parameters λ = (6.0294 × 10−4 , 3.514 × 10−4 ). To demonstrate reliability of the aggregation approaches based on linear functional strategy we generate samples 100 times in accordance with Eq. (28) for δ = 0.02. In our experiment, we compare the performance of multi-parameter choice rules against the proposed aggregation approaches based ∥f −f ∥ (1) on linear functional strategy using the relative error measure ∥f ∥ρ with f = fz1 , and f = fz2 , f = fz,λ ,
⋃
(2)
(3)
f = fz,λ and f = fz,λ which is listed in Tables 1, 2 & 3. We illustrate the error estimates for different
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
69
Table 1 Statistical performance interpretation of the estimators in HK -norm. Regularized solutions
Mean relative error
Median relative error
Standard deviation of relative error
fz1
0.0322
0.0324
0.0126
fz2
0.0338
0.0345
0.0140
fz,λ
(1)
0.0322
0.0324
0.0126
fz,λ
(2)
0.0363
0.0360
0.0128
(3) fz,λ
0.0327
0.0353
0.0131
Table 2 Statistical performance interpretation of the estimators in empirical Lρ2X -norm. Regularized solutions
Mean relative error
Median relative error
Standard deviation of relative error
fz1
0.0043
0.0043
0.0012
fz2
0.0042
0.0037
0.0013
fz,λ
(1)
0.0043
0.0043
0.0012
(2) fz,λ
0.0037
0.0036
0.0010
(3)
0.0043
0.0037
0.0013
fz,λ
Table 3 Statistical performance interpretation of the estimators in sup-norm. Regularized solutions
Mean relative error
Median relative error
Standard deviation of relative error
fz1
0.0650
0.0646
0.0089
fz2
0.0637
0.0603
0.0096
(1) fz,λ
0.0648
0.0644
0.0089
fz,λ
(2)
0.0603
0.0587
0.0079
(3)
0.0648
0.0606
0.0094
fz,λ
multi-penalty regularizers in HK -norm, ∥ · ∥m -empirical norm and sup-norm in Fig. 2(a), (b) & (c), respectively. We observe from Fig. 1 that the aggregation approach provides the good estimator based on different regularized solutions corresponding to fixed parameter choices. From the statistical analysis, We also observe that the multi-penalty estimators based on linear functional strategy and corresponding to various parameter choice rules which provides the parameters λ = (λ1 , λ2 ) on the discrepancy curve perform similar. 6.2. Multi-task learning via linear functional strategy In the numerical experiments, we consider the multi-view manifold regularization scheme considered in Minh et al. [49] to construct the estimator based on the different views of the input data. The views are the different features extracted from the input examples for the given data sets. Suppose the function f = (f 1 , . . . , f v ) ∈ H = HK 1 × · · · × HK v is given corresponding to operator C : Y v → Y is defined by Cf (x) = ∑vthe vi-views of the inputs and the combination v c f (x) , for c = (c , . . . , c ) ∈ R . Then we consider the optimization problem associated to i 1 v i=1 n the labeled data {(xi , yi )}m i=1 and unlabeled data {xi }i=m+1 : fz,λ = argminf ∈H,c∈S v−1 α
m 1 ∑
m 1/2
+ λW ∥(Sx′ MW Sx′ ) ∗
i=1 2
2
∥yi − Cf (xi )∥2Y + λA ∥f ∥2H + λB ∥(Sx∗′ MB Sx′ )1/2 f ∥H
f ∥H ,
(29)
70
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
(a) Relative error in H-norm.
(b) Relative error in emp. L2 -norm × 10−3 .
(c) Relative error in sup-norm.
Fig. 2. Figures show relative errors in ∥ · ∥H -norm (a), ∥ · ∥m -empirical norm (b) and infinity norm (c) corresponding to 100 test problems with samples according to (28) with δ = 0.02 for all estimators.
where the regularization parameters λA , λB , λW ≥ 0, x′ = (xi )ni=1 , MB = In ⊗ (Mv ⊗ IY ) and MW = L ⊗ IY are symmetric, positive operators. Here Mv = v Iv − 1v 1Tv , Iv is identity of size v × v , 1v is a vector of size v × 1 with ones, ⊗ is the Kronecker product and L is a graph Laplacian. The graph Laplacian L is the block matrix of size n × n, with block (i, j) being the v × v diagonal matrix, given by Lij = diag(L1ij , . . . , Lvij ),
(30)
where the scalar graph Laplacian Li is induced by the symmetric, nonnegative weight matrix W i . The first penalty controls the complexity of the function in the ambient space, the second penalty measures the consistency of the component functions across different views, and the third penalty measures the smoothness of the component functions in their corresponding views. In order to optimize the functional simultaneously for f and c, we first fix c on the sphere Sαv−1 = {x ∈ Rv : ∥x∥ = α} and minimize for the estimator f ∈ H. Then we fix the function f and optimize for the combination operator c. The solution of the regularization scheme (29) can be evaluated according to Theorem 10 [49]. We demonstrate the performance of LFS-Lρ2X on the considered multi-view manifold regularization problem (29) for the challenging task of multi-class image classification. We use the well-known Caltech-101 data set provided in Fei-Fei et al. [27] for object recognition. The data set consists of P = 102 classes of objects and about 40 to 800 images per category. For our experiments, we randomly selected 15 images from each class. We used the four extracted features: PHOW gray, PHOW color, geometric blurred and self-symmetry on three levels of the spatial pyramid considered in Vedaldi et al. [61]. We consider the χ 2 -kernels provided in [61] corresponding to each view xi of the input x = (x1 , . . . , xv ). The operator-valued kernel for the manifold learning scheme (29) is given by K (x, t) = G(x, t) ⊗ IP for G(x, t) =
v ∑
K i (xi , t i )ei eTi ,
(31)
i=1
where K i is scalar-valued kernel defined on ith view and ei = (0, . . . , 1, . . . , 0) ∈ Rv is the ith coordinate vector. We consider three splits of the given data and the results are reported in terms of
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
71
accuracy by averaging over all three splits. The test set contains 15 objects per class in all experiments. For the output values, we set y = (−1, . . . , 1, . . . , −1) ∈ Y = RP , i.e., 1 on the kth place otherwise −1 if x belongs to the kth class.
Algorithm 2: Semi-supervised least-square regression and classification based on LFS-Lρ2X This algorithm calculates the set of regularized solutions fz,λr corresponding to the views (v1r , . . . , vsr ) for (1 ≤ r ≤ l) according to Theorem 10 [49]. Then it calculates the aggregated solution fz from the K regularized solutions fz,λr (1 ≤ r ≤ l) based on the linear functional strategy in Lρ2X . Input: ⋃ n – Training data {(xi , yi )}m {xi }i=m+1 , with m labeled and n − m unlabeled examples. i=1 – Testing data ti . Parameters: – The regularization parameters λA , λB , λW . – The weight vector c. – The number of classes P. – Scalar-valued kernels K i corresponding to ith view. Procedure: K – To calculate the set of regularized solutions fz,λr (1 ≤ r ≤ l), compute kernel matrices ( )n ′ ′ Gr [x ] = Gr (xi , xj ) i,j=1 for x = {x1 , . . . , xn } corresponding to views (v1r , . . . , vsr ) according to (31). – Compute graph(Laplacian L according to (30). ) n – Compute Br = (Jm ⊗ ccT ) + mλB (In ⊗ Mv ) + mλW L Gr [x′ ], where Jmn : Rn → Rn is a diagonal matrix of size n × n, with the first m entries on the main diagonal being 1 and the rest being 0. – Compute C = cT ⊗ IP and C∗ = In×m ⊗ C ∗ for In×m = [Im , 0m×(n−m) ]T . – Compute YC such that C∗ y = vec(YCT ), where the vectorization of an n × P matrix A, denoted vec(A), is the nP × 1 column vector obtained by stacking the columns of the matrix A on top of one another. – Solve matrix equations Br Ar + mλA Ar (= YC for )Ar (1 ≤ r ≤ l). – Compute kernel matrices Gr [x, x′ ] = Gr (xi , xj ) ij for 1 ≤ i ≤ m and 1 ≤ j ≤ n. K
)T
¯ rq = 1 vec ATr Gr [x′ ]T (In×m ⊗ c) vec ATq Gq [x′ ]T (In×m ⊗ c) and – Compute H ( n ) ¯hr = 1 yT vec ATr Gr [x, x′ ]T (Im ⊗ c) for 1 ≤ r , q ≤ l. m ¯ −1 h¯ for H¯ = (H¯ rq )l ¯ ¯ l – Compute c¯ = (c¯1 , . . . , c¯l )T = H r ,q=1 and h = (hr )r =1 . – Compute kernel matrices Gr [ti , x] between ti and x. – Compute the value of estimator fz based on linear functional strategy at ti , i.e.,
(
fz (ti ) =
l ∑ r =1
K
c¯r fz,λr (ti ) =
l ∑
(
)
c¯r vec(ATr Gr [ti , x]T ).
r =1
Output: Multi-class classification: return index of max(Cfz (ti )).
First of all, we take the same choice of regularization parameters λA = 10−5 , λB = 10−6 , λW = 10−6 as considered in Minh et al. [49] to see the performance of least square manifold learning scheme (29) based on linear functional strategy. In supervised learning, we take the number of labeled data per class l = {1, 5, 10, 15}. In semi-supervised learning, we take the number of unlabeled data u = 5 with increasing the number of labeled data per class l = {1, 5, 10}. In the results of Tables 4, 5 & 6, we use this choice of regularization parameters. We have chosen the uniform combination operator c = v1 (1, . . . , 1)T ∈ Rv in the results of Tables 4, 5 & 7. LFS-ℓ2 represents the estimator based on LFS-Lρ2X using the single-view estimators for the lower level of views, i.e., PHOW color and gray ℓ2, SSIM ℓ2, GB. Similarly, LFS-ℓ0 and LFS-ℓ1 represent the estimators based on LFS-Lρ2X using the upper and middle level of feature’s spatial pyramid. LFS-ALL is the estimator based on LFS-Lρ2X using all
72
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
Table 4 Results of single-view learning on Caltech-101 using each feature and based on LFS-Lρ2X . Estimators
Accuracy l=1 u=5
Accuracy l=5 u=5
Accuracy l = 10 u=5
Accuracy l=1 u=0
Accuracy l=5 u=0
Accuracy l = 10 u=0
Accuracy l = 15 u=0
PHOW color ℓ0 PHOW color ℓ1 PHOW color ℓ2 PHOW gray ℓ0 PHOW gray ℓ1 PHOW gray ℓ2 SSIM ℓ0 SSIM ℓ1 SSIM ℓ2 GB
13.66% 17.10% 18.71% 20.31% 24.53% 25.64% 15.27% 20.83% 22.64% 25.01%
33.14% 42.03% 45.86% 45.38% 54.86% 56.75% 35.27% 45.12% 48.47% 44.49%
39.72% 48.87% 52.66% 52.18% 61.31% 63.90% 41.87% 52.18% 55.19% 40.44%
13.51% 16.97% 18.65% 20.44% 24.47% 25.73% 15.49% 20.74% 22.68% 24.92%
33.16% 42.14% 45.86% 45.42% 54.88% 56.71% 35.38% 45.23% 48.47% 45.97%
39.76% 48.87% 52.75% 52.22% 61.37% 63.97% 41.94% 52.27% 55.27% 21.35%
41.87% 51.81% 55.49% 54.38% 63.70% 66.27% 45.10% 55.14% 57.58% 29.24%
LFS-ℓ0 LFS-ℓ1 LFS-ℓ2 LFS-ALL
26.67% 31.42% 32.09% 29.46%
49.78% 49.87% 55.64% 52.00%
58.82% 67.04% 66.43% 58.45%
26.51% 30.68% 32.00% 29.04%
48.95% 58.56% 57.43% 52.79%
52.83% 62.88% 60.37% 58.13%
53.03% 64.95% 63.77% 61.44%
Table 5 Results of multi-view learning on Caltech-101 using feature on each level of spatial pyramid and based on LFS-Lρ2X . Estimators
Accuracy l=1 u=5
Accuracy l=5 u=5
Accuracy l = 10 u=5
Accuracy l=1 u=0
Accuracy l=5 u=0
Accuracy l = 10 u=0
Accuracy l = 15 u=0
MVL-ℓ0 MVL-ℓ1 MVL-ℓ2 MVL-ALL LFS-MVL
27.52% 31.13% 32.94% 30.41% 32.77%
57.67% 63.09% 64.20% 61.46% 64.84%
64.36% 69.54% 70.98% 68.00% 71.44%
26.23% 29.98% 31.18% 28.45% 31.26%
57.76% 62.79% 63.97% 61.37% 64.60%
64.40% 69.63% 71.02% 68.08% 71.68%
66.47% 71.72% 73.33% 70.24% 73.49%
Table 6 Results of multi-view learning on Caltech-101 using feature on each level of spatial pyramid and based on LFS-Lρ2X with optimal choice of combination operator. Estimators
Accuracy l=1 u=5
Accuracy l=5 u=5
Accuracy l = 10 u=5
Accuracy l=1 u=0
Accuracy l=5 u=0
Accuracy l = 10 u=0
Accuracy l = 15 u=0
MVL-ℓ0-optC MVL-ℓ1-optC MVL-ℓ2-optC LFS-MVL-optC
29.78% 32.90% 33.49% 33.59%
60.33% 64.62% 65.16% 66.34%
65.90% 70.39% 71.20% 72.22%
29.61% 31.68% 32.07% 32.29%
60.48% 64.60% 65.29% 66.14%
66.62% 70.92% 71.44% 72.31%
66.49% 72.29% 73.64% 74.10%
the single-view estimators. MVL-ℓ2 is the multi-view learning estimator using the views PHOW color and gray ℓ2, SSIM ℓ2, GB. Similar notations follow for MVL-ℓ0 and MVL-ℓ1. MVL-ALL is the multi-view learning estimator constructed using all 10 views. LFS-MVL represents the estimator based on LFS-Lρ2X on the regularized solutions MVL-ℓ0, MVL-ℓ1 and MVL-ℓ2. In Table 4, we observe that the aggregation approach based on linear functional strategy improves the accuracy of classification for single-view estimators in both the supervised and semi-supervised setting. In Table 5, we present the performance of multi-view estimators based on different feature of spatial pyramid. Tables 4, 5 demonstrate that multi-view learning performs significantly better to single-view learning. In Table 6, we go one step further by choosing the optimal combination operator c. We optimize the combination operator over the sphere Sαv−1 for the fixed function f . To obtain the optimal weight vector c, we created a validation set by selecting 5 examples for each class from the training set. The validation set is used to determine the best value of c found over all the iterations using different initializations. We iterate 25 times in the optimization procedure
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
73
Table 7 Results of multi-view learning on Caltech-101 using feature on each level of spatial pyramid and based on LFS-Lρ2X with the balanced-discrepancy principle parameter choice. Estimators
Acc. l=1 u=5
Parameter choice
Acc. l=5 u=5
Parameter choice
MVL-ℓ0
25.49%
λA = 5.20 × 10−6 λB = 1.34 × 10−8 λW = 5.83 × 10−8
57.91%
λA = 5.20 × 10−6 λB = 1.60 × 10−7 λW = 3.45 × 10−7
64.55%
λA = 5.20 × 10−6 λB = 2.37 × 10−7 λW = 5.46 × 10−7
MVL-ℓ1
29.43%
λA = 5.20 × 10−6 λB = 7.13 × 10−9 λW = 5.22 × 10−8
62.70%
λA = 5.20 × 10−6 λB = 1.25 × 10−7 λW = 3.23 × 10−7
69.72%
λA = 5.20 × 10−6 λB = 2.54 × 10−7 λW = 5.31 × 10−7
MVL-ℓ2
30.92%
λA = 5.20 × 10−6 λB = 6.22 × 10−9 λW = 5.09 × 10−8
64.23%
λA = 5.20 × 10−6 λB = 1.36 × 10−7 λW = 3.13 × 10−7
71.13%
λA = 5.20 × 10−6 λB = 3.85 × 10−7 λW = 5.49 × 10−7
LFS-MVL
30.13%
64.55%
Acc. l = 10 u=5
Parameter choice
72.00%
Table 8 Results of multi-view learning on Caltech-101 using feature on each level of spatial pyramid and based on LFS-Lρ2X with the balanced-discrepancy principle parameter choice and optimal combination operator. Estimators
Acc. l=1 u=5
Parameter choice
Acc. l=5 u=5
MVL-ℓ0
26.80%
λA = 5.20 × 10−6 λB = 1.34 × 10−8 λW = 5.83 × 10−8
60.48%
λA = 5.20 × 10−6 λB = 1.60 × 10−7 λW = 3.45 × 10−7
66.91%
λA = 5.20 × 10−6 λB = 2.37 × 10−7 λW = 5.46 × 10−7
MVL-ℓ1
30.13%
λA = 5.20 × 10−6 λB = 7.13 × 10−9 λW = 5.22 × 10−8
64.55%
λA = 5.20 × 10−6 λB = 1.25 × 10−7 λW = 3.23 × 10−7
70.72%
λA = 5.20 × 10−6 λB = 2.54 × 10−7 λW = 5.31 × 10−7
MVL-ℓ2
31.55%
λA = 5.20 × 10−6 λB = 6.22 × 10−9 λW = 5.09 × 10−8
65.36%
λA = 5.20 × 10−6 λB = 1.36 × 10−7 λW = 3.13 × 10−7
71.57%
λA = 5.20 × 10−6 λB = 3.85 × 10−7 λW = 5.49 × 10−7
LFS-MVL
31.79%
66.67%
Parameter choice
Acc. l = 10 u=5
Parameter choice
71.94%
of c. We fixed ∥α∥ = 1 in the optimization problem of combination operator. The optimal choice of combination operator is powerful with clear improvements in classification accuracies over the uniform weight approach. In Table 7, we construct the multi-view estimators based on the balanceddiscrepancy principle with the initial choice of parameters λA = 4 × 10−6 , λB = 6.9 × 10−5 and λW = 1.7 × 10−5 . Table 8 shows the multi-view estimators based on the balanced-discrepancy principle with the optimal combination operator. In all the tables, we observe that LFS-MVL performs better than multi-view estimators MVL-ℓ0, MVL-ℓ1, MVL-ℓ2 and MVL-ALL. The results demonstrate that the proposed methods improve significantly with increasing the size of the labeled set. Here we observe that unlabeled data can be helpful in improving performance when the number of labeled data is small. This finding suggests that in case of few labeled examples, the unlabeled data provides additional information about the distribution of the input space. On the other hand, when there are sufficient labeled data to represent well the distribution of the input space, the unlabeled data will not provide an improvement in the results. In all the cases shown in the tables, we obtain better results by the aggregation approach compared to other estimators. Acknowledgments The authors are grateful for the valuable suggestions and comments of the anonymous referees that led to improve the quality of the paper.
74
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
References [1] Abhishake, S. Sivananthan, Multi-penalty regularization in learning theory, J. Complexity 36 (2016) 141–165. [2] R.S. Anderssen, The linear functional strategy for improperly posed problems, in: J. Cannon, U. Hornung (Eds.), Inverse Problems, Birkhäuser, Springer, Basel, 1986, pp. 11–30. [3] A. Argyriou, T. Evgeniou, M. Pontil, Convex multi-task feature learning, Mach. Learn. 73 (3) (2008) 243–272. [4] F.R. Bach, G.R. Lanckriet, M.I. Jordan, Multiple kernel learning, conic duality, and the SMO algorithm, in: Proceedings of the Twenty-First International Conference on Machine Learning, ACM, 2004, p. 6. [5] F. Bauer, O. Ivanyshyn, Optimal regularization with two interdependent regularization parameters, Inverse Problems 23 (1) (2007) 331–342. [6] F. Bauer, S. Kindermann, The quasi-optimality criterion for classical inverse problems, Inverse Problems 24 (2008) 035002. [7] F. Bauer, P. Mathé, S. Pereverzev, Local solutions to inverse problems in geodesy, J. Geod. 81 (1) (2007) 39–51. [8] F. Bauer, S.V. Pereverzev, An utilization of a rough approximation of a noise covariance within the framework of multiparameter regularization, Int. J. Tomogr. Stat. 4 (2006) 1–12. [9] F. Bauer, S. Pereverzev, L. Rosasco, On regularization algorithms in learning theory, J. Complexity 23 (1) (2007) 52–72. [10] M. Belge, M.E. Kilmer, E.L. Miller, Efficient determination of multiple regularization parameters in a generalized L-curve framework, Inverse Problems 18 (2002) 1161–1183. [11] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, J. Mach. Learn. Res. 7 (2006) 2399–2434. [12] G. Blanchard, P. Mathé, Discrepancy principle for statistical inverse problems with application to conjugate gradient iteration, Inverse Problems 28 (11) (2012) 115011. [13] G. Blanchard, N. Mücke, Kernel regression, minimax rates and effective dimensionality: beyond the regular case, 2016. arXiv preprint arXiv:1611.03979. [14] G. Blanchard, N. Mücke, Optimal rates for regularization of statistical inverse learning problems, 2016. arXiv preprint arXiv:1604.04054. [15] U. Brefeld, T. Gärtner, T. Scheffer, S. Wrobel, Efficient co-regularised least squares regression, in: Proceedings of the 23rd International Conference on Machine Learning, ACM, 2006, pp. 137–144. [16] C. Brezinski, M. Redivo-Zaglia, G. Rodriguez, S. Seatzu, Multi-parameter regularization techniques for ill-conditioned linear systems, Numer. Math. 94 (2) (2003) 203–228. [17] S.S. Bucak, R. Jin, A.K. Jain, Multiple kernel learning for visual object recognition: A review, IEEE Trans. Pattern Anal. Mach. Intell. 36 (7) (2014) 1354–1369. [18] A. Caponnetto, E. De Vito, Optimal rates for the regularized least-squares algorithm, Found. Comput. Math. 7 (3) (2007) 331–368. [19] A. Caponnetto, Y. Yao, Cross-validation based adaptation for regularization operators in learning theory, Anal. Appl. 8 (2) (2010) 161–183. [20] Z. Chen, Y. Lu, Y. Xu, H. Yang, Multi-parameter Tikhonov regularization for linear ill-posed operator equations, J. Comput. Math. 26 (2008) 37–55. [21] J. Chen, S. Pereverzyev Jr., Y. Xu, Aggregation of regularized solutions from multiple observation models, Inverse Problems 31 (7) (2015) 075005. [22] C. Ciliberto, Y. Mroueh, T. Poggio, L. Rosasco, Convex learning of multiple tasks and their structure, in: ICML, 2015, pp. 1548–1557. [23] C. Ciliberto, L. Rosasco, S. Villa, Learning multiple visual tasks while discovering their structure, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 131–139. [24] E. De Vito, S. Pereverzyev, L. Rosasco, Adaptive kernel methods using the balancing principle, Found. Comput. Math. 10 (4) (2010) 455–479. [25] E. De Vito, L. Rosasco, A. Caponnetto, U. De Giovannini, F. Odone, Learning from examples as an inverse problem, J. Mach. Learn. Res. 6 (2005) 883–904. [26] T. Evgeniou, C.A. Micchelli, M. Poggio, Learning multiple tasks with kernel methods, J. Mach. Learn. Res. 6 (2005) 615–637. [27] L. Fei-Fei, R. Fergus, P. Perona, One-shot learning of object categories, IEEE Trans. Pattern Anal. Mach. Intell. 28 (4) (2006) 594–611. [28] M. Fornasier, V. Naumova, S.V. Pereverzyev, Parameter choice strategies for multipenalty regularization, SIAM J. Numer. Anal. 52 (4) (2014) 1770–1794. [29] A. Goldenshluger, S.V. Pereverzev, Adaptive estimation of linear functionals in Hilbert scales from indirect white noise observations, Probab. Theory Related Fields 118 (2000) 169–186. [30] Z.C. Guo, S.B. Lin, D.X. Zhou, Learning theory of distributed spectral algorithms, Inverse Problems (2017). [31] S.J. Hwang, F. Sha, K. Grauman, Sharing features between objects and their attributes, in: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2011, pp. 1761–1768. [32] K. Ito, B. Jin, T. Takeuchi, Multi-parameter Tikhonov regularization–An augmented approach, Chin. Ann. Math. 35 (3) (2014) 383–398. [33] L. Kantorovich, G. Akilov, Functional Analysis, Elsevier Science, 2004. [34] G. Kriukova, O. Panasiuk, S.V. Pereverzyev, P. Tkachenko, A linear functional strategy for regularized ranking, Neural Netw. 73 (2016) 26–35. [35] G. Kriukova, S. Pereverzyev Jr., P. Tkachenko, Nyström type subsampling analyzed as a regularized projection, 2016. [36] K. Lin, S. Lu, P. Mathé, Oracle-type posterior contraction rates in Bayesian inverse problems, Inverse Probl. Imaging 9 (3) (2015).
A. Rastogi, S. Sampath / Journal of Complexity 43 (2017) 51–75
75
[37] S. Lu, P. Mathé, S. Pereverzyev, Balancing principle in supervised learning for a general regularization scheme, RICAMReport, 38, 2016. [38] S. Lu, S.V. Pereverzev, Multi-parameter regularization and its numerical realization, Numer. Math. 118 (1) (2011) 1–31. [39] S. Lu, S. Pereverzev, Regularization Theory for Ill-posed Problems: Selected Topics, Vol. 58, Walter de Gruyter, Berlin, 2013. [40] S. Lu, S. Pereverzyev Jr., S. Sivananthan, Multiparameter regularization for construction of extrapolating estimators in statistical learning theory, in: Multiscale Signal Analysis and Modeling, Springer, New York, 2013, pp. 347–366. [41] S. Lu, S.V. Pereverzev, U. Tautenhahn, A model function method in regularized total least squares, Appl. Anal. 89 (11) (2010) 1693–1703. [42] Y. Luo, D. Tao, C. Xu, D. Li, C. Xu, Vector-valued multi-view semi-supervsed learning for multi-label image classification, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2013, pp. 647–653. [43] Y. Luo, D. Tao, C. Xu, C. Xu, H. Liu, Y. Wen, Multiview vector-valued manifold regularization for multilabel image classification, IEEE Trans. Neural Netw. Learn. Syst. 24 (5) (2013) 709–722. [44] P. Mathé, S.V. Pereverzev, Direct estimation of linear functionals from indirect noisy observations, J. Complexity 18 (2) (2002) 500–516. [45] P. Mathé, S.V. Pereverzev, Geometry of linear ill-posed problems in variable Hilbert scales, Inverse Problems 19 (3) (2003) 789–803. [46] C.A. Micchelli, M. Pontil, Kernels for multi–task learning, in: NIPS, Vol. 86, 2004, pp. 921–928. [47] C.A. Micchelli, M. Pontil, Learning the kernel function via regularization, J. Mach. Learn. Res. 6 (2) (2005) 1099–1125. [48] C.A. Micchelli, M. Pontil, On learning vector-valued functions, Neural Comput. 17 (1) (2005) 177–204. [49] H.Q. Minh, L. Bazzani, V. Murino, A unifying framework in vector-valued reproducing kernel Hilbert spaces for manifold regularization and co-regularized multi-view learning, J. Mach. Learn. Res. 17 (25) (2016) 1–72. [50] H.Q. Minh, V. Sindhwani, Vector-valued manifold regularization, in: International Conference on Machine Learning, 2011. [51] V.A. Morozov, On the solution of functional equations by the method of regularization, Sov. Math. Dokl. 7 (1) (1966) 414–417. [52] V. Naumova, S.V. Pereverzyev, Multi-penalty regularization with a component-wise penalization, Inverse Problems 29 (7) (2013) 075002. [53] V. Naumova, S.V. Pereverzyev, S. Sivananthan, Extrapolation in variable RKHSs with application to the blood glucose reading, Inverse Problems 27 (7) (2011) 075010. [54] S.V. Pereverzyev, P. Tkachenko, Regularization by the linear functional strategy with multiple kernels, Front. Appl. Math. Stat. 3 (2017) 1. [55] D. Phillips, A technique for the numerical solution of certain integral equation of the first kind, J. Assoc. Comput. Mach. 9 (1962) 84–97. [56] A. Rastogi, S. Sivananthan, Optimal rates for the regularized learning algorithms under general source condition, Front. Appl. Math. Stat. 3 (2017) 3. [57] D. Rosenberg, V. Sindhwani, P. Bartlett, P. Niyogi, A kernel for semi-supervised learning with multi-view point cloud regularization, IEEE Signal Process. Mag. 26 (5) (2009) 145–150. [58] R. Salakhutdinov, A. Torralba, J. Tenenbaum, Learning to share visual appearance for multiclass object detection, in: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2011, pp. 1481–1488. [59] V. Sindhwani, D.S. Rosenberg, An RKHS for multi-view learning and manifold co-regularization, in: Proceedings of the 25th International Conference on Machine Learning, ACM, 2008, pp. 976–983. [60] S. Sun, Multi-view Laplacian support vector machines, in: International Conference on Advanced Data Mining and Applications, Springer, 2011, pp. 209–222. [61] A. Vedaldi, V. Gulshan, M. Varma, A. Zisserman, Multiple kernels for object detection, in: IEEE 12th International Conference on Computer Vision, 2009, IEEE, 2009, pp. 606–613. [62] S.N. Wood, Modelling and smoothing parameter estimation with multiple quadratic penalties, J. Roy. Statist. Soc. 62 (2000) 413–428. [63] J. Xie, J. Zou, An improved model function method for choosing regularization parameters in linear inverse problems, Inverse Problems 18 (3) (2002) 631–643. [64] T. Zhang, Effective dimension and generalization of kernel learning, in: NIPS, MIT Press, Cambridge, MA, 2002, pp. 454–461.