Appendix 3 Proofs of the Main Results of Chapter 5
T HEOREM (General form of theorem 5.1).– For a labeling function f ∈ G, let F = {(x, y) → (f (x), y)} be a class consisting of the bounded functions with the range (k) (k) (k) (k) [a, b]. Let Sk = {(x1 , y1 ), . . . , (xm , ym )} be a labeled sample drawn from Sk of (K size Nk for all k ∈ [1, . . . , K]. Let w = (w1 , . . . , wK ) ∈ [0, 1]K with k=1 wk = 1. 8K w (S, T ), for any k=1 Nk ≥ 8(b−a) and any > 0, Then given any arbitrary ξ ≥ DF ξ 2 with probability at least 1 − , the following holds: K w wk RSˆ f − RT f ≤ DF (S, T ) sup k f ∈F i=1
⎛
⎞ 12
(K
w ⎜ ln N1 (ξ /8, F, 2 K +⎝
k=1
Nk ) − ln(/8) ⎟ ⎠ ,
NK k=1 2 32(b−a)2 K i=k k=1 wk
(
Ni )
w (S, T ) and where ξ = ξ − DF w DF (S, T ) =
K
wk DF (Si , T ).
i=1
(K Here, the quantity N1w (ξ, F, 2 k=1 Nk ) stands for the uniform entropy number that is given by the following equation: N1w (ξ, F, 2
K k=1
Nk ) =
sup
{S 2Nk }K k=1
2Nk K log N (ξ, F, w }k=1 )), 1 ({S
where for all source samples Sk and their associated ghost samples (k) (k) k = 1, . . . , K the quantity Sk = {(x1 , y1 )}N i=1 drawn from Sk , ∀k
166
Advances in Domain Adaptation Theory
S 2Nk = {Sk , Sk } and the metric w 1 is a variation of the 1 metric defined for some f ∈ F based on the following norm: 2Nk K f w }k=1 )) = 1 ({S
Nk K wk (k) (k) (k) (k) |f ((xi , yi ))| + |f ((xi , yi ))| . Nk i=1
k=1
Compared to the classical result under the assumption of the same distribution relating the true and the empirical risk defined with respect to the sample set, there is w (S, T ) that appears in the proposed bound. The proposed a discrepancy quantity DF result coincides with the classical one if any source domain and the target domain match, i.e., DF (Si , T ) = 0 holds for any 1 ≤ k ≤ K. T HEOREM (General form of theorem 5.2).– For a labeling function f ∈ G, let F = {(x, y) → (f (x), y)} be a loss function class consisting of the bounded functions with the range [a, b] for a space of labeling functions G. Let w = (w1 , . . . , wK ) ∈ (K [0, 1]K with k=1 wk = 1. If the following holds lim
N1 ,...,NK →∞
(K k=1 Nk ) K k=1 NK 2 32(b−a)2 K i=k Ni ) k=1 wk (
ln N1w (ξ /8, F, 2
< ∞,
w w (S, T ), then we have for any ξ ≥ DF (S, T ), with ξ = ξ − DF K wk RSˆ f − RT f > ξ} = 0. lim Pr {sup k m→∞ f ∈F i=1
By the Cauchy–Schwarz inequality, setting wk = KNk N for all 1 ≤ k ≤ K k k=1 minimizes the second term of the theorem presented above and leads to the following result: ( K K Nk DF (Si , T ) wk RSˆ f − RT f ≤ i=1(K sup k f ∈F i=1 k=1 Nk ⎞ 12 ⎛ (K w ln N1 (ξ /8, F, 2 k=1 Nk ) − ln(/8) ⎠ K +⎝ . k=1 NK 32(b−a)2
This particular value for the weight vector w leads to the fastest rate of convergence that is confirmed by the numerical results provided by the authors in the original paper. L EMMA 5.2 ([RED 17]).– Let SX , TX ∈ P (X) be two probability measures on Rd . Assume that the cost function c(x, x ) = φ(x) − φ(x )Hk , where H is an RKHS equipped with kernel k : X × X → R induced by φ : X → Hk and
Appendix 3
167
k (x, x ) = φ(x), φ(x )Hk . Assume further that the loss function h,f : x −→ (h(x), f (x)) is convex, symmetric and bounded and obeys the triangular equality and has the parametric form |h(x) − f (x)|q for some q > 0. Assume also that kernel k in the RKHS Hk is square-root integrable w.r.t. both SX , TX for all SX , TX ∈ P(X) where X is separable and 0 ≤ k (x, x ) ≤ K, ∀ x, x ∈ X. If Hk ≤ 1, then the following holds:
∀(h, h ) ∈ Hk2 ,
RTq (h, h ) ≤ RSq (h, h ) + W1 (SX , TX ).
P ROOF .– We have RTq (h, h ) = RTq (h, h ) + RSq (h, h ) − RSq (h, h )
= RSq (h, h ) +
=
RSq (h, h )
E
x ∼TX
[φ(x ), Hk ] − E [φ(x), Hk ]
+ E
x ∼TX
x∼SX
[φ(x )] − E [φ(x)], Hk x∼SX
≤ RSq (h, h ) + Hk E
≤ RSq (h, h ) +
x ∼TX
&
X
[φ(x )] − E [φ(x)]Hk x∼SX
φd(SX − TX )Hk .
The second line is obtained by using the reproducing property applied to , and the third line follows from the properties of the expected value. The fourth line here is due to the properties of the inner product, whereas the fifth line is due to ||h,f ||Hk ≤ 1. Now, using the dual form of the integral given by the Kantorovich–Rubinstein theorem, we have the following: & & φd(SX − TX )Hk = (φ(x) − φ(x ))dγ(x, x )Hk X
& ≤
X×X
X×X
φ(x) − φ(x )Hk dγ(x, x ).
Now we can obtain the final result by taking the infimum over γ from the left- and right-hand sides of the inequality, i.e. & inf φd(SX − TX )Hk γ∈Π(SX ,TX )
≤
X
&
inf
γ∈Π(SX ,TX )
X×X
φ(x) − φ(x )Hk dγ(x, x ).
which gives RTq (h, h ) ≤ RSq (h, h ) + W1 (SX , TX ).
168
Advances in Domain Adaptation Theory
T HEOREM 5.5 ([RED 17]).– Let Su , Tu be unlabeled samples of size NS and NT each, drawn independently from SX and TX , respectively. Let S be a labeled sample of size m generated by drawing β m points from TX (β ∈ [0, 1]) and (1 − β) m ˆ ∈ H points from SX and labeling them according to fS and fT , respectively. If h q α ∗ is the empirical minimizer of RSˆ(h) on S and hT = argmin RT (h), then for any h∈H
δ ∈ (0, 1), with probability at least 1 − δ (over the choice of samples), we have ˆ ≤ R q (h∗ ) + c1 + 2(1 − α)(W1 (SˆX , TˆX ) + λ + c2 ), RTq (h) T T
where c1 = 2
* + + 2K (1−α)2 + , 1−β
c2 =
1 /ς 2 log δ
α2 β
m )
log(2/δ) +4
1 + NS
)
1 NT
K/m
α (1 − α) √ + √ mβ β m(1 − β) 1 − β
.
P ROOF .– The proof of this theorem follows the proof of theorem 3.3 and differs only in the concentration inequality obtained for the empirical combined error. This inequality is established by the following lemma. L EMMA A3.1.– Under the assumptions of lemma 5.2, let S be a labeled sample of size m generated by drawing β m points from TX (β ∈ [0, 1]) and (1 − β) m points from SX and labeling them according to fS and fT , respectively. Then with probability at least 1 − δ for all h ∈ H with 0 ≤ k(xi , xj ) ≤ K, the following holds: α (1 − α) ˆα √ + √ Pr R + (h) − Rα (h) > 2 K/m mβ β m(1 − β) 1 − β ⎧ ⎫ ⎨ ⎬ −2 m ≤ exp . ⎩ 2K (1−α)2 + α2 ⎭ 1−β
β
P ROOF .– First, we use McDiarmid’s inequality in order to obtain the right side of the inequality by defining the maximum changes of magnitude when one of the sample vectors has been changed. For the sake of completeness, we give its definition here. D EFINITION A3.1.– Suppose X1 , X2 , . . . , Xn are independent random variables taking values in a set X and assume that f : X n → R is a function of X1 , X2 , . . . , Xn that satisfies for ∀i, x1 , . . . , xn , xi ∈ X : |f (x1 , . . . , xi , . . . , xn ) − f (x1 , . . . , xi , . . . , xn )| ≤ ci .
,
Appendix 3
169
Then the following inequality holds for any ε > 0: −22 ( P |f (x1 , x2 , . . . , xn ) − E [f (x1 , x2 , . . . , xn )] | > ε ≤ exp . n 2 i=1 ci We now rewrite the difference between the empirical and true combined error in the following way:
ˆ α (h) − Rα (h)| = |α(R q (h) − R q (h)) − (α − 1)(R q (h) − R q (h))| |R T S Tˆ Sˆ = |α E
x∼TX
[q (h(x), fT (x))] − (α − 1) E
y∼SX
[q (h(y), fS (y))]
m(1−β) βm α (α − 1) − q (h(xi ), fT (xi )) + q (h(yi ), fS (yi ))| mβ i=1 m(1 − β) i=1
≤ sup |α E ∈H
x∼TX
[q (h(x), f (x))] − (α − 1) E
y∼SX
[q (h(y), f (y))]
m(1−β) mβ α (α − 1) − q (h(xi ), fT (xi )) + q (h(yi ), fS (yi ))|. mβ i=1 m(1 − β) i=1
Changing either xi or yi in this expression changes its value by at most √
√ 2α K βm
K and 2(1−α) (1−β)m , respectively. This gives us the denominator of the exponential in definition A3.1 (see previous page) √ 2 √ 2 (1 − α)2 2(1 − α) K 4K α2 2α K + . + (1−β)m = βm βm (1 − β)m m β (1 − β)
Then, we bound the expectation of the difference between the true and empirical combined errors by the sum of Rademacher averages over the samples. Denoting by X an i.i.d sample of size βm drawn independently of X (and likewise for Y ), and using the symmetrization technique we have E sup |α E
X,Y h∈H
x∼TX
[(h(x), f (x))] − (α − 1) E
y∼SX
[(h(y), f (y))]
m(1−β) mβ α (α − 1) − (h(xi ), fS (xi )) + (h(yi ), fT (yi ))| mβ i=1 m(1 − β) i=1 mβ α ≤ E sup | E (h(xi ), fS (xi )) X,Y h∈H X mβ i=1 mβ (α − 1) − (α − 1)E (h(yi ), fT (yi )) Y m(1 − β) i=1
170
Advances in Domain Adaptation Theory (1−β)m βm α (α − 1) − (h(xi ), fS (xi )) + (h(yi ), fT (yi ))| mβ i=1 m(1 − β) i=1
≤
E
sup |
X,X ,Y,Y h∈H
βm α σi ((h(xi ), fS (xi )) − (h(xi ), fS (xi ))) mβ i=1
1−α σi ((h(yi ), fT (yi )) − (h(yi ), fT (yi )))| m(1 − β) i=1 α (1 − α) √ + √ . ≤ 2 K/m mβ β m(1 − β) 1 − β βm
+
Finally, the Rademacher averages, in their turn, are bounded using a theorem from [BAR 02]. Using this inequality in Definition A3.1 gives us the desired result: α (1 − α) ˆα α √ + √ Pr R (h) − R (h) > 2 K/m + mβ β m(1 − β) 1 − β ⎧ ⎫ ⎨ ⎬ −2 m ≤ exp . ⎩ 2K (1−α)2 + α2 ⎭ 1−β
β
The final result can now be obtained by following the steps in the proof of theorem 3.3 and applying the corresponding results with the Wasserstein distance and the lemma proved above.