INS 11278
No. of Pages 20, Model 3G
10 December 2014 Information Sciences xxx (2014) xxx–xxx 1
Contents lists available at ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins 4 5
Sub-domain adaptation learning methodology
3 6
Q1
a
7 8 9
b c
10
School of Information Eng., Yancheng Institute of Technology, Yancheng, China School of Automation, Southeast University, Nanjing, China Department of System Eng. & Eng. Management, City University of Hong Kong, Hong Kong Special Administrative Region
a r t i c l e
1 2 2 6 13 14 15 16 17 18 19 20 21 22 23 24 25
Jun Gao a,b,c,⇑, Rong Huang a, Hanxiong Li c
i n f o
Article history: Received 7 April 2013 Received in revised form 15 November 2014 Accepted 27 November 2014 Available online xxxx
Q3
Keywords: Maximum mean discrepancy Local weighted mean Projected maximum local weighted mean discrepancy Multi-label classification Support vector machines
a b s t r a c t Regarded as global methods, Maximum Mean Discrepancy (MMD) based transfer learning frameworks only reflect the global distribution discrepancy and structural differences between domains; they can reflect neither the inner local distribution discrepancy nor the structural differences between domains. To address this problem, a novel transfer learning framework with local learning ability, a Sub-domain Adaptation Learning Framework (SDAL), is proposed. In this framework, a Projected Maximum Local Weighted Mean Discrepancy (PMLMD) is constructed by integrating the theory and method of Local Weighted Mean (LWM) into MMD. PMLMD reflects global distribution discrepancy between domains through accumulating local distribution discrepancies between the local sub-domains in domains. In particular, we formulate in theory that PMLMD is one of the generalized measures of MMD. On the basis of SDAL, two novel methods are proposed by using Multi-label Classifiers (MLC) and Support Vector Machine (SVM). Finally, tests on artificial data sets, high dimensional text data sets and face data sets show the SDALbased transfer learning methods are superior to or at least comparable with benchmarking methods. 2014 Published by Elsevier Inc.
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
45 46
1. Introduction
47
A major assumption in traditional statistical models is that the training data and the future data must have the same distribution; that is, the training data and the test data are Identically and Independently Distributed (I.I.D).1 However, when the distribution appears non-identical, almost all the traditional intellectual learning models have to be rebuilt subject to the future data. In real-world applications, it is common that the data—such as cross-language texts, biological information, social internet information and multi-task studies [20]—may be non-I.I.D. The key challenge of these applications is that accurately labeled task-specific data are scarce while task- relevant data are abundant. Learning with non-I.I.D. data in such scenarios helps build accurate models by leveraging relevant data to perform new learning tasks, identifying the true connections among samples and their labels, and expediting the knowledge discovery process by simplifying the expensive data collection process. For such cases, Transfer Learning (TL) or Knowledge Transfer algorithms are proposed [23,29,4,8,27,12,19]. The aim of TL algorithms is to effectively build a statistical learning model that can deal with the target domain by using the knowledge obtained from the source domain [25]. These algorithms focus on knowledge transfer between different tasks or domains instead of simply
48 49 50 51 52 53 54 55 56 57
Q2 Q1
⇑ Corresponding author at: School of Information Eng., Yancheng Institute of Technology, Yancheng, China. E-mail address:
[email protected] (J. Gao). In probability theory and statistics, a sequence or other collection of random variables is independent and identically distributed (I.I.D.) if each random variable has the same probability distribution as the others and all are mutually independent. 1
http://dx.doi.org/10.1016/j.ins.2014.11.041 0020-0255/ 2014 Published by Elsevier Inc.
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041
INS 11278
No. of Pages 20, Model 3G
10 December 2014 Q1 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
2
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
generalizing of cross-problems learning methods. From this perspective, TL algorithms differ greatly from traditional supervised or semi-supervised or unsupervised methods since the latter deal with the training and the test data drawn from the same distribution but the former handle the source domains and the target ones having different distributions. Domain Adaptation Learning (DAL), a special TL method, addresses how to build a statistical learning model after learning the knowledge from the source and target domains, which have different but related distributions. In DAL, the major calculating problem is how to minimize the distribution discrepancy between the task domains. In such cases, we need an effective measure that can reflect the distribution discrepancy between the task domains—this becomes the greatest challenge in building an effective DAL model. Recently, researchers have proposed some methods to judge the distribution discrepancy between the two domains, such as Kullback–Leibler Distance (KL-distance) [30] and Maximum Mean Discrepancy (MMD) [3]. KL-distance is an estimating method having parameters; it requires continual prior density estimates in the process of measuring the distribution discrepancy of the domains. But MMD is a measure having no parameters; it reflects the distribution discrepancy between the source and target domains by calculating their mean difference between them, thus making MMD simple, effective and intuitive. Based on MMD, some traditional methods—such as Transductive Support Vector Machine (TSVM) [15], Multi-label Classifiers (MLC) [14], and Feature Selection,—have been rebuilt to address a few domain adaptation learning problems. Furthermore, based on MMD, Quanz and Huan [24] proposed and utilized Projected Maximum Mean Discrepancy (PMMD)2 to show the distribution discrepancy between the embedded domains of the source and target domains. Either MMD or PMMD shows the distribution discrepancy between different domains in the form of the population mean difference between the domains or between the corresponding embedded subspaces of the domains. As stated in statistical theories [33], the population mean or the expectation of a domain, as an effective statistical feature, often indicates the global distribution and structure information of the domain. And from the perspective of geometry, the population mean shows the distribution of spatial data better; that is, it can better reflect the statistical feature of Gaussian distribution data. So MMD and PMMD reflect to some extent the discrepancy of the global distribution or the global structure information between the source domain and the target domain. Then they are more suitable for reflecting the distribution discrepancy between domains having an apparent Gaussian distribution. On this level, the MMD-based domain adaptation learning methods, such as Multi-view Transfer Learning with a Large Margin Framework (MVTL-LM) [37], Domain Transfer Multiple Kernel Learning Framework (DTMKL) [9,10], Domain Adaptation Support Vector Machine (DASVM) [5], Domain Adaptation Kerneled Support Vector Machine (DAKSVM) [28], Maximum Mean Discrepancy Embedding (MMDE) [21], and Multi-label Classification Learning Framework (MCLF) [6], are in the category of global methods; hence, to some extent, they ignore the local discrepancy between domains and the local structural information of different domains. So far, little research has been reported to address this problem through using the novel measure with local learning ability. Therefore, in this paper, we propose a novel transfer learning framework with local learning ability: Sub-domain Adaptation Learning Framework (SDAL). In this framework, we integrate Local Weighted Mean (LWM) [2] into MMD and then propose a novel criterion that can effectively measure the distribution discrepancy between the source domain and the target domain—Projected Maximum Local Weighted Mean Discrepancy (PMLMD). Finally, in the SDAL framework, by using MLC and SVM, two sub-domain adaptation learning methods are constructed for TL problems. The framework in this paper has the following advantages: (1) PMLMD based on LWM and MMD cannot only effectively calculate the local distribution discrepancy between subdomains in different domains but also effectively reflect the local geometrical discrepancy in different domains. Furthermore, it can reflect the global distribution and geometrical discrepancy between domains through accumulating the local distribution discrepancies between sub-domains. We theoretically formulate that PMLMD is a generalization of MMD and PMMD. Additionally, a novel definition, Closest Local Sub-domain (CLSD), is presented to justify how to calculate the local distribution discrepancy between the source and target domains. (2) SDAL welcomes many traditional statistical learning methods, such as SVM, Support Vector Regression (SVR) [38], and MLC, to solve the problems of domain adaptation learning. In particular, in the SDAL framework, we use MLC and SVM to build two sub-domain adaptation learning methods for TL. The former, MLC based sub-domain adaptation learning (MLC-SDAL), is a local linear multi-label classification sub-domain adaptation learning method, which is an effective classifier and can realize low-dimensional embedding of the original input space corresponding to the source and target domains, so it can be taken as the generalized form of MLC and MCLF. The latter, SVM- based sub-domain learning (SVM-SDAL), can be not only a large-margin domain support vector method but also a local learning classifier; on this level, it inherits and extends all the advantages of TSVM and above MMD-based support vector machine algorithms. SDAL of course can address the traditional supervised, semi-supervised or unsupervised intellectual recognition problems. (3) All the advantages of SDAL are justified by the tests on artificial data with obvious local manifold feature, high-dimension text data and face recognition data.
113
2 PMMD and MMD differ from each other in that PMMD reflects the distribution discrepancy between the low dimensional spaces embedded in the source domain and the target domain respectively while MMD reflects the distribution discrepancy between the source domain and the target domain.
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041
INS 11278
No. of Pages 20, Model 3G
10 December 2014 Q1
3
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
disPMLMD ( D2 d , D1q ' ) ≠ 0
D1
1
D2
2
0.8
∀D2d
1.5
D1q '
0.6
1
0.4
0
y
y
Σd
0.5
0.2
-0.2
0 -0.5
-0.4
∀D1q
-1
-0.6 -1.5
-0.8 -1 -4
-3
-2
-1
0
1
2
3
4
-2 -3
D2 d ' -2
-1
0
x
1
2
3
x
Σ q
disPMLMD ( D1q , D2 d ' ) ≠ 0
disPMLMD ( D1 , D2 ) ≠ 0
Fig. 1. Distribution discrepancy between two functions with the same mean but different distributions.
116
We organize the rest of this paper as follows: in chapter 2, we briefly review the MMD measure and give a description of its problem; SDAL is discussed in chapter 3; then all these methods are tested in chapter 4; and finally we make conclusions and expectations.
117
2. Maximum Mean Discrepancy (MMD)
114 115
118 119
120
s t For any source domain Ds ¼ fxsi gjni¼1 in distribution P and target domain Dt ¼ fztj gjnj¼1 in distribution W, the MMD Between Ds and Dt can be written as follows in RKHS [3]:
distðDs ; Dt Þ ¼ sup ðExP ½f ðxÞ EzW ½f ðzÞÞ ¼ kExP ½/ðxÞ EzW ½/ðzÞkH
122 123 124
125
where /() is a nonlinear projection from the original input space to the high-dimensional Hilbert space and E() is the mathematical expectation. For simplicity, Eq. (1) can be transformed into the following expression:
2 nt ns 1 X 1X dist ðDs ; Dt Þ ¼ /ðxsi Þ /ðztj Þ : ns i¼1 nt j¼1 2
127 128 129 130 131 132 133 134 135 136 137 138
ð1Þ
kf kH 61
ð2Þ
H
From Eq. (2), the distribution discrepancy is calculated using means of the source and target domains. There exist several deficiencies in MMD, as shown in Figs. 1 and 2. Where patches shown as the small circles are the sub-domains of the domain D1 and the domain D2 respectively. 2 2 The graphs in Fig. 1 refer to functions y = sin(x), x 2 ðp; pÞ and 3x2 þ 2y2 ¼ 1, and those in Fig. 2 refer to functions 2 2 jxj jxj y ¼ exp 1002 and y ¼ exp 12 . Two functions in Fig. 1 have the same means but completely different distributions, and those in Fig. 2 have the same means and similar distributions but different deviations. When we use the MMD-based DAL framework [37,9,10,28,33] in the above case and then utilize MMD3 to measure the distribution discrepancy between those two domains, we find disMMD(D1, D2) = 0. This means the domains related to graphs in Figs. 1 and 2 have no distribution discrepancy, which is a wrong conclusion. Another problem in the MMD based learning framework is that the mean is calculated through standard average. However, since the contribution of samples is not equal, different weights should be assigned to samples. Thus, the weighted 3
We use the MMD measure with the linear-kernel function in [6].
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041
INS 11278
No. of Pages 20, Model 3G
10 December 2014 Q1
4
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
disPMLMD ( D2 d , D1q ' ) ≠ 0
D1
1 0.9
0.9999
0.8
0.9998
0.7
0.9997
0.6
D1q '
0.9994
0.3
0.9993
0.2
-2
-1
0
d
D2 d '
0.9992
∀D1q
0.1 -3
Σ
0.9995
0.4
0 -4
∀D2d
0.9996
0.5
y
y
D2 1
0.9991 1
2
3
0.999 -4
4
-3
-2
-1
x
0
1
2
3
4
x
Σ q
disPMLMD ( D1q , D2 d ' ) ≠ 0
disPMLMD ( D1 , D2 ) ≠ 0
Fig. 2. Distribution discrepancy between two functions with the same mean and similar distributions.
Fig. 3. Sub-domain adaptation learning framework.
141
average could be a good solution. On the other hand, there always exist different distributions in different location of the domain. To effectively solve the problem, the domain should be decomposed into patches. Each patch should be processed separately.
142
3. Sub-domain adaptation learning framework
143
147
Why, to some extent, are the MMD-based DAL frameworks not adapted to the problems in Figs. 1 and 2? It is because the MMD-based DAL frameworks focus only on the global discrepancy but ignore the local discrepancy between domains. Therefore, we propose a novel adaptation learning framework with local learning ability—SDAL (Fig. 3). There are three parts in the SDAL framework: domain decomposition, the sub-domain distribution discrepancy measure and the sub-domain adaptation methods. The following is the SDAL framework in this paper.
148
3.1. Domain decomposition
149
Using the manifold learning theory [17], for non-Gaussian or manifold-valued data, we usually deal with them from local sub-domains because non-Gaussian data can be locally viewed as Gaussian4 and a curved manifold can be locally viewed as Euclidean.5 However, when the real-world sub-domain adaptation learning problems are addressed by using the above theory, there are two unsolved problems: one is how to effectively decompose the source and target domains, which have different distributions, into local data sub-domains; the other is how to effectively construct a sample weight in the sub-domains. Fortunately, we may use a way similar to that in [40,39] to solve these problems, even though this way is usually used to deal with non-transfer pattern recognition problems.
139 140
144 145 146
150 151 152 153 154 155
4 5
Means the data in sub-domain are of Gaussian distribution. Means the spaces corresponding to the sub-domain can be viewed as Euclidean space.
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041
INS 11278
No. of Pages 20, Model 3G
10 December 2014 Q1 156 157 158 159 160 161
162 163 164 165
5
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
s t Definition 1 (Local Sub-domain). For any source domain Ds ¼ fxsi gjni¼1 and target domain Dt ¼ fztj gjnj¼1 with different but ðc Þ related distributions, and 8xsi 2 Ds and 8ztj 2 Dt , there are 8Dsi ¼ fxsi 1 gjkc11¼1 2 Ds consisting of k1 nearest samples for xsi and 8Dtj ¼ fztjðc2 Þ gjkc22¼1 2 Dt consisting of k2 nearest samples for ztj, which are regarded as the local sub-domains in the source and ðc Þ ðc Þ target domains, respectively. xsi 1 denotes the c1 th nearest neighbor sample of xsi and ztj 2 the c2 th nearest neighbor sample of ztj.
If there is a projection operator w when D0si ¼ wðDsi Þ and D0tj ¼ wðDtj Þ, according to [2], the LWM of Dsi ; Dtj ; D0si and D0tj can be ðc Þ ðc Þ ðc Þ ðc Þ ðc Þ ðc Þ ðc Þ ðc Þ Pk1 Pk2 Pk1 Pk1 bsiðc1 Þ wðxsiðc1 Þ Þ Pk2 Pk2 btjðc2 Þ wðztjðc2 Þ Þ btj 2 xtj 2 btj 2 ytj 2 bsi 1 xsi 1 bsi 1 ysi 1 written as ðp1 Þ , ðp2 Þ , ðp1 Þ ¼ ðp1 Þ and ðp2 Þ ¼ ðp2 Þ respecc1 ¼1 Pk1 c2 ¼1 Pk2 c1 ¼1 Pk1 c1 ¼1 Pk1 c2 ¼1 Pk2 c2 ¼1 Pk2 b
p1 ¼1 si
b
b
p ¼1 tj
b
p ¼1 si
b
p ¼1 si
p ¼1 tj
b
p ¼1 tj
1 1 2 2 2 ðc Þ ðc Þ kztj ztj 2 k2 kx x 1 k2 ðc Þ ðc Þ tively, where ¼ exp si h1si is the weight of sample xsi 1 in Dsi and btj 2 ¼ exp is the weight of sample h2 2 ðc Þ ztj 2 in Dtj; h1 and h2 are the heat kernel parameters in the heat kernel function exp dh .
ðc Þ bsi 1
Then, we propose the measure PMLMD as follows.
166
3.2. Sub-domain distribution discrepancy measure
167
According to Definition 1, we herein combine the theory and the approach of LWM with the principles of MMD to propose PMLMD (Fig. 4), a novel measure which can effectively measure local distribution discrepancy between domains.
168 169 170 171 172 173 174 175 176 177 178 179
180 182 183 184 185 186
3.2.1. Projected Maximum Local Weighted Mean Discrepancy (PMLMD) According to Fig. 4, PMLMD uses accumulated local distribution discrepancies between any sub-domain in the domain and its respective CLSD to indicate the global distribution discrepancy. The local distribution discrepancy in the domains is shown by the local weighted mean difference, which has some local learning ability, thus showing that PMLMD realizes global learning through local learning. This idea, different from MMD, has been used widely in many local learning algorithms. In order to show its effectiveness, we utilize the PMLMD measure to calculate the distribution discrepancy between the functions in Figs. 1 and 2 and the result is disPMLMD(D1, D2) – 0. This indicates those domains have distribution discrepancy, which is of course true. M Definition 2 (CLSD). For any two domains D1 ¼ fD1q gjN q¼1 and D2 ¼ fD2d gjd¼1 having distribution discrepancy, D1q is a subdomain in Pq Gaussian distribution in D1, and D2d is a sub-domain in Wd Gaussian distribution in D2. For "D1q, if there is D2d0 2 D2 which supports the following equation, D2d0 is the Closest Local Sub-domain (CLSD) of D1q in D2.
distðD1q ; D2d0 Þ ¼ kExPq ½f ðxÞ EzWd0 ½f ðzÞkH ¼ min kExPq ½f ðxÞ EzWd ½f ðzÞkH d¼1;...;M
ð3Þ
Note that when a domain in non-Gaussian or manifold distribution is decomposed into sub-domains, its sub-domains generally have different Gaussian distributions. This is because if the sub-domains have the same distributions, the domain must be of Gaussian distribution, which is opposite to the prerequisite. So we hold that it is reasonable to define the subdomains in Definition 2 as having different Gaussian distributions.
Domain D1
Domain D2
Mainfold learning theory Sub-domain
Sub-domain
∀D1q (q = 1,L , ns )
∀D2 d (d = 1,L , nt )
LWM+MMD CLSD
CLSD
D2 d ' ⊂ D2
D1q ' ⊂ D1
LWM+PMMD Local discrepancy
dis ( D1q , D2 d ' )
Local discrepancy
∑∑ q
dis ( D2 d , D1q ' )
d
Global discrepancy disPMLMD ( D2 , D1 )
PMLMD based objective function
Fig. 4. Construction of PMLMD measurement.
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041
INS 11278
No. of Pages 20, Model 3G
10 December 2014 Q1 187 188
189
191 192 193
194 195 196 197
198
200
6
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
Using LWM in Definition 2 instead of the standard average to replace the expectation in (3) and integrating linear projection /(x) = x, we get the CLSD, whose estimator is
ðc Þ ðc Þ ðc Þ ðc Þ k1 k2 X X b1q1 v 1q1 b2d20 u2d20 distCLSD ðD1q ; D2d0 Þ ¼ ¼ min c ¼1 Pk1 bðp1 Þ c ¼1 Pk2 bðp20 Þ d¼1;...;M 1
p1 ¼1 1q
2
p2 ¼1 2d
2
ðc Þ ðc Þ ðc Þ ðc Þ k1 k2 X X b1q1 v 1q1 b2d2 u2d2 c ¼1 Pk1 bðp1 Þ c ¼1 Pk2 bðp2 Þ 1
p1 ¼1 1q
2
p2 ¼1 2d
ð4Þ
2
Then, we propose the PMLMD measurement as follows: s t Definition 3 (PMLMD). For any source domain Ds ¼ fDsi gjni¼1 and target domain Dt ¼ fDtj gjnj¼1 having different distributions but being related, there are 8Dsi 2 Ds , a local sub-domain in the source domain with samples xsi ¼ ðxsi1 ; . . . ; xsin ÞT 2 Ds , and 8Dtj 2 Dt , a local sub-domain in the target domain with samples ztj ¼ ðztj1 ; . . . ; ztjn ÞT 2 Dt . When the low-dimensional embedded sub-spaces corresponding to Ds and Dt are D0s and D0t respectively, the definition of PMLMD is as follows:
2 ðc Þ ðc Þ ðc Þ ðc Þ X k1 b 1 w / x 1 k2 b 2 w / z 2 nt nt ns X ns X X X X si si tj tj 2 distPMLMD D0s ; D0t ¼ cij dist2PMLMD D0si ; D0tj ¼ cij Pk1 Pk2 ðp1 Þ ðp2 Þ c1 ¼1 c2 ¼1 i¼1 j¼1 i¼1 j¼1 p1 ¼1 bsi p2 ¼1 btj
ð5Þ
F
201
8 <
212
Dsi is the CLSD of Dtj in the Ds or are the correlation coefficients of the local sub-domains, and w is the Dtj is the CLSD of Dsi in the Dt : 0 otherwise low-dimensional embedded projection, and /() is a nonlinear projection from the original input space to the high-dimensional Hilbert space, and kkF is F-norm. As to Definition 3, when we let bsi = btj = 1, k1 = ns and k2 = nt, the measure PMLMD in this paper will become PMMD. Based on this, if we let the low-dimensional embedded projection be linear, then PMLMD will be MMD. Therefore, on this level, PMLMD has not influenced the manifestation of the distribution discrepancy between domains but generalized MMD and PMMD to some extent. Please note that w(/()) in Eq. (5) is low-dimensional embedding operators in the high-dimensional Hilbert space such that w(/()) can be written as w(/ ()) = xT/(), where x is a nonlinear projection on the high-dimensional Hilbert space. According to the Representer Theorem[16], x is done by forming linear combinations of the samples in the Hilbert space, P þnt that is, x ¼ nrs¼1 ar /ðxr Þ, in which ar is the corresponding correlation coefficient of the sample /(xr). If we let
213
X ¼ Ds [ Dt ¼ fxs1 ; . . . ; xsns ; zt1 ; . . . ; ztnt g Rnðns þnt Þ , a ¼ ðas1 ; . . . ; asns ; at1 ; . . . ; atnt ÞT 2 Rns þnt ; x can also be written as x = / |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
214
(X)a, and Eq. (5) is simplified to the following theorem:
215
Theorem 1. PMLMD in Eq. (5) can be simplified as follows:
where cij ¼ 202 203 204 205 206 207 208 209 210 211
216 219
218 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236
1
ns þnt
nt ns X X 2 distPMLMD D0s ; D0t ¼ cij dist2PMLMD D0si ; D0tj ¼ trðaT KLKT aÞ;
ð6Þ
i¼1 j¼1
Kss Kst 2 Rðns þnt Þðns þnt Þ in which Kss, Ktt and Kst are Kts Ktt kernel matrixes in the source domain, the target domain and across the domains, respectively. where L is the global distribution discrepancy weight matrix, and K ¼
The proof of the above theorem can be found in Appendix A. From Eq. (6), we know that PMLMD focuses on the weight matrix L of the global distribution discrepancy between the source and target domains. In order to make the successive methods applicable, we particularly build the algorithm of the weight matrix L. See Appendix B for Algorithm 1. According to Theorem 1, PMLMD can be transformed, subject to Eq. (6) into a function similar to PMMD but having different inner structures and geometrical meanings. This transformation is useful for building the objective function of SDAL in this paper and is good for integrating classical statistical learning methods to develop some sub-domain adaptation learning algorithms with local learning capability. 3.2.2. PMLMD based objective function From the above analysis, the MMD-based DAL frameworks[37,9,10,28,33] to some extent are not able to reflect the inner local structure and local information in data spaces having different distributions. In such cases, we propose a PMLMD based objective function.
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041
INS 11278
No. of Pages 20, Model 3G
10 December 2014 Q1 237 238 239
240
242
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
7
s t Definition 4 (PMLMD Based Objective Function). For any source domain Ds ¼ fxsi gjni¼1 and target domain Dt ¼ fztj gjnj¼1 , the s t corresponding embedded subspaces are D0s ¼ fwðxsi Þgjni¼1 and D0t ¼ fwðztj Þgjnj¼1 respectively, where w is the projection operator. The PMLMD based objective function is
2 arg min Jðf Þ ¼ arg min Rðf ; k; Ds Þ þ kdist PMLMD D0s ; D0t f 2HK
ð7Þ
f 2HK
247
where f on the left of the equation is the decision function in the high-dimensional Hermit space HK related to the kernel function k; R(f, k, Ds) at the right of the equation is the risk function, which depends on the labeled samples in the source 2 domain Ds; distPMLMD D0s ; D0t is the local distribution discrepancy between D0s and D0t ; and k > 0 is introduced to balance the difference of the distribution between the two domains and the risk function for the labeled patterns. As we have discussed, at this level, SDAL has better generalization capability.
248
3.3. Sub-domain adaptation learning method
249
As we know, by integrating several traditional statistical learning methods, such as SVM, MLC, or Support Vector Regression (SVR), the objective function in Eq. (7) can be changed into a few sub-domain adaptation learning methods. For simplicity, we only chose MLC and SVM to build two novel sub-domain adaptation learning methods.
243 244 245 246
250 251
253
3.3.1. MLC-SDAL: MLC based sub-domain adaptation learning method P Pns in MLC [14] to model the first objective R(f, k, Ds) in Eq. (7). We use the cost function m p¼1 i¼1 ‘ðxsi ; f p ; Y sip Þ þ hXðf p Þ
254
Correspondingly we build the objective function of MLC-SDAL.
252
255 256 257 258
259
261 262 263 264 265
266
268 269 270 271 272 273
274
276 277 278
279
s Definition 5 (MLC-SDAL). Suppose a source domain Ds ¼ Xs ¼ fxsi gjni¼1 2 Rnns of m classes. Y 2 Rns m is the class label indicator matrix that encodes the class information of ns samples to m classes. Let the target domain be t Dt ¼ Xt ¼ fztj gjnj¼1 2 Rnnt . The low-dimensional embedded sub-space of Ds is D0s and that of Dt is D0t . For 8xsi 2 Rn , if the Q4 sample belongs to the pth class, let Y sip ¼ 1; otherwise, let Y sip ¼ 0. Then, the objective function of MLC-SDAL is
arg minJðf p ; WÞ ¼ arg min WT W¼Il
WT W¼Il
! ! ns m X X 2 ‘ðxsi ; f p ; Y sip Þ þ hXðf p Þ þ kdisPMLMD D0s ; D0t ; p¼1
where fp is the classification decision function of the pth class, W ¼ ½x1 ; . . . ; xl 2 Rnl is the orthogonal projection transformation matrix related to the optimization problem in Eq. (8), Il is l order unit matrix. According to Eq. (7) in Appendix A related to Theorem 1, let /(X) = X where X = Xs [ Xt, and thus Eq. (7) can be 2 tr(WTXLXTW), which indicates tr(WTXLXTW) is the linear form of distPMLMD D0s ; D0t . Therefore, Eq. (8) can be transformed into
! ! ns m X X T T arg minJðf p ; WÞ ¼ arg min ‘ðxsi ; f p ; Y sip Þ þ hXðf p Þ þ ktrðW XLX WÞ : WT W¼Il
WT W¼Il
p¼1
282 283 284 285
286 287 288
ð9Þ
i¼1
We adopt the method in [14] to define the classification decision function of the pth class; therefore, let T f p ðxÞ ¼ uTp x ¼ g Tp x þ hp WT x in which up 2 Rn is the weight vector of the decision function; g p 2 Rn is the weight vector of the original input space; and hp 2 Rl is the weight vector 2of the embedded subspace. We use the least square error to replace the loss function ‘, that is, let ‘ðxsi ; f p ; Y il Þ ¼ uTp xsi Y il . The structural risk function related to the labeled samples in the source domain Ds in Eq. (9) can be written as ns m X X ‘ðxsi ; f p ; Y sip Þ þ hXðf p Þ p¼1
! ¼
i¼1
1 kXT U Ys k2F þ hkU WHk2F þ gkUk2F ; ns s
ð10Þ
where U = [u1, . . . , um], H = [h1, . . . , hm]. Input Eq. (10) into Eq. (9) and we get the final objective function of MLC-SDAL:
arg min JðH; U; WÞ ¼ arg min 281
ð8Þ
i¼1
WT W¼Il
WT W¼Il
1 kXTs U Ys k2F þ hkU WHk2F þ gkUk2F þ ktrðWT XLXT WÞ : ns
ð11Þ
In order to calculate Eq. (11) effectively, we adopt the method similar to that in [14,6] to get the solutions H⁄, U⁄ and W⁄ to the problem. It can be found from Eq. (9) that MLC-SDAL in this paper is apparently the generalized form of multi-label classification methods in [14,6]. 3.3.2. SVM-SDAL: SVM based sub-domain adaptation learning method The objective function of Least Squares Support Vector Machines (LS-SVM) [32] is used to model the first objective R(f, k, Ds) in Eq. (7), and we build the objective function of SVM-SDAL in the same way. Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041
INS 11278
No. of Pages 20, Model 3G
10 December 2014 Q1 289 290
291
8
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
s t Definition 6 (SVM-SDAL). For the train datasets Ds ¼ Xs ¼ fxi ; yi gjni¼1 Rn f1; þ1g and Dt ¼ Xt ¼ fztj gjnj¼1 2 Rnnt , the SVM-SDAL classifier is obtained through minimizing the following optimization problem:
ns C CX 2 minJðx; bÞ ¼ min kxk2 þ e2 þ kdisPMLMD D0s ; D0t x;b x;b 2 2 i¼1 si
293 294
295
s:t: ysi ðxT /ðxsi Þ þ bÞ ¼ 1 esi ;
2 According to Theorem 1, as distPMLMD D0s ; D0t ¼ trðaT KLKT aÞ, Eq. (12) can be ns C CX minJðx; bÞ ¼ min kxk2 þ e2 þ ktrðaT KLKT aÞ x;b x;b 2 2 i¼1 si
297 298 299
300
ð12Þ
i ¼ 1; . . . ; ns :
s:t: ysi ðxT /ðxsi Þ þ bÞ ¼ 1 esi ;
ð13Þ
i ¼ 1; . . . ; ns :
For Eq. (13), according to the Representer Theorem [16], if we let x = /(X)a, we can transform kxk2 and xT/(xsi) into the following forms:
kxk2 ¼ aT Ka; 302 303
304
xT /ðxsi Þ ¼ aT /ðXÞT /ðxsi Þ ¼ aT Ksi : Then Eq. (13) can be rewritten as:
minJðx; bÞ ¼ minaT ðkKLK þ x;b
306 307
308
310
x;b
T
s:t: ysi ða Ksi þ bÞ ¼ 1 esi ;
ns C CX KÞa þ e2 2 2 i¼1 si
ð14Þ
i ¼ 1; . . . ; ns :
Like the process of calculating the optimal solution to LS-SVM, we use the following to calculate Eq. (14).
0
Y Tns
Y ns
P þ C 1 Ins
! b
l
¼
0 1n s
ð15Þ
311
312
where 1ns ¼ ð1; . . . ; 1ÞT , Pij ¼ ysi ysj KTsj ð2#KLK þ CKÞ1 Ksi , l ¼ ðl1 ; . . . ; lns ÞT is Lagrange coefficient and Y ns ¼ ðys1 ; ; ysns ÞT . |fflfflfflffl{zfflfflfflffl} ns
315
In order to upgrade the capability of SVM-SDAL to address multi-classification problems, we present the Multiple Classification SVM-SDAL (MSVM-SDAL) by integrating the Fuzzy Least Squares Support Vector Machines for Multi-class Problems (FLS-SVM)[24] and using ‘One-against-One’.
316
3.4. An analysis of time complexity
317
3.4.1. The time complexity of MLC-SDAL For MLC-SDAL, most time is consumed in the process of calculating the projection matrix W. Generally speaking, the maxP s Pnt imum rank of the weight matrix L ¼ ni¼1 j¼1 R ij Lij of global distribution discrepancy between patches is nsnt. However, there is always one patch in the target domain that is closest to one of the patches in the source domain and vice versa. Thus, the weight matrix L has at most (ns + nt) terms, and the maximum rank of L would be (ns + nt) in practice. Then the maximum rank of XLXT is (ns + nt). Furthermore, the rank of UUT is at most min(n, m); then the rank of kXLXT hUUT is at most min(n, m + ns + nt). Suppose the dataset dealt with is high-dimensional, that is, n m + ns + nt; the time complexity of calculating W is O(n2(m + ns + nt)). Then the time complexity of MLC-SDAL is O(n2(m + ns + nt)).
313 314
318 319 320 321 322 323 324
331
3.4.2. The time complexity of SVM-SDAL For SVM-SDAL in this paper, time is mainly spent on the following two aspects: one is the calculation of the inverse matrix of the ns rank matrix; the other + CK)1. The time complexity of calculating the inverse is the calculation of (2kKLK 1 2.5 matrix of the ns rank matrix is O n2:5 is O((n ) [32]. In the above coefficient matrix of Eq. and that of (2kKLK + CK) s + nt) s 1 (15), the value of (2kKLK + CK) is unchangeable in the process of conducting SVM-SDAL, so we can calculate (2kKLK + CK)1 in advance before the construction of the coefficient matrix. Thus, the time complexity of SVM-SDAL would not be more than O n2:5 . s
332
4. Experimental study
333
In order to evaluate the effectiveness of the proposed algorithms and their extensions in this paper for DAL problems, we systematically compare them with several state-of-the-art algorithms on different datasets. Three different classes of
325 326 327 328 329 330
334
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041
INS 11278
No. of Pages 20, Model 3G
10 December 2014 Q1
9
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx 6 4 2 0 -2 -4 -6 -8 -10 -12 -12 -10 -8 -6 -4 -2 0
2
4
6
(a) Dataset in source domain
8
8 6 4 2 0 -2 -4 -6 -8 -10 -12 -12 -10 -8 -6 -4 -2 0 2
4 6
(b) Two-moon dataset with rotation angle 30
8
8 6 4 2 0 -2 -4 -6 -8 -10 -12 -10 -8 -6 -4 -2 0
6 4 2 0 -2 -4 -6 -8 2
4
6
(c) Two-moon dataset with rotation angle 60
8
-10 -12 -10 -8 -6 -4 -2 0
2
4
6
8
(d) Two-moon dataset with rotation angle 80
Fig. 5. Two-moon dataset based source domain and target domains.
354
domain adaptation problems are investigated: a series of two-dimensional synthetic problems having different complexities with a two-moon dataset, (2) several real-world cross-domain text classification problems with different domain adaptation datasets such as 20Newsgroups and Reuter-21578 [11], and (3) a real problem in the context of multi-class classification in intra-domain on face recognition with ORL and Yale datasets.6 For all these datasets, true labels are available for both source and target domain instances. However, prior information related to the target domain is considered only for an objective and quantitative assessment of the performances of the proposed methods. We test on two-moon datasets to show (1) the stronger local learning capability of the learning framework in this paper and (2) the influence of the balance parameter k on the learning capability of the learning framework in this paper. The tests on two real-world high-dimensional text datasets show the learning capability of the methods in this paper in terms of addressing the real-world problems and the influence of nearest parameters k. Finally, through testing on face datasets, we show the multi-classification transfer learning capability of the algorithms in this paper. Currently, how to set the algorithm parameters for the methods is an open and hot topic. In general, the algorithm parameters are manually set. In order to evaluate the performance of the algorithms, the strategy, as pointed out in [1], is that a set of the prior parameters is first given and then the best cross-validation mean rate among the set is used to estimate the generalized accuracy. In this paper, we have adopted this strategy: the cross-validation7 is used on the training set for parameter selection. Finally, please note that in the following sections, we conduct the experiments randomly on sample data for 5 rounds and use 10-fold cross-validation in Sections 4.1 and 4.2 and 4-fold cross-validation in Section 4.3; the mean and standard deviation of experimental results on the testing data is used for performance evaluation. We chose the ratio of correctly classified samples over the total number of samples as the reference classification accuracy measure. The environment of testing is Intel Core2, 2.0 GHz main frequency, 2G RAM and Vista system.
355
4.1. Experiments on artificial datasets
356
4.1.1. Data sets Artificial two-moon data has typical manifold distribution. Hence, these data are often used to test local learning capability. In order to test the capability of linear MLC-SDAL and SVM-SDAL, which can not only realize transfer learning but also keep the local information of samples, we use a two-moon dataset containing 300 samples as the source domain. The dataset can be divided into positive and negative classes, each of which has 150 samples (see Fig. 5(a)). After rotating the data in the source domain counterclockwise ten times, we obtain 10 target domains having different, but related distributions. Fig. 5(b)– (d) demonstrate the target domains obtained after rotating the data in the source domain by 30, 60 and 80 respectively. We find from Fig. 5 that the bigger the rotation degree, the more varied the distribution discrepancy between the source domain and the target domain, thereby demonstrating that the corresponding local adaptation learning problems become more and more complicated.
335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353
357 358 359 360 361 362 363 364 365
366 367 368 369 370
4.1.2. Comparison methods The process of the test is designed as follows: to show the effectiveness of the linear method in this paper – MLC-SDAL, we compare it with MLC [14], a classical supervised classification method, and Multi-label Classification Domain Adaptation (MCDA) [6], a MCLF-based transfer learning method. Similarly, the nonlinear method in this paper (SVM-SDAL) is compared to the supervised SVM, semi-supervised TSVM and LMPROJ of the transferring learning capability.
6
http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html. Due to the distribution difference between the source domain and the target domain in the learning process of domain adaptation classifier, the generalized capability of classifiers is limited to some extent when cross-validation is used to set the classifier parameters [24,9,11]. But this unbiased method is still widely used in the some transfer learning methods [37,28,18,31,26], which we follow to set the parameters. 7
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041
INS 11278
No. of Pages 20, Model 3G
10 December 2014 Q1
10
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
0.9 0.85
1
o
30 60o 80o
0.95 0.9
0.75
Accuracy
Accuracy
0.8
0.7 0.65
0.85 0.8 0.75
0.6
o
0.5 -8
30 60o o 80
0.7
0.55 -6
-4
-2
0
2
4
6
8
0.65 -8
-6
-4
-2
0
2
4
6
8
x
λ=10
λ=10
(a) MLC-SDAL
(b) SVM-SDAL
x
Fig. 6. Influence of two-moon correlation parameter k on classification accuracy.
371 372 373 374 375 376 377 378 379
4.1.3. Implementation details In the process of testing, we let the parameters of the methods be k1 = k2 = [2, 4, 6, 8], h1 = h2 = [2(10), 2(8), 2(6), . . . , 20, 22, . . . , 26, 28, 210] and set the parameter h, g in MLC-SDAL and a, b in MLC and MCDA using the way in [14,6]. All four nonlinear methods use Gaussian kernel function in which the r is set as the square-root of the average norm of the training samples [10]. Meanwhile, we use one of the testing results of two local learning methods on two-moon datasets with rotation angles of 30, 60 and 80 to show the influence of k in the local domain adaptation learning framework on learning capability. This testing result comes into being when other parameters are fixed except k and the range of k belongs to [27, 25, 23, 22, 21, 20, 21, 23, 25, 27]. See the results in Fig. 6. We test each method 5 times. The means and standard deviations demonstrate the accuracy and the means of computation time indicate the efficiency.
395
4.1.4. Experimental results It is shown in Table 1 that the classification effectiveness of the above algorithms decreases as the degrees of rotation increase, which is consistent with the fact that the bigger the rotation degree, the larger the distribution discrepancy between the source domain and the target domain, thus resulting in poor adaptability of the algorithms We also find that the classification accuracy of MLC and SVM is obviously lower than that of other methods, which to some extent shows that they are not good at solving transfer learning problems. Furthermore, from Table 1, it can be found that the accuracy of two algorithms in this paper is higher than that of the other DAL methods. This, to some extent, shows the local learning framework in this paper improves the learning capability of the algorithms. However, from the perspective of their efficiency, the methods in this paper are not as good as the others, which results from the fact that the method here needs to partition the domain and calculate the CLSD of each sub-domain. Therefore, we will pay special attention to finding out how to partition the domains efficiently. Also, Fig. 6 shows the influence of the correlative parameter k, when it is given different values, on classification accuracy in MLC-SDAL and SVM-SDAL. That is, when k has different value in the process of testing two algorithms, the corresponding classification accuracy is different: the tendency is for the classification accuracy to change from low to high as the value of k varies from small to great. However, not all bigger k leads to higher accuracy. The above analysis sufficiently supports that k affects the transfer learning ability.
396
4.2. Experiments on high-dimensional text datasets
397
4.2.1. Data sets The basic idea of our design is to utilize the hierarchy of datasets to distinguish domains. Specifically, the task is defined as main category classification. Each main category is split into two disjoint parts with different sub-categories: one as labeled data in the source domain and the other as unlabeled data in the target domain. Two typical high-dimensional datasets—20Newsgroups and Reuter-21578—are often used to test domain adaptation learning methods [24,37,9,28,18,13,22,5]. The 20newsgroups is a text collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups nearly evenly. It contains four main categories, i.e., ‘comp’, ‘rec’, ‘sci’, ‘talk’, as well as some small categories, such as ‘alt.atheism’, ‘misc.forsale’. Each of the four main categories contains some subcategories, which are assigned to different domains. Using these four main categories, we create six sub-datasets, known as 20Ng-1, 20Ng-2, 20Ng-3, 20Ng-4, 20Ng-5 and 20Ng-6 respectively. The detailed descriptions of these sub-datasets are summarized in Table 2. Each of the sub-datasets ensures the domains of labeled and unlabeled data related, since they are under
380 381 382 383 384 385 386 387 388 389 390 391 392 393 394
398 399 400 401 402 403 404 405 406 407
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041
Q1 INS 11278
10 December 2014
10.1016/j.ins.2014.11.041
Target domain (rotation angle) Algorithm
10
Linear MLC Algorithms MCDA
0.8739 ± 0.0679 (0.6855) 0.9203 ± 0.0392 (0.8676) 0.9387 ± 0.0262 (1.7920)
0.8511 ± 0.0743 (0.5333) 0.9127 ± 0.911 (0.9191) 0.9253 ± 0.0503 (1.5695)
0.8099 ± 0.0487 (0.4058) 0.8809 ± 0.0632 (1.0232) 0.9128 ± 0.0753 (1.8011)
0.8648 ± 0.0819 (0.3911) 0.8716 ± 0.0525 (0.8999) 0.8894 ± 0.0380 (2.1087)
0.7658 ± 0.0533 (0.3120) 0.8480 ± 0.0226 (0.9229) 0.8667 ± 0.0207 (2.0313)
0.7099 ± 0.0737 (0.6288) 0.7946 ± 0.0835 (1.1471) 0.8065 ± 0.0764 (1.9970)
0.7023 ± 0.0655 (0.4963) 0.7622 ± 0.0371 (0.8846) 0.7809 ± 0.0288 (1.8822)
0.6681 ± 0.0738 (0.4966) 0.7323 ± 0.0802 (0.9085) 0.7483 ± 0.0427 (2.0916)
0.5268 ± 0.0913 (0.5407) 0.6451 ± 0.0755 (0.8930) 0.6901 ± 0.0663 (2.017)
0.4322 ± 0.0549 (0.5070) 0.5082 ± 0.0625 (0.8878) 0.5946 ± 0.0394 (2.2337)
0.9852 ± 0.0428 (4.3045) 0.9866 ± 0.0377 (6.6922) LMPORJ 0.9875 ± 0.0342 (7.9487) SVM0.9887 ± 0.0259 SDAL (9.1172)
0.9755 ± 0.0436 (4.2857) 0.9812 ± 0.0652 (8.0480) 0.9855 ± 0.0726 (8.2073) 0.9864 ± 0.0488 (9.6616)
0.9543 ± 0.0658 (4.9970) 0.9762 ± 0.0551 (7.4435) 0.9806 ± 0.0564 (8.7825) 0.9824 ± 0.0329 (10.5866)
0.9468 ± 0.0639 (4.7563) 0.9712 ± 0.0599 (7.0811) 0.9733 ± 0.0469 (8.0122) 0.9806 ± 0.0483 (10.0879)
0.9211 ± 0.0649 (4.1740) 0.9355 ± 0.0825 (6.7848) 0.9499 ± 0.0586 (8.6543) 0.9786 ± 0.0396 (10.0063)
0.8579 ± 0.0816 (5.3356) 0.9128 ± 0.0792 (8.002) 0.9459 ± 0.0683 (8.9879) 0.9558 ± 0.0439 (8.9814)
0.7423 ± 0.0545 (4.9972) 0.8574 ± 0.0614 (8.1733) 0.9087 ± 0.0334 (8.5446) 0.9158 ± 0.0408 (9.8644)
0.7093 ± 0.0772 (5.7197) 0.7696 ± 0.0562 (7.8991) 0.8356 ± 0.0361 (8.0921) 0.8664 ± 0.0285 (9.9875)
0.5869 ± 0.0643 (4.7251) 0.6854 ± 0.0491 (7.7843) 0.7442 ± 0.0599 (9.3007) 0.8158 ± 0.0379 (9.6570)
0.5648 ± 0.0697 (4.9109) 0.5991 ± 0.0474 (7.6086) 0.6952 ± 0.0367 (8.8666) 0.7337 ± 0.0329 (9.9928)
MLCSDAL No-linear SVM Algorithms TSVM
15
20
25
30
40
50
60
70
80
Accuracy (times) Accuracy (times) Accuracy (times) Accuracy (times) Accuracy (times) Accuracy (times) Accuracy (times) Accuracy (times) Accuracy (times) Accuracy (times)
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
No. of Pages 20, Model 3G
11
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/
Table 1 Comparison of testing results of 7 algorithms on 7 two-moon datasets with different distributions.
INS 11278
No. of Pages 20, Model 3G
10 December 2014 Q1
12
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
Table 2 Descriptions of sub-datasets form 20Newsgroups and Reuter-21578. Datasets
20Newsgroups
Reuter-21578
Sub-datasets
Source domain
Target domain
Number of feature
comp.sys.ibm.pc.hardware comp.sys.mac.hardware rec.sport.baseball rec.sport.hockey comp.sys.ibm.pc.hardware comp.sys.mac.hardware sci.crypt sci.med comp.graphics comp.os.ms-windows.misc talk.politics.misc
26,214
Positive class(+)
Negative class()
20Ng-1 (comp vs rec)
comp.graphics comp.os.mswindows.misc
rec.autos rec.motorcycles
20Ng-2 (comp vs sci) 20Ng-3 (comp vs talk) 20Ng-4 (rec vs sci) 20Ng-5 (rec vs talk) 20Ng-6 (sci vs talk)
comp.graphics comp.os.mswindows.misc comp.sys.ibm.pc. hardware comp.sys.mac. hardware
sci.electronics sci.space talk.politics.guns talk.politics.mideast
rec.autos rec.motorcycles
sci.electronics sci.space talk.politics.guns talk.politics.mideast talk.politics.guns talk.politics.mideast
rec.sport.baseball rec.sport.hockey sci.crypt sci.med rec.autos rec.motorcycles talk.politics.misc
Rut-1 (orgs vs people)
orgs.{. . .}
people.{. . .}
orgs.{. . .}
4771
Rut-2 (orgs vs places)
orgs.{. . .}
places.{. . .}
people.{. . .} orgs.{. . .}
4415
Rut-3 (people vs places)
people.{. . .}
places.{. . .}
places.{. . .} people. {. . .}
4562
rec.sport.baseball rec.sport.hockey sci.electronics sci.space
sci.crypt sci.med talk.politics.misc
places.{. . .} Note: Since there are too many subcategories in Reuters-21578, we omit the composition details of last three datasets here.
408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434
the same main categories. Besides, the domains are also ensured to be different, since they are drawn from different subcategories. Reuters-21578 is one of the most famous test collections for evaluation of automatic text categorization techniques. It contains five main categories. Among these categories, ‘orgs’, ‘people’, and ‘places’ are three larger ones. Ruters-21578 corpus also has hierarchical structure. We also generate three sub-datasets Rut-1(orgs vs people), Rut-2(orgs vs places) and Rut-3(people vs places) for cross-domain classification in the similar way as what have been done on the 20Newsgroups through using these three main categories. Since they approximately consist of 500 kinds of subcategories, the detailed description cannot be listed here. 4.2.2. Comparison methods In this section, the comparison among MLC, MCDA, SVM, TSVM and LMPROJ is conducted and the way to set the parameters is the same as in 4.1. In order to show the method in this paper surpasses the others in dealing with real-world crossdomain text classification problems, we compare it with the following methods, of which the TL capability has been justified by reported experiments: Locally Weighted Ensemble (LWE) [11], Cross-Domain Spectral Classification (CDCS) [18], Kernel Mean Mathing (KMM) [13], Transfer Component Analysis (TCA) [22] and Domain Adaptation Support Vector Machine (DASVM) [5]. LWE can combine multiple models for transfer learning, the weights of which are dynamically assigned according to a model’s predictive power on each test example. CDCS, known as a spectral domain-transfer learning method, uses out-of-domain structural constraints to regularize the in-domain supervision through designing a novel cost function from normalized cut. As a nonparametric TL method, KMM can reduce the mismatch between two different domains by directly producing re-sampling weights without distribution estimation. TCA tries to learn some transfer components across domains in RKHS using MMD and performs domain adaptation via a new parametric kernel using feature extraction methods, which can dramatically minimize the distance between domain distributions by projecting data onto the learned transfer components. Furthermore, TCA can handle large datasets and naturally lead to out-of-sample generalization. DASVM realizes the transfer learning by extending TSVM to label unlabeled target samples progressively and simultaneously removes some auxiliary labeled samples. 4.2.3. Implementation details In order to improve the efficiency of the methods, we construct 6 datasets for experiments according to Table 2: 20Newsgroups5%,8 20Newsgroups15%, 20Newsgroups25%, Reuter5%,9 Reuter15% and Reuter25%. In all these datasets, the samples accounts 8 The dataset 20Newsgroups5% consists of the training and testing datasets. The training dataset includes 5% labeled data and 5% unlabeled data in each subcategory of the source and target domains respectively, while the testing dataset contains 60% unlabeled data in the target domain. 9 The construction process of Reuter5% is the same as that of 20Newsgroups5%.
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041
Q1
20Ng-1 Accuracy
20Ng-2 Accuracy
20Ng-3 Accuracy
20Ng-4 Accuracy
20Ng-5 Accuracy
20Ng-6 Accuracy
Rut-1 Accuracy
Rut-2 Accuracy
Rut-3 Accuracy
Linear Algorithms
MLC CDCS MCDA LWE MLC-SDAL
0.6951 ± 0.0138 0.7992 ± 0.0551 0.7491 ± 0.0595 0.8359 ± 0.0448 0.7116 ± 0.0283
0.6557 ± 0.0270 0.6673 ± 0.0804 0.6848 ± 0.0790 0.7022 ± 0.0694 0.7149 ± 0.0672
0.7791 ± 0.0461 0.8616 ± 0.0648 0.8166 ± 0.0598 0.8659 ± 0.0702 0.8266 ± 0.0607
0.5807 ± 0.0331 0.5862 ± 0.0394 0.5918 ± 0.0694 0.6088 ± 0.0315 0.6125 ± 0.0415
0.6115 ± 0.0124 0.7637 ± 0.0773 0.6519 ± 0.0607 0.6933 ± 0.0356 0.7084 ± 0.0715
0.7172 ± 0.0540 0.6945 ± 0.0819 0.7211 ± 0.0772 0.6823 ± 0.0791 0.6998 ± 0.0402
0.6994 ± 0.0217 0.7989 ± 0.0812 0.7097 ± 0.0593 0.8028 ± 0.0509 0.7393 ± 0.0613
0.6817 ± 0.0473 0.7019 ± 0.0373 0.6929 ± 0.0671 0.6621 ± 0.0594 0.7238 ± 0.0490
0.6028 ± 0.0712 0.6066 ± 0.0716 0.5817 ± 0.0618 0.6721 ± 0.0245 0.6546 ± 0.0381
No-linear Algorithms
SVM TSVM KMM TCA LMPROJ DASVM SVM-SDAL
0.7515 ± 0.0429 0.7667 ± 0.0371 0.7826 ± 0.0195 0.7813 ± 0.0391 0.7837 ± 0.0186 0.7911 ± 0.0412 0.8133 ± 0.0219
0.7081 ± 0.0675 0.7103 ± 0.0441 0.7331 ± 0.0648 0.7554 ± 0.0462 0.7633 ± 0.0374 0.7965 ± 0.0732 0.8239 ± 0.0467
0.8393 ± 0.0772 0.8455 ± 0.0676 0.8579 ± 0.0637 0.8877 ± 0.0349 0.8997 ± 0.0472 0.8879 ± 0.0408 0.9047 ± 0.0421
0.6027 ± 0.0675 0.7255 ± 0.0646 0.7419 ± 0.0682 0.7646 ± 0.0684 0.7559 ± 0.0719 0.7734 ± 0.0374 0.8298 ± 0.0482
0.6441 ± 0.0490 0.6896 ± 0.0438 0.7090 ± 0.0723 0.6981 ± 0.0726 0.7058 ± 0.0678 0.7339 ± 0.0573 0.7765 ± 0.0506
0.7146 ± 0.0608 0.7451 ± 0.0549 0.7514 ± 0.0485 0.7723 ± 0.0701 0.8072 ± 0.0633 0.8173 ± 0.0561 0.8271 ± 0.0284
0.7335 ± 0.0667 0.7515 ± 0.0538 0.7831 ± 0.0611 0.7947 ± 0.0548 0.8048 ± 0.0642 0.8326 ± 0.0418 0.8315 ± 0.0296
0.7267 ± 0.0437 0.7327 ± 0.0538 0.7385 ± 0.0514 0.7349 ± 0.0682 0.7040 ± 0.0551 0.7106 ± 0.0281 0.7282 ± 0.0447
0.6483 ± 0.0376 0.6779 ± 0.0406 0.6927 ± 0.0481 0.6788 ± 0.0435 0.6581 ± 0.0489 0.7059 ± 0.0482 0.7395 ± 0.0291
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
Sub-datasets Algorithm
INS 11278
10 December 2014
10.1016/j.ins.2014.11.041
No. of Pages 20, Model 3G
13
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/
Table 3 Recognition results of 12 algorithms on 20Newsgroups5% and Reuter5%.
Q1
20Ng-2 Accuracy
20Ng-3 Accuracy
20Ng-4 Accuracy
20Ng-5 Accuracy
20Ng-6 Accuracy
Rut-1 Accuracy
Rut-2 Accuracy
Rut-3 Accuracy
Linear Algorithms
MLC CDCS MCDA LWE MLC-SDAL
0.7021 ± 0.0303 0.8314 ± 0.0467 0.7839 ± 0.0618 0.8609 ± 0.0572 0.7808 ± 0.0337
0.6720 ± 0.0194 0.6909 ± 0.0783 0.7088 ± 0.0824 0.7182 ± 0.0713 0.7741 ± 0.0591
0.8259 ± 0.0109 0.8824 ± 0.0594 0.8364 ± 0.0570 0.8813 ± 0.0628 0.8279 ± 0.0513
0.6106 ± 0.0402 0.5987 ± 0.0421 0.6263 ± 0.0461 0.6279 ± 0.0389 0.6331 ± 0.0318
0.6358 ± 0.0236 0.7835 ± 0.0507 0.6792 ± 0.0438 0.7109 ± 0.0554 0.7213 ± 0.0593
0.7248 ± 0.0386 0.7164 ± 0.0681 0.7324 ± 0.0692 0.7044 ± 0.0613 0.7157 ± 0.0246
0.7103 ± 0.0343 0.8157 ± 0.0554 0.7445 ± 0.0533 0.8097 ± 0.0621 0.7644 ± 0.0597
0.6913 ± 0.0182 0.7299 ± 0.0404 0.7225 ± 0.0517 0.7039 ± 0.0392 0.7379 ± 0.0579
0.6380 ± 0.0694 0.6170 ± 0.0673 0.6149 ± 0.0429 0.6784 ± 0.0509 0.6881 ± 0.0309
No-linear Algorithms
SVM TSVM KMM TCA LMPROJ DASVM SVM-SDAL
0.8016 ± 0.0587 0.8290 ± 0.0276 0.8326 ± 0.0318 0.8466 ± 0.0344 0.8471 ± 0.0254 0.8420 ± 0.0281 0.8479 ± 0.0167
0.7217 ± 0.0439 0.7382 ± 0.0683 0.7812 ± 0.0660 0.8104 ± 0.0399 0.8167 ± 0.0424 0.8273 ± 0.0589 0.8619 ± 0.0308
0.8511 ± 0.0613 0.8763 ± 0.0749 0.8917 ± 0.0445 0.9198 ± 0.0472 0.9249 ± 0.0366 0.9391 ± 0.0697 0.9314 ± 0.0297
0.6601 ± 0.0673 0.7608 ± 0.0732 0.7895 ± 0.0641 0.8290 ± 0.0316 0.8137 ± 0.0634 0.8208 ± 0.0687 0.8617 ± 0.0341
0.7099 ± 0.0474 0.7217 ± 0.0379 0.7587 ± 0.0572 0.7638 ± 0.0694 0.7785 ± 0.0754 0.7868 ± 0.0449 0.7949 ± 0.0377
0.7736 ± 0.0513 0.8023 ± 0.0470 0.7934 ± 0.0637 0.8244 ± 0.0575 0.8397 ± 0.0457 0.8257 ± 0.0663 0.8664 ± 0.0396
0.7790 ± 0.0638 0.7847 ± 0.0609 0.8159 ± 0.0536 0.8279 ± 0.0578 0.8475 ± 0.0608 0.8559 ± 0.0396 0.8660 ± 0.0307
0.7588 ± 0.0502 0.7731 ± 0.0462 0.7939 ± 0.0535 0.8127 ± 0.0493 0.7498 ± 0.0481 0.7643 ± 0.0371 0.7938 ± 0.0268
0.6529 ± 0.0337 0.7148 ± 0.0356 0.7155 ± 0.0334 0.7016 ± 0.0274 0.7008 ± 0.0527 0.7649 ± 0.0274 0.8067 ± 0.0318
INS 11278
20Ng-1 Accuracy
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
Sub-datasets Algorithm
10 December 2014
14
10.1016/j.ins.2014.11.041
No. of Pages 20, Model 3G
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/
Table 4 Recognition results of 12 algorithms on 20Newsgroups15% and Reuter15%.
Q1
20Ng-1 Accuracy
20Ng-2 Accuracy
20Ng-3 Accuracy
20Ng-4 Accuracy
20Ng-5 Accuracy
20Ng-6 Accuracy
Rut-1 Accuracy
Rut-2 Accuracy
Rut-3 Accuracy
Linear Algorithms
MLC CDCS MCDA LWE MLC-SDAL
0.7429 ± 0.0586 0.8425 ± 0.0339 0.8087 ± 0.0694 0.8721 ± 0.0507 0.8129 ± 0.0198
0.6997 ± 0.0891 0.7013 ± 0.0719 0.7309 ± 0.0813 0.7715 ± 0.0827 0.7846 ± 0.0772
0.8326 ± 0.0791 0.9016 ± 0.0608 0.8476 ± 0.0662 0.8997 ± 0.0699 0.8555 ± 0.0564
0.6179 ± 0.0528 0.6526 ± 0.0486 0.6557 ± 0.0638 0.6988 ± 0.0491 0.6924 ± 0.0554
0.6721 ± 0.0843 0.7929 ± 0.0684 0.6887 ± 0.0542 0.7278 ± 0.0621 0.7389 ± 0.0646
0.7428 ± 0.0743 0.7192 ± 0.0710 0.7438 ± 0.0698 0.7488 ± 0.0692 0.7519 ± 0.0384
0.7321 ± 0.0813 0.8462 ± 0.0797 0.7586 ± 0.0776 0.8204 ± 0.0468 0.7719 ± 0.0756
0.6967 ± 0.0451 0.7349 ± 0.0488 0.7509 ± 0.0502 0.7221 ± 0.0447 0.7615 ± 0.0387
0.6540 ± 0.0713 0.6479 ± 0.0682 0.6588 ± 0.0594 0.6905 ± 0.0487 0.6953 ± 0.0403
No-linear Algorithms
SVM TSVM KMM TCA LMPROJ DASVM SVM-SDAL
0.8209 ± 0.0577 0.8407 ± 0.0397 0.8547 ± 0.0290 0.8662 ± 0.0227 0.8514 ± 0.0276 0.8588 ± 0.0385 0.8705 ± 0.0195
0.7619 ± 0.0545 0.7438 ± 0.0694 0.7929 ± 0.0742 0.8326 ± 0.0510 0.8485 ± 0.0426 0.8478 ± 0.0669 0.8827 ± 0.0289
0.9017 ± 0.0667 0.8904 ± 0.0713 0.9190 ± 0.0529 0.9410 ± 0.0288 0.9521 ± 0.0534 0.9493 ± 0.0717 0.9557 ± 0.0383
0.6883 ± 0.0616 0.7828 ± 0.0678 0.8301 ± 0.0669 0.8420 ± 0.0616 0.8409 ± 0.0659 0.8529 ± 0.0571 0.8711 ± 0.0359
0.7129 ± 0.0615 0.7319 ± 0.0557 0.7812 ± 0.0601 0.7737 ± 0.0704 0.7915 ± 0.0707 0.8102 ± 0.0611 0.8121 ± 0.0318
0.7928 ± 0.0512 0.8202 ± 0.0661 0.8146 ± 0.0737 0.8358 ± 0.0690 0.8546 ± 0.0593 0.8482 ± 0.0703 0.8755 ± 0.0345
0.8158 ± 0.0793 0.8198 ± 0.0671 0.8434 ± 0.0529 0.8499 ± 0.0602 0.8552 ± 0.0746 0.8701 ± 0.0389 0.8849 ± 0.0397
0.7754 ± 0.0418 0.7809 ± 0.0399 0.8027 ± 0.0492 0.8155 ± 0.0551 0.8007 ± 0.0551 0.8097 ± 0.0487 0.8117 ± 0.0331
0.6792 ± 0.0414 0.7235 ± 0.0333 0.7348 ± 0.0336 0.7038 ± 0.0318 0.7094 ± 0.0437 0.7969 ± 0.0227 0.8276 ± 0.0355
Note: MLC and SVM are supervised; TSVM semi-supervised; and the others TL methods.
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
Sub-datasets Algorithm
INS 11278
10 December 2014
10.1016/j.ins.2014.11.041
No. of Pages 20, Model 3G
15
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/
Table 5 Recognition results of 12 algorithms on 20Newsgroups25% and Reuter25%.
INS 11278
No. of Pages 20, Model 3G
10 December 2014 Q1
16
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
0.82
0.9 20Ng-2 20Ng-4 Rut-1
0.8
0.85
0.78 0.76
Accuracy
Accuracy
20Ng-2 20Ng-4 Rut-1
0.74 0.72
0.8
0.75
0.7 0.68
0.7
0.66 0.64
0.65
0
2
4
6
8
10
12
14
16
18
20
k1=k2
(a) MLC-SDAL
0
2
4
6
8
10
12
14
16
18
20
k1=k2
(b) SVM-SDAL
Fig. 7. Influence of nearest neighbor parameters k1, k2 on local study ability of SDAL-MCL and SVM-SDAL.
(a) Source domain samples
(c) Target domain samples with noise which followed poisson distribution
(b) Target domain samples with noise which followed normal distribution
(d) Target domain samples with noise which followed gamma distribution
Fig. 8. ORL dataset structure based source domain samples and target domain samples.
(a) Source domain samples
(c) Target domain samples with noise which followed poisson distribution
(b) Target domain samples with noise which followed normal distribution
(d) Target domain samples with noise which followed gamma distribution
Fig. 9. Yale dataset structure based source domain samples and target domain samples.
435 436 437 438 439 440 441 442 443 444 445 446
for the same percentage of the target domains, that is, 60%. Tables 3–5 show the results (mean ± standard deviation) of the experiments. Meanwhile, we use one of the testing results of two local learning methods in this paper on 20Ng-2, 20Ng-4 in 20Newsgroups25% and Rut-1 in Reuter25% to show the influence of k1, k2 in local domain adaptation learning framework on learning capability. The testing result in Fig. 7 comes into being when other parameters are fixed except k1, k2. Let k1 = k2 and their values belong to [1–10,12,14,16,18,20]. The settings of LWE, CDCS, KMM, TCA and DASVM are the same as in [11,18,13,22,5], respectively. 4.2.4. Experimental results The classification accuracy in Tables 3–5 shows that TL methods are superior to both traditional supervised methods (e.g., MCL, SVM) and semi-supervised ones (TSVM). Also, from these tables, we find that better transfer learning capability of the methods to some extent contributes to the local learning capability being improved due to the integration of the local weighted mean. In particular, the methods in this paper have testing results similar to or better than those of LWE having some local learning capability, which shows the sub-domain transfer learning framework has stronger local learning capaQ1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041
INS 11278
No. of Pages 20, Model 3G
10 December 2014 Q1
17
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx Table 6 Comparison of testing results of 11 algorithms on ORL and Yale datasets. Datasets Algorithm
ORL_Gauss Accuracy
ORL_Poiss Accuracy
ORL_Gam Accuracy
Yale_Gauss Accuracy
Yale_Poiss Accuracy
Yale_Gam Accuracy
Linear algorithms
MLC CDCS MCDA LWE MLC-SDAL
0.6682 ± 0.0853 0.6920 ± 0.0753 0.6932 ± 0.0453 0.7199 ± 0.0829 0.7291 ± 0.0828
0.6880 ± 0.0951 0.7296 ± 0.0649 0.6870 ± 0.0685 0.8175 ± 0.0493 0.7108 ± 0.0365
0.6898 ± 0.0554 0.6911 ± 0.0939 0.7061 ± 0.0547 0.6970 ± 0.0648 0.7084 ± 0.0467
0.5314 ± 0.0857 0.5801 ± 0.0033 0.5993 ± 0.0764 0.6234 ± 0.0725 0.6010 ± 0.0346
0.6005 ± 0.0068 0.6277 ± 0.0824 0.6479 ± 0.0820 0.6188 ± 0.0822 0.6219 ± 0.0740
0.6104 ± 0.0930 0.6931 ± 0.0752 0.7087 ± 0.0958 0.7243 ± 0.0866 0.6891 ± 0.0808
No-linear algorithms
SVM TSVM KMM TCA LMPROJ DASVM MSVM-SDAL
0.7002 ± 0.0581 0.7204 ± 0.0407 0.7579 ± 0.0671 0.7869 ± 0.0676 0.7769 ± 0.0545 0.8379 ± 0.0678 0.8436 ± 0.0723
0.7168 ± 0.0433 0.7399 ± 0.0761 0.8043 ± 0.0528 0.8411 ± 0.0753 0.8198 ± 0.0489 0.8710 ± 0.0590 0.8486 ± 0.0439
0.6997 ± 0.0725 0.7420 ± 0.0731 0.8210 ± 0.0246 0.8510 ± 0.0439 0.8546 ± 0.0437 0.8537 ± 0.0834 0.8807 ± 0.0394
0.5879 ± 0.0746 0.6580 ± 0.0519 0.6887 ± 0.0599 0.7009 ± 0.0582 0.7189 ± 0.0822 0.7068 ± 0.0454 0.7290 ± 0.0428
0.6544 ± 0.0803 0.6697 ± 0.0348 0.7139 ± 0.0790 0.7319 ± 0.0706 0.7228 ± 0.0493 0.7216 ± 0.0761 0.7254 ± 0.0644
0.7091 ± 0.0492 0.7208 ± 0.0693 0.7223 ± 0.0648 0.7368 ± 0.0643 0.7288 ± 0.0757 0.7597 ± 0.0629 0.7984 ± 0.0473
451
bility. And as the number of the unlabeled samples from the target domain in the testing dataset increases, the accuracy of our methods upgrades greatly. In addition, the methods in this paper have better stability than other transfer learning methods. From Fig. 7, we find that the learning capability of the framework depends on the value of k1, k2; that is, the learning capability will be affected by the size of local sub-domains, which all local learning methods encounter.
452
4.3. Experiments on face datasets
453
4.3.1. Data sets Face datasets ORL and Yale are often used to test the classification effectiveness of multi-classification methods. ORL consists of 40 classes, each of which includes 10 faces with different expressions (see Fig. 8); Yale consists of 15 classes, each of which includes 11 faces with different expressions (see Fig. 9). To build the source and target domains, we randomly chose five samples from each class in ORL and Yale, respectively, as the source domain and added noises having Gaussian, Poiss and Gam distribution to each class of samples in ORL and Yale, respectively, thus getting three target domains having different distributions from the source domain. We made the target domains as ORL_Gauss; ORL_Poiss and ORL_Gam; and Yale_Gauss; Yale_Poiss and Yale_Gam. The above noise data come from running Mabtlab functions: normrnd, poissrnd and gamrnd. Figs. 8 and 9 are the original data, and the noise-added data of one of the classes in ORL and Yale, respectively.
447 448 449 450
454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475
4.3.2. Comparison methods As we know, MSVM-SDAL, as the extension of SVM-SDAL, combines the characteristics of Fuzzy Least Square Multiple Classification SVM (FLS-SVM). This makes it more adapted to addressing multi-classification problems. Therefore, for testing the multi-classification learning capability of the SDAL framework, we compared MLC, SVM, TSVM, MCDA, CDCS, LWE, LMPROJ, KMM, TCA and DASVM with MLC-SDAL and MSVM-SDAL. 4.3.3. Implementation details Since SVM, TSVM, CDCS, LWE, LMPROJ and DASVM cannot address multi-classification problems directly, we decomposed the multi-classification problems in these six methods into binary-classification problems and tested them by ‘‘One-againstOne’’ strategy. The parameters in these 10 methods are defined as those in Section 4.2. In the experiments, we let k1 = k2 = [2, 4, 6, 8] in MLC-SDAL and MSVM-SDAL and constructed the training data from the samples in the source domain and three samples randomly chosen from each class of the target domains, then used the rest of the samples in the target domains as testing samples. We conducted the experiments to randomly sample data for five rounds and used 4-fold cross validation here. Table 6 shows the mean and standard deviation of classification accuracy.
479
4.3.4. Experimental results Table 6 shows that MLC-SDAL and MSVM-SDAL have stronger multi-classification capability than the other methods. Particularly, MSVM-SDAL, due to the integration of fuzzy theories, is more applicable in addressing unclassifiable regions, thus making it more effective than nonlinear SVM, TSVM, LMPROJ and DASVM.
480
5. Conclusions
481
We have proposed a statistical learning framework with local domain adaptation learning ability: SDAL, in which there is a MMD- and LWM-based transfer learning measurement (PMLMD) which has local learning capability. PMLMD can not only reflect the local distribution discrepancy between the source and target domains but also shows inner differences between
476 477 478
482 483
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041
INS 11278
No. of Pages 20, Model 3G
10 December 2014 Q1
18
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
489
local structures and local information of domains. Furthermore, due to the integration of LWM, we can find the contributions of samples to keeping local structure of sub-domains. Based on the above framework, we have presented two SDAL-based sub-domain transfer learning algorithms by integrating classical statistical methods: MLC-SDAL and SVM-SDAL. Either theoretical analyses or tests on artificial and real-world datasets show strong classification capability of these two algorithms. However, we are still concerned about problems that have not been solved, such as how to improve the efficiency of the algorithms and how to set the parameters effectively in transfer learning methods.
490
6. Uncited references
484 485 486 487 488
491
Q5
[34–36].
492
Acknowledgements
493 494
This work is supported in part by National Science Foundation of China under Grants 61375001, 61272210 and 60903100, the Natural Science Foundation of Jiangsu Province of China under Grant BK2011417.
495
Appendix A
496 497
A.1. Proof of Theorem 1
498
Proof 1. Let X be the union of the source and target domains and X ¼ Ds [ Dt ¼ fxs1 ; . . . ; xsns ; zt1 ; . . . ; ztnt g; and let w(/ (x)) = xT/(x) |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
499 500 501
ns þnt
according to Eq. (5). For two local patches Dsi 2 Ds , Dtj 2 Dt , when the weight in the above-mentioned local patches is extended to the whole data set X, there will be
1T . . . X X X C B ð1Þ ns ns ns ðp Þ ðcÞ ðp Þ ðn Þ ðp Þ b 1 ; . . . ; bsi b 1 ; . . . ; bsi s b 1 ; 0; . . . ; 0C bsi ¼ B A ; @bsi p1 ¼1 si p1 ¼1 si p1 ¼1 si |fflfflfflffl{zfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} n 0
502
t
ns
0
1T
.Xn .Xn .Xn B t t t ð1Þ ðp2 Þ ðcÞ ðp2 Þ ðnt Þ ðp Þ C btj ¼ B b b b 2C ; . . . ; 0; btj ; . . . ; b ; . . . ; b tj tj tj tj @0; p2 p2 p2 tj A |fflfflfflffl{zfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} n s
then
Pk 1
509 510
511
513 514
515
nt
ðc Þ bsi 1
c1 ¼1
b
b
p1 ¼1 si
506
507
p2 ¼1 tj
ðc Þ ðc Þ k 1 b 1 xT / x 1 X si si ¼ bTsi /ðXÞT x; Pk 1 ðp1 Þ c1 ¼1 p1 ¼1 bsi ðc Þ ðc Þ k 2 b 2 xT / z 2 X tj tj ¼ bTtj /ðXÞT x: Pk 2 ðp2 Þ b c2 ¼1 p2 ¼1 tj
ðA:3Þ
ðA:4Þ
Input equation (A.3) and (A.4) into Eq. (5), then Eq. (5) becomes ns X nt X 2 distPMLMD D0s ; D0t ¼ cij kbTsi /ðXÞT x bTtj /ðXÞT xk2F :
Subject to kAk2F ¼ trðAT AÞ, Eq. (A.5) is transformed into nt ns X X
! nt ns X T X cij bTsi /ðXÞT x bTtj /ðXÞT x bTsi /ðXÞT x bTtj /ðXÞT x
i¼1 j¼1
i¼1 j¼1
! nt ns X X T T T T cij bsi bsi þ btj btj 2bsi btj /ðXÞ x : ¼ tr x /ðXÞ T
517
519
520
ðA:5Þ
i¼1 j¼1
cij kbTsi /ðXÞT x bTtj /ðXÞT xk2F ¼ tr
518
ðA:2Þ
ðc Þ ðc Þ ðc Þ Pk2 btj 2 xT / ztj 2 xT / xsi 1 can be rewritten as: Pk1 ðp1 Þ and c2 ¼1 Pk2 ðp2 Þ
504 505
ðA:1Þ
ð6Þ
i¼1 j¼1
Let Lij ¼ bsi bTsi þ btj bTtj 2bsi bTtj be the weight matrix of distribution discrepancy between patches and Rij ¼ diag ðr ij ; . . . ; rij Þ be |fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl} the correlated coefficient matrix, Eq. (6) can be rewritten as ns þnt
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041
INS 11278
No. of Pages 20, Model 3G
10 December 2014 Q1
522 523 524 525 526 527 528
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx nt ns X X
nt ns X X Rij Lij /ðXÞT x
i¼1 j¼1
i¼1 j¼1
cij kbTsi /ðXÞT x bTtj /ðXÞT xk2F ¼ tr xT /ðXÞ
19
! ¼ trðxT /ðXÞL/ðXÞT xÞ
ð7Þ
P s Pnt where L ¼ ni¼1 j¼1 R ij Lij is the weight matrix of the global distribution discrepancy between the source domain and the target domain. According to Representer Theory [16], the nonlinear projection transformation x can be written as /(X)a where a ¼ ðas1 ; . . . ; asns ; at1 ; . . . ; atnt ÞT 2 Rns þnt is the column vector consisting of (ns + nt) coefficients. At the same time, define an inner-product-based kernel function k(xi, xj) = (/(xi), /(xj)) = /(xi)T/(xj), then equation (7) can be turned into the form of Eq. (6). The Theorem is proved. h
529 530
Appendix B
531
Algorithm 1. The construction of the weight matrix L. Input: Given the source domain Ds and the target domain Dt, let X = Ds [ Dt, the heat kernel parameter be h1 and h2, and k-NN parameters be k1 and k2; Output: the global distribution discrepancy weight matrix L; s t Step 1: Build the patches in Ds and Dt according to Definition 4; that is, Ds ¼ fDsi gjni¼1 and Dt ¼ fDtj gjnj¼1 ; Step 2: For 8Dsi 2 Ds and 8Dtj 2 Dt , Step 2.1: Calculate the weights bsi and btj of X using Eqs. (A.1) and (A.2), respectively; Step 2.2: Build the weight matrix Lij ¼ bsi bTsi þ btj bTtj 2bsi bTtj of the distribution discrepancy between the local patches Dsi and Dtj ; Step 2.3: Calculate the CLSD in Dt of Dsi and the CLSD in ds of Dtj according to Eq. (4). When Dsi is the CLSD of Dtj or vise verse, define the correlated coefficient matrix Rij ¼ diagð1; . . . ; 1Þ of local patches. Otherwise, Rij ¼ diagð0; . . . ; 0Þ; |fflfflfflfflfflffl{zfflfflfflfflfflffl} |fflfflfflfflfflffl{zfflfflfflfflfflffl} ns þnt ns þnt Pns Pnt Step 3: Calculate the global distribution discrepancy weight matrix L ¼ i¼1 j¼1 Rij Lij .
544
545
References
546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580
[1] L.G. Abril, C. Angulo, F. Velasco, J.A. Ortega, A note on the bias in SVMs for multi-classification, IEEE Trans. Neural Netw. 19 (4) (2008) 723–725. [2] C.G. Atkeson, A.W. Moore, S. Schaal, Locally weighted learning, Artif. Intell. Rev. 11 (1–5) (1997) 11–73. [3] K.M. Borgwardt, A. Gretton, M.J. Rasch, H.P. Kriegel, B. Scholkopf, A.J. Smola, Integrating structured biological data by kernel maximum mean discrepancy, Bioinformatics 22 (14) (2006) 49–57. [4] H. Bou Ammar, E. Eaton, M.E. Taylor, D. Constantin Mocanu, K. Driessens, G. Weiss, K. Tuyls, An automated measure of MDP similarity for transfer in reinforcement learning, in: Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014. [5] L. Bruzzone, M. Marconcini, Domain adaptation problems: a DASVM classification technique and a circular validation strategy, IEEE Trans. Pattern Anal. Mach. Intell. 32 (5) (2010) 770–787. [6] B. Chen, W. Lam, I.W. Tsang, T.L. Wong, Extracting discriminative concepts for domain adaptation in text mining, in: KDD, vol. 2009, 2009, pp. 179–188. [7] N.Y. Deng, Y.J. Tian, The New Method of Data Mining Support Vector Machine, Science Press, Beijing, 2004. [8] Z.H. Deng, Y.Z. Jiang, L.B. Cao, S.T. Wang, Knowledge-leverage based TSK fuzzy system with improved knowledge transfer, in: FUZZ-IEEE, vol. 2014, 2014, pp. 178–185. [9] L.X. Duan, I.W. Tsang, D. Xu, Domain transfer multiple kernel learning, IEEE Trans. Pattern Anal. Mach. Intell. 34 (3) (2012) 465–479. [10] L.X. Duan, D. Xu, I.W. Tsang, Domain adaptation from multiple sources: a domain-dependent regularization approach, IEEE Trans. Neural Netw. Learn. Syst. 23 (3) (2012) 504–518. [11] J. Gao, W. Fan, J. Jiang, J.W. Han, Knowledge transfer via multiple model local structure mapping, in: KDD, vol. 2008, 2008, pp. 283–291. [12] R. Gopalan, R. Li, R. Chellappa, Unsupervised adaptation across domain shifts by generating intermediate data representations, IEEE Trans. Pattern Anal. Mach. Intell. 36 (11) (2014) 2288–2302. [13] J. Huang, A. Smola, A. Gretton, K.M. Borgwardt, B. Scholkopf, Correcting sample selection bias by unlabeled data, in: Advance in Neural Information Processing Systems, vol. 19, 2007, pp. 601–608. [14] S. Ji, L. Tang, S. Yu, J. Ye, Extracting shared subspace for multi-label classification, in: KDD, vol. 2008, 2008, pp. 381–389. [15] T. Joachims, Transductive inference for text classification using support vector machines, in: ICML, vol. 1999, 1999, pp. 200–209. [16] T. Kanamori, S. Hido, M. Sugiyama, A least-squares approach to direct importance estimation, J. Mach. Learn. Res. 10 (1) (2009) 1391–1445. [17] J. Lee, Riemannian Manifolds: An Introduction to Curvature, Springer-Verlag Press, Berlin, 2003. [18] X. Ling, W. Dai, G.R. Xue, Q. Yang, Y. Yu, Spectral domain-transfer learning, in: KDD, vol 2008, 2008, pp. 488–496. [19] M.S. Long, J.M. Wang, G.G. Ding, S.J. Pan, P.S. Yu, Adaptation regularization: a general framework for transfer learning, IEEE Trans. Knowl. Data Eng. 26 (5) (2014) 1076–1089. [20] S. Ozawa, A. Roy, D. Roussinov, A multitask learning model for online pattern recognition, IEEE Trans. Neural Netw. 20 (3) (2009) 430–445. [21] S.J. Pan, J.T. Kwok, Q. Yang, Transfer learning via dimensionality reduction, in: AAAI, vol. 2008, 2008, pp. 677–682. [22] S.J. Pan, I.W. Tsang, J.T. Kwork, Q. Yang, Domain adaptation via transfer component analysis, IEEE Trans. Neural Netw. 22 (2) (2011) 199–210. [23] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22 (10) (2010) 1345–1359. [24] B. Quanz, J. Huan, Large margin transductive transfer learning, in: CIKM, vol. 2009, 2009, pp. 1327–1336. [25] B. Quanz, J. Huan, M. Mishra, Knowledge transfer with low-quality data: a feature extraction issue, IEEE Trans. Knowl. Data Eng 25 (10) (2013) 1789– 1802. [26] C.W. Seah, I.W. Tsang, Y.S. Ong, Transfer ordinal label learning, IEEE Trans. Neural Netw. Learn. Syst. 24 (11) (2013) 1863–1876.
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041
INS 11278
No. of Pages 20, Model 3G
10 December 2014 Q1 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598
20
J. Gao et al. / Information Sciences xxx (2014) xxx–xxx
[27] J. shell, S. Coupland, Fuzzy transfer learning: methodology and application, Inform. Sci. 293 (2015) 59–79. [28] J.W. Tao, K.F.L. Chung, S.T. Wang, On minimum distribution discrepancy support vector machine for domain adaptation, Pattern Recog. 45 (11) (2012) 3962–3984. [29] J.W. Tao, W.J. Hu, S.T. Wang, Sparsity regularization label propagation for domain adaptation learning, Neurocomputing 139 (2014) 202–219. [30] V.E. Tim, H. Peter, Rényi divergence and Kullback–Leibler divergence, IEEE Trans. Inform. Theory 60 (7) (2014) 3797–3820. [31] T. Tommasi, F. Orabona, B. Caputo, Learning categories from few examples with multi model knowledge transfer, IEEE Trans. Pattern Anal. Mach. Intell. 36 (5) (2014) 928–941. [32] D. Tsujinishi, S. Abe, Fuzzy least squares support vector machines for multiclass problems, Neural Netw. 16 (5–6) (2003) 785–792. [33] V. Vanpanik, Statistical Learning Theory, Wiley Press, NewYork, 1998. [34] V. Vural, G. Fung, J. Dy, B. Rao, Semi-supervised classifiers using a-priori metric information, Optim. Meth. Softw. 23 (4) (2006) 521–532. [35] Y.Y. Wang, S.C. Chen, Z.H. Zhou, New semi-supervised classification method based on modified cluster assumption, IEEE Trans. Neural Netw. Learn. Syst. 23 (5) (2012) 689–702. [36] Z. Wang, S.C. Chen, New least squares support vector machines based on matrix patterns, Neur. Process. Lett. 26 (1) (2007) 41–56. [37] D. Zhang, J.R. He, Y. Liu, L. Si, D. R.D. Lawrence, Multi-view transfer learning with a large margin approach, in: KDD, vol. 2011, 2011, pp. 1208–1216. [38] L. Zhang, W.D. Zhou, P.C. Chang, J. Liu, Z. Yan, T. Wang, F.Z. Li, Kernel sparse representation-based classifier, IEEE Trans. Signal Process. 60 (4) (2012) 1684–1695. [39] W. Zhang, X.G. Wang, D.L. Zhao, X.O. Tang, Graph degree linkage: agglomerative clustering on a directed graph, in: ECCV, vol. 2012, 2012, pp. 428–441. [40] D.L. Zhao, Z.C. Lin, R. Xiao, X.O. Tang, Linear Laplacian discrimination for feature extraction, in: CVPR, vol. 2007, 2007, pp. 1-7.
599
Q1 Please cite this article in press as: J. Gao et al., Sub-domain adaptation learning methodology, Inform. Sci. (2014), http://dx.doi.org/ 10.1016/j.ins.2014.11.041