DT-LET: Deep Transfer Learning by Exploring Where to Transfer
Communicated by Dr. Lu Xiaoqiang
Journal Pre-proof
DT-LET: Deep Transfer Learning by Exploring Where to Transfer Jianzhe Lin, Liang Zhao, Qi Wang, Rabab Ward, Z. Jane Wang PII: DOI: Reference:
S0925-2312(20)30087-4 https://doi.org/10.1016/j.neucom.2020.01.042 NEUCOM 21799
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
4 November 2019 31 December 2019 10 January 2020
Please cite this article as: Jianzhe Lin, Liang Zhao, Qi Wang, Rabab Ward, Z. Jane Wang, DTLET: Deep Transfer Learning by Exploring Where to Transfer, Neurocomputing (2020), doi: https://doi.org/10.1016/j.neucom.2020.01.042
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Elsevier B.V. All rights reserved.
DT-LET: Deep Transfer Learning by Exploring Where to Transfer Jianzhe Lina , Liang Zhaob , Qi Wangc,d,∗, Rabab Warda , Z. Jane Wanga a Department
of Electrical and Computer Engineering, University of British Columbia, Canada Engineering from Dalian University of Technology, China c School of Computer Science,Unmanned System Research Institute d Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University b Software
Abstract Previous transfer learning methods based on deep network assume the knowledge should be transferred between the same hidden layers of the source domain and the target domains. This assumption doesn’t always hold true, especially when the data from the two domains are heterogeneous with different resolutions. In such case, the most suitable numbers of layers for the source domain data and the target domain data would differ. As a result, the high level knowledge from the source domain would be transferred to the wrong layer of target domain. Based on this observation, “where to transfer” proposed in this paper might be a novel research area. We propose a new mathematic model named DT-LET to solve this heterogeneous transfer learning problem. In order to select the best matching of layers to transfer knowledge, we define specific loss function to estimate the corresponding relationship between high-level features of data in the source domain and the target domain. To verify this proposed cross-layer model, experiments for two cross-domain recognition/classification tasks are conducted, and the achieved superior results demonstrate the necessity of layer correspondence searching. Keywords: Transfer learning, Where to transfer, Deep learning.
✩ DT-LET can be pronounced as ”Data Let”, meaning let data to be transferred to the right place, in accord with ”Where to Transfer” ∗ Corresponding author Email address:
[email protected] (Qi Wang)
Preprint submitted to Neurocomputing
January 23, 2020
1. Introduction Transfer learning or domain adaption aims at digging potential information in auxiliary source domain to assist the learning task in target domain, where insufficient labeled data with prior knowledge exist [1]. Especially for tasks like image classifi5
cation or recognition, the labeled data (the target domain data) are highly required but always not enough as labeling process would be quite tedious and laborious. Without the help of related data (the source domain data), the learning tasks will fail. Therefore, having a better use of auxiliary source domain data by transfer learning methods has attracted researchers’ attention.
10
It should be noted direct use of labeled source domain data on a new scene of target domain would result in poor performance due to the semantic gap between the two domains, even they are representing the same objects [2][3][4][5]. The semantic gap can be resulted from different acquisition conditions(illumination or view angle), and the use of different cameras or sensors. Transfer learning methods are proposed to
15
overcome this semantic gap [6][7][8][9]. Traditionally, these transfer learning methods would adopt linear or non-linear transformation with kernel function to learn a common subspace on which the gap is bridged [10][11][12]. Recent advancement has proven that the features learnt on such common subspace are inefficient. Therefore, deep learning based model has been introduced due to its power on high level feature
20
representation. Current deep learning based transfer learning topics include two research branches, what knowledge to transfer and how to transfer knowledge [13]. For what knowledge to transfer, researchers mainly concentrate on instance-based transfer learning and parameter transfer approaches. Instance-based transfer learning methods assume that
25
only certain parts of the source data can be reused for learning in the target domain by re-weighting [14]. As for parameter transfer approaches, people mainly try to find the pivot parameters in deep network to transfer to accelerate the transfer process. For how to transfer knowledge, different deep networks are introduced to complete the transfer learning process. However, for both research areas, the right correspondence of layers
30
is ignored.
2
2. Limitations and Contributions The current limitation of transfer learning problems are concluded below. For what knowledge to transfer problem, the transferred content might even be negative or wrong. A fundamental problem for current transfer learning work should be nega35
tive transfer[15]. If the knowledge from the source domain to target domain is transferred to wrong layers, the transferred knowledge is quite error-prone. With the wrong prior information added, bad effect can be generated on target domain data. For how to transfer knowledge problem, as the two deep networks for the source domain data and the target domain data need to have the same number of layers, the two models
40
could not be optimal at the same time. This situation is especially important for crossresolution heterogeneous transfer. For data with different resolutions, The data with higher resolution might need more max-pooling layers than the data with lower resolution, and more neural network layers are needed. Based on the above observation and assumption, we propose a new research topic, where to transfer. In this work, the
45
number of layers for two domains does not need to be the same, and optimal matching of layers will be found by the newly proposed objective function. With the best parameters from the source domain data transferred to the right layer of the target domain, the performance of the target domain learning task can be improved. The proposed work is named Deep Transfer Learning by Exploring where to Trans-
50
fer (DT-LET), which is based on Stacked Auto-Encoders[16]. A detailed flowchart is shown in Fig. 1. The main contributions are concluded as follows. • This paper for the first time introduces the where to transfer problem. The deep networks from the source domain and the target domain no longer need to be with the same parameter settings, and the cross-layer transfer learning is proposed in
55
this paper. • We propose a new principle for finding the correspondence between neural net-
works in the source domain and in the target domain by defining new unified objctive loss function. By optimizing this objective function, the best setting of two deep networks as well as the correspondence relationship can be figured out.
3
Source Domain Xs Training Data
SVM
Common Subspace Ω
SVM
Target Domain It Data
CCA ..
..
CCA
..
.. ..
Cross layer matching CCA
.. reshape
.. reshape
Source Domain Co-occurence Data Cs
Target Domain Co-occurenceData Ct
Figure 1: The flowchart of the proposed DT-LET framework. The two neural networks are first trained by the co-occurrence data Cs and Ct . After network training, the common subspace is found and the training l is transferred to such space to train SVM classifier, to classify D . data DS T
60
3. Related Work Deep learning intends to learn nonlinear representation of raw data to reveal the hidden features [17][18]. However, a large number of labeled data are required to avoid over-fitting during the feature learning process. To achieve this goal, transfer 4
learning has been introduced to augment the data with prior knowledge. By aligning 65
data from different domains to high-level correlation space, the data information on different domains can be shared. To find this correlation space, many deep transfer learning frameworks have been proposed in recent years. The main motivation is to bridge the semantic gap between the two deep neural networks of the source domain and the target domain. However, due to the complexity of transfer learning, some trans-
70
fer mechanisms still lack satisfying interpreting. Based on this consideration, quite a few interesting ideas have been generated. To solve how to determine which domain to be source or target problem, Fabio et al. [19] propose to automatically align domains for the source and target domain. To boost the transfer efficiency and find extra profit during the transfer process, deep mutual learning [20] has been proposed to transfer
75
knowledge bidirectionally. The function of each layer in transfer learning is explored in [21]. The transfer learning with unequal classes and data are experimented in [22] and [23] respectively. However, all the above works still just explain what knowledge to transfer and how to transfer knowledge problems. They still ignore interpreting the matching mechanisms between layers of deep networks of the source domain and the
80
target domain. For this problem, We also name it as DT-LET: Deep Transfer Learning by Exploring Where to Transfer. For this work, we adopt stacked denoising autoencoder(SDA) as the baseline deep network for transfer learning. Glorot et al. for the first time employ stacked denoising to learn homogeneous features based on joint space for sentiment classification [24]. The computation com-
85
plexity is further reduced by Chen et al. by the proposing of Marginalized Stacked Denoising Autoencoder(mSDA) [25]. In this work, some characteristics of word vector are set to zero in the equations of expectation to optimize the representation. Still by matching the marginal as well as conditional distribution, Zhang et al. and Zhuang et al. also develop SDA based homogeneous transfer learning framework [26][27]. For
90
heterogeneous case, Zhou et al. [28] propose an extension of mSDA to bridge the semantic gap by finding the cross-domain corresponding instances in advance. Google brain team in recent time introduces generative adversarial network to SDA and propose the Wasserstein Auto-Encoders [29] to generate samples of better quality on target domain. It can be found SDA is with quite high potential, and our work also chooses 5
95
SDA as the basic neural network for the where to transfer problem. 4. Deep Mapping Mechanism The general framework of such deep mapping mechanism can be summarized as three steps, network setting up, correlation maximization, and layer matching. We would like to first introduce the deep mapping mechanism by defining the variables.
100
s , in which the The samples in the source domain are denoted as DS = {Iis }ni=1
l , they labeled data in the source domain is further denoted as DSl = {Xis , Yis }ni=1
are used to supervise the classification process. In the target domain, the samples are t demoted as DT = {Iit }ni=1 . The co-occurrence data [30](the data in the source domain
and the target domain belonging to the same classes but with no prior label information) 105
c in the source domain are denoted as CS = {Cis }ni=1 , in target domain are denoted as
c CT = {Cit }ni=1 . They are further jointly represented by DC = {CiS , CiT }ni c , which
are used to supervise the transfer learning process. The parameters of deep network in the source domain are denoted by ΘS = {W s , bs }, and ΘT = {W t , bt } in the target domain.
110
The matching of layers is denoted by Rs,t = {ri11 ,j1 , ri22 ,j2 , ..., rima ,jb }, in which a
represents the total number of layers for the source domain data, and b represents the
total layers for the target domain data. m is the total number of matching layers. We define here, if m = min{a − 1, b − 1}(as the first layer is the original layer which will not be used to transfer, m is compared with a-1 or b-1 instead of a or b), we define the 115
transfer process as full rank transfer learning; else if m < min{a − 1, b − 1}, we define this case as non-full rank transfer learning.
The common subspace is represented by Ω and the final classifier is represented by Ψ. The labeled data DSl from the source domain are used to predict the label of DT by applying Ψ(Ω(DT )). 120
4.1. Network Setting Up The stacked auto-encoder (SAE) is first employed in the source domain and the target domain to get the hidden feature representation H S and H T of original data as
6
shown in eq. (1) and eq. (2). H S (n + 1) = f (W S (n) × H S (n) + bS (n)), n > 1;
(1)
H S (n) = f (W S (n) × CS + bS (n)), n = 1. H T (n + 1) = f (W T (n) × H T (n) + bT (n)), n > 1;
(2)
H T (n) = f (W T (n) × CT + bT (n)), n = 1. Here W S and bS are parameters from neural network ΘS , W T and bT are param125
eters from neural network ΘT . H S (n) and H T (n) mean the nth hidden layers in the source domain and in the target domain respectively. The two neural networks are first initialized by above functions. 4.2. Correlation Maximization To set up the initial relationship of the two neural networks, we resort to Canonical
130
Correlation Analysis (CCA) which can maximize the correlation between two domains [31]. A multi-layer correlation model based on the above deep networks is further constructed. Both the CS and the CT are projected by CCA to a common subspace Ω on which a uniformed representation is generated. Such projection matrices obtained by CCA are denoted as V S (n) and V T (n). To find optimal neural networks in the
135
source domain and in the target domain, we have two general objectives: to minimize the reconstruction error of neural networks of the source domain and the target domain, and to maximize the correlation between the two neural networks. To achieve the second objective, we need on one hand find the best layer matching, on the other hand maximize the correlation between corresponding layers. To achieve this goal, we can
140
minimize the final objective function
L(Rs,t ) =
Ls (θS ) + LT (θT ) , P (V S , V T )
(3)
in this function, the objective function is defined as L, and L(Rs,t ) is in corresponding with different matching of Rs,t . We would like to generate the best matching by finding the minimum L(Rs,t ). In L(Rs,t ), Ls (ΘS ) and LT (ΘT ) represent the reconstruction
7
errors of data in the source domain and the target domain, which are defined as follows: LS (θS ) = [
ns 1 X 1 ( ||h S S (Cis ) − Xis ||2 )] ns i=1 2 W ,b
(4)
S
S
n −1 nl n l+1 X λ X X S(l) 2 + (Wkj ) 2 j=1 S
l=1
k=1
LT (θT ) = [
mt 1 1 X ( ||hW T ,bT (Cit ) − Cit ||2 )] nt i=1 2
(5)
T
T
n −1 nl n l+1 X λ X X T (l) 2 (Wkj ) , + 2 j=1 T
l=1
k=1
Here nS and nT are the numbers of their layers, nSl and nSl are the numbers of neurons in layer l, and λ is the trade-off parameter. The third term P (Vs , Vt ) represents the domain divergence after projection by CCA which we want to maximize. The definition for this term is in eq. (6)
S
T
P (V , V ) =
S nX −1
l=2
145
where
P
ST
T
q
= H S(l) H T (l) ,
T
V S(l)
P
SS
T P T (l) V S(l) ST V q , P T P S(l) T (l) V T (l) SS V TT V T
= H S(l) H S(l) ,
P
TT
(6) T
= H T (l) H T (l) , By
minimizing eq. (3), we can collectively train the two neural networks θT = {W T , bT } and θS = {W S , bS }. 4.3. Layer Matching After constructing the multiple layers of the networks by eq. (3), we need to further find the best matching for layers after construction of neural networks. As different layer matching would generate different function loss value L in eq. (3), we define the objective function for layer matching as Rs,t = arg min L.
8
(7)
5. Model Training 150
Here we would like to first optimize the eq. 3. As the equation is not joint convex with all the parameters θS , θT , Vs , and Vt , and the two parameters θS and θT are not related with V S and v T , we would like to introduce two-step iteration optimization. 5.1. Step.1: Updating V S , V T with fixed ΘS , ΘT In Eq. (3), the optimization of V S , V T is just related to the dominator term. The
155
optimization of each layer V S (l1 ), V T (l2 )(suppose the layer 11 on source domain is in corresponding with layer l2 on target domain) can be formulated as q
max
V S(l1 ) ,V T (l2 )
T
As V S(l1 ) eq. (8) as
P
SS
T
V S(l1 )
T P T (l2 ) V S(l1 ) ST V q P T P S(l1 ) T (l2 ) V T (l2 ) SS V TT V T
V S(l1 ) = 1 and V T (l2 )
max V S(l1 ) T
s.t.V S(l1 )
T
P
P
ST
SS
P
TT
V T (l2 ) = 1 [31], we can rewrite
V T (l2 ) , T
V S(l1 ) = 1, V T (l2 )
(8)
P
TT
V T (l2 ) = 1
(9)
This is a typical constrained problem which can be solved as a series of unconstrained minimization problems. we introduce the Lagrangian multiplier to solve this problem and we have T
L(wl , V S(l1 ) , V T (l2 ) ) = V S(l1 )
X
ST
V T (l2 )
wlS S(l1 )T X (V V S(l1 ) − 1) SS 2 T X wT + l (V T (l2 ) V T (l2 ) − 1) TT 2
+
(10)
Then we take the partial derivatives for V S Eq. (10) and get X X ∂L T (l2 ) S = V − w V S(l1 ) = 0. l ST SS ∂V S(l1 )
9
(11)
The partial derivatives for V T is the same. Now we can have the final solution as X 160
ST
X−1 XT TT
ST
V S(l1 ) = wl2
X
SS
V S(l1 ) ,
(12)
here we assume wl = wlS = wlT . From here V S (l1 ) and wl can be solved by the generalized eigenvalue decomposition and we can also get the corresponding V T (l2 ). 5.2. Step.2: Updating ΘS , ΘT with fixed V S , V T As ΘS and ΘT are mutual independent and with the same form, we here just demonstrate the solution of ΘS on the source domain (the solution of ΘT can be derived similarly). Actually the objective division operation is with the same function with subtraction operation and we reformulate the objective function as min φ(θS ) = LS (θS ) − Γ(V S , V T ) θS
(13)
Here we apply the gradient descent method to adjust the parameter as ∂φ ∂W S(l1 ) ∂LS (θS ) ∂Γ(V S , V T ) − = ∂W S(l1 ) ∂W S(l1 ) (αS(l1 +1) − β S(l1 +1) + ωl γ S(l1 +1) ) × H S(l1 ) = nc + λS W S(l1 )
(14)
∂φ ∂bS(l1 ) ∂LS (θS ) ∂Γ(V S , V T ) = − = ∂bS(l1 ) ∂bS(l1 ) (αS(l1 +1) − β S(l1 +1) + ωl γ S(l1 +1) ) , nc
(15)
W S(l1 ) = W S(l1 ) − µS
bS(l1 ) = bS(l1 ) − µS
in which
αS(l1 )
−(DSl − H S(l1 ) ) · H S(l1 ) · (1 − H S(l1 ) ), l = nS T = W S(l1 ) αS(l1 +1) · H S(l1 ) · (1 − H S(l1 ) ), l = 2, ..., nS − 1 10
(16)
β S(l1 ) =
γ S(l1 ) =
l = nS
0, T
H T (l2 ) V T (l2 ) V S(l1 ) · H S(l1 ) · (1 − H S(l1 ) ),
(17)
l = 2, ..., nS − 1 0, l = nS T
H S(l1 ) V S(l1 ) V S(l1 ) · H S(l1 ) · (1 − H S(l1 ) ), .
(18)
l = 2, ..., nS − 1
The operator · here stands for the dot product. The same optimization process
works for ΘT on the target domain.
algorithm 1 Deep Mapping Model Training Input: DC = {Cis , Cit }ni c , Input: λS = 1, λT = 1, µS = 0.5, µT = 0.5 Output: Θ(W S , bS ), Θ(W T , bT ), V S , V T 1: function N ETWORK S ET U P 2: Initialize Θ(W S , bS ), Θ(W T , bT ) ← RandomN um 3: repeat 4: for l = 1, 2, ..., nS do 5: V S ← arg min L(ωl , V S(l) ) 6: end for 7: for l = 1, 2, ..., nT do 8: V T ← arg min L(ωl , V T (l) ) 9: end for 10: θS = arg min φ(θS ), θT = arg min φ(θT ) 11: until Convergence 12: end function 13: function L AYER M ATCHING 14: Initialize Rs,t ← RandomM atching 15: Initialize m ← 0, σ ← 1 16: if m < mit then 17: if σ > 0.05 then m 18: Rs,t = E(m) = {e1 (t), e2 (t), ..., ea−1 (t)} 19: E = {E 1 , E 2 , ..., E nc } 20: E(m + 1) = argmax{S(E1 ), S(E2 ), ..., S(Enc )} 21: m = m+1 22: end if 23: end if 24: end function
165
After these two optimizations for each layer, the two whole networks (the source 11
domain network and the target domain network) are further fine-tuned by the backpropagation process. The forward and backward propagations will iterate until convergence. 5.3. Optimization of Rs,t 170
We finally get the minimized L(Rs,t ) by the above procedure. As described previously, we have a − 1 layers used for source domain and b − 1 layers for target domain, the expected computation complexity of exhaustive search can be approximated by
O(a × b), and the problem should be NP-hard to optimize. To reduce this to polynomial complexity, we introduce the Immune Clonal Strategy(ICS) to solve this problem. 175
We take Eq. 7 as the affinity function in the ICS. The source domain layers are regarded as the antibodies, and the target domain layers are viewed as the antigen. Various antibodies have different effectiveness for the antigen. By maximizing the affinity function, the best antibodies are chosen. The detailed optimization procedure is modeled as an iterative process. It includes three phases: clone, mutation, and selection.
180
5.3.1. Clone Phase At the very first, we set the a − 1 layers on the source domain as the antibodies
E(m) = {e1 (m), e2 (m), ..., ea−1 (m)}. For random ei (t) we have b(as ei (t) = 0 is also one choice) choices of values, representing the corresponding layers in the target domain. For notational simplicity, we omit the iteration number m in the following 185
explanation with no loss of understandability. The initial E(m) is cloned for nc times and we get E = {E 1 , E 2 , ..., E nc }. 5.3.2. Mutation Phase The randomly chosen antibodies are not the best. Therefore, a mutation phase after the clone phase is necessary. For example, for the antibody E i , we randomly replace
190
Nm representatives in its clone E i by the same number of elements. There is no doubt that these newly introduced elements should differ from the former representatives, which enrich the diversity of the original antibodies. After this, mutated antibodies E = {E 1 , E 2 , ..., E nc } are obtained. 12
5.3.3. Selection Phase With the obtained antibodies, which are manifestly more various than the original set, we will select the most promising ones for the next round of processing. The principle is also defined with the affinity values. Higher ones indicate more fitness. Therefore,we have E(m + 1) = argmax{S(E1 ), S(E2 ), ..., S(Enc )} 195
(19)
which means that the antibody with the largest affinity value is taken as E(m + 1) to enter the next iteration. The iteration does not terminate until the change between S(E(m)) and S(E(m + 1)) is smaller than threshold or the maximum number of iteration mit is reached. The final E(m) then is output as the minimized Rs,t . However, after our experi-
200
ments, we heuristically find the number of matching layers would almost be in direct proportion to the resolution of images. The training process is finally summarized in Alg. (1). 5.4. Classification on Common Semantic Subspace The final classification is performed on the common subspace Ω. The unlabeled
205
data on the target domain DT and the labeled DS are both projected to the common subspace Ω by the correlation coefficients V S (nS ) and V T (nT ). The projection is formulated as H S = AS (nS )V S (nS ) in the source domain, and H T = AT (nT )V T (nT ) in the target domain. The standard SVM algorithm is applied on Ω. The classifier Ψ is trained by {HiS , YiS }ni S . This trained classifier Ψ is applied to DT as Ψ(H T ). The
210
pseudo-code of the proposed approach can be found in Algorithm (2) below. 6. Experiments
We carry out our DT-LET framework on two cross-domain recognition tasks, handwritten digit recognition, and text-to-image classification.
13
algorithm 2 Classification on Common Semantic Subspace Input: X S , Y S , V S , X T , V T Input: Θ(W S , bS ), Θ(W T , bT ), ns , nt Output: Y T 1: function SVMTRAINING (X S , Y S , V S , Θ(W S , bS ), ns ) 2: for i = 1, 2, 3, .., ns do 3: Calculate H S (ns ) for X S (ns ) by Θ(W S , bS ) as Eq. (1) 4: AS ← H S (ns )V S (ns ) 5: end for 6: Ψ ← {AS , Y S } 7: end function 8: function SVMTESTING (X T , V T , Θ(W T , bT ), nt ) 9: for j = 1, 2, 3, .., nT do 10: Calculate H T (nt ) for X T (nt ) by Θ(W T , bT ) as Eq. (2) 11: AT ← H T (nt )V T (nt ) 12: Y T ← Ψ(AT ) 13: end for 14: end function 6.1. Experimental Dataset Descriptions 215
Handwritten digit recognition:For this task, we mainly conduct the experiment on Multi Features Dataset collected from UCI machine learning repository. This dataset consists of features of handwritten numerals (0-9, in total 10 classes) extracted from a collection of Dutch utility maps. 6 features exist for each numeral and we choose the most popular features 216-D profile correlations and 240-D pixel averages in 2*3
220
windows to complete the transfer learning based recognition task. Text-to-image classification:For this task, we make use of NUS-WIDE dataset. In our experiment, the images in this dataset are represented with 500-D visual features and annotated with 1000-D text tags from Flickr. 10 categories of instances are included in this classification task, which are birds, building, cars, cat, dog, fish, flowers,
225
horses, mountain, and plane. 6.2. Comparative Methods and Evaluation As the proposed ET-LET framework mainly have four components, deep learning, CCA, layer matching, and SVM classifier, we first select the baseline method, DeepCCA-SVM(DCCA-SVM)[25] as baseline comparison methods. We also conduct ex-
230
periment just without layer matching(the number of layers are the same on the source 14
and the target domains) while all the other parameters are the same with the proposed ET-LET, and we name this framework NoneDT-LET. The other deep learning based comparison methods are duft-tDTNs[32], DeepCoral[33], DANN[34], and ADDA[35] methods. For these methods, duft-tDTNs is the most rep235
resentative, which should be up to now heterogenous transfer learning method with best performance. DeepCoral should be the first deep learning based domain adaptation framework, DANN should for the first time introduce the domain adversarial concept to domain adaptation, and ADDA is the most famous unsupervised domain adaptation method.
240
For the deep network based method, the DCCA-SVM, duft-tDTNs, DeepCoral, DANN, NoneDT-LET are all with 4 layers for the source domain and the target domain data, as we find more or less layer would generate worse performance. At last, for the evaluation metric, we select the classification accuracies on the target domain data over the 2 pairs of datasets.
245
6.3. Task 1: Handwritten Digit Recognition In the first experiment, we conduct our study for handwritten digit recognition. The source domain data are the 240-D pixel averages in 2*3 windows feature, while the target domain data are the 216-D profile correlations feature. As there are 10 classes 2 in total, we complete 45 (C10 ) binary classification tasks, for each category, the ac-
250
curacy is the average accuracy of 9 binary classification tasks. We use 60% data as co-occurrence data to complete the transfer learning process and find the common subspace, 20% labeled samples on source domain as the training samples, and the rest samples on target domain as the testing samples to complete the classification process. The experiments are repeated for 100 times with 100 sets of randomly chosen training
255
and testing data to avoid data bias [36]. The final accuracy is the average accuracy of the 100 repeated experiments. This data setting applies for all four methods under comparison. For the deep network, the numbers of neurons of 4 layer networks are 240-170100-30 for source domain data and 216-154-92-30 for target domain data, this setting
260
works for the all comparison methods. For the proposed DT-LET, we find the best two 15
duft-tDTNs
DeepCoral
DANN
ADDA
NoneDT-LET
3 ) DT-LET(r5,4
2 ) DT-LET(r4,3
numeral 0 1 2 3 4 5 6 7 8 9
DCCA-SVM
Table 1: Classification Accuracy Results on Multi Feature Dataset. (The best performance is emphasized by boldface.)
0.961 0.943 0.955 0.945 0.956 0.938 0.958 0.962 0.948 0.944
0.972 0.956 0.972 0.956 0.969 0.949 0.966 0.968 0.954 0.958
0.864 0.805 0.855 0.873 0.881 0.815 0.893 0.847 0.904 0.915
0.923 0.941 0.911 0.961 0.933 0.946 0.968 0.929 0.968 0.963
0.966 0.978 0.982 0.973 0.980 0.970 0.961 0.979 0.971 0.969
0.983 0.964 0.979 0.966 0.980 0.958 0.978 0.978 0.965 0.970
0.989 0.976 0.980 0.976 0.987 0.971 0.988 0.975 0.968 0.976
0.984 0.982 0.989 0.975 0.983 0.977 0.986 0.985 0.975 0.961
2 3 layer matching with lowest loss after 25 iterations are r4,3 and r5,4 . The numbers of 2 neurons for r4,3 are 240-170-100-30 for source domain data and 216-123-30 for target
domain data. The average objective function loss of the all 45 binary classification 3 tasks for these two are 0.856 and 0.832 respectively. The numbers of neurons for r5,4 265
are 240-185-130-75-30 for source domain data and 216-154-92-30 for target domain data. The one-against-one SVM classification is applied for final classification. The average classification accuracies of 10 categories are shown in Tab. 1. The matching correlation is detailed in Fig. 2. As can be found in Tab. 1, the best performances have been highlighted, which all
270
exist in DT-LET framework. However, the best performances for different categories 3 2 do not exist in the framework with same layer matching. Overall, r5,4 and r4,3 should
be the best two layer matchings compared with other settings. Based on these results, we heuristically get the conclusion that the best layer matching ratio(5/4, 4/3) is generally in direct proportion to the dimension ratio of original data(240/216). However, 275
more matched layers do not guarantee better performance as the classification results 2 for number “1”, “2”, “5”, “7”, “8” of DT-LET (r4,3 ) with 2 layer matchings perform 3 better than DT-LET(r5,4 ) with 3 layer matchings.
16
Layer matching for Multi Feature Dataset ..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
DCCA-SVM, duft-tDTNs, NoneDT-LET
..
..
..
..
DT-LET, 4(source)-3(target) layer matching ..
..
DT-LET, 5(source)-4(target) layer matching
Figure 2: The comparison of different layer matching setting for different frameworks on Multi Feature Dataset.
6.4. Task 2: Text-to-image Classification In the second experiment, we conduct our study for Text-to-image classification. 280
The source domain data are the 1000-D text feature, while the target domain data are 2 the 500-D image feature. As there are 10 classes in total, we complete 45 (C10 ) binary
classification tasks. We still use 60% data as co-occurrence data[30], 20% labeled samples on source domain as the training samples, and the rest samples on target domain as the testing samples. The same data setting as Task 1 applies for all four methods 285
under comparison. For the deep network, the numbers of neurons of 4 layer networks are 1000-750500-200 for source domain data and 500-400-300-200 for target domain data, this setting works for the all comparison methods. For the proposed DT-LET, we find the best 2 3 2 two layer matchings with lowest loss after 25 iterations are r5,3 , r5,4 and r5,4 (non-
290
full rank). The average objective function loss of 45 binary classification tasks for these two layer matchings are 3.231, 3.443 and 3.368. The numbers of neurons for 2 r5,3 are 1000-800-600–400-200 for source domain data and 500-350-200 for target do3 2 main data. The numbers of neurons for both r5,4 and r5,4 are 1000-750-500-200 for
source domain data and 500-400-300-200 for target domain data. As matching prin295
2 ciple would also influence the performance of transfer learning, we present two r5,3
with different matching principles as shown in Fig. 3(the average objective function loss for the two different matching principles are 3.231 and 3.455), in which all the detailed layer matching principles are described. For this task, as the overall accura17
Table 2: Classification Accuracy Results on NUS-WIDE Dataset. (The best performance is emphasized by boldface (method names abbreviated).) DT-LET categories DC. DC DA. d-tD AD. N-T 2 (1) 2 (2) 2 3 r5,3 r5,3 r5,4 r5,4 birds 0.78 0.77 0.67 0.81 0.81 0.78 0.83 0.83 0.85 0.83 building 0.81 0.78 0.67 0.83 0.83 0.82 0.88 0.84 0.88 0.89 cars 0.80 0.77 0.69 0.81 0.85 0.81 0.83 0.83 0.87 0.85 cat 0.80 0.77 0.78 0.83 0.81 0.81 0.87 0.87 0.86 0.87 dog 0.80 0.77 0.70 0.82 0.81 0.81 0.85 0.85 0.86 0.82 fish 0.77 0.75 0.73 0.82 0.79 0.78 0.85 0.84 0.85 0.84 flowers 0.80 0.78 0.77 0.84 0.81 0.81 0.86 0.84 0.84 0.88 horses 0.80 0.78 0.72 0.82 0.83 0.81 0.84 0.81 0.84 0.83 mountain 0.82 0.79 0.75 0.82 0.82 0.83 0.83 0.81 0.82 0.83 plane 0.82 0.79 0.79 0.82 0.79 0.83 0.81 0.83 0.83 0.83 average 0.80 0.77 0.77 0.83 0.83 0.81 0.84 0.83 0.85 0.85
cies are generally lower than task 1, we would like to compare more different settings 300
for this cross-layers matching task. We first verify the effectiveness of DT-LET framework. Compared with the comparison methods, the accuracy of DT-lET framework is generally with around 85% accuracy while the comparison methods are generally with no more than 80%. This observation generates the conclusion that finding the appropriate layer matching is essential. The second comparison is between the full rank and
305
2 non-full rank framework. As can be found in the table, actually r5,4 is with the highest
overall accuracy, although the other non-full rank DT-LETs do not perform quite well. This observation gives us a hint that full rank transfer is not always best as negative transfer would degrade the performance. However, the full rank transfer is generally good, although not optimal. The third comparison is between the same transfers with 310
2 different matching principles. We present two r5,3 with different matching principles,
and we find the performances vary. The case 1 performs better than case 2. This result tell us continuous transfer might be better than discrete transfer: as for case 1, the transfer is in the last two layers of both domains, and in case 2, the transfer is conducted in layer 3 and layer 5 of the source domain data. 315
By comparing specific objects, we can find the objects with large semantic difference with other categories are with higher accuracy. For the objects which are hard to classify and with low accuracy, like “birds” and “plane”, the accuracies are always low even the DT-LET is introduced. This observation proves the conclusion that DT-
18
Table 3: Effects of the number of neurons at the last layer
layer matching 3 r5,4 2 r4,3
10 0.9082 0.8853
20 0.9543 0.9677
30 0.9786 0.9797
40 0.9771 0.9713
50 0.9653 0.9522
LET can only be used to improve the transfer process, which helps with the following 320
classification process; while the classification accuracy is still based on the semantic difference of data of different 10categories. We also have to point out the relationship between the average objective function 2 loss and the classification accuracy is not strictly positive correlated. Overall r5,4 is
with the highest classification accuracy while its average objective function loss is not 325
lowest. Based on this observation, we have to point out, the lowest average objective function loss can only generate the best transfer leaning result with optimal common subspace. On such common subspace, the data projected from target domain are classified. These classification results are also influenced by the classifier as well as training samples projected randomly from the source domain. Therefore, we conclude as fol-
330
lows. We can just guarantee a good classification performance after getting the optimal transfer learning result, while the classification accuracy is also influenced by the classification settings. 6.5. Parameter Sensitivity In this section, we study the effect of different parameters in our networks. We
335
have to point out the even the layer matching is random, the last layer of the two neural networks from the source domain and the target domain must be correlated to construct the common subspace. Actually, the number of neurons at last layer would also affect the final classification result. For the last layer, we take experiments on Multi Feature Dataset as an example. The result is shown in Tab. 3.
340
From this figure, it can be noted when the number of neuron is 30, the performance is the best. Therefore in our former experiments, 30 neurons are used. The conclusion can also be drawn that more neurons are not always better. Based on this observation, The number of layers in Task 1 is set as 30, and in Task 2 as 200.
19
Layer matching for NUS-WIDE Dataset ..
..
..
..
..
..
..
.. ..
..
..
..
..
..
..
..
DT-LET, 5(source)-3(target) layer matching, with 2 matchings (case 2)
DT-LET, 5(source)-3(target) layer matching, with 2 matchings (case 1) ..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
DT-LET, 5(source)-4(target) layer matching, with 2 matching
DT-LET, 5(source)-4(target) layer matching, with 3 matching
Figure 3: The comparison of different layer matching setting for different frameworks on NUS-WIDE Dataset.
7. Conclusion 345
In this paper, we propose a novel framework, referred to as Deep Transfer Learning by Exploring where to Transfer(DT-LET), for hand writing digit recognition and textto-image classification. In the proposed model, we first find the best matching of deep layers for transfer between the source and target domains. After the matching, the final correlated common subspace is found on which classifier is applied. Experimental
350
results support the effectiveness of the proposed framework. For the future work, We would propose more robust model to solve the proposed “where to transfer” problem.
20
Declaration of interests The authors declare that they have no known competing financial interests or per355
sonal relationships that could have appeared to influence the work reported in this paper. Author Statement Jianzhe Lin: Writing paper, Conceptualization, Methodology, Software Liang Zhao: Methodology Qi Wang, Rabab Ward, Z. Jane Wang: Writing - review & editing, Su-
360
pervision. References [1] S. J. Pan, Q. Yang, A survey on transfer learning, IEEE Transaction on Knowledge and Data Engineering 22 (10) (2010) 1345–1359. [2] Z. Y., Y. Chen, Z. Lu, S. J. Pan, G. R. Xue, Y. Yu, Q. Yang, Heterogeneous transfer
365
learning for image classification, In Proc. AAAI. [3] L. Duan, D. Xu, W. Tsang, Learning with augmented features for heterogeneous domain adaptation, In proc. ICML. [4] X. Lu, T. Gong, X. Zheng, Multisource compensation network for remote sensing cross-domain scene classification, IEEE Transactions on Geoscience and Remote
370
Sensing. [5] Q. Wang, J. Gao, X. Li, Weakly supervised adversarial domain adaptation for semantic segmentation in urban scenes, IEEE Transactions on Image Processing. [6] W. Dai, Y. Chen, G. Xue, Q. Yang, Y. Yu, Translated learning: Transfer learning across different feature spaces, In proc. NIPS.
375
[7] T. Liu, Q. Yang, D. Tao, Understanding how feature structure transfers in transfer learning, In Proc. IJCAI.
21
[8] J. Wang, Y. Chen, S. Hao, W. Feng, Z. Shen, Balanced distribution adaptation for transfer learning, In Proc. ICDM. [9] Q. Wang, J. Wan, X. Li, Robust hierarchical deep learning for vehicular manage380
ment, IEEE Transactions on Vehicular Technology 68 (5) (2018) 4148–4156. [10] Y. Yan, W. Li, M. Ng, M. Tan, H. Wu, H. Min, Q. Wu, Translated learning: Transfer learning across different feature spaces, In Proc. IJCAI. [11] J. S. Smith, B. T. Nebgen, R. Zubatyuk, N. Lubbers, C. Devereux, K. Barros, S. Tretiak, O. Isayev, A. E. Roitberg, Approaching coupled cluster accuracy
385
with a general-purpose neural network potential through transfer learning, Nature communications 10 (1) (2019) 1–8. [12] Q. Wang, Q. Li, X. Li, Hyperspectral band selection via adaptive subspace partition strategy, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.
390
[13] J. Li, H. Zhang, Y. Huang, L. Zhang, Visual domain adaptation: a survey of recent advances, IEEE Signal Processing Magazine 33 (3) (2015) 53–69. [14] M. Gong, K. Zhang, T. Liu, D. Tao, C. Glymour, B. Scholkopf, Domain adaptation with conditional transferable components, In Proc. ICML. [15] B. Tan, Y. Zhang, S. J. Pan, Q. Yang, Distant domain transfer learning, in Proc.
395
AAAI. [16] F. Zhuang, X. Cheng, P. Luo, P. J., Q. He, Supervised representation learning with double encoding-layer autoencoder for transfer learning, ACM Transactions on Intelligent Systems and Technology (2018) 1–16. [17] M. Long, J. Wang, Y. Cao, J. Sun, P. Yu, Deep learning of transferable representa-
400
tion for scalable domain adaptation, IEEE Transactions on Knowledge and Data Engineering 28 (8) (2016) 2027–2040.
22
[18] Y. Yuan, X. Zheng, X. Lu, Hyperspectral image superresolution by transfer learning, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10 (5) (2017) 1963–1974. 405
[19] F. M. Carlucci, L. Porzi, B. Caput, Autodial: Automatic domain alignment layers, arXiv:1704.08082. [20] Y. Zhang, T. Xiang, T. M. Hospedales, H. Lu, Deep mutual learning, In proc. CVPR. [21] E. Collier, R. DiBiano, S. Mukhopadhyay, Cactusnets: Layer applicability as a
410
metric for transfer learning, arXiv:1711.01558. [22] I. Redko, N. Courty, R. Flamary, D. Tuia, Optimal transport for multi-source domain adaptation under target shift, arXiv:1803.04899. [23] M. Bernico, Y. Li, Z. Dingchao, Investigating the impact of data volume and domain similarity on transfer learning applications, In proc. CVPR.
415
[24] X. Glorot, A. Bordes, Y. Bengio, Domain adaptation for large-scale sentiment classification: A deep learning approach, in Proc. ICML. [25] Y. Yu, S. Tang, K. Aizawa, A. Aizawa, Category-based deep cca for fine-grained venue discovery from multimodal data, IEEE Transactions on Neural Networks and Learning Systems (2018) 1–9.
420
[26] F. Zhuang, X. Cheng, P. Luo, P. S. J., H. Q., Supervised representation learning: Transfer learning with deep autoencoders, in Proc. IJCAI. [27] X. Zhang, F. X. Yu, S. F. Chang, S. Wang, Supervised representation learning: Transfer learning with deep autoencoders, Computer Science. [28] S. J. Zhou, J. T.and Pan, I. W. Tsang, Y. Yan, Hybrid heterogeneous transfer
425
learning through deep learning, in Proc. AAAI. [29] I. Tolstikhin, O. Bousquet, S. Gelly, B. Schoelkopf, Wasserstein auto-encoders, arXiv:1711.01558. 23
[30] L. Yang, L. Jing, J. Yu, M. K. Ng, Learning transferred weights from cooccurrence data for heterogeneous transfer learning, IEEE Transactions on Neural 430
Networks and Learning Systems 27 (11) (2016) 2187–2200. [31] D. R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: An overview with application to learning methods, Neural computation. [32] J. Tang, X. Shu, Z. Li, G. J. Qi, J. Wang, Generalized deep transfer networks for knowledge propagation in heterogeneous domains, ACM Transactions on Multi-
435
media Computing, Communications, and Applications 12 (4) (2016) 68:1–68:22. [33] B. Sun, K. Saenko, Deep coral: Correlation alignment for deep domain adaptation, In Proc. ECCV. [34] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, V. Lempitsky, Domain-adversarial training of neural networks,
440
Journal of Machine Learning Research 17 (59) (2016) 1–35. URL http://jmlr.org/papers/v17/15-239.html [35] E. Tzeng, J. Hoffman, K. Saenko, T. Darrell, Adversarial discriminative domain adaptation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7167–7176.
445
[36] T. Tommasi, N. Quadrianto, B. Caputo, C. Lampert, Beyond dataset bias: Multitask unaligned shared knowledge transfer, In Proc. ACCV.
24
Jianzhe Lin received the B. E. degree in Optical Engineering and the second B. A. degree in English from the Huazhong University of Science and Technology, and the master degree in the Center from Chinese Academy 450
of Sciences, China. He is a Ph.D student in Department of Electrical and Computer Engineering at the University of British Columbia, Canada. His current research interests include computer vision and machine learning.
Liang Zhao received his M.S. and Ph.D. degrees in Software Engineering from Dalian University of Technology, China, in 2014 and 2018, 455
respectively. He is working as a full teacher in the School of Software Technology, Dalian University of Technology. His research interests include data mining and machine learning.
Wang Qi received the B.E. degree in automation and
25
the Ph.D. degree in pattern recognition and intelligent systems from the University of 460
Science and Technology of China, Hefei, China, in 2005 and 2010, respectively. He is currently a Professor with the School of Computer Science and the Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an, China. His research interests include computer vision and pattern recognition.
Rabab Ward is currently a Professor Emeritus with the 465
Electrical and Computer Engineering Department,The University of British Columbia (UBC), Canada. Her research interests are mainly in the areas of signal, image,and video processing. She is a fellow of the Royal Society of Canada, the Institute of Electrical and Electronics Engineers,the Canadian Academy of Engineers,and the Engineering Institute of Canada. She is the President of the IEEE Signal Processing
470
Society.
Z. Jane Wang received the B.Sc. degree from Tsinghua University, Beijing, China, in 1996 and the Ph.D. degree from the University of Connecticut in 2002. Since August 2004, she has been with the Department of Electrical and Computer Engineering at the University of British Columbia, Canada, where she is 475
currently a professor.
26