Journal Pre-proof Geometric Knowledge Embedding for unsupervised domain adaptation Hanrui Wu, Yuguang Yan, Yuzhong Ye, Michael K. Ng, Qingyao Wu
PII: DOI: Reference:
S0950-7051(19)30508-8 https://doi.org/10.1016/j.knosys.2019.105155 KNOSYS 105155
To appear in:
Knowledge-Based Systems
Received date : 18 July 2019 Revised date : 19 October 2019 Accepted date : 22 October 2019 Please cite this article as: H. Wu, Y. Yan, Y. Ye et al., Geometric Knowledge Embedding for unsupervised domain adaptation, Knowledge-Based Systems (2019), doi: https://doi.org/10.1016/j.knosys.2019.105155. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2019 Published by Elsevier B.V.
Journal Pre-proof *Conflict of Interest Form
Conflict of Interest and Authorship Conformation Form Please check the following as appropriate:
pro of
All authors have participated in (a) conception and design, or analysis and interpretation of the data; (b) drafting the article or revising it critically for important intellectual content; and (c) approval of the final version. This manuscript has not been submitted to, nor is under review at, another journal or other publishing venue. The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript The following authors have affiliations with organizations with direct or indirect financial interest in the subject matter discussed in the manuscript:
re-
o
Author’s name
Jo
urn a
lP
Affiliation
Journal Pre-proof
*Revised Manuscript (Clean Version) Click here to view linked References
Highlights Hanrui Wu,Yuguang Yan,Yuzhong Ye,Michael K. Ng,Qingyao Wu
pro of
Geometric Knowledge Embedding for Unsupervised Domain Adaptation
- We exploit the geometric information of the source and target data to learn discriminative representations. In this sense, the features of each sample and its neighbors are both considered during the learning procedure. - We introduce MMD into the graph convolutional network to explore geometric knowledge for learning transferable embeddings. GCN has not been applied to domain adaptation problems before, and this makes our proposed method a decent supplement to existing domain adaptation approaches.
Jo
urn a
lP
re-
- We conduct comprehensive experiments on four real-world applications, including object recognition, image classification and text categorization, to demonstrate the effectiveness of our proposed method.
Journal Pre-proof
Geometric Knowledge Embedding for Unsupervised Domain Adaptation
pro of
Hanrui Wua , Yuguang Yana , Yuzhong Yea , Michael K. Ngb and Qingyao Wua,∗ a South b The
China University of Technology, Guangzhou, China University of Hong Kong, Hong Kong, China
ABSTRACT
Keywords: Domain Adaptation Graph-based Model Geometric Knowledge Graph Convolutional Network Maximum Mean Discrepancy
Domain adaptation aims to transfer auxiliary knowledge from a source domain to enhance the learning performance on a target domain. Recent studies have suggested that deep networks are able to achieve promising results for domain adaptation problems. However, deep neural networks cannot reveal the underlying geometric information from input data. Indeed, such geometric information is very useful for describing the relationship between the samples from source and target domains. In this paper, we propose a novel learning algorithm named GKE, which stands for Geometric Knowledge Embedding. In GKE, we use a graph-based model to explore the underlying geometric structure of the input source and target data based on their similarities. Concretely, we develop a graph convolutional network to learn discriminative representations based on the constructed graph. To obtain effective transferable representations, we match source and target domains by reducing the Maximum Mean Discrepancy (MMD) between their learned representations. Extensive experiments on real-world data sets demonstrate that the proposed method outperforms existing domain adaption methods.
lP
re-
ARTICLE INFO
1. Introduction
Jo
urn a
In machine learning and data mining problems, feature representation learning is an effective strategy for improving learning performance [16, 54, 53], especially after the development of deep learning. In recent years, deep neural networks have shown promising performance in learning effective representations across a variety of tasks and domains [16]. However, an issue called data bias or domain shift usually leads to performance deterioration when the model trained on one domain (i.e., source domain) is used to predict data in another novel domain (i.e., target domain) [38, 16]. To mitigate the negative effects of domain shift, a technique named domain adaptation, also known as transfer learning, has emerged and been successfully applied in many cases, such as text categorization [58, 57, 35], object recognition [46, 21, 43, 32, 45, 14], image classification [42, 33, 39, 30], and so on. Domain adaptation can be classified into two major categories: supervised domain adaptation [55, 50, 49], where some labeled target data are available, and unsupervised domain adaptation [42, 43, 32, 33, 39, 30], where no labeled target data are available. In this paper, we focus on dealing with the unsupervised domain adaptation problem, which is more challenging and practical. The main challenge for solving domain adaption is how to reduce the domain shift, which is also referred to as domain discrepancy. To this end, many Maximum Mean Discrepancy (MMD) based algorithms have been proposed, such as [37, 7, 46, 14, 30]. All these algorithms contain an MMD reduction module to minimize the domain discrepancy, e.g., TCA [37] uses MMD to measure the domain discrepancy in a reproducing kernel Hilbert space (RKHS), and minimizes the MMD between source and target domains in a lowdimensional transformation. DAN [37] applies MMD to layers embedded in an RKHS, and optimally matches the different domain distributions. As a result, these methods prove the effectiveness of MMD in measuring the domain discrepancy between two empirical distributions. Recently, various researchers have considered the geometric information of data samples based on their similarities [20, 29, 47]. These works have shown that geometric information could be very useful for describing the relationship between samples. As a result, several domain adaptation methods, such as [31, 15], seek to adopt graph-based models to preserve intrinsic data information for learning a domain invariant feature embedding. However, these methods ∗ Corresponding
author
[email protected] (H. Wu);
[email protected] (Y. Yan);
[email protected] (Y. Ye);
[email protected] (M.K. Ng);
[email protected] (Q. Wu) ORCID (s):
H. Wu et al.: Preprint submitted to Elsevier
Page 1 of 17
Journal Pre-proof
Unsupervised domain adaptation
re-
pro of
use the explicit graph-based regularization in the loss function, thus rely on the assumption that connected nodes in the graph are likely to share the same label, which leads to the possibility that the modeling capacity is limited. Graph convolutional network (GCN) [27] was introduced to handle this issue, and has shown superiority for learning informative geometric information in graphs by exploring the similarity relationship between samples. Motivated by this, we seek to use GCN to address the shortcomings of deep neural network that they cannot reveal the underlying geometric information from input data. In this paper, in order to learn discriminative and transferable representations, we develop a method called Geometric Knowledge Embedding (GKE) to exploit the geometric information of data, which is usually omitted in most existing domain adaptation methods. Specifically, we construct a graph to describe the relationship between samples of both source and target data based on their similarities, and develop a graph convolutional network [27] to learn discriminative representations based on the constructed graph. Different from [27], the convolutional-like operation in GKE is performed on the nodes that have same numbers of adjacent nodes. As a result, for each sample, the features of itself and its neighbors are taken into consideration to learn the new representations. In order to obtain representations that are transferable across domains, we match source and target domains by reducing the Maximum Mean Discrepancy (MMD) between their learned representations. The benefits of the proposed algorithm are two-fold. First, GKE fills the gap that most existing domain adaptation methods fail to reveal the underlying geometric information from input data. Second, the learned valuable geometric information of source and target data can be a guidance for learning more transferable representations. We highlight our principal contributions as follows. - We exploit the geometric information of the source and target data to learn discriminative representations. In this sense, the features of each sample and its neighbors are both considered during the learning procedure.
lP
- We introduce MMD into the graph convolutional network to explore geometric knowledge for learning transferable embeddings. GCN has not been applied to domain adaptation problems before, and this makes our proposed method a decent supplement to existing domain adaptation approaches. - We conduct comprehensive experiments on four real-world applications, including object recognition, image classification and text categorization, to demonstrate the effectiveness of our proposed method.
2. Related Studies
urn a
The remainders of the paper are organized as follows. We review important studies regarding domain adaptation and graph convolutional network in Section 2. Then we introduce our proposed method in Section 3 in detail. The experimental results are presented in Section 4. Section 5 concludes the whole paper.
2.1. Domain Adaptation 2.1.1. Shallow Model
Jo
A number of domain adaptation methods have been studied over the past decades [36, 26, 17, 37, 9, 22, 18, 10, 19, 11, 31, 12, 52, 7]. These methods can be referred to as shallow approaches and they mainly focus on two strategies: instance reweighting and feature representation matching. Instance reweighting assigns new weights to source samples to reduce the marginal distribution difference between the source and target domains [26, 11, 12]. Chu et al. [11, 12] propose to jointly adapt source and target marginal distributions and learn an SVM classifier. However, methods based on this strategy assume that the conditional distributions of the source and target domains are the same [38, 52]. To relax this assumption, feature representation matching seeks to learn the latent invariant features to minimize the domain distribution divergence and bridge the source and target domains [37, 22, 19, 31, 42, 7]. TCA [37] uses MMD to measure the domain discrepancy in a reproducing kernel Hilbert space (RKHS), and learns a lowdimensional linear transformation such that the marginal distribution difference between the source and target domains is minimized. GFK [22] adopts KL divergences to estimate the domain difference, and integrates an infinite number of subspaces to discover new feature representations. CORAL [42] minimizes domain discrepancy by aligning the second-order statistics of source and target distributions. And D-CORAL [43] further extends CORAL to perform end-to-end adaptation in deep neural networks.
H. Wu et al.: Preprint submitted to Elsevier
Page 2 of 17
Journal Pre-proof
pro of
Unsupervised domain adaptation
re-
Figure 1: Learning procedure of our proposed model.
urn a
lP
2.1.2. Deep Model In contrary to shallow models, deep models focus on learning transferable features via deep neural networks [46, 21, 33, 45, 14, 39, 30]. As a result, a common feature space can be obtained and the domain discrepancy in it is minimal. To this end, many researchers seek to reduce the Maximum Mean Discrepancy (MMD) [46, 30], which has been proved effective to measure domain discrepancy. For example, DDC [46] uses linear-kernel MMD and the prediction loss on the source domain to learn a feature subspace that is both discriminative and domain invariant. DTLC-N [14] adopts MMD to estimate the probability densities in nonlinear version to build deep structures for effective domain-invariant feature learning. DAN [30] applies MMD to layers embedded in an RKHS. [40] further jointly utilizes MMD and CORAL in a two stream Convolutional Neural Network (CNN) to learn feature representations. Another approach to match the domain distributions adopted in deep models is adversarial learning [23]. Adversarial training methods learn a representation by minimizing the domain discrepancy using an adversarial objective w.r.t. a domain discriminator [21, 33, 45, 39]. RevGrad [21] uses a domain discriminator and an encoder is trained to confuse this discriminator via a proposed gradient reversal layer. JAN-A [33] maximizes Joint Maximum Mean Discrepancy (JMMD) in an adversarial network to make the distributions of the source and target domains more distinguishable. MADA [39] explores multi-mode structures using multiple domain discriminators to enable fine-grained alignment of different data distributions.
2.2. Graph Convolutional Network
Jo
Graph Convolutional Network (GCN) was first introduced in [27] to classify nodes in a graph in a semi-supervised learning paradigm. Differ from traditional methods that use the graph for regularization [59, 56, 4, 48], GCN relaxes the assumption that connected nodes in the graph are more likely to share the same label. Indeed, relaxing this assumption could help the study of additional information held in graph edges. Recent studies have shown that GCN can be used in several fields, such as image classification [47] and medical industry [20, 29]. [47] proposes to utilize both word embedding and knowledge graph, thus builds a knowledge graph where each node corresponds to a semantic category, and then uses GCN to transfer information between different categories. [20] adopts GCN to learn a representation of a local neighborhood around each node in a graph, thus handles the problem of predicting protein interfaces. [29] applies a siamese GCN to learn a similarity metric between irregular graphs. Due to the promising performance achieved by GCN in exploring the informative geometric knowledge in graph, in this paper, we apply GCN to learn discriminative representations by exploiting the underlying geometric information of the source and target data, which is barely considered in most existing domain adaptation approaches.
3. Methodology In this section, we present the Geometric Knowledge Embedding (GKE) model for effective domain adaptation. We begin with some statements of the unsupervised domain adaptation problem, and then provide an overview and detailed discussions of the proposed model. H. Wu et al.: Preprint submitted to Elsevier
Page 3 of 17
Journal Pre-proof
Unsupervised domain adaptation Table 1 Notations and descriptions used in this paper. 𝑠 , 𝑡 𝐗𝑠 , 𝐗𝑡 𝐙𝑠 , 𝐙𝑡 𝐱𝑠,𝑖 , 𝐱𝑡,𝑖 𝑛𝑠 , 𝑛𝑡 𝑑, 𝑑̂ ,
Description
Notation
source/target domain source/target data learned source/target data source/target sample #source/target samples #input/learned features source/target distributions
𝐀 𝐗 𝐙 𝛿 𝑘 𝐶 𝜆
Description adjacency matrix input data learned data kernel bandwidth #nearest neighbors #classes tradeoff parameter
pro of
Notation
3.1. Problem Statement
In this part, we firstly provide the definitions of terminologies, and then describe the notations. For clarity, we list the frequently used notations and their descriptions in Table 1.
re-
Definition 1. (Domain). A domain consists of two components: a feature space = ℝ𝑑 where 𝑑 is the number of features, and a marginal probability distribution (𝐱) where 𝐱 ∈ .
Definition 2. (Task). Given a specific domain = {, (𝐱)}, a task consists of two components: a label space = {1, … , 𝐶} where 𝐶 is the number of labels, and a classifier 𝑓 (𝐱) that can be modeled as a conditional probability distribution (𝑦|𝐱), where 𝐱 ∈ and 𝑦 ∈ .
3.2. Overview
urn a
lP
The source domain data denoted as 𝐗𝑠 = [𝐱𝑠,1 , … , 𝐱𝑠,𝑛𝑠 ]⊤ ∈ ℝ𝑛𝑠 ×𝑑 are drawn from distribution , and the target domain data denoted as 𝐗𝑡 = [𝐱𝑡,1 , … , 𝐱𝑡,𝑛𝑡 ]⊤ ∈ ℝ𝑛𝑡 ×𝑑 are drawn from distribution , where 𝑑 is the feature dimension, 𝑛𝑠 and 𝑛𝑡 are sample numbers of the source and target domains, respectively. In the problem of unsupervised domain adaptation, we are given a labeled source domain 𝑠 = {𝐗𝑠 , 𝐲𝑠 }, and an unlabeled target domain 𝑡 = {𝐗𝑡 }, where 𝐲𝑠 = [𝑦𝑠,1 , … , 𝑦𝑠,𝑛𝑠 ]⊤ , and 𝑦𝑠,𝑖 ∈ {1, … , 𝐶} is the label of 𝐱𝑠,𝑖 . In this problem, the feature and label spaces of the source and target domains are the same, i.e., 𝑠 = 𝑡 and 𝑠 = 𝑡 ; while the distributions of the source and target domains are different, i.e., ≠ . Our goal is to extract the knowledge from the source domain 𝑠 to improve the prediction performance on 𝐗𝑡 in the target domain 𝑡 .
Jo
We seek to exploit the underlying geometric information involved in samples to address the problem of unsupervised domain adaptation. Instead of learning on representations of samples only, we also take advantage of similarities between samples to learn discriminative representations for the source and target domains. Figure 1 illustrates the learning procedure of the proposed method GKE. Concretely, we firstly apply a Convolutional Neural Network (CNN), such as AlexNet [28], to extract high-level features for input data. In order to take the geometric structure of data into consideration, we then construct a graph based on the extracted features by finding the 𝑘-nearest neighbors of each samples, so that the similarity relationship between samples of both source and target domains is introduced into the learning model. After that, we apply a graph convolutional network to learn discriminative representations based on samples and their neighbors. Meanwhile, we reduce the MMD between the learned source and target representations to guarantee the feature transferability. Let 𝐙𝑠 , 𝐙𝑡 be the feature representations learned by GKE of the source and target data, respectively. In addition, denote 𝑓 (⋅) as the learned classifier. Formally, we define the loss function as follows = 𝓁𝐶𝐸 (𝐲𝑠 , 𝑓 (𝐙𝑠 )) + 𝜆Ω(𝐙𝑠 , 𝐙𝑡 ),
(1)
where 𝜆 is the tradeoff parameter, 𝓁𝐶𝐸 (⋅, ⋅) is the cross-entropy error over all the source samples, and Ω(⋅, ⋅) is the unbiased empirical estimate MMD between the learned source and target representations. Next, we will discuss the construction of our GCN, the matching of domain distributions as well as the classifier training.
H. Wu et al.: Preprint submitted to Elsevier
Page 4 of 17
Journal Pre-proof
Unsupervised domain adaptation
3.3. Construction of Graph Convolutional Network
pro of
Firstly, we briefly describe the graph convolutional network. After that, we provide the construction of our GCN in detail. GCN was firstly proposed to solve semi-supervised node classification problem in [27], which aims to make predictions for entities by exploiting knowledge from some labeled entities. This problem can be modeled as graph-based semi-supervised learning, in which each entity is treated as a node and the relation between two entities is represented as an edge. Rather than using explicit graph-based regularization in the loss function as traditional approaches do [4, 48], GCN adopts the adjacency matrix to model the similarity information of samples, leading to the improved performance of diffusing knowledge across edges at all GCN layers. Formally, a graph convolutional layer computes the following transformation: ( (𝑙) (𝑙) ) ̃ 𝐖 , 𝐇(𝑙+1) = 𝜎 𝐀𝐇
(2)
1
re-
where 𝜎(⋅) denotes an activation function. 𝐇(𝑙) is the matrix of activations in the 𝑙-th layer, i.e., 𝐇(0) means the input ̃ is a normalized adjacency matrix which can be obtained data. 𝐖(𝑙) is a layer-specific trainable weight matrix, and 𝐀 using renormalization trick [27]: 1
̃ = 𝐃− 2 𝐀𝐃− 2 , 𝐀
(3)
urn a
lP
∑ where 𝐀 is an adjacency matrix of a graph with added self-connections, and 𝐷𝑖𝑖 = 𝑗 𝐴𝑖𝑗 . In this paper, we exploit the similarity relationship between both source and target samples to construct a graph, and apply GCN to learn discriminative representations for the samples based on the features of them and their neighbors. Specifically, we apply a pretrained AlexNet on ImageNet to extract the features for the source and target samples, where the extracted features are denoted as 𝐗 = [𝐗𝑠 , 𝐗𝑡 ] = [𝐱1 , … , 𝐱𝑛 ]⊤ ∈ ℝ𝑛×𝑑 , 𝑛 = 𝑛𝑠 + 𝑛𝑡 . After that, we construct the similarity matrix 𝐀 based on 𝐗. From a geometric perspective, according to [6], the data points are usually drawn from a low dimensional manifold embedded in high dimensional ambient space. According to the local invariance assumption [3], if two examples 𝐱𝑖 , 𝐱𝑗 are close in the intrinsic geometry of the data distribution underlying domain , then their embeddings 𝐳𝑖 and 𝐳𝑗 should also be close. To exploit the geometric information of input data, motivated by [31] and [5], we encode the geometric structure knowledge by constructing a 𝑘-nearest neighbors graph 𝐀 on the involved scatter of data points ⎧ (1 − cos(𝐱𝑖 , 𝐱𝑗 ))2 ⎪ exp (− ), if 𝐱𝑖 ∈ 𝑘 (𝐱𝑗 ) or 𝐱𝑗 ∈ 𝑘 (𝐱𝑖 ); ⎪ 𝛽2 𝐴𝑖𝑗 = ⎨ ⎪1, 𝐱𝑖 = 𝐱𝑗 ; ⎪0, otherwise; ⎩
(4)
Jo
where 𝑘 (𝐱𝑖 ) presents the set of 𝑘 nearest neighbors of sample 𝐱𝑖 . Here, we use the heat kernel weighting [5] to ∑ measure the similarity, 𝛽 is the bandwidth parameter of the heat kernel and following [13] we set 𝛽 = 𝑘1 𝑘𝑖=1 (1 − ̃ using Eq. (3). cos(𝐱𝑖 , 𝐱𝑗 )) if 𝐱𝑖 ∈ 𝑘 (𝐱𝑗 ). Lastly, we obtain the normalized adjacency matrix 𝐀 Based on the constructed graph, we design a three-layer GCN where each layer takes the representation from previous layer as input and outputs a new representation. We formulate the proposed three-layer GCN in terms of the ̃ as input features 𝐗 and the normalized adjacency matrix 𝐀 ( ( ) ) (0) ̃ 𝐖) = 𝐀𝜎 ̃ 𝐀 ̃ 𝜎(𝐀𝐗𝐖 ̃ 𝐺(𝐗, 𝐀; ) 𝐖(1) 𝐖(2) ,
(5)
where 𝐖 = {𝐖(0) , 𝐖(1) , 𝐖(2) } is the GCN parameters to learn during the training procedure. We use Rectified Linear Unit (ReLU) as the activation function 𝜎(⋅) and stack the convolution operations one after another. Denote 𝑑𝑙 ̃ ∈ ℝ𝑛×𝑛 , 𝐗 ∈ ℝ𝑛×𝑑 , 𝐖(0) ∈ ℝ𝑑×𝑑1 , 𝐖(1) ∈ ℝ𝑑1 ×𝑑2 and as the feature dimension of 𝑙-th layer, we have 𝑑0 = 𝑑, 𝐀 H. Wu et al.: Preprint submitted to Elsevier
Page 5 of 17
Journal Pre-proof
Unsupervised domain adaptation
pro of
𝐖(2) ∈ ℝ𝑑2 ×𝑑3 , thus the output of the GCN model is a matrix denoted as ℝ𝑛×𝑑3 , in which each row defines the learned ̂ features for one sample. In this sense, by denoting 𝐙 = [𝐙𝑠 , 𝐙𝑡 ] = [𝐳1 , … , 𝐳𝑛 ]⊤ ∈ ℝ𝑛×𝑑 as the features learned by ̃ 𝐖) and 𝑑̂ is the dimension of learned feature, i.e., GKE of both source and target domains, we have 𝐙 = 𝐺(𝐗, 𝐀; ̂ 𝑑 = 𝑑3 .
3.4. Domain Distribution Matching
Although GCN is able to leverage samples in both domains and their neighbors to learn discriminative representations, it still suffers from the issue of domain discrepancy, resulting in performance degeneration in the target domain. To alleviate this issue, we reduce the term Ω(⋅, ⋅) in Eq. (1), which estimates the distribution discrepancy between the learned representations of the source and target data. Specifically, we apply Maximum Mean Discrepancy (MMD) [24] to measure the domain discrepancy due to its effectiveness and efficiency. Given the source and target representations 𝐙𝑠 and 𝐙𝑡 learned by GCN, MMD measures the distribution discrepancy between source and target domains based on the samples 𝐙𝑠 and 𝐙𝑡 . Specifically, when the MMD is small, the domains are likely to distribute similarly. Formally, MMD can be formulated as follows (6)
re-
Ω(𝐙𝑠 , 𝐙𝑡 ) ∶= sup(𝔼𝑠 [𝑔(𝐳𝑠 )] − 𝔼𝑡 [𝑔(𝐳𝑡 )]), 𝑔∈
Ω(𝐙𝑠 , 𝐙𝑡 ) ∶= sup 𝑔∈
(
lP
where 𝑔(⋅) is a mapping function and 𝔼𝑠 [⋅] denotes the expectation. By replacing the computation of population expectations on the samples 𝐙𝑠 and 𝐙𝑡 with empirical expectations, we can rewrite the empirical estimate formulation of Eq. (6) as 𝑛𝑠 𝑛𝑡 ) 1 ∑ 1 ∑ 𝑔(𝐳𝑠,𝑖 ) − 𝑔(𝐳𝑡,𝑖 ) . 𝑛𝑠 𝑖=1 𝑛𝑡 𝑖=1
(7)
Ω(𝐙𝑠 , 𝐙𝑡 ) =
urn a
Eq. (7) indicates that must be rich enough to vanish the population MMD if and only if the distributions of 𝐙𝑠 and 𝐙𝑡 are identical, and also must be restrictive enough to converge the empirical estimate of the MMD quickly for the guarantee of test consistent [24]. Particularly, the function classes in universal reproducing kernel Hilbert spaces (RKHSs) show satisfaction to both of the properties. In this sense, MMD presents the distance between the mean embedding of two distributions, and an unbiased empirical estimate MMD can be calculated as 𝑛𝑠 𝑛𝑠 𝑛𝑡 𝑛𝑡 𝑛𝑠 𝑛𝑡 1 ∑∑ 1 ∑∑ 2 ∑∑ ⟨𝑔(𝐳𝑠,𝑖 ), 𝑔(𝐳𝑠,𝑗 )⟩ + ⟨𝑔(𝐳𝑡,𝑖 ), 𝑔(𝐳𝑡,𝑗 )⟩ − ⟨𝑔(𝐳𝑠,𝑖 ), 𝑔(𝐳𝑡,𝑗 )⟩. 𝑛𝑠 𝑛𝑡 𝑖=1 𝑗=1 𝑛2𝑠 𝑖=1 𝑗=1 𝑛2𝑡 𝑖=1 𝑗=1
(8)
𝑛𝑠 𝑛𝑠 𝑛𝑡 𝑛𝑡 𝑛𝑠 𝑛𝑡 1 ∑∑ 1 ∑∑ 2 ∑∑ (𝐳𝑠,𝑖 , 𝐳𝑠,𝑗 ) + (𝐳𝑡,𝑖 , 𝐳𝑡,𝑗 ) − (𝐳𝑠,𝑖 , 𝐳𝑡,𝑗 ). 𝑛𝑠 𝑛𝑡 𝑖=1 𝑗=1 𝑛2𝑠 𝑖=1 𝑗=1 𝑛2𝑡 𝑖=1 𝑗=1
(9)
Ω(𝐙𝑠 , 𝐙𝑡 ) =
Jo
By introducing a kernel function into the empirical MMD, i.e., (𝐳𝑠,𝑖 , 𝐳𝑠,𝑗 ) = ⟨𝑔(𝐳𝑠,𝑖 ), 𝑔(𝐳𝑠,𝑗 )⟩, we obtain
Here, we adopt the Gaussian kernel, i.e., (𝐳𝑠,𝑖 , 𝐳𝑠,𝑗 ) = exp(− 2𝛿12 ‖𝐳𝑠,𝑖 − 𝐳𝑠,𝑗 ‖22 ), where 𝛿 > 0 is the bandwidth parameter.
3.5. Classifier Training
̂
Define 𝑓 (𝐳𝑠 ) = softmax(𝚯𝐳𝑠 ), where 𝚯 = [𝜽1 , … , 𝜽𝐶 ]⊤ ∈ ℝ𝐶×𝑑 , and 𝜽𝑐 presents the parameters for the class 𝑐. GKE trains the classifier by minimizing the cross-entropy loss over all the labeled source samples as
𝓁𝐶𝐸 (𝐲𝑠 , 𝑓 (𝐙𝑠 )) = −
𝑛𝑠 𝐶 ∑ ∑ 𝑖=1 𝑐=1
𝑦𝑐 log 𝑓𝑐 (𝐳𝑠,𝑖 ),
H. Wu et al.: Preprint submitted to Elsevier
(10) Page 6 of 17
Journal Pre-proof
Unsupervised domain adaptation Table 2 Statistical information of the Office, Office-Caltech and Testbed data sets.
Office
Amazon (A) DSLR (D) Webcam (W)
Office-Caltech
Amazon (A) DSLR (D) Webcam (W) Caltech-256 (C)
Testbed
Caltech-256 (C) ImageNet (I) Sun (S)
#Samples
#Classes
2817 498 795
31
958 157 295 1299
10
3847 4000 2626
40
pro of
Domain
lP
re-
Data set
Figure 2: Some images from the Office, Caltech-256 and Testbed data sets.
𝑦̂𝑡 = arg max 𝜽⊤ 𝑐 𝐳𝑡 , 𝑐
urn a
where 𝑓𝑐 (𝐳𝑠,𝑖 ) is the 𝑐-th element of the output of 𝑓 (𝐳𝑠,𝑖 ). Consequently, given a learned target sample 𝐳𝑡 , we make prediction by 𝑐 ∈ {1, … , 𝐶},
(11)
which is also the index of the largest element of 𝚯𝐳𝑡 .
4. Experiment
4.1. Data Sets
Jo
In this section, we conduct extensive experiments for domain adaptation problems to evaluate the effectiveness of GKE. We first describe the data sets used in our experiments in detail, and list several compared baseline methods. Then we provide the experimental results and discussions. In-depth experimental analysis is also included. The promising results demonstrate the effectiveness of the proposed method. We perform experiments to investigate the effectiveness of the proposed method in three cases: object recognition, image classification and text categorization. Specifically, for object recognition problem, we adopt two widely used data sets: Office and Office-Caltech. We use the Testbed data set for the image classification case. Table 2 lists the statistical information of these three data sets, and Figure 2 shows some example images from them. For text categorization problem, we use the Multi-Language data set to evaluate the effectiveness of the proposed model. Table 3 presents the statistical information of this data set. • Office The Office data set contains images of 31 categories over three object domains [41]: images in Amazon(A) are downloaded from the Amazon website, DSLR (D) contains high-resolution images obtained from a digital SLR camera, and Webcam (W) consists of low-resolution images taken from a web camera. The data set contains a total of 4, 110 images with a minimum of 7 and a maximum of 100 samples per domain and category. We use H. Wu et al.: Preprint submitted to Elsevier
Page 7 of 17
Journal Pre-proof
Unsupervised domain adaptation Table 3 Statistical information of the Multi-Language data set. Tasks
#Samples
Multi-Language
FR → EN, GR → EN, IT → EN, SP → EN EN → FR, GR → FR, IT → FR, SP → FR EN → GR, FR → GR, IT → GR, SP → GR EN → IT, FR → IT, GR → IT, SP → IT EN → SP, FR → SP, GR → SP, IT → SP
#Features
#Classes
21, 531 24, 893 34, 279 15, 506 11, 547
6
pro of
Data set
600
these 3 domains to construct 3 × 2 = 6 learning tasks, i.e., A → D, A → W, D → A, D → W, W → A and W → D. Specifically, we use a fine-tuned AlexNet [28] model to extract features for the Office data set.
re-
• Office-Caltech The Caltech-256 (C) data set [25] includes a total of 30, 607 images with 256 categories, among which 10 categories are overlapped with the Office data set. We extract the images shared common classes from these 4 domains and construct 4 × 3 = 12 learning tasks, i.e., A → C, A → D, A → W, C → A, C → D, C → W, D → A, D → C, D → W, W → A, W → C and W → D. Specifically, we use a pretrained AlexNet [28] model to extract features for the Office-Caltech data set.
lP
• Testbed The Testbed data set [44] contains 10, 473 images with 40 categories. This data set is collected from three domains: Caltech-256 (C) contains 256 categories with a minimum of 80 and a maximum of 827 images, ImageNet (I) contains around 21, 000 object classes organized according to the Wordnet hierarchy, Sun (S) contains a total of 142, 165 pictures and it was created as a comprehensive collection of annotated images covering a large variety of environmental scenes, places and objects. We use these 3 domains to construct 3 × 2 = 6 learning tasks, i.e., C → I, C → S, I → C, I → S, S → C and S → I. Specifically, we use a pretrained AlexNet [28] model to extract features for the Testbed data set.
Jo
urn a
• Multi-Language The Multi-Language data set [2] includes feature characteristics of documents written in five different languages, i.e., English (EN), French (FR), German (GR), Italian (IT) and Spanish (SP) but sharing the same six categories, i.e., C15, CCAT, E21, ECAT, GCAT and M11. Each language contains documents written or translated in that language. Take the English as an example, the original documents are the documents written in English, and their translated versions are the documents translated from French, German, Spanish, or Italian. We use the four translated documents as the source domains, and the original documents as the target domain, thus construct 5 × 4 = 20 learning tasks, i.e., FR → EN, GR → EN, IT → EN, SP → EN, EN → FR, GR → FR, IT → FR, SP → FR, EN → GR, FR → GR, IT → GR, SP → GR, EN → IT, FR → IT, GR → IT, SP → IT, EN → SP, FR → SP, GR → SP and IT → SP. The Multi-Language data set contains a large amount of samples, to speed up the experiments, we randomly select 100 samples per category for each language. Specifically, we directly use the term frequency-inverse document frequency (TF-IDF) features for the Multi-Language data set.
4.2. Implementation
We implement GKE in the TensorFlow platform [1]1 . GKE contains 3 convolutional layers and 1 dense layer with output channel numbers as 2, 048 → 1, 024 → 512 → 𝐶. We estimate the domain discrepancy using empirical MMD for the last convolutional layer, i.e., the 512-dimension output channel. We train the model for a maximum of 200 epochs with a learning rate of 0.001. Moreover, we use ReLU as the activation function 𝜎(⋅) and 50% dropout for the network model.
4.3. Baselines
We use several shallow models and deep learning approaches as baselines, and list them as follows.
1 https://www.tensorflow.org/
H. Wu et al.: Preprint submitted to Elsevier
Page 8 of 17
Journal Pre-proof
Unsupervised domain adaptation
• AlexNet [28]. AlexNet is a standard deep model for learning representations, which is a baseline without considering domain discrepancy. Here, we use AlexNet as the base architecture for deep models, and also use it to extract features for shallow models on object recognition and image classification tasks.
pro of
• GFK [22]. Geodesic Flow Kernel (GFK) learns a new feature representation based on the proposed geodesic flow kernel. Then a classifier is learned on the new feature representation. • SA [19]. Subspace Alignment (SA) aligns the source subspace with the target subspace to discover a domain invariant feature subspace. After that, a classifier is trained on the learned feature subspace. • TCA [37]. Transfer Component Analysis (TCA) finds common latent features using MMD-penalized Kernel PCA. Then a classifier is learned on the latent features. • GCMF [31]. Graph co-regularized Collective Matrix tri-Factorization (GCMF) first presents the prior knowledge by a graph structure, then simultaneously maximizes the empirical likelihood and preserves the geometric structure.
re-
• CORAL [42]. CORrelation ALignment (CORAL) aligns the second-order statistics of the source and target distributions with a linear transformation, and then a classifier is trained on the transformed features. • D-CORAL [43]. Deep CORAL (D-CORAL) extends CORAL to learn a nonlinear transformation that aligns correlations of layer activations in deep neural networks.
lP
• RevGrad [21]. RevGrad adapts a single network layer to match source and target by making them indistinguishable in a domain adversarial learning procedure. Then the discriminative and domain invariant feature representation is learned and a classifier is obtained. • JAN-A [33]. Adversarial Joint Adaptation Network (JAN-A) maximizes joint maximum mean discrepancy (JMMD) in an adversarial network to make the distributions of the source and target domains more distinguishable. Once the new feature representation is learned, a classifier can be trained.
urn a
• MADA [39]. Multi-Adversarial Domain Adaptation (MADA) captures multi-mode structures using multiple domain discriminators to align different data distributions. Thus the transferable features are obtained. After that, a classifier can be learned on the transferable features. • DDC [46]. Deep Domain Confusion (DDC) adds an adaptation layer regularized by linear-kernel MMD to maximize domain confusion, thus learns a representation that is both semantically meaningful and domain invariant. Then a classifier is obtained on the learned representation. • RTN [32]. Residual Transfer Network (RTN) is also an MMD-based method, which jointly learns transferable features and adaptive classifiers via deep residual learning.
Jo
• DAN [30]. Deep Adaptation Network (DAN) discovers transferable features by matching kernel embedding of multi-layer representation in reproducing kernel Hilbert spaces (RKHSs). Then a classifier is applied to learn the final predictions. • DTLC-N [14]. Deep Transfer Low-rank Coding nonlinear version (DTLC-N) uses MMD as the estimate of probability densities, and combines multilayer common dictionaries, low-rank coding, and CNN architecture into a unified procedure. • MMD-CORAL [40]. MMD-CORAL proposes to jointly adapt the features and classifiers, in which the features are adapted by both MMD and CORAL and the classifiers are adapted by minimizing the entropy loss of the target data. Note that in the above baseline methods, only TCA, GCMF and CORAL can be conducted to solve text categorization problem according to their original papers. In addition, Our proposed GKE contains 3 parameters: the number of nearest neighbors 𝑘, the MMD tradeoff parameter 𝜆 and the Gaussian kernel bandwidth 𝛿. Specifically, we set 𝑘 = 4 for the Office data set, 𝑘 = 8 for the Office-Caltech and Testbed data sets, and 𝑘 = 16 for the Multi-Language data set. To obtain the best performance, we search 𝜆 ∈ {100 , 101 , … , 105 } and 𝛿 ∈ {20 , 21 , … , 25 }. H. Wu et al.: Preprint submitted to Elsevier
Page 9 of 17
Journal Pre-proof
Unsupervised domain adaptation Table 4 Accuracies (mean±standard deviations) of different methods for domain adaptation on the Office data set. A→D
D→A
A→W
W→A
D→W
W→D
Average
AlexNet
64.2 ± 0.4
45.5 ± 0.6
61.6 ± 0.5
48.3 ± 0.5
95.4 ± 0.3
99.0 ± 0.3
69.0
GFK SA TCA CORAL D-CORAL
58.6 ± 0.0 60.1 ± 0.0 57.8 ± 0.0 62.2 ± 0.0 66.8 ± 0.6
52.4 ± 0.0 50.1 ± 0.0 51.6 ± 0.0 48.4 ± 0.0 52.8 ± 0.2
58.4 ± 0.0 58.5 ± 0.0 59.0 ± 0.0 61.9 ± 0.0 66.4 ± 0.4
46.1 ± 0.0 47.3 ± 0.0 47.9 ± 0.0 48.2 ± 0.0 51.5 ± 0.3
93.6 ± 0.0 92.0 ± 0.0 90.2 ± 0.0 96.2 ± 0.0 95.7 ± 0.3
91.0 ± 0.0 98.9 ± 0.0 88.2 ± 0.0 99.5 ± 0.0 99.2 ± 0.1
66.7 67.8 65.8 69.4 72.1
RevGrad JAN-A MADA
72.3 ± 0.3 72.8 ± 0.3 74.1 ± 0.1
52.4 ± 0.4 57.5 ± 0.2 56.0 ± 0.2
73.0 ± 0.5 75.2 ± 0.4 78.5 ± 0.2
50.4 ± 0.5 56.3 ± 0.2 54.5 ± 0.3
96.4 ± 0.3 96.6 ± 0.2 99.8 ± 0.1
99.2 ± 0.3 99.6 ± 0.1 100.0 ± 0.0
74.0 76.3 77.2
DDC RTN DAN DTLC-N MMD-CORAL GKE
64.9 ± 0.4 71.0 ± 0.2 71.7 ± 0.4 68.2 ± 0.5 71.2 74.9 ± 0.3
47.2 ± 0.5 50.5 ± 0.3 50.0 ± 0.6 54.9 ± 0.3 54.6 63.3 ± 0.0
61.0 ± 0.5 73.3 ± 0.2 73.9 ± 0.5 70.4 ± 0.3 72.1 78.6 ± 0.3
49.4 ± 0.6 51.0 ± 0.1 51.4 ± 0.6 53.9 ± 0.5 53.9 60.1 ± 0.0
95.0 ± 0.3 96.8 ± 0.2 96.8 ± 0.3 96.9 ± 0.5 97.3 96.7 ± 0.0
98.5 ± 0.3 99.6 ± 0.1 99.6 ± 0.2 99.3 ± 0.4 98.7 100.0 ± 0.0
69.3 73.7 73.9 73.9 74.6 78.9
re-
pro of
Method
AlexNet
GFK
SA
TCA
DDC
RTN
DAN
DTLC-N
MMD-CORAL
GKE
A→C A→D A→W C→A C→D C→W D→A D→C D→W W→A W→C W→D
84.6 ± 0.2 88.5±0.3 83.1±0.2 91.8±0.2 89.0±0.2 83.1±0.2 89.3±0.2 80.9±0.3 97.7±0.2 83.8±0.3 77.7±0.3 100.0±0.0
76.2±0.0 86.0±0.0 89.5±0.0 90.7±0.0 77.1±0.0 78.0±0.0 89.8±0.0 77.9±0.0 97.0±0.0 88.5±0.0 77.1±0.0 98.1±0.0
82.1±0.0 87.3±0.0 81.7±0.0 92.1±0.0 86.6±0.0 84.1±0.0 84.6±0.0 76.6±0.0 97.0±0.0 82.1±0.0 74.1±0.0 100.0±0.0
81.2±0.0 82.8±0.0 84.4±0.0 92.1±0.0 87.9±0.0 88.1±0.0 90.4±0.0 79.6±0.0 96.9±0.0 85.6±0.0 75.5±0.0 99.4±0.0
85.0±0.2 89.0±0.2 86.1±0.3 91.9±0.2 88.8±0.3 85.4±0.2 89.5±0.2 81.1±0.3 98.2±0.1 84.9±0.3 78.0±0.3 100.0±0.0
88.1 95.5 95.2 93.7 94.2 96.9 93.8 84.6 99.2 92.5 86.6 100.0
88.0±0.3 92.8±0.2 96.1±0.1 93.5±0.2 91.4±0.3 96.3±0.1 94.3±0.2 82.4±0.3 99.0±0.1 93.4±0.2 87.3±0.3 100.0±0.0
87.2±0.7 93.1±0.4 93.2±0.5 93.2±0.6 91.4±0.3 92.3±0.6 92.8±0.5 82.9±0.3 99.3±0.3 93.3±0.6 82.1±0.4 100.0±0.0
89.1 96.6 95.7 93.6 93.4 95.2 94.7 84.7 99.4 94.8 86.5 100.0
88.4±0.1 99.7±0.3 97.6±0.0 93.5±0.0 94.3±0.0 98.3±0.0 93.5±0.1 83.8±0.1 99.7±0.0 94.4±0.0 88.9±0.1 100.0±0.0
Average
87.5
85.5
85.7
87.0
88.2
93.4
92.9
91.7
93.6
94.3
urn a
Task
lP
Table 5 Accuracies (mean±standard deviations) of different methods for domain adaptation on the Office-Caltech data set.
4.4. Results for Object Recognition
Jo
Table 4 exhibits the classification accuracy results on the Office data set. For fair comparison, the results of CORAL [42], D-CORAL [43], RTN [32], JAN-A [33], MADA [39], DTLC-N [14] and MMD-CORAL are directly produced from their original papers, and the results of rest baselines are directly produced from [30]. GKE obtains the best performance on most tasks. According to [30], the domains D and W distribute very similar while both of them distribute significantly different to domain A, thus D → W, W → D are easy transfer tasks and A → D, D → A, A → W, W → A are hard transfer tasks. From Table 4, for the hard transfer tasks, e.g., D → A and W → A, GKE improves the performance dramatically compared to most baseline methods. While for the easy transfer tasks, e.g., W → D, GKE achieves comparable accuracy. The promising results highlight the importance of exploiting the underlying geometric information from input data, and suggest that GKE is able to learn more transferable representations for effective domain adaptation. We draw several interesting observations as follows. • AlexNet is a standard deep learning approach, which performs worse than deep domain adaptation methods. This reveals that applying deep networks to learn abstract representations cannot effectively handle the domain H. Wu et al.: Preprint submitted to Elsevier
Page 10 of 17
Journal Pre-proof
Unsupervised domain adaptation Table 6 Accuracies (%) of different methods for domain adaptation on the Testbed data set. C→I
I→C
C→S
S→C
I→S
S→I
Average
22.4
22.4
38.5
20.1 15.1 14.8 25.4
17.4 14.3 12.0 25.2
31.3 25.8 26.6 40.2
25.6
34.5
43.9
AlexNet
66.1
73.8
21.9
24.6
GFK SA TCA CORAL
52.0 43.7 48.6 66.2
58.5 52.0 54.0 74.7
18.6 13.9 15.6 22.9
21.1 15.8 14.6 26.9
GKE
70.0
76.4
24.1
32.8
pro of
Method
Table 7 Accuracies (%) of different methods for domain adaptation on the Multi-Language data set.
FR → EN GR → EN IT → EN SP → EN EN → FR GR → FR IT → FR SP → FR EN → GR FR → GR IT → GR SP → GR EN → IT FR → IT GR → IT SP → IT EN → SP FR → SP GR → SP IT → SP
43.0 45.7 32.0 35.8 66.3 37.0 46.0 43.7 48.2 53.2 27.8 29.2 40.0 41.7 32.2 41.0 42.3 50.7 32.7 62.2
64.3 57.7 52.8 47.0 64.2 61.2 65.5 61.0 62.3 49.8 55.7 47.0 57.0 58.7 47.8 55.0 52.8 49.8 50.5 69.7
Average
42.5
GCMF
CORAL
GKE
71.5 64.2 57.8 66.5 69.5 70.0 66.0 67.2 67.0 71.2 65.8 56.7 59.0 63.0 58.8 59.8 60.3 64.7 57.8 62.5
62.8 60.8 59.3 62.0 74.5 66.3 64.8 65.3 66.2 67.3 58.8 53.3 55.5 60.2 55.5 63.0 58.8 64.7 58.0 68.7
73.0 70.2 66.3 67.2 74.7 68.8 72.3 71.5 72.0 72.8 68.5 66.5 64.5 64.3 65.3 65.8 63.0 67.7 62.2 69.3
64.0
62.3
68.3
re-
TCA
urn a
lP
SVM
56.5
Jo
shift issue [51, 33].
Task
• Deep transfer models outperform both standard deep models and shallow methods in general, which indicates that introducing the domain discrepancy reduction module into deep neural networks can learn more transferable representations. • Compared to MMD based methods, such as TCA, DDC, DAN and DTLC-N, GKE achieves better or comparable results, which confirms that our GCN is able to exploit the underlying geometric information of the data, thus beneficial for learning discriminative representations. For the Office-Caltech data set, we report the classification accuracy results in Table 5. Similar to the experiments on the Office data set, we report the results of RTN, DTLC-N and MMD-CORAL by referring to their original papers, and the results of rest baseline methods by referring to [30] directly. We have similar observations as on the Office data set. GKE obtains the best average accuracy compared to all the baseline methods. Fox example, on the task 𝐴 → 𝐷, GKE obtains a significant performance improvement of 𝟑.𝟏% compared to the second best MMD-CORAL, and sets H. Wu et al.: Preprint submitted to Elsevier
Page 11 of 17
Journal Pre-proof
Unsupervised domain adaptation
GKE JAN-A
65 60
pro of
Accuracy (%)
70
55 50 1
2
3
4
5
6
k(log2)
Figure 3: Sensitivity of the parameters 𝑘, 𝜆 and 𝛿 on task W → A on the Office data set.
20
target
1
10
10
20
2
3
4
re-
source
6
7
8
9
10
10
10
0
0
5
0
0
−10
−10 0
20
40
−20 −10
(a) AlexNet: A → W
20
10
0
0
−20
−10 20
(e) AlexNet: W → A
20
−20 −20
−20 −20
0
20
40
−10
0
(f) GKE: W → A
−20 −10
(c) AlexNet: A → W
urn a
20
0
10
(b) GKE: A → W
40
−40 −20
0
lP
−20 −20
−10
−10
10
40
20
20
10
0
0
−20
−10
−40 −20
0
(g) AlexNet: W → A
20
0
10
20
(d) GKE: A → W
−20 −20
−10
0
10
(h) GKE: W → A
Figure 4: The t-SNE visualization of sample features on tasks A → W and W → A on the Office-Caltech data set.
new state-of-the-art results. This clearly validates the effectiveness of the proposed method on object recognition tasks.
4.5. Results for Image Classification
Jo
For the experiments on image classification case, we present the classification accuracy results in Table 6. The results of baseline methods are achieved by referring to [42] directly. Similar observations can be drawn as on the Office and Office-Caltech data sets. GKE outperforms the baseline methods on all the 6 learning tasks. This clearly validates the effectiveness of the proposed method for image classification applications.
4.6. Results for Text Categorization
To investigate the effectiveness of the proposed method on text categorization problem, we compare GKE with SVM [8], TCA [37], GCMF [31] and CORAL [42]. We present the experimental results on the Multi-Language data set in Table 7. SVM only achieves an average classification accuracy of 42.5%, and performs very poorly on many of the learning tasks, which indicates that the adaptation difficulty in the 20 learning tasks varies a lot. TCA, GCMF, CORAL and GKE perform better than SVM, which validates that domain adaptation methods can actually improve the target prediction performance. Compared to TCA and CORAL, which seek to minimize the domain discrepancy but omit the geometric information of data, GKE achieves better performance. In addition, we observe that GKE outperforms the baseline methods on most tasks (18 out of 20), and obtains an average classification H. Wu et al.: Preprint submitted to Elsevier
Page 12 of 17
Journal Pre-proof
pro of
Unsupervised domain adaptation
(b) Ω w.r.t. epoch
re-
(a) Accuracy w.r.t. epoch
Figure 5: Convergence performance of tasks W → A and W → C on the Office-Caltech data set.
4.7. Parameters Sensitivity Analysis
lP
accuracy of 𝟔𝟖.𝟑%, which has improved significantly by 𝟒.𝟑% compared to the second best GCMF. This demonstrates that our proposed model is able to effectively handle text categorization problems.
urn a
We use the learning task W → A on the Office data set to evaluate the sensitivity of parameters 𝑘, 𝜆 and 𝛿. We vary one of them in a range space while fix the rest. Specifically, we vary 𝑘 ∈ {21 , 22 , … , 26 }, 𝜆 ∈ {100 , 101 , … , 105 } and 𝛿 ∈ {20 , 21 , … , 25 }. Figure 3 presents the results, including the best accuracy of the baseline methods (in this task, JAN-A). We observe that as parameters 𝑘, 𝜆 and 𝛿 increase, the accuracy of GKE increases and reaches top performance, and then decreases. The stable accuracy curves ensure that GKE can outperform the baselines consistently. Specifically, for the curve of MMD regularization parameter 𝜆, if setting 𝜆 too small, the MMD regularizer will have little effect on the learned representation, while setting 𝜆 too large, the heavy regularizer will make all data points stay too close, leading to the poor representation learning.
4.8. Feature Visualization
Jo
We use the t-SNE [34] tool to analyse the feature transferability of the learning tasks A → W and W → A on the Office-Caltech data set with the features learned by AlexNet and GKE. For a clear view, we randomly choose 100 samples from each domain. Figure 4 shows the results. Specifically, for task A → W, Figures 4(a) and 4(b) present the representations learned by AlexNet and GKE, respectively. In Figures 4(c) and 4(d), we use different colors to indicate 10 classes. Several observations can be drawn as follows. • From Figures 4(a) and 4(b), we observe that the source and target representations learned by GKE follow similar distributions compared to the results of AlexNet, which demonstrates that the domain discrepancy between the source and target domains is dramatically reduced through GKE. • Figures 4(c) and 4(d) indicate that GKE has better performance in aligning the samples shared common class, leading to the improvement of prediction performance. • From Figures 4(b) and 4(d), we can see that not only the domain discrepancy between the source and target data is reduced, but also the distances across different classes are increased, which demonstrates that GKE is able to obtain transferable and discriminative representations. The descriptions of Figures 4(e), 4(f), 4(g) and 4(h) for task W → A are similar to that for task A → W, and we have similar observations as on task A → W.
H. Wu et al.: Preprint submitted to Elsevier
Page 13 of 17
Journal Pre-proof
Unsupervised domain adaptation
4.9. Convergence Performance
pro of
We testify the convergence performance by taking the learning tasks W → A and W → C on the Office-Caltech data set as examples, and report the results in Figure 5. Figures 5(a) and 5(b) show the results of accuracy and domain discrepancy w.r.t. the learning epoch, in which the function Ω(⋅, ⋅) is defined in Eq. (7). We observe that the learning tasks have similar convergence performance. As epoch increases, the learning accuracy improves and converges to an optimal result, and the domain discrepancy between the source and target domains decreases and converges quickly. This further validates the effectiveness of the proposed method.
5. Conclusion
lP
re-
In this paper, we have presented a novel method called GKE to exploit the underlying geometric information of the source and target data for solving the unsupervised domain adaptation problem. Specifically, we apply the graph convolutional network to learn discriminative features by leveraging the similarity relationship between samples, and introduce MMD into the graph convolutional network to learn transferable representations. We conduct comprehensive experiments on real-world data sets and compare GKE with several well-known domain adaptation methods. Experimental results demonstrate that GKE is able to obtain promising performance in handling applications such as object recognition, image classification and text categorization. It is worth noting that other CNN models such as ResNet may also adoptable in the proposed GKE. For fair comparison, we only use AlexNet and leave the work on ResNet as our future work. In addition, the recent proposed adversarial loss can also be used to estimate the domain discrepancy, we will study how to effectively apply the adversarial loss to our model. Moreover, predicting the labels of further input vectors on the target domain after training is an interesting direction to study in our future work.
Acknowledgements
References
urn a
This work was supported by National Natural Science Foundation of China (NSFC) 61876208; Guangdong Provincial Scientific and Technological funds 2017B090901008, 2018B010108002; Natural Science Foundation of Guangdong Province 2015A030310446; National Key R&D Program of China 2018YFC0830900; Pre-Research Foundation of China 61400010205; Pearl River S&T Nova Program of Guangzhou 201806010081; CCF-Tencent Open Research Fund RAGR20190103; and Hong Kong Research Grant Council GRF 12306616, 12200317, 12300218 and 12300519.
Jo
[1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., rey Dean, J., hieu Devin, M., et al., 2015. Martin wa enberg, martin wicke, yuan yu, and xiaoqiang zheng. 2015. tensorflow: Large-scale machine learning on heterogeneous systems.(2015). hp. tensor ow. org/So ware available from tensor ow. org . [2] Amini, M., Usunier, N., Goutte, C., 2009. Learning from multiple partially observed views-an application to multilingual text categorization, in: Advances in Neural Information Processing Systems, pp. 28–36. [3] Belkin, M., Niyogi, P., 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering, in: Advances in Neural Information Processing Systems, pp. 585–591. [4] Belkin, M., Niyogi, P., Sindhwani, V., 2006. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 7, 2399–2434. [5] Cai, D., He, X., Han, J., Huang, T.S., 2010. Graph regularized nonnegative matrix factorization for data representation. IEEE transactions on Pattern Analysis and Machine Intelligence 33, 1548–1560. [6] Cai, D., He, X., Wang, X., Bao, H., Han, J., 2009. Locality preserving nonnegative matrix factorization., in: International Joint Conferences on Artificial Intelligence, pp. 1010–1015. [7] Cao, Y., Long, M., Wang, J., 2018. Unsupervised domain adaptation with distribution matching machines, in: Association for the Advancement of Artificial Intelligence. [8] Chang, C.C., Lin, C.J., 2011. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27. [9] Chen, M., Weinberger, K.Q., Blitzer, J., 2011. Co-training for domain adaptation, in: Advances in Neural Information Processing Systems, pp. 2456–2464. [10] Chen, Z., Zhang, W., 2013. Domain adaptation with topical correspondence learning., in: International Joint Conferences on Artificial Intelligence. [11] Chu, W.S., De la Torre, F., Cohn, J.F., 2013. Selective transfer machine for personalized facial action unit detection, in: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3515–3522.
H. Wu et al.: Preprint submitted to Elsevier
Page 14 of 17
Journal Pre-proof
Unsupervised domain adaptation
Jo
urn a
lP
re-
pro of
[12] Chu, W.S., De la Torre, F., Cohn, J.F., 2017. Selective transfer machine for personalized facial expression analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 529–545. [13] Defferrard, M., Bresson, X., Vandergheynst, P., 2016. Convolutional neural networks on graphs with fast localized spectral filtering, in: Advances in Neural Information Processing Systems, pp. 3844–3852. [14] Ding, Z., Fu, Y., 2018. Deep transfer low-rank coding for cross-domain learning. IEEE Transactions on Neural Networks and Learning Systems . [15] Ding, Z., Li, S., Shao, M., Fu, Y., 2018. Graph adaptive knowledge transfer for unsupervised domain adaptation, in: Proceedings of the European Conference on Computer Vision, pp. 37–52. [16] Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T., 2014. Decaf: A deep convolutional activation feature for generic visual recognition, in: International Conference on Machine Learning, pp. 647–655. [17] Duan, L., Xu, D., Tsang, I.W.H., Luo, J., 2010. Visual event recognition in videos by learning from web data, in: IEEE Conference on Computer Vision and Pattern Recognition. [18] Duan, L., Xu, D., Tsang, I.W.H., Luo, J., 2012. Visual event recognition in videos by learning from web data. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 1667–1680. [19] Fernando, B., Habrard, A., Sebban, M., Tuytelaars, T., 2013. Unsupervised visual domain adaptation using subspace alignment, in: IEEE International Conference on Computer Vision, pp. 2960–2967. [20] Fout, A., Byrd, J., Shariat, B., Ben-Hur, A., 2017. Protein interface prediction using graph convolutional networks, in: Advances in Neural Information Processing Systems, pp. 6530–6539. [21] Ganin, Y., Lempitsky, V., 2015. Unsupervised domain adaptation by backpropagation, in: International Conference on Machine Learning, pp. 1180–1189. [22] Gong, B., Shi, Y., Sha, F., Grauman, K., 2012. Geodesic flow kernel for unsupervised domain adaptation, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE. pp. 2066–2073. [23] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Advances in Neural Information Processing Systems, pp. 2672–2680. [24] Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A., 2012. A kernel two-sample test. Journal of Machine Learning Research 13, 723–773. [25] Griffin, G., Holub, A., Perona, P., 2007. Caltech-256 object category dataset. [26] Huang, J., Gretton, A., Borgwardt, K.M., Schölkopf, B., Smola, A.J., 2007. Correcting sample selection bias by unlabeled data, in: Advances in Neural Information Processing Systems, pp. 601–608. [27] Kipf, T.N., Welling, M., 2017. Semi-supervised classification with graph convolutional networks, in: International Conference on Learning Representations. [28] Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, pp. 1097–1105. [29] Ktena, S.I., Parisot, S., Ferrante, E., Rajchl, M., Lee, M., Glocker, B., Rueckert, D., 2017. Distance metric learning using graph convolutional networks: Application to functional brain networks, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 469–477. [30] Long, M., Cao, Y., Cao, Z., Wang, J., Jordan, M.I., 2018. Transferable representation learning with deep adaptation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence . [31] Long, M., Wang, J., Ding, G., Shen, D., Yang, Q., 2014. Transfer learning with graph co-regularization. IEEE Transactions on Knowledge and Data Engineering 26, 1805–1818. [32] Long, M., Zhu, H., Wang, J., Jordan, M.I., 2016. Unsupervised domain adaptation with residual transfer networks, in: Advances in Neural Information Processing Systems, pp. 136–144. [33] Long, M., Zhu, H., Wang, J., Jordan, M.I., 2017. Deep transfer learning with joint adaptation networks, in: International Conference on Machine Learning, pp. 2208–2217. [34] Maaten, L.v.d., Hinton, G., 2008. Visualizing data using t-sne. Journal of Machine Learning Research 9, 2579–2605. [35] Moon, S., Carbonell, J.G., 2017. Completely heterogeneous transfer learning with attention-what and what not to transfer., in: International Joint Conferences on Artificial Intelligence, pp. 1–2. [36] Pan, S.J., Kwok, J.T., Yang, Q., Pan, J.J., 2007. Adaptive localization in a dynamic wifi environment through multi-view learning, in: Association for the Advancement of Artificial Intelligence, pp. 1108–1113. [37] Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q., 2011. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks 22, 199–210. [38] Pan, S.J., Yang, Q., 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 1345–1359. [39] Pei, Z., Cao, Z., Long, M., Wang, J., 2018. Multi-adversarial domain adaptation, in: Association for the Advancement of Artificial Intelligence. [40] Rahman, M.M., Fookes, C., Baktashmotlagh, M., Sridharan, S., 2019. On minimum discrepancy estimation for deep domain adaptation. arXiv preprint arXiv:1901.00282 . [41] Saenko, K., Kulis, B., Fritz, M., Darrell, T., 2010. Adapting visual category models to new domains, in: European Conference on Computer Vision, pp. 213–226. [42] Sun, B., Feng, J., Saenko, K., 2016. Return of frustratingly easy domain adaptation., in: Association for the Advancement of Artificial Intelligence, p. 8. [43] Sun, B., Saenko, K., 2016. Deep coral: Correlation alignment for deep domain adaptation, in: European Conference on Computer Vision, Springer. pp. 443–450. [44] Tommasi, T., Tuytelaars, T., 2014. A testbed for cross-dataset analysis, in: Europeon Conference on Computer Vision, Springer. pp. 18–31.
H. Wu et al.: Preprint submitted to Elsevier
Page 15 of 17
Journal Pre-proof
Unsupervised domain adaptation
Jo
urn a
lP
re-
pro of
[45] Tzeng, E., Hoffman, J., Saenko, K., Darrell, T., 2017. Adversarial discriminative domain adaptation, in: IEEE Conference on Computer Vision and Pattern Recognition, p. 4. [46] Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T., 2014. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474 . [47] Wang, X., Ye, Y., Gupta, A., 2018. Zero-shot recognition via semantic embeddings and knowledge graphs, in: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6857–6866. [48] Weston, J., Ratle, F., Mobahi, H., Collobert, R., 2012. Deep learning via semi-supervised embedding, in: Neural Networks: Tricks of the Trade. Springer, pp. 639–655. [49] Yan, Y., Li, W., Ng, M., Tan, M., Wu, H., Min, H., Wu, Q., 2017. Learning discriminative correlation subspace for heterogeneous domain adaptation, in: International Joint Conferences on Artificial Intelligence, pp. 3252–3258. [50] Yan, Y., Li, W., Wu, H., Min, H., Tan, M., Wu, Q., 2018. Semi-supervised optimal transport for heterogeneous domain adaptation, in: International Joint Conferences on Artificial Intelligence, pp. 2969–2975. [51] Yosinski, J., Clune, J., Bengio, Y., Lipson, H., 2014. How transferable are features in deep neural networks?, in: Advances in Neural Information Processing Systems, pp. 3320–3328. [52] Zhang, J., Li, W., Ogunbona, P., 2017. Joint geometrical and statistical alignment for visual domain adaptation, in: Internaltional Conference on Computer Vision and Pattern Recogintion. [53] Zhang, L., Zhang, L., Du, B., You, J., Tao, D., 2019. Hyperspectral image unsupervised classification by robust manifold matrix factorization. Information Sciences 485, 154–169. [54] Zhang, L., Zhang, Q., Du, B., Huang, X., Tang, Y.Y., Tao, D., 2016a. Simultaneous spectral-spatial feature selection and extraction for hyperspectral images. IEEE Transactions on Cybernetics 48, 16–28. [55] Zhang, L., Zuo, W., Zhang, D., 2016b. Lsdt: Latent sparse domain transfer learning for visual adaptation. IEEE Transactions on Image Processing 25, 1177–1191. [56] Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B., 2004. Learning with local and global consistency, in: Advances in Neural Information Processing Systems, pp. 321–328. [57] Zhou, J.T., Pan, S.J., Tsang, I.W., Ho, S.S., 2016. Transfer learning for cross-language text categorization through active correspondences construction., in: Association for the Advancement of Artificial Intelligence, pp. 2400–2406. [58] Zhou, J.T., Pan, S.J., Tsang, I.W., Yan, Y., 2014. Hybrid heterogeneous transfer learning through deep learning., in: Association for the Advancement of Artificial Intelligence, pp. 2213–2220. [59] Zhu, X., Ghahramani, Z., Lafferty, J.D., 2003. Semi-supervised learning using gaussian fields and harmonic functions, in: International Conference on Machine Learning, pp. 912–919.
H. Wu et al.: Preprint submitted to Elsevier
Page 16 of 17