Projected Transfer Sparse Coding for cross domain image representation

Projected Transfer Sparse Coding for cross domain image representation

J. Vis. Commun. Image R. 33 (2015) 265–272 Contents lists available at ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier.com/loc...

1MB Sizes 0 Downloads 118 Views

J. Vis. Commun. Image R. 33 (2015) 265–272

Contents lists available at ScienceDirect

J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

Projected Transfer Sparse Coding for cross domain image representation q Xiao Li, Min Fang ⇑, Ju-Jie Zhang School of Computer Science and Technology, Xidian University, 710071, China

a r t i c l e

i n f o

Article history: Received 30 July 2015 Accepted 28 September 2015 Available online 8 October 2015 Keywords: Image representation Domain adaptation Sparse coding Projection matrix L2;1 norm Shared dictionary Maximum Mean Discrepancy KSVD

a b s t r a c t Sparse coding has been used for image representation successfully. However, when there is considerable variation between source and target domain, sparse coding cannot achieve satisfactory results. In this paper, we proposed a Projected Transfer Sparse Coding algorithm. In order to reduce their distribution difference, we project source and target data into a shared low dimensional space. Meanwhile, we learn a projection matrix and a shared dictionary and the sparse coding of source and target data in the low dimensional space. Unlike existing methods, the sparse representations are learnt using the projected data which are invariant to the distribution difference and the irrelevant samples. Thus, the sparse representations are robust and can improve the classification performance. We do not need to know any explicit correspondence across domains. We learn the projection matrix, the discriminative sparse representations, and the dictionary in a unified objective function. Our image representation method yields state-of-the-art results. Ó 2015 Elsevier Inc. All rights reserved.

1. Introduction Image representation is one of the major topics in computer vision, in which sparse coding has been successfully used and achieves outstanding performance. Sparse coding yields sparse representation so that the data is represented by a linear combination of a few dictionary elements. Olshausen et al. in [1] have pointed out that natural images can be well represented by sparse coding. Recently, sparse coding has been successfully used in computer vision problems, e.g., image classification [2,3], object recognition [4] and face recognition [5–7]. Sparse coding has been used for computer vision tasks in the following three aspects. Firstly, sparse coding has been used in the image feature extraction process of Bag-of-Word (BoW) model, boosting the performance of image classification [2]. Secondly, sparse coding can be used for image representations [8,9]. Sparse representations mean using a few dictionary items to represent images, making them easy to understand and interpret. Thirdly, for face recognition, training images are often chosen as a dictionary and test images are then represented by the dictionary and classified by assigning the class with the lowest reconstruction error. Thus, sparse coding can be used as a sparse representation-based classifier [5–7]. q

This paper has been recommended for acceptance by M.T. Sun.

⇑ Corresponding author.

E-mail address: [email protected] (M. Fang). http://dx.doi.org/10.1016/j.jvcir.2015.09.018 1047-3203/Ó 2015 Elsevier Inc. All rights reserved.

The merits of sparse coding are that samples can be well interpreted by the linear combination of a few dictionary elements. Besides, the sparse representations can capture the main information of images, thus resisting the noise to a certain degree. For example, we aim to classify images obtained by a web camera. However, the labeled images are very limited to train a robust image classifier. It is expansive and complex to annotate the unlabeled images. We regard this dataset as target domain. Thankfully, we have a large amount of labeled images obtained by a digital SLR camera, namely source domain. So our goal is to leverage source domain images to help our target classification tasks. When the feature distribution of source domain differs greatly from that of target domain, directly applying the classifier or object models trained from source domain on target domain is impossible. Visual domains differ in image distributions greatly even if they contain images of the same categories. There are many factors for the phenomenon, such as scene, location, pose, viewing angle, background clutter. Transfer learning [8,10,11] has been widely studied to solve this problem, producing excellent results. In [12], a survey on transfer learning is presented. Sparse coding methods may be not suitable when training and test data have different distributions. In recent years, many researchers have focused on the use of sparse coding in transfer learning. Source and target samples are combined to learn a shared dictionary and sparse representations [8,13]. They seek the sparse representations of all source and target samples. However, when

266

X. Li et al. / J. Vis. Commun. Image R. 33 (2015) 265–272

source domain is very different from target domain, some source samples are not relevant to the target samples in the original space even using Minimum Mean Discrepancy (MMD) to reduce the distribution distance. In [14], Shekhar et al. proposed a method which jointly learns the projections of source and target data and the dictionary in the projected low dimensional space using the few labeled data in the target domain. Qiu et al. [15] used regression to adapt dictionaries. However, in practical applications, we possibly cannot get labels of the target data. In this paper, we focus on the unsupervised domain adaptation problem where the target labels are unavailable. We project source and target data onto a low dimensional space. The projection matrix is restricted by Minimum Mean Discrepancy to reduce the distribution difference between source and target domain, which can help learn a compact shared dictionary. Some samples in the source domain may be irrelevant to the target domain even in the low dimensional shared space, so we restraint the source projection matrix according to their relevance to the target data using L2;1 norm which can make the rows of matrix sparse, selecting the source samples. As a consequence, the irrelevant samples are discarded. We learn the projection matrix, the discriminative sparse representations of all the samples, and the dictionary in a unified objective function. Thus, the learned low dimensional representation can well represent the common structure of the data. Moreover, the dictionary is representative of the projected data from both domains and the sparse representations are very effective for representing them. We perform experiments on USPS–MNIST, MSRC–VOC2007, Office–Caltech256 dataset pairs which are under different distributions. The results demonstrate that our method performs better than other state-of-the-art methods. This paper is organized as follows. Section 2 provides a brief review of related work. In Section 3, we develop a novel Projected Transfer Sparse Coding (PTSC) algorithm to deal with the issue of cross domain image representation problems. In Section 4, extensive experiments are conducted. In Sections 5, the conclusions and future work are presented. 2. Related work Researchers have developed lots of algorithms to get sparse and efficient representations of images. It has received growing attention because of its advantage and promising performance for many computer vision applications [5,8,13]. Given input samples Y ¼ ½y1 ; . . . ; yn  2 Rdn , the sparse coding X 2 Rmn and dictionary D ¼ ½d1 ; . . . ; dm  2 Rdm can be trained by solving the following problem

ðD; XÞ ¼ min kY  DXk2F D;X

s:t: kxi k0 6 T

ð1Þ

where xi is a column of X, and T the sparsity factor which means that every sparse coding item has fewer than T nonzero elements. qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pr Pm 2 The Frobenius norm is defined as kPkF ¼ i j P ij . There are many optimization methods for solving the sparse coding problem such as matching pursuit [16], orthogonal matching pursuit [17] and basic pursuit [18]. We can get the dictionary from the data by some optimizing algorithms such as KSVD [19] and MOD [20]. The KSVD algorithm is efficient and requires fewer iterations to converge, therefore, we use the KSVD algorithm in our experiment. However, traditional sparse coding methods [5,19] did not consider the distribution mismatch, known as distribution difference, among the training and testing domains. Transfer learning aims to deal with the different distributions between source and target

domains. A classical strategy aims to learn a new domain invariant feature representation by minimizing the distribution divergence. It is measured by the empirical Maximum Mean Discrepancy (MMD) [8,13,21–23]. Domain Transfer Multiple Kernel Learning [22] simultaneously minimizes the structural risk function of SVM and Maximum Mean Discrepancy (MMD) in kernel space. Duan et al. [23] proposed an Adaptive Multiple Kernel Learning (AMKL) to cope with the distribution mismatch between web and consumer video domain. Transfer Component Analysis (TCA), proposed in [10], focused on reducing the distance between the two marginal distributions while maximizing the data variance. SA [24] learns a linear transformation between the subspaces of the domains inducted by eigenvectors. Long et al. in [25] proposed a method to incorporate feature matching and instance reweighting by minimizing the MMD and L2;1 norm with Principal Component Analysis to construct new representations that is robust to the distribution difference. In this paper, we also use L2;1 norm to restrict source projection matrix to eliminate the irrelevant source samples. Transfer Sparse Coding (TSC) algorithm [8] aims to construct sparse representations for the images by incorporating sparse coding, Maximum Mean Discrepancy (MMD) and graph Laplacian into a unified objective. Al-Shedivat et al. [13] presented a method unifying sparse representations, domain transfer and classification into an objective. They all learn the sparse representations in the original space which is different from our method that source and target data are projected onto a low dimensional space. Zhu and Shao [26] proposed a method which measures the cross domain divergence by constructing virtual correspondences across both domains through a transformation matrix. The method learns two dictionaries for source and target domain. Shekhar et al. [14] proposed a method to learn the projections of source and target data and a shared dictionary in the projected low dimensional space in a supervised way. Huang in [28] proposed a method to solve the cross domain image synthesis and recognition problems by jointly solving the coupled dictionary and the projection matrix. However, we focus on the unsupervised domain adaptation problem where the labels of target data are unavailable. Partially Shared Dictionary Learning [27] proposed by Ranjan jointly learns different dictionaries for source and target domains. The dictionary elements which are common across domains are then used to obtain the sparse representations which are used to train a classifier. PSDL learns different dictionaries for source and target domains in original space while we learn the dictionary and sparse coding in a common space and reweight source samples, simultaneously. 3. Projected Transfer Sparse Coding (PTSC) algorithm Define source domain set Ds ¼ fY s ; Hs g, target domain set Dt ¼ fY t g, in which Y s 2 Rdns is source data, Hs 2 Rns 1 is the label of source data and Y t 2 Rdnt is target data. ns is the number of source samples, nt target samples. Suppose Y ¼ ½Y s Y t  2 Rdn , n ¼ ns þ nt . Source and target samples come from different probability distributions. In this paper, we introduce a unified objective function to solve the cross domain image representation problems, which we call Projected Transfer Sparse Coding algorithm. 3.1. Sparse coding We use the matrix P 2 Rkd to project source and target data onto the shared subspace, k being the dimension of the subspace. In order to get the dictionary D 2 Rkp and sparse coding X 2 Rpn , the projected source and target data are reconstructed by minimizing the following reconstruction error.

267

X. Li et al. / J. Vis. Commun. Image R. 33 (2015) 265–272

min kPY  DXk2F

ð2Þ

D;X

s:t: 8i;

kxi k0 6 T

Since k is smaller than d, we can project the data onto a low dimensional feature space. Meanwhile, in the low dimensional space, the dimension of the dictionary atom is small enough to handle. We learn the projection matrix and the dictionary jointly in order to preserve the common structure of source and target data. Suppose P ¼ B0 Y 0 , then PY ¼ B0 Y 0 Y. Inspired by the satisfactory results of TCA algorithm [10], we can also replace Y 0 Y with the kernel function K, then the new representation in the shared subspace is B0 K, in which Kðyi ; yj Þ ¼ y0i yj , and the ith column of B0 K represents the sample yi . B 2 Rnk transforms features into a k-dimensional space. We can use any kernel function that satisfies the Mercer’s condition [29] to compute the kernel function such as Gaussian

Similar to TCA algorithm, to avoid the trivial solution (B ¼ 0), we also restrict B0 KHK 0 B ¼ I in which H ¼ I  1n 110 ; I 2 Rnn is an identify matrix and 1 2 Rn is a column vector with all ones. Thus, the objective function is converted into the following form:

min kB0 K  DXk2F þ aðkBs k22;1 þ kBt k2F Þ þ btrðB0 KMK 0 BÞ D;X

s:t: B0 KHK 0 B ¼ I 8i; kxi k0 6 T

ð8Þ

In which a; b are the regularization parameters which weight the instance reweighting term and MMD regularization term. It is interesting to know that when a is zero, the objective is equivalent to the combination of TCA and sparse coding. We call it PTSC1 algorithm which does not consider the instance reweighting term. We also evaluate its effect in the experiment to prove the impact of reweighting source samples.

2

kernel Kðyi ; yj Þ ¼ etkyi yj k , Polynomial kernel Kðyi ; yj Þ ¼ ðy0i yj þ 1Þt and linear kernel Kðyi ; yj Þ ¼ y0i yj . Thus, problem (2) can be reformulated as 0

min kB K  D;X

s:t: 8i;

DXk2F

kxi k0 6 T

ð3Þ

We want to reweight the source data according to their relevance to the target data. Therefore, we restrict the source projection matrix by L2;1 norm which was proposed in [25], which can make B row sparse, reweighting the source samples. B can be defined as B ¼ ½Bs ; Bt , in which Bs restricts source data and Bt restricts target data. Thus, we get the restrictions on the projection matrix as the following instance reweighting term.

ð4Þ

Each row of B represents the importance of a sample. Since L2;1 norm leads to row sparsity, we use L2;1 norm to restrict the source projection matrix, eliminating the irrelevant source samples. kBk2;1 P qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pk 2 denotes the L2;1 norm of a matrix B. kBk2;1 ¼ ni¼1 j¼1 Bij . The distribution divergence of the projected source and target data is measured by the nonparametric Maximum Mean Discrepancy in an infinite dimensional Reproducing Kernel Hibert Space (RKHS) [10].

2  nt ns  1 X 1X   distðY s ; Y t Þ2 ¼  B0 ki  B0 kj  ¼ trðB0 KMK 0 BÞ  ns i¼1 nt j¼1 8 1 ; yi ; yj 2 Ds > > < ns ns 1 Mij ¼ nt nt ; yi ; yj 2 Dt > > : 1 ; otherwise ns nt

In order to solve the problem, we update the dictionary, the sparse coding and the projection matrix alternatively. Firstly, we update X and D with B fixed. Secondly, we update B with D and X fixed. Detailed optimization procedures are presented. 4.1. Updating D and X

3.2. Domain transfer

kBs k22;1 þ kBt k2F

4. Optimization of PTSC

With B fixed, it can be seen from the unified objective function (8) that only the first term is related with the dictionary. So we only need to solve the following problem

min kB0 K  DXk2F

ð9Þ

D;X

s:t: 8i;

kxi k0 6 T

Since B is fixed, the signals B0 K is known, we get the dictionary D and the sparse coding X by KSVD algorithm [19]. Suppose Z ¼ B0 K, the objective function can be written into another simplified form.

min kZ  DXk2F

ð10Þ

D;X

s:t: 8i;

kxi k0 6 T

P Define Ei ¼ Z  j–i di xi ; di is the atom of D; xi is the atom of X, the problem is equivalent to the following form

min ¼ kEi  di xi k2F

ð11Þ

di ;xi

thus, we decompose Ei via SVD,

ð5Þ

U RV ¼ SVDðEi Þ di ¼ Uð:; 1Þ

ð12Þ

xi ¼ Rð1; 1ÞVð1; :Þ ð6Þ

The distribution distance between source and target domains can be reduced by minimizing the above objective.

where Uð:; 1Þ denotes the first column of U and Vð1; :Þ for the first row of V. Then we can get the dictionary and the sparse coding atoms one by one. Finally, we get the dictionary D and sparse coding X. 4.2. Updating B

3.3. The PTSC model Therefore, from the above (3)–(5), we have the following unified objective function:

min kB0 K  DXk2F þ aðkBs k22;1 þ kBt k2F Þ þ btrðB0 KMK 0 BÞ D;X

s:t: 8i;

kxi k0 6 T

Nguyen et al. in [4] have proved that the optimal solution D ¼ ZA ¼ B0 KA; A 2 Rnm . Plugging it into Eq. (8), we obtain

min kB0 K  B0 KAXk2F þ aðkBs k22;1 þ kBt k2F Þ þ btrðB0 KMK 0 BÞ D;X

ð7Þ

s:t: B0 KHK 0 B ¼ I 8i;

kxi k0 6 T

When fixing X and D, the objective function can be written as

ð13Þ

268

X. Li et al. / J. Vis. Commun. Image R. 33 (2015) 265–272

min trðB0 KððI  AXÞ0 ðI  AXÞ þ bMÞK 0 BÞ þ aðkBs k22;1 þ kBt k2F Þ B

0

0

s:t: B KHK B ¼ I 8i;

kxi k0 6 T

ð14Þ

Input: source data Y s , target data Y t , dictionary size ds, subspace dimension k, parameters a b T.

Suppose L ¼ ðI  AXÞ0 ðI  AXÞ þ bM

W ¼ diagðb1 ; . . . ; bns ; b1 ; . . . ; bnt Þ ( bi ¼

ð15Þ

)

1 ; 2kBi k

i 2 ½1; ns 

1;

i 2 ½ns þ 1; n

:

The objective can be written as

min trðB0 KLK 0 BÞ þ atrðB0 WBÞ

ð16Þ

B

s:t: B0 KHK 0 B ¼ I

Taking the lagrangian of the function with the lagrange multiplier as Q

min trðB0 KLK 0 BÞ þ atrðB0 WBÞ  trððB0 KHK 0 B  IÞQÞ Setting the derivative of the function w.r.t. B to zero

ðKLK 0 þ aWÞB ¼ KHK 0 BQ Substituting it into (17), we have 1

min trðB0 KHK 0 BÞ B0 ðKLK 0 þ aWÞB

ð18Þ

B

We can obtain the maximization problem 1

max trðB0 ðKLK 0 þ aWÞBÞ ðB0 KHK 0 BÞ

ð19Þ

B

Then we can get B by solving the eigendecomposition problem 1

1: Randomly initialize B 2 Rnk . Update: 2: Compute the dictionary D and sparse coding X while fixing B via (12). 3: Compute the projection matrix B via Algorithm 1. Until convergence Output: the sparse coding X, the dictionary A, the projection matrix B.

5. Experiment

ð17Þ

B

of ðKLK þ aWÞ KHK for the k biggest eigenvectors. However, W is dependent on the value of B. So we use the alternating optimization strategy to update B and W iteratively. Algorithm 1 concludes the algorithm for solving the projection matrix B. 0

Algorithm 2. Projected Transfer Sparse Coding (PTSC) algorithm

0

Algorithm 1. For solving the projection matrix B Input: source data Y s , target data Y t ; parameters a b. 1: Compute M through (5), compute kernel matrix K, construct centering matrix H, initialize W as identity matrix. Update: 2: Compute the projection matrix B via solving (19). 3: Compute W via (15). Until convergence Output: the projection matrix B.

We alternatively solve the projection matrix and the sparse representations until the objective function converges. Algorithm 2 summarizes our Projected Transfer Sparse Coding (PTSC) algorithm.

Table 1 The descriptions of the benchmark image datasets. Dataset

# Examples

# Features

# Classes

Abbreviation

USPS MNIST MSRC VOC2007 Amazon Webcam DSLR Caltech256

2000 2500 1269 1530 958 295 157 1095

256 256 240 240 800 800 800 800

10 10 6 6 10 10 10 10

USPS MNIST MSRC VOC A W D C

We evaluate our unsupervised domain adaptation method (PTSC) with applications to image classification tasks using limited images from target domain and a large amount of images from source domain. We train a Logistic Regression (LR) [29] classifier using the sparse representations obtained by PTSC and classify the target unlabeled data. We conduct the experiments on publicly available benchmark datasets USPS, MNIST, MSRC, VOC2007, Office and Caltech256 datasets (see Table 1). 5.1. Dataset description USPS dataset [30] and MNIST dataset [31]: the two datasets are all images of handwritten numerals of ‘‘0”–‘‘9”, however, they follow very different distributions. Fig. 1 shows some examples for the two datasets. USPS dataset consists of about 9000 images of size 16  16. MNIST dataset consists of about 70,000 images of size 28  28. We resize all images to size 16  16, and transform them into 256 dimensional vectors. We randomly select 2000 images from USPS dataset and 2500 images from MNIST dataset to form our USPS–MNIST dataset. Switching source and target pair, we get another dataset MNIST–USPS dataset. MSRC dataset [32] and VOC2007 dataset [33]: MSRC dataset consists of about 4323 images. VOC2007 dataset consists of about 5011 images. MSRC dataset are standard images while VOC2007 dataset are digital photos. They follow very different distributions, but they share 6 common semantic classes: ‘‘aeroplane”,‘‘bicycle”, ‘‘bird”,‘‘car”, ‘‘cow”, ‘‘sheep”. Fig. 1 shows some examples for the two datasets. The same as [8], we extract 128-dimensional dense SIFT (DSIFT) [34] features of all the images. Then we form a codebook using the DSIFT features by K-means method. We conduct max pooling to get the image features. What’s more, we fix the dictionary size as 240, thus, the image feature dimension is 240. We normalize the features with L2 norm. We randomly select 1269 images from MSRC dataset and 1530 images from VOC2007 dataset to form our MSRC–VOC dataset. Switching source and target pair, we get another dataset VOC–MSRC. Office dataset [35] and Caltech256 dataset [36]: Office dataset contains Amazon, DSLR and Webcam dataset. They come from different channels, images in Amazon are downloaded from online merchants, images in Webcam are obtained by a web camera and images in DSLR are obtained by a digital SLR camera. Since Caltech-256 dataset also shares some same categories with the office dataset, we also use Caltech-256 dataset. They follow very different distributions, but they share 10 common semantic classes: ‘‘backpack”, ‘‘touring-bike”, ‘‘calculator”, ‘‘head phones”, ‘‘ computer-keyboard”, ‘‘laptop”, ‘‘computer-monitor”, ‘‘computermouse”, ‘‘coffee-mug” and ‘‘video-projector”. Fig. 2 shows some examples for the four datasets. We extract SULR features [37]

X. Li et al. / J. Vis. Commun. Image R. 33 (2015) 265–272

269

Fig. 1. Some image samples from USPS, MNIST, MSRC, VOC2007 dataset.

and then obtain the codebook with 800 visual words using the SULR features by K-means method. All the images are represented by 800 dimensional vectors. These extracted features are then normalized with L2 norm. From the four datasets, we select one dataset as source domain and another dataset as target domain.

use polynomial kernel with parameter 4. We also analyze the effectiveness of our algorithm and investigate the importance of the parameters. We repeat all the experiments 10 times and record the average classification accuracy and standard deviations. 5.3. Experiment results

5.2. Experiment setup We compare our PTSC algorithm with six state-of-the-art baseline methods. The detailed comparative algorithms are shown below. 1. LR [38]: LR classifier learned on source data. It is used as the baseline algorithm. The parameter C in LR is determined by searching the grid f0:01; 0:1; 1; 10; 100g. 2. PCA [39]: Principal Component Analysis (PCA) is used for dimensionality reduction. The reduced dimension is 64 in our experiment. Since PCA can reduce the noise and capture the main information of features, we use the features processed by PCA in all the experiments expect the first one. 3. TCA [10]: source and target data are projected onto a latent space to preserve data properties. 4. TJM [25]: it considers feature matching and instance reweighting, simultaneously. 5. SA [24]: it learns two subspaces by using eigenvectors over source and target domains and then aligns them by a linear transformation matrix. 6. SC [19]: it only learns the sparse coding of source and target data. 7. TSC [8]: when learning the sparse coding of source and target data, it also considers the distribution divergence and geometric structure of the data. 8. SDDL [14]: the method learns a class specific dictionary and the sparse representations in a low dimensional shared space. 9. PTSC1: it is proposed in this paper. PTSC1 is a special case of our proposed PTSC with a ¼ 0 which means the instance reweighting term is not considered. TCA, TJM and SA algorithms are transfer learning methods, the new representations are used to train a supervised LR classifier to classify the target data. SC, TSC, SDDL and our PTSC1 and PTSC algorithms are all sparse representation methods in cross domain setting, then the sparse representations are used to train a supervised LR classifier to classify the target data. For TCA and TJM methods, we set subspace bases k as 20, regularization parameter k as 1. We set subspace dimension in SA as 20. The dictionary size of SDDL method are set as 60, and the subspace dimension is fixed as 30. We fix the size of the dictionary in SC, TSC, PTSC1 and PTSC as 128. The instance reweighting regularization parameter a ¼ 10, the MMD regularization parameter b ¼ 1. Generally, the sparsity factor T ¼ 30 is used in our experiments. We

From Tables 2–4, we can see the performance of different methods and that our method achieves the best on most datasets. (1) LR does not cope with the distribution mismatch between source and target data, thus it is much worse than the transfer learning based methods. The results demonstrate that the distribution divergence between source and target data plays an important role in classifying the test samples. In this paper, when the new representations are learned by different algorithms, we use LR classifier for the image classification tasks. (2) When we use the new features reduced by PCA, it can be found that there is little difference of the results from using the original features expect for the decline in A–C, W–D and D–W datasets. So we use the features after PCA in the experiment. SC algorithm outperforms LR. The reason may be that the sparse representations can capture the important information of images and reduce noise. However, they all underperform TSC, TJM and our proposed algorithm, because they do not take into consideration of the distribution divergence. (3) TCA and TJM algorithms are both transfer learning based methods, we can observe that they outperform PCA algorithm. They all use MMD to measure the distribution difference between source and target domain and learn the new representations in the latent space. In addition, TJM considers eliminating the irrelevant source samples, so TJM performs better than TCA. (4) We can see from Tables 2–4 that our method performs better than TSC method. TSC computes the sparse representations in the original space, while ours in the low dimensional space. This makes sense because we preserve the data properties and reduce the distribution difference. Furthermore, we also eliminate the source samples which are irrelevant to the target samples by restricting the projection matrix. (5) Compared to TJM algorithm, we compute the sparse coding of the reduced dimensional features of source and target data, and do the classification using the sparse coding. We can see from the results that our proposed PTSC algorithm dramatically outperforms TJM algorithm. The reason may be that the sparse coding learned by our algorithm is more discriminative and less influenced by the distribution difference so as to benefit the classification tasks.

270

X. Li et al. / J. Vis. Commun. Image R. 33 (2015) 265–272

Fig. 2. Some image samples from Office, Caltech-256 dataset.

Table 2 Comparisons of classification accuracies (%) on USPS–MINST dataset pairs. S?T

LR

PCA

TCA

TJM

SA

SC

TSC

SDDL

PTSC1

PTSC

USPS ? MINST MINST ? USPS

28:85  0:32 40:05  0:52

30:52  0:24 36:22  0:48

31:95  0:64 41:72  0:51

39:50  0:43 42:66  0:67

35:70  0:50 35:22  0:44

26:85  0:30 34:77  0:78

33:20  0:45 33:50  0:57

35:21  0:43 51:54  0:37

32:45  0:40 40:88  0:39

40:30  0:32 52:11  0:37

Table 3 Comparisons of classification accuracies (%) on MSRC–VOC dataset pairs. S?T

LR

PCA

TCA

TJM

SA

SC

TSC

SDDL

PTSC1

PTSC

MSRC ? VOC VOC ? MSRC

32:05  0:24 49:05  0:39

31:75  0:53 41:41  0:21

32:58  0:82 42:08  0:83

31:30  0:74 50:52  0:56

30:52  0:15 42:47  0:50

30:54  0:73 45:47  0:68

31:96  0:63 52:04  0:73

22:93  0:33 40:69  0:26

32:16  0:34 47:67  0:53

34:30  0:72 55:82  0:56

Table 4 Comparisons of classification accuracies (%) on Office and Caltech256 datasets. S?T

LR

PCA

TCA

TJM

SA

SC

TSC

SDDL

PTSC1

PTSC

C?A C?W C?D A?C A?W A?D W?C W?A W?D D?C D?A D?W

40:91  0:86 34:57  0:63 33:12  0:15 36:42  0:03 36:27  0:12 33:12  0:10 33:83  0:79 38:23  0:46 78:34  0:39 32:68  0:03 35:17  0:37 69:32  0:22

40:52  0:19 38:13  0:56 38:58  0:26 34:18  0:08 38:30  0:51 42:03  0:82 35:52  0:98 39:45  0:72 71:97  0:45 31:16  0:65 36:53  0:44 73:22  0:33

45:31  0:32 30:76  0:27 35:94  0:29 39:35  0:19 36:94  0:32 31:21  0:22 34:63  0:94 28:99  0:58 85:98  0:73 31:25  0:56 33:61  0:17 77:76  0:27

46:14  0:26 38:98  0:37 44:21  0:66 38:64  0:65 37:64  0:25 37:57  0:96 30:17  0:36 29:47  0:39 76:43  0:31 30:45  0:41 32:98  0:54 84:61  0:42

45:90  0:33 36:30  0:88 39:50  0:76 35:44  0:50 38:20  0:90 37:22  0:56 30:62  0:82 38:10  0:70 77:40  0:88 32:00  0:90 38:55  0:80 82:30  0:78

50:62  0:63 35:76  0:27 42:58  0:26 40:16  0:03 38:64  0:43 36:94  0:27 26:62  0:51 30:37  0:58 66:24  0:23 22:43  0:29 25:57  0:41 60:30  0:42

52:40  0:08 38:47  0:46 25:47  0:77 39:62  0:60 28:08  0:47 27:65  0:61 27:60  0:46 35:49  0:06 63:94  0:29 27:24  0:84 31:00  0:21 70:00  0:31

48:40  0:41 36:47  0:57 65:47  0:77 26:82  0:90 59:76  0:60 29:90  1:22 26:98  0:75 44:11  0:96 64:64  1:04 24:66  0:50 41:22  0:91 72:20  1:41

51:25  0:28 37:69  0:21 43:31  0:21 35:17  0:36 37:62  0:71 35:03  0:18 28:22  0:08 34:75  0:29 67:51  0:59 27:60  0:46 29:74  0:08 72:37  0:29

52:19  0:21 39:42  0:21 44:58  0:07 41:91  0:38 44:03  0:39 43:14  0:27 30:81  0:75 35:17  0:93 89:17  0:42 26:89  0:61 32:67  0:53 80:60  0:81

(6) We can see from the results that PTSC is better than SA algorithm in most cases. A big difference of our PTSC from SA is that we apply sparse coding theory to learn sparse representations in the common space and reweight source samples. SA just learns a subspace using eigenvectors of two domains and does not reweight source samples. (7) SDDL requires the labeled source and target samples while ours do not. Even so, we observe that PTSC performs better than SDDL on most (12 out of 16) datasets. Because we use Minimum Mean Discrepancy to reduce the distribution difference and L2;1 norm to restrict the source projection matrix to reweight source samples. (8) As can be seen, PTSC1 and PTSC all perform better than other algorithms. But PTSC is slightly better than PTSC1. The results show that reweighting source samples is important for the visual domain adaptation problem. Since the visual domain difference is large, thus it is not enough to just

project the data into a low dimensional space. The L2;1 norm term is better than L2 norm for restricting the source samples. 5.4. Effectiveness analysis The distribution distance can measure the disparity across domains. We perform SC, TSC, PTSC1 and PTSC algorithm on A– W dataset to get the sparse representations of all the images. Meanwhile, we compute the MMD distance using the sparse representations. The square of the distance can be computed by

2  nt ns  1 X 1X  j i xs  xt  ¼ trðXMX 0 Þ  ns i¼1 nt j¼1  F

in which M is the same as that in Eq. (6). As we all know, the smaller the distribution distance is, the better the method is. Fig. 3 shows

271

X. Li et al. / J. Vis. Commun. Image R. 33 (2015) 265–272

the MMD distance of source and target sparse coding of different algorithms. We can see that the distance of SC is the largest because it does not consider the distribution divergence across domains. The distance of TSC is smaller than SC since it considers the distribution divergence. The distance of PTSC1 is about the same as TSC but is larger than PTSC which means that reweighting the source samples according its relevance to the target is important for the classification performance. When learning the sparse coding, we not only consider mapping source and target data into the same latent space, but also consider the correlation of source and target samples. The comparison demonstrates the effectiveness of our proposed method. 5.5. Parameter analysis We run our PTSC algorithm under different parameter combinations to evaluate the influence of the MMD, instance reweighting term. The parameter sensitivity results are shown in Fig. 4 for A–W dataset. Except for the parameters a and b used for drawing the figure, we fix dictionary size as 128. We set instance reweighting regularization parameter a by searching f0:001; 0:01; 0:1; 1; 10; 100g. The results show that the algorithm is robust in a large range of a. When a is zero, it means that source data are treated equally and thus cannot be reweighted by their relevance to the target data. b weights the importance of the distribution difference which ranges in f0; 0:01; 0:1; 1; 10; 100g. Fig. 4 shows that the performance is so poor when b ¼ 0 which means that the distribution difference is not considered. We can see from Fig. 4 that we can boost the performance by setting a ¼ 10 and b ¼ 1.

Fig. 4. Classification accuracies of A–W dataset with different a and b.

5.6. Dictionary size We report the accuracy rates with different dictionary size while other parameters are fixed as the best one. The results for a couple of datasets, A–W, USPS–MINST, VOC–MSRC are shown in Fig. 5. It can be noticed that the performance is bad when the dictionary size is small and the performance is better as the dictionary becomes larger. The reason may be that when the dictionary is small, it may be less discriminative because different features may be represented by the same dictionary element. However, a large dictionary may be less generalizable. We need to choose a proper sized dictionary. As we all know, a large dictionary means high computational complexity, as the dimension of the new image representation is high. The performance improvement is flat after the dictionary size of 128. Considering the computational complexity and the performance, we set the dictionary size as 128.

Fig. 5. Dictionary size.

Fig. 6. The convergence of PTSC algorithm.

5.7. Convergence study

Fig. 3. The MMD distance of different algorithms.

Since PTSC is an iterative algorithm, we iteratively optimize P, X and D. We can see from Fig. 6 that the objective value deceases steadily with more iterations and converges within 10 iterations. Therefore, we can conclude that our proposed algorithm is convergent.

272

X. Li et al. / J. Vis. Commun. Image R. 33 (2015) 265–272

6. Conclusions and future work A novel algorithm is presented for solving the cross domain image representation problems. We minimize the distribution divergence across domains using Maximum Mean Discrepancy. In addition, the source projection matrix is restricted by L2;1 norm in other to reweight source samples according to their relevance to the target data. The learned low dimensional representations can well represent the common structure of source and target data. Thus, the sparse representations learnt by them are robust to the distribution difference and represent the main information of images. Moreover, we can learn the projection matrix, the dictionary and the sparse representations in a unified objective. We have shown on the public real-world datasets, USPS–MNIST, MSRC–VOC2007, Office–Caltech256, that the proposed unsupervised domain adaptation approach (PTSC) can bring improvement to target image classification tasks. In the future, we are planning to take multiple kernel learning into account when learning the kernel matrix, which may be more precise for the final classification tasks. Moreover, we also plan to extend our method to a multi domain setting. Acknowledgments This work is supported by National Natural Science Foundation of China (Grant Nos. 61472305, 61070143, 61303034), Science and Technology project of Shaanxi province, China (Grant No. 2015GY027), and the Fundamental Research Funds for the Central Universities (Grant No. SMC1405). References [1] B.A. Olshausen et al., Emergence of simple-cell receptive field properties by learning a sparse code for natural images, Nature 381 (6583) (1996) 607–609. [2] J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparse coding for image classification, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, IEEE, 2009, pp. 1794–1801. [3] C. Zhang, S. Wang, Q. Huang, C. Liang, J. Liu, Q. Tian, Laplacian affine sparse coding with tilt and orientation consistency for image classification, J. Visual Commun. Image Represent. 24 (7) (2013) 786–793. [4] H. Nguyen, V. Patel, N. Nasrabad, R. Chellappa, Design of non-linear kernel dictionaries for object recognition, IEEE Trans. Image Process. 22 (12) (2013) 5123–5135. [5] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [6] M. Yang, D. Zhang, J. Yang, Robust sparse coding for face recognition, in: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2011, pp. 625–632. [7] C.-Y. Lu, H. Min, J. Gui, L. Zhu, Y.-K. Lei, Face recognition via weighted sparse representation, J. Visual Commun. Image Represent. 24 (2) (2013) 111–116. [8] M. Long, G. Ding, J. Wang, J. Sun, Y. Guo, P.S. Yu, Transfer sparse coding for robust image representation, in: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2013, pp. 407–414. [9] M. Zheng, J. Bu, C. Chen, C. Wang, L. Zhang, G. Qiu, D. Cai, Graph regularized sparse coding for image representation, IEEE Trans. Image Process. 20 (5) (2011) 1327–1336. [10] S.J. Pan, I.W. Tsang, J.T. Kwok, Q. Yang, Domain adaptation via transfer component analysis, IEEE Trans. Neural Networks 22 (2) (2011) 199–210. [11] M. Fang, Y. Guo, X. Zhang, X. Li, Multi-source transfer learning based on label shared subspace, Pattern Recognit. Lett. 51 (2015) 101–106.

[12] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22 (10) (2010) 1345–1359. [13] M. Al-Shedivat, J.J.-Y. Wang, M. Alzahrani, J.Z. Huang, X. Gao, Supervised transfer sparse coding, in: Twenty-eighth AAAI Conference on Artificial Intelligence, 2014, pp. 1665–1672. [14] S. Shekhar, V. Patel, H. Nguyen, R. Chellappa, Coupled projections for adaptation of dictionaries, IEEE Trans. Image Process. 24 (10) (2015) 2941– 2954. [15] Q. Qiu, V.M. Patel, P. Turaga, R. Chellappa, Domain adaptive dictionary learning, in: Computer Vision – ECCV 2012, Springer, 2012, pp. 631–645. [16] S.G. Mallat, Z. Zhang, Matching pursuits with time-frequency dictionaries, IEEE Trans. Signal Process. 41 (12) (1993) 3397–3415. [17] Y.C. Pati, R. Rezaiifar, P. Krishnaprasad, Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition, in: 1993 Conference Record of the Twenty-Seventh Asilomar Conference on Signals, Systems and Computers, IEE, 1993, pp. 40–44. [18] S.S. Chen, D.L. Donoho, M.A. Saunders, Atomic decomposition by basis pursuit, SIAM J. Sci. Comput. 20 (1) (1998) 33–61. [19] M. Aharon, M. Elad, A. Bruckstein, K-svd: an algorithm for designing overcomplete dictionaries for sparse representation, IEEE Trans. Signal Process. 54 (11) (2006) 4311. [20] K. Engan, S.O. Aase, J. Hakon Husoy, Method of optimal directions for frame design, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999. Proceedings, vol. 5, IEEE, 1999, pp. 2443–2446. [21] J. Liu, M. Shah, B. Kuipers, S. Savarese, Cross-view action recognition via view knowledge transfer, in: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2011, pp. 3209–3216. [22] L. Duan, I.W. Tsang, D. Xu, Domain transfer multiple kernel learning, IEEE Trans. Pattern Anal. Mach. Intell. 34 (3) (2012) 465–479. [23] L. Duan, D. Xu, I.-H. Tsang, J. Luo, Visual event recognition in videos by learning from web data, IEEE Trans. Pattern Anal. Mach. Intell. 34 (9) (2012) 1667– 1680. [24] B. Fernando, A. Habrard, M. Sebban, T. Tuytelaars, Unsupervised visual domain adaptation using subspace alignment, in: 2013 IEEE International Conference on Computer Vision (ICCV), IEEE, 2013, pp. 2960–2967. [25] M. Long, J. Wang, G. Ding, J. Sun, P.S. Yu, Transfer joint matching for unsupervised domain adaptation, in: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2014, pp. 1410–1417. [26] F. Zhu, L. Shao, Weakly-supervised cross-domain dictionary learning for visual recognition, Int. J. Comput. Vision 109 (1–2) (2014) 42–59. [27] V. Ranjan, G. Harit, C. Jawahar, Learning partially shared dictionaries for domain adaptation, in: Computer Vision – ACCV 2014 Workshops, Springer, 2014, pp. 247–261. [28] D.-A. Huang, Y.-C.F. Wang, Coupled dictionary and feature space learning with applications to cross-domain image synthesis and recognition, in: 2013 IEEE International Conference on Computer Vision (ICCV), IEEE, 2013, pp. 2496– 2503. [29] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) 273– 297. [30] J.J. Hull, A database for handwritten text recognition research, IEEE Trans. Pattern Anal. Mach. Intell. 16 (5) (1994) 550–554. [31] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324. [32] J. Shotton, J. Winn, C. Rother, A. Criminisi, Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation, in: Computer Vision – ECCV 2006, Springer, 2006, pp. 1–15. [33] M. Everingham, L. Van Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, Int. J. Comput. Vision 88 (2) (2010) 303– 338. [34] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vision 60 (2) (2004) 91–110. [35] K. Saenko, B. Kulis, M. Fritz, T. Darrell, Adapting visual category models to new domains, in: Computer Vision – ECCV 2010, Springer, 2010, pp. 213–226. [36] G. Griffin, A. Holub, P. Perona, Caltech-256 Object Category Dataset, Technical Report 7694, California Institute of Technology, 2007. [37] H. Bay, T. Tuytelaars, L. Van Gool, Surf: speeded up robust features, in: Computer Vision – ECCV 2006, Springer, 2006, pp. 404–417. [38] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, LIBLINEAR: a library for large linear classification, J. Mach. Learn. Res. 9 (2008) 1871–1874. [39] S. Wold, K. Esbensen, P. Geladi, Principal component analysis, Chemometr. Intell. Lab. Syst. 2 (1) (1987) 37–52.