Engineering Applications of Artificial Intelligence 28 (2014) 181–189
Contents lists available at ScienceDirect
Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai
Beyond cross-domain learning: Multiple-domain nonnegative matrix factorization Jim Jing-Yan Wang a,b, Xin Gao a,n a Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia b Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
art ic l e i nf o
a b s t r a c t
Article history: Received 19 February 2013 Received in revised form 7 October 2013 Accepted 1 November 2013 Available online 12 December 2013
Traditional cross-domain learning methods transfer learning from a source domain to a target domain. In this paper, we propose the multiple-domain learning problem for several equally treated domains. The multiple-domain learning problem assumes that samples from different domains have different distributions, but share the same feature and class label spaces. Each domain could be a target domain, while also be a source domain for other domains. A novel multiple-domain representation method is proposed for the multiple-domain learning problem. This method is based on nonnegative matrix factorization (NMF), and tries to learn a basis matrix and coding vectors for samples, so that the domain distribution mismatch among different domains will be reduced under an extended variation of the maximum mean discrepancy (MMD) criterion. The novel algorithm — multiple-domain NMF (MDNMF) — was evaluated on two challenging multiple-domain learning problems — multiple user spam email detection and multiple-domain glioma diagnosis. The effectiveness of the proposed algorithm is experimentally verified. & 2013 Elsevier Ltd. All rights reserved.
Keywords: Data representation Nonnegative matrix factorization Cross-domain learning
1. Introduction The cross-domain learning problem has attracted much attention from the machine learning community recently. The goal of the problem is to learn a classifier to classify the samples in a target domain, but the number of labeled samples in the target domain is not sufficient for the learning. At the same time, it assumes that a source domain with a sufficient number of labeled samples exists and could be helpful for the learning problem of the target domain. The source domain samples share the same feature space and the class label space as the target domain, but the distribution of source domain samples is significantly different from that of the target domain. Thus the source domain samples cannot be directly used for the learning problem of the target domain. Instead, the domain transfer, or domain adaptation, is needed to learn from the source domain to the target domain. Recently, many works have been done in the transfer learning problem (Daume, 2007; Yang et al., 2007; Jiang et al., 2008; Bruzzone and Marconcini, 2010; Duan et al., 2012a,b). Previous cross-domain learning research mainly focused on learning from a single source domain to a target domain
n
Corresponding author. Tel.: þ 966 2 808 0323. E-mail addresses:
[email protected] (J.J.-Y. Wang),
[email protected] (X. Gao). 0952-1976/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.engappai.2013.11.002
(Daume, 2007; Yang et al., 2007; Jiang et al., 2008; Bruzzone and Marconcini, 2010; Duan et al., 2012a,b). It usually assumes that there are only a few labeled samples in the target domain, while a great number of labeled samples in the source domain. However, in many real-world applications, there are usually more than two domains, and for each domain there are just a few labeled samples. In such cases, the goal is to learn a classifier for each domain with the help of labeled samples from all other domains, i.e., each domain could be a target domain and at the same time, it can also be a source domain for all other domains. For example, in the problem of spam email detection, we may have several email subsets of different users. For each user, a large number of emails are collected while only a small portion of them are labeled as non-spam or spam. Because the data distributions of the different users' emails are different but related, they could be treated as different domains. Every user needs a classifier for the spam detection, but does not have enough labeled emails, thus every one is a target domain. At the same time, each user's data would also be helpful for the learning of other users' classifier, thus they are also target domains. We define this problem as the multipledomain learning problem. It has the following features: 1. Several (usually more than two) domains will be considered in the learning procedure. 2. All the domains will be treated equally. No domain will be specified as source or target domains. Each domain could be
182
J.J.-Y. Wang, X. Gao / Engineering Applications of Artificial Intelligence 28 (2014) 181–189
a target domain, while could also be a source domain for other domains. 3. A classifier is needed for each domain, but the number of labeled samples for each domain is limited. Though there are many real-world applications of the multipledomain learning problem, surprisingly, little attention has been paid in the learning problem of multiple-domains. Recently, a few methods have been presented to learn from multiple source domains to a single target domain. For example, Duan et al. (2012a) proposed the domain adaptation machine (DAM) for the multiple source domain adaption problem, which learns a robust classifier for the target domain by leveraging many base classifiers which could be learned using the labeled samples from the source domains or the target domains. Zhuang et al. (2010) proposed a centralized consensus regularization (CCR) framework for learning from multiple source domains to a target domain. It trains a local classifier by considering both local data available in one source domain and the prediction consensus with the classifiers learned from other source domains. Yao and Doretto (2010) proposed the multiple source transfer AdaBoost (MDTAB) by extending the boosting framework for transferring knowledge from multiple sources. Sun et al. (2011) proposed a two-stage-weighting-based method for multiple source domain adaptation (TSWMSD) (Sun et al., 2011), by combining weighted data samples from multiple sources based on marginal probability differences and conditional probability differences, with the target domain data. Tu and Sun (2012a) proposed a multiple source domain adaptation method based on ensemble learning. Using model-friendly classifiers, different test samples are assigned with different weights dynamically. Tu and Sun (2012b) further proposed a novel framework to maximize separability among classes and to minimize separability among domains simultaneously. To this end, the classseparate objectives and the domain-merge objectives are combined to achieve a unified objective. All the aforementioned works tried to deal with the problem of learning from multiple source domains for a single target domain. They can be regarded as a special case of our multiple domain learning problem. The methods that learn from multiple source domains can be extended to the multiple-domain learning problem by treating each domain as the target domain and other domains as source domains in turn. However, this strategy has the following limitations: Its quite time consuming. For each domain, we should perform a learning procedure to learn form all other domains. It is not efficient when the number of domains is large. It assumes that all the samples in the source domains are labeled, which is not true for the multiple-domain learning problem. In this case, only the labeled samples from other domains will be utilized, while neglecting the unlabeled ones. The classifier is only learned for a specified target domain, and it could not be applied to other domains. Learning a single classifier for all the domains is impossible.
multiple-domain NMF method has been studied yet. To fill these gaps, in this paper, we develop a novel data representation method for the multiple-domain data representation, based on NMF. We try to map all the samples from multiple-domains with different data distributions into a common representation space with a common distribution by NMF. A distribution mismatch term is constructed and applied to the coding vectors of samples under the framework of NMF. With the common representation of samples from multiple-domains, a robust classifier could be trained for the classification problem of samples of multipledomains directly. The rest of this paper is organized as follows: In Section 2, we propose the NMF learning algorithm for representation of samples from multiple-domains. In Section 3, we show the experimental results for the proposed algorithm on two real-world multipledomain learning problems. Finally, we conclude this paper with the conclusion and possible future works in Section 4.
2. Multiple-domain nonnegative matrix factorization In this section, we will introduce the proposed NMF method for representation of data samples from multiple-domains. 2.1. Objective function To introduce the proposed NMF method, we construct an objective function for the factorization of all the samples from multiple-domains, by considering the following two problems simultaneously: NMF problem: Given a training data set with N data samples D denoted as D ¼ fxi gN i ¼ 1 , where xi A R þ is the D-dimensional nonnegative feature vector of the i-th sample, we organize it as a nonnegative matrix X ¼ ½x1 ; …; xN A RDN þ . NMF tries to find two low-rank nonnegative matrices H A RDK and W A RKN so that their product can approxþ þ imate the original data matrix, X HW (Zheng et al., 2007; Gruber et al., 2009). H ¼ ½h1 ; …; hK , where hk A RDþ is the k-th column of H, could be regarded as a basis matrix with each column hk being a basis vector. In this way, each sample xi will be approximated as the linear combination of the basis vectors as K
xi ∑ hk wki ¼ Hwi
where wi ¼ ½w1i ; …; wKi > is the linear combination coefficient vector for the i-th sample, and also the i-th column of W, which could be regarded as a new representation of xi with regard to H. The coefficient matrix W is also called the coding matrix, and wi is also called the coding vector of xi. To find the optimal factorization matrices H and W, we try to minimize the squared L2 norm distance between the original matrix X and the product HW to make the approximation error as small as possible. The NMF problem can be formulated as the following constrained minimization problem: ( min H;W
Nonnegative matrix factorization (NMF) has been well studied and applied as a data representation method, due to its ability to find the latent nature of data (Cai et al., 2011, 2009, 2008; Wang et al., 2012a, 2013a,b). However, up to now, all research carried on are limited in the single domain problem, and no cross-domain or
ð1Þ
k¼1
s:t:
) ∑ ‖xi Hwi ‖22 ¼ ‖X HW‖22 ¼ Tr½ðX HWÞðX HWÞ >
i:xi A D
H Z 0; W Z 0:
ð2Þ
where TrðÞ is the trace of a matrix. Multiple-domain distribution mismatch reduction problem: Suppose that the entire data set is composed of M domains
J.J.-Y. Wang, X. Gao / Engineering Applications of Artificial Intelligence 28 (2014) 181–189
average as possible. In this way, the optimization problem in (5) is improved to the following one: 2 M 1 M min ∑ ∑ w w m m′ W M m¼1 m′ ¼ 1 s:t: W Z0: ð6Þ
with different data distributions, we denote the sample set of the m-th domain as Dm , thus D ¼ ⋃M m ¼ 1 Dm . The number of samples from Dm is denoted as Nm, and N ¼ ∑M m ¼ 1 N m . Because of the significant differences of distributions among different domains, directly performing NMF to the samples from different domains as in (2) may cause serious problems. The difference among the same class in different domains might even be larger than the differences among classes within the same domain, making the NMF results not discriminant enough for the classification problem. To solve this issue, it is crucial to reduce the differences among the data distributions of different domains (Duan et al., 2012a). To this end, we try to reduce the distribution mismatch of the coding vectors wi of the samples, instead of the samples xi themselves. In this way, all the samples will share a common distribution in the coding space, so that they could be used directly to train a classifier which would be applied to all the domains. To represent the distribution of the m-th domain, we calculate the mean of coding vectors for the samples from this domain as wm ¼
1 ∑ w N m i:xi A Dm i
By (6), we hope that the distributions of all the domains will be transferred to a common distribution in the coding vector space, which is represented by the average of mean coding vectors of all the domains. This criterion simplifies the complexity by reducing the number of the terms in the objective function from MðM 1Þ=2 to M. It could also be proven that the OTA protocol is the lower bound of OTO protocol, i.e., 2 M M 1 M r ∑ ∑ w w ∑ ‖w m w m′ ‖22 ð7Þ m m′ M m¼1 m′ ¼ 1 m;m′ ¼ 1:m a m′ 2 By subscribing (3) into (6), we have 8 !2 < M 1 1 M 1 ∑ ∑ ∑ wi ∑ wj min :m ¼ 1 N m i:xi A Dm W M m′ ¼ 1 N m′ j:xj A Dm′ 2 M M ¼ ∑ ‖W π m ‖2 ¼ ∑ ∑ πm i wi m ¼ 1 i:xi A D m¼1 M > ¼ ∑ TrðW π m π m W > Þ
ð3Þ
It will be used as the expectation of the distribution of the coding vectors of Dm . To compare the distribution mismatch of a pair of domains, we use the maximum mean discrepancy (MMD) (Borgwardt et al., 2006) as the basic criterion, which has been used to develop crossdomain component analysis methods (Pan et al., 2011), cross-domain classifiers (Duan et al., 2009), etc. Based on MMD, we compare the coding vector distributions of a pair of domains Dm and Dm′ based on the squared L2 norm distance between the means w m and w m′ from two domains in the coding vector space, namely: DistðDm ; Dm′ Þ ¼ ‖w m w m′ ‖22
m¼1
s:t:
W
s:t:
m;m′ ¼ 1:m o m′
W Z 0:
m > N
ð9Þ
m
ð4Þ
m
Π m ¼ Π mþ Π m
ð10Þ
We observe that the (i,j)-th elements of matrices Π Π m are
m þ
and
8 2 1 1 > > 1 if xi ; xj A Dm ; > > > M < Nm m Π þ ij ¼ 1 1 > if xi A Dm′ ðm′ a mÞ; xj A Dm″ ðm″ a mÞ > > MN m″ > MN m′ > : 0; else: 8 1 1 1 > > 1 if xi A Dm ; xj A Dm′ ðm′ a mÞ; > > M MN m′ > < Nm m 1 1 1 Π ij ¼ > 1 if xi A Dm′ ðm′ a mÞ; xj A Dm ; > > MN m′ Nm M > > : 0 else:
‖w m w m′ ‖22 ð5Þ
We call it “one-to-one” (OTO) protocol because each domain is compared to any other one. The objective function of (5) contains MðM 1Þ=2 terms, making it a complex problem. One-to-all protocol: Due to the complexity of the OTO protocol, we propose “one-to-all” (OTA) protocol to simplify the formulation. OTA makes the coding vector means of all the domains maximally close to the average means of all the domains. To this end, we first compute the average of the mean coding vectors of all the domains as ð1=MÞ∑M m′ ¼ 1 w m′ , and then try to learn the coding vectors so that the means of each domain are as close to the
m 1 ; …;
We further denote the matrix Π ¼ π m π m > ¼ ½Π ij , where m m m Πm ij ¼ π i π j . Moreover, each matrix Π could be separated m m m to two nonnegative parts Π þ Z 0 and Π Z 0, where Π þ only contains the absolute values of the positive elements, m while Π only contains the absolute values of the negative elements, so that the original matrix could be represented as the difference of these parts as
domain distributions, we can compare each pair of domains in the domain set using MMD, and then calculate the sum of them as the final distribution mismatch measure. In this way, we could seek a common coding vector space for NMF of multiple-domains by minimizing all the distances of distributions between all pair of domains. Based on the distribution distance defined in (4), this problem could be formulated as M
ð8Þ
π and where π ¼ ½π 8 1 1 > > if xi A Dm ; 1 > < Nm M πm ¼ i 1 > > > if xi A Dm′ ðm′a mÞ: : MN m′
One-to-one protocol: To compare the mismatch of multiple-
∑
W Z0: m
MMD is proposed to compare the mismatch of a pair of domains. However, in the multiple-domain learning problem, there are many (usually more than two) domains. In order to extend it to measure the mismatch of multipledomain distributions, we have the following two protocols:
min
183
ð11Þ By substituting (10) into (8), we have min W
M
M ¼ Tr W ∑
∑ Tr½WðΠ þ Π ÞW > m
m
m¼1
m¼1
where Π þ ¼
Π
m þ
and Π
M ¼ ∑m ¼ 1
m¼1
m ¼ TrðW Π þ W > Þ TrðW Π W > Þ s:t: W Z0: ∑M m¼1
M
Π mþ W > Tr W ∑ Π m W >
ð12Þ
Π
m .
184
J.J.-Y. Wang, X. Gao / Engineering Applications of Artificial Intelligence 28 (2014) 181–189
By considering the NMF problem and the OTA criterion for multiple-domain distribution mismatch reduction simultaneously, we combine the two optimization problems in (2) and (12) to formulate the MDNMF problem:
min OðH; WÞ ¼ Tr½ðX HWÞðX HWÞ > þ α TrðW Π þ W > Þ H;W m α TrðW Π W > Þ s:t: where
H Z 0; W Z 0;
ð13Þ
α is the trade-off parameter.
2.2. Optimization We adopt the Lagrange multiplier method to optimize (13). We introduce the Lagrange multiplier matrices, Φ and Ψ , for the constrain of H Z 0 and W Z 0, respectively. Thus, the Lagrange function L of (13) is L ¼ Tr½ðX HWÞðX HWÞ > þ α TrðW Π þ W > Þ α TrðW Π W > Þ m
þTrðΦH Þ þ TrðΨ W Þ >
>
ð14Þ
By setting the partial derivatives of L with respect to H and W to zero, we have ∂L ¼ 2XW > þ 2HWW > þ Φ ¼ 0 ∂H ∂L ¼ 2H > X þ 2H > HW þ 2αW Π þ 2αW Π þ Ψ ¼ 0 ∂W
ð15Þ
By using the KKT conditions Φ○H ¼ 0 and Ψ ○W ¼ 0, where ○ denotes element-wise product, we could have the following equations for the factorization matrices H and W: ðXW > Þ○H þ ðHWW > Þ○H ¼ 0 ðH > XÞ○W þ ðH > HWÞ○W þ αðW Π þ Þ○W αðW Π Þ○W ¼ 0
ð16Þ
These equations lead to the following updating rules: H⟵H○
½XW >
½HWW > ½H > X þ αW Π W⟵W○ > ½H HW þ αW Π þ
ð17Þ
where ½=½ denotes the element-wise division. 2.3. Algorithm Based on the optimization results in (17), we can design the iterative MDNMF learning algorithm, which is listed in Algorithm 1. The update rules will be repeated until a maximum interaction number T is reached. Algorithm 1. MDNMF learning algorithm. INPUT: Training set D of M domains and its data matrix X; Input: Iteration number T; Initialize the basis matrix H0 and the coding matrix W0; Initialize matrices Π þ and Π as in (11); for t ¼ 1; …; T do Update the basis matrix Ht and the coding matrix Wt by fixing the pervious basis matrix H t 1 and the coding matrix W t 1 as in (17); end for OUTPUT: Basis matrix HT and coding matrix WT.
Time complexity analysis: In this learning algorithm, the basis matrix W and the coding matrix W are updated alternately in each iteration. Since H is of size D K and W is of size K N, the time complexity of the training procedure is OðT K ðD þNÞÞ.
Moreover, before the iterations, the matrix Π of size N N should be initialized first, thus the final time complexity of this algorithm is OðT K ðD þ NÞ þ N 2 Þ. 2.4. Representation of new samples When a test sample from the m-th domain comes, we try to represent it as a new coding vector w using the learned basis matrix H, so that x Hw. During the representation procedure, we assume that the input of this new sample does not affect the coding of the training samples, so wi for training samples xi A D is fixed. We also assume that the average of the coding vector means of all the domains is not affected by x. To consider the effect of the new sample to the distribution of the m-th domain, the coding vector mean of the m-th domain is re-computed as ð1=ðN m þ 1ÞÞ ð∑i:xi A Dm wi þ wÞ by extending Dm to add the coding vector w of the new test sample. By replacing the mean coding vector in (6), and considering the approximation error of the new sample, we need to optimize the following objective function for the representation of the test sample x: ( ! 1 min ∑ wi þ w RðwÞ ¼ ‖x Hw‖22 þ α w N m þ1 i:xi A Dm 2 ! 2 ) 1 1 M 1 w þ W π ∑ wi : ¼ ‖x Hw‖22 þ α ∑ N þ 1 M m′ ¼ 1 N m ′i:xi A Dm m 2 2
s:t:
w Z 0;
where π ¼ ½π 1 ; …; π N 8 1 1 1 > > > < Nm þ 1 M Nm πi ¼ 1 1 > > > : M N m′
ð18Þ >
with if xi A Dm if xi A Dm′ ðm′a mÞ
:
We also separate π into two negative parts π ¼ π þ π , where 8 1 1 < 1 if xi A Dm π þ i ¼ Nm þ 1 M Nm : 0 else 8 <1 1 if xi A Dm′ ðm′a mÞ π i ¼ M N m′ ð19Þ : 0 else By substituting π ¼ π þ π into (18), we have the following optimization problem: ( 2 1 wþWπ þ Wπ ‖x Hw‖22 þ α min w Nm þ 1 2 ¼ Tr½ðx HwÞðx HwÞ > "
þ α Tr
s:t:
1 wþWπ þ Wπ Nm þ 1
w Z 0:
1 wþWπ þ Wπ Nm þ 1
> #)
ð20Þ
We denote the Lagrange multiplier vector for the constrain w Z 0 as ψ. The Lagrange function L of problem (20) is L ¼ Tr½ðx HwÞðx HwÞ > " > # 1 1 w þW π þ W π wþWπ þ Wπ þ α Tr Nm þ 1 Nm þ 1 þ Trðψ w > Þ
ð21Þ
By setting the derivative of L regarding to w to zero, we have 2 ∂L 1 ¼ 2H > x þ 2H > Hw þ 2α w ∂w Nm þ 1 1 1 W π þ 2α Wπ þψ þ 2α Nm þ 1 Nm þ 1 ¼0 ð22Þ
J.J.-Y. Wang, X. Gao / Engineering Applications of Artificial Intelligence 28 (2014) 181–189
Using the KKT condition ψ ○w ¼ 0, we get the following equations for w: 2 1 1 ðW π þ Þ○w w○w þ α ðH > HwÞ○w þ α N m þ1 Nm þ 1 1 ðW π Þ○w ¼ ðH > xÞ○w þ α ð23Þ N m þ1 which leads to the following updating rule for w: h i H > x þ αNm1þ 1 W π w⟵w○ 2 H > Hw þ α Nm1þ 1 w þ αNm1þ 1 W π þ
ð24Þ
Using this updating rule, a new test sample x will be represented as a new coding vector by fixing the basis matrix H and the coding vectors wi of the training samples xi A D. The algorithm for representation of a new test sample is given in Algorithm 2. Algorithm 2. MDNMF representation algorithm. INPUT: Test samples x; INPUT: Basis matrix H and coding matrix W learned from the training set using Algorithm 1; Input: Iteration number T. Initialize the coding vector w0; Initialize the vectors π þ and π as in (19); for t ¼ 1; …; T do Update the coding vector wt by fixing the pervious coding vector wt 1 , basis matrix H and coding matrix W as in (24); end for OUTPUT: Coding matrix wT.
Time complexity analysis: In this representation algorithm, only the coding vector w of size K is updated for T times. Besides, a vector π of size N is initialized before the iterations, thus the time complexity is OðT K þ NÞ.
3. Experiments In this section we will evaluate the proposed representation method on two multiple-domain learning tasks — multiple user spam email detection and multiple-domain glioma diagnosis — using gene expression data. 3.1. Experiment I: multiple user spam email detection 3.1.1. Dataset and setup The task of spam email detection is to classify a given test email to non-spam or spam. We use a spam email dataset which is collected from publicly available sources (Duan et al., 2012b). The spam email dataset consists the emails from 15 inboxes of 15 different users. Due to the differences of the distributions of the emails received by individual users, we treat each user as an individual domain. Each user may have labeled some of the emails in his/her inboxes, but not all of them. When we try to learn a classifier to help a user to detect the spam emails automatically, other users' emails may be helpful. Thus the multiple user spam email detection problem is a typical multiple-domain learning problem since all the users need a spam email classifier. In the dataset, for each user we have 400 emails, 300 of which were used as training samples in the training set, and the remaining 100 were used as test samples in an independent test set. Thus in total we have 300 15 ¼ 4500 samples in the training set, while 100 15 ¼ 1500 samples in the independent test set. Among all these training samples, we randomly select a small part (30%) of
185
them as labeled samples while keeping the remaining ones as unlabeled. To represent the emails, we first extract the word-frequency features from each email, then weight the word-frequency features using a word weighting algorithm (Wang et al., 2011) based on boosting (Li et al., 2008), and finally perform Algorithm 1 to the training samples to represent the feature vectors to coding vectors. To classify the test samples, we train a single semi-supervised SVM classifier regularized by ensemble manifold (Geng et al., 2012, 2009) using the coding vectors of all training samples from different domains. To conduct the evaluation, we first perform the 10-fold cross validation on the training set to optimize the parameters for the MDNMF model and the SVM classifier. To classify a test sample, we first represent it using Algorithm 2, and then classify it using the SVM classifier learned at the training procedure. To evaluate the performances of the multiple-domain learning methods, we employ the following performance criteria for the 10-fold cross validation: Sensitivity, Specificity, F-Score, Accuracy (ACC), and Matthews Correlation Coefficient (MCC). We define a true positive (TP) as a spam email classified correctly as a spam, a false negative (FN) as a spam email classified wrongly as a non-spam, a true negative (TN) as a non-spam email classified correctly as a non-spam, and a false positive (FP) as a non-spam email classified wrongly as a spam. With these definitions, we have the following classification performance measures for the training set: TP TN ; Specificity ¼ ; TP þ FN FP þ TN 2TP TP þ TN FScore ¼ ; ACC ¼ ; 2TP þ FP þ FN TP þ FP þTN þ FN TN TN FP FN MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðTP þ FPÞðTP þFNÞðTN þ FPÞðTN þ FNÞ Sensitivity ¼
ð25Þ
Note that for all the measures except MCC, the value range is ½0; 1. A better classifier will achieve a higher performance measure value. The value range of MCC is ½ 1; 1. A larger MCC corresponds to a better classifier, and 1 and 1 represents worst and best classifiers, respectively. For the independent test, receiver operating characteristic (ROC) (Ben-David, 2008) and Recall–Precision curves are reported as performance measures. The ROC curve could be created by plotting the true positive rate (TPR) vs. the false positive rate (FPR), at various threshold settings, where TPR and FPR are defined as follows: TPR ¼
TP ; TP þ TN
FPR ¼
FP FP þ TN
ð26Þ
Moreover, the area under ROC curve (AUC) (Ben-David, 2008) is also used as a single performance measure to compare the classification results. The Recall–Precision curve could be created by plotting the precision vs. the recall, at various threshold settings, where precision and recall are defined as follows: Precision ¼
TP ; TP þ FP
Recall ¼
TP TP þFN
ð27Þ
A better classifier should achieve a ROC curve closer to the upper left corner with a larger AUC value, while a Recall–Precision curve closer to the upper right corner. Since there are no other multiple learning methods that we can compare our algorithms with, we compared the proposed MDNMF with the state-of-the-art cross-domain learning methods that learn from multiple source domains for a target domain, including DAM (Duan et al., 2012b), CCR (Zhuang et al., 2010), TSWMSD (Sun et al., 2011), and MDTAB (Yao and Doretto, 2010). DAM (Duan et al., 2012b) requires pre-learned base classifiers, and we used the standard SVM as base classifiers for each domain, thus there are 15 base classifiers in total. Moreover, we also provided the results
186
J.J.-Y. Wang, X. Gao / Engineering Applications of Artificial Intelligence 28 (2014) 181–189
of a baseline method, which helps to identify the cross-domain issues. The baseline method is developed by putting all training samples of different domains to one merged dataset and training one single SVM on the merged dataset. 3.1.2. Results The boxplot of various performance measures for 10-fold cross validation on the training set is given in Fig. 1. From this figure, it is easy to see that the proposed MDNMF outperforms other multiple source domain learning algorithms significantly, which could be confirmed by the improvements of MDNMF over those methods on all the performance criteria. The reason is that the samples from all the domains are mapped into a common coding space via MDNMF, which are combined and further used to train a robust classifier to the problem. Because it provides a unified multipledomain training set refined by the MDNMF to obtain the final representation, it is not necessary to design domain transfer classifiers. Moreover, it is obvious that with respect to the unlabeled samples, the state-of-the-art methods have no superiority over the MDNMF. Fig. 2 shows the ROC and Recall–Precision curves of the methods on the independent test set of the email spam set. The AUCs of RCOs are summarized in Table 1. It can be seen that MDNMF achieves the best performance among all the methods compared here. In particular, the DAM achieves AUC of 0.8848 on the independent test of spam email data test set and the
baseline method achieves only 0.8101, while MDNMF achieves 0.9084. The main difference between MDNMF and other methods is that MDNMF exploits the unlabeled samples from all the domains, and maps the samples into a common space. We can see that by the multiple-domain mapping, MDNMF achieves performance improvement w.r.t. concerned performance measures such as TPR, FPR, Recall and Precision. This could also be confirmed by Fig. 2.
3.2. Experiment II: multiple-domain glioma diagnosis In the second group of experiment we test our data representation method on the multiple-domain glioma diagnosis problem. 3.2.1. Dataset and setup We collect 399 samples from four different datasets generated by four different research groups (Murat et al., 2008; Sun et al., 2006; Freije et al., 2004; Nutt et al., 2003). The number of samples in each dataset varies from 50 to 180. Each sample is classified into one of the four classes as follows: 1. Glioblastoma (GBM): 248 samples. 2. Anaplastic oligodendroglioma (AO) and anaplastic astrocytoma (AA): 79 samples.
Specificity
Sensitivity 0.9
Specificity
Sensitivity
0.9
0.8
0.7
0.6
MDNMF TSWMSD
DAM
MDTAB
CCR
0.8
0.7
0.6
Baseline
MDNMF TSWMSD
DAM
0.9
0.8
0.8
0.7
0.6
MDNMF TSWMSD
DAM
CCR
Baseline
CCR
Baseline
ACC
0.9
ACC
Fscore
Fscore
MDTAB
MDTAB
CCR
0.7
0.6
Baseline
MDNMF TSWMSD
DAM
MDTAB
MCC 0.7
MCC
0.6
0.5
0.4
MDNMF TSWMSD
DAM
MDTAB
CCR
Baseline
Fig. 1. Performances of the 10-fold cross validation on the spam email training set. (a) Sensitivity, (b) Specificity, (c) F-Score, (d) ACC, and (e) MCC.
J.J.-Y. Wang, X. Gao / Engineering Applications of Artificial Intelligence 28 (2014) 181–189
ROC
1
Recall−Precision
1
0.8
0.8
Precision
0.6
TPR
MDNMF TSWMSD DAM MDTAB CCR Baseline
0.4
0.6 MDNMF TSWMSD DAM MDTAB CCR Baseline
0.4
0.2
0.2
0
187
0
0.4
0
0.8
0
0.4
FPR
0.8
Recall
Fig. 2. ROC curves and Recall–Precision curves of the independent test on the spam email test set. (a) ROC and (b) Recall–Precision.
600
Table 1 The AUC values of ROC of different methods on the test set. AUC
MDNMF TSWMSD DAM MDTAB CCR Baseline
0.9084 0.8848 0.8869 0.8585 0.8457 0.8101
Time (s)
Methods
Training Testing
MDNMF
TSWMSD
DAM
MDTAB
CCR
Baseline
Fig. 4. Training and testing time of various methods on the multiple-domain glioma gene expression dataset.
only tuned using the 9 folds in the training set by the 9-fold cross validation. We adopt the ACC as a criterion for the classification problem. In the experiments, we adopt the one-to-all protocol to handle the multi-class problem, and we also use the standard SVM as base classifiers for each domain for the DAM (Duan et al., 2012b) method. Thus there are 16 base classifiers (4 domains 4 classes) in total in this experiment.
0.9
ACC
200
0
ACC
1
400
0.8 0.7 0.6 0.5
MDNMF
TSWMSD
DAM
MDTAB
CCR
Baseline
Fig. 3. Boxplot of 10-fold cross-validation on the multiple-domain glioma gene expression dataset.
3. Astrocytoma (A): 45 samples. 4. Non-tumor: 27 samples. Microarray analysis (Simek et al., 2004) is used to determine the expression of samples to 272 genes commonly shared by the four datasets, and the 272 gene expression data will be used as the original nonnegative features of samples. Due to the different equipments and data pre-processing methods used by different groups, the distributions of the four data sets are significantly different, thus we treat the four data sets as four domains. To evaluate the performance of MDNMF, we perform 10-fold cross validation. The entire dataset is separated into 10 independent folds, and each fold is used as the test set in turn while other folds are used as the training set. Note that the parameters are
3.2.2. Results The boxplot of ACCs for the 10-fold cross validation on the glioma gene expression dataset is given in Fig. 3. In most of the folds, our algorithm achieves higher recognition accuracies, especially when compared with the baseline method. Moreover, it seems that TSWMSD and DAM outperforms MDTAB, CCR and the baseline. The results of TSWMSD, DAM, CCR and our algorithm all rely on a robust SVM classifier. Differently, most multiple source domain learning algorithms employ multiple source domains to directly refine an SVM model, while MDNMF tries to learn a common data representation space for the learning of SVM. Compared with the multiple-domain classifier learning, multiple-domain data representation shows more adaptability to the various distributed domains. We also compared the training and the testing time with other methods, and the results are reported in Fig. 4. From this figure we can observe that compared to multiple source domain learning methods, the proposed algorithm can reduce the learning time significantly. The reason is as follows: since these multiple source domain learning methods are not designed for the multipledomain learning problem, when we use them we have to we treat
188
J.J.-Y. Wang, X. Gao / Engineering Applications of Artificial Intelligence 28 (2014) 181–189
each domain as the target domain in turn. Thus the learning procedure was conducted for every domain, treating it as target domain while leaving the others as source domains. In our algorithm, we only perform the learning procedure for one time for all the domains. since we treat all the domains equally, The only exception is the baseline method, because it does not conduct any transfer learning procedure, we can also perform one single learning procedure to all the samples regardless which domain they belong to.
4. Conclusion and future works In this paper, we proposed the multiple-domain learning problem, and developed a novel multiple-domain data representation method. The multiple-domain learning problem is summarized as a learning problem for several equally treated domains, all of which are in lack of labeled samples for learning robust classifiers. No domain will be specified as the source domain or the target domain as in the traditional cross-domain learning problems. Moreover, we improved the NMF for the representation of samples of multipledomains, so that the coding vectors share a common distribution and the single classifier could be learned in the code space for all domains. The experiments on two challenging multiple-domain learning tasks illumined the effectiveness and efficiency of the proposed method. In the further, we will develop novel multiple-domain data representation method by extending other representation methodologies to multiple-domains, such as multiple-domain sparse coding (Gao et al., 2011, 2013), multiple-domain hashing (Wang et al., 2010, 2012b), multiple-domain feature selection (Keynia, 2012; Esfandian et al., 2012; Chatterjee and Bhattacherjee, 2011), and multiple-domain component analysis (Aradhya et al., 2008). Moreover, we could also directly conduct the multiple-domain learning at the level of classification or ranking to propose the multiple-domain SVM (Abdi and Giveki, 2013; Wang et al., 2013c) or multiple-domain database ranking (Yang et al., 2012), by reducing the multiple-domain mismatch of distributions of classification prediction results or ranking scores. Recently, the multiple-domain SVM was reported by Ji and Sun (2013). This method deals with the multiple-domain learning problem and tries to learn effective classifiers for this problem. However, the multiple-domain learning method reported in Ji and Sun (2013) tries to learn different classifiers for different domains, while in our framework, we first try to learn a common representation for all the domains, and then learn a single classifier for all the domains. Moreover, our framework can only handle the binary classification up to now, but the multiple-domain SVM developed in Ji and Sun (2013) is targeted at multi-class learning.
Acknowledgments The study was supported by grants from Chongqing Laboratory of Computational Intelligence, China (Grant no. LCI-2013-02), Tianjin Key Laboratory of Cognitive Computing Application, China, and King Abdullah University of Science Technology (KAUST), Saudi Arabia.
Key CQand and
References Abdi, M., Giveki, D., 2013. Automatic detection of erythemato-squamous diseases using PSO-SVM based on association rules. Eng. Appl. Artif. Intell. 26, 603–608. Aradhya, V.N.M., Kumar, G.H., Noushath, S., 2008. Multilingual OCR system for South Indian scripts and English documents: an approach based on Fourier transform and principal component analysis. Eng. Appl. Artif. Intell. 21 (4), 658–668.
Ben-David, A., 2008. About the relationship between ROC curves and Cohen's kappa. Eng. Appl. Artif. Intell. 21 (6), 874–882. Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H.-P., Schoelkopf, B., Smola, A.J., 2006. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22 (14), E49–E57. Bruzzone, L., Marconcini, M., 2010. Domain adaptation problems: a DASVM classification technique and a circular validation strategy. IEEE Trans. Pattern Anal. Mach. Intell. 32 (5), 770–787. Cai, D., He, X., Wu, X., Han, J., 2008. Non-negative matrix factorization on manifold. In: The Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 63–72. Cai, D., He, X., Wang, X., Bao, H., Han, J., 2009. Locality preserving nonnegative matrix factorization. In: Boutilier, C. (Ed.), Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI-09), pp. 1010–1015. Cai, D., He, X., Han, J., Huang, T., 2011. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1548–1560. Chatterjee, S., Bhattacherjee, A., 2011. Genetic algorithms for feature selection of image analysis-based quality monitoring model: an application to an iron mine. Eng. Appl. Artif. Intell. 24 (5), 786–795. Daume III, H., 2007. Frustratingly easy domain adaptation. In: ACL 2007—Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pp. 256–263. Duan, L., Tsang, I.W., Xu, D., Maybank, S.J., 2009. Domain transfer SVM for video concept detection. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2009, pp. 1375– 1381. Duan, L., Tsang, I.W., Xu, D., 2012a. Domain transfer multiple kernel learning. IEEE Trans. Pattern Anal. Mach. Intell. 34 (3), 465–479. Duan, L., Xu, D., Tsang, I.W.-H., 2012b. Domain adaptation from multiple sources: a domain-dependent regularization approach. IEEE Trans. Neural Netw. Learn. Syst. 23 (3), 504–518. Esfandian, N., Razzazi, F., Behrad, A., 2012. A clustering based feature selection method in spectro-temporal domain for speech recognition. Eng. Appl. Artif. Intell. 25 (6), 1194–1202. Freije, W., Castro-Vargas, F., Fang, Z., Horvath, S., Cloughesy, T., Liau, L., Mischel, P., Nelson, S., 2004. Gene expression profiling of gliomas strongly predicts survival. Cancer Res. 64 (18), 6503–6510. Gao, S., Chia, L.-T., Tsang, I.-H., 2011. Multi-layer group sparse coding—for concurrent image classification and annotation. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, pp. 2809–2816. Gao, S., Tsang, I.-H., Chia, L.-T., 2013. Laplacian sparse coding, hypergraph Laplacian sparse coding, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35, 92–104. Geng, B., Xu, C., Tao, D., Yang, L., Hua, X.-S., 2009. Ensemble manifold regularization. In: CVPR: 2009 IEEE Conference on Computer Vision and Pattern Recognition, vols. 1–4, pp. 2388–2394. Geng, B., Tao, D., Xu, C., Yang, L., Hua, X.-S., 2012. Ensemble manifold regularization. IEEE Trans. Pattern Anal. Mach. Intell. 34 (6), 1227–1233. Gruber, P., Meyer-Baese, A., Foo, S., Theis, F.J., 2009. ICA, kernel methods and nonnegativity: new paradigms for dynamical component analysis of fMRI data. Eng. Appl. Artif. Intell. 22 (4–5), 497–504. Ji, Y., Sun, S., 2013. Multitask multiclass support vector machines: model and experiments. Pattern Recognit. 46 (3), 914–924. Jiang, W., Zavesky, E., Chang, S.-F., Loui, A., 2008. Cross-domain learning methods for high-level visual concept classification. In: 2008 15th IEEE International Conference on Image Processing—ICIP 2008, pp. 161–164. Keynia, F., 2012. A new feature selection algorithm and composite neural network for electricity price forecasting. Eng. Appl. Artif. Intell. 25 (8), 1687–1697. Li, X., Wang, L., Sung, E., 2008. AdaBoost with SVM-based component classifiers. Eng. Appl. Artif. Intell. 21 (5), 785–795. Murat, A., Migliavacca, E., Gorlia, T., Lambiv, W.L., Shay, T., Hamou, M.-F., de Tribolet, N., Regli, L., Wick, W., Kouwenhoven, M.C.M., Hainfellner, J.A., Heppner, F.L., Dietrich, P.-Y., Zimmer, Y., Cairncross, J.G., Janzer, R.-C., Domany, E., Delorenzi, M., Stupp, R., Hegi, M.E., 2008. Stem cell-related “Self-Renewal” signature and high epidermal growth factor receptor expression associated with resistance to concomitant chemoradiotherapy in glioblastoma. J. Clin. Oncol. 26 (18), 3015–3024. Nutt, C., Mani, D., Betensky, R., Tamayo, P., Cairncross, J., Ladd, C., Pohl, U., Hartmann, C., McLaughlin, M., Batchelor, T., Black, P., von Deimling, A., Pomeroy, S., Golub, T., Louis, D., 2003. Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res. 63 (7), 1602–1607. Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q., 2011. Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 22 (2), 199–210. Simek, K., Fujarewicz, K., Swierniak, A., Kimmel, M., Jarzab, B., Wiench, M., Rzeszowska, J., 2004. Using SVD and SVM methods for selection, classification, clustering and modeling of DNA microarray data. Eng. Appl. Artif. Intell. 17 (4), 417–427. Sun, L., Hui, A., Su, Q., Vortmeyer, A., Kotliarov, Y., Pastorino, S., Passaniti, A., Menon, J., Walling, J., Bailey, R., Rosenblum, M., Mikkelsen, T., Fine, H., 2006. Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. Cancer Cell 9 (4), 287–300.
J.J.-Y. Wang, X. Gao / Engineering Applications of Artificial Intelligence 28 (2014) 181–189
Sun, Q., Chattopadhyay, R., Panchanathan, S., Ye, J., 2011. A two-stage weighting framework for multi-source domain adaptation. In: Advances in Neural Information Processing Systems, pp. 505–513. Tu, W., Sun, S., 2012a. Dynamical ensemble learning with model-friendly classifiers for domain adaptation. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 1181–1184. Tu, W., Sun, S., 2012b. Cross-domain representation-learning framework with combination of class-separate and domain-merge objectives. In: Proceedings of the First International Workshop on Cross Domain Knowledge Discovery in Web and Social Network Mining, CDKD '12, pp. 18–25. Wang, J., Kumar, S., Chang, S.-F., 2010. Semi-supervised hashing for scalable image retrieval. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 13–18 June 2010, San Francisco, CA, USA, pp. 3424–3431. Wang, J., Li, Y., Zhang, Y., Xie, H., Wang, C., 2011. Boosted learning of visual word weighting factors for bag-of-features based medical image retrieval. In: Proceedings of the Sixth International Conference on Image and Graphics (ICIG 2011), pp. 1035–1040. Wang, J.-Y., Almasri, I., Gao, X., 2012. Adaptive graph regularized nonnegative matrix factorization via feature selection. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 963–966. Wang, J., Kumar, S., Chang, S.-F., 2012b. Semi-supervised hashing for large-scale search. IEEE Trans. Pattern Anal. Mach. Intell. 34, 2393–2406.
189
Wang, J.J.-Y., Bensmail, H., Gao, X., 2013a. Multiple graph regularized nonnegative matrix factorization. Pattern Recognit. 46 (10), 2840–2847. Wang, J.J.-Y., Wang, X., Gao, X., 2013b. Non-negative matrix factorization by maximizing correntropy for cancer clustering. BMC Bioinforma. 14 (1), 107. Wang, X.-Y., Zhang, B.-B., Yang, H.-Y., 2013c. Active SVM-based relevance feedback using multiple classifiers ensemble and features reweighting. Eng. Appl. Artif. Intell. 26, 368–381. Yang, J., Yan, R., Hauptmann, A.G., 2007. Cross-domain video concept detection using adaptive SVMS. In: Proceedings of the ACM International Multimedia Conference and Exhibition, pp. 188–197. Yang, Y., Nie, F., Xu, D., Luo, J., Zhuang, Y., Pan, Y., 2012. A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans. Pattern Anal. Mach. Intell. 34, 723–742. Yao, Y., Doretto, G., 2010. Boosting for transfer learning with multiple sources. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 1855–1862. Zheng, Z., Yang, J., Zhu, Y., 2007. Initialization enhancer for non-negative matrix factorization. Eng. Appl. Artif. Intell. 20 (1), 101–110. Zhuang, F., Luo, P., Xiong, H., Xiong, Y., He, Q., Shi, Z., 2010. Cross-domain learning from multiple sources: a consensus regularization perspective. IEEE Trans. Knowl. Data Eng. 22, 1664–1678.