Expert Systems with Applications 38 (2011) 12151–12159
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Multiple-kernel SVM based multiple-task oriented data mining system for gene expression data analysis q Zhenyu Chen a,⇑, Jianping Li b, Liwei Wei b, Weixuan Xu b, Yong Shi c,d a
Department of Management Science and Engineering, School of Business Administration, Northeastern University, Shenyang 110819, China Institute of Policy & Management, Chinese Academy of Sciences, Beijing 100080, China c Research Center on Fictitious Economy & Data Science, Chinese Academy of Sciences, Beijing 100080, China d College of Information Sciences and Technology, University of Nebraska at Omaha, Omaha, NE 68118, USA b
a r t i c l e
i n f o
Keywords: Support vector machine Multiple-kernel learning Feature selection Data fusion Decision rule Associated rule Subclass discovery Gene expression
a b s t r a c t Gene expression profiling using DNA microarray technique has been shown as a promising tool to improve the diagnosis and treatment of cancer. Recently, many computational methods have been used to discover maker genes, make class prediction and class discovery based on gene expression data of cancer tissue. However, those techniques fall short on some critical areas. These included (a) interpretation of the solution and extracted knowledge. (b) Integrating various sources data and incorporating the prior knowledge into the system. (c) Giving a global understanding of biological complex systems by a complete knowledge discovery framework. This paper proposes a multiple-kernel SVM based data mining system. Multiple tasks, including feature selection, data fusion, class prediction, decision rule extraction, associated rule extraction and subclass discovery, are incorporated in an integrated framework. ALL-AML Leukemia dataset is used to demonstrate the performance of this system. Ó 2011 Elsevier Ltd. All rights reserved.
1. Introduction DNA microarray technology makes it possible to measure the simultaneous expression levels of thousands of genes in a single experiment and allow us to investigate the biological molecular state of a living cell. There are numerous potential applications for DNA microarray technology. Especially, it provides valuable insights into the molecular characteristics of cancers. Utilizing the gene expression data might be the most direct way to improve the diagnosis accuracy and discover the therapeutic drug target (Alizadeh, Eisen, & Davis, 2000; Alon et al., 1999; Golub, Slonim, & Tamayo, 1999; Matthias, Anthony, & Nikola, 2003; Yeoh, Ross, & Shurtleff, 2002). Many machine learning and statistical methods have been used to discover the potential knowledge from gene expression data of cancer tissue and try to give new insights into the diagnosis and treatment of cancers. Most of the literatures for tissue gene expression data analysis focus on the following three issues: Gene identification, class discovery, and class prediction. Gene identification is to select genes with powerful explanation capacity to a specific task such as classification. Gene identification
q This research has been partially supported by Grants from National Natural Science Foundation of China (#70621001,#70531040). ⇑ Corresponding author. E-mail addresses:
[email protected] (Z. Chen),
[email protected] (J. Li),
[email protected] (L. Wei),
[email protected] (W. Xu),
[email protected] (Y. Shi).
0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.03.025
can improve the accuracy of classifiers and reduce the computational costs. More importantly, a small subset of biological relevant genes may be useful for understanding the underlying mechanism of cancer and designing less expensive experiments (Niijima & Kuhara, 2006). Usually, all the methods proposed to tackle with gene identification yield two basic categories: Filter and wrapper methods (Inza, Larranaga, Blanco, & Cerrolaza, 2004). In general, the wrapper methods can obtain better predictive accuracy than the filter methods, but they require more computational time. Besides, gene identification algorithms may be categorized in another way: Gene ranking and gene subset selection. Gene ranking is the most commonly used technique. It evaluates each gene individually with respect to a certain criterion and assigns a score reflecting the ability to distinguish between various classes of samples. The measures based on machine learning and statistical learning theory (Fuhrman et al., 2000; Golub et al., 1999; Su, Murali, Pavlovic, Schaffer, & Kasif, 2003), such as information gain, twoing rule, sum minority, max minority, Gini index, sum of variances, one-dimensional SVM and t-statistics, are the popular criteria to rank genes. It is the drawback of gene ranking methods that they evaluate each gene in isolation and ignore the gene to gene correlation. From a biological perspective, we know that groups of genes, as the atomic units, work together as pathway components and reflect the state of cell (Piatetsky-Shapiro & Tamayo, 2003). Gene subset selection, on the contrary, conducts the search for a good subset by using the classifier itself as a part of evaluation function. Considering the high computational cost in the searching process, heuristic
12152
Z. Chen et al. / Expert Systems with Applications 38 (2011) 12151–12159
algorithms including SVM-RFE (Duan, Rajapakse, & Wang, 2005; Guyon, Weston, & Barnhill, 2002; Tang, Zhang, & Huang, 2007) and saliency analysis (Cao, Seng, Gu, & Lee, 2003) are widely used. Class discovery usually refers to identifying previously unknown subclasses adopting the unsupervised techniques. Hierarchical clustering is a popular unsupervised tool due to its intuitive appeal and visualization properties (Eisen, Spellman, Brown, & Botstein, 1998). Another family of clustering method is greedy search-based iterative descent clustering, such as selforganizing map (SOM) (Tamayo et al., 1999), K-means clustering (Li, Weinberg, Darden, & Pedersen, 2001) and Bayesian clustering (Roth & Lange, 2004). Clustering techniques usually face some difficulties including how to choose the right number of clusters and how to evaluate the clustering results (Monti, Tamayo, & Mesirov, 2003). Class prediction refers to the assignment of particular samples to already-defined classes which could reflect current states or future outcomes (Golub et al., 1999). The widely used methods for class prediction include decision tree (Camp & Slattery, 2002), artificial neural network (ANN) (Ando, Suguro, & Kobayashi, 2003; Tan & Pan, 2005; Tung & Quek, 2005), support vector machine (SVM) (Mao, Zhou, & Pi, 2005; Statnikov, Aliferis, & Tsamardinos, 2005) and so on. It is the limitation of above mentioned methods that each of them can only deal with one task: Gene identification, class discovery or class prediction. However, microarray data mining needs dealing with various tasks. It is necessary to incorporate the tasks into an enlarged system (Matthias et al., 2003): (1) Feature selection can strongly influence the performance of the methods. The most outstanding character of gene expression data is that it contains a large number of gene expression values (several thousands to tens of thousands) and a relatively small number of samples (a few dozen). It brings great challenges for the commonly used knowledge discovery methods. When the number of features far exceed the number of training samples available, most classification and clustering methods, such as decision tree, ANN and Kmeans, are sensitive to noise and susceptible to overfitting (Guyon et al., 2002; Monti et al., 2003;Radivojac, Chawla, & Dunker, 2004; Tung & Quek, 2005). Therefore, features selection is a necessary prior stage of the gene expression data mining (Ando et al., 2003; Matthias et al., 2003; Tan & Pan, 2005; Tung & Quek, 2005). Besides, many features selection methods require a prefiltering stage to refine the raw feature set to achieve a satisfactory performance (Tung & Quek, 2005; Zhu, Ong, & Dash, 2007). Especially for the wrapper-based features selection methods, it is popular to propose a hybrid wrapper and filter framework to improve the performance. (2) With the continuous emergence of new DNA data and new array technologies, how to integrate various sources data and incorporate the prior knowledge is another challenging problem (Fellenberg, 2003). Different microarray techniques use different mechanism to measure gene expression levels. It makes the gene expression levels reported by different techniques not comparable with each other. Consequently, a challenge is integrating databases to connect this disparate information as well as performing studies to collectively analyze those datasets from diverse sources that have heterogeneous formats. It is a natural way to adapt normalization to make the gene expression values from different data sources conformable each other (Goh & Kasabov, 2003). Another way is to use statistic methods to combine the experimental results from single source data (Filkov & Skiena, 2003; Hwang et al., 2005).
(3) Considering the high cost and long time of experimental research, various tasks in the medical field including the diagnosis on a disease, the outcome prediction and drug discovery depend on the computational methods. It is necessary to design multiple-task oriented knowledge discovery system as a complete scheme. Currently, the hybridized pipeline is a commonly used way to integrate multiple-tasks by incorporating a sequence of methods (Kim, Zhou, Morse, & Park, 2005; Matthias et al., 2003; Radivojac et al., 2004; Tan & Pan, 2005; Tung & Quek, 2005; Sethi & Leangsuksun, 2006). For each method used in one system, it is independent to set the initialized values, carry out the optimization algorithm and tune the free parameters. Although each method performs well, their integration usually does not. A small computational error may be transmitted to the following step and enlarged. It increases the uncertainty of the data mining system. For example, a filter method is used to identify the relevant genes, SVM is used to make classification and decision tree is used to extract the comprehensible rules. The selected gene subset, that is usually not optimal by the filter method, may take great effect on the performance of SVM and decision tree. Besides, good classification performance of SVM can not be take advantage of by the next step: Rule extraction by decision tree. In this paper, a multiple-kernel SVM (MK-SVM) is proposed for multi-task oriented microarray data mining. Unlike standard SVM that is usually viewed as a ‘‘black-box’’, multiple-kernel SVM based gene expression data mining system is applicable to feature selection, data fusion, class prediction, decision rule extraction, associated rule extraction and subclass discovery. This paper is organized as follows: In Section 2, we give a series of algorithm. In Section 3, we develop a methodology for multitask oriented gene expression data mining. Section 4 presents the case study on ALL-AML leukemia dataset. Section 5 summarizes the results and draws a general conclusion. 2. Multi-kernel based SVM: Proposed algorithms 2.1. SVM: A brief introduction We only give a brief introduction of SVM for a typical binary classification problem. The basic SVM concepts can be found in (Chen, Li, & Wei, 2007; Cristianini & Shawe-Taylor, 2000; Vapnik, 1995). Given a set of data points G ¼ fðxi; yi Þgni¼1 ; xi 2 Rm and yi 2 {+1, 1}. The decision function of SVM is
f ðxÞ ¼ hw; /ðxÞi þ b;
ð1Þ
where /(x) is a mapping of sample x from the input space to a highdimensional feature space. h, i denotes the dot product in the feature space. The optimal values of w and b can be obtained by solving the following regularized optimization problem:
min Jðw; nÞ ¼ s:t:
Xn 1 kwk2 þ C i¼1 ni ; 2
yi ðhw; /ðxi Þi þ bÞ P 1 ni ;
ð2Þ i ¼ 1; . . . ; n;
ni P 0;
ð3Þ
where ni is the ith slack variable and C is the regularization parameter. This problem is computationally solved using the solution of its dual form:
max
X n
a i¼1 i
1 Xn a a y y kðx ; x Þ ; i j i j i j i;j¼1 2
ð4Þ
12153
Z. Chen et al. / Expert Systems with Applications 38 (2011) 12151–12159
(P
n i¼1 yi i
ð5Þ
methods (see details in the next two sections). The selected feature subset n is ones onhaving nonzero feature coefficients. ðsÞ ðsÞ ðsÞ ðsÞ (2) Let GðsÞ ¼ xi ; yi , where xi 2 Rm and yi 2 fþ1; 1g
where ai is a Lagrange multiplier which corresponds to the sample xi, k(xi, xj) = h/(xi), / (xj)i is a kernel function. The most commonly used kernel function is Gaussian kernel:
come from the sth data source. We define the following multi-kernel to integrate multi-source data:
kðx; yÞ ¼ exp r2 kx yk2 ;
K 2 ðxi ; xj Þ ¼
s:t:
a ¼ 0;
0 6 ai 6 C;
i ¼ 1; . . . ; n;
i¼1
ð6Þ
where r2 is the kernel parameter. The generalization performance of SVM is mainly affected by the kernel parameters, for example, r2 and the regularization parameter C. They have to be set beforehand and tuned carefully.
X
b k ðxi ; xj Þ; d d d
bd P 0;
ð7:1Þ
where bd is the weight on the corresponding basic kernel kd(xi, xj). Usually, heterogeneous kernels or one kernel being continuously or discretely parameterized are selected as basic kernels. Formulation 2: A free parameter wd can be introduced to integrate the expert and situation knowledge:
kðxi ; xj Þ ¼
X d
wd bd kd ðxi ; xj Þ;
bd P 0:
ð7:2Þ
Unlike bd that is learned automatically, wd is set according to the expert and situation knowledge. Formulation 3: A multiple-kernel can also be written as the convex combination of basic kernels:
kðxi ; xj Þ ¼
X
b k ðx ; xj Þ; d d d i
bd P 0;
Xbd d
¼ 1:
ð7:3Þ
For the corresponding applications, versatile multiple-kernels can be designed by adopting different basic kernels. Three kind of multiple-kernels based on Eq. (7.1) are proposed: (1) A new kernel: Single feature kernel is introduced in Chen et al. (2007) for feature selection via SVM. In the present paper, it is called K1:
K 1 ðxi ; xj Þ ¼
Xm
w b kðxi;d ; xj;d Þ; d¼1 d d
bd P 0;
ð8Þ
where xi,d denotes the feature in the dth dimension of the ith input vector. In Eq. (8), the parameter bd, which is named as feature coefficient, represents the weight on the corresponding single feature basic kernel. By the introduction of feature coefficients b, the optimal feature subset can be obtained by an easier way than the usually used combinatorial optimization
ð9Þ
(3) Similarly, a new multi-kernel can be defined to make feature selection and multi-source data fusion simultaneously:
X
K 3 ðxi ; xj Þ ¼
2.2. Design of multiple-kernels
kðxi ; xj Þ ¼
; gs P 0: g k xiðsÞ ; xðsÞ j s s
b d d
X
¼
Multiple-kernel learning replaces the commonly used single fixed kernel, say Gaussian kernel, by multi-kernel (Bach, Lanckrient, & Jordan, 2004; Chen, 2008; Chen et al., 2007; Lanckrient, Cristianini, Bartlett, El Ghaoui, & Jordan, 2004; Li, Chen, Wei, & Xu, 2007; Micchelli & Pontil, 2005; Parrado-Hernández, Mora-Jiménez, Arenas-Garcı´a, Figueiras-Vidal, & Navia-Vázquez, 2003; Sonnenburg, Ratsch, Schafer, & Scholkopf, 2006; Tsang & Kwok, 2006; Wei, Chen, & Li, 2011). Multi-kernel is the ensemble of multiple heterogeneous kernels or parameterizations of a single kernel. It provides the chance to choose an optimal kernel. Moreover, it is also used to improve the transparency of SVM (Chen, 2008; Chen et al., 2007). Formulation 1: In practice, a multi-kernel can be written as the linear combination of basic kernels:
X
s;d
X s
ðsÞ gs k xðsÞ i;d ; xj;d
ðsÞ ðsÞ ls;d k xi;d ; xj;d ;
ð10Þ
where ls,d = bdgs P 0. Besides, ANOVA technique (Gunn & Kandola, 2002) can be used in multi-kernel to increase the number of basic kernels. 2.3. One-step algorithm When the multi-kernel in Eq. (7.2) is used, Eqs. (4), (5) are written as:
max
X n i¼1
ai
X 1 Xn ai aj yi yj d wd bd kd ðxi ; xj Þ ; i;j¼1 2
8 Pn > < i¼1 yi ai ¼ 0; s:t: 0 6 ai 6 C; i ¼ 1; . . . ; n; > : bd P 0:
ð11Þ
ð12Þ
Naturally, a two-stage iterative procedure is considered to decompose the original problem into two sub-problems: A quadratic programming and a linear programming that can be solved by standard optimizers. The convergence property of the two-stage iterative algorithm is discussed and proved in Gunn and Kandola (2002), Lee, Kim, and Lee (2006). The two-stage iterative algorithm typically gives convergent solutions in a few steps, and often it is sufficient to take a one step update (Lee et al., 2006). It results in a low computational cost. However, most other algorithms proposed for microarray data mining are not capable of solving the optimization problems which come with large number of basic kernels (no less than the number of genes) to combine within reasonable time. Algorithm 1. One-step algorithm Input: Input data vector xi 2 Rm and output yi 2 {+1, 1}. Step 1: Initialization: Set the original regularization parameters (C, k) and the kernel parameter (r2) at random. The origin o ð0Þ nal feature coefficientb(0) is set to bd ¼ 1jd ¼ 1; . . . ; m . Step 2: Solving the Lagrange coefficient a: The Lagrange coefficienta is obtained by solving the following quadratic programming in which the original feature coefficient b(0) is used:
max
X
n
a i¼1 i
Xm 1 Xn ð0Þ a a y y w b k ðx ; x Þ ; j d i i;j¼1 i j i j d¼1 d d 2 ð17Þ
(P s:t:
n i¼1 yi
ai ¼ 0;
0 6 ai 6 C;
i ¼ 1; . . . ; n:
ð18Þ
Step 3: Solve the feature coefficient b: The feature coefficient b is obtained by solving the following dual linear program-
12154
Z. Chen et al. / Expert Systems with Applications 38 (2011) 12151–12159
ming (see Appendix for the derivation in details) in which the Lagrange coefficient a solved in the last step is used:
max
Xn
u; i¼1 i
ð19Þ
8P P n n > > j¼1 aj yj kðxi;d ; xj;d Þ 6 wd ; < i¼1 ui yi s:t: Pn u y ¼ 0; > i¼1 i i > : 0 6 ui 6 k; i ¼ 1; . . . ; n:
d ¼ 1; . . . ; m;
ð20Þ Step 4: Tune the free parameters and go back to step 2 until the output are optimal. Pm Pn Output: f ðxi Þ ¼ sign d¼1 wd bd j¼1 aj yj kd ðxi ; xj Þ þ b
Step 2: Train an MK-SVM described in Eq. (22), (23) by a standard optimizer of SVM. Step 3: Tune the free parameters and go back to step 2 until the output are optimal. For microarray data, sometimes the coefficients matrix (n m) is too large to be optimized. In Chen et al. (2007), a 1-norm based linear programming is formulated to further reduce the computational cost. Alternatively, we propose a hybrid algorithm in which one step algorithm (see Algorithm 1) is used firstly to reduce the number of features to some dozens. Then direct algorithm (Algorithm 2) is used to refine the results and deliver the results to Algorithm 5 for rule extraction. Algorithm 3. Hybrid algorithm
Algorithm 1.1. Feature selection algorithm Input: Input data vector xi 2 Rm and output yi 2 {+1, 1}. Step 1: Initialization: Use the single feature kernel K1, then:
Input: Input data vector xi 2 Rm and output yi 2 { + 1, 1}. Step 1: Train an MK-SVM by one step algorithm (Algorithm 1) and obtain a set of feature subset. Step 2: In the selected feature subspace in the last step, train an M-SVM by direct algorithm (Algorithm 2).
kd ðxi ; xj Þ ¼ kðxi;d ; xj;d Þ: Step 2: Train an MK-SVM by one 1). P P step algorithm(Algorithm m n Output: f ðxi Þ ¼ sign d¼1 wd bd j¼1 aj yj kðxi;d ; xj;d Þ þ b
Rule Extraction Algorithms include three components: Decision rules extraction, rule reduction and ordering and associate rules extraction (see Algorithm 5).
Algorithm 1.2. Data fusion algorithm Input: Input data vector xi 2 Rm and output yi 2 { + 1, 1}. Step 1: Initialization: Use the single feature kernel K2, then:
ðsÞ ðsÞ kd ðxi ; xj Þ ¼ k xi ; xj :
ð22Þ
ð23Þ
In Chen et al. (2007), we give the approach of decision rule extraction from linear SVM. For example, the extracted decision rules in the 2-dimensional space are written as follows:
2.4. Direct and hybrid algorithms
kðxi ; xj Þ ¼
X
b2 k ðx ; xj Þ; d d d i
bd P 0;
P
d bd
¼ 1 is designed:
b d d
¼ 1;
X
ð21Þ
Let ci,d = aibd, then Eqs. (11), (12) change into (see (Chen, 2008) for the derivations in detail):
max
X X n d
c i¼1 i;d
1X c c y y k ðx ; x Þ ; i j d i;j;d i;d j;d i j 2
8P P M n > > < d¼1 i¼1 yi ci;d ¼ 0; PM s:t: 0 6 d¼1 ci;d 6 C; i ¼ 1; . . . ; n; > > : c P 0; d ¼ 1; . . . ; M:
Algorithm 4. Rule extraction from MK-SVM Input: Input data vector xi 2 Rm and output yi 2 {+1, 1}. Step 1: Do for t = 1,. . .,T ðtÞ (1) Set wd according to expert knowledge. (2) Train an MK-SVM by hybrid algorithm (Algorithm 3) and obtain support vectors. (3) Extract decision rules from MK-SVM (see (Chen, 2008; Chen et al., 2007) in details) and obtain a set of rules. (4) Refine the extracted rules in the last step by rule reduction and ordering (see (Chen, 2008; Chen et al., 2007) in details). (5) Discover new subclasses (Algorithm 5). Step 2: Extract associate rules from MK-SVM (algorithm 6) and obtain a set of associate rule set. Step 3: Refine the discovered new subclasses.
Step 2: Train an MK-SVM by one step (Algorithm 1). P P algorithm ðsÞ ðsÞ n þb Output: f ðxi Þ ¼ sign s gs j¼1 aj yj k xi ; xj
A multi-kernel with the constraint
2.5. Rule extraction algorithms
i;d
In the two-stage iterative algorithm, two hyperparameters (a, b) need to be optimized iteratively. Unlike it, above formulation uses P the multi-kernel with the constraint d bd ¼ 1, and result in only one hyperparameter c. It is the advantage that the hyperparameter c can be obtained by any standard optimizer of SVM directly. The details of SVM optimizers can be found in Cristianini and Shawe-Taylor (2000). Algorithm 2. Direct algorithm Input: Input data vector xi 2 Rm and output yi 2 {+1, 1}. Step 1: Initialization: Set the original regularization parametersC and the kernel parameter (i.e. r2) at random.
IF SV1 6 x1 6 þ1 and 1 6 x2 6 SV2 ; THEN y ¼ 1ð1Þ;
ð24Þ
where SV denotes a support vector. In Chen (2008), we extend linear-SVM based decision rule extraction approach to the nonlinear case. Moreover, the computational complexities are compared. MK-SVM, unlike traditional unsupervised learning, gives a new way to discover new subclasses. Two measures are defined for subclasses discovery: (1) Overlapping: It measures how many times all the premises of two rules are satisfied by one sample and their consequents match the target decision at the same time. (2) Coverage: It measures how many samples are covered by at least one of pairwise-rule.
12155
Z. Chen et al. / Expert Systems with Applications 38 (2011) 12151–12159
The pairwise rules with low overlapping and high coverage rate may reveal the potential subclasses. Thus, in the following algorithm, those rules are viewed as the possible divisions of the subclasses. For class discovery, the validation of the computational results is extremely important. It is accepted that if the discovered subclasses reflect the true structure, then a class predictor based on these subclasses should perform well (Golub et al., 1999). Algorithm 5. New subclasses discovery from MK-SVM
Class Prediction
Decision Rules Extraction
MK-SVM
DataFusion
Subclass Discovery
Confidence ¼
the sample size covered by two rules at the same time the total sample size
Algorithm 6. Extract associate rules from MK-SVM
Input: A set of rule obtained by decision rules extraction and refinement algorithms. Step 1: Compute the overlapping and coverage rate of each pairwise-rule. Step 2: Rank all pairwise rules and select one with the lowest overlapping and highest coverage rate. Step 3: Label the samples. The samples covered by one rule are labeled as class 1 and the others as class 2. Step 4: The samples with new labels are used to train a new MKSVM.
Feature Selection
Step 5: The other samples, which are not covered by the top one pairwise-rule, are input into the trained MK-SVM to be labeled. MK-SVM can also be used to discover associate rules, and then discover associate genes. The confidence measure is defined as follows:
Associated Rules Extraction
Fig. 1. Six applications of MK-SVM in gene expression data mining system.
Input: Two sets of rule Step 1: Compute the confidence measure of two rules in the rule set. Step 2: Rank all pairwise rules and select one with the lowest overlapping and highest coverage rate. 3. Multiple-kernel SVM based gene expression data mining system: A methodology Fig. 1 shows six applications of MK-SVM to gene expression data mining. The algorithms for those applications have been given in the last section. MK-SVM based gene expression data mining system is shown in Fig. 2. 4. Experiments ALL-AML leukemia dataset is used to evaluate the performance of MK-SVM based system. ALL-AML leukemia dataset (Golub et al., 1999) is already divided into the training dataset and the testing dataset. The training dataset consists of 38 bone marrow samples (27 ALL and 11 AML), over 7129 probes from 6817 human genes. Also 34 samples in the testing dataset are provided, with 20 ALL and 14 AML. ALL-AML leukemia dataset has been analyzed before in various papers. It is highly suited for demonstrating the whole process of MK-SVM based microaray data mining. In the section, the raw data without any preprocessing are used. The Gaussian kernel is used. The results reported in this section are
Weight1 Feature Selection Data Fusion
Class Prediction
Decision Rules Extraction
Assosiated Rules Extraction
Subclass Discovery
Weight 2
Feature Selection Data Fusion
Class Prediction
Decision Rules Extraction
Subclass Discovery
Fig. 2. MK-SVM based gene expression data mining system.
12156
Z. Chen et al. / Expert Systems with Applications 38 (2011) 12151–12159
Table 1 Classification accuracies with different kernels, kernel parameters (r2) and independent testing set.
r2
Gaussian (D1)
K1 (D1)
Ker2 (D2)
Ker3 (D2)
Ker2 (D3)
Ker3 (D3)
5000 7000 9000 11000 13000 15000 17000
16/34 16/34 18/34 18/34 23/34 21/34 19/34
26/34 26/34 25/34 25/34 28/34 26/34 25/34
32/36 32/36 33/36 34/36 35/36 35/36 35/36
33/36 34/36 33/36 34/36 33/36 33/36 35/36
26/36 27/36 27/36 27/36 27/36 28/36 30/36
30/36 30/36 31/36 32/36 33/36 33/36 31/36
even better than those on reduced feature subset (MKSVM with K1). It demonstrates that data heterogeneity interferes with the selection of optimal gene subset. MK-SVM with Ker3 achieves the best results. Only one sample in D2 (independent testing set) and three samples in D3 are misclassified.
obtained on the independent testing set and with a 10-fold cross-validation on the training set. Our implementation is carried out on the Matlab 6.5 development environment. 4.1. Feature selection, data fusion and class prediction By computational analysis, Guyon et al. reveal that there are significant differences between the distribution of the training and test set in the ALL-AML leukemia dataset (Guyon et al., 2002). They further point out that the training and test set come from different data sources. Therefore, we can conduct the multiple sources data fusion on this dataset. The initial dataset (38 training and 34 testing) is named as D1. The training and testing set are randomly divided into two equal parts. Then one part of the training set and that of the testing set constitute a new dataset: D2. The other two parts constitute D3. Firstly, algorithm 2.1 is used to identify optimal diagnostic gene subset. The regularized parameter k in Eq. (20) controls the sparsity of the feature coefficients b and the size of selected gene subset. Therefore, it should be tuned carefully (Chen et al., 2007). When different kernels (Gaussian, K1, K2 and K3), datasets (D1, D2, D3) and kernel parameters (r2) are used, the classification accuracies are shown in Table 1. There are 34 testing samples for D1 with Gaussian and K1 kernels. When super-SVM is trained on D3 (D2), D2 (D3) is used as the testing set (36 samples). Note that Gaussian kernel is used as the basic kernel in K1, K2 and K3. Table 2 shows the selected optimal gene subsets on three training sets. Since the raw gene expression data contain large variations and high level of noise, SVM with Gaussian kernel don’t perform well (see the second column of Table 1). There are three ways to improve its performance: (1) One way is to choose a suitable preprocessing method to reduce the variations of the data. (2) Another way is to reduce the feature space via feature selection. Compared with SVM with Gaussian kernel, better classification accuracies are achieved when the single feature kernel (K1) is used. It indicates that an effective feature reduction improves the robustness of SVM on high noise data. (3) The fusion of multiple source data is the third way. Table 1 show that the classification accuracies of MK-SVM with K2 are far better than those of SVM with Gaussian kernel, and Table 2 Optimal gene subsets on three datasets. Dataset Optimal gene subsets D1 D2 D3
AF009426, D49950, J03930, M31166, M54995, M77142, M81933, U46499, X51804 L09717,M23197,M81933,U48736,U46499,U57094,X07743,U17838, M31211,U72936,M28170 AF009426,J03890,M23197,M33195,M63379,S77763,U18062,U51990, U79288,X65644,X68560,X76648,M83652,M8399,U75276,M31523
4.2. Decision rules extraction Based on the selected gene subsets, Algorithm 4 is used to extract decision rules. The threshold of false-alarm measure is set to 0.1 and the threshold of soundness measure is set to 0.9. The extracted rules on original dataset (38 training and 34 testing) are shown in Table 3. It can be seen that the rule set contains only one marker gene: U46499. When the thresholds of false-alarm and soundness measure are set to 0.2 and 0.6 respectively, the extracted rules are shown in Table 4. The rule set contains three genes: U46499, M31166 and M77142. In comparison with Table 3, the overall accuracy doesn’t change. The performance of an interpretable model is evaluated by three measures: the number of rules, the number of conditions per rule and the number of samples misclassified. The performance of MK-SVM and some other interpretable models (Fang et al., 2006; He, Tang, Zhang, & Sunderraman, 2006) are shown in Table 5. CART is a kind of classification tree [56]. ANFIS is an adaptive networkbased fuzzy inference system (He et al., 2006). FARM-DS is a fuzzy association mining algorithm (He et al., 2006). SVM-RFE (Guyon et al., 2002) is used as the filter to select the related genes for above three rule extraction methods. It can be seen that MK-SVM, which uses small number of rules with few conditions per rule, achieves the highest classification accuracy. That is to say, MK-SVM has
Table 3 Decision table 1 (the thresholds of false-alarm and soundness measure are 0.1 and 0.9 respectively). No.
Rule body
Class
False-alarm (on testing set)
Soundness (on testing set)
1 2 Overall
U46499 6 146 U46499 P 208
ALL AML
0/20 0/14 0/34
18/20 13/14 31/34
Table 4 Decision Table 2 (the thresholds of false-alarm and soundness measure are 0.2 and 0.6 respectively). No.
Rule body
Class
False-alarm (on testing set)
Soundness (on testing set)
1 2 3 4 Overall
U46499 6 146 M31166 P 174 U46499 P 1049 M77142 P 122
ALL ALL AML AML
0/20 0/20 0/14 2/14 0/34
18/20 5/20 11/14 7/14 31/34
12157
Z. Chen et al. / Expert Systems with Applications 38 (2011) 12151–12159
good generalization simultaneously.
performance
and
good
interpretability
4.3. Associated rule (gene) extraction Due to the complex interaction of genes, usually there are several groups of genes with similar functions. In Algorithm 4, when a big weight is put on the selected genes, those genes are kept out and the new selected genes may have the same explanation capacity. When big weights are put on the genes: U46499 and M31166, the identified associated genes are shown in Table 6. Then rules extracted from the first set of associated genes in Table 6 (r2 = 10000) are shown in Table 7. Table 8 shows the associated rules, associated genes and their confidence measures. There are 25 samples in the testing set covered by two sets of associated rules simultaneously. The relationships (potential pathways during the development of the cancer, for example) between the associated genes: U46499, M31166, M77142 and D49950 should be further revealed by analyzing time series expression data. This method provides us an opportunity to draw up an efficient treatment plan for individual patient. Three samples listed in Table 9 are used to demonstrate the process of drawing the treatment
Table 5 Performance of MK-SVM and some other interpretable models. Feature selection
Rule extraction
No. of rules
No. of conditions per rule
Misclassification (testing set)
MK-SVM: Algorithm 1 SVM-RFE SVM-RFE SVM-RFE
MK-SVM: Algorithm 4 CART FARM-DS ANFIS Rough Set
2–4
1.0
0
4 5 2 1.33
2.0 4.8 8.0 1.5
7 2 1 1–4
plan based on two decision tables (Tables 4 and 7). The diagnosis process for these three samples is shown in Table 10. For example, the sample one satisfies the premises of ‘‘rule a’’ and ‘‘rule 5’’ which correspond to the class: ALL leukemia, so the diagnosis result of this sample is ALL leukemia. Thus a drug would be designed to adjust the expression levels of the discovered two targets: U46499 and D49950 to normal ones. 4.4. Subclass discovery Now we use MK-SVM to discover the subclasses: T-ALL and BALL in ALL-AML leukemia dataset. If we set a low threshold of the soundness measure, more rules will be included in the rule set (see Table 11). In those rules, the pairwise rules with low overlapping and high coverage rate measure may reveal the potential subclasses. Then we rank those pairwise rules and select the top one to re-label the ALL samples (see Table 12). It is an important problem to evaluate and correct the discovered subclasses. The re-labeled samples are used to train a new MK-SVM and make rule extraction, and then the performance of this method is tested on ALL leukemia training dataset (38 samples). The results on the ALL leukemia testing set are omitted here, for only one T-ALL in the testing set. The experimental results are shown in Table 13. Golub et al. (1999) use SOM to clustering this database. AML samples are separated from ALL samples with four errors. They also separate AML, T-ALL and B-ALL respectively (two errors). In Roth and Lange (2004); Alexandridis, Lin, and Irwin, 2004, Gaussian mixture model that is optimized by expectation–maximization (EM) algorithm is used to make class discovery and class prediction. In Roth and Lange (2004), AML samples are separated from ALL samples with two errors, and T-ALL samples are separated from other samples with one error. The method proposed in
Table 8 Associated rules and their confidence measures.
Table 6 Associated genes of U46499 and M31166. Gene
Weight
r2
Associated genes
U46499
1000
10000
U46499
1000
5000
U46499
1000
2000
U46499 M31166
1000
2000
U46499 M31166
1000
5000
D49950, J03930, M31166, M77142, U12471 D49950, J03930, M31166, M77142, U12471 M31166, M77142, U12471, U84487 D49950, U84487, M54995, M77142, U12471, U72621, U84487, U09087 D49950, J03930, L13278, M54995, M77142,M81933, U12471, U84487,U09087
Associated Rule no.
Rule body
Rule body
Confidence
1 2 3 4 5 Overall
U46499 6 146
M31166 P 174 D49950 P 168 D49950 6 54 M77142 P 122 M31166 6 72
5/34 9/34 8/34 7/34 13/34 25/34
U46499 P 208
Table 9 Three samples and their expression levels. Gene
U46499
D49950
M31166
M77142
Sample 1 Sample 2 Sample 3
91 1121 1415
214 333 455
91 78 10
33 37 634
Table 10 Decision table for diagnosis.
Table 7 Rules extracted from the associated genes (r2 = 10000). Rule no.
Rule body
Class
False-alarm (testing set)
Soundness (testing set)
1 2 3 4 5 Overall Confidence
D49950 6 54 M31166 P 174 M77142 P 122 M31166 6 72 D49950 P 168
AML ALL AML AML ALL
0/14 0/20 2/14 3/14 4/20 2/34
8/14 5/20 7/14 13/14 9/20 28/34 25/34
Rule No.
Rule body
Class
Sample 1
a b Diagnosis 1 2 3 4 5 Diagnosis
U46499 6 146 U46499 P 208
ALL AML
Yes ALL
D49950 6 54 M31166 P 174 M77142 P 122 M31166 6 72 D49950 P 168
AML ALL AML AML ALL
Sample 2
Sample 3
Yes AML
Yes AML
Yes Yes Yes ALL
AML
AML
12158
Z. Chen et al. / Expert Systems with Applications 38 (2011) 12151–12159
Table 11 Decision rule extraction for subclass discovery (47 samples for ALL leukemia). No.
class
Rule body
False-alarm
Soundness
1 2 3 4 5 6
ALL ALL ALL ALL ALL ALL
L20321 P 45 U72621 6 43 M29064 P 26 and U72621 P 83 U20350 P 29 and Y07759 P 2 U20350 P 56 and U72621 P 24 and L20321 6 44 and M29064 P 87 and M77140 6 317 U20350 P 56 and L20321 6 44 and M77140 6 152 and 29064 P 87 and M20747 P 43
2/47 1/47 2/47 0/47 1/47 1/47
21/47 23/47 6/47 5/47 9/47 6/47
Table 12 Overlapping and coverage measures for the pairwise rules. Pairwise rules
1–3
1–4
1–5
1–6
2–3
2–4
2–5
2–6
Overlap Coverage Rank
2 24 6
2 24 7
0 30 1
0 25 3
0 26 2
1 25 4
2 27 5
4 20 8
Table 13 Performance of MK-SVM on T-ALL and B-ALL subclasses discovery. No.
Rule body
Class
False-alarm
Soundness
1 2 3 4 Overall
X02160 6 36 X58072 6 260 Y00264 6 63 D13720 P 288
T-ALL B-ALL B-ALL T-ALL
0/38 1/38 0/38 3/38 1/38
5/38 20/38 17/38 8/38 38/38
Alexandridis et al. (2004) achieves high accuracy with only one error. von Heydebreck, Huber, and Poustka (2001) use the statistic measures to discover ten highest scoring partitions in which three partitions are similar to the true AML, T-ALL and B-ALL labels. Compared with them, the performance of MK-SVM is promising (one error). 5. Conclusions This paper proposes a multiple-kernel SVM based gene expression data mining system. Multiple tasks, including feature selection, data fusion, class prediction, decision rule extraction, associated rule extraction and subclass discovery, are incorporated in a system and carried out by multiple-kernel SVM. Experiments, conducted on ALL-AML leukemia dataset, have indicated that feature selection and data fusion can improve the robustness of SVM on high noise and heterogeneous data. Compared with some other interpretable methods, multiple-kernel SVM, using the compact rule set containing a small number of rules with few conditions per rule (usually no more than three), has achieved high classification accuracy. References Alexandridis, R., Lin, S., & Irwin, M. (2004). Class discovery and classification of tumor samples using mixture modeling of gene expression data–a unified approach. Bioinformatics, 20, 2545–2552. Alizadeh, A. A., Eisen, M. B., Davis, R. E., et al. (2000). Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling. Nature, 403, 503–511. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., et al. (1999). Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the USA, 96, 6745–6750. Ando, T., Suguro, M., Kobayashi, T., et al. (2003). Selection of causal gene sets for lymphoma prognostication from expression profiling and construction of prognostic fuzzy neural network models. Journal of Bioscience and Bioengineering, 96, 161–167. Bach, F. R., Lanckrient, G. R. G., & Jordan, M. I. (2004). Multiple kernel learning, conic duality and the SMO algorithm. In G. Russell & S. Dale (Eds.), Proceedings of the Twenty First International Conference on Machine Learning (pp. 41–48). New York: ACM.
Camp, N. J., & Slattery, M. L. (2002). Classification tree analysis: A statistical tool to investigate risk factor interactions with an example for colon cancer. Cancer Causes & Control, 13, 813–823. Cao, L., Seng, C. K., Gu, Q., & Lee, H. P. (2003). Saliency analysis of support vector machines for gene selection in tissue classification. Neural Computing & Applications, 11, 244–249. Chen, Z. (2008). Research on Support Vector Ensemble Kernel Knowledge Discovery Model, PhD thesis, Graduate University of the Chinese Academy of Sciences. Chen, Z., Li, J., & Wei, L. (2007). A multiple kernel support vector machine scheme for feature selection and rule extraction from gene expression data of cancer tissue. Artificial Intelligence of Medicine, 41(2), 161–175. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge, UK: Cambridge University Press. Duan, K., Rajapakse, J. C., Wang, H., et al. (2005). Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Transactions on Nanobioscience, 4, 228–234. Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci., 95, 14863–14868. Fang, J., Grzymala-Busse, J. W. (2006). Leukemia prediction from gene expression data – A rough set approach. In: Proceedings of 8th International Conference on Artificial Intelligence and Soft Computing, (pp. 899–908). Fellenberg, M. (2003). Developing integrative bioinformatics systems. Biosilico, 1(5), 177–183. Filkov, V., & Skiena, S. (2003). Integrating microarray data by consensus clustering. Proceedings of the International Conference on Tools with Artificial Intelligence, 418–426. Fuhrman, S., Cunningham, M. J., Wen, X., Zweiger, G., Seilhamer, J. J., & Somogyi, R. (2000). The application of Shannon entropy in the identification of putative drug targets. Biosystems, 55, 5–14. Goh, L., & Kasabov, N. (2003). Integrated Gene Expression Analysis of Multiple Microarray Data Sets Based on a Normalization Technique and on Adaptive Connectionist Model. Proceedings of the International Joint Conference on Neural Networks, 3, 1724–1728. Golub, T. R., Slonim, D. K., Tamayo, P., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. Gunn, S. R., & Kandola, J. S. (2002). Structural modeling with sparse kernels. machine learning, 48, 137–163. Guyon, I., Weston, J., Barnhill, S., et al. (2002). Gene selection for cancer classification using support vector machines. Machine learning, 46, 389–422. He, Y., Tang, Y., Zhang, Y., & Sunderraman, R. (2006). Mining fuzzy association rules from microarray gene expression data for leukemia classification. In: Proceedings of 2006 IEEE International Conference on Granular Computing (pp. 461–464). Hwang, D., Rust, A. G., Ramsey, S., Smith, J. J., Leslie, D. M., Weston, A. D., et al. (2005). A data integration methodology for systems biology. Proceedings of the International Joint Conference on Neural Networks, 102, 17296–17301. Inza, I., Larranaga, P., Blanco, R., & Cerrolaza, A. J. (2004). Filter versus wrapper gene selection approaches in DNA microarray domains. Artificial Intelligence in Medicine, 31, 91–103. Kim, H., Zhou, J. X., Morse, H. C., III, & Park, H. (2005). A three-stage framework for gene expression data analysis by L1-norm support vector regression. International Journal of Bioinformatics Research and Applications, 1, 51–62. Lanckrient, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui, L., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5, 27–72. Lee, Y., Kim, Y., Lee, S., et al. (2006). Structured multicategory support vector machines with analysis of variance decomposition. Biometrika, 93(3), 555– 571. Li, J., Chen, Z., Wei, L., & Xu, W. (2007). Feather selection via least squares support feature machine. International Journal of Information Technology & Decision Making, 6(4), 1–16. Li, L., Weinberg, C. R., Darden, T. A., & Pedersen, L. G. (2001). Gene selection for sample classification based on gene expression data: Study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics, 17, 1131– 1142. Mao, Y., Zhou, X., Pi, D., et al. (2005). Constructing support vector machine ensembles for cancer classification based on proteomic profiling. Genomics Proteomics & Bioinformatics, 3, 238–241.
Z. Chen et al. / Expert Systems with Applications 38 (2011) 12151–12159 Matthias, E. F., Anthony, R., & Nikola, K. (2003). Evolving connectionist systems for knowledge discovery from gene expression data of cancer tissue. Artificial intelligence of medicine, 28, 165–189. Micchelli, C. A., & Pontil, M. (2005). Learning the kernel function via regularization. Journal of Machine Learning Research, 6, 1099–1125. Monti, S., Tamayo, P., Mesirov, J., et al. (2003). Consensus clustering: A resamplingbased method for class discovery and visualization of gene expression microarray data. Machine Learning, 52, 91–118. Niijima, S., & Kuhara, S. (2006). Gene subset selection in kernel-induced feature space. Pattern Recognition Letters, 27, 1884–1892. Parrado-Hernández, E., Mora-Jiménez, I., Arenas-Garcı´a, J., Figueiras-Vidal, A. R., & Navia-Vázquez, A. (2003). Growing support vector classifiers with controlled complexity. Pattern Recognition, 36(7), 1479–1488. Piatetsky-Shapiro, G., & Tamayo, P. (2003). Microarray data mining: Facing the challenges. ACM SIGKDD Explorations Newsletter, 5, 1–5. Radivojac, P., Chawla, N. V., Dunker, A. K., et al. (2004). Classification and knowledge discovery in protein databases. J Biomed Inform, 37, 224–239. Roth, V., & Lange, T. (2004). Bayesian class discovery in microarray datasets. IEEE Transactions on BIO-MED ENG, 51, 707–718. Sethi, P., & Leangsuksun, C. (2006). A novel computational framework for fast distributed computing and knowledge integration for microarray gene expression data analysis. In: Proceedings of 20th International Conference on Advanced Information Networking and Applications, 2, (pp. 613–617). Sonnenburg, S., Ratsch, G., Schafer, C., & Scholkopf, B. (2006). Large scale multiple kernel learning. Journal of Machine Learning Research, 7, 1531–1565. Statnikov, A., Aliferis, C. F., Tsamardinos, I., et al. (2005). A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21, 631–643.
12159
Su, Y., Murali, T. M., Pavlovic, V., Schaffer, M., & Kasif, S. (2003). RankGene: Identification of diagnostic genes based on expression data. Bioinformatics, 19, 1578–1579. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., et al. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci., 96, 2907–2912. Tang, Y. C., Zhang, Y. Q., & Huang, Z. (2007). Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis. IEEE-ACM Transactions on Computational Biology and Bioinformatics, 4(3), 365–381. Tan, A. H., & Pan, H. (2005). Predictive neural networks for gene expression data analysis. Neural Networks, 18, 297–306. Tsang, I. W., & Kwok, J. T. (2006). Efficient hyperkernel learning using second-order cone programming. IEEE Transactions on Neural Networks, 17, 48–58. Tung, W. L., & Quek, C. (2005). GenSo-FDSS: A neural-fuzzy decision support system for pediatric ALL cancer subtype identification using gene expression data. Artificial Intelligence in Medicine, 33, 61–88. Vapnik, V. (1995). The nature of statistic learning theory. New York: Springer. von Heydebreck, A., Huber, W., Poustka, A., et al. (2001). Identifying splits with clear separation: A new class discovery method for gene expression data. Bioinformatics, 17(Suppl 1), S107–S114. Wei, L., Chen, Z., Li, J. (2011) Evolution Strategies Based AdaptiveLpLS-SVM, Information Sciences, doi:10.1016/j.ins.2011.02.029. Yeoh, E., Ross, M. E., & Shurtleff, S. A. (2002). Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer cell, 1, 133–143. Zhu, Z. X., Ong, Y. S., & Dash, M. (2007). Wrapper-filter feature selection algorithm using a memetic framework. IEEE Trans Syst Man Cybern B Cybern, 37, 70–76.