Computers in Biology and Medicine 43 (2013) 933–941
Contents lists available at SciVerse ScienceDirect
Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm
Sparse maximum margin discriminant analysis for feature extraction and gene selection on gene expression data Yan Cui a, Chun-Hou Zheng b, Jian Yang a,n, Wen Sha b a b
School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, Jiangsu, China College of Electrical Engineering and Automation, Anhui University, Hefei, Anhui, China
art ic l e i nf o
a b s t r a c t
Article history: Received 20 December 2011 Accepted 26 April 2013
Dimensionality reduction is necessary for gene expression data classification. In this paper, we propose a new method for reducing the dimensionality of gene expression data. First, based on a sparse representation, we developed a new criterion for characterizing the margin, which is called sparse maximum margin discriminant analysis (SMMDA); this approach can be used to find an optimal transform matrix such that the sparse margin is maximal in the transformed space. Second, using SMMDA, we present a new feature extraction method for gene expression data. Third, based on SMMDA, we propose a new discriminant gene selection method. During gene selection, we first found the onedimensional projection of the gene expression data in the most separable direction using SMMDA. Then, we applied the sparse representation technique to regress the projection, and we obtained the relevance vector for the gene set. Discriminant genes were then selected according to this vector. Compared with the conventional method of maximum margin discriminant analysis, the proposed SMMDA method successfully avoids the difficulty of parameter selection. Extensive experiments using publicly available gene expression datasets showed that SMMDA is efficient for feature extraction and gene selection. & 2013 Elsevier Ltd. All rights reserved.
Keywords: Sparse maximum margin Feature extraction Gene selection Gene expression data
1. Introduction The rapid development of microarray technologies, which can simultaneously assess the expression level of thousands of genes, makes the precise, objective, and systematic analysis and diagnosis of human cancers possible. By monitoring the expression levels of thousands of genes in cells simultaneously, microarray experiments could lead to a complete observation of the molecular variations among tumors and hence result in a reliable classification. Gene expression data from DNA microarrays can be characterized by many variables (genes), but with only a few observations (experiments). Mathematically, the data can be expressed as a matrix X ¼ ðxij Þmn , where each row represents a gene and each column represents a sample or a patient for tumor diagnosis. The numerical value of xij denotes the expression level of a specific gene iði ¼ 1; 2; :::; mÞ in a specific sample jðj ¼ 1; 2; :::; nÞ. Statistically, the fact that there is a very large number of variables (genes) with only a small number of observations (samples) makes most of the classical data analysis methods infeasible, including classification. Fortunately, this problem can be avoided by selecting only the relevant features or extracting the essential features from the original data, where the former methodology belongs to feature
n
Corresponding author. Tel.: +86 25 8431 7297. E-mail address:
[email protected] (J. Yang).
0010-4825/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compbiomed.2013.04.018
selection or subset selection and the latter is called feature extraction. For feature selection, i.e., selecting a subset of genes from the original data, many related studies on tumor classification have been reported [1–4,11,12,21–23]. Moreover, a comparative study of discrimination methods based on selected sets of genes can be found in the literature [5]. Feature extraction is another type of widely-used method for tumor classification. Instead of selecting key genes from expression data, feature extraction methods aim to extract the most representative features with low dimensions from the original data. The popular feature extraction methods involved in gene data analysis include principal component analysis (PCA), linear discriminant analysis (LDA), complete PCA plus LDA [13], and partial least squares (PLS) [6]. A systematic benchmarking of these methods is reported in the literature [7]. These methods have good performance on tumor classification; however, they do not work well for non-Gaussian data sets [8]. To overcome this problem, Fukunaga and Mantock [9] presented a method called nonparametric discriminant analysis (NDA). This method is a classic margin-based discriminant analysis. In recent years, many other nonparametric discriminant analysis methods have been developed, such as nonparametric feature analysis (NFA) [8] and maximum margin criterion (MMC) [10]. Maximum margin criterion for robust feature extraction can avoid the small sample size (3s) problem, i.e., the size of the samples is very small compared with the dimension of the
934
Y. Cui et al. / Computers in Biology and Medicine 43 (2013) 933–941
samples. In a geometrical sense, MMC projects input patterns onto the subspaces spanned by the normals of a set of pairwise orthogonal margin maximizing hyperplanes. It aims to maximize the average marginal distances between different classes, and the corresponding features can enhance the separability better than PCA and LDA (Fig. 1). Experimental results on face recognition [14–16] indicate that the method can be efficiently used for discriminant feature extraction. Similar to face recognition, the microarray data typically contains thousands of genes on each chip, and the number of the collected tumor samples is much smaller than that of the genes [17,18]. Therefore, gene expression data analysis also needs an effective discriminant analysis method for feature extraction, which aims to find the optimal projection such that the maximum margin criterion is maximized after the projection of the samples. MMC is regarded as a nonparametric extension of LDA [19]. It measures between-class scatter based on marginal information, using the K-nearest neighbor technique. The nonparametric discriminant analysis method, however, encounters the problem of how to choose the optimal K. In addition, the weighting function used to deemphasize the samples far from the classification margin is too complicated. To solve these problems, in this paper, we propose a new maximum margin characterization method by virtue of sparse representation because it has a discriminative nature for classification [20]. The presented method, called sparse maximum margin discriminant analysis (SMMDA), can successfully avoid the difficulty of parameter selection and does not need the weighting function. SMMDA, as a nonparametric discriminant analysis method for feature extraction, does not need to select the parameter K. Moreover, it obtains the number of margin samples flexibly by using a sparse representation technique. In addition, SMMDA can be used for gene selection. Gene selection has the following biological explanation: most of the abnormalities in cell behavior are due to irregular gene activities, and thus, it is critical to highlight these specific genes [18]. SMMDA could find the onedimensional projection of gene expression data in the most separable direction; thus, we can use the sparse representation technique to regress the projection to obtain the relevance vector for the gene set and to select the genes according to the vector. The remainder of this paper is organized as follows. Section 2 describes the method proposed in this paper. The maximum margin criterion is first presented, and the algorithms of sparse maximum margin discriminant analysis for feature extraction and gene selection are consequently given. Section 3 presents the numerical experiments. Section 4 concludes the paper and outlines directions for future work.
2. Methods 2.1. Maximum margin criterion We are given ðx1 ; y1 Þ; ðx2 ; y2 Þ; :::; ðxn ; yn Þ∈ℝm C 1 ; :::; C l , where sample xi is an m-dimensional vector and yi is the corresponding class label for sample xi ði ¼ 1; 2; :::; nÞ. The maximum margin criterion aims to maximize the distances between classes after the transformation, and the criterion is [10] J¼
l 1 l ∑ ∑ p p dðC i ; C j Þ 2i¼1j¼1 i j
ð1Þ
We can use the distance between the mean vectors as the distance between classes, i.e., dðC i ; C j Þ ¼ dðmi ; mj Þ
ð2Þ
where mi and mj are the mean vectors of class C i and class C j , respectively. The variables pi and pj are a priori probabilities of class C i and class C j , respectively. Eq. (2) does not take the scatter of the classes into account; thus, it is not suitable for classification. Even if the distance between the mean vectors is large, it is not easy to separate two classes that have the large spread and that overlap with each other. In statistics, we usually use the overall variance tr ðSi Þ to measure the scatter of the data, where Si is the covariance matrix of class C i . Then, we define the interclass distance as: dðC i ; C j Þ ¼ dðmi ; mj Þ−trðSi Þ−trðSj Þ
ð3Þ
With Eq. (3), we can decompose Eq. (1) into two parts J¼
l l 1 l 1 l ∑ ∑ p p dðmi ; mj Þ− ∑ ∑ pi pj ðtrðSi Þ þ trðSj ÞÞ 2i¼1j¼1 i j 2i¼1j¼1
ð4Þ
The second part equals tr ðSw Þ. By employing the Euclidean distance, the first part can be simplified to tr ðSb Þ, which measures the overall variance of the class mean vectors. Then, we obtain J ¼ trðSb −Sw Þ
ð5Þ
A large J indicates that the class mean vectors scatter in a large space and that each class has a small spread. Additionally, a large J means that the samples in the same class are close to each other, while they are far from each other if the samples are in different classes. More details of the method can be found in [10]. Li et al. [10] proposed to use the maximum margin criterion to find projection vectors. Now, the criterion can be defined as J ¼ trðW T ðSb −Sw ÞWÞ
ð6Þ
The projection vectors W that maximize Eq. (6) can be found as the eigenvectors of Sb−Sw that correspond to the largest eigenvalues. The advantage of using the maximum margin criterion is that we need not compute the inverse of Sw; hence, the singularity problem can be avoided.
2.2. Sparse maximum margin discriminant analysis for feature extraction
Fig. 1. An illustration of the behavior of PCA, LDA and MMC for a binary classification problem.
When performing dimensionality reduction, we want to find a mapping from the measurement space to some feature space such that J is maximized after the transformation. However, the maximum margin criterion (MMC) is nonparametric discriminant analysis. It measures between-class scatter based on marginal information, using the K-nearest neighbor technique.
Y. Cui et al. / Computers in Biology and Medicine 43 (2013) 933–941
Given two pattern classes C 1 and C 2 , the training samples are x1i ; ði ¼ 1; 2; :::; N 1 Þ and x2j ; ðj ¼ 1; 2; :::; N 2 Þ, respectively. SMMC b
N1
¼ ∑
i¼1
wi ðx1i −mi Þðx1i −mi ÞT
N2
þ ∑
j¼1
wj ðx2j −mj Þðx2j −mj ÞT
ð7Þ
is the two-class nonparametric between-class scatter where SMMC b matrix. N 1 and N 2 are the number of samples in C 1 and C 2 , respectively. Here, mi ¼ ∑Kq ¼ 1 x2iq and mj ¼ ∑Kq ¼ 1 x1jq , where x2iq is the q-th nearest neighbor from C 2 to x1i , and x1jq is the q-th nearest neighbor from class C 1 to x2j . Here, wi is a weighting function that is used to deemphasize the samples that are far from the classification margin. The nonparametric discriminant analysis method, however, encounters the problem of how to choose the optimal K. In addition, the weighting function wi used to deemphasize the samples far from the classification margin is very complicated. To solve these problems, we use a sparse representation to replace the K-nearest neighbor technique. Let us consider the onesided case first and begin with the samples in C 1 . For a given sample x1i ∈C 1 , we denote A ¼ ½x1i ; :::; x1i−1 ; x1iþ1 ; :::; x1N1 and B ¼ ½x21 ; :::; x2N2 . Then, the overcomplete dictionary for x1i is D ¼ ½A; B. We seek a reconstructive vector using the following equation: α^ ¼ argmin‖αi ‖0 subject to x1i ¼ Dαi
ð8Þ
α^ ¼ argmin‖αi ‖1 subject to
¼ Dαi
The individual between-class difference is defined as Sb1
¼ x1i −BαBi ¼ AαAi
ð11Þ x1i
can As illustrated by Fig. 2, the local sparse margin for sample be measured by the L2 -norm of Sb1 . The between-class scatter matrix with respect to C 1 is defined as N1
S1;b ¼ ∑ Sb1 ðSb1 ÞT
ð12Þ
i¼1
MMC uses a complicated weighting function to deemphasize the samples far from the classification margin, which exerts a negative influence on the classification. Our method does not need the weighting function, because the farther away x1i is from the classification margin, the less information BαBi contains. Therefore, the weighting function is unnecessary in our method. We measure the within-class scatter of x1i by the within-class difference Sw 1. 1 A B Sw 1 ¼ xi −Aαi ¼ Bαi
ð13Þ
Then, we can also give a nonparametric version of the withinclass scatter matrix Sw constructed by Sw 1 . Note that MMC suggests only a nonparametric version of the between-class scatter matrix Sb . The within-class scatter matrix with respect to class C 1 is defined as N1
This problem is an NP-hard problem; however, if the solution is sufficiently sparse, then the solution to Eq. (8) is equivalent to the following L1 -minimization problem [24]: x1i
935
ð9Þ
This problem can be solved by the standard convex programming method [25]. We change the form of the representation as follows: ! αAi 1 _ ½ xi ¼ D α i ¼ A; B ¼ AαAi þ BαBi ð10Þ αBi In this equation, x1i is decomposed into two parts, i.e., the within-class part AαAi and the between-class part BαBi , following the parallelogram rule illustrated in Fig. 2. In MMC, the margin samples are searched by the K-nearest neighbor technique, and the parameter K is nonadaptive and must be the same for each sample. In our method, margin samples are the support training samples of class C 2 , i.e., the samples of class C 2 that correspond to nonzero components of αBi . The number of margin samples obtained by sparse representation is selfadapting. However, the importance of margin samples is different for classification. We can use sparse representation to compute the weighting values of the margin samples, i.e., αBi .
w T S1;w ¼ ∑ Sw 1 ðS1 Þ
ð14Þ
i¼1
For the two-class case, the between-class and within-class scatter matrices are defined as Sb ¼ S1;b þ S2;b
ð15Þ
Sw ¼ S1;w þ S2;w
ð16Þ
Extending the algorithm to multi-class cases is direct. Suppose that there are l pattern classes C 1 ; :::; C l . We convert the multi-class cases into two-class cases in the following way: C i is viewed as one class, and the remaining is viewed as the other class. The between-class scatter matrix Si;b and within-class scatter matrix Si;w with respect to C i ði ¼ 1; :::; lÞ can be computed by Eq. (12) and Eq. (14), respectively. Based on these matrices, we can construct the between-class scatter matrix and the withinclass scatter matrix as follows: l
Sb ¼ ∑ Si;b
ð17Þ
i¼1 l
Sw ¼ ∑ Si;w
ð18Þ
Considering a mapping W; we would like to maximize W ¼ tr W T ðSb −Sw ÞW J ðW Þ ¼ tr SW b −Sw
ð19Þ
i¼1
The optimal projection W opt is chosen as the matrix with orthogonal columns ½w1 ; :::; wn following the criterion: n
W opt ¼ argmax trðW T ðSb −Sw ÞWÞ ¼ argmax ∑ wTi ðSb −Sw Þwi W
W
ð20Þ
wi
Here, fwi ji ¼ 1; :::; ng are the generalized eigenvectors of the equation ðSb −Sw Þw ¼ λw that correspond to the n largest generalized eigenvalues ½λ1 ; :::; λn . The algorithm can be summarized as follows: Fig. 2. Illustration of SMMDA for two-class cases. Here, O is the base point. AαAi and BαBi form the side of a parallelogram. Sb1 is the local sparse margin of x1i . Sw 1 me asures the within-class scatter ofx1i .
Step 1: For sample xij ∈C i , compute _ α i based on Eq. (9); Step 2: Compute the between-class scatter matrix Sb and the within-class scatter matrix Sw ;
936
Y. Cui et al. / Computers in Biology and Medicine 43 (2013) 933–941
Step 3: Compute the d largest generalized eigenvalues and the corresponding eigenvector matrix W of ðSb −Sw Þ; then, obtain the d-dimensional data Y ¼ W T X. Because this algorithm is based on a sparse representation, we named it sparse maximum margin discriminant analysis (SMMDA).
l
α~ ¼ ∑ absðαi Þ
The SMMDA-based gene selection algorithm can be summarized as follows: Step 1: For the i-th two-class casesði ¼ 1; 2; :::; lÞ, compute the projection si ði ¼ 1; 2; :::; lÞ by using SMMDA; Step 2: For each si ði ¼ 1; 2; :::; lÞ, calculate αi by solving Eq. (25); Step 3: Obtain the absolute score α~ using Eq. (26); Step 4: The m genes are sorted in descending order according to the α~ j ; then, select the first M genes.
2.3. Sparse maximum margin discriminant analysis-based gene selection Hang [26,27] proposed two gene selection methods based on sparse representations; these methods have good performance on classification. However, if we use different values to label the samples, then the selected genes are inconsistent. For example, yi is the label of the i-th sample. Yet yi ∈f−1; 1g or yi ∈f0; 1g is different for the gene selection results. In our method, we first use SMMDA to find the projection s of the training sample matrix X onto the most separable direction w1 . Then, we use a sparse representation technique to regress the projection s to obtain a sparse relevant vector for the gene set. The nonzero elements of the relevant vector indicate the relevance of the corresponding genes to the projection s, and thus, we can select the corresponding genes according to the sparse vector. Compared with the methods in [26,27], the presented method can avoid obtaining different gene selection results due to using different label values. We denote the gene expression matrix as X ¼ ðxij Þmn , where each row represents a gene, while each column represents a sample. Here, s is the projection of the most separable direction of the gene expression data for the two-class cases, which can be obtained by SMMDA in Step 3, i.e., s ¼ wT1 X, where w1 is the first eigenvector of the equation ðSb −Sw Þw ¼ λw. We construct the model as follows: s ¼ αT X
ð21Þ
In this equation, s can be seen as the signal, and X is the dictionary. However, α obtained from Eq. (21) might not be sparse. We would like to find the vector α that is as sparse as possible, i.e., ‖α‖0 is as small as possible. A sparse α indicates the relevance of the corresponding genes to s. In the vector α, the nonzero elements correspond to the genes that have contributions to s, whereas the zero elements correspond to the genes that have no contribution to s. Based on this idea, we can select genes according to the nonzero elements of α. To this end, we construct the following model: min‖α‖0 subject to αT X ¼ s
ð22Þ
Similar to Eq. (8), we convert it to min‖α‖1 subject to α T X ¼ s
ð23Þ
α~ ¼ absðαÞ
ð24Þ
~ the relevant vector for the gene set; it is the absolute Here, αis score of the sparse vector α. Additionally, α~ j ; j∈f1; 2; :::; mg indicates the relevance of the j-th gene to s. The larger the value of α~ j is, the higher the classification ability of the j-th gene. The genes are then sorted in descending order according to the values of α~ j . Finally, we select the first M genes for the sample classification. It is also easy to extend the gene selection algorithm to multiclass cases. Suppose that there are l pattern classes C 1 ; :::; C l . We convert the multi-class cases into two-class cases in the following way: C i is viewed as one class, and the remaining is viewed as the other class; si is the projection in the most separable direction of the gene expression data for the i-th two-class case. Then, we can have min‖αi ‖1 subject to αi T X ¼ si
ð25Þ
ð26Þ
i¼1
3. Results and discussion After achieving a lower-dimensional rendering of the data by feature extraction or gene selection, we adopt a suitable classifier to classify the data, such as K-NN, neural networks, support vector machine (SVM), or classification trees. In this paper, we apply the K-NN classifier for its simplicity. In addition, it could be the best classifier for tumor classification for which there are lowdimensional features [5]. During our experiment, for the K-NN classifier, the k was set to 5. In fact, when k varied from 3 to 7, the experimental results almost did not change. Moreover, another classifier, the sparse representation classifier [20], was also adopted for comparison in the experiments. The performance of SMMDA is evaluated in comparison with other representative feature extraction methods, such as PCA, LDA, and PLS, and other gene selection methods, such as the ratio of genes in between-group to within-group sums of squares (BW) [5] and one-way analysis of variance (ANOVA) [28]. 3.1. Two-class classification experiments To evaluate the performance of the proposed method, we applied it to five publicly available datasets, i.e., colon dataset, acute leukemia dataset, glioma dataset, prostate dataset and breast dataset, respectively. The datasets used in these experiments are summarized in Table 1. Colon dataset [29]: In this dataset, the training set is composed of 40 colon tissues, of which 14 are normal and 26 are tumor samples. The test set is composed of 22 tissues, of which 8 are normal and 14 are tumor samples. The number of gene expression levels is 2000. Acute leukemia dataset [30]: In this dataset, the training set is composed of 38 leukemia patients, of which 11 suffer from acute myeloid leukemia (AML) and 27 suffer from acute lymphoblastic leukemia (ALL). The test set is composed of 34 patients, of which 14 suffer from AML and 20 suffer from ALL. The number of gene expression levels is 7129. Glioma dataset [31]: In this dataset, the training set is composed of 21 gliomas with classic histology, of which 14 are glioblastomas and 7 are anaplastic oligodendrogliomas. The test set is composed of 29 gliomas with non-classic histology, of which
Table 1 Description of the datasets. No.
Datasets
Samples
Genes
Classes
1 2 3 4 5
Colon Acute leukemia Breast Glioma Prostate
62 72 97 50 136
2000 7192 24188 12625 12600
2 2 2 2 2
Y. Cui et al. / Computers in Biology and Medicine 43 (2013) 933–941
14 are glioblastomas and 15 are anaplastic oligodendrogliomas. The number of gene expression levels is 12625. Prostate dataset [32]: In this dataset, the training set is composed of 102 prostate tissues, of which 50 are normal and 52 are tumor samples. The test set is composed of 34 tissues, of which 9 are normal and 25 are tumor samples. The number of gene expression levels is 12600. Breast dataset [33]: In this dataset, the training set is composed of 78 breast cancer patients, of which 34 develop metastases within 5 years and 44 remain disease-free within 5 years. The test set is composed of 19 patients, of which 12 develop metastases within 5 years and 7 remain disease-free within 5 years. The number of gene expression levels is 24188. To obtain reliable experimental results with comparability and repeatability for different numerical experiments, we used the original dataset divided into separate sets for training and testing and, in addition, we reshuffled all of the datasets randomly and used the shuffled datasets in the experiments. In total, the numerical experiments were performed with 20 random splittings on each original dataset. In addition, the randomly selected training sets and test sets contain the same number of samples of each class as in the original training and test sets. We constructed the classification models using the training samples and estimated the correct classification rate using the test set. The classification results using PLS, PCA, LDA, SVM and the proposed SMMDA are listed in Table 2. For each classification problem, the experimental results were reported by the mean value and the corresponding standard deviation of the accuracy on the original dataset and the 20 randomly partitioned sets. The corresponding dimensions are also listed in Table 2. Methods 2–5 listed in Table 2 were used for comparison. Considering the ‘high dimensions and small sample size’
937
characteristics of gene expression data, SVM could be the best classifier for classifying the original data [34]. In the experiment, we trained the SVM with an RBF kernel on the training data and then used it to classify the test data. In method 3, we first used PCA to reduce the dimensions of the gene expression data and then used K-NN for classification. Methods 2 and 4 were similar to Method 3, except that PCA was replaced by PLS [6] and LDA, respectively. The values of the Accuracy column listed in Table 2 are the best classification results of the five methods, i.e., the highest accuracy by varying the dimension of the used features, with the statistical means and standard deviations of the accuracy on the original test dataset and 20 randomization splits. It should be noted that, because of the very high dimensions and many samples, our computer (CPU 2.80 GHz, Memory 2 GB) cannot process the prostate cancer dataset and the breast cancer dataset using PLS. From Table 2, it can be seen that, for the five datasets, the mean classification accuracies of SMMDA are higher than the remaining methods. Therefore, our method, SMMDA, could achieve better classification results than the other methods. SVM can classify Colon and Leukemia datasets very well, yet it is the worst method for classifying the Glioma dataset. PLS can achieve good accuracy, yet it is memory demanding. Concerning the area under curve (AUC) statistics and the classification sensitivity and specificity on the five datasets, the proposed method SMMDA is much better than the others. We also used the sparse representation classifier (SRC) to classify the five datasets. The best accuracy results and the corresponding dimensions are listed in Table 3. The AUC score and the classification sensitivity and specificity can also be found in Table 3. From the experimental results, it can be seen that the proposed method is also effective.
Table 2 Summary of the results on five datasets using the K-NN classifier. Accuracy
Dimension
AUC
Sensitivity
Specificity
Colon cancer data 1 SMMDA 2 PLS 3 PCA 4 LDA 5 SVM
92.197 3.83 89.067 4.03 80.63 7 6.01 82.81 7 4.03 84.38 7 4.94
8 3 10 1 –
93.337 4.12 89.29 7 5.28 80.717 6.41 83.577 5.23 86.43 7 4.62
94.83 7 3.86 90.00 74.89 81.75 7 6.35 84.137 4.33 87.75 7 4.81
89.55 7 3.81 88.12 7 4.71 79.517 6.57 81.49 74.73 81.017 5.07
Acute leukemia data 1 SMMDA 2 PLS 3 PCA 4 LDA 5 SVM
96.88 7 2.55 94.53 7 2.99 81.25 7 4.94 92.71 74.44 92.197 4.04
8 3 10 1 –
100.007 0.00 94.65 7 5.69 85.65 7 6.12 95.58 7 2.71 97.94 7 2.19
93.767 2.01 93.067 4.28 82.36 7 4.95 93.51 73.07 93.177 3.28
99.677 2.13 96.01 73.74 80.147 4.83 91.917 3.18 91.217 3.44
Breast cancer data 1 SMMDA 2 PLS 3 PCA 4 LDA 5 SVM
73.53 7 11.01 – 67.89 7 12.21 67.65 7 11.00 67.65 7 8.32
20 – 50 1 –
75.177 9.38 – 69.717 11.28 69.477 9.66 70.437 11.49
70.5979.94 – 64.717 11.19 68.637 10.22 66.677 9.06
76.477 10.08 – 71.07711.23 66.677 10.78 68.637 9.40
Glioma data 1 SMMDA 2 PLS 3 PCA 4 LDA 5 SVM
76.677 9.80 70.00 78.01 72.86 7 6.59 71.98 7 4.26 66.677 4.02
9 5 8 1 –
77.50 7 8.77 71.15 77.94 75.85 7 8.64 73.46 76.52 67.187 5.32
73.55 7 7.09 70.5978.48 68.187 7.24 72.55 7 5.57 65.22 7 5.95
79.79 7 7.51 69.417 8.54 77.54 7 7.39 71.41 75.54 68.127 6.09
Prostate cancer data 1 SMMDA 2 PLS 3 PCA 4 LDA 5 SVM
94.44 73.49 – 77.08 7 4.74 88.89 7 4.39 87.96 7 4.75
10 – 3 1 –
95.197 4.95 – 78.75 7 6.42 92.167 3.80 92.02 7 3.05
93.51 73.02 – 72.737 2.93 87.50 7 3.86 85.487 4.33
95.37 72.96 – 81.43 7 2.95 90.28 7 3.92 90.44 74.17
938
Y. Cui et al. / Computers in Biology and Medicine 43 (2013) 933–941
Table 3 Summary of the results on five datasets using the SRC classifier. Accuracy
Dimension
AUC
Sensitivity
Specificity
Colon cancer data 1 SMMDA 2 PLS 3 PCA 4 LDA 5 SVM
90.63 73.13 87.50 7 4.94 81.25 7 4.94 81.25 7 6.75 84.38 74.94
10 5 15 1 –
91.91 74.83 88.23 7 5.03 81.577 6.49 83.017 5.25 86.43 7 4.62
92.337 3.88 89.05 7 4.78 81.88 7 6.43 83.767 4.46 87.75 7 4.81
88.93 7 3.68 85.95 7 5.10 80.62 7 6.45 78.747 4.94 81.017 5.07
Acute leukemia data 1 SMMDA 2 PLS 3 PCA 4 LDA 5 SVM
96.88 73.13 95.067 2.89 79.69 74.04 91.977 4.56 92.197 4.04
10 5 15 1 –
100.007 0.00 96.03 7 4.03 82.25 7 6.44 93.69 7 3.13 97.94 72.19
96.63 7 3.26 95.65 7 5.06 80.02 7 4.89 91.46 7 3.12 93.17 7 3.28
98.78 7 3.19 94.477 4.92 79.36 7 4.91 92.48 73.16 91.217 3.44
Breast cancer data 1 SMMDA 2 PLS 3 PCA 4 LDA 5 SVM
72.55 7 8.98 – 67.41 712.00 66.187 10.05 67.65 78.32
20 – 45 1 –
74.88 7 8.72 – 69.50 7 10.76 69.02 7 9.06 70.43 711.49
70.217 8.89 – 64.367 10.57 67.71 79.89 66.677 9.06
74.89 7 9.07 – 70.467 10.43 64.65 79.92 68.63 79.40
Glioma data 1 SMMDA 2 PLS 3 PCA 4 LDA 5 SVM
75.007 9.96 70.32 7 7.92 72.50 7 6.85 71.55 7 4.55 66.67 74.02
8 5 6 1 –
76.3379.07 71.077 8.14 75.39 7 9.86 73.187 6.85 67.18 7 5.32
74.25 78.75 70.027 8.64 68.237 7.48 72.46 7 6.06 65.22 7 5.95
75.75 7 8.71 70.627 8.72 76.777 7.62 70.64 7 6.04 68.127 6.09
Prostate cancer data 1 SMMDA 2 PLS 3 PCA 4 LDA 5 SVM
94.18 73.52 – 76.6774.21 87.447 4.63 87.96 7 4.75
12 – 3 1 –
94.88 7 5.05 – 78.45 7 6.54 91.737 4.03 92.02 7 3.05
92.75 7 4.55 – 72.3375.74 87.15 73.83 85.487 4.33
95.61 74.49 – 81.017 5.68 87.737 3.64 90.44 74.17
Table 4 Summary of the results on five datasets with gene selection using the K-NN classifier. Accuracy
AUC
Sensitivity
Specificity
90.32 7 3.18 87.687 5.26 87.107 4.67
93.517 3.14 90.917 4.60 88.63 7 4.05
92.187 3.08 85.65 7 4.78 85.337 4.15
Acute leukemia data 1 SMMDA 96.88 7 3.13 2 BW 92.197 5.85 3 ANOVA 91.737 4.64
100.007 0.00 95.25 7 4.06 93.43 7 4.39
95.64 73.02 93.917 3.58 92.85 7 4.09
98.69 7 2.89 90.477 3.64 90.617 4.15
Breast cancer 1 SMMDA 2 BW 3 ANOVA
data 70.617 10.63 69.127 10.05 70.327 11.27
71.677 8.83 69.417 9.66 69.977 9.34
74.51 79.48 69.577 9.86 72.55 7 9.73
66.71 79.57 68.677 9.94 68.097 9.81
Glioma data 1 SMMDA 2 BW 3 ANOVA
75.71 78.67 73.86 7 7.92 74.02 77.85
79.007 8.52 73.54 7 7.13 77.05 7 8.15
72.73 78.89 70.187 7.85 70.597 8.23
78.69 7 8.75 77.54 7 7.99 77.45 7 8.34
Prostate cancer data 1 SMMDA 93.217 3.28 2 BW 90.65 7 3.62 3 ANOVA 90.92 7 4.08
93.92 7 2.05 91.28 7 3.24 92.13 75.59
94.127 2.99 91.187 3.52 91.917 4.75
92.30 7 3.17 90.12 73.67 89.93 7 4.54
Colon cancer 1 SMMDA 2 BW 3 ANOVA
data 91.25 7 2.61 88.28 7 2.99 86.98 7 3.65
For gene selection, the highest classification results using BW, ANOVA and SMMDA are listed in Table 4. The corresponding gene numbers of five datasets were selected to be 40, 90, 100, 50 and 100, respectively. For each classification problem, the experimental results were also reported by the mean value and the corresponding standard deviation of the accuracy on the original dataset and the 20 randomly partitioned datasets. The AUC score and the
classification sensitivity and specificity on the five datasets are also included in Table 4. Method 2 in Table 4 performed a preliminary selection of genes on the basis of the ratio of their between-groups to within-groups sum of squares and then used the K-NN classifier for classification. Method 3 used non-parametric one-way Analysis of Variance (ANOVA) for gene selection. The experimental results of the three methods listed in Table 4 are their best classification results, and Fig. 3 shows the accuracies of these methods on colon data. In Fig. 3, the X-axis represents the M genes, i.e., the number of selected genes used for classification, and the Y-axis represents the mean accuracy of the 21 experiments. We added 5 genes in each step. It is noted that the accuracy decreases when the gene numbers are larger than 55. From Table 4 and Fig. 3, we can find that SMMDA outperforms the other methods. Although both BW and ANOVA are also efficient, they are not as stable as SMMDA. We also adopted SRC as a classifier in the gene selection experiments. The corresponding experimental results are listed in Table 5. From this experiment, we can achieve similar results to using K-NN. 3.2. Multi-class classification experiments In the multi-class classification experiments, we used another two datasets to further evaluate the performance of the proposed method. One is the Leukemia dataset [35], which has become a benchmark in cancer classification. In this dataset, the distinction between acute myelogenous leukemia (AML), acute lymphoblastic leukemia (ALL), and mixed-lineage leukemia (MLL) is known. The dataset contains 11225 genes in 72 samples. We randomly selected 15 cases of AML, 15 cases of ALL and 15 cases of MLL as the training set and used the remaining samples as test data.
Y. Cui et al. / Computers in Biology and Medicine 43 (2013) 933–941
1
Table 6 Summary of the results on the multi-class datasets using the K-NN classifier.
0.9
Accuracy
0.8 0.7 0.6 SMMDA BW ANOVA
0.5 0.4
939
5
10
15
20
25
30
35
40
Accuracy
Dimension
AUC
Leukemia data 1 SMMDA 2 PLS 3 PCA 4 LDA
97.32 7 2.47 91.677 5.97 87.147 5.15 91.03 76.42
10 3 5 2
100.007 0.00 95.737 6.28 90.85 7 5.69 95.36 7 6.86
Brain tumor data 1 SMMDA 2 PLS 3 PCA 4 LDA
85.38 7 3.86 77.00 7 3.87 73.83 74.51 78.00 74.22
5 3 3 3
87.45 73.32 79.62 7 2.93 74.98 7 4.74 80.26 7 4.19
Gene Numbers Fig. 3. The mean classification accuracy on the test set of the Colon dataset with gene selection.
Table 7 Summary of the results on the multi-class datasets with gene selection using the K-NN classifier.
Table 5 Summary of the results on five datasets with gene selection using the SRC classifier. AUC
Sensitivity
Specificity
data 90.007 2.62 88.03 7 2.74 86.88 7 4.08
89.977 3.24 87.46 7 4.69 87.047 4.56
91.88 73.36 90.32 7 4.89 88.217 4.05
88.127 3.18 85.74 74.69 85.55 7 4.11
Acute leukemia data 1 SMMDA 95.317 4.03 2 BW 90.63 7 4.94 3 ANOVA 91.47 73.86
100.00 70.00 92.94 74.58 93.21 74.64
92.62 7 3.59 91.147 3.73 92.53 7 4.16
98.007 3.47 90.127 3.65 90.417 4.05
Breast cancer data 1 SMMDA 70.21 710.05 2 BW 68.427 10.53 3 ANOVA 70.45 710.67
71.45 7 9.63 68.75 7 9.86 69.63 710.07
74.34 710.35 69.27 710.69 71.85 711.16
66.087 10.15 67.57 710.73 69.05 7 11.18
Glioma data 1 SMMDA 75.007 7.91 2 BW 73.63 7 7.62 3 ANOVA 73.50 7 7.94
78.88 77.92 72.97 78.58 76.69 78.67
72.55 77.99 70.07 78.92 70.737 9.27
77.45 7 7.83 77.197 9.03 76.277 9.21
Prostate cancer data 1 SMMDA 92.2473.00 2 BW 90.26 7 4.13 3 ANOVA 90.65 7 3.87
93.25 72.62 91.167 4.08 91.57 73.28
93.34 7 2.96 90.72 73.85 90.85 7 3.61
91.147 3.04 89.80 7 3.91 90.45 7 3.53
Colon cancer 1 SMMDA 2 BW 3 ANOVA
AUC
Leukemia data 1 SMMDA 2 BW 3 ANOVA
96.55 7 3.15 93.90 7 3.55 92.09 7 4.03
100.007 0.00 97.55 7 4.05 96.91 74.81
Brain tumor data 1 SMMDA 2 BW 3 ANOVA
86.69 7 3.36 83.477 3.55 82.83 7 3.58
88.92 7 3.59 85.28 7 3.73 85.09 7 3.96
1 0.9 0.8
Accuracy
Accuracy
Accuracy
0.7 0.6 SMMDA BW ANOVA
0.5 0.4
50
100
150
200
250
300
350
400
450
500
Gene Numbers
Another dataset is the Brain tumor set [31]. There are 4 types of malignant glioma in this dataset: classic glioblastomas, classic anaplastic oligodendrogliomas, nonclassic glioblastomas, and nonclassic anaplastic oligodendrogliomas. The dataset has 50 samples, and the number of genes is 10367. In this dataset, we randomly selected 5 cases of classic glioblastomas, 5 cases of classic anaplastic oligodendrogliomas, 5 cases of nonclassic glioblastomas, and 5 cases of nonclassic anaplastic oligodendrogliomas as the training set, and we used the remaining samples as test data. The experiment was run 21 times. The classification results are listed in Tables 6 and 7 and Fig. 4. Table 7 and Fig. 4 are gene selection experimental results. The number of selected genes of two datasets is 500 for both. It should be noted that the accuracy will decrease when the gene numbers are larger than 800. In Fig. 4, the selected genes were added 50 in each step. Because SVM is essentially a two-class classifier [36], in these experiments we did not use it to classify the two datasets. Moreover, we applied the SRC classifier to classify the two datasets. Tables 8 and 9 list the experimental results using the SRC classifier. From the experiments, we can see that, compared with two-class classification, the advantage of our method is more obvious in the multi-class classification task.
Fig. 4. The mean classification accuracy on the test set of the Leukemia dataset with gene selection.
Table 8 Summary of the results on the multi-class datasets using the SRC classifier. Accuracy
Dimension
AUC
Leukemia data 1 SMMDA 2 PLS 3 PCA 4 LDA
96.677 3.05 91.007 5.86 86.90 7 5.00 90.00 76.45
10 5 5 2
100.007 0.00 95.10 76.35 89.977 5.42 94.127 6.83
Brain tumor data 1 SMMDA 2 PLS 3 PCA 4 LDA
84.89 7 3.84 75.50 7 4.22 73.337 4.77 77.83 7 4.37
5 3 3 3
87.05 74.12 76.89 74.39 74.187 4.81 79.86 7 4.65
From the above experiments, it can be found that, for feature extraction, the proposed method achieves much higher and more stable accuracy than PCA, PLS and LDA, and for gene selection, our method is more advanced than BW and ANOVA. It should be noted that the standard deviations of the results from the proposed
940
Y. Cui et al. / Computers in Biology and Medicine 43 (2013) 933–941
Table 9 Summary of the results on the multi-class datasets with gene selection using the SRC classifier. Accuracy
AUC
Leukemia data 1 SMMDA 2 BW 3 ANOVA
95.89 7 3.43 92.83 7 4.02 92.00 74.33
100.00 70.00 97.32 7 4.53 96.63 74.86
Brain tumor data 1 SMMDA 2 BW 3 ANOVA
84.36 7 4.18 82.95 7 3.69 80.79 7 3.61
86.75 75.06 85.13 7 4.86 83.59 74.54
method are not smaller than other methods when it is used on the high-dimensional datasets. Of course, these results are only preliminary conclusions from our experiments. In the future, more experiments will be performed to verify the conclusions.
4. Conclusions This article presented a new maximum margin discriminant analysis method based on sparse representation (SMMDA). This method can be used for feature extraction and gene selection. It characterizes the margin by sparse representation. Based on this characterization, a class margin criterion is designed for determining an optimal transform matrix such that the sparse margin is maximal in the transformed space. For gene selection, SMMDA finds the one-dimensional projection of the gene expression data in the most separable direction, and then uses a sparse representation technique to regress the projection to obtain the relevant vector for the gene set, and it selects genes according to the vector. The proposed method was applied to gene expression data analysis. Compared with other widely-used feature extraction algorithms and gene selection algorithms, the proposed SMMDA technique utilizes the margin information and maximizes the boundary. The experimental results demonstrate that it is very effective in extracting the discriminant features from the gene expression data and also in selecting genes with discriminative information. This paper is a meaningful attempt to connect sparse maximum margin discriminant analysis with the bioinformatics field. Considering the burgeoning of both subjects, there should be much room to facilitate the effective application of sparse maximum margin discriminant analysis to bioinformatics. In the future, we will study the new algorithm to extend our method and attempt to achieve better results.
Conflict of interest statement All authors declare that they have no competing interests.
Acknowledgments This work was partially supported by the National Science Fund for Distinguished Young Scholars under Grant nos. 61125305 and 61233011, the Key Project of Chinese Ministry of Education, the National Natural Science Foundation of China under Grant no. 61272339, and the Key Project of Anhui Educational Committee, under Grant no. KJ2012A005.
References [1] K. Bae, B.K. Mallick, Gene selection using a two-level hierarchical Bayesian model, Bioinformatics 20 (18) (2004) 3423–3430. [2] K.E. Lee, N. Sha, E.R. Dougherty, M. Vannucci, B.K. Mallick, Gene selection: a Bayesian variable selection approach, Bioinformatics 19 (1) (2003) 90–97. [3] J.G. Liao, K.V. Chin, Logistic regression for disease classification using microarray data: model selection in large p and small n case, Bioinformatics 23 (15) (2007) 1945–1951. [4] H.H. Zhang, J. Ahn, X. Lin, C. Park, Gene selection using support vector machines with non-convex penalty, Bioinformatics 22 (1) (2006) 88–95. [5] S. Dudoit, J. Fridlyand, T.P. Speed, Comparison of discrimination methods for the classification of tumor using gene expression data, J. Am. Stat. Assoc. 97 (457) (2002) 77–87. [6] D.V. Nguyen, D.M. Roche, Tumor classification by partial least squares using microarray gene expression data, Bioinformatics 18 (1) (2002) 39–50. [7] N. Pochet, F. De Smet, J.A.K. Suykens, B.L.R. De Moor, Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction, Bioinformatics 20 (17) (2004) 3185–3195. [8] Z.F. Li, D.H. Lin, X. Tang, Nonparametric discriminant analysis for face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 31 (4) (2009) 755–761. [9] K. Fukunaga, J. Mantock, Nonparametric discriminant analysis, IEEE Trans. Pattern Anal. Mach. Intell. 5 (6) (1983) 671–678. [10] H.F. Li, T. Jiang, K.S. Zhang, Efficient and robust feature extraction by maximum margin criterion, IEEE Trans. Neural Networks 17 (1) (2006) 157–165. [11] S. Zhu, D. Wang, K. Yu, T. Li, Feature selection for gene expression using modelbased entropy, IEEE/ACM Trans. Comput. Biol. Bioinformatics 7 (1) (2008) 25–36. [12] T.K. Paul, H. Iba, Prediction of cancer class with majority voting genetic programming classifier using gene expression data, IEEE/ACM Trans. Comput. Biol. Bioinformatics 6 (2) (2009) 353–367. [13] J. Yang, J.Y. Yang, Why can LDA be performed in PCA transformed space? Pattern Recognit. 36 (2) (2003) 563–566. [14] W.H. Yang, D.Q. Dai, Two-dimensional maximum margin feature extraction for face recognition, IEEE Trans. Syst. Man Cybern. Part B: Cybern.—Spec. Issue Cybern. Cognitive Inf. 39 (4) (2009) 1002–1012. [15] Y.Z. Li, J.Y. Yang, Y.J. Zheng, New and efficient feature extraction methods based on maximum margin criterion, J. Syst. Simulation 19 (2007) 1061–1066. [16] W.H. Yang, D.Q. Dai, H. Yan, Feature extraction and uncorrelated discriminant analysis for high-dimensional data, IEEE Trans. Knowl. Data Eng. 20 (5) (2008) 601–614. [17] D.S. Huang, C.H. Zheng, Independent component analysis-based penalized discriminant method for tumor classification using gene expression data, Bioinformatics 22 (15) (2006) 1855–1862. [18] C.H. Zheng, D.S. Huang, L. Zhang, X.Z. Kong, Tumor clustering using nonnegative matrix factorization with gene selection, IEEE Trans. Inf. Technol. Biomed. 13 (4) (2009) 599–607. [19] A. Kocsor, K. Kovacs, C. Szepesvari, Margin maximizing discriminant analysis, in: Proceedings of the 15th European Conference on Machine Learning, 32 (1) (2004) pp. 227–238. [20] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [21] Q. Shen, Z. Mei, B.X. Ye, Simultaneous genes and training samples selection by modified particle swarm optimization for gene expression data classification, Comput. Biol. Med. 39 (7) (2009) 646–649. [22] S. Shah, A. Kusiak, Cancer gene search with data-mining and genetic algorithms, Comput. Biol. Med. 37 (2) (2007) 251–261. [23] Y Tang, Y. Zhang, Z. Huang, Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis, IEEE/ACM Trans. Comput. Biol. Bioinformatics 4 (3) (2007) 365–381. [24] D. Donoho, For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution, Comm. Pure Appl. Math. 59 (6) (2006) 797–829. [25] S. Chen, D. Donoho, M. Saunders, Atomic decomposition by basis pursuit, SIAM Rev. 43 (1) (2001) 129–159. [26] X. Hang, Gene selection using l1-norm least square regression, World Congress on Computer Science and Information Engineering 5 (2009) pp. 38–39. [27] X. Hang, Multiclass gene selection on microarray data using l1-norm least square regression, International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing (2009) pp. 52–55. [28] B. Jones, Matlab Statistics Toolbox, The MathWorks, Inc., Natick, MA, 1997. [29] A. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, A.J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Natl. Acad. Sci. USA 96 (12) (1999) 6745–6750. [30] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, E.S. Lander, Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (5439) (1999) 531–537. [31] C.L. Nutt, D.R. Mani, R.A. Betensky, P Tamayo, J.G. Cairncross, C. Ladd, U. Pohl, C. Hartmann, M.E. McLaughlin, T.T. Batchelor, P.M. Black, A. Deimling, S.L. Pomeroy, T.R. Golub, D.N. Louis, Gene expression-based classification of malignant gliomas correlates better with survival than histological classification, Cancer Res. 63 (7) (2003) 1602–1607.
Y. Cui et al. / Computers in Biology and Medicine 43 (2013) 933–941
[32] D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D'Amico, J.P. Richie, E.S Lander, M. Loda, P.W. Kantoff, T.R. Golub, W.R. Sellers, Gene expression correlates of clinical prostate cancer behavior, Cancer Cell 1 (2) (2002) 203–209. [33] L.J. Van 't Veer, H. Dai, M.J. Van De Vijver, Y.D. He, A.A.M. Hart, M. Mao, H.L. Peterse, K. Van Der Kooy, M.J. Marton, A.T. Witteveen, G.J. Schreiber, R.M. Kerkhoven, C. Roberts, P.S. Linsley, R. Bernards, S.H. Friend, Gene expression profiling predicts clinical outcome of breast cancer, Nature 415 (6871) (2002) 530–536. [34] T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, D. Haussler, Support vector machines classification and validation of cancer tissue samples using microarrary expression data, Bioinformatics 16 (10) (2000) 906–914.
941
[35] S.A. Armstrong, J.E. Staunton, L.B. Silverman, R. Pieters, M.L. den Boer, M.D. Minden, S.E. Sallan, E.S. Lander, T.R. Golub, S.J. Korsmeyer, MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia, Nat. Genet. 30 (1) (2002) 41–47. [36] B. Li, C.H. Zheng, D.S. Huang, L. Zhang, K. Han, Gene expression data classification using locally linear discriminant embedding, Comput. Biol. Med. 40 (10) (2010) 1016–1025.