Gene 706 (2019) 188–200
Contents lists available at ScienceDirect
Gene journal homepage: www.elsevier.com/locate/gene
Research paper
Gene selection for microarray data classification via adaptive hypergraph embedded dictionary learning Xiao Zhenga,1, Wenyang Zhub,1, Chang Tangc, Minhui Wangd,
T
⁎
a
Wuhan University of Technology Hospital, Wuhan University of Technology, Wuhan 430070, China Department of Interventional Radiology, The Affiliated Huai'an Hospital of Xuzhou Medical University, Huai'an 223100, China c School of Computer Science, China University of Geosciences, Wuhan 430074, China d Department of Pharmacy, People's Hospital of Lian'shui County, Huai'an 223300, China b
ARTICLE INFO
ABSTRACT
MSC: 00-01 99-00
Due to the rapid development of DNA microarray technology, a large number of microarray data come into being and classifying these data has been verified useful for cancer diagnosis, treatment and prevention. However, microarray data classification is still a challenging task since there are often a huge number of genes but a small number of samples in gene expression data. As a result, a computational method for reducing the dimension of microarray data is necessary. In this paper, we introduce a computational gene selection model for microarray data classification via adaptive hypergraph embedded dictionary learning (AHEDL). Specifically, a dictionary is learned from the feature space of original high dimensional microarray data, and this learned dictionary is used to represent original genes with a reconstruction coefficient matrix. Then we use a l2, 1-norm regularization to impose the row sparsity on the coefficient matrix for selecting discriminate genes. Meanwhile, in order to capture the localmanifold geometrical structure of original microarray data in a high-order manner, a hypergraph is adaptively learned and embedded into the model. An iterative updating algorithm is designed for solving the optimization problem. In order to validate the efficacy of the proposed model, we have conducted experiments on six publicly available microarray data sets and the results demonstrate that AHEDL outperforms other state-of-the-art methods in terms of microarray data classification. Abbreviations:
Keywords: Gene selection Microarray data classification Hypergraph learning Dictionary learning
AHEDL ADMM SVM RF k-NN CV MSVM-RF KernelPLS WLMGS GRSL-GS LNNFW RLR ACC SD ANOVA DF SS MS
Corresponding author. E-mail address:
[email protected] (M. Wang). 1 Xiao Zheng and Wenyang Zhu contributed equally as first author to this work. ⁎
https://doi.org/10.1016/j.gene.2019.04.060 Received 27 October 2018; Received in revised form 3 April 2019; Accepted 22 April 2019 Available online 11 May 2019 0378-1119/ © 2019 Elsevier B.V. All rights reserved.
Adaptive Hypergraph Embedded Dictionary Learning Alternating Direction method of Multipliers Support Vector Machine Random Forest k-Nearest Neighbor cross validation Multiclass Support Vector Machine-Recursive Feature Elimination Kernel Partial Least Squares Weight Local Modularity based Gene Selection Gene Selection via Subspace Learning and Manifold Regularization Local-Nearest-Neighbors-based Feature Weighting for Gene Selection Regularized Logistic Regression accuracy standard deviations Analysis of Variance Degrees of Freedom Sum-of-Square Mean Sum-of-Square
Gene 706 (2019) 188–200
X. Zheng, et al. F Sig SRBCT GCM CLL_SUB_111
F-value statistical significance Small Round Blue Cell Tumors Global Cancer Map B-cell chronic lymphocytic leukemia
1. Introduction
and put forward an optimal gene classification procedure. By introducing the Markov blanket technique, Wang et al. (2017b) presented an improved wrapper-based gene selection method which can identify targeting genes while eliminating redundant ones in an efficient way. In embedded methods, the structure and representation property of data are usually exploited to model the gene selection problem. These models are often under the assumption that each dimension of features can be linearly reconstructed by other relevant features. Representative methods in this category include matrix factorization (Wang et al., 2016a; Zheng et al., 2011; Du et al., 2017) and low-rank representation (Wang et al., 2016b). Wang et al. (2016b) used a Laplacian graph to regularize the low-rank representation and proposed a differentially expressed genes selection model. Zheng et al. (2011) applied nonnegative matrix factorization to select important tumor genes. Considering that the l2, 1-norm is robust to outliers in original data, Wang et al. (2016a) proposed a robust characteristic gene selection method with l2, 1-norm regularization. Guo et al. (2018) proposed an ensemble consensus-guided unsupervised feature selection model for identifying the disease-associated genes. By combining three machine learning algorithms, including Monte Carlo feature selection, random forest, and rough set-based rule learning together, Wang et al. (2018) successfully identified genes with significant expression differences between PDX and original human tumors. In unsupervised case, the intrinsic local geometric structure can be regarded as a priori information and can be easily deployed in embedded methods. Therefore, these methods usually reach superior performance in many cases, and have obtained more and more attentions. In recent years, self-representation model has been successfully used for feature/gene selection (Tang et al., 2018b; Tang et al., 2018c; Zhu et al., 2015; Shang et al., 2016; Zhu et al., 2017a; Liu et al., 2017). In these methods, the features/genes are reconstructed by other ones, and certain regularization term is imposed on the representation coefficient matrix for selecting discriminative features/genes, and the commonly used graph Laplacian regularization is usually used for local geometrical data structure preservation. Although previous representation based approaches have achieved great success, there are still some issues. Firstly, traditional representation based approaches usually use the original features/genes as the reconstruction dictionary, this would degenerate the final selection performance because there often exists noisy feature dimension in original data and the original features may be not representative. Secondly, previous methods usually use a pre-defined graph, but constructing an appropriate graph by using a manually set function is challenging since practical data always contain noise and outliers and an accurate similarity measuring function is hard to define. Thirdly, traditional graph can only describe the pair-wise relations of data, but cannot capture the high-order relations, so that the complex structures implied in the data cannot be sufficiently exploited. Tang et al. (2018a) proposed a gene selection method for microarray data classification via subspace learning and manifold regularization. In this method, the original microarray data are projected into a lower-dimensional subspace and the projection matrix is used to measure the weights of different genes. Meanwhile, the Laplacian graph is used to preserve the local geometric structure of original data in the projected subspace. However, this method also uses the traditional pre-defined pair-wise graph, high-order relations between data cannot be captured. In order to address the above-mentioned issues, we propose an adaptive hypergraph embedded dictionary learning method (AHEDL) for gene selection, to improve the classification of microarray data.
Due to the rapid development of biomedical and DNA microarray technology, a great deal of genomic microarray data come into being (Lj et al., 2002). It has been verified that classifying these microarray data play an important role in drug development (Liang et al., 2018), cancer diagnosis, treatment and prevention (Golub et al., 1999a; Guyon et al., 2002; Nguyen and Nahavandi, 2016; De et al., 2017; Naranjo et al., 2017; Huang et al., 2016; Anauate et al., 2017; Zhang et al., 2018; Huang and Liang, 2018; Das et al., 2018; Li et al., 2018). However, the small sample size but high feature dimension makes the classification task challenging (Scott, 2008; Oh et al., 2004; Buza, 2016). Based on some existing biological research, it has been verified that only a small number of genes play critical roles in biological process and indicating diseases (Liao et al., 2003; Guo et al., 2017), while the rest genes are often noisy and redundant. In addition, processing the original highdimensional microarray data not only degenerates the final performance of classification algorithms but also increases the computation burden of hardware. To this end, it is urgent to reduce the dimensionality of original high dimensional microarray data by selecting a discriminate subset of genes which can obtain better classification results (Mitra et al., 2002; Dy and Brodley, 2004; He et al., 2005; Chuang et al., 2012; Song et al., 2016; Ramos et al., 2017; Miao et al., 2017; Wang et al., 2017a; Tang et al., 2018a; Algamal et al., 2018). Generally speaking, the gene selection task is very similar to the feature selection in data mining and machine learning community (Mitra et al., 2002; Alrajab et al., 2017; Odeh and Baareh, 2016; Luo et al., 2013; Shi et al., 2015; Luo et al., 2016; Shen et al., 2018a; Shen et al., 2018b; Shen et al., 2016; Li et al., 2019; Tang et al., 2018b; Tang et al., 2018c; Tang et al., 2018d; Tang et al., 2019a; Tang et al., 2019b). Both of the two tasks aim to select a subset of dimensions of original data. In recent years, many computational gene selection methods have been proposed to facilitate tumor classification and diagnosis. Based on the different evaluation functions, we can classify these methods into three categories, i.e., filter methods, wrapper methods and embedded methods. Filter methods usually act as a preprocessing step, which often work independently with the classifier. Most of these methods measure the importance of genes according to some criteria such as Z-score (Thomas et al., 2001), t-test (Dudoit et al., 2000; Long et al., 2001), signal-tonoise ratio (Golub et al., 1999b), mutual information (Cai et al., 2009), information gain (Chuang et al., 2012) and the Laplacian score (He et al., 2005). Golub et al. (1999b) firstly proposed the signal-to-noise ratio function for evaluating the pros and cons of the genes. As two conventional methods for feature selection, ReliefF (Robnik- ikonja and Kononenko, 2003) and MRMR (Peng et al., 2005) are combined for gene selection (Yi et al., 2008). Sun et al. (2018) proposed a crossentropy based multi-filter ensemble method for microarray data classification. In their method, multiple filters are used to select the microarray data in order to obtain a plurality of the pre-selected feature subsets with a different classification ability. Wrapper methods often use the classification accuracy as an indicator for optimal gene subset selection. Guyon et al. (2002) developed a recursive feature elimination algorithm for gene selection. The algorithm works by recursively eliminating the parameters of the support vector machine, which earns great success in gene selection (Duan et al., 2005; Zhou and Tuck, 2007; Liang et al., 2011; Tapia et al., 2012). Ghosh and Chinnaiyan (2005) combined with Lasso estimation 189
Gene 706 (2019) 188–200
X. Zheng, et al.
Considering that dictionary learning can obtain more robust representation of original data (Gao et al., 2013; Liu et al., 2015), we learn a dictionary from the feature space of original high dimensional microarray data, and this learned dictionary is used to reconstruct original genes with a reconstruction coefficient matrix. Then the l2, 1norm regularization is used to impose the row sparsity on the coefficient matrix for selecting discriminate genes. In addition, in order to capture the local manifold geometrical structure of original microarray data in a high-order manner, a hypergraph is adaptively learned and embedded into the model. We develop an iterative updating algorithm to solve the optimization problem. Finally, we have conducted experiments on six publicly available microarray data sets to validate the efficacy of the proposed method and the results demonstrate that AHEDL outperforms other state-of-the-art methods in terms of microarray data classification.
L=I
=
m i=1
mi =
m i=1
n j =1
well-known Frobenius norm of M.
Mij 2 . M
F
=
n i=1
m j=1
2.2. Adaptive hypergraph embedded dictionary learning for gene selection Given a microarray data matrix X = {x1,x2,⋯xn} = {f1;f2;⋯fd} ∈ ℝd×n, where n and d represent the numbers of samples and genes, respectively. xi ∈ ℝd×1 is a sample and fj ∈ ℝ1×n is a gene vector. Traditional representation based gene selection models can be uniformly defined as following:
min
WL (X, C) + R (C),(3)
where L (X, C) is the loss function which regularizes the data representation term, and R (C) is used to regularize the representation coefficient matrix C which can be used to reflect the importance of gene dimensions. For example, in some self-representation based methods, each gene is represented as the linear combination of its relevant genes and the reconstruction errors are minimized. In order to select important genes, the l2, 1-norm is usually deployed to regularize C for row sparsity. In addition, the well studied graph Laplacian is often integrated into R (C) for local manifold structure preservation. Different to previous methods which select genes on original data space, we investigate how to perform gene selection on a new space, i.e., dictionary basis space. It has been verified that sparse dictionary learning distinguishes important elements from unimportant ones by assigning the codes of unimportant elements as zeros and the important ones as nonzeros, which enables sparse dictionary learning to reduce the impact of noise and enhance the efficiency of learning models (Yu et al., 2015; Mairal et al., 2010; Zhu et al., 2017b). This motivates us to perform gene selection on the learned dictionary basis space in this work. Given a microarray data matrix X ∈ ℝd×n (where each column represents a data sample), we intend to learn m dictionary bases P ∈ ℝd×m to induce the new representations Q ∈ ℝm×n of X. In order to enable our model to output gene importance, we only set the number of basis equivalent to the dimensions of genes in this work (i.e., m = d). The objective function of bases and representation learning can be defined as:
Mij 2 is the
2.1. Hypergraph regularization As we discussed in the first section, traditional graph can only describe the pair-wise relations between data to preserve the local geometric structure, but the higher-order complex relations in the data cannot be exploited. Here, we give an illustration on the gene-disease relations by using Fig. 1. In Fig. 1, the black dots and the black lines indicate the genes and the relations between two genes, respectively. The dot lines indicate the relations among no less than two genes. Traditional simple graph only describe the gene-gene relations, e.g., l1 vs. l2 (i.e., l1 and l2 are two genes related to a certain disease), l2 vs. l3, and l2 vs. l4, but cannot reflect the relations in real-world cases, i.e., a specific disease is usually related with more than two genes. For instance, three of the four genes in Fig. 1 (i.e., l1, l2 and l3) may be related to a certain disease while another disease may be related to other three genes (i.e., l1, l2 and l4). The hyperedges in a hypergraph can easily describe this type of relations, as the color dot lines shown in Fig. 1. The hypergraph was used in many previous works including clustering (Zhou and Huang, 2006) and classification (Gao et al., 2014). In this work, we focus on adaptively learning a hypergraph to preserve the high-order local geometric structures of the microarray data. = (V , E , w) , A hypergraph can be mathematically denoted as where V = [vi] and E = [ei] represent the set of the vertexes and the hyperedges, respectively. w = [wi] is the weight of the hyperedges. Different to traditional simple graph edges which only consist of pairs of vertices, hyperedges can be arbitrarily sized sets of vertices. If the binary vertex-edge relations are represented by an incidence matrix H, each element of H is defined as: 1, if v
e,
i j H(vi , ej ) = { 0, otherwise .
(2)
is an identity matrix. De, Dv and W, respectively, are the where I ∈ ℝ diagonal matrices of δ = [δ(ei)], d = [d(vj)] and w = [w(ei)]. n is the number of data samples.
In this section, we firstly give a brief introduction about hypergraph regularization, then we introduce our proposed adaptive hypergraph embedded dictionary learning model (AHEDL). Throughout this paper, matrices are written as boldface capital letters and vectors are denoted as boldface lower case letters. For an arbitrary matrix M ∈ ℝm×n, Mij denotes its (i, j)-th entry, mi and mj denotes the i-th row and j-th column of M, respectively. Tr(M) is the trace of M if M is square and MT is the transpose of M. Im is the identity matrix with size m × m (denoted by e xtbfI if the size is obviously known). 1 is a vector with each element is 1. The l2, of matrix M is defined as 1-norm 2,1
1 1 2 HWDe 1HT Dv 2 , n×n
2. Proposed method
M
Dv
min P, Q
X
PQ
2 F
+
Q
2,1 ,
s. t . pi
2
1, i = 1, 2,
, m,
(4)
where β is a positive balancing parameter. ‖pi‖ ≤ 1 enforces the learned dictionary atoms to be compact. Here, we use the l2, 1-norm instead of l1-norm used in traditional dictionary learning models, mainly for two reasons. Firstly, the l1-norm leads to the element sparsity and cannot reflect the importance of different features. Instead, the l2, 1-norm can
Fig. 1. An intuitive comparison of a simple graph and a hypergraph. The edges (i.e., e1, e2 and e3 represented by black lines) in a simple graph can only reflect the relations between two genes. The hyperedges (i.e., he1 and he2 represented by color dot lines) can capture the relations among no less than two genes. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
(1)
Then the degree of a hyperedge ei can be calculated via δ(ei) = ∑vj∈Vh(vj, ei). For a specific vertex vj, its degree can be obtained via d(vj) = ∑vj∈ei, ei∈Ew(ei)h(vj, ei). The hypergraph Laplacian matrix can be formulated as: 190
Gene 706 (2019) 188–200
X. Zheng, et al.
impose row sparsity on Q, which can be used to measure the distance in feature dimensions via the l2-norm regularization. Secondly, imposing the row sparsity on Q can select the closely related bases from P to represent each sample, which can also capture the local property of original data. Recent literature has verified that preserving local geometric data structure is critical for unsupervised feature learning (Liu et al., 2014), and the well-studied graph Laplacian is often used to accomplish this task (Shang et al., 2016; Wang and Wang, 2017). In this work, we adaptively learn a hypergraph to preserve the local geometric structure of data samples in a high-order manner and constrain that if two samples are close to each other, their representation coefficients should also be similar. We formulate this property as following hypergraph Laplacian regularization term:
min
W, De, Dv
1 2
qi
e E , xi, xj V , i, j
qj
2 2
+
W
2 F,
Table 1 Statistics of the microarray data sets.
(5)
where = and γ is a positive balancing parameter. fi and (e ) fj are the corresponding representation coefficient vectors of i-th and jand sample reth samples. In such a manner, the hypergraph presentation coefficient matrix can constrain each other to obtain their optimal solutions. On the one hand, the hypergraph can be adaptively learned to preserve the high-order local geometrical structure of data samples, on the other hand, the learned coefficient matrix can constrain the hypergraph learning process. After a simple mathematical transformation, Eq. (5) becomes
min Tr (QLQT ) + Tr (WT W) s . t . wT 1 = 1, w (ei ) > 0.
min
s. t . pi
2
X
PQ
2 F
+ Tr (QLQT ) +
1, wT 1 = 1 , w (ei ) > 0.
Q
2,1
#Class
CLL_SUB_111 Breast Lung Tumors-11 SRBCT GCM
111 95 203 174 83 198
11,340 4869 12,600 12,533 2308 16,063
3 3 5 11 4 14
2.3.1. Update Q with other variables fixed While other variables are fixed, the problem for solving Q is transformed into
min Q
X
PQ
2 F
+ Tr (QLQT ) +
Q
2,1 .
(8)
Eq. (8) is convex, but non-smooth due to the l2, 1-norm regularization on Q. Here, we use an iterative reweighted least-squares algorithm to solve it. Supposing the current estimation of Q in the sub-problem of updating Q is Qt, we define a diagonal weighting matrix Gt with its i-th 1 t+1 diagonal element git = is updated by solving the t , and then Q
(6)
2
qi 2
following weighted least squares problem:
By combining Eq. (4) and Eq. (5) together, and imposing some constraints on the weight of hyperedges, we have our AHEDL model as follows: P, Q, De , Dv , W
#Gene number
There are five variables we need to optimize in Eq. (7), and it is difficult to optimize them simultaneously since Eq. (7) is not jointly convex to all variables. To this end, we develop an alternative updating method to optimize each variable while fixing the others until the algorithm converges.
w (e ) h (xi , e ) h (xj , e )
De , D v, W
#Instance
2.3. Optimization solution of AHEDL
s. t . wT 1
= tbf 1, w (ei ) > 0,
Data sets
Qt + 1 = arg
min
QTr (X PQ)T (X PQ) + Tr (QLQT ) + Tr (QT G(t ) Q).
(9)
Taking the derivative of Eq. (9) with respect to Q and setting it to zero, we have
+ Tr (WT W) (7)
(PT P + G) Q + QL = PT X .
where α is another positive constant for balancing the hypergraph Laplacian regularization term. As can be seen from Eq. (7), our proposed AHEDL model integrates dictionary learning, hypergraph learning and gene selection into a uniform framework. During the optimization process, the three variables P, W and Q can constrain each other to reach their individual optimum. Since P = {p1, p2, ⋯, pn} is the new
(10)
The above equation is a Sylvester equation (Bartels and Stewart, 1972). Since PTP + βG is strictly positive definite, Eq. (10) has stable solution. The iterative algorithm for solving Q is summarized in Algorithm 1. Algorithm 1. Iterative algorithm for solving W.
2.3.2. Update P by fixing other variables With other variables fixed, the problem (Eq. (7)) for solving P is transformed into
representation of X, the row sparsity imposed on Q by using the l2, 1norm can be deployed to measure the importance of gene dimensions in the new dictionary basis space. In contrast to the low-level representation in original data space of previous methods, AHEDL captures higher level and more abstract representation.
min X P
191
2
PQ F , s . t . Pi
2
1.
(11)
Gene 706 (2019) 188–200
X. Zheng, et al.
Table 2 Classification accuracy of each running time for different data sets (%). Data sets
Classifiers
CLL_SUB_111 Breast Lung Tumors-11 SRBCT GCM
Running times
k-NN RF SVM k-NN RF SVM k-NN RF SVM k-NN RF SVM k-NN RF SVM k-NN RF SVM
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
#11
#12
#13
#14
#15
82.03 69.36 87.54 76.71 68.70 64.25 95.57 99.81 95.20 86.08 87.18 78.96 98.40 97.60 99.14 62.92 67.92 65.92
71.72 76.12 81.92 64.20 60.10 73.06 95.49 91.09 94.53 80.50 86.29 84.48 97.98 96.92 99.31 61.31 51.25 64.64
77.67 77.97 81.24 77.63 71.49 61.06 89.92 99.78 92.65 81.69 81.34 92.48 98.88 96.91 97.63 60.16 60.05 60.64
74.28 82.05 71.34 61.75 68.54 61.71 94.89 99.59 91.34 80.66 83.81 80.52 99.77 98.88 96.64 63.25 68.46 66.33
79.37 76.35 78.69 68.97 82.45 46.68 93.39 95.43 89.53 85.81 84.10 82.73 99.44 99.89 98.64 60.74 65.88 59.88
73.84 86.82 73.27 64.05 71.06 57.87 91.83 93.79 93.20 79.84 83.34 88.39 98.54 99.34 95.67 71.40 63.22 68.34
79.78 75.00 74.63 49.41 72.63 52.97 92.48 92.98 91.46 79.10 78.74 89.03 98.71 99.26 98.97 60.04 66.00 64.36
81.57 79.31 73.44 56.85 63.77 60.74 97.37 96.89 95.52 85.16 80.74 85.62 98.82 99.98 99.97 73.58 69.48 55.82
87.05 81.00 70.50 69.34 53.69 66.95 92.39 92.93 93.48 81.24 81.72 87.43 97.77 95.31 99.04 62.90 63.92 52.50
76.03 77.54 71.54 64.30 61.28 74.46 91.07 93.07 98.64 84.44 80.93 87.61 99.67 98.34 95.43 57.78 59.82 62.86
64.58 82.40 75.49 56.48 57.84 76.52 94.15 96.93 97.89 81.68 86.60 82.71 99.94 98.79 97.54 62.14 62.77 69.94
72.29 79.05 66.36 60.34 59.04 62.12 99.14 96.47 94.67 84.61 93.57 85.93 97.78 99.49 97.71 69.90 62.62 64.64
85.17 72.90 79.87 66.49 67.21 61.38 94.21 97.87 93.17 73.21 77.26 82.66 99.44 96.42 99.75 69.92 67.69 64.67
70.87 67.01 76.28 57.02 59.73 69.64 95.11 96.29 97.32 85.66 83.28 79.75 98.67 97.18 99.94 66.24 65.32 67.63
82.48 88.07 80.40 72.53 56.77 67.28 91.91 93.67 90.97 91.36 76.22 86.72 97.91 99.28 98.14 66.52 65.99 56.93
Ave
SD
77.25 78.06 76.17 64.40 64.95 63.78 93.93 95.77 93.97 82.74 83.01 85.00 98.85 98.24 98.23 64.59 64.03 63.01
5.87 5.56 5.28 7.59 7.40 7.72 2.37 2.69 2.56 4.02 4.21 3.66 0.85 1.37 1.41 4.59 4.41 4.77
Table 3 Averaged classification accuracy (ACC ± SD) of different methods by using different classifiers (%) (The best results are marked in bold font). Methods
Classifiers
CLL_SUB_111
Breast
Lung
Tumors-11
SRBCT
GCM
F-test
k-NN RF SVM k-NN RF SVM k-NN RF SVM k-NN RF SVM k-NN RF SVM k-NN RF SVM k-NN RF SVM k-NN RF SVM
56.20 ± 6.88 58.69 ± 6.69 57.95 ± 6.54 73.50 ± 6.00 73.88 ± 5.78 73.21 ± 6.05 74.57 ± 6.69 74.46 ± 7.15 73.91 ± 6.95 74.07 ± 5.79 75.05 ± 6.15 75.08 ± 6.03 76.89 ± 6.10 73.62 ± 5.69 75.09 ± 5.71 75.65 ± 6.52 74.19 ± 5.76 75.16 ± 6.14 76.78 ± 5.90 76.98 ± 5.74 76.15 ± 5.82 77.59 ± 5.91 77.28 ± 5.69 76.92 ± 5.81
56.77 ± 9.34 58.75 ± 9.00 57.51 ± 8.96 61.78 ± 8.34 58.70 ± 8.42 59.83 ± 8.10 58.75 ± 6.97 58.64 ± 7.31 58.91 ± 7.27 59.28 ± 8.43 61.69 ± 7.75 59.98 ± 7.97 61.29 ± 7.58 62.04 ± 7.27 60.53 ± 7.36 60.40 ± 7.80 60.70 ± 7.42 60.64 ± 7.86 64.32 ± 7.67 63.06 ± 7.44 62.82 ± 7.38 65.72 ± 7.72 64.67 ± 0.46 64.27 ± 7.40
87.83 ± 2.19 87.30 ± 1.88 85.86 ± 2.27 91.13 ± 2.43 90.26 ± 2.96 91.71 ± 2.68 89.69 ± 3.14 89.60 ± 3.02 91.54 ± 2.67 91.74 ± 2.84 92.45 ± 2.34 92.68 ± 2.76 91.86 ± 2.70 92.45 ± 2.84 93.82 ± 2.88 90.27 ± 2.48 92.34 ± 2.37 92.16 ± 2.48 93.50 ± 2.47 94.29 ± 2.96 93.73 ± 2.56 93.97 ± 2.45 95.57 ± 2.93 94.53 ± 2.59
56.63 ± 3.97 55.93 ± 4.54 57.76 ± 4.41 82.42 ± 3.31 82.47 ± 3.34 81.86 ± 3.39 68.27 ± 4.34 71.11 ± 4.68 69.38 ± 4.62 79.86 ± 4.35 82.62 ± 4.00 81.11 ± 4.27 82.70 ± 3.77 83.11 ± 4.54 81.93 ± 4.12 81.15 ± 4.20 81.63 ± 4.59 82.04 ± 4.31 82.80 ± 4.10 82.28 ± 4.39 83.27 ± 3.98 84.81 ± 4.13 83.37 ± 4.43 83.88 ± 3.41
95.50 ± 1.67 95.52 ± 2.32 95.81 ± 1.91 97.10 ± 1.67 96.09 ± 1.16 97.67 ± 1.61 97.83 ± 1.79 98.05 ± 1.76 96.91 ± 2.20 97.56 ± 2.35 97.05 ± 1.73 97.45 ± 1.97 96.88 ± 1.74 97.38 ± 1.20 97.97 ± 1.51 96.19 ± 1.98 96.83 ± 2.47 97.93 ± 2.18 98.08 ± 1.58 97.87 ± 1.31 98.14 ± 1.57 98.78 ± 1.55 98.52 ± 1.30 98.96 ± 1.54
52.72 ± 6.58 50.93 ± 6.84 51.74 ± 6.67 53.89 ± 5.72 54.05 ± 5.59 52.68 ± 5.27 48.81 ± 4.64 51.26 ± 4.60 50.36 ± 5.03 59.79 ± 4.56 60.45 ± 4.98 61.76 ± 4.97 63.96 ± 5.25 61.05 ± 5.12 62.09 ± 5.19 61.44 ± 5.19 61.86 ± 5.08 62.84 ± 5.49 64.45 ± 4.96 62.95 ± 5.22 63.64 ± 4.85 65.73 ± 4.95 64.97 ± 5.27 65.74 ± 4.88
MSVM-RFE KernelPLS WLMGS RLR LNNFW GRSL-GS AHEDL
To solve Eq. (11), we use the Alternating Direction method of Multipliers (ADMM) (Boyd et al., 2011) by introducing a variable matrix F to get the optimal solution of P, then we have
min P, F
X
PQ
2 F,
s . t . F = P , hi
2
1.
quality of hypergraph. In order to tackle this problem, we design to learn the hyperedges from the reconstructed data space, in which the impact of noises can be efficiently reduced. Here we use the following formulation to construct the set of the hyperedges:
(12)
ei = {vj | (Pqi , Pqj)
The optimal P can be got by the following iteration steps:
P (t + 1) = arg
F(t + 1) = arg
P F (t ) + Y (t ) 2F , min
P(t + 1) F + Y (t ) 2F , s . t . fi 1, Y (t + 1) = Y (t ) + P(t + 1) F(t + 1) ,
i, j = 1,
, n,
(14)
where i is the average similarity between Pqi and each of other reconstructed data samples, and τ is a constant which is set to 0.5 in our experiments. It can be seen from Eq. (14) that AHEDL learns the incidence matrix H from the reconstructed data space and different samples will be allocated with different numbers of the neighbors. Different to previous simple graph methods (Liu et al., 2006) and hypegraph methods (Zhou and Huang, 2006; Somu et al., 2016) which learn the graphs from the original data as well as assume the fixed number of neighbors for all the samples, AHEDL is more flexible and robust.
min
P X PQ(t ) 2F +
i} ,
F
(13)
where Y is the Lagrange Multiplier and κ is a parameter, t represents the iteration time. 2.3.3. Update H and De by fixing other variables As discussed in previous section, the edge weights generated from the original data are not robust to noises, which will degenerate the 192
Gene 706 (2019) 188–200
X. Zheng, et al.
0.7
0.9
Classification accuracy
Classification accuracy
1
0.8 F−test MSVM−RFE KernelPLS WLMGS RLR LNNFW GRSL−GS AHEDL
0.7
0.6
0.5 1
10
20
30
40
0.65
0.6
F−test MSVM−RFE KernelPLS WLMGS RLR LNNFW GRSL−GS AHEDL
0.55
0.5 1
50
10
Number of selected genes
20
30
40
50
Number of selected genes
(a) CLL SUB 111
(b) Breast
1
1
Classification accuracy
Classification accuracy
0.9 0.95
0.9
F−test MSVM−RFE KernelPLS WLMGS RLR LNNFW GRSL−GS AHEDL
0.85
0.8 1
10
20
30
40
0.8
0.7
F−test MSVM−RFE KernelPLS WLMGS RLR LNNFW GRSL−GS AHEDL
0.6
0.5
0.4 1
50
10
Number of selected genes
1
0.75
0.994
0.7
0.988 F−test MSVM−RFE KernelPLS WLMGS RLR LNNFW GRSL−GS AHEDL
0.982
0.976
10
20
30
40
50
(d) Tumors-11
Classification accuracy
Classification accuracy
(c) Lung
0.97 1
20
Number of selected genes
30
40
0.65
0.55
0.5 1
50
Number of selected genes
F−test MSVM−RFE KernelPLS WLMGS RLR LNNFW GRSL−GS AHEDL
0.6
10
20
30
Number of selected genes
(e) SRBCT
(f) GCM
Fig. 2. The classification accuracy of different methods with different selected number of genes on different data sets.
193
40
50
Gene 706 (2019) 188–200
X. Zheng, et al.
Table 4 ANOVA for classification accuracy of SVM.
w (ei ) =
Source
SS
DF
MS
F
Statistical significance (Sig.)
Data set Method Data set × Method Error
110,557.374 6631.172 5694.513 10,187.611
5 7 35 479
18,754.187 1102.541 162.473 27.619
801.402 49.795 7.294
0.000 0.000 0.000
d (vi ) =
1 1 2 HWDe 1HT Dv 2
)Q ) + T
(16) 1 T 1 v 2 Q QDv 2 H ,
By letting Z = De and z = diag (Z), since W is a diagonal matrix, Eq. (15) can be rewritten as the following form:
zw +
w 22 , s. t . wT 1 = 1, w (ei ) > 0.
(17)
Then Eq. (17) can be transferred into following form:
min w
w
1 z 2
2
, s. t . wT 1 = 1, w (ei ) > 0.
(18)
2
2.4.2. Convergence analysis It is difficult to theoretically prove the convergence for Algorithm 2 since there are five blocks in the model of AHEDL and the objective function Eq. (7) is non-smooth. However, the convergence of each subproblem can be well guaranteed. In addition, we will give empirical evidence on real data in the experimental section and the results suggest that the proposed algorithm has very strong and stable convergence behaviour.
The Lagrangian function corresponding to Eq. (18) can be established as:
L (w, , ) =
w
1 z 2
2 2
(21)
2.4.1. Complexity analysis As presented by Algorithm 2, the problem of solving AHEDL can be decomposed to four steps. The main computation cost lies in solving Q and P. For solving Q, the classical algorithm for the Sylvester equation is the Bartels Stewart algorithm (Bartels and Stewart, 1972), of which the complexity is O (m3) . Let T1 be the iteration number in Algorithm (1), then the time complexity of updating Q in each iteration of Algorithm 2 is O (T1 m3) . For solving P, let T2 be the iteration number of the ADMM algorithm, since m = d, the time complexity of updating P in each iteration of Algorithm 2 is O (T2 (d 2n + d 3)) .
Tr (WT W)
s. t . wT 1 = 1, w (ei ) > 0.
w
,n
In this subsection, we give a brief analysis of the optimization algorithm for solving AHEDL.
By substituting the definition of L into Eq. (15), it can be converted to the following form:
min
w (ei ) h (vi , ej ), i, j = 1,
Algorithm 2. Iterative algorithm for solving AHEDL.
(15)
1HT D
vi ei,ei E
2.4. Algorithm complexity and convergence analysis
s. t . wT 1 = 1, w (ei ) > 0.
Dv
(20)
After we solve AHEDL and obtain Q, we sort the l2-norm of each row of Q and sort them in descending order, then the top K discriminate gene dimension can be chosen for final microarray data classification. For clarity, the entire algorithm for solving AHEDL is summarized in Algorithm 2.
W
((
, n.
Dv = diag (d)
min Tr (QLQT ) + Tr (WT W)
W
, i = 1, +
Then we have W = diag (w) and
2.3.4. Update W and Dv by fixing other variables By ignoring other fixed variables, the object function with respect to W can be written as follows:
min Tr Q I
1 ( zi + ) 2
(wT 1
1)
w,
(19)
where η ≥ 0 and θ ≥ 0 are the Lagrangian multipliers. According to the Karush-Kuhn-Tucker conditions, we obtain the close-form solution for w(ei) as:
194
image ID
812105 80338 841641 796258 383188 782193 745019 45632 289645 796475 244618 767495 1473131 809910 486787 770394 774502 810057 1435862 811000 491565 841620 781014 756556 234376
Ranking
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Transmembrane Protein Selenium binding protein 1 Cyclin D1 (PRAD1: parathyroid adenomatosis 1) Sarcoglycan, alpha (50 KD dystrophin-associated glycoprotein) Recoverin Thioredoxin EH domain containing 1 Glycogen synthase 1 (muscle) Amyloid beta (A4) precursor-like protein 1 ESTs, Moderately similar to skeletal muscle LIM-protein FHL3 [H. sapiens] ESTs GLI-Kruppel family member GLI3 (Greig cephalopolysyndactyly syndrome) Transducin-like enhancer of split 2, homolog of Drosophila E(sp1) Interferon-inducible Calponin 3, acidic Fc fragment of IgG, receptor, transporter, alpha Protein tyrosine phosphatase, non-receptor type 12 Cold shock domain protein A Antigen identified by monoclonal antibodies 12E7, F21 and O13 Lectin, galactoside-binding, soluble, 3 binding protein (galectin 6 binding protein) Cbp/p300-interacting transactivator, with Glu/Asp-rich carboxy-terminal domain, 2 Dihydropyrimidinase-like 2 Suppression of tumorigenicity 5 Complement component 1 inhibitor (angioedema, hereditary) Homo sapiens mRNA: cDNA DKFZp564F112(from clone DKFZp564F112)
Gene name
Table 5 The top 50 selected genes of AHEDL in the SRBCT data set.
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Ranking 1434905 815239 75009 244637 296448 207274 41591 39796 208718 824704 377048 769716 755750 204545 362483 202901 178825 868630 1474684 143306 203003 32299 280837 784593 151449
image ID
Homeo box B7 Guanine nucleotide exchange factor; 115-kD; mouse Lsc homolog EphB4 Homo sapiens mRNA full length insert cDNA clone EUROIMAGE Insulin-like growth factor 2 (somatomedin A) Human DNA for insulin-like growth factor II (IGF-2); exon 7 and additional ORF Meningioma (disrupted in balanced translocation) 1 3-Hydroxymethyl-3-methylglutaryl-Coenzyme A lyase (hydroxymethylglutaricaciduria) Annexin A1 Mannose phosphate isomerase Homo sapiens incomplete cDNA for a mutated allele of a myosin class I, myh-1c Neurofibromin 2 (bilateral acoustic neuroma) Non-metastatic cells 2, protein (NM23B) expressed in ESTs Spectrin, beta, non-erythrocytic 1 ESTs Neurogranin (protein kinase C substrate, RC3) Transforming growth factor beta-stimulated protein TSC-22 Ephrin-A1 Lymphocyte-specific protein 1 Non-metastatic cells 4, protein expressed in Homo sapiens myo-inositol monophosphatase 2 mRNA, complete cds Human Chromosome 16 BAC clone CIT987SK-A-101F10 ESTs Protein tyrosine phosphatase, non-receptor type 21
Gene name
X. Zheng, et al.
Gene 706 (2019) 188–200
195
Gene 706 (2019) 188–200
X. Zheng, et al.
Table 6 Averaged classification accuracy (ACC ± SD) comparison by using tradition sample pair wise graph learning and the proposed adaptive hypergraph learning. Methods
Classifiers
CLL_SUB_111
Breast
Lung
Tumors-11
SRBCT
GCM
AHEDL
k-NN RF SVM k-NN RF SVM
77.59 ± 5.91 77.28 ± 5.69 76.92 ± 5.81 76.49 ± 6.21 76.68 ± 5.77 76.42 ± 5.64
65.72 ± 7.72 64.67 ± 0.46 64.27 ± 7.40 64.37 ± 7.56 63.96 ± 0.65 63.18 ± 7.38
93.97 ± 2.45 95.57 ± 2.93 94.53 ± 2.59 92.76 ± 2.37 94.83 ± 2.49 93.86 ± 2.62
84.81 ± 4.13 83.37 ± 4.43 83.88 ± 3.41 83.21 ± 4.52 82.93 ± 4.43 83.09 ± 3.41
98.78 ± 1.55 98.52 ± 1.30 98.96 ± 1.54 97.45 ± 1.54 97.85 ± 1.27 98.25 ± 1.48
65.73 ± 4.95 64.97 ± 5.27 65.74 ± 4.88 64.97 ± 4.78 63.84 ± 5.34 65.11 ± 4.65
AGEDL
3. Experimental results and discussions
3.2. Data sets
In this section, we conduct experiments to validate the efficacy of the proposed AHEDL.
Six publicly available microarray data sets including B-cell chronic lymphocytic leukemia (CLL_SUB_111), Breast, Lung, Tumors-11, small round blue cell tumors (SRBCT) and global cancer map (GCM)2 are used to test the effectiveness of the proposed gene selection method. These data sets are collected for diagnosis of various cancers such as lung cancer, B-cell chronic lymphocytic leukemia, Ewing's family of tumors, neuroblastoma, non-Hodgkin lymphoma and rhabdomyosarcoma and prostate cancer. All of the six microarray data sets are with the following characteristics: 1) the data sets are typical high-dimensional, four of the data sets are with over 10,000 dimensions. 2) The number of genes is much larger than the number of samples. 3) These data sets contain a huge number of redundant and irrelevant genes which are harmful for classification. The statistics of these data sets are summarized in Table 1.
3.1. Classification methods In this work, three kinds of classification methods including Support Vector Machine, Random Forest and k-Nearest Neighbor are used to test the selected gene subset for microarray data classification. The classification accuracy is used as the evaluation criterion. Support Vector Machine (SVM) (Cortes and Vapnik, 1995) is a kind of supervised learning method with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples with each one belonging to a specific class, the SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling (Platt, 1999) exist to use SVM in a probabilistic classification setting). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. Multiclass SVM aims to assign labels to instances by using support vector machines, where the labels are drawn from a finite set of several elements. Random Forest (RF) or random decision forests (Ho, 2002; Ho, 1998) works as an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set. k-Nearest Neighbor (k-NN) (Altman, 1992) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression: In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor; In k-NN regression, the output is the property value for the object, and this value is the average of the values of its k-nearest neighbors. For each classification method, we use 5-fold Cross Validation (CV) (Geisser, 1993; Kohavi, 1995; Devijver and Kittler, 1982) in our experiments on each of the microarray data set with selected gene subset for avoiding “selection bias”. Specifically, the samples with selected gene subset of each data set are randomly portioned into 5 equal sized groups, 4 groups of which are chosen for training and the rest one is used for testing. The process was repeated five times, with each of the 5 subsets used once as the testing data.
3.3. Experimental setting In AHEDL, there are four parameters need to be set, i.e., α, β, γ and the number of selected genes K. For α, β and γ, we tuned their values by a “grid-search” strategy from {10−3, 10−2, 10−1, 1, 10, 102, 103}. Since the optimal number of selected genes is unknown, we set different number of selected genes for all data sets, and the best classification results from the optimal parameter combination were reported. Note that gene selection methods aim to select a minimum number of genes to achieve a highly accurate prediction performance. In our experiments, the selected gene number was tuned from {10, 20, 30, 40, 50}. After completing the gene selection process, the three classification methods were used to classify the microarray data. Seven state-of-the-art gene selection methods are chosen for comparison, they are as follows:
• MSVM-RFE (Multiclass SVM-RFE) (Zhou and Tuck, 2007), which is • • • •
an extension of Support Vector Machine-Recursive Feature Elimination (SVM-RFE) for multiclass classification. F-test (Cao et al., 2009), which is one of most popular filter gene selection methods and is based on statistical hypothesis testing. KernelPLS (Kernel Partial Least Squares) (Sun et al., 2014), which is a very efficient method based on partial least squares (PLS) and theory of Reproducing Kernel Hilbert Space. WLMGS (Weight Local Modularity based Gene Selection) (Zhao and Wu, 2016), in which the discriminative power of gene subset is evaluated by using the weight local modularity of a weighted sample graph in the gene subset where the intra-class distance is small and the inter-class distance is large. RLR (Guo et al., 2016), in which a kernel-based approach is used to estimate the class centroid to define both the between-class
2 CLL_SUB_111 and Lung can be downloaded from: http://featureselection. asu.edu/datasets.php; Breast and GCM can be downloaded from: http://portals. broadinstitute.org/cgi-bin/cancer/datasets.cgi; Tumors-11 and SRBCT can be downloaded from: http://datam.i2r.a-star.edu.sg/datasets/krbd/index.html.
196
Gene 706 (2019) 188–200
X. Zheng, et al.
x 10
8
4
x 10
1.7
3.5
1.6
3
Objective value
Objective value
1.5 1.4 1.3 1.2
2.5 2 1.5 1
1.1 0.5
1 0.9 0
10
20
30
40
0 0
50
10
Number of iterations
(a) CLL SUB 111 x 10
20
30
40
50
40
50
Number of iterations
(b) Breast
8
7
x 10
2
7
1.8 6
1.4
Objective value
Objective value
1.6
1.2 1 0.8 0.6
5 4 3 2
0.4 1 0.2 0 0
10
20
30
40
0 0
50
10
Number of iterations
(c) Lung 6
7
x 10
x 10
4
3.5
Objective value
2
Objective value
30
(d) Tumors-11
2.5
1.5
1
0.5
0 0
20
Number of iterations
3
2.5
2
1.5
10
20
30
40
1 0
50
Number of iterations
10
20
30
Number of iterations
(e) SRBCT
(f) GCM
Fig. 3. The objective values of the function in Eq. (7) on six data sets used in our experiments. 197
40
50
Gene 706 (2019) 188–200
X. Zheng, et al.
• •
separability and the within-class compactness for a newly defined linear discriminant analysis criterion. LNNFW (An et al., 2017), which is based on the trick of locally minimizing the within-class distances and maximizing the betweenclass distances with the k nearest neighbors rule. GRSL-GS (Tang et al., 2018a), which is a gene selection method for microarray data classification via subspace learning and manifold regularization.
3.7. Statistical test Similar to previous works (Chen et al., 2014; Li et al., 2016), we also use the two-way Analysis of Variance (ANOVA) to test whether the eight algorithms were significantly different in terms of classification accuracy. In our ANOVA test, the accuracy of each running time is defined as “Dependent Variable” and the algorithms and data sets are defined as “Fixed Factors”. Here we use the results of SVM for each method and data set. The ANOVA results for classification accuracy is listed in Table 4. As can be seen, all the p-values are smaller than 0.05, which demonstrates the significant differences of classification accuracy among the eight algorithms.
For MSVM-RFE, we adopt the strategy used Guyon et al. (2002), which removes the genes ranked the lowest 50% from each iteration recursively. The number of components of KernelPLS is determined by 5-fold CV. For WLMGS and GRSL-GS, the number of nearest neighbor in constructing the sample graph is set to 5. The kernel width σ used in the Gaussian kernel function and other regularization parameters in GRSLGS and RLR are tuned with 5-fold CV. As to other parameters of other methods, we use the settings as recommended by the corresponding references. All the implementation programs were run on a desktop computer with Intel Core i5-4200M 2.5 GHz CPU and 8 GB RAM.
3.8. Case study
For each data set, we have 5 selected gene subsets with the numbers of selected genes vary from 10 to 50. For each subset, we used three classifiers and 5-fold CV to perform classification, this process was repeated 20 times in order to eliminate the impact of randomness generated by the 5-fold CV. For each running time, the averaged classification accuracy for the 5 gene subsets of each data set are shown in Table 2. Due to the page limit, here we only present the results of the first 14 running times. In Table 2, the column “Ave” and “SD” represent the average accuracy and standard deviations of the results generated from the 15 running times, respectively. As can be seen, the selected gene subsets of our method are suitable for different classifiers, good classification performance can be achieved by using different classifiers and the selected gene subsets.
In order to demonstrate whether the genes selected by using AHEDL are important and disease related, we list the top 50 selected genes in the SRBCT data set obtained by AHEDL in Table 5. As can be seen, most of the disease related genes have been selected. Here, we give some examples. It has been verified that AF1Q corresponding to image ID 812105 is highly expressed for neuroblastoma. AF1Q has been also found by many previous works (Khan et al., 2001; Fu and Fu-Liu, 2005; Pal et al., 2007). By searching in GEO (Gene Expression Omnibus), the results show that AF1Q discriminates between NB and Ewing family tumor. Further, it has been observed that SGCA corresponding to image ID 796258 is highly expressed for rhabdomyosarcoma (RMS). SGCA has been also found in Khan et al. (2001) and Pal et al. (2007) and a search in GEO profiles show that SGCA can discriminate RMS from Ewings sarcoma. EHD1 corresponding to image ID 745019 is moderately expressed for few cases of RMS, and it has been also found by Pal et al. (2007). GEO profiles also show that EHD1 is moderately expressed for RMS. Again, it has been observed that EST corresponding to image ID 244618 is highly expressed for RMS. EST has been also found by Khan et al. (2001). GEO profiles show that EST is highly expressed for RMS. Furthermore, it has been observed that image ID 234376 is highly expressed for Ewing's sarcoma (EWS).
3.5. Comparison with other gene selection methods
4. Discussions and conclusions
In order to verify the superiority of the proposed AHEDL, we compare it with several other state-of-the-art gene selection methods. The averaged accuracy (ACC) and standard deviations (SD) of the 20 running times obtained by three different classifiers using the top 10, 20, 30, 40, 50 genes selected by the different gene selection methods are shown in Table 3. The best results are marked in bold font. As we can see, the proposed AHEDL can outperform other methods in terms of averaged classification accuracy. Among the eight methods, the proposed AHEDL yields the best results, which demonstrates that AHEDL can effectively select most of the discriminative genes for microarray data classification task.
4.1. Discussions
3.4. Classification accuracy for different data sets
4.1.1. The advantages of hypergraph learning In previous literatures, there are many graph based feature selection or gene selection models. However, traditional methods often use the pair-wise graph to preserve the local geometrical structure of data, e.g. GRSL-GS (Tang et al., 2018a), the high-order information cannot be effectively captured. In our proposed AHEDL, we adaptively learn a hypergraph to capture the local geometrical structure of data in a highorder manner, thus we can obtain better results. As shown in Table 3 and Fig. 2, our proposed AHEDL performs steadily better than GRSL-GS. In order to validate the efficacy of the proposed adaptive hypergraph learning regularization in a more intuitive way, we use traditional sample pair wise graph to replace the hypergraph in Eq. (7) and refer this method as adaptive graph embedded dictionary learning (AGEDL). The results comparison between AGEDL and AHEDL are shown in Table 6. As can be seen, the proposed adaptive hypergraph learning regularization can effectively improve the final classification accuracy which demonstrate that the adaptive hypergraph learning term can serve the model to select more discriminate gene subset.
3.6. Classification performance with different number of selected genes Since it is hard to determine the optimal number of selected genes, here we investigate the classification performance of different methods with different number of selected genes. The classification results of different methods on different data sets with different number of selected genes are plotted in Fig. 2. Since we use three different classifiers for classification, for each data set and each gene subset, we plot the best result of different classifiers in Fig. 2. As can be seen, our AHEDL obtains steady better performance than other method with different selected number of genes. With a small number of selected genes, our method can select more discriminative genes than other methods. The gene subset selected by AHEDL can better serve classification of microarray data.
4.1.2. Convergence analysis of Algorithm 2 Although it is difficult to theoretically prove the convergence of Algorithm 2, each sub-problem can be well guaranteed to be converged. Here, we show the variation of the objective values of Eq. (7) with different iteration times on the six data sets in Fig. 3. Empirical evidence from Fig. 3 also demonstrates the strong and stable convergence 198
Gene 706 (2019) 188–200
X. Zheng, et al.
behaviour of the proposed algorithm. Additionally, the proposed algorithm converges very fast within nearly 10 iterations on the six data sets.
Commun. ACM 15 (9), 820–826. Boyd, S., Parikh, N., Chu, E., Peleato, B., 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3 (1), 1–122. Buza, K., 2016. Classification of gene expression data: a hubness-aware semi-supervised approach. Comput. Methods Prog. Biomed. 127 (C), 105–113. Cai, R., Hao, Z., Yang, X., Wen, W., 2009. An efficient gene selection algorithm based on mutual information. Neurocomputing 72 (4–6), 991–999. Cao, K.A.L., Bonnet, A., Gadat, S., 2009. Multiclass classification and gene selection with a stochastic algorithm. Comput. Stat. Data Anal. 53 (10), 3601–3615. Chen, K.H., Wang, K.J., Tsai, M.L., Wang, K.M., Adrian, A.M., Cheng, W.C., Yang, T.S., Teng, N.C., Tan, K.P., Chang, K.S., 2014. Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm. BMC Bioinf. 15 (1), 49. Chuang, L.Y., Yang, C.H., Li, J.C., Yang, C.H., 2012. A hybrid BPSO-CGA approach for gene selection and classification of microarray data. J. Comput. Biol. 19 (1), 68. Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn. 20 (3), 273–297. Das, S., Rai, A., Mishra, D.C., Rai, S.N., 2018. Statistical approach for selection of biologically informative genes. Gene 655, 71–83. De, C.L., Giannoccaro, M., Marchesi, E., Bossi, P., Favales, F., Locati, L.D., Licitra, L., Pilotti, S., Canevari, S., 2017. Integrative miRNA-gene expression analysis enables refinement of associated biology and prediction of response to cetuximab in head and neck squamous cell cancer. Genes 8 (1), 35. Devijver, P.A., Kittler, J., 1982. Pattern Recognition: A Statistical Approach. Prentice/hall International. Du, S., Ma, Y., Li, S., Ma, Y., 2017. Robust unsupervised feature selection via matrix factorization. Neurocomputing 241, 115–127. Duan, K.B., Rajapakse, J.C., Wang, H., Azuaje, F., 2005. Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans. Nanobioscience 4 (3), 228–234. Dudoit, S., Yang, Y.H., Callow, M.J., Speed, T.P., 2000. Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments. Stat. Sin. 12 (1), 111–139. Dy, J.G., Brodley, C.E., 2004. Feature selection for unsupervised learning. J. Mach. Learn. Res. 5, 845–889. Fu, L.M., Fu-Liu, C.S., 2005. Evaluation of gene importance in microarray data based upon probability of selection. BMC Bioinf. 6 (1), 67. Gao, S., Tsang, I.W., Chia, L.T., 2013. Laplacian sparse coding, hypergraph Laplacian sparse coding, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35 (1), 92–104. Gao, Y., Ji, R., Cui, P., Dai, Q., Hua, G., 2014. Hyperspectral image classification through bilayer graph-based learning. IEEE Trans. Image Process. 23 (7), 2769–2778. Geisser, S., 1993. Predictive Inference: An Introduction. Chapman and Hall. Ghosh, D., Chinnaiyan, A.M., 2005. Classification and selection of biomarkers in genomic data using Lasso. J Biomed Biotechnol 2005 (2), 147. Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C., Lander, E., 1999a. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286 (5439), 531537. Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C., Lander, E., 1999b. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286 (5439), 531537. https://doi.org/10.1126/science.286.5439.531. Guo, S., Guo, D., Chen, L., Jiang, Q., 2016. A centroid-based gene selection method for microarray data classification. J. Theor. Biol. 400, 32–41. Guo, S., Guo, D., Chen, L., Jiang, Q., 2017. A l1-regularized feature selection method for local dimension reduction on microarray data. Comput. Biol. Chem. 67, 92–101. Guo, X., Jiang, X., Xu, J., Quan, X., Wu, M., Zhang, H., 2018. Ensemble consensus-guided unsupervised feature selection to identify Huntingtons disease-associated genes. Genes 9 (7), 350. Guyon, I., Weston, J., Barnhill, S., Vapnik, V., 2002. Gene selection for cancer classification using support vector machines. Mach. Learn. 46 (1–3), 389–422. He, X., Cai, D., Niyogi, P., 2005. Laplacian score for feature selection. In: Advances in Neural Information Processing Systems. vol. 18. pp. 507–514. Ho, T.K., 1998. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20 (8), 832–844. Ho, T.K., 2002. Random decision forests. In: International Conference on Document Analysis and Recognition, pp. 278. Huang, H.H., Liang, Y., 2018. Hybrid l 1/2+2 method for gene selection in the cox proportional hazards model. Comput. Methods Prog. Biomed. 164, 65–73. Huang, X., Gao, Y., Jiang, B., Zhou, Z., Zhan, A., 2016. Reference gene selection for quantitative gene expression studies during biological invasions: a test on multiple genes and tissues in a model ascidian Ciona savignyi. Gene 576, 79–87. Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., et al., 2001. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7 (6), 673. Kohavi, R., 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International Joint Conference on Artificial Intelligence, pp. 1137–1143. Li, X., Li, M., Yin, M., 2016. Multiobjective ranking binary artificial bee colony for gene selection problems using microarray datasets. IEEE/CAA J. Autom. Sinica (99), 1–16. Li, J., Wang, Y., Jiang, T., Xiao, H., Song, X., 2018. Grouped gene selection and multiclassification of acute leukemia via new regularized multinomial regression. Gene 667, 18–24. Li, S., Tang, C., Liu, X., Liu, Y., Chen, J., 2019. Dual graph regularized compact feature
4.1.3. Future works Although AHEDL achieves good performance in gene selection, there are still many works that can be investigated in the future. We will focus on three aspects. Firstly, non-linear models with kernel functions will be developed to replace the linear feature representation model in Eq. (7) for coping with microarray data with complex structure. Secondly, since sequencing technologies are often used today, we will adapt the proposed AHEDL into sequencing data analysis such as clustering and classification. Thirdly, more efficient model will be investigated to deal with large-scale microarray data with large number of samples and/or gene dimensions. 4.2. Conclusions In this paper, we propose a gene selection method for microarray data classification via adaptive hypergraph embedded dictionary learning (AHEDL). Instead of using original feature space for measuring the importance of different gene dimensions, we learn a dictionary and use it to represent original genes with a reconstruction coefficient matrix. The l2, 1-norm regularization is used to impose the row sparsity on the coefficient matrix for selecting discriminate genes. In order to capture the local manifold geometrical structure of original microarray data in a high-order manner, we adaptively learn a hypergraph and embed it into the model. An iterative updating algorithm is developed to solve the optimization problem. Finally, experiments on six publicly available microarray data sets are conducted to validate the efficacy of the proposed model for microarray data classification. Authors' contributions Conceived and designed the study: Xiao Zheng and Minhui Wang. Collected and preprocessed data: Wenyang Zhu and Chang Tang. Performed the experiments: Xiao Zheng and Wenyang Zhu. Analyzed results: Xiao Zheng. Drafted the manuscript: Xiao Zheng and Wenyang Zhu. Corrected the manuscript: Minhui Wang. Declaration of Competing Interest The authors declare that there is no conflict of interest regarding the publication of this article. Acknowledgements This work was partly supported by the National Science Foundation of China (61701451), and the Fundamental Research Funds for the Central Universities, China University of Geosciences, Wuhan (CUG170654). References Algamal, Z.Y., Alhamzawi, R., Mohammad, H.A., 2018. Gene selection for microarray gene expression classification using Bayesian Lasso quantile regression. Comput. Biol. Med. 97, 145–152. Alrajab, M., Lu, J., Xu, Q., 2017. Examining applying high performance genetic data feature selection and classification algorithms for colon cancer diagnosis. Comput. Methods Prog. Biomed. 146, 11. Altman, N.S., 1992. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46 (3), 175–185. An, S., Wang, J., Wei, J., 2017. Local-nearest-neighbors-based feature weighting for gene selection. IEEE/ACM Trans. Comput. Biol. Bioinform. 15 (5), 1538–1548 PP. Anauate, A.C., Leal, M.F., Wisnieski, F., Santos, L.C., Gigek, C.O., Chen, E.S., Geraldis, J.C., Calcagno, D.Q., Assumpo, P.P., Demachki, S., Arasaki, C.H., Loureno, L.G., Artigiani, R., Burbano, R.R., Smith, M.A.C., 2017. Identification of suitable reference genes for miRNA expression normalization in gastric cancer. Gene 621, 59–68. Bartels, H.R., Stewart, W.G., 1972. Solution of the matrix equation ax + xb = c.
199
Gene 706 (2019) 188–200
X. Zheng, et al.
Shi, H., Luo, Y., Xu, C., Wen, Y., 2015. Manifold regularized transfer distance metric learning. In: British Machine Vision Conference, pp. 158.1–158.11. Somu, N., Raman, M.R.G., Kirthivasan, K., Sriram, V.S.S., 2016. Hypergraph based feature selection technique for medical diagnosis. J. Med. Syst. 40 (11), 1–16. Song, H., Zhang, X., Shi, C., Wang, S., Wu, A., Wei, C., 2016. Selection and verification of candidate reference genes for mature microRNA expression by quantitative RT-PCR in the tea plant (Camellia sinensis). Genes 7 (6), 25. Sun, S., Peng, Q., Shakoor, A., 2014. A kernel-based multivariate feature selection method for microarray data classification. PLoS One 9 (9), e102541. Sun, Y., Lu, C., Li, X., 2018. The cross-entropy based multi-filter ensemble method for gene selection. Genes 9 (5), 258. Tang, C., Cao, L., Zheng, X., Wang, M., 2018a. Gene selection for microarray data classification via subspace learning and manifold regularization. Med. Biol. Eng. Comput. 56 (7), 12711284. Tang, C., Zhu, X., Chen, J., Wang, P., Liu, X., Tian, J., 2018b. Robust graph regularized unsupervised feature selection. Expert Syst. Appl. 96, 64–76. Tang, C., Liu, X., Li, M., Wang, P., Chen, J., Wang, L., Li, W., 2018c. Robust unsupervised feature selection via dual self-representation and manifold regularization. Knowl.Based Syst. 145, 109–120. Tang, C., Chen, J., Liu, X., Li, M., Wang, P., Wang, M., Lu, P., 2018d. Consensus learning guided multi-view unsupervised feature selection. Knowl.-Based Syst. 160, 49–60. Tang, C., Zhu, X., Liu, X., Li, M., Wang, P., Zhang, C., Wang, L., 2019a. Learning a joint affinity graph for multiview subspace clustering. IEEE Trans. Multimedia. https:// doi.org/10.1109/TMM.2018.2889560. Tang, C., Zhu, X., Liu, X., Wang, L., 2019b. Cross-view local structure preserved diversity and consensus learning for multi-view unsupervised feature selection. In: AAAI Conference on Artificial Intelligence. Tapia, E., Bulacio, P., Angelone, L., 2012. Sparse and stable gene selection with consensus svm-rfe. Pattern Recogn. Lett. 33 (2), 164–172. Thomas, J.G., Olson, J.M., Tapscott, S.J., Zhao, L.P., 2001. An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res. 11 (7), 1227. Wang, S., Wang, H., 2017. Unsupervised feature selection via low-rank approximation and structure learning. Knowl.-Based Syst. 124, 70–79. Wang, D., Liu, J.X., Gao, Y.L., Yu, J., Zheng, C.H., Xu, Y., 2016a. An nmf-l2,1-norm constraint method for characteristic gene selection. PLoS One 11 (7), e0158494. Wang, Y.X., Liu, J.X., Gao, Y.L., Zheng, C.H., Shang, J.L., 2016b. Differentially expressed genes selection via Laplacian regularized low-rank representation method. Comput. Biol. Chem. 65 (1), 185–192. Wang, W.Z., Yang, B.P., Feng, C.L., Wang, J.G., Xiong, G.R., Zhao, T.T., Zhang, S.Z., 2017a. Efficient sugarcane transformation via bar gene selection. Trop. Plant Biol. 1–9. Wang, A., An, N., Yang, J., Chen, G., Li, L., Alterovitz, G., 2017b. Wrapper-based gene selection with Markov blanket. Comput. Biol. Med. 81, 11–23. Wang, D., Li, J.R., Zhang, Y.H., 2018. Identification of differentially expressed genes between original breast cancer and xenograft using machine learning algorithms. Genes 9 (3), 155. Yi, Z., Ding, C., Tao, L., 2008. Gene selection algorithm by combining relieff and mrmr. BMC Genomics 9 (Suppl. 2), S27 (S2). Yu, G., Zhang, G., Zhang, Z., Yu, Z., Deng, L., 2015. Semi-supervised classification based on subspace sparse representation. Knowl. Inf. Syst. 43 (1), 81–101. Zhang, S., Wang, J., Ghoshal, T., Wilkins, D., Mo, Y.-Y., Chen, Y., Zhou, Y., 2018. lncRNA gene signatures for prediction of breast cancer intrinsic subtypes and prognosis. Genes 9 (2), 65. Zhao, G., Wu, Y., 2016. Feature subset selection for cancer classification using weight local modularity. Sci. Rep. 6 (1), 34759. Zheng, C.H., Ng, T.Y., Zhang, D., Shiu, C.K., 2011. Tumor classification based on nonnegative matrix factorization using gene expression data. IEEE Trans. Nanobioscience 10 (2), 86–93. Zhou, D., Huang, J., 2006. Learning with hypergraphs: clustering, classification, and embedding. In: Advances in Neural Information Processing Systems, pp. 1601–1608. Zhou, X., Tuck, D.P., 2007. Msvm-rfe: extensions of svm-rfe for multiclass gene selection on dna microarray data. Bioinformatics 23 (9), 1106. Zhu, P., Zuo, W., Zhang, L., Hu, Q., Shiu, S.C.K., 2015. Unsupervised feature selection by regularized self-representation. Pattern Recogn. 48 (2), 438–446. Zhu, P., Zhu, W., Wang, W., Zuo, W., Hu, Q., 2017a. Non-convex regularized self-representation for unsupervised feature selection. Image Vis. Comput. 60 (1), 22–29. Zhu, X., Li, X., Zhang, S., Ju, C., Wu, X., 2017b. Robust joint graph sparse coding for unsupervised spectral feature selection. IEEE Trans. Neural Netw. Learn. Syst. 28 (6), 1263–1275.
representation for unsupervised feature selection. Neurocomputing 331, 77–96. Liang, Y., Zhang, F., Wang, J., Joshi, T., Wang, Y., Xu, D., 2011. Prediction of droughtresistant genes in Arabidopsis thaliana using SVM-RFE. PLoS One 6 (7), e21750. Liang, F., Li, Q., Zhou, L., 2018. Bayesian neural networks for selection of drug sensitive genes. J. Am. Stat. Assoc. 113 (523), 955–972. Liao, J.C., Boscolo, R., Yang, Y.L., Tran, L.M., Sabatti, C., Roychowdhury, V.P., 2003. Network component analysis: reconstruction of regulatory signals in biological systems. Proc. Natl. Acad. Sci. U. S. A. 100 (26), 15522–15527. Liu, J., Li, M., Ma, W.Y., Liu, Q., Lu, H., 2006. An adaptive graph model for automatic image annotation. In: ACM International Workshop on Multimedia Information Retrieval, pp. 61–70. Liu, X., Wang, L., Zhang, J., Yin, J., Liu, H., 2014. Global and local structure preservation for feature selection. IEEE Trans. Neural Netw. Learn. Syst. 25 (6), 1083–1095. Liu, Q., Puthenputhussery, A., Liu, C., 2015. Learning the discriminative dictionary for sparse representation by a general fisher regularized model. In: IEEE International Conference on Image Processing, pp. 4347–4351. Liu, Y., Liu, K., Zhang, C., Wang, J., Wang, X., 2017. Unsupervised feature selection via diversity-induced self-representation. Neurocomputing 219, 350–363. V. T. V. Lj, H. Dai, V. D. V. Mj, Y. D. He, A. A. Hart, M. Mao, H. L. Peterse, d. K. K. Van, M. J. Marton, A. T. Witteveen, Gene expression profiling predicts clinical outcome of breast cancer, Nature 415 (6871) (2002) 530–536. Long, A.D., Mangalam, H.J., Chan, B.Y., Tolleri, L., Hatfield, G.W., Baldi, P., 2001. Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework. Analysis of global gene expression in Escherichia coli k12. J. Biol. Chem. 276 (23), 19937–19944. Luo, Y., Tao, D., Xu, C., Xu, C., Liu, H., Wen, Y., 2013. Multiview vector-valued manifold regularization for multilabel image classification. IEEE Trans. Neural Netw. Learn. Syst. 24 (5), 709–722. Luo, Y., Wen, Y., Tao, D., Gui, J., Xu, C., 2016. Large margin multi-modal multi-task feature extraction for image classification. IEEE Trans. Image Process. 25 (1), 414–427. Mairal, J., Bach, F., Ponce, J., Sapiro, G., 2010. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11 (Jan), 19–60. Miao, Y., Jiang, H., Liu, H., Yao, Y.D., 2017. An Alzheimer's disease related genes identification method based on multiple classifier integration. Comput. Methods Prog. Biomed. 150, 107–115. Mitra, P., Murthy, C., Pal, S.K., 2002. Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Mach. Intell. 24 (3), 301–312. Naranjo, L., Prez, C.J., Martn, J., Camposroca, Y., 2017. A two-stage variable selection and classification approach for Parkinson's disease detection by using voice recording replications. Comput. Methods Prog. Biomed. 142, 147–156. Nguyen, T., Nahavandi, S., 2016. Modified ahp for gene selection and cancer classification using type-2 fuzzy logic. IEEE Trans. Fuzzy Syst. 24 (2), 273–287. Odeh, S.M., Baareh, A.K., 2016. A comparison of classification methods as diagnostic system: a case study on skin lesions. Comput. Methods Prog. Biomed. 137, 311–319. Oh, I.S., Lee, J.S., Moon, B.R., 2004. Hybrid genetic algorithms for feature selection. IEEE Trans. Pattern Anal. Mach. Intell. 26 (11), 1424–1437. Pal, N.R., Aguan, K., Sharma, A., Amari, S.-i., 2007. Discovering biomarkers from gene expression data for predicting cancer subgroups using neural networks and relational fuzzy clustering. BMC Bioinf. 8 (1), 5. Peng, H., Long, F., Ding, C., 2005. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27 (8), 1226–1238. Platt, J.C., 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers 10 (4), 61–74. Ramos, J., Castellanos-Garzn, J.A., Gonzlez-Briones, A., Paz, J.F.D., Corchado, J.M., 2017. An agent-based clustering approach for gene selection in gene expression microarray. Interdiscip. Sci. 9 (1), 1–13. Robnik- ikonja, M., Kononenko, I., 2003. Theoretical and empirical analysis of relieff and rrelieff. Mach. Learn. 53 (1–2), 23–69. Scott, D.W., 2008. The Curse of Dimensionality and Dimension Reduction. John Wiley Sons, Inc. Shang, R., Zhang, Z., Jiao, L., Liu, C., Li, Y., 2016. Self-representation based dual-graph regularized feature selection clustering. Neurocomputing 171 (1), 1242–1253. Shen, X., Shen, F., Sun, Q.S., Yang, Y., Yuan, Y.H., Shen, H.T., 2016. Semi-paired discrete hashing: learning latent hash codes for semi-paired cross-view retrieval. IEEE Trans. Cybern. 47 (12), 4275–4288. Shen, X., Liu, W., Tsang, I.W., Sun, Q.S., Ong, Y.S., 2018a. Multilabel prediction via crossview search. IEEE Trans. Neural Netw. Learn. Syst. 29 (9), 4324–4338. Shen, X., Shen, F., Liu, L., Yuan, Y., Liu, W., Sun, Q., 2018b. Multiview discrete hashing for scalable multimedia search. ACM Trans. Intell. Syst. Technol. 9 (5), 53:1–53:21.
200