Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Group sparse feature selection on local learning based clustering Yue Wu, Can Wang n, Jiajun Bu, Chun Chen Zhejiang Provincial Key Laboratory of Service Robot, College of Computer Science, Zhejiang University, Hangzhou 310027, China
art ic l e i nf o
a b s t r a c t
Article history: Received 14 August 2014 Received in revised form 17 July 2015 Accepted 18 July 2015 Communicated by X. Gao
Feature selection plays an important role in many machine learning applications. By extracting meaningful features and eliminating both redundancies and noises, it effectively improves the accuracy and efficiency of the learning algorithm. In this paper, an unsupervised feature selection method called GSFS-llc (group sparse feature selection on local learning based clustering) is proposed. GSFS-llc first retrieves the cluster structure in a dataset using LLC and then selects important features that best preserve the cluster structure by L2,1-norm regularized regression. By combining group sparsity regularization with locality based learning algorithm, GSFS-llc leads a more robust result than other methods and selects features that respect the underlying geometric structure in the dataset. Extensive experimental results over real world handwritten digits, human faces, voices and object images datasets demonstrate the superiority of the GSFS-llc algorithm. & 2015 Elsevier B.V. All rights reserved.
Keywords: Unsupervised Group sparsity Local learning Feature selection
1. Introduction Many applications nowadays such as computer vision, pattern recognition and data mining are facing data of increasing dimensions. Higher dimensionality contains more information but may also introduce more redundancies and noises. As a consequence, “curse of dimensionality” [1,2] is a prevalent problem for learning in high dimensional space, i.e. algorithms and procedures that are analytically and computationally effective in low dimensional space become totally impractical in this case. Various dimensionality reduction techniques have been introduced to ease this problem. Most canonical among them are feature selection and feature extraction algorithms. Unlike feature extraction methods, such as PCA [3] and LPP [4], which transform original data into a reduced representation set of features, feature selection methods only choose a relevant feature subset from the original data. Thus feature selection is not only simpler and but also better preserves the actual meaning of each feature to help understand its function. The selected subset with the most informative features will effectively reduce the redundancies and noises in data and make the following learning tasks more accurate. Feature selection methods can be roughly classified as supervised and unsupervised by whether label information is involved in selection. Typical supervised feature selection methods [5–8] select features having high correlations with class labels. Unsupervised methods in contrast mainly exploit the distribution of data samples to find the optimal feature subset. As labels are
n
Corresponding author. Tel.: +8657187951431; fax: +8657187953955. E-mail address:
[email protected] (C. Wang).
usually expensive to acquire, unsupervised methods are used more broadly than supervised ones. However, without the guidance of label information, feature selection becomes a more challenging issue in an unsupervised setting. Many existing unsupervised methods select features that best preserve certain properties of the datasets, such as maximum variance and minimum redundancy. Influential works such as [9–11] are typical unsupervised feature selection methods. In recent years, feature selection methods started to explore geometric structures in high dimensional data space. As recent works have shown that data spaces are often low-dimensional manifolds embedded in high-dimensional ambient spaces [12–14], various feature section methods explicitly considering manifold structures are proposed, including Laplacian score [15], eigenvalue sensitive feature selection [16], multi-cluster feature selection [17] , local kernel regression score [18] and feature selection for local learning based clustering [11,19]. These methods either directly optimize for manifold structure or use manifold regularizations in their model to select features that respect intrinsic manifold structure in data space. Among them, feature selection methods based on local learning based clustering (LLC) show great improvements in robustness and accuracy on the datasets with manifold structure [11,19]. However, the limitation of these manifold-based methods for feature selection is that they neglect the group structures in data space. Recent studies [20–22] point out that group structures are inherent in many datasets such as documents, images, and DNA data, where documents of the same topic naturally forms a group and in DNA data features can be grouped by their metabolic profiling. As data from the same group tend to share the same sparsity pattern in their low-dimensional representation, considering the shared group structures is
http://dx.doi.org/10.1016/j.neucom.2015.07.045 0925-2312/& 2015 Elsevier B.V. All rights reserved.
Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i
Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
2
expected to improve the performance of the feature selection algorithms [6,23]. In this paper, a new unsupervised group sparse feature selection method based on LLC, named GSFS-llc, is proposed. The motivation of our work is to exploit both the manifold structure and the group structure in feature selection. Using the cluster labels from LLC, GSFS-llc establish a L2,1-norm regularized regression framework to select important features that respect both group structure and manifold structure in data space. As group structure is inherent in many datasets, GSFS-llc will gain better performance comparing with methods only considering the manifold structure in data space. It is worthwhile to highlight the following aspects of our work: 1. We establish a regression framework using the cluster labels from LLC and group sparse regularization. To the best of our knowledge, this is the first time that local learning based clustering is combined with group sparse regularization for feature selection. 2. By choosing the features that best fit the LLC cluster labels, GSFS-llc naturally respects the manifold structure in data space. 3. By inducing sparse features from L2,1-norm regularized regression, GSFS-llc explicitly considers the group structure in data space. It is thus expected to gain better performance in feature selection and inherit some nice properties associated with group structures [22], such as better stability in the presence of noise. The rest of the paper is organized as follows: Section 2 gives a brief review of the related work; Section 3 describes the GSFS-llc algorithm; Section 4 shows the experiments and results of the GSFS-llc algorithm compared with other feature selection methods; Section 5 makes a conclusion of this paper.
2. Related works Guyon et al. categorized feature selection algorithms into filter methods, wrapper methods and embedded methods [24]. These three kinds of methods differ in how the learning algorithm is incorporated in evaluating and selecting features. Filter methods [25,15,23,16,26] capture the inner properties of the data using statistical techniques such as variance, Pearson correlation, and mutual information, and select features before running the learning algorithm. Filter methods are usually the most efficient among all feature selection methods. Wrapper methods [27,8,28,29] are associated with some predictive models for scoring the chosen feature subsets. Given a candidate feature subsets, a wrapper method will train a specific predictive model using the candidate subset and then use the model's performance to score the subset.
The feature subset with the highest score is selected as the final result. Embedded methods such as [7,19] perform variable selection in the training process and are usually specific to given learning machines. Decision trees, such as CART, which have a built-in mechanism to perform variable selection, are typical embedded methods. Recent studies such as ISOMAP [12], LLE [13] have shown that data spaces in many cases are low-dimensional manifolds embedded in the high-dimensional ambient spaces. This manifold assumption is broadly verified in existing datasets such as Yale, PIE, and USPS. Many locality based methods are developed in recent years to uncover the underlying manifolds. Consequently, some feature selection methods incorporate this manifold assumption by choosing the features that respect the intrinsic geometric structures of the data space. Cheung et al. proposed a filter method called local kernel regression (LKR) score [18]. This algorithm seeks the features which minimize the within-neighborhood estimation error and meanwhile have the maximum variance over all data samples. Using different neighborhood graphs, the LKR score can deal with both supervised and unsupervised situations efficiently. Sun et al. also proposed a local learning based feature selection method by decomposing an arbitrarily complex nonlinear problem into a set of locally linear ones through local learning, and then using the large margin framework to learn feature relevance [30]. Zeng et al. proposed an unsupervised feature selection method based on local learning based clustering (LLC-fs) [11,19]. They incorporate a binary selection vector into the LLC objective function to select the optimal features for LLC. The GSFS-lls proposed in this paper is also based on LLC, but it differs from LLC-fs in that features are selected as the result of the group sparsity-inducing norm. Thus the selected features will inherit the stability associated with the group structure and will achieve better performance in the presence of noise. Another related research area actively explored in recent years is sparse regression. By using different regularizations, such as LASSO [31] (using l1-norm), and elastic net [32] (using the combination of l1-norm and l2-norm), sparse regression is able to retrieve various sparse basis from the data space under different sparsity assumptions. While LASSO and elastic net are designed for vectors, group sparse regularization (L2,1-norm) is designed for matrix. It first calculates the l2-norm for each row of the matrix and then the results are summed to form an l1-norm. That is, qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pq P 2 J M A Rpq J 2;1 ¼ pi¼ 1 j ¼ 1 mij . More stable fitting results can be obtained by using group sparse regularization than using LASSO in that the elements of each row are more likely to be zero or nonzero together when fitted to different clusters, as shown in Fig. 1. Sparse regularizations are used to select important features in data space since it can efficiently retrieve the features correlated
Fig. 1. Fitting results of regressions using L2,1 regularization vs. l1 regularization. The 5 10 regression coefficient matrix in the figure shows the fitting results of the 5 features on 10 clusters, with each column being the fitting result on a specific cluster. Blue elements denote non-zero coefficients and white elements denote zero coefficients. Obviously the group sparse regularization (L2,1 norm) leads to more stable fitting results than LASSO (l1 norm). Thus group sparsity based feature selection is expected to be more robust than LASSO based feature selection. (a) Group sparse coefficient matrix obtained by L2,1-norm regularized regression. (b) Sparse coefficient matrix obtained by l1-norm regularized regression. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)
Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i
Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
3
approximate the label of a data sample using its neighbors'. As there exist C Z 2 clusters in dataset, we will use an indicator vector y to represent a cluster label. An indicator vector for a sample point in the l-th cluster is a vector of C elements where the l-th element is 1 and all the others being 0. Given a set of data points X ¼ ½x1 ; x2 ; …; xN T ¼ ½f 1 ; f 2 ; …; f M , xi A RM denotes a sample point, and f j A RN denotes a feature. N i represents the set of xi's neighbors and ni ¼ j N i j is its cardinality. Let K : X X -R be a positive definite kernel function. The kernel matrix of xi's neighbors can be defined as Ki ¼ ½Kðxu ; xv Þ A Rni ni for xu ; xv A N i , and ki ¼ ½Kðxi ; xj Þ for all xj A N i . Different kernel functions can be adopt to it, such as linear kernel Kðxu ; xv Þ ¼ xu xv , 2 2 heat kernel Kðxu ; xv Þ ¼ e ð J xu xv J 2 =2σ Þ . Using kernel regression, the estimation for the l-th element of its cluster indicator vector is X l l βij Kðxi ; xj Þ ð1Þ y^ i ¼ xj A N i
Fig. 2. A failed example for some existing unsupervised feature selection methods. Feature b obviously has better overall discriminating power than feature a in that it is able to separate all the three clusters (colored blue, green and red respectively), while feature a fails to separate cluster blue from cluster green. However, feature selection based on maximum variance or laplacian score will rank feature a 4 b as feature a has a larger variance. MCFS will also rank feature a 4b in that it also selects features that maximally discriminate one cluster from the rest. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)
with class labels. Nie et al. proposed a supervised feature selection method called RFS [6] by regressing over class labels using L2,1norm regularization. Fang et al. proposed a locality and similarity preserving embedding (LSPE) for feature selection [33]. It first constructs a nearest neighbor graph to preserve the locality structure in data space and then maps the structure to the reconstruction coefficients such that the similarity among these data points is preserved. By adding the L2,1-norm minimization constraints to the projection matrix, a sparse result is obtained. Cai et al. proposed a filter method called multi-cluster feature selection (MCFS) [17] that uses spectral embedding to obtain the cluster information and then solves a series of l1-norm regularized least square regression problems to retrieve the sparse coefficients. One major novelty in MCFS is that it extends sparsity regularized regression to an unsupervised setting. Inspired by MCFS, we propose SFS-llc algorithm in this paper by combining local learning based clustering (LLC) with LASSO. Since both SFS-llc and MCFS rely on the l1-norm to induce sparsity, they both suffer from the instability problem as described in Fig. 2 when dealing with multicluster dataset. To overcome this limitation, GSFS-llc is further proposed in this paper by using group sparse regularization to retrieve the important feature subset in data space.
Using kernel ridge regression will make the above problem easier to solve, where the coefficient βlij in Eq. (1) can then be obtained from the following optimization problem: min J Ki βi yli J 2 þ λðβi ÞT Ki βi l
l
The main idea of GSFS-llc is to select important features from the dataset via a two-step process: (1) obtaining the cluster labels using LLC [34]; (2) fitting the obtained cluster labels using the L2,1-norm regularized regression model and select features with large regression coefficients. Since our regression model and related analysis are fundamentally based on cluster labels from LLC, we will first introduce here the LLC algorithm, as presented in [34,11,19]. 3.1. Local learning based clustering for cluster analysis Under the manifold assumption, the data space can be treated as linear within a small neighborhood. It is therefore reasonable to
ð2Þ
where βi ¼ ½βij A Rni is the coefficient vector; yli ¼ ½ylj T A Rni ; λ 4 0 is the regularization parameter. l The solution to problem (2) is βi ¼ ðKi þ λIÞ 1 yli . Substituting it back into (1), we get l
l
l T y^ i ¼ ki ðKi þ λIÞ 1 yli
ð3Þ
Let
αTi ¼ kTi ðKi þ λIÞ 1
ð4Þ yli
and can be easily which is independent of the cluster indicator calculated as ki and Ki are already known. The estimated cluster l indicator y^ i can then be expressed in the form of a linear combination of its neighbors' cluster indicators l y^ i ¼ αTi yli
ð5Þ
So the overall prediction error can be calculated as follows: C X N X
l
ðyli y^ i Þ2 ¼
l¼1i¼1
C X
l
J yl y^ J 2
l¼1
¼
C X
J yl Ayl J 2
l¼1
¼ tr½Y T ðI AÞT ðI AÞY ¼ trðY T TYÞ
ð6Þ
where Y ¼ ½y ; …; y is the cluster indicator matrix which we want to solve; T ¼ ðI AÞT ðI AÞ; C is the number of clusters and N is the number of sample points. A ¼ ½aij is an N N sparse matrix and aij equals the corresponding element of αi in Eq. (4) if xj A N i , otherwise it is set to 0. By relaxing Y into the continuous domain, the prediction error in Eq. (6) can be minimized by solving the following optimization problem: 1
3. Group sparse feature selection on local learning based clustering
l
βli A Rni
min
Y A RNC
s:t:
C
traceðY T TYÞ YT Y ¼ I
ð7Þ
The solution to Y is given by the matrix T's top C eigenvectors corresponding to the smallest eigenvalues. Since the number of clusters C is usually unknown in practice, the top u eigenvectors are used instead, where u is a preset parameter.
Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i
Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
4
3.2. Feature Selection using L2,1-norm regularized regression The scaled cluster indicator matrix Y obtained in the previous step embodies the cluster structure in the dataset. The next step is to find out important features by exploiting this cluster structure. A simple idea is to treat this as a regression problem where cluster indicators in Y are used as the target variables. The obtained regression coefficients w can then be used to measure the importance of the features. It is important to point out here that the feature selection methods in this paper are unsupervised in nature. Although regression technique is exploited in our feature selection process, the target variables used in regression are not true labels of the samples (which is unknown), but rather the labels obtained from the clustering algorithm LLC. Thus a better interpretation of our regression based feature selection is that we select features that best preserve local data consistency in the original data space. This is the fundamental difference between the methods proposed in this paper and the supervised feature selection methods such as RFS [6], which selects the features best correlated with known class labels. Given a column of the scaled cluster indicator matrix yk, where k A ½1; …; u. The coefficients which indicate the importance of each feature to the k-th cluster can be easily obtained by minimizing the following least square regression problem min J yk Xw k J 22
ð8Þ
wk
Adding an l1-norm regularizer to the least square regression problem, Eq. (8) becomes min J yk Xw k J 22 þ w k
γ J wk J 1
ð9Þ
where γ 4 0 is the regularization parameter and the l1-norm regularizer makes the coefficient vector wk sparse when γ have been set properly. The features corresponding to the non-zero coefficients in wk are relevant to cluster k, and the larger the absolute value of wkj is, the more relevant the feature fj will be. The LASSO problem in Eq. (9) can be solved easily by the Least Angel Regression (LARs) algorithm, and the LARs algorithm is able to guarantee the sparseness of wk by specifying the number of non-zero entries instead of tuning the parameter γ. That is much more convenient for feature selection. Note that in Eq. (9) each yk is fitted by X independently and the largest coefficient for a feature in fitting different clusters is chosen as the feature's score that determines its selection. This may lead to a problem when a feature discriminates one cluster very well so the corresponding coefficient is large and this large coefficient gained from one single cluster may lead to the final selection of the feature although this feature may be useless in discriminating other clusters and gain coefficients close to zero from other clusters. Selecting such features may deteriorate the quality of the whole feature subset. To better illustrate this situation, we
present a simple example in Fig. 2 showing that some existing unsupervised feature selection methods may fail to select the best discriminating feature. There are three Gaussians in a two dimensional space. Feature a has better discriminating power in separating cluster red from the other two clusters. But it fails to separate cluster blue from cluster green. It is obvious that globally feature b can better discriminate all the clusters than feature a. Existing unsupervised feature selection methods, such as maximum variance and Laplacian score [15], will fail in this example as they will rank feature a 4 b since a has a larger variances. Even some newer methods such as MCFS [17], and SFS-llc proposed in this paper, will rank feature a 4 b because they consider the discriminating power on each cluster independently using LASSO. In contrast, GSFS-llc uses l2;1 -regularization and selects features in groups. Consequently it tends to select features that have better overall discriminating power and can rank feature b 4a in this case. To overcome this problem, the u least square regression problems are considered together as follows: min wk
u X
J yk Xwk J 22 ¼ min J Y XW J 2F ¼ min W
k¼1
W
N X
J yi xTi W J 22
i¼1
ð10Þ where yi is the row vector of the scaled cluster indicator matrix Y i. e. the cluster indicator vector for each sample xi. As indicated in [6], a small change to Eq. (10) can make it robust to outliers. That is, using the following robust loss function instead of the least square loss: min W
N X
J yi xTi W J 2 ¼ min J Y XW J 2;1 W
i¼1
where J M A Rpq J 2;1 ¼
Pp i¼1
ð11Þ
qP ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi q 2 j ¼ 1 mij denotes the L2,1-norm.
Sparsity can be induced from the robust loss function in Eq. (11) by adding an L2,1 regularizer. Then the final objective function turns into min J Y XW J 2;1 þ γ J W J 2;1 W
ð12Þ
This optimization problem can be efficiently solved by a method proposed in [6], which also proves that the algorithm will converge to the global optimum of the problem. This indicates that our GSFS-llc can retrieve near-optimal feature subset. After the coefficient matrix W is acquired, the final ranking score for each feature can be easily calculated. For the j-th feature, the GSFS-llc score is defined as follows: ScoreGSFS llc ðjÞ ¼ max j wji j i
ð13Þ
This means that the GSFS-llc score of the j-th feature is the largest absolute value of the corresponding coefficients. At last, sort all features according to their GSFS-llc scores in descent order and the top d features forms the best candidate feature subset. The complete GSFS-llc algorithm is summarized in Table 1.
Table 1 The GSFS-llc algorithm. Input:
Output:
Data matrix X with N sample points and M features; The number of nearest neighbors k; The number of used eigenvectors u; The regularization parameter γ; The number of selected features d; d selected features.
1. Construct the k nearest neighbor graph and the kernel matrix K according to the data matrix X. 2. Based on the neighbor graph and the kernel matrix, compute αi in Eq. (4) and construct the sparse matrix A. 3. Solve the optimization problem in Eq. (7), get Y ¼ ½y1 ; …; yu which contains the top u eigenvectors corresponding to the smallest eigenvalues. 4. Solve the L2,1-norm regularized regression problems in Eq. (12) to obtain the coefficient matrix W. 5. Compute the GSFS-llc score for each feature according to Eq. (13) and return top d features corresponding to their GSFS-llc scores.
Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i
Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
3.3. Computational complexity analysis A simple analysis of the computational complexities of GSFS-llc is presented as follows (N: sample size; M: sample dimensionalities; k: number of nearest neighbors):
Constructing the kernel matrix K will incur a cost of OðN2 MÞ. Constructing the sparse matrix A need to find the k nearest neighbors of xi and compute αi in Eq. (4) for each xi, the time 3
complexity is about OðN 2 k þ Nk Þ. The computational complexities for solving the optimization problem in Eq. (7) is about OðN 3 Þ. The computational complexities for solving the L2,1-norm regularized regression problems in Eq. (12) is about OðN 3 Þ. Computing the GSFS-llc score requires O(dM) and ranking features according to their GSFS-llc score requires OðM log MÞ.
Since k is usually a quite small constant and d⪡M, the total computational complexity of GSFS-llc is about OðN 2 M þ N3 þ M log MÞ. This computational complexity is comparable with other group sparsity based algorithms.
4. Experiments In this section, the performance of the GSFS-llc algorithm will be evaluated on various datasets including handwritten digits, human faces, voices and object images. The experiment setups are described in the following subsection. 4.1. Experiment setup To demonstrate the effectiveness of the GSFS-llc algorithm, the state-of-the-art feature selection algorithms are compared with GSFS-llc including:
MCFS proposed in [17], which uses spectral embedding to
obtain the cluster structure in a dataset and l1-norm regularized least square regression to select features that best preserve the cluster structure. LS (Laplacian score) proposed in [15], which selects features that best aligns with the manifolds structure of the data space. LKR proposed in [18], which seeks the features that minimize the within-neighborhood estimation error and maximize the variance over all data samples. SFS-llc, which differs from GSFS-llc in that it uses Eq. (9) to calculate the coefficient matrix W while GSFS-llc which uses Eq. (12) to calculate the coefficient matrix.
The five datasets used in the experiments are USPS08, ORL, ISOLET5, COIL100 and CIFAR10. USPS08 is a subset of the famous USPS handwritten digit database [35]. The original dataset consists of 9298 pieces of 16 16 handwritten digit images and the subset containing only number 0 and number 8 is chosen to form the USPS08 dataset with 2261 images. ORL is a face dataset consisting of 400 images which belong to 40 subjects (10 samples per subject) [36]. For some subjects, the images are taken at different times, varying the lighting, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses). All the images are taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement). The cropped dataset1 whose samples are 1
http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html
5
32 32 images with 256 grey levels per pixel is used in the experiments. ISOLET5 is a spoken letter recognition dataset [37]. There are 150 subjects who spoke the name of each letter of the alphabet twice. The speakers are grouped into sets of 30 speakers each, and are referred to as ISOLET1, ISOLET2, ISOLET3, ISOLET4, and ISOLET5. The ISOLET5 subset which has 1559 samples with 617 features is used in the experiments. COIL100 is the Columbia object image library of 100 objects. The images of each object are taken 51 apart as the object is rotated on a turntable and each object has 72 images. The size of each image is 32 32 pixels, with 256 grey levels per pixel. Thus, each image is represented by a 1024-dimensional vector. The CIFAR10 dataset consists of 60,000 32 32 color images in 10 classes, with 6000 images per class [38]. In the experiments a 512-dimensional gist feature vector is used to represent each image in CIFAR10. The statistics of the datasets are summarized in Table 2. To quantitatively evaluate the experimental results, two metrics are used to measure the clustering performance including the accuracy (AC) and the normalized mutual information (MI), which are evaluated based on a mapping between cluster label and true label. Here cluster label denotes the label obtained from the clustering algorithm, and true label denotes the class label given in the test dataset (ground truth). However, there exists no direct mapping between a cluster label and its true label. This mapping problem can be solved by a permutation mapping function that maps each cluster label to a true label in the dataset. The best mapping can be found by using the Kuhn–Munkres algorithm [39,40], also known as the Hungarian algorithm. The objective of the Kuhn–Munkres algorithm is to find the optimal label mapping that can produce the largest number of matching pairs between cluster labels and true labels. The corresponding true label for a cluster label in this optimal mapping is then called as its equivalent label. The AC is then defined as Pn AC ¼
i¼1
δðmapðt i Þ; li Þ
ð14Þ
n
where n is the total number of samples and ti and li denote the obtained cluster label and the true label of data xi. δðu; vÞ ¼ 1 if u ¼v otherwise δðu; vÞ ¼ 0, and mapðt i Þ is the permutation mapping function that maps each cluster label ti to the equivalent label (ground truth) from the dataset. Let C denote the set of clusters obtained from the ground truth and C 0 denote the set of clusters obtained from the algorithm. The mutual information between C and C 0 is defined as follows: X
MIðC; C 0 Þ ¼
c A C;c0 A C 0
pðci ; c0j Þ log 2
pðci ; c0j Þ pðci Þ pðc0j Þ
ð15Þ
where pðci Þ and pðc0j Þ are the probabilities that a sample randomly selected from the data belongs to the clusters ci and c0j , and pðci ; c0j Þ is the joint probability that the arbitrarily selected sample belongs to clusters ci and c0j simultaneously. Mutual information is also a widely used performance criterion for feature selection [41]. The normalized mutual information used in the experiments is defined as follows: NMIðC; C 0 Þ ¼
MIðC; C 0 Þ maxðHðCÞ; HðC 0 ÞÞ
ð16Þ
where H(C) and HðC 0 Þ denote the entropies of C and C 0 , respectively. It is quite straightforward to see that NMIðC; C 0 Þ A ½0; 1 with NMIðC; C 0 Þ ¼ 1 if C and C 0 are identical and NMIðC; C 0 Þ ¼ 0 if C and C 0 are independent.
Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i
Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
6
4.2. Experimental results To demonstrate the effectiveness of the GSFS-llc algorithm, the k-means algorithm is performed on the chosen datasets with different features obtained by different feature selection algorithms. For all these method, Gaussian kernel is used to calculate the similarities between each sample and k ¼5 nearest neighbors are chosen to construct the adjacency/kernel matrix. The number of used eigenvectors in SFS-llc, GSFS-llc and MCFS are set to the number of clusters. The regularization parameter γ in GSFS-llc is set to 1. Fig. 3 shows the feature selection result on the USPS08 dataset by the GSFS-llc algorithm. Fig. 3(a) and (b) is the mean images of cluster 0 and 8, and Fig. 3(c) and (d) shows the top 10 selected features on the mean image of each cluster. The top 10 selected features obviously differentiate digit 0 from digit 8 in that each selected feature (a pixel in the image) shows a clear contrast.
Table 2 Statistics of the datasets. Dataset
# of samples
# of features
# of classes
USPS08 ORL ISOLET5 COIL100 CIFAR10
2261 400 1559 7200 60,000
256 1024 617 1024 512
2 40 26 100 10
Fig. 4 and Table 3 show the clustering results on USPS08 dataset. The parameter d in Table 3 denotes the number of selected features. Each row gives the clustering result on the d selected features obtained by different algorithms and the best result is displayed in bold font. The last row shows the clustering result using all features. As can be seen from the results, both GSFS-llc and SFS-llc obviously outperform the other methods while GSFSllc achieves the best performance in all the cases. Another important observation from the results is that GSFS-llc clearly outperforms clustering results using all features both in AC and NMI, proving that it is competent in reducing redundancies and noises in data. In order to further test the performance of GSFS-llc in the presence of noise, another experiment is performed on the USPS08 dataset by adding random salt and pepper noise to the original image, so that some pixels turn into pure white or black as shown in Fig. 5. The clustering results are shown in Fig. 6 and Table 4. As can be observed from the experimental results, with the increase of the noise density (from 0 to 40%), the clustering accuracy of GSFS-llc decreases from 97.57 to 84.19%. But the important observation here is the slightly widening gap between GSFS-llc and SFS-llc, especially when the noise density is below 30%. We therefore conclude that GSFS-llc tends to have more robust performance than SFS-llc in the presence of noise. This is a benefit brought by the group sparse regularization, as indicated in [22]. Fig. 7 and Table 5 show the clustering results on ORL dataset. Similar to the experimental results on USPS08, GSFS-llc still achieves the best performance on ORL dataset in almost all cases.
Fig. 3. The feature selection result on USPS08 dataset. The top 10 selected features obviously differentiate digit 0 from digit 8 in that each selected feature (a pixel in the image) shows a clear contrast. (a) Mean image of 0. (b) Mean image of 8. (c) Top 10 features of 0. (d) Top 10 features of 8.
Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i
Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
1
7
1
0.95 0.8
0.85
0.6
NMI
Accuracy
0.9
0.8
0.4
0.75 LS LKR MCFS SFS−llc GSFS−llc
0.7 0.65 0.6
0
50
100
150
200
LS LKR MCFS SFS−llc GSFS−llc
0.2
0
0
50
# of Features
100
150
200
# of Features
Fig. 4. Clustering accuracy and normalized mutual information vs. the number of selected features on USPS08 dataset. Both GSFS-llc and SFS-llc obviously outperform the other methods and GSFS-llc achieves the best performance in all the cases. (a) Clustering accuracy. (b) Normalized mutual information.
Table 3 Clustering result on USPS08 dataset. d
10 20 30 40 50 60 70 80 90 100 120 140 160 180 200 All
Clustering accuracy (%)
Normalized mutual information (%)
LS
LKR
MCFS
SFS-llc
GSFS-llc
LS
LKR
MCFS
SFS-llc
GSFS-llc
97.52 73.91 82.04 82.84 83.37 82.26 82.22 82.40 82.71 84.70 84.25 84.34 84.39 84.17 84.25 84.34
74.92 78.51 78.77 78.59 79.48 79.39 80.85 80.67 80.81 81.20 84.17 85.10 85.98 86.42 84.65 84.34
66.70 97.52 79.43 76.25 78.77 86.86 89.34 90.23 89.47 89.16 88.90 89.21 86.33 85.01 84.79 84.34
95.40 97.30 96.77 96.42 95.89 94.96 93.45 93.06 93.94 92.48 90.54 89.56 88.63 88.01 87.00 84.34
97.57 98.01 97.17 97.30 97.43 97.48 97.26 97.39 96.99 96.55 95.62 94.56 93.10 88.50 91.20 84.34
81.52 30.76 39.37 39.91 40.28 38.37 38.46 38.58 39.22 42.52 42.07 41.91 41.98 41.60 41.76 41.91
18.96 27.78 27.60 27.11 28.74 29.06 32.50 32.88 33.30 34.00 39.47 42.65 44.73 45.44 42.30 41.91
15.55 81.46 36.29 31.82 35.04 46.75 52.60 55.22 53.38 52.49 52.20 51.60 45.55 43.23 42.99 41.91
70.76 80.24 77.51 76.29 73.95 70.47 65.06 63.49 66.68 61.30 55.57 52.99 50.40 48.61 46.59 41.91
81.87 84.42 79.60 80.41 81.40 81.36 80.23 80.98 78.67 76.18 71.66 67.03 61.56 50.98 56.38 41.91
Note: The parameter d denotes the number of selected features.
Fig. 5. The original and noise image in USPS08. (a) Original image in USPS08. (b) After adding 10% salt and pepper noise.
Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i
Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
8
1 0.8
LS LKR MCFS SFS−llc GSFS−llc
0.7
0.8
NMI
Accuracy
0.6 0.6
0.4
0.5 0.4 0.3
LS LKR MCFS SFS−llc GSFS−llc
0.2
0.2 0.1
0
0 0
10
20
30
40
0
10
Noise Density
20
30
40
Noise Density
Fig. 6. Clustering accuracy and normalized mutual information vs. noise density on USPS08 dataset. GSFS-llc outperforms others on both clustering accuracy and the normalized mutual information. With the increase of the noise density the clustering accuracy of the GSFS-llc slightly decreases and the curve is quite smooth comparing with others. (a) Clustering accuracy. (b) Normalized mutual information.
Table 4 Clustering result on Noise USPS08 dataset. Noise density (%)
Clustering accuracy (%)
0 5 10 15 20 25 30 35 40
Normalized mutual information (%)
LS
LKR
MCFS
SFS-llc
GSFS-llc
LS
LKR
MCFS
SFS-llc
GSFS-llc
66.48 81.63 80.81 76.98 77.31 75.36 75.34 74.20 73.81
97.52 61.72 57.39 54.08 59.05 62.16 62.97 62.09 61.00
77.53 95.00 87.71 87.82 83.18 86.35 85.15 82.60 81.19
95.40 96.15 96.57 92.80 91.39 90.24 86.24 80.29 80.19
97.57 96.36 96.65 94.60 93.77 92.16 89.48 83.10 84.19
24.39 37.88 36.83 30.95 30.88 27.42 26.05 23.33 21.78
81.64 0.05 0.04 0.04 0.04 0.03 0.03 0.03 0.05
34.76 74.17 56.89 53.25 43.22 45.54 41.22 34.97 31.52
70.76 76.74 76.45 65.47 60.58 55.59 45.63 33.40 32.38
81.87 78.04 76.77 71.81 67.66 61.87 54.36 39.92 40.39
0.9
0.8
0.5 0.7
NMI
Accuracy
0.6
0.4
0.6 0.3 LS LKR MCFS SFS−llc GSFS−llc
0.2 0.1
0
50
100
150
200
# of Features
LS LKR MCFS SFS−llc GSFS−llc
0.5
0.4
0
50
100
150
200
# of Features
Fig. 7. Clustering accuracy and normalized mutual information vs. the number of selected features on ORL dataset. GSFS-llc achieves the best performance on ORL dataset in almost all cases. (a) Clustering accuracy. (b) Normalized mutual information.
The overall best results, 65.75% in clustering accuracy and 79.39% in normalized mutual information also beat the results obtained by all features which are only 58.75% and 75.48%. The clustering results on ISOLET5 are shown in Fig. 8 and Table 6. As seen from the results, overall GSFS-llc still beats all the other methods in both clustering accuracy and normalized mutual information. The clustering results based on the selected feature subsets also beat the result of all features on ISOLET5 dataset because of the noises and redundancies.
The clustering results on COIL100 dataset are shown in Fig. 9 and Table 7. Same as the result of ISOLET5, the performance of GSFS-llc is lower than MCFS when selecting a small feature subset, but the gap between them is not very large. The performance of GSFS-llc when chosen a large enough feature subset still outperforms the others. This time the clustering performance on the selected feature subsets is lower than that on all features. This is mainly because the number of clusters in COIL100 dataset is 100, which is a quite large number compared
Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i
Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
9
Table 5 Clustering result on ORL dataset. d
10 20 30 40 50 60 70 80 90 100 120 140 160 180 200 All
Clustering accuracy (%)
Normalized mutual information (%)
LS
LKR
MCFS
SFS-llc
GSFS-llc
LS
LKR
MCFS
SFS-llc
GSFS-llc
25.25 27.50 29.00 31.50 32.50 31.75 33.50 33.25 33.25 34.00 37.50 37.25 38.25 41.50 40.25 58.75
28.00 32.75 31.50 33.75 36.50 38.25 36.25 37.50 36.50 38.25 38.00 38.00 41.75 44.00 47.00 58.75
48.75 56.25 51.75 50.75 58.25 53.50 52.50 54.50 54.50 53.00 53.50 52.25 51.25 53.00 50.50 58.75
44.50 47.25 52.75 57.50 51.50 54.25 54.00 53.75 61.25 55.50 60.75 54.50 55.75 58.50 54.00 58.75
50.75 56.25 57.50 63.00 61.50 59.00 60.50 65.75 59.75 54.50 64.75 55.00 54.75 61.75 61.00 58.75
50.04 50.39 52.39 54.90 55.04 53.93 55.94 56.75 56.66 56.20 59.67 59.36 60.97 62.43 62.63 75.48
52.01 55.12 54.77 56.03 59.24 61.31 60.00 60.57 59.10 62.36 60.94 62.41 64.67 64.76 68.48 75.48
68.01 73.15 71.13 69.70 74.12 72.70 72.55 74.40 73.55 72.76 73.23 73.16 71.97 72.79 71.01 75.48
66.34 67.79 71.38 74.70 72.23 73.06 72.86 73.18 75.90 73.43 77.68 74.66 73.75 76.91 72.45 75.48
69.58 73.77 73.24 78.00 77.59 75.20 76.90 79.39 75.05 74.27 79.12 73.28 74.45 78.81 77.85 75.48
0.7
0.8
0.6
0.7
0.5
NMI
Accuracy
Note: The parameter d denotes the number of selected features.
0.6
0.4 0.5 LS LKR MCFS SFS−llc GSFS−llc
0.3
0.2
0
50
100
150
200
LS LKR MCFS SFS−llc GSFS−llc
0.4 0
50
# of Features
100
150
200
# of Features
Fig. 8. Clustering accuracy and normalized mutual information vs. the number of selected features on ISOLET5 dataset. GSFS-llc beats all the other methods in both clustering accuracy and normalized mutual information. (a) Clustering accuracy. (b) Normalized mutual information.
Table 6 Clustering result on ISOLET5 dataset. d
10 20 30 40 50 60 70 80 90 100 120 140 160 180 200 All
Clustering accuracy (%)
Normalized mutual information (%)
LS
LKR
MCFS
SFS-llc
GSFS-llc
LS
LKR
MCFS
SFS-llc
GSFS-llc
32.65 34.96 38.74 36.63 37.78 38.81 40.35 44.32 42.59 47.66 49.39 50.93 53.18 49.71 52.85 51.19
29.89 37.27 37.08 39.00 37.97 36.37 38.10 35.86 36.05 40.92 38.81 41.69 42.08 47.34 45.16 51.19
38.29 47.92 52.98 47.08 56.13 56.25 56.77 53.11 52.85 53.11 47.79 54.20 49.07 53.69 52.66 51.19
35.98 46.06 57.02 52.34 54.84 58.76 51.25 54.27 59.20 54.33 53.88 54.59 55.23 52.79 51.89 51.19
31.05 52.02 49.84 54.46 52.34 57.92 60.68 57.09 62.73 63.31 58.18 60.81 56.51 59.97 60.30 51.19
47.80 48.80 53.51 54.04 53.53 56.85 58.90 62.00 60.95 65.26 66.27 67.03 69.46 69.07 70.58 69.44
43.97 48.79 52.26 53.86 54.55 51.11 52.88 53.62 52.91 55.88 55.59 59.14 61.45 61.74 61.76 69.44
53.70 64.26 67.43 66.91 69.81 70.62 71.01 70.34 70.06 70.36 66.37 70.55 68.64 70.43 69.54 69.44
51.17 60.20 67.69 67.61 68.02 72.18 69.95 70.82 73.51 72.20 71.51 71.83 71.75 71.53 69.66 69.44
41.30 63.44 66.96 69.34 70.94 74.61 75.27 72.83 75.32 75.70 75.87 76.29 75.32 75.63 75.69 69.44
Note: The parameter d denotes the number of selected features.
Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i
Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
10
0.8 0.5 0.7 0.6 0.3
NMI
Accuracy
0.4
0.2
0.4 LS LKR MCFS SFS−llc GSFS−llc
0.1 0
0.5
0
50
100
150
LS LKR MCFS SFS−llc GSFS−llc
0.3 0.2
200
0
50
# of Features
100
150
200
# of Features
Fig. 9. Clustering accuracy and normalized mutual information vs. the number of selected features on COIL100 dataset. The performance of GSFS-llc is lower than MCFS when selecting a small feature subset, but the gap between them is not very large. When chosen a large enough feature subset GSFS-llc outperforms the others and remains stable. (a) Clustering accuracy. (b) Normalized mutual information.
Table 7 Clustering result on COIL100 dataset. d
10 20 30 40 50 60 70 80 90 100 120 140 160 180 200 All
Clustering accuracy (%)
Normalized mutual information (%)
LS
LKR
MCFS
SFS-llc
GSFS-llc
LS
LKR
MCFS
SFS-llc
GSFS-llc
13.00 15.82 16.50 16.78 21.10 20.07 22.15 22.99 24.00 25.01 24.18 24.51 29.18 29.06 30.50 53.25
16.31 18.99 23.28 22.88 25.50 26.43 27.39 29.17 30.06 31.39 38.46 40.54 38.65 41.99 43.39 53.25
34.39 39.17 40.04 41.44 43.60 42.47 46.28 45.10 46.78 45.72 45.93 46.83 47.14 48.75 48.42 53.25
20.57 26.60 32.72 40.06 40.92 41.92 44.53 44.13 44.76 46.94 46.67 46.17 47.86 47.10 49.78 53.25
29.47 37.86 38.06 41.25 43.07 45.42 45.03 47.28 47.00 46.39 49.76 48.54 49.28 48.49 48.88 53.25
32.06 35.10 36.02 36.95 41.76 41.80 42.60 44.04 45.54 46.63 47.23 47.66 52.45 52.06 53.36 77.42
35.95 40.70 46.00 47.78 49.57 51.07 52.29 53.89 56.33 56.77 62.48 65.01 64.81 67.08 67.61 77.42
59.88 64.13 64.45 66.32 66.72 67.57 69.22 69.75 69.59 70.26 71.14 71.22 72.59 72.37 72.34 77.42
41.01 50.30 56.54 63.76 66.74 66.54 68.47 68.93 69.91 71.07 71.28 71.36 71.77 72.60 73.23 77.42
51.76 60.50 62.35 65.74 66.70 69.54 70.00 70.18 70.74 70.97 72.54 73.13 72.84 73.26 73.38 77.42
Note: The parameter d denotes the number of selected features.
0.15
0.25
NMI
Accuracy
0.2
0.1
0.2
LS LKR MCFS SFS−llc GSFS−llc
LS LKR MCFS SFS−llc GSFS−llc
0.15
0
50
100
150
200
# of Features
0.05
0
50
100
150
200
# of Features
Fig. 10. Clustering accuracy and normalized mutual information vs. the number of selected features on CIFAR10 dataset. The performance of GSFS-llc beats the others in most of the cases. (a) Clustering accuracy. (b) Normalized mutual information.
with only 1024 features. So the discriminating power of the selected subsets consisted of at most 200 features cannot handle clustering 100 clusters. Although the analysis in Section 3.3 shows that GSFS-llc suffers from high time complexity, its performance in real data sets are much better than the worst-case scenario OðN2 M þ N 3 þ M log MÞ.
The major reason is that the kNN adjacency/kernel matrix is highly sparse and many time-consuming matrix operations can be greatly accelerated. In order to study the performance of our GSFS-llc on large-scale datasets, we conduct an experiment on CIFAR10 dataset with 60,000 data samples. The clustering results are shown in Fig. 10 and Table 8, which shows that our methods
Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i
Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
11
Table 8 Clustering result on CIFAR10 dataset. d
10 20 30 40 50 60 70 80 90 100 120 140 160 180 200 All
Clustering accuracy (%)
Normalized mutual information (%)
LS
LKR
MCFS
SFS-llc
GSFS-llc
LS
LKR
MCFS
SFS-llc
GSFS-llc
20.08 20.53 22.53 22.21 23.93 23.59 24.30 24.09 24.07 24.28 24.70 25.60 26.43 26.55 26.66 28.32
20.08 21.84 21.23 22.59 22.06 22.74 22.77 23.57 23.65 24.13 25.37 25.90 26.74 26.16 26.70 28.32
20.60 23.17 24.86 24.19 26.51 26.60 26.47 26.90 27.53 27.12 27.02 27.08 27.07 28.26 27.93 28.32
21.16 22.16 23.98 24.86 25.72 26.44 26.96 28.58 26.92 25.97 27.15 27.34 27.09 27.26 27.27 28.32
22.22 23.80 24.06 25.13 27.86 27.88 27.18 27.26 27.14 27.12 27.70 27.20 28.42 28.54 28.58 28.32
8.39 10.29 11.42 11.27 12.14 11.42 12.47 12.89 12.87 12.85 13.41 14.11 14.51 14.84 15.66 16.86
8.39 11.59 11.90 11.93 11.74 12.28 12.72 13.08 13.39 13.61 14.01 14.24 14.74 15.39 15.86 16.86
7.18 10.91 12.23 13.63 14.03 14.63 14.49 14.97 15.08 14.93 15.28 16.20 15.96 15.69 16.11 16.86
8.72 12.76 13.37 13.92 15.13 15.49 16.21 16.89 16.73 15.90 16.24 16.50 16.83 16.73 16.75 16.86
10.89 13.00 13.82 14.63 15.48 16.21 16.14 16.15 16.84 16.98 17.32 17.30 17.59 17.47 17.60 16.86
Note: The parameter d denotes the number of selected features.
Table 9 Time Costs on CIFAR10.
Time costs (s)
LS
LKR
MCFS
SFS-llc
GSFS-llc
204
1522
254
1766
1776
achieve best performance in most cases. We also show the time costs of each algorithm in Table 9 (the number of selected features d ¼50). The time cost of GSFS-llc is much better than expected. In fact it is almost comparable with other feature selection methods with lower time complexity even in large datasets. To summarize the experimental results, we finally put an emphasis on the comparison among MCFS, SFS-llc and GSFS-llc. All of the three are unsupervised feature selection methods combining a clustering algorithm and a sparse regularization. To be more specific, MCFS is a combination of Laplacian eigenmap and LASSO; SFS-llc is a combination of LLC and LASSO, and GSFS-llc is a combination of LLC and group sparse regularization. The performance characteristic of a specific feature selection method can then be traced back to that of the specific components. From the experimental results shown in Figs. 4–9, the performance of SFS-llc is better than that of MCFS in most cases. This performance gain is rather the result of the advantage of LLC over laplacian eigenmap, as indicated in [34]. It can also be observed from the experimental results that GSFS-llc achieves much better clustering results than that of SFS-llc in most cases. This indicates that using group sparsity regularized regression leads to considerable performance gain than using LASSO. All the experimental results above obviously show the effectiveness of the proposed GSFS-llc algorithm on handwritten digits, human face, spoken letter and image data. By selecting the feature subsets, the GSFS-llc algorithm reduces the impact of noises and redundancies in data and improves the clustering performance.
4.3. Parameter selection The GSFS-llc algorithm has three parameters: the number of nearest neighbors k, the number of used eigenvectors u and the regularization parameter γ. In order to explore the impact of each parameter, the experiments of fixing two parameters and tuning the rest one are designed and performed on the ORL dataset. All the
experiments use top 50 features for clustering. Fig. 11 shows the clustering results when fixing u¼40 and γ ¼ 1 and tuning parameter k from 3 to 20. Both clustering accuracy and normalized mutual information reach a relatively high value when k¼5 and then decreasing slightly. So using k¼5 nearest neighbors to estimate a sample's label seems appropriate. Fig. 12 shows the results of tuning parameter u from 1 to 60 while fixing k¼ 5 and γ ¼1. As seen from the results, the GSFS-llc achieves the best performance around u¼ 40 which equals the number of clusters in ORL dataset. Fig. 13 is the result of tuning the regularization parameter γ from 10 2 to 103 while fixing k¼5 and u¼ 40. The clustering performance slightly increases until γ reaches 1, and then decreases to bottom. According to the experimental results, the proper value for γ is 1.
5. Conclusion and future work This paper proposes an unsupervised feature selection method GSFS-llc using the regularized regression framework. By fitting to the cluster labels obtained from local learning based clustering (LLC), GSFS-llc naturally selects the features respecting the manifold structure in data space. What is more, using group sparse regularization, GSFS- llc inherits the stability associated with group structure and exhibits stable performance in the presence of noise. In comparisons with the other state-of-the-art feature selection algorithms, GSFS-llc outperforms the others in various datasets including handwritten digits, human faces, voices and object images. The experiments also demonstrate GSFS-llc is robust in a dataset injected with noise. There are several interesting problems to be investigated in our future work: (1) The proposed model in this paper exploits both the manifold structure and the group structure in data space by combining local learning based clustering and L2,1 regularization. As there exist many different models both for learning the manifold structure and recovering the group sparsity pattern, it is possible to marry different models and come up with more effective methods. (2) It is also possible to extend the current model by adding other constraints, such as the nonnegative constraints which will make the relevancies between features and clusters in accumulate form. The advantage of the nonnegative constraints are exhibited in various models such as [42,43]. It will be interesting to explore the effect of the nonnegative constraints in our model.
Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i
Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
12
0.8
0.6 0.7
0.6
0.4
NMI
Accuracy
0.5
0.3
0.5
0.2
LS LKR MCFS SFS−llc GSFS−llc
0.1 0
5
10
15
LS LKR MCFS SFS−llc GSFS−llc
0.4
0.3
20
5
10
15
20
# of Neighbors
# of Neighbors
Fig. 11. Clustering accuracy and normalized mutual information vs. the number of neighbors on ORL dataset.(a) Clustering accuracy. (b) Normalized mutual information.
0.8 0.6 0.7
0.6
0.4
NMI
Accuracy
0.5
0.3
0.5
0.2
LS LKR MCFS SFS−llc GSFS−llc
0.1 0
0
10
20
30
40
50
LS LKR MCFS SFS−llc GSFS−llc
0.4
0.3
60
0
10
20
# of Used Eigenvectors
30
40
50
60
# of Used Eigenvectors
Fig. 12. Clustering accuracy and normalized mutual information vs. the number of used eigenvectors on ORL dataset. (a) Clustering accuracy. (b) Normalized mutual information.
0.8 0.6 0.7
0.6
0.4
NMI
Accuracy
0.5
0.3
0.5
0.2
LS LKR MCFS SFS−llc GSFS−llc
0.1 0
−2
10
−1
10
0
1
10
10
2
10
3
10
γ
LS LKR MCFS SFS−llc GSFS−llc
0.4
0.3
−2
10
−1
10
0
1
10
10
2
10
3
10
γ
Fig. 13. Clustering accuracy and normalized mutual information vs. the value of the regularization parameter γ on ORL dataset. (a) Clustering accuracy. (b) Normalized mutual information.
Acknowledgments This work is supported by Zhejiang Provincial Natural Science Foundation of China (Grant no. LZ13F020001), National Science Foundation of China (Grant nos. 61173185, 61173186), National Key Technology R&D Program (Grant no. 2014BAK15B02) and Demonstration of Digital Medical Service and Technology in Destined Region (Grant no. 2012AA022814).
References [1] R.E. Bellman, Adaptive Control Processes: A Guided Tour, Princeton University Press, Princeton, New Jersey, 1961.
[2] M. Verleysen, Learning high-dimensional data, in: Nato Science Series Sub Series III Computer And Systems Sciences, vol. 186, 2003, pp. 141–162. [3] I. Jolliffe, Principal Component Analysis, Wiley Online Library, 2005. [4] X. He, P. Niyogi, Locality preserving projections, in: Advances in Neural Information Processing Systems, 2003. [5] H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell. 27 (2005) 1226–1238. [6] F. Nie, H. Huang, X. Cai, C.H. Ding, Efficient and robust feature selection via joint l2;1 -norms minimization, Adv. Neural Inf. Process. Syst. (2010) 1813–1821. [7] M. Tan, L. Wang, I.W. Tsang, Learning sparse svm for feature selection on very high dimensional datasets, in: Proceedings of the 27th International Conference on Machine Learning, 2010, pp. 1047–1054. [8] Y. Zhang, S. Li, T. Wang, Z. Zhang, Divergence-based feature selection for separate classes, Neurocomputing 101 (2013) 32–42. [9] J.G. Dy, C.E. Brodley, Feature selection for unsupervised learning, J. Mach. Learn. Res. 5 (2004) 845–889.
Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i
Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ [10] C. Boutsidis, M.W. Mahoney, P. Drineas, Unsupervised feature selection for principal components analysis, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 61–69. [11] H. Zeng, Y.-M. Cheung, Feature selection for local learning based clustering, in: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, 2009, pp. 414–425. [12] J.B. Tenenbaum, V. De Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (2000) 2319–2323. [13] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000) 2323–2326. [14] M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, in: Advances in Neural Information Processing Systems, 2001. [15] X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, Adv. Neural Inf. Process. Syst. (2005) 507–514. [16] Y. Jiang, J. Ren, Eigenvalue sensitive feature selection, in: Proceedings of the 28th International Conference on Machine Learning, 2011, pp. 89–96. [17] D. Cai, C. Zhang, X. He, Unsupervised feature selection for multi-cluster data, in: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, 2010, pp. 333–342. [18] Y. Cheung, H. Zeng, Local kernel regression score for selecting features of highdimensional data, IEEE Trans. Knowl. Data Eng. 21 (2009) 1798–1802. [19] H. Zeng, Y. Cheung, Feature selection and kernel learning for local learningbased clustering, IEEE Trans. Pattern Anal. Mach. Intell. 33 (2011) 1532–1547. [20] J. Kim, R. Monteiro, H. Park, Group sparsity in nonnegative matrix factorization, in: SDM, 2012, pp. 851–862. [21] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 68 (2006) 49–67. [22] J. Huang, T. Zhang, The benefit of group sparsity, Ann. Stat. 38 (2010) 1978–2004. [23] Y. Yang, H.T. Shen, Z. Ma, Z. Huang, X. Zhou, l2;1 -norm regularized discriminative feature selection for unsupervised learning, in: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, 2011, pp. 1589–1594. [24] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1157–1182. [25] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification (2nd edition), John Wiley and Sons, New York, 2001. [26] G. Doquire, M. Verleysen, A graph Laplacian based approach to semisupervised feature selection for regression problems, Neurocomputing 121 (2013) 5–13. [27] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using support vector machines, Mach. Learn. 46 (2002) 389–422. [28] F. Benoi\widehatt, M. van Heeswijk, Y. Miche, M. Verleysen, A. Lendasse, Feature selection for nonlinear models with extreme learning machines, Neurocomputing 102 (2013) 111–124. [29] G. Doquire, M. Verleysen, Mutual information-based feature selection for multilabel classification, Neurocomputing 122 (2013) 148–155. [30] Y. Sun, S. Todorovic, S. Goodison, Local-learning-based feature selection for high-dimensional data analysis, IEEE Trans. Pattern Anal. Mach. Intell. 32 (2010) 1610–1626. [31] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B (Methodol.) (1996) 267–288. [32] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67 (2005) 301–320. [33] X. Fang, Y. Xu, X. Li, Z. Fan, H. Liu, Y. Chen, Locality and similarity preserving embedding for feature selection, Neurocomputing 128 (2013) 304–315. [34] M. Wu, B. Schölkopf, A local learning approach for clustering, Adv. Neural Inf. Process. Syst. (2006) 1529–1536. [35] J.J. Hull, A database for handwritten text recognition research, IEEE Trans. Pattern Anal. Mach. Intell. 16 (1994) 550–554. [36] F.S. Samaria, A.C. Harter, Parameterization of a stochastic model for human face identification, in: Proceedings of the Second IEEE Workshop on Applications of Computer Vision, 1994, pp. 138–142. [37] M.A. Fanty, R. Cole, Spoken letter recognition, Adv. Neural Inf. Process. Syst. 3 (1990) 220–226. [38] A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images, Technical Report, 2009. [39] H.W. Kuhn, The Hungarian method for the assignment problem, Naval Res. Logist. Q. 2 (1955) 83–97. [40] J. Munkres, Algorithms for the assignment and transportation problems, J. Soc. Ind. Appl. Math. 5 (1957) 32–38. [41] B. Frénay, G. Doquire, M. Verleysen, Theoretical and empirical study on the potential inadequacy of mutual information for feature selection in classification, Neurocomputing 112 (2013) 64–78. [42] D.Wang, X.Gao, X.Wang, Semi-supervised nonnegative matrix factorization via constraint propagation, IEEE Transactions on Cybernetics, http://dx.doi. org/10.1109/TCYB.2015.2399533.
13
[43] Z. He, C. Chen, J. Bu, C. Wang, L. Zhang, D. Cai, X. He, Document summarization based on data reconstruction, in: AAAI, 2012.
Yue Wu received the B.S. degree in computer science from Zhejiang University, China, in 2010. He is currently a Ph.D. candidate in the College of Computer Science at Zhejiang University. His research interests include machine learning, pattern recognition and image retrieval.
Can Wang is currently an associate professor in the College of Computer Science, Zhejiang University. His research interests include information retrieval, data mining, machine learning and information accessibility.
Jiajun Bu is a Professor in the College of Computer Science, Zhejiang University. His research interests include information retrieval, computer graphics, embedded system and computer supported cooperative work.
Chun Chen is a professor in the College of Computer Science, Zhejiang University. His research interests include information retrieval, data mining, computer vision, computer graphics and embedded technology.
Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i