Group sparse feature selection on local learning based clustering

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Group spa...

Download PDF

1MB Sizes 0 Downloads 155 Views

Report

PDF Reader
Full Text

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Group sparse feature selection on local learning based clustering Yue Wu, Can Wang n, Jiajun Bu, Chun Chen Zhejiang Provincial Key Laboratory of Service Robot, College of Computer Science, Zhejiang University, Hangzhou 310027, China

art ic l e i nf o

a b s t r a c t

Article history: Received 14 August 2014 Received in revised form 17 July 2015 Accepted 18 July 2015 Communicated by X. Gao

Feature selection plays an important role in many machine learning applications. By extracting meaningful features and eliminating both redundancies and noises, it effectively improves the accuracy and efﬁciency of the learning algorithm. In this paper, an unsupervised feature selection method called GSFS-llc (group sparse feature selection on local learning based clustering) is proposed. GSFS-llc ﬁrst retrieves the cluster structure in a dataset using LLC and then selects important features that best preserve the cluster structure by L2,1-norm regularized regression. By combining group sparsity regularization with locality based learning algorithm, GSFS-llc leads a more robust result than other methods and selects features that respect the underlying geometric structure in the dataset. Extensive experimental results over real world handwritten digits, human faces, voices and object images datasets demonstrate the superiority of the GSFS-llc algorithm. & 2015 Elsevier B.V. All rights reserved.

Keywords: Unsupervised Group sparsity Local learning Feature selection

1. Introduction Many applications nowadays such as computer vision, pattern recognition and data mining are facing data of increasing dimensions. Higher dimensionality contains more information but may also introduce more redundancies and noises. As a consequence, “curse of dimensionality” [1,2] is a prevalent problem for learning in high dimensional space, i.e. algorithms and procedures that are analytically and computationally effective in low dimensional space become totally impractical in this case. Various dimensionality reduction techniques have been introduced to ease this problem. Most canonical among them are feature selection and feature extraction algorithms. Unlike feature extraction methods, such as PCA [3] and LPP [4], which transform original data into a reduced representation set of features, feature selection methods only choose a relevant feature subset from the original data. Thus feature selection is not only simpler and but also better preserves the actual meaning of each feature to help understand its function. The selected subset with the most informative features will effectively reduce the redundancies and noises in data and make the following learning tasks more accurate. Feature selection methods can be roughly classiﬁed as supervised and unsupervised by whether label information is involved in selection. Typical supervised feature selection methods [5–8] select features having high correlations with class labels. Unsupervised methods in contrast mainly exploit the distribution of data samples to ﬁnd the optimal feature subset. As labels are

n

Corresponding author. Tel.: +8657187951431; fax: +8657187953955. E-mail address: [email protected] (C. Wang).

usually expensive to acquire, unsupervised methods are used more broadly than supervised ones. However, without the guidance of label information, feature selection becomes a more challenging issue in an unsupervised setting. Many existing unsupervised methods select features that best preserve certain properties of the datasets, such as maximum variance and minimum redundancy. Inﬂuential works such as [9–11] are typical unsupervised feature selection methods. In recent years, feature selection methods started to explore geometric structures in high dimensional data space. As recent works have shown that data spaces are often low-dimensional manifolds embedded in high-dimensional ambient spaces [12–14], various feature section methods explicitly considering manifold structures are proposed, including Laplacian score [15], eigenvalue sensitive feature selection [16], multi-cluster feature selection [17] , local kernel regression score [18] and feature selection for local learning based clustering [11,19]. These methods either directly optimize for manifold structure or use manifold regularizations in their model to select features that respect intrinsic manifold structure in data space. Among them, feature selection methods based on local learning based clustering (LLC) show great improvements in robustness and accuracy on the datasets with manifold structure [11,19]. However, the limitation of these manifold-based methods for feature selection is that they neglect the group structures in data space. Recent studies [20–22] point out that group structures are inherent in many datasets such as documents, images, and DNA data, where documents of the same topic naturally forms a group and in DNA data features can be grouped by their metabolic proﬁling. As data from the same group tend to share the same sparsity pattern in their low-dimensional representation, considering the shared group structures is

http://dx.doi.org/10.1016/j.neucom.2015.07.045 0925-2312/& 2015 Elsevier B.V. All rights reserved.

Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i

Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

expected to improve the performance of the feature selection algorithms [6,23]. In this paper, a new unsupervised group sparse feature selection method based on LLC, named GSFS-llc, is proposed. The motivation of our work is to exploit both the manifold structure and the group structure in feature selection. Using the cluster labels from LLC, GSFS-llc establish a L2,1-norm regularized regression framework to select important features that respect both group structure and manifold structure in data space. As group structure is inherent in many datasets, GSFS-llc will gain better performance comparing with methods only considering the manifold structure in data space. It is worthwhile to highlight the following aspects of our work: 1. We establish a regression framework using the cluster labels from LLC and group sparse regularization. To the best of our knowledge, this is the ﬁrst time that local learning based clustering is combined with group sparse regularization for feature selection. 2. By choosing the features that best ﬁt the LLC cluster labels, GSFS-llc naturally respects the manifold structure in data space. 3. By inducing sparse features from L2,1-norm regularized regression, GSFS-llc explicitly considers the group structure in data space. It is thus expected to gain better performance in feature selection and inherit some nice properties associated with group structures [22], such as better stability in the presence of noise. The rest of the paper is organized as follows: Section 2 gives a brief review of the related work; Section 3 describes the GSFS-llc algorithm; Section 4 shows the experiments and results of the GSFS-llc algorithm compared with other feature selection methods; Section 5 makes a conclusion of this paper.

2. Related works Guyon et al. categorized feature selection algorithms into ﬁlter methods, wrapper methods and embedded methods [24]. These three kinds of methods differ in how the learning algorithm is incorporated in evaluating and selecting features. Filter methods [25,15,23,16,26] capture the inner properties of the data using statistical techniques such as variance, Pearson correlation, and mutual information, and select features before running the learning algorithm. Filter methods are usually the most efﬁcient among all feature selection methods. Wrapper methods [27,8,28,29] are associated with some predictive models for scoring the chosen feature subsets. Given a candidate feature subsets, a wrapper method will train a speciﬁc predictive model using the candidate subset and then use the model's performance to score the subset.

The feature subset with the highest score is selected as the ﬁnal result. Embedded methods such as [7,19] perform variable selection in the training process and are usually speciﬁc to given learning machines. Decision trees, such as CART, which have a built-in mechanism to perform variable selection, are typical embedded methods. Recent studies such as ISOMAP [12], LLE [13] have shown that data spaces in many cases are low-dimensional manifolds embedded in the high-dimensional ambient spaces. This manifold assumption is broadly veriﬁed in existing datasets such as Yale, PIE, and USPS. Many locality based methods are developed in recent years to uncover the underlying manifolds. Consequently, some feature selection methods incorporate this manifold assumption by choosing the features that respect the intrinsic geometric structures of the data space. Cheung et al. proposed a ﬁlter method called local kernel regression (LKR) score [18]. This algorithm seeks the features which minimize the within-neighborhood estimation error and meanwhile have the maximum variance over all data samples. Using different neighborhood graphs, the LKR score can deal with both supervised and unsupervised situations efﬁciently. Sun et al. also proposed a local learning based feature selection method by decomposing an arbitrarily complex nonlinear problem into a set of locally linear ones through local learning, and then using the large margin framework to learn feature relevance [30]. Zeng et al. proposed an unsupervised feature selection method based on local learning based clustering (LLC-fs) [11,19]. They incorporate a binary selection vector into the LLC objective function to select the optimal features for LLC. The GSFS-lls proposed in this paper is also based on LLC, but it differs from LLC-fs in that features are selected as the result of the group sparsity-inducing norm. Thus the selected features will inherit the stability associated with the group structure and will achieve better performance in the presence of noise. Another related research area actively explored in recent years is sparse regression. By using different regularizations, such as LASSO [31] (using l1-norm), and elastic net [32] (using the combination of l1-norm and l2-norm), sparse regression is able to retrieve various sparse basis from the data space under different sparsity assumptions. While LASSO and elastic net are designed for vectors, group sparse regularization (L2,1-norm) is designed for matrix. It ﬁrst calculates the l2-norm for each row of the matrix and then the results are summed to form an l1-norm. That is, qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Pq P 2 J M A Rpq J 2;1 ¼ pi¼ 1 j ¼ 1 mij . More stable ﬁtting results can be obtained by using group sparse regularization than using LASSO in that the elements of each row are more likely to be zero or nonzero together when ﬁtted to different clusters, as shown in Fig. 1. Sparse regularizations are used to select important features in data space since it can efﬁciently retrieve the features correlated

Fig. 1. Fitting results of regressions using L2,1 regularization vs. l1 regularization. The 5 10 regression coefﬁcient matrix in the ﬁgure shows the ﬁtting results of the 5 features on 10 clusters, with each column being the ﬁtting result on a speciﬁc cluster. Blue elements denote non-zero coefﬁcients and white elements denote zero coefﬁcients. Obviously the group sparse regularization (L2,1 norm) leads to more stable ﬁtting results than LASSO (l1 norm). Thus group sparsity based feature selection is expected to be more robust than LASSO based feature selection. (a) Group sparse coefﬁcient matrix obtained by L2,1-norm regularized regression. (b) Sparse coefﬁcient matrix obtained by l1-norm regularized regression. (For interpretation of the references to color in this ﬁgure caption, the reader is referred to the web version of this paper.)

Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i

Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

3

approximate the label of a data sample using its neighbors'. As there exist C Z 2 clusters in dataset, we will use an indicator vector y to represent a cluster label. An indicator vector for a sample point in the l-th cluster is a vector of C elements where the l-th element is 1 and all the others being 0. Given a set of data points X ¼ ½x1 ; x2 ; …; xN T ¼ ½f 1 ; f 2 ; …; f M , xi A RM denotes a sample point, and f j A RN denotes a feature. N i represents the set of xi's neighbors and ni ¼ j N i j is its cardinality. Let K : X X -R be a positive deﬁnite kernel function. The kernel matrix of xi's neighbors can be deﬁned as Ki ¼ ½Kðxu ; xv Þ A Rni ni for xu ; xv A N i , and ki ¼ ½Kðxi ; xj Þ for all xj A N i . Different kernel functions can be adopt to it, such as linear kernel Kðxu ; xv Þ ¼ xu xv , 2 2 heat kernel Kðxu ; xv Þ ¼ e ð J xu xv J 2 =2σ Þ . Using kernel regression, the estimation for the l-th element of its cluster indicator vector is X l l βij Kðxi ; xj Þ ð1Þ y^ i ¼ xj A N i

Fig. 2. A failed example for some existing unsupervised feature selection methods. Feature b obviously has better overall discriminating power than feature a in that it is able to separate all the three clusters (colored blue, green and red respectively), while feature a fails to separate cluster blue from cluster green. However, feature selection based on maximum variance or laplacian score will rank feature a 4 b as feature a has a larger variance. MCFS will also rank feature a 4b in that it also selects features that maximally discriminate one cluster from the rest. (For interpretation of the references to color in this ﬁgure caption, the reader is referred to the web version of this paper.)

with class labels. Nie et al. proposed a supervised feature selection method called RFS [6] by regressing over class labels using L2,1norm regularization. Fang et al. proposed a locality and similarity preserving embedding (LSPE) for feature selection [33]. It ﬁrst constructs a nearest neighbor graph to preserve the locality structure in data space and then maps the structure to the reconstruction coefﬁcients such that the similarity among these data points is preserved. By adding the L2,1-norm minimization constraints to the projection matrix, a sparse result is obtained. Cai et al. proposed a ﬁlter method called multi-cluster feature selection (MCFS) [17] that uses spectral embedding to obtain the cluster information and then solves a series of l1-norm regularized least square regression problems to retrieve the sparse coefﬁcients. One major novelty in MCFS is that it extends sparsity regularized regression to an unsupervised setting. Inspired by MCFS, we propose SFS-llc algorithm in this paper by combining local learning based clustering (LLC) with LASSO. Since both SFS-llc and MCFS rely on the l1-norm to induce sparsity, they both suffer from the instability problem as described in Fig. 2 when dealing with multicluster dataset. To overcome this limitation, GSFS-llc is further proposed in this paper by using group sparse regularization to retrieve the important feature subset in data space.

Using kernel ridge regression will make the above problem easier to solve, where the coefﬁcient βlij in Eq. (1) can then be obtained from the following optimization problem: min J Ki βi yli J 2 þ λðβi ÞT Ki βi l

l

The main idea of GSFS-llc is to select important features from the dataset via a two-step process: (1) obtaining the cluster labels using LLC [34]; (2) ﬁtting the obtained cluster labels using the L2,1-norm regularized regression model and select features with large regression coefﬁcients. Since our regression model and related analysis are fundamentally based on cluster labels from LLC, we will ﬁrst introduce here the LLC algorithm, as presented in [34,11,19]. 3.1. Local learning based clustering for cluster analysis Under the manifold assumption, the data space can be treated as linear within a small neighborhood. It is therefore reasonable to

ð2Þ

where βi ¼ ½βij A Rni is the coefﬁcient vector; yli ¼ ½ylj T A Rni ; λ 4 0 is the regularization parameter. l The solution to problem (2) is βi ¼ ðKi þ λIÞ 1 yli . Substituting it back into (1), we get l

l

l T y^ i ¼ ki ðKi þ λIÞ 1 yli

ð3Þ

Let

αTi ¼ kTi ðKi þ λIÞ 1

ð4Þ yli

and can be easily which is independent of the cluster indicator calculated as ki and Ki are already known. The estimated cluster l indicator y^ i can then be expressed in the form of a linear combination of its neighbors' cluster indicators l y^ i ¼ αTi yli

ð5Þ

So the overall prediction error can be calculated as follows: C X N X

l

ðyli y^ i Þ2 ¼

l¼1i¼1

C X

l

J yl y^ J 2

l¼1

¼

C X

J yl Ayl J 2

l¼1

¼ tr½Y T ðI AÞT ðI AÞY ¼ trðY T TYÞ

ð6Þ

where Y ¼ ½y ; …; y is the cluster indicator matrix which we want to solve; T ¼ ðI AÞT ðI AÞ; C is the number of clusters and N is the number of sample points. A ¼ ½aij is an N N sparse matrix and aij equals the corresponding element of αi in Eq. (4) if xj A N i , otherwise it is set to 0. By relaxing Y into the continuous domain, the prediction error in Eq. (6) can be minimized by solving the following optimization problem: 1

3. Group sparse feature selection on local learning based clustering

l

βli A Rni

min

Y A RNC

s:t:

C

traceðY T TYÞ YT Y ¼ I

ð7Þ

The solution to Y is given by the matrix T's top C eigenvectors corresponding to the smallest eigenvalues. Since the number of clusters C is usually unknown in practice, the top u eigenvectors are used instead, where u is a preset parameter.

Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i

Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

3.2. Feature Selection using L2,1-norm regularized regression The scaled cluster indicator matrix Y obtained in the previous step embodies the cluster structure in the dataset. The next step is to ﬁnd out important features by exploiting this cluster structure. A simple idea is to treat this as a regression problem where cluster indicators in Y are used as the target variables. The obtained regression coefﬁcients w can then be used to measure the importance of the features. It is important to point out here that the feature selection methods in this paper are unsupervised in nature. Although regression technique is exploited in our feature selection process, the target variables used in regression are not true labels of the samples (which is unknown), but rather the labels obtained from the clustering algorithm LLC. Thus a better interpretation of our regression based feature selection is that we select features that best preserve local data consistency in the original data space. This is the fundamental difference between the methods proposed in this paper and the supervised feature selection methods such as RFS [6], which selects the features best correlated with known class labels. Given a column of the scaled cluster indicator matrix yk, where k A ½1; …; u. The coefﬁcients which indicate the importance of each feature to the k-th cluster can be easily obtained by minimizing the following least square regression problem min J yk Xw k J 22

ð8Þ

wk

Adding an l1-norm regularizer to the least square regression problem, Eq. (8) becomes min J yk Xw k J 22 þ w k

γ J wk J 1

ð9Þ

where γ 4 0 is the regularization parameter and the l1-norm regularizer makes the coefﬁcient vector wk sparse when γ have been set properly. The features corresponding to the non-zero coefﬁcients in wk are relevant to cluster k, and the larger the absolute value of wkj is, the more relevant the feature fj will be. The LASSO problem in Eq. (9) can be solved easily by the Least Angel Regression (LARs) algorithm, and the LARs algorithm is able to guarantee the sparseness of wk by specifying the number of non-zero entries instead of tuning the parameter γ. That is much more convenient for feature selection. Note that in Eq. (9) each yk is ﬁtted by X independently and the largest coefﬁcient for a feature in ﬁtting different clusters is chosen as the feature's score that determines its selection. This may lead to a problem when a feature discriminates one cluster very well so the corresponding coefﬁcient is large and this large coefﬁcient gained from one single cluster may lead to the ﬁnal selection of the feature although this feature may be useless in discriminating other clusters and gain coefﬁcients close to zero from other clusters. Selecting such features may deteriorate the quality of the whole feature subset. To better illustrate this situation, we

present a simple example in Fig. 2 showing that some existing unsupervised feature selection methods may fail to select the best discriminating feature. There are three Gaussians in a two dimensional space. Feature a has better discriminating power in separating cluster red from the other two clusters. But it fails to separate cluster blue from cluster green. It is obvious that globally feature b can better discriminate all the clusters than feature a. Existing unsupervised feature selection methods, such as maximum variance and Laplacian score [15], will fail in this example as they will rank feature a 4 b since a has a larger variances. Even some newer methods such as MCFS [17], and SFS-llc proposed in this paper, will rank feature a 4 b because they consider the discriminating power on each cluster independently using LASSO. In contrast, GSFS-llc uses l2;1 -regularization and selects features in groups. Consequently it tends to select features that have better overall discriminating power and can rank feature b 4a in this case. To overcome this problem, the u least square regression problems are considered together as follows: min wk

u X

J yk Xwk J 22 ¼ min J Y XW J 2F ¼ min W

k¼1

W

N X

J yi xTi W J 22

i¼1

ð10Þ where yi is the row vector of the scaled cluster indicator matrix Y i. e. the cluster indicator vector for each sample xi. As indicated in [6], a small change to Eq. (10) can make it robust to outliers. That is, using the following robust loss function instead of the least square loss: min W

N X

J yi xTi W J 2 ¼ min J Y XW J 2;1 W

i¼1

where J M A Rpq J 2;1 ¼

Pp i¼1

ð11Þ

qP ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ q 2 j ¼ 1 mij denotes the L2,1-norm.

Sparsity can be induced from the robust loss function in Eq. (11) by adding an L2,1 regularizer. Then the ﬁnal objective function turns into min J Y XW J 2;1 þ γ J W J 2;1 W

ð12Þ

This optimization problem can be efﬁciently solved by a method proposed in [6], which also proves that the algorithm will converge to the global optimum of the problem. This indicates that our GSFS-llc can retrieve near-optimal feature subset. After the coefﬁcient matrix W is acquired, the ﬁnal ranking score for each feature can be easily calculated. For the j-th feature, the GSFS-llc score is deﬁned as follows: ScoreGSFS llc ðjÞ ¼ max j wji j i

ð13Þ

This means that the GSFS-llc score of the j-th feature is the largest absolute value of the corresponding coefﬁcients. At last, sort all features according to their GSFS-llc scores in descent order and the top d features forms the best candidate feature subset. The complete GSFS-llc algorithm is summarized in Table 1.

Table 1 The GSFS-llc algorithm. Input:

Output:

Data matrix X with N sample points and M features; The number of nearest neighbors k; The number of used eigenvectors u; The regularization parameter γ; The number of selected features d; d selected features.

1. Construct the k nearest neighbor graph and the kernel matrix K according to the data matrix X. 2. Based on the neighbor graph and the kernel matrix, compute αi in Eq. (4) and construct the sparse matrix A. 3. Solve the optimization problem in Eq. (7), get Y ¼ ½y1 ; …; yu which contains the top u eigenvectors corresponding to the smallest eigenvalues. 4. Solve the L2,1-norm regularized regression problems in Eq. (12) to obtain the coefﬁcient matrix W. 5. Compute the GSFS-llc score for each feature according to Eq. (13) and return top d features corresponding to their GSFS-llc scores.

Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i

Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

3.3. Computational complexity analysis A simple analysis of the computational complexities of GSFS-llc is presented as follows (N: sample size; M: sample dimensionalities; k: number of nearest neighbors):

Constructing the kernel matrix K will incur a cost of OðN2 MÞ. Constructing the sparse matrix A need to ﬁnd the k nearest neighbors of xi and compute αi in Eq. (4) for each xi, the time 3

complexity is about OðN 2 k þ Nk Þ. The computational complexities for solving the optimization problem in Eq. (7) is about OðN 3 Þ. The computational complexities for solving the L2,1-norm regularized regression problems in Eq. (12) is about OðN 3 Þ. Computing the GSFS-llc score requires O(dM) and ranking features according to their GSFS-llc score requires OðM log MÞ.

Since k is usually a quite small constant and d⪡M, the total computational complexity of GSFS-llc is about OðN 2 M þ N3 þ M log MÞ. This computational complexity is comparable with other group sparsity based algorithms.

4. Experiments In this section, the performance of the GSFS-llc algorithm will be evaluated on various datasets including handwritten digits, human faces, voices and object images. The experiment setups are described in the following subsection. 4.1. Experiment setup To demonstrate the effectiveness of the GSFS-llc algorithm, the state-of-the-art feature selection algorithms are compared with GSFS-llc including:

MCFS proposed in [17], which uses spectral embedding to

obtain the cluster structure in a dataset and l1-norm regularized least square regression to select features that best preserve the cluster structure. LS (Laplacian score) proposed in [15], which selects features that best aligns with the manifolds structure of the data space. LKR proposed in [18], which seeks the features that minimize the within-neighborhood estimation error and maximize the variance over all data samples. SFS-llc, which differs from GSFS-llc in that it uses Eq. (9) to calculate the coefﬁcient matrix W while GSFS-llc which uses Eq. (12) to calculate the coefﬁcient matrix.

The ﬁve datasets used in the experiments are USPS08, ORL, ISOLET5, COIL100 and CIFAR10. USPS08 is a subset of the famous USPS handwritten digit database [35]. The original dataset consists of 9298 pieces of 16 16 handwritten digit images and the subset containing only number 0 and number 8 is chosen to form the USPS08 dataset with 2261 images. ORL is a face dataset consisting of 400 images which belong to 40 subjects (10 samples per subject) [36]. For some subjects, the images are taken at different times, varying the lighting, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses). All the images are taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement). The cropped dataset1 whose samples are 1

http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html

5

32 32 images with 256 grey levels per pixel is used in the experiments. ISOLET5 is a spoken letter recognition dataset [37]. There are 150 subjects who spoke the name of each letter of the alphabet twice. The speakers are grouped into sets of 30 speakers each, and are referred to as ISOLET1, ISOLET2, ISOLET3, ISOLET4, and ISOLET5. The ISOLET5 subset which has 1559 samples with 617 features is used in the experiments. COIL100 is the Columbia object image library of 100 objects. The images of each object are taken 51 apart as the object is rotated on a turntable and each object has 72 images. The size of each image is 32 32 pixels, with 256 grey levels per pixel. Thus, each image is represented by a 1024-dimensional vector. The CIFAR10 dataset consists of 60,000 32 32 color images in 10 classes, with 6000 images per class [38]. In the experiments a 512-dimensional gist feature vector is used to represent each image in CIFAR10. The statistics of the datasets are summarized in Table 2. To quantitatively evaluate the experimental results, two metrics are used to measure the clustering performance including the accuracy (AC) and the normalized mutual information (MI), which are evaluated based on a mapping between cluster label and true label. Here cluster label denotes the label obtained from the clustering algorithm, and true label denotes the class label given in the test dataset (ground truth). However, there exists no direct mapping between a cluster label and its true label. This mapping problem can be solved by a permutation mapping function that maps each cluster label to a true label in the dataset. The best mapping can be found by using the Kuhn–Munkres algorithm [39,40], also known as the Hungarian algorithm. The objective of the Kuhn–Munkres algorithm is to ﬁnd the optimal label mapping that can produce the largest number of matching pairs between cluster labels and true labels. The corresponding true label for a cluster label in this optimal mapping is then called as its equivalent label. The AC is then deﬁned as Pn AC ¼

i¼1

δðmapðt i Þ; li Þ

ð14Þ

n

where n is the total number of samples and ti and li denote the obtained cluster label and the true label of data xi. δðu; vÞ ¼ 1 if u ¼v otherwise δðu; vÞ ¼ 0, and mapðt i Þ is the permutation mapping function that maps each cluster label ti to the equivalent label (ground truth) from the dataset. Let C denote the set of clusters obtained from the ground truth and C 0 denote the set of clusters obtained from the algorithm. The mutual information between C and C 0 is deﬁned as follows: X

MIðC; C 0 Þ ¼

c A C;c0 A C 0

pðci ; c0j Þ log 2

pðci ; c0j Þ pðci Þ pðc0j Þ

ð15Þ

where pðci Þ and pðc0j Þ are the probabilities that a sample randomly selected from the data belongs to the clusters ci and c0j , and pðci ; c0j Þ is the joint probability that the arbitrarily selected sample belongs to clusters ci and c0j simultaneously. Mutual information is also a widely used performance criterion for feature selection [41]. The normalized mutual information used in the experiments is deﬁned as follows: NMIðC; C 0 Þ ¼

MIðC; C 0 Þ maxðHðCÞ; HðC 0 ÞÞ

ð16Þ

where H(C) and HðC 0 Þ denote the entropies of C and C 0 , respectively. It is quite straightforward to see that NMIðC; C 0 Þ A ½0; 1 with NMIðC; C 0 Þ ¼ 1 if C and C 0 are identical and NMIðC; C 0 Þ ¼ 0 if C and C 0 are independent.

Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i

Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

4.2. Experimental results To demonstrate the effectiveness of the GSFS-llc algorithm, the k-means algorithm is performed on the chosen datasets with different features obtained by different feature selection algorithms. For all these method, Gaussian kernel is used to calculate the similarities between each sample and k ¼5 nearest neighbors are chosen to construct the adjacency/kernel matrix. The number of used eigenvectors in SFS-llc, GSFS-llc and MCFS are set to the number of clusters. The regularization parameter γ in GSFS-llc is set to 1. Fig. 3 shows the feature selection result on the USPS08 dataset by the GSFS-llc algorithm. Fig. 3(a) and (b) is the mean images of cluster 0 and 8, and Fig. 3(c) and (d) shows the top 10 selected features on the mean image of each cluster. The top 10 selected features obviously differentiate digit 0 from digit 8 in that each selected feature (a pixel in the image) shows a clear contrast.

Table 2 Statistics of the datasets. Dataset

# of samples

# of features

# of classes

USPS08 ORL ISOLET5 COIL100 CIFAR10

2261 400 1559 7200 60,000

256 1024 617 1024 512

2 40 26 100 10

Fig. 4 and Table 3 show the clustering results on USPS08 dataset. The parameter d in Table 3 denotes the number of selected features. Each row gives the clustering result on the d selected features obtained by different algorithms and the best result is displayed in bold font. The last row shows the clustering result using all features. As can be seen from the results, both GSFS-llc and SFS-llc obviously outperform the other methods while GSFSllc achieves the best performance in all the cases. Another important observation from the results is that GSFS-llc clearly outperforms clustering results using all features both in AC and NMI, proving that it is competent in reducing redundancies and noises in data. In order to further test the performance of GSFS-llc in the presence of noise, another experiment is performed on the USPS08 dataset by adding random salt and pepper noise to the original image, so that some pixels turn into pure white or black as shown in Fig. 5. The clustering results are shown in Fig. 6 and Table 4. As can be observed from the experimental results, with the increase of the noise density (from 0 to 40%), the clustering accuracy of GSFS-llc decreases from 97.57 to 84.19%. But the important observation here is the slightly widening gap between GSFS-llc and SFS-llc, especially when the noise density is below 30%. We therefore conclude that GSFS-llc tends to have more robust performance than SFS-llc in the presence of noise. This is a beneﬁt brought by the group sparse regularization, as indicated in [22]. Fig. 7 and Table 5 show the clustering results on ORL dataset. Similar to the experimental results on USPS08, GSFS-llc still achieves the best performance on ORL dataset in almost all cases.

Fig. 3. The feature selection result on USPS08 dataset. The top 10 selected features obviously differentiate digit 0 from digit 8 in that each selected feature (a pixel in the image) shows a clear contrast. (a) Mean image of 0. (b) Mean image of 8. (c) Top 10 features of 0. (d) Top 10 features of 8.

Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i

Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1

7

1

0.95 0.8

0.85

0.6

NMI

Accuracy

0.9

0.8

0.4

0.75 LS LKR MCFS SFS−llc GSFS−llc

0.7 0.65 0.6

0

50

100

150

200

LS LKR MCFS SFS−llc GSFS−llc

0.2

0

0

50

# of Features

100

150

200

# of Features

Fig. 4. Clustering accuracy and normalized mutual information vs. the number of selected features on USPS08 dataset. Both GSFS-llc and SFS-llc obviously outperform the other methods and GSFS-llc achieves the best performance in all the cases. (a) Clustering accuracy. (b) Normalized mutual information.

Table 3 Clustering result on USPS08 dataset. d

10 20 30 40 50 60 70 80 90 100 120 140 160 180 200 All

Clustering accuracy (%)

Normalized mutual information (%)

LS

LKR

MCFS

SFS-llc

GSFS-llc

LS

LKR

MCFS

SFS-llc

GSFS-llc

97.52 73.91 82.04 82.84 83.37 82.26 82.22 82.40 82.71 84.70 84.25 84.34 84.39 84.17 84.25 84.34

74.92 78.51 78.77 78.59 79.48 79.39 80.85 80.67 80.81 81.20 84.17 85.10 85.98 86.42 84.65 84.34

66.70 97.52 79.43 76.25 78.77 86.86 89.34 90.23 89.47 89.16 88.90 89.21 86.33 85.01 84.79 84.34

95.40 97.30 96.77 96.42 95.89 94.96 93.45 93.06 93.94 92.48 90.54 89.56 88.63 88.01 87.00 84.34

97.57 98.01 97.17 97.30 97.43 97.48 97.26 97.39 96.99 96.55 95.62 94.56 93.10 88.50 91.20 84.34

81.52 30.76 39.37 39.91 40.28 38.37 38.46 38.58 39.22 42.52 42.07 41.91 41.98 41.60 41.76 41.91

18.96 27.78 27.60 27.11 28.74 29.06 32.50 32.88 33.30 34.00 39.47 42.65 44.73 45.44 42.30 41.91

15.55 81.46 36.29 31.82 35.04 46.75 52.60 55.22 53.38 52.49 52.20 51.60 45.55 43.23 42.99 41.91

70.76 80.24 77.51 76.29 73.95 70.47 65.06 63.49 66.68 61.30 55.57 52.99 50.40 48.61 46.59 41.91

81.87 84.42 79.60 80.41 81.40 81.36 80.23 80.98 78.67 76.18 71.66 67.03 61.56 50.98 56.38 41.91

Note: The parameter d denotes the number of selected features.

Fig. 5. The original and noise image in USPS08. (a) Original image in USPS08. (b) After adding 10% salt and pepper noise.

Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i

Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

1 0.8

LS LKR MCFS SFS−llc GSFS−llc

0.7

0.8

NMI

Accuracy

0.6 0.6

0.4

0.5 0.4 0.3

LS LKR MCFS SFS−llc GSFS−llc

0.2

0.2 0.1

0

0 0

10

20

30

40

0

10

Noise Density

20

30

40

Noise Density

Fig. 6. Clustering accuracy and normalized mutual information vs. noise density on USPS08 dataset. GSFS-llc outperforms others on both clustering accuracy and the normalized mutual information. With the increase of the noise density the clustering accuracy of the GSFS-llc slightly decreases and the curve is quite smooth comparing with others. (a) Clustering accuracy. (b) Normalized mutual information.

Table 4 Clustering result on Noise USPS08 dataset. Noise density (%)

Clustering accuracy (%)

0 5 10 15 20 25 30 35 40

Normalized mutual information (%)

LS

LKR

MCFS

SFS-llc

GSFS-llc

LS

LKR

MCFS

SFS-llc

GSFS-llc

66.48 81.63 80.81 76.98 77.31 75.36 75.34 74.20 73.81

97.52 61.72 57.39 54.08 59.05 62.16 62.97 62.09 61.00

77.53 95.00 87.71 87.82 83.18 86.35 85.15 82.60 81.19

95.40 96.15 96.57 92.80 91.39 90.24 86.24 80.29 80.19

97.57 96.36 96.65 94.60 93.77 92.16 89.48 83.10 84.19

24.39 37.88 36.83 30.95 30.88 27.42 26.05 23.33 21.78

81.64 0.05 0.04 0.04 0.04 0.03 0.03 0.03 0.05

34.76 74.17 56.89 53.25 43.22 45.54 41.22 34.97 31.52

70.76 76.74 76.45 65.47 60.58 55.59 45.63 33.40 32.38

81.87 78.04 76.77 71.81 67.66 61.87 54.36 39.92 40.39

0.9

0.8

0.5 0.7

NMI

Accuracy

0.6

0.4

0.6 0.3 LS LKR MCFS SFS−llc GSFS−llc

0.2 0.1

0

50

100

150

200

# of Features

LS LKR MCFS SFS−llc GSFS−llc

0.5

0.4

0

50

100

150

200

# of Features

Fig. 7. Clustering accuracy and normalized mutual information vs. the number of selected features on ORL dataset. GSFS-llc achieves the best performance on ORL dataset in almost all cases. (a) Clustering accuracy. (b) Normalized mutual information.

The overall best results, 65.75% in clustering accuracy and 79.39% in normalized mutual information also beat the results obtained by all features which are only 58.75% and 75.48%. The clustering results on ISOLET5 are shown in Fig. 8 and Table 6. As seen from the results, overall GSFS-llc still beats all the other methods in both clustering accuracy and normalized mutual information. The clustering results based on the selected feature subsets also beat the result of all features on ISOLET5 dataset because of the noises and redundancies.

The clustering results on COIL100 dataset are shown in Fig. 9 and Table 7. Same as the result of ISOLET5, the performance of GSFS-llc is lower than MCFS when selecting a small feature subset, but the gap between them is not very large. The performance of GSFS-llc when chosen a large enough feature subset still outperforms the others. This time the clustering performance on the selected feature subsets is lower than that on all features. This is mainly because the number of clusters in COIL100 dataset is 100, which is a quite large number compared

Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i

Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

9

Table 5 Clustering result on ORL dataset. d

10 20 30 40 50 60 70 80 90 100 120 140 160 180 200 All

Clustering accuracy (%)

Normalized mutual information (%)

LS

LKR

MCFS

SFS-llc

GSFS-llc

LS

LKR

MCFS

SFS-llc

GSFS-llc

25.25 27.50 29.00 31.50 32.50 31.75 33.50 33.25 33.25 34.00 37.50 37.25 38.25 41.50 40.25 58.75

28.00 32.75 31.50 33.75 36.50 38.25 36.25 37.50 36.50 38.25 38.00 38.00 41.75 44.00 47.00 58.75

48.75 56.25 51.75 50.75 58.25 53.50 52.50 54.50 54.50 53.00 53.50 52.25 51.25 53.00 50.50 58.75

44.50 47.25 52.75 57.50 51.50 54.25 54.00 53.75 61.25 55.50 60.75 54.50 55.75 58.50 54.00 58.75

50.75 56.25 57.50 63.00 61.50 59.00 60.50 65.75 59.75 54.50 64.75 55.00 54.75 61.75 61.00 58.75

50.04 50.39 52.39 54.90 55.04 53.93 55.94 56.75 56.66 56.20 59.67 59.36 60.97 62.43 62.63 75.48

52.01 55.12 54.77 56.03 59.24 61.31 60.00 60.57 59.10 62.36 60.94 62.41 64.67 64.76 68.48 75.48

68.01 73.15 71.13 69.70 74.12 72.70 72.55 74.40 73.55 72.76 73.23 73.16 71.97 72.79 71.01 75.48

66.34 67.79 71.38 74.70 72.23 73.06 72.86 73.18 75.90 73.43 77.68 74.66 73.75 76.91 72.45 75.48

69.58 73.77 73.24 78.00 77.59 75.20 76.90 79.39 75.05 74.27 79.12 73.28 74.45 78.81 77.85 75.48

0.7

0.8

0.6

0.7

0.5

NMI

Accuracy

Note: The parameter d denotes the number of selected features.

0.6

0.4 0.5 LS LKR MCFS SFS−llc GSFS−llc

0.3

0.2

0

50

100

150

200

LS LKR MCFS SFS−llc GSFS−llc

0.4 0

50

# of Features

100

150

200

# of Features

Fig. 8. Clustering accuracy and normalized mutual information vs. the number of selected features on ISOLET5 dataset. GSFS-llc beats all the other methods in both clustering accuracy and normalized mutual information. (a) Clustering accuracy. (b) Normalized mutual information.

Table 6 Clustering result on ISOLET5 dataset. d

10 20 30 40 50 60 70 80 90 100 120 140 160 180 200 All

Clustering accuracy (%)

Normalized mutual information (%)

LS

LKR

MCFS

SFS-llc

GSFS-llc

LS

LKR

MCFS

SFS-llc

GSFS-llc

32.65 34.96 38.74 36.63 37.78 38.81 40.35 44.32 42.59 47.66 49.39 50.93 53.18 49.71 52.85 51.19

29.89 37.27 37.08 39.00 37.97 36.37 38.10 35.86 36.05 40.92 38.81 41.69 42.08 47.34 45.16 51.19

38.29 47.92 52.98 47.08 56.13 56.25 56.77 53.11 52.85 53.11 47.79 54.20 49.07 53.69 52.66 51.19

35.98 46.06 57.02 52.34 54.84 58.76 51.25 54.27 59.20 54.33 53.88 54.59 55.23 52.79 51.89 51.19

31.05 52.02 49.84 54.46 52.34 57.92 60.68 57.09 62.73 63.31 58.18 60.81 56.51 59.97 60.30 51.19

47.80 48.80 53.51 54.04 53.53 56.85 58.90 62.00 60.95 65.26 66.27 67.03 69.46 69.07 70.58 69.44

43.97 48.79 52.26 53.86 54.55 51.11 52.88 53.62 52.91 55.88 55.59 59.14 61.45 61.74 61.76 69.44

53.70 64.26 67.43 66.91 69.81 70.62 71.01 70.34 70.06 70.36 66.37 70.55 68.64 70.43 69.54 69.44

51.17 60.20 67.69 67.61 68.02 72.18 69.95 70.82 73.51 72.20 71.51 71.83 71.75 71.53 69.66 69.44

41.30 63.44 66.96 69.34 70.94 74.61 75.27 72.83 75.32 75.70 75.87 76.29 75.32 75.63 75.69 69.44

Note: The parameter d denotes the number of selected features.

Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i

Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

10

0.8 0.5 0.7 0.6 0.3

NMI

Accuracy

0.4

0.2

0.4 LS LKR MCFS SFS−llc GSFS−llc

0.1 0

0.5

0

50

100

150

LS LKR MCFS SFS−llc GSFS−llc

0.3 0.2

200

0

50

# of Features

100

150

200

# of Features

Fig. 9. Clustering accuracy and normalized mutual information vs. the number of selected features on COIL100 dataset. The performance of GSFS-llc is lower than MCFS when selecting a small feature subset, but the gap between them is not very large. When chosen a large enough feature subset GSFS-llc outperforms the others and remains stable. (a) Clustering accuracy. (b) Normalized mutual information.

Table 7 Clustering result on COIL100 dataset. d

10 20 30 40 50 60 70 80 90 100 120 140 160 180 200 All

Clustering accuracy (%)

Normalized mutual information (%)

LS

LKR

MCFS

SFS-llc

GSFS-llc

LS

LKR

MCFS

SFS-llc

GSFS-llc

13.00 15.82 16.50 16.78 21.10 20.07 22.15 22.99 24.00 25.01 24.18 24.51 29.18 29.06 30.50 53.25

16.31 18.99 23.28 22.88 25.50 26.43 27.39 29.17 30.06 31.39 38.46 40.54 38.65 41.99 43.39 53.25

34.39 39.17 40.04 41.44 43.60 42.47 46.28 45.10 46.78 45.72 45.93 46.83 47.14 48.75 48.42 53.25

20.57 26.60 32.72 40.06 40.92 41.92 44.53 44.13 44.76 46.94 46.67 46.17 47.86 47.10 49.78 53.25

29.47 37.86 38.06 41.25 43.07 45.42 45.03 47.28 47.00 46.39 49.76 48.54 49.28 48.49 48.88 53.25

32.06 35.10 36.02 36.95 41.76 41.80 42.60 44.04 45.54 46.63 47.23 47.66 52.45 52.06 53.36 77.42

35.95 40.70 46.00 47.78 49.57 51.07 52.29 53.89 56.33 56.77 62.48 65.01 64.81 67.08 67.61 77.42

59.88 64.13 64.45 66.32 66.72 67.57 69.22 69.75 69.59 70.26 71.14 71.22 72.59 72.37 72.34 77.42

41.01 50.30 56.54 63.76 66.74 66.54 68.47 68.93 69.91 71.07 71.28 71.36 71.77 72.60 73.23 77.42

51.76 60.50 62.35 65.74 66.70 69.54 70.00 70.18 70.74 70.97 72.54 73.13 72.84 73.26 73.38 77.42

Note: The parameter d denotes the number of selected features.

0.15

0.25

NMI

Accuracy

0.2

0.1

0.2

LS LKR MCFS SFS−llc GSFS−llc

LS LKR MCFS SFS−llc GSFS−llc

0.15

0

50

100

150

200

# of Features

0.05

0

50

100

150

200

# of Features

Fig. 10. Clustering accuracy and normalized mutual information vs. the number of selected features on CIFAR10 dataset. The performance of GSFS-llc beats the others in most of the cases. (a) Clustering accuracy. (b) Normalized mutual information.

with only 1024 features. So the discriminating power of the selected subsets consisted of at most 200 features cannot handle clustering 100 clusters. Although the analysis in Section 3.3 shows that GSFS-llc suffers from high time complexity, its performance in real data sets are much better than the worst-case scenario OðN2 M þ N 3 þ M log MÞ.

The major reason is that the kNN adjacency/kernel matrix is highly sparse and many time-consuming matrix operations can be greatly accelerated. In order to study the performance of our GSFS-llc on large-scale datasets, we conduct an experiment on CIFAR10 dataset with 60,000 data samples. The clustering results are shown in Fig. 10 and Table 8, which shows that our methods

Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i

Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

11

Table 8 Clustering result on CIFAR10 dataset. d

10 20 30 40 50 60 70 80 90 100 120 140 160 180 200 All

Clustering accuracy (%)

Normalized mutual information (%)

LS

LKR

MCFS

SFS-llc

GSFS-llc

LS

LKR

MCFS

SFS-llc

GSFS-llc

20.08 20.53 22.53 22.21 23.93 23.59 24.30 24.09 24.07 24.28 24.70 25.60 26.43 26.55 26.66 28.32

20.08 21.84 21.23 22.59 22.06 22.74 22.77 23.57 23.65 24.13 25.37 25.90 26.74 26.16 26.70 28.32

20.60 23.17 24.86 24.19 26.51 26.60 26.47 26.90 27.53 27.12 27.02 27.08 27.07 28.26 27.93 28.32

21.16 22.16 23.98 24.86 25.72 26.44 26.96 28.58 26.92 25.97 27.15 27.34 27.09 27.26 27.27 28.32

22.22 23.80 24.06 25.13 27.86 27.88 27.18 27.26 27.14 27.12 27.70 27.20 28.42 28.54 28.58 28.32

8.39 10.29 11.42 11.27 12.14 11.42 12.47 12.89 12.87 12.85 13.41 14.11 14.51 14.84 15.66 16.86

8.39 11.59 11.90 11.93 11.74 12.28 12.72 13.08 13.39 13.61 14.01 14.24 14.74 15.39 15.86 16.86

7.18 10.91 12.23 13.63 14.03 14.63 14.49 14.97 15.08 14.93 15.28 16.20 15.96 15.69 16.11 16.86

8.72 12.76 13.37 13.92 15.13 15.49 16.21 16.89 16.73 15.90 16.24 16.50 16.83 16.73 16.75 16.86

10.89 13.00 13.82 14.63 15.48 16.21 16.14 16.15 16.84 16.98 17.32 17.30 17.59 17.47 17.60 16.86

Note: The parameter d denotes the number of selected features.

Table 9 Time Costs on CIFAR10.

Time costs (s)

LS

LKR

MCFS

SFS-llc

GSFS-llc

204

1522

254

1766

1776

achieve best performance in most cases. We also show the time costs of each algorithm in Table 9 (the number of selected features d ¼50). The time cost of GSFS-llc is much better than expected. In fact it is almost comparable with other feature selection methods with lower time complexity even in large datasets. To summarize the experimental results, we ﬁnally put an emphasis on the comparison among MCFS, SFS-llc and GSFS-llc. All of the three are unsupervised feature selection methods combining a clustering algorithm and a sparse regularization. To be more speciﬁc, MCFS is a combination of Laplacian eigenmap and LASSO; SFS-llc is a combination of LLC and LASSO, and GSFS-llc is a combination of LLC and group sparse regularization. The performance characteristic of a speciﬁc feature selection method can then be traced back to that of the speciﬁc components. From the experimental results shown in Figs. 4–9, the performance of SFS-llc is better than that of MCFS in most cases. This performance gain is rather the result of the advantage of LLC over laplacian eigenmap, as indicated in [34]. It can also be observed from the experimental results that GSFS-llc achieves much better clustering results than that of SFS-llc in most cases. This indicates that using group sparsity regularized regression leads to considerable performance gain than using LASSO. All the experimental results above obviously show the effectiveness of the proposed GSFS-llc algorithm on handwritten digits, human face, spoken letter and image data. By selecting the feature subsets, the GSFS-llc algorithm reduces the impact of noises and redundancies in data and improves the clustering performance.

4.3. Parameter selection The GSFS-llc algorithm has three parameters: the number of nearest neighbors k, the number of used eigenvectors u and the regularization parameter γ. In order to explore the impact of each parameter, the experiments of ﬁxing two parameters and tuning the rest one are designed and performed on the ORL dataset. All the

experiments use top 50 features for clustering. Fig. 11 shows the clustering results when ﬁxing u¼40 and γ ¼ 1 and tuning parameter k from 3 to 20. Both clustering accuracy and normalized mutual information reach a relatively high value when k¼5 and then decreasing slightly. So using k¼5 nearest neighbors to estimate a sample's label seems appropriate. Fig. 12 shows the results of tuning parameter u from 1 to 60 while ﬁxing k¼ 5 and γ ¼1. As seen from the results, the GSFS-llc achieves the best performance around u¼ 40 which equals the number of clusters in ORL dataset. Fig. 13 is the result of tuning the regularization parameter γ from 10 2 to 103 while ﬁxing k¼5 and u¼ 40. The clustering performance slightly increases until γ reaches 1, and then decreases to bottom. According to the experimental results, the proper value for γ is 1.

5. Conclusion and future work This paper proposes an unsupervised feature selection method GSFS-llc using the regularized regression framework. By ﬁtting to the cluster labels obtained from local learning based clustering (LLC), GSFS-llc naturally selects the features respecting the manifold structure in data space. What is more, using group sparse regularization, GSFS- llc inherits the stability associated with group structure and exhibits stable performance in the presence of noise. In comparisons with the other state-of-the-art feature selection algorithms, GSFS-llc outperforms the others in various datasets including handwritten digits, human faces, voices and object images. The experiments also demonstrate GSFS-llc is robust in a dataset injected with noise. There are several interesting problems to be investigated in our future work: (1) The proposed model in this paper exploits both the manifold structure and the group structure in data space by combining local learning based clustering and L2,1 regularization. As there exist many different models both for learning the manifold structure and recovering the group sparsity pattern, it is possible to marry different models and come up with more effective methods. (2) It is also possible to extend the current model by adding other constraints, such as the nonnegative constraints which will make the relevancies between features and clusters in accumulate form. The advantage of the nonnegative constraints are exhibited in various models such as [42,43]. It will be interesting to explore the effect of the nonnegative constraints in our model.

Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i

Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

12

0.8

0.6 0.7

0.6

0.4

NMI

Accuracy

0.5

0.3

0.5

0.2

LS LKR MCFS SFS−llc GSFS−llc

0.1 0

5

10

15

LS LKR MCFS SFS−llc GSFS−llc

0.4

0.3

20

5

10

15

20

# of Neighbors

# of Neighbors

Fig. 11. Clustering accuracy and normalized mutual information vs. the number of neighbors on ORL dataset.(a) Clustering accuracy. (b) Normalized mutual information.

0.8 0.6 0.7

0.6

0.4

NMI

Accuracy

0.5

0.3

0.5

0.2

LS LKR MCFS SFS−llc GSFS−llc

0.1 0

0

10

20

30

40

50

LS LKR MCFS SFS−llc GSFS−llc

0.4

0.3

60

0

10

20

# of Used Eigenvectors

30

40

50

60

# of Used Eigenvectors

Fig. 12. Clustering accuracy and normalized mutual information vs. the number of used eigenvectors on ORL dataset. (a) Clustering accuracy. (b) Normalized mutual information.

0.8 0.6 0.7

0.6

0.4

NMI

Accuracy

0.5

0.3

0.5

0.2

LS LKR MCFS SFS−llc GSFS−llc

0.1 0

−2

10

−1

10

0

1

10

10

2

10

3

10

γ

LS LKR MCFS SFS−llc GSFS−llc

0.4

0.3

−2

10

−1

10

0

1

10

10

2

10

3

10

γ

Fig. 13. Clustering accuracy and normalized mutual information vs. the value of the regularization parameter γ on ORL dataset. (a) Clustering accuracy. (b) Normalized mutual information.

Acknowledgments This work is supported by Zhejiang Provincial Natural Science Foundation of China (Grant no. LZ13F020001), National Science Foundation of China (Grant nos. 61173185, 61173186), National Key Technology R&D Program (Grant no. 2014BAK15B02) and Demonstration of Digital Medical Service and Technology in Destined Region (Grant no. 2012AA022814).

References [1] R.E. Bellman, Adaptive Control Processes: A Guided Tour, Princeton University Press, Princeton, New Jersey, 1961.

[2] M. Verleysen, Learning high-dimensional data, in: Nato Science Series Sub Series III Computer And Systems Sciences, vol. 186, 2003, pp. 141–162. [3] I. Jolliffe, Principal Component Analysis, Wiley Online Library, 2005. [4] X. He, P. Niyogi, Locality preserving projections, in: Advances in Neural Information Processing Systems, 2003. [5] H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell. 27 (2005) 1226–1238. [6] F. Nie, H. Huang, X. Cai, C.H. Ding, Efﬁcient and robust feature selection via joint l2;1 -norms minimization, Adv. Neural Inf. Process. Syst. (2010) 1813–1821. [7] M. Tan, L. Wang, I.W. Tsang, Learning sparse svm for feature selection on very high dimensional datasets, in: Proceedings of the 27th International Conference on Machine Learning, 2010, pp. 1047–1054. [8] Y. Zhang, S. Li, T. Wang, Z. Zhang, Divergence-based feature selection for separate classes, Neurocomputing 101 (2013) 32–42. [9] J.G. Dy, C.E. Brodley, Feature selection for unsupervised learning, J. Mach. Learn. Res. 5 (2004) 845–889.

Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i

Y. Wu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ [10] C. Boutsidis, M.W. Mahoney, P. Drineas, Unsupervised feature selection for principal components analysis, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 61–69. [11] H. Zeng, Y.-M. Cheung, Feature selection for local learning based clustering, in: Proceedings of the 13th Paciﬁc-Asia Conference on Advances in Knowledge Discovery and Data Mining, 2009, pp. 414–425. [12] J.B. Tenenbaum, V. De Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (2000) 2319–2323. [13] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000) 2323–2326. [14] M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, in: Advances in Neural Information Processing Systems, 2001. [15] X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, Adv. Neural Inf. Process. Syst. (2005) 507–514. [16] Y. Jiang, J. Ren, Eigenvalue sensitive feature selection, in: Proceedings of the 28th International Conference on Machine Learning, 2011, pp. 89–96. [17] D. Cai, C. Zhang, X. He, Unsupervised feature selection for multi-cluster data, in: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, 2010, pp. 333–342. [18] Y. Cheung, H. Zeng, Local kernel regression score for selecting features of highdimensional data, IEEE Trans. Knowl. Data Eng. 21 (2009) 1798–1802. [19] H. Zeng, Y. Cheung, Feature selection and kernel learning for local learningbased clustering, IEEE Trans. Pattern Anal. Mach. Intell. 33 (2011) 1532–1547. [20] J. Kim, R. Monteiro, H. Park, Group sparsity in nonnegative matrix factorization, in: SDM, 2012, pp. 851–862. [21] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 68 (2006) 49–67. [22] J. Huang, T. Zhang, The beneﬁt of group sparsity, Ann. Stat. 38 (2010) 1978–2004. [23] Y. Yang, H.T. Shen, Z. Ma, Z. Huang, X. Zhou, l2;1 -norm regularized discriminative feature selection for unsupervised learning, in: Proceedings of the 22nd International Joint Conference on Artiﬁcial Intelligence, 2011, pp. 1589–1594. [24] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1157–1182. [25] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classiﬁcation (2nd edition), John Wiley and Sons, New York, 2001. [26] G. Doquire, M. Verleysen, A graph Laplacian based approach to semisupervised feature selection for regression problems, Neurocomputing 121 (2013) 5–13. [27] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classiﬁcation using support vector machines, Mach. Learn. 46 (2002) 389–422. [28] F. Benoi\widehatt, M. van Heeswijk, Y. Miche, M. Verleysen, A. Lendasse, Feature selection for nonlinear models with extreme learning machines, Neurocomputing 102 (2013) 111–124. [29] G. Doquire, M. Verleysen, Mutual information-based feature selection for multilabel classiﬁcation, Neurocomputing 122 (2013) 148–155. [30] Y. Sun, S. Todorovic, S. Goodison, Local-learning-based feature selection for high-dimensional data analysis, IEEE Trans. Pattern Anal. Mach. Intell. 32 (2010) 1610–1626. [31] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B (Methodol.) (1996) 267–288. [32] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67 (2005) 301–320. [33] X. Fang, Y. Xu, X. Li, Z. Fan, H. Liu, Y. Chen, Locality and similarity preserving embedding for feature selection, Neurocomputing 128 (2013) 304–315. [34] M. Wu, B. Schölkopf, A local learning approach for clustering, Adv. Neural Inf. Process. Syst. (2006) 1529–1536. [35] J.J. Hull, A database for handwritten text recognition research, IEEE Trans. Pattern Anal. Mach. Intell. 16 (1994) 550–554. [36] F.S. Samaria, A.C. Harter, Parameterization of a stochastic model for human face identiﬁcation, in: Proceedings of the Second IEEE Workshop on Applications of Computer Vision, 1994, pp. 138–142. [37] M.A. Fanty, R. Cole, Spoken letter recognition, Adv. Neural Inf. Process. Syst. 3 (1990) 220–226. [38] A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images, Technical Report, 2009. [39] H.W. Kuhn, The Hungarian method for the assignment problem, Naval Res. Logist. Q. 2 (1955) 83–97. [40] J. Munkres, Algorithms for the assignment and transportation problems, J. Soc. Ind. Appl. Math. 5 (1957) 32–38. [41] B. Frénay, G. Doquire, M. Verleysen, Theoretical and empirical study on the potential inadequacy of mutual information for feature selection in classiﬁcation, Neurocomputing 112 (2013) 64–78. [42] D.Wang, X.Gao, X.Wang, Semi-supervised nonnegative matrix factorization via constraint propagation, IEEE Transactions on Cybernetics, http://dx.doi. org/10.1109/TCYB.2015.2399533.

13

[43] Z. He, C. Chen, J. Bu, C. Wang, L. Zhang, D. Cai, X. He, Document summarization based on data reconstruction, in: AAAI, 2012.

Yue Wu received the B.S. degree in computer science from Zhejiang University, China, in 2010. He is currently a Ph.D. candidate in the College of Computer Science at Zhejiang University. His research interests include machine learning, pattern recognition and image retrieval.

Can Wang is currently an associate professor in the College of Computer Science, Zhejiang University. His research interests include information retrieval, data mining, machine learning and information accessibility.

Jiajun Bu is a Professor in the College of Computer Science, Zhejiang University. His research interests include information retrieval, computer graphics, embedded system and computer supported cooperative work.

Chun Chen is a professor in the College of Computer Science, Zhejiang University. His research interests include information retrieval, data mining, computer vision, computer graphics and embedded technology.

Please cite this article as: Y. Wu, et al., Group sparse feature selection on local learning based clustering, Neurocomputing (2015), http: //dx.doi.org/10.1016/j.neucom.2015.07.045i

Group sparse feature selection on local learning based clustering

Group sparse feature selection on local learning based clustering

Recommend Documents