Nonnegative Laplacian embedding guided subspace learning for unsupervised feature selection

Nonnegative Laplacian embedding guided subspace learning for unsupervised feature selection

Pattern Recognition 93 (2019) 337–352 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/patcog...

3MB Sizes 0 Downloads 83 Views

Pattern Recognition 93 (2019) 337–352

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/patcog

Nonnegative Laplacian embedding guided subspace learning for unsupervised feature selectionR Yong Zhang, Qi Wang∗, Dun-wei Gong, Xian-fang Song School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221008, China

a r t i c l e

i n f o

Article history: Received 4 September 2017 Revised 21 February 2019 Accepted 24 April 2019 Available online 25 April 2019 Keywords: Unsupervised feature selection Nonnegative Laplacian embedding Subspace learning Class labels

a b s t r a c t Unsupervised feature selection plays an important role in machine learning and data mining, which is very challenging because of unavailable class labels. We propose an unsupervised feature selection framework by combining the discriminative information of class labels with the subspace learning in this paper. In the proposed framework, the nonnegative Laplacian embedding is first utilized to produce pseudo labels, so as to improve the classification accuracy. Then, an optimal feature subset is selected by the subspace learning guiding by the discriminative information of class labels, on the premise of maintaining the local structure of data. We develop an iterative strategy for updating similarity matrix and pseudo labels, which can bring about more accurate pseudo labels, and then we provide the convergence of the proposed strategy. Finally, experimental results on six real-world datasets prove the superiority of the proposed approach over seven state-of-the-art ones. © 2019 Elsevier Ltd. All rights reserved.

1. Introduction As the capabilities of acquiring and storing information increase, more and more attributes (features) have been involved in the fields of pattern recognition and machine learning. Accordingly, high-dimensional data raise a variety of issues, such as increasing storage space, prolonged processing time, as well as redundant information. All these problems deteriorate the performance of learning algorithms and restrict their applications. Therefore, reducing the dimension of data has become an increasingly important task [11]. Generally, there are two ways to reduce the dimension of data. One is feature extraction, seeking a projection of the original data space to the smaller dimensional data space [10]. Another is feature selection, targeting to seek a low-dimensional feature subset by removing redundant features, on the condition of maintaining a high classification accuracy [12,37,59]. Generally, feature selection has close relationship with specific criteria when seeking an optimal feature subset, so it can be formulated as a discretely combinatorial optimization problem [50]. And the number of feature combinations increases exponentially as the dimension of data increases. This often results in “the curse of dimensionality” [3,60]. Considering whether data labels are

R ∗

Fully documented templates are available in the elsarticle package on CTAN. Corresponding author. E-mail address: [email protected] (Q. Wang).

https://doi.org/10.1016/j.patcog.2019.04.020 0031-3203/© 2019 Elsevier Ltd. All rights reserved.

available or not, existing approaches can be divided into the three categories: supervised [2,17,20,34,44], semi-supervised [6,25,31,55], and unsupervised [8,9,22,27,45,46] approaches. Supervised approach guides the process of feature selection by using data labels. When there are adequate labeled data, this approach is always the first choice because of its high reliability and good classification accuracy. Representative supervised methods include information entropy, mutual information, ReliefF spectral analysis, branch-and-bound technology, hypergraph learning, and Hilbert Schmidt independence criterion [1,2,14,15,28,39,40,53,56,57,66]. Moreover, population-based heuristic search methods such as particle swarm optimization [16,58–60], genetic algorithms [24,51], forest optimization algorithm [13], and artificial bee colony optimization [18], have also been used.In real life, either not all samples are labeled, or labeling samples is very expensive. It limits severely applications of this approach. When there are only few labeled data, semi-supervised approach is introduced to deal with the “small-labeled-sample” situation [63]. Semi-supervised feature selection can be viewed as a compromise between supervised and unsupervised cases. The key issue of semi-supervised approach is to sufficiently utilize those labeled and unlabeled data for boosting the learning performance [61]. Recent years, several semi-supervised feature selection methods have been proposed. Liu and Nie et al. [31] implemented a new noise insensitive trace ratio criterion in the graph based semisupervised feature selection method. Yang et al. [55] developed a two-level manifold learning method for multimedia data, which

338

Y. Zhang, Q. Wang and D.-w. Gong et al. / Pattern Recognition 93 (2019) 337–352

constructed several independent graphs and combined them to the data representation. Yang et al. [52] gave a manifold learning based semi-supervised multi-media search approach, which sorted the data points based on local regression and global alignment. Wang et al. [47] developed a novel semi-supervised representatives feature selection algorithm based on information theory and Markov blanket, which can effectively remove irrelevant and redundant features. Recently Sheikhpour et al. [41] summarized existing semisupervised feature selection methods in detail. Compared with supervised and semi-supervised approaches, unsupervised feature selection approach is more challenging because of no data labels. Existing unsupervised methods can be generally divided into the two categories, i.e., similarity preservation and clustering performance maximization. The first saves representative features reflecting the local structure of original data, and selects a feature subset which is consistent with its global structure. Laplacian score-based method [19], unsupervised feature selection by preserving class structures [5], maximum margin feature selection with manifold regularization [62], unsupervised spectral feature selection method for mixed data [43], L2,1 -norm based unsupervised optimal feature selection [49] and global and local structure preserving sparse subspace learning model [68] are part of classical similarity preservation methods. For the second category i.e., clustering performance maximization, it selects representative features with maximal performance values in terms of one or more criteria [33]. The most typical strategy is to assign pseudo labels for data points, and transforms an unsupervised feature selection problem into a supervised counterpart. Recently, researchers have proposed various methods to generate pseudo labels, including spectral embedding [65], spectral clustering [29,48], matrix factorization [4,21,23,38], and dictionary learning [7,69], to name a few. Among them, the spectral embedding-based method regards the eigenvectors of a similarity matrix as pseudo labels, but its generation process is independent of feature selection. In contrast, the spectral clustering-based method simultaneously generates pseudo labels and selects features. The above two methods only preserve the similarity of features by a similarity matrix which can reflect the distribution of data on part dimensions. Matrix factorization-based method generates the pseudo labels with a cluster indicator matrix by learning a set of bases [26]. This kind of method only considers the over fitting of data, and ignores local information of cluster labels. Besides, the decomposition of feature density matrix often increases the complexity of an algorithm. Considering the last category, the dictionary learning-based method can reflect the distribution of data structure, whereas its computational cost is relatively high, resulting in a poor applicability. Note that, both the latter two methods emphasize the reconstruction of data. However, few methods have capabilities on maintaining sample similarity and reconstructing data simultaneously. Focusing on both the similarity and the reconstruction of data, this paper presents a new nonnegative Laplacian embeddingguided subspace learning for solving unsupervised feature selection, called NLE-SLFS. The main highlight are as follows: (1) A new unsupervised learning framework is proposed by integrating nonnegative Laplacian embedding with the subspace learning. Here, an sparsity term in the proposed feature selection model can reduce noisy or redundant features; (2) An algorithm of iteratively generating similarity matrix and pseudo labels is developed, so as to get more accurate pseudo labels; (3) The proposed algorithm has global optimal solution because of the convexity in mathematics. This means that NLS-SLFS has great stability and it is not sensitive to the initialization;

(4) Experiments on six benchmark datasets illustrate that NLE-SLFS outperforms several state-of-the-art methods in terms of clustering performance. This paper is organized as follows. Section 2 provides some basic concepts. Section 3 describes the proposed algorithm. The efficiency of the proposed algorithm is provided in Section 4. Section 5 gives conclusions and points out several open problems. 2. Related concepts This section provides a brief review to unsupervised feature selection and Laplacian embedding. 2.1. Unsupervised feature selection LetX = {x1 , x2 , . . . , xn }T ∈ Rn×d be a set of samples, where d and n are the number of features and that of samples respectively, and xi ∈ Rd is the feature descriptor of the ith sample. The purpose of unsupervised feature selection is to find a feature subset which contains the most important information of data. Specifically, the unsupervised feature selection can be described as the following objective function to be optimized [70]:

U = GL(X, λ1 ) + T (X, λ2 )

(1)

Where the function, GL(X, λ1 ), is employed to generate pseudo labels, T(X, λ2 ) is utilized to select features, λ1 and λ2 are the model parameters. For T(X, λ2 ), it is a loss minimization function, with the following expression:

T (X, λ2 ) = l (X, W ) + θ H (W )

(2) W ∈ Rd × c

Where l(X, W) is a loss function, and is the feature selection matrix. The parameter c represents the number of classes. Let G ∈ Rn × c be the pseudo label matrix, then l(X, W) can be formulated as l (X, W ) = XW − G2,1 . H(W) is a sparse regularization term, which is employed to remove noisy features by controlling W; θ is a regularization parameter used to balance the influence of the sparse regularization term on results of feature selection. Note that, λ2 in the function (2) means all parameters involved in the model of feature selection, including the regularization parameter θ. The performance of a feature selection algorithm depends mainly on GL(X, λ1 ). If GL(X, λ1 ) can effectively describe the real distribution of data, the algorithm will have an expected performance. Recently, many methods have been introduced to generate pseudo labels. Tang et al. [45] combined the discriminant analysis and pseudo labels into unsupervised feature selection framework. Yang et al. [54] generated class labels by linear classifier, and proposed a joint framework by integrating the discriminant analysis and the minimum L2 , 1 -norm into unsupervised feature selection. Although various techniques have been proposed to generate pseudo labels, most of them ignore the subspace multiplicity structure of data, i.e., the collection of feature from various classes lied in a union of lower dimensional subspaces. When intra-class variations are very large, the spatial proximity of data that is widely used in standard clustering algorithms, often does not hold true. In addition, since the paper involves different types of mathematical norms, we list their descriptions to improve the paper’s readability, as follows: X2,1 represents the L2,1 -norm of a matrix X, X2 represents the L2 -norm of a matrix or vector X, and XF is the Frobenius norm of a matrix X. 2.2. Laplacian embedding The input data of Laplacian embedding is a matrix S, which captures the pairwise similarities among n objects and represents the

Y. Zhang, Q. Wang and D.-w. Gong et al. / Pattern Recognition 93 (2019) 337–352

weights of edges on a graph with n nodes. Multi-dimensional embedding is to embed n nodes of a graph into a lower  p-dimensional  subspace with coordinates (g1 , . . . , g p ) ∈ R p . Let gi − g j  be the 2 Euclidean distance between nodes i and j. If gi is similar to gj , gi will be adjacent to gj in the embedded space, so the value of   gi − g j  will be small. Thus, embedding can be obtained by the 2 following optimization problem:

   gi − g j 2 Si j = gTi (D − S )i j g j 2

 p

min J (G ) = G

p

i, j=1





i, j=1

= 2T r G ( D − S )G T

(3)

where S is a similarity matrix, D = diag(d1 , d2 , .., d p ) is a degree  matrix, di = j Si j , and L = D − S refers to Graph Laplacian. In addition, a constraint is utilized to prevent G prematurely converging to 0. Furthermore,the embedding process is achieved by solving the following minimization problem:

min T r (GT (D − S )G ), G

s.t. GT G = I

(4)

339

classification of data, i.e., a pseudo label matrix, and S be a similarity matrix to capture the pairwise similarities between samples, then the essence that applying nonnegative Laplacian embedding in clustering is to optimize the following problem:

max T r[GT (S − D + σ I )G]

s.t. GT G = I, G ≥ 0

G

(6)

where σ is the largest eigenvalue of the feature matrix, L = D − S, with the purpose of generating a non-negative and nonsingularity matrix (S − D + σ I ). The nonnegative characteristic can be achieved by enforcing the constraint G ≥ 0, which has the following benefits: 1) clusters can be obtained directly by solving the problem (6), and 2) the obtained nonnegative solution is more accurate and can be selected as a desired cluster indicator. Since S − D + σ I2 and GGT 2 are two constants when GT G = I, and

  (S − D + σ I ) − GGT 2 = S − D + σ I22 2  2 − 2T r (S − D + σ I )GGT + GGT  2

(7)

the problem (6) should be equivalent to the problem:



2

Solution of the above problem can be formed by the k smallest eigenvectors of R. By introducing Laplacian embedding into the graph G segmentation problem [36], we have

min (S − D + σ I ) − GGT 

min T r[GT (D − S )G]

So the problems (6) and (8) have the similar solution. According to the Non-negative Matrix Factorization algorithm proposed in [26], we can obtain the updating rule of G for the problems (8), as follows:

G

s.t.GT G = I

(5)

Given the fact that the solution of (4) is composed of the k smallest eigenvectors, solution of the segmenting G will have mixed signs, resulting in the deviation of G from the ideal solution of cluster labels. Therefore, the clusters cannot be identified by the solution of (5), directly. In the proposed framework, we tackle with the cluster problem by the following two steps. Firstly, we obtain the eigenvectors of L = D − S for Laplacian embedding, then achieve the cluster labels by the k-means clustering in the eigenvector space. 3. The proposed framework This section develops an unsupervised feature selection framework by combining the discriminative information of data labels with the subspace learning. To fulfill this task, we first employ the nonnegative Laplacian embedding method to produce pseudo labels. Then, we formulate an optimization model for feature selection, and develop an algorithm to iteratively update the similarity matrix and the pseudo labels simultaneously. Finally, the convergence of the proposed algorithm is analyzed. 3.1. Generating pseudo labels Laplacian embedding has been employed to solve spectral clustering problems. Most of previous studies, however, ignore the non-negativity characteristic of the cluster indicator, G. Compared with traditional Laplacian embedding, non-negative Laplacian embedding (NLE) [33] has a soft clustering capability. In other words, its solution can be regarded as a posterior probability, with which an object can be grouped into a cluster. This paper replaces traditional Laplacian embedding with NLE to generate class labels. Generally, the eigenvector of G includes both positive and negative elements, so it cannot be directly taken as a cluster indicator to assign the labels of samples. Focused on this issue, we add a nonnegative constraint G ≥ 0 into the problem (6). Setting G ≥ 0 can ensure that the proposed algorithm directly obtains an available cluster indicator, i.e., the eigenvector of G with discriminative information. Based on the eigenvector, we can obtain a good pseudo label for each sample. Let G be an indicator matrix which guides the

G

s.t.

(8)

2

GT G = I, G ≥ 0

Gi j ←



[(S+σ I )G+GV− ]i j [DG+GV+ ]i j T

(9)

s.t. V = G (S − D + σ I )G

where V+ and V− are the positive and negative parts of V, respectively. Therefore, we can employ the rule (9) to update the indicator matrix, G, in the problem (6). After the matrix G converges, the nonnegative constraint vector of G will generate cluster assigning information, which could be regarded as discriminative information. Based on the discriminative information of data, the subspace learning can achieve well effective subspace segmentation results. 3.2. Problem formulation After obtaining the pseudo label matrix (or indicator matrix), G, we then classify samples by using G. Based on it, each sample will be assigned with an appropriate class label. Moreover, the L2 , 1 norm is used to alleviate the negative impact of abnormal samples. As a result, we have the following optimization problem:

arg min X W − G2,1

(10)

W

We expect that classification results is able to keep a local structure consistent with the original data. So we add the following item:

arg min W

n 



2

Si j xiW − x jW 

(11)

2

i, j=1

By combining (6), (10), and (11), we have the following optimization problem:

arg min W,G

α

X W − G2,1 +

n 

2 2

i, j=1

T r (GT LG ) + βW 2,1 2 s.t. GT G = I, G ≥ 0 +



Si j xiW − x jW 

(12)

340

Y. Zhang, Q. Wang and D.-w. Gong et al. / Pattern Recognition 93 (2019) 337–352

Where W2,1 represents the sparsity regularization item, which is used to determine whether or not a feature is selected, wi is the ith row of W, α and β are two tradeoff parameters. The purpose of subspace embedding is to minimize Tr(GT LG) with nonnegative and orthogonal constraints. 3.3. Solving the problem An effective iterative algorithm is proposed in this subsection to deal with the formulated optimization problem (12). Specifically, we iteratively update G and W, respectively, in the algorithm. First, we update W on the premise of fixing G. To fulfill this task, the problem (12) can be represented as follows: n 

arg min X W − G2,1 + W



2

Si j xiW − x jW  + βW 2,1 2

As a result of the non-smoothness of L2 , 1 -norm and L2 -norm among the problem (13), it is difficult to solve them directly. On this circumstance, we transform the problem (13) into the following problem: 

min T r ( (X W − G )T Dv (X W − G ) + T r ( (X W )T L (X W )) W

+ β (W DwW )

Input: The data matrix, X ∈ Rn×d ; The number of selected features, k; The control parameters, α , β Output: The feature selection matrix, W ∈ Rd×c ; 1: Obtain initial value of the matrix G by using the method proposed in section 3.1, and set t = 1; 2: Implement the following steps until G converges,2.1 Update W by the formula (15);2.2 Update G by the formula (19);2.3 Set t = t + 1;Sort Wi 2 , i = 1, . . ., d, and select k features with the largest values to form the final feature subset.

(13)

i, j=1

T

Algorithm 1 The proposed iterative algorithm for NLE-SLFS.

(14)

where the diagonal matrix Dv (ii ) = (2(XW − G )i 2 Dw (ii ) = (2Wi 2 )−1 , L =D − S , xi and xj are any two samples in a dataset, S = Si j /2xiW − x jW  . 2 Setting the derivative of the problem (14) in terms of W are equal to 0, we have the following formula:

Lemma 1: For any two nonzero vectors p and q, the following inequality is established:

q2 −

q22  p22 ≤  p2 − 2 p2 2 p2

Let the problem (12) be the following objective function:

F (W, G ) =

)−1 ,

W t+1 = (X T Dtv X + X T L´t X + β Dtw )−1 X T Dtv G

(15)

When the value of W in the formula (15) does not changed, i.e., W is fixed, the problem (10) will be simplified to:

arg min G

s.t.

1 2

 2 X W − G2,1 + α2 T r (GT LG ) + η4 GT G − IF

G≥0

(16)

 2 where the penalty term, η4 GT G − I , is used to reduce orthogoF

nal limit. By incorporating the Lagrangian function into the problem (16), the optimization function can be transformed as follows:

L(G, ρ ) =

2 1 α η X W − G2,1 + T r (GT LG ) + GT G − IK F + ρ , G 2 2 4 (17)

where ρ ∈ Rn × c is a Lagrangian operator. With respect to G, setting the derivative of the function (17) to 0, we have the following iterative formula:

∂L = X W − G + α LG + η (GGT G − G ) + ρ = 0 ∂G

(18)

Based on the Karush-Kunhn-Tucker (KKT) [32] condition, we have ρi j Gi j = 0. Further, the update formula of G can be obtained as follows:

Gt+1 = Gti j ij

(

Gt

(X W + ηGt )i j + α LGt + ηGt Gt T Gt )i j

(20)

n  2  1 S i j x i W − x j W  X W − G2,1 + 2 2

+

α 2

i, j=1

T r (G LG ) + βW 2,1 + T

η  4

2

GT G − I 

Finally, the G and W will converge by repeating the formulas (15) and (19), and the matrix W is final output of the proposed algorithm. Algorithm 1 shows the detailed steps of the proposed iterative algorithm 3.4. Convergence analysis This subsection investigates the convergence of the proposed algorithm. First of all, we introduce the following lemma [35].

(21)

Firstly, we consider the case of fixing G. Let

f (W ) = T r (X W − G )T Dv (X W − G ) + β (W T DwW )

(22)

According to the formula (14), then the function (21) becomes

FG (W ) = f (W ) +

n 



2

S i j x i W − x j W 

(23)

2

i, j=1

According to the formula (18), we can obtain

W t+1 = arg min f (W t ) + W

Because of Sti j =



n 

2

S i j x i W t − x j W t 

(24)

2

i, j=1

Si j

2xiW t −x j W t 

, we have

  n xiW t+1 − x jW t+1 2   2 ≤ f (W t ) f (W t+1 ) + Si j  2xiW t − x jW t  i, j=1 2  2 t t n x i W − x j W   2 + Si j  2xiW t − x jW t  i, j=1

(25)

2

According to Lemma 1, we have

  xiW t+1 − x jW t+1 2  t+1    2 S i j x i W − x jW t+1  − 2 2xiW t − x jW t  i, j=1 2   2 n x i W t − x j W t     2 ≤ S i j x i W t − x j W t  −  2 2 x i W t − x j W t  n 



i, j=1

(19)

F

(26)

2

Taking both (25) and (26) into consideration, the following inequality is held:

f (W t+1 ) +

n 





Si j xiW t+1 − x jW t+1  ≤ f (W t ) 2

i, j=1

+

n 





S i j x i W t − x j W t 

2

(27)

i, j=1

Hence, we have F (W t+1 , G ) < F (W t , G ), suggesting that the value of the objective function (23) is monotonically decreasing when

Y. Zhang, Q. Wang and D.-w. Gong et al. / Pattern Recognition 93 (2019) 337–352

updating W by the Algorithm 1. Therefore, the proposed algorithm is convergent with respect to W when G is fixed. Since the problem (13) is convex, the solution of the function(15) will be the global optimal solution of the problem (13). Secondly, we fix W. According to the formula (19), we can obtain a function of G as follows:

2 1 α η X W − G2,1 + T r (GT LG ) + GT G − IF 2 2 4

U (G ) =

(28)

Since the value of Wt is fixed, we have U (Gt ) = F (W t , Gt ). By introducing an auxiliary function of U as shown in [67], it is easy to prove that U (Gt+1 ) < U (Gt ). Thus, we have

F (W , G t

t+1

) ≤ F (W , G ) t

t

(29)

Based on the formulas (27) and (29), the following inequality is obtained:

F (W t+1 , Gt+1 ) ≤ F (W t , Gt+1 ) ≤ F (W t , Gt )

3.5. Further discussion In the problem (12), if we consider the extreme case that G is enforced to be linear, i. e., G = XW, we can obtain a simplified optimization problem as follows, W

s.t.

X W > 0, W T W = I

Table 1 Brief information of benchmark datasets. Dataset

# of instances

# of features

# of classes

Type of data

Isolet1 ORL32 coil20 Yale64 WarpPIE TOX-171

1560 400 1440 165 210 171

617 1024 1024 1024 2420 5748

26 40 20 15 10 4

Spoken letter Face images Object images Face images Face images Biological Data

these datasets, coilCOIL20, ORl32, Yale64 and warpPIE are various kinds of face images; Isolet1 is a 30 speakers’ speech signal of alphabet twice speech for signal dataset, and TOX-171 is the retrieval ID GDS1454 microarray data from the GEO gene expression data repository. All these data are normalized for standardization.

(30)

This indicates that the problem (12) monotonically decreases by using the updating rules in Algorithm 1. Therefore, the proposed algorithm is convergent.

min Tr(WT X T LX W ) + βW 2,1

341

(31)

Obviously, the computational complexity and analysis of the above problem are simpler than the problem (12). But, as mentioned in the literature [42], it is likely that G is nonlinear in many applications. Therefore, the problem (12) proposed in this paper is superior to the problem (31) because of its flexibility of linearity. Moreover, incorporating discriminative analysis and the L2, 1-norm minimization into a joint framework, the UDFS algorithm [54] also proposed a similar objective function to the problem (31). Our experimental results in Section 4 show that the proposed method is superior to UDFS on most data sets. To exploit the discriminative information in unsupervised scenarios, the technologies of pseudo labels have been studied and used by some work, such as MCFS [5] and NDFS [29]. As indicated in the literature [29], performing clustering and feature selection simultaneously can explicitly enforces that G is linearly approximated by the selected features, making the results more accurate. Like the NDFS algorithm, our proposed feature selection algorithm also learns G and W simultaneously. However, compared with NDFS, our proposed method adds the local structure item into the problem (12) in order to keep a local structure consistent with the original data. Furthermore, experimental results in Section 4 show that both the proposed method and NDFS are superior to three comparison algorithms without label information, LS, SPEC and UDFS, on most data sets. And, with the help of the local structure item, the performance of the proposed method also is better than that of NDFS on most data sets. 4. Experiments and analyses In order to evaluate the effectiveness of the proposed NLE-SLFS algorithm, we apply it to six real datasets, and compare it with seven state-of-the-art unsupervised algorithms 4.1. Datasets We select six databases from different fields in the experiments. Table 1 lists brief information of these datasets. Among

4.2. Experimental settings We compare the proposed algorithm with the following seven methods: 1) Laplacian score (LS) method [19]: LS is a filter method, which employs the Laplacian score to evaluate the effectiveness of features via Laplacian eigenmaps and local preservation projection. It selects features based on a feature similarity matrix, which can keep the local manifold structure of samples. 2) Multi-cluster unsupervised feature selection (MCFS) [5]: MCFS detects the distribution of data by spectral analysis of data, and formulates the feature selection as a regression problem with the L1 -norm minimization regularization term for retaining the cluster structure of data. 3) Unsupervised discriminative feature selection (UDFS) method [54]: UDFS combines the local discriminative information of data with the L2 , 1 -norm sparse constraint to perform feature selection. 4) Regularized self-representation (RSR) method [71]: RSR selects the most representative features by incorporating the L2 , 1 -norm fitting with the L2 , 1 -norm minimization regularization which can promote robustness of outliers. 5) Spectral feature selection (SPEC) method [64]: SPEC is also a filter method which selects features by using the structure of graph matrix induced from the pairwise samples similarity. Different with other feature selection methods, SPEC is a unified supervised and unsupervised feature selection algorithm. 6) Unsupervised feature selection using nonnegative spectral analysis (NDFS) method [29]: NDFS embolies the spectral clustering to guide feature selection directly. In order to obtain a more accurate result and reduce redundancy features, it also imposes the nonnegative constraint and the L2 , 1 -norm minimization regularization on both scaled cluster indicator matrix and feature selection process. 7) Unsupervised feature selection diversity-induced selfrepresentation (DISR) method [30]: DISR takes both the representativeness and diversity properties of features, and employs the self-representation of data to preserve discriminative features. For all the datasets, the number of selected features is set from 0 to 100 with an interval 10, and from 100 to 200 with an interval 20. Following the previous studies[30,29], the size of nearest Neighborhoods of Laplacian matrix is set to m = 5 for MCFS,LS,UDFS,NDFS, DISR and NLE-SLFS. MCFS, NDFS, DISR and NLE-SLFS require the parameter m to build a similarity matrix, and UDFS requires it to build the local total scatter and between-class scatter matrices. For a fair comparison, the regularization parameter and the sparsity parameter in LS, SPEC, MCFS, UDFS,NDFS, DISR and NLE-SLFS are tuned within {0.0 0 01,0.0 01,0.01,0.1,1,10,100,10 0 0}. The maximum number of iterations for all the algorithms is set to 100. The parameters of α and β are fixed to be 1 for all the tests. When the feature selection process finished, the k-means

342

Y. Zhang, Q. Wang and D.-w. Gong et al. / Pattern Recognition 93 (2019) 337–352 Table 2 The ACC values and the associated number of selected features obtained by the comparison algorithms (The average ACC ± std/# features.). Data sets

LS

MCFS

NDFS

UDFS

COIL20 Isolet1 ORL32 TOX-171 warpPIE Yale64 Data sets COIL20 Isolet1 ORL32 TOX-171 warpPIE Yale64

0.4395 ± 0.0258/50 0.4752 ± 0.0761/200 0.4280 ± 0.0718/200 0.3967 ± 0.0315/60 0.3029 ± 0.0285/1800 0.3889 ± 0.0307/200 SPEC 0.4721 ± 0.0958 /200 0.3997 ± 0.1245/140 0.4584 ± 0.0512/200 0.3711 ± 0.0229/200 0.2975 ± 0.0270/200 0.3834 ± 0.0385/200

0.5886 ± 0.0642/200 0.5537 ± 0.0435/200 0.5371 ± 0.0422/200 0.4490 ± 0.0425/200 0.5042 ± 0.0234/200 0.4406 ± 0.0348/200 RSR 0.5719 ± 0.0190/100 0.5072 ± 0.0320/200 0.5025 ± 0.0199/200 0.4670 ± 0.0265/200 0.3660 ± 0.0474/200 0.4427 ± 0.0391/200

0.6312 ± 0.0517/120 0.5387 ± 0.0302/200 0.5569 ± 0.0331/200 0.4645 ± 0.0167/200 0.3287 ± 0.0156/200 0.4507 ± 0.0135/200 DISR 0.6341 ± 0.0193/160 0.5301 ± 0.0321/200 x0.5368 ± 0.0385/200 0.4589 ± 0.0210/200 0.4537 ± 0.0474/200 0.4979 ± 0.0264/200

0.5549 ± 0.0870/90 0.3007 ± 0.0402/180 0.4020 ± 0.0309/200 0.4145 ± 0.0167/200 0.4985 ± 0.0355/200 0.3684 ± 0.0119/200 NLE-SLFS 0.6729 ± 0.0365/90 0.6417 ± 0.0310/200 0.6507 ± 0.0264/200 0.5114 ± 0.0189/200 0.5483 ± 0.0243/200 0.5101 ± 0.0326/200

Table 3 The NMI values and the associated number of selected features obtained by the comparison algorithms (The average NMI ± std/# features.). Data sets

LS

MCFS

NDFS

UDFS

COIL20 Isolet1 ORL32 TOX-171 warpPIE Yale64 Data sets COIL20 Isolet1 ORL32 TOX-171 warpPIE Yale64

0.5236 ± 0.0359/200 0.6383 ± 0.0896/200 0.6274 ± 0.0406/200 0.1383 ± 0.0282/200 0.3121 ± 0.0265/200 0.4732 ± 0.0355/100 SPEC 0.5729 ± 0.0627/200 0.5443 ± 0.1120/200 0.6141 ± 0.0337/200 0.0831 ± 0.0154/10 0.3054 ± 0.0208/30 0.4677 ± 0.0334/200

0.6215 ± 0.0296/200 0.6424 ± 0.0575/140 0.6968 ± 0.0290/200 0.1889 ± 0.0550/140 0.5126 ± 0.0318/70 0.5299 ± 0.0353/200 RSR 0.6401 ± 0.0212/700 0.6769 ± 0.0690/200 0.7117 ± 0.0248/200 0.1994 ± 0.0258/200 0.4160 ± 0.0297/90 0.5535 ± 0.0235/200

0.6830 ± 0.0461/200 0.6862 ± 0.0522/200 0.7152 ± 0.040 0/20 0 0.1240 ± 0.0110/80 0.2476 ± 0.0232/50 0.5378 ± 0.0133/200 DISR 0.6732 ± 0.0558/200 0.6396 ± 0.0745/200 0.6992 ± 0.0474/200 0.2351 ± 0.0241/120 0.3604 ± 0.0513/20 0.5739 ± 0.0268/200

0.6515 ± 0.0824/200 0.4636 ± 0.0479/200 0.6530 ± 0.0634/200 0.1486 ± 0.0211/140 0.4686 ± 0.0241/60 0.4246 ± 0.0 086/20 0 NLE-SLFS 0.7140 ± 0.0235/60 0.7303 ± 0.0591/200 0.7519 ± 0.0277/160 0.2441 ± 0.0179/100 0.5062 ± 0.0573/160 0.5844 ± 0.0314/200

algorithm is employed to examine the selected features’ effectiveness. Since the k-means algorithm relies upon initial clusters, we repeatedly run it 20 times with different randomly initial points, and then the average values are adopted over different initialization. The performance of an unsupervised feature selection algorithm is evaluated based on its clustering results. The number of clusters is set to the true classes provided in the datasets. We use two evaluation metrics, i.e., clustering accuracy (ACC) and normalized mutual information (NMI), to measure the performances in clustering results. Let pi and qi be the predicted label and the real label of the ith sample in a dataset, respectively. Then the value of ACC can be calculated as follows:

n

i=1

ACC =

δ (qi , map( pi )) n

(32)

where n is the sample number, map() is a permutation mapping function which maps the predicted label into its corresponding true one by using the Kuhn–Munkres algorithm, δ (a, b) is the delta function and defined as follows:



δ (a, b) =

1, 0,

a=b otherwise

(33)

The ACC is utilized to measure the percentage of forecasted labels against true class labels, and a larger value of ACC indicates that forecasted labels are closer to true ones. The higher the value of ACC is, the better the clustering performance is. Given two variables P and Q, the value of NMI can be calculated as follows:

NMI =



I (P, Q ) H ( P )H ( Q )

(34)

where I(P, Q) is mutual information of P and Q, H(P) and H(Q) are the entropies of P and Q, respectively. In the experiments, the NMI is used to reflect the consistency of similarity between true labels and clustering ones by using the selected features, and a higher value of NMI(P, Q) implies that P can well predict Q. 4.3. Experimental results We report and analyze experimental results of all compared algorithms in this subsection. Tables 2 and 3 list the values of ACC and NMI produced by the eight algorithms. For each algorithm, we vary the number of selected features following the above settings, given the fact that the optimal feature subset is unknown in advance. The best results are highlighted in bold. From the two tables, NLE-SLFS performs better than all the compared algorithms for all the datasets but warpPIE. For warpPIE, NLE-SLFS has the second best value in terms of NMI, but has the best value in terms of ACC. NDFS performs reasonably well for a part of cases but achieves bad solutions for some cases such as warpPIE. Therefore, NLE-SLFS is superior to the comparison algorithms on most datasets in terms of ACC and NMI. Furthermore, we compare clustering results given by different algorithms with different number of features to illustrate the influence of feature selection on clustering. Figs. 1 and 2 depict the values of ACC and NMI with respect to the number of selected features, respectively. We have the following observations from Fig. 2. When the number of features belongs to {10,...,200}, NLE-SLFS has the best ACC values on coil20, Isolet1, TOX-171, and ORL32, which are significantly greater than the comparison algorithms.NLE-SLFS achieves the highest ACC values for all the six datasets, with its classification accuracies being 0.7261 for coil20, 0.6874 for Isolet1, 0.6872 for ORL32, 0.5335 for TOX-171, 0.5845 for warpPIE, and 0.5556 for Yale64.For warpPIE, when the number of selected features is larger

Y. Zhang, Q. Wang and D.-w. Gong et al. / Pattern Recognition 93 (2019) 337–352

343

Fig. 1. The clustering accuracy (ACC) obtained by different feature selection algorithms with different numbers of features.

than 50, the ACC value obtained by NLE-SLFS is significantly better than the other seven comparison algorithms. When it is smaller than 50, UDFS achieves the best ACC value, whereas NLE-SLFS obtains the second best value. For Yale64, when the number of features is smaller than 80 and larger than 160, NLE-SLFS obtains the best ACC values. It gets the second best results when the number of features is within the range of [80,160].

The following observations can be obtained from Fig. 2. The NMI values obtained by NLE-SLFS are superior to those of the comparison algorithms for ORL32 and Isolet1. For coil20, a larger NMI value is obtained when the number of selected features is smaller than 120. When the number is more than 120, NLE-SLFS has also good NMI values, which are similar as UDFS and DISR. For TOX-171, NLE-SLFS has better NMI values than UDFS, SPEC, LS, and NDFS

344

Y. Zhang, Q. Wang and D.-w. Gong et al. / Pattern Recognition 93 (2019) 337–352

Fig. 2. The normalized mutual information (NMI) obtained by different feature selection algorithms with different numbers of features.

for any number of features, and it also has better NMI values than RSR and MCFS in most cases. DISR still shows a good performance which is competitive with NLE-SLFS. For Yale64, NLE-SLFS has better NMI values than UDFS, SPEC, LS, MCFS, and NDFS when the number of selected features varies from 10 to 200. DISR also has good results which can compete with NLE-SLFS. Moreover, NLESLFS achieves the largest NMI values for all the six datasets. Its

values are 0.7412 for coil20, 0.789 for Isolet1, 0.7815 for ORL32, 0.288 for TOX-171, 0.5547 for warpPIE, and 0.6260 for Yale64. To evaluate the efficiency of the proposed model, Table 4 lists the time cost of different algorithms on the data sets. Because the LS algorithm separates the features’ evaluation from clustering learning tasks, it takes less time than others algorithms. However, it is also the reason why it achieves poorer performance than other

Y. Zhang, Q. Wang and D.-w. Gong et al. / Pattern Recognition 93 (2019) 337–352

345

Table 4 Time cost of different algorithms over different data sets (unit:Secs). Dataset

LS

MCFS

NDFS

UDFS

ORL32 COIL20 Isolet1 Yale64 TOX-171 WarpPIE Dataset ORL32 COIL20 Isolet1 Yale64 TOX-171 WarpPIE

5.581 ± 0.786 30.926 ± 1.847 0.438 ± 0.025 2.439 ± 0.738 11.89 ± 0.668 4.695 ± 0.017 SPEC 20.44 ± 1.021 57.4345 ± 1.725 24.584 ± 1.976 1.507 ± 0.019 11.23 ± 0.843 4.117 ± 0.084

9.063 ± 0.739 18.582 ± 1.168 15.285 ± 0.968 9.613 ± 0.786 170.92 ± 1.772 40.95 ± 1.423 RSR 23.575 ± 1.237 23.548 ± 1.236 20.561 ± 1.235 32.582 ± 0.148 73.589 ± 1.846 60.588 ± 1.941

11.139 ± 0.915 19.463 ± 1.423 18.592 ± 1.269 13.183 ± 0.037 1.849 ± 0.019 29.15 ± 1.237 DISR 30.126 ± 1.428 27.836 ± 1.227 19.589 ± 1.563 26.575 ± 1.847 78.483 ± 1.239 62.542 ± 1.941

10.918 ± 0.755 16.824 ± 1.374 14.539 ± 1.067 11.66 ± 0.914 1763.18 ± 2.536 71.23 ± 1.723 NLE-SLFS 10.446 ± 0.989 15.521 ± 1.045 17.985 ± 1.037 23.493 ± 1.159 27.683 ± 1.265 26.958 ± 1.151

Table 5 The ACC Values obtained by different algorithms on data sets with various proportion noises. Dataset

COIL20 5%

10%

5%

10%

5%

10%

SPEC LS MCFS UDFS NDFS RSR DISR NLE-SLFS

0.55659 0.4852 0.6376 0.64652 0.65421 0.60128 0.64552 0.64617

0.51413 0.44274 0.61309 0.59111 0.64274 0.58806 0.65162 0.70686

0.39192 0.49903 0.5782 0.44166 0.54935 0.52836 0.53479 0.58592

0.36801 0.46394 0.55358 0.30179 0.53109 0.51511 0.51876 0.57031

0.53214 0.47137 0.53262 0.45221 0.56311 0.54331 0.55269 0.60368

0.48851 0.46235 0.54862 0.41512 0.54423 0.52208 0.55279 0.57227

Dataset

TOX-171

SPEC LS MCFS UDFS NDFS RSR DISR NLE-SLFS

Isolet1

ORL32

WarpPIE

Yale64

5%

10%

5%

10%

5%

10%

0.38362 0.40261 0.45714 0.45058 0.52631 0.46852 0.44567 0.44764

0.3728 0.39372 0.44929 0.40488 0.45731 0.44966 0.43553 0.42371

0.39261 0.38011 0.56092 0.49857 0.39523 0.41659 0.40351 0.49632

0.33857 0.32179 0.48095 0.47309 0.33071 0.40802 0.39841 0.44743

0.40724 0.46878 0.53269 0.37327 0.48631 0.52794 0.53948 0.55123

0.38878 0.40237 0.49561 0.3603 0.47575 0.49141 0.51978 0.51815

algorithms (see Tables 2 and 3 for details). Compared with other wrapper and embedded based algorithms, the proposed model has a significant improvement in time cost, where the uses of the selective matrix W and the pseudo label matrix G space both play important roles. In conclusion, we can see from these tables and figures that NLE-SLFS has better performance than the other comparison algorithms in terms of both ACC and NMI for most datasets. There are possibly three reasons as follows. 1) The data labels are first produced using NLE, which can provide a good guidance for feature selection. 2) The pseudo labels and the feature selection matrix are alternatively updated during the process of feature selection, which can effectively approximate real conditions of data. 3) The local structure preservation term is conducive to maintaining the original structure of data. Hence, NLE-SLFS eliminates redundant features for clustering, which is beneficial to reducing the computational complexity of a learning algorithm. Furthermore, in order to illustrate the ability of the proposed algorithm for solving noise data, we added 5% and 10% noise data in each of the six data sets. Experimental results of the proposed algorithm on different proportion noises are shown on Tables 5 and 6. We can see from Tables 5 and 6 that for the data with smaller features, noise data have a greater impact on the performance of the proposed algorithm. However, for high-dimensional data that

contain a certain percentage of noise, noise data have little effect on the results of feature selection. The reason is that for lowdimensional data, noises are bound to destroy seriously the local structure of data. However, for high-dimensional data, noises have less influence on their structures because their internal structures are sparse. By exploring local geo-metric structure of the data distribution, a better performance is expected. In the proposed method, we use a local structure retention term, which makes our method more competitive than other methods to deal with noise data. Compared with other methods, it can be observed that feature selection can be enhanced by removing noisy and redundant information. Besides, it can also make the subsequent processing more effective by selecting a feature subset, which is crucial for processing high-dimensional data. Moreover, our proposed method still has a strong ability to deal with noise data. As shown in Tables 5 and 6, the performance of the proposed method on the six data sets is still very good. 4.4. Sensitivity analysis of parameters We conduct a sensitivity analysis about the impact of the parameters in (15) and (13) on the performance of our algorithm. Firstly, we fix α and change the value of β in the range of {0.0 0 01, 0.0 01, 0.01, 0.1, 1, 10, 100, 1000}, the number of selected features k in the range of {10,20,..., 10 0, 120,... 20 0}. Figs. 3 and 4 depict the ACC and NMI values of NLE-SLFS for different values of k and β , respectively. It can be seen that the performance of NLESLFS changes differently for different datasets, in terms of both ACC and NMI measures. Taking the ACC values showed in Fig. 3 as example, a large β improves the performance of NLE-SLFS on tackling the dataset Isolet1, but degrades its performance on tackling the datasets Yale64 and ORL32. A good result is achieved when β is in the middle interval of the tuned range for most datasets. Next, we fix β , and change the value of the parameter α and the number of selected features k. Figs. 5 and 6 show the ACC and NMI values for NLE-SLFS under various combinations of k and α . We can also see that the performance of NLE-SLFS changes differently for different datasets when α varies within {0.0 0 01, 0.0 01, 0.01, 0.1, 1, 10, 100, 10 0 0}. Specifically, the ACC and NMI values of NLE-SLFS do not change much at all for the datasets coil20, ORL32 and warpPIE; however, for the rest datasets, the change of α has relatively clear influence on the performance of NLE-SLFS. For the dataset Isolet1, NLE-SLFS has good ACC and NMI values when we set α to a larger value; for the dataset Yale64, a small α helps NLESLFS improve its ACC values; and, for the dataset TOX-171, when α = 1, NLE-SLFS has good ACC and NMI values for different k values. Overall, our proposed NLE-SLFS has a good robustness under different values of α and β . The two parameters can be chosen

346

Y. Zhang, Q. Wang and D.-w. Gong et al. / Pattern Recognition 93 (2019) 337–352

Fig. 3. The ACC values given by NLE-SLFS with different k and β , where # feature represents the number of selected features (k),the beta is β .

Y. Zhang, Q. Wang and D.-w. Gong et al. / Pattern Recognition 93 (2019) 337–352

Fig. 4. The NMI values given by NLE-SLFS with different k and β , where # feature represents the number of selected features (k),the beta is β .

347

348

Y. Zhang, Q. Wang and D.-w. Gong et al. / Pattern Recognition 93 (2019) 337–352

Fig. 5. The ACC values given by NLE-SLFS with different k and α , where # feature represents the number of selected features (k),the alpha is α .

Y. Zhang, Q. Wang and D.-w. Gong et al. / Pattern Recognition 93 (2019) 337–352

Fig. 6. The NMI values given by NLE-SLFS with different k and α , where # feature represents the number of selected features (k),the alpha is α .

349

350

Y. Zhang, Q. Wang and D.-w. Gong et al. / Pattern Recognition 93 (2019) 337–352 Table 6 The NMI Values obtained by different algorithms on data sets with various proportion noises. Dataset

COIL20

Isolet1

ORL32

5%

10%

5%

10%

5%

10%

SPEC LS MCFS UDFS NDFS RSR DISR NLE-SLFS

0.64584 0.51105 0.64253 0.67768 0.73417 0.6479 0.62327 0.70164

0.57383 0.47762 0.56974 0.64042 0.64646 0.61446 0.61832 0.58206

0.58644 0.63621 0.67085 0.55427 0.74498 0.66649 0.6235 0.68126

0.54028 0.5949 0.61829 0.449936 0.64882 0.65698 0.61732 0.69421

0.70585 0.65636 0.72886 0.69818 0.75565 0.70116 0.70892 0.72569

0.61803 0.62591 0.68138 0.66503 0.65296 0.69695 0.70189 0.7161

Dataset

TOX-171

SPEC LS MCFS UDFS NDFS RSR DISR NLE-SLFS

WarpPIE

Yale64

5%

10%

5%

10%

5%

10%

0.13709 0.18277 0.19723 0.1283 0.12706 0.25543 0.21647 0.25114

0.07263 0.17984 0.15815 0.13527 0.10879 0.2382 0.19537 0.22811

0.31847 0.35439 0.54649 0.4855 0.27302 0.41563 0.30851 0.53618

0.30256 0.32661 0.48297 0.44483 0.23214 0.40782 0.28251 0.49427

0.48836 0.48644 0.57141 0.38892 0.53629 0.52176 0.58132 0.60819

0.45276 0.46187 0.5583 0.39212 0.52821 0.50754 0.5645 0.5803

from a wide range, but the use of too small or large values is not recommended. Moreover, although how to identify optimal values of the two parameters is dependent on data and still an open problem, as the results showed in Tables 5 and 6, NLE-SLFS can obtain satisfactory solutions for the test datasets, compared to other algorithms.

Acknowledgments This work was jointly supported by National Natural Science Foundation of China (No. 61876185, 61573361), and Six Talent Peak Projects in Jiangsu Province (No. DZXX-053).

References 5. Conclusions In this paper, we proposed an unsupervised feature selection framework based onthe subspace learning and the nonnegative Laplacian embedding, called NLE-SLFS. In the framework, the nonnegative Laplacian embedding was used to generate high-quality pseudo class labels, and the subspace learning with local structure preservation was developed to find optimal feature subset under the help of discriminative information provided by pseudo class labels. The effective cooperation between the nonnegative Laplacian embedding and the subspace learning obviously improved the performance of the proposed algorithm. The analysis from theory to experiments verified the effectiveness of the proposed algorithm. We have conduct extensive numerical tests on six real-world data to verify the effectiveness of the proposed NLE-SLFS. The experimental results proved that our NLE-SLFS can achieve equal or better performance compared with existing representative methods, LS, SPEC, MCFS, UDFS, SPEC, NDFS and DISR, in terms of ACC/NMI measures. Moreover, an additional experiment also showed that the proposed NLE-SLFS has stable performance for the change of the two key tradeoff parameters α and β . However, NLE-SLFS maybe has several deficiencies. Firstly, its performance relies partly on the quality of pseudo class labels generated. How to get accurate results for high-dimensional complex data is still an open problem. Secondly, NLE-SLFS may converge to a local optimal feature subset because the process of selecting features is independent of any learning algorithm. In the future work, we will study the applications of nature-inspired algorithms with global exploration capability in unsupervised feature selection problems. In addition, extending NLE-SLFS to tackle more structure complex datasets, like data with missing values or corruptions, is our research contents in the future.

[1] E. Atashpaz-Gargari, M.S. Reis, U.M. Braga-Neto, J. Barrera, E.R. Dougherty, A fast branch-and-bound algorithm for u-curve feature selection, Pattern Recognit. 73 (2018) 172–188. [2] J. Bedo, K. Borgwardt, A. Gretton, K.M. Borgwardt, Supervised feature selection via dependence estimation, Comput. Sci. (2007) 823–830. abs/0704.2668. [3] V. Bolon-Canedo, N. Sanchez-Marono, A. Alonso-Betanzos, Recent advances and emerging challenges of feature selection in the context of big data, Knowl. Based Syst. 86 (2015) 33–45. [4] D. Cai, X. He, J. Han, T.S. Huang, Graph regularized nonnegative matrix factorization for data representation, IEEE Trans. Pattern Anal. Mach. Intell. 33 (2011) 1548–1560. [5] D. Cai, C. Zhang, X. He, Unsupervised feature selection for multi-cluster data, in: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2010, pp. 333–342. [6] X. Chang, F. Nie, Y. Yang, H. Huang, A convex formulation for semi-supervised multi-label feature selection, in: Proceedings of the 28th AAAI Conference on Artificial Intelligence, 2014, pp. 1171–1177. [7] J. Chen, L. Jiao, Z. Wen, High-level feature selection with dictionary learning for unsupervised SAR imagery terrain classification, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 10 (1) (2017) 145–160. [8] L. Du, Y.D. Shen, Unsupervised feature selection with adaptive structure learning, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, ACM, New York, NY, USA, 2015, pp. 209–218. [9] L. Du, Z. Shen, X. Li, P. Zhou, Y.D. Shen, Local and global discriminative learning for unsupervised feature selection, in: Proceedings of the IEEE 13th International Conference on Data Mining, 2013, pp. 131–140. [10] R.O. Duda, P.E. Hart, D. G. Stork, Pattern Classification, John Wiley-Sons, New York, USA, 2012. [11] H. Fang, L. Chen, R. Srinivasan, Influence of time and length size feature selections for human activity sequences recognition, ISA Trans. 53 (1) (2014) 134–140. [12] W.F. Gao, L. Hu, P. Zhang, Class-specific mutual information variation for feature selection, Pattern Recognit. 79 (2018) 328–339. [13] M. Ghaemi, M. Feizi-Derakhshi, Feature selection using forest optimization algorithm, Pattern Recognit. 60 (2016) 121–129. [14] R.M. Gray, Entropy and Information Theory, Springer Science and Business Media. Berlin/Heidelberg, Germany, 2011. [15] Q. Gu, Z. Li, J. Han, Generalized fisher score for feature selection, in: Proceedings of the 27th International Conference on Uncertainty in Artificial Intelligence, AUAI Press, 2011, pp. 266–273.

Y. Zhang, Q. Wang and D.-w. Gong et al. / Pattern Recognition 93 (2019) 337–352 [16] F. Hafiz, A. Faizal, N. Patel, C. Naik, A two-dimensional (2-d) learning framework for particle swarm based feature selection, Pattern Recognit. 76 (2018) 416–433. [17] Y. Han, Y. Yang, X. Zhou, Co-regularized ensemble for feature selection, in: Proceedings of the 23rd International Joint Conference on Artificial Intelligence, IJCAI, 2013, pp. 1380–1386. [18] E. Hancer, B. Xue, M.J. Zhang, D. Karaboga, B. Akay, Pareto front feature selection based on artificial bee colony optimization, Inf. Sci. 422 (2018) 462–479. [19] X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, in: Proceedings of the International Conference on Neural Information Processing Systems, 2005, pp. 507–514. [20] R. He, T. Tan, L. Wang, W.S. Zheng, l2 , 1 regularized correntropy for robust feature selection, in: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2504–2511. [21] C. Hou, F. Nie, D. Yi, Y. Wu, Feature selection via joint embedding learning and sparse regression, in: Proceedings of the 22nd International Joint Conference on Artificial Intelligence., 2011, pp. 1324–1329. [22] J. Hu, J. Pei, J. Tang, How can i index my thousands of photos effectively and automatically? An unsupervised feature selection approach, in: Proceedings of the 2014 SIAM International Conference on Data Mining, 2014, pp. 136–144. [23] J. Huang, F. Nie, H. Huang, C. Ding, Robust manifold non-negative matrix factorization, ACM Trans. Knowl. Discov. Data 8 (3) (2013) 1–21. [24] S. Jadhav, H. He, K. Jenkins, Information gain directed genetic algorithm wrapper feature selection for credit rating, Appl. Soft Comput. 69 (2018) 541–553. [25] X. Kong, S.Y. Philip, Semi-supervised feature selection for graph classification, IEEE Trans. Knowl. Discov. Data Min. (2010) 793–802. [26] D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization, Adv. Neural Inf. Process. Syst. 13 (20 0 0). [27] Z. Li, J. Liu, Y. Yang, X. Zhou, H. Lu, Clustering-guided sparse structural learning for unsupervised feature selection, IEEE Trans. Knowl. Data Eng. 26 (9) (2014) 2138–2150. [28] F. Li, D.Q. Miao, W. Pedrycz, Granular multi-label feature selection based on mutual information, Pattern Recognit. 67 (2017) 410–423. [29] Z. Li, Y. Yang, J. Liu, H. Lu, Unsupervised feature selection using nonnegative spectral analysis, in: International Conference on Artificial Intelligence, 2012, pp. 1026–1032. [30] Y. Liu, K. Liu, C. Zhang, J. Wang, X. Wang, Unsupervised feature selection via diversity-induced self-representation, Neurocomputing (2016) 350–363. [31] Y. Liu, F. Nie, J. Wu, L. Chen, Efficient semi-supervised feature selection with noise insensitive trace ratio criterion, Neurocomputing 105 (2013) 12–18. [32] L. Lovász, M. Plummer, Matching Theory, American Mathematical Society, North-Holland, Amsterdam, 2009. [33] D. Luo, C. Ding, H. Huang, T. Li, Non-negative Laplacian embedding, in: Proceedings of IEEE Ninth International Conference on Data Mining, 2009, pp. 337–346. [34] F. Nie, H. Huang, X. Cai, C.H. Ding, Efficient and robust feature selection via joint l2 , 1 -norms minimization, in: Advances in Neural Information Processing Systems, 2010, pp. 1813–1821. [35] F. Nie, H. Wang, H. Huang, C.H.Q. Ding, Unsupervised and semi-supervised learning via l1 -norm graph, in: Proceedings of IEEE International Conference on Computer Vision, 2011, pp. 2268–2273. [36] F. Nie, S. Xiang, Y. Jia, C. Zhang, S. Yan, Trace ratio criterion for feature selection, in: Proceedings of National Conference on Artificial Intelligence, volume 2, 2008, pp. 671–676. [37] H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Comput. Soc. 27 (8) (2005) 1226–1238. [38] M. Qian, C. Zhai, Robust unsupervised feature selection, in: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, 2013, pp. 1621–1627. [39] J.L. Rodgers, W.A. Nicewander, Thirteen ways to look at the correlation coefficient, Am. Stat. 42 (1) (1988) 59–66. [40] A. Senawi, H.L. Wei, S.A. Billings, A new maximum relevance-minimum multicollinearity (MRmMC) method for feature selection and ranking, Pattern Recognit. 67 (2017) 47–61. [41] R. Sheikhpour, M.A. Sarram, S. Gharaghani, M.A.Z. Chahooki, A survey on semi– supervised feature selection methods, Pattern Recognit. 64 (2017) 141–158. [42] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 22 (20 0 0) 888–905. [43] S. Solorio-Fernandez, J.F. Martinez-Trinidad, J.A. Carrasco-Ochoa, A new unsupervised spectral feature selection method for mixed data: a filter approach, Pattern Recognit. 72 (2017) 314–326. [44] U. Talukdar, S.M. Hazarika, J.Q. Gan, A kernel partial least square based feature selection method, Pattern Recognit. 83 (2018) 91–106. [45] J. Tang, X. Hu, H. Gao, H. Liu, Discriminant analysis for unsupervised feature selection, in: Proceedings of the 2014 SIAM International Conference on Data Mining, 2014, pp. 938–946. [46] S. Wang, J. Tang, H. Liu, Embedded unsupervised feature selection, in: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, pp. 470–476. [47] Y.T. Wang, J.D. Wang, H. Liao, H.Y. Chen, An efficient semi-supervised representatives feature selection algorithm based on information theory, Pattern Recognit. 61 (2017) 511–523. [48] X. Wang, X. Zhang, Z. Zeng, Q. Wu, J. Zhang, Unsupervised spectral feature selection with l1 -norm graph, Neurocomputing (2016) 47–54. [49] J.J. Wen, Z.H. Lai, Y.W. Zhan, J.R. Cui, The l2,1 -norm-based unsupervised optimal

[50]

[51] [52]

[53] [54]

[55]

[56]

[57] [58]

[59] [60] [61]

[62]

[63] [64]

[65]

[66] [67]

[68]

[69]

[70] [71]

351

feature selection with applications to action recognition, Pattern Recognit. 60 (2016) 515–530. L. Wolf, A. Shashua, Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weight based approach, J. Mach. Learn. Res. 6 (2005) 1855–1887. Z. Xuan, X. Gao, J.J. Wang, H. Yu, Z.Y. Wang, Z.R. Chi, Eye tracking data guided feature selection for image classification, Pattern Recognit. 63 (2017) 56–70. Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, Y. Pan, A multimedia retrieval framework based on semi-supervised ranking and relevance feedback, IEEE Trans. Pattern Anal. Mach. Intell. 34 (2012) 723–742. J.B. Yang, C.J. Ong, An effective feature selection method via mutual information estimation, IEEE Trans. Syst. Man Cybern.-Syst. 42 (6) (2012) 1550–1559. Y. Yang, H.T. Shen, Z.G. Ma, Z. Huang, X.F. Zhou, l2 , 1 -norm regularized discriminative feature selection for unsupervised learning, in: Proceeding of the 22th International Joint Conference on Artificial Intelligence, 2011, pp. 1589–1594. Y. Yang, Y.T. Zhuang, F. Wu, Y.H. Pan, Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval, IEEE Trans. Multimed. 10 (2008) 437–446. L. Yu, H. Liu, Feature selection for high-dimensional data: a fast correlation-based filter solution, in: Proceedings of the 20th International Conference on Machine Learning, 2003, pp. 856–863. Z.H. Zhang, L. Bai, Y.H. Liang, E. Hancock, Joint hypergraph learning and sparse regression for feature selection, Pattern Recognit. 63 (2017) 291–309. Y. Zhang, D. Gong, J. Cheng, Multi-objective particle swarm optimization approach for cost-based feature selection in classification, IEEE/ACM Trans. Comput. Biol. Bioinf. 99 (22) (2017) 64–75. Y. Zhang, D. Gong, Y. Hu, W. Zhang, Feature selection algorithm based on bare bones particle swarm optimization, Neurocomputing 148 (2015) 150–157. Y. Zhang, D. Gong, W. Zhang, Feature selection of unreliable data using an improved multi-objective PSO algorithm, Neurocomputing 171 (2016) 1281–1290. Q. Zhang, J. Sun, G. Zhong, J. Dong, Random multi-graphs: a semi-supervised learning framework for classification of high dimensional data, Image Vis. Comput. 60 (2017) 30–37. B. Zhao, J. Kwok, F. Wang, C. Zhang, Unsupervised maximum margin feature selection with manifold regularization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 888–895. Z. Zhao, H. Liu, Semi-supervised feature selection via spectral analysis, in: Proceedings of the SIAM International Conference on Data Mining, 2007, pp. 1–12. Z. Zhao, H. Liu, Spectral feature selection for supervised and unsupervised learning, in: Proceedings of the International Conference on Machine Learning, 2007, pp. 1151–1157. Z. Zhao, L. Wang, H. Liu, Efficient spectral feature selection with minimum redundancy, in: Proceedings of the 24th AAAI Conference on Artificial Intelligence, 2010, pp. 673–678. K.F. Zheng, X.J. Wang, Feature selection method with joint maximal information entropy between features and class, Pattern Recognit. 77 (2018) 20–29. N. Zhou, H. Cheng, W. Pedrycz, Y. Zhang, H. Liu, Discriminative sparse subspace learning and its application to unsupervised feature selection selection, ISA Trans. (2016) 104–118. N. Zhou, Y.Y. Xu, H. Cheng, J. Fang, W. Pedrycz, Global and local structure preserving sparse subspace learning: an iterative approach to unsupervised feature selection, Pattern Recognit. 53 (2016) 87–101. P. Zhu, Q. Hu, C. Zhang, W. Zuo, Coupled dictionary learning for unsupervised feature selection, in: Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016, pp. 2422–2428. P. Zhu, W. Zhu, Q. Hu, C. Zhang, W. Zuo, Subspace clustering guided unsupervised feature selection, Pattern Recognit. 66 (2017) 364–374. P. Zhu, W. Zuo, L. Zhang, Q. Hu, S.C. Shiu, Unsupervised feature selection by regularized self-representation, Pattern Recognit. 48 (2) (2015) 438–446. Yong Zhang received the M.S. and Ph.D. degrees in control theory and control engineering from China University of Mining and Technology in 2006 and 2009, respectively. He is currently with the School of Information and Control Engineering, China University of Mining and Technology. His research interests include intelligence optimization and data mining.

Qing Wang received the B.S. degree in mathematics and applied mathematics from Henan Polytechnic University in 2015. He is currently master student at the School of Information and Control Engineering, China University of Mining and Technology. His research interests include particle swarm and feature selection.

352

Y. Zhang, Q. Wang and D.-w. Gong et al. / Pattern Recognition 93 (2019) 337–352 Dunwei Gong received the Ph.D. degree in control theory and control engineering from China University of Mining and Technology in 1999. He is a professor in the School of Information and Control Engineering, China University of Mining and Technology. His main research interest include intelligence optimization and control.

Xianfang Song received the B.S. degree in automation from Luoyang Institute of Science and Technology in 2014. She is currently master student at the School of Information and Control Engineering, China University of Mining and Technology. Her research interests include particle swarm and feature selection.