Unsupervised feature selection via adaptive hypergraph regularized latent representation learning

Unsupervised feature selection via adaptive hypergraph regularized latent representation learning

ARTICLE IN PRESS JID: NEUCOM [m5G;October 31, 2019;0:5] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputing jo...

2MB Sizes 0 Downloads 127 Views

ARTICLE IN PRESS

JID: NEUCOM

[m5G;October 31, 2019;0:5]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Unsupervised feature selection via adaptive hypergraph regularized latent representation learningR Deqiong Ding a, Xiaogao Yang b, Fei Xia c, Tiefeng Ma a, Haiyun Liu d,∗, Chang Tang e a

School of Statistics, Southwestern University of Finance and Economics, Chengdu 611130, China College of Mechanical Engineering, Hunan University of Arts and Science, Changde, 415000, China c Associate Research Fellow, Naval University of Engineering; Optical Engineering Postdoctoral Mobile Station, NUDT, Changsha, 410073, China d Institute of Cardiovascular Disease Research, The Affiliated Huai’an Hospital of Xuzhou Medical University, Huai’an 223100, China e School of Computer Science, China University of Geosciences, Wuhan 430074, China b

a r t i c l e

i n f o

Article history: Received 8 May 2019 Revised 4 August 2019 Accepted 6 October 2019 Available online xxx Communicated by Dr Xiaofeng Zhu MSC: 00-01 99-00 Keywords: Unsupervised feature selection Hypergraph learning Latent representation learning Local structure preservation

a b s t r a c t Due to the rapid development of multimedia technology, a large number of unlabelled data with high dimensionality need to be processed. The high dimensionality of data not only increases the computation burden of computer hardware, but also hinders algorithms to obtain optimal performance. Unsupervised feature selection, which is regarded as a means of dimensionality reduction, has been widely recognized as an important and challenging pre-step for many machine learning and data mining tasks. However, we observe that there are at least two issues in previous unsupervised feature selection methods. Firstly, traditional unsupervised feature selection algorithms usually assume that the data instances are identically distributed and there is no dependency between them. However, the data instances are not only associated with high dimensional features but also inherently interconnected with each other. Secondly, the traditional similarity graph used in previous methods can only describe the pair-wise relations of data, but cannot capture the high-order relations, so that the complex structures implied in the data cannot be sufficiently exploited. In this work, we propose a robust unsupervised feature selection method which embeds the latent representation learning into feature selection. Instead of measuring the feature importances in original data space, the feature selection is carried out in the learned latent representation space which is more robust to noises. In order to capture the local manifold geometrical structure of original data in a high-order manner, a hypergraph is adaptively learned and embedded into the resultant model. An efficient alternating algorithm is developed to optimize the problem. Experimental results on eight benchmark data sets demonstrate the effectiveness of the proposed method. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Due to the arrival of information explosion era, a large number of high-dimensional data come into being, including images, texts, and medical microarray data [1,2]. Directly processing the high dimensional data not only significantly increases the computation time and memory burden of the algorithms and computer hardware, but also results in bad performance due to the existence of irrelevant, noisy and redundant dimensions. In original high-dimensional feature space, the distance concentration phenomenon makes the classical distance based models, e.g., KNN, fail to work [3,4]. Usually, the intrinsic dimensionality of high-dimensional data is typically small [5–8] and only a part R ∗

Fully documented templates are available in the elsarticle package on CTAN. Corresponding author. E-mail address: [email protected] (H. Liu).

of features are discriminative for learning tasks such as data clustering and classification since the noisy and redundant features mixed in original data often degenerate the performance of learning algorithms [9,10]. As an effective pre-processing of high dimensional data, feature selection [11–14] aims to accomplish the dimensionality reduction by removing some irrelevant and redundant features while preserving the intrinsic data structure. In the past decades, a variety of feature selection methods have been proposed based on different priori knowledge of data. Considering the availability of data labels, previous feature selection methods can be generally classified into three classes: supervised [15], semi-supervised [16] and unsupervised methods [17–29]. As to supervised feature selection methods, labels of training samples are known in advance [30,31], these methods aim to select discriminative features by distinguishing samples from different classes. Due to its robustness to outlier, sparse learning [32,33] has been used as a powerful technique in supervised feature selec-

https://doi.org/10.1016/j.neucom.2019.10.018 0925-2312/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

tion. In some cases, only a part of data labels are known in advance, the rest are not labelled, and labelling large amounts of unlabelled data instances is time consuming and fairly expensive. For this reason, semi-supervised methods [34–36] come into being. These methods aim to select discriminative features by connecting labelled samples according to their label information and unlabelled samples by their relationship to labelled samples. In most practical applications, it is laborious to obtain the labels of samples, especially in today’s high dimensional data explosion era. Extraction of the most discriminative information from unlabelled data is a challenging problem. Since unsupervised feature selection can determine the feature importance based on underlying properties of original data without label information, more and more researchers focus on it in recent years [37]. We also focus on unsupervised feature selection works in this paper. In general, unsupervised feature selection methods can be summarized into three kinds, i.e., filter, wrapper, and embedding methods. Filter methods use feature ranking techniques for evaluating the importance of a single feature or a feature subset, and the commonly used ranking metrics include variance [38], Laplacian score [39], feature similarity [17], and trace ratio [40]. Wrapper methods select features based on clustering or classification performance of the learning algorithms [41,42], and they search for features to better suit the learning tasks. Embedding methods combine feature selection and model reconstruction together, and they often learn a feature weight vector or matrix to reflect the feature importance [18,43–48]. Compared with filter and wrapper methods, the advantage of embedding methods is that they can take different data properties into account, e.g., manifold structure and data distribution priors. Thus, embedding methods can usually obtain better performance. Since local manifold structure has been verified more important than global structure [43], most of embedded methods try to exploit the local structure for feature selection. The well studied graph Laplacian is usually deployed to preserve the local structure of the original data. Although promising performance has been achieved by previous unsupervised feature selection methods in various applications, there still exist two issues in existing methods. Firstly, they usually assume that the data instances are identically distributed and there is no dependency between them. However, the data instances are not only associated with high dimensional features but also inherently interconnected with each other. Even if the data instances originate from homologous sources, they are often influenced by external conditions, e.g., the illumination change in face images. Secondly, the traditional similarity graph used in previous methods for local geometrical structure preservation can only describe the pair-wise relations of data, but cannot capture the highorder relations, so that the complex structures implied in the data cannot be sufficiently exploited. In order to address these two issues, an unsupervised feature selection method via adaptive hypergraph regularized latent representation learning is proposed in this work (termed AHRLRL briefly). Specifically, we embed a latent representation learning model into the unsupervised feature selection framework, in which we measure the importance of features in the learned latent representation space rather than original data space. By this way, the affect of noises mixed in original data can be efficiently reduced. Meanwhile, in order to capture the local manifold geometrical structure of original data in a high-order manner, a hypergraph is adaptively learned and embedded into the model. We summary the main contributions of this paper as follows: • A latent representation learning model is proposed and embedded into the task of unsupervised feature selection to exploit the interconnection between data samples. • Instead of performing feature selection in the original data space, the feature representation model is built in the learned

latent space, which can not only robustly characterize the intrinsic data structure but also in turn act as label information to serve the feature selection process. • A hypergraph is adaptively learned and embedded into the model for capturing the local manifold geometrical structure of original data in a high-order manner. • An alternating algorithm with convergence guarantee is developed to optimize the resultant problem, and experiments on various of datasets are conducted to validate the superiority of the proposed method when compared with several other stateof-the-art ones. 2. Related work In this section, we give a brief review of previous works about unsupervised feature selection, latent representation learning and hypergraph regularization. 2.1. Unsupervised feature selection As a kind of dimensionality reduction method, unsupervised feature selection selects a relevant subset of features which contains the most discriminative information of original data without using labels of data samples. In the past decades, lots of unsupervised feature selection methods have been proposed. In previous literatures, these methods can be roughly categorized into three classes: filter, wrapper, and embedded methods. Filter methods are independent of learning tasks, and these methods select the optimal feature subsets by exploiting the intrinsic properties of the data [17,38–40]. For example, He et al. [39] proposed a Laplacian Score (LS) as a measure for locality preservation of data. The LS is under the assumption of data manifold structure, i.e., if two data points belong to the same class, they should be close to each other. Spectral graph theory was also deployed for unsupervised feature selection [15]. Based on information measurement, Liu et al. [49] performed feature selection as feature clustering in a hierarchically agglomerative way. In [50], a similarity preserving criterion for feature selection was proposed, which surpassed many widely used criteria. Wang et al. [45] proposed a so called maximum projection and minimum redundancy feature selection method. Roffo et al. [51] took the feature distributions into consideration and transferred feature selection into an infinite number permitting problem of paths among feature distributions. The main limitation of filter methods lies in that they treat features individually but not take the possible correlation among features into consideration. Thus the redundancy in feature subsets cannot be efficiently removed. Approaches based on wrapper model are dependent on predetermined learning algorithms such as clustering and classification [41,42], and they often select features to better serve the given learning tasks for improving learning performance. Dy et al. [52] exploited an Expectation-Maximization (EM) clustering algorithm to select optimal feature subset through scatter separability and maximum likelihood. In [53], the number of errors in a validation subset was used to remove the redundant features. In general, wrapper-based methods can outperform filter models [54]. However, in most wrapper-based methods, the optimization problem is computationally intractable. For embedded methods, a learning model is usually trained by all of the features, then some redundant features are removed with the performance of learning model well maintained. For example, the Support Vector Machine (SVM) based recursive feature elimination method [55], the Regularized Least Squares Classifier (RLSC) [56], the Single-Set Spectral Sparsification (BSS) for k-means [57], and Randomized Feature Selection (RFS) algorithm for the k-means [58]. In the past few years, sparse learning based methods have

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

JID: NEUCOM

ARTICLE IN PRESS

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

also been proposed, and these methods selected important features by minimizing the fitting errors along with some sparse regularization terms, with many variants being proposed with good performance and interpretability. There is a basic principle behind these methods, i.e., the sparsity regularization can be used to constrain the importance of different features. In order to enforce the sparsity of the feature weights, many sparsity induced norms have been exploited, e.g., sparse logistic regression [59] and group sparsity [43,60]. The typical methods include: multi-cluster feature selection (MCFS) [61], joint feature selection and subspace learning (FSSL) [62], unsupervised discriminative feature selection (UDFS) [63], and unsupervised feature selection using feature similarity (FSFS) [17]. Recently, self-representation based methods have been proposed to select the most representative features and shown superior results [48,64]. The feasibility behind these methods is that each feature can be well reconstructed by the linear combination of its relevant features and the representation coefficient matrix with sparsity constrain can be used as the feature weight. It has been verified that preserving local geometric structure of data is especially important for unsupervised feature selection [43]. To this end, the graph Laplacian regularization term has been widely used for preserving local geometric structure in many embedded methods [18,43,44]. However, traditional similarity graph used in previous methods can only describe the pair-wise relations of data, but cannot capture the high-order relations, so that the complex structures implied in the data cannot be sufficiently exploited. In addition, most of previous methods perform feature selection in the original data space, the final performance is often affected by the noises mixed in original data. 2.2. Latent representation Latent representation learning can benefit many data mining and machine learning tasks and has attracted increasingly attention recently, especially for network data [65–69]. In networks, instances often connect to each other due to a variety of factors. Latent representations of different instances interact with each other and form link information, and the instances with similar latent representations are more likely to be connected with each other than the instances with dissimilar latent representations. Usually, the latent representations from link information can be formed by a symmetric nonnegative matrix factorization model [70,71], which decomposes the link matrix F into a product of a nonnegative matrix V and its transpose VT in a low-dimensional latent space as follows:

min F − VVT 2F ,

(1)

where V ∈ Rn×c is the latent representations of all n data instances, and c is the number of latent factors. The link matrix F describes the association and similarity between data instances. In [70,71], symmetric nonnegative matrix factorization model is used to capture the cluster structure for data clustering. In this work, we will borrow this idea and learn the latent representation from the affinity matrix of data instances for unsupervised feature selection. Different to previous methods which evaluate the feature importances in original data space, we aim to exploit the relationships between data instances and use them to serve the feature selection in the learned latent representation space which is more robust to noises. 2.3. Hypergraph regularization As we discussed in the first section, traditional graph can only describe the pair-wise relations between data to preserve the local geometric structure, but the higher-order complex relations in the data cannot be exploited. Here we give an illustration on the gene-disease relations by using Fig. 1. In Fig. 1, the black dots and

3

Fig. 1. An intuitive comparison of a simple graph and a hypergraph. The edges (i.e., e1 , e2 and e3 represented by black lines) in a simple graph can only reflect the relations between two genes. The hyperedges (i.e., he1 and he2 represented by color dot lines) can capture the relations among no less than two genes.

the black lines indicate the genes and the relations between two genes, respectively. The dot lines indicate the relations among no less than two genes. Traditional simple graph only describe the gene-gene relations, e.g., l1 vs. l2 (i.e., l1 and l2 are two genes related to a certain disease), l2 vs. l3 , and l2 vs. l4 , but cannot reflect the relations in real-world cases, i.e., a specific disease is usually related with more than two genes. For instance, three of the four genes in Fig. 1 (i.e., l1 , l2 and l3 ) may be related to a certain disease while another disease may be related to other three genes (i.e., l1 , l2 and l4 ). The hyperedges in a hypergraph can easily describe this type of relations, as the color dot lines shown in Fig. 1. The hypergraph was used in many previous works including clustering [72] and classification [73]. In this work, we focus on adaptively learning a hypergraph to preserve the high-order local geometric structures of the microarray data. A hypergraph can be mathematically denoted as G = (V, E, w ), where V = [vi ] and E = [ei ] represent the set of the vertexes and the hyperedges, respectively. w = [wi ] is the weight of the hyperedges. Different to traditional simple graph edges which only consist of pairs of vertices, hyperedges can be arbitrarily sized sets of vertices. If the binary vertex-edge relations are represented by an incidence matrix H, each element of H is defined as:



0, otherwise. 1, i f vi ∈ e j ,

H ( vi , e j ) =

(2)

Then the degree of a hyperedge ei can be calculated via δ (ei ) =  v j ∈V h (v j , ei ). For a specific vertex vj , its degree can be obtained  via d (v j ) = v ∈e ,e ∈E w(ei )h(v j , ei ). The hypergraph Laplacian maj

i

i

trix can be formulated as: 1

1

L = I − Dv − 2 HWDe −1 HT Dv − 2 ,

(3)

Rn×n

where I ∈ is an identity matrix. De , Dv and W, respectively, are the diagonal matrices of δ = [δ (ei )], d = [d (v j )] and w = [w(ei )]. n is the number of data samples. 3. Proposed method Before we give the details of our proposed method, we first define some notations to be used through out this paper. Boldface capital letters and boldface lower case letters represent matrices and vectors, respectively. As to an arbitrary matrix M ∈ Rm×n , Mi j denotes its (i, j)th entry, while mi and m j denote the ith row and jth column of M, respectively. T r (M ) is the trace of M if M is square and MT is the transpose of M. A, B represents the inner product of two matrices. Im is the identity matrix with size m × m.  The l2,1 -norm of matrix M is defined as M2,1 = m i=1 mi  =   m n   2 2 n m is the well-known i=1 j=1 Mi j . MF = i=1 j=1 Mi j Frobenius norm of M. As afore-mentioned, most of previous methods perform feature selection in the original data space and assume that data instances are independent and identically distributed. However, in reality,

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

data instances are inevitably with some noisy features and samples, and this will induce some noisy similarity between data samples. Latent representations that are directly derived from adjacency information may degenerate feature selection in the original data space. In addition, data instances are not always independent and identically distributed when they are generated from heterogeneous sources or homologous sources with different external conditions. Therefore, it is desirable to embed the latent representation learning into the feature selection phase on the content space. As a result, the latent representation learning and feature selection could help and boost each other, i.e., feature selection in the latent space can avoid the influence of noisy features in original data space, and the data transformed by the feature transformation matrix can help learn better latent representations which are robust to noisy adjacency information. As latent factors encode some hidden attributes of instances, they should be related to some features (or attributes) of data instances. Therefore, we take the latent representation matrix V as a constraint to model the content information of data through a multivariate linear regression model:

tations of the i-th and j-th samples. In such a manner, the hypergraph G and sample representation coefficient matrix can constrain each other to obtain their optimal solutions. On the one hand, the hypergraph can be adaptively learned to preserve the high-order local geometrical structure of data samples, on the other hand, the learned coefficient matrix can constrain the hypergraph learning process. After a simple mathematical transformation, Eq. (7) becomes

min XT − V

where β is a parameter to balance the local manifold geometric structure regularization. As can be seen from Eq. (9), our AHRLRL model integrate feature selection, latent representation learning and hypergraph learning into a unified framework. In addition, the learned latent representation and hypergraph can serve to boost the feature selection process. Without label information, it is not easy to assess the feature relevance of data. In contrast to the scarce label information, data similarity information can be easily computed by using some distance metrics such as Euclidean distance and Cosine Similarity [74]. On the one hand, there are some inaccurate noisy similarity values since there are some noisy features mixed in original data, resulting in negative effect on the feature selection algorithms. On the other hand, as to a certain dataset, the same feature set is used for representing each data instance, thus there exists homophily effect [75] in the data and the data content information will affect and is dependent on the latent representations from data association structure. Therefore, we can embed the latent representation learning into the feature selection phase. In such a manner, data content information can serve to mitigate the negative effects from noisy similarity values in learning latent representations, while good latent representations in turn can contribute to select more meaningful features. In other words, these two phases could cooperate and boost each other. In this way, the learned latent representations can capture the inherent correlations between data instances and are more robust to noisy similarities in original similarity matrix. When the latent representations V is fixed, it can be regarded as the role of label information to guide the feature selection process. In this work, we use the affinity matrix which measures the similarity of data instance pairs to represent the link matrix in Eq. 9, and its’ entry is computed according to the following equation:

T

2 F,

(4)

where X ∈ Rn×d is the original data matrix with each row xi ∈ Rd is a sample. T ∈ Rd×c is a transformation parameter matrix, and the l2 -norm of the ith row vector T(i, : )2 can be used as the feature weight because it reflects the importance of the ith feature in the latent space. If the ith feature take part in the representation of other features in the latent space, then T(i, : )2 must be significant. Therefore, the row-sparsity is expected for regularizing the coefficients matrix T. To achieve this goal, we add a l2,1 -norm regularization term on T for a joint sparsity among all c latent factors and obtain following model:

min XT − V2F + αT2,1 , T,V

(5)

where parameter α controls the sparseness of the model. By combining the objective function of latent representation learning in Eq. (1) into the feature selection model in Eq. (5), the objective function that embeds latent representation learning into feature selection phase can be formulated as follows:

min XT − V2F + F − VVT 2F + αT2,1 , s.t. V ≥ 0, T, V

(6)

Since the first two terms of Eq. (6) are both for latent representation learning, we set the same weight for these two terms. It has been indicated in recent research that both global pairwise sample similarity and local geometric structure are critical for feature selection, while preserving local manifold geometric data structure is more important for unsupervised feature selection. In Eq. (6), we hope to preserve the local manifold geometric data structure of original data in the latent space, thus a graph Laplacian regularization term is added into our model. Considering that traditional graph can only describe the pair-wise relations between samples, we aim to we adaptively learn a hypergraph to preserve the local geometric structure of data samples in a high-order manner and constrain that if two samples are close to each other, their representations in the learned latent space should also be similar. We formulate this property as following hypergraph Laplacian regularization term:

min

W,De ,Dv

1 2



[XT](i, : ) − [XT]( j, : )22 + γ W2F ,

e∈E,xi ,x j ∈V,i, j

s.t. w 1 = 1, w(ei ) > 0, T

w(e )h (x ,e )h (x ,e )

min

H,De ,Dv ,W

T r (TT XT LXT ) + γ T r (WT W ) s.t. wT 1 = 1, w(ei ) > 0. (8)

By combining Eqs. (6) and (8) together, and imposing some constraints on the weight of hyperedges, we have our AHRLRL model as follows:

min

T,V,H,De ,Dv ,W

XT − V2F + F − VVT 2F + αT2,1 + β T r (TT XT LXT )

+ γ T r ( WT W ) , s.t. V ≥ 0, wT 1 = 1, w(ei ) > 0,

Fi j = exp

    xi − x j  2 −2σ 2

,

(9)

(10)

where xi and x j are the feature vectors which represent the ith and jth data instance, respectively. xi − x j 2 is the square of the Euclidean distance between two data instances. 4. Optimization algorithm and theoretical analysis

(7)

i j where  = and γ is a positive balancing parameδ (e ) ter. [XT](i, : ) and [XT]( j, : ) are the corresponding latent represen-

There are five variables in Eq. (9), it is difficult to solve them simultaneously. In this section, we first present an effective alternate optimization to iteratively solve the problem, then give theoretical

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

ARTICLE IN PRESS

JID: NEUCOM

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

analysis of the proposed algorithm. Specifically, we optimize one variable by fixing the others at one time. 4.1. Optimization algorithm 4.1.1. Subproblem 1: Solving T when other variables are fixed When other variables are fixed, the objective function is convex w.r.t. T, and T can be obtained by solving the following optimization problem,

F (T ) = min XT − V + αT2,1 + β T r (T X LXT ). T

T

2 F

T

(11)

Eq. (11) can be solved by the iterative reweighted least-squares (IRLS) algorithm [48,76]. For the IRLS algorithm, we need to introduce a diagonal matrix  ∈ Rn×n , with the ith diagonal element as:

(i, i ) =

1 . 2T(i, : )2

(12)

F (Tˆ ) = min XT − V2F + α T r (TT T ) + β T r (TT XT LXT ). T

(13)

By taking the derivative of F (Tˆ ) with respect to T and setting it to be zero, then we have

XT (XT − V ) + α T + β XT LXT = 0.

(14)

Then the closed form solution of T can be obtained by

Tˆ = (X X + α  + β X LX ) T

−1

T

X V.

(15)

4.1.2. Subproblem 2: Solving V when other variables are fixed When other variables are fixed, V can be obtained by solving the following optimization problem,

F (V ) = min XT − V2F + F − VVT 2F , s.t. V ≥ 0. V

(16)

The Lagrange multiplier method is employed to solve problem (16). For this purpose, we set Lagrange multiplier ∈ Rn×c for the constraint V ≥ 0. We construct the Lagrange function as follows:

ˆ ) = min XT − V2F + F − VVT 2F + T r ( VT ). F (V V

(17)

ˆ ) with respect to V is equal Requiring that the derivatives of F (V to zero, we obtain

−2XT + 2V − 4FV + 4VVT V + = 0.

(18)

According to the Kuhn-Tucker condition i j Vi j = 0, we have

Vi j ← Vi j

Compared with previous simple graph methods [77,78] and hypegraph methods [72,79] which learn the graphs from the original data as well as assume the fixed number of neighbors for all the samples, AHRLRL is more flexible and robust. 4.1.4. Update W and Dv by fixing other variables By ignoring other fixed variables, the object function with respect to W can be written as following:

min β T r (TT XT LXT ) + γ T r (WT W ), s.t. wT 1 = 1, w(ei ) > 0, W

(21) −1

− 12

− 12

By letting O = De H Dv XTT X Dv H and o = diag(O ), since W is a diagonal matrix, Eq. (21) can be rewritten as the following form: T

T

T

min −β ow + γ w22 , s.t. wT 1 = 1, wi > 0, w

(22)

Then Eq. (22) can be transferred into following form:

Then Eq. (11) can be converted to following weighted least squares problem:

T

5

(2XT + 4FV )i j , (2V + 4β VVT V )i j

(19)

where ← represents assignment. 4.1.3. Subproblem 3: Updating H and De by fixing other variables As aforementioned, the edge weights generated from the original feature space may be affected by the noises, which will induce inaccurate hypergraph. To tackle this problem, we design to learn the hyperedges from the learned latent representation space, in which the affect of noisy and irrelevant features can be reduced as much as possible. Then we use the following formulation to construct the set of the hyperedges:

ei = {v j |θ (vi , v j ) ≤ τ σ¯ i }, i, j = 1, . . . , n,

(20)

where σ¯ i is the average similarity between vi and each of other data in the latent representation space, and τ is a constant which is set to 0.5 in our experiments. It can be seen from Eq. (20) that AHRLRL learns the incidence matrix H from the latent representation space and different samples will be allocated with different numbers of the neighbors.

2   βo   , s.t. wT 1 = 1, wi > 0.  min w − w 2γ  2

(23)

The Lagrangian function corresponding to Eq. (23) can be established as:

2   βo   − η (wT 1 − 1 ) − ξ w,  L ( w , η , γ ) = w − 2γ  2

(24)

where η ≥ 0 and ξ ≥ 0 are the Lagrangian multipliers. According to the Karush-Kuhn-Tucker conditions, we obtain the close-form solution for wi as:



wi =

βo 2γ



, i = 1, . . . , n.

(25)

+

Then we have W = diag(w ) and



d ( v i )=



Dv =diag(d )

vi ∈ei, ei ∈E

w(ei )h(vi , e j ), i, j = 1, . . . , n

(26)

For clarity, the entire algorithm for solving AHRLRL is summarized in Algorithm 1. Algorithm 1 Iterative algorithm for solving AHRLRL. Input: Original data matrix X ∈ Rn×d , parameters α , β and γ . Initialization: Initialize the hypergraph G by using original data, = I, V = rand (n, c ). while not converged do 1. Update T by solving Eq. (11); 2. Update V by Eq. (19); 3. Update H and De by using Eq. (20); 4. Update W and Dv by solving Eq. (21); 5. Calculate L by using Eq. (7); end while Output: T.

4.2. Algorithm analysis In this section, we give a brief analysis of the optimization algorithm for solving AHRLRL. 4.2.1. Algorithm complexity As can be seen from Algorithm 1, the problem of solving AHRLRL can be decomposed to four sub-problems. The main computation cost lies in solving T, of which the complexity is O (d3 ) in each iteration for solving Eq. (11). Therefore, for each iteration, the computation complexity of Algorithm 1 is O (t1 d3 ), where t1 is the iteration times needed for solving Eq. (11).

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

ARTICLE IN PRESS

JID: NEUCOM 6

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx Table 1 Statistics of the datasets used in this paper. Data sets

Instance

Feature

Class

Type

ORL warpPIE10P orlraws10P COIL20 Isolet CLL_SUB_111 Prostate_GE USPS

400 210 100 1440 1560 111 102 9298

1024 2420 10304 1024 617 11340 5966 256

40 10 10 20 26 3 2 10

Face images Face images Face images Object images Speech Signal Biological microarray Biological microarray Digit images

4.2.2. Convergence analysis It is difficult to generally prove the convergence for Algorithm 1 since there are six blocks in the model of AHRLRL and the objective function (9) is non-smooth. However, the convergence of each sub-problem can be well guaranteed. In addition, empirical evidence on real data presented suggests that the proposed algorithm has very strong and stable convergence behaviour. In the experiment section, we will also demonstrate this point. 5. Experimental results In this section, the evaluation of the proposed unsupervised feature selection method on several public data sets is presented. Meanwhile, we compare our algorithm with several state-of-theart unsupervised feature selection methods and the parameter sensitivity analysis will be also reported. 5.1. Datasets In our experiments, eight different publicly available datasets were used for evaluation, including three face image datasets, i.e., ORL [61], orlraws10P and warpPIE10P1 , one object image dataset COIL202 , one speech signal dataset Isolet3 , two biological microarray datasets. i.e., CLL_SUB_1114 and Prostate_GE5 , and one digit image dataset USPS. The statistics of these datasets are summarized in Table 1. 5.2. Comparison methods and parameters settings To validate the effectiveness of our AHRLRL method, we follow the common experiment settings of previous unsupervised feature selection works, i.e., the clustering performance with selected features are evaluated. We compare AHRLRL with several following state-of-the-art unsupervised feature selection approaches: • Baseline: All of the original features are adopted; • LS: Laplacian Score [39], in which features are selected with the most consistency with Gaussian Laplacian matrix; • MCFS: Multi-cluster feature selection [61], it uses the l1 -norm to regularize the feature selection process as a spectral information regression problem; • RSR: Regularized self-representation feature selection method [48], which uses the l2,1 -norm to measure the fitting error and also promotes sparsity; • MFFS: A new unsupervised feature selection criterion developed from the viewpoint of subspace learning, which converts the unsupervised feature selection problem to a matrix factorization problem [80]; 1 2 3 4 5

http://featureselection.asu.edu/datasets.php. http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php. http://archive.ics.uci.edu/ml/datasets/ISOLET. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2466. http://www.ncbi.nlm.nih.gov/pubmed/12381711.

• GLoSS: Global and local structure preserving sparse subspace learning model for unsupervised feature selection [44], which can simultaneously realize feature selection and subspace learning. • GSR_SFS: Graph self-representation sparse feature selection [81], in which the traditional fixed similarity graph is used to preserve the local geometrical structure of data. • l1 -UFS: Unsupervised feature selection via l1 -norm regularized graph learning [25], in which the l1 -norm is used to replace traditional l2 -norm for measuring the sample similarity in the selected feature space. • DSRMR: Robust unsupervised feature selection via dual selfrepresentation and manifold regularization [21], in which the feature self-representation term is used for feature reconstruction while the sample self-representation term is deployed to learn the similarity graph for local geometrical structure preservation. • AEFS: Autoencoder feature selector for unsupervised feature selection which combines autoencoder regression and group lasso tasks [82]. Several parameters need to be set in AHRLRL and other previous methods. For LS, GLoSS, MCFS, and GSR_SFS, we fixed the neighborhood size to 5 for all the datasets. For GLoSS and GSR_SFS, the Gaussian kernel width of the distance function is set to 1. Similar to previous feature selection works [21,44,81], in order to make fair comparison of different methods, we tuned the rest parameters for all of the methods (including our AHRLRL) by a “grid-search” strategy from {10−3 , 10−2 , 10−1 , 1, 10, 102 , 103 }. Because the optimal number of selected features is unknown, we set different number of selected features for all datasets, and the best clustering and classification results from the optimal parameters are reported for all the algorithms on each dataset. For all of the datasets, we tuned the selected feature number from {20, 30, . . . , 90, 100}. 5.3. Clustering experiment settings and results 5.3.1. Experiment settings Similar to previous unsupervised feature selection methods, for each method and each dataset, we also use the selected features to perform k-means clustering which uses the square of the Euclidean distance to measure the similarity between two data instances. Then we compare the clustering results with the true labels which represent the classes that the data samples belong to. Two widely used evaluation metrics, i.e., accuracy (ACC) and normalized mutual information (NMI), are employed to evaluate the performance of clusters. The larger ACC and NMI represent better performance. Since the performance of k-means depends on the initial point, we run it 20 times with random starting points and report the average value. If the clustering results is denoted by qi and the true label of xi is denoted by pi , then ACC is defined as follows:

n

ACC =

i=1

δ ( pi , map(qi )) n

,

(27)

where δ (x, y ) = 1 if x = y, otherwise δ (x, y ) = 0. map(qi ) is the best mapping function that permutes clustering labels to match the true labels using the Kuhn-Munkres algorithm. Given two variables P and Q, NMI is defined as

NMI (P, Q ) =



I (P, Q ) H ( P )H ( Q )

(28)

where H(P) and H(Q) are the entropies of P and Q, respectively, and I(P, Q) is the mutual information between P and Q. For clustering, P and Q are the clustering results and the true labels, respectively. NMI reflects the consistency between clustering results and ground truth labels.

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

ARTICLE IN PRESS

JID: NEUCOM

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

7

Table 2 Best clustering results (ACC% ± std%) and the corresponding numbers of selected features of different feature selection algorithms on different datasets. The best results are highlighted in bold. Datasets

ORL

warpPIE10P

orlraws10P

COIL20

Isolet

CLL_SUB_111

Prostate_GE

USPS

Baseline LS MCFS RSR MFFS GLoSS GSR_SFS l1 -UFS DSRMR AEFS AHRLRL

55.79 ± 2.69 48.55 ± 2.30 50.01 ± 2.42 54.04 ± 2.66 55.19 ± 2.44 54.19 ± 1.94 55.88 ± 2.63 58.24 ± 2.38 57.69 ± 3.17 56.84 ± 3.62 58.93 ± 2.51

27.12 ± 1.91 42.93 ± 2.00 38.74 ± 3.26 34.81 ± 2.83 42.14 ± 2.61 41.74 ± 2.61 39.36 ± 2.14 46.64 ± 3.44 44.29 ± 2.12 47.25 ± 4.06 48.24 ± 3.23

72.76 ± 6.57 61.90 ± 4.24 75.87 ± 7.63 75.29 ± 6.52 73.85 ± 7.28 76.83 ± 5.41 76.04 ± 6.23 76.65 ± 5.35 76.78 ± 5.28 75.77 ± 4.93 78.28 ± 5.06

58.32 ± 5.40 49.86 ± 3.66 59.22 ± 3.14 60.68 ± 3.68 61.14 ± 3.88 58.87 ± 1.40 60.56 ± 5.55 63.55 ± 5.47 61.16 ± 3.07 62.34 ± 4.04 64.52 ± 4.22

58.02 ± 3.47 53.79 ± 2.76 56.30 ± 3.06 53.97 ± 2.52 60.91 ± 2.74 61.72 ± 3.62 58.90 ± 2.19 60.84 ± 2.37 59.80 ± 3.74 62.64 ± 3.41 64.43 ± 2.76

53.15 ± 5.91 45.68 ± 6.87 46.04 ± 6.17 45.95 ± 6.79 48.65 ± 6.14 47.14 ± 5.32 47.48 ± 6.38 65.77 ± 1.81 51.49 ± 5.54 55.85 ± 2.96 56.55 ± 4.47

58.82 ± 1.02 55.88 ± 0.97 59.80 ± 1.21 60.88 ± 1.38 64.71 ± 1.05 60.78 ± 1.14 61.76 ± 0.89 63.82 ± 1.69 64.27 ± 1.13 67.46 ± 5.37 77.83 ± 1.56

66.43 ± 2.15 62.59 ± 2.12 72.52 ± 2.82 72.41 ± 4.66 70.82 ± 3.73 67.71 ± 3.87 70.04 ± 3.72 73.91 ± 1.54 74.75 ± 5.23 74.63 ± 3.75 76.67 ± 3.37

Table 3 Best clustering results (NMI% ± std%) and the corresponding numbers of selected features of different feature selection algorithms on different datasets. The best results are highlighted in bold. Datasets

ORL

warpPIE10P

orlraws10P

COIL20

Isolet

CLL_SUB_111

Prostate_GE

USPS

Baseline LS MCFS RSR MFFS GLoSS GSR_SFS l1 -UFS DSRMR AEFS AHRLRL

71.95 ± 1.21 69.59 ± 1.13 70.23 ± 1.32 73.25 ± 1.29 74.40 ± 1.17 73.23 ± 1.46 74.18 ± 1.25 77.87 ± 1.71 73.95 ± 1.42 74.20 ± 2.35 75.44 ± 1.21

27.92 ± 3.53 47.26 ± 2.62 39.98 ± 2.65 37.41 ± 2.49 47.89 ± 2.54 43.68 ± 1.89 44.24 ± 2.64 53.21 ± 3.37 50.02 ± 2.47 54.87 ± 2.49 58.63 ± 2.72

79.96 ± 3.90 70.47 ± 3.30 85.81 ± 4.68 82.79 ± 3.62 83.36 ± 5.68 83.65 ± 4.51 76.03 ± 5.18 86.42 ± 4.65 83.44 ± 3.19 85.84 ± 4.51 88.26 ± 4.82

74.01 ± 2.97 64.80 ± 1.69 70.89 ± 1.35 72.76 ± 1.19 72.84 ± 2.06 49.94 ± 1.03 72.78 ± 2.02 75.84 ± 1.39 74.27 ± 2.12 74.82 ± 2.66 77.74 ± 2.22

74.67 ± 1.37 68.43 ± 1.08 70.26 ± 1.07 69.67 ± 1.06 73.41 ± 1.00 73.05 ± 1.91 71.87 ± 1.34 74.71 ± 1.00 73.76 ± 1.53 74.07 ± 1.38 76.23 ± 1.76

18.07 ± 4.56 11.91 ± 9.83 10.19 ± 5.38 9.70 ± 5.58 9.85 ± 5.38 10.87 ± 5.60 11.18 ± 5.29 20.63 ± 5.62 19.09 ± 4.44 16.31 ± 5.84 15.76 ± 5.48

2.37 ± 0.05 2.46 ± 0.98 2.21 ± 0.62 3.53 ± 0.84 8.47 ± 0.67 3.46 ± 0.71 4.06 ± 0.70 8.78 ± 0.73 10.51 ± 0.85 15.17 ± 2.24 22.39 ± 1.12

61.63 ± 1.65 59.50 ± 1.62 65.57 ± 0.44 66.16 ± 1.75 65.51 ± 0.67 61.42 ± 1.81 62.54 ± 1.07 68.37 ± 1.21 67.24 ± 2.04 66.96 ± 2.83 69.18 ± 1.13

5.3.2. Clustering results Tables 2 and 3 present the experimental results of different methods in terms of ACC and NMI on different datasets, the corresponding numbers of selected features are also reported. The best results are highlighted in bold fonts. As can be seen, some comparison method (such as LS) is lower than baseline in some cases, this is because that the method such as LS cannot select the most important features for some datasets. However, for most of the methods, they can work better than the baseline, which means that only a part of features are discriminative for learning tasks such as clustering and these methods can effectively select the most important features while remove some redundant and noisy features. The results also have shown that the proposed method can perform better than other state-of-the-art methods in terms of ACC. There are three major reasons. Firstly, unlike previous methods which treat each data instance independently, the interconnection information between data instances have been exploited by latent representation learning in our method. Secondly, we measure the importances of different feature dimensions in the latent space rather than original feature space, this can make our method robust to original noisy features and data instances. Thirdly, the graph based manifold regularization term can well preserve the local geometrical structure of data which has been verified critical for unsupervised feature selection in previous literature [43]. It should be noted that our method significantly outperforms other methods on biological microarray dataset Prostate_GE, this is due to the characteristics of biological genetic data acquisition. The biological microarray data are obtained by detecting different genes under different conditions, the number of detected genes corresponds to the feature dimension while each detecting condition generates one data instance. In such a manner, different data instances are essentially generated from the same genes. Thus, different data instances inevitably dependent on each other. Since the latent representation learning in our method can just exploit such dependence between microarray data instances, our method can perform significantly better than other methods on this dataset.

It should be noted that our method dose not perform the best on CLL_SUB_111 dataset. CLL_SUB_111 is a biological microarray dataset, which is used to identify molecular correlates of genetically and clinically distinct subgroups of B-cell chronic lymphocytic leukemia (B-CLL). It has been verified that a strikingly high degree of correlation between loss or gain of genomic material and the amount of transcripts from the affected regions leads to the hypothesis of gene dosage as a significant pathogenic factor [83], i.e, most of the features (genes) are important for clustering and classifying these data samples. Therefore, most of ACC results of the comparison methods are lower than baseline for this dataset since only a part of features (genes) are chosen for clustering after performing feature selection. Even so, our method still works better than most of other compared ones, which also demonstrates that our proposed method can select the most discriminative features. In order to illustrate the effect of feature selection to clustering, we show the detailed performance of all methods with respect to different selected numbers of features on different datasets. The ACC values and the NMI values with respect to different number of selected features on different datasets are plotted in Fig. 2 and Fig. 3, respectively. As can be seen, with different number of selected features, our method AHRLRL can steadily perform better than other methods. It is worth noting that when using fewer features, our methods can obtain higher clustering accuracy than the baseline in most cases, which validates that our method can save the clustering time and improve clustering accuracy. 5.3.3. Parameter sensitivity There are three balancing parameters in our proposed method (i.e., α , β and γ ). To study the sensitivity of the method with regard to the parameters in Eq. (9), experiments were conducted by fixing two of them and varying the rest one. Figs. 4 and 5 plot the ACC and NMI values given by AHRLRL w.r.t. parameter γ while keeping parameters α = 1 and β = 1, respectively. The results show that AHRLRL performs stably well for different γ ’s with fixed selected numbers.

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

ARTICLE IN PRESS

JID: NEUCOM 8

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

0.5

0.65 BaseLine

LS

GLoSS

GSR_SFS

MCFS l1−UFS

RSR

MFFS

BaseLine

LS

DSRMR

AEFS

GLoSS

GSR_SFS

0.6

Clustering accuracy (ACC)

Clustering accuracy (ACC)

AHRLRL

0.55

0.5

0.45

0.4 20

0.45

MFFS

DSRMR

AEFS

0.4

0.35

0.3

0.25

30

40

50

60

70

80

90

0.2 20

100

30

40

50

60

70

80

90

100

Number of selected features

(a) ORL

(b) warpPIE10P

0.9

0.7 BaseLine

LS

GLoSS

GSR_SFS

MCFS l1−UFS

RSR

MFFS

BaseLine

LS

DSRMR

AEFS

GLoSS

GSR_SFS

AHRLRL

0.8

Clustering accuracy (ACC)

Clustering accuracy (ACC)

RSR

AHRLRL

Number of selected features

0.85

MCFS l1−UFS

0.75 0.7 0.65 0.6 0.55 0.5

0.65

MCFS l1−UFS

RSR

MFFS

DSRMR

AEFS

AHRLRL

0.6

0.55

0.5

0.45

0.45 0.4 20

30

40

50

60

70

80

90

0.4 20

100

30

Number of selected features

40

(c) orlraws10P

70

80

90

100

0.8 BaseLine

LS

GLoSS

GSR_SFS

MCFS l −UFS 1

RSR

MFFS

DSRMR

AEFS

0.75

AHRLRL

BaseLine

LS

GLoSS

GSR_SFS

MCFS l −UFS 1

RSR

MFFS

DSRMR

AEFS

AHRLRL

Clustering accuracy (ACC)

Clustering accuracy (ACC)

60

(d) COIL20

0.7

0.65

50

Number of selected features

0.6

0.55

0.5

0.45

0.7 0.65 0.6 0.55 0.5 0.45 0.4

0.4 0.35 0.35 20

30

40

50

60

70

80

90

0.3 20

100

30

Number of selected features

(e) Isolet

60

70

80

90

100

(f) CLL SUB 111 0.9

BaseLine

LS

GLoSS

GSR_SFS

MCFS l −UFS 1

RSR

MFFS

DSRMR

AEFS

0.85

AHRLRL

Clustering accuracy (ACC)

Clustering accuracy (ACC)

50

Number of selected features

0.8

0.75

40

0.7

0.65

0.6

BaseLine

LS

GLoSS

GSR_SFS

MCFS l −UFS 1

RSR

MFFS

DSRMR

AEFS

AHRLRL

0.8 0.75 0.7 0.65 0.6

0.55 0.55 0.5 20

30

40

50

60

70

80

Number of selected features

(g) Prostate GE

90

100

0.5 20

30

40

50

60

70

80

90

100

Number of selected features

(h) USPS

Fig. 2. The clustering accuracy (ACC) of using different selected features by different methods on different datasets.

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

ARTICLE IN PRESS

JID: NEUCOM

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

0.65 BaseLine

LS

GLoSS

GSR_SFS

MCFS l −UFS 1

RSR

MFFS

DSRMR

AEFS

Normalized mutual information (NMI)

Normalized mutual information (NMI)

0.8

AHRLRL

0.75

0.7

0.65

0.6 20

30

40

50

60

70

80

90

0.6

LS

GLoSS

GSR_SFS

0.5 0.45 0.4 0.35 0.3 0.25

30

GSR_SFS

MCFS l −UFS 1

40

MFFS

BaseLine

LS

DSRMR

AEFS

GLoSS

GSR_SFS

AHRLRL

0.75 0.7 0.65 0.6

30

40

50

60

70

80

90

0.75

60

70

80

90

100

MCFS l −UFS 1

RSR

MFFS

DSRMR

AEFS

AHRLRL

0.7

0.65

0.6

0.55

0.5 20

100

30

40

50

60

70

80

90

100

Number of selected features

(c) orlraws10P

(d) COIL20

0.8

0.25 BaseLine

LS

GLoSS

GSR_SFS

MCFS l1−UFS

RSR

MFFS

BaseLine

LS

DSRMR

AEFS

GLoSS

GSR_SFS

Normalized mutual information (NMI)

Normalized mutual information (NMI)

50

RSR

Normalized mutual information (NMI)

Normalized mutual information (NMI)

LS

GLoSS

Number of selected features

AHRLRL

0.7

0.65

0.6

0.55

30

40

50

60

70

80

90

RSR

MFFS

DSRMR

AEFS

0.2

0.15

0.1

0.05

0 20

100

MCFS l1−UFS

AHRLRL

30

Number of selected features

40

50

60

70

80

90

100

Number of selected features

(e) Isolet

(f) CLL SUB 111 0.8

BaseLine

LS

GLoSS

GSR_SFS

MCFS l1−UFS

RSR

MFFS

DSRMR

AEFS

Normalized mutual information (NMI)

0.25

Normalized mutual information (NMI)

AEFS

0.8 BaseLine

0.8

AHRLRL

0.2

0.15

0.1

0.05

0 20

MFFS

DSRMR

(b) warpPIE10P

0.85

0.5 20

RSR

Number of selected features

0.95

0.75

1

0.55

(a) ORL

0.55 20

MCFS l −UFS

AHRLRL

0.2 20

100

BaseLine

Number of selected features

0.9

9

30

40

50

60

70

80

Number of selected features

(g) Prostate GE

90

100

0.75

BaseLine

LS

GLoSS

GSR_SFS

MCFS l1−UFS

RSR

MFFS

DSRMR

AEFS

AHRLRL

0.7 0.65 0.6 0.55 0.5 0.45 0.4 20

30

40

50

60

70

80

90

100

Number of selected features

(h) USPS

Fig. 3. The normalized mutual information (NMI) of using different selected features by different methods on different datasets.

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

ARTICLE IN PRESS

JID: NEUCOM 10

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

0.7

0.4

0.6 0.3

0.4

ACC

ACC

0.5

0.3 0.2

0.2

0.1

0.1 0

0 20

20 30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50

80

#Feature

0.1 90

0.01 0.001

100

1

70

0.1 90

100

#Feature

(a) ORL

0.01 0.001

(b) warpPIE10P

0.8

0.7 0.6 0.5

ACC

ACC

0.6

0.4

0.4 0.3 0.2

0.2

0.1 0

0 20

20 30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50

80

#Feature

0.1 90

0.01 0.001

100

1

70

0.1 90

100

#Feature

(c) orlraws10P

0.01 0.001

(d) COIL20

0.7

0.5

0.6 0.4 0.3

0.4

ACC

ACC

0.5

0.3

0.2

0.2 0.1 0.1 0

0 20

20 30

30

1000

40

80 100

1

70 80

0.1 90

#Feature

0.1 90

0.01 0.001

100

#Feature

(e) Isolet

0.01 0.001

(f) CLL SUB 111

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

ACC

ACC

10

60

1

70

100

50

10

60

1000

40

100

50

0.3

36

0.2 0.1 0

0.3 0.2 0.1 0

20

20 30

1000

40

100

50

10

60

1

70 80

1000

40

100

50

10

60

100

0.01 0.001

(g) Prostate GE

1

70 80

0.1 90

#Feature

30

0.1 90 100

#Feature

0.01 0.001

(h) USPS

Fig. 4. ACC of our method w.r.t γ on different datasets with α = 1 and β = 1.

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

ARTICLE IN PRESS

JID: NEUCOM

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

0.8

11

0.5 0.4

0.6

NMI

NMI

0.3 0.4

0.2

0.2

0.1

0

0 20

20 30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50

80

#Feature

0.1 90

0.01 0.001

100

1

70

0.1 90

100

#Feature

(a) ORL

0.01 0.001

(b) warpPIE10P

1

0.7 0.6

0.8 0.5 0.4

NMI

NMI

0.6 0.4

0.3 0.2

0.2 0.1 0

0 20

20 30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50

70 80

0.1 90

#Feature

0.1 90

0.01 0.001

100

1 100

#Feature

(c) orlraws10P

0.01 0.001

(d) COIL20

0.8

0.1

0.6

0.08

NMI

NMI

0.06

0.4

0.04

0.2 0.02

0

0

20

20

30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50

100

#Feature

1

70

0.1 90

80

0.1 90

0.01 0.001

100

#Feature

(e) Isolet

0.01 0.001

(f) CLL SUB 111

0.7

0.06

0.6

0.05

0.5

NMI

NMI

0.04 0.03

37

0.02 0.01

0.4 0.3 0.2 0.1 0

0

20

20 30

1000

40

100

50

10

60

1

70 80

1000

40

100

50

10

60

100

0.01 0.001

(g) Prostate GE

1

70 80

0.1 90

#Feature

30

0.1 90 100

#Feature

0.01 0.001

(h) USPS

Fig. 5. NMI of our method w.r.t γ on different datasets with α = 1 and β = 1.

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

ARTICLE IN PRESS

JID: NEUCOM 12

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

0.35

0.6

0.3

0.5

0.25

0.4

ACC

ACC

0.7

0.3

0.2 0.15

0.2

0.1

0.1

0.05

0

0

20

20

30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50 0.1 90

80

0.1 90

0.01 0.001

100

#Feature

1

70 100

#Feature

(a) ORL

0.01 0.001

(b) warpPIE10P

0.8

0.7 0.6 0.5

ACC

ACC

0.6

0.4

0.4 0.3 0.2

0.2

0.1 0

0 20

20 30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50

80

0.1 90

0.1 90

0.01 0.001

100

#Feature

1

70 100

#Feature

(c) orlraws10P

0.01 0.001

(d) COIL20

0.4

0.5 0.4

0.3

ACC

ACC

0.3 0.2

0.1

0.2 0.1

0

0 20

20 30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50

80

0.1 90 100

#Feature

1

70 0.1 90

0.01 0.001

100

#Feature

(e) Isolet

0.01 0.001

(f) CLL SUB 111

0.7

0.8

0.6 0.6

0.4

ACC

ACC

0.5

0.4

38

0.2

0.3 0.2 0.1 0

0 20

20 30

1000

40

100

50

10

60

1

70 80

1000

40

100

50

10

60

100

0.01 0.001

(g) Prostate GE

1

70 80

0.1 90

#Feature

30

0.1 90 100

#Feature

0.01 0.001

(h) USPS

Fig. 6. ACC of our method w.r.t β on different datasets with α = 1 and γ = 1.

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

ARTICLE IN PRESS

JID: NEUCOM

[m5G;October 31, 2019;0:5]

0.8

0.2

0.6

0.15

NMI

NMI

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

0.4

0.2

13

0.1

0.05

0

0

20

20

30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50 0.1 90

80

0.1 90

0.01 0.001

100

#Feature

1

70 100

#Feature

(a) ORL

0.01 0.001

(b) warpPIE10P

1

0.8

0.8

0.6

NMI

NMI

0.6 0.4

0.4

0.2

0.2 0

0 20

20 30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50

70 80

0.1 90

#Feature

0.1 90

0.01 0.001

100

1 100

#Feature

(c) orlraws10P

0.01 0.001

(d) COIL20

0.7

0.05

0.6 0.04 0.03

0.4

NMI

NMI

0.5

0.3

0.02

0.2 0.01

0.1 0

0

20

20

30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50 0.1 90 100

#Feature

1

70 80

0.1 90

0.01 0.001

100

#Feature

(e) Isolet

0.01 0.001

(f) CLL SUB 111

0.7

0.05

0.6 0.04

0.5

NMI

NMI

0.03 0.02

39

0.01

0.4 0.3 0.2 0.1 0

0

20

20 30

1000

40

100

50

10

60

1

70 80

1000

40

100

50

10

60

100

0.01 0.001

(g) Prostate GE

1

70 80

0.1 90

#Feature

30

0.1 90 100

#Feature

0.01 0.001

(h) USPS

Fig. 7. NMI of our method w.r.t β on different datasets with α = 1 and γ = 1.

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

ARTICLE IN PRESS

JID: NEUCOM 14

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

0.7

0.4

0.6 0.3

0.4

ACC

ACC

0.5

0.3 0.2

0.2

0.1

0.1 0

0 20

20 30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50

80

0.1 90

0.1 90

0.01 0.001

100

#Feature

1

70 100

#Feature

(a) ORL

0.01 0.001

(b) warpPIE10P

1

0.7 0.6

0.8 0.5

ACC

ACC

0.6 0.4

0.4 0.3 0.2

0.2 0.1 0

0 20

20 30

30

1000

40

80

80 100

#Feature

(c) orlraws10P

0.01 0.001

(d) COIL20

0.5

0.5

0.4

0.4

0.3

0.3

ACC

ACC

0.1 90

0.01 0.001

100

1

70

0.1 90

#Feature

0.2 0.1

0.2 0.1

0

0 20

20 30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50

100

1

70 80

0.1 90

#Feature

0.1 90

0.01 0.001

100

#Feature

(e) Isolet

0.01 0.001

(f) CLL SUB 111

0.8

0.8

0.6

0.6

0.4

ACC

ACC

10

60

1

70

100

50

10

60

1000

40

100

50

0.4

0.2

40

0.2

0

0 20

20 30

1000

40

100

50

10

60

1

70 80

1000

40

100

50

10

60

100

0.01 0.001

(g) Prostate GE

1

70 80

0.1 90

#Feature

30

0.1 90 100

#Feature

0.01 0.001

(h) USPS

Fig. 8. ACC of our method w.r.t α on different datasets with β = 1 and γ = 1.

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

ARTICLE IN PRESS

JID: NEUCOM

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

0.8

15

0.7 0.6 0.5 0.4

NMI

NMI

0.6

0.4

0.3 0.2

0.2

0.1 0

0 20

20 30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50

80

0.1 90

0.1 90

0.01 0.001

100

#Feature

1

70 100

#Feature

(a) ORL

0.01 0.001

(b) warpPIE10P

1

0.8

0.8

0.6

NMI

NMI

0.6 0.4

0.4

0.2

0.2 0

0 20

20 30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50

80

#Feature

0.1 90

0.01 0.001

100

1

70

0.1 90

100

#Feature

(c) orlraws10P

0.01 0.001

(d) COIL20

0.8

0.14 0.12

0.6

NMI

NMI

0.1

0.4

0.08 0.06 0.04

0.2

0.02

0

0

20

20

30

30

1000

40

80

10

60

1

70

100

50

10

60

1000

40

100

50

70

0.1 90 100

#Feature

1 80

0.1 90

0.01 0.001

100

#Feature

(e) Isolet

0.01 0.001

(f) CLL SUB 111

0.7

0.2

0.6 0.5

NMI

NMI

0.15

0.1

41

0.05

0.4 0.3 0.2 0.1 0

0

20

20 30

1000

40

100

50

10

60

1

70 80

1000

40

100

50

10

60

100

0.01 0.001

(g) Prostate GE

1

70 80

0.1 90

#Feature

30

0.1 90 100

#Feature

0.01 0.001

(h) USPS

Fig. 9. NMI of our method w.r.t α on different datasets with β = 1 and γ = 1.

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

ARTICLE IN PRESS

JID: NEUCOM 16

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

7000

6500

6000

5500

5000

4500

4000

0

10

20

30

40

x 10

2.5

Objective function value of AHRLRL

Objective function value of AHRLRL

8

7500

2

1.5

1

0.5

0

50

0

10

Number of iterations

(a) ORL Objective function value of AHRLRL

Objective function value of AHRLRL

4 3.5 3 2.5 2 1.5 1 0.5 0

10

20

30

40

50

8

40

50

5 4 3 2 1 0

0

10

(c) orlraws10P

20

30

(d) COIL20 7

Objective function value of AHRLRL

Objective function value of AHRLRL

50

6

Number of iterations

x 10

2

1.5

1

0.5

0

40

7

8

0

50

x 10

Number of iterations

2.5

40

7

4.5

0

30

(b) warpPIE10P

8

x 10

5

20

Number of iterations

10

20

30

40

50

x 10

18 16 14 12 10 8 6 4 2 0

0

10

Number of iterations

20

30

Number of iterations

(e) Isolet

(f) CLL SUB 111

3000

Objective function value of AHRLRL

Objective function value of AHRLRL

8

2800 2600 2400 2200 2000 1800 1600 1400 1200 1000

0

10

20

30

Number of iterations

(g) Prostate GE

40

50

2.5

x 10

2

1.5

1

0.5

0

0

10

20

30

40

50

Number of iterations

(h) USPS

Fig. 10. The convergence curves of the objective function in Eq. (9) on eight datasets with α = β = 1.

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

JID: NEUCOM

ARTICLE IN PRESS D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

Figs. 6 and 7 demonstrate the ACC and NMI values of AHRLRL w.r.t. parameter β with fixed parameters α = 1 and γ = 1. As can be seen, the results are a little bit sensitive to parameter β for ORL, warpPIE10P and COIL20 in terms of ACC and NMI. For ORL, when β > 1, higher ACC and NMI can be obtained. For warpPIE10P, AHRLRL has good performance when 0.1 < β < 100. For COIL20, the best ACC and NMI can be achieved when β = 0.1, for other β ’s, the performance is stable. Figs. 8 and 9 illustrate the ACC and NMI values given by AHRLRL w.r.t. parameter α while fixing parameters β = 1 and γ = 1. From these figures, we can see that when α = 1, the results of AHRLRL suddenly increase to a peak for warpPIE10P. For COIL20, the results change rapidly and larger α can induce better results when 0.001 < α < 10. For Isolet, the NMI value reach to the best when α = 1. For other cases, AHRLRL performs stably well. 5.4. Empirically convergence analysis Although we cannot theoretically prove the convergence of Algorithm 1, each sub-problem can be well guaranteed to be converged. Here we show the variation of the objective values of Eq. (9) with different iteration times on different datasets in Fig. 10. Empirical evidence from Fig. 10 demonstrates the strong and stable convergence behaviour of the proposed algorithm. 6. Conclusion In this work, we propose a robust unsupervised feature selection method by embedding the latent representation and hypergraph learning. Instead of measuring the feature importances in original data space, we perform the feature selection in the learned latent representation space which is more robust to noises. A hypergraph is adaptively learned and embedded into the resultant model to capture the local manifold geometrical structure of original data in a high-order manner. We develop an efficient alternating algorithm to optimize the resultant model. Experiments on eight benchmark data are conducted to demonstrate the effectiveness of the proposed method. Declaration of Competing Interest None. Acknowledgments This work was supported in part by the National Natural Science Foundation of China under Grant No. 61572515 and 61701451, and in part by the Fundamental Research Funds for the Central Universities, China University of Geosciences (Wuhan) under Grant No.CUG170654, in part by and in part by China Postdoctoral Science Foundation under Grant No. 2016M593023. References [1] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1157–1182. [2] Y. Shi, J. Liu, Z. Qi, B. Wang, Learning from label proportions on high-dimensional data., Neural Netw. 103 (2018) 9–18. [3] V.D. Mil’Man, New proof of the theorem of a. dvoretzky on intersections of convex bodies, Funct. Anal. Appl. 5 (4) (1971) 288–295. [4] C. Tang, W. Li, P. Wang, L. Wang, Online human action recognition based on incremental learning of weighted covariance descriptors, Inf. Sci. (Ny) 467 (2018) 219–237. [5] B. Pes, N. Dess, M. Angioni, Exploiting the ensemble paradigm for stable feature selection: a case study on high-dimensional genomic data, Inf. Fus. 35 (2017) 132–147. [6] L. Wang, W. Song, P. Liu, Link the remote sensing big data to the image features via wavelet transformation, Cluster Comput. 19 (2) (2016) 793–810. [7] X. Li, W. Chen, X. Cheng, Y. Liao, G. Chen, Comparison and integration of feature reduction methods for land cover classification with rapideye imagery, Multimed. Tools Appl. (2017) 1–17.

[m5G;October 31, 2019;0:5] 17

[8] C. Tang, X. Liu, P. Wang, C. Zhang, M. Li, L. Wang, Adaptive hypergraph embedded semi-supervised multi-label image annotation, IEEE Trans. Multimed. (2019), doi:10.1109/TMM.2019.2909860. [9] V. Tangkaratt, J. Morimoto, M. Sugiyama, Model-based reinforcement learning with dimension reduction, Neural Netw. 84 (2016) 1–16. [10] C. Tang, X. Zhu, X. Liu, M. Li, P. Wang, C. Zhang, L. Wang, Learning a joint affinity graph for multiview subspace clustering, IEEE Trans Multimed. 21 (7) (2019) 1724–1736. [11] A. Jain, D. Zongker, Feature selection: evaluation, application, and small sample performance, IEEE Trans. Pattern Anal. Mach. Intell. 19 (2) (1997) 153–158. [12] A. Sankaran, A. Jain, T. Vashisth, M. Vatsa, R. Singh, Adaptive latent fingerprint segmentation using feature selection and random decision forest classification, Inf. Fus. 34 (2017) 1–15. [13] R.M.O. Cruz, R. Sabourin, G.D.C. Cavalcanti, Meta-des.oracle: meta-learning and feature selection for dynamic ensemble selection, Inf. Fus. 38 (2017) 84–103. [14] P. Connor, P. Hollensen, O. Krigolson, T. Trappenberg, A biological mechanism for Bayesian feature selection: weight decay and raising the lasso, Neural Netw. 67 (C) (2015) 121–130. [15] Z. Zhao, H. Liu, Spectral feature selection for supervised and unsupervised learning, in: Proceedings of the International Conference on Machine Learning, 2007, pp. 1151–1157. [16] W. Liao, A. Pizurica, P. Scheunders, W. Philips, Y. Pi, Semisupervised local discriminant analysis for feature extraction in hyperspectral images, IEEE Trans. Geosci. Remote Sens. 51 (1) (2013) 184–198. [17] P. Mitra, C. Murthy, S.K. Pal, Unsupervised feature selection using feature similarity, IEEE Trans. Pattern Anal. Mach. Intell. 24 (3) (2002) 301–312. [18] S. Wang, J. Tang, H. Liu, Embedded unsupervised feature selection., in: Proceedings of the AAAI Conference on Artificial Intelligence, 2015, pp. 470–476. [19] Z. Li, Y. Yang, J. Liu, X. Zhou, H. Lu, Unsupervised feature selection using nonnegative spectral analysis, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2012, pp. 1026–1032. [20] C. Tang, L. Cao, X. Zheng, M. Wang, Gene selection for microarray data classification via subspace learning and manifold regularization, Med. Biol. Eng. Comput. 56 (7) (2018) 1271–1284. [21] C. Tang, X. Liu, M. Li, P. Wang, J. Chen, L. Wang, W. Li, Robust unsupervised feature selection via dual self-representation and manifold regularization, Knowl. Based Syst. 145 (2018) 109–120. [22] Y. Wan, X. Chen, J. Zhang, Global and intrinsic geometric structure embedding for unsupervised feature selection, Expert Syst. Appl. 93 (2018). [23] M. Luo, F. Nie, X. Chang, Y. Yang, A.G. Hauptmann, Q. Zheng, Adaptive unsupervised feature selection with structure regularization., IEEE Trans. Neural Netw. Learn Syst. 29 (4) (2018) 944–956. [24] X. Zhu, S. Zhang, R. Hu, Y. Zhu, J. Song, Local and global structure preservation for robust unsupervised spectral feature selection, IEEE Trans. Knowl. Data Eng. 30 (3) (2018) 517–529. [25] C. Tang, X. Zhu, J. Chen, P. Wang, X. Liu, J. Tian, Robust graph regularized unsupervised feature selection, Expert Syst. Appl. 96 (2018) 64–76. [26] C. Tang, J. Chen, X. Liu, M. Li, P. Wang, M. Wang, P. Lu, Consensus learning guided multi-view unsupervised feature selection, Knowl. Based Syst. 160 (2018) 49–60. [27] C. Tang, X. Zhu, X. Liu, L. Wang, Cross-view local structure preserved diversity and consensus learning for multi-view unsupervised feature selection, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2019, pp. 5101–5108. [28] C. Tang, X. Liu, X. Zhu, J. Xiong, M. Li, J. Xia, X. Wang, L. Wang, Feature selective projection with low-rank embedding and dual Laplacian regularization, IEEE Trans. Knowl. Data Eng. (2019), doi:10.1109/TKDE.2019.2911946. [29] C. Tang, M. Bian, X. Liu, M. Li, H. Zhou, P. Wang, H. Yin, Unsupervised feature selection via latent representation learning and manifold regularization, Neural Netw. 117 (2019) 163–178. [30] Z. Zhao, L. Wang, H. Liu, et al., Efficient spectral feature selection with minimum redundancy, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2010, pp. 673–678. [31] J. Tang, S. Alelyani, H. Liu, Feature selection for classification: a review, Data Classif. Algor. Appl. (2014) 37. [32] F. Nie, H. Huang, X. Cai, C.H. Ding, Efficient and robust feature selection via joint l2,1 -norms minimization, in: Proceedings of the Annual Conference on Neural Information Processing Systems, 2010, pp. 1813–1821. [33] J. Tang, H. Liu, Feature selection with linked data in social media, in: Proceedings of the SIAM International Conference on Data Mining, SIAM, 2012, pp. 118–128. [34] X. Zhu, Semi-supervised learning literature survey, Comput. Sci. 37 (1) (2008) 63–77. [35] G. Camps-Valls, T.V.B. Marsheva, D. Zhou, Semi-supervised graph-based hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 45 (10) (2007) 3044–3054. [36] R. Luo, W. Liao, X. Huang, Y. Pi, W. Philips, Feature extraction of hyperspectral images with semisupervised graph learning, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. PP (99) (2016) 1–11. [37] L. Wolf, A. Shashua, D. Geman, Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weighted-based approach, J. Mach. Learn. Res. 6 (2005) 1855–1887. [38] J.G. Dy, C.E. Brodley, Feature selection for unsupervised learning, J. Mach. Learn. Res. 5 (2004) 845–889. [39] X. He, D. Cai, P. Niyogi, Laplacian score for feature selection., Proceedings of the Annual Conference on Neural Information Processing Systems 18 (2005) 507–514.

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

JID: NEUCOM 18

ARTICLE IN PRESS

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx

[40] F. Nie, S. Xiang, Y. Jia, C. Zhang, S. Yan, Trace ratio criterion for feature selection, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2008, pp. 671–676. [41] C. Constantinopoulos, M.K. Titsias, A. Likas, Bayesian feature and model selection for gaussian mixture models., IEEE Trans. Pattern Anal. Mach. Intell. 28 (6) (2006) 1013–1018. [42] M.H.C. Law, M.A.T. Figueiredo, A.K. Jain, Simultaneous feature selection and clustering using mixture models, IEEE Trans Pattern Anal Mach Intell 26 (9) (2004) 1154–1166. [43] X. Liu, L. Wang, J. Zhang, J. Yin, H. Liu, Global and local structure preservation for feature selection, IEEE Trans. Neural Netw. Learn Syst. 25 (6) (2014) 1083–1095. [44] N. Zhou, Y. Xu, H. Cheng, J. Fang, W. Pedrycz, Global and local structure preserving sparse subspace learning: an iterative approach to unsupervised feature selection, Pattern Recog. 53 (2016) 87–101. [45] S. Wang, W. Pedrycz, Q. Zhu, W. Zhu, Unsupervised feature selection via maximum projection and minimum redundancy, Knowl. Based Syst. 75 (2015) 19–29. [46] R. Shang, W. Wang, R. Stolkin, L. Jiao, Subspace learning-based graph regularized feature selection, Knowl. Based Syst. 112 (2016) 152–165. [47] P. Zhu, W. Zhu, Q. Hu, C. Zhang, W. Zuo, Subspace clustering guided unsupervised feature selection, Pattern Recog. 66 (2017) 364–374. [48] P. Zhu, W. Zuo, L. Zhang, Q. Hu, S.C.K. Shiu, Unsupervised feature selection by regularized self-representation, Pattern Recog. 48 (2) (2015) 438–446. [49] H. Liu, X. Wu, S. Zhang, Feature selection using hierarchical feature clustering, in: Proceedings of the ACM International Conference on Information and Knowledge Management, 2011, pp. 979–984. [50] Z. Zhao, L. Wang, H. Liu, J. Ye, On similarity preserving feature selection, IEEE Trans. Knowl. Data Eng. 25 (3) (2013) 619–632. [51] G. Roffo, S. Melzi, M. Cristani, Infinite feature selection, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4202–4210. [52] J.G. Dy, C.E. Brodley, Feature selection for unsupervised learning, J. Mach. Learn. Res. 5 (2004) 845–889. [53] S. Maldonado, R. Weber, A wrapper method for feature selection using support vector machines, Inf. Sci. (Ny) 179 (13) (2009) 2208–2217. [54] S. Wang, F. Nie, X. Chang, L. Yao, X. Li, Q.Z. Sheng, Unsupervised feature analysis with class margin optimization, in: Proceedings of the Machine Learning and Knowledge Discovery in Databases, 2015, pp. 383–398. [55] S. Paul, Feature selection for linear SVM with provable guarantees, in: Proceedings of the Eigteenth International Conference on Artificial Intelligence and Statistics, 2015, pp. 735–743. [56] S. Paul, P. Drineas, Feature selection for ridge regression with provable guarantees, Neural Comput. 28 (4) (2015) 716–742. [57] C. Boutsidis, M. Magdon-Ismail, Deterministic feature selection for k-means clustering, IEEE Trans. Inf. Theory 59 (9) (2013) 6099–6110. [58] C. Boutsidis, M.W. Mahoney, P. Drineas, Unsupervised feature selection for the k-means clustering problem, in: Proceedings of the Annual Conference on Neural Information Processing Systems, 2009, pp. 153–161. [59] A.Y. Ng, Feature selection, l1 vs. l2 regularization, and rotational invariance, Proceedings of the International Conference on Machine Learning 19(5) (2007) 379–387. [60] Y. Yang, H.T. Shen, Z. Ma, Z. Huang, X. Zhou, l 2,1 -norm regularized discriminative feature selection for unsupervised learning, in: Proceedings of the International Joint Conferences on Artificial Intelligence, 2011, pp. 1589–1594. [61] D. Cai, C. Zhang, X. He, Unsupervised feature selection for multi-cluster data, in: Proceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining, 2010, pp. 333–342. [62] Q. Gu, Z. Li, J. Han, Joint feature selection and subspace learning, in: Proceedings of the International Joint Conferences on Artificial Intelligence, 2011, pp. 1294–1299. [63] Y. Yang, H.T. Shen, Z. Ma, Z. Huang, X. Zhou, l2,1 -norm regularized discriminative feature selection for unsupervised learning, in: Proceedings of the International Joint Conferences on Artificial Intelligence, 2011, pp. 1589–1594. [64] P. Zhu, W. Zhu, W. Wang, W. Zuo, Q. Hu, Non-convex regularized self-representation for unsupervised feature selection, Image Vis. Comput. 60 (2017) 22–29. [65] M.E. Newman, M. Girvan, Finding and evaluating community structure in networks, Phys. Rev. E 69 (2) (2004) 026113. [66] L. Tang, H. Liu, Relational learning via latent social dimensions, in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, pp. 817–826. [67] J.B. Morton, Y. Munakata, Active versus latent representations: a neural network model of perseveration, dissociation, and decalage, Dev Psychob. 40 (3) (2002) 255–265. [68] Y. Jacob, L. Denoyer, P. Gallinari, Learning latent representations of nodes for classifying in heterogeneous social networks, in: Proceedings of the ACM International Conference on Web Search and Data Mining, 2014, pp. 373–382. [69] J. Li, X. Hu, L. Wu, H. Liu, Robust unsupervised feature selection on networked data, in: Proceedings of the SIAM International Conference on Data Mining, SIAM, 2016, pp. 387–395. [70] Z. He, S. Xie, R. Zdunek, G. Zhou, A. Cichocki, Symmetric nonnegative matrix factorization: algorithms and applications to probabilistic clustering, IEEE Trans. Neural Netw. 22 (12) (2011) 2117–2131. [71] D. Kuang, C. Ding, H. Park, Symmetric nonnegative matrix factorization for graph clustering, in: Proceedings of the SIAM International Conference on Data Mining, 2014, pp. 106–117. [72] D. Zhou, J. Huang, Learning with hypergraphs: clustering, classification, and

[73]

[74] [75] [76] [77]

[78]

[79] [80] [81]

[82]

[83]

embedding, in: Proceedings of the International Conference on Neural Information Processing Systems, 2006, pp. 1601–1608. Y. Gao, R. Ji, P. Cui, Q. Dai, G. Hua, Hyperspectral image classification through bilayer graph-based learning, IEEE Trans. Image Process. 23 (7) (2014) 2769–2778. L. Geng, H.J. Hamilton, Interestingness measures for data mining: a survey, ACM Comput. Surv. 38 (3) (2006) 9. M. McPherson, L. Smith-Lovin, J.M. Cook, Birds of a feather: homophily in social networks, Annu. Rev. Soc. 27 (1) (2001) 415–444. Y. Iijima, M. Murakawa, Y. Kasai, E. Takahashi, T. Higuchi, Optimization transfer using surrogate objective functions, J. Comput. Graph. Stat. 9 (1) (20 0 0) 1–20. F. Nie, W. Zhu, X. Li, Unsupervised feature selection with structured graph optimization, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2016, pp. 1302–1308. J. Liu, M. Li, W.Y. Ma, Q. Liu, H. Lu, An adaptive graph model for automatic image annotation, in: Proceedings of the ACM International Workshop on Multimedia Information Retrieval, 2006, pp. 61–70. N. Somu, M.R.G. Raman, K. Kirthivasan, V.S.S. Sriram, Hypergraph based feature selection technique for medical diagnosis, J. Med. Syst. 40 (11) (2016) 1–16. S. Wang, W. Pedrycz, Q. Zhu, W. Zhu, Subspace learning for unsupervised feature selection via matrix factorization, Pattern Recog. 48 (1) (2015) 10–19. R. Hu, X. Zhu, D. Cheng, W. He, Y. Yan, J. Song, S. Zhang, Graph self-representation method for unsupervised feature selection, Neurocomputing 220 (2017) 130–137. K. Han, Y. Wang, C. Zhang, C. Li, C. Xu, Autoencoder inspired unsupervised feature selection, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 2941–2945. C. Haslinger, N. Schweifer, S. Stilgenbauer, H. Dohner, P. Lichter, N. Kraut, C. Stratowa, R. Abseher, Microarray gene expression profiling of b-cell chronic lymphocytic leukemia subgroups defined by genomic aberrations and vh mutation status, J. Clin. Oncol. 22 (19) (2004) 3937–3949. Deqiong Ding received a master’s degree from Kunming University of Science and Technology in China in 2011. She is currently a Ph.D. candidate at the School of Statistics, Southwestern University of Finance and Economics, Chengdu, China. Her current research interests include statistical analysis and methodological research, big data analysis methods and applications.

Xiaogao Yang received his Ph.D. degree from the college of Mechanical Engineering of Chongqing University in 2015. He is currently a lecture at Hunan University of Arts and Sciences. His research interests include big data analysis and fluid machinery.

Fei Xia received his Ph.D. from Computer College of National University of Defense Technology, China in 2010. He is currently an Associate Research Fellow in Electronic Engineering College, Naval University of Engineering, Wuhan. China. Dr. Xia has published 40+ peerreviewed papers, including those in highly regarded journals such as Parallel Computing, Super Computing, Bioinformatics, etc. His current research interests include Deep Learning, Reconfigurable Computing and High Performance Computing.

Tiefeng Ma received his Ph.D. in probability theory and mathematical statistics from Beijing University of Technology Beijing, China in 2008. He is now a professor and a doctoral supervisor at Southwestern University of Finance and Economics, Chengdu, China. Dr.Ma has published more than 40 peer-reviewed papers. His research interests include multivariate statistical analysis, the method and application of large data analysis, and portfolio analysis.

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018

JID: NEUCOM

ARTICLE IN PRESS

[m5G;October 31, 2019;0:5]

D. Ding, X. Yang and F. Xia et al. / Neurocomputing xxx (xxxx) xxx Haiyun Liu is now a doctor at the Institute of Cardiovascular Disease Research, The Affiliated Huai’an Hospital of Xuzhou Medical University. His current research interests include medical data analysis for disease diagnosis and medical data mining.

19

Chang Tang received his Ph.D. degree from Tianjin University, Tianjin, China in 2016. He joined the AMRL Lab of the University of Wollongong between Sep. 2014 and Sep. 2015. He is now an associate professor at the School of Computer Science, China University of Geosciences, Wuhan, China. Dr. Tang has published 30+ peer-reviewed papers, including those in highly regarded journals and conferences such as IEEE-TPAMI, IEEE-TMM, IEEE T-HMS, IEEE SPL, ICCV, CVPR, ACMM, AAAI, etc. He served on the Technical Program Committees of IJCAI 2018, ICME 2018, AAAI 2019/2020, ICME 2019, IJCAI 2019 and CVPR 2019. His current research interests include machine learning and data mining.

Please cite this article as: D. Ding, X. Yang and F. Xia et al., Unsupervised feature selection via adaptive hypergraph regularized latent representation learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.10.018