Multinomial mixture model with feature selection for text clustering

Multinomial mixture model with feature selection for text clustering

Knowledge-Based Systems 21 (2008) 704–708 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locat...

185KB Sizes 2 Downloads 153 Views

Knowledge-Based Systems 21 (2008) 704–708

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Multinomial mixture model with feature selection for text clustering Minqiang Li *, Liang Zhang School of Management, Tianjin University, 92 Weijin Road, Nankai District, Tianjin 300072, China

a r t i c l e

i n f o

Article history: Received 27 June 2007 Accepted 24 March 2008 Available online 31 March 2008 Keywords: Text clustering Multinomial mixture model Feature selection Text mining

a b s t r a c t The task of selecting relevant features is a hard problem in the field of unsupervised text clustering due to the absence of class labels that would guide the search. This paper proposes a new mixture model method for unsupervised text clustering, named multinomial mixture model with feature selection (M3FS). In M3FS, we introduce the concept of component-dependent ‘‘feature saliency” to the mixture model. We say a feature is relevant to a certain mixture component if the feature saliency value is higher than a predefined threshold. Thus the feature selection process is treated as a parameter estimation problem. The Expectation–Maximization (EM) algorithm is then used for estimating the model. The experiment results on commonly used text datasets show that the M3FS method has good clustering performance and feature selection capability. Ó 2008 Elsevier B.V. All rights reserved.

1. Introduction With the rapid development of storage technologies and communication networks, now massive amounts of text documents are convenient to access. Unsupervised clustering has become an important task for text mining. The goal of clustering is to determine the intrinsic grouping of a set of unlabeled data. Many clustering models and algorithms [1–4] have been investigated in different application environments, such as hierarchical clustering, partition clustering, and probabilistic clustering models. In most text clustering tasks, the vocabulary size is very large and the dataset is extremely sparse, which will hinder the performance of clustering algorithms. Therefore the feature space dimension needs to be reduced, and the commonly used technique is feature selection. One of the difficulties of feature selection in unsupervised text clustering is that there are no labels to guide the clustering process. Therefore, it is necessary to extract useful features from high dimensional feature space on well-clustered documents by considering the performance of the clustering algorithm. Inspired by the articles [3,5], we propose a novel multinomial mixture model with feature selection (M3FS) to cluster the text documents. In M3FS, a new component-dependent feature saliency concept is introduced to the model by clustering the text data and performing feature selection simultaneously. In this paper, we employ machine learning and statistical methods to investigate model based clustering algorithms for text clustering. The structure of this paper is as follows. Section 2 reviews the major methods for text clustering and feature selection. In Section 3, we introduce the probability model and the parameter esti* Corresponding author. Tel.: +86 22 2740 4796; fax: +86 22 2740 4796. E-mail addresses: [email protected], [email protected] (M. Li). 0950-7051/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2008.03.025

mation algorithm of the M3FS method. The experiments on several text datasets are reported in Section 4. Then we evaluate the related works and the computational complexity of the M3FS method in Section 5. Finally, we conclude the paper in the final section. 2. Related work Recently, many models and algorithms [3–6] have been successfully applied to text clustering tasks. Most of them are probabilistic finite mixture models. Liu et al. [7] used the Gaussian mixtures and the EM algorithm [8] to cluster text documents and presented the corresponding model selection method. Nigam et al. [3] proposed a multinomial mixture model for text clustering. There is a multinomial distribution over C classes for a word, and the document model is a mixture of these multinomials. Law et al. [5] adopted the EM algorithm together with the minimum message length model selection criterion to simultaneously estimate the feature saliencies and the number of clusters. Other works include hierarchical Bayesian models like the Latent Dirichlet Allocation (LDA) [6], and the Probabilistic Latent Semantic Analysis (PLSA) [4]. They both perform ‘‘soft” clustering over the text data. The documents are modeled by mixtures of latent topics and the topics are probabilistic distributions over the words. Spectral analysis is another technique that has been successfully applied to text clustering scenario. Dillon [9] modeled the document collection as a bipartite graph between documents and words. Then the clustering problem was posed as a graph partitioning problem and a spectral clustering algorithm was proposed. Li et al. [10] employed the spectral analysis of text sets in the similarity space to predict clustering behavior before actual clustering was performed. Selecting relevant features in unsupervised learning scenarios is a hard problem due to the absence of labeled data samples that

705

M. Li, L. Zhang / Knowledge-Based Systems 21 (2008) 704–708

would guide the search. Although feature selection influences significantly the effectiveness of clustering algorithms [11], there has been few studies done by now. Dash and Liu [11] proposed a simple filter based feature selection approach for unsupervised clustering which follows the ‘‘clustering and relevance determination” [12] strategy. The dataset is first partitioned by certain clustering algorithm, and the relevance values of the features are scored so as to select the most relevant features. The filter based feature selection approaches do not consider the properties of the clustering methods, and it is usually hard to score the feature relevant value properly without taking into account of the clustering algorithms and feature correlations. To address these issues, various wrapper based feature selection methods were proposed by combining feature selection strategies and clustering algorithms [5,12–15]. Law et al. [6] introduced the concept of ‘‘feature saliency” to the Gaussian mixture model. Feature selection is taken as a part of the model’s parameter estimation. Roth and Lange [12] combined a Gaussian mixture model with a Bayesian marginalization mechanism as the feature selection principle. Dy and Brodley [15] suggested a heuristic to compare feature subsets. Vaithyanathan and Dom [16] used the genetic algorithm for feature selection in k-means clustering. Wrappers are usually more complicated and computationally expensive than filters. However, the wrapper based methods is able to achieve better clustering accuracy than filter based methods by considering the properties of the learning algorithm. 3. M3FS method For the text clustering task considered in this paper, the Salton’s bag-of words model is employed to represent documents. A document is represented as a vector of words on a fixed vocabulary. We de~1 ; . . . ; w ~ N g, and w ~ ¼ ðw1 ; . . . ; wM Þ is note the document set as D ¼ fw a document vector, where wi is the term (word) frequency. 3.1. Multinomial mixture model We utilize the multinomial mixture model by Nigam et al. [3] ~ follows a K-component multifor text clustering. The document w nomial mixture distribution and its probability density function can be written as: P  M K K X X m¼1 wm ! YM ~ ; HÞ ¼ ~; ~ pðw kk pðw hk Þ ¼ kk QM hwm ð1Þ m¼1 k;m k¼1 k¼1 m¼1 wm ! where H ¼ fk1 ; . . . ; kK ; ~ h1 ; . . . ; ~ hK g are the model’s parameters, k1, . . ., kK are the mixing probabilities, ~ hk ¼ ðhk;1 ; . . . ; hk;M Þ are the PM parameters defining the k-th component, and m¼1 hk;m ¼ 1, PK k¼1 kk ¼ 1. In order to cluster the documents, we introduce the unobserved indicators ~ c ¼ ðc1 ; . . . ; cK Þ, where one of the ck from ~ c is exactly equal to 1 and the others are 0. The indicator vector set C ¼ f~ c1 ; . . . ; ~ cN g specifies the mixture components from which the observed examples are drawn. The observed document vector ~ and the latent indicator vector ~ w c form the complete data ~ ~;~ ~;~ x ¼ fw cg; ~ x 2 X ¼ D [ C. The joint density function of ðw cÞ is:  0 PM 1ck K M Y m¼1 wm ! Y w mA @kk Q ~;~ pð~ x; HÞ ¼ pðw c; HÞ ¼ hk;m ð2Þ M m¼1 k¼1 m¼1 wm ! 3.2. Parameter estimation Here, a N  K parameter matrix G is introduced to specify the possibility of a document belonging to a cluster. We call it the ‘‘Clustering Probability” (CP) matrix. The elements of G are defined as:

~ n ; HÞ ¼ PK g n;k ¼ Eðcn;k jD; HÞ ¼ Prðcn;k ¼ 1jw

~ n ; hk Þ kk pðw ~ n ; hk0 Þ 0 pðw

k0 ¼1 kk

ð3Þ

In the soft-clustering scenario with constant value neglected, the log-likelihood of the model is: ! N X K M X X ð4Þ L¼ g n;k log kk þ wn;m log hk;m n¼1 k¼1

m¼1

In order to estimate the model’s parameters and calculate the maximum a priori (MAP) estimation, we use Dirichlet priors on {kk} and {hk,m}, and the corresponding hyper-parameters are dk and dh. This procedure is also called Laplacian smoothing. Then the EM algorithm is used to estimate the parameters of the model, where the E-step is defined as: _ ðt1Þ QM ~ðt1Þ Þwn;m m¼1 ð h k _ PK _ðt1Þ QM ~ ðt1Þ wn;m Þ k0¼1 k k0 m¼1 ð h k0 _

ðtÞ g n;k

¼ Prðcn;k

_

~ n ; Hðt1Þ Þ ¼ ¼ 1jw

kk

The parameters {kk} and {hk,m} are updated in M-step by: N X ðtÞ ðtÞ kk / dk  1 þ g n;k

ð5Þ

ð6Þ

n¼1 ðtÞ

hk;m / dh  1 þ

N X

ðtÞ

wn;m g n;k

ð7Þ

n¼1

The experiments on Reuters dataset by Rigouste, et al. [17] verified that the value of dk has little, if any, influence to the clustering result, and thus we set dk = 1 and dh = 2. By considering the parameP P ters’ constraint Kk¼1 kk ¼ 1, and M m¼1 hk;m ¼ 1; k ¼ 1; . . . ; K, we have: ðtÞ

PN ðtÞ PN ðtÞ dk  1 þ n¼1 g n;k n¼1 g n;k   ¼ PN ðtÞ N 0 d  1 þ g 0 k k ¼1 n¼1 n;k PN P ðtÞ ðtÞ dh  1 þ n¼1 wn;m g n;k 1 þ Nn¼1 wn;m g n;k ¼ ¼ PM PN PM PN P ðtÞ ðtÞ M 0 M þ m0 ¼1 n¼1 wn;m0 g n;k Þ m¼1 ðdh  1Þ þ m0 ¼1 n¼1 wn;m g n;k Þ

kk ¼ P K ðtÞ

hk;m

ð8Þ

ð9Þ

3.3. Feature selection The proposed multinomial mixture model is based on the Naïve Bayesian assumption as in [22], or all features of the documents are independent of each other in the context of the classes. Then a modified ‘‘feature saliency” method is proposed to perform feature selection over the dataset. Law et al. defined the ‘‘feature saliency” [5] concept as a set of real-valued parameters, which are used to measure the relevancy of the features to the clustering process. This method makes the feature selection process into a parameter estimation problem, which can be done during the clustering process. The latent variables ~ / ¼ f~ /1 ; . . . ; ~ /M g are defined to indicate the relevancy of the features to the clustering process, where  1; if feature m is relevant ð10Þ /m ¼ 0; otherwise However, we notice that when the documents are clustered, there are high-frequency words on different topics. For example, words like ‘‘cast” and ‘‘director” often appear in movie-related documents, while ‘‘stock market” and ‘‘fund” appear frequently in documents in finance and economics. Therefore, we change the ‘‘feature saliency” vector to a matrix form as U = {/k,m}, where  1; if feature m is relevent to the component k ð11Þ /k;m ¼ 0; otherwise Then, we assume the features that are relevant to a mixture component follow the corresponding distribution, and the irrelevant features follow a shared distribution of qðwm ;~ cÞ, where ~ c is the parameter vector.

706

M. Li, L. Zhang / Knowledge-Based Systems 21 (2008) 704–708

Since the latent matrix U and the shared distribution of irrelevancy qðwm ;~ cÞ are introduced to the multinomial model, it may be rightly named multinomial mixture model with feature selection (M3FS). The M3FS can be written as: ~;~ pð~ x; HÞ ¼ pðw c; U; HÞ  2 PM 3ck K M Y m¼1 wm ! Y wm /k;m wm 1/k;m 5 4 kk QM ðhk;m Þ ðcm Þ ¼ m¼1 k¼1 m¼1 wm !

Datasets

Number of documents

Number of classes

Number of words

Classic-3 Hitech-6 Reuters-10 Newsgroups-20

3891 1530 5771 19,949

3 6 10 20

41,681 10,080 15,996 43,586

ð12Þ

The log-likelihood of the model is: ( ) N X K M X X L¼ g n;k log kk þ wn;m ½sk;m log hk;m þ ð1  sk;m Þ log cm  n¼1 k¼1

Table 1 Test datasets summary

over the classes is extremely unbalanced in the Reuters dataset, thus a subset of the top 10 categories is selected, and is named the Reuters-10. 4.2. Evaluation metrics

m¼1

ð13Þ where G = {gn,k} is the CP matrix. The feature saliency matrix S = {sk,m} indicates the probability that the feature m is relevant to the mixture component k. In order to perform Laplacian smoothing, the Dirichlet priors on {kk}, {hk,m} and {cm} with hyper parameters dk = 1, dh = 2 and dh = 2 are introduced to the model. The EM algorithm is then used to estimate the model’s parameters, where the critical steps are: E-step: Update the CP matrix G: _

ðtÞ g n;k

~ n ; Hðt1Þ Þ ¼ Prðcn;k ¼ 1jw _ wn;m _ðt1Þ _ _ðt1Þ _ ðt1Þ QM ðt1Þ ð1 s k;m Þ ~ðt1Þ s k;m ð~ kk cm Þ m¼1 ð h k;m Þ ¼ 2 3wn;m _ _s ðt1Þ _ðt1Þ k0;m _ PK _ðt1Þ QM ð1 s Þ ðt1Þ ðt1Þ 4 ~ k0;m 5 ð~ cm Þ h k0 ;m k0¼1 k k0 m¼1

ð14Þ

In order to evaluate the performance of the proposed text clustering algorithm, we use the commonly adopted the F-measure [18] and the NMI [19]. By taking each document as a query result, the F-measure [18] is a metric that combines precision and recall which are defined individually as: Precisioni;j ¼ Recalli;j ¼

_ ðtÞ sk;m

Other parameters of the model are estimated as: PN ðtÞ n¼1 g n;k ðtÞ kk ¼ N P ðtÞ ðtÞ 1 þ Nn¼1 wn;m sk;m g n;k ðtÞ hk;m ¼ PM PN ðtÞ ðtÞ M þ m0 ¼1 n¼1 wn;m0 sk;m0 g n;k PN PK ðtÞ ðtÞ 1 þ n¼1 k¼1 wn;m ð1  sk;m Þg n;k cðtÞ PM PN PK m ¼ ðtÞ ðtÞ M þ m0¼1 n¼1 k¼1 wn;m0 ð1  sk;m0 Þg n;k

N i;j Ni

ð20Þ

2precisioni;j  recalli;j precisioni;j þ recalli;j

ð21Þ

ð15Þ

The overall value of the F-measure is calculated by taking the weighted average of all values for the F-measure: X ni max F i;j ð22Þ F¼ n j i

ð16Þ ð17Þ

The normalized mutual information (NMI) [19] is also a generally used to evaluate clustering performance. Suppose that S represents the assigned cluster label of a document, and T denotes the actual class label of the document. The mutual information of S and T is calculated as:

ð18Þ

IðS; TÞ ¼

_ðt1Þ

s ðt1Þ w ðtÞ N ð~ h k;m Þ n;m k;m  g n;k 1 X ¼ _ _ðt1Þ _ðt1Þ _ N n¼1 ~ðt1Þ wn;m s ðt1Þ w ð1 s k;m Þ k;m ð h k;m Þ þ ð~ c k Þ n;m

ð19Þ

where, Ni,j is the number of documents with class label i in cluster j, Ni is the number of documents with class label i and Nj is the number of documents in cluster j. Then the F-measure of cluster j and class i is calculated as: F i;j ¼

M-step: Estimate the parameter matrix S:

N i;j Nj

X

pðs; tÞ log

s2S;t2T

pðs; tÞ pðsÞpðtÞ

ð23Þ

The feature selection threshold is denote as d(0 < d < 1). When the EM iterations stop, the features with saliency value higher than the threshold are selected, which is represented as F = {wmj sk,m > d}. Notice that the same feature may have different saliency values on all the topics. A feature is selected if it is salient to one of the topics at least.

When the mutual information is then normalized based on the entropy H(S), H(T), the NMI is calculated as:   P NN i;j i;j N i;j log Ni Nj IðS; TÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi NMI ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ r ð24Þ P ffi P HðSÞHðTÞ Nj Ni N log N log i j i j N N

4. Experiments

4.3. Experiment results

4.1. Text datasets

A series of experiments are conducted to investigate the performance of the proposed M3FS model for text clustering. In the following experiments, we perform 20-fold cross-validation and give the average results of 20-folds over the test sets. Table 2 shows M3FS feature selection results over the text datasets, where the feature selection threshold value is set as 0.5. The space dimensions of the selected features are much smaller than the original one. In order to examine the feature selection ability of the M3FS in detail, we set different feature selection thresholds and perform

Four standard text datasets are used in our experiments: Classic-3,1 Hitech-6,2 Reuters,3 and Newsgroup-20.4 Their main features are summarized in Table 1. Note that the documents distribution

1 2 3 4

ftp://ftp.cs.cornell.edu/pub/smart/. http://trec.nist.gov. http://www.daviddlewis.com/resources/testcollections/reuters21578/. http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html.

M. Li, L. Zhang / Knowledge-Based Systems 21 (2008) 704–708

the clustering algorithm on the Classic-3 dataset. Table 3 shows the F-measure and NMI results of the clustering based on the selected features. Here, d is the feature selection threshold. d = 0 means that the feature selection is not performed, and the original vocabulary is used as the feature set. Table 3 shows that the M3FS can significantly decrease the number of the selected features to about 5% of the original with d 2 [0.5, 0.7], while it achieves relatively better clustering performance. Table 4 lists the top 10 features of the three clusters on the Classic-3 dataset. Finally, we experimentally compare our clustering method with the existing popular text clustering methods over the 4 text datasets. The major clustering algorithms for comparison are k-means (with the randomly selected initial centre points), the multinomial mixture model (M3) [3], the spectral clustering [20], which are the main-stream approaches. The experiment results are reported in Table 5. The F-measure and the NMI illustrate that the M3FS yields the best results among all the four clustering methods. The proposed algorithm uses the selected features with high saliency values, and the irrelevant features that may degrade the clustering

707

algorithms are effectively removed. Therefore, the M3FS algorithm can achieve better performance than other clustering methods using all the available features. 5. Discussion The M3FS clustering method is inspired by the works of [5,21], bur it is characterized with new elements. First, instead of using the Gaussian mixtures, we adopt the multinomial mixture model, which has been proved to be more suitable for text clustering [3]. Second, the feature saliency in M3FS is a probability matrix that stores the probabilities of the features relevant to the mixture components, where the saliency of a feature is represented as the probability that the feature is relevant to the clustering algorithm in [5,21]. The M3FS method is based on the EM algorithm. Since each iteration of the EM algorithm includes E-step and M-step which have time a complexity of O(NMK), the total clustering time of M3FS is O(2INMK), where I is the total iteration times of the EM algorithm. 6. Conclusion

Table 2 M3FS feature selection results over the four datasets Datasets

Number of all words

Number of selected features

Classic-3 Hitech-6 Reuters-10 Newsgroups-20

41,681 10,080 15,996 43,586

2387 2096 1885 3950

Table 3 Performance of M3FS feature selection for Classic-3 Feature selection threshold (d 2 [0, 0.7])

Number of selected features

F-measure

NMI

0 0.1 0.3 0.5 0.7

41,681 35,844 16,072 2387 1264

0.92 0.93 0.96 0.94 0.81

0.83 0.84 0.88 0.86 0.79

Table 4 Top 10 features with the highest feature saliency values on Classic-3 Cluster

Top 10 Selected Features

1

Cells, patients, hormone, blood, cancer, children, renal, deaths, abortions, rats Aero, heat, engine, plane, wing, supersonic, laminar, shock, transfer, power Retrieval, system, libraries, research, library, science, scientific, computer, information, literature

2 3

Table 5 Performance of different clustering methods for the four datasets Clustering methods

Classic-3

Hitech-6

Reuters-10

Newsgroups-20

F-measure

k-Means M3 M3FS Spectral

0.66 ± 0.08 0.92 ± 0.01 0.96 ± 0.02 0.96 ± 0.02

0.40 ± 0.07 0.41 ± 0.02 0.44 ± 0.01 0.43 ± 0.02

0.43 ± 0.04 0.67 ± 0.01 0.72 ± 0.02 0.68 ± 0.03

0.11 ± 0.03 0.45 ± 0.03 0.53 ± 0.02 0.48 ± 0.03

NMI

k-Means M3 M3FS Spectral

0.40 ± 0.06 0.83 ± 0.03 0.93 ± 0.02 0.93 ± 0.02

0.20 ± 0.05 0.23 ± 0.04 0.26 ± 0.01 0.24 ± 0.01

0.33 ± 0.06 0.51 ± 0.02 0.54 ± 0.01 0.52 ± 0.01

0.08 ± 0.04 0.55 ± 0.03 0.61 ± 0.02 0.58 ± 0.02

In this paper, we present a novel mixture model method for text clustering, named multinomial mixture model with feature selection (M3FS). The M3FS introduces an indicator vector into the mixture model as latent variables. Thus the feature selection becomes a parameter estimation problem. And the maximum likelihood estimation of the feature saliency probability matrix can be estimated by the EM algorithm. The ‘‘feature saliency” is defined as the probability of the feature being relevant to a mixture component. The same feature may have different feature saliency value over different mixture components (clusters). The results of experiments on text datasets show that the M3FS has good clustering capabilities on text data. Since many real world problems of text clustering are semisupervised [22], the M3FS method is to be extended to the semisupervised scenarios of text clustering and text classification in our future work. Acknowledgements This work was supported by the National Science Foundation of China (Grant No. 70171002) and the Ph.D. Programs Foundation of Ministry of Education of China (Grant No. 20020056047). References [1] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, NJ, 1988. [2] P. Bradley, U. Fayyad, C. Reina, Clustering very large database using EM mixture models, in: Proc. 15th Intern. Conf. on Pattern Recognition (ICPR2000), 2000, pp. 76–80. [3] K. Nigam, A.K. McCallum, S. Thrun, T.M. Mitchel, Text classification from labeled and unlabeled documents using EM, Machine Learning 39 (2/3) (2000) 103–134. [4] T. Hofmann, Probabilistic latent semantic analysis, in: Proc. Uncertainty in Artificial Intelligence (UAI’99), 1999, pp. 289–296. [5] M.H.C. Law, M.A.T. Figueiredo, A.K. Jain, Simultaneous feature selection and clustering using mixture models, IEEE Transaction on Pattern Analysis and Machine Intelligence 26 (9) (2004) 1154–1166. [6] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, Journal of Machine Learning Research 3 (2003) 993–1022. [7] X. Liu, Y. Gong, W. Xu, S. Zhu, Document clustering with cluster refinement and model selection capabilities, in: Proc. 25th Annual International ACM SIGIR Conf. on Research and Development in Information Retrieval, Tampere, Finland, 2002. [8] A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society B39 (1977) 1–38. [9] I. Dhillon, Co-clustering documents and words using bipartite spectral graph partitioning. Technical Report 2001-05, UT Austin CS Dept., 2001.

708

M. Li, L. Zhang / Knowledge-Based Systems 21 (2008) 704–708

[10] W. Li, W. Ng, E. Lim, Spectral analysis of text collection for similarity-based clustering, in: 20th International Conf. on Data Engineering (ICDE’04), 2004, p. 833. [11] M. Dash, H. Liu, Feature selection for clustering, in: Proc. Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD-2000), 2000. [12] V. Roth, T. Lange, Feature selection in clustering problems, Advances in Neural Information Processing Systems, vol. 16, MIT Press, Cambridge, MA, 2004. [13] Y. Kim, W. Street, F. Menczer, Feature selection for unsupervised learning via evolutionary search, in: Proc. Sixth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, 2000, pp. 365–369. [14] A.v. Heydebreck, W. Huber, A. Poustka, M. Vingron, Identifying splits with clear separation: a new class discovery method for gene expression data, Bioinformatics 17 (2001). [15] J. Dy, C. Brodley, Feature subset selection and order identification for unsupervised learning, in: Proc. ICML’2000, 2000, pp. 247–254.

[16] S. Vaithyanathan, B. Dom, Generalized model selection for unsupervised learning in high dimensions, in: S. Solla, T. Leen, K. Muller (Eds.), Proc. NIPS’12, MIT Press, 2000. [17] L. Rigouste, O. Cappe, F. Yvon, Evaluation of a probabilistic method for unsupervised text clustering, in: Applied Stochastic Models and Data Analysis (ASMDA 2005), 2005. [18] C.J. van Rijsbergen, Information Retrieval, Butterworth, London, 1979. [19] A. Strehl, J. Ghosh, Cluster ensembles – a knowledge reuse framework for combining partitions, Journal of Machine Learning Research 3 (2002) 583–617. [20] A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, Advances in Neura Information Processing Systems, vol. 14, MIT Press, 2002, pp. 849–856. [21] M.W. Graham, D.J. Miller, Unsupervised learning of parsimonious mixtures on large feature spaces, Penn State EE Dept. Technical Report, 2004. [22] C. Lee, Improving classification performance using unlabeled data: Naïve Bayesian case, Knowledge-Based Systems 20 (2007) 220–224.